On the use of applied machine learning and digital infrastructure to leverage social media data in health and epidemiology

Müller, Martin Mathias

doi:10.5075/epfl-thesis-8283

Müller, Martin Mathias

2021

Download

Formats

Format
BibTeX
MARCXML
TextMARC
MARC
DublinCore
EndNote
NLM
RefWorks
RIS

Files

Abstract

The quantification of population-level health behaviors is crucial for guiding public health policy. However, traditional methods for measuring such health behaviors have several short- comings. In recent years social media data has been successfully used to measure health behaviors and may be used as a low-cost and real-time addition to traditional data sources. Methods from the field of natural language processing are increasingly used to automatically process, filter and categorize the rapidly growing amount of publicly available social media data. However, a number of methodological challenges limit the rate at which we can generate insight from such data. In this work I will argue that long-term investment into digital infrastructure and open source tooling is required in order to overcome these challenges. In chapter 2 we introduce the Crowd- breaks platform which is the basis of this thesis. Crowdbreaks is an open source framework for real-time data collection, continuous crowdsourced annotation, and continuous re-training of machine learning classifiers. In contrast to traditional research workflows, projects on Crowdbreaks run over an extended period of time, allowing for the observation of health trends over multiple years while keeping algorithms up-to-date. In chapter 3 we quantify the occurrence of concept drift in vaccine-related Twitter data, which further validates the need for the Crowdbreaks platform. In chapter 4 we use the Crowdbreaks platform to trace sentiment towards the novel gene-editing technology CRISPR/Cas9 back to its first application in 2013 and investigate how public opinion may have been affected in context of recent scandals sur- rounding the technology. In chapter 5 we turn our attention to the COVID-19 pandemic and analyze who was speaking and who was heard in the early months of the pandemic. Chapter 6 builds on this work and explores the dynamics of Twitter communities during the COVID-19 pandemic. Lastly, in chapter 7 we introduce COVID-Twitter-BERT, a domain-specific language model which has been used in various downstream natural language processing applications on COVID-19-related Twitter data.

Details

Title On the use of applied machine learning and digital infrastructure to leverage social media data in health and epidemiology

Author(s) Müller, Martin Mathias

Advisor(s)

Salathé, Marcel

Pagination 214

Date 2021

Publisher Lausanne, EPFL

Keywords

Social media; natural language processing; digital epidemiology; Twitter; health

Language English

DOI https://doi.org/10.5075/epfl-thesis-8283

Laboratories UPSALATHE1

Record Appears in Scientific production and competences > SV - School of Life Sciences > GHI - Global Health Institute > UPSALATHE1 - Prof. Salathé Group (SV/IC)
Scientific production and competences > EPFL Theses
Work produced at EPFL
Published
Theses

Record creation date 2021-02-22

Files

Abstract

Details

PDF