Files

Abstract

The quantification of population-level health behaviors is crucial for guiding public health policy. However, traditional methods for measuring such health behaviors have several short- comings. In recent years social media data has been successfully used to measure health behaviors and may be used as a low-cost and real-time addition to traditional data sources. Methods from the field of natural language processing are increasingly used to automatically process, filter and categorize the rapidly growing amount of publicly available social media data. However, a number of methodological challenges limit the rate at which we can generate insight from such data. In this work I will argue that long-term investment into digital infrastructure and open source tooling is required in order to overcome these challenges. In chapter 2 we introduce the Crowd- breaks platform which is the basis of this thesis. Crowdbreaks is an open source framework for real-time data collection, continuous crowdsourced annotation, and continuous re-training of machine learning classifiers. In contrast to traditional research workflows, projects on Crowdbreaks run over an extended period of time, allowing for the observation of health trends over multiple years while keeping algorithms up-to-date. In chapter 3 we quantify the occurrence of concept drift in vaccine-related Twitter data, which further validates the need for the Crowdbreaks platform. In chapter 4 we use the Crowdbreaks platform to trace sentiment towards the novel gene-editing technology CRISPR/Cas9 back to its first application in 2013 and investigate how public opinion may have been affected in context of recent scandals sur- rounding the technology. In chapter 5 we turn our attention to the COVID-19 pandemic and analyze who was speaking and who was heard in the early months of the pandemic. Chapter 6 builds on this work and explores the dynamics of Twitter communities during the COVID-19 pandemic. Lastly, in chapter 7 we introduce COVID-Twitter-BERT, a domain-specific language model which has been used in various downstream natural language processing applications on COVID-19-related Twitter data.

Details

PDF