“A GeoComputational Approach to Giving Population Context to Social Media: ‘Textation’ without Representation?"
Every day, over 65 million tweets – or short messages - are sent on the social networking platform, Twitter. (1) Increasingly, researchers are using this voluminous source of social media data to track population trends, monitor illnesses, describe behaviors, and characterize diffusion of information. In addition to its large volume of messaging, Twitter is also attracting researchers due to its diverse population of over 255 million users (2), user profiles, and amenable Terms of Service. Public health and epidemiologic researchers are harnessing the potential of Twitter, and other social media, to study health-related characteristics of the population, and are now frequently designing health interventions based on social media. From 2010-2014, the number of PubMed-referenced studies using social media to study health-related topics has nearly doubled, from about 250 to nearly 500. But how representative of the population are these individuals? To date, tools to readily answer this question do not exist. Many studies focus on geographic distribution of individuals, such as in using Twitter to track spread of influenza in the U.S. (3) While the absolute numbers of individuals identified with the flu over time and location via Twitter is valuable, a critical need is to know the relative number of individuals affected; does 2,000 individuals represent 5% of the population or 50% of the population?
We aim to address this crucial gap in how public health and epidemiologic studies using social media are able to be designed and interpreted. By developing computational algorithms linking Twitter feeds with geographic information systems (GIS) and US Census data, we can develop a platform for periodic monitoring and reporting of how Twitter reflects the underlying population. This will enable researchers from all disciplines to draw more meaningful and appropriate conclusions from their findings, as well as use more sophisticated approaches for study design based on an understanding of the populations included. Our key objectives for this proposed work are:
Specific Aims:
1. To characterize twitter data in relation to the underlying population composition. Using profile, location, and message content information, we will develop an algorithm to integrate geospatial and socio-demographic data and perform automated analyses to characterize the population represented by Twitter data across a range of spatial units.
2. Develop a platform for periodic monitoring and reporting of how Twitter reflects the population. For the work in Aim 1 to have an impact it needs to be accessible and current. Thus, we will create a platform for automated refreshing of the analytic results from the algorithm above and provide usable reporting tools.