NSF Award # SES-1823633

Twitter Robustness & Reliability

The Generalizability and Replicability of Twitter Data for Population Research: An NSF-Funded Project ($500,000)

Team members: Guangqing Chi (PI), Eric Plutzer, Jennifer Van Hook, Heng Xu, Junjun Yin

Learn more

Continents being analyzed
Long-term, active projects

Social media data have the potential to track phenomena in real time, such as percentage of the population that is fearful in the minutes after a disaster or terrorist event, or the degree of anger immediately after the announcement of a jury verdict in a highly publicized case.

In each of these examples, it would be difficult to conduct a field survey in real time, and respondents may not be able to reconstruct how they felt or behaved at the time of the event, even if interviewed just a few days later. Social media data have the potential to overcome these limitations.

The objectives of this project are:

(1) Extending and refining existing methods for imputing the gender, age, race/ethnicity, and county of residence of each Twitter user;

(2) Using these values to assess the representativeness of Twitter samples at the county level and explaining the determinants of bias;

(3) Adapting five methods developed for probability or non-probability surveys to reweight Twitter samples and comparing their performance in producing model estimates that can be used to infer characteristics of the general population; and

(4) Testing the feasibility of using Twitter data to estimate migration at the county level by comparing Twitter data to the Internal Revenue Service migration data and estimating Puerto Rican migration to the continental U.S. after Hurricane Maria.

This project evaluates the extent to which Twitter users represent or misrepresent the population across different demographic groups and test the feasibility of developing weights that, when applied to Twitter data, make the results more representative of the underlying population.

The project conducts the research at the county level in the United States from January 2014-December 2020, using 96% geotagged tweets in the study period and 100% tweets in one month.

This project analyzes how the application of survey weighting can rebalance samples of Twitter data, and assesses how well this rebalancing allows valid generalizations about population behaviors.

The project provides a foundation for future advances in the use of social media data for scientific, health, and applied research, thus permitting a wide variety of inferences useful in social policy formulation. A key aspect of the project provides new evidence regarding the accuracy of migration flows in real time, thus assisting social policy relevant to providing assistance in response to natural disasters.

Python Package and Data:

Chi, Guangqing, Junjun Yin, Jennifer Van Hook, Eric Plutzer, and Heng Xu. 2020. "Estimation of Twitter user demographics in the USA, 2014." Released on November 23, 2020.

Available here