Motivation

The development and incorporation of data science and big data into public health practice requires us to learn a new language. Fortunately many of the new terms have public health equivalents.

As a first step we created an A-Z of data science in public health.

As a next step we have drawn on the literature to create the following tables. We welcome suggestions for additions and improvements.

The glossary below draws extensively from:

Mooney et al table 2 [1], Kohavi et al [2], McGrail et al [3] and Fuller et al [4].

Big data - the 5Vs

A brief definition of big data. Originally big data was referred to as the 3Vs but more recent defintions talk of 5 Vs (see diagram). (Adapted from [4])

5 Vs of big data

5 Vs of big data

Briefly:

  • Volume refers to storage - datasets are growing rapidly in size and increasingly need clusters of computers to store. In turn this has required developments in database technology and analytical techniques to be be able to manage and extract value. This is particularly to digital public health generating data from apps or sensors
  • Velocity refers to the speed at which data is collected. Often in public health we produce annual datasets or even more aggregated time periods. Again with the advent of digital public health data be collected in near-realtime - on a daily or even hourly basis. Increasing frequency of collection rapidly grows the size of the data.
  • Variety. There is an increasingly complex data landscape for public health. The nature of data and the formats in which it comes is growing rapidly - for example unstructured data including text and social media are increasingly used as public health data sources (see below)
  • Veracity. All that glisters is not gold. Big data can mislead, be inaccurate and be biased.
  • Value. “Data is the new oil”/ “data is the new soil”

Types of big data for public health

Source: Big data for public health [1]
Source Examples ‘Bigness’ Technical Issues Typical uses
-omic/biological Whole exome profiling, Wide Lab effects, informatics Etiologic research, screening
metabolomics pipeline
Geospatial Neighborhood Wide Spatial autocorrelation Etiologic research, surveillance
characteristics
Electronic health Records of all patients Tall, often Data cleaning, natural Clinical research, surveillance
records with hypertension also wide language
Personal Daily GPS records, Tall Redundancy, inference Etiologic research, potentially clinical
monitoring Fitbit readings of intentions decision making
Effluent data Google search results, Tall Selection biases, natural Surveillance, screening, identification of
Reddit language hidden social networks

Towards defining public health data science

Donoho provides a useful discussion about the origins and development of data science. [5] In brief he defines “greater data science” as:

The science of data

and sees 6 core areas for data science activity:

  1. Data Exploration and Preparation
  2. Data Representation and Transformation
  3. Computing with Data
  4. Data Modeling
  5. Data Visualization and Presentation
  6. Science about Data Science

The NIH [6] defines data science as:

“the interdisciplinary field of inquiry in which quantitative and analytical approaches, processes, and systems are developed and used to extract knowledge and insights from increasingly large and/ or complex sets of data”

McGrail et al [3] provide a defintion of data science in population health as:

“The science of data about people”

We are working towards a defintion of public health data science. Two suggestions are:

“The application of data science to improve and protect health and reduce inequalities”

or:

“The science of data to improve and protect health and reduce inequalities”

Adapted from [3] table 1

Area Focus Multisource data Primary aim of research Focus on technical/ policy infrastructure
Data science Data esp. “big data” Not always Data for actionable information No
Population data science People, systems, population insights Linkage and multiple sources Public value Key focus - legal, ethical, privacy, data collection
Informatics Providers/ ICT systems Not necessarily Implementation Database/ technical development
Public health data science Public and population health, healthcare, health systems Often, linkage important Improving population health, reducing health inequality Focus as per population health data science?

Data science terms for public health

Data science term Related PH term or concept
Accuracy Proportion of results correctly classified
Artificial intelligence Artificial intelligence (AI) is used to describe machines that perform human-like activities such as learning, perception, problem solving and playing games. AI has been used to engage the public by improving the quality of eHealth interactions. For example, patients can use AI-based eHealth applications to receive personalised information. Chronology MD was developed for patients with Crohn’s disease; this programme allows patients to input their ‘observations of daily living’ and an AI system assists patients with management of their disease (eg, medication reminders, exercise and proper sleep motivation). This case highlights how AI applications can increase the immediacy of eHealth, the development of closeness and the feeling of an authentic, caring relationship. These applications help to provide a human-like element to eHealth exchanges between patents and AI systems
Big data hubris Big data hubris is the assumption that data with sufficient volume and velocity can compensate for or eliminate the need for high veracity data, high-quality study designs and more traditional forms of data analysis. Lazer and colleagues[7] provide a compelling example of big data hubris in the Google Flu Trends research. An important limitation of Google Flu Trends was that the underlying algorithms and methodology of Google search terms are proprietary and evolving (Google’s Hummingbird algorithm likely uses deep learning). Core scientific principles of replicability and transparency are difficult when dealing with proprietary data, whether it be from Google, Facebook or others
Confusion matrix A 2x2 table comparing predicted positive and negative results with observed results
Coverage The proportion of a data set for which a classifier makes a prediction.
Cross-validation A method for estimating the accuracy (or error) of an outcome by dividing the data into k mutually exclusive subsets (“folds”) of approximately equal size. The model is trained and tested on the outcome k times. Each time it is trained on the data set minus a fold and tested on that fold. The accuracy estimate is the average accuracy for the k folds.
Data mining Exploratory analysis. Data mining (sometimes referred to as knowledge discovery in databases) is the process of extracting new and at times useful information from data. Data mining and machine learning often use the same statistical techniques and it is difficult to differentiate the two in practice. Some would argue the primary focus of data mining is unsupervised learning (see unsupervised learning). Drug pathway discovery through analysis of published results is an example of data mining in health research. Perhaps, data mining can be better conceived as data refining, where large volumes of data are sifted using statistical techniques to find potential associa- tions of interest for researchers.
Deep learning Deep learning is a machine learning technique designed to process signals like a human brain. Instead of using a single machine learning technique on a single type of data, deep learning uses multiple machine learning methods and layers of data to perform abstract learning tasks. To date, population health-related examples of deep learning are difficult to identify. An example of deep learning is an image recognition to image caption process. First, an image detection machine (eg, Vision Deep Convoluted Neural Network) identifies the items in an image, then based on those items a language generating machine (eg, recurrent neural nets) uses the data to generate a caption about the image. These processes can allow for detection and creation of abstract and creative objects such as painting, and music. ONS have been using deep learning to identify and count caravans in caravan parks in order to improve the census and reduce the requirement fr enumerators
Ensemble learning A machine-learning approach involving training multiple models on data subsets and combining results from these models when predicting for unobserved inputs. Ensembles can be more accurate (see accuracy) the individual models.
Features Variables. Measurements recorded for each observation (for example, participant age, sex, and body mass index are all features)
Knowledge discovery The non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data. This is the definition used in “Advances in Knowledge Discovery and Data Mining,” 1996, by Fayyad, Piatetsky-Shapiro, and Smyth.[8]
Label Observed or computed value of an outcome or other variable of interest
Labeling The process of setting a label for a variable, as opposed to leaving the variable’s value unknown
Learning algorithm The set of steps used to train a model automatically from a data set (not to be confused with the model itself ; e.g., there are many algorithms to train a neural network, each with different bounds on time, memory, and accuracy)
Machine learning The coining of the term machine learning is often credited to computer scientist Arthur Samuel who developed a machine that could defeat humans in the game of checkers. More recently, Tom Mitchell has explained machine learning as ‘a computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E’). Modern machine learning involves a number of specific methods including including neural networks, decision trees, nearest neighbour classifiers, support vector machines and Markov and hidden Markov models. These methods can be used for supervised or unsupervised learning.
Natural language Working with words as data, as in qualitative or mixed-methods research (generally, human readable but not readily machine readable)
Noisy labels Measurement error
Out-of-sample Applying a model fitted to one data set to make predictions in another
Overfitting Fitting a model to random noise or error instead of the actual relationship (due to having either a small number of observations or a large number of parameters relative to the number of observations)
Pipeline (From bioinformatics) The ordered set of tools applied to a data set to move it from its raw state to a final interpretable analytic result
Precision Positive predictive value
Recall Sensitivity
Semi-supervised learning An analytic technique used to fit predictive models to data where many observations are missing outcome data.
Small-n, large-p A wide but short data set: n = number of observations, p = number of variables for each observation
Supervised learning An analytic technique in which patterns in covariates that are correlated with observed outcomes are exploited to predict outcomes in a data set or sets in which the correlates were observed but the outcome was unobserved. For example, linear regression and logistic regression are both supervised learning techniques, as are neural networks, boosted trees (xgboost) and penalised regression models (lasso and ridge)
Test data set A subset of a more complete data set used to test empirical performance of an algorithm trained on a training data set
Text analytics (See natural language). Text analytics refers to the process of compiling and analysing text to derive meaningful information.4 Machines use algorithms to derive patterns and develop categories within text. Machine learning methods for text analytics can extract specific infor- mation, summarise and simplify, provide question and answers (eg, Apple’s Siri) and analyse documents for sentiments and opinions. For example, Twitter data has been used to predict income and socioeconomic status.Preoţiuc-Pietro et al used Twitter data and supervised learning techniques, logistic regression with Elastic Net regularisation and Support Vector regression with a Radial Basis Function kernel, to profile features, inferred psychological and demographic features, emotions and word clusters to predict income.[9] The correlation between the predicted model and income data was 0.63 with a Mean Average Error of 9535£
Training Fitting a model
Training data set A subset of a more complete data set used to train a model whose empirical performance can be tested on a test data set
Unsupervised learning An analytic technique in which data is automatically explored to identify patterns, without reference to outcome information. Latent class analysis (when used without covariates) and k-means clustering are unsupervised learning techniques
Video analytics Video analytics (also referred to as video content analysis), uses machine learning to evaluate video footage to extract important details. Video analytics have been applied to closed-circuit tele- vision and video streaming services, such as YouTube, for object detection and tracking, behavioural analysis, and detection of ‘interesting events’. For example, Zangenehpour et al [10] used 90 hours of video at 23 intersections in Montreal to examine the safety of cyclists–driver interactions at intersections with cycle tracks. The authors used TrafficIntelligence, developed by Dr. Nicolas Saunier, to detect and classify road users, select and predict trajectories, and calculate post encroachment time (a measure of safety).[11]

Critiques of big data in public health

Fuller et al [4] provide some useful critiques of big data use for public and population health:

  • Automating research changes the nature of knowledge
  • Claims of objectivity are misleading
  • Bigger is not always better
  • Not all data are equivalent
  • Accessible does not equal ethical
  • Lack of access creates digital divides

Khoury and Ioannidis [12] argue that to separate signal from noise in big data for public health benefit requires a stronger epidemiological foundation, improved knowledge translation and integration and reinvigoration of the princples of evidence based medicine.

Future directions

Selected machine learning applications in public health

Selected machine learning applications for public health [1]
Algorithm Learning type Example
K-means clustering Unsupervised Hot spot detection
Retrospective event detection Unsupervised Case ascertainment
Content analysis Unsupervised Public health surveillance
K-nearest neighbors clustering Supervised Spatiotemporal hot spot detection ; Clinical outcomes from
genetic data; falls from wearable sensors
Naı ̈ve Bayes Supervised Acute gastrointestinal syndrome surveillance
Neural networks Supervised Identifying microcalcification clusters in digital mammograms ;
predicting mortality in head trauma patients ; predicting influenza
vaccination outcome
Support vector machines Supervised Diagnosis of diabetes mellitus; detection of depression through
Twitter posts
Decision trees Supervised Identifying infants at high risk for serious bacterial infections ;
comparing cost-effectiveness of different influenza treatments ;
and physical activity from wearable sensors

Selected studies of social media analysis in public health

There are a number of studies which have used social media analysis to predict public health outcomes, or health behaviours. These include adverse drug reactions, depressive or suicidal behaviour or other mental health problems, area health statistics, asthma and presence of ’food deserts.

The predictive value of social media data has recently been reviewed [13] - we have adapted the following table.

Table 1: Social media analytical applications for public health [@Phillips2017]; T = Twitter, O = Blogs, F = Facebook, I = Instagram, R Reddit, TR = Tumblr
Article Topic Data Data Size Features Task Success Rate
Bian Adverse drug re- T 239 users N-gram, Seman- Classification Acc. 74%
actions tic, Non-SM
Feldman Adverse drug re- O 41K posts, Semantic, Non- Classification F1 0.84 (statins) F1 0.78
actions 5.3K users SM (anti-depressants)
Nikfarjam Adverse drug re- O 6.8K posts Semantic Classification F1 0.68
actions
Segura Adverse drug re- O 400 posts Semantic, Non- Classification F1 0.68
actions SM
Yates Adverse drug re- T, O 400K forum N-gram, Seman- Classification Prec. 0.59 (O) Prec. 0.48
actions posts, 2.8B tic, Non-SM (T)
tweets
Corley et al Influenza T, O 97.9M posts Metadata, N- Regression r = 0.63
gram
Lamb Influenza T 3.8B tweets N-gram, Seman- Regression r = 0.80
tic
Paul Influenza T Not specified N-gram, Seman- Regression 25.3% improvement
tic
Bodnar Influenza T 239M tweets N-gram Regression r = 0.88
Zou Intestinal dis- T 410M tweets N-gram Regression r = 0.73 (Norovirus), 0.77
ease (Food poisoning)
Zhang Asthma T 5.5M tweets N-gram Classification Acc. 66.3%
Chancellor Mental health TR 13K users, Metadata, Se- Regression Concordance 0.658
68.3M posts mantic
De Choud- Mental health T 40K tweets Semantic, Social Classification Acc. 80%
hury
De Choud- Mental health T 2.1M tweets Semantic, Social Classification Acc. 70%
hury
De Choud- Mental health F, T 40K tweets, Metadata, Se- Regression 2r = 0.48
hury 0.6M posts mantic, Social
(F)
De Choud- Mental health R 63K posts, Metadata, Se- Classification Acc. 80%
hury 35K users mantic
Burnap Mental health T 2K tweets N-gram, Seman- Classification F1 0.69
tic
Shuai Mental health F, I 63K users Metadata, So- Classification Acc. 78% (I), Acc. 83%
(F), 2K users cial, Behavior (F)
Tsugawa Mental health T 209 users, N-gram, Seman- Classification Acc. 66%
574K tweets tic, Social
Won Mental health O 153M posts N-gram, Non- Regression Acc. 79%
SM
Chancellor Mental health I 100K users Semantic Classification F1 0.81
Lehrman Mental health R 200 posts N-gram, Senti- Classification Acc. 54.5%, baseline
ment 30.5%
Culotta Health stats T 4.3M tweets Metadata, Se- Regression r = 0.63
mantic, Non-SM
De Choud- Food Deserts I 14M posts Semantic, Spa- Classification Acc. 80%
hury tial, Non-SM


References

1 Mooney SJ, Pejaver V. Big Data in Public Health : Terminology , Machine Learning , and Privacy. Annual review of public health 2018;1–18. doi:29261408

2 Kohavi R, Provost F. Glossary of Terms. Machine Learning 1998;30:271–4. doi:10.1023/A:1017181826899

3 Mcgrail KM, Jones K, Akbari A et al. International Journal of People. International Journal of Population Data Science 2018;3:1–11.https://ijpds.org/article/view/415

4 Fuller D, Buote R, Stanley K. A glossary for big data in population and public health: Discussion and commentary on terminology and research methods. Journal of Epidemiology and Community Health 2017;71:1113–7. doi:10.1136/jech-2017-209608

5 Donoho D. 50 Years of Data Science. 2017;26:745–66. doi:10.1080/10618600.2017.1384734

6 Data B, Revolution R. Nih strategic plan for data science. 2018;2015:1–26.

7 Lazer D, Kennedy R, King G et al. The parable of Google Flue: traps in big data analysis. Science 2014;343:1203–5. doi:10.1126/science.1248506

8 Fayyad UM, Piatetsky-Shapiro G, Smyth P et al. Advances in Knowledge Discovery and Data Mining. 1996.

9 Preoţiuc-Pietro D, Volkeva S, Lampos V et al. Studying user income through language, behaviour and affect in social media. PLoS ONE 2015;10. doi:10.1371/journal.pone.0138717

10 Zangenehpour S, Miranda-Moreno LF, Saunier N. Automated Classification in Traffic Video at Intersections with Heavy Pedestrian and Bicycle Traffic. 2014 TRB Annual Meeting Compendium of Papers 2014.

11 Saunier N, Sayed T, Ismail K. Large-Scale Automated Analysis of Vehicle Interactions and Collisions. Transportation Research Record: Journal of the Transportation Research Board 2010;2147:42–50. doi:10.3141/2147-06

12 Khoury MJ, Ioannidis JPA. Medicine. Big data meets public health. Science 2014;346:1054–5. doi:10.1126/science.aaa2709

13 Phillips L, Dowling C, Shaffer K et al. Using Social Media to Predict the Future: A Systematic Literature Review. arXiv 2017;1–55.http://arxiv.org/abs/1706.06134