Motivation

The development and incorporation of data science and big data into public health practice requires us to learn a new language. Fortunately many of the new terms have public health equivalents.

As a first step we created an A-Z of data science in public health.

As a next step we have drawn on the literature to create the following tables. We welcome suggestions for additions and improvements.

The glossary below draws extensively from:

Mooney et al table 2 [1], Kohavi et al [2], McGrail et al [3] and Fuller et al [4].

Big data - the 5Vs

A brief definition of big data. Originally big data was referred to as the 3Vs but more recent defintions talk of 5 Vs (see diagram). (Adapted from [4])

5 Vs of big data

Briefly:

Volume refers to storage - datasets are growing rapidly in size and increasingly need clusters of computers to store. In turn this has required developments in database technology and analytical techniques to be be able to manage and extract value. This is particularly to digital public health generating data from apps or sensors
Velocity refers to the speed at which data is collected. Often in public health we produce annual datasets or even more aggregated time periods. Again with the advent of digital public health data be collected in near-realtime - on a daily or even hourly basis. Increasing frequency of collection rapidly grows the size of the data.
Variety. There is an increasingly complex data landscape for public health. The nature of data and the formats in which it comes is growing rapidly - for example unstructured data including text and social media are increasingly used as public health data sources (see below)
Veracity. All that glisters is not gold. Big data can mislead, be inaccurate and be biased.
Value. “Data is the new oil”/ “data is the new soil”

Types of big data for public health

Source: Big data for public health [1]
Source	Examples	‘Bigness’	Technical Issues	Typical uses
-omic/biological	Whole exome profiling,	Wide	Lab effects, informatics	Etiologic research, screening
	metabolomics		pipeline
Geospatial	Neighborhood	Wide	Spatial autocorrelation	Etiologic research, surveillance
	characteristics
Electronic health	Records of all patients	Tall, often	Data cleaning, natural	Clinical research, surveillance
records	with hypertension	also wide	language
Personal	Daily GPS records,	Tall	Redundancy, inference	Etiologic research, potentially clinical
monitoring	Fitbit readings		of intentions	decision making
Effluent data	Google search results,	Tall	Selection biases, natural	Surveillance, screening, identification of
	Reddit		language	hidden social networks

Towards defining public health data science

Donoho provides a useful discussion about the origins and development of data science. [5] In brief he defines “greater data science” as:

The science of data

and sees 6 core areas for data science activity:

Data Exploration and Preparation
Data Representation and Transformation
Computing with Data
Data Modeling
Data Visualization and Presentation
Science about Data Science

The NIH [6] defines data science as:

“the interdisciplinary field of inquiry in which quantitative and analytical approaches, processes, and systems are developed and used to extract knowledge and insights from increasingly large and/ or complex sets of data”

McGrail et al [3] provide a defintion of data science in population health as:

“The science of data about people”

We are working towards a defintion of public health data science. Two suggestions are:

“The application of data science to improve and protect health and reduce inequalities”

or:

“The science of data to improve and protect health and reduce inequalities”

Adapted from [3] table 1

Area	Focus	Multisource data	Primary aim of research	Focus on technical/ policy infrastructure
Data science	Data esp. “big data”	Not always	Data for actionable information	No
Population data science	People, systems, population insights	Linkage and multiple sources	Public value	Key focus - legal, ethical, privacy, data collection
Informatics	Providers/ ICT systems	Not necessarily	Implementation	Database/ technical development
Public health data science	Public and population health, healthcare, health systems	Often, linkage important	Improving population health, reducing health inequality	Focus as per population health data science?

Data science terms for public health

Data science term	Related PH term or concept
Accuracy	Proportion of results correctly classified
Artificial intelligence	Artificial intelligence (AI) is used to describe machines that perform human-like activities such as learning, perception, problem solving and playing games. AI has been used to engage the public by improving the quality of eHealth interactions. For example, patients can use AI-based eHealth applications to receive personalised information. Chronology MD was developed for patients with Crohn’s disease; this programme allows patients to input their ‘observations of daily living’ and an AI system assists patients with management of their disease (eg, medication reminders, exercise and proper sleep motivation). This case highlights how AI applications can increase the immediacy of eHealth, the development of closeness and the feeling of an authentic, caring relationship. These applications help to provide a human-like element to eHealth exchanges between patents and AI systems
Big data hubris	Big data hubris is the assumption that data with sufficient volume and velocity can compensate for or eliminate the need for high veracity data, high-quality study designs and more traditional forms of data analysis. Lazer and colleagues[7] provide a compelling example of big data hubris in the Google Flu Trends research. An important limitation of Google Flu Trends was that the underlying algorithms and methodology of Google search terms are proprietary and evolving (Google’s Hummingbird algorithm likely uses deep learning). Core scientific principles of replicability and transparency are difficult when dealing with proprietary data, whether it be from Google, Facebook or others
Confusion matrix	A 2x2 table comparing predicted positive and negative results with observed results
Coverage	The proportion of a data set for which a classifier makes a prediction.
Cross-validation	A method for estimating the accuracy (or error) of an outcome by dividing the data into k mutually exclusive subsets (“folds”) of approximately equal size. The model is trained and tested on the outcome k times. Each time it is trained on the data set minus a fold and tested on that fold. The accuracy estimate is the average accuracy for the k folds.
Data mining	Exploratory analysis. Data mining (sometimes referred to as knowledge discovery in databases) is the process of extracting new and at times useful information from data. Data mining and machine learning often use the same statistical techniques and it is difficult to differentiate the two in practice. Some would argue the primary focus of data mining is unsupervised learning (see unsupervised learning). Drug pathway discovery through analysis of published results is an example of data mining in health research. Perhaps, data mining can be better conceived as data refining, where large volumes of data are sifted using statistical techniques to find potential associa- tions of interest for researchers.
Deep learning	Deep learning is a machine learning technique designed to process signals like a human brain. Instead of using a single machine learning technique on a single type of data, deep learning uses multiple machine learning methods and layers of data to perform abstract learning tasks. To date, population health-related examples of deep learning are difficult to identify. An example of deep learning is an image recognition to image caption process. First, an image detection machine (eg, Vision Deep Convoluted Neural Network) identifies the items in an image, then based on those items a language generating machine (eg, recurrent neural nets) uses the data to generate a caption about the image. These processes can allow for detection and creation of abstract and creative objects such as painting, and music. ONS have been using deep learning to identify and count caravans in caravan parks in order to improve the census and reduce the requirement fr enumerators
Ensemble learning	A machine-learning approach involving training multiple models on data subsets and combining results from these models when predicting for unobserved inputs. Ensembles can be more accurate (see accuracy) the individual models.
Features	Variables. Measurements recorded for each observation (for example, participant age, sex, and body mass index are all features)
Knowledge discovery	The non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data. This is the definition used in “Advances in Knowledge Discovery and Data Mining,” 1996, by Fayyad, Piatetsky-Shapiro, and Smyth.[8]
Label	Observed or computed value of an outcome or other variable of interest
Labeling	The process of setting a label for a variable, as opposed to leaving the variable’s value unknown
Learning algorithm	The set of steps used to train a model automatically from a data set (not to be confused with the model itself ; e.g., there are many algorithms to train a neural network, each with different bounds on time, memory, and accuracy)
Machine learning	The coining of the term machine learning is often credited to computer scientist Arthur Samuel who developed a machine that could defeat humans in the game of checkers. More recently, Tom Mitchell has explained machine learning as ‘a computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E’). Modern machine learning involves a number of specific methods including including neural networks, decision trees, nearest neighbour classifiers, support vector machines and Markov and hidden Markov models. These methods can be used for supervised or unsupervised learning.
Natural language	Working with words as data, as in qualitative or mixed-methods research (generally, human readable but not readily machine readable)
Noisy labels	Measurement error
Out-of-sample	Applying a model fitted to one data set to make predictions in another
Overfitting	Fitting a model to random noise or error instead of the actual relationship (due to having either a small number of observations or a large number of parameters relative to the number of observations)
Pipeline	(From bioinformatics) The ordered set of tools applied to a data set to move it from its raw state to a final interpretable analytic result
Precision	Positive predictive value
Recall	Sensitivity
Semi-supervised learning	An analytic technique used to fit predictive models to data where many observations are missing outcome data.
Small-n, large-p	A wide but short data set: n = number of observations, p = number of variables for each observation
Supervised learning	An analytic technique in which patterns in covariates that are correlated with observed outcomes are exploited to predict outcomes in a data set or sets in which the correlates were observed but the outcome was unobserved. For example, linear regression and logistic regression are both supervised learning techniques, as are neural networks, boosted trees (xgboost) and penalised regression models (lasso and ridge)
Test data set	A subset of a more complete data set used to test empirical performance of an algorithm trained on a training data set
Text analytics	(See natural language). Text analytics refers to the process of compiling and analysing text to derive meaningful information.4 Machines use algorithms to derive patterns and develop categories within text. Machine learning methods for text analytics can extract specific infor- mation, summarise and simplify, provide question and answers (eg, Apple’s Siri) and analyse documents for sentiments and opinions. For example, Twitter data has been used to predict income and socioeconomic status.Preoţiuc-Pietro et al used Twitter data and supervised learning techniques, logistic regression with Elastic Net regularisation and Support Vector regression with a Radial Basis Function kernel, to profile features, inferred psychological and demographic features, emotions and word clusters to predict income.[9] The correlation between the predicted model and income data was 0.63 with a Mean Average Error of 9535£
Training	Fitting a model
Training data set	A subset of a more complete data set used to train a model whose empirical performance can be tested on a test data set
Unsupervised learning	An analytic technique in which data is automatically explored to identify patterns, without reference to outcome information. Latent class analysis (when used without covariates) and k-means clustering are unsupervised learning techniques
Video analytics	Video analytics (also referred to as video content analysis), uses machine learning to evaluate video footage to extract important details. Video analytics have been applied to closed-circuit tele- vision and video streaming services, such as YouTube, for object detection and tracking, behavioural analysis, and detection of ‘interesting events’. For example, Zangenehpour et al [10] used 90 hours of video at 23 intersections in Montreal to examine the safety of cyclists–driver interactions at intersections with cycle tracks. The authors used TrafficIntelligence, developed by Dr. Nicolas Saunier, to detect and classify road users, select and predict trajectories, and calculate post encroachment time (a measure of safety).[11]

Critiques of big data in public health

Fuller et al [4] provide some useful critiques of big data use for public and population health:

Automating research changes the nature of knowledge
Claims of objectivity are misleading
Bigger is not always better
Not all data are equivalent
Accessible does not equal ethical
Lack of access creates digital divides

Khoury and Ioannidis [12] argue that to separate signal from noise in big data for public health benefit requires a stronger epidemiological foundation, improved knowledge translation and integration and reinvigoration of the princples of evidence based medicine.

Future directions

Selected machine learning applications in public health

Selected machine learning applications for public health [1]
Algorithm	Learning type	Example
K-means clustering	Unsupervised	Hot spot detection
Retrospective event detection	Unsupervised	Case ascertainment
Content analysis	Unsupervised	Public health surveillance
K-nearest neighbors clustering	Supervised	Spatiotemporal hot spot detection ; Clinical outcomes from
		genetic data; falls from wearable sensors
Naı ̈ve Bayes	Supervised	Acute gastrointestinal syndrome surveillance
Neural networks	Supervised	Identifying microcalcification clusters in digital mammograms ;
		predicting mortality in head trauma patients ; predicting influenza
		vaccination outcome
Support vector machines	Supervised	Diagnosis of diabetes mellitus; detection of depression through
		Twitter posts
Decision trees	Supervised	Identifying infants at high risk for serious bacterial infections ;
		comparing cost-effectiveness of different influenza treatments ;
		and physical activity from wearable sensors

Selected studies of social media analysis in public health

There are a number of studies which have used social media analysis to predict public health outcomes, or health behaviours. These include adverse drug reactions, depressive or suicidal behaviour or other mental health problems, area health statistics, asthma and presence of ’food deserts.

The predictive value of social media data has recently been reviewed [13] - we have adapted the following table.

Table 1: Social media analytical applications for public health [@Phillips2017]; T = Twitter, O = Blogs, F = Facebook, I = Instagram, R Reddit, TR = Tumblr
Article	Topic	Data	Data Size	Features	Task	Success Rate
Bian	Adverse drug re-	T	239 users	N-gram, Seman-	Classification	Acc. 74%
	actions			tic, Non-SM
Feldman	Adverse drug re-	O	41K posts,	Semantic, Non-	Classification	F1 0.84 (statins) F1 0.78
	actions		5.3K users	SM		(anti-depressants)
Nikfarjam	Adverse drug re-	O	6.8K posts	Semantic	Classification	F1 0.68
	actions
Segura	Adverse drug re-	O	400 posts	Semantic, Non-	Classification	F1 0.68
	actions			SM
Yates	Adverse drug re-	T, O	400K forum	N-gram, Seman-	Classification	Prec. 0.59 (O) Prec. 0.48
	actions		posts, 2.8B	tic, Non-SM		(T)
			tweets
Corley et al	Influenza	T, O	97.9M posts	Metadata, N-	Regression	r = 0.63
				gram
Lamb	Influenza	T	3.8B tweets	N-gram, Seman-	Regression	r = 0.80
				tic
Paul	Influenza	T	Not specified	N-gram, Seman-	Regression	25.3% improvement
				tic
Bodnar	Influenza	T	239M tweets	N-gram	Regression	r = 0.88
Zou	Intestinal dis-	T	410M tweets	N-gram	Regression	r = 0.73 (Norovirus), 0.77
	ease					(Food poisoning)
Zhang	Asthma	T	5.5M tweets	N-gram	Classification	Acc. 66.3%
Chancellor	Mental health	TR	13K users,	Metadata, Se-	Regression	Concordance 0.658
			68.3M posts	mantic
De Choud-	Mental health	T	40K tweets	Semantic, Social	Classification	Acc. 80%
hury
De Choud-	Mental health	T	2.1M tweets	Semantic, Social	Classification	Acc. 70%
hury
De Choud-	Mental health	F, T	40K tweets,	Metadata, Se-	Regression	2r = 0.48
hury			0.6M posts	mantic, Social
			(F)
De Choud-	Mental health	R	63K posts,	Metadata, Se-	Classification	Acc. 80%
hury			35K users	mantic
Burnap	Mental health	T	2K tweets	N-gram, Seman-	Classification	F1 0.69
				tic
Shuai	Mental health	F, I	63K users	Metadata, So-	Classification	Acc. 78% (I), Acc. 83%
			(F), 2K users	cial, Behavior		(F)
Tsugawa	Mental health	T	209 users,	N-gram, Seman-	Classification	Acc. 66%
			574K tweets	tic, Social
Won	Mental health	O	153M posts	N-gram, Non-	Regression	Acc. 79%
				SM
Chancellor	Mental health	I	100K users	Semantic	Classification	F1 0.81

Lehrman	Mental health	R	200 posts	N-gram, Senti-	Classification	Acc. 54.5%, baseline
				ment		30.5%
Culotta	Health stats	T	4.3M tweets	Metadata, Se-	Regression	r = 0.63
				mantic, Non-SM
De Choud-	Food Deserts	I	14M posts	Semantic, Spa-	Classification	Acc. 80%
hury				tial, Non-SM

References

1 Mooney SJ, Pejaver V. Big Data in Public Health : Terminology , Machine Learning , and Privacy. Annual review of public health 2018;1–18. doi:29261408

2 Kohavi R, Provost F. Glossary of Terms. Machine Learning 1998;30:271–4. doi:10.1023/A:1017181826899

3 Mcgrail KM, Jones K, Akbari A et al. International Journal of People. International Journal of Population Data Science 2018;3:1–11.https://ijpds.org/article/view/415

4 Fuller D, Buote R, Stanley K. A glossary for big data in population and public health: Discussion and commentary on terminology and research methods. Journal of Epidemiology and Community Health 2017;71:1113–7. doi:10.1136/jech-2017-209608

5 Donoho D. 50 Years of Data Science. 2017;26:745–66. doi:10.1080/10618600.2017.1384734

6 Data B, Revolution R. Nih strategic plan for data science. 2018;2015:1–26.

7 Lazer D, Kennedy R, King G et al. The parable of Google Flue: traps in big data analysis. Science 2014;343:1203–5. doi:10.1126/science.1248506

8 Fayyad UM, Piatetsky-Shapiro G, Smyth P et al. Advances in Knowledge Discovery and Data Mining. 1996.

9 Preoţiuc-Pietro D, Volkeva S, Lampos V et al. Studying user income through language, behaviour and affect in social media. PLoS ONE 2015;10. doi:10.1371/journal.pone.0138717

10 Zangenehpour S, Miranda-Moreno LF, Saunier N. Automated Classification in Traffic Video at Intersections with Heavy Pedestrian and Bicycle Traffic. 2014 TRB Annual Meeting Compendium of Papers 2014.

11 Saunier N, Sayed T, Ismail K. Large-Scale Automated Analysis of Vehicle Interactions and Collisions. Transportation Research Record: Journal of the Transportation Research Board 2010;2147:42–50. doi:10.3141/2147-06

12 Khoury MJ, Ioannidis JPA. Medicine. Big data meets public health. Science 2014;346:1054–5. doi:10.1126/science.aaa2709

13 Phillips L, Dowling C, Shaffer K et al. Using Social Media to Predict the Future: A Systematic Literature Review. arXiv 2017;1–55.http://arxiv.org/abs/1706.06134

Data science glossary for public health

Draft 2

Julian Flowers

2018-04-22