The development and incorporation of data science and big data into public health practice requires us to learn a new language. Fortunately many of the new terms have public health equivalents.
As a first step we created an A-Z of data science in public health.
As a next step we have drawn on the literature to create the following tables. We welcome suggestions for additions and improvements.
The glossary below draws extensively from:
Mooney et al table 2 [1], Kohavi et al [2] and Fuller et al [3].
A brief definition of big data. Originally big data was referred to as the 3Vs but more recent defintions talk of 5 Vs (see diagram). (Adapted from [3])
5 Vs of big data
Source | Examples | ‘Bigness’ | Technical Issues | Typical uses |
---|---|---|---|---|
-omic/biological | Whole exome profiling, | Wide | Lab effects, informatics | Etiologic research, screening |
metabolomics | pipeline | |||
Geospatial | Neighborhood | Wide | Spatial autocorrelation | Etiologic research, surveillance |
characteristics | ||||
Electronic health | Records of all patients | Tall, often | Data cleaning, natural | Clinical research, surveillance |
records | with hypertension | also wide | language | |
Personal | Daily GPS records, | Tall | Redundancy, inference | Etiologic research, potentially clinical |
monitoring | Fitbit readings | of intentions | decision making | |
Effluent data | Google search results, | Tall | Selection biases, natural | Surveillance, screening, identification of |
language | hidden social networks |
Data science term | Related PH term or concept |
---|---|
Accuracy | Proportion of results correctly classified |
Artificial intelligence | Artificial intelligence (AI) is used to describe machines that perform human-like activities such as learning, perception, problem solving and playing games. AI has been used to engage the public by improving the quality of eHealth interactions. For example, patients can use AI-based eHealth applications to receive personalised information. Chronology MD was developed for patients with Crohn’s disease; this programme allows patients to input their ‘observations of daily living’ and an AI system assists patients with management of their disease (eg, medication reminders, exercise and proper sleep motivation). This case highlights how AI applications can increase the immediacy of eHealth, the development of closeness and the feeling of an authentic, caring relationship. These applications help to provide a human-like element to eHealth exchanges between patents and AI systems |
Big data hubris | Big data hubris is the assumption that data with sufficient volume and velocity can compensate for or eliminate the need for high veracity data, high-quality study designs and more traditional forms of data analysis. Lazer and colleagues[4] provide a compelling example of big data hubris in the Google Flu Trends research. An important limitation of Google Flu Trends was that the underlying algorithms and methodology of Google search terms are proprietary and evolving (Google’s Hummingbird algorithm likely uses deep learning). Core scientific principles of replicability and transparency are difficult when dealing with proprietary data, whether it be from Google, Facebook or others |
Confusion matrix | A 2x2 table comparing predicted positive and negative results with observed results |
Coverage | The proportion of a data set for which a classifier makes a prediction. |
Cross-validation | A method for estimating the accuracy (or error) of an outcome by dividing the data into k mutually exclusive subsets (“folds”) of approximately equal size. The model is trained and tested on the outcome k times. Each time it is trained on the data set minus a fold and tested on that fold. The accuracy estimate is the average accuracy for the k folds. |
Data mining | Exploratory analysis. Data mining (sometimes referred to as knowledge discovery in databases) is the process of extracting new and at times useful information from data. Data mining and machine learning often use the same statistical techniques and it is difficult to differentiate the two in practice. Some would argue the primary focus of data mining is unsupervised learning (see unsupervised learning). Drug pathway discovery through analysis of published results is an example of data mining in health research. Perhaps, data mining can be better conceived as data refining, where large volumes of data are sifted using statistical techniques to find potential associa- tions of interest for researchers. |
Deep learning | Deep learning is a machine learning technique designed to process signals like a human brain. Instead of using a single machine learning technique on a single type of data, deep learning uses multiple machine learning methods and layers of data to perform abstract learning tasks. To date, population health-related examples of deep learning are difficult to identify. An example of deep learning is an image recognition to image caption process. First, an image detection machine (eg, Vision Deep Convoluted Neural Network) identifies the items in an image, then based on those items a language generating machine (eg, recurrent neural nets) uses the data to generate a caption about the image. These processes can allow for detection and creation of abstract and creative objects such as painting, and music. ONS have been using deep learning to identify and count caravans in caravan parks in order to improve the census and reduce the requirement fr enumerators |
Ensemble learning | A machine-learning approach involving training multiple models on data subsets and combining results from these models when predicting for unobserved inputs. Ensembles can be more accurate (see accuracy) the individual models. |
Features | Variables. Measurements recorded for each observation (for example, participant age, sex, and body mass index are all features) |
Knowledge discovery | The non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data. This is the definition used in “Advances in Knowledge Discovery and Data Mining,” 1996, by Fayyad, Piatetsky-Shapiro, and Smyth.[5] |
Label | Observed or computed value of an outcome or other variable of interest |
Labeling | The process of setting a label for a variable, as opposed to leaving the variable’s value unknown |
Learning algorithm | The set of steps used to train a model automatically from a data set (not to be confused with the model itself ; e.g., there are many algorithms to train a neural network, each with different bounds on time, memory, and accuracy) |
Machine learning | The coining of the term machine learning is often credited to computer scientist Arthur Samuel who developed a machine that could defeat humans in the game of checkers. More recently, Tom Mitchell has explained machine learning as ‘a computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E’). Modern machine learning involves a number of specific methods including including neural networks, decision trees, nearest neighbour classifiers, support vector machines and Markov and hidden Markov models. These methods can be used for supervised or unsupervised learning. |
Natural language | Working with words as data, as in qualitative or mixed-methods research (generally, human readable but not readily machine readable) |
Noisy labels | Measurement error |
Out-of-sample | Applying a model fitted to one data set to make predictions in another |
Overfitting | Fitting a model to random noise or error instead of the actual relationship (due to having either a small number of observations or a large number of parameters relative to the number of observations) |
Pipeline | (From bioinformatics) The ordered set of tools applied to a data set to move it from its raw state to a final interpretable analytic result |
Precision | Positive predictive value |
Recall | Sensitivity |
Semi-supervised learning | An analytic technique used to fit predictive models to data where many observations are missing outcome data. |
Small-n, large-p | A wide but short data set: n = number of observations, p = number of variables for each observation |
Supervised learning | An analytic technique in which patterns in covariates that are correlated with observed outcomes are exploited to predict outcomes in a data set or sets in which the correlates were observed but the outcome was unobserved. For example, linear regression and logistic regression are both supervised learning techniques, as are neural networks, boosted trees (xgboost) and penalised regression models (lasso and ridge) |
Test data set | A subset of a more complete data set used to test empirical performance of an algorithm trained on a training data set |
Text analytics | (See natural language). Text analytics refers to the process of compiling and analysing text to derive meaningful information.4 Machines use algorithms to derive patterns and develop categories within text. Machine learning methods for text analytics can extract specific infor- mation, summarise and simplify, provide question and answers (eg, Apple’s Siri) and analyse documents for sentiments and opinions. For example, Twitter data has been used to predict income and socioeconomic status.Preoţiuc-Pietro et al used Twitter data and supervised learning techniques, logistic regression with Elastic Net regularisation and Support Vector regression with a Radial Basis Function kernel, to profile features, inferred psychological and demographic features, emotions and word clusters to predict income.[6] The correlation between the predicted model and income data was 0.63 with a Mean Average Error of 9535£ |
Training | Fitting a model |
Training data set | A subset of a more complete data set used to train a model whose empirical performance can be tested on a test data set |
Unsupervised learning | An analytic technique in which data is automatically explored to identify patterns, without reference to outcome information. Latent class analysis (when used without covariates) and k-means clustering are unsupervised learning techniques |
Video analytics | Video analytics (also referred to as video content analysis), uses machine learning to evaluate video footage to extract important details. Video analytics have been applied to closed-circuit tele- vision and video streaming services, such as YouTube, for object detection and tracking, behavioural analysis, and detection of ‘interesting events’. For example, Zangenehpour et al [7] used 90 hours of video at 23 intersections in Montreal to examine the safety of cyclists–driver interactions at intersections with cycle tracks. The authors used TrafficIntelligence, developed by Dr. Nicolas Saunier, to detect and classify road users, select and predict trajectories, and calculate post encroachment time (a measure of safety).[8] |
Fuller et al [3] provide some useful critiques of big data use for public and population health:
Khoury and Ioannidis [9] argue that to separate signal from noise in big data for public health benefit requires a stronger epidemiological foundation, improved knowledge translation and integration and reinvigoration of the princples of evidence based medicine.
Algorithm | Learning type | Example |
---|---|---|
K-means clustering | Unsupervised | Hot spot detection |
Retrospective event detection | Unsupervised | Case ascertainment |
Content analysis | Unsupervised | Public health surveillance |
K-nearest neighbors clustering | Supervised | Spatiotemporal hot spot detection ; Clinical outcomes from |
genetic data; falls from wearable sensors | ||
Naı ̈ve Bayes | Supervised | Acute gastrointestinal syndrome surveillance |
Neural networks | Supervised | Identifying microcalcification clusters in digital mammograms ; |
predicting mortality in head trauma patients ; predicting influenza | ||
vaccination outcome | ||
Support vector machines | Supervised | Diagnosis of diabetes mellitus; detection of depression through |
Twitter posts | ||
Decision trees | Supervised | Identifying infants at high risk for serious bacterial infections ; |
comparing cost-effectiveness of different influenza treatments ; | ||
and physical activity from wearable sensors |
1 Mooney SJ, Pejaver V. Big Data in Public Health : Terminology , Machine Learning , and Privacy. Annual review of public health 2018;1–18. doi:29261408
2 Kohavi R, Provost F. Glossary of Terms. Machine Learning 1998;30:271–4. doi:10.1023/A:1017181826899
3 Fuller D, Buote R, Stanley K. A glossary for big data in population and public health: Discussion and commentary on terminology and research methods. Journal of Epidemiology and Community Health 2017;71:1113–7. doi:10.1136/jech-2017-209608
4 Lazer D, Kennedy R, King G et al. The parable of Google Flue: traps in big data analysis. Science 2014;343:1203–5. doi:10.1126/science.1248506
5 Fayyad UM, Piatetsky-Shapiro G, Smyth P et al. Advances in Knowledge Discovery and Data Mining. 1996.
6 Preoţiuc-Pietro D, Volkeva S, Lampos V et al. Studying user income through language, behaviour and affect in social media. PLoS ONE 2015;10. doi:10.1371/journal.pone.0138717
7 Zangenehpour S, Miranda-Moreno LF, Saunier N. Automated Classification in Traffic Video at Intersections with Heavy Pedestrian and Bicycle Traffic. 2014 TRB Annual Meeting Compendium of Papers 2014.
8 Saunier N, Sayed T, Ismail K. Large-Scale Automated Analysis of Vehicle Interactions and Collisions. Transportation Research Record: Journal of the Transportation Research Board 2010;2147:42–50. doi:10.3141/2147-06
9 Khoury MJ, Ioannidis JPA. Medicine. Big data meets public health. Science 2014;346:1054–5. doi:10.1126/science.aaa2709
10 Phillips L, Dowling C, Shaffer K et al. Using Social Media to Predict the Future: A Systematic Literature Review. arXiv 2017;1–55.http://arxiv.org/abs/1706.06134