Motivation

The development and incorporation of data science and big data into public health practice requires us to learn a new language. Fortunately many of the new terms have public health equivalents.

As a first step we created an A-Z of data science in public health.

As a next step we have drawn on the literature to create the following tables. The glossary below draws extensively from:

Mooney et al table 2 (Mooney and Pejaver 2018), Kohavi et al (Kohavi and Provost 1998), McGrail et al (Mcgrail et al. 2018) and Fuller et al (Fuller, Buote, and Stanley 2017).

Definitions

Towards defining public health data science

Donoho provides a useful discussion about the origins and development of data science. (Donoho 2017) In brief he defines “greater data science” as:

The science of data

and sees 6 core areas for data science activity:

Data Exploration and Preparation
Data Representation and Transformation
Computing with Data
Data Modeling
Data Visualization and Presentation
Science about Data Science

The NIH (Data and Revolution 2018) defines data science as:

“the interdisciplinary field of inquiry in which quantitative and analytical approaches, processes, and systems are developed and used to extract knowledge and insights from increasingly large and/ or complex sets of data”

McGrail et al (Mcgrail et al. 2018) provide a defintion of data science in population health as:

“The science of data about people”

We are working towards a definition of public health data science. Two suggestions are:

“The application of data science to improve and protect health and reduce inequalities”

or:

“The science of data to improve and protect health and reduce inequalities”

Mcgrail compares the focus of current conceptual data science frameworks in health - this is summarised in the table.

Adapted from (Mcgrail et al. 2018) table 1

Area	Focus	Multisource data	Primary aim of research	Focus on technical/ policy infrastructure
Data science	Data esp. “big data”	Not always	Data for actionable information	No
Population data science	People, systems, population insights	Linkage and multiple sources	Public value	Key focus - legal, ethical, privacy, data collection
Informatics	Providers/ ICT systems	Not necessarily	Implementation	Database/ technical development
Public health data science	Public and population health, healthcare, health systems	Often, linkage important	Improving population health, reducing health inequality	Focus as per population health data science?

Big data: Official definition from the [National Institute of Standards and Technology. NIST Big Data Interoperability Framework: Volume 1, Definitions (NIST Special Publication 1500-1). (2015)](http://nvlpubs.nist.gov/nistpubs/SpecialPublications/NIST.SP.1500-1.pdf] Originally big data was referred to as the 3Vs but more recent definitions talk of 5 Vs (see diagram). (Adapted from (Fuller, Buote, and Stanley 2017))

name	value
1	big data consists of extensive datasetsprimarily in the characteristics of volume, variety, velocity, and/or variabilitythat require a scalable architecture for efficient storage, manipulation, and analysis.
2	big data engineering includes advanced techniques that harness independent resources for building scalable data systems when the characteristics of the datasets require new architectures for efficient storage, manipulation, and analysis.
3	the big data paradigm consists of the distribution of data systems across horizontally coupled, independent resources to achieve the scalability needed for the efficient processing of extensive datasets.
4	computational portability is the movement of the computation to the location of the data.
5	data governance refers to the overall management of the availability, usability, integrity, and security of the data employed in an enterprise.
6	the data lifecycle is the set of processes that transforms raw data into actionable knowledge, which includes data collection, preparation, analytics, visualization, and access.
7	data science is the empirical synthesis of actionable knowledge from raw data through the complete data life cycle process.
8	the data science is extraction of actionable knowledge directly from data through a process of discovery, hypothesis, and hypothesis testing.
9	a latency is a practitioner who has sufficient knowledge in the overlapping regimes of business needs, domain knowledge, analytical skills, and software and systems engineering to manage the end-to-end data processes through each stage in the data life cycle.
10	distributed computing is a computing system in which components located on networked computers communicate and coordinate their actions by passing messages.
11	distributed file systems contain multi-structured (object) datasets that are distributed across the computing nodes of the server cluster(s).
12	a federated database system is a type of meta-database management system, which transparently maps multiple autonomous database systems into a single federated database. horizontal scaling implies the coordination of individual resources (e.g., server) that are integrated to act in parallel as a single system (i.e., operate as a cluster).
13	latency refers to the delay in processing or in availability.
14	massively parallel processing refers to a multitude of individual processors working in parallel to execute a particular program.
15	non-relational models, frequently referred to as nosql, refer to logical data models that do not follow relational algebra for the storage and manipulation of data.
16	resource negotiation consists of built-in data management capabilities that provide the necessary support functions, such as operations management, workflow integration, security, governance, support for additional processing models, and controls for multi-tenant environments, providing higher availability and lower latency applications.
17	a-1 nist big data interoperability framework: volume 1, definitions schema-on-read is the application of a data schema through preparation steps such as transformations, cleansing, and integration at the time the data is read from the database.
18	shared-disk file systems, such as storage area networks (sans) and network attached storage (nas), use a single storage pool, which is accessed from multiple computing resources. validity refers to appropriateness of the data for its intended use.  value refers to the inherent wealth, economic and social, embedded in any dataset.
19	 variability refers to the change in other data characteristics.
20	 variety refers to data from multiple repositories, domains, or types.
21	velocity refers to the rate of data flow. veracity refers to the accuracy of the data.
22	vertical scaling implies increasing the system parameters of processing speed, storage, and memory for greater performance. volatility refers to the tendency for data structures to change over time.
23	volume refers to the size of the dataset.

Types of big data for public health

Data science terms for public health

Accuracy: Proportion of results correctly classified
Artificial intelligence:: Artificial intelligence (AI) is used to describe machines that perform human-like activities such as learning, perception, problem solving and playing games. AI has been used to engage the public by improving the quality of eHealth interactions. For example, patients can use AI-based eHealth applications to receive personalised information. Chronology MD was developed for patients with Crohn’s disease; this programme allows patients to input their ‘observations of daily living’ and an AI system assists patients with management of their disease (eg, medication reminders, exercise and proper sleep motivation). This case highlights how AI applications can increase the immediacy of eHealth, the development of closeness and the feeling of an authentic, caring relationship. These applications help to provide a human-like element to eHealth exchanges between patents and AI systems
Big data hubris: Big data hubris is the assumption that data with sufficient volume and velocity can compensate for or eliminate the need for high veracity data, high-quality study designs and more traditional forms of data analysis. Lazer and colleagues(Lazer et al. 2014) provide a compelling example of big data hubris in the Google Flu Trends research. An important limitation of Google Flu Trends was that the underlying algorithms and methodology of Google search terms are proprietary and evolving (Google’s Hummingbird algorithm likely uses deep learning). Core scientific principles of replicability and transparency are difficult when dealing with proprietary data, whether it be from Google, Facebook or others
Blockchain: A blockchain, originally block chain, is a growing list of records, called blocks, that are linked using cryptography. Each block contains a cryptographic hash of the previous block, a timestamp, and transaction data (generally represented as a Merkle tree). , By design, a blockchain is resistant to modification of the data. It is “an open, distributed ledger that can record transactions between two parties efficiently and in a verifiable and permanent way”. For use as a distributed ledger, a blockchain is typically managed by a peer-to-peer network collectively adhering to a protocol for inter-node communication and validating new blocks. Once recorded, the data in any given block cannot be altered retroactively without alteration of all subsequent blocks, which requires consensus of the network majority. Although blockchain records are not unalterable, blockchains may be considered secure by design and exemplify a distributed computing system with high Byzantine fault tolerance. Decentralized consensus has therefore been claimed with a blockchain.
CRISP-DM: CRoss-Industry Standard Process for Data Mining. An industry standard for analytical process.(???)
Confusion matrix: A 2x2 table comparing predicted positive and negative results with observed results
Coverage: The proportion of a data set for which a classifier makes a prediction.
Cross-validation: A method for estimating the accuracy (or error) of an outcome by dividing the data into k mutually exclusive subsets (“folds”) of approximately equal size. The model is trained and tested on the outcome k times. Each time it is trained on the data set minus a fold and tested on that fold. The accuracy estimate is the average accuracy for the k folds.
Data mining: Exploratory analysis. Data mining (sometimes referred to as knowledge discovery in databases) is the process of extracting new and at times useful information from data. Data mining and machine learning often use the same statistical techniques and it is difficult to differentiate the two in practice. Some would argue the primary focus of data mining is unsupervised learning (see unsupervised learning). Drug pathway discovery through analysis of published results is an example of data mining in health research. Perhaps, data mining can be better conceived as data refining, where large volumes of data are sifted using statistical techniques to find potential associa- tions of interest for researchers.
Deep learning: Deep learning is a machine learning technique designed to process signals like a human brain. Instead of using a single machine learning technique on a single type of data, deep learning uses multiple machine learning methods and layers of data to perform abstract learning tasks. To date, population health-related examples of deep learning are difficult to identify. An example of deep learning is an image recognition to image caption process. First, an image detection machine (eg, Vision Deep Convoluted Neural Network) identifies the items in an image, then based on those items a language generating machine (eg, recurrent neural nets) uses the data to generate a caption about the image. These processes can allow for detection and creation of abstract and creative objects such as painting, and music. ONS have been using deep learning to identify and count caravans in caravan parks in order to improve the census and reduce the requirement for enumerators
Digital epidemiology (from (Salathé 2018)): The goal of epidemiology, very broadly speaking, is to understand the patterns of disease and health dynamics in populations as well as the causes of these patterns, and to use this understanding to mitigate and prevent disease, and to promote health. The goal of digital epidemiology is exactly the same. So what differentiates (non-digital) epidemiology from digital epidemiology? The broadest definition one can give for digital epidemiology is the following: Digital epidemiology is epidemiology that uses digital data. I expect that this broad and straightforward definition will appeal to many, as it includes any modern approach to epidemiology based on digital sources. I would, however, like to offer an additional and much more narrow definition for digital epidemiology that I personally find more appealing and more thought-provoking, namely the following: Digital epidemiology is epidemiology that uses data that was generated outside the public health system, i.e. with data that was not generated with the primary purpose of doing epidemiology.

Digital epidemiology example - using Google trends

On tool that has been used for digital epidemiology is Google Trends. There is some research e.g. Google Flu trends that suggests that certain search terms have predictive value for disease occurrence e.g. Flu, Norovirus. Google Flu trends has the best researched topic but recent research has found a strong correlation between the frequency of some cancers and Google searches for cancer(Lippi and Cervellin 2019). There is increasing evidence that Google trends may have some value in NCD surveillance. (R users can access Google trends with the GtrendsR R package which allows evaluation of up to 5 topics.)

Fig1. Google trends for digital epidemiology and related terms

A related tool provided by Google is GoogleCorrelate. [This example shows that weekly admissions for respiratory disease are highly correlated with searches for “cough”, “linctus” i.e. respiratory symptoms and treatments]. The correlates user-provided time series data (weekly or monthly) with comparable Google search data. It can be used to identify terms which predict the data provided - and the relationship can be lagged by a week so that a predictive model can be built. For example, in the NHS increased searches for pneumonia precede hospitalisations for pneumonia and this has been used for providing early warning to trusts.¹

The use of these tools is also associated with indfodemiology and infosurveillance(Mavragani and Ochoa 2019)

Ensemble learning: A machine-learning approach involving training multiple models on data subsets and combining results from these models when predicting for unobserved inputs. Ensembles can be more accurate (see accuracy) than individual models.
F_1 score: A measure of accuracy in machine learning for binary classification. \(F_1 = 2 * (precision * recall/precision + recall)\)
Features: Variables. Measurements recorded for each observation (for example, participant age, sex, and body mass index are all features)
Knowledge discovery: The non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data. This is the definition used in “Advances in Knowledge Discovery and Data Mining,” 1996, by Fayyad, Piatetsky-Shapiro, and Smyth.(Fayyad et al. 1996)
Label: Observed or computed value of an outcome or other variable of interest
Labeling: The process of setting a label for a variable, as opposed to leaving the variable’s value unknown
Learning algorithm: The set of steps used to train a model automatically from a data set (not to be confused with the model itself ; e.g., there are many algorithms to train a neural network, each with different bounds on time, memory, and accuracy)
Machine learning: The coining of the term machine learning is often credited to computer scientist Arthur Samuel who developed a machine that could defeat humans in the game of checkers. More recently, Tom Mitchell has explained machine learning as ‘a computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E’). Modern machine learning involves a number of specific methods including including neural networks, decision trees, nearest neighbour classifiers, support vector machines and Markov and hidden Markov models. These methods can be used for supervised or unsupervised learning.
Natural language: Working with words as data, as in qualitative or mixed-methods research (generally, human readable but not readily machine readable)
Noisy labels: Measurement error
Out-of-sample: Applying a model fitted to one data set to make predictions in another
Overfitting: itting a model to random noise or error instead of the actual relationship (due to having either a small number of observations or a large number of parameters relative to the number of observations)
Pipeline: (From bioinformatics) The ordered set of tools applied to a data set to move it from its raw state to a final interpretable analytic result
Precision: Positive predictive value

Predictive analytics

Prescriptive analytics

Recall: Sensitivity
Semi-supervised learning: An analytical technique used to fit predictive models to data where many observations are missing outcome data.
Small-n, large-p: A wide but short data set: n = number of observations, p = number of variables for each observation
Supervised learning: An analytical technique in which patterns in covariates that are correlated with observed outcomes are exploited to predict outcomes in a data set or sets in which the correlates were observed but the outcome was unobserved. For example, linear regression and logistic regression are both supervised learning techniques, as are neural networks, boosted trees (xgboost) and penalised regression models (lasso and ridge)
Test data set: A subset of a more complete data set used to test empirical performance of an algorithm trained on a training data set
Text analytics: (See natural language). Text analytics refers to the process of compiling and analysing text to derive meaningful information. Machines use algorithms to derive patterns and develop categories within text. Machine learning methods for text analytics can extract specific information, summarise and simplify, provide question and answers (eg, Apple’s Siri) and analyse documents for sentiments and opinions. For example, Twitter data has been used to predict income and socioeconomic status. Preoţiuc-Pietro et al used Twitter data and supervised learning techniques, logistic regression with Elastic Net regularisation and Support Vector regression with a Radial Basis Function kernel, to profile features, inferred psychological and demographic features, emotions and word clusters to predict income.(Preoţiuc-Pietro et al. 2015) The correlation between the predicted model and income data was 0.63 with a Mean Average Error of £9535
Training: Fitting a model
Training data set: A subset of a more complete data set used to train a model whose empirical performance can be tested on a test data set
Unsupervised learning: An analytic technique in which data is automatically explored to identify patterns, without reference to outcome information. Latent class analysis (when used without covariates) and k-means clustering are unsupervised learning techniques
Video analytics: Video analytics (also referred to as video content analysis), uses machine learning to evaluate video footage to extract important details. Video analytics have been applied to closed-circuit television and video streaming services, such as YouTube, for object detection and tracking, behavioural analysis, and detection of ‘interesting events’. For example, Zangenehpour et al (Zangenehpour, Miranda-Moreno, and Saunier 2014) used 90 hours of video at 23 intersections in Montreal to examine the safety of cyclists–driver interactions at intersections with cycle tracks. The authors used TrafficIntelligence, developed by Dr. Nicolas Saunier, to detect and classify road users, select and predict trajectories, and calculate post encroachment time (a measure of safety).(Saunier, Sayed, and Ismail 2010)

Precision, prediction and prevention

Precision public health is a relatively new term first coined in 2013 to parallele precision medicine. It has been defined variously as:

“the application and combination of new and existing technologies, which more precisely describe and analyse individuals and their environment over the life course, to tailor preventive interventions for at-risk groups and improve the overall health of the population.” (Weeramanthri et al. 2018)
“improving the ability to prevent disease, promote health, and reduce health disparities in populations by applying emerging methods and technologies for measuring disease, pathogens, exposures, behaviors, and susceptibility in populations; and developing policies and targeted implementation programs to improve health”(Khoury, Iademarco, and Riley 2016)
“…requires robust primary surveillance data, rapid application of sophisticated analytics to track the geographical distribution of disease, and the capacity to act on such information” (Dowell, Blazes, and Desmond-Hellmann 2016)
“Precision public health is characterized by discovering, validating, and optimizing care strategies for well-characterized population strata”(Arnett and Claas 2016) * “Right intervention for the right population at the right time”(Bayer and Galea 2015)

A more recent definition suggests:

“Precision public health investigates how multiple dimensions of social position interact to confer health risk differently for precisely defined population subgroups according to the social contexts in which they are embedded, while considering relevant biological and behavioural factors. It leverages this information to uncover the precise social structures and processes that pattern health outcomes, and to identify actionable interventions within the social contexts of affected groups”(Olstad and McIntyre 2019)

This links much more with social determinants of health and tackling health inequalities, taking account of social structure and context.

Olstad(Olstad and McIntyre 2019) suggests greater precision in public health can be achieved by:

Precision should be sought in the areas that are the most theoretically meaningful within the context of each individual study, while acknowledging that a minimum of two should be implemented in tandem to constitute an instance of precision public health. 1. Provide explicit and precise descriptions of the theoretical rationale underlying the selection and operationalisation of social positions, social contexts, health outcomes and potential confounders. The proposed causal pathways should be precisely identified a priori. 2. Identify the precise social positions of populations of interest and investigate their associations with health by expanding beyond common master categories to examine other dimensions of social position, and the heterogeneity that exists within social categories. Measures of perceived social position should be explored more fully. 3. Operationalise social position in more precise ways, such as by using continuous measures or more categories, considering qualitative and quantitative features, and considering factors at multiple levels. 4. Describe the precise time and context of measurement of socialposition and study the health effects of social position in a variety of contexts and at multiple time points across the life course. 5. Use precise language to describe health inequities (eg, inequities in cardiovascular disease according to wealth and gender/sex). 6. Use knowledge of the health effects of individuals’ precise social positions to inform the study of precise contextual mechanisms responsible for situating them there. Leverage this information to propose precise interventions to ameliorate health inequities.

and this requires a shift in our thinking

Move away from…	Move towards…
Biomedical model of health	Social determinants model of health
The functions (eg, surveillance) and methods (eg, big data) of public health	The foundations (eg, social determinants) and core aims (eg, improve population health, reduce health inequities) of public health
Problematising individuals and their behaviours	Problematising the social contexts that create social stratification
Scaled-up versions of individual level interventions	Interventions that address the root causes of health inequities
Precision medicine for the population	Precision public health

Biro et al have reviewed the literature on related terms and suggest 3 potential categories:(Bíró et al. 2018)

Individualised prevention
Personalised prevention
Precision prevention

Their definitions are:

Individualized prevention is a form of prevention in public health, in which health professionals consider the characteristics, lifestyle, family history, anamnesis, risk status and medication of the client when making proposals to maintain or improve the individual’s quality of life.

Personalized prevention is a form of prevention in public health, which includes the activities of individualized prevention and in which health professionals also consider biological information and biomarkers at the level of molecular disease pathways, genetics, transcriptomics, proteomics and metabolomics of the client when making proposals to maintain or improve the individual’s quality of life.

Precision prevention is a form of prevention in public health, which includes the activities of personalized prevention and in which health professionals also consider the socioeconomic status or the opportunities offered by psychological and behavioral data of the client when making proposals to maintain or improve the individual’s quality of life.

Examples

Type	Primary	Secondary	Tertiary
Population-based/ universal	Mass media campaigns/ Smoking bans	Type 2DM screening	Population wide retinopathy screening
Stratified prevention	Targeted mass media campaigns	Type 2 DM screening in high risk groups	Retinopathy screening in T2 DM
Individualised prevention	By considering the client’s characteristics, lifestyle and medical record the GP helps the patient to create a healthy lifestyle.	Depending on the client’s characteristics,lifestyle and medical record the GP sends the patient to screen for type 2 diabetes	The GP helps the patient with type diabetes to cope with the disease according to the patient’s characteristics, lifestyle and medical record.
Personalized prevention	By considering the client’s characteristics, lifestyle and medical record the GP helps the patient to create a healthy lifestyle.	Depending on the client’s characteristics,lifestyle and medical record the GP sends the patient to screen for type 2 diabetes	The GP helps the patient with type diabetes to cope with the disease according to the patient’s characteristics, lifestyle and medical record and genome
Precision prevention	By considering the client’s characteristics, lifestyle and medical record, genome, psychological profile and socio-economic status the GP helps the patient to create a healthy lifestyle.	Depending on the client’s characteristics,lifestyle and medical record, genome, psychological profile and socio-economic status encourages the patient to go for screening for type 2 diabetes	The GP helps the patient with type diabetes to cope with the disease according to the patient’s characteristics, lifestyle and medical record genome, psychological profile and socio-economic status

From these suggested definitions we can see that individualised or personalised prevention are primary care delivered services aimed at lifestyle modification, early detection and disease management using ever more personalised information to guide decision making (including patient or personal preference). These definitions are akin the emerging idea of “lifestyle medicine” which has been defined as:

“Lifestyle medicine is a branch of evidence-based medicine in which comprehensive lifestyle changes (including nutrition, physical activity, stress management, social support and environmental exposures) are used to prevent, treat and reverse the progression of chronic diseases by addressing their underlying causes. Lifestyle medicine interventions include health risk assessment screening, health behavior change counseling and clinical application of lifestyle modifications. Lifestyle medicine is often prescribed in conjunction with pharmacotherapy and other forms of therapy.”1

\[ Precision\ public\ health \neq \sum precision\ medicine\]

Public health 3.0 (O’Carroll et al. 2017)

CDC have recently written about Public Health 3.0 which, from a data perspective, promotes the idea of greater granularity and precision.

Predictive prevention and PHE definition of precision public health

## [1] "Predictive prevention does not replace existing public health interventions at population or community level – but it does build on existing  data-driven targeting techniques and channels to add another dimension of deeper customer engagement. By continuing to combine behavioural science and digital innovations, we can actively encourage people to make healthier choices and take greater responsibility for their wellbeing."                                                              
## [2] "A second more expansive version of precision public health is about the use of data and analytical techniques to design and implement interventions that benefit whole populations.  This version emphasises the use of sophisticated surveillance and modelling and has been promoted by the Bill and Melinda Gates Foundation. It is certainly appealing and is typified for example by the recent Global Burden of Disease study and much of PHE’s work on infection control and health improvement."

Critiques of big data in public health

Fuller et al (Fuller, Buote, and Stanley 2017) provide some useful critiques of big data use for public and population health:

Automating research changes the nature of knowledge
Claims of objectivity are misleading
Bigger is not always better
Not all data are equivalent
Accessible does not equal ethical
Lack of access creates digital divides

Khoury and Ioannidis (Khoury and Ioannidis 2014) argue that to separate signal from noise in big data for public health benefit requires a stronger epidemiological foundation, improved knowledge translation and integration and reinvigoration of the princples of evidence based medicine.

Future directions

Selected machine learning applications in public health

References

Arnett, Donna K., and Steven A. Claas. 2016. “Precision medicine, genomics, and public health.” Diabetes Care. https://doi.org/10.2337/dc16-1763.

Bayer, Ronald, and Sandro Galea. 2015. “Public Health in the Precision-Medicine Era.” New England Journal of Medicine. https://doi.org/10.1056/NEJMp1506241.

Bíró, K, V Dombrádi, A Jani, K Boruzs, and M Gray. 2018. “Creating a common language: defining individualized, personalized and precision prevention in public health.” Journal of Public Health (Oxford, England), 1–8. https://doi.org/10.1093/pubmed/fdy066.

Data, Big, and Resolution Revolution. 2018. “Nih strategic plan for data science” 2015: 1–26.

Donoho, David. 2017. “50 Years of Data Science.” https://doi.org/10.1080/10618600.2017.1384734.

Dowell, Scott F., David Blazes, and Susan Desmond-Hellmann. 2016. “Four steps to precision public health.” https://doi.org/10.1038/540189a.

Fayyad, Usama M, Gregory Piatetsky-Shapiro, Padhraic Smyth, and Ramasamy Uthurusamy. 1996. Advances in Knowledge Discovery and Data Mining.

Fuller, Daniel, Richard Buote, and Kevin Stanley. 2017. “A glossary for big data in population and public health: Discussion and commentary on terminology and research methods.” Journal of Epidemiology and Community Health 71 (11). https://doi.org/10.1136/jech-2017-209608.

Khoury, Muin J., Michael F. Iademarco, and William T. Riley. 2016. “Precision Public Health for the Era of Precision Medicine.” https://doi.org/10.1016/j.amepre.2015.08.031.

Khoury, Muin J, and John P A Ioannidis. 2014. “Medicine. Big data meets public health.” Science 346 (6213): 1054–5. https://doi.org/10.1126/science.aaa2709.

Kohavi, Ron, and Foster Provost. 1998. “Glossary of Terms.” Machine Learning. 30 (2-3): 271–74. https://doi.org/10.1023/A:1017181826899.

Lazer, D, R Kennedy, G King, and A Vespignani. 2014. “The parable of Google Flue: traps in big data analysis.” Science 343: 1203–5. https://doi.org/10.1126/science.1248506.

Lippi, Giuseppe, and Gianfranco Cervellin. 2019. “Is digital epidemiology reliable?—insight from updated cancer statistics.” Annals of Translational Medicine. https://doi.org/10.21037/atm.2018.11.55.

Mavragani, Amaryllis, and Gabriela Ochoa. 2019. “Google trends in infodemiology and infoveillance: Methodology framework.” Journal of Medical Internet Research. https://doi.org/10.2196/13439.

Mcgrail, Kimberlyn M, Kerina Jones, Ashley Akbari, Tellen D Bennett, Andy Boyd, Fabrizio Carinci, Xinjie Cui, et al. 2018. “International Journal of People.” International Journal of Population Data Science 3 (February): 1–11. https://ijpds.org/article/view/415.

Mooney, Stephen J., and Vikas Pejaver. 2018. “Big Data in Public Health: Terminology, Machine Learning, and Privacy.” Annual Review of Public Health. https://doi.org/10.1146/annurev-publhealth-040617-014208.

O’Carroll, Patrick W., Karen B. DeSalvo, Denise Koo, John Auerbach, and Judith A. Monroe. 2017. “Public health 3.0: Time for an upgrade.” In Solving Population Health Problems Through Collaboration. https://doi.org/10.4324/9781315212708.

Olstad, Dana Lee, and Lynn McIntyre. 2019. “Reconceptualising precision public health.” BMJ Open 9 (9): e030279. https://doi.org/10.1136/bmjopen-2019-030279.

Preoţiuc-Pietro, Daniel, Svitlana Volkeva, Vasileios Lampos, Yoram Bachrach, and Nikolaos Aletras. 2015. “Studying user income through language, behaviour and affect in social media.” PLoS ONE 10 (9). https://doi.org/10.1371/journal.pone.0138717.

Salathé, Marcel. 2018. “Digital epidemiology: what is it, and where is it going?” Life Sciences, Society and Policy. https://doi.org/10.1186/s40504-017-0065-7.

Saunier, Nicolas, Tarek Sayed, and Karim Ismail. 2010. “Large-Scale Automated Analysis of Vehicle Interactions and Collisions.” Transportation Research Record: Journal of the Transportation Research Board 2147: 42–50. https://doi.org/10.3141/2147-06.

Weeramanthri, Tarun Stephen, Hugh J S Dawkins, Gareth Baynam, Matthew Bellgard, Ori Gudes, and James Bernard Semmens. 2018. “Editorial: Precision Public Health.” Frontiers in Public Health 6 (April): 3–5. https://doi.org/10.3389/fpubh.2018.00121.

Zangenehpour, Sohail, Luis Fernando Miranda-Moreno, and Nicolas Saunier. 2014. “Automated Classification in Traffic Video at Intersections with Heavy Pedestrian and Bicycle Traffic.” 2014 TRB Annual Meeting Compendium of Papers.

Google Correlate is being closed at the end of 2019.↩

Data science glossary for public health: Draft

Draft-2020-01-02

Julian Flowers

2020-01-02