Motivation

The development and incorporation of data science and big data into public health practice requires us to learn a new language. Fortunately many of the new terms have public health equivalents.

As a first step we created an A-Z of data science in public health.

As a next step we have drawn on the literature to create the following tables. The glossary below draws extensively from:

Mooney et al table 2 (Mooney and Pejaver 2018), Kohavi et al (Kohavi and Provost 1998), McGrail et al (Mcgrail et al. 2018) and Fuller et al (Fuller, Buote, and Stanley 2017).

Definitions

Towards defining public health data science

Donoho provides a useful discussion about the origins and development of data science. (Donoho 2017) In brief he defines “greater data science” as:

The science of data

and sees 6 core areas for data science activity:

  1. Data Exploration and Preparation
  2. Data Representation and Transformation
  3. Computing with Data
  4. Data Modeling
  5. Data Visualization and Presentation
  6. Science about Data Science

The NIH (Data and Revolution 2018) defines data science as:

“the interdisciplinary field of inquiry in which quantitative and analytical approaches, processes, and systems are developed and used to extract knowledge and insights from increasingly large and/ or complex sets of data”

McGrail et al (Mcgrail et al. 2018) provide a defintion of data science in population health as:

“The science of data about people”

We are working towards a definition of public health data science. Two suggestions are:

“The application of data science to improve and protect health and reduce inequalities”

or:

“The science of data to improve and protect health and reduce inequalities”

Mcgrail compares the focus of current conceptual data science frameworks in health - this is summarised in the table.

Adapted from (Mcgrail et al. 2018) table 1

Area Focus Multisource data Primary aim of research Focus on technical/ policy infrastructure
Data science Data esp. “big data” Not always Data for actionable information No
Population data science People, systems, population insights Linkage and multiple sources Public value Key focus - legal, ethical, privacy, data collection
Informatics Providers/ ICT systems Not necessarily Implementation Database/ technical development
Public health data science Public and population health, healthcare, health systems Often, linkage important Improving population health, reducing health inequality Focus as per population health data science?
Big data
Official definition from the [National Institute of Standards and Technology. NIST Big Data Interoperability Framework: Volume 1, Definitions (NIST Special Publication 1500-1). (2015)](http://nvlpubs.nist.gov/nistpubs/SpecialPublications/NIST.SP.1500-1.pdf] Originally big data was referred to as the 3Vs but more recent definitions talk of 5 Vs (see diagram). (Adapted from (Fuller, Buote, and Stanley 2017))
name value
1 big data consists of extensive datasetsprimarily in the characteristics of volume, variety, velocity, and/or variabilitythat require a scalable architecture for efficient storage, manipulation, and analysis.
2 big data engineering includes advanced techniques that harness independent resources for building scalable data systems when the characteristics of the datasets require new architectures for efficient storage, manipulation, and analysis.
3 the big data paradigm consists of the distribution of data systems across horizontally coupled, independent resources to achieve the scalability needed for the efficient processing of extensive datasets.
4 computational portability is the movement of the computation to the location of the data.
5 data governance refers to the overall management of the availability, usability, integrity, and security of the data employed in an enterprise.
6 the data lifecycle is the set of processes that transforms raw data into actionable knowledge, which includes data collection, preparation, analytics, visualization, and access.
7 data science is the empirical synthesis of actionable knowledge from raw data through the complete data life cycle process.
8 the data science is extraction of actionable knowledge directly from data through a process of discovery, hypothesis, and hypothesis testing.
9 a latency is a practitioner who has sufficient knowledge in the overlapping regimes of business needs, domain knowledge, analytical skills, and software and systems engineering to manage the end-to-end data processes through each stage in the data life cycle.
10 distributed computing is a computing system in which components located on networked computers communicate and coordinate their actions by passing messages.
11 distributed file systems contain multi-structured (object) datasets that are distributed across the computing nodes of the server cluster(s).
12 a federated database system is a type of meta-database management system, which transparently maps multiple autonomous database systems into a single federated database. horizontal scaling implies the coordination of individual resources (e.g., server) that are integrated to act in parallel as a single system (i.e., operate as a cluster).
13 latency refers to the delay in processing or in availability.
14 massively parallel processing refers to a multitude of individual processors working in parallel to execute a particular program.
15 non-relational models, frequently referred to as nosql, refer to logical data models that do not follow relational algebra for the storage and manipulation of data.
16 resource negotiation consists of built-in data management capabilities that provide the necessary support functions, such as operations management, workflow integration, security, governance, support for additional processing models, and controls for multi-tenant environments, providing higher availability and lower latency applications.
17 a-1 nist big data interoperability framework: volume 1, definitions schema-on-read is the application of a data schema through preparation steps such as transformations, cleansing, and integration at the time the data is read from the database.
18 shared-disk file systems, such as storage area networks (sans) and network attached storage (nas), use a single storage pool, which is accessed from multiple computing resources. validity refers to appropriateness of the data for its intended use.  value refers to the inherent wealth, economic and social, embedded in any dataset.
19  variability refers to the change in other data characteristics.
20  variety refers to data from multiple repositories, domains, or types.
21 velocity refers to the rate of data flow. veracity refers to the accuracy of the data.
22 vertical scaling implies increasing the system parameters of processing speed, storage, and memory for greater performance. volatility refers to the tendency for data structures to change over time.
23 volume refers to the size of the dataset.

Types of big data for public health

Data science terms for public health

Accuracy
Proportion of results correctly classified
Artificial intelligence:
Artificial intelligence (AI) is used to describe machines that perform human-like activities such as learning, perception, problem solving and playing games. AI has been used to engage the public by improving the quality of eHealth interactions. For example, patients can use AI-based eHealth applications to receive personalised information. Chronology MD was developed for patients with Crohn’s disease; this programme allows patients to input their ‘observations of daily living’ and an AI system assists patients with management of their disease (eg, medication reminders, exercise and proper sleep motivation). This case highlights how AI applications can increase the immediacy of eHealth, the development of closeness and the feeling of an authentic, caring relationship. These applications help to provide a human-like element to eHealth exchanges between patents and AI systems
Big data hubris
Big data hubris is the assumption that data with sufficient volume and velocity can compensate for or eliminate the need for high veracity data, high-quality study designs and more traditional forms of data analysis. Lazer and colleagues(Lazer et al. 2014) provide a compelling example of big data hubris in the Google Flu Trends research. An important limitation of Google Flu Trends was that the underlying algorithms and methodology of Google search terms are proprietary and evolving (Google’s Hummingbird algorithm likely uses deep learning). Core scientific principles of replicability and transparency are difficult when dealing with proprietary data, whether it be from Google, Facebook or others
Blockchain
A blockchain, originally block chain, is a growing list of records, called blocks, that are linked using cryptography. Each block contains a cryptographic hash of the previous block, a timestamp, and transaction data (generally represented as a Merkle tree). , By design, a blockchain is resistant to modification of the data. It is “an open, distributed ledger that can record transactions between two parties efficiently and in a verifiable and permanent way”. For use as a distributed ledger, a blockchain is typically managed by a peer-to-peer network collectively adhering to a protocol for inter-node communication and validating new blocks. Once recorded, the data in any given block cannot be altered retroactively without alteration of all subsequent blocks, which requires consensus of the network majority. Although blockchain records are not unalterable, blockchains may be considered secure by design and exemplify a distributed computing system with high Byzantine fault tolerance. Decentralized consensus has therefore been claimed with a blockchain.
CRISP-DM
CRoss-Industry Standard Process for Data Mining. An industry standard for analytical process.(???)
Confusion matrix
A 2x2 table comparing predicted positive and negative results with observed results
Coverage
The proportion of a data set for which a classifier makes a prediction.
Cross-validation
A method for estimating the accuracy (or error) of an outcome by dividing the data into k mutually exclusive subsets (“folds”) of approximately equal size. The model is trained and tested on the outcome k times. Each time it is trained on the data set minus a fold and tested on that fold. The accuracy estimate is the average accuracy for the k folds.
Data mining
Exploratory analysis. Data mining (sometimes referred to as knowledge discovery in databases) is the process of extracting new and at times useful information from data. Data mining and machine learning often use the same statistical techniques and it is difficult to differentiate the two in practice. Some would argue the primary focus of data mining is unsupervised learning (see unsupervised learning). Drug pathway discovery through analysis of published results is an example of data mining in health research. Perhaps, data mining can be better conceived as data refining, where large volumes of data are sifted using statistical techniques to find potential associa- tions of interest for researchers.
Deep learning
Deep learning is a machine learning technique designed to process signals like a human brain. Instead of using a single machine learning technique on a single type of data, deep learning uses multiple machine learning methods and layers of data to perform abstract learning tasks. To date, population health-related examples of deep learning are difficult to identify. An example of deep learning is an image recognition to image caption process. First, an image detection machine (eg, Vision Deep Convoluted Neural Network) identifies the items in an image, then based on those items a language generating machine (eg, recurrent neural nets) uses the data to generate a caption about the image. These processes can allow for detection and creation of abstract and creative objects such as painting, and music. ONS have been using deep learning to identify and count caravans in caravan parks in order to improve the census and reduce the requirement for enumerators
Digital epidemiology (from (Salathé 2018))
The goal of epidemiology, very broadly speaking, is to understand the patterns of disease and health dynamics in populations as well as the causes of these patterns, and to use this understanding to mitigate and prevent disease, and to promote health. The goal of digital epidemiology is exactly the same. So what differentiates (non-digital) epidemiology from digital epidemiology? The broadest definition one can give for digital epidemiology is the following: Digital epidemiology is epidemiology that uses digital data. I expect that this broad and straightforward definition will appeal to many, as it includes any modern approach to epidemiology based on digital sources. I would, however, like to offer an additional and much more narrow definition for digital epidemiology that I personally find more appealing and more thought-provoking, namely the following: Digital epidemiology is epidemiology that uses data that was generated outside the public health system, i.e. with data that was not generated with the primary purpose of doing epidemiology.

Precision, prediction and prevention

Precision public health is a relatively new term first coined in 2013 to parallele precision medicine. It has been defined variously as:

  • “the application and combination of new and existing technologies, which more precisely describe and analyse individuals and their environment over the life course, to tailor preventive interventions for at-risk groups and improve the overall health of the population.” (Weeramanthri et al. 2018)
  • “improving the ability to prevent disease, promote health, and reduce health disparities in populations by applying emerging methods and technologies for measuring disease, pathogens, exposures, behaviors, and susceptibility in populations; and developing policies and targeted implementation programs to improve health”(Khoury, Iademarco, and Riley 2016)
  • “…requires robust primary surveillance data, rapid application of sophisticated analytics to track the geographical distribution of disease, and the capacity to act on such information” (Dowell, Blazes, and Desmond-Hellmann 2016)
  • “Precision public health is characterized by discovering, validating, and optimizing care strategies for well-characterized population strata”(Arnett and Claas 2016) * “Right intervention for the right population at the right time”(Bayer and Galea 2015)

A more recent definition suggests:

  • “Precision public health investigates how multiple dimensions of social position interact to confer health risk differently for precisely defined population subgroups according to the social contexts in which they are embedded, while considering relevant biological and behavioural factors. It leverages this information to uncover the precise social structures and processes that pattern health outcomes, and to identify actionable interventions within the social contexts of affected groups”(Olstad and McIntyre 2019)

This links much more with social determinants of health and tackling health inequalities, taking account of social structure and context.

Olstad(Olstad and McIntyre 2019) suggests greater precision in public health can be achieved by:

Precision should be sought in the areas that are the most theoretically meaningful within the context of each individual study, while acknowledging that a minimum of two should be implemented in tandem to constitute an instance of precision public health. 1. Provide explicit and precise descriptions of the theoretical rationale underlying the selection and operationalisation of social positions, social contexts, health outcomes and potential confounders. The proposed causal pathways should be precisely identified a priori. 2. Identify the precise social positions of populations of interest and investigate their associations with health by expanding beyond common master categories to examine other dimensions of social position, and the heterogeneity that exists within social categories. Measures of perceived social position should be explored more fully. 3. Operationalise social position in more precise ways, such as by using continuous measures or more categories, considering qualitative and quantitative features, and considering factors at multiple levels. 4. Describe the precise time and context of measurement of socialposition and study the health effects of social position in a variety of contexts and at multiple time points across the life course. 5. Use precise language to describe health inequities (eg, inequities in cardiovascular disease according to wealth and gender/sex). 6. Use knowledge of the health effects of individuals’ precise social positions to inform the study of precise contextual mechanisms responsible for situating them there. Leverage this information to propose precise interventions to ameliorate health inequities.

and this requires a shift in our thinking

Move away from… Move towards…
Biomedical model of health Social determinants model of health
The functions (eg, surveillance) and methods (eg, big data) of public health The foundations (eg, social determinants) and core aims (eg, improve population health, reduce health inequities) of public health
Problematising individuals and their behaviours Problematising the social contexts that create social stratification
Scaled-up versions of individual level interventions Interventions that address the root causes of health inequities
Precision medicine for the population Precision public health

Biro et al have reviewed the literature on related terms and suggest 3 potential categories:(Bíró et al. 2018)

  • Individualised prevention
  • Personalised prevention
  • Precision prevention

Their definitions are:

Individualized prevention is a form of prevention in public health, in which health professionals consider the characteristics, lifestyle, family history, anamnesis, risk status and medication of the client when making proposals to maintain or improve the individual’s quality of life.

Personalized prevention is a form of prevention in public health, which includes the activities of individualized prevention and in which health professionals also consider biological information and biomarkers at the level of molecular disease pathways, genetics, transcriptomics, proteomics and metabolomics of the client when making proposals to maintain or improve the individual’s quality of life.

Precision prevention is a form of prevention in public health, which includes the activities of personalized prevention and in which health professionals also consider the socioeconomic status or the opportunities offered by psychological and behavioral data of the client when making proposals to maintain or improve the individual’s quality of life.

Examples

Type Primary Secondary Tertiary
Population-based/ universal Mass media campaigns/ Smoking bans Type 2DM screening Population wide retinopathy screening
Stratified prevention Targeted mass media campaigns Type 2 DM screening in high risk groups Retinopathy screening in T2 DM
Individualised prevention By considering the client’s characteristics, lifestyle and medical record the GP helps the patient to create a healthy lifestyle. Depending on the client’s characteristics,lifestyle and medical record the GP sends the patient to screen for type 2 diabetes The GP helps the patient with type diabetes to cope with the disease according to the patient’s characteristics, lifestyle and medical record.
Personalized prevention By considering the client’s characteristics, lifestyle and medical record the GP helps the patient to create a healthy lifestyle. Depending on the client’s characteristics,lifestyle and medical record the GP sends the patient to screen for type 2 diabetes The GP helps the patient with type diabetes to cope with the disease according to the patient’s characteristics, lifestyle and medical record and genome
Precision prevention By considering the client’s characteristics, lifestyle and medical record, genome, psychological profile and socio-economic status the GP helps the patient to create a healthy lifestyle. Depending on the client’s characteristics,lifestyle and medical record, genome, psychological profile and socio-economic status encourages the patient to go for screening for type 2 diabetes The GP helps the patient with type diabetes to cope with the disease according to the patient’s characteristics, lifestyle and medical record genome, psychological profile and socio-economic status

From these suggested definitions we can see that individualised or personalised prevention are primary care delivered services aimed at lifestyle modification, early detection and disease management using ever more personalised information to guide decision making (including patient or personal preference). These definitions are akin the emerging idea of “lifestyle medicine” which has been defined as:

“Lifestyle medicine is a branch of evidence-based medicine in which comprehensive lifestyle changes (including nutrition, physical activity, stress management, social support and environmental exposures) are used to prevent, treat and reverse the progression of chronic diseases by addressing their underlying causes. Lifestyle medicine interventions include health risk assessment screening, health behavior change counseling and clinical application of lifestyle modifications. Lifestyle medicine is often prescribed in conjunction with pharmacotherapy and other forms of therapy.”1

\[ Precision\ public\ health \neq \sum precision\ medicine\]

Public health 3.0 (O’Carroll et al. 2017)

CDC have recently written about Public Health 3.0 which, from a data perspective, promotes the idea of greater granularity and precision.

Predictive prevention and PHE definition of precision public health

## [1] "Predictive prevention does not replace existing public health interventions at population or community level – but it does build on existing  data-driven targeting techniques and channels to add another dimension of deeper customer engagement. By continuing to combine behavioural science and digital innovations, we can actively encourage people to make healthier choices and take greater responsibility for their wellbeing."                                                              
## [2] "A second more expansive version of precision public health is about the use of data and analytical techniques to design and implement interventions that benefit whole populations.  This version emphasises the use of sophisticated surveillance and modelling and has been promoted by the Bill and Melinda Gates Foundation. It is certainly appealing and is typified for example by the recent Global Burden of Disease study and much of PHE’s work on infection control and health improvement."

Critiques of big data in public health

Fuller et al (Fuller, Buote, and Stanley 2017) provide some useful critiques of big data use for public and population health:

  • Automating research changes the nature of knowledge
  • Claims of objectivity are misleading
  • Bigger is not always better
  • Not all data are equivalent
  • Accessible does not equal ethical
  • Lack of access creates digital divides

Khoury and Ioannidis (Khoury and Ioannidis 2014) argue that to separate signal from noise in big data for public health benefit requires a stronger epidemiological foundation, improved knowledge translation and integration and reinvigoration of the princples of evidence based medicine.

Future directions

Selected machine learning applications in public health

References

Arnett, Donna K., and Steven A. Claas. 2016. “Precision medicine, genomics, and public health.” Diabetes Care. https://doi.org/10.2337/dc16-1763.

Bayer, Ronald, and Sandro Galea. 2015. “Public Health in the Precision-Medicine Era.” New England Journal of Medicine. https://doi.org/10.1056/NEJMp1506241.

Bíró, K, V Dombrádi, A Jani, K Boruzs, and M Gray. 2018. “Creating a common language: defining individualized, personalized and precision prevention in public health.” Journal of Public Health (Oxford, England), 1–8. https://doi.org/10.1093/pubmed/fdy066.

Data, Big, and Resolution Revolution. 2018. “Nih strategic plan for data science” 2015: 1–26.

Donoho, David. 2017. “50 Years of Data Science.” https://doi.org/10.1080/10618600.2017.1384734.

Dowell, Scott F., David Blazes, and Susan Desmond-Hellmann. 2016. “Four steps to precision public health.” https://doi.org/10.1038/540189a.

Fayyad, Usama M, Gregory Piatetsky-Shapiro, Padhraic Smyth, and Ramasamy Uthurusamy. 1996. Advances in Knowledge Discovery and Data Mining.

Fuller, Daniel, Richard Buote, and Kevin Stanley. 2017. “A glossary for big data in population and public health: Discussion and commentary on terminology and research methods.” Journal of Epidemiology and Community Health 71 (11). https://doi.org/10.1136/jech-2017-209608.

Khoury, Muin J., Michael F. Iademarco, and William T. Riley. 2016. “Precision Public Health for the Era of Precision Medicine.” https://doi.org/10.1016/j.amepre.2015.08.031.

Khoury, Muin J, and John P A Ioannidis. 2014. “Medicine. Big data meets public health.” Science 346 (6213): 1054–5. https://doi.org/10.1126/science.aaa2709.

Kohavi, Ron, and Foster Provost. 1998. “Glossary of Terms.” Machine Learning. 30 (2-3): 271–74. https://doi.org/10.1023/A:1017181826899.

Lazer, D, R Kennedy, G King, and A Vespignani. 2014. “The parable of Google Flue: traps in big data analysis.” Science 343: 1203–5. https://doi.org/10.1126/science.1248506.

Lippi, Giuseppe, and Gianfranco Cervellin. 2019. “Is digital epidemiology reliable?—insight from updated cancer statistics.” Annals of Translational Medicine. https://doi.org/10.21037/atm.2018.11.55.

Mavragani, Amaryllis, and Gabriela Ochoa. 2019. “Google trends in infodemiology and infoveillance: Methodology framework.” Journal of Medical Internet Research. https://doi.org/10.2196/13439.

Mcgrail, Kimberlyn M, Kerina Jones, Ashley Akbari, Tellen D Bennett, Andy Boyd, Fabrizio Carinci, Xinjie Cui, et al. 2018. “International Journal of People.” International Journal of Population Data Science 3 (February): 1–11. https://ijpds.org/article/view/415.

Mooney, Stephen J., and Vikas Pejaver. 2018. “Big Data in Public Health: Terminology, Machine Learning, and Privacy.” Annual Review of Public Health. https://doi.org/10.1146/annurev-publhealth-040617-014208.

O’Carroll, Patrick W., Karen B. DeSalvo, Denise Koo, John Auerbach, and Judith A. Monroe. 2017. “Public health 3.0: Time for an upgrade.” In Solving Population Health Problems Through Collaboration. https://doi.org/10.4324/9781315212708.

Olstad, Dana Lee, and Lynn McIntyre. 2019. “Reconceptualising precision public health.” BMJ Open 9 (9): e030279. https://doi.org/10.1136/bmjopen-2019-030279.

Preoţiuc-Pietro, Daniel, Svitlana Volkeva, Vasileios Lampos, Yoram Bachrach, and Nikolaos Aletras. 2015. “Studying user income through language, behaviour and affect in social media.” PLoS ONE 10 (9). https://doi.org/10.1371/journal.pone.0138717.

Salathé, Marcel. 2018. “Digital epidemiology: what is it, and where is it going?” Life Sciences, Society and Policy. https://doi.org/10.1186/s40504-017-0065-7.

Saunier, Nicolas, Tarek Sayed, and Karim Ismail. 2010. “Large-Scale Automated Analysis of Vehicle Interactions and Collisions.” Transportation Research Record: Journal of the Transportation Research Board 2147: 42–50. https://doi.org/10.3141/2147-06.

Weeramanthri, Tarun Stephen, Hugh J S Dawkins, Gareth Baynam, Matthew Bellgard, Ori Gudes, and James Bernard Semmens. 2018. “Editorial: Precision Public Health.” Frontiers in Public Health 6 (April): 3–5. https://doi.org/10.3389/fpubh.2018.00121.

Zangenehpour, Sohail, Luis Fernando Miranda-Moreno, and Nicolas Saunier. 2014. “Automated Classification in Traffic Video at Intersections with Heavy Pedestrian and Bicycle Traffic.” 2014 TRB Annual Meeting Compendium of Papers.


  1. Google Correlate is being closed at the end of 2019.