The development and incorporation of data science and big data into public health practice requires us to learn a new language. Fortunately many of the new terms have public health equivalents.
As a first step we created an A-Z of data science in public health.
As a next step we have drawn on the literature to create the following tables. The glossary below draws extensively from:
Mooney et al table 2 (Mooney and Pejaver 2018), Kohavi et al (Kohavi and Provost 1998), McGrail et al (Mcgrail et al. 2018) and Fuller et al (Fuller, Buote, and Stanley 2017).
Donoho provides a useful discussion about the origins and development of data science. (Donoho 2017) In brief he defines “greater data science” as:
The science of data
and sees 6 core areas for data science activity:
The NIH (Data and Revolution 2018) defines data science as:
“the interdisciplinary field of inquiry in which quantitative and analytical approaches, processes, and systems are developed and used to extract knowledge and insights from increasingly large and/ or complex sets of data”
McGrail et al (Mcgrail et al. 2018) provide a defintion of data science in population health as:
“The science of data about people”
We are working towards a definition of public health data science. Two suggestions are:
“The application of data science to improve and protect health and reduce inequalities”
or:
“The science of data to improve and protect health and reduce inequalities”
Mcgrail compares the focus of current conceptual data science frameworks in health - this is summarised in the table.
Adapted from (Mcgrail et al. 2018) table 1
Area | Focus | Multisource data | Primary aim of research | Focus on technical/ policy infrastructure |
---|---|---|---|---|
Data science | Data esp. “big data” | Not always | Data for actionable information | No |
Population data science | People, systems, population insights | Linkage and multiple sources | Public value | Key focus - legal, ethical, privacy, data collection |
Informatics | Providers/ ICT systems | Not necessarily | Implementation | Database/ technical development |
Public health data science | Public and population health, healthcare, health systems | Often, linkage important | Improving population health, reducing health inequality | Focus as per population health data science? |
name | value |
---|---|
1 | big data consists of extensive datasetsprimarily in the characteristics of volume, variety, velocity, and/or variabilitythat require a scalable architecture for efficient storage, manipulation, and analysis. |
2 | big data engineering includes advanced techniques that harness independent resources for building scalable data systems when the characteristics of the datasets require new architectures for efficient storage, manipulation, and analysis. |
3 | the big data paradigm consists of the distribution of data systems across horizontally coupled, independent resources to achieve the scalability needed for the efficient processing of extensive datasets. |
4 | computational portability is the movement of the computation to the location of the data. |
5 | data governance refers to the overall management of the availability, usability, integrity, and security of the data employed in an enterprise. |
6 | the data lifecycle is the set of processes that transforms raw data into actionable knowledge, which includes data collection, preparation, analytics, visualization, and access. |
7 | data science is the empirical synthesis of actionable knowledge from raw data through the complete data life cycle process. |
8 | the data science is extraction of actionable knowledge directly from data through a process of discovery, hypothesis, and hypothesis testing. |
9 | a latency is a practitioner who has sufficient knowledge in the overlapping regimes of business needs, domain knowledge, analytical skills, and software and systems engineering to manage the end-to-end data processes through each stage in the data life cycle. |
10 | distributed computing is a computing system in which components located on networked computers communicate and coordinate their actions by passing messages. |
11 | distributed file systems contain multi-structured (object) datasets that are distributed across the computing nodes of the server cluster(s). |
12 | a federated database system is a type of meta-database management system, which transparently maps multiple autonomous database systems into a single federated database. horizontal scaling implies the coordination of individual resources (e.g., server) that are integrated to act in parallel as a single system (i.e., operate as a cluster). |
13 | latency refers to the delay in processing or in availability. |
14 | massively parallel processing refers to a multitude of individual processors working in parallel to execute a particular program. |
15 | non-relational models, frequently referred to as nosql, refer to logical data models that do not follow relational algebra for the storage and manipulation of data. |
16 | resource negotiation consists of built-in data management capabilities that provide the necessary support functions, such as operations management, workflow integration, security, governance, support for additional processing models, and controls for multi-tenant environments, providing higher availability and lower latency applications. |
17 | a-1 nist big data interoperability framework: volume 1, definitions schema-on-read is the application of a data schema through preparation steps such as transformations, cleansing, and integration at the time the data is read from the database. |
18 | shared-disk file systems, such as storage area networks (sans) and network attached storage (nas), use a single storage pool, which is accessed from multiple computing resources. validity refers to appropriateness of the data for its intended use. value refers to the inherent wealth, economic and social, embedded in any dataset. |
19 | variability refers to the change in other data characteristics. |
20 | variety refers to data from multiple repositories, domains, or types. |
21 | velocity refers to the rate of data flow. veracity refers to the accuracy of the data. |
22 | vertical scaling implies increasing the system parameters of processing speed, storage, and memory for greater performance. volatility refers to the tendency for data structures to change over time. |
23 | volume refers to the size of the dataset. |
On tool that has been used for digital epidemiology is Google Trends. There is some research e.g. Google Flu trends that suggests that certain search terms have predictive value for disease occurrence e.g. Flu, Norovirus. Google Flu trends has the best researched topic but recent research has found a strong correlation between the frequency of some cancers and Google searches for cancer(Lippi and Cervellin 2019). There is increasing evidence that Google trends may have some value in NCD surveillance. (R users can access Google trends with the GtrendsR
R package which allows evaluation of up to 5 topics.)
Fig1. Google trends for digital epidemiology and related terms
A related tool provided by Google is GoogleCorrelate. [This example shows that weekly admissions for respiratory disease are highly correlated with searches for “cough”, “linctus” i.e. respiratory symptoms and treatments]. The correlates user-provided time series data (weekly or monthly) with comparable Google search data. It can be used to identify terms which predict the data provided - and the relationship can be lagged by a week so that a predictive model can be built. For example, in the NHS increased searches for pneumonia precede hospitalisations for pneumonia and this has been used for providing early warning to trusts.1
The use of these tools is also associated with indfodemiology and infosurveillance(Mavragani and Ochoa 2019)
Predictive analytics
Prescriptive analytics
Precision public health is a relatively new term first coined in 2013 to parallele precision medicine. It has been defined variously as:
A more recent definition suggests:
This links much more with social determinants of health and tackling health inequalities, taking account of social structure and context.
Olstad(Olstad and McIntyre 2019) suggests greater precision in public health can be achieved by:
Precision should be sought in the areas that are the most theoretically meaningful within the context of each individual study, while acknowledging that a minimum of two should be implemented in tandem to constitute an instance of precision public health. 1. Provide explicit and precise descriptions of the theoretical rationale underlying the selection and operationalisation of social positions, social contexts, health outcomes and potential confounders. The proposed causal pathways should be precisely identified a priori. 2. Identify the precise social positions of populations of interest and investigate their associations with health by expanding beyond common master categories to examine other dimensions of social position, and the heterogeneity that exists within social categories. Measures of perceived social position should be explored more fully. 3. Operationalise social position in more precise ways, such as by using continuous measures or more categories, considering qualitative and quantitative features, and considering factors at multiple levels. 4. Describe the precise time and context of measurement of socialposition and study the health effects of social position in a variety of contexts and at multiple time points across the life course. 5. Use precise language to describe health inequities (eg, inequities in cardiovascular disease according to wealth and gender/sex). 6. Use knowledge of the health effects of individuals’ precise social positions to inform the study of precise contextual mechanisms responsible for situating them there. Leverage this information to propose precise interventions to ameliorate health inequities.
and this requires a shift in our thinking
Move away from… | Move towards… |
---|---|
Biomedical model of health | Social determinants model of health |
The functions (eg, surveillance) and methods (eg, big data) of public health | The foundations (eg, social determinants) and core aims (eg, improve population health, reduce health inequities) of public health |
Problematising individuals and their behaviours | Problematising the social contexts that create social stratification |
Scaled-up versions of individual level interventions | Interventions that address the root causes of health inequities |
Precision medicine for the population | Precision public health |
Biro et al have reviewed the literature on related terms and suggest 3 potential categories:(Bíró et al. 2018)
Their definitions are:
Individualized prevention is a form of prevention in public health, in which health professionals consider the characteristics, lifestyle, family history, anamnesis, risk status and medication of the client when making proposals to maintain or improve the individual’s quality of life.
Personalized prevention is a form of prevention in public health, which includes the activities of individualized prevention and in which health professionals also consider biological information and biomarkers at the level of molecular disease pathways, genetics, transcriptomics, proteomics and metabolomics of the client when making proposals to maintain or improve the individual’s quality of life.
Precision prevention is a form of prevention in public health, which includes the activities of personalized prevention and in which health professionals also consider the socioeconomic status or the opportunities offered by psychological and behavioral data of the client when making proposals to maintain or improve the individual’s quality of life.
Type | Primary | Secondary | Tertiary |
---|---|---|---|
Population-based/ universal | Mass media campaigns/ Smoking bans | Type 2DM screening | Population wide retinopathy screening |
Stratified prevention | Targeted mass media campaigns | Type 2 DM screening in high risk groups | Retinopathy screening in T2 DM |
Individualised prevention | By considering the client’s characteristics, lifestyle and medical record the GP helps the patient to create a healthy lifestyle. | Depending on the client’s characteristics,lifestyle and medical record the GP sends the patient to screen for type 2 diabetes | The GP helps the patient with type diabetes to cope with the disease according to the patient’s characteristics, lifestyle and medical record. |
Personalized prevention | By considering the client’s characteristics, lifestyle and medical record the GP helps the patient to create a healthy lifestyle. | Depending on the client’s characteristics,lifestyle and medical record the GP sends the patient to screen for type 2 diabetes | The GP helps the patient with type diabetes to cope with the disease according to the patient’s characteristics, lifestyle and medical record and genome |
Precision prevention | By considering the client’s characteristics, lifestyle and medical record, genome, psychological profile and socio-economic status the GP helps the patient to create a healthy lifestyle. | Depending on the client’s characteristics,lifestyle and medical record, genome, psychological profile and socio-economic status encourages the patient to go for screening for type 2 diabetes | The GP helps the patient with type diabetes to cope with the disease according to the patient’s characteristics, lifestyle and medical record genome, psychological profile and socio-economic status |
From these suggested definitions we can see that individualised or personalised prevention are primary care delivered services aimed at lifestyle modification, early detection and disease management using ever more personalised information to guide decision making (including patient or personal preference). These definitions are akin the emerging idea of “lifestyle medicine” which has been defined as:
“Lifestyle medicine is a branch of evidence-based medicine in which comprehensive lifestyle changes (including nutrition, physical activity, stress management, social support and environmental exposures) are used to prevent, treat and reverse the progression of chronic diseases by addressing their underlying causes. Lifestyle medicine interventions include health risk assessment screening, health behavior change counseling and clinical application of lifestyle modifications. Lifestyle medicine is often prescribed in conjunction with pharmacotherapy and other forms of therapy.”1
\[ Precision\ public\ health \neq \sum precision\ medicine\]
CDC have recently written about Public Health 3.0 which, from a data perspective, promotes the idea of greater granularity and precision.
## [1] "Predictive prevention does not replace existing public health interventions at population or community level – but it does build on existing data-driven targeting techniques and channels to add another dimension of deeper customer engagement. By continuing to combine behavioural science and digital innovations, we can actively encourage people to make healthier choices and take greater responsibility for their wellbeing."
## [2] "A second more expansive version of precision public health is about the use of data and analytical techniques to design and implement interventions that benefit whole populations. This version emphasises the use of sophisticated surveillance and modelling and has been promoted by the Bill and Melinda Gates Foundation. It is certainly appealing and is typified for example by the recent Global Burden of Disease study and much of PHE’s work on infection control and health improvement."
Fuller et al (Fuller, Buote, and Stanley 2017) provide some useful critiques of big data use for public and population health:
Khoury and Ioannidis (Khoury and Ioannidis 2014) argue that to separate signal from noise in big data for public health benefit requires a stronger epidemiological foundation, improved knowledge translation and integration and reinvigoration of the princples of evidence based medicine.
Arnett, Donna K., and Steven A. Claas. 2016. “Precision medicine, genomics, and public health.” Diabetes Care. https://doi.org/10.2337/dc16-1763.
Bayer, Ronald, and Sandro Galea. 2015. “Public Health in the Precision-Medicine Era.” New England Journal of Medicine. https://doi.org/10.1056/NEJMp1506241.
Bíró, K, V Dombrádi, A Jani, K Boruzs, and M Gray. 2018. “Creating a common language: defining individualized, personalized and precision prevention in public health.” Journal of Public Health (Oxford, England), 1–8. https://doi.org/10.1093/pubmed/fdy066.
Data, Big, and Resolution Revolution. 2018. “Nih strategic plan for data science” 2015: 1–26.
Donoho, David. 2017. “50 Years of Data Science.” https://doi.org/10.1080/10618600.2017.1384734.
Dowell, Scott F., David Blazes, and Susan Desmond-Hellmann. 2016. “Four steps to precision public health.” https://doi.org/10.1038/540189a.
Fayyad, Usama M, Gregory Piatetsky-Shapiro, Padhraic Smyth, and Ramasamy Uthurusamy. 1996. Advances in Knowledge Discovery and Data Mining.
Fuller, Daniel, Richard Buote, and Kevin Stanley. 2017. “A glossary for big data in population and public health: Discussion and commentary on terminology and research methods.” Journal of Epidemiology and Community Health 71 (11). https://doi.org/10.1136/jech-2017-209608.
Khoury, Muin J., Michael F. Iademarco, and William T. Riley. 2016. “Precision Public Health for the Era of Precision Medicine.” https://doi.org/10.1016/j.amepre.2015.08.031.
Khoury, Muin J, and John P A Ioannidis. 2014. “Medicine. Big data meets public health.” Science 346 (6213): 1054–5. https://doi.org/10.1126/science.aaa2709.
Kohavi, Ron, and Foster Provost. 1998. “Glossary of Terms.” Machine Learning. 30 (2-3): 271–74. https://doi.org/10.1023/A:1017181826899.
Lazer, D, R Kennedy, G King, and A Vespignani. 2014. “The parable of Google Flue: traps in big data analysis.” Science 343: 1203–5. https://doi.org/10.1126/science.1248506.
Lippi, Giuseppe, and Gianfranco Cervellin. 2019. “Is digital epidemiology reliable?—insight from updated cancer statistics.” Annals of Translational Medicine. https://doi.org/10.21037/atm.2018.11.55.
Mavragani, Amaryllis, and Gabriela Ochoa. 2019. “Google trends in infodemiology and infoveillance: Methodology framework.” Journal of Medical Internet Research. https://doi.org/10.2196/13439.
Mcgrail, Kimberlyn M, Kerina Jones, Ashley Akbari, Tellen D Bennett, Andy Boyd, Fabrizio Carinci, Xinjie Cui, et al. 2018. “International Journal of People.” International Journal of Population Data Science 3 (February): 1–11. https://ijpds.org/article/view/415.
Mooney, Stephen J., and Vikas Pejaver. 2018. “Big Data in Public Health: Terminology, Machine Learning, and Privacy.” Annual Review of Public Health. https://doi.org/10.1146/annurev-publhealth-040617-014208.
O’Carroll, Patrick W., Karen B. DeSalvo, Denise Koo, John Auerbach, and Judith A. Monroe. 2017. “Public health 3.0: Time for an upgrade.” In Solving Population Health Problems Through Collaboration. https://doi.org/10.4324/9781315212708.
Olstad, Dana Lee, and Lynn McIntyre. 2019. “Reconceptualising precision public health.” BMJ Open 9 (9): e030279. https://doi.org/10.1136/bmjopen-2019-030279.
Preoţiuc-Pietro, Daniel, Svitlana Volkeva, Vasileios Lampos, Yoram Bachrach, and Nikolaos Aletras. 2015. “Studying user income through language, behaviour and affect in social media.” PLoS ONE 10 (9). https://doi.org/10.1371/journal.pone.0138717.
Salathé, Marcel. 2018. “Digital epidemiology: what is it, and where is it going?” Life Sciences, Society and Policy. https://doi.org/10.1186/s40504-017-0065-7.
Saunier, Nicolas, Tarek Sayed, and Karim Ismail. 2010. “Large-Scale Automated Analysis of Vehicle Interactions and Collisions.” Transportation Research Record: Journal of the Transportation Research Board 2147: 42–50. https://doi.org/10.3141/2147-06.
Weeramanthri, Tarun Stephen, Hugh J S Dawkins, Gareth Baynam, Matthew Bellgard, Ori Gudes, and James Bernard Semmens. 2018. “Editorial: Precision Public Health.” Frontiers in Public Health 6 (April): 3–5. https://doi.org/10.3389/fpubh.2018.00121.
Zangenehpour, Sohail, Luis Fernando Miranda-Moreno, and Nicolas Saunier. 2014. “Automated Classification in Traffic Video at Intersections with Heavy Pedestrian and Bicycle Traffic.” 2014 TRB Annual Meeting Compendium of Papers.
Google Correlate is being closed at the end of 2019.↩