Open public health data and analytical tools are becoming increasingly available as part of R packages or through Application Programming Interfaces (APIs).
This article briefly reviews:
WHO provide an R package (WHO
) to access some of their data.
The starting point is the get_codes
function which gives a list of indicators (n = 2767) and respective codes which can then be passed to the get_data
function.
There is a bit of work to do to separate out confidence intervals into separate columns to make the data tidier - see code below for age-standardised prevalence of obesity by country. This shows the universal increase in adult obesity rates across Europe although there is considerable variation between countries. The most dramatic increases have been in Turkish women.
library(WHO)
WHO::get_codes() %>% head()
## # A tibble: 6 x 3
## label display url
## <chr> <chr> <chr>
## 1 MDG_0000000001 Infant mortality rate (probabi… http://apps.who.int/gho/…
## 2 MDG_0000000003 Adolescent birth rate (per 100… http://apps.who.int/gho/…
## 3 MDG_0000000005 Contraceptive prevalence (%) http://apps.who.int/gho/…
## 4 MDG_0000000006 Unmet need for family planning… http://apps.who.int/gho/…
## 5 MDG_0000000007 Under-five mortality rate (pro… http://apps.who.int/gho/…
## 6 MDG_0000000010 Median availability of selecte… http://apps.who.int/gho/…
WHO::get_data(code = "NCD_BMI_30A") %>%
tidyr::separate(value, c("value", "ci"), sep = " ") %>%
mutate(ci = stringr::str_replace(ci, "\\[", ""),
ci = stringr::str_replace(ci, "\\]", "")) %>%
tidyr::separate(ci, remove = FALSE, c("lower", "upper"), sep = "-") %>%
mutate_at(.vars = c("value", "lower", "upper"), .funs = as.numeric) %>%
filter(region == "Europe", sex == "Female") %>%
ggplot(aes(year, value, colour = country)) +
geom_line(aes(group = country)) +
#facet_grid(agegroup~sex) +
theme(legend.position = "bottom") -> g
plotly::ggplotly(g)
Eurostat publish access to data in R via the eurostat
package and have published a helpful cheatsheet as a tutorial. Like the WHO data there is a utility function get_eurostat_toc
which tells you whats in the dataset (9695 indicators) and a data extraction function get_eurostat
.
The example below shows how to extract, plot and map data on amenable mortality across Europe.
library(eurostat)
toc <- eurostat::get_eurostat_toc() ## see what's in the dataset
am <- search_eurostat("Amenable", type = "dataset") ## find the indicator ids for amenable deaths
am_data <- get_eurostat(id = "hlth_cd_apr") ## amenable mortality data
am_data1 <- get_eurostat(id = "hlth_cd_apreu") ## amenable mortality data by ICD10 code
#summary(am_data1)
# am_data %>%
# filter(unit == "RT", values > 10) %>%
# ggplot(aes(time, values, shape = indic_he, colour = geo, group = geo)) +
# geom_line() +
# geom_point() +
# #facet_wrap(indic_he~icd10, scales = "free", nrow = 2) +
# labs(caption = "Source: EUROSTAT",
# title = "Preventable and amenable mortality in Europe \n2010-2014")
am_data1 %>%
filter(unit == "RT", values > 10, !icd10 == "TOTAL") %>%
ggplot(aes(time, values, shape = indic_he, colour = icd10)) +
geom_line() +
geom_point() +
facet_wrap(indic_he~sex, nrow = 2) +
labs(caption = "Source: EUROSTAT",
title = "Preventable and amenable mortality in Europe (EU28) \n2010-2014",
subtitle = "By ICD10 code",
y = "Rate") +
scale_y_log10()
am_data2 <- get_eurostat(id = "hlth_cd_apr") %>% filter(time == max(time), unit == "RT") %>% mutate(cat = cut_to_classes(values, n = 5, decimals = 1)) ## stratify data into quintiles
mapdata <- merge_eurostat_geodata(am_data2, resolution = "20") ## add to mapping function
## amenable mortality European map
ggplot(mapdata, aes(x = long, y = lat, group = group)) +
geom_polygon(aes(fill = cat), colour = "grey", size = .1) +
labs(title = "Amenable mortality by NUTS-3 region, 2014",
subtitle = "Mortality rate per 100,000",
fill = "Amenable mortality \nrate",
caption = "Source: EUROSTAT") +
theme_void() +
scale_fill_brewer(palette = "RdYlBu") +
coord_map(xlim = c(-12, 44), ylim = c(35,67))
The openair
package includes data from pollution monitoring sites from the NERC. There is a good manual for use here.
The data is updated daily and workhorse function is importAURN
. To use it you will need to know the codes of relevant monitoring stations. As an example, the Marylebone Road station is “my1” and we can extract data as below:
pollution_data <- openair::importAURN(site = c("my1", "kc1"), year = 2010:2017)
We can now plot daily concentrations of pollutants at this site. We’ll look at pm2.5 and n02.
summaryPlot(pollution_data)
## date1 date2 o3 no2 co so2 pm10
## "POSIXct" "POSIXt" "numeric" "numeric" "numeric" "numeric" "numeric"
## nox no pm2.5 nv10 v10 nv2.5 v2.5
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
## ws wd site code
## "numeric" "numeric" "factor" "factor"
The NOMIS API works in a similar way to the Fingertips API and is well described here. You can construct a URL to download csv files from the site built on common patterns. To download a geographical dataset of a given indicator requires identifying codes for geography
, sex
, item
, and measures
types. These retrieve geographical codes, names and types, numerator and denominator definitions.
In R it seems most straightforward to download the metadata files in json format, but construct data extract as csv files. These always take the root https://www.nomisweb.co.uk/api/v01/dataset/NM_1_1.data.csv.
You can search the definitions and metadata with queries like this https://www.nomisweb.co.uk/api/v01/dataset/def.htm?search=*ethnicity* or https://www.nomisweb.co.uk/api/v01/dataset/def.htm?search=*claimant*
library(jsonlite)
library(listviewer)
datasets <- jsonlite::fromJSON("https://www.nomisweb.co.uk/api/v01/dataset/def.sdmx.json", flatten = TRUE)
str(datasets, max.level = 4)
## List of 1
## $ structure:List of 7
## ..$ header :List of 4
## .. ..$ id : chr "none"
## .. ..$ prepared: chr "2018-03-20T15:36:27Z"
## .. ..$ sender :List of 2
## .. .. ..$ contact:List of 4
## .. .. ..$ id : chr "NOMIS"
## .. ..$ test : chr "false"
## ..$ keyfamilies :List of 1
## .. ..$ keyfamily:'data.frame': 1230 obs. of 14 variables:
## .. .. ..$ agencyid : chr [1:1230] "NOMIS" "NOMIS" "NOMIS" "NOMIS" ...
## .. .. ..$ id : chr [1:1230] "NM_1_1" "NM_2_1" "NM_4_1" "NM_5_1" ...
## .. .. ..$ uri : chr [1:1230] "Nm-1d1" "Nm-2d1" "Nm-4d1" "Nm-5d1" ...
## .. .. ..$ version : num [1:1230] 1 1 1 1 1 1 1 1 1 1 ...
## .. .. ..$ annotations.annotation :List of 1230
## .. .. ..$ components.attribute :List of 1230
## .. .. ..$ components.dimension :List of 1230
## .. .. ..$ components.primarymeasure.conceptref: chr [1:1230] "OBS_VALUE" "OBS_VALUE" "OBS_VALUE" "OBS_VALUE" ...
## .. .. ..$ components.timedimension.codelist : chr [1:1230] "CL_1_1_TIME" "CL_2_1_TIME" "CL_4_1_TIME" "CL_5_1_TIME" ...
## .. .. ..$ components.timedimension.conceptref : chr [1:1230] "TIME" "TIME" "TIME" "TIME" ...
## .. .. ..$ description.value : chr [1:1230] "Records the number of people claiming Jobseeker's Allowance (JSA) and National Insurance credits at Jobcentre P"| __truncated__ "A quartery count of claimants who were claiming Jobseeker's Allowance on the count date analysed by their age a"| __truncated__ "A monthly count of Jobseeker's Allowance (JSA) claimants broken down by age and the duration of claim. Totals e"| __truncated__ "A midyear estimate of the workforce (the denominator) which was used for calculating claimant count rates prior"| __truncated__ ...
## .. .. ..$ description.lang : chr [1:1230] "en" "en" "en" "en" ...
## .. .. ..$ name.value : chr [1:1230] "Jobseeker's Allowance with rates and proportions" "claimant count - age and duration" "Jobseeker's Allowance by age and duration" "claimant count denominators - historical workforce series" ...
## .. .. ..$ name.lang : chr [1:1230] "en" "en" "en" "en" ...
## ..$ xmlns : chr "http://www.SDMX.org/resources/SDMXML/schemas/v2_0/message"
## ..$ common : chr "http://www.SDMX.org/resources/SDMXML/schemas/v2_0/common"
## ..$ structure : chr "http://www.SDMX.org/resources/SDMXML/schemas/v2_0/structure"
## ..$ xsi : chr "http://www.w3.org/2001/XMLSchema-instance"
## ..$ schemalocation: chr "http://sdmx.org/docs/2_0/SDMXMessage.xsd"
listviewer::jsonedit(datasets)
## has 1229 indicators
## lets look at indicator names
purrr::map_df(datasets$structure$keyfamilies, "name.value")
## # A tibble: 1,230 x 1
## keyfamily
## <chr>
## 1 Jobseeker's Allowance with rates and proportions
## 2 claimant count - age and duration
## 3 Jobseeker's Allowance by age and duration
## 4 claimant count denominators - historical workforce series
## 5 claimant count - occupation
## 6 claimant count - occupation, age and duration
## 7 Jobseeker's Allowance off-flows by reason, age and duration
## 8 Jobseeker's Allowance flows by age and duration
## 9 Jobseeker's Allowance flows
## 10 Claimant Count - seasonally adjusted
## # ... with 1,220 more rows
## get a dataset
nomis <- read_csv("https://www.nomisweb.co.uk/api/v01/dataset/NM_1_1.data.csv?geography=TYPE486&sex=7&item=1&measures=20100&time=latest") %>%
select(ITEM_NAME, GEOGRAPHY_NAME, GEOGRAPHY_CODE, DATE, MEASURES_NAME, OBS_VALUE, RECORD_COUNT)
head(nomis)
## # A tibble: 6 x 7
## ITEM_NAME GEOGRAPHY_NAME GEOGRAPHY_CODE DATE MEASURES_NAME OBS_VALUE
## <chr> <chr> <chr> <chr> <chr> <int>
## 1 Total cla… Darlington E06000005 2018… Persons claim… 1554
## 2 Total cla… Hartlepool E06000001 2018… Persons claim… 801
## 3 Total cla… Middlesbrough E06000002 2018… Persons claim… 2963
## 4 Total cla… Redcar and Cle… E06000003 2018… Persons claim… 1959
## 5 Total cla… Stockton on Te… E06000004 2018… Persons claim… 2629
## 6 Total cla… Chester-le-Str… E07000054 2018… Persons claim… 450
## # ... with 1 more variable: RECORD_COUNT <int>
test <-"Julian Flowers is "
str_extract_all(test, pattern = "[A-Z]{1}[a-z]*")
## [[1]]
## [1] "Julian" "Flowers"
NOTE - this link does not seem to be working
The rClinicalCodes
package can be used to find code lists used in research on primary care datasets from the clinicalcodes.org repository.
The package has a a small number of functions of which these are particularly useful:
all_ClinicalCodes_articles
gives a list of the research papers for which there are code listsget_ClincalCodes
requires a URL and will download a dataframe for of codes by disease for that article.For example we can download the code list for diabetes used in Doran et al’s paper on withdrawing performance indicators.
#art1 <- rClinicalCodes::all_ClinicalCodes_articles()
# head(rClinicalCodes::get_ClinicalCodes(article_id = 5, codelist_name = "diabetes") %>%
# mutate_if(is.factor, as.character)) %>%
# knitr::kable()
Or we can review all the codes for a given article e.g the QOF business rules v24.
codelists <- get_ClinicalCodes(article_id = 1)
names <- colnames(purrr::map_df(codelists, colnames ))
codes <-get_ClinicalCodes(article_id = 1, codelist_name = names) %>%
mutate_if(is.factor, as.character) %>%
filter(code != "code")
head(codes)
and finally pull the codeset for a given disease area.
codelists <- get_ClinicalCodes(article_id = 1:52) ## pull all the code lists
#codelists[2] ## review
codelists1 <- map(codelists, "list_name") ## find all the list names
## find the list-names for your disease of interest e.g. copd
disease_list <- as.tibble(names(codelists1)) %>% arrange(value) %>%
filter(str_detect(value, "chronic-obstructive|copd")) %>%
pull(value)
complete <- data.frame()
for(i in seq_along(codelists)){
newdf <- codelists[[i]][1:6] %>%
mutate_if(is.factor, as.character) %>%
mutate_if(is.numeric, as.character)
complete <- bind_rows(complete, newdf)
}
## get copd codes
copd_codes <- complete %>%
filter(str_detect(list_name, "chronic-obstructive|copd"))
copd_codes %>%
knitr::kable()
The fingertipsR
package is designed to make it easy to extract Fingertips data into R for further analysis. It reads the Fingertips API and is more user-friendly than reading the API directly although for skilled programmers the API offers much more functionality.
The use of the API is introduced by this vignette and more details are available here.
The package can be downloaded from CRAN, or the development version from Github as below.
## Stable version
(install.packages("fingertipsR"))
## development version
devtools::install_github("PublicHealthEngland/fingertipsR")
library(fingertipsR)
We can extract a list of all the indicators:
fingertipsR::indicators_unique()
## # A tibble: 1,948 x 2
## IndicatorID IndicatorName
## <int> <fct>
## 1 247 Dementia QOF prevalence (all ages)
## 2 340 Income Deprivation Affecting Older People Index
## 3 734 Percentage of people in long-term unemployment
## 4 1164 Adults (18-64) with physical disabilities supported by adu…
## 5 1165 Adults (18-64) with learning disabilities supported by adu…
## 6 1166 Adults (18-64) with mental health problems supported by ad…
## 7 1167 Older people (65+) supported by adult social care througho…
## 8 1172 % of the total resident population who are 65-74 years old
## 9 1173 % of the total resident population who are 75-84 years old
## 10 1174 % of the total resident population who are 85 and over
## # ... with 1,938 more rows
Then get the metadata for a set of indicators or profiles.
metadata <- indicator_metadata(IndicatorID = c(247, 340, 734))
metadata %>%
formattable::formattable()
IndicatorID | Indicator | Indicator full name | Definition | Rationale | Policy | Data source | Indicator production | Indicator source | Methodology | Standard population/values | Confidence interval details | Source of numerator | Definition of numerator | Source of denominator | Definition of denominator | Disclosure control | Caveats | Copyright | Data re-use | Links | Indicator number | Notes | Frequency | Rounding | Data quality | Indicator Content | Unit | Value type | Year type | Polarity |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
340 | Income deprived older people (60+) | Percentage of people aged 60 and over living in income deprived households (Income Deprivation Affecting Older People Index). | Adults aged 60 or over living in income-deprived households as a percentage of all adults aged 60 or over. The Income Deprivation Affecting Older People Index (IDAOPI) measures the proportion of all adults aged 60 or over living in income deprived families. It is a subset of the Income Deprivation Domain which measures the proportion of the population in an area experiencing deprivation relating to low income. The definition includes adults aged 60 or over receiving Income Support or income-based Jobseekers Allowance or income-based Employment and Support Allowance or Pension Credit (Guarantee). | Poverty is a risk factor for mental health for people of all ages. In older adults, poverty may add to the risk of social isolation already present for this group, as those in poverty have limited money available to participate in social activities. | NA | Department for Communities and Local Government (DCLG) | Public Health Data Science, Public Health England | https://www.gov.uk/government/statistics/english-indices-of-deprivation-2015 | For all geographies apart from wards, the value of the IDAOPI domain score was applied to mid-2012 population estimates to obtain numerator estimates (estimated number of older people living in income-deprived households) for each LSOA. These numerator estimates and population estimates for LSOAs were then aggregated to MSOAs, local authorities and England as a whole (using geographical lookup tables obtained from ONS Geography). Percentage figures were then calculated based on these aggregated data. The estimates for wards were derived by the Local Government Association. LSOAs were matched to the ward in which they had the best fit (by size of area). Estimates were then calculated based on population weighted averages of the LSOA level IDAOPI scores. For GP practice estimates, LSOA level deprivation data were applied proportionally to the practice populations. | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | Estimates and projections of the population aged 60 and over can be obtained from Projecting Older People Information System (POPPI) http://www.poppi.org.uk/ | NA | NA | NA | NA | 3 | NA | % | Proportion | Calendar | RAG - Low is good |
734 | Long-term unemployment: % of working age population | Long-term unemployment: % of population aged 16-64 that are long-term unemployed | The percentage of the working age population that have been claiming job seekers allowance for over 12 months. Population is based on mid-year estimate of that year, unless not available in which case previous year’s mid-year estimate is used. | There is a strong evidence to suggest that work is generally good for physical and mental health and wellbeing, taking into account the nature and quality of work and its social context, and that worklessness is associated with poorer physical and mental health. | NA | ONS NOMIS | NA | https://www.nomisweb.co.uk/Default.asp | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | Pre 2011 Mid year Populationsare rebased populations from the 2011 census | NA | NA | 3 | NA | % | Crude rate | Calendar | RAG - Low is good |
247 | Dementia recorded prevalence (QOF): % of practice register (all ages) | Dementia recorded prevalence (QOF): % of patients on GP practice registers recorded as having dementia (all ages) | The recorded dementia prevalence is the number of people with dementia recorded on GP practice registers as a proportion of the people (all ages);registered at each GP practice, allocated to a local authority boundary using the postcode of the practice. |
Dementia is a syndrome characterised by an insidious but ultimately catastrophic, progressive global deterioration in intellectual function and is a main cause of late-life disability. The prevalence of dementia increases with age and is estimated to be approximately 20% at 80 years of age. The annual incidence of vascular dementia is 1.2/100 overall person years at risk and is the same in all age groups. Alzheimer’s disease accounts for 50 - 75% of cases of dementia. The annual incidence of dementia of the Alzheimer type rises to 34.3/100 person years at risk in the 90 year age group; the prevalence is higher in women than in men due to the longer lifespan of women. Other types of dementia such as Lewy Body dementia and fronto-temporal dementia are relatively rare but can be very distressing. In a third of cases, dementia is associated with other psychiatric symptoms (depressive disorder, adjustment disorder, generalised anxiety disorder, alcohol related problems). A complaint of subjective memory impairment is an indicator of dementia especially when there is altered functioning in terms of activities of daily living. |
NA | QOF | NA | Data for 2016/17 can be downloaded from http://digital.nhs.uk/catalogue/PUB30124 For other data years please find on the http://content.digital.nhs.uk/qof | NA | NA | NA | Quality Outcomes Framework (QOF), NHS Digital | The number of patients registered with dementia, as recorded on practice disease registers, allocated to a local authority boundary using the postcode of the practice. | Quality Outcomes Framework (QOF), NHS Digital | The number of people registered at GP practices (all ages), allocated to a local authority boundary using the postcode of the practice. | NA | This indicator is a measure of recorded prevalence and not actual prevalence and therefore under-reports groups who are less likely to be registered with a GP, such as ethnic minority populations, homeless people, migrants and travellers. QOF registers are constructed to underpin indicators on quality of care, and they do not necessarily equate to prevalence as may be defined by epidemiologists. Caution should be taken when interpreting this indicator as higher than average value may mean that the prevalence of the condition is high in an area, but it could also indicate that detection is better there; this is for local knowledge to determine.The intended measure is a resident-based measure at local authority geography. However the required base datasets are not available at the local authority level. Therefore a proxy registration based measure has been constructed which illustrates the proportion of individuals with dementia in the registered population apportioned to local authority geographies, based on the postcode of their GP practice. For local authorities that are geographically coterminous with CCGs, please note that values may not be the same as the CCG value. | Re-used with the permission of NHS Digital. All rights reserved. | NA | NA | NA | NHS Digital advises a number of practices in Somerset CCG opted out of national QOF arrangements for 2014/15 and 2015/16. This means that reporting on individual clinical indicators will appear lower than practices who have continued to deliver national QOF | NA | NA | 3 | NA | % | Proportion | Financial | BOB - Blue orange blue |
And lookup which units of analysis are available for these indicators:
ind_areas <- indicator_areatypes() %>%
filter(IndicatorID %in% c(247, 340, 734)) %>%
left_join(metadata) %>%
left_join(area_types()) %>%
select(IndicatorID, Indicator,AreaTypeID, AreaTypeName ) %>%
distinct()
ind_areas
## # A tibble: 15 x 4
## IndicatorID Indicator AreaTypeID AreaTypeName
## <int> <chr> <int> <chr>
## 1 247 Dementia recorded prevale… 153 CCG unchanged plus o…
## 2 247 Dementia recorded prevale… 152 CCG unchanged plus n…
## 3 247 Dementia recorded prevale… 120 Sustainability and T…
## 4 247 Dementia recorded prevale… 102 Counties and Unitary…
## 5 247 Dementia recorded prevale… 46 Sub-region (former L…
## 6 247 Dementia recorded prevale… 7 General Practice
## 7 340 Income deprived older peo… 153 CCG unchanged plus o…
## 8 340 Income deprived older peo… 152 CCG unchanged plus n…
## 9 340 Income deprived older peo… 104 PHEC 2015 new plus P…
## 10 340 Income deprived older peo… 102 Counties and Unitary…
## 11 340 Income deprived older peo… 101 Local authority dist…
## 12 340 Income deprived older peo… 8 Ward
## 13 340 Income deprived older peo… 7 General Practice
## 14 340 Income deprived older peo… 6 Government Office Re…
## 15 734 Long-term unemployment: %… 102 Counties and Unitary…
and then extract the data for each indicator for all available geographies (takes a few minutes).
data <- fingertips_data(IndicatorID = 247, AreaTypeID = c(120, 102, 46) )
head(data)
## # A tibble: 6 x 24
## IndicatorID IndicatorName ParentCode ParentName AreaCode AreaName
## <int> <chr> <chr> <chr> <chr> <chr>
## 1 247 Dementia: QOF pr… <NA> <NA> E920000… England
## 2 247 Dementia: QOF pr… E92000001 England E920000… England
## 3 247 Dementia: QOF pr… E92000001 England E390000… London NHS…
## 4 247 Dementia: QOF pr… E92000001 England E390000… Wessex NHS…
## 5 247 Dementia: QOF pr… E92000001 England E390000… Cheshire a…
## 6 247 Dementia: QOF pr… E92000001 England E390000… Yorkshire …
## # ... with 18 more variables: AreaType <chr>, Sex <chr>, Age <chr>,
## # CategoryType <chr>, Category <chr>, Timeperiod <chr>, Value <dbl>,
## # LowerCI95.0limit <dbl>, UpperCI95.0limit <dbl>,
## # LowerCI99.8limit <dbl>, UpperCI99.8limit <dbl>, Count <dbl>,
## # Denominator <dbl>, Valuenote <chr>, RecentTrend <chr>,
## # ComparedtoEnglandvalueorpercentiles <chr>,
## # Comparedtosubnationalparentvalueorpercentiles <chr>,
## # TimeperiodSortable <int>
Largely designed for infectious disease epidemiology (https://cran.r-project.org/web/packages/epitools/epitools.pdf), but has functions for:
ageadjust.direct
; ageadjust.indirect
binom.conf.int
; pois.conf.int
oddsratio
; rateratio
; riskratio
As an example we can calculate Wilson CI for a proportion:
x <- 40
n <- 100
epitools::binom.wilson(x, n)
## x n proportion lower upper conf.level
## 1 40 100 0.4 0.3094013 0.4979974 0.95
library(epitools)
r243 <- matrix(c(12,2,7,9), 2, 2)
dimnames(r243) <- list(Diarrhea = c("Yes", "No"),
"Antibody level" = c("Low", "High")
)
r243
## Antibody level
## Diarrhea Low High
## Yes 12 7
## No 2 9
r243b <- t(r243)
r243b
## Diarrhea
## Antibody level Yes No
## Low 12 2
## High 7 9
## Chi sq
epitab(r243, rev = "b", verbose = TRUE)
## $tab
## Antibody level
## Diarrhea High p0 Low p1 oddsratio lower upper p.value
## No 9 0.5625 2 0.1428571 1.000000 NA NA NA
## Yes 7 0.4375 12 0.8571429 7.714286 1.283544 46.36398 0.02588706
##
## $measure
## [1] "wald"
##
## $conf.level
## [1] 0.95
##
## $pvalue
## [1] "fisher.exact"
##
## $x
## Antibody level
## Diarrhea High Low
## No 9 2
## Yes 7 12
##
## $data
## Antibody level
## Diarrhea High Low Total
## No 9 2 11
## Yes 7 12 19
## Total 16 14 30
##
## $p.exposed
## Antibody level
## Diarrhea High Low Total
## No 0.5625 0.1428571 0.3666667
## Yes 0.4375 0.8571429 0.6333333
## Total 1.0000 1.0000000 1.0000000
##
## $p.outcome
## Antibody level
## Diarrhea High Low Total
## No 0.8181818 0.1818182 1
## Yes 0.3684211 0.6315789 1
## Total 0.5333333 0.4666667 1
##
## $p.value
## two-sided
## Diarrhea midp.exact fisher.exact chi.square
## No NA NA NA
## Yes 0.02332167 0.02588706 0.01733469
##
## $correction
## [1] FALSE
## Risk ratio
epitab(r243, method="riskratio",rev = "b", verbose = TRUE)
## $tab
## Antibody level
## Diarrhea High p0 Low p1 riskratio lower upper
## No 9 0.8181818 2 0.1818182 1.000000 NA NA
## Yes 7 0.3684211 12 0.6315789 3.473684 0.9468914 12.74326
## Antibody level
## Diarrhea p.value
## No NA
## Yes 0.02588706
##
## $measure
## [1] "wald"
##
## $conf.level
## [1] 0.95
##
## $pvalue
## [1] "fisher.exact"
##
## $x
## Antibody level
## Diarrhea High Low
## No 9 2
## Yes 7 12
##
## $data
## Antibody level
## Diarrhea High Low Total
## No 9 2 11
## Yes 7 12 19
## Total 16 14 30
##
## $p.exposed
## Antibody level
## Diarrhea High Low Total
## No 0.5625 0.1428571 0.3666667
## Yes 0.4375 0.8571429 0.6333333
## Total 1.0000 1.0000000 1.0000000
##
## $p.outcome
## Antibody level
## Diarrhea High Low Total
## No 0.8181818 0.1818182 1
## Yes 0.3684211 0.6315789 1
## Total 0.5333333 0.4666667 1
##
## $p.value
## two-sided
## Diarrhea midp.exact fisher.exact chi.square
## No NA NA NA
## Yes 0.02332167 0.02588706 0.01733469
##
## $correction
## [1] FALSE
## Risk ratio
epitab(matrix(c(41, 15, 28010, 19017),2,2)[2:1,],
method="rateratio", verbose = TRUE)
## $tab
## Outcome
## Predictor Count Person-time rateratio lower upper p.value
## Exposed1 15 19017 1.000000 NA NA NA
## Exposed2 41 28010 1.855759 1.027226 3.352563 0.03545742
##
## $measure
## [1] "wald"
##
## $conf.level
## [1] 0.95
##
## $pvalue
## [1] "midp.exact"
##
## $x
## Outcome
## Predictor Count Person-time
## Exposed1 15 19017
## Exposed2 41 28010
##
## $data
## Outcome
## Predictor Count Person-time
## Exposed1 15 19017
## Exposed2 41 28010
## Total 56 47027
##
## $p.value
## two-sided
## Predictor midp.exact wald
## Exposed1 NA NA
## Exposed2 0.03545742 0.03736289
This package is intended to “provide a common syntax for the most frequent statistical analysis in epidemiology”. It uses the lattice
package for graphics, and has aglm_coef
function for display of regression models (NB the broom
pacakge is also very good for this), and has standard functions for epidemiology
There is a good vignette.
odds_trend(Cancer ~ Age, data = Macmahon)
## Odds Ratios with 95% CIs and p-values
##
## Exposure OR lower upper chi.square fisher.exact
## <20 1.00 NA NA <NA> <NA>
## 20-24 1.21 1.06 1.39 0.007 0.007
## 25-29 1.55 1.35 1.79 < 0.001 < 0.001
## 30-34 1.88 1.60 2.22 < 0.001 < 0.001
## >34 2.41 1.96 2.95 < 0.001 < 0.001
##
##
## Chi-squared Test for Trend in Proportions = 129.012
## 1 d.f., p < 0.001
The dstTest
package calculates directly standardised rates and thier confidence intervals using “several methods”.
These methods include:
From the vignette:
if(!require(dsrTest))install.packages("dsrTest")
library(dsrTest)
library(tidyverse, quietly = TRUE)
data("downs.mi")
b <- downs.mi %>%
filter(BirthOrder == 3)
d1 <- with(b, dsrTest(Cases, Births, Standard, mult = 100000, method = "gamma"))
d2 <- with(b, dsrTest(Cases, Births, Standard, mult = 100000, method = "dobson"))
d3 <- with(b, dsrTest(Cases, Births, Standard, mult = 100000, method = "bootstrap"))
dsr <- list(d1, d2, d3)
formattable::formattable(map_df(dsr, broom::tidy) %>%
cbind(method = c("gamma", "dobson", "boostrap")))
estimate | statistic | p.value | parameter | conf.low | conf.high | method | alternative |
---|---|---|---|---|---|---|---|
85.06922 | 6 | NA | 1e+05 | 77.18345 | 94.22357 | Directly standardized rate: Gamma Method for Weighted Sum of Poissons | two.sided |
85.06922 | 6 | NA | 1e+05 | 77.17350 | 93.52121 | Directly standardized rate: Dobson’s method for Weighted Sum of Poissons with Exact two-sided Poisson test (central method) | two.sided |
85.06922 | 6 | NA | 1e+05 | 77.34172 | 93.52023 | Directly standardized rate: Approximate bootstrap method for Weighted Sum of Poissons | two.sided |
The fantastic hansard
package provides an R interface the to http://www.data.parliament.uk API and returns results in tidy format.
It gives access to information about:
from both the Commons and the Lords.
library(hansard)
## Health topics
topics <- research_subtopics_list()
list <- topics$`Health services and medicine`
For example, the classification of health topics for parliamentary briefings are Alcoholism, Animal experiments, Communicable diseases, Diseases, Drugs misuse, Genetics, Health education and preventive medicine, Health finance, Health services, Health staff and professions, Medical ethics, Medicine, Mental health, Obesity, Patient rights and complaints, Smoking, Vaccination.
We can use this to see what briefings have been created about or including, say obesity.
briefings <- research_briefings(subtopic = list[14] )
briefings %>%
select(title, description)
## # A tibble: 17 x 2
## title description
## <chr> <chr>
## 1 Obesity S… 26% of adults in England are obese. A further 35% are overw…
## 2 World Can… "This House of Lords Library Briefing provides information …
## 3 The effec… "This pack has been prepared ahead of the debate to be held…
## 4 Social In… This Commons Library research briefing summarises a wide ra…
## 5 Key Issue… Key Issues 2017 is a series of short briefings on the topic…
## 6 Allocatio… This pack has been prepared ahead of the debate to be held …
## 7 Sugar and… This POSTnote summarises the health risks associated with e…
## 8 Debate pa… "This debate pack is prepared for the Backbench Business de…
## 9 Debate on… "This pack has been compiled ahead of the debate on A tax o…
## 10 Building … This In Focus briefing provides some statistics relating to…
## 11 Obesity L… This In Focus note examines the increasing number of adults…
## 12 Obesity T… A quarter of adults in the UK are clinically obese and ther…
## 13 Preventin… This POSTnote describes the causes of diabetes and the know…
## 14 Obesity This note sets out the definition and causes of obesity, th…
## 15 The impor… Article examining the relationship between taking part in s…
## 16 Diet and … This POSTnote looks at the latest scientific research, advi…
## 17 Food Prod… Food Products (Marketing to Children) Bill. (Bill 19 of 200…
We can also look at the questions answered by the SofS and the Department of Health.
First we need the id for any given MP (1572 for Jeremy Hunt) or the department name, and we can then extract the data as below.
members_search("Hunt")
## # A tibble: 19 x 12
## mnis_id home_page additional_name_v… family_name_val… full_name_value
## <chr> <chr> <chr> <chr> <chr>
## 1 2374 <NA> Crother Crother-Hunt Lord Crowther-…
## 2 1425 http://www… Simon Djanogly Mr Jonathan Dj…
## 3 2414 <NA> Charles Lindsay Gordon The Marquess o…
## 4 2021 <NA> Charles Gomer Gordon The Marquess o…
## 5 2313 <NA> John Clarence Wes… Hastings Earl of Huntin…
## 6 2022 <NA> Edward Robin Hood Hastings-Bass Earl of Huntin…
## 7 994 <NA> James Fletcher Hunt Lord Hunt of W…
## 8 2003 <NA> Cecil John Hunt Lord Hunt
## 9 1572 http://www… Richard Streynsham Hunt Mr Jeremy Hunt
## 10 2361 <NA> Henderson Hunt Lord Hunt of F…
## 11 2023 <NA> Joseph Benedict Hunt Lord Hunt of T…
## 12 1024 <NA> Leonard Hunt Sir John Hunt
## 13 2543 <NA> Charles Roland Hunt Lord Hunt of C…
## 14 2024 <NA> Alexander Hunt Lord Hunt of K…
## 15 4111 http://www… <NA> Hunt Tristram Hunt
## 16 48 <NA> Robert Frederick … Hunter Mr Andrew Hunt…
## 17 1598 http://www… James Hunter Mark Hunter
## 18 3018 <NA> Brockie Hunter Lord Hunter of…
## 19 119 <NA> <NA> Major Mr John Major
## # ... with 7 more variables: gender_value <chr>, given_name_value <chr>,
## # label_value <chr>, constituency_about <chr>,
## # constituency_label_value <chr>, party_value <chr>, twitter_value <chr>
commons_answered_questions(answered_by = 1572) %>%
select(date_of_answer_value, answer_text_value )
## # A tibble: 35 x 2
## date_of_answer_value answer_text_value
## <dttm> <chr>
## 1 2014-11-25 00:00:00 <p>One year on from the Government’s formal respo…
## 2 2015-01-13 00:00:00 <p>There are over 6,000 more nurses on our hospit…
## 3 2015-02-24 00:00:00 <p>Since May 2010, £179 million of signed savings…
## 4 2015-06-02 00:00:00 <p>Trusts placed in special measures receive tail…
## 5 2015-10-13 00:00:00 <p>£400 million in resilience money has been inve…
## 6 2015-11-17 00:00:00 <p>£400 million in resilience money has been inve…
## 7 2016-02-09 00:00:00 <p>In November, I announced a campaign to halve t…
## 8 2016-07-26 00:00:00 <p>We acknowledge that local government are vital…
## 9 2016-09-08 00:00:00 <p>This is an employer and employee matter betwee…
## 10 2016-09-08 00:00:00 <p>This is an employer and employee matter betwee…
## # ... with 25 more rows
phe_questions <- hansard_all_answered_questions(answering_body = "health", start_date = "2016-01-01")
phe_questions <- phe_questions[, -1] %>%
mutate(row = row_number())
filter(phe_questions, str_detect(answer_text_value, "Public Health England")) %>%
group_by(yearq = zoo::as.yearqtr(date_of_answer_value)) %>%
count() %>%
ggplot(aes(yearq, n)) +
geom_col() +
zoo::scale_x_yearqtr() +
labs(title = "Frequency of answered parliamentary questions involving PHE by quarter")
library(myScrapers)
filter(phe_questions, str_detect(question_text, "Public Health England")) %>%
select(row, date_of_answer_value, ng_dept_short_name_value, question_text, answer_text_value) %>%
create_bigrams(text = answer_text_value) %>%
count(bigram, sort = TRUE) %>%
with(wordcloud::wordcloud(bigram, n, min.freq = 5, max.words = "Inf" ))
Wordcloud of questions asked and answered in Parliament containg ‘Public Health England’
library(quanteda)
pheqn <- filter(phe_questions, str_detect(question_text, "Public Health England")) %>%
select(row, date_of_answer_value, ng_dept_short_name_value, question_text, answer_text_value, tabling_member_printed)
phecorp <- corpus(pheqn$answer_text_value)
docvars(phecorp) <- pheqn$question_text
phedfm <- dfm(phecorp, group = phecorp$docvar1, remove = stopwords("english"), stem = FALSE, remove_punct =TRUE, remove_numbers = TRUE)
phedfm <- dfm_remove(phedfm, c("uploads", "td", "tr", "attachment_data", "http", "https", "p", "_blank", "em", "br", "shif", "wp-content", "et", "al"))
topfeatures(phedfm, 100)
## health phe england public
## 498 222 222 221
## trial local government sexual
## 192 134 92 87
## national data nhs services
## 86 82 75 72
## research evidence including hiv
## 67 67 67 66
## review report department prep
## 64 64 61 58
## published impact commissioning group
## 56 54 52 51
## can clinics cancer programme
## 50 48 47 47
## drug available support information
## 47 46 46 46
## reproductive work service www.gov.uk
## 45 44 43 40
## target mental advice new
## 40 40 39 39
## clinical access number part
## 38 38 37 36
## authorities groups href contraception
## 36 35 35 35
## approval years patients sites
## 35 34 34 34
## risk provide care guidance
## 33 33 33 32
## centre range within provided
## 32 31 30 30
## products plan across commissioners
## 30 30 30 28
## people outcomes site process
## 28 28 28 27
## ensure also guide recommendations
## 27 27 27 26
## action working strong development
## 26 26 26 25
## training england's system framework
## 25 24 23 23
## committee authority eatwell well
## 23 23 23 22
## survey following year therefore
## 22 21 21 21
## reference currently high place
## 21 20 20 20
## officials food use likely
## 20 20 19 19
## needs monitoring women current
## 19 19 19 18
presDfm <- dfm_trim(phedfm, min_count = 3, max_docfreq = 3)
presDistMat <- textstat_dist(dfm_weight(presDfm, "relfreq"))
presCluster <- hclust(presDistMat)
presCluster$labels <- docnames(presDfm)
plot(presCluster, xlab = "", sub = "", main = "Clustering of public health related questions by MP", cex = 0.4, hang = -.5, labels = pheqn$tabling_member_printed )
if (require(topicmodels)) {
myLDAfit20 <- LDA(convert(presDfm, to = "topicmodels"), k = 10)
get_terms(myLDAfit20, 10)
}
## Topic 1 Topic 2 Topic 3
## [1,] "gbd" "fluorosis" "hepatitis"
## [2,] "menus" "dental" "introduction"
## [3,] "cft" "tbody" "genital"
## [4,] "agencies" "mild" "warts"
## [5,] "sugar" "mcgrady" "c"
## [6,] "dph" "newcastle" "hev"
## [7,] "council.lancashire.gov.uk" "manchester" "goals"
## [8,] "mgissuehistoryhome.aspx" "ins" "policies"
## [9,] "iid" "ministerial" "reductions"
## [10,] "papers" "higher" "period"
## Topic 4 Topic 5
## [1,] "emergency" "coverage"
## [2,] "iso" "cervical"
## [3,] "laboratory" "oral"
## [4,] "secretary" "incinerators"
## [5,] "laboratories" "interventions"
## [6,] "accredited" "black"
## [7,] "accreditation" "toolkit"
## [8,] "ripl" "waste"
## [9,] "list" "sharing"
## [10,] "ukas" "disadvantaged"
## Topic 6
## [1,] "transformation"
## [2,] "collections"
## [3,] "liver"
## [4,] "green"
## [5,] "methods"
## [6,] "commissioning-sexual-health-reproductive-health-and-hiv-services"
## [7,] "multidisciplinary"
## [8,] "boroughs"
## [9,] "nmhin"
## [10,] "efficient"
## Topic 7 Topic 8 Topic 9
## [1,] "www.nice.org.uk" "drug-related" "hospital"
## [2,] "vitamin" "royal" "volumes"
## [3,] "d" "performance" "symptoms"
## [4,] "quit" "ambitions" "survival"
## [5,] "zika" "force" "collect"
## [6,] "loss" "adph" "trials"
## [7,] "www.england.nhs.uk" "continence" "systemic"
## [8,] "pdf" "hee" "anti-cancer"
## [9,] "mhra" "pmhlwd" "confidentiality"
## [10,] "cigarettes" "accountability" "fund"
## Topic 10
## [1,] "water"
## [2,] "fluoridation"
## [3,] "reported"
## [4,] "fluoride"
## [5,] "reporting"
## [6,] "limitations"
## [7,] "ncras"
## [8,] "trusts"
## [9,] "authoritative"
## [10,] "sexually"
tidytext::tidy(myLDAfit20) %>%
group_by(topic) %>%
top_n(10, beta) %>%
ungroup() %>%
arrange(topic, -beta) %>%
mutate(term = str_trunc(term, width = 20, side = "center")) %>%
#mutate(row = row_number()) %>%
ggplot(aes(reorder(term, beta), beta)) +
geom_col() +
coord_flip() +
theme(axis.text = element_text(size = rel(.8))) +
facet_wrap(~topic, scales = "free") +
labs(x = "terms")
To extract the tabular information from http://jech.bmj.com/content/71/8/827 in a reusable form is complex but potentially useful.
Steps:
tabulizer
package form rOpenSci. This requires rJava
which can be temperamental and might require a series of updates and applying the fix outlined here (for MacOS at least).fulltext
package. This is also available from rOpenSci and provides an interface to a range of bibliographic databases and through which you can download full text articles.It can be quite complex to extract a table even with these helpers and may require a fair bit of data tidying (the tidyr
package is very helpful here) and use of regular expressions.
This example shows how to extract the data in tables 1 and 2 from Masters R, et al. J Epidemiol Community Health 2017;71:827–834. doi:10.1136/jech-2016-208141
# install.packages("rJava")
library(rJava) # load and attach 'rJava' now
# install.packages("devtools")
# devtools::install_github("ropensci/tabulizer", args="--no-multiarch")
#
# devtools::install_github("ropensci/tabulizerjars", args="--no-multiarch")
library(tabulizer)
library(tabulizerjars)
tab <- extract_tables("~/Downloads/827.full.pdf")
data.frame(cbind(tab[[2]][3:11,1:4])) %>%
separate(X1, c("category", "median roi"), sep = "\\s(\\d)") %>%
separate(X2, remove = FALSE, c("roi min", "to", "roi max", "no roi studies", "median cbr"), sep = " ") %>%
select(-c(to, X2)) %>%
filter(category != "Specialism") %>%
rename(cbr_range = X3,
No_CBR_studies = X4 ) %>%
formattable::formattable(caption = "Table 1, Masters R, et al. J Epidemiol Community Health 2017;71:827–834. doi:10.1136/jech-2016-208141 ")
category | median roi | roi min | roi max | no roi studies | median cbr | cbr_range | No_CBR_studies |
---|---|---|---|---|---|---|---|
Overall | 4.3 | –21.3 | 221 | 34 | 8.3 | 0.7 to 29.4 | 23 |
Local level | .1 | 0.9 | 19.3 | 18 | 10.3 | 0.9 to 23.6 | 11 |
National level | 7.2 | –21.3 | 221 | 17 | 17 | 1.2 to 167. | 10 |
Health protection | 4.2 | 0.7 | 221 | 8 | 41.8 | 1.1 to 167 | 10 |
Legislation | 6.5 | 38 | 55 | 2 | 5.8 | 3 to 8.6 | 2 |
Health promotion | .2 | 0.7 | 6.2 | 12 | 14.4 | 2 to 29.4 | 3 |
Healthcare public health | .1 | 1.1 | 19.3 | 6 | None | None reported | None reported |
Wider determinants | .6 | 1.1 | 10.8 | 6 | 7.1 | 0.7 to 23.6 | 6 |
table <- data.frame(tab[[5]][-c(1:3),1:7])
colnames(table) <- tab[[5]][2,1:7]
table
##
## 1 Abelson et al10 Hib vaccination
## 2 Abelson et al10 HIV/AIDS prevention
## 3 Abelson et al10 Measles vaccination
## 4 Abelson et al10 Programmes to reduce rates of coronary heart disease
## 5 Abelson et al10 Programmes to reduce tobacco consumption
## 6 Abelson et al10 Road safety campaigns
## 7 Boccalini et al32 Universal hepatitis B vaccination
## 8
## 9 Bonin et al58 Parenting programmes for the prevention of persistent conduct
## 10 disorders
## 11 Drummond11 Needle exchange
## 12 Evans-Lacko et al12 Antistigma social marketing campaign
## 13 Garpenholt et al38 Hib vaccination
## 14 Gortmaker et al15 Sugar sweetened beverage tax
## 15 Gortmaker et al15 Eliminating tax subsidy of nutritionally poor food advertising to
## 16 children
## 17 Gould8 Lead paint control
## 18
## 19 Holtgrave et al34 HIV counselling, testing, referral and partner notification services
## 20 Hutchinson et al35 Expanded HIV testing
## 21 Kwon et al36 Needle exchange
## 22 Lokkerbol et al55 Telemedicine for depression
## 23
## 24 McGuire et al14 Family planning services
## 25 Miller et al16 Booster seats for 4–7 years olds
## 26 Nguyen et al37 Needle exchange
## 27 Nichol et al7 Influenza vaccination of healthy workers
## 28 Romano et al40 Folic acid fortification of grain
## 29 Trust for America13 Primary and secondary prevention programmes
## 30 Wang et al50 Universal school nursing services
## 31 White et al28 MMR vaccination
## 32 Ding et al30 Hospital-based postpartum influenza vaccination
## 33
## 34 Zhou et al38 Hib vaccination
## 35
## 36 Hib, haemophilus influenzae type b; MMR, measles, mumps and rubella.
## Benefit-cost Return on
## 1 Australia 1.06 Medical
## 2 Australia 4 Medical
## 3 Australia 167 Medical
## 4 Australia 11 Medical
## 5 Australia 2 Medical
## 6 Australia 3 Medical
## 7 Italy €2.78 Medical and
## 8 societal
## 9 England 7.89 Medical and
## 10 societal
## 11 Australia 1.2 Public health
## 12 England £0.7 to £1.90 Unclear
## 13 Sweden 1.59 Societal
## 14 USA $55 Medical
## 15 USA $38 Medical
## 16
## 17 USA $17 to $221 Medical and
## 18 societal
## 19 USA 20.09 Societal
## 20 USA $1.46 to $2.01 Health sector
## 21 Australia $A1.3 to $A5.5 Health sector
## 22 The €1.45 to €1.76 Medical
## 23 Netherlands
## 24 UK 11.09 to 29.39 Medical
## 25 USA 8.6 Medical
## 26 USA $3.48 Medical
## 27 USA −$21.27 to +$174.32 Societal
## 28 USA 4.3 to 6.1 Human capital
## 29 USA $6.2 Medical
## 30 USA $2.20 Societal
## 31 USA 14 Medical
## 32 USA $1.7 Medical and
## 33 societal
## 34 USA 5.4 Medical and
## 35 societal
## 36
## Time
## 1 5% 15 years
## 2 5% 25 years
## 3 5% 33 years
## 4 5% 40 years
## 5 5% 40 years
## 6 5% 40 years
## 7 3% 20 years
## 8
## 9 3.5% 35 years
## 10
## 11 5% Lifetime
## 12 None 1 year
## 13 5% 20 years
## 14 3% 10 years
## 15 3% 10 years
## 16
## 17 None Unclear
## 18
## 19 6% Lifetime
## 20 3% Lifetime
## 21 3% Lifetime
## 22 1.5% costs, 4% 5 years
## 23 effects
## 24 6% Lifetime
## 25 3% 3 years
## 26 None 1 year
## 27 3% 1 year
## 28 4% Lifetime
## 29 0% 10–20 years
## 30 None 1 year
## 31 10% Lifetime
## 32 3% 1 year
## 33
## 34 3% Lifetime
## 35
## 36