Automation holds significant potential to support the United Kingdom health care system and its recovery throughout the Coronavirus (COVID-19) pandemic. The Local Knowledge and Intelligence Service North East and Yorkshire (LKIS NEY) team of the Office for Health Improvement and Disparities (OHID) within the Department of Health and Social Care (DHSC) applies automated data sciences techniques in the context of public health intelligence. This publication presents one successful project that has used code to automatically extract data from a website to support evidence-based decision making on deaths and hospital beds during the pandemic.
The objective of automation is to speed up data extraction by allowing for multiple local authorities and metrics to be downloaded at once. Manual download methods are often restricted to extracting data from one area at a time, for just 1,000 rows and are limited to a maximum number of five metrics on each download.
The method and code discussed in this article can be applied to a range of documents such as briefs, academic papers or dissertations. This report aims to support the use of data sciences in public health intelligence by promoting and facilitating the development and implementation of new technological interventions to assist population health improvement
The method and code presented in this article enable the automatic extraction of data directly from the COVID-19 in the UK dashboard website:- https://coronavirus.data.gov.uk/details/download (accessed in March 2022) for the ‘North East and Yorkshire of England COVID-19 Weekly Briefing’. The data extraction consists of the most recent release of data for COVID-19 case rates, deaths and hospital bed occupancy indicators by the Integrated Care Systems (ICS) of Humber Coast and Vale, North East and North Cumbria, West Yorkshire, and South Yorkshire and Bassetlaw.
The method presented in this report follows one of the described method in developer’s guide ‘Bulk downloads Application Programming Interface (API) v2’ https://coronavirus.data.gov.uk/details/developers-guide/bulk-downloads-api (accessed in March 2022) by using RStudio. The API method of data extraction is chosen because it is the simplest way to gather data for this briefing. A table of the sample dataset is shown after each block of code to aid interpretation and learning.
Import an external ICS lookup in a CSV format using only the ICS and area_code column.
Table 5 shows an example of the ICS and area_code. The areas include the NHS trusts, LTLA and Region.
lkp_ics_codes <- read.csv("ney_ics_lkp.csv", stringsAsFactors = F) %>%
select(area_code, ics)
| area_code | ics |
|---|---|
| E08000019 | SYB |
| E08000035 | WY |
| E06000002 | NENC |
| RXF | WY |
| RJL | HCV |
| R08 | NENC |
This code is used to save the data in an XLSX format. The COVID-19 case rates and deaths by the North East region and LTLA data is saved as ‘Region_LTLA’ and the COVID-19 hospital bed occupancy by the NHS trusts data is saved as ‘NHS_Trust’.
write_xlsx(region_ltla_all_ics, "Region_LTLA.xlsx")
write_xlsx(nhs_all_ics, "NHS_Trust.xlsx")
This section presents all of the discussed code. It begins with the COVID-19 case rates and deaths by the North East region and LTLA. Then, the COVID-19 hospital bed occupancy by the NHS trust. Finally, the datasets are saved in an XLSX format.
##########################################################################################
######################### Region and LTLA ################################################
######################### Extraction for Today on COVID19 case rate and deaths ###########
region_code <- 'E12000001' #North East region
api_link_reg <- paste0("https://api.coronavirus.data.gov.uk/v2/data?",
"areaType=region",
"&areaCode=", region_code,
"&metric=newCasesBySpecimenDateRollingRate",
"&metric=newOnsDeathsByRegistrationDate",
"&format=csv")
df_region_today <- read.csv(api_link_reg)
######################### Extraction for Today ALL LTLA ##################################
ltla_code <- c('E08000016', 'E08000017', 'E08000018', 'E08000019', 'E07000171',
'E08000032', 'E08000033', 'E08000034', 'E08000035', 'E08000036',
'E07000163', 'E07000028', 'E07000029', 'E07000030', 'E06000011',
'E06000010', 'E06000012', 'E06000013', 'E06000014', 'E07000164',
'E07000165', 'E07000166', 'E07000167', 'E07000168', 'E07000169',
'E07000026', 'E06000001', 'E06000002', 'E06000003', 'E06000004',
'E06000005', 'E06000047', 'E06000057', 'E08000021', 'E08000022',
'E08000023', 'E08000024', 'E08000037')
df_ltla <- list()
for(i in ltla_code)
{
api_link_la <- paste0("https://api.coronavirus.data.gov.uk/v2/data?",
"areaType=ltla",
"&areaCode=", i,
"&metric=newCasesBySpecimenDateRollingRate",
"&metric=newOnsDeathsByRegistrationDate",
"&format=csv")
df <- read.csv(api_link_la)
df_ltla[[i]] <- df
}
df_ltla_today <- do.call(rbind, df_ltla)
######################### Bind LTLA and regional data ##################################
region_ltla_today <- rbind(df_region_today, df_ltla_today)
##########################################################################################
######################### NHS TRUSTS #####################################################
######################### Extraction for Today NHS TRUSTS ################################
Sys.sleep(30)
trust_code <- c('RCF', 'RFF', 'TAD', 'RWY', 'RXP', 'RX4',
'RP5', 'RR7', 'RCD', 'RWA', 'RR8', 'RXF',
'RTD', 'RNN', 'RVW', 'RJL', 'RFR', 'RCU',
'RHQ', 'RTR', 'R0B', 'RCB')
df_nhs = list()
for(i in trust_code)
{
api_link_nhs <- paste0("https://api.coronavirus.data.gov.uk/v2/data?",
"areaType=nhsTrust",
"&areaCode=", i,
"&metric=hospitalCases",
"&format=csv")
df_1 <- read.csv(api_link_nhs)
df_nhs[[i]] <- df_1
}
df_nhs <- do.call(rbind, df_nhs)
##########################################################################################
######################### Add in ICS codes ###############################################
lkp_ics_codes <- read.csv("R code/ney_ics_lkp.csv", stringsAsFactors = F) %>%
select(area_code, ics)
region_ltla_all_ics <- left_join(region_ltla_today, lkp_ics_codes,
by = c("areaCode" = "area_code"), keep = F)
nhs_all_ics <- left_join(nhs_all, lkp_ics_codes,
by = c("areaCode" = "area_code"), keep = F)
write_xlsx(region_ltla_all_ics, "Region_LTLA.xlsx")
write_xlsx(nhs_all_ics, "NHS_Trust.xlsx")
##########################################################################################
######################### Region and LTLA ################################################
######################### Extraction for Today on COVID19 case rate and deaths ###########
region_code <- 'E12000001' #North East region
api_link_reg <- paste0("https://api.coronavirus.data.gov.uk/v2/data?",
"areaType=region",
"&areaCode=", region_code,
"&metric=newCasesBySpecimenDateRollingRate",
"&metric=newOnsDeathsByRegistrationDate",
"&format=csv")
df_region_today <- read.csv(api_link_reg)
######################### Extraction for Today ALL LTLA ##################################
ltla_code <- c('E08000016', 'E08000017', 'E08000018', 'E08000019', 'E07000171',
'E08000032', 'E08000033', 'E08000034', 'E08000035', 'E08000036',
'E07000163', 'E07000028', 'E07000029', 'E07000030', 'E06000011',
'E06000010', 'E06000012', 'E06000013', 'E06000014', 'E07000164',
'E07000165', 'E07000166', 'E07000167', 'E07000168', 'E07000169',
'E07000026', 'E06000001', 'E06000002', 'E06000003', 'E06000004',
'E06000005', 'E06000047', 'E06000057', 'E08000021', 'E08000022',
'E08000023', 'E08000024', 'E08000037')
df_ltla <- list()
for(i in ltla_code)
{
api_link_la <- paste0("https://api.coronavirus.data.gov.uk/v2/data?",
"areaType=ltla",
"&areaCode=", i,
"&metric=newCasesBySpecimenDateRollingRate",
"&metric=newOnsDeathsByRegistrationDate",
"&format=csv")
df <- read.csv(api_link_la)
df_ltla[[i]] <- df
}
df_ltla_today <- do.call(rbind, df_ltla)
######################### Combine New Old of Region and LTLA #############################
region_ltla_today <- rbind(df_region_today, df_ltla_today)
region_ltla_old <- read_excel("Region_LTLA.xlsx") %>% select(-ics)
df_new <- setdiff(region_ltla_today$date, region_ltla_old$date)
rows <- which(region_ltla_today$date %in% df_new)
df_today_new <- region_ltla_today[rows,]
region_ltla_all <- rbind(region_ltla_old,df_today_new)
##########################################################################################
######################### NHS TRUSTS #####################################################
######################### Extraction for Today NHS TRUSTS ################################
Sys.sleep(30)
trust_code <- c('RCF', 'RFF', 'TAD', 'RWY', 'RXP', 'RX4',
'RP5', 'RR7', 'RCD', 'RWA', 'RR8', 'RXF',
'RTD', 'RNN', 'RVW', 'RJL', 'RFR', 'RCU',
'RHQ', 'RTR', 'R0B', 'RCB')
df_nhs = list()
for(i in trust_code)
{
api_link_nhs <- paste0("https://api.coronavirus.data.gov.uk/v2/data?",
"areaType=nhsTrust",
"&areaCode=", i,
"&metric=hospitalCases",
"&format=csv")
df_1 <- read.csv(api_link_nhs)
df_nhs[[i]] <- df_1
}
df_nhs <- do.call(rbind, df_nhs)
######################### Combine New and Old NHS TRUST ##################################
nhs_old <- read_excel("NHS_Trust.xlsx") %>% select(-ics)
df_new <- setdiff(df_nhs$date, nhs_old$date)
rows <- which(df_nhs$date %in% df_new)
df_today_new <- df_nhs[rows,]
nhs_all <- rbind(nhs_old,df_today_new)
##########################################################################################
######################### Add in ICS codes ###############################################
lkp_ics_codes <- read.csv("R code/ney_ics_lkp.csv", stringsAsFactors = F) %>%
select(area_code, ics)
region_ltla_all_ics <- left_join(region_ltla_all, lkp_ics_codes,
by = c("areaCode" = "area_code"), keep = F)
nhs_all_ics <- left_join(nhs_all, lkp_ics_codes,
by = c("areaCode" = "area_code"), keep = F)
write_xlsx(region_ltla_all_ics, "Region_LTLA.xlsx")
write_xlsx(nhs_all_ics, "NHS_Trust.xlsx")
The authors wish to thank Eleanor Mackey, Leon Green, Elizabeth Mannion, Toby Johnstone, Kristy Roe, Barbara Coyle, Chris Skelly and Daniel Reidpath for their contributions to this article.