Background

Automation holds significant potential to support the United Kingdom health care system and its recovery throughout the Coronavirus (COVID-19) pandemic. The Local Knowledge and Intelligence Service North East and Yorkshire (LKIS NEY) team of the Office for Health Improvement and Disparities (OHID) within the Department of Health and Social Care (DHSC) applies automated data sciences techniques in the context of public health intelligence. This publication presents one successful project that has used code to automatically extract data from a website to support evidence-based decision making on deaths and hospital beds during the pandemic.

Objective

The objective of automation is to speed up data extraction by allowing for multiple local authorities and metrics to be downloaded at once. Manual download methods are often restricted to extracting data from one area at a time, for just 1,000 rows and are limited to a maximum number of five metrics on each download.

The method and code discussed in this article can be applied to a range of documents such as briefs, academic papers or dissertations. This report aims to support the use of data sciences in public health intelligence by promoting and facilitating the development and implementation of new technological interventions to assist population health improvement

Method

The method and code presented in this article enable the automatic extraction of data directly from the COVID-19 in the UK dashboard website:- https://coronavirus.data.gov.uk/details/download (accessed in March 2022) for the ‘North East and Yorkshire of England COVID-19 Weekly Briefing’. The data extraction consists of the most recent release of data for COVID-19 case rates, deaths and hospital bed occupancy indicators by the Integrated Care Systems (ICS) of Humber Coast and Vale, North East and North Cumbria, West Yorkshire, and South Yorkshire and Bassetlaw.

The method presented in this report follows one of the described method in developer’s guide ‘Bulk downloads Application Programming Interface (API) v2’ https://coronavirus.data.gov.uk/details/developers-guide/bulk-downloads-api (accessed in March 2022) by using RStudio. The API method of data extraction is chosen because it is the simplest way to gather data for this briefing. A table of the sample dataset is shown after each block of code to aid interpretation and learning.

1 Coronavirus case rates and deaths data

This section describes the process of preparing, extracting and combining COVID-19 case rates and deaths data for the briefing. The metrics or indicators of interest are ‘newCasesBySpecimenDateRollingRate’, which refers to the new cases rolling seven day COVID-19 rates, and ‘newOnsDeathsByRegistrationDate’ which refers to new registered COVID-19 deaths reported by the Office for National Statistics (ONS) https://coronavirus.data.gov.uk/metrics/doc/newOnsDeathsByRegistrationDate (accessed in March 2022).

1.1 Extract the latest data

This block of code can be used to extract data for the chosen region and lower tier local authority (LTLA) geographies. For this briefing, it is total of 40 areas consisting of one region and 39 LTLAs.

1.1.1 For region

This code is for the North East of England region. First, assign the relevant ONS area code to region_code. In this case the region code for North East is ‘E12000001’. The region code can be found at https://geoportal.statistics.gov.uk/documents/regions-december-2021-names-and-codes-in-england/about (accessed in March 2022).

The string assigned to api_link_reg is used with the R function read.csv() to establish a connection between the website and RStudio by using the dashboard’s API and extracting the data by specifying the area type as region, area code as the region_code and metrics in comma-separated values (CSV) format.

region_code <- 'E12000001'
api_link_reg <- paste0("https://api.coronavirus.data.gov.uk/v2/data?",
                        "areaType=region",
                         "&areaCode=", region_code,
                          "&metric=newCasesBySpecimenDateRollingRate",
                           "&metric=newOnsDeathsByRegistrationDate",
                             "&format=csv")
df_region_today <- read.csv(api_link_reg)

After the extraction, the dataset is stored as df_region_today. Table 1 shows a part of the North East new cases rolling seven day COVID-19 rates and registered COVID-19 deaths from the ONS.

Table 1: Example df_region_today
areaCode	areaName	areaType	date	newCasesBySpecimenDateRollingRate	newOnsDeathsByRegistrationDate
E12000001	North East	region	2021-07-13	889.3	NA
E12000001	North East	region	2021-07-12	867.3	NA
E12000001	North East	region	2021-07-11	845.8	NA
E12000001	North East	region	2021-07-10	828.8	NA
E12000001	North East	region	2021-07-09	807.4	NA
E12000001	North East	region	2021-07-08	776.7	NA

1.1.2 For LTLA

The code below uses a loop technique to repeat a sequence of instructions under certain conditions.

The command Sys.sleep(30) is used to pause the code for the 30 seconds to prevent triggering a ‘429 Too Many Requests’ error. If an error still appears, the user may need to increase the time interval that the code is paused for by increasing the number in the bracket.

Before starting the loop all the 39 LTLAs codes are assigned to ltla_code and list() is used to create the list df_ltla, to store the results of the loop. The LTLAs code can be found at https://geoportal.statistics.gov.uk/datasets/ons::lower-tier-local-authority-to-upper-tier-local-authority-april-2021-lookup-in-england-and-wales/about (accessed in March 2022).

This is followed with a dynamic bit of code where an index i is iteratively replaced by each local authority code in the ltla_code. For example, the loop starts by replacing i with E08000016 and runs through everything between {}. Then df_ltla[[i]] stores the extracted data into the created list df_ltla.

Finally, all the rows of the LTLA in the list are combined into one data frame and stored as df_ltla_today.

Sys.sleep(30) 

ltla_code <- c('E08000016', 'E08000017', 'E08000018', 'E08000019', 'E07000171', 
               'E08000032', 'E08000033', 'E08000034', 'E08000035', 'E08000036', 
               'E07000163', 'E07000028', 'E07000029', 'E07000030', 'E06000011', 
               'E06000010', 'E06000012', 'E06000013', 'E06000014', 'E07000164', 
               'E07000165', 'E07000166', 'E07000167', 'E07000168', 'E07000169', 
               'E07000026', 'E06000001', 'E06000002', 'E06000003', 'E06000004', 
               'E06000005', 'E06000047', 'E06000057', 'E08000021', 'E08000022', 
               'E08000023', 'E08000024', 'E08000037')

df_ltla <- list() 

for(i in ltla_code) 
{ 
 api_link_la <- paste0("https://api.coronavirus.data.gov.uk/v2/data?",
                         "areaType=ltla",
                          "&areaCode=", i,
                           "&metric=newCasesBySpecimenDateRollingRate",  
                            "&metric=newOnsDeathsByRegistrationDate",
                             "&format=csv")
 df  <- read.csv(api_link_la)  
 df_ltla[[i]] <- df 
}

df_ltla_today <- do.call(rbind, df_ltla)

Table 2 shows a part of the 39 LTLAs new cases rolling seven day COVID-19 rates and registered COVID-19 deaths from the ONS.

Table 2: Example df_ltla_today
areaCode	areaName	areaType	date	newCasesBySpecimenDateRollingRate	newOnsDeathsByRegistrationDate
E06000011	East Riding of Yorkshire	ltla	2020-12-11	129.0	26
E06000014	York	ltla	2021-08-05	320.4	NA
E08000033	Calderdale	ltla	2021-03-11	103.6	NA
E07000166	Richmondshire	ltla	2021-07-05	281.0	NA
E07000165	Harrogate	ltla	2021-03-28	26.7	NA
E08000022	North Tyneside	ltla	2021-10-23	774.6	NA

1.2 Merge data

1.2.1 Merge the region and LTLAs data

This line of code is used to merge all area data into one data frame and stored as region_ltla_today.

region_ltla_today <- rbind(df_region_today, df_ltla_today)

Table 3 shows a part of the new COVID-19 cases rates and deaths by the North East region and LTLA.

Table 3: Example region_ltla_today
areaCode	areaName	areaType	date	newCasesBySpecimenDateRollingRate	newOnsDeathsByRegistrationDate
E12000001	North East	region	2022-05-26	69.8	NA
E12000001	North East	region	2022-05-25	72.7	NA
E12000001	North East	region	2022-05-24	74.4	NA
E12000001	North East	region	2022-05-23	77.7	NA
E12000001	North East	region	2022-05-22	81.5	NA
E12000001	North East	region	2022-05-21	84.6	NA

1.2.2 Merge today and previous extracted

This block of code is used to prepare and combine the most recent data and the previously extracted dataset for all areas of interest. Note: when running the method for the first time, skip this block of code.

First, upload the previous (initial) data in an MS Excel (XLSX) format to RStudio and remove the ICS column to region_ltla_old. The ICS may refer to the section 3.
Secondly, identify the new data by comparing the most recent with the previously extracted datasets and assign as df_new. Then, use df_new to identify the most recent dates that appear in region_ltla_today and assign this to df_today_new.
Finally, combine the new and previous extracted data into region_ltla_all.

region_ltla_old <- read_excel("Region_LTLA.xlsx") %>% select(-ics)
df_new <- setdiff(region_ltla_today$date, region_ltla_old$date)
rows <- which(region_ltla_today$date %in% df_new)
df_today_new <- region_ltla_today[rows,]
region_ltla_all <- rbind(region_ltla_old,df_today_new)

2 Coronavirus hospital bed occupancy data

This section describes preparing, extracting and combining COVID-19 hospital bed occupancy data for the briefing. The metric or indicator of interest is ‘hospitalCases’, which refer to daily numbers of confirmed COVID-19 patients in hospital and it includes data from 22 National Health Service (NHS) trusts. The LTLAs code can be found at https://odsportal.digital.nhs.uk/ (accessed in March 2022).

2.1 Extract the latest data

This block of code is also based on loop techniques similar to the method described in section 1.1.2, the code begins with Sys.sleep(30), then lists all the NHS trusts code into the trust_code and finally creates a list to store the results which is the df_nhs.

This is followed by the code where index i is iteratively replaced by each NHS trust code in the trust_code. For example, the loop starts by replacing i with RCF and then runs through all the trust codes between {}, (similar to section 1.1.2). Each iteration is stored in df_nhs[[i]]; this is then converted into a dataframe outside the loop using do.call().

Finally, combine the rows of all the NHS trusts in the list into one data frame and assign as df_nhs.

Sys.sleep(30) 

trust_code <- c('RCF', 'RFF', 'TAD', 'RWY', 'RXP', 'RX4', 
                'RP5', 'RR7', 'RCD', 'RWA', 'RR8', 'RXF', 
                'RTD', 'RNN', 'RVW', 'RJL', 'RFR', 'RCU', 
                'RHQ', 'RTR', 'R0B', 'RCB')

df_nhs <- list()

for(i in trust_code) 
{
 api_link_nhs   <- paste0("https://api.coronavirus.data.gov.uk/v2/data?",
                            "areaType=nhsTrust",
                             "&areaCode=", i, 
                              "&metric=hospitalCases",
                               "&format=csv")
 df_1 <- read.csv(api_link_nhs)
 df_nhs[[i]] <- df_1
}

df_nhs <- do.call(rbind, df_nhs)

Table 4 shows a part of new hospital bed occupancy by the NHS trusts.

Table 4: Example df_nhs
areaCode	areaName	areaType	date	hospitalCases
RWA	Hull University Teaching Hospitals NHS Trust	nhsTrust	2022-03-19	184
RX4	Cumbria, Northumberland, Tyne and Wear NHS Foundation Trust	nhsTrust	2020-07-27	0
RFR	Rotherham NHS Foundation Trust	nhsTrust	2020-07-06	22
RX4	Cumbria, Northumberland, Tyne and Wear NHS Foundation Trust	nhsTrust	2021-08-26	2
TAD	Bradford District Care NHS Foundation Trust	nhsTrust	2022-03-19	0
RWA	Hull University Teaching Hospitals NHS Trust	nhsTrust	2021-08-06	45

2.2 Merge today and previous extracted data

This block of code is used to prepare and combine the most recent data with the previously extracted dataset for all areas of interest. Note: when running the method for the first time, skip this block of code.

First, upload the previous data ‘NHS_Trust.xlsx’ to RStudio and remove the ICS column. Then, identify any new data by comparing the most recent with the previously extracted datasets and assign as df_new.
Next, use df_new to identify the most recent dates that appear in df_nhs and assign this to df_today_new.
Finally, combine the new and previous extracted data into nhs_all.

nhs_old <- read_excel("NHS_Trust.xlsx") %>% select(-ics) 
df_new <- setdiff(df_nhs$date, nhs_old$date)
rows <- which(df_nhs$date%in% df_new)
df_today_new <- df_nhs[rows,]
nhs_all <- rbind(nhs_old,df_today_new)

3 Classify the areas in ICS

Import an external ICS lookup in a CSV format using only the ICS and area_code column.

Table 5 shows an example of the ICS and area_code. The areas include the NHS trusts, LTLA and Region.

lkp_ics_codes <- read.csv("ney_ics_lkp.csv", stringsAsFactors = F) %>% 
                 select(area_code, ics)

Table 5: Example lkp_ics_codes
area_code	ics
E08000019	SYB
E08000035	WY
E06000002	NENC
RXF	WY
RJL	HCV
R08	NENC

3.1 Coronavirus case rates and deaths data

Merge the COVID-19 case rate and deaths data with the ICS and using area code as key indicator.

Table 6 shows a part of the COVID-19 case rate and deaths with the ICS data.

3.1.1 Running the script for the first time

The following line of code can be run only when using this method for the first time.

region_ltla_all_ics <- left_join(df_ltla_today, lkp_ics_codes, 
                                 by = c("areaCode" = "area_code"), keep = F)

3.1.2 Subsequent running the script

region_ltla_all_ics <- left_join(region_ltla_all, lkp_ics_codes, 
                                 by = c("areaCode" = "area_code"), keep = F)

Table 6: Example region_ltla_all_ics
areaCode	areaName	areaType	date	newCasesBySpecimenDateRollingRate	newOnsDeathsByRegistrationDate	ics
E12000001	North East	region	2022-02-23	251.4	NA	NENC
E08000016	Barnsley	ltla	2022-02-23	258.8	NA	SYB
E08000017	Doncaster	ltla	2022-02-23	195.3	NA	SYB
E08000018	Rotherham	ltla	2022-02-23	227.9	NA	SYB
E08000019	Sheffield	ltla	2022-02-23	264.6	NA	SYB
E07000171	Bassetlaw	ltla	2022-02-23	317.0	NA	SYB

3.2 Coronavirus hospital bed occupancy data

Merge the COVID-19 hospital bed occupancy data using the ICS and using area code as key indicator.

Table 7 shows a part of the COVID-19 hospital bed occupancy with the ICS.

nhs_all_ics <- left_join(df_nhs, lkp_ics_codes, 
                         by = c("areaCode" = "area_code"),keep = F)

nhs_all_ics <- left_join(nhs_all, lkp_ics_codes, 
                         by = c("areaCode" = "area_code"),keep = F)

Table 7: Example nhs_all_ics
areaCode	areaName	areaType	date	hospitalCases	ics
RCB	York Teaching Hospital NHS Foundation Trust	nhsTrust	2020-12-09	82	HCV
RR8	Leeds Teaching Hospitals NHS Trust	nhsTrust	2020-07-06	55	WY
RCF	Airedale NHS Foundation Trust	nhsTrust	2020-08-06	5	WY
RR8	Leeds Teaching Hospitals NHS Trust	nhsTrust	2021-01-21	259	WY
RCF	Airedale NHS Foundation Trust	nhsTrust	2021-02-09	42	WY
RP5	Doncaster and Bassetlaw Teaching Hospitals NHS Foundation Trust	nhsTrust	2021-11-21	64	SYB

4 Save dataset in a working directory in MS Excel (XLSX) format

This code is used to save the data in an XLSX format. The COVID-19 case rates and deaths by the North East region and LTLA data is saved as ‘Region_LTLA’ and the COVID-19 hospital bed occupancy by the NHS trusts data is saved as ‘NHS_Trust’.

write_xlsx(region_ltla_all_ics, "Region_LTLA.xlsx")
write_xlsx(nhs_all_ics, "NHS_Trust.xlsx")

5. The complete script for the method

This section presents all of the discussed code. It begins with the COVID-19 case rates and deaths by the North East region and LTLA. Then, the COVID-19 hospital bed occupancy by the NHS trust. Finally, the datasets are saved in an XLSX format.

5.1 Initial script

##########################################################################################
######################### Region and LTLA ################################################


######################### Extraction for Today on COVID19 case rate and deaths ###########

region_code <- 'E12000001' #North East region

api_link_reg <- paste0("https://api.coronavirus.data.gov.uk/v2/data?",
                        "areaType=region",
                         "&areaCode=", region_code,
                          "&metric=newCasesBySpecimenDateRollingRate",  
                           "&metric=newOnsDeathsByRegistrationDate",     
                             "&format=csv")

df_region_today <- read.csv(api_link_reg) 



######################### Extraction for Today ALL LTLA ##################################

ltla_code <- c('E08000016', 'E08000017', 'E08000018', 'E08000019', 'E07000171', 
               'E08000032', 'E08000033', 'E08000034', 'E08000035', 'E08000036', 
               'E07000163', 'E07000028', 'E07000029', 'E07000030', 'E06000011', 
               'E06000010', 'E06000012', 'E06000013', 'E06000014', 'E07000164', 
               'E07000165', 'E07000166', 'E07000167', 'E07000168', 'E07000169', 
               'E07000026', 'E06000001', 'E06000002', 'E06000003', 'E06000004', 
               'E06000005', 'E06000047', 'E06000057', 'E08000021', 'E08000022', 
               'E08000023', 'E08000024', 'E08000037')


df_ltla   <- list()

for(i in ltla_code) 
{ 
  api_link_la <- paste0("https://api.coronavirus.data.gov.uk/v2/data?",
                         "areaType=ltla",
                          "&areaCode=", i,
                           "&metric=newCasesBySpecimenDateRollingRate", 
                            "&metric=newOnsDeathsByRegistrationDate",
                             "&format=csv")
  
  df  <- read.csv(api_link_la)  
  
  df_ltla[[i]] <- df
}

df_ltla_today    <- do.call(rbind, df_ltla)

######################### Bind LTLA and regional data ##################################

region_ltla_today <- rbind(df_region_today, df_ltla_today) 


##########################################################################################
######################### NHS TRUSTS #####################################################


######################### Extraction for Today NHS TRUSTS ################################

Sys.sleep(30) 

trust_code <- c('RCF', 'RFF', 'TAD', 'RWY', 'RXP', 'RX4', 
                'RP5', 'RR7', 'RCD', 'RWA', 'RR8', 'RXF', 
                'RTD', 'RNN', 'RVW', 'RJL', 'RFR', 'RCU', 
                'RHQ', 'RTR', 'R0B', 'RCB')

df_nhs =  list()

for(i in trust_code) 
{
  api_link_nhs   <- paste0("https://api.coronavirus.data.gov.uk/v2/data?",
                            "areaType=nhsTrust",
                             "&areaCode=", i, 
                              "&metric=hospitalCases",
                               "&format=csv")
  
  df_1 <- read.csv(api_link_nhs)
  
  df_nhs[[i]] <- df_1
}

df_nhs    <- do.call(rbind, df_nhs)



##########################################################################################
######################### Add in ICS codes ###############################################

lkp_ics_codes <- read.csv("R code/ney_ics_lkp.csv", stringsAsFactors = F) %>%
                 select(area_code, ics)
                   
region_ltla_all_ics  <- left_join(region_ltla_today, lkp_ics_codes,
                        by = c("areaCode" = "area_code"), keep = F)
nhs_all_ics <- left_join(nhs_all, lkp_ics_codes, 
               by = c("areaCode" = "area_code"), keep = F)


write_xlsx(region_ltla_all_ics, "Region_LTLA.xlsx")
write_xlsx(nhs_all_ics, "NHS_Trust.xlsx")

5.2 Subsequent script

##########################################################################################
######################### Region and LTLA ################################################


######################### Extraction for Today on COVID19 case rate and deaths ###########

region_code <- 'E12000001' #North East region

api_link_reg <- paste0("https://api.coronavirus.data.gov.uk/v2/data?",
                        "areaType=region",
                         "&areaCode=", region_code,
                          "&metric=newCasesBySpecimenDateRollingRate",  
                           "&metric=newOnsDeathsByRegistrationDate",     
                             "&format=csv")

df_region_today <- read.csv(api_link_reg) 



######################### Extraction for Today ALL LTLA ##################################

ltla_code <- c('E08000016', 'E08000017', 'E08000018', 'E08000019', 'E07000171', 
               'E08000032', 'E08000033', 'E08000034', 'E08000035', 'E08000036', 
               'E07000163', 'E07000028', 'E07000029', 'E07000030', 'E06000011', 
               'E06000010', 'E06000012', 'E06000013', 'E06000014', 'E07000164', 
               'E07000165', 'E07000166', 'E07000167', 'E07000168', 'E07000169', 
               'E07000026', 'E06000001', 'E06000002', 'E06000003', 'E06000004', 
               'E06000005', 'E06000047', 'E06000057', 'E08000021', 'E08000022', 
               'E08000023', 'E08000024', 'E08000037')


df_ltla   <- list()

for(i in ltla_code) 
{ 
  api_link_la <- paste0("https://api.coronavirus.data.gov.uk/v2/data?",
                         "areaType=ltla",
                          "&areaCode=", i,
                           "&metric=newCasesBySpecimenDateRollingRate", 
                            "&metric=newOnsDeathsByRegistrationDate",
                             "&format=csv")
  
  df  <- read.csv(api_link_la)  
  
  df_ltla[[i]] <- df
}

df_ltla_today    <- do.call(rbind, df_ltla)



######################### Combine New Old of Region and LTLA #############################

region_ltla_today <- rbind(df_region_today, df_ltla_today)
region_ltla_old <- read_excel("Region_LTLA.xlsx") %>% select(-ics)
df_new <- setdiff(region_ltla_today$date, region_ltla_old$date)
rows <- which(region_ltla_today$date %in% df_new)
df_today_new <- region_ltla_today[rows,]
region_ltla_all <- rbind(region_ltla_old,df_today_new)



##########################################################################################
######################### NHS TRUSTS #####################################################


######################### Extraction for Today NHS TRUSTS ################################

Sys.sleep(30) 

trust_code <- c('RCF', 'RFF', 'TAD', 'RWY', 'RXP', 'RX4', 
                'RP5', 'RR7', 'RCD', 'RWA', 'RR8', 'RXF', 
                'RTD', 'RNN', 'RVW', 'RJL', 'RFR', 'RCU', 
                'RHQ', 'RTR', 'R0B', 'RCB')

df_nhs =  list()

for(i in trust_code) 
{
  api_link_nhs   <- paste0("https://api.coronavirus.data.gov.uk/v2/data?",
                            "areaType=nhsTrust",
                             "&areaCode=", i, 
                              "&metric=hospitalCases",
                               "&format=csv")
  
  df_1 <- read.csv(api_link_nhs)
  
  df_nhs[[i]] <- df_1
}

df_nhs    <- do.call(rbind, df_nhs)



######################### Combine New and Old NHS TRUST ##################################

nhs_old <- read_excel("NHS_Trust.xlsx") %>% select(-ics)

df_new <- setdiff(df_nhs$date, nhs_old$date)

rows <- which(df_nhs$date %in% df_new)

df_today_new <- df_nhs[rows,]
nhs_all <- rbind(nhs_old,df_today_new)



##########################################################################################
######################### Add in ICS codes ###############################################

lkp_ics_codes <- read.csv("R code/ney_ics_lkp.csv", stringsAsFactors = F) %>%
                 select(area_code, ics)
                   
region_ltla_all_ics  <- left_join(region_ltla_all, lkp_ics_codes,
                        by = c("areaCode" = "area_code"), keep = F)
nhs_all_ics <- left_join(nhs_all, lkp_ics_codes, 
               by = c("areaCode" = "area_code"), keep = F)


write_xlsx(region_ltla_all_ics, "Region_LTLA.xlsx")
write_xlsx(nhs_all_ics, "NHS_Trust.xlsx")

Acknowledgements

The authors wish to thank Eleanor Mackey, Leon Green, Elizabeth Mannion, Toby Johnstone, Kristy Roe, Barbara Coyle, Chris Skelly and Daniel Reidpath for their contributions to this article.

Contact us

LKISNorthEastandYorkshire@phe.gov.uk

Method for automated data extraction from published sources during the COVID-19 pandemic

Fui-Ching Lam and Simon Orange

24/06/2022