Purpose

The UK government provides a list of private test providers: https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/995458/covid-private-testing-providers-general-testing-220621.csv/preview

A list of websites and emails is fine if you are ordering a kit, but most mail order kits won’t provide results for at least 24 hours until you have posted them back. If you want to find providers that can do a test on site and get a same day you need to know where the companies are based, information that is not given in the list.

This is a perfect case for webscraping, loop through the urls and extract UK postcodes and plot them on a map.

Open most browsers such as Chrome or Firefox and it is very easy to view the page source code (in Chrome right click and select ‘View Page Source’). The most common R webscraping tool is rvest. However, testing a few urls it soon became apparent many websites are Java based. The Rselenium package gets round this. It essentially piggybacks on the browsers ability to decode the website and enables website attributes to be fetched as plain text.

In order for R selenium to work the RJava R package must be installed and the computer must have Java installed. I suggest installing Java first, followed by the RJava package, as the path to Java is critical for Rjavas smooth operation. Once both have been installed remember to restart Rstudio to allow the file paths to be updated.

The code below shows how to start Rselenium. I have used Firefox as the browser. The port number is just 4 random digits.

library(RSelenium)

##setup the selenium browser driver. This instance uses firefox and a random 4 digit number as the port
rsd <- rsDriver(browser = "firefox", port = 4833L)
remDr <- rsd$client

Postcodes

A helpful stackoverflow post gave me the Rgex pattern below which seemed to do a good job of finding postcodes.

##define pattern to search through the text
pattern <- "\\b(?:([Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([A-Za-z][0-9][A-Za-z])|([A-Za-z][A-Ha-hJ-Yj-y][0-9]?[A-Za-z]))))\\s?[0-9][A-Za-z]{2}))\\b"

The csv was imported and the page linked in the spreadsheet searched for postcodes. However, a big improvement would be to search all the pages of the site.

library(stringr)
library(tidyverse)

##import csv from gov
covid <- read.csv("https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/995458/covid-private-testing-providers-general-testing-220621.csv")
##remove empty first row
covid <- covid[2:NROW(covid),]
##find unique websites
covids <- unique(covid$Website)
##create list to populate with loop
all_pcodes <- list()
for (c in covids){
  ##try catch error function to prevent loop dropping out
  tryCatch({
##navigate to the url  
remDr$navigate(c)
##extract the page source as raw text
pg <- remDr$getPageSource()
##search for the postcode
pcodes <- str_extract(pg, pattern)
##if strings have been found, write these to a dataframe and then to the list
if(NROW(pcodes)>0){
df <- data.frame(Website = c, pcodes)

all_pcodes[[c]] <- df
##print where the loop is up to
print(match(c, covids))
flush.console()

}
##try catch error function to prevent the loop finishing if there is an error
  }, error=function(e){cat("ERROR :",conditionMessage(e), "\n")})
}
##bind list as a ddata frame
pc_df <- do.call(rbind, all_pcodes)
##save postcodes in case R crashes
saveRDS(pc_df, "postcodes_out.RDS")

Postcode coordinate match

Each code output from the scrape was looped through the excellent PostcodesioR package.

library(sf)
library(PostcodesioR)

##find all unique IDs in the valid postcodes
id_T <- unique(pc_df$Website)
m <- id_T[1]
##loop through each and pull out the data on each from the PostcodeIO package
all_mems <- list()
for (m in id_T){
  tryCatch({
  df <- filter(pc_df, Website == m)
  
  PC <- postcode_lookup(df$pcodes)
  df$latitude <- PC$latitude
  df$longitude <- PC$longitude

  all_mems[[m]] <- df
}, error=function(e){cat("ERROR :",conditionMessage(e), "\n")})
}

##bind all together
all_pc <- do.call(rbind, all_mems)
##define crs code for latitude and longitude
latlong = "+init=epsg:4326"
##filter out websites that have no postcode data and convert to sf objects with latlong crs
all_pc_sf <- all_pc %>% 
  filter(!is.na(latitude)) %>% 
  st_as_sf(coords = c("longitude", "latitude"), crs = latlong) %>% 
  left_join(covid, by = "Website")
## save data as RDS for map to import
saveRDS(all_pc_sf, "map_dat.RDS")

The final output map is available at

Using Rselenium and PostcodesioR to plot UK Government Private PCR list

Blaise Kelly

22/06/2021

Purpose

Postcodes

Postcode coordinate match