Web scraping

Webscraping basics in R

The basic toolkit is:

a Url (link) you want to obtain data from
the rvest package in R or beautiful soup in Python
the selector gadget extension for web browsers

Installing the package

The package is only available on Github and can be downloaded using devtools.


library(devtools)
devtools::install_github("julianflowers/myScrapers")

Simple web-scraping

Web scraping is a set of techniques to obtain information or data from websites. In R the rvest and httr packages are the mainstay of scraping. These import and read html and xml pages into R which can then be parsed and analysed.

In myScrapers there are 2 functions:

get_page_links which identifies the links on a webpage
get_page_text which extracts text from a webpage

Examples

We can use get_page_links to extract information from following page of PHE statistical releases. https://www.gov.uk/government/statistics?departments%5B%5D=public-health-england




url <- "https://www.gov.uk/government/statistics?departments%5B%5D=public-health-england"

get_page_links(url) %>%
  .[19:40]
#>  [1] "/government/statistics/mrsa-mssa-and-e-coli-bacteraemia-and-c-difficile-infection-30-day-all-cause-fatality"   
#>  [2] "/government/collections/healthcare-associated-infections-hcai-guidance-data-and-analysis"                      
#>  [3] "/government/collections/escherichia-coli-e-coli-guidance-data-and-analysis"                                    
#>  [4] "/government/collections/clostridium-difficile-guidance-data-and-analysis"                                      
#>  [5] "/government/collections/staphylococcus-aureus-guidance-data-and-analysis"                                      
#>  [6] "/government/collections/pseudomonas-aeruginosa-guidance-data-and-analysis"                                     
#>  [7] "/government/statistics/weekly-all-cause-mortality-surveillance-2018-to-2019"                                   
#>  [8] "/government/collections/all-cause-mortality-surveillance"                                                      
#>  [9] "/government/statistics/weekly-national-flu-reports-2018-to-2019-season"                                        
#> [10] "/government/collections/weekly-national-flu-reports"                                                           
#> [11] "/government/collections/seasonal-influenza-guidance-data-and-analysis"                                         
#> [12] "/government/statistics/norovirus-national-update"                                                              
#> [13] "/government/collections/rotavirus-guidance-data-and-analysis"                                                  
#> [14] "/government/collections/norovirus-guidance-data-and-analysis"                                                  
#> [15] "/government/statistics/emergency-presentations-of-cancer-quarterly-data"                                       
#> [16] "/government/statistics/klebsiella-spp-bacteraemia-monthly-data-split-by-location-of-onset-by-nhs-trust"        
#> [17] "/government/collections/klebsiella-species-guidance-data-and-analysis"                                         
#> [18] "/government/statistics/clostridium-difficile-infection-monthly-data-by-attributed-clinical-commissioning-group"
#> [19] "/government/collections/clostridium-difficile-guidance-data-and-analysis"                                      
#> [20] "/government/statistics/mssa-bacteraemia-monthly-data-by-nhs-acute-trust"                                       
#> [21] "/government/collections/staphylococcus-aureus-guidance-data-and-analysis"                                      
#> [22] "/government/statistics/p-aeruginosa-bacteraemia-monthly-data-split-by-location-of-onset-by-nhs-trust"

Use cases

We’ll use GP in hours syndromic surveillance data to illustrate further uses. This report “Monitors the number of people who visit their GP during surgery hours under the syndromic surveillance system.”

The system publishes weekly reports and spreadsheets - to obtain a year’s worth of these reports manually would require 104 separate downloads.

Using a webscraping approach this can be achieved in a few lines of code.

Identifying reports

The code below identifies all the pdf reports on the page.

urls <- "https://www.gov.uk/government/publications/gp-in-hours-weekly-bulletins-for-2018"

get_page_links(urls) %>%
  .[grepl("pdf$", .)] %>%
  head(10) %>%
  unique()
#> [1] "https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/747453/GP_in-hours_weekly_bulletin_week_40.pdf"
#> [2] "https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/745446/GPinHoursEngBulletin2018Wk39.pdf"       
#> [3] "https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/743478/GPinHoursEngBulletin2018Wk38.pdf"       
#> [4] "https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/741881/GPinHoursEngBulletin2018Wk37.pdf"       
#> [5] "https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/740008/GPinHoursEngBulletin2018Wk36.pdf"

We can then use the downloader package to download the pdfs:


library(downloader)

get_page_links(urls) %>%
  .[grepl("pdf$", .)] %>%
  head(10) %>%
  unique() %>%
  map(., ~download(.x, destfile = basename(.x)))

Identifiying data (spreadsheet links)

We can take a similar approach to spreadsheets.

Having downloaded the reports or spreadsheets it is now straightforward to import them for further analysis.


library(readxl)
files <- list.files(pattern = ".xls")

data <- map(files, ~(read_excel(.x,  sheet = "Local Authority", na = "*", 
    skip = 4)))

head(data)
#> [[1]]
#> # A tibble: 151 x 27
#>    `ONS Upper Tier… `ONS Upper Tier… `PHE Centre Nam… `PHE Centre ODS…
#>    <chr>            <chr>            <chr>            <chr>           
#>  1 E09000002        Barking and Dag… London           X25001AA        
#>  2 E09000003        Barnet           London           X25001AA        
#>  3 E09000004        Bexley           London           X25001AA        
#>  4 E09000005        Brent            London           X25001AA        
#>  5 E09000006        Bromley          London           X25001AA        
#>  6 E09000007        Camden           London           X25001AA        
#>  7 E09000008        Croydon          London           X25001AA        
#>  8 E09000009        Ealing           London           X25001AA        
#>  9 E09000010        Enfield          London           X25001AA        
#> 10 E09000011        Greenwich        London           X25001AA        
#> # ... with 141 more rows, and 23 more variables: `PHE Region Names` <chr>,
#> #   `PHE Region ODS Code` <chr>, `Denominator Population` <dbl>, `Observed
#> #   number of cases` <chr>, `Rate per 100,000` <chr>, SIR <chr>, `SIR
#> #   CI` <chr>, `Historic SIR` <chr>, `Observed number of cases__1` <chr>,
#> #   `Rate per 100,000__1` <chr>, SIR__1 <chr>, `SIR CI__1` <chr>,
#> #   `Historic SIR__1` <chr>, `Observed number of cases__2` <chr>, `Rate
#> #   per 100,000__2` <chr>, SIR__2 <chr>, `SIR CI__2` <chr>, `Historic
#> #   SIR__2` <chr>, `Observed number of cases__3` <chr>, `Rate per
#> #   100,000__3` <chr>, SIR__3 <chr>, `SIR CI__3` <chr>, `Historic
#> #   SIR__3` <chr>
#> 
#> [[2]]
#> # A tibble: 151 x 27
#>    `ONS Upper Tier… `ONS Upper Tier… `PHE Centre Nam… `PHE Centre ODS…
#>    <chr>            <chr>            <chr>            <chr>           
#>  1 E09000002        Barking and Dag… London           X25001AA        
#>  2 E09000003        Barnet           London           X25001AA        
#>  3 E09000004        Bexley           London           X25001AA        
#>  4 E09000005        Brent            London           X25001AA        
#>  5 E09000006        Bromley          London           X25001AA        
#>  6 E09000007        Camden           London           X25001AA        
#>  7 E09000008        Croydon          London           X25001AA        
#>  8 E09000009        Ealing           London           X25001AA        
#>  9 E09000010        Enfield          London           X25001AA        
#> 10 E09000011        Greenwich        London           X25001AA        
#> # ... with 141 more rows, and 23 more variables: `PHE Region Names` <chr>,
#> #   `PHE Region ODS Code` <chr>, `Denominator Population` <dbl>, `Observed
#> #   number of cases` <chr>, `Rate per 100,000` <chr>, SIR <chr>, `SIR
#> #   CI` <chr>, `Historic SIR` <chr>, `Observed number of cases__1` <chr>,
#> #   `Rate per 100,000__1` <chr>, SIR__1 <chr>, `SIR CI__1` <chr>,
#> #   `Historic SIR__1` <chr>, `Observed number of cases__2` <chr>, `Rate
#> #   per 100,000__2` <chr>, SIR__2 <chr>, `SIR CI__2` <chr>, `Historic
#> #   SIR__2` <chr>, `Observed number of cases__3` <chr>, `Rate per
#> #   100,000__3` <chr>, SIR__3 <chr>, `SIR CI__3` <chr>, `Historic
#> #   SIR__3` <chr>
#> 
#> [[3]]
#> # A tibble: 151 x 27
#>    `ONS Upper Tier… `ONS Upper Tier… `PHE Centre Nam… `PHE Centre ODS…
#>    <chr>            <chr>            <chr>            <chr>           
#>  1 E09000002        Barking and Dag… London           X25001AA        
#>  2 E09000003        Barnet           London           X25001AA        
#>  3 E09000004        Bexley           London           X25001AA        
#>  4 E09000005        Brent            London           X25001AA        
#>  5 E09000006        Bromley          London           X25001AA        
#>  6 E09000007        Camden           London           X25001AA        
#>  7 E09000008        Croydon          London           X25001AA        
#>  8 E09000009        Ealing           London           X25001AA        
#>  9 E09000010        Enfield          London           X25001AA        
#> 10 E09000011        Greenwich        London           X25001AA        
#> # ... with 141 more rows, and 23 more variables: `PHE Region Names` <chr>,
#> #   `PHE Region ODS Code` <chr>, `Denominator Population` <chr>, `Observed
#> #   number of cases` <chr>, `Rate per 100,000` <chr>, SIR <chr>, `SIR
#> #   CI` <chr>, `Historic SIR` <chr>, `Observed number of cases__1` <chr>,
#> #   `Rate per 100,000__1` <chr>, SIR__1 <chr>, `SIR CI__1` <chr>,
#> #   `Historic SIR__1` <chr>, `Observed number of cases__2` <chr>, `Rate
#> #   per 100,000__2` <chr>, SIR__2 <chr>, `SIR CI__2` <chr>, `Historic
#> #   SIR__2` <chr>, `Observed number of cases__3` <chr>, `Rate per
#> #   100,000__3` <chr>, SIR__3 <chr>, `SIR CI__3` <chr>, `Historic
#> #   SIR__3` <chr>
#> 
#> [[4]]
#> # A tibble: 151 x 27
#>    `ONS Upper Tier… `ONS Upper Tier… `PHE Centre Nam… `PHE Centre ODS…
#>    <chr>            <chr>            <chr>            <chr>           
#>  1 E09000002        Barking and Dag… London           X25001AA        
#>  2 E09000003        Barnet           London           X25001AA        
#>  3 E09000004        Bexley           London           X25001AA        
#>  4 E09000005        Brent            London           X25001AA        
#>  5 E09000006        Bromley          London           X25001AA        
#>  6 E09000007        Camden           London           X25001AA        
#>  7 E09000008        Croydon          London           X25001AA        
#>  8 E09000009        Ealing           London           X25001AA        
#>  9 E09000010        Enfield          London           X25001AA        
#> 10 E09000011        Greenwich        London           X25001AA        
#> # ... with 141 more rows, and 23 more variables: `PHE Region Names` <chr>,
#> #   `PHE Region ODS Code` <chr>, `Denominator Population` <dbl>, `Observed
#> #   number of cases` <chr>, `Rate per 100,000` <chr>, SIR <chr>, `SIR
#> #   CI` <chr>, `Historic SIR` <chr>, `Observed number of cases__1` <chr>,
#> #   `Rate per 100,000__1` <chr>, SIR__1 <chr>, `SIR CI__1` <chr>,
#> #   `Historic SIR__1` <chr>, `Observed number of cases__2` <chr>, `Rate
#> #   per 100,000__2` <chr>, SIR__2 <chr>, `SIR CI__2` <chr>, `Historic
#> #   SIR__2` <chr>, `Observed number of cases__3` <chr>, `Rate per
#> #   100,000__3` <chr>, SIR__3 <chr>, `SIR CI__3` <chr>, `Historic
#> #   SIR__3` <chr>
#> 
#> [[5]]
#> # A tibble: 151 x 27
#>    `ONS Upper Tier… `ONS Upper Tier… `PHE Centre Nam… `PHE Centre ODS…
#>    <chr>            <chr>            <chr>            <chr>           
#>  1 E09000002        Barking and Dag… London           X25001AA        
#>  2 E09000003        Barnet           London           X25001AA        
#>  3 E09000004        Bexley           London           X25001AA        
#>  4 E09000005        Brent            London           X25001AA        
#>  5 E09000006        Bromley          London           X25001AA        
#>  6 E09000007        Camden           London           X25001AA        
#>  7 E09000008        Croydon          London           X25001AA        
#>  8 E09000009        Ealing           London           X25001AA        
#>  9 E09000010        Enfield          London           X25001AA        
#> 10 E09000011        Greenwich        London           X25001AA        
#> # ... with 141 more rows, and 23 more variables: `PHE Region Names` <chr>,
#> #   `PHE Region ODS Code` <chr>, `Denominator Population` <chr>, `Observed
#> #   number of cases` <chr>, `Rate per 100,000` <chr>, SIR <chr>, `SIR
#> #   CI` <chr>, `Historic SIR` <chr>, `Observed number of cases__1` <chr>,
#> #   `Rate per 100,000__1` <chr>, SIR__1 <chr>, `SIR CI__1` <chr>,
#> #   `Historic SIR__1` <chr>, `Observed number of cases__2` <chr>, `Rate
#> #   per 100,000__2` <chr>, SIR__2 <chr>, `SIR CI__2` <chr>, `Historic
#> #   SIR__2` <chr>, `Observed number of cases__3` <chr>, `Rate per
#> #   100,000__3` <chr>, SIR__3 <chr>, `SIR CI__3` <chr>, `Historic
#> #   SIR__3` <chr>

Analysing Duncan Selbie’s friday messages

Using simple functions it is relatively easy to scrape Duncan Selbie’s blogs into a data frame for further analysis.

The base url is https://publichealthmatters.blog.gov.uk/category/duncan-selbie-friday-message/, and there are 8 pages of results so the first task is to create a list of urls.


url_ds <- "https://publichealthmatters.blog.gov.uk/category/duncan-selbie-friday-message/"
url_ds1 <- paste0(url_ds, "page/", 2:8)
urls_ds <- c(url_ds, url_ds1)

Then we can extract links and isolate those specific to the friday messages


links <- map(urls_ds, ~(get_page_links(.x))) 

friday_message <- links %>% flatten() %>%.[grepl("duncan-selbies-friday-message", .)] %>% .[!grepl("comments", .)] %>% unique()

head(friday_message)
#> [[1]]
#> [1] "https://publichealthmatters.blog.gov.uk/2018/10/12/duncan-selbies-friday-message-12-october-2018/"
#> 
#> [[2]]
#> [1] "https://publichealthmatters.blog.gov.uk/2018/10/05/duncan-selbies-friday-message-5-october-2018/"
#> 
#> [[3]]
#> [1] "https://publichealthmatters.blog.gov.uk/2018/09/28/duncan-selbies-friday-message-28-september-2018/"
#> 
#> [[4]]
#> [1] "https://publichealthmatters.blog.gov.uk/2018/09/14/duncan-selbies-friday-message-14-september-2018/"
#> 
#> [[5]]
#> [1] "https://publichealthmatters.blog.gov.uk/2018/09/07/duncan-selbies-friday-message-7-september-2018/"
#> 
#> [[6]]
#> [1] "https://publichealthmatters.blog.gov.uk/2018/08/31/duncan-selbies-friday-message-31-august-2018/"

and then extract blog text:

library(tm)
library(magrittr)

blog_text <- map(friday_message, ~(get_page_text(.x)))
blog_text <- map(blog_text, ~(str_remove(.x, "\\n")))
blog_text <- map(blog_text, ~(str_remove(.x, "    GOV.UK blogs use cookies to make the site simpler. Find out more about cookies\n  ")))
blog_text <- map(blog_text, ~(str_remove(.x, "Dear everyone")))

blog_title <- map(blog_text, 2)
names(blog_text) <- blog_title

blog_text1 <- map(blog_text, extract, 5:11)
blog_text2 <- map(blog_text1, data.frame)
blog_text2 <- map_df(blog_text2, bind_rows)
blog_text2 <- blog_text2 %>% mutate(text = clean_texts(.x..i..))

We can then visualise with, for example, a wordcloud.

library(quanteda)

corp <- corpus(blog_text2$text)
dfm <- dfm(corp, ngrams = 2, remove = c("government_licence", "open_government", "public_health", "official_blog", "blog_public", "health_england", "cancel_reply", "content available", "health_blog", "licence_v", "best_wishes", "otherwise_stated", "except_otherwise", "friday_messages", "best_wishes", 
                                        "available_open"))

textplot_wordcloud(dfm)

Additional functions

I have added a few functions to the package.

get_dsph_england returns a list of local authorities and their current DsPH. It scrapes https://www.gov.uk/government/publications/directors-of-public-health-in-england--2/directors-of-public-health-in-england


dsph <- get_dsph_england()
dsph %>%
  knitr::kable()

LA	Name
Derby UA	Cate Edwynn
Derbyshire	Dean Wallace
Leicester UA	Ruth Tennant
Leicestershire	Mike Sandys
Lincolnshire	Derek Ward
Northamptonshire	Lucy Wightman
Nottingham UA	Alison Challenger
Nottinghamshire	Jonathan Gribbin
Rutland UA	Mike Sandys
Bedford Borough	Muriel Scott
Cambridgeshire	Liz Robin
Central Bedfordshire UA	Muriel Scott
Essex	Mike Gogarty
Hertfordshire	Jim McManus
Luton UA	Gerry Taylor
Milton Keynes UA	Muriel Scott
Norfolk (covers Great Yarmouth)	Louise Smith
Peterborough UA	Liz Robin
Southend on Sea UA	Andrea Atherton
Suffolk (covers Waveney)	Abdul Razaq
Thurrock UA	Ian Wake
Barking and Dagenham	Matthew Cole
Barnet	Tamara Djuretic
Bexley	Anjan Ghosh
Brent	Melanie Smith
Bromley	Nada Lemic
Camden	Julie Billet
City	Penny Bevan
Croydon	Rachel Flowers
Ealing	Wendy Meredith (acting/interim)
Enfield	Stuart Lines
Greenwich	Steve Whiteman
Hackney	Penny Bevan
Hammersmith and Fulham	Anita Parkin (acting/interim)
Haringey	Will Maimaris (acting/interim)
Harrow	Carole Furlong
Havering	Mark Ansell (acting/interim)
Hillingdon	Steven Hajioff
Hounslow	Laura Maclehose (acting/interim)
Islington	Julie Billet
Kensington and Chelsea	Mike Robinson
Kingston upon Thames	Iona Lidington
Lambeth	Ruth Hutt (acting/interim)
Lewisham	Danny Ruta
Merton	Dagmar Zeuner
Newham	Livia Royle (acting/interim)
Redbridge	Gladys Xavier (acting/interim)
Richmond upon Thames	Houda Al-Sharifi
Southwark	Kevin Fenton
Sutton	Imran Choudhury
Tower Hamlets	Somen Banerjee
Waltham Forest	Joe McDonnell
Wandsworth	Houda Al-Sharifi
Westminster	Mike Robinson
Darlington UA	Miriam Davidson
County Durham UA	Amanda Healy
Gateshead Council	Alice Wiseman
Hartlepool	Peter Brambleby (acting/interim)
Middlesbrough UA	Edward Kunonga
Newcastle upon Tyne	Eugene Milne
North Tyneside	Wendy Burke
Northumberland	Liz Morgan
Redcar and Cleveland	Edward Kunonga (acting/interim)
South Tyneside	Tom Hall
Stockton on Tees UA	Sarah Bowman-Abouna
Sunderland	Gillian Gibson
Blackburn with Darwen	Dominic Harrison
Blackpool	Arif Rajpura
Bolton	David Herne (acting/interim)
Bury	Lesley Jones
Cheshire East UA	Fiona Reynolds
Cheshire West and Chester UA	Ian Ashworth
Cumbria	Colin Cox
Halton UA	Eileen O’Meara
Knowsley	Matthew Ashton
Lancashire	Sakthi Karunanithi
Liverpool	Sandra Davies
Manchester	David Regan
Oldham	Katrina Stephens (acting/interim)
Rochdale	Andrea Fallon
Salford	David Herne
Sefton	Matthew Ashton (acting/interim)
St Helens	Sue Forster
Stockport	Stephen Watkins
Tameside	Jeanelle De Gruchy
Trafford	Eleanor Roafe (acting/interim)
Warrington UA	Muna Abdel Aziz
Wigan	Kate Ardern
Wirral	Fiona Johnstone
West Berkshire UA	Tessa Lindfield
Bracknell Forest	Lisa McNally
Brighton UA	Alistair Hill
Buckinghamshire County Council	Jane O’Grady
East Sussex	– Darrell Gale
Hampshire	Sallie Bacon
Isle of Wight UA	Sallie Bacon (acting/interim)
Kent	Andrew Scott-Clark
Medway UA	James Williams
Oxfordshire	Jonathan McWilliam
Portsmouth UA	Jason Horsley
Reading UA	Tessa Lindfield
Slough UA	Tessa Lindfield
Southampton UA	Jason Horsley
Surrey	Helen Atkinson
West Sussex	- Anna-Marie Raleigh
Windsor and Maidenhead UA	Tessa Lindfield
Wokingham UA	Tessa Lindfield
‘Bathnes’ Bath and North East Somerset	Bruce Laurence
Bournemouth	David Phillips (Sam Crowe covering secondment)
City of Bristol	Susan Milner (acting/interim)
Cornwall UA	Caroline Court (acting/interim)
Devon County Council	Virginia Pearson
Dorset	David Phillips (Sam Crowe covering secondment)
Gloucestershire	Sarah Scott
Isle of Scilly UA	Caroline Court (acting/interim)
North Somerset UA	Andrew Burnett (acting/interim)
Plymouth UA	Ruth Harrell
Poole UA	David Phillips (Sam Crowe covering secondment)
Somerset	Trudi Grant
South Gloucestershire UA	Mark Pietroni / Sara Blackmore
Swindon UA	Cherry Jones
Torbay UA	Caroline Dimond
Wiltshire UA	Tracy Daskiewicz
Birmingham	Becky Pollard (acting/interim)
Coventry	Liz Gaulton
Dudley	Deborah Harkins
Herefordshire	Karen Wright
Sandwell	Ansaf Azhar (acting/interim)
Shropshire UA	Rod Thomson
Solihull	Meradin Peachey
Staffordshire	Richard Harling
Stoke on Trent UA	Paul Edmondson-Jones
Telford and Wrekin UA	Liz Noakes
Walsall	Barbara Watt
Warwickshire	John Linnane
Wolverhampton	John Denley
Worcestershire	Frances Howie
Barnsley	Julia Burrows
Bradford	Sarah Muckle
Calderdale	Paul Butcher
Doncaster	Rupert Suckling
East Riding of Yorkshire UA	Tim Allison
Hull City Council	Julia Weldon
Kirklees	Rachel Spencer-Henshall
Leeds	Ian Cameron
North East Lincolnshire UA	Steve Pintus
North Lincolnshire UA	Penny Spring
North Yorkshire	Lincoln Sargeant
Rotherham	Theresa Roche
Sheffield	Greg Fell
Wakefield	Anna Hartley
York UA	Sharon Stoltz

get_phe_catalogue identifies all the PHE publications on GOV.UK. For this function you have to set the n = argument. We recommend starting at n = 110. This produces an interactive searchable table of links.


cat <- get_phe_catalogue(n = 110)

cat