Web scraping

Julian Flowers

2018-10-15

This vignette illustrates some of the basics of web-scraping and some features of the myScrapers package - in particular simple web-scraping functions. We also show some functions in the package specifically designed to retrieve public health information for public health practitioners.

Webscraping basics in R

The basic toolkit is:

Installing the package

The package is only available on Github and can be downloaded using devtools.


library(devtools)
devtools::install_github("julianflowers/myScrapers")

Simple web-scraping

Web scraping is a set of techniques to obtain information or data from websites. In R the rvest and httr packages are the mainstay of scraping. These import and read html and xml pages into R which can then be parsed and analysed.

In myScrapers there are 2 functions:

Examples

We can use get_page_links to extract information from following page of PHE statistical releases. https://www.gov.uk/government/statistics?departments%5B%5D=public-health-england




url <- "https://www.gov.uk/government/statistics?departments%5B%5D=public-health-england"

get_page_links(url) %>%
  .[19:40]
#>  [1] "/government/statistics/mrsa-mssa-and-e-coli-bacteraemia-and-c-difficile-infection-30-day-all-cause-fatality"   
#>  [2] "/government/collections/healthcare-associated-infections-hcai-guidance-data-and-analysis"                      
#>  [3] "/government/collections/escherichia-coli-e-coli-guidance-data-and-analysis"                                    
#>  [4] "/government/collections/clostridium-difficile-guidance-data-and-analysis"                                      
#>  [5] "/government/collections/staphylococcus-aureus-guidance-data-and-analysis"                                      
#>  [6] "/government/collections/pseudomonas-aeruginosa-guidance-data-and-analysis"                                     
#>  [7] "/government/statistics/weekly-all-cause-mortality-surveillance-2018-to-2019"                                   
#>  [8] "/government/collections/all-cause-mortality-surveillance"                                                      
#>  [9] "/government/statistics/weekly-national-flu-reports-2018-to-2019-season"                                        
#> [10] "/government/collections/weekly-national-flu-reports"                                                           
#> [11] "/government/collections/seasonal-influenza-guidance-data-and-analysis"                                         
#> [12] "/government/statistics/norovirus-national-update"                                                              
#> [13] "/government/collections/rotavirus-guidance-data-and-analysis"                                                  
#> [14] "/government/collections/norovirus-guidance-data-and-analysis"                                                  
#> [15] "/government/statistics/emergency-presentations-of-cancer-quarterly-data"                                       
#> [16] "/government/statistics/klebsiella-spp-bacteraemia-monthly-data-split-by-location-of-onset-by-nhs-trust"        
#> [17] "/government/collections/klebsiella-species-guidance-data-and-analysis"                                         
#> [18] "/government/statistics/clostridium-difficile-infection-monthly-data-by-attributed-clinical-commissioning-group"
#> [19] "/government/collections/clostridium-difficile-guidance-data-and-analysis"                                      
#> [20] "/government/statistics/mssa-bacteraemia-monthly-data-by-nhs-acute-trust"                                       
#> [21] "/government/collections/staphylococcus-aureus-guidance-data-and-analysis"                                      
#> [22] "/government/statistics/p-aeruginosa-bacteraemia-monthly-data-split-by-location-of-onset-by-nhs-trust"

Use cases

We’ll use GP in hours syndromic surveillance data to illustrate further uses. This report “Monitors the number of people who visit their GP during surgery hours under the syndromic surveillance system.”

The system publishes weekly reports and spreadsheets - to obtain a year’s worth of these reports manually would require 104 separate downloads.

Using a webscraping approach this can be achieved in a few lines of code.

Identifying reports

The code below identifies all the pdf reports on the page.

We can then use the downloader package to download the pdfs:

Analysing Duncan Selbie’s friday messages

Using simple functions it is relatively easy to scrape Duncan Selbie’s blogs into a data frame for further analysis.

The base url is https://publichealthmatters.blog.gov.uk/category/duncan-selbie-friday-message/, and there are 8 pages of results so the first task is to create a list of urls.


url_ds <- "https://publichealthmatters.blog.gov.uk/category/duncan-selbie-friday-message/"
url_ds1 <- paste0(url_ds, "page/", 2:8)
urls_ds <- c(url_ds, url_ds1)

Then we can extract links and isolate those specific to the friday messages


links <- map(urls_ds, ~(get_page_links(.x))) 

friday_message <- links %>% flatten() %>%.[grepl("duncan-selbies-friday-message", .)] %>% .[!grepl("comments", .)] %>% unique()

head(friday_message)
#> [[1]]
#> [1] "https://publichealthmatters.blog.gov.uk/2018/10/12/duncan-selbies-friday-message-12-october-2018/"
#> 
#> [[2]]
#> [1] "https://publichealthmatters.blog.gov.uk/2018/10/05/duncan-selbies-friday-message-5-october-2018/"
#> 
#> [[3]]
#> [1] "https://publichealthmatters.blog.gov.uk/2018/09/28/duncan-selbies-friday-message-28-september-2018/"
#> 
#> [[4]]
#> [1] "https://publichealthmatters.blog.gov.uk/2018/09/14/duncan-selbies-friday-message-14-september-2018/"
#> 
#> [[5]]
#> [1] "https://publichealthmatters.blog.gov.uk/2018/09/07/duncan-selbies-friday-message-7-september-2018/"
#> 
#> [[6]]
#> [1] "https://publichealthmatters.blog.gov.uk/2018/08/31/duncan-selbies-friday-message-31-august-2018/"

and then extract blog text:

library(tm)
library(magrittr)

blog_text <- map(friday_message, ~(get_page_text(.x)))
blog_text <- map(blog_text, ~(str_remove(.x, "\\n")))
blog_text <- map(blog_text, ~(str_remove(.x, "    GOV.UK blogs use cookies to make the site simpler. Find out more about cookies\n  ")))
blog_text <- map(blog_text, ~(str_remove(.x, "Dear everyone")))

blog_title <- map(blog_text, 2)
names(blog_text) <- blog_title

blog_text1 <- map(blog_text, extract, 5:11)
blog_text2 <- map(blog_text1, data.frame)
blog_text2 <- map_df(blog_text2, bind_rows)
blog_text2 <- blog_text2 %>% mutate(text = clean_texts(.x..i..))

We can then visualise with, for example, a wordcloud.

library(quanteda)

corp <- corpus(blog_text2$text)
dfm <- dfm(corp, ngrams = 2, remove = c("government_licence", "open_government", "public_health", "official_blog", "blog_public", "health_england", "cancel_reply", "content available", "health_blog", "licence_v", "best_wishes", "otherwise_stated", "except_otherwise", "friday_messages", "best_wishes", 
                                        "available_open"))

textplot_wordcloud(dfm)

Additional functions

I have added a few functions to the package.

get_dsph_england returns a list of local authorities and their current DsPH. It scrapes https://www.gov.uk/government/publications/directors-of-public-health-in-england--2/directors-of-public-health-in-england


dsph <- get_dsph_england()
dsph %>%
  knitr::kable()
LA Name
Derby UA Cate Edwynn
Derbyshire Dean Wallace
Leicester UA Ruth Tennant
Leicestershire Mike Sandys
Lincolnshire Derek Ward
Northamptonshire Lucy Wightman
Nottingham UA Alison Challenger
Nottinghamshire Jonathan Gribbin
Rutland UA Mike Sandys
Bedford Borough Muriel Scott
Cambridgeshire Liz Robin
Central Bedfordshire UA Muriel Scott
Essex Mike Gogarty
Hertfordshire Jim McManus
Luton UA Gerry Taylor
Milton Keynes UA Muriel Scott
Norfolk (covers Great Yarmouth) Louise Smith
Peterborough UA Liz Robin
Southend on Sea UA Andrea Atherton
Suffolk (covers Waveney) Abdul Razaq
Thurrock UA Ian Wake
Barking and Dagenham Matthew Cole
Barnet Tamara Djuretic
Bexley Anjan Ghosh
Brent Melanie Smith
Bromley Nada Lemic
Camden Julie Billet
City Penny Bevan
Croydon Rachel Flowers
Ealing Wendy Meredith (acting/interim)
Enfield Stuart Lines
Greenwich Steve Whiteman
Hackney Penny Bevan
Hammersmith and Fulham Anita Parkin (acting/interim)
Haringey Will Maimaris (acting/interim)
Harrow Carole Furlong
Havering Mark Ansell (acting/interim)
Hillingdon Steven Hajioff
Hounslow Laura Maclehose (acting/interim)
Islington Julie Billet
Kensington and Chelsea Mike Robinson
Kingston upon Thames Iona Lidington
Lambeth Ruth Hutt (acting/interim)
Lewisham Danny Ruta
Merton Dagmar Zeuner
Newham Livia Royle (acting/interim)
Redbridge Gladys Xavier (acting/interim)
Richmond upon Thames Houda Al-Sharifi
Southwark Kevin Fenton
Sutton Imran Choudhury
Tower Hamlets Somen Banerjee
Waltham Forest Joe McDonnell
Wandsworth Houda Al-Sharifi
Westminster Mike Robinson
Darlington UA Miriam Davidson
County Durham UA Amanda Healy
Gateshead Council Alice Wiseman
Hartlepool Peter Brambleby (acting/interim)
Middlesbrough UA Edward Kunonga
Newcastle upon Tyne Eugene Milne
North Tyneside Wendy Burke
Northumberland Liz Morgan
Redcar and Cleveland Edward Kunonga (acting/interim)
South Tyneside Tom Hall
Stockton on Tees UA Sarah Bowman-Abouna
Sunderland Gillian Gibson
Blackburn with Darwen Dominic Harrison
Blackpool Arif Rajpura
Bolton David Herne (acting/interim)
Bury Lesley Jones
Cheshire East UA Fiona Reynolds
Cheshire West and Chester UA Ian Ashworth
Cumbria Colin Cox
Halton UA Eileen O’Meara
Knowsley Matthew Ashton
Lancashire Sakthi Karunanithi
Liverpool Sandra Davies
Manchester David Regan
Oldham Katrina Stephens (acting/interim)
Rochdale Andrea Fallon
Salford David Herne
Sefton Matthew Ashton (acting/interim)
St Helens Sue Forster
Stockport Stephen Watkins
Tameside Jeanelle De Gruchy
Trafford Eleanor Roafe (acting/interim)
Warrington UA Muna Abdel Aziz
Wigan Kate Ardern
Wirral Fiona Johnstone
West Berkshire UA Tessa Lindfield
Bracknell Forest Lisa McNally
Brighton UA Alistair Hill
Buckinghamshire County Council Jane O’Grady
East Sussex – Darrell Gale
Hampshire Sallie Bacon
Isle of Wight UA Sallie Bacon (acting/interim)
Kent Andrew Scott-Clark
Medway UA James Williams
Oxfordshire Jonathan McWilliam
Portsmouth UA Jason Horsley
Reading UA Tessa Lindfield
Slough UA Tessa Lindfield
Southampton UA Jason Horsley
Surrey Helen Atkinson
West Sussex - Anna-Marie Raleigh
Windsor and Maidenhead UA Tessa Lindfield
Wokingham UA Tessa Lindfield
‘Bathnes’ Bath and North East Somerset Bruce Laurence
Bournemouth David Phillips (Sam Crowe covering secondment)
City of Bristol Susan Milner (acting/interim)
Cornwall UA Caroline Court (acting/interim)
Devon County Council Virginia Pearson
Dorset David Phillips (Sam Crowe covering secondment)
Gloucestershire Sarah Scott
Isle of Scilly UA Caroline Court (acting/interim)
North Somerset UA Andrew Burnett (acting/interim)
Plymouth UA Ruth Harrell
Poole UA David Phillips (Sam Crowe covering secondment)
Somerset Trudi Grant
South Gloucestershire UA Mark Pietroni / Sara Blackmore
Swindon UA Cherry Jones
Torbay UA Caroline Dimond
Wiltshire UA Tracy Daskiewicz
Birmingham Becky Pollard (acting/interim)
Coventry Liz Gaulton
Dudley Deborah Harkins
Herefordshire Karen Wright
Sandwell Ansaf Azhar (acting/interim)
Shropshire UA Rod Thomson
Solihull Meradin Peachey
Staffordshire Richard Harling
Stoke on Trent UA Paul Edmondson-Jones
Telford and Wrekin UA Liz Noakes
Walsall Barbara Watt
Warwickshire John Linnane
Wolverhampton John Denley
Worcestershire Frances Howie
Barnsley Julia Burrows
Bradford Sarah Muckle
Calderdale Paul Butcher
Doncaster Rupert Suckling
East Riding of Yorkshire UA Tim Allison
Hull City Council Julia Weldon
Kirklees Rachel Spencer-Henshall
Leeds Ian Cameron
North East Lincolnshire UA Steve Pintus
North Lincolnshire UA Penny Spring
North Yorkshire Lincoln Sargeant
Rotherham Theresa Roche
Sheffield Greg Fell
Wakefield Anna Hartley
York UA Sharon Stoltz

get_phe_catalogue identifies all the PHE publications on GOV.UK. For this function you have to set the n = argument. We recommend starting at n = 110. This produces an interactive searchable table of links.


cat <- get_phe_catalogue(n = 110)

cat