This vignette illustrates some of the basics of web-scraping and some features of the myScrapers
package - in particular simple web-scraping functions. We also show some functions in the package specifically designed to retrieve public health information for public health practitioners.
The basic toolkit is:
rvest
package in R or beautiful soup
in PythonThe package is only available on Github and can be downloaded using devtools
.
Web scraping is a set of techniques to obtain information or data from websites. In R the rvest
and httr
packages are the mainstay of scraping. These import and read html and xml pages into R which can then be parsed and analysed.
In myScrapers
there are 2 functions:
get_page_links
which identifies the links on a webpageget_page_text
which extracts text from a webpageWe can use get_page_links
to extract information from following page of PHE statistical releases. https://www.gov.uk/government/statistics?departments%5B%5D=public-health-england
url <- "https://www.gov.uk/government/statistics?departments%5B%5D=public-health-england"
get_page_links(url) %>%
.[19:40]
#> [1] "/government/statistics/mrsa-mssa-and-e-coli-bacteraemia-and-c-difficile-infection-30-day-all-cause-fatality"
#> [2] "/government/collections/healthcare-associated-infections-hcai-guidance-data-and-analysis"
#> [3] "/government/collections/escherichia-coli-e-coli-guidance-data-and-analysis"
#> [4] "/government/collections/clostridium-difficile-guidance-data-and-analysis"
#> [5] "/government/collections/staphylococcus-aureus-guidance-data-and-analysis"
#> [6] "/government/collections/pseudomonas-aeruginosa-guidance-data-and-analysis"
#> [7] "/government/statistics/weekly-all-cause-mortality-surveillance-2018-to-2019"
#> [8] "/government/collections/all-cause-mortality-surveillance"
#> [9] "/government/statistics/weekly-national-flu-reports-2018-to-2019-season"
#> [10] "/government/collections/weekly-national-flu-reports"
#> [11] "/government/collections/seasonal-influenza-guidance-data-and-analysis"
#> [12] "/government/statistics/norovirus-national-update"
#> [13] "/government/collections/rotavirus-guidance-data-and-analysis"
#> [14] "/government/collections/norovirus-guidance-data-and-analysis"
#> [15] "/government/statistics/emergency-presentations-of-cancer-quarterly-data"
#> [16] "/government/statistics/klebsiella-spp-bacteraemia-monthly-data-split-by-location-of-onset-by-nhs-trust"
#> [17] "/government/collections/klebsiella-species-guidance-data-and-analysis"
#> [18] "/government/statistics/clostridium-difficile-infection-monthly-data-by-attributed-clinical-commissioning-group"
#> [19] "/government/collections/clostridium-difficile-guidance-data-and-analysis"
#> [20] "/government/statistics/mssa-bacteraemia-monthly-data-by-nhs-acute-trust"
#> [21] "/government/collections/staphylococcus-aureus-guidance-data-and-analysis"
#> [22] "/government/statistics/p-aeruginosa-bacteraemia-monthly-data-split-by-location-of-onset-by-nhs-trust"
We’ll use GP in hours syndromic surveillance data to illustrate further uses. This report “Monitors the number of people who visit their GP during surgery hours under the syndromic surveillance system.”
The system publishes weekly reports and spreadsheets - to obtain a year’s worth of these reports manually would require 104 separate downloads.
Using a webscraping approach this can be achieved in a few lines of code.
The code below identifies all the pdf reports on the page.
urls <- "https://www.gov.uk/government/publications/gp-in-hours-weekly-bulletins-for-2018"
get_page_links(urls) %>%
.[grepl("pdf$", .)] %>%
head(10) %>%
unique()
#> [1] "https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/747453/GP_in-hours_weekly_bulletin_week_40.pdf"
#> [2] "https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/745446/GPinHoursEngBulletin2018Wk39.pdf"
#> [3] "https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/743478/GPinHoursEngBulletin2018Wk38.pdf"
#> [4] "https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/741881/GPinHoursEngBulletin2018Wk37.pdf"
#> [5] "https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/740008/GPinHoursEngBulletin2018Wk36.pdf"
We can then use the downloader
package to download the pdfs:
We can take a similar approach to spreadsheets.
Having downloaded the reports or spreadsheets it is now straightforward to import them for further analysis.
library(readxl)
files <- list.files(pattern = ".xls")
data <- map(files, ~(read_excel(.x, sheet = "Local Authority", na = "*",
skip = 4)))
head(data)
#> [[1]]
#> # A tibble: 151 x 27
#> `ONS Upper Tier… `ONS Upper Tier… `PHE Centre Nam… `PHE Centre ODS…
#> <chr> <chr> <chr> <chr>
#> 1 E09000002 Barking and Dag… London X25001AA
#> 2 E09000003 Barnet London X25001AA
#> 3 E09000004 Bexley London X25001AA
#> 4 E09000005 Brent London X25001AA
#> 5 E09000006 Bromley London X25001AA
#> 6 E09000007 Camden London X25001AA
#> 7 E09000008 Croydon London X25001AA
#> 8 E09000009 Ealing London X25001AA
#> 9 E09000010 Enfield London X25001AA
#> 10 E09000011 Greenwich London X25001AA
#> # ... with 141 more rows, and 23 more variables: `PHE Region Names` <chr>,
#> # `PHE Region ODS Code` <chr>, `Denominator Population` <dbl>, `Observed
#> # number of cases` <chr>, `Rate per 100,000` <chr>, SIR <chr>, `SIR
#> # CI` <chr>, `Historic SIR` <chr>, `Observed number of cases__1` <chr>,
#> # `Rate per 100,000__1` <chr>, SIR__1 <chr>, `SIR CI__1` <chr>,
#> # `Historic SIR__1` <chr>, `Observed number of cases__2` <chr>, `Rate
#> # per 100,000__2` <chr>, SIR__2 <chr>, `SIR CI__2` <chr>, `Historic
#> # SIR__2` <chr>, `Observed number of cases__3` <chr>, `Rate per
#> # 100,000__3` <chr>, SIR__3 <chr>, `SIR CI__3` <chr>, `Historic
#> # SIR__3` <chr>
#>
#> [[2]]
#> # A tibble: 151 x 27
#> `ONS Upper Tier… `ONS Upper Tier… `PHE Centre Nam… `PHE Centre ODS…
#> <chr> <chr> <chr> <chr>
#> 1 E09000002 Barking and Dag… London X25001AA
#> 2 E09000003 Barnet London X25001AA
#> 3 E09000004 Bexley London X25001AA
#> 4 E09000005 Brent London X25001AA
#> 5 E09000006 Bromley London X25001AA
#> 6 E09000007 Camden London X25001AA
#> 7 E09000008 Croydon London X25001AA
#> 8 E09000009 Ealing London X25001AA
#> 9 E09000010 Enfield London X25001AA
#> 10 E09000011 Greenwich London X25001AA
#> # ... with 141 more rows, and 23 more variables: `PHE Region Names` <chr>,
#> # `PHE Region ODS Code` <chr>, `Denominator Population` <dbl>, `Observed
#> # number of cases` <chr>, `Rate per 100,000` <chr>, SIR <chr>, `SIR
#> # CI` <chr>, `Historic SIR` <chr>, `Observed number of cases__1` <chr>,
#> # `Rate per 100,000__1` <chr>, SIR__1 <chr>, `SIR CI__1` <chr>,
#> # `Historic SIR__1` <chr>, `Observed number of cases__2` <chr>, `Rate
#> # per 100,000__2` <chr>, SIR__2 <chr>, `SIR CI__2` <chr>, `Historic
#> # SIR__2` <chr>, `Observed number of cases__3` <chr>, `Rate per
#> # 100,000__3` <chr>, SIR__3 <chr>, `SIR CI__3` <chr>, `Historic
#> # SIR__3` <chr>
#>
#> [[3]]
#> # A tibble: 151 x 27
#> `ONS Upper Tier… `ONS Upper Tier… `PHE Centre Nam… `PHE Centre ODS…
#> <chr> <chr> <chr> <chr>
#> 1 E09000002 Barking and Dag… London X25001AA
#> 2 E09000003 Barnet London X25001AA
#> 3 E09000004 Bexley London X25001AA
#> 4 E09000005 Brent London X25001AA
#> 5 E09000006 Bromley London X25001AA
#> 6 E09000007 Camden London X25001AA
#> 7 E09000008 Croydon London X25001AA
#> 8 E09000009 Ealing London X25001AA
#> 9 E09000010 Enfield London X25001AA
#> 10 E09000011 Greenwich London X25001AA
#> # ... with 141 more rows, and 23 more variables: `PHE Region Names` <chr>,
#> # `PHE Region ODS Code` <chr>, `Denominator Population` <chr>, `Observed
#> # number of cases` <chr>, `Rate per 100,000` <chr>, SIR <chr>, `SIR
#> # CI` <chr>, `Historic SIR` <chr>, `Observed number of cases__1` <chr>,
#> # `Rate per 100,000__1` <chr>, SIR__1 <chr>, `SIR CI__1` <chr>,
#> # `Historic SIR__1` <chr>, `Observed number of cases__2` <chr>, `Rate
#> # per 100,000__2` <chr>, SIR__2 <chr>, `SIR CI__2` <chr>, `Historic
#> # SIR__2` <chr>, `Observed number of cases__3` <chr>, `Rate per
#> # 100,000__3` <chr>, SIR__3 <chr>, `SIR CI__3` <chr>, `Historic
#> # SIR__3` <chr>
#>
#> [[4]]
#> # A tibble: 151 x 27
#> `ONS Upper Tier… `ONS Upper Tier… `PHE Centre Nam… `PHE Centre ODS…
#> <chr> <chr> <chr> <chr>
#> 1 E09000002 Barking and Dag… London X25001AA
#> 2 E09000003 Barnet London X25001AA
#> 3 E09000004 Bexley London X25001AA
#> 4 E09000005 Brent London X25001AA
#> 5 E09000006 Bromley London X25001AA
#> 6 E09000007 Camden London X25001AA
#> 7 E09000008 Croydon London X25001AA
#> 8 E09000009 Ealing London X25001AA
#> 9 E09000010 Enfield London X25001AA
#> 10 E09000011 Greenwich London X25001AA
#> # ... with 141 more rows, and 23 more variables: `PHE Region Names` <chr>,
#> # `PHE Region ODS Code` <chr>, `Denominator Population` <dbl>, `Observed
#> # number of cases` <chr>, `Rate per 100,000` <chr>, SIR <chr>, `SIR
#> # CI` <chr>, `Historic SIR` <chr>, `Observed number of cases__1` <chr>,
#> # `Rate per 100,000__1` <chr>, SIR__1 <chr>, `SIR CI__1` <chr>,
#> # `Historic SIR__1` <chr>, `Observed number of cases__2` <chr>, `Rate
#> # per 100,000__2` <chr>, SIR__2 <chr>, `SIR CI__2` <chr>, `Historic
#> # SIR__2` <chr>, `Observed number of cases__3` <chr>, `Rate per
#> # 100,000__3` <chr>, SIR__3 <chr>, `SIR CI__3` <chr>, `Historic
#> # SIR__3` <chr>
#>
#> [[5]]
#> # A tibble: 151 x 27
#> `ONS Upper Tier… `ONS Upper Tier… `PHE Centre Nam… `PHE Centre ODS…
#> <chr> <chr> <chr> <chr>
#> 1 E09000002 Barking and Dag… London X25001AA
#> 2 E09000003 Barnet London X25001AA
#> 3 E09000004 Bexley London X25001AA
#> 4 E09000005 Brent London X25001AA
#> 5 E09000006 Bromley London X25001AA
#> 6 E09000007 Camden London X25001AA
#> 7 E09000008 Croydon London X25001AA
#> 8 E09000009 Ealing London X25001AA
#> 9 E09000010 Enfield London X25001AA
#> 10 E09000011 Greenwich London X25001AA
#> # ... with 141 more rows, and 23 more variables: `PHE Region Names` <chr>,
#> # `PHE Region ODS Code` <chr>, `Denominator Population` <chr>, `Observed
#> # number of cases` <chr>, `Rate per 100,000` <chr>, SIR <chr>, `SIR
#> # CI` <chr>, `Historic SIR` <chr>, `Observed number of cases__1` <chr>,
#> # `Rate per 100,000__1` <chr>, SIR__1 <chr>, `SIR CI__1` <chr>,
#> # `Historic SIR__1` <chr>, `Observed number of cases__2` <chr>, `Rate
#> # per 100,000__2` <chr>, SIR__2 <chr>, `SIR CI__2` <chr>, `Historic
#> # SIR__2` <chr>, `Observed number of cases__3` <chr>, `Rate per
#> # 100,000__3` <chr>, SIR__3 <chr>, `SIR CI__3` <chr>, `Historic
#> # SIR__3` <chr>
Using simple functions it is relatively easy to scrape Duncan Selbie’s blogs into a data frame for further analysis.
The base url is https://publichealthmatters.blog.gov.uk/category/duncan-selbie-friday-message/, and there are 8 pages of results so the first task is to create a list of urls.
url_ds <- "https://publichealthmatters.blog.gov.uk/category/duncan-selbie-friday-message/"
url_ds1 <- paste0(url_ds, "page/", 2:8)
urls_ds <- c(url_ds, url_ds1)
Then we can extract links and isolate those specific to the friday messages
links <- map(urls_ds, ~(get_page_links(.x)))
friday_message <- links %>% flatten() %>%.[grepl("duncan-selbies-friday-message", .)] %>% .[!grepl("comments", .)] %>% unique()
head(friday_message)
#> [[1]]
#> [1] "https://publichealthmatters.blog.gov.uk/2018/10/12/duncan-selbies-friday-message-12-october-2018/"
#>
#> [[2]]
#> [1] "https://publichealthmatters.blog.gov.uk/2018/10/05/duncan-selbies-friday-message-5-october-2018/"
#>
#> [[3]]
#> [1] "https://publichealthmatters.blog.gov.uk/2018/09/28/duncan-selbies-friday-message-28-september-2018/"
#>
#> [[4]]
#> [1] "https://publichealthmatters.blog.gov.uk/2018/09/14/duncan-selbies-friday-message-14-september-2018/"
#>
#> [[5]]
#> [1] "https://publichealthmatters.blog.gov.uk/2018/09/07/duncan-selbies-friday-message-7-september-2018/"
#>
#> [[6]]
#> [1] "https://publichealthmatters.blog.gov.uk/2018/08/31/duncan-selbies-friday-message-31-august-2018/"
and then extract blog text:
library(tm)
library(magrittr)
blog_text <- map(friday_message, ~(get_page_text(.x)))
blog_text <- map(blog_text, ~(str_remove(.x, "\\n")))
blog_text <- map(blog_text, ~(str_remove(.x, " GOV.UK blogs use cookies to make the site simpler. Find out more about cookies\n ")))
blog_text <- map(blog_text, ~(str_remove(.x, "Dear everyone")))
blog_title <- map(blog_text, 2)
names(blog_text) <- blog_title
blog_text1 <- map(blog_text, extract, 5:11)
blog_text2 <- map(blog_text1, data.frame)
blog_text2 <- map_df(blog_text2, bind_rows)
blog_text2 <- blog_text2 %>% mutate(text = clean_texts(.x..i..))
We can then visualise with, for example, a wordcloud.
library(quanteda)
corp <- corpus(blog_text2$text)
dfm <- dfm(corp, ngrams = 2, remove = c("government_licence", "open_government", "public_health", "official_blog", "blog_public", "health_england", "cancel_reply", "content available", "health_blog", "licence_v", "best_wishes", "otherwise_stated", "except_otherwise", "friday_messages", "best_wishes",
"available_open"))
textplot_wordcloud(dfm)
I have added a few functions to the package.
get_dsph_england
returns a list of local authorities and their current DsPH. It scrapes https://www.gov.uk/government/publications/directors-of-public-health-in-england--2/directors-of-public-health-in-england
LA | Name |
---|---|
Derby UA | Cate Edwynn |
Derbyshire | Dean Wallace |
Leicester UA | Ruth Tennant |
Leicestershire | Mike Sandys |
Lincolnshire | Derek Ward |
Northamptonshire | Lucy Wightman |
Nottingham UA | Alison Challenger |
Nottinghamshire | Jonathan Gribbin |
Rutland UA | Mike Sandys |
Bedford Borough | Muriel Scott |
Cambridgeshire | Liz Robin |
Central Bedfordshire UA | Muriel Scott |
Essex | Mike Gogarty |
Hertfordshire | Jim McManus |
Luton UA | Gerry Taylor |
Milton Keynes UA | Muriel Scott |
Norfolk (covers Great Yarmouth) | Louise Smith |
Peterborough UA | Liz Robin |
Southend on Sea UA | Andrea Atherton |
Suffolk (covers Waveney) | Abdul Razaq |
Thurrock UA | Ian Wake |
Barking and Dagenham | Matthew Cole |
Barnet | Tamara Djuretic |
Bexley | Anjan Ghosh |
Brent | Melanie Smith |
Bromley | Nada Lemic |
Camden | Julie Billet |
City | Penny Bevan |
Croydon | Rachel Flowers |
Ealing | Wendy Meredith (acting/interim) |
Enfield | Stuart Lines |
Greenwich | Steve Whiteman |
Hackney | Penny Bevan |
Hammersmith and Fulham | Anita Parkin (acting/interim) |
Haringey | Will Maimaris (acting/interim) |
Harrow | Carole Furlong |
Havering | Mark Ansell (acting/interim) |
Hillingdon | Steven Hajioff |
Hounslow | Laura Maclehose (acting/interim) |
Islington | Julie Billet |
Kensington and Chelsea | Mike Robinson |
Kingston upon Thames | Iona Lidington |
Lambeth | Ruth Hutt (acting/interim) |
Lewisham | Danny Ruta |
Merton | Dagmar Zeuner |
Newham | Livia Royle (acting/interim) |
Redbridge | Gladys Xavier (acting/interim) |
Richmond upon Thames | Houda Al-Sharifi |
Southwark | Kevin Fenton |
Sutton | Imran Choudhury |
Tower Hamlets | Somen Banerjee |
Waltham Forest | Joe McDonnell |
Wandsworth | Houda Al-Sharifi |
Westminster | Mike Robinson |
Darlington UA | Miriam Davidson |
County Durham UA | Amanda Healy |
Gateshead Council | Alice Wiseman |
Hartlepool | Peter Brambleby (acting/interim) |
Middlesbrough UA | Edward Kunonga |
Newcastle upon Tyne | Eugene Milne |
North Tyneside | Wendy Burke |
Northumberland | Liz Morgan |
Redcar and Cleveland | Edward Kunonga (acting/interim) |
South Tyneside | Tom Hall |
Stockton on Tees UA | Sarah Bowman-Abouna |
Sunderland | Gillian Gibson |
Blackburn with Darwen | Dominic Harrison |
Blackpool | Arif Rajpura |
Bolton | David Herne (acting/interim) |
Bury | Lesley Jones |
Cheshire East UA | Fiona Reynolds |
Cheshire West and Chester UA | Ian Ashworth |
Cumbria | Colin Cox |
Halton UA | Eileen O’Meara |
Knowsley | Matthew Ashton |
Lancashire | Sakthi Karunanithi |
Liverpool | Sandra Davies |
Manchester | David Regan |
Oldham | Katrina Stephens (acting/interim) |
Rochdale | Andrea Fallon |
Salford | David Herne |
Sefton | Matthew Ashton (acting/interim) |
St Helens | Sue Forster |
Stockport | Stephen Watkins |
Tameside | Jeanelle De Gruchy |
Trafford | Eleanor Roafe (acting/interim) |
Warrington UA | Muna Abdel Aziz |
Wigan | Kate Ardern |
Wirral | Fiona Johnstone |
West Berkshire UA | Tessa Lindfield |
Bracknell Forest | Lisa McNally |
Brighton UA | Alistair Hill |
Buckinghamshire County Council | Jane O’Grady |
East Sussex | – Darrell Gale |
Hampshire | Sallie Bacon |
Isle of Wight UA | Sallie Bacon (acting/interim) |
Kent | Andrew Scott-Clark |
Medway UA | James Williams |
Oxfordshire | Jonathan McWilliam |
Portsmouth UA | Jason Horsley |
Reading UA | Tessa Lindfield |
Slough UA | Tessa Lindfield |
Southampton UA | Jason Horsley |
Surrey | Helen Atkinson |
West Sussex | - Anna-Marie Raleigh |
Windsor and Maidenhead UA | Tessa Lindfield |
Wokingham UA | Tessa Lindfield |
‘Bathnes’ Bath and North East Somerset | Bruce Laurence |
Bournemouth | David Phillips (Sam Crowe covering secondment) |
City of Bristol | Susan Milner (acting/interim) |
Cornwall UA | Caroline Court (acting/interim) |
Devon County Council | Virginia Pearson |
Dorset | David Phillips (Sam Crowe covering secondment) |
Gloucestershire | Sarah Scott |
Isle of Scilly UA | Caroline Court (acting/interim) |
North Somerset UA | Andrew Burnett (acting/interim) |
Plymouth UA | Ruth Harrell |
Poole UA | David Phillips (Sam Crowe covering secondment) |
Somerset | Trudi Grant |
South Gloucestershire UA | Mark Pietroni / Sara Blackmore |
Swindon UA | Cherry Jones |
Torbay UA | Caroline Dimond |
Wiltshire UA | Tracy Daskiewicz |
Birmingham | Becky Pollard (acting/interim) |
Coventry | Liz Gaulton |
Dudley | Deborah Harkins |
Herefordshire | Karen Wright |
Sandwell | Ansaf Azhar (acting/interim) |
Shropshire UA | Rod Thomson |
Solihull | Meradin Peachey |
Staffordshire | Richard Harling |
Stoke on Trent UA | Paul Edmondson-Jones |
Telford and Wrekin UA | Liz Noakes |
Walsall | Barbara Watt |
Warwickshire | John Linnane |
Wolverhampton | John Denley |
Worcestershire | Frances Howie |
Barnsley | Julia Burrows |
Bradford | Sarah Muckle |
Calderdale | Paul Butcher |
Doncaster | Rupert Suckling |
East Riding of Yorkshire UA | Tim Allison |
Hull City Council | Julia Weldon |
Kirklees | Rachel Spencer-Henshall |
Leeds | Ian Cameron |
North East Lincolnshire UA | Steve Pintus |
North Lincolnshire UA | Penny Spring |
North Yorkshire | Lincoln Sargeant |
Rotherham | Theresa Roche |
Sheffield | Greg Fell |
Wakefield | Anna Hartley |
York UA | Sharon Stoltz |
get_phe_catalogue
identifies all the PHE publications on GOV.UK. For this function you have to set the n = argument. We recommend starting at n = 110. This produces an interactive searchable table of links.