PHE - Health Improvement
These slides outline an approach to webscraping local authority joint strategic needs assessments (JSNAs)
Automated:
Web scraping
https://en.wikipedia.org/wiki/Web_scraping
url <- "https://en.wikipedia.org/wiki/Web_scraping"
Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites.[1] Web scraping software may access the World Wide Web directly using the Hypertext Transfer Protocol, or through a web browser. While web scraping can be done manually by a software user, the term typically refers to automated processes implemented using a bot or web crawler. It is a form of copying, in which specific data is gathered and copied from the web, typically into a central local database or spreadsheet, for later retrieval or analysis.
myScrapers
- Google searching and link extractiontidyverse
- data wranglingreadtext
- reading pdfsquanteda
and tidytext
- nlp and text miningphecharts
- PHE styles and themesNote - due to limitations placed by Google on web-scraping, this process only returns 100 hits.
To scrape ALL JSNAs will require additional work
googlesearchR
function from the myScrapers
package to perform an initial searchjsna <- googlesearchR("jsna")
head(jsna)
## [[1]]
## [1] "https://cambridgeshireinsight.org.uk/jsna/"
##
## [[2]]
## [1] "http://www.devonhealthandwellbeing.org.uk/jsna/about/"
##
## [[3]]
## [1] "http://www.devonhealthandwellbeing.org.uk/jsna/"
##
## [[4]]
## [1] "https://www.gov.uk/government/publications/joint-strategic-needs-assessment-and-joint-health-and-wellbeing-strategies-explained"
##
## [[5]]
## [1] "http://www.healthandwellbeingbucks.org/what-is-the-jsna"
##
## [[6]]
## [1] "https://new.enfield.gov.uk/healthandwellbeing/topics/jsna/"
get_page_links
function to extract the links for each page identified by the Google search (this takes some time)purrr::safely
to prevent the function from stopping if it encounters links it can’t openjsna1 <- map_df(jsna, data.frame) #convert list to data frame
safe_links <- safely(get_page_links) # safe function
# extract links per page and store as a tibble
jsna2 <- jsna1 %>%
mutate(links = map(.x..i.., ~(safe_links(.x)))) %>%
as.tibble()
# remove errors and filter for pages containing 'jsna'
jsna3 <- jsna2 %>%
mutate(links = map(links, "result")) %>%
filter(links != "NULL") %>%
unnest() %>%
filter(str_detect(links, "[Jj][Ss][Nn][Aa]"))
This gives a list of 1630 JSNA related pages.
We can try and search again within these pages either to detect pdfs we can download and analyse, or to extract further sets of links if reports are web-based
# search witin jsna pages
jsna4 <- jsna3 %>%
mutate(links1 = map(links, ~(safe_links(.x))))
# extract pdfs and create full links to pdfs if required
pdfs <- jsna4 %>%
mutate(links2 = map(links1, "result")) %>%
filter(links2 != "NULL") %>%
unnest(links2) %>%
filter(str_detect(links2, "pdf")) %>%
distinct() %>%
mutate(full_links = ifelse(!str_detect(links2, "^http"), paste0(links, links2), links2))
# safe readtext funtion NB might also try new version of pdf_tools
safe_readtext <- safely(readtext::readtext)
safe_pdf <- safely(pdftools::pdf_text)
# try and read pdfs
pdfs_text <- pdfs %>%
mutate(text = map(full_links, ~(safe_readtext(.x))))
pdfs1_text <- pdfs %>%
mutate(text = map(full_links, ~safe_pdf(.x)))
# tidy
pdfs_text1 <- pdfs_text %>%
mutate(text = map(text, "result")) %>%
filter(text != "NULL") %>%
unnest()
# save file
pdfs_text1 %>%
write_rds("jsna.rds")
This second iteration generates 1630 pages from which we were able to extract 597 pdfs.
Finally we can use simple search techniques to identify where and how often PHE is quoted in each report,
jsna1 <- read_rds("jsna.rds")
library(quanteda)
# create corpus
corpus_jsna <- corpus(jsna1$text)
# add back full links
docvars(corpus_jsna, "link") <- pdfs_text1$full_links
# create search
multiword <- c( "phof*", "phout*", "fingert*", "phe") # this could be a miniui shiny app reading the rds file
# run keywerds in context
kw <- kwic(corpus_jsna, pattern = phrase(multiword), window = 15)
# add id field
corp <- corpus_jsna$documents %>%
rownames_to_column("text") %>%
select(text, link)
# join fulllnks back to keyword table
kw %>%
left_join(corp, by = c("docname" = "text")) %>%
mutate(link1 = paste0("<a href =", link, ">Links</a>")) %>%
DT::datatable(filter = "top", escape = FALSE)
jsna_dtm <- dfm(corpus_jsna, remove = stopwords("en"), ngrams = 1:2)
jsna_dtm1 <- dfm_trim(jsna_dtm, scarcity = 0.7)
dict <- create_lookup(phe = c("phe", "health_eng*"),
profile = "profile*",
fingertips = "fingertip*",
local_health = c("localhealth", "local_health"),
phof = c("phof", "health_outcomes_framework"),
indicator = "indicator*")
lookup1 <- dfm_lookup(jsna_dtm1, dictionary = dict)
lu<- lookup1 %>%
convert(., to = "data.frame") %>%
gather(metric, value, 2:ncol(.)) %>%
ggplot(aes(document, metric, fill = value)) +
geom_tile() +
scale_fill_gradient(low = "white", high = "black") +
scale_x_discrete(position = "top") +
theme_phe() +
theme(axis.text.x.top = element_text(angle = 90, hjust = 0))
lu
lookup1 %>%
convert(., to = "data.frame") %>%
gather(metric, value, 2:ncol(.)) %>%
mutate(mention = ifelse(value == 0, 0, 1)) %>%
group_by(metric) %>%
summarise(`% reports with term` = round(100 * mean(mention), 2)) %>%
knitr::kable() %>%
kableExtra::kable_styling(full_width = TRUE)
metric | % reports with term |
---|---|
fingertips | 12.81 |
indicator | 60.07 |
local_health | 20.63 |
phe | 34.44 |
phof | 10.32 |
profile | 54.08 |
# count frequency of keywords per document
kw %>%
left_join(corp, by = c("docname" = "text")) %>%
group_by(docname, link) %>%
count() %>%
arrange(-n) %>%
ungroup() %>%
select(-docname) %>%
select(n, everything()) %>%
distinct() %>%
knitr::kable()