This note shows how to scrape and analyse PHE’s monthly bulletins. It is inspired by this book.
Having automated downloading bulletins from the web, the next step is to convert the files from pdf to text format. This can be achieved in R with the new pdftools
package, available via rOpenScience.
Firstly, load the relevant libraries for conversion and subsequent data wrangling and analysis.
library(pdftools)
library(dplyr)
library(stringr)
library(ggplot2)
Then switch the directory where the files are stored and select only the .pdfs.
setwd("~/Documents/R_projects/feeback/monthly_bulletins")
files <- list.files(pattern = "pdf")
file <- data_frame(files) %>%
mutate(doc = row_number())
Then we’ll create a function which adds the converts the files to text using the pdftools
package and creates a Document Term Matrix using thetm
package which we will tidy using the tidy function of the tidytext
package. We can then add the file name.
There is extraneous formatting which we can remove with the stringr
package using the str_replace_all
function.
df <- data_frame()
l <- lapply(files, pdf_text)
corp <- tm::Corpus(VectorSource(l))
corptd <- tidy(corp)
corptd <- corptd %>%
mutate(doc = row_number()) %>%
full_join(file) %>%
select(files, text)
## Convert strings
corptd$text <- str_replace_all(corptd$text, "\n", "")
In total there are 40 bulletins available for analysis.
The bulletins are arranged in articles. We can visualise how many words there are per bulletin.
corptd %>%
group_by(files) %>%
unnest_tokens(word, text) %>%
anti_join(stop_words) %>%
count(word, sort = TRUE)%>%
summarise(words = sum(n)) %>%
ggplot(aes(reorder(files, words), words)) +
geom_point() +
coord_flip() +
labs(x="",
y = "Number of words",
title = "Number of words per bulletin") +
theme(axis.text.x = element_text(size =6))
Joining, by = "word"
It looks like there are a number of terms which are common to every document and related to the publication template rather the variable content. We add these to the stop word dictionary which will remove them from the analysis
bulletin_sw <- data_frame(word = c("phe", "published", "bulletin", "press", "phe's", "www.gov.uk", "news", "publications", "gateway", "formore"), lexicon = "SMART" )
stop_words1 <- bind_rows(stop_words, bulletin_sw)
library(wordcloud)
cloud <- corptd %>%
unnest_tokens(word, text) %>%
anti_join(stop_words1) %>%
count(word, sort = TRUE)
Joining, by = "word"
with(cloud, wordcloud(word, n, max.words = Inf, scale = c(8, 0.2), rot.per = 0.4, random.order = FALSE))
We can also create a bigram wordcloud which plots 2 word combinations rather than single words.
library(wordcloud)
library(tidyr)
cloud1 <- corptd %>%
unnest_tokens(bigram, text, token = "ngrams", n = 2) %>%
count(bigram, sort = TRUE)
## Split the bigrams back into separate words
cloud_sep <- cloud1 %>%
separate(bigram, c("word1", "word2"), sep = " ")
## and filter out the 'stop words'
cloud_filt <- cloud_sep %>%
filter(!word1 %in% stop_words$word) %>%
filter(!word2 %in% stop_words$word)
## and recombine
cloud_new <- cloud_filt %>%
unite(bigram, word1, word2, sep = " ")
cloud_new
## and replot the workd cloud
with(cloud_new, wordcloud(bigram, n, max.words = Inf, random.order = FALSE, scale = c(8, 0.2), rot.per = .5, colors = brewer.pal(8, "Dark2")))
We can see that some of the commoner terms and phrases used in PHE’s bulletins include ‘local authorities’, ‘nhs england’, ‘mental health’, ‘kevin fenton’, ‘health check’ and ‘intelligence network’.
We can extend the anlaysis further by looking at the distribution of terms in individual bulletins, and then looking for patterns to see if bulletins can be clustered according to content.
First we can look at a single bulletin
files | text |
---|---|
Aug_PHE_Bulletin.pdf | PHE Bulletin News and views for the public health sectorPHE publications gateway number 2014301 29 August 2014PHE Bulletin, Public Health England’s regular update, provides news and information on PHEand the public health landscape for all those concerned with the public’s health. For moreinformation see our website: www.gov.uk/phePublic health newsUpdate on Ebola as British national working in Sierra Leone is repatriatedA decision was made last weekend to repatriate a British national healthcare workerresiding in Sierra Leone, who has been diagnosed with Ebola virus disease. Thepatient is being treated in an isolation unit at the Royal Free Hospital, London. Allappropriate protocols have been promptly activated by the Department of Health,PHE and NHS England and protective measures strictly maintained to minimise therisk of transmission to staff. Ebola is a form of viral haemorrhagic fever and istransmitted by direct contact with the blood or bodily fluids of an infected person.Currently more than 3,000 cases have been reported in Guinea, Liberia and SierraLeone, of which there have been more than 1,500 deaths. For more information seethe joint Department of Health, PHE and NHS England press release.New analysis shows improving cancer survival in England using world-leadingcancer staging dataCancer survival in England for breast, lung, prostate, colorectal and ovarian cancercontinues to improve, shows new data published in a report this week from PHE’sNational Cancer Intelligence Network. This work draws on the huge advances in thequality and completeness of cancer staging data by the NHS and the timeliness ofthe information gathered through the National Cancer Registration Service, one ofthe most advanced anywhere in the world. The report describes the 1-year survivalanalysis for patients in England first diagnosed in 2012. For more information seePHE’s press release. 1PHE investigating national outbreak of SalmonellaPHE is investigating a national outbreak of Salmonella Enteritidis, working with theFood Standards Agency and public health organisations in Europe. As at 22 August,total reported numbers reached 247 cases, with 99 cases in Hampshire, 54 in WestMidlands, 39 in Cheshire and Merseyside and 39 in London. Overall case reportinghas slowed in the last two weeks and PHE assesses the current risk as low. There isevidence to indicate that recent cases in Europe with the same strains of Salmonellainfection were associated with consumption of eggs from a single source fromoutside the UK. For more information see PHE’s press release.Whole genome sequencing to revolutionise investigation of outbreaks ofinfectious diseasePHE is at the forefront of using new whole genome sequencing (WGS) technologiesto improve the diagnosis and control of infectious diseases and is leading theimplementation whole genome sequencing HIV, tuberculosis and hepatitis C in thecontext of the 100,000 genomes project. PHE is also working with GenomicsEngland as part of the genomes project to apply WGS to further our understandingof why some people develop severe reactions to infections. To date the genomes ofover 28,000 bacteria and viruses have been sequenced by PHE. For moreinformation see PHE’s press release.Standardised packaging can save lives and boost local economiesPHE has submitted its response to the Government’s consultation on standardisedtobacco packaging. The response includes new PHE figures on the potential benefitsthat standardised packaging of tobacco products could bring, not only for health, butin savings of around £500 million – providing a real economic boost to the mostdeprived communities. Recent official data from Australia, where standardisedpackaging was introduced in December 2012. Data from the Australian Treasuryshows a 3.4% fall in tobacco sales by volume in the first year following theintroduction of standardised packs. For more information, see PHE’s press release.PHE reveals winning entrepreneurs of Health X competitionPHE has unveiled the three winners of the PHE Health X initiative, the firstcompetition of its kind that seeks to bring inspirational technology to the public bymaking motivational health apps widely available for free. The winning apps are: Fee fi fo fit, an intervention product to promote positive changes in young people using a game-based reward system Foodswitch, a smartphone app which provides consumers with nutritional information to help them make healthier choices when shopping 2 Youniverse, a 28-day exercise and diet planner, which generates daily meal plans, shopping lists and exercise ideas.Benefits for each winner include promotional support and help with productdevelopment. For more information see PHE’s press release.Businesses risk losing billions unless they adapt: report reveals the futurecost of dementiaPHE and Alzheimer’s Society have released a new report on the future financialimplications to the nation’s businesses of dementia, and call on employers to adapttheir working environment to support the increasing numbers affected by thecondition. The report, from the Centre for Economics and Business Researchreveals that by 2030, when the number of people with dementia is expected to rise toover 1 million, dementia caring obligations could cost companies more than £3billion. The report predicts this will have a huge impact on businesses as the numberof workers reducing hours, changing work patterns or even quitting is expected togrow. For more information see PHE’s press release.BCG vaccine prevents TB infection in childrenA new study from PHE published in the British Medical Journal has found BacillusCalmette-Guérin (BCG) vaccine may protect against M. tuberculosis (TB) infection,in addition to decreasing progression of TB from infection to disease. Prior to thisreview, it has been widely accepted that BCG vaccine protects against the mostsevere forms of disease such as tuberculosis (TB) meningitis in children. The reviewindicates BCG vaccine can also protect against an individual becoming infected. Formore information see PHE’s press release.New cardiovascular disease profiles for clinical commissioning groupsPHE’s National Cardiovascular Intelligence Network has published cardiovasculardisease profiles for each clinical commissioning group in England. The profiles bringtogether information on coronary heart disease, diabetes, kidney disease and strokeand on the leading risk factors such as smoking and obesity for local areas to use intheir service planning. For more information see PHE’s news storySignificant scope to improve antibiotic prescribingA new study by scientists at PHE and University College London has found that thelikelihood of general practitioners prescribing antibiotics for coughs and coldsincreased by 40% between 1999 and 2011, despite Government recommendationsto reduce prescribing for illnesses largely caused by viruses. The researchers alsofound substantial variation in prescribing between general practices, with the highest 3prescribing practices twice as likely to give a prescription for coughs and colds as thelowest prescribers. The research is published in the Journal of AntimicrobialChemotherapy. For more information see PHE’s press release.Earlier screening recommended for serious abnormalities in pregnancyThe UK National Screening Committee (UK NSC) has recommended earlierscreening for the rare chromosomal abnormalities, Edward’s Syndrome (Trisomy 18)and Patau’s Syndrome (Trisomy 13) during pregnancy, giving women the informationthey need to access support and make choices about their pregnancy at an earlierstage. The recommendation was made at the UK NSC committee meeting in June2014 and is published in its minutes. The UK NSC recommended against populationscreening for atrial fibrillation, Type 2 diabetes and parvovirus B19 infection. Formore information see PHE’s press release.PHE urges freshers to get MenC vaccine before beginning universityPHE is urging new students (freshers) to ensure they get vaccinated againstmeningococcal C (MenC) infection before beginning university in September. In theUK, all children are offered MenC vaccine to protect them against MenC infection butas the protection can wane, a booster for teenagers was added last year. For thenext few years, university freshers will also be eligible for vaccination until theteenagers who have had the booster reach university age. For more information seePHE’s press release.Guidance on multidisciplinary public health teamsPHE has published guidance on organising and managing multidisciplinary publichealth teams in local government which has been co-produced with the LocalGovernment Association, the Association of Directors of Public Health and theFaculty of Public Health. The guidance concerns the appropriate employment ofpublic health professionals who carry out roles as consultants in public health anddirectors of public health and who are included on the General Medical CouncilSpecialist Register, the General Dental Council Specialist List or the UK PublicHealth Register for Public Health Specialists.Public Health Outcomes Framework data tool updatedThe Public Health Outcomes Framework data tool was updated earlier this month.The web-based tool brings together available indicators from the framework to helplocal authorities and others understand how well the public’s health is beingimproved and protected. Indicators on air pollution, population affected by noise, 4physical activity in adults and preventable sight loss were newly updated in thisquarter’s update.Feedback sought for Child Health Profiles 2015Earlier this year, PHE published Child Health Profiles 2014 for each top tier localauthority in England. PHE is currently reviewing the content of last year's profiles andinviting stakeholders to complete a short survey on the priorities they would like tosee reflected next year.Obesity and physical activity facts sheets updatedA series of PHE factsheets which compile up-to-date key information and data aboutobesity and physical activity in an easily readable format have been updated thismonth. The data factsheets will be a useful resource for policy makers, andpractitioners. There is also an update to the physical activity data sources.PHE’s health protection web content to move to Government websiteThe HPA.org.uk website will be closing on 2 September 2014 and PHE’s healthprotection content will move to GOV.UK, the UK public sector information websitecreated to provide a single point of access to HM Government services. Users whocontinue to use the HPA.org.uk web address will be re-directed automatically to thenew content on GOV.UK, or to the National Archive for older, less-used information.This is the start of a process to bring together all PHE’s information in one place andto make sure we are meeting the needs of our users.Results from PHE’s first annual public opinion surveyThe results from PHE’s first annual public opinion survey, carried out by Ipsos MORI,are now available. The PHE commissioned survey complements provides anindependent measurement of how the public perceives public health and PHE’s rolein the delivery of public health services. PHE chief executive Duncan Selbie said “athird of the public say they have heard of us and, when given an explanation of ourrole, two thirds would be confident in our advice. This is a sound beginning for us tobuild on”. 5Recent PHE BlogsWhat does the digital age mean for the public's health? by Kevin Fenton (4 August2014). Through initiatives like HealthX and Be He@lthy, Be Mobile, PHE is alreadyheavily involved in the digital health and mHealth revolution. In a new blog, Directorfor Health and Wellbeing Professor Kevin Fenton asks where we go next.Through health and high water: a young carer’s perspective by Viv Bennett andSophie (6 August 2014). Sophie, a young carer, talks to PHE Director of NursingProfessor Viv Bennett about the difficulties she and other young carers experience.We are what we eat by Julian Flowers (7 August 2014). How do differences inincome and education relate to differences in diet? Julian Flowers looks at what thedata show in the first of a series of blogs focusing on the wealth of available publichealth data and what those data can tell us.Just how close is ICT to the frontline these days? by Michael Brodie (20 August2014). The dramatic ways in which modern information and communicationstechnology has improved opportunities for protecting and improving the public'shealth has meant equally dramatic changes in the roles of ICT teams. Finance andCommercial Director Michael Brodie looks at what some of these changes mean forstaff on the frontline.Cancer registration as good as the world’s best by Jem Rashbass (26 August 2014)With this week’s release of new cancer survival and staging data we’ve shown thatour cancer registration service is world-class. Director for Disease Registration JemRashbass explains how we got to this point and why it matters. 6Campaigns newsNew Change4Life smart tools launched, ready for the new school yearGet Going and Smart Restart join the collection of year-round Change4Life smarttools to help families make a healthy change. Get Going provides families withpersonal activity plans to learn what sort of activity is right for them. Smart Restart isanother fun way for parents to create healthier eating and activity habits for theirchildren.PHE annual flu campaign launches in OctoberThe PHE flu campaign will launch on 6 October 2014, running for four weeks andconsisting of press, radio and online search supported by PR. The campaign willtarget under 65s with long-term conditions, pregnant women and parents of childrenaged two, three and four. It will be based on a simple message encouraging peoplenot to put off getting their flu vaccination. Printed leaflets are available now from theDepartment of Health orderline and a toolkit for local use will be available in the firstweek in September.New Be Clear on Cancer: “Blood in Pee” campaign briefingA briefing to help local teams prepare for the autumn rerun of PHE’s national BeClear on Cancer “Blood in Pee” campaign has been produced. It includesinformation about the background, aims, and likely impact of the campaign whichstarts on 13 October 2014. For more information see the Cancer Research UKwebsite.Break the habit toolkit for Stoptober 2014PHE’s marketing team at partnerships@phe.gov.uk can provide a new free ‘Breakthe Habit’ workplace toolkit. It is intended to provide employers with the guidance tobring Stoptober to life in the workplace and employees with the means by which tostay motivated throughout the 28-day journey.New NHS Health Checks marketing resourceA new marketing toolkit is available to provide support for all those promoting andimplementing the NHS Health Check programme locally. It includes NHS HealthCheck identity guidelines, templates for press advertising, and other marketingadvice and materials. The resource is on the NHS Health Check website. 7News from other organisationsNHS England’s first Annual Report publishedNHS England has published its first Annual Report and Accounts setting out itsachievements over the last year and its aspiration for 2014-2015. The report featureskey milestones since its creation in April 2013, the annual accounts and a Directors’report.Queen’s Birthday Honours nominationsThe Department of Health is looking for honour nominations of worthy candidateswho have made an outstanding contribution to the health and care system for theQueen’s Birthday 2015 honours round. Deadline for submitting nominations is 12September 2014. Nomination forms and guidance are available here.National review of choices in end of life care consultationThe National Council for Palliative Care is running a public consultation on theGovernment’s review of choices in end of life care. The information gathered willoutline the kinds of choices that people would like to be able to make at the end oflife and information about the funding, systems and processes that would be neededto enable choices to be acted upon. The review focuses on end of life care for adultsaged 18 and over, and within the current legal framework. The consultation closeson 30 September 2014.Department of Health factsheet on transfer of children’s public health servicesto local authorities.The Department of Health has published a factsheet on planning and paying forpublic health services for 0 to 5 year olds which will transfer from the NHS to localauthorities in October 2015. 8Events newsPHE annual conference 2014PHE’s second annual conference will be held on 16 to 17 September 2014 atWarwick University and will include keynote addresses by Jeremy Hunt, Secretary ofState for Health, Una O'Brien, Permanent Secretary at the Department of Health andSimon Stevens, chief executive of NHS England. The full programme for theconference and booking details can be found on the conference website.PHE Board’s next open meetingThe eighth open meeting of PHE’s Board will take place on 24 September 2014,from 11am to 3pm, in London. The meeting will include a panel discussion onantimicrobial resistance with external experts. Meeting details, including boardpapers for earlier meetings and information on future board meetings, are availableonline.Regional workshops on the transfer of children's public health commissioningOn 1 October 2015 the responsibility for commissioning public health services for 0-5year olds will transfer from NHS England to local authorities. PHE is organising aseries of regional workshops designed to explore the data, information andsupporting IT aspects for those working in the NHS and local authorities on thetransfer of commissioning responsibilities. Workshops will be held in Birmingham,Bristol, Leeds and London in September and October 2014. For more details aboutthe workshops including how to register visit the PHE events website. 9 |
We can then create a per document per term table known as a Document Term Matrix (DTM). First we can count the terms per document.
corp_dtm <- corptd %>%
group_by(files, text) %>%
unnest_tokens(word, text) %>%
anti_join(stop_words) %>%
count(files, word, sort = TRUE) %>%
ungroup()
Joining, by = "word"
corp_dtm
Next we can create the DTM.
corp_dtm
<<DocumentTermMatrix (documents: 40, terms: 11339)>>
Non-/sparse entries: 30830/422730
Sparsity : 93%
Maximal term length: 41
Weighting : term frequency (tf)
And the next step is topic modelling - cluster analysis using the Latent Dirichlet Allocation (LDA) algorithm. We’ll start with 6 topics arbitrarily.
corp_lda
A LDA_VEM topic model with 10 topics.
Tidytext gives us the option of returning to a tidy analysis, using the tidy and augment verbs borrowed from the broom package. In particular, we start with the tidy verb.
corp_lda_tidy <- corp_lda_tidy %>%
filter(!term %in% ("2013", "2014"))
Error: unexpected ',' in:
"corp_lda_tidy <- corp_lda_tidy %>%
filter(!term %in% ("2013","
The term beta is the probability of a given word being generated for that topic. We evaluate the most common terms per topic.