Manually, take steps to filter the news results to show only articles relating to the topic of COVID-19.
Question to ask: does the page I’m viewing have all information I need available or will I need to take more steps to get all the data I want?
Question to ask: what happens to the url/urls and how will that inform my web-scraping algorithm?
Answer - developing a logical approach to navigating
Selecting the drop-down “Topic” and filtering the topic to “COVID-19” will reduce the news results to only those relating to COVID-19.
Doing this changes the url, therefore we’ll need to use the new url in our web-scraping.
All articles relating to COVID-19 are spread over multiple results pages, therefore to get all of them we’ll need to loop over all pages changing the url ending “&page=…”
NOTE: some web-pages are dynamically generated using Javascript, and the url does not update on user interaction. These pages require extra tools to web-scrape like the Selenium package.
HTML basics
HTML code is made up of different types of elements.
These elements can be nested: the title, paragraph and hyperlink are nested in the body of the HTML document.
Elements can have attributes, for example the hyperlink has a href attribute, which is the url it points too.
HTML Elements and Attributes
There are many types of elements that have different functions and uses in HTML. The examples to the left are elements you’ll see in the worked example.
Attributes are properties of the element, and there are some that are specific to the element they’re related too. Another important attribute is the class attribute: it is often used to point to a class name in a style sheet for the webpage.
Worked example: inspect webpage and identify HTML elements
Find the HTML elements related to the content we want.
Make a note of the element names and unique attributes.
Answer: inspect webpage and identify HTML elements
[“…” indicate removed code, for clarity]
The {rvest} package
#install.packages("rvest")library(rvest)url<-"https://www.gov.uk/search/news-and-communications"#Simulating a session in a html browsersimulated_session <-session(url)#Extracting the elements from a websiteextracted_element<-html_elements(simulated_session, 'element.name') #Extracts text inside an elementhtml_text(extracted_element) #Extracts the specified attribute attached to the element/elementshtml_attr(extracted_element, 'attribute.name')#Example - extracts hyperlinks or all attributeshtml_attr(extracted.element, 'href') #html_attr() gets a single attribute; html_attrs(extracted.element) #gets all attributes.
Worked example: write and execute the web-scraping code 1
To pull the article dates, from just the first page of new articles
library(rvest)#This is the gov.uk news page url, for the first page of filtered to COVID-19 topic onlyurl<-"https://www.gov.uk/search/news-and-communications?level_one_taxon=5b7b9532-a775-4bd2-a3aa-6ce380184b6c&order=updated-newest"#simulating an html session in R url_session<-session(url)##Use "time" to get all time elements. First, specify a unique class for an element## that contains the time elementsNews_times<- url_session |>html_elements("time")|>#give me all <time> elements in the page source codehtml_text() # give me the text displayed on the webpage associated with these elements
Worked example: write and execute the web-scraping code 2
To pull the articles urls, for further web-scraping
library(rvest)#This is the gov.uk news page url, for the first page of filtered to COVID-19 topic onlyurl<-"https://www.gov.uk/search/news-and-communications?level_one_taxon=5b7b9532-a775-4bd2-a3aa-6ce380184b6c&order=updated-newest"#simulating an html session in R url_session<-session(url)##More complex example - what if we wanted the article urls to investigate themes ## in our further analysis?News_urls<-url_session |>html_element('div[class="finder-results js-finder-results"]')|>#take the <div> element with this classhtml_elements("a")|>#take all the hyperlink <a> elements in the element stated abovehtml_attr('href')|>#give me the href attributes from those hyperlinks (NOT the text displayed)data.frame()
Worked example: write and execute the web-scraping code 3
To loop over the search results pages, to get all article dates
library(rvest)Total_number_of_pages<-3All_dates<-c() #create an empty vector to add dates in loopfor(i in1:Total_number_of_pages){#This url is updated with the page number i while loopingurl<-paste0("https://www.gov.uk/search/news-and-communications?level_one_taxon=5b7b9532-a775-4bd2-a3aa-6ce380184b6c&order=updated-newest&page=",i)#simulating an html session in R url_session<-session(url)##Use "time" to get all time elements. First, specify a unique class for an element## that contains the time elementsNews_times<- url_session |>html_elements("time")|>#give me all <time> elements in the page source codehtml_text() # give me the text displayed on the webpage associated with these elements#Add the pages dates to the list of all datesAll_dates<-append(All_dates,News_times)}
Final thoughts
Always take the time to assess if a project is suitable for web-scraping (sometimes it’s not possible or you can save effort by requesting data!)
Take the time to understand the HTML and look for unique attributes (there may be more than one way to scrape what you need!)
The process is trial-and-error: always check your outputs along the way as you may be pulling data you might not expect.
Appendix: COVID 19 articles over-time
library(rvest)Total_number_of_pages<-119All_dates<-c() #create an empty vector to add dates in loopfor(i in1:Total_number_of_pages){print(i)#This url is updated with the page number i while looping url<-paste0("https://www.gov.uk/search/news-and-communications?level_one_taxon=5b7b9532-a775-4bd2-a3aa-6ce380184b6c&order=updated-newest&page=",i)#simulating an html session in R url_session<-session(url)##Use "time" to get all time elements. First, specify a unique class for an element## that contains the time elements News_times<- url_session |>html_elements("time")|>#give me all <time> elements in the page source codehtml_text() # give me the text displayed on the webpage associated with these elements#Add the pages dates to the list of all dates All_dates<-append(All_dates,News_times)}All_dates<-gsub(" January ","/1/",All_dates)All_dates<-gsub(" February ","/2/",All_dates)All_dates<-gsub(" March ","/3/",All_dates)All_dates<-gsub(" April ","/4/",All_dates)All_dates<-gsub(" May ","/5/",All_dates)All_dates<-gsub(" June ","/6/",All_dates)All_dates<-gsub(" July ","/7/",All_dates)All_dates<-gsub(" August ","/8/",All_dates)All_dates<-gsub(" September ","/9/",All_dates)All_dates<-gsub(" October ","/10/",All_dates)All_dates<-gsub(" November ","/11/",All_dates)All_dates<-gsub(" December ","/12/",All_dates)All_dates<-as.Date(All_dates,"%d/%m/%Y")hist(All_dates,breaks="months",xlab="Articles relating to COVID-19 (per month)")