Lab 5: Reddit Data and Web Scraping

Introduction

Today we are going to shift gears slightly to consider alternative sources of textual data of political relevance. We will focus on two alternative sources of data – Reddit and Google News.

What is Reddit?

Reddit is a social news aggregation and discussion website jounded in 2005.

Today, the platform boasts the following statistics

  • 7th most visited site in the US, 29th most visited in the world
  • 8th most used social media platform in the US
  • Eleven million posts are submitted to Reddit each month, over 2 billion comments per year
  • Reddit is primarily young, 18-29 yo make up 64% of the user base and 30-49 make up another 29%
  • Something like 1 in 4 people between the age of 25 and 29 use Reddit actively
  • 222 million users live within the US; roughly half of the Reddit community.
  • 430 million monthly active users, 52 million daily active users
  • 42% of users get news from Reddit

Grabbing information from a subReddit

One of the nice things about Reddit is that they allow for non-authenticated calls to their API by simply adding “.json” to the appropriate part of a given URL. To get started let’s look at the main politics subreddit and then compare to the same URL but with “.json” appended. If we want to read this information into R we can do the following:

library(jsonlite)
library(tidyverse)
library(plyr)
library(dplyr)
library(httr)
library(xml2)
library(rvest)
library(RSelenium)
library(knitr)
url <- "https://www.reddit.com/r/politics/"
pol <- fromJSON(paste0(url,".json"))

Easy peasy. To access information which is relevant to our purposes, we need to parse through this list a little bit as follows, the relevant fields being defined here.

out <- pol$data$children$data
as_tibble(out)
## # A tibble: 27 x 109
##    approved_at_utc subreddit selftext     author_fullname saved mod_reason_title
##    <lgl>           <chr>     <chr>        <chr>           <lgl> <lgl>           
##  1 NA              politics  "Welcome to~ t2_d5h4t        FALSE NA              
##  2 NA              politics  "**Reportin~ t2_onl9u        FALSE NA              
##  3 NA              politics  ""           t2_4ot9563      FALSE NA              
##  4 NA              politics  ""           t2_deaxrt       FALSE NA              
##  5 NA              politics  ""           t2_gdfin        FALSE NA              
##  6 NA              politics  ""           t2_d1sxkm8i     FALSE NA              
##  7 NA              politics  ""           t2_tyyiguo      FALSE NA              
##  8 NA              politics  ""           t2_ba0snhoh     FALSE NA              
##  9 NA              politics  ""           t2_51aar45o     FALSE NA              
## 10 NA              politics  ""           t2_2eef87e      FALSE NA              
## # i 17 more rows
## # i 103 more variables: gilded <int>, clicked <lgl>, title <chr>,
## #   link_flair_richtext <list>, subreddit_name_prefixed <chr>, hidden <lgl>,
## #   pwls <int>, link_flair_css_class <lgl>, downs <int>,
## #   thumbnail_height <int>, top_awarded_type <lgl>, hide_score <lgl>,
## #   name <chr>, quarantine <lgl>, link_flair_text_color <chr>,
## #   upvote_ratio <dbl>, author_flair_background_color <chr>, ...

There are a few fields which may be of interest to us here. The first is the “title” field:

out$title
##  [1] "The \"What happened in your state last week?\" Megathread, Week 21"                                                                                                         
##  [2] "Discussion Thread: Ongoing Debt Ceiling and Budget Negotiations"                                                                                                            
##  [3] "AOC threatens to leave Twitter after Elon Musk promotes ‘disgusting’ account impersonating her"                                                                             
##  [4] "Ron DeSantis wants to \"make America Florida\": That's a dire threat"                                                                                                       
##  [5] "Ron DeSantis says he will ‘destroy leftism’ in US if elected president"                                                                                                     
##  [6] "The Press is Falling for Anti-Abortion “Fetal Heartbeat” Propaganda"                                                                                                        
##  [7] "Trump’s Lawyers Start to Wonder if One Could Be a Snitch"                                                                                                                   
##  [8] "The Right’s War on Brands Is Stupid and Terrifying: The anti-LGBTQ attacks of Bud Light and Target are no mere boycotts—the aim is to intimidate companies into submission."
##  [9] "Trump pledges to end birthright citizenship on first day in office"                                                                                                         
## [10] "Matt Gaetz joins growing rebellion over McCarthy's debt ceiling deal"                                                                                                       
## [11] "Trump campaign uses Pete Buttigieg’s picture to mock veterans over Memorial Day weekend"                                                                                    
## [12] "Jen Psaki: Despite being a laughingstock, don’t ignore what DeSantis is actually saying"                                                                                    
## [13] "Trump escalates attacks on judges amid increasing legal scrutiny"                                                                                                           
## [14] "With Gov Walz’s signature, Minnesota becomes 23rd state to legalize Marijuana"                                                                                              
## [15] "Don’t rule out a national abortion ban in 2025"                                                                                                                             
## [16] "DeSantis Hit With FEC Complaint Over 'Brazen' Violation of Campaign Finance Law"                                                                                            
## [17] "Oregon voters could decide on ranked-choice voting soon"                                                                                                                    
## [18] "Trump Team Wants to Get Rid of Feds Investigating Him"                                                                                                                      
## [19] "Trump lawyer said to have been waved off searching office for secret records"                                                                                               
## [20] "First Republican publicly supports ousting McCarthy as Speaker"                                                                                                             
## [21] "Far-right members, unhappy with debt deal, float threatening McCarthy's speakership"                                                                                        
## [22] "Uganda’s president signs horrific “Kill the Gays” law. The Biden administration is reevaluating aid."                                                                       
## [23] "Dianne Feinstein once became confused after seeing Vice President Kamala Harris preside over the Senate, report says"                                                       
## [24] "A Republican congressman says DeSantis threatened to primary some House reps if they didn't endorse him over Trump"                                                         
## [25] "Report: Threats against abortion providers have spiked, especially in states like Oregon, Washington"                                                                       
## [26] "‘Numbers Nobody Has Ever Seen’: How the GOP Lost Wisconsin"                                                                                                                 
## [27] "Is Student Debt Forgiveness Happening, or What?"

This all looks great, but there are only a few posts! What if we wanted to grab more of them? The trick is to use an older version of Reddit which did not feature the “never ending reddit” feature, and used pages instead! The technique is exactly the same:

url <- "https://old.reddit.com/r/politics/"
pol_old <- fromJSON(paste0(url,".json"))
out_old <- pol_old$data$children$data
out_old$title
##  [1] "The \"What happened in your state last week?\" Megathread, Week 21"                                                                                                         
##  [2] "Discussion Thread: Ongoing Debt Ceiling and Budget Negotiations"                                                                                                            
##  [3] "AOC threatens to leave Twitter after Elon Musk promotes ‘disgusting’ account impersonating her"                                                                             
##  [4] "Ron DeSantis wants to \"make America Florida\": That's a dire threat"                                                                                                       
##  [5] "Ron DeSantis says he will ‘destroy leftism’ in US if elected president"                                                                                                     
##  [6] "The Press is Falling for Anti-Abortion “Fetal Heartbeat” Propaganda"                                                                                                        
##  [7] "Trump’s Lawyers Start to Wonder if One Could Be a Snitch"                                                                                                                   
##  [8] "The Right’s War on Brands Is Stupid and Terrifying: The anti-LGBTQ attacks of Bud Light and Target are no mere boycotts—the aim is to intimidate companies into submission."
##  [9] "Trump pledges to end birthright citizenship on first day in office"                                                                                                         
## [10] "Matt Gaetz joins growing rebellion over McCarthy's debt ceiling deal"                                                                                                       
## [11] "Trump campaign uses Pete Buttigieg’s picture to mock veterans over Memorial Day weekend"                                                                                    
## [12] "Jen Psaki: Despite being a laughingstock, don’t ignore what DeSantis is actually saying"                                                                                    
## [13] "Trump escalates attacks on judges amid increasing legal scrutiny"                                                                                                           
## [14] "With Gov Walz’s signature, Minnesota becomes 23rd state to legalize Marijuana"                                                                                              
## [15] "Don’t rule out a national abortion ban in 2025"                                                                                                                             
## [16] "DeSantis Hit With FEC Complaint Over 'Brazen' Violation of Campaign Finance Law"                                                                                            
## [17] "Oregon voters could decide on ranked-choice voting soon"                                                                                                                    
## [18] "Trump Team Wants to Get Rid of Feds Investigating Him"                                                                                                                      
## [19] "Trump lawyer said to have been waved off searching office for secret records"                                                                                               
## [20] "First Republican publicly supports ousting McCarthy as Speaker"                                                                                                             
## [21] "Far-right members, unhappy with debt deal, float threatening McCarthy's speakership"                                                                                        
## [22] "Uganda’s president signs horrific “Kill the Gays” law. The Biden administration is reevaluating aid."                                                                       
## [23] "Dianne Feinstein once became confused after seeing Vice President Kamala Harris preside over the Senate, report says"                                                       
## [24] "A Republican congressman says DeSantis threatened to primary some House reps if they didn't endorse him over Trump"                                                         
## [25] "Report: Threats against abortion providers have spiked, especially in states like Oregon, Washington"                                                                       
## [26] "‘Numbers Nobody Has Ever Seen’: How the GOP Lost Wisconsin"                                                                                                                 
## [27] "Is Student Debt Forgiveness Happening, or What?"

To make this actionable, we need to make a few modifications after studying how the links work. After clicking “next” the url for me yields the following, “.json” being inserted into the appropriate place:

out_old$name
##  [1] "t3_13v4ga6" "t3_13vx2bi" "t3_13vycwl" "t3_13voghb" "t3_13vv2qt"
##  [6] "t3_13vm76t" "t3_13vsrdn" "t3_13vu0nz" "t3_13vwilw" "t3_13vp9xf"
## [11] "t3_13vulnw" "t3_13vxtpg" "t3_13vpmq7" "t3_13vwauy" "t3_13vs04d"
## [16] "t3_13vwom0" "t3_13vpb5s" "t3_13vqa7b" "t3_13vosy2" "t3_13vwtan"
## [21] "t3_13vu9z1" "t3_13vwso3" "t3_13vq1bt" "t3_13vxi88" "t3_13vr8s0"
## [26] "t3_13vvzq1" "t3_13vqcnn"
next_page <- paste0("https://old.reddit.com/r/politics/.json?count=25&after=",out_old$name[nrow(out_old)])
pol_next <- fromJSON(next_page)
out_next <- pol_next$data$children$data

After playing around a little bit, the following works for grabbing multiple pages:

moar_reddit <- function(subreddit,n_pages=1){
  
  base_url <- paste0("https://old.reddit.com/r/",subreddit,"/.json")
  
  out <- list()
  out[[1]] <- fromJSON(base_url,flatten=T)$data$children
  
  for(i in 2:n_pages){
    
    Sys.sleep(1)
    
    tmp <- out[[i-1]]
    last_id <- tmp[nrow(tmp),"data.name"]
    
    out[[i]] <- fromJSON(paste0(base_url,"?after=",last_id),flatten = T)$data$children
  }
  
  do.call("rbind.fill",out)
  
}

Let’s take it for a spin!

some_more_reddit <- moar_reddit("politics",5)
as_tibble(some_more_reddit)
## # A tibble: 127 x 111
##    kind  data.approved_at_utc data.subreddit data.selftext  data.author_fullname
##    <chr> <lgl>                <chr>          <chr>          <chr>               
##  1 t3    NA                   politics       "Welcome to t~ t2_d5h4t            
##  2 t3    NA                   politics       "**Reporting:~ t2_onl9u            
##  3 t3    NA                   politics       ""             t2_4ot9563          
##  4 t3    NA                   politics       ""             t2_deaxrt           
##  5 t3    NA                   politics       ""             t2_gdfin            
##  6 t3    NA                   politics       ""             t2_d1sxkm8i         
##  7 t3    NA                   politics       ""             t2_tyyiguo          
##  8 t3    NA                   politics       ""             t2_ba0snhoh         
##  9 t3    NA                   politics       ""             t2_51aar45o         
## 10 t3    NA                   politics       ""             t2_2eef87e          
## # i 117 more rows
## # i 106 more variables: data.saved <lgl>, data.mod_reason_title <lgl>,
## #   data.gilded <int>, data.clicked <lgl>, data.title <chr>,
## #   data.link_flair_richtext <list>, data.subreddit_name_prefixed <chr>,
## #   data.hidden <lgl>, data.pwls <int>, data.link_flair_css_class <lgl>,
## #   data.downs <int>, data.thumbnail_height <int>, data.top_awarded_type <lgl>,
## #   data.hide_score <lgl>, data.name <chr>, data.quarantine <lgl>, ...

Excellent! What sort of useful text can we grab from this data? Most immediately, we might grab the titles of the posts:

head(some_more_reddit$data.title)
## [1] "The \"What happened in your state last week?\" Megathread, Week 21"                            
## [2] "Discussion Thread: Ongoing Debt Ceiling and Budget Negotiations"                               
## [3] "AOC threatens to leave Twitter after Elon Musk promotes ‘disgusting’ account impersonating her"
## [4] "Ron DeSantis wants to \"make America Florida\": That's a dire threat"                          
## [5] "Ron DeSantis says he will ‘destroy leftism’ in US if elected president"                        
## [6] "The Press is Falling for Anti-Abortion “Fetal Heartbeat” Propaganda"

It would be straightforward to clean these up and process them similarly to how we treated Tweets. We might also be interested in the comments each post generated. To to this we can use the following field:

head(some_more_reddit$data.permalink)
## [1] "/r/politics/comments/13v4ga6/the_what_happened_in_your_state_last_week/"        
## [2] "/r/politics/comments/13vx2bi/discussion_thread_ongoing_debt_ceiling_and_budget/"
## [3] "/r/politics/comments/13vycwl/aoc_threatens_to_leave_twitter_after_elon_musk/"   
## [4] "/r/politics/comments/13voghb/ron_desantis_wants_to_make_america_florida_thats/" 
## [5] "/r/politics/comments/13vv2qt/ron_desantis_says_he_will_destroy_leftism_in_us/"  
## [6] "/r/politics/comments/13vm76t/the_press_is_falling_for_antiabortion_fetal/"

Remembering that we can append just about anything with “.json” to get what we want, let’s do the following.

expand_comments <- function(reddit_data){
  
  base <- "https://old.reddit.com"
  links <- paste0(base,reddit_data$data.permalink,".json")
  
  out <- list()
  
  for(i in 1:length(links)){
    
    Sys.sleep(1)
    
    out[[i]] <- fromJSON(links[i],flatten = TRUE)$data.children[[2]]
    
  }
  
  do.call("rbind.fill",out)
}

Let’s take it for a spin on a subset of the data!

some_comments <- expand_comments(some_more_reddit[5,])

The most interesting textual information is contained within the “body” field:

head(some_comments$data.body)
## [1] "\nAs a reminder, this subreddit [is for civil discussion.](/r/politics/wiki/index#wiki_be_civil)\n\nIn general, be courteous to others. Debate/discuss/argue the merits of ideas, don't attack people. Personal insults, shill or troll accusations, hate speech, any suggestion or support of harm, violence, or death, and other rule violations can result in a permanent ban. \n\nIf you see comments in violation of our rules, please report them.\n\n For those who have questions regarding any media outlets being posted on this subreddit, please click [here](https://www.reddit.com/r/politics/wiki/approveddomainslist) to review our details as to our approved domains list and outlet criteria.\n \n\n***\n\n\n*I am a bot, and this action was performed automatically. Please [contact the moderators of this subreddit](/message/compose/?to=/r/politics) if you have any questions or concerns.*"
## [2] "Is the leftism in the room with us right now?"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        
## [3] "Ah yes, openly admitting to the destruction of democracy. Also known as a dictatorship.  \n\nVote wisely America. This could be your last chance
## [4] "Leftism to DeSantis is putting avocado on a BLT. It’s basically anything outside of his world view
## [5] "They hate us so much for caring about people and the environment. They hate the inclusivity. They hate the logic and reason. They hate the empathy. They want it gone and replaced with authoritarian white nationalism. They are the unapologetic past. They are the unapologetic slavery, the genocide, the segregation, the internment camps. They are the ugliness that doesn’t care, but also doesn’t want the children to know what they did. Every person voting Republican at this point is a traitor to the United States voting for its downfall."                                                                                                                                                                                                                                                                                                                                                          
## [6] "A kind of \"final solution\", if you will.  Hmmm..."

Again, we are in a position where we would easily be able to process this text!

Searching Reddit

Other than scraping subReddits, another useful task we can accomplish is utilizing their “search” capabilities. Let’s start by analyzing the results from a search for, say, Mitch McConnell. To get a JSON file, we just need to insert “.json” into the appropriate place. After navigating to the next page, we can see that the URL pattern is similar to the above and so we might write the following function:

search_reddit <- function(search,n_pages){
  
    url <- "https://old.reddit.com/search.json?"
    query <- gsub(" ","+",search)
    base <- paste0(url,"q=",tolower(query))
    
    out <- list()
    out[[1]] <- fromJSON(url(base),flatten=T)$data$children
    last_id <- out[[1]][nrow(out[[1]])-3,"data.name"]
    
    for(i in 2:n_pages){
  
      Sys.sleep(1)
    
      out[[i]] <- fromJSON(url(paste0(base,"&after=",last_id)),flatten = T)$data$children
      
      last_id <- out[[i]][nrow(out[[i]]),"data.name"]
  }
  
  do.call("rbind.fill",out)
    
}

Let’s take it for a spin:

search_out <- search_reddit("Mitch McConnell", 3)

From here we can do a few things. First, note that a number of subReddits are identified which are relevant to the search:

table(search_out$data.subreddit)
## 
##          AIVoiceMemes      AngryObservation              antimeme 
##                     1                     1                     1 
##           AskALiberal          AskOldPeople             AskReddit 
##                     1                     2                     2 
##    AustralianPolitics   bestconspiracymemes   BikiniBottomTwitter 
##                     1                     1                     1 
## conservativeterrorism         CringetopiaRM      DecodingTheGurus 
##                     1                     1                     1 
##               Destiny               economy                 esist 
##                     1                     1                     1 
##              facepalm               Fauxmoi       Fuckthealtright 
##                     1                     1                     1 
##           FunnyandSad                   gay       huntingtonbeach 
##                     1                     1                     2 
##     interestingasfuck             inthenews                kansas 
##                     1                     1                     1 
##              Kentucky     LeopardsAteMyFace             lexington 
##                     1                     1                     1 
##            Louisville                 memes                   nba 
##                     1                     2                     1 
##                  news       NewsOfTheStupid               NoRules 
##                     1                     1                     1 
##          philadelphia                  pics PoliticalCompassMemes 
##                     1                     2                     1 
##        PoliticalHumor              politics        PublicFreakout 
##                     1                     8                     1 
##     Qult_Headquarters     SaintMeghanMarkle       shittyaskreddit 
##                     1                     1                     1 
##     ShittyLifeProTips             Spiderman      TheBidenshitshow 
##                     1                     1                     1 
##     TheMajorityReport     therewasanattempt                tumblr 
##                     1                     2                     1 
##     UkrainianConflict               VoteDEM          washingtondc 
##                     1                     1                     1 
##            weedstocks                 Weird      whatisthisanimal 
##                     1                     1                     1 
##    WhitePeopleTwitter     WorkersStrikeBack            WorkReform 
##                     5                     1                     1 
##        WouldYouRather 
##                     1

We also have access to the same sort of permalinks as before which lead to post comments:

head(search_out$data.permalink)
## [1] "/r/pics/comments/12wjh4n/august_2_2022_jon_stewart_as_mitch_mcconnell/"                 
## [2] "/r/AskReddit/comments/13mzli7/who_do_you_believe_is_literally_evil/"                    
## [3] "/r/interestingasfuck/comments/13d69cg/this_tortoises_strange_reaction_to_the_darker/"   
## [4] "/r/FunnyandSad/comments/13ondoy/rich_old_white_nut_jobs_screwing_the_world/"            
## [5] "/r/LeopardsAteMyFace/comments/13fk5po/joe_manchin_sided_with_the_republicans_to_secure/"
## [6] "/r/PublicFreakout/comments/13ky6iw/victoria_secret_meltdown/"

With this sort of information, we can use the above functions to either collect additional posts from those subReddits or expand the comments for posts of interest!

Google News

Google News is a news aggregation service developed by Google and launched in 2002. Boasting over 500 million visits per month, the site aggregates content from more than 20,000 publishers and 50,000 news sources worldwide.

Although not “social media,” this sort of “traditional media” may be helpful for your research projects as it will allow you to systematically collect other (recent) information about your politician of choice. These two media realms also frequently interact, with traditional media citing social media and vice versa. For our purposes it also allows us to practice the skills we have been developing while giving a (soft) tutorial into web-scraping!

To get started, we need a few new packages, xml2 and rvest. The first of these packages will allow us to work with XML files which you can roughly think of as the older and more general cousin of JSON files. The second is a package of wrappers for the xml2 and httr packages which make it incredibly easly to download and manipulate both HTML and XML. You will also need the SelectorGadget JavaScript bookmarklet, an interactive css selector finder, to get the most out of the tutorial!

Let’s start by taking a quick look at the anatomy of a Google News Query. There are three main components: the link to Google News, the query definition, and the language definition. So, as a first step towards building up to a workable scraping function, we can build up queries as follows:

google_news <- function(search_terms){
  base <- "https://news.google.com/search?q="
  query <- gsub(" ","%20",search_terms)
  lang <- "&hl=en-US&gl=US&ceid=US%3Aen"
  
  url <- paste0(base,query,lang)
  
  url
}


browseURL(google_news("ukraine war"))

browseURL(google_news("Chuck Schumer"))

Great! Next we need to think about how to grab content from these pages. To do this, let’s first grab the underlying HTML for the page. Abusing our (work in progress) function for a moment…

google_news("ukraine war") %>% 
  read_html() -> a_page

a_page
## {html_document}
## <html lang="en-US" dir="ltr">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
## [2] <body id="yDmH0d" jscontroller="pjICDe" jsaction="rcuQ6b:npT2md; click:FA ...

This is more or less exactly what you get when you take a look at the appropriate page source. From here, we just need to know what to look for.

There are two broad approaches to figuring out what we need. For simple websites, SelectorGadget will give us exactly what we need. For more complex websites, we can inspect the source itself. For example, a search result which comes up for “Ukraine War” is the article “U.S. unveils $700 million military aid package for Ukraine; fighting rages in Donbas.” Navigating to the page source and searching for this yields the class corresponding to article titles, which we select using the “.” operator:

a_page %>% 
  html_nodes(".DY5T1d") %>% 
  html_text() %>% 
  head()
## [1] "Russia-Ukraine War: Live Updates: Drone Strikes Damage Buildings in Moscow as Kyiv Is Hit Again"
## [2] "Russia's war in Ukraine: Live updates"                                                          
## [3] "Ukraine war comes to Moscow as drones strike both capitals"                                     
## [4] "The American military veterans who've fallen in Ukraine"                                        
## [5] "Beyond Ukraine's Offensive: The West Needs to Prepare the Country's Military for a Long War"    
## [6] "Russia-Ukraine war: List of key events, day 456"

Reading a little further over, we can see the at this has an “href” attribute which links to the article

a_page %>% 
  html_nodes(".DY5T1d") %>% 
  html_attr("href") %>% 
  head()
## [1] "./articles/CBMiSGh0dHBzOi8vd3d3Lm55dGltZXMuY29tL2xpdmUvMjAyMy8wNS8zMC93b3JsZC9ydXNzaWEtdWtyYWluZS1kcm9uZXMtbmV3c9IBAA?hl=en-US&gl=US&ceid=US%3Aen"                                                                                                                                          
## [2] "./articles/CBMiUGh0dHBzOi8vd3d3LmNubi5jb20vZXVyb3BlL2xpdmUtbmV3cy9ydXNzaWEtdWtyYWluZS13YXItbmV3cy0wNS0zMC0yMy9pbmRleC5odG1s0gFUaHR0cHM6Ly9hbXAuY25uLmNvbS9jbm4vZXVyb3BlL2xpdmUtbmV3cy9ydXNzaWEtdWtyYWluZS13YXItbmV3cy0wNS0zMC0yMy9pbmRleC5odG1s?hl=en-US&gl=US&ceid=US%3Aen"                
## [3] "./articles/CBMiZ2h0dHBzOi8vd3d3LnJldXRlcnMuY29tL3dvcmxkL2V1cm9wZS91a3JhaW5lLWFpci1kZWZlbmNlcy1iYXR0bGUtZnJlc2gtd2F2ZS1ydXNzaWFuLWF0dGFja3MtMjAyMy0wNS0zMC_SAQA?hl=en-US&gl=US&ceid=US%3Aen"                                                                                                 
## [4] "./articles/CBMibWh0dHBzOi8vd3d3Lndhc2hpbmd0b25wb3N0LmNvbS9uYXRpb25hbC1zZWN1cml0eS8yMDIzLzA1LzI5L21lbW9yaWFsLWRheS1hbWVyaWNhbnMta2lsbGVkLXVrcmFpbmUtcnVzc2lhLXdhci_SAQA?hl=en-US&gl=US&ceid=US%3Aen"                                                                                         
## [5] "./articles/CBMiS2h0dHBzOi8vd3d3LmZvcmVpZ25hZmZhaXJzLmNvbS91a3JhaW5lL3J1c3NpYS13YXItYmV5b25kLXVrcmFpbmVzLW9mZmVuc2l2ZdIBAA?hl=en-US&gl=US&ceid=US%3Aen"                                                                                                                                      
## [6] "./articles/CBMiVmh0dHBzOi8vd3d3LmFsamF6ZWVyYS5jb20vbmV3cy8yMDIzLzUvMjUvcnVzc2lhLXVrcmFpbmUtd2FyLWxpc3Qtb2Yta2V5LWV2ZW50cy1kYXktNDU20gFaaHR0cHM6Ly93d3cuYWxqYXplZXJhLmNvbS9hbXAvbmV3cy8yMDIzLzUvMjUvcnVzc2lhLXVrcmFpbmUtd2FyLWxpc3Qtb2Yta2V5LWV2ZW50cy1kYXktNDU2?hl=en-US&gl=US&ceid=US%3Aen"

Scrolling to the right, we can also find the place were the publication date is kept:

a_page %>% 
  html_nodes(".WW6dff") %>% 
  html_attr("datetime") %>% 
  head()
## [1] "2023-05-30T20:40:16Z" "2023-05-30T18:50:00Z" "2023-05-30T18:33:00Z"
## [4] "2023-05-29T08:00:00Z" "2023-05-10T07:00:00Z" "2023-05-25T02:22:57Z"

Likewise, we can find the “vehicle” or “publisher” information in the following class:

a_page %>% 
  html_nodes(".wEwyrc") %>% 
  html_text()
##   [1] "The New York Times"          "CNN"                        
##   [3] "Reuters"                     "The Washington Post"        
##   [5] "Foreign Affairs Magazine"    "Al Jazeera English"         
##   [7] "The Guardian"                "The Independent"            
##   [9] "Air & Space Forces Magazine" "BBC"                        
##  [11] "The Guardian"                "Al Jazeera English"         
##  [13] "BBC"                         "NDTV"                       
##  [15] "PBS NewsHour"                "The Independent"            
##  [17] "The Guardian"                "Sky News"                   
##  [19] "Al Jazeera English"          "Euronews"                   
##  [21] "CNBC"                        "The Washington Post"        
##  [23] "Al Jazeera English"          "BBC"                        
##  [25] "Euronews"                    "The Guardian"               
##  [27] "Sky News"                    "The Associated Press"       
##  [29] "Business Insider"            "Al Jazeera English"         
##  [31] "CBS News"                    "The Independent"            
##  [33] "Breaking Defense"            "The Guardian"               
##  [35] "USA TODAY"                   "CBS News"                   
##  [37] "The New York Times"          "CNBC"                       
##  [39] "BBC"                         "CNN"                        
##  [41] "NBC News"                    "The Guardian"               
##  [43] "BBC"                         "The Guardian"               
##  [45] "Slate"                       "USA TODAY"                  
##  [47] "Reuters"                     "BBC"                        
##  [49] "CNN International"           "Al Jazeera English"         
##  [51] "BBC"                         "The Guardian"               
##  [53] "National Post"               "The Washington Post"        
##  [55] "Vox.com"                     "Reuters.com"                
##  [57] "The Washington Post"         "CNN"                        
##  [59] "Al Jazeera English"          "Reuters"                    
##  [61] "Al Jazeera English"          "Reuters.com"                
##  [63] "The Guardian"                "CNBC"                       
##  [65] "Al Jazeera English"          "CNN"                        
##  [67] "Euronews"                    "Reuters"                    
##  [69] "The Guardian"                "The Washington Post"        
##  [71] "The Washington Post"         "BBC"                        
##  [73] "The Washington Post"         "USA TODAY"                  
##  [75] "Yahoo News"                  "Newsweek"                   
##  [77] "NBC News"                    "HuffPost UK"                
##  [79] "Al Jazeera English"          "The Independent"            
##  [81] "Times of India"              "Al Jazeera English"         
##  [83] "The Globe and Mail"          "Al Jazeera English"         
##  [85] "Reuters"                     "Financial Times"            
##  [87] "Al Jazeera English"          "The Guardian"               
##  [89] "Sky News"                    "Forbes"                     
##  [91] "The Independent"             "Euronews"                   
##  [93] "The Independent"             "South China Morning Post"   
##  [95] "The New York Times"          "Fox News"                   
##  [97] "Al Jazeera English"          "Kyiv Independent"           
##  [99] "The Independent"             "Al Jazeera English"         
## [101] "The Washington Post"         "EURACTIV"                   
## [103] "YouTube"                     "The Strategist"             
## [105] "YouTube"

Now let’s throw everything together into an updated version of our function!

google_news <- function(search_terms){
  
  base <- "https://news.google.com/search?q="
  query <- gsub(" ","%20",search_terms)
  lang <- "&hl=en-US&gl=US&ceid=US%3Aen"
  
  url <- paste0(base,query,lang)
  
  result <- read_html(url)
  
  result %>% 
    html_nodes(".DY5T1d") %>% 
    html_text() -> title
  
  result %>% 
    html_nodes(".DY5T1d") %>% 
    html_attr("href") -> link
  
  sapply(link,function(x)paste0("https://news.google.com",substring(x,2))) -> link
  
  result %>% 
    html_nodes(".WW6dff") %>% 
    html_attr("datetime") -> date
  
  result %>% 
    html_nodes(".wEwyrc") %>% 
    html_text() -> publisher
  
  captured <- min(length(publisher),length(title),length(link),length(date))
  
  out <- data.frame(publisher = publisher[1:captured],
                    title = title[1:captured],
                    link = link[1:captured],
                    date = date[1:captured])
  
  rownames(out) <- NULL
  
  out
}

Let’s take it for a spin!

biden <- google_news("Joe Biden")
as_tibble(biden)
## # A tibble: 112 x 4
##    publisher          title                                          link  date 
##    <fct>              <fct>                                          <fct> <fct>
##  1 The Daily Beast    Biden Accuser Tara Reade Claims She Fled to R~ http~ 2023~
##  2 MSNBC              Biden (literally) laughs off question about p~ http~ 2023~
##  3 Yahoo News         Biden Has Priceless Response To Fox News Ques~ http~ 2023~
##  4 Newsweek           Video of Joe Biden's Response to Potential Tr~ http~ 2023~
##  5 Fox News           Republicans to hold FBI Director Wray in cont~ http~ 2023~
##  6 The New York Times Opinion | Biden’s Age, the Economy and Trump:~ http~ 2023~
##  7 MSNBC              This poll of Black adults looks bad for Biden~ http~ 2023~
##  8 Newsweek           Joe Biden Rewards Donors With Admin Positions~ http~ 2023~
##  9 CNN                Debt ceiling deal exposes Biden and McCarthy’~ http~ 2023~
## 10 NBC News           Debt ceiling deal details: What does the Bide~ http~ 2023~
## # i 102 more rows

Sweet! Two other tasks before we finish. First, we would prefer to have the actual URL of the news article rather than the Google News re-direct link. Second, we would like to have more information about the article – a description rather than just the title. To do this, we need to introduce one additional package, RSelenium and slightly modify the code.

NOTE: this will require the most recent version of Chrome to work. There may be some driver issues with Chrome, but these can be fixed by following the instructions to delete the LICENSE.chromedriver here and setting the appropriate “chromever,” the options for which are available via binman::list_versions(appname = "chromedriver").

expand_results <- function(data_frame){

  system("taskkill /im java.exe /f", intern=FALSE, ignore.stdout=FALSE)
  
  google_link <- data_frame$link
  
  prefs = list("profile.managed_default_content_settings.images" = 2L,
               "profile.default_content_setting_values.notifications" = 2,
               "profile.managed_default_content_settings.stylesheets" = 2,
               "profile.managed_default_content_settings.cookies" = 2,
               "profile.managed_default_content_settings.javascript" = 1,
               "profile.managed_default_content_settings.plugins" = 1,
               "profile.managed_default_content_settings.popups" = 2,
               "profile.managed_default_content_settings.geolocation" = 2,
               "profile.managed_default_content_settings.media_stream" = 2)
  
  cprof <- list(chromeOptions = list(prefs = prefs))
  rD <- rsDriver(browser = "chrome",
                extraCapabilities = cprof,
                 chromever = "113.0.5672.63",
                 verbose = F)
  remDr <- rD[["client"]]
  
  tmp <- list()
  for(i in 1:length(google_link)){

    url <- google_link[i]
    remDr$navigate(url)

    true_link <- unlist(remDr$getCurrentUrl())

    remDr$getPageSource()[[1]] %>%
      read_html() %>%
      html_node(xpath = '//meta[@name="description"]') %>%
      html_attr("content") -> description

    tmp[[i]] <- data.frame(true_link = true_link,
                           description = description)

  }
  
  tmp <- do.call("rbind",tmp)
  
  system("taskkill /im java.exe /f", intern=FALSE, ignore.stdout=FALSE)
  
  cbind(data_frame,tmp)
  
}

Let’s see what we get!

extra <- expand_results(biden[1:10,])
as_tibble(extra)
## # A tibble: 10 x 6
##    publisher          title                    link  date  true_link description
##    <fct>              <fct>                    <fct> <fct> <fct>     <fct>      
##  1 The Daily Beast    Biden Accuser Tara Read~ http~ 2023~ https://~ "In a Russ~
##  2 MSNBC              Biden (literally) laugh~ http~ 2023~ https://~ "If Donald~
##  3 Yahoo News         Biden Has Priceless Res~ http~ 2023~ https://~ "The presi~
##  4 Newsweek           Video of Joe Biden's Re~ http~ 2023~ https://~ "\"Where a~
##  5 Fox News           Republicans to hold FBI~ http~ 2023~ https://~ "House Ove~
##  6 The New York Times Opinion | Biden’s Age, ~ http~ 2023~ https://~ "The group~
##  7 MSNBC              This poll of Black adul~ http~ 2023~ https://~ "Bidens' p~
##  8 Newsweek           Joe Biden Rewards Donor~ http~ 2023~ https://~ "Biden onc~
##  9 CNN                Debt ceiling deal expos~ http~ 2023~ https://~ "President~
## 10 NBC News           Debt ceiling deal detai~ http~ 2023~ https://~ "Joe Biden~

Success! From these principles you can (somewhat) easily expand into scraping the news articles themselves, the links they contain, etc!

Exercises

Before we get ahead of ourselves, we want to make sure that you have fundamentals in order. Do the following:

Write a script which…

  1. Collect five pages from the subReddit of your choice, clean the “title” field, and creates a word cloud. Conduct PCA on the appropriate document-term matrix; are there any notable clusters?
  2. Collect five pages from the Reddit search of your choice, clean the “title” field, and create a word cloud. Conduct PCA on the appropriate document-term matrix; are there any notable clusters?
  3. Collect Google News results for the search term of your choice.
    • Clean the “title” field, create a word cloud, and conduct PCA as above.
    • Now “expand” the data.frame by scraping the resulting links. Clean the “description” field, creade a word cloud, and conduct PCA as above.

Save and submit your working R script to the Exercise/Quiz Submission Link by the end of the day (ideally, end of lab session!).

