Data Wrangling with Presidential Speeches

Extracting Text of Presidential Speeches Using R

KEEP THE EVAL=FALSE SETTING IN BLOCK “preslinks” UNLESS YOU REALLY WANT THIS TO REPLACE ALL THE FILES YOU HAVE SAVED.

The data collected here will be used to classify modern presidential candidate speeches as Republican (conservative) or Democratic (liberal).

In defining “modern” I decided to include all presidents starting with John Kennedy. Political science is not my specialty, but it seemed to me that there has developed a certain pattern of social and political characteristics in the US since 1960 that makes this a reasonable cutoff. Just to mention a few of the factors that went into my thinking:
– The break up of the Solid South and subsequent shift from Democratic to Republican in southern states.
– The Northeast’s shift from Republican to Democratic.
– The shift of African-Americans political identification from Republican to Democratic.
– The Republican party embrace of social and religious conservative values.
– The decline of labor unionism.
– The free trade movement.

Pseudocode:
1. Build a list (list #1) of presidents and their party affiliation.
2. Build a list (list #2) of all presidential speech text links.
3. Loop through each president in list #1 and for each president, subset list #2 to create another list (list #3) of speech links containing just their name.
4. Loop through list #3 to download the text of the speech and save it to a text file.

knitr::opts_chunk$set(echo = TRUE)


library(stringr)  ## for returning strings matched by regular expressions

library(XML)  ## for parsing html files

Prepare a Control List of Presidential Parties

Create a data frame containing the classifications of presidents as Republican or Democratic.

## Build a data frame that will be used to control the extraction and saving of speech transcripts according to party.

president <- c('kennedy','lbjohnson','nixon', 'ford', 'carter', 'reagan', 'bush', 'clinton', 'gwbush', 'obama')
party <- c('Democratic','Democratic','Republican', 'Republican', 'Democratic', 'Republican', 'Republican', 'Democratic', 'Republican', 'Democratic')

## Combine the two vectors into one data frame.
presidentDF <- data.frame(president, party)

presidentDF

##    president      party
## 1    kennedy Democratic
## 2  lbjohnson Democratic
## 3      nixon Republican
## 4       ford Republican
## 5     carter Democratic
## 6     reagan Republican
## 7       bush Republican
## 8    clinton Democratic
## 9     gwbush Republican
## 10     obama Democratic

Build List of All Speech Links

Build a data frame containing all the speech transcript links on the Miller Web site. Data frame will be called allspeechlinksDF.

## URL of the page that has all the links to presidential speeches
mainpageURL <- 'http://millercenter.org/president/speeches'

## Use the XML package to read the entire speech web page code into an R list object that can be further explored.
speechlistHTML <- htmlParse(mainpageURL)

## Extract just the nodes that has speech transcript links (hrefs).  
speechlinkNodeSet <- getNodeSet(speechlistHTML, '//a[@class="transcript"]/@href')  ## Using an XPath expression to locate links

## Convert the nodeset of links to a data frame
allspeechlinksDF <- as.data.frame(unlist(speechlinkNodeSet), stringsAsFactors = FALSE)

## Provide a useful name for the dataframe column
colnames(allspeechlinksDF) <- 'speechname'

Function Definition - Download a Speech and Save it.

Define a function to form a complete URL for a speech file, download a speech transcript and save it.

This function will be called by the preslinksFunc function.

## Input parameter 'x' is a stub of a link to a transcript file found on the Miller Center web site. 
## Will be adding a leading "http://millercenter.org" to it to form a complete URL.
savespeechFunc <- function(x) {

  ## This is the local file name we will use when we save.
  speechfileName_local <- gsub("/", "_", x)  ## the file name that will be used when we save locally
  
  ## print the URL to the speech
  ## This is the URL that we will download speech from.
  speechUrl <- paste("http://millercenter.org", x, sep = '')
  print(speechUrl)
  
  ## This is the full path and file name we will use when we save.
  speechfileFullPath <- paste("/Users/mitchellfawcett/Documents/Data Science/LeftRight/Data/SourceData", speechfileName_local, sep = "/")
  
  ## Read the contents of the speech web page into a variable
  speechCode <- readLines(speechUrl, warn = FALSE)  ## The "warn" argumnent set to FALSE so it will ignore files that are missing a final EOL.

  ## sometimes web pages have no content.  Save a messge to that effect.
  ## Empty pages look like this when extracted:  [1] ""   ""   ""   ""   ""   "\t"   6 elements.
  if (length(speechCode) < 7) {
    
    fileConn <- file(speechfileFullPath)
    ## fileConn <- file("/Users/mitchellfawcett/Documents/Data Science/LeftRight/Data/SourceData/myfile")
    writeLines("Web page for speech has no content", fileConn)
    close(fileConn)
    return()
    ## return from here if the web page was empty, otherwise do the steps below.
  }
  
  ## If there is content, use the XML package to parse the entire speech web page code into an R list object that can be further explored.
  htmlOutput <- htmlParse(speechCode)
  
  ## Extract just the nodes that have the speech text.  Where id="transcript".
  ## xpathSApply has the effect of returning a single value rather than a list of nodes (nodeset)
  speechText <- xpathSApply( htmlOutput, '//*[@id = "transcript"]/*', xmlValue)
  
  ## fileConn <- file("/Users/mitchellfawcett/Documents/Data Science/LeftRight/Data/SourceData/myfile")
  fileConn <- file(speechfileFullPath)
  writeLines(speechText, fileConn)
  close(fileConn)
  
  return()
  
}

Define Function to Subset Speech Transcripts for a President

This will (1) get the subset of transcript links belonging to a president. and (2) using a nested function, get and save the transcripts to disk.

## Define a function to list the links to a particular president
## x = a president's name prefixed with a forward leaning slash.

preslinksFunc <- function(x) {
  
  ## presspeechDF will be a list of speech links for a single president.
  presspeechDF <- NULL
  ## 'x' will be a presidents name.
  ## Need the '/' before the name to distingusih between the two Bushs
  presspeechDF <- subset(allspeechlinksDF, grepl(paste('/', x, sep = ''), allspeechlinksDF$speechname))
  
  ## nested function:
  ## Using the subset of speech links for the president, get their plain text and save them to files on our computer.
  if (!is.null(presspeechDF)) {
   xyz <- apply(presspeechDF, 1, function(x) savespeechFunc(x) )
  }
  
}

Build list of speech links for each president

This is the main control function for this program.

KEEP THE EVAL=FALSE SETTING IN THIS CODE BLOCK UNLESS YOU REALLY WANT TO REPLACE ALL THE FILES YOU HAVE SAVED.

This uses a pair of nested functions to first loop through the list of presidents, and then for each president get a list of links to their speeches, retrieve the transcript and save it to disk.

## call the function for each president using mapply 
preslinksList <- mapply(preslinksFunc, presidentDF$president)

print(preslinksList)

Copy Transcripts to Conservative and Liberal Folders

The speeches will all be saved to the folder /Users/mitchellfawcett/Documents/Data Science/LeftRight/Data/SourceData. I manually copied and pasted them to /Conservative and /Liberal sub-folders as floows:

Liberal: ‘kennedy’,‘lbjohnson’, ‘carter’, ‘clinton’, ‘obama’

Conservative: ‘nixon’, ‘ford’, ‘reagan’, ‘bush’, ‘gwbush’