M. Fawcett - 08/28/2016

Introduction

This project demonstrates how to wrangle the transcripts of large numbers of presidential speeches and prepare them for statistical analysis. My primary data source was the Web site http://millercenter.org/president/speeches, which provides transcripts of major speeches by all US presidents.

The following R code will:
1. Extract the links to speeche transcripts found on the Miller Center Web page
2. Open each link
3. “Screen scrape” the text from each link
4. Save the text of each speech to a file on the local hard drive

Extracting Text of Presidential Speeches Using R

KEEP THE EVAL=FALSE SETTING IN BLOCK “preslinks” UNLESS YOU REALLY WANT THIS TO REPLACE ALL THE FILES YOU HAVE SAVED.

The data collected here will be used to classify modern presidential candidate speeches as Republican (conservative) or Democratic (liberal).

In defining “modern” I decided to include all presidents starting with John Kennedy. Political science is not my specialty, but it seemed to me that there has developed a certain pattern of social and political characteristics in the US since 1960 that makes this a reasonable cutoff. Just to mention a few of the factors that went into my thinking:
– The break up of the Solid South and subsequent shift from Democratic to Republican in southern states.
– The Northeast’s shift from Republican to Democratic.
– The shift of African-Americans political identification from Republican to Democratic.
– The Republican party embrace of social and religious conservative values.
– The decline of labor unionism.
– The free trade movement.

Pseudocode:
1. Build a list (list #1) of presidents and their party affiliation.
2. Build a list (list #2) of all presidential speech text links.
3. Loop through each president in list #1 and for each president, subset list #2 to create another list (list #3) of speech links containing just their name.
4. Loop through list #3 to download the text of the speech and save it to a text file.

knitr::opts_chunk$set(echo = TRUE)


library(stringr)  ## for returning strings matched by regular expressions

library(XML)  ## for parsing html files

Prepare a Control List of Presidential Parties

Create a data frame containing the classifications of presidents as Republican or Democratic.

## Build a data frame that will be used to control the extraction and saving of speech transcripts according to party.

president <- c('kennedy','lbjohnson','nixon', 'ford', 'carter', 'reagan', 'bush', 'clinton', 'gwbush', 'obama')
party <- c('Democratic','Democratic','Republican', 'Republican', 'Democratic', 'Republican', 'Republican', 'Democratic', 'Republican', 'Democratic')

## Combine the two vectors into one data frame.
presidentDF <- data.frame(president, party)

presidentDF
##    president      party
## 1    kennedy Democratic
## 2  lbjohnson Democratic
## 3      nixon Republican
## 4       ford Republican
## 5     carter Democratic
## 6     reagan Republican
## 7       bush Republican
## 8    clinton Democratic
## 9     gwbush Republican
## 10     obama Democratic

Function Definition - Download a Speech and Save it.

Define a function to form a complete URL for a speech file, download a speech transcript and save it.

This function will be called by the preslinksFunc function.

## Input parameter 'x' is a stub of a link to a transcript file found on the Miller Center web site. 
## Will be adding a leading "http://millercenter.org" to it to form a complete URL.
savespeechFunc <- function(x) {

  ## This is the local file name we will use when we save.
  speechfileName_local <- gsub("/", "_", x)  ## the file name that will be used when we save locally
  
  ## print the URL to the speech
  ## This is the URL that we will download speech from.
  speechUrl <- paste("http://millercenter.org", x, sep = '')
  print(speechUrl)
  
  ## This is the full path and file name we will use when we save.
  speechfileFullPath <- paste("/Users/mitchellfawcett/Documents/Data Science/LeftRight/Data/SourceData", speechfileName_local, sep = "/")
  
  ## Read the contents of the speech web page into a variable
  speechCode <- readLines(speechUrl, warn = FALSE)  ## The "warn" argumnent set to FALSE so it will ignore files that are missing a final EOL.

  ## sometimes web pages have no content.  Save a messge to that effect.
  ## Empty pages look like this when extracted:  [1] ""   ""   ""   ""   ""   "\t"   6 elements.
  if (length(speechCode) < 7) {
    
    fileConn <- file(speechfileFullPath)
    ## fileConn <- file("/Users/mitchellfawcett/Documents/Data Science/LeftRight/Data/SourceData/myfile")
    writeLines("Web page for speech has no content", fileConn)
    close(fileConn)
    return()
    ## return from here if the web page was empty, otherwise do the steps below.
  }
  
  ## If there is content, use the XML package to parse the entire speech web page code into an R list object that can be further explored.
  htmlOutput <- htmlParse(speechCode)
  
  ## Extract just the nodes that have the speech text.  Where id="transcript".
  ## xpathSApply has the effect of returning a single value rather than a list of nodes (nodeset)
  speechText <- xpathSApply( htmlOutput, '//*[@id = "transcript"]/*', xmlValue)
  
  ## fileConn <- file("/Users/mitchellfawcett/Documents/Data Science/LeftRight/Data/SourceData/myfile")
  fileConn <- file(speechfileFullPath)
  writeLines(speechText, fileConn)
  close(fileConn)
  
  return()
  
}

Define Function to Subset Speech Transcripts for a President

This will (1) get the subset of transcript links belonging to a president. and (2) using a nested function, get and save the transcripts to disk.

## Define a function to list the links to a particular president
## x = a president's name prefixed with a forward leaning slash.

preslinksFunc <- function(x) {
  
  ## presspeechDF will be a list of speech links for a single president.
  presspeechDF <- NULL
  ## 'x' will be a presidents name.
  ## Need the '/' before the name to distingusih between the two Bushs
  presspeechDF <- subset(allspeechlinksDF, grepl(paste('/', x, sep = ''), allspeechlinksDF$speechname))
  
  ## nested function:
  ## Using the subset of speech links for the president, get their plain text and save them to files on our computer.
  if (!is.null(presspeechDF)) {
   xyz <- apply(presspeechDF, 1, function(x) savespeechFunc(x) )
  }
  
}

Copy Transcripts to Conservative and Liberal Folders

The speeches will all be saved to the folder /Users/mitchellfawcett/Documents/Data Science/LeftRight/Data/SourceData. I manually copied and pasted them to /Conservative and /Liberal sub-folders as floows:

Liberal: ‘kennedy’,‘lbjohnson’, ‘carter’, ‘clinton’, ‘obama’

Conservative: ‘nixon’, ‘ford’, ‘reagan’, ‘bush’, ‘gwbush’