M. Fawcett - 08/28/2016
This project demonstrates how to wrangle the transcripts of large numbers of presidential speeches and prepare them for statistical analysis. My primary data source was the Web site http://millercenter.org/president/speeches, which provides transcripts of major speeches by all US presidents.
The following R code will:
1. Extract the links to speeche transcripts found on the Miller Center Web page
2. Open each link
3. “Screen scrape” the text from each link
4. Save the text of each speech to a file on the local hard drive
KEEP THE EVAL=FALSE SETTING IN BLOCK “preslinks” UNLESS YOU REALLY WANT THIS TO REPLACE ALL THE FILES YOU HAVE SAVED.
The data collected here will be used to classify modern presidential candidate speeches as Republican (conservative) or Democratic (liberal).
In defining “modern” I decided to include all presidents starting with John Kennedy. Political science is not my specialty, but it seemed to me that there has developed a certain pattern of social and political characteristics in the US since 1960 that makes this a reasonable cutoff. Just to mention a few of the factors that went into my thinking:
– The break up of the Solid South and subsequent shift from Democratic to Republican in southern states.
– The Northeast’s shift from Republican to Democratic.
– The shift of African-Americans political identification from Republican to Democratic.
– The Republican party embrace of social and religious conservative values.
– The decline of labor unionism.
– The free trade movement.
Pseudocode:
1. Build a list (list #1) of presidents and their party affiliation.
2. Build a list (list #2) of all presidential speech text links.
3. Loop through each president in list #1 and for each president, subset list #2 to create another list (list #3) of speech links containing just their name.
4. Loop through list #3 to download the text of the speech and save it to a text file.
knitr::opts_chunk$set(echo = TRUE)
library(stringr) ## for returning strings matched by regular expressions
library(XML) ## for parsing html files
Create a data frame containing the classifications of presidents as Republican or Democratic.
## Build a data frame that will be used to control the extraction and saving of speech transcripts according to party.
president <- c('kennedy','lbjohnson','nixon', 'ford', 'carter', 'reagan', 'bush', 'clinton', 'gwbush', 'obama')
party <- c('Democratic','Democratic','Republican', 'Republican', 'Democratic', 'Republican', 'Republican', 'Democratic', 'Republican', 'Democratic')
## Combine the two vectors into one data frame.
presidentDF <- data.frame(president, party)
presidentDF
## president party
## 1 kennedy Democratic
## 2 lbjohnson Democratic
## 3 nixon Republican
## 4 ford Republican
## 5 carter Democratic
## 6 reagan Republican
## 7 bush Republican
## 8 clinton Democratic
## 9 gwbush Republican
## 10 obama Democratic
Build a data frame containing all the speech transcript links on the Miller Web site. Data frame will be called allspeechlinksDF.
## URL of the page that has all the links to presidential speeches
mainpageURL <- 'http://millercenter.org/president/speeches'
## Use the XML package to read the entire speech web page code into an R list object that can be further explored.
speechlistHTML <- htmlParse(mainpageURL)
## Extract just the nodes that has speech transcript links (hrefs).
speechlinkNodeSet <- getNodeSet(speechlistHTML, '//a[@class="transcript"]/@href') ## Using an XPath expression to locate links
## Convert the nodeset of links to a data frame
allspeechlinksDF <- as.data.frame(unlist(speechlinkNodeSet), stringsAsFactors = FALSE)
## Provide a useful name for the dataframe column
colnames(allspeechlinksDF) <- 'speechname'
Define a function to form a complete URL for a speech file, download a speech transcript and save it.
This function will be called by the preslinksFunc function.
## Input parameter 'x' is a stub of a link to a transcript file found on the Miller Center web site.
## Will be adding a leading "http://millercenter.org" to it to form a complete URL.
savespeechFunc <- function(x) {
## This is the local file name we will use when we save.
speechfileName_local <- gsub("/", "_", x) ## the file name that will be used when we save locally
## print the URL to the speech
## This is the URL that we will download speech from.
speechUrl <- paste("http://millercenter.org", x, sep = '')
print(speechUrl)
## This is the full path and file name we will use when we save.
speechfileFullPath <- paste("/Users/mitchellfawcett/Documents/Data Science/LeftRight/Data/SourceData", speechfileName_local, sep = "/")
## Read the contents of the speech web page into a variable
speechCode <- readLines(speechUrl, warn = FALSE) ## The "warn" argumnent set to FALSE so it will ignore files that are missing a final EOL.
## sometimes web pages have no content. Save a messge to that effect.
## Empty pages look like this when extracted: [1] "" "" "" "" "" "\t" 6 elements.
if (length(speechCode) < 7) {
fileConn <- file(speechfileFullPath)
## fileConn <- file("/Users/mitchellfawcett/Documents/Data Science/LeftRight/Data/SourceData/myfile")
writeLines("Web page for speech has no content", fileConn)
close(fileConn)
return()
## return from here if the web page was empty, otherwise do the steps below.
}
## If there is content, use the XML package to parse the entire speech web page code into an R list object that can be further explored.
htmlOutput <- htmlParse(speechCode)
## Extract just the nodes that have the speech text. Where id="transcript".
## xpathSApply has the effect of returning a single value rather than a list of nodes (nodeset)
speechText <- xpathSApply( htmlOutput, '//*[@id = "transcript"]/*', xmlValue)
## fileConn <- file("/Users/mitchellfawcett/Documents/Data Science/LeftRight/Data/SourceData/myfile")
fileConn <- file(speechfileFullPath)
writeLines(speechText, fileConn)
close(fileConn)
return()
}
This will (1) get the subset of transcript links belonging to a president. and (2) using a nested function, get and save the transcripts to disk.
## Define a function to list the links to a particular president
## x = a president's name prefixed with a forward leaning slash.
preslinksFunc <- function(x) {
## presspeechDF will be a list of speech links for a single president.
presspeechDF <- NULL
## 'x' will be a presidents name.
## Need the '/' before the name to distingusih between the two Bushs
presspeechDF <- subset(allspeechlinksDF, grepl(paste('/', x, sep = ''), allspeechlinksDF$speechname))
## nested function:
## Using the subset of speech links for the president, get their plain text and save them to files on our computer.
if (!is.null(presspeechDF)) {
xyz <- apply(presspeechDF, 1, function(x) savespeechFunc(x) )
}
}
This is the main control function for this program.
KEEP THE EVAL=FALSE SETTING IN THIS CODE BLOCK UNLESS YOU REALLY WANT TO REPLACE ALL THE FILES YOU HAVE SAVED.
This uses a pair of nested functions to first loop through the list of presidents, and then for each president get a list of links to their speeches, retrieve the transcript and save it to disk.
## call the function for each president using mapply
preslinksList <- mapply(preslinksFunc, presidentDF$president)
print(preslinksList)
The speeches will all be saved to the folder /Users/mitchellfawcett/Documents/Data Science/LeftRight/Data/SourceData. I manually copied and pasted them to /Conservative and /Liberal sub-folders as floows:
Liberal: ‘kennedy’,‘lbjohnson’, ‘carter’, ‘clinton’, ‘obama’
Conservative: ‘nixon’, ‘ford’, ‘reagan’, ‘bush’, ‘gwbush’