I tried a couple different approached for data collecting like the one below. I ended up just copying the titles from https://millercenter.org/the-presidency/presidential-speeches and cleaning the titles so it looks a little messing. Had to do this approach for webscraping because the page only loads when you scroll down. (other options for scrapping would be to use RSelenium)
tr <- c("January 19, 2021: Farewell Address
video icon audio icon transcript icon
January 13, 2021: Statement about the Violence at the Capitol
video icon audio icon transcript icon
January 7, 2021: Message After Pro-Trump Mob Overruns US Capitol
video icon audio icon transcript icon
January 6, 2021: Speech Urging Supporters to Go Home
video icon audio icon transcript icon
November 5, 2020: Remarks on the 2020 Election
video icon audio icon transcript icon
September 26, 2020: Announcing His Nominee for the US Supreme Court
video icon audio icon transcript icon
September 7, 2020: Labor Day Press Conference
video icon audio icon transcript icon
August 8, 2020: Press Conference on Executive Orders
video icon audio icon transcript icon
July 4, 2020: Remarks at Salute to America
video icon audio icon transcript icon
June 20, 2020: Campaign Rally in Tulsa, Oklahoma
video icon audio icon transcript icon
June 13, 2020: Address at West Point Graduation
video icon audio icon transcript icon
June 1, 2020: Statement on Protests Against Police Brutality
video icon audio icon transcript icon
April 23, 2020: Task Force Briefing on the Coronavirus Pandemic
video icon audio icon transcript icon
April 15, 2020: Press Briefing with the Coronavirus Task Force
video icon audio icon transcript icon
April 13, 2020: Coronavirus Task Force Briefing
video icon audio icon transcript icon
March 13, 2020: Press Conference about the Coronavirus
video icon audio icon transcript icon
March 11, 2020: Statement on the Coronavirus
video icon audio icon transcript icon
February 6, 2020: Remarks after HIs Acquittal
video icon audio icon transcript icon
February 4, 2020: State of the Union Address
video icon audio icon transcript icon
January 24, 2020: Speech at March for Life
video icon audio icon transcript icon
January 8, 2020: Statement on Iran
video icon audio icon transcript icon
January 3, 2020: Remarks on the Killing of Qasem Soleimani
video icon audio icon transcript icon
October 27, 2019: Statement on the the Death of Abu Bakr al-Baghdadi
video icon audio icon transcript icon
September 25, 2019: Press Conference
video icon audio icon transcript icon
September 24, 2019: Remarks at the United Nations General Assembly
video icon audio icon transcript icon
February 15, 2019: Speech Declaring a National Emergency
audio icon transcript icon
February 5, 2019: State of the Union Address
video icon audio icon transcript icon
January 19, 2019: Remarks about the US Southern Border
video icon audio icon transcript icon
September 25, 2018: Address at the 73rd Session of the United Nations General Assembly
video icon audio icon transcript icon
July 24, 2018: Speech at the Veterans of Foreign Wars National Convention
video icon audio icon transcript icon
March 19, 2018: Remarks on Combating the Opioid Crisis
video icon audio icon transcript icon
February 23, 2018: Remarks at the Conservative Political Action Conference
video icon audio icon transcript icon
February 15, 2018: Statement on the School Shooting in Parkland, Florida
video icon audio icon transcript icon
February 1, 2018: Remarks at the House and Senate Republican Member Conference
video icon audio icon transcript icon
January 30, 2018: State of the Union Address
video icon audio icon transcript icon
January 26, 2018: Address at the World Economic Forum
video icon audio icon transcript icon
December 18, 2017: Remarks on National Security Strategy
video icon audio icon transcript icon
September 19, 2017: Address to the United Nations General Assembly
video icon audio icon transcript icon
July 24, 2017: Speech at the Boy Scout Jamboree
video icon audio icon transcript icon
June 29, 2017: Speech at the Unleashing American Energy Event
audio icon transcript icon
February 28, 2017: Address to Joint Session of Congress
video icon audio icon transcript icon
January 20, 2017: Inaugural Address")
t <- gsub(":","",tr)
t <- gsub(",","",t)
t <- gsub(" at "," ",t)
t <- gsub(" the "," ",t)
t <- gsub(" The ", " ",t)
t <- gsub(" for "," ",t)
t <- gsub(" to "," ",t)
t <- gsub(" on "," ",t)
t <- gsub(" in "," ",t)
t <- gsub(" with "," ",t)
t <- gsub(" of "," ",t)
t <- gsub(" a "," ",t)
t <- tolower(t)
t <- strsplit(t,split = "\n")
t <- t[[1]]
t <- t[seq(1,84,by = 2)]
t <- gsub(" ","-",t)
#getting url to transcript for each speech
urlst <- "https://millercenter.org/the-presidency/presidential-speeches/"
urlt <- c()
lnth_t <- seq(1,length(t))
for (i in lnth_t) {
urlt[i] <- paste0(urlst,t[i])
}
urlt[23] = "https://millercenter.org/the-presidency/presidential-speeches/october-27-2019-statement-death-abu-bakr-al-baghdadi"
urlt[29] = "https://millercenter.org/the-presidency/presidential-speeches/september-25-2018-address-73rd-session-united-nations-general"
urlt[32] = "https://millercenter.org/the-presidency/presidential-speeches/february-23-2018-remarks-conservative-political-action"
urlt[34] = "https://millercenter.org/the-presidency/presidential-speeches/february-1-2018-remarks-house-and-senate-republican-member"
urlt[35] = "https://millercenter.org/the-presidency/presidential-speeches/january-30-2018-state-union-address"
urlt[41] = "https://millercenter.org/the-presidency/presidential-speeches/february-28-2017-address-joint-session-congress"
trump <- c()
#Getting speech transcripts
for (i in lnth_t) {
trump[i] <- read_html(urlt[i]) %>%
#html_nodes("main") %>%
html_nodes('.transcript-inner') %>%
html_text()
}
b <- "May 15, 2016: Commencement Address at Rutgers University
video icon audio icon transcript icon
March 22, 2016: Remarks to the People of Cuba
video icon audio icon transcript icon
January 12, 2016: 2016 State of the Union Address
video icon audio icon transcript icon
June 26, 2015: Remarks in Eulogy for the Honorable Reverend Clementa Pickney
video icon audio icon transcript icon
March 7, 2015: Remarks at the 50th Anniversary of the Selma Marches
video icon audio icon transcript icon
January 20, 2015: 2015 State of the Union Address
video icon audio icon transcript icon
November 20, 2014: Address to the Nation on Immigration
video icon audio icon transcript icon
January 28, 2014: 2014 State of the Union Address
video icon audio icon transcript icon
December 4, 2013: Speech on Economic Mobility
video icon audio icon transcript icon
September 10, 2013: Address to the Nation on Syria
video icon audio icon transcript icon
July 24, 2013: Remarks on Education and the Economy
video icon audio icon transcript icon
July 19, 2013: Remarks on Trayvon Martin
video icon audio icon transcript icon
April 8, 2013: Speech on Gun Violence
video icon audio icon transcript icon
March 21, 2013: Address to the People of Israel
video icon audio icon transcript icon
March 1, 2013: Statement on the Government Sequester
video icon audio icon transcript icon
February 13, 2013: 2013 State of the Union Address
video icon audio icon transcript icon
January 29, 2013: Remarks on Immigration Reform
video icon audio icon transcript icon
January 21, 2013: Second Inaugural Address
video icon audio icon transcript icon
December 16, 2012: Remarks on Sandy Hook Elementary Shootings
video icon audio icon transcript icon
November 6, 2012: 2012 Election Night Victory Speech
video icon audio icon transcript icon
September 6, 2012: Nominee Acceptance Speech at 2012 Democratic National Convention
video icon audio icon transcript icon
January 24, 2012: 2012 State of the Union Address
video icon audio icon transcript icon
October 21, 2011: Remarks on the End of the War in Iraq
video icon audio icon transcript icon
September 8, 2011: Address to Congress on the American Jobs Act
video icon audio icon transcript icon
June 22, 2011: Remarks on the Afghanistan Pullout
video icon audio icon transcript icon
May 25, 2011: Address to the British Parliament
video icon audio icon transcript icon
May 19, 2011: Speech on American Diplomacy in the Middle East and North Africa
video icon audio icon transcript icon
May 1, 2011: Remarks on the Death of Osama Bin Laden
video icon audio icon transcript icon
January 25, 2011: 2011 State of the Union Address
video icon audio icon transcript icon
January 12, 2011: Remarks at Memorial for Victims of the Tucson, AZ Shooting
video icon audio icon transcript icon
November 3, 2010: Press Conference After 2010 Midterm Elections
video icon audio icon transcript icon
September 23, 2010: Address to the United Nations
video icon audio icon transcript icon
August 31, 2010: Address on the End of the Combat Mission in Iraq
video icon audio icon transcript icon
June 15, 2010: Speech on the BP Oil Spill
video icon audio icon transcript icon
April 28, 2010: Remarks on Wall Street Reform
video icon audio icon transcript icon
April 15, 2010: Remarks on Space Exploration in the 21st Century
video icon audio icon transcript icon
March 15, 2010: Speech on Health Care Reform
video icon audio icon transcript icon
February 9, 2010: News Conference on Congressional Gridlock
video icon audio icon transcript icon
January 27, 2010: 2010 State of the Union Address
video icon audio icon transcript icon
December 10, 2009: Acceptance of Nobel Peace Prize
video icon audio icon transcript icon
December 1, 2009: Speech on Strategy in Afghanistan and Pakistan
video icon audio icon transcript icon
September 9, 2009: Address to Congress on Health Care
video icon audio icon transcript icon
June 4, 2009: Address at Cairo University
video icon audio icon transcript icon
May 26, 2009: Remarks on Nominating Judge Sonia Sotomayor to the U.S. Supreme Court
audio icon transcript icon
February 24, 2009: Address Before a Joint Session of Congress
video icon audio icon transcript icon
February 7, 2009: Remarks on the American Recovery and Reinvestment Act
video icon audio icon transcript icon
January 29, 2009: Remarks on the Lilly Ledbetter Fair Pay Restoration Act Bill Signing
video icon audio icon transcript icon
January 20, 2009: Inaugural Address
video icon audio icon transcript icon
November 4, 2008: Remarks on Election Night
transcript icon
August 28, 2008: Acceptance Speech at the Democratic National Convention"
b <- gsub(":","",b)
b <- gsub(",","",b)
b <- gsub(" at "," ",b)
b <- gsub(" the "," ",b)
b <- gsub(" The ", " ",b)
b <- gsub(" for "," ",b)
b <- gsub(" to "," ",b)
b <- gsub(" on "," ",b)
b <- gsub(" in "," ",b)
b <- gsub(" with "," ",b)
b <- gsub(" of "," ",b)
b <- gsub(" a "," ",b)
b <- gsub(" Pickney","",b)
b <- gsub(" National Convention","",b)
b <- gsub(" Africa", "",b)
b <- tolower(b)
b <- strsplit(b,split = "\n")
b <- b[[1]]
b <- b[seq(1,84,by = 2)]
b <- gsub(" ","-",b)
#getting url to transcript for each speech
urlsb <- "https://millercenter.org/the-presidency/presidential-speeches/"
urlb <- c()
lnth_b <- seq(1,length(b))
for (i in lnth_b) {
urlb[i] <- paste0(urlsb,b[i])
}
obama <- c()
#Getting speech transcripts
for (i in lnth_b) {
obama[i] <- read_html(urlb[i]) %>%
#html_nodes("main") %>%
html_nodes('.transcript-inner') %>%
html_text()
}
Dset <- c(trump,obama)
#trump ~ 1, and obama ~ 0
label <- c(rep(1,42),rep(0,42))
# cleaning text
Dset <- Dset %>%
gsub("\n", "",.) %>%
gsub("\r","",.) %>%
gsub("\r\r","",.) %>%
gsub("\r\n", "",.) %>%
gsub("\r\n\n", "",.) %>%
gsub("\r\n\r\n", "",.) %>%
gsub("Transcript","",.) %>%
gsub("Transcript\n", "",.)
corp <- Corpus(VectorSource(Dset)) %>%
#tokenization
tm_map(stripWhitespace) %>%
tm_map(removePunctuation) %>%
tm_map(removeNumbers) %>%
#stopwords
tm_map(removeWords, stopwords("english")) %>%
#stemming
tm_map(stemDocument)
## Warning in tm_map.SimpleCorpus(., stripWhitespace): transformation drops
## documents
## Warning in tm_map.SimpleCorpus(., removePunctuation): transformation drops
## documents
## Warning in tm_map.SimpleCorpus(., removeNumbers): transformation drops documents
## Warning in tm_map.SimpleCorpus(., removeWords, stopwords("english")):
## transformation drops documents
## Warning in tm_map.SimpleCorpus(., stemDocument): transformation drops documents
tdm <- TermDocumentMatrix(corp) #latent semantic indexing (first introduced)
tfidf <- weightTfIdf(tdm) # TF-IDF matrix # shows rare words
# extract (20) concepts
lsa.tfidf <- lsa(tfidf, dim = 20)
# convert to data frame
DF <- as.data.frame(as.matrix(lsa.tfidf$dk))
nspeech <- length(Dset)
# sample 60% for training data
train <- sample(c(1:nspeech),round(0.6*11))
#run logistic model on training
trainData = cbind(label = label[train], DF[train,])
reg <- glm(label ~., data = trainData, family = 'binomial')
#compute accuracy on validation set
valid = cbind(label = label[-train], DF[-train,])
pred <- predict(reg, newdata = valid, type = "response")
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading
#produce confusion matrix
confusionMatrix(as.factor(ifelse(pred>0.5,1,0)),as.factor(label[-train]))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 26 6
## 1 14 31
##
## Accuracy : 0.7403
## 95% CI : (0.6277, 0.8336)
## No Information Rate : 0.5195
## P-Value [Acc > NIR] : 6.073e-05
##
## Kappa : 0.4839
##
## Mcnemar's Test P-Value : 0.1175
##
## Sensitivity : 0.6500
## Specificity : 0.8378
## Pos Pred Value : 0.8125
## Neg Pred Value : 0.6889
## Prevalence : 0.5195
## Detection Rate : 0.3377
## Detection Prevalence : 0.4156
## Balanced Accuracy : 0.7439
##
## 'Positive' Class : 0
##
Recall we set Trumps speeches to be 1 and Obama’s to be 0.
gain = gains(valid$label,pred)
#plotting ROC curve
plot(c(0, gain$cume.pct.of.total * sum(valid$label))~ c(0,gain$cume.obs),
xlab = "Case #",
ylab = "Cumulative",
#main = "",
#type = "1"
)
lines(c(0,sum(valid$label))~ c(0,dim(valid)[1]),lty = 2)
# Decile-wise lift chart (a sort of bar chart)
heights = gain$mean.resp/mean(valid$label)
midpoints = barplot(heights,
names.arg = gain$depth,
xlab = "Percentile",
ylab = "Mean Response",
main = "Decile-wise Lift Chart",
ylim = c(0, 2)
)
# Add texts to the lift chart
text(midpoints, heights + 0.5, labels = round(heights, 1), cex = 0.8)
v = tfidf %>% inspect() %>% rowSums()
## <<TermDocumentMatrix (terms: 15174, documents: 84)>>
## Non-/sparse entries: 76994/1197622
## Sparsity : 94%
## Maximal term length: 26
## Weighting : term frequency - inverse document frequency (normalized) (tf-idf)
## Sample :
## Docs
## Terms 11 12 13 23 27
## – 0.0010622486 0.000000000 0.000000000 0.025463028 0.0000000000
## — 0.0081220179 0.000000000 0.011316218 0.000000000 0.0005163110
## ’re 0.0000000000 0.000000000 0.005108688 0.000000000 0.0001522840
## ’s 0.0007985196 0.000000000 0.006607237 0.000000000 0.0000000000
## applaus 0.0011738186 0.000000000 0.000000000 0.000000000 0.0000000000
## insur 0.0000000000 0.000000000 0.000000000 0.000000000 0.0006077377
## iraq 0.0000000000 0.000000000 0.000000000 0.003605405 0.0010326221
## lot 0.0000000000 0.000000000 0.004167804 0.000000000 0.0000000000
## peac 0.0010178962 0.005777438 0.000000000 0.000000000 0.0005823626
## president 0.0000000000 0.000000000 0.019194906 0.000000000 0.0000000000
## Docs
## Terms 35 46 47 6 72
## – 0.0000000000 0.0000000000 0.0086471871 0.0000000000 0.0251215694
## — 0.0015949465 0.0000000000 0.0000000000 0.0000000000 0.0000000000
## ’re 0.0000000000 0.0002717873 0.0010833855 0.0007476469 0.0003147422
## ’s 0.0000000000 0.0016307237 0.0019500940 0.0007476469 0.0028326796
## applaus 0.0000000000 0.0341594306 0.0119442714 0.0000000000 0.0222081073
## insur 0.0000000000 0.0000000000 0.0000000000 0.0000000000 0.0000000000
## iraq 0.0005316488 0.0000000000 0.0007346325 0.0000000000 0.0000000000
## lot 0.0000000000 0.0005196827 0.0000000000 0.0007147855 0.0000000000
## peac 0.0000000000 0.0000000000 0.0004143069 0.0000000000 0.0012036330
## president 0.0000000000 0.0015166159 0.0006045462 0.0020859939 0.0000000000
term = names(v)
occurrences = v %>% as.numeric()
# limit words by specifying minimum frequency
wordcloud(term,occurrences, min.freq=1, colors=brewer.pal(6,"Dark2"))
I didn’t what to mess with the text to must so we get some weird things like ’s, -, and peac (I would think that would be impeach though it was found in both trumps and obamas speeches).