The dataset was retrieved from Kaggle (https://www.kaggle.com/adhok93/presidentialaddress/home) and contains all 58 inauguration speeches of presidents from the George Washington’s first address to present day.
inaug_speeches <- read.csv("/Users/ultrajosef/Documents/JosephOMalley_CodePortfolio_staging/R_Text_Analytics/InaugurationSpeehes/data/inaug_speeches.csv")##install needed packages
#install.packages(c("psych", "data.table", "stringr", "lubridate","tidyr", "dplyr", "tidytext",
##"ggplot2", "RWeka", "quanteda", "tm","wordcloud", "lsa", "ggcorrplot", "plotly", "rJava", "NLP",
##"openNLP","openNLPmodels.en", "reshape", "tibble","googleVis", "magrittr", "RColorBrewer"))##initial exploration
str(inaug_speeches)## 'data.frame': 58 obs. of 6 variables:
## $ X : int 4 5 6 7 8 9 10 11 12 13 ...
## $ Name : Factor w/ 39 levels "Abraham Lincoln",..: 13 13 23 32 32 20 20 21 21 25 ...
## $ Inaugural.Address: Factor w/ 5 levels "First Inaugural Address",..: 1 4 3 1 4 1 4 1 4 3 ...
## $ Date : Factor w/ 58 levels "20-Jan-97","Friday, January 20, 1961",..: 35 14 29 54 15 30 39 47 22 5 ...
## $ text : Factor w/ 58 levels " \xa0\xa0ABOUT to add the solemnity of an oath to the obligations imposed by a second call to the st"| __truncated__,..: 58 28 26 43 18 23 1 6 32 7 ...
## $ text2 : Factor w/ 2 levels ""," in some cases as the powers which they respectively claim are often not defined by any distinct lines. Mischi"| __truncated__: 1 1 1 1 1 1 1 1 1 1 ...
inaug_speeches_dt <- as.data.table(inaug_speeches)
names(inaug_speeches)## [1] "X" "Name" "Inaugural.Address"
## [4] "Date" "text" "text2"
##count unique values in each column
apply(inaug_speeches[, c(1:ncol(inaug_speeches))], 2, function(x) length(unique(x)))## X Name Inaugural.Address Date
## 58 39 5 58
## text text2
## 58 2
##counts of each unique president
dt <- inaug_speeches_dt[, .(number_of_distinct_records = uniqueN(text)), by = Name]
##convert to kable tables
kable(dt, escape = F, "html", table.attr = "class='dtable'") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "center")| Name | number_of_distinct_records |
|---|---|
| George Washington | 2 |
| John Adams | 1 |
| Thomas Jefferson | 2 |
| James Madison | 2 |
| James Monroe | 2 |
| John Quincy Adams | 1 |
| Andrew Jackson | 2 |
| Martin Van Buren | 1 |
| William Henry Harrison | 1 |
| James Knox Polk | 1 |
| Zachary Taylor | 1 |
| Franklin Pierce | 1 |
| James Buchanan | 1 |
| Abraham Lincoln | 2 |
| Ulysses S. Grant | 2 |
| Rutherford B. Hayes | 1 |
| James A. Garfield | 1 |
| Grover Cleveland | 2 |
| Benjamin Harrison | 1 |
| William McKinley | 2 |
| Theodore Roosevelt | 1 |
| William Howard Taft | 1 |
| Woodrow Wilson | 2 |
| Warren G. Harding | 1 |
| Calvin Coolidge | 1 |
| Herbert Hoover | 1 |
| Franklin D. Roosevelt | 4 |
| Harry S. Truman | 1 |
| Dwight D. Eisenhower | 2 |
| John F. Kennedy | 1 |
| Lyndon Baines Johnson | 1 |
| Richard Milhous Nixon | 2 |
| Jimmy Carter | 1 |
| Ronald Reagan | 2 |
| George Bush | 1 |
| Bill Clinton | 2 |
| George W. Bush | 2 |
| Barack Obama | 2 |
| Donald J. Trump | 1 |
##convert strings we will analyze to charachters
inaug_speeches_dt <- inaug_speeches_dt[, text:=as.character(text)]
inaug_speeches_dt <- inaug_speeches_dt[, text2:=as.character(text2)]
##combine text columns in r (only one that couldn't fit into one excel cell)
inaug_speeches_dt$text <- apply(inaug_speeches_dt[ , c("text", "text2") ] ,1 , paste , collapse = "" )
##create duplicate text column (for manipulation)
##trim whitespace and covert text to lowercase
inaug_speeches_dt$text_final <- trimws(inaug_speeches_dt$text)
##convert to regex and add additional text cleanin using stringr
##remove encoding errors using stringr (i.e. <U+AO97>) see <https://stackoverflow.com/questions/39993715/how-to-remove-unicode-u00a6-from-string>
inaug_speeches_dt$text_final <- gsub("\\s*<u\\+\\w+>\\s*", " ", inaug_speeches_dt$text_final)
inaug_speeches_dt$text_final <- gsub("\\s*<U\\+\\w+>\\s*", " ", inaug_speeches_dt$text_final)
##convert whitespace to single space
inaug_speeches_dt$text_final <- gsub("\\s", " ", inaug_speeches_dt$text_final)
##remove non-reg charachters
inaug_speeches_dt$text_final <- gsub("[^[A-Za-z0-9 ][:punct:]]", "", inaug_speeches_dt$text_final)##load require packages
require(stringr)
require(lubridate)
require(tidyr)
require(dplyr)##check unique
#unique(inaug_speeches_dt$Date)
##make copy of original date format for manipulation
inaug_speeches_dt$DateOriginal <- inaug_speeches_dt$Date
##edit missmatched date format (Clinton 1997)
inaug_speeches_dt$Date <- gsub("20-Jan-97", "Monday, January 20, 1997", inaug_speeches_dt$Date)
## split current date formats
inaug_speeches_dt <- separate(inaug_speeches_dt, "Date", c('DayOfWeek', 'MonthDay', 'Year'), sep = ",")
inaug_speeches_dt$MonthDay <- trimws(inaug_speeches_dt$MonthDay)
inaug_speeches_dt$Year <- trimws(inaug_speeches_dt$Year)
##replace NA Values in main date column with
#inaug_speeches_dt$MonthDay <- ifelse(is.na(inaug_speeches_dt$MonthDay), inaug_speeches_dt$DayOfWeek, inaug_speeches_dt$MonthDay)
##recode factors for First Inaugural address
inaug_speeches_dt$Inaugural.Address <- recode_factor(inaug_speeches_dt$Inaugural.Address, "Inaugural Address" = "First Inaugural Address")
##replace non-day of the week with "unknown"
inaug_speeches_dt$DayOfWeek <-
ifelse(inaug_speeches_dt$DayOfWeek == "Monday", "Monday",
ifelse(inaug_speeches_dt$DayOfWeek == "Tuesday", "Tuesday",
ifelse(inaug_speeches_dt$DayOfWeek == "Wednesday", "Wednesday",
ifelse(inaug_speeches_dt$DayOfWeek == "Thursday", "Thursday",
ifelse(inaug_speeches_dt$DayOfWeek == "Friday", "Friday",
ifelse(inaug_speeches_dt$DayOfWeek == "Saturday", "Saturday",
ifelse(inaug_speeches_dt$DayOfWeek == "Sunday", "Sunday", "Unknown")))))))
##check unique days of week
dt <- inaug_speeches_dt[, .(number_of_distinct = uniqueN(text_final)), by = DayOfWeek]
kable(dt, escape = F, "html", table.attr = "class='dtable'") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "center")| DayOfWeek | number_of_distinct |
|---|---|
| Thursday | 9 |
| Monday | 18 |
| Saturday | 9 |
| Wednesday | 8 |
| Tuesday | 7 |
| Friday | 7 |
##concatenate DayMonth and Year
inaug_speeches_dt$Date <- apply(inaug_speeches_dt[, c('MonthDay', 'Year')], 1, paste, collapse = ",")
inaug_speeches_dt$Date <- gsub(",NA", "", inaug_speeches_dt$Date)
##check unique month/day
#unique(inaug_speeches_dt$MonthDay)
##multidate function in r
multidate <- function(data, formats){
data <- gsub("nd", "", data, perl = TRUE)
a<-list()
for(i in 1:length(formats)){
a[[i]]<- as.Date(data,format=formats[i])
a[[1]][!is.na(a[[i]])]<-a[[i]][!is.na(a[[i]])]
}
a[[1]]
}
###converts disparate date types (list) to common format
inaug_speeches_dt$Date <- multidate(inaug_speeches_dt$Date,
c("%m/%d/%Y","%d.%m.%Y","%d %b %y", "%d %b %y", "%B %d, %Y",
"%B %d %Y", "%d-%b-%y", "%m.%d.%Y", "%d %b, %Y", "%A, %m/%d/%y",
"%A, %B %d, %Y", "%d of %B %Y", "%m%d%y", "%d%m%Y", "%d-%m-%Y",
"%A %B %d %Y", "%d %B, %Y", "%d %B %Y", "%d %B %Y", "%B %d %Y",
"%B %d, %Y", "%A %B %d, %Y", "%h %d %Y", "%m/%d%y", "%d/%m/%Y",
"%d/%m/%Y", "%m%d%y", "%d %B %Y", "%B %d,%Y", "%b. %d,%Y",
"%b-%d-%Y", "%m/%d.%Y"))
###change format to m/d/Y
inaug_speeches_dt$Date <- format(inaug_speeches_dt$Date, "%m/%d/%Y")
#unique(inaug_speeches_dt$Date)##calculate number of charachters
inaug_speeches_dt$nchar <- nchar(inaug_speeches_dt$text_final)
##calculate number of words
nwords <- nwords <- function(string, pseudo=F){
ifelse( pseudo,
pattern <- "\\S+",
pattern <- "[[:alpha:]]+"
)
str_count(string, pattern)
}
##calculate number of words using nwords() function
inaug_speeches_dt$num_words <- nwords(inaug_speeches_dt$text_final)
##create new column for column labels for President
inaug_speeches_dt$PresidentNumber <- seq.int(nrow(inaug_speeches_dt))
inaug_speeches_dt$Speech <- apply(inaug_speeches_dt[, c('PresidentNumber', 'Name')], 1, paste, collapse = " ")## Visualizing speech length
require(tidytext)
require(dplyr)
require(ggplot2)##create barchart in ggplot
inaug_speeches_dt %>%
unnest_tokens(word,text_final) %>%
group_by(Speech) %>%
summarise(num_words=n()) %>%
mutate(mean_words=mean(num_words)) %>%
ggplot(aes(x=Speech,y=(num_words)))+geom_bar(stat = "identity",width=0.5,
aes(fill = inaug_speeches_dt$Inaugural.Address)) +
scale_fill_manual(values = c("red", "goldenrod", "blue", "light blue", "black")) +
theme(axis.text.x = element_text(vjust=1,angle=90)) + theme(legend.position="bottom") +
geom_text(aes(label=inaug_speeches_dt$Year), vjust=0,angle=90,size=2.5,hjust=0)+ylim(c(0,11500)) +
labs(title="Speech Length",
caption="United States: Inauguration Speeches")##checkaverage speech length by inaigural address number
inaug_groups <-inaug_speeches_dt %>%
group_by(Inaugural.Address) %>%
summarise_at(vars(num_words), funs(mean(., na.rm=TRUE))) %>%
arrange(desc(num_words)) %>%
mutate(num_words = round(num_words, 2))
inaug_groups <- as.data.table(inaug_groups)
##convert to kable tables
kable(inaug_groups, escape = F, "html", table.attr = "class='dtable'") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "center")| Inaugural.Address | num_words |
|---|---|
| First Inaugural Address | 2701.69 |
| Second Inaugural Address | 1893.12 |
| Third Inaugural Address | 1427.00 |
| Fourth Inaugural Address | 596.00 |
## subset columns
inaug_speeches_clean <- subset(inaug_speeches_dt, select = c("PresidentNumber", "Name","Inaugural.Address",
"DayOfWeek", "Year", "Date", "DateOriginal",
"text_final","nchar", "num_words", "Speech"))
##################################### Export Cleaned Data ##########################################
#setwd('file/path')
#write.csv(inaug_speeches_clean, "inaug_speeches_clean.csv")require(RWeka)
require(quanteda)
require(tm)
require(data.table)Adjust N-Gram selction (one or more) in this section
##check structure of the data
#str(inaug_speeches_clean)
###create data table for conversion of free text to charachters
inaug_speeches_clean <- as.data.table(inaug_speeches_clean)
###specify the text column to be used
text <- inaug_speeches_clean$text_final
###custom stopwords list
custom_stopwords <- (c(stopwords("english"), "otherwords", "a0"))
##################### tokenize using quanteda ######################################
##tokenization and text cleaning
toks <- tokens(text, remove_punct = TRUE,
remove_symbols = TRUE)
toks <- tokens_tolower(toks)
#toks <- tokens_wordstem(toks, language = quanteda_options("language_stemmer"))
toks <- tokens_remove(toks, custom_stopwords)
###create n-grams, specify size (sparsity/computing power are a consideration here)
toksNgrams <- tokens_ngrams(toks, n = c(2, 3), concatenator = "_")
###convert to data frame matrix
dfm_toksNgrams <- as.matrix(toksNgrams)
dfm_toksNgrams <- dfm(toksNgrams)
dfm_toksNgrams <- dfm_remove(dfm_toksNgrams, c(stopwords("english"), "high_school"))
###remove leading and trailing stopwords from n-grams
dfm_toksNgrams <- dfm_remove(dfm_toksNgrams,
pattern = c(paste0("^", stopwords("english"), "_"),
paste0("_", stopwords("english"), "$")),
valuetype = "regex")
##rename rows of matrix to presidents names
row.names(dfm_toksNgrams) <- inaug_speeches_clean$Speech
##frequency dist
term_frequency <- colSums(dfm_toksNgrams)
# Sort term_frequency in descending order
term_frequency <- sort(term_frequency, decreasing = TRUE)
term_frequency <- as.data.frame(term_frequency)
# extract column names (words/ngrams) to new column in r
setDT(term_frequency, keep.rownames = TRUE)[]## rn term_frequency
## 1: united_states 158
## 2: let_us 101
## 3: american_people 41
## 4: federal_government 34
## 5: men_women 28
## ---
## 121852: reign_peace_made 1
## 121853: peace_made_permanent 1
## 121854: made_permanent_government 1
## 121855: permanent_government_liberty 1
## 121856: government_liberty_law 1
# alternative method of extract column names (words/ngrams) to new column in r
# term_frequency$names <- rownames(term_frequency)## write term frequency into new file (optional)
#setwd('local/file/path')
#write.csv(term_frequency, "inaug_term_frequency.csv")require(ggplot2)
library(wordcloud)##subset top 40 rows
term_frequency2 <- term_frequency %>%
arrange(desc(term_frequency)) %>%
head(40)
## Plot a barchart of the most frequent words/phrases
ggplot(term_frequency2, aes(x=reorder(rn, -term_frequency),y=(term_frequency))) +
geom_bar(stat = "identity",width=0.5,
aes(fill = term_frequency2$term_frequency)) +
theme(axis.text.x = element_text(vjust=1,angle=90)) + theme(legend.position="none") +
geom_text(aes(label=term_frequency), vjust=0,angle=90,size=2.5,hjust=0)+
labs(title="Most Common Phrases", caption="United States: Inauguration Speeches")
Adjust N-Gram selction (one or more) in this section
##check structure of the data
#str(inaug_speeches_clean)
###create data table for conversion of free text to charachters
inaug_speeches_clean <- as.data.table(inaug_speeches_clean)
###specify the text column to be used
text <- inaug_speeches_clean$text_final
###custom stopwords list
custom_stopwords <- (c(stopwords("english"), "otherwords", "a0"))
####################### tokenize using quanteda #####################################
##tokenization and text cleaning
toks <- tokens(text, remove_punct = TRUE,
remove_symbols = TRUE)
toks <- tokens_tolower(toks)
#toks <- tokens_wordstem(toks, language = quanteda_options("language_stemmer"))
toks <- tokens_remove(toks, custom_stopwords)
###create n-grams, specify size (sparsity/computing power are a consideration here)
toksNgrams <- tokens_ngrams(toks, n = c(1), concatenator = "_")
###convert to data frame matrix
dfm_toksNgrams <- as.matrix(toksNgrams)
dfm_toksNgrams <- dfm(toksNgrams)
dfm_toksNgrams <- dfm_remove(dfm_toksNgrams, c(stopwords("english"), "high_school"))
###remove leading and trailing stopwords from n-grams
dfm_toksNgrams <- dfm_remove(dfm_toksNgrams,
pattern = c(paste0("^", stopwords("english"), "_"),
paste0("_", stopwords("english"), "$")),
valuetype = "regex")
##rename rows of matrix to presidents names
row.names(dfm_toksNgrams) <- inaug_speeches_clean$Speech
##frequency dist
term_frequency <- colSums(dfm_toksNgrams)
# Sort term_frequency in descending order
term_frequency <- sort(term_frequency, decreasing = TRUE)
term_frequency <- as.data.frame(term_frequency)
# extract column names (words/ngrams) to new column in r
setDT(term_frequency, keep.rownames = TRUE)[]
# alternative method of extract column names (words/ngrams) to new column in r
# term_frequency$names <- rownames(term_frequency)##can compare 2 or more texts (president numbers)
dat <- dfm_toksNgrams[c(57, 58),]
##convert from quanteda dfm to tm DTM
dat <- convert(dat, to = "tm")
##convert from DTM to TDM
dat <- as.TermDocumentMatrix(dat)
dat <- as.matrix(dat)
##create clouds from wordcloud package
comparison.cloud(dat,max.words=80,random.order=FALSE,colors=c("#1F497D","#C0504D", "light blue"),
main="Differences Between Inauguration Speeches")commonality.cloud(dat,random.order=FALSE,max.words=50, color="#1F497D",main="Commonalities in Inauguration Speeches")
A measure of similarity between two non-zero vectors of an inner product space that measures the cosine of the angle between them. We can use this measure to gauge relative similarity of two texts.
library(lsa)
require(ggplot2)
require(RWeka)
require(quanteda)
require(tm)
require(data.table)########################## Create tokens for cosine similarity ##########################
##check structure of the data
#str(inaug_speeches_clean)
###create data table for conversion of free text to charachters
inaug_speeches_clean <- as.data.table(inaug_speeches_clean)
###specify the text column to be used
text <- inaug_speeches_clean$text_final
###custom stopwords list
custom_stopwords <- (c(stopwords("english"), "otherwords"))
################################# tokenize using quanteda ###############################
##tokenization and text cleaning
toks <- tokens(text, remove_punct = TRUE,
remove_symbols = TRUE)
toks <- tokens_tolower(toks)
#toks <- tokens_wordstem(toks, language = quanteda_options("language_stemmer"))
toks <- tokens_remove(toks, custom_stopwords)
###create n-grams, specify size (sparsity/computing power are a consideration here)
toksNgrams <- tokens_ngrams(toks, n = c(1), concatenator = "_")
###convert to data frame matrix
dfm_toksNgrams <- as.matrix(toksNgrams)
dfm_toksNgrams <- dfm(toksNgrams)
dfm_toksNgrams <- dfm_remove(dfm_toksNgrams, c(stopwords("english"), "high_school", "a0"))
###remove leading and trailing stopwords from n-grams
dfm_toksNgrams <- dfm_remove(dfm_toksNgrams,
pattern = c(paste0("^", stopwords("english"), "_"),
paste0("_", stopwords("english"), "$")),
valuetype = "regex")
########################### Transform Matrix ###########################################
## (see https://www.youtube.com/watch?v=7cwBhWYHgsA)
##can compare 2 or more texts
CosineSim_dfm <- dfm_toksNgrams[c(1:nrow(dfm_toksNgrams)),]
##convert from quanteda dfm to tm DTM
CosineSim_dfm <- convert(CosineSim_dfm, to = "tm")
##check dimensions of dfm
dim(CosineSim_dfm)
## get Cosine similarity in transposed Matrix format
CosineSim_Matrix <- cosine(t(as.matrix(CosineSim_dfm)))
##check dimensions and transpose
dim(CosineSim_Matrix)
##convert to data frame
CosineSim_Matrix <- as.data.frame(CosineSim_Matrix)
###################### join to primary dataset (optional) #########################################
##add cosine similarities to original text
## create unique ID to join Presdident Names to Cosine similarity text
CosineSim_Matrix$UniqueResp <- seq.int(nrow(CosineSim_Matrix))
inaug_speeches_clean$UniqueResp <- seq.int(nrow(inaug_speeches_clean))
##cartesian join
#inaug_speeches_wCosine <- merge(x = inaug_speeches_clean, y = CosineSim_Matrix, by = "UniqueResp", all.x = TRUE)require(ggcorrplot)######################### visualize cosine similarity matrix #########################
##Convert RowNames
##convert row and column names
CosineSim_Matrix <- as.data.frame(CosineSim_Matrix)
row.names(CosineSim_Matrix) <- inaug_speeches_clean$Speech
colnames(CosineSim_Matrix) <- t(inaug_speeches_clean$Speech)
##subset matrix of presidents
CosineSim_Matrix_1900 <- CosineSim_Matrix[c(30:32, 34:37, 41:42, 44:46, 48:49, 51, 52, 54, 56, 58),
c(30:32, 34:37, 41:42, 44:46, 48:49, 51, 52, 54, 56, 58)]
##correlation matrix plot
# ggcorrplot(CosineSim_Matrix, type = "lower", lab = TRUE, legend.title = "Cos Sim",
# title = "Cosine Similarity of Inauguration Speeches",
# show.diag = TRUE, outline.color = "black", lab_size = 2)
##correlation matrix plot for presidents 1st Inauguration speeches since 1900
ggcorrplot(CosineSim_Matrix_1900, type = "lower", lab = TRUE, legend.title = "Cos Sim",
title = "Cosine Similarity of Inauguration Speeches",
show.diag = TRUE, outline.color = "black", lab_size = 2)As the visualizations show, there seems to be similar language across the different speeches. Based on our phrase frequencies, we may start to form the idea that these speeches use similar language.
Lexical Diversity refers to “the range of different words used in a text, with a greater range indicating a higher diversity” see https://rdrr.io/github/kbenoit/quanteda/man/textstat_lexdiv.html
require(RWeka)
require(quanteda)
require(tm)##check structure of the data
#str(inaug_speeches_clean)
###create data table for conversion of free text to charachters
inaug_speeches_clean <- as.data.table(inaug_speeches_clean)
###specify the text column to be used
text <- inaug_speeches_clean$text_final
###custom stopwords list
custom_stopwords <- (c(stopwords("english"), "otherwords"))
###tokenize using quanteda
##tokenization and text cleaning
toks <- tokens(text, remove_punct = TRUE,
remove_symbols = TRUE)
toks <- tokens_tolower(toks)
#toks <- tokens_wordstem(toks, language = quanteda_options("language_stemmer"))
toks <- tokens_remove(toks, custom_stopwords)
###create n-grams, specify size (sparsity/computing power are a consideration here)
toksNgrams <- tokens_ngrams(toks, n = c(1), concatenator = "_")
###convert to data frame matrix
dfm_toksNgrams <- as.matrix(toksNgrams)
dfm_toksNgrams <- dfm(toksNgrams)
dfm_toksNgrams <- dfm_remove(dfm_toksNgrams, c(stopwords("english"), "high_school", "a0"))
###remove leading and trailing stopwords from n-grams
dfm_toksNgrams <- dfm_remove(dfm_toksNgrams,
pattern = c(paste0("^", stopwords("english"), "_"),
paste0("_", stopwords("english"), "$")),
valuetype = "regex")
##calculate lexical diversity
dt <- textstat_lexdiv(dfm_toksNgrams, measure = c("TTR", "C", "R", "CTTR", "U", "S", "Maas"), log.base = 10)
row.names(dt) <- inaug_speeches_clean$Speech
kable(dt, escape = F, "html", table.attr = "class='dtable'") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "center")| document | TTR | C | R | CTTR | U | S | Maas | lgV0 | lgeV0 | |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 George Washington | text1 | 0.7712121 | 0.9599843 | 19.812798 | 14.009764 | 70.46086 | 0.9606026 | 0.1191313 | 9.664986 | 22.25445 |
| 2 George Washington | text2 | 0.9354839 | 0.9838408 | 7.366007 | 5.208554 | 110.92051 | 0.9720826 | 0.0949498 | 9.849038 | 22.67825 |
| 3 John Adams | text3 | 0.6539179 | 0.9391205 | 21.410195 | 15.139294 | 49.77364 | 0.9433430 | 0.1417426 | 8.282356 | 19.07083 |
| 4 Thomas Jefferson | text4 | 0.7219512 | 0.9514409 | 20.673537 | 14.618398 | 60.00553 | 0.9534554 | 0.1290935 | 9.005973 | 20.73702 |
| 5 Thomas Jefferson | text5 | 0.6702756 | 0.9422172 | 21.364870 | 15.107245 | 52.03791 | 0.9459361 | 0.1386245 | 8.457093 | 19.47318 |
| 6 James Madison | text6 | 0.8251880 | 0.9693874 | 19.033064 | 13.458409 | 89.04555 | 0.9689960 | 0.1059727 | 10.762020 | 24.78047 |
| 7 James Madison | text7 | 0.8051002 | 0.9656333 | 18.864100 | 13.338933 | 79.71589 | 0.9652996 | 0.1120025 | 10.178290 | 23.43638 |
| 8 James Monroe | text8 | 0.5607595 | 0.9214597 | 22.289749 | 15.761233 | 40.72634 | 0.9296517 | 0.1566976 | 7.587212 | 17.47020 |
| 9 James Monroe | text9 | 0.5406343 | 0.9191822 | 24.286457 | 17.173119 | 40.89349 | 0.9295048 | 0.1563770 | 7.713494 | 17.76098 |
| 10 John Quincy Adams | text10 | 0.6336489 | 0.9368476 | 23.479237 | 16.602328 | 49.68414 | 0.9429508 | 0.1418702 | 8.404924 | 19.35305 |
| 11 Andrew Jackson | text11 | 0.7974217 | 0.9640515 | 18.581809 | 13.139323 | 76.07551 | 0.9636099 | 0.1146509 | 9.922223 | 22.84676 |
| 12 Andrew Jackson | text12 | 0.7081081 | 0.9453774 | 16.681921 | 11.795899 | 50.24098 | 0.9443588 | 0.1410818 | 7.958799 | 18.32581 |
| 13 Martin Van Buren | text13 | 0.6190960 | 0.9367861 | 27.471404 | 19.425216 | 52.11271 | 0.9452260 | 0.1385250 | 8.819611 | 20.30790 |
| 14 William Henry Harrison | text14 | 0.4451714 | 0.9018829 | 27.517932 | 19.458116 | 36.50921 | 0.9190651 | 0.1655003 | 7.478819 | 17.22062 |
| 15 James Knox Polk | text15 | 0.5083260 | 0.9124985 | 24.282879 | 17.170588 | 38.38009 | 0.9244132 | 0.1614162 | 7.491096 | 17.24889 |
| 16 Zachary Taylor | text16 | 0.7826087 | 0.9609117 | 18.000000 | 12.727922 | 69.67440 | 0.9602029 | 0.1198018 | 9.452599 | 21.76541 |
| 17 Franklin Pierce | text17 | 0.6320696 | 0.9378663 | 25.353793 | 17.927839 | 51.60738 | 0.9449468 | 0.1392016 | 8.666712 | 19.95584 |
| 18 James Buchanan | text18 | 0.5833940 | 0.9254103 | 21.617091 | 15.285592 | 42.06571 | 0.9322086 | 0.1541828 | 7.661970 | 17.64234 |
| 19 Abraham Lincoln | text19 | 0.5401679 | 0.9169910 | 22.061080 | 15.599539 | 38.81743 | 0.9259377 | 0.1605043 | 7.407035 | 17.05533 |
| 20 Abraham Lincoln | text20 | 0.7758112 | 0.9564288 | 14.284199 | 10.100454 | 58.07040 | 0.9520101 | 0.1312268 | 8.288490 | 19.08495 |
| 21 Ulysses S. Grant | text21 | 0.7025090 | 0.9441687 | 16.594683 | 11.734213 | 49.19525 | 0.9431396 | 0.1425734 | 7.871266 | 18.12426 |
| 22 Ulysses S. Grant | text22 | 0.7046549 | 0.9455988 | 17.588164 | 12.436710 | 51.36812 | 0.9455681 | 0.1395254 | 8.122285 | 18.70225 |
| 23 Rutherford B. Hayes | text23 | 0.5948419 | 0.9267515 | 20.623093 | 14.582729 | 42.04733 | 0.9323763 | 0.1542165 | 7.597807 | 17.49460 |
| 24 James A. Garfield | text24 | 0.6023272 | 0.9304291 | 23.022769 | 16.279556 | 45.48814 | 0.9374074 | 0.1482692 | 8.034674 | 18.50052 |
| 25 Grover Cleveland | text25 | 0.7024691 | 0.9472671 | 19.992622 | 14.136919 | 55.15507 | 0.9492577 | 0.1346503 | 8.597770 | 19.79710 |
| 26 Benjamin Harrison | text26 | 0.5617081 | 0.9247460 | 25.929995 | 18.335275 | 44.23130 | 0.9349412 | 0.1503610 | 8.087798 | 18.62284 |
| 27 Grover Cleveland | text27 | 0.7162837 | 0.9517020 | 22.662203 | 16.024597 | 62.12338 | 0.9549461 | 0.1268740 | 9.300661 | 21.41556 |
| 28 William McKinley | text28 | 0.5554425 | 0.9224672 | 24.628099 | 17.414695 | 42.47985 | 0.9322944 | 0.1534294 | 7.869503 | 18.12020 |
| 29 William McKinley | text29 | 0.6593205 | 0.9404341 | 21.757576 | 15.384929 | 50.98598 | 0.9447161 | 0.1400473 | 8.400955 | 19.34391 |
| 30 Theodore Roosevelt | text30 | 0.6873614 | 0.9386571 | 14.597330 | 10.321871 | 43.26788 | 0.9351472 | 0.1520258 | 7.224440 | 16.63489 |
| 31 William Howard Taft | text31 | 0.4863933 | 0.9083814 | 24.844175 | 17.567485 | 37.29019 | 0.9217889 | 0.1637581 | 7.422019 | 17.08983 |
| 32 Woodrow Wilson | text32 | 0.6724138 | 0.9407596 | 19.160844 | 13.548763 | 49.11435 | 0.9428205 | 0.1426907 | 8.072543 | 18.58772 |
| 33 Woodrow Wilson | text33 | 0.6484490 | 0.9335389 | 16.872141 | 11.930405 | 42.59013 | 0.9339032 | 0.1532306 | 7.371380 | 16.97323 |
| 34 Warren G. Harding | text34 | 0.5912240 | 0.9295215 | 24.605150 | 17.398468 | 45.95088 | 0.9378063 | 0.1475207 | 8.163130 | 18.79630 |
| 35 Calvin Coolidge | text35 | 0.5578947 | 0.9227000 | 24.318068 | 17.195471 | 42.41594 | 0.9322496 | 0.1535449 | 7.847362 | 18.06922 |
| 36 Herbert Hoover | text36 | 0.5204646 | 0.9129286 | 22.130458 | 15.648597 | 37.40836 | 0.9228554 | 0.1634992 | 7.286079 | 16.77682 |
| 37 Franklin D. Roosevelt | text37 | 0.6677350 | 0.9409695 | 20.428764 | 14.445318 | 50.33461 | 0.9441276 | 0.1409505 | 8.259828 | 19.01896 |
| 38 Franklin D. Roosevelt | text38 | 0.6380236 | 0.9342652 | 19.467564 | 13.765646 | 45.16559 | 0.9375166 | 0.1487977 | 7.778886 | 17.91155 |
| 39 Franklin D. Roosevelt | text39 | 0.6779388 | 0.9395618 | 16.894148 | 11.945967 | 46.21399 | 0.9393061 | 0.1471002 | 7.664836 | 17.64894 |
| 40 Franklin D. Roosevelt | text40 | 0.7680608 | 0.9526420 | 12.455854 | 8.807619 | 51.09915 | 0.9451019 | 0.1398921 | 7.581042 | 17.45600 |
| 41 Harry S. Truman | text41 | 0.5627669 | 0.9186355 | 19.257813 | 13.617330 | 37.71372 | 0.9243085 | 0.1628359 | 7.134504 | 16.42780 |
| 42 Dwight D. Eisenhower | text42 | 0.6256117 | 0.9340470 | 21.905347 | 15.489419 | 46.82868 | 0.9394967 | 0.1461316 | 8.077265 | 18.59859 |
| 43 Dwight D. Eisenhower | text43 | 0.6233293 | 0.9295869 | 17.882056 | 12.644523 | 41.40420 | 0.9317621 | 0.1554096 | 7.352399 | 16.92952 |
| 44 John F. Kennedy | text44 | 0.6458037 | 0.9332973 | 17.122934 | 12.107743 | 42.68129 | 0.9340201 | 0.1530669 | 7.399115 | 17.03709 |
| 45 Lyndon Baines Johnson | text45 | 0.6237624 | 0.9280622 | 16.585512 | 11.727728 | 39.60951 | 0.9287027 | 0.1588913 | 7.100592 | 16.34972 |
| 46 Richard Milhous Nixon | text46 | 0.5952153 | 0.9253669 | 19.241204 | 13.605586 | 40.45280 | 0.9298032 | 0.1572265 | 7.370067 | 16.97021 |
| 47 Richard Milhous Nixon | text47 | 0.4850575 | 0.8931095 | 14.307140 | 10.116676 | 27.50027 | 0.8951574 | 0.1906916 | 5.836110 | 13.43814 |
| 48 Jimmy Carter | text48 | 0.6889632 | 0.9417280 | 16.847933 | 11.913288 | 47.65068 | 0.9412113 | 0.1448657 | 7.773747 | 17.89971 |
| 49 Ronald Reagan | text49 | 0.6269036 | 0.9339979 | 21.553086 | 15.240334 | 46.55331 | 0.9391721 | 0.1465631 | 8.032439 | 18.49537 |
| 50 Ronald Reagan | text50 | 0.5798319 | 0.9240608 | 20.978380 | 14.833955 | 41.04521 | 0.9305297 | 0.1560877 | 7.535057 | 17.35011 |
| 51 George Bush | text51 | 0.5766816 | 0.9215483 | 19.256331 | 13.616282 | 38.84269 | 0.9266769 | 0.1604521 | 7.232741 | 16.65400 |
| 52 Bill Clinton | text52 | 0.6428571 | 0.9338782 | 18.160003 | 12.841061 | 43.88878 | 0.9357902 | 0.1509466 | 7.578816 | 17.45087 |
| 53 Bill Clinton | text53 | 0.5384615 | 0.9123765 | 18.418218 | 13.023647 | 35.01554 | 0.9182020 | 0.1689933 | 6.838465 | 15.74615 |
| 54 George W. Bush | text54 | 0.6159509 | 0.9277078 | 17.584293 | 12.433973 | 40.26934 | 0.9297755 | 0.1575842 | 7.234536 | 16.65814 |
| 55 George W. Bush | text55 | 0.5973905 | 0.9261725 | 19.568528 | 13.837039 | 41.04972 | 0.9308283 | 0.1560791 | 7.443269 | 17.13876 |
| 56 Barack Obama | text56 | 0.6619601 | 0.9418403 | 22.969158 | 16.241647 | 52.96845 | 0.9467445 | 0.1374015 | 8.633734 | 19.87991 |
| 57 Barack Obama | text57 | 0.6388368 | 0.9357247 | 20.857800 | 14.748692 | 47.10605 | 0.9400319 | 0.1457007 | 8.032027 | 18.49443 |
| 58 Donald J. Trump | text58 | 0.5948387 | 0.9219186 | 16.559609 | 11.709412 | 37.00370 | 0.9233768 | 0.1643908 | 6.876131 | 15.83288 |
##calculate create new column with chosen lexical diversity
LexDiv_r <- textstat_lexdiv(dfm_toksNgrams, measure = "CTTR", log.base = 10)
## create ID column to join on
LexDiv_r$ID <- 1:nrow(LexDiv_r)
inaug_speeches_clean$ID <- 1:nrow(inaug_speeches_clean)
## join lexical diversity df to full speech df
inaug_speeches_clean <- merge(inaug_speeches_clean, LexDiv_r, by = "ID")require(ggplot2)
require(plotly)##convert strings we will visualize to numeric
inaug_speeches_clean <- inaug_speeches_clean[, Year:=as.numeric(Year)]
## look at number of Inaugurations by Address Number
dt <- inaug_speeches_clean[, .(number_of_distinct = uniqueN(PresidentNumber)), by = Inaugural.Address]
kable(dt, escape = F, "html", table.attr = "class='dtable'") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "center")| Inaugural.Address | number_of_distinct |
|---|---|
| First Inaugural Address | 39 |
| Second Inaugural Address | 17 |
| Third Inaugural Address | 1 |
| Fourth Inaugural Address | 1 |
###Time Series Plot
##subset
first_inaug <-
inaug_speeches_clean %>%
filter(Inaugural.Address == "First Inaugural Address")
##create plot
LexDiv_plot <- ggplot(data=first_inaug, aes(x=Year, y=CTTR)) +
geom_line(colour = "blue", linejoin = "mitre") +
geom_point(colour = "Black") +
theme(axis.text.x = element_text(vjust=1,angle=90)) +
geom_text(aes(label=Name), vjust=0,angle=90,size=2.5,hjust=0) +
scale_x_continuous(breaks = seq(1789,2017,4))
##show plot
LexDiv_plot##convert and show in plotly
LexDiv_plotly <- ggplotly(LexDiv_plot)
LexDiv_plotly##################### scatter plot ##########
##scatter plot of number of words vs Lexical Diversity Measure
scat <- ggplot(inaug_speeches_clean, aes(x = num_words, y = CTTR)) +
geom_point(aes(color = Year))
##convert to plotly and visualize
scat <- ggplotly(scat)
scatWe find that there have been 17 presidents that have been elected for a second term. We also see thaere does not appear to be a trend in the variety of language used over the years (based on CTTR measure of Lexical Doversity). Upon further examination, we see that the Lexical Diversity measure we used may be biased towards longer responses.
This analysis looks to give several methods for which one can efficiently analyze large bodies of free-text. From there, it gives several approaches for quantifying the text in different ways. This is intended for exploratory use and there are several more popular approaches that could be useful here including: sentiment analysis, Named-entity, and different unsupervised word embedding approaches (i.e. - Word2Vec). I hope this analysis can help give you ideas for approaching this in your work.