The dataset was retrieved from Kaggle (https://www.kaggle.com/adhok93/presidentialaddress/home) and contains all 58 inauguration speeches of presidents from the George Washington’s first address to present day.


Donald Trump (above) proving his eyes were stronger than the sun.




Import Data

inaug_speeches <- read.csv("/Users/ultrajosef/Documents/JosephOMalley_CodePortfolio_staging/R_Text_Analytics/InaugurationSpeehes/data/inaug_speeches.csv")




Install Necessary Packages

##install needed packages
#install.packages(c("psych", "data.table", "stringr", "lubridate","tidyr", "dplyr", "tidytext", 
    ##"ggplot2", "RWeka", "quanteda", "tm","wordcloud", "lsa", "ggcorrplot", "plotly", "rJava", "NLP", 
    ##"openNLP","openNLPmodels.en", "reshape", "tibble","googleVis", "magrittr", "RColorBrewer"))

1. Exploratory Data Analysis (EDA)

##initial exploration
str(inaug_speeches)
## 'data.frame':    58 obs. of  6 variables:
##  $ X                : int  4 5 6 7 8 9 10 11 12 13 ...
##  $ Name             : Factor w/ 39 levels "Abraham Lincoln",..: 13 13 23 32 32 20 20 21 21 25 ...
##  $ Inaugural.Address: Factor w/ 5 levels "First Inaugural Address",..: 1 4 3 1 4 1 4 1 4 3 ...
##  $ Date             : Factor w/ 58 levels "20-Jan-97","Friday, January 20, 1961",..: 35 14 29 54 15 30 39 47 22 5 ...
##  $ text             : Factor w/ 58 levels "            \xa0\xa0ABOUT to add the solemnity of an oath to the obligations imposed by a second call to the st"| __truncated__,..: 58 28 26 43 18 23 1 6 32 7 ...
##  $ text2            : Factor w/ 2 levels ""," in some cases  as the powers which they respectively claim are often not defined by any distinct lines. Mischi"| __truncated__: 1 1 1 1 1 1 1 1 1 1 ...
inaug_speeches_dt <- as.data.table(inaug_speeches)
names(inaug_speeches)
## [1] "X"                 "Name"              "Inaugural.Address"
## [4] "Date"              "text"              "text2"
##count unique values in each column
apply(inaug_speeches[, c(1:ncol(inaug_speeches))], 2, function(x) length(unique(x)))
##                 X              Name Inaugural.Address              Date 
##                58                39                 5                58 
##              text             text2 
##                58                 2
##counts of each unique president
dt <- inaug_speeches_dt[, .(number_of_distinct_records = uniqueN(text)), by = Name]

##convert to kable tables
kable(dt, escape = F, "html", table.attr = "class='dtable'")  %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "center")
Name number_of_distinct_records
George Washington 2
John Adams 1
Thomas Jefferson 2
James Madison 2
James Monroe 2
John Quincy Adams 1
Andrew Jackson 2
Martin Van Buren 1
William Henry Harrison 1
James Knox Polk 1
Zachary Taylor 1
Franklin Pierce 1
James Buchanan 1
Abraham Lincoln 2
Ulysses S. Grant 2
Rutherford B. Hayes 1
James A. Garfield 1
Grover Cleveland 2
Benjamin Harrison 1
William McKinley 2
Theodore Roosevelt 1
William Howard Taft 1
Woodrow Wilson 2
Warren G. Harding 1
Calvin Coolidge 1
Herbert Hoover 1
Franklin D. Roosevelt 4
Harry S. Truman 1
Dwight D. Eisenhower 2
John F. Kennedy 1
Lyndon Baines Johnson 1
Richard Milhous Nixon 2
Jimmy Carter 1
Ronald Reagan 2
George Bush 1
Bill Clinton 2
George W. Bush 2
Barack Obama 2
Donald J. Trump 1

2. Text Cleaning/Manipulation

https://stackoverflow.com/questions/39993715/how-to-remove-unicode-u00a6-from-string
##convert strings we will analyze to charachters
inaug_speeches_dt <- inaug_speeches_dt[, text:=as.character(text)]
inaug_speeches_dt <- inaug_speeches_dt[, text2:=as.character(text2)]
##combine text columns in r (only one that couldn't fit into one excel cell)
inaug_speeches_dt$text <- apply(inaug_speeches_dt[ , c("text", "text2") ] ,1 , paste , collapse = "" )


##create duplicate text column (for manipulation)
##trim whitespace and covert text to lowercase
inaug_speeches_dt$text_final <- trimws(inaug_speeches_dt$text)


##convert to regex and add additional text cleanin using stringr
##remove encoding errors using stringr (i.e. <U+AO97>) see <https://stackoverflow.com/questions/39993715/how-to-remove-unicode-u00a6-from-string>
inaug_speeches_dt$text_final <- gsub("\\s*<u\\+\\w+>\\s*", " ", inaug_speeches_dt$text_final)
inaug_speeches_dt$text_final <- gsub("\\s*<U\\+\\w+>\\s*", " ", inaug_speeches_dt$text_final)
##convert whitespace to single space
inaug_speeches_dt$text_final <- gsub("\\s", " ", inaug_speeches_dt$text_final)
##remove non-reg charachters
inaug_speeches_dt$text_final <- gsub("[^[A-Za-z0-9 ][:punct:]]", "", inaug_speeches_dt$text_final)




3. Date Formatting

convert dates using lubridate: https://www.rstudio.com/resources/cheatsheets/
date conventions: https://www.statmethods.net/input/dates.html
##load require packages
require(stringr)
require(lubridate)
require(tidyr)
require(dplyr)
##check unique 
#unique(inaug_speeches_dt$Date)

##make copy of original date format for manipulation
inaug_speeches_dt$DateOriginal <- inaug_speeches_dt$Date
##edit missmatched date format (Clinton 1997)
inaug_speeches_dt$Date <- gsub("20-Jan-97", "Monday, January 20, 1997", inaug_speeches_dt$Date)

## split current date formats
inaug_speeches_dt <- separate(inaug_speeches_dt, "Date", c('DayOfWeek', 'MonthDay', 'Year'), sep = ",") 

inaug_speeches_dt$MonthDay <- trimws(inaug_speeches_dt$MonthDay)
inaug_speeches_dt$Year <- trimws(inaug_speeches_dt$Year)

##replace NA Values in main date column with 
#inaug_speeches_dt$MonthDay <- ifelse(is.na(inaug_speeches_dt$MonthDay), inaug_speeches_dt$DayOfWeek, inaug_speeches_dt$MonthDay)

##recode factors for First Inaugural address
inaug_speeches_dt$Inaugural.Address <- recode_factor(inaug_speeches_dt$Inaugural.Address, "Inaugural Address" = "First Inaugural Address")

                
##replace non-day of the week with "unknown"
inaug_speeches_dt$DayOfWeek <- 
  ifelse(inaug_speeches_dt$DayOfWeek == "Monday", "Monday", 
         ifelse(inaug_speeches_dt$DayOfWeek == "Tuesday", "Tuesday", 
                ifelse(inaug_speeches_dt$DayOfWeek == "Wednesday", "Wednesday", 
                       ifelse(inaug_speeches_dt$DayOfWeek == "Thursday", "Thursday", 
                              ifelse(inaug_speeches_dt$DayOfWeek == "Friday", "Friday", 
                                     ifelse(inaug_speeches_dt$DayOfWeek == "Saturday", "Saturday", 
                                            ifelse(inaug_speeches_dt$DayOfWeek == "Sunday", "Sunday", "Unknown")))))))
##check unique days of week
dt <- inaug_speeches_dt[, .(number_of_distinct = uniqueN(text_final)), by = DayOfWeek] 

kable(dt, escape = F, "html", table.attr = "class='dtable'")  %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "center")
DayOfWeek number_of_distinct
Thursday 9
Monday 18
Saturday 9
Wednesday 8
Tuesday 7
Friday 7
##concatenate DayMonth and Year
inaug_speeches_dt$Date <- apply(inaug_speeches_dt[, c('MonthDay', 'Year')], 1, paste, collapse = ",")
inaug_speeches_dt$Date <- gsub(",NA", "", inaug_speeches_dt$Date)

##check unique month/day
#unique(inaug_speeches_dt$MonthDay)

##multidate function in r
multidate <- function(data, formats){
  data <- gsub("nd", "", data, perl = TRUE)
  a<-list()
  for(i in 1:length(formats)){
    a[[i]]<- as.Date(data,format=formats[i])
    a[[1]][!is.na(a[[i]])]<-a[[i]][!is.na(a[[i]])]
  }
  a[[1]]
}

###converts disparate date types (list) to common format
inaug_speeches_dt$Date <- multidate(inaug_speeches_dt$Date, 
                                     c("%m/%d/%Y","%d.%m.%Y","%d %b %y", "%d %b %y", "%B %d, %Y",
                                       "%B %d %Y", "%d-%b-%y", "%m.%d.%Y", "%d %b, %Y", "%A, %m/%d/%y",
                                       "%A, %B %d, %Y", "%d of %B %Y", "%m%d%y", "%d%m%Y", "%d-%m-%Y",
                                       "%A %B %d %Y", "%d %B, %Y", "%d %B %Y", "%d %B %Y", "%B %d %Y",
                                       "%B %d, %Y", "%A %B %d, %Y", "%h %d %Y", "%m/%d%y", "%d/%m/%Y",
                                       "%d/%m/%Y", "%m%d%y", "%d %B %Y", "%B %d,%Y", "%b. %d,%Y",
                                       "%b-%d-%Y", "%m/%d.%Y"))

###change format to m/d/Y
inaug_speeches_dt$Date <- format(inaug_speeches_dt$Date, "%m/%d/%Y")
#unique(inaug_speeches_dt$Date)




4. Number of Charachters and Number of Words

##calculate number of charachters
inaug_speeches_dt$nchar <- nchar(inaug_speeches_dt$text_final)

##calculate number of words
nwords <- nwords <- function(string, pseudo=F){
  ifelse( pseudo, 
          pattern <- "\\S+", 
          pattern <- "[[:alpha:]]+" 
  )
  str_count(string, pattern)
}

##calculate number of words using nwords() function
inaug_speeches_dt$num_words <- nwords(inaug_speeches_dt$text_final)


##create new column for column labels for President
inaug_speeches_dt$PresidentNumber <- seq.int(nrow(inaug_speeches_dt))
inaug_speeches_dt$Speech <- apply(inaug_speeches_dt[, c('PresidentNumber', 'Name')], 1, paste, collapse = " ")
## Visualizing speech length

require(tidytext)
require(dplyr)
require(ggplot2)
##create barchart in ggplot

inaug_speeches_dt %>%
  unnest_tokens(word,text_final) %>%
  group_by(Speech) %>%
  summarise(num_words=n()) %>%
  mutate(mean_words=mean(num_words)) %>%
  ggplot(aes(x=Speech,y=(num_words)))+geom_bar(stat = "identity",width=0.5, 
                                               aes(fill = inaug_speeches_dt$Inaugural.Address)) +
  scale_fill_manual(values = c("red", "goldenrod", "blue", "light blue", "black")) +
  theme(axis.text.x = element_text(vjust=1,angle=90)) + theme(legend.position="bottom") +
  geom_text(aes(label=inaug_speeches_dt$Year), vjust=0,angle=90,size=2.5,hjust=0)+ylim(c(0,11500)) +
  labs(title="Speech Length",
       caption="United States: Inauguration Speeches")

##checkaverage speech length by inaigural address number
inaug_groups <-inaug_speeches_dt %>%
  group_by(Inaugural.Address) %>%
  summarise_at(vars(num_words), funs(mean(., na.rm=TRUE))) %>% 
  arrange(desc(num_words)) %>% 
  mutate(num_words = round(num_words, 2))


inaug_groups <- as.data.table(inaug_groups)

##convert to kable tables
kable(inaug_groups, escape = F, "html", table.attr = "class='dtable'")  %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "center")
Inaugural.Address num_words
First Inaugural Address 2701.69
Second Inaugural Address 1893.12
Third Inaugural Address 1427.00
Fourth Inaugural Address 596.00
## subset columns
inaug_speeches_clean <- subset(inaug_speeches_dt, select = c("PresidentNumber", "Name","Inaugural.Address", 
                                                             "DayOfWeek", "Year", "Date", "DateOriginal", 
                                                             "text_final","nchar", "num_words", "Speech"))

##################################### Export Cleaned Data ##########################################
#setwd('file/path')
#write.csv(inaug_speeches_clean, "inaug_speeches_clean.csv")


5. N-Grams Frequency

require(RWeka)
require(quanteda)
require(tm)
require(data.table)

Adjust N-Gram selction (one or more) in this section

##check structure of the data
#str(inaug_speeches_clean)

###create data table for conversion of free text to charachters
inaug_speeches_clean <- as.data.table(inaug_speeches_clean)

###specify the text column to be used 
text <- inaug_speeches_clean$text_final

###custom stopwords list
custom_stopwords <- (c(stopwords("english"), "otherwords", "a0"))

##################### tokenize using quanteda ######################################
##tokenization and text cleaning
toks <- tokens(text, remove_punct = TRUE,
               remove_symbols = TRUE)
toks <- tokens_tolower(toks)
#toks <- tokens_wordstem(toks, language = quanteda_options("language_stemmer"))
toks <- tokens_remove(toks, custom_stopwords)
###create n-grams, specify size (sparsity/computing power are a consideration here)
toksNgrams <- tokens_ngrams(toks, n = c(2, 3), concatenator = "_")


###convert to data frame matrix
dfm_toksNgrams <- as.matrix(toksNgrams)
dfm_toksNgrams <- dfm(toksNgrams)
dfm_toksNgrams <- dfm_remove(dfm_toksNgrams, c(stopwords("english"), "high_school"))
###remove leading and trailing stopwords from n-grams
dfm_toksNgrams <- dfm_remove(dfm_toksNgrams, 
                             pattern = c(paste0("^", stopwords("english"), "_"), 
                                         paste0("_", stopwords("english"), "$")), 
                             valuetype = "regex")

##rename rows of matrix to presidents names
row.names(dfm_toksNgrams) <- inaug_speeches_clean$Speech

##frequency dist
term_frequency <- colSums(dfm_toksNgrams)

# Sort term_frequency in descending order
term_frequency <- sort(term_frequency, decreasing = TRUE)
term_frequency <- as.data.frame(term_frequency)

# extract column names (words/ngrams) to new column in r
setDT(term_frequency, keep.rownames = TRUE)[]
##                                   rn term_frequency
##      1:                united_states            158
##      2:                       let_us            101
##      3:              american_people             41
##      4:           federal_government             34
##      5:                    men_women             28
##     ---                                            
## 121852:             reign_peace_made              1
## 121853:         peace_made_permanent              1
## 121854:    made_permanent_government              1
## 121855: permanent_government_liberty              1
## 121856:       government_liberty_law              1
# alternative method of extract column names (words/ngrams) to new column in r
# term_frequency$names <- rownames(term_frequency)
## write term frequency into new file (optional)

#setwd('local/file/path')
#write.csv(term_frequency, "inaug_term_frequency.csv")

Frequency Visualizations

require(ggplot2)
library(wordcloud)
##subset top 40 rows
term_frequency2 <- term_frequency %>% 
  arrange(desc(term_frequency)) %>%
  head(40)

## Plot a barchart of the most frequent words/phrases 
ggplot(term_frequency2, aes(x=reorder(rn, -term_frequency),y=(term_frequency))) + 
            geom_bar(stat = "identity",width=0.5, 
            aes(fill = term_frequency2$term_frequency)) +
            theme(axis.text.x = element_text(vjust=1,angle=90)) + theme(legend.position="none") +
            geom_text(aes(label=term_frequency), vjust=0,angle=90,size=2.5,hjust=0)+
            labs(title="Most Common Phrases", caption="United States: Inauguration Speeches")




6. Comparison Cloud


These comparison/commonality clouds compare the language used by the 57th (Obama II) and 58th (Trump) Presidents. These visualizations can compare/contrast more than two texts, however.

Adjust N-Gram selction (one or more) in this section

##check structure of the data
#str(inaug_speeches_clean)

###create data table for conversion of free text to charachters
inaug_speeches_clean <- as.data.table(inaug_speeches_clean)

###specify the text column to be used 
text <- inaug_speeches_clean$text_final

###custom stopwords list
custom_stopwords <- (c(stopwords("english"), "otherwords", "a0"))

####################### tokenize using quanteda #####################################
##tokenization and text cleaning
toks <- tokens(text, remove_punct = TRUE,
               remove_symbols = TRUE)
toks <- tokens_tolower(toks)
#toks <- tokens_wordstem(toks, language = quanteda_options("language_stemmer"))
toks <- tokens_remove(toks, custom_stopwords)
###create n-grams, specify size (sparsity/computing power are a consideration here)
toksNgrams <- tokens_ngrams(toks, n = c(1), concatenator = "_")


###convert to data frame matrix
dfm_toksNgrams <- as.matrix(toksNgrams)
dfm_toksNgrams <- dfm(toksNgrams)
dfm_toksNgrams <- dfm_remove(dfm_toksNgrams, c(stopwords("english"), "high_school"))
###remove leading and trailing stopwords from n-grams
dfm_toksNgrams <- dfm_remove(dfm_toksNgrams, 
                             pattern = c(paste0("^", stopwords("english"), "_"), 
                                         paste0("_", stopwords("english"), "$")), 
                             valuetype = "regex")

##rename rows of matrix to presidents names
row.names(dfm_toksNgrams) <- inaug_speeches_clean$Speech

##frequency dist
term_frequency <- colSums(dfm_toksNgrams)

# Sort term_frequency in descending order
term_frequency <- sort(term_frequency, decreasing = TRUE)
term_frequency <- as.data.frame(term_frequency)

# extract column names (words/ngrams) to new column in r
setDT(term_frequency, keep.rownames = TRUE)[]
# alternative method of extract column names (words/ngrams) to new column in r
# term_frequency$names <- rownames(term_frequency)
convert from quanteda to tm: https://rdrr.io/cran/quanteda/man/convert.html>
##can compare 2 or more texts (president numbers)
dat <- dfm_toksNgrams[c(57, 58),]
##convert from quanteda dfm to tm DTM
dat <- convert(dat, to = "tm")
##convert from DTM to TDM
dat <- as.TermDocumentMatrix(dat)
dat <- as.matrix(dat)

##create clouds from wordcloud package
comparison.cloud(dat,max.words=80,random.order=FALSE,colors=c("#1F497D","#C0504D", "light blue"),
                                                  main="Differences Between Inauguration Speeches")

commonality.cloud(dat,random.order=FALSE,max.words=50, color="#1F497D",main="Commonalities in Inauguration Speeches")




7. Cosine Similarity


A measure of similarity between two non-zero vectors of an inner product space that measures the cosine of the angle between them. We can use this measure to gauge relative similarity of two texts.

Cosine similarity: https://www.youtube.com/watch?v=7cwBhWYHgsA
library(lsa)
require(ggplot2)
require(RWeka)
require(quanteda)
require(tm)
require(data.table)
########################## Create tokens for cosine similarity ##########################

##check structure of the data
#str(inaug_speeches_clean)

###create data table for conversion of free text to charachters
inaug_speeches_clean <- as.data.table(inaug_speeches_clean)

###specify the text column to be used 
text <- inaug_speeches_clean$text_final

###custom stopwords list
custom_stopwords <- (c(stopwords("english"), "otherwords"))

################################# tokenize using quanteda ###############################
##tokenization and text cleaning
toks <- tokens(text, remove_punct = TRUE,
               remove_symbols = TRUE)
toks <- tokens_tolower(toks)
#toks <- tokens_wordstem(toks, language = quanteda_options("language_stemmer"))
toks <- tokens_remove(toks, custom_stopwords)
###create n-grams, specify size (sparsity/computing power are a consideration here)
toksNgrams <- tokens_ngrams(toks, n = c(1), concatenator = "_")


###convert to data frame matrix
dfm_toksNgrams <- as.matrix(toksNgrams)
dfm_toksNgrams <- dfm(toksNgrams)
dfm_toksNgrams <- dfm_remove(dfm_toksNgrams, c(stopwords("english"), "high_school", "a0"))
###remove leading and trailing stopwords from n-grams
dfm_toksNgrams <- dfm_remove(dfm_toksNgrams, 
                             pattern = c(paste0("^", stopwords("english"), "_"), 
                                         paste0("_", stopwords("english"), "$")), 
                             valuetype = "regex")


########################### Transform Matrix ###########################################


## (see https://www.youtube.com/watch?v=7cwBhWYHgsA)
##can compare 2 or more texts
CosineSim_dfm <- dfm_toksNgrams[c(1:nrow(dfm_toksNgrams)),]
##convert from quanteda dfm to tm DTM
CosineSim_dfm <- convert(CosineSim_dfm, to = "tm")

##check dimensions of dfm
dim(CosineSim_dfm)

## get Cosine similarity in transposed Matrix format
CosineSim_Matrix <- cosine(t(as.matrix(CosineSim_dfm)))
##check dimensions and transpose
dim(CosineSim_Matrix)
##convert to data frame
CosineSim_Matrix <- as.data.frame(CosineSim_Matrix)

###################### join to primary dataset (optional) #########################################
##add cosine similarities to original text
## create unique ID to join Presdident Names to Cosine similarity text
CosineSim_Matrix$UniqueResp <- seq.int(nrow(CosineSim_Matrix))
inaug_speeches_clean$UniqueResp <- seq.int(nrow(inaug_speeches_clean))
##cartesian join
#inaug_speeches_wCosine <- merge(x = inaug_speeches_clean, y = CosineSim_Matrix, by = "UniqueResp", all.x = TRUE)

Cosine Similarity Visualization

require(ggcorrplot)
######################### visualize cosine similarity matrix #########################
##Convert RowNames
##convert row and column names

CosineSim_Matrix <- as.data.frame(CosineSim_Matrix)
row.names(CosineSim_Matrix) <- inaug_speeches_clean$Speech
colnames(CosineSim_Matrix) <- t(inaug_speeches_clean$Speech)

##subset matrix of presidents
CosineSim_Matrix_1900 <- CosineSim_Matrix[c(30:32, 34:37, 41:42, 44:46, 48:49, 51, 52, 54, 56, 58), 
                                            c(30:32, 34:37, 41:42, 44:46, 48:49, 51, 52, 54, 56, 58)]

##correlation matrix plot
# ggcorrplot(CosineSim_Matrix, type = "lower", lab = TRUE, legend.title = "Cos Sim", 
#           title = "Cosine Similarity of Inauguration Speeches",
#           show.diag = TRUE, outline.color = "black", lab_size = 2)

##correlation matrix plot for presidents 1st Inauguration speeches since 1900
ggcorrplot(CosineSim_Matrix_1900, type = "lower", lab = TRUE, legend.title = "Cos Sim", 
           title = "Cosine Similarity of Inauguration Speeches",
           show.diag = TRUE, outline.color = "black", lab_size = 2)

As the visualizations show, there seems to be similar language across the different speeches. Based on our phrase frequencies, we may start to form the idea that these speeches use similar language.




8. Lexical Diversity


Lexical Diversity refers to “the range of different words used in a text, with a greater range indicating a higher diversity” see https://rdrr.io/github/kbenoit/quanteda/man/textstat_lexdiv.html

require(RWeka)
require(quanteda)
require(tm)
##check structure of the data
#str(inaug_speeches_clean)

###create data table for conversion of free text to charachters
inaug_speeches_clean <- as.data.table(inaug_speeches_clean)

###specify the text column to be used 
text <- inaug_speeches_clean$text_final

###custom stopwords list
custom_stopwords <- (c(stopwords("english"), "otherwords"))
###tokenize using quanteda

##tokenization and text cleaning
toks <- tokens(text, remove_punct = TRUE,
               remove_symbols = TRUE)
toks <- tokens_tolower(toks)
#toks <- tokens_wordstem(toks, language = quanteda_options("language_stemmer"))
toks <- tokens_remove(toks, custom_stopwords)
###create n-grams, specify size (sparsity/computing power are a consideration here)
toksNgrams <- tokens_ngrams(toks, n = c(1), concatenator = "_")


###convert to data frame matrix
dfm_toksNgrams <- as.matrix(toksNgrams)
dfm_toksNgrams <- dfm(toksNgrams)
dfm_toksNgrams <- dfm_remove(dfm_toksNgrams, c(stopwords("english"), "high_school", "a0"))
###remove leading and trailing stopwords from n-grams
dfm_toksNgrams <- dfm_remove(dfm_toksNgrams, 
                             pattern = c(paste0("^", stopwords("english"), "_"), 
                                         paste0("_", stopwords("english"), "$")), 
                             valuetype = "regex")

##calculate lexical diversity
dt <- textstat_lexdiv(dfm_toksNgrams, measure = c("TTR", "C", "R", "CTTR", "U", "S", "Maas"), log.base = 10)

row.names(dt) <- inaug_speeches_clean$Speech


kable(dt, escape = F, "html", table.attr = "class='dtable'")  %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "center")
document TTR C R CTTR U S Maas lgV0 lgeV0
1 George Washington text1 0.7712121 0.9599843 19.812798 14.009764 70.46086 0.9606026 0.1191313 9.664986 22.25445
2 George Washington text2 0.9354839 0.9838408 7.366007 5.208554 110.92051 0.9720826 0.0949498 9.849038 22.67825
3 John Adams text3 0.6539179 0.9391205 21.410195 15.139294 49.77364 0.9433430 0.1417426 8.282356 19.07083
4 Thomas Jefferson text4 0.7219512 0.9514409 20.673537 14.618398 60.00553 0.9534554 0.1290935 9.005973 20.73702
5 Thomas Jefferson text5 0.6702756 0.9422172 21.364870 15.107245 52.03791 0.9459361 0.1386245 8.457093 19.47318
6 James Madison text6 0.8251880 0.9693874 19.033064 13.458409 89.04555 0.9689960 0.1059727 10.762020 24.78047
7 James Madison text7 0.8051002 0.9656333 18.864100 13.338933 79.71589 0.9652996 0.1120025 10.178290 23.43638
8 James Monroe text8 0.5607595 0.9214597 22.289749 15.761233 40.72634 0.9296517 0.1566976 7.587212 17.47020
9 James Monroe text9 0.5406343 0.9191822 24.286457 17.173119 40.89349 0.9295048 0.1563770 7.713494 17.76098
10 John Quincy Adams text10 0.6336489 0.9368476 23.479237 16.602328 49.68414 0.9429508 0.1418702 8.404924 19.35305
11 Andrew Jackson text11 0.7974217 0.9640515 18.581809 13.139323 76.07551 0.9636099 0.1146509 9.922223 22.84676
12 Andrew Jackson text12 0.7081081 0.9453774 16.681921 11.795899 50.24098 0.9443588 0.1410818 7.958799 18.32581
13 Martin Van Buren text13 0.6190960 0.9367861 27.471404 19.425216 52.11271 0.9452260 0.1385250 8.819611 20.30790
14 William Henry Harrison text14 0.4451714 0.9018829 27.517932 19.458116 36.50921 0.9190651 0.1655003 7.478819 17.22062
15 James Knox Polk text15 0.5083260 0.9124985 24.282879 17.170588 38.38009 0.9244132 0.1614162 7.491096 17.24889
16 Zachary Taylor text16 0.7826087 0.9609117 18.000000 12.727922 69.67440 0.9602029 0.1198018 9.452599 21.76541
17 Franklin Pierce text17 0.6320696 0.9378663 25.353793 17.927839 51.60738 0.9449468 0.1392016 8.666712 19.95584
18 James Buchanan text18 0.5833940 0.9254103 21.617091 15.285592 42.06571 0.9322086 0.1541828 7.661970 17.64234
19 Abraham Lincoln text19 0.5401679 0.9169910 22.061080 15.599539 38.81743 0.9259377 0.1605043 7.407035 17.05533
20 Abraham Lincoln text20 0.7758112 0.9564288 14.284199 10.100454 58.07040 0.9520101 0.1312268 8.288490 19.08495
21 Ulysses S. Grant text21 0.7025090 0.9441687 16.594683 11.734213 49.19525 0.9431396 0.1425734 7.871266 18.12426
22 Ulysses S. Grant text22 0.7046549 0.9455988 17.588164 12.436710 51.36812 0.9455681 0.1395254 8.122285 18.70225
23 Rutherford B. Hayes text23 0.5948419 0.9267515 20.623093 14.582729 42.04733 0.9323763 0.1542165 7.597807 17.49460
24 James A. Garfield text24 0.6023272 0.9304291 23.022769 16.279556 45.48814 0.9374074 0.1482692 8.034674 18.50052
25 Grover Cleveland text25 0.7024691 0.9472671 19.992622 14.136919 55.15507 0.9492577 0.1346503 8.597770 19.79710
26 Benjamin Harrison text26 0.5617081 0.9247460 25.929995 18.335275 44.23130 0.9349412 0.1503610 8.087798 18.62284
27 Grover Cleveland text27 0.7162837 0.9517020 22.662203 16.024597 62.12338 0.9549461 0.1268740 9.300661 21.41556
28 William McKinley text28 0.5554425 0.9224672 24.628099 17.414695 42.47985 0.9322944 0.1534294 7.869503 18.12020
29 William McKinley text29 0.6593205 0.9404341 21.757576 15.384929 50.98598 0.9447161 0.1400473 8.400955 19.34391
30 Theodore Roosevelt text30 0.6873614 0.9386571 14.597330 10.321871 43.26788 0.9351472 0.1520258 7.224440 16.63489
31 William Howard Taft text31 0.4863933 0.9083814 24.844175 17.567485 37.29019 0.9217889 0.1637581 7.422019 17.08983
32 Woodrow Wilson text32 0.6724138 0.9407596 19.160844 13.548763 49.11435 0.9428205 0.1426907 8.072543 18.58772
33 Woodrow Wilson text33 0.6484490 0.9335389 16.872141 11.930405 42.59013 0.9339032 0.1532306 7.371380 16.97323
34 Warren G. Harding text34 0.5912240 0.9295215 24.605150 17.398468 45.95088 0.9378063 0.1475207 8.163130 18.79630
35 Calvin Coolidge text35 0.5578947 0.9227000 24.318068 17.195471 42.41594 0.9322496 0.1535449 7.847362 18.06922
36 Herbert Hoover text36 0.5204646 0.9129286 22.130458 15.648597 37.40836 0.9228554 0.1634992 7.286079 16.77682
37 Franklin D. Roosevelt text37 0.6677350 0.9409695 20.428764 14.445318 50.33461 0.9441276 0.1409505 8.259828 19.01896
38 Franklin D. Roosevelt text38 0.6380236 0.9342652 19.467564 13.765646 45.16559 0.9375166 0.1487977 7.778886 17.91155
39 Franklin D. Roosevelt text39 0.6779388 0.9395618 16.894148 11.945967 46.21399 0.9393061 0.1471002 7.664836 17.64894
40 Franklin D. Roosevelt text40 0.7680608 0.9526420 12.455854 8.807619 51.09915 0.9451019 0.1398921 7.581042 17.45600
41 Harry S. Truman text41 0.5627669 0.9186355 19.257813 13.617330 37.71372 0.9243085 0.1628359 7.134504 16.42780
42 Dwight D. Eisenhower text42 0.6256117 0.9340470 21.905347 15.489419 46.82868 0.9394967 0.1461316 8.077265 18.59859
43 Dwight D. Eisenhower text43 0.6233293 0.9295869 17.882056 12.644523 41.40420 0.9317621 0.1554096 7.352399 16.92952
44 John F. Kennedy text44 0.6458037 0.9332973 17.122934 12.107743 42.68129 0.9340201 0.1530669 7.399115 17.03709
45 Lyndon Baines Johnson text45 0.6237624 0.9280622 16.585512 11.727728 39.60951 0.9287027 0.1588913 7.100592 16.34972
46 Richard Milhous Nixon text46 0.5952153 0.9253669 19.241204 13.605586 40.45280 0.9298032 0.1572265 7.370067 16.97021
47 Richard Milhous Nixon text47 0.4850575 0.8931095 14.307140 10.116676 27.50027 0.8951574 0.1906916 5.836110 13.43814
48 Jimmy Carter text48 0.6889632 0.9417280 16.847933 11.913288 47.65068 0.9412113 0.1448657 7.773747 17.89971
49 Ronald Reagan text49 0.6269036 0.9339979 21.553086 15.240334 46.55331 0.9391721 0.1465631 8.032439 18.49537
50 Ronald Reagan text50 0.5798319 0.9240608 20.978380 14.833955 41.04521 0.9305297 0.1560877 7.535057 17.35011
51 George Bush text51 0.5766816 0.9215483 19.256331 13.616282 38.84269 0.9266769 0.1604521 7.232741 16.65400
52 Bill Clinton text52 0.6428571 0.9338782 18.160003 12.841061 43.88878 0.9357902 0.1509466 7.578816 17.45087
53 Bill Clinton text53 0.5384615 0.9123765 18.418218 13.023647 35.01554 0.9182020 0.1689933 6.838465 15.74615
54 George W. Bush text54 0.6159509 0.9277078 17.584293 12.433973 40.26934 0.9297755 0.1575842 7.234536 16.65814
55 George W. Bush text55 0.5973905 0.9261725 19.568528 13.837039 41.04972 0.9308283 0.1560791 7.443269 17.13876
56 Barack Obama text56 0.6619601 0.9418403 22.969158 16.241647 52.96845 0.9467445 0.1374015 8.633734 19.87991
57 Barack Obama text57 0.6388368 0.9357247 20.857800 14.748692 47.10605 0.9400319 0.1457007 8.032027 18.49443
58 Donald J. Trump text58 0.5948387 0.9219186 16.559609 11.709412 37.00370 0.9233768 0.1643908 6.876131 15.83288
##calculate create new column with chosen lexical diversity  
LexDiv_r <- textstat_lexdiv(dfm_toksNgrams, measure = "CTTR", log.base = 10)

## create ID column to join on
LexDiv_r$ID <- 1:nrow(LexDiv_r)
inaug_speeches_clean$ID <- 1:nrow(inaug_speeches_clean)

## join lexical diversity df to full speech df
inaug_speeches_clean <- merge(inaug_speeches_clean, LexDiv_r, by = "ID")

Visualize Lexical Diversity

require(ggplot2)
require(plotly)

Lexical Diversity over time

##convert strings we will visualize to numeric
inaug_speeches_clean <- inaug_speeches_clean[, Year:=as.numeric(Year)]

## look at number of Inaugurations by Address Number
dt <- inaug_speeches_clean[, .(number_of_distinct = uniqueN(PresidentNumber)), by = Inaugural.Address]

kable(dt, escape = F, "html", table.attr = "class='dtable'") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "center")
Inaugural.Address number_of_distinct
First Inaugural Address 39
Second Inaugural Address 17
Third Inaugural Address 1
Fourth Inaugural Address 1
###Time Series Plot 
##subset
first_inaug <- 
  inaug_speeches_clean %>%
  filter(Inaugural.Address == "First Inaugural Address")
##create plot
LexDiv_plot <- ggplot(data=first_inaug, aes(x=Year, y=CTTR)) +
  geom_line(colour = "blue", linejoin = "mitre") +
  geom_point(colour = "Black") +  
  theme(axis.text.x = element_text(vjust=1,angle=90)) +
  geom_text(aes(label=Name), vjust=0,angle=90,size=2.5,hjust=0) +
  scale_x_continuous(breaks = seq(1789,2017,4))
##show plot
LexDiv_plot

Lexical Diversity in plotly

##convert and show in plotly
LexDiv_plotly <- ggplotly(LexDiv_plot)
LexDiv_plotly

Scatter Plot of Lexical Diversity & Length of Speech

##################### scatter plot ##########

##scatter plot of number of words vs Lexical Diversity Measure
scat <- ggplot(inaug_speeches_clean, aes(x = num_words, y = CTTR)) +
  geom_point(aes(color = Year))

##convert to plotly and visualize
scat <- ggplotly(scat)
scat

We find that there have been 17 presidents that have been elected for a second term. We also see thaere does not appear to be a trend in the variety of language used over the years (based on CTTR measure of Lexical Doversity). Upon further examination, we see that the Lexical Diversity measure we used may be biased towards longer responses.



Conclusion

This analysis looks to give several methods for which one can efficiently analyze large bodies of free-text. From there, it gives several approaches for quantifying the text in different ways. This is intended for exploratory use and there are several more popular approaches that could be useful here including: sentiment analysis, Named-entity, and different unsupervised word embedding approaches (i.e. - Word2Vec). I hope this analysis can help give you ideas for approaching this in your work.