1. Exploratory Data Analysis (EDA)
2. Text Cleaning/Manipulation
3. Date Formatting
4. Number of Charachters and Number of Words
5. N-Grams Frequency
- Frequency Visualizations
6. Comparison Cloud
7. Cosine Similarity
- Cosine Similarity Visualization
8. Lexical Diversity
- Visualize Lexical Diversity
Conclusion

The dataset was retrieved from Kaggle (https://www.kaggle.com/adhok93/presidentialaddress/home) and contains all 58 inauguration speeches of presidents from the George Washington’s first address to present day.

Donald Trump (above) proving his eyes were stronger than the sun.

Import Data

inaug_speeches <- read.csv("/Users/ultrajosef/Documents/JosephOMalley_CodePortfolio_staging/R_Text_Analytics/InaugurationSpeehes/data/inaug_speeches.csv")

Install Necessary Packages

##install needed packages
#install.packages(c("psych", "data.table", "stringr", "lubridate","tidyr", "dplyr", "tidytext", 
    ##"ggplot2", "RWeka", "quanteda", "tm","wordcloud", "lsa", "ggcorrplot", "plotly", "rJava", "NLP", 
    ##"openNLP","openNLPmodels.en", "reshape", "tibble","googleVis", "magrittr", "RColorBrewer"))

1. Exploratory Data Analysis (EDA)

##initial exploration
str(inaug_speeches)

## 'data.frame':    58 obs. of  6 variables:
##  $ X                : int  4 5 6 7 8 9 10 11 12 13 ...
##  $ Name             : Factor w/ 39 levels "Abraham Lincoln",..: 13 13 23 32 32 20 20 21 21 25 ...
##  $ Inaugural.Address: Factor w/ 5 levels "First Inaugural Address",..: 1 4 3 1 4 1 4 1 4 3 ...
##  $ Date             : Factor w/ 58 levels "20-Jan-97","Friday, January 20, 1961",..: 35 14 29 54 15 30 39 47 22 5 ...
##  $ text             : Factor w/ 58 levels "            \xa0\xa0ABOUT to add the solemnity of an oath to the obligations imposed by a second call to the st"| __truncated__,..: 58 28 26 43 18 23 1 6 32 7 ...
##  $ text2            : Factor w/ 2 levels ""," in some cases  as the powers which they respectively claim are often not defined by any distinct lines. Mischi"| __truncated__: 1 1 1 1 1 1 1 1 1 1 ...

inaug_speeches_dt <- as.data.table(inaug_speeches)
names(inaug_speeches)

## [1] "X"                 "Name"              "Inaugural.Address"
## [4] "Date"              "text"              "text2"

##count unique values in each column
apply(inaug_speeches[, c(1:ncol(inaug_speeches))], 2, function(x) length(unique(x)))

##                 X              Name Inaugural.Address              Date 
##                58                39                 5                58 
##              text             text2 
##                58                 2

##counts of each unique president
dt <- inaug_speeches_dt[, .(number_of_distinct_records = uniqueN(text)), by = Name]

##convert to kable tables
kable(dt, escape = F, "html", table.attr = "class='dtable'")  %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "center")

Name	number_of_distinct_records
George Washington	2
John Adams	1
Thomas Jefferson	2
James Madison	2
James Monroe	2
John Quincy Adams	1
Andrew Jackson	2
Martin Van Buren	1
William Henry Harrison	1
James Knox Polk	1
Zachary Taylor	1
Franklin Pierce	1
James Buchanan	1
Abraham Lincoln	2
Ulysses S. Grant	2
Rutherford B. Hayes	1
James A. Garfield	1
Grover Cleveland	2
Benjamin Harrison	1
William McKinley	2
Theodore Roosevelt	1
William Howard Taft	1
Woodrow Wilson	2
Warren G. Harding	1
Calvin Coolidge	1
Herbert Hoover	1
Franklin D. Roosevelt	4
Harry S. Truman	1
Dwight D. Eisenhower	2
John F. Kennedy	1
Lyndon Baines Johnson	1
Richard Milhous Nixon	2
Jimmy Carter	1
Ronald Reagan	2
George Bush	1
Bill Clinton	2
George W. Bush	2
Barack Obama	2
Donald J. Trump	1

information about Kable Tables: https://haozhu233.github.io/kableExtra/awesome_table_in_html.html

information about Dplyr: https://datascienceplus.com/data-manipulation-with-dplyr/

2. Text Cleaning/Manipulation

stringr https://www.rstudio.com/resources/cheatsheets/

https://stackoverflow.com/questions/39993715/how-to-remove-unicode-u00a6-from-string

##convert strings we will analyze to charachters
inaug_speeches_dt <- inaug_speeches_dt[, text:=as.character(text)]
inaug_speeches_dt <- inaug_speeches_dt[, text2:=as.character(text2)]
##combine text columns in r (only one that couldn't fit into one excel cell)
inaug_speeches_dt$text <- apply(inaug_speeches_dt[ , c("text", "text2") ] ,1 , paste , collapse = "" )


##create duplicate text column (for manipulation)
##trim whitespace and covert text to lowercase
inaug_speeches_dt$text_final <- trimws(inaug_speeches_dt$text)


##convert to regex and add additional text cleanin using stringr
##remove encoding errors using stringr (i.e. <U+AO97>) see <https://stackoverflow.com/questions/39993715/how-to-remove-unicode-u00a6-from-string>
inaug_speeches_dt$text_final <- gsub("\\s*<u\\+\\w+>\\s*", " ", inaug_speeches_dt$text_final)
inaug_speeches_dt$text_final <- gsub("\\s*<U\\+\\w+>\\s*", " ", inaug_speeches_dt$text_final)
##convert whitespace to single space
inaug_speeches_dt$text_final <- gsub("\\s", " ", inaug_speeches_dt$text_final)
##remove non-reg charachters
inaug_speeches_dt$text_final <- gsub("[^[A-Za-z0-9 ][:punct:]]", "", inaug_speeches_dt$text_final)

3. Date Formatting

convert dates using lubridate: https://www.rstudio.com/resources/cheatsheets/

date conventions: https://www.statmethods.net/input/dates.html

##load require packages
require(stringr)
require(lubridate)
require(tidyr)
require(dplyr)

##check unique 
#unique(inaug_speeches_dt$Date)

##make copy of original date format for manipulation
inaug_speeches_dt$DateOriginal <- inaug_speeches_dt$Date
##edit missmatched date format (Clinton 1997)
inaug_speeches_dt$Date <- gsub("20-Jan-97", "Monday, January 20, 1997", inaug_speeches_dt$Date)

## split current date formats
inaug_speeches_dt <- separate(inaug_speeches_dt, "Date", c('DayOfWeek', 'MonthDay', 'Year'), sep = ",") 

inaug_speeches_dt$MonthDay <- trimws(inaug_speeches_dt$MonthDay)
inaug_speeches_dt$Year <- trimws(inaug_speeches_dt$Year)

##replace NA Values in main date column with 
#inaug_speeches_dt$MonthDay <- ifelse(is.na(inaug_speeches_dt$MonthDay), inaug_speeches_dt$DayOfWeek, inaug_speeches_dt$MonthDay)

##recode factors for First Inaugural address
inaug_speeches_dt$Inaugural.Address <- recode_factor(inaug_speeches_dt$Inaugural.Address, "Inaugural Address" = "First Inaugural Address")

                
##replace non-day of the week with "unknown"
inaug_speeches_dt$DayOfWeek <- 
  ifelse(inaug_speeches_dt$DayOfWeek == "Monday", "Monday", 
         ifelse(inaug_speeches_dt$DayOfWeek == "Tuesday", "Tuesday", 
                ifelse(inaug_speeches_dt$DayOfWeek == "Wednesday", "Wednesday", 
                       ifelse(inaug_speeches_dt$DayOfWeek == "Thursday", "Thursday", 
                              ifelse(inaug_speeches_dt$DayOfWeek == "Friday", "Friday", 
                                     ifelse(inaug_speeches_dt$DayOfWeek == "Saturday", "Saturday", 
                                            ifelse(inaug_speeches_dt$DayOfWeek == "Sunday", "Sunday", "Unknown")))))))
##check unique days of week
dt <- inaug_speeches_dt[, .(number_of_distinct = uniqueN(text_final)), by = DayOfWeek] 

kable(dt, escape = F, "html", table.attr = "class='dtable'")  %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "center")

DayOfWeek	number_of_distinct
Thursday	9
Monday	18
Saturday	9
Wednesday	8
Tuesday	7
Friday	7

##concatenate DayMonth and Year
inaug_speeches_dt$Date <- apply(inaug_speeches_dt[, c('MonthDay', 'Year')], 1, paste, collapse = ",")
inaug_speeches_dt$Date <- gsub(",NA", "", inaug_speeches_dt$Date)

##check unique month/day
#unique(inaug_speeches_dt$MonthDay)

##multidate function in r
multidate <- function(data, formats){
  data <- gsub("nd", "", data, perl = TRUE)
  a<-list()
  for(i in 1:length(formats)){
    a[[i]]<- as.Date(data,format=formats[i])
    a[[1]][!is.na(a[[i]])]<-a[[i]][!is.na(a[[i]])]
  }
  a[[1]]
}

###converts disparate date types (list) to common format
inaug_speeches_dt$Date <- multidate(inaug_speeches_dt$Date, 
                                     c("%m/%d/%Y","%d.%m.%Y","%d %b %y", "%d %b %y", "%B %d, %Y",
                                       "%B %d %Y", "%d-%b-%y", "%m.%d.%Y", "%d %b, %Y", "%A, %m/%d/%y",
                                       "%A, %B %d, %Y", "%d of %B %Y", "%m%d%y", "%d%m%Y", "%d-%m-%Y",
                                       "%A %B %d %Y", "%d %B, %Y", "%d %B %Y", "%d %B %Y", "%B %d %Y",
                                       "%B %d, %Y", "%A %B %d, %Y", "%h %d %Y", "%m/%d%y", "%d/%m/%Y",
                                       "%d/%m/%Y", "%m%d%y", "%d %B %Y", "%B %d,%Y", "%b. %d,%Y",
                                       "%b-%d-%Y", "%m/%d.%Y"))

###change format to m/d/Y
inaug_speeches_dt$Date <- format(inaug_speeches_dt$Date, "%m/%d/%Y")
#unique(inaug_speeches_dt$Date)

4. Number of Charachters and Number of Words

##calculate number of charachters
inaug_speeches_dt$nchar <- nchar(inaug_speeches_dt$text_final)

##calculate number of words
nwords <- nwords <- function(string, pseudo=F){
  ifelse( pseudo, 
          pattern <- "\\S+", 
          pattern <- "[[:alpha:]]+" 
  )
  str_count(string, pattern)
}

##calculate number of words using nwords() function
inaug_speeches_dt$num_words <- nwords(inaug_speeches_dt$text_final)


##create new column for column labels for President
inaug_speeches_dt$PresidentNumber <- seq.int(nrow(inaug_speeches_dt))
inaug_speeches_dt$Speech <- apply(inaug_speeches_dt[, c('PresidentNumber', 'Name')], 1, paste, collapse = " ")

## Visualizing speech length

require(tidytext)
require(dplyr)
require(ggplot2)

##create barchart in ggplot

inaug_speeches_dt %>%
  unnest_tokens(word,text_final) %>%
  group_by(Speech) %>%
  summarise(num_words=n()) %>%
  mutate(mean_words=mean(num_words)) %>%
  ggplot(aes(x=Speech,y=(num_words)))+geom_bar(stat = "identity",width=0.5, 
                                               aes(fill = inaug_speeches_dt$Inaugural.Address)) +
  scale_fill_manual(values = c("red", "goldenrod", "blue", "light blue", "black")) +
  theme(axis.text.x = element_text(vjust=1,angle=90)) + theme(legend.position="bottom") +
  geom_text(aes(label=inaug_speeches_dt$Year), vjust=0,angle=90,size=2.5,hjust=0)+ylim(c(0,11500)) +
  labs(title="Speech Length",
       caption="United States: Inauguration Speeches")

##checkaverage speech length by inaigural address number
inaug_groups <-inaug_speeches_dt %>%
  group_by(Inaugural.Address) %>%
  summarise_at(vars(num_words), funs(mean(., na.rm=TRUE))) %>% 
  arrange(desc(num_words)) %>% 
  mutate(num_words = round(num_words, 2))


inaug_groups <- as.data.table(inaug_groups)

##convert to kable tables
kable(inaug_groups, escape = F, "html", table.attr = "class='dtable'")  %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "center")

Inaugural.Address	num_words
First Inaugural Address	2701.69
Second Inaugural Address	1893.12
Third Inaugural Address	1427.00
Fourth Inaugural Address	596.00

## subset columns
inaug_speeches_clean <- subset(inaug_speeches_dt, select = c("PresidentNumber", "Name","Inaugural.Address", 
                                                             "DayOfWeek", "Year", "Date", "DateOriginal", 
                                                             "text_final","nchar", "num_words", "Speech"))

##################################### Export Cleaned Data ##########################################
#setwd('file/path')
#write.csv(inaug_speeches_clean, "inaug_speeches_clean.csv")

5. N-Grams Frequency

require(RWeka)
require(quanteda)
require(tm)
require(data.table)

Adjust N-Gram selction (one or more) in this section

##check structure of the data
#str(inaug_speeches_clean)

###create data table for conversion of free text to charachters
inaug_speeches_clean <- as.data.table(inaug_speeches_clean)

###specify the text column to be used 
text <- inaug_speeches_clean$text_final

###custom stopwords list
custom_stopwords <- (c(stopwords("english"), "otherwords", "a0"))

##################### tokenize using quanteda ######################################
##tokenization and text cleaning
toks <- tokens(text, remove_punct = TRUE,
               remove_symbols = TRUE)
toks <- tokens_tolower(toks)
#toks <- tokens_wordstem(toks, language = quanteda_options("language_stemmer"))
toks <- tokens_remove(toks, custom_stopwords)
###create n-grams, specify size (sparsity/computing power are a consideration here)
toksNgrams <- tokens_ngrams(toks, n = c(2, 3), concatenator = "_")


###convert to data frame matrix
dfm_toksNgrams <- as.matrix(toksNgrams)
dfm_toksNgrams <- dfm(toksNgrams)
dfm_toksNgrams <- dfm_remove(dfm_toksNgrams, c(stopwords("english"), "high_school"))
###remove leading and trailing stopwords from n-grams
dfm_toksNgrams <- dfm_remove(dfm_toksNgrams, 
                             pattern = c(paste0("^", stopwords("english"), "_"), 
                                         paste0("_", stopwords("english"), "$")), 
                             valuetype = "regex")

##rename rows of matrix to presidents names
row.names(dfm_toksNgrams) <- inaug_speeches_clean$Speech

##frequency dist
term_frequency <- colSums(dfm_toksNgrams)

# Sort term_frequency in descending order
term_frequency <- sort(term_frequency, decreasing = TRUE)
term_frequency <- as.data.frame(term_frequency)

# extract column names (words/ngrams) to new column in r
setDT(term_frequency, keep.rownames = TRUE)[]

##                                   rn term_frequency
##      1:                united_states            158
##      2:                       let_us            101
##      3:              american_people             41
##      4:           federal_government             34
##      5:                    men_women             28
##     ---                                            
## 121852:             reign_peace_made              1
## 121853:         peace_made_permanent              1
## 121854:    made_permanent_government              1
## 121855: permanent_government_liberty              1
## 121856:       government_liberty_law              1

# alternative method of extract column names (words/ngrams) to new column in r
# term_frequency$names <- rownames(term_frequency)

## write term frequency into new file (optional)

#setwd('local/file/path')
#write.csv(term_frequency, "inaug_term_frequency.csv")

Frequency Visualizations

require(ggplot2)
library(wordcloud)

##subset top 40 rows
term_frequency2 <- term_frequency %>% 
  arrange(desc(term_frequency)) %>%
  head(40)

## Plot a barchart of the most frequent words/phrases 
ggplot(term_frequency2, aes(x=reorder(rn, -term_frequency),y=(term_frequency))) + 
            geom_bar(stat = "identity",width=0.5, 
            aes(fill = term_frequency2$term_frequency)) +
            theme(axis.text.x = element_text(vjust=1,angle=90)) + theme(legend.position="none") +
            geom_text(aes(label=term_frequency), vjust=0,angle=90,size=2.5,hjust=0)+
            labs(title="Most Common Phrases", caption="United States: Inauguration Speeches")

6. Comparison Cloud

These comparison/commonality clouds compare the language used by the 57th (Obama II) and 58th (Trump) Presidents. These visualizations can compare/contrast more than two texts, however.

Adjust N-Gram selction (one or more) in this section

##check structure of the data
#str(inaug_speeches_clean)

###create data table for conversion of free text to charachters
inaug_speeches_clean <- as.data.table(inaug_speeches_clean)

###specify the text column to be used 
text <- inaug_speeches_clean$text_final

###custom stopwords list
custom_stopwords <- (c(stopwords("english"), "otherwords", "a0"))

####################### tokenize using quanteda #####################################
##tokenization and text cleaning
toks <- tokens(text, remove_punct = TRUE,
               remove_symbols = TRUE)
toks <- tokens_tolower(toks)
#toks <- tokens_wordstem(toks, language = quanteda_options("language_stemmer"))
toks <- tokens_remove(toks, custom_stopwords)
###create n-grams, specify size (sparsity/computing power are a consideration here)
toksNgrams <- tokens_ngrams(toks, n = c(1), concatenator = "_")


###convert to data frame matrix
dfm_toksNgrams <- as.matrix(toksNgrams)
dfm_toksNgrams <- dfm(toksNgrams)
dfm_toksNgrams <- dfm_remove(dfm_toksNgrams, c(stopwords("english"), "high_school"))
###remove leading and trailing stopwords from n-grams
dfm_toksNgrams <- dfm_remove(dfm_toksNgrams, 
                             pattern = c(paste0("^", stopwords("english"), "_"), 
                                         paste0("_", stopwords("english"), "$")), 
                             valuetype = "regex")

##rename rows of matrix to presidents names
row.names(dfm_toksNgrams) <- inaug_speeches_clean$Speech

##frequency dist
term_frequency <- colSums(dfm_toksNgrams)

# Sort term_frequency in descending order
term_frequency <- sort(term_frequency, decreasing = TRUE)
term_frequency <- as.data.frame(term_frequency)

# extract column names (words/ngrams) to new column in r
setDT(term_frequency, keep.rownames = TRUE)[]
# alternative method of extract column names (words/ngrams) to new column in r
# term_frequency$names <- rownames(term_frequency)

convert from quanteda to tm: https://rdrr.io/cran/quanteda/man/convert.html>

##can compare 2 or more texts (president numbers)
dat <- dfm_toksNgrams[c(57, 58),]
##convert from quanteda dfm to tm DTM
dat <- convert(dat, to = "tm")
##convert from DTM to TDM
dat <- as.TermDocumentMatrix(dat)
dat <- as.matrix(dat)

##create clouds from wordcloud package
comparison.cloud(dat,max.words=80,random.order=FALSE,colors=c("#1F497D","#C0504D", "light blue"),
                                                  main="Differences Between Inauguration Speeches")

commonality.cloud(dat,random.order=FALSE,max.words=50, color="#1F497D",main="Commonalities in Inauguration Speeches")

7. Cosine Similarity

A measure of similarity between two non-zero vectors of an inner product space that measures the cosine of the angle between them. We can use this measure to gauge relative similarity of two texts.

Cosine similarity: https://www.youtube.com/watch?v=7cwBhWYHgsA

library(lsa)
require(ggplot2)
require(RWeka)
require(quanteda)
require(tm)
require(data.table)

########################## Create tokens for cosine similarity ##########################

##check structure of the data
#str(inaug_speeches_clean)

###create data table for conversion of free text to charachters
inaug_speeches_clean <- as.data.table(inaug_speeches_clean)

###specify the text column to be used 
text <- inaug_speeches_clean$text_final

###custom stopwords list
custom_stopwords <- (c(stopwords("english"), "otherwords"))

################################# tokenize using quanteda ###############################
##tokenization and text cleaning
toks <- tokens(text, remove_punct = TRUE,
               remove_symbols = TRUE)
toks <- tokens_tolower(toks)
#toks <- tokens_wordstem(toks, language = quanteda_options("language_stemmer"))
toks <- tokens_remove(toks, custom_stopwords)
###create n-grams, specify size (sparsity/computing power are a consideration here)
toksNgrams <- tokens_ngrams(toks, n = c(1), concatenator = "_")


###convert to data frame matrix
dfm_toksNgrams <- as.matrix(toksNgrams)
dfm_toksNgrams <- dfm(toksNgrams)
dfm_toksNgrams <- dfm_remove(dfm_toksNgrams, c(stopwords("english"), "high_school", "a0"))
###remove leading and trailing stopwords from n-grams
dfm_toksNgrams <- dfm_remove(dfm_toksNgrams, 
                             pattern = c(paste0("^", stopwords("english"), "_"), 
                                         paste0("_", stopwords("english"), "$")), 
                             valuetype = "regex")


########################### Transform Matrix ###########################################


## (see https://www.youtube.com/watch?v=7cwBhWYHgsA)
##can compare 2 or more texts
CosineSim_dfm <- dfm_toksNgrams[c(1:nrow(dfm_toksNgrams)),]
##convert from quanteda dfm to tm DTM
CosineSim_dfm <- convert(CosineSim_dfm, to = "tm")

##check dimensions of dfm
dim(CosineSim_dfm)

## get Cosine similarity in transposed Matrix format
CosineSim_Matrix <- cosine(t(as.matrix(CosineSim_dfm)))
##check dimensions and transpose
dim(CosineSim_Matrix)
##convert to data frame
CosineSim_Matrix <- as.data.frame(CosineSim_Matrix)

###################### join to primary dataset (optional) #########################################
##add cosine similarities to original text
## create unique ID to join Presdident Names to Cosine similarity text
CosineSim_Matrix$UniqueResp <- seq.int(nrow(CosineSim_Matrix))
inaug_speeches_clean$UniqueResp <- seq.int(nrow(inaug_speeches_clean))
##cartesian join
#inaug_speeches_wCosine <- merge(x = inaug_speeches_clean, y = CosineSim_Matrix, by = "UniqueResp", all.x = TRUE)

Cosine Similarity Visualization

require(ggcorrplot)

######################### visualize cosine similarity matrix #########################
##Convert RowNames
##convert row and column names

CosineSim_Matrix <- as.data.frame(CosineSim_Matrix)
row.names(CosineSim_Matrix) <- inaug_speeches_clean$Speech
colnames(CosineSim_Matrix) <- t(inaug_speeches_clean$Speech)

##subset matrix of presidents
CosineSim_Matrix_1900 <- CosineSim_Matrix[c(30:32, 34:37, 41:42, 44:46, 48:49, 51, 52, 54, 56, 58), 
                                            c(30:32, 34:37, 41:42, 44:46, 48:49, 51, 52, 54, 56, 58)]

##correlation matrix plot
# ggcorrplot(CosineSim_Matrix, type = "lower", lab = TRUE, legend.title = "Cos Sim", 
#           title = "Cosine Similarity of Inauguration Speeches",
#           show.diag = TRUE, outline.color = "black", lab_size = 2)

##correlation matrix plot for presidents 1st Inauguration speeches since 1900
ggcorrplot(CosineSim_Matrix_1900, type = "lower", lab = TRUE, legend.title = "Cos Sim", 
           title = "Cosine Similarity of Inauguration Speeches",
           show.diag = TRUE, outline.color = "black", lab_size = 2)

As the visualizations show, there seems to be similar language across the different speeches. Based on our phrase frequencies, we may start to form the idea that these speeches use similar language.

8. Lexical Diversity

Lexical Diversity refers to “the range of different words used in a text, with a greater range indicating a higher diversity” see https://rdrr.io/github/kbenoit/quanteda/man/textstat_lexdiv.html

require(RWeka)
require(quanteda)
require(tm)

##check structure of the data
#str(inaug_speeches_clean)

###create data table for conversion of free text to charachters
inaug_speeches_clean <- as.data.table(inaug_speeches_clean)

###specify the text column to be used 
text <- inaug_speeches_clean$text_final

###custom stopwords list
custom_stopwords <- (c(stopwords("english"), "otherwords"))
###tokenize using quanteda

##tokenization and text cleaning
toks <- tokens(text, remove_punct = TRUE,
               remove_symbols = TRUE)
toks <- tokens_tolower(toks)
#toks <- tokens_wordstem(toks, language = quanteda_options("language_stemmer"))
toks <- tokens_remove(toks, custom_stopwords)
###create n-grams, specify size (sparsity/computing power are a consideration here)
toksNgrams <- tokens_ngrams(toks, n = c(1), concatenator = "_")


###convert to data frame matrix
dfm_toksNgrams <- as.matrix(toksNgrams)
dfm_toksNgrams <- dfm(toksNgrams)
dfm_toksNgrams <- dfm_remove(dfm_toksNgrams, c(stopwords("english"), "high_school", "a0"))
###remove leading and trailing stopwords from n-grams
dfm_toksNgrams <- dfm_remove(dfm_toksNgrams, 
                             pattern = c(paste0("^", stopwords("english"), "_"), 
                                         paste0("_", stopwords("english"), "$")), 
                             valuetype = "regex")

##calculate lexical diversity
dt <- textstat_lexdiv(dfm_toksNgrams, measure = c("TTR", "C", "R", "CTTR", "U", "S", "Maas"), log.base = 10)

row.names(dt) <- inaug_speeches_clean$Speech


kable(dt, escape = F, "html", table.attr = "class='dtable'")  %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "center")

	document	TTR	C	R	CTTR	U	S	Maas	lgV0	lgeV0
1 George Washington	text1	0.7712121	0.9599843	19.812798	14.009764	70.46086	0.9606026	0.1191313	9.664986	22.25445
2 George Washington	text2	0.9354839	0.9838408	7.366007	5.208554	110.92051	0.9720826	0.0949498	9.849038	22.67825
3 John Adams	text3	0.6539179	0.9391205	21.410195	15.139294	49.77364	0.9433430	0.1417426	8.282356	19.07083
4 Thomas Jefferson	text4	0.7219512	0.9514409	20.673537	14.618398	60.00553	0.9534554	0.1290935	9.005973	20.73702
5 Thomas Jefferson	text5	0.6702756	0.9422172	21.364870	15.107245	52.03791	0.9459361	0.1386245	8.457093	19.47318
6 James Madison	text6	0.8251880	0.9693874	19.033064	13.458409	89.04555	0.9689960	0.1059727	10.762020	24.78047
7 James Madison	text7	0.8051002	0.9656333	18.864100	13.338933	79.71589	0.9652996	0.1120025	10.178290	23.43638
8 James Monroe	text8	0.5607595	0.9214597	22.289749	15.761233	40.72634	0.9296517	0.1566976	7.587212	17.47020
9 James Monroe	text9	0.5406343	0.9191822	24.286457	17.173119	40.89349	0.9295048	0.1563770	7.713494	17.76098
10 John Quincy Adams	text10	0.6336489	0.9368476	23.479237	16.602328	49.68414	0.9429508	0.1418702	8.404924	19.35305
11 Andrew Jackson	text11	0.7974217	0.9640515	18.581809	13.139323	76.07551	0.9636099	0.1146509	9.922223	22.84676
12 Andrew Jackson	text12	0.7081081	0.9453774	16.681921	11.795899	50.24098	0.9443588	0.1410818	7.958799	18.32581
13 Martin Van Buren	text13	0.6190960	0.9367861	27.471404	19.425216	52.11271	0.9452260	0.1385250	8.819611	20.30790
14 William Henry Harrison	text14	0.4451714	0.9018829	27.517932	19.458116	36.50921	0.9190651	0.1655003	7.478819	17.22062
15 James Knox Polk	text15	0.5083260	0.9124985	24.282879	17.170588	38.38009	0.9244132	0.1614162	7.491096	17.24889
16 Zachary Taylor	text16	0.7826087	0.9609117	18.000000	12.727922	69.67440	0.9602029	0.1198018	9.452599	21.76541
17 Franklin Pierce	text17	0.6320696	0.9378663	25.353793	17.927839	51.60738	0.9449468	0.1392016	8.666712	19.95584
18 James Buchanan	text18	0.5833940	0.9254103	21.617091	15.285592	42.06571	0.9322086	0.1541828	7.661970	17.64234
19 Abraham Lincoln	text19	0.5401679	0.9169910	22.061080	15.599539	38.81743	0.9259377	0.1605043	7.407035	17.05533
20 Abraham Lincoln	text20	0.7758112	0.9564288	14.284199	10.100454	58.07040	0.9520101	0.1312268	8.288490	19.08495
21 Ulysses S. Grant	text21	0.7025090	0.9441687	16.594683	11.734213	49.19525	0.9431396	0.1425734	7.871266	18.12426
22 Ulysses S. Grant	text22	0.7046549	0.9455988	17.588164	12.436710	51.36812	0.9455681	0.1395254	8.122285	18.70225
23 Rutherford B. Hayes	text23	0.5948419	0.9267515	20.623093	14.582729	42.04733	0.9323763	0.1542165	7.597807	17.49460
24 James A. Garfield	text24	0.6023272	0.9304291	23.022769	16.279556	45.48814	0.9374074	0.1482692	8.034674	18.50052
25 Grover Cleveland	text25	0.7024691	0.9472671	19.992622	14.136919	55.15507	0.9492577	0.1346503	8.597770	19.79710
26 Benjamin Harrison	text26	0.5617081	0.9247460	25.929995	18.335275	44.23130	0.9349412	0.1503610	8.087798	18.62284
27 Grover Cleveland	text27	0.7162837	0.9517020	22.662203	16.024597	62.12338	0.9549461	0.1268740	9.300661	21.41556
28 William McKinley	text28	0.5554425	0.9224672	24.628099	17.414695	42.47985	0.9322944	0.1534294	7.869503	18.12020
29 William McKinley	text29	0.6593205	0.9404341	21.757576	15.384929	50.98598	0.9447161	0.1400473	8.400955	19.34391
30 Theodore Roosevelt	text30	0.6873614	0.9386571	14.597330	10.321871	43.26788	0.9351472	0.1520258	7.224440	16.63489
31 William Howard Taft	text31	0.4863933	0.9083814	24.844175	17.567485	37.29019	0.9217889	0.1637581	7.422019	17.08983
32 Woodrow Wilson	text32	0.6724138	0.9407596	19.160844	13.548763	49.11435	0.9428205	0.1426907	8.072543	18.58772
33 Woodrow Wilson	text33	0.6484490	0.9335389	16.872141	11.930405	42.59013	0.9339032	0.1532306	7.371380	16.97323
34 Warren G. Harding	text34	0.5912240	0.9295215	24.605150	17.398468	45.95088	0.9378063	0.1475207	8.163130	18.79630
35 Calvin Coolidge	text35	0.5578947	0.9227000	24.318068	17.195471	42.41594	0.9322496	0.1535449	7.847362	18.06922
36 Herbert Hoover	text36	0.5204646	0.9129286	22.130458	15.648597	37.40836	0.9228554	0.1634992	7.286079	16.77682
37 Franklin D. Roosevelt	text37	0.6677350	0.9409695	20.428764	14.445318	50.33461	0.9441276	0.1409505	8.259828	19.01896
38 Franklin D. Roosevelt	text38	0.6380236	0.9342652	19.467564	13.765646	45.16559	0.9375166	0.1487977	7.778886	17.91155
39 Franklin D. Roosevelt	text39	0.6779388	0.9395618	16.894148	11.945967	46.21399	0.9393061	0.1471002	7.664836	17.64894
40 Franklin D. Roosevelt	text40	0.7680608	0.9526420	12.455854	8.807619	51.09915	0.9451019	0.1398921	7.581042	17.45600
41 Harry S. Truman	text41	0.5627669	0.9186355	19.257813	13.617330	37.71372	0.9243085	0.1628359	7.134504	16.42780
42 Dwight D. Eisenhower	text42	0.6256117	0.9340470	21.905347	15.489419	46.82868	0.9394967	0.1461316	8.077265	18.59859
43 Dwight D. Eisenhower	text43	0.6233293	0.9295869	17.882056	12.644523	41.40420	0.9317621	0.1554096	7.352399	16.92952
44 John F. Kennedy	text44	0.6458037	0.9332973	17.122934	12.107743	42.68129	0.9340201	0.1530669	7.399115	17.03709
45 Lyndon Baines Johnson	text45	0.6237624	0.9280622	16.585512	11.727728	39.60951	0.9287027	0.1588913	7.100592	16.34972
46 Richard Milhous Nixon	text46	0.5952153	0.9253669	19.241204	13.605586	40.45280	0.9298032	0.1572265	7.370067	16.97021
47 Richard Milhous Nixon	text47	0.4850575	0.8931095	14.307140	10.116676	27.50027	0.8951574	0.1906916	5.836110	13.43814
48 Jimmy Carter	text48	0.6889632	0.9417280	16.847933	11.913288	47.65068	0.9412113	0.1448657	7.773747	17.89971
49 Ronald Reagan	text49	0.6269036	0.9339979	21.553086	15.240334	46.55331	0.9391721	0.1465631	8.032439	18.49537
50 Ronald Reagan	text50	0.5798319	0.9240608	20.978380	14.833955	41.04521	0.9305297	0.1560877	7.535057	17.35011
51 George Bush	text51	0.5766816	0.9215483	19.256331	13.616282	38.84269	0.9266769	0.1604521	7.232741	16.65400
52 Bill Clinton	text52	0.6428571	0.9338782	18.160003	12.841061	43.88878	0.9357902	0.1509466	7.578816	17.45087
53 Bill Clinton	text53	0.5384615	0.9123765	18.418218	13.023647	35.01554	0.9182020	0.1689933	6.838465	15.74615
54 George W. Bush	text54	0.6159509	0.9277078	17.584293	12.433973	40.26934	0.9297755	0.1575842	7.234536	16.65814
55 George W. Bush	text55	0.5973905	0.9261725	19.568528	13.837039	41.04972	0.9308283	0.1560791	7.443269	17.13876
56 Barack Obama	text56	0.6619601	0.9418403	22.969158	16.241647	52.96845	0.9467445	0.1374015	8.633734	19.87991
57 Barack Obama	text57	0.6388368	0.9357247	20.857800	14.748692	47.10605	0.9400319	0.1457007	8.032027	18.49443
58 Donald J. Trump	text58	0.5948387	0.9219186	16.559609	11.709412	37.00370	0.9233768	0.1643908	6.876131	15.83288

##calculate create new column with chosen lexical diversity  
LexDiv_r <- textstat_lexdiv(dfm_toksNgrams, measure = "CTTR", log.base = 10)

## create ID column to join on
LexDiv_r$ID <- 1:nrow(LexDiv_r)
inaug_speeches_clean$ID <- 1:nrow(inaug_speeches_clean)

## join lexical diversity df to full speech df
inaug_speeches_clean <- merge(inaug_speeches_clean, LexDiv_r, by = "ID")

Visualize Lexical Diversity

require(ggplot2)
require(plotly)

Lexical Diversity over time

##convert strings we will visualize to numeric
inaug_speeches_clean <- inaug_speeches_clean[, Year:=as.numeric(Year)]

## look at number of Inaugurations by Address Number
dt <- inaug_speeches_clean[, .(number_of_distinct = uniqueN(PresidentNumber)), by = Inaugural.Address]

kable(dt, escape = F, "html", table.attr = "class='dtable'") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "center")

Inaugural.Address	number_of_distinct
First Inaugural Address	39
Second Inaugural Address	17
Third Inaugural Address	1
Fourth Inaugural Address	1

###Time Series Plot 
##subset
first_inaug <- 
  inaug_speeches_clean %>%
  filter(Inaugural.Address == "First Inaugural Address")
##create plot
LexDiv_plot <- ggplot(data=first_inaug, aes(x=Year, y=CTTR)) +
  geom_line(colour = "blue", linejoin = "mitre") +
  geom_point(colour = "Black") +  
  theme(axis.text.x = element_text(vjust=1,angle=90)) +
  geom_text(aes(label=Name), vjust=0,angle=90,size=2.5,hjust=0) +
  scale_x_continuous(breaks = seq(1789,2017,4))
##show plot
LexDiv_plot

Lexical Diversity in plotly

##convert and show in plotly
LexDiv_plotly <- ggplotly(LexDiv_plot)
LexDiv_plotly

Scatter Plot of Lexical Diversity & Length of Speech

##################### scatter plot ##########

##scatter plot of number of words vs Lexical Diversity Measure
scat <- ggplot(inaug_speeches_clean, aes(x = num_words, y = CTTR)) +
  geom_point(aes(color = Year))

##convert to plotly and visualize
scat <- ggplotly(scat)
scat

We find that there have been 17 presidents that have been elected for a second term. We also see thaere does not appear to be a trend in the variety of language used over the years (based on CTTR measure of Lexical Doversity). Upon further examination, we see that the Lexical Diversity measure we used may be biased towards longer responses.

Conclusion

This analysis looks to give several methods for which one can efficiently analyze large bodies of free-text. From there, it gives several approaches for quantifying the text in different ways. This is intended for exploratory use and there are several more popular approaches that could be useful here including: sentiment analysis, Named-entity, and different unsupervised word embedding approaches (i.e. - Word2Vec). I hope this analysis can help give you ideas for approaching this in your work.

Text Analysis: United States Presidential Inauguration Speeches

Joseph O’Malley

`August 10, 2018