This first chunk downloads all packages required to complete the analysis.
library(tidyverse)
package 㤼㸱tidyverse㤼㸲 was built under R version 4.0.5replacing previous import 㤼㸱lifecycle::last_warnings㤼㸲 by 㤼㸱rlang::last_warnings㤼㸲 when loading 㤼㸱pillar㤼㸲Registered S3 methods overwritten by 'dbplyr':
method from
print.tbl_lazy
print.tbl_sql
replacing previous import 㤼㸱lifecycle::last_warnings㤼㸲 by 㤼㸱rlang::last_warnings㤼㸲 when loading 㤼㸱hms㤼㸲-- Attaching packages ------------------------------------------------------------------------------ tidyverse 1.3.1 --
v ggplot2 3.3.5 v purrr 0.3.4
v tibble 3.1.6 v dplyr 1.0.7
v tidyr 1.1.4 v stringr 1.4.0
v readr 2.1.1 v forcats 0.5.1
package 㤼㸱ggplot2㤼㸲 was built under R version 4.0.5package 㤼㸱tibble㤼㸲 was built under R version 4.0.5package 㤼㸱tidyr㤼㸲 was built under R version 4.0.5package 㤼㸱readr㤼㸲 was built under R version 4.0.5package 㤼㸱purrr㤼㸲 was built under R version 4.0.5package 㤼㸱dplyr㤼㸲 was built under R version 4.0.5package 㤼㸱stringr㤼㸲 was built under R version 4.0.5package 㤼㸱forcats㤼㸲 was built under R version 4.0.5-- Conflicts --------------------------------------------------------------------------------- tidyverse_conflicts() --
x dplyr::filter() masks stats::filter()
x dplyr::lag() masks stats::lag()
library(DT)
package 㤼㸱DT㤼㸲 was built under R version 4.0.5Registered S3 method overwritten by 'htmlwidgets':
method from
print.htmlwidget tools:rstudio
library(tidytext) # package for text analysis
package 㤼㸱tidytext㤼㸲 was built under R version 4.0.5
library(readxl) # reads excel files, the format I used for the data
The following chunk then reads in the downloaded Excel file with all of the Inaugural Speeches sorted by name.
inaug_speeches <- read_excel("inaug_speeches.xlsx")
inaug_speeches
This chunk then uses the above data and separates the words so they can be analyzed, rather than a long string of words.
inaug_words <- inaug_speeches %>%
unnest_tokens(word, text)
inaug_words
NA
NA
The below chunk then separates the data by number of words, lexical diversity (number of distinct words), and lexical density (number of distinct words divided by total number of words).
inaug_words %>%
group_by(author) %>%
summarise(num_words = n(),
lex_diversity = n_distinct(word),
lex_density = n_distinct(word)/n())
This chunk is more simple than the one above and gives the mean word length for each speech.
inaug_words %>%
group_by(author) %>%
mutate(word_length = nchar(word)) %>%
summarize(mean_word_length = mean(word_length)) %>%
arrange(-mean_word_length)
Using this chunk then gives mini graphs of each speech with the word length in the document.
inaug_words %>%
mutate(word_length = nchar(word)) %>%
ggplot(aes(word_length)) +
geom_histogram(binwidth = 1) +
facet_wrap(vars(author), scales = "free_y") +
labs(title = "Word Length By Author")
The following chunk first removes stop words, then creates a graph with the most common words from each speech.
inaug_words %>%
anti_join(stop_words) %>%
group_by(author) %>%
count(word, sort = T) %>%
top_n(5) %>%
ungroup() %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n, fill = author)) +
geom_col(show.legend = FALSE) +
labs(x = NULL, y = "Most common words") +
facet_wrap(vars(author), scales = "free") +
scale_fill_viridis_d() +
theme_minimal() +
coord_flip()
Joining, by = "word"Selecting by n
This chunk is necessary for the chunk below it, as this calculates the tf-idfs in each document.
inaug_word_counts <- inaug_speeches %>% # This counts each word per author
unnest_tokens(word, text) %>%
count(author, word, sort = TRUE)
total_words <- inaug_word_counts %>% # This counts total words per author
group_by(author) %>%
summarize(total = sum(n))
inaug_word_counts <- left_join(inaug_word_counts, total_words) # Joins the two
Joining, by = "author"
inaug_tf_idf <- inaug_word_counts %>% # Calculates tf-idf
bind_tf_idf(word, author, n)
inaug_tf_idf %>% # Displays it
arrange(-tf_idf)
NA
This final chunk then takes the above data of the tf-idfs and graphs the words with the highest tf-idfs, by President.
inaug_tf_idf %>%
arrange(-tf_idf) %>%
mutate(word = factor(word, levels = rev(unique(word)))) %>%
group_by(author) %>%
top_n(5) %>%
ggplot(aes(word, tf_idf, fill = author)) +
geom_col(show.legend = FALSE) +
labs(x = NULL, y = "tf-idf") +
facet_wrap(~author, scales = "free") +
coord_flip() +
theme_minimal() +
scale_fill_viridis_d() +
labs(title = "Most distinctive words in Inaugural Speeches")
Selecting by tf_idf