Sentiment analysis determines what feelings or tone (sentiments) a person or figure communicates through their words. Sentiments can be binary like positive or negative or more specific emotions like fear, anger or optimism. For my analysis, I chose to examine Joe Biden and Donald Trump’s acceptance speeches in 2020 and 2016, respectively. President-elect Biden and President Trump are of different political parties and have promised different leadership styles. However, acceptance speeches are, in general, overwhelmingly positive. This is somewhat common sense as who would not be optimistic after being elected the United States president. I hypothesize that both of them offer a similarly upbeat tone. I also anticipate that President-elect Biden’s speech will speak regarding unity than President Trump will speak more to his supporters. First, we will load the necessary packages for our analysis.
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.2 v purrr 0.3.4
## v tibble 3.0.4 v dplyr 1.0.2
## v tidyr 1.1.2 v stringr 1.4.0
## v readr 1.4.0 v forcats 0.5.0
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
##
## Attaching package: 'glue'
## The following object is masked from 'package:dplyr':
##
## collapse
Next, I load in their speeches and format/munge the data.
# get files from the input directory
files <- list.files("C:/Users/Owner/OneDrive/Documents/DA320/Module5/Chapter19/Speeches/")
# Use the glue function to break long strings and trim whitespace and blank lines
BidenSpeech <- glue("C:/Users/Owner/OneDrive/Documents/DA320/Module5/Chapter19/Speeches/", files[1], sep = "")
# Remove any extra trailing spaces
BidenSpeech <- trimws(BidenSpeech)
# read in the updated file
BidenText <- glue(read_file(BidenSpeech))
# remove dollar signs
BidenText <- gsub("\\$", "", BidenText)
# tokenize the data
# Tokens are individual words and tokenization is splitting words into tokens. This is like parsing
Bidentokens <- data_frame(text = BidenText) %>% unnest_tokens(word, text)
## Warning: `data_frame()` is deprecated as of tibble 1.1.0.
## Please use `tibble()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
# Now for trump!!!
# Use the glue function to break long strings and trim whitespace and blank lines
TrumpSpeech <- glue("C:/Users/Owner/OneDrive/Documents/DA320/Module5/Chapter19/Speeches/", files[2], sep = "")
# Remove any extra trailing spaces
TrumpSpeech <- trimws(TrumpSpeech)
# read in the updated file
TrumpText <- glue(read_file(TrumpSpeech))
# remove dollar signs
TrumpText <- gsub("\\$", "", TrumpText)
# tokenize the data
# Tokens are individual words and tokenization is splitting words into tokens. This is like parsing
Trumptokens <- data_frame(text = TrumpText) %>% unnest_tokens(word, text)
# Create vector of stop words
stop_words_Vec <- as.vector(stop_words[[1]])
# Create empty vector for non-stop words
non_stop_trump <- vector()
# Create vector of Trump words
trumpwords <- as.vector(Trumptokens[[1]])
# Loop structure for removing stop words
for (word in trumpwords){
if (word %in% stop_words_Vec){
} else {
non_stop_trump <- c(non_stop_trump, word)
}
}
# Save vector back as data frame
non_stop_trump_df <- data.frame(non_stop_trump)
# rename column[1] to word
non_stop_trump_df <- non_stop_trump_df %>%
rename(word = non_stop_trump)
# For trump words without stop words, group by words, summarize the 'count' arrange in descending order
non_stop_trump_df_2 <- non_stop_trump_df %>%
group_by(word) %>%
summarize(count=n()) %>%
arrange(desc(count))
## `summarise()` ungrouping output (override with `.groups` argument)
# For trump tokens, group by words, summarize the 'count' arrange in descending order
non_stop_trump_df_2 <- non_stop_trump_df_2 %>%
filter(count >= 4)
# bar plot of Trump top words, sort by descending value
trump_col <- ggplot(non_stop_trump_df_2, aes(x = reorder(word, -count), y = count)) + geom_col(fill = "red", colour = "grey8") +
labs(title = "President Trump Top 20 Words") + xlab("Word") + ylab("# Times Used")
# Now Biden
# Create empty vector for Biden non-stop words
non_stop_Biden <- vector()
# Create vector of Biden words
Bidenwords <- as.vector(Bidentokens[[1]])
# Loop structure for removing stop words
for (word in Bidenwords){
if (word %in% stop_words_Vec){
} else {
non_stop_Biden <- c(non_stop_Biden, word)
}
}
# Save vector back as data frame
non_stop_Biden_df <- data.frame(non_stop_Biden)
# rename column[1] to word
non_stop_Biden_df <- non_stop_Biden_df %>%
rename(word = non_stop_Biden)
# For Biden tokens, group by words, summarize the 'count' arrange in descending order
non_stop_Biden_df2 <- non_stop_Biden_df %>%
group_by(word) %>%
summarize(count=n()) %>%
arrange(desc(count))
## `summarise()` ungrouping output (override with `.groups` argument)
# For Biden tokens, group by words, summarize the 'count' arrange in descending order
non_stop_Biden_df_2 <- non_stop_Biden_df2 %>%
filter(count >= 5)
# bar plot of top words, sort by descending value
Biden_col <- ggplot(non_stop_Biden_df_2, aes(x = reorder(word, -count), y = count)) + geom_col(fill = "blue", colour = "grey8") +
labs(title = "Joe Biden Top 21 Words") + xlab("Word") + ylab("# Times Used")
Our speeches are correctly loaded and ready for analysis. Note that we have 1721 tokens (words) for Biden and 1617 tokens (words) for Trump. First, let us look at their top used words. Before creating these plots, I filtered out the stop words as declared by the tidytext package. If you are unaware, stop words are part of a set of commonly used words used to join other words together.
The first two plots do little to understand the speeches. However, some words are commonly associated with each candidate. Words like ‘fantastic’ and ‘guy’ sound like President Trump, while ‘folks’ and ‘people’ are reminiscent of President-elect Biden.
Next, we need to take our words and analyze their sentiment. I chose to do this in two separate ways. First, using the ‘bing’ lexicon, I split words into either positive or negative identifiers. For this, I expect Biden and Trump to be relatively similar.
# get the sentiment from the first text:
Bidentokens %>%
inner_join(get_sentiments("bing")) %>% # pull out only sentiment words using bing as a binary identifier
count(sentiment) %>% # count the # of positive & negative words
spread(sentiment, n, fill = 0) %>% # made data wide rather than narrow
mutate(sentiment = positive - negative) %>% # # of positive words - # of negative words
mutate(percent_positive = positive / (positive + negative)) # calculate percent of positive words
## Joining, by = "word"
## # A tibble: 1 x 4
## negative positive sentiment percent_positive
## <dbl> <dbl> <dbl> <dbl>
## 1 24 88 64 0.786
# get the sentiment from the second text:
Trumptokens %>%
inner_join(get_sentiments("bing")) %>% # pull out only sentiment words using bing as a binary identifier
count(sentiment) %>% # count the # of positive & negative words
spread(sentiment, n, fill = 0) %>% # made data wide rather than narrow
mutate(sentiment = positive - negative) %>% # # of positive words - # of negative words
mutate(percent_positive = positive / (positive + negative)) # calculate percent of positive words
## Joining, by = "word"
## # A tibble: 1 x 4
## negative positive sentiment percent_positive
## <dbl> <dbl> <dbl> <dbl>
## 1 26 125 99 0.828
Sweet! So Trump was 4.2% more positive than Biden. I would have guessed Biden would be more positive, but overall, the numbers are close. Next, I use the ‘NRC’ lexicon. This lexicon splits words into the more specific categories: anger, anticipation, disgust, fear, joy, negative, positive, sadness, surprise and trust. I expect Biden to score higher in trust and surprise and Trump to score higher in anticipation and joy.
# get the NRC sentiment from the first text:
Bidentokens %>%
inner_join(get_sentiments("nrc")) %>% # pull out only sentiment words
count(sentiment) %>% # count the # of positive & negative words
spread(sentiment, n, fill = 0) # made data wide rather than narrow
## Joining, by = "word"
## # A tibble: 1 x 10
## anger anticipation disgust fear joy negative positive sadness surprise
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 21 65 6 24 55 33 132 22 18
## # ... with 1 more variable: trust <dbl>
# get the NRC sentiment from the second text:
Trumptokens %>%
inner_join(get_sentiments("nrc")) %>% # pull out only sentiment words
count(sentiment) %>% # count the # of positive & negative words
spread(sentiment, n, fill = 0) # made data wide rather than narrow
## Joining, by = "word"
## # A tibble: 1 x 10
## anger anticipation disgust fear joy negative positive sadness surprise
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 5 32 8 10 32 37 83 12 11
## # ... with 1 more variable: trust <dbl>
Biden scored higher in nearly all categories. However, he was significantly higher in trust, anticipation and positive by NRC calculations. This could be partially due to Biden’s vocabulary but also due to the current political landscape that President-Elect Biden finds himself in. More uncertain times lead to more direct speeches. Trump also has a history of using broad, more laymen terms compared to the President-Elect.
That’s all for this week. Thanks for reading!
Chris