The United States is currently going through our 59th Presidential Election with Donald Trump and Joe Biden as the Republican and Democratic candidates, respectively. On September 29th, 2020, the first presidential debate for this election aired on television. Presidential debates are important for the election since they allow the people and nation to see the candidates speak on important topics and current issues. The ‘performance’ of the candidates can have either a positive or negative impact on the election outcome. With Joe Biden and Donald Trump constantly being criticized and critiqued in the media, I thought it would be interesting and relevant to analyze the first 2020 presidential debate. In this project, I will analyze which candidate spoke more words as well as the sentiment of each speaker.
Compare the word count, word choice and sentiment of each presidential candidate using the first 2020 Presidential Debate that took place on September 29th, 2020.
First, I located a transcript of the presidental deabte from https://www.rev.com/blog/transcripts/donald-trump-joe-biden-1st-presidential-debate-transcript-2020 . Then, I converted the transcript to a text file where I cleaned the data and separated the transcript by each candidate, one for Joe Biden and one for Donald Trump. I removed the words spoken by Chris Wallace, the debates moderator, since I felt that his speech hinders the analysis of each candidate and is not needed for this report. Once I had the transcript for each candidate in a plain text file, I then imported the files into R and began my analysis.
First, I imported the packages that would allow me to conduct my analysis.
library(tidyverse)
library(tidytext)
library(ggthemes)
library(wordcloud2)
library(textdata)
library(gridExtra)
library(readr)
Using these packages and my two data sets, ‘Biden_DebateText’ and ‘Trump_DebateText’, I started by separating the text into individual words. Once I did this for both Joe Biden and Donald Trump, I then counted the number of words each candidate said to get the total word count.
Biden_DebateText %>%
unnest_tokens(word, X1) %>%
mutate(word = gsub("\u2019", "'", word)) -> biden_words
count(biden_words)
| n |
|---|
| 6678 |
Trump_DebateText %>%
unnest_tokens(word, X1) %>%
mutate(word = gsub("\u2019", "'", word))->trump_words
count(trump_words)
| n |
|---|
| 7272 |
This allows us to see that Donald Trump spoke more words at the presidential debate according to this transcript. Donald Trump spoke a total of 7,272 words and Joe Biden spoke a total of 6,678 words.
With this information, there are multiple assumptions that can be drawn. Is it fair that Donald Trump spoke more even though each candidate was suppose to have 2 minutes each to talk? Is there a reason why Donald Trump spoke more than Joe Biden? Does this mean Donald Trump interrupted Joe Biden more frequently, causing Joe Biden to talk less? These questions can not be answered directly using this data, but it is important that we can conclude that Donald Trump spoke nearly 1,000 more words than Joe Biden did at the first debate.
Next, for each candidate I removed the stop words from their transcripts. Stop words are frequent words that are said but provide little information that can be used for an analysis. Some example of stop words are, “I” and “the”. Removing the stop words from the data allows us to get a clearer look at what meaningful words each candidate said during the debate, i.e., removing filler words. To remove the stop words, we can use anti_join().
biden_words %>%
anti_join(stop_words) %>%
count(word, sort = TRUE) %>%
filter(!word %in% c("crosstalk", "00")) -> filter_bidenwords
trump_words %>%
anti_join(stop_words) %>%
count(word, sort = TRUE) %>%
filter(!word %in% c("crosstalk", "00")) -> filter_trumpwords
It is important to remove ‘crosstalk’ and ‘00’ from both candidates transcripts. This is because during the presidential debate, there was many instances where both candidates, Trump and Biden, were speaking at the same time. This is represented in the transcript as “Crosstalk[time]”, with ‘time’ being when the cross over talking occurred. Since it was unclear to decipher what each candidate was saying at this time, “Crosstalk[time]” was included in the transcript. This is why we will remove all instances of ‘crosstalk’ and ‘00’ so that we can analyze the words the candidates spoke since they didn’t actual say “crosstalk”.
Since we have filtered out “crosstalk” and “00”, since these aren’t words that are actually being spoken, let’s re-examine the total count of words each candidate said and see if this makes any difference.
count(filter_bidenwords)
| n |
|---|
| 901 |
count(filter_trumpwords)
| n |
|---|
| 823 |
Here, we are examining the number of distinct words each candidate spoke. These results are interesting since now it shows that Joe Biden said 901 distinct words and Donald Trump said 823 distinct words.
From these results, one can make the assumption that Joe Biden used a larger variety in vocabulary during the debate. As well as, Donald Trump could have been repetitive in the distinct words he spoke, resulting in a smaller number of distinct words spoken.
Now, let’s look at the wordclouds for each president. This visualization allows us to see the frequency of each word they said. The larger the word is, the more times it was spoken by each candidate.
filter_bidenwords %>%
wordcloud2()
| word | n |
|---|---|
| people | 75 |
| deal | 24 |
| president | 22 |
| vote | 21 |
| true | 20 |
| plan | 16 |
| american | 15 |
| tax | 15 |
| jobs | 13 |
| covid | 12 |
filter_trumpwords %>%
wordcloud2()
| word | n |
|---|---|
| people | 67 |
| joe | 32 |
| country | 24 |
| left | 20 |
| million | 20 |
| dollars | 19 |
| president | 19 |
| election | 17 |
| lot | 16 |
| law | 15 |
Now, lets visualize the top 10 words said by each candidate by using a bargraph.
filter_bidenwords %>%
head(10) %>%
ggplot(aes(reorder(word,n), n))+ geom_col() + coord_flip() +
theme_economist() + ggtitle("Biden's 10 Most Frequent Words :
2020 Presidential Debate") + xlab("Word") + geom_bar(stat="identity", fill="#000099")+
ylab("Count") + geom_text(aes(label=n), hjust =1.5,vjust=0, color="white", size=3.5)
Here we can see that Biden’s most frequent word said was ‘people’. The most frequent words that are important to point out are ‘vote’ and ‘covid’. The word ‘covid’ is extremely relevant since it is the pandemic the whole world is dealing with currently and it is important for our leader to discuss this topic. The word ‘vote’ is also important since many people are encouraging everyone to vote for this election.
filter_trumpwords %>%
head(10) %>%
ggplot(aes(reorder(word,n), n))+ geom_col() + coord_flip() +
theme_economist() + ggtitle("Trump's 10 Most Frequent Words :
2020 Presidential Debate") + xlab("Word") + geom_bar(stat="identity", fill="#FF9999")+
ylab("Count") + geom_text(aes(label=n), hjust =1.5,vjust=0, color="white", size=3.5)
This shows that Trump’s most frequent word said was ‘people’ as well. It is interesting to point out that Donald Trump’s second most popular word is ‘Joe’, the first name of his opposing candidate. This means that Donald Trump frequently referenced Joe Biden during this debate. Another frequent word to point out is the word ‘left’. We can assume that Donald Trump used this word to reference the “Left-Wing” in politics.
If we were to look at the top 10 words each candidate said we can see an over lap in the words, ‘people’ and ‘president’. These results make sense since this is the presidential debate and the audience is the American people. It is also interesting to see the word ‘covid’ in Biden’s top 10 words and not in Donald Trump’s. Is this because Donald Trump refers to the virus as another name/term or does this mean he is avoiding discussing this topic? It is also interesting to see how Donald Trump frequently says “Joe” in the presidential debate where as Joe Biden does not frequently say “Donald”.
Now, let’s examine the sentiment of each candidate, which just means how positive or negative words are. There are three different lexicons or dictionaries we can use to analyze the sentiment of a text. They are “afinn”, “bing” and “nrc”. Each has their own pros and cons, hence, why we will use all three to get an overall completed analysis for sentiment.
First, we will start with the afinn lexicon. Afinn measures the sentiment of a word based on a scale ranging from -7, being the most negative, to a +7, being the most positive. On this scale, 0 represents a neutral word.
Let’s start by taking all of the words Biden said minus the stop words and separate all the words into two categories positive and negative as defined by afinn. Then we can count each word and shows their frequency of positive and negative words.
biden_words %>%
anti_join(stop_words) %>%
inner_join(get_sentiments("afinn")) -> biden_afinn
mean(biden_afinn$value)
## [1] -0.1584507
Here, we can see an average afinn score for Biden is a value of -0.15857. This shows that according to the afinn lexicon, Joe Biden had an negative average sentiment during this debate.
Now, let’s examine just Joe Biden’s negative words according to the afinn lexicon.
biden_afinn %>%
filter(value < 0) %>%
arrange(desc(-value)) %>%
count(word, sort= TRUE) %>%
head(10) %>%
ggplot(aes(reorder(word,n),n)) + geom_col() + theme_economist() + coord_flip()+
ggtitle("Biden's Most \n Frequent Negative \n Words") + xlab("Word") +geom_bar(stat="identity", fill="#FF9999")+
ylab("Count") + geom_text(aes(label=n), hjust =1.5,vjust=0, color="white", size=3.5) ->biden_afinn_neg_graph
biden_afinn %>%
filter(value < 0) %>%
arrange(desc(-value)) %>%
count(word, sort= TRUE) %>%
wordcloud2()
| word | n |
|---|---|
| discredited | 7 |
| wrong | 6 |
| killed | 5 |
| violence | 5 |
| died | 4 |
| poor | 4 |
| recession | 4 |
| violent | 4 |
| crime | 3 |
| crisis | 3 |
Now, just as we did before, let’s examine Joe Biden’s most positive words according to afinn.
biden_afinn %>%
filter(value > 0) %>%
arrange(desc(-value)) %>%
count(word, sort= TRUE) %>%
head(10) %>%
ggplot(aes(reorder(word,n),n)) + geom_col() + theme_economist() + coord_flip()+ geom_bar(stat="identity", fill="#000099") +
ggtitle("Biden's Most \n Frequent Positive \n Words") + xlab("Word") +
ylab("Count") + geom_text(aes(label=n), hjust =1.5,vjust=0, color="white", size=3.5) -> biden_afinn_pos_graph
biden_afinn %>%
filter(value > 0) %>%
arrange(desc(-value)) %>%
count(word, sort= TRUE) %>%
wordcloud2()
| word | n |
|---|---|
| true | 20 |
| care | 10 |
| united | 8 |
| matter | 6 |
| support | 5 |
| yeah | 5 |
| god | 4 |
| trust | 4 |
| accepted | 3 |
| fine | 3 |
grid.arrange(biden_afinn_neg_graph, biden_afinn_pos_graph, ncol = 2, top="Joe Biden: Afinn Lexicon")
It is interesting to see Joe Biden’s most negative word as ‘discredited’ according to the afinn lexicon. This word is important to consider since everything each candidate says is being fact checked and it is important for all candidates to speak the truth and have a credited source.
Now that we examined Joe Biden, let’s examine Donald Trump.
trump_words %>%
anti_join(stop_words) %>%
inner_join(get_sentiments("afinn")) -> trump_afinn
mean(trump_afinn$value)
## [1] -0.4247104
Here, we can see an average afinn score for Trump is a value of -0.4247104. Compared to Joe Biden’s average afinn score, Trump has a more negative afinn value as a mean. This means that on average Trump’s words are more negative than Joe Biden. Again, this is using the Afinn lexicon where words are rated on a scale from -7 to 7.
Now, let’s examine just Donald Trump’s negative words according to the afinn lexicon.
trump_afinn %>%
filter(value < 0) %>%
arrange(desc(-value)) %>%
count(word, sort= TRUE) %>%
head(10) %>%
ggplot(aes(reorder(word,n),n)) + geom_col() + theme_economist() + coord_flip()+
ggtitle("Trump's Most \n Frequent Negative \n Words") + xlab("Word") +geom_bar(stat="identity", fill="#FF9999")+
ylab("Count") + geom_text(aes(label=n), hjust =1.5,vjust=0, color="white", size=3.5) -> trump_afinn_neg_graph
trump_afinn %>%
filter(value < 0) %>%
arrange(desc(-value)) %>%
count(word, sort= TRUE) %>%
wordcloud2()
| word | n |
|---|---|
| wrong | 10 |
| lost | 8 |
| bad | 7 |
| died | 6 |
| disaster | 6 |
| excuse | 6 |
| leave | 5 |
| fraud | 4 |
| lowest | 4 |
| racist | 3 |
Now, just as we did before, let’s examine Donald Trump’s most positive words according to afinn.
trump_afinn %>%
filter(value > 0) %>%
arrange(desc(-value)) %>%
count(word, sort= TRUE) %>%
head(10) %>%
ggplot(aes(reorder(word,n),n)) + geom_col() + theme_economist() + coord_flip()+
ggtitle("Trump's Most \n Frequent Positive \n Words") + xlab("Word") + geom_bar(stat="identity", fill="#000099")+ ylab("Count") + geom_text(aes(label=n), hjust =1.5,vjust=0, color="white", size=3.5) -> trump_afinn_pos_graph
trump_afinn %>%
filter(value > 0) %>%
arrange(desc(-value)) %>%
count(word, sort= TRUE) %>%
wordcloud2()
| word | n |
|---|---|
| won | 9 |
| care | 6 |
| super | 5 |
| support | 5 |
| true | 5 |
| agreed | 4 |
| happy | 4 |
| top | 4 |
| agree | 3 |
| fair | 3 |
grid.arrange(trump_afinn_neg_graph, trump_afinn_pos_graph, ncol = 2, top="Donald Trump: Afinn Lexicon")
Overall, both candidates negative sentiment words are similar. Both candidates use wrong multiple times as well as words related to killing and violence.
Now, we can use the NRC Lexicon to again analyze sentiment of both candidates. NRC is a lexicon where words are sorted off of eight basic emotions; anger, fear, anticipation, trust, surprise, sadness, joy, and disgust. Also, words are sorted into negative and positive sentiment. It is important to remember that when sorting the words, the ‘positive’ and ‘negative’ sentiment will always have larger values since each emotion falls into either categories.
First, we will get the sentiments of each candidate using inner_join(). Then, we can group by the sentiment and then count their frequencies. We can then plot the top 10 for each candidate side by side to make for an easier comparison.
filter_bidenwords %>%
inner_join(get_sentiments("nrc")) -> biden_nrc
biden_nrc %>%
group_by(sentiment) %>%
count(sentiment, sort = TRUE) %>%
head(10) -> biden_top_ten
filter_trumpwords %>%
inner_join(get_sentiments("nrc")) -> trump_nrc
trump_nrc %>%
group_by(sentiment) %>%
count(sentiment, sort = TRUE) %>%
head(10) -> trump_top_ten
ggplot(trump_top_ten, aes(reorder(sentiment, n),n)) + geom_col() +
coord_flip() + xlab("Sentiment") + ylab("Count") +geom_bar(stat="identity", fill="#FF9999")+
ggtitle("Trump NRC") + theme_economist() +geom_text(aes(label=n), hjust =1.5,vjust=0, color="white", size=3.5) -> trump_nrc_plot
ggplot(biden_top_ten, aes(reorder(sentiment, n),n)) + geom_col() +
coord_flip() + xlab("Sentiment") + ylab("Count") + geom_bar(stat="identity", fill="#000099")+
ggtitle("Biden NRC") + theme_economist() + geom_text(aes(label=n), hjust =1.5,vjust=0, color="white", size=3.5)-> biden_nrc_plot
grid.arrange(biden_nrc_plot, trump_nrc_plot, ncol = 2)
Comparing both candidates NRC sentiment analysis side by side allows us to see that both have positive and negative with the most frequency. It is important to remember that each word is represented in either of these categories, hence, the outstanding numbers. It is also important to recognize the emotion of trust being frequent for both candidates. This is interesting since we must trust that these candidates will follow through with their plans and actions they propose.
Overall, both candidates have similar results when using the NRC lexicon for a sentiment analysis due to the overwhelming frequencies for positive and negative sentiment. Both candidates had a positive sentiment as their most frequent, however, Donald Trump had a close second sentiment that was negative.
Lastly, we can use the BING Lexicon to again analyze sentiment of both candidates. BING is a lexicon that categorizes words in a binary fashion into positive and negative categories. To use this lexicon, we will first group by sentiment and then count the number of words that are considered negative and positive.
filter_bidenwords %>%
inner_join(get_sentiments("bing")) %>%
group_by(sentiment) %>%
count() %>%
ggplot(aes(sentiment,n, fill = sentiment)) + geom_col() + coord_flip() +
ylab("Count") + ggtitle("Biden Bing Seniment Analysis") +
geom_text(aes(label=n), hjust =1.5,vjust=0, color="white", size=3.5) -> bidenBing
filter_bidenwords %>%
inner_join(get_sentiments("bing")) %>%
filter(sentiment %in% "negative") %>%
arrange(desc(n)) %>%
head(10) -> biden_bing_negative
ggplot(biden_bing_negative, aes(reorder(word,n),n)) + geom_bar(stat="identity", fill="#FF9999") + coord_flip() + xlab("Sentiment") + ylab("Count") + ggtitle("Biden's Top 10 Most \n NEGATIVE Words: \n Bing Lexicon") + theme_economist() + geom_text(aes(label=n), hjust =1.5,vjust=0, color="white", size=3.5)-> biden_bing1
filter_bidenwords %>%
inner_join(get_sentiments("bing")) %>%
filter(sentiment %in% "positive") %>%
arrange(desc(n)) %>%
head(10) -> biden_bing_positive
ggplot(biden_bing_positive, aes(reorder(word,n),n)) + geom_bar(stat="identity", fill="#000099")+ coord_flip() + xlab("Sentiment") + ylab("Count") + ggtitle("Biden's Top 10 Most \n POSITIVE Words: \n Bing Lexicon") + theme_economist()+ geom_text(aes(label=n), hjust =1.5,vjust=0, color="white", size=3.5) -> biden_bing2
grid.arrange(biden_bing1, biden_bing2, ncol = 2)
Looking at this analysis for Joe Biden, it is interesting to see the word ‘vice’ be categorized as a negative word. This is interesting since Joe Biden was the Vice President during Obama’s administration. This is important to consider since the audience may not consider the word ‘vice’ to be negative. Again, these discrepancies are why we consider all three lexicons for the analysis. Using the word support is also interesting since it makes sense for Biden to use this word because candidates are asking the American people to support them and vote for them.
Now, let’s do the same for Donald Trump so we can compare the two candidates using the same sentiment lexicon.
filter_trumpwords %>%
inner_join(get_sentiments("bing")) %>%
group_by(sentiment) %>%
count() %>%
ggplot(aes(sentiment,n, fill = sentiment)) + geom_col() + coord_flip() +
ylab("Count") + ggtitle("Trump Bing Seniment Analysis") +
geom_text(aes(label=n), hjust =1.5,vjust=0, color="white", size=3.5) -> trumpBing
filter_trumpwords %>%
inner_join(get_sentiments("bing")) %>%
filter(sentiment %in% "negative") %>%
arrange(desc(n)) %>%
head(10) -> trump_bing_negative
ggplot(trump_bing_negative, aes(reorder(word,n),n)) + geom_bar(stat="identity", fill="#FF9999") + coord_flip() + xlab("Sentiment") + ylab("Count") + ggtitle("Trump's Top 10 Most \n NEGATIVE Words: \n Bing Lexicon") + theme_economist() + geom_text(aes(label=n), hjust =1.5,vjust=0, color="white", size=3.5)-> trump_bing1
filter_trumpwords %>%
inner_join(get_sentiments("bing")) %>%
filter(sentiment %in% "positive") %>%
filter(!word %in% c("trump")) %>%
arrange(desc(n)) %>%
head(10) -> trump_bing_positive
ggplot(trump_bing_positive, aes(reorder(word,n),n)) + geom_bar(stat="identity", fill="#000099") + coord_flip() + xlab("Sentiment") + ylab("Count") + ggtitle("Trumps's Top 10 Most \n POSITIVE Words: \n Bing Lexicon") + theme_economist()+ geom_text(aes(label=n), hjust =1.5,vjust=0, color="white", size=3.5) -> trump_bing2
grid.arrange(trump_bing1, trump_bing2, ncol = 2)
Looking at Donald Trump’s analysis side by side, it is clear to see that Trump had a stronger negative sentiment due to the larger negative word counts. It is important to consider how Donald Trump used the word radical and if it truly has a negative sentiment. Most likely, we can assume that Donald used this word to describe the “radical left”. It is then up to ones opinion and judgment as to the words sentiment under that context.
Now that we have compared each candidates most positive and negative words side by side, let’s compare positive words for both candidates and the same for negative.
Starting with negative words, we can graph both candidates side by side for simple comparison.
grid.arrange(trump_bing1, biden_bing1, ncol = 2)
Here we can see that both candidates most negative word according to this lexicon is the word ‘wrong’. As well as both candidates using the word died frequently, which we can assume is due to the pandemic.
Now we can do the same for positive words.
grid.arrange(trump_bing2, biden_bing2, ncol = 2)
Here we can see that both candidates used the word support frequently. It is also interesting to compare the word length of these most frequent positive words. One can draw the conclusion that Donald Trump uses shorter words such as “won”, “top” and “fair”. Where as Joe Biden uses longer words such as “affordable”, “significant”, “advantage”, and “peaceful”.
Now, let’s examine the word count of positive and negative words for each candidate side by side.
grid.arrange(trumpBing, bidenBing, ncol = 2)
Here, we can see that Biden used slightly more negative words when compared to Donald Trump, according to the Bing Lexicon. However, it is important to note that Trump also spoke less positive words than Joe Biden. A similarity between the two candidates is that they both spoke with a more negative sentiment.
Using the BING lexicon, allows us to analyze the sentiment of each candidate using a different dictionary when compared to AFINN and NRC.
Due to the political climate we are currently in, I felt it was relevant and necessary to analyze the 2020 presidential debate between Joe Biden and Donald Trump. I wished to analyze who spoke more as well as the sentiment of each speaker.
I was first able to conclude that Donald Trump spoke roughly 1,000 more words when compared to Joe Biden. However, it is important to take into consideration the lack of words represented in this transcript due to the inability to decipher who spoke when candidates were speaking at the same time.
After filtering out the stop words or filler words, it was then concluded that Joe Biden spoke more distinct words when compared to Donald Trump. We can conclude that Joe Biden used a larger vocabulary where as Donald Trump used the same words more frequently.
After analyzing the word count of each candidate, I then conducted a sentiment analysis to see wheather either candiate was more positive or negative than another. Overall, using the “afinn”, “bin” and “nrc”, lexicons resulted in similar results for both candidates. One can say this is a result of being asked the same questions and being restricted on what topics each candidate is talking about. Since both candidates were discussing the same topics and questions given by the moderator, Chris Wallace, it makes sense as to why the sentiment of both candidates are similar. If one wanted to further analyze the sentiment of both candidates, it would be interesting to examine each candidates rallies since no questions or talking points are being asked.
This report could also be furthered once the future presidential debates are to take place. Once the debates occur, one can analyze all three debates to provide an even more in-depth analysis due to the increase in text.