Using NLP techniques what can we learn about the debate between Trump and Harris?
First, we will review the most spoken words reviewing both the raw text, and the cleaned text which removes stop words (the, and, if, etc.)
trump_word_count_no_stop <- trump_tidy_speech %>%
anti_join(stop_words) %>%
count(word, sort = TRUE) %>%
filter(!grepl("^\\d+$", word), !grepl(",", word))
## Joining with `by = join_by(word)`
ggplot(data=trump_word_count_no_stop %>% head(20)) +
geom_bar(aes(y=reorder(word, n), x=n), fill="#56B4E9", stat="identity")+
theme_minimal() +
labs(y="Cleaned Words", x="Count", title="Cleaned Word Count: Trump")
harris_speech <- read_file("assets/harris.txt")
harris_speech_df <- tibble(line=1, text=harris_speech)
harris_tidy_speech <- harris_speech_df %>%
unnest_tokens(word, text)
harris_word_count_with_stop <- harris_tidy_speech %>% count(word, sort = TRUE)
ggplot(data=harris_word_count_with_stop %>% head(10)) +
geom_bar(aes(y=reorder(word, n), x=n), fill = "lightgreen", stat="identity")+
theme_minimal()+
labs(x="Count", y="Raw Words", title="Raw Word Count: Harris")
harris_word_count_no_stop <- harris_tidy_speech %>%
anti_join(stop_words) %>%
count(word, sort = TRUE) %>%
filter(!grepl("^\\d+$", word))
## Joining with `by = join_by(word)`
harris_word_count_no_stop %>% select(n) %>% sum()
## [1] 1997
ggplot(data=harris_word_count_no_stop %>% head(20)) +
geom_bar(aes(y=reorder(word, n), x=n), fill="lightgreen", stat="identity")+
theme_minimal()+
labs(y="Cleaned Words", x="Count", title="Cleaned Word Count: Harris")
The top words from each candidate’s cleaned word counts provide insight
into the themes and focal points of their speeches:
This shows Trump’s speech focused on grand themes of leadership, the people, and international/national concerns, with an emphasis on scale (millions, billions).
Harris’ top words indicate a focus on leadership, social issues, and perhaps a critique or contrast of Trump, while maintaining a tone of unity and care.
This word frequency highlights how both candidates shaped their messages around leadership but approached their audiences from different thematic perspectives.
total_words_compare <- tibble(
speaker=c("Trump", "Harris"),
total_words=c(trump_word_count_with_stop %>% select(n) %>% sum(),
harris_word_count_with_stop %>% select(n) %>% sum())
)
observed <- c(8118, 5950)
chisq_test <- chisq.test(observed)
chi_squared_results <- data.frame(
Statistic = chisq_test$statistic,
P_Value = formatC(chisq_test$p.value, format = "e", digits = 3),
Degrees_of_Freedom = chisq_test$parameter
)
stargazer::stargazer(chi_squared_results, summary = FALSE, type = "text", digits = 3)
##
## ================================================
## Statistic P_Value Degrees_of_Freedom
## ------------------------------------------------
## X-squared 334.107 1.225e-74 1
## ------------------------------------------------
The result of the chi-squared test indicates a very significant difference between the word counts of Trump and Harris. Here’s how to interpret the output:
X-squared = 334.11: This is the test statistic. A large value like this indicates that the observed counts (8118 for Trump, 5950 for Harris) deviate significantly from what would be expected under the null hypothesis (which usually assumes equal counts or some expected distribution).
df = 1: Degrees of freedom, which in this case is 1 because you’re comparing two categories (Trump vs. Harris).
p-value < 2.2e-16: The p-value is extremely small, much smaller than any common significance level (like 0.05 or 0.01), meaning that the difference in word counts is statistically significant.
This result strongly suggests that the difference in the total number of words spoken by Trump and Harris is not due to random chance. If you were testing the hypothesis that both speakers would have spoken roughly the same number of words, you would reject that hypothesis based on this p-value.
difference <- 8118-5950
P_Value <- formatC(chisq_test$p.value, format = "e", digits = 3)
ggplot(data=total_words_compare)+
geom_bar(aes(x=speaker, y=total_words, fill=speaker), stat="identity")+
geom_text(aes(x="Harris", y=7500, label=paste("Difference: =",difference)))+
geom_text(aes(x="Harris", y=7000, label=paste("Chi Sqr P-Value: ",P_Value)))+
scale_fill_manual(values=c("lightgreen", "#56B4E9"))+
labs(title="Total Word Count")+
theme_minimal()+
labs(x="Speaker", y="Total Word Count", title="Total Count of Spoken Words: Trump vs Harris", fill="")
Observation: Trump has been called one of the slowest speaker of all recent U.S. presidents Article, during the debate Trump spoke for approximately 42 minutes and 52 seconds, while Harris spoke for 37 minutes and 36 seconds.Article.
As a general rule, a 5-minute speech is roughly 750 words, which is 150 words per minute. This means that Trump spoke at approximately 188 words per minute and Harris spoke at 165 words per minute.Neither of these amounts are statistically faster than the noted average, however, it is interesting because it suggests both speakers delivered their messages with a higher intensity and density of information. This can be a reflection of different speaking styles—Trump may have used shorter, more direct sentences, while Harris might have taken a slightly more measured approach. These differences in speaking pace, while not extreme, could have had subtle impacts on the audience’s perception and retention of the content, affecting how their messages were received.
unique_words_compare <- tibble(
speaker=c("Trump", "Harris"),
total_words=c(trump_word_count_with_stop %>% nrow(),
harris_word_count_with_stop %>% nrow())
)
observed <- c(1221, 1257)
chisq_test <- chisq.test(observed)
chi_squared_results <- data.frame(
Statistic = chisq_test$statistic,
P_Value = formatC(chisq_test$p.value, format = "e", digits = 3),
Degrees_of_Freedom = chisq_test$parameter
)
stargazer::stargazer(chi_squared_results, summary = FALSE, type = "text", digits = 3)
##
## ================================================
## Statistic P_Value Degrees_of_Freedom
## ------------------------------------------------
## X-squared 0.523 4.696e-01 1
## ------------------------------------------------
P_Value <- formatC(chisq_test$p.value, format = "e", digits = 3)
ggplot(data=unique_words_compare)+
geom_bar(aes(x=speaker, y=total_words, fill=speaker), stat="identity")+
geom_text(aes(x="Trump", y=1400, label=paste("Difference: =",difference)))+
geom_text(aes(x="Trump", y=1300, label=paste("Chi Sqr P-Value: ",P_Value)))+
scale_fill_manual(values=c("lightgreen", "#56B4E9")) +
labs(title="Unique Word Count: Harris vs Trump")+
theme_minimal()+
theme(
panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank()
)+
labs(x="Speaker", y="Total Unique Words Spoken", fill="")
The test results show the following:
Despite Harris having a slightly higher unique word count (1257 vs. 1221 for Trump), the chi-squared test indicates that this difference could be due to random chance rather than a meaningful variation. Both speakers employed a similar range of unique words, implying that their speeches, in terms of linguistic diversity (unique vocabulary), are quite comparable.
This could suggest that both candidates structured their responses similarly in terms of complexity and vocabulary, using a comparable amount of distinct words to convey their messages. The focus of their speeches likely revolved more around content and themes rather than significantly differing linguistic styles.
trump_sentiment_speech <- trump_tidy_speech %>%
inner_join(get_sentiments("bing"), by="word") %>%
count(sentiment, sort = TRUE) %>%
mutate(speaker="Trump")
harris_sentiment_speech <- harris_tidy_speech %>%
inner_join(get_sentiments("bing"), by="word") %>%
count(sentiment, sort = TRUE) %>%
mutate(speaker="Harris")
sentiment_analysis <- rbind(trump_sentiment_speech, harris_sentiment_speech)
ggplot(data=sentiment_analysis)+
geom_bar(aes(x=n, y=speaker, group=sentiment, fill=sentiment), stat='identity', position='dodge')+
scale_fill_manual(values=c("#FF4500", "#1E90FF"))+
theme_minimal()+
labs(y="Speaker", x="Word Count by Sentiment", fill="", title="Sentiment of Spoken Words: Trump vs Harris")
This plot presents the sentiment analysis of spoken words from Trump and Harris, comparing the counts of positive and negative words used by each speaker.
Tone and Strategy: The difference in sentiment between Trump and Harris could reflect their rhetorical strategies. Trump’s use of more balanced sentiment (with a slight lean toward negativity) might suggest a more critical or combative tone, possibly focusing on challenges or opponents. Harris, on the other hand, seems to have adopted a more positive tone, likely focusing on hope, solutions, or unity.
Impact on Audience: The higher proportion of negative words in Trump’s speech might resonate with individuals concerned about issues and seeking change, while Harris’s positive tone could appeal to those looking for optimism and constructive dialogue.
Context of Speeches: If the speeches were given in the context of a debate or campaign, this sentiment analysis could reflect the nature of their messages: Trump focusing more on problems or critiques, and Harris potentially emphasizing unity, progress, and solutions.
This sentiment breakdown highlights the contrast in how each speaker communicated their message, with Harris leaning more toward a positive appeal and Trump taking a more balanced but slightly negative approach.
trump_afinn_sentiment <- trump_tidy_speech %>%
inner_join(get_sentiments("afinn"), by="word") %>%
summarise(sentiment_score = sum(value)) %>%
mutate(speaker="Trump")
harris_afinn_sentiment <- harris_tidy_speech %>%
inner_join(get_sentiments("afinn"), by="word") %>%
summarise(sentiment_score = sum(value)) %>%
mutate(speaker="Harris")
sentiment_afinn <- rbind(trump_afinn_sentiment, harris_afinn_sentiment)
ggplot(data=sentiment_afinn)+
geom_bar(aes(x=sentiment_score, y=speaker, fill=speaker), stat='identity')+
scale_fill_manual(values=c("#1E90FF", "#FF4500"))+
theme_minimal()+
labs(y="Speaker", x="Overall Sentiment Score", fill="", title="Overal Sentiment of Spoken Words: Trump vs Harris")
This plot shows the overall sentiment scores for Trump and Harris based on the AFINN sentiment analysis. AFINN assigns positive and negative values to words based on their emotional tone, and the overall score is the sum of those values across the entire text.
Contrast in Tone: This stark contrast between Trump and Harris highlights a major difference in their rhetorical approaches. Trump’s more negative sentiment might align with a strategy focused on pointing out issues, dangers, or challenges. In contrast, Harris appears to have employed more optimistic language, possibly focusing on solutions or unity.
Impact on Audience: The difference in sentiment could influence audience perception. Negative language often drives urgency and emphasizes problems, potentially resonating with voters who feel discontent. Positive language, meanwhile, may appeal to those looking for hope, change, or constructive discourse.
Context of Speeches: The negative sentiment for Trump might also suggest that his speech was more confrontational or critical, possibly aimed at highlighting issues within the current political or social landscape. Harris’s positive sentiment suggests her speech may have been more forward-looking or focused on progress and unification.
This analysis reveals clear differences in emotional tone between the two speakers, which likely reflect their messaging strategies during their speeches.
wordcloud(words = trump_word_count_no_stop$word,
freq = trump_word_count_no_stop$n,
max.words = 100)
wordcloud(words = harris_word_count_no_stop$word,
freq = harris_word_count_no_stop$n,
max.words = 100)
## Trump Bigrams
trump_speech_no_stop <- trump_tidy_speech %>%
anti_join(stop_words, by="word") %>%
filter(!grepl("^\\d+$", word), !grepl(",", word)) %>%
select(word) %>%
paste(., collapse="")
flextable(tibble(line=1, text=trump_speech_no_stop) %>%
unnest_tokens(bigram, text, token = "ngrams", n = 2) %>%
count(bigram, sort = TRUE) %>%
mutate(frequency = n/sum(n)) %>%
head(10) %>%
rename(Bigram=bigram, Count = n, Frequency=frequency))
The frequent bigrams (pairs of consecutive words) used by Trump provide insights into the key themes and topics in his speech. Here’s a breakdown of what these bigrams suggest:
Economic Focus: Bigrams like “billions dollars,” “millions people,” and “student loans” emphasize Trump’s focus on large-scale economic matters and financial policies. He seems to concentrate on issues with wide-reaching impacts on the population.
Patriotism and Threats: Phrases like “history country” and “destroying country” suggest Trump is intertwining national pride with concerns about perceived threats to the country, which is common in populist rhetoric.
Political Opponents: The mention of “Nancy Pelosi” reflects the adversarial tone of Trump’s speech, where he is directly addressing or criticizing key figures from the opposing party.
These frequent bigrams show that Trump’s rhetoric is centered around economic magnitude, patriotism, and potential threats to the country, as well as critiques of political opponents.
harris_speech_no_stop <- harris_tidy_speech %>%
anti_join(stop_words, by="word") %>%
filter(!grepl("^\\d+$", word), !grepl(",", word)) %>%
select(word) %>%
paste(., collapse="")
flextable(tibble(line=1, text=harris_speech_no_stop) %>%
unnest_tokens(bigram, text, token = "ngrams", n = 2) %>%
count(bigram, sort = TRUE) %>%
mutate(frequency = n/sum(n)) %>%
head(10) %>%
rename(Bigram=bigram, Count = n, Frequency=frequency))
The frequent bigrams used by Harris provide a window into the primary focus areas and themes of her speech. Let’s break down the top bigrams:
Focus on Donald Trump: The frequent use of “Donald Trump” and “Donald Trump’s” highlights how central Trump is to Harris’s speech. Her speech seems to focus heavily on contrasting her policies with Trump’s actions, particularly criticizing his leadership and decisions.
Healthcare: Phrases like “affordable care,” “care act,” and “health care” indicate that healthcare was a significant focus in her speech, likely defending the Affordable Care Act or advocating for healthcare reforms.
American People and Middle Class: Harris frequently refers to “American people” and “middle class,” suggesting that her speech is focused on economic policies and the well-being of the general populace, likely framed as a fight for equality, opportunity, and economic security.
Leadership and Security: The mention of “vice president,” “president united,” and “national security” points to discussions of leadership and the importance of uniting the country while ensuring national security.
In summary, Harris’s speech focused on criticizing Trump’s leadership, defending and promoting healthcare reforms, and addressing economic concerns related to the middle class. The focus on Trump, the Affordable Care Act, and national security suggests she is positioning herself as a strong alternative to the previous administration while addressing key voter concerns.
flextable(tibble(line=1, text=trump_speech_no_stop) %>%
unnest_tokens(trigram, text, token = "ngrams", n = 3) %>%
count(trigram, sort = TRUE) %>%
mutate(frequency = n/sum(n)) %>%
head(10) %>%
rename(Trigram=trigram, Count = n, Frequency=frequency))
The trigrams (three-word phrases) used by Trump provide even more nuanced insights into his speech and messaging, emphasizing key themes and framing strategies. Here’s what each trigram suggests:
Economic Focus: Phrases like “hundreds billions dollars” and “billions dollars China” indicate a significant emphasis on large-scale economic figures, trade, and financial dealings, particularly with China.
Immigration: Trigrams such as “people pouring country” and “allowing millions people” highlight Trump’s focus on immigration and his framing of it as a major issue. The use of words like “pouring” suggests a critical perspective on current immigration policies.
Leadership and Legacy: The trigram “president history country” suggests that Trump is concerned with how his presidency will be viewed in a historical context, potentially comparing himself to past leaders.
Controversial Issues: Topics such as “abortion ninth month” and “afraid North Korea” show that Trump addressed sensitive and high-stakes topics, likely aiming to appeal to conservative voters and bolster his foreign policy credentials.
Infrastructure and Policy: Phrases like “biggest pipeline world” and “close student loans” suggest Trump is discussing tangible policy issues, such as energy infrastructure and student loans, possibly positioning his actions or future plans as solutions to these issues.
Overall, Trump’s frequent trigrams point to a speech that mixes economic concerns, immigration policy, and a focus on his presidency’s place in history, with discussions on controversial topics like abortion and international relations. This aligns with his rhetorical style, which often combines grand-scale figures with strong opinions on national security, immigration, and economic matters.
flextable(tibble(line=1, text=harris_speech_no_stop) %>%
unnest_tokens(trigram, text, token = "ngrams", n = 3) %>%
count(trigram, sort = TRUE) %>%
mutate(frequency = n/sum(n)) %>%
head(10) %>%
rename(Trigram=trigram, Count = n, Frequency=frequency))
The trigrams used by Harris shed light on the specific issues and messaging she focused on in her speech. These trigrams suggest that healthcare, critiques of Donald Trump, and women’s rights were central themes. Here’s a breakdown of each:
Healthcare Focus: The prominence of the “affordable care act” trigram suggests that healthcare is a significant part of Harris’s messaging. She is likely defending the ACA and highlighting its role in providing protections for millions of Americans.
Critique of Donald Trump: Trigrams like “donald trump left,” “understand donald trump,” and “trump left worst” emphasize Harris’s focus on critiquing Trump’s presidency. She is positioning Trump’s leadership as harmful, particularly regarding healthcare and other key policies.
Reproductive Rights: Trigrams such as “protections roe wade,” “abortion ban understand,” and “carry pregnancy term” show that women’s reproductive rights are a major theme in Harris’s speech. She is likely discussing the importance of maintaining protections for women under Roe v. Wade and opposing efforts to restrict access to abortion.
Policy and Leadership: Harris also addresses executive actions and legislative matters, as suggested by “answer question veto” and “ban understand project.” This reflects her focus on leadership, decision-making, and the implications of legislative bans.
Harris’s trigrams reveal a speech that heavily focuses on healthcare, critiques of Trump’s presidency, and women’s reproductive rights. Her frequent references to the Affordable Care Act and Roe v. Wade indicate her dedication to protecting these key pieces of legislation. Additionally, her repeated mention of Trump suggests she is contrasting her platform with his policies, framing her vision as a corrective to the challenges and failures of his administration.
Flesch-Kincaid is more focused on sentence length and word length (syllables per word), and it tends to favor readability for a broader range of texts, especially shorter sentences and simpler words. SMOG emphasizes the number of complex words (words with 3+ syllables) and is commonly used for texts with more dense vocabulary, like healthcare or legal documents.
trump_corpus_speech <- corpus(trump_speech)
harris_corpus_speech <- corpus(harris_speech)
trump_readability_scores <- textstat_readability(trump_corpus_speech, measure = c("Flesch.Kincaid", "SMOG")) %>%
transmute(Document="Trump", Flesch.Kincaid, SMOG)
harris_readability_scores <- textstat_readability(harris_corpus_speech, measure = c("Flesch.Kincaid", "SMOG")) %>%
transmute(Document="Harris", Flesch.Kincaid, SMOG)
readability_scores <- rbind(trump_readability_scores, harris_readability_scores)
flextable(readability_scores)
The Flesch-Kincaid and SMOG scores provide insight into the complexity and readability of the speeches delivered by Trump and Harris. These scores help determine the education level required to comprehend the text fully.
Trump’s Speech: With a lower Flesch-Kincaid and SMOG score, Trump’s speech is simpler in terms of language and structure. This aligns with his often direct, accessible style of communication, which may be intentional to reach a broad audience, including those with varying education levels. His language tends to use shorter sentences and simpler vocabulary, which may help in delivering his message more clearly and directly to the general public.
Harris’s Speech: Harris’s higher reading levels indicate a more sophisticated style, using more advanced vocabulary and complex sentence structures. This could reflect a more formal tone or a focus on policy details that require a deeper understanding. Her speech may appeal to an audience with a higher education level, and her use of more intricate language could convey depth or seriousness about the issues she’s discussing.
These differences in reading levels highlight how both speakers adjust their communication styles depending on their audience and the complexity of the topics they discuss.