For this homework assignment, I will be analyzing presidential candidates’ campaign speeches. I am choosing to only focus on campaign event rallies. The speeches were sourced from The American Presidency Project @ UC Santa Barbara.
I am interested to see if presidential candidates campaign by emphasizing local issues or campaign on national rhetoric.
The download for the csv file can be found [HERE].
1) Take a small random sample of your documents and read them carefully.
Completed.
2) Tokenize your documents and pre-process them, removing any “extraneous” content you noticed in closely reading a sample of your documents. What content have you removed and why? How do the top 20 features in your documents change with different pre-processing decisions?
library(rprojroot)library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.4
✔ forcats 1.0.0 ✔ stringr 1.5.0
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.2 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Warning: package 'quanteda.textmodels' was built under R version 4.3.3
library(quanteda.textplots)
Warning: package 'quanteda.textplots' was built under R version 4.3.3
library(quanteda.textstats)
Warning: package 'quanteda.textstats' was built under R version 4.3.3
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~## Pre-processing#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~## read data into environment data <-read.csv("~/Desktop/Desktop - Stone’s MacBook Pro/CU Classes/Fall 2024/Text as Data/Data/campaign_data.csv") data_corpus <-corpus(data$text)#add metadata (docvars) docvars(data_corpus, "date") <- data$datedocvars(data_corpus, "person") <- data$persondocvars(data_corpus, "title") <- data$titledocvars(data_corpus, "url") <- data$urldocvars(data_corpus, "location") <- data$locationdocvars(data_corpus, "state") <- data$state# tokens() is a function that tokenizes the text. # tokenization is the process of splitting a document into its constituent words# tokenize our campaign speeches. data_tokens <- data_corpus %>%tokens(remove_punct =TRUE, remove_numbers =TRUE, remove_symbols =TRUE ) %>%tokens_remove(stopwords("en")) %>%tokens_ngrams(n =1:2) %>%tokens_tolower() #%>% #tokens_wordstem()data_tokens[1:5]
During my reading of the documents, I noticed (to a small degree) the presence of symbols ($, -, :, ;, etc.) These symbols are unrelated to my text analysis and were not of value. Additionally, I removed numbers, punctuation, urls, and certain words as these were unrelated to my research question. I chose to not stem the words. Although, it can be easily changed if I decide to change. If I do change, I will create a new object containing the stemmed version.
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~## Create a DTM # Document-Term Matrix #~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~### create a DTM (DFM) in which the words are pre-processed:data_dfm <-dfm(data_tokens)## How many documents? How many features (words)?data_dfm
Document-feature matrix of: 399 documents, 335,242 features (99.41% sparse) and 6 docvars.
features
docs good see applause thank much president falwell god bless liberty
text1 1 1 55 3 4 11 1 6 2 15
text2 0 3 0 2 0 3 0 4 2 0
text3 5 4 0 2 1 1 0 0 0 0
text4 5 7 15 13 7 17 0 2 0 0
text5 3 2 42 6 3 17 0 4 2 4
text6 17 15 0 7 7 3 0 1 0 0
[ reached max_ndoc ... 393 more documents, reached max_nfeat ... 335,232 more features ]
topfeatures(data_dfm, 20)
people going know country applause president can one
6106 5156 4334 4235 4057 3857 3756 3375
just america us get want make now american
3301 3199 2940 2796 2763 2714 2626 2566
right like every back
2321 2319 2289 2233
## Let's remove very infrequent and very frequent termsdata_dfm_trim <-dfm_trim(data_dfm, min_termfreq =5, max_termfreq =200)topfeatures(data_dfm_trim, 20)
tough running_president spent case
200 200 200 200
south drug standing name
200 200 198 198
programs mexico small_businesses higher
198 198 197 197
pass anybody taken path
197 197 197 197
stood simple allow trump's
196 196 195 195
3) Pick an important source of variation in your data (for example: date, author identity, location, etc.). Subset your data along this dimension and create word clouds for each category. e.g. a male vs. female author word cloud, a before and after a particular date word cloud etc. What do you notice?
Subset by Location and Candidate
# look at word cloud of Hillary Clinton speeches in Florida dfmat_hill_loc <-dfm_subset(data_dfm_trim, state =="Florida") %>%dfm_group(person =="Hillary Clinton")textplot_wordcloud(dfmat_hill_loc, max_words =200)
dev.off()
null device
1
What I learned:
Despite filtering out stop words, there is still considerable inclusion of superfluous words within my word cloud. Moving forward I will have to do a better job of picking words more closely associated with the concepts I am interested in.
Subset by Candidate
dfmat_candidate <-dfm_subset(data_dfm_trim, person %in%c("Kamala Harris", "Donald J. Trump")) %>%dfm_group(groups = person) # Comparison wordcloudtextplot_wordcloud(dfmat_candidate, comparison =TRUE, color =c("red", "blue"), max_words=200)
dev.off()
null device
1
Rank Frequency Plot:
# Sum the columns of the DFM to get a total count for each word: freqs =colSums(data_dfm_trim)# Create a vocabulary vector:words =colnames(data_dfm_trim)# Create a data frame that includes the words in the vocabulary and their frequencies:wordlist =data.frame(words, freqs)# Re-order the wordlist by decreasing frequencywordlist = wordlist[order(wordlist[, "freqs"], decreasing =TRUE), ]# What are the 10 most frequent words?head(wordlist, 10)
words freqs
tough tough 200
running_president running_president 200
spent spent 200
case case 200
south south 200
drug drug 200
standing standing 198
name name 198
programs programs 198
mexico mexico 198
# Plot the distribution. Does it look Zipfian?plot(wordlist$freqs , type ="l", lwd=2, main ="Rank frequency Plot", xlab="Rank", ylab ="Frequency")
Ranked Frequency (logged)
# Plot the logged distrbution. Does it look like a line with slope -1? plot(wordlist$freqs , type ="l", log="xy", lwd=2, main ="Rank frequency Plot (Logged)", xlab="log-Rank", ylab ="log-Frequency")
Frequency Co-occurrence Matrix
# Create a feature co-occurrence matrix (FCM):fcm1 =fcm(data_dfm_trim, context ="document", count="frequency")## What are the dimensions of our FCM?dim(fcm1)
[1] 22088 22088
## Show the head of the fcm:head(fcm1)
Feature co-occurrence matrix of: 6 by 22,088 features.
features
features liberty university thrilled largest christian girl growing
liberty 162 112 25 40 63 65 100
university 0 207 28 57 19 47 111
thrilled 0 0 10 10 3 21 24
largest 0 0 0 59 11 21 83
christian 0 0 0 0 14 11 20
girl 0 0 0 0 0 35 42
features
features wilmington delaware ii
liberty 31 22 31
university 14 40 62
thrilled 3 1 10
largest 6 23 34
christian 4 6 5
girl 6 12 11
[ reached max_nfeat ... 22,078 more features ]
## Visualize the co-occurrences:# Calculate the total frequency of each feature in the dfmfeature_freq <-colSums(data_dfm_trim)# Sort the features by frequency in decreasing order and extract the top 60top_features <-names(sort(feature_freq, decreasing =TRUE)[1:60])head(top_features)
# Subset the fcm using the top 60 featuresfcm_filtered <-fcm_select(fcm1, pattern = top_features, selection ="keep", valuetype ="fixed")# Plot the co-occurrences bewteen the top features:fcm_select(fcm_filtered, pattern = top_features) %>%textplot_network(min_freq = .8, edge_size =2)
What I learned is:
This word cloud does not group by date. Which is why we see Trump mentioning Hillary Clinton so prominently. Future comparative iterations need to properly group by relevant features. Additionally, these speeches do not discriminate between who is talking. The “booo” or “laughter” is coming from the audience.
4) Create a dictionary (or use a pre-existing dictionary) to measure a topic of interest in your data.
I am interested to see how much candidates discuss immigration by date. The dictionary I created contains words associated with immigration. Because I am looking across candidates, I included known ways Donald J. Trump has referenced immigrants.
5) Label your documents according to whether or not they contain the dictionary terms. Visualize the prevalence of your dictionary terms (either over time or by some other dimension of interest).
#date variable was not being treated as a datedata <- data %>%mutate(date =ymd(date))class(data$date)# Provides a proportion of speeches that mention immigration by month. immigration_date = data %>%group_by(date =floor_date(date, unit="month")) %>%summarise(prop =sum(immigration)/n()) %>%mutate(topic ="immigration")immigration_date %>%ggplot() +aes(x = date, y = prop) +geom_smooth(method=loess, span=1) +geom_line()+labs(y ="Monthly Propoportion of Speeches", x ="Date")+theme_minimal(base_size=10)
Proportion of speeches mentioning immigration by month
The plot above tracks the proportion of speeches in each month mentioning immigration. If the speech contains a word from the dictionary then it is marked as TRUE. There are some issues as I have not subset the data by party or location. Further iterations will have to plot these proportions by party over time.
6) Choose a random sample of documents that contain your dictionary terms. How many of the documents are actually relevant to the concept you are trying to measure? How might you adapt your dictionary to make it more effective? What are the tradeoffs between precision and recall given what you are trying to measure? If you change your dictionary, repeat step 6. How have your results changed?
Random sample of documents:
set.seed(123) # set seed for reproducibilitysample_data <- data[sample(nrow(data), 5), ] # randomly chose 5 obs.
[179] was flagged as TRUE. 2016-10-22 by Hillary Clinton.
After reading this speech, I believe this is a false positive. The document does contain the word “immigration”. However, it was asked by a reporter and Clinton’s subsequent response was a non-answer. I don’t believe I can (at this point in time) fix this issue through changing my dictionary. I will need to re-evaluate how I sourced the documents. This speech was solely between reporters which is different than my actual interest on campaign rally speeches.
[14] was flagged as TRUE. 2015-05-27 by Rick Sanatorum.
After reading this speech, I believe this is a false positive. The document does contain the word “immigration. However, it was in reference to a”pro-worker immigration group” in a larger discussion about American workers. To be honest, I have no idea what a “pro-worker immigration group” is and I’m not sure how to treat it. There isn’t any discussion of immigration or migrants/border topics within the document.
[195] was flagged as FALSE. 2016-10-31 by Hillary Clinton.
After reading this speech, I believe this is correctly classified. Interestingly, Clinton did mention Mexico in relation to Trump insulting the President of Mexico. However, it was not picked up. I think this is because I included the stemmed version in my dictionary “mexi”. I probably did this wrong and need to fix this.
[306] was flagged as FALSE. 2020-09-21 by Joseph Biden.
After reading this speech, I believe this is correctly classified. The speech was mostly about COVID-19 and Trump.
[118] was flagged as TRUE. 2016-08-24 by Donald J. Trump.
After reading this speech, I believe this is correctly classified. I need to include additional words within the dictionary. Words such as “illegals”, “open borders”, “illegal immigrant”.
7) Focusing on the dimension of interest you identified earlier, choose reference documents and run a Wordscores model. How does it perform? Can you improve the performance with additional preprocessing steps?
Struggled with this. I used the code your provided from the wordscores-2.R file. This doesn’t look right.
# Wordscore creation # create reference score# Assuming Bernie Sanders and Donald Trump represent the two extremes of ideological positionssanders_index <-which(data$person =="Bernie Sanders")trump_index <-which(data$person =="Donald J. Trump")# create reference score refscores <-rep(NA, nrow(data_dfm))refscores[sanders_index] <--1refscores[trump_index] <-1#Fit a Wordscores model using the ref scoreswordscores_model <-textmodel_wordscores(data_dfm, refscores,scale ="linear",smooth =1)#Extract the wordscores, rescale them and then save in data.frame we created earlier.pred_ws <-predict(wordscores_model, se.fit =TRUE, newdata = data_dfm)textplot_scale1d(pred_ws)