Homework 1

Author

Stone Neilon

Background

For this homework assignment, I will be analyzing presidential candidates’ campaign speeches. I am choosing to only focus on campaign event rallies. The speeches were sourced from The American Presidency Project @ UC Santa Barbara.

I am interested to see if presidential candidates campaign by emphasizing local issues or campaign on national rhetoric.

The download for the csv file can be found [HERE].

1) Take a small random sample of your documents and read them carefully.

Completed.

2) Tokenize your documents and pre-process them, removing any “extraneous” content you noticed in closely reading a sample of your documents. What content have you removed and why? How do the top 20 features in your documents change with different pre-processing decisions? 

library(rprojroot)
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(lubridate)
library(stringr)
library(quanteda)
Warning: package 'quanteda' was built under R version 4.3.3
Package version: 4.1.0
Unicode version: 14.0
ICU version: 71.1
Parallel computing: disabled
See https://quanteda.io for tutorials and examples.
library(quanteda.corpora)
library(quanteda.textmodels)
Warning: package 'quanteda.textmodels' was built under R version 4.3.3
library(quanteda.textplots)
Warning: package 'quanteda.textplots' was built under R version 4.3.3
library(quanteda.textstats)
Warning: package 'quanteda.textstats' was built under R version 4.3.3
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
# Pre-processing
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#

# read data into environment 
data <- read.csv("~/Desktop/Desktop - Stone’s MacBook Pro/CU Classes/Fall 2024/Text as Data/Data/campaign_data.csv") 

data_corpus <- corpus(data$text)

#add metadata (docvars) 
docvars(data_corpus, "date") <- data$date
docvars(data_corpus, "person") <- data$person
docvars(data_corpus, "title") <- data$title
docvars(data_corpus, "url") <- data$url
docvars(data_corpus, "location") <- data$location
docvars(data_corpus, "state") <- data$state


# tokens() is a function that tokenizes the text. 
# tokenization is the process of splitting a document into its constituent words
# tokenize our campaign speeches. 
data_tokens <- data_corpus %>% 
  tokens(
    remove_punct = TRUE, 
    remove_numbers = TRUE, 
    remove_symbols = TRUE
  ) %>% 
  tokens_remove(stopwords("en")) %>% 
  tokens_ngrams(n = 1:2) %>% 
  tokens_tolower() #%>% 
  #tokens_wordstem()

data_tokens[1:5]
Tokens consisting of 5 documents and 6 docvars.
text1 :
 [1] "good"      "see"       "applause"  "thank"     "applause"  "thank"    
 [7] "much"      "president" "falwell"   "god"       "bless"     "liberty"  
[ ... and 2,571 more ]

text2 :
 [1] "thank"        "months"       "deliberation" "prayer"       "future"      
 [6] "country"      "come"         "tonight"      "make"         "announcement"
[11] "believe"      "can"         
[ ... and 1,731 more ]

text3 :
 [1] "joined"     "progress"   "unknown"    "student"    "speaker"   
 [6] "graduating" "may"        "graduation" "heading"    "annapolis" 
[11] "naval"      "academy"   
[ ... and 1,385 more ]

text4 :
 [1] "rubio"       "thank"       "honor"       "thank"       "much"       
 [6] "fascinating" "week"        "probably"    "historic"    "week"       
[11] "life"        "monday"     
[ ... and 6,079 more ]

text5 :
 [1] "cruz"      "thank"     "good"      "see"       "well"      "god"      
 [7] "bless"     "great"     "state"     "new"       "hampshire" "thing"    
[ ... and 3,519 more ]

During my reading of the documents, I noticed (to a small degree) the presence of symbols ($, -, :, ;, etc.) These symbols are unrelated to my text analysis and were not of value. Additionally, I removed numbers, punctuation, urls, and certain words as these were unrelated to my research question. I chose to not stem the words. Although, it can be easily changed if I decide to change. If I do change, I will create a new object containing the stemmed version.

#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
# Create a DTM 
# Document-Term Matrix 
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#

## create a DTM (DFM) in which the words are pre-processed:
data_dfm <- dfm(data_tokens)

## How many documents? How many features (words)?
data_dfm
Document-feature matrix of: 399 documents, 335,242 features (99.41% sparse) and 6 docvars.
       features
docs    good see applause thank much president falwell god bless liberty
  text1    1   1       55     3    4        11       1   6     2      15
  text2    0   3        0     2    0         3       0   4     2       0
  text3    5   4        0     2    1         1       0   0     0       0
  text4    5   7       15    13    7        17       0   2     0       0
  text5    3   2       42     6    3        17       0   4     2       4
  text6   17  15        0     7    7         3       0   1     0       0
[ reached max_ndoc ... 393 more documents, reached max_nfeat ... 335,232 more features ]
topfeatures(data_dfm, 20)
   people     going      know   country  applause president       can       one 
     6106      5156      4334      4235      4057      3857      3756      3375 
     just   america        us       get      want      make       now  american 
     3301      3199      2940      2796      2763      2714      2626      2566 
    right      like     every      back 
     2321      2319      2289      2233 
## Let's remove very infrequent and very frequent terms
data_dfm_trim <- dfm_trim(data_dfm, min_termfreq = 5, max_termfreq = 200)

topfeatures(data_dfm_trim, 20)
            tough running_president             spent              case 
              200               200               200               200 
            south              drug          standing              name 
              200               200               198               198 
         programs            mexico  small_businesses            higher 
              198               198               197               197 
             pass           anybody             taken              path 
              197               197               197               197 
            stood            simple             allow           trump's 
              196               196               195               195 

3) Pick an important source of variation in your data (for example: date, author identity, location, etc.). Subset your data along this dimension and create word clouds for each category. e.g. a male vs. female author word cloud, a before and after a particular date word cloud etc. What do you notice?

Subset by Location and Candidate

# look at word cloud of Hillary Clinton speeches in Florida 
dfmat_hill_loc <- dfm_subset(data_dfm_trim, state == "Florida") %>%
  dfm_group(person == "Hillary Clinton")

textplot_wordcloud(dfmat_hill_loc, max_words = 200)

dev.off()
null device 
          1 

What I learned:

Despite filtering out stop words, there is still considerable inclusion of superfluous words within my word cloud. Moving forward I will have to do a better job of picking words more closely associated with the concepts I am interested in.

Subset by Candidate

dfmat_candidate <- dfm_subset(data_dfm_trim, person %in% c("Kamala Harris", "Donald J. Trump")) %>%
  dfm_group(groups = person) 

# Comparison wordcloud
textplot_wordcloud(dfmat_candidate, comparison = TRUE, 
                   color = c("red", "blue"), max_words=200)

dev.off()
null device 
          1 

Rank Frequency Plot:

# Sum the columns of the DFM to get a total count for each word: 
freqs = colSums(data_dfm_trim)

# Create a vocabulary vector:
words = colnames(data_dfm_trim)

# Create a data frame that includes the words in the vocabulary and their frequencies:
wordlist = data.frame(words, freqs)

# Re-order the wordlist by decreasing frequency
wordlist = wordlist[order(wordlist[, "freqs"], decreasing = TRUE), ]

# What are the 10 most frequent words?
head(wordlist, 10)
                              words freqs
tough                         tough   200
running_president running_president   200
spent                         spent   200
case                           case   200
south                         south   200
drug                           drug   200
standing                   standing   198
name                           name   198
programs                   programs   198
mexico                       mexico   198
# Plot the distribution. Does it look Zipfian?
plot(wordlist$freqs , type = "l", lwd=2, main = "Rank frequency Plot", xlab="Rank", ylab ="Frequency")

Ranked Frequency (logged)

# Plot the logged distrbution. Does it look like a line with slope -1? 
plot(wordlist$freqs , type = "l", log="xy", lwd=2, main = "Rank frequency Plot (Logged)", xlab="log-Rank", ylab ="log-Frequency")

Frequency Co-occurrence Matrix

# Create a feature co-occurrence matrix (FCM):
fcm1 = fcm(data_dfm_trim, context = "document", count="frequency")

## What are the dimensions of our FCM?
dim(fcm1)
[1] 22088 22088
## Show the head of the fcm:
head(fcm1)
Feature co-occurrence matrix of: 6 by 22,088 features.
            features
features     liberty university thrilled largest christian girl growing
  liberty        162        112       25      40        63   65     100
  university       0        207       28      57        19   47     111
  thrilled         0          0       10      10         3   21      24
  largest          0          0        0      59        11   21      83
  christian        0          0        0       0        14   11      20
  girl             0          0        0       0         0   35      42
            features
features     wilmington delaware ii
  liberty            31       22 31
  university         14       40 62
  thrilled            3        1 10
  largest             6       23 34
  christian           4        6  5
  girl                6       12 11
[ reached max_nfeat ... 22,078 more features ]
## Visualize the co-occurrences:

# Calculate the total frequency of each feature in the dfm
feature_freq <- colSums(data_dfm_trim)

# Sort the features by frequency in decreasing order and extract the top 60
top_features <- names(sort(feature_freq, decreasing = TRUE)[1:60])
head(top_features)
[1] "tough"             "running_president" "spent"            
[4] "case"              "south"             "drug"             
top_features
 [1] "tough"             "running_president" "spent"            
 [4] "case"              "south"             "drug"             
 [7] "standing"          "name"              "programs"         
[10] "mexico"            "small_businesses"  "higher"           
[13] "pass"              "anybody"           "taken"            
[16] "path"              "stood"             "simple"           
[19] "allow"             "trump's"           "places"           
[22] "members"           "action"            "hold"             
[25] "example"           "fix"               "attack"           
[28] "voted"             "booo_vice"         "often"            
[31] "stay"              "rules"             "asking"           
[34] "telling"           "presidency"        "progress"         
[37] "speak"             "weeks"             "buy"              
[40] "honor"             "prepared"          "going_get"        
[43] "easy"              "michigan"          "growing"          
[46] "exactly"           "might"             "syria"            
[49] "agree"             "middle_east"       "age"              
[52] "global"            "invest"            "weapons"          
[55] "helped"            "rich"              "interest"         
[58] "plans"             "voters"            "constitution"     
# Subset the fcm using the top 60 features
fcm_filtered <- fcm_select(fcm1, pattern = top_features, selection = "keep", valuetype = "fixed")

# Plot the co-occurrences bewteen the top features:
fcm_select(fcm_filtered, pattern = top_features) %>%
  textplot_network(min_freq = .8, edge_size = 2)

What I learned is:

This word cloud does not group by date. Which is why we see Trump mentioning Hillary Clinton so prominently. Future comparative iterations need to properly group by relevant features. Additionally, these speeches do not discriminate between who is talking. The “booo” or “laughter” is coming from the audience.

4) Create a dictionary (or use a pre-existing dictionary) to measure a topic of interest in your data. 

I am interested to see how much candidates discuss immigration by date. The dictionary I created contains words associated with immigration. Because I am looking across candidates, I included known ways Donald J. Trump has referenced immigrants.

immigration <- "border|immigration|refugee|mexi|wall|migra|alien|asylum|native|deport|dreamer|"

5) Label your documents according to whether or not they contain the dictionary terms. Visualize the prevalence of your dictionary terms (either over time or by some other dimension of interest).

#date variable was not being treated as a date
data <- data %>%
  mutate(date = ymd(date))

class(data$date)

# Provides a proportion of speeches that mention immigration by month. 
immigration_date = data %>% 
  group_by(date = floor_date(date, unit="month")) %>%
  summarise(prop = sum(immigration)/n()) %>%
  mutate(topic = "immigration")

immigration_date %>% 
  ggplot() + 
  aes(x = date, y = prop) + 
  geom_smooth(method=loess, span=1) +
  geom_line()+
  labs(y = "Monthly Propoportion of Speeches", x = "Date")+
  theme_minimal(base_size=10)

Proportion of speeches mentioning immigration by month

The plot above tracks the proportion of speeches in each month mentioning immigration. If the speech contains a word from the dictionary then it is marked as TRUE. There are some issues as I have not subset the data by party or location. Further iterations will have to plot these proportions by party over time.

6) Choose a random sample of documents that contain your dictionary terms. How many of the documents are actually relevant to the concept you are trying to measure? How might you adapt your dictionary to make it more effective? What are the tradeoffs between precision and recall given what you are trying to measure? If you change your dictionary, repeat step 6. How have your results changed?

Random sample of documents:

set.seed(123) # set seed for reproducibility
sample_data <- data[sample(nrow(data), 5), ] # randomly chose 5 obs.

[179] was flagged as TRUE. 2016-10-22 by Hillary Clinton.

  • After reading this speech, I believe this is a false positive. The document does contain the word “immigration”. However, it was asked by a reporter and Clinton’s subsequent response was a non-answer. I don’t believe I can (at this point in time) fix this issue through changing my dictionary. I will need to re-evaluate how I sourced the documents. This speech was solely between reporters which is different than my actual interest on campaign rally speeches.

[14] was flagged as TRUE. 2015-05-27 by Rick Sanatorum.

  • After reading this speech, I believe this is a false positive. The document does contain the word “immigration. However, it was in reference to a”pro-worker immigration group” in a larger discussion about American workers. To be honest, I have no idea what a “pro-worker immigration group” is and I’m not sure how to treat it. There isn’t any discussion of immigration or migrants/border topics within the document.

[195] was flagged as FALSE. 2016-10-31 by Hillary Clinton.

  • After reading this speech, I believe this is correctly classified. Interestingly, Clinton did mention Mexico in relation to Trump insulting the President of Mexico. However, it was not picked up. I think this is because I included the stemmed version in my dictionary “mexi”. I probably did this wrong and need to fix this.

[306] was flagged as FALSE. 2020-09-21 by Joseph Biden.

  • After reading this speech, I believe this is correctly classified. The speech was mostly about COVID-19 and Trump.

[118] was flagged as TRUE. 2016-08-24 by Donald J. Trump.

  • After reading this speech, I believe this is correctly classified. I need to include additional words within the dictionary. Words such as “illegals”, “open borders”, “illegal immigrant”.

7) Focusing on the dimension of interest you identified earlier, choose reference documents and run a Wordscores model. How does it perform? Can you improve the performance with additional preprocessing steps? 

Struggled with this. I used the code your provided from the wordscores-2.R file. This doesn’t look right.

# Wordscore creation 

# create reference score
# Assuming Bernie Sanders and Donald Trump represent the two extremes of ideological positions
sanders_index <- which(data$person == "Bernie Sanders")
trump_index <- which(data$person == "Donald J. Trump")


# create reference score 
refscores <- rep(NA, nrow(data_dfm))
refscores[sanders_index] <- -1
refscores[trump_index] <- 1


#Fit a Wordscores model using the ref scores

wordscores_model <- textmodel_wordscores(data_dfm,
                              refscores,
                              scale = "linear",
                              smooth = 1)


#Extract the wordscores, rescale them and then save in data.frame we created earlier.

pred_ws <- predict(wordscores_model, se.fit = TRUE, newdata = data_dfm)

textplot_scale1d(pred_ws)