Text Analysis I at IQMR

Author

Michelle Bueno Vásquez

Published

June 21, 2024

Note

All credit for these materials go to Professor Ben Noble, I just compiled it into a nice html format

Lab 1: Discovering SOTU

Visit the American Presidency Project Website and choose one State of the Union Address delivered orally after 1981, and one written State of the Union Address delivered before 1900.

Download oral addresses here

Download oral addresses here

Some R Basics

First, install.packages() and then load the following libraries

Code
library(quanteda)
library(quanteda.corpora)
library(stringr)
library(tidyverse)
library(tidytext)
library(stringi)
library(textdata)
library(stm)
Note

note, you’ll need to use devtools for quanteda.copora:

devtools::install_github("quanteda/quanteda.corpora")

The package quanteda.corpora contains a corpus object that contains all State of the Union Addresses.

Code
data_corpus_sotu
Corpus consisting of 241 documents and 6 docvars.
Washington-1790 :
"Fellow-Citizens of the Senate and House of Representatives: ..."

Washington-1790b :
"Fellow-Citizens of the Senate and House of Representatives: ..."

Washington-1791 :
"Fellow-Citizens of the Senate and House of Representatives: ..."

Washington-1792 :
"Fellow-Citizens of the Senate and House of Representatives: ..."

Washington-1793 :
"Fellow-Citizens of the Senate and House of Representatives: ..."

Washington-1794 :
"Fellow-Citizens of the Senate and House of Representatives: ..."

[ reached max_ndoc ... 235 more documents ]
Code
sou_df <- as_tibble(convert(data_corpus_sotu, to = 'data.frame'))
sou_df
# A tibble: 241 × 8
   doc_id           text     FirstName President Date       delivery type  party
   <chr>            <chr>    <chr>     <chr>     <date>     <fct>    <fct> <fct>
 1 Washington-1790  "Fellow… George    Washingt… 1790-01-08 spoken   SOTU  Inde…
 2 Washington-1790b "Fellow… George    Washingt… 1790-12-08 spoken   SOTU  Inde…
 3 Washington-1791  "Fellow… George    Washingt… 1791-10-25 spoken   SOTU  Inde…
 4 Washington-1792  "Fellow… George    Washingt… 1792-11-06 spoken   SOTU  Inde…
 5 Washington-1793  "Fellow… George    Washingt… 1793-12-03 spoken   SOTU  Inde…
 6 Washington-1794  "Fellow… George    Washingt… 1794-11-19 spoken   SOTU  Inde…
 7 Washington-1795  "Fellow… George    Washingt… 1795-12-08 spoken   SOTU  Inde…
 8 Washington-1796  "Fellow… George    Washingt… 1796-12-07 spoken   SOTU  Inde…
 9 Adams-1797       "Gentle… John      Adams     1797-11-22 spoken   SOTU  Fede…
10 Adams-1798       "Gentle… John      Adams     1798-12-08 spoken   SOTU  Fede…
# ℹ 231 more rows

You can see that our data is now structured like a spreadsheet. Columns contain metadata like the speaker, the date, and the party. The text column contains the raw text of all State of the Union addresses.

You can view the full text of a speech by selecting the row using indexing and pulling the text column using $:

Code
sou_df[1,]$text
[1] "Fellow-Citizens of the Senate and House of Representatives:\n\nI embrace with great satisfaction the opportunity which now presents itself of congratulating you on the present favorable prospects of our public affairs. The recent accession of the important state of North Carolina to the Constitution of the United States (of which official information has been received), the rising credit and respectability of our country, the general and increasing good will toward the government of the Union, and the concord, peace, and plenty with which we are blessed are circumstances auspicious in an eminent degree to our national prosperity.\n\nIn resuming your consultations for the general good you can not but derive encouragement from the reflection that the measures of the last session have been as satisfactory to your constituents as the novelty and difficulty of the work allowed you to hope. Still further to realize their expectations and to secure the blessings which a gracious Providence has placed within our reach will in the course of the present important session call for the cool and deliberate exertion of your patriotism, firmness, and wisdom.\n\nAmong the many interesting objects which will engage your attention that of providing for the common defense will merit particular regard. To be prepared for war is one of the most effectual means of preserving peace.\n\nA free people ought not only to be armed, but disciplined; to which end a uniform and well-digested plan is requisite; and their safety and interest require that they should promote such manufactories as tend to render them independent of others for essential, particularly military, supplies.\n\nThe proper establishment of the troops which may be deemed indispensable will be entitled to mature consideration. In the arrangements which may be made respecting it it will be of importance to conciliate the comfortable support of the officers and soldiers with a due regard to economy.\n\nThere was reason to hope that the pacific measures adopted with regard to certain hostile tribes of Indians would have relieved the inhabitants of our southern and western frontiers from their depredations, but you will perceive from the information contained in the papers which I shall direct to be laid before you (comprehending a communication from the Commonwealth of Virginia) that we ought to be prepared to afford protection to those parts of the Union, and, if necessary, to punish aggressors.\n\nThe interests of the United States require that our intercourse with other nations should be facilitated by such provisions as will enable me to fulfill my duty in that respect in the manner which circumstances may render most conducive to the public good, and to this end that the compensation to be made to the persons who may be employed should, according to the nature of their appointments, be defined by law, and a competent fund designated for defraying the expenses incident to the conduct of foreign affairs.\n\nVarious considerations also render it expedient that the terms on which foreigners may be admitted to the rights of citizens should be speedily ascertained by a uniform rule of naturalization.\n\nUniformity in the currency, weights, and measures of the United States is an object of great importance, and will, I am persuaded, be duly attended to.\n\nThe advancement of agriculture, commerce, and manufactures by all proper means will not, I trust, need recommendation; but I can not forbear intimating to you the expediency of giving effectual encouragement as well to the introduction of new and useful inventions from abroad as to the exertions of skill and genius in producing them at home, and of facilitating the intercourse between the distant parts of our country by a due attention to the post-office and post-roads.\n\nNor am I less persuaded that you will agree with me in opinion that there is nothing which can better deserve your patronage than the promotion of science and literature. Knowledge is in every country the surest basis of public happiness. In one in which the measures of government receive their impressions so immediately from the sense of the community as in ours it is proportionably essential.\n\nTo the security of a free constitution it contributes in various ways - by convincing those who are intrusted with the public administration that every valuable end of government is best answered by the enlightened confidence of the people, and by teaching the people themselves to know and to value their own rights; to discern and provide against invasions of them; to distinguish between oppression and the necessary exercise of lawful authority; between burthens proceeding from a disregard to their convenience and those resulting from the inevitable exigencies of society; to discriminate the spirit of liberty from that of licentiousness - cherishing the first, avoiding the last - and uniting a speedy but temperate vigilance against encroachments, with an inviolable respect to the laws.\n\nWhether this desirable object will be best promoted by affording aids to seminaries of learning already established, by the institution of a national university, or by any other expedients will be well worthy of a place in the deliberations of the legislature.\n\nGentlemen of the House of Representatives:\n\nI saw with peculiar pleasure at the close of the last session the resolution entered into by you expressive of your opinion that an adequate provision for the support of the public credit is a matter of high importance to the national honor and prosperity. In this sentiment I entirely concur; and to a perfect confidence in your best endeavors to devise such a provision as will be truly with the end I add an equal reliance on the cheerful cooperation of the other branch of the legislature.\n\nIt would be superfluous to specify inducements to a measure in which the character and interests of the United States are so obviously so deeply concerned, and which has received so explicit a sanction from your declaration.\n\nGentlemen of the Senate and House of Representatives:\n\nI have directed the proper officers to lay before you, respectively, such papers and estimates as regard the affairs particularly recommended to your consideration, and necessary to convey to you that information of the state of the Union which it is my duty to afford.\n\nThe welfare of our country is the great object to which our cares and efforts ought to be directed, and I shall derive great satisfaction from a cooperation with you in the pleasing though arduous task of insuring to our fellow citizens the blessings which they have a right to expect from a free, efficient, and equal government."

Using the tidyverse suite of tools, you can also perform other operations on our dataframe. For example, if you want to see the State of the Union # Addresses given by Lincoln, you can use the %>% and filter() commands:

Code
sou_df %>% 
  filter(President == 'Lincoln')
# A tibble: 4 × 8
  doc_id       text          FirstName President Date       delivery type  party
  <chr>        <chr>         <chr>     <chr>     <date>     <fct>    <fct> <fct>
1 Lincoln-1861 "Fellow-Citi… Abraham   Lincoln   1861-12-03 written  SOTU  Repu…
2 Lincoln-1862 "Fellow-Citi… Abraham   Lincoln   1862-12-01 written  SOTU  Repu…
3 Lincoln-1863 "Fellow-Citi… Abraham   Lincoln   1863-12-08 written  SOTU  Repu…
4 Lincoln-1864 "Fellow-Citi… Abraham   Lincoln   1864-12-06 written  SOTU  Repu…

You might also be interested in creating your own variables. For example, you can use mutate() and if_else(condition, true, false) to create avariable for whether a State of the Union Address was given before or after the U.S. Civil War.

Code
sou_df2 <- sou_df %>% 
  mutate(
    post_war = if_else(Date > '1865-04-09', 1, 0)
    )

You can also use group_by() and summarise() to aggregate and summarize your data.

Code
sou_df2 %>% 
  group_by(post_war) %>% 
  summarise(n = n())
# A tibble: 2 × 2
  post_war     n
     <dbl> <int>
1        0    76
2        1   165

Here, we can seen 76 speeches were given before the Civil War and 165 after.

Finally, tolower() converts text to lower case; this is often helpful because the mix of upper and lowercase can mess with our analysis.

Code
sou_df_lower <- sou_df %>% 
  mutate(
    text = tolower(text)
    )

Lab 2: Getting Our Words Into a Bag

Let’s focus on some of the key R code we will need to use to format our corpus as a bag of words. We will do so using the package, quanteda.

Code
head(data_corpus_sotu)
Corpus consisting of 6 documents and 6 docvars.
Washington-1790 :
"Fellow-Citizens of the Senate and House of Representatives: ..."

Washington-1790b :
"Fellow-Citizens of the Senate and House of Representatives: ..."

Washington-1791 :
"Fellow-Citizens of the Senate and House of Representatives: ..."

Washington-1792 :
"Fellow-Citizens of the Senate and House of Representatives: ..."

Washington-1793 :
"Fellow-Citizens of the Senate and House of Representatives: ..."

Washington-1794 :
"Fellow-Citizens of the Senate and House of Representatives: ..."

This object is already formatted as a corpus, however, we can always put text into a corpus format using the corpus command.

We will practice this skill using a corpus of tweets President Trump sent while in office.

You can import that file directly from the web using the code below. I have made some changes to the column labels to comport with more conventional formatting.

Code
trump_tweets <- read_csv('https://raw.githubusercontent.com/MarkHershey/CompleteTrumpTweetsArchive/master/data/realDonaldTrump_in_office.csv') %>% 
    rename(id = ID, 
           time = Time, 
           url = `Tweet URL`, 
           text = `Tweet Text`)

You can view the basic file structure by using the head() command. You can see that a key column for us will be the text column which contains the tweet text.

Code
head(trump_tweets)
# A tibble: 6 × 4
  id               time                url                                 text 
  <chr>            <dttm>              <chr>                               <chr>
1 @realDonaldTrump 2017-01-20 06:31:00 https://twitter.com/realDonaldTrum… It a…
2 @realDonaldTrump 2017-01-20 11:51:00 https://twitter.com/realDonaldTrum… Toda…
3 @realDonaldTrump 2017-01-20 11:51:00 https://twitter.com/realDonaldTrum… powe…
4 @realDonaldTrump 2017-01-20 11:52:00 https://twitter.com/realDonaldTrum… What…
5 @realDonaldTrump 2017-01-20 11:53:00 https://twitter.com/realDonaldTrum… Janu…
6 @realDonaldTrump 2017-01-20 11:54:00 https://twitter.com/realDonaldTrum… The …

We start by converting our text to a corpus object using the corpus() command. If you preview this object, you’ll see that it has now been reformatted.

Code
tt_corp <- corpus(trump_tweets)

Next, let’s convert this into BOW format using a series of quanteda functions.

The tokens function tokenizes our corpus. The what = 'word' indicates that we will be using unigrams.

Code
tt_dfm <- tokens(tt_corp, what = 'word',

        remove_numbers = TRUE, 
        remove_punct = TRUE, 
        remove_symbols = TRUE,
        remove_separators = TRUE) %>% 
    tokens_tolower() %>%
    tokens_remove(c(stopwords("english"))) %>%
    tokens_select(min_nchar = 3) %>% 
    dfm() %>% 
    dfm_wordstem() %>% 
    dfm_trim(min_termfreq = 5, termfreq_type = 'count',
        min_docfreq = 5, docfreq_type = 'count')
Tip
  1. Here, we remove several types of tokens we do not want, including numbers, punctuation, symbols, and separators.

    First, ask yourself: is there any concern about removing some of these character types from our corpus?

  2. we convert all tokens to lowercase

  3. we remove a set of stopwords that are standard in the quanteda package

  4. we remove all words with only one or two characters (e.g., ‘a’, ‘at’)

  5. we create a dfm

  6. we stem words

  7. we remove rare words—those appearing less than 5 times in total and

  8. those that appear in fewer than five documents

We can preview our dfm now after having applied all of these steps

Code
tt_dfm
Document-feature matrix of: 23,073 documents, 5,134 features (99.74% sparse) and 3 docvars.
       features
docs    begin today see a.m movement continu work mere transfer power
  text1     2     1   1   1        1       1    1    0        0     0
  text2     0     1   0   0        0       0    0    1        2     1
  text3     0     0   0   0        0       0    0    0        0     1
  text4     0     0   0   0        0       0    0    0        0     0
  text5     0     0   0   0        0       0    0    0        0     0
  text6     0     0   0   0        0       0    0    0        0     0
[ reached max_ndoc ... 23,067 more documents, reached max_nfeat ... 5,124 more features ]

We could do some keyword counting. Let’s see how many times Trump references Hilary Clinton.

Code
sum(tt_dfm[,'clinton'])
[1] 293

We could run a simple regression using our keyword.

One hypothesis is that as we get farther away from the 2016 election, Trump is less likely to reference Clinton.

We can extract the number of references to Clinton from our dfm

Code
clinton_ref <- convert(tt_dfm[,'clinton'], 'data.frame')[,2]

We can convert the time the tweet was sent into a year variable, subtract 2016 to determine how many years away we are from the election.

Code
years_since_election <- year(docvars(tt_dfm)$time) - 2016

We perform our regression and see that for every year we get farther away from the 2016 election, the number of clinton references in a Trump tweet declines by about 0.008 on average.

Code
summary(lm(clinton_ref ~ years_since_election))

Call:
lm(formula = clinton_ref ~ years_since_election)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.02850 -0.02085 -0.01320 -0.00556  2.99444 

Coefficients:
                       Estimate Std. Error t value Pr(>|t|)    
(Intercept)           0.0361410  0.0026098  13.848   <2e-16 ***
years_since_election -0.0076454  0.0008086  -9.456   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1239 on 23071 degrees of freedom
Multiple R-squared:  0.00386,   Adjusted R-squared:  0.003817 
F-statistic: 89.41 on 1 and 23071 DF,  p-value: < 2.2e-16

We have many other options when it comes to creating our dfm. As one example:

Code
tt_dfm2 <- tokens(tt_corp, what = 'word',
                  remove_numbers = TRUE, 
                  remove_punct = TRUE, 
                  remove_symbols = TRUE,
                  remove_separators = TRUE) %>%
  tokens_tolower() %>%
  tokens_remove(c(stopwords("english"), 
                  'clinton', 
                  'biden')) %>%
  tokens_ngrams(n = 1:2) %>% 
  tokens_select(min_nchar = 3) %>% 
  dfm() %>% 
  dfm_trim(min_termfreq = 50, termfreq_type = 'count',
        min_docfreq = 50, docfreq_type = 'count')
Tip

Here we: 1. We remove a set of stopwords that are standard in the quanteda package and custom stopwords (perhaps Trump’s opponents?) 2. we can include all unigrams and bigrams after removing some features and after removing stopwords we remove more words than we did previously

Here, we can preview some of the features in our dfm.

Code
head(featnames(tt_dfm2))
[1] "today"     "see"       "a.m"       "continues" "work"      "power"    

We can also count the number of Clinton references without formatting our data as a dfm using the str_count() command from the stringi package.

Code
clinton_ref_df <- trump_tweets %>% 
    mutate(text = tolower(text),
        clinton_ref = str_count(text, 'clinton'),
        years_since_election = year(time) - 2016)

We can see the number of references we found

Code
table(clinton_ref_df$clinton_ref)

    0     1     2     3 
22780   264    22     7 

Same regression as above, and we get the same results

Code
summary(lm(clinton_ref ~ years_since_election, clinton_ref_df))

Call:
lm(formula = clinton_ref ~ years_since_election, data = clinton_ref_df)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.03053 -0.02266 -0.01478 -0.00691  2.99309 

Coefficients:
                       Estimate Std. Error t value Pr(>|t|)    
(Intercept)           0.0384046  0.0028051  13.691   <2e-16 ***
years_since_election -0.0078748  0.0008691  -9.061   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1331 on 23071 degrees of freedom
Multiple R-squared:  0.003546,  Adjusted R-squared:  0.003503 
F-statistic: 82.11 on 1 and 23071 DF,  p-value: < 2.2e-16

Note that str_count() will capture some word stems. For example, it will count both ‘republican’ and ‘republicans’ when you search for ‘republican’

Code
example_text = 'The republican party is made up of republicans.'
str_count(example_text, 'republican') == 2
[1] TRUE

Let’s try to replicate Ben Noble’s dictionary based sentiment analysis from lecture in this data set.

We can access the afinn dictionary from the tidytext package.

Code
# This is a data frame of words and sentiment scores
afinn_dict <- get_sentiments('afinn')

# let's give each text a unique id number
clinton_ref_df$unique_id <- 1:nrow(trump_tweets)

# starting from our original data
trump_words <- clinton_ref_df %>% 
    unnest_tokens(word, text)

trump_sentment <- trump_words %>% 
    left_join(afinn_dict) %>% 
    mutate(value = if_else(is.na(value), 0, value))

trump_score_df <- trump_sentment %>% 
    group_by(id, 
             time, 
             url, 
             unique_id, 
             clinton_ref, 
             years_since_election) %>% 
    summarise(
      tot_score = sum(value), 
      word_count = n()
      ) %>% 
    mutate(
      sent_score = tot_score / word_count
      )
Tip

Are speeches referencing clinton more negative than those not?

Code
summary(lm(sent_score ~ clinton_ref, trump_score_df))

Call:
lm(formula = sent_score ~ clinton_ref, data = trump_score_df)

Residuals:
     Min       1Q   Median       3Q      Max 
-3.05404 -0.08852 -0.05404  0.07929  1.94596 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.054041   0.001239  43.626  < 2e-16 ***
clinton_ref -0.073815   0.009236  -7.992 1.39e-15 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1871 on 23070 degrees of freedom
Multiple R-squared:  0.002761,  Adjusted R-squared:  0.002718 
F-statistic: 63.87 on 1 and 23070 DF,  p-value: 1.39e-15

yes—speeches referencing clinton are about 0.07 points more negative!

Your Turn!

Starting with the original df of tweets, trump_tweets do the following:

1) Create a dfm with the following characteristics: - unigrams - remove punctuation, symbols, and separators, but not numbers - convert tokens to lowercase - remove stopwords from quanteda - remove all letters (single character words) - stem the words - trim the corpus so that you keep all words that appear at least 10 times and those that appear across at least 5 tweets

Tip
Code
tt_corp_ex <- corpus(trump_tweets)

tt_dfm_ex <- tokens(tt_corp_ex, what = 'word',
    # here, we remove several types of tokens we do not want, including 
    # punctuation, symbols, and separators, but not numbers
        remove_numbers = FALSE, 
        remove_punct = TRUE, 
        remove_symbols = TRUE,
        remove_separators = TRUE) %>% 
    # we convert all tokens to lowercase
    tokens_tolower() %>%
    # we remove a set of stopwords that are standard in the quanteda package
    tokens_remove(c(stopwords("english"))) %>%
    # we remove all words with only one character
    tokens_select(min_nchar = 2) %>% 
    # we create a dfm
    dfm() %>% 
    # we stem words
    dfm_wordstem() %>% 
    # we remove rare words---those appearing less than 10 times in total and
    # those that appear in fewer than five tweets
    dfm_trim(min_termfreq = 10, termfreq_type = 'count',
        min_docfreq = 5, docfreq_type = 'count')

2) What is N, J, and the sparsity of the dfm?

Tip

N = 23,073 (the number of documents), J = 3,550 (the number of unique features), and the sparsity is 99.63% (the percentage of cells that are 0).

3) How often does Trump reference immigration? Let’s suppose we can capture all references to immigration using the stem ‘immigr’ and the stem ‘border’.

Tip
Code
# extract columns with features immigr and border, sum the totals
sum(tt_dfm_ex[,c('immigr', 'border')])
[1] 1428

4) Using the original data, trump_tweets, create a new df where you use str_count() to count instances of both ‘immigr’ (note, this will capture complete words like ‘immigrant’ and ‘immigration’) and ‘border’. Don’t forget to lowercase the text!

Hint: it might be easier to do this is multiple steps. First, count the number of ‘immigr’ references in a variable, then the number of ‘border’ references, and finally, add those two variables together for the final count.

Tip
Code
immig_df <- trump_tweets %>% 
# lowercase text
    mutate(text = tolower(text),
        # count 'immigr' references
        immigr = str_count(text, 'immigr'),
        # count 'border' references
        border = str_count(text, 'border'),
        # add both reference types into a single variable
        all_ref = immigr + border)

5) Using the tidytext method, convert the you just created to one where each row is a word within a document. Before you do this, create unique document ids!

Tip
Code
# assign unique id
immig_df$unique_id <- 1:nrow(immig_df)

immig_toks <- immig_df %>% 
    # create unigrams
    unnest_tokens(word, text)

6) The NRC dictionary contains a df of words and emotion codes, including anger. The code below will read in that dictionary and subset just to anger words. Merge this dictionary with the tweet data and aggregate back to the document level counting the proportion of anger words per words in tweets.

When you group, make sure to include the unique_id and the category that counts all immigration references. Also make sure to convert NA (non-anger) words to 0s!

Tip
Code
# read in nrc dictionary and filter to anger words
nrc_anger <- get_sentiments('nrc') %>% 
    filter(sentiment == 'anger')

anger_df <- immig_toks %>% 
    # merge the nrc anger dict
    left_join(nrc_anger) %>% 
    # convert NAs to 0s
    mutate(anger = if_else(is.na(sentiment), 0, 1)) %>% 
    # grouping variables
    group_by(unique_id, time, url, all_ref) %>% 
    # compute the total anger per tweet and total word count
    summarise(n_anger = sum(anger), wc = n()) %>% 
    # create the proportion of anger words in a tweet
    mutate(anger_pct = n_anger / wc)

7) Run a regression to test whether tweets that reference immigration more are angrier on average. What do you conclude?

Tip
Code
summary(lm(anger_pct ~ all_ref, anger_df))

Call:
lm(formula = anger_pct ~ all_ref, data = anger_df)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.03882 -0.01923 -0.01923  0.01303  0.98077 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.0192274  0.0002332  82.442  < 2e-16 ***
all_ref     0.0048980  0.0007398   6.621 3.65e-11 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.03472 on 23070 degrees of freedom
Multiple R-squared:  0.001896,  Adjusted R-squared:  0.001853 
F-statistic: 43.83 on 1 and 23070 DF,  p-value: 3.655e-11

For each additional immigration reference, the proportion of anger words used in a tweet is 0.5% higher.

8) Our measure of immigration references is probably imperfect. As a class, let’s try to assess how it performs. Run the code below. It will create a csv file on your desktop. Read through each tweet and determine whether it is or isn’t about immigration. If so, add a 1 to the column named true_immigr, otherwise put a 0 in this column. If you finish email me the csv that includes the codes you added.

Tip
Code
# randomly sample 10 immigration rows and 10 non-immigration rows
immig_val <- bind_rows(
        immig_df %>% filter(all_ref == 1) %>% sample_n(20),
        immig_df %>% filter(all_ref == 0) %>% sample_n(20)
    ) %>% 
    # create an empty column for coding
    mutate(true_immigr = NA_real_) %>% 
    # subset to relevant variables
    select(time, text, all_ref, true_immigr)

# write validation set to csv
write_csv(immig_val, 'immig_val.csv')

Lab 3: Using Topic Modeling Software

note: Part I was the hard codring by hand– Prof. Noble suggested skipping that and using the software in our own work

Part II

Code
## Part II: Using Topic Modeling Software
library(tidyverse)
library(stm)

# Let's return to Donald Trump's tweets from the last lab
trump_tweets <- read_csv('https://raw.githubusercontent.com/MarkHershey/CompleteTrumpTweetsArchive/master/data/realDonaldTrump_in_office.csv') %>% 
    rename(id = ID, time = Time, url = `Tweet URL`, text = `Tweet Text`)
trump_tweets$uid <- 1:nrow(trump_tweets)

# Using our code from last time, let's create a dfm.
tt_corp <- corpus(trump_tweets)
tt_dfm <- tokens(tt_corp, what = 'word',
        remove_numbers = TRUE, 
        remove_punct = TRUE, 
        remove_symbols = TRUE,
        remove_separators = TRUE) %>% 
    tokens_tolower() %>%
    tokens_remove(c(stopwords("english"))) %>%
    tokens_select(min_nchar = 3) %>% 
    dfm() %>% 
    dfm_wordstem() %>% 
    dfm_trim(min_termfreq = 5, termfreq_type = 'count',
        min_docfreq = 5, docfreq_type = 'count')

# due to the pre-processing, some of the documents have no tokens, let's remove
tt_omit <- tt_dfm[!rowSums(tt_dfm) == 0,]

This line of code runs our LDA model. This can take a while as the model runs through several iterations. The default tolerance is 1e-5, but that is going to take too long for a demo, I set it to 5e-4 but you can remove this in actual analysis.

Code
tt_mod <- stm(tt_omit, 
              K = 15, 
              init.type = 'LDA', 
              seed = 20221208, 
              emtol = 5e-4)
Beginning LDA Initialization 
....................................................................................................
Completed E-Step (1 seconds). 
Completed M-Step. 
Completing Iteration 1 (approx. per word bound = -7.354) 
....................................................................................................
Completed E-Step (1 seconds). 
Completed M-Step. 
Completing Iteration 2 (approx. per word bound = -7.316, relative change = 5.126e-03) 
....................................................................................................
Completed E-Step (1 seconds). 
Completed M-Step. 
Completing Iteration 3 (approx. per word bound = -7.293, relative change = 3.099e-03) 
....................................................................................................
Completed E-Step (1 seconds). 
Completed M-Step. 
Completing Iteration 4 (approx. per word bound = -7.279, relative change = 1.962e-03) 
....................................................................................................
Completed E-Step (1 seconds). 
Completed M-Step. 
Completing Iteration 5 (approx. per word bound = -7.269, relative change = 1.348e-03) 
Topic 1: good, one, watch, world, u. 
 Topic 2: elect, see, trump, tri, go 
 Topic 3: great, make, america, look, just 
 Topic 4: great, total, big, strong, militari 
 Topic 5: report, fbi, investig, impeach, show 
 Topic 6: biden, joe, russia, corrupt, campaign 
 Topic 7: amp, much, now, back, never 
 Topic 8: american, nation, today, year, histori 
 Topic 9: news, fake, media, rate, number 
 Topic 10: democrat, republican, get, dem, parti 
 Topic 11: presid, @realdonaldtrump, trump, hous, @whitehous 
 Topic 12: countri, border, must, deal, law 
 Topic 13: thank, new, job, first, work 
 Topic 14: hunt, witch, dollar, billion, money 
 Topic 15: state, vote, great, unit, support 
....................................................................................................
Completed E-Step (1 seconds). 
Completed M-Step. 
Completing Iteration 6 (approx. per word bound = -7.262, relative change = 1.022e-03) 
....................................................................................................
Completed E-Step (1 seconds). 
Completed M-Step. 
Completing Iteration 7 (approx. per word bound = -7.256, relative change = 8.460e-04) 
....................................................................................................
Completed E-Step (1 seconds). 
Completed M-Step. 
Completing Iteration 8 (approx. per word bound = -7.250, relative change = 7.445e-04) 
....................................................................................................
Completed E-Step (1 seconds). 
Completed M-Step. 
Completing Iteration 9 (approx. per word bound = -7.246, relative change = 6.577e-04) 
....................................................................................................
Completed E-Step (2 seconds). 
Completed M-Step. 
Completing Iteration 10 (approx. per word bound = -7.241, relative change = 5.783e-04) 
Topic 1: one, good, u., come, watch 
 Topic 2: elect, see, go, way, tri 
 Topic 3: great, make, america, look, just 
 Topic 4: great, big, total, win, strong 
 Topic 5: report, fbi, show, investig, can 
 Topic 6: biden, joe, russia, obama, campaign 
 Topic 7: amp, now, much, back, never 
 Topic 8: american, nation, year, today, ever 
 Topic 9: news, fake, media, bad, rate 
 Topic 10: democrat, get, republican, dem, impeach 
 Topic 11: presid, @realdonaldtrump, trump, hous, @whitehous 
 Topic 12: countri, border, must, china, deal 
 Topic 13: thank, job, new, work, first 
 Topic 14: hunt, witch, dollar, billion, money 
 Topic 15: state, vote, unit, support, peopl 
....................................................................................................
Completed E-Step (1 seconds). 
Completed M-Step. 
Completing Iteration 11 (approx. per word bound = -7.238, relative change = 5.190e-04) 
....................................................................................................
Completed E-Step (1 seconds). 
Completed M-Step. 
Model Converged 

Once the model fits, use the following code to inspect the top words associated with each topic. What do you think each topic represents?

Code
labelTopics(tt_mod)
Topic 1 Top Words:
     Highest Prob: one, good, u., come, watch, day, world 
     FREX: iran, kill, attack, war, terrorist, syria, protest 
     Lift: soleimani, iranian, idlib, provinc, consecut, mph, iran 
     Score: good, u., one, world, watch, meet, day 
Topic 2 Top Words:
     Highest Prob: elect, see, go, way, last, tri, left 
     FREX: night, wrong, candid, secretari, ahead, donald, execut 
     Lift: carlson, verif, dominion, fulton, tucker, ایران, @briankempga 
     Score: elect, see, tri, left, night, go, last 
Topic 3 Top Words:
     Highest Prob: great, make, america, look, just, two, wonder 
     FREX: forward, minist, friend, prime, god, stock, polic 
     Lift: minist, abe, emperor, @justintrudeau, @abeshinzo, @borisjohnson, india 
     Score: america, look, great, make, forward, friend, market 
Topic 4 Top Words:
     Highest Prob: great, big, total, win, strong, militari, tax 
     FREX: complet, endors, vet, amend, militari, love, strong 
     Lift: @jimhagedornmn, #tx13, vet, tubervill, @billhagertytn, tommi, amend 
     Score: great, endors, total, strong, militari, tax, love 
Topic 5 Top Words:
     Highest Prob: report, show, fbi, investig, can, allow, mueller 
     FREX: investig, schiff, mueller, whistleblow, adam, hear, wit 
     Lift: debacl, papadopoulo, whistleblow, horowitz, limbaugh, @repadamschiff, withheld 
     Score: report, investig, fbi, schiff, mueller, show, allow 
Topic 6 Top Words:
     Highest Prob: biden, joe, russia, obama, campaign, corrupt, lie 
     FREX: joe, biden, sleepi, comey, jame, russia, reason 
     Lift: joe, sleepi, biden’, hunter, frack, clapper, comey 
     Score: biden, joe, russia, corrupt, obama, campaign, lie 
Topic 7 Top Words:
     Highest Prob: amp, now, much, never, back, like, just 
     FREX: north, korea, kim, middl, brought, speak, perhap 
     Lift: denuclear, singapor, jong, kim, maggi, occupi, korea 
     Score: amp, much, back, like, don’t, north, never 
Topic 8 Top Words:
     Highest Prob: american, nation, year, today, ever, histori, honor 
     FREX: honor, communiti, women, administr, hero, black, histor 
     Lift: african, merri, tackl, african-american, uniform, communiti, hero 
     Score: american, today, nation, honor, ever, histori, famili 
Topic 9 Top Words:
     Highest Prob: news, fake, media, bad, rate, number, stori 
     FREX: news, media, fake, via, enemi, rate, @breitbartnew 
     Lift: media, news, @breitbartnew, coverag, @washingtonpost, msnbc, enemi 
     Score: news, fake, media, rate, stori, number, bad 
Topic 10 Top Words:
     Highest Prob: democrat, get, republican, dem, impeach, noth, senat 
     FREX: parti, process, enough, obamacar, told, legal, legisl 
     Lift: ebola, obamacar, sullivan, parti, process, feloni, cum 
     Score: democrat, republican, impeach, dem, get, parti, noth 
Topic 11 Top Words:
     Highest Prob: presid, @realdonaldtrump, trump, hous, @whitehous, live, white 
     FREX: white, confer, @gopchairwoman, deliv, @danscavino, @gop, freedom 
     Lift: stu, #mytak, #economi, #varneyco, @trumpstud, @danscavino, easter 
     Score: presid, @realdonaldtrump, trump, @whitehous, hous, live, white 
Topic 12 Top Words:
     Highest Prob: countri, border, must, china, want, deal, law 
     FREX: wall, law, immigr, trade, mexico, secur, southern 
     Lift: apprehend, southern, uncheck, mexico’, barrier, coyot, patrol 
     Score: border, countri, law, must, wall, china, deal 
Topic 13 Top Words:
     Highest Prob: thank, job, new, work, time, first, high 
     FREX: york, respond, hospit, relief, pay, industri, water 
     Lift: @morningsmaria, obamag, york, draw, equip, complex, respond 
     Score: thank, new, job, first, work, york, high 
Topic 14 Top Words:
     Highest Prob: hunt, witch, dollar, money, billion, close, forc 
     FREX: witch, dollar, hillari, clinton, crook, hunt, billion 
     Lift: reset, hillary’, #dobb, stone, witch, hillari, crook 
     Score: witch, hunt, hillari, dollar, billion, clinton, money 
Topic 15 Top Words:
     Highest Prob: state, vote, unit, support, peopl, protect, court 
     FREX: court, suprem, voter, congratul, florida, victori, birthday 
     Lift: bader, suprem, #gasen, @rondesantisfl, #gapol, birthday, @kloeffler 
     Score: state, vote, unit, support, court, congratul, protect 

For one topic, use the following code to inspect some representative docs

Code
# First, we need to remove the "empty" documents from our corpus
tt_corp_edit <- tt_corp[tt_corp$uid %in% tt_omit$uid]

# Ensure the corpus and document-feature matrix alignment
valid_doc_ids <- docnames(tt_omit)
tt_corp_edit <- subset(tt_corp, docid(tt_corp) %in% valid_doc_ids)
tt_texts <- as.character(tt_corp_edit)

This code finds 3 representative docs for topic 4 (but you can change these settings)

Code
top4 <- findThoughts(tt_mod, 
                     tt_corp_edit, 
                     n = 3, 
                     topic = 4)
plot(top4, width = 80)

You can also see the topic proportions within each document.

Code
head(as.data.frame(tt_mod$theta))
          V1         V2         V3         V4         V5         V6         V7
1 0.04996729 0.07568687 0.05077997 0.03612348 0.15720353 0.04557277 0.04974693
2 0.14222249 0.04316241 0.04572607 0.02508022 0.10142069 0.04153530 0.04898297
3 0.05408990 0.13126856 0.05203457 0.04299540 0.07062815 0.04931853 0.09138689
4 0.07035653 0.05276560 0.03964565 0.04263955 0.05262365 0.14489002 0.07672538
5 0.12535164 0.05444744 0.05640382 0.04608327 0.04448621 0.03923085 0.05280267
6 0.07589637 0.06282384 0.05280871 0.08467997 0.03644294 0.05839662 0.04706087
          V8         V9        V10        V11        V12        V13        V14
1 0.07751744 0.04418403 0.05617533 0.11009075 0.03775928 0.11410811 0.04563758
2 0.14725731 0.10873508 0.07509485 0.05940872 0.03256668 0.05045103 0.04054266
3 0.13066054 0.04927581 0.06459973 0.05982959 0.04126092 0.05677835 0.04646971
4 0.03949757 0.03874437 0.09443208 0.03997499 0.13576844 0.04469876 0.04884859
5 0.10361064 0.04275511 0.05033472 0.10530737 0.04349439 0.06310097 0.06781145
6 0.15495474 0.03996809 0.04437184 0.06097737 0.05805776 0.13430394 0.03518529
         V15
1 0.04944664
2 0.03781352
3 0.05940335
4 0.07838881
5 0.10477945
6 0.05407164
Code
# let's create a vector of which topic is the most prevalent in each tweet
max_topic <- c()
for (i in 1:nrow(tt_mod$theta)){
  max_topic <- c(max_topic, which.max(tt_mod$theta[i,]))
}

The code was bugging, so here is the ChatGPT/Michelle debug:

Code
# Create a vector of which topic is the most prevalent in each tweet
max_topic <- apply(tt_mod$theta, 1, which.max)

# Convert the filtered corpus to a data frame
tt_corp_df <- convert(tt_corp_edit, to = "data.frame")

# Add the max_topic vector to the data frame
tt_corp_df$max_topic <- max_topic

# Extract year from the time column and add it to the data frame
tt_corp_df$year <- format(as.Date(tt_corp_df$time), "%Y")

# Print intermediate data to debug
print(head(tt_corp_df))
  doc_id
1  text1
2  text2
3  text3
4  text4
5  text5
6  text6
                                                                                                                                        text
1                           It all begins today! I will see you at 11:00 A.M. for the swearing-in. THE MOVEMENT CONTINUES - THE WORK BEGINS!
2 Today we are not merely transferring power from one Administration to another, or from one party to another – but we are transferring...
3                                              power from Washington, D.C. and giving it back to you, the American People. # InaugurationDay
4                     What truly matters is not which party controls our government, but whether our government is controlled by the people.
5                                        January 20th 2017, will be remembered as the day the people became the rulers of this nation again.
6             The forgotten men and women of our country will be forgotten no longer. From this moment on, it’s going to be # AmericaFirst
                id                time
1 @realDonaldTrump 2017-01-20 06:31:00
2 @realDonaldTrump 2017-01-20 11:51:00
3 @realDonaldTrump 2017-01-20 11:51:00
4 @realDonaldTrump 2017-01-20 11:52:00
5 @realDonaldTrump 2017-01-20 11:53:00
6 @realDonaldTrump 2017-01-20 11:54:00
                                                            url uid max_topic
1 https://twitter.com/realDonaldTrump/status/822421390125043713   1         5
2 https://twitter.com/realDonaldTrump/status/822501803615014918   2         8
3 https://twitter.com/realDonaldTrump/status/822501939267141634   3         2
4 https://twitter.com/realDonaldTrump/status/822502135233384448   4         6
5 https://twitter.com/realDonaldTrump/status/822502270503972872   5         1
6 https://twitter.com/realDonaldTrump/status/822502450007515137   6         8
  year
1 2017
2 2017
3 2017
4 2017
5 2017
6 2017
Code
print(table(tt_corp_df$year))

2017 2018 2019 2020 2021 
2427 3499 6961 8880  120 
Code
print(table(tt_corp_df$max_topic))

   1    2    3    4    5    6    7    8    9   10   11   12   13   14   15 
 916  986 1291 2134 1099 1293 1437 1580 1376 1724 2126 1939 1537 1125 1324 
Code
# Ensure the year column exists
if (!"year" %in% colnames(tt_corp_df)) {
  stop("Year column not found in the data frame")
}

# See how often each topic is discussed by year
topic_summary <- tt_corp_df %>% 
  group_by(year, max_topic) %>% 
  summarise(n = n(), .groups = 'drop') %>% 
  arrange(year, max_topic)

print(topic_summary)
# A tibble: 75 × 3
   year  max_topic     n
   <chr>     <int> <int>
 1 2017          1   136
 2 2017          2    91
 3 2017          3   203
 4 2017          4   283
 5 2017          5    67
 6 2017          6    94
 7 2017          7   167
 8 2017          8   225
 9 2017          9   205
10 2017         10   173
# ℹ 65 more rows
Code
# Identify the top topic for each year based on the count (n)
top_topics_per_year <- topic_summary %>%
  group_by(year) %>%
  top_n(1, n) %>%
  arrange(year)

# Print the top topics per year
print(top_topics_per_year)
# A tibble: 5 × 3
# Groups:   year [5]
  year  max_topic     n
  <chr>     <int> <int>
1 2017          4   283
2 2018         12   483
3 2019         12   731
4 2020         11  1102
5 2021         15    19
Code
knitr::kable(top_topics_per_year)
year max_topic n
2017 4 283
2018 12 483
2019 12 731
2020 11 1102
2021 15 19

Time permitting, change the settings. Try a different number of topics and see what (if anything) changes. Try to change the pre-processing and see if 15 topics is different from the original 15 topics.