Code
library(quanteda)
library(quanteda.corpora)
library(stringr)
library(tidyverse)
library(tidytext)
library(stringi)
library(textdata)
library(stm)
All credit for these materials go to Professor Ben Noble, I just compiled it into a nice html format
Visit the American Presidency Project Website and choose one State of the Union Address delivered orally after 1981, and one written State of the Union Address delivered before 1900.
First, install.packages()
and then load the following libraries
library(quanteda)
library(quanteda.corpora)
library(stringr)
library(tidyverse)
library(tidytext)
library(stringi)
library(textdata)
library(stm)
note, you’ll need to use devtools for quanteda.copora:
devtools::install_github("quanteda/quanteda.corpora")
The package quanteda.corpora
contains a corpus
object that contains all State of the Union Addresses.
data_corpus_sotu
Corpus consisting of 241 documents and 6 docvars.
Washington-1790 :
"Fellow-Citizens of the Senate and House of Representatives: ..."
Washington-1790b :
"Fellow-Citizens of the Senate and House of Representatives: ..."
Washington-1791 :
"Fellow-Citizens of the Senate and House of Representatives: ..."
Washington-1792 :
"Fellow-Citizens of the Senate and House of Representatives: ..."
Washington-1793 :
"Fellow-Citizens of the Senate and House of Representatives: ..."
Washington-1794 :
"Fellow-Citizens of the Senate and House of Representatives: ..."
[ reached max_ndoc ... 235 more documents ]
<- as_tibble(convert(data_corpus_sotu, to = 'data.frame'))
sou_df sou_df
# A tibble: 241 × 8
doc_id text FirstName President Date delivery type party
<chr> <chr> <chr> <chr> <date> <fct> <fct> <fct>
1 Washington-1790 "Fellow… George Washingt… 1790-01-08 spoken SOTU Inde…
2 Washington-1790b "Fellow… George Washingt… 1790-12-08 spoken SOTU Inde…
3 Washington-1791 "Fellow… George Washingt… 1791-10-25 spoken SOTU Inde…
4 Washington-1792 "Fellow… George Washingt… 1792-11-06 spoken SOTU Inde…
5 Washington-1793 "Fellow… George Washingt… 1793-12-03 spoken SOTU Inde…
6 Washington-1794 "Fellow… George Washingt… 1794-11-19 spoken SOTU Inde…
7 Washington-1795 "Fellow… George Washingt… 1795-12-08 spoken SOTU Inde…
8 Washington-1796 "Fellow… George Washingt… 1796-12-07 spoken SOTU Inde…
9 Adams-1797 "Gentle… John Adams 1797-11-22 spoken SOTU Fede…
10 Adams-1798 "Gentle… John Adams 1798-12-08 spoken SOTU Fede…
# ℹ 231 more rows
You can see that our data is now structured like a spreadsheet. Columns contain metadata like the speaker, the date, and the party. The text
column contains the raw text of all State of the Union addresses.
You can view the full text of a speech by selecting the row using indexing and pulling the text column using $
:
1,]$text sou_df[
[1] "Fellow-Citizens of the Senate and House of Representatives:\n\nI embrace with great satisfaction the opportunity which now presents itself of congratulating you on the present favorable prospects of our public affairs. The recent accession of the important state of North Carolina to the Constitution of the United States (of which official information has been received), the rising credit and respectability of our country, the general and increasing good will toward the government of the Union, and the concord, peace, and plenty with which we are blessed are circumstances auspicious in an eminent degree to our national prosperity.\n\nIn resuming your consultations for the general good you can not but derive encouragement from the reflection that the measures of the last session have been as satisfactory to your constituents as the novelty and difficulty of the work allowed you to hope. Still further to realize their expectations and to secure the blessings which a gracious Providence has placed within our reach will in the course of the present important session call for the cool and deliberate exertion of your patriotism, firmness, and wisdom.\n\nAmong the many interesting objects which will engage your attention that of providing for the common defense will merit particular regard. To be prepared for war is one of the most effectual means of preserving peace.\n\nA free people ought not only to be armed, but disciplined; to which end a uniform and well-digested plan is requisite; and their safety and interest require that they should promote such manufactories as tend to render them independent of others for essential, particularly military, supplies.\n\nThe proper establishment of the troops which may be deemed indispensable will be entitled to mature consideration. In the arrangements which may be made respecting it it will be of importance to conciliate the comfortable support of the officers and soldiers with a due regard to economy.\n\nThere was reason to hope that the pacific measures adopted with regard to certain hostile tribes of Indians would have relieved the inhabitants of our southern and western frontiers from their depredations, but you will perceive from the information contained in the papers which I shall direct to be laid before you (comprehending a communication from the Commonwealth of Virginia) that we ought to be prepared to afford protection to those parts of the Union, and, if necessary, to punish aggressors.\n\nThe interests of the United States require that our intercourse with other nations should be facilitated by such provisions as will enable me to fulfill my duty in that respect in the manner which circumstances may render most conducive to the public good, and to this end that the compensation to be made to the persons who may be employed should, according to the nature of their appointments, be defined by law, and a competent fund designated for defraying the expenses incident to the conduct of foreign affairs.\n\nVarious considerations also render it expedient that the terms on which foreigners may be admitted to the rights of citizens should be speedily ascertained by a uniform rule of naturalization.\n\nUniformity in the currency, weights, and measures of the United States is an object of great importance, and will, I am persuaded, be duly attended to.\n\nThe advancement of agriculture, commerce, and manufactures by all proper means will not, I trust, need recommendation; but I can not forbear intimating to you the expediency of giving effectual encouragement as well to the introduction of new and useful inventions from abroad as to the exertions of skill and genius in producing them at home, and of facilitating the intercourse between the distant parts of our country by a due attention to the post-office and post-roads.\n\nNor am I less persuaded that you will agree with me in opinion that there is nothing which can better deserve your patronage than the promotion of science and literature. Knowledge is in every country the surest basis of public happiness. In one in which the measures of government receive their impressions so immediately from the sense of the community as in ours it is proportionably essential.\n\nTo the security of a free constitution it contributes in various ways - by convincing those who are intrusted with the public administration that every valuable end of government is best answered by the enlightened confidence of the people, and by teaching the people themselves to know and to value their own rights; to discern and provide against invasions of them; to distinguish between oppression and the necessary exercise of lawful authority; between burthens proceeding from a disregard to their convenience and those resulting from the inevitable exigencies of society; to discriminate the spirit of liberty from that of licentiousness - cherishing the first, avoiding the last - and uniting a speedy but temperate vigilance against encroachments, with an inviolable respect to the laws.\n\nWhether this desirable object will be best promoted by affording aids to seminaries of learning already established, by the institution of a national university, or by any other expedients will be well worthy of a place in the deliberations of the legislature.\n\nGentlemen of the House of Representatives:\n\nI saw with peculiar pleasure at the close of the last session the resolution entered into by you expressive of your opinion that an adequate provision for the support of the public credit is a matter of high importance to the national honor and prosperity. In this sentiment I entirely concur; and to a perfect confidence in your best endeavors to devise such a provision as will be truly with the end I add an equal reliance on the cheerful cooperation of the other branch of the legislature.\n\nIt would be superfluous to specify inducements to a measure in which the character and interests of the United States are so obviously so deeply concerned, and which has received so explicit a sanction from your declaration.\n\nGentlemen of the Senate and House of Representatives:\n\nI have directed the proper officers to lay before you, respectively, such papers and estimates as regard the affairs particularly recommended to your consideration, and necessary to convey to you that information of the state of the Union which it is my duty to afford.\n\nThe welfare of our country is the great object to which our cares and efforts ought to be directed, and I shall derive great satisfaction from a cooperation with you in the pleasing though arduous task of insuring to our fellow citizens the blessings which they have a right to expect from a free, efficient, and equal government."
Using the tidyverse suite of tools, you can also perform other operations on our dataframe. For example, if you want to see the State of the Union # Addresses given by Lincoln, you can use the %>%
and filter()
commands:
%>%
sou_df filter(President == 'Lincoln')
# A tibble: 4 × 8
doc_id text FirstName President Date delivery type party
<chr> <chr> <chr> <chr> <date> <fct> <fct> <fct>
1 Lincoln-1861 "Fellow-Citi… Abraham Lincoln 1861-12-03 written SOTU Repu…
2 Lincoln-1862 "Fellow-Citi… Abraham Lincoln 1862-12-01 written SOTU Repu…
3 Lincoln-1863 "Fellow-Citi… Abraham Lincoln 1863-12-08 written SOTU Repu…
4 Lincoln-1864 "Fellow-Citi… Abraham Lincoln 1864-12-06 written SOTU Repu…
You might also be interested in creating your own variables. For example, you can use mutate()
and if_else(condition, true, false)
to create avariable for whether a State of the Union Address was given before or after the U.S. Civil War.
<- sou_df %>%
sou_df2 mutate(
post_war = if_else(Date > '1865-04-09', 1, 0)
)
You can also use group_by()
and summarise()
to aggregate and summarize your data.
%>%
sou_df2 group_by(post_war) %>%
summarise(n = n())
# A tibble: 2 × 2
post_war n
<dbl> <int>
1 0 76
2 1 165
Here, we can seen 76 speeches were given before the Civil War and 165 after.
Finally, tolower()
converts text to lower case; this is often helpful because the mix of upper and lowercase can mess with our analysis.
<- sou_df %>%
sou_df_lower mutate(
text = tolower(text)
)
Let’s focus on some of the key R code we will need to use to format our corpus as a bag of words. We will do so using the package, quanteda.
head(data_corpus_sotu)
Corpus consisting of 6 documents and 6 docvars.
Washington-1790 :
"Fellow-Citizens of the Senate and House of Representatives: ..."
Washington-1790b :
"Fellow-Citizens of the Senate and House of Representatives: ..."
Washington-1791 :
"Fellow-Citizens of the Senate and House of Representatives: ..."
Washington-1792 :
"Fellow-Citizens of the Senate and House of Representatives: ..."
Washington-1793 :
"Fellow-Citizens of the Senate and House of Representatives: ..."
Washington-1794 :
"Fellow-Citizens of the Senate and House of Representatives: ..."
This object is already formatted as a corpus, however, we can always put text into a corpus format using the corpus
command.
You can import that file directly from the web using the code below. I have made some changes to the column labels to comport with more conventional formatting.
<- read_csv('https://raw.githubusercontent.com/MarkHershey/CompleteTrumpTweetsArchive/master/data/realDonaldTrump_in_office.csv') %>%
trump_tweets rename(id = ID,
time = Time,
url = `Tweet URL`,
text = `Tweet Text`)
You can view the basic file structure by using the head()
command. You can see that a key column for us will be the text
column which contains the tweet text.
head(trump_tweets)
# A tibble: 6 × 4
id time url text
<chr> <dttm> <chr> <chr>
1 @realDonaldTrump 2017-01-20 06:31:00 https://twitter.com/realDonaldTrum… It a…
2 @realDonaldTrump 2017-01-20 11:51:00 https://twitter.com/realDonaldTrum… Toda…
3 @realDonaldTrump 2017-01-20 11:51:00 https://twitter.com/realDonaldTrum… powe…
4 @realDonaldTrump 2017-01-20 11:52:00 https://twitter.com/realDonaldTrum… What…
5 @realDonaldTrump 2017-01-20 11:53:00 https://twitter.com/realDonaldTrum… Janu…
6 @realDonaldTrump 2017-01-20 11:54:00 https://twitter.com/realDonaldTrum… The …
We start by converting our text to a corpus object using the corpus()
command. If you preview this object, you’ll see that it has now been reformatted.
<- corpus(trump_tweets) tt_corp
quanteda
functions.The tokens
function tokenizes our corpus. The what = 'word'
indicates that we will be using unigrams.
<- tokens(tt_corp, what = 'word',
tt_dfm
remove_numbers = TRUE,
remove_punct = TRUE,
remove_symbols = TRUE,
remove_separators = TRUE) %>%
tokens_tolower() %>%
tokens_remove(c(stopwords("english"))) %>%
tokens_select(min_nchar = 3) %>%
dfm() %>%
dfm_wordstem() %>%
dfm_trim(min_termfreq = 5, termfreq_type = 'count',
min_docfreq = 5, docfreq_type = 'count')
Here, we remove several types of tokens we do not want, including numbers, punctuation, symbols, and separators.
First, ask yourself: is there any concern about removing some of these character types from our corpus?
we convert all tokens to lowercase
we remove a set of stopwords that are standard in the quanteda package
we remove all words with only one or two characters (e.g., ‘a’, ‘at’)
we create a dfm
we stem words
we remove rare words—those appearing less than 5 times in total and
those that appear in fewer than five documents
We can preview our dfm now after having applied all of these steps
tt_dfm
Document-feature matrix of: 23,073 documents, 5,134 features (99.74% sparse) and 3 docvars.
features
docs begin today see a.m movement continu work mere transfer power
text1 2 1 1 1 1 1 1 0 0 0
text2 0 1 0 0 0 0 0 1 2 1
text3 0 0 0 0 0 0 0 0 0 1
text4 0 0 0 0 0 0 0 0 0 0
text5 0 0 0 0 0 0 0 0 0 0
text6 0 0 0 0 0 0 0 0 0 0
[ reached max_ndoc ... 23,067 more documents, reached max_nfeat ... 5,124 more features ]
We could do some keyword counting. Let’s see how many times Trump references Hilary Clinton.
sum(tt_dfm[,'clinton'])
[1] 293
One hypothesis is that as we get farther away from the 2016 election, Trump is less likely to reference Clinton.
We can extract the number of references to Clinton from our dfm
<- convert(tt_dfm[,'clinton'], 'data.frame')[,2] clinton_ref
We can convert the time the tweet was sent into a year variable, subtract 2016 to determine how many years away we are from the election.
<- year(docvars(tt_dfm)$time) - 2016 years_since_election
We perform our regression and see that for every year we get farther away from the 2016 election, the number of clinton references in a Trump tweet declines by about 0.008 on average.
summary(lm(clinton_ref ~ years_since_election))
Call:
lm(formula = clinton_ref ~ years_since_election)
Residuals:
Min 1Q Median 3Q Max
-0.02850 -0.02085 -0.01320 -0.00556 2.99444
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.0361410 0.0026098 13.848 <2e-16 ***
years_since_election -0.0076454 0.0008086 -9.456 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.1239 on 23071 degrees of freedom
Multiple R-squared: 0.00386, Adjusted R-squared: 0.003817
F-statistic: 89.41 on 1 and 23071 DF, p-value: < 2.2e-16
We have many other options when it comes to creating our dfm. As one example:
<- tokens(tt_corp, what = 'word',
tt_dfm2 remove_numbers = TRUE,
remove_punct = TRUE,
remove_symbols = TRUE,
remove_separators = TRUE) %>%
tokens_tolower() %>%
tokens_remove(c(stopwords("english"),
'clinton',
'biden')) %>%
tokens_ngrams(n = 1:2) %>%
tokens_select(min_nchar = 3) %>%
dfm() %>%
dfm_trim(min_termfreq = 50, termfreq_type = 'count',
min_docfreq = 50, docfreq_type = 'count')
Here we: 1. We remove a set of stopwords that are standard in the quanteda package and custom stopwords (perhaps Trump’s opponents?) 2. we can include all unigrams and bigrams after removing some features and after removing stopwords we remove more words than we did previously
Here, we can preview some of the features in our dfm.
head(featnames(tt_dfm2))
[1] "today" "see" "a.m" "continues" "work" "power"
We can also count the number of Clinton references without formatting our data as a dfm using the str_count()
command from the stringi
package.
<- trump_tweets %>%
clinton_ref_df mutate(text = tolower(text),
clinton_ref = str_count(text, 'clinton'),
years_since_election = year(time) - 2016)
We can see the number of references we found
table(clinton_ref_df$clinton_ref)
0 1 2 3
22780 264 22 7
Same regression as above, and we get the same results
summary(lm(clinton_ref ~ years_since_election, clinton_ref_df))
Call:
lm(formula = clinton_ref ~ years_since_election, data = clinton_ref_df)
Residuals:
Min 1Q Median 3Q Max
-0.03053 -0.02266 -0.01478 -0.00691 2.99309
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.0384046 0.0028051 13.691 <2e-16 ***
years_since_election -0.0078748 0.0008691 -9.061 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.1331 on 23071 degrees of freedom
Multiple R-squared: 0.003546, Adjusted R-squared: 0.003503
F-statistic: 82.11 on 1 and 23071 DF, p-value: < 2.2e-16
Note that str_count()
will capture some word stems. For example, it will count both ‘republican’ and ‘republicans’ when you search for ‘republican’
= 'The republican party is made up of republicans.'
example_text str_count(example_text, 'republican') == 2
[1] TRUE
We can access the afinn
dictionary from the tidytext
package.
# This is a data frame of words and sentiment scores
<- get_sentiments('afinn')
afinn_dict
# let's give each text a unique id number
$unique_id <- 1:nrow(trump_tweets)
clinton_ref_df
# starting from our original data
<- clinton_ref_df %>%
trump_words unnest_tokens(word, text)
<- trump_words %>%
trump_sentment left_join(afinn_dict) %>%
mutate(value = if_else(is.na(value), 0, value))
<- trump_sentment %>%
trump_score_df group_by(id,
time,
url,
unique_id,
clinton_ref, %>%
years_since_election) summarise(
tot_score = sum(value),
word_count = n()
%>%
) mutate(
sent_score = tot_score / word_count
)
Are speeches referencing clinton more negative than those not?
summary(lm(sent_score ~ clinton_ref, trump_score_df))
Call:
lm(formula = sent_score ~ clinton_ref, data = trump_score_df)
Residuals:
Min 1Q Median 3Q Max
-3.05404 -0.08852 -0.05404 0.07929 1.94596
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.054041 0.001239 43.626 < 2e-16 ***
clinton_ref -0.073815 0.009236 -7.992 1.39e-15 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.1871 on 23070 degrees of freedom
Multiple R-squared: 0.002761, Adjusted R-squared: 0.002718
F-statistic: 63.87 on 1 and 23070 DF, p-value: 1.39e-15
yes—speeches referencing clinton are about 0.07 points more negative!
Starting with the original df of tweets, trump_tweets
do the following:
1) Create a dfm with the following characteristics: - unigrams - remove punctuation, symbols, and separators, but not numbers - convert tokens to lowercase - remove stopwords from quanteda - remove all letters (single character words) - stem the words - trim the corpus so that you keep all words that appear at least 10 times and those that appear across at least 5 tweets
<- corpus(trump_tweets)
tt_corp_ex
<- tokens(tt_corp_ex, what = 'word',
tt_dfm_ex # here, we remove several types of tokens we do not want, including
# punctuation, symbols, and separators, but not numbers
remove_numbers = FALSE,
remove_punct = TRUE,
remove_symbols = TRUE,
remove_separators = TRUE) %>%
# we convert all tokens to lowercase
tokens_tolower() %>%
# we remove a set of stopwords that are standard in the quanteda package
tokens_remove(c(stopwords("english"))) %>%
# we remove all words with only one character
tokens_select(min_nchar = 2) %>%
# we create a dfm
dfm() %>%
# we stem words
dfm_wordstem() %>%
# we remove rare words---those appearing less than 10 times in total and
# those that appear in fewer than five tweets
dfm_trim(min_termfreq = 10, termfreq_type = 'count',
min_docfreq = 5, docfreq_type = 'count')
2) What is N, J, and the sparsity of the dfm?
N = 23,073 (the number of documents), J = 3,550 (the number of unique features), and the sparsity is 99.63% (the percentage of cells that are 0).
3) How often does Trump reference immigration? Let’s suppose we can capture all references to immigration using the stem ‘immigr’ and the stem ‘border’.
# extract columns with features immigr and border, sum the totals
sum(tt_dfm_ex[,c('immigr', 'border')])
[1] 1428
4) Using the original data, trump_tweets
, create a new df where you use str_count()
to count instances of both ‘immigr’ (note, this will capture complete words like ‘immigrant’ and ‘immigration’) and ‘border’. Don’t forget to lowercase the text!
Hint: it might be easier to do this is multiple steps. First, count the number of ‘immigr’ references in a variable, then the number of ‘border’ references, and finally, add those two variables together for the final count.
<- trump_tweets %>%
immig_df # lowercase text
mutate(text = tolower(text),
# count 'immigr' references
immigr = str_count(text, 'immigr'),
# count 'border' references
border = str_count(text, 'border'),
# add both reference types into a single variable
all_ref = immigr + border)
5) Using the tidytext method, convert the you just created to one where each row is a word within a document. Before you do this, create unique document ids!
# assign unique id
$unique_id <- 1:nrow(immig_df)
immig_df
<- immig_df %>%
immig_toks # create unigrams
unnest_tokens(word, text)
6) The NRC dictionary contains a df of words and emotion codes, including anger. The code below will read in that dictionary and subset just to anger words. Merge this dictionary with the tweet data and aggregate back to the document level counting the proportion of anger words per words in tweets.
When you group, make sure to include the unique_id and the category that counts all immigration references. Also make sure to convert NA (non-anger) words to 0s!
# read in nrc dictionary and filter to anger words
<- get_sentiments('nrc') %>%
nrc_anger filter(sentiment == 'anger')
<- immig_toks %>%
anger_df # merge the nrc anger dict
left_join(nrc_anger) %>%
# convert NAs to 0s
mutate(anger = if_else(is.na(sentiment), 0, 1)) %>%
# grouping variables
group_by(unique_id, time, url, all_ref) %>%
# compute the total anger per tweet and total word count
summarise(n_anger = sum(anger), wc = n()) %>%
# create the proportion of anger words in a tweet
mutate(anger_pct = n_anger / wc)
7) Run a regression to test whether tweets that reference immigration more are angrier on average. What do you conclude?
summary(lm(anger_pct ~ all_ref, anger_df))
Call:
lm(formula = anger_pct ~ all_ref, data = anger_df)
Residuals:
Min 1Q Median 3Q Max
-0.03882 -0.01923 -0.01923 0.01303 0.98077
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.0192274 0.0002332 82.442 < 2e-16 ***
all_ref 0.0048980 0.0007398 6.621 3.65e-11 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.03472 on 23070 degrees of freedom
Multiple R-squared: 0.001896, Adjusted R-squared: 0.001853
F-statistic: 43.83 on 1 and 23070 DF, p-value: 3.655e-11
For each additional immigration reference, the proportion of anger words used in a tweet is 0.5% higher.
8) Our measure of immigration references is probably imperfect. As a class, let’s try to assess how it performs. Run the code below. It will create a csv file on your desktop. Read through each tweet and determine whether it is or isn’t about immigration. If so, add a 1 to the column named true_immigr
, otherwise put a 0 in this column. If you finish email me the csv that includes the codes you added.
# randomly sample 10 immigration rows and 10 non-immigration rows
<- bind_rows(
immig_val %>% filter(all_ref == 1) %>% sample_n(20),
immig_df %>% filter(all_ref == 0) %>% sample_n(20)
immig_df %>%
) # create an empty column for coding
mutate(true_immigr = NA_real_) %>%
# subset to relevant variables
select(time, text, all_ref, true_immigr)
# write validation set to csv
write_csv(immig_val, 'immig_val.csv')
note: Part I was the hard codring by hand– Prof. Noble suggested skipping that and using the software in our own work
## Part II: Using Topic Modeling Software
library(tidyverse)
library(stm)
# Let's return to Donald Trump's tweets from the last lab
<- read_csv('https://raw.githubusercontent.com/MarkHershey/CompleteTrumpTweetsArchive/master/data/realDonaldTrump_in_office.csv') %>%
trump_tweets rename(id = ID, time = Time, url = `Tweet URL`, text = `Tweet Text`)
$uid <- 1:nrow(trump_tweets)
trump_tweets
# Using our code from last time, let's create a dfm.
<- corpus(trump_tweets)
tt_corp <- tokens(tt_corp, what = 'word',
tt_dfm remove_numbers = TRUE,
remove_punct = TRUE,
remove_symbols = TRUE,
remove_separators = TRUE) %>%
tokens_tolower() %>%
tokens_remove(c(stopwords("english"))) %>%
tokens_select(min_nchar = 3) %>%
dfm() %>%
dfm_wordstem() %>%
dfm_trim(min_termfreq = 5, termfreq_type = 'count',
min_docfreq = 5, docfreq_type = 'count')
# due to the pre-processing, some of the documents have no tokens, let's remove
<- tt_dfm[!rowSums(tt_dfm) == 0,] tt_omit
This line of code runs our LDA model. This can take a while as the model runs through several iterations. The default tolerance is 1e-5, but that is going to take too long for a demo, I set it to 5e-4 but you can remove this in actual analysis.
<- stm(tt_omit,
tt_mod K = 15,
init.type = 'LDA',
seed = 20221208,
emtol = 5e-4)
Beginning LDA Initialization
....................................................................................................
Completed E-Step (1 seconds).
Completed M-Step.
Completing Iteration 1 (approx. per word bound = -7.354)
....................................................................................................
Completed E-Step (1 seconds).
Completed M-Step.
Completing Iteration 2 (approx. per word bound = -7.316, relative change = 5.126e-03)
....................................................................................................
Completed E-Step (1 seconds).
Completed M-Step.
Completing Iteration 3 (approx. per word bound = -7.293, relative change = 3.099e-03)
....................................................................................................
Completed E-Step (1 seconds).
Completed M-Step.
Completing Iteration 4 (approx. per word bound = -7.279, relative change = 1.962e-03)
....................................................................................................
Completed E-Step (1 seconds).
Completed M-Step.
Completing Iteration 5 (approx. per word bound = -7.269, relative change = 1.348e-03)
Topic 1: good, one, watch, world, u.
Topic 2: elect, see, trump, tri, go
Topic 3: great, make, america, look, just
Topic 4: great, total, big, strong, militari
Topic 5: report, fbi, investig, impeach, show
Topic 6: biden, joe, russia, corrupt, campaign
Topic 7: amp, much, now, back, never
Topic 8: american, nation, today, year, histori
Topic 9: news, fake, media, rate, number
Topic 10: democrat, republican, get, dem, parti
Topic 11: presid, @realdonaldtrump, trump, hous, @whitehous
Topic 12: countri, border, must, deal, law
Topic 13: thank, new, job, first, work
Topic 14: hunt, witch, dollar, billion, money
Topic 15: state, vote, great, unit, support
....................................................................................................
Completed E-Step (1 seconds).
Completed M-Step.
Completing Iteration 6 (approx. per word bound = -7.262, relative change = 1.022e-03)
....................................................................................................
Completed E-Step (1 seconds).
Completed M-Step.
Completing Iteration 7 (approx. per word bound = -7.256, relative change = 8.460e-04)
....................................................................................................
Completed E-Step (1 seconds).
Completed M-Step.
Completing Iteration 8 (approx. per word bound = -7.250, relative change = 7.445e-04)
....................................................................................................
Completed E-Step (1 seconds).
Completed M-Step.
Completing Iteration 9 (approx. per word bound = -7.246, relative change = 6.577e-04)
....................................................................................................
Completed E-Step (2 seconds).
Completed M-Step.
Completing Iteration 10 (approx. per word bound = -7.241, relative change = 5.783e-04)
Topic 1: one, good, u., come, watch
Topic 2: elect, see, go, way, tri
Topic 3: great, make, america, look, just
Topic 4: great, big, total, win, strong
Topic 5: report, fbi, show, investig, can
Topic 6: biden, joe, russia, obama, campaign
Topic 7: amp, now, much, back, never
Topic 8: american, nation, year, today, ever
Topic 9: news, fake, media, bad, rate
Topic 10: democrat, get, republican, dem, impeach
Topic 11: presid, @realdonaldtrump, trump, hous, @whitehous
Topic 12: countri, border, must, china, deal
Topic 13: thank, job, new, work, first
Topic 14: hunt, witch, dollar, billion, money
Topic 15: state, vote, unit, support, peopl
....................................................................................................
Completed E-Step (1 seconds).
Completed M-Step.
Completing Iteration 11 (approx. per word bound = -7.238, relative change = 5.190e-04)
....................................................................................................
Completed E-Step (1 seconds).
Completed M-Step.
Model Converged
Once the model fits, use the following code to inspect the top words associated with each topic. What do you think each topic represents?
labelTopics(tt_mod)
Topic 1 Top Words:
Highest Prob: one, good, u., come, watch, day, world
FREX: iran, kill, attack, war, terrorist, syria, protest
Lift: soleimani, iranian, idlib, provinc, consecut, mph, iran
Score: good, u., one, world, watch, meet, day
Topic 2 Top Words:
Highest Prob: elect, see, go, way, last, tri, left
FREX: night, wrong, candid, secretari, ahead, donald, execut
Lift: carlson, verif, dominion, fulton, tucker, ایران, @briankempga
Score: elect, see, tri, left, night, go, last
Topic 3 Top Words:
Highest Prob: great, make, america, look, just, two, wonder
FREX: forward, minist, friend, prime, god, stock, polic
Lift: minist, abe, emperor, @justintrudeau, @abeshinzo, @borisjohnson, india
Score: america, look, great, make, forward, friend, market
Topic 4 Top Words:
Highest Prob: great, big, total, win, strong, militari, tax
FREX: complet, endors, vet, amend, militari, love, strong
Lift: @jimhagedornmn, #tx13, vet, tubervill, @billhagertytn, tommi, amend
Score: great, endors, total, strong, militari, tax, love
Topic 5 Top Words:
Highest Prob: report, show, fbi, investig, can, allow, mueller
FREX: investig, schiff, mueller, whistleblow, adam, hear, wit
Lift: debacl, papadopoulo, whistleblow, horowitz, limbaugh, @repadamschiff, withheld
Score: report, investig, fbi, schiff, mueller, show, allow
Topic 6 Top Words:
Highest Prob: biden, joe, russia, obama, campaign, corrupt, lie
FREX: joe, biden, sleepi, comey, jame, russia, reason
Lift: joe, sleepi, biden’, hunter, frack, clapper, comey
Score: biden, joe, russia, corrupt, obama, campaign, lie
Topic 7 Top Words:
Highest Prob: amp, now, much, never, back, like, just
FREX: north, korea, kim, middl, brought, speak, perhap
Lift: denuclear, singapor, jong, kim, maggi, occupi, korea
Score: amp, much, back, like, don’t, north, never
Topic 8 Top Words:
Highest Prob: american, nation, year, today, ever, histori, honor
FREX: honor, communiti, women, administr, hero, black, histor
Lift: african, merri, tackl, african-american, uniform, communiti, hero
Score: american, today, nation, honor, ever, histori, famili
Topic 9 Top Words:
Highest Prob: news, fake, media, bad, rate, number, stori
FREX: news, media, fake, via, enemi, rate, @breitbartnew
Lift: media, news, @breitbartnew, coverag, @washingtonpost, msnbc, enemi
Score: news, fake, media, rate, stori, number, bad
Topic 10 Top Words:
Highest Prob: democrat, get, republican, dem, impeach, noth, senat
FREX: parti, process, enough, obamacar, told, legal, legisl
Lift: ebola, obamacar, sullivan, parti, process, feloni, cum
Score: democrat, republican, impeach, dem, get, parti, noth
Topic 11 Top Words:
Highest Prob: presid, @realdonaldtrump, trump, hous, @whitehous, live, white
FREX: white, confer, @gopchairwoman, deliv, @danscavino, @gop, freedom
Lift: stu, #mytak, #economi, #varneyco, @trumpstud, @danscavino, easter
Score: presid, @realdonaldtrump, trump, @whitehous, hous, live, white
Topic 12 Top Words:
Highest Prob: countri, border, must, china, want, deal, law
FREX: wall, law, immigr, trade, mexico, secur, southern
Lift: apprehend, southern, uncheck, mexico’, barrier, coyot, patrol
Score: border, countri, law, must, wall, china, deal
Topic 13 Top Words:
Highest Prob: thank, job, new, work, time, first, high
FREX: york, respond, hospit, relief, pay, industri, water
Lift: @morningsmaria, obamag, york, draw, equip, complex, respond
Score: thank, new, job, first, work, york, high
Topic 14 Top Words:
Highest Prob: hunt, witch, dollar, money, billion, close, forc
FREX: witch, dollar, hillari, clinton, crook, hunt, billion
Lift: reset, hillary’, #dobb, stone, witch, hillari, crook
Score: witch, hunt, hillari, dollar, billion, clinton, money
Topic 15 Top Words:
Highest Prob: state, vote, unit, support, peopl, protect, court
FREX: court, suprem, voter, congratul, florida, victori, birthday
Lift: bader, suprem, #gasen, @rondesantisfl, #gapol, birthday, @kloeffler
Score: state, vote, unit, support, court, congratul, protect
For one topic, use the following code to inspect some representative docs
# First, we need to remove the "empty" documents from our corpus
<- tt_corp[tt_corp$uid %in% tt_omit$uid]
tt_corp_edit
# Ensure the corpus and document-feature matrix alignment
<- docnames(tt_omit)
valid_doc_ids <- subset(tt_corp, docid(tt_corp) %in% valid_doc_ids)
tt_corp_edit <- as.character(tt_corp_edit) tt_texts
This code finds 3 representative docs for topic 4 (but you can change these settings)
<- findThoughts(tt_mod,
top4
tt_corp_edit, n = 3,
topic = 4)
plot(top4, width = 80)
You can also see the topic proportions within each document.
head(as.data.frame(tt_mod$theta))
V1 V2 V3 V4 V5 V6 V7
1 0.04996729 0.07568687 0.05077997 0.03612348 0.15720353 0.04557277 0.04974693
2 0.14222249 0.04316241 0.04572607 0.02508022 0.10142069 0.04153530 0.04898297
3 0.05408990 0.13126856 0.05203457 0.04299540 0.07062815 0.04931853 0.09138689
4 0.07035653 0.05276560 0.03964565 0.04263955 0.05262365 0.14489002 0.07672538
5 0.12535164 0.05444744 0.05640382 0.04608327 0.04448621 0.03923085 0.05280267
6 0.07589637 0.06282384 0.05280871 0.08467997 0.03644294 0.05839662 0.04706087
V8 V9 V10 V11 V12 V13 V14
1 0.07751744 0.04418403 0.05617533 0.11009075 0.03775928 0.11410811 0.04563758
2 0.14725731 0.10873508 0.07509485 0.05940872 0.03256668 0.05045103 0.04054266
3 0.13066054 0.04927581 0.06459973 0.05982959 0.04126092 0.05677835 0.04646971
4 0.03949757 0.03874437 0.09443208 0.03997499 0.13576844 0.04469876 0.04884859
5 0.10361064 0.04275511 0.05033472 0.10530737 0.04349439 0.06310097 0.06781145
6 0.15495474 0.03996809 0.04437184 0.06097737 0.05805776 0.13430394 0.03518529
V15
1 0.04944664
2 0.03781352
3 0.05940335
4 0.07838881
5 0.10477945
6 0.05407164
# let's create a vector of which topic is the most prevalent in each tweet
<- c()
max_topic for (i in 1:nrow(tt_mod$theta)){
<- c(max_topic, which.max(tt_mod$theta[i,]))
max_topic }
The code was bugging, so here is the ChatGPT/Michelle debug:
# Create a vector of which topic is the most prevalent in each tweet
<- apply(tt_mod$theta, 1, which.max)
max_topic
# Convert the filtered corpus to a data frame
<- convert(tt_corp_edit, to = "data.frame")
tt_corp_df
# Add the max_topic vector to the data frame
$max_topic <- max_topic
tt_corp_df
# Extract year from the time column and add it to the data frame
$year <- format(as.Date(tt_corp_df$time), "%Y")
tt_corp_df
# Print intermediate data to debug
print(head(tt_corp_df))
doc_id
1 text1
2 text2
3 text3
4 text4
5 text5
6 text6
text
1 It all begins today! I will see you at 11:00 A.M. for the swearing-in. THE MOVEMENT CONTINUES - THE WORK BEGINS!
2 Today we are not merely transferring power from one Administration to another, or from one party to another – but we are transferring...
3 power from Washington, D.C. and giving it back to you, the American People. # InaugurationDay
4 What truly matters is not which party controls our government, but whether our government is controlled by the people.
5 January 20th 2017, will be remembered as the day the people became the rulers of this nation again.
6 The forgotten men and women of our country will be forgotten no longer. From this moment on, it’s going to be # AmericaFirst
id time
1 @realDonaldTrump 2017-01-20 06:31:00
2 @realDonaldTrump 2017-01-20 11:51:00
3 @realDonaldTrump 2017-01-20 11:51:00
4 @realDonaldTrump 2017-01-20 11:52:00
5 @realDonaldTrump 2017-01-20 11:53:00
6 @realDonaldTrump 2017-01-20 11:54:00
url uid max_topic
1 https://twitter.com/realDonaldTrump/status/822421390125043713 1 5
2 https://twitter.com/realDonaldTrump/status/822501803615014918 2 8
3 https://twitter.com/realDonaldTrump/status/822501939267141634 3 2
4 https://twitter.com/realDonaldTrump/status/822502135233384448 4 6
5 https://twitter.com/realDonaldTrump/status/822502270503972872 5 1
6 https://twitter.com/realDonaldTrump/status/822502450007515137 6 8
year
1 2017
2 2017
3 2017
4 2017
5 2017
6 2017
print(table(tt_corp_df$year))
2017 2018 2019 2020 2021
2427 3499 6961 8880 120
print(table(tt_corp_df$max_topic))
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
916 986 1291 2134 1099 1293 1437 1580 1376 1724 2126 1939 1537 1125 1324
# Ensure the year column exists
if (!"year" %in% colnames(tt_corp_df)) {
stop("Year column not found in the data frame")
}
# See how often each topic is discussed by year
<- tt_corp_df %>%
topic_summary group_by(year, max_topic) %>%
summarise(n = n(), .groups = 'drop') %>%
arrange(year, max_topic)
print(topic_summary)
# A tibble: 75 × 3
year max_topic n
<chr> <int> <int>
1 2017 1 136
2 2017 2 91
3 2017 3 203
4 2017 4 283
5 2017 5 67
6 2017 6 94
7 2017 7 167
8 2017 8 225
9 2017 9 205
10 2017 10 173
# ℹ 65 more rows
# Identify the top topic for each year based on the count (n)
<- topic_summary %>%
top_topics_per_year group_by(year) %>%
top_n(1, n) %>%
arrange(year)
# Print the top topics per year
print(top_topics_per_year)
# A tibble: 5 × 3
# Groups: year [5]
year max_topic n
<chr> <int> <int>
1 2017 4 283
2 2018 12 483
3 2019 12 731
4 2020 11 1102
5 2021 15 19
::kable(top_topics_per_year) knitr
year | max_topic | n |
---|---|---|
2017 | 4 | 283 |
2018 | 12 | 483 |
2019 | 12 | 731 |
2020 | 11 | 1102 |
2021 | 15 | 19 |
Time permitting, change the settings. Try a different number of topics and see what (if anything) changes. Try to change the pre-processing and see if 15 topics is different from the original 15 topics.