2026-05-20

Warm-up

  • Groups: As you arrive
  • Look at the news blurbs
  • How is word usage different across the articles?
  • Which words carry strongest emotion?
  • What “topics” are covered in each?

Today’s Class

  • Warm-up: news articles
  • “Families” of Text Analysis
  • Activity: categorization by hand

Wednesday’s Class

  • Text analysis with tidytext

Office Hours

  • Office Hours: Today, Friday 1:30pm-3:00pm (Tyler)
  • Tuesdays, 10:30am-12:00pm (Yao)

Learning Goals

  • Motivate quantitative approaches to text analysis
  • Explore classification strategies
  • Understand term frequencies and document frequencies
  • Understand dictionary-based appraoches
  • Discuss semantic approaches
  • (Wednesday) Text analysis with tidytext

Text as Data

Why Text?

  • We live in an increasingly literate and text-producing world
  • Text carries lots of important information!
  • We can observe online interactions

Sources of Text Data

  • Social media interactions (we can observe these directly)
  • News/papers
  • Digitized books and other media

Challenges of Text Data

  • Unlike spatial, categorical, or numeric data, the meaning of text data can be unclear
  • Requires interpretation

Applications of Text Analysis

  • In text analysis, we often want to understand how someone (or some organization) is talking about an event or topic
  • For instance, newspapers might describe the same event very differently
  • Can we quantify these differences?

Families of Text Analysis

Term Frequency

  • An initial way to analyze text data is term frequency
  • The first family, term frequency analysis, represents text as observations that vary in how often certain strings of characters (e.g., words) appear.

Closed and Open Vocabulary

Closed Vocubulary

  • Example: word dictionaries
  • Masculine - feminine
  • Positive - negative
  • Threatening language
  • Moral foundations
  • Used for testing existing theories

Sentiment Dictionary

  • AFINN
  • Ranges between -5 - 5
  • Others include Bing (positive/negative) and NRC(positive, negative, anger, anticipation, disgust, fear, joy, sadness, surprise, and trust)

Open Vocabulary

  • Example: differential language analysis
  • Researchers correlate each of \(k\) most common words with outcome variables
  • What linguistic features are associated with gender, age?
  • Depression?
  • Used for generating new theories

Do dictionaries correlate with happiness?

  • We might expect that “positive” words (defined by a dictionary called LIWC) would correlate with happiness
  • This is not always the case!
  • A warning: next slide contains strong language

Do dictionaries correlate with happiness?

  • Jaidka and colleagues (2020) find that positive words are often not associated with happiness

Question

  • In pairs:
  • What is an example of text data? (e.g. tweets)
  • What might we want to know about this data? (e.g. how is it linked to happiness)
  • What kind of dictionary would be useful? (e.g. happy - unhappy words)
  • What kind of outcome would we want to observe? (e.g. survey users on happiness)

Text frequency recap

  • Unit of analysis: word
  • Closed vocabulary: we define a dictionary, used to test theories
  • Open vocabulary: we examine word correlations to an outcome, used to generate new theories

Activity

  • Create a dictionary from the words in the handouts
  • Circle words with strongest emotion (positive or negative)
  • Rank these from most positive to most negative
  • If time, compare these to AFINN scores darenr.github.io/afinn

Document Structure

  • The second family, document structure analysis, assumes one can extract from word co-occurrence statistics what any given document is “about” (i.e., what the appropriate keywords or themes are) and represents text as observations that vary on this feature.

Document Structure

  • Each document has “themes” or “topics”
  • Words combined with each other can comprise topics (e.g. “march” with “january” vs. “soldier”)

Topic Models

  • Documents are comprised of topics
  • Topics are comprised of words
  • Researchers must length of document, proportion of document expected to come from each topic
  • Algorithm gives us words related to each topic

Latent Direlecht Analysis

  1. Each topic \(\theta\) has a distribution
  2. Each word \(\phi\) has a distribution within each topic
  3. Together, words and topics make up the document

Topic Models

  • Examples:
  • What topics have newspapers discussed?
  • Academic journals?
  • College application essays?

Topic Models

Community Detection Algorithms

  • The researcher builds networks of documents
  • Recall: two mode networks
  • For text analysis, what are the first level? The second level?

Community Detection Algorithms

  • For example:
  • What library books discuss the same topics?
  • What sermons use similar themes?

Document Structure Recap

  • Unit of analysis: document
  • The researcher must choose how to define a document (page? chapter? book?)

Activity

  • Identify 3 topics in a document
  • Which words are associated with a given topic?

Semantic Similarity

  • The third family, semantic similarity analysis, attempts to quantify the meaning of strings of characters and represents texts as collections of such meanings.

Semantic Similarity

  • Unit of analysis: shared meanings between words

Within Corpus Approach

  • The researcher finds “semantic neighbors” of words through common usage
  • Unit of analysis: meanings
  • Examples:
  • What language is associated with demographic groups (e.g. race, class, gender) in certain texts?
  • Does language reveal “implicit bias” in how people think about concepts/people/etc.?

Between Corpus Approach

  • Unit of analysis: documents
  • For example:
  • How does gender bias vary across Wikipedia entries?
  • How does discussion of COVID-19 vary across counties?

Semantic Meanings

From Van Loon, 2022

Word Embeddings

From Van Loon, 2022

Semantic Similarity

  • Hoffman (2020) maps semantic similarity of New York Society Library Collection (1789-92)

Activity

  • In an article, identify an area of meaning with “semantically similar” words
  • Are these meanings shared within an article, or across?
  • Can you map the three newspapers (IHE, NYT, NYP) as a network, according to their semantic (dis)similarities?

Tidy Text

Sentiment Analysis

Document Term Matrix

Text Analysis Recap

  • We live in an age where text data is everywhere
  • However, extracting meaning out of text data can be challenging
  • Three “families” of methods:
  • Term frequencies
  • Document structure
  • Semantic similarities
  • We will use tidytext to explore these on Wednesday!

Zoom

Warm-up: Sign up for presentation times!

  • Go to Canvas -> Collaborations
  • Pick a time (if nobody else has it)

Learning Goals

  • Download book data with gutenbergr
  • Download newspaper data with guardianapi
  • Structure data with tidytext
  • Text frequency analysis
  • Sentiment analysis
  • If time, Reddit data with RedditExtractoR

Office Hours

  • Office Hours: Friday 1:30pm-3:00pm zoom
  • Tuesdays, 10:30am-12:00pm (Yao)

Create a Gaurdian API Key

Gutenberg

What is Project Gutenberg?

  • A huge repository of digitized books (>75 million!)
  • Many are older books with expired US copyright

Explore Gutenberg

  • In pairs/groups:
  • Go to gutenberg.org
  • Pick a book!
  • Copy the id number (EBook-No.)

Downloading Books

  • Use gutenberg_download to download your book
  • Put your id in the gutenberg_id field
#install.packages("gutenbergr")
library(gutenbergr)

# Look at Gutenberg books written by Durkheim
gutenberg_works(author == "Durkheim, Émile")
## # A tibble: 1 × 8
##   gutenberg_id title     author gutenberg_author_id language gutenberg_bookshelf
##          <int> <chr>     <chr>                <int> <fct>    <chr>              
## 1        41360 The Elem… Durkh…               40654 en       Browsing: Culture/…
## # ℹ 2 more variables: rights <fct>, has_text <lgl>
# download book - notice that we need the id number (also on the gutenberg website)
efrl <- gutenberg_download(gutenberg_id = 41360, 
           mirror = "http://mirrors.xmission.com/gutenberg/")

Downloading Books

  • Download your book!
# download book - notice that we need the id number (also on the gutenberg website)
efrl <- gutenberg_download(gutenberg_id = 41360, 
                           mirror = "http://mirrors.xmission.com/gutenberg/")

What is TidyText?

What is Tidy Data?

  • Each observation is a row
  • Each variable a column
  • Each type of observational unit a table

What is Tidy Data?

Row Person Birthday Occupation
1 Joe 12/3/1963 Carpenter
2 Malik 6/8/1978 Architect
3 Suzanna 4/3/2001 Student

What is Tidy Data?

Row County Temperature PM2.5
1 Santa Clara 78.1 12.1
2 San Mateo 82.3 32.1
3 San Francisco 65.4 44.7

What is TidyText Data?

  • How should we organize data with text?
  • For example, newspaper articles

What is TidyText Data?

This is nice …

Row Paper Article Text
1 New York Times Study Compares Gas Stove Pollution to Secondhand Cigarette Smoke Using a single gas-stove burner can raise indoor concentrations of benzene, …
2 New York Times Study Compares Gas Stove Pollution to Secondhand Cigarette Smoke For the peer-reviewed study, researchers at Stanford’s Doerr School of Sustainability …
3 New York Times Study Compares Gas Stove Pollution to Secondhand Cigarette Smoke In about a third of the homes, a single gas burner …

What is TidyText Data?

But we often prefer this

Row Paper Article Text
1 New York Times Study Compares Gas Stove Pollu… Using
2 New York Times Study Compares Gas Stove Pollu… a
3 New York Times Study Compares Gas Stove Pollu… single

What is TidyText Data?

  • Take a look at your gutenberg data
  • Is your gutenberg data in tidytext format?

What is TidyText Data?

  • Put it into tidytext format with unnest_tokens
library(tidytext)
library(dplyr)
library(magrittr)

# try to tokenize into single words
efrl %<>%
  unnest_tokens(word, text)

Word Frequencies

  • What are the most frequent words in your book?
# count of words that contain "law"
efrl %>% 
  count(word, sort = T)

Word Frequencies

  • Your most common word might be “the”
  • What should we do now?
  • Remove words like “the” (Try View(stop_words))
  • Standardize documents by word frequency (does our document use “the” more than others?)
library(tidytext)
library(dplyr)
library(magrittr)

# count of words that contain "law"
efrl %>% 
  count(word, sort = T)
## # A tibble: 11,336 × 2
##    word      n
##    <chr> <int>
##  1 the   16907
##  2 of     9685
##  3 is     5874
##  4 to     5681
##  5 and    4653
##  6 it     4597
##  7 in     4569
##  8 a      4277
##  9 which  3337
## 10 that   3138
## # ℹ 11,326 more rows

Guardian Data

Downloading Guardian Data

  • First, library guardianapi
library(guardianapi)

Set your API Key!

  • Then set your api key
library(guardianapi)

gu_api_key("your key here")

Downloading Guardian Data

  • Search for some terms!
  • It often helps to use multiple, to limit results
guardian <- gu_content('"San Jose" AND "California"',
                         from_date = "2025-01-1",
                         to_date = "2026-05-11")

Working with TidyText Data

  • Look at your data
  • Are the data in tidytext format?
# look at your guardian data
guardian %>%
  head()
## # A tibble: 6 × 45
##   id        type  section_id section_name web_publication_date web_title web_url
##   <chr>     <chr> <chr>      <chr>        <dttm>               <chr>     <chr>  
## 1 us-news/… arti… us-news    US news      2026-01-30 05:54:00  Matt Mah… https:…
## 2 us-news/… arti… us-news    US news      2025-12-09 05:58:55  Communit… https:…
## 3 us-news/… arti… us-news    US news      2026-04-08 08:16:37  ICE agen… https:…
## 4 sport/20… arti… sport      Sport        2026-01-29 03:37:36  ICE agen… https:…
## 5 us-news/… arti… us-news    US news      2026-04-23 12:44:09  Leading … https:…
## 6 us-news/… arti… us-news    US news      2026-02-27 21:00:18  Californ… https:…
## # ℹ 38 more variables: api_url <chr>, tags <lgl>, is_hosted <lgl>,
## #   pillar_id <chr>, pillar_name <chr>, headline <chr>, standfirst <chr>,
## #   trail_text <chr>, byline <chr>, main <chr>, body <chr>, wordcount <dbl>,
## #   first_publication_date <dttm>, is_inappropriate_for_sponsorship <lgl>,
## #   is_premoderated <lgl>, last_modified <dttm>, production_office <chr>,
## #   publication <chr>, short_url <chr>, should_hide_adverts <lgl>,
## #   show_in_related_content <lgl>, thumbnail <chr>, legally_sensitive <lgl>, …

Working with TidyText Data

  • Let’s put the data in tidytext format
  • Let’s first limit our dataset to blogs using filter()
# first, set up liveblog dataframe
tidy_blogs <- guardian %>%
  filter(type == "liveblog")

Working with TidyText Data

  • Use unnest_tokens to put in tidytext format
  • Remember stop_words?
  • We can remove these with an anti_join
# unnest tokens
tidy_blogs %<>%
  unnest_tokens(word, body_text) %>%
  anti_join(stop_words)

Working with TidyText Data

  • What are stop_words?
  • View with View(stop_words)
  • Three different lexicons: SMART, snowball, and onix
  • Stop words are essentially words that are not useful for our analyses, such as “the”
  • Are there any surprising words there?

Working with TidyText Data

  • What happens when we anti_join the stop_words?
  • Let’s take a closer look at joins

Working with TidyText Data

Working with TidyText Data

  • Coming back to our blog data, let’s look at the result
# look at examples
tidy_blogs %>%
  select(type, word) %>%
  head()
## # A tibble: 6 × 2
##   type     word        
##   <chr>    <chr>       
## 1 liveblog concludes   
## 2 liveblog coverage    
## 3 liveblog politics    
## 4 liveblog day         
## 5 liveblog reading     
## 6 liveblog developments

Working with TidyText Data

  • Now that we have a tokenized dataset, new analyses become simple
  • For example can use the count() function to get word frequencies
# look at blog word frequencies
tidy_blogs %>%
  count(word, sort = TRUE)
## # A tibble: 7,880 × 2
##    word               n
##    <chr>          <int>
##  1 trump            729
##  2 president        251
##  3 donald           245
##  4 house            237
##  5 trump’s          208
##  6 white            182
##  7 people           147
##  8 federal          133
##  9 administration   122
## 10 war              119
## # ℹ 7,870 more rows

Working with TidyText Data

  • What if we repeat the prior steps for articles, not blogs?
  • Try it!
tidy_articles <- guardian %>%
  filter(type == "article")

# make tidytext format, remove stop words
tidy_articles %<>%
  unnest_tokens(word, body_text) %>%
  anti_join(stop_words)
# look at article word frequencies
tidy_articles %>%
  count(word, sort = TRUE)

Working with TidyText Data

  • We can examine word frequencies across categories (like blogs vs. articles)
library(tidyr)
frequency <- bind_rows(tidy_blogs,
                       tidy_articles) %>% 
  count(type, word) %>%
  group_by(type) %>%
  mutate(proportion = n / sum(n)) %>% 
  select(-n) %>% 
  pivot_wider(names_from = type, values_from = proportion) 

Working with TidyText Data

library(scales)
library(ggplot2)

# expect a warning about rows with missing values being removed
ggplot(frequency, aes(x = article, y = liveblog, 
                      color = abs(article - liveblog))) +
  geom_abline(color = "gray40", lty = 2) +
  geom_jitter(alpha = 0.1, size = 2.5, width = 0.3, height = 0.3) +
  geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) +
  scale_x_log10(labels = percent_format()) +
  scale_y_log10(labels = percent_format()) +
  scale_color_gradient(limits = c(0, 0.001), 
                       low = "darkslategray4", high = "gray75") +
  theme(legend.position="none") +
  labs(x = "Articles", y = "Blogs")

Working with TidyText Data

Examine Term Frequency

  • Create term frequency variable
# summarize total words in each section
tf_articles <- tidy_articles %>%
  select(section_name, word) %>%
  count(section_name, word, sort = TRUE)

Examine Term Frequency by Section

  • \(TFIDF\), or term frequency inverse document frequency shows the relative occurance of terms across documents
  • Try it!
# summarize total words in each article
tf_articles <- tidy_articles %>%
  select(section_name, word) %>%
  count(section_name, word, sort = TRUE)

# create tfidf
articles_tf_idf <- tf_articles %>%
  bind_tf_idf(word, section_name, n)

TFIDF

Sentiment Analysis

Sentiment Analysis

  • What is sentiment analysis?
  • A way to analyze sentiments/emotions using text data
  • Three “sentiment lexicons” come with the tidytext package: AFINN, Bing, and NRC
  • Let’s explore these

Sentiment Analysis

  • AFINN
  • Numeric scale, ranging from negative (-5) to positive (5)
# look at afinn lexicon
get_sentiments("afinn")

Sentiment Analysis

  • Bing
  • Positive/Negative
# look at bing lexicon
get_sentiments("bing")

Sentiment Analysis

  • NRC
  • Categories: positive, negative, anger, anticipation, disgust, fear, joy, sadness, surprise, and trust
# look at nrc lexicon
get_sentiments("nrc")

Sentiment Dictionaries

  • AFINN (-5 - 5)
  • Bing (positive, negative)
  • NRC (anger, anticipation, disgust, fear, joy, negative, positive, sadness, surprise, trust )
# look at afinn lexicon
get_sentiments("afinn")

# look at bing lexicon
get_sentiments("bing")

# look at nrc lexicon
get_sentiments("nrc")

Add Sentiment to Our Data

  • Let’s join our tidy_articles with the NRC dictionary
  • What kind of join should we perform?

Add Sentiment to Our Data

  • Try an inner_join
# join nrc with tidy comments
tidy_articles %<>%
  inner_join(get_sentiments("nrc"))

Plot Sentiments!

  • Use ggplot() to plot sentiments
library(ggplot2)
ggplot(tidy_articles, aes(y = sentiment))+
  geom_bar(aes(fill = sentiment))+
  theme_minimal()+
  labs(title = "Sentiments in Guardian Articles about San Jose")

Plot Sentiments!

  • You can use facet_wrap to plot sentiments for each category
  • Try it!
library(ggplot2)

ggplot(tidy_articles, aes(y = sentiment))+
  geom_bar(aes(fill = sentiment))+
  theme_minimal()+
  labs(title = "Sentiments in Articles about San Jose")+
  facet_wrap(~section_name, scales = "free_x")

Plot Sentiments!

Recap

  • Lots of sources of text data (books, news, social media)
  • In general: we want to put our data in tidytext format (one word per row)
  • Which “families” of text analysis have we performed?
  • Next week: topic models, LLMs and AI
  • Be sure to sign up for a presentation time!

Office Hours

  • Office Hours: Friday 1:30pm-3:00pm zoom
  • Tuesdays, 10:30am-12:00pm (Yao)

Extracting Data from Reddit

Extracting Data From Reddit

  • Only thing required is RedditExtractoR package!
  • First, let’s find the subreddits related to San Jose
library(RedditExtractoR)

# extract sj subreddits
sj_subreddits <- find_subreddits("san jose")

Extracting Data from Reddit

  • There are two ways to get urls
  • First, search by subreddit
# we can get urls of the san jose subreddit
sj_urls <- find_thread_urls(subreddit = "SanJose",
                                 period = "day")

Extracting Data from Reddit

  • There are two ways to get urls
  • First, search by keyword
  • Second, search by keywords
# alternatively, we can find urls of all pages related to san jose
sj_urls <- find_thread_urls(keywords = "san jose", 
                            period = "day")

Extracting Data from Reddit

  • Now that we have urls, let’s get the content
# extract comments from these pages
sj_comments <- get_thread_content(sj_urls$url)

Extracting Data from Reddit

  • Take 5 minutes in groups to pull Reddit data
  • Try pulling urls from a different subreddit or keyword
# we can get urls of the san jose subreddit
sj_urls <- find_thread_urls(subreddit = "SanJose",
                                 period = "day")

# extract comments from these pages
sj_comments <- get_thread_content(sj_urls$url)

Extracting Data from Reddit

  • We want to look at the content!
  • But what type of object do we have?
# get climate subreddit urls
class(climate_comments)

Extracting Data from Reddit

# get climate subreddit urls
names(climate_comments$comments)

Extracting Data from Reddit

# get climate subreddit urls
names(climate_comments$threads)

Extracting Data from Reddit

  • Let’s put it in TidyText format!
library(dplyr)
library(tidytext)

tidy_comments <- sj_comments$comments %>%
   unnest_tokens(word, comment) %>%
  anti_join(stop_words)

# look at words and timestamps
tidy_comments %>%
  select(timestamp, word)