Intro to Computational Social Science: Week 8

2026-05-20

Warm-up

Groups: As you arrive
Look at the news blurbs
How is word usage different across the articles?
Which words carry strongest emotion?
What “topics” are covered in each?

Today’s Class

Warm-up: news articles
“Families” of Text Analysis
Activity: categorization by hand

Wednesday’s Class

Text analysis with tidytext

Office Hours

Office Hours: Today, Friday 1:30pm-3:00pm (Tyler)
Tuesdays, 10:30am-12:00pm (Yao)

Learning Goals

Motivate quantitative approaches to text analysis
Explore classification strategies
Understand term frequencies and document frequencies
Understand dictionary-based appraoches
Discuss semantic approaches
(Wednesday) Text analysis with tidytext

Text as Data

Why Text?

We live in an increasingly literate and text-producing world
Text carries lots of important information!
We can observe online interactions

Sources of Text Data

Social media interactions (we can observe these directly)
News/papers
Digitized books and other media

Challenges of Text Data

Unlike spatial, categorical, or numeric data, the meaning of text data can be unclear
Requires interpretation

Applications of Text Analysis

In text analysis, we often want to understand how someone (or some organization) is talking about an event or topic
For instance, newspapers might describe the same event very differently
Can we quantify these differences?

Families of Text Analysis

Term Frequency

An initial way to analyze text data is term frequency
The first family, term frequency analysis, represents text as observations that vary in how often certain strings of characters (e.g., words) appear.

Closed and Open Vocabulary

Closed Vocubulary

Example: word dictionaries
Masculine - feminine
Positive - negative
Threatening language
Moral foundations
Used for testing existing theories

Sentiment Dictionary

AFINN
Ranges between -5 - 5
Others include Bing (positive/negative) and NRC(positive, negative, anger, anticipation, disgust, fear, joy, sadness, surprise, and trust)

Open Vocabulary

Example: differential language analysis
Researchers correlate each of \(k\) most common words with outcome variables
What linguistic features are associated with gender, age?
Depression?
Used for generating new theories

Do dictionaries correlate with happiness?

We might expect that “positive” words (defined by a dictionary called LIWC) would correlate with happiness
This is not always the case!
A warning: next slide contains strong language

Do dictionaries correlate with happiness?

Jaidka and colleagues (2020) find that positive words are often not associated with happiness

Question

In pairs:
What is an example of text data? (e.g. tweets)
What might we want to know about this data? (e.g. how is it linked to happiness)
What kind of dictionary would be useful? (e.g. happy - unhappy words)
What kind of outcome would we want to observe? (e.g. survey users on happiness)

Text frequency recap

Unit of analysis: word
Closed vocabulary: we define a dictionary, used to test theories
Open vocabulary: we examine word correlations to an outcome, used to generate new theories

Activity

Create a dictionary from the words in the handouts
Circle words with strongest emotion (positive or negative)
Rank these from most positive to most negative
If time, compare these to AFINN scores darenr.github.io/afinn

Document Structure

The second family, document structure analysis, assumes one can extract from word co-occurrence statistics what any given document is “about” (i.e., what the appropriate keywords or themes are) and represents text as observations that vary on this feature.

Document Structure

Each document has “themes” or “topics”
Words combined with each other can comprise topics (e.g. “march” with “january” vs. “soldier”)

Topic Models

Documents are comprised of topics
Topics are comprised of words
Researchers must length of document, proportion of document expected to come from each topic
Algorithm gives us words related to each topic

Latent Direlecht Analysis

Each topic \(\theta\) has a distribution
Each word \(\phi\) has a distribution within each topic
Together, words and topics make up the document

Topic Models

Examples:
What topics have newspapers discussed?
Academic journals?
College application essays?

Topic Models

Topics in college essays are related to household income (Alvero et al, 2021)

Community Detection Algorithms

The researcher builds networks of documents
Recall: two mode networks
For text analysis, what are the first level? The second level?

Community Detection Algorithms

For example:
What library books discuss the same topics?
What sermons use similar themes?

Document Structure Recap

Unit of analysis: document
The researcher must choose how to define a document (page? chapter? book?)

Activity

Identify 3 topics in a document
Which words are associated with a given topic?

Semantic Similarity

The third family, semantic similarity analysis, attempts to quantify the meaning of strings of characters and represents texts as collections of such meanings.

Semantic Similarity

Unit of analysis: shared meanings between words

Within Corpus Approach

The researcher finds “semantic neighbors” of words through common usage
Unit of analysis: meanings
Examples:
What language is associated with demographic groups (e.g. race, class, gender) in certain texts?
Does language reveal “implicit bias” in how people think about concepts/people/etc.?

Between Corpus Approach

Unit of analysis: documents
For example:
How does gender bias vary across Wikipedia entries?
How does discussion of COVID-19 vary across counties?

Semantic Meanings

From Van Loon, 2022

Word Embeddings

From Van Loon, 2022

Semantic Similarity

Hoffman (2020) maps semantic similarity of New York Society Library Collection (1789-92)

Activity

In an article, identify an area of meaning with “semantically similar” words
Are these meanings shared within an article, or across?
Can you map the three newspapers (IHE, NYT, NYP) as a network, according to their semantic (dis)similarities?

Tidy Text

Sentiment Analysis

Document Term Matrix

Text Analysis Recap

We live in an age where text data is everywhere
However, extracting meaning out of text data can be challenging
Three “families” of methods:
Term frequencies
Document structure
Semantic similarities
We will use tidytext to explore these on Wednesday!

Zoom

Warm-up: Sign up for presentation times!

Learning Goals

Download book data with gutenbergr
Download newspaper data with guardianapi
Structure data with tidytext
Text frequency analysis
Sentiment analysis
If time, Reddit data with RedditExtractoR

Office Hours

Office Hours: Friday 1:30pm-3:00pm zoom
Tuesdays, 10:30am-12:00pm (Yao)

Create a Gaurdian API Key

Go to open-platform.theguardian.com/
Sign up for an API!

Gutenberg

What is Project Gutenberg?

A huge repository of digitized books (>75 million!)
Many are older books with expired US copyright

Explore Gutenberg

In pairs/groups:
Go to gutenberg.org
Pick a book!
Copy the id number (EBook-No.)

Downloading Books

Use gutenberg_download to download your book
Put your id in the gutenberg_id field

#install.packages("gutenbergr")
library(gutenbergr)

# Look at Gutenberg books written by Durkheim
gutenberg_works(author == "Durkheim, Émile")

## # A tibble: 1 × 8
##   gutenberg_id title     author gutenberg_author_id language gutenberg_bookshelf
##          <int> <chr>     <chr>                <int> <fct>    <chr>              
## 1        41360 The Elem… Durkh…               40654 en       Browsing: Culture/…
## # ℹ 2 more variables: rights <fct>, has_text <lgl>

# download book - notice that we need the id number (also on the gutenberg website)
efrl <- gutenberg_download(gutenberg_id = 41360, 
           mirror = "http://mirrors.xmission.com/gutenberg/")

Downloading Books

Download your book!

# download book - notice that we need the id number (also on the gutenberg website)
efrl <- gutenberg_download(gutenberg_id = 41360, 
                           mirror = "http://mirrors.xmission.com/gutenberg/")

What is TidyText?

What is Tidy Data?

Each observation is a row
Each variable a column
Each type of observational unit a table

What is Tidy Data?

Row	Person	Birthday	Occupation
1	Joe	12/3/1963	Carpenter
2	Malik	6/8/1978	Architect
3	Suzanna	4/3/2001	Student

What is Tidy Data?

Row	County	Temperature	PM2.5
1	Santa Clara	78.1	12.1
2	San Mateo	82.3	32.1
3	San Francisco	65.4	44.7

What is TidyText Data?

How should we organize data with text?
For example, newspaper articles

What is TidyText Data?

This is nice …

Row	Paper	Article	Text
1	New York Times	Study Compares Gas Stove Pollution to Secondhand Cigarette Smoke	Using a single gas-stove burner can raise indoor concentrations of benzene, …
2	New York Times	Study Compares Gas Stove Pollution to Secondhand Cigarette Smoke	For the peer-reviewed study, researchers at Stanford’s Doerr School of Sustainability …
3	New York Times	Study Compares Gas Stove Pollution to Secondhand Cigarette Smoke	In about a third of the homes, a single gas burner …

What is TidyText Data?

But we often prefer this

Row	Paper	Article	Text
1	New York Times	Study Compares Gas Stove Pollu…	Using
2	New York Times	Study Compares Gas Stove Pollu…	a
3	New York Times	Study Compares Gas Stove Pollu…	single

What is TidyText Data?

Take a look at your gutenberg data
Is your gutenberg data in tidytext format?

What is TidyText Data?

Put it into tidytext format with unnest_tokens

library(tidytext)
library(dplyr)
library(magrittr)

# try to tokenize into single words
efrl %<>%
  unnest_tokens(word, text)

Word Frequencies

What are the most frequent words in your book?

# count of words that contain "law"
efrl %>% 
  count(word, sort = T)

Word Frequencies

Your most common word might be “the”
What should we do now?
Remove words like “the” (Try View(stop_words))
Standardize documents by word frequency (does our document use “the” more than others?)

library(tidytext)
library(dplyr)
library(magrittr)

# count of words that contain "law"
efrl %>% 
  count(word, sort = T)

## # A tibble: 11,336 × 2
##    word      n
##    <chr> <int>
##  1 the   16907
##  2 of     9685
##  3 is     5874
##  4 to     5681
##  5 and    4653
##  6 it     4597
##  7 in     4569
##  8 a      4277
##  9 which  3337
## 10 that   3138
## # ℹ 11,326 more rows

Guardian Data

Downloading Guardian Data

First, library guardianapi

library(guardianapi)

Set your API Key!

Then set your api key

library(guardianapi)

gu_api_key("your key here")

Downloading Guardian Data

Search for some terms!
It often helps to use multiple, to limit results

guardian <- gu_content('"San Jose" AND "California"',
                         from_date = "2025-01-1",
                         to_date = "2026-05-11")

Working with TidyText Data

Look at your data
Are the data in tidytext format?

# look at your guardian data
guardian %>%
  head()

## # A tibble: 6 × 45
##   id        type  section_id section_name web_publication_date web_title web_url
##   <chr>     <chr> <chr>      <chr>        <dttm>               <chr>     <chr>  
## 1 us-news/… arti… us-news    US news      2026-01-30 05:54:00  Matt Mah… https:…
## 2 us-news/… arti… us-news    US news      2025-12-09 05:58:55  Communit… https:…
## 3 us-news/… arti… us-news    US news      2026-04-08 08:16:37  ICE agen… https:…
## 4 sport/20… arti… sport      Sport        2026-01-29 03:37:36  ICE agen… https:…
## 5 us-news/… arti… us-news    US news      2026-04-23 12:44:09  Leading … https:…
## 6 us-news/… arti… us-news    US news      2026-02-27 21:00:18  Californ… https:…
## # ℹ 38 more variables: api_url <chr>, tags <lgl>, is_hosted <lgl>,
## #   pillar_id <chr>, pillar_name <chr>, headline <chr>, standfirst <chr>,
## #   trail_text <chr>, byline <chr>, main <chr>, body <chr>, wordcount <dbl>,
## #   first_publication_date <dttm>, is_inappropriate_for_sponsorship <lgl>,
## #   is_premoderated <lgl>, last_modified <dttm>, production_office <chr>,
## #   publication <chr>, short_url <chr>, should_hide_adverts <lgl>,
## #   show_in_related_content <lgl>, thumbnail <chr>, legally_sensitive <lgl>, …

Working with TidyText Data

Let’s put the data in tidytext format
Let’s first limit our dataset to blogs using filter()

# first, set up liveblog dataframe
tidy_blogs <- guardian %>%
  filter(type == "liveblog")

Working with TidyText Data

Use unnest_tokens to put in tidytext format
Remember stop_words?
We can remove these with an anti_join

# unnest tokens
tidy_blogs %<>%
  unnest_tokens(word, body_text) %>%
  anti_join(stop_words)

Working with TidyText Data

What are stop_words?
View with View(stop_words)
Three different lexicons: SMART, snowball, and onix
Stop words are essentially words that are not useful for our analyses, such as “the”
Are there any surprising words there?

Working with TidyText Data

What happens when we anti_join the stop_words?
Let’s take a closer look at joins

Working with TidyText Data

Coming back to our blog data, let’s look at the result

# look at examples
tidy_blogs %>%
  select(type, word) %>%
  head()

## # A tibble: 6 × 2
##   type     word        
##   <chr>    <chr>       
## 1 liveblog concludes   
## 2 liveblog coverage    
## 3 liveblog politics    
## 4 liveblog day         
## 5 liveblog reading     
## 6 liveblog developments

Working with TidyText Data

Now that we have a tokenized dataset, new analyses become simple
For example can use the count() function to get word frequencies

# look at blog word frequencies
tidy_blogs %>%
  count(word, sort = TRUE)

## # A tibble: 7,880 × 2
##    word               n
##    <chr>          <int>
##  1 trump            729
##  2 president        251
##  3 donald           245
##  4 house            237
##  5 trump’s          208
##  6 white            182
##  7 people           147
##  8 federal          133
##  9 administration   122
## 10 war              119
## # ℹ 7,870 more rows

Working with TidyText Data

What if we repeat the prior steps for articles, not blogs?
Try it!

tidy_articles <- guardian %>%
  filter(type == "article")

# make tidytext format, remove stop words
tidy_articles %<>%
  unnest_tokens(word, body_text) %>%
  anti_join(stop_words)

# look at article word frequencies
tidy_articles %>%
  count(word, sort = TRUE)

Working with TidyText Data

We can examine word frequencies across categories (like blogs vs. articles)

library(tidyr)
frequency <- bind_rows(tidy_blogs,
                       tidy_articles) %>% 
  count(type, word) %>%
  group_by(type) %>%
  mutate(proportion = n / sum(n)) %>% 
  select(-n) %>% 
  pivot_wider(names_from = type, values_from = proportion)

Working with TidyText Data

library(scales)
library(ggplot2)

# expect a warning about rows with missing values being removed
ggplot(frequency, aes(x = article, y = liveblog, 
                      color = abs(article - liveblog))) +
  geom_abline(color = "gray40", lty = 2) +
  geom_jitter(alpha = 0.1, size = 2.5, width = 0.3, height = 0.3) +
  geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) +
  scale_x_log10(labels = percent_format()) +
  scale_y_log10(labels = percent_format()) +
  scale_color_gradient(limits = c(0, 0.001), 
                       low = "darkslategray4", high = "gray75") +
  theme(legend.position="none") +
  labs(x = "Articles", y = "Blogs")

Working with TidyText Data

Examine Term Frequency

Create term frequency variable

# summarize total words in each section
tf_articles <- tidy_articles %>%
  select(section_name, word) %>%
  count(section_name, word, sort = TRUE)

Examine Term Frequency by Section

\(TFIDF\), or term frequency inverse document frequency shows the relative occurance of terms across documents
Try it!

# summarize total words in each article
tf_articles <- tidy_articles %>%
  select(section_name, word) %>%
  count(section_name, word, sort = TRUE)

# create tfidf
articles_tf_idf <- tf_articles %>%
  bind_tf_idf(word, section_name, n)

TFIDF

Sentiment Analysis

What is sentiment analysis?
A way to analyze sentiments/emotions using text data
Three “sentiment lexicons” come with the tidytext package: AFINN, Bing, and NRC
Let’s explore these

Sentiment Analysis

AFINN
Numeric scale, ranging from negative (-5) to positive (5)

# look at afinn lexicon
get_sentiments("afinn")

Sentiment Analysis

Bing
Positive/Negative

# look at bing lexicon
get_sentiments("bing")

Sentiment Analysis

NRC
Categories: positive, negative, anger, anticipation, disgust, fear, joy, sadness, surprise, and trust

# look at nrc lexicon
get_sentiments("nrc")

Sentiment Dictionaries

AFINN (-5 - 5)
Bing (positive, negative)
NRC (anger, anticipation, disgust, fear, joy, negative, positive, sadness, surprise, trust )

# look at afinn lexicon
get_sentiments("afinn")

# look at bing lexicon
get_sentiments("bing")

# look at nrc lexicon
get_sentiments("nrc")

Add Sentiment to Our Data

Let’s join our tidy_articles with the NRC dictionary
What kind of join should we perform?

Add Sentiment to Our Data

Try an inner_join

# join nrc with tidy comments
tidy_articles %<>%
  inner_join(get_sentiments("nrc"))

Plot Sentiments!

Use ggplot() to plot sentiments

library(ggplot2)
ggplot(tidy_articles, aes(y = sentiment))+
  geom_bar(aes(fill = sentiment))+
  theme_minimal()+
  labs(title = "Sentiments in Guardian Articles about San Jose")

Plot Sentiments!

You can use facet_wrap to plot sentiments for each category
Try it!

library(ggplot2)

ggplot(tidy_articles, aes(y = sentiment))+
  geom_bar(aes(fill = sentiment))+
  theme_minimal()+
  labs(title = "Sentiments in Articles about San Jose")+
  facet_wrap(~section_name, scales = "free_x")

Plot Sentiments!

Recap

Lots of sources of text data (books, news, social media)
In general: we want to put our data in tidytext format (one word per row)
Which “families” of text analysis have we performed?
Next week: topic models, LLMs and AI
Be sure to sign up for a presentation time!

Office Hours

Office Hours: Friday 1:30pm-3:00pm zoom
Tuesdays, 10:30am-12:00pm (Yao)

Extracting Data from Reddit

Extracting Data From Reddit

Only thing required is RedditExtractoR package!
First, let’s find the subreddits related to San Jose

library(RedditExtractoR)

# extract sj subreddits
sj_subreddits <- find_subreddits("san jose")

Extracting Data from Reddit

There are two ways to get urls
First, search by subreddit

# we can get urls of the san jose subreddit
sj_urls <- find_thread_urls(subreddit = "SanJose",
                                 period = "day")

Extracting Data from Reddit

There are two ways to get urls
First, search by keyword
Second, search by keywords

# alternatively, we can find urls of all pages related to san jose
sj_urls <- find_thread_urls(keywords = "san jose", 
                            period = "day")

Extracting Data from Reddit

Now that we have urls, let’s get the content

# extract comments from these pages
sj_comments <- get_thread_content(sj_urls$url)

Extracting Data from Reddit

Take 5 minutes in groups to pull Reddit data
Try pulling urls from a different subreddit or keyword

# we can get urls of the san jose subreddit
sj_urls <- find_thread_urls(subreddit = "SanJose",
                                 period = "day")

# extract comments from these pages
sj_comments <- get_thread_content(sj_urls$url)

Extracting Data from Reddit

We want to look at the content!
But what type of object do we have?

# get climate subreddit urls
class(climate_comments)

Extracting Data from Reddit

# get climate subreddit urls
names(climate_comments$comments)

Extracting Data from Reddit

# get climate subreddit urls
names(climate_comments$threads)

Extracting Data from Reddit

Let’s put it in TidyText format!

library(dplyr)
library(tidytext)

tidy_comments <- sj_comments$comments %>%
   unnest_tokens(word, comment) %>%
  anti_join(stop_words)

# look at words and timestamps
tidy_comments %>%
  select(timestamp, word)