Introduction

The hands-on exercise for this week focuses on: 1) scaling texts ; 2) implementing scaling techniques using quanteda.

In this tutorial, you will learn how to:

Scale texts using the “wordfish” algorithm
Scale texts gathered from online sources
Replicate analyses by @kaneko_estimating_2021

Before proceeding, we’ll load the packages we will need for this tutorial.

library(dplyr)
library(quanteda) # includes functions to implement Lexicoder
library(quanteda.textmodels) # for estimating similarity and complexity measures
library(quanteda.textplots) #for visualizing text modelling results

In this exercise we’ll be using the dataset we used for the sentiment analysis exercise. The data were collected from the Twitter accounts of the top eight newspapers in the UK by circulation. The tweets include any tweets by the news outlet from their main account.

Importing data

If you’re working on this document from your own computer (“locally”) you can download the tweets data in the following way:

tweets  <- readRDS(gzcon(url("https://github.com/cjbarrie/CTA-ED/blob/main/data/sentanalysis/newstweets.rds?raw=true")))

We first take a sample from these data to speed up the runtime of some of the analyses.

tweets <- tweets %>%
  sample_n(20000)

Construct `dfm` object

Then, as in the previous exercise, we create a corpus object, specify the document-level variables by which we want to group, and generate our document feature matrix.

#make corpus object, specifying tweet as text field
tweets_corpus <- corpus(tweets, text_field = "text")

#add in username document-level information
docvars(tweets_corpus, "newspaper") <- tweets$user_username

dfm_tweets <- dfm(tokens(tweets_corpus,
                    remove_punct = TRUE)) %>%
  dfm_select(pattern = stopwords("english"), 
             selection = "remove",
             valuetype = "fixed")

We can then have a look at the number of documents (tweets) we have per newspaper Twitter account.

## number of tweets per newspaper
table(docvars(dfm_tweets, "newspaper"))

## 
##     DailyMailUK     DailyMirror EveningStandard        guardian         MetroUK 
##            2081            5926            2207            2843             943 
##       Telegraph          TheSun        thetimes 
##            1520            3794             686

And this is what our document feature matrix looks like, where each word has a count for each of our eight newspapers.

dfm_tweets

## Document-feature matrix of: 20,000 documents, 49,244 features (99.98% sparse) and 31 docvars.
##        features
## docs    british jihadi rapper suspected isis beatle john arrested spain
##   text1       1      2      1         1    1      1    1        1     1
##   text2       0      0      0         0    0      0    0        0     0
##   text3       0      0      0         0    0      0    0        0     0
##   text4       0      0      0         0    0      0    0        0     0
##   text5       0      0      0         0    0      0    0        0     0
##   text6       0      0      0         0    0      0    0        0     0
##        features
## docs    anti-terror
##   text1           1
##   text2           0
##   text3           0
##   text4           0
##   text5           0
##   text6           0
## [ reached max_ndoc ... 19,994 more documents, reached max_nfeat ... 49,234 more features ]

Estimate wordfish model

Once we have our data in this format, we are able to group and trim the document feature matrix before estimating the wordfish model.

# compress the document-feature matrix at the newspaper level
dfm_newstweets <- dfm_group(dfm_tweets, groups = newspaper)

# remove words not used by two or more newspapers
dfm_newstweets <- dfm_trim(dfm_newstweets, 
                                min_docfreq = 2, docfreq_type = "count")

## size of the document-feature matrix
dim(dfm_newstweets)

## [1]     8 11214

#### estimate the Wordfish model ####
set.seed(123L)
dfm_newstweets_results <- textmodel_wordfish(dfm_newstweets)

And this is what results.

summary(dfm_newstweets_results)

## 
## Call:
## textmodel_wordfish.dfm(x = dfm_newstweets)
## 
## Estimated Document Positions:
##                    theta       se
## DailyMailUK      0.68745 0.012530
## DailyMirror      1.17434 0.006574
## EveningStandard -0.22178 0.015659
## guardian        -0.97854 0.010922
## MetroUK         -0.08295 0.022719
## Telegraph       -0.95957 0.012159
## TheSun           1.44379 0.005828
## thetimes        -1.06274 0.015344
## 
## Estimated Feature Scores:
##      british  jihadi rapper suspected     isis  beatle    john arrested   spain
## beta -0.2076 0.02223 0.4123     0.107 -0.03389  0.6622 -0.5457   0.4938 -0.6025
## psi   3.4092 0.22591 0.6840     1.414  0.32932 -2.0069  2.0595   1.9968  1.5467
##      anti-terror police    many  people believe  earth   flat    nasa
## beta      0.5180 0.1513 -0.6732 -0.2758 -0.2282 0.1516 0.6371  0.3014
## psi      -0.7848 3.6143  2.4356  4.1062  1.0807 1.2319 1.1065 -0.3358
##      conspiracy #emmerdale cancel     set   tours     amid coronavirus pandemic
## beta    -0.0818      2.704 0.1990 -0.1202  0.4068 -0.02598     0.02225  -0.2969
## psi      0.7265     -1.759 0.8275  3.0522 -0.1873  3.47168     5.82679   3.3194
##      spanish       flu survivor  becomes world’s
## beta -0.0837 -0.000629  0.78131 -0.00149  0.0828
## psi   0.8272  1.216601 -0.03734  1.60138  0.7084

We can then plot our estimates of the \(\theta\)s—i.e., the estimates of the latent newspaper position—as so.

textplot_scale1d(dfm_newstweets_results)

Interestingly, we seem not to have captured ideology but some other tonal dimension. We see that the tabloid newspapers are scored similarly, and grouped toward the right hand side of this latent dimension; whereas the broadsheet newspapers have an estimated theta further to the left.

Plotting the “features,” i.e., the word-level betas shows how words are positioned along this dimension, and which words help discriminate between news outlets.

textplot_scale1d(dfm_newstweets_results, margin = "features")

And we can also look at these features.

features <- dfm_newstweets_results[["features"]]

betas <- dfm_newstweets_results[["beta"]]

feat_betas <- as.data.frame(cbind(features, betas))
feat_betas$betas <- as.numeric(feat_betas$betas)

feat_betas %>%
  arrange(desc(betas)) %>%
  top_n(20) %>% 
  kbl() %>%
  kable_styling(bootstrap_options = "striped")

## Selecting by betas

features	betas
🎥	10.328659
creature	6.177819
toddlers	5.944454
crawling	5.399534
cops	5.140734
wwe	4.897089
wriggling	4.798940
sings	4.798940
ronaldo	4.798940
treats	4.798940
underwater	4.600182
medicine	4.600182
smiling	4.600182
cub	4.439275
cristiano	4.439275
mom	4.439275
thrifty	4.439275
diver	4.291178
😭	4.102854
pic	4.100626
cupboard	4.100626

These words do seem to belong to more tabloid-style reportage, and include emojis relating to film, sports reporting on “cristiano” as well as more colloquial terms like “saucy.”

Replicating Kaneko et al.

This section adapts code from the replication data provided for @kaneko_estimating_2021 here.

If you’re working locally, you can download the dfm data with:

kaneko_dfm  <- readRDS(gzcon(url("https://github.com/cjbarrie/CTA-ED/blob/main/data/wordscaling/study1_kaneko.rds?raw=true")))

This data is in the form a document-feature-matrix. We can first manipulate it in the same way as @kaneko_estimating_2021 by grouping at the level of newspaper and removing infrequent words.

table(docvars(kaneko_dfm, "Newspaper"))

## 
##       Asahi     Chugoku    Chunichi    Hokkaido      Kahoku    Mainichi 
##          38          24          47          46          18          26 
##      Nikkei Nishinippon      Sankei     Yomiuri 
##          13          27          14          30

## prepare the newspaper-level document-feature matrix
# compress the document-feature matrix at the newspaper level
kaneko_dfm_study1 <- dfm_group(kaneko_dfm, groups = Newspaper)
# remove words not used by two or more newspapers
kaneko_dfm_study1 <- dfm_trim(kaneko_dfm_study1, min_docfreq = 2, docfreq_type = "count")

## size of the document-feature matrix
dim(kaneko_dfm_study1)

## [1]   10 4660

Exercises

Estimate a wordfish model for with Kaneko (2021)’s data

## estimate the Wordfish model
# Step 1: Import and preprocess data) ---

kaneko_dfm <- readRDS(gzcon(url("https://github.com/cjbarrie/CTA-ED/blob/main/data/wordscaling/study1_kaneko.rds?raw=true")))

# Group and compress the dfm at the newspaper level
kaneko_dfm_study1 <- dfm_group(kaneko_dfm, groups = Newspaper)

# Remove infrequent words not used by two or more newspapers to reduce noise
kaneko_dfm_study1 <- dfm_trim(kaneko_dfm_study1, min_docfreq = 2, docfreq_type = "count")



# Set seed for reproducibility
set.seed(123L)

# Estimate the Wordfish model using textmodel_wordfish function
kaneko_results <- textmodel_wordfish(kaneko_dfm_study1)


# View model summary, including estimated theta (doc positions) and beta (feature scores)
summary(kaneko_results)

## 
## Call:
## textmodel_wordfish.dfm(x = kaneko_dfm_study1)
## 
## Estimated Document Positions:
##               theta      se
## Asahi       -0.9892 0.01651
## Chugoku     -0.4585 0.02446
## Chunichi    -1.2230 0.01232
## Hokkaido    -0.4160 0.01917
## Kahoku      -0.5731 0.02753
## Mainichi    -0.1876 0.02546
## Nikkei       1.2578 0.04403
## Nishinippon -0.2616 0.02750
## Sankei       1.2472 0.03943
## Yomiuri      1.6040 0.02477
## 
## Estimated Feature Scores:
##        安保    法制    国会     一線   越え 安倍内閣  新た 安全保障政策
## beta 0.2157 0.01591 0.09476 -0.05989 -1.362  -1.6217 0.332    -0.006181
## psi  3.6911 3.36064 3.73441 -0.18786 -1.989   0.4758 2.192     0.789972
##      関連法案    閣議    決定   提出 安倍首相    先月     米    議会     演説
## beta  0.01405 -0.1579 -0.2177 0.5569   0.2524 -0.6656 0.5448 -0.9467 -0.22245
## psi   2.41503  0.7228  1.2414 2.5076   2.5758 -0.5345 2.8280  0.5425 -0.08278
##      安全保障    戦後   初めて 大改革      夏   成就    約束    通り   合意
## beta   0.2303 -0.3609 0.003823 -1.362 -0.6648 -2.335 -0.6655 -0.4473 0.4545
## psi    3.0372  2.0478 0.937039 -1.989  0.6898 -2.481  0.1587  0.4404 1.8744
##      歴史的  転換 集団的自衛権    行使
## beta -1.154 0.141       0.2183 0.05474
## psi  -2.105 1.609       3.7932 3.73714

Write a paragraph here explaining and interpreting your results

explanation: theta is the ‘political stance score’ assigned by the model to each newspaper.As Wordfish ranks newspapers based on word frequency, this value represents the position of these newspapers on this latent axis. from the table, we can see there are mainly three camps. 1.Positive camp : like Yomiuri, Nikkei and Sankei all score positively. In particular, the Yomiuri Shimbun (1.60) has the highest score, indicating that its stance is closest to this end of the spectrum

2.Negative camp :Chunichi and Asahi have fallen into the negative zone. The Chunichi Shimbun (-1.22) has the lowest score.

3.The middle ground: Newspapers such as Mainichi and Nishinippon have scores quite close to 0 (e.g. -0.18 and -0.26), indicating that their language is less strident and that they present a more neutral and moderate stance.

Visualize the results

## We can then plot our estimates of the thetas---i.e., the estimates of the latent Japanese newspaper position.
# Plot the estimated positions (theta) of newspapers along the latent dimension
library(ggplot2)
textplot_scale1d(kaneko_results, 
                 margin = "documents", 
                 groups = docvars(kaneko_dfm_study1, "Newspaper")) +
  labs(title = "Kaneko et al. (2021) Wordfish ",
       subtitle = "Estimated newspaper positions (theta)")

# Plot the features (words) to see which ones discriminate between newspapers
textplot_scale1d(kaneko_results, 
                 margin = "features") +
  labs(title = "Kaneko et al. (2021) ",
       subtitle = "Estimated feature scores (beta)")

Write here the interpretation of your plot(s)

interpretation: The horizontal axis (beta) represents a word’s ‘political stance’; the further a word lies on either end of the scale, the stronger its partisan connotations (positive values indicate terms favoured by conservatives, whilst negative values indicate those favoured by liberals). The vertical axis (psi) represents a word’s ‘frequency of appearance’; the higher a word lies on this axis, the more frequently it appears across all newspapers.

Take the word at the very top of the ‘pyramid’ as an example: This point has an extremely high psi value, indicating that it is a ‘nationwide’ high-frequency term (such as ‘Japan’ or ‘report’); however, its beta value is almost 0, suggesting that it is highly neutral. It is used extensively by all newspapers, with no discernible political bias. Don’t forget to knit!

CTA-ED Exercise 4: Scaling techniques (with correct answers)

Marion Lieutaud

6/03/2024

Introduction

Importing data

Construct `dfm` object

Estimate wordfish model

Replicating Kaneko et al.

Exercises

CTA-ED Exercise 4: Scaling techniques (with correct answers)

Marion Lieutaud

6/03/2024

Introduction

Importing data

Construct dfm object

Estimate wordfish model

Replicating Kaneko et al.

Exercises

Construct `dfm` object