Text as Data - Exercise 3

This document guides you through exercise 3. Please try to follow the instructions on your own PC and feel free to ask questions if something is unclear. After this exercise you should be able to run keyword analysis on a document-feature-matrix. In detail, you should be competent in the following operations:

Create a document feature matrix of selected features only
Group the data in a dfm
Convert the dfm to a data frame
Rename variables in a data frame
Sort data in a data frame
Plot data with ggplot
Implement OLS with heteroskedasticity robust standard errors

Let’s work with news data on immigration from the UK. You can download the data here: link to site

In the code below we clear the environment, load required packages (after having them installed) and load the corpus of documents of immigration news in the UK.

rm(list = ls())
library(quanteda)
library(readtext)
library(ggplot2)
load("C:/Users/felix/Dropbox/Teaching/sps_text_sose2020/material/quanteda_corpora/data_corpus_immigrationnews.rda")
corp <- corpus(data_corpus_immigrationnews)

Let’s first have a look at the data and understand what they contain. How many documents does the corpus consist of? What are the document level variables?

summary(corp)

The question we would like to ask in this exercise is the following: Do more recent immigration news refer more frequently to the EU and Europe?

As a first step, we therefore create a document feature matrix dfm that only contains features starting with “euro”. We use so-called unigrams, i.e. single words (as in contrast to bigrams or trigrams which are word collocations such as Vice President). Here is the code that extracts only euro- unigrams from the and saves then

# create dfm of euro* tokens only
dfm_euro <- dfm(corp,
           select="euro*",
           tolower = TRUE,               
           stem = TRUE,               
           ngrams = 1)

As we would like to analyse the time trend of the euro features, let’s group the data by day using the dfm_group() command:

# group the dfm by day
dfm_euro_day <- dfm_group(dfm_euro, groups = "day" )

To illustrate the time trend, we can make use of the ggplot package - arguably the most powerful graphing package ever designed! Ggplot nees a data frame object as input, so we first convert our corpus to a data frame using the convert() command:

# convert dfm to a data.frame (for plotting with ggplot)
data_frame_dfm_euro_day <- convert(dfm_euro_day, to = "data.frame")

We could look at a summary of the created data frame:

# look at summary of newly created data frame
summary(data_frame_dfm_euro_day)

So far, our data frame is split into different news outlets mentioning more or less frequently the Euro. If we want an overall picture, we might want to look at the total of features mentioning the Euro. So, let’s sum up all mentions of Euro across all news outlet using the rowSums() command:

# create total euro tokens variable
data_frame_dfm_euro_day$total <- rowSums(data_frame_dfm_euro_day[,c(-1)])

Then, we might want to rename the running variable for time into “day”:

# rename day variable
names(data_frame_dfm_euro_day)[1] <- "day"

Of course, we need to make sure that this running time variable (the x-Axis on our graph) is numeric and sorted:

# convert day variable to numeric
data_frame_dfm_euro_day$day <- as.numeric(data_frame_dfm_euro_day$day)

# sort by day
data_frame_dfm_euro_day <- data_frame_dfm_euro_day[order(data_frame_dfm_euro_day$day),]

We are now able to graph the trend of Euro mentions in UK immigration news:

# graph
ggplot(data_frame_dfm_euro_day, aes(x=day, y=total)) + 
  geom_point() + geom_line() +
  ggtitle("Euro Tokens")

However, to statistically test whether there is a difference in EU mentions in the news between earlier (let’s say before day 115) and later news, we need a more formal test than eyballing a graph. Therefore, let’s implement a simple test in regression format (as you are familiar with from econometrics. We would like to test the null hypothesis of no difference in EU mentions among the immigration news before and after day 115 (two-sided test). First, let’s load the packages that we require for regression analysis (as in Econometrics):

# load packages
library(lmtest)
library(sandwich)

Let’s look at our dependent variable and create our independent variable of interest:

# dependent variable: total EU mentions
summary(data_frame_dfm_euro_day$total)
hist(data_frame_dfm_euro_day$total)

# generate independent variable of interest: indicator variable for after day 115
data_frame_dfm_euro_day$post115 <-  ifelse(data_frame_dfm_euro_day$day > 115,1,0)

One could optionally illustrate the relationship with a simple plot:

# optional illustration:
plot(x= data_frame_dfm_euro_day$day, y= data_frame_dfm_euro_day$total)

Finally, we run a regression analysis, using robust standard errors as always:

# significant difference after day 115?
mod1 <- lm(total ~ post115 , data=data_frame_dfm_euro_day)
# OLS Estimation with robust standard errors
(res1 <- coeftest(mod1, vcov = vcovHC(mod1, "HC2")))

What is the interpretation? How could you in addition test for a time trend in EU mentions?

Congratulations, you made it through exercise 3! If this was far too easy for you and you have some time left, try splitting the UK immigration news by news outlet and create a graph with two or more time trends of euro features for different outlets.