Text as Data - Exercise 2

This document guides you through exercise 2. Please try to follow the instructions on your own PC and feel free to ask questions if something is unclear. After this exercise you should be able to:

Understand the concept of sparsity
Visualize text data with a wordcloud
Plot most frequent words in a Document Feature Matrix
Understand and reproduce Zipf’s Law

As we are going to visualise some keywords, we need the package ggplot2 for visualisation:

# install.packages("ggplot2")
library(ggplot2)

Let’s work with the same data as in exercise 1:

rm(list = ls())
library(quanteda)
library(readtext)
dat <- readtext("C:/Users/felix/Dropbox/Teaching/sps_text_sose2020/material/manifesto_pdfs/*.pdf", 
                docvarsfrom = "filenames",
                encoding = "UTF-8")
corp <- corpus(dat)

However, let’s don’t remove punctuation and stopwords when constructing the dfm:

dfm <- dfm(corp,
           tolower = TRUE,               
           stem = TRUE,               
           #remove_punct = TRUE,
           remove_numbers= TRUE ,
           #remove = stopwords("German"),
           ngrams = 1)   
dfm

To visualize the text data, one could draw a wordcloud where bigger font indicates higher frequency:

set.seed(100)
textplot_wordcloud(dfm)

However, it is better and more scientific to plot a certain number of most frequent words like this:

features_dfm <- textstat_frequency(dfm, n = 75)
features_dfm$feature <- with(features_dfm, reorder(feature, -frequency))
ggplot(features_dfm, aes(x = feature, y = frequency)) +
  geom_point() + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

What happens to the most frequent words when we add removal of stopwords and punctuation in pre-processing?

dfm <- dfm(corp,
           tolower = TRUE,               
           stem = TRUE,               
           remove_punct = TRUE,
           remove_numbers= TRUE ,
           remove = stopwords("German"),
           ngrams = 1)   
dfm
set.seed(100)
textplot_wordcloud(dfm)
features_dfm <- textstat_frequency(dfm, n = 75)
features_dfm$feature <- with(features_dfm, reorder(feature, -frequency))
ggplot(features_dfm, aes(x = feature, y = frequency)) +
  geom_point() + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

Alright, now we have a good feeling what frequencies and sparsity mean in our text data. Let’s finally check whether Zipf’s Law holds in our data. Remember that Zipf’s Law means that with lower ranked words, frequency of these words decreases rapidly. So, let’s plot the rank of the frequency against the frequency:

plot(1:ncol(dfm),sort(colSums(dfm),dec=T), main = "Zipf's Law?", ylab="Frequency", xlab = "Frequency Rank")

Taking logs shows the following relationship:

plot(1:ncol(dfm),sort(colSums(dfm),dec=T), main = "Zipf's Law?", ylab="Frequency", xlab = "Frequency Rank", log="xy")

Text as Data - Exercise 2

Congratulations! You made it through Exercise 2.