Text as Data - Exercise 1

This document guides you through exercise 1. Please try to follow the instructions on your own PC and feel free to ask questions if something is unclear. After this exercise you should be able to do the following:

Clean the environment in R
Install and load packages
Import text as data using a path
convert text data to a quanteda corpus
Create a document feature matrix (dfm)
Choose pre-processing steps
Have a first glance at the data

Let’s start by cleaning the environment:

rm(list = ls())

To be able to use the Quanteda package, you first need to install it. Remember that running a command line in R can be prompted with CTRL + Enter (Windows) or command + enter (Mac). As you need to do this only once I have put the following line as a comment (indicated by the #). To be able to run the line, you first need to get rid of the #:

# install.packages("quanteda")

Now we can load the package. Remember that this is required every time you start a new R session and would like to work with Quanteda:

library(quanteda)

In order to be able to read text as data into R, we need to install and load another package called readtext. Again, we need to install this package only once and load it every time we start an R session:

# install.packages("readtext")
library(readtext)

Import Text

We will use the manifestos of major German parties for the 2013 and 2017 federal elections for this exercise. You can download the pdf files here: link to manifesto_pdfs.zip

1. Set Path

We need to tell R where to find the downloaded pdf. So, create a new folder and place the (unzipped) manifesto pdfs in there. For example, I have saved the pdf files in a folder called

C:/Users/felix/Dropbox/Teaching/sps_text_sose2020/material/manifesto_pdfs/

2. Read in Text

Now we use the follwing command to read in the text data (make sure to replace the path with your path and pay attention to only use forward slashes and not backward slashes):

dat <- readtext("C:/Users/felix/Dropbox/Teaching/sps_text_sose2020/material/manifesto_pdfs/*.pdf", 
                docvarsfrom = "filenames",
                encoding = "UTF-8")

Note that you can also read in text from various other formats (e.g. .csv, .tab, .json, .xml, .html, .pdf, .doc, .docx, .rtf, .xls, .xlsx).

3. Convert to Quanteda Corpus

To be able to run analysis with the imported text, we need to convert it into what is called a corpus. Let’s summarise the corpus with the summary command as well.

corp <- corpus(dat)
summary(corp)

Generate a Document Feature Matrix (DFM)

To get a feeling of the text data, let’s convert the text into a document-frequency-matrix, a dfm. Here we choose as input our corpus which we defined above. As pre-processing options we choose to convert all characters to lower case, stem words, remove punctuation and numbers, as well as German stopwords. We use unigrams as observations. Unigrams are entities consisting only of one word, not of two (bigrams), or three (trigrams).

dfm <- dfm(corp,
           tolower = TRUE,               
           stem = TRUE,               
           remove_punct = TRUE,
           remove_numbers= TRUE ,
           remove = stopwords("German"),
           ngrams = 1)

First Glance at Data

Once we have generated our dfm, let’s have a first look at its features. The following two commands tell us how many documents we have and how many features.

ndoc(dfm) # number of documents
nfeat(dfm) # number of features

Alternatively, we can just type the name of the dfm:

dfm

We can also ask R to tell us the first five document names, or the first 20 features in the dfm:

head(docnames(dfm), 5)
head(featnames(dfm), 20)

However, we might be more interested into how many features each document entails and which features are mentioned most frequently:

head(rowSums(dfm), 10)
head(colSums(dfm), 10)
topfeatures(dfm, 10)