Text as Data - Exercise 4

This document guides you through exercise 4. Please try to follow the instructions on your own PC and feel free to ask questions if something is unclear. After this exercise you should be able to compute TF-IDF weights for a document feature matrix. You should be able to answer the following questions:

What is term frequency, what is inverse document frequency?
What are TF-IDF weights? Why do we need them?
How do we get TF-IDF weights in a dfm?

Let’s work with review data from amazon. You can download the data here: link to site

As always, we first clear the environment, load required packages (after installation). The amazon reviews come as a .tsv file that we load as a data frame using the command fread() from the data.table package.

rm(list = ls())
#install.packages("data.table")
library(data.table)
library(quanteda)
# Load Amazon review data set
review_data<-as.data.frame(fread("C:/Users/felix/Dropbox/Teaching/sps_text_sose2020/material/amazon/sample_us.tsv"))

It’s always good to have a look at the data first:

summary(review_data)

Quanteda needs to know which column has the text, and which columns are document variables. Therefore we rename the review_data column to text.

names(review_data)[14] <- "text"

Let’s look at the first entries of “text”:

head(review_data$text)

Before constructing the document feature matrix (dfm) we need to make sure the text is a character

review_data$text <- as.character(review_data$text)

Now we can convert to quanteda corpus and create a dfm.

# convert to quanteda corpus
corp_reviews <- corpus(review_data)

# create dfm
dfm_amazon <- dfm(corp_reviews,
          remove_punct = TRUE,
          remove_numbers = TRUE,
          remove_symbols = TRUE,
          tolower = TRUE,   
          remove = stopwords("english"),
          stem = TRUE,   
          ngrams = 1)

We can draw a wordcloud to get a feeling of common words in the dfm.

?textplot_wordcloud
textplot_wordcloud(dfm_amazon,  min_size = 0.5, max_size = 4, min_count = 2,
                   max_words = 500, color = "darkblue", adjust = TRUE)

We can list the most frequent features in the dfm (“topfeatures”) or in the, say, second document:

topfeatures(dfm_amazon)
topfeatures(dfm_amazon[2,])

Finally, let’s convert our dfm into TF-IDF scores:

# convert to tf-idf
dfm_tfidf <- dfm_tfidf(dfm_amazon)
# check TF-IDF scores of topfeatures in second document
topfeatures(dfm_tfidf[1,])

One could impose further restrictions here: 1. removing words that occurr in more (less) than X% of the documents 2. removing words that are mentioned more (less) than X times in total

Congratulations, you made it through exercise 4!