Text as Data - Exercise 6

This document guides you through exercise 6. Please try to follow the instructions on your own PC and feel free to ask questions if something is unclear. After this exercise you should be able to perform dictionary analysis. To be more specific, be sure to know the following:

What is a dictionary method?
What is tokenization?
How can a dictionary method be implemented in R?
How can one examine the result of a dictionary method?

The question we would like to answer in this exercise is, whether we can gauge how negative or positive an amazon review is given the review text. This could provide us with a way to validate the star-based rating.

Let’s use again the example data from amazon reviews: https://s3.amazonaws.com/amazon-reviews-pds/readme.html Information about columns here: https://s3.amazonaws.com/amazon-reviews-pds/tsv/index.txt

The following code clears the environment, loads packages, and reads in the data:

rm(list = ls())
#install.packages("data.table")
library(data.table)
library(quanteda)
# This here is a very small review example data set
review_data<-as.data.frame(fread("C:/Users/felix/Dropbox/Teaching/sps_text_sose2020/material/amazon/sample_us.tsv"))

As we want to use a dictionary method, let’s use Quanteda’s built-in dictionary on negative and positive sentiment:

# Quanteda's built-in dictionary
?data_dictionary_LSD2015
dict <- data_dictionary_LSD2015
summary(dict)

What are the most negative words in this dictionary?

head(dict$negative)

Alright, let’s convert the review data into a quanteda corpus and tokenize it:

# Note: Quanteda needs to know which column has the text, and which columns are document variables. Therefore we rename the review_body column to "text".
names(review_data)[14] <- "text"
corp <- corpus(review_data)

# Tokenize the corpus
?tokens
toks <- tokens(corp)

Finally, let’s apply the dictionary method:

?tokens_lookup
toks_sentiment <- tokens_lookup(toks, dictionary = dict, levels = 1)
print(toks_sentiment)

For a better overview, we can look at the sentiment tokens in a dfm. Summarising rows would give us the overall sentiment of a given review:

dfm(toks_sentiment)

Text as Data - Exercise 6

Congratulations, this is the end of the sixth exercise.