library(DT)
library(tidytext)
library(dplyr)
library(stringr)
library(sentimentr)
library(ggplot2)
library(RColorBrewer)
library(readr)
library(SnowballC)
library(tm)
library(wordcloud)
library(reticulate)

Import Data

This project uses Amazon product reviews spanning May 1996 and July 2014 to determine whether text is positive or negative. We look specifically at video game reviews. Since the json code for video game reviews is not in strict json, we use python to convert to a pandas dataframe and then export that dataframe to csv.

import pandas as pd
import gzip

def parse(path):
  g = gzip.open(path, 'rb')
  for l in g:
    yield eval(l)

def getDF(path):
  i = 0
  reviews = {}
  for d in parse(path):
    df[i] = d
    i += 1
  return pd.DataFrame.from_dict(df, orient='index')

df = getDF('reviews_Video_Games_5.json.gz')

df.to_csv(r'reviews.csv')

Next, we read that csv file into an R dataframe.

reviews <- readr::read_csv(file = 'reviews.csv')

Preview Data

Now that we have the dataset imported, we can take a peak at the data. The column that contains the review is titled ‘reviewText’ and the column that indicates the rating associated with each review is ‘reviewTime’.

summary(reviews)
##        X1          reviewerID            asin           reviewerName      
##  Min.   :     0   Length:231780      Length:231780      Length:231780     
##  1st Qu.: 57945   Class :character   Class :character   Class :character  
##  Median :115890   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :115890                                                           
##  3rd Qu.:173834                                                           
##  Max.   :231779                                                           
##    helpful           reviewText           overall        summary         
##  Length:231780      Length:231780      Min.   :1.000   Length:231780     
##  Class :character   Class :character   1st Qu.:4.000   Class :character  
##  Mode  :character   Mode  :character   Median :5.000   Mode  :character  
##                                        Mean   :4.086                     
##                                        3rd Qu.:5.000                     
##                                        Max.   :5.000                     
##  unixReviewTime       reviewTime       
##  Min.   :9.399e+08   Length:231780     
##  1st Qu.:1.213e+09   Class :character  
##  Median :1.318e+09   Mode  :character  
##  Mean   :1.277e+09                     
##  3rd Qu.:1.368e+09                     
##  Max.   :1.406e+09

Word Summary

In order to begin analyzing the sentiment of each review, we look at the individual sentiments of each word. More speifically, we filter the reviews text to remove any punctuation and stop words then create an individual row for each word.

words <- reviews %>%
  select(reviewerID, asin, overall, reviewText) %>%
  unnest_tokens(word, reviewText) %>%
  filter(!word %in% stop_words$word,
         str_detect(word, "^[a-z']+$"))

datatable(head(words))

Sentiment Analysis with Afinn

To predict the sentiment of words in this dataset, we use the Afinn list of English words and associated ratings. Each word is ranked from -5 to 5, where 5 is the most positive rating while -5 is the most negative. By joining the Afinn sentiment score with our reviews dataframe, we can compare the two methods of ranking words.

afinn <- get_sentiments("afinn") %>% mutate(word = wordStem(word))
reviews.afinn <- words %>%
  inner_join(afinn, by = "word")
head(reviews.afinn)
## # A tibble: 6 x 5
##   reviewerID     asin       overall word  score
##   <chr>          <chr>        <dbl> <chr> <int>
## 1 A2HD75EMZR8QLN 0700099867       1 live      2
## 2 A2HD75EMZR8QLN 0700099867       1 dirt     -2
## 3 A3UR8NLLY1ZHCX 0700099867       4 huge      1
## 4 A3UR8NLLY1ZHCX 0700099867       4 fan       3
## 5 A1INA0F5CWW3J4 0700099867       1 fake     -3
## 6 A1INA0F5CWW3J4 0700099867       1 fake     -3

Most Common Words

Here, we see the most common words and the average ratings and sentiment scores associated with each word.

word_summary <- reviews.afinn %>%
  group_by(word) %>%
  summarise(mean_rating = mean(overall), score = max(score), count_word = n()) %>%
  arrange(desc(count_word))
datatable(head(word_summary))

Most Common Words View

We can try to visualize the words associated with each amazon review rating and sentiment score. Most video game ratings fall between 3.5 and 4.5 in this amazon dataset, so we set this range as the filter. The plot below shows that many of these words are divided in two clusters: one with a positive sentiment score and one with a negative sentiment score. The quantity of words with positive Amazon ratings but negative sentiment scores is concerning, so we will look into the effect this has on sentiment by products later on.

ggplot(filter(word_summary, count_word < 50000), aes(mean_rating, score)) + geom_text(aes(label = word, color = count_word, size=count_word), position= position_jitter()) + scale_color_gradient