MATH2349 Semester 2, 2019

Required packages

library(readr)
library(dplyr)
library(tidyr)
library(stringr)
library(Hmisc)
library(outliers)

Executive Summary

First of all, we have found and loaded two data files containing information about phones available via Amazon e-shop and their customer reviews. We have joined the two data files into one data set containing all the information we need.

Next, we looked at the data types and data structure and applied appropriate type conversions and transformations. We created an unordered factor variable, a proper date format, made sure that the numeric columns were integer where needed and fixed the prices.

Next, we created three variables based on the existing data. We scanned the data for missing values and replaced all of them with suitable values.

Finally, we checked fthe data for outliers and applied suitable transformations to mitigate their effect where needed.

Data

The two data sets used in this assignment comes from Kaggle. The data set contains information about Amazon cell phones reviews.

There are two files - items and reviews. The items data file contains information about the phones. It has \(9\) variables and \(792\) observations. Each observation contains:

ASIN - Amazon Standard Identification Number
Brand - The phone’s brand
Title - A product title
URL - A URL to the Amazon website
Image - A URL of the product image
Rating - Product average rating
Review URL - A URL of product review page
Total Reviews - A number of product reviews
Prices - Product prices

The reviews data file contains information about the reviews. It has \(8\) variables and \(82,815\) observations. Each observation contains:

ASIN - Amazon Standard Identification Number
Name - A name of the reviewer
Rating - 1-5-scale rating
Date - a date of the review
Verified - Binary variable whether or not the reviewer is valid
Title - A title of the review
Body - A body of the review
Helpful Votes - A number of thumbs-ups

We used readr library to read the two data files. We used head() and dim() functions to check the data and their dimensions.

Finally we joined the data sets using the left_join() function from the dyplyr package. The data were joined by the asin column. We used left join to extend the reviews data set with the information of the reviewed items. The resulting data set consists of \(16\) variables (\(8\) variables from reviews data set, \(9\) from the items data set, however asin only appears once) and \(82,815\) observations. The number of observations in the resulting data set shows us that there is no review for non-existing products.

items <- read_csv("20190928-items.csv")

## Parsed with column specification:
## cols(
##   asin = col_character(),
##   brand = col_character(),
##   title = col_character(),
##   url = col_character(),
##   image = col_character(),
##   rating = col_double(),
##   reviewUrl = col_character(),
##   totalReviews = col_double(),
##   prices = col_character()
## )

head(items)

dim(items)

## [1] 792   9

reviews <- read_csv("20190928-reviews.csv")

## Parsed with column specification:
## cols(
##   asin = col_character(),
##   name = col_character(),
##   rating = col_double(),
##   date = col_character(),
##   verified = col_logical(),
##   title = col_character(),
##   body = col_character(),
##   helpfulVotes = col_double()
## )

head(reviews)

dim(reviews)

## [1] 82815     8

joined <- reviews %>% left_join(items, by = "asin")
head(joined)

dim(joined)

## [1] 82815    16

Understand

From the head() function used in the previous section, we noticed that some columns have ambiguous names. Because there were some columns with the same names in both data sets, the join operation renamed them. We would like to use different names to differentiate between the variables easier:

rating.x variable contains values from the reviews dataset. We would like to rename it to userRating
rating.y variable contains values from the items dataset. We would like to rename it to averageRating
title.x variable contains values from the reviews dataset. We would like to rename it to reviewTitle
title.y variable contains values from the items dataset. We would like to rename it to description

We used the names() function to verify that the renaming operation was successful.

joined <- joined %>% rename(
  userRating = rating.x,
  averageRating = rating.y,
  reviewTitle = title.x,
  description = title.y
)

names(joined)

##  [1] "asin"          "name"          "userRating"    "date"         
##  [5] "verified"      "reviewTitle"   "body"          "helpfulVotes" 
##  [9] "brand"         "description"   "url"           "image"        
## [13] "averageRating" "reviewUrl"     "totalReviews"  "prices"

We checked the data structure using the str() function.

str(joined, vec.len = 1)

## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 82815 obs. of  16 variables:
##  $ asin         : chr  "B0000SX2UC" ...
##  $ name         : chr  "Janet" ...
##  $ userRating   : num  3 1 ...
##  $ date         : chr  "October 11, 2005" ...
##  $ verified     : logi  FALSE ...
##  $ reviewTitle  : chr  "Def not best, but not worst" ...
##  $ body         : chr  "I had the Samsung A600 for awhile which is absolute doo doo. You can read my review on it and detect my rage at"| __truncated__ ...
##  $ helpfulVotes : num  1 17 ...
##  $ brand        : chr  "Nokia" ...
##  $ description  : chr  "Dual-Band / Tri-Mode Sprint PCS Phone w/ Voice Activated Dialing & Bright White Backlit Screen" ...
##  $ url          : chr  "https://www.amazon.com/Dual-Band-Tri-Mode-Activated-Dialing-Backlit/dp/B0000SX2UC" ...
##  $ image        : chr  "https://m.media-amazon.com/images/I/2143EBQ210L._AC_UY218_SEARCH213888_FMwebp_QL75_.jpg" ...
##  $ averageRating: num  3 3 ...
##  $ reviewUrl    : chr  "https://www.amazon.com/product-reviews/B0000SX2UC" ...
##  $ totalReviews : num  14 14 ...
##  $ prices       : chr  NA ...

There are several type conversions that need to be done:

userRating is of type double and contains integer values from 1 to 5. One option would be to convert this variable to a factor variable. However, we have decided to convert it to integer to easily perform statistical operations.
date variable is of type character. We would like to transform it to a date type.
helpfulVotes is of type double and we would like to transform it to an integer as with userRating.
brand is of type character, but we would like to convert it to a factor. We have \(10\) different values, and the variable is unordered. We add labels to the levels, which is the first two letters of the name of the brand.
totalReviews is of type double. We would like it to be integer.
prices is of type character but we would like it to be numeric. Some of the prices in prices have more than one value, so we check the maximum amount of dollar signs to figure out how many prices are given for each value. The result is two. So we create two columns from prices, one called price1 and one called price2, and if there exist a second price in prices it is now stored in price2. Next we need to transform it to numeric, and to do so, we need to remove the dollar signs. This is done by using gsub().

#find the maximum number of prices in each observation
max(str_count(joined$prices, "[$]"), na.rm = T)

## [1] 2

#Separates the prices columns in two
joined <- joined %>% separate(prices, into = c("price1", "price2"), sep=",[$]")

#Transform the variables so they have the right format
joined <- joined %>% transform(
  userRating = as.integer(userRating),
  date = as.Date(date,format="%B %d, %Y"),
  helpfulVotes = as.integer(helpfulVotes),
  brand = factor(brand, 
                 levels=c("Nokia", "Motorola", "Sony", "Samsung","HUAWEI", 
                          "Apple", "OnePlus", "Google", "ASUS", "Xiaomi"),
                 labels=c("no", "mo", "so", "sa", "hu", "ap", "on", "go", "as", "xi")),
  totalReviews = as.integer(totalReviews),
  price1 = as.numeric(gsub('(,|[$])', '', price1)),
  price2 = as.numeric(gsub('(,|[$])', '', price2))
  )

dim(joined)

## [1] 82815    17

head(joined)

Tidy & Manipulate Data I

The three rules from (Hadley Wickham and Grolemund (2016)) which make a tidy data set are:

Each variable must have its own column.
Each observation must have its own row.
Each value must have its own cell.

In the previous section, we have separated the prices variable into two variables. We had to do it in order to apply proper data type conversion. If we didn’t do it, the third rule - Each value must have its own cell - would be violated.

Hence, our data set is in a tidy format based on the three criteria above.

Tidy & Manipulate Data II

We created three new variables by using the mutate() function from dyplyr package. We created a variable reviewChars which contains the number of characters in the review body. Next, we created reviewWords variable with the number of words in the review body. Finally, we created a varRating variable that contains the variance of the userRating column grouped by the unique item code (asin column). All of these variables are numeric.

joined <-
  joined %>% group_by(asin) %>% mutate(
    reviewChars = str_length(body),
    reviewWords = sapply(str_split(body, " "), length),
    varRating = var(userRating, na.rm = T)
  )

is.numeric(joined$reviewChars)

## [1] TRUE

is.numeric(joined$reviewWords)

## [1] TRUE

is.numeric(joined$varRating)

## [1] TRUE

dim(joined)

## [1] 82815    20

head(joined)

Scan I

We have scanned the columns for NA values. We have found several columns with missing values:

joined %>% is.na() %>% colSums()

##          asin          name    userRating          date      verified 
##             0             2             0             0             0 
##   reviewTitle          body  helpfulVotes         brand   description 
##             3            16         49681             0             0 
##           url         image averageRating     reviewUrl  totalReviews 
##             0             0             0             0             0 
##        price1        price2   reviewChars   reviewWords     varRating 
##         23670         71390            16             0            44

In the name column, there are two missing values. This variable denotes the name of the reviewer. We have decided to set those values to “unknown”.

There are three missing values in the reviewTitle column. We have decided to use the first two words from the body column as the value for the missing titles. We could do it because the intersection between the missing values of reviewTitle column and body column is empty.

There are \(16\) missing values in the body column. We have decided to use use the review title (in the reviewTitle column) as the body where the body is empty.

In case of the helpfulVotes column, there are \(49,681\) missing values. These values mean that no one voted for the review. We can safely replace the missing values with \(0\).

There are \(23,670\) missing values in the price1 column. We have decided to replace the missing values with the mean of the prices by brands. We have created a mean_prices dataframe with the mean prices per brand. We have done this because in the joined dataframe, the prices are repeated for each review.

There are \(71,344\) empty values in the price2 column. price2 column is empty when price1 column is empty and there is only one price for the product in the data set. We have decided to use the value from price1 column for each missing value in price2 column.

Because we have updated the values of the body column, we need to recompute the values of reviewChars and reviewWords columns.

Lastly, there are \(44\) missing values for our created variable varRating. We have found out that these values are missing when there is only one review per product, because the var() function returns NA when there is only one observation. We can set those values to \(0\) denoting the variance of the ratings is zero.

# name
joined$name[is.na(joined$name)] <- "unknown"

# reviewTitle
intersect(which(is.na(joined$reviewTitle)), which(is.na(joined$body)))

## integer(0)

joined$reviewTitle[is.na(joined$reviewTitle)] <- word(string = joined$body[is.na(joined$reviewTitle)], start = 1, end = 2)

# body
joined$body[is.na(joined$body)] <- joined$reviewTitle[is.na(joined$body)]

# helpfulVotes
joined$helpfulVotes[is.na(joined$helpfulVotes)] <- 0

# price1
mean_prices <-
  joined %>% 
  distinct(asin, .keep_all = T) %>% 
  group_by(brand) %>% 
  mutate(mean_price = mean(price1, na.rm = T)) %>% 
  select(brand, mean_price) %>% 
  distinct()

joined <- joined %>% left_join(mean_prices, by = "brand")
joined$price1[is.na(joined$price1)] <- joined$mean_price[is.na(joined$price1)]
joined <- subset(joined, select=-mean_price)

# price2
joined$price2[is.na(joined$price2)] <- joined$price1[is.na(joined$price2)]

# reviewChars, reviewWords
joined <-
  joined %>% mutate(
    reviewChars = str_length(body),
    reviewWords = sapply(str_split(body, " "), length)
  )

# varRating
joined$varRating[is.na(joined$varRating)] <- 0


joined %>% is.na() %>% colSums()

##          asin          name    userRating          date      verified 
##             0             0             0             0             0 
##   reviewTitle          body  helpfulVotes         brand   description 
##             0             0             0             0             0 
##           url         image averageRating     reviewUrl  totalReviews 
##             0             0             0             0             0 
##        price1        price2   reviewChars   reviewWords     varRating 
##             0             0             0             0             0

We have checked for special values, such as Infinite and NaN values for all numeric columns in our data set. There are no special values for any numeric columns.

is.special <- function(x) {
  if (is.numeric(x)) is.infinite(x) | is.nan(x)
}

sum(is.special(joined$userRating))

## [1] 0

sum(is.special(joined$helpfulVotes))

## [1] 0

sum(is.special(joined$averageRating))

## [1] 0

sum(is.special(joined$totalReviews))

## [1] 0

sum(is.special(joined$price1))

## [1] 0

sum(is.special(joined$price2))

## [1] 0

sum(is.special(joined$reviewChars))

## [1] 0

sum(is.special(joined$reviewWords))

## [1] 0

sum(is.special(joined$varRating))

## [1] 0

Scan II

The userRating column should only contain numbers between \(1\) and \(5\). We can check manually if it is true.

sum(joined$userRating < 1 | joined$userRating > 5)

## [1] 0

We have checked whether the helpfulVotes column is normal distributed by using a histogram. It turns out not to be, so we cannot use the z-scores method to find outliers. We have created a boxplot to visualize the outliers. Because we do not have any missing values, boxplot is safe to use. The boxplot shows a considerable amount of outliers. However, these outliers are not a result of data entry or processing error, so we have decided not to impute nor remove them. We will transform the values in the next section to mitigate the effect of those outliers.

joined$helpfulVotes %>% hist(xlab="Helpful Votes", density=20, xlim = c(0, 800), main="Histogram of helpful votes")

joined$helpfulVotes %>% boxplot(main = "Box Plot of Helpful Votes column", ylab = "Votes", col = "grey")

The averageRating column should contain the values between \(1\) and \(5\), because it is based on the userRating column. We have doublechecked this manually.

sum(joined$averageRating < 1 | joined$averageRating > 5)

## [1] 0

The totalReviews column contains the number of reviews per product. This variable is unlikely to contain outliers as it is derived from the actual number of reviews from the Amazon website. Still, we have created a boxplot to make sure, there are no outliers.

joined$totalReviews %>% boxplot(main = "Box Plot of Total Reviews column", ylab = "Reviews", col = "grey")

The price1 column is not normally distributed, so we have shown the boxplot to detect outliers. The boxplot also shows a considerable amount of outliers. We looked at the data to see which products have the highest price detected as an outlier, so see if the outliers are result of data entry errors, and it turned out to be Iphones, so the prices are assumed to be correct. So, we do not replace the values, but we will apply some transformation to mitigate the effect of outliers.

The same applies to the price2 column.

joined$price1 %>% hist(xlab="Prices", density=20, main="Histogram of prices 1")

joined$price1 %>% boxplot(main = "Box Plot of Prices 1", ylab = "Price", col = "grey")

joined$price2 %>% hist(xlab="Prices", density=20, main="Histogram of prices 2")

joined$price2 %>% boxplot(main = "Box Plot of Prices 2", ylab = "Price", col = "grey")

We have used a boxplot to visualize the outliers in the reviewChars and reviewWords columns. We have found some values that seem to be outliers, but after manually checking of those values, we have found that the reviews are actually that long and are valid. Apparently, Amazon does not have a limit for the number of characters or words given in a review.

We will not do anything with those values as it would require us to update the body column.

joined$reviewChars %>% boxplot(main = "Box Plot of Review Characters", ylab = "Characters", col = "grey")

joined$reviewWords %>% boxplot(main = "Box Plot of Review Words", ylab = "Words", col = "grey")

There is no point of checking the outlier in the varRating column as this column contains the variance of the user ratings for each product.

Transform

As mentioned above we have transformed three variables to mitigate the effect of outliers. For the helpfulVotes column, we have used the min-max normalization. We cannot use the log transformation in this case as we have values of zero in this column.

For the price1 and price2 columns we have used the log transformation. The histograms show that the data are closer to be symmetric after the transformation.

minmaxnormalise <- function(x){(x- min(x)) / (max(x)-min(x))}

joined %>% transform(helpfulVotes = minmaxnormalise(helpfulVotes))

joined <- joined %>% transform(price1 = log10(price1))
joined <- joined %>% transform(price2 = log10(price2))

joined$price1 %>% hist(xlab="Log Prices", density=20, main="Histogram of logarithm of prices 1")

joined$price2 %>% hist(xlab="Log Prices", density=20, main="Histogram of logarithm of prices 2")