Required packages

library(readr)
library(dplyr)
library(tidyr)
library(stringr)
library(Hmisc)
library(outliers)

Executive Summary

First of all, we have found and loaded two data files containing information about phones available via Amazon e-shop and their customer reviews. We have joined the two data files into one data set containing all the information we need.

Next, we looked at the data types and data structure and applied appropriate type conversions and transformations. We created an unordered factor variable, a proper date format, made sure that the numeric columns were integer where needed and fixed the prices.

Next, we created three variables based on the existing data. We scanned the data for missing values and replaced all of them with suitable values.

Finally, we checked fthe data for outliers and applied suitable transformations to mitigate their effect where needed.

Data

The two data sets used in this assignment comes from Kaggle. The data set contains information about Amazon cell phones reviews.

There are two files - items and reviews. The items data file contains information about the phones. It has \(9\) variables and \(792\) observations. Each observation contains:

The reviews data file contains information about the reviews. It has \(8\) variables and \(82,815\) observations. Each observation contains:

We used readr library to read the two data files. We used head() and dim() functions to check the data and their dimensions.

Finally we joined the data sets using the left_join() function from the dyplyr package. The data were joined by the asin column. We used left join to extend the reviews data set with the information of the reviewed items. The resulting data set consists of \(16\) variables (\(8\) variables from reviews data set, \(9\) from the items data set, however asin only appears once) and \(82,815\) observations. The number of observations in the resulting data set shows us that there is no review for non-existing products.

items <- read_csv("20190928-items.csv")
## Parsed with column specification:
## cols(
##   asin = col_character(),
##   brand = col_character(),
##   title = col_character(),
##   url = col_character(),
##   image = col_character(),
##   rating = col_double(),
##   reviewUrl = col_character(),
##   totalReviews = col_double(),
##   prices = col_character()
## )
head(items)
dim(items)
## [1] 792   9
reviews <- read_csv("20190928-reviews.csv")
## Parsed with column specification:
## cols(
##   asin = col_character(),
##   name = col_character(),
##   rating = col_double(),
##   date = col_character(),
##   verified = col_logical(),
##   title = col_character(),
##   body = col_character(),
##   helpfulVotes = col_double()
## )
head(reviews)
dim(reviews)
## [1] 82815     8
joined <- reviews %>% left_join(items, by = "asin")
head(joined)
dim(joined)
## [1] 82815    16

Understand

From the head() function used in the previous section, we noticed that some columns have ambiguous names. Because there were some columns with the same names in both data sets, the join operation renamed them. We would like to use different names to differentiate between the variables easier:

We used the names() function to verify that the renaming operation was successful.

joined <- joined %>% rename(
  userRating = rating.x,
  averageRating = rating.y,
  reviewTitle = title.x,
  description = title.y
)

names(joined)
##  [1] "asin"          "name"          "userRating"    "date"         
##  [5] "verified"      "reviewTitle"   "body"          "helpfulVotes" 
##  [9] "brand"         "description"   "url"           "image"        
## [13] "averageRating" "reviewUrl"     "totalReviews"  "prices"

We checked the data structure using the str() function.

str(joined, vec.len = 1)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 82815 obs. of  16 variables:
##  $ asin         : chr  "B0000SX2UC" ...
##  $ name         : chr  "Janet" ...
##  $ userRating   : num  3 1 ...
##  $ date         : chr  "October 11, 2005" ...
##  $ verified     : logi  FALSE ...
##  $ reviewTitle  : chr  "Def not best, but not worst" ...
##  $ body         : chr  "I had the Samsung A600 for awhile which is absolute doo doo. You can read my review on it and detect my rage at"| __truncated__ ...
##  $ helpfulVotes : num  1 17 ...
##  $ brand        : chr  "Nokia" ...
##  $ description  : chr  "Dual-Band / Tri-Mode Sprint PCS Phone w/ Voice Activated Dialing & Bright White Backlit Screen" ...
##  $ url          : chr  "https://www.amazon.com/Dual-Band-Tri-Mode-Activated-Dialing-Backlit/dp/B0000SX2UC" ...
##  $ image        : chr  "https://m.media-amazon.com/images/I/2143EBQ210L._AC_UY218_SEARCH213888_FMwebp_QL75_.jpg" ...
##  $ averageRating: num  3 3 ...
##  $ reviewUrl    : chr  "https://www.amazon.com/product-reviews/B0000SX2UC" ...
##  $ totalReviews : num  14 14 ...
##  $ prices       : chr  NA ...

There are several type conversions that need to be done:

#find the maximum number of prices in each observation
max(str_count(joined$prices, "[$]"), na.rm = T) 
## [1] 2
#Separates the prices columns in two
joined <- joined %>% separate(prices, into = c("price1", "price2"), sep=",[$]")

#Transform the variables so they have the right format
joined <- joined %>% transform(
  userRating = as.integer(userRating),
  date = as.Date(date,format="%B %d, %Y"),
  helpfulVotes = as.integer(helpfulVotes),
  brand = factor(brand, 
                 levels=c("Nokia", "Motorola", "Sony", "Samsung","HUAWEI", 
                          "Apple", "OnePlus", "Google", "ASUS", "Xiaomi"),
                 labels=c("no", "mo", "so", "sa", "hu", "ap", "on", "go", "as", "xi")),
  totalReviews = as.integer(totalReviews),
  price1 = as.numeric(gsub('(,|[$])', '', price1)),
  price2 = as.numeric(gsub('(,|[$])', '', price2))
  )

dim(joined)
## [1] 82815    17
head(joined)

Tidy & Manipulate Data I

The three rules from (Hadley Wickham and Grolemund (2016)) which make a tidy data set are:

In the previous section, we have separated the prices variable into two variables. We had to do it in order to apply proper data type conversion. If we didn’t do it, the third rule - Each value must have its own cell - would be violated.

Hence, our data set is in a tidy format based on the three criteria above.

Tidy & Manipulate Data II

We created three new variables by using the mutate() function from dyplyr package. We created a variable reviewChars which contains the number of characters in the review body. Next, we created reviewWords variable with the number of words in the review body. Finally, we created a varRating variable that contains the variance of the userRating column grouped by the unique item code (asin column). All of these variables are numeric.

joined <-
  joined %>% group_by(asin) %>% mutate(
    reviewChars = str_length(body),
    reviewWords = sapply(str_split(body, " "), length),
    varRating = var(userRating, na.rm = T)
  )

is.numeric(joined$reviewChars)
## [1] TRUE
is.numeric(joined$reviewWords)
## [1] TRUE
is.numeric(joined$varRating)
## [1] TRUE
dim(joined)
## [1] 82815    20
head(joined)

Scan I

We have scanned the columns for NA values. We have found several columns with missing values:

joined %>% is.na() %>% colSums()
##          asin          name    userRating          date      verified 
##             0             2             0             0             0 
##   reviewTitle          body  helpfulVotes         brand   description 
##             3            16         49681             0             0 
##           url         image averageRating     reviewUrl  totalReviews 
##             0             0             0             0             0 
##        price1        price2   reviewChars   reviewWords     varRating 
##         23670         71390            16             0            44

In the name column, there are two missing values. This variable denotes the name of the reviewer. We have decided to set those values to “unknown”.

There are three missing values in the reviewTitle column. We have decided to use the first two words from the body column as the value for the missing titles. We could do it because the intersection between the missing values of reviewTitle column and body column is empty.

There are \(16\) missing values in the body column. We have decided to use use the review title (in the reviewTitle column) as the body where the body is empty.

In case of the helpfulVotes column, there are \(49,681\) missing values. These values mean that no one voted for the review. We can safely replace the missing values with \(0\).

There are \(23,670\) missing values in the price1 column. We have decided to replace the missing values with the mean of the prices by brands. We have created a mean_prices dataframe with the mean prices per brand. We have done this because in the joined dataframe, the prices are repeated for each review.

There are \(71,344\) empty values in the price2 column. price2 column is empty when price1 column is empty and there is only one price for the product in the data set. We have decided to use the value from price1 column for each missing value in price2 column.

Because we have updated the values of the body column, we need to recompute the values of reviewChars and reviewWords columns.

Lastly, there are \(44\) missing values for our created variable varRating. We have found out that these values are missing when there is only one review per product, because the var() function returns NA when there is only one observation. We can set those values to \(0\) denoting the variance of the ratings is zero.

# name
joined$name[is.na(joined$name)] <- "unknown"

# reviewTitle
intersect(which(is.na(joined$reviewTitle)), which(is.na(joined$body)))
## integer(0)
joined$reviewTitle[is.na(joined$reviewTitle)] <- word(string = joined$body[is.na(joined$reviewTitle)], start = 1, end = 2)

# body
joined$body[is.na(joined$body)] <- joined$reviewTitle[is.na(joined$body)]

# helpfulVotes
joined$helpfulVotes[is.na(joined$helpfulVotes)] <- 0

# price1
mean_prices <-
  joined %>% 
  distinct(asin, .keep_all = T) %>% 
  group_by(brand) %>% 
  mutate(mean_price = mean(price1, na.rm = T)) %>% 
  select(brand, mean_price) %>% 
  distinct()

joined <- joined %>% left_join(mean_prices, by = "brand")
joined$price1[is.na(joined$price1)] <- joined$mean_price[is.na(joined$price1)]
joined <- subset(joined, select=-mean_price)

# price2
joined$price2[is.na(joined$price2)] <- joined$price1[is.na(joined$price2)]

# reviewChars, reviewWords
joined <-
  joined %>% mutate(
    reviewChars = str_length(body),
    reviewWords = sapply(str_split(body, " "), length)
  )

# varRating
joined$varRating[is.na(joined$varRating)] <- 0


joined %>% is.na() %>% colSums()
##          asin          name    userRating          date      verified 
##             0             0             0             0             0 
##   reviewTitle          body  helpfulVotes         brand   description 
##             0             0             0             0             0 
##           url         image averageRating     reviewUrl  totalReviews 
##             0             0             0             0             0 
##        price1        price2   reviewChars   reviewWords     varRating 
##             0             0             0             0             0

We have checked for special values, such as Infinite and NaN values for all numeric columns in our data set. There are no special values for any numeric columns.

is.special <- function(x) {
  if (is.numeric(x)) is.infinite(x) | is.nan(x)
}

sum(is.special(joined$userRating))
## [1] 0
sum(is.special(joined$helpfulVotes))
## [1] 0
sum(is.special(joined$averageRating))
## [1] 0
sum(is.special(joined$totalReviews))
## [1] 0
sum(is.special(joined$price1))
## [1] 0
sum(is.special(joined$price2))
## [1] 0
sum(is.special(joined$reviewChars))
## [1] 0
sum(is.special(joined$reviewWords))
## [1] 0
sum(is.special(joined$varRating))
## [1] 0

Scan II

The userRating column should only contain numbers between \(1\) and \(5\). We can check manually if it is true.

sum(joined$userRating < 1 | joined$userRating > 5)
## [1] 0

We have checked whether the helpfulVotes column is normal distributed by using a histogram. It turns out not to be, so we cannot use the z-scores method to find outliers. We have created a boxplot to visualize the outliers. Because we do not have any missing values, boxplot is safe to use. The boxplot shows a considerable amount of outliers. However, these outliers are not a result of data entry or processing error, so we have decided not to impute nor remove them. We will transform the values in the next section to mitigate the effect of those outliers.

joined$helpfulVotes %>% hist(xlab="Helpful Votes", density=20, xlim = c(0, 800), main="Histogram of helpful votes")

joined$helpfulVotes %>% boxplot(main = "Box Plot of Helpful Votes column", ylab = "Votes", col = "grey")

The averageRating column should contain the values between \(1\) and \(5\), because it is based on the userRating column. We have doublechecked this manually.

sum(joined$averageRating < 1 | joined$averageRating > 5)
## [1] 0

The totalReviews column contains the number of reviews per product. This variable is unlikely to contain outliers as it is derived from the actual number of reviews from the Amazon website. Still, we have created a boxplot to make sure, there are no outliers.

joined$totalReviews %>% boxplot(main = "Box Plot of Total Reviews column", ylab = "Reviews", col = "grey")

The price1 column is not normally distributed, so we have shown the boxplot to detect outliers. The boxplot also shows a considerable amount of outliers. We looked at the data to see which products have the highest price detected as an outlier, so see if the outliers are result of data entry errors, and it turned out to be Iphones, so the prices are assumed to be correct. So, we do not replace the values, but we will apply some transformation to mitigate the effect of outliers.

The same applies to the price2 column.

joined$price1 %>% hist(xlab="Prices", density=20, main="Histogram of prices 1")

joined$price1 %>% boxplot(main = "Box Plot of Prices 1", ylab = "Price", col = "grey")

joined$price2 %>% hist(xlab="Prices", density=20, main="Histogram of prices 2")

joined$price2 %>% boxplot(main = "Box Plot of Prices 2", ylab = "Price", col = "grey")

We have used a boxplot to visualize the outliers in the reviewChars and reviewWords columns. We have found some values that seem to be outliers, but after manually checking of those values, we have found that the reviews are actually that long and are valid. Apparently, Amazon does not have a limit for the number of characters or words given in a review.

We will not do anything with those values as it would require us to update the body column.

joined$reviewChars %>% boxplot(main = "Box Plot of Review Characters", ylab = "Characters", col = "grey")

joined$reviewWords %>% boxplot(main = "Box Plot of Review Words", ylab = "Words", col = "grey")

There is no point of checking the outlier in the varRating column as this column contains the variance of the user ratings for each product.

Transform

As mentioned above we have transformed three variables to mitigate the effect of outliers. For the helpfulVotes column, we have used the min-max normalization. We cannot use the log transformation in this case as we have values of zero in this column.

For the price1 and price2 columns we have used the log transformation. The histograms show that the data are closer to be symmetric after the transformation.

minmaxnormalise <- function(x){(x- min(x)) / (max(x)-min(x))}

joined %>% transform(helpfulVotes = minmaxnormalise(helpfulVotes))
joined <- joined %>% transform(price1 = log10(price1))
joined <- joined %>% transform(price2 = log10(price2))

joined$price1 %>% hist(xlab="Log Prices", density=20, main="Histogram of logarithm of prices 1")

joined$price2 %>% hist(xlab="Log Prices", density=20, main="Histogram of logarithm of prices 2")