Packages required to reproduce the report.
library(readr)
library(tidyr)
library(dplyr)
library(lubridate)
library(outliers)
library(knitr)
library(magrittr)
library(Hmisc)
library(editrules)
library(forecast)
library(infotheo)
library(mlr)
This report outlines the steps carried out to preprocess the Amazon Cell Phones Reviews dataset. The dataset is collected from www.kaggle.com. The dataset is inspected for variables, their types, meaning and the number of observations. The dataset came in two different csv files and was merged to form a single dataset on which further preprocessing was done. On examining the data types, there were some misinterpretations. Most of the data types were characters including the date. Data types were converted into meaningful types (factors, characters, numerics, date, integers) instead of only characters and numerics. The dataset was already in tidy format so no action was taken. An extra column/attribute was added to show the proportion of “helpful review” among the total number of reviews after the data set was scanned for missing values that were handled by imputation with the mean prices and mean helpfulVotes (the columns with missing values) of each cell phone brand respectively. Further, the dataset was scanned for outliers using two types of univariate outlier detection methods (box plot and z score). According to the summary statistics of each variable, it was inferred that the observed possible outliers may not exactly be outliers but can be extreme values that need not be handled. The outliers for some variables (<5%) were handled by excluding them from the datsset. The dataset’s numeric attributes were then transformed by Boxcox, square root and logarithmic transformations to obtain normal distribution and was also normalised for ease of further statistical analysis. In the final dataset, attributes/features such as url, name, image were filtered out since they were incomprehensible and counterproductive.
The data set is called Amazon Cell Phones Reviews and provides informationon on cell phone ratings and reviews on Amazon for the 10 brands: : ASUS, Apple, Google, HUAWEI, Motorola, Nokia, OnePlus, Samsung, Sony, and Xiaomi. The dataset is obtained from www.kaggle.com where the data was scraped from Amazon.com. This dataset has data retreived on 28.09.2019 and is in two csv files: items.csv and reviews.csv.
There are 792 observations and 9 attributes in items.csv with item details and ratings.
and there are 82815 observations and 8 attributes in reviews.csv has user details and their ratings and reviews.
The two datasets can be joined using inner join so that the rows common in both the datasets will be retained. Now, the merged dataset has 82815 observations and 16 variables in total and will be used for further preprocessing.
# Reading dataset 1
items <- read_csv("20190928-items.csv")
Parsed with column specification:
cols(
asin = col_character(),
brand = col_character(),
title = col_character(),
url = col_character(),
image = col_character(),
rating = col_double(),
reviewUrl = col_character(),
totalReviews = col_double(),
prices = col_character()
)
str(items)
Classes âspec_tbl_dfâ, âtbl_dfâ, âtblâ and 'data.frame': 792 obs. of 9 variables:
$ asin : chr "B0000SX2UC" "B0009N5L7K" "B000SKTZ0S" "B00198M12M" ...
$ brand : chr "Nokia" "Motorola" "Motorola" "Nokia" ...
$ title : chr "Dual-Band / Tri-Mode Sprint PCS Phone w/ Voice Activated Dialing & Bright White Backlit Screen" "Motorola I265 phone" "MOTOROLA C168i AT&T CINGULAR PREPAID GOPHONE CELL PHONE" "Nokia 6500 Slide Black/silver Unlocked Cell Phone" ...
$ url : chr "https://www.amazon.com/Dual-Band-Tri-Mode-Activated-Dialing-Backlit/dp/B0000SX2UC" "https://www.amazon.com/Motorola-i265-I265-phone/dp/B0009N5L7K" "https://www.amazon.com/MOTOROLA-C168i-CINGULAR-PREPAID-GOPHONE/dp/B000SKTZ0S" "https://www.amazon.com/Nokia-6500-Slide-silver-Unlocked/dp/B00198M12M" ...
$ image : chr "https://m.media-amazon.com/images/I/2143EBQ210L._AC_UY218_SEARCH213888_FMwebp_QL75_.jpg" "https://m.media-amazon.com/images/I/419WBAVDARL._AC_UY218_SEARCH213888_FMwebp_QL75_.jpg" "https://m.media-amazon.com/images/I/71b+q3ydkIS._AC_UY218_SEARCH213888_FMwebp_QL75_.jpg" "https://m.media-amazon.com/images/I/41ss4HpLkLL._AC_UY218_SEARCH213888_FMwebp_QL75_.jpg" ...
$ rating : num 3 2.9 2.6 2.4 3.3 3.2 2 2.7 3.2 3.5 ...
$ reviewUrl : chr "https://www.amazon.com/product-reviews/B0000SX2UC" "https://www.amazon.com/product-reviews/B0009N5L7K" "https://www.amazon.com/product-reviews/B000SKTZ0S" "https://www.amazon.com/product-reviews/B00198M12M" ...
$ totalReviews: num 14 7 22 5 21 12 1 3 8 133 ...
$ prices : chr NA "$49.95" NA NA ...
- attr(*, "spec")=
.. cols(
.. asin = col_character(),
.. brand = col_character(),
.. title = col_character(),
.. url = col_character(),
.. image = col_character(),
.. rating = col_double(),
.. reviewUrl = col_character(),
.. totalReviews = col_double(),
.. prices = col_character()
.. )
head(items)
# Reading dataset 2
reviews <- read_csv("20190928-reviews.csv")
Parsed with column specification:
cols(
asin = col_character(),
name = col_character(),
rating = col_double(),
date = col_character(),
verified = col_logical(),
title = col_character(),
body = col_character(),
helpfulVotes = col_double()
)
str(reviews)
Classes âspec_tbl_dfâ, âtbl_dfâ, âtblâ and 'data.frame': 82815 obs. of 8 variables:
$ asin : chr "B0000SX2UC" "B0000SX2UC" "B0000SX2UC" "B0000SX2UC" ...
$ name : chr "Janet" "Luke Wyatt" "Brooke" "amy m. teague" ...
$ rating : num 3 1 5 3 4 4 5 4 5 3 ...
$ date : chr "October 11, 2005" "January 7, 2004" "December 30, 2003" "March 18, 2004" ...
$ verified : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ title : chr "Def not best, but not worst" "Text Messaging Doesn't Work" "Love This Phone" "Love the Phone, BUT...!" ...
$ body : chr "I had the Samsung A600 for awhile which is absolute doo doo. You can read my review on it and detect my rage at"| __truncated__ "Due to a software issue between Nokia and Sprint this phone's text messaging capabilities don't work with Sprin"| __truncated__ "This is a great, reliable phone. I also purchased this phone after my samsung A460 died. The menu is easily com"| __truncated__ "I love the phone and all, because I really did need one, but I didn't expect the price of the bill when I recei"| __truncated__ ...
$ helpfulVotes: num 1 17 5 1 1 NA 2 2 7 3 ...
- attr(*, "spec")=
.. cols(
.. asin = col_character(),
.. name = col_character(),
.. rating = col_double(),
.. date = col_character(),
.. verified = col_logical(),
.. title = col_character(),
.. body = col_character(),
.. helpfulVotes = col_double()
.. )
head(reviews)
# Merging the datasets using inner join
amazon_cell <- items %>% inner_join(reviews, by = "asin")
dim(amazon_cell)
[1] 82815 16
head(amazon_cell)
There are 11 variables with character data types, 4 with numeric data types and 1 with logical data type.
The dataset has a date attribute and a brand attribute which are in character format. They can be converted into their respective Date and factor types.
The date is in Month, Day, Year format and hence mdy() is used to convert the type. The brand has been converted into a factor which is already neatly labelled.
Verified users and unverified users can be categorised and labelled for better understanding.
totalReviews, reviewerRating and helpfulVotes can be converted to integers. avgRating can be left as numberic as it has decimal values.
The rating and title attribute names can be renamed to uniquely identify product’s and reviewer’s columns.
The prices are read as characters because of the ‘$’ sybmol. It can be removed and the type can be converted to numeric for ease of any analysis with the product prices. parse_number from the readr package was used to do this step.
# Renaming attributes
names(amazon_cell)[3] <- "productTitle"
names(amazon_cell)[6] <- "avgRating"
names(amazon_cell)[11] <- "reviewerRating"
names(amazon_cell)[14] <- "reviewTitle"
# Converting data types
amazon_cell$prices <- parse_number(amazon_cell$prices)
amazon_cell$date <- mdy(amazon_cell$date)
amazon_cell$brand <- as.factor(amazon_cell$brand)
amazon_cell$verified <- factor(amazon_cell$verified, levels = c("FALSE", "TRUE"), labels = c("Unverified", "Verified"))
amazon_cell$totalReviews <- as.integer(amazon_cell$totalReviews)
amazon_cell$reviewerRating <- as.integer(amazon_cell$reviewerRating)
amazon_cell$helpfulVotes <- as.integer(amazon_cell$helpfulVotes)
# Check for corrected data types
str(amazon_cell)
Classes âspec_tbl_dfâ, âtbl_dfâ, âtblâ and 'data.frame': 82815 obs. of 16 variables:
$ asin : chr "B0000SX2UC" "B0000SX2UC" "B0000SX2UC" "B0000SX2UC" ...
$ brand : Factor w/ 10 levels "Apple","ASUS",..: 6 6 6 6 6 6 6 6 6 6 ...
$ productTitle : chr "Dual-Band / Tri-Mode Sprint PCS Phone w/ Voice Activated Dialing & Bright White Backlit Screen" "Dual-Band / Tri-Mode Sprint PCS Phone w/ Voice Activated Dialing & Bright White Backlit Screen" "Dual-Band / Tri-Mode Sprint PCS Phone w/ Voice Activated Dialing & Bright White Backlit Screen" "Dual-Band / Tri-Mode Sprint PCS Phone w/ Voice Activated Dialing & Bright White Backlit Screen" ...
$ url : chr "https://www.amazon.com/Dual-Band-Tri-Mode-Activated-Dialing-Backlit/dp/B0000SX2UC" "https://www.amazon.com/Dual-Band-Tri-Mode-Activated-Dialing-Backlit/dp/B0000SX2UC" "https://www.amazon.com/Dual-Band-Tri-Mode-Activated-Dialing-Backlit/dp/B0000SX2UC" "https://www.amazon.com/Dual-Band-Tri-Mode-Activated-Dialing-Backlit/dp/B0000SX2UC" ...
$ image : chr "https://m.media-amazon.com/images/I/2143EBQ210L._AC_UY218_SEARCH213888_FMwebp_QL75_.jpg" "https://m.media-amazon.com/images/I/2143EBQ210L._AC_UY218_SEARCH213888_FMwebp_QL75_.jpg" "https://m.media-amazon.com/images/I/2143EBQ210L._AC_UY218_SEARCH213888_FMwebp_QL75_.jpg" "https://m.media-amazon.com/images/I/2143EBQ210L._AC_UY218_SEARCH213888_FMwebp_QL75_.jpg" ...
$ avgRating : num 3 3 3 3 3 3 3 3 3 3 ...
$ reviewUrl : chr "https://www.amazon.com/product-reviews/B0000SX2UC" "https://www.amazon.com/product-reviews/B0000SX2UC" "https://www.amazon.com/product-reviews/B0000SX2UC" "https://www.amazon.com/product-reviews/B0000SX2UC" ...
$ totalReviews : int 14 14 14 14 14 14 14 14 14 14 ...
$ prices : num NA NA NA NA NA NA NA NA NA NA ...
$ name : chr "Janet" "Luke Wyatt" "Brooke" "amy m. teague" ...
$ reviewerRating: int 3 1 5 3 4 4 5 4 5 3 ...
$ date : Date, format: "2005-10-11" "2004-01-07" ...
$ verified : Factor w/ 2 levels "Unverified","Verified": 1 1 1 1 1 1 1 1 1 1 ...
$ reviewTitle : chr "Def not best, but not worst" "Text Messaging Doesn't Work" "Love This Phone" "Love the Phone, BUT...!" ...
$ body : chr "I had the Samsung A600 for awhile which is absolute doo doo. You can read my review on it and detect my rage at"| __truncated__ "Due to a software issue between Nokia and Sprint this phone's text messaging capabilities don't work with Sprin"| __truncated__ "This is a great, reliable phone. I also purchased this phone after my samsung A460 died. The menu is easily com"| __truncated__ "I love the phone and all, because I really did need one, but I didn't expect the price of the bill when I recei"| __truncated__ ...
$ helpfulVotes : int 1 17 5 1 1 NA 2 2 7 3 ...
head(amazon_cell)
The data is in tidy format because of the following reasons:
# Total missing values (NAs)
sum(is.na(amazon_cell))
[1] 73372
# Number of missing values in each column
colSums(is.na(amazon_cell))
asin brand productTitle url image
0 0 0 0 0
avgRating reviewUrl totalReviews prices name
0 0 0 23670 2
reviewerRating date verified reviewTitle body
0 0 0 3 16
helpfulVotes
49681
# Handling missing values
amazon_cell_imputed <- amazon_cell %>% group_by(brand) %>% mutate(prices = ifelse(is.na(prices), mean(prices, na.rm=TRUE), prices))
amazon_cell_imputed <- amazon_cell_imputed %>% group_by(brand) %>% mutate(helpfulVotes = ifelse(is.na(helpfulVotes), mean(helpfulVotes, na.rm=TRUE), helpfulVotes))
# Re-check for missing values
colSums(is.na(amazon_cell_imputed))
asin brand productTitle url image
0 0 0 0 0
avgRating reviewUrl totalReviews prices name
0 0 0 0 2
reviewerRating date verified reviewTitle body
0 0 0 3 16
helpfulVotes
0
The summary statistcis for each column shows that there are no errors or inconsistency as most of the data types are characters and the numeric and integer values lie in the range that they should be (by looking at the maximum and minimum values).
There are 73372 missing values in total. That is prices have 35141(42%) and helpfulVotes have 49681 (60%) missing values. The column sum of missing values show that name, body and reviewTitle have 2, 16 and 3 missing values respectively.
The missing value of each numeric observation is handled by imputing the mean price values of the respective column after being grouped by brand.
A new variable ‘ratioHelpful’ can be created to find out the ratio of helpful feedback present in the total reviews of a product from which we can infer how much the product is helpful.
amazon_cell_imputed <- mutate(amazon_cell_imputed,
ratioHelpful = helpfulVotes / totalReviews
)
head(amazon_cell_imputed)
# Univarate outlier detection
amazon_cell_imputed$reviewerRating %>% boxplot(main="Box Plot to Detect Reviwer Ratings Outliers", ylab="Reviwer Ratings")
amazon_cell_imputed$avgRating %>% boxplot(main="Box Plot to Detect Average Ratings Outliers", ylab="Average Ratings")
amazon_cell_imputed$totalReviews %>% boxplot(main="Box Plot to Detect Total Reviews Outliers", ylab="Total Reviews")
amazon_cell_imputed$prices %>% boxplot(main="Box Plot to Detect Price Outliers", ylab="Price")
amazon_cell_imputed$ratioHelpful %>% boxplot(main="Box Plot to Detect Ratio Helpful Votes Outliers", ylab="Ratio Helpful Votes")
amazon_cell_imputed$helpfulVotes %>% boxplot(main="Box Plot to Detect Helpful Votes Outliers", ylab="Helpful Votes")
# z-score test to detect outliers location and count
zscore_avg <- amazon_cell_imputed$avgRating %>% scores(type = "z")
length (which(abs(zscore_avg)>3))/82815 *100
[1] 0.2113144
zscore_tot <- amazon_cell_imputed$totalReviews %>% scores(type = "z")
length (which(abs(zscore_tot)>3))/82815 *100
[1] 0
zscore_price <- amazon_cell_imputed$prices %>% scores(type = "z")
length (which(abs(zscore_price)>3))/82815 *100
[1] 2.571998
zscore_ratio <- amazon_cell_imputed$ratioHelpful %>% scores(type = "z")
length (which(abs(zscore_ratio)>3))/82815 *100
[1] 0.1919942
zscore_vote <- amazon_cell_imputed$helpfulVotes %>% scores(type = "z")
length (which(abs(zscore_vote)>3))/82815 *100
[1] 0.7643543
# Handling outliers
amazon_cell_avgRating <- amazon_cell_imputed$avgRating[ -which(abs(zscore_avg) >3 )]
amazon_cell_prices <- amazon_cell_imputed$prices[ -which(abs(zscore_price) >3 )]
amazon_cell_ratioHelpful <- amazon_cell_imputed$ratioHelpful[ -which(abs(zscore_ratio) >3 )]
amazon_cell_helpfulVotes <- amazon_cell_imputed$helpfulVotes[ -which(abs(zscore_vote) >3 )]
# Re-checking outliers
amazon_cell_avgRating %>% boxplot(main="Box Plot to Detect Average Ratings Outliers", ylab="Average Ratings")
amazon_cell_prices %>% boxplot(main="Box Plot to Detect Price Outliers", ylab="Price")
amazon_cell_ratioHelpful %>% boxplot(main="Box Plot to Detect Ratio Helpful Votes Outliers", ylab="Ratio Helpful Votes")
amazon_cell_helpfulVotes %>% boxplot(main="Box Plot to Detect Helpful Votes Outliers", ylab="Helpful Votes")
From the univariate outlier detection methods (box plots and z scores), it is evident that reviewerRating and totalReviews do not have any outliers.
The z score method of finding the outliers state that the number of outliers are less than 5%. Hence, it is better to exclude the obserevations from the dataset.
After excluding, there are some outliers that are still remaining. They may be extreme values which need not be handled.
# Transforming attributes
hist(amazon_cell_imputed$avgRating, main="Histogram for Average Rating", xlab = "Average Rating")
amazon_cell_imputed$avgRating<- BoxCox(amazon_cell_imputed$avgRating,lambda = "auto")
hist(amazon_cell_imputed$avgRating, main="Histogram for Average Rating", xlab = "Average Rating")
hist(amazon_cell_imputed$ratioHelpful, main="Histogram for Ratio of Helpful Votes", xlab = "Ratio of Helpful Votes")
amazon_cell_imputed$ratioHelpful<- log10(amazon_cell_imputed$ratioHelpful)
hist(amazon_cell_imputed$ratioHelpful, main="Histogram for Ratio of Helpful Votes", xlab = "Ratio of Helpful Votes")
hist(amazon_cell_imputed$totalReviews, main = "Histogram for Total Reviews", xlab = "Total Reviews")
amazon_cell_imputed$totalReviews<- log10(amazon_cell_imputed$totalReviews)
amazon_cell_imputed$totalReviews<- amazon_cell_imputed$totalReviews^2
hist(amazon_cell_imputed$totalReviews, main = "Histogram for Total Reviews", xlab = "Total Reviews")
hist(amazon_cell_imputed$prices, main = "Histogram for Prices", xlab = "Prices")
amazon_cell_imputed$prices<- BoxCox(amazon_cell_imputed$prices,lambda = "auto")
amazon_cell_imputed$prices<- amazon_cell_imputed$prices^2
hist(amazon_cell_imputed$prices, main = "Histogram for Prices", xlab = "Prices")
The aim to do transformation is to convert the non-normal distribution to a normal distribution that would be preffered for analysis. From the histograms, depending on the skewness, transformation methods were decided. Boxcox transformation is used to trnasform avgRating and prices because this transformation is a type of a power transformation that can transform data to normal distribution automtcally finding a parameter that fits normality assumption. For right skewed distributions, log transformation was used since it works well to reduce right skewness hence log base 10 transformation is done. Multiple transformation is done to bring the distributions close to normal.
For such an analysis, the urls (url, image, reviewUrl, name) are not needed and hence can be featured out. This can be the final preprocessed data.
# feature drop
amazon_final <- amazon_cell_imputed[, -c(3,4,5,7,10,14,15)]
head(amazon_final)