library(readr)
library(dplyr)
library(tidyr)
library(stringr)
library(Hmisc)
library(outliers)
First of all, we have found and loaded two data files containing information about phones available via Amazon e-shop and their customer reviews. We have joined the two data files into one data set containing all the information we need.
Next, we looked at the data types and data structure and applied appropriate type conversions and transformations. We created an unordered factor variable, a proper date format, made sure that the numeric columns were integer where needed and fixed the prices.
Next, we created three variables based on the existing data. We scanned the data for missing values and replaced all of them with suitable values.
Finally, we checked fthe data for outliers and applied suitable transformations to mitigate their effect where needed.
The two data sets used in this assignment comes from Kaggle. The data set contains information about Amazon cell phones reviews.
There are two files - items and reviews. The items data file contains information about the phones. It has \(9\) variables and \(792\) observations. Each observation contains:
The reviews data file contains information about the reviews. It has \(8\) variables and \(82,815\) observations. Each observation contains:
We used readr library to read the two data files. We used head() and dim() functions to check the data and their dimensions.
Finally we joined the data sets using the left_join() function from the dyplyr package. The data were joined by the asin column. We used left join to extend the reviews data set with the information of the reviewed items. The resulting data set consists of \(16\) variables (\(8\) variables from reviews data set, \(9\) from the items data set, however asin only appears once) and \(82,815\) observations. The number of observations in the resulting data set shows us that there is no review for non-existing products.
items <- read_csv("20190928-items.csv")
## Parsed with column specification:
## cols(
## asin = col_character(),
## brand = col_character(),
## title = col_character(),
## url = col_character(),
## image = col_character(),
## rating = col_double(),
## reviewUrl = col_character(),
## totalReviews = col_double(),
## prices = col_character()
## )
head(items)
dim(items)
## [1] 792 9
reviews <- read_csv("20190928-reviews.csv")
## Parsed with column specification:
## cols(
## asin = col_character(),
## name = col_character(),
## rating = col_double(),
## date = col_character(),
## verified = col_logical(),
## title = col_character(),
## body = col_character(),
## helpfulVotes = col_double()
## )
head(reviews)
dim(reviews)
## [1] 82815 8
joined <- reviews %>% left_join(items, by = "asin")
head(joined)
dim(joined)
## [1] 82815 16
From the head() function used in the previous section, we noticed that some columns have ambiguous names. Because there were some columns with the same names in both data sets, the join operation renamed them. We would like to use different names to differentiate between the variables easier:
rating.x variable contains values from the reviews dataset. We would like to rename it to userRatingrating.y variable contains values from the items dataset. We would like to rename it to averageRatingtitle.x variable contains values from the reviews dataset. We would like to rename it to reviewTitletitle.y variable contains values from the items dataset. We would like to rename it to descriptionWe used the names() function to verify that the renaming operation was successful.
joined <- joined %>% rename(
userRating = rating.x,
averageRating = rating.y,
reviewTitle = title.x,
description = title.y
)
names(joined)
## [1] "asin" "name" "userRating" "date"
## [5] "verified" "reviewTitle" "body" "helpfulVotes"
## [9] "brand" "description" "url" "image"
## [13] "averageRating" "reviewUrl" "totalReviews" "prices"
We checked the data structure using the str() function.
str(joined, vec.len = 1)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 82815 obs. of 16 variables:
## $ asin : chr "B0000SX2UC" ...
## $ name : chr "Janet" ...
## $ userRating : num 3 1 ...
## $ date : chr "October 11, 2005" ...
## $ verified : logi FALSE ...
## $ reviewTitle : chr "Def not best, but not worst" ...
## $ body : chr "I had the Samsung A600 for awhile which is absolute doo doo. You can read my review on it and detect my rage at"| __truncated__ ...
## $ helpfulVotes : num 1 17 ...
## $ brand : chr "Nokia" ...
## $ description : chr "Dual-Band / Tri-Mode Sprint PCS Phone w/ Voice Activated Dialing & Bright White Backlit Screen" ...
## $ url : chr "https://www.amazon.com/Dual-Band-Tri-Mode-Activated-Dialing-Backlit/dp/B0000SX2UC" ...
## $ image : chr "https://m.media-amazon.com/images/I/2143EBQ210L._AC_UY218_SEARCH213888_FMwebp_QL75_.jpg" ...
## $ averageRating: num 3 3 ...
## $ reviewUrl : chr "https://www.amazon.com/product-reviews/B0000SX2UC" ...
## $ totalReviews : num 14 14 ...
## $ prices : chr NA ...
There are several type conversions that need to be done:
userRating is of type double and contains integer values from 1 to 5. One option would be to convert this variable to a factor variable. However, we have decided to convert it to integer to easily perform statistical operations.
date variable is of type character. We would like to transform it to a date type.
helpfulVotes is of type double and we would like to transform it to an integer as with userRating.
brand is of type character, but we would like to convert it to a factor. We have \(10\) different values, and the variable is unordered. We add labels to the levels, which is the first two letters of the name of the brand.
totalReviews is of type double. We would like it to be integer.
prices is of type character but we would like it to be numeric. Some of the prices in prices have more than one value, so we check the maximum amount of dollar signs to figure out how many prices are given for each value. The result is two. So we create two columns from prices, one called price1 and one called price2, and if there exist a second price in prices it is now stored in price2. Next we need to transform it to numeric, and to do so, we need to remove the dollar signs. This is done by using gsub().
#find the maximum number of prices in each observation
max(str_count(joined$prices, "[$]"), na.rm = T)
## [1] 2
#Separates the prices columns in two
joined <- joined %>% separate(prices, into = c("price1", "price2"), sep=",[$]")
#Transform the variables so they have the right format
joined <- joined %>% transform(
userRating = as.integer(userRating),
date = as.Date(date,format="%B %d, %Y"),
helpfulVotes = as.integer(helpfulVotes),
brand = factor(brand,
levels=c("Nokia", "Motorola", "Sony", "Samsung","HUAWEI",
"Apple", "OnePlus", "Google", "ASUS", "Xiaomi"),
labels=c("no", "mo", "so", "sa", "hu", "ap", "on", "go", "as", "xi")),
totalReviews = as.integer(totalReviews),
price1 = as.numeric(gsub('(,|[$])', '', price1)),
price2 = as.numeric(gsub('(,|[$])', '', price2))
)
dim(joined)
## [1] 82815 17
head(joined)
The three rules from (Hadley Wickham and Grolemund (2016)) which make a tidy data set are:
In the previous section, we have separated the prices variable into two variables. We had to do it in order to apply proper data type conversion. If we didn’t do it, the third rule - Each value must have its own cell - would be violated.
Hence, our data set is in a tidy format based on the three criteria above.
We created three new variables by using the mutate() function from dyplyr package. We created a variable reviewChars which contains the number of characters in the review body. Next, we created reviewWords variable with the number of words in the review body. Finally, we created a varRating variable that contains the variance of the userRating column grouped by the unique item code (asin column). All of these variables are numeric.
joined <-
joined %>% group_by(asin) %>% mutate(
reviewChars = str_length(body),
reviewWords = sapply(str_split(body, " "), length),
varRating = var(userRating, na.rm = T)
)
is.numeric(joined$reviewChars)
## [1] TRUE
is.numeric(joined$reviewWords)
## [1] TRUE
is.numeric(joined$varRating)
## [1] TRUE
dim(joined)
## [1] 82815 20
head(joined)
We have scanned the columns for NA values. We have found several columns with missing values:
joined %>% is.na() %>% colSums()
## asin name userRating date verified
## 0 2 0 0 0
## reviewTitle body helpfulVotes brand description
## 3 16 49681 0 0
## url image averageRating reviewUrl totalReviews
## 0 0 0 0 0
## price1 price2 reviewChars reviewWords varRating
## 23670 71390 16 0 44
In the name column, there are two missing values. This variable denotes the name of the reviewer. We have decided to set those values to “unknown”.
There are three missing values in the reviewTitle column. We have decided to use the first two words from the body column as the value for the missing titles. We could do it because the intersection between the missing values of reviewTitle column and body column is empty.
There are \(16\) missing values in the body column. We have decided to use use the review title (in the reviewTitle column) as the body where the body is empty.
In case of the helpfulVotes column, there are \(49,681\) missing values. These values mean that no one voted for the review. We can safely replace the missing values with \(0\).
There are \(23,670\) missing values in the price1 column. We have decided to replace the missing values with the mean of the prices by brands. We have created a mean_prices dataframe with the mean prices per brand. We have done this because in the joined dataframe, the prices are repeated for each review.
There are \(71,344\) empty values in the price2 column. price2 column is empty when price1 column is empty and there is only one price for the product in the data set. We have decided to use the value from price1 column for each missing value in price2 column.
Because we have updated the values of the body column, we need to recompute the values of reviewChars and reviewWords columns.
Lastly, there are \(44\) missing values for our created variable varRating. We have found out that these values are missing when there is only one review per product, because the var() function returns NA when there is only one observation. We can set those values to \(0\) denoting the variance of the ratings is zero.
# name
joined$name[is.na(joined$name)] <- "unknown"
# reviewTitle
intersect(which(is.na(joined$reviewTitle)), which(is.na(joined$body)))
## integer(0)
joined$reviewTitle[is.na(joined$reviewTitle)] <- word(string = joined$body[is.na(joined$reviewTitle)], start = 1, end = 2)
# body
joined$body[is.na(joined$body)] <- joined$reviewTitle[is.na(joined$body)]
# helpfulVotes
joined$helpfulVotes[is.na(joined$helpfulVotes)] <- 0
# price1
mean_prices <-
joined %>%
distinct(asin, .keep_all = T) %>%
group_by(brand) %>%
mutate(mean_price = mean(price1, na.rm = T)) %>%
select(brand, mean_price) %>%
distinct()
joined <- joined %>% left_join(mean_prices, by = "brand")
joined$price1[is.na(joined$price1)] <- joined$mean_price[is.na(joined$price1)]
joined <- subset(joined, select=-mean_price)
# price2
joined$price2[is.na(joined$price2)] <- joined$price1[is.na(joined$price2)]
# reviewChars, reviewWords
joined <-
joined %>% mutate(
reviewChars = str_length(body),
reviewWords = sapply(str_split(body, " "), length)
)
# varRating
joined$varRating[is.na(joined$varRating)] <- 0
joined %>% is.na() %>% colSums()
## asin name userRating date verified
## 0 0 0 0 0
## reviewTitle body helpfulVotes brand description
## 0 0 0 0 0
## url image averageRating reviewUrl totalReviews
## 0 0 0 0 0
## price1 price2 reviewChars reviewWords varRating
## 0 0 0 0 0
We have checked for special values, such as Infinite and NaN values for all numeric columns in our data set. There are no special values for any numeric columns.
is.special <- function(x) {
if (is.numeric(x)) is.infinite(x) | is.nan(x)
}
sum(is.special(joined$userRating))
## [1] 0
sum(is.special(joined$helpfulVotes))
## [1] 0
sum(is.special(joined$averageRating))
## [1] 0
sum(is.special(joined$totalReviews))
## [1] 0
sum(is.special(joined$price1))
## [1] 0
sum(is.special(joined$price2))
## [1] 0
sum(is.special(joined$reviewChars))
## [1] 0
sum(is.special(joined$reviewWords))
## [1] 0
sum(is.special(joined$varRating))
## [1] 0
The userRating column should only contain numbers between \(1\) and \(5\). We can check manually if it is true.
sum(joined$userRating < 1 | joined$userRating > 5)
## [1] 0
We have checked whether the helpfulVotes column is normal distributed by using a histogram. It turns out not to be, so we cannot use the z-scores method to find outliers. We have created a boxplot to visualize the outliers. Because we do not have any missing values, boxplot is safe to use. The boxplot shows a considerable amount of outliers. However, these outliers are not a result of data entry or processing error, so we have decided not to impute nor remove them. We will transform the values in the next section to mitigate the effect of those outliers.
joined$helpfulVotes %>% hist(xlab="Helpful Votes", density=20, xlim = c(0, 800), main="Histogram of helpful votes")
joined$helpfulVotes %>% boxplot(main = "Box Plot of Helpful Votes column", ylab = "Votes", col = "grey")
The averageRating column should contain the values between \(1\) and \(5\), because it is based on the userRating column. We have doublechecked this manually.
sum(joined$averageRating < 1 | joined$averageRating > 5)
## [1] 0
The totalReviews column contains the number of reviews per product. This variable is unlikely to contain outliers as it is derived from the actual number of reviews from the Amazon website. Still, we have created a boxplot to make sure, there are no outliers.
joined$totalReviews %>% boxplot(main = "Box Plot of Total Reviews column", ylab = "Reviews", col = "grey")
The price1 column is not normally distributed, so we have shown the boxplot to detect outliers. The boxplot also shows a considerable amount of outliers. We looked at the data to see which products have the highest price detected as an outlier, so see if the outliers are result of data entry errors, and it turned out to be Iphones, so the prices are assumed to be correct. So, we do not replace the values, but we will apply some transformation to mitigate the effect of outliers.
The same applies to the price2 column.
joined$price1 %>% hist(xlab="Prices", density=20, main="Histogram of prices 1")
joined$price1 %>% boxplot(main = "Box Plot of Prices 1", ylab = "Price", col = "grey")
joined$price2 %>% hist(xlab="Prices", density=20, main="Histogram of prices 2")
joined$price2 %>% boxplot(main = "Box Plot of Prices 2", ylab = "Price", col = "grey")
We have used a boxplot to visualize the outliers in the reviewChars and reviewWords columns. We have found some values that seem to be outliers, but after manually checking of those values, we have found that the reviews are actually that long and are valid. Apparently, Amazon does not have a limit for the number of characters or words given in a review.
We will not do anything with those values as it would require us to update the body column.
joined$reviewChars %>% boxplot(main = "Box Plot of Review Characters", ylab = "Characters", col = "grey")
joined$reviewWords %>% boxplot(main = "Box Plot of Review Words", ylab = "Words", col = "grey")
There is no point of checking the outlier in the varRating column as this column contains the variance of the user ratings for each product.
As mentioned above we have transformed three variables to mitigate the effect of outliers. For the helpfulVotes column, we have used the min-max normalization. We cannot use the log transformation in this case as we have values of zero in this column.
For the price1 and price2 columns we have used the log transformation. The histograms show that the data are closer to be symmetric after the transformation.
minmaxnormalise <- function(x){(x- min(x)) / (max(x)-min(x))}
joined %>% transform(helpfulVotes = minmaxnormalise(helpfulVotes))
joined <- joined %>% transform(price1 = log10(price1))
joined <- joined %>% transform(price2 = log10(price2))
joined$price1 %>% hist(xlab="Log Prices", density=20, main="Histogram of logarithm of prices 1")
joined$price2 %>% hist(xlab="Log Prices", density=20, main="Histogram of logarithm of prices 2")