library(dplyr)
library(magrittr)
library(tidyr)
library(stringr)
library(knitr)
library(ggplot2)
library(readr)
library(Hmisc)
library(forecast)
Mobile phones is one of the most bought item in today’s world. There are a lot of companies making mobiles so it leaves the user in a confusion as to which to purchase. Online reviews help individuals in purchasing a mobile phone.
Data source: https://www.kaggle.com/grikomsn/amazon-cell-phones-reviews/download#20190928-items.csv
The datas used for this assignment are the mobile phone details and the reviews from Amazon. The items data contains the information of the mobile phones and it contains- asin, brand, title, url, image, rating, reviewUrl, totalReviews and prices variables. The types of variables are numeric and character. The reviews data contains the reviews given by individuals and it contains- asin, name, rating, date, verified, title, body and helpfulVotes variables. The type of variables are numeric, character and logical.
Merging of the two data is done using full join because review contained more observations than items data. The join was done using key = asin which is a variable common in both the data.
items <- read_csv("20190928-items.csv")
Parsed with column specification:
cols(
asin = [31mcol_character()[39m,
brand = [31mcol_character()[39m,
title = [31mcol_character()[39m,
url = [31mcol_character()[39m,
image = [31mcol_character()[39m,
rating = [32mcol_double()[39m,
reviewUrl = [31mcol_character()[39m,
totalReviews = [32mcol_double()[39m,
prices = [31mcol_character()[39m
)
head(items)
reviews <- read_csv("20190928-reviews.csv")
Parsed with column specification:
cols(
asin = [31mcol_character()[39m,
name = [31mcol_character()[39m,
rating = [32mcol_double()[39m,
date = [31mcol_character()[39m,
verified = [33mcol_logical()[39m,
title = [31mcol_character()[39m,
body = [31mcol_character()[39m,
helpfulVotes = [32mcol_double()[39m
)
head(reviews)
new_dataset <- full_join(items,reviews,by = "asin")
head(new_dataset)
To understand the data the class and dimesions of the merged data is checked first. The rating.y variable which is the rating given by users to a mobile phone is changed to a factore with levels - 1,2,3,4 and 5 and ordered.
class(new_dataset)
[1] "spec_tbl_df" "tbl_df" "tbl" "data.frame"
dim(new_dataset)
[1] 82815 16
names(new_dataset)
[1] "asin" "brand" "title.x" "url" "image" "rating.x" "reviewUrl" "totalReviews"
[9] "prices" "name" "rating.y" "date" "verified" "title.y" "body" "helpfulVotes"
sapply(new_dataset,class)
asin brand title.x url image rating.x reviewUrl totalReviews prices
"character" "character" "character" "character" "character" "numeric" "character" "numeric" "character"
name rating.y date verified title.y body helpfulVotes
"character" "numeric" "character" "logical" "character" "character" "numeric"
new_dataset$rating.y <- factor(new_dataset$rating.y,levels = c(1,2,3,4,5), ordered = TRUE)
sapply(new_dataset,class)
$asin
[1] "character"
$brand
[1] "character"
$title.x
[1] "character"
$url
[1] "character"
$image
[1] "character"
$rating.x
[1] "numeric"
$reviewUrl
[1] "character"
$totalReviews
[1] "numeric"
$prices
[1] "character"
$name
[1] "character"
$rating.y
[1] "ordered" "factor"
$date
[1] "character"
$verified
[1] "logical"
$title.y
[1] "character"
$body
[1] "character"
$helpfulVotes
[1] "numeric"
Names of columns rating.x, rating.y, title.x and title.y were changed for better understanding. The totalReviews variable is removed because of its redundancy as merging both the tables included all the reviews. The price variable was separated to two new variables low_variant_price and high_variant_price since it contained both. Character ‘$’ is removed from the price variable before separating. The new variables is then converted to numeric from character class.
A new variable overall_rating was created which is the percentage of the item’s rating excluding the reviewed rating. The mutate function is used to achieve the goal.
The data provided does not contain any special values but due to the merging of the tables missing values may appear. This was checked using the first step. The missing values in name column was changed to a constant(“No name”) since it is the name of the mobile phone. Similarly, review header and body columns’ missing values were changed to constants - “No title” and “No review body”. helpfulVotes, low_variant_price and high_variant_price columns’ missing values were changed to the means because these variables contain the price and the votes. Choosing mode over mean would be uneven distribution since a low price phone may get a price higher than actual price.
colSums(is.na(new_dataset))
asin brand item name url image item rating
0 0 0 0 0 0
reviewUrl low_variant_price high_variant_price name review rating date
0 35141 72235 2 0 0
verified review header body helpfulVotes overall_rating
0 3 16 49681 0
new_dataset$name[is.na(new_dataset$name)] <- "No name"
new_dataset$`review header`[is.na(new_dataset$`review header`)] <- "No title"
new_dataset$body[is.na(new_dataset$body)] <- "No review body"
new_dataset$helpfulVotes <- impute(new_dataset$helpfulVotes, fun = mean)
new_dataset$low_variant_price <- impute(new_dataset$low_variant_price, fun = mean)
new_dataset$high_variant_price <- impute(new_dataset$high_variant_price, fun = mean)
colSums(is.na(new_dataset))
asin brand item name url image item rating
0 0 0 0 0 0
reviewUrl low_variant_price high_variant_price name review rating date
0 0 0 0 0 0
verified review header body helpfulVotes overall_rating
0 0 0 0 0
The first step checks the classes of all the variables. The item_rating and overall_rating variables contained outlier which was corrected using Capping(Winsorising) technique because these variable were not data entry error. Moreover, these were ratings given by individuals which cannot be changed to the mean because of their extreme ratings(like or unlike). Therefore, it is logical to change their rating to the nearest neighbour that does not lie in the outlier region.
sapply(new_dataset,class)
$asin
[1] "character"
$brand
[1] "character"
$`item name`
[1] "character"
$url
[1] "character"
$image
[1] "character"
$`item rating`
[1] "numeric"
$reviewUrl
[1] "character"
$low_variant_price
[1] "impute"
$high_variant_price
[1] "impute"
$name
[1] "character"
$`review rating`
[1] "ordered" "factor"
$date
[1] "character"
$verified
[1] "logical"
$`review header`
[1] "character"
$body
[1] "character"
$helpfulVotes
[1] "impute"
$overall_rating
[1] "numeric"
cap <- function(x){
quantiles <- quantile( x, c(.05, 0.25, 0.75, .95 ) )
x[ x < quantiles[2] - 1.5*IQR(x) ] <- quantiles[1]
x[ x > quantiles[3] + 1.5*IQR(x) ] <- quantiles[4]
x
}
new_dataset$`item rating` %>% boxplot(main="Box Plot of item rating", ylab="item rating", col = "grey")
item_rating_capped <- new_dataset$`item rating` %>% cap()
item_rating_capped %>% boxplot(main="Box Plot of capped item rating", ylab="item rating", col = "grey")
new_dataset$overall_rating %>% boxplot(main="Box Plot of overall rating", ylab="item rating", col = "grey")
overall_rating_capped <- new_dataset$overall_rating %>% cap()
overall_rating_capped %>% boxplot(main="Box Plot of capped overall rating", ylab="item rating", col = "grey")
The low_variant_price variable was chosen for transformation because of the right-skew nature which can be seen in the histogram. The variable was transformed using square root transformation. In the first step, a histogram is plotted for the variable to check the skewness. The second step applies the sqrt method to the variable and the final step plots the histogram of the transformed variable.
hist(new_dataset$low_variant_price, main = "Histogram of low_variant_price", xlab = "Price", col = "limegreen")
log_lvp <- sqrt(new_dataset$low_variant_price)
hist(log_lvp, main = "Histogram of square root low_variant_price", xlab = "Price", col = "limegreen")
NOTE: Follow the order outlined above in the report as possible as you can. Note that sometimes the order of the tasks may be different than the order given here. Any further or optional pre-processing tasks can be added to the template using an additional section in the R Markdown file. Make sure your code is visible (within the margin of the page). Do not use View() to show your data, instead give headers (using head() )