Main Body
Packages
The packages used are readr, dplyr, caret, ggplot2, olsrr, reshape2, and BBmisc.
Importing the .csv
I worked with two .cvs: historical sales data and new product data.
existing_products <- read.csv("existingproductattributes2017.csv")
newproductattributes <- read.csv("newproductattributes2017.csv")
1st exploration
glimpse(existing_products)
glimpse(existing_products)
## Observations: 80
## Variables: 18
## $ ProductType <fct> PC, PC, PC, Laptop, Laptop, Accessories, A…
## $ ProductNum <int> 101, 102, 103, 104, 105, 106, 107, 108, 10…
## $ Price <dbl> 949.00, 2249.99, 399.00, 409.99, 1079.99, …
## $ x5StarReviews <int> 3, 2, 3, 49, 58, 83, 11, 33, 16, 10, 21, 7…
## $ x4StarReviews <int> 3, 1, 0, 19, 31, 30, 3, 19, 9, 1, 2, 25, 8…
## $ x3StarReviews <int> 2, 0, 0, 8, 11, 10, 0, 12, 2, 1, 2, 6, 5, …
## $ x2StarReviews <int> 0, 0, 0, 3, 7, 9, 0, 5, 0, 0, 4, 3, 0, 8, …
## $ x1StarReviews <int> 0, 0, 0, 9, 36, 40, 1, 9, 2, 0, 15, 3, 1, …
## $ PositiveServiceReview <int> 2, 1, 1, 7, 7, 12, 3, 5, 2, 2, 2, 9, 2, 44…
## $ NegativeServiceReview <int> 0, 0, 0, 8, 20, 5, 0, 3, 1, 0, 1, 2, 0, 3,…
## $ Recommendproduct <dbl> 0.9, 0.9, 0.9, 0.8, 0.7, 0.3, 0.9, 0.7, 0.…
## $ BestSellersRank <int> 1967, 4806, 12076, 109, 268, 64, NA, 2, NA…
## $ ShippingWeight <dbl> 25.80, 50.00, 17.40, 5.70, 7.00, 1.60, 7.3…
## $ ProductDepth <dbl> 23.94, 35.00, 10.50, 15.00, 12.90, 5.80, 6…
## $ ProductWidth <dbl> 6.62, 31.75, 8.30, 9.90, 0.30, 4.00, 10.30…
## $ ProductHeight <dbl> 16.89, 19.00, 10.20, 1.30, 8.90, 1.00, 11.…
## $ ProfitMargin <dbl> 0.15, 0.25, 0.08, 0.08, 0.09, 0.05, 0.05, …
## $ Volume <int> 12, 8, 12, 196, 232, 332, 44, 132, 64, 40,…
glimpse(newproductattributes)
glimpse(newproductattributes)
## Observations: 24
## Variables: 18
## $ ProductType <fct> PC, PC, Laptop, Laptop, Laptop, Netbook, N…
## $ ProductNum <int> 171, 172, 173, 175, 176, 178, 180, 181, 18…
## $ Price <dbl> 699.00, 860.00, 1199.00, 1199.00, 1999.00,…
## $ x5StarReviews <int> 96, 51, 74, 7, 1, 19, 312, 23, 3, 296, 943…
## $ x4StarReviews <int> 26, 11, 10, 2, 1, 8, 112, 18, 4, 66, 437, …
## $ x3StarReviews <int> 14, 10, 3, 1, 1, 4, 28, 7, 0, 30, 224, 12,…
## $ x2StarReviews <int> 14, 10, 3, 1, 3, 1, 31, 22, 1, 21, 160, 16…
## $ x1StarReviews <int> 25, 21, 11, 1, 0, 10, 47, 18, 0, 36, 247, …
## $ PositiveServiceReview <int> 12, 7, 11, 2, 0, 2, 28, 5, 1, 28, 90, 8, 1…
## $ NegativeServiceReview <int> 3, 5, 5, 1, 1, 4, 16, 16, 0, 9, 23, 6, 6, …
## $ Recommendproduct <dbl> 0.7, 0.6, 0.8, 0.6, 0.3, 0.6, 0.7, 0.4, 0.…
## $ BestSellersRank <int> 2498, 490, 111, 4446, 2820, 4140, 2699, 17…
## $ ShippingWeight <dbl> 19.90, 27.00, 6.60, 13.00, 11.60, 5.80, 4.…
## $ ProductDepth <dbl> 20.63, 21.89, 8.94, 16.30, 16.81, 8.43, 10…
## $ ProductWidth <dbl> 19.25, 27.01, 12.80, 10.80, 10.90, 11.42, …
## $ ProductHeight <dbl> 8.39, 9.13, 0.68, 1.40, 0.88, 1.20, 0.95, …
## $ ProfitMargin <dbl> 0.25, 0.20, 0.10, 0.15, 0.23, 0.08, 0.09, …
## $ Volume <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
Pre-process
This process includes Dummify the data, remove column BestSellerRank, cleaning outliers from Volume.
Dummify
At this point, I had to dummify the data. That means remove the column Product Type and create as much as columns as types of data are, and put 1 if the product corresponds to these type of data, and put a 0 if its not.
After this process the data, we have this data set:
#Dummify the data -----
newDataFrame <- dummyVars(" ~ .", data = existing_products)
readyData <- data.frame(predict(newDataFrame, newdata = existing_products))
#Explore
glimpse(readyData)
## Observations: 80
## Variables: 29
## $ ProductType.Accessories <dbl> 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1,…
## $ ProductType.Display <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ ProductType.ExtendedWarranty <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ ProductType.GameConsole <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ ProductType.Laptop <dbl> 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0,…
## $ ProductType.Netbook <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ ProductType.PC <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ ProductType.Printer <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ ProductType.PrinterSupplies <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ ProductType.Smartphone <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ ProductType.Software <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ ProductType.Tablet <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ ProductNum <dbl> 101, 102, 103, 104, 105, 106, 107, …
## $ Price <dbl> 949.00, 2249.99, 399.00, 409.99, 10…
## $ x5StarReviews <dbl> 3, 2, 3, 49, 58, 83, 11, 33, 16, 10…
## $ x4StarReviews <dbl> 3, 1, 0, 19, 31, 30, 3, 19, 9, 1, 2…
## $ x3StarReviews <dbl> 2, 0, 0, 8, 11, 10, 0, 12, 2, 1, 2,…
## $ x2StarReviews <dbl> 0, 0, 0, 3, 7, 9, 0, 5, 0, 0, 4, 3,…
## $ x1StarReviews <dbl> 0, 0, 0, 9, 36, 40, 1, 9, 2, 0, 15,…
## $ PositiveServiceReview <dbl> 2, 1, 1, 7, 7, 12, 3, 5, 2, 2, 2, 9…
## $ NegativeServiceReview <dbl> 0, 0, 0, 8, 20, 5, 0, 3, 1, 0, 1, 2…
## $ Recommendproduct <dbl> 0.9, 0.9, 0.9, 0.8, 0.7, 0.3, 0.9, …
## $ BestSellersRank <dbl> 1967, 4806, 12076, 109, 268, 64, NA…
## $ ShippingWeight <dbl> 25.80, 50.00, 17.40, 5.70, 7.00, 1.…
## $ ProductDepth <dbl> 23.94, 35.00, 10.50, 15.00, 12.90, …
## $ ProductWidth <dbl> 6.62, 31.75, 8.30, 9.90, 0.30, 4.00…
## $ ProductHeight <dbl> 16.89, 19.00, 10.20, 1.30, 8.90, 1.…
## $ ProfitMargin <dbl> 0.15, 0.25, 0.08, 0.08, 0.09, 0.05,…
## $ Volume <dbl> 12, 8, 12, 196, 232, 332, 44, 132, …
Delete BestSeller
#Deleting feature with NA----
readyData$BestSellersRank <- NULL
Clean Outliers
#Outliers
boxplot(Volume~ProductType,data=existing_products)$out
## [1] 11204 0 20 824
outliers <- boxplot(Volume~ProductType,data=existing_products)$out
outliers_list <- existing_products[existing_products$Volume %in%
boxplot(Volume~ProductType,data=existing_products)
$out, ]
readyData_model <- readyData[-which(readyData$Volume %in% outliers),]
Normalize
Finally, we have to normalize some features to compare between them
#Normalize
readyData_model[14:27] <-normalize(readyData_model[14:27], method = "standardize", range = c(0, 1))
1st Model
Before doing the feature selection I created a Model to see the importance of each variable to predict the volume of sales.
I divided the data in 75% train, 25% test, and set a Cross validation with number = 3, and repeats = 5. And the first model was “svmLinear” considering all the features.
model_svm_1
## Support Vector Machines with Linear Kernel
##
## 57 samples
## 27 predictors
##
## No pre-processing
## Resampling: Cross-Validated (3 fold, repeated 5 times)
## Summary of sample sizes: 39, 37, 38, 37, 39, 38, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 903.1777 0.5358367 444.4644
##
## Tuning parameter 'C' was held constant at a value of 1
Feature engineering
Variable importance
This model has no good results, but checking the importance of each variable is the most important thing to check at this point:
imp <- varImp(model_svm_1)
print(imp)
## loess r-squared variable importance
##
## only 20 most important variables shown (out of 27)
##
## Overall
## x5StarReviews 100.0000
## x4StarReviews 82.2252
## x3StarReviews 61.5416
## PositiveServiceReview 57.4461
## ProductType.GameConsole 46.3359
## x2StarReviews 32.7700
## ProductNum 30.0723
## NegativeServiceReview 27.2093
## ProductDepth 10.7801
## ShippingWeight 9.5332
## ProductWidth 8.3349
## ProfitMargin 5.1048
## Price 4.6801
## Recommendproduct 3.5220
## ProductHeight 3.0507
## ProductType.ExtendedWarranty 2.6580
## ProductType.Printer 2.6057
## ProductType.PC 1.7962
## ProductType.Netbook 1.0408
## ProductType.Laptop 0.7157
As we can see, we have a perfect correlation with x5StarReviews. So, we have to take this feature out. ProductType.GameConsole also has a strong correlation, but we need to focus on other type of products, so we took this out too.
After looking this table, we only select these features:
- x4StarReviews
- x3StarReviews
- PositiveServiceReview
- x2StarReviews
Checking collinearity
The next step is checking the collinearity.Collinearity is a phenomenon in which one feature variable in a regression model is highly linearly correlated with another feature variable. If we have a strong correlation between two variables this is a problem because our model is going to confuse them trying to predict. We can choose one of them.
rel_features_1 <- c("x4StarReviews", "x3StarReviews",
"x2StarReviews")
collinearity <- readyData_model[,rel_features_1]
corrData <- cor(collinearity)
melted_correlation <- melt(corrData)
melted_correlation
## Var1 Var2 value
## 1 x4StarReviews x4StarReviews 1.0000000
## 2 x3StarReviews x4StarReviews 0.9202076
## 3 x2StarReviews x4StarReviews 0.6319090
## 4 x4StarReviews x3StarReviews 0.9202076
## 5 x3StarReviews x3StarReviews 1.0000000
## 6 x2StarReviews x3StarReviews 0.8503649
## 7 x4StarReviews x2StarReviews 0.6319090
## 8 x3StarReviews x2StarReviews 0.8503649
## 9 x2StarReviews x2StarReviews 1.0000000
As we can see, there is a strong collinearity between 3 stars and 4 stars, and between 3 stars and 2 stars. I decided to remove 3 stars because 4 stars has more correlation with the volume.
After looking this table, we only select these features: * x4StarReviews
* PositiveServiceReview * x2StarReviews
Cleaning outliers: x4StarReviews, PositiveServiceReview and x2StarReviews
After select all the features that we are going to consider in our model it is time to clean the outliers from this features:
- x4StarReviews
- PositiveServiceReview
- x2StarReviews
readyData_model_after <- readyData[-which(readyData$x4StarReviews %in% out_4stars),]
readyData_model_after <- readyData_model_after[-which(readyData$x2StarReviews %in% out_2stars),]
readyData_model_after <- readyData_model_after[-which(readyData$PositiveServiceReview %in% out_positive),]
Modeling after feature engineering
Now I created a model only with the futures selected in the previous step, and Volume (dependent variable).
SVM
rel_features_2 <- c("x4StarReviews","PositiveServiceReview",
"x2StarReviews", "Volume")
model_svm_2 <- caret::train(Volume ~.,
data = train[,rel_features_2],
method = "svmLinear",
trControl = ctrl)
model_svm_2
## Support Vector Machines with Linear Kernel
##
## 37 samples
## 3 predictor
##
## No pre-processing
## Resampling: Cross-Validated (3 fold, repeated 5 times)
## Summary of sample sizes: 25, 24, 25, 23, 25, 26, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 567.9654 0.5139188 270.5452
##
## Tuning parameter 'C' was held constant at a value of 1
Results in train
train_results <-predict(model_svm_2, newdata=train)
postResample(train$Volume, train_results)
## RMSE Rsquared MAE
## 377.2857514 0.6814975 176.2551628
Resuls in test
test_results <- predict(model_svm_2, newdata=test)
postResample(test$Volume, test_results)
## RMSE Rsquared MAE
## 757.8170048 0.7428785 253.8865292
Errors
predictions_svm <- predict(model_svm_2, newdata=test)
p_results_svm <- test
p_results_svm$predictions <- c(predictions_svm)
ggplot(p_results_svm) +
geom_point(aes (x=Volume, y= predictions, col= "Predictions")) +
geom_line (aes (x=Volume, y= Volume, col= "Volumen"))
Random Forest
random_forest <- caret::train(Volume ~.,
data = train[,rel_features_2],
method = "rf",
trControl = ctrl)
## note: only 2 unique complexity parameters in default grid. Truncating the grid to 2 .
random_forest
## Random Forest
##
## 37 samples
## 3 predictor
##
## No pre-processing
## Resampling: Cross-Validated (3 fold, repeated 5 times)
## Summary of sample sizes: 25, 24, 25, 26, 24, 24, ...
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared MAE
## 2 181.9917 0.9207401 97.64999
## 3 163.1099 0.9338889 84.56727
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 3.
Results in train
train_results_rf <-predict(random_forest, newdata=train)
postResample(train$Volume, train_results_rf)
## RMSE Rsquared MAE
## 79.7289479 0.9853145 36.4292024
Results in test
test_results_rf <- predict(random_forest, newdata=test)
postResample(test$Volume, test_results_rf)
## RMSE Rsquared MAE
## 944.7101471 0.5944375 279.4627115
Errors
#Apply the model to the test
predictions_rf <- predict(random_forest, newdata=test)
p_results <- test
p_results$predictions <- c(predictions_rf)
ggplot(p_results) +
geom_point(aes (x=Volume, y= predictions, col= "Predictions")) +
geom_line (aes (x=Volume, y= Volume, col= "Volumen"))
KNN
## k-Nearest Neighbors
##
## 37 samples
## 3 predictor
##
## Pre-processing: centered (3), scaled (3)
## Resampling: Cross-Validated (3 fold, repeated 5 times)
## Summary of sample sizes: 24, 26, 24, 25, 25, 24, ...
## Resampling results across tuning parameters:
##
## k RMSE Rsquared MAE
## 5 322.0185 0.7946666 180.6591
## 7 368.5619 0.7289864 200.0899
## 9 412.1854 0.7250478 244.2898
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 5.
Results in train
#Checking PostResample
train_results_knn <-predict(knn, newdata=train)
postResample(train$Volume, train_results_knn)
## RMSE Rsquared MAE
## 205.2140290 0.9117773 94.1750322
Results in test
test_results_knn <- predict(knn, newdata=test)
postResample(test$Volume, test_results_knn)
## RMSE Rsquared MAE
## 1007.8728478 0.4139313 293.6354497
Errors
predictions_knn <- predict(random_forest, newdata=test)
p_results_knn <- test
p_results_knn$predictions <- c(predictions_knn)
ggplot(p_results_knn) +
geom_point(aes (x=Volume, y= predictions, col= "Predictions")) +
geom_line (aes (x=Volume, y= Volume, col= "Volumen"))
Performance metrics
The best algorithm in this case is Random Forest
#Comparison between three models
Classifier <- c('SVM','Random Forest','KNN')
RMSE <- c(567.9654, 163.1099, 322.0185)
Rsquared <- c(0.5139188, 0.9338889, 0.7946666)
metrics <- data.frame(Classifier, RMSE, Rsquared)
metrics
## Classifier RMSE Rsquared
## 1 SVM 567.9654 0.5139188
## 2 Random Forest 163.1099 0.9338889
## 3 KNN 322.0185 0.7946666
Apply the model to new products
newproductattributes <- read.csv("newproductattributes2017.csv")
#Pre-process
newproductattributes[4:10] <-normalize(newproductattributes[4:10], method = "standardize", range = c(0, 1))
#Predictions
predictions_final <- predict(random_forest, newdata=newproductattributes)
finalPred <- newproductattributes
finalPred$predictions <- c(predictions_final)
write.csv(finalPred, file="C2.T3finalPred.csv", row.names = TRUE)
Results
finalPred$profit <- finalPred$predictions * finalPred$ProfitMargin * finalPred$Price
features_final <- c("PC","Laptop",
"Netbook", "Smartphone")
output <- finalPred[which(finalPred$ProductType %in% features_final),]
output <- output[order(-output$profit),]
output <- output[c(1,2,3,19,20)]
output
## ProductType ProductNum Price predictions profit
## 1 PC 171 699.00 1325.00293 231544.263
## 3 Laptop 173 1199.00 1306.98173 156707.110
## 7 Netbook 180 329.00 1650.91427 48883.571
## 2 PC 172 860.00 224.83360 38671.379
## 5 Laptop 176 1999.00 37.98027 17462.187
## 8 Netbook 181 439.00 251.14120 12127.609
## 13 Smartphone 194 49.00 1421.39160 8357.783
## 15 Smartphone 196 300.00 251.14120 8287.660
## 12 Smartphone 193 199.00 334.03707 7312.071
## 4 Laptop 175 1199.00 37.98027 6830.751
## 14 Smartphone 195 149.00 206.86373 4623.404
## 6 Netbook 178 399.99 40.73093 1303.357
## 9 Netbook 183 330.00 37.98027 1128.014
write.csv(output, file="output_1.csv", row.names = TRUE)
Impact services reviews
I was asked to assess the impact of service reviews on sales of different product types. To do that I had consider the two features about service reviews, the volume and the product types.
Remove outliers
An outlier is an observation that lies an abnormal distance from other values in a random sample from a population. We have to remove the outliers because if we include them our function is going to move in its direction and this is wrong because is not going to represent what is really happening with the rest of the data.
Remove the outliers from Volume
out_volume <- existing_products[which(existing_products$Volume %in%
boxplot(existing_products$Volume)$out
),]
p <- existing_products[-which(existing_products$Volume %in%
boxplot(existing_products$Volume)$out
),]
From PositiveServiceReview
#Remove the outliers from PositiveServiceReview
out_positive <- p[which(p$PositiveServiceReview %in%
boxplot(p$PositiveServiceReview)$out
),]
p <- p[-which(p$PositiveServiceReview %in%
boxplot(p$PositiveServiceReview)$out
),]
From NegativeServiceReview
out_negative <- p[which(p$NegativeServiceReview %in%
boxplot(p$NegativeServiceReview)$out
),]
p <- p[-which(p$NegativeServiceReview %in%
boxplot(p$NegativeServiceReview)$out
),]
Outliers
outliers_serviceReviews <- rbind(out_volume, out_positive, out_negative)
## ProductType ProductNum PositiveReview NegativeReview Volume
## 50 Accessories 150 536 22 11204
## 73 GameConsole 198 56 13 7036
## 18 Accessories 118 310 6 680
## 23 Software 123 144 112 2052
## 34 ExtendedWarranty 134 280 8 1232
## 35 ExtendedWarranty 135 280 8 1232
## 36 ExtendedWarranty 136 280 8 1232
## 37 ExtendedWarranty 137 280 8 1232
## 38 ExtendedWarranty 138 280 8 1232
## 39 ExtendedWarranty 139 280 8 1232
## 40 ExtendedWarranty 140 280 8 1232
## 41 ExtendedWarranty 141 280 8 1232
## 48 Accessories 148 120 15 2140
## 53 Accessories 153 80 2 1896
## 4 Laptop 104 7 8 196
## 5 Laptop 105 7 20 232
## 22 Software 122 55 38 1576
## 26 Display 126 42 12 1224
## 65 Printer 165 4 7 80
## 67 Printer 167 42 50 824
## 69 Printer 169 8 13 396
## 80 GameConsole 200 29 14 1684
Plot the relation between Services Reviews and Volume
Considering only 4 Product Types
#I had to create a new column with the sum of PositiveServiceReview and NegativeSerciveReview in order to plot both of them in one graphic.
p$ServicesReviews <- p$PositiveServiceReview + p$NegativeServiceReview
#As we want to check the impact only in four types of products, I had to group them.
types <- c("Laptop", "Netbook", "PC", "Smartphone")
#Now its time to created a data set considering only the four types of products
p1 <- p[which(p$ProductType %in% types),]
#Plot
ggplot(p1,
aes(x= ServicesReviews, y=Volume)) +
geom_point(aes(x=PositiveServiceReview, y= Volume, col= "Positive")) +
geom_point(aes(x=NegativeServiceReview, y=Volume, col="Negative")) +
facet_grid(p1$ProductType)
Considering all the Product Types
As we can see in the previous graphic the data is not enough to make a conclusion about a relationship between the Service Reviews, Volume and Product Type. So, we have to look for a relationship only between Service Reviews and Volume
serv_reviews <- ggplot(p,
aes(x= ServicesReviews, y=Volume)) +
geom_point(aes(x=PositiveServiceReview, y= Volume, col= "Positive")) +
geom_point(aes(x=NegativeServiceReview, y=Volume, col="Negative")) +
geom_smooth(aes(x=NegativeServiceReview, y=Volume, col="Negative")) +
geom_smooth(aes(x=PositiveServiceReview, y=Volume, col="Positive"))
serv_reviews
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Impact customers reviews
If we consider only the 4 type of products (PC, Laptop, Netbook, and Smartphone) we only have 13 products.
s <- existing_products
s1 <- s[which(s$ProductType %in% types),]
summary(s1$ProductType)
## Accessories Display ExtendedWarranty GameConsole
## 0 0 0 0
## Laptop Netbook PC Printer
## 3 2 4 0
## PrinterSupplies Smartphone Software Tablet
## 0 4 0 0
Considering only 4 type of products
If we try to plot we found some errors because we don’t have enough data
ggplot(s1[which(s$ProductType %in% c("PC", "Laptop", "Netbook", "Smartphone")),],
aes(x= total_stars, y=Volume)) +
geom_point(aes(x=x1StarReviews, y= Volume, col= "1x")) +
geom_point(aes(x=x2StarReviews, y=Volume, col="2x")) +
geom_point(aes(x=x3StarReviews, y=Volume, col="3x")) +
geom_point(aes(x=x4StarReviews, y=Volume, col="4x")) +
geom_point(aes(x=x5StarReviews, y=Volume, col="5x"))
## Warning: Removed 4 rows containing missing values (geom_point).
## Warning: Removed 4 rows containing missing values (geom_point).
## Warning: Removed 4 rows containing missing values (geom_point).
## Warning: Removed 4 rows containing missing values (geom_point).
## Warning: Removed 4 rows containing missing values (geom_point).
Considering all the type of products
As we can see in the previous graphic the data is not enough to make a conclusion about a relationship between the Customers Reviews, Volume and Product Type. So, we have to look for a relationship only between Customers Reviews and Volume
Remove the outliers
Remove from Volume
s3 <- existing_products[-which(existing_products$Volume %in%
boxplot(existing_products$Volume)$out
),]
Remove from x5StarReviews
There is no outlier in x5StarReviews
Remove from x4StarReviews
s3 <- s3[-which(s3$x4StarReviews %in%
boxplot(s3$x3StarReviews)$out),]
Remove from x3StarReviews
s3 <- s3[-which(s3$x3StarReviews %in%
boxplot(s3$x3StarReviews)$out),]
Remove from x2StarReviews
s3 <- s3[-which(s3$x2StarReviews %in%
boxplot(s3$x2StarReviews)$out),]
Remove from x1StarReviews
s3 <- s3[-which(s3$x1StarReviews %in%
boxplot(s3$x1StarReviews)$out),]
Plot the relationship
s3$total_stars <- s3$x1StarReviews + s3$x2StarReviews + s3$x3StarReviews +
s3$x4StarReviews + s3$x5StarReviews
customers_reviews <- ggplot(s3, aes(x=total_stars,y=Volume)) +
geom_point(aes(x=x1StarReviews, y= Volume, col= "1x")) +
geom_point(aes(x=x2StarReviews, y=Volume, col="2x")) +
geom_point(aes(x=x3StarReviews, y=Volume, col="3x")) +
geom_point(aes(x=x4StarReviews, y=Volume, col="4x")) +
geom_point(aes(x=x5StarReviews, y=Volume, col="5x")) +
geom_smooth(aes(x=x1StarReviews, y=Volume, col="1x")) +
geom_smooth(aes(x=x2StarReviews, y=Volume, col="2x")) +
geom_smooth(aes(x=x3StarReviews, y=Volume, col="3x")) +
geom_smooth(aes(x=x4StarReviews, y=Volume, col="4x")) +
geom_smooth(aes(x=x5StarReviews, y=Volume, col="5x"))
customers_reviews
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'