Summary

Purpose and Results

The goal of this project is to predict sales of four different product types: PC, Laptops, Netbooks and Smartphones and assess the impact services reviews and customer reviews have on sales of different product types.

Predicting sales

Using the historical sales data, I made sales volume predictions for the different product types:

output_final <- read.csv("output_1.csv") 

output_final

##     X ProductType ProductNum   Price predictions     profit
## 1   1          PC        171  699.00  1325.00293 231544.263
## 2   3      Laptop        173 1199.00  1306.98173 156707.110
## 3   7     Netbook        180  329.00  1650.91427  48883.571
## 4   2          PC        172  860.00   224.83360  38671.379
## 5   5      Laptop        176 1999.00    37.98027  17462.187
## 6   8     Netbook        181  439.00   251.14120  12127.609
## 7  13  Smartphone        194   49.00  1421.39160   8357.783
## 8  15  Smartphone        196  300.00   251.14120   8287.660
## 9  12  Smartphone        193  199.00   334.03707   7312.071
## 10  4      Laptop        175 1199.00    37.98027   6830.751
## 11 14  Smartphone        195  149.00   206.86373   4623.404
## 12  6     Netbook        178  399.99    40.73093   1303.357
## 13  9     Netbook        183  330.00    37.98027   1128.014

The most profiable product is the PC (Product Number 171), followed by a Laptop (Prooduct Number 173).

Grouping the results into Product Type groups:

group <- aggregate(output_final$profit, by=list(ProductType=output_final$ProductType), FUN=sum)

group <- group[order(-group$x),] 

group

##   ProductType         x
## 3          PC 270215.64
## 1      Laptop 181000.05
## 2     Netbook  63442.55
## 4  Smartphone  28580.92

The most profiable Product Type is the PC, followed by a Laptop.

Impact services reviews

I didn’t have enough data to figure out a clear relationship between Service Reviews (Positive and Negative) with the Volume of sales according to each type of product. To do that, I need more products and more data about this products.

Nevertheless I found a relation relationship between Service Reviews (Positive and Negative) with the Volume considering all the types of products:

The Volume of sales increase until it has around 40 Positive Reviews.

Impact customer reviews

I need more data to check for a clear relationship between Customer Reviews (5x,4x,3x,2x,1x) with the Volume of sales according to each type of product. Considering all the types of product:

The Volume of sales start to increase less when we have more than 40 4xStars Reviews.

Key points

Relevant Features

The features that have a strong correlation with Volume of sales are the followings:

x4StarReviews
PositiveServiceReview
x2StarReviews

Performance metrics from other individual classifiers

The metrics from the classifiers SVM, Random Forest (the best algorithm for this case), and KNN are the followings:

#Comparison between three models 
Classifier <- c('SVM','Random Forest','KNN')
RMSE <- c(567.9654, 163.1099, 322.0185)
Rsquared <- c(0.5139188, 0.9338889, 0.7946666)
metrics <- data.frame(Classifier, RMSE, Rsquared)
metrics

##      Classifier     RMSE  Rsquared
## 1           SVM 567.9654 0.5139188
## 2 Random Forest 163.1099 0.9338889
## 3           KNN 322.0185 0.7946666

Main Body

Packages

The packages used are readr, dplyr, caret, ggplot2, olsrr, reshape2, and BBmisc.

Importing the .csv

I worked with two .cvs: historical sales data and new product data.

existing_products <- read.csv("existingproductattributes2017.csv") 
newproductattributes <- read.csv("newproductattributes2017.csv")

1st exploration

glimpse(existing_products)

glimpse(existing_products)

## Observations: 80
## Variables: 18
## $ ProductType           <fct> PC, PC, PC, Laptop, Laptop, Accessories, A…
## $ ProductNum            <int> 101, 102, 103, 104, 105, 106, 107, 108, 10…
## $ Price                 <dbl> 949.00, 2249.99, 399.00, 409.99, 1079.99, …
## $ x5StarReviews         <int> 3, 2, 3, 49, 58, 83, 11, 33, 16, 10, 21, 7…
## $ x4StarReviews         <int> 3, 1, 0, 19, 31, 30, 3, 19, 9, 1, 2, 25, 8…
## $ x3StarReviews         <int> 2, 0, 0, 8, 11, 10, 0, 12, 2, 1, 2, 6, 5, …
## $ x2StarReviews         <int> 0, 0, 0, 3, 7, 9, 0, 5, 0, 0, 4, 3, 0, 8, …
## $ x1StarReviews         <int> 0, 0, 0, 9, 36, 40, 1, 9, 2, 0, 15, 3, 1, …
## $ PositiveServiceReview <int> 2, 1, 1, 7, 7, 12, 3, 5, 2, 2, 2, 9, 2, 44…
## $ NegativeServiceReview <int> 0, 0, 0, 8, 20, 5, 0, 3, 1, 0, 1, 2, 0, 3,…
## $ Recommendproduct      <dbl> 0.9, 0.9, 0.9, 0.8, 0.7, 0.3, 0.9, 0.7, 0.…
## $ BestSellersRank       <int> 1967, 4806, 12076, 109, 268, 64, NA, 2, NA…
## $ ShippingWeight        <dbl> 25.80, 50.00, 17.40, 5.70, 7.00, 1.60, 7.3…
## $ ProductDepth          <dbl> 23.94, 35.00, 10.50, 15.00, 12.90, 5.80, 6…
## $ ProductWidth          <dbl> 6.62, 31.75, 8.30, 9.90, 0.30, 4.00, 10.30…
## $ ProductHeight         <dbl> 16.89, 19.00, 10.20, 1.30, 8.90, 1.00, 11.…
## $ ProfitMargin          <dbl> 0.15, 0.25, 0.08, 0.08, 0.09, 0.05, 0.05, …
## $ Volume                <int> 12, 8, 12, 196, 232, 332, 44, 132, 64, 40,…

glimpse(newproductattributes)

glimpse(newproductattributes)

## Observations: 24
## Variables: 18
## $ ProductType           <fct> PC, PC, Laptop, Laptop, Laptop, Netbook, N…
## $ ProductNum            <int> 171, 172, 173, 175, 176, 178, 180, 181, 18…
## $ Price                 <dbl> 699.00, 860.00, 1199.00, 1199.00, 1999.00,…
## $ x5StarReviews         <int> 96, 51, 74, 7, 1, 19, 312, 23, 3, 296, 943…
## $ x4StarReviews         <int> 26, 11, 10, 2, 1, 8, 112, 18, 4, 66, 437, …
## $ x3StarReviews         <int> 14, 10, 3, 1, 1, 4, 28, 7, 0, 30, 224, 12,…
## $ x2StarReviews         <int> 14, 10, 3, 1, 3, 1, 31, 22, 1, 21, 160, 16…
## $ x1StarReviews         <int> 25, 21, 11, 1, 0, 10, 47, 18, 0, 36, 247, …
## $ PositiveServiceReview <int> 12, 7, 11, 2, 0, 2, 28, 5, 1, 28, 90, 8, 1…
## $ NegativeServiceReview <int> 3, 5, 5, 1, 1, 4, 16, 16, 0, 9, 23, 6, 6, …
## $ Recommendproduct      <dbl> 0.7, 0.6, 0.8, 0.6, 0.3, 0.6, 0.7, 0.4, 0.…
## $ BestSellersRank       <int> 2498, 490, 111, 4446, 2820, 4140, 2699, 17…
## $ ShippingWeight        <dbl> 19.90, 27.00, 6.60, 13.00, 11.60, 5.80, 4.…
## $ ProductDepth          <dbl> 20.63, 21.89, 8.94, 16.30, 16.81, 8.43, 10…
## $ ProductWidth          <dbl> 19.25, 27.01, 12.80, 10.80, 10.90, 11.42, …
## $ ProductHeight         <dbl> 8.39, 9.13, 0.68, 1.40, 0.88, 1.20, 0.95, …
## $ ProfitMargin          <dbl> 0.25, 0.20, 0.10, 0.15, 0.23, 0.08, 0.09, …
## $ Volume                <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …

Pre-process

This process includes Dummify the data, remove column BestSellerRank, cleaning outliers from Volume.

Dummify

At this point, I had to dummify the data. That means remove the column Product Type and create as much as columns as types of data are, and put 1 if the product corresponds to these type of data, and put a 0 if its not.

After this process the data, we have this data set:

#Dummify the data -----
newDataFrame <- dummyVars(" ~ .", data = existing_products)
readyData <- data.frame(predict(newDataFrame, newdata = existing_products))

#Explore
glimpse(readyData)

## Observations: 80
## Variables: 29
## $ ProductType.Accessories      <dbl> 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1,…
## $ ProductType.Display          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ ProductType.ExtendedWarranty <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ ProductType.GameConsole      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ ProductType.Laptop           <dbl> 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0,…
## $ ProductType.Netbook          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ ProductType.PC               <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ ProductType.Printer          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ ProductType.PrinterSupplies  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ ProductType.Smartphone       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ ProductType.Software         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ ProductType.Tablet           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ ProductNum                   <dbl> 101, 102, 103, 104, 105, 106, 107, …
## $ Price                        <dbl> 949.00, 2249.99, 399.00, 409.99, 10…
## $ x5StarReviews                <dbl> 3, 2, 3, 49, 58, 83, 11, 33, 16, 10…
## $ x4StarReviews                <dbl> 3, 1, 0, 19, 31, 30, 3, 19, 9, 1, 2…
## $ x3StarReviews                <dbl> 2, 0, 0, 8, 11, 10, 0, 12, 2, 1, 2,…
## $ x2StarReviews                <dbl> 0, 0, 0, 3, 7, 9, 0, 5, 0, 0, 4, 3,…
## $ x1StarReviews                <dbl> 0, 0, 0, 9, 36, 40, 1, 9, 2, 0, 15,…
## $ PositiveServiceReview        <dbl> 2, 1, 1, 7, 7, 12, 3, 5, 2, 2, 2, 9…
## $ NegativeServiceReview        <dbl> 0, 0, 0, 8, 20, 5, 0, 3, 1, 0, 1, 2…
## $ Recommendproduct             <dbl> 0.9, 0.9, 0.9, 0.8, 0.7, 0.3, 0.9, …
## $ BestSellersRank              <dbl> 1967, 4806, 12076, 109, 268, 64, NA…
## $ ShippingWeight               <dbl> 25.80, 50.00, 17.40, 5.70, 7.00, 1.…
## $ ProductDepth                 <dbl> 23.94, 35.00, 10.50, 15.00, 12.90, …
## $ ProductWidth                 <dbl> 6.62, 31.75, 8.30, 9.90, 0.30, 4.00…
## $ ProductHeight                <dbl> 16.89, 19.00, 10.20, 1.30, 8.90, 1.…
## $ ProfitMargin                 <dbl> 0.15, 0.25, 0.08, 0.08, 0.09, 0.05,…
## $ Volume                       <dbl> 12, 8, 12, 196, 232, 332, 44, 132, …

Delete BestSeller

#Deleting feature with NA----
readyData$BestSellersRank <- NULL

Clean Outliers

#Outliers
boxplot(Volume~ProductType,data=existing_products)$out

## [1] 11204     0    20   824

outliers <- boxplot(Volume~ProductType,data=existing_products)$out

outliers_list <- existing_products[existing_products$Volume %in% 
                                boxplot(Volume~ProductType,data=existing_products)
                              $out, ]
readyData_model <- readyData[-which(readyData$Volume %in% outliers),]

Normalize

Finally, we have to normalize some features to compare between them

#Normalize 

readyData_model[14:27] <-normalize(readyData_model[14:27], method = "standardize", range = c(0, 1))

1st Model

Before doing the feature selection I created a Model to see the importance of each variable to predict the volume of sales.

I divided the data in 75% train, 25% test, and set a Cross validation with number = 3, and repeats = 5. And the first model was “svmLinear” considering all the features.

model_svm_1

## Support Vector Machines with Linear Kernel 
## 
## 57 samples
## 27 predictors
## 
## No pre-processing
## Resampling: Cross-Validated (3 fold, repeated 5 times) 
## Summary of sample sizes: 39, 37, 38, 37, 39, 38, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   903.1777  0.5358367  444.4644
## 
## Tuning parameter 'C' was held constant at a value of 1

Feature engineering

Variable importance

This model has no good results, but checking the importance of each variable is the most important thing to check at this point:

imp <- varImp(model_svm_1)
print(imp)

## loess r-squared variable importance
## 
##   only 20 most important variables shown (out of 27)
## 
##                               Overall
## x5StarReviews                100.0000
## x4StarReviews                 82.2252
## x3StarReviews                 61.5416
## PositiveServiceReview         57.4461
## ProductType.GameConsole       46.3359
## x2StarReviews                 32.7700
## ProductNum                    30.0723
## NegativeServiceReview         27.2093
## ProductDepth                  10.7801
## ShippingWeight                 9.5332
## ProductWidth                   8.3349
## ProfitMargin                   5.1048
## Price                          4.6801
## Recommendproduct               3.5220
## ProductHeight                  3.0507
## ProductType.ExtendedWarranty   2.6580
## ProductType.Printer            2.6057
## ProductType.PC                 1.7962
## ProductType.Netbook            1.0408
## ProductType.Laptop             0.7157

As we can see, we have a perfect correlation with x5StarReviews. So, we have to take this feature out. ProductType.GameConsole also has a strong correlation, but we need to focus on other type of products, so we took this out too.

After looking this table, we only select these features:

x4StarReviews
x3StarReviews
PositiveServiceReview
x2StarReviews

Checking collinearity

The next step is checking the collinearity.Collinearity is a phenomenon in which one feature variable in a regression model is highly linearly correlated with another feature variable. If we have a strong correlation between two variables this is a problem because our model is going to confuse them trying to predict. We can choose one of them.

rel_features_1 <- c("x4StarReviews", "x3StarReviews", 
                  "x2StarReviews")
collinearity <- readyData_model[,rel_features_1]
corrData <- cor(collinearity)
melted_correlation <- melt(corrData)
melted_correlation

##            Var1          Var2     value
## 1 x4StarReviews x4StarReviews 1.0000000
## 2 x3StarReviews x4StarReviews 0.9202076
## 3 x2StarReviews x4StarReviews 0.6319090
## 4 x4StarReviews x3StarReviews 0.9202076
## 5 x3StarReviews x3StarReviews 1.0000000
## 6 x2StarReviews x3StarReviews 0.8503649
## 7 x4StarReviews x2StarReviews 0.6319090
## 8 x3StarReviews x2StarReviews 0.8503649
## 9 x2StarReviews x2StarReviews 1.0000000

As we can see, there is a strong collinearity between 3 stars and 4 stars, and between 3 stars and 2 stars. I decided to remove 3 stars because 4 stars has more correlation with the volume.

After looking this table, we only select these features: * x4StarReviews
* PositiveServiceReview * x2StarReviews

Cleaning outliers: x4StarReviews, PositiveServiceReview and x2StarReviews

After select all the features that we are going to consider in our model it is time to clean the outliers from this features:

x4StarReviews
PositiveServiceReview
x2StarReviews

readyData_model_after <- readyData[-which(readyData$x4StarReviews %in% out_4stars),]

readyData_model_after <- readyData_model_after[-which(readyData$x2StarReviews %in% out_2stars),]

readyData_model_after <- readyData_model_after[-which(readyData$PositiveServiceReview %in% out_positive),]

Modeling after feature engineering

Now I created a model only with the futures selected in the previous step, and Volume (dependent variable).

SVM

rel_features_2 <- c("x4StarReviews","PositiveServiceReview", 
                    "x2StarReviews", "Volume")

model_svm_2 <- caret::train(Volume ~.,
                                data = train[,rel_features_2], 
                                method = "svmLinear",
                                trControl = ctrl)


model_svm_2

## Support Vector Machines with Linear Kernel 
## 
## 37 samples
##  3 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (3 fold, repeated 5 times) 
## Summary of sample sizes: 25, 24, 25, 23, 25, 26, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   567.9654  0.5139188  270.5452
## 
## Tuning parameter 'C' was held constant at a value of 1

Results in train

train_results <-predict(model_svm_2, newdata=train)
postResample(train$Volume, train_results)

##        RMSE    Rsquared         MAE 
## 377.2857514   0.6814975 176.2551628

Resuls in test

test_results <- predict(model_svm_2, newdata=test)
postResample(test$Volume, test_results)

##        RMSE    Rsquared         MAE 
## 757.8170048   0.7428785 253.8865292

Errors

predictions_svm <- predict(model_svm_2, newdata=test)
p_results_svm <- test
p_results_svm$predictions <- c(predictions_svm)
ggplot(p_results_svm) + 
  geom_point(aes (x=Volume, y= predictions, col= "Predictions")) +
  geom_line (aes (x=Volume, y= Volume, col= "Volumen"))

Random Forest

random_forest <- caret::train(Volume ~.,
                                data = train[,rel_features_2], 
                                method = "rf",
                                trControl = ctrl)

## note: only 2 unique complexity parameters in default grid. Truncating the grid to 2 .

random_forest

## Random Forest 
## 
## 37 samples
##  3 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (3 fold, repeated 5 times) 
## Summary of sample sizes: 25, 24, 25, 26, 24, 24, ... 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE      Rsquared   MAE     
##   2     181.9917  0.9207401  97.64999
##   3     163.1099  0.9338889  84.56727
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 3.

Results in train

train_results_rf <-predict(random_forest, newdata=train)
postResample(train$Volume, train_results_rf)

##       RMSE   Rsquared        MAE 
## 79.7289479  0.9853145 36.4292024

Results in test

test_results_rf <- predict(random_forest, newdata=test)
postResample(test$Volume, test_results_rf)

##        RMSE    Rsquared         MAE 
## 944.7101471   0.5944375 279.4627115

Errors

#Apply the model to the test
predictions_rf <- predict(random_forest, newdata=test)
p_results <- test
p_results$predictions <- c(predictions_rf)
ggplot(p_results) + 
  geom_point(aes (x=Volume, y= predictions, col= "Predictions")) +
  geom_line (aes (x=Volume, y= Volume, col= "Volumen"))

KNN

## k-Nearest Neighbors 
## 
## 37 samples
##  3 predictor
## 
## Pre-processing: centered (3), scaled (3) 
## Resampling: Cross-Validated (3 fold, repeated 5 times) 
## Summary of sample sizes: 24, 26, 24, 25, 25, 24, ... 
## Resampling results across tuning parameters:
## 
##   k  RMSE      Rsquared   MAE     
##   5  322.0185  0.7946666  180.6591
##   7  368.5619  0.7289864  200.0899
##   9  412.1854  0.7250478  244.2898
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 5.

Results in train

#Checking PostResample
train_results_knn <-predict(knn, newdata=train)
postResample(train$Volume, train_results_knn)

##        RMSE    Rsquared         MAE 
## 205.2140290   0.9117773  94.1750322

Results in test

test_results_knn <- predict(knn, newdata=test)
postResample(test$Volume, test_results_knn)

##         RMSE     Rsquared          MAE 
## 1007.8728478    0.4139313  293.6354497

Errors

predictions_knn <- predict(random_forest, newdata=test)
p_results_knn <- test
p_results_knn$predictions <- c(predictions_knn)
ggplot(p_results_knn) + 
  geom_point(aes (x=Volume, y= predictions, col= "Predictions")) +
  geom_line (aes (x=Volume, y= Volume, col= "Volumen"))

Performance metrics

The best algorithm in this case is Random Forest

#Comparison between three models 
Classifier <- c('SVM','Random Forest','KNN')
RMSE <- c(567.9654, 163.1099, 322.0185)
Rsquared <- c(0.5139188, 0.9338889, 0.7946666)
metrics <- data.frame(Classifier, RMSE, Rsquared)
metrics

##      Classifier     RMSE  Rsquared
## 1           SVM 567.9654 0.5139188
## 2 Random Forest 163.1099 0.9338889
## 3           KNN 322.0185 0.7946666

Apply the model to new products

newproductattributes <- read.csv("newproductattributes2017.csv")

#Pre-process 
newproductattributes[4:10] <-normalize(newproductattributes[4:10], method = "standardize", range = c(0, 1))

#Predictions 
predictions_final <- predict(random_forest, newdata=newproductattributes)
finalPred <- newproductattributes
finalPred$predictions <- c(predictions_final)

write.csv(finalPred, file="C2.T3finalPred.csv", row.names = TRUE)

Results

finalPred$profit <- finalPred$predictions * finalPred$ProfitMargin * finalPred$Price

features_final <- c("PC","Laptop", 
                    "Netbook", "Smartphone")

output <- finalPred[which(finalPred$ProductType %in% features_final),]

output <- output[order(-output$profit),] 

output <- output[c(1,2,3,19,20)]

output

##    ProductType ProductNum   Price predictions     profit
## 1           PC        171  699.00  1325.00293 231544.263
## 3       Laptop        173 1199.00  1306.98173 156707.110
## 7      Netbook        180  329.00  1650.91427  48883.571
## 2           PC        172  860.00   224.83360  38671.379
## 5       Laptop        176 1999.00    37.98027  17462.187
## 8      Netbook        181  439.00   251.14120  12127.609
## 13  Smartphone        194   49.00  1421.39160   8357.783
## 15  Smartphone        196  300.00   251.14120   8287.660
## 12  Smartphone        193  199.00   334.03707   7312.071
## 4       Laptop        175 1199.00    37.98027   6830.751
## 14  Smartphone        195  149.00   206.86373   4623.404
## 6      Netbook        178  399.99    40.73093   1303.357
## 9      Netbook        183  330.00    37.98027   1128.014

write.csv(output, file="output_1.csv", row.names = TRUE)

Impact services reviews

I was asked to assess the impact of service reviews on sales of different product types. To do that I had consider the two features about service reviews, the volume and the product types.

Remove outliers

An outlier is an observation that lies an abnormal distance from other values in a random sample from a population. We have to remove the outliers because if we include them our function is going to move in its direction and this is wrong because is not going to represent what is really happening with the rest of the data.

Remove the outliers from Volume

out_volume <- existing_products[which(existing_products$Volume %in% 
                                boxplot(existing_products$Volume)$out
                              ),]

p <- existing_products[-which(existing_products$Volume %in% 
                                boxplot(existing_products$Volume)$out
                              ),]

From PositiveServiceReview

#Remove the outliers from PositiveServiceReview
out_positive <- p[which(p$PositiveServiceReview %in% 
                                boxplot(p$PositiveServiceReview)$out
                              ),]
p <- p[-which(p$PositiveServiceReview %in% 
                                boxplot(p$PositiveServiceReview)$out
                              ),]

From NegativeServiceReview

out_negative <- p[which(p$NegativeServiceReview %in% 
                                boxplot(p$NegativeServiceReview)$out
                              ),]
p <- p[-which(p$NegativeServiceReview %in% 
                                boxplot(p$NegativeServiceReview)$out
                              ),]

Outliers

outliers_serviceReviews <- rbind(out_volume, out_positive, out_negative)

##         ProductType ProductNum PositiveReview NegativeReview Volume
## 50      Accessories        150            536             22  11204
## 73      GameConsole        198             56             13   7036
## 18      Accessories        118            310              6    680
## 23         Software        123            144            112   2052
## 34 ExtendedWarranty        134            280              8   1232
## 35 ExtendedWarranty        135            280              8   1232
## 36 ExtendedWarranty        136            280              8   1232
## 37 ExtendedWarranty        137            280              8   1232
## 38 ExtendedWarranty        138            280              8   1232
## 39 ExtendedWarranty        139            280              8   1232
## 40 ExtendedWarranty        140            280              8   1232
## 41 ExtendedWarranty        141            280              8   1232
## 48      Accessories        148            120             15   2140
## 53      Accessories        153             80              2   1896
## 4            Laptop        104              7              8    196
## 5            Laptop        105              7             20    232
## 22         Software        122             55             38   1576
## 26          Display        126             42             12   1224
## 65          Printer        165              4              7     80
## 67          Printer        167             42             50    824
## 69          Printer        169              8             13    396
## 80      GameConsole        200             29             14   1684

Plot the relation between Services Reviews and Volume

Considering only 4 Product Types

#I had to create a new column with the sum of PositiveServiceReview and NegativeSerciveReview in order to plot both of them in one graphic. 

p$ServicesReviews <- p$PositiveServiceReview + p$NegativeServiceReview

#As we want to check the impact only in four types of products, I had to group them.
types <- c("Laptop", "Netbook", "PC", "Smartphone")
 
#Now its time to created a data set considering only the four types of products 
p1 <- p[which(p$ProductType %in% types),]

#Plot
ggplot(p1, 
       aes(x= ServicesReviews, y=Volume)) + 
  geom_point(aes(x=PositiveServiceReview, y= Volume, col= "Positive")) + 
  geom_point(aes(x=NegativeServiceReview, y=Volume, col="Negative")) +
  facet_grid(p1$ProductType)

Considering all the Product Types

As we can see in the previous graphic the data is not enough to make a conclusion about a relationship between the Service Reviews, Volume and Product Type. So, we have to look for a relationship only between Service Reviews and Volume

serv_reviews <- ggplot(p, 
       aes(x= ServicesReviews, y=Volume)) + 
  geom_point(aes(x=PositiveServiceReview, y= Volume, col= "Positive")) + 
  geom_point(aes(x=NegativeServiceReview, y=Volume, col="Negative")) + 
  geom_smooth(aes(x=NegativeServiceReview, y=Volume, col="Negative")) +
  geom_smooth(aes(x=PositiveServiceReview, y=Volume, col="Positive"))

serv_reviews

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Impact customers reviews

If we consider only the 4 type of products (PC, Laptop, Netbook, and Smartphone) we only have 13 products.

s <- existing_products
s1 <- s[which(s$ProductType %in% types),]
summary(s1$ProductType)

##      Accessories          Display ExtendedWarranty      GameConsole 
##                0                0                0                0 
##           Laptop          Netbook               PC          Printer 
##                3                2                4                0 
##  PrinterSupplies       Smartphone         Software           Tablet 
##                0                4                0                0

Considering only 4 type of products

If we try to plot we found some errors because we don’t have enough data

ggplot(s1[which(s$ProductType %in% c("PC", "Laptop", "Netbook", "Smartphone")),], 
       aes(x= total_stars, y=Volume)) + 
  geom_point(aes(x=x1StarReviews, y= Volume, col= "1x")) + 
  geom_point(aes(x=x2StarReviews, y=Volume, col="2x")) + 
  geom_point(aes(x=x3StarReviews, y=Volume, col="3x")) +
  geom_point(aes(x=x4StarReviews, y=Volume, col="4x")) +
  geom_point(aes(x=x5StarReviews, y=Volume, col="5x"))

## Warning: Removed 4 rows containing missing values (geom_point).

## Warning: Removed 4 rows containing missing values (geom_point).

## Warning: Removed 4 rows containing missing values (geom_point).

## Warning: Removed 4 rows containing missing values (geom_point).

## Warning: Removed 4 rows containing missing values (geom_point).

Considering all the type of products

As we can see in the previous graphic the data is not enough to make a conclusion about a relationship between the Customers Reviews, Volume and Product Type. So, we have to look for a relationship only between Customers Reviews and Volume

Remove the outliers

Remove from Volume

s3 <- existing_products[-which(existing_products$Volume %in% 
                                boxplot(existing_products$Volume)$out
                              ),]

Remove from x5StarReviews

There is no outlier in x5StarReviews

Remove from x4StarReviews

s3 <- s3[-which(s3$x4StarReviews %in% 
                boxplot(s3$x3StarReviews)$out),]

Remove from x3StarReviews

s3 <- s3[-which(s3$x3StarReviews %in% 
                boxplot(s3$x3StarReviews)$out),]

Remove from x2StarReviews

s3 <- s3[-which(s3$x2StarReviews %in% 
                boxplot(s3$x2StarReviews)$out),]

Remove from x1StarReviews

s3 <- s3[-which(s3$x1StarReviews %in% 
                boxplot(s3$x1StarReviews)$out),]

Plot the relationship

s3$total_stars <- s3$x1StarReviews + s3$x2StarReviews + s3$x3StarReviews +
  s3$x4StarReviews + s3$x5StarReviews

customers_reviews <- ggplot(s3, aes(x=total_stars,y=Volume)) + 
  geom_point(aes(x=x1StarReviews, y= Volume, col= "1x")) + 
  geom_point(aes(x=x2StarReviews, y=Volume, col="2x")) + 
  geom_point(aes(x=x3StarReviews, y=Volume, col="3x")) +
  geom_point(aes(x=x4StarReviews, y=Volume, col="4x")) +
  geom_point(aes(x=x5StarReviews, y=Volume, col="5x")) +
  geom_smooth(aes(x=x1StarReviews, y=Volume, col="1x")) +
  geom_smooth(aes(x=x2StarReviews, y=Volume, col="2x")) +
  geom_smooth(aes(x=x3StarReviews, y=Volume, col="3x")) +
  geom_smooth(aes(x=x4StarReviews, y=Volume, col="4x")) +
  geom_smooth(aes(x=x5StarReviews, y=Volume, col="5x"))

customers_reviews

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Recommendations

The PCs are the most profitable type of product. The most profitable product is the Number 171 (PC).

Laptop is the second type of product that generates more profit. The product that generates more profit is the number 173.

SmartPhone is the type of product that generates less profit.

If we want to assess the impact services reviews and customer reviews have on sales of different product types we need more data and reviews from our customers. We only have 3 Tablets, 4 PCs, 4 Smartphone and 3 Laptops, its not possible to find a clear correlation only with these quantity of products.

Multiple Regression in R

Gabriel Miadosqui

18-06-2019

Summary

Purpose and Results

Key points

Main Body

Packages

Importing the .csv

1st exploration

glimpse(existing_products)

glimpse(newproductattributes)

Pre-process

Dummify

Delete BestSeller

Clean Outliers

Normalize

1st Model

Feature engineering

Variable importance

Checking collinearity

Cleaning outliers: x4StarReviews, PositiveServiceReview and x2StarReviews

Modeling after feature engineering

SVM

Results in train

Resuls in test

Errors

Random Forest

Results in train

Results in test

Errors

KNN

Results in train

Results in test

Errors

Performance metrics

Apply the model to new products

Results

Impact services reviews

Remove outliers

Remove the outliers from Volume

From PositiveServiceReview

From NegativeServiceReview

Outliers

Plot the relation between Services Reviews and Volume

Considering only 4 Product Types

Considering all the Product Types

Impact customers reviews

Considering only 4 type of products

Considering all the type of products

Remove the outliers

Remove from Volume

Remove from x5StarReviews

Remove from x4StarReviews

Remove from x3StarReviews

Remove from x2StarReviews

Remove from x1StarReviews

Plot the relationship

Recommendations