Objective

We have been asked by an Electronics Company to forecast sales of new products that they are considering for building up inventory. Specifically, they would like for us to predict sales volume for four different product types: PCs, Laptops, Netbooks, and Smartphones, which will be used to determine which new products will be brought into store inventory.

They would also like for us to analyze the impact Customer Reviews and Service Reviews have on sales volume.

Goal

Our goals are to 1) assess the impact Customer Reviews and Service Reviews have on sales, and 2) build a model that can predict sales volume on new products with at least 85% level of certainty (R2) and minimal error (RMSE). We will build three different algorithms (Random Forest, Support Vector Machines RBF kernel, and Gradient Boosting) and utilize a variety of automatic and manual tuning mechanisms to optimize models.

Data Description

Data consists of service reviews, customer reviews, historical sales data, and product descriptions for all products currently in inventory. The target variable for this project is sales ‘Volume’ for the product types: PC, Laptops, Netbooks, and Smartphones.

Loading packages

library(tidyverse)
library(caret)
library(ggplot2)
library(corrplot)
library(openxlsx)
library(kableExtra)
library(dplyr)
library(scales)

Import data

existing <- read.csv(file.path('C:/Users/jlbro/OneDrive/C3T3', 'existing.csv'), 
                     stringsAsFactors = TRUE)

Check structure

str(existing)
## 'data.frame':    80 obs. of  18 variables:
##  $ ProductType          : Factor w/ 12 levels "Accessories",..: 7 7 7 5 5 1 1 1 1 1 ...
##  $ ProductNum           : int  101 102 103 104 105 106 107 108 109 110 ...
##  $ Price                : num  949 2250 399 410 1080 ...
##  $ x5StarReviews        : int  3 2 3 49 58 83 11 33 16 10 ...
##  $ x4StarReviews        : int  3 1 0 19 31 30 3 19 9 1 ...
##  $ x3StarReviews        : int  2 0 0 8 11 10 0 12 2 1 ...
##  $ x2StarReviews        : int  0 0 0 3 7 9 0 5 0 0 ...
##  $ x1StarReviews        : int  0 0 0 9 36 40 1 9 2 0 ...
##  $ PositiveServiceReview: int  2 1 1 7 7 12 3 5 2 2 ...
##  $ NegativeServiceReview: int  0 0 0 8 20 5 0 3 1 0 ...
##  $ Recommendproduct     : num  0.9 0.9 0.9 0.8 0.7 0.3 0.9 0.7 0.8 0.9 ...
##  $ BestSellersRank      : int  1967 4806 12076 109 268 64 NA 2 NA 18 ...
##  $ ShippingWeight       : num  25.8 50 17.4 5.7 7 1.6 7.3 12 1.8 0.75 ...
##  $ ProductDepth         : num  23.9 35 10.5 15 12.9 ...
##  $ ProductWidth         : num  6.62 31.75 8.3 9.9 0.3 ...
##  $ ProductHeight        : num  16.9 19 10.2 1.3 8.9 ...
##  $ ProfitMargin         : num  0.15 0.25 0.08 0.08 0.09 0.05 0.05 0.05 0.05 0.05 ...
##  $ Volume               : int  12 8 12 196 232 332 44 132 64 40 ...

Notice that an important variable, ‘ProductType’ has 12 levels. Regression algorithms can easily misinterpret categorical variables in which there are more than 2 values (it may treat them as ranked or ordinal values), so we need to turn each ProductType level into a ‘dummy’ variable. This will turn each ‘ProductType’ into it’s own variable with only 2 values, ‘0’ for false and ‘1’ for true.

dummy <- dummyVars(' ~ .', data = existing)
existing2 <- data.frame(predict(dummy, newdata = existing))
# check structure
str(existing2)
## 'data.frame':    80 obs. of  29 variables:
##  $ ProductType.Accessories     : num  0 0 0 0 0 1 1 1 1 1 ...
##  $ ProductType.Display         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ ProductType.ExtendedWarranty: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ ProductType.GameConsole     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ ProductType.Laptop          : num  0 0 0 1 1 0 0 0 0 0 ...
##  $ ProductType.Netbook         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ ProductType.PC              : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ ProductType.Printer         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ ProductType.PrinterSupplies : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ ProductType.Smartphone      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ ProductType.Software        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ ProductType.Tablet          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ ProductNum                  : num  101 102 103 104 105 106 107 108 109 110 ...
##  $ Price                       : num  949 2250 399 410 1080 ...
##  $ x5StarReviews               : num  3 2 3 49 58 83 11 33 16 10 ...
##  $ x4StarReviews               : num  3 1 0 19 31 30 3 19 9 1 ...
##  $ x3StarReviews               : num  2 0 0 8 11 10 0 12 2 1 ...
##  $ x2StarReviews               : num  0 0 0 3 7 9 0 5 0 0 ...
##  $ x1StarReviews               : num  0 0 0 9 36 40 1 9 2 0 ...
##  $ PositiveServiceReview       : num  2 1 1 7 7 12 3 5 2 2 ...
##  $ NegativeServiceReview       : num  0 0 0 8 20 5 0 3 1 0 ...
##  $ Recommendproduct            : num  0.9 0.9 0.9 0.8 0.7 0.3 0.9 0.7 0.8 0.9 ...
##  $ BestSellersRank             : num  1967 4806 12076 109 268 ...
##  $ ShippingWeight              : num  25.8 50 17.4 5.7 7 1.6 7.3 12 1.8 0.75 ...
##  $ ProductDepth                : num  23.9 35 10.5 15 12.9 ...
##  $ ProductWidth                : num  6.62 31.75 8.3 9.9 0.3 ...
##  $ ProductHeight               : num  16.9 19 10.2 1.3 8.9 ...
##  $ ProfitMargin                : num  0.15 0.25 0.08 0.08 0.09 0.05 0.05 0.05 0.05 0.05 ...
##  $ Volume                      : num  12 8 12 196 232 332 44 132 64 40 ...

Notice each product type is it’s own variable. The regression algorithms will now be able to better understand each. We only need PC, Laptop, Netbook, and Smartphone, so let’s remove all other product types.

Check summary for descriptive stats and NAs

summary(existing2)
##  ProductType.Accessories ProductType.Display ProductType.ExtendedWarranty
##  Min.   :0.000           Min.   :0.0000      Min.   :0.000               
##  1st Qu.:0.000           1st Qu.:0.0000      1st Qu.:0.000               
##  Median :0.000           Median :0.0000      Median :0.000               
##  Mean   :0.325           Mean   :0.0625      Mean   :0.125               
##  3rd Qu.:1.000           3rd Qu.:0.0000      3rd Qu.:0.000               
##  Max.   :1.000           Max.   :1.0000      Max.   :1.000               
##                                                                          
##  ProductType.GameConsole ProductType.Laptop ProductType.Netbook ProductType.PC
##  Min.   :0.000           Min.   :0.0000     Min.   :0.000       Min.   :0.00  
##  1st Qu.:0.000           1st Qu.:0.0000     1st Qu.:0.000       1st Qu.:0.00  
##  Median :0.000           Median :0.0000     Median :0.000       Median :0.00  
##  Mean   :0.025           Mean   :0.0375     Mean   :0.025       Mean   :0.05  
##  3rd Qu.:0.000           3rd Qu.:0.0000     3rd Qu.:0.000       3rd Qu.:0.00  
##  Max.   :1.000           Max.   :1.0000     Max.   :1.000       Max.   :1.00  
##                                                                               
##  ProductType.Printer ProductType.PrinterSupplies ProductType.Smartphone
##  Min.   :0.00        Min.   :0.0000              Min.   :0.00          
##  1st Qu.:0.00        1st Qu.:0.0000              1st Qu.:0.00          
##  Median :0.00        Median :0.0000              Median :0.00          
##  Mean   :0.15        Mean   :0.0375              Mean   :0.05          
##  3rd Qu.:0.00        3rd Qu.:0.0000              3rd Qu.:0.00          
##  Max.   :1.00        Max.   :1.0000              Max.   :1.00          
##                                                                        
##  ProductType.Software ProductType.Tablet   ProductNum        Price        
##  Min.   :0.000        Min.   :0.0000     Min.   :101.0   Min.   :   3.60  
##  1st Qu.:0.000        1st Qu.:0.0000     1st Qu.:120.8   1st Qu.:  52.66  
##  Median :0.000        Median :0.0000     Median :140.5   Median : 132.72  
##  Mean   :0.075        Mean   :0.0375     Mean   :142.6   Mean   : 247.25  
##  3rd Qu.:0.000        3rd Qu.:0.0000     3rd Qu.:160.2   3rd Qu.: 352.49  
##  Max.   :1.000        Max.   :1.0000     Max.   :200.0   Max.   :2249.99  
##                                                                           
##  x5StarReviews    x4StarReviews    x3StarReviews    x2StarReviews   
##  Min.   :   0.0   Min.   :  0.00   Min.   :  0.00   Min.   :  0.00  
##  1st Qu.:  10.0   1st Qu.:  2.75   1st Qu.:  2.00   1st Qu.:  1.00  
##  Median :  50.0   Median : 22.00   Median :  7.00   Median :  3.00  
##  Mean   : 176.2   Mean   : 40.20   Mean   : 14.79   Mean   : 13.79  
##  3rd Qu.: 306.5   3rd Qu.: 33.00   3rd Qu.: 11.25   3rd Qu.:  7.00  
##  Max.   :2801.0   Max.   :431.00   Max.   :162.00   Max.   :370.00  
##                                                                     
##  x1StarReviews     PositiveServiceReview NegativeServiceReview Recommendproduct
##  Min.   :   0.00   Min.   :  0.00        Min.   :  0.000       Min.   :0.100   
##  1st Qu.:   2.00   1st Qu.:  2.00        1st Qu.:  1.000       1st Qu.:0.700   
##  Median :   8.50   Median :  5.50        Median :  3.000       Median :0.800   
##  Mean   :  37.67   Mean   : 51.75        Mean   :  6.225       Mean   :0.745   
##  3rd Qu.:  15.25   3rd Qu.: 42.00        3rd Qu.:  6.250       3rd Qu.:0.900   
##  Max.   :1654.00   Max.   :536.00        Max.   :112.000       Max.   :1.000   
##                                                                                
##  BestSellersRank ShippingWeight     ProductDepth      ProductWidth   
##  Min.   :    1   Min.   : 0.0100   Min.   :  0.000   Min.   : 0.000  
##  1st Qu.:    7   1st Qu.: 0.5125   1st Qu.:  4.775   1st Qu.: 1.750  
##  Median :   27   Median : 2.1000   Median :  7.950   Median : 6.800  
##  Mean   : 1126   Mean   : 9.6681   Mean   : 14.425   Mean   : 7.819  
##  3rd Qu.:  281   3rd Qu.:11.2050   3rd Qu.: 15.025   3rd Qu.:11.275  
##  Max.   :17502   Max.   :63.0000   Max.   :300.000   Max.   :31.750  
##  NA's   :15                                                          
##  ProductHeight     ProfitMargin        Volume     
##  Min.   : 0.000   Min.   :0.0500   Min.   :    0  
##  1st Qu.: 0.400   1st Qu.:0.0500   1st Qu.:   40  
##  Median : 3.950   Median :0.1200   Median :  200  
##  Mean   : 6.259   Mean   :0.1545   Mean   :  705  
##  3rd Qu.:10.300   3rd Qu.:0.2000   3rd Qu.: 1226  
##  Max.   :25.800   Max.   :0.4000   Max.   :11204  
## 

Reveals 15 NAs for ‘BestSellersRank’. Deleting as it is the only variable with NAs and was determined to have low correlation to target variable (0.12)

existing2$BestSellersRank <- NULL

EDA

Correlation matrix of all variables

corrData <- cor(existing2)
corrplot(corrData)

Corrplot is unreadable with so many variables, but we examined the corrData, which revealed that x5StarReviews had a perfect correlation of 1 to our target variable, ‘volume,’ which risks overfitting. We will remove x5StarReviews, product types that are not PC, Laptop, Netbook, Smartphone, and other variables with less than .18 correlation to ‘volume.’

# remove x5Star, low correlated, and product types not of interest to client
existing2 <- subset(existing2, select = -c(1:4, 8:9, 11:12, 15, 24:27))
str(existing2)
## 'data.frame':    80 obs. of  15 variables:
##  $ ProductType.Laptop    : num  0 0 0 1 1 0 0 0 0 0 ...
##  $ ProductType.Netbook   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ ProductType.PC        : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ ProductType.Smartphone: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ ProductNum            : num  101 102 103 104 105 106 107 108 109 110 ...
##  $ Price                 : num  949 2250 399 410 1080 ...
##  $ x4StarReviews         : num  3 1 0 19 31 30 3 19 9 1 ...
##  $ x3StarReviews         : num  2 0 0 8 11 10 0 12 2 1 ...
##  $ x2StarReviews         : num  0 0 0 3 7 9 0 5 0 0 ...
##  $ x1StarReviews         : num  0 0 0 9 36 40 1 9 2 0 ...
##  $ PositiveServiceReview : num  2 1 1 7 7 12 3 5 2 2 ...
##  $ NegativeServiceReview : num  0 0 0 8 20 5 0 3 1 0 ...
##  $ Recommendproduct      : num  0.9 0.9 0.9 0.8 0.7 0.3 0.9 0.7 0.8 0.9 ...
##  $ ShippingWeight        : num  25.8 50 17.4 5.7 7 1.6 7.3 12 1.8 0.75 ...
##  $ Volume                : num  12 8 12 196 232 332 44 132 64 40 ...
# View correlation heatmap
corrData <- cor(existing2)
corrplot(corrData)

# view enhanced correlation heatmap with unnecessary variables removed
corrplot(corrData, method = 'shade', shade.col = NA, tl.col = 'black', 
         type = 'upper', tl.srt = 45)

As you can see, x4Star, x3Star, x2Star, and PositiveService Review have highest correlation to target variable ‘Volume.’

ggplot(data = existing2, mapping = aes(x = Volume)) +
  geom_histogram()

Histogram of Volume, reveals a couple outliers

ggplot(data = existing, aes(x = ProductType, y = Volume, fill = ProductType)) +
  geom_bar(stat = 'identity') + 
  guides(fill=FALSE) +
  coord_flip()

Observation

  • Notice Accessories by far has most volume sold. With such an imbalance in volume, we will need to use log to accurately to tell the impact of Star Reviews and Service Reviews on Sales Volume. We will first create a dataset that filters out only the products we are interested in: PCs, Netbooks, Laptops, Smartphones
# data frame of only products of interest
existing3 <- filter(existing, (ProductType=='PC' | ProductType=='Netbook' | ProductType=='Laptop' | ProductType=='Smartphone')) 

Plot association of 4 Star Reviews on Sales Volume

ggplot(data=existing3, aes(x=x4StarReviews, y=Volume)) + 
  geom_point(aes(color=ProductType, size=1)) +
  theme_bw() +
  scale_x_continuous(trans = 'log2') + 
  scale_y_continuous(trans = 'log2') +
  geom_line() +
  facet_wrap(~ProductType) + 
  xlab('Number of 4 Star Reviews') +
  ylab('Sales Volume') +
  ggtitle('Effect of 4 Star Reviews on Sales Volume')

Plot association of 3 Star Reviews on Sales Volume

ggplot(data=existing3, aes(x=x3StarReviews, y=Volume)) + 
  geom_point(aes(color=ProductType, size=1)) +
  theme_bw() +
  scale_x_continuous(trans = 'log2') + 
  scale_y_continuous(trans = 'log2') +
  geom_line() +
  facet_wrap(~ProductType) + 
  xlab('Number of 3 Star Reviews') +
  ylab('Sales Volume') +
  ggtitle('Effect of 3 Star Reviews on Sales Volume')

Plot association of 2 Star Reviews on Sales Volume

ggplot(data=existing3, aes(x=x2StarReviews, y=Volume)) + 
  geom_point(aes(color=ProductType, size=1)) +
  theme_bw() +
  scale_x_continuous(trans = 'log2') + 
  scale_y_continuous(trans = 'log2') +
  geom_line() +
  facet_wrap(~ProductType) + 
  xlab('Number of 2 Star Reviews') +
  ylab('Sales Volume') +
  ggtitle('Effect of 2 Star Reviews on Sales Volume')

Plot association of 1 Star Reviews on Sales Volume

ggplot(data=existing3, aes(x=x1StarReviews, y=Volume)) + 
  geom_point(aes(color=ProductType, size=1)) +
  theme_bw() +
  scale_x_continuous(trans = 'log2') + 
  scale_y_continuous(trans = 'log2') +
  geom_line() +
  facet_wrap(~ProductType) + 
  xlab('Number of 1 Star Reviews') +
  ylab('Sales Volume') +
  ggtitle('Effect of 1 Star Reviews on Sales Volume')

Observations

  • Interestingly, all Star reviews seem to have positive correlation to sales volume, perhaps this is due to the fact that higher volume of reviews is obvious indication of more products being sold.

Plot impact of Positive Service Reviews on Sales Volume

ggplot(data=existing3, aes(x=PositiveServiceReview, y=Volume)) + 
  geom_point(aes(color=ProductType, size=1)) +
  theme_bw() +
  scale_x_continuous(trans = 'log2') + 
  scale_y_continuous(trans = 'log2') +
  geom_line() +
  facet_wrap(~ProductType) + 
  xlab('Number of Positive Service Reviews') +
  ylab('Sales Volume') +
  ggtitle('Effect of Positive Service Reviews on Sales Volume')

Plot impact of Negative Service Reviews on Sales Volume

ggplot(data=existing3, aes(x=NegativeServiceReview, y=Volume)) + 
  geom_point(aes(color=ProductType, size=1)) +
  theme_bw() +
  scale_x_continuous(trans = 'log2') + 
  scale_y_continuous(trans = 'log2') +
  geom_line() +
  facet_wrap(~ProductType) + 
  xlab('Number of Negative Service Reviews') +
  ylab('Sales Volume') +
  ggtitle('Effect of Negative Service Reviews on Sales Volume')

Observations

  • Positive service reviews are positively associated with sales volume.
  • Negative service reviews are not quite as linearly related to volume, however, there is a slight positive association.
  • It is important to note that there are very few negative service reviews.

Modeling

set.seed(123)

# CreateDataPartition() 75% and 25%
index1 <- createDataPartition(existing2$Volume, p=0.75, list = FALSE)
train1 <- existing2[ index1,]
test1 <- existing2[-index1,]

# Check structure of train1
str(train1)
## 'data.frame':    61 obs. of  15 variables:
##  $ ProductType.Laptop    : num  0 0 0 1 0 0 0 0 0 0 ...
##  $ ProductType.Netbook   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ ProductType.PC        : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ ProductType.Smartphone: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ ProductNum            : num  101 102 103 104 106 107 108 109 110 111 ...
##  $ Price                 : num  949 2250 399 410 114 ...
##  $ x4StarReviews         : num  3 1 0 19 30 3 19 9 1 2 ...
##  $ x3StarReviews         : num  2 0 0 8 10 0 12 2 1 2 ...
##  $ x2StarReviews         : num  0 0 0 3 9 0 5 0 0 4 ...
##  $ x1StarReviews         : num  0 0 0 9 40 1 9 2 0 15 ...
##  $ PositiveServiceReview : num  2 1 1 7 12 3 5 2 2 2 ...
##  $ NegativeServiceReview : num  0 0 0 8 5 0 3 1 0 1 ...
##  $ Recommendproduct      : num  0.9 0.9 0.9 0.8 0.3 0.9 0.7 0.8 0.9 0.5 ...
##  $ ShippingWeight        : num  25.8 50 17.4 5.7 1.6 7.3 12 1.8 0.75 1 ...
##  $ Volume                : num  12 8 12 196 332 44 132 64 40 84 ...
# Set cross validation
control1 <- trainControl(method = 'repeatedcv',
                         number = 10,
                         repeats = 1)

Random forest model and tuning

# set seed
set.seed(123)

# Creating dataframe for manual tuning
rfGrid <- expand.grid(mtry = c(2,3,4,5,6,7,8))

rf1 <- train(Volume ~ x4StarReviews + PositiveServiceReview + x2StarReviews + x3StarReviews + 
               x1StarReviews + NegativeServiceReview + Recommendproduct + ShippingWeight + Price,
             data = train1,
             method = 'rf',
             trControl = control1,
             tuneGrid = rfGrid)

rf1
## Random Forest 
## 
## 61 samples
##  9 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 55, 53, 55, 55, 55, 56, ... 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE      Rsquared   MAE     
##   2     869.2921  0.8755901  416.0646
##   3     849.5229  0.8871013  400.1741
##   4     824.7775  0.8939530  386.7741
##   5     827.7373  0.8980015  384.5929
##   6     801.6069  0.9043345  372.1979
##   7     802.9288  0.9073910  372.2425
##   8     788.6419  0.9081729  365.2084
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 8.

Level of importance for variables in model

ggplot(varImp(rf1, scale=FALSE)) +
  geom_bar(stat = 'identity') +
  ggtitle('Variable Importance of Random Forest 1 on Sales Volume')

Plot residuals

rf1resid <- residuals(rf1)
plot(train1$Volume, rf1resid, 
     xlab = 'Sales Volume', 
     ylab = 'Residuals', 
     main ='Predicted Sales Volume Residuals Plot',
     abline(0,0))

Observations

  • Overall the residual plot looks good, except there are two outliers that will likely skew our R2 and RMSE results, especially if they are not even for products of interest.
  • A deeper dive reveals that both outliers are for accessories which are not of interest in this project. We will remove these two outliers from the test data set
# Removing 2 outlier rows #18 and #48 from test set
test1_rem_out <- test1[!rownames(test1) %in% c('18', '48'), ]

Predict rf1 on test1

rf1Preds <- predict(rf1, newdata = test1_rem_out)
summary(rf1Preds)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   14.11   19.05  102.97  425.15  719.71 1271.74
plot(rf1Preds)

Observation

  • A symmetrical pattern means a good residual plot
# postResample to test if model will do well on new data
PR_rf1 <- data.frame(postResample(rf1Preds, test1_rem_out$Volume))
PR_rf1
##          postResample.rf1Preds..test1_rem_out.Volume.
## RMSE                                      190.4253816
## Rsquared                                    0.9452884
## MAE                                        98.7387608

CV RMSE=788, R2=.908

PostResample RMSE=190, R2=.945

Our Cross Validation R2 is .908 after tuning and feature selection, which is excellent. Our postResample R2 is even better, at .945. If cross validation was above 94-95%, it would be a red-flag for overfitting, but postResample in upper 90s means it will generalize well on new data (and thus is not overfitting).

Random Forest using feature selection

set.seed(123)

rf2 <- train(Volume ~ x4StarReviews + PositiveServiceReview + x2StarReviews,
             data = train1,
             method = 'rf',
             trControl = control1)
## note: only 2 unique complexity parameters in default grid. Truncating the grid to 2 .
rf2
## Random Forest 
## 
## 61 samples
##  3 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 55, 53, 55, 55, 55, 56, ... 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE      Rsquared   MAE     
##   2     771.2710  0.9218973  349.2301
##   3     745.3771  0.9284383  338.8776
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 3.

Variable importance

ggplot(varImp(rf2, scale=FALSE)) +
  geom_bar(stat = 'identity') +
  ggtitle('Variable Importance of Random Forest 2 on Sales Volume')

Plotting the residuals against the actual values for Volume. The graph below shows a couple volume outliers, and further research reveals both outliers are for accessories, which are not products of interest.

resid_rf2 <- residuals(rf2)
plot(train1$Volume, resid_rf2, 
     xlab = 'Sales Volume', 
     ylab = 'Residuals', 
     main ='Predicted Sales Volume Residuals Plot',
     abline(0,0))

Predict rf2 on test1 with outliers removed

rf2Preds <- predict(rf2, newdata = test1_rem_out)
summary(rf2Preds)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##    2.742   14.336  103.066  419.753  714.428 1379.420
plot(rf2Preds)

# postResample
PR_rf2 <- data.frame(postResample(rf2Preds, test1_rem_out$Volume))
PR_rf2
##          postResample.rf2Preds..test1_rem_out.Volume.
## RMSE                                      153.8183544
## Rsquared                                    0.9718405
## MAE                                        74.7555212

CV RMSE = 745, R2=.928

PostResample RMSE=153, R2=.972

The postResample R2 and RMSE for a regression model is excellent.

Support Vector Machines – RBF Kernel Feature Selection

set.seed(123)

# Creating dataframe for manual tuning
rbfGrid <- expand.grid(sigma = c(.01, .015, .2),
                       C = c(10, 100, 1000))

rbf1 <- train(Volume ~ x4StarReviews + x3StarReviews + PositiveServiceReview,
              data = train1,
              method = 'svmRadial',
              trControl = control1,
              tuneGrid = rbfGrid,
              preProc = c('center','scale'))

rbf1
## Support Vector Machines with Radial Basis Function Kernel 
## 
## 61 samples
##  3 predictor
## 
## Pre-processing: centered (3), scaled (3) 
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 55, 53, 55, 55, 55, 56, ... 
## Resampling results across tuning parameters:
## 
##   sigma  C     RMSE       Rsquared   MAE     
##   0.010    10   944.2228  0.8594778  489.1231
##   0.010   100   930.7863  0.8149199  473.1634
##   0.010  1000  1190.2695  0.8309208  580.6230
##   0.015    10   986.8673  0.8419784  507.9508
##   0.015   100   940.5278  0.8123009  480.0726
##   0.015  1000  1236.2804  0.8613697  590.2852
##   0.200    10   913.7802  0.9182651  467.5324
##   0.200   100   879.2184  0.9194906  436.6931
##   0.200  1000   949.7834  0.9093674  462.1128
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.2 and C = 100.

Predict rbf1 on test1

rbf1Preds <- predict(rbf1, newdata = test1_rem_out)
summary(rbf1Preds)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   29.81   94.91  267.83  463.59  552.12 2146.20
plot(rbf1Preds)

postResample

PR_rbf1 <- data.frame(postResample(rbf1Preds, test1_rem_out$Volume))
PR_rbf1
##          postResample.rbf1Preds..test1_rem_out.Volume.
## RMSE                                       264.0730623
## Rsquared                                     0.8148197
## MAE                                        177.1172248

CV RMSE=879, R2=.919

PostResample RMSE=264, R2=.815

Support Vector Machines – RBF Kernel

set.seed(123)

# Creating dataframe for manual tuning
rbfGrid <- expand.grid(sigma = c(.01, .015, .2),
                       C = c(10, 100, 1000))

rbf2 <- train(Volume ~ .,
              data = train1,
              method = 'svmRadial',
              trControl = control1,
              tuneGrid = rbfGrid,
              preProc = c('center','scale'))

rbf2
## Support Vector Machines with Radial Basis Function Kernel 
## 
## 61 samples
## 14 predictors
## 
## Pre-processing: centered (14), scaled (14) 
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 55, 53, 55, 55, 55, 56, ... 
## Resampling results across tuning parameters:
## 
##   sigma  C     RMSE       Rsquared   MAE     
##   0.010    10   929.4771  0.7527835  516.5685
##   0.010   100   852.7698  0.7998966  471.1629
##   0.010  1000   792.9114  0.7582943  435.2153
##   0.015    10   942.7868  0.7078926  527.4695
##   0.015   100   855.0032  0.7762986  471.0963
##   0.015  1000   813.0011  0.7517215  443.8939
##   0.200    10  1032.4257  0.5438955  573.8091
##   0.200   100  1014.3794  0.5534847  561.6915
##   0.200  1000   998.9260  0.5646485  562.9418
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.01 and C = 1000.

Variable importance

ggplot(varImp(rbf2, scale=FALSE)) +
  geom_bar(stat = 'identity') +
  ggtitle('Variable Importance of Support Vector RBF Model on Sales Volume')

Predicting rbf2 on test1

rbf2Preds <- predict(rbf2, newdata = test1_rem_out)
summary(rbf2Preds)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -269.904    7.444  157.157  435.433  608.986 1570.714
# postResample to test if it will do well on new data or if overfitting
PR_rbf2 <- data.frame(postResample(rbf2Preds, test1_rem_out$Volume))
PR_rbf2
##          postResample.rbf2Preds..test1_rem_out.Volume.
## RMSE                                        318.555567
## Rsquared                                      0.757511
## MAE                                         228.271384

CV RMSE=657, R2=.909

PostResample RMSE=420, R2=.704

Contains negative predictions

Gradient Boosting with feature selection

set.seed(123)

gbm1 <- train(Volume ~ x4StarReviews + x2StarReviews + PositiveServiceReview,
              data = train1,
              method = 'gbm',
              trControl = control1,
              preProc = c('center','scale'))
gbm1
## Stochastic Gradient Boosting 
## 
## 61 samples
##  3 predictor
## 
## Pre-processing: centered (3), scaled (3) 
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 55, 53, 55, 55, 55, 56, ... 
## Resampling results across tuning parameters:
## 
##   interaction.depth  n.trees  RMSE      Rsquared   MAE     
##   1                   50      1010.966  0.8249911  571.2535
##   1                  100      1054.100  0.8371555  585.4725
##   1                  150      1024.901  0.8667286  552.3807
##   2                   50      1010.350  0.8575585  557.7797
##   2                  100      1046.985  0.8593534  568.5074
##   2                  150      1053.486  0.8588568  578.0264
##   3                   50      1010.362  0.8472539  564.3698
##   3                  100      1038.869  0.8615773  564.2031
##   3                  150      1055.289  0.8588360  567.2579
## 
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were n.trees = 50, interaction.depth =
##  2, shrinkage = 0.1 and n.minobsinnode = 10.

Plotting the residuals against the actual values for Volume. The graph below shows a couple volume outliers, and further research reveals both outliers are for accessories, which are not products of interest.

resid_gbm1 <- residuals(gbm1)
plot(train1$Volume, resid_gbm1, 
     xlab = 'Sales Volume', 
     ylab = 'Residuals', 
     main ='Predicted Sales Volume Residuals Plot',
     abline(0,0))

Predicting gbm on test1 with outliers removed

gbmPreds <- predict(gbm1, newdata = test1_rem_out)
summary(gbmPreds)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##    9.966   35.707   40.684  587.565 1340.268 2091.828

postResample

PR_gbm1 <- data.frame(postResample(gbmPreds, test1_rem_out$Volume))
PR_gbm1
##          postResample.gbmPreds..test1_rem_out.Volume.
## RMSE                                      266.4904990
## Rsquared                                    0.9105057
## MAE                                       172.6952417

CV RMSE=1010, R2=.858

PostResample RMSE=266, R2=.911

Gradient Boosting

set.seed(123)

gbm2 <- train(Volume ~ .,
              data = train1,
              method = 'gbm',
              trControl = control1,
              preProc = c('center','scale'))
gbm2

Plotting the residuals against the actual values for Volume. The graph below shows a couple volume outliers, and further research reveals both outliers are for accessories, which are not products of interest.

resid_gbm2 <- residuals(gbm2)
plot(train1$Volume, resid_gbm2, 
     xlab = 'Sales Volume', 
     ylab = 'Residuals', 
     main ='Predicted Sales Volume Residuals Plot',
     abline(0,0))

Predicting gbm2 on test1 with outliers removed

gbm2Preds <- predict(gbm2, newdata = test1_rem_out)
summary(gbm2Preds)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -159.76   98.77  195.42  530.69  997.01 1531.33

postResample

PR_gbm2 <- data.frame(postResample(gbm2Preds, test1_rem_out$Volume))
PR_gbm2
##          postResample.gbm2Preds..test1_rem_out.Volume.
## RMSE                                       321.8757140
## Rsquared                                     0.7409195
## MAE                                        250.5706319

CV RMSE=813, R2=.962

PostResample RMSE=321, R2=.74

Contains negative predictions

view actual vs predicted results in data frame for all models

Act_v_Pred_NoOutlier <- data.frame(test1_rem_out %>% 
                                     select(ProductNum, Volume),
                                   rf1Preds, rf2Preds, rbf1Preds, rbf2Preds, gbmPreds, gbm2Preds) %>% 
  mutate_if(is.numeric, round)

Compare predictions to actual volume sold for each model

kable(Act_v_Pred_NoOutlier, format = 'html', caption = 'Actual Sales Compared to Model Predictions', digits=3) %>% kable_styling(bootstrap_options = 'striped', full_width = FALSE) %>% 
  column_spec(2, border_right = TRUE)
Actual Sales Compared to Model Predictions
ProductNum Volume rf1Preds rf2Preds rbf1Preds rbf2Preds gbmPreds gbm2Preds
105 232 450 324 279 72 526 689
112 300 375 305 403 157 10 133
115 1396 1264 1302 1027 1571 2092 1428
117 44 31 31 67 92 20 333
127 32 14 14 89 484 41 -113
133 20 19 14 30 -125 41 161
136 1232 1232 1234 1401 1410 1356 1297
142 84 84 69 207 44 36 -160
143 88 103 103 268 184 251 -43
147 836 720 714 552 1470 1478 1531
149 52 84 79 199 -56 21 195
153 1896 1272 1379 2146 1553 2092 1387
164 8 14 3 89 -101 41 121
166 0 17 12 95 7 41 99
169 396 433 387 301 302 585 950
190 16 17 13 126 -270 20 15
197 1472 1099 1154 603 609 1340 997
# compare postResample metrics across all models
PostResample_AllModels <- data.frame(cbind(PR_rf1, PR_rf2, PR_rbf1, PR_rbf2, PR_gbm1, PR_gbm2))

kable(PostResample_AllModels, format = 'html', caption = 'PostResample Result Comparison for All Models',
      col.names = c('RF1','RF2','SVM RBF1','SVM RBF2','GBoost1','GBoost2')) %>% 
  kable_styling(bootstrap_options = 'striped', full_width = FALSE) %>% 
  column_spec(3, background = '#8494a9', color = 'white') %>% 
  column_spec(1, border_right = TRUE)
PostResample Result Comparison for All Models
RF1 RF2 SVM RBF1 SVM RBF2 GBoost1 GBoost2
RMSE 190.4253816 153.8183544 264.0730623 318.555567 266.4904990 321.8757140
Rsquared 0.9452884 0.9718405 0.8148197 0.757511 0.9105057 0.7409195
MAE 98.7387608 74.7555212 177.1172248 228.271385 172.6952417 250.5706319

Top Model: rf2

Use top model to make predictions on new product dataset

Import data

new <- read.csv(file.path('C:/Users/jlbro/OneDrive/C3T3', 'new.csv'), stringsAsFactors = TRUE)

# check structure
str(new)
## 'data.frame':    24 obs. of  18 variables:
##  $ ProductType          : Factor w/ 12 levels "Accessories",..: 7 7 5 5 5 6 6 6 6 12 ...
##  $ ProductNum           : int  171 172 173 175 176 178 180 181 183 186 ...
##  $ Price                : num  699 860 1199 1199 1999 ...
##  $ x5StarReviews        : int  96 51 74 7 1 19 312 23 3 296 ...
##  $ x4StarReviews        : int  26 11 10 2 1 8 112 18 4 66 ...
##  $ x3StarReviews        : int  14 10 3 1 1 4 28 7 0 30 ...
##  $ x2StarReviews        : int  14 10 3 1 3 1 31 22 1 21 ...
##  $ x1StarReviews        : int  25 21 11 1 0 10 47 18 0 36 ...
##  $ PositiveServiceReview: int  12 7 11 2 0 2 28 5 1 28 ...
##  $ NegativeServiceReview: int  3 5 5 1 1 4 16 16 0 9 ...
##  $ Recommendproduct     : num  0.7 0.6 0.8 0.6 0.3 0.6 0.7 0.4 0.7 0.8 ...
##  $ BestSellersRank      : int  2498 490 111 4446 2820 4140 2699 1704 5128 34 ...
##  $ ShippingWeight       : num  19.9 27 6.6 13 11.6 5.8 4.6 4.8 4.3 3 ...
##  $ ProductDepth         : num  20.63 21.89 8.94 16.3 16.81 ...
##  $ ProductWidth         : num  19.2 27 12.8 10.8 10.9 ...
##  $ ProductHeight        : num  8.39 9.13 0.68 1.4 0.88 1.2 0.95 1.5 0.97 0.37 ...
##  $ ProfitMargin         : num  0.25 0.2 0.1 0.15 0.23 0.08 0.09 0.11 0.09 0.1 ...
##  $ Volume               : int  0 0 0 0 0 0 0 0 0 0 ...

Process data same as training data set

# create dummy variables
newDummy <- dummyVars(' ~ .', data = new)

new2 <- data.frame(predict(newDummy, newdata = new))

# check structure
str(new2)
## 'data.frame':    24 obs. of  29 variables:
##  $ ProductType.Accessories     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ ProductType.Display         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ ProductType.ExtendedWarranty: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ ProductType.GameConsole     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ ProductType.Laptop          : num  0 0 1 1 1 0 0 0 0 0 ...
##  $ ProductType.Netbook         : num  0 0 0 0 0 1 1 1 1 0 ...
##  $ ProductType.PC              : num  1 1 0 0 0 0 0 0 0 0 ...
##  $ ProductType.Printer         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ ProductType.PrinterSupplies : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ ProductType.Smartphone      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ ProductType.Software        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ ProductType.Tablet          : num  0 0 0 0 0 0 0 0 0 1 ...
##  $ ProductNum                  : num  171 172 173 175 176 178 180 181 183 186 ...
##  $ Price                       : num  699 860 1199 1199 1999 ...
##  $ x5StarReviews               : num  96 51 74 7 1 19 312 23 3 296 ...
##  $ x4StarReviews               : num  26 11 10 2 1 8 112 18 4 66 ...
##  $ x3StarReviews               : num  14 10 3 1 1 4 28 7 0 30 ...
##  $ x2StarReviews               : num  14 10 3 1 3 1 31 22 1 21 ...
##  $ x1StarReviews               : num  25 21 11 1 0 10 47 18 0 36 ...
##  $ PositiveServiceReview       : num  12 7 11 2 0 2 28 5 1 28 ...
##  $ NegativeServiceReview       : num  3 5 5 1 1 4 16 16 0 9 ...
##  $ Recommendproduct            : num  0.7 0.6 0.8 0.6 0.3 0.6 0.7 0.4 0.7 0.8 ...
##  $ BestSellersRank             : num  2498 490 111 4446 2820 ...
##  $ ShippingWeight              : num  19.9 27 6.6 13 11.6 5.8 4.6 4.8 4.3 3 ...
##  $ ProductDepth                : num  20.63 21.89 8.94 16.3 16.81 ...
##  $ ProductWidth                : num  19.2 27 12.8 10.8 10.9 ...
##  $ ProductHeight               : num  8.39 9.13 0.68 1.4 0.88 1.2 0.95 1.5 0.97 0.37 ...
##  $ ProfitMargin                : num  0.25 0.2 0.1 0.15 0.23 0.08 0.09 0.11 0.09 0.1 ...
##  $ Volume                      : num  0 0 0 0 0 0 0 0 0 0 ...
new2$BestSellersRank <- NULL
new2 <- subset(new2, select = -c(1:4, 8:9, 11:12, 15, 24:27))

str(new2)

Predict rf1 on ‘new’ product data

set.seed(123)

Predicted_Volume <- predict(rf2, newdata = new2)

Add predictions to the ‘new’ product dataframe

Preds_rf2_df <- data.frame(new2 %>% select(ProductType.Laptop, ProductType.Netbook, ProductType.PC, ProductType.Smartphone, ProductNum, x4StarReviews), Predicted_Volume) %>% 
  mutate_if(is.numeric, round)

write.xlsx(Preds_rf2_df, 'C:/Users/jlbro/OneDrive/Predict Sales Volume//Rf1Preds.xlsx')

Preds_rf2_df <- read.xlsx(file.path('C:/Users/jlbro/OneDrive/Predict Sales Volume', 'Rf1Predictions.xlsx'))

View our sales predictions for 4 product types on a new dataset. All products not of interest were removed.

kable(Preds_rf2_df, format = 'html', caption = 'Forecasted Sales for New Product Types',
      col.names = c('Product Type', 'Product Number', '4 Star Reviews', 'Predicted Volume'), align = 'lccc') %>% 
  kable_styling(bootstrap_options = 'striped', full_width = FALSE) %>% 
  column_spec(4, background = '#8494a9', color = 'white')
Forecasted Sales for New Product Types
Product Type Product Number 4 Star Reviews Predicted Volume
PC 171 26 479
PC 172 11 157
Laptop 173 10 187
Laptop 175 2 37
Laptop 176 1 14
Netbook 178 8 56
Netbook 180 112 1234
Netbook 181 18 129
Netbook 183 4 19
Smartphone 193 26 445
Smartphone 194 26 650
Smartphone 195 8 87
Smartphone 196 19 159

Actionable Insights

1) The more x4 and x5 Star Reviews a product has, the higher the sales volume. From a business perspective, consider focusing more on product types with higher reviews.

2) The more Positive Service Reviews, the higher the sales volume for most products. Because of this, consider boosting Customer Service training during employee on-boarding and regularly thereafter as part of a Customer- and Employee-centric Company mission.

3) PC #171, Laptop #173, Netbook #180, and Smartphone #650 are projected to outperform other products within each type quite handily

4) Consider using predicted sales volume to help determine new products for building store inventory

Report analysis by Jennifer Brosnahan