Objective
We have been asked by an Electronics Company to forecast sales of new products that they are considering for building up inventory. Specifically, they would like for us to predict sales volume for four different product types: PCs, Laptops, Netbooks, and Smartphones, which will be used to determine which new products will be brought into store inventory.
They would also like for us to analyze the impact Customer Reviews and Service Reviews have on sales volume.
Goal
Our goals are to 1) assess the impact Customer Reviews and Service Reviews have on sales, and 2) build a model that can predict sales volume on new products with at least 85% level of certainty (R2) and minimal error (RMSE). We will build three different algorithms (Random Forest, Support Vector Machines RBF kernel, and Gradient Boosting) and utilize a variety of automatic and manual tuning mechanisms to optimize models.
Data Description
Data consists of service reviews, customer reviews, historical sales data, and product descriptions for all products currently in inventory. The target variable for this project is sales ‘Volume’ for the product types: PC, Laptops, Netbooks, and Smartphones.
Loading packages
library(tidyverse)
library(caret)
library(ggplot2)
library(corrplot)
library(openxlsx)
library(kableExtra)
library(dplyr)
library(scales)
Import data
existing <- read.csv(file.path('C:/Users/jlbro/OneDrive/C3T3', 'existing.csv'),
stringsAsFactors = TRUE)
Check structure
## 'data.frame': 80 obs. of 18 variables:
## $ ProductType : Factor w/ 12 levels "Accessories",..: 7 7 7 5 5 1 1 1 1 1 ...
## $ ProductNum : int 101 102 103 104 105 106 107 108 109 110 ...
## $ Price : num 949 2250 399 410 1080 ...
## $ x5StarReviews : int 3 2 3 49 58 83 11 33 16 10 ...
## $ x4StarReviews : int 3 1 0 19 31 30 3 19 9 1 ...
## $ x3StarReviews : int 2 0 0 8 11 10 0 12 2 1 ...
## $ x2StarReviews : int 0 0 0 3 7 9 0 5 0 0 ...
## $ x1StarReviews : int 0 0 0 9 36 40 1 9 2 0 ...
## $ PositiveServiceReview: int 2 1 1 7 7 12 3 5 2 2 ...
## $ NegativeServiceReview: int 0 0 0 8 20 5 0 3 1 0 ...
## $ Recommendproduct : num 0.9 0.9 0.9 0.8 0.7 0.3 0.9 0.7 0.8 0.9 ...
## $ BestSellersRank : int 1967 4806 12076 109 268 64 NA 2 NA 18 ...
## $ ShippingWeight : num 25.8 50 17.4 5.7 7 1.6 7.3 12 1.8 0.75 ...
## $ ProductDepth : num 23.9 35 10.5 15 12.9 ...
## $ ProductWidth : num 6.62 31.75 8.3 9.9 0.3 ...
## $ ProductHeight : num 16.9 19 10.2 1.3 8.9 ...
## $ ProfitMargin : num 0.15 0.25 0.08 0.08 0.09 0.05 0.05 0.05 0.05 0.05 ...
## $ Volume : int 12 8 12 196 232 332 44 132 64 40 ...
Notice that an important variable, ‘ProductType’ has 12 levels. Regression algorithms can easily misinterpret categorical variables in which there are more than 2 values (it may treat them as ranked or ordinal values), so we need to turn each ProductType level into a ‘dummy’ variable. This will turn each ‘ProductType’ into it’s own variable with only 2 values, ‘0’ for false and ‘1’ for true.
dummy <- dummyVars(' ~ .', data = existing)
existing2 <- data.frame(predict(dummy, newdata = existing))
# check structure
str(existing2)
## 'data.frame': 80 obs. of 29 variables:
## $ ProductType.Accessories : num 0 0 0 0 0 1 1 1 1 1 ...
## $ ProductType.Display : num 0 0 0 0 0 0 0 0 0 0 ...
## $ ProductType.ExtendedWarranty: num 0 0 0 0 0 0 0 0 0 0 ...
## $ ProductType.GameConsole : num 0 0 0 0 0 0 0 0 0 0 ...
## $ ProductType.Laptop : num 0 0 0 1 1 0 0 0 0 0 ...
## $ ProductType.Netbook : num 0 0 0 0 0 0 0 0 0 0 ...
## $ ProductType.PC : num 1 1 1 0 0 0 0 0 0 0 ...
## $ ProductType.Printer : num 0 0 0 0 0 0 0 0 0 0 ...
## $ ProductType.PrinterSupplies : num 0 0 0 0 0 0 0 0 0 0 ...
## $ ProductType.Smartphone : num 0 0 0 0 0 0 0 0 0 0 ...
## $ ProductType.Software : num 0 0 0 0 0 0 0 0 0 0 ...
## $ ProductType.Tablet : num 0 0 0 0 0 0 0 0 0 0 ...
## $ ProductNum : num 101 102 103 104 105 106 107 108 109 110 ...
## $ Price : num 949 2250 399 410 1080 ...
## $ x5StarReviews : num 3 2 3 49 58 83 11 33 16 10 ...
## $ x4StarReviews : num 3 1 0 19 31 30 3 19 9 1 ...
## $ x3StarReviews : num 2 0 0 8 11 10 0 12 2 1 ...
## $ x2StarReviews : num 0 0 0 3 7 9 0 5 0 0 ...
## $ x1StarReviews : num 0 0 0 9 36 40 1 9 2 0 ...
## $ PositiveServiceReview : num 2 1 1 7 7 12 3 5 2 2 ...
## $ NegativeServiceReview : num 0 0 0 8 20 5 0 3 1 0 ...
## $ Recommendproduct : num 0.9 0.9 0.9 0.8 0.7 0.3 0.9 0.7 0.8 0.9 ...
## $ BestSellersRank : num 1967 4806 12076 109 268 ...
## $ ShippingWeight : num 25.8 50 17.4 5.7 7 1.6 7.3 12 1.8 0.75 ...
## $ ProductDepth : num 23.9 35 10.5 15 12.9 ...
## $ ProductWidth : num 6.62 31.75 8.3 9.9 0.3 ...
## $ ProductHeight : num 16.9 19 10.2 1.3 8.9 ...
## $ ProfitMargin : num 0.15 0.25 0.08 0.08 0.09 0.05 0.05 0.05 0.05 0.05 ...
## $ Volume : num 12 8 12 196 232 332 44 132 64 40 ...
Notice each product type is it’s own variable. The regression algorithms will now be able to better understand each. We only need PC, Laptop, Netbook, and Smartphone, so let’s remove all other product types.
Check summary for descriptive stats and NAs
## ProductType.Accessories ProductType.Display ProductType.ExtendedWarranty
## Min. :0.000 Min. :0.0000 Min. :0.000
## 1st Qu.:0.000 1st Qu.:0.0000 1st Qu.:0.000
## Median :0.000 Median :0.0000 Median :0.000
## Mean :0.325 Mean :0.0625 Mean :0.125
## 3rd Qu.:1.000 3rd Qu.:0.0000 3rd Qu.:0.000
## Max. :1.000 Max. :1.0000 Max. :1.000
##
## ProductType.GameConsole ProductType.Laptop ProductType.Netbook ProductType.PC
## Min. :0.000 Min. :0.0000 Min. :0.000 Min. :0.00
## 1st Qu.:0.000 1st Qu.:0.0000 1st Qu.:0.000 1st Qu.:0.00
## Median :0.000 Median :0.0000 Median :0.000 Median :0.00
## Mean :0.025 Mean :0.0375 Mean :0.025 Mean :0.05
## 3rd Qu.:0.000 3rd Qu.:0.0000 3rd Qu.:0.000 3rd Qu.:0.00
## Max. :1.000 Max. :1.0000 Max. :1.000 Max. :1.00
##
## ProductType.Printer ProductType.PrinterSupplies ProductType.Smartphone
## Min. :0.00 Min. :0.0000 Min. :0.00
## 1st Qu.:0.00 1st Qu.:0.0000 1st Qu.:0.00
## Median :0.00 Median :0.0000 Median :0.00
## Mean :0.15 Mean :0.0375 Mean :0.05
## 3rd Qu.:0.00 3rd Qu.:0.0000 3rd Qu.:0.00
## Max. :1.00 Max. :1.0000 Max. :1.00
##
## ProductType.Software ProductType.Tablet ProductNum Price
## Min. :0.000 Min. :0.0000 Min. :101.0 Min. : 3.60
## 1st Qu.:0.000 1st Qu.:0.0000 1st Qu.:120.8 1st Qu.: 52.66
## Median :0.000 Median :0.0000 Median :140.5 Median : 132.72
## Mean :0.075 Mean :0.0375 Mean :142.6 Mean : 247.25
## 3rd Qu.:0.000 3rd Qu.:0.0000 3rd Qu.:160.2 3rd Qu.: 352.49
## Max. :1.000 Max. :1.0000 Max. :200.0 Max. :2249.99
##
## x5StarReviews x4StarReviews x3StarReviews x2StarReviews
## Min. : 0.0 Min. : 0.00 Min. : 0.00 Min. : 0.00
## 1st Qu.: 10.0 1st Qu.: 2.75 1st Qu.: 2.00 1st Qu.: 1.00
## Median : 50.0 Median : 22.00 Median : 7.00 Median : 3.00
## Mean : 176.2 Mean : 40.20 Mean : 14.79 Mean : 13.79
## 3rd Qu.: 306.5 3rd Qu.: 33.00 3rd Qu.: 11.25 3rd Qu.: 7.00
## Max. :2801.0 Max. :431.00 Max. :162.00 Max. :370.00
##
## x1StarReviews PositiveServiceReview NegativeServiceReview Recommendproduct
## Min. : 0.00 Min. : 0.00 Min. : 0.000 Min. :0.100
## 1st Qu.: 2.00 1st Qu.: 2.00 1st Qu.: 1.000 1st Qu.:0.700
## Median : 8.50 Median : 5.50 Median : 3.000 Median :0.800
## Mean : 37.67 Mean : 51.75 Mean : 6.225 Mean :0.745
## 3rd Qu.: 15.25 3rd Qu.: 42.00 3rd Qu.: 6.250 3rd Qu.:0.900
## Max. :1654.00 Max. :536.00 Max. :112.000 Max. :1.000
##
## BestSellersRank ShippingWeight ProductDepth ProductWidth
## Min. : 1 Min. : 0.0100 Min. : 0.000 Min. : 0.000
## 1st Qu.: 7 1st Qu.: 0.5125 1st Qu.: 4.775 1st Qu.: 1.750
## Median : 27 Median : 2.1000 Median : 7.950 Median : 6.800
## Mean : 1126 Mean : 9.6681 Mean : 14.425 Mean : 7.819
## 3rd Qu.: 281 3rd Qu.:11.2050 3rd Qu.: 15.025 3rd Qu.:11.275
## Max. :17502 Max. :63.0000 Max. :300.000 Max. :31.750
## NA's :15
## ProductHeight ProfitMargin Volume
## Min. : 0.000 Min. :0.0500 Min. : 0
## 1st Qu.: 0.400 1st Qu.:0.0500 1st Qu.: 40
## Median : 3.950 Median :0.1200 Median : 200
## Mean : 6.259 Mean :0.1545 Mean : 705
## 3rd Qu.:10.300 3rd Qu.:0.2000 3rd Qu.: 1226
## Max. :25.800 Max. :0.4000 Max. :11204
##
Reveals 15 NAs for ‘BestSellersRank’. Deleting as it is the only variable with NAs and was determined to have low correlation to target variable (0.12)
existing2$BestSellersRank <- NULL
EDA
Correlation matrix of all variables
corrData <- cor(existing2)
corrplot(corrData)

Corrplot is unreadable with so many variables, but we examined the corrData, which revealed that x5StarReviews had a perfect correlation of 1 to our target variable, ‘volume,’ which risks overfitting. We will remove x5StarReviews, product types that are not PC, Laptop, Netbook, Smartphone, and other variables with less than .18 correlation to ‘volume.’
# remove x5Star, low correlated, and product types not of interest to client
existing2 <- subset(existing2, select = -c(1:4, 8:9, 11:12, 15, 24:27))
str(existing2)
## 'data.frame': 80 obs. of 15 variables:
## $ ProductType.Laptop : num 0 0 0 1 1 0 0 0 0 0 ...
## $ ProductType.Netbook : num 0 0 0 0 0 0 0 0 0 0 ...
## $ ProductType.PC : num 1 1 1 0 0 0 0 0 0 0 ...
## $ ProductType.Smartphone: num 0 0 0 0 0 0 0 0 0 0 ...
## $ ProductNum : num 101 102 103 104 105 106 107 108 109 110 ...
## $ Price : num 949 2250 399 410 1080 ...
## $ x4StarReviews : num 3 1 0 19 31 30 3 19 9 1 ...
## $ x3StarReviews : num 2 0 0 8 11 10 0 12 2 1 ...
## $ x2StarReviews : num 0 0 0 3 7 9 0 5 0 0 ...
## $ x1StarReviews : num 0 0 0 9 36 40 1 9 2 0 ...
## $ PositiveServiceReview : num 2 1 1 7 7 12 3 5 2 2 ...
## $ NegativeServiceReview : num 0 0 0 8 20 5 0 3 1 0 ...
## $ Recommendproduct : num 0.9 0.9 0.9 0.8 0.7 0.3 0.9 0.7 0.8 0.9 ...
## $ ShippingWeight : num 25.8 50 17.4 5.7 7 1.6 7.3 12 1.8 0.75 ...
## $ Volume : num 12 8 12 196 232 332 44 132 64 40 ...
# View correlation heatmap
corrData <- cor(existing2)
corrplot(corrData)

# view enhanced correlation heatmap with unnecessary variables removed
corrplot(corrData, method = 'shade', shade.col = NA, tl.col = 'black',
type = 'upper', tl.srt = 45)

As you can see, x4Star, x3Star, x2Star, and PositiveService Review have highest correlation to target variable ‘Volume.’
ggplot(data = existing2, mapping = aes(x = Volume)) +
geom_histogram()

Histogram of Volume, reveals a couple outliers
ggplot(data = existing, aes(x = ProductType, y = Volume, fill = ProductType)) +
geom_bar(stat = 'identity') +
guides(fill=FALSE) +
coord_flip()

Observation
- Notice Accessories by far has most volume sold. With such an imbalance in volume, we will need to use log to accurately to tell the impact of Star Reviews and Service Reviews on Sales Volume. We will first create a dataset that filters out only the products we are interested in: PCs, Netbooks, Laptops, Smartphones
# data frame of only products of interest
existing3 <- filter(existing, (ProductType=='PC' | ProductType=='Netbook' | ProductType=='Laptop' | ProductType=='Smartphone'))
Plot association of 4 Star Reviews on Sales Volume
ggplot(data=existing3, aes(x=x4StarReviews, y=Volume)) +
geom_point(aes(color=ProductType, size=1)) +
theme_bw() +
scale_x_continuous(trans = 'log2') +
scale_y_continuous(trans = 'log2') +
geom_line() +
facet_wrap(~ProductType) +
xlab('Number of 4 Star Reviews') +
ylab('Sales Volume') +
ggtitle('Effect of 4 Star Reviews on Sales Volume')

Plot association of 3 Star Reviews on Sales Volume
ggplot(data=existing3, aes(x=x3StarReviews, y=Volume)) +
geom_point(aes(color=ProductType, size=1)) +
theme_bw() +
scale_x_continuous(trans = 'log2') +
scale_y_continuous(trans = 'log2') +
geom_line() +
facet_wrap(~ProductType) +
xlab('Number of 3 Star Reviews') +
ylab('Sales Volume') +
ggtitle('Effect of 3 Star Reviews on Sales Volume')

Plot association of 2 Star Reviews on Sales Volume
ggplot(data=existing3, aes(x=x2StarReviews, y=Volume)) +
geom_point(aes(color=ProductType, size=1)) +
theme_bw() +
scale_x_continuous(trans = 'log2') +
scale_y_continuous(trans = 'log2') +
geom_line() +
facet_wrap(~ProductType) +
xlab('Number of 2 Star Reviews') +
ylab('Sales Volume') +
ggtitle('Effect of 2 Star Reviews on Sales Volume')

Plot association of 1 Star Reviews on Sales Volume
ggplot(data=existing3, aes(x=x1StarReviews, y=Volume)) +
geom_point(aes(color=ProductType, size=1)) +
theme_bw() +
scale_x_continuous(trans = 'log2') +
scale_y_continuous(trans = 'log2') +
geom_line() +
facet_wrap(~ProductType) +
xlab('Number of 1 Star Reviews') +
ylab('Sales Volume') +
ggtitle('Effect of 1 Star Reviews on Sales Volume')

Observations
- Interestingly, all Star reviews seem to have positive correlation to sales volume, perhaps this is due to the fact that higher volume of reviews is obvious indication of more products being sold.
Plot impact of Positive Service Reviews on Sales Volume
ggplot(data=existing3, aes(x=PositiveServiceReview, y=Volume)) +
geom_point(aes(color=ProductType, size=1)) +
theme_bw() +
scale_x_continuous(trans = 'log2') +
scale_y_continuous(trans = 'log2') +
geom_line() +
facet_wrap(~ProductType) +
xlab('Number of Positive Service Reviews') +
ylab('Sales Volume') +
ggtitle('Effect of Positive Service Reviews on Sales Volume')

Plot impact of Negative Service Reviews on Sales Volume
ggplot(data=existing3, aes(x=NegativeServiceReview, y=Volume)) +
geom_point(aes(color=ProductType, size=1)) +
theme_bw() +
scale_x_continuous(trans = 'log2') +
scale_y_continuous(trans = 'log2') +
geom_line() +
facet_wrap(~ProductType) +
xlab('Number of Negative Service Reviews') +
ylab('Sales Volume') +
ggtitle('Effect of Negative Service Reviews on Sales Volume')

Observations
- Positive service reviews are positively associated with sales volume.
- Negative service reviews are not quite as linearly related to volume, however, there is a slight positive association.
- It is important to note that there are very few negative service reviews.
Modeling
set.seed(123)
# CreateDataPartition() 75% and 25%
index1 <- createDataPartition(existing2$Volume, p=0.75, list = FALSE)
train1 <- existing2[ index1,]
test1 <- existing2[-index1,]
# Check structure of train1
str(train1)
## 'data.frame': 61 obs. of 15 variables:
## $ ProductType.Laptop : num 0 0 0 1 0 0 0 0 0 0 ...
## $ ProductType.Netbook : num 0 0 0 0 0 0 0 0 0 0 ...
## $ ProductType.PC : num 1 1 1 0 0 0 0 0 0 0 ...
## $ ProductType.Smartphone: num 0 0 0 0 0 0 0 0 0 0 ...
## $ ProductNum : num 101 102 103 104 106 107 108 109 110 111 ...
## $ Price : num 949 2250 399 410 114 ...
## $ x4StarReviews : num 3 1 0 19 30 3 19 9 1 2 ...
## $ x3StarReviews : num 2 0 0 8 10 0 12 2 1 2 ...
## $ x2StarReviews : num 0 0 0 3 9 0 5 0 0 4 ...
## $ x1StarReviews : num 0 0 0 9 40 1 9 2 0 15 ...
## $ PositiveServiceReview : num 2 1 1 7 12 3 5 2 2 2 ...
## $ NegativeServiceReview : num 0 0 0 8 5 0 3 1 0 1 ...
## $ Recommendproduct : num 0.9 0.9 0.9 0.8 0.3 0.9 0.7 0.8 0.9 0.5 ...
## $ ShippingWeight : num 25.8 50 17.4 5.7 1.6 7.3 12 1.8 0.75 1 ...
## $ Volume : num 12 8 12 196 332 44 132 64 40 84 ...
# Set cross validation
control1 <- trainControl(method = 'repeatedcv',
number = 10,
repeats = 1)
Random forest model and tuning
# set seed
set.seed(123)
# Creating dataframe for manual tuning
rfGrid <- expand.grid(mtry = c(2,3,4,5,6,7,8))
rf1 <- train(Volume ~ x4StarReviews + PositiveServiceReview + x2StarReviews + x3StarReviews +
x1StarReviews + NegativeServiceReview + Recommendproduct + ShippingWeight + Price,
data = train1,
method = 'rf',
trControl = control1,
tuneGrid = rfGrid)
rf1
## Random Forest
##
## 61 samples
## 9 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times)
## Summary of sample sizes: 55, 53, 55, 55, 55, 56, ...
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared MAE
## 2 869.2921 0.8755901 416.0646
## 3 849.5229 0.8871013 400.1741
## 4 824.7775 0.8939530 386.7741
## 5 827.7373 0.8980015 384.5929
## 6 801.6069 0.9043345 372.1979
## 7 802.9288 0.9073910 372.2425
## 8 788.6419 0.9081729 365.2084
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 8.
Level of importance for variables in model
ggplot(varImp(rf1, scale=FALSE)) +
geom_bar(stat = 'identity') +
ggtitle('Variable Importance of Random Forest 1 on Sales Volume')

Plot residuals
rf1resid <- residuals(rf1)
plot(train1$Volume, rf1resid,
xlab = 'Sales Volume',
ylab = 'Residuals',
main ='Predicted Sales Volume Residuals Plot',
abline(0,0))

Observations
- Overall the residual plot looks good, except there are two outliers that will likely skew our R2 and RMSE results, especially if they are not even for products of interest.
- A deeper dive reveals that both outliers are for accessories which are not of interest in this project. We will remove these two outliers from the test data set
# Removing 2 outlier rows #18 and #48 from test set
test1_rem_out <- test1[!rownames(test1) %in% c('18', '48'), ]
Predict rf1 on test1
rf1Preds <- predict(rf1, newdata = test1_rem_out)
summary(rf1Preds)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 14.11 19.05 102.97 425.15 719.71 1271.74

Observation
- A symmetrical pattern means a good residual plot
# postResample to test if model will do well on new data
PR_rf1 <- data.frame(postResample(rf1Preds, test1_rem_out$Volume))
PR_rf1
## postResample.rf1Preds..test1_rem_out.Volume.
## RMSE 190.4253816
## Rsquared 0.9452884
## MAE 98.7387608
CV RMSE=788, R2=.908
PostResample RMSE=190, R2=.945
Our Cross Validation R2 is .908 after tuning and feature selection, which is excellent. Our postResample R2 is even better, at .945. If cross validation was above 94-95%, it would be a red-flag for overfitting, but postResample in upper 90s means it will generalize well on new data (and thus is not overfitting).
Random Forest using feature selection
set.seed(123)
rf2 <- train(Volume ~ x4StarReviews + PositiveServiceReview + x2StarReviews,
data = train1,
method = 'rf',
trControl = control1)
## note: only 2 unique complexity parameters in default grid. Truncating the grid to 2 .
## Random Forest
##
## 61 samples
## 3 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times)
## Summary of sample sizes: 55, 53, 55, 55, 55, 56, ...
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared MAE
## 2 771.2710 0.9218973 349.2301
## 3 745.3771 0.9284383 338.8776
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 3.
Variable importance
ggplot(varImp(rf2, scale=FALSE)) +
geom_bar(stat = 'identity') +
ggtitle('Variable Importance of Random Forest 2 on Sales Volume')

Plotting the residuals against the actual values for Volume. The graph below shows a couple volume outliers, and further research reveals both outliers are for accessories, which are not products of interest.
resid_rf2 <- residuals(rf2)
plot(train1$Volume, resid_rf2,
xlab = 'Sales Volume',
ylab = 'Residuals',
main ='Predicted Sales Volume Residuals Plot',
abline(0,0))

Predict rf2 on test1 with outliers removed
rf2Preds <- predict(rf2, newdata = test1_rem_out)
summary(rf2Preds)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.742 14.336 103.066 419.753 714.428 1379.420

# postResample
PR_rf2 <- data.frame(postResample(rf2Preds, test1_rem_out$Volume))
PR_rf2
## postResample.rf2Preds..test1_rem_out.Volume.
## RMSE 153.8183544
## Rsquared 0.9718405
## MAE 74.7555212
CV RMSE = 745, R2=.928
PostResample RMSE=153, R2=.972
The postResample R2 and RMSE for a regression model is excellent.
Support Vector Machines – RBF Kernel Feature Selection
set.seed(123)
# Creating dataframe for manual tuning
rbfGrid <- expand.grid(sigma = c(.01, .015, .2),
C = c(10, 100, 1000))
rbf1 <- train(Volume ~ x4StarReviews + x3StarReviews + PositiveServiceReview,
data = train1,
method = 'svmRadial',
trControl = control1,
tuneGrid = rbfGrid,
preProc = c('center','scale'))
rbf1
## Support Vector Machines with Radial Basis Function Kernel
##
## 61 samples
## 3 predictor
##
## Pre-processing: centered (3), scaled (3)
## Resampling: Cross-Validated (10 fold, repeated 1 times)
## Summary of sample sizes: 55, 53, 55, 55, 55, 56, ...
## Resampling results across tuning parameters:
##
## sigma C RMSE Rsquared MAE
## 0.010 10 944.2228 0.8594778 489.1231
## 0.010 100 930.7863 0.8149199 473.1634
## 0.010 1000 1190.2695 0.8309208 580.6230
## 0.015 10 986.8673 0.8419784 507.9508
## 0.015 100 940.5278 0.8123009 480.0726
## 0.015 1000 1236.2804 0.8613697 590.2852
## 0.200 10 913.7802 0.9182651 467.5324
## 0.200 100 879.2184 0.9194906 436.6931
## 0.200 1000 949.7834 0.9093674 462.1128
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.2 and C = 100.
Predict rbf1 on test1
rbf1Preds <- predict(rbf1, newdata = test1_rem_out)
summary(rbf1Preds)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 29.81 94.91 267.83 463.59 552.12 2146.20

postResample
PR_rbf1 <- data.frame(postResample(rbf1Preds, test1_rem_out$Volume))
PR_rbf1
## postResample.rbf1Preds..test1_rem_out.Volume.
## RMSE 264.0730623
## Rsquared 0.8148197
## MAE 177.1172248
CV RMSE=879, R2=.919
PostResample RMSE=264, R2=.815
Support Vector Machines – RBF Kernel
set.seed(123)
# Creating dataframe for manual tuning
rbfGrid <- expand.grid(sigma = c(.01, .015, .2),
C = c(10, 100, 1000))
rbf2 <- train(Volume ~ .,
data = train1,
method = 'svmRadial',
trControl = control1,
tuneGrid = rbfGrid,
preProc = c('center','scale'))
rbf2
## Support Vector Machines with Radial Basis Function Kernel
##
## 61 samples
## 14 predictors
##
## Pre-processing: centered (14), scaled (14)
## Resampling: Cross-Validated (10 fold, repeated 1 times)
## Summary of sample sizes: 55, 53, 55, 55, 55, 56, ...
## Resampling results across tuning parameters:
##
## sigma C RMSE Rsquared MAE
## 0.010 10 929.4771 0.7527835 516.5685
## 0.010 100 852.7698 0.7998966 471.1629
## 0.010 1000 792.9114 0.7582943 435.2153
## 0.015 10 942.7868 0.7078926 527.4695
## 0.015 100 855.0032 0.7762986 471.0963
## 0.015 1000 813.0011 0.7517215 443.8939
## 0.200 10 1032.4257 0.5438955 573.8091
## 0.200 100 1014.3794 0.5534847 561.6915
## 0.200 1000 998.9260 0.5646485 562.9418
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.01 and C = 1000.
Variable importance
ggplot(varImp(rbf2, scale=FALSE)) +
geom_bar(stat = 'identity') +
ggtitle('Variable Importance of Support Vector RBF Model on Sales Volume')

Predicting rbf2 on test1
rbf2Preds <- predict(rbf2, newdata = test1_rem_out)
summary(rbf2Preds)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -269.904 7.444 157.157 435.433 608.986 1570.714
# postResample to test if it will do well on new data or if overfitting
PR_rbf2 <- data.frame(postResample(rbf2Preds, test1_rem_out$Volume))
PR_rbf2
## postResample.rbf2Preds..test1_rem_out.Volume.
## RMSE 318.555567
## Rsquared 0.757511
## MAE 228.271384
CV RMSE=657, R2=.909
PostResample RMSE=420, R2=.704
Contains negative predictions
Gradient Boosting with feature selection
set.seed(123)
gbm1 <- train(Volume ~ x4StarReviews + x2StarReviews + PositiveServiceReview,
data = train1,
method = 'gbm',
trControl = control1,
preProc = c('center','scale'))
## Stochastic Gradient Boosting
##
## 61 samples
## 3 predictor
##
## Pre-processing: centered (3), scaled (3)
## Resampling: Cross-Validated (10 fold, repeated 1 times)
## Summary of sample sizes: 55, 53, 55, 55, 55, 56, ...
## Resampling results across tuning parameters:
##
## interaction.depth n.trees RMSE Rsquared MAE
## 1 50 1010.966 0.8249911 571.2535
## 1 100 1054.100 0.8371555 585.4725
## 1 150 1024.901 0.8667286 552.3807
## 2 50 1010.350 0.8575585 557.7797
## 2 100 1046.985 0.8593534 568.5074
## 2 150 1053.486 0.8588568 578.0264
## 3 50 1010.362 0.8472539 564.3698
## 3 100 1038.869 0.8615773 564.2031
## 3 150 1055.289 0.8588360 567.2579
##
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
##
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were n.trees = 50, interaction.depth =
## 2, shrinkage = 0.1 and n.minobsinnode = 10.
Plotting the residuals against the actual values for Volume. The graph below shows a couple volume outliers, and further research reveals both outliers are for accessories, which are not products of interest.
resid_gbm1 <- residuals(gbm1)
plot(train1$Volume, resid_gbm1,
xlab = 'Sales Volume',
ylab = 'Residuals',
main ='Predicted Sales Volume Residuals Plot',
abline(0,0))

Predicting gbm on test1 with outliers removed
gbmPreds <- predict(gbm1, newdata = test1_rem_out)
summary(gbmPreds)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.966 35.707 40.684 587.565 1340.268 2091.828
postResample
PR_gbm1 <- data.frame(postResample(gbmPreds, test1_rem_out$Volume))
PR_gbm1
## postResample.gbmPreds..test1_rem_out.Volume.
## RMSE 266.4904990
## Rsquared 0.9105057
## MAE 172.6952417
CV RMSE=1010, R2=.858
PostResample RMSE=266, R2=.911
Gradient Boosting
set.seed(123)
gbm2 <- train(Volume ~ .,
data = train1,
method = 'gbm',
trControl = control1,
preProc = c('center','scale'))
gbm2
Plotting the residuals against the actual values for Volume. The graph below shows a couple volume outliers, and further research reveals both outliers are for accessories, which are not products of interest.
resid_gbm2 <- residuals(gbm2)
plot(train1$Volume, resid_gbm2,
xlab = 'Sales Volume',
ylab = 'Residuals',
main ='Predicted Sales Volume Residuals Plot',
abline(0,0))

Predicting gbm2 on test1 with outliers removed
gbm2Preds <- predict(gbm2, newdata = test1_rem_out)
summary(gbm2Preds)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -159.76 98.77 195.42 530.69 997.01 1531.33
postResample
PR_gbm2 <- data.frame(postResample(gbm2Preds, test1_rem_out$Volume))
PR_gbm2
## postResample.gbm2Preds..test1_rem_out.Volume.
## RMSE 321.8757140
## Rsquared 0.7409195
## MAE 250.5706319
CV RMSE=813, R2=.962
PostResample RMSE=321, R2=.74
Contains negative predictions
view actual vs predicted results in data frame for all models
Act_v_Pred_NoOutlier <- data.frame(test1_rem_out %>%
select(ProductNum, Volume),
rf1Preds, rf2Preds, rbf1Preds, rbf2Preds, gbmPreds, gbm2Preds) %>%
mutate_if(is.numeric, round)
Compare predictions to actual volume sold for each model
kable(Act_v_Pred_NoOutlier, format = 'html', caption = 'Actual Sales Compared to Model Predictions', digits=3) %>% kable_styling(bootstrap_options = 'striped', full_width = FALSE) %>%
column_spec(2, border_right = TRUE)
Actual Sales Compared to Model Predictions
|
ProductNum
|
Volume
|
rf1Preds
|
rf2Preds
|
rbf1Preds
|
rbf2Preds
|
gbmPreds
|
gbm2Preds
|
|
105
|
232
|
450
|
324
|
279
|
72
|
526
|
689
|
|
112
|
300
|
375
|
305
|
403
|
157
|
10
|
133
|
|
115
|
1396
|
1264
|
1302
|
1027
|
1571
|
2092
|
1428
|
|
117
|
44
|
31
|
31
|
67
|
92
|
20
|
333
|
|
127
|
32
|
14
|
14
|
89
|
484
|
41
|
-113
|
|
133
|
20
|
19
|
14
|
30
|
-125
|
41
|
161
|
|
136
|
1232
|
1232
|
1234
|
1401
|
1410
|
1356
|
1297
|
|
142
|
84
|
84
|
69
|
207
|
44
|
36
|
-160
|
|
143
|
88
|
103
|
103
|
268
|
184
|
251
|
-43
|
|
147
|
836
|
720
|
714
|
552
|
1470
|
1478
|
1531
|
|
149
|
52
|
84
|
79
|
199
|
-56
|
21
|
195
|
|
153
|
1896
|
1272
|
1379
|
2146
|
1553
|
2092
|
1387
|
|
164
|
8
|
14
|
3
|
89
|
-101
|
41
|
121
|
|
166
|
0
|
17
|
12
|
95
|
7
|
41
|
99
|
|
169
|
396
|
433
|
387
|
301
|
302
|
585
|
950
|
|
190
|
16
|
17
|
13
|
126
|
-270
|
20
|
15
|
|
197
|
1472
|
1099
|
1154
|
603
|
609
|
1340
|
997
|
# compare postResample metrics across all models
PostResample_AllModels <- data.frame(cbind(PR_rf1, PR_rf2, PR_rbf1, PR_rbf2, PR_gbm1, PR_gbm2))
kable(PostResample_AllModels, format = 'html', caption = 'PostResample Result Comparison for All Models',
col.names = c('RF1','RF2','SVM RBF1','SVM RBF2','GBoost1','GBoost2')) %>%
kable_styling(bootstrap_options = 'striped', full_width = FALSE) %>%
column_spec(3, background = '#8494a9', color = 'white') %>%
column_spec(1, border_right = TRUE)
PostResample Result Comparison for All Models
|
|
RF1
|
RF2
|
SVM RBF1
|
SVM RBF2
|
GBoost1
|
GBoost2
|
|
RMSE
|
190.4253816
|
153.8183544
|
264.0730623
|
318.555567
|
266.4904990
|
321.8757140
|
|
Rsquared
|
0.9452884
|
0.9718405
|
0.8148197
|
0.757511
|
0.9105057
|
0.7409195
|
|
MAE
|
98.7387608
|
74.7555212
|
177.1172248
|
228.271385
|
172.6952417
|
250.5706319
|
Top Model: rf2
Use top model to make predictions on new product dataset
Import data
new <- read.csv(file.path('C:/Users/jlbro/OneDrive/C3T3', 'new.csv'), stringsAsFactors = TRUE)
# check structure
str(new)
## 'data.frame': 24 obs. of 18 variables:
## $ ProductType : Factor w/ 12 levels "Accessories",..: 7 7 5 5 5 6 6 6 6 12 ...
## $ ProductNum : int 171 172 173 175 176 178 180 181 183 186 ...
## $ Price : num 699 860 1199 1199 1999 ...
## $ x5StarReviews : int 96 51 74 7 1 19 312 23 3 296 ...
## $ x4StarReviews : int 26 11 10 2 1 8 112 18 4 66 ...
## $ x3StarReviews : int 14 10 3 1 1 4 28 7 0 30 ...
## $ x2StarReviews : int 14 10 3 1 3 1 31 22 1 21 ...
## $ x1StarReviews : int 25 21 11 1 0 10 47 18 0 36 ...
## $ PositiveServiceReview: int 12 7 11 2 0 2 28 5 1 28 ...
## $ NegativeServiceReview: int 3 5 5 1 1 4 16 16 0 9 ...
## $ Recommendproduct : num 0.7 0.6 0.8 0.6 0.3 0.6 0.7 0.4 0.7 0.8 ...
## $ BestSellersRank : int 2498 490 111 4446 2820 4140 2699 1704 5128 34 ...
## $ ShippingWeight : num 19.9 27 6.6 13 11.6 5.8 4.6 4.8 4.3 3 ...
## $ ProductDepth : num 20.63 21.89 8.94 16.3 16.81 ...
## $ ProductWidth : num 19.2 27 12.8 10.8 10.9 ...
## $ ProductHeight : num 8.39 9.13 0.68 1.4 0.88 1.2 0.95 1.5 0.97 0.37 ...
## $ ProfitMargin : num 0.25 0.2 0.1 0.15 0.23 0.08 0.09 0.11 0.09 0.1 ...
## $ Volume : int 0 0 0 0 0 0 0 0 0 0 ...
Process data same as training data set
# create dummy variables
newDummy <- dummyVars(' ~ .', data = new)
new2 <- data.frame(predict(newDummy, newdata = new))
# check structure
str(new2)
## 'data.frame': 24 obs. of 29 variables:
## $ ProductType.Accessories : num 0 0 0 0 0 0 0 0 0 0 ...
## $ ProductType.Display : num 0 0 0 0 0 0 0 0 0 0 ...
## $ ProductType.ExtendedWarranty: num 0 0 0 0 0 0 0 0 0 0 ...
## $ ProductType.GameConsole : num 0 0 0 0 0 0 0 0 0 0 ...
## $ ProductType.Laptop : num 0 0 1 1 1 0 0 0 0 0 ...
## $ ProductType.Netbook : num 0 0 0 0 0 1 1 1 1 0 ...
## $ ProductType.PC : num 1 1 0 0 0 0 0 0 0 0 ...
## $ ProductType.Printer : num 0 0 0 0 0 0 0 0 0 0 ...
## $ ProductType.PrinterSupplies : num 0 0 0 0 0 0 0 0 0 0 ...
## $ ProductType.Smartphone : num 0 0 0 0 0 0 0 0 0 0 ...
## $ ProductType.Software : num 0 0 0 0 0 0 0 0 0 0 ...
## $ ProductType.Tablet : num 0 0 0 0 0 0 0 0 0 1 ...
## $ ProductNum : num 171 172 173 175 176 178 180 181 183 186 ...
## $ Price : num 699 860 1199 1199 1999 ...
## $ x5StarReviews : num 96 51 74 7 1 19 312 23 3 296 ...
## $ x4StarReviews : num 26 11 10 2 1 8 112 18 4 66 ...
## $ x3StarReviews : num 14 10 3 1 1 4 28 7 0 30 ...
## $ x2StarReviews : num 14 10 3 1 3 1 31 22 1 21 ...
## $ x1StarReviews : num 25 21 11 1 0 10 47 18 0 36 ...
## $ PositiveServiceReview : num 12 7 11 2 0 2 28 5 1 28 ...
## $ NegativeServiceReview : num 3 5 5 1 1 4 16 16 0 9 ...
## $ Recommendproduct : num 0.7 0.6 0.8 0.6 0.3 0.6 0.7 0.4 0.7 0.8 ...
## $ BestSellersRank : num 2498 490 111 4446 2820 ...
## $ ShippingWeight : num 19.9 27 6.6 13 11.6 5.8 4.6 4.8 4.3 3 ...
## $ ProductDepth : num 20.63 21.89 8.94 16.3 16.81 ...
## $ ProductWidth : num 19.2 27 12.8 10.8 10.9 ...
## $ ProductHeight : num 8.39 9.13 0.68 1.4 0.88 1.2 0.95 1.5 0.97 0.37 ...
## $ ProfitMargin : num 0.25 0.2 0.1 0.15 0.23 0.08 0.09 0.11 0.09 0.1 ...
## $ Volume : num 0 0 0 0 0 0 0 0 0 0 ...
new2$BestSellersRank <- NULL
new2 <- subset(new2, select = -c(1:4, 8:9, 11:12, 15, 24:27))
str(new2)
Predict rf1 on ‘new’ product data
set.seed(123)
Predicted_Volume <- predict(rf2, newdata = new2)
Add predictions to the ‘new’ product dataframe
Preds_rf2_df <- data.frame(new2 %>% select(ProductType.Laptop, ProductType.Netbook, ProductType.PC, ProductType.Smartphone, ProductNum, x4StarReviews), Predicted_Volume) %>%
mutate_if(is.numeric, round)
write.xlsx(Preds_rf2_df, 'C:/Users/jlbro/OneDrive/Predict Sales Volume//Rf1Preds.xlsx')
Preds_rf2_df <- read.xlsx(file.path('C:/Users/jlbro/OneDrive/Predict Sales Volume', 'Rf1Predictions.xlsx'))
View our sales predictions for 4 product types on a new dataset. All products not of interest were removed.
kable(Preds_rf2_df, format = 'html', caption = 'Forecasted Sales for New Product Types',
col.names = c('Product Type', 'Product Number', '4 Star Reviews', 'Predicted Volume'), align = 'lccc') %>%
kable_styling(bootstrap_options = 'striped', full_width = FALSE) %>%
column_spec(4, background = '#8494a9', color = 'white')
Forecasted Sales for New Product Types
|
Product Type
|
Product Number
|
4 Star Reviews
|
Predicted Volume
|
|
PC
|
171
|
26
|
479
|
|
PC
|
172
|
11
|
157
|
|
Laptop
|
173
|
10
|
187
|
|
Laptop
|
175
|
2
|
37
|
|
Laptop
|
176
|
1
|
14
|
|
Netbook
|
178
|
8
|
56
|
|
Netbook
|
180
|
112
|
1234
|
|
Netbook
|
181
|
18
|
129
|
|
Netbook
|
183
|
4
|
19
|
|
Smartphone
|
193
|
26
|
445
|
|
Smartphone
|
194
|
26
|
650
|
|
Smartphone
|
195
|
8
|
87
|
|
Smartphone
|
196
|
19
|
159
|
Actionable Insights
1) The more x4 and x5 Star Reviews a product has, the higher the sales volume. From a business perspective, consider focusing more on product types with higher reviews.
2) The more Positive Service Reviews, the higher the sales volume for most products. Because of this, consider boosting Customer Service training during employee on-boarding and regularly thereafter as part of a Customer- and Employee-centric Company mission.
3) PC #171, Laptop #173, Netbook #180, and Smartphone #650 are projected to outperform other products within each type quite handily
4) Consider using predicted sales volume to help determine new products for building store inventory
Report analysis by Jennifer Brosnahan