Executive summary

Case

The sales team of Blackwell Electronics ask for another analysis regarding sales performance of certain products in one of their stores. This time, the analysis aims again at the prediction of sales performance but taking the product type into account to see whether and how specific product types outperform others. Therefore, the historical sales data on sales volume has to be analysed again to assesses whether the inclusion of ‘product type’ has positive impact on predictability and finally to predict sales volume of new products. The four product types are:
PC, Laptops, Netbooks and Smartphones.
Additionally, Blackwell expresses their interest in assessing the impact of service reviews and customer reviews on sales performance.

Procedure

The analysis followed the common data mining approach containing data the steps exploration, pre-processing, modelling and optimization, prediction, and evaluation.
For the analysis the statistical programming language R with different packages (e.g. ‘caret’ was used to apply several prediction techniques).
In particular, correlation Matrix, Random Forest and ANOVA test are applied to assess the importance of ‘product type’ and other attributes.

Results

The predicted sales volume is ordered from highest to lowest sales volume in Table 1.

Table 1: Pred. Volume and profit by brand

Product Type Brand Pred. Volume Profit (in k$)
Netbook Acer 734 22
PC Dell1 267 47
Smartphone Samsung2 267 2
Laptop Apple 225 27
Smartphone Motorola1 181 4
Smartphone Motorola2 167 6
PC Dell2 152 26
Netbook Asus 137 7
Smartphone HTC 134 3
Netbook hp 101 3
Netbook Samsung1 95 3
Laptop Toshiba 77 14
Laptop Razer 0 0

Table 2 - Total volume and profit by product categories

Category Total Sales Profit (in k$)
Netbooks 1067 35
Smartphones 749 15
PC 419 73
Laptop 302 41

The results show that, in terms of sales, the most sold product type would be Netbooks. In this category, our model has predicted the highest sales volume by far to be the Acer Netbook. Following in categories, Smartphones come in second, with Samsung being the most popular. Next are PCs, and last are laptops, coming close. However, the last two (PCs and Laptops) have the highest profit.
Addressing the interest in incorporating Product Type into our analysis, it unveiled that there is no impact on using the mentioned Product Type when we are trying to predict sales.
As far as the impact is on service and customer reviews, we concluded the following: The Positive Reviews, followed by the 4 and 3 stars had the highest impact on our prediction models.
Additionally, the predicted results agree with the previous analysis using RapidMiner.

Limitations

The dataset has limited products (80). A much larger number would increase the accuracy of the prediction. Some of the parameters of each product are barely relevant for our analysis. For instance, the dimensions (width, height and depth) cannot predict sales in our case.
Similarly, the Best Sellers Rank is missing many values and had to be excluded for analysis.

Recommendations

Consider the results to be only estimates and should only contribute to decision-making indicatively.
For further analysis, an expanded dataset in terms of products and a starting to record sales from now on with the purpose of contributing to a more meaningful analysis.

Technical documentation

Table of contents

  1. Data Exploration and preprocessing
    1. Dealing with repeated products
    2. Removing useless columns
    3. Checking for outliers
  2. Feature Engineering
  3. Modelling
  4. Analyzing and Plotting Errors
  5. Final Predictions

1.DATA EXPLORATION AND PREPROCESSING

1.1.Dealing with repeated products

#Because we have worked with these products before, and because the dataset is small, we can observe duplicates just by looking at the df itself. We find out all 'ExtendedWarranty' products share the same values in everything (including exact number of reviews of all types) except for a small different in price. In a business sense, We conclude this is because warranties are sold alongside different products, only the same type of warranty is included, but will vary in price proportionally to the price itself. We should treat them as the same products and remove all but one.

#Showing the 6 duplicates:
#IMPORTANT NOTE: It's good practice to remove the columns by name instead of by column number, so that if we do changes in dataset later, and we reload these commands, we won't mess up unwanted columns.

sum(duplicated(ExistProd[,c(1,4:18)]))
## [1] 6
#Removing them, but setting an average price for the remaining product of Warranty:
#ExistProd <- ExistProd[-c(35:41),]
Duplicates <- ExistProd[c(32:41),]
Duplicates[3,3] <- sum(Duplicates[3:10,3])/8
Duplicates <- Duplicates[1:3,]
ExistProd  <- ExistProd[-c(32:41),]
ExistProd  <- rbind(ExistProd, Duplicates)

1.2.Removing useless columns

#We checked in the previous project that some columns were not useful and just added noise to our prediction model. The columns are:
#1. Product dimensions (width, Depth and Height)
#2. Best Sellers Rank
#3. Shipping Weight
#4. Product Num (we don't want the id to add 'noise')

#We proceed to get rid of the aforementioned columns in both df (they don't need to be the same for prediction, but it's good practice)

ExistProd <- dplyr::select(ExistProd,-c(ProductNum,BestSellersRank,ShippingWeight,ProductDepth, ProductWidth, ProductHeight))

NewProd  <- dplyr::select(NewProd,-c(ProductNum,BestSellersRank,ShippingWeight,ProductDepth, ProductWidth, ProductHeight))
#BAD EXAMPLE: ExistProd <- ExistProd[,-c(2,12:16)]
            #NewProd    <- NewProd[,-c(2,12:16)]

1.3.Checking for outliers

#We find lots of outliers in almost all columns. If we remove them, it will drastically change our dataset (already very small), and so we don't remove all of them.
ggplot(stack(ExistProd), aes(x=ind, y=values)) + geom_boxplot() + theme(axis.text.x = element_text(angle = 45, hjust = 1))

#But we should look more in depth at least in the outliers of the dependent column
ggplot(ExistProd, aes(x=ProductType, y=Volume)) +geom_boxplot() + theme(axis.text.x = element_text(angle = 45, hjust = 1))

#We have few outliers here as well. The R Built-in function will tell us the values.
boxplot(ExistProd$Volume, plot = FALSE)$out
## [1]  2052  2140 11204  1896  7036  1684
#But we only want to remove the two extreme outliers because our dataset is small, and because these outliers don't belong to the product types we want to predict.
ExistProd <- ExistProd[-which(ExistProd$Volume > 7000),]

1.4.Scaling values

#Since different variables have different units, we are going to scale all numeric variables (except label):
for (i in c(2:11)){ExistProd[,i] <- scale(ExistProd[,i])}

#We do the same in NewProd df
for (i in c(2:11)){NewProd[,i] <- scale(NewProd[,i])}

2. FEATURE ENGINEERING

2.1.Dealing with non-numeric features (Dummifying)

#We are going to dummify(turn categorical values into logical ones)
#We need to dummify Product Type only. We check if it's a factor first
str(ExistProd$ProductType)
##  Factor w/ 12 levels "Accessories",..: 7 7 7 5 5 1 1 1 1 1 ...
#The Correlation Matrix is only going to work with numeric values, so we create another df as to avoid messing with the current one, so that we can run a correlation matrix. We do this before dummifying, because once we dummify we'll have lots of variables and the correlation matrix will be way harder to read.
ExistProd_corr <- ExistProd[,-c(1)]

CorrData2 <- cor(ExistProd_corr)
corrplot(CorrData2, order="hclust", method = "number",
        tl.col = "black", tl.srt = 90 , tl.cex = 0.7, tl.pos = "t")

#In order to find out the relationship between the label and the ProductType (categorical), we are going to use the ANOVA test.
ANOVA <- aov(ExistProd$Volume~ ExistProd$ProductType)
summary(ANOVA)
##                       Df   Sum Sq Mean Sq F value Pr(>F)
## ExistProd$ProductType 11  5075621  461420   1.473  0.166
## Residuals             59 18487579  313349
#Since the p-value is higher than 0.05, we reject the Null Hypothesis, which states that both variables come from the same population. Therefore, it's not useful to use this variable to predict Volume.

#Now we will confirm this with a Random Forest w/ dummyfeatures
ExistProd2 <- createDummyFeatures(obj = ExistProd, cols = "ProductType")
NewProd2   <- createDummyFeatures(obj = NewProd, cols = "ProductType")

set.seed(107)

inTrain  <- createDataPartition(ExistProd2$Volume,p = 0.75, list = FALSE)
training <- ExistProd2[inTrain,]
testing  <- ExistProd2[-inTrain,]
ctrl     <- trainControl(method = "cv", number = 10)


mRF  <- randomForest(Volume ~., ExistProd2, ntree = 80)
plot (mRF, main = "Performance of RF depending on the number of trees")

ctrl <- trainControl(method="repeatedcv",repeats = 3, number = 5) #,classProbs=TRUE,summaryFunction = twoClassSummary)
mRF2 <- caret::train(Volume ~ ., data = training, method = "rf", trControl = ctrl, preProcess = c("center","scale"), tuneLength = 2)

varImpPlot(mRF, main = "Performance of RF depending on the number of trees")

#As a conclusion, we get rid of Product Type
ExistProd$ProductType <- NULL


#ExistProd2 <- createDummyFeatures(obj = ExistProd, cols = "ProductType")
#NewProd2   <- createDummyFeatures(obj = NewProd, cols = "ProductType")

2.2.Dealing with Collinearity

#We now look for collinearity in the newly dummified df
CorrData <- cor(ExistProd)
CorNames <- findCorrelation(CorrData, cutoff = 0.9, verbose = TRUE, names = TRUE)
## Compare row 4  and column  3 with corr  0.907 
##   Means:  0.554 vs 0.374 so flagging column 4 
## Compare row 5  and column  6 with corr  0.979 
##   Means:  0.478 vs 0.343 so flagging column 5 
## Compare row 2  and column  11 with corr  1 
##   Means:  0.476 vs 0.3 so flagging column 2 
## All correlations <= 0.9
#The above function recommends what columns we should erase based on collinearity. So we proceed to do that:
ExistProd <- ExistProd[, - which(names(ExistProd) %in% CorNames)]

#BAD PRACTICE
#ExistProd[ ,c(2,4,5)] <- list(NULL)
#NewProd[ ,c(2,4,5)]   <- list(NULL)

3. Modelling

set.seed(107)

inTrain  <- createDataPartition(ExistProd2$Volume,p = 0.75, list = FALSE)
training <- ExistProd2[inTrain,]
testing  <- ExistProd2[-inTrain,]
ctrl     <- trainControl(method = "cv", number = 10)

nrow(training)
## [1] 55
nrow(testing)
## [1] 16
#The chosen models are looped below
Five_Models   <- c("lm", "rf", "knn", "svmLinear", "gbm")

#We create an empty variable so we can later assign all results to it
compare.model <- c()

#We run all models at once and assign the results in 'compare.model'.
#However, now the loop is overwriting each model as it goes through them, so it will only keep the last one saved.
for (i in Five_Models){
  model         <- caret::train(Volume ~., data = training, method = i, )
  pred          <- predict(model, newdata = testing)
  pred.metric   <- postResample(testing$Volume, obs = pred)
  compare.model <- cbind(pred.metric, compare.model)
}
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1   313823.5317             nan     0.1000 40051.1907
##      2   281141.9833             nan     0.1000 36576.1443
##      3   243944.3453             nan     0.1000 25128.4651
##      4   227056.8211             nan     0.1000 18409.0107
##      5   211217.7925             nan     0.1000 19245.5370
##      6   189509.0779             nan     0.1000 17120.2941
##      7   175637.2247             nan     0.1000 2595.9761
##      8   171221.4563             nan     0.1000  372.4642
##      9   160521.7337             nan     0.1000 11451.3583
##     10   155619.3410             nan     0.1000  724.0868
##     20    97964.3675             nan     0.1000 3784.9466
##     40    71691.6121             nan     0.1000 -512.2182
##     50    68044.7417             nan     0.1000 -607.8096
#That's why we create a list with all models so that we can later choose the model
methods <- list()

#We will save them in the variable previously created
for (i in Five_Models) {
  methods[[i]] <- caret::train(Volume ~., data = training, method = i )
}
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1   307578.8280             nan     0.1000 41207.7924
##      2   286305.2500             nan     0.1000 19030.2813
##      3   261112.1845             nan     0.1000 28267.1541
##      4   241705.6523             nan     0.1000 14618.5178
##      5   204843.1469             nan     0.1000 26150.5048
##      6   185556.7377             nan     0.1000 18609.4629
##      7   178005.6144             nan     0.1000 7763.0654
##      8   159496.5745             nan     0.1000 13674.8079
##      9   139299.9802             nan     0.1000 9006.8488
##     10   129438.4131             nan     0.1000 5994.8550
##     20    83051.1139             nan     0.1000 -1019.0330
##     40    65818.0608             nan     0.1000 -1494.6146
##     60    58964.5250             nan     0.1000 -799.1243
##     80    54570.5628             nan     0.1000 -182.9114
##    100    50063.0809             nan     0.1000 -689.4158
##    120    44184.3397             nan     0.1000 -860.5722
##    140    42799.7584             nan     0.1000 -209.3923
##    150    40323.4292             nan     0.1000 -518.1395
methods
## $lm
## Linear Regression 
## 
## 55 samples
## 22 predictors
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 55, 55, 55, 55, 55, 55, ... 
## Resampling results:
## 
##   RMSE          Rsquared  MAE         
##   5.406146e-13  1         3.916189e-13
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE
## 
## $rf
## Random Forest 
## 
## 55 samples
## 22 predictors
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 55, 55, 55, 55, 55, 55, ... 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE      Rsquared   MAE      
##    2    310.5234  0.8459020  201.14610
##   12    180.6845  0.9361108   96.03043
##   22    150.1520  0.9521491   77.92924
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 22.
## 
## $knn
## k-Nearest Neighbors 
## 
## 55 samples
## 22 predictors
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 55, 55, 55, 55, 55, 55, ... 
## Resampling results across tuning parameters:
## 
##   k  RMSE      Rsquared   MAE     
##   5  250.5506  0.8560911  160.4576
##   7  254.6291  0.8658619  169.5987
##   9  271.0767  0.8670381  183.6637
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 5.
## 
## $svmLinear
## Support Vector Machines with Linear Kernel 
## 
## 55 samples
## 22 predictors
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 55, 55, 55, 55, 55, 55, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   507.7142  0.8739201  306.0704
## 
## Tuning parameter 'C' was held constant at a value of 1
## 
## $gbm
## Stochastic Gradient Boosting 
## 
## 55 samples
## 22 predictors
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 55, 55, 55, 55, 55, 55, ... 
## Resampling results across tuning parameters:
## 
##   interaction.depth  n.trees  RMSE      Rsquared   MAE     
##   1                   50      319.8130  0.7433712  222.8188
##   1                  100      317.0902  0.7356551  227.5105
##   1                  150      322.5018  0.7255873  232.8695
##   2                   50      317.0595  0.7405268  219.8445
##   2                  100      317.3141  0.7307384  226.8979
##   2                  150      315.9586  0.7289831  227.2538
##   3                   50      316.2784  0.7406168  219.6941
##   3                  100      317.0800  0.7363365  225.7095
##   3                  150      318.8884  0.7271765  227.6880
## 
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were n.trees = 150,
##  interaction.depth = 2, shrinkage = 0.1 and n.minobsinnode = 10.
colnames(compare.model) <- Five_Models
compare.model
##                  lm          rf         knn  svmLinear          gbm
## RMSE     295.757896 359.1920424 214.8119177 82.8867239 1.150406e-12
## Rsquared   0.690411   0.9533011   0.8578734  0.9829067 1.000000e+00
## MAE      209.847853 125.9877635 157.9000000 46.9278500 4.392184e-13

4.Analyzing and plotting errors

compare.model.melt <- melt(compare.model, varnames = c("metric", "model"))
compare.model.melt <- as.data.frame(compare.model.melt)
compare.model.melt
##      metric     model        value
## 1      RMSE        lm 2.957579e+02
## 2  Rsquared        lm 6.904110e-01
## 3       MAE        lm 2.098479e+02
## 4      RMSE        rf 3.591920e+02
## 5  Rsquared        rf 9.533011e-01
## 6       MAE        rf 1.259878e+02
## 7      RMSE       knn 2.148119e+02
## 8  Rsquared       knn 8.578734e-01
## 9       MAE       knn 1.579000e+02
## 10     RMSE svmLinear 8.288672e+01
## 11 Rsquared svmLinear 9.829067e-01
## 12      MAE svmLinear 4.692785e+01
## 13     RMSE       gbm 1.150406e-12
## 14 Rsquared       gbm 1.000000e+00
## 15      MAE       gbm 4.392184e-13
#All models offer quite different results, which indicates the predictions will have high variance between models.  
#Based on errors, the SVM stands out in MAE (smaller error than others). RMSE is pretty similar in all. As for the RSquared, RF performes better than SVM here, but SVM comes in second, so we choose this model.
for(i in c("RMSE","Rsquared","MAE")) {
  metric <-  compare.model.melt %>%  filter(metric == i)
  gg     <- ggplot(metric, aes(model,value))
  print(gg + geom_bar(stat = "identity") + ggtitle(i))
}

5.Final Predictions

predictions_svm <- predict(methods$svmLinear, NewProd2)
predictions_svm
##          1          2          3          4          5          6 
##  306.10722  203.20065  227.05095   43.66592  -48.88348  132.54230 
##          7          8          9         10         11         12 
##  767.52884  142.75748   79.33107  684.13076 2504.82682  325.20232 
##         13         14         15         16         17         18 
##  394.90194  180.11386  222.16805 1059.01335   97.12584  165.24101 
##         19         20         21         22         23         24 
##  154.88560  170.53507  234.14689  133.09492   37.65265 2598.67423
#Adding our predictions to the test set
Prod_Sales_Pred                 <- NewProd_Raw[,-c(3:nrow(NewProd_Raw))]
Prod_Sales_Pred$predictions_svm <- predictions_svm
Prod_Sales_Pred                 <- Prod_Sales_Pred[-c(10:11,16:24),]


Prod_Sales_Pred$ProductNum <- mgsub(x = Prod_Sales_Pred$ProductNum,pattern = c(171,172,173,175,176,178,180,181,183,193,194,195,196), replacement = c("Dell1","Dell2","Apple","Toshiba","Razer","hp","Acer","Asus","Samsung1","Motorola1","Samsung2","HTC","Motorola2"))

#We can observe how sales vary among the different products
ggplot(Prod_Sales_Pred, aes(x=ProductNum,y=predictions_svm, fill=ProductType)) + geom_col() + theme(axis.text.x = element_text(angle = 45, hjust = 1))