Executive summary

Case

The sales team of Blackwell Electronics ask for another analysis regarding sales performance of certain products in one of their stores. This time, the analysis aims again at the prediction of sales performance but taking the product type into account to see whether and how specific product types outperform others. Therefore, the historical sales data on sales volume has to be analysed again to assesses whether the inclusion of ‘product type’ has positive impact on predictability and finally to predict sales volume of new products. The four product types are:
PC, Laptops, Netbooks and Smartphones.
Additionally, Blackwell expresses their interest in assessing the impact of service reviews and customer reviews on sales performance.

Procedure

The analysis followed the common data mining approach containing data the steps exploration, pre-processing, modelling and optimization, prediction, and evaluation.
For the analysis the statistical programming language R with different packages (e.g. ‘caret’ was used to apply several prediction techniques).
In particular, correlation Matrix, Random Forest and ANOVA test are applied to assess the importance of ‘product type’ and other attributes.

Results

The predicted sales volume is ordered from highest to lowest sales volume in Table 1.

Table 1: Pred. Volume and profit by brand

Product Type Brand Pred. Volume Profit (in k$)
Netbook Acer 734 22
PC Dell1 267 47
Smartphone Samsung2 267 2
Laptop Apple 225 27
Smartphone Motorola1 181 4
Smartphone Motorola2 167 6
PC Dell2 152 26
Netbook Asus 137 7
Smartphone HTC 134 3
Netbook hp 101 3
Netbook Samsung1 95 3
Laptop Toshiba 77 14
Laptop Razer 0 0

Table 2 - Total volume and profit by product categories

Category Total Sales Profit (in k$)
Netbooks 1067 35
Smartphones 749 15
PC 419 73
Laptop 302 41

The results show that, in terms of sales, the most sold product type would be Netbooks. In this category, our model has predicted the highest sales volume by far to be the Acer Netbook. Following in categories, Smartphones come in second, with Samsung being the most popular. Next are PCs, and last are laptops, coming close. However, the last two (PCs and Laptops) have the highest profit.
Addressing the interest in incorporating Product Type into our analysis, it unveiled that there is no impact on using the mentioned Product Type when we are trying to predict sales.
As far as the impact is on service and customer reviews, we concluded the following: The Positive Reviews, followed by the 4 and 3 stars had the highest impact on our prediction models.
Additionally, the predicted results agree with the previous analysis using RapidMiner.

Limitations

The dataset has limited products (80). A much larger number would increase the accuracy of the prediction. Some of the parameters of each product are barely relevant for our analysis. For instance, the dimensions (width, height and depth) cannot predict sales in our case.
Similarly, the Best Sellers Rank is missing many values and had to be excluded for analysis.

Recommendations

Consider the results to be only estimates and should only contribute to decision-making indicatively.
For further analysis, an expanded dataset in terms of products and a starting to record sales from now on with the purpose of contributing to a more meaningful analysis.

Technical documentation

Table of contents

  1. Data Exploration and preprocessing
    1. Dealing with repeated products
    2. Removing useless columns
    3. Checking for outliers
  2. Feature Engineering
  3. Modelling
  4. Analyzing and Plotting Errors
  5. Final Predictions

1.DATA EXPLORATION AND PREPROCESSING

1.1.Dealing with repeated products

#Because we have worked with these products before, and because the dataset is small, we can observe duplicates just by looking at the df itself. We find out all 'ExtendedWarranty' products share the same values in everything (including exact number of reviews of all types) except for a small different in price. In a business sense, We conclude this is because warranties are sold alongside different products, only the same type of warranty is included, but will vary in price proportionally to the price itself. We should treat them as the same products and remove all but one.

#Showing the 6 duplicates:
#IMPORTANT NOTE: It's good practice to remove the columns by name instead of by column number, so that if we do changes in dataset later, and we reload these commands, we won't mess up unwanted columns.

sum(duplicated(ExistProd[,c(1,4:18)]))
## [1] 6
#Removing them, but setting an average price for the remaining product of Warranty:
#ExistProd <- ExistProd[-c(35:41),]
Duplicates <- ExistProd[c(32:41),]
Duplicates[3,3] <- sum(Duplicates[3:10,3])/8
Duplicates <- Duplicates[1:3,]
ExistProd  <- ExistProd[-c(32:41),]
ExistProd  <- rbind(ExistProd, Duplicates)

1.2.Removing useless columns

#We checked in the previous project that some columns were not useful and just added noise to our prediction model. The columns are:
#1. Product dimensions (width, Depth and Height)
#2. Best Sellers Rank
#3. Shipping Weight
#4. Product Num (we don't want the id to add 'noise')

#We proceed to get rid of the aforementioned columns in both df (they don't need to be the same for prediction, but it's good practice)

ExistProd <- dplyr::select(ExistProd,-c(ProductNum,BestSellersRank,ShippingWeight,ProductDepth, ProductWidth, ProductHeight))

NewProd  <- dplyr::select(NewProd,-c(ProductNum,BestSellersRank,ShippingWeight,ProductDepth, ProductWidth, ProductHeight))
#BAD EXAMPLE: ExistProd <- ExistProd[,-c(2,12:16)]
            #NewProd    <- NewProd[,-c(2,12:16)]

1.3.Checking for outliers

#We find lots of outliers in almost all columns. If we remove them, it will drastically change our dataset (already very small), and so we don't remove all of them.
ggplot(stack(ExistProd), aes(x=ind, y=values)) + geom_boxplot() + theme(axis.text.x = element_text(angle = 45, hjust = 1))

#But we should look more in depth at least in the outliers of the dependent column
ggplot(ExistProd, aes(x=ProductType, y=Volume)) +geom_boxplot() + theme(axis.text.x = element_text(angle = 45, hjust = 1))

#We have few outliers here as well. The R Built-in function will tell us the values.
boxplot(ExistProd$Volume, plot = FALSE)$out
## [1]  2052  2140 11204  1896  7036  1684
#But we only want to remove the two extreme outliers because our dataset is small, and because these outliers don't belong to the product types we want to predict.
ExistProd <- ExistProd[-which(ExistProd$Volume > 7000),]

1.4.Scaling values

#Since different variables have different units, we are going to scale all numeric variables (except label):
for (i in c(2:11)){ExistProd[,i] <- scale(ExistProd[,i])}

#We do the same in NewProd df
for (i in c(2:11)){NewProd[,i] <- scale(NewProd[,i])}

2. FEATURE ENGINEERING

2.1.Dealing with non-numeric features (Dummifying)

#We are going to dummify(turn categorical values into logical ones)
#We need to dummify Product Type only. We check if it's a factor first
str(ExistProd$ProductType)
##  Factor w/ 12 levels "Accessories",..: 7 7 7 5 5 1 1 1 1 1 ...
#The Correlation Matrix is only going to work with numeric values, so we create another df as to avoid messing with the current one, so that we can run a correlation matrix. We do this before dummifying, because once we dummify we'll have lots of variables and the correlation matrix will be way harder to read.
ExistProd_corr <- ExistProd[,-c(1)]

CorrData2 <- cor(ExistProd_corr)
corrplot(CorrData2, order="hclust", method = "number",
        tl.col = "black", tl.srt = 90 , tl.cex = 0.7, tl.pos = "t")

#In order to find out the relationship between the label and the ProductType (categorical), we are going to use the ANOVA test.
ANOVA <- aov(ExistProd$Volume~ ExistProd$ProductType)
summary(ANOVA)
##                       Df   Sum Sq Mean Sq F value Pr(>F)
## ExistProd$ProductType 11  5075621  461420   1.473  0.166
## Residuals             59 18487579  313349
#Since the p-value is higher than 0.05, we reject the Null Hypothesis, which states that both variables come from the same population. Therefore, it's not useful to use this variable to predict Volume.

#Now we will confirm this with a Random Forest w/ dummyfeatures
ExistProd2 <- createDummyFeatures(obj = ExistProd, cols = "ProductType")
NewProd2   <- createDummyFeatures(obj = NewProd, cols = "ProductType")

set.seed(107)

inTrain  <- createDataPartition(ExistProd2$Volume,p = 0.75, list = FALSE)
training <- ExistProd2[inTrain,]
testing  <- ExistProd2[-inTrain,]
ctrl     <- trainControl(method = "cv", number = 10)


mRF  <- randomForest(Volume ~., ExistProd2, ntree = 80)
plot (mRF, main = "Performance of RF depending on the number of trees")

ctrl <- trainControl(method="repeatedcv",repeats = 3, number = 5) #,classProbs=TRUE,summaryFunction = twoClassSummary)
mRF2 <- caret::train(Volume ~ ., data = training, method = "rf", trControl = ctrl, preProcess = c("center","scale"), tuneLength = 2)
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut
## = 10, : These variables have zero variances: ProductType.GameConsole
varImpPlot(mRF, main = "Performance of RF depending on the number of trees")

#As a conclusion, we get rid of Product Type
ExistProd$ProductType <- NULL


#ExistProd2 <- createDummyFeatures(obj = ExistProd, cols = "ProductType")
#NewProd2   <- createDummyFeatures(obj = NewProd, cols = "ProductType")

2.2.Dealing with Collinearity

#We now look for collinearity in the newly dummified df
CorrData <- cor(ExistProd)
CorNames <- findCorrelation(CorrData, cutoff = 0.9, verbose = TRUE, names = TRUE)
## Compare row 4  and column  3 with corr  0.907 
##   Means:  0.554 vs 0.374 so flagging column 4 
## Compare row 5  and column  6 with corr  0.979 
##   Means:  0.478 vs 0.343 so flagging column 5 
## Compare row 2  and column  11 with corr  1 
##   Means:  0.476 vs 0.3 so flagging column 2 
## All correlations <= 0.9
#The above function recommends what columns we should erase based on collinearity. So we proceed to do that:
ExistProd <- ExistProd[, - which(names(ExistProd) %in% CorNames)]

#BAD PRACTICE
#ExistProd[ ,c(2,4,5)] <- list(NULL)
#NewProd[ ,c(2,4,5)]   <- list(NULL)

3. Modelling

set.seed(107)

inTrain  <- createDataPartition(ExistProd2$Volume,p = 0.75, list = FALSE)
training <- ExistProd2[inTrain,]
testing  <- ExistProd2[-inTrain,]
ctrl     <- trainControl(method = "cv", number = 10)

nrow(training)
## [1] 55
nrow(testing)
## [1] 16
#The chosen models are looped below
Five_Models   <- c("lm", "rf", "knn", "svmLinear", "gbm")

#We create an empty variable so we can later assign all results to it
compare.model <- c()

#We run all models at once and assign the results in 'compare.model'.
#However, now the loop is overwriting each model as it goes through them, so it will only keep the last one saved.
for (i in Five_Models){
  model         <- caret::train(Volume ~., data = training, method = i, )
  pred          <- predict(model, newdata = testing)
  pred.metric   <- postResample(testing$Volume, obs = pred)
  compare.model <- cbind(pred.metric, compare.model)
}
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1   303176.4348             nan     0.1000 40462.3957
##      2   272477.1231             nan     0.1000 32410.1335
##      3   248886.5191             nan     0.1000 26451.8675
##      4   219753.3517             nan     0.1000 17516.6251
##      5   199465.8015             nan     0.1000 18759.1620
##      6   184440.3625             nan     0.1000 12458.1945
##      7   177159.9269             nan     0.1000 7471.0441
##      8   167689.6991             nan     0.1000 11325.1351
##      9   160824.8732             nan     0.1000 7906.7223
##     10   152524.7895             nan     0.1000 8120.2568
##     20   118293.4321             nan     0.1000 2588.7671
##     40    84604.2931             nan     0.1000 -419.9515
##     50    79927.8822             nan     0.1000  426.1485
#That's why we create a list with all models so that we can later choose the model
methods <- list()

#We will save them in the variable previously created
for (i in Five_Models) {
  methods[[i]] <- caret::train(Volume ~., data = training, method = i )
}
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1   306825.8840             nan     0.1000 40990.9882
##      2   273575.5778             nan     0.1000 31439.9033
##      3   244391.8178             nan     0.1000 25477.7753
##      4   213145.3700             nan     0.1000 23615.1157
##      5   192634.3575             nan     0.1000 21341.3448
##      6   178418.6702             nan     0.1000 9397.9167
##      7   168919.2731             nan     0.1000 10973.9090
##      8   164326.1448             nan     0.1000 4768.9323
##      9   152837.0199             nan     0.1000 10086.1343
##     10   143640.9480             nan     0.1000 9912.1428
##     20   108881.9898             nan     0.1000 1317.7514
##     40    85196.5922             nan     0.1000 -3028.3143
##     50    82590.4110             nan     0.1000 -771.6307
methods
## $lm
## Linear Regression 
## 
## 55 samples
## 22 predictors
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 55, 55, 55, 55, 55, 55, ... 
## Resampling results:
## 
##   RMSE                 Rsquared  MAE                  
##   0.00000000000119855  1         0.0000000000004984662
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE
## 
## $rf
## Random Forest 
## 
## 55 samples
## 22 predictors
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 55, 55, 55, 55, 55, 55, ... 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE      Rsquared   MAE      
##    2    295.3924  0.8501766  195.52305
##   12    161.4392  0.9456535   83.72705
##   22    142.8953  0.9557016   73.59713
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 22.
## 
## $knn
## k-Nearest Neighbors 
## 
## 55 samples
## 22 predictors
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 55, 55, 55, 55, 55, 55, ... 
## Resampling results across tuning parameters:
## 
##   k  RMSE      Rsquared   MAE     
##   5  285.9488  0.8475635  183.1795
##   7  300.7292  0.8717571  191.4858
##   9  322.9252  0.8747862  207.6705
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 5.
## 
## $svmLinear
## Support Vector Machines with Linear Kernel 
## 
## 55 samples
## 22 predictors
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 55, 55, 55, 55, 55, 55, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   558.2278  0.8263847  336.3741
## 
## Tuning parameter 'C' was held constant at a value of 1
## 
## $gbm
## Stochastic Gradient Boosting 
## 
## 55 samples
## 22 predictors
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 55, 55, 55, 55, 55, 55, ... 
## Resampling results across tuning parameters:
## 
##   interaction.depth  n.trees  RMSE      Rsquared   MAE     
##   1                   50      327.2453  0.7341861  221.1909
##   1                  100      329.7403  0.7267396  231.7280
##   1                  150      337.1263  0.7152389  241.4511
##   2                   50      336.3849  0.7307282  228.8012
##   2                  100      341.1190  0.7156439  240.9551
##   2                  150      345.5710  0.7062829  246.0999
##   3                   50      333.5261  0.7325947  224.1092
##   3                  100      335.1053  0.7229182  236.9917
##   3                  150      340.5911  0.7099806  247.3737
## 
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were n.trees = 50, interaction.depth
##  = 1, shrinkage = 0.1 and n.minobsinnode = 10.
colnames(compare.model) <- Five_Models
compare.model
##                   lm          rf         knn   svmLinear
## RMSE     261.3892308 507.6748333 213.8073900 116.1175572
## Rsquared   0.7631923   0.8111341   0.8573653   0.9557089
## MAE      189.5483306 282.5984336 104.9000000  64.6621250
##                             gbm
## RMSE     0.00000000000010835913
## Rsquared 1.00000000000000000000
## MAE      0.00000000000008451837

4.Analyzing and plotting errors

compare.model.melt <- melt(compare.model, varnames = c("metric", "model"))
compare.model.melt <- as.data.frame(compare.model.melt)
compare.model.melt
##      metric     model                    value
## 1      RMSE        lm 261.38923076776160314694
## 2  Rsquared        lm   0.76319231883129745597
## 3       MAE        lm 189.54833064362532013547
## 4      RMSE        rf 507.67483326984211089439
## 5  Rsquared        rf   0.81113408132053355093
## 6       MAE        rf 282.59843355652418495083
## 7      RMSE       knn 213.80738995647459432803
## 8  Rsquared       knn   0.85736534432595234989
## 9       MAE       knn 104.89999999999999147349
## 10     RMSE svmLinear 116.11755715242007624965
## 11 Rsquared svmLinear   0.95570893357672337398
## 12      MAE svmLinear  64.66212500000003160494
## 13     RMSE       gbm   0.00000000000010835913
## 14 Rsquared       gbm   1.00000000000000000000
## 15      MAE       gbm   0.00000000000008451837
#All models offer quite different results, which indicates the predictions will have high variance between models.  
#Based on errors, the SVM stands out in MAE (smaller error than others). RMSE is pretty similar in all. As for the RSquared, RF performes better than SVM here, but SVM comes in second, so we choose this model.
for(i in c("RMSE","Rsquared","MAE")) {
  metric <-  compare.model.melt %>%  filter(metric == i)
  gg     <- ggplot(metric, aes(model,value))
  print(gg + geom_bar(stat = "identity") + ggtitle(i))
}

5.Final Predictions

predictions_svm <- predict(methods$svmLinear, NewProd2)
predictions_svm
##         1         2         3         4         5         6         7 
## 124.30139 107.98660 122.76846  74.39960  45.70958  97.43330 260.85973 
##         8         9        10        11        12        13        14 
## 142.03468  82.34863 220.68163 708.08504 129.82936 170.79756  99.08641 
##        15        16        17        18        19        20        21 
## 133.68933 263.95151  90.67295 105.75206  86.92859 107.73372  67.57322 
##        22        23        24 
##  87.96442  59.24840 499.01275
#Adding our predictions to the test set
Prod_Sales_Pred                 <- NewProd_Raw[,-c(3:nrow(NewProd_Raw))]
Prod_Sales_Pred$predictions_svm <- predictions_svm
Prod_Sales_Pred                 <- Prod_Sales_Pred[-c(10:11,16:24),]


Prod_Sales_Pred$ProductNum <- mgsub(x = Prod_Sales_Pred$ProductNum,pattern = c(171,172,173,175,176,178,180,181,183,193,194,195,196), replacement = c("Dell1","Dell2","Apple","Toshiba","Razer","hp","Acer","Asus","Samsung1","Motorola1","Samsung2","HTC","Motorola2"))

#We can observe how sales vary among the different products
ggplot(Prod_Sales_Pred, aes(x=ProductNum,y=predictions_svm, fill=ProductType)) + geom_col() + theme(axis.text.x = element_text(angle = 45, hjust = 1))