The sales team of Blackwell Electronics ask for another analysis regarding sales performance of certain products in one of their stores. This time, the analysis aims again at the prediction of sales performance but taking the product type into account to see whether and how specific product types outperform others. Therefore, the historical sales data on sales volume has to be analysed again to assesses whether the inclusion of ‘product type’ has positive impact on predictability and finally to predict sales volume of new products. The four product types are:
PC, Laptops, Netbooks and Smartphones.
Additionally, Blackwell expresses their interest in assessing the impact of service reviews and customer reviews on sales performance.
Procedure
The analysis followed the common data mining approach containing data the steps exploration, pre-processing, modelling and optimization, prediction, and evaluation.
For the analysis the statistical programming language R with different packages (e.g. ‘caret’ was used to apply several prediction techniques).
In particular, correlation Matrix, Random Forest and ANOVA test are applied to assess the importance of ‘product type’ and other attributes.
Results
The predicted sales volume is ordered from highest to lowest sales volume in Table 1.
Table 1: Pred. Volume and profit by brand
| Product Type | Brand | Pred. Volume | Profit (in k$) |
|---|---|---|---|
| Netbook | Acer | 734 | 22 |
| PC | Dell1 | 267 | 47 |
| Smartphone | Samsung2 | 267 | 2 |
| Laptop | Apple | 225 | 27 |
| Smartphone | Motorola1 | 181 | 4 |
| Smartphone | Motorola2 | 167 | 6 |
| PC | Dell2 | 152 | 26 |
| Netbook | Asus | 137 | 7 |
| Smartphone | HTC | 134 | 3 |
| Netbook | hp | 101 | 3 |
| Netbook | Samsung1 | 95 | 3 |
| Laptop | Toshiba | 77 | 14 |
| Laptop | Razer | 0 | 0 |
Table 2 - Total volume and profit by product categories
| Category | Total Sales | Profit (in k$) |
|---|---|---|
| Netbooks | 1067 | 35 |
| Smartphones | 749 | 15 |
| PC | 419 | 73 |
| Laptop | 302 | 41 |
The results show that, in terms of sales, the most sold product type would be Netbooks. In this category, our model has predicted the highest sales volume by far to be the Acer Netbook. Following in categories, Smartphones come in second, with Samsung being the most popular. Next are PCs, and last are laptops, coming close. However, the last two (PCs and Laptops) have the highest profit.
Addressing the interest in incorporating Product Type into our analysis, it unveiled that there is no impact on using the mentioned Product Type when we are trying to predict sales.
As far as the impact is on service and customer reviews, we concluded the following: The Positive Reviews, followed by the 4 and 3 stars had the highest impact on our prediction models.
Additionally, the predicted results agree with the previous analysis using RapidMiner.
Limitations
The dataset has limited products (80). A much larger number would increase the accuracy of the prediction. Some of the parameters of each product are barely relevant for our analysis. For instance, the dimensions (width, height and depth) cannot predict sales in our case.
Similarly, the Best Sellers Rank is missing many values and had to be excluded for analysis.
Recommendations
Consider the results to be only estimates and should only contribute to decision-making indicatively.
For further analysis, an expanded dataset in terms of products and a starting to record sales from now on with the purpose of contributing to a more meaningful analysis.
#Because we have worked with these products before, and because the dataset is small, we can observe duplicates just by looking at the df itself. We find out all 'ExtendedWarranty' products share the same values in everything (including exact number of reviews of all types) except for a small different in price. In a business sense, We conclude this is because warranties are sold alongside different products, only the same type of warranty is included, but will vary in price proportionally to the price itself. We should treat them as the same products and remove all but one.
#Showing the 6 duplicates:
#IMPORTANT NOTE: It's good practice to remove the columns by name instead of by column number, so that if we do changes in dataset later, and we reload these commands, we won't mess up unwanted columns.
sum(duplicated(ExistProd[,c(1,4:18)]))
## [1] 6
#Removing them, but setting an average price for the remaining product of Warranty:
#ExistProd <- ExistProd[-c(35:41),]
Duplicates <- ExistProd[c(32:41),]
Duplicates[3,3] <- sum(Duplicates[3:10,3])/8
Duplicates <- Duplicates[1:3,]
ExistProd <- ExistProd[-c(32:41),]
ExistProd <- rbind(ExistProd, Duplicates)
#We checked in the previous project that some columns were not useful and just added noise to our prediction model. The columns are:
#1. Product dimensions (width, Depth and Height)
#2. Best Sellers Rank
#3. Shipping Weight
#4. Product Num (we don't want the id to add 'noise')
#We proceed to get rid of the aforementioned columns in both df (they don't need to be the same for prediction, but it's good practice)
ExistProd <- dplyr::select(ExistProd,-c(ProductNum,BestSellersRank,ShippingWeight,ProductDepth, ProductWidth, ProductHeight))
NewProd <- dplyr::select(NewProd,-c(ProductNum,BestSellersRank,ShippingWeight,ProductDepth, ProductWidth, ProductHeight))
#BAD EXAMPLE: ExistProd <- ExistProd[,-c(2,12:16)]
#NewProd <- NewProd[,-c(2,12:16)]
#We find lots of outliers in almost all columns. If we remove them, it will drastically change our dataset (already very small), and so we don't remove all of them.
ggplot(stack(ExistProd), aes(x=ind, y=values)) + geom_boxplot() + theme(axis.text.x = element_text(angle = 45, hjust = 1))
#But we should look more in depth at least in the outliers of the dependent column
ggplot(ExistProd, aes(x=ProductType, y=Volume)) +geom_boxplot() + theme(axis.text.x = element_text(angle = 45, hjust = 1))
#We have few outliers here as well. The R Built-in function will tell us the values.
boxplot(ExistProd$Volume, plot = FALSE)$out
## [1] 2052 2140 11204 1896 7036 1684
#But we only want to remove the two extreme outliers because our dataset is small, and because these outliers don't belong to the product types we want to predict.
ExistProd <- ExistProd[-which(ExistProd$Volume > 7000),]
#Since different variables have different units, we are going to scale all numeric variables (except label):
for (i in c(2:11)){ExistProd[,i] <- scale(ExistProd[,i])}
#We do the same in NewProd df
for (i in c(2:11)){NewProd[,i] <- scale(NewProd[,i])}
#We are going to dummify(turn categorical values into logical ones)
#We need to dummify Product Type only. We check if it's a factor first
str(ExistProd$ProductType)
## Factor w/ 12 levels "Accessories",..: 7 7 7 5 5 1 1 1 1 1 ...
#The Correlation Matrix is only going to work with numeric values, so we create another df as to avoid messing with the current one, so that we can run a correlation matrix. We do this before dummifying, because once we dummify we'll have lots of variables and the correlation matrix will be way harder to read.
ExistProd_corr <- ExistProd[,-c(1)]
CorrData2 <- cor(ExistProd_corr)
corrplot(CorrData2, order="hclust", method = "number",
tl.col = "black", tl.srt = 90 , tl.cex = 0.7, tl.pos = "t")
#In order to find out the relationship between the label and the ProductType (categorical), we are going to use the ANOVA test.
ANOVA <- aov(ExistProd$Volume~ ExistProd$ProductType)
summary(ANOVA)
## Df Sum Sq Mean Sq F value Pr(>F)
## ExistProd$ProductType 11 5075621 461420 1.473 0.166
## Residuals 59 18487579 313349
#Since the p-value is higher than 0.05, we reject the Null Hypothesis, which states that both variables come from the same population. Therefore, it's not useful to use this variable to predict Volume.
#Now we will confirm this with a Random Forest w/ dummyfeatures
ExistProd2 <- createDummyFeatures(obj = ExistProd, cols = "ProductType")
NewProd2 <- createDummyFeatures(obj = NewProd, cols = "ProductType")
set.seed(107)
inTrain <- createDataPartition(ExistProd2$Volume,p = 0.75, list = FALSE)
training <- ExistProd2[inTrain,]
testing <- ExistProd2[-inTrain,]
ctrl <- trainControl(method = "cv", number = 10)
mRF <- randomForest(Volume ~., ExistProd2, ntree = 80)
plot (mRF, main = "Performance of RF depending on the number of trees")
ctrl <- trainControl(method="repeatedcv",repeats = 3, number = 5) #,classProbs=TRUE,summaryFunction = twoClassSummary)
mRF2 <- caret::train(Volume ~ ., data = training, method = "rf", trControl = ctrl, preProcess = c("center","scale"), tuneLength = 2)
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut
## = 10, : These variables have zero variances: ProductType.GameConsole
varImpPlot(mRF, main = "Performance of RF depending on the number of trees")
#As a conclusion, we get rid of Product Type
ExistProd$ProductType <- NULL
#ExistProd2 <- createDummyFeatures(obj = ExistProd, cols = "ProductType")
#NewProd2 <- createDummyFeatures(obj = NewProd, cols = "ProductType")
#We now look for collinearity in the newly dummified df
CorrData <- cor(ExistProd)
CorNames <- findCorrelation(CorrData, cutoff = 0.9, verbose = TRUE, names = TRUE)
## Compare row 4 and column 3 with corr 0.907
## Means: 0.554 vs 0.374 so flagging column 4
## Compare row 5 and column 6 with corr 0.979
## Means: 0.478 vs 0.343 so flagging column 5
## Compare row 2 and column 11 with corr 1
## Means: 0.476 vs 0.3 so flagging column 2
## All correlations <= 0.9
#The above function recommends what columns we should erase based on collinearity. So we proceed to do that:
ExistProd <- ExistProd[, - which(names(ExistProd) %in% CorNames)]
#BAD PRACTICE
#ExistProd[ ,c(2,4,5)] <- list(NULL)
#NewProd[ ,c(2,4,5)] <- list(NULL)
set.seed(107)
inTrain <- createDataPartition(ExistProd2$Volume,p = 0.75, list = FALSE)
training <- ExistProd2[inTrain,]
testing <- ExistProd2[-inTrain,]
ctrl <- trainControl(method = "cv", number = 10)
nrow(training)
## [1] 55
nrow(testing)
## [1] 16
#The chosen models are looped below
Five_Models <- c("lm", "rf", "knn", "svmLinear", "gbm")
#We create an empty variable so we can later assign all results to it
compare.model <- c()
#We run all models at once and assign the results in 'compare.model'.
#However, now the loop is overwriting each model as it goes through them, so it will only keep the last one saved.
for (i in Five_Models){
model <- caret::train(Volume ~., data = training, method = i, )
pred <- predict(model, newdata = testing)
pred.metric <- postResample(testing$Volume, obs = pred)
compare.model <- cbind(pred.metric, compare.model)
}
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 303176.4348 nan 0.1000 40462.3957
## 2 272477.1231 nan 0.1000 32410.1335
## 3 248886.5191 nan 0.1000 26451.8675
## 4 219753.3517 nan 0.1000 17516.6251
## 5 199465.8015 nan 0.1000 18759.1620
## 6 184440.3625 nan 0.1000 12458.1945
## 7 177159.9269 nan 0.1000 7471.0441
## 8 167689.6991 nan 0.1000 11325.1351
## 9 160824.8732 nan 0.1000 7906.7223
## 10 152524.7895 nan 0.1000 8120.2568
## 20 118293.4321 nan 0.1000 2588.7671
## 40 84604.2931 nan 0.1000 -419.9515
## 50 79927.8822 nan 0.1000 426.1485
#That's why we create a list with all models so that we can later choose the model
methods <- list()
#We will save them in the variable previously created
for (i in Five_Models) {
methods[[i]] <- caret::train(Volume ~., data = training, method = i )
}
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 306825.8840 nan 0.1000 40990.9882
## 2 273575.5778 nan 0.1000 31439.9033
## 3 244391.8178 nan 0.1000 25477.7753
## 4 213145.3700 nan 0.1000 23615.1157
## 5 192634.3575 nan 0.1000 21341.3448
## 6 178418.6702 nan 0.1000 9397.9167
## 7 168919.2731 nan 0.1000 10973.9090
## 8 164326.1448 nan 0.1000 4768.9323
## 9 152837.0199 nan 0.1000 10086.1343
## 10 143640.9480 nan 0.1000 9912.1428
## 20 108881.9898 nan 0.1000 1317.7514
## 40 85196.5922 nan 0.1000 -3028.3143
## 50 82590.4110 nan 0.1000 -771.6307
methods
## $lm
## Linear Regression
##
## 55 samples
## 22 predictors
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 55, 55, 55, 55, 55, 55, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 0.00000000000119855 1 0.0000000000004984662
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
##
## $rf
## Random Forest
##
## 55 samples
## 22 predictors
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 55, 55, 55, 55, 55, 55, ...
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared MAE
## 2 295.3924 0.8501766 195.52305
## 12 161.4392 0.9456535 83.72705
## 22 142.8953 0.9557016 73.59713
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 22.
##
## $knn
## k-Nearest Neighbors
##
## 55 samples
## 22 predictors
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 55, 55, 55, 55, 55, 55, ...
## Resampling results across tuning parameters:
##
## k RMSE Rsquared MAE
## 5 285.9488 0.8475635 183.1795
## 7 300.7292 0.8717571 191.4858
## 9 322.9252 0.8747862 207.6705
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 5.
##
## $svmLinear
## Support Vector Machines with Linear Kernel
##
## 55 samples
## 22 predictors
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 55, 55, 55, 55, 55, 55, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 558.2278 0.8263847 336.3741
##
## Tuning parameter 'C' was held constant at a value of 1
##
## $gbm
## Stochastic Gradient Boosting
##
## 55 samples
## 22 predictors
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 55, 55, 55, 55, 55, 55, ...
## Resampling results across tuning parameters:
##
## interaction.depth n.trees RMSE Rsquared MAE
## 1 50 327.2453 0.7341861 221.1909
## 1 100 329.7403 0.7267396 231.7280
## 1 150 337.1263 0.7152389 241.4511
## 2 50 336.3849 0.7307282 228.8012
## 2 100 341.1190 0.7156439 240.9551
## 2 150 345.5710 0.7062829 246.0999
## 3 50 333.5261 0.7325947 224.1092
## 3 100 335.1053 0.7229182 236.9917
## 3 150 340.5911 0.7099806 247.3737
##
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
##
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were n.trees = 50, interaction.depth
## = 1, shrinkage = 0.1 and n.minobsinnode = 10.
colnames(compare.model) <- Five_Models
compare.model
## lm rf knn svmLinear
## RMSE 261.3892308 507.6748333 213.8073900 116.1175572
## Rsquared 0.7631923 0.8111341 0.8573653 0.9557089
## MAE 189.5483306 282.5984336 104.9000000 64.6621250
## gbm
## RMSE 0.00000000000010835913
## Rsquared 1.00000000000000000000
## MAE 0.00000000000008451837
compare.model.melt <- melt(compare.model, varnames = c("metric", "model"))
compare.model.melt <- as.data.frame(compare.model.melt)
compare.model.melt
## metric model value
## 1 RMSE lm 261.38923076776160314694
## 2 Rsquared lm 0.76319231883129745597
## 3 MAE lm 189.54833064362532013547
## 4 RMSE rf 507.67483326984211089439
## 5 Rsquared rf 0.81113408132053355093
## 6 MAE rf 282.59843355652418495083
## 7 RMSE knn 213.80738995647459432803
## 8 Rsquared knn 0.85736534432595234989
## 9 MAE knn 104.89999999999999147349
## 10 RMSE svmLinear 116.11755715242007624965
## 11 Rsquared svmLinear 0.95570893357672337398
## 12 MAE svmLinear 64.66212500000003160494
## 13 RMSE gbm 0.00000000000010835913
## 14 Rsquared gbm 1.00000000000000000000
## 15 MAE gbm 0.00000000000008451837
#All models offer quite different results, which indicates the predictions will have high variance between models.
#Based on errors, the SVM stands out in MAE (smaller error than others). RMSE is pretty similar in all. As for the RSquared, RF performes better than SVM here, but SVM comes in second, so we choose this model.
for(i in c("RMSE","Rsquared","MAE")) {
metric <- compare.model.melt %>% filter(metric == i)
gg <- ggplot(metric, aes(model,value))
print(gg + geom_bar(stat = "identity") + ggtitle(i))
}
predictions_svm <- predict(methods$svmLinear, NewProd2)
predictions_svm
## 1 2 3 4 5 6 7
## 124.30139 107.98660 122.76846 74.39960 45.70958 97.43330 260.85973
## 8 9 10 11 12 13 14
## 142.03468 82.34863 220.68163 708.08504 129.82936 170.79756 99.08641
## 15 16 17 18 19 20 21
## 133.68933 263.95151 90.67295 105.75206 86.92859 107.73372 67.57322
## 22 23 24
## 87.96442 59.24840 499.01275
#Adding our predictions to the test set
Prod_Sales_Pred <- NewProd_Raw[,-c(3:nrow(NewProd_Raw))]
Prod_Sales_Pred$predictions_svm <- predictions_svm
Prod_Sales_Pred <- Prod_Sales_Pred[-c(10:11,16:24),]
Prod_Sales_Pred$ProductNum <- mgsub(x = Prod_Sales_Pred$ProductNum,pattern = c(171,172,173,175,176,178,180,181,183,193,194,195,196), replacement = c("Dell1","Dell2","Apple","Toshiba","Razer","hp","Acer","Asus","Samsung1","Motorola1","Samsung2","HTC","Motorola2"))
#We can observe how sales vary among the different products
ggplot(Prod_Sales_Pred, aes(x=ProductNum,y=predictions_svm, fill=ProductType)) + geom_col() + theme(axis.text.x = element_text(angle = 45, hjust = 1))