The sales team of Blackwell Electronics ask for another analysis regarding sales performance of certain products in one of their stores. This time, the analysis aims again at the prediction of sales performance but taking the product type into account to see whether and how specific product types outperform others. Therefore, the historical sales data on sales volume has to be analysed again to assesses whether the inclusion of ‘product type’ has positive impact on predictability and finally to predict sales volume of new products. The four product types are:
PC, Laptops, Netbooks and Smartphones.
Additionally, Blackwell expresses their interest in assessing the impact of service reviews and customer reviews on sales performance.
Procedure
The analysis followed the common data mining approach containing data the steps exploration, pre-processing, modelling and optimization, prediction, and evaluation.
For the analysis the statistical programming language R with different packages (e.g. ‘caret’ was used to apply several prediction techniques).
In particular, correlation Matrix, Random Forest and ANOVA test are applied to assess the importance of ‘product type’ and other attributes.
Results
The predicted sales volume is ordered from highest to lowest sales volume in Table 1.
Table 1: Pred. Volume and profit by brand
Product Type | Brand | Pred. Volume | Profit (in k$) |
---|---|---|---|
Netbook | Acer | 734 | 22 |
PC | Dell1 | 267 | 47 |
Smartphone | Samsung2 | 267 | 2 |
Laptop | Apple | 225 | 27 |
Smartphone | Motorola1 | 181 | 4 |
Smartphone | Motorola2 | 167 | 6 |
PC | Dell2 | 152 | 26 |
Netbook | Asus | 137 | 7 |
Smartphone | HTC | 134 | 3 |
Netbook | hp | 101 | 3 |
Netbook | Samsung1 | 95 | 3 |
Laptop | Toshiba | 77 | 14 |
Laptop | Razer | 0 | 0 |
Table 2 - Total volume and profit by product categories
Category | Total Sales | Profit (in k$) |
---|---|---|
Netbooks | 1067 | 35 |
Smartphones | 749 | 15 |
PC | 419 | 73 |
Laptop | 302 | 41 |
The results show that, in terms of sales, the most sold product type would be Netbooks. In this category, our model has predicted the highest sales volume by far to be the Acer Netbook. Following in categories, Smartphones come in second, with Samsung being the most popular. Next are PCs, and last are laptops, coming close. However, the last two (PCs and Laptops) have the highest profit.
Addressing the interest in incorporating Product Type into our analysis, it unveiled that there is no impact on using the mentioned Product Type when we are trying to predict sales.
As far as the impact is on service and customer reviews, we concluded the following: The Positive Reviews, followed by the 4 and 3 stars had the highest impact on our prediction models.
Additionally, the predicted results agree with the previous analysis using RapidMiner.
Limitations
The dataset has limited products (80). A much larger number would increase the accuracy of the prediction. Some of the parameters of each product are barely relevant for our analysis. For instance, the dimensions (width, height and depth) cannot predict sales in our case.
Similarly, the Best Sellers Rank is missing many values and had to be excluded for analysis.
Recommendations
Consider the results to be only estimates and should only contribute to decision-making indicatively.
For further analysis, an expanded dataset in terms of products and a starting to record sales from now on with the purpose of contributing to a more meaningful analysis.
#Because we have worked with these products before, and because the dataset is small, we can observe duplicates just by looking at the df itself. We find out all 'ExtendedWarranty' products share the same values in everything (including exact number of reviews of all types) except for a small different in price. In a business sense, We conclude this is because warranties are sold alongside different products, only the same type of warranty is included, but will vary in price proportionally to the price itself. We should treat them as the same products and remove all but one.
#Showing the 6 duplicates:
#IMPORTANT NOTE: It's good practice to remove the columns by name instead of by column number, so that if we do changes in dataset later, and we reload these commands, we won't mess up unwanted columns.
sum(duplicated(ExistProd[,c(1,4:18)]))
## [1] 6
#Removing them, but setting an average price for the remaining product of Warranty:
#ExistProd <- ExistProd[-c(35:41),]
Duplicates <- ExistProd[c(32:41),]
Duplicates[3,3] <- sum(Duplicates[3:10,3])/8
Duplicates <- Duplicates[1:3,]
ExistProd <- ExistProd[-c(32:41),]
ExistProd <- rbind(ExistProd, Duplicates)
#We checked in the previous project that some columns were not useful and just added noise to our prediction model. The columns are:
#1. Product dimensions (width, Depth and Height)
#2. Best Sellers Rank
#3. Shipping Weight
#4. Product Num (we don't want the id to add 'noise')
#We proceed to get rid of the aforementioned columns in both df (they don't need to be the same for prediction, but it's good practice)
ExistProd <- dplyr::select(ExistProd,-c(ProductNum,BestSellersRank,ShippingWeight,ProductDepth, ProductWidth, ProductHeight))
NewProd <- dplyr::select(NewProd,-c(ProductNum,BestSellersRank,ShippingWeight,ProductDepth, ProductWidth, ProductHeight))
#BAD EXAMPLE: ExistProd <- ExistProd[,-c(2,12:16)]
#NewProd <- NewProd[,-c(2,12:16)]
#We find lots of outliers in almost all columns. If we remove them, it will drastically change our dataset (already very small), and so we don't remove all of them.
ggplot(stack(ExistProd), aes(x=ind, y=values)) + geom_boxplot() + theme(axis.text.x = element_text(angle = 45, hjust = 1))
#But we should look more in depth at least in the outliers of the dependent column
ggplot(ExistProd, aes(x=ProductType, y=Volume)) +geom_boxplot() + theme(axis.text.x = element_text(angle = 45, hjust = 1))
#We have few outliers here as well. The R Built-in function will tell us the values.
boxplot(ExistProd$Volume, plot = FALSE)$out
## [1] 2052 2140 11204 1896 7036 1684
#But we only want to remove the two extreme outliers because our dataset is small, and because these outliers don't belong to the product types we want to predict.
ExistProd <- ExistProd[-which(ExistProd$Volume > 7000),]
#Since different variables have different units, we are going to scale all numeric variables (except label):
for (i in c(2:11)){ExistProd[,i] <- scale(ExistProd[,i])}
#We do the same in NewProd df
for (i in c(2:11)){NewProd[,i] <- scale(NewProd[,i])}
#We are going to dummify(turn categorical values into logical ones)
#We need to dummify Product Type only. We check if it's a factor first
str(ExistProd$ProductType)
## Factor w/ 12 levels "Accessories",..: 7 7 7 5 5 1 1 1 1 1 ...
#The Correlation Matrix is only going to work with numeric values, so we create another df as to avoid messing with the current one, so that we can run a correlation matrix. We do this before dummifying, because once we dummify we'll have lots of variables and the correlation matrix will be way harder to read.
ExistProd_corr <- ExistProd[,-c(1)]
CorrData2 <- cor(ExistProd_corr)
corrplot(CorrData2, order="hclust", method = "number",
tl.col = "black", tl.srt = 90 , tl.cex = 0.7, tl.pos = "t")
#In order to find out the relationship between the label and the ProductType (categorical), we are going to use the ANOVA test.
ANOVA <- aov(ExistProd$Volume~ ExistProd$ProductType)
summary(ANOVA)
## Df Sum Sq Mean Sq F value Pr(>F)
## ExistProd$ProductType 11 5075621 461420 1.473 0.166
## Residuals 59 18487579 313349
#Since the p-value is higher than 0.05, we reject the Null Hypothesis, which states that both variables come from the same population. Therefore, it's not useful to use this variable to predict Volume.
#Now we will confirm this with a Random Forest w/ dummyfeatures
ExistProd2 <- createDummyFeatures(obj = ExistProd, cols = "ProductType")
NewProd2 <- createDummyFeatures(obj = NewProd, cols = "ProductType")
set.seed(107)
inTrain <- createDataPartition(ExistProd2$Volume,p = 0.75, list = FALSE)
training <- ExistProd2[inTrain,]
testing <- ExistProd2[-inTrain,]
ctrl <- trainControl(method = "cv", number = 10)
mRF <- randomForest(Volume ~., ExistProd2, ntree = 80)
plot (mRF, main = "Performance of RF depending on the number of trees")
ctrl <- trainControl(method="repeatedcv",repeats = 3, number = 5) #,classProbs=TRUE,summaryFunction = twoClassSummary)
mRF2 <- caret::train(Volume ~ ., data = training, method = "rf", trControl = ctrl, preProcess = c("center","scale"), tuneLength = 2)
varImpPlot(mRF, main = "Performance of RF depending on the number of trees")
#As a conclusion, we get rid of Product Type
ExistProd$ProductType <- NULL
#ExistProd2 <- createDummyFeatures(obj = ExistProd, cols = "ProductType")
#NewProd2 <- createDummyFeatures(obj = NewProd, cols = "ProductType")
#We now look for collinearity in the newly dummified df
CorrData <- cor(ExistProd)
CorNames <- findCorrelation(CorrData, cutoff = 0.9, verbose = TRUE, names = TRUE)
## Compare row 4 and column 3 with corr 0.907
## Means: 0.554 vs 0.374 so flagging column 4
## Compare row 5 and column 6 with corr 0.979
## Means: 0.478 vs 0.343 so flagging column 5
## Compare row 2 and column 11 with corr 1
## Means: 0.476 vs 0.3 so flagging column 2
## All correlations <= 0.9
#The above function recommends what columns we should erase based on collinearity. So we proceed to do that:
ExistProd <- ExistProd[, - which(names(ExistProd) %in% CorNames)]
#BAD PRACTICE
#ExistProd[ ,c(2,4,5)] <- list(NULL)
#NewProd[ ,c(2,4,5)] <- list(NULL)
set.seed(107)
inTrain <- createDataPartition(ExistProd2$Volume,p = 0.75, list = FALSE)
training <- ExistProd2[inTrain,]
testing <- ExistProd2[-inTrain,]
ctrl <- trainControl(method = "cv", number = 10)
nrow(training)
## [1] 55
nrow(testing)
## [1] 16
#The chosen models are looped below
Five_Models <- c("lm", "rf", "knn", "svmLinear", "gbm")
#We create an empty variable so we can later assign all results to it
compare.model <- c()
#We run all models at once and assign the results in 'compare.model'.
#However, now the loop is overwriting each model as it goes through them, so it will only keep the last one saved.
for (i in Five_Models){
model <- caret::train(Volume ~., data = training, method = i, )
pred <- predict(model, newdata = testing)
pred.metric <- postResample(testing$Volume, obs = pred)
compare.model <- cbind(pred.metric, compare.model)
}
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 313823.5317 nan 0.1000 40051.1907
## 2 281141.9833 nan 0.1000 36576.1443
## 3 243944.3453 nan 0.1000 25128.4651
## 4 227056.8211 nan 0.1000 18409.0107
## 5 211217.7925 nan 0.1000 19245.5370
## 6 189509.0779 nan 0.1000 17120.2941
## 7 175637.2247 nan 0.1000 2595.9761
## 8 171221.4563 nan 0.1000 372.4642
## 9 160521.7337 nan 0.1000 11451.3583
## 10 155619.3410 nan 0.1000 724.0868
## 20 97964.3675 nan 0.1000 3784.9466
## 40 71691.6121 nan 0.1000 -512.2182
## 50 68044.7417 nan 0.1000 -607.8096
#That's why we create a list with all models so that we can later choose the model
methods <- list()
#We will save them in the variable previously created
for (i in Five_Models) {
methods[[i]] <- caret::train(Volume ~., data = training, method = i )
}
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 307578.8280 nan 0.1000 41207.7924
## 2 286305.2500 nan 0.1000 19030.2813
## 3 261112.1845 nan 0.1000 28267.1541
## 4 241705.6523 nan 0.1000 14618.5178
## 5 204843.1469 nan 0.1000 26150.5048
## 6 185556.7377 nan 0.1000 18609.4629
## 7 178005.6144 nan 0.1000 7763.0654
## 8 159496.5745 nan 0.1000 13674.8079
## 9 139299.9802 nan 0.1000 9006.8488
## 10 129438.4131 nan 0.1000 5994.8550
## 20 83051.1139 nan 0.1000 -1019.0330
## 40 65818.0608 nan 0.1000 -1494.6146
## 60 58964.5250 nan 0.1000 -799.1243
## 80 54570.5628 nan 0.1000 -182.9114
## 100 50063.0809 nan 0.1000 -689.4158
## 120 44184.3397 nan 0.1000 -860.5722
## 140 42799.7584 nan 0.1000 -209.3923
## 150 40323.4292 nan 0.1000 -518.1395
methods
## $lm
## Linear Regression
##
## 55 samples
## 22 predictors
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 55, 55, 55, 55, 55, 55, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 5.406146e-13 1 3.916189e-13
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
##
## $rf
## Random Forest
##
## 55 samples
## 22 predictors
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 55, 55, 55, 55, 55, 55, ...
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared MAE
## 2 310.5234 0.8459020 201.14610
## 12 180.6845 0.9361108 96.03043
## 22 150.1520 0.9521491 77.92924
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 22.
##
## $knn
## k-Nearest Neighbors
##
## 55 samples
## 22 predictors
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 55, 55, 55, 55, 55, 55, ...
## Resampling results across tuning parameters:
##
## k RMSE Rsquared MAE
## 5 250.5506 0.8560911 160.4576
## 7 254.6291 0.8658619 169.5987
## 9 271.0767 0.8670381 183.6637
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 5.
##
## $svmLinear
## Support Vector Machines with Linear Kernel
##
## 55 samples
## 22 predictors
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 55, 55, 55, 55, 55, 55, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 507.7142 0.8739201 306.0704
##
## Tuning parameter 'C' was held constant at a value of 1
##
## $gbm
## Stochastic Gradient Boosting
##
## 55 samples
## 22 predictors
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 55, 55, 55, 55, 55, 55, ...
## Resampling results across tuning parameters:
##
## interaction.depth n.trees RMSE Rsquared MAE
## 1 50 319.8130 0.7433712 222.8188
## 1 100 317.0902 0.7356551 227.5105
## 1 150 322.5018 0.7255873 232.8695
## 2 50 317.0595 0.7405268 219.8445
## 2 100 317.3141 0.7307384 226.8979
## 2 150 315.9586 0.7289831 227.2538
## 3 50 316.2784 0.7406168 219.6941
## 3 100 317.0800 0.7363365 225.7095
## 3 150 318.8884 0.7271765 227.6880
##
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
##
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were n.trees = 150,
## interaction.depth = 2, shrinkage = 0.1 and n.minobsinnode = 10.
colnames(compare.model) <- Five_Models
compare.model
## lm rf knn svmLinear gbm
## RMSE 295.757896 359.1920424 214.8119177 82.8867239 1.150406e-12
## Rsquared 0.690411 0.9533011 0.8578734 0.9829067 1.000000e+00
## MAE 209.847853 125.9877635 157.9000000 46.9278500 4.392184e-13
compare.model.melt <- melt(compare.model, varnames = c("metric", "model"))
compare.model.melt <- as.data.frame(compare.model.melt)
compare.model.melt
## metric model value
## 1 RMSE lm 2.957579e+02
## 2 Rsquared lm 6.904110e-01
## 3 MAE lm 2.098479e+02
## 4 RMSE rf 3.591920e+02
## 5 Rsquared rf 9.533011e-01
## 6 MAE rf 1.259878e+02
## 7 RMSE knn 2.148119e+02
## 8 Rsquared knn 8.578734e-01
## 9 MAE knn 1.579000e+02
## 10 RMSE svmLinear 8.288672e+01
## 11 Rsquared svmLinear 9.829067e-01
## 12 MAE svmLinear 4.692785e+01
## 13 RMSE gbm 1.150406e-12
## 14 Rsquared gbm 1.000000e+00
## 15 MAE gbm 4.392184e-13
#All models offer quite different results, which indicates the predictions will have high variance between models.
#Based on errors, the SVM stands out in MAE (smaller error than others). RMSE is pretty similar in all. As for the RSquared, RF performes better than SVM here, but SVM comes in second, so we choose this model.
for(i in c("RMSE","Rsquared","MAE")) {
metric <- compare.model.melt %>% filter(metric == i)
gg <- ggplot(metric, aes(model,value))
print(gg + geom_bar(stat = "identity") + ggtitle(i))
}
predictions_svm <- predict(methods$svmLinear, NewProd2)
predictions_svm
## 1 2 3 4 5 6
## 306.10722 203.20065 227.05095 43.66592 -48.88348 132.54230
## 7 8 9 10 11 12
## 767.52884 142.75748 79.33107 684.13076 2504.82682 325.20232
## 13 14 15 16 17 18
## 394.90194 180.11386 222.16805 1059.01335 97.12584 165.24101
## 19 20 21 22 23 24
## 154.88560 170.53507 234.14689 133.09492 37.65265 2598.67423
#Adding our predictions to the test set
Prod_Sales_Pred <- NewProd_Raw[,-c(3:nrow(NewProd_Raw))]
Prod_Sales_Pred$predictions_svm <- predictions_svm
Prod_Sales_Pred <- Prod_Sales_Pred[-c(10:11,16:24),]
Prod_Sales_Pred$ProductNum <- mgsub(x = Prod_Sales_Pred$ProductNum,pattern = c(171,172,173,175,176,178,180,181,183,193,194,195,196), replacement = c("Dell1","Dell2","Apple","Toshiba","Razer","hp","Acer","Asus","Samsung1","Motorola1","Samsung2","HTC","Motorola2"))
#We can observe how sales vary among the different products
ggplot(Prod_Sales_Pred, aes(x=ProductNum,y=predictions_svm, fill=ProductType)) + geom_col() + theme(axis.text.x = element_text(angle = 45, hjust = 1))