Pre-processing

First thing we are going to do is cleaning the Existing and New Products Data Set.

Changing Column Names

We changed columns names to have more visual and easy to understand graphs.

Looking for NAs

After several trials and errors we decided to change NAs to 0s even if it’s not the best thing to do. As all NAs are in BSRank column, for us the best aproach would be to remove the NAs rows, then create a Correlation matrix and see if BSRank has a high correlation with volume. If it does, then we should run a model to predict the NAs. If it doesn’t, then is ok to remove BSRank entire column and move on.

Graph showing NAs weight per column

#Finding NAs weight per variable (VIM package, aggr function)
aggr(Existing_Products)

Changing NAs to 0

We discovered 15 NAs, all of them in BSRank column and we changed them to 0.

Attributes selection

Time to see what variables are the most important to predict Volume. To do that we created a correlation matrix graph to see it clear.

#Looking for most relevant attributes to predict Volume
Existing_Products_noNA.cor <- cor(Existing_Products_noNA[2:18])
corrplot(Existing_Products_noNA.cor, method = "color", addCoef.col = "gray")

corrplot(Existing_Products_noNA.cor, method = "ellipse")

As PType is not a numeric column we exclude it for the correlation matrix. We need to find other ways to see is there is correlation between PType and Volume. That’s when we could use ANOVA to discover if we should create clusters or not. As P value is bigger than 0.05, we can assume that there is no need to take PType into account to predict Volume. Maybe if we would like to predict number of 5 Star Review this could change, but for predicting Volume doesn’t make sense.

#Time to see if PType is relevant for predicting volume using ANOVA
x <- aov(formula = Volume~PType, data = Existing_Products_noNA)
summary(x)
##             Df    Sum Sq Mean Sq F value Pr(>F)  
## PType       11  38888822 3535347   1.683 0.0961 .
## Residuals   68 142859034 2100868                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Then we decided to use only 2 attributes or variables to create our Model:

  • x4Star
  • PSRev

Looking for duplicates

This is something that could be done before the ANOVA test because if there were some duplicates, this would affect for sure the test results. But could be that duplicates only appeared after attributes selection so I prefer to do it after and then if duplicates appear, run again the ANOVA test.

We noticed that there are several duplicates. So we proceed to remove them and now we should run the ANOVA test again without this data points and see if makes sense to do clusters.

We still conclude that there is no relation between PType and Volume so we can move on.

y <- aov(formula = Volume~PType, data = ANOVA_test2_noDup)
summary(y)
##             Df    Sum Sq Mean Sq F value Pr(>F)
## PType       11  38152029 3468366   1.496  0.157
## Residuals   61 141465303 2319103

Looking for Outliers

Before we built our Model we need to see if there are Outliers and exclude them. Here we can see that when we do a Boxplot of the Volume column some Outliers appear.

#Looking for Outliers
boxplot(Ex_Prod_noNA_noDup$Volume, main = "Volume Boxplot")

We just excluded 2 of them as we don’t consider the rest as Outliers. Remove data is something that you need to try to avoid so as these data points are so close from the max we prefer to maintain them.

#Removing Outliers
Ex_Prod_noNA_noDup_noOut <- Ex_Prod_noNA_noDup[-which(Ex_Prod_noNA_noDup$Volume > 5000) , ]
boxplot(Ex_Prod_noNA_noDup_noOut$Volume)

Data Normalisation

The last thing for this pre-processing is data normalisation. We maintain here how we do it because it should help improve our distance based models results, but in this case it didn’t help so we didn’t use it.

#Normalising x4Star and PSRev for better results with distance models
#Ex_Prod_noNA_noDup_noOut$x4Star <- scale(Ex_Prod_noNA_noDup_noOut$x4Star)
#Ex_Prod_noNA_noDup_noOut$PSRev <- scale(Ex_Prod_noNA_noDup_noOut$PSRev)
#Ex_Prod_noNA_noDup_noOut

Creating Training and Testing Data Sets

We plitted our Data Set as follow:

-TrainSet: 70% -TestSet: 30%

#Creating Train and Test Data Sets
set.seed(145)
in_training <- createDataPartition(Ex_Prod_noNA_noDup_noOut$Volume, p = 0.7, list = F)
TrainSet <- Ex_Prod_noNA_noDup_noOut[in_training,]
TestSet <- Ex_Prod_noNA_noDup_noOut[-in_training,]

Comparing different Models

We used loops to compare in a quick way 7 different models. We tried linear, regression, and decision tree models.

#Creating a Loop to test diferent Models
models <- c("lm", "knn", "rf", "svmRadial", "kknn", "svmLinear", "svmPoly")
compare <- c()
all_pred <- c()
for(i in models){
   model <- train(Volume~. , data = TrainSet, method = i, number = 10, repeats = 3)
   pred <- predict(model, newdata = TestSet)
   per <- postResample(TestSet$Volume, pred)
   compare <- cbind(compare, per)
   all_pred <- cbind(all_pred, pred)
}
## note: only 1 unique complexity parameters in default grid. Truncating the grid to 1 .

Here we can see the performance of each Model

#Comparing the Models Accuracy
colnames(compare) <- models
knitr::kable(compare)
lm knn rf svmRadial kknn svmLinear svmPoly
RMSE 298.6099844 390.931012 342.5214695 474.4796239 347.6666583 256.2710595 527.9877831
Rsquared 0.9106918 0.923536 0.9213091 0.8000836 0.9168945 0.9084097 0.8502989
MAE 194.6747725 212.136090 202.0089166 267.9594583 178.4210526 141.7828645 255.5176046
#Comparing Models prediction
#colnames(all_pred) <- models
#all_pred

#Melting All Models results
#compare_melt1 <- melt(compare, varnames = c("metric", "model"))
#compare_melt1 <- data.table(compare_melt1)
#compare_melt1

#Ploting Models Error
#ggplot(compare_melt1, aes(x = model, y = value)) + geom_bar(stat = "identity") + facet_grid(metric~., scales = "free")

Final Model

After trying different ones separetely we decided to use Random Forest for our prediction as it is the one with best results.

#Creating the Random Forest Model
fitControl <- trainControl(method = "repeatedcv", number = 10, repeats = 3)
RFPrediction <- train(Volume~., data = TrainSet, method = "rf", trControl=fitControl, tuneLenght = 2)
## note: only 1 unique complexity parameters in default grid. Truncating the grid to 1 .
RFPrediction
## Random Forest 
## 
## 50 samples
##  2 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 45, 45, 45, 45, 46, 46, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   182.4473  0.9303902  110.5815
## 
## Tuning parameter 'mtry' was held constant at a value of 2

Final Results

Here is the table with the final prediction results.

#Doing the Final Prediction
RFFinalPrediction <- predict(RFPrediction, New_Products_FINAL)
New_Products_FINAL$Volume <- RFFinalPrediction
New_Products_FINAL$PType <- New_Products$PType
New_Products_FINAL$PNum <- New_Products$PNum
knitr::kable(New_Products_FINAL)
x4Star PSRev Volume PType PNum
26 12 581.28008 PC 171
11 7 170.61893 PC 172
10 11 450.63160 Laptop 173
2 2 29.93422 Laptop 175
1 0 10.40133 Laptop 176
8 2 49.08240 Netbook 178
112 28 969.26347 Netbook 180
18 5 133.26813 Netbook 181
4 1 20.89600 Netbook 183
66 28 1077.00933 Tablet 186
437 90 1228.73053 Tablet 187
26 8 275.90248 Smartphone 193
26 14 687.71688 Smartphone 194
8 4 81.97747 Smartphone 195
19 5 140.18947 Smartphone 196
97 32 963.77867 GameConsole 199
0 1 10.79485 Display 201
1 2 27.23049 Accessories 301
2 2 29.93422 Accessories 302
18 4 130.28480 Software 303
8 5 81.80200 Printer 304
0 1 10.79485 PrinterSupplies 305
1 0 10.40133 ExtendedWarranty 306
252 59 1175.21720 GameConsole 307