First thing we are going to do is cleaning the Existing and New Products Data Set.
We changed columns names to have more visual and easy to understand graphs.
After several trials and errors we decided to change NAs to 0s even if it’s not the best thing to do. As all NAs are in BSRank column, for us the best aproach would be to remove the NAs rows, then create a Correlation matrix and see if BSRank has a high correlation with volume. If it does, then we should run a model to predict the NAs. If it doesn’t, then is ok to remove BSRank entire column and move on.
#Finding NAs weight per variable (VIM package, aggr function)
aggr(Existing_Products)
We discovered 15 NAs, all of them in BSRank column and we changed them to 0.
Time to see what variables are the most important to predict Volume. To do that we created a correlation matrix graph to see it clear.
#Looking for most relevant attributes to predict Volume
Existing_Products_noNA.cor <- cor(Existing_Products_noNA[2:18])
corrplot(Existing_Products_noNA.cor, method = "color", addCoef.col = "gray")
corrplot(Existing_Products_noNA.cor, method = "ellipse")
As PType is not a numeric column we exclude it for the correlation matrix. We need to find other ways to see is there is correlation between PType and Volume. That’s when we could use ANOVA to discover if we should create clusters or not. As P value is bigger than 0.05, we can assume that there is no need to take PType into account to predict Volume. Maybe if we would like to predict number of 5 Star Review this could change, but for predicting Volume doesn’t make sense.
#Time to see if PType is relevant for predicting volume using ANOVA
x <- aov(formula = Volume~PType, data = Existing_Products_noNA)
summary(x)
## Df Sum Sq Mean Sq F value Pr(>F)
## PType 11 38888822 3535347 1.683 0.0961 .
## Residuals 68 142859034 2100868
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Then we decided to use only 2 attributes or variables to create our Model:
This is something that could be done before the ANOVA test because if there were some duplicates, this would affect for sure the test results. But could be that duplicates only appeared after attributes selection so I prefer to do it after and then if duplicates appear, run again the ANOVA test.
We noticed that there are several duplicates. So we proceed to remove them and now we should run the ANOVA test again without this data points and see if makes sense to do clusters.
We still conclude that there is no relation between PType and Volume so we can move on.
y <- aov(formula = Volume~PType, data = ANOVA_test2_noDup)
summary(y)
## Df Sum Sq Mean Sq F value Pr(>F)
## PType 11 38152029 3468366 1.496 0.157
## Residuals 61 141465303 2319103
Before we built our Model we need to see if there are Outliers and exclude them. Here we can see that when we do a Boxplot of the Volume column some Outliers appear.
#Looking for Outliers
boxplot(Ex_Prod_noNA_noDup$Volume, main = "Volume Boxplot")
We just excluded 2 of them as we don’t consider the rest as Outliers. Remove data is something that you need to try to avoid so as these data points are so close from the max we prefer to maintain them.
#Removing Outliers
Ex_Prod_noNA_noDup_noOut <- Ex_Prod_noNA_noDup[-which(Ex_Prod_noNA_noDup$Volume > 5000) , ]
boxplot(Ex_Prod_noNA_noDup_noOut$Volume)
The last thing for this pre-processing is data normalisation. We maintain here how we do it because it should help improve our distance based models results, but in this case it didn’t help so we didn’t use it.
#Normalising x4Star and PSRev for better results with distance models
#Ex_Prod_noNA_noDup_noOut$x4Star <- scale(Ex_Prod_noNA_noDup_noOut$x4Star)
#Ex_Prod_noNA_noDup_noOut$PSRev <- scale(Ex_Prod_noNA_noDup_noOut$PSRev)
#Ex_Prod_noNA_noDup_noOut
We plitted our Data Set as follow:
-TrainSet: 70% -TestSet: 30%
#Creating Train and Test Data Sets
set.seed(145)
in_training <- createDataPartition(Ex_Prod_noNA_noDup_noOut$Volume, p = 0.7, list = F)
TrainSet <- Ex_Prod_noNA_noDup_noOut[in_training,]
TestSet <- Ex_Prod_noNA_noDup_noOut[-in_training,]
We used loops to compare in a quick way 7 different models. We tried linear, regression, and decision tree models.
#Creating a Loop to test diferent Models
models <- c("lm", "knn", "rf", "svmRadial", "kknn", "svmLinear", "svmPoly")
compare <- c()
all_pred <- c()
for(i in models){
model <- train(Volume~. , data = TrainSet, method = i, number = 10, repeats = 3)
pred <- predict(model, newdata = TestSet)
per <- postResample(TestSet$Volume, pred)
compare <- cbind(compare, per)
all_pred <- cbind(all_pred, pred)
}
## note: only 1 unique complexity parameters in default grid. Truncating the grid to 1 .
Here we can see the performance of each Model
#Comparing the Models Accuracy
colnames(compare) <- models
knitr::kable(compare)
| lm | knn | rf | svmRadial | kknn | svmLinear | svmPoly | |
|---|---|---|---|---|---|---|---|
| RMSE | 298.6099844 | 390.931012 | 342.5214695 | 474.4796239 | 347.6666583 | 256.2710595 | 527.9877831 |
| Rsquared | 0.9106918 | 0.923536 | 0.9213091 | 0.8000836 | 0.9168945 | 0.9084097 | 0.8502989 |
| MAE | 194.6747725 | 212.136090 | 202.0089166 | 267.9594583 | 178.4210526 | 141.7828645 | 255.5176046 |
#Comparing Models prediction
#colnames(all_pred) <- models
#all_pred
#Melting All Models results
#compare_melt1 <- melt(compare, varnames = c("metric", "model"))
#compare_melt1 <- data.table(compare_melt1)
#compare_melt1
#Ploting Models Error
#ggplot(compare_melt1, aes(x = model, y = value)) + geom_bar(stat = "identity") + facet_grid(metric~., scales = "free")
After trying different ones separetely we decided to use Random Forest for our prediction as it is the one with best results.
#Creating the Random Forest Model
fitControl <- trainControl(method = "repeatedcv", number = 10, repeats = 3)
RFPrediction <- train(Volume~., data = TrainSet, method = "rf", trControl=fitControl, tuneLenght = 2)
## note: only 1 unique complexity parameters in default grid. Truncating the grid to 1 .
RFPrediction
## Random Forest
##
## 50 samples
## 2 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 45, 45, 45, 45, 46, 46, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 182.4473 0.9303902 110.5815
##
## Tuning parameter 'mtry' was held constant at a value of 2
Here is the table with the final prediction results.
#Doing the Final Prediction
RFFinalPrediction <- predict(RFPrediction, New_Products_FINAL)
New_Products_FINAL$Volume <- RFFinalPrediction
New_Products_FINAL$PType <- New_Products$PType
New_Products_FINAL$PNum <- New_Products$PNum
knitr::kable(New_Products_FINAL)
| x4Star | PSRev | Volume | PType | PNum |
|---|---|---|---|---|
| 26 | 12 | 581.28008 | PC | 171 |
| 11 | 7 | 170.61893 | PC | 172 |
| 10 | 11 | 450.63160 | Laptop | 173 |
| 2 | 2 | 29.93422 | Laptop | 175 |
| 1 | 0 | 10.40133 | Laptop | 176 |
| 8 | 2 | 49.08240 | Netbook | 178 |
| 112 | 28 | 969.26347 | Netbook | 180 |
| 18 | 5 | 133.26813 | Netbook | 181 |
| 4 | 1 | 20.89600 | Netbook | 183 |
| 66 | 28 | 1077.00933 | Tablet | 186 |
| 437 | 90 | 1228.73053 | Tablet | 187 |
| 26 | 8 | 275.90248 | Smartphone | 193 |
| 26 | 14 | 687.71688 | Smartphone | 194 |
| 8 | 4 | 81.97747 | Smartphone | 195 |
| 19 | 5 | 140.18947 | Smartphone | 196 |
| 97 | 32 | 963.77867 | GameConsole | 199 |
| 0 | 1 | 10.79485 | Display | 201 |
| 1 | 2 | 27.23049 | Accessories | 301 |
| 2 | 2 | 29.93422 | Accessories | 302 |
| 18 | 4 | 130.28480 | Software | 303 |
| 8 | 5 | 81.80200 | Printer | 304 |
| 0 | 1 | 10.79485 | PrinterSupplies | 305 |
| 1 | 0 | 10.40133 | ExtendedWarranty | 306 |
| 252 | 59 | 1175.21720 | GameConsole | 307 |