Pre-processing

First thing we are going to do is cleaning the Existing and New Products Data Set.

Changing column names
Looking for NAs
Attributes selection
Looking for duplicates
Data normalisation (for distance models)

Changing Column Names

We changed columns names to have more visual and easy to understand graphs.

Looking for NAs

After several trials and errors we decided to change NAs to 0s even if it’s not the best thing to do. As all NAs are in BSRank column, for us the best aproach would be to remove the NAs rows, then create a Correlation matrix and see if BSRank has a high correlation with volume. If it does, then we should run a model to predict the NAs. If it doesn’t, then is ok to remove BSRank entire column and move on.

Graph showing NAs weight per column

#Finding NAs weight per variable (VIM package, aggr function)
aggr(Existing_Products)

Changing NAs to 0

We discovered 15 NAs, all of them in BSRank column and we changed them to 0.

Attributes selection

Time to see what variables are the most important to predict Volume. To do that we created a correlation matrix graph to see it clear.

#Looking for most relevant attributes to predict Volume
Existing_Products_noNA.cor <- cor(Existing_Products_noNA[2:18])
corrplot(Existing_Products_noNA.cor, method = "color", addCoef.col = "gray")

corrplot(Existing_Products_noNA.cor, method = "ellipse")

As PType is not a numeric column we exclude it for the correlation matrix. We need to find other ways to see is there is correlation between PType and Volume. That’s when we could use ANOVA to discover if we should create clusters or not. As P value is bigger than 0.05, we can assume that there is no need to take PType into account to predict Volume. Maybe if we would like to predict number of 5 Star Review this could change, but for predicting Volume doesn’t make sense.

#Time to see if PType is relevant for predicting volume using ANOVA
x <- aov(formula = Volume~PType, data = Existing_Products_noNA)
summary(x)

##             Df    Sum Sq Mean Sq F value Pr(>F)  
## PType       11  38888822 3535347   1.683 0.0961 .
## Residuals   68 142859034 2100868                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Then we decided to use only 2 attributes or variables to create our Model:

x4Star
PSRev

Looking for duplicates

This is something that could be done before the ANOVA test because if there were some duplicates, this would affect for sure the test results. But could be that duplicates only appeared after attributes selection so I prefer to do it after and then if duplicates appear, run again the ANOVA test.

We noticed that there are several duplicates. So we proceed to remove them and now we should run the ANOVA test again without this data points and see if makes sense to do clusters.

We still conclude that there is no relation between PType and Volume so we can move on.

y <- aov(formula = Volume~PType, data = ANOVA_test2_noDup)
summary(y)

##             Df    Sum Sq Mean Sq F value Pr(>F)
## PType       11  38152029 3468366   1.496  0.157
## Residuals   61 141465303 2319103

Looking for Outliers

Before we built our Model we need to see if there are Outliers and exclude them. Here we can see that when we do a Boxplot of the Volume column some Outliers appear.

#Looking for Outliers
boxplot(Ex_Prod_noNA_noDup$Volume, main = "Volume Boxplot")

We just excluded 2 of them as we don’t consider the rest as Outliers. Remove data is something that you need to try to avoid so as these data points are so close from the max we prefer to maintain them.

#Removing Outliers
Ex_Prod_noNA_noDup_noOut <- Ex_Prod_noNA_noDup[-which(Ex_Prod_noNA_noDup$Volume > 5000) , ]
boxplot(Ex_Prod_noNA_noDup_noOut$Volume)

Data Normalisation

The last thing for this pre-processing is data normalisation. We maintain here how we do it because it should help improve our distance based models results, but in this case it didn’t help so we didn’t use it.

#Normalising x4Star and PSRev for better results with distance models
#Ex_Prod_noNA_noDup_noOut$x4Star <- scale(Ex_Prod_noNA_noDup_noOut$x4Star)
#Ex_Prod_noNA_noDup_noOut$PSRev <- scale(Ex_Prod_noNA_noDup_noOut$PSRev)
#Ex_Prod_noNA_noDup_noOut

Creating Training and Testing Data Sets

We plitted our Data Set as follow:

-TrainSet: 70% -TestSet: 30%

#Creating Train and Test Data Sets
set.seed(145)
in_training <- createDataPartition(Ex_Prod_noNA_noDup_noOut$Volume, p = 0.7, list = F)
TrainSet <- Ex_Prod_noNA_noDup_noOut[in_training,]
TestSet <- Ex_Prod_noNA_noDup_noOut[-in_training,]

Comparing different Models

We used loops to compare in a quick way 7 different models. We tried linear, regression, and decision tree models.

#Creating a Loop to test diferent Models
models <- c("lm", "knn", "rf", "svmRadial", "kknn", "svmLinear", "svmPoly")
compare <- c()
all_pred <- c()
for(i in models){
   model <- train(Volume~. , data = TrainSet, method = i, number = 10, repeats = 3)
   pred <- predict(model, newdata = TestSet)
   per <- postResample(TestSet$Volume, pred)
   compare <- cbind(compare, per)
   all_pred <- cbind(all_pred, pred)
}

## note: only 1 unique complexity parameters in default grid. Truncating the grid to 1 .

Here we can see the performance of each Model

#Comparing the Models Accuracy
colnames(compare) <- models
knitr::kable(compare)

	lm	knn	rf	svmRadial	kknn	svmLinear	svmPoly
RMSE	298.6099844	390.931012	342.5214695	474.4796239	347.6666583	256.2710595	527.9877831
Rsquared	0.9106918	0.923536	0.9213091	0.8000836	0.9168945	0.9084097	0.8502989
MAE	194.6747725	212.136090	202.0089166	267.9594583	178.4210526	141.7828645	255.5176046

#Comparing Models prediction
#colnames(all_pred) <- models
#all_pred

#Melting All Models results
#compare_melt1 <- melt(compare, varnames = c("metric", "model"))
#compare_melt1 <- data.table(compare_melt1)
#compare_melt1

#Ploting Models Error
#ggplot(compare_melt1, aes(x = model, y = value)) + geom_bar(stat = "identity") + facet_grid(metric~., scales = "free")

Final Model

After trying different ones separetely we decided to use Random Forest for our prediction as it is the one with best results.

#Creating the Random Forest Model
fitControl <- trainControl(method = "repeatedcv", number = 10, repeats = 3)
RFPrediction <- train(Volume~., data = TrainSet, method = "rf", trControl=fitControl, tuneLenght = 2)

## note: only 1 unique complexity parameters in default grid. Truncating the grid to 1 .

RFPrediction

## Random Forest 
## 
## 50 samples
##  2 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 45, 45, 45, 45, 46, 46, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   182.4473  0.9303902  110.5815
## 
## Tuning parameter 'mtry' was held constant at a value of 2

Final Results

Here is the table with the final prediction results.

#Doing the Final Prediction
RFFinalPrediction <- predict(RFPrediction, New_Products_FINAL)
New_Products_FINAL$Volume <- RFFinalPrediction
New_Products_FINAL$PType <- New_Products$PType
New_Products_FINAL$PNum <- New_Products$PNum
knitr::kable(New_Products_FINAL)

x4Star	PSRev	Volume	PType	PNum
26	12	581.28008	PC	171
11	7	170.61893	PC	172
10	11	450.63160	Laptop	173
2	2	29.93422	Laptop	175
1	0	10.40133	Laptop	176
8	2	49.08240	Netbook	178
112	28	969.26347	Netbook	180
18	5	133.26813	Netbook	181
4	1	20.89600	Netbook	183
66	28	1077.00933	Tablet	186
437	90	1228.73053	Tablet	187
26	8	275.90248	Smartphone	193
26	14	687.71688	Smartphone	194
8	4	81.97747	Smartphone	195
19	5	140.18947	Smartphone	196
97	32	963.77867	GameConsole	199
0	1	10.79485	Display	201
1	2	27.23049	Accessories	301
2	2	29.93422	Accessories	302
18	4	130.28480	Software	303
8	5	81.80200	Printer	304
0	1	10.79485	PrinterSupplies	305
1	0	10.40133	ExtendedWarranty	306
252	59	1175.21720	GameConsole	307

Multiple Regression in R

by Pol Mayolas