The task is to redo the profitability prediction analysis, but this time using R. We are given four product types, and we need to find the predicted volumes for each of those types. Then we need to assess the impact services reviews and customer reviews have on sales of different product types.
We need to clean our data from any missing values or inconsistent data format in our features. It will also be required to create dummy data for our Product.Type feature to convert it from categorical to numerical attribute. Feature selection and engineering happens in this step as well.
In order to choose the features in the model, a correlation matrix was created in order to see the relationship of all the features with one another. This was done after creating dummy features for each product type as the correlation matrix only works on numerical features. We got the outcome in a table format (hidden, too big). We can see that x5StarReviews is perfectly correlated with the Volume, this is not good as it will overfit the model and needs further investigation as reviews can vary greatly. We also see that 4starreviews is also correlated along with positiveservicereview. This is why we featured engineered sumReviews which includes highly correlated features but are not collinear with any other features.
#creating new variable sumReviews
existingData$sumReviews <- rowSums(existingData[,c("x5StarReviews",
"x4StarReviews",
"x3StarReviews",
"x2StarReviews",
"x1StarReviews")])
#scatterplot showing a clear positive correlation between sumReviews and Volume
ggplot(data = existingData, aes(x=sumReviews, y=Volume))+
geom_point()+
geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Next, I had to find if my data contains any outliers
#save outliers and remove the rows with outliers
filteredCols <- select(existingData, c(29,30,20,21,"ProductType.Netbook"))
boxplot <-boxplot(filteredCols,
main = "Multiple boxplots for comparision",
names = c("Volume", "SumRev", "PosRev", "NegRev",
"ProdType.Netbook"),
par(cex.axis=0.8),
las=2)
From the boxplot, we can see that volume has 2 big outliers compared to all the other features. In order to have closer predictions, we opt to remove those 2 outliers and clean our data.
outliers <- boxplot(existingData$Volume, plot=FALSE)$out
existingData <- existingData[-which(existingData$Volume %in% outliers),]
#loading r files (to minimize the rmd file size)
read_chunk('models.R')
read_chunk('parameterTables.R')
K nearest neighbor algorithm is very simple. It works based on minimum distance from the query instance to the training samples to determine the K-nearest neighbors. After we gather K nearest neighbors, we take simple majority of these K-nearest neighbors to be the prediction of the query instance. That’s why it’s good in small datasets, doesn’t need a huge number of observations to create a good model
Random Forests or random decision forests are an ensemble learning method for classification, regression and other tasks that operates by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees. It is good against overfitting due to the randomness of the algorithm.
SVMs are supervised learning models with associated learning algorithms that analyze data used for classification and regression analysis. Given a set of training examples, each marked as belonging to one or the other of two categories, an SVM training algorithm builds a model that assigns new examples to one category or the other, making it a non-probabilistic binary linear classifier (although methods such as Platt scaling exist to use SVM in a probabilistic classification setting). A SVM model is a representation of the examples as points in space, mapped so that the examples of the separate categories are divided by a clear gap that is as wide as possible. New examples are then mapped into that same space and predicted to belong to a category based on which side of the gap they fall.
The aim of linear regression is to model a continuous variable (Volume) as a mathematical function of one or more X variable(s), so that we can use this regression model to predict the Volume when only the X variables are known.
#Set seed to remember the randomization
set.seed(99)
#Creating subset of the dataset
existingData %>% select(c(Volume,
PositiveServiceReview,
NegativeServiceReview,
sumReviews,
ProductType.Netbook)) -> ExistingSub
#Normalise/scale attributes except the dependent feature
ExistingSub[,c(2:5)] <- lapply(ExistingSub[,c(2:5)] , scale)
#Splitting data into training set and testing set for cross validation
inTraining <- createDataPartition(ExistingSub$Volume, p = .75, list = FALSE)
training <- ExistingSub[inTraining,]
testing <- ExistingSub[-inTraining,]
#filtering test set to include only netbook products
testingLimited <- testing %>%
dplyr::filter(testing$ProductType.Netbook > 1)
Below are a few cases where the regression model is used on the training set and testing set in order to find the best performing model for each Product type.
rfControl <- trainControl(method = "repeatedcv",
number = 10,
repeats = 3,
search="random")
#rfGrid <- expand.grid(mtry=c(9,10,11,12,13))
#Set seed to know the random order
set.seed(33)
#Creating RF model
rfModel <- train(Volume~., data = training,
method = "rf",
trControl=rfControl,
#tuneGrid=rfGrid,
#classProbs = TRUE,
tuneLength=4,
importance=T)
#view model
rfModel
## Random Forest
##
## 60 samples
## 4 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 54, 55, 54, 55, 56, 54, ...
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared MAE
## 2 163.2945 0.9410443 94.09219
## 4 141.6765 0.9517085 81.97853
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 4.
#varimp
varImp(rfModel)
## rf variable importance
##
## Overall
## sumReviews 100.00
## PositiveServiceReview 39.79
## ProductType.Netbook 10.18
## NegativeServiceReview 0.00
#predict on testing
pred <- predict(rfModel, newdata = testingLimited)
#summary(pred)
postResample(pred, testingLimited$Volume)
## RMSE Rsquared MAE
## 5.352667 NA 5.352667
knnControl <- trainControl(method = "repeatedcv",
number = 10,
repeats = 3,
classProbs = TRUE)
#always set seed before running models, to be able to recreate it
set.seed(998)
knnModel <- train(Volume ~., data = training, method = "knn",
knnControl=knnControl,
tuneGrid = expand.grid(k = c(1:20)),
tuneLength = 10,
importance = T)
#Run model and show output
knnModel
## k-Nearest Neighbors
##
## 60 samples
## 4 predictor
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 60, 60, 60, 60, 60, 60, ...
## Resampling results across tuning parameters:
##
## k RMSE Rsquared MAE
## 1 238.3725 0.8539807 126.1933
## 2 224.2186 0.8696911 117.5413
## 3 204.6779 0.8905569 107.7150
## 4 192.9165 0.9047350 106.4514
## 5 197.9975 0.9031988 109.1531
## 6 203.7193 0.9060980 110.4996
## 7 213.9738 0.8998088 115.0937
## 8 222.0604 0.8946905 119.4785
## 9 233.8670 0.8890846 124.0909
## 10 245.1505 0.8815652 131.9592
## 11 255.0431 0.8695729 139.4645
## 12 265.0895 0.8676412 145.0892
## 13 271.3346 0.8671004 149.3319
## 14 280.1443 0.8630071 156.2646
## 15 289.8259 0.8604927 164.2969
## 16 301.6022 0.8549850 172.5339
## 17 307.6597 0.8605604 177.9734
## 18 318.3979 0.8572318 184.4209
## 19 330.2090 0.8544466 191.9054
## 20 339.8995 0.8502308 197.9088
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 4.
#importance of each attribute
varImp(knnModel)
## loess r-squared variable importance
##
## Overall
## sumReviews 100.00
## PositiveServiceReview 97.17
## NegativeServiceReview 23.53
## ProductType.Netbook 0.00
#predict on testing
pred <- predict(knnModel, newdata = testingLimited)
summary(pred)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 91.43 91.43 91.43 91.43 91.43 91.43
postResample(pred, testingLimited$Volume)
## RMSE Rsquared MAE
## 3.428571 NA 3.428571
SVMControl <- trainControl(method = "repeatedcv",
number = 10,
repeats = 3,
search="random")
SVMgrid <- expand.grid(C = c(1,5,10,18,50))
#Set seed to know the random order
set.seed(150)
SVMModel <- train(Volume ~., data = training, method = "svmLinear",
trControl=SVMControl,
tuneGrid = SVMgrid)
#Run model and show output
SVMModel
## Support Vector Machines with Linear Kernel
##
## 60 samples
## 4 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 55, 53, 53, 54, 53, 53, ...
## Resampling results across tuning parameters:
##
## C RMSE Rsquared MAE
## 1 436.1306 0.9152330 248.9349
## 5 477.1820 0.8585140 263.2508
## 10 487.3140 0.8528044 266.9367
## 18 484.9899 0.8488825 263.6817
## 50 477.3821 0.8496195 257.2033
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was C = 1.
#importance of each attribute
varImp(SVMModel)
## loess r-squared variable importance
##
## Overall
## sumReviews 100.00
## PositiveServiceReview 97.17
## NegativeServiceReview 23.53
## ProductType.Netbook 0.00
#predict on testing set
pred <- predict(SVMModel, newdata = testingLimited)
postResample(pred, testingLimited$Volume)
## RMSE Rsquared MAE
## 49.24266 NA 49.24266
| Parameters | RandomForest | kNN | SupportVectorMachines |
|---|---|---|---|
| Folds | 10 | 10 | 10 |
| repeats | 3 | 3 | 3 |
| tuneLength | 10 | None | None |
| tuneGrid | None | [1:5] | (1,5,10,18,50) |
| Best mtry/k/C | mtry = 4 | k = 3 | C = 50 |
| Training Metrics | |||
| RMSE | 124.2 | 185.3 | 174.4 |
| R Squared | 0.969 | 0.915 | 0.96 |
| Testing Metrics | |||
| RMSE | 7.2 | 23.2 | 14.9 |
| R Squared | NA | NA | NA |
| Parameters | RandomForest | kNN | SupportVectorMachines |
|---|---|---|---|
| Folds | 10 | 10 | 10 |
| repeats | 3 | 3 | 3 |
| tuneLength | 10 | None | None |
| tuneGrid | None | [1:5] | (1,5,10,18,50) |
| Best mtry/k/C | mtry = 4 | k = 3 | C = 50 |
| Training Metrics | |||
| RMSE | 128.5 | 184.4 | 161.3 |
| R Squared | 0.964 | 0.908 | 0.966 |
| Testing Metrics | |||
| RMSE | 3.6 | 26.2 | 31.8 |
| R Squared | 1 | 1 | 1 |
| Parameters | RandomForest | kNN | SupportVectorMachines |
|---|---|---|---|
| Folds | 10 | 10 | 10 |
| repeats | 3 | 3 | 3 |
| tuneLength | 10 | None | None |
| tuneGrid | None | [1:5] | (1,5,10,18,50) |
| Best mtry/k/C | mtry = 3 | k = 4 | C = 1 |
| Training Metrics | |||
| RMSE | 144.5 | 196.8 | 457.7 |
| R Squared | 0.949 | 0.912 | 0.87 |
| Testing Metrics | |||
| RMSE | 3.5 | 62 | 17.9 |
| R Squared | NA | NA | NA |
| Parameters | RandomForest | kNN | SupportVectorMachines |
|---|---|---|---|
| Folds | 10 | 10 | 10 |
| repeats | 3 | 3 | 3 |
| tuneLength | 10 | None | None |
| tuneGrid | None | [1:10] | (1,5,10,18,50) |
| Best mtry/k/C | mtry = 4 | k = 6 | C = 1 |
| Training Metrics | |||
| RMSE | 142.6 | 224.4 | 463.6 |
| R Squared | 0.95 | 0.87 | 0.837 |
| Testing Metrics | |||
| RMSE | 44 | 217.7 | 66.7 |
| R Squared | 1 | 1 | 1 |
The best regression model for each product type is Random Forest. It was the best model in predicting specific product type, this is because of the randomness and the bagging method it utilizes when constructing the trees. Support Vector machine performed slightly worse because it needs a bigger dataset in order to reduce error during training. As for kNN, if the dataset was slightly noisy (and this one was) it will confuse the model and the prediction error on our testing set will be higher.
#load allpredictions file
allpredictions <- read.csv("predictedvolumes.csv")
#adding col
existingDf[c("dataStatus")] <- "Old"
allpredictions[c("dataStatus")] <- "New"
commonCols <- intersect(names(allpredictions), names(existingDf))
allData <-rbind(allpredictions[commonCols], existingDf[commonCols])
allData$sumReviews <- rowSums(allData[,c("x5StarReviews",
"x4StarReviews",
"x3StarReviews",
"x2StarReviews",
"x1StarReviews")])
filteredProducts <- subset(allData, ProductType=="Laptop" | ProductType=="PC" | ProductType=="Smartphone" | ProductType=="Netbook")
summary(filteredProducts)
## ProductType ProductNum Price x5StarReviews
## Smartphone :8 Min. :101.0 Min. : 49.0 Min. : 1.00
## Laptop :6 1st Qu.:150.0 1st Qu.: 307.2 1st Qu.: 4.75
## Netbook :6 Median :177.5 Median : 405.0 Median : 22.50
## PC :6 Mean :165.2 Mean : 628.7 Mean : 58.08
## Accessories:0 3rd Qu.:190.8 3rd Qu.: 837.6 3rd Qu.: 61.00
## Display :0 Max. :197.0 Max. :2250.0 Max. :368.00
## (Other) :0
## x4StarReviews x3StarReviews x2StarReviews x1StarReviews
## Min. : 0.00 Min. : 0.000 Min. : 0.000 Min. : 0.00
## 1st Qu.: 3.25 1st Qu.: 1.250 1st Qu.: 1.000 1st Qu.: 1.25
## Median : 10.50 Median : 5.000 Median : 3.000 Median :10.50
## Mean : 16.42 Mean : 7.654 Mean : 7.769 Mean :14.19
## 3rd Qu.: 23.50 3rd Qu.:10.750 3rd Qu.:10.750 3rd Qu.:21.75
## Max. :112.00 Max. :37.000 Max. :33.000 Max. :48.00
##
## PositiveServiceReview NegativeServiceReview Recommendproduct
## Min. : 0.000 Min. : 0.000 Min. :0.3000
## 1st Qu.: 2.000 1st Qu.: 1.000 1st Qu.:0.6000
## Median : 5.000 Median : 3.000 Median :0.7000
## Mean : 6.462 Mean : 4.577 Mean :0.6423
## 3rd Qu.: 7.750 3rd Qu.: 5.750 3rd Qu.:0.7750
## Max. :28.000 Max. :20.000 Max. :0.9000
##
## BestSellersRank ShippingWeight ProductDepth ProductWidth
## Min. : 109 Min. : 0.700 Min. : 2.600 Min. : 0.300
## 1st Qu.: 1277 1st Qu.: 0.950 1st Qu.: 3.225 1st Qu.: 5.308
## Median : 2723 Median : 4.900 Median : 8.685 Median : 8.685
## Mean : 5766 Mean : 9.655 Mean :11.079 Mean : 9.836
## 3rd Qu.: 5742 3rd Qu.:12.650 3rd Qu.:15.975 3rd Qu.:10.875
## Max. :44465 Max. :50.000 Max. :35.000 Max. :31.750
## NA's :1
## ProductHeight ProfitMargin Volume dataStatus
## Min. : 0.300 Min. :0.0800 Min. : 4.0 Length:26
## 1st Qu.: 0.425 1st Qu.:0.0925 1st Qu.: 39.0 Class :character
## Median : 0.995 Median :0.1100 Median : 100.0 Mode :character
## Mean : 4.170 Mean :0.1312 Mean : 256.6
## 3rd Qu.: 6.668 3rd Qu.:0.1500 3rd Qu.: 288.0
## Max. :20.710 Max. :0.2500 Max. :1472.0
##
## sumReviews
## Min. : 3.00
## 1st Qu.: 9.75
## Median : 61.00
## Mean :104.12
## 3rd Qu.:123.00
## Max. :530.00
##
We can see a wide variance of predictions, not particularly similar to our existing data. In order to verify that our predictions are correct, we need to look at the data tables and check the predictions.
#new var to change the wording in legend of ggplot
status <- factor(filteredProducts$dataStatus)
#ploting the distribution of errors compared to the rest of the data
ggplot(data=filteredProducts) +
geom_point(mapping =aes(x=ProductType, y=Volume, color=status))
metrics <- data.frame(filteredProducts$ProductType, filteredProducts$Volume, filteredProducts$dataStatus, filteredProducts$sumReviews, filteredProducts$PositiveServiceReview, filteredProducts$NegativeServiceReview)
metrics <- metrics[with(metrics, order(filteredProducts.ProductType, filteredProducts.Volume, filteredProducts.dataStatus)), ]
rownames(metrics) <- NULL
rownames(metrics) <- seq(length=nrow(metrics))
The table below shows the main features and how they are affecting the volume.
| ProductType | Volume | Status | Sum Reviews | Positive Service Review | Negative Service Review |
|---|---|---|---|---|---|
| Laptop | 38 | New | 6 | 0 | 1 |
| Laptop | 39 | New | 12 | 2 | 1 |
| Laptop | 88 | Old | 51 | 6 | 2 |
| Laptop | 196 | Old | 88 | 7 | 8 |
| Laptop | 232 | Old | 143 | 7 | 20 |
| Laptop | 515 | New | 101 | 11 | 5 |
kable(netbook, col.names = c("ProductType", "Volume", "Status", "Sum Reviews", "Positive Service Review", "Negative Service Review"),padding=-3L) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),fixed_thead = T, full_width = F, position = "left")
| ProductType | Volume | Status | Sum Reviews | Positive Service Review | Negative Service Review |
|---|---|---|---|---|---|
| Netbook | 4 | Old | 3 | 0 | 1 |
| Netbook | 39 | New | 8 | 1 | 0 |
| Netbook | 80 | New | 42 | 2 | 4 |
| Netbook | 88 | Old | 50 | 3 | 3 |
| Netbook | 137 | New | 88 | 5 | 16 |
| Netbook | 1181 | New | 530 | 28 | 16 |
kable(pc, col.names = c("ProductType", "Volume", "Status", "Sum Reviews", "Positive Service Review", "Negative Service Review"),padding=-3L) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),fixed_thead = T, full_width = F, position = "left")
| ProductType | Volume | Status | Sum Reviews | Positive Service Review | Negative Service Review |
|---|---|---|---|---|---|
| PC | 8 | Old | 3 | 1 | 0 |
| PC | 12 | Old | 8 | 2 | 0 |
| PC | 12 | Old | 3 | 1 | 0 |
| PC | 84 | Old | 43 | 5 | 3 |
| PC | 297 | New | 103 | 7 | 5 |
| PC | 460 | New | 175 | 12 | 3 |
kable(smartphone, col.names = c("ProductType", "Volume", "Status", "Sum Reviews", "Positive Service Review", "Negative Service Review"),padding=-3L) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),fixed_thead = T, full_width = F, position = "left")
| ProductType | Volume | Status | Sum Reviews | Positive Service Review | Negative Service Review |
|---|---|---|---|---|---|
| Smartphone | 16 | Old | 9 | 1 | 1 |
| Smartphone | 72 | Old | 55 | 5 | 4 |
| Smartphone | 112 | New | 67 | 4 | 1 |
| Smartphone | 248 | Old | 120 | 9 | 3 |
| Smartphone | 261 | New | 124 | 5 | 7 |
| Smartphone | 480 | New | 188 | 8 | 6 |
| Smartphone | 501 | New | 244 | 14 | 6 |
| Smartphone | 1472 | Old | 443 | 22 | 3 |
Whenever we have high customer and service reviews, the higher the volume is. 3, 4 and 5 star reviews as well as positive service reviews are our main volume influences based on the correlation matrix we created earlier in the report.
Checking the correlation for each of the product types, we find a consistent relationship between total customer reviews and service reviews with the volume. This means whenever we have a lot of reviews on an item, the odds of selling increase. Looking at the table above, we can clearly notice higher volumes for products with higher reviews. Also 4 and 5 stars have a bigger effet on volumes than 1 and 2 stars.