Business Question

The task is to redo the profitability prediction analysis, but this time using R. We are given four product types, and we need to find the predicted volumes for each of those types. Then we need to assess the impact services reviews and customer reviews have on sales of different product types.

Preprocessing Data

We need to clean our data from any missing values or inconsistent data format in our features. It will also be required to create dummy data for our Product.Type feature to convert it from categorical to numerical attribute. Feature selection and engineering happens in this step as well.

In order to choose the features in the model, a correlation matrix was created in order to see the relationship of all the features with one another. This was done after creating dummy features for each product type as the correlation matrix only works on numerical features. We got the outcome in a table format (hidden, too big). We can see that x5StarReviews is perfectly correlated with the Volume, this is not good as it will overfit the model and needs further investigation as reviews can vary greatly. We also see that 4starreviews is also correlated along with positiveservicereview. This is why we featured engineered sumReviews which includes highly correlated features but are not collinear with any other features.

#creating new variable sumReviews
existingData$sumReviews <- rowSums(existingData[,c("x5StarReviews", 
                                                   "x4StarReviews", 
                                                   "x3StarReviews", 
                                                   "x2StarReviews", 
                                                   "x1StarReviews")])

#scatterplot showing a clear positive correlation between sumReviews and Volume
ggplot(data = existingData, aes(x=sumReviews, y=Volume))+
  geom_point()+
  geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Next, I had to find if my data contains any outliers

#save outliers and remove the rows with outliers
filteredCols <- select(existingData, c(29,30,20,21,"ProductType.Netbook"))

boxplot <-boxplot(filteredCols,
        main = "Multiple boxplots for comparision",
        names = c("Volume", "SumRev", "PosRev", "NegRev",
                  "ProdType.Netbook"),
        par(cex.axis=0.8),
        las=2)

From the boxplot, we can see that volume has 2 big outliers compared to all the other features. In order to have closer predictions, we opt to remove those 2 outliers and clean our data.

outliers <- boxplot(existingData$Volume, plot=FALSE)$out
existingData <- existingData[-which(existingData$Volume %in% outliers),]



#loading r files (to minimize the rmd file size)
read_chunk('models.R')
read_chunk('parameterTables.R')

Algorithms tried

k-NN Model

K nearest neighbor algorithm is very simple. It works based on minimum distance from the query instance to the training samples to determine the K-nearest neighbors. After we gather K nearest neighbors, we take simple majority of these K-nearest neighbors to be the prediction of the query instance. That’s why it’s good in small datasets, doesn’t need a huge number of observations to create a good model

Random Forest

Random Forests or random decision forests are an ensemble learning method for classification, regression and other tasks that operates by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees. It is good against overfitting due to the randomness of the algorithm.

Support Vector Machines

SVMs are supervised learning models with associated learning algorithms that analyze data used for classification and regression analysis. Given a set of training examples, each marked as belonging to one or the other of two categories, an SVM training algorithm builds a model that assigns new examples to one category or the other, making it a non-probabilistic binary linear classifier (although methods such as Platt scaling exist to use SVM in a probabilistic classification setting). A SVM model is a representation of the examples as points in space, mapped so that the examples of the separate categories are divided by a clear gap that is as wide as possible. New examples are then mapped into that same space and predicted to belong to a category based on which side of the gap they fall.

Creating a Linear model

The aim of linear regression is to model a continuous variable (Volume) as a mathematical function of one or more X variable(s), so that we can use this regression model to predict the Volume when only the X variables are known.

#Set seed to remember the randomization
set.seed(99)

#Creating subset of the dataset
existingData %>% select(c(Volume,
                          PositiveServiceReview, 
                          NegativeServiceReview, 
                          sumReviews, 
                          ProductType.Netbook)) -> ExistingSub

#Normalise/scale attributes except the dependent feature
ExistingSub[,c(2:5)] <- lapply(ExistingSub[,c(2:5)] , scale)

#Splitting data into training set and testing set for cross validation
inTraining <- createDataPartition(ExistingSub$Volume, p = .75, list = FALSE)
training <- ExistingSub[inTraining,]
testing <- ExistingSub[-inTraining,]

#filtering test set to include only netbook products
testingLimited <- testing %>% 
  dplyr::filter(testing$ProductType.Netbook > 1)

Sample of the Cases done

Below are a few cases where the regression model is used on the training set and testing set in order to find the best performing model for each Product type.

RF Netbook
rfControl <- trainControl(method = "repeatedcv", 
                          number = 10, 
                          repeats = 3, 
                          search="random")

#rfGrid <- expand.grid(mtry=c(9,10,11,12,13))

#Set seed to know the random order
set.seed(33)

#Creating RF model
rfModel <- train(Volume~., data = training, 
                 method = "rf",
                 trControl=rfControl, 
                 #tuneGrid=rfGrid, 
                 #classProbs = TRUE,
                 tuneLength=4, 
                 importance=T)

#view model
rfModel
## Random Forest 
## 
## 60 samples
##  4 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 54, 55, 54, 55, 56, 54, ... 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE      Rsquared   MAE     
##   2     163.2945  0.9410443  94.09219
##   4     141.6765  0.9517085  81.97853
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 4.
#varimp
varImp(rfModel)
## rf variable importance
## 
##                       Overall
## sumReviews             100.00
## PositiveServiceReview   39.79
## ProductType.Netbook     10.18
## NegativeServiceReview    0.00
#predict on testing
pred <- predict(rfModel, newdata = testingLimited)
#summary(pred)

postResample(pred, testingLimited$Volume)
##     RMSE Rsquared      MAE 
## 5.352667       NA 5.352667
kNN-Netbook
knnControl <- trainControl(method = "repeatedcv", 
                           number = 10, 
                           repeats = 3,
                           classProbs = TRUE)
#always set seed before running models, to be able to recreate it
set.seed(998)
knnModel <- train(Volume ~., data = training, method = "knn",
                  knnControl=knnControl,
                  tuneGrid = expand.grid(k = c(1:20)),
                  tuneLength = 10,
                  importance = T)



#Run model and show output
knnModel
## k-Nearest Neighbors 
## 
## 60 samples
##  4 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 60, 60, 60, 60, 60, 60, ... 
## Resampling results across tuning parameters:
## 
##   k   RMSE      Rsquared   MAE     
##    1  238.3725  0.8539807  126.1933
##    2  224.2186  0.8696911  117.5413
##    3  204.6779  0.8905569  107.7150
##    4  192.9165  0.9047350  106.4514
##    5  197.9975  0.9031988  109.1531
##    6  203.7193  0.9060980  110.4996
##    7  213.9738  0.8998088  115.0937
##    8  222.0604  0.8946905  119.4785
##    9  233.8670  0.8890846  124.0909
##   10  245.1505  0.8815652  131.9592
##   11  255.0431  0.8695729  139.4645
##   12  265.0895  0.8676412  145.0892
##   13  271.3346  0.8671004  149.3319
##   14  280.1443  0.8630071  156.2646
##   15  289.8259  0.8604927  164.2969
##   16  301.6022  0.8549850  172.5339
##   17  307.6597  0.8605604  177.9734
##   18  318.3979  0.8572318  184.4209
##   19  330.2090  0.8544466  191.9054
##   20  339.8995  0.8502308  197.9088
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 4.
#importance of each attribute
varImp(knnModel)
## loess r-squared variable importance
## 
##                       Overall
## sumReviews             100.00
## PositiveServiceReview   97.17
## NegativeServiceReview   23.53
## ProductType.Netbook      0.00
#predict on testing
pred <- predict(knnModel, newdata = testingLimited)
summary(pred)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   91.43   91.43   91.43   91.43   91.43   91.43
postResample(pred, testingLimited$Volume)
##     RMSE Rsquared      MAE 
## 3.428571       NA 3.428571
SVM Netbook
SVMControl <- trainControl(method = "repeatedcv", 
                           number = 10, 
                           repeats = 3,
                           search="random")
SVMgrid <- expand.grid(C = c(1,5,10,18,50))

#Set seed to know the random order
set.seed(150)

SVMModel <- train(Volume ~., data = training, method = "svmLinear",
                  trControl=SVMControl,
                  tuneGrid = SVMgrid)

#Run model and show output
SVMModel
## Support Vector Machines with Linear Kernel 
## 
## 60 samples
##  4 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 55, 53, 53, 54, 53, 53, ... 
## Resampling results across tuning parameters:
## 
##   C   RMSE      Rsquared   MAE     
##    1  436.1306  0.9152330  248.9349
##    5  477.1820  0.8585140  263.2508
##   10  487.3140  0.8528044  266.9367
##   18  484.9899  0.8488825  263.6817
##   50  477.3821  0.8496195  257.2033
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was C = 1.
#importance of each attribute
varImp(SVMModel)
## loess r-squared variable importance
## 
##                       Overall
## sumReviews             100.00
## PositiveServiceReview   97.17
## NegativeServiceReview   23.53
## ProductType.Netbook      0.00
#predict on testing set
pred <- predict(SVMModel, newdata = testingLimited)
postResample(pred, testingLimited$Volume)
##     RMSE Rsquared      MAE 
## 49.24266       NA 49.24266

Cases done based on Product.Type:

Netbook
Parameters RandomForest kNN SupportVectorMachines
Folds 10 10 10
repeats 3 3 3
tuneLength 10 None None
tuneGrid None [1:5] (1,5,10,18,50)
Best mtry/k/C mtry = 4 k = 3 C = 50
Training Metrics
RMSE 124.2 185.3 174.4
R Squared 0.969 0.915 0.96
Testing Metrics
RMSE 7.2 23.2 14.9
R Squared NA NA NA
PC
Parameters RandomForest kNN SupportVectorMachines
Folds 10 10 10
repeats 3 3 3
tuneLength 10 None None
tuneGrid None [1:5] (1,5,10,18,50)
Best mtry/k/C mtry = 4 k = 3 C = 50
Training Metrics
RMSE 128.5 184.4 161.3
R Squared 0.964 0.908 0.966
Testing Metrics
RMSE 3.6 26.2 31.8
R Squared 1 1 1
Laptop
Parameters RandomForest kNN SupportVectorMachines
Folds 10 10 10
repeats 3 3 3
tuneLength 10 None None
tuneGrid None [1:5] (1,5,10,18,50)
Best mtry/k/C mtry = 3 k = 4 C = 1
Training Metrics
RMSE 144.5 196.8 457.7
R Squared 0.949 0.912 0.87
Testing Metrics
RMSE 3.5 62 17.9
R Squared NA NA NA
Smartphone
Parameters RandomForest kNN SupportVectorMachines
Folds 10 10 10
repeats 3 3 3
tuneLength 10 None None
tuneGrid None [1:10] (1,5,10,18,50)
Best mtry/k/C mtry = 4 k = 6 C = 1
Training Metrics
RMSE 142.6 224.4 463.6
R Squared 0.95 0.87 0.837
Testing Metrics
RMSE 44 217.7 66.7
R Squared 1 1 1

The best regression model for each product type is Random Forest. It was the best model in predicting specific product type, this is because of the randomness and the bagging method it utilizes when constructing the trees. Support Vector machine performed slightly worse because it needs a bigger dataset in order to reduce error during training. As for kNN, if the dataset was slightly noisy (and this one was) it will confuse the model and the prediction error on our testing set will be higher.

Predicting the volume for each product type

#load allpredictions file
allpredictions <- read.csv("predictedvolumes.csv")

#adding col
existingDf[c("dataStatus")] <- "Old"
allpredictions[c("dataStatus")] <- "New"

commonCols <- intersect(names(allpredictions), names(existingDf)) 
allData <-rbind(allpredictions[commonCols], existingDf[commonCols]) 

allData$sumReviews <- rowSums(allData[,c("x5StarReviews", 
                                                   "x4StarReviews", 
                                                   "x3StarReviews", 
                                                   "x2StarReviews", 
                                                   "x1StarReviews")])

filteredProducts <- subset(allData, ProductType=="Laptop" | ProductType=="PC" | ProductType=="Smartphone" | ProductType=="Netbook")

summary(filteredProducts)
##       ProductType   ProductNum        Price        x5StarReviews   
##  Smartphone :8    Min.   :101.0   Min.   :  49.0   Min.   :  1.00  
##  Laptop     :6    1st Qu.:150.0   1st Qu.: 307.2   1st Qu.:  4.75  
##  Netbook    :6    Median :177.5   Median : 405.0   Median : 22.50  
##  PC         :6    Mean   :165.2   Mean   : 628.7   Mean   : 58.08  
##  Accessories:0    3rd Qu.:190.8   3rd Qu.: 837.6   3rd Qu.: 61.00  
##  Display    :0    Max.   :197.0   Max.   :2250.0   Max.   :368.00  
##  (Other)    :0                                                     
##  x4StarReviews    x3StarReviews    x2StarReviews    x1StarReviews  
##  Min.   :  0.00   Min.   : 0.000   Min.   : 0.000   Min.   : 0.00  
##  1st Qu.:  3.25   1st Qu.: 1.250   1st Qu.: 1.000   1st Qu.: 1.25  
##  Median : 10.50   Median : 5.000   Median : 3.000   Median :10.50  
##  Mean   : 16.42   Mean   : 7.654   Mean   : 7.769   Mean   :14.19  
##  3rd Qu.: 23.50   3rd Qu.:10.750   3rd Qu.:10.750   3rd Qu.:21.75  
##  Max.   :112.00   Max.   :37.000   Max.   :33.000   Max.   :48.00  
##                                                                    
##  PositiveServiceReview NegativeServiceReview Recommendproduct
##  Min.   : 0.000        Min.   : 0.000        Min.   :0.3000  
##  1st Qu.: 2.000        1st Qu.: 1.000        1st Qu.:0.6000  
##  Median : 5.000        Median : 3.000        Median :0.7000  
##  Mean   : 6.462        Mean   : 4.577        Mean   :0.6423  
##  3rd Qu.: 7.750        3rd Qu.: 5.750        3rd Qu.:0.7750  
##  Max.   :28.000        Max.   :20.000        Max.   :0.9000  
##                                                              
##  BestSellersRank ShippingWeight    ProductDepth     ProductWidth   
##  Min.   :  109   Min.   : 0.700   Min.   : 2.600   Min.   : 0.300  
##  1st Qu.: 1277   1st Qu.: 0.950   1st Qu.: 3.225   1st Qu.: 5.308  
##  Median : 2723   Median : 4.900   Median : 8.685   Median : 8.685  
##  Mean   : 5766   Mean   : 9.655   Mean   :11.079   Mean   : 9.836  
##  3rd Qu.: 5742   3rd Qu.:12.650   3rd Qu.:15.975   3rd Qu.:10.875  
##  Max.   :44465   Max.   :50.000   Max.   :35.000   Max.   :31.750  
##  NA's   :1                                                         
##  ProductHeight     ProfitMargin        Volume        dataStatus       
##  Min.   : 0.300   Min.   :0.0800   Min.   :   4.0   Length:26         
##  1st Qu.: 0.425   1st Qu.:0.0925   1st Qu.:  39.0   Class :character  
##  Median : 0.995   Median :0.1100   Median : 100.0   Mode  :character  
##  Mean   : 4.170   Mean   :0.1312   Mean   : 256.6                     
##  3rd Qu.: 6.668   3rd Qu.:0.1500   3rd Qu.: 288.0                     
##  Max.   :20.710   Max.   :0.2500   Max.   :1472.0                     
##                                                                       
##    sumReviews    
##  Min.   :  3.00  
##  1st Qu.:  9.75  
##  Median : 61.00  
##  Mean   :104.12  
##  3rd Qu.:123.00  
##  Max.   :530.00  
## 

Distribution of predictions compared to existing data

We can see a wide variance of predictions, not particularly similar to our existing data. In order to verify that our predictions are correct, we need to look at the data tables and check the predictions.

#new var to change the wording in legend of ggplot
status <- factor(filteredProducts$dataStatus)
#ploting the distribution of errors compared to the rest of the data
ggplot(data=filteredProducts) +
  geom_point(mapping =aes(x=ProductType, y=Volume, color=status))

metrics <- data.frame(filteredProducts$ProductType, filteredProducts$Volume, filteredProducts$dataStatus, filteredProducts$sumReviews, filteredProducts$PositiveServiceReview, filteredProducts$NegativeServiceReview)
metrics <- metrics[with(metrics, order(filteredProducts.ProductType, filteredProducts.Volume, filteredProducts.dataStatus)), ]

rownames(metrics) <- NULL 
rownames(metrics) <- seq(length=nrow(metrics)) 

Product Type Prediction Analysis

The table below shows the main features and how they are affecting the volume.

Laptop
ProductType Volume Status Sum Reviews Positive Service Review Negative Service Review
Laptop 38 New 6 0 1
Laptop 39 New 12 2 1
Laptop 88 Old 51 6 2
Laptop 196 Old 88 7 8
Laptop 232 Old 143 7 20
Laptop 515 New 101 11 5
Netbook
kable(netbook, col.names = c("ProductType", "Volume", "Status", "Sum Reviews", "Positive Service Review", "Negative Service Review"),padding=-3L) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),fixed_thead = T, full_width = F, position = "left")
ProductType Volume Status Sum Reviews Positive Service Review Negative Service Review
Netbook 4 Old 3 0 1
Netbook 39 New 8 1 0
Netbook 80 New 42 2 4
Netbook 88 Old 50 3 3
Netbook 137 New 88 5 16
Netbook 1181 New 530 28 16
PC
kable(pc, col.names = c("ProductType", "Volume", "Status", "Sum Reviews", "Positive Service Review", "Negative Service Review"),padding=-3L) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),fixed_thead = T, full_width = F, position = "left")
ProductType Volume Status Sum Reviews Positive Service Review Negative Service Review
PC 8 Old 3 1 0
PC 12 Old 8 2 0
PC 12 Old 3 1 0
PC 84 Old 43 5 3
PC 297 New 103 7 5
PC 460 New 175 12 3
Smartphone
kable(smartphone, col.names = c("ProductType", "Volume", "Status", "Sum Reviews", "Positive Service Review", "Negative Service Review"),padding=-3L) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),fixed_thead = T, full_width = F, position = "left")
ProductType Volume Status Sum Reviews Positive Service Review Negative Service Review
Smartphone 16 Old 9 1 1
Smartphone 72 Old 55 5 4
Smartphone 112 New 67 4 1
Smartphone 248 Old 120 9 3
Smartphone 261 New 124 5 7
Smartphone 480 New 188 8 6
Smartphone 501 New 244 14 6
Smartphone 1472 Old 443 22 3

Whenever we have high customer and service reviews, the higher the volume is. 3, 4 and 5 star reviews as well as positive service reviews are our main volume influences based on the correlation matrix we created earlier in the report.

Assessing the Impact of Customer & Service Reviews on sales of different product type

Checking the correlation for each of the product types, we find a consistent relationship between total customer reviews and service reviews with the volume. This means whenever we have a lot of reviews on an item, the odds of selling increase. Looking at the table above, we can clearly notice higher volumes for products with higher reviews. Also 4 and 5 stars have a bigger effet on volumes than 1 and 2 stars.

Conclusions

  • Random Forest performed the best compared to the other regression models. This is because RF uses a randomized bagging technique which is very effective against over-fitting. In order to optimize the model, I changed the set.seed() number a few times until I found the best randomized set of trees (forest). The support vector machine model performed poorly due to the low number of observation in our data, as it needs higher training sample to give better output. Our kNN model performed the poorest; this is because we had a very small sample size with high number of features (5). It performs better with lower number of features.
  • Positive impact for customer and service reviews; when they both go up, the demand on the product goes up. But the weight of positive service review and 4/5 star reviews is much bigger than negative and lower star reviews.
  • Clear correlation between volume and total customer reviews
  • Removed two outliers found in Volume: 11204 and 7036
  • Top 5 product types by profitability are:
    1. PC - 80385
    2. Laptop - 61748.5
    3. PC - 51084
    4. Netbook - 34969.41
    5. Laptop - 17471.26