Informal Report <task 3>

Y.S. Kim

14-02-2020

Task objectives

The task is to predict the sales in four different product types (PC, Laptops, Netbooks and Smartphones) while assessing the effects service and customer reviews on sales. The data of the existingproductattributes2017 were used to build models and to make a prediction on the newproductattributes2017.

Pre-processing

First, outliers were removed in excel before importing the file in R using boxplots and scatterplots. The dataframe was renamed as ‘existing’.

# Load the data inside your system
existing <- read.csv("C:/Users/Y.S. Kim/Desktop/Ubiqum/R/Task 3 Multiple Regression/existingproductattributes2017_cleaned_without_outlier.csv")

The data was checked for missing values:

## [1] TRUE

This function shows that there are missing values. The summary function shows that we deal with 15 missing values in the attribute BestSellersRank, and this column [12] was deleted accordingly. Thereafter the any(is.na()) confirmed that the dataset was complete.

existing.new <- existing[-12]
## [1] FALSE

Using the glimpse function, the data types were checked:

## Observations: 65
## Variables: 17
## $ ProductType           <fct> PC, PC, Laptop, Laptop, Accessories, Accessor...
## $ ProductNum            <int> 101, 103, 104, 105, 106, 107, 108, 109, 110, ...
## $ Price                 <dbl> 949.00, 399.00, 409.99, 1079.99, 114.22, 379....
## $ x5StarReviews         <int> 3, 3, 49, 58, 83, 11, 33, 16, 10, 21, 75, 10,...
## $ x4StarReviews         <int> 3, 0, 19, 31, 30, 3, 19, 9, 1, 2, 25, 8, 62, ...
## $ x3StarReviews         <int> 2, 0, 8, 11, 10, 0, 12, 2, 1, 2, 6, 5, 13, 27...
## $ x2StarReviews         <int> 0, 0, 3, 7, 9, 0, 5, 0, 0, 4, 3, 0, 8, 7, 2, ...
## $ x1StarReviews         <int> 0, 0, 9, 36, 40, 1, 9, 2, 0, 15, 3, 1, 16, 5,...
## $ PositiveServiceReview <int> 2, 1, 7, 7, 12, 3, 5, 2, 2, 2, 9, 2, 44, 57, ...
## $ NegativeServiceReview <int> 0, 0, 8, 20, 5, 0, 3, 1, 0, 1, 2, 0, 3, 3, 0,...
## $ Recommendproduct      <dbl> 0.9, 0.9, 0.8, 0.7, 0.3, 0.9, 0.7, 0.8, 0.9, ...
## $ ShippingWeight        <dbl> 25.80, 17.40, 5.70, 7.00, 1.60, 7.30, 12.00, ...
## $ ProductDepth          <dbl> 23.94, 10.50, 15.00, 12.90, 5.80, 6.70, 7.90,...
## $ ProductWidth          <dbl> 6.62, 8.30, 9.90, 0.30, 4.00, 10.30, 6.70, 9....
## $ ProductHeight         <dbl> 16.89, 10.20, 1.30, 8.90, 1.00, 11.50, 2.20, ...
## $ ProfitMargin          <dbl> 0.15, 0.08, 0.08, 0.09, 0.05, 0.05, 0.05, 0.0...
## $ Volume                <int> 12, 12, 196, 232, 332, 44, 132, 64, 40, 84, 3...

After loading the caret package, the attribute with categorical data (ProductType) was converted to numerical data, by dummifying the data, since we are dealing with a regression problem:

#dummify the data 
existing.dummified <- dummyVars(" ~ .", data = existing.new)
existing.transformed <- data.frame(predict(existing.dummified, newdata = existing.new))

A correlation matrix was created to see the correlation coefficients of the attributes and after installing the package ‘corrplot’, a heat map was created the visualize the correlation matrix:

After evaluation of the correlation coefficients, feature selection was done. Attributes were omitted taking into consideration the collinearity (>0.90) and relevance for the output variable Volume, as well as x5StarReviews due to its perfect correlation with Volume.

The ProductType attributes with low correlation coefficient were removed as well, leaving GameConsole and Accessories in the model.

The following features were selected:

## [1] "ProductType.Accessories" "ProductType.GameConsole"
## [3] "x4StarReviews"           "x2StarReviews"          
## [5] "x1StarReviews"           "PositiveServiceReview"  
## [7] "Volume"

Training model

The dataset was split into a trainingset (80%) and a testset (20%):

#creating train and testdata (80/20%)
set.seed(532)

trainSize<-round(nrow(existing.transformed)*0.8)
testSize<-nrow(existing.transformed)-trainSize

training_indices<-sample(seq_len(nrow(existing.transformed)),size =trainSize)
trainSet<-existing.transformed[training_indices,]
testSet<-existing.transformed[-training_indices,] 

Linear Regression Model

First a linear regression model was trained, starting with PositiveServiceReview as the independent variable with Volume as the dependent variable: Cross-validation was performed using “repeatedcv”, number = 10 and repeats = 1.

#10 fold cross validation
set.seed(532)
fitControl <- trainControl(method = "repeatedcv", number = 10, repeats = 1)

#train model: LM
mod_lm <- train(Volume~PositiveServiceReview,
                data = trainSet,
                method="lm",
                trControl = fitControl)
## Linear Regression 
## 
## 52 samples
##  1 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 48, 46, 48, 47, 46, 47, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   189.7157  0.9148687  124.2393
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE
## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -559.18  -70.08  -40.49  -10.61  738.22 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             19.434     34.266   0.567    0.573    
## PositiveServiceReview   32.470      2.356  13.783   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 201.7 on 50 degrees of freedom
## Multiple R-squared:  0.7916, Adjusted R-squared:  0.7875 
## F-statistic:   190 on 1 and 50 DF,  p-value: < 2.2e-16

Attributes were added stepwise one-by-one to the model, and were included if the adjusted R-squared improved. This resulted in the following model with 3 predictors: PositiveServiceReview, x4StarReviews and ProductType.GameConsole.

#10 fold cross validation
set.seed(532)
fitControl <- trainControl(method = "repeatedcv", number = 10, repeats = 1)

mod_lm5 <- train(Volume~PositiveServiceReview+x4StarReviews+ProductType.GameConsole,
                 data = trainSet,
                 method="lm",
                 trControl = fitControl)
## Linear Regression 
## 
## 52 samples
##  3 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 48, 46, 48, 47, 46, 47, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   183.0133  0.9283009  113.6324
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE
## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -492.60  -55.73  -28.45   -1.35  806.33 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               12.835     31.072   0.413 0.681379    
## PositiveServiceReview     27.617      3.369   8.198 1.11e-10 ***
## x4StarReviews              1.616      1.614   1.002 0.321514    
## ProductType.GameConsole  729.655    187.935   3.882 0.000315 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 173 on 48 degrees of freedom
## Multiple R-squared:  0.8527, Adjusted R-squared:  0.8435 
## F-statistic: 92.65 on 3 and 48 DF,  p-value: < 2.2e-16

Since the errors of the data are not normally distributed, I decided to train non-parametric models instead of linear regression model.

Non-parametric Machine Learning models

Creation of trainset and testset

set.seed(999)
#split to train and test data 80/20
inTraining <- createDataPartition(existing.transformed$Volume, p = 0.8, list = FALSE)
trainset2 <- existing.transformed[inTraining,]
testset2 <- existing.transformed[-inTraining,]

Random Forest

After splitting the data into a trainset and testset, the Random Forest model was trained. The 3 predictors were included to predict output variable Volume. PreProcess to the “center” and “scale” was applied.

set.seed(999)

#10 fold cross validation
fitControl <- trainControl(method = "repeatedcv", number = 10, repeats = 1)

mod_rf <- train(Volume~PositiveServiceReview+x4StarReviews+ProductType.GameConsole,
                data = trainset2,
                method="rf",
                trControl = fitControl,
                preProcess = c("center","scale"),
                Tunelength = 20)
## note: only 2 unique complexity parameters in default grid. Truncating the grid to 2 .
mod_rf
## Random Forest 
## 
## 53 samples
##  3 predictor
## 
## Pre-processing: centered (3), scaled (3) 
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 48, 48, 49, 47, 48, 48, ... 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE      Rsquared   MAE     
##   2     162.3054  0.8985241  92.55416
##   3     160.3005  0.8881183  93.04651
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 3.

PostResample was performed to evaluate the performance of the model on the testset:

#Results on testset
test_results_rf <- predict(object = mod_rf, 
                           newdata = testset2)
postResample(testset2$Volume, test_results_rf) 
##        RMSE    Rsquared         MAE 
## 245.5874723   0.9671046 128.3724468

The difference between the RMSE in the train and test results (160 vs. 245 respectively) indicate that the model is somewhat overfitting.

k-NN

After that, another non-parametric model k-NN was trained after installing the “ISLR” package.

set.seed(999)

#10 fold cross validation
fitControl <- trainControl(method = "repeatedcv", number = 10, repeats = 1)

mod_kNN <- train(Volume~PositiveServiceReview+x4StarReviews+ProductType.GameConsole,
                 data = trainset2,
                 method = "knn",
                 trControl = fitControl,
                 preProcess = c("center","scale"),
                 Tunelength=20) 
mod_kNN  
## k-Nearest Neighbors 
## 
## 53 samples
##  3 predictor
## 
## Pre-processing: centered (3), scaled (3) 
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 48, 48, 49, 47, 48, 48, ... 
## Resampling results across tuning parameters:
## 
##   k  RMSE      Rsquared   MAE      
##   5  157.0234  0.9109473   88.46938
##   7  169.3998  0.8965086   97.49935
##   9  179.6905  0.8787475  104.34117
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 5.

Results on the testset:

#Results on testset
test_results_kNN <- predict(object = mod_kNN, 
                            newdata = testset2)
postResample(testset2$Volume, test_results_kNN) 
##        RMSE    Rsquared         MAE 
## 318.2856579   0.8952037 151.5333333

The results indicate that the k-NN model is overfitting the data.

Tuning parameters

The next step was tuning the parameters of both models to reduce overfitting.

Random Forest

First the number of folds was reduced to 5.

#reduce number of folds to 5
set.seed(999)

#5 fold cross validation
fitControl <- trainControl(method = "repeatedcv", number = 5, repeats = 1)

mod_rf2 <- train(Volume~PositiveServiceReview+x4StarReviews+ProductType.GameConsole,
                data = trainset2,
                method="rf",
                trControl = fitControl,
                preProcess = c("center","scale"),
                Tunelength = 20)
## note: only 2 unique complexity parameters in default grid. Truncating the grid to 2 .
mod_rf2
## Random Forest 
## 
## 53 samples
##  3 predictor
## 
## Pre-processing: centered (3), scaled (3) 
## Resampling: Cross-Validated (5 fold, repeated 1 times) 
## Summary of sample sizes: 43, 41, 43, 43, 42 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE      Rsquared   MAE     
##   2     175.8397  0.8772036  93.59335
##   3     175.6154  0.8835685  94.94684
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 3.
#Results on testset
test_results_rf2 <- predict(object = mod_rf2, 
                           newdata = testset2)
postResample(testset2$Volume, test_results_rf2) 
##        RMSE    Rsquared         MAE 
## 242.2859652   0.9667954 125.7231904

The difference in RMSE still indicates overfitting, so in the next step we removed the attribute x4StarReviews:

#remove 1 attribute (x4StarReviews)
set.seed(999)

#5 fold cross validation
fitControl <- trainControl(method = "repeatedcv", number = 5, repeats = 1)

mod_rf3 <- train(Volume~PositiveServiceReview+ProductType.GameConsole,
                data = trainset2,
                method="rf",
                trControl = fitControl,
                preProcess = c("center","scale"),
                Tunelength = 20)
## note: only 1 unique complexity parameters in default grid. Truncating the grid to 1 .
mod_rf3
## Random Forest 
## 
## 53 samples
##  2 predictor
## 
## Pre-processing: centered (2), scaled (2) 
## Resampling: Cross-Validated (5 fold, repeated 1 times) 
## Summary of sample sizes: 43, 41, 43, 43, 42 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   178.9265  0.8707863  104.4397
## 
## Tuning parameter 'mtry' was held constant at a value of 2
#Results on testset
test_results_rf3 <- predict(object = mod_rf3, 
                           newdata = testset2)
postResample(testset2$Volume, test_results_rf3) 
##        RMSE    Rsquared         MAE 
## 225.6951958   0.9676274 132.5573636

We reduced the number of folds further to 3.

#reduce number of folds to 3
set.seed(999)

#3 fold cross validation
fitControl <- trainControl(method = "repeatedcv", number = 3, repeats = 1, verboseIter = F)

mod_rf4 <- train(Volume~PositiveServiceReview+ProductType.GameConsole,
                data = trainset2,
                method="rf",
                trControl = fitControl,
                preProcess = c("center","scale"),
                Tunelength = 20)
## note: only 1 unique complexity parameters in default grid. Truncating the grid to 1 .
mod_rf4 
## Random Forest 
## 
## 53 samples
##  2 predictor
## 
## Pre-processing: centered (2), scaled (2) 
## Resampling: Cross-Validated (3 fold, repeated 1 times) 
## Summary of sample sizes: 36, 35, 35 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   181.6469  0.8403215  111.3117
## 
## Tuning parameter 'mtry' was held constant at a value of 2
#Results on testset
test_results_rf4 <- predict(object = mod_rf4, 
                           newdata = testset2)
postResample(testset2$Volume, test_results_rf4) 
##        RMSE    Rsquared         MAE 
## 228.8082080   0.9689043 134.3636493

The difference in RMSE (182 in trainset vs. 229 in testset) was considered as acceptable.

k-NN

The parameters of the k-NN model were tuned to reduce overfitting. First the number of folds was reduced to 5.

#reduce number of folds to 5
set.seed(999)

#5 fold cross validation
fitControl <- trainControl(method = "repeatedcv", number = 5, repeats = 1)

mod_kNN2 <- train(Volume~PositiveServiceReview+x4StarReviews+ProductType.GameConsole,
                 data = trainset2,
                 method = "knn",
                 trControl = fitControl,
                 preProcess = c("center","scale"),
                 Tunelength=20) 
mod_kNN2
## k-Nearest Neighbors 
## 
## 53 samples
##  3 predictor
## 
## Pre-processing: centered (3), scaled (3) 
## Resampling: Cross-Validated (5 fold, repeated 1 times) 
## Summary of sample sizes: 43, 41, 43, 43, 42 
## Resampling results across tuning parameters:
## 
##   k  RMSE      Rsquared   MAE      
##   5  167.5157  0.8945650   88.18387
##   7  194.6543  0.8518548   98.19663
##   9  212.0035  0.8179905  105.98942
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 5.
#Results on testset
test_results_kNN2 <- predict(object = mod_kNN2, 
                            newdata = testset2)
postResample(testset2$Volume, test_results_kNN2)
##        RMSE    Rsquared         MAE 
## 318.2856579   0.8952037 151.5333333

Reduction of the number of folds did not signicantly reduce overfitting. In the next step the attribute x4StarReviews was removed.

#remove 1 attribute (x4StarReviews)
set.seed(999)

#5 fold cross validation
fitControl <- trainControl(method = "repeatedcv", number = 5, repeats = 1)

mod_kNN3 <- train(Volume~PositiveServiceReview+ProductType.GameConsole,
                 data = trainset2,
                 method = "knn",
                 trControl = fitControl,
                 preProcess = c("center","scale"),
                 Tunelength=20) 
mod_kNN3  
## k-Nearest Neighbors 
## 
## 53 samples
##  2 predictor
## 
## Pre-processing: centered (2), scaled (2) 
## Resampling: Cross-Validated (5 fold, repeated 1 times) 
## Summary of sample sizes: 43, 41, 43, 43, 42 
## Resampling results across tuning parameters:
## 
##   k  RMSE      Rsquared   MAE     
##   5  172.5964  0.9157711  100.0659
##   7  190.2809  0.9193296  113.7852
##   9  215.0542  0.9096555  120.5562
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 5.
#Results on testset
test_results_kNN3 <- predict(object = mod_kNN3, 
                            newdata = testset2)
postResample(testset2$Volume, test_results_kNN3) 
##        RMSE    Rsquared         MAE 
## 279.1377010   0.9045463 148.6719308

Tuning other parameters including tunelength or number of folds did not lead to further reduction of overfitting.

Predicting Volume using Random Forest

Both models were applied to the testing data using the predict() function. The difference between predicted and observed values was calculated and a new dataframe was created with the columns predicted, observed and difference values.

#Random Forest
Predict_rf <- predict(mod_rf4, newdata=testset2)
#predicted values
Predict_rf 
##          2         18         19         22         31         39         42 
##   15.91488  145.03963  109.94639 1132.49107  172.60415 1378.70987  292.75786 
##         45         54         60         61         65 
##  145.03963   64.10508   17.79085  337.56770 1115.79293
#observed values
testset2$Volume 
##  [1]   12   80  136 1224   88 1896  360   84   32   12  248 1684
#Dataframe created with extra column (difference between the predicted and observed)
df_rf
##     Predicted Observed  Difference
## 2    15.91488       12    3.914883
## 18  145.03963       80   65.039627
## 19  109.94639      136  -26.053609
## 22 1132.49107     1224  -91.508933
## 31  172.60415       88   84.604146
## 39 1378.70987     1896 -517.290133
## 42  292.75786      360  -67.242141
## 45  145.03963       84   61.039627
## 54   64.10508       32   32.105076
## 60   17.79085       12    5.790854
## 61  337.56770      248   89.567695
## 65 1115.79293     1684 -568.207067
summary(difference_rf)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -568.207  -73.309    4.853  -77.353   62.040   89.568

The median difference between predicted and observed values was 4.853.

Predicting Volume using k-NN

#k-NN
Predict_kNN <- predict(mod_kNN3, newdata=testset2)
#predicted values
Predict_kNN
##  [1]   16.00000  111.38462  110.00000 1294.40000  184.72727 1294.40000
##  [7]  294.66667  111.38462   85.66667   18.50000  310.66667  945.60000
#observed values
testset2$Volume
##  [1]   12   80  136 1224   88 1896  360   84   32   12  248 1684
#Dataframe 
df_kNN
##     Predicted Observed Difference
## 1    16.00000       12    4.00000
## 2   111.38462       80   31.38462
## 3   110.00000      136  -26.00000
## 4  1294.40000     1224   70.40000
## 5   184.72727       88   96.72727
## 6  1294.40000     1896 -601.60000
## 7   294.66667      360  -65.33333
## 8   111.38462       84   27.38462
## 9    85.66667       32   53.66667
## 10   18.50000       12    6.50000
## 11  310.66667      248   62.66667
## 12  945.60000     1684 -738.40000
summary(difference_kNN)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -738.40  -35.83   16.94  -89.88   55.92   96.73

The median difference between predicted and observed values was 16.94.

Evaluating both models, random forest was considered as the best model, considering the lower RMSE, and acceptable performance metrics on both training and testset. The lower median difference between predicted and observed values in the testset demonstrates that Random Forest is capable of predicting the new data more accurate compared with k-NN.

As the final step, the newproductattributes2017.csv file was imported. The new products data set was pre-processed again for prediction, where the same structure of the existing products was used. The productType data were dummified using the dummyVars() function and the attributes that were removed from the existing products data set were removed.

#loading 
newproductattributes2017 <- read.csv("C:/Users/Y.S. Kim/Desktop/Ubiqum/R/Task 3 Multiple Regression/newproductattributes2017.csv")

#renaming
new <- newproductattributes2017

#preprocessing/feature selection, removing attributes
new[2:4] <- NULL
new[3] <- NULL
new[6:13] <- NULL

# dummify the data in ProductType and remove attributes
new.dummified <- dummyVars(" ~ .", data = new)
new.transformed <- data.frame(predict(new.dummified, newdata = new))
new.transformed[2:3] <- NULL
new.transformed[3:10] <- NULL

The selected model Random Forest (mod_rf4) was used to predict Volume in this file.

#predict on newproductattributes using random forest
finalPred <- predict(mod_rf4, newdata=new.transformed)
##          1          2          3          4          5          6          7 
##  488.15447  292.75786  483.13180   48.27178   17.79085   48.27178 1084.82960 
##          8          9         10         11         12         13         14 
##  109.94639   15.91488 1084.82960 1378.70987  342.32043  624.45480  145.03963 
##         15         16         17         18         19         20         21 
##  109.94639 1150.88027   15.91488   48.27178   48.27178  145.03963  109.94639 
##         22         23         24 
##   15.91488   17.79085 1378.70987

Top 10 new products sorted by predicted volume (Random Forest):

#Add predictions to the new products data set 
output <- newproductattributes2017 
output$predictions <- finalPred
##    ProductType ProductNum PositiveServiceReview predictions
## 1       Tablet        187                    90   1378.7099
## 2  GameConsole        307                    59   1378.7099
## 3  GameConsole        199                    32   1150.8803
## 4      Netbook        180                    28   1084.8296
## 5       Tablet        186                    28   1084.8296
## 6   Smartphone        194                    14    624.4548
## 7           PC        171                    12    488.1545
## 8       Laptop        173                    11    483.1318
## 9   Smartphone        193                     8    342.3204
## 10          PC        172                     7    292.7579

The task was to make sales predictions for the 4 target product types: PC, Laptops, Netbooks and Smartphones

##    ProductType ProductNum PositiveServiceReview predictions
## 1           PC        171                    12   488.15447
## 2           PC        172                     7   292.75786
## 3       Laptop        173                    11   483.13180
## 4       Laptop        175                     2    48.27178
## 5       Laptop        176                     0    17.79085
## 6      Netbook        178                     2    48.27178
## 7      Netbook        180                    28  1084.82960
## 8      Netbook        181                     5   109.94639
## 9      Netbook        183                     1    15.91488
## 10  Smartphone        193                     8   342.32043
## 11  Smartphone        194                    14   624.45480
## 12  Smartphone        195                     4   145.03963
## 13  Smartphone        196                     5   109.94639

Mean sales prediction, grouped by Product Type

## # A tibble: 4 x 2
##   ProductType  mean
##   <fct>       <dbl>
## 1 PC           390.
## 2 Netbook      315.
## 3 Smartphone   305.
## 4 Laptop       183.

After predicting volume using the Random Forest model, the k-NN model was applied to the new products dataset to see whether there is a general agreement between both models, which would give extra strength to the predictions.

##predict on newproductattributes using kNN
finalPred2 <- predict(mod_kNN3, newdata=new.transformed)
##  [1]  570.4000  294.6667  525.3333   48.0000   18.5000   48.0000  876.0000
##  [8]  110.0000   16.0000  876.0000 1294.4000  310.6667  576.0000  111.3846
## [15]  110.0000 1044.0000   16.0000   48.0000   48.0000  111.3846  110.0000
## [22]   16.0000   18.5000 1294.4000

Top 10 new products sorted by predicted volume (by k-NN):

#Add predictions to the new products data set 
output3 <- newproductattributes2017 
output3$predictions <- finalPred2
##    ProductType ProductNum PositiveServiceReview predictions
## 1       Tablet        187                    90   1294.4000
## 2  GameConsole        307                    59   1294.4000
## 3  GameConsole        199                    32   1044.0000
## 4      Netbook        180                    28    876.0000
## 5       Tablet        186                    28    876.0000
## 6   Smartphone        194                    14    576.0000
## 7           PC        171                    12    570.4000
## 8       Laptop        173                    11    525.3333
## 9   Smartphone        193                     8    310.6667
## 10          PC        172                     7    294.6667

Sales predictions for the 4 target product types: PC, Laptops, Netbooks and Smartphones

##    ProductType ProductNum PositiveServiceReview predictions
## 1           PC        171                    12    570.4000
## 2           PC        172                     7    294.6667
## 3       Laptop        173                    11    525.3333
## 4       Laptop        175                     2     48.0000
## 5       Laptop        176                     0     18.5000
## 6      Netbook        178                     2     48.0000
## 7      Netbook        180                    28    876.0000
## 8      Netbook        181                     5    110.0000
## 9      Netbook        183                     1     16.0000
## 10  Smartphone        193                     8    310.6667
## 11  Smartphone        194                    14    576.0000
## 12  Smartphone        195                     4    111.3846
## 13  Smartphone        196                     5    110.0000

Mean sales prediction, grouped by Product Type

## # A tibble: 4 x 2
##   ProductType  mean
##   <fct>       <dbl>
## 1 PC           433.
## 2 Smartphone   277.
## 3 Netbook      262.
## 4 Laptop       197.

Using both models (Random Forest and kNN) we could see a general agreement. There was an overlap in top products sorted by predicted volumes. PC was the top product with the highest predicted Volume, followed by Netbook, Smartphone and Laptop. Descriptive statistics demonstrated comparable values regarding PositiveServiceReviews and x4StarReviews, justifying generalization of the model to the new data.