The task is to predict the sales in four different product types (PC, Laptops, Netbooks and Smartphones) while assessing the effects service and customer reviews on sales. The data of the existingproductattributes2017 were used to build models and to make a prediction on the newproductattributes2017.
First, outliers were removed in excel before importing the file in R using boxplots and scatterplots. The dataframe was renamed as ‘existing’.
# Load the data inside your system
existing <- read.csv("C:/Users/Y.S. Kim/Desktop/Ubiqum/R/Task 3 Multiple Regression/existingproductattributes2017_cleaned_without_outlier.csv")
The data was checked for missing values:
## [1] TRUE
This function shows that there are missing values. The summary function shows that we deal with 15 missing values in the attribute BestSellersRank, and this column [12] was deleted accordingly. Thereafter the any(is.na()) confirmed that the dataset was complete.
existing.new <- existing[-12]
## [1] FALSE
Using the glimpse function, the data types were checked:
## Observations: 65
## Variables: 17
## $ ProductType <fct> PC, PC, Laptop, Laptop, Accessories, Accessor...
## $ ProductNum <int> 101, 103, 104, 105, 106, 107, 108, 109, 110, ...
## $ Price <dbl> 949.00, 399.00, 409.99, 1079.99, 114.22, 379....
## $ x5StarReviews <int> 3, 3, 49, 58, 83, 11, 33, 16, 10, 21, 75, 10,...
## $ x4StarReviews <int> 3, 0, 19, 31, 30, 3, 19, 9, 1, 2, 25, 8, 62, ...
## $ x3StarReviews <int> 2, 0, 8, 11, 10, 0, 12, 2, 1, 2, 6, 5, 13, 27...
## $ x2StarReviews <int> 0, 0, 3, 7, 9, 0, 5, 0, 0, 4, 3, 0, 8, 7, 2, ...
## $ x1StarReviews <int> 0, 0, 9, 36, 40, 1, 9, 2, 0, 15, 3, 1, 16, 5,...
## $ PositiveServiceReview <int> 2, 1, 7, 7, 12, 3, 5, 2, 2, 2, 9, 2, 44, 57, ...
## $ NegativeServiceReview <int> 0, 0, 8, 20, 5, 0, 3, 1, 0, 1, 2, 0, 3, 3, 0,...
## $ Recommendproduct <dbl> 0.9, 0.9, 0.8, 0.7, 0.3, 0.9, 0.7, 0.8, 0.9, ...
## $ ShippingWeight <dbl> 25.80, 17.40, 5.70, 7.00, 1.60, 7.30, 12.00, ...
## $ ProductDepth <dbl> 23.94, 10.50, 15.00, 12.90, 5.80, 6.70, 7.90,...
## $ ProductWidth <dbl> 6.62, 8.30, 9.90, 0.30, 4.00, 10.30, 6.70, 9....
## $ ProductHeight <dbl> 16.89, 10.20, 1.30, 8.90, 1.00, 11.50, 2.20, ...
## $ ProfitMargin <dbl> 0.15, 0.08, 0.08, 0.09, 0.05, 0.05, 0.05, 0.0...
## $ Volume <int> 12, 12, 196, 232, 332, 44, 132, 64, 40, 84, 3...
After loading the caret package, the attribute with categorical data (ProductType) was converted to numerical data, by dummifying the data, since we are dealing with a regression problem:
#dummify the data
existing.dummified <- dummyVars(" ~ .", data = existing.new)
existing.transformed <- data.frame(predict(existing.dummified, newdata = existing.new))
A correlation matrix was created to see the correlation coefficients of the attributes and after installing the package ‘corrplot’, a heat map was created the visualize the correlation matrix:
After evaluation of the correlation coefficients, feature selection was done. Attributes were omitted taking into consideration the collinearity (>0.90) and relevance for the output variable Volume, as well as x5StarReviews due to its perfect correlation with Volume.
The ProductType attributes with low correlation coefficient were removed as well, leaving GameConsole and Accessories in the model.
The following features were selected:
## [1] "ProductType.Accessories" "ProductType.GameConsole"
## [3] "x4StarReviews" "x2StarReviews"
## [5] "x1StarReviews" "PositiveServiceReview"
## [7] "Volume"
The dataset was split into a trainingset (80%) and a testset (20%):
#creating train and testdata (80/20%)
set.seed(532)
trainSize<-round(nrow(existing.transformed)*0.8)
testSize<-nrow(existing.transformed)-trainSize
training_indices<-sample(seq_len(nrow(existing.transformed)),size =trainSize)
trainSet<-existing.transformed[training_indices,]
testSet<-existing.transformed[-training_indices,]
First a linear regression model was trained, starting with PositiveServiceReview as the independent variable with Volume as the dependent variable: Cross-validation was performed using “repeatedcv”, number = 10 and repeats = 1.
#10 fold cross validation
set.seed(532)
fitControl <- trainControl(method = "repeatedcv", number = 10, repeats = 1)
#train model: LM
mod_lm <- train(Volume~PositiveServiceReview,
data = trainSet,
method="lm",
trControl = fitControl)
## Linear Regression
##
## 52 samples
## 1 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times)
## Summary of sample sizes: 48, 46, 48, 47, 46, 47, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 189.7157 0.9148687 124.2393
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
##
## Call:
## lm(formula = .outcome ~ ., data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -559.18 -70.08 -40.49 -10.61 738.22
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 19.434 34.266 0.567 0.573
## PositiveServiceReview 32.470 2.356 13.783 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 201.7 on 50 degrees of freedom
## Multiple R-squared: 0.7916, Adjusted R-squared: 0.7875
## F-statistic: 190 on 1 and 50 DF, p-value: < 2.2e-16
Attributes were added stepwise one-by-one to the model, and were included if the adjusted R-squared improved. This resulted in the following model with 3 predictors: PositiveServiceReview, x4StarReviews and ProductType.GameConsole.
#10 fold cross validation
set.seed(532)
fitControl <- trainControl(method = "repeatedcv", number = 10, repeats = 1)
mod_lm5 <- train(Volume~PositiveServiceReview+x4StarReviews+ProductType.GameConsole,
data = trainSet,
method="lm",
trControl = fitControl)
## Linear Regression
##
## 52 samples
## 3 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times)
## Summary of sample sizes: 48, 46, 48, 47, 46, 47, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 183.0133 0.9283009 113.6324
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
##
## Call:
## lm(formula = .outcome ~ ., data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -492.60 -55.73 -28.45 -1.35 806.33
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.835 31.072 0.413 0.681379
## PositiveServiceReview 27.617 3.369 8.198 1.11e-10 ***
## x4StarReviews 1.616 1.614 1.002 0.321514
## ProductType.GameConsole 729.655 187.935 3.882 0.000315 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 173 on 48 degrees of freedom
## Multiple R-squared: 0.8527, Adjusted R-squared: 0.8435
## F-statistic: 92.65 on 3 and 48 DF, p-value: < 2.2e-16
Since the errors of the data are not normally distributed, I decided to train non-parametric models instead of linear regression model.
Creation of trainset and testset
set.seed(999)
#split to train and test data 80/20
inTraining <- createDataPartition(existing.transformed$Volume, p = 0.8, list = FALSE)
trainset2 <- existing.transformed[inTraining,]
testset2 <- existing.transformed[-inTraining,]
After splitting the data into a trainset and testset, the Random Forest model was trained. The 3 predictors were included to predict output variable Volume. PreProcess to the “center” and “scale” was applied.
set.seed(999)
#10 fold cross validation
fitControl <- trainControl(method = "repeatedcv", number = 10, repeats = 1)
mod_rf <- train(Volume~PositiveServiceReview+x4StarReviews+ProductType.GameConsole,
data = trainset2,
method="rf",
trControl = fitControl,
preProcess = c("center","scale"),
Tunelength = 20)
## note: only 2 unique complexity parameters in default grid. Truncating the grid to 2 .
mod_rf
## Random Forest
##
## 53 samples
## 3 predictor
##
## Pre-processing: centered (3), scaled (3)
## Resampling: Cross-Validated (10 fold, repeated 1 times)
## Summary of sample sizes: 48, 48, 49, 47, 48, 48, ...
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared MAE
## 2 162.3054 0.8985241 92.55416
## 3 160.3005 0.8881183 93.04651
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 3.
PostResample was performed to evaluate the performance of the model on the testset:
#Results on testset
test_results_rf <- predict(object = mod_rf,
newdata = testset2)
postResample(testset2$Volume, test_results_rf)
## RMSE Rsquared MAE
## 245.5874723 0.9671046 128.3724468
The difference between the RMSE in the train and test results (160 vs. 245 respectively) indicate that the model is somewhat overfitting.
After that, another non-parametric model k-NN was trained after installing the “ISLR” package.
set.seed(999)
#10 fold cross validation
fitControl <- trainControl(method = "repeatedcv", number = 10, repeats = 1)
mod_kNN <- train(Volume~PositiveServiceReview+x4StarReviews+ProductType.GameConsole,
data = trainset2,
method = "knn",
trControl = fitControl,
preProcess = c("center","scale"),
Tunelength=20)
mod_kNN
## k-Nearest Neighbors
##
## 53 samples
## 3 predictor
##
## Pre-processing: centered (3), scaled (3)
## Resampling: Cross-Validated (10 fold, repeated 1 times)
## Summary of sample sizes: 48, 48, 49, 47, 48, 48, ...
## Resampling results across tuning parameters:
##
## k RMSE Rsquared MAE
## 5 157.0234 0.9109473 88.46938
## 7 169.3998 0.8965086 97.49935
## 9 179.6905 0.8787475 104.34117
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 5.
Results on the testset:
#Results on testset
test_results_kNN <- predict(object = mod_kNN,
newdata = testset2)
postResample(testset2$Volume, test_results_kNN)
## RMSE Rsquared MAE
## 318.2856579 0.8952037 151.5333333
The results indicate that the k-NN model is overfitting the data.
The next step was tuning the parameters of both models to reduce overfitting.
First the number of folds was reduced to 5.
#reduce number of folds to 5
set.seed(999)
#5 fold cross validation
fitControl <- trainControl(method = "repeatedcv", number = 5, repeats = 1)
mod_rf2 <- train(Volume~PositiveServiceReview+x4StarReviews+ProductType.GameConsole,
data = trainset2,
method="rf",
trControl = fitControl,
preProcess = c("center","scale"),
Tunelength = 20)
## note: only 2 unique complexity parameters in default grid. Truncating the grid to 2 .
mod_rf2
## Random Forest
##
## 53 samples
## 3 predictor
##
## Pre-processing: centered (3), scaled (3)
## Resampling: Cross-Validated (5 fold, repeated 1 times)
## Summary of sample sizes: 43, 41, 43, 43, 42
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared MAE
## 2 175.8397 0.8772036 93.59335
## 3 175.6154 0.8835685 94.94684
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 3.
#Results on testset
test_results_rf2 <- predict(object = mod_rf2,
newdata = testset2)
postResample(testset2$Volume, test_results_rf2)
## RMSE Rsquared MAE
## 242.2859652 0.9667954 125.7231904
The difference in RMSE still indicates overfitting, so in the next step we removed the attribute x4StarReviews:
#remove 1 attribute (x4StarReviews)
set.seed(999)
#5 fold cross validation
fitControl <- trainControl(method = "repeatedcv", number = 5, repeats = 1)
mod_rf3 <- train(Volume~PositiveServiceReview+ProductType.GameConsole,
data = trainset2,
method="rf",
trControl = fitControl,
preProcess = c("center","scale"),
Tunelength = 20)
## note: only 1 unique complexity parameters in default grid. Truncating the grid to 1 .
mod_rf3
## Random Forest
##
## 53 samples
## 2 predictor
##
## Pre-processing: centered (2), scaled (2)
## Resampling: Cross-Validated (5 fold, repeated 1 times)
## Summary of sample sizes: 43, 41, 43, 43, 42
## Resampling results:
##
## RMSE Rsquared MAE
## 178.9265 0.8707863 104.4397
##
## Tuning parameter 'mtry' was held constant at a value of 2
#Results on testset
test_results_rf3 <- predict(object = mod_rf3,
newdata = testset2)
postResample(testset2$Volume, test_results_rf3)
## RMSE Rsquared MAE
## 225.6951958 0.9676274 132.5573636
We reduced the number of folds further to 3.
#reduce number of folds to 3
set.seed(999)
#3 fold cross validation
fitControl <- trainControl(method = "repeatedcv", number = 3, repeats = 1, verboseIter = F)
mod_rf4 <- train(Volume~PositiveServiceReview+ProductType.GameConsole,
data = trainset2,
method="rf",
trControl = fitControl,
preProcess = c("center","scale"),
Tunelength = 20)
## note: only 1 unique complexity parameters in default grid. Truncating the grid to 1 .
mod_rf4
## Random Forest
##
## 53 samples
## 2 predictor
##
## Pre-processing: centered (2), scaled (2)
## Resampling: Cross-Validated (3 fold, repeated 1 times)
## Summary of sample sizes: 36, 35, 35
## Resampling results:
##
## RMSE Rsquared MAE
## 181.6469 0.8403215 111.3117
##
## Tuning parameter 'mtry' was held constant at a value of 2
#Results on testset
test_results_rf4 <- predict(object = mod_rf4,
newdata = testset2)
postResample(testset2$Volume, test_results_rf4)
## RMSE Rsquared MAE
## 228.8082080 0.9689043 134.3636493
The difference in RMSE (182 in trainset vs. 229 in testset) was considered as acceptable.
The parameters of the k-NN model were tuned to reduce overfitting. First the number of folds was reduced to 5.
#reduce number of folds to 5
set.seed(999)
#5 fold cross validation
fitControl <- trainControl(method = "repeatedcv", number = 5, repeats = 1)
mod_kNN2 <- train(Volume~PositiveServiceReview+x4StarReviews+ProductType.GameConsole,
data = trainset2,
method = "knn",
trControl = fitControl,
preProcess = c("center","scale"),
Tunelength=20)
mod_kNN2
## k-Nearest Neighbors
##
## 53 samples
## 3 predictor
##
## Pre-processing: centered (3), scaled (3)
## Resampling: Cross-Validated (5 fold, repeated 1 times)
## Summary of sample sizes: 43, 41, 43, 43, 42
## Resampling results across tuning parameters:
##
## k RMSE Rsquared MAE
## 5 167.5157 0.8945650 88.18387
## 7 194.6543 0.8518548 98.19663
## 9 212.0035 0.8179905 105.98942
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 5.
#Results on testset
test_results_kNN2 <- predict(object = mod_kNN2,
newdata = testset2)
postResample(testset2$Volume, test_results_kNN2)
## RMSE Rsquared MAE
## 318.2856579 0.8952037 151.5333333
Reduction of the number of folds did not signicantly reduce overfitting. In the next step the attribute x4StarReviews was removed.
#remove 1 attribute (x4StarReviews)
set.seed(999)
#5 fold cross validation
fitControl <- trainControl(method = "repeatedcv", number = 5, repeats = 1)
mod_kNN3 <- train(Volume~PositiveServiceReview+ProductType.GameConsole,
data = trainset2,
method = "knn",
trControl = fitControl,
preProcess = c("center","scale"),
Tunelength=20)
mod_kNN3
## k-Nearest Neighbors
##
## 53 samples
## 2 predictor
##
## Pre-processing: centered (2), scaled (2)
## Resampling: Cross-Validated (5 fold, repeated 1 times)
## Summary of sample sizes: 43, 41, 43, 43, 42
## Resampling results across tuning parameters:
##
## k RMSE Rsquared MAE
## 5 172.5964 0.9157711 100.0659
## 7 190.2809 0.9193296 113.7852
## 9 215.0542 0.9096555 120.5562
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 5.
#Results on testset
test_results_kNN3 <- predict(object = mod_kNN3,
newdata = testset2)
postResample(testset2$Volume, test_results_kNN3)
## RMSE Rsquared MAE
## 279.1377010 0.9045463 148.6719308
Tuning other parameters including tunelength or number of folds did not lead to further reduction of overfitting.
Both models were applied to the testing data using the predict() function. The difference between predicted and observed values was calculated and a new dataframe was created with the columns predicted, observed and difference values.
#Random Forest
Predict_rf <- predict(mod_rf4, newdata=testset2)
#predicted values
Predict_rf
## 2 18 19 22 31 39 42
## 15.91488 145.03963 109.94639 1132.49107 172.60415 1378.70987 292.75786
## 45 54 60 61 65
## 145.03963 64.10508 17.79085 337.56770 1115.79293
#observed values
testset2$Volume
## [1] 12 80 136 1224 88 1896 360 84 32 12 248 1684
#Dataframe created with extra column (difference between the predicted and observed)
df_rf
## Predicted Observed Difference
## 2 15.91488 12 3.914883
## 18 145.03963 80 65.039627
## 19 109.94639 136 -26.053609
## 22 1132.49107 1224 -91.508933
## 31 172.60415 88 84.604146
## 39 1378.70987 1896 -517.290133
## 42 292.75786 360 -67.242141
## 45 145.03963 84 61.039627
## 54 64.10508 32 32.105076
## 60 17.79085 12 5.790854
## 61 337.56770 248 89.567695
## 65 1115.79293 1684 -568.207067
summary(difference_rf)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -568.207 -73.309 4.853 -77.353 62.040 89.568
The median difference between predicted and observed values was 4.853.
#k-NN
Predict_kNN <- predict(mod_kNN3, newdata=testset2)
#predicted values
Predict_kNN
## [1] 16.00000 111.38462 110.00000 1294.40000 184.72727 1294.40000
## [7] 294.66667 111.38462 85.66667 18.50000 310.66667 945.60000
#observed values
testset2$Volume
## [1] 12 80 136 1224 88 1896 360 84 32 12 248 1684
#Dataframe
df_kNN
## Predicted Observed Difference
## 1 16.00000 12 4.00000
## 2 111.38462 80 31.38462
## 3 110.00000 136 -26.00000
## 4 1294.40000 1224 70.40000
## 5 184.72727 88 96.72727
## 6 1294.40000 1896 -601.60000
## 7 294.66667 360 -65.33333
## 8 111.38462 84 27.38462
## 9 85.66667 32 53.66667
## 10 18.50000 12 6.50000
## 11 310.66667 248 62.66667
## 12 945.60000 1684 -738.40000
summary(difference_kNN)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -738.40 -35.83 16.94 -89.88 55.92 96.73
The median difference between predicted and observed values was 16.94.
Evaluating both models, random forest was considered as the best model, considering the lower RMSE, and acceptable performance metrics on both training and testset. The lower median difference between predicted and observed values in the testset demonstrates that Random Forest is capable of predicting the new data more accurate compared with k-NN.
As the final step, the newproductattributes2017.csv file was imported. The new products data set was pre-processed again for prediction, where the same structure of the existing products was used. The productType data were dummified using the dummyVars() function and the attributes that were removed from the existing products data set were removed.
#loading
newproductattributes2017 <- read.csv("C:/Users/Y.S. Kim/Desktop/Ubiqum/R/Task 3 Multiple Regression/newproductattributes2017.csv")
#renaming
new <- newproductattributes2017
#preprocessing/feature selection, removing attributes
new[2:4] <- NULL
new[3] <- NULL
new[6:13] <- NULL
# dummify the data in ProductType and remove attributes
new.dummified <- dummyVars(" ~ .", data = new)
new.transformed <- data.frame(predict(new.dummified, newdata = new))
new.transformed[2:3] <- NULL
new.transformed[3:10] <- NULL
The selected model Random Forest (mod_rf4) was used to predict Volume in this file.
#predict on newproductattributes using random forest
finalPred <- predict(mod_rf4, newdata=new.transformed)
## 1 2 3 4 5 6 7
## 488.15447 292.75786 483.13180 48.27178 17.79085 48.27178 1084.82960
## 8 9 10 11 12 13 14
## 109.94639 15.91488 1084.82960 1378.70987 342.32043 624.45480 145.03963
## 15 16 17 18 19 20 21
## 109.94639 1150.88027 15.91488 48.27178 48.27178 145.03963 109.94639
## 22 23 24
## 15.91488 17.79085 1378.70987
#Add predictions to the new products data set
output <- newproductattributes2017
output$predictions <- finalPred
## ProductType ProductNum PositiveServiceReview predictions
## 1 Tablet 187 90 1378.7099
## 2 GameConsole 307 59 1378.7099
## 3 GameConsole 199 32 1150.8803
## 4 Netbook 180 28 1084.8296
## 5 Tablet 186 28 1084.8296
## 6 Smartphone 194 14 624.4548
## 7 PC 171 12 488.1545
## 8 Laptop 173 11 483.1318
## 9 Smartphone 193 8 342.3204
## 10 PC 172 7 292.7579
The task was to make sales predictions for the 4 target product types: PC, Laptops, Netbooks and Smartphones
## ProductType ProductNum PositiveServiceReview predictions
## 1 PC 171 12 488.15447
## 2 PC 172 7 292.75786
## 3 Laptop 173 11 483.13180
## 4 Laptop 175 2 48.27178
## 5 Laptop 176 0 17.79085
## 6 Netbook 178 2 48.27178
## 7 Netbook 180 28 1084.82960
## 8 Netbook 181 5 109.94639
## 9 Netbook 183 1 15.91488
## 10 Smartphone 193 8 342.32043
## 11 Smartphone 194 14 624.45480
## 12 Smartphone 195 4 145.03963
## 13 Smartphone 196 5 109.94639
Mean sales prediction, grouped by Product Type
## # A tibble: 4 x 2
## ProductType mean
## <fct> <dbl>
## 1 PC 390.
## 2 Netbook 315.
## 3 Smartphone 305.
## 4 Laptop 183.
After predicting volume using the Random Forest model, the k-NN model was applied to the new products dataset to see whether there is a general agreement between both models, which would give extra strength to the predictions.
##predict on newproductattributes using kNN
finalPred2 <- predict(mod_kNN3, newdata=new.transformed)
## [1] 570.4000 294.6667 525.3333 48.0000 18.5000 48.0000 876.0000
## [8] 110.0000 16.0000 876.0000 1294.4000 310.6667 576.0000 111.3846
## [15] 110.0000 1044.0000 16.0000 48.0000 48.0000 111.3846 110.0000
## [22] 16.0000 18.5000 1294.4000
#Add predictions to the new products data set
output3 <- newproductattributes2017
output3$predictions <- finalPred2
## ProductType ProductNum PositiveServiceReview predictions
## 1 Tablet 187 90 1294.4000
## 2 GameConsole 307 59 1294.4000
## 3 GameConsole 199 32 1044.0000
## 4 Netbook 180 28 876.0000
## 5 Tablet 186 28 876.0000
## 6 Smartphone 194 14 576.0000
## 7 PC 171 12 570.4000
## 8 Laptop 173 11 525.3333
## 9 Smartphone 193 8 310.6667
## 10 PC 172 7 294.6667
Sales predictions for the 4 target product types: PC, Laptops, Netbooks and Smartphones
## ProductType ProductNum PositiveServiceReview predictions
## 1 PC 171 12 570.4000
## 2 PC 172 7 294.6667
## 3 Laptop 173 11 525.3333
## 4 Laptop 175 2 48.0000
## 5 Laptop 176 0 18.5000
## 6 Netbook 178 2 48.0000
## 7 Netbook 180 28 876.0000
## 8 Netbook 181 5 110.0000
## 9 Netbook 183 1 16.0000
## 10 Smartphone 193 8 310.6667
## 11 Smartphone 194 14 576.0000
## 12 Smartphone 195 4 111.3846
## 13 Smartphone 196 5 110.0000
Mean sales prediction, grouped by Product Type
## # A tibble: 4 x 2
## ProductType mean
## <fct> <dbl>
## 1 PC 433.
## 2 Smartphone 277.
## 3 Netbook 262.
## 4 Laptop 197.
Using both models (Random Forest and kNN) we could see a general agreement. There was an overlap in top products sorted by predicted volumes. PC was the top product with the highest predicted Volume, followed by Netbook, Smartphone and Laptop. Descriptive statistics demonstrated comparable values regarding PositiveServiceReviews and x4StarReviews, justifying generalization of the model to the new data.