In this work we have to predict the volume sales in four different product types while assessing the effects service and customer reviews have on sales.
Weāll be using Regression to build machine learning models for this analyses. Once we have determined which algorithm works better on the provided data set, we will predict the sales of four product types from the new products list.
To sum up:
What are we trying to predict?
We need to predict the sales volume for the new products list.
What type of problem is it? Classification or Regression? Binary or Multi-class? Uni-variate or Multi-variate?
It is a multiple regression problem with multiple features.
What type of data do we have?
We have two files, one to build predictive models (existingproducts.csv) and the other one is the data set that will be used to test the model (newproducts).
First, we need to activate packages to be used during the project.
library(dplyr)
library(caret)
library(readr)
library(dummies)
library(corrplot)
library(normalr)
library(randomForest) # Random forest in regression problems
library(rknn) # k-NN in regression problems
library(kknn) # k-NN in regression with train() function
library(DMwR) # regr.eval function
library(kernlab) # svm radial kernel
library(gbm)
library(plotly) The next step is to upload the data we are going to be working in.
With the function glimpse()and summary() we can do the initial exploration of data to know better what type of data we are handling.
## Observations: 80
## Variables: 18
## $ ProductType <chr> "PC", "PC", "PC", "Laptop", "Laptop", "Access...
## $ ProductNum <dbl> 101, 102, 103, 104, 105, 106, 107, 108, 109, ...
## $ Price <dbl> 949.00, 2249.99, 399.00, 409.99, 1079.99, 114...
## $ x5StarReviews <dbl> 3, 2, 3, 49, 58, 83, 11, 33, 16, 10, 21, 75, ...
## $ x4StarReviews <dbl> 3, 1, 0, 19, 31, 30, 3, 19, 9, 1, 2, 25, 8, 6...
## $ x3StarReviews <dbl> 2, 0, 0, 8, 11, 10, 0, 12, 2, 1, 2, 6, 5, 13,...
## $ x2StarReviews <dbl> 0, 0, 0, 3, 7, 9, 0, 5, 0, 0, 4, 3, 0, 8, 7, ...
## $ x1StarReviews <dbl> 0, 0, 0, 9, 36, 40, 1, 9, 2, 0, 15, 3, 1, 16,...
## $ PositiveServiceReview <dbl> 2, 1, 1, 7, 7, 12, 3, 5, 2, 2, 2, 9, 2, 44, 5...
## $ NegativeServiceReview <dbl> 0, 0, 0, 8, 20, 5, 0, 3, 1, 0, 1, 2, 0, 3, 3,...
## $ Recommendproduct <dbl> 0.9, 0.9, 0.9, 0.8, 0.7, 0.3, 0.9, 0.7, 0.8, ...
## $ BestSellersRank <dbl> 1967, 4806, 12076, 109, 268, 64, NA, 2, NA, 1...
## $ ShippingWeight <dbl> 25.80, 50.00, 17.40, 5.70, 7.00, 1.60, 7.30, ...
## $ ProductDepth <dbl> 23.94, 35.00, 10.50, 15.00, 12.90, 5.80, 6.70...
## $ ProductWidth <dbl> 6.62, 31.75, 8.30, 9.90, 0.30, 4.00, 10.30, 6...
## $ ProductHeight <dbl> 16.89, 19.00, 10.20, 1.30, 8.90, 1.00, 11.50,...
## $ ProfitMargin <dbl> 0.15, 0.25, 0.08, 0.08, 0.09, 0.05, 0.05, 0.0...
## $ Volume <dbl> 12, 8, 12, 196, 232, 332, 44, 132, 64, 40, 84...
Now we will check for outliers.
To finish with the initial exploration we will check for missing values and in case of finding them we will transform them.
## [1] 15
There are missing values, and with the str() function seen above we can see that all missing values correspond to the same attribute. In the next part, Pre-processing we will treat outliers and missing values.
Most data will contain a mixture of numeric and nominal data so we need to understand how to incorporate both when it comes to developing regression models and making predictions.
Categorical variables may be used directly as predictor or predicted variables in a multiple regression model as long as theyāve been converted to binary values. In order to pre-process the sales data as needed we first need to convert all factor or āchrā classes to binary features that contain ā0ā and ā1ā classes. Fortunately, caret has a method for creating these āDummy Variablesā as follows:
# dummify the data
existing_products_2 <- dummyVars(" ~ .", data = existing_products)
existing_products_3 <- data.frame(predict(existing_products_2, newdata = existing_products))Now is time to remove outliers.
# Outliers are an Accesory and a Game console, there aren“t PC, Laptop, Netbook or Smartphone (our objectives). So we can remove them.
existing_products_3 <- existing_products_3[-which(existing_products_3$Volume %in% outliers),]Once data is dummified and we don“t have categorical data we can treat missing values. In this case, all missing values are from the same attribute. So we have decided to delete this attribute.
After dummifying data and omitting outliers and missing values we have to check the correlation among variables and between variables and the dependant variable.
We will use the cor() function to create a correlation matrix to visualize the correlation between the features.
corrData <- cor(existing_products_3)
corrplot(corrData) # There are so many variables that is difficult to distibguish among them, but it can be check if you call corrData object.From the corrData we can observe that these values have high correlation with the dependant variable and can lead to overfitting.
x5starReview
Now we have to check if there is high correlation among features, not removing these features can lead to too much noise.
x4starReviem and x3starReview has a correlation of 0.937. In order to decide which is going to be removed we have to check the correlation of each feature with the dependant variable.
x4starReview has greater correlation so x3starReview is going to be removed.
The same happen with x2starReview and x1starReview, where x1starReview is going to be removed.
We have check and removed features that have highest correlation with dependant variables and among them. Now we have to treat these features that have less correlation with dependat variables like, profit margin and physical attributes, that have less than 0.2 correlation.
The variables that have low correlation with dependant variable are:
In the case of dummified data we will not remove ProductTypeXXX because we don“t want to get so biased model.
# Variables with high collinearity with dependant variable will be always deleted.
existing_products_3$x5StarReviews <- NULL
# Multicollinearity among variables will also be applied, reducing noise always is effective with any algorithm.
existing_products_3$x1StarReviews <- NULL
existing_products_3$x3StarReviews <- NULL
existing_products_3$ProductNum <- NULL
# We will create Existing_Products to remove all variables that correlation matrix tell us to delete.
drop_variables <- c("ProductDepth", "ProductHeight", "ProductWidth", "ProfitMargin", "ShippingWeight")
Existing_Products <- existing_products_3[ , !(names(existing_products_3) %in% drop_variables)]Once data is preprocessed, it“s time to create training and testing sets for the predictive model that must be performed.
First, we will split data into two sets, training and testing set with cerateDataPartition() function, that does a stratified random split of the data.
set.seed(123) # seed is a number that you choose for a starting point used to create a sequence of random numbers.
inTrain <- createDataPartition(y = Existing_Products$Volume, p=.75, list = FALSE)
# To partition the data:
training <- Existing_Products[ inTrain,]
testing <- Existing_Products[-inTrain,]We will run 4 different methods and then we will compare them. The best method is going to be used to predict our new product list Volume.
This algorithms can be compared if all have the same resampling method and the same number of repetitions. To modify resampling method trainControl() function can be used.
The first method that is going to be performed is Supported Vector Machine algorithm. To run any algorithm the train() function is used. In this case, the method that will be used is called āsvmRadialā, one among many.
set.seed(123) # this function must every before every fucntion that generates something randomly.
svmGrid <- expand.grid(sigma=0.02541634, C=8)
svmfit <- train(Volume ~ ., data = training, method = "svmRadial", trControl=fitControl, tuneGrid=svmGrid, preProc = c("center", "scale")) #Ƶ This model can be adjusted training first with different tuneLengths and then fixing the best parameters with tuneGrid.Once the training algorithm is performed it has to be tried to predict in testing set.
svm_pred <- predict(svmfit, newdata = testing)
actuals_preds_svm <- data.frame(cbind(actuals=testing$Volume, predicteds=svm_pred)) # make actuals_predicteds dataframe.
actuals_preds_svm## actuals predicteds
## 1 232 356.60341
## 2 132 272.62175
## 3 300 177.75650
## 4 60 126.18329
## 5 1576 1230.77253
## 6 2052 648.13852
## 7 32 -25.62101
## 8 20 119.63613
## 9 1232 1174.19628
## 10 88 239.77103
## 11 1536 540.28954
## 12 836 1596.56955
## 13 904 583.01019
## 14 232 326.43896
## 15 8 -30.56946
## 16 80 124.19303
## 17 0 33.32721
## 18 16 -53.75056
Next method is called Random Forest In this case, the method that will be used is called ārangerā.
set.seed(123)
rfGrid <- expand.grid(min.node.size=5, mtry=18, splitrule="variance")
rffit <- train(Volume ~., data = training, method = "ranger", trControl=fitControl, tuneGrid=rfGrid, preProc = c("center", "scale"))Once the training algorithm is performed it has to be tried to predict in testing set.
Third method is called k- Nearest Neighbour. In this case, the method that will be used is called ākknnā.
set.seed(123)
knnGrid <- expand.grid(distance=2, kernel="optimal", kmax=7)
knnfit <- train(Volume ~ ., data = training,method = "kknn", trControl=fitControl, tuneGrid=knnGrid, preProc = c("center", "scale"))Once the training algorithm is performed it has to be tried to predict in testing set.
Last method is the Gradient Boosted Tree algorithm. In this case, the method that will be used is called āgbmā.
set.seed(123)
gbtGrid <- expand.grid(shrinkage=0.1, n.trees = 100, n.minobsinnode=10, interaction.depth =4)
gbtfit <- train(Volume ~ ., data = training, method = "gbm", trControl=fitControl, tuneGrid=gbtGrid, preProc = c("center", "scale"))## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 303746.3704 nan 0.1000 47364.5116
## 2 260991.7002 nan 0.1000 40660.7765
## 3 228188.0744 nan 0.1000 31542.3772
## 4 201687.0054 nan 0.1000 23702.3657
## 5 176467.0886 nan 0.1000 21064.6207
## 6 162489.4085 nan 0.1000 16992.8083
## 7 148508.5083 nan 0.1000 13369.1008
## 8 126896.9738 nan 0.1000 16100.6165
## 9 114511.0398 nan 0.1000 7738.4310
## 10 106768.3731 nan 0.1000 7161.4203
## 20 71298.5542 nan 0.1000 1300.9095
## 40 50777.4050 nan 0.1000 -254.5416
## 60 45668.5482 nan 0.1000 -617.5422
## 80 42421.8598 nan 0.1000 -698.2938
## 100 40242.1494 nan 0.1000 -81.6501
##
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 279000.0293 nan 0.1000 42606.3028
## 2 234876.7483 nan 0.1000 32479.0555
## 3 206898.8307 nan 0.1000 29253.0048
## 4 179537.2676 nan 0.1000 22677.3041
## 5 162157.8396 nan 0.1000 17296.0541
## 6 150346.5125 nan 0.1000 12112.2650
## 7 136244.1924 nan 0.1000 12770.1771
## 8 127507.6999 nan 0.1000 9652.6774
## 9 117391.6930 nan 0.1000 9992.3059
## 10 107396.0696 nan 0.1000 9752.8844
## 20 69262.4632 nan 0.1000 409.4036
## 40 54199.0372 nan 0.1000 -549.3564
## 60 49271.8126 nan 0.1000 -311.7903
## 80 45506.9843 nan 0.1000 -998.6933
## 100 40748.6882 nan 0.1000 -602.2334
##
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 312193.3912 nan 0.1000 34195.1299
## 2 276087.1119 nan 0.1000 40318.1071
## 3 239211.5124 nan 0.1000 30065.8308
## 4 210351.5949 nan 0.1000 30054.4391
## 5 190591.0971 nan 0.1000 21728.5235
## 6 165952.8206 nan 0.1000 21829.7154
## 7 152798.8857 nan 0.1000 15612.3851
## 8 146771.8333 nan 0.1000 4026.9346
## 9 133027.0861 nan 0.1000 13515.8725
## 10 124729.1518 nan 0.1000 9822.2569
## 20 78499.3704 nan 0.1000 2012.5070
## 40 53864.4807 nan 0.1000 -183.5246
## 60 50516.5432 nan 0.1000 -937.1841
## 80 48112.6756 nan 0.1000 -944.7866
## 100 45875.0999 nan 0.1000 -1476.9462
##
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 291577.6355 nan 0.1000 43965.3134
## 2 255277.6244 nan 0.1000 32708.0358
## 3 226661.5523 nan 0.1000 31752.3964
## 4 203771.8177 nan 0.1000 25356.7952
## 5 182793.3788 nan 0.1000 22963.0732
## 6 166199.2215 nan 0.1000 17863.4436
## 7 149779.0236 nan 0.1000 15774.4652
## 8 143744.0590 nan 0.1000 1621.0729
## 9 127592.8051 nan 0.1000 11418.7959
## 10 110911.8719 nan 0.1000 12687.3787
## 20 71546.4889 nan 0.1000 5522.9451
## 40 51312.6798 nan 0.1000 -981.1606
## 60 47681.0697 nan 0.1000 -664.4383
## 80 46065.5137 nan 0.1000 -114.8126
## 100 43506.8636 nan 0.1000 -800.3095
##
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 292259.9739 nan 0.1000 46037.0871
## 2 252206.5544 nan 0.1000 39922.0617
## 3 230701.4885 nan 0.1000 23804.0082
## 4 196126.8157 nan 0.1000 22457.5388
## 5 172374.3921 nan 0.1000 24080.2368
## 6 154833.6069 nan 0.1000 18979.2596
## 7 136690.2565 nan 0.1000 15966.0847
## 8 120086.3442 nan 0.1000 14837.5269
## 9 113570.2024 nan 0.1000 4105.3163
## 10 97036.4696 nan 0.1000 9643.4460
## 20 60248.7276 nan 0.1000 1737.0030
## 40 50975.6983 nan 0.1000 -1451.4538
## 60 44663.2720 nan 0.1000 -370.3443
## 80 43068.8728 nan 0.1000 -1049.3787
## 100 40678.5028 nan 0.1000 -150.8069
##
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 308563.9268 nan 0.1000 33046.9787
## 2 270072.8043 nan 0.1000 37805.5938
## 3 236401.2020 nan 0.1000 31437.2617
## 4 203316.5173 nan 0.1000 23900.9670
## 5 181760.8152 nan 0.1000 19024.8861
## 6 164823.5820 nan 0.1000 14918.7401
## 7 147541.3252 nan 0.1000 18218.4492
## 8 141854.3670 nan 0.1000 3909.6725
## 9 131719.1849 nan 0.1000 10221.0323
## 10 122936.8906 nan 0.1000 8338.3240
## 20 88864.2185 nan 0.1000 -1296.2423
## 40 62302.0910 nan 0.1000 -936.0026
## 60 59142.1164 nan 0.1000 -2127.4887
## 80 55055.9752 nan 0.1000 -853.4398
## 100 49965.4377 nan 0.1000 -1414.0418
##
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 262491.8344 nan 0.1000 34922.0824
## 2 223935.8545 nan 0.1000 34604.0746
## 3 203199.9011 nan 0.1000 21915.4769
## 4 182522.7743 nan 0.1000 22600.7615
## 5 156221.9110 nan 0.1000 25466.3351
## 6 140623.1158 nan 0.1000 12171.4237
## 7 124398.2199 nan 0.1000 12464.0935
## 8 102921.3673 nan 0.1000 12354.9772
## 9 93123.9137 nan 0.1000 8358.3391
## 10 81563.4372 nan 0.1000 9931.1841
## 20 50396.6401 nan 0.1000 3145.6571
## 40 34322.0680 nan 0.1000 -37.4712
## 60 31256.1992 nan 0.1000 -995.6498
## 80 29691.7007 nan 0.1000 -484.9028
## 100 28667.5623 nan 0.1000 -245.2765
##
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 293664.7015 nan 0.1000 38511.5089
## 2 253738.3940 nan 0.1000 36885.7181
## 3 208855.3054 nan 0.1000 28616.4559
## 4 189135.1123 nan 0.1000 22755.7028
## 5 168707.3085 nan 0.1000 19903.8352
## 6 155116.2446 nan 0.1000 14245.4270
## 7 141195.7249 nan 0.1000 14923.4580
## 8 121309.2212 nan 0.1000 15835.0693
## 9 113041.8535 nan 0.1000 8029.9109
## 10 97321.4836 nan 0.1000 11198.1425
## 20 51314.9081 nan 0.1000 175.8949
## 40 37242.5268 nan 0.1000 -439.5313
## 60 34284.1921 nan 0.1000 -486.1084
## 80 32723.3787 nan 0.1000 -955.6775
## 100 30937.5057 nan 0.1000 -480.2995
##
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 293303.4687 nan 0.1000 45634.3342
## 2 256423.8214 nan 0.1000 38822.5546
## 3 223370.5023 nan 0.1000 32482.3584
## 4 202198.0183 nan 0.1000 25206.1748
## 5 177054.0688 nan 0.1000 18356.2085
## 6 151520.4865 nan 0.1000 21273.8680
## 7 140638.8680 nan 0.1000 12000.1885
## 8 128833.3336 nan 0.1000 11786.7128
## 9 124501.3004 nan 0.1000 2943.1144
## 10 118144.6434 nan 0.1000 7492.2680
## 20 70876.6431 nan 0.1000 1150.9235
## 40 47806.3920 nan 0.1000 -28.7751
## 60 42920.1400 nan 0.1000 -452.8661
## 80 40550.3434 nan 0.1000 -390.6150
## 100 37854.7878 nan 0.1000 -4.1366
##
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 302333.1727 nan 0.1000 43549.0515
## 2 257588.8094 nan 0.1000 38222.4656
## 3 214455.7553 nan 0.1000 27053.0815
## 4 189366.6964 nan 0.1000 24218.0897
## 5 163090.7211 nan 0.1000 18747.4759
## 6 143861.9725 nan 0.1000 14637.2499
## 7 129931.6083 nan 0.1000 13576.7613
## 8 119769.7380 nan 0.1000 10339.1323
## 9 109162.7182 nan 0.1000 11243.3283
## 10 102913.8702 nan 0.1000 7425.2574
## 20 64028.4293 nan 0.1000 27.4634
## 40 44216.3133 nan 0.1000 -1507.1932
## 60 42388.6488 nan 0.1000 -557.7306
## 80 39204.9094 nan 0.1000 -131.6251
## 100 37949.1674 nan 0.1000 -61.6298
##
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 285122.7488 nan 0.1000 44596.6164
## 2 242059.2342 nan 0.1000 40923.2986
## 3 211139.3179 nan 0.1000 28140.8434
## 4 181733.6434 nan 0.1000 27124.9700
## 5 157314.1365 nan 0.1000 20926.0057
## 6 139749.4615 nan 0.1000 18565.5969
## 7 128966.6920 nan 0.1000 12462.7460
## 8 120068.9842 nan 0.1000 11034.9951
## 9 107813.3343 nan 0.1000 11092.6384
## 10 96307.7366 nan 0.1000 10165.9504
## 20 56702.6647 nan 0.1000 706.3690
## 40 36629.4567 nan 0.1000 -681.1539
## 60 32881.0053 nan 0.1000 -1443.9054
## 80 29737.9461 nan 0.1000 -272.2517
## 100 28424.7429 nan 0.1000 -946.3329
Once the training algorithm is performed it has to be tried to predict in testing set.
After making the predictions using the test set, it is goint to use resamples() function to assess the metrics of the new predictions compared to the Ground Truth.
resamps <- resamples(list(svm = svmfit, rf = rffit, knn = knnfit, gbt = gbtfit))
summary(resamps) # Show the accuracy and kappa of each model. Remember that the number of resamples was fit to 10 in fitcontrol object.##
## Call:
## summary.resamples(object = resamps)
##
## Models: svm, rf, knn, gbt
## Number of resamples: 10
##
## MAE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## svm 30.37503 219.85782 313.7654 292.61240 407.8387 483.0093 0
## rf 14.99248 34.15706 100.0813 96.95222 151.7929 191.6567 0
## knn 32.26114 188.98975 235.3296 210.16284 248.5188 286.6808 0
## gbt 63.01270 112.17930 154.1390 152.32937 179.9038 263.5565 0
##
## RMSE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## svm 34.93878 308.25199 564.8230 476.7672 681.9374 714.7242 0
## rf 22.26736 58.75182 183.8599 173.5783 264.5494 361.9389 0
## knn 45.31928 295.99877 340.9257 341.7493 425.5244 564.0861 0
## gbt 91.09679 162.28138 231.7509 226.0522 246.6022 429.1217 0
##
## Rsquared
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## svm 0.3002866 0.4913675 0.7429490 0.7039975 0.9577268 0.9981831 0
## rf 0.7868799 0.9062093 0.9664738 0.9298072 0.9880728 0.9987420 0
## knn 0.5653646 0.6915781 0.7458344 0.7702468 0.8640047 0.9884488 0
## gbt 0.7576783 0.8293145 0.8603423 0.8768935 0.9527890 0.9856353 0
diffs <- diff(resamps)
summary(diffs) # Gives the differences of the resamps object. Another way to check which model fits better.##
## Call:
## summary.diff.resamples(object = diffs)
##
## p-value adjustment: bonferroni
## Upper diagonal: estimates of the difference
## Lower diagonal: p-value for H0: difference = 0
##
## MAE
## svm rf knn gbt
## svm 195.66 82.45 140.28
## rf 0.004519 -113.21 -55.38
## knn 0.627520 0.010850 57.83
## gbt 0.150495 0.113840 0.364229
##
## RMSE
## svm rf knn gbt
## svm 303.19 135.02 250.72
## rf 0.002998 -168.17 -52.47
## knn 0.703751 0.019815 115.70
## gbt 0.089718 0.808675 0.149251
##
## Rsquared
## svm rf knn gbt
## svm -0.22581 -0.06625 -0.17290
## rf 0.1204 0.15956 0.05291
## knn 1.0000 0.0673 -0.10665
## gbt 0.5268 0.8341 0.3316
Then it will generate a data frame that contains different evaluation parameters of all models.
algorithm_comparison <- data.frame(Model="SVM", MAE=mean(resamps$values$`svm~MAE`), RMSE=mean(resamps$values$`svm~RMSE`), Rsquared=mean(resamps$values$`svm~Rsquared`))
algorithm_comparison <- rbind(algorithm_comparison,data.frame(Model="RF", MAE=mean(resamps$values$`rf~MAE`), RMSE=mean(resamps$values$`rf~RMSE`), Rsquared=mean(resamps$values$`rf~Rsquared`)))
algorithm_comparison <- rbind(algorithm_comparison,data.frame(Model="kNN", MAE=mean(resamps$values$`knn~MAE`), RMSE=mean(resamps$values$`knn~RMSE`), Rsquared=mean(resamps$values$`knn~Rsquared`)))
algorithm_comparison <- rbind(algorithm_comparison,data.frame(Model="GBT", MAE=mean(resamps$values$`gbt~MAE`), RMSE=mean(resamps$values$`gbt~RMSE`), Rsquared=mean(resamps$values$`gbt~Rsquared`)))
algorithm_comparison## Model MAE RMSE Rsquared
## 1 SVM 292.61240 476.7672 0.7039975
## 2 RF 96.95222 173.5783 0.9298072
## 3 kNN 210.16284 341.7493 0.7702468
## 4 GBT 152.32937 226.0522 0.8768935
Among used different methods the Random Forest method gives the best results. So the performed predictive model that is going to apply to predict new products sales Volume is going to be Random Forest.
Finally we will train our predictive model in the whole set and not just in our testing set.
rf_prediction_allData <- predict(rffit, newdata = Existing_Products)
actuals_preds_allData_rf <- data.frame(cbind(actuals=Existing_Products$Volume, predicteds=rf_prediction_allData)) # make actuals_predicteds dataframe.
correlation_accuracy_allData_rf <- cor(actuals_preds_allData_rf) # % 96.1 of Accuracy.The final step is to apply the performed predictive model to the new products in order to predict Volume of different products.
To apply the predictive model, the data set must suffer the same modifications.
Load data.
Dummify data as before.
# dummify the data
new_products_2 <- dummyVars(" ~ .", data = new_products)
new_products_3 <- data.frame(predict(new_products_2, newdata = new_products))Omit these values that were considered no necessary.
new_products_3$x5StarReviews <- NULL
new_products_3$x1StarReviews <- NULL
new_products_3$x3StarReviews <- NULL
new_products_3$BestSellersRank <- NULL
new_products_3$ProductNum <- NULL
new_prodructs_4 <- new_products_3[ , !(names(new_products_3) %in% drop_variables)]Finally, predictive model will be applied in the modified new products data set. Also, will be generated a csv file containin preditcion rules.
Now it“s time to analyze the four products that we have been asked to asses.
PC_new <- subset(New_Products, ProductType == "PC") # Group PCs of New Products data set.
drop_variables <- cbind(drop_variables, "x5StarReviews", "x3StarReviews","x1StarReviews", "BestSellersRank") # Add to previously generated drop_variables object the attributes that were omitted to generate the predictive model.
PC_new_1 <- PC_new[ , !(names(PC_new) %in% drop_variables)] # Generate a data frame with only the attributes that were used to build the predictive model.
PC_existing <- subset(existing_products, ProductType == "PC") # Group PCs of Existing Products data set befor being dummified.
PC_existing_1 <- PC_existing[ , !(names(PC_existing) %in% drop_variables)] # Generate a data frame with only the attributes that were used to build the predictive model.
PC_analysis <- rbind(PC_new_1,PC_existing_1)
PC_analysis## # A tibble: 6 x 9
## ProductType ProductNum Price x4StarReviews x2StarReviews PositiveService~
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 PC 171 699 26 14 12
## 2 PC 172 860 11 10 7
## 3 PC 101 949 3 0 2
## 4 PC 102 2250. 1 0 1
## 5 PC 103 399 0 0 1
## 6 PC 142 610. 7 0 5
## # ... with 3 more variables: NegativeServiceReview <dbl>,
## # Recommendproduct <dbl>, Volume <dbl>
PC_analysis$ProductNum <- as.factor(PC_analysis$ProductNum)
#ggplot(PC_analysis, aes( Volume , color=ProductNum, fill=ProductNum)) + geom_col() + labs(title = "PC") + stat_count()
#analysis <- rbind(PC_new_1, Smartphone_new_1, Netbook_new_1, Laptop_new_1)
#ggplot(PC_analysis, aes( Volume , color=ProductNum, fill=ProductNum)) + geom_col() + labs(title = "PC") + stat_count()Laptop_new <- subset(New_Products, ProductType == "Laptop") # Group Laptops of New Products data set.
Laptop_new_1 <- Laptop_new[ , !(names(Laptop_new) %in% drop_variables)] # Generate a data frame with only the attributes that were used to build the predictive model.
Laptop_existing <- subset(existing_products, ProductType == "Laptop") # Group PCs of Existing Products data set befor being dummified.
Laptop_existing_1 <- Laptop_existing[ , !(names(Laptop_existing) %in% drop_variables)] # Generate a data frame with only the attributes that were used to build the predictive model.
Laptop_analysis <- rbind(Laptop_new_1,Laptop_existing_1)
Laptop_analysis## # A tibble: 6 x 9
## ProductType ProductNum Price x4StarReviews x2StarReviews PositiveService~
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Laptop 173 1199 10 3 11
## 2 Laptop 175 1199 2 1 2
## 3 Laptop 176 1999 1 3 0
## 4 Laptop 104 410. 19 3 7
## 5 Laptop 105 1080. 31 7 7
## 6 Laptop 143 771. 14 5 6
## # ... with 3 more variables: NegativeServiceReview <dbl>,
## # Recommendproduct <dbl>, Volume <dbl>
Laptop_analysis$ProductNum <- as.factor(Laptop_analysis$ProductNum)
ggplot(Laptop_analysis, aes(ProductNum , Volume, color=ProductNum, fill=ProductNum)) + geom_point() + labs(title = "Laptop")Netbook_new <- subset(New_Products, ProductType == "Netbook") # Group Laptops of New Products data set.
Netbook_new_1 <- Netbook_new[ , !(names(Netbook_new) %in% drop_variables)] # Generate a data frame with only the attributes that were used to build the predictive model.
Netbook_existing <- subset(existing_products, ProductType == "Netbook") # Group PCs of Existing Products data set befor being dummified.
Netbook_existing_1 <- Netbook_existing[ , !(names(Netbook_existing) %in% drop_variables)] # Generate a data frame with only the attributes that were used to build the predictive model.
Netbook_analysis <- rbind(Netbook_new_1, Netbook_existing_1)
Netbook_analysis## # A tibble: 6 x 9
## ProductType ProductNum Price x4StarReviews x2StarReviews PositiveService~
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Netbook 178 400. 8 1 2
## 2 Netbook 180 329 112 31 28
## 3 Netbook 181 439 18 22 5
## 4 Netbook 183 330 4 1 1
## 5 Netbook 177 380. 0 1 0
## 6 Netbook 182 350. 10 2 3
## # ... with 3 more variables: NegativeServiceReview <dbl>,
## # Recommendproduct <dbl>, Volume <dbl>
Netbook_analysis$ProductNum <- as.factor(Netbook_analysis$ProductNum)
ggplot(Netbook_analysis, aes(ProductNum , Volume, color=ProductNum, fill=ProductNum)) + geom_point() + labs(title = "Netbook")Smartphone_new <- subset(New_Products, ProductType == "Smartphone") # Group Laptops of New Products data set.
Smartphone_new_1 <- Smartphone_new[ , !(names(Smartphone_new) %in% drop_variables)] # Generate a data frame with only the attributes that were used to build the predictive model.
Smartphone_existing <- subset(existing_products, ProductType == "Smartphone") # Group PCs of Existing Products data set befor being dummified.
Smartphone_existing_1 <- Smartphone_existing[ , !(names(Smartphone_existing) %in% drop_variables)] # Generate a data frame with only the attributes that were used to build the predictive model.
Smartphone_analysis <- rbind(Smartphone_new_1, Smartphone_existing_1)
Smartphone_analysis## # A tibble: 8 x 9
## ProductType ProductNum Price x4StarReviews x2StarReviews PositiveService~
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Smartphone 193 199 26 16 8
## 2 Smartphone 194 49 26 33 14
## 3 Smartphone 195 149 8 4 4
## 4 Smartphone 196 300 19 20 5
## 5 Smartphone 190 199 1 2 1
## 6 Smartphone 191 200 25 11 9
## 7 Smartphone 192 99 17 2 5
## 8 Smartphone 197 499 28 10 22
## # ... with 3 more variables: NegativeServiceReview <dbl>,
## # Recommendproduct <dbl>, Volume <dbl>
Smartphone_analysis$ProductNum <- as.factor(Smartphone_analysis$ProductNum)
ggplot(Smartphone_analysis, aes(ProductNum , Volume, color=ProductNum, fill=ProductNum)) + geom_point() + labs(title = "Smartphone")First we will generate a data set that contains only the variables that we have used to make the predictive model.
drop_variables_asses <- c("Price","ProductDepth", "ProductHeight", "ProductWidth", "ProfitMargin", "ProductNum", "ShippingWeight","Recommendproduct", "BestSellersRank")
existing_products_review <- existing_products[ , !(names(existing_products) %in% drop_variables_asses)]As before, we are going to omit outliers.
On one hand, we will asses the effect of Service Review variables, PositiveServiceReview and NegativeServiceReview.
PositiveReview <- ggplot(existing_products_review , aes(PositiveServiceReview, Volume)) + geom_point() + geom_smooth(color="red") + labs(title = "Positive Service Review")
NegativeReview <- ggplot(existing_products_review , aes(NegativeServiceReview, Volume)) + geom_point() + geom_smooth(color="blue") + labs(title = "Negative Service Review")
grid.arrange(PositiveReview, NegativeReview, ncol = 2)It looks that there are some points that don“t follow the almost generated linear model. IN the next lines we can extract them with filter() function.
Positive_outliers <- filter(existing_products_review, existing_products_review$PositiveServiceReview > 200)
Positive_outliers## # A tibble: 9 x 9
## ProductType x5StarReviews x4StarReviews x3StarReviews x2StarReviews
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Accessories 170 100 23 20
## 2 ExtendedWa~ 308 27 8 3
## 3 ExtendedWa~ 308 27 8 3
## 4 ExtendedWa~ 308 27 8 3
## 5 ExtendedWa~ 308 27 8 3
## 6 ExtendedWa~ 308 27 8 3
## 7 ExtendedWa~ 308 27 8 3
## 8 ExtendedWa~ 308 27 8 3
## 9 ExtendedWa~ 308 27 8 3
## # ... with 4 more variables: x1StarReviews <dbl>, PositiveServiceReview <dbl>,
## # NegativeServiceReview <dbl>, Volume <dbl>
On the other hand, we will analyze the asses of Star Reviews vartiables.
x4starReview <- ggplot(existing_products_review , aes(x4StarReviews, Volume)) + geom_point() + geom_smooth() + geom_smooth(color="blue") + labs(title = "4 Star Review")
x2starReview <- ggplot(existing_products_review , aes(x2StarReviews, Volume)) + geom_point() + geom_smooth() + geom_smooth(color="yellow") + labs(title = "2 Star Review")
grid.arrange(x4starReview, x2starReview, ncol = 2)