1 Introduction

In this work we have to predict the volume sales in four different product types while assessing the effects service and customer reviews have on sales.

We’ll be using Regression to build machine learning models for this analyses. Once we have determined which algorithm works better on the provided data set, we will predict the sales of four product types from the new products list.

To sum up:

What are we trying to predict?

We need to predict the sales volume for the new products list.

What type of problem is it? Classification or Regression? Binary or Multi-class? Uni-variate or Multi-variate?

It is a multiple regression problem with multiple features.

What type of data do we have?

We have two files, one to build predictive models (existingproducts.csv) and the other one is the data set that will be used to test the model (newproducts).

3 Import Data

The next step is to upload the data we are going to be working in.

4 Initial exploration of data

With the function glimpse()and summary() we can do the initial exploration of data to know better what type of data we are handling.

## Observations: 80
## Variables: 18
## $ ProductType           <chr> "PC", "PC", "PC", "Laptop", "Laptop", "Access...
## $ ProductNum            <dbl> 101, 102, 103, 104, 105, 106, 107, 108, 109, ...
## $ Price                 <dbl> 949.00, 2249.99, 399.00, 409.99, 1079.99, 114...
## $ x5StarReviews         <dbl> 3, 2, 3, 49, 58, 83, 11, 33, 16, 10, 21, 75, ...
## $ x4StarReviews         <dbl> 3, 1, 0, 19, 31, 30, 3, 19, 9, 1, 2, 25, 8, 6...
## $ x3StarReviews         <dbl> 2, 0, 0, 8, 11, 10, 0, 12, 2, 1, 2, 6, 5, 13,...
## $ x2StarReviews         <dbl> 0, 0, 0, 3, 7, 9, 0, 5, 0, 0, 4, 3, 0, 8, 7, ...
## $ x1StarReviews         <dbl> 0, 0, 0, 9, 36, 40, 1, 9, 2, 0, 15, 3, 1, 16,...
## $ PositiveServiceReview <dbl> 2, 1, 1, 7, 7, 12, 3, 5, 2, 2, 2, 9, 2, 44, 5...
## $ NegativeServiceReview <dbl> 0, 0, 0, 8, 20, 5, 0, 3, 1, 0, 1, 2, 0, 3, 3,...
## $ Recommendproduct      <dbl> 0.9, 0.9, 0.9, 0.8, 0.7, 0.3, 0.9, 0.7, 0.8, ...
## $ BestSellersRank       <dbl> 1967, 4806, 12076, 109, 268, 64, NA, 2, NA, 1...
## $ ShippingWeight        <dbl> 25.80, 50.00, 17.40, 5.70, 7.00, 1.60, 7.30, ...
## $ ProductDepth          <dbl> 23.94, 35.00, 10.50, 15.00, 12.90, 5.80, 6.70...
## $ ProductWidth          <dbl> 6.62, 31.75, 8.30, 9.90, 0.30, 4.00, 10.30, 6...
## $ ProductHeight         <dbl> 16.89, 19.00, 10.20, 1.30, 8.90, 1.00, 11.50,...
## $ ProfitMargin          <dbl> 0.15, 0.25, 0.08, 0.08, 0.09, 0.05, 0.05, 0.0...
## $ Volume                <dbl> 12, 8, 12, 196, 232, 332, 44, 132, 64, 40, 84...

Now we will check for outliers.

To finish with the initial exploration we will check for missing values and in case of finding them we will transform them.

## [1] 15

There are missing values, and with the str() function seen above we can see that all missing values correspond to the same attribute. In the next part, Pre-processing we will treat outliers and missing values.

5 Pre-processing

Most data will contain a mixture of numeric and nominal data so we need to understand how to incorporate both when it comes to developing regression models and making predictions.

Categorical variables may be used directly as predictor or predicted variables in a multiple regression model as long as they’ve been converted to binary values. In order to pre-process the sales data as needed we first need to convert all factor or ā€˜chr’ classes to binary features that contain ā€˜0’ and ā€˜1’ classes. Fortunately, caret has a method for creating these ā€˜Dummy Variables’ as follows:

Now is time to remove outliers.

Once data is dummified and we don“t have categorical data we can treat missing values. In this case, all missing values are from the same attribute. So we have decided to delete this attribute.

After dummifying data and omitting outliers and missing values we have to check the correlation among variables and between variables and the dependant variable.

6 Feature Engineering

We will use the cor() function to create a correlation matrix to visualize the correlation between the features.

From the corrData we can observe that these values have high correlation with the dependant variable and can lead to overfitting.

x5starReview

Now we have to check if there is high correlation among features, not removing these features can lead to too much noise.

x4starReviem and x3starReview has a correlation of 0.937. In order to decide which is going to be removed we have to check the correlation of each feature with the dependant variable.

x4starReview has greater correlation so x3starReview is going to be removed.

The same happen with x2starReview and x1starReview, where x1starReview is going to be removed.

We have check and removed features that have highest correlation with dependant variables and among them. Now we have to treat these features that have less correlation with dependat variables like, profit margin and physical attributes, that have less than 0.2 correlation.

The variables that have low correlation with dependant variable are:

  • Display
  • Extended Warranty
  • Laptop
  • Printer Supplies
  • Smartphone
  • Software
  • Tablet
  • Product Depth
  • Profit Margin

In the case of dummified data we will not remove ProductTypeXXX because we don“t want to get so biased model.

7 Training and Testing

Once data is preprocessed, it“s time to create training and testing sets for the predictive model that must be performed.

First, we will split data into two sets, training and testing set with cerateDataPartition() function, that does a stratified random split of the data.

8 Modelling

We will run 4 different methods and then we will compare them. The best method is going to be used to predict our new product list Volume.

This algorithms can be compared if all have the same resampling method and the same number of repetitions. To modify resampling method trainControl() function can be used.

8.1 Supported Vector Machine

The first method that is going to be performed is Supported Vector Machine algorithm. To run any algorithm the train() function is used. In this case, the method that will be used is called ā€œsvmRadialā€, one among many.

Once the training algorithm is performed it has to be tried to predict in testing set.

##    actuals predicteds
## 1      232  356.60341
## 2      132  272.62175
## 3      300  177.75650
## 4       60  126.18329
## 5     1576 1230.77253
## 6     2052  648.13852
## 7       32  -25.62101
## 8       20  119.63613
## 9     1232 1174.19628
## 10      88  239.77103
## 11    1536  540.28954
## 12     836 1596.56955
## 13     904  583.01019
## 14     232  326.43896
## 15       8  -30.56946
## 16      80  124.19303
## 17       0   33.32721
## 18      16  -53.75056

8.4 Gradient Boosted Trees

Last method is the Gradient Boosted Tree algorithm. In this case, the method that will be used is called ā€œgbmā€.

## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1   303746.3704             nan     0.1000 47364.5116
##      2   260991.7002             nan     0.1000 40660.7765
##      3   228188.0744             nan     0.1000 31542.3772
##      4   201687.0054             nan     0.1000 23702.3657
##      5   176467.0886             nan     0.1000 21064.6207
##      6   162489.4085             nan     0.1000 16992.8083
##      7   148508.5083             nan     0.1000 13369.1008
##      8   126896.9738             nan     0.1000 16100.6165
##      9   114511.0398             nan     0.1000 7738.4310
##     10   106768.3731             nan     0.1000 7161.4203
##     20    71298.5542             nan     0.1000 1300.9095
##     40    50777.4050             nan     0.1000 -254.5416
##     60    45668.5482             nan     0.1000 -617.5422
##     80    42421.8598             nan     0.1000 -698.2938
##    100    40242.1494             nan     0.1000  -81.6501
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1   279000.0293             nan     0.1000 42606.3028
##      2   234876.7483             nan     0.1000 32479.0555
##      3   206898.8307             nan     0.1000 29253.0048
##      4   179537.2676             nan     0.1000 22677.3041
##      5   162157.8396             nan     0.1000 17296.0541
##      6   150346.5125             nan     0.1000 12112.2650
##      7   136244.1924             nan     0.1000 12770.1771
##      8   127507.6999             nan     0.1000 9652.6774
##      9   117391.6930             nan     0.1000 9992.3059
##     10   107396.0696             nan     0.1000 9752.8844
##     20    69262.4632             nan     0.1000  409.4036
##     40    54199.0372             nan     0.1000 -549.3564
##     60    49271.8126             nan     0.1000 -311.7903
##     80    45506.9843             nan     0.1000 -998.6933
##    100    40748.6882             nan     0.1000 -602.2334
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1   312193.3912             nan     0.1000 34195.1299
##      2   276087.1119             nan     0.1000 40318.1071
##      3   239211.5124             nan     0.1000 30065.8308
##      4   210351.5949             nan     0.1000 30054.4391
##      5   190591.0971             nan     0.1000 21728.5235
##      6   165952.8206             nan     0.1000 21829.7154
##      7   152798.8857             nan     0.1000 15612.3851
##      8   146771.8333             nan     0.1000 4026.9346
##      9   133027.0861             nan     0.1000 13515.8725
##     10   124729.1518             nan     0.1000 9822.2569
##     20    78499.3704             nan     0.1000 2012.5070
##     40    53864.4807             nan     0.1000 -183.5246
##     60    50516.5432             nan     0.1000 -937.1841
##     80    48112.6756             nan     0.1000 -944.7866
##    100    45875.0999             nan     0.1000 -1476.9462
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1   291577.6355             nan     0.1000 43965.3134
##      2   255277.6244             nan     0.1000 32708.0358
##      3   226661.5523             nan     0.1000 31752.3964
##      4   203771.8177             nan     0.1000 25356.7952
##      5   182793.3788             nan     0.1000 22963.0732
##      6   166199.2215             nan     0.1000 17863.4436
##      7   149779.0236             nan     0.1000 15774.4652
##      8   143744.0590             nan     0.1000 1621.0729
##      9   127592.8051             nan     0.1000 11418.7959
##     10   110911.8719             nan     0.1000 12687.3787
##     20    71546.4889             nan     0.1000 5522.9451
##     40    51312.6798             nan     0.1000 -981.1606
##     60    47681.0697             nan     0.1000 -664.4383
##     80    46065.5137             nan     0.1000 -114.8126
##    100    43506.8636             nan     0.1000 -800.3095
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1   292259.9739             nan     0.1000 46037.0871
##      2   252206.5544             nan     0.1000 39922.0617
##      3   230701.4885             nan     0.1000 23804.0082
##      4   196126.8157             nan     0.1000 22457.5388
##      5   172374.3921             nan     0.1000 24080.2368
##      6   154833.6069             nan     0.1000 18979.2596
##      7   136690.2565             nan     0.1000 15966.0847
##      8   120086.3442             nan     0.1000 14837.5269
##      9   113570.2024             nan     0.1000 4105.3163
##     10    97036.4696             nan     0.1000 9643.4460
##     20    60248.7276             nan     0.1000 1737.0030
##     40    50975.6983             nan     0.1000 -1451.4538
##     60    44663.2720             nan     0.1000 -370.3443
##     80    43068.8728             nan     0.1000 -1049.3787
##    100    40678.5028             nan     0.1000 -150.8069
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1   308563.9268             nan     0.1000 33046.9787
##      2   270072.8043             nan     0.1000 37805.5938
##      3   236401.2020             nan     0.1000 31437.2617
##      4   203316.5173             nan     0.1000 23900.9670
##      5   181760.8152             nan     0.1000 19024.8861
##      6   164823.5820             nan     0.1000 14918.7401
##      7   147541.3252             nan     0.1000 18218.4492
##      8   141854.3670             nan     0.1000 3909.6725
##      9   131719.1849             nan     0.1000 10221.0323
##     10   122936.8906             nan     0.1000 8338.3240
##     20    88864.2185             nan     0.1000 -1296.2423
##     40    62302.0910             nan     0.1000 -936.0026
##     60    59142.1164             nan     0.1000 -2127.4887
##     80    55055.9752             nan     0.1000 -853.4398
##    100    49965.4377             nan     0.1000 -1414.0418
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1   262491.8344             nan     0.1000 34922.0824
##      2   223935.8545             nan     0.1000 34604.0746
##      3   203199.9011             nan     0.1000 21915.4769
##      4   182522.7743             nan     0.1000 22600.7615
##      5   156221.9110             nan     0.1000 25466.3351
##      6   140623.1158             nan     0.1000 12171.4237
##      7   124398.2199             nan     0.1000 12464.0935
##      8   102921.3673             nan     0.1000 12354.9772
##      9    93123.9137             nan     0.1000 8358.3391
##     10    81563.4372             nan     0.1000 9931.1841
##     20    50396.6401             nan     0.1000 3145.6571
##     40    34322.0680             nan     0.1000  -37.4712
##     60    31256.1992             nan     0.1000 -995.6498
##     80    29691.7007             nan     0.1000 -484.9028
##    100    28667.5623             nan     0.1000 -245.2765
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1   293664.7015             nan     0.1000 38511.5089
##      2   253738.3940             nan     0.1000 36885.7181
##      3   208855.3054             nan     0.1000 28616.4559
##      4   189135.1123             nan     0.1000 22755.7028
##      5   168707.3085             nan     0.1000 19903.8352
##      6   155116.2446             nan     0.1000 14245.4270
##      7   141195.7249             nan     0.1000 14923.4580
##      8   121309.2212             nan     0.1000 15835.0693
##      9   113041.8535             nan     0.1000 8029.9109
##     10    97321.4836             nan     0.1000 11198.1425
##     20    51314.9081             nan     0.1000  175.8949
##     40    37242.5268             nan     0.1000 -439.5313
##     60    34284.1921             nan     0.1000 -486.1084
##     80    32723.3787             nan     0.1000 -955.6775
##    100    30937.5057             nan     0.1000 -480.2995
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1   293303.4687             nan     0.1000 45634.3342
##      2   256423.8214             nan     0.1000 38822.5546
##      3   223370.5023             nan     0.1000 32482.3584
##      4   202198.0183             nan     0.1000 25206.1748
##      5   177054.0688             nan     0.1000 18356.2085
##      6   151520.4865             nan     0.1000 21273.8680
##      7   140638.8680             nan     0.1000 12000.1885
##      8   128833.3336             nan     0.1000 11786.7128
##      9   124501.3004             nan     0.1000 2943.1144
##     10   118144.6434             nan     0.1000 7492.2680
##     20    70876.6431             nan     0.1000 1150.9235
##     40    47806.3920             nan     0.1000  -28.7751
##     60    42920.1400             nan     0.1000 -452.8661
##     80    40550.3434             nan     0.1000 -390.6150
##    100    37854.7878             nan     0.1000   -4.1366
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1   302333.1727             nan     0.1000 43549.0515
##      2   257588.8094             nan     0.1000 38222.4656
##      3   214455.7553             nan     0.1000 27053.0815
##      4   189366.6964             nan     0.1000 24218.0897
##      5   163090.7211             nan     0.1000 18747.4759
##      6   143861.9725             nan     0.1000 14637.2499
##      7   129931.6083             nan     0.1000 13576.7613
##      8   119769.7380             nan     0.1000 10339.1323
##      9   109162.7182             nan     0.1000 11243.3283
##     10   102913.8702             nan     0.1000 7425.2574
##     20    64028.4293             nan     0.1000   27.4634
##     40    44216.3133             nan     0.1000 -1507.1932
##     60    42388.6488             nan     0.1000 -557.7306
##     80    39204.9094             nan     0.1000 -131.6251
##    100    37949.1674             nan     0.1000  -61.6298
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1   285122.7488             nan     0.1000 44596.6164
##      2   242059.2342             nan     0.1000 40923.2986
##      3   211139.3179             nan     0.1000 28140.8434
##      4   181733.6434             nan     0.1000 27124.9700
##      5   157314.1365             nan     0.1000 20926.0057
##      6   139749.4615             nan     0.1000 18565.5969
##      7   128966.6920             nan     0.1000 12462.7460
##      8   120068.9842             nan     0.1000 11034.9951
##      9   107813.3343             nan     0.1000 11092.6384
##     10    96307.7366             nan     0.1000 10165.9504
##     20    56702.6647             nan     0.1000  706.3690
##     40    36629.4567             nan     0.1000 -681.1539
##     60    32881.0053             nan     0.1000 -1443.9054
##     80    29737.9461             nan     0.1000 -272.2517
##    100    28424.7429             nan     0.1000 -946.3329

Once the training algorithm is performed it has to be tried to predict in testing set.

9 Resamples

After making the predictions using the test set, it is goint to use resamples() function to assess the metrics of the new predictions compared to the Ground Truth.

## 
## Call:
## summary.resamples(object = resamps)
## 
## Models: svm, rf, knn, gbt 
## Number of resamples: 10 
## 
## MAE 
##         Min.   1st Qu.   Median      Mean  3rd Qu.     Max. NA's
## svm 30.37503 219.85782 313.7654 292.61240 407.8387 483.0093    0
## rf  14.99248  34.15706 100.0813  96.95222 151.7929 191.6567    0
## knn 32.26114 188.98975 235.3296 210.16284 248.5188 286.6808    0
## gbt 63.01270 112.17930 154.1390 152.32937 179.9038 263.5565    0
## 
## RMSE 
##         Min.   1st Qu.   Median     Mean  3rd Qu.     Max. NA's
## svm 34.93878 308.25199 564.8230 476.7672 681.9374 714.7242    0
## rf  22.26736  58.75182 183.8599 173.5783 264.5494 361.9389    0
## knn 45.31928 295.99877 340.9257 341.7493 425.5244 564.0861    0
## gbt 91.09679 162.28138 231.7509 226.0522 246.6022 429.1217    0
## 
## Rsquared 
##          Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## svm 0.3002866 0.4913675 0.7429490 0.7039975 0.9577268 0.9981831    0
## rf  0.7868799 0.9062093 0.9664738 0.9298072 0.9880728 0.9987420    0
## knn 0.5653646 0.6915781 0.7458344 0.7702468 0.8640047 0.9884488    0
## gbt 0.7576783 0.8293145 0.8603423 0.8768935 0.9527890 0.9856353    0
## 
## Call:
## summary.diff.resamples(object = diffs)
## 
## p-value adjustment: bonferroni 
## Upper diagonal: estimates of the difference
## Lower diagonal: p-value for H0: difference = 0
## 
## MAE 
##     svm      rf       knn      gbt    
## svm           195.66    82.45   140.28
## rf  0.004519          -113.21   -55.38
## knn 0.627520 0.010850            57.83
## gbt 0.150495 0.113840 0.364229        
## 
## RMSE 
##     svm      rf       knn      gbt    
## svm           303.19   135.02   250.72
## rf  0.002998          -168.17   -52.47
## knn 0.703751 0.019815           115.70
## gbt 0.089718 0.808675 0.149251        
## 
## Rsquared 
##     svm    rf       knn      gbt     
## svm        -0.22581 -0.06625 -0.17290
## rf  0.1204           0.15956  0.05291
## knn 1.0000 0.0673            -0.10665
## gbt 0.5268 0.8341   0.3316

Then it will generate a data frame that contains different evaluation parameters of all models.

##   Model       MAE     RMSE  Rsquared
## 1   SVM 292.61240 476.7672 0.7039975
## 2    RF  96.95222 173.5783 0.9298072
## 3   kNN 210.16284 341.7493 0.7702468
## 4   GBT 152.32937 226.0522 0.8768935

Among used different methods the Random Forest method gives the best results. So the performed predictive model that is going to apply to predict new products sales Volume is going to be Random Forest.

Finally we will train our predictive model in the whole set and not just in our testing set.

11 New Product analysis

Now it“s time to analyze the four products that we have been asked to asses.

11.0.1 PC

## # A tibble: 6 x 9
##   ProductType ProductNum Price x4StarReviews x2StarReviews PositiveService~
##   <chr>            <dbl> <dbl>         <dbl>         <dbl>            <dbl>
## 1 PC                 171  699             26            14               12
## 2 PC                 172  860             11            10                7
## 3 PC                 101  949              3             0                2
## 4 PC                 102 2250.             1             0                1
## 5 PC                 103  399              0             0                1
## 6 PC                 142  610.             7             0                5
## # ... with 3 more variables: NegativeServiceReview <dbl>,
## #   Recommendproduct <dbl>, Volume <dbl>

12 Assesing of service and customer reviews variables

First we will generate a data set that contains only the variables that we have used to make the predictive model.

As before, we are going to omit outliers.

12.0.1 Service Reviews

On one hand, we will asses the effect of Service Review variables, PositiveServiceReview and NegativeServiceReview.

It looks that there are some points that don“t follow the almost generated linear model. IN the next lines we can extract them with filter() function.

## # A tibble: 9 x 9
##   ProductType x5StarReviews x4StarReviews x3StarReviews x2StarReviews
##   <chr>               <dbl>         <dbl>         <dbl>         <dbl>
## 1 Accessories           170           100            23            20
## 2 ExtendedWa~           308            27             8             3
## 3 ExtendedWa~           308            27             8             3
## 4 ExtendedWa~           308            27             8             3
## 5 ExtendedWa~           308            27             8             3
## 6 ExtendedWa~           308            27             8             3
## 7 ExtendedWa~           308            27             8             3
## 8 ExtendedWa~           308            27             8             3
## 9 ExtendedWa~           308            27             8             3
## # ... with 4 more variables: x1StarReviews <dbl>, PositiveServiceReview <dbl>,
## #   NegativeServiceReview <dbl>, Volume <dbl>