Example: Estimating Wine Quality —-

Step 1: Collecting data —-

To develop the wine rating model, we will use data donated to the UCI Machine Learning Data Repository (http://archive.ics.uci.edu/ml) by P. Cortez, A. Cerdeira, F. Almeida, T. Matos, and J. Reis. The data include examples of red and white Vinho Verde wines from Portugal—one of the world’s leading wine-producing countries. Because the factors that contribute to a highly rated wine may differ between the red and white varieties, for this analysis we will examine only the more popular white wines.

Step 2: Exploring and preparing the data —-

wine <- read.csv("http://www.sci.csueastbay.edu/~esuess/classes/Statistics_6620/Presentations/ml10/whitewines.csv")

examine the wine data

str(wine)
'data.frame':   4898 obs. of  12 variables:
 $ fixed.acidity       : num  6.7 5.7 5.9 5.3 6.4 7 7.9 6.6 7 6.5 ...
 $ volatile.acidity    : num  0.62 0.22 0.19 0.47 0.29 0.14 0.12 0.38 0.16 0.37 ...
 $ citric.acid         : num  0.24 0.2 0.26 0.1 0.21 0.41 0.49 0.28 0.3 0.33 ...
 $ residual.sugar      : num  1.1 16 7.4 1.3 9.65 0.9 5.2 2.8 2.6 3.9 ...
 $ chlorides           : num  0.039 0.044 0.034 0.036 0.041 0.037 0.049 0.043 0.043 0.027 ...
 $ free.sulfur.dioxide : num  6 41 33 11 36 22 33 17 34 40 ...
 $ total.sulfur.dioxide: num  62 113 123 74 119 95 152 67 90 130 ...
 $ density             : num  0.993 0.999 0.995 0.991 0.993 ...
 $ pH                  : num  3.41 3.22 3.49 3.48 2.99 3.25 3.18 3.21 2.88 3.28 ...
 $ sulphates           : num  0.32 0.46 0.42 0.54 0.34 0.43 0.47 0.47 0.47 0.39 ...
 $ alcohol             : num  10.4 8.9 10.1 11.2 10.9 ...
 $ quality             : int  5 6 6 4 6 6 6 6 6 7 ...

the distribution of quality ratings

hist(wine$quality)

summary statistics of the wine data

summary(wine)
 fixed.acidity    volatile.acidity  citric.acid    
 Min.   : 3.800   Min.   :0.0800   Min.   :0.0000  
 1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700  
 Median : 6.800   Median :0.2600   Median :0.3200  
 Mean   : 6.855   Mean   :0.2782   Mean   :0.3342  
 3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900  
 Max.   :14.200   Max.   :1.1000   Max.   :1.6600  
 residual.sugar     chlorides       free.sulfur.dioxide
 Min.   : 0.600   Min.   :0.00900   Min.   :  2.00     
 1st Qu.: 1.700   1st Qu.:0.03600   1st Qu.: 23.00     
 Median : 5.200   Median :0.04300   Median : 34.00     
 Mean   : 6.391   Mean   :0.04577   Mean   : 35.31     
 3rd Qu.: 9.900   3rd Qu.:0.05000   3rd Qu.: 46.00     
 Max.   :65.800   Max.   :0.34600   Max.   :289.00     
 total.sulfur.dioxide    density             pH       
 Min.   :  9.0        Min.   :0.9871   Min.   :2.720  
 1st Qu.:108.0        1st Qu.:0.9917   1st Qu.:3.090  
 Median :134.0        Median :0.9937   Median :3.180  
 Mean   :138.4        Mean   :0.9940   Mean   :3.188  
 3rd Qu.:167.0        3rd Qu.:0.9961   3rd Qu.:3.280  
 Max.   :440.0        Max.   :1.0390   Max.   :3.820  
   sulphates         alcohol         quality     
 Min.   :0.2200   Min.   : 8.00   Min.   :3.000  
 1st Qu.:0.4100   1st Qu.: 9.50   1st Qu.:5.000  
 Median :0.4700   Median :10.40   Median :6.000  
 Mean   :0.4898   Mean   :10.51   Mean   :5.878  
 3rd Qu.:0.5500   3rd Qu.:11.40   3rd Qu.:6.000  
 Max.   :1.0800   Max.   :14.20   Max.   :9.000  

Set train and test data

wine_train <- wine[1:3750, ]
wine_test <- wine[3751:4898, ]

Step 3: Training a model on the data —-

regression tree using rpart and traning a tree model on the data

library(rpart)
m.rpart <- rpart(quality ~ ., data = wine_train)

get basic information about the tree

m.rpart
n= 3750 

node), split, n, deviance, yval
      * denotes terminal node

 1) root 3750 2945.53200 5.870933  
   2) alcohol< 10.85 2372 1418.86100 5.604975  
     4) volatile.acidity>=0.2275 1611  821.30730 5.432030  
       8) volatile.acidity>=0.3025 688  278.97670 5.255814 *
       9) volatile.acidity< 0.3025 923  505.04230 5.563380 *
     5) volatile.acidity< 0.2275 761  447.36400 5.971091 *
   3) alcohol>=10.85 1378 1070.08200 6.328737  
     6) free.sulfur.dioxide< 10.5 84   95.55952 5.369048 *
     7) free.sulfur.dioxide>=10.5 1294  892.13600 6.391036  
      14) alcohol< 11.76667 629  430.11130 6.173291  
        28) volatile.acidity>=0.465 11   10.72727 4.545455 *
        29) volatile.acidity< 0.465 618  389.71680 6.202265 *
      15) alcohol>=11.76667 665  403.99400 6.596992 *

Because alcohol was used first in the tree, it is the single most important predictor of wine quality.Interpretation: for instance, in 3750 root, 2372 with alcohol < 10.85, and 1378 with > 10.85.

get more detailed information about the tree

summary(m.rpart)
Call:
rpart(formula = quality ~ ., data = wine_train)
  n= 3750 

          CP nsplit rel error    xerror       xstd
1 0.15501053      0 1.0000000 1.0003632 0.02445718
2 0.05098911      1 0.8449895 0.8499376 0.02348303
3 0.02796998      2 0.7940004 0.8051823 0.02291590
4 0.01970128      3 0.7660304 0.7864921 0.02202804
5 0.01265926      4 0.7463291 0.7601014 0.02092604
6 0.01007193      5 0.7336698 0.7521970 0.02075419
7 0.01000000      6 0.7235979 0.7434783 0.02056119

Variable importance
             alcohol              density 
                  34                   21 
    volatile.acidity            chlorides 
                  15                   11 
total.sulfur.dioxide  free.sulfur.dioxide 
                   7                    6 
      residual.sugar            sulphates 
                   3                    1 
         citric.acid 
                   1 

Node number 1: 3750 observations,    complexity param=0.1550105
  mean=5.870933, MSE=0.7854751 
  left son=2 (2372 obs) right son=3 (1378 obs)
  Primary splits:
      alcohol              < 10.85    to the left,  improve=0.15501050, (0 missing)
      density              < 0.992035 to the right, improve=0.10915940, (0 missing)
      chlorides            < 0.0395   to the right, improve=0.07682258, (0 missing)
      total.sulfur.dioxide < 158.5    to the right, improve=0.04089663, (0 missing)
      citric.acid          < 0.235    to the left,  improve=0.03636458, (0 missing)
  Surrogate splits:
      density              < 0.991995 to the right, agree=0.869, adj=0.644, (0 split)
      chlorides            < 0.0375   to the right, agree=0.757, adj=0.339, (0 split)
      total.sulfur.dioxide < 103.5    to the right, agree=0.690, adj=0.155, (0 split)
      residual.sugar       < 5.375    to the right, agree=0.667, adj=0.094, (0 split)
      sulphates            < 0.345    to the right, agree=0.647, adj=0.038, (0 split)

Node number 2: 2372 observations,    complexity param=0.05098911
  mean=5.604975, MSE=0.5981709 
  left son=4 (1611 obs) right son=5 (761 obs)
  Primary splits:
      volatile.acidity    < 0.2275   to the right, improve=0.10585250, (0 missing)
      free.sulfur.dioxide < 13.5     to the left,  improve=0.03390500, (0 missing)
      citric.acid         < 0.235    to the left,  improve=0.03204075, (0 missing)
      alcohol             < 10.11667 to the left,  improve=0.03136524, (0 missing)
      chlorides           < 0.0585   to the right, improve=0.01633599, (0 missing)
  Surrogate splits:
      pH                   < 3.485    to the left,  agree=0.694, adj=0.047, (0 split)
      sulphates            < 0.755    to the left,  agree=0.685, adj=0.020, (0 split)
      total.sulfur.dioxide < 105.5    to the right, agree=0.683, adj=0.011, (0 split)
      residual.sugar       < 0.75     to the right, agree=0.681, adj=0.007, (0 split)
      chlorides            < 0.0285   to the right, agree=0.680, adj=0.003, (0 split)

Node number 3: 1378 observations,    complexity param=0.02796998
  mean=6.328737, MSE=0.7765472 
  left son=6 (84 obs) right son=7 (1294 obs)
  Primary splits:
      free.sulfur.dioxide  < 10.5     to the left,  improve=0.07699080, (0 missing)
      alcohol              < 11.76667 to the left,  improve=0.06210660, (0 missing)
      total.sulfur.dioxide < 67.5     to the left,  improve=0.04438619, (0 missing)
      residual.sugar       < 1.375    to the left,  improve=0.02905351, (0 missing)
      fixed.acidity        < 7.35     to the right, improve=0.02613259, (0 missing)
  Surrogate splits:
      total.sulfur.dioxide < 53.5     to the left,  agree=0.952, adj=0.214, (0 split)
      volatile.acidity     < 0.875    to the right, agree=0.940, adj=0.024, (0 split)

Node number 4: 1611 observations,    complexity param=0.01265926
  mean=5.43203, MSE=0.5098121 
  left son=8 (688 obs) right son=9 (923 obs)
  Primary splits:
      volatile.acidity    < 0.3025   to the right, improve=0.04540111, (0 missing)
      alcohol             < 10.05    to the left,  improve=0.03874403, (0 missing)
      free.sulfur.dioxide < 13.5     to the left,  improve=0.03338886, (0 missing)
      chlorides           < 0.0495   to the right, improve=0.02574623, (0 missing)
      citric.acid         < 0.195    to the left,  improve=0.02327981, (0 missing)
  Surrogate splits:
      citric.acid          < 0.215    to the left,  agree=0.633, adj=0.141, (0 split)
      free.sulfur.dioxide  < 20.5     to the left,  agree=0.600, adj=0.063, (0 split)
      chlorides            < 0.0595   to the right, agree=0.593, adj=0.047, (0 split)
      residual.sugar       < 1.15     to the left,  agree=0.583, adj=0.023, (0 split)
      total.sulfur.dioxide < 219.25   to the right, agree=0.582, adj=0.022, (0 split)

Node number 5: 761 observations
  mean=5.971091, MSE=0.5878633 

Node number 6: 84 observations
  mean=5.369048, MSE=1.137613 

Node number 7: 1294 observations,    complexity param=0.01970128
  mean=6.391036, MSE=0.6894405 
  left son=14 (629 obs) right son=15 (665 obs)
  Primary splits:
      alcohol              < 11.76667 to the left,  improve=0.06504696, (0 missing)
      chlorides            < 0.0395   to the right, improve=0.02758705, (0 missing)
      fixed.acidity        < 7.35     to the right, improve=0.02750932, (0 missing)
      pH                   < 3.055    to the left,  improve=0.02307356, (0 missing)
      total.sulfur.dioxide < 191.5    to the right, improve=0.02186818, (0 missing)
  Surrogate splits:
      density              < 0.990885 to the right, agree=0.720, adj=0.424, (0 split)
      volatile.acidity     < 0.2675   to the left,  agree=0.637, adj=0.253, (0 split)
      chlorides            < 0.0365   to the right, agree=0.630, adj=0.238, (0 split)
      residual.sugar       < 1.475    to the left,  agree=0.575, adj=0.126, (0 split)
      total.sulfur.dioxide < 128.5    to the right, agree=0.574, adj=0.124, (0 split)

Node number 8: 688 observations
  mean=5.255814, MSE=0.4054895 

Node number 9: 923 observations
  mean=5.56338, MSE=0.5471747 

Node number 14: 629 observations,    complexity param=0.01007193
  mean=6.173291, MSE=0.6838017 
  left son=28 (11 obs) right son=29 (618 obs)
  Primary splits:
      volatile.acidity     < 0.465    to the right, improve=0.06897561, (0 missing)
      total.sulfur.dioxide < 200      to the right, improve=0.04223066, (0 missing)
      residual.sugar       < 0.975    to the left,  improve=0.03061714, (0 missing)
      fixed.acidity        < 7.35     to the right, improve=0.02978501, (0 missing)
      sulphates            < 0.575    to the left,  improve=0.02165970, (0 missing)
  Surrogate splits:
      citric.acid          < 0.045    to the left,  agree=0.986, adj=0.182, (0 split)
      total.sulfur.dioxide < 279.25   to the right, agree=0.986, adj=0.182, (0 split)

Node number 15: 665 observations
  mean=6.596992, MSE=0.6075098 

Node number 28: 11 observations
  mean=4.545455, MSE=0.9752066 

Node number 29: 618 observations
  mean=6.202265, MSE=0.6306098 

use the rpart.plot package to create a visualization

library(rpart.plot)
package ‘rpart.plot’ was built under R version 3.3.2Loading required package: rpart

a basic decision tree diagram

rpart.plot(m.rpart, digits = 3)

a few adjustments to the diagram

rpart.plot(m.rpart, digits = 4, fallen.leaves = TRUE, type = 3, extra = 101)

Step 4: Evaluate model performance —-

generate predictions for the testing dataset

p.rpart <- predict(m.rpart, wine_test)
p.rpart
    3751     3752     3753     3754     3755     3756 
6.596992 5.255814 6.202265 5.971091 5.563380 6.596992 
    3757     3758     3759     3760     3761     3762 
5.255814 5.255814 6.596992 5.563380 5.563380 5.563380 
    3763     3764     3765     3766     3767     3768 
5.255814 5.971091 5.255814 5.369048 6.596992 5.563380 
    3769     3770     3771     3772     3773     3774 
5.255814 5.971091 5.971091 5.971091 5.563380 5.255814 
    3775     3776     3777     3778     3779     3780 
5.971091 5.563380 5.255814 5.255814 6.202265 6.202265 
    3781     3782     3783     3784     3785     3786 
5.255814 5.971091 5.255814 6.202265 6.202265 5.971091 
    3787     3788     3789     3790     3791     3792 
5.255814 6.202265 6.596992 6.202265 5.971091 5.563380 
    3793     3794     3795     3796     3797     3798 
6.202265 5.971091 5.563380 5.563380 5.563380 6.596992 
    3799     3800     3801     3802     3803     3804 
5.255814 6.202265 6.596992 5.563380 6.202265 6.202265 
    3805     3806     3807     3808     3809     3810 
6.596992 6.202265 6.202265 6.596992 4.545455 5.563380 
    3811     3812     3813     3814     3815     3816 
5.255814 5.563380 5.563380 5.563380 5.563380 5.255814 
    3817     3818     3819     3820     3821     3822 
5.563380 5.563380 5.971091 6.202265 6.596992 6.202265 
    3823     3824     3825     3826     3827     3828 
5.255814 5.255814 5.971091 6.596992 5.971091 5.563380 
    3829     3830     3831     3832     3833     3834 
5.971091 5.255814 5.255814 5.971091 5.563380 6.202265 
    3835     3836     3837     3838     3839     3840 
5.563380 5.971091 5.369048 6.596992 5.971091 5.563380 
    3841     3842     3843     3844     3845     3846 
6.596992 5.971091 5.255814 6.202265 5.971091 5.255814 
    3847     3848     3849     3850     3851     3852 
5.563380 6.596992 6.202265 5.255814 6.202265 6.596992 
    3853     3854     3855     3856     3857     3858 
5.971091 5.563380 5.971091 6.202265 5.255814 6.202265 
    3859     3860     3861     3862     3863     3864 
5.255814 6.202265 6.202265 5.563380 5.971091 6.202265 
    3865     3866     3867     3868     3869     3870 
5.255814 5.369048 5.255814 5.563380 5.971091 6.202265 
    3871     3872     3873     3874     3875     3876 
5.971091 5.971091 5.255814 5.563380 5.563380 6.202265 
    3877     3878     3879     3880     3881     3882 
5.255814 5.255814 5.255814 5.563380 6.202265 5.971091 
    3883     3884     3885     3886     3887     3888 
5.971091 5.255814 5.563380 6.202265 5.255814 5.971091 
    3889     3890     3891     3892     3893     3894 
5.563380 5.971091 6.202265 5.563380 5.563380 6.202265 
    3895     3896     3897     3898     3899     3900 
5.971091 6.202265 5.971091 6.596992 5.255814 5.255814 
    3901     3902     3903     3904     3905     3906 
5.563380 5.971091 5.255814 5.563380 6.596992 6.596992 
    3907     3908     3909     3910     3911     3912 
5.971091 6.596992 5.563380 5.971091 6.202265 5.563380 
    3913     3914     3915     3916     3917     3918 
5.971091 6.202265 5.563380 6.202265 6.202265 5.255814 
    3919     3920     3921     3922     3923     3924 
5.971091 5.255814 5.971091 5.971091 5.971091 5.971091 
    3925     3926     3927     3928     3929     3930 
6.596992 5.255814 5.255814 5.971091 6.596992 5.563380 
    3931     3932     3933     3934     3935     3936 
5.563380 5.971091 5.255814 5.971091 6.202265 5.971091 
    3937     3938     3939     3940     3941     3942 
5.255814 6.596992 5.563380 6.596992 5.255814 5.563380 
    3943     3944     3945     3946     3947     3948 
5.563380 6.596992 6.202265 6.596992 5.563380 6.202265 
    3949     3950     3951     3952     3953     3954 
6.596992 6.596992 6.596992 6.202265 6.596992 5.971091 
    3955     3956     3957     3958     3959     3960 
5.971091 6.596992 5.563380 6.202265 6.202265 6.596992 
    3961     3962     3963     3964     3965     3966 
5.255814 5.255814 5.255814 6.202265 5.255814 6.202265 
    3967     3968     3969     3970     3971     3972 
5.563380 5.255814 5.971091 5.971091 6.202265 6.202265 
    3973     3974     3975     3976     3977     3978 
5.255814 5.255814 6.596992 5.971091 5.255814 5.971091 
    3979     3980     3981     3982     3983     3984 
5.563380 5.971091 6.596992 5.971091 5.971091 5.971091 
    3985     3986     3987     3988     3989     3990 
6.202265 5.255814 6.202265 5.971091 5.563380 6.202265 
    3991     3992     3993     3994     3995     3996 
5.563380 5.971091 5.563380 5.563380 5.971091 5.563380 
    3997     3998     3999     4000     4001     4002 
6.596992 5.971091 6.596992 6.596992 5.971091 5.255814 
    4003     4004     4005     4006     4007     4008 
5.563380 6.202265 5.563380 5.563380 6.596992 6.596992 
    4009     4010     4011     4012     4013     4014 
5.971091 6.202265 6.202265 5.563380 5.563380 5.563380 
    4015     4016     4017     4018     4019     4020 
6.202265 6.202265 6.202265 5.255814 5.563380 5.971091 
    4021     4022     4023     4024     4025     4026 
6.202265 5.971091 6.596992 5.563380 5.255814 6.596992 
    4027     4028     4029     4030     4031     4032 
5.255814 6.202265 5.971091 6.202265 5.971091 5.971091 
    4033     4034     4035     4036     4037     4038 
5.971091 5.563380 5.255814 5.255814 5.255814 5.563380 
    4039     4040     4041     4042     4043     4044 
5.563380 5.563380 5.563380 5.369048 6.202265 5.563380 
    4045     4046     4047     4048     4049     4050 
6.596992 6.596992 6.596992 5.971091 6.202265 6.202265 
    4051     4052     4053     4054     4055     4056 
6.596992 6.596992 5.255814 5.971091 5.255814 5.971091 
    4057     4058     4059     4060     4061     4062 
5.971091 5.971091 5.971091 5.255814 5.971091 6.202265 
    4063     4064     4065     4066     4067     4068 
5.563380 6.596992 5.971091 5.971091 5.255814 5.563380 
    4069     4070     4071     4072     4073     4074 
5.255814 6.596992 5.563380 5.563380 5.563380 5.563380 
    4075     4076     4077     4078     4079     4080 
5.563380 6.596992 5.255814 6.202265 5.971091 6.596992 
    4081     4082     4083     4084     4085     4086 
5.563380 5.971091 5.971091 5.255814 5.563380 5.563380 
    4087     4088     4089     4090     4091     4092 
5.971091 6.596992 5.971091 6.202265 6.596992 5.563380 
    4093     4094     4095     4096     4097     4098 
6.596992 5.563380 6.202265 5.255814 5.255814 6.202265 
    4099     4100     4101     4102     4103     4104 
5.563380 6.202265 5.971091 6.596992 5.255814 6.202265 
    4105     4106     4107     4108     4109     4110 
6.596992 6.596992 5.255814 6.596992 5.563380 6.202265 
    4111     4112     4113     4114     4115     4116 
5.255814 5.563380 6.596992 6.202265 6.596992 5.971091 
    4117     4118     4119     4120     4121     4122 
5.563380 5.255814 5.971091 6.202265 5.563380 6.596992 
    4123     4124     4125     4126     4127     4128 
5.255814 5.971091 6.596992 6.596992 5.255814 5.369048 
    4129     4130     4131     4132     4133     4134 
5.255814 5.563380 6.202265 6.202265 5.255814 6.202265 
    4135     4136     4137     4138     4139     4140 
5.255814 5.563380 5.255814 6.202265 5.563380 5.255814 
    4141     4142     4143     4144     4145     4146 
6.596992 5.971091 5.369048 5.563380 5.971091 6.596992 
    4147     4148     4149     4150     4151     4152 
6.596992 6.202265 5.971091 5.255814 6.596992 5.971091 
    4153     4154     4155     4156     4157     4158 
5.971091 5.971091 5.971091 6.596992 5.563380 5.971091 
    4159     4160     4161     4162     4163     4164 
5.563380 6.596992 6.596992 5.563380 5.255814 6.202265 
    4165     4166     4167     4168     4169     4170 
6.596992 6.202265 6.596992 6.202265 5.971091 5.563380 
    4171     4172     4173     4174     4175     4176 
6.596992 5.255814 5.563380 5.255814 5.563380 6.596992 
    4177     4178     4179     4180     4181     4182 
6.596992 5.563380 6.202265 6.596992 5.255814 5.563380 
    4183     4184     4185     4186     4187     4188 
5.255814 5.563380 6.596992 5.255814 6.596992 6.596992 
    4189     4190     4191     4192     4193     4194 
6.202265 6.596992 5.563380 5.563380 5.563380 5.971091 
    4195     4196     4197     4198     4199     4200 
6.202265 6.596992 6.596992 5.971091 5.255814 6.596992 
    4201     4202     4203     4204     4205     4206 
5.255814 6.202265 5.255814 5.563380 6.202265 6.596992 
    4207     4208     4209     4210     4211     4212 
5.255814 5.563380 5.971091 5.563380 5.971091 5.971091 
    4213     4214     4215     4216     4217     4218 
5.563380 5.255814 5.563380 5.563380 5.971091 5.255814 
    4219     4220     4221     4222     4223     4224 
6.202265 6.596992 5.563380 5.563380 5.971091 5.971091 
    4225     4226     4227     4228     4229     4230 
6.202265 5.563380 6.596992 6.596992 5.255814 6.202265 
    4231     4232     4233     4234     4235     4236 
6.202265 5.563380 5.971091 6.202265 5.563380 6.202265 
    4237     4238     4239     4240     4241     4242 
5.563380 6.202265 5.563380 5.255814 6.596992 5.563380 
    4243     4244     4245     4246     4247     4248 
5.971091 5.563380 6.596992 5.255814 5.255814 5.255814 
    4249     4250     4251     4252     4253     4254 
5.971091 6.202265 5.255814 5.563380 6.596992 5.971091 
    4255     4256     4257     4258     4259     4260 
5.255814 6.596992 5.369048 6.202265 5.255814 6.596992 
    4261     4262     4263     4264     4265     4266 
5.255814 6.596992 6.202265 6.202265 5.563380 6.596992 
    4267     4268     4269     4270     4271     4272 
5.563380 5.971091 5.563380 5.369048 5.563380 5.255814 
    4273     4274     4275     4276     4277     4278 
6.202265 6.202265 5.563380 5.563380 6.596992 6.596992 
    4279     4280     4281     4282     4283     4284 
6.202265 5.971091 6.596992 5.971091 5.971091 6.596992 
    4285     4286     4287     4288     4289     4290 
5.563380 5.563380 5.971091 5.563380 5.971091 6.596992 
    4291     4292     4293     4294     4295     4296 
6.202265 5.255814 6.202265 5.971091 5.255814 6.596992 
    4297     4298     4299     4300     4301     4302 
6.596992 5.563380 6.202265 6.596992 5.563380 5.971091 
    4303     4304     4305     4306     4307     4308 
5.971091 5.971091 5.255814 5.255814 6.202265 5.971091 
    4309     4310     4311     4312     4313     4314 
5.971091 6.596992 6.596992 6.596992 5.563380 5.255814 
    4315     4316     4317     4318     4319     4320 
6.596992 6.202265 6.202265 5.255814 6.202265 5.255814 
    4321     4322     4323     4324     4325     4326 
6.202265 6.202265 5.971091 6.202265 5.563380 6.202265 
    4327     4328     4329     4330     4331     4332 
5.971091 5.563380 6.596992 5.255814 6.202265 5.255814 
    4333     4334     4335     4336     4337     4338 
5.255814 6.202265 5.563380 5.971091 6.596992 5.563380 
    4339     4340     4341     4342     4343     4344 
5.255814 5.971091 5.255814 6.202265 5.563380 6.202265 
    4345     4346     4347     4348     4349     4350 
6.202265 6.596992 5.971091 5.255814 5.255814 5.971091 
    4351     4352     4353     4354     4355     4356 
6.202265 6.596992 5.255814 5.971091 5.971091 5.563380 
    4357     4358     4359     4360     4361     4362 
5.971091 6.202265 5.563380 5.255814 5.255814 5.255814 
    4363     4364     4365     4366     4367     4368 
5.971091 6.202265 5.255814 6.596992 5.563380 5.971091 
    4369     4370     4371     4372     4373     4374 
5.971091 6.596992 5.563380 5.563380 5.255814 6.596992 
    4375     4376     4377     4378     4379     4380 
5.563380 5.971091 6.596992 5.563380 6.202265 6.202265 
    4381     4382     4383     4384     4385     4386 
5.563380 6.596992 6.596992 5.563380 5.971091 6.596992 
    4387     4388     4389     4390     4391     4392 
5.369048 6.202265 6.202265 6.202265 5.563380 5.255814 
    4393     4394     4395     4396     4397     4398 
5.563380 6.596992 5.971091 6.202265 5.563380 6.202265 
    4399     4400     4401     4402     4403     4404 
6.202265 5.563380 5.563380 5.971091 6.596992 5.255814 
    4405     4406     4407     4408     4409     4410 
5.563380 6.596992 6.202265 5.563380 5.971091 6.202265 
    4411     4412     4413     4414     4415     4416 
5.563380 5.255814 5.255814 5.971091 6.596992 5.255814 
    4417     4418     4419     4420     4421     4422 
6.596992 5.255814 5.255814 5.255814 6.596992 5.971091 
    4423     4424     4425     4426     4427     4428 
5.563380 6.596992 5.255814 6.596992 5.255814 5.255814 
    4429     4430     4431     4432     4433     4434 
6.202265 5.563380 5.255814 5.563380 5.971091 6.596992 
    4435     4436     4437     4438     4439     4440 
6.596992 5.255814 6.596992 6.596992 5.971091 5.255814 
    4441     4442     4443     4444     4445     4446 
5.563380 5.563380 5.563380 5.563380 5.971091 6.596992 
    4447     4448     4449     4450     4451     4452 
5.971091 5.563380 5.971091 5.563380 5.255814 5.563380 
    4453     4454     4455     4456     4457     4458 
6.202265 6.202265 6.596992 6.596992 5.971091 5.971091 
    4459     4460     4461     4462     4463     4464 
6.202265 5.563380 5.971091 5.971091 6.202265 5.563380 
    4465     4466     4467     4468     4469     4470 
6.596992 5.971091 6.202265 6.596992 6.202265 5.563380 
    4471     4472     4473     4474     4475     4476 
5.563380 5.563380 5.971091 5.255814 6.202265 5.563380 
    4477     4478     4479     4480     4481     4482 
6.596992 5.971091 6.596992 5.971091 5.255814 6.596992 
    4483     4484     4485     4486     4487     4488 
5.971091 5.563380 6.202265 6.596992 6.596992 6.202265 
    4489     4490     4491     4492     4493     4494 
5.255814 6.596992 6.202265 5.971091 6.596992 6.596992 
    4495     4496     4497     4498     4499     4500 
5.971091 5.563380 5.971091 5.971091 6.596992 6.202265 
    4501     4502     4503     4504     4505     4506 
6.202265 6.596992 6.596992 5.255814 6.596992 5.563380 
    4507     4508     4509     4510     4511     4512 
6.202265 5.971091 6.596992 5.971091 5.563380 5.563380 
    4513     4514     4515     4516     4517     4518 
5.563380 6.202265 6.596992 5.563380 6.596992 5.255814 
    4519     4520     4521     4522     4523     4524 
6.596992 6.202265 5.563380 5.971091 6.202265 6.596992 
    4525     4526     4527     4528     4529     4530 
5.563380 5.255814 5.563380 5.971091 6.202265 6.596992 
    4531     4532     4533     4534     4535     4536 
5.563380 5.971091 5.563380 5.563380 5.971091 5.255814 
    4537     4538     4539     4540     4541     4542 
6.596992 5.255814 6.202265 5.971091 6.596992 6.596992 
    4543     4544     4545     4546     4547     4548 
5.563380 5.971091 6.202265 5.563380 6.596992 5.971091 
    4549     4550     4551     4552     4553     4554 
6.202265 5.255814 6.596992 5.563380 6.596992 5.971091 
    4555     4556     4557     4558     4559     4560 
6.596992 6.596992 6.596992 5.255814 5.971091 5.255814 
    4561     4562     4563     4564     4565     4566 
5.563380 5.255814 6.596992 6.596992 6.596992 5.369048 
    4567     4568     4569     4570     4571     4572 
6.596992 5.255814 5.255814 5.563380 6.202265 6.202265 
    4573     4574     4575     4576     4577     4578 
6.596992 5.255814 6.596992 6.596992 5.971091 5.255814 
    4579     4580     4581     4582     4583     4584 
5.255814 5.971091 5.971091 5.971091 5.971091 6.202265 
    4585     4586     4587     4588     4589     4590 
5.255814 5.255814 5.563380 6.596992 5.255814 5.971091 
    4591     4592     4593     4594     4595     4596 
6.596992 5.255814 5.255814 6.596992 5.971091 5.563380 
    4597     4598     4599     4600     4601     4602 
5.971091 6.202265 5.255814 5.563380 5.255814 5.255814 
    4603     4604     4605     4606     4607     4608 
5.971091 5.971091 5.255814 6.202265 5.971091 5.563380 
    4609     4610     4611     4612     4613     4614 
6.202265 6.202265 6.202265 6.596992 6.596992 5.563380 
    4615     4616     4617     4618     4619     4620 
6.596992 5.563380 5.971091 5.971091 6.596992 6.596992 
    4621     4622     4623     4624     4625     4626 
5.563380 5.971091 5.971091 6.202265 6.202265 5.255814 
    4627     4628     4629     4630     4631     4632 
6.596992 6.596992 5.563380 5.255814 5.255814 5.563380 
    4633     4634     4635     4636     4637     4638 
6.202265 6.202265 5.255814 5.255814 6.596992 5.563380 
    4639     4640     4641     4642     4643     4644 
5.255814 6.202265 6.202265 5.563380 6.202265 5.563380 
    4645     4646     4647     4648     4649     4650 
5.255814 5.971091 5.971091 5.255814 5.971091 5.563380 
    4651     4652     4653     4654     4655     4656 
5.971091 5.563380 5.563380 6.202265 5.563380 5.563380 
    4657     4658     4659     4660     4661     4662 
5.563380 6.596992 5.563380 6.202265 5.563380 6.202265 
    4663     4664     4665     4666     4667     4668 
5.255814 5.255814 6.596992 5.563380 5.971091 5.971091 
    4669     4670     4671     4672     4673     4674 
5.255814 5.563380 5.971091 5.971091 5.255814 6.596992 
    4675     4676     4677     4678     4679     4680 
5.971091 5.971091 5.971091 5.255814 6.596992 5.971091 
    4681     4682     4683     4684     4685     4686 
5.563380 6.202265 5.255814 6.596992 5.255814 5.563380 
    4687     4688     4689     4690     4691     4692 
5.563380 6.202265 5.971091 5.255814 5.255814 5.563380 
    4693     4694     4695     4696     4697     4698 
5.563380 6.202265 6.202265 5.971091 5.563380 6.596992 
    4699     4700     4701     4702     4703     4704 
6.596992 5.971091 6.202265 6.596992 5.563380 5.369048 
    4705     4706     4707     4708     4709     4710 
5.255814 5.255814 6.202265 5.255814 6.202265 5.563380 
    4711     4712     4713     4714     4715     4716 
5.255814 6.202265 6.596992 5.971091 5.971091 6.596992 
    4717     4718     4719     4720     4721     4722 
6.202265 5.255814 5.563380 6.202265 6.202265 5.971091 
    4723     4724     4725     4726     4727     4728 
5.563380 5.255814 5.369048 5.563380 6.596992 6.202265 
    4729     4730     4731     4732     4733     4734 
5.563380 6.596992 5.563380 5.971091 5.255814 5.563380 
    4735     4736     4737     4738     4739     4740 
6.596992 6.202265 5.971091 5.971091 6.596992 5.563380 
    4741     4742     4743     4744     4745     4746 
5.971091 5.255814 5.563380 5.563380 5.255814 6.596992 
    4747     4748     4749     4750 
5.255814 6.202265 5.255814 6.596992 
 [ reached getOption("max.print") -- omitted 148 entries ]

compare the distribution of predicted values vs. actual values

summary(p.rpart)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  4.545   5.563   5.971   5.893   6.202   6.597 
summary(wine_test$quality)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  3.000   5.000   6.000   5.901   6.000   9.000 

compare the correlation

cor(p.rpart, wine_test$quality) # regression tress making good job cuz cor value is high
[1] 0.5369525

function to calculate the mean absolute error

MAE <- function(actual, predicted) {
  mean(abs(actual - predicted))  
}

mean absolute error between predicted and actual values

MAE(p.rpart, wine_test$quality)
[1] 0.5872652

This implies that, on average, the difference between our model’s predictions and the true quality score was about 0.59. On a quality scale from zero to 10, this seems to suggest that our model is doing fairly well.

mean absolute error between actual values and mean value

mean(wine_train$quality) # result = 5.87
[1] 5.870933
MAE(5.87, wine_test$quality)
[1] 0.6722474

If we predicted the value 5.87 for every wine sample, we would have a mean absolute error of only about 0.67

Step 5: Improving model performance using RWeka package—-

  1. train a M5’ Model Tree
library(RWeka)
Error in library(RWeka) : there is no package called ‘RWeka’
  1. display the tree
m.m5p
  1. get a summary of the model’s performance
summary(m.m5p)
  1. generate predictions for the model
p.m5p <- predict(m.m5p, wine_test)
  1. summary statistics about the predictions
summary(p.m5p)
  1. correlation between the predicted and true values
cor(p.m5p, wine_test$quality)

mean absolute error of predicted and true values

(uses a custom function defined above)

MAE(wine_test$quality, p.m5p)
LS0tCnRpdGxlOiAiUmVncmVzc2lvbiBUcmVlcyBhbmQgTW9kZWwgVHJlc3MtIEVzdGltYXRpbmcgV2luZSBRdWFsaXR5IgpvdXRwdXQ6IGh0bWxfbm90ZWJvb2sKLS0tCiMjIEV4YW1wbGU6IEVzdGltYXRpbmcgV2luZSBRdWFsaXR5IC0tLS0KIyMgU3RlcCAxOiBDb2xsZWN0aW5nIGRhdGEgLS0tLQpUbyBkZXZlbG9wIHRoZSB3aW5lIHJhdGluZyBtb2RlbCwgd2Ugd2lsbCB1c2UgZGF0YSBkb25hdGVkIHRvIHRoZSBVQ0kgTWFjaGluZSBMZWFybmluZyBEYXRhIFJlcG9zaXRvcnkgKGh0dHA6Ly9hcmNoaXZlLmljcy51Y2kuZWR1L21sKSBieSBQLiBDb3J0ZXosIEEuCkNlcmRlaXJhLCBGLiBBbG1laWRhLCBULiBNYXRvcywgYW5kIEouIFJlaXMuIFRoZSBkYXRhIGluY2x1ZGUgZXhhbXBsZXMgb2YgcmVkIGFuZCB3aGl0ZSBWaW5obyBWZXJkZSB3aW5lcyBmcm9tIFBvcnR1Z2Fs4oCUb25lIG9mIHRoZSB3b3JsZCdzIGxlYWRpbmcgd2luZS1wcm9kdWNpbmcgY291bnRyaWVzLiBCZWNhdXNlIHRoZSBmYWN0b3JzIHRoYXQgY29udHJpYnV0ZSB0byBhIGhpZ2hseSByYXRlZCB3aW5lIG1heSBkaWZmZXIgYmV0d2VlbiB0aGUgcmVkIGFuZCB3aGl0ZSB2YXJpZXRpZXMsIGZvciB0aGlzIGFuYWx5c2lzIHdlIHdpbGwgZXhhbWluZSBvbmx5IHRoZSBtb3JlIHBvcHVsYXIgd2hpdGUgd2luZXMuCgojIyBTdGVwIDI6IEV4cGxvcmluZyBhbmQgcHJlcGFyaW5nIHRoZSBkYXRhIC0tLS0KYGBge3J9CndpbmUgPC0gcmVhZC5jc3YoImh0dHA6Ly93d3cuc2NpLmNzdWVhc3RiYXkuZWR1L35lc3Vlc3MvY2xhc3Nlcy9TdGF0aXN0aWNzXzY2MjAvUHJlc2VudGF0aW9ucy9tbDEwL3doaXRld2luZXMuY3N2IikKYGBgCgpleGFtaW5lIHRoZSB3aW5lIGRhdGEKYGBge3J9CnN0cih3aW5lKQpgYGAKCnRoZSBkaXN0cmlidXRpb24gb2YgcXVhbGl0eSByYXRpbmdzCmBgYHtyfQpoaXN0KHdpbmUkcXVhbGl0eSkKYGBgCgpzdW1tYXJ5IHN0YXRpc3RpY3Mgb2YgdGhlIHdpbmUgZGF0YQpgYGB7cn0Kc3VtbWFyeSh3aW5lKQpgYGAKClNldCB0cmFpbiBhbmQgdGVzdCBkYXRhCmBgYHtyfQp3aW5lX3RyYWluIDwtIHdpbmVbMTozNzUwLCBdCndpbmVfdGVzdCA8LSB3aW5lWzM3NTE6NDg5OCwgXQpgYGAKCiMjIFN0ZXAgMzogVHJhaW5pbmcgYSBtb2RlbCBvbiB0aGUgZGF0YSAtLS0tCnJlZ3Jlc3Npb24gdHJlZSB1c2luZyBycGFydCBhbmQgdHJhbmluZyBhIHRyZWUgbW9kZWwgb24gdGhlIGRhdGEKYGBge3J9CmxpYnJhcnkocnBhcnQpCm0ucnBhcnQgPC0gcnBhcnQocXVhbGl0eSB+IC4sIGRhdGEgPSB3aW5lX3RyYWluKQpgYGAKCmdldCBiYXNpYyBpbmZvcm1hdGlvbiBhYm91dCB0aGUgdHJlZQpgYGB7cn0KbS5ycGFydApgYGAKQmVjYXVzZSBhbGNvaG9sIHdhcyB1c2VkIGZpcnN0IGluIHRoZSB0cmVlLCBpdCBpcyB0aGUgc2luZ2xlIG1vc3QgaW1wb3J0YW50IHByZWRpY3RvciBvZiB3aW5lIHF1YWxpdHkuSW50ZXJwcmV0YXRpb246IGZvciBpbnN0YW5jZSwgaW4gMzc1MCByb290LCAyMzcyIHdpdGggYWxjb2hvbCA8IDEwLjg1LCBhbmQgMTM3OCB3aXRoID4gMTAuODUuCgpnZXQgbW9yZSBkZXRhaWxlZCBpbmZvcm1hdGlvbiBhYm91dCB0aGUgdHJlZQpgYGB7cn0Kc3VtbWFyeShtLnJwYXJ0KQpgYGAKCnVzZSB0aGUgcnBhcnQucGxvdCBwYWNrYWdlIHRvIGNyZWF0ZSBhIHZpc3VhbGl6YXRpb24KYGBge3J9CmxpYnJhcnkocnBhcnQucGxvdCkKYGBgCgphIGJhc2ljIGRlY2lzaW9uIHRyZWUgZGlhZ3JhbQpgYGB7cn0KcnBhcnQucGxvdChtLnJwYXJ0LCBkaWdpdHMgPSAzKQpgYGAKCmEgZmV3IGFkanVzdG1lbnRzIHRvIHRoZSBkaWFncmFtCmBgYHtyfQpycGFydC5wbG90KG0ucnBhcnQsIGRpZ2l0cyA9IDQsIGZhbGxlbi5sZWF2ZXMgPSBUUlVFLCB0eXBlID0gMywgZXh0cmEgPSAxMDEpCmBgYAoKIyMgU3RlcCA0OiBFdmFsdWF0ZSBtb2RlbCBwZXJmb3JtYW5jZSAtLS0tCmdlbmVyYXRlIHByZWRpY3Rpb25zIGZvciB0aGUgdGVzdGluZyBkYXRhc2V0CmBgYHtyfQpwLnJwYXJ0IDwtIHByZWRpY3QobS5ycGFydCwgd2luZV90ZXN0KQpwLnJwYXJ0CmBgYAoKY29tcGFyZSB0aGUgZGlzdHJpYnV0aW9uIG9mIHByZWRpY3RlZCB2YWx1ZXMgdnMuIGFjdHVhbCB2YWx1ZXMKYGBge3J9CnN1bW1hcnkocC5ycGFydCkKc3VtbWFyeSh3aW5lX3Rlc3QkcXVhbGl0eSkKYGBgCgpjb21wYXJlIHRoZSBjb3JyZWxhdGlvbgpgYGB7cn0KY29yKHAucnBhcnQsIHdpbmVfdGVzdCRxdWFsaXR5KSAjIHJlZ3Jlc3Npb24gdHJlZXMgbWFraW5nIGdvb2Qgam9iIGN1eiBjb3IgdmFsdWUgaXMgaGlnaApgYGAKCmZ1bmN0aW9uIHRvIGNhbGN1bGF0ZSB0aGUgbWVhbiBhYnNvbHV0ZSBlcnJvcgpgYGB7cn0KTUFFIDwtIGZ1bmN0aW9uKGFjdHVhbCwgcHJlZGljdGVkKSB7CiAgbWVhbihhYnMoYWN0dWFsIC0gcHJlZGljdGVkKSkgIAp9CmBgYAoKbWVhbiBhYnNvbHV0ZSBlcnJvciBiZXR3ZWVuIHByZWRpY3RlZCBhbmQgYWN0dWFsIHZhbHVlcwpgYGB7cn0KTUFFKHAucnBhcnQsIHdpbmVfdGVzdCRxdWFsaXR5KQpgYGAKVGhpcyBpbXBsaWVzIHRoYXQsIG9uIGF2ZXJhZ2UsIHRoZSBkaWZmZXJlbmNlIGJldHdlZW4gb3VyIG1vZGVsJ3MgcHJlZGljdGlvbnMgYW5kIHRoZSB0cnVlIHF1YWxpdHkgc2NvcmUgd2FzIGFib3V0IDAuNTkuIE9uIGEgcXVhbGl0eSBzY2FsZSBmcm9tIHplcm8gdG8gMTAsIHRoaXMgc2VlbXMgdG8gc3VnZ2VzdCB0aGF0IG91ciBtb2RlbCBpcyBkb2luZyBmYWlybHkgd2VsbC4KCm1lYW4gYWJzb2x1dGUgZXJyb3IgYmV0d2VlbiBhY3R1YWwgdmFsdWVzIGFuZCBtZWFuIHZhbHVlCmBgYHtyfQptZWFuKHdpbmVfdHJhaW4kcXVhbGl0eSkgIyBhY3R1YWwgbWVhbiB2YWx1ZT01Ljg3Ck1BRSg1Ljg3LCB3aW5lX3Rlc3QkcXVhbGl0eSkgCmBgYApJZiB3ZSBwcmVkaWN0ZWQgdGhlIHZhbHVlIDUuODcgZm9yIGV2ZXJ5IHdpbmUgc2FtcGxlLCB3ZSB3b3VsZCBoYXZlIGEgbWVhbiBhYnNvbHV0ZSBlcnJvciBvZiBvbmx5IGFib3V0IDAuNjcKCiMjIFN0ZXAgNTogSW1wcm92aW5nIG1vZGVsIHBlcmZvcm1hbmNlIHVzaW5nIFJXZWthIHBhY2thZ2UtLS0tCjEuIHRyYWluIGEgTTUnIE1vZGVsIFRyZWUKYGBge3J9CmxpYnJhcnkoUldla2EpCm0ubTVwIDwtIE01UChxdWFsaXR5IH4gLiwgZGF0YSA9IHdpbmVfdHJhaW4pCmBgYAoKMi4gZGlzcGxheSB0aGUgdHJlZQpgYGB7cn0KbS5tNXAKCmBgYAoKMy4gZ2V0IGEgc3VtbWFyeSBvZiB0aGUgbW9kZWwncyBwZXJmb3JtYW5jZQpgYGB7cn0Kc3VtbWFyeShtLm01cCkKYGBgCgo0LiBnZW5lcmF0ZSBwcmVkaWN0aW9ucyBmb3IgdGhlIG1vZGVsCmBgYHtyfQpwLm01cCA8LSBwcmVkaWN0KG0ubTVwLCB3aW5lX3Rlc3QpCmBgYAoKNS4gc3VtbWFyeSBzdGF0aXN0aWNzIGFib3V0IHRoZSBwcmVkaWN0aW9ucwpgYGB7cn0Kc3VtbWFyeShwLm01cCkKYGBgCgo2LiBjb3JyZWxhdGlvbiBiZXR3ZWVuIHRoZSBwcmVkaWN0ZWQgYW5kIHRydWUgdmFsdWVzCmBgYHtyfQpjb3IocC5tNXAsIHdpbmVfdGVzdCRxdWFsaXR5KQpgYGAKCiMgbWVhbiBhYnNvbHV0ZSBlcnJvciBvZiBwcmVkaWN0ZWQgYW5kIHRydWUgdmFsdWVzCiMgKHVzZXMgYSBjdXN0b20gZnVuY3Rpb24gZGVmaW5lZCBhYm92ZSkKYGBge3J9Ck1BRSh3aW5lX3Rlc3QkcXVhbGl0eSwgcC5tNXApCmBgYAoKCgoKCg==