The training dataset contains 12795 rows and 16 columns related to properties of commercial wines. A summary and boxplots of all variables suggest some columns have missing data, and some may contain outliers. The variable histogram suggests normal data distribution.
A visual correlation matrix reveals little relation among independent variables. The predictors LabelAppeal and STARS do have some connections with the target dependent variable.
dim(df.train.orig)
## [1] 12795 16
str(df.train.orig)
## 'data.frame': 12795 obs. of 16 variables:
## $ INDEX : int 1 2 4 5 6 7 8 11 12 13 ...
## $ TARGET : int 3 3 5 3 4 0 0 4 3 6 ...
## $ FixedAcidity : num 3.2 4.5 7.1 5.7 8 11.3 7.7 6.5 14.8 5.5 ...
## $ VolatileAcidity : num 1.16 0.16 2.64 0.385 0.33 0.32 0.29 -1.22 0.27 -0.22 ...
## $ CitricAcid : num -0.98 -0.81 -0.88 0.04 -1.26 0.59 -0.4 0.34 1.05 0.39 ...
## $ ResidualSugar : num 54.2 26.1 14.8 18.8 9.4 ...
## $ Chlorides : num -0.567 -0.425 0.037 -0.425 NA 0.556 0.06 0.04 -0.007 -0.277 ...
## $ FreeSulfurDioxide : num NA 15 214 22 -167 -37 287 523 -213 62 ...
## $ TotalSulfurDioxide: num 268 -327 142 115 108 15 156 551 NA 180 ...
## $ Density : num 0.993 1.028 0.995 0.996 0.995 ...
## $ pH : num 3.33 3.38 3.12 2.24 3.12 3.2 3.49 3.2 4.93 3.09 ...
## $ Sulphates : num -0.59 0.7 0.48 1.83 1.77 1.29 1.21 NA 0.26 0.75 ...
## $ Alcohol : num 9.9 NA 22 6.2 13.7 15.4 10.3 11.6 15 12.6 ...
## $ LabelAppeal : int 0 -1 -1 -1 0 0 0 1 0 0 ...
## $ AcidIndex : int 8 7 8 6 9 11 8 7 6 8 ...
## $ STARS : int 2 3 3 1 2 NA NA 3 NA 4 ...
Displaying first few rows of the dataset
head(df.train.orig)
## INDEX TARGET FixedAcidity VolatileAcidity CitricAcid ResidualSugar
## 1 1 3 3.2 1.160 -0.98 54.2
## 2 2 3 4.5 0.160 -0.81 26.1
## 3 4 5 7.1 2.640 -0.88 14.8
## 4 5 3 5.7 0.385 0.04 18.8
## 5 6 4 8.0 0.330 -1.26 9.4
## 6 7 0 11.3 0.320 0.59 2.2
## Chlorides FreeSulfurDioxide TotalSulfurDioxide Density pH Sulphates
## 1 -0.567 NA 268 0.99280 3.33 -0.59
## 2 -0.425 15 -327 1.02792 3.38 0.70
## 3 0.037 214 142 0.99518 3.12 0.48
## 4 -0.425 22 115 0.99640 2.24 1.83
## 5 NA -167 108 0.99457 3.12 1.77
## 6 0.556 -37 15 0.99940 3.20 1.29
## Alcohol LabelAppeal AcidIndex STARS
## 1 9.9 0 8 2
## 2 NA -1 7 3
## 3 22.0 -1 8 3
## 4 6.2 -1 6 1
## 5 13.7 0 9 2
## 6 15.4 0 11 NA
summary(df.train.orig)
## INDEX TARGET FixedAcidity VolatileAcidity
## Min. : 1 Min. :0.000 Min. :-18.100 Min. :-2.7900
## 1st Qu.: 4038 1st Qu.:2.000 1st Qu.: 5.200 1st Qu.: 0.1300
## Median : 8110 Median :3.000 Median : 6.900 Median : 0.2800
## Mean : 8070 Mean :3.029 Mean : 7.076 Mean : 0.3241
## 3rd Qu.:12106 3rd Qu.:4.000 3rd Qu.: 9.500 3rd Qu.: 0.6400
## Max. :16129 Max. :8.000 Max. : 34.400 Max. : 3.6800
##
## CitricAcid ResidualSugar Chlorides FreeSulfurDioxide
## Min. :-3.2400 Min. :-127.800 Min. :-1.1710 Min. :-555.00
## 1st Qu.: 0.0300 1st Qu.: -2.000 1st Qu.:-0.0310 1st Qu.: 0.00
## Median : 0.3100 Median : 3.900 Median : 0.0460 Median : 30.00
## Mean : 0.3084 Mean : 5.419 Mean : 0.0548 Mean : 30.85
## 3rd Qu.: 0.5800 3rd Qu.: 15.900 3rd Qu.: 0.1530 3rd Qu.: 70.00
## Max. : 3.8600 Max. : 141.150 Max. : 1.3510 Max. : 623.00
## NA's :616 NA's :638 NA's :647
## TotalSulfurDioxide Density pH Sulphates
## Min. :-823.0 Min. :0.8881 Min. :0.480 Min. :-3.1300
## 1st Qu.: 27.0 1st Qu.:0.9877 1st Qu.:2.960 1st Qu.: 0.2800
## Median : 123.0 Median :0.9945 Median :3.200 Median : 0.5000
## Mean : 120.7 Mean :0.9942 Mean :3.208 Mean : 0.5271
## 3rd Qu.: 208.0 3rd Qu.:1.0005 3rd Qu.:3.470 3rd Qu.: 0.8600
## Max. :1057.0 Max. :1.0992 Max. :6.130 Max. : 4.2400
## NA's :682 NA's :395 NA's :1210
## Alcohol LabelAppeal AcidIndex STARS
## Min. :-4.70 Min. :-2.000000 Min. : 4.000 Min. :1.000
## 1st Qu.: 9.00 1st Qu.:-1.000000 1st Qu.: 7.000 1st Qu.:1.000
## Median :10.40 Median : 0.000000 Median : 8.000 Median :2.000
## Mean :10.49 Mean :-0.009066 Mean : 7.773 Mean :2.042
## 3rd Qu.:12.40 3rd Qu.: 1.000000 3rd Qu.: 8.000 3rd Qu.:3.000
## Max. :26.50 Max. : 2.000000 Max. :17.000 Max. :4.000
## NA's :653 NA's :3359
Box plots showing graphically the summaries of the variables
Frequency plots of the variables showing nearly normal distribution
aggr_plot <- aggr(df.train.orig, col=c('navyblue','red'), numbers=TRUE, sortVars=TRUE, labels=names(df.train.orig), cex.axis=.7, gap=5, ylab=c("Histogram of missing data","Pattern"))
##
## Variables sorted by number of missings:
## Variable Count
## STARS 0.26252442
## Sulphates 0.09456819
## TotalSulfurDioxide 0.05330207
## Alcohol 0.05103556
## FreeSulfurDioxide 0.05056663
## Chlorides 0.04986323
## ResidualSugar 0.04814381
## pH 0.03087143
## INDEX 0.00000000
## TARGET 0.00000000
## FixedAcidity 0.00000000
## VolatileAcidity 0.00000000
## CitricAcid 0.00000000
## Density 0.00000000
## LabelAppeal 0.00000000
## AcidIndex 0.00000000
Use the MICE package to impute missing data. The resulting density plot suggest the imputed values are reasonable compared to the existing distribution.
tempData <- mice(df.train.orig,m=1,maxit=1)
##
## iter imp variable
## 1 1 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
densityplot(tempData)
df.train.nomissing <- complete(tempData)
Any outliers outside of lower 1.5IQR would be capped at 5th %ile, and observations above the upper 1.5IQR would be capped at 95th %ile.
df.train.nooutliers <- df.train.nomissing
id <- c(2:16)
for (val in id) {
qnt <- quantile(df.train.nooutliers[,val], probs=c(.25, .75), na.rm = T)
caps <- quantile(df.train.nooutliers[,val], probs=c(.05, .95), na.rm = T)
H <- 1.5 * IQR(df.train.nooutliers[,val], na.rm = T)
df.train.nooutliers[,val][df.train.nooutliers[,val] < (qnt[1] - H)] <- caps[1]
df.train.nooutliers[,val][df.train.nooutliers[,val] > (qnt[2] + H)] <- caps[2]
}
df.train.transformed <- df.train.nooutliers
Split the training data set such that 80% of the observations will be used to train the model and 20% will be used to evaluate.
Evaluate performance of models based on comparing the predicted target values with holdout’s mean and standard error below.
## [1] 3.026417
## [1] 0.01698166
p1 <- glm(TARGET ~ . -INDEX, data=ntraining, family="poisson")
# summary of the model
summary(p1)
##
## Call:
## glm(formula = TARGET ~ . - INDEX, family = "poisson", data = ntraining)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.9133 -0.6860 0.1377 0.6367 2.5006
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.447e+00 2.238e-01 6.463 1.03e-10 ***
## FixedAcidity -3.288e-04 1.034e-03 -0.318 0.750568
## VolatileAcidity -4.153e-02 8.097e-03 -5.129 2.91e-07 ***
## CitricAcid 1.421e-02 7.376e-03 1.927 0.053993 .
## ResidualSugar 7.329e-05 1.784e-04 0.411 0.681215
## Chlorides -3.590e-02 1.909e-02 -1.881 0.060003 .
## FreeSulfurDioxide 1.363e-04 3.905e-05 3.491 0.000481 ***
## TotalSulfurDioxide 1.161e-04 2.867e-05 4.049 5.15e-05 ***
## Density -2.411e-01 2.182e-01 -1.105 0.269252
## pH -1.748e-02 9.625e-03 -1.816 0.069398 .
## Sulphates -1.121e-02 6.704e-03 -1.673 0.094378 .
## Alcohol 5.051e-03 1.774e-03 2.847 0.004415 **
## LabelAppeal 1.417e-01 6.821e-03 20.769 < 2e-16 ***
## AcidIndex -1.054e-01 5.804e-03 -18.158 < 2e-16 ***
## STARS 3.391e-01 6.286e-03 53.940 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 18262 on 10235 degrees of freedom
## Residual deviance: 12897 on 10221 degrees of freedom
## AIC: 38456
##
## Number of Fisher Scoring iterations: 5
Evaluate the model’s mean and standard error using holdout dataset:
mean(floor(predict(p1, ntesting, type = "response")))
## [1] 2.51739
std.error(floor(predict(p1, ntesting, type = "response")))
## [1] 0.01219045
p2 <- glm(TARGET ~ . -INDEX -FixedAcidity -ResidualSugar -Density,
data=ntraining, family="poisson")
summary(p2)
##
## Call:
## glm(formula = TARGET ~ . - INDEX - FixedAcidity - ResidualSugar -
## Density, family = "poisson", data = ntraining)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.9317 -0.6870 0.1380 0.6371 2.5059
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.209e+00 6.177e-02 19.568 < 2e-16 ***
## VolatileAcidity -4.163e-02 8.097e-03 -5.142 2.72e-07 ***
## CitricAcid 1.418e-02 7.376e-03 1.923 0.054466 .
## Chlorides -3.660e-02 1.908e-02 -1.918 0.055066 .
## FreeSulfurDioxide 1.362e-04 3.905e-05 3.487 0.000488 ***
## TotalSulfurDioxide 1.158e-04 2.865e-05 4.044 5.26e-05 ***
## pH -1.751e-02 9.622e-03 -1.820 0.068771 .
## Sulphates -1.111e-02 6.701e-03 -1.658 0.097321 .
## Alcohol 5.041e-03 1.774e-03 2.841 0.004494 **
## LabelAppeal 1.418e-01 6.820e-03 20.788 < 2e-16 ***
## AcidIndex -1.059e-01 5.730e-03 -18.477 < 2e-16 ***
## STARS 3.392e-01 6.284e-03 53.986 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 18262 on 10235 degrees of freedom
## Residual deviance: 12898 on 10224 degrees of freedom
## AIC: 38451
##
## Number of Fisher Scoring iterations: 5
Evaluate the model’s mean and standard error using holdout dataset:
mean(floor(predict(p2, ntesting, type = "response")))
## [1] 2.517155
std.error(floor(predict(p2, ntesting, type = "response")))
## [1] 0.01218848
nb1 <- glm.nb(TARGET ~ . -INDEX, data=ntraining)
summary(nb1)
##
## Call:
## glm.nb(formula = TARGET ~ . - INDEX, data = ntraining, init.theta = 48733.23342,
## link = log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.9132 -0.6860 0.1377 0.6367 2.5006
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.447e+00 2.238e-01 6.463 1.03e-10 ***
## FixedAcidity -3.288e-04 1.034e-03 -0.318 0.750561
## VolatileAcidity -4.153e-02 8.097e-03 -5.129 2.91e-07 ***
## CitricAcid 1.421e-02 7.376e-03 1.927 0.054000 .
## ResidualSugar 7.330e-05 1.784e-04 0.411 0.681209
## Chlorides -3.590e-02 1.909e-02 -1.881 0.060006 .
## FreeSulfurDioxide 1.363e-04 3.906e-05 3.491 0.000481 ***
## TotalSulfurDioxide 1.161e-04 2.867e-05 4.049 5.15e-05 ***
## Density -2.411e-01 2.182e-01 -1.105 0.269264
## pH -1.748e-02 9.626e-03 -1.816 0.069396 .
## Sulphates -1.121e-02 6.704e-03 -1.673 0.094383 .
## Alcohol 5.051e-03 1.774e-03 2.847 0.004416 **
## LabelAppeal 1.417e-01 6.822e-03 20.768 < 2e-16 ***
## AcidIndex -1.054e-01 5.805e-03 -18.158 < 2e-16 ***
## STARS 3.391e-01 6.287e-03 53.938 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(48733.23) family taken to be 1)
##
## Null deviance: 18262 on 10235 degrees of freedom
## Residual deviance: 12896 on 10221 degrees of freedom
## AIC: 38458
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 48733
## Std. Err.: 63499
## Warning while fitting theta: iteration limit reached
##
## 2 x log-likelihood: -38425.91
Evaluate the model’s mean and standard error using holdout dataset:
mean(floor(predict(nb1, ntesting, type = "response")))
## [1] 2.51739
std.error(floor(predict(nb1, ntesting, type = "response")))
## [1] 0.01219045
nb2 <- glm.nb(TARGET ~ . -INDEX -FixedAcidity -ResidualSugar -Density,
data=ntraining)
summary(nb2)
##
## Call:
## glm.nb(formula = TARGET ~ . - INDEX - FixedAcidity - ResidualSugar -
## Density, data = ntraining, init.theta = 48721.75002, link = log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.932 -0.687 0.138 0.637 2.506
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.209e+00 6.177e-02 19.567 < 2e-16 ***
## VolatileAcidity -4.164e-02 8.097e-03 -5.142 2.72e-07 ***
## CitricAcid 1.418e-02 7.376e-03 1.923 0.054472 .
## Chlorides -3.660e-02 1.908e-02 -1.918 0.055069 .
## FreeSulfurDioxide 1.362e-04 3.905e-05 3.487 0.000488 ***
## TotalSulfurDioxide 1.158e-04 2.865e-05 4.044 5.26e-05 ***
## pH -1.751e-02 9.623e-03 -1.820 0.068770 .
## Sulphates -1.111e-02 6.701e-03 -1.658 0.097326 .
## Alcohol 5.041e-03 1.774e-03 2.841 0.004496 **
## LabelAppeal 1.418e-01 6.820e-03 20.787 < 2e-16 ***
## AcidIndex -1.059e-01 5.731e-03 -18.477 < 2e-16 ***
## STARS 3.392e-01 6.284e-03 53.985 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(48721.75) family taken to be 1)
##
## Null deviance: 18262 on 10235 degrees of freedom
## Residual deviance: 12898 on 10224 degrees of freedom
## AIC: 38453
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 48722
## Std. Err.: 63485
## Warning while fitting theta: iteration limit reached
##
## 2 x log-likelihood: -38427.42
Evaluate the model’s mean and standard error using holdout dataset:
mean(floor(predict(nb2, ntesting, type = "response")))
## [1] 2.517155
std.error(floor(predict(nb2, ntesting, type = "response")))
## [1] 0.01218848
zi1 <- zeroinfl(TARGET ~ . -INDEX |STARS, data=ntraining, dist="negbin")
summary(zi1)
##
## Call:
## zeroinfl(formula = TARGET ~ . - INDEX | STARS, data = ntraining,
## dist = "negbin")
##
## Pearson residuals:
## Min 1Q Median 3Q Max
## -2.12705 -0.49934 0.06438 0.49034 2.08559
##
## Count model coefficients (negbin with log link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.520e+00 2.311e-01 6.576 4.82e-11 ***
## FixedAcidity 2.373e-04 1.066e-03 0.223 0.8238
## VolatileAcidity -1.734e-02 8.375e-03 -2.071 0.0384 *
## CitricAcid 3.897e-03 7.570e-03 0.515 0.6067
## ResidualSugar -1.156e-04 1.836e-04 -0.630 0.5289
## Chlorides -1.911e-02 1.970e-02 -0.970 0.3319
## FreeSulfurDioxide 3.943e-05 4.003e-05 0.985 0.3247
## TotalSulfurDioxide -1.411e-05 2.906e-05 -0.485 0.6274
## Density -2.770e-01 2.250e-01 -1.231 0.2183
## pH 5.195e-03 9.961e-03 0.522 0.6020
## Sulphates -2.062e-03 6.937e-03 -0.297 0.7663
## Alcohol 8.079e-03 1.827e-03 4.422 9.77e-06 ***
## LabelAppeal 2.259e-01 7.143e-03 31.619 < 2e-16 ***
## AcidIndex -3.848e-02 6.312e-03 -6.096 1.09e-09 ***
## STARS 1.164e-01 7.018e-03 16.588 < 2e-16 ***
## Log(theta) 2.301e+01 NA NA NA
##
## Zero-inflation model coefficients (binomial with logit link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.4699 0.1171 21.09 <2e-16 ***
## STARS -2.8004 0.1039 -26.95 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Theta = 9835737329.486
## Number of iterations in BFGS optimization: 67
## Log-likelihood: -1.72e+04 on 18 Df
Evaluate the model’s mean and standard error using holdout dataset:
mean(floor(predict(zi1, ntesting, type = "response")))
## [1] 2.538492
std.error(floor(predict(zi1, ntesting, type = "response")))
## [1] 0.01246765
zi2 <- zeroinfl(TARGET ~ . -INDEX -FixedAcidity -ResidualSugar -Density -CitricAcid -Chlorides -FreeSulfurDioxide -TotalSulfurDioxide -pH -Sulphates |STARS, data=ntraining, dist="negbin")
summary(zi2)
##
## Call:
## zeroinfl(formula = TARGET ~ . - INDEX - FixedAcidity - ResidualSugar -
## Density - CitricAcid - Chlorides - FreeSulfurDioxide - TotalSulfurDioxide -
## pH - Sulphates | STARS, data = ntraining, dist = "negbin")
##
## Pearson residuals:
## Min 1Q Median 3Q Max
## -2.13221 -0.49523 0.06362 0.49137 2.07155
##
## Count model coefficients (negbin with log link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.263002 0.053928 23.420 < 2e-16 ***
## VolatileAcidity -0.017495 0.008365 -2.091 0.0365 *
## Alcohol 0.008133 0.001825 4.457 8.32e-06 ***
## LabelAppeal 0.226149 0.007136 31.693 < 2e-16 ***
## AcidIndex -0.038786 0.006222 -6.234 4.56e-10 ***
## STARS 0.116290 0.007010 16.588 < 2e-16 ***
## Log(theta) 17.344846 1.442677 12.023 < 2e-16 ***
##
## Zero-inflation model coefficients (binomial with logit link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.4719 0.1172 21.09 <2e-16 ***
## STARS -2.8023 0.1040 -26.93 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Theta = 34101294.411
## Number of iterations in BFGS optimization: 39
## Log-likelihood: -1.721e+04 on 9 Df
Evaluate the model’s mean and standard error using holdout dataset:
mean(floor(predict(zi2, ntesting, type = "response")))
## [1] 2.538726
std.error(floor(predict(zi2, ntesting, type = "response")))
## [1] 0.01249403
Although not a big difference in the performance of all models, the last model performs slightly better with closer values to the baseline’s mean and standard error.
The Zero-Inflated with Significant variables model will be used to predict the evaluation dataset.
##
## iter imp variable
## 1 1 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
Export to working directory the evaluation dataset with predicted TARGET to working directory
# Export
write.csv(eval_df,file="HW5_Wine.csv")