In this homework assignment, you will explore, analyze and model a data set containing information on approximately 12,000 commercially available wines. The variables are mostly related to the chemical properties of the wine being sold. The response variable is the number of sample cases of wine that were purchased by wine distribution companies after sampling a wine. These cases would be used to provide tasting samples to restaurants and wine stores around the United States. The more sample cases purchased, the more likely is a wine to be sold at a high end restaurant. A large wine manufacturer is studying the data in order to predict the number of wine cases ordered based upon the wine characteristics. If the wine manufacturer can predict the number of cases, then that manufacturer will be able to adjust their wine offering to maximize sales.
Your objective is to build a count regression model to predict the number of cases of wine that will be sold given certain properties of the wine. HINT: Sometimes, the fact that a variable is missing is actually predictive of the target. You can only use the variables given to you (or variables that you derive from the variables provided).
library(dplyr)
library(ggplot2)
library(ggpubr)
library(psych)
library(knitr)
library(Amelia)
library(tidyr)
library(mice) #missing values
library(caTools)
library(MASS)
##Load dataset
wine.train <- read.csv("wine-training-data.csv", header= TRUE)
wine.train <- wine.train[,-1]
str(wine.train)
## 'data.frame': 12795 obs. of 15 variables:
## $ TARGET : int 3 3 5 3 4 0 0 4 3 6 ...
## $ FixedAcidity : num 3.2 4.5 7.1 5.7 8 11.3 7.7 6.5 14.8 5.5 ...
## $ VolatileAcidity : num 1.16 0.16 2.64 0.385 0.33 0.32 0.29 -1.22 0.27 -0.22 ...
## $ CitricAcid : num -0.98 -0.81 -0.88 0.04 -1.26 0.59 -0.4 0.34 1.05 0.39 ...
## $ ResidualSugar : num 54.2 26.1 14.8 18.8 9.4 ...
## $ Chlorides : num -0.567 -0.425 0.037 -0.425 NA 0.556 0.06 0.04 -0.007 -0.277 ...
## $ FreeSulfurDioxide : num NA 15 214 22 -167 -37 287 523 -213 62 ...
## $ TotalSulfurDioxide: num 268 -327 142 115 108 15 156 551 NA 180 ...
## $ Density : num 0.993 1.028 0.995 0.996 0.995 ...
## $ pH : num 3.33 3.38 3.12 2.24 3.12 3.2 3.49 3.2 4.93 3.09 ...
## $ Sulphates : num -0.59 0.7 0.48 1.83 1.77 1.29 1.21 NA 0.26 0.75 ...
## $ Alcohol : num 9.9 NA 22 6.2 13.7 15.4 10.3 11.6 15 12.6 ...
## $ LabelAppeal : int 0 -1 -1 -1 0 0 0 1 0 0 ...
## $ AcidIndex : int 8 7 8 6 9 11 8 7 6 8 ...
## $ STARS : int 2 3 3 1 2 NA NA 3 NA 4 ...
## TARGET FixedAcidity VolatileAcidity CitricAcid
## Min. :0.000 Min. :-18.100 Min. :-2.7900 Min. :-3.2400
## 1st Qu.:2.000 1st Qu.: 5.200 1st Qu.: 0.1300 1st Qu.: 0.0300
## Median :3.000 Median : 6.900 Median : 0.2800 Median : 0.3100
## Mean :3.029 Mean : 7.076 Mean : 0.3241 Mean : 0.3084
## 3rd Qu.:4.000 3rd Qu.: 9.500 3rd Qu.: 0.6400 3rd Qu.: 0.5800
## Max. :8.000 Max. : 34.400 Max. : 3.6800 Max. : 3.8600
##
## ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide
## Min. :-127.800 Min. :-1.1710 Min. :-555.00 Min. :-823.0
## 1st Qu.: -2.000 1st Qu.:-0.0310 1st Qu.: 0.00 1st Qu.: 27.0
## Median : 3.900 Median : 0.0460 Median : 30.00 Median : 123.0
## Mean : 5.419 Mean : 0.0548 Mean : 30.85 Mean : 120.7
## 3rd Qu.: 15.900 3rd Qu.: 0.1530 3rd Qu.: 70.00 3rd Qu.: 208.0
## Max. : 141.150 Max. : 1.3510 Max. : 623.00 Max. :1057.0
## NA's :616 NA's :638 NA's :647 NA's :682
## Density pH Sulphates Alcohol
## Min. :0.8881 Min. :0.480 Min. :-3.1300 Min. :-4.70
## 1st Qu.:0.9877 1st Qu.:2.960 1st Qu.: 0.2800 1st Qu.: 9.00
## Median :0.9945 Median :3.200 Median : 0.5000 Median :10.40
## Mean :0.9942 Mean :3.208 Mean : 0.5271 Mean :10.49
## 3rd Qu.:1.0005 3rd Qu.:3.470 3rd Qu.: 0.8600 3rd Qu.:12.40
## Max. :1.0992 Max. :6.130 Max. : 4.2400 Max. :26.50
## NA's :395 NA's :1210 NA's :653
## LabelAppeal AcidIndex STARS
## Min. :-2.000000 Min. : 4.000 Min. :1.000
## 1st Qu.:-1.000000 1st Qu.: 7.000 1st Qu.:1.000
## Median : 0.000000 Median : 8.000 Median :2.000
## Mean :-0.009066 Mean : 7.773 Mean :2.042
## 3rd Qu.: 1.000000 3rd Qu.: 8.000 3rd Qu.:3.000
## Max. : 2.000000 Max. :17.000 Max. :4.000
## NA's :3359
## vars n mean sd median trimmed mad min
## TARGET 1 12795 3.03 1.93 3.00 3.05 1.48 0.00
## FixedAcidity 2 12795 7.08 6.32 6.90 7.07 3.26 -18.10
## VolatileAcidity 3 12795 0.32 0.78 0.28 0.32 0.43 -2.79
## CitricAcid 4 12795 0.31 0.86 0.31 0.31 0.42 -3.24
## ResidualSugar 5 12179 5.42 33.75 3.90 5.58 15.72 -127.80
## Chlorides 6 12157 0.05 0.32 0.05 0.05 0.13 -1.17
## FreeSulfurDioxide 7 12148 30.85 148.71 30.00 30.93 56.34 -555.00
## TotalSulfurDioxide 8 12113 120.71 231.91 123.00 120.89 134.92 -823.00
## Density 9 12795 0.99 0.03 0.99 0.99 0.01 0.89
## pH 10 12400 3.21 0.68 3.20 3.21 0.39 0.48
## Sulphates 11 11585 0.53 0.93 0.50 0.53 0.44 -3.13
## Alcohol 12 12142 10.49 3.73 10.40 10.50 2.37 -4.70
## LabelAppeal 13 12795 -0.01 0.89 0.00 -0.01 1.48 -2.00
## AcidIndex 14 12795 7.77 1.32 8.00 7.64 1.48 4.00
## STARS 15 9436 2.04 0.90 2.00 1.97 1.48 1.00
## max range skew kurtosis se
## TARGET 8.00 8.00 -0.33 -0.88 0.02
## FixedAcidity 34.40 52.50 -0.02 1.67 0.06
## VolatileAcidity 3.68 6.47 0.02 1.83 0.01
## CitricAcid 3.86 7.10 -0.05 1.84 0.01
## ResidualSugar 141.15 268.95 -0.05 1.88 0.31
## Chlorides 1.35 2.52 0.03 1.79 0.00
## FreeSulfurDioxide 623.00 1178.00 0.01 1.84 1.35
## TotalSulfurDioxide 1057.00 1880.00 -0.01 1.67 2.11
## Density 1.10 0.21 -0.02 1.90 0.00
## pH 6.13 5.65 0.04 1.65 0.01
## Sulphates 4.24 7.37 0.01 1.75 0.01
## Alcohol 26.50 31.20 -0.03 1.54 0.03
## LabelAppeal 2.00 4.00 0.01 -0.26 0.01
## AcidIndex 17.00 13.00 1.65 5.19 0.01
## STARS 4.00 3.00 0.45 -0.69 0.01
There are 12795 rows and 16 attributes of wine characterisitics data, each wine has 14 potential predictor variables. The response variable is TARGET (# of cases purchased)
par(mfrow=c(2,2))
boxplot(wine.train$FixedAcidity ~ wine.train$TARGET, horizontal=TRUE, col=rainbow(length(unique(wine.train$TARGET))))
boxplot(wine.train$VolatileAcidity ~ wine.train$TARGET, horizontal=TRUE, col=rainbow(length(unique(wine.train$TARGET))))
boxplot(wine.train$CitricAcid ~ wine.train$TARGET, horizontal=TRUE, col=rainbow(length(unique(wine.train$TARGET))))
boxplot(wine.train$ResidualSugar ~ wine.train$TARGET, horizontal=TRUE, col=rainbow(length(unique(wine.train$TARGET))))
boxplot(wine.train$Chlorides ~ wine.train$TARGET, horizontal=TRUE, col=rainbow(length(unique(wine.train$TARGET))))
boxplot(wine.train$FreeSulfurDioxide ~ wine.train$TARGET, horizontal=TRUE, col=rainbow(length(unique(wine.train$TARGET))))
boxplot(wine.train$TotalSulfurDioxide ~ wine.train$TARGET, horizontal=TRUE, col=rainbow(length(unique(wine.train$TARGET))))
boxplot(wine.train$Density ~ wine.train$TARGET, horizontal=TRUE, col=rainbow(length(unique(wine.train$TARGET))))
boxplot(wine.train$pH ~ wine.train$TARGET, horizontal=TRUE, col=rainbow(length(unique(wine.train$TARGET))))
boxplot(wine.train$Sulphates ~ wine.train$TARGET, horizontal=TRUE, col=rainbow(length(unique(wine.train$TARGET))))
boxplot(wine.train$Alcohol ~ wine.train$TARGET, horizontal=TRUE, col=rainbow(length(unique(wine.train$TARGET))))
boxplot(wine.train$LabelAppeal ~ wine.train$TARGET, horizontal=TRUE, col=rainbow(length(unique(wine.train$TARGET))))
boxplot(wine.train$AcidIndex ~ wine.train$TARGET, horizontal=TRUE, col=rainbow(length(unique(wine.train$TARGET))))
boxplot(wine.train$STARS ~ wine.train$TARGET, horizontal=TRUE, col=rainbow(length(unique(wine.train$TARGET))))
par(mfrow=c(2,2))
h1 <- ggplot(data = wine.train, aes(x = FixedAcidity)) + geom_histogram(binwidth = 10, fill="blue")
h2 <- ggplot(data = wine.train, aes(x = ResidualSugar)) + geom_histogram(binwidth = 10, fill="blue")
h3 <- ggplot(data = wine.train, aes(x = FreeSulfurDioxide)) + geom_histogram(binwidth = 10, fill="blue")
h4 <- ggplot(data = wine.train, aes(x = TotalSulfurDioxide)) + geom_histogram(binwidth = 10, fill="blue")
h5 <- ggplot(data = wine.train, aes(x = pH)) + geom_histogram(binwidth = 10, fill="blue")
h6 <- ggplot(data = wine.train, aes(x = Alcohol)) + geom_histogram(binwidth = 10, fill="blue")
h7 <- ggplot(data = wine.train, aes(x = AcidIndex)) + geom_histogram(binwidth = 10, fill="blue")
ggarrange(h1,h2,h3,h4,h5,h6,h7, ncol = 3, nrow = 3)
## Warning: Removed 616 rows containing non-finite values (stat_bin).
## Warning: Removed 647 rows containing non-finite values (stat_bin).
## Warning: Removed 682 rows containing non-finite values (stat_bin).
## Warning: Removed 395 rows containing non-finite values (stat_bin).
## Warning: Removed 653 rows containing non-finite values (stat_bin).
The following variables seem to have strong correlation to the response variable TARGET: - Chlorides, Density, pH, Sulphates, Alcohol, LabelAppeal, AcidIndex, STARS The following variables seem to have mild correlation to the response variable TARGET: - FixedAcidity, VolatileAcidity, CitricAcid, ResidualSugar, FreeSulfurDioxide, TotalSulfurDioxide
## TARGET FixedAcidity VolatileAcidity
## TARGET 1.0000000000 -0.012538100 -0.0759978765
## FixedAcidity -0.0125380998 1.000000000 0.0190109733
## VolatileAcidity -0.0759978765 0.019010973 1.0000000000
## CitricAcid 0.0023450490 0.014000376 -0.0234315631
## ResidualSugar 0.0035195999 -0.015429391 0.0015279517
## Chlorides -0.0304301331 -0.006104447 0.0148489225
## FreeSulfurDioxide 0.0226398054 0.015438463 -0.0114408079
## TotalSulfurDioxide 0.0216020726 -0.023323485 -0.0007434083
## Density -0.0475989086 0.011574241 0.0130977690
## pH 0.0002198557 -0.004553886 0.0072030364
## Sulphates -0.0212203783 0.042229181 0.0015161001
## Alcohol 0.0737771084 -0.013085026 0.0002603082
## LabelAppeal 0.4979464796 0.011375965 -0.0202419713
## AcidIndex -0.1676430648 0.154167846 0.0250529742
## STARS 0.5546857223 -0.004937345 -0.0402432388
## CitricAcid ResidualSugar Chlorides
## TARGET 0.0023450490 0.003519600 -0.0304301331
## FixedAcidity 0.0140003760 -0.015429391 -0.0061044471
## VolatileAcidity -0.0234315631 0.001527952 0.0148489225
## CitricAcid 1.0000000000 -0.009843146 -0.0335608661
## ResidualSugar -0.0098431456 1.000000000 0.0041215692
## Chlorides -0.0335608661 0.004121569 1.0000000000
## FreeSulfurDioxide 0.0121132485 0.021959113 -0.0204924876
## TotalSulfurDioxide -0.0099174506 0.017030939 0.0004188605
## Density -0.0169919691 -0.007120841 0.0206724860
## pH -0.0007581304 0.017563769 -0.0179702278
## Sulphates -0.0144237270 -0.002705775 0.0026187777
## Alcohol 0.0169864284 -0.018943324 -0.0228849573
## LabelAppeal 0.0153315666 -0.004579308 -0.0063870237
## AcidIndex 0.0545838104 -0.020301890 -0.0017134096
## STARS 0.0071401699 0.019665541 -0.0063242568
## FreeSulfurDioxide TotalSulfurDioxide Density
## TARGET 0.022639805 0.0216020726 -0.047598909
## FixedAcidity 0.015438463 -0.0233234848 0.011574241
## VolatileAcidity -0.011440808 -0.0007434083 0.013097769
## CitricAcid 0.012113248 -0.0099174506 -0.016991969
## ResidualSugar 0.021959113 0.0170309394 -0.007120841
## Chlorides -0.020492488 0.0004188605 0.020672486
## FreeSulfurDioxide 1.000000000 0.0134616726 -0.008663509
## TotalSulfurDioxide 0.013461673 1.0000000000 0.023167955
## Density -0.008663509 0.0231679548 1.000000000
## pH -0.002008516 -0.0034227601 -0.002019229
## Sulphates 0.026829029 0.0025040509 -0.010609294
## Alcohol -0.023867458 -0.0168515467 -0.006128355
## LabelAppeal 0.014960087 -0.0027237419 -0.018094403
## AcidIndex -0.014733717 -0.0221292631 0.047778830
## STARS -0.015390398 0.0220949002 -0.028492455
## pH Sulphates Alcohol LabelAppeal
## TARGET 0.0002198557 -0.021220378 0.0737771084 0.4979464796
## FixedAcidity -0.0045538857 0.042229181 -0.0130850260 0.0113759650
## VolatileAcidity 0.0072030364 0.001516100 0.0002603082 -0.0202419713
## CitricAcid -0.0007581304 -0.014423727 0.0169864284 0.0153315666
## ResidualSugar 0.0175637691 -0.002705775 -0.0189433242 -0.0045793083
## Chlorides -0.0179702278 0.002618778 -0.0228849573 -0.0063870237
## FreeSulfurDioxide -0.0020085157 0.026829029 -0.0238674577 0.0149600871
## TotalSulfurDioxide -0.0034227601 0.002504051 -0.0168515467 -0.0027237419
## Density -0.0020192285 -0.010609294 -0.0061283546 -0.0180944026
## pH 1.0000000000 0.010449255 -0.0122034469 0.0002181758
## Sulphates 0.0104492547 1.000000000 0.0108443299 0.0037686996
## Alcohol -0.0122034469 0.010844330 1.0000000000 -0.0006449123
## LabelAppeal 0.0002181758 0.003768700 -0.0006449123 1.0000000000
## AcidIndex -0.0537128921 0.031071782 -0.0558919056 0.0103009840
## STARS -0.0044002985 -0.023135130 0.0648544864 0.3188970216
## AcidIndex STARS
## TARGET -0.16764306 0.554685722
## FixedAcidity 0.15416785 -0.004937345
## VolatileAcidity 0.02505297 -0.040243239
## CitricAcid 0.05458381 0.007140170
## ResidualSugar -0.02030189 0.019665541
## Chlorides -0.00171341 -0.006324257
## FreeSulfurDioxide -0.01473372 -0.015390398
## TotalSulfurDioxide -0.02212926 0.022094900
## Density 0.04777883 -0.028492455
## pH -0.05371289 -0.004400299
## Sulphates 0.03107178 -0.023135130
## Alcohol -0.05589191 0.064854486
## LabelAppeal 0.01030098 0.318897022
## AcidIndex 1.00000000 -0.095482582
## STARS -0.09548258 1.000000000
The correlation matrices show the impact of many missing values across the different predictors, preliminary and dropping missing values across, LabelAppeal, AcidIndex and STARS confirm to be the ones with the strongest correlation, followed by Akcohol and VolatileAcidity. We will proceed with Data Prep tasks to deal with missing values in a more cautionary way.
## Missing values
Non_NAs <- sapply(wine.train, function(y) sum(length(which(!is.na(y)))))
NAs <- sapply(wine.train, function(y) sum(length(which(is.na(y)))))
NA_Percent <- NAs / (NAs + Non_NAs)
NA_SUMMARY <- data.frame(Non_NAs,NAs,NA_Percent)
kable(NA_SUMMARY)
Non_NAs | NAs | NA_Percent | |
---|---|---|---|
TARGET | 12795 | 0 | 0.0000000 |
FixedAcidity | 12795 | 0 | 0.0000000 |
VolatileAcidity | 12795 | 0 | 0.0000000 |
CitricAcid | 12795 | 0 | 0.0000000 |
ResidualSugar | 12179 | 616 | 0.0481438 |
Chlorides | 12157 | 638 | 0.0498632 |
FreeSulfurDioxide | 12148 | 647 | 0.0505666 |
TotalSulfurDioxide | 12113 | 682 | 0.0533021 |
Density | 12795 | 0 | 0.0000000 |
pH | 12400 | 395 | 0.0308714 |
Sulphates | 11585 | 1210 | 0.0945682 |
Alcohol | 12142 | 653 | 0.0510356 |
LabelAppeal | 12795 | 0 | 0.0000000 |
AcidIndex | 12795 | 0 | 0.0000000 |
STARS | 9436 | 3359 | 0.2625244 |
## Split data sets
set.seed(999)
sampl = sample.split(wine.train$TARGET, SplitRatio = .80)
wine.train1 <- subset(wine.train, sampl == TRUE)
wine.test1 <- subset(wine.train, sampl == FALSE)
## Data Imputation
wine.train2 <- mice(wine.train1, m=1, maxit = 5, seed = 42)
##
## iter imp variable
## 1 1 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 2 1 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 3 1 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 4 1 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 5 1 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
wine.train2 <- complete(wine.train2)
wine.train2 <- as.data.frame(wine.train2)
wine.test2 <- mice(wine.test1, m=1, maxit = 5, seed = 42)
##
## iter imp variable
## 1 1 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 2 1 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 3 1 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 4 1 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 5 1 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
##
## Call:
## glm(formula = TARGET ~ ., family = poisson, data = wine.train1)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.2128 -0.2757 0.0647 0.3766 1.6981
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.608e+00 2.796e-01 5.750 8.90e-09 ***
## FixedAcidity 6.705e-04 1.177e-03 0.570 0.56901
## VolatileAcidity -2.750e-02 9.283e-03 -2.963 0.00305 **
## CitricAcid -3.835e-03 8.519e-03 -0.450 0.65259
## ResidualSugar 1.828e-05 2.152e-04 0.085 0.93232
## Chlorides -3.764e-02 2.314e-02 -1.627 0.10377
## FreeSulfurDioxide 5.671e-05 4.892e-05 1.159 0.24630
## TotalSulfurDioxide 2.230e-05 3.177e-05 0.702 0.48274
## Density -4.025e-01 2.749e-01 -1.464 0.14326
## pH 2.307e-04 1.085e-02 0.021 0.98303
## Sulphates -5.984e-03 7.973e-03 -0.751 0.45293
## Alcohol 3.262e-03 2.004e-03 1.628 0.10360
## LabelAppeal 1.730e-01 8.858e-03 19.530 < 2e-16 ***
## AcidIndex -4.967e-02 6.666e-03 -7.451 9.28e-14 ***
## STARS 1.929e-01 8.328e-03 23.160 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 4720.5 on 5143 degrees of freedom
## Residual deviance: 3242.8 on 5129 degrees of freedom
## (5093 observations deleted due to missingness)
## AIC: 18545
##
## Number of Fisher Scoring iterations: 5
model2 <- glm(TARGET ~ .-FixedAcidity-CitricAcid-ResidualSugar-Chlorides-FreeSulfurDioxide-TotalSulfurDioxide-Density-pH-Sulphates-Alcohol, data=wine.train1, family=poisson)
summary(model2)
##
## Call:
## glm(formula = TARGET ~ . - FixedAcidity - CitricAcid - ResidualSugar -
## Chlorides - FreeSulfurDioxide - TotalSulfurDioxide - Density -
## pH - Sulphates - Alcohol, family = poisson, data = wine.train1)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.1898 -0.2777 0.0622 0.3764 1.6086
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.251442 0.054724 22.868 < 2e-16 ***
## VolatileAcidity -0.027581 0.009278 -2.973 0.00295 **
## LabelAppeal 0.173177 0.008853 19.562 < 2e-16 ***
## AcidIndex -0.050616 0.006553 -7.724 1.13e-14 ***
## STARS 0.194208 0.008292 23.421 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 4720.5 on 5143 degrees of freedom
## Residual deviance: 3253.1 on 5139 degrees of freedom
## (5093 observations deleted due to missingness)
## AIC: 18535
##
## Number of Fisher Scoring iterations: 5
model3 = glm(TARGET ~ VolatileAcidity+LabelAppeal+AcidIndex+STARS, data=wine.train2, family=poisson)
summary(model3)
##
## Call:
## glm(formula = TARGET ~ VolatileAcidity + LabelAppeal + AcidIndex +
## STARS, family = poisson, data = wine.train2)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.0381 -0.6778 0.1239 0.6394 2.6618
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.201889 0.042133 28.526 < 2e-16 ***
## VolatileAcidity -0.043501 0.007274 -5.981 2.22e-09 ***
## LabelAppeal 0.143130 0.006779 21.113 < 2e-16 ***
## AcidIndex -0.102810 0.004986 -20.621 < 2e-16 ***
## STARS 0.340243 0.006238 54.545 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 18291 on 10236 degrees of freedom
## Residual deviance: 12832 on 10232 degrees of freedom
## AIC: 38400
##
## Number of Fisher Scoring iterations: 5
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
##
## Call:
## glm.nb(formula = TARGET ~ ., data = wine.train1, init.theta = 138898.9965,
## link = log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.2127 -0.2757 0.0647 0.3766 1.6981
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.608e+00 2.796e-01 5.750 8.91e-09 ***
## FixedAcidity 6.705e-04 1.177e-03 0.570 0.56900
## VolatileAcidity -2.750e-02 9.283e-03 -2.963 0.00305 **
## CitricAcid -3.835e-03 8.519e-03 -0.450 0.65259
## ResidualSugar 1.828e-05 2.152e-04 0.085 0.93231
## Chlorides -3.764e-02 2.314e-02 -1.627 0.10378
## FreeSulfurDioxide 5.671e-05 4.892e-05 1.159 0.24630
## TotalSulfurDioxide 2.230e-05 3.177e-05 0.702 0.48275
## Density -4.025e-01 2.750e-01 -1.464 0.14326
## pH 2.307e-04 1.085e-02 0.021 0.98303
## Sulphates -5.984e-03 7.973e-03 -0.751 0.45293
## Alcohol 3.262e-03 2.004e-03 1.628 0.10360
## LabelAppeal 1.730e-01 8.858e-03 19.529 < 2e-16 ***
## AcidIndex -4.967e-02 6.666e-03 -7.451 9.28e-14 ***
## STARS 1.929e-01 8.328e-03 23.160 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(138899) family taken to be 1)
##
## Null deviance: 4720.4 on 5143 degrees of freedom
## Residual deviance: 3242.7 on 5129 degrees of freedom
## (5093 observations deleted due to missingness)
## AIC: 18547
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 138899
## Std. Err.: 259921
## Warning while fitting theta: iteration limit reached
##
## 2 x log-likelihood: -18515.07
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
##
## Call:
## glm.nb(formula = TARGET ~ . - FixedAcidity - CitricAcid - ResidualSugar -
## Chlorides - FreeSulfurDioxide - TotalSulfurDioxide - Density -
## pH - Sulphates - Alcohol, data = wine.train1, init.theta = 138402.1806,
## link = log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.1898 -0.2777 0.0622 0.3764 1.6086
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.251443 0.054725 22.868 < 2e-16 ***
## VolatileAcidity -0.027581 0.009279 -2.973 0.00295 **
## LabelAppeal 0.173177 0.008853 19.562 < 2e-16 ***
## AcidIndex -0.050616 0.006553 -7.724 1.13e-14 ***
## STARS 0.194209 0.008292 23.421 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(138402.2) family taken to be 1)
##
## Null deviance: 4720.4 on 5143 degrees of freedom
## Residual deviance: 3253.0 on 5139 degrees of freedom
## (5093 observations deleted due to missingness)
## AIC: 18537
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 138402
## Std. Err.: 258834
## Warning while fitting theta: iteration limit reached
##
## 2 x log-likelihood: -18525.37
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
##
## Call:
## glm.nb(formula = TARGET ~ VolatileAcidity + LabelAppeal + AcidIndex +
## STARS, data = wine.train2, init.theta = 48614.35988, link = log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.0380 -0.6778 0.1239 0.6394 2.6617
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.201895 0.042134 28.526 < 2e-16 ***
## VolatileAcidity -0.043502 0.007274 -5.981 2.22e-09 ***
## LabelAppeal 0.143130 0.006780 21.112 < 2e-16 ***
## AcidIndex -0.102812 0.004986 -20.621 < 2e-16 ***
## STARS 0.340248 0.006238 54.543 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(48614.36) family taken to be 1)
##
## Null deviance: 18290 on 10236 degrees of freedom
## Residual deviance: 12831 on 10232 degrees of freedom
## AIC: 38402
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 48614
## Std. Err.: 62794
## Warning while fitting theta: iteration limit reached
##
## 2 x log-likelihood: -38389.98
##
## Call:
## lm(formula = TARGET ~ ., data = wine.train2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.5119 -0.9973 0.1659 1.0271 4.2662
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.936e+00 5.334e-01 7.380 1.71e-13 ***
## FixedAcidity 2.874e-04 2.253e-03 0.128 0.89847
## VolatileAcidity -1.245e-01 1.790e-02 -6.960 3.62e-12 ***
## CitricAcid 2.889e-02 1.628e-02 1.775 0.07598 .
## ResidualSugar 4.461e-04 4.131e-04 1.080 0.28023
## Chlorides -1.963e-01 4.391e-02 -4.471 7.88e-06 ***
## FreeSulfurDioxide 2.881e-04 9.384e-05 3.070 0.00214 **
## TotalSulfurDioxide 2.285e-04 5.997e-05 3.810 0.00014 ***
## Density -1.101e+00 5.254e-01 -2.096 0.03614 *
## pH -3.884e-02 2.067e-02 -1.879 0.06030 .
## Sulphates -3.503e-02 1.516e-02 -2.310 0.02089 *
## Alcohol 1.167e-02 3.775e-03 3.090 0.00201 **
## LabelAppeal 4.400e-01 1.642e-02 26.799 < 2e-16 ***
## AcidIndex -2.509e-01 1.099e-02 -22.825 < 2e-16 ***
## STARS 1.158e+00 1.664e-02 69.584 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.415 on 10222 degrees of freedom
## Multiple R-squared: 0.4615, Adjusted R-squared: 0.4608
## F-statistic: 625.8 on 14 and 10222 DF, p-value: < 2.2e-16
model8 <- lm(TARGET ~ .-FixedAcidity-CitricAcid-ResidualSugar-Density-pH-Sulphates, data = wine.train2)
summary(model8)
##
## Call:
## lm(formula = TARGET ~ . - FixedAcidity - CitricAcid - ResidualSugar -
## Density - pH - Sulphates, data = wine.train2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.4936 -1.0062 0.1739 1.0227 4.3350
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.701e+00 1.035e-01 26.108 < 2e-16 ***
## VolatileAcidity -1.259e-01 1.790e-02 -7.037 2.09e-12 ***
## Chlorides -1.978e-01 4.391e-02 -4.503 6.76e-06 ***
## FreeSulfurDioxide 2.860e-04 9.387e-05 3.047 0.002315 **
## TotalSulfurDioxide 2.316e-04 5.997e-05 3.862 0.000113 ***
## Alcohol 1.181e-02 3.775e-03 3.129 0.001762 **
## LabelAppeal 4.399e-01 1.642e-02 26.780 < 2e-16 ***
## AcidIndex -2.502e-01 1.076e-02 -23.248 < 2e-16 ***
## STARS 1.160e+00 1.664e-02 69.682 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.416 on 10228 degrees of freedom
## Multiple R-squared: 0.4606, Adjusted R-squared: 0.4602
## F-statistic: 1092 on 8 and 10228 DF, p-value: < 2.2e-16
## Model Validation (Validation set)
modelVal <- function(model, testset){
pred = predict(model, testset)
diffMat = as.numeric(pred) - as.numeric(testset$TARGET)
diffMat = diffMat^2
rmse <- sqrt(mean(diffMat))
return(rmse)
}
modelVal(model1, wine.test2)
## [1] 2.528024
## [1] 2.527991
## [1] 2.618705
## [1] 2.528024
## [1] 2.527991
## [1] 2.618704
## [1] 1.420752
## [1] 1.422984
Model10 - Simple Linear Regression with Imputation and only significant variables produces the best performance, with the lowest RMSE (1.42) - it is the simpler model as it only considers the most significant predictors
wine.eval <- read.csv("wine-evaluation-data.csv", header= TRUE)
## Data Imputation
wine.eval2 <- mice(wine.eval, m=1, maxit = 5, seed = 42)
##
## iter imp variable
## 1 1 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 2 1 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 3 1 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 4 1 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 5 1 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## Warning: Number of logged events: 1