Cover Page
Data 621 - Week 5 HW
Baron Curtin
CUNY School of Professional Studies
Introduction
The purpose of this assignment is to build a count regression model to predict the number of cases of wine that will be sold given certain properties of the wine
Data Exploration
Non Visual Exploration
Variables
- Response Variable: TARGET
- Number of cases of wine expected the sell
- Explanatory Variables
- 12 Variables describing the chemical composition of the wine
- 2 Variables describing non-chemical chracteristics of the wine
- Identification Variable: INDEX
- Will not be used in the analysis
## Observations: 12,795
## Variables: 15
## $ TARGET <int> 3, 3, 5, 3, 4, 0, 0, 4, 3, 6, 0, 4, 3, 7, 4...
## $ FixedAcidity <dbl> 3.2, 4.5, 7.1, 5.7, 8.0, 11.3, 7.7, 6.5, 14...
## $ VolatileAcidity <dbl> 1.160, 0.160, 2.640, 0.385, 0.330, 0.320, 0...
## $ CitricAcid <dbl> -0.98, -0.81, -0.88, 0.04, -1.26, 0.59, -0....
## $ ResidualSugar <dbl> 54.20, 26.10, 14.80, 18.80, 9.40, 2.20, 21....
## $ Chlorides <dbl> -0.567, -0.425, 0.037, -0.425, NA, 0.556, 0...
## $ FreeSulfurDioxide <dbl> NA, 15, 214, 22, -167, -37, 287, 523, -213,...
## $ TotalSulfurDioxide <dbl> 268, -327, 142, 115, 108, 15, 156, 551, NA,...
## $ Density <dbl> 0.99280, 1.02792, 0.99518, 0.99640, 0.99457...
## $ pH <dbl> 3.33, 3.38, 3.12, 2.24, 3.12, 3.20, 3.49, 3...
## $ Sulphates <dbl> -0.59, 0.70, 0.48, 1.83, 1.77, 1.29, 1.21, ...
## $ Alcohol <dbl> 9.9, NA, 22.0, 6.2, 13.7, 15.4, 10.3, 11.6,...
## $ LabelAppeal <int> 0, -1, -1, -1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 2...
## $ AcidIndex <int> 8, 7, 8, 6, 9, 11, 8, 7, 6, 8, 5, 10, 7, 8,...
## $ STARS <int> 2, 3, 3, 1, 2, NA, NA, 3, NA, 4, 1, 2, 2, 3...
The glimpse function of dplyr shows that there are 12795 observations and 15 variables (without INDEX) * All of the variables are numeric * Various fields have missing values
- 11 variables are of the double type
- 5 variables of the integer type
## TARGET FixedAcidity VolatileAcidity CitricAcid
## Min. :0.000 Min. :-18.100 Min. :-2.7900 Min. :-3.2400
## 1st Qu.:2.000 1st Qu.: 5.200 1st Qu.: 0.1300 1st Qu.: 0.0300
## Median :3.000 Median : 6.900 Median : 0.2800 Median : 0.3100
## Mean :3.029 Mean : 7.076 Mean : 0.3241 Mean : 0.3084
## 3rd Qu.:4.000 3rd Qu.: 9.500 3rd Qu.: 0.6400 3rd Qu.: 0.5800
## Max. :8.000 Max. : 34.400 Max. : 3.6800 Max. : 3.8600
##
## ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide
## Min. :-127.800 Min. :-1.1710 Min. :-555.00 Min. :-823.0
## 1st Qu.: -2.000 1st Qu.:-0.0310 1st Qu.: 0.00 1st Qu.: 27.0
## Median : 3.900 Median : 0.0460 Median : 30.00 Median : 123.0
## Mean : 5.419 Mean : 0.0548 Mean : 30.85 Mean : 120.7
## 3rd Qu.: 15.900 3rd Qu.: 0.1530 3rd Qu.: 70.00 3rd Qu.: 208.0
## Max. : 141.150 Max. : 1.3510 Max. : 623.00 Max. :1057.0
## NA's :616 NA's :638 NA's :647 NA's :682
## Density pH Sulphates Alcohol
## Min. :0.8881 Min. :0.480 Min. :-3.1300 Min. :-4.70
## 1st Qu.:0.9877 1st Qu.:2.960 1st Qu.: 0.2800 1st Qu.: 9.00
## Median :0.9945 Median :3.200 Median : 0.5000 Median :10.40
## Mean :0.9942 Mean :3.208 Mean : 0.5271 Mean :10.49
## 3rd Qu.:1.0005 3rd Qu.:3.470 3rd Qu.: 0.8600 3rd Qu.:12.40
## Max. :1.0992 Max. :6.130 Max. : 4.2400 Max. :26.50
## NA's :395 NA's :1210 NA's :653
## LabelAppeal AcidIndex STARS
## Min. :-2.000000 Min. : 4.000 Min. :1.000
## 1st Qu.:-1.000000 1st Qu.: 7.000 1st Qu.:1.000
## Median : 0.000000 Median : 8.000 Median :2.000
## Mean :-0.009066 Mean : 7.773 Mean :2.042
## 3rd Qu.: 1.000000 3rd Qu.: 8.000 3rd Qu.:3.000
## Max. : 2.000000 Max. :17.000 Max. :4.000
## NA's :3359
- The summary function provides information on 8 fields that are missing values
- The pH range is from ~0.5 to ~6.1
- These values indicate that all of these wines are closer to acids rather than bases
- Being alcohol, one would expect the pHs to be more basic
Basic Stats
## # A tibble: 15 x 16
## variables n na_vals obs mean min q1 median q3
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 AcidIndex 12795 0 12795 7.77e+0 4 7 8 8
## 2 Alcohol 12795 653 12142 1.05e+1 -4.7 9 10.4 12.4
## 3 Chlorides 12795 638 12157 5.48e-2 -1.17 -0.031 0.046 0.153
## 4 CitricAc~ 12795 0 12795 3.08e-1 -3.24 0.03 0.31 0.580
## 5 Density 12795 0 12795 9.94e-1 0.888 0.988 0.994 1.00
## 6 FixedAci~ 12795 0 12795 7.08e+0 -18.1 5.2 6.9 9.5
## 7 FreeSulf~ 12795 647 12148 3.08e+1 -555 0 30 70
## 8 LabelApp~ 12795 0 12795 -9.07e-3 -2 -1 0 1
## 9 pH 12795 395 12400 3.21e+0 0.48 2.96 3.2 3.47
## 10 Residual~ 12795 616 12179 5.42e+0 -128. -2 3.9 15.9
## 11 STARS 12795 3359 9436 2.04e+0 1 1 2 3
## 12 Sulphates 12795 1210 11585 5.27e-1 -3.13 0.28 0.5 0.86
## 13 TARGET 12795 0 12795 3.03e+0 0 2 3 4
## 14 TotalSul~ 12795 682 12113 1.21e+2 -823 27 123 208
## 15 Volatile~ 12795 0 12795 3.24e-1 -2.79 0.13 0.28 0.64
## # ... with 7 more variables: max <dbl>, sd <dbl>, var <dbl>, range <dbl>,
## # iqr <dbl>, skewness <dbl>, kurtosis <dbl>
- basisStats further confirms the existence of missing values in the aforemention variables
- No variable appears to suffer from severe skew or kurtosis
- There are a number of variables with negative values that should not contain negatives
- These will be addressed in data preparation
- All of the variables below except for LabelAppeal are chemical properties yet all should not contain negatives
Correlation
- The correlation table above reveals that STARS and LabelAppeal have the greatest correlation with selling cases of wine
- AcidIndex has the highest negative correlation indicating that the higher the acidity, the less cases will be sold
- The correlation matrix further confirms that values from the correlation table
- LabelAppeal and STARS show a weak positive correlation
- It would appear that LabelAppeal has an impact on STARS
Visual Inspection
Density Plots
## No id variables; using all as measure variables
The density plots reveal some of the variables are approximately normally distributed and others that are not: * Normally Distributed Variables: + FixedAcidity, VolatileAcidity, CitricAcid, ResidualSugar, Chlorides, FreeSulfurDioxide, TotalSulfurDioxide, Density, pH, Sulphates, Alcohol * Non-normal Variables: + TARGET, LabelAppeal, AcidIndex, STARS
Histograms
- The histograms are able to refute some of the conclusions made in the density plots showing that Alcohol, Density, and pH as that approximately normal distributions
- The histograms also reveal right skew in Sulphates, VolatileAcidity, TotalSulfurDioxide, ResidualSugar, FixedAcidity, Chlorides
Box Plots
- The box plots are able to confirm the existence of skew in VolatileAcidity, CitricAcid, Sulphates, LabelAppeal, AcidIndex
- The box plots further confirm that high variability that exists ResidualSugar, TotalSulfurDioxide
Data Preparation
INDEX removed from datasets
As INDEX is just an identification variable, it will be removed
Fields with Negatives Set to 0
## # A tibble: 10 x 1
## variables
## <chr>
## 1 Alcohol
## 2 Chlorides
## 3 CitricAcid
## 4 FixedAcidity
## 5 FreeSulfurDioxide
## 6 LabelAppeal
## 7 ResidualSugar
## 8 Sulphates
## 9 TotalSulfurDioxide
## 10 VolatileAcidity
The variables above contained values lower then zero. All of them are chemical properties of the wine (except LabelAppeal) and should not be 0. They have been changed so that their lowest value is 0. LabelAppeal also receives the same treatment because there aren’t many rating scales that begin in the negatives.
Missing Value Imputation
THe missing values will need to be handled prior to model development. To do so, the missForest package will be used
## missForest iteration 1 in progress...done!
## missForest iteration 2 in progress...done!
## missForest iteration 3 in progress...done!
## missForest iteration 4 in progress...done!
## removed variable(s) 1 due to the missingness of all entries
## missForest iteration 1 in progress...done!
## missForest iteration 2 in progress...done!
## missForest iteration 3 in progress...done!
## missForest iteration 4 in progress...done!
Build Models
Poisson Regression Model
The glmulti package provides a convinient way to generate the best models. To generate the models, we will use that package
Best Formulas
## [[1]]
## TARGET ~ 1 + VolatileAcidity + CitricAcid + Chlorides + FreeSulfurDioxide +
## TotalSulfurDioxide + Density + pH + Sulphates + Alcohol +
## LabelAppeal + AcidIndex + STARS
## <environment: 0x0000000027a16e10>
##
## [[2]]
## TARGET ~ 1 + VolatileAcidity + CitricAcid + Chlorides + FreeSulfurDioxide +
## TotalSulfurDioxide + pH + Sulphates + Alcohol + LabelAppeal +
## AcidIndex + STARS
## <environment: 0x0000000027a16e10>
##
## [[3]]
## TARGET ~ 1 + FixedAcidity + VolatileAcidity + CitricAcid + Chlorides +
## FreeSulfurDioxide + TotalSulfurDioxide + Density + pH + Sulphates +
## Alcohol + LabelAppeal + AcidIndex + STARS
## <environment: 0x0000000027a16e10>
##
## [[4]]
## TARGET ~ 1 + VolatileAcidity + CitricAcid + ResidualSugar + Chlorides +
## FreeSulfurDioxide + TotalSulfurDioxide + Density + pH + Sulphates +
## Alcohol + LabelAppeal + AcidIndex + STARS
## <environment: 0x0000000027a16e10>
##
## [[5]]
## TARGET ~ 1 + VolatileAcidity + CitricAcid + Chlorides + FreeSulfurDioxide +
## TotalSulfurDioxide + Density + pH + Sulphates + LabelAppeal +
## AcidIndex + STARS
## <environment: 0x0000000027a16e10>
Model 1: 13 Variables
##
## Call:
## glm(formula = TARGET ~ 1 + VolatileAcidity + CitricAcid + Chlorides +
## FreeSulfurDioxide + TotalSulfurDioxide + Density + pH + Sulphates +
## Alcohol + LabelAppeal + AcidIndex + STARS, family = "poisson",
## data = imputed$train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.9377 -0.7520 0.1119 0.6389 2.8292
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.457e+00 1.960e-01 7.436 1.04e-13 ***
## VolatileAcidity -6.774e-02 9.523e-03 -7.113 1.14e-12 ***
## CitricAcid 1.925e-02 8.398e-03 2.292 0.02189 *
## Chlorides -9.086e-02 2.519e-02 -3.607 0.00031 ***
## FreeSulfurDioxide 2.087e-04 5.139e-05 4.061 4.88e-05 ***
## TotalSulfurDioxide 1.177e-04 3.024e-05 3.892 9.95e-05 ***
## Density -3.509e-01 1.921e-01 -1.826 0.06778 .
## pH -1.772e-02 7.620e-03 -2.326 0.02003 *
## Sulphates -2.647e-02 7.861e-03 -3.367 0.00076 ***
## Alcohol 2.990e-03 1.428e-03 2.094 0.03625 *
## LabelAppeal 1.458e-01 8.896e-03 16.389 < 2e-16 ***
## AcidIndex -9.822e-02 4.469e-03 -21.977 < 2e-16 ***
## STARS 3.614e-01 5.862e-03 61.652 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 22861 on 12794 degrees of freedom
## Residual deviance: 16644 on 12782 degrees of freedom
## AIC: 48612
##
## Number of Fisher Scoring iterations: 5
- The AIC for this model is 48651
- The model includes 11 significant variables at the 5% level
- AcidIndex has the highest negative impact on TARGET
- STARS has the highest positive impact on TARGET
- This aligns with conventional wisdom as one would expect previous reviews to influence future buyers
Model 2: All Variables
##
## Call:
## glm(formula = TARGET ~ 1 + ., family = "poisson", data = imputed$train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.9375 -0.7516 0.1103 0.6390 2.8308
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.457e+00 1.960e-01 7.436 1.04e-13 ***
## FixedAcidity -5.757e-04 9.689e-04 -0.594 0.552384
## VolatileAcidity -6.768e-02 9.523e-03 -7.107 1.19e-12 ***
## CitricAcid 1.926e-02 8.399e-03 2.293 0.021853 *
## ResidualSugar 3.705e-05 2.336e-04 0.159 0.873990
## Chlorides -9.095e-02 2.519e-02 -3.610 0.000306 ***
## FreeSulfurDioxide 2.089e-04 5.139e-05 4.064 4.82e-05 ***
## TotalSulfurDioxide 1.175e-04 3.025e-05 3.884 0.000103 ***
## Density -3.507e-01 1.921e-01 -1.825 0.067940 .
## pH -1.771e-02 7.621e-03 -2.323 0.020162 *
## Sulphates -2.637e-02 7.863e-03 -3.353 0.000799 ***
## Alcohol 2.994e-03 1.428e-03 2.097 0.036015 *
## LabelAppeal 1.457e-01 8.897e-03 16.381 < 2e-16 ***
## AcidIndex -9.780e-02 4.525e-03 -21.612 < 2e-16 ***
## STARS 3.614e-01 5.863e-03 61.638 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 22861 on 12794 degrees of freedom
## Residual deviance: 16644 on 12780 degrees of freedom
## AIC: 48616
##
## Number of Fisher Scoring iterations: 5
- The AIC value ias 48654, slightly higher than model 1
- It reintroduces some insignificant variables that may overfit the data
- The highest positive impactful variable is STARS, aligning with conventional wisdom
- The highest negative variable is AcidIndex
Negative Binomial Regression
The glmulti package provides a convinient way to generate the best models. To generate the models, we will use that package
Best Formulas
## [[1]]
## TARGET ~ 1 + VolatileAcidity + Chlorides + FreeSulfurDioxide +
## TotalSulfurDioxide + pH + Sulphates + LabelAppeal + AcidIndex +
## STARS
## <environment: 0x000000005c280218>
##
## [[2]]
## TARGET ~ 1 + VolatileAcidity + CitricAcid + Chlorides + FreeSulfurDioxide +
## TotalSulfurDioxide + pH + Sulphates + LabelAppeal + AcidIndex +
## STARS
## <environment: 0x000000005c280218>
##
## [[3]]
## TARGET ~ 1 + VolatileAcidity + Chlorides + FreeSulfurDioxide +
## TotalSulfurDioxide + Density + pH + Sulphates + LabelAppeal +
## AcidIndex + STARS
## <environment: 0x000000005c280218>
##
## [[4]]
## TARGET ~ 1 + VolatileAcidity + Chlorides + FreeSulfurDioxide +
## TotalSulfurDioxide + Sulphates + LabelAppeal + AcidIndex +
## STARS
## <environment: 0x000000005c280218>
##
## [[5]]
## TARGET ~ 1 + VolatileAcidity + CitricAcid + Chlorides + FreeSulfurDioxide +
## TotalSulfurDioxide + Density + pH + Sulphates + LabelAppeal +
## AcidIndex + STARS
## <environment: 0x000000005c280218>
Model 3: 9 Variables
##
## Call:
## glm.nb(formula = TARGET ~ 1 + VolatileAcidity + Chlorides + FreeSulfurDioxide +
## TotalSulfurDioxide + pH + Sulphates + LabelAppeal + AcidIndex +
## STARS, data = imputed$train, init.theta = 44857.248, link = log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.9310 -0.7486 0.1083 0.6398 2.8329
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.149e+00 4.716e-02 24.368 < 2e-16 ***
## VolatileAcidity -6.782e-02 9.523e-03 -7.121 1.07e-12 ***
## Chlorides -9.352e-02 2.518e-02 -3.714 0.000204 ***
## FreeSulfurDioxide 2.067e-04 5.139e-05 4.023 5.75e-05 ***
## TotalSulfurDioxide 1.154e-04 3.022e-05 3.819 0.000134 ***
## pH -1.784e-02 7.619e-03 -2.342 0.019172 *
## Sulphates -2.623e-02 7.860e-03 -3.337 0.000847 ***
## LabelAppeal 1.457e-01 8.895e-03 16.377 < 2e-16 ***
## AcidIndex -9.836e-02 4.455e-03 -22.076 < 2e-16 ***
## STARS 3.627e-01 5.843e-03 62.076 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(44857.25) family taken to be 1)
##
## Null deviance: 22860 on 12794 degrees of freedom
## Residual deviance: 16657 on 12785 degrees of freedom
## AIC: 48622
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 44857
## Std. Err.: 53571
## Warning while fitting theta: iteration limit reached
##
## 2 x log-likelihood: -48599.69
- This model has an AIC of 48661, higher than the poisson models
- All of the variables are significant
- The highest positive coefficient is STARS, aligning again with conventional wisdom
- The lowest negative coefficient is AcidIndex, similar to the poisson regressions
Model 4: Log Transform 9 Variables
##
## Call:
## glm.nb(formula = TARGET ~ 1 + VolatileAcidity + Chlorides + FreeSulfurDioxide +
## TotalSulfurDioxide + pH + Sulphates + LabelAppeal + AcidIndex +
## STARS, data = ., init.theta = 52851.92895, link = log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7821 -0.3463 0.1144 0.3702 1.3945
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.956417 0.163722 5.842 5.17e-09 ***
## VolatileAcidity -0.116464 0.026142 -4.455 8.38e-06 ***
## Chlorides -0.114019 0.052053 -2.190 0.028491 *
## FreeSulfurDioxide 0.014240 0.004190 3.399 0.000677 ***
## TotalSulfurDioxide 0.010643 0.003837 2.774 0.005536 **
## pH -0.081187 0.047520 -1.708 0.087548 .
## Sulphates -0.060223 0.023680 -2.543 0.010985 *
## LabelAppeal 0.046602 0.023306 2.000 0.045548 *
## AcidIndex -0.799153 0.062611 -12.764 < 2e-16 ***
## STARS 0.986068 0.030149 32.707 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(52851.93) family taken to be 1)
##
## Null deviance: 8120.9 on 12794 degrees of freedom
## Residual deviance: 6491.5 on 12785 degrees of freedom
## AIC: 30175
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 52852
## Std. Err.: 61681
## Warning while fitting theta: iteration limit reached
##
## 2 x log-likelihood: -30152.53
- The log transformation resulted in a much lower AIC at 30191
- Many of the variables that were highly significant previously, became less significant
- The highest positive coefficient is STARS, aligning with conventional wisdom
- The lowest negative coefficient is AcidIndex
Multiple Linear Regression
For multiple linear regression, we can use the leaps package to return the best subset
## Subset selection object
## Call: regsubsets.formula(TARGET ~ ., data = imputed$train, method = "exhaustive",
## nvmax = NULL, nbest = 1)
## 14 Variables (and intercept)
## Forced in Forced out
## FixedAcidity FALSE FALSE
## VolatileAcidity FALSE FALSE
## CitricAcid FALSE FALSE
## ResidualSugar FALSE FALSE
## Chlorides FALSE FALSE
## FreeSulfurDioxide FALSE FALSE
## TotalSulfurDioxide FALSE FALSE
## Density FALSE FALSE
## pH FALSE FALSE
## Sulphates FALSE FALSE
## Alcohol FALSE FALSE
## LabelAppeal FALSE FALSE
## AcidIndex FALSE FALSE
## STARS FALSE FALSE
## 1 subsets of each size up to 14
## Selection Algorithm: exhaustive
## FixedAcidity VolatileAcidity CitricAcid ResidualSugar Chlorides
## 1 ( 1 ) " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " "
## 3 ( 1 ) " " " " " " " " " "
## 4 ( 1 ) " " "*" " " " " " "
## 5 ( 1 ) " " "*" " " " " "*"
## 6 ( 1 ) " " "*" " " " " "*"
## 7 ( 1 ) " " "*" " " " " "*"
## 8 ( 1 ) " " "*" " " " " "*"
## 9 ( 1 ) " " "*" " " " " "*"
## 10 ( 1 ) " " "*" "*" " " "*"
## 11 ( 1 ) " " "*" "*" " " "*"
## 12 ( 1 ) " " "*" "*" " " "*"
## 13 ( 1 ) "*" "*" "*" " " "*"
## 14 ( 1 ) "*" "*" "*" "*" "*"
## FreeSulfurDioxide TotalSulfurDioxide Density pH Sulphates
## 1 ( 1 ) " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " "
## 3 ( 1 ) " " " " " " " " " "
## 4 ( 1 ) " " " " " " " " " "
## 5 ( 1 ) " " " " " " " " " "
## 6 ( 1 ) "*" " " " " " " " "
## 7 ( 1 ) "*" "*" " " " " " "
## 8 ( 1 ) "*" "*" " " " " "*"
## 9 ( 1 ) "*" "*" " " " " "*"
## 10 ( 1 ) "*" "*" " " " " "*"
## 11 ( 1 ) "*" "*" " " "*" "*"
## 12 ( 1 ) "*" "*" "*" "*" "*"
## 13 ( 1 ) "*" "*" "*" "*" "*"
## 14 ( 1 ) "*" "*" "*" "*" "*"
## Alcohol LabelAppeal AcidIndex STARS
## 1 ( 1 ) " " " " " " "*"
## 2 ( 1 ) " " " " "*" "*"
## 3 ( 1 ) " " "*" "*" "*"
## 4 ( 1 ) " " "*" "*" "*"
## 5 ( 1 ) " " "*" "*" "*"
## 6 ( 1 ) " " "*" "*" "*"
## 7 ( 1 ) " " "*" "*" "*"
## 8 ( 1 ) " " "*" "*" "*"
## 9 ( 1 ) "*" "*" "*" "*"
## 10 ( 1 ) "*" "*" "*" "*"
## 11 ( 1 ) "*" "*" "*" "*"
## 12 ( 1 ) "*" "*" "*" "*"
## 13 ( 1 ) "*" "*" "*" "*"
## 14 ( 1 ) "*" "*" "*" "*"
Determine Best Subset
- From the Cp plot, 12 variables is the best subset
- These variables are
## (Intercept) FixedAcidity VolatileAcidity
## TRUE FALSE TRUE
## CitricAcid ResidualSugar Chlorides
## TRUE FALSE TRUE
## FreeSulfurDioxide TotalSulfurDioxide Density
## TRUE TRUE TRUE
## pH Sulphates Alcohol
## TRUE TRUE TRUE
## LabelAppeal AcidIndex STARS
## TRUE TRUE TRUE
Model 5: 12 Variables
##
## Call:
## lm(formula = TARGET ~ . - ResidualSugar - FixedAcidity, data = imputed$train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.3425 -1.1117 0.1344 1.0604 4.6378
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.465e+00 4.963e-01 6.983 3.04e-12 ***
## VolatileAcidity -1.884e-01 2.314e-02 -8.145 4.15e-16 ***
## CitricAcid 5.612e-02 2.167e-02 2.590 0.009603 **
## Chlorides -2.913e-01 6.239e-02 -4.669 3.06e-06 ***
## FreeSulfurDioxide 5.854e-04 1.334e-04 4.387 1.16e-05 ***
## TotalSulfurDioxide 3.140e-04 7.816e-05 4.017 5.92e-05 ***
## Density -1.019e+00 4.876e-01 -2.089 0.036706 *
## pH -4.204e-02 1.934e-02 -2.174 0.029731 *
## Sulphates -7.560e-02 1.960e-02 -3.858 0.000115 ***
## Alcohol 1.227e-02 3.619e-03 3.390 0.000702 ***
## LabelAppeal 5.395e-01 2.484e-02 21.721 < 2e-16 ***
## AcidIndex -2.443e-01 9.967e-03 -24.510 < 2e-16 ***
## STARS 1.249e+00 1.641e-02 76.117 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.461 on 12782 degrees of freedom
## Multiple R-squared: 0.4254, Adjusted R-squared: 0.4249
## F-statistic: 788.6 on 12 and 12782 DF, p-value: < 2.2e-16
- All of the variables in Model 5 are significant at the 5% level
- The most significant predictors are VolatileAcidity, STARS, AcidIndex, LabelAppeal, Alcohol, Sulphates, TotalSulfurDioxide, FreeSulfurDioxide, Chlorides
- FreeSulfurDioxide has the highest positive impact on TARGET, contrary to the belief about STARS and LabelAppeal
- Sulphates has the highest negative impacto on TARGET
- The adjusted R^2 is .4219 leaving room for improvement
Model 6: All Variables
##
## Call:
## lm(formula = TARGET ~ ., data = imputed$train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.3452 -1.1145 0.1333 1.0606 4.6370
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.4698486 0.4963990 6.990 2.88e-12 ***
## FixedAcidity -0.0011419 0.0024425 -0.468 0.640130
## VolatileAcidity -0.1883262 0.0231386 -8.139 4.35e-16 ***
## CitricAcid 0.0559966 0.0216699 2.584 0.009775 **
## ResidualSugar -0.0001492 0.0005963 -0.250 0.802467
## Chlorides -0.2917076 0.0624020 -4.675 2.97e-06 ***
## FreeSulfurDioxide 0.0005864 0.0001335 4.394 1.12e-05 ***
## TotalSulfurDioxide 0.0003142 0.0000782 4.018 5.90e-05 ***
## Density -1.0193362 0.4875970 -2.091 0.036590 *
## pH -0.0419544 0.0193413 -2.169 0.030089 *
## Sulphates -0.0753582 0.0196068 -3.843 0.000122 ***
## Alcohol 0.0122428 0.0036201 3.382 0.000722 ***
## LabelAppeal 0.5394189 0.0248408 21.715 < 2e-16 ***
## AcidIndex -0.2434396 0.0101406 -24.006 < 2e-16 ***
## STARS 1.2491617 0.0164156 76.096 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.461 on 12780 degrees of freedom
## Multiple R-squared: 0.4254, Adjusted R-squared: 0.4248
## F-statistic: 675.8 on 14 and 12780 DF, p-value: < 2.2e-16
- The full model reintroduces insignificant variables
- FreeSulfurDioxide has the highest positive impact on TARGET, contract to the belief about STARS and LabelAppeal
- Sulphates have the highest negative impact on TARGET
- The adjusted R^2 is .4218, lower than the subsetted model
Select Models
The model we will use for prediction is Model 3. It had the highest AIC, all variables were significant, and the coefficients aligned with conventional thinking
Model 3
- The histogram of the residuals do not show a normal distribution
- The qqplot shows a fairly linear relationship with the tails of the plot venturing away from the line
- The residual plot does not display contant variance
Test Model
Code Appendix
knitr::opts_chunk$set(echo = FALSE)
knitr::opts_chunk$set(tidy = TRUE)
knitr::opts_chunk$set(warning = FALSE)
loadPkg <- function(x) {
if (!require(x, character.only = T))
install.packages(x, dependencies = T)
require(x, character.only = T)
}
libs <- c("knitr", "magrittr", "data.table", "kableExtra", "caret", "pROC",
"missForest", "zoo", "ISLR", "leaps", "fBasics", "reshape2", "GGally", "gridExtra",
"ROCR", "dummies", "pscl", "bestglm", "glmulti", "meifly", "MASS", "tidyverse")
lapply(libs, loadPkg)
train <- fread("https://raw.githubusercontent.com/baroncurtin2/data621/master/week5/wine-training-data.csv") %>%
as_tibble()
test <- fread("https://raw.githubusercontent.com/baroncurtin2/data621/master/week5/wine-evaluation-data.csv") %>%
as_tibble() %>% rename(INDEX = IN)
# list of datasets
data <- list(train = train, test = test)
data_frame(exp_vars = names(train)) %>% filter(!exp_vars %in% c("TARGET", "INDEX")) %>%
arrange(exp_vars) %>% mutate(tag = if_else(exp_vars %in% c("STARS", "LabelAppeal"),
"Non-Chemical", "Chemical"))
# remove INDEX from datasets
data %<>% map(function(df) {
df %<>% dplyr::select(-INDEX)
return(df)
})
glimpse(data$train)
data$train %>% sapply(typeof) %>% as.data.frame() %>% rownames_to_column(var = "variable") %>%
rename(vartype = 2) %>% group_by(vartype) %>% summarise(count = n())
summary(data$train)
stats <- data$train %>% basicStats() %>% as_tibble() %>% rownames_to_column() %>%
gather(var, value, -rowname) %>% spread(rowname, value) %>% dplyr::rename_all(str_to_lower) %>%
dplyr::rename_all(str_trim) %>% dplyr::rename(variables = "var", q1 = `1. quartile`,
q3 = `3. quartile`, max = maximum, min = minimum, na_vals = nas, n = nobs,
sd = stdev, var = variance) %>% mutate(obs = n - na_vals, range = max -
min, iqr = q3 - q1) %>% dplyr::select(variables, n, na_vals, obs, mean,
min, q1, median, q3, max, sd, var, range, iqr, skewness, kurtosis) %>% print
stats %>% filter(min < 0) %>% select(variables)
# helper function
setToZero <- function(x) {
if (is.double(x)) {
if_else(!is.na(x) & x < 0, as.double(0), as.double(x))
} else {
if_else(!is.na(x) & x < 0, 0, x)
}
}
data %<>% map(function(df) {
# vars with negatives
negs <- stats %>% filter(min < 0) %>% select(variables)
# mutate values below 0 to be 0
df %<>% mutate_all(as.numeric) %>% mutate_at(negs[["variables"]], setToZero)
# return df
return(df)
})
data$train %>% cor(use = "na.or.complete") %>% as.data.frame() %>% rownames_to_column(var = "exp_var") %>%
as_data_frame() %>% select(exp_var, TARGET) %>% filter(!exp_var %in% c("TARGET",
"INDEX")) %>% arrange(desc(TARGET))
ggcorr(data$train, palette = "RdBu", label = T, geom = "tile", size = 2)
vis <- data$train %>% melt()
ggplot(vis, aes(value)) + geom_density(fill = "skyblue") + facet_wrap(~variable,
scales = "free")
data$train %>% gather(key = "variable", value = "value") %>% ggplot(aes(value)) +
geom_histogram(fill = "red", binwidth = function(x) 2 * IQR(x)/(length(x)^(1/3))) +
facet_wrap(~variable, scales = "free")
ggplot(vis, aes(x = 1, y = value)) + geom_boxplot(show.legend = T) + stat_summary(fun.y = mean,
color = "red", geom = "point", shape = 18, size = 3) + facet_wrap(~variable,
scales = "free") + coord_flip()
negs <- stats %>% filter(min < 0) %>% select(variables) %>% print
forests <- data %>% map(function(df) {
df %<>% as.data.frame() %>% missForest()
# return df
return(df)
})
imputed <- forests %>% map(function(x) {
return(x$ximp)
})
prm <- glmulti(TARGET ~ ., data = imputed$train, level = 1, method = "h", confsetsize = 5,
plotty = F, report = F, fitfunction = "glm", family = "poisson")
prm@formulas
m1 <- glm(TARGET ~ 1 + VolatileAcidity + CitricAcid + Chlorides + FreeSulfurDioxide +
TotalSulfurDioxide + Density + pH + Sulphates + Alcohol + LabelAppeal +
AcidIndex + STARS, data = imputed$train, family = "poisson")
summary(m1)
m2 <- glm(TARGET ~ 1 + ., data = imputed$train, family = "poisson")
summary(m2)
nbr <- glmulti(TARGET ~ ., data = imputed$train, level = 1, method = "h", confsetsize = 5,
plotty = F, report = F, family = negative.binomial(1))
nbr@formulas
m3 <- glm.nb(TARGET ~ 1 + VolatileAcidity + Chlorides + FreeSulfurDioxide +
TotalSulfurDioxide + pH + Sulphates + LabelAppeal + AcidIndex + STARS, data = imputed$train)
summary(m3)
m4 <- imputed$train %>% mutate_all(~log(1 + .x)) %>% glm.nb(TARGET ~ 1 + VolatileAcidity +
Chlorides + FreeSulfurDioxide + TotalSulfurDioxide + pH + Sulphates + LabelAppeal +
AcidIndex + STARS, data = .)
summary(m4)
mlr <- regsubsets(TARGET ~ ., data = imputed$train, method = "exhaustive", nvmax = NULL,
nbest = 1)
mlr.summary <- summary(mlr)
print(mlr.summary)
# determine best subset
plot(mlr.summary$cp, xlab = "Number of Variables", ylab = "Cp")
points(which.min(mlr.summary$cp), mlr.summary$cp[which.min(mlr.summary$cp)],
pch = 20, col = "red")
# cp plot par(mfrow=c(1,2))
plot(mlr, scale = "Cp", main = "Cp")
# r^2 splot
plot(mlr, scale = "adjr2", main = "Adjusted R^2")
(summary(mlr))$which[12, ]
m5 <- lm(TARGET ~ . - ResidualSugar - FixedAcidity, data = imputed$train)
summary(m5)
m6 <- lm(TARGET ~ ., data = imputed$train)
summary(m6)
par(mfrow = c(2, 2))
plot(m3)
hist(m3$residuals)
qqnorm(m3$residuals)
qqline(m3$residuals)
m3_results <- predict(m3, newdata = imputed$test)
test.final <- imputed$test %>% bind_cols(data_frame(predicted = m3_results))
m3.training <- predict(m3, newdata = imputed$train)
training.final <- imputed$train %>% bind_cols(data_frame(predicted = m3.training)) %>%
mutate(log_target = log(1 + TARGET))