The Data used for this project consists of 12,795 observations from 16 variables.

INDEX TARGET FixedAcidity VolatileAcidity CitricAcid ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide Density pH Sulphates Alcohol LabelAppeal AcidIndex STARS
1 3 3.2 1.160 -0.98 54.2 -0.567 NA 268 0.99280 3.33 -0.59 9.9 0 8 2
2 3 4.5 0.160 -0.81 26.1 -0.425 15 -327 1.02792 3.38 0.70 NA -1 7 3
4 5 7.1 2.640 -0.88 14.8 0.037 214 142 0.99518 3.12 0.48 22.0 -1 8 3
5 3 5.7 0.385 0.04 18.8 -0.425 22 115 0.99640 2.24 1.83 6.2 -1 6 1
6 4 8.0 0.330 -1.26 9.4 NA -167 108 0.99457 3.12 1.77 13.7 0 9 2
7 0 11.3 0.320 0.59 2.2 0.556 -37 15 0.99940 3.20 1.29 15.4 0 11 NA

 
 

Data Exploration

Very little correlation exists between any two variables. There appears to be positive correlation between TARGET and LabelAppeal (0.56) and TARGET and STARS (0.33). The variables that make up the content of wine exhibit Kurtosis, specifically what appears to be a Leptokurtic Distribution. This indicates that the wines within this data consist of contents that are seemingly consistent across the board.

Number of Cases Bought
Cases of Wine Bought Frequency Percent Frequency
0 2734 0.2136772
1 244 0.0190699
2 1091 0.0852677
3 2611 0.2040641
4 3177 0.2483001
5 2014 0.1574052
6 765 0.0597890
7 142 0.0110981
8 17 0.0013286

 
 

 
 

 
 
 
 

Content Comparisons (grouped by similarly ranged values)

 

The plots shown on the following pages group and compare polar ends– meaning wines that have sold 5 or more cases against wines that have sold 1 or less. Also, comparisons are made between 4 star and 1 star rankings. These differences are filtered into two datasets from which the below graphs are generated.

 

 
 
 
 
 

In comparing chemical contents and their success–determined by the extremes of ‘cases bought’ and ‘stars received’–a subtle pattern seems to emerge. In both comparisons, a wine’s success appears to be reliant on a consistent process of production. This moreso applies to the amount of cases bought than the amount of stars received, and this would make sense. If a wine buyer–whether a person, restaurant, etc–is looking to purchase wine (in bulk or not), they will likely be inclined to purchase what they know will be consistent. However, it should be noted here that this sort of occurrence could simply be due to market forces. The wine producers capable of producing large quantities of wine will have a more established and consistent production process in place, which in turn would likely mean their position in the market is relatively stable.

 

It is important to note the differences in size between these comparisons.

 

In comparing ‘STARS’, there are:
612 four-star observations
3042 one-star observations

Given that one of these samples is much smaller than the other, while also seeing its distributions above, it could be inferred that four-star rated wine is again very consistent, with 50% of observed values falling within an incredibly tight range.

 

In comparing ‘TARGET’, there are:
2978 one-or-fewer observations
2938 five-or-more observations

The size of these compared data sets ensure that the plots above are accurate representations. The above comparisons show less consistency and a wider range of ‘content values’ for the bottom performing group when compared to top performers. While the data shows that a high star-ranking is almost certain to result in more cases of wine being sold, the inverse of this is not true and the table below shows what can be seen on the top plot of page 6.

Stars v Cases
Stars Cases Sold Range n
1 0-6 6 3042
2 0-7 7 3570
3 2-7 5 2212
4 4-8 4 612

 
 

While a 1 or 2 star rating can certainly result in zero cases being bought, it is possible that a wine with this sort of ranking can be among the top performers. This is absolutely reasonable if seen through the paradigm of supply and demand. As the quality decreases, so too does the price. Depending on any given consumer, it could be preferred to buy more lower-quality wine than less higher-quality wine, or vice-versa. There are many possible combinations here.

 
 
 
 

Data Preparation

The above graphic shoes only the primary variables with interactive missing values–it does not include all variables with missing values. The variables to have their NA values replaced with their mean are: Residual Sugar, Chlorides, Free Sulfur Dioxide, Total Sulfur Dioxide, and pH. The NA’s for these variables are assumed to be missing data points as these contents are present in all wines.

The variables to have their NA values replaces with a value of ‘0’ are: Sulphates and Alcohol. These values are replaced with zero because Sulphate and Alcohol-free wines are produced, so it’s assumed that their NA values are indicative of their lack of these substances.

 

Based on the newly made assumptions on Sulphate and Alcohol, along with the graphic above, it will further be assumed that if there is any interaction between STARS and either of these variables, the missing value is due to no review having been conducted, therefore the NA value will be replaced with 0. For all other missing STARS values, they will be replaced with 1 unless there are more than three NA values present for a given index. This is not perfect, but as seen on page 4, the distribution of one-star ratings most closely takes the shape of the ‘cases bought’ histogram on page 3.

The information below illustrates the distinction between >=5 cases and <=1 cases bought for the STARS variable.

5 or more: 2938 observations, 143 NA’s, (min=1, max=4)

1 or less: 2978 observations, 2164 NA’s, (min=1, max=2)

 
 
 
 

Booster Variable Creation

In an attempt to give more weight to the values of the ‘content variables’ that lie outside of the inner quartiles, a ‘Booster’ variable has been created for each variable in the plots on page 7 and 8. This variable was created only with the differences in the number of cases.

 

A temporary dataframe was used to take the values of each individual variable from the primary data after removing/replacing the missing values. After one variable was added, its values were split evenly 200 times so that each value corresponded with an ordered number from 1:200 and a variable with each observation’s corresponding rank was created. Example below:

Variable Creation – Value Rank
FixedAcidity faD VolatileAcidity vaD CitricAcid caD
3.2 82 1.160 123 -0.98 64
4.5 87 0.160 92 -0.81 69
7.1 96 2.640 168 -0.88 67
5.7 91 0.385 99 0.04 93
8.0 100 0.330 97 -1.26 56
11.3 112 0.320 97 0.59 108
7.7 99 0.290 96 -0.40 81
6.5 94 -1.220 49 0.34 101
14.8 126 0.270 95 1.05 121
5.5 90 -0.220 80 0.39 103

 

The ‘TARGET’ variable was appended to this dataframe in order to allow the sorting and separation of ‘Cases Bought,’ which was done again by ‘five or more’ and ‘one or less.’ When this was prepared, the density value was then calculated for each range of values that fell within the assigned observations rank (demonstrated above where ‘vaD’ = 97).

Since the ranking value applied to the entire data before separation, it was then possible to find the difference in density values between the top and bottom ~23% of wines that were bought. This difference was calculated and the data was regrouped in its original form with the new ‘Booster’ variables included.

Data – Includes Booster Variable
INDEX TARGET FixedAcidity faD VolatileAcidity vaD CitricAcid caD
1 3 3.2 0.0006781 1.160 0.0002012 -0.98 0.0053076
2 3 4.5 0.0000000 0.160 0.0420080 -0.81 0.0000000
4 5 7.1 0.0223944 2.640 -0.0002899 -0.88 0.0028418
5 3 5.7 0.0114750 0.385 -0.0012175 0.04 -0.0127964
6 4 8.0 -0.0002560 0.330 0.0000000 -1.26 0.0005886
7 0 11.3 -0.0085295 0.320 0.0000000 0.59 -0.0003150
8 0 7.7 0.0000000 0.290 0.0072418 -0.40 -0.0052081
11 4 6.5 0.0051996 -1.220 -0.0006350 0.34 0.0293321
12 3 14.8 0.0019500 0.270 0.0000000 1.05 -0.0022201
13 6 5.5 0.0010389 -0.220 0.0000000 0.39 -0.0035294

However, a flaw presented itself in this created variable–the relative position from the middle ~50% of data was not accounted for in the calculations. To resolve this, the distance from the inner ~50% was calculated for both the top and bottom performers.

Once this distance was found for both the top and bottom performers, to determine the extent of the ‘Boost,’ it had to be determined if the difference in extremes was representative of actual high performance overall, or simply a difference in extremes. A variable was created to determine how close a Boosted value was to the middle data, and the closer it was, the closer ‘Boost’ got to 0. If the Boosted value was far from the middle data, the position variable exaggerated the ‘Boost’ value. The Boost variable is represented in the bottom two plots below.

 
 

 
 
 
 

Models

For all following models, the variables are selected via backward stepwise selection, and the final model is presented.

1.1 - Poisson

## 
## Call:
## glm(formula = Y ~ X, family = "poisson")
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.1375  -0.6828   0.1227   0.6041   2.8521  
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)          4.854e-01  1.326e-02  36.602  < 2e-16 ***
## XVolatileAcidity    -2.803e-02  6.604e-03  -4.244 2.20e-05 ***
## XvaD                 2.644e+00  3.766e-01   7.022 2.19e-12 ***
## XcaD                 1.317e+00  2.379e-01   5.535 3.11e-08 ***
## XrsD                 5.790e-01  8.767e-02   6.604 4.00e-11 ***
## XChlorides          -3.614e-02  1.649e-02  -2.191 0.028427 *  
## XchD                 9.789e-01  1.115e-01   8.776  < 2e-16 ***
## XFreeSulfurDioxide   1.159e-04  3.517e-05   3.296 0.000979 ***
## XfsdD                1.514e+00  1.713e-01   8.841  < 2e-16 ***
## XTotalSulfurDioxide  6.799e-05  2.290e-05   2.969 0.002990 ** 
## XtsdD                2.888e+00  3.066e-01   9.418  < 2e-16 ***
## XdD                  1.493e+00  2.128e-01   7.016 2.29e-12 ***
## XphD                 4.611e+00  1.106e+00   4.169 3.07e-05 ***
## XSulphates          -1.244e-02  5.648e-03  -2.202 0.027640 *  
## XslphD               1.533e+00  6.951e-01   2.205 0.027431 *  
## XalcD                2.354e+00  4.819e-01   4.885 1.03e-06 ***
## XLabelAppeal         1.462e-01  6.095e-03  23.988  < 2e-16 ***
## XaiD                 9.859e-01  8.339e-02  11.823  < 2e-16 ***
## XSTARS               3.012e-01  5.371e-03  56.075  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 22861  on 12794  degrees of freedom
## Residual deviance: 15519  on 12776  degrees of freedom
## AIC: 47499
## 
## Number of Fisher Scoring iterations: 5

This model produces an AIC value much higher than its poisson comparisons, and also has a higher RMSE value. The coeffiecients present all make sense. Testing for overdispersion determines there is no need to use a negative binomial model. This model will not be kept.

 
 

1.2 - Hurdle Poisson

## 
## Call:
## hurdle(formula = Y ~ X | X1, dist = "poisson", link = "logit")
## 
## Pearson residuals:
##      Min       1Q   Median       3Q      Max 
## -2.25511 -0.43294  0.02967  0.42416  7.57606 
## 
## Count model coefficients (truncated poisson with log link):
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  1.035316   0.017875  57.919  < 2e-16 ***
## XcaD         1.063463   0.254520   4.178 2.94e-05 ***
## XchD         0.537298   0.118250   4.544 5.53e-06 ***
## XfsdD        0.773635   0.182356   4.242 2.21e-05 ***
## XdD          1.103349   0.222893   4.950 7.42e-07 ***
## XAlcohol     0.003400   0.001266   2.685  0.00726 ** 
## XalcD        2.807490   0.510890   5.495 3.90e-08 ***
## XLabelAppeal 0.246079   0.006560  37.514  < 2e-16 ***
## XSTARS       0.097996   0.005994  16.349  < 2e-16 ***
## Zero hurdle model coefficients (binomial with logit link):
##                        Estimate Std. Error z value Pr(>|z|)    
## (Intercept)          -0.9990257  0.1618130  -6.174 6.66e-10 ***
## X1VolatileAcidity    -0.1117414  0.0344915  -3.240  0.00120 ** 
## X1vaD                10.6059536  2.2189321   4.780 1.76e-06 ***
## X1caD                 3.5768263  1.3562796   2.637  0.00836 ** 
## X1rsD                 3.1743268  0.4158827   7.633 2.30e-14 ***
## X1Chlorides          -0.1876219  0.0862815  -2.175  0.02967 *  
## X1chD                 4.5927659  0.6641207   6.916 4.66e-12 ***
## X1FreeSulfurDioxide   0.0004811  0.0001879   2.560  0.01047 *  
## X1fsdD                3.6933388  0.7374297   5.008 5.49e-07 ***
## X1TotalSulfurDioxide  0.0006100  0.0001199   5.089 3.60e-07 ***
## X1tsdD               12.1521843  1.4214979   8.549  < 2e-16 ***
## X1dD                  5.1431695  1.1721187   4.388 1.14e-05 ***
## X1pH                 -0.1055445  0.0404786  -2.607  0.00912 ** 
## X1phD                28.9694160  5.9103422   4.901 9.51e-07 ***
## X1Sulphates          -0.1541194  0.0299663  -5.143 2.70e-07 ***
## X1Alcohol            -0.0472699  0.0065592  -7.207 5.73e-13 ***
## X1alcD                5.3157786  2.6925118   1.974  0.04835 *  
## X1LabelAppeal        -0.4423819  0.0314124 -14.083  < 2e-16 ***
## X1aiD                 4.6974842  0.3928270  11.958  < 2e-16 ***
## X1STARS               2.4917448  0.0650581  38.300  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Number of iterations in BFGS optimization: 17 
## Log-likelihood: -2.074e+04 on 29 Df

This model has interesting coefficients for the variables vaD, tsdD, and phD. LabelAppeal is also negative, which is likely to offset the high values observed in the other coefficients. Given the lack of overdispersion in the data, this model is promising.

 
 

1.3 - Zero-inflated Poisson

## 
## Call:
## zeroinfl(formula = Y ~ X | X1, dist = "poisson", link = "logit")
## 
## Pearson residuals:
##      Min       1Q   Median       3Q      Max 
## -2.23308 -0.41578  0.03286  0.41186  8.82854 
## 
## Count model coefficients (poisson with log link):
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  1.031147   0.017566  58.703  < 2e-16 ***
## XcaD         0.962806   0.251065   3.835 0.000126 ***
## XchD         0.503305   0.116107   4.335 1.46e-05 ***
## XfsdD        0.775146   0.178079   4.353 1.34e-05 ***
## XdD          1.033867   0.217667   4.750 2.04e-06 ***
## XAlcohol     0.003280   0.001230   2.667 0.007659 ** 
## XalcD        2.780746   0.498511   5.578 2.43e-08 ***
## XLabelAppeal 0.235873   0.006348  37.156  < 2e-16 ***
## XSTARS       0.106517   0.005899  18.058  < 2e-16 ***
## 
## Zero-inflation model coefficients (binomial with logit link):
##                        Estimate Std. Error z value Pr(>|z|)    
## (Intercept)           8.378e-01  1.866e-01   4.490 7.11e-06 ***
## X1VolatileAcidity     1.130e-01  4.000e-02   2.826  0.00472 ** 
## X1vaD                -1.196e+01  2.671e+00  -4.477 7.56e-06 ***
## X1caD                -4.306e+00  1.653e+00  -2.605  0.00919 ** 
## X1rsD                -3.464e+00  4.719e-01  -7.340 2.13e-13 ***
## X1Chlorides           2.050e-01  1.004e-01   2.043  0.04105 *  
## X1chD                -5.184e+00  7.974e-01  -6.501 8.00e-11 ***
## X1FreeSulfurDioxide  -5.649e-04  2.195e-04  -2.573  0.01008 *  
## X1fsdD               -3.624e+00  8.490e-01  -4.269 1.97e-05 ***
## X1TotalSulfurDioxide -6.803e-04  1.392e-04  -4.889 1.02e-06 ***
## X1tsdD               -1.382e+01  1.616e+00  -8.552  < 2e-16 ***
## X1dD                 -5.859e+00  1.371e+00  -4.274 1.92e-05 ***
## X1pH                  1.162e-01  4.705e-02   2.470  0.01351 *  
## X1phD                -3.227e+01  6.888e+00  -4.685 2.80e-06 ***
## X1Sulphates           1.875e-01  3.526e-02   5.317 1.05e-07 ***
## X1Alcohol             5.734e-02  7.581e-03   7.565 3.89e-14 ***
## X1LabelAppeal         6.652e-01  3.931e-02  16.920  < 2e-16 ***
## X1aiD                -5.360e+00  4.541e-01 -11.804  < 2e-16 ***
## X1STARS              -2.696e+00  7.744e-02 -34.809  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Number of iterations in BFGS optimization: 38 
## Log-likelihood: -2.081e+04 on 28 Df

This model has many negative coefficients, including for STARS. They appear almost inverse to model 1.2. The AIC for this model is higher than model 1.2.

 
 

2.1 - Negative Binomial

## 
## Call:
## glm.nb(formula = Y ~ X, init.theta = 50093.12961, link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.1374  -0.6828   0.1227   0.6041   2.8521  
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)          4.854e-01  1.326e-02  36.600  < 2e-16 ***
## XVolatileAcidity    -2.803e-02  6.605e-03  -4.244 2.20e-05 ***
## XvaD                 2.644e+00  3.766e-01   7.021 2.20e-12 ***
## XcaD                 1.317e+00  2.379e-01   5.535 3.11e-08 ***
## XrsD                 5.790e-01  8.767e-02   6.604 4.00e-11 ***
## XChlorides          -3.614e-02  1.649e-02  -2.191 0.028428 *  
## XchD                 9.789e-01  1.116e-01   8.776  < 2e-16 ***
## XFreeSulfurDioxide   1.159e-04  3.517e-05   3.296 0.000979 ***
## XfsdD                1.514e+00  1.713e-01   8.841  < 2e-16 ***
## XTotalSulfurDioxide  6.800e-05  2.290e-05   2.969 0.002991 ** 
## XtsdD                2.888e+00  3.066e-01   9.418  < 2e-16 ***
## XdD                  1.493e+00  2.128e-01   7.016 2.29e-12 ***
## XphD                 4.611e+00  1.106e+00   4.169 3.07e-05 ***
## XSulphates          -1.244e-02  5.649e-03  -2.202 0.027641 *  
## XslphD               1.533e+00  6.951e-01   2.205 0.027435 *  
## XalcD                2.354e+00  4.819e-01   4.885 1.03e-06 ***
## XLabelAppeal         1.462e-01  6.096e-03  23.987  < 2e-16 ***
## XaiD                 9.859e-01  8.339e-02  11.823  < 2e-16 ***
## XSTARS               3.012e-01  5.371e-03  56.073  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(50093.13) family taken to be 1)
## 
##     Null deviance: 22860  on 12794  degrees of freedom
## Residual deviance: 15519  on 12776  degrees of freedom
## AIC: 47502
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  50093 
##           Std. Err.:  56210 
## Warning while fitting theta: iteration limit reached 
## 
##  2 x log-likelihood:  -47461.65

This model will not be kept for the same reasons described with model 1.1.

 
 

2.2 - Hurdle Negative Binomial

## 
## Call:
## hurdle(formula = Y ~ X | X1, dist = "negbin", link = "logit")
## 
## Pearson residuals:
##      Min       1Q   Median       3Q      Max 
## -2.25511 -0.43294  0.02967  0.42416  7.57607 
## 
## Count model coefficients (truncated negbin with log link):
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   1.035318   0.017875  57.919  < 2e-16 ***
## XcaD          1.063464   0.254520   4.178 2.94e-05 ***
## XchD          0.537303   0.118250   4.544 5.53e-06 ***
## XfsdD         0.773584   0.182356   4.242 2.21e-05 ***
## XdD           1.103419   0.222893   4.950 7.40e-07 ***
## XAlcohol      0.003400   0.001266   2.685  0.00726 ** 
## XalcD         2.807634   0.510889   5.496 3.89e-08 ***
## XLabelAppeal  0.246080   0.006560  37.514  < 2e-16 ***
## XSTARS        0.097996   0.005994  16.349  < 2e-16 ***
## Log(theta)   17.834858   1.383143  12.894  < 2e-16 ***
## Zero hurdle model coefficients (binomial with logit link):
##                        Estimate Std. Error z value Pr(>|z|)    
## (Intercept)          -0.9990257  0.1618130  -6.174 6.66e-10 ***
## X1VolatileAcidity    -0.1117414  0.0344915  -3.240  0.00120 ** 
## X1vaD                10.6059536  2.2189321   4.780 1.76e-06 ***
## X1caD                 3.5768263  1.3562796   2.637  0.00836 ** 
## X1rsD                 3.1743268  0.4158827   7.633 2.30e-14 ***
## X1Chlorides          -0.1876219  0.0862815  -2.175  0.02967 *  
## X1chD                 4.5927659  0.6641207   6.916 4.66e-12 ***
## X1FreeSulfurDioxide   0.0004811  0.0001879   2.560  0.01047 *  
## X1fsdD                3.6933388  0.7374297   5.008 5.49e-07 ***
## X1TotalSulfurDioxide  0.0006100  0.0001199   5.089 3.60e-07 ***
## X1tsdD               12.1521843  1.4214979   8.549  < 2e-16 ***
## X1dD                  5.1431695  1.1721187   4.388 1.14e-05 ***
## X1pH                 -0.1055445  0.0404786  -2.607  0.00912 ** 
## X1phD                28.9694160  5.9103422   4.901 9.51e-07 ***
## X1Sulphates          -0.1541194  0.0299663  -5.143 2.70e-07 ***
## X1Alcohol            -0.0472699  0.0065592  -7.207 5.73e-13 ***
## X1alcD                5.3157786  2.6925118   1.974  0.04835 *  
## X1LabelAppeal        -0.4423819  0.0314124 -14.083  < 2e-16 ***
## X1aiD                 4.6974842  0.3928270  11.958  < 2e-16 ***
## X1STARS               2.4917448  0.0650581  38.300  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Theta: count = 55664756.2322
## Number of iterations in BFGS optimization: 44 
## Log-likelihood: -2.074e+04 on 30 Df

This model has interesting coefficients for the variables vaD, tsdD, and phD. LabelAppeal is also negative, which is likely to offset the high values observed in the other coefficients. This was not the chosen model, but it will be kept in case overdispersion should arise in the data.

 
 

2.3 - Zero-inflated Negative Binomial

## 
## Call:
## zeroinfl(formula = Y ~ X | X1, dist = "negbin", link = "logit")
## 
## Pearson residuals:
##      Min       1Q   Median       3Q      Max 
## -2.23308 -0.41578  0.03286  0.41186  8.82828 
## 
## Count model coefficients (negbin with log link):
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   1.031147   0.017566  58.703  < 2e-16 ***
## XcaD          0.962790   0.251065   3.835 0.000126 ***
## XchD          0.503305   0.116107   4.335 1.46e-05 ***
## XfsdD         0.775151   0.178079   4.353 1.34e-05 ***
## XdD           1.033874   0.217667   4.750 2.04e-06 ***
## XAlcohol      0.003280   0.001230   2.667 0.007659 ** 
## XalcD         2.780724   0.498511   5.578 2.43e-08 ***
## XLabelAppeal  0.235873   0.006348  37.156  < 2e-16 ***
## XSTARS        0.106517   0.005899  18.058  < 2e-16 ***
## Log(theta)   17.860893        NaN     NaN      NaN    
## 
## Zero-inflation model coefficients (binomial with logit link):
##                        Estimate Std. Error z value Pr(>|z|)    
## (Intercept)           8.378e-01  1.866e-01   4.490 7.11e-06 ***
## X1VolatileAcidity     1.130e-01  4.000e-02   2.826  0.00472 ** 
## X1vaD                -1.196e+01  2.671e+00  -4.477 7.56e-06 ***
## X1caD                -4.307e+00  1.653e+00  -2.605  0.00917 ** 
## X1rsD                -3.464e+00  4.719e-01  -7.340 2.13e-13 ***
## X1Chlorides           2.050e-01  1.004e-01   2.043  0.04105 *  
## X1chD                -5.184e+00  7.974e-01  -6.501 8.00e-11 ***
## X1FreeSulfurDioxide  -5.649e-04  2.195e-04  -2.573  0.01007 *  
## X1fsdD               -3.624e+00  8.490e-01  -4.269 1.97e-05 ***
## X1TotalSulfurDioxide -6.803e-04  1.392e-04  -4.889 1.02e-06 ***
## X1tsdD               -1.382e+01  1.616e+00  -8.552  < 2e-16 ***
## X1dD                 -5.859e+00  1.371e+00  -4.274 1.92e-05 ***
## X1pH                  1.162e-01  4.705e-02   2.470  0.01351 *  
## X1phD                -3.227e+01  6.888e+00  -4.684 2.81e-06 ***
## X1Sulphates           1.875e-01  3.526e-02   5.317 1.05e-07 ***
## X1Alcohol             5.734e-02  7.581e-03   7.565 3.89e-14 ***
## X1LabelAppeal         6.652e-01  3.931e-02  16.920  < 2e-16 ***
## X1aiD                -5.360e+00  4.541e-01 -11.804  < 2e-16 ***
## X1STARS              -2.696e+00  7.744e-02 -34.809  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Theta = 57133024.4533 
## Number of iterations in BFGS optimization: 38 
## Log-likelihood: -2.081e+04 on 29 Df

The abnormal coefficient in this model is for STARS which has a negative impact. Some interesting differences are with vaD, tsdD, and phD. This model will not be kept.

 
 

3.1 - Multiple Linear Regression - Default Variables only

## 
## Call:
## lm(formula = Y ~ X)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.2904 -1.0317  0.1718  1.0444  5.2458 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          2.545e+00  4.800e-01   5.303 1.16e-07 ***
## XFixedAcidity       -8.951e-03  2.023e-03  -4.425 9.75e-06 ***
## XVolatileAcidity    -1.242e-01  1.632e-02  -7.613 2.86e-14 ***
## XChlorides          -1.760e-01  4.115e-02  -4.276 1.91e-05 ***
## XFreeSulfurDioxide   4.412e-04  8.814e-05   5.006 5.64e-07 ***
## XTotalSulfurDioxide  3.105e-04  5.663e-05   5.482 4.28e-08 ***
## XDensity            -1.416e+00  4.814e-01  -2.942  0.00326 ** 
## XSulphates          -6.293e-02  1.419e-02  -4.436 9.25e-06 ***
## XLabelAppeal         4.109e-01  1.498e-02  27.438  < 2e-16 ***
## XSTARS               1.150e+00  1.403e-02  81.942  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.444 on 12785 degrees of freedom
## Multiple R-squared:  0.4386, Adjusted R-squared:  0.4382 
## F-statistic:  1110 on 9 and 12785 DF,  p-value: < 2.2e-16

This linear model was done to compare these original variables with the variable that was created. While there is nothing to note in regards to the content variables, the coefficients LabelAppeal and STARS both make sense.

 
 

3.2 - Multiple Linear Regression - Created Variables only

## 
## Call:
## lm(formula = Y ~ X)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.1390 -1.2539  0.2445  1.2553  5.7049 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.93292    0.01967 149.111  < 2e-16 ***
## XfaD         6.57918    1.72174   3.821 0.000133 ***
## XvaD        12.68931    1.20544  10.527  < 2e-16 ***
## XcaD         6.40934    0.76617   8.365  < 2e-16 ***
## XrsD         1.67788    0.26015   6.450 1.16e-10 ***
## XchD         4.75876    0.36122  13.174  < 2e-16 ***
## XfsdD        5.40947    0.47622  11.359  < 2e-16 ***
## XtsdD       10.47594    0.88417  11.848  < 2e-16 ***
## XdD          6.98614    0.66872  10.447  < 2e-16 ***
## XphD        20.34403    3.44408   5.907 3.57e-09 ***
## XslphD      10.73206    2.16247   4.963 7.03e-07 ***
## XalcD       13.21605    1.54490   8.555  < 2e-16 ***
## XaiD         3.42672    0.25791  13.287  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.792 on 12782 degrees of freedom
## Multiple R-squared:  0.1354, Adjusted R-squared:  0.1346 
## F-statistic: 166.8 on 12 and 12782 DF,  p-value: < 2.2e-16

This model is what model 3.1 was compared to–it consists only of the created ‘Booster’ variables. This model has a much smaller fit than model 3.1 as observed in their R^2 values. The pH boost variable has the strongest positive impact among VolatileAcidity, TotalSulphurDioxide, Sulphates and Alcohol.

 
 

3.3 - Multiple Linear Regression - All Variables

## 
## Call:
## lm(formula = Y ~ X)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.7538 -0.9275  0.1540  0.9668  4.8665 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          1.965e+00  4.590e-01   4.282 1.87e-05 ***
## XVolatileAcidity    -7.518e-02  1.576e-02  -4.772 1.85e-06 ***
## XvaD                 7.729e+00  9.378e-01   8.241  < 2e-16 ***
## XcaD                 3.873e+00  5.898e-01   6.568 5.31e-11 ***
## XrsD                 1.363e+00  2.000e-01   6.813 9.97e-12 ***
## XChlorides          -1.184e-01  3.929e-02  -3.013  0.00259 ** 
## XchD                 3.050e+00  2.785e-01  10.953  < 2e-16 ***
## XFreeSulfurDioxide   2.937e-04  8.423e-05   3.487  0.00049 ***
## XfsdD                3.533e+00  3.673e-01   9.619  < 2e-16 ***
## XTotalSulfurDioxide  1.746e-04  5.435e-05   3.213  0.00132 ** 
## XtsdD                7.000e+00  6.830e-01  10.248  < 2e-16 ***
## XDensity            -7.761e-01  4.604e-01  -1.686  0.09191 .  
## XdD                  4.400e+00  5.163e-01   8.521  < 2e-16 ***
## XphD                 1.361e+01  2.648e+00   5.140 2.79e-07 ***
## XSulphates          -4.151e-02  1.355e-02  -3.063  0.00220 ** 
## XalcD                8.607e+00  1.189e+00   7.240 4.74e-13 ***
## XLabelAppeal         4.469e-01  1.433e-02  31.196  < 2e-16 ***
## XaiD                 2.581e+00  1.891e-01  13.647  < 2e-16 ***
## XSTARS               1.035e+00  1.377e-02  75.194  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.377 on 12776 degrees of freedom
## Multiple R-squared:  0.4897, Adjusted R-squared:  0.4889 
## F-statistic:   681 on 18 and 12776 DF,  p-value: < 2.2e-16

This model finds the most significant variables out of those that were created as well as those that weren’t. By using both types, the R^2 value increases by 5.11%, so that this model explains 48.97% of variation in the data. This model will be kept while model 3.1 and 3.2 will not.

 
 
 
 

Evaluating Models

Between the three Multiple Linear Regression models, the model with the original variables has a much better fit (R^2=0.4386) than the model with only the created variables (R^2=0.1354). However when all variables were modeled in Model 3.3, there is an increase of 5.11% (R^2=0.4897) in variation explained by the model.

 

Poisson and Binomial Model Evaluation
Model AIC RMSE
1.1 47499.41 1.485925
1.2 41540.90 1.355723
1.3 41679.43 1.355261
2.1 47501.65 1.485925
2.2 41542.91 1.355723
2.3 41681.43 1.355261
## 
##  Overdispersion test
## 
## data:  poisson
## z = -11.588, p-value = 1
## alternative hypothesis: true dispersion is greater than 1
## sample estimates:
## dispersion 
##  0.8714349

 
 

The model to be chosen is model 1.2 - the Hurdle Poisson Model. While its RMSE value is the second lowest value, the difference isn’t small enough to justify the sacrifice of roughly ~180 in AIC. These differences are present in both the poisson (1.x) and negative binomial models (2.x). The decision was then between model 1.2 -Hurdle Poisson- and 2.2 -Hurdle Negative Binomial- and since overdispersion is not present in the data, the best model is 1.2.

 

This model is not perfect. It predicts much fewer values of 0 than present in the actual data. When the actual Target value is 0, this model predicts a range of 0-5, with 8 occurrences of 5 being predicted and 49 predictions with a value of 4. The model does a good job of predicting a value within a small range of the actual value a majority of the time. It can certainly produce results with relatively accurate predictions, however it should not be relied upon entirely.