The Data used for this project consists of 12,795 observations from 16 variables.
INDEX | TARGET | FixedAcidity | VolatileAcidity | CitricAcid | ResidualSugar | Chlorides | FreeSulfurDioxide | TotalSulfurDioxide | Density | pH | Sulphates | Alcohol | LabelAppeal | AcidIndex | STARS |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 3 | 3.2 | 1.160 | -0.98 | 54.2 | -0.567 | NA | 268 | 0.99280 | 3.33 | -0.59 | 9.9 | 0 | 8 | 2 |
2 | 3 | 4.5 | 0.160 | -0.81 | 26.1 | -0.425 | 15 | -327 | 1.02792 | 3.38 | 0.70 | NA | -1 | 7 | 3 |
4 | 5 | 7.1 | 2.640 | -0.88 | 14.8 | 0.037 | 214 | 142 | 0.99518 | 3.12 | 0.48 | 22.0 | -1 | 8 | 3 |
5 | 3 | 5.7 | 0.385 | 0.04 | 18.8 | -0.425 | 22 | 115 | 0.99640 | 2.24 | 1.83 | 6.2 | -1 | 6 | 1 |
6 | 4 | 8.0 | 0.330 | -1.26 | 9.4 | NA | -167 | 108 | 0.99457 | 3.12 | 1.77 | 13.7 | 0 | 9 | 2 |
7 | 0 | 11.3 | 0.320 | 0.59 | 2.2 | 0.556 | -37 | 15 | 0.99940 | 3.20 | 1.29 | 15.4 | 0 | 11 | NA |
Very little correlation exists between any two variables. There appears to be positive correlation between TARGET and LabelAppeal (0.56) and TARGET and STARS (0.33). The variables that make up the content of wine exhibit Kurtosis, specifically what appears to be a Leptokurtic Distribution. This indicates that the wines within this data consist of contents that are seemingly consistent across the board.
Cases of Wine Bought | Frequency | Percent Frequency |
---|---|---|
0 | 2734 | 0.2136772 |
1 | 244 | 0.0190699 |
2 | 1091 | 0.0852677 |
3 | 2611 | 0.2040641 |
4 | 3177 | 0.2483001 |
5 | 2014 | 0.1574052 |
6 | 765 | 0.0597890 |
7 | 142 | 0.0110981 |
8 | 17 | 0.0013286 |
The plots shown on the following pages group and compare polar ends– meaning wines that have sold 5 or more cases against wines that have sold 1 or less. Also, comparisons are made between 4 star and 1 star rankings. These differences are filtered into two datasets from which the below graphs are generated.
In comparing chemical contents and their success–determined by the extremes of ‘cases bought’ and ‘stars received’–a subtle pattern seems to emerge. In both comparisons, a wine’s success appears to be reliant on a consistent process of production. This moreso applies to the amount of cases bought than the amount of stars received, and this would make sense. If a wine buyer–whether a person, restaurant, etc–is looking to purchase wine (in bulk or not), they will likely be inclined to purchase what they know will be consistent. However, it should be noted here that this sort of occurrence could simply be due to market forces. The wine producers capable of producing large quantities of wine will have a more established and consistent production process in place, which in turn would likely mean their position in the market is relatively stable.
It is important to note the differences in size between these comparisons.
In comparing ‘STARS’, there are:
612 four-star observations
3042 one-star observations
Given that one of these samples is much smaller than the other, while also seeing its distributions above, it could be inferred that four-star rated wine is again very consistent, with 50% of observed values falling within an incredibly tight range.
In comparing ‘TARGET’, there are:
2978 one-or-fewer observations
2938 five-or-more observations
The size of these compared data sets ensure that the plots above are accurate representations. The above comparisons show less consistency and a wider range of ‘content values’ for the bottom performing group when compared to top performers. While the data shows that a high star-ranking is almost certain to result in more cases of wine being sold, the inverse of this is not true and the table below shows what can be seen on the top plot of page 6.
Stars | Cases Sold | Range | n |
1 | 0-6 | 6 | 3042 |
2 | 0-7 | 7 | 3570 |
3 | 2-7 | 5 | 2212 |
4 | 4-8 | 4 | 612 |
While a 1 or 2 star rating can certainly result in zero cases being bought, it is possible that a wine with this sort of ranking can be among the top performers. This is absolutely reasonable if seen through the paradigm of supply and demand. As the quality decreases, so too does the price. Depending on any given consumer, it could be preferred to buy more lower-quality wine than less higher-quality wine, or vice-versa. There are many possible combinations here.
The above graphic shoes only the primary variables with interactive missing values–it does not include all variables with missing values. The variables to have their NA values replaced with their mean are: Residual Sugar, Chlorides, Free Sulfur Dioxide, Total Sulfur Dioxide, and pH. The NA’s for these variables are assumed to be missing data points as these contents are present in all wines.
The variables to have their NA values replaces with a value of ‘0’ are: Sulphates and Alcohol. These values are replaced with zero because Sulphate and Alcohol-free wines are produced, so it’s assumed that their NA values are indicative of their lack of these substances.
Based on the newly made assumptions on Sulphate and Alcohol, along with the graphic above, it will further be assumed that if there is any interaction between STARS and either of these variables, the missing value is due to no review having been conducted, therefore the NA value will be replaced with 0. For all other missing STARS values, they will be replaced with 1 unless there are more than three NA values present for a given index. This is not perfect, but as seen on page 4, the distribution of one-star ratings most closely takes the shape of the ‘cases bought’ histogram on page 3.
The information below illustrates the distinction between >=5 cases and <=1 cases bought for the STARS variable.
5 or more: 2938 observations, 143 NA’s, (min=1, max=4)
1 or less: 2978 observations, 2164 NA’s, (min=1, max=2)
In an attempt to give more weight to the values of the ‘content variables’ that lie outside of the inner quartiles, a ‘Booster’ variable has been created for each variable in the plots on page 7 and 8. This variable was created only with the differences in the number of cases.
A temporary dataframe was used to take the values of each individual variable from the primary data after removing/replacing the missing values. After one variable was added, its values were split evenly 200 times so that each value corresponded with an ordered number from 1:200 and a variable with each observation’s corresponding rank was created. Example below:
FixedAcidity | faD | VolatileAcidity | vaD | CitricAcid | caD |
---|---|---|---|---|---|
3.2 | 82 | 1.160 | 123 | -0.98 | 64 |
4.5 | 87 | 0.160 | 92 | -0.81 | 69 |
7.1 | 96 | 2.640 | 168 | -0.88 | 67 |
5.7 | 91 | 0.385 | 99 | 0.04 | 93 |
8.0 | 100 | 0.330 | 97 | -1.26 | 56 |
11.3 | 112 | 0.320 | 97 | 0.59 | 108 |
7.7 | 99 | 0.290 | 96 | -0.40 | 81 |
6.5 | 94 | -1.220 | 49 | 0.34 | 101 |
14.8 | 126 | 0.270 | 95 | 1.05 | 121 |
5.5 | 90 | -0.220 | 80 | 0.39 | 103 |
The ‘TARGET’ variable was appended to this dataframe in order to allow the sorting and separation of ‘Cases Bought,’ which was done again by ‘five or more’ and ‘one or less.’ When this was prepared, the density value was then calculated for each range of values that fell within the assigned observations rank (demonstrated above where ‘vaD’ = 97).
Since the ranking value applied to the entire data before separation, it was then possible to find the difference in density values between the top and bottom ~23% of wines that were bought. This difference was calculated and the data was regrouped in its original form with the new ‘Booster’ variables included.
INDEX | TARGET | FixedAcidity | faD | VolatileAcidity | vaD | CitricAcid | caD |
---|---|---|---|---|---|---|---|
1 | 3 | 3.2 | 0.0006781 | 1.160 | 0.0002012 | -0.98 | 0.0053076 |
2 | 3 | 4.5 | 0.0000000 | 0.160 | 0.0420080 | -0.81 | 0.0000000 |
4 | 5 | 7.1 | 0.0223944 | 2.640 | -0.0002899 | -0.88 | 0.0028418 |
5 | 3 | 5.7 | 0.0114750 | 0.385 | -0.0012175 | 0.04 | -0.0127964 |
6 | 4 | 8.0 | -0.0002560 | 0.330 | 0.0000000 | -1.26 | 0.0005886 |
7 | 0 | 11.3 | -0.0085295 | 0.320 | 0.0000000 | 0.59 | -0.0003150 |
8 | 0 | 7.7 | 0.0000000 | 0.290 | 0.0072418 | -0.40 | -0.0052081 |
11 | 4 | 6.5 | 0.0051996 | -1.220 | -0.0006350 | 0.34 | 0.0293321 |
12 | 3 | 14.8 | 0.0019500 | 0.270 | 0.0000000 | 1.05 | -0.0022201 |
13 | 6 | 5.5 | 0.0010389 | -0.220 | 0.0000000 | 0.39 | -0.0035294 |
However, a flaw presented itself in this created variable–the relative position from the middle ~50% of data was not accounted for in the calculations. To resolve this, the distance from the inner ~50% was calculated for both the top and bottom performers.
Once this distance was found for both the top and bottom performers, to determine the extent of the ‘Boost,’ it had to be determined if the difference in extremes was representative of actual high performance overall, or simply a difference in extremes. A variable was created to determine how close a Boosted value was to the middle data, and the closer it was, the closer ‘Boost’ got to 0. If the Boosted value was far from the middle data, the position variable exaggerated the ‘Boost’ value. The Boost variable is represented in the bottom two plots below.
For all following models, the variables are selected via backward stepwise selection, and the final model is presented.
##
## Call:
## glm(formula = Y ~ X, family = "poisson")
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.1375 -0.6828 0.1227 0.6041 2.8521
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.854e-01 1.326e-02 36.602 < 2e-16 ***
## XVolatileAcidity -2.803e-02 6.604e-03 -4.244 2.20e-05 ***
## XvaD 2.644e+00 3.766e-01 7.022 2.19e-12 ***
## XcaD 1.317e+00 2.379e-01 5.535 3.11e-08 ***
## XrsD 5.790e-01 8.767e-02 6.604 4.00e-11 ***
## XChlorides -3.614e-02 1.649e-02 -2.191 0.028427 *
## XchD 9.789e-01 1.115e-01 8.776 < 2e-16 ***
## XFreeSulfurDioxide 1.159e-04 3.517e-05 3.296 0.000979 ***
## XfsdD 1.514e+00 1.713e-01 8.841 < 2e-16 ***
## XTotalSulfurDioxide 6.799e-05 2.290e-05 2.969 0.002990 **
## XtsdD 2.888e+00 3.066e-01 9.418 < 2e-16 ***
## XdD 1.493e+00 2.128e-01 7.016 2.29e-12 ***
## XphD 4.611e+00 1.106e+00 4.169 3.07e-05 ***
## XSulphates -1.244e-02 5.648e-03 -2.202 0.027640 *
## XslphD 1.533e+00 6.951e-01 2.205 0.027431 *
## XalcD 2.354e+00 4.819e-01 4.885 1.03e-06 ***
## XLabelAppeal 1.462e-01 6.095e-03 23.988 < 2e-16 ***
## XaiD 9.859e-01 8.339e-02 11.823 < 2e-16 ***
## XSTARS 3.012e-01 5.371e-03 56.075 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 22861 on 12794 degrees of freedom
## Residual deviance: 15519 on 12776 degrees of freedom
## AIC: 47499
##
## Number of Fisher Scoring iterations: 5
This model produces an AIC value much higher than its poisson comparisons, and also has a higher RMSE value. The coeffiecients present all make sense. Testing for overdispersion determines there is no need to use a negative binomial model. This model will not be kept.
##
## Call:
## hurdle(formula = Y ~ X | X1, dist = "poisson", link = "logit")
##
## Pearson residuals:
## Min 1Q Median 3Q Max
## -2.25511 -0.43294 0.02967 0.42416 7.57606
##
## Count model coefficients (truncated poisson with log link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.035316 0.017875 57.919 < 2e-16 ***
## XcaD 1.063463 0.254520 4.178 2.94e-05 ***
## XchD 0.537298 0.118250 4.544 5.53e-06 ***
## XfsdD 0.773635 0.182356 4.242 2.21e-05 ***
## XdD 1.103349 0.222893 4.950 7.42e-07 ***
## XAlcohol 0.003400 0.001266 2.685 0.00726 **
## XalcD 2.807490 0.510890 5.495 3.90e-08 ***
## XLabelAppeal 0.246079 0.006560 37.514 < 2e-16 ***
## XSTARS 0.097996 0.005994 16.349 < 2e-16 ***
## Zero hurdle model coefficients (binomial with logit link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.9990257 0.1618130 -6.174 6.66e-10 ***
## X1VolatileAcidity -0.1117414 0.0344915 -3.240 0.00120 **
## X1vaD 10.6059536 2.2189321 4.780 1.76e-06 ***
## X1caD 3.5768263 1.3562796 2.637 0.00836 **
## X1rsD 3.1743268 0.4158827 7.633 2.30e-14 ***
## X1Chlorides -0.1876219 0.0862815 -2.175 0.02967 *
## X1chD 4.5927659 0.6641207 6.916 4.66e-12 ***
## X1FreeSulfurDioxide 0.0004811 0.0001879 2.560 0.01047 *
## X1fsdD 3.6933388 0.7374297 5.008 5.49e-07 ***
## X1TotalSulfurDioxide 0.0006100 0.0001199 5.089 3.60e-07 ***
## X1tsdD 12.1521843 1.4214979 8.549 < 2e-16 ***
## X1dD 5.1431695 1.1721187 4.388 1.14e-05 ***
## X1pH -0.1055445 0.0404786 -2.607 0.00912 **
## X1phD 28.9694160 5.9103422 4.901 9.51e-07 ***
## X1Sulphates -0.1541194 0.0299663 -5.143 2.70e-07 ***
## X1Alcohol -0.0472699 0.0065592 -7.207 5.73e-13 ***
## X1alcD 5.3157786 2.6925118 1.974 0.04835 *
## X1LabelAppeal -0.4423819 0.0314124 -14.083 < 2e-16 ***
## X1aiD 4.6974842 0.3928270 11.958 < 2e-16 ***
## X1STARS 2.4917448 0.0650581 38.300 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Number of iterations in BFGS optimization: 17
## Log-likelihood: -2.074e+04 on 29 Df
This model has interesting coefficients for the variables vaD, tsdD, and phD. LabelAppeal is also negative, which is likely to offset the high values observed in the other coefficients. Given the lack of overdispersion in the data, this model is promising.
##
## Call:
## zeroinfl(formula = Y ~ X | X1, dist = "poisson", link = "logit")
##
## Pearson residuals:
## Min 1Q Median 3Q Max
## -2.23308 -0.41578 0.03286 0.41186 8.82854
##
## Count model coefficients (poisson with log link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.031147 0.017566 58.703 < 2e-16 ***
## XcaD 0.962806 0.251065 3.835 0.000126 ***
## XchD 0.503305 0.116107 4.335 1.46e-05 ***
## XfsdD 0.775146 0.178079 4.353 1.34e-05 ***
## XdD 1.033867 0.217667 4.750 2.04e-06 ***
## XAlcohol 0.003280 0.001230 2.667 0.007659 **
## XalcD 2.780746 0.498511 5.578 2.43e-08 ***
## XLabelAppeal 0.235873 0.006348 37.156 < 2e-16 ***
## XSTARS 0.106517 0.005899 18.058 < 2e-16 ***
##
## Zero-inflation model coefficients (binomial with logit link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 8.378e-01 1.866e-01 4.490 7.11e-06 ***
## X1VolatileAcidity 1.130e-01 4.000e-02 2.826 0.00472 **
## X1vaD -1.196e+01 2.671e+00 -4.477 7.56e-06 ***
## X1caD -4.306e+00 1.653e+00 -2.605 0.00919 **
## X1rsD -3.464e+00 4.719e-01 -7.340 2.13e-13 ***
## X1Chlorides 2.050e-01 1.004e-01 2.043 0.04105 *
## X1chD -5.184e+00 7.974e-01 -6.501 8.00e-11 ***
## X1FreeSulfurDioxide -5.649e-04 2.195e-04 -2.573 0.01008 *
## X1fsdD -3.624e+00 8.490e-01 -4.269 1.97e-05 ***
## X1TotalSulfurDioxide -6.803e-04 1.392e-04 -4.889 1.02e-06 ***
## X1tsdD -1.382e+01 1.616e+00 -8.552 < 2e-16 ***
## X1dD -5.859e+00 1.371e+00 -4.274 1.92e-05 ***
## X1pH 1.162e-01 4.705e-02 2.470 0.01351 *
## X1phD -3.227e+01 6.888e+00 -4.685 2.80e-06 ***
## X1Sulphates 1.875e-01 3.526e-02 5.317 1.05e-07 ***
## X1Alcohol 5.734e-02 7.581e-03 7.565 3.89e-14 ***
## X1LabelAppeal 6.652e-01 3.931e-02 16.920 < 2e-16 ***
## X1aiD -5.360e+00 4.541e-01 -11.804 < 2e-16 ***
## X1STARS -2.696e+00 7.744e-02 -34.809 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Number of iterations in BFGS optimization: 38
## Log-likelihood: -2.081e+04 on 28 Df
This model has many negative coefficients, including for STARS. They appear almost inverse to model 1.2. The AIC for this model is higher than model 1.2.
##
## Call:
## glm.nb(formula = Y ~ X, init.theta = 50093.12961, link = log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.1374 -0.6828 0.1227 0.6041 2.8521
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.854e-01 1.326e-02 36.600 < 2e-16 ***
## XVolatileAcidity -2.803e-02 6.605e-03 -4.244 2.20e-05 ***
## XvaD 2.644e+00 3.766e-01 7.021 2.20e-12 ***
## XcaD 1.317e+00 2.379e-01 5.535 3.11e-08 ***
## XrsD 5.790e-01 8.767e-02 6.604 4.00e-11 ***
## XChlorides -3.614e-02 1.649e-02 -2.191 0.028428 *
## XchD 9.789e-01 1.116e-01 8.776 < 2e-16 ***
## XFreeSulfurDioxide 1.159e-04 3.517e-05 3.296 0.000979 ***
## XfsdD 1.514e+00 1.713e-01 8.841 < 2e-16 ***
## XTotalSulfurDioxide 6.800e-05 2.290e-05 2.969 0.002991 **
## XtsdD 2.888e+00 3.066e-01 9.418 < 2e-16 ***
## XdD 1.493e+00 2.128e-01 7.016 2.29e-12 ***
## XphD 4.611e+00 1.106e+00 4.169 3.07e-05 ***
## XSulphates -1.244e-02 5.649e-03 -2.202 0.027641 *
## XslphD 1.533e+00 6.951e-01 2.205 0.027435 *
## XalcD 2.354e+00 4.819e-01 4.885 1.03e-06 ***
## XLabelAppeal 1.462e-01 6.096e-03 23.987 < 2e-16 ***
## XaiD 9.859e-01 8.339e-02 11.823 < 2e-16 ***
## XSTARS 3.012e-01 5.371e-03 56.073 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(50093.13) family taken to be 1)
##
## Null deviance: 22860 on 12794 degrees of freedom
## Residual deviance: 15519 on 12776 degrees of freedom
## AIC: 47502
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 50093
## Std. Err.: 56210
## Warning while fitting theta: iteration limit reached
##
## 2 x log-likelihood: -47461.65
This model will not be kept for the same reasons described with model 1.1.
##
## Call:
## hurdle(formula = Y ~ X | X1, dist = "negbin", link = "logit")
##
## Pearson residuals:
## Min 1Q Median 3Q Max
## -2.25511 -0.43294 0.02967 0.42416 7.57607
##
## Count model coefficients (truncated negbin with log link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.035318 0.017875 57.919 < 2e-16 ***
## XcaD 1.063464 0.254520 4.178 2.94e-05 ***
## XchD 0.537303 0.118250 4.544 5.53e-06 ***
## XfsdD 0.773584 0.182356 4.242 2.21e-05 ***
## XdD 1.103419 0.222893 4.950 7.40e-07 ***
## XAlcohol 0.003400 0.001266 2.685 0.00726 **
## XalcD 2.807634 0.510889 5.496 3.89e-08 ***
## XLabelAppeal 0.246080 0.006560 37.514 < 2e-16 ***
## XSTARS 0.097996 0.005994 16.349 < 2e-16 ***
## Log(theta) 17.834858 1.383143 12.894 < 2e-16 ***
## Zero hurdle model coefficients (binomial with logit link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.9990257 0.1618130 -6.174 6.66e-10 ***
## X1VolatileAcidity -0.1117414 0.0344915 -3.240 0.00120 **
## X1vaD 10.6059536 2.2189321 4.780 1.76e-06 ***
## X1caD 3.5768263 1.3562796 2.637 0.00836 **
## X1rsD 3.1743268 0.4158827 7.633 2.30e-14 ***
## X1Chlorides -0.1876219 0.0862815 -2.175 0.02967 *
## X1chD 4.5927659 0.6641207 6.916 4.66e-12 ***
## X1FreeSulfurDioxide 0.0004811 0.0001879 2.560 0.01047 *
## X1fsdD 3.6933388 0.7374297 5.008 5.49e-07 ***
## X1TotalSulfurDioxide 0.0006100 0.0001199 5.089 3.60e-07 ***
## X1tsdD 12.1521843 1.4214979 8.549 < 2e-16 ***
## X1dD 5.1431695 1.1721187 4.388 1.14e-05 ***
## X1pH -0.1055445 0.0404786 -2.607 0.00912 **
## X1phD 28.9694160 5.9103422 4.901 9.51e-07 ***
## X1Sulphates -0.1541194 0.0299663 -5.143 2.70e-07 ***
## X1Alcohol -0.0472699 0.0065592 -7.207 5.73e-13 ***
## X1alcD 5.3157786 2.6925118 1.974 0.04835 *
## X1LabelAppeal -0.4423819 0.0314124 -14.083 < 2e-16 ***
## X1aiD 4.6974842 0.3928270 11.958 < 2e-16 ***
## X1STARS 2.4917448 0.0650581 38.300 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Theta: count = 55664756.2322
## Number of iterations in BFGS optimization: 44
## Log-likelihood: -2.074e+04 on 30 Df
This model has interesting coefficients for the variables vaD, tsdD, and phD. LabelAppeal is also negative, which is likely to offset the high values observed in the other coefficients. This was not the chosen model, but it will be kept in case overdispersion should arise in the data.
##
## Call:
## zeroinfl(formula = Y ~ X | X1, dist = "negbin", link = "logit")
##
## Pearson residuals:
## Min 1Q Median 3Q Max
## -2.23308 -0.41578 0.03286 0.41186 8.82828
##
## Count model coefficients (negbin with log link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.031147 0.017566 58.703 < 2e-16 ***
## XcaD 0.962790 0.251065 3.835 0.000126 ***
## XchD 0.503305 0.116107 4.335 1.46e-05 ***
## XfsdD 0.775151 0.178079 4.353 1.34e-05 ***
## XdD 1.033874 0.217667 4.750 2.04e-06 ***
## XAlcohol 0.003280 0.001230 2.667 0.007659 **
## XalcD 2.780724 0.498511 5.578 2.43e-08 ***
## XLabelAppeal 0.235873 0.006348 37.156 < 2e-16 ***
## XSTARS 0.106517 0.005899 18.058 < 2e-16 ***
## Log(theta) 17.860893 NaN NaN NaN
##
## Zero-inflation model coefficients (binomial with logit link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 8.378e-01 1.866e-01 4.490 7.11e-06 ***
## X1VolatileAcidity 1.130e-01 4.000e-02 2.826 0.00472 **
## X1vaD -1.196e+01 2.671e+00 -4.477 7.56e-06 ***
## X1caD -4.307e+00 1.653e+00 -2.605 0.00917 **
## X1rsD -3.464e+00 4.719e-01 -7.340 2.13e-13 ***
## X1Chlorides 2.050e-01 1.004e-01 2.043 0.04105 *
## X1chD -5.184e+00 7.974e-01 -6.501 8.00e-11 ***
## X1FreeSulfurDioxide -5.649e-04 2.195e-04 -2.573 0.01007 *
## X1fsdD -3.624e+00 8.490e-01 -4.269 1.97e-05 ***
## X1TotalSulfurDioxide -6.803e-04 1.392e-04 -4.889 1.02e-06 ***
## X1tsdD -1.382e+01 1.616e+00 -8.552 < 2e-16 ***
## X1dD -5.859e+00 1.371e+00 -4.274 1.92e-05 ***
## X1pH 1.162e-01 4.705e-02 2.470 0.01351 *
## X1phD -3.227e+01 6.888e+00 -4.684 2.81e-06 ***
## X1Sulphates 1.875e-01 3.526e-02 5.317 1.05e-07 ***
## X1Alcohol 5.734e-02 7.581e-03 7.565 3.89e-14 ***
## X1LabelAppeal 6.652e-01 3.931e-02 16.920 < 2e-16 ***
## X1aiD -5.360e+00 4.541e-01 -11.804 < 2e-16 ***
## X1STARS -2.696e+00 7.744e-02 -34.809 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Theta = 57133024.4533
## Number of iterations in BFGS optimization: 38
## Log-likelihood: -2.081e+04 on 29 Df
The abnormal coefficient in this model is for STARS which has a negative impact. Some interesting differences are with vaD, tsdD, and phD. This model will not be kept.
##
## Call:
## lm(formula = Y ~ X)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.2904 -1.0317 0.1718 1.0444 5.2458
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.545e+00 4.800e-01 5.303 1.16e-07 ***
## XFixedAcidity -8.951e-03 2.023e-03 -4.425 9.75e-06 ***
## XVolatileAcidity -1.242e-01 1.632e-02 -7.613 2.86e-14 ***
## XChlorides -1.760e-01 4.115e-02 -4.276 1.91e-05 ***
## XFreeSulfurDioxide 4.412e-04 8.814e-05 5.006 5.64e-07 ***
## XTotalSulfurDioxide 3.105e-04 5.663e-05 5.482 4.28e-08 ***
## XDensity -1.416e+00 4.814e-01 -2.942 0.00326 **
## XSulphates -6.293e-02 1.419e-02 -4.436 9.25e-06 ***
## XLabelAppeal 4.109e-01 1.498e-02 27.438 < 2e-16 ***
## XSTARS 1.150e+00 1.403e-02 81.942 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.444 on 12785 degrees of freedom
## Multiple R-squared: 0.4386, Adjusted R-squared: 0.4382
## F-statistic: 1110 on 9 and 12785 DF, p-value: < 2.2e-16
This linear model was done to compare these original variables with the variable that was created. While there is nothing to note in regards to the content variables, the coefficients LabelAppeal and STARS both make sense.
##
## Call:
## lm(formula = Y ~ X)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.1390 -1.2539 0.2445 1.2553 5.7049
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.93292 0.01967 149.111 < 2e-16 ***
## XfaD 6.57918 1.72174 3.821 0.000133 ***
## XvaD 12.68931 1.20544 10.527 < 2e-16 ***
## XcaD 6.40934 0.76617 8.365 < 2e-16 ***
## XrsD 1.67788 0.26015 6.450 1.16e-10 ***
## XchD 4.75876 0.36122 13.174 < 2e-16 ***
## XfsdD 5.40947 0.47622 11.359 < 2e-16 ***
## XtsdD 10.47594 0.88417 11.848 < 2e-16 ***
## XdD 6.98614 0.66872 10.447 < 2e-16 ***
## XphD 20.34403 3.44408 5.907 3.57e-09 ***
## XslphD 10.73206 2.16247 4.963 7.03e-07 ***
## XalcD 13.21605 1.54490 8.555 < 2e-16 ***
## XaiD 3.42672 0.25791 13.287 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.792 on 12782 degrees of freedom
## Multiple R-squared: 0.1354, Adjusted R-squared: 0.1346
## F-statistic: 166.8 on 12 and 12782 DF, p-value: < 2.2e-16
This model is what model 3.1 was compared to–it consists only of the created ‘Booster’ variables. This model has a much smaller fit than model 3.1 as observed in their R^2 values. The pH boost variable has the strongest positive impact among VolatileAcidity, TotalSulphurDioxide, Sulphates and Alcohol.
##
## Call:
## lm(formula = Y ~ X)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.7538 -0.9275 0.1540 0.9668 4.8665
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.965e+00 4.590e-01 4.282 1.87e-05 ***
## XVolatileAcidity -7.518e-02 1.576e-02 -4.772 1.85e-06 ***
## XvaD 7.729e+00 9.378e-01 8.241 < 2e-16 ***
## XcaD 3.873e+00 5.898e-01 6.568 5.31e-11 ***
## XrsD 1.363e+00 2.000e-01 6.813 9.97e-12 ***
## XChlorides -1.184e-01 3.929e-02 -3.013 0.00259 **
## XchD 3.050e+00 2.785e-01 10.953 < 2e-16 ***
## XFreeSulfurDioxide 2.937e-04 8.423e-05 3.487 0.00049 ***
## XfsdD 3.533e+00 3.673e-01 9.619 < 2e-16 ***
## XTotalSulfurDioxide 1.746e-04 5.435e-05 3.213 0.00132 **
## XtsdD 7.000e+00 6.830e-01 10.248 < 2e-16 ***
## XDensity -7.761e-01 4.604e-01 -1.686 0.09191 .
## XdD 4.400e+00 5.163e-01 8.521 < 2e-16 ***
## XphD 1.361e+01 2.648e+00 5.140 2.79e-07 ***
## XSulphates -4.151e-02 1.355e-02 -3.063 0.00220 **
## XalcD 8.607e+00 1.189e+00 7.240 4.74e-13 ***
## XLabelAppeal 4.469e-01 1.433e-02 31.196 < 2e-16 ***
## XaiD 2.581e+00 1.891e-01 13.647 < 2e-16 ***
## XSTARS 1.035e+00 1.377e-02 75.194 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.377 on 12776 degrees of freedom
## Multiple R-squared: 0.4897, Adjusted R-squared: 0.4889
## F-statistic: 681 on 18 and 12776 DF, p-value: < 2.2e-16
This model finds the most significant variables out of those that were created as well as those that weren’t. By using both types, the R^2 value increases by 5.11%, so that this model explains 48.97% of variation in the data. This model will be kept while model 3.1 and 3.2 will not.
Between the three Multiple Linear Regression models, the model with the original variables has a much better fit (R^2=0.4386) than the model with only the created variables (R^2=0.1354). However when all variables were modeled in Model 3.3, there is an increase of 5.11% (R^2=0.4897) in variation explained by the model.
Model | AIC | RMSE |
---|---|---|
1.1 | 47499.41 | 1.485925 |
1.2 | 41540.90 | 1.355723 |
1.3 | 41679.43 | 1.355261 |
2.1 | 47501.65 | 1.485925 |
2.2 | 41542.91 | 1.355723 |
2.3 | 41681.43 | 1.355261 |
##
## Overdispersion test
##
## data: poisson
## z = -11.588, p-value = 1
## alternative hypothesis: true dispersion is greater than 1
## sample estimates:
## dispersion
## 0.8714349
The model to be chosen is model 1.2 - the Hurdle Poisson Model. While its RMSE value is the second lowest value, the difference isn’t small enough to justify the sacrifice of roughly ~180 in AIC. These differences are present in both the poisson (1.x) and negative binomial models (2.x). The decision was then between model 1.2 -Hurdle Poisson- and 2.2 -Hurdle Negative Binomial- and since overdispersion is not present in the data, the best model is 1.2.
This model is not perfect. It predicts much fewer values of 0 than present in the actual data. When the actual Target value is 0, this model predicts a range of 0-5, with 8 occurrences of 5 being predicted and 49 predictions with a value of 4. The model does a good job of predicting a value within a small range of the actual value a majority of the time. It can certainly produce results with relatively accurate predictions, however it should not be relied upon entirely.