In this assignment I will explore, analyze and model a data set containing information on approximately 12,000 commercially available wines. The variables are mostly related to the chemical properties of the wine being sold. The response variable is the number of sample cases of wine that were purchased by wine distribution companies after sampling a wine. These cases would be used to provide tasting samples to restaurants and wine stores around the United States. The more sample cases purchased, the more likely is a wine to be sold at a high end restaurant. A large wine manufacturer is studying the data in order to predict the number of wine cases ordered based upon the wine characteristics. If the wine manufacturer can predict the number of cases, then that manufacturer will be able to adjust their wine offering to maximize sales.
My objective is to build a count regression model to predict the number of cases of wine that will be sold given certain properties of the wine. Sometimes, the fact that a variable is missing is actually predictive of the target. I can only use the variables given to me (or variables that I derive from the variables provided).
We have two datasets.
One is the wine training dataset, which includes 14 candidate predictors, 1 response variable and 12795 observations.
Other one is the wine evaluation dataset, which also includes 14 candidate predictors, 1 response variable but 16129 observations.
Below is a short description of the variables of interest in the data set:
VARIABLE NAME | DEFINITION | THEORETICAL EFFECT |
---|---|---|
INDEX | Identification Variable (do not use) | None |
TARGET | Number of Cases Purchased | None |
ACID INDEX | Proprietary method of testing total acidity of wine by using a weighted average | |
ALCOHOL | Alcohol Content | |
CHLORIDES | Chloride content of wine | |
CITRIC ACID | Citric Acid Content | |
DENSITY | Density of Wine | |
FIXED ACIDITY | Fixed Acidity of Wine | |
FREE SULFUR DIOXIDE | Sulfur Dioxide content of wine | |
LABEL APPEAL | Marketing Score indicationg the appeal of label design for consumers. High numbers suggest customers like the label design. Negative numbers suggest customers don’t like design. | Many consumers purchase based on the visual appeal of the wine label design. Higher numbers suggest better sales. |
RESIDUAL SUGAR | Residual Sugar of wine | |
STARS | Wine rating by a team of experts. 4 Stars = Excellent, 1 Star = Poor | A high number of stars suggests high sales |
SULPHATES | Sulfate content of wine | |
TOTAL SULFUR DIOXIDE | Total Sulfur Dioxide of Wine | |
VOLATILE ACIDITY | Volatile Acid content of wine. | |
pH | pH of wine |
The wine training set contains 16 columns - including the target variable TARGET - and 12,795 rows, covering a variety of different brands of wine. The data-set is entirely numerical variables, but also contains some variables that are highly discrete and have a limited number of possible values. We believe it is still reasonable to treat these as numerical variables since the different values follow a natural numerical order.
## INDEX TARGET FixedAcidity VolatileAcidity CitricAcid ResidualSugar Chlorides
## 1 1 3 3.2 1.160 -0.98 54.2 -0.567
## 2 2 3 4.5 0.160 -0.81 26.1 -0.425
## 3 4 5 7.1 2.640 -0.88 14.8 0.037
## 4 5 3 5.7 0.385 0.04 18.8 -0.425
## 5 6 4 8.0 0.330 -1.26 9.4 NA
## 6 7 0 11.3 0.320 0.59 2.2 0.556
## FreeSulfurDioxide TotalSulfurDioxide Density pH Sulphates Alcohol
## 1 NA 268 0.99280 3.33 -0.59 9.9
## 2 15 -327 1.02792 3.38 0.70 NA
## 3 214 142 0.99518 3.12 0.48 22.0
## 4 22 115 0.99640 2.24 1.83 6.2
## 5 -167 108 0.99457 3.12 1.77 13.7
## 6 -37 15 0.99940 3.20 1.29 15.4
## LabelAppeal AcidIndex STARS
## 1 0 8 2
## 2 -1 7 3
## 3 -1 8 3
## 4 -1 6 1
## 5 0 9 2
## 6 0 11 NA
## vars n mean sd median trimmed mad min
## INDEX 1 12795 8069.98 4656.91 8110.00 8071.03 5977.84 1.00
## TARGET 2 12795 3.03 1.93 3.00 3.05 1.48 0.00
## FixedAcidity 3 12795 7.08 6.32 6.90 7.07 3.26 -18.10
## VolatileAcidity 4 12795 0.32 0.78 0.28 0.32 0.43 -2.79
## CitricAcid 5 12795 0.31 0.86 0.31 0.31 0.42 -3.24
## ResidualSugar 6 12179 5.42 33.75 3.90 5.58 15.72 -127.80
## Chlorides 7 12157 0.05 0.32 0.05 0.05 0.13 -1.17
## FreeSulfurDioxide 8 12148 30.85 148.71 30.00 30.93 56.34 -555.00
## TotalSulfurDioxide 9 12113 120.71 231.91 123.00 120.89 134.92 -823.00
## Density 10 12795 0.99 0.03 0.99 0.99 0.01 0.89
## pH 11 12400 3.21 0.68 3.20 3.21 0.39 0.48
## Sulphates 12 11585 0.53 0.93 0.50 0.53 0.44 -3.13
## Alcohol 13 12142 10.49 3.73 10.40 10.50 2.37 -4.70
## LabelAppeal 14 12795 -0.01 0.89 0.00 -0.01 1.48 -2.00
## AcidIndex 15 12795 7.77 1.32 8.00 7.64 1.48 4.00
## STARS 16 9436 2.04 0.90 2.00 1.97 1.48 1.00
## max range skew kurtosis se
## INDEX 16129.00 16128.00 0.00 -1.20 41.17
## TARGET 8.00 8.00 -0.33 -0.88 0.02
## FixedAcidity 34.40 52.50 -0.02 1.67 0.06
## VolatileAcidity 3.68 6.47 0.02 1.83 0.01
## CitricAcid 3.86 7.10 -0.05 1.84 0.01
## ResidualSugar 141.15 268.95 -0.05 1.88 0.31
## Chlorides 1.35 2.52 0.03 1.79 0.00
## FreeSulfurDioxide 623.00 1178.00 0.01 1.84 1.35
## TotalSulfurDioxide 1057.00 1880.00 -0.01 1.67 2.11
## Density 1.10 0.21 -0.02 1.90 0.00
## pH 6.13 5.65 0.04 1.65 0.01
## Sulphates 4.24 7.37 0.01 1.75 0.01
## Alcohol 26.50 31.20 -0.03 1.54 0.03
## LabelAppeal 2.00 4.00 0.01 -0.26 0.01
## AcidIndex 17.00 13.00 1.65 5.19 0.01
## STARS 4.00 3.00 0.45 -0.69 0.01
## [1] 16
## [1] 12795
## INDEX TARGET FixedAcidity VolatileAcidity
## Min. : 1 Min. :0.000 Min. :-18.100 Min. :-2.7900
## 1st Qu.: 4038 1st Qu.:2.000 1st Qu.: 5.200 1st Qu.: 0.1300
## Median : 8110 Median :3.000 Median : 6.900 Median : 0.2800
## Mean : 8070 Mean :3.029 Mean : 7.076 Mean : 0.3241
## 3rd Qu.:12106 3rd Qu.:4.000 3rd Qu.: 9.500 3rd Qu.: 0.6400
## Max. :16129 Max. :8.000 Max. : 34.400 Max. : 3.6800
##
## CitricAcid ResidualSugar Chlorides FreeSulfurDioxide
## Min. :-3.2400 Min. :-127.800 Min. :-1.1710 Min. :-555.00
## 1st Qu.: 0.0300 1st Qu.: -2.000 1st Qu.:-0.0310 1st Qu.: 0.00
## Median : 0.3100 Median : 3.900 Median : 0.0460 Median : 30.00
## Mean : 0.3084 Mean : 5.419 Mean : 0.0548 Mean : 30.85
## 3rd Qu.: 0.5800 3rd Qu.: 15.900 3rd Qu.: 0.1530 3rd Qu.: 70.00
## Max. : 3.8600 Max. : 141.150 Max. : 1.3510 Max. : 623.00
## NA's :616 NA's :638 NA's :647
## TotalSulfurDioxide Density pH Sulphates
## Min. :-823.0 Min. :0.8881 Min. :0.480 Min. :-3.1300
## 1st Qu.: 27.0 1st Qu.:0.9877 1st Qu.:2.960 1st Qu.: 0.2800
## Median : 123.0 Median :0.9945 Median :3.200 Median : 0.5000
## Mean : 120.7 Mean :0.9942 Mean :3.208 Mean : 0.5271
## 3rd Qu.: 208.0 3rd Qu.:1.0005 3rd Qu.:3.470 3rd Qu.: 0.8600
## Max. :1057.0 Max. :1.0992 Max. :6.130 Max. : 4.2400
## NA's :682 NA's :395 NA's :1210
## Alcohol LabelAppeal AcidIndex STARS
## Min. :-4.70 Min. :-2.000000 Min. : 4.000 Min. :1.000
## 1st Qu.: 9.00 1st Qu.:-1.000000 1st Qu.: 7.000 1st Qu.:1.000
## Median :10.40 Median : 0.000000 Median : 8.000 Median :2.000
## Mean :10.49 Mean :-0.009066 Mean : 7.773 Mean :2.042
## 3rd Qu.:12.40 3rd Qu.: 1.000000 3rd Qu.: 8.000 3rd Qu.:3.000
## Max. :26.50 Max. : 2.000000 Max. :17.000 Max. :4.000
## NA's :653 NA's :3359
Given that the Index column had no impact on the target variable, number of wines, it was dropped.
## 'data.frame': 12795 obs. of 15 variables:
## $ TARGET : int 3 3 5 3 4 0 0 4 3 6 ...
## $ FixedAcidity : num 3.2 4.5 7.1 5.7 8 11.3 7.7 6.5 14.8 5.5 ...
## $ VolatileAcidity : num 1.16 0.16 2.64 0.385 0.33 0.32 0.29 -1.22 0.27 -0.22 ...
## $ CitricAcid : num -0.98 -0.81 -0.88 0.04 -1.26 0.59 -0.4 0.34 1.05 0.39 ...
## $ ResidualSugar : num 54.2 26.1 14.8 18.8 9.4 ...
## $ Chlorides : num -0.567 -0.425 0.037 -0.425 NA 0.556 0.06 0.04 -0.007 -0.277 ...
## $ FreeSulfurDioxide : num NA 15 214 22 -167 -37 287 523 -213 62 ...
## $ TotalSulfurDioxide: num 268 -327 142 115 108 15 156 551 NA 180 ...
## $ Density : num 0.993 1.028 0.995 0.996 0.995 ...
## $ pH : num 3.33 3.38 3.12 2.24 3.12 3.2 3.49 3.2 4.93 3.09 ...
## $ Sulphates : num -0.59 0.7 0.48 1.83 1.77 1.29 1.21 NA 0.26 0.75 ...
## $ Alcohol : num 9.9 NA 22 6.2 13.7 15.4 10.3 11.6 15 12.6 ...
## $ LabelAppeal : int 0 -1 -1 -1 0 0 0 1 0 0 ...
## $ AcidIndex : int 8 7 8 6 9 11 8 7 6 8 ...
## $ STARS : int 2 3 3 1 2 NA NA 3 NA 4 ...
## [1] 8200
The first observation is the number of missing values throughout the dataset. We have 8200 missing values. Of the 16 feature columns, 8 of them contain at least some missing values. We also see that the TARGET value is always between 0 and 8, which makes sense as this is the “Number of Cases of Wine Sold” (we would not expect partial cases).
I also note that many of the numerical features measuring the quantity of a chemical in the wine have a negative minimum value. We are assuming the original chemical measurements were normalized (possible a log transform) allowing for negative values, since technically negative concentrations shouldn’t be physically possible. As such, we chose to leave those values as-is.
I wanted to get an idea of the distribution profiles for each of the variables.
## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: Removed 8200 rows containing non-finite values (`stat_bin()`).
## Warning: Removed 8200 rows containing non-finite values (`stat_density()`).
The majority of variables exhibit a somewhat normal distribution, characterized by a steep curve. Notably, variables AcidIndex and STARS display a right-skewed distribution.
A more intriguing observation is the distinctive shape of many features, which are centered with values clustered around the middle, forming a somewhat uniform shape above and below. This pattern suggests a quasi-tri-modal distribution, with low, middle, and high normal distributions overlapping.
In our analytical approach, we decide against extensive feature engineering; however, we contemplate the possibility of breaking these features into three separate components. Two potential strategies include:
Utilizing mixTools to segregate the multi-modal curves into three distinct features, each capturing exclusively low, middle, or high values while retaining numerical precision.
Employing discretization to convert the features into categorical values that indicate whether the values are low, middle, or high, offering a simplified representation for analysis and interpretation.
I also elected to use box-plots to get an idea of the spread of each variable.
## Warning: Removed 8200 rows containing non-finite values (`stat_boxplot()`).
The box plots exhibit no significant outliers across the features, suggesting that outlier detection and removal may not be necessary. Notably, AcidIndex, LabelAppeal, and STARS demonstrate categorical (ordinal) characteristics. To explore their relationship with the TARGET variable, we observe a discernible pattern: an increase in LabelAppeal corresponds to a rise in TARGET.
This correlation is also evident between STARS and TARGET. Particularly noteworthy is the strong association between STARS=NA and lower TARGET values. It’s worth mentioning that the original project instructions emphasized the potential informativeness of missing data. Consequently, I opt to impute STARS=NA with STARS=0, aligning with observed patterns where increasing stars aligns with higher
I also wanted to plot scatter plots of each variable versus the target variable, TARGET, to get an idea of the relationship between them.
Due to the discrete nature of the target variable, identifying clear
linear relationships in the data proves challenging. Nevertheless, both
STARS and LabelAppeal exhibit a significant positive correlation with
the TARGET, and several chemical features demonstrate at least some
negative association, with lower values coinciding with a higher
frequency of 8 and 7 values in the target variable.
Despite revealing interesting relationships among variables, the plots also expose significant data issues. Notably, numerous data points contain missing values, necessitating imputation or removal. Additionally, there is a concern regarding nonsensical negative values in variables measuring concentration. We have assumed these variables underwent log transformation, attributing the negative values to this process. However, this assumption lacks supporting evidence, and a reevaluation would be warranted with more information on the data collection/transformation process. we would need to reevaluate if given more information on the data collection/transformation process.
Upon our initial examination of the initial rows of raw data, I observed the presence of missing data. Now, let’s evaluate and identify the fields that contain these missing values.
## values ind
## 1 26.25 STARS
## 2 9.46 Sulphates
## 3 5.33 TotalSulfurDioxide
## 4 5.10 Alcohol
## 5 5.06 FreeSulfurDioxide
## 6 4.99 Chlorides
## 7 4.81 ResidualSugar
## 8 3.09 pH
## 9 0.00 TARGET
## 10 0.00 FixedAcidity
## 11 0.00 VolatileAcidity
## 12 0.00 CitricAcid
## 13 0.00 Density
## 14 0.00 LabelAppeal
## 15 0.00 AcidIndex
In the project specifications, it was highlighted that the absence of a specific variable could have predictive significance. Consequently, I will handle the missing values by imputing STARS=NA with STARS=0. The remaining missing data will be imputed using the caret::preProcess function with the knnImpute method. It’s important to note that preProcess will not only impute missing values but also perform centering, scaling, and BoxCox transformation on our features simultaneously.
With our missing data imputed correctly, I can now build off the scatter plots from above to quantify the correlations between our target variable and predictor variable. We will want to choose those with stronger positive or negative correlations. Features with correlations closer to zero will probably not provide any meaningful information on explaining wins by a team.
## values ind
## 1 0.685381473 STARS
## 2 0.356500469 LabelAppeal
## 3 0.062030498 Alcohol
## 4 0.051730323 TotalSulfurDioxide
## 5 0.043996542 FreeSulfurDioxide
## 6 0.016187709 ResidualSugar
## 7 0.008684633 CitricAcid
## 8 -0.009081197 pH
## 9 -0.035589560 Density
## 10 -0.039072231 Chlorides
## 11 -0.039917146 Sulphates
## 12 -0.049010939 FixedAcidity
## 13 -0.088793212 VolatileAcidity
## 14 -0.221991949 AcidIndex
STARS, LabelAppeal, and AcidIndex exhibit the most substantial correlation with the TARGET, aligning with our observations in the variable plots discussed earlier. It’s important to recall that we imputed NA values for STARS, treating them as 0 in our analysis.
A potential issue in multivariable regression is the presence of correlation between variables, known as multicollinearity. A simple way to check for this is by running correlations between the variables.
## TARGET FixedAcidity VolatileAcidity CitricAcid ResidualSugar
## TARGET 1.00 -0.05 -0.09 0.01 0.02
## FixedAcidity -0.05 1.00 0.01 0.01 -0.02
## VolatileAcidity -0.09 0.01 1.00 -0.02 -0.01
## CitricAcid 0.01 0.01 -0.02 1.00 -0.01
## ResidualSugar 0.02 -0.02 -0.01 -0.01 1.00
## Chlorides -0.04 0.00 0.00 -0.01 -0.01
## FreeSulfurDioxide 0.04 0.00 -0.01 0.01 0.02
## TotalSulfurDioxide 0.05 -0.02 -0.02 0.01 0.02
## Density -0.04 0.01 0.01 -0.01 0.00
## pH -0.01 -0.01 0.01 -0.01 0.01
## Sulphates -0.04 0.03 0.00 -0.01 -0.01
## Alcohol 0.06 -0.01 0.00 0.02 -0.02
## LabelAppeal 0.36 0.00 -0.02 0.01 0.00
## AcidIndex -0.22 0.17 0.04 0.06 -0.01
## STARS 0.69 -0.04 -0.06 0.01 0.02
## Chlorides FreeSulfurDioxide TotalSulfurDioxide Density pH
## TARGET -0.04 0.04 0.05 -0.04 -0.01
## FixedAcidity 0.00 0.00 -0.02 0.01 -0.01
## VolatileAcidity 0.00 -0.01 -0.02 0.01 0.01
## CitricAcid -0.01 0.01 0.01 -0.01 -0.01
## ResidualSugar -0.01 0.02 0.02 0.00 0.01
## Chlorides 1.00 -0.02 -0.01 0.02 -0.02
## FreeSulfurDioxide -0.02 1.00 0.01 0.00 0.01
## TotalSulfurDioxide -0.01 0.01 1.00 0.01 0.00
## Density 0.02 0.00 0.01 1.00 0.01
## pH -0.02 0.01 0.00 0.01 1.00
## Sulphates 0.00 0.01 -0.01 -0.01 0.00
## Alcohol -0.02 -0.02 -0.02 -0.01 -0.01
## LabelAppeal 0.01 0.01 -0.01 -0.01 0.00
## AcidIndex 0.03 -0.04 -0.04 0.04 -0.07
## STARS -0.03 0.02 0.03 -0.03 -0.01
## Sulphates Alcohol LabelAppeal AcidIndex STARS
## TARGET -0.04 0.06 0.36 -0.22 0.69
## FixedAcidity 0.03 -0.01 0.00 0.17 -0.04
## VolatileAcidity 0.00 0.00 -0.02 0.04 -0.06
## CitricAcid -0.01 0.02 0.01 0.06 0.01
## ResidualSugar -0.01 -0.02 0.00 -0.01 0.02
## Chlorides 0.00 -0.02 0.01 0.03 -0.03
## FreeSulfurDioxide 0.01 -0.02 0.01 -0.04 0.02
## TotalSulfurDioxide -0.01 -0.02 -0.01 -0.04 0.03
## Density -0.01 -0.01 -0.01 0.04 -0.03
## pH 0.00 -0.01 0.00 -0.07 -0.01
## Sulphates 1.00 0.01 -0.01 0.03 -0.03
## Alcohol 0.01 1.00 0.00 -0.05 0.06
## LabelAppeal -0.01 0.00 1.00 0.02 0.26
## AcidIndex 0.03 -0.05 0.02 1.00 -0.15
## STARS -0.03 0.06 0.26 -0.15 1.00
Observing the dataset, I note that the features exhibit minimal
correlations with each other, indicating a lack of significant
multicollinearity. This suggests a higher likelihood of meeting the
assumptions of linear regression.
In our data preparation and exploration, the key findings can be summarized into the following categories:
The INDEX field was removed from the dataset as it did not contribute any relevant information for the model.
For the STARS field, missing values were imputed as 0, considering the high correlation between missing values and the target variable. Other fields with missing values were imputed using the knnimpute method from caret.
## [1] 0
Several numerical features exhibited seemingly unreasonable negative values. Despite this, we opted to interpret them as log-transformed variables, assuming the values are legitimate.
The following plots illustrate the alterations in distributions and the ultimate values post the transformations:
Upon completing the transformations, we observe that the variables are
now more centered and exhibit a closer resemblance to a normal
distribution. However, it is evident that they still deviate from
perfect normal distributions.
## [1] "Number of Training Samples: 10238"
## [1] "Number of Testing Samples: 2557"
In this first model, we include all available features. Features include:
FixedAcidity, VolatileAcidity, CitricAcid, ResidualSugar, Chlorides, FreeSulfurDioxide, TotalSulfurDioxide, Density, pH, Sulphates, Alcohol, LabelAppeal, AcidIndex, STARS
##
## Call:
## glm(formula = TARGET ~ FixedAcidity + VolatileAcidity + CitricAcid +
## ResidualSugar + Chlorides + FreeSulfurDioxide + TotalSulfurDioxide +
## Density + pH + Sulphates + Alcohol + as.factor(LabelAppeal) +
## as.factor(AcidIndex) + as.factor(STARS), family = poisson,
## data = trainingData)
##
## Coefficients:
## Estimate Std. Error z value
## (Intercept) 0.003085 0.319723 0.010
## FixedAcidity -0.001992 0.005768 -0.345
## VolatileAcidity -0.025544 0.005749 -4.443
## CitricAcid 0.006762 0.005641 1.199
## ResidualSugar 0.001775 0.005871 0.302
## Chlorides -0.015149 0.005823 -2.602
## FreeSulfurDioxide 0.010149 0.005767 1.760
## TotalSulfurDioxide 0.016715 0.005858 2.853
## Density -0.006978 0.005709 -1.222
## pH -0.002176 0.005817 -0.374
## Sulphates -0.007812 0.005916 -1.320
## Alcohol 0.015683 0.005895 2.660
## as.factor(LabelAppeal)-1.11204793733397 0.248120 0.042218 5.877
## as.factor(LabelAppeal)0.0101741115806247 0.441789 0.041137 10.739
## as.factor(LabelAppeal)1.13239616049522 0.570610 0.041849 13.635
## as.factor(LabelAppeal)2.25461820940981 0.708786 0.047071 15.058
## as.factor(AcidIndex)-3.59682937695875 -0.138996 0.324343 -0.429
## as.factor(AcidIndex)-1.79176983045029 -0.098355 0.317457 -0.310
## as.factor(AcidIndex)-0.545318540973785 -0.143436 0.317165 -0.452
## as.factor(AcidIndex)0.362910765511677 -0.169359 0.317223 -0.534
## as.factor(AcidIndex)1.05172974217783 -0.281750 0.317645 -0.887
## as.factor(AcidIndex)1.59059728918163 -0.419515 0.318958 -1.315
## as.factor(AcidIndex)2.02271372429848 -0.798853 0.323818 -2.467
## as.factor(AcidIndex)2.37629509167962 -0.782543 0.329882 -2.372
## as.factor(AcidIndex)2.67051656830802 -0.712545 0.334968 -2.127
## as.factor(AcidIndex)2.9188445277671 -0.657023 0.344814 -1.905
## as.factor(AcidIndex)3.13100139587667 -0.733772 0.475161 -1.544
## as.factor(AcidIndex)3.31417429494859 -0.965357 0.548643 -1.760
## as.factor(AcidIndex)3.47378568897179 -1.075318 0.548867 -1.959
## as.factor(STARS)-0.42623524866846 0.751147 0.021927 34.257
## as.factor(STARS)0.416552574962037 1.068591 0.020480 52.178
## as.factor(STARS)1.25934039859254 1.189054 0.021607 55.031
## as.factor(STARS)2.10212822222303 1.310332 0.027052 48.437
## Pr(>|z|)
## (Intercept) 0.99230
## FixedAcidity 0.72989
## VolatileAcidity 0.00000885131 ***
## CitricAcid 0.23064
## ResidualSugar 0.76245
## Chlorides 0.00928 **
## FreeSulfurDioxide 0.07841 .
## TotalSulfurDioxide 0.00433 **
## Density 0.22158
## pH 0.70835
## Sulphates 0.18671
## Alcohol 0.00781 **
## as.factor(LabelAppeal)-1.11204793733397 0.00000000418 ***
## as.factor(LabelAppeal)0.0101741115806247 < 0.0000000000000002 ***
## as.factor(LabelAppeal)1.13239616049522 < 0.0000000000000002 ***
## as.factor(LabelAppeal)2.25461820940981 < 0.0000000000000002 ***
## as.factor(AcidIndex)-3.59682937695875 0.66825
## as.factor(AcidIndex)-1.79176983045029 0.75670
## as.factor(AcidIndex)-0.545318540973785 0.65109
## as.factor(AcidIndex)0.362910765511677 0.59343
## as.factor(AcidIndex)1.05172974217783 0.37508
## as.factor(AcidIndex)1.59059728918163 0.18842
## as.factor(AcidIndex)2.02271372429848 0.01363 *
## as.factor(AcidIndex)2.37629509167962 0.01768 *
## as.factor(AcidIndex)2.67051656830802 0.03340 *
## as.factor(AcidIndex)2.9188445277671 0.05672 .
## as.factor(AcidIndex)3.13100139587667 0.12253
## as.factor(AcidIndex)3.31417429494859 0.07849 .
## as.factor(AcidIndex)3.47378568897179 0.05009 .
## as.factor(STARS)-0.42623524866846 < 0.0000000000000002 ***
## as.factor(STARS)0.416552574962037 < 0.0000000000000002 ***
## as.factor(STARS)1.25934039859254 < 0.0000000000000002 ***
## as.factor(STARS)2.10212822222303 < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 18257 on 10237 degrees of freedom
## Residual deviance: 10834 on 10205 degrees of freedom
## AIC: 36472
##
## Number of Fisher Scoring iterations: 6
## Warning in model_eval$aic <- model$aic: Coercing LHS to a list
## $RMSE
## [1] 2.588709
##
## $Rsquared
## [1] 0.5197045
##
## $MAE
## [1] 2.226568
##
## $aic
## [1] 36471.7
##
## $bic
## [1] 36710.42
In this second model, we only include the most predictive features based on our first Poisson Model. The predictors for the following model are:
VolatileAcidity, TotalSulfurDioxide, Alcohol, LabelAppeal, AcidIndex, STARS
##
## Call:
## glm(formula = TARGET ~ VolatileAcidity + TotalSulfurDioxide +
## Alcohol + as.factor(LabelAppeal) + as.factor(AcidIndex) +
## as.factor(STARS), family = poisson, data = trainingData)
##
## Coefficients:
## Estimate Std. Error z value
## (Intercept) -0.007050 0.319240 -0.022
## VolatileAcidity -0.025933 0.005747 -4.512
## TotalSulfurDioxide 0.016602 0.005854 2.836
## Alcohol 0.016146 0.005891 2.741
## as.factor(LabelAppeal)-1.11204793733397 0.249280 0.042214 5.905
## as.factor(LabelAppeal)0.0101741115806247 0.443022 0.041133 10.770
## as.factor(LabelAppeal)1.13239616049522 0.571830 0.041842 13.666
## as.factor(LabelAppeal)2.25461820940981 0.709495 0.047057 15.077
## as.factor(AcidIndex)-3.59682937695875 -0.128514 0.323956 -0.397
## as.factor(AcidIndex)-1.79176983045029 -0.090501 0.317028 -0.285
## as.factor(AcidIndex)-0.545318540973785 -0.135344 0.316693 -0.427
## as.factor(AcidIndex)0.362910765511677 -0.161767 0.316736 -0.511
## as.factor(AcidIndex)1.05172974217783 -0.275130 0.317112 -0.868
## as.factor(AcidIndex)1.59059728918163 -0.415075 0.318390 -1.304
## as.factor(AcidIndex)2.02271372429848 -0.795036 0.323244 -2.460
## as.factor(AcidIndex)2.37629509167962 -0.779055 0.329310 -2.366
## as.factor(AcidIndex)2.67051656830802 -0.708279 0.334405 -2.118
## as.factor(AcidIndex)2.9188445277671 -0.644856 0.344143 -1.874
## as.factor(AcidIndex)3.13100139587667 -0.711490 0.474721 -1.499
## as.factor(AcidIndex)3.31417429494859 -0.953863 0.548057 -1.740
## as.factor(AcidIndex)3.47378568897179 -1.088689 0.548180 -1.986
## as.factor(STARS)-0.42623524866846 0.753195 0.021919 34.362
## as.factor(STARS)0.416552574962037 1.070745 0.020469 52.311
## as.factor(STARS)1.25934039859254 1.191737 0.021593 55.190
## as.factor(STARS)2.10212822222303 1.312190 0.027034 48.539
## Pr(>|z|)
## (Intercept) 0.98238
## VolatileAcidity 0.00000641065 ***
## TotalSulfurDioxide 0.00457 **
## Alcohol 0.00613 **
## as.factor(LabelAppeal)-1.11204793733397 0.00000000352 ***
## as.factor(LabelAppeal)0.0101741115806247 < 0.0000000000000002 ***
## as.factor(LabelAppeal)1.13239616049522 < 0.0000000000000002 ***
## as.factor(LabelAppeal)2.25461820940981 < 0.0000000000000002 ***
## as.factor(AcidIndex)-3.59682937695875 0.69159
## as.factor(AcidIndex)-1.79176983045029 0.77529
## as.factor(AcidIndex)-0.545318540973785 0.66911
## as.factor(AcidIndex)0.362910765511677 0.60954
## as.factor(AcidIndex)1.05172974217783 0.38561
## as.factor(AcidIndex)1.59059728918163 0.19235
## as.factor(AcidIndex)2.02271372429848 0.01391 *
## as.factor(AcidIndex)2.37629509167962 0.01799 *
## as.factor(AcidIndex)2.67051656830802 0.03417 *
## as.factor(AcidIndex)2.9188445277671 0.06096 .
## as.factor(AcidIndex)3.13100139587667 0.13394
## as.factor(AcidIndex)3.31417429494859 0.08178 .
## as.factor(AcidIndex)3.47378568897179 0.04703 *
## as.factor(STARS)-0.42623524866846 < 0.0000000000000002 ***
## as.factor(STARS)0.416552574962037 < 0.0000000000000002 ***
## as.factor(STARS)1.25934039859254 < 0.0000000000000002 ***
## as.factor(STARS)2.10212822222303 < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 18257 on 10237 degrees of freedom
## Residual deviance: 10849 on 10213 degrees of freedom
## AIC: 36471
##
## Number of Fisher Scoring iterations: 6
## Warning in model_eval$aic <- model$aic: Coercing LHS to a list
## $RMSE
## [1] 2.588993
##
## $Rsquared
## [1] 0.5185381
##
## $MAE
## [1] 2.22691
##
## $aic
## [1] 36470.94
##
## $bic
## [1] 36651.78
Similar to Poisson Model 1, the predictors for the following model are:
FixedAcidity, VolatileAcidity, CitricAcid, ResidualSugar, Chlorides, FreeSulfurDioxide, TotalSulfurDioxide, Density, pH, Sulphates, Alcohol, LabelAppeal, AcidIndex, STARS
##
## Call:
## glm.nb(formula = TARGET ~ FixedAcidity + VolatileAcidity + CitricAcid +
## ResidualSugar + Chlorides + FreeSulfurDioxide + TotalSulfurDioxide +
## Density + pH + Sulphates + Alcohol + as.factor(LabelAppeal) +
## as.factor(AcidIndex) + as.factor(STARS), data = trainingData,
## init.theta = 41134.94708, link = log)
##
## Coefficients:
## Estimate Std. Error z value
## (Intercept) 0.003106 0.319742 0.010
## FixedAcidity -0.001992 0.005769 -0.345
## VolatileAcidity -0.025545 0.005749 -4.443
## CitricAcid 0.006762 0.005641 1.199
## ResidualSugar 0.001775 0.005872 0.302
## Chlorides -0.015149 0.005823 -2.602
## FreeSulfurDioxide 0.010149 0.005767 1.760
## TotalSulfurDioxide 0.016716 0.005859 2.853
## Density -0.006978 0.005709 -1.222
## pH -0.002176 0.005817 -0.374
## Sulphates -0.007812 0.005916 -1.320
## Alcohol 0.015683 0.005896 2.660
## as.factor(LabelAppeal)-1.11204793733397 0.248120 0.042219 5.877
## as.factor(LabelAppeal)0.0101741115806247 0.441788 0.041138 10.739
## as.factor(LabelAppeal)1.13239616049522 0.570606 0.041850 13.635
## as.factor(LabelAppeal)2.25461820940981 0.708782 0.047072 15.057
## as.factor(AcidIndex)-3.59682937695875 -0.139016 0.324362 -0.429
## as.factor(AcidIndex)-1.79176983045029 -0.098372 0.317476 -0.310
## as.factor(AcidIndex)-0.545318540973785 -0.143455 0.317184 -0.452
## as.factor(AcidIndex)0.362910765511677 -0.169378 0.317243 -0.534
## as.factor(AcidIndex)1.05172974217783 -0.281772 0.317664 -0.887
## as.factor(AcidIndex)1.59059728918163 -0.419539 0.318977 -1.315
## as.factor(AcidIndex)2.02271372429848 -0.798883 0.323837 -2.467
## as.factor(AcidIndex)2.37629509167962 -0.782574 0.329901 -2.372
## as.factor(AcidIndex)2.67051656830802 -0.712573 0.334987 -2.127
## as.factor(AcidIndex)2.9188445277671 -0.657049 0.344832 -1.905
## as.factor(AcidIndex)3.13100139587667 -0.733804 0.475179 -1.544
## as.factor(AcidIndex)3.31417429494859 -0.965392 0.548661 -1.760
## as.factor(AcidIndex)3.47378568897179 -1.075356 0.548884 -1.959
## as.factor(STARS)-0.42623524866846 0.751146 0.021927 34.256
## as.factor(STARS)0.416552574962037 1.068590 0.020480 52.177
## as.factor(STARS)1.25934039859254 1.189055 0.021608 55.029
## as.factor(STARS)2.10212822222303 1.310333 0.027053 48.435
## Pr(>|z|)
## (Intercept) 0.99225
## FixedAcidity 0.72989
## VolatileAcidity 0.00000885467 ***
## CitricAcid 0.23065
## ResidualSugar 0.76243
## Chlorides 0.00928 **
## FreeSulfurDioxide 0.07841 .
## TotalSulfurDioxide 0.00433 **
## Density 0.22159
## pH 0.70831
## Sulphates 0.18671
## Alcohol 0.00781 **
## as.factor(LabelAppeal)-1.11204793733397 0.00000000418 ***
## as.factor(LabelAppeal)0.0101741115806247 < 0.0000000000000002 ***
## as.factor(LabelAppeal)1.13239616049522 < 0.0000000000000002 ***
## as.factor(LabelAppeal)2.25461820940981 < 0.0000000000000002 ***
## as.factor(AcidIndex)-3.59682937695875 0.66823
## as.factor(AcidIndex)-1.79176983045029 0.75667
## as.factor(AcidIndex)-0.545318540973785 0.65107
## as.factor(AcidIndex)0.362910765511677 0.59341
## as.factor(AcidIndex)1.05172974217783 0.37507
## as.factor(AcidIndex)1.59059728918163 0.18842
## as.factor(AcidIndex)2.02271372429848 0.01363 *
## as.factor(AcidIndex)2.37629509167962 0.01768 *
## as.factor(AcidIndex)2.67051656830802 0.03341 *
## as.factor(AcidIndex)2.9188445277671 0.05673 .
## as.factor(AcidIndex)3.13100139587667 0.12252
## as.factor(AcidIndex)3.31417429494859 0.07849 .
## as.factor(AcidIndex)3.47378568897179 0.05009 .
## as.factor(STARS)-0.42623524866846 < 0.0000000000000002 ***
## as.factor(STARS)0.416552574962037 < 0.0000000000000002 ***
## as.factor(STARS)1.25934039859254 < 0.0000000000000002 ***
## as.factor(STARS)2.10212822222303 < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(41134.95) family taken to be 1)
##
## Null deviance: 18256 on 10237 degrees of freedom
## Residual deviance: 10834 on 10205 degrees of freedom
## AIC: 36474
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 41135
## Std. Err.: 38698
## Warning while fitting theta: iteration limit reached
##
## 2 x log-likelihood: -36406.04
## $RMSE
## [1] 2.588709
##
## $Rsquared
## [1] 0.5197043
##
## $MAE
## [1] 2.226568
##
## $aic
## [1] 36474.04
##
## $bic
## [1] 36719.99
Similar to Poisson Model 2, the predictors for the following model are:
VolatileAcidity, FreeSulfurDioxide, TotalSulfurDioxide, Alcohol, LabelAppeal, AcidIndex, STARS
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
##
## Call:
## glm.nb(formula = TARGET ~ VolatileAcidity + FreeSulfurDioxide +
## TotalSulfurDioxide + Alcohol + as.factor(LabelAppeal) + as.factor(AcidIndex) +
## as.factor(STARS), data = trainingData, init.theta = 41086.94367,
## link = log)
##
## Coefficients:
## Estimate Std. Error z value
## (Intercept) 0.003146 0.319305 0.010
## VolatileAcidity -0.025880 0.005748 -4.503
## FreeSulfurDioxide 0.010281 0.005763 1.784
## TotalSulfurDioxide 0.016568 0.005854 2.830
## Alcohol 0.016283 0.005892 2.764
## as.factor(LabelAppeal)-1.11204793733397 0.248881 0.042216 5.895
## as.factor(LabelAppeal)0.0101741115806247 0.442465 0.041136 10.756
## as.factor(LabelAppeal)1.13239616049522 0.570970 0.041847 13.644
## as.factor(LabelAppeal)2.25461820940981 0.708906 0.047060 15.064
## as.factor(AcidIndex)-3.59682937695875 -0.139441 0.324031 -0.430
## as.factor(AcidIndex)-1.79176983045029 -0.100423 0.317095 -0.317
## as.factor(AcidIndex)-0.545318540973785 -0.144746 0.316755 -0.457
## as.factor(AcidIndex)0.362910765511677 -0.171011 0.316796 -0.540
## as.factor(AcidIndex)1.05172974217783 -0.284490 0.317173 -0.897
## as.factor(AcidIndex)1.59059728918163 -0.423797 0.318445 -1.331
## as.factor(AcidIndex)2.02271372429848 -0.803065 0.323292 -2.484
## as.factor(AcidIndex)2.37629509167962 -0.786016 0.329349 -2.387
## as.factor(AcidIndex)2.67051656830802 -0.715109 0.334444 -2.138
## as.factor(AcidIndex)2.9188445277671 -0.656024 0.344216 -1.906
## as.factor(AcidIndex)3.13100139587667 -0.721816 0.474773 -1.520
## as.factor(AcidIndex)3.31417429494859 -0.955332 0.548072 -1.743
## as.factor(AcidIndex)3.47378568897179 -1.095310 0.548206 -1.998
## as.factor(STARS)-0.42623524866846 0.752694 0.021921 34.336
## as.factor(STARS)0.416552574962037 1.070384 0.020470 52.290
## as.factor(STARS)1.25934039859254 1.191406 0.021595 55.171
## as.factor(STARS)2.10212822222303 1.313002 0.027038 48.561
## Pr(>|z|)
## (Intercept) 0.99214
## VolatileAcidity 0.00000671010 ***
## FreeSulfurDioxide 0.07444 .
## TotalSulfurDioxide 0.00465 **
## Alcohol 0.00571 **
## as.factor(LabelAppeal)-1.11204793733397 0.00000000374 ***
## as.factor(LabelAppeal)0.0101741115806247 < 0.0000000000000002 ***
## as.factor(LabelAppeal)1.13239616049522 < 0.0000000000000002 ***
## as.factor(LabelAppeal)2.25461820940981 < 0.0000000000000002 ***
## as.factor(AcidIndex)-3.59682937695875 0.66695
## as.factor(AcidIndex)-1.79176983045029 0.75147
## as.factor(AcidIndex)-0.545318540973785 0.64770
## as.factor(AcidIndex)0.362910765511677 0.58933
## as.factor(AcidIndex)1.05172974217783 0.36974
## as.factor(AcidIndex)1.59059728918163 0.18324
## as.factor(AcidIndex)2.02271372429848 0.01299 *
## as.factor(AcidIndex)2.37629509167962 0.01701 *
## as.factor(AcidIndex)2.67051656830802 0.03250 *
## as.factor(AcidIndex)2.9188445277671 0.05667 .
## as.factor(AcidIndex)3.13100139587667 0.12843
## as.factor(AcidIndex)3.31417429494859 0.08132 .
## as.factor(AcidIndex)3.47378568897179 0.04572 *
## as.factor(STARS)-0.42623524866846 < 0.0000000000000002 ***
## as.factor(STARS)0.416552574962037 < 0.0000000000000002 ***
## as.factor(STARS)1.25934039859254 < 0.0000000000000002 ***
## as.factor(STARS)2.10212822222303 < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(41086.94) family taken to be 1)
##
## Null deviance: 18256 on 10237 degrees of freedom
## Residual deviance: 10846 on 10212 degrees of freedom
## AIC: 36472
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 41087
## Std. Err.: 38650
## Warning while fitting theta: iteration limit reached
##
## 2 x log-likelihood: -36418.09
## Warning in model_eval$aic <- model$aic: Coercing LHS to a list
## $RMSE
## [1] 2.588933
##
## $Rsquared
## [1] 0.5187898
##
## $MAE
## [1] 2.226864
##
## $aic
## [1] 36472.09
##
## $bic
## [1] 36667.4
The predictors for the following model are:
FixedAcidity, VolatileAcidity, CitricAcid, ResidualSugar, Chlorides, FreeSulfurDioxide, TotalSulfurDioxide, Density, pH, Sulphates, Alcohol, LabelAppeal, AcidIndex, STARS
##
## Call:
## lm(formula = TARGET ~ FixedAcidity + VolatileAcidity + CitricAcid +
## ResidualSugar + Chlorides + FreeSulfurDioxide + TotalSulfurDioxide +
## Density + pH + Sulphates + Alcohol + as.factor(LabelAppeal) +
## as.factor(AcidIndex) + as.factor(STARS), data = trainingData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.9522 -0.8534 0.0343 0.8407 5.4734
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 0.931620 0.757229 1.230
## FixedAcidity -0.004052 0.013084 -0.310
## VolatileAcidity -0.081484 0.012986 -6.275
## CitricAcid 0.022100 0.012865 1.718
## ResidualSugar 0.006841 0.013311 0.514
## Chlorides -0.046700 0.013175 -3.545
## FreeSulfurDioxide 0.031034 0.013117 2.366
## TotalSulfurDioxide 0.050073 0.013193 3.795
## Density -0.021108 0.012926 -1.633
## pH -0.005594 0.013123 -0.426
## Sulphates -0.019856 0.013445 -1.477
## Alcohol 0.050943 0.013321 3.824
## as.factor(LabelAppeal)-1.11204793733397 0.386444 0.069704 5.544
## as.factor(LabelAppeal)0.0101741115806247 0.863227 0.067899 12.713
## as.factor(LabelAppeal)1.13239616049522 1.314178 0.070928 18.528
## as.factor(LabelAppeal)2.25461820940981 1.909243 0.093235 20.478
## as.factor(AcidIndex)-3.59682937695875 -0.238781 0.773108 -0.309
## as.factor(AcidIndex)-1.79176983045029 -0.147070 0.755530 -0.195
## as.factor(AcidIndex)-0.545318540973785 -0.292792 0.754771 -0.388
## as.factor(AcidIndex)0.362910765511677 -0.381323 0.754857 -0.505
## as.factor(AcidIndex)1.05172974217783 -0.691337 0.755587 -0.915
## as.factor(AcidIndex)1.59059728918163 -0.955120 0.757102 -1.262
## as.factor(AcidIndex)2.02271372429848 -1.445823 0.760290 -1.902
## as.factor(AcidIndex)2.37629509167962 -1.490708 0.765801 -1.947
## as.factor(AcidIndex)2.67051656830802 -1.602844 0.775004 -2.068
## as.factor(AcidIndex)2.9188445277671 -1.247523 0.782729 -1.594
## as.factor(AcidIndex)3.13100139587667 -1.346890 0.923859 -1.458
## as.factor(AcidIndex)3.31417429494859 -1.723319 0.954413 -1.806
## as.factor(AcidIndex)3.47378568897179 -1.836053 0.924747 -1.985
## as.factor(STARS)-0.42623524866846 1.339736 0.036898 36.309
## as.factor(STARS)0.416552574962037 2.370119 0.035887 66.044
## as.factor(STARS)1.25934039859254 2.938188 0.041719 70.428
## as.factor(STARS)2.10212822222303 3.622417 0.065432 55.362
## Pr(>|t|)
## (Intercept) 0.218613
## FixedAcidity 0.756779
## VolatileAcidity 0.000000000365 ***
## CitricAcid 0.085874 .
## ResidualSugar 0.607301
## Chlorides 0.000395 ***
## FreeSulfurDioxide 0.018003 *
## TotalSulfurDioxide 0.000148 ***
## Density 0.102502
## pH 0.669914
## Sulphates 0.139737
## Alcohol 0.000132 ***
## as.factor(LabelAppeal)-1.11204793733397 0.000000030288 ***
## as.factor(LabelAppeal)0.0101741115806247 < 0.0000000000000002 ***
## as.factor(LabelAppeal)1.13239616049522 < 0.0000000000000002 ***
## as.factor(LabelAppeal)2.25461820940981 < 0.0000000000000002 ***
## as.factor(AcidIndex)-3.59682937695875 0.757436
## as.factor(AcidIndex)-1.79176983045029 0.845664
## as.factor(AcidIndex)-0.545318540973785 0.698082
## as.factor(AcidIndex)0.362910765511677 0.613458
## as.factor(AcidIndex)1.05172974217783 0.360231
## as.factor(AcidIndex)1.59059728918163 0.207140
## as.factor(AcidIndex)2.02271372429848 0.057242 .
## as.factor(AcidIndex)2.37629509167962 0.051610 .
## as.factor(AcidIndex)2.67051656830802 0.038649 *
## as.factor(AcidIndex)2.9188445277671 0.111009
## as.factor(AcidIndex)3.13100139587667 0.144900
## as.factor(AcidIndex)3.31417429494859 0.071005 .
## as.factor(AcidIndex)3.47378568897179 0.047119 *
## as.factor(STARS)-0.42623524866846 < 0.0000000000000002 ***
## as.factor(STARS)0.416552574962037 < 0.0000000000000002 ***
## as.factor(STARS)1.25934039859254 < 0.0000000000000002 ***
## as.factor(STARS)2.10212822222303 < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.304 on 10205 degrees of freedom
## Multiple R-squared: 0.5424, Adjusted R-squared: 0.541
## F-statistic: 378 on 32 and 10205 DF, p-value: < 0.00000000000000022
## Warning in model_eval$aic <- AIC(model): Coercing LHS to a list
## $RMSE
## [1] 1.301831
##
## $Rsquared
## [1] 0.5423939
##
## $MAE
## [1] 1.018945
##
## $aic
## [1] 34523.17
##
## $bic
## [1] 34769.12
For the final Linear Model, we leverage stepAIC on our Linear Model #5 to choose the most important features.
##
## Call:
## lm(formula = TARGET ~ VolatileAcidity + CitricAcid + Chlorides +
## FreeSulfurDioxide + TotalSulfurDioxide + Density + Sulphates +
## Alcohol + as.factor(LabelAppeal) + as.factor(AcidIndex) +
## as.factor(STARS), data = trainingData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.9668 -0.8516 0.0346 0.8410 5.4791
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 0.94253 0.75615 1.246
## VolatileAcidity -0.08158 0.01298 -6.284
## CitricAcid 0.02205 0.01286 1.714
## Chlorides -0.04666 0.01317 -3.542
## FreeSulfurDioxide 0.03106 0.01311 2.369
## TotalSulfurDioxide 0.05032 0.01319 3.816
## Density -0.02112 0.01292 -1.634
## Sulphates -0.02008 0.01344 -1.495
## Alcohol 0.05091 0.01332 3.823
## as.factor(LabelAppeal)-1.11204793733397 0.38713 0.06969 5.555
## as.factor(LabelAppeal)0.0101741115806247 0.86330 0.06789 12.716
## as.factor(LabelAppeal)1.13239616049522 1.31469 0.07091 18.539
## as.factor(LabelAppeal)2.25461820940981 1.90982 0.09321 20.489
## as.factor(AcidIndex)-3.59682937695875 -0.24879 0.77227 -0.322
## as.factor(AcidIndex)-1.79176983045029 -0.15866 0.75455 -0.210
## as.factor(AcidIndex)-0.545318540973785 -0.30425 0.75371 -0.404
## as.factor(AcidIndex)0.362910765511677 -0.39233 0.75377 -0.520
## as.factor(AcidIndex)1.05172974217783 -0.70337 0.75438 -0.932
## as.factor(AcidIndex)1.59059728918163 -0.96828 0.75580 -1.281
## as.factor(AcidIndex)2.02271372429848 -1.45914 0.75896 -1.923
## as.factor(AcidIndex)2.37629509167962 -1.50472 0.76437 -1.969
## as.factor(AcidIndex)2.67051656830802 -1.61663 0.77352 -2.090
## as.factor(AcidIndex)2.9188445277671 -1.26256 0.78131 -1.616
## as.factor(AcidIndex)3.13100139587667 -1.35945 0.92262 -1.473
## as.factor(AcidIndex)3.31417429494859 -1.73323 0.95274 -1.819
## as.factor(AcidIndex)3.47378568897179 -1.85267 0.92329 -2.007
## as.factor(STARS)-0.42623524866846 1.33996 0.03689 36.324
## as.factor(STARS)0.416552574962037 2.37062 0.03587 66.082
## as.factor(STARS)1.25934039859254 2.93856 0.04171 70.456
## as.factor(STARS)2.10212822222303 3.62296 0.06542 55.381
## Pr(>|t|)
## (Intercept) 0.212615
## VolatileAcidity 0.000000000343 ***
## CitricAcid 0.086543 .
## Chlorides 0.000398 ***
## FreeSulfurDioxide 0.017866 *
## TotalSulfurDioxide 0.000136 ***
## Density 0.102261
## Sulphates 0.135004
## Alcohol 0.000133 ***
## as.factor(LabelAppeal)-1.11204793733397 0.000000028428 ***
## as.factor(LabelAppeal)0.0101741115806247 < 0.0000000000000002 ***
## as.factor(LabelAppeal)1.13239616049522 < 0.0000000000000002 ***
## as.factor(LabelAppeal)2.25461820940981 < 0.0000000000000002 ***
## as.factor(AcidIndex)-3.59682937695875 0.747341
## as.factor(AcidIndex)-1.79176983045029 0.833456
## as.factor(AcidIndex)-0.545318540973785 0.686458
## as.factor(AcidIndex)0.362910765511677 0.602730
## as.factor(AcidIndex)1.05172974217783 0.351160
## as.factor(AcidIndex)1.59059728918163 0.200178
## as.factor(AcidIndex)2.02271372429848 0.054565 .
## as.factor(AcidIndex)2.37629509167962 0.049029 *
## as.factor(AcidIndex)2.67051656830802 0.036645 *
## as.factor(AcidIndex)2.9188445277671 0.106135
## as.factor(AcidIndex)3.13100139587667 0.140656
## as.factor(AcidIndex)3.31417429494859 0.068909 .
## as.factor(AcidIndex)3.47378568897179 0.044818 *
## as.factor(STARS)-0.42623524866846 < 0.0000000000000002 ***
## as.factor(STARS)0.416552574962037 < 0.0000000000000002 ***
## as.factor(STARS)1.25934039859254 < 0.0000000000000002 ***
## as.factor(STARS)2.10212822222303 < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.304 on 10208 degrees of freedom
## Multiple R-squared: 0.5424, Adjusted R-squared: 0.5411
## F-statistic: 417.2 on 29 and 10208 DF, p-value: < 0.00000000000000022
## Warning in model_eval$aic <- AIC(model): Coercing LHS to a list
## $RMSE
## [1] 1.301865
##
## $Rsquared
## [1] 0.5423695
##
## $MAE
## [1] 1.018829
##
## $aic
## [1] 34517.71
##
## $bic
## [1] 34741.96
Several predictors have significant effects on the dependent variable (TARGET) based on their p-values (e.g., VolatileAcidity, Chlorides, FreeSulfurDioxide, TotalSulfurDioxide, Density, Alcohol, LabelAppeal, and STARS).
The coefficients for the different levels of categorical predictors (e.g., LabelAppeal, AcidIndex, and STARS) represent the difference in the mean of the dependent variable compared to the reference level.
The Effect Direction, for example, a positive coefficient for Alcohol suggests that an increase in Alcohol is associated with an increase in the mean of TARGET, while a negative coefficient for VolatileAcidity suggests the opposite.
The overall model fit is evaluated using metrics such as R-squared (the proportion of variance explained), adjusted R-squared (a penalized version of R-squared for the number of predictors), and F-statistic (overall significance of the model).
Zero inflation addresses the prevalence of numerous zeros in certain Poisson distributions by providing a correction. This method shows particular promise, especially considering our data exploration, where we observed a higher frequency of zeros followed by a distribution resembling normal data.
The model is a zero-inflated Poisson model used to predict the variable specified in the formula (TARGET) based on the predictors FixedAcidity, VolatileAcidity, CitricAcid, ResidualSugar, Chlorides, FreeSulfurDioxide, TotalSulfurDioxide, Density, pH, Sulphates, Alcohol, LabelAppeal, AcidIndex, and STARS.
Here we use the variable STARS because it had the most missing values.
##
## Call:
## zeroinfl(formula = TARGET ~ . | STARS, data = trainingData)
##
## Pearson residuals:
## Min 1Q Median 3Q Max
## -2.30463 -0.52758 0.02379 0.40652 2.70756
##
## Count model coefficients (poisson with log link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.2542657 0.0067515 185.776 < 0.0000000000000002 ***
## FixedAcidity 0.0001639 0.0059267 0.028 0.9779
## VolatileAcidity -0.0135867 0.0059034 -2.302 0.0214 *
## CitricAcid 0.0011296 0.0057567 0.196 0.8444
## ResidualSugar -0.0007654 0.0059981 -0.128 0.8985
## Chlorides -0.0069000 0.0059648 -1.157 0.2474
## FreeSulfurDioxide 0.0044605 0.0058837 0.758 0.4484
## TotalSulfurDioxide 0.0003172 0.0059794 0.053 0.9577
## Density -0.0073991 0.0058865 -1.257 0.2088
## pH 0.0037535 0.0059566 0.630 0.5286
## Sulphates -0.0010330 0.0060752 -0.170 0.8650
## Alcohol 0.0247431 0.0060303 4.103 0.000040756 ***
## LabelAppeal 0.1991156 0.0062675 31.770 < 0.0000000000000002 ***
## AcidIndex -0.0322218 0.0064214 -5.018 0.000000522 ***
## STARS 0.1194349 0.0069054 17.296 < 0.0000000000000002 ***
##
## Zero-inflation model coefficients (binomial with logit link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.93328 0.07725 -37.97 <0.0000000000000002 ***
## STARS -2.61255 0.07064 -36.98 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Number of iterations in BFGS optimization: 23
## Log-likelihood: -1.668e+04 on 17 Df
The count model suggests Alcohol, LabelAppeal, AcidIndex, and STARS have statistically significant coefficients (p < 0.05). The signs of the coefficients indicate the direction of the impact on the expected count of TARGET.
The Zero-Inflation Model suggest that STARS has a statistically significant coefficient (p < 0.05), indicating that it significantly affects the odds of excess zeros. The negative sign suggests that a higher value of STARS is associated with a lower likelihood of excess zeros.
The optimization algorithm took 40 iterations to converge.
The log-likelihood value indicates how well the model fits the data. A lower log-likelihood suggests a better fit.
The model suggests that variables such as Alcohol, LabelAppeal, AcidIndex, and STARS are important predictors in both the count and zero-inflation components. The negative coefficient for STARS in the zero-inflation model implies that higher STARS ratings are associated with a lower likelihood of excess zeros.
Here I use the following variables, VolatileAcidity, Alcohol, LabelAppeal, and AcidIndex, and Stars. I chose these variables because the p < 0.05, containing the most excess zeros.
The zero-inflated Poisson Reduced model is used to predict the variable specified in the formula (TARGET) based on the predictors VolatileAcidity, Alcohol, LabelAppeal, and AcidIndex, with separate models for the count and zero-inflation components. The model also includes a grouping variable STARS, because that variable is believed to contain (or is hiding) some of the information that contributes to the excess zeros.
##
## Call:
## zeroinfl(formula = TARGET ~ VolatileAcidity + Alcohol + LabelAppeal +
## AcidIndex | STARS, data = trainingData)
##
## Pearson residuals:
## Min 1Q Median 3Q Max
## -2.39753 -0.45601 0.06658 0.41411 2.30869
##
## Count model coefficients (poisson with log link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.296696 0.006100 212.562 < 0.0000000000000002 ***
## VolatileAcidity -0.015657 0.005877 -2.664 0.00772 **
## Alcohol 0.033820 0.005978 5.658 0.0000000153 ***
## LabelAppeal 0.239028 0.005790 41.283 < 0.0000000000000002 ***
## AcidIndex -0.035684 0.006282 -5.680 0.0000000135 ***
##
## Zero-inflation model coefficients (binomial with logit link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.87781 0.07319 -39.32 <0.0000000000000002 ***
## STARS -2.60304 0.06780 -38.39 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Number of iterations in BFGS optimization: 11
## Log-likelihood: -1.683e+04 on 7 Df
The count model suggests that, holding other predictors constant, an increase in VolatileAcidity leads to a decrease in the expected count of TARGET. Alcohol, LabelAppeal, and AcidIndex have positive coefficients, indicating that an increase in these variables is associated with an increase in the expected count of TARGET. All coefficients in the count model are statistically significant based on the p-values.
The zero-inflation model suggests that, holding other predictors constant, a one-unit increase in STARS is associated with a decrease in the odds of excess zeros. Both Intercept and STARS coefficients in the zero-inflation model are statistically significant.
The optimization algorithm took 11 iterations to converge.
The log-likelihood value indicates how well the model fits the data. A lower log-likelihood suggests a better fit.
The table presents Mean Squared Error (MSE) and Akaike Information Criterion (AIC) values for the eight different models (labeled Model1 through Model8).
MSE is a useful metric for quantifying the accuracy of predictions, and it is widely employed in regression and machine learning applications. A lower MSE indicates better predictive performance. It means that, on average, the model’s predictions are closer to the actual values. A higher MSE suggests that the model’s predictions are, on average, further away from the actual values.
The AIC is a measure of the relative quality of statistical models for a given set of data. Lower AIC values indicate a better fit.
## Warning in matrix(c(mse1, mse2, mse3, mse4, mse5, mse6, mse7, mse8, aic1, :
## data length [12] is not a sub-multiple or multiple of the number of rows [8]
MSE | AIC | |
---|---|---|
Model1 | 6.701417 | 6.702885 |
Model2 | 6.701416 | 6.702572 |
Model3 | 1.694763 | 1.694853 |
Model4 | 1.730258 | 1.892633 |
Model5 | 36471.703172 | 36470.936609 |
Model6 | 36474.037980 | 36472.088457 |
Model7 | 6.701417 | 6.702885 |
Model8 | 6.701416 | 6.702572 |
Model 1 and Model 2: These models have identical MSE values, suggesting that they perform equally well in terms of the mean squared error. The value is relatively low.
Model 3 and Model 4: Model 3 has a lower MSE compared to Model4, indicating that Model 3 performs better in terms of minimizing the squared differences between predicted and actual values.
Model 5 and Model 6: Model 5 has an extremely high MSE compared to Model 6. Model 5 seems to perform much worse in terms of mean squared error, possibly indicating poor predictive performance.
Model 7 and Model 8: Similar to Model 1 and Model 2, these models have identical MSE values, suggesting equivalent performance in terms of mean squared error.
Model 1 and Model 2: These models have nearly identical AIC values, suggesting similar goodness of fit according to this criterion.
Model 3 and Model 4: Model3 has a lower AIC compared to Model 4, indicating that Model 3 is preferred in terms of the trade-off between goodness of fit and model complexity.
Model 5 and Model 6: Similar to the MSE comparison, Model 5 has a higher AIC compared to Model6, indicating that Model 6 is preferred in terms of AIC.
Model 7 and Model 8: These models have nearly identical AIC values, similar to the situation with Model1 and Model 2.
Best Models: Model 3 seems to be the best-performing model based on both MSE and AIC, as it has the lowest MSE and AIC among the presented models. Model 6 also performs well in terms of both MSE and AIC, but it may be slightly less preferred than Model 3.
Poor Models: Models 5 and 7 seem to perform poorly, especially Model 5, which has exceptionally high MSE and AIC values.
Model Selection: When choosing a model, it’s often desirable to balance goodness of fit (low MSE) with simplicity (low AIC). Model 3 strikes a good balance in this regard.
I will see the output of the Models using test data. The table presents loss values for eight different models (labeled Model1 through Model8). The term “loss” generally refers to a measure of how well a model is performing.
I will use the squared loss to validate the model. I will use the squared difference to select a model (MSE) from predictions on the training sets. (Lower numbers are better.)
## Loss:
## Model1 6.739686
## Model2 6.745143
## Model3 6.739686
## Model4 6.740173
## Model5 1.686629
## Model6 1.686877
## Model7 1.701499
## Model8 1.851677
Model 1 and Model 3: These models have nearly identical loss values, suggesting similar performance according to the chosen loss metric.
Model 2: Model 2 has a slightly higher loss compared to Model 1 and Model 3. This indicates that Model 2 may be performing slightly worse than Model 1 and Model 3 according to the specified loss metric.
Model 4: Model 4 has a loss value close to that of Model 1 and Model 3, indicating comparable performance.
Model 5 and Model 6: Models 5 and 6 have lower loss values compared to the previous models. A lower loss generally indicates better performance, so Models 5 and 6 seem to be performing well.
Model 7: Model 7 has a loss value slightly higher than Models 5 and 6 but lower than Model 8. Its performance is somewhere in between.
Model 8: Model 8 has the highest loss value among the presented models. A higher loss suggests that Model 8 is not performing as well as the other models according to the chosen loss metric.
Best Models: Models 5 and 6 seem to be the best-performing models, as they have the lowest loss values among the presented models.
Poor Models: Model 8 has the highest loss, suggesting poorer performance compared to the other models.
Model Comparison: The models can be ranked based on their loss values, with lower values indicating better performance.
Because I am not interested in gaining insight into the underlying causes of wine selection, I will use the squared loss. This will tell me how accurate our model is without caring about confidence intervals etc.
Based on this metric, Multiple Linear Regression Model 5 is the most accurate.
The wine eval dataset contains 16 columns - including the target variable TARGET - and 3,335 rows, covering a variety of different brands of wine. The data-set is entirely numerical variables, but also contains some variables that are highly discrete and have a limited number of possible values. We will drop the first 2 columns INDEX, we don’t need and TARGET, all missing rows. We have alot of missing values. Columns that have missing values are, ResidualSugar, Chlorides, FreSulfurDioxide, TotalSulfurDioxide, ph, Sulphates, Alcohol, and Stars, which contains the most missing values. To prepare our testing data, wine eval I had to convert
## IN TARGET FixedAcidity VolatileAcidity CitricAcid ResidualSugar Chlorides
## 1 3 NA 5.4 -0.860 0.27 -10.7 0.092
## 2 9 NA 12.4 0.385 -0.76 -19.7 1.169
## 3 10 NA 7.2 1.750 0.17 -33.0 0.065
## 4 18 NA 6.2 0.100 1.80 1.0 -0.179
## 5 21 NA 11.4 0.210 0.28 1.2 0.038
## 6 30 NA 17.6 0.040 -1.15 1.4 0.535
## FreeSulfurDioxide TotalSulfurDioxide Density pH Sulphates Alcohol
## 1 23 398 0.98527 5.02 0.64 12.30
## 2 -37 68 0.99048 3.37 1.09 16.00
## 3 9 76 1.04641 4.61 0.68 8.55
## 4 104 89 0.98877 3.20 2.11 12.30
## 5 70 53 1.02899 2.54 -0.07 4.80
## 6 -250 140 0.95028 3.06 -0.02 11.40
## LabelAppeal AcidIndex STARS
## 1 -1 6 NA
## 2 0 6 2
## 3 0 8 1
## 4 -1 8 1
## 5 0 10 NA
## 6 1 8 4
## Warning in FUN(newX[, i], ...): no non-missing arguments to min; returning Inf
## Warning in FUN(newX[, i], ...): no non-missing arguments to max; returning -Inf
## vars n mean sd median trimmed mad min
## IN 1 3335 8048.31 4655.48 7906.00 8044.28 5960.05 3.00
## TARGET 2 0 NaN NA NA NaN NA Inf
## FixedAcidity 3 3335 6.86 6.32 6.90 6.91 2.82 -18.20
## VolatileAcidity 4 3335 0.31 0.81 0.28 0.31 0.46 -2.83
## CitricAcid 5 3335 0.31 0.87 0.31 0.31 0.44 -3.12
## ResidualSugar 6 3167 5.32 34.37 3.60 5.46 16.90 -128.30
## Chlorides 7 3197 0.06 0.31 0.05 0.06 0.12 -1.15
## FreeSulfurDioxide 8 3183 34.95 149.63 30.00 34.26 57.82 -563.00
## TotalSulfurDioxide 9 3178 123.41 225.80 124.00 124.00 137.88 -769.00
## Density 10 3335 0.99 0.03 0.99 0.99 0.01 0.89
## pH 11 3231 3.24 0.68 3.21 3.23 0.37 0.60
## Sulphates 12 3025 0.53 0.91 0.50 0.53 0.39 -3.07
## Alcohol 13 3150 10.58 3.76 10.40 10.58 2.52 -4.20
## LabelAppeal 14 3335 0.01 0.89 0.00 0.01 1.48 -2.00
## AcidIndex 15 3335 7.75 1.32 8.00 7.62 1.48 5.00
## STARS 16 2494 2.04 0.91 2.00 1.97 1.48 1.00
## max range skew kurtosis se
## IN 16130.00 16127.00 0.01 -1.20 80.62
## TARGET -Inf -Inf NA NA NA
## FixedAcidity 33.50 51.70 -0.12 2.04 0.11
## VolatileAcidity 3.61 6.44 -0.04 1.62 0.01
## CitricAcid 3.76 6.88 -0.03 1.66 0.02
## ResidualSugar 145.40 273.70 -0.06 1.97 0.61
## Chlorides 1.26 2.41 -0.04 1.74 0.01
## FreeSulfurDioxide 617.00 1180.00 0.07 1.88 2.65
## TotalSulfurDioxide 1004.00 1773.00 -0.05 1.50 4.01
## Density 1.10 0.21 -0.03 1.94 0.00
## pH 6.21 5.61 0.12 1.69 0.01
## Sulphates 4.18 7.25 0.01 1.83 0.02
## Alcohol 25.60 29.80 0.05 1.54 0.07
## LabelAppeal 2.00 4.00 0.05 -0.26 0.02
## AcidIndex 17.00 12.00 1.51 4.28 0.02
## STARS 4.00 3.00 0.44 -0.75 0.02
## IN TARGET FixedAcidity VolatileAcidity
## Min. : 3 Mode:logical Min. :-18.200 Min. :-2.8300
## 1st Qu.: 4018 NA's:3335 1st Qu.: 5.200 1st Qu.: 0.0800
## Median : 7906 Median : 6.900 Median : 0.2800
## Mean : 8048 Mean : 6.864 Mean : 0.3103
## 3rd Qu.:12061 3rd Qu.: 9.000 3rd Qu.: 0.6300
## Max. :16130 Max. : 33.500 Max. : 3.6100
##
## CitricAcid ResidualSugar Chlorides FreeSulfurDioxide
## Min. :-3.1200 Min. :-128.300 Min. :-1.15000 Min. :-563.00
## 1st Qu.: 0.0000 1st Qu.: -2.600 1st Qu.: 0.01600 1st Qu.: 3.00
## Median : 0.3100 Median : 3.600 Median : 0.04700 Median : 30.00
## Mean : 0.3124 Mean : 5.319 Mean : 0.06143 Mean : 34.95
## 3rd Qu.: 0.6050 3rd Qu.: 17.200 3rd Qu.: 0.17100 3rd Qu.: 79.25
## Max. : 3.7600 Max. : 145.400 Max. : 1.26300 Max. : 617.00
## NA's :168 NA's :138 NA's :152
## TotalSulfurDioxide Density pH Sulphates
## Min. :-769.00 Min. :0.8898 Min. :0.600 Min. :-3.0700
## 1st Qu.: 27.25 1st Qu.:0.9883 1st Qu.:2.980 1st Qu.: 0.3300
## Median : 124.00 Median :0.9946 Median :3.210 Median : 0.5000
## Mean : 123.41 Mean :0.9947 Mean :3.237 Mean : 0.5346
## 3rd Qu.: 210.00 3rd Qu.:1.0005 3rd Qu.:3.490 3rd Qu.: 0.8200
## Max. :1004.00 Max. :1.0998 Max. :6.210 Max. : 4.1800
## NA's :157 NA's :104 NA's :310
## Alcohol LabelAppeal AcidIndex STARS
## Min. :-4.20 Min. :-2.00000 Min. : 5.000 Min. :1.00
## 1st Qu.: 9.00 1st Qu.:-1.00000 1st Qu.: 7.000 1st Qu.:1.00
## Median :10.40 Median : 0.00000 Median : 8.000 Median :2.00
## Mean :10.58 Mean : 0.01349 Mean : 7.748 Mean :2.04
## 3rd Qu.:12.50 3rd Qu.: 1.00000 3rd Qu.: 8.000 3rd Qu.:3.00
## Max. :25.60 Max. : 2.00000 Max. :17.000 Max. :4.00
## NA's :185 NA's :841
## [1] 16
## [1] 3335
## FixedAcidity VolatileAcidity CitricAcid ResidualSugar
## 0 0 0 168
## Chlorides FreeSulfurDioxide TotalSulfurDioxide Density
## 138 152 157 0
## pH Sulphates Alcohol LabelAppeal
## 104 310 185 0
## AcidIndex STARS
## 0 841
For multiple imputation with Random Forest I have imputed values in place of the missing values in my wine_eval dataset. Keeping in mind that the effectiveness of imputation depends on the nature of your data and the appropriateness of the imputation method for your specific problem. I also had to convert STARRS, LabelAppeal, AcidIndex to factors.
##
## iter imp variable
## 1 1 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 1 2 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 1 3 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 1 4 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 1 5 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 2 1 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 2 2 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 2 3 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 2 4 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 2 5 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 3 1 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 3 2 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 3 3 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 3 4 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 3 5 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
These are the predicted values for the response variable (dependent variable) generated by the linear regression model (lm1) using the predictors in the wine_eval dataset. These values represent the model’s estimate of the target variable based on the features in the wine_eval data.
It’s important to note that the interpretation of these predictions depends on the context of your specific regression model and the variables involved. The accuracy and reliability of the predictions are also influenced by the quality of the model and the suitability of linear regression for your data.
## 2 4 6 7 8 9
## 3.9375180 2.4883264 0.2896969 1.4415012 5.2314899 1.6674334
The density plot provides insight into the distribution of predicted values, showing where the mass of the observations lies and how it varies across the range. The wiggles in the plot can be indicative of underlying patterns or structures in the data. The initial increase in the middle of the plot suggests that there is a concentration of predicted values around that range. The higher density in this region indicates a greater number of observations with similar or close predicted values.The initial increase in the middle of the plot suggests that there is a concentration of predicted values around that range. The higher density in this region indicates a greater number of observations with similar or close predicted values.The decrease in density as you move towards the tails of the distribution indicates that fewer observations have extreme predicted values. This decrease could be due to the data becoming sparser in these regions.
This Cumulative Distribution Plot of Predictions plot provides a visual representation of how the predicted values are distributed across the dataset. At the start, the line is usually straight because there are likely many low predicted values. This straight line represents the period during which most of the low values are covered, and as you move along the x-axis, more observations are included in the cumulative count. The upward slope indicates the portion where the cumulative probability is increasing. This is where more observations with higher predicted values are being added. The line may become straight again when most of the observations with higher predicted values have been included. The slanted S-shape is common in cumulative distribution plots and reflects the accumulation of observations across the range of predicted values. If there are sudden changes in slope, it might indicate points of inflection or areas where the density of observations changes rapidly.
This study aimed to address the challenge of predicting the number of wines sold to restaurants by employing a Count Regression Model. The initial phase involved cleaning and pre-processing a dataset to get it ready for model training. Several techniques, including Poisson Regression, Negative Binomial, Multiple Linear Regression, and Zero Inflation Poison, were applied for count regression. Because I was not interested in gaining insight into the underlying causes of wine selection, I used the squared loss. This told me how accurate our model is without caring about confidence intervals etc. Lower numbers are better, so I chose to test the wine_eval dataset with Mutiple Linear Regression Model. The results from various wine types exhibited a high degree of similarity, mainly attributed to the utilization of random forest during the pre-processing stage to address missing values.
knitr::opts_chunk$set(echo = FALSE)
# load libraries
suppressWarnings({
# Code that generates specific warnings
# Other code
library(pscl)
library(tinytex)
library(devtools)
library(vctrs)
library(mice)
library(tidyverse)
library(dplyr)
library(psych)
library(corrplot)
library(RColorBrewer)
library(knitr)
library(MASS)
library(caret)
library(kableExtra)
library(ResourceSelection)
library(pROC)
library(ggplot2)
library(gridExtra)
library(htmltools)
library(ggpubr)
})
suppressMessages({
library(pscl)
library(tinytex)
library(devtools)
library(vctrs)
library(mice)
library(tidyverse)
library(dplyr)
library(psych)
library(corrplot)
library(RColorBrewer)
library(knitr)
library(MASS)
library(caret)
library(kableExtra)
library(ResourceSelection)
library(pROC)
library(ggplot2)
library(gridExtra)
library(htmltools)
library(ggpubr)
})
library(pscl)
library(tinytex)
library(devtools)
library(vctrs)
library(mice)
library(tidyverse)
library(dplyr)
library(psych)
library(corrplot)
library(RColorBrewer)
library(knitr)
library(MASS)
library(caret)
library(kableExtra)
library(ResourceSelection)
library(pROC)
library(ggplot2)
library(gridExtra)
library(htmltools)
library(ggpubr)
#load data
wine_train<- read.csv("https://raw.githubusercontent.com/enidroman/DATA-621-Business-Analytics-and-Data-Mining/main/wine-training-data.csv")
wine_eval <- read.csv("https://raw.githubusercontent.com/enidroman/DATA-621-Business-Analytics-and-Data-Mining/main/wine-evaluation-data.csv")
vn <- c("INDEX", "TARGET", " ", " ", "ACID INDEX", "ALCOHOL", "CHLORIDES", "CITRIC ACID", "DENSITY", "FIXED ACIDITY", "FREE SULFUR DIOXIDE", "LABEL APPEAL", "RESIDUAL SUGAR", "STARS", "SULPHATES", "TOTAL SULFUR DIOXIDE", "VOLATILE ACIDITY", "pH")
defin <- c("Identification Variable (do not use)", "Number of Cases Purchased", " ", " ", "Proprietary method of testing total acidity of wine by using a weighted average", "Alcohol Content", "Chloride content of wine", "Citric Acid Content", "Density of Wine", "Fixed Acidity of Wine", "Sulfur Dioxide content of wine", "Marketing Score indicationg the appeal of label design for consumers. High numbers suggest customers like the label design. Negative numbers suggest customers don't like design.", "Residual Sugar of wine", "Wine rating by a team of experts. 4 Stars = Excellent, 1 Star = Poor", "Sulfate content of wine", "Total Sulfur Dioxide of Wine", "Volatile Acid content of wine.", "pH of wine")
theor_effect <- c("None", "None", " ", " ", " ", " ", " ", " ", " ", " ", " ", "Many consumers purchase based on the visual appeal of the wine label design. Higher numbers suggest better sales.", " ", "A high number of stars suggests high sales", " ", " ", " ", " ")
kable(cbind(vn, defin, theor_effect), col.names = c("VARIABLE NAME", "DEFINITION", "THEORETICAL EFFECT")) %>%
kable_paper(full_width = T)
head(wine_train)
describe(wine_train)
ncol(wine_train)
nrow(wine_train)
# summary statistics
summary(wine_train)
wine_train <- subset(wine_train, select = -INDEX)
str(wine_train)
# count the total number of missing values
sum(is.na(wine_train))
dis_wine_train <- wine_train %>%
gather(key = 'variable', value = 'value')
# Histogram plots of each variable
ggplot(dis_wine_train) +
geom_histogram(aes(x=value, y = ..density..), bins=30) +
geom_density(aes(x=value), color='blue') +
facet_wrap(. ~variable, scales='free', ncol=4)
box_wine_train <- wine_train %>%
gather(key = 'variable', value = 'value')
# Boxplots for each variable
ggplot(box_wine_train, aes(variable, value)) +
geom_boxplot() +
facet_wrap(. ~variable, scales='free', ncol=6)
wine_train_character_wide <- wine_train %>%
dplyr::select(TARGET, STARS, LabelAppeal, AcidIndex) %>%
pivot_longer(cols = -TARGET, names_to="variable", values_to="value") %>%
arrange(variable, value)
wine_train_character_wide %>%
ggplot(mapping = aes(x = factor(value), y = TARGET)) +
geom_boxplot() +
facet_wrap(.~variable, scales="free") +
theme_bw() +
theme(axis.text.x = element_text(angle = 90))
featurePlot(wine_train[,2:ncol(wine_train)], wine_train[,1], pch = 20)
missing <- colSums(wine_train %>% sapply(is.na))
missing_pct <- round(missing / nrow(wine_train) * 100, 2)
stack(sort(missing_pct, decreasing = TRUE))
# separate our features from target so we don't inadvertently transform the target
training_x <- wine_train %>% dplyr::select(-TARGET)
training_y <- wine_train$TARGET
# separate our features from target so we don't inadvertently transform the target
eval_x <- wine_eval %>% dplyr::select(-TARGET)
eval_y <- wine_eval$TARGET
create_na_dummy <- function(vector) {
as.integer(vector %>% is.na())
}
impute_missing <- function(data) {
# Replace missing STARS with 0
data$STARS <- data$STARS %>%
replace_na(0)
return(data)
}
# Replace missing STARS with 'unknown' and convert STARS to a factor
training_x <- impute_missing(training_x)
eval_x <- impute_missing(eval_x)
imputation <- preProcess(training_x, method = c("knnImpute", 'BoxCox'))
# summary(imputation)
training_x_imp <- predict(imputation, training_x)
eval_x_imp <- predict(imputation, eval_x)
clean_df <- cbind(training_y, training_x_imp) %>%
as.data.frame() %>%
rename(TARGET = training_y)
clean_eval_df <- cbind(eval_y, eval_x_imp) %>%
as.data.frame() %>%
rename(TARGET = eval_y)
stack(sort(cor(clean_df[,1], clean_df[,2:ncol(clean_df)])[,], decreasing=TRUE))
mcor<-round(cor(clean_df),2)
mcor
correlation = cor(clean_df, use = 'pairwise.complete.obs')
corrplot(correlation, 'ellipse', type = 'lower', order = 'hclust',
col=brewer.pal(n=8, name="RdYlBu"))
sum(is.na(clean_df))
clean_wine_train <- clean_df %>%
gather(key = 'variable', value = 'value')
# Histogram plots of each variable
ggplot(clean_wine_train) +
geom_histogram(aes(x=value, y = ..density..), bins=30) +
geom_density(aes(x=value), color='blue') +
facet_wrap(. ~variable, scales='free', ncol=4)
options(scipen = 999)
#75% data test training split
# get training/test split
y_raw <- as.matrix(clean_df$TARGET)
trainingRows <- createDataPartition(y_raw, p=0.8, list=FALSE)
# Build training data sets
trainX <- clean_df[trainingRows,] %>% dplyr::select(-TARGET)
trainY <- clean_df[trainingRows,] %>% dplyr::select(TARGET)
# put remaining rows into the test sets
testX <- clean_df[-trainingRows,] %>% dplyr::select(-TARGET)
testY <- clean_df[-trainingRows,] %>% dplyr::select(TARGET)
# Build a DF
trainingData <- as.data.frame(trainX)
trainingData$TARGET <- trainY$TARGET
print(paste('Number of Training Samples: ', dim(trainingData)[1]))
testingData <- as.data.frame(testX)
testingData$TARGET <- testY$TARGET
print(paste('Number of Testing Samples: ', dim(testingData)[1]))
model_test_perf <- function(model, trainX, trainY, testX, testY) {
# Evaluate Model with testing data set
predictedY <- predict(model, newdata = trainX)
model_results <- data.frame(obs = trainY, pred = predictedY)
colnames(model_results) <- c('obs', 'pred')
# Calculate RMSE, Rsquared, and MAE by default
model_eval <- defaultSummary(model_results)
# Add AIC score to the results
if ('aic' %in% names(model)) {
model_eval$aic <- model$aic
} else {
model_eval$aic <- AIC(model)
}
# Add BIC score to the results
model_eval$bic <- BIC(model)
return(model_eval)
}
poiss1 = glm(TARGET ~ FixedAcidity + VolatileAcidity + CitricAcid + ResidualSugar +
Chlorides + FreeSulfurDioxide + TotalSulfurDioxide + Density +
pH + Sulphates + Alcohol +
as.factor(LabelAppeal) +
as.factor(AcidIndex) +
as.factor(STARS),
data=trainingData,
family=poisson)
summary(poiss1)
# Evaluate Model 1 with testing data set
(poiss1_eval <- model_test_perf(poiss1, trainX, trainY, testX, testY))
#' Extract key performance results from a model
#'
#' @param model A linear model of interest
#' @examples
#' model_performance_extraction(my_model)
#' @return data.frame
#' @export
model_performance_extraction <- function(model = NULL) {
# Make sure a model was passed
if (is.null(model)) {
return(NULL)
}
data.frame("RSE" = model$sigma,
"Adj R2" = model$adj.r.squared,
"F-Statistic" = model$fstatistic[1])
performance_metrics <- data.frame("RSE" = model$sigma,
"Adj R2" = model$adj.r.squared,
"F-Statistic" = model$fstatistic[1])
return(performance_metrics)
}
poiss2 <- glm(TARGET ~ VolatileAcidity + TotalSulfurDioxide + Alcohol +
as.factor(LabelAppeal) +
as.factor(AcidIndex) +
as.factor(STARS),
data=trainingData,
family=poisson)
summary(poiss2)
# Evaluate Model 1 with testing data set
(poiss2_eval <- model_test_perf(poiss2, trainX, trainY, testX, testY))
model_performance_extraction <- function(model = NULL) {
# Make sure a model was passed
if (is.null(model)) {
return(NULL)
}
performance_metrics <- data.frame("RSE" = model$sigma,
"Adj R2" = model$adj.r.squared,
"F-Statistic" = model$fstatistic[1])
return(performance_metrics)
}
negbi1 <- glm.nb(TARGET ~ FixedAcidity + VolatileAcidity + CitricAcid + ResidualSugar +
Chlorides + FreeSulfurDioxide + TotalSulfurDioxide + Density +
pH + Sulphates + Alcohol +
as.factor(LabelAppeal) +
as.factor(AcidIndex) +
as.factor(STARS),
data=trainingData)
summary(negbi1)
# Evaluate Model 1 with testing data set
(negbi1_eval <- model_test_perf(negbi1, trainX, trainY, testX, testY))
model_performance_extraction <- function(model = NULL) {
# Make sure a model was passed
if (is.null(model)) {
return(NULL)
}
performance_metrics <- data.frame("RSE" = model$sigma,
"Adj R2" = model$adj.r.squared,
"F-Statistic" = model$fstatistic[1])
return(performance_metrics)
}
negbi2 <- glm.nb(TARGET~ VolatileAcidity + FreeSulfurDioxide + TotalSulfurDioxide + Alcohol +
as.factor(LabelAppeal) +
as.factor(AcidIndex) +
as.factor(STARS),
data=trainingData)
summary (negbi2)
# Evaluate Model 1 with testing data set
(negbi2_eval <- model_test_perf(negbi2, trainX, trainY, testX, testY))
model_performance_extraction <- function(model = NULL) {
# Make sure a model was passed
if (is.null(model)) {
return(NULL)
}
performance_metrics <- data.frame("RSE" = model$sigma,
"Adj R2" = model$adj.r.squared,
"F-Statistic" = model$fstatistic[1])
return(performance_metrics)
}
lm1 <- lm(TARGET ~ FixedAcidity + VolatileAcidity + CitricAcid + ResidualSugar +
Chlorides + FreeSulfurDioxide + TotalSulfurDioxide + Density +
pH + Sulphates + Alcohol +
as.factor(LabelAppeal) +
as.factor(AcidIndex) +
as.factor(STARS),
data=trainingData)
summary(lm1)
# Evaluate Model 1 with testing data set
(lm1_eval <- model_test_perf(lm1, trainX, trainY, testX, testY))
model_performance_extraction <- function(model = NULL) {
# Make sure a model was passed
if (is.null(model)) {
return(NULL)
}
performance_metrics <- data.frame("RSE" = model$sigma,
"Adj R2" = model$adj.r.squared,
"F-Statistic" = model$fstatistic[1])
return(performance_metrics)
}
lm2 <- stepAIC(lm1, direction = "both",
scope = list(upper = lm1, lower = ~ 1),
scale = 0, trace = FALSE)
summary(lm2)
# Evaluate Model 1 with testing data set
(lm2_eval <- model_test_perf(lm2, trainX, trainY, testX, testY))
model_performance_extraction <- function(model = NULL) {
# Make sure a model was passed
if (is.null(model)) {
return(NULL)
}
performance_metrics <- data.frame("RSE" = model$sigma,
"Adj R2" = model$adj.r.squared,
"F-Statistic" = model$fstatistic[1])
return(performance_metrics)
}
zip1 <- zeroinfl(TARGET~.|STARS, data = trainingData)
summary(zip1)
zip2 <- zeroinfl(TARGET ~ VolatileAcidity + Alcohol + LabelAppeal + AcidIndex | STARS, data =trainingData)
summary(zip2)
aic1 <- poiss1$aic
aic2 <- poiss2$aic
aic3 <- negbi1$aic
aic4 <- negbi2$aic
aic5 <- lm1$aic
aic6 <- lm2$aic
aic7 <- zip1$aic
aic8 <- zip2$aic
mse1 <- mean((trainingData$TARGET - predict(poiss1))^2)
mse2 <- mean((trainingData$TARGET - predict(poiss2))^2)
mse3 <- mean((trainingData$TARGET - predict(negbi1))^2)
mse4 <- mean((trainingData$TARGET - predict(negbi2))^2)
mse5 <- mean((trainingData$TARGET - predict(lm1))^2)
mse6 <- mean((trainingData$TARGET - predict(lm2))^2)
mse7 <- mean((trainingData$TARGET - predict(zip1))^2)
mse8 <- mean((trainingData$TARGET - predict(zip2))^2)
compare_aic_mse <- matrix(c(mse1, mse2, mse3, mse4, mse5, mse6, mse7, mse8,
aic1, aic2, aic3, aic4, aic5, aic6, aic7, aic8),nrow=8,ncol=2,byrow=TRUE)
rownames(compare_aic_mse) <- c("Model1","Model2","Model3","Model4","Model5","Model6","Model7","Model8")
colnames(compare_aic_mse) <- c("MSE","AIC")
compare_models <- as.data.frame(compare_models)
kable(compare_aic_mse) %>%
kable_styling(full_width = T)
modelValidation <- function(mod){
preds = predict(mod, testingData)
diffMat = as.numeric(preds) - as.numeric(testingData$TARGET)
diffMat = diffMat^2
loss <- mean(diffMat)
return(loss)
}
compare_models <- matrix(c(modelValidation(poiss1),modelValidation(poiss2),modelValidation(negbi1),modelValidation(negbi2),modelValidation(lm1),modelValidation(lm2),
modelValidation(zip1),modelValidation(zip2)),
nrow=8,ncol=1,byrow=TRUE)
rownames(compare_models) <- c("Model1","Model2","Model3","Model4","Model5","Model6","Model7","Model8")
colnames(compare_models) <- c("Loss:")
compare_models <- as.data.frame(compare_models)
compare_models
head(wine_eval)
describe(wine_eval)
summary(wine_eval)
ncol(wine_eval)
nrow(wine_eval)
wine_test <- wine_eval[-c(1,2)]
colSums(is.na(wine_test))
set.seed(32)
wine_test <- mice(wine_test, m=5, maxit = 3, method = 'rf')
wine_test$STARS <- as.factor(wine_test$STARS)
wine_test$STARS <- factor(wine_test$STARS, levels = levels(trainingData$STARS))
trainingData$LabelAppeal <- factor(trainingData$LabelAppeal)
wine_test$LabelAppeal <- factor(wine_test$LabelAppeal, levels = levels(trainingData$LabelAppeal))
wine_test$AcidIndex <- factor(wine_test$AcidIndex, levels = levels(trainingData$AcidIndex))
wine_test$LabelAppeal <- factor(wine_test$LabelAppeal, levels = levels(trainingData$LabelAppeal))
wine_test <- complete(wine_test)
predictions <- predict(lm1, data= wine_test)
print(head(predictions))
# Convert predictions to a data frame
predictions_df <- data.frame(Predictions = predictions)
# Display the datatable
DT::datatable(predictions_df)
hist(predictions)
# Density plot
plot(density(predictions), main = "Density Plot of Predictions", col = "skyblue", lwd = 2)
# Cumulative distribution plot
plot(ecdf(predictions), main = "Cumulative Distribution Plot of Predictions", col = "skyblue", lwd = 2)