INTRODUCTION

In this assignment I will explore, analyze and model a data set containing information on approximately 12,000 commercially available wines. The variables are mostly related to the chemical properties of the wine being sold. The response variable is the number of sample cases of wine that were purchased by wine distribution companies after sampling a wine. These cases would be used to provide tasting samples to restaurants and wine stores around the United States. The more sample cases purchased, the more likely is a wine to be sold at a high end restaurant. A large wine manufacturer is studying the data in order to predict the number of wine cases ordered based upon the wine characteristics. If the wine manufacturer can predict the number of cases, then that manufacturer will be able to adjust their wine offering to maximize sales.

My objective is to build a count regression model to predict the number of cases of wine that will be sold given certain properties of the wine. Sometimes, the fact that a variable is missing is actually predictive of the target. I can only use the variables given to me (or variables that I derive from the variables provided).

We have two datasets.

One is the wine training dataset, which includes 14 candidate predictors, 1 response variable and 12795 observations.

Other one is the wine evaluation dataset, which also includes 14 candidate predictors, 1 response variable but 16129 observations.

Below is a short description of the variables of interest in the data set:

DATA SET

VARIABLE NAME DEFINITION THEORETICAL EFFECT
INDEX Identification Variable (do not use) None
TARGET Number of Cases Purchased None
ACID INDEX Proprietary method of testing total acidity of wine by using a weighted average
ALCOHOL Alcohol Content
CHLORIDES Chloride content of wine
CITRIC ACID Citric Acid Content
DENSITY Density of Wine
FIXED ACIDITY Fixed Acidity of Wine
FREE SULFUR DIOXIDE Sulfur Dioxide content of wine
LABEL APPEAL Marketing Score indicationg the appeal of label design for consumers. High numbers suggest customers like the label design. Negative numbers suggest customers don’t like design. Many consumers purchase based on the visual appeal of the wine label design. Higher numbers suggest better sales.
RESIDUAL SUGAR Residual Sugar of wine
STARS Wine rating by a team of experts. 4 Stars = Excellent, 1 Star = Poor A high number of stars suggests high sales
SULPHATES Sulfate content of wine
TOTAL SULFUR DIOXIDE Total Sulfur Dioxide of Wine
VOLATILE ACIDITY Volatile Acid content of wine.
pH pH of wine

DATA EXPLORATION

DATA WINE TRAINING SET

The wine training set contains 16 columns - including the target variable TARGET - and 12,795 rows, covering a variety of different brands of wine. The data-set is entirely numerical variables, but also contains some variables that are highly discrete and have a limited number of possible values. We believe it is still reasonable to treat these as numerical variables since the different values follow a natural numerical order.

##   INDEX TARGET FixedAcidity VolatileAcidity CitricAcid ResidualSugar Chlorides
## 1     1      3          3.2           1.160      -0.98          54.2    -0.567
## 2     2      3          4.5           0.160      -0.81          26.1    -0.425
## 3     4      5          7.1           2.640      -0.88          14.8     0.037
## 4     5      3          5.7           0.385       0.04          18.8    -0.425
## 5     6      4          8.0           0.330      -1.26           9.4        NA
## 6     7      0         11.3           0.320       0.59           2.2     0.556
##   FreeSulfurDioxide TotalSulfurDioxide Density   pH Sulphates Alcohol
## 1                NA                268 0.99280 3.33     -0.59     9.9
## 2                15               -327 1.02792 3.38      0.70      NA
## 3               214                142 0.99518 3.12      0.48    22.0
## 4                22                115 0.99640 2.24      1.83     6.2
## 5              -167                108 0.99457 3.12      1.77    13.7
## 6               -37                 15 0.99940 3.20      1.29    15.4
##   LabelAppeal AcidIndex STARS
## 1           0         8     2
## 2          -1         7     3
## 3          -1         8     3
## 4          -1         6     1
## 5           0         9     2
## 6           0        11    NA

DATA SUMMARY STATS

##                    vars     n    mean      sd  median trimmed     mad     min
## INDEX                 1 12795 8069.98 4656.91 8110.00 8071.03 5977.84    1.00
## TARGET                2 12795    3.03    1.93    3.00    3.05    1.48    0.00
## FixedAcidity          3 12795    7.08    6.32    6.90    7.07    3.26  -18.10
## VolatileAcidity       4 12795    0.32    0.78    0.28    0.32    0.43   -2.79
## CitricAcid            5 12795    0.31    0.86    0.31    0.31    0.42   -3.24
## ResidualSugar         6 12179    5.42   33.75    3.90    5.58   15.72 -127.80
## Chlorides             7 12157    0.05    0.32    0.05    0.05    0.13   -1.17
## FreeSulfurDioxide     8 12148   30.85  148.71   30.00   30.93   56.34 -555.00
## TotalSulfurDioxide    9 12113  120.71  231.91  123.00  120.89  134.92 -823.00
## Density              10 12795    0.99    0.03    0.99    0.99    0.01    0.89
## pH                   11 12400    3.21    0.68    3.20    3.21    0.39    0.48
## Sulphates            12 11585    0.53    0.93    0.50    0.53    0.44   -3.13
## Alcohol              13 12142   10.49    3.73   10.40   10.50    2.37   -4.70
## LabelAppeal          14 12795   -0.01    0.89    0.00   -0.01    1.48   -2.00
## AcidIndex            15 12795    7.77    1.32    8.00    7.64    1.48    4.00
## STARS                16  9436    2.04    0.90    2.00    1.97    1.48    1.00
##                         max    range  skew kurtosis    se
## INDEX              16129.00 16128.00  0.00    -1.20 41.17
## TARGET                 8.00     8.00 -0.33    -0.88  0.02
## FixedAcidity          34.40    52.50 -0.02     1.67  0.06
## VolatileAcidity        3.68     6.47  0.02     1.83  0.01
## CitricAcid             3.86     7.10 -0.05     1.84  0.01
## ResidualSugar        141.15   268.95 -0.05     1.88  0.31
## Chlorides              1.35     2.52  0.03     1.79  0.00
## FreeSulfurDioxide    623.00  1178.00  0.01     1.84  1.35
## TotalSulfurDioxide  1057.00  1880.00 -0.01     1.67  2.11
## Density                1.10     0.21 -0.02     1.90  0.00
## pH                     6.13     5.65  0.04     1.65  0.01
## Sulphates              4.24     7.37  0.01     1.75  0.01
## Alcohol               26.50    31.20 -0.03     1.54  0.03
## LabelAppeal            2.00     4.00  0.01    -0.26  0.01
## AcidIndex             17.00    13.00  1.65     5.19  0.01
## STARS                  4.00     3.00  0.45    -0.69  0.01
## [1] 16
## [1] 12795
##      INDEX           TARGET       FixedAcidity     VolatileAcidity  
##  Min.   :    1   Min.   :0.000   Min.   :-18.100   Min.   :-2.7900  
##  1st Qu.: 4038   1st Qu.:2.000   1st Qu.:  5.200   1st Qu.: 0.1300  
##  Median : 8110   Median :3.000   Median :  6.900   Median : 0.2800  
##  Mean   : 8070   Mean   :3.029   Mean   :  7.076   Mean   : 0.3241  
##  3rd Qu.:12106   3rd Qu.:4.000   3rd Qu.:  9.500   3rd Qu.: 0.6400  
##  Max.   :16129   Max.   :8.000   Max.   : 34.400   Max.   : 3.6800  
##                                                                     
##    CitricAcid      ResidualSugar        Chlorides       FreeSulfurDioxide
##  Min.   :-3.2400   Min.   :-127.800   Min.   :-1.1710   Min.   :-555.00  
##  1st Qu.: 0.0300   1st Qu.:  -2.000   1st Qu.:-0.0310   1st Qu.:   0.00  
##  Median : 0.3100   Median :   3.900   Median : 0.0460   Median :  30.00  
##  Mean   : 0.3084   Mean   :   5.419   Mean   : 0.0548   Mean   :  30.85  
##  3rd Qu.: 0.5800   3rd Qu.:  15.900   3rd Qu.: 0.1530   3rd Qu.:  70.00  
##  Max.   : 3.8600   Max.   : 141.150   Max.   : 1.3510   Max.   : 623.00  
##                    NA's   :616        NA's   :638       NA's   :647      
##  TotalSulfurDioxide    Density             pH          Sulphates      
##  Min.   :-823.0     Min.   :0.8881   Min.   :0.480   Min.   :-3.1300  
##  1st Qu.:  27.0     1st Qu.:0.9877   1st Qu.:2.960   1st Qu.: 0.2800  
##  Median : 123.0     Median :0.9945   Median :3.200   Median : 0.5000  
##  Mean   : 120.7     Mean   :0.9942   Mean   :3.208   Mean   : 0.5271  
##  3rd Qu.: 208.0     3rd Qu.:1.0005   3rd Qu.:3.470   3rd Qu.: 0.8600  
##  Max.   :1057.0     Max.   :1.0992   Max.   :6.130   Max.   : 4.2400  
##  NA's   :682                         NA's   :395     NA's   :1210     
##     Alcohol       LabelAppeal          AcidIndex          STARS      
##  Min.   :-4.70   Min.   :-2.000000   Min.   : 4.000   Min.   :1.000  
##  1st Qu.: 9.00   1st Qu.:-1.000000   1st Qu.: 7.000   1st Qu.:1.000  
##  Median :10.40   Median : 0.000000   Median : 8.000   Median :2.000  
##  Mean   :10.49   Mean   :-0.009066   Mean   : 7.773   Mean   :2.042  
##  3rd Qu.:12.40   3rd Qu.: 1.000000   3rd Qu.: 8.000   3rd Qu.:3.000  
##  Max.   :26.50   Max.   : 2.000000   Max.   :17.000   Max.   :4.000  
##  NA's   :653                                          NA's   :3359

Given that the Index column had no impact on the target variable, number of wines, it was dropped.

DATA STRUCTURE

## 'data.frame':    12795 obs. of  15 variables:
##  $ TARGET            : int  3 3 5 3 4 0 0 4 3 6 ...
##  $ FixedAcidity      : num  3.2 4.5 7.1 5.7 8 11.3 7.7 6.5 14.8 5.5 ...
##  $ VolatileAcidity   : num  1.16 0.16 2.64 0.385 0.33 0.32 0.29 -1.22 0.27 -0.22 ...
##  $ CitricAcid        : num  -0.98 -0.81 -0.88 0.04 -1.26 0.59 -0.4 0.34 1.05 0.39 ...
##  $ ResidualSugar     : num  54.2 26.1 14.8 18.8 9.4 ...
##  $ Chlorides         : num  -0.567 -0.425 0.037 -0.425 NA 0.556 0.06 0.04 -0.007 -0.277 ...
##  $ FreeSulfurDioxide : num  NA 15 214 22 -167 -37 287 523 -213 62 ...
##  $ TotalSulfurDioxide: num  268 -327 142 115 108 15 156 551 NA 180 ...
##  $ Density           : num  0.993 1.028 0.995 0.996 0.995 ...
##  $ pH                : num  3.33 3.38 3.12 2.24 3.12 3.2 3.49 3.2 4.93 3.09 ...
##  $ Sulphates         : num  -0.59 0.7 0.48 1.83 1.77 1.29 1.21 NA 0.26 0.75 ...
##  $ Alcohol           : num  9.9 NA 22 6.2 13.7 15.4 10.3 11.6 15 12.6 ...
##  $ LabelAppeal       : int  0 -1 -1 -1 0 0 0 1 0 0 ...
##  $ AcidIndex         : int  8 7 8 6 9 11 8 7 6 8 ...
##  $ STARS             : int  2 3 3 1 2 NA NA 3 NA 4 ...
## [1] 8200

The first observation is the number of missing values throughout the dataset. We have 8200 missing values. Of the 16 feature columns, 8 of them contain at least some missing values. We also see that the TARGET value is always between 0 and 8, which makes sense as this is the “Number of Cases of Wine Sold” (we would not expect partial cases).

I also note that many of the numerical features measuring the quantity of a chemical in the wine have a negative minimum value. We are assuming the original chemical measurements were normalized (possible a log transform) allowing for negative values, since technically negative concentrations shouldn’t be physically possible. As such, we chose to leave those values as-is.

DISTRIBUTION

I wanted to get an idea of the distribution profiles for each of the variables.

## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: Removed 8200 rows containing non-finite values (`stat_bin()`).
## Warning: Removed 8200 rows containing non-finite values (`stat_density()`).

The majority of variables exhibit a somewhat normal distribution, characterized by a steep curve. Notably, variables AcidIndex and STARS display a right-skewed distribution.

A more intriguing observation is the distinctive shape of many features, which are centered with values clustered around the middle, forming a somewhat uniform shape above and below. This pattern suggests a quasi-tri-modal distribution, with low, middle, and high normal distributions overlapping.

In our analytical approach, we decide against extensive feature engineering; however, we contemplate the possibility of breaking these features into three separate components. Two potential strategies include:

Utilizing mixTools to segregate the multi-modal curves into three distinct features, each capturing exclusively low, middle, or high values while retaining numerical precision.

Employing discretization to convert the features into categorical values that indicate whether the values are low, middle, or high, offering a simplified representation for analysis and interpretation.

BOXPLOT

I also elected to use box-plots to get an idea of the spread of each variable.

## Warning: Removed 8200 rows containing non-finite values (`stat_boxplot()`).

The box plots exhibit no significant outliers across the features, suggesting that outlier detection and removal may not be necessary. Notably, AcidIndex, LabelAppeal, and STARS demonstrate categorical (ordinal) characteristics. To explore their relationship with the TARGET variable, we observe a discernible pattern: an increase in LabelAppeal corresponds to a rise in TARGET.

This correlation is also evident between STARS and TARGET. Particularly noteworthy is the strong association between STARS=NA and lower TARGET values. It’s worth mentioning that the original project instructions emphasized the potential informativeness of missing data. Consequently, I opt to impute STARS=NA with STARS=0, aligning with observed patterns where increasing stars aligns with higher

VARIABLE PLOTS

I also wanted to plot scatter plots of each variable versus the target variable, TARGET, to get an idea of the relationship between them.

Due to the discrete nature of the target variable, identifying clear linear relationships in the data proves challenging. Nevertheless, both STARS and LabelAppeal exhibit a significant positive correlation with the TARGET, and several chemical features demonstrate at least some negative association, with lower values coinciding with a higher frequency of 8 and 7 values in the target variable.

Despite revealing interesting relationships among variables, the plots also expose significant data issues. Notably, numerous data points contain missing values, necessitating imputation or removal. Additionally, there is a concern regarding nonsensical negative values in variables measuring concentration. We have assumed these variables underwent log transformation, attributing the negative values to this process. However, this assumption lacks supporting evidence, and a reevaluation would be warranted with more information on the data collection/transformation process. we would need to reevaluate if given more information on the data collection/transformation process.

MISSING DATA

Upon our initial examination of the initial rows of raw data, I observed the presence of missing data. Now, let’s evaluate and identify the fields that contain these missing values.

##    values                ind
## 1   26.25              STARS
## 2    9.46          Sulphates
## 3    5.33 TotalSulfurDioxide
## 4    5.10            Alcohol
## 5    5.06  FreeSulfurDioxide
## 6    4.99          Chlorides
## 7    4.81      ResidualSugar
## 8    3.09                 pH
## 9    0.00             TARGET
## 10   0.00       FixedAcidity
## 11   0.00    VolatileAcidity
## 12   0.00         CitricAcid
## 13   0.00            Density
## 14   0.00        LabelAppeal
## 15   0.00          AcidIndex

In the project specifications, it was highlighted that the absence of a specific variable could have predictive significance. Consequently, I will handle the missing values by imputing STARS=NA with STARS=0. The remaining missing data will be imputed using the caret::preProcess function with the knnImpute method. It’s important to note that preProcess will not only impute missing values but also perform centering, scaling, and BoxCox transformation on our features simultaneously.

MULTICOLINEARITY

A potential issue in multivariable regression is the presence of correlation between variables, known as multicollinearity. A simple way to check for this is by running correlations between the variables.

##                    TARGET FixedAcidity VolatileAcidity CitricAcid ResidualSugar
## TARGET               1.00        -0.05           -0.09       0.01          0.02
## FixedAcidity        -0.05         1.00            0.01       0.01         -0.02
## VolatileAcidity     -0.09         0.01            1.00      -0.02         -0.01
## CitricAcid           0.01         0.01           -0.02       1.00         -0.01
## ResidualSugar        0.02        -0.02           -0.01      -0.01          1.00
## Chlorides           -0.04         0.00            0.00      -0.01         -0.01
## FreeSulfurDioxide    0.04         0.00           -0.01       0.01          0.02
## TotalSulfurDioxide   0.05        -0.02           -0.02       0.01          0.02
## Density             -0.04         0.01            0.01      -0.01          0.00
## pH                  -0.01        -0.01            0.01      -0.01          0.01
## Sulphates           -0.04         0.03            0.00      -0.01         -0.01
## Alcohol              0.06        -0.01            0.00       0.02         -0.02
## LabelAppeal          0.36         0.00           -0.02       0.01          0.00
## AcidIndex           -0.22         0.17            0.04       0.06         -0.01
## STARS                0.69        -0.04           -0.06       0.01          0.02
##                    Chlorides FreeSulfurDioxide TotalSulfurDioxide Density    pH
## TARGET                 -0.04              0.04               0.05   -0.04 -0.01
## FixedAcidity            0.00              0.00              -0.02    0.01 -0.01
## VolatileAcidity         0.00             -0.01              -0.02    0.01  0.01
## CitricAcid             -0.01              0.01               0.01   -0.01 -0.01
## ResidualSugar          -0.01              0.02               0.02    0.00  0.01
## Chlorides               1.00             -0.02              -0.01    0.02 -0.02
## FreeSulfurDioxide      -0.02              1.00               0.01    0.00  0.01
## TotalSulfurDioxide     -0.01              0.01               1.00    0.01  0.00
## Density                 0.02              0.00               0.01    1.00  0.01
## pH                     -0.02              0.01               0.00    0.01  1.00
## Sulphates               0.00              0.01              -0.01   -0.01  0.00
## Alcohol                -0.02             -0.02              -0.02   -0.01 -0.01
## LabelAppeal             0.01              0.01              -0.01   -0.01  0.00
## AcidIndex               0.03             -0.04              -0.04    0.04 -0.07
## STARS                  -0.03              0.02               0.03   -0.03 -0.01
##                    Sulphates Alcohol LabelAppeal AcidIndex STARS
## TARGET                 -0.04    0.06        0.36     -0.22  0.69
## FixedAcidity            0.03   -0.01        0.00      0.17 -0.04
## VolatileAcidity         0.00    0.00       -0.02      0.04 -0.06
## CitricAcid             -0.01    0.02        0.01      0.06  0.01
## ResidualSugar          -0.01   -0.02        0.00     -0.01  0.02
## Chlorides               0.00   -0.02        0.01      0.03 -0.03
## FreeSulfurDioxide       0.01   -0.02        0.01     -0.04  0.02
## TotalSulfurDioxide     -0.01   -0.02       -0.01     -0.04  0.03
## Density                -0.01   -0.01       -0.01      0.04 -0.03
## pH                      0.00   -0.01        0.00     -0.07 -0.01
## Sulphates               1.00    0.01       -0.01      0.03 -0.03
## Alcohol                 0.01    1.00        0.00     -0.05  0.06
## LabelAppeal            -0.01    0.00        1.00      0.02  0.26
## AcidIndex               0.03   -0.05        0.02      1.00 -0.15
## STARS                  -0.03    0.06        0.26     -0.15  1.00

Observing the dataset, I note that the features exhibit minimal correlations with each other, indicating a lack of significant multicollinearity. This suggests a higher likelihood of meeting the assumptions of linear regression.

DATA PREPARATION

In our data preparation and exploration, the key findings can be summarized into the following categories:

REMOVED FIELDS:

The INDEX field was removed from the dataset as it did not contribute any relevant information for the model.

MISSING VALUES:

For the STARS field, missing values were imputed as 0, considering the high correlation between missing values and the target variable. Other fields with missing values were imputed using the knnimpute method from caret.

## [1] 0

OUTLIERS:

Several numerical features exhibited seemingly unreasonable negative values. Despite this, we opted to interpret them as log-transformed variables, assuming the values are legitimate.

TRANSFORM NON NORMAL VARIABLES

The following plots illustrate the alterations in distributions and the ultimate values post the transformations:

Upon completing the transformations, we observe that the variables are now more centered and exhibit a closer resemblance to a normal distribution. However, it is evident that they still deviate from perfect normal distributions.

FINALIZATION OF DATA PREPARATION

## [1] "Number of Training Samples:  10238"
## [1] "Number of Testing Samples:  2557"

MODEL BUILDING

POISON REGRESSION MODEL 1

In this first model, we include all available features. Features include:

FixedAcidity, VolatileAcidity, CitricAcid, ResidualSugar, Chlorides, FreeSulfurDioxide, TotalSulfurDioxide, Density, pH, Sulphates, Alcohol, LabelAppeal, AcidIndex, STARS

## 
## Call:
## glm(formula = TARGET ~ FixedAcidity + VolatileAcidity + CitricAcid + 
##     ResidualSugar + Chlorides + FreeSulfurDioxide + TotalSulfurDioxide + 
##     Density + pH + Sulphates + Alcohol + as.factor(LabelAppeal) + 
##     as.factor(AcidIndex) + as.factor(STARS), family = poisson, 
##     data = trainingData)
## 
## Coefficients:
##                                           Estimate Std. Error z value
## (Intercept)                               0.003085   0.319723   0.010
## FixedAcidity                             -0.001992   0.005768  -0.345
## VolatileAcidity                          -0.025544   0.005749  -4.443
## CitricAcid                                0.006762   0.005641   1.199
## ResidualSugar                             0.001775   0.005871   0.302
## Chlorides                                -0.015149   0.005823  -2.602
## FreeSulfurDioxide                         0.010149   0.005767   1.760
## TotalSulfurDioxide                        0.016715   0.005858   2.853
## Density                                  -0.006978   0.005709  -1.222
## pH                                       -0.002176   0.005817  -0.374
## Sulphates                                -0.007812   0.005916  -1.320
## Alcohol                                   0.015683   0.005895   2.660
## as.factor(LabelAppeal)-1.11204793733397   0.248120   0.042218   5.877
## as.factor(LabelAppeal)0.0101741115806247  0.441789   0.041137  10.739
## as.factor(LabelAppeal)1.13239616049522    0.570610   0.041849  13.635
## as.factor(LabelAppeal)2.25461820940981    0.708786   0.047071  15.058
## as.factor(AcidIndex)-3.59682937695875    -0.138996   0.324343  -0.429
## as.factor(AcidIndex)-1.79176983045029    -0.098355   0.317457  -0.310
## as.factor(AcidIndex)-0.545318540973785   -0.143436   0.317165  -0.452
## as.factor(AcidIndex)0.362910765511677    -0.169359   0.317223  -0.534
## as.factor(AcidIndex)1.05172974217783     -0.281750   0.317645  -0.887
## as.factor(AcidIndex)1.59059728918163     -0.419515   0.318958  -1.315
## as.factor(AcidIndex)2.02271372429848     -0.798853   0.323818  -2.467
## as.factor(AcidIndex)2.37629509167962     -0.782543   0.329882  -2.372
## as.factor(AcidIndex)2.67051656830802     -0.712545   0.334968  -2.127
## as.factor(AcidIndex)2.9188445277671      -0.657023   0.344814  -1.905
## as.factor(AcidIndex)3.13100139587667     -0.733772   0.475161  -1.544
## as.factor(AcidIndex)3.31417429494859     -0.965357   0.548643  -1.760
## as.factor(AcidIndex)3.47378568897179     -1.075318   0.548867  -1.959
## as.factor(STARS)-0.42623524866846         0.751147   0.021927  34.257
## as.factor(STARS)0.416552574962037         1.068591   0.020480  52.178
## as.factor(STARS)1.25934039859254          1.189054   0.021607  55.031
## as.factor(STARS)2.10212822222303          1.310332   0.027052  48.437
##                                                      Pr(>|z|)    
## (Intercept)                                           0.99230    
## FixedAcidity                                          0.72989    
## VolatileAcidity                                 0.00000885131 ***
## CitricAcid                                            0.23064    
## ResidualSugar                                         0.76245    
## Chlorides                                             0.00928 ** 
## FreeSulfurDioxide                                     0.07841 .  
## TotalSulfurDioxide                                    0.00433 ** 
## Density                                               0.22158    
## pH                                                    0.70835    
## Sulphates                                             0.18671    
## Alcohol                                               0.00781 ** 
## as.factor(LabelAppeal)-1.11204793733397         0.00000000418 ***
## as.factor(LabelAppeal)0.0101741115806247 < 0.0000000000000002 ***
## as.factor(LabelAppeal)1.13239616049522   < 0.0000000000000002 ***
## as.factor(LabelAppeal)2.25461820940981   < 0.0000000000000002 ***
## as.factor(AcidIndex)-3.59682937695875                 0.66825    
## as.factor(AcidIndex)-1.79176983045029                 0.75670    
## as.factor(AcidIndex)-0.545318540973785                0.65109    
## as.factor(AcidIndex)0.362910765511677                 0.59343    
## as.factor(AcidIndex)1.05172974217783                  0.37508    
## as.factor(AcidIndex)1.59059728918163                  0.18842    
## as.factor(AcidIndex)2.02271372429848                  0.01363 *  
## as.factor(AcidIndex)2.37629509167962                  0.01768 *  
## as.factor(AcidIndex)2.67051656830802                  0.03340 *  
## as.factor(AcidIndex)2.9188445277671                   0.05672 .  
## as.factor(AcidIndex)3.13100139587667                  0.12253    
## as.factor(AcidIndex)3.31417429494859                  0.07849 .  
## as.factor(AcidIndex)3.47378568897179                  0.05009 .  
## as.factor(STARS)-0.42623524866846        < 0.0000000000000002 ***
## as.factor(STARS)0.416552574962037        < 0.0000000000000002 ***
## as.factor(STARS)1.25934039859254         < 0.0000000000000002 ***
## as.factor(STARS)2.10212822222303         < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 18257  on 10237  degrees of freedom
## Residual deviance: 10834  on 10205  degrees of freedom
## AIC: 36472
## 
## Number of Fisher Scoring iterations: 6
## Warning in model_eval$aic <- model$aic: Coercing LHS to a list
## $RMSE
## [1] 2.588709
## 
## $Rsquared
## [1] 0.5197045
## 
## $MAE
## [1] 2.226568
## 
## $aic
## [1] 36471.7
## 
## $bic
## [1] 36710.42

POISON REGRESSION REDUCED MODEL 2

In this second model, we only include the most predictive features based on our first Poisson Model. The predictors for the following model are:

VolatileAcidity, TotalSulfurDioxide, Alcohol, LabelAppeal, AcidIndex, STARS

## 
## Call:
## glm(formula = TARGET ~ VolatileAcidity + TotalSulfurDioxide + 
##     Alcohol + as.factor(LabelAppeal) + as.factor(AcidIndex) + 
##     as.factor(STARS), family = poisson, data = trainingData)
## 
## Coefficients:
##                                           Estimate Std. Error z value
## (Intercept)                              -0.007050   0.319240  -0.022
## VolatileAcidity                          -0.025933   0.005747  -4.512
## TotalSulfurDioxide                        0.016602   0.005854   2.836
## Alcohol                                   0.016146   0.005891   2.741
## as.factor(LabelAppeal)-1.11204793733397   0.249280   0.042214   5.905
## as.factor(LabelAppeal)0.0101741115806247  0.443022   0.041133  10.770
## as.factor(LabelAppeal)1.13239616049522    0.571830   0.041842  13.666
## as.factor(LabelAppeal)2.25461820940981    0.709495   0.047057  15.077
## as.factor(AcidIndex)-3.59682937695875    -0.128514   0.323956  -0.397
## as.factor(AcidIndex)-1.79176983045029    -0.090501   0.317028  -0.285
## as.factor(AcidIndex)-0.545318540973785   -0.135344   0.316693  -0.427
## as.factor(AcidIndex)0.362910765511677    -0.161767   0.316736  -0.511
## as.factor(AcidIndex)1.05172974217783     -0.275130   0.317112  -0.868
## as.factor(AcidIndex)1.59059728918163     -0.415075   0.318390  -1.304
## as.factor(AcidIndex)2.02271372429848     -0.795036   0.323244  -2.460
## as.factor(AcidIndex)2.37629509167962     -0.779055   0.329310  -2.366
## as.factor(AcidIndex)2.67051656830802     -0.708279   0.334405  -2.118
## as.factor(AcidIndex)2.9188445277671      -0.644856   0.344143  -1.874
## as.factor(AcidIndex)3.13100139587667     -0.711490   0.474721  -1.499
## as.factor(AcidIndex)3.31417429494859     -0.953863   0.548057  -1.740
## as.factor(AcidIndex)3.47378568897179     -1.088689   0.548180  -1.986
## as.factor(STARS)-0.42623524866846         0.753195   0.021919  34.362
## as.factor(STARS)0.416552574962037         1.070745   0.020469  52.311
## as.factor(STARS)1.25934039859254          1.191737   0.021593  55.190
## as.factor(STARS)2.10212822222303          1.312190   0.027034  48.539
##                                                      Pr(>|z|)    
## (Intercept)                                           0.98238    
## VolatileAcidity                                 0.00000641065 ***
## TotalSulfurDioxide                                    0.00457 ** 
## Alcohol                                               0.00613 ** 
## as.factor(LabelAppeal)-1.11204793733397         0.00000000352 ***
## as.factor(LabelAppeal)0.0101741115806247 < 0.0000000000000002 ***
## as.factor(LabelAppeal)1.13239616049522   < 0.0000000000000002 ***
## as.factor(LabelAppeal)2.25461820940981   < 0.0000000000000002 ***
## as.factor(AcidIndex)-3.59682937695875                 0.69159    
## as.factor(AcidIndex)-1.79176983045029                 0.77529    
## as.factor(AcidIndex)-0.545318540973785                0.66911    
## as.factor(AcidIndex)0.362910765511677                 0.60954    
## as.factor(AcidIndex)1.05172974217783                  0.38561    
## as.factor(AcidIndex)1.59059728918163                  0.19235    
## as.factor(AcidIndex)2.02271372429848                  0.01391 *  
## as.factor(AcidIndex)2.37629509167962                  0.01799 *  
## as.factor(AcidIndex)2.67051656830802                  0.03417 *  
## as.factor(AcidIndex)2.9188445277671                   0.06096 .  
## as.factor(AcidIndex)3.13100139587667                  0.13394    
## as.factor(AcidIndex)3.31417429494859                  0.08178 .  
## as.factor(AcidIndex)3.47378568897179                  0.04703 *  
## as.factor(STARS)-0.42623524866846        < 0.0000000000000002 ***
## as.factor(STARS)0.416552574962037        < 0.0000000000000002 ***
## as.factor(STARS)1.25934039859254         < 0.0000000000000002 ***
## as.factor(STARS)2.10212822222303         < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 18257  on 10237  degrees of freedom
## Residual deviance: 10849  on 10213  degrees of freedom
## AIC: 36471
## 
## Number of Fisher Scoring iterations: 6
## Warning in model_eval$aic <- model$aic: Coercing LHS to a list
## $RMSE
## [1] 2.588993
## 
## $Rsquared
## [1] 0.5185381
## 
## $MAE
## [1] 2.22691
## 
## $aic
## [1] 36470.94
## 
## $bic
## [1] 36651.78

NEGATIVE BINOMIAL MODEL 3

Similar to Poisson Model 1, the predictors for the following model are:

FixedAcidity, VolatileAcidity, CitricAcid, ResidualSugar, Chlorides, FreeSulfurDioxide, TotalSulfurDioxide, Density, pH, Sulphates, Alcohol, LabelAppeal, AcidIndex, STARS

## 
## Call:
## glm.nb(formula = TARGET ~ FixedAcidity + VolatileAcidity + CitricAcid + 
##     ResidualSugar + Chlorides + FreeSulfurDioxide + TotalSulfurDioxide + 
##     Density + pH + Sulphates + Alcohol + as.factor(LabelAppeal) + 
##     as.factor(AcidIndex) + as.factor(STARS), data = trainingData, 
##     init.theta = 41134.94708, link = log)
## 
## Coefficients:
##                                           Estimate Std. Error z value
## (Intercept)                               0.003106   0.319742   0.010
## FixedAcidity                             -0.001992   0.005769  -0.345
## VolatileAcidity                          -0.025545   0.005749  -4.443
## CitricAcid                                0.006762   0.005641   1.199
## ResidualSugar                             0.001775   0.005872   0.302
## Chlorides                                -0.015149   0.005823  -2.602
## FreeSulfurDioxide                         0.010149   0.005767   1.760
## TotalSulfurDioxide                        0.016716   0.005859   2.853
## Density                                  -0.006978   0.005709  -1.222
## pH                                       -0.002176   0.005817  -0.374
## Sulphates                                -0.007812   0.005916  -1.320
## Alcohol                                   0.015683   0.005896   2.660
## as.factor(LabelAppeal)-1.11204793733397   0.248120   0.042219   5.877
## as.factor(LabelAppeal)0.0101741115806247  0.441788   0.041138  10.739
## as.factor(LabelAppeal)1.13239616049522    0.570606   0.041850  13.635
## as.factor(LabelAppeal)2.25461820940981    0.708782   0.047072  15.057
## as.factor(AcidIndex)-3.59682937695875    -0.139016   0.324362  -0.429
## as.factor(AcidIndex)-1.79176983045029    -0.098372   0.317476  -0.310
## as.factor(AcidIndex)-0.545318540973785   -0.143455   0.317184  -0.452
## as.factor(AcidIndex)0.362910765511677    -0.169378   0.317243  -0.534
## as.factor(AcidIndex)1.05172974217783     -0.281772   0.317664  -0.887
## as.factor(AcidIndex)1.59059728918163     -0.419539   0.318977  -1.315
## as.factor(AcidIndex)2.02271372429848     -0.798883   0.323837  -2.467
## as.factor(AcidIndex)2.37629509167962     -0.782574   0.329901  -2.372
## as.factor(AcidIndex)2.67051656830802     -0.712573   0.334987  -2.127
## as.factor(AcidIndex)2.9188445277671      -0.657049   0.344832  -1.905
## as.factor(AcidIndex)3.13100139587667     -0.733804   0.475179  -1.544
## as.factor(AcidIndex)3.31417429494859     -0.965392   0.548661  -1.760
## as.factor(AcidIndex)3.47378568897179     -1.075356   0.548884  -1.959
## as.factor(STARS)-0.42623524866846         0.751146   0.021927  34.256
## as.factor(STARS)0.416552574962037         1.068590   0.020480  52.177
## as.factor(STARS)1.25934039859254          1.189055   0.021608  55.029
## as.factor(STARS)2.10212822222303          1.310333   0.027053  48.435
##                                                      Pr(>|z|)    
## (Intercept)                                           0.99225    
## FixedAcidity                                          0.72989    
## VolatileAcidity                                 0.00000885467 ***
## CitricAcid                                            0.23065    
## ResidualSugar                                         0.76243    
## Chlorides                                             0.00928 ** 
## FreeSulfurDioxide                                     0.07841 .  
## TotalSulfurDioxide                                    0.00433 ** 
## Density                                               0.22159    
## pH                                                    0.70831    
## Sulphates                                             0.18671    
## Alcohol                                               0.00781 ** 
## as.factor(LabelAppeal)-1.11204793733397         0.00000000418 ***
## as.factor(LabelAppeal)0.0101741115806247 < 0.0000000000000002 ***
## as.factor(LabelAppeal)1.13239616049522   < 0.0000000000000002 ***
## as.factor(LabelAppeal)2.25461820940981   < 0.0000000000000002 ***
## as.factor(AcidIndex)-3.59682937695875                 0.66823    
## as.factor(AcidIndex)-1.79176983045029                 0.75667    
## as.factor(AcidIndex)-0.545318540973785                0.65107    
## as.factor(AcidIndex)0.362910765511677                 0.59341    
## as.factor(AcidIndex)1.05172974217783                  0.37507    
## as.factor(AcidIndex)1.59059728918163                  0.18842    
## as.factor(AcidIndex)2.02271372429848                  0.01363 *  
## as.factor(AcidIndex)2.37629509167962                  0.01768 *  
## as.factor(AcidIndex)2.67051656830802                  0.03341 *  
## as.factor(AcidIndex)2.9188445277671                   0.05673 .  
## as.factor(AcidIndex)3.13100139587667                  0.12252    
## as.factor(AcidIndex)3.31417429494859                  0.07849 .  
## as.factor(AcidIndex)3.47378568897179                  0.05009 .  
## as.factor(STARS)-0.42623524866846        < 0.0000000000000002 ***
## as.factor(STARS)0.416552574962037        < 0.0000000000000002 ***
## as.factor(STARS)1.25934039859254         < 0.0000000000000002 ***
## as.factor(STARS)2.10212822222303         < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(41134.95) family taken to be 1)
## 
##     Null deviance: 18256  on 10237  degrees of freedom
## Residual deviance: 10834  on 10205  degrees of freedom
## AIC: 36474
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  41135 
##           Std. Err.:  38698 
## Warning while fitting theta: iteration limit reached 
## 
##  2 x log-likelihood:  -36406.04
## $RMSE
## [1] 2.588709
## 
## $Rsquared
## [1] 0.5197043
## 
## $MAE
## [1] 2.226568
## 
## $aic
## [1] 36474.04
## 
## $bic
## [1] 36719.99

NEGATIVE BINOMIAL REDUCED MODEL 4

Similar to Poisson Model 2, the predictors for the following model are:

VolatileAcidity, FreeSulfurDioxide, TotalSulfurDioxide, Alcohol, LabelAppeal, AcidIndex, STARS

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## 
## Call:
## glm.nb(formula = TARGET ~ VolatileAcidity + FreeSulfurDioxide + 
##     TotalSulfurDioxide + Alcohol + as.factor(LabelAppeal) + as.factor(AcidIndex) + 
##     as.factor(STARS), data = trainingData, init.theta = 41086.94367, 
##     link = log)
## 
## Coefficients:
##                                           Estimate Std. Error z value
## (Intercept)                               0.003146   0.319305   0.010
## VolatileAcidity                          -0.025880   0.005748  -4.503
## FreeSulfurDioxide                         0.010281   0.005763   1.784
## TotalSulfurDioxide                        0.016568   0.005854   2.830
## Alcohol                                   0.016283   0.005892   2.764
## as.factor(LabelAppeal)-1.11204793733397   0.248881   0.042216   5.895
## as.factor(LabelAppeal)0.0101741115806247  0.442465   0.041136  10.756
## as.factor(LabelAppeal)1.13239616049522    0.570970   0.041847  13.644
## as.factor(LabelAppeal)2.25461820940981    0.708906   0.047060  15.064
## as.factor(AcidIndex)-3.59682937695875    -0.139441   0.324031  -0.430
## as.factor(AcidIndex)-1.79176983045029    -0.100423   0.317095  -0.317
## as.factor(AcidIndex)-0.545318540973785   -0.144746   0.316755  -0.457
## as.factor(AcidIndex)0.362910765511677    -0.171011   0.316796  -0.540
## as.factor(AcidIndex)1.05172974217783     -0.284490   0.317173  -0.897
## as.factor(AcidIndex)1.59059728918163     -0.423797   0.318445  -1.331
## as.factor(AcidIndex)2.02271372429848     -0.803065   0.323292  -2.484
## as.factor(AcidIndex)2.37629509167962     -0.786016   0.329349  -2.387
## as.factor(AcidIndex)2.67051656830802     -0.715109   0.334444  -2.138
## as.factor(AcidIndex)2.9188445277671      -0.656024   0.344216  -1.906
## as.factor(AcidIndex)3.13100139587667     -0.721816   0.474773  -1.520
## as.factor(AcidIndex)3.31417429494859     -0.955332   0.548072  -1.743
## as.factor(AcidIndex)3.47378568897179     -1.095310   0.548206  -1.998
## as.factor(STARS)-0.42623524866846         0.752694   0.021921  34.336
## as.factor(STARS)0.416552574962037         1.070384   0.020470  52.290
## as.factor(STARS)1.25934039859254          1.191406   0.021595  55.171
## as.factor(STARS)2.10212822222303          1.313002   0.027038  48.561
##                                                      Pr(>|z|)    
## (Intercept)                                           0.99214    
## VolatileAcidity                                 0.00000671010 ***
## FreeSulfurDioxide                                     0.07444 .  
## TotalSulfurDioxide                                    0.00465 ** 
## Alcohol                                               0.00571 ** 
## as.factor(LabelAppeal)-1.11204793733397         0.00000000374 ***
## as.factor(LabelAppeal)0.0101741115806247 < 0.0000000000000002 ***
## as.factor(LabelAppeal)1.13239616049522   < 0.0000000000000002 ***
## as.factor(LabelAppeal)2.25461820940981   < 0.0000000000000002 ***
## as.factor(AcidIndex)-3.59682937695875                 0.66695    
## as.factor(AcidIndex)-1.79176983045029                 0.75147    
## as.factor(AcidIndex)-0.545318540973785                0.64770    
## as.factor(AcidIndex)0.362910765511677                 0.58933    
## as.factor(AcidIndex)1.05172974217783                  0.36974    
## as.factor(AcidIndex)1.59059728918163                  0.18324    
## as.factor(AcidIndex)2.02271372429848                  0.01299 *  
## as.factor(AcidIndex)2.37629509167962                  0.01701 *  
## as.factor(AcidIndex)2.67051656830802                  0.03250 *  
## as.factor(AcidIndex)2.9188445277671                   0.05667 .  
## as.factor(AcidIndex)3.13100139587667                  0.12843    
## as.factor(AcidIndex)3.31417429494859                  0.08132 .  
## as.factor(AcidIndex)3.47378568897179                  0.04572 *  
## as.factor(STARS)-0.42623524866846        < 0.0000000000000002 ***
## as.factor(STARS)0.416552574962037        < 0.0000000000000002 ***
## as.factor(STARS)1.25934039859254         < 0.0000000000000002 ***
## as.factor(STARS)2.10212822222303         < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(41086.94) family taken to be 1)
## 
##     Null deviance: 18256  on 10237  degrees of freedom
## Residual deviance: 10846  on 10212  degrees of freedom
## AIC: 36472
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  41087 
##           Std. Err.:  38650 
## Warning while fitting theta: iteration limit reached 
## 
##  2 x log-likelihood:  -36418.09
## Warning in model_eval$aic <- model$aic: Coercing LHS to a list
## $RMSE
## [1] 2.588933
## 
## $Rsquared
## [1] 0.5187898
## 
## $MAE
## [1] 2.226864
## 
## $aic
## [1] 36472.09
## 
## $bic
## [1] 36667.4

MULTIPLE LINEAR REGRESSION MODEL 5

The predictors for the following model are:

FixedAcidity, VolatileAcidity, CitricAcid, ResidualSugar, Chlorides, FreeSulfurDioxide, TotalSulfurDioxide, Density, pH, Sulphates, Alcohol, LabelAppeal, AcidIndex, STARS

## 
## Call:
## lm(formula = TARGET ~ FixedAcidity + VolatileAcidity + CitricAcid + 
##     ResidualSugar + Chlorides + FreeSulfurDioxide + TotalSulfurDioxide + 
##     Density + pH + Sulphates + Alcohol + as.factor(LabelAppeal) + 
##     as.factor(AcidIndex) + as.factor(STARS), data = trainingData)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.9522 -0.8534  0.0343  0.8407  5.4734 
## 
## Coefficients:
##                                           Estimate Std. Error t value
## (Intercept)                               0.931620   0.757229   1.230
## FixedAcidity                             -0.004052   0.013084  -0.310
## VolatileAcidity                          -0.081484   0.012986  -6.275
## CitricAcid                                0.022100   0.012865   1.718
## ResidualSugar                             0.006841   0.013311   0.514
## Chlorides                                -0.046700   0.013175  -3.545
## FreeSulfurDioxide                         0.031034   0.013117   2.366
## TotalSulfurDioxide                        0.050073   0.013193   3.795
## Density                                  -0.021108   0.012926  -1.633
## pH                                       -0.005594   0.013123  -0.426
## Sulphates                                -0.019856   0.013445  -1.477
## Alcohol                                   0.050943   0.013321   3.824
## as.factor(LabelAppeal)-1.11204793733397   0.386444   0.069704   5.544
## as.factor(LabelAppeal)0.0101741115806247  0.863227   0.067899  12.713
## as.factor(LabelAppeal)1.13239616049522    1.314178   0.070928  18.528
## as.factor(LabelAppeal)2.25461820940981    1.909243   0.093235  20.478
## as.factor(AcidIndex)-3.59682937695875    -0.238781   0.773108  -0.309
## as.factor(AcidIndex)-1.79176983045029    -0.147070   0.755530  -0.195
## as.factor(AcidIndex)-0.545318540973785   -0.292792   0.754771  -0.388
## as.factor(AcidIndex)0.362910765511677    -0.381323   0.754857  -0.505
## as.factor(AcidIndex)1.05172974217783     -0.691337   0.755587  -0.915
## as.factor(AcidIndex)1.59059728918163     -0.955120   0.757102  -1.262
## as.factor(AcidIndex)2.02271372429848     -1.445823   0.760290  -1.902
## as.factor(AcidIndex)2.37629509167962     -1.490708   0.765801  -1.947
## as.factor(AcidIndex)2.67051656830802     -1.602844   0.775004  -2.068
## as.factor(AcidIndex)2.9188445277671      -1.247523   0.782729  -1.594
## as.factor(AcidIndex)3.13100139587667     -1.346890   0.923859  -1.458
## as.factor(AcidIndex)3.31417429494859     -1.723319   0.954413  -1.806
## as.factor(AcidIndex)3.47378568897179     -1.836053   0.924747  -1.985
## as.factor(STARS)-0.42623524866846         1.339736   0.036898  36.309
## as.factor(STARS)0.416552574962037         2.370119   0.035887  66.044
## as.factor(STARS)1.25934039859254          2.938188   0.041719  70.428
## as.factor(STARS)2.10212822222303          3.622417   0.065432  55.362
##                                                      Pr(>|t|)    
## (Intercept)                                          0.218613    
## FixedAcidity                                         0.756779    
## VolatileAcidity                                0.000000000365 ***
## CitricAcid                                           0.085874 .  
## ResidualSugar                                        0.607301    
## Chlorides                                            0.000395 ***
## FreeSulfurDioxide                                    0.018003 *  
## TotalSulfurDioxide                                   0.000148 ***
## Density                                              0.102502    
## pH                                                   0.669914    
## Sulphates                                            0.139737    
## Alcohol                                              0.000132 ***
## as.factor(LabelAppeal)-1.11204793733397        0.000000030288 ***
## as.factor(LabelAppeal)0.0101741115806247 < 0.0000000000000002 ***
## as.factor(LabelAppeal)1.13239616049522   < 0.0000000000000002 ***
## as.factor(LabelAppeal)2.25461820940981   < 0.0000000000000002 ***
## as.factor(AcidIndex)-3.59682937695875                0.757436    
## as.factor(AcidIndex)-1.79176983045029                0.845664    
## as.factor(AcidIndex)-0.545318540973785               0.698082    
## as.factor(AcidIndex)0.362910765511677                0.613458    
## as.factor(AcidIndex)1.05172974217783                 0.360231    
## as.factor(AcidIndex)1.59059728918163                 0.207140    
## as.factor(AcidIndex)2.02271372429848                 0.057242 .  
## as.factor(AcidIndex)2.37629509167962                 0.051610 .  
## as.factor(AcidIndex)2.67051656830802                 0.038649 *  
## as.factor(AcidIndex)2.9188445277671                  0.111009    
## as.factor(AcidIndex)3.13100139587667                 0.144900    
## as.factor(AcidIndex)3.31417429494859                 0.071005 .  
## as.factor(AcidIndex)3.47378568897179                 0.047119 *  
## as.factor(STARS)-0.42623524866846        < 0.0000000000000002 ***
## as.factor(STARS)0.416552574962037        < 0.0000000000000002 ***
## as.factor(STARS)1.25934039859254         < 0.0000000000000002 ***
## as.factor(STARS)2.10212822222303         < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.304 on 10205 degrees of freedom
## Multiple R-squared:  0.5424, Adjusted R-squared:  0.541 
## F-statistic:   378 on 32 and 10205 DF,  p-value: < 0.00000000000000022
## Warning in model_eval$aic <- AIC(model): Coercing LHS to a list
## $RMSE
## [1] 1.301831
## 
## $Rsquared
## [1] 0.5423939
## 
## $MAE
## [1] 1.018945
## 
## $aic
## [1] 34523.17
## 
## $bic
## [1] 34769.12

MULTIPLE LINEAR REGRESSION MODEL REDUCED 6

For the final Linear Model, we leverage stepAIC on our Linear Model #5 to choose the most important features.

## 
## Call:
## lm(formula = TARGET ~ VolatileAcidity + CitricAcid + Chlorides + 
##     FreeSulfurDioxide + TotalSulfurDioxide + Density + Sulphates + 
##     Alcohol + as.factor(LabelAppeal) + as.factor(AcidIndex) + 
##     as.factor(STARS), data = trainingData)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.9668 -0.8516  0.0346  0.8410  5.4791 
## 
## Coefficients:
##                                          Estimate Std. Error t value
## (Intercept)                               0.94253    0.75615   1.246
## VolatileAcidity                          -0.08158    0.01298  -6.284
## CitricAcid                                0.02205    0.01286   1.714
## Chlorides                                -0.04666    0.01317  -3.542
## FreeSulfurDioxide                         0.03106    0.01311   2.369
## TotalSulfurDioxide                        0.05032    0.01319   3.816
## Density                                  -0.02112    0.01292  -1.634
## Sulphates                                -0.02008    0.01344  -1.495
## Alcohol                                   0.05091    0.01332   3.823
## as.factor(LabelAppeal)-1.11204793733397   0.38713    0.06969   5.555
## as.factor(LabelAppeal)0.0101741115806247  0.86330    0.06789  12.716
## as.factor(LabelAppeal)1.13239616049522    1.31469    0.07091  18.539
## as.factor(LabelAppeal)2.25461820940981    1.90982    0.09321  20.489
## as.factor(AcidIndex)-3.59682937695875    -0.24879    0.77227  -0.322
## as.factor(AcidIndex)-1.79176983045029    -0.15866    0.75455  -0.210
## as.factor(AcidIndex)-0.545318540973785   -0.30425    0.75371  -0.404
## as.factor(AcidIndex)0.362910765511677    -0.39233    0.75377  -0.520
## as.factor(AcidIndex)1.05172974217783     -0.70337    0.75438  -0.932
## as.factor(AcidIndex)1.59059728918163     -0.96828    0.75580  -1.281
## as.factor(AcidIndex)2.02271372429848     -1.45914    0.75896  -1.923
## as.factor(AcidIndex)2.37629509167962     -1.50472    0.76437  -1.969
## as.factor(AcidIndex)2.67051656830802     -1.61663    0.77352  -2.090
## as.factor(AcidIndex)2.9188445277671      -1.26256    0.78131  -1.616
## as.factor(AcidIndex)3.13100139587667     -1.35945    0.92262  -1.473
## as.factor(AcidIndex)3.31417429494859     -1.73323    0.95274  -1.819
## as.factor(AcidIndex)3.47378568897179     -1.85267    0.92329  -2.007
## as.factor(STARS)-0.42623524866846         1.33996    0.03689  36.324
## as.factor(STARS)0.416552574962037         2.37062    0.03587  66.082
## as.factor(STARS)1.25934039859254          2.93856    0.04171  70.456
## as.factor(STARS)2.10212822222303          3.62296    0.06542  55.381
##                                                      Pr(>|t|)    
## (Intercept)                                          0.212615    
## VolatileAcidity                                0.000000000343 ***
## CitricAcid                                           0.086543 .  
## Chlorides                                            0.000398 ***
## FreeSulfurDioxide                                    0.017866 *  
## TotalSulfurDioxide                                   0.000136 ***
## Density                                              0.102261    
## Sulphates                                            0.135004    
## Alcohol                                              0.000133 ***
## as.factor(LabelAppeal)-1.11204793733397        0.000000028428 ***
## as.factor(LabelAppeal)0.0101741115806247 < 0.0000000000000002 ***
## as.factor(LabelAppeal)1.13239616049522   < 0.0000000000000002 ***
## as.factor(LabelAppeal)2.25461820940981   < 0.0000000000000002 ***
## as.factor(AcidIndex)-3.59682937695875                0.747341    
## as.factor(AcidIndex)-1.79176983045029                0.833456    
## as.factor(AcidIndex)-0.545318540973785               0.686458    
## as.factor(AcidIndex)0.362910765511677                0.602730    
## as.factor(AcidIndex)1.05172974217783                 0.351160    
## as.factor(AcidIndex)1.59059728918163                 0.200178    
## as.factor(AcidIndex)2.02271372429848                 0.054565 .  
## as.factor(AcidIndex)2.37629509167962                 0.049029 *  
## as.factor(AcidIndex)2.67051656830802                 0.036645 *  
## as.factor(AcidIndex)2.9188445277671                  0.106135    
## as.factor(AcidIndex)3.13100139587667                 0.140656    
## as.factor(AcidIndex)3.31417429494859                 0.068909 .  
## as.factor(AcidIndex)3.47378568897179                 0.044818 *  
## as.factor(STARS)-0.42623524866846        < 0.0000000000000002 ***
## as.factor(STARS)0.416552574962037        < 0.0000000000000002 ***
## as.factor(STARS)1.25934039859254         < 0.0000000000000002 ***
## as.factor(STARS)2.10212822222303         < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.304 on 10208 degrees of freedom
## Multiple R-squared:  0.5424, Adjusted R-squared:  0.5411 
## F-statistic: 417.2 on 29 and 10208 DF,  p-value: < 0.00000000000000022
## Warning in model_eval$aic <- AIC(model): Coercing LHS to a list
## $RMSE
## [1] 1.301865
## 
## $Rsquared
## [1] 0.5423695
## 
## $MAE
## [1] 1.018829
## 
## $aic
## [1] 34517.71
## 
## $bic
## [1] 34741.96

Several predictors have significant effects on the dependent variable (TARGET) based on their p-values (e.g., VolatileAcidity, Chlorides, FreeSulfurDioxide, TotalSulfurDioxide, Density, Alcohol, LabelAppeal, and STARS).

The coefficients for the different levels of categorical predictors (e.g., LabelAppeal, AcidIndex, and STARS) represent the difference in the mean of the dependent variable compared to the reference level.

The Effect Direction, for example, a positive coefficient for Alcohol suggests that an increase in Alcohol is associated with an increase in the mean of TARGET, while a negative coefficient for VolatileAcidity suggests the opposite.

The overall model fit is evaluated using metrics such as R-squared (the proportion of variance explained), adjusted R-squared (a penalized version of R-squared for the number of predictors), and F-statistic (overall significance of the model).

ZERO INFLATION POISON MODEL 7

Zero inflation addresses the prevalence of numerous zeros in certain Poisson distributions by providing a correction. This method shows particular promise, especially considering our data exploration, where we observed a higher frequency of zeros followed by a distribution resembling normal data.

The model is a zero-inflated Poisson model used to predict the variable specified in the formula (TARGET) based on the predictors FixedAcidity, VolatileAcidity, CitricAcid, ResidualSugar, Chlorides, FreeSulfurDioxide, TotalSulfurDioxide, Density, pH, Sulphates, Alcohol, LabelAppeal, AcidIndex, and STARS.

Here we use the variable STARS because it had the most missing values.

## 
## Call:
## zeroinfl(formula = TARGET ~ . | STARS, data = trainingData)
## 
## Pearson residuals:
##      Min       1Q   Median       3Q      Max 
## -2.30463 -0.52758  0.02379  0.40652  2.70756 
## 
## Count model coefficients (poisson with log link):
##                      Estimate Std. Error z value             Pr(>|z|)    
## (Intercept)         1.2542657  0.0067515 185.776 < 0.0000000000000002 ***
## FixedAcidity        0.0001639  0.0059267   0.028               0.9779    
## VolatileAcidity    -0.0135867  0.0059034  -2.302               0.0214 *  
## CitricAcid          0.0011296  0.0057567   0.196               0.8444    
## ResidualSugar      -0.0007654  0.0059981  -0.128               0.8985    
## Chlorides          -0.0069000  0.0059648  -1.157               0.2474    
## FreeSulfurDioxide   0.0044605  0.0058837   0.758               0.4484    
## TotalSulfurDioxide  0.0003172  0.0059794   0.053               0.9577    
## Density            -0.0073991  0.0058865  -1.257               0.2088    
## pH                  0.0037535  0.0059566   0.630               0.5286    
## Sulphates          -0.0010330  0.0060752  -0.170               0.8650    
## Alcohol             0.0247431  0.0060303   4.103          0.000040756 ***
## LabelAppeal         0.1991156  0.0062675  31.770 < 0.0000000000000002 ***
## AcidIndex          -0.0322218  0.0064214  -5.018          0.000000522 ***
## STARS               0.1194349  0.0069054  17.296 < 0.0000000000000002 ***
## 
## Zero-inflation model coefficients (binomial with logit link):
##             Estimate Std. Error z value            Pr(>|z|)    
## (Intercept) -2.93328    0.07725  -37.97 <0.0000000000000002 ***
## STARS       -2.61255    0.07064  -36.98 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Number of iterations in BFGS optimization: 23 
## Log-likelihood: -1.668e+04 on 17 Df

The count model suggests Alcohol, LabelAppeal, AcidIndex, and STARS have statistically significant coefficients (p < 0.05). The signs of the coefficients indicate the direction of the impact on the expected count of TARGET.

The Zero-Inflation Model suggest that STARS has a statistically significant coefficient (p < 0.05), indicating that it significantly affects the odds of excess zeros. The negative sign suggests that a higher value of STARS is associated with a lower likelihood of excess zeros.

The optimization algorithm took 40 iterations to converge.

The log-likelihood value indicates how well the model fits the data. A lower log-likelihood suggests a better fit.

The model suggests that variables such as Alcohol, LabelAppeal, AcidIndex, and STARS are important predictors in both the count and zero-inflation components. The negative coefficient for STARS in the zero-inflation model implies that higher STARS ratings are associated with a lower likelihood of excess zeros.

ZERO INFLATION POISON REDUCED MODEL 7

Here I use the following variables, VolatileAcidity, Alcohol, LabelAppeal, and AcidIndex, and Stars. I chose these variables because the p < 0.05, containing the most excess zeros.

The zero-inflated Poisson Reduced model is used to predict the variable specified in the formula (TARGET) based on the predictors VolatileAcidity, Alcohol, LabelAppeal, and AcidIndex, with separate models for the count and zero-inflation components. The model also includes a grouping variable STARS, because that variable is believed to contain (or is hiding) some of the information that contributes to the excess zeros.

## 
## Call:
## zeroinfl(formula = TARGET ~ VolatileAcidity + Alcohol + LabelAppeal + 
##     AcidIndex | STARS, data = trainingData)
## 
## Pearson residuals:
##      Min       1Q   Median       3Q      Max 
## -2.39753 -0.45601  0.06658  0.41411  2.30869 
## 
## Count model coefficients (poisson with log link):
##                  Estimate Std. Error z value             Pr(>|z|)    
## (Intercept)      1.296696   0.006100 212.562 < 0.0000000000000002 ***
## VolatileAcidity -0.015657   0.005877  -2.664              0.00772 ** 
## Alcohol          0.033820   0.005978   5.658         0.0000000153 ***
## LabelAppeal      0.239028   0.005790  41.283 < 0.0000000000000002 ***
## AcidIndex       -0.035684   0.006282  -5.680         0.0000000135 ***
## 
## Zero-inflation model coefficients (binomial with logit link):
##             Estimate Std. Error z value            Pr(>|z|)    
## (Intercept) -2.87781    0.07319  -39.32 <0.0000000000000002 ***
## STARS       -2.60304    0.06780  -38.39 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Number of iterations in BFGS optimization: 11 
## Log-likelihood: -1.683e+04 on 7 Df

The count model suggests that, holding other predictors constant, an increase in VolatileAcidity leads to a decrease in the expected count of TARGET. Alcohol, LabelAppeal, and AcidIndex have positive coefficients, indicating that an increase in these variables is associated with an increase in the expected count of TARGET. All coefficients in the count model are statistically significant based on the p-values.

The zero-inflation model suggests that, holding other predictors constant, a one-unit increase in STARS is associated with a decrease in the odds of excess zeros. Both Intercept and STARS coefficients in the zero-inflation model are statistically significant.

The optimization algorithm took 11 iterations to converge.

The log-likelihood value indicates how well the model fits the data. A lower log-likelihood suggests a better fit.

MODEL SELECTION

MODEL COMPARISON BY MSE/AIC

The table presents Mean Squared Error (MSE) and Akaike Information Criterion (AIC) values for the eight different models (labeled Model1 through Model8).

MSE is a useful metric for quantifying the accuracy of predictions, and it is widely employed in regression and machine learning applications. A lower MSE indicates better predictive performance. It means that, on average, the model’s predictions are closer to the actual values. A higher MSE suggests that the model’s predictions are, on average, further away from the actual values.

The AIC is a measure of the relative quality of statistical models for a given set of data. Lower AIC values indicate a better fit.

## Warning in matrix(c(mse1, mse2, mse3, mse4, mse5, mse6, mse7, mse8, aic1, :
## data length [12] is not a sub-multiple or multiple of the number of rows [8]
MSE AIC
Model1 6.701417 6.702885
Model2 6.701416 6.702572
Model3 1.694763 1.694853
Model4 1.730258 1.892633
Model5 36471.703172 36470.936609
Model6 36474.037980 36472.088457
Model7 6.701417 6.702885
Model8 6.701416 6.702572

Model 1 and Model 2: These models have identical MSE values, suggesting that they perform equally well in terms of the mean squared error. The value is relatively low.

Model 3 and Model 4: Model 3 has a lower MSE compared to Model4, indicating that Model 3 performs better in terms of minimizing the squared differences between predicted and actual values.

Model 5 and Model 6: Model 5 has an extremely high MSE compared to Model 6. Model 5 seems to perform much worse in terms of mean squared error, possibly indicating poor predictive performance.

Model 7 and Model 8: Similar to Model 1 and Model 2, these models have identical MSE values, suggesting equivalent performance in terms of mean squared error.

Model 1 and Model 2: These models have nearly identical AIC values, suggesting similar goodness of fit according to this criterion.

Model 3 and Model 4: Model3 has a lower AIC compared to Model 4, indicating that Model 3 is preferred in terms of the trade-off between goodness of fit and model complexity.

Model 5 and Model 6: Similar to the MSE comparison, Model 5 has a higher AIC compared to Model6, indicating that Model 6 is preferred in terms of AIC.

Model 7 and Model 8: These models have nearly identical AIC values, similar to the situation with Model1 and Model 2.

Best Models: Model 3 seems to be the best-performing model based on both MSE and AIC, as it has the lowest MSE and AIC among the presented models. Model 6 also performs well in terms of both MSE and AIC, but it may be slightly less preferred than Model 3.

Poor Models: Models 5 and 7 seem to perform poorly, especially Model 5, which has exceptionally high MSE and AIC values.

Model Selection: When choosing a model, it’s often desirable to balance goodness of fit (low MSE) with simplicity (low AIC). Model 3 strikes a good balance in this regard.

MODEL COMPARISON BY LOSS

I will see the output of the Models using test data. The table presents loss values for eight different models (labeled Model1 through Model8). The term “loss” generally refers to a measure of how well a model is performing.

I will use the squared loss to validate the model. I will use the squared difference to select a model (MSE) from predictions on the training sets. (Lower numbers are better.)

##           Loss:
## Model1 6.739686
## Model2 6.745143
## Model3 6.739686
## Model4 6.740173
## Model5 1.686629
## Model6 1.686877
## Model7 1.701499
## Model8 1.851677

Model 1 and Model 3: These models have nearly identical loss values, suggesting similar performance according to the chosen loss metric.

Model 2: Model 2 has a slightly higher loss compared to Model 1 and Model 3. This indicates that Model 2 may be performing slightly worse than Model 1 and Model 3 according to the specified loss metric.

Model 4: Model 4 has a loss value close to that of Model 1 and Model 3, indicating comparable performance.

Model 5 and Model 6: Models 5 and 6 have lower loss values compared to the previous models. A lower loss generally indicates better performance, so Models 5 and 6 seem to be performing well.

Model 7: Model 7 has a loss value slightly higher than Models 5 and 6 but lower than Model 8. Its performance is somewhere in between.

Model 8: Model 8 has the highest loss value among the presented models. A higher loss suggests that Model 8 is not performing as well as the other models according to the chosen loss metric.

Best Models: Models 5 and 6 seem to be the best-performing models, as they have the lowest loss values among the presented models.

Poor Models: Model 8 has the highest loss, suggesting poorer performance compared to the other models.

Model Comparison: The models can be ranked based on their loss values, with lower values indicating better performance.

Because I am not interested in gaining insight into the underlying causes of wine selection, I will use the squared loss. This will tell me how accurate our model is without caring about confidence intervals etc.

Based on this metric, Multiple Linear Regression Model 5 is the most accurate.

PREDICTION

DATA WINE EVAL DATA SET

The wine eval dataset contains 16 columns - including the target variable TARGET - and 3,335 rows, covering a variety of different brands of wine. The data-set is entirely numerical variables, but also contains some variables that are highly discrete and have a limited number of possible values. We will drop the first 2 columns INDEX, we don’t need and TARGET, all missing rows. We have alot of missing values. Columns that have missing values are, ResidualSugar, Chlorides, FreSulfurDioxide, TotalSulfurDioxide, ph, Sulphates, Alcohol, and Stars, which contains the most missing values. To prepare our testing data, wine eval I had to convert

##   IN TARGET FixedAcidity VolatileAcidity CitricAcid ResidualSugar Chlorides
## 1  3     NA          5.4          -0.860       0.27         -10.7     0.092
## 2  9     NA         12.4           0.385      -0.76         -19.7     1.169
## 3 10     NA          7.2           1.750       0.17         -33.0     0.065
## 4 18     NA          6.2           0.100       1.80           1.0    -0.179
## 5 21     NA         11.4           0.210       0.28           1.2     0.038
## 6 30     NA         17.6           0.040      -1.15           1.4     0.535
##   FreeSulfurDioxide TotalSulfurDioxide Density   pH Sulphates Alcohol
## 1                23                398 0.98527 5.02      0.64   12.30
## 2               -37                 68 0.99048 3.37      1.09   16.00
## 3                 9                 76 1.04641 4.61      0.68    8.55
## 4               104                 89 0.98877 3.20      2.11   12.30
## 5                70                 53 1.02899 2.54     -0.07    4.80
## 6              -250                140 0.95028 3.06     -0.02   11.40
##   LabelAppeal AcidIndex STARS
## 1          -1         6    NA
## 2           0         6     2
## 3           0         8     1
## 4          -1         8     1
## 5           0        10    NA
## 6           1         8     4
## Warning in FUN(newX[, i], ...): no non-missing arguments to min; returning Inf
## Warning in FUN(newX[, i], ...): no non-missing arguments to max; returning -Inf
##                    vars    n    mean      sd  median trimmed     mad     min
## IN                    1 3335 8048.31 4655.48 7906.00 8044.28 5960.05    3.00
## TARGET                2    0     NaN      NA      NA     NaN      NA     Inf
## FixedAcidity          3 3335    6.86    6.32    6.90    6.91    2.82  -18.20
## VolatileAcidity       4 3335    0.31    0.81    0.28    0.31    0.46   -2.83
## CitricAcid            5 3335    0.31    0.87    0.31    0.31    0.44   -3.12
## ResidualSugar         6 3167    5.32   34.37    3.60    5.46   16.90 -128.30
## Chlorides             7 3197    0.06    0.31    0.05    0.06    0.12   -1.15
## FreeSulfurDioxide     8 3183   34.95  149.63   30.00   34.26   57.82 -563.00
## TotalSulfurDioxide    9 3178  123.41  225.80  124.00  124.00  137.88 -769.00
## Density              10 3335    0.99    0.03    0.99    0.99    0.01    0.89
## pH                   11 3231    3.24    0.68    3.21    3.23    0.37    0.60
## Sulphates            12 3025    0.53    0.91    0.50    0.53    0.39   -3.07
## Alcohol              13 3150   10.58    3.76   10.40   10.58    2.52   -4.20
## LabelAppeal          14 3335    0.01    0.89    0.00    0.01    1.48   -2.00
## AcidIndex            15 3335    7.75    1.32    8.00    7.62    1.48    5.00
## STARS                16 2494    2.04    0.91    2.00    1.97    1.48    1.00
##                         max    range  skew kurtosis    se
## IN                 16130.00 16127.00  0.01    -1.20 80.62
## TARGET                 -Inf     -Inf    NA       NA    NA
## FixedAcidity          33.50    51.70 -0.12     2.04  0.11
## VolatileAcidity        3.61     6.44 -0.04     1.62  0.01
## CitricAcid             3.76     6.88 -0.03     1.66  0.02
## ResidualSugar        145.40   273.70 -0.06     1.97  0.61
## Chlorides              1.26     2.41 -0.04     1.74  0.01
## FreeSulfurDioxide    617.00  1180.00  0.07     1.88  2.65
## TotalSulfurDioxide  1004.00  1773.00 -0.05     1.50  4.01
## Density                1.10     0.21 -0.03     1.94  0.00
## pH                     6.21     5.61  0.12     1.69  0.01
## Sulphates              4.18     7.25  0.01     1.83  0.02
## Alcohol               25.60    29.80  0.05     1.54  0.07
## LabelAppeal            2.00     4.00  0.05    -0.26  0.02
## AcidIndex             17.00    12.00  1.51     4.28  0.02
## STARS                  4.00     3.00  0.44    -0.75  0.02
##        IN         TARGET         FixedAcidity     VolatileAcidity  
##  Min.   :    3   Mode:logical   Min.   :-18.200   Min.   :-2.8300  
##  1st Qu.: 4018   NA's:3335      1st Qu.:  5.200   1st Qu.: 0.0800  
##  Median : 7906                  Median :  6.900   Median : 0.2800  
##  Mean   : 8048                  Mean   :  6.864   Mean   : 0.3103  
##  3rd Qu.:12061                  3rd Qu.:  9.000   3rd Qu.: 0.6300  
##  Max.   :16130                  Max.   : 33.500   Max.   : 3.6100  
##                                                                    
##    CitricAcid      ResidualSugar        Chlorides        FreeSulfurDioxide
##  Min.   :-3.1200   Min.   :-128.300   Min.   :-1.15000   Min.   :-563.00  
##  1st Qu.: 0.0000   1st Qu.:  -2.600   1st Qu.: 0.01600   1st Qu.:   3.00  
##  Median : 0.3100   Median :   3.600   Median : 0.04700   Median :  30.00  
##  Mean   : 0.3124   Mean   :   5.319   Mean   : 0.06143   Mean   :  34.95  
##  3rd Qu.: 0.6050   3rd Qu.:  17.200   3rd Qu.: 0.17100   3rd Qu.:  79.25  
##  Max.   : 3.7600   Max.   : 145.400   Max.   : 1.26300   Max.   : 617.00  
##                    NA's   :168        NA's   :138        NA's   :152      
##  TotalSulfurDioxide    Density             pH          Sulphates      
##  Min.   :-769.00    Min.   :0.8898   Min.   :0.600   Min.   :-3.0700  
##  1st Qu.:  27.25    1st Qu.:0.9883   1st Qu.:2.980   1st Qu.: 0.3300  
##  Median : 124.00    Median :0.9946   Median :3.210   Median : 0.5000  
##  Mean   : 123.41    Mean   :0.9947   Mean   :3.237   Mean   : 0.5346  
##  3rd Qu.: 210.00    3rd Qu.:1.0005   3rd Qu.:3.490   3rd Qu.: 0.8200  
##  Max.   :1004.00    Max.   :1.0998   Max.   :6.210   Max.   : 4.1800  
##  NA's   :157                         NA's   :104     NA's   :310      
##     Alcohol       LabelAppeal         AcidIndex          STARS     
##  Min.   :-4.20   Min.   :-2.00000   Min.   : 5.000   Min.   :1.00  
##  1st Qu.: 9.00   1st Qu.:-1.00000   1st Qu.: 7.000   1st Qu.:1.00  
##  Median :10.40   Median : 0.00000   Median : 8.000   Median :2.00  
##  Mean   :10.58   Mean   : 0.01349   Mean   : 7.748   Mean   :2.04  
##  3rd Qu.:12.50   3rd Qu.: 1.00000   3rd Qu.: 8.000   3rd Qu.:3.00  
##  Max.   :25.60   Max.   : 2.00000   Max.   :17.000   Max.   :4.00  
##  NA's   :185                                         NA's   :841
## [1] 16
## [1] 3335
##       FixedAcidity    VolatileAcidity         CitricAcid      ResidualSugar 
##                  0                  0                  0                168 
##          Chlorides  FreeSulfurDioxide TotalSulfurDioxide            Density 
##                138                152                157                  0 
##                 pH          Sulphates            Alcohol        LabelAppeal 
##                104                310                185                  0 
##          AcidIndex              STARS 
##                  0                841

PREPARE DATA WINE EVAL DATA SET

For multiple imputation with Random Forest I have imputed values in place of the missing values in my wine_eval dataset. Keeping in mind that the effectiveness of imputation depends on the nature of your data and the appropriateness of the imputation method for your specific problem. I also had to convert STARRS, LabelAppeal, AcidIndex to factors.

## 
##  iter imp variable
##   1   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   1   2  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   1   3  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   1   4  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   1   5  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   2   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   2   2  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   2   3  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   2   4  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   2   5  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   3   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   3   2  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   3   3  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   3   4  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   3   5  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS

PREDICTION USING LINEAR MODEL REGRESSION MODEL

These are the predicted values for the response variable (dependent variable) generated by the linear regression model (lm1) using the predictors in the wine_eval dataset. These values represent the model’s estimate of the target variable based on the features in the wine_eval data.

It’s important to note that the interpretation of these predictions depends on the context of your specific regression model and the variables involved. The accuracy and reliability of the predictions are also influenced by the quality of the model and the suitability of linear regression for your data.

##         2         4         6         7         8         9 
## 3.9375180 2.4883264 0.2896969 1.4415012 5.2314899 1.6674334

NUMBER OF CASES PURCHASED AS PREDICTIED

HISTOGRAM OF PREDICTIONS

DENSITY PLOT OF PREDICTIONS

The density plot provides insight into the distribution of predicted values, showing where the mass of the observations lies and how it varies across the range. The wiggles in the plot can be indicative of underlying patterns or structures in the data. The initial increase in the middle of the plot suggests that there is a concentration of predicted values around that range. The higher density in this region indicates a greater number of observations with similar or close predicted values.The initial increase in the middle of the plot suggests that there is a concentration of predicted values around that range. The higher density in this region indicates a greater number of observations with similar or close predicted values.The decrease in density as you move towards the tails of the distribution indicates that fewer observations have extreme predicted values. This decrease could be due to the data becoming sparser in these regions.

COMULATIVE PLOT OF OF PREDICTIONS

This Cumulative Distribution Plot of Predictions plot provides a visual representation of how the predicted values are distributed across the dataset. At the start, the line is usually straight because there are likely many low predicted values. This straight line represents the period during which most of the low values are covered, and as you move along the x-axis, more observations are included in the cumulative count. The upward slope indicates the portion where the cumulative probability is increasing. This is where more observations with higher predicted values are being added. The line may become straight again when most of the observations with higher predicted values have been included. The slanted S-shape is common in cumulative distribution plots and reflects the accumulation of observations across the range of predicted values. If there are sudden changes in slope, it might indicate points of inflection or areas where the density of observations changes rapidly.

CONCLUSION

This study aimed to address the challenge of predicting the number of wines sold to restaurants by employing a Count Regression Model. The initial phase involved cleaning and pre-processing a dataset to get it ready for model training. Several techniques, including Poisson Regression, Negative Binomial, Multiple Linear Regression, and Zero Inflation Poison, were applied for count regression. Because I was not interested in gaining insight into the underlying causes of wine selection, I used the squared loss. This told me how accurate our model is without caring about confidence intervals etc. Lower numbers are better, so I chose to test the wine_eval dataset with Mutiple Linear Regression Model. The results from various wine types exhibited a high degree of similarity, mainly attributed to the utilization of random forest during the pre-processing stage to address missing values.

knitr::opts_chunk$set(echo = FALSE) 
# load libraries
suppressWarnings({
  # Code that generates specific warnings
  # Other code
  library(pscl)
  library(tinytex)
  library(devtools)
  library(vctrs)
  library(mice)
  library(tidyverse)
  library(dplyr)
  library(psych)
  library(corrplot)
  library(RColorBrewer)
  library(knitr)
  library(MASS)
  library(caret)
  library(kableExtra)
  library(ResourceSelection)
  library(pROC)
  library(ggplot2)
  library(gridExtra)
  library(htmltools)
  library(ggpubr)
})

suppressMessages({
  library(pscl)
  library(tinytex)
  library(devtools)
  library(vctrs)
  library(mice)
  library(tidyverse)
  library(dplyr)
  library(psych)
  library(corrplot)
  library(RColorBrewer)
  library(knitr)
  library(MASS)
  library(caret)
  library(kableExtra)
  library(ResourceSelection)
  library(pROC)
  library(ggplot2)
  library(gridExtra)
  library(htmltools)
  library(ggpubr)
  })

library(pscl)
  library(tinytex)
  library(devtools)
  library(vctrs)
  library(mice)
  library(tidyverse)
  library(dplyr)
  library(psych)
  library(corrplot)
  library(RColorBrewer)
  library(knitr)
  library(MASS)
  library(caret)
  library(kableExtra)
  library(ResourceSelection)
  library(pROC)
  library(ggplot2)
  library(gridExtra)
  library(htmltools)
  library(ggpubr)
#load data
wine_train<- read.csv("https://raw.githubusercontent.com/enidroman/DATA-621-Business-Analytics-and-Data-Mining/main/wine-training-data.csv")
wine_eval <- read.csv("https://raw.githubusercontent.com/enidroman/DATA-621-Business-Analytics-and-Data-Mining/main/wine-evaluation-data.csv")
vn <- c("INDEX", "TARGET", " ", " ", "ACID INDEX", "ALCOHOL", "CHLORIDES", "CITRIC ACID", "DENSITY", "FIXED ACIDITY", "FREE SULFUR DIOXIDE", "LABEL APPEAL", "RESIDUAL SUGAR", "STARS", "SULPHATES", "TOTAL SULFUR DIOXIDE", "VOLATILE ACIDITY", "pH")
defin <- c("Identification Variable (do not use)", "Number of Cases Purchased", " ", " ", "Proprietary method of testing total acidity of wine by using a weighted average", "Alcohol Content", "Chloride content of wine", "Citric Acid Content", "Density of Wine", "Fixed Acidity of Wine", "Sulfur Dioxide content of wine", "Marketing Score indicationg the appeal of label design for consumers. High numbers suggest customers like the label design. Negative numbers suggest customers don't like design.", "Residual Sugar of wine", "Wine rating by a team of experts. 4 Stars = Excellent, 1 Star = Poor", "Sulfate content of wine", "Total Sulfur Dioxide of Wine", "Volatile Acid content of wine.", "pH of wine")
theor_effect <- c("None", "None", " ", " ", " ", " ", " ", " ", " ", " ", " ", "Many consumers purchase based on the visual appeal of the wine label design. Higher numbers suggest better sales.", " ", "A high number of stars suggests high sales", " ", " ", " ", " ")

kable(cbind(vn, defin, theor_effect), col.names = c("VARIABLE NAME", "DEFINITION", "THEORETICAL EFFECT")) %>% 
 kable_paper(full_width = T)

head(wine_train)
describe(wine_train)
ncol(wine_train)
nrow(wine_train)
# summary statistics
summary(wine_train)
wine_train <- subset(wine_train, select = -INDEX)
str(wine_train)
# count the total number of missing values 
sum(is.na(wine_train))
dis_wine_train <- wine_train %>% 
  gather(key = 'variable', value = 'value')
# Histogram plots of each variable
ggplot(dis_wine_train) + 
  geom_histogram(aes(x=value, y = ..density..), bins=30) + 
  geom_density(aes(x=value), color='blue') +
  facet_wrap(. ~variable, scales='free', ncol=4)
box_wine_train <- wine_train %>% 
  gather(key = 'variable', value = 'value')
# Boxplots for each variable
ggplot(box_wine_train, aes(variable, value)) + 
  geom_boxplot() + 
  facet_wrap(. ~variable, scales='free', ncol=6)
  

wine_train_character_wide <- wine_train %>% 
  dplyr::select(TARGET, STARS, LabelAppeal, AcidIndex) %>%
  pivot_longer(cols = -TARGET, names_to="variable", values_to="value") %>%
  arrange(variable, value)
wine_train_character_wide %>% 
  ggplot(mapping = aes(x = factor(value), y = TARGET)) +
    geom_boxplot() + 
    facet_wrap(.~variable, scales="free") +
    theme_bw() +
    theme(axis.text.x = element_text(angle = 90))

featurePlot(wine_train[,2:ncol(wine_train)], wine_train[,1], pch = 20)
missing <- colSums(wine_train %>% sapply(is.na))
missing_pct <- round(missing / nrow(wine_train) * 100, 2)
stack(sort(missing_pct, decreasing = TRUE))
# separate our features from target so we don't inadvertently transform the target
training_x <- wine_train %>% dplyr::select(-TARGET)
training_y <- wine_train$TARGET
# separate our features from target so we don't inadvertently transform the target
eval_x <- wine_eval %>% dplyr::select(-TARGET)
eval_y <- wine_eval$TARGET
create_na_dummy <- function(vector) {
  as.integer(vector %>% is.na())
}
impute_missing <- function(data) {
  # Replace missing STARS with 0 
  data$STARS <- data$STARS %>%
    replace_na(0)
  return(data)
}
# Replace missing STARS with 'unknown' and convert STARS to a factor
training_x <- impute_missing(training_x)
eval_x <- impute_missing(eval_x)
imputation <- preProcess(training_x, method = c("knnImpute", 'BoxCox'))
# summary(imputation)
training_x_imp <- predict(imputation, training_x)
eval_x_imp <- predict(imputation, eval_x)
clean_df <- cbind(training_y, training_x_imp) %>% 
  as.data.frame() %>%
  rename(TARGET = training_y)
clean_eval_df <- cbind(eval_y, eval_x_imp) %>% 
  as.data.frame() %>%
  rename(TARGET = eval_y)
  
stack(sort(cor(clean_df[,1], clean_df[,2:ncol(clean_df)])[,], decreasing=TRUE))
mcor<-round(cor(clean_df),2)
mcor

correlation = cor(clean_df, use = 'pairwise.complete.obs')
corrplot(correlation, 'ellipse', type = 'lower', order = 'hclust',
         col=brewer.pal(n=8, name="RdYlBu"))
sum(is.na(clean_df))
clean_wine_train <- clean_df %>% 
  gather(key = 'variable', value = 'value')
# Histogram plots of each variable
ggplot(clean_wine_train) + 
  geom_histogram(aes(x=value, y = ..density..), bins=30) + 
  geom_density(aes(x=value), color='blue') +
  facet_wrap(. ~variable, scales='free', ncol=4)
options(scipen = 999)
#75% data test training split
# get training/test split
y_raw <- as.matrix(clean_df$TARGET)
trainingRows <- createDataPartition(y_raw, p=0.8, list=FALSE)
# Build training data sets
trainX <- clean_df[trainingRows,] %>% dplyr::select(-TARGET)
trainY <- clean_df[trainingRows,] %>% dplyr::select(TARGET)
# put remaining rows into the test sets
testX <- clean_df[-trainingRows,] %>% dplyr::select(-TARGET)
testY <- clean_df[-trainingRows,] %>% dplyr::select(TARGET)
# Build a DF
trainingData <- as.data.frame(trainX)
trainingData$TARGET <- trainY$TARGET
print(paste('Number of Training Samples: ', dim(trainingData)[1]))
testingData <- as.data.frame(testX)
testingData$TARGET <- testY$TARGET
print(paste('Number of Testing Samples: ', dim(testingData)[1]))

model_test_perf <- function(model, trainX, trainY, testX, testY) {
  # Evaluate Model with testing data set
  predictedY <- predict(model, newdata = trainX)
  model_results <- data.frame(obs = trainY, pred = predictedY)
  colnames(model_results) <- c('obs', 'pred')
  
  # Calculate RMSE, Rsquared, and MAE by default
  model_eval <- defaultSummary(model_results)
  
  # Add AIC score to the results
  if ('aic' %in% names(model)) {
    model_eval$aic <- model$aic
  } else {
    model_eval$aic <- AIC(model)
  }
  
  # Add BIC score to the results
  model_eval$bic <- BIC(model)
  
  return(model_eval)
}

poiss1 = glm(TARGET ~ FixedAcidity + VolatileAcidity + CitricAcid + ResidualSugar + 
                Chlorides + FreeSulfurDioxide + TotalSulfurDioxide + Density +
                pH + Sulphates + Alcohol + 
                as.factor(LabelAppeal) +
                as.factor(AcidIndex) +
                as.factor(STARS),
              data=trainingData, 
              family=poisson)
summary(poiss1)
# Evaluate Model 1 with testing data set
(poiss1_eval <- model_test_perf(poiss1, trainX, trainY, testX, testY))

  
#' Extract key performance results from a model
#'
#' @param model A linear model of interest
#' @examples
#' model_performance_extraction(my_model)
#' @return data.frame
#' @export
model_performance_extraction <- function(model = NULL) {
  # Make sure a model was passed
  if (is.null(model)) {
    return(NULL)
}
  
  data.frame("RSE" = model$sigma,
             "Adj R2" = model$adj.r.squared,
             "F-Statistic" = model$fstatistic[1])
  
  performance_metrics <- data.frame("RSE" = model$sigma,
                                    "Adj R2" = model$adj.r.squared,
                                    "F-Statistic" = model$fstatistic[1])
  return(performance_metrics)
}
poiss2 <- glm(TARGET ~ VolatileAcidity + TotalSulfurDioxide + Alcohol + 
                as.factor(LabelAppeal) + 
                as.factor(AcidIndex) + 
                as.factor(STARS),
              data=trainingData, 
              family=poisson)
summary(poiss2)
# Evaluate Model 1 with testing data set
(poiss2_eval <- model_test_perf(poiss2, trainX, trainY, testX, testY))

model_performance_extraction <- function(model = NULL) {
  # Make sure a model was passed
  if (is.null(model)) {
    return(NULL)
  }
  
  performance_metrics <- data.frame("RSE" = model$sigma,
                                    "Adj R2" = model$adj.r.squared,
                                    "F-Statistic" = model$fstatistic[1])
  return(performance_metrics)
}
negbi1 <- glm.nb(TARGET ~ FixedAcidity + VolatileAcidity + CitricAcid + ResidualSugar + 
                Chlorides + FreeSulfurDioxide + TotalSulfurDioxide + Density +
                pH + Sulphates + Alcohol + 
                as.factor(LabelAppeal) +
                as.factor(AcidIndex) +
                as.factor(STARS),
              data=trainingData)
summary(negbi1)

# Evaluate Model 1 with testing data set
(negbi1_eval <- model_test_perf(negbi1, trainX, trainY, testX, testY))

model_performance_extraction <- function(model = NULL) {
  # Make sure a model was passed
  if (is.null(model)) {
    return(NULL)
  }
  
  performance_metrics <- data.frame("RSE" = model$sigma,
                                    "Adj R2" = model$adj.r.squared,
                                    "F-Statistic" = model$fstatistic[1])
  return(performance_metrics)
}
negbi2 <- glm.nb(TARGET~ VolatileAcidity + FreeSulfurDioxide + TotalSulfurDioxide + Alcohol +
                as.factor(LabelAppeal) +
                as.factor(AcidIndex) + 
                as.factor(STARS),
              data=trainingData)
summary (negbi2)
# Evaluate Model 1 with testing data set
(negbi2_eval <- model_test_perf(negbi2, trainX, trainY, testX, testY))

model_performance_extraction <- function(model = NULL) {
  # Make sure a model was passed
  if (is.null(model)) {
    return(NULL)
  }
  
  performance_metrics <- data.frame("RSE" = model$sigma,
                                    "Adj R2" = model$adj.r.squared,
                                    "F-Statistic" = model$fstatistic[1])
  return(performance_metrics)
}
lm1 <- lm(TARGET ~ FixedAcidity + VolatileAcidity + CitricAcid + ResidualSugar + 
                Chlorides + FreeSulfurDioxide + TotalSulfurDioxide + Density +
                pH + Sulphates + Alcohol + 
                as.factor(LabelAppeal) +
                as.factor(AcidIndex) +
                as.factor(STARS),
              data=trainingData)
summary(lm1)
# Evaluate Model 1 with testing data set
(lm1_eval <- model_test_perf(lm1, trainX, trainY, testX, testY))

model_performance_extraction <- function(model = NULL) {
  # Make sure a model was passed
  if (is.null(model)) {
    return(NULL)
  }
  
  performance_metrics <- data.frame("RSE" = model$sigma,
                                    "Adj R2" = model$adj.r.squared,
                                    "F-Statistic" = model$fstatistic[1])
  return(performance_metrics)
}
lm2 <- stepAIC(lm1, direction = "both",
               scope = list(upper = lm1, lower = ~ 1),
               scale = 0, trace = FALSE)
summary(lm2)
# Evaluate Model 1 with testing data set
(lm2_eval <- model_test_perf(lm2, trainX, trainY, testX, testY))

model_performance_extraction <- function(model = NULL) {
  # Make sure a model was passed
  if (is.null(model)) {
    return(NULL)
  }
  
  performance_metrics <- data.frame("RSE" = model$sigma,
                                    "Adj R2" = model$adj.r.squared,
                                    "F-Statistic" = model$fstatistic[1])
  return(performance_metrics)
}

zip1 <- zeroinfl(TARGET~.|STARS, data = trainingData)
summary(zip1)


zip2 <- zeroinfl(TARGET ~ VolatileAcidity +  Alcohol + LabelAppeal + AcidIndex | STARS, data =trainingData)

summary(zip2)


aic1 <- poiss1$aic
aic2 <- poiss2$aic
aic3 <- negbi1$aic
aic4 <- negbi2$aic
aic5 <- lm1$aic
aic6 <- lm2$aic
aic7 <- zip1$aic
aic8 <- zip2$aic
mse1 <- mean((trainingData$TARGET - predict(poiss1))^2)
mse2 <- mean((trainingData$TARGET - predict(poiss2))^2)
mse3 <- mean((trainingData$TARGET - predict(negbi1))^2)
mse4 <- mean((trainingData$TARGET - predict(negbi2))^2)
mse5 <- mean((trainingData$TARGET - predict(lm1))^2)
mse6 <- mean((trainingData$TARGET - predict(lm2))^2)
mse7 <- mean((trainingData$TARGET - predict(zip1))^2)
mse8 <- mean((trainingData$TARGET - predict(zip2))^2)

compare_aic_mse <- matrix(c(mse1, mse2, mse3, mse4, mse5, mse6, mse7, mse8, 
                            aic1, aic2, aic3, aic4, aic5, aic6, aic7, aic8),nrow=8,ncol=2,byrow=TRUE)

rownames(compare_aic_mse) <- c("Model1","Model2","Model3","Model4","Model5","Model6","Model7","Model8")
colnames(compare_aic_mse) <- c("MSE","AIC")
compare_models <- as.data.frame(compare_models)

kable(compare_aic_mse)  %>% 
  kable_styling(full_width = T)
modelValidation <- function(mod){
  preds = predict(mod, testingData)
  diffMat = as.numeric(preds) - as.numeric(testingData$TARGET)
  diffMat = diffMat^2
  loss <- mean(diffMat)
  return(loss)
}

compare_models <- matrix(c(modelValidation(poiss1),modelValidation(poiss2),modelValidation(negbi1),modelValidation(negbi2),modelValidation(lm1),modelValidation(lm2),
                           modelValidation(zip1),modelValidation(zip2)),
                         nrow=8,ncol=1,byrow=TRUE)

rownames(compare_models) <- c("Model1","Model2","Model3","Model4","Model5","Model6","Model7","Model8")
colnames(compare_models) <- c("Loss:")
compare_models <- as.data.frame(compare_models)
compare_models
head(wine_eval)
describe(wine_eval)
summary(wine_eval)
ncol(wine_eval)
nrow(wine_eval)
wine_test <- wine_eval[-c(1,2)]
colSums(is.na(wine_test))
set.seed(32)
wine_test <- mice(wine_test, m=5, maxit = 3, method = 'rf')
wine_test$STARS <- as.factor(wine_test$STARS)
wine_test$STARS <- factor(wine_test$STARS, levels = levels(trainingData$STARS))
trainingData$LabelAppeal <- factor(trainingData$LabelAppeal)
wine_test$LabelAppeal <- factor(wine_test$LabelAppeal, levels = levels(trainingData$LabelAppeal))
wine_test$AcidIndex <- factor(wine_test$AcidIndex, levels = levels(trainingData$AcidIndex))
wine_test$LabelAppeal <- factor(wine_test$LabelAppeal, levels = levels(trainingData$LabelAppeal))
wine_test <- complete(wine_test)
predictions <- predict(lm1, data= wine_test)
print(head(predictions))

# Convert predictions to a data frame
predictions_df <- data.frame(Predictions = predictions)

# Display the datatable
DT::datatable(predictions_df)
hist(predictions)
# Density plot
plot(density(predictions), main = "Density Plot of Predictions", col = "skyblue", lwd = 2)
# Cumulative distribution plot
plot(ecdf(predictions), main = "Cumulative Distribution Plot of Predictions", col = "skyblue", lwd = 2)