HW3: Wine

Author

Will Brewster

Published

April 27, 2026

Introduction

This report will explore and analyze a data set containing information on approximately 12,000 commercially available wines. with variables relating mostly to the chemical properties of the wine being sold. The response variable corresponds to the number of sample cases of wine that were purchased by wine distribution companies after sampling a wine, which will be used to by restaurants and wine stores. The higher the sample case number, the more likely that the wine will be sold.

The objective of the analysis is to predict the cases of wine that will be sold given certain properties of the wine using the following variables:

Variable Name Definition
TARGET Number of Cases Purchased
Alcohol Alcohol Content
Acid Index Proprietary method of testing total acidity of wine by using a weighted average
Chlorides Chloride content of wine
CitricAcid Citric Acid Content
Density Density of Wine
FixedAcidity Fixed Acidity of Wine
FreeSulfurDioxide Sulfur Dioxide content of wine
LabelAppeal Marketing Score indicating the appeal of label design for consumers. High numbers suggest customers like the label design. Negative numbers suggest customer don’t like the design. Many consumers purchase based on the visual appeal of the wine label design. Higher numbers suggest better sales.
ResidualSugar Residual Sugar of wine
STARS Wine rating by a team of experts; 4 Stars = Excellent, 1 Star = Poor
TotalSulfurDioxide Total Sulfur Dioxide of Wine
VolatileAcidity Volatile Acid content of wine
pH pH of wine
stargazer(train_data1, type = "text", title="Descriptive statistics")

Descriptive statistics
=============================================================
Statistic            N     Mean   St. Dev.   Min       Max   
-------------------------------------------------------------
TARGET             10,236  3.026   1.927      0         8    
FixedAcidity       10,236  7.090   6.333   -18.000   34.400  
VolatileAcidity    10,236  0.327   0.784    -2.790    3.680  
CitricAcid         10,236  0.309   0.863    -3.240    3.860  
ResidualSugar      9,742   5.275   33.941  -127.800  141.150 
Chlorides          9,711   0.055   0.319    -1.170    1.351  
FreeSulfurDioxide  9,697  30.899  147.309  -546.000  622.000 
TotalSulfurDioxide 9,704  121.823 231.038  -823.000 1,057.000
Density            10,236  0.994   0.027    0.889     1.099  
pH                 9,915   3.212   0.675    0.480     6.050  
Sulphates          9,276   0.529   0.940    -3.130    4.240  
Alcohol            9,717  10.497   3.717    -4.700   26.100  
LabelAppeal        10,236 -0.011   0.895      -2        2    
AcidIndex          10,236  7.781   1.339      4        17    
STARS              7,548   2.045   0.905      1         4    
-------------------------------------------------------------
describe(train_data1)
                   vars     n   mean     sd median trimmed    mad     min
TARGET                1 10236   3.03   1.93   3.00    3.05   1.48    0.00
FixedAcidity          2 10236   7.09   6.33   6.90    7.08   3.26  -18.00
VolatileAcidity       3 10236   0.33   0.78   0.28    0.33   0.43   -2.79
CitricAcid            4 10236   0.31   0.86   0.31    0.31   0.42   -3.24
ResidualSugar         5  9742   5.28  33.94   3.90    5.49  15.72 -127.80
Chlorides             6  9711   0.05   0.32   0.05    0.05   0.14   -1.17
FreeSulfurDioxide     7  9697  30.90 147.31  30.00   30.96  54.86 -546.00
TotalSulfurDioxide    8  9704 121.82 231.04 124.00  122.02 133.43 -823.00
Density               9 10236   0.99   0.03   0.99    0.99   0.01    0.89
pH                   10  9915   3.21   0.68   3.20    3.21   0.37    0.48
Sulphates            11  9276   0.53   0.94   0.50    0.53   0.44   -3.13
Alcohol              12  9717  10.50   3.72  10.40   10.51   2.52   -4.70
LabelAppeal          13 10236  -0.01   0.90   0.00   -0.01   1.48   -2.00
AcidIndex            14 10236   7.78   1.34   8.00    7.65   1.48    4.00
STARS                15  7548   2.05   0.90   2.00    1.97   1.48    1.00
                       max   range  skew kurtosis   se
TARGET                8.00    8.00 -0.32    -0.88 0.02
FixedAcidity         34.40   52.40 -0.01     1.73 0.06
VolatileAcidity       3.68    6.47  0.05     1.82 0.01
CitricAcid            3.86    7.10 -0.03     1.89 0.01
ResidualSugar       141.15  268.95 -0.07     1.90 0.34
Chlorides             1.35    2.52  0.02     1.76 0.00
FreeSulfurDioxide   622.00 1168.00  0.01     1.82 1.50
TotalSulfurDioxide 1057.00 1880.00  0.00     1.76 2.35
Density               1.10    0.21 -0.01     1.94 0.00
pH                    6.05    5.57  0.04     1.69 0.01
Sulphates             4.24    7.37  0.00     1.77 0.01
Alcohol              26.10   30.80 -0.04     1.55 0.04
LabelAppeal           2.00    4.00  0.01    -0.27 0.01
AcidIndex            17.00   13.00  1.66     5.25 0.01
STARS                 4.00    3.00  0.44    -0.70 0.01

I. Data Exploration

We first note that nearly all the variables have negative values, which we can take to indicate that the data has been log transformed in order to show a normal distribution. In addition, there appear to be a high amount of missing values, with the most NA’s coming from the STARS variable (2688). Given the positive correlation between Target and Stars, we can impute missing values for Stars as ‘zero’.

In regards to the skewness, we see that the data appear to have a relatively normal distribution and are centered, with slightly negative/right skewness with the variables of Target, FixedAcidity, CitricAcid, ResidualSugar, Density, and Alcohol. AcidIndex is the most skewed. The kurtosis being greater than 1 for all variables except Target, LabelAppeal, and Stars, the distribution is leptokurtic so the data in its original form is perhaps indicative of several outlier values. The graph below demonstrates these distributions:

gather_df <- train_data1 %>% 
  gather(key = 'variable', value = 'value')

# Histogram plots of each variable
ggplot(gather_df) + 
  geom_histogram(aes(x=value, y = ..density..), bins=30) + 
  geom_density(aes(x=value), color='red') +
  facet_wrap(. ~variable, scales='free', ncol=4)
Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
ℹ Please use `after_stat(density)` instead.
Warning: Removed 6578 rows containing non-finite outside the scale range
(`stat_bin()`).
Warning: Removed 6578 rows containing non-finite outside the scale range
(`stat_density()`).

We see that just over 26% of the values for Stars are missing, with the full breakdown of the percentage of missing values in the data set below:

missing <- colSums(train_data1 |> sapply(is.na)) 

missing_pct <- round(missing / nrow(train_data1) * 100, 2)

stack(sort(missing_pct, decreasing = TRUE))
   values                ind
1   26.26              STARS
2    9.38          Sulphates
3    5.27  FreeSulfurDioxide
4    5.20 TotalSulfurDioxide
5    5.13          Chlorides
6    5.07            Alcohol
7    4.83      ResidualSugar
8    3.14                 pH
9    0.00             TARGET
10   0.00       FixedAcidity
11   0.00    VolatileAcidity
12   0.00         CitricAcid
13   0.00            Density
14   0.00        LabelAppeal
15   0.00          AcidIndex

Looking into the correlations we have a full breakdown below, and we see that Stars and LabelAppeal are the most correlated with Target, with AcidIndex, VolatileAcidity, Density Chlorides, FixedAcidity, Sulphates, and pH all showing negative correlation to the Target variable.

cor_train_data1 <- cor(train_data1, use = "complete.obs")

corrplot(cor_train_data1, method = 'square', type = 'lower', tl.col = 'darkblue',  addgrid.col = 'black', order = 'original',addshade = 'all', tl.cex =0.75,number.cex = 0.75, tl.srt = 45, mar = c(0,0,0,0), diag = FALSE)

#full breakdown of correlation coefficients
correlations <- round(cor(train_data1, use = "complete.obs"),digits = 3)
correlations
                   TARGET FixedAcidity VolatileAcidity CitricAcid ResidualSugar
TARGET              1.000       -0.018          -0.088      0.018         0.003
FixedAcidity       -0.018        1.000           0.013      0.018        -0.015
VolatileAcidity    -0.088        0.013           1.000     -0.034         0.006
CitricAcid          0.018        0.018          -0.034      1.000         0.002
ResidualSugar       0.003       -0.015           0.006      0.002         1.000
Chlorides          -0.039       -0.014           0.016     -0.039         0.004
FreeSulfurDioxide   0.029        0.022          -0.003      0.017         0.010
TotalSulfurDioxide  0.017       -0.027           0.009     -0.010         0.018
Density            -0.041        0.018           0.020     -0.021        -0.014
pH                 -0.011       -0.003           0.007     -0.001         0.019
Sulphates          -0.014        0.055           0.000     -0.015        -0.009
Alcohol             0.066       -0.018          -0.002      0.022        -0.028
LabelAppeal         0.491        0.011          -0.032      0.025        -0.003
AcidIndex          -0.156        0.164           0.020      0.063        -0.024
STARS               0.555       -0.004          -0.053      0.016         0.021
                   Chlorides FreeSulfurDioxide TotalSulfurDioxide Density
TARGET                -0.039             0.029              0.017  -0.041
FixedAcidity          -0.014             0.022             -0.027   0.018
VolatileAcidity        0.016            -0.003              0.009   0.020
CitricAcid            -0.039             0.017             -0.010  -0.021
ResidualSugar          0.004             0.010              0.018  -0.014
Chlorides              1.000            -0.012             -0.002   0.017
FreeSulfurDioxide     -0.012             1.000              0.014  -0.010
TotalSulfurDioxide    -0.002             0.014              1.000   0.018
Density                0.017            -0.010              0.018   1.000
pH                    -0.020            -0.008             -0.001   0.000
Sulphates              0.007             0.026              0.005  -0.020
Alcohol               -0.023            -0.034             -0.020   0.001
LabelAppeal           -0.007             0.013             -0.004  -0.008
AcidIndex             -0.002            -0.014             -0.020   0.051
STARS                 -0.014            -0.010              0.021  -0.020
                       pH Sulphates Alcohol LabelAppeal AcidIndex  STARS
TARGET             -0.011    -0.014   0.066       0.491    -0.156  0.555
FixedAcidity       -0.003     0.055  -0.018       0.011     0.164 -0.004
VolatileAcidity     0.007     0.000  -0.002      -0.032     0.020 -0.053
CitricAcid         -0.001    -0.015   0.022       0.025     0.063  0.016
ResidualSugar       0.019    -0.009  -0.028      -0.003    -0.024  0.021
Chlorides          -0.020     0.007  -0.023      -0.007    -0.002 -0.014
FreeSulfurDioxide  -0.008     0.026  -0.034       0.013    -0.014 -0.010
TotalSulfurDioxide -0.001     0.005  -0.020      -0.004    -0.020  0.021
Density             0.000    -0.020   0.001      -0.008     0.051 -0.020
pH                  1.000     0.011  -0.018      -0.015    -0.063 -0.006
Sulphates           0.011     1.000   0.019       0.005     0.034 -0.022
Alcohol            -0.018     0.019   1.000      -0.005    -0.056  0.070
LabelAppeal        -0.015     0.005  -0.005       1.000     0.022  0.312
AcidIndex          -0.063     0.034  -0.056       0.022     1.000 -0.090
STARS              -0.006    -0.022   0.070       0.312    -0.090  1.000

II. Data Preparation

III. Build Models

IV. Select Models

Conclusion