Section 1 Data exploration

Let’s take a look at some visuals to get an idea on the data set

    ï..INDEX         TARGET       FixedAcidity     VolatileAcidity  
 Min.   :    1   Min.   :0.000   Min.   :-18.100   Min.   :-2.7900  
 1st Qu.: 4038   1st Qu.:2.000   1st Qu.:  5.200   1st Qu.: 0.1300  
 Median : 8110   Median :3.000   Median :  6.900   Median : 0.2800  
 Mean   : 8070   Mean   :3.029   Mean   :  7.076   Mean   : 0.3241  
 3rd Qu.:12106   3rd Qu.:4.000   3rd Qu.:  9.500   3rd Qu.: 0.6400  
 Max.   :16129   Max.   :8.000   Max.   : 34.400   Max.   : 3.6800  
                                                                    
   CitricAcid      ResidualSugar        Chlorides       FreeSulfurDioxide
 Min.   :-3.2400   Min.   :-127.800   Min.   :-1.1710   Min.   :-555.00  
 1st Qu.: 0.0300   1st Qu.:  -2.000   1st Qu.:-0.0310   1st Qu.:   0.00  
 Median : 0.3100   Median :   3.900   Median : 0.0460   Median :  30.00  
 Mean   : 0.3084   Mean   :   5.419   Mean   : 0.0548   Mean   :  30.85  
 3rd Qu.: 0.5800   3rd Qu.:  15.900   3rd Qu.: 0.1530   3rd Qu.:  70.00  
 Max.   : 3.8600   Max.   : 141.150   Max.   : 1.3510   Max.   : 623.00  
                   NA's   :616        NA's   :638       NA's   :647      
 TotalSulfurDioxide    Density             pH          Sulphates      
 Min.   :-823.0     Min.   :0.8881   Min.   :0.480   Min.   :-3.1300  
 1st Qu.:  27.0     1st Qu.:0.9877   1st Qu.:2.960   1st Qu.: 0.2800  
 Median : 123.0     Median :0.9945   Median :3.200   Median : 0.5000  
 Mean   : 120.7     Mean   :0.9942   Mean   :3.208   Mean   : 0.5271  
 3rd Qu.: 208.0     3rd Qu.:1.0005   3rd Qu.:3.470   3rd Qu.: 0.8600  
 Max.   :1057.0     Max.   :1.0992   Max.   :6.130   Max.   : 4.2400  
 NA's   :682                         NA's   :395     NA's   :1210     
    Alcohol       LabelAppeal          AcidIndex          STARS      
 Min.   :-4.70   Min.   :-2.000000   Min.   : 4.000   Min.   :1.000  
 1st Qu.: 9.00   1st Qu.:-1.000000   1st Qu.: 7.000   1st Qu.:1.000  
 Median :10.40   Median : 0.000000   Median : 8.000   Median :2.000  
 Mean   :10.49   Mean   :-0.009066   Mean   : 7.773   Mean   :2.042  
 3rd Qu.:12.40   3rd Qu.: 1.000000   3rd Qu.: 8.000   3rd Qu.:3.000  
 Max.   :26.50   Max.   : 2.000000   Max.   :17.000   Max.   :4.000  
 NA's   :653                                          NA's   :3359   

Warning: Removed 616 rows containing non-finite values (stat_boxplot).

Warning: Removed 638 rows containing non-finite values (stat_boxplot).

Warning: Removed 647 rows containing non-finite values (stat_boxplot).

Warning: Removed 682 rows containing non-finite values (stat_boxplot).

Warning: Removed 395 rows containing non-finite values (stat_boxplot).

Warning: Removed 1210 rows containing non-finite values (stat_boxplot).

Warning: Removed 653 rows containing non-finite values (stat_boxplot).

Warning: Removed 3359 rows containing non-finite values (stat_boxplot).

Warning: Removed 616 rows containing non-finite values (stat_bin).

Warning: Removed 638 rows containing non-finite values (stat_bin).

Warning: Removed 647 rows containing non-finite values (stat_bin).

Warning: Removed 682 rows containing non-finite values (stat_bin).

Warning: Removed 395 rows containing non-finite values (stat_bin).

Warning: Removed 1210 rows containing non-finite values (stat_bin).

Warning: Removed 653 rows containing non-finite values (stat_bin).

Warning: Removed 3359 rows containing non-finite values (stat_bin).

Looks like there are a few columns with NAs.

Residual Sugar Cholrides FreeSulfurDixode TotalSulfurDioxide pH Sulphates Alcohol Stars

Lets replace those NAs with the median value.

    ï..INDEX         TARGET       FixedAcidity     VolatileAcidity  
 Min.   :    1   Min.   :0.000   Min.   :-18.100   Min.   :-2.7900  
 1st Qu.: 4038   1st Qu.:2.000   1st Qu.:  5.200   1st Qu.: 0.1300  
 Median : 8110   Median :3.000   Median :  6.900   Median : 0.2800  
 Mean   : 8070   Mean   :3.029   Mean   :  7.076   Mean   : 0.3241  
 3rd Qu.:12106   3rd Qu.:4.000   3rd Qu.:  9.500   3rd Qu.: 0.6400  
 Max.   :16129   Max.   :8.000   Max.   : 34.400   Max.   : 3.6800  
   CitricAcid      ResidualSugar        Chlorides        FreeSulfurDioxide
 Min.   :-3.2400   Min.   :-127.800   Min.   :-1.17100   Min.   :-555.0   
 1st Qu.: 0.0300   1st Qu.:   0.900   1st Qu.: 0.00000   1st Qu.:   5.0   
 Median : 0.3100   Median :   3.900   Median : 0.04600   Median :  30.0   
 Mean   : 0.3084   Mean   :   5.346   Mean   : 0.05438   Mean   :  30.8   
 3rd Qu.: 0.5800   3rd Qu.:  14.900   3rd Qu.: 0.12800   3rd Qu.:  64.0   
 Max.   : 3.8600   Max.   : 141.150   Max.   : 1.35100   Max.   : 623.0   
 TotalSulfurDioxide    Density             pH          Sulphates      
 Min.   :-823.0     Min.   :0.8881   Min.   :0.480   Min.   :-3.1300  
 1st Qu.:  34.0     1st Qu.:0.9877   1st Qu.:2.970   1st Qu.: 0.3400  
 Median : 123.0     Median :0.9945   Median :3.200   Median : 0.5000  
 Mean   : 120.8     Mean   :0.9942   Mean   :3.207   Mean   : 0.5245  
 3rd Qu.: 198.0     3rd Qu.:1.0005   3rd Qu.:3.450   3rd Qu.: 0.7700  
 Max.   :1057.0     Max.   :1.0992   Max.   :6.130   Max.   : 4.2400  
    Alcohol       LabelAppeal          AcidIndex          STARS      
 Min.   :-4.70   Min.   :-2.000000   Min.   : 4.000   Min.   :1.000  
 1st Qu.: 9.10   1st Qu.:-1.000000   1st Qu.: 7.000   1st Qu.:2.000  
 Median :10.40   Median : 0.000000   Median : 8.000   Median :2.000  
 Mean   :10.48   Mean   :-0.009066   Mean   : 7.773   Mean   :2.031  
 3rd Qu.:12.20   3rd Qu.: 1.000000   3rd Qu.: 8.000   3rd Qu.:2.000  
 Max.   :26.50   Max.   : 2.000000   Max.   :17.000   Max.   :4.000  

From the boxplots andd historgrams it looks like some of our variables have a high kurtosis let’s run the kurtosis numbers to confirm

kurtosis(wine_train$FixedAcidity)
[1] 4.67573
kurtosis(wine_train$VolatileAcidity)
[1] 4.832966
kurtosis(wine_train$CitricAcid)
[1] 4.838696
kurtosis(wine_train$ResidualSugar)
[1] 5.131142
kurtosis(wine_train$Chlorides)
[1] 5.040543
kurtosis(wine_train$FreeSulfurDioxide)
[1] 5.094918
kurtosis(wine_train$TotalSulfurDioxide)
[1] 4.938647
kurtosis(wine_train$Density)
[1] 4.900725
kurtosis(wine_train$pH)
[1] 4.795074
kurtosis(wine_train$Sulphates)
[1] 5.249137
kurtosis(wine_train$Alcohol)
[1] 4.78399
kurtosis(wine_train$LabelAppeal)
[1] 2.738136
kurtosis(wine_train$AcidIndex)
[1] 8.191373
kurtosis(wine_train$STARS)
[1] 3.156654

Section 2 Data Preparation

Let’s create a new variable that is the ratio of fixed vs volatile acidity and total sulfur vs free sulfur.

Let’s also create a binary sale or no sale of our target variable. If target varialbe is 1> then it is a sales or if 0 then no sale denoted by binary variable 0= no sale 1= sale.

Residual Sugar has a high kurtosis so let’s create a few buckets:

1= less than -100 2= between -100 and -50 3= between -50 and 0 4= between 0 and 50 5= between 50 and 100 6= greater than 100

Let’s create a few combination variables.

Stars * Acidindex

Alcohol + Sulphates

Label Appeal + Stars

Section 3 Build Models

Lets create 3 different models. One regression and two logistic regression models.

Let’s start with the regression model.


Call:
lm(formula = TARGET_1 ~ acid_ratio + sulfur_ratio + CitricAcid + 
    pH + ResidualSugar.buckets + Stars_Acidindex + Alcohol_Suphates + 
    Labelappeal_Starts, data = wine_train)

Residuals:
    Min      1Q  Median      3Q     Max 
-0.8919  0.1356  0.1974  0.2339  0.4156 

Coefficients:
                           Estimate   Std. Error t value             Pr(>|t|)
(Intercept)             0.808003005  0.046946382  17.211 < 0.0000000000000002
acid_ratio              0.000005536  0.000047933   0.115             0.908059
sulfur_ratio            0.000178400  0.000226863   0.786             0.431661
CitricAcid              0.002538428  0.004187221   0.606             0.544371
pH                     -0.018318906  0.005396369  -3.395             0.000689
ResidualSugar.buckets2 -0.025519528  0.044481681  -0.574             0.566175
ResidualSugar.buckets3 -0.000115974  0.042263265  -0.003             0.997811
ResidualSugar.buckets4 -0.003154046  0.041673578  -0.076             0.939671
ResidualSugar.buckets5  0.004890718  0.043237528   0.113             0.909943
ResidualSugar.buckets6  0.000722736  0.056050921   0.013             0.989712
Stars_Acidindex        -0.001509554  0.000803263  -1.879             0.060229
Alcohol_Suphates       -0.000594445  0.000964802  -0.616             0.537819
Labelappeal_Starts      0.034162806  0.003803747   8.981 < 0.0000000000000002
                          
(Intercept)            ***
acid_ratio                
sulfur_ratio              
CitricAcid                
pH                     ***
ResidualSugar.buckets2    
ResidualSugar.buckets3    
ResidualSugar.buckets4    
ResidualSugar.buckets5    
ResidualSugar.buckets6    
Stars_Acidindex        .  
Alcohol_Suphates          
Labelappeal_Starts     ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.408 on 12782 degrees of freedom
Multiple R-squared:  0.01042,   Adjusted R-squared:  0.009488 
F-statistic: 11.21 on 12 and 12782 DF,  p-value: < 0.00000000000000022

Call:
glm(formula = TARGET_1 ~ acid_ratio + sulfur_ratio + CitricAcid + 
    pH + ResidualSugar.buckets + LabelAppeal + AcidIndex + Stars_Acidindex + 
    Alcohol_Suphates + Labelappeal_Starts, family = "binomial", 
    data = wine_train)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.4864   0.3532   0.5439   0.6995   2.3343  

Coefficients:
                         Estimate Std. Error z value             Pr(>|z|)    
(Intercept)             4.6460611  0.5137727   9.043 < 0.0000000000000002 ***
acid_ratio              0.0002207  0.0002932   0.753              0.45162    
sulfur_ratio            0.0013716  0.0013293   1.032              0.30216    
CitricAcid              0.0729345  0.0265964   2.742              0.00610 ** 
pH                     -0.1720916  0.0342243  -5.028          0.000000495 ***
ResidualSugar.buckets2 -0.0820614  0.2826145  -0.290              0.77154    
ResidualSugar.buckets3 -0.0074060  0.2700126  -0.027              0.97812    
ResidualSugar.buckets4  0.0030714  0.2663137   0.012              0.99080    
ResidualSugar.buckets5  0.0072229  0.2762871   0.026              0.97914    
ResidualSugar.buckets6 -0.0789080  0.3558426  -0.222              0.82451    
LabelAppeal            -0.5974383  0.2140376  -2.791              0.00525 ** 
AcidIndex              -0.4907750  0.0525577  -9.338 < 0.0000000000000002 ***
Stars_Acidindex         0.0208169  0.0263108   0.791              0.42883    
Alcohol_Suphates       -0.0115924  0.0060958  -1.902              0.05721 .  
Labelappeal_Starts      0.4661038  0.2108762   2.210              0.02708 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 13276  on 12794  degrees of freedom
Residual deviance: 12029  on 12780  degrees of freedom
AIC: 12059

Number of Fisher Scoring iterations: 4

Call:
glm(formula = TARGET_1 ~ Stars_Acidindex + pH + CitricAcid + 
    pH + STARS + AcidIndex + LabelAppeal, family = "binomial", 
    data = wine_train)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.5086   0.3532   0.5437   0.7006   2.3656  

Coefficients:
                Estimate Std. Error z value             Pr(>|z|)    
(Intercept)      4.51630    0.43885  10.291 < 0.0000000000000002 ***
Stars_Acidindex  0.02105    0.02631   0.800              0.42371    
pH              -0.17107    0.03419  -5.003          0.000000563 ***
CitricAcid       0.07255    0.02658   2.729              0.00635 ** 
STARS            0.46242    0.21093   2.192              0.02836 *  
AcidIndex       -0.49022    0.05257  -9.324 < 0.0000000000000002 ***
LabelAppeal     -0.13039    0.02651  -4.918          0.000000873 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 13276  on 12794  degrees of freedom
Residual deviance: 12035  on 12788  degrees of freedom
AIC: 12049

Number of Fisher Scoring iterations: 4

Call:
glm(formula = TARGET_1 ~ STARS, family = "binomial", data = wine_train)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.7749   0.3915   0.6810   0.6810   0.8797  

Coefficients:
            Estimate Std. Error z value            Pr(>|z|)    
(Intercept)  0.15644    0.06161   2.539              0.0111 *  
STARS        0.59343    0.03132  18.950 <0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 13276  on 12794  degrees of freedom
Residual deviance: 12882  on 12793  degrees of freedom
AIC: 12886

Number of Fisher Scoring iterations: 4

The variables that are statistically significant are Citric Acid, pH, LabelAppeal, AcidIndex, and stars.

Section 4 Select model

We are going to use AIC as the criteria. My glm model 2 has the lowest AIC score of 12049. This moel is also relatively simple becuae it is using the variables that were considerable significant. And the Stars and Acid_index relationship seem to have a significant affect on the model’s performance.

Section 5 Formula

[1] "TARGET_1 = 4.51629922 + 0.02105146 * Stars_Acidindex - 0.17107175 * pH + 0.07254590 * CitricAcid + 0.46242288 * STARS - 0.49021771 * AcidIndex - 0.13039031 * LabelAppeal"