Section 1 Data exploration
Let’s take a look at some visuals to get an idea on the data set
ï..INDEX TARGET FixedAcidity VolatileAcidity
Min. : 1 Min. :0.000 Min. :-18.100 Min. :-2.7900
1st Qu.: 4038 1st Qu.:2.000 1st Qu.: 5.200 1st Qu.: 0.1300
Median : 8110 Median :3.000 Median : 6.900 Median : 0.2800
Mean : 8070 Mean :3.029 Mean : 7.076 Mean : 0.3241
3rd Qu.:12106 3rd Qu.:4.000 3rd Qu.: 9.500 3rd Qu.: 0.6400
Max. :16129 Max. :8.000 Max. : 34.400 Max. : 3.6800
CitricAcid ResidualSugar Chlorides FreeSulfurDioxide
Min. :-3.2400 Min. :-127.800 Min. :-1.1710 Min. :-555.00
1st Qu.: 0.0300 1st Qu.: -2.000 1st Qu.:-0.0310 1st Qu.: 0.00
Median : 0.3100 Median : 3.900 Median : 0.0460 Median : 30.00
Mean : 0.3084 Mean : 5.419 Mean : 0.0548 Mean : 30.85
3rd Qu.: 0.5800 3rd Qu.: 15.900 3rd Qu.: 0.1530 3rd Qu.: 70.00
Max. : 3.8600 Max. : 141.150 Max. : 1.3510 Max. : 623.00
NA's :616 NA's :638 NA's :647
TotalSulfurDioxide Density pH Sulphates
Min. :-823.0 Min. :0.8881 Min. :0.480 Min. :-3.1300
1st Qu.: 27.0 1st Qu.:0.9877 1st Qu.:2.960 1st Qu.: 0.2800
Median : 123.0 Median :0.9945 Median :3.200 Median : 0.5000
Mean : 120.7 Mean :0.9942 Mean :3.208 Mean : 0.5271
3rd Qu.: 208.0 3rd Qu.:1.0005 3rd Qu.:3.470 3rd Qu.: 0.8600
Max. :1057.0 Max. :1.0992 Max. :6.130 Max. : 4.2400
NA's :682 NA's :395 NA's :1210
Alcohol LabelAppeal AcidIndex STARS
Min. :-4.70 Min. :-2.000000 Min. : 4.000 Min. :1.000
1st Qu.: 9.00 1st Qu.:-1.000000 1st Qu.: 7.000 1st Qu.:1.000
Median :10.40 Median : 0.000000 Median : 8.000 Median :2.000
Mean :10.49 Mean :-0.009066 Mean : 7.773 Mean :2.042
3rd Qu.:12.40 3rd Qu.: 1.000000 3rd Qu.: 8.000 3rd Qu.:3.000
Max. :26.50 Max. : 2.000000 Max. :17.000 Max. :4.000
NA's :653 NA's :3359
Warning: Removed 616 rows containing non-finite values (stat_boxplot).
Warning: Removed 638 rows containing non-finite values (stat_boxplot).
Warning: Removed 647 rows containing non-finite values (stat_boxplot).
Warning: Removed 682 rows containing non-finite values (stat_boxplot).
Warning: Removed 395 rows containing non-finite values (stat_boxplot).
Warning: Removed 1210 rows containing non-finite values (stat_boxplot).
Warning: Removed 653 rows containing non-finite values (stat_boxplot).
Warning: Removed 3359 rows containing non-finite values (stat_boxplot).
Warning: Removed 616 rows containing non-finite values (stat_bin).
Warning: Removed 638 rows containing non-finite values (stat_bin).
Warning: Removed 647 rows containing non-finite values (stat_bin).
Warning: Removed 682 rows containing non-finite values (stat_bin).
Warning: Removed 395 rows containing non-finite values (stat_bin).
Warning: Removed 1210 rows containing non-finite values (stat_bin).
Warning: Removed 653 rows containing non-finite values (stat_bin).
Warning: Removed 3359 rows containing non-finite values (stat_bin).
Looks like there are a few columns with NAs.
Residual Sugar Cholrides FreeSulfurDixode TotalSulfurDioxide pH Sulphates Alcohol Stars
Lets replace those NAs with the median value.
ï..INDEX TARGET FixedAcidity VolatileAcidity
Min. : 1 Min. :0.000 Min. :-18.100 Min. :-2.7900
1st Qu.: 4038 1st Qu.:2.000 1st Qu.: 5.200 1st Qu.: 0.1300
Median : 8110 Median :3.000 Median : 6.900 Median : 0.2800
Mean : 8070 Mean :3.029 Mean : 7.076 Mean : 0.3241
3rd Qu.:12106 3rd Qu.:4.000 3rd Qu.: 9.500 3rd Qu.: 0.6400
Max. :16129 Max. :8.000 Max. : 34.400 Max. : 3.6800
CitricAcid ResidualSugar Chlorides FreeSulfurDioxide
Min. :-3.2400 Min. :-127.800 Min. :-1.17100 Min. :-555.0
1st Qu.: 0.0300 1st Qu.: 0.900 1st Qu.: 0.00000 1st Qu.: 5.0
Median : 0.3100 Median : 3.900 Median : 0.04600 Median : 30.0
Mean : 0.3084 Mean : 5.346 Mean : 0.05438 Mean : 30.8
3rd Qu.: 0.5800 3rd Qu.: 14.900 3rd Qu.: 0.12800 3rd Qu.: 64.0
Max. : 3.8600 Max. : 141.150 Max. : 1.35100 Max. : 623.0
TotalSulfurDioxide Density pH Sulphates
Min. :-823.0 Min. :0.8881 Min. :0.480 Min. :-3.1300
1st Qu.: 34.0 1st Qu.:0.9877 1st Qu.:2.970 1st Qu.: 0.3400
Median : 123.0 Median :0.9945 Median :3.200 Median : 0.5000
Mean : 120.8 Mean :0.9942 Mean :3.207 Mean : 0.5245
3rd Qu.: 198.0 3rd Qu.:1.0005 3rd Qu.:3.450 3rd Qu.: 0.7700
Max. :1057.0 Max. :1.0992 Max. :6.130 Max. : 4.2400
Alcohol LabelAppeal AcidIndex STARS
Min. :-4.70 Min. :-2.000000 Min. : 4.000 Min. :1.000
1st Qu.: 9.10 1st Qu.:-1.000000 1st Qu.: 7.000 1st Qu.:2.000
Median :10.40 Median : 0.000000 Median : 8.000 Median :2.000
Mean :10.48 Mean :-0.009066 Mean : 7.773 Mean :2.031
3rd Qu.:12.20 3rd Qu.: 1.000000 3rd Qu.: 8.000 3rd Qu.:2.000
Max. :26.50 Max. : 2.000000 Max. :17.000 Max. :4.000
From the boxplots andd historgrams it looks like some of our variables have a high kurtosis let’s run the kurtosis numbers to confirm
kurtosis(wine_train$FixedAcidity)
[1] 4.67573
kurtosis(wine_train$VolatileAcidity)
[1] 4.832966
kurtosis(wine_train$CitricAcid)
[1] 4.838696
kurtosis(wine_train$ResidualSugar)
[1] 5.131142
kurtosis(wine_train$Chlorides)
[1] 5.040543
kurtosis(wine_train$FreeSulfurDioxide)
[1] 5.094918
kurtosis(wine_train$TotalSulfurDioxide)
[1] 4.938647
kurtosis(wine_train$Density)
[1] 4.900725
kurtosis(wine_train$pH)
[1] 4.795074
kurtosis(wine_train$Sulphates)
[1] 5.249137
kurtosis(wine_train$Alcohol)
[1] 4.78399
kurtosis(wine_train$LabelAppeal)
[1] 2.738136
kurtosis(wine_train$AcidIndex)
[1] 8.191373
kurtosis(wine_train$STARS)
[1] 3.156654
Section 2 Data Preparation
Let’s create a new variable that is the ratio of fixed vs volatile acidity and total sulfur vs free sulfur.
Let’s also create a binary sale or no sale of our target variable. If target varialbe is 1> then it is a sales or if 0 then no sale denoted by binary variable 0= no sale 1= sale.
Residual Sugar has a high kurtosis so let’s create a few buckets:
1= less than -100 2= between -100 and -50 3= between -50 and 0 4= between 0 and 50 5= between 50 and 100 6= greater than 100
Let’s create a few combination variables.
Stars * Acidindex
Alcohol + Sulphates
Label Appeal + Stars
Section 3 Build Models
Lets create 3 different models. One regression and two logistic regression models.
Let’s start with the regression model.
Call:
lm(formula = TARGET_1 ~ acid_ratio + sulfur_ratio + CitricAcid +
pH + ResidualSugar.buckets + Stars_Acidindex + Alcohol_Suphates +
Labelappeal_Starts, data = wine_train)
Residuals:
Min 1Q Median 3Q Max
-0.8919 0.1356 0.1974 0.2339 0.4156
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.808003005 0.046946382 17.211 < 0.0000000000000002
acid_ratio 0.000005536 0.000047933 0.115 0.908059
sulfur_ratio 0.000178400 0.000226863 0.786 0.431661
CitricAcid 0.002538428 0.004187221 0.606 0.544371
pH -0.018318906 0.005396369 -3.395 0.000689
ResidualSugar.buckets2 -0.025519528 0.044481681 -0.574 0.566175
ResidualSugar.buckets3 -0.000115974 0.042263265 -0.003 0.997811
ResidualSugar.buckets4 -0.003154046 0.041673578 -0.076 0.939671
ResidualSugar.buckets5 0.004890718 0.043237528 0.113 0.909943
ResidualSugar.buckets6 0.000722736 0.056050921 0.013 0.989712
Stars_Acidindex -0.001509554 0.000803263 -1.879 0.060229
Alcohol_Suphates -0.000594445 0.000964802 -0.616 0.537819
Labelappeal_Starts 0.034162806 0.003803747 8.981 < 0.0000000000000002
(Intercept) ***
acid_ratio
sulfur_ratio
CitricAcid
pH ***
ResidualSugar.buckets2
ResidualSugar.buckets3
ResidualSugar.buckets4
ResidualSugar.buckets5
ResidualSugar.buckets6
Stars_Acidindex .
Alcohol_Suphates
Labelappeal_Starts ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.408 on 12782 degrees of freedom
Multiple R-squared: 0.01042, Adjusted R-squared: 0.009488
F-statistic: 11.21 on 12 and 12782 DF, p-value: < 0.00000000000000022
Call:
glm(formula = TARGET_1 ~ acid_ratio + sulfur_ratio + CitricAcid +
pH + ResidualSugar.buckets + LabelAppeal + AcidIndex + Stars_Acidindex +
Alcohol_Suphates + Labelappeal_Starts, family = "binomial",
data = wine_train)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.4864 0.3532 0.5439 0.6995 2.3343
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 4.6460611 0.5137727 9.043 < 0.0000000000000002 ***
acid_ratio 0.0002207 0.0002932 0.753 0.45162
sulfur_ratio 0.0013716 0.0013293 1.032 0.30216
CitricAcid 0.0729345 0.0265964 2.742 0.00610 **
pH -0.1720916 0.0342243 -5.028 0.000000495 ***
ResidualSugar.buckets2 -0.0820614 0.2826145 -0.290 0.77154
ResidualSugar.buckets3 -0.0074060 0.2700126 -0.027 0.97812
ResidualSugar.buckets4 0.0030714 0.2663137 0.012 0.99080
ResidualSugar.buckets5 0.0072229 0.2762871 0.026 0.97914
ResidualSugar.buckets6 -0.0789080 0.3558426 -0.222 0.82451
LabelAppeal -0.5974383 0.2140376 -2.791 0.00525 **
AcidIndex -0.4907750 0.0525577 -9.338 < 0.0000000000000002 ***
Stars_Acidindex 0.0208169 0.0263108 0.791 0.42883
Alcohol_Suphates -0.0115924 0.0060958 -1.902 0.05721 .
Labelappeal_Starts 0.4661038 0.2108762 2.210 0.02708 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 13276 on 12794 degrees of freedom
Residual deviance: 12029 on 12780 degrees of freedom
AIC: 12059
Number of Fisher Scoring iterations: 4
Call:
glm(formula = TARGET_1 ~ Stars_Acidindex + pH + CitricAcid +
pH + STARS + AcidIndex + LabelAppeal, family = "binomial",
data = wine_train)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.5086 0.3532 0.5437 0.7006 2.3656
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 4.51630 0.43885 10.291 < 0.0000000000000002 ***
Stars_Acidindex 0.02105 0.02631 0.800 0.42371
pH -0.17107 0.03419 -5.003 0.000000563 ***
CitricAcid 0.07255 0.02658 2.729 0.00635 **
STARS 0.46242 0.21093 2.192 0.02836 *
AcidIndex -0.49022 0.05257 -9.324 < 0.0000000000000002 ***
LabelAppeal -0.13039 0.02651 -4.918 0.000000873 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 13276 on 12794 degrees of freedom
Residual deviance: 12035 on 12788 degrees of freedom
AIC: 12049
Number of Fisher Scoring iterations: 4
Call:
glm(formula = TARGET_1 ~ STARS, family = "binomial", data = wine_train)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.7749 0.3915 0.6810 0.6810 0.8797
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.15644 0.06161 2.539 0.0111 *
STARS 0.59343 0.03132 18.950 <0.0000000000000002 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 13276 on 12794 degrees of freedom
Residual deviance: 12882 on 12793 degrees of freedom
AIC: 12886
Number of Fisher Scoring iterations: 4
The variables that are statistically significant are Citric Acid, pH, LabelAppeal, AcidIndex, and stars.
Section 4 Select model
We are going to use AIC as the criteria. My glm model 2 has the lowest AIC score of 12049. This moel is also relatively simple becuae it is using the variables that were considerable significant. And the Stars and Acid_index relationship seem to have a significant affect on the model’s performance.
Section 5 Formula
[1] "TARGET_1 = 4.51629922 + 0.02105146 * Stars_Acidindex - 0.17107175 * pH + 0.07254590 * CitricAcid + 0.46242288 * STARS - 0.49021771 * AcidIndex - 0.13039031 * LabelAppeal"