Taking a look at a summary of the data, there seem to be many missing values in the ResidualSugar
,Chlorides
,FreeSulfurDioxide
,TotalSulfurDioxide
,pH
,Sulphates
,Alcohol
, and STARS
fields. The STARS
and LabelAppeal
columns are both ordinal variables and may need to be transformed into dummy variables.
## TARGET FixedAcidity VolatileAcidity CitricAcid
## Min. :0.000 Min. :-18.100 Min. :-2.7900 Min. :-3.2400
## 1st Qu.:2.000 1st Qu.: 5.200 1st Qu.: 0.1300 1st Qu.: 0.0300
## Median :3.000 Median : 6.900 Median : 0.2800 Median : 0.3100
## Mean :3.029 Mean : 7.076 Mean : 0.3241 Mean : 0.3084
## 3rd Qu.:4.000 3rd Qu.: 9.500 3rd Qu.: 0.6400 3rd Qu.: 0.5800
## Max. :8.000 Max. : 34.400 Max. : 3.6800 Max. : 3.8600
##
## ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide
## Min. :-127.800 Min. :-1.1710 Min. :-555.00 Min. :-823.0
## 1st Qu.: -2.000 1st Qu.:-0.0310 1st Qu.: 0.00 1st Qu.: 27.0
## Median : 3.900 Median : 0.0460 Median : 30.00 Median : 123.0
## Mean : 5.419 Mean : 0.0548 Mean : 30.85 Mean : 120.7
## 3rd Qu.: 15.900 3rd Qu.: 0.1530 3rd Qu.: 70.00 3rd Qu.: 208.0
## Max. : 141.150 Max. : 1.3510 Max. : 623.00 Max. :1057.0
## NA's :616 NA's :638 NA's :647 NA's :682
## Density pH Sulphates Alcohol
## Min. :0.8881 Min. :0.480 Min. :-3.1300 Min. :-4.70
## 1st Qu.:0.9877 1st Qu.:2.960 1st Qu.: 0.2800 1st Qu.: 9.00
## Median :0.9945 Median :3.200 Median : 0.5000 Median :10.40
## Mean :0.9942 Mean :3.208 Mean : 0.5271 Mean :10.49
## 3rd Qu.:1.0005 3rd Qu.:3.470 3rd Qu.: 0.8600 3rd Qu.:12.40
## Max. :1.0992 Max. :6.130 Max. : 4.2400 Max. :26.50
## NA's :395 NA's :1210 NA's :653
## LabelAppeal AcidIndex STARS
## Min. :-2.000000 Min. : 4.000 Min. :1.000
## 1st Qu.:-1.000000 1st Qu.: 7.000 1st Qu.:1.000
## Median : 0.000000 Median : 8.000 Median :2.000
## Mean :-0.009066 Mean : 7.773 Mean :2.042
## 3rd Qu.: 1.000000 3rd Qu.: 8.000 3rd Qu.:3.000
## Max. : 2.000000 Max. :17.000 Max. :4.000
## NA's :3359
The following histograms help visualize the distributions of numerical variables in this dataset. Many of the predictor variables have a narrow spread and have high occurances at the center of the distribution. Normalizing the data may help make the distributions of variables more normal.
This correlation plot shows that there is no multicollinearity in the dataset. The correlations between STARS, AcidIndex, LabelAppeal and TARGET are strong. The remaining predictors have little to no correlation with TARGET.
The weak correlations between most of the predictors and TARGET were suprising. The following box plots provide a more in-depth view at the relationship between predictors and the target variable. The plots confirm that the relationship between target and most of the features appears limited.
The STARS
and LabelAppeal
columns contain ordinal data. Using ordinal variables as-is in a model requires the assumption that categories are equally spaced. Since stars and label appeal are both subjective labels, this assumption may not hold true. To resolve this, these ordinal columns will be encoded into dummy variables.
The following plots provide a visualization of missing data. There appears to be a patten in the mising values, so it will be useful to include a flag for missing data. KNN imputation is unsupervised, meaning it does not require a target variable. A train test split was performed earlier so that only predictor data is used for imputation.