This report explores a dataset containing quality and attributes for approximately 4,900 white wines.

```
## 'data.frame': 4898 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
```

```
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1 Min. : 3.800 Min. :0.0800 Min. :0.0000
## 1st Qu.:1225 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700
## Median :2450 Median : 6.800 Median :0.2600 Median :0.3200
## Mean :2450 Mean : 6.855 Mean :0.2782 Mean :0.3342
## 3rd Qu.:3674 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900
## Max. :4898 Max. :14.200 Max. :1.1000 Max. :1.6600
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.600 Min. :0.00900 Min. : 2.00
## 1st Qu.: 1.700 1st Qu.:0.03600 1st Qu.: 23.00
## Median : 5.200 Median :0.04300 Median : 34.00
## Mean : 6.391 Mean :0.04577 Mean : 35.31
## 3rd Qu.: 9.900 3rd Qu.:0.05000 3rd Qu.: 46.00
## Max. :65.800 Max. :0.34600 Max. :289.00
## total.sulfur.dioxide density pH sulphates
## Min. : 9.0 Min. :0.9871 Min. :2.720 Min. :0.2200
## 1st Qu.:108.0 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100
## Median :134.0 Median :0.9937 Median :3.180 Median :0.4700
## Mean :138.4 Mean :0.9940 Mean :3.188 Mean :0.4898
## 3rd Qu.:167.0 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500
## Max. :440.0 Max. :1.0390 Max. :3.820 Max. :1.0800
## alcohol quality
## Min. : 8.00 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.40 Median :6.000
## Mean :10.51 Mean :5.878
## 3rd Qu.:11.40 3rd Qu.:6.000
## Max. :14.20 Max. :9.000
```

It looks there are just 19 zeros in Citric Acid, and no NA values in the entire dataset.

`## [1] "Non-Zero values in columns"`

```
## [1] "4898 X"
## [1] "4898 fixed.acidity"
## [1] "4898 volatile.acidity"
## [1] "4879 citric.acid" "19 citric.acid"
## [1] "4898 residual.sugar"
## [1] "4898 chlorides"
## [1] "4898 free.sulfur.dioxide"
## [1] "4898 total.sulfur.dioxide"
## [1] "4898 density"
## [1] "4898 pH"
## [1] "4898 sulphates"
## [1] "4898 alcohol"
## [1] "4898 quality"
```

`## [1] "Non-NA values in columns"`

```
## [1] "4898 X"
## [1] "4898 fixed.acidity"
## [1] "4898 volatile.acidity"
## [1] "4898 citric.acid"
## [1] "4898 residual.sugar"
## [1] "4898 chlorides"
## [1] "4898 free.sulfur.dioxide"
## [1] "4898 total.sulfur.dioxide"
## [1] "4898 density"
## [1] "4898 pH"
## [1] "4898 sulphates"
## [1] "4898 alcohol"
## [1] "4898 quality"
```

```
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.800 6.300 6.800 6.855 7.300 14.200
```

Fixed Acidity looks normally distributed. Skewness is very minor. There are just a few outliers. Fixed Acidity may be helpful in further analysis due to its relation with pH and Volatile Acidity.

```
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0800 0.2100 0.2600 0.2782 0.3200 1.1000
```

Volatile Acidity also looks normally distributed. It is positively skewed. There are more than a few outliers. Volatile Acidity may be helpful in further analysis due to its relation with pH and Fixed Acidity.

```
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.2700 0.3200 0.3342 0.3900 1.6600
```

Interestingly, there are relatively many wines with 0.49 and 0.74 Citric Acid values. Apart from that, distributions look normal so far. There is an extreme outlier here with value 1.66. Since the acidity attributes (i.e. Fixed Acidity, Volatile Acidity and Citric Acid) are related each other, I believe those three may be significant in further analysis.

```
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.600 1.700 5.200 6.391 9.900 65.800
```

For Residual Sugar, most values are below 3 and half of all values are below 5.2 even though there are relatively huge outliers such as 65.8. As also seen above Residual Sugar is highly skewed. So, it may be helpful to make a Log_10 transformation for further analysis.

```
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9871 0.9917 0.9937 0.9940 0.9961 1.0390
```

`## [1] "Lower bound of range: 0.98878"`

`## [1] "Upper bound of range: 1.0003"`

`## [1] "Standard Deviation: 0.00279160061693161"`

For better understanding of Density, x axises are limited by omitting outliers. Although Standard Deviation is very small, I think Density may be somehow correlated with Quality. There are some peaks through the distribution. However, mean and median very close to each other and by looking the chart itself, it doesn’t look like skewed. Excluding outliers, range and variance for Density are relatively small.

```
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.720 3.090 3.180 3.188 3.280 3.820
```

pH also looks normally distributed, there are no extreme outliers, no skewness. Half of pH values are between 3.09 and 3.28 with mean 3.1882666.

```
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 9.50 10.40 10.51 11.40 14.20
```

Alcohol values are distributed positively skewed. There are no extreme outliers as expected. The range is between 8 and 14.2. From gut feeling, I believe Alcohol plays a crucial role for the quality of a white wine.

```
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.878 6.000 9.000
```

Quality is also normally distributed as expected with the range 6. It is interesting that there are just a few white wines with Quality score 3 or 9. So, there is no perfect white wine! There are also no white wine with a quality score of 0, 1 or 2.

There are 4,898 white wines in the dataset with 12 features (Fixed Acidity, Volatile Acidity, Citric Acid, Residual Sugar, Chlorides, Free Sulfur Dioxide, Total Sulfur Dioxide, Density, pH, Sulphates, Alcohol, and Quality). All variables are numeric.

Other observations,

- Fixed Acidity is mostly about 7
- 90% of Volatile Acidity values are below 0.4.
- There are tow unsual peaks in Citric Acid values at 0.49 and 0.74.
- Residual Sugar is the most postively skewed attribute.
- Although the maximum of Chlorides values is 0.346, 92% of Chlorides values are below 0.06.
- The range of Free Sulfur Dioxide is 287, and the mean is 35.3080849.
- The mean of Total Sulfur Dioxide is 138.3606574.
- The range of Density is 17.3425658 times greater than its standard deviation.

The main features in the data set are Residual Sugar, Density and Alcohol. I would like to determine which features are more significant to predict the Quality. I do not think all variables in the data set are significant to build a model to predict the Quality.

Acidity attributes (i.e. Fixed Acidity, Volatile Acidity and Citric Acid) may also be significant to predict the Quality as well as pH.

I will add a factor variable attribute of Quality and some buckets in the further part of analysis.

For better visualization, I mostly limit the x axis to get rid of outliers. I also make a Log_10 transformation to Residual Sugar since it is highly skewed, now it looks like a bimodal distribution with two peaks around 1.2 and 10.

“X” denotes the pairs with insignificant p-values. Residual Sugar - Density and Alcohol - Density pairs have strong correlations. And, there are just weak correlations in Density - Quality and Alcohol - Quality.

```
##
## Pearson's product-moment correlation
##
## data: df$quality and df$fixed.acidity
## t = -8.005, df = 4896, p-value = 1.48e-15
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.14121974 -0.08592991
## sample estimates:
## cor
## -0.1136628
```

According to this box plot, less Fixed Acidity implies better Quality in the Quality range 3-7. However, there is no significant correlation between this pair. For better visualization, the lowest and highest 1% are omitted to remove outliers.

```
##
## Pearson's product-moment correlation
##
## data: df$quality and df$total.sulfur.dioxide
## t = -12.418, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.2017563 -0.1474524
## sample estimates:
## cor
## -0.1747372
```

Although there is no significant correlation between Total Sulfur Dioxide and Quality, the box plot is interesting. The range is tend to decrease while Quality is increasing. So, this may indicate that Total Sulfur Dioxide may also have a significant effect to predict Quality score.

```
##
## Pearson's product-moment correlation
##
## data: df$quality and df$density
## t = -22.581, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.3322718 -0.2815385
## sample estimates:
## cor
## -0.3071233
```

In the above box plot, the lowest 1% and highest 21% are omitted to remove outliers in order to make the visualization better. There is a weak negative correlation between Density and Quality with the value -0.3071233. So, this attribute may have a significant effect to predict Quality score. The relation is especially remarkable after Quality score 5.

```
##
## Pearson's product-moment correlation
##
## data: df$quality and df$pH
## t = 6.9917, df = 4896, p-value = 3.081e-12
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.07162022 0.12707983
## sample estimates:
## cor
## 0.09942725
```

pH is negatively correlated in low qualities such as 3, 4 and 5 then it turns to a positive correlation. However, it is hard to say that there is a correlation with the value 0.0994272.

```
##
## Pearson's product-moment correlation
##
## data: df$quality and df$alcohol
## t = 33.858, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4126015 0.4579941
## sample estimates:
## cor
## 0.4355747
```

It can visually be interpreted that Alcohol has some effect on Quality. It is negatively correlated in low qualities such as 3, 4 and 5 then it turns to a positive correlation, pH’s effect on Quality also