Wine Quality

Introduction

Data collected by Cortex et al. for the purpose of data mining will be explored and mined. A dataset of red wine will be the focus. This data contains 1599 observations with 12 features. 11 of the features are objective laboratory tests such as density and pH. The final feature is an objective quality score that was obtained by taking the median score out of ten from three wine experts.

The original publication from Cortex et al. can be found online.

Setup

## Load Libraries
library(tidyverse)              # Creating clean and tidy data
library(knitr)                  # Dynamic report generation
library(kableExtra)             # Creation of complex tables
library(moments)                # Calculate skewness/kurtosis

Load Data

red_wine <- read.csv("winequality-red.csv", sep = ';', header = TRUE)
names(red_wine) <- gsub(x = names(red_wine), pattern = "\\.", replacement = "_") # modify variable names for readability

Exploratory Data Analysis

new_sum <- function(x){
  metrics <- list(Class = class(x),
                  NA_Vals = sum(is.na(x)),
                  Min = min(x),
                  Q1 = quantile(x,probs = .25),
                  Median = median(x),
                  Mean = mean(x),
                  Q3 = quantile(x,probs = .75),
                  Max = max(x),
                  SD = sd(x),
                  Skewness = skewness(x)
  ) 
  metrics[-1] <-  metrics[-1] %>% sapply(round,3)
  
  return(metrics)
}

summary_stat <- red_wine %>% 
  sapply(new_sum)

summary_stat %>%
  kbl(caption = paste0('Summary Statistics:<br>','Dims: ',nrow(red_wine),' X ',ncol(red_wine))) %>%
  kable_classic('striped',full_width = T) %>%
  pack_rows('Data',1,2) %>%
  pack_rows('Distribution Stats',3,8) %>%
  pack_rows('Deviation',9,10)

Summary Statistics:
Dims: 1599 X 12
	fixed_acidity	volatile_acidity	citric_acid	residual_sugar	chlorides	free_sulfur_dioxide	total_sulfur_dioxide	density	pH	sulphates	alcohol	quality
Data
Class	numeric	numeric	numeric	numeric	numeric	numeric	numeric	numeric	numeric	numeric	numeric	integer
NA_Vals	0	0	0	0	0	0	0	0	0	0	0	0
Distribution Stats
Min	4.6	0.12	0	0.9	0.012	1	6	0.99	2.74	0.33	8.4	3
Q1	7.1	0.39	0.09	1.9	0.07	7	22	0.996	3.21	0.55	9.5	5
Median	7.9	0.52	0.26	2.2	0.079	14	38	0.997	3.31	0.62	10.2	6
Mean	8.32	0.528	0.271	2.539	0.087	15.875	46.468	0.997	3.311	0.658	10.423	5.636
Q3	9.2	0.64	0.42	2.6	0.09	21	62	0.998	3.4	0.73	11.1	6
Max	15.9	1.58	1	15.5	0.611	72	289	1.004	4.01	2	14.9	8
Deviation
SD	1.741	0.179	0.195	1.41	0.047	10.46	32.895	0.002	0.154	0.17	1.066	0.808
Skewness	0.982	0.671	0.318	4.536	5.675	1.249	1.514	0.071	0.194	2.426	0.86	0.218

Summary Statistic Analysis

The red_wine data has 1599 observations with 12 features. The data is fairly clean and does not appear to have any immediate errors that can be excluded, such as impossible values. No missing values are present in the entire dataset. There are a number of features where there is a very high likelihood of outliers. Specifically, the upper limit of the total_sulfur_dioxide and residual_sugar features have clear outliers based on the significant different between the Q3 and maximum values.

To further explore distributions of the features, boxplots of all features are included below. Free_sulfur_dioxide and total_sulfur_dioxide are separated due to scaling. As expected there are outliers present in all features. While outliers are present a number of features do not need to have outliers addressed as they are minor and are not extreme outliers. There features include quality, alcohol, sulphates, pH, density, chlorides, citric_acid, volatile_acidity, and free_sulfur_dioxide.The remaining features, residual_sugar, fixed_acidity, and total_sulfur_dioxide have extreme outliers that will need to be addressed.

par(mar = c(1,2,.1,.1))
red_wine %>%
  select(!c(free_sulfur_dioxide, total_sulfur_dioxide)) %>% 
  stack() %>% 
  ggplot(aes(x = ind, y = values)) +
  geom_boxplot(outlier.color = 'red') +
  coord_flip() +
  ggtitle("Boxplots of Features", subtitle = "Excluding sulfur dioxide variables") +
  xlab("Features") +
  ylab("Values")

red_wine %>%
  select(c(free_sulfur_dioxide, total_sulfur_dioxide)) %>% 
  stack() %>% 
  ggplot(aes(x = ind, y = values)) +
  geom_boxplot(outlier.color = 'red') +
  coord_flip() +
  ggtitle("Boxplots of Features", subtitle = "Sulfur dioxide variables") +
  xlab("Features") +
  ylab("Values")

Histograms of Features with Extreme Outliers and Skewness

Further investigation of the three features with extreme outliers and the chlorides variables are visualized below. The chlorides feature was included based on the high level of skewness identified in the summary statistics. Each of these four features have a positive skew. Each of these four features have extreme values that will likely need to be removed. Addressing the outliers will assist in normalizing the distributions and sknewness present in the data. Without subject matter expertise it is difficult to assign the cause of these outliers; however, it is likely that there was either an issue with the testing procedure or sample.

par(mar = c(2,2,.1,.1))
ggplot(red_wine, aes(residual_sugar)) + geom_histogram(aes(y = ..density..),color = 'black',fill = 'pink',bins = 15) +
  geom_density(fill = 'lightblue',color = 'blue',alpha = .3) +
  ggtitle("Residual Sugar")
  
ggplot(red_wine, aes(fixed_acidity)) + geom_histogram(aes(y = ..density..),color = 'black',fill = 'pink',bins = 15) +
  geom_density(fill = 'lightblue',color = 'blue',alpha = .3) +
  ggtitle("Fixed Acidity")

ggplot(red_wine, aes(total_sulfur_dioxide)) + geom_histogram(aes(y = ..density..),color = 'black',fill = 'pink',bins = 15) +
  geom_density(fill = 'lightblue',color = 'blue',alpha = .3) +
  ggtitle("Total Sulfur Dioxide")

ggplot(red_wine, aes(chlorides)) + geom_histogram(aes(y = ..density..),color = 'black',fill = 'pink',bins = 15) +
  geom_density(fill = 'lightblue',color = 'blue',alpha = .3) +
  ggtitle("Chlorides")