Introduction

Data collected by Cortex et al. for the purpose of data mining will be explored and mined. A dataset of red wine will be the focus. This data contains 1599 observations with 12 features. 11 of the features are objective laboratory tests such as density and pH. The final feature is an objective quality score that was obtained by taking the median score out of ten from three wine experts.

The original publication from Cortex et al. can be found online.

Setup

## Load Libraries
library(tidyverse)              # Creating clean and tidy data
library(knitr)                  # Dynamic report generation
library(kableExtra)             # Creation of complex tables
library(moments)                # Calculate skewness/kurtosis

Load Data

red_wine <- read.csv("winequality-red.csv", sep = ';', header = TRUE)
names(red_wine) <- gsub(x = names(red_wine), pattern = "\\.", replacement = "_") # modify variable names for readability

Exploratory Data Analysis

new_sum <- function(x){
  metrics <- list(Class = class(x),
                  NA_Vals = sum(is.na(x)),
                  Min = min(x),
                  Q1 = quantile(x,probs = .25),
                  Median = median(x),
                  Mean = mean(x),
                  Q3 = quantile(x,probs = .75),
                  Max = max(x),
                  SD = sd(x),
                  Skewness = skewness(x)
  ) 
  metrics[-1] <-  metrics[-1] %>% sapply(round,3)
  
  return(metrics)
}

summary_stat <- red_wine %>% 
  sapply(new_sum)

summary_stat %>%
  kbl(caption = paste0('Summary Statistics:<br>','Dims: ',nrow(red_wine),' X ',ncol(red_wine))) %>%
  kable_classic('striped',full_width = T) %>%
  pack_rows('Data',1,2) %>%
  pack_rows('Distribution Stats',3,8) %>%
  pack_rows('Deviation',9,10)
Summary Statistics:
Dims: 1599 X 12
fixed_acidity volatile_acidity citric_acid residual_sugar chlorides free_sulfur_dioxide total_sulfur_dioxide density pH sulphates alcohol quality
Data
Class numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric integer
NA_Vals 0 0 0 0 0 0 0 0 0 0 0 0
Distribution Stats
Min 4.6 0.12 0 0.9 0.012 1 6 0.99 2.74 0.33 8.4 3
Q1 7.1 0.39 0.09 1.9 0.07 7 22 0.996 3.21 0.55 9.5 5
Median 7.9 0.52 0.26 2.2 0.079 14 38 0.997 3.31 0.62 10.2 6
Mean 8.32 0.528 0.271 2.539 0.087 15.875 46.468 0.997 3.311 0.658 10.423 5.636
Q3 9.2 0.64 0.42 2.6 0.09 21 62 0.998 3.4 0.73 11.1 6
Max 15.9 1.58 1 15.5 0.611 72 289 1.004 4.01 2 14.9 8
Deviation
SD 1.741 0.179 0.195 1.41 0.047 10.46 32.895 0.002 0.154 0.17 1.066 0.808
Skewness 0.982 0.671 0.318 4.536 5.675 1.249 1.514 0.071 0.194 2.426 0.86 0.218

Summary Statistic Analysis

The red_wine data has 1599 observations with 12 features. The data is fairly clean and does not appear to have any immediate errors that can be excluded, such as impossible values. No missing values are present in the entire dataset. There are a number of features where there is a very high likelihood of outliers. Specifically, the upper limit of the total_sulfur_dioxide and residual_sugar features have clear outliers based on the significant different between the Q3 and maximum values.

To further explore distributions of the features, boxplots of all features are included below. Free_sulfur_dioxide and total_sulfur_dioxide are separated due to scaling. As expected there are outliers present in all features. While outliers are present a number of features do not need to have outliers addressed as they are minor and are not extreme outliers. There features include quality, alcohol, sulphates, pH, density, chlorides, citric_acid, volatile_acidity, and free_sulfur_dioxide.The remaining features, residual_sugar, fixed_acidity, and total_sulfur_dioxide have extreme outliers that will need to be addressed.

par(mar = c(1,2,.1,.1))
red_wine %>%
  select(!c(free_sulfur_dioxide, total_sulfur_dioxide)) %>% 
  stack() %>% 
  ggplot(aes(x = ind, y = values)) +
  geom_boxplot(outlier.color = 'red') +
  coord_flip() +
  ggtitle("Boxplots of Features", subtitle = "Excluding sulfur dioxide variables") +
  xlab("Features") +
  ylab("Values")

red_wine %>%
  select(c(free_sulfur_dioxide, total_sulfur_dioxide)) %>% 
  stack() %>% 
  ggplot(aes(x = ind, y = values)) +
  geom_boxplot(outlier.color = 'red') +
  coord_flip() +
  ggtitle("Boxplots of Features", subtitle = "Sulfur dioxide variables") +
  xlab("Features") +
  ylab("Values")

Histograms of Features with Extreme Outliers and Skewness

Further investigation of the three features with extreme outliers and the chlorides variables are visualized below. The chlorides feature was included based on the high level of skewness identified in the summary statistics. Each of these four features have a positive skew. Each of these four features have extreme values that will likely need to be removed. Addressing the outliers will assist in normalizing the distributions and sknewness present in the data. Without subject matter expertise it is difficult to assign the cause of these outliers; however, it is likely that there was either an issue with the testing procedure or sample.

par(mar = c(2,2,.1,.1))
ggplot(red_wine, aes(residual_sugar)) + geom_histogram(aes(y = ..density..),color = 'black',fill = 'pink',bins = 15) +
  geom_density(fill = 'lightblue',color = 'blue',alpha = .3) +
  ggtitle("Residual Sugar")
  
ggplot(red_wine, aes(fixed_acidity)) + geom_histogram(aes(y = ..density..),color = 'black',fill = 'pink',bins = 15) +
  geom_density(fill = 'lightblue',color = 'blue',alpha = .3) +
  ggtitle("Fixed Acidity")

ggplot(red_wine, aes(total_sulfur_dioxide)) + geom_histogram(aes(y = ..density..),color = 'black',fill = 'pink',bins = 15) +
  geom_density(fill = 'lightblue',color = 'blue',alpha = .3) +
  ggtitle("Total Sulfur Dioxide")

ggplot(red_wine, aes(chlorides)) + geom_histogram(aes(y = ..density..),color = 'black',fill = 'pink',bins = 15) +
  geom_density(fill = 'lightblue',color = 'blue',alpha = .3) +
  ggtitle("Chlorides")