Data collected by Cortex et al. for the purpose of data mining will be explored and mined. A dataset of red wine will be the focus. This data contains 1599 observations with 12 features. 11 of the features are objective laboratory tests such as density and pH. The final feature is an objective quality score that was obtained by taking the median score out of ten from three wine experts.
The original publication from Cortex et al. can be found online.
## Load Libraries
library(tidyverse) # Creating clean and tidy data
library(knitr) # Dynamic report generation
library(kableExtra) # Creation of complex tables
library(moments) # Calculate skewness/kurtosis
red_wine <- read.csv("winequality-red.csv", sep = ';', header = TRUE)
names(red_wine) <- gsub(x = names(red_wine), pattern = "\\.", replacement = "_") # modify variable names for readability
new_sum <- function(x){
metrics <- list(Class = class(x),
NA_Vals = sum(is.na(x)),
Min = min(x),
Q1 = quantile(x,probs = .25),
Median = median(x),
Mean = mean(x),
Q3 = quantile(x,probs = .75),
Max = max(x),
SD = sd(x),
Skewness = skewness(x)
)
metrics[-1] <- metrics[-1] %>% sapply(round,3)
return(metrics)
}
summary_stat <- red_wine %>%
sapply(new_sum)
summary_stat %>%
kbl(caption = paste0('Summary Statistics:<br>','Dims: ',nrow(red_wine),' X ',ncol(red_wine))) %>%
kable_classic('striped',full_width = T) %>%
pack_rows('Data',1,2) %>%
pack_rows('Distribution Stats',3,8) %>%
pack_rows('Deviation',9,10)
| fixed_acidity | volatile_acidity | citric_acid | residual_sugar | chlorides | free_sulfur_dioxide | total_sulfur_dioxide | density | pH | sulphates | alcohol | quality | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Data | ||||||||||||
| Class | numeric | numeric | numeric | numeric | numeric | numeric | numeric | numeric | numeric | numeric | numeric | integer |
| NA_Vals | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Distribution Stats | ||||||||||||
| Min | 4.6 | 0.12 | 0 | 0.9 | 0.012 | 1 | 6 | 0.99 | 2.74 | 0.33 | 8.4 | 3 |
| Q1 | 7.1 | 0.39 | 0.09 | 1.9 | 0.07 | 7 | 22 | 0.996 | 3.21 | 0.55 | 9.5 | 5 |
| Median | 7.9 | 0.52 | 0.26 | 2.2 | 0.079 | 14 | 38 | 0.997 | 3.31 | 0.62 | 10.2 | 6 |
| Mean | 8.32 | 0.528 | 0.271 | 2.539 | 0.087 | 15.875 | 46.468 | 0.997 | 3.311 | 0.658 | 10.423 | 5.636 |
| Q3 | 9.2 | 0.64 | 0.42 | 2.6 | 0.09 | 21 | 62 | 0.998 | 3.4 | 0.73 | 11.1 | 6 |
| Max | 15.9 | 1.58 | 1 | 15.5 | 0.611 | 72 | 289 | 1.004 | 4.01 | 2 | 14.9 | 8 |
| Deviation | ||||||||||||
| SD | 1.741 | 0.179 | 0.195 | 1.41 | 0.047 | 10.46 | 32.895 | 0.002 | 0.154 | 0.17 | 1.066 | 0.808 |
| Skewness | 0.982 | 0.671 | 0.318 | 4.536 | 5.675 | 1.249 | 1.514 | 0.071 | 0.194 | 2.426 | 0.86 | 0.218 |
The red_wine data has 1599 observations with 12 features. The data is fairly clean and does not appear to have any immediate errors that can be excluded, such as impossible values. No missing values are present in the entire dataset. There are a number of features where there is a very high likelihood of outliers. Specifically, the upper limit of the total_sulfur_dioxide and residual_sugar features have clear outliers based on the significant different between the Q3 and maximum values.
To further explore distributions of the features, boxplots of all features are included below. Free_sulfur_dioxide and total_sulfur_dioxide are separated due to scaling. As expected there are outliers present in all features. While outliers are present a number of features do not need to have outliers addressed as they are minor and are not extreme outliers. There features include quality, alcohol, sulphates, pH, density, chlorides, citric_acid, volatile_acidity, and free_sulfur_dioxide.The remaining features, residual_sugar, fixed_acidity, and total_sulfur_dioxide have extreme outliers that will need to be addressed.
par(mar = c(1,2,.1,.1))
red_wine %>%
select(!c(free_sulfur_dioxide, total_sulfur_dioxide)) %>%
stack() %>%
ggplot(aes(x = ind, y = values)) +
geom_boxplot(outlier.color = 'red') +
coord_flip() +
ggtitle("Boxplots of Features", subtitle = "Excluding sulfur dioxide variables") +
xlab("Features") +
ylab("Values")
red_wine %>%
select(c(free_sulfur_dioxide, total_sulfur_dioxide)) %>%
stack() %>%
ggplot(aes(x = ind, y = values)) +
geom_boxplot(outlier.color = 'red') +
coord_flip() +
ggtitle("Boxplots of Features", subtitle = "Sulfur dioxide variables") +
xlab("Features") +
ylab("Values")
Further investigation of the three features with extreme outliers and the chlorides variables are visualized below. The chlorides feature was included based on the high level of skewness identified in the summary statistics. Each of these four features have a positive skew. Each of these four features have extreme values that will likely need to be removed. Addressing the outliers will assist in normalizing the distributions and sknewness present in the data. Without subject matter expertise it is difficult to assign the cause of these outliers; however, it is likely that there was either an issue with the testing procedure or sample.
par(mar = c(2,2,.1,.1))
ggplot(red_wine, aes(residual_sugar)) + geom_histogram(aes(y = ..density..),color = 'black',fill = 'pink',bins = 15) +
geom_density(fill = 'lightblue',color = 'blue',alpha = .3) +
ggtitle("Residual Sugar")
ggplot(red_wine, aes(fixed_acidity)) + geom_histogram(aes(y = ..density..),color = 'black',fill = 'pink',bins = 15) +
geom_density(fill = 'lightblue',color = 'blue',alpha = .3) +
ggtitle("Fixed Acidity")
ggplot(red_wine, aes(total_sulfur_dioxide)) + geom_histogram(aes(y = ..density..),color = 'black',fill = 'pink',bins = 15) +
geom_density(fill = 'lightblue',color = 'blue',alpha = .3) +
ggtitle("Total Sulfur Dioxide")
ggplot(red_wine, aes(chlorides)) + geom_histogram(aes(y = ..density..),color = 'black',fill = 'pink',bins = 15) +
geom_density(fill = 'lightblue',color = 'blue',alpha = .3) +
ggtitle("Chlorides")