##Wine qualities
As part of description of data set we are going to explore we have got following data indicators/ variables (based on physicochemical tests):
Output variable (based on sensory data): 12 - quality (score between 0 and 10)
library(readr)
#reading data file
#rw_df red wide data frame
rw_df <- read_delim("data/winequality-red.csv", delim = ";")
## Rows: 1599 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ";"
## dbl (12): fixed acidity, volatile acidity, citric acid, residual sugar, chlo...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(rw_df)
sample_size <- length(rw_df$quality)
Sample size is 1599
Data set statistics are:
summary(rw_df)
## fixed acidity volatile acidity citric acid residual sugar
## Min. : 4.60 Min. :0.1200 Min. :0.000 Min. : 0.900
## 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090 1st Qu.: 1.900
## Median : 7.90 Median :0.5200 Median :0.260 Median : 2.200
## Mean : 8.32 Mean :0.5278 Mean :0.271 Mean : 2.539
## 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420 3rd Qu.: 2.600
## Max. :15.90 Max. :1.5800 Max. :1.000 Max. :15.500
## chlorides free sulfur dioxide total sulfur dioxide density
## Min. :0.01200 Min. : 1.00 Min. : 6.00 Min. :0.9901
## 1st Qu.:0.07000 1st Qu.: 7.00 1st Qu.: 22.00 1st Qu.:0.9956
## Median :0.07900 Median :14.00 Median : 38.00 Median :0.9968
## Mean :0.08747 Mean :15.87 Mean : 46.47 Mean :0.9967
## 3rd Qu.:0.09000 3rd Qu.:21.00 3rd Qu.: 62.00 3rd Qu.:0.9978
## Max. :0.61100 Max. :72.00 Max. :289.00 Max. :1.0037
## pH sulphates alcohol quality
## Min. :2.740 Min. :0.3300 Min. : 8.40 Min. :3.000
## 1st Qu.:3.210 1st Qu.:0.5500 1st Qu.: 9.50 1st Qu.:5.000
## Median :3.310 Median :0.6200 Median :10.20 Median :6.000
## Mean :3.311 Mean :0.6581 Mean :10.42 Mean :5.636
## 3rd Qu.:3.400 3rd Qu.:0.7300 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :4.010 Max. :2.0000 Max. :14.90 Max. :8.000
spec(rw_df)
## cols(
## `fixed acidity` = col_double(),
## `volatile acidity` = col_double(),
## `citric acid` = col_double(),
## `residual sugar` = col_double(),
## chlorides = col_double(),
## `free sulfur dioxide` = col_double(),
## `total sulfur dioxide` = col_double(),
## density = col_double(),
## pH = col_double(),
## sulphates = col_double(),
## alcohol = col_double(),
## quality = col_double()
## )
values on the edge are close to each other. No potential outliers. Skewed distribution to the right.
hist(rw_df$'fixed acidity')
boxplot(rw_df$'fixed acidity')
The top value 1.58 on the edge is distant from others. Could be potential outlier. Slight skewed distribution to the right.
hist(rw_df$'volatile acidity')
boxplot(rw_df$'volatile acidity')
The top value 1 is on the edge and distant from others. Could be
potential outlier. According to additional
source it should be in a range between 0 and 0.5
Histogram is very far from Normal distribution.
hist(rw_df$'citric acid')
boxplot(rw_df$'citric acid')
I suspect all values 10 and above can be treated as outliers. Skewed distribution to the right.
hist(rw_df$'residual sugar')
boxplot(rw_df$'residual sugar')
The top value 0.611 is on the edge and distant from others. Could be potential outlier. Skewed distribution to the right.
hist(rw_df$'chlorides')
boxplot(rw_df$'chlorides')
The top 3 values are distant from others. Could be potential outliers. Skewed distribution to the right.
hist(rw_df$'free sulfur dioxide')
boxplot(rw_df$'free sulfur dioxide')
The top 2 values above 250 are distant from others. Evident outliers. Skewed distribution to the right.
hist(rw_df$'total sulfur dioxide')
boxplot(rw_df$'total sulfur dioxide')
Surprisingly no skewed distribution. Looks like Normal distribution. No evident outliers
hist(rw_df$'density')
boxplot(rw_df$'density')
This one is very tricky. The lowest value 2.7 should be treated as
outlier.
While the highest value 4.01 can be realistic. Looks like Normal
distribution with no skewed effect.
hist(rw_df$'pH')
boxplot(rw_df$'pH')
The top 2 values above 1.5 are distant from others. Could be outliers. Skewed distribution to the right.
hist(rw_df$'sulphates')
boxplot(rw_df$'sulphates')
The top value above 1.5 is distant from others. Could be outlier. Skewed distribution to the right.
hist(rw_df$'alcohol')
boxplot(rw_df$'alcohol')