The Study of Wine Quality

##Wine qualities

As part of description of data set we are going to explore we have got following data indicators/ variables (based on physicochemical tests):

1 - fixed acidity
2 - volatile acidity
3 - citric acid
4 - residual sugar
5 - chlorides
6 - free sulfur dioxide
7 - total sulfur dioxide
8 - density
9 - pH
10 - sulphates
11 - alcohol

Output variable (based on sensory data): 12 - quality (score between 0 and 10)

library(readr)
#reading data file
#rw_df red wide data frame
rw_df <- read_delim("data/winequality-red.csv", delim = ";")

## Rows: 1599 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ";"
## dbl (12): fixed acidity, volatile acidity, citric acid, residual sugar, chlo...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(rw_df)

General characteristics

a. What is the sample size?

sample_size <- length(rw_df$quality)

Sample size is 1599

c. How can you summarize the data of each variable in a concise way? What statistics are you going to present?

Data set statistics are:

summary(rw_df)

##  fixed acidity   volatile acidity  citric acid    residual sugar  
##  Min.   : 4.60   Min.   :0.1200   Min.   :0.000   Min.   : 0.900  
##  1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090   1st Qu.: 1.900  
##  Median : 7.90   Median :0.5200   Median :0.260   Median : 2.200  
##  Mean   : 8.32   Mean   :0.5278   Mean   :0.271   Mean   : 2.539  
##  3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420   3rd Qu.: 2.600  
##  Max.   :15.90   Max.   :1.5800   Max.   :1.000   Max.   :15.500  
##    chlorides       free sulfur dioxide total sulfur dioxide    density      
##  Min.   :0.01200   Min.   : 1.00       Min.   :  6.00       Min.   :0.9901  
##  1st Qu.:0.07000   1st Qu.: 7.00       1st Qu.: 22.00       1st Qu.:0.9956  
##  Median :0.07900   Median :14.00       Median : 38.00       Median :0.9968  
##  Mean   :0.08747   Mean   :15.87       Mean   : 46.47       Mean   :0.9967  
##  3rd Qu.:0.09000   3rd Qu.:21.00       3rd Qu.: 62.00       3rd Qu.:0.9978  
##  Max.   :0.61100   Max.   :72.00       Max.   :289.00       Max.   :1.0037  
##        pH          sulphates         alcohol         quality     
##  Min.   :2.740   Min.   :0.3300   Min.   : 8.40   Min.   :3.000  
##  1st Qu.:3.210   1st Qu.:0.5500   1st Qu.: 9.50   1st Qu.:5.000  
##  Median :3.310   Median :0.6200   Median :10.20   Median :6.000  
##  Mean   :3.311   Mean   :0.6581   Mean   :10.42   Mean   :5.636  
##  3rd Qu.:3.400   3rd Qu.:0.7300   3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :4.010   Max.   :2.0000   Max.   :14.90   Max.   :8.000

spec(rw_df)

## cols(
##   `fixed acidity` = col_double(),
##   `volatile acidity` = col_double(),
##   `citric acid` = col_double(),
##   `residual sugar` = col_double(),
##   chlorides = col_double(),
##   `free sulfur dioxide` = col_double(),
##   `total sulfur dioxide` = col_double(),
##   density = col_double(),
##   pH = col_double(),
##   sulphates = col_double(),
##   alcohol = col_double(),
##   quality = col_double()
## )

below I combined answers on 3 questions of LAB and grouped them by variable

b. Any outliers? Do you have any concerns about the data quality?

d. How can you visualize the distribution of each variable?

e. Do you see any skewed distributions?

1 - fixed acidity

values on the edge are close to each other. No potential outliers. Skewed distribution to the right.

hist(rw_df$'fixed acidity')

boxplot(rw_df$'fixed acidity')

2 - volatile acidity

The top value 1.58 on the edge is distant from others. Could be potential outlier. Slight skewed distribution to the right.

hist(rw_df$'volatile acidity')

boxplot(rw_df$'volatile acidity')

3 - citric acid

The top value 1 is on the edge and distant from others. Could be potential outlier. According to additional source it should be in a range between 0 and 0.5
Histogram is very far from Normal distribution.

hist(rw_df$'citric acid')

boxplot(rw_df$'citric acid')

4 - residual sugar

I suspect all values 10 and above can be treated as outliers. Skewed distribution to the right.

hist(rw_df$'residual sugar')

boxplot(rw_df$'residual sugar')

5 - chlorides

The top value 0.611 is on the edge and distant from others. Could be potential outlier. Skewed distribution to the right.

hist(rw_df$'chlorides')

boxplot(rw_df$'chlorides')

6 - free sulfur dioxide

The top 3 values are distant from others. Could be potential outliers. Skewed distribution to the right.

hist(rw_df$'free sulfur dioxide')

boxplot(rw_df$'free sulfur dioxide')

7 - total sulfur dioxide

The top 2 values above 250 are distant from others. Evident outliers. Skewed distribution to the right.

hist(rw_df$'total sulfur dioxide')

boxplot(rw_df$'total sulfur dioxide')

8 - density

Surprisingly no skewed distribution. Looks like Normal distribution. No evident outliers

hist(rw_df$'density')

boxplot(rw_df$'density')

9 - pH

This one is very tricky. The lowest value 2.7 should be treated as outlier.
While the highest value 4.01 can be realistic. Looks like Normal distribution with no skewed effect.

hist(rw_df$'pH')

boxplot(rw_df$'pH')

10 - sulphates

The top 2 values above 1.5 are distant from others. Could be outliers. Skewed distribution to the right.

hist(rw_df$'sulphates')

boxplot(rw_df$'sulphates')

11 - alcohol

The top value above 1.5 is distant from others. Could be outlier. Skewed distribution to the right.

hist(rw_df$'alcohol')

boxplot(rw_df$'alcohol')

Wine Project - part A - With R Notebook

Vlad Lushpin