1. Load and Check Data

Checking the variables types and number of observations. here we have 12 variables that might related with wine quality, with quality as the output variable.

## 'data.frame':    4898 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...
##        X        fixed.acidity    volatile.acidity  citric.acid    
##  Min.   :   1   Min.   : 3.800   Min.   :0.0800   Min.   :0.0000  
##  1st Qu.:1225   1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700  
##  Median :2450   Median : 6.800   Median :0.2600   Median :0.3200  
##  Mean   :2450   Mean   : 6.855   Mean   :0.2782   Mean   :0.3342  
##  3rd Qu.:3674   3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900  
##  Max.   :4898   Max.   :14.200   Max.   :1.1000   Max.   :1.6600  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.600   Min.   :0.00900   Min.   :  2.00     
##  1st Qu.: 1.700   1st Qu.:0.03600   1st Qu.: 23.00     
##  Median : 5.200   Median :0.04300   Median : 34.00     
##  Mean   : 6.391   Mean   :0.04577   Mean   : 35.31     
##  3rd Qu.: 9.900   3rd Qu.:0.05000   3rd Qu.: 46.00     
##  Max.   :65.800   Max.   :0.34600   Max.   :289.00     
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  9.0        Min.   :0.9871   Min.   :2.720   Min.   :0.2200  
##  1st Qu.:108.0        1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100  
##  Median :134.0        Median :0.9937   Median :3.180   Median :0.4700  
##  Mean   :138.4        Mean   :0.9940   Mean   :3.188   Mean   :0.4898  
##  3rd Qu.:167.0        3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500  
##  Max.   :440.0        Max.   :1.0390   Max.   :3.820   Max.   :1.0800  
##     alcohol         quality     
##  Min.   : 8.00   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.40   Median :6.000  
##  Mean   :10.51   Mean   :5.878  
##  3rd Qu.:11.40   3rd Qu.:6.000  
##  Max.   :14.20   Max.   :9.000

The quality column ranges between 3-9 with 9 as the maximum level of quality.
we will need to research more about the other 12 reprsents and try to figure out the potential relationship with quality. First we need to che k the data information regarding the attributes.

White Wine Attributes Information

Input variables (based on physicochemical tests) & attributes description:

  1. fixed acidity (tartaric acid - g / dm^3): most acids involved with wine or fixed or nonvolatile (do not evaporate readily)

  2. volatile acidity (acetic acid - g / dm^3): the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste

  3. citric acid (g / dm^3): found in small quantities, citric acid can add ‘freshness’ and flavor to wines

  4. residual sugar (g / dm^3): the amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet

  5. chlorides (sodium chloride - g / dm^3): the amount of salt in the wine

  6. free sulfur dioxide (mg / dm^3): the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine

  7. total sulfur dioxide (mg / dm^3): amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine

  8. density (g / cm^3): the density of water is close to that of water depending on the percent alcohol and sugar content

  9. pH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very
    basic); most wines are between 3-4 on the pH scale

  10. sulphates (potassium sulphate - g / dm3): a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant

  11. alcohol (% by volume): the percent alcohol content of the wine

Output variable (based on sensory data):

  1. quality (score between 0 and 10)
##                    X        fixed.acidity     volatile.acidity 
##                    0                    0                    0 
##          citric.acid       residual.sugar            chlorides 
##                    0                    0                    0 
##  free.sulfur.dioxide total.sulfur.dioxide              density 
##                    0                    0                    0 
##                   pH            sulphates              alcohol 
##                    0                    0                    0 
##              quality 
##                    0

Based on the summary , there is no missing values in the dataset, so probably we wont need to clean the data.

we won’t need X variable so we will drop it

wine$X <- NULL

THEN WE NEED TO FIND CORRELATION BETWEEN VARIABLES TO QUALITY

Univariate Plots Section

Wine Quality

first we will take a look of the white wine quality distribution

##   quality freq
## 1       3   20
## 2       4  163
## 3       5 1457
## 4       6 2198
## 5       7  880
## 6       8  175
## 7       9    5

The dataset itself has 11 potential variables that might influence the quality of the wine, we will take a look at fixed acidity, volatile acidity, and citric acidity, residual sugar, chloride, free sulfur dioxide, total sulfur dioxide

We will divide the variable analysis into 4 blocks: 1. Acidity : Includes fixed acidity, Volatile Acidity and Citric Acidity, pH

  1. Sugar & Salt : Includes Residual & Clorides and suplhate (sulphate is salts of sulfuric acid)
  2. Sulfur: Includes Free Sulfur Dioxide & Total Sulfur Dioxide
  3. Alcohol & Density

Fixed Acidity, Volatile Acidity and Citric Acidity, pH

create_hist <- function(varname, binwidth) {
  return(ggplot(aes(x = varname), data = wine) + geom_histogram(binwidth = binwidth,fill="#FF9999", colour="black"))
}

We can see from chart above, Fixed acidity, volatile acidity and citric acidity tend to have normal distribution with tendency slightly skewed to the left with probably few outliers with high acidity, while the pH variable is normally distributed between 2.8 and 3.5 which means all the wine is mostly acidic, (PH<7)

to check on these variables outliers, we will take a look later using boxplot altogether with other variables as well.

Residual Sugar, Chlorides & Sulphates

here we will take a look on sugar & salt

grid.arrange(ggplot(wine, aes( x = 1 , y = residual.sugar ) ) + 
               geom_jitter(alpha = 0.1 ) +
               geom_boxplot(alpha = 0.2, color = 'red' ) ,
             ggplot(wine, aes( x   = residual.sugar  ) ) + 
                   geom_histogram(bins=30 ),ncol=2)

we can see that residual sugar and chlorides are heavily skewed to the left with mostly very little residual sugar (1-2 mg/dm3) and clorides around (0.2 mg/dm3) and sulphates also somewhat skewed to the left with mostly around 0.4-0.5 mg/dm3.

Residual sugar and chloride distributions, which are skewed to the left. This actually can be fixed by changing them into log distribution form.

RS2 <- ggplot(aes(x=residual.sugar),data=wine) +
  geom_histogram()+
  scale_x_log10( breaks = scales::trans_breaks("log10", function(x) 10^x),
   labels = scales::trans_format("log10", scales::math_format(10^.x))
 ) +
  ggtitle("Residual Sugar") +
  xlab("Log 10 Residual Sugar") +
  ylab("Count")
  
 

CL2 <-ggplot(aes(x=chlorides),data=wine) +
  geom_histogram()+
  scale_x_log10( breaks = scales::trans_breaks("log10", function(x) 10^x),
   labels = scales::trans_format("log10", scales::math_format(10^.x))
 ) +
  ggtitle("Chlorides") +
  xlab("Log 10 Chlorides") +
  ylab("Count")

  

grid.arrange(RS2, CL2, ncol =1)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Sulfurs & Density

as we can see that all the other variables are in g/dm^3, however , free sulfur dioxide, total sulfur dioxide and density have different scale, so we need to convert the variables into g/dm^3

Current variable scale - free sulfur dioxide (mg / dm^3) - total sulfur dioxide (mg / dm^3) - density (g / cm^3)

convert to g/dm3

for density g/cm^3 to mg/dm^3 actually same because gram->mg*1000 and cm3->dm3/1000==equal

drop the preconvert variables

Outlier Analysis

we can see that all the variables have outliers. we can see that only residual sugar and density that have very little outliers, the rest that related to acid and suplhate, such as fixed,volatile, and citric acidity, chlorides,PH, free SO2,and total SO2 has many outliers

##      fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 775            9.1             0.27        0.45           10.6     0.035
## 821            6.6             0.36        0.29            1.6     0.021
## 828            7.4             0.24        0.36            2.0     0.031
## 877            6.9             0.36        0.34            4.2     0.018
## 1606           7.1             0.26        0.49            2.2     0.032
##      density   pH sulphates alcohol quality Free_SO2 Total_SO2
## 775  0.99700 3.20      0.46    10.4       9    0.028     0.124
## 821  0.98965 3.41      0.61    12.4       9    0.024     0.085
## 828  0.99055 3.28      0.48    12.5       9    0.027     0.139
## 877  0.98980 3.28      0.36    12.7       9    0.057     0.119
## 1606 0.99030 3.37      0.42    12.9       9    0.031     0.113

Univariate Analysis

What is the structure of your dataset?

There are 4989 white wine in this data set with 12 features (fixed.acidity, volatile.acidity, citric.acid, residual.sugar, chlorides, free.sulfur.dioxide, total.sulfur.dioxide, density, pH, sulphates, alcohol, quality)

The following observations are made from dataSet:

  • The range for fixed acidity is quite high with minimum being 4.8 and maximum being 14.2, while volatile acidity ranges between 0.08 to 1.1, similar with nitric acid with range between 0 to 1.66 mg/dm^3
  • The alcohol content varies from 8.00 to 14.20.
  • The quality of the samples range from 3 to 9 with 6 being the median, and mean at around 5.86

  • pH value varies from 2.720 to 3.820 with a median being 3.188.

I am interested to take a look more of the sugar residual variable wherease we see that the low quality white wine have tendency to have high sugar residual , but when we look at level 9 quality, we also see high sugar residual in some sample, so I want to take a look more of this variable and its influence to quality

What other features in the dataset do you think will help support your

other feature of interest that might support me probably how long is the age of the wine itself. We always heard that long life wine have better quality, but this hypothesis should be tested as well

Did you create any new variables from existing variables in the dataset?

I didn’t create any new variables from existing variables, but rather just converting them into same scale. I convert free.Sulfur.dioxide and total.sulfur.dioxide into free_SO2 and total_SO2 in g/dm^3 from initial scale mg/dm^3

Of the features you investigated, were there any unusual distributions?

I changed the form of quality into factor, so we can differentiate qualities into facets.

unusual distribution can be found in residual sugar and chloride, which are skewed to the left. This actually can be fixed by changing them into log distribution form.

Bivariate Plots Section

Relationship between variables

Quality Correlation

as we can see above the quality variable are corellated weakly with fixed acidity (-0.11),volatile acidity (-0.19) chlorides(-0.21), density(-0.31) , and have certain positive correlation with alcohol(0.43).

for correlation between variables, we can see a strong correlation between density and residual sugar(positive) and alcohol(negative). Probably we need to drop the variable if we want to predict the quality of others white wine since it has strong correlation between variables.

TAKING A LOOK CLOSER BETWEEN VARiABLES

Density

##                         [,1]
## fixed.acidity     0.26533101
## volatile.acidity  0.02711385
## citric.acid       0.14950257
## residual.sugar    0.83896645
## chlorides         0.25721132
## density           1.00000000
## pH               -0.09359149
## sulphates         0.07449315
## alcohol          -0.78013762
## quality          -0.30712331
## Free_SO2          0.29421041
## Total_SO2         0.52988132

from both scatterplots we can see that high level of alcohol tend to have lower density, and high level of residual sugar rend to have high density.theres a strong positive relationship between density with residual sugar (0.83896645) and strong negative relationship between density and alcohol(-0.78013762)

QUALITY

in term of chlorides and acidities, we can see that the correlation is negatively weak. The lower the level of chlorides, volatile acidity, fixed acidity and density tend to have higher quality in weak form.

alcohol tend to have positive relationship with quality

##                          [,1]
## fixed.acidity    -0.113662831
## volatile.acidity -0.194722969
## citric.acid      -0.009209091
## residual.sugar   -0.097576829
## chlorides        -0.209934411
## density          -0.307123313
## pH                0.099427246
## sulphates         0.053677877
## alcohol           0.435574715
## quality           1.000000000
## Free_SO2          0.008158067
## Total_SO2        -0.174737218
## 
## Call:
## lm(formula = quality ~ ., data = wine)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.8348 -0.4934 -0.0379  0.4637  3.1143 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       1.502e+02  1.880e+01   7.987 1.71e-15 ***
## fixed.acidity     6.552e-02  2.087e-02   3.139  0.00171 ** 
## volatile.acidity -1.863e+00  1.138e-01 -16.373  < 2e-16 ***
## citric.acid       2.209e-02  9.577e-02   0.231  0.81759    
## residual.sugar    8.148e-02  7.527e-03  10.825  < 2e-16 ***
## chlorides        -2.473e-01  5.465e-01  -0.452  0.65097    
## density          -1.503e+02  1.907e+01  -7.879 4.04e-15 ***
## pH                6.863e-01  1.054e-01   6.513 8.10e-11 ***
## sulphates         6.315e-01  1.004e-01   6.291 3.44e-10 ***
## alcohol           1.935e-01  2.422e-02   7.988 1.70e-15 ***
## Free_SO2          3.733e+00  8.441e-01   4.422 9.99e-06 ***
## Total_SO2        -2.857e-01  3.781e-01  -0.756  0.44979    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7514 on 4886 degrees of freedom
## Multiple R-squared:  0.2819, Adjusted R-squared:  0.2803 
## F-statistic: 174.3 on 11 and 4886 DF,  p-value: < 2.2e-16

Bivariate Analysis

In bivariate analysis we will take a look of 2 variables which are quality and density. we pick the quality as main feature because we want to find out what makes a good quality white wine while we take a look more of density because it has strong interrelationship with many independent variables in the dataset

Talk about some of the relationships you observed in this part

Quality : we can find that there are weak negative correlations between quality and :fixed acidity (-0.11),volatile acidity (-0.19) chlorides(-0.21), density(-0.31) ,

we can find also that there are weak positive correlations between quality and alcohol(0.43).

Did you observe any interesting relationships between the other features

As I mentioned the relationship between residual sugar and density is strongly positive and negative relationship between alcohol and density. This is may make sense since the more alcohol insde the wine cause less density, and the more sugar inside the wine cause more density.

theres a strong positive relationship between density with residual sugar (0.83896645) and strong negative relationship between density and alcohol(-0.78013762)

What was the strongest relationship you found?

The strongest relationship I found was: density with residual sugar (0.83896645) and density with alcohol(-0.78013762)

Multivariate Plots Section

quality.cut <- cut(wine$quality, breaks = c(0,4,8,10))

we can see that high alcohol and low residual sugat shown one occurence of high quality wine, but this is a very weak relationship

ggplot(wine, aes(x = alcohol, y = pH , color = quality.cut)) +
  coord_cartesian(
      xlim = c(quantile(wine$alcohol, .01), quantile(wine$alcohol, .99)),
      ylim = c(quantile(wine$pH, .01), quantile(wine$pH, .99))
      ) +
  geom_jitter(alpha = 1, size = 1.5) +
  scale_color_brewer(palette = "Set1") +
  theme_dark() +
  ggtitle("Alcohol vs pH") +
  xlab("Alcohol") +
  ylab("pH")

ggplot(aes(x = fixed.acidity, 
           y = density   ), 
       data = wine) +
      geom_point(alpha = 0.1, size = 0.5) +
      geom_smooth(method = "lm", se = FALSE,size=1)

ggplot(aes(factor(quality), 
            alcohol), 
        data = wine) +
  geom_jitter( alpha = .3)  +
  geom_boxplot( alpha = .5,color = 'blue')+
  stat_summary(fun.y = "mean", 
               geom = "point", 
               color = "red", 
               shape = 8, 
               size = 4)

We can see that the green dots means the combination of both higher pH and Alcohols shows few occurences of high quality white wine, while the influence is not strong, but these variable have certain significance toward white wine quality

Multivariate Analysis

Talk about some of the relationships you observed in this part

Alcohol and pH strenghtens each other and create occurences of high quality wine

Were there any interesting or surprising interactions between features?

the interesting interction between alcohol and residual sugar, that have positive and negative relationship to density, causes no occurences of high quality wine

OPTIONAL: Did you create any models with your dataset? Discuss the
No


Final Plots and Summary

Plot One

Description One

This Plots shows alcohol has strong positive relationship with quality

Plot Two

Description Two

This shows that fixed & volatile acidity, chlorides and quality has weak negative relationship with quality

Plot Three

Description Three

We can see that the green dots means the combination of both higher pH and Alcohols shows few occurences of high quality white wine, while the influence is not strong, but these variable have certain significance toward white wine quality

Reflection

Struggles, someimes i want to use certain variable such as quality to do plotting, but cant do it if we format quality as factor. but we cant do faceting if the quality variable is in form of number or integer, so i need to keep changing both variable type

another stuggles is i cant seem to use variance inflation factor (VIF) library to quantifies the severity of multicollinearity regression analysis.

the future work would be making a model of multiple regression to predict the quality of the white wine