Analysis of White Wine Quality

Introduction

This dataset I use for analysis was created by Paulo Cortez (University of Minho), Antonio Cerdeira, Fernando Almeida, Telmo Matos and Jose Reis (2009), which is availabe puplic for reseach purpose. The dataset is about the physicochemical variables that affects the quality of the Portuguese “Vinho Verde” White wine, more description about the dataset can be found on this link.

# Load all of the packages that to be used for the analysis

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)
library(ggthemes)
library(tibble)
library(knitr)
library(gridExtra)

## 
## Attaching package: 'gridExtra'

## The following object is masked from 'package:dplyr':
## 
##     combine

library(reshape2)
library(GGally)

## 
## Attaching package: 'GGally'

## The following object is masked from 'package:dplyr':
## 
##     nasa

library(GGally)
library(scales)
library(memisc)

## Loading required package: lattice

## Loading required package: MASS

## 
## Attaching package: 'MASS'

## The following object is masked from 'package:dplyr':
## 
##     select

## 
## Attaching package: 'memisc'

## The following object is masked from 'package:scales':
## 
##     percent

## The following objects are masked from 'package:dplyr':
## 
##     collect, query, recode, rename

## The following objects are masked from 'package:stats':
## 
##     contrasts, contr.sum, contr.treatment

## The following object is masked from 'package:base':
## 
##     as.array

library(lattice)
library(MASS)
library(car)

## 
## Attaching package: 'car'

## The following object is masked from 'package:memisc':
## 
##     recode

## The following object is masked from 'package:dplyr':
## 
##     recode

library(reshape)

## 
## Attaching package: 'reshape'

## The following object is masked from 'package:memisc':
## 
##     rename

## The following objects are masked from 'package:reshape2':
## 
##     colsplit, melt, recast

## The following object is masked from 'package:dplyr':
## 
##     rename

library (plyr)

## -------------------------------------------------------------------------

## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)

## -------------------------------------------------------------------------

## 
## Attaching package: 'plyr'

## The following objects are masked from 'package:reshape':
## 
##     rename, round_any

## The following object is masked from 'package:memisc':
## 
##     rename

## The following objects are masked from 'package:dplyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize

Load the Dataset

getwd()

## [1] "/home/banito/Downloads/project4"

setwd('/home/banito/Downloads/project4')
wht_wine_quality<-read.csv('wineQualityWhites.csv')
head(wht_wine_quality)

##   X fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 1           7.0             0.27        0.36           20.7     0.045
## 2 2           6.3             0.30        0.34            1.6     0.049
## 3 3           8.1             0.28        0.40            6.9     0.050
## 4 4           7.2             0.23        0.32            8.5     0.058
## 5 5           7.2             0.23        0.32            8.5     0.058
## 6 6           8.1             0.28        0.40            6.9     0.050
##   free.sulfur.dioxide total.sulfur.dioxide density   pH sulphates alcohol
## 1                  45                  170  1.0010 3.00      0.45     8.8
## 2                  14                  132  0.9940 3.30      0.49     9.5
## 3                  30                   97  0.9951 3.26      0.44    10.1
## 4                  47                  186  0.9956 3.19      0.40     9.9
## 5                  47                  186  0.9956 3.19      0.40     9.9
## 6                  30                   97  0.9951 3.26      0.44    10.1
##   quality
## 1       6
## 2       6
## 3       6
## 4       6
## 5       6
## 6       6

names(wht_wine_quality)

##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"

The dataset consist of 4898 observation with 12 varaibles and the dataset doesn’t have any missing values # Univariate Plots Section. In this section we will have some insights about the dataset by simply plotting the varaibles and see thier distibutions.The fixed.acidity is the none-volitile acid amount found on wine, it have maximum and minimum of 3.8 and 14.2 in units of (g/dm^3). To make the plot more readable and remove some outliers in the right side of the plot, 99 percentile is used to limit the upper bound of the xlimit. The histogram now look more a normal distribution.

## Fuction for plotting

# Make summary of attribute varaible 
summary(wht_wine_quality$fixed.acidity)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.800   6.300   6.800   6.855   7.300  14.200

# Histogram using 'ggplot'function
ggplot(aes(x = fixed.acidity), data = wht_wine_quality) + 
  geom_histogram(binwidth =  0.2,color='black', fill='#FFCC67') + 
  xlim(3.8, 14.2)

## Warning: Removed 1 rows containing missing values (geom_bar).

# limit the upper value of x-axis to its 95 perecentile
ggplot(aes(x = fixed.acidity), data = wht_wine_quality) +
  geom_histogram(binwidth =  0.2,color='black', fill='#FFCC67') + xlim(3.8,    quantile(wht_wine_quality$fixed.acidity,0.99))

## Warning: Removed 45 rows containing non-finite values (stat_bin).

## Warning: Removed 1 rows containing missing values (geom_bar).

The volatile.acidity indicates the amount of acetic acid in the wine, its content ranges from 0.08 to 1.1 in gm/dm^3. Let us plot the histogram for volitle acidity found in white wine for the whole observation dataset. The amount of acetic acid in the wine is mostly bound between 0.1 and 0.8 as depicted by the histogram

# Make summary of attribute varaible
summary(wht_wine_quality$volatile.acidity)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0800  0.2100  0.2600  0.2782  0.3200  1.1000

# Histogram using 'gglot'function
ggplot(aes(x =volatile.acidity), data = wht_wine_quality) + scale_x_continuous()

# Limit the x-axis to fit the data
ggplot(aes(x = citric.acid), data = wht_wine_quality) + 
  geom_histogram(binwidth =  0.02,color='black', fill='#FFCC67') +
  xlim(0, quantile(wht_wine_quality$citric.acid, 0.99))

## Warning: Removed 22 rows containing non-finite values (stat_bin).

Citric acid: found in small quantities can add ‘freshness’ and flavor to wines. The amount of citric acid in g/dm^3 ranges from 0 to 1.66. The historgram more or less exibits normal distribution.

# Make summary of attribute varaible
summary(wht_wine_quality$citric.acid)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.2700  0.3200  0.3342  0.3900  1.6600

# plotting histogram and
# limit the upper value of xlim to a 99 percentile

ggplot(aes(x = citric.acid), data = wht_wine_quality) + 
  geom_histogram(binwidth =  0.02,color='black', fill='#FFCC67') + 
  xlim(0, quantile(wht_wine_quality$citric.acid,0.99))

## Warning: Removed 22 rows containing non-finite values (stat_bin).

Residual sugar is the amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet.The maximum and minimum residual sugar found in the wine is 0.6 and 65 respectively. The residual.sugar varaible is a tailed distribution which is skwed to the right. As the distribution of residual sugar is very skwed scale_x_log10 function is used for clearer view.

# Make summary of attribute varaible
summary(wht_wine_quality$residual.sugar)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.600   1.700   5.200   6.391   9.900  65.800

# Histogram using 'gglot'function
# Limit the x-axis to reflect the range of values 
ggplot(aes(x = residual.sugar), data = wht_wine_quality) + 
  geom_histogram(binwidth =  .05,color='black', fill='#FFBBAA') +
scale_x_log10(lim = c(min(wht_wine_quality$residual.sugar), quantile(wht_wine_quality$residual.sugar, 0.95)), 
                  breaks = seq(min(wht_wine_quality$residual.sugar), 
                     quantile(wht_wine_quality$residual.sugar, 0.95), 1))

## Warning: Removed 240 rows containing non-finite values (stat_bin).

## Warning: Removed 1 rows containing missing values (geom_bar).

Chlorides indicates the amount of salt (Sodium Chloride , g/dm^3) presence in the wine. In the dataset the value ranges from 0.009 to 0.34. The histogram for Chloride depicts that the distribution exibits normal. The 95 perentile is used here to remove outlies.

# Make summary of attribute varaible
summary(wht_wine_quality$chlorides)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00900 0.03600 0.04300 0.04577 0.05000 0.34600

# Histogram using 'qplot'function
qplot(x = chlorides, data = wht_wine_quality) + scale_x_continuous()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Limit the x-axis to fit the data
ggplot(aes(x = chlorides), data = wht_wine_quality) + geom_histogram(binwidth =  .001,color='black', fill='#09BBAA') + 
  xlim(0, quantile(wht_wine_quality$chlorides,0.95))

## Warning: Removed 237 rows containing non-finite values (stat_bin).

Free sulfur dioxide is the free form of SO2 (in mg / dm^3) that exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine. Its value ranges from 2.0 to 289. The histogram indicates normal distribution for sulfur dioxide. I used the 95 percentile for a clearer veiw and removing of outliers.

# Make summary of attribute varaible
summary(wht_wine_quality$free.sulfur.dioxide)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00   23.00   34.00   35.31   46.00  289.00

# Histogram using 'qplot'function
qplot(x = free.sulfur.dioxide, data = wht_wine_quality)+ scale_x_continuous()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Limit the x-axis to fit the data , the 95 percentile is used to 
# limit the upper value
ggplot(aes(x = free.sulfur.dioxide), data = wht_wine_quality) + 
  geom_histogram(binwidth =  2,color='black', fill='#09BBAA') +  xlim(0,quantile(wht_wine_quality$free.sulfur.dioxide,0.95))

## Warning: Removed 237 rows containing non-finite values (stat_bin).

The total sulfur dioxide is the sum of amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine. Summary for the variable shows the maximum and minimum value is 9.0 to 440 in mg/dm^3 respectively. The histogram dipicts closer to normal distribution.

# Make summary of attribute varaible
summary(wht_wine_quality$total.sulfur.dioxide)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     9.0   108.0   134.0   138.4   167.0   440.0

# Histogram using 'qplot'function
qplot(x = total.sulfur.dioxide, data = wht_wine_quality) + scale_x_continuous()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Limit the x-axis to fit the data, used the 99 percentile for better view
ggplot(aes(x = total.sulfur.dioxide), data = wht_wine_quality) + 
  geom_histogram(binwidth =  2,color='black', fill='#079CDF') + xlim(0,300)

## Warning: Removed 6 rows containing non-finite values (stat_bin).

The density of the wine range between 0.9871 and 1.039 in g/cm^3, the histogram depicts normal distibution of the density varaible.

# Make summary of attribute varaible
summary(wht_wine_quality$density)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9871  0.9917  0.9937  0.9940  0.9961  1.0390

# Histogram using 'qplot'function
qplot(x = density, data = wht_wine_quality) + scale_x_continuous()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Limit the x-axis to fit the data
ggplot(aes(x = density), data = wht_wine_quality) + geom_histogram(binwidth =  0.0001,color='black', fill='#079CDF') + xlim(0.9871,1.01)

## Warning: Removed 3 rows containing non-finite values (stat_bin).

## Warning: Removed 1 rows containing missing values (geom_bar).

The pH describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale. The value for pH in the observation ranges from 2.720 to 3.82. The histogram exibihit normal distrubution for pH.

# Make summary of attribute varaible
summary(wht_wine_quality$pH)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.720   3.090   3.180   3.188   3.280   3.820

# Histogram using 'qplot'function
qplot(x = pH, data = wht_wine_quality) + scale_x_continuous()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Limit the x-axis to fit the data
ggplot(aes(x = pH), data = wht_wine_quality) + 
  geom_histogram(binwidth =  0.01,color='black', fill='#FFCC67') + 
  xlim(2.7, 3.85)

## Warning: Removed 1 rows containing missing values (geom_bar).

Sulphates is a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant. The amount of sulphates in the wine ranges from 0.22 to 1.08.The histogram shows close to normal distibution of Sulphate content in the wine. To limit the x-axis to fit the data ,I applied 95 percentile to see clear picture.

# Make summary of attribute varaible
summary(wht_wine_quality$sulphates)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2200  0.4100  0.4700  0.4898  0.5500  1.0800

# Histogram using 'ggplot'function

ggplot(aes(x = sulphates), data = wht_wine_quality) + 
  geom_histogram(binwidth =  0.01,color='black', fill='#FFCC67') + 
  xlim(0.22, quantile(wht_wine_quality$sulphates,.95))

## Warning: Removed 229 rows containing non-finite values (stat_bin).

The percentage alcohol content of the wine in ppm (parts per million) is the other input varaible, it varies from 8.0 to 14.2. The histogram show a distribution skwed to the right.

# Make summary of attribute varaible
summary(wht_wine_quality$alcohol)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00    9.50   10.40   10.51   11.40   14.20

# Histogram using 'qplot'function
qplot(x = alcohol, data = wht_wine_quality) + scale_x_continuous()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Histogram using 'ggplot'and set xlim
ggplot(aes(x = alcohol), data = wht_wine_quality)  + geom_histogram(binwidth =  0.05,color='black', fill='#FFCC67')  + xlim(8, 14.5)

## Warning: Removed 1 rows containing missing values (geom_bar).

The last attribute varaibale in the dataset is the quality which is the output variable that is affected by the combination of the other input varaible. The volatile.acidity indicates the amount of acetic acid in the wine, its content ranges from 0.08 to 1.1 in gm/dm^3. Let us plot the histogram for volitle acidity found in white wine for the whole observation dataset. The amount of acetic acid in the wine is mostly bound between 0.1 and 0.8 as depicted by the histogram

# Make summary of attribute varaible
summary(wht_wine_quality$quality)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.878   6.000   9.000

# Histogram using 'qplot'function
ggplot(wht_wine_quality, aes(x=as.factor(quality))) + geom_bar()

Univariate Analysis

What is the structure of your dataset?

The dataset contains 4898 observations with 12 variables generally, 11 input variables based on physicochemical tests which affect wine quality and 1 output variable which is the quality of the wine. The input varaibles are:- fixed acidity (tartaric acid - g / dm^3), volatile acidity (acetic acid - g / dm^3), citric acid (g / dm^3), residual sugar (g / dm^3), chlorides (sodium chloride - g / dm^3, free sulfur dioxide (mg / dm^3), total sulfur dioxide (mg / dm^3), density (g / cm^3), pH, sulphates (potassium sulphate - g / dm3), alcohol (% by volume) and the output varaible based on sensor data is thw qaulity score between 0 and 10.At least 3 wine experts rated the quality of each wine, with 0 as the lowest rating and 10 as the highest rating.

What is/are the main feature(s) of interest in your dataset?

The quality of the white wine is what ulimately matters the most, all the other inputs combination is to prouce a certain kind of test which is attributed to quality of the wine. Most wine quality is concentrated in the cagegories 5, 6, 7, small amount of whitle wine falls to categores 3, 4, 8 and 9 and not in the categories 1,2, and 10.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

In the univariate plot section, I have done histogram for all varaibles in the dataset and studied the distibution, most of the varariables are close to normal distributions. Alcohol and residual sugar exibits distribution skwed to the right. Volitile.acidity and citric acid show some irregularies.

Did you create any new variables from existing variables in the dataset?

I didn’t create any new varaible here ### Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

When plotting histograms of the attribute variables, I applied the xlim and the 99 or 95 percentile (quantile function) to limit the upper x-axis value to remove outlier and for better visualization and to see clearer distibutio of that varaible.

Bivariate Plots Section

This dataset contains 12 varaibles and 11 of them physicochemical input varaible that impact the test or quality of the white wine. So below I attempt to see how these input varaibles are related with the output varaible (quality) using boxplot plotting each input varaible againest the quality of the white wine. And the relation between quality and the other input varibles is quantified by computing the cooralation

The correlation coefficient between alcohol and quality of white wine is 0.4355747, that shows there is postive correlation. The boxplot dipict that most of the quality of wine is in the observation are in 5,6, and 7 quality level. The boxplot also displays outlier in the dataset. The linear model also show there exists a positve correlation between alchol and quality.

cor.test(wht_wine_quality$quality, wht_wine_quality$alcohol,method = 'pearson')

## 
##  Pearson's product-moment correlation
## 
## data:  wht_wine_quality$quality and wht_wine_quality$alcohol
## t = 33.858, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4126015 0.4579941
## sample estimates:
##       cor 
## 0.4355747

ggplot(aes(factor(quality), alcohol), data = wht_wine_quality) +
  geom_boxplot()+
  geom_jitter(alpha = 0.05)  +
  geom_smooth(aes(quality-2,alcohol), method = 'lm', color = 'red') + 
  xlab('quality') + 
  ylab('alcohol') +
  ggtitle('Quality vs alcohol of White Wine')

The correlation coefficient is -0.1136628, which show a negative correrlation between quality and fixed.acidity. The box plot show outliers in the dataset for most of the quality levels. The linear model also show a negarive correlation between fixed.acidity and quality. Quality levels 5, 6 and 7 shows that most observation in this level contained fixed acid between 4 and 9.

cor.test(wht_wine_quality$quality, wht_wine_quality$fixed.acidity, 
         method = 'pearson')

## 
##  Pearson's product-moment correlation
## 
## data:  wht_wine_quality$quality and wht_wine_quality$fixed.acidity
## t = -8.005, df = 4896, p-value = 1.48e-15
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.14121974 -0.08592991
## sample estimates:
##        cor 
## -0.1136628

ggplot(aes(factor(quality), fixed.acidity), data = wht_wine_quality) +
  geom_boxplot()+
  geom_jitter(alpha = 0.05)  +
  geom_smooth(aes(quality-2,fixed.acidity), method = 'lm', color = 'red')

The is a very weak or no correlation between citric.acid and quality, the correaltion cofficient is given as -0.009209091. Boxplots for this case show that most observation in white wine contains citric acid between 0.2 and 0.75 and there are also many outlier in the dataset.

cor.test(wht_wine_quality$quality, wht_wine_quality$citric.acid, 
         method = 'pearson')

## 
##  Pearson's product-moment correlation
## 
## data:  wht_wine_quality$quality and wht_wine_quality$citric.acid
## t = -0.6444, df = 4896, p-value = 0.5193
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.03720595  0.01880221
## sample estimates:
##          cor 
## -0.009209091

ggplot(aes(factor(quality), citric.acid), data = wht_wine_quality) +
  geom_boxplot()+
  geom_jitter(alpha = 0.05)  +
  geom_smooth(aes(quality-2,citric.acid), method = 'lm', color = 'red')

Negative correlation exists between quality and residual.sugar. Though the redisual sugar goes as high as 60 , most observation show that the residual.sugar is below 20. The linear model show a nagative correaltions between quality and residual.sugar. The coord_cartesian fucntion is used to remove outlies fro a better view of the plot.

cor.test(wht_wine_quality$quality, wht_wine_quality$residual.sugar, 
         method = 'pearson')

## 
##  Pearson's product-moment correlation
## 
## data:  wht_wine_quality$quality and wht_wine_quality$residual.sugar
## t = -6.8603, df = 4896, p-value = 7.724e-12
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.12524103 -0.06976101
## sample estimates:
##         cor 
## -0.09757683

ggplot(aes(factor(quality), residual.sugar), data = wht_wine_quality) +
  geom_boxplot() + coord_cartesian(ylim = c(0,20)) +
  geom_jitter(alpha = 0.05)  + 
  geom_smooth(aes(quality-2,residual.sugar), method = 'lm', color = 'red')

Negative correaltion between quality and chlorides with correaltion cofficient of -0.2099344. Many observation for chloride are fall into outlies, the chloride amount in the dataset is below 0.1 for most of the observations.

cor.test(wht_wine_quality$quality, wht_wine_quality$chlorides, 
         method = 'pearson')

## 
##  Pearson's product-moment correlation
## 
## data:  wht_wine_quality$quality and wht_wine_quality$chlorides
## t = -15.024, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.2365501 -0.1830039
## sample estimates:
##        cor 
## -0.2099344

ggplot(aes(factor(quality), chlorides), data = wht_wine_quality) +
  geom_boxplot()+
  geom_jitter(alpha = 0.05)  +
  geom_smooth(aes(quality-2,chlorides), method = 'lm', color = 'red')

A weak positive correlation between free.surfur.dioxide and quality.The boxplot dipicts that for most of the observation th free.sulfur.dioxide is below 100 ,though it goes as high as 300 for some observation which is an outlier.

cor.test(wht_wine_quality$quality, wht_wine_quality$free.sulfur.dioxide, 
         method = 'pearson')

## 
##  Pearson's product-moment correlation
## 
## data:  wht_wine_quality$quality and wht_wine_quality$free.sulfur.dioxide
## t = 0.57085, df = 4896, p-value = 0.5681
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.01985292  0.03615626
## sample estimates:
##         cor 
## 0.008158067

ggplot(aes(factor(quality), free.sulfur.dioxide), data = wht_wine_quality) +
  geom_boxplot()+
  geom_jitter(alpha = 0.05)  +
  geom_smooth(aes(quality-2,free.sulfur.dioxide), method = 'lm', color = 'red')

Negative correlation between total.sulfur.dioxide and quality. The correlation cofficient between quality and total.sulfur.dioxide is 0.00815 which is a very weak relation. Event though, the values goes as high as 400, the boxplot show that most observations the amount of total sulfer dioxide is below 250.

cor.test(wht_wine_quality$quality, wht_wine_quality$total.sulfur.dioxide, 
         method = 'pearson')

## 
##  Pearson's product-moment correlation
## 
## data:  wht_wine_quality$quality and wht_wine_quality$total.sulfur.dioxide
## t = -12.418, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.2017563 -0.1474524
## sample estimates:
##        cor 
## -0.1747372

ggplot(aes(factor(quality), total.sulfur.dioxide), data = wht_wine_quality) +
  geom_boxplot()+ geom_jitter(alpha = 0.05)  +
  geom_smooth(aes(quality-2,total.sulfur.dioxide), method = 'lm',color = 'red')

Still the correaltion between density and quality is negative with a correlation coefficient of -0.3071233. From the boxplot we can infer that in most of the observations the density of the wine is below 1, which is less than the density of water. There are also some outlier of density observation for while quality level of 6.

cor.test(wht_wine_quality$quality, wht_wine_quality$density, method = 'pearson')

## 
##  Pearson's product-moment correlation
## 
## data:  wht_wine_quality$quality and wht_wine_quality$density
## t = -22.581, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.3322718 -0.2815385
## sample estimates:
##        cor 
## -0.3071233

ggplot(aes(factor(quality), density), data = wht_wine_quality) +
  geom_boxplot()+
  geom_jitter(alpha = 0.05)  +
  geom_smooth(aes(quality-2,density), method = 'lm', color = 'red')

Positive correaltion between quality and pH of white wine with correlation cofficient of 0.09942725. The boxplot show that there are many outlier and the pH of most observation falls between 3 and 3.3 which is acidic catagory.

cor.test(wht_wine_quality$quality, wht_wine_quality$pH, method = 'pearson')

## 
##  Pearson's product-moment correlation
## 
## data:  wht_wine_quality$quality and wht_wine_quality$pH
## t = 6.9917, df = 4896, p-value = 3.081e-12
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.07162022 0.12707983
## sample estimates:
##        cor 
## 0.09942725

ggplot(aes(factor(quality), pH), data = wht_wine_quality) +
  geom_boxplot()+
  geom_jitter(alpha = 0.05)  +
  geom_smooth(aes(quality-2,pH), method = 'lm', color = 'red')

Positive correlation exists between quality and and surphates in white wine, the correaltion coffecient is 0.05367788. The plot depicts that there are many outliers in the observation dataset. Most observation that contains higher level of sulphate fall to quality levels fo 5,6 and 7.

cor.test(wht_wine_quality$quality, wht_wine_quality$sulphates, 
         method = 'pearson')

## 
##  Pearson's product-moment correlation
## 
## data:  wht_wine_quality$quality and wht_wine_quality$sulphates
## t = 3.7613, df = 4896, p-value = 0.000171
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.02571007 0.08156172
## sample estimates:
##        cor 
## 0.05367788

ggplot(aes(factor(quality), sulphates), data = wht_wine_quality) +
  geom_boxplot()+
  geom_jitter(alpha = 0.05)  +
  geom_smooth(aes(quality-2,sulphates), method = 'lm', color = 'red')

Overlaying summary with raw data plot show the graph of raw data of quality and alchol. The plot also overlays graph of the grand mean (i.e. the mean of the alcoholic content of the wine for each catagory or level) , the black line, the 50th percentile, the green line, the 10th percentile, the blue line and the 90th percentile, the red line for each white wine quality catagory. All plots show that there is an increase in the amount of alcohol percentage presence in wine for qulity levels above 5.

ggplot(aes(x=quality, y = alcohol), data = wht_wine_quality) + 
  geom_point(alpha = 0.1, position = position_jitter(h=0), color='orange') + 
 coord_cartesian(xlim = c(2,8)) + coord_cartesian(ylim = c(0.8,1.2)) +
  coord_trans(y = 'sqrt') + 
  geom_line(stat = 'summary', fun.y = mean) +
  geom_line(stat = 'summary', fun.y = 'quantile', fun.args=list(probs=0.1), 
            linetype =2, color = 'blue')+ 
   geom_line(stat = 'summary', fun.y = 'quantile', fun.args=list(probs=0.9), 
             linetype =2, color = 'red') +
geom_line(stat = 'summary', fun.y = 'quantile', fun.args=list(probs=0.5), 
          linetype =2, color = 'green')

Let us plot key varaible against each other using ‘ggpiars’, in the above plots we only see how the input varaibles are related or correlated with quality. The ggpairs plot show not only the relation between input variables and the quality of white wine but also the relation among each other. The ggpairs plot dicpicts the scatter plot, the histogram and correlation , which is very helpful for quick summary of the given dataset. What can be seen from the corration coefficient from the upper right part of the graph is that the correaltion among each varaible is not that strong, it is below 0.5 for most of the cases.

ggpairs(wht_wine_quality,2:13, title = 'Summary of relation among varaibles')

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

In the above plots, I have tried to see the correaltion and distribution of each input vararaible with the output varaible (i.e. the quality of white wine). Among the input varaibles alcohol,free.sulfur.dioxide and sulphate have a positve correaltion (correlation coefficient of 0.4355747) with quality, alcohol being having a stong correaltion with quality, that is the highest the alcoholic content in percentage found in the white wine the best is the quality of the wine. Other physicochemical variables are nagatively correlated with the quality of the wine, citric.acid has the weakest correlation with quality (correaltion coefficient of -0.009209091).

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

Except the three input varaibles, the other varaibles have a week negative correaltion with the quality of the white wine.

What was the strongest relationship you found?

Among all the input varaibles, the quality of the white wine is strongely correlated with the alcohol level in the wine.

Multivariate Plots Section

A simple linear multiple regression model is developed to see how well the input variable could be used to predict the output varaible (i.e. the quality of white wine).The regression model can be given in the form below.

quality = alcohol0.1935 - chlorides0.2473 + citric.acid0.021 - density0.0229 +
fixed.acidity0.0655 + free.sulfur.dioxide0.0037 + pH0.686 + residual.sugar0.0814 + sulphates0.635 -total.sulfur.dioxide0.000286 -
volatile.acidity*1.863 + 0.015

From the coefficient of the linear regeression model we see that some varaibles have little impact in predicting quality of the white wine, for example free.sulfur.dioxide, total.sulfur.dioxide has little contribution in modeling quality of the wine.

The residual that is the difference between the observed output varaible and the predicted variable is also plotted. The residual show the performance of the model, the smaller the residual the better the our model is. The histogram of the residual error dipicts a normal distibution which is a good indication of the model as the residuals don’t skwed to one side only.

attach(wht_wine_quality)
lnr_mod <-lm(formula=quality~alcohol + chlorides + citric.acid + density + 
          fixed.acidity + free.sulfur.dioxide + pH +   
          residual.sugar + sulphates+total.sulfur.dioxide + volatile.acidity)

summary(lnr_mod)

## 
## Call:
## lm(formula = quality ~ alcohol + chlorides + citric.acid + density + 
##     fixed.acidity + free.sulfur.dioxide + pH + residual.sugar + 
##     sulphates + total.sulfur.dioxide + volatile.acidity)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.8348 -0.4934 -0.0379  0.4637  3.1143 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           1.502e+02  1.880e+01   7.987 1.71e-15 ***
## alcohol               1.935e-01  2.422e-02   7.988 1.70e-15 ***
## chlorides            -2.473e-01  5.465e-01  -0.452  0.65097    
## citric.acid           2.209e-02  9.577e-02   0.231  0.81759    
## density              -1.503e+02  1.907e+01  -7.879 4.04e-15 ***
## fixed.acidity         6.552e-02  2.087e-02   3.139  0.00171 ** 
## free.sulfur.dioxide   3.733e-03  8.441e-04   4.422 9.99e-06 ***
## pH                    6.863e-01  1.054e-01   6.513 8.10e-11 ***
## residual.sugar        8.148e-02  7.527e-03  10.825  < 2e-16 ***
## sulphates             6.315e-01  1.004e-01   6.291 3.44e-10 ***
## total.sulfur.dioxide -2.857e-04  3.781e-04  -0.756  0.44979    
## volatile.acidity     -1.863e+00  1.138e-01 -16.373  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7514 on 4886 degrees of freedom
## Multiple R-squared:  0.2819, Adjusted R-squared:  0.2803 
## F-statistic: 174.3 on 11 and 4886 DF,  p-value: < 2.2e-16

hist(lnr_mod$residuals)

ggplot(aes(x = lnr_mod$residuals), data = lnr_mod) + geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Below is the density vs alchol plotted againest quality of the white wine. we can see that most white wine that contains less than 10% of alcohol have relatively less denser and those white wine with relatively high alcolic percentage have lesser density.

ggplot(data = wht_wine_quality, aes(x = alcohol, y = density, 
    color = factor(quality))) + geom_point(alpha = 1/2, 
    position = position_jitter(h = 0), size = 5) +
    coord_cartesian(ylim = c(0.987, 1.003)) + 
    scale_colour_brewer(palette="Set1")

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

Employing the ‘ggpairs’ function in GGally, I have plotted a chart that shows a quick summary of the different varaibles such as scatter plot, boxplot , correaltion cofficient etc.

Were there any interesting or surprising interactions between features?

From the cofficient multiple linear regeression model , we can see that all are less than 0.5, which shows most of the input variables are not strong predictors of quality of wine.

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model

Yes, I have build a multiple linear regression model. The limitation of the model would be that the data is only collected for one type of wine varity and only from one manufacturer and could not represent to model the quality of wine based on physicochemical input varaibles for other manufacturer and wine varities in general.

Final Plots and Summary

Plot One

ggplot(wht_wine_quality, aes(x=as.factor(quality)) ) + geom_bar() + 
  xlab('Quality in levels range from 0 to 10') +  
  ylab('Wine Counti in numbers') + ggtitle("White Wine Quality")

Description One

The histogram plot of quality of the White wine from all observations in the data set depicts that most of white wine quality in the observation falls to quality level of 6, 5, and 7 respectively. Only few numbers of observationf of the white wine falls in quality level of 8,4 and 3. None of the observations falls into the highest quality of wine level 9 and 10 and either of the lowest quality level of 0, 1 and 2.

Plot Two

ggplot(aes(factor(quality), alcohol), data = wht_wine_quality) +
  geom_boxplot() + 
  geom_jitter(alpha = 0.05)  +
  geom_smooth(aes(quality-2,alcohol), method = 'lm', color = 'red') + 
  xlab('quality in level  range 1 to 10') + 
  ylab('alcohol in % by volume') +
  ggtitle('Quality vs alcohol of White Wine')

Description Two

Above we did correaltion between the quality of the white one and the 11 input varaible that affect the quality of wine. The alcohol level have show srong correlation as compared to other varaibles in the observation. This plot show the box plot for each quality level or catagory againest the alcoholic level presence and also the linear fit line. The boxplot for quality levels 5, 6 and 7 show that there are many outliers that lies below the 25 and above the 75 percentile. The linear fit line depicts the fact that ther is a postive correlation between quality and alcohol level of the white wine.

Plot Three

ggplot(data = wht_wine_quality, aes(x = alcohol, y = density, 
color = factor(quality))) +
geom_point(alpha = 1/2, position = position_jitter(h = 0), size = 5) +
coord_cartesian(ylim = c(0.987, 1.003)) + scale_colour_brewer(palette="Set1") + 
geom_smooth(method = "lm", se = FALSE, size = 1) +
   xlab('alcohol amount in % by volume') +
   ylab('density in g/cm^3') +
   ggtitle('Density vs. Alcohol by quality')

Description Three

The plot shows the density plottes againest alcohol level and classified by quality of the white wine. It show that white wines with medium level of quality (6) tends to have low alcohol and high density and white wines with higher quality (8 and 9) tends to have low density and higher alcoholic percentage presence in the wine. The linear model fit between the density and alcohol also shows that there is a negative correaltion between them for each quality category or level.

Reflection

In Summary, the above analysis have show that the quality of the white wine depends on the physicochemical input varaibles. The quality of the wine is posively correlated with alcohol, free.sulfur.dioxide, sulphate and pH and it is negatively correlated with the other physicochemical input varaibles. The correlation among each varaibles is not that strong, all correlation coefficient is below 0.5 in abosute terms, the execption is that there is a strong correlation between density and total.sulfur dioxide which is above 0.8. The challenge while I was working on this data analysis was to identify and figure out which relation, pattern , or correlation among the varialbes is the most important. Which kind of plot I should use to visualize the relation and find out patterns was the other challenge. To have a quality anlysis and get some useful insights from a dataset, one should have a good deal of understanding about the dataset before diving direcly to any kind of analysis on a given data . This would have made it much easier to figure out or sepeculate relationships, patterns or trends among each varaible of interest.