Checking the variables types and number of observations. here we have 12 variables that might related with wine quality, with quality as the output variable.
## 'data.frame': 4898 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1 Min. : 3.800 Min. :0.0800 Min. :0.0000
## 1st Qu.:1225 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700
## Median :2450 Median : 6.800 Median :0.2600 Median :0.3200
## Mean :2450 Mean : 6.855 Mean :0.2782 Mean :0.3342
## 3rd Qu.:3674 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900
## Max. :4898 Max. :14.200 Max. :1.1000 Max. :1.6600
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.600 Min. :0.00900 Min. : 2.00
## 1st Qu.: 1.700 1st Qu.:0.03600 1st Qu.: 23.00
## Median : 5.200 Median :0.04300 Median : 34.00
## Mean : 6.391 Mean :0.04577 Mean : 35.31
## 3rd Qu.: 9.900 3rd Qu.:0.05000 3rd Qu.: 46.00
## Max. :65.800 Max. :0.34600 Max. :289.00
## total.sulfur.dioxide density pH sulphates
## Min. : 9.0 Min. :0.9871 Min. :2.720 Min. :0.2200
## 1st Qu.:108.0 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100
## Median :134.0 Median :0.9937 Median :3.180 Median :0.4700
## Mean :138.4 Mean :0.9940 Mean :3.188 Mean :0.4898
## 3rd Qu.:167.0 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500
## Max. :440.0 Max. :1.0390 Max. :3.820 Max. :1.0800
## alcohol quality
## Min. : 8.00 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.40 Median :6.000
## Mean :10.51 Mean :5.878
## 3rd Qu.:11.40 3rd Qu.:6.000
## Max. :14.20 Max. :9.000
The quality column ranges between 3-9 with 9 as the maximum level of quality.
we will need to research more about the other 12 reprsents and try to figure out the potential relationship with quality. First we need to che k the data information regarding the attributes.
fixed acidity (tartaric acid - g / dm^3): most acids involved with wine or fixed or nonvolatile (do not evaporate readily)
volatile acidity (acetic acid - g / dm^3): the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste
citric acid (g / dm^3): found in small quantities, citric acid can add ‘freshness’ and flavor to wines
residual sugar (g / dm^3): the amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet
chlorides (sodium chloride - g / dm^3): the amount of salt in the wine
free sulfur dioxide (mg / dm^3): the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine
total sulfur dioxide (mg / dm^3): amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine
density (g / cm^3): the density of water is close to that of water depending on the percent alcohol and sugar content
pH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very
basic); most wines are between 3-4 on the pH scale
sulphates (potassium sulphate - g / dm3): a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant
alcohol (% by volume): the percent alcohol content of the wine
## X fixed.acidity volatile.acidity
## 0 0 0
## citric.acid residual.sugar chlorides
## 0 0 0
## free.sulfur.dioxide total.sulfur.dioxide density
## 0 0 0
## pH sulphates alcohol
## 0 0 0
## quality
## 0
Based on the summary , there is no missing values in the dataset, so probably we wont need to clean the data.
we won’t need X variable so we will drop it
wine$X <- NULL
THEN WE NEED TO FIND CORRELATION BETWEEN VARIABLES TO QUALITY
first we will take a look of the white wine quality distribution
## quality freq
## 1 3 20
## 2 4 163
## 3 5 1457
## 4 6 2198
## 5 7 880
## 6 8 175
## 7 9 5
The dataset itself has 11 potential variables that might influence the quality of the wine, we will take a look at fixed acidity, volatile acidity, and citric acidity, residual sugar, chloride, free sulfur dioxide, total sulfur dioxide
We will divide the variable analysis into 4 blocks: 1. Acidity : Includes fixed acidity, Volatile Acidity and Citric Acidity, pH
create_hist <- function(varname, binwidth) {
return(ggplot(aes(x = varname), data = wine) + geom_histogram(binwidth = binwidth,fill="#FF9999", colour="black"))
}
We can see from chart above, Fixed acidity, volatile acidity and citric acidity tend to have normal distribution with tendency slightly skewed to the left with probably few outliers with high acidity, while the pH variable is normally distributed between 2.8 and 3.5 which means all the wine is mostly acidic, (PH<7)
to check on these variables outliers, we will take a look later using boxplot altogether with other variables as well.
here we will take a look on sugar & salt
grid.arrange(ggplot(wine, aes( x = 1 , y = residual.sugar ) ) +
geom_jitter(alpha = 0.1 ) +
geom_boxplot(alpha = 0.2, color = 'red' ) ,
ggplot(wine, aes( x = residual.sugar ) ) +
geom_histogram(bins=30 ),ncol=2)
we can see that residual sugar and chlorides are heavily skewed to the left with mostly very little residual sugar (1-2 mg/dm3) and clorides around (0.2 mg/dm3) and sulphates also somewhat skewed to the left with mostly around 0.4-0.5 mg/dm3.
Residual sugar and chloride distributions, which are skewed to the left. This actually can be fixed by changing them into log distribution form.
RS2 <- ggplot(aes(x=residual.sugar),data=wine) +
geom_histogram()+
scale_x_log10( breaks = scales::trans_breaks("log10", function(x) 10^x),
labels = scales::trans_format("log10", scales::math_format(10^.x))
) +
ggtitle("Residual Sugar") +
xlab("Log 10 Residual Sugar") +
ylab("Count")
CL2 <-ggplot(aes(x=chlorides),data=wine) +
geom_histogram()+
scale_x_log10( breaks = scales::trans_breaks("log10", function(x) 10^x),
labels = scales::trans_format("log10", scales::math_format(10^.x))
) +
ggtitle("Chlorides") +
xlab("Log 10 Chlorides") +
ylab("Count")
grid.arrange(RS2, CL2, ncol =1)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
as we can see that all the other variables are in g/dm^3, however , free sulfur dioxide, total sulfur dioxide and density have different scale, so we need to convert the variables into g/dm^3
Current variable scale - free sulfur dioxide (mg / dm^3) - total sulfur dioxide (mg / dm^3) - density (g / cm^3)
convert to g/dm3
for density g/cm^3 to mg/dm^3 actually same because gram->mg*1000 and cm3->dm3/1000==equal
drop the preconvert variables
we can see that all the variables have outliers. we can see that only residual sugar and density that have very little outliers, the rest that related to acid and suplhate, such as fixed,volatile, and citric acidity, chlorides,PH, free SO2,and total SO2 has many outliers
## fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 775 9.1 0.27 0.45 10.6 0.035
## 821 6.6 0.36 0.29 1.6 0.021
## 828 7.4 0.24 0.36 2.0 0.031
## 877 6.9 0.36 0.34 4.2 0.018
## 1606 7.1 0.26 0.49 2.2 0.032
## density pH sulphates alcohol quality Free_SO2 Total_SO2
## 775 0.99700 3.20 0.46 10.4 9 0.028 0.124
## 821 0.98965 3.41 0.61 12.4 9 0.024 0.085
## 828 0.99055 3.28 0.48 12.5 9 0.027 0.139
## 877 0.98980 3.28 0.36 12.7 9 0.057 0.119
## 1606 0.99030 3.37 0.42 12.9 9 0.031 0.113
There are 4989 white wine in this data set with 12 features (fixed.acidity, volatile.acidity, citric.acid, residual.sugar, chlorides, free.sulfur.dioxide, total.sulfur.dioxide, density, pH, sulphates, alcohol, quality)
The following observations are made from dataSet:
The quality of the samples range from 3 to 9 with 6 being the median, and mean at around 5.86
pH value varies from 2.720 to 3.820 with a median being 3.188.
I am interested to take a look more of the sugar residual variable wherease we see that the low quality white wine have tendency to have high sugar residual , but when we look at level 9 quality, we also see high sugar residual in some sample, so I want to take a look more of this variable and its influence to quality
other feature of interest that might support me probably how long is the age of the wine itself. We always heard that long life wine have better quality, but this hypothesis should be tested as well
I didn’t create any new variables from existing variables, but rather just converting them into same scale. I convert free.Sulfur.dioxide and total.sulfur.dioxide into free_SO2 and total_SO2 in g/dm^3 from initial scale mg/dm^3
I changed the form of quality into factor, so we can differentiate qualities into facets.
unusual distribution can be found in residual sugar and chloride, which are skewed to the left. This actually can be fixed by changing them into log distribution form.
as we can see above the quality variable are corellated weakly with fixed acidity (-0.11),volatile acidity (-0.19) chlorides(-0.21), density(-0.31) , and have certain positive correlation with alcohol(0.43).
for correlation between variables, we can see a strong correlation between density and residual sugar(positive) and alcohol(negative). Probably we need to drop the variable if we want to predict the quality of others white wine since it has strong correlation between variables.
TAKING A LOOK CLOSER BETWEEN VARiABLES
## [,1]
## fixed.acidity 0.26533101
## volatile.acidity 0.02711385
## citric.acid 0.14950257
## residual.sugar 0.83896645
## chlorides 0.25721132
## density 1.00000000
## pH -0.09359149
## sulphates 0.07449315
## alcohol -0.78013762
## quality -0.30712331
## Free_SO2 0.29421041
## Total_SO2 0.52988132
from both scatterplots we can see that high level of alcohol tend to have lower density, and high level of residual sugar rend to have high density.theres a strong positive relationship between density with residual sugar (0.83896645) and strong negative relationship between density and alcohol(-0.78013762)
in term of chlorides and acidities, we can see that the correlation is negatively weak. The lower the level of chlorides, volatile acidity, fixed acidity and density tend to have higher quality in weak form.
alcohol tend to have positive relationship with quality
## [,1]
## fixed.acidity -0.113662831
## volatile.acidity -0.194722969
## citric.acid -0.009209091
## residual.sugar -0.097576829
## chlorides -0.209934411
## density -0.307123313
## pH 0.099427246
## sulphates 0.053677877
## alcohol 0.435574715
## quality 1.000000000
## Free_SO2 0.008158067
## Total_SO2 -0.174737218
##
## Call:
## lm(formula = quality ~ ., data = wine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.8348 -0.4934 -0.0379 0.4637 3.1143
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.502e+02 1.880e+01 7.987 1.71e-15 ***
## fixed.acidity 6.552e-02 2.087e-02 3.139 0.00171 **
## volatile.acidity -1.863e+00 1.138e-01 -16.373 < 2e-16 ***
## citric.acid 2.209e-02 9.577e-02 0.231 0.81759
## residual.sugar 8.148e-02 7.527e-03 10.825 < 2e-16 ***
## chlorides -2.473e-01 5.465e-01 -0.452 0.65097
## density -1.503e+02 1.907e+01 -7.879 4.04e-15 ***
## pH 6.863e-01 1.054e-01 6.513 8.10e-11 ***
## sulphates 6.315e-01 1.004e-01 6.291 3.44e-10 ***
## alcohol 1.935e-01 2.422e-02 7.988 1.70e-15 ***
## Free_SO2 3.733e+00 8.441e-01 4.422 9.99e-06 ***
## Total_SO2 -2.857e-01 3.781e-01 -0.756 0.44979
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7514 on 4886 degrees of freedom
## Multiple R-squared: 0.2819, Adjusted R-squared: 0.2803
## F-statistic: 174.3 on 11 and 4886 DF, p-value: < 2.2e-16
In bivariate analysis we will take a look of 2 variables which are quality and density. we pick the quality as main feature because we want to find out what makes a good quality white wine while we take a look more of density because it has strong interrelationship with many independent variables in the dataset
Quality : we can find that there are weak negative correlations between quality and :fixed acidity (-0.11),volatile acidity (-0.19) chlorides(-0.21), density(-0.31) ,
we can find also that there are weak positive correlations between quality and alcohol(0.43).
As I mentioned the relationship between residual sugar and density is strongly positive and negative relationship between alcohol and density. This is may make sense since the more alcohol insde the wine cause less density, and the more sugar inside the wine cause more density.
theres a strong positive relationship between density with residual sugar (0.83896645) and strong negative relationship between density and alcohol(-0.78013762)
The strongest relationship I found was: density with residual sugar (0.83896645) and density with alcohol(-0.78013762)
quality.cut <- cut(wine$quality, breaks = c(0,4,8,10))
we can see that high alcohol and low residual sugat shown one occurence of high quality wine, but this is a very weak relationship
ggplot(wine, aes(x = alcohol, y = pH , color = quality.cut)) +
coord_cartesian(
xlim = c(quantile(wine$alcohol, .01), quantile(wine$alcohol, .99)),
ylim = c(quantile(wine$pH, .01), quantile(wine$pH, .99))
) +
geom_jitter(alpha = 1, size = 1.5) +
scale_color_brewer(palette = "Set1") +
theme_dark() +
ggtitle("Alcohol vs pH") +
xlab("Alcohol") +
ylab("pH")
ggplot(aes(x = fixed.acidity,
y = density ),
data = wine) +
geom_point(alpha = 0.1, size = 0.5) +
geom_smooth(method = "lm", se = FALSE,size=1)
ggplot(aes(factor(quality),
alcohol),
data = wine) +
geom_jitter( alpha = .3) +
geom_boxplot( alpha = .5,color = 'blue')+
stat_summary(fun.y = "mean",
geom = "point",
color = "red",
shape = 8,
size = 4)
We can see that the green dots means the combination of both higher pH and Alcohols shows few occurences of high quality white wine, while the influence is not strong, but these variable have certain significance toward white wine quality
Alcohol and pH strenghtens each other and create occurences of high quality wine
the interesting interction between alcohol and residual sugar, that have positive and negative relationship to density, causes no occurences of high quality wine
This Plots shows alcohol has strong positive relationship with quality
This shows that fixed & volatile acidity, chlorides and quality has weak negative relationship with quality
We can see that the green dots means the combination of both higher pH and Alcohols shows few occurences of high quality white wine, while the influence is not strong, but these variable have certain significance toward white wine quality
Struggles, someimes i want to use certain variable such as quality to do plotting, but cant do it if we format quality as factor. but we cant do faceting if the quality variable is in form of number or integer, so i need to keep changing both variable type
another stuggles is i cant seem to use variance inflation factor (VIF) library to quantifies the severity of multicollinearity regression analysis.
the future work would be making a model of multiple regression to predict the quality of the white wine