Target of the Project

In this project, i am trying to analyze the which of the factors determine the quality of the red wine. After analysis, i will create a linear model to predict the quality of wine for given characteristics.

Structure of Dataframe

## [1] 1599   12

Our data set consists of 13 variables that may determine Red wine quality with around 1600 observations.

Summary of the Data Frame

## 'data.frame':    1599 obs. of  12 variables:
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
##  fixed.acidity   volatile.acidity  citric.acid    residual.sugar  
##  Min.   : 4.60   Min.   :0.1200   Min.   :0.000   Min.   : 0.900  
##  1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090   1st Qu.: 1.900  
##  Median : 7.90   Median :0.5200   Median :0.260   Median : 2.200  
##  Mean   : 8.32   Mean   :0.5278   Mean   :0.271   Mean   : 2.539  
##  3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420   3rd Qu.: 2.600  
##  Max.   :15.90   Max.   :1.5800   Max.   :1.000   Max.   :15.500  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.01200   Min.   : 1.00       Min.   :  6.00      
##  1st Qu.:0.07000   1st Qu.: 7.00       1st Qu.: 22.00      
##  Median :0.07900   Median :14.00       Median : 38.00      
##  Mean   :0.08747   Mean   :15.87       Mean   : 46.47      
##  3rd Qu.:0.09000   3rd Qu.:21.00       3rd Qu.: 62.00      
##  Max.   :0.61100   Max.   :72.00       Max.   :289.00      
##     density             pH          sulphates         alcohol     
##  Min.   :0.9901   Min.   :2.740   Min.   :0.3300   Min.   : 8.40  
##  1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500   1st Qu.: 9.50  
##  Median :0.9968   Median :3.310   Median :0.6200   Median :10.20  
##  Mean   :0.9967   Mean   :3.311   Mean   :0.6581   Mean   :10.42  
##  3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300   3rd Qu.:11.10  
##  Max.   :1.0037   Max.   :4.010   Max.   :2.0000   Max.   :14.90  
##     quality     
##  Min.   :3.000  
##  1st Qu.:5.000  
##  Median :6.000  
##  Mean   :5.636  
##  3rd Qu.:6.000  
##  Max.   :8.000
## [1] "3" "4" "5" "6" "7" "8"
## 'data.frame':    1599 obs. of  12 variables:
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : Ord.factor w/ 6 levels "3"<"4"<"5"<"6"<..: 3 3 3 4 3 3 3 5 5 3 ...

Univariate plot section

Plotting the distribution of each of the variable present in the dataset to get the understanding of shape i.e Normal, Right skewed or Left skewed and presence of extreme Outliers in variables.

##   3   4   5   6   7   8 
##  10  53 681 638 199  18

Observation:

Most of data in our dataset is of Average quality wines i.e 5 and 6 as compared to Poor and Good quality wines. This may result in bias result and inaccuracy of the model of the Wine Quality

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.60    7.10    7.90    8.32    9.20   15.90

Observation:

Distribution of Fixed acidity in our data set is Right/Positive Skewed with Median of 7.90 and mean being dragged to 8.32 due to presence of Outliers

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3900  0.5200  0.5278  0.6400  1.5800

Observation:

Distribution of Volatile Acidity seems to be Bi-modal with peaks at maximum no. of wines possessing the value of 0.4 and 0.6.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.090   0.260   0.271   0.420   1.000

Observation:

Citric acid dsitribution does not seems to be following any standard dsitribution with more the two peak values. 75% of the values are less than or equal to 0.42 but maximum value of citric acid is 1.0 which signifies the presence of several outliers.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.900   1.900   2.200   2.539   2.600  15.500

observation:

Residual sugar also seems to be Right Skewed as Fixed acidity with Mean being greater than Median due to effect of Outliers.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100

Observation:

Distribution of Chlorides seems to be similar to Residual sugar i.e Right Skewed due to presece of outliers.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    7.00   14.00   15.87   21.00   72.00

Observation:

Free sulfur dioxide also seems to be following positive skewed distribution as most of varibles with peak value at 6.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   22.00   38.00   46.47   62.00  289.00

Observation:

Total sulfur dioxide also seems to be following similar pattern as Free sulfur dioxide.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9901  0.9956  0.9968  0.9967  0.9978  1.0037

Observation:

Density seems to approximately Normally distributed with Mean and median to be 0.99

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.740   3.210   3.310   3.311   3.400   4.010

Observation:

PH distribution is also Normal similar to density

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3300  0.5500  0.6200  0.6581  0.7300  2.0000

Observation:

Sulphates also seems to be following positive skewed distribution as most of varibles with peak value at 0.6.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.50   10.20   10.42   11.10   14.90

`

Observation:

Alcohol also seems to be right skewed but with less skewness/outliers as compared to other skewed variables.

Univariate Plot Analysis

What is the Structure of the dataset?

The Red wine dataset has 1599 observations of 13 variables. Out of 13, Quality is the categorical variable and rest are numerical that reflect the chemical and physical properties of the wine.

What is/are the main feature(s) of interest in your dataset?

Main feature of interest in this dataset is the ‘Quality’. I would like to determine which factors are best to determine the quality of a wine.

What are your thoughts before starting the analysis about the dataset?

I think ‘Alcohol’ might play a key role in determining the quality of the Red wine. Also, ‘acidity’ also may impact the quality of the wine. Considering the taste, i believe ‘residual sugar’ will also affect the quality of the red wine.

Of the features you have investigated, what different distributions you observed for the variables in the dataset?

  1. Out of all variables, only PH nad density follows teh NOrmal distribution.
  2. Fixed and volatile acidity, total and free sulfur dioxides, alcohol and sulphates seem to follow right skewed due to the presence of outliers.
  3. Residual sugar and Chloride seems to have more extreme outliers as compare to other variables.

Bivariate Plot Section

I have created Correlation matrix between all the variables to get the overview that which variables may have impact on Quality and which all variables may be correlated to each other.

Observations:

  1. Density has a some significant correlation with Fixed Acidity with pearson coefficent of 0.67.
  2. Volatile Acidity and Alcohol seems to be strngest factor among all to influence the Quality of the wine.
  3. Alcohol has negative correlation with density which seems to fine as the density of water is greater than the density of alcohol.
  4. Strangely, volatile acidity seems to have positive correlation with pH which doesnt seems to be normal as pH reduce with the acidity

Now, lets have a closer view to check which varibales are helpful in determining the Quality of the Red wine.

Plot signify that fixed Acidity doesnt have significant impact in determining the Quality of the red Wine as Median values remain approximately same with increase in quality.

It clearly depicts that volatile acidity have negative correlation with quality of the wine. As the volatile acidity decreases quality of the wine increases.

CItric Acid also seems to be playing role in determining the quality of the red wine. Good wines are having more concentration of Citric acid.

This contradicts my initial assumption as Residual sugar seems not to be affecting the quality of the wines as median values for residual sugar are apporximately same for different quality wines.

Above plots imply that chlorides seems to be following the similar pattern as Residual sugar and have no major impact on wuality of the red wines.

Its seems to be an interesting observation, bad and good quality wines seems to have low concentration of free sulfur dioxide ,while average quality wines seems to have high conctentraion of free sulfur.

This pretty much expected, total sulfur follows the similar pattern as free sulfur

Better wines seems be less dense as compared bad quality wines. This may be due to presence high concentration of Alcohol.

Better wines seems to have less pH, i.e they are more acidic.

Lets check out which how various acids impact the pH

Its strange to see why volatile acidity have positive correlation with pH. It might be due to Simpson paradox or limited no, of observations.

It clearly depicts that better wines seems to have more concentration of sulphates

It seems to be most strongest correlation among all the variables with quality of the Red WInes. It can be clearly seen that better quality wines have more concentration of Alcoholas compared to poor quality wines.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation?

  1. Fixed Acidity seems to have almost no effect on quality.
  2. Volatile Acidity seems to have a negative correlation with the quality.
  3. Better wines seem to have higher concentration of Citric Acid.
  4. Better wines seem to have higher alcohol percentages.
  5. Chlorides & Residual Sugar seems to have no impact on quality of the wines.
  6. Better wines seem to have lower densities. This may be due to the higher alcohol content in them.
  7. Better wines seem to be more acidic.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

Volatile acidity had a positive correlation with pH which was unexpected. This may be due to Simpson paradox or limited no. of observations.

What was the strongest relationship you found?

Alcohol seems be strongest factor in determining the quality of wines with higest pearson correlation of 0.48 among all other variables.

Multivariate Plot section

As we saw, Alcohol seems be strongest factor in determining the quality of wines. So lets plot that with other significant varibales to have deeper view.

Observation :

It is observed that more alcohol and more citric acid concentration seems to produce better wines.

Observation :

It is observed that more alcohol produce better wines if they have high sulphates concentration.

Observation:

This one says better wines should have high alcohol but low volatile acidity.

Observation:

It shows high alcohol and low pH tend to produce better quality wines.

Lets analyze How other acids impact the PH:-

It shows that high citric acid and low pH tend to produce high quality wines.

Observation:

This obviously was not expected, Ph increases with increase in volatile acidity, results in better quality wines.

Linear Modelling

Now after all the analysis, lets determine how much each variable is actually contributing in determining the quality of wines.

Before deciding for the final model, first i will create linear model for each independent variable seprately with quality of the wine. Based on each contribution, i will decide what variables should actually be determining the quality of the Red Wines

Quality vs Alcohol

## 
## Call:
## lm(formula = as.numeric(quality) ~ alcohol, data = redwine)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.8442 -0.4112 -0.1690  0.5166  2.5888 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.12503    0.17471  -0.716    0.474    
## alcohol      0.36084    0.01668  21.639   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7104 on 1597 degrees of freedom
## Multiple R-squared:  0.2267, Adjusted R-squared:  0.2263 
## F-statistic: 468.3 on 1 and 1597 DF,  p-value: < 2.2e-16

Observation:

Based on R squared value, we can say Alcohol can explain 22% of variance in Quality of the Red wine.

Quality vs Sulphates

## 
## Call:
## lm(formula = as.numeric(quality) ~ sulphates, data = redwine)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.2432 -0.5424  0.1102  0.4456  2.3977 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.84775    0.07842   36.31   <2e-16 ***
## sulphates    1.19771    0.11539   10.38   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7819 on 1597 degrees of freedom
## Multiple R-squared:  0.0632, Adjusted R-squared:  0.06261 
## F-statistic: 107.7 on 1 and 1597 DF,  p-value: < 2.2e-16

Observation:

Sulphates ten to contribute only 6% to the Quality of the Red wine. It is low value but significant as P-value is less than 0.05.

Quality vs pH

## 
## Call:
## lm(formula = as.numeric(quality) ~ pH, data = redwine)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.6817 -0.6394  0.3032  0.3878  2.4874 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   4.6359     0.4332  10.703   <2e-16 ***
## pH           -0.3020     0.1307  -2.311    0.021 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8065 on 1597 degrees of freedom
## Multiple R-squared:  0.003333,   Adjusted R-squared:  0.002709 
## F-statistic:  5.34 on 1 and 1597 DF,  p-value: 0.02096

Observation:

pH seems to be contributing only 2% to the quality of the wines.

Quality vs density

## 
## Call:
## lm(formula = as.numeric(quality) ~ density, data = redwine)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.7885 -0.6216  0.1554  0.4271  2.5177 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    78.24      10.51   7.446 1.57e-13 ***
## density       -74.85      10.54  -7.100 1.87e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7954 on 1597 degrees of freedom
## Multiple R-squared:  0.0306, Adjusted R-squared:  0.02999 
## F-statistic: 50.41 on 1 and 1597 DF,  p-value: 1.875e-12

Observation: Similar to pH, density also seems to explaining just 2% of the variance in quality of wine. So, Density is not strong contributer to quality of the wines.

Quality vs citric acid

## 
## Call:
## lm(formula = as.numeric(quality) ~ citric.acid, data = redwine)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.0011 -0.5976  0.1021  0.5057  2.5901 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.38172    0.03372 100.294   <2e-16 ***
## citric.acid  0.93845    0.10104   9.288   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7869 on 1597 degrees of freedom
## Multiple R-squared:  0.05124,    Adjusted R-squared:  0.05065 
## F-statistic: 86.26 on 1 and 1597 DF,  p-value: < 2.2e-16

Observation:

Citric acid tend to contribute only 5% to the Quality of the Red wine. It is low value but significant i.e reliable.

Quality vs Volatile acidity

## 
## Call:
## lm(formula = as.numeric(quality) ~ volatile.acidity, data = redwine)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.79071 -0.54411 -0.00687  0.47350  2.93148 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       4.56575    0.05791   78.85   <2e-16 ***
## volatile.acidity -1.76144    0.10389  -16.95   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7437 on 1597 degrees of freedom
## Multiple R-squared:  0.1525, Adjusted R-squared:  0.152 
## F-statistic: 287.4 on 1 and 1597 DF,  p-value: < 2.2e-16

Observation:

This is not what i expected, volatile acidity seems to be contributing 15% to the quality of the red wine.

Quality vs Fixed acidity

## 
## Call:
## lm(formula = as.numeric(quality) ~ fixed.acidity, data = redwine)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.8248 -0.6061  0.1925  0.4341  2.5550 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    3.15732    0.09789  32.253  < 2e-16 ***
## fixed.acidity  0.05754    0.01152   4.996  6.5e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8016 on 1597 degrees of freedom
## Multiple R-squared:  0.01539,    Adjusted R-squared:  0.01477 
## F-statistic: 24.96 on 1 and 1597 DF,  p-value: 6.496e-07

Observation:

FIxed acidity doesn’t seems to be contributing to the quality of the red wine with R-square value of only 1.4.

Final Linear Model

Considering the individual R square value and correlation with quality variable, I have decided to create final linear model with below 4 independent variables :-

Dependent Variable: Quality

Independent variables: Alcohol, Sulphates, Citric acid, Volatile Acidity

Here is the our Model results:-

## 
## Calls:
## m1: lm(formula = as.numeric(quality) ~ alcohol, data = redwine)
## m2: lm(formula = as.numeric(quality) ~ alcohol + sulphates, data = redwine)
## m3: lm(formula = as.numeric(quality) ~ alcohol + sulphates + citric.acid, 
##     data = redwine)
## m4: lm(formula = as.numeric(quality) ~ alcohol + sulphates + citric.acid + 
##     volatile.acidity, data = redwine)
## 
## ============================================================================
##                          m1            m2            m3            m4       
## ----------------------------------------------------------------------------
##   (Intercept)          -0.125        -0.625***     -0.566**       0.646**   
##                        (0.175)       (0.177)       (0.176)       (0.201)    
##   alcohol               0.361***      0.346***      0.338***      0.309***  
##                        (0.017)       (0.016)       (0.016)       (0.016)    
##   sulphates                           0.994***      0.814***      0.696***  
##                                      (0.102)       (0.107)       (0.103)    
##   citric.acid                                       0.513***     -0.079     
##                                                    (0.093)       (0.104)    
##   volatile.acidity                                               -1.265***  
##                                                                  (0.113)    
## ----------------------------------------------------------------------------
##   R-squared             0.227         0.270         0.284         0.336     
##   adj. R-squared        0.226         0.269         0.282         0.334     
##   sigma                 0.710         0.690         0.684         0.659     
##   F                   468.267       294.988       210.501       201.777     
##   p                     0.000         0.000         0.000         0.000     
##   Log-likelihood    -1721.057     -1675.142     -1659.955     -1599.093     
##   Deviance            805.870       760.894       746.576       691.852     
##   AIC                3448.114      3358.284      3329.910      3210.186     
##   BIC                3464.245      3379.793      3356.795      3242.448     
##   N                  1599          1599          1599          1599         
## ============================================================================

Multivariate Plot Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

  1. High Alcohol and high Sulphates concentration seems to produce good quality wines.
  2. Similar beahviour is shown by Citric acid

Linear Model Summary

I created couple of linear model for observing the individual contribution of the variable in explaining the variance in quality of the red wines. Alcohol and Volatile Acidity comes out to be most important factor by contributing 22% and 15% respectively in determining the quality of the wines.

Consolidating all the key variables, my final model includes Alcohol, Volatile Acididty, Citric acid, Sulphates which all together can explain approximately 33 % of the variance in quality of Red wines.

This might be good result as it can be due to the fact that our dataset comprised mainly of ‘Average’ quality wines and as there were very few data about the ‘Good’ and the ‘Bad’ quality wines in our dataset. More observations about ‘bad’ and ‘good’ quality wines would have helped me better in predicting the quality of red wine effectively.

Final Plots & Summary

Plot 1

Description -

  1. This plot shows us that Alcohol percentage has played a key role in determining the quality of Wines.
  2. The higher the alcohol percentage, the better the wine quality.
  3. In this dataset, even though most of the observations are for average quality wine, we can see from the above plot that the mean and median coincides for all the boxes implying that for a particular Quality it is very normally distributed.

  4. Also, from our linear model test, we saw from the R Squared value that alcohol alone contributes to about 22% in the variance of the wine quality.

Plot 2

Description -

  1. This plot shows us that VOlatile acidity concentration has also played a significant role in determining the quality of Wines.
  2. Lower the Volatile acidity, the better the wine quality. But this isn’t the case with other acid parameters included in data set.

  3. Also, from our linear model test, we saw that Volatile acidity contributes around 15% to quality of red wines.

Plot 3

Description -

  1. This comes out to be more unexpected result of this complete analysis.
  2. Its strange to see why volatile acidity have positive correlation with pH. Generally, pH decrease with increase with in acidity
  3. Reason for this probably be the limited number of observations for Quality 3, 4, 7 and 8 as most of the observations are for Averge quality wines i.e 5 and 6.
  4. Another probable reason of this might be Simson paradox.

Reflection

Redwine dataset consists of 1599 observation for 13 variable consisting of various physical and chemical properties of Red wine.

First I plotted different variables against the quality to see Univariate relationships between them and then I plotted each variable against Quality to see which are significant in determining quality of the wine. I saw that the factors which affected the quality of the wine the most were Alcohol percentage, Sulphate and Citric acid and Volatile Acidity.

I tried to figure out the effect of each individual acid on the overall pH of the wine. Here I found out a very strange phenomenon where I saw that for volatile acids, the pH was increasing with acidity which was against general property of acids.

In the final part of my analysis, I plotted multivariate plots to see if there were some interesting combinations of variables which together affected the overall quality of the wine. These plots also help me to finalize the final variables which i need to put in linear model to predict the quality of wine.

While performing the analysis, my main struggle was to get a higher confidence level for predicting the the ‘Good’ and the ‘Bad’ quality wines as the data was very centralized towards the ‘Average’ quality, dataset set did not have enough data on the extreme edges i.e for wiuality 3,4 ,7, and 8 to accurately build a model which can predict the quality of a wine given the other variables with lesser margin of error. So maybe in future, if can get a dataset about Red Wines with more complete information i.e with more number of observations of every quality type, then I can build my models more effectively.