Title: Red Wine By Johnny Sch. Antoine

Introduction

This project is about exploring a red wine dataset in order to see how chemical properties influence the quality of red wines. The dataset is composed of 1599 observations and 13 variables.I will use one variable exploration techniques and two or more variables exploration techniques. I will try to see if there any correlations between variables. I will also use a few plots to explain the dataset. I will fit a Linear model to predict the quality of the wine based on its chemical composition.



Dataset Summary

## [1] 1599   14
##  [1] "X"                        "fixed.acidity"           
##  [3] "volatile.acidity"         "citric.acid"             
##  [5] "residual.sugar"           "chlorides"               
##  [7] "free.sulfur.dioxide"      "total.sulfur.dioxide"    
##  [9] "density"                  "pH"                      
## [11] "sulphates"                "alcohol"                 
## [13] "quality"                  "ratio.freesulfur.dioxide"
## 'data.frame':    1599 obs. of  14 variables:
##  $ X                       : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity           : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity        : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid             : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar          : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides               : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide     : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide    : num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density                 : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                      : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates               : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol                 : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality                 : int  5 5 5 6 5 5 5 7 7 5 ...
##  $ ratio.freesulfur.dioxide: num  0.324 0.373 0.278 0.283 0.324 ...
##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality      ratio.freesulfur.dioxide
##  Min.   : 8.40   Min.   :3.000   Min.   :0.02273         
##  1st Qu.: 9.50   1st Qu.:5.000   1st Qu.:0.25926         
##  Median :10.20   Median :6.000   Median :0.37500         
##  Mean   :10.42   Mean   :5.636   Mean   :0.38231         
##  3rd Qu.:11.10   3rd Qu.:6.000   3rd Qu.:0.48485         
##  Max.   :14.90   Max.   :8.000   Max.   :0.85714

Univariate Plots Section

## Scale for 'x' is already present. Adding another scale for 'x', which
## will replace the existing scale.
## Warning: Removed 78 rows containing non-finite values (stat_bin).
## Warning: Removed 2 rows containing missing values (geom_bar).

Fixed acidy appeared positively skewed and we know a big chunk of the data fall between 7 and 9. we took care of a few outliers by excusing the top 5%.



Volatile acidity appeared to be normally distributed and we know a big chunk of the data falls between 0.3 and 0.7. we use the log10 transformation because volatile acidity just works better with that transformation.



## Warning: Removed 78 rows containing non-finite values (stat_bin).

The histogram of citric acid shows that this variable is neither normally distributed nor skewed. There are spikes between 0 and 0.04, between 0.5 and 0.10, between 0.2 and 0.3, and also between 0.4 and 0.6. We also took care of some outliers by excluding the top 5%.



## Warning: Removed 79 rows containing non-finite values (stat_bin).

We have a normal distribution for residual sugar. If I have not excluded the top 5% there would be a tail on the right side of the plot.



The chlorides histogram is mostly normally distributed with a long tail on the right. The range for this variable is really small it is between 0.012 and 0.37. We use the log10 transformation and it made the histogram look a little bit more normal.



## Warning: Removed 77 rows containing non-finite values (stat_bin).

The histogram of free sulfur dioxide shows that the data is positively skewed and there is a pick between 5 and 8. Most of the data lies between 0 and 40. We excluded the top 5%, which helps with outliers.



## Warning: Removed 80 rows containing non-finite values (stat_bin).
## Warning: Removed 1 rows containing missing values (geom_bar).

The histogram of total sulfur dioxide shows that the data is positively skewed. Most of the data fall between 0 and 110. We excluded the top 5%, which helps with outliers.



The density histogram shows a normal distribution when I use the logarithmic scale.



Just like the density Histogram the pHs show a fairly normally distributed plot when using the logarithmic scale.



The sulfate Histogram shows more of a normal shape when we use the logarithmic function as a transformation.



## Warning: Removed 70 rows containing non-finite values (stat_bin).

The alcohol histogram is positively skewed.



## Warning: Ignoring unknown parameters: binwidth, bins, pad

The quality histogram is normally distributed. It also shows that the worst wine and the great one are outliers and most of the wine of just average.



The ratio.freesulfur.dioxide histogram is negatively skewed when using the logarithmic function.




Univariate Analysis

What is the structure of your dataset?

I chose the red wine dataset, the original data set was composed of 1599 observations, and 13 variables. The quality feature is an ordered categorical variable, while all others are either numerical or integer variables. Most of the variables have a skewed distribution with a long tail either on the left or on the right.

What is/are the main feature(s) of interest in your dataset?

The main feature of interest is the quality which is basically the quality of the wine we are going to analyze the data to see which features has a significant influence on the quality of a red wine. The quality of the wine in this dataset lies between 3 and 8.

What other features in the dataset do you think will help support your
investigation into your feature(s) of interest?

Without analyzing the data I would say almost all the variable will help supporting my investigation. The fixed. acidity, the volatile. acidity the citric. acidic, the pH, the residual sugar, etc… I expect all of them to be a factor in the quality of a wine.

Did you create any new variables from existing variables in the dataset?

Without analyzing the data I would say almost all the variable will help supporting my investigation. The fixed. acidity, the volatile. acidity the citric. acidic, the pH, the residual sugar, etc… I expect all of them to be a factor in the quality of a wine.

Of the features you investigated, were there any unusual distributions?
Did you perform any operations on the data to tidy, adjust, or change the form
of the data? If so, why did you do this?

Yes, while I was plotting the variation I made sure I excluded the top 5% in for some plots to get rid of outliers I also use the logarithmic function in some plots because some features do not work on the normal scale they are much better represented on the logarithmic scale.






Bivariate Plots Section

cor(redwine,method="pearson")
##                                     X fixed.acidity volatile.acidity
## X                         1.000000000   -0.26848392     -0.008815099
## fixed.acidity            -0.268483920    1.00000000     -0.256130895
## volatile.acidity         -0.008815099   -0.25613089      1.000000000
## citric.acid              -0.153551355    0.67170343     -0.552495685
## residual.sugar           -0.031260835    0.11477672      0.001917882
## chlorides                -0.119868519    0.09370519      0.061297772
## free.sulfur.dioxide       0.090479643   -0.15379419     -0.010503827
## total.sulfur.dioxide     -0.117849669   -0.11318144      0.076470005
## density                  -0.368372087    0.66804729      0.022026232
## pH                        0.136005328   -0.68297819      0.234937294
## sulphates                -0.125306999    0.18300566     -0.260986685
## alcohol                   0.245122841   -0.06166827     -0.202288027
## quality                   0.066452608    0.12405165     -0.390557780
## ratio.freesulfur.dioxide  0.335438942   -0.13081236     -0.072618561
##                          citric.acid residual.sugar    chlorides
## X                        -0.15355136   -0.031260835 -0.119868519
## fixed.acidity             0.67170343    0.114776724  0.093705186
## volatile.acidity         -0.55249568    0.001917882  0.061297772
## citric.acid               1.00000000    0.143577162  0.203822914
## residual.sugar            0.14357716    1.000000000  0.055609535
## chlorides                 0.20382291    0.055609535  1.000000000
## free.sulfur.dioxide      -0.06097813    0.187048995  0.005562147
## total.sulfur.dioxide      0.03553302    0.203027882  0.047400468
## density                   0.36494718    0.355283371  0.200632327
## pH                       -0.54190414   -0.085652422 -0.265026131
## sulphates                 0.31277004    0.005527121  0.371260481
## alcohol                   0.10990325    0.042075437 -0.221140545
## quality                   0.22637251    0.013731637 -0.128906560
## ratio.freesulfur.dioxide -0.16693889   -0.070626080 -0.105156413
##                          free.sulfur.dioxide total.sulfur.dioxide
## X                                0.090479643          -0.11784967
## fixed.acidity                   -0.153794193          -0.11318144
## volatile.acidity                -0.010503827           0.07647000
## citric.acid                     -0.060978129           0.03553302
## residual.sugar                   0.187048995           0.20302788
## chlorides                        0.005562147           0.04740047
## free.sulfur.dioxide              1.000000000           0.66766645
## total.sulfur.dioxide             0.667666450           1.00000000
## density                         -0.021945831           0.07126948
## pH                               0.070377499          -0.06649456
## sulphates                        0.051657572           0.04294684
## alcohol                         -0.069408354          -0.20565394
## quality                         -0.050656057          -0.18510029
## ratio.freesulfur.dioxide         0.327240869          -0.37143493
##                              density          pH    sulphates     alcohol
## X                        -0.36837209  0.13600533 -0.125306999  0.24512284
## fixed.acidity             0.66804729 -0.68297819  0.183005664 -0.06166827
## volatile.acidity          0.02202623  0.23493729 -0.260986685 -0.20228803
## citric.acid               0.36494718 -0.54190414  0.312770044  0.10990325
## residual.sugar            0.35528337 -0.08565242  0.005527121  0.04207544
## chlorides                 0.20063233 -0.26502613  0.371260481 -0.22114054
## free.sulfur.dioxide      -0.02194583  0.07037750  0.051657572 -0.06940835
## total.sulfur.dioxide      0.07126948 -0.06649456  0.042946836 -0.20565394
## density                   1.00000000 -0.34169933  0.148506412 -0.49617977
## pH                       -0.34169933  1.00000000 -0.196647602  0.20563251
## sulphates                 0.14850641 -0.19664760  1.000000000  0.09359475
## alcohol                  -0.49617977  0.20563251  0.093594750  1.00000000
## quality                  -0.17491923 -0.05773139  0.251397079  0.47616632
## ratio.freesulfur.dioxide -0.26497991  0.18489507 -0.010459139  0.24627450
##                              quality ratio.freesulfur.dioxide
## X                         0.06645261               0.33543894
## fixed.acidity             0.12405165              -0.13081236
## volatile.acidity         -0.39055778              -0.07261856
## citric.acid               0.22637251              -0.16693889
## residual.sugar            0.01373164              -0.07062608
## chlorides                -0.12890656              -0.10515641
## free.sulfur.dioxide      -0.05065606               0.32724087
## total.sulfur.dioxide     -0.18510029              -0.37143493
## density                  -0.17491923              -0.26497991
## pH                       -0.05773139               0.18489507
## sulphates                 0.25139708              -0.01045914
## alcohol                   0.47616632               0.24627450
## quality                   1.00000000               0.19411335
## ratio.freesulfur.dioxide  0.19411335               1.00000000

This graph shows correlationships between variables in the dataset.



This graph shows correlationships between variables in the dataset.



By looking at this graph we can see there is a trend that the quality of the wine increase when we add more alcohol to the wine.



The correlation is not that big, but it is a positive one the more sulphates we have in the wine the better the quality.



Quality and volatile acidity are negatively correlated, it means that as long as we have volatile acidity the less we have the better the quality will be.



The quality of the wine is strongly and positively correlate to citric acid.




The following plots are some features other than quality that are strongly correlated

pH and fixed acidity are strongly and negatively correlated



Free sulfur dioxide and total Sulfur dioxide are strongly and positively correlated.



Fixed acidity and density are strongly and positively correlated



citric acid and fixed acidity are positively and strongly correlated



Volatile acidity and citric acity are correlated strongly and negatively.




Bivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?

I was surprised to see that only a few features were strongly correlated to the quality of a wine. I was shocked to realize sugar and PH are not playing a really big factor in the quality of a wine. It did not come to me as a surprise that sulphates, volatile acidity and alcohol are kind of strongly correlated to the quality of a wine.

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

There were some very strong correlations between PH and fixed acidity, free sulfure dioxide and total sulfur dioxide, density and fixed acidity, citric acid and volatile acid, citric acid and fixed acidity. Fixed acidity has strong correlation with several of other features.

What was the strongest relationship you found?

The strongest relationship (correlation) is between fixed acidity and pH(-0.68). It’s followed by fixed acidity and citric acid (0.67), then fixed acidity and density(0.668), and total sulfur dioxide and free sulfur dioxide (0.667).






Multivariate Plots Section



We can see that quality is better as volatile acidity is low and alcohol is high.



We can see that quality is better as alcohol and sulphate increase.



We can see that quality is better with increasing sulphates and a lower volatile acidity



We can see that the quality is better with increase of citric acid and a decreasing volatile acidity.





## 
## Calls:
## m1: lm(formula = quality ~ log10(volatile.acidity), data = redwine)
## m2: lm(formula = quality ~ log10(volatile.acidity) + alcohol, data = redwine)
## m3: lm(formula = quality ~ log10(volatile.acidity) + alcohol + log10(sulphates), 
##     data = redwine)
## m4: lm(formula = quality ~ log10(volatile.acidity) + alcohol + log10(sulphates) + 
##     citric.acid, data = redwine)
## m5: lm(formula = quality ~ log10(volatile.acidity) + alcohol + log10(sulphates) + 
##     citric.acid + fixed.acidity, data = redwine)
## m6: lm(formula = quality ~ log10(volatile.acidity) + alcohol + log10(sulphates) + 
##     citric.acid + fixed.acidity + residual.sugar, data = redwine)
## m7: lm(formula = quality ~ log10(volatile.acidity) + alcohol + log10(sulphates) + 
##     citric.acid + fixed.acidity + residual.sugar + log10(pH), 
##     data = redwine)
## m8: lm(formula = quality ~ log10(volatile.acidity) + alcohol + log10(sulphates) + 
##     citric.acid + fixed.acidity + residual.sugar + log10(pH) + 
##     log10(chlorides), data = redwine)
## m9: lm(formula = quality ~ log10(volatile.acidity) + alcohol + log10(sulphates) + 
##     citric.acid + fixed.acidity + residual.sugar + log10(pH) + 
##     log10(chlorides) + log10(density), data = redwine)
## m10: lm(formula = quality ~ log10(volatile.acidity) + alcohol + log10(sulphates) + 
##     citric.acid + fixed.acidity + residual.sugar + log10(pH) + 
##     log10(chlorides) + log10(density) + free.sulfur.dioxide, 
##     data = redwine)
## m11: lm(formula = quality ~ log10(volatile.acidity) + alcohol + log10(sulphates) + 
##     citric.acid + fixed.acidity + residual.sugar + log10(pH) + 
##     log10(chlorides) + log10(density) + free.sulfur.dioxide + 
##     total.sulfur.dioxide, data = redwine)
## m12: lm(formula = quality ~ log10(volatile.acidity) + alcohol + log10(sulphates) + 
##     citric.acid + fixed.acidity + residual.sugar + log10(pH) + 
##     log10(chlorides) + log10(density) + free.sulfur.dioxide + 
##     total.sulfur.dioxide + ratio.freesulfur.dioxide, data = redwine)
## 
## ====================================================================================================================================================================================================
##                                  m1            m2            m3            m4            m5            m6            m7            m8            m9           m10           m11           m12       
## ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
##   (Intercept)                   5.012***      1.938***      2.415***      2.438***      1.969***      1.970***      3.397***      3.505***      3.073***      3.140***      4.024***      3.942***  
##                                (0.041)       (0.165)       (0.171)       (0.172)       (0.207)       (0.208)       (0.659)       (0.657)       (0.796)       (0.797)       (0.819)       (0.820)    
##   log10(volatile.acidity)      -2.057***     -1.567***     -1.300***     -1.372***     -1.455***     -1.455***     -1.452***     -1.301***     -1.273***     -1.277***     -1.151***     -1.166***  
##                                (0.121)       (0.112)       (0.114)       (0.133)       (0.134)       (0.135)       (0.135)       (0.140)       (0.143)       (0.143)       (0.145)       (0.145)    
##   alcohol                                     0.309***      0.299***      0.299***      0.308***      0.308***      0.318***      0.299***      0.279***      0.275***      0.267***      0.267***  
##                                              (0.016)       (0.016)       (0.016)       (0.016)       (0.016)       (0.017)       (0.017)       (0.026)       (0.027)       (0.027)       (0.027)    
##   log10(sulphates)                                          1.517***      1.557***      1.549***      1.549***      1.544***      1.760***      1.810***      1.837***      1.832***      1.819***  
##                                                            (0.177)       (0.181)       (0.181)       (0.181)       (0.181)       (0.188)       (0.195)       (0.196)       (0.195)       (0.195)    
##   citric.acid                                                            -0.109        -0.466***     -0.465***     -0.519***     -0.414**      -0.406**      -0.403**      -0.242        -0.235     
##                                                                          (0.105)       (0.137)       (0.138)       (0.140)       (0.142)       (0.142)       (0.142)       (0.146)       (0.146)    
##   fixed.acidity                                                                         0.052***      0.052***      0.036*        0.030*        0.049         0.047         0.023         0.020     
##                                                                                        (0.013)       (0.013)       (0.015)       (0.015)       (0.025)       (0.025)       (0.026)       (0.026)    
##   residual.sugar                                                                                     -0.000        -0.000         0.003         0.012         0.016         0.018         0.017     
##                                                                                                      (0.012)       (0.012)       (0.012)       (0.015)       (0.015)       (0.015)       (0.015)    
##   log10(pH)                                                                                                        -2.649*       -3.392**      -2.579        -2.566        -3.730*       -3.737*    
##                                                                                                                    (1.161)       (1.171)       (1.443)       (1.442)       (1.460)       (1.460)    
##   log10(chlorides)                                                                                                               -0.534***     -0.510***     -0.517***     -0.583***     -0.576***  
##                                                                                                                                  (0.134)       (0.137)       (0.137)       (0.137)       (0.137)    
##   log10(density)                                                                                                                              -49.448       -52.290       -34.609       -20.123     
##                                                                                                                                               (51.307)      (51.337)      (51.230)      (51.959)    
##   free.sulfur.dioxide                                                                                                                                        -0.002         0.004        -0.001     
##                                                                                                                                                              (0.002)       (0.002)       (0.004)    
##   total.sulfur.dioxide                                                                                                                                                     -0.003***     -0.001     
##                                                                                                                                                                            (0.001)       (0.001)    
##   ratio.freesulfur.dioxide                                                                                                                                                                0.354     
##                                                                                                                                                                                          (0.216)    
## ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
##   R-squared                     0.153         0.311         0.341         0.342         0.348         0.348         0.350         0.357         0.357         0.358         0.365         0.366     
##   adj. R-squared                0.153         0.310         0.340         0.340         0.346         0.346         0.347         0.353         0.353         0.354         0.361         0.361     
##   sigma                         0.743         0.671         0.656         0.656         0.653         0.653         0.652         0.649         0.649         0.649         0.646         0.645     
##   F                           288.647       359.989       275.225       206.705       170.163       141.714       122.534       110.189        98.045        88.466        82.943        76.337     
##   p                             0.000         0.000         0.000         0.000         0.000         0.000         0.000         0.000         0.000         0.000         0.000         0.000     
##   Log-likelihood            -1793.802     -1628.955     -1593.104     -1592.555     -1584.495     -1584.495     -1581.883     -1573.978     -1573.511     -1572.601     -1563.504     -1562.150     
##   Deviance                    882.635       718.183       686.690       686.218       679.335       679.335       677.119       670.457       670.066       669.303       661.731       660.611     
##   AIC                        3593.604      3265.910      3196.209      3197.110      3182.990      3184.990      3181.765      3167.956      3169.022      3169.202      3153.009      3152.299     
##   BIC                        3609.735      3287.419      3223.095      3229.372      3220.630      3228.007      3230.159      3221.728      3228.170      3233.728      3222.912      3227.579     
##   N                          1599          1599          1599          1599          1599          1599          1599          1599          1599          1599          1599          1599         
## ====================================================================================================================================================================================================

The linear regression we fit can only predict 36.6% of the quality of a wine.




Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

Alcohol is the feature with the strongest correlationship with quality. I was rather surprised to see that residual sugar did not have a stronger correlation with quality. Fixed acidity has a strong correlation with 3 other features. It has a strong relationship with citric acid, with density and with pH.

Were there any interesting or surprising interactions between features?

Like we have mentioned before, fixed acidity has a strong correlation with pH, density and free sulfur dioxide. I think it’s interesting to see one feature has a strong correlation with three other features.

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.

Yes, I did create one linear regression and it can only account for 36.6% of wine quality prediction.







Final Plots and Summary

Plot One

## Warning: Ignoring unknown parameters: binwidth, bins, pad

Description One

The graph above show that most of the wines have a quality of 5 or 6 just a few have a quality of 3 or 4 or 8. It just showed the distribution of the wines by quality.



Plot Two

## Warning: Removed 3 rows containing non-finite values (stat_smooth).
## Warning: Removed 3 rows containing missing values (geom_point).

Description Two

Those two features have on of the strongest correlationship in the dataset. I find it interesting because people tend to think because a correlation is negative it’s not strong correlationship. I wanted to show a graph with a strong negative correlation, and that’s the kind of relationship we have between fixed acidity and pH in that dataset.



Plot Three

Description Three

This graph above shows the correlation between quality and alcohol, it’s positive and it is the strongest correlation that quality have with any feature in the datasett.






Reflection

The red wine dataset is composed of 1599 observations and 13 variables. 11 of them have physicochemical attributes. The main focus of this analysis was to find by how much those variables with physicochemical attributes contribute to the quality of a wine. Even if there are 11 variable with physicochemical attributes only 4 of them have a somewhat strong correlation with the quality of a wine. In other words, not a big percentage of those features contribute greatly to the quality of the wine, based on our analysis 4 of them have a somewhat strong correlation with the quality of a wine. Those 4 features are the following alcohol, volatile acidity, sulfates and citric acid.

I struggled with some plot that have some weird shapes, I tried using some transformations that were not working. I struggle a lot with my linear regression because I couldn’t accept the fact that my regression shows that the chemical variables are responsible only for 36.6% of the wine quality. And then I came to the realization, maybe the quality of a wine is not something that is objective after all.

I liked the fact that it was so easy to plot the quality variable, my two plots for correlations match the result I got when I compute to find the correlation between all the variables. Those are some of the things that went right.

Well, it was really surprising to see that residual sugar was not so much of a big factor when it comes to predicting the wine quality. In my head before doing any analysis I was expecting residual sugar to play a big role in the quality of a wine.

Like I mentioned before, my regression model was not accurate when it come to predicting the quality of a wine. This lead me to raise the question can a model predict anything with 100% accuracy and how do we use regression to predict something that is mainly subjective. How can we make up for a lack of accuracy when trying to predict something.