Introduction

Wine quality is hard to define due to the layers and combinations of extrinsic and intrinsic factors in wine evaluations. The subjective perception of quality especially, brings complexity in evaluating wine quality and adds difficulty in building an accurate and uniformed wine rating system. Wine quality is assessed by either sensory tests or physicochemical tests. The latter analyze the wines’ chemical traits such as pH, sugar, and alcohol level. On the other hand, sensory tests rely on human senses, evaluating appearance, aroma, flavor, etc.

Since many wine rating systems and sensory analysis are performed by humans, the results are prone to subjective factors. In an attempt to minimize the bias and discrepancy between the two, a research study in 2009 used a data mining approach in predicting human wine tasting preferences based on analytical data (Cortez et al.,2009). The study aimed to build a computationally efficient model with accurate predictive performance that not only supports oenologist’s wine evaluations, but also improve the quality and speed of their decision making. It enabled wine tests to distinguish subtle variations in visual attributes of the product. A similar work by Verissimo explored the chemical profile’s impacts on sensory evaluations. The objective of their study was to analyze the correlation between the chemical composition of red wines and the sensory perception of the products (Verissimo, C. M.,2021). Their analysis focused on the correlations between visual attributes and sensory characteristics. The study tested whether the evaluation of the wine color attributes can predict other sensory characteristics related to aromas and flavours.

While other studies explored specific aspects of taste and perception and how they are intervowen, in this work we look to evaluate whether there is a correlation between chemical composition of wines and the probability that they succeed with sommeliers. Attempting to answer this question has more practical implications for a winemaker as judges’ opinions often open the doors to a larger wine market. The assigned quality scores from 1-10 were chosen as the independent variable. The scores represent the median value of three gradings given by independent judges for each wine sample. Applying logistic regression, we are planning to model a transformed version of this variable on different sets of predictors to discover the combination that would be the most expedient at predicting the probability of a wine having high quality.

We expect the final model to include volatile acidity levels, the amount of residual sugar, citric acid, chlorides, and alcohol. These variables are specifically related to the taste of wine and how it behaves in a glass. It is our assumption that chlorides and volatile acidity will negatively affect the score as they both strengthen unwanted characteristics. On the other hand, the residual sugar amount, citric acid, and alcohol are expected to correlate positively with the scores as they add either sweet or sour overtones to wine, whereas citric acid adds more freshness.

Materials and Methods

We collected the Wine Quality dataset from kaggle that is open and available for public research. The data sets contain information on samples of red wine produced by the “Vinho Verde” winery in Portugal. Initially we opted to only include the red wine data set in our analysis. However, during initial EDA, it became clear that including more grouping variables would provide us with more insights into the relationships between groups. The red wine and white wine data sets are identical, so we added a column named ‘type’ in each data set to indicate the type of wine. Next, we transformed the data by combining these cases. Adding a categorical variable was imperative as it would widen the scope of generalizability to white wines and enable us to look for more easily interpretable interactions between the type of wine and other variables.

Some of the variables in the dataset were distributed on a rather small scale (most lower than 1) which could complicate interpretation. Therefore, we decided to recode volatile acidity, chlorides, and density by multiplying them by 10, 100, and 1000 and respectively to make subsequent interpretations easier (Table 1, 3, 4). This means interpretations of model coefficients for these three variables would have a different step (0.1, 0.01, and 0.001 respectively). In addition, we had to recode the type variable into a numeric dummy variable by assigning all red wines number 1, and all white ones – number 0 (Figure 10-11).

Initially a linear regression model was considered to predict the variation in the quality scores; however, this idea was cast aside due to the design limitations of the response variable. The scores can only take 10 values which disqualified us from assessing the conditions necessary for the linear model. Because it was not possible to apply linear regression, a logistic regression model appeared to be a more appropriate way to approach the variation in the scores. In order to apply logistic aggression, the quality variable was split into a two-level categorical variable. The wine samples are considered “Good” if their initial score was above 6 (the 75th percentile (Table 16)) or “Bad” if the associated score lies below said cutoff. The newly created quality categories, while not ideally symmetrical in terms of the number of observations, both have an adequate amount of cases to work with (Figure 18).

To come up with the most effective logistic regression model, the new categorical variable was modeled on a wide variety of explanatory variables, using stepwise AIC and Lasso regression. The stepwise method produced a model that included 11 predictors (Table 6), whereas the Lasso regression showed that only 9 should be considered (Table 7). The AIC-based model was deemed preferable as it had more statistically significant predictors than the one yielded by the Lasso regression. Lastly, we suspected that, based on the exploratory data analysis, there may be an interaction between total sulfur dioxide and type. Hence, an interaction term (between total amount of sulfur dioxide and the type of wine) was added to the final model (Table 1). The following section includes an in-detail comparison between the model that has the interaction term with the logistic regression model without one.

Variable Chart
Name	Description	Variable Role	Type	Values	Units
Fixed Acidity	Volume of acids in a sample	Explanatory	Numeric	4.6 - 15.9	g / dm^3
Volatile Acidity	Volume of acetic acid	Explanatory	Numeric	0.12 - 1.58	g / dm^3
Citric Acid	Citric acid can add ‘freshness’ and flavor to wines	Explanatory	Numeric	0 - 1	g / dm^3
Residual Sugar	Amount of sugar left after fermentation	Explanatory	Numeric	0.9 - 15.5	g / dm^3
Chlorides	Amount of salt in wine	Explanatory	Numeric	0.01 - 0.61	g / dm^3
Free sulfur dioxide	Amount of unbound sulfur dioxide that inhibits microbial growth and oxidation	Explanatory	Numeric	1 ~ 72	mg / dm^3
Total sulfur dioxide	Total amount of sulfur dioxide in wine (both free and bound)	Explanatory	Numeric	0 - 289	mg / dm^3
Density	density of water is close to that of water depending on the percent alcohol and sugar content	Explanatory	Numeric	0.99 - 1	g / cm^3
pH	Describes how acidic or basic a wine is	Explanatory	Numeric	2.74 - 4.01	NA
Sulphates	A wine additive which can contribute to sulfur dioxide gas levels	Explanatory	Numeric	0.33 - 2	g / dm^3
Alcohol	Percent alcohol content of the wine	Explanatory	Numeric	8.4 - 14.9	%
Quality	(score between 0 and 10)	Response	Numeric	3 ~ 8	NA

Results

$\label{fig:figs}Correlation Matrix$

Correlation Matrix

Because our initial plan was to run a linear regression model, we chose to visualize the relationship between all variables using a correlation matrix for red wine and white wine respectively. The plot demonstrates that there is some correlation present between pairs of variables such as ‘quality vs. alcohol’, ‘quality vs. volatile acidity’, ‘quality vs. sulphates’, ‘density vs. quality’, and ‘density vs. alcohol’, and more. However, the main goal was to identify predictors with the greatest impacts on wine quality. By looking at the correlation plot, variables alcohol, volatile acidity, chlorides, and density appeared to have the highest correlations with the quality variable.

# Table 2
summary(finallogmod)

## 
## Call:
## glm(formula = cat_score ~ alcohol + volatile_acidity + sulphates + 
##     residual_sugar + chlorides + type1 + free_sulfur_dioxide + 
##     total_sulfur_dioxide + pH + fixed_acidity + density + total_sulfur_dioxide:type1, 
##     family = binomial, data = allwine_logistic)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6258  -0.6264  -0.3692  -0.1618   3.0086  
## 
## Coefficients:
##                              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                432.590020  67.144848   6.443 1.17e-10 ***
## alcohol                      0.422343   0.081280   5.196 2.03e-07 ***
## volatile_acidity            -0.360168   0.036892  -9.763  < 2e-16 ***
## sulphates                    2.521693   0.291719   8.644  < 2e-16 ***
## residual_sugar               0.227178   0.026723   8.501  < 2e-16 ***
## chlorides                   -0.081727   0.025215  -3.241 0.001190 ** 
## type1                        1.545325   0.317531   4.867 1.13e-06 ***
## free_sulfur_dioxide          0.010263   0.002973   3.452 0.000557 ***
## total_sulfur_dioxide        -0.001982   0.001418  -1.398 0.162068    
## pH                           2.577454   0.362937   7.102 1.23e-12 ***
## fixed_acidity                0.478028   0.066084   7.234 4.70e-13 ***
## density                     -0.454322   0.068174  -6.664 2.66e-11 ***
## type1:total_sulfur_dioxide  -0.012521   0.003498  -3.579 0.000344 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 6439.6  on 6496  degrees of freedom
## Residual deviance: 5064.5  on 6484  degrees of freedom
## AIC: 5090.5
## 
## Number of Fisher Scoring iterations: 6

The final model includes 11 predictors and an interaction coefficient between total sulfur dioxide amount and the type of wine (Table 2). The model passed the drop-in-deviance test as its p-value is very close to zero (Table 4). This means that there is statistically significant evidence that the model is more effective at predicting the probability of a wine sample having good quality than the model with no predictors would. Furthermore, the model performed well on the cross-validation test. The results yielded by the model were quite impressive as the mean sensitivity for each fold was around 0.82 and specificity – 0.95 (Table 8-9). Lastly, the model passed the emplogit test – all the points were close to the line (Figure 3). This finding indicated that no additional variable transformations were needed.

Table 2 shows that most predictors are statistically significant except for the total sulfur dioxide whose explanatory weight also happens to be captured by the interaction term with the wine type. Based on the interaction term, we would expect total sulfur dioxide in a red wine sample to have a stronger negative relationship with the odds of a sample being good as it would drop by $1,5$ percent per one unit ($\frac{mg}{cm^3}$) increase in the explanatory variable while keeping other variables constant. At the same time, in a white wine sample the relationship is not as extreme – the odds of a sample being of good quality would only drop by 1 percent per one unit increase in the total sulfur dioxide while keeping other variables constant.

Even though the interaction coefficient is statistically significant, total sulfur dioxide has a high p-value in the model ($p = 0.16$). With that in mind, the binary quality variable was modeled on the same set of 11 predictors without the interaction (Table 6). Running this additional regression model enabled us to find a statistically significant coefficient for total sulfur dioxide. Namely, the odds of a wine sample being good are expected to fall by 0.004 percent per 1 unit increase in total sulfur dioxide at the 0.01 percent significance level while holding other predictors constant. Although this model provides reliable evidence of a statistically significant relationship between total sulfur dioxide and quality, the model has a slightly higher AIC (5103 vs. 5090).

To further investigate the relationship between type and total sulfur dioxide, two auxiliary models were run. In each scenario the quality binary variable was modeled on 10 predictors where type was pre-selected. The model run only for red wines had a statistically significant coefficient for total sulfur dioxide (t = -3.13 , p < 0.01) (Table 10); however, this model had insignificant pH and free sulfur dioxide coefficients. The “white wines” model, on the other hand, had insignificant total sulfur dioxide and alcohol coefficients, whereas the other 8 did not lose their explanatory power (Table 11).

Other important predictors in the final model include volatile acidity, density, and alcohol, consonant with the predictions derived from the the correlation matrix. Namely, the odds of a wine sample being good are expected to increase by the factor of 1.52 per one unit change in the percentage of alcohol (as long as its concentration does not exceed 14.9 and is not below 8.4% and other variables are kept constant). Interestingly, we would expect a sample of red wine to have the odds of being rated favorably by judges that is 469% higher than that of a white wine sample with no total sulfur dioxide present, while keeping other predictors constant. Extrapolating the interpretation of the coefficient to the values of total sulfur dioxide that are not present in the dataset is not entirely unreasonable as the minimum value of the dioxide found in a wine sample is very close to zero (Figure 2).

# Table 3 
exp(finallogmod$coefficients)

##                (Intercept)                    alcohol 
##              7.438044e+187               1.525532e+00 
##           volatile_acidity                  sulphates 
##               6.975595e-01               1.244966e+01 
##             residual_sugar                  chlorides 
##               1.255053e+00               9.215233e-01 
##                      type1        free_sulfur_dioxide 
##               4.689497e+00               1.010316e+00 
##       total_sulfur_dioxide                         pH 
##               9.980197e-01               1.316359e+01 
##              fixed_acidity                    density 
##               1.612891e+00               6.348784e-01 
## type1:total_sulfur_dioxide 
##               9.875566e-01

$\label{fig:figs}Total Sulfur Dioxide$

Total Sulfur Dioxide

Among the negative coefficients, the most notorious are volatile acidity, density, and chlorides as they were in the initial predictions and have the lowest p-values. Based on the final model, there is statistically significant evidence ($t = -9.763, p < 0.01$) that the odds of receiving a high score are expected to decrease by 30 percent per 0.1 unit ($\frac{g}{cm^3}$) change in volatile acidity while keeping the other variables constant. Similarly, the odds of a wine sample receiving a high score are expected to go down by 8% per 0.01 unit change in chlorides ($t = -3.241, p < 0.01$) while keeping other variables constant. Lastly, the odds that wine has good quality are expected to drop by 36% per one-thousandth of a unit increase in density ($\frac{g}{cm^3}$), while keeping other variables constant

Discussion

The final model results confirmed most of the initial assumptions about how taste-variables are associated with the probability of receiving high scores. Residual sugars and alcohol concentration both had statistically significant positive coefficients with the odds of wine being rated above 6. This is an intuitive outcome as higher amounts of sugar and alcohol contribute to the taste characteristics of wine that judges are normally after – sweetness and sourness. On the other hand, volatile acidity and chlorides are chemicals that boost the qualities of wine that customers and judges alike do not look favorably upon. Larger concentrations of chlorides lead to a saltier taste, whereas higher volatile acidity adds a vinegar-like taste to wines. The model demonstrates that both of these variables are negatively associated with the odds of a wine sample receiving a high score.

Something we failed to include in our initial hypothesis was the importance of density, sulphates, and the presence of different forms of sulfur dioxide. The positive relationship between sulphates and the odds of good quality (while controlling for other variables) is intuitive as sulphates are a chemical that is used by wine producers to slow down the oxidation process of their product. In other words, sulphates are used to prevent wine from turning into vinegar and acquiring negative taste and smell characteristics (e.g boiled egg) as it kills bacteria. Similarly, the negative relationship between density and wine quality, as well as the pH and wine quality, did not come as a surprise as density of a sample is usually supposed to be inversely proportional to the amount of alcohol in it. As we established, alcohol generally is positively associated with the quality of wine due to the taste characteristics it empowers. If wine has higher density, it indicates that there is less alcohol present and the sample would feel either too sweet or watery. The positive relationship between wine’s pH and the odds of quality is intuitive. Increasing pH values increases the tart and sour taste of wine which would be looked upon favorably by judges.

The results that were the most interesting are the uncovered relationships between free sulfur dioxide and quality, total sulfur dioxide and quality, as well as the interaction between the type and total sulfur dioxide. In both logistic regression models (with and without the interaction), the amount of free sulfur dioxide is positively associated with the quality of wine, whereas the total amount is negatively associated with the response. The result for free sulfur dioxide is expected as it is often used in wine production to improve wine’s longevity and taste. In addition, larger amounts of free sulfur dioxide indicate that the winemaker is tender and aims to produce more longevous wine, which tends to be regarded more highly by critics.

To better understand the results, it is crucial to first explain the chemical composition of the variable and its effect on wine. Total sulfur dioxide essentially encompasses the amount of free SO2 dioxides (which is a statistically significant predictor) and bound SO2 dioxides. After conducting additional research, we found a possible explanation for the change in the variable’s significance levels across different wine types. Namely, bound SO2 (which is present in total sulfur dioxide but not in free sulfur dioxide) is used exclusively in red wine production for colour retention. This may be a possible explanation for why total sulfur dioxide had a statistically significant coefficient in the model that only included red wines as only this type of wine requires colour preservation. Likewise, free sulfur dioxide is used primarily for white wines, and therefore this chemical is not a reliable predictor of the odds of quality for red wines.

These explanations are supported by the models built for each type of wine (Table 10-11). The model that only predicts the probability of a good score for white wines showcased a positive and statistically significant association between free sulfur dioxide and the response variable. Likewise, the “red wine” model produced a statistically significant association between total sulfur dioxide and the odds of a good score, but was negative. The reasons for the negative coefficient remain unclear as the relationship, which would otherwise be expected to be positive, could be influenced by the specific type of wine produced at the winery and the quality of sulfur added.

Furthermore, there may a plethora of hidden interactions in the model that affect the relationship between the odds of good quality and total sulfur dioxide. The sheer number of numeric predictors proved to be a great challenge to modeling as it made locating interactions a time-consuming challenge. There were more than 12 numeric predictors to choose from, and some interactions we thought could be present in the linear regression model did not automatically carry over to the logistic regression model. On a similar note, the model utilized in the paper had highly collinear variables. Sugar levels, alcohol concentration, and density are closely related to each other as alcohol is inversely proportional to density of wine, and the latter is positively correlated with the amount of sugar. The model retained all the three variables as excluding one of them led to a large drop in performance. However, for future research it would be imperative to only include either alcohol concentration or density.

Confounding variables not considered in the data are another important area of concern. Because the dataset was not collected as a part of an experiment, there could be confounding variables that impacted the performance of each wine sample among the judges. Although the wine samples come from the same vineyard and belong to the same wine brand, it is unknown what the age of wine was, the types of grapes the wines were produced from, and where the samples were stored. All these variables could be affecting the relationships present in the model at hand. Lastly, judge scores are not an entirely objective measure of quality as it can vary from judge to judge which presents a challenge to the replicability of the research project at hand.

To remedy this issue, this report can be further modified by using a more subjective response variable – market prices – and use judge’s scores as another explanatory variable. While our model only predicts the probability of wine being good or bad, creating a model that uses specific scores and connecting it with the subsequent prices of wine would shed some light on whether manufacturers can alter the amounts of specific chemical components to achieve a higher a price for their product and whether the score given by a judge changes the economic performance of a sample on the market.

Another drawback that stemmed from the nature of the response variable is the model’s inability to predict a specific score. Because the model includes mostly chemical components, it will be of use only to a narrow audience of winemakers and experts. The logistic regression in this report would be most beneficial to winemakers who seek to produce wine of good quality without making it top-notch. However, the model would not be of much use to business owners whose goal is to produce the best wine type possible, aiming for the 9-10 scores. We were unable to come up with a linear regression model that would have a strong ability to predict the variability in the data due to the spread of the response variable and the inability to test the linearity conditions.

Despite the limitations of the models, the results it produced are consonant with that of the larger field. In the reference article on chemical profile on sensory evaluation of red wines (Verissimo, C. M.,2021), it was found that attributes that were related to possible defects include titratable acidity and volatile acidity. While we did not include titratable acidity in our variables, we did conclude that volatile acidity is associated with a lower quality score of red wines. The study also stated that the results of the analysis indicated that panelists with short-term training are able to perceive the influence of physicochemical variables on wines. Additionally, the correlation analysis enabled the identification of production adjustments and the understanding of the consumer perception of a complex product.

A study conducted in 2009 was also motivated by building an accurate and efficient model for wine analysis (Cortez et al.,2009). In our analysis, we used the classic logistic/multiple regression approach. This research study proposed a data mining approach aiming to extract high-level knowledge from raw data. Some data mining techniques include neural networks and support vector machines. The goal of their study was not only to find out what makes the best wine, but also to evaluate data mining methods, techniques, as well as their predicting patterns and accuracies. The results of the study highlighted factors that can be used to improve wine quality. These factors include monitoring the grape sugar content and controlling variables such as alcohol concentration and volatile acidity. This is consistent with our findings that alcohol content and residual sugar are positively correlated with wine quality, while volatile acidity, density, and chlorides are negatively correlated. While there is a wide range of techniques and algorithms that can be used for wine analysis, we are able to identify certain common key predictors. However, future research can always aim to improve previous technologies that will lead to growth of the wine industry.

Annotated Appendix

Data Cleaning

To transform the data, we combined “redwine” and “whitewine” into one dataset stored in “allwine”, which is the main dataset we used throughout the modeling. The ‘quality’ column was parsed into an integer variable for better graphics. There was no missing data from either data sets.

Outlier Analysis

No outliers were removed as the emplogit for the main model satisfied the linearity condition.

Hypothesis Testing

Based on the correlation matrix(Figure 1), We wanted to first consider the four predictors that have the strongest degrees of association with the response variable. Namely, we are considering alcohol, volatile acidity, density, and chlorides that had correlation coefficients of 0.44, -0.26, -0.31, and -0.20 respectively. We then used scatter plots and smoothers to test how each of the selected variables interacted with the response, while also considering the type variable to test for possible interactions between variables. After examining the scatter plots, we only saw an interaction effect between type and chlorides as the lines representing the two types of wines have significantly different slopes. To test this hypothesis, we will run a linear regression model with the interaction term.

Interpreting coefficients

Most of the interpretations were based on the final logistic regression model. To arrive at these interpretations, we exponentiated odds ratios from the model.

Tables and Figures

emplogitplot1(cat_score ~ alcohol + volatile_acidity + sulphates + 
    residual_sugar + chlorides + type1 + free_sulfur_dioxide + 
    total_sulfur_dioxide + pH + fixed_acidity + density, data = allwine_logistic, 
    out = TRUE, ngroup = 10)

$\label{fig:figs}emplogit plot, Arrow 3$

emplogit plot, Arrow 3

##    Group Cases      XMin XMax     XMean NumYes  Prop AdjProp      Logit
## 1      1   707  8.000000  9.1  8.918317     93 0.132   0.132 -1.8833898
## 2      2   798  9.200000  9.4  9.307498     14 0.018   0.018 -3.9992196
## 3      3   562  9.500000  9.6  9.533926     21 0.037   0.038 -3.2314283
## 4      4   535  9.633333  9.9  9.799159     29 0.054   0.055 -2.8438517
## 5      5   693  9.950000 10.3 10.126215     68 0.098   0.099 -2.2083854
## 6      6   695 10.400000 10.7 10.528465    111 0.160   0.160 -1.6582281
## 7      7   652 10.750000 11.1 10.941150    163 0.250   0.250 -1.0986123
## 8      8   564 11.200000 11.5 11.344651    161 0.285   0.286 -0.9148912
## 9      9   680 11.550000 12.3 11.959407    288 0.424   0.424 -0.3063742
## 10    10   611 12.333333 14.9 12.819673    329 0.538   0.538  0.1522937

# Table 4
anova(nulllog, finallogmod, test = "Chisq")

## Analysis of Deviance Table
## 
## Model 1: cat_score ~ 1
## Model 2: cat_score ~ alcohol + volatile_acidity + sulphates + residual_sugar + 
##     chlorides + type1 + free_sulfur_dioxide + total_sulfur_dioxide + 
##     pH + fixed_acidity + density + total_sulfur_dioxide:type1
##   Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
## 1      6496     6439.6                          
## 2      6484     5064.5 12   1375.1 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# Table 5
summary(finallogmod)

## 
## Call:
## glm(formula = cat_score ~ alcohol + volatile_acidity + sulphates + 
##     residual_sugar + chlorides + type1 + free_sulfur_dioxide + 
##     total_sulfur_dioxide + pH + fixed_acidity + density + total_sulfur_dioxide:type1, 
##     family = binomial, data = allwine_logistic)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6258  -0.6264  -0.3692  -0.1618   3.0086  
## 
## Coefficients:
##                              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                432.590020  67.144848   6.443 1.17e-10 ***
## alcohol                      0.422343   0.081280   5.196 2.03e-07 ***
## volatile_acidity            -0.360168   0.036892  -9.763  < 2e-16 ***
## sulphates                    2.521693   0.291719   8.644  < 2e-16 ***
## residual_sugar               0.227178   0.026723   8.501  < 2e-16 ***
## chlorides                   -0.081727   0.025215  -3.241 0.001190 ** 
## type1                        1.545325   0.317531   4.867 1.13e-06 ***
## free_sulfur_dioxide          0.010263   0.002973   3.452 0.000557 ***
## total_sulfur_dioxide        -0.001982   0.001418  -1.398 0.162068    
## pH                           2.577454   0.362937   7.102 1.23e-12 ***
## fixed_acidity                0.478028   0.066084   7.234 4.70e-13 ***
## density                     -0.454322   0.068174  -6.664 2.66e-11 ***
## type1:total_sulfur_dioxide  -0.012521   0.003498  -3.579 0.000344 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 6439.6  on 6496  degrees of freedom
## Residual deviance: 5064.5  on 6484  degrees of freedom
## AIC: 5090.5
## 
## Number of Fisher Scoring iterations: 6

# Table 6
summary(finallogmod1)

## 
## Call:
## glm(formula = cat_score ~ alcohol + volatile_acidity + sulphates + 
##     residual_sugar + chlorides + type1 + free_sulfur_dioxide + 
##     total_sulfur_dioxide + pH + fixed_acidity + density, family = binomial, 
##     data = allwine_logistic)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.7977  -0.6285  -0.3687  -0.1769   3.0537  
## 
## Coefficients:
##                        Estimate Std. Error z value Pr(>|z|)    
## (Intercept)          405.043172  65.553568   6.179 6.46e-10 ***
## alcohol                0.445830   0.079964   5.575 2.47e-08 ***
## volatile_acidity      -0.357205   0.036655  -9.745  < 2e-16 ***
## sulphates              2.449671   0.285114   8.592  < 2e-16 ***
## residual_sugar         0.220186   0.026250   8.388  < 2e-16 ***
## chlorides             -0.077348   0.024865  -3.111 0.001866 ** 
## type1                  0.785649   0.244041   3.219 0.001285 ** 
## free_sulfur_dioxide    0.010886   0.002952   3.688 0.000226 ***
## total_sulfur_dioxide  -0.003796   0.001330  -2.855 0.004300 ** 
## pH                     2.606564   0.360909   7.222 5.11e-13 ***
## fixed_acidity          0.483907   0.065558   7.381 1.57e-13 ***
## density               -0.426715   0.066582  -6.409 1.47e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 6439.6  on 6496  degrees of freedom
## Residual deviance: 5079.3  on 6485  degrees of freedom
## AIC: 5103.3
## 
## Number of Fisher Scoring iterations: 6

# Table 7
summary(lassomodel)

## 
## Call:
## glm(formula = cat_score ~ fixed_acidity + volatile_acidity + 
##     residual_sugar + chlorides + free_sulfur_dioxide + pH + sulphates + 
##     alcohol + total_sulfur_dioxide, data = allwine_logistic)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -0.87171  -0.23381  -0.09198   0.04866   1.07596  
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -1.3791316  0.1343650 -10.264  < 2e-16 ***
## fixed_acidity         0.0052278  0.0042072   1.243  0.21406    
## volatile_acidity     -0.0331764  0.0032274 -10.280  < 2e-16 ***
## residual_sugar        0.0061537  0.0011486   5.358 8.72e-08 ***
## chlorides            -0.0049353  0.0015700  -3.144  0.00168 ** 
## free_sulfur_dioxide   0.0016790  0.0003667   4.578 4.78e-06 ***
## pH                    0.0493245  0.0328550   1.501  0.13333    
## sulphates             0.2193904  0.0347232   6.318 2.82e-10 ***
## alcohol               0.1313215  0.0043842  29.953  < 2e-16 ***
## total_sulfur_dioxide -0.0005145  0.0001310  -3.927 8.70e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.1289921)
## 
##     Null deviance: 1026.00  on 6496  degrees of freedom
## Residual deviance:  836.77  on 6487  degrees of freedom
## AIC: 5143.8
## 
## Number of Fisher Scoring iterations: 2

# Table 8
t(cv2)

##           [,1]      [,2]      [,3]      [,4]      [,5]
## [1,] 0.8201015 0.8182237 0.8198247 0.8189632 0.8194552

mean(cv2)

## [1] 0.8193136

# Table 9
lr2spec

## [1] 0.9535300 0.9521147 0.9524833 0.9539014 0.9523670

mean(lr2spec)

## [1] 0.9528793

# Table 10
summary(finallogmodR)

## 
## Call:
## glm(formula = cat_score ~ alcohol + volatile_acidity + sulphates + 
##     residual_sugar + chlorides + free_sulfur_dioxide + total_sulfur_dioxide + 
##     pH + fixed_acidity + density, family = binomial, data = all_red)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.0376  -0.4305  -0.2211  -0.1217   2.9988  
## 
## Coefficients:
##                        Estimate Std. Error z value Pr(>|z|)    
## (Intercept)          231.721801 106.897675   2.168 0.030182 *  
## alcohol                0.777758   0.126725   6.137 8.39e-10 ***
## volatile_acidity      -0.288552   0.064750  -4.456 8.33e-06 ***
## sulphates              3.732812   0.541171   6.898 5.29e-12 ***
## residual_sugar         0.241220   0.073634   3.276 0.001053 ** 
## chlorides             -0.083778   0.033076  -2.533 0.011312 *  
## free_sulfur_dioxide    0.010129   0.012180   0.832 0.405610    
## total_sulfur_dioxide  -0.016225   0.004902  -3.310 0.000933 ***
## pH                     0.106809   0.984552   0.108 0.913611    
## fixed_acidity          0.294208   0.122230   2.407 0.016084 *  
## density               -0.246467   0.109213  -2.257 0.024024 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1269.92  on 1598  degrees of freedom
## Residual deviance:  871.32  on 1588  degrees of freedom
## AIC: 893.32
## 
## Number of Fisher Scoring iterations: 6

# Table 11
summary(finallogmodW)

## 
## Call:
## glm(formula = cat_score ~ alcohol + volatile_acidity + sulphates + 
##     residual_sugar + chlorides + free_sulfur_dioxide + total_sulfur_dioxide + 
##     pH + fixed_acidity + density, family = binomial, data = all_white)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1241  -0.6703  -0.4113  -0.1799   2.7724  
## 
## Coefficients:
##                        Estimate Std. Error z value Pr(>|z|)    
## (Intercept)           6.441e+02  9.396e+01   6.855 7.15e-12 ***
## alcohol               1.240e-01  1.133e-01   1.094  0.27392    
## volatile_acidity     -3.609e-01  4.784e-02  -7.543 4.60e-14 ***
## sulphates             2.144e+00  3.468e-01   6.182 6.34e-10 ***
## residual_sugar        2.967e-01  3.561e-02   8.333  < 2e-16 ***
## chlorides            -1.268e-01  3.786e-02  -3.349  0.00081 ***
## free_sulfur_dioxide   8.755e-03  3.131e-03   2.796  0.00517 ** 
## total_sulfur_dioxide -4.766e-04  1.504e-03  -0.317  0.75136    
## pH                    3.361e+00  4.262e-01   7.885 3.16e-15 ***
## fixed_acidity         5.307e-01  8.969e-02   5.917 3.28e-09 ***
## density              -6.670e-01  9.524e-02  -7.003 2.50e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 5116.8  on 4897  degrees of freedom
## Residual deviance: 4146.7  on 4887  degrees of freedom
## AIC: 4168.7
## 
## Number of Fisher Scoring iterations: 6

#Volatile acidity, Table 12 
favstats(~volatile_acidity, data = allwine)

##  min  Q1 median Q3  max    mean       sd    n missing
##  0.8 2.3    2.9  4 15.8 3.39666 1.646365 6497       0

#Alcohol, Table 13
favstats(~alcohol, data = allwine)

##  min  Q1 median   Q3  max    mean       sd    n missing
##    8 9.5   10.3 11.3 14.9 10.4918 1.192712 6497       0

#Density, Table 14 
favstats(~density, data = allwine)

##     min     Q1 median     Q3     max     mean       sd    n missing
##  987.11 992.34 994.89 996.99 1038.98 994.6966 2.998673 6497       0

#Chlorides, Table 15
favstats(~chlorides, data = allwine)

##  min  Q1 median  Q3  max     mean      sd    n missing
##  0.9 3.8    4.7 6.5 61.1 5.603386 3.50336 6497       0

#Quality, Table 16
favstats(~quality, data = allwine)

##  min Q1 median Q3 max     mean        sd    n missing
##    3  5      6  6   9 5.818378 0.8732553 6497       0

$\label{fig:figs}Volatile acidity$

Volatile acidity

$\label{fig:figs}Alcohol$

Alcohol

$\label{fig:figs}Density$

Density

$\label{fig:figs}Chlorides$

Chlorides

$\label{fig:figs}Quality$

Quality

$\label{fig:figs}Volatile acidity, chlorides, alchol vs. quality$

Volatile acidity, chlorides, alchol vs. quality

$\label{fig:figs}Red vs White wine quality distribution$

Red vs White wine quality distribution

$\label{fig:figs}Red vs White wine quality distribution$

Red vs White wine quality distribution

$\label{fig:figs}Alcohol vs Duality$

Alcohol vs Duality

$\label{fig:figs}pH vs quality$

pH vs quality

$\label{fig:figs}Density vs. Fixed acidity$

Density vs. Fixed acidity

$\label{fig:figs}Density vs. Alcohol$

Density vs. Alcohol

$\label{fig:figs}Sulphates vs. Chlorides$

Sulphates vs. Chlorides

$\label{fig:figs}Condition Plots for AIC model$

Condition Plots for AIC model

$\label{fig:figs}Distribution of cases across the two wine types$

Distribution of cases across the two wine types

CIR office hour

We first went through the structure of the report as a whole. The variable chart isn’t required to appear in the main body, but we agreed that if we do have enough space, audiences might find it helpful to have a variable with a description column (instead of having a separate list). Also, because we have been careful with going over the page limit, most of our plots and tables are stored in the appendix. For now, we used the favstats() function for statistical summaries for predictor variables, which is difficult to put in a table format. We could make a tibble using the summarize() function that will print out nicely,Before turning in the final report, we will select a number of helpful figures and move them to the main sections. Though it might be unnecessary, it might be worthwhile to format equation expressions using LaTex, which can be done later.

References

Cortez, Cerdeira, A., Almeida, F., Matos, T., & Reis, J. (2009). Modeling 
wine preferences by data mining from physicochemical properties. Decision 
Support Systems, 47(4), 547–553. https://doi.org/10.1016/j.dss.2009.05.016

This article uses a data mining approach in predicting human wine tasting preferences based on analytical data. The main objective of this study is to build a computationally efficient model with accurate predictive performance that not only supports oenologist’s wine evaluations, but also improve the quality and speed of their decision making. Since many wine rating systems and sensory analysis are performed by humans, the results are prone to subjective factors.

The conclusion is that the proposed data-driven approach is based on objective tests and thus it can be integrated into a decision support system. The results of the study highlighted factors that can be used to improve wine quality. These factors include monitoring the grape sugar content and controlling variables such as alcohol concentration and volatile acidity.

Veríssimo, C. M., Alcântara, R. L., Lima, L. L., Pereira, G. E., & Maciel, 
M. I. (2021). Impact of chemical profile on sensory evaluation of Tropical 
Red Wines. International Journal of Food Science 
&Technology,56(7),3588–3599.https://doi.org/10.1111/ijfs.14987

This article examines the correlation between the physicochemical properties and the sensory variables of red wines. Sensory descriptors were instituted by trained panellists and senosy analysis were performed. A bivariate correlation matrix was generated to analyze the correlation between sensory and chemical variables and to display correlation coefficients.

Among the sensory variables, the term ’sweet-ish’ was positively correlated with ‘alcohol content.’ The perception of taste ‘sour’ actually varied among samples; some variables that interact with the perception of ‘sour’ include ‘pH values, Titratable acidity, and Volatile acidity.’ There is also a positive correlation between ‘Bitter’ and ‘Anthocyanins’ which suggests that bitter perception is high in aged wines. Furthermore, the interaction between these extrinsic and intrinsic variables in the product are predominant in the perception of sensory attributes of wines.

Source code and annotated outputs

library(knitr)
knitr::opts_chunk$set(fig.pos = 'h', out.extra = "")
knitr::opts_chunk$set(echo = FALSE, message = FALSE, warning = FALSE)
library(mosaic)
library(dplyr)
library(MASS)
library(car)
library(leaps)
library(Stat2Data)
library(glmnet)
library(corrplot)
library(boot)
library(kableExtra)
library(GGally)
library(graphics)
library(janitor)
library(cowplot)
library(gridExtra)
library(ggthemes)
library(patchwork)
library(tidyverse)

library(lemon)
knit_print.data.frame <- lemon_print

allwine <- read_csv("wine data/allwine.csv") %>%
     as_tibble() %>%
     clean_names() %>%
     rename(pH = p_h)

dictionary <- read_csv("wine data/272-data dictionary.csv")

red_wine <- read_csv("wine data/red_wine.csv")

white_wine <- read_csv("wine data/white_wine.csv")

dictionary_2<- read_csv("wine data/dictionary_with_desc.csv")
allwine <- allwine %>%
  mutate(type1 = ifelse(type == "red", 1, 0),
         volatile_acidity = volatile_acidity * 10,
         density = density*1000,
         chlorides = chlorides *100)
p1 <- ggplot(allwine, aes(volatile_acidity, quality, color = type)) +
  geom_point() +
  geom_smooth(se = FALSE, method= "lm") +
  theme_classic() +
  theme(legend.position = "none") +
 coord_fixed(ratio=2)

p2 <- ggplot(allwine, aes(chlorides, quality, color = type)) +
  geom_point() +
  geom_smooth(se = FALSE, method = "lm") +
  theme_classic() +
  theme(legend.position = "none") +
  labs(y = NULL) +
  coord_fixed(ratio=10)

p3 <- ggplot(allwine, aes(alcohol, quality, color = type)) +
  geom_point() +
  geom_smooth(se = FALSE, method= "lm") +
  theme_classic() +
  labs(y = NULL) +
  coord_fixed(ratio = 1)
#selecting a better model 

#Using AIC 
fullmodel <- lm(quality ~ fixed_acidity + volatile_acidity + citric_acid + 
                residual_sugar + chlorides + free_sulfur_dioxide + 
                total_sulfur_dioxide + density + pH + sulphates + alcohol + type1,
                data = allwine)

stepAIC(fullmodel, direction = "both", trace = FALSE)

finalmodel1 <- lm(quality ~ fixed_acidity + volatile_acidity + residual_sugar + 
                 free_sulfur_dioxide + total_sulfur_dioxide + density + 
                   pH + sulphates + alcohol + type1, data = allwine)
par(mfrow = c(2, 2), mar = c(2, 2, 2, 2)) 
summary(finalmodel1)
plot(finalmodel1)

#Using lasso 

fullmodela <- model.matrix(fullmodel)[,-1]
response <- as.numeric(allwine$quality)
fit.lasso <- glmnet(fullmodela, response, alpha = 1)
fit_lasso_cv <- cv.glmnet(fullmodela, response, alpha = 1, nfolds = 5)
fit_lasso_cv$lambda.min
fit_lasso_cv$lambda.1se
number_preds <- 12
a1 = as.matrix(coef(fit_lasso_cv, s = "lambda.min"))
a2 = coef(fit_lasso_cv, s = "lambda.1se")[1:(number_preds+1)]
cbind(a1, a2)

#Volatile_acidity, residual_sugar, free_sulfur_dioxide, total_sulfur_dioxide, 
#sulfates, alcohol, pH

finalmodel2 <- lm(quality ~ volatile_acidity + residual_sugar + 
                 free_sulfur_dioxide + total_sulfur_dioxide + 
                   pH + sulphates + alcohol, data = allwine)
summary(finalmodel2)


anova(finalmodel1, finalmodel2)

#Transforming the variables that had problems with linearity 
allwine_end <- allwine %>%
  mutate(end_score = log(quality/(10 - quality)),
         new_va = sqrt(volatile_acidity),
         new_sugar = (residual_sugar)^0.25)
ggplot(allwine_end, aes(new_sugar, colour = type)) + geom_histogram(binwidth = 0.01, fill = "white")
ggplot(allwine_end, aes(new_va, colour = type)) + geom_histogram(binwidth = 0.01, fill = "white")

finalmodel3 <- lm(end_score ~ fixed_acidity + new_va + 
                 free_sulfur_dioxide + total_sulfur_dioxide + density + 
                   pH + sulphates + alcohol + type1 + new_sugar, data = allwine_end)
#par(mfrow = c(2, 2), mar = c(2, 2, 2, 2))
knit_print.table <- lemon_print
summary(USArrests)

#plot(finalmodel3)
# Trying out logistic regression 
favstats(~quality, data = allwine)

allwine_logistic <- allwine %>%
  mutate(cat_score = ifelse(quality > 6, 1, 0)) 
allwine_logistic <- allwine_logistic[-1]
fulllogitmodel <- glm(cat_score ~ fixed_acidity + citric_acid + chlorides 
                + free_sulfur_dioxide + 
                total_sulfur_dioxide + density + pH + sulphates + alcohol + type1
                + volatile_acidity + residual_sugar,
                data = allwine_logistic)
#AIC for logistic regression
nulllog <- glm(cat_score ~ 1, family = binomial, data = allwine_logistic)
step(nulllog, direction = "both", scope = formula(fulllogitmodel))

#Checking conditions
emplogitplot1(cat_score ~ alcohol + volatile_acidity + sulphates + 
    residual_sugar + chlorides + type1 + free_sulfur_dioxide + 
    total_sulfur_dioxide + pH + fixed_acidity + density, data = allwine_logistic, 
    out = TRUE, ngroup = 10)

finallogmod1 <- glm(formula = cat_score ~ alcohol + volatile_acidity + sulphates + 
    residual_sugar + chlorides + type1 + free_sulfur_dioxide + 
    total_sulfur_dioxide + pH + fixed_acidity + density, family = binomial, 
    data = allwine_logistic)

finallogmod <- glm(formula = cat_score ~ alcohol + volatile_acidity + sulphates + 
    residual_sugar + chlorides + type1 + free_sulfur_dioxide + 
    total_sulfur_dioxide + pH + fixed_acidity + density
    + total_sulfur_dioxide:type1, family = binomial, 
    data = allwine_logistic)

ggplot(data = allwine_logistic, aes(x = volatile_acidity, fill = as.factor(cat_score))) + 
  geom_density(position = 'fill', alpha = 0.5) 

anova(nulllog, finallogmod, test = "Chisq") 

summary(finallogmod)
summary(finallogmod1)

#coefficients 

exp(finallogmod$coefficients)



#separate models 

all_red <- allwine_logistic %>%
  filter(type1 == 1)

finallogmodR <- glm(formula = cat_score ~ alcohol + volatile_acidity + sulphates + 
    residual_sugar + chlorides + free_sulfur_dioxide + 
   total_sulfur_dioxide+ pH + fixed_acidity + density, family = binomial, 
    data = all_red)
summary(finallogmodR)

all_white <- allwine_logistic %>%
  filter(type1 == 0)

finallogmodW <- glm(formula = cat_score ~ alcohol + volatile_acidity + sulphates + 
    residual_sugar + chlorides + free_sulfur_dioxide + 
   total_sulfur_dioxide+ pH + fixed_acidity + density, family = binomial, 
    data = all_white)
summary(finallogmodW)


predict.regsubsets <- function (object, newdata, id, ...) {
  form <- as.formula(object$call[[2]])
  mat <- model.matrix(form, newdata)
  coefi <- coef(object, id = id)
  xvars <- names(coefi)
  mat[, xvars] %*% coefi
}

k = 5
set.seed(2)

folds <- sample(1:k, nrow(allwine), replace = TRUE) 

cv.errors <- matrix(NA, k, 10, dimnames = list(NULL, paste(1:10)))
for(j in 1:k){
  best.fit <- regsubsets(quality ~ fixed_acidity + volatile_acidity + residual_sugar + 
                 free_sulfur_dioxide + total_sulfur_dioxide + density + 
                   pH + sulphates + alcohol + type1, data = allwine[folds != j, ], 
      nvmax = 10)
  for(i in 1:10) {
    pred <- predict(best.fit, allwine[folds == j, ], id = i)
    cv.errors[j, i] <- mean((allwine$quality[folds == j] - pred)^2)
  }
}
cv.errors

mean.cv.errors <- apply(cv.errors, 2, mean)
mean.cv.errors
plot(mean.cv.errors, type = 'b')

#Cross-fold validation for logistic regression 

#Sensitivity (the number of correctly predicted Yes's)
costSe <- function(r, pi = 0) {mean((r == 1 & pi > 0.5) | (r == 0 & pi < 0.5))}
cv2 <- matrix(0, nrow = 5, ncol = 1)   

for(i in 1:5) {cv2[i] = cv.glm(finallogmod, costSe, K = 5, 
                                data = allwine_logistic)$delta[2]}
t(cv2)          
mean(cv2)

#Specificity (the number of correctly predictted No's)
costSp <- function(r, pi = 0) {sum(r == 0 & pi < 0.5) / sum(r == 0)}
lr2spec <- rep(0, 5)

for(i in 1:5)  {lr2spec[i] <- cv.glm(finallogmod, costSp, K = 5,
                                     data = allwine_logistic)$delta[2]}
lr2spec
mean(lr2spec)

#Using Lasso 

X <- model.matrix(cat_score ~ ., allwine_logistic)[,-1] 
y <- as.factor(allwine_logistic$cat_score)

# Had to add family = "binomial" for logistic regression
fit.lasso <- glmnet(X, y, alpha = 1, family = "binomial")
plot(fit.lasso, xvar = "lambda", label = TRUE)

# Coefficients using lambda with minimum deviance, and 1 SE away
fit.lasso.cv <- cv.glmnet(X, y, alpha = 1, nfolds = 5, family = "binomial")
plot(fit.lasso.cv)
coef(fit.lasso.cv, s = "lambda.min")
coef(fit.lasso.cv, s = "lambda.1se")


#Coming with lasso model for logistic regression 
lassomodel <- glm(cat_score ~ fixed_acidity + volatile_acidity + residual_sugar +
                    chlorides + free_sulfur_dioxide + pH + sulphates + alcohol +
                    total_sulfur_dioxide,
                    data = allwine_logistic)
summary(lassomodel)


anova(nulllog, lassomodel, test = "Chisq") 


kbl(dictionary_2, caption = "Variable Chart", booktabs = TRUE) %>%
  kable_styling(position = "center") %>%
  kable_classic(html_font = "Cambria") %>%
  column_spec(2, width = "15em") %>%
 row_spec(0, bold = TRUE)
par(mfrow = c(1,2))

corrplot(cor(red_wine))

corrplot(cor(white_wine))

par(mfrow= c(1,1))
# Table 2
summary(finallogmod)
# Table 3 
exp(finallogmod$coefficients)

ggplot(allwine, aes(total_sulfur_dioxide, fill = type)) +
  geom_density() +
  theme_classic()


#s1 + s2 + plot_layout()
emplogitplot1(cat_score ~ alcohol + volatile_acidity + sulphates + 
    residual_sugar + chlorides + type1 + free_sulfur_dioxide + 
    total_sulfur_dioxide + pH + fixed_acidity + density, data = allwine_logistic, 
    out = TRUE, ngroup = 10)
# Table 4
anova(nulllog, finallogmod, test = "Chisq") 
# Table 5
summary(finallogmod)
# Table 6
summary(finallogmod1)
# Table 7
summary(lassomodel)
# Table 8
t(cv2)          
mean(cv2)
# Table 9
lr2spec
mean(lr2spec)
# Table 10
summary(finallogmodR)
# Table 11
summary(finallogmodW)
#Volatile acidity, Table 12 
favstats(~volatile_acidity, data = allwine)
#Alcohol, Table 13
favstats(~alcohol, data = allwine)
#Density, Table 14 
favstats(~density, data = allwine)
#Chlorides, Table 15
favstats(~chlorides, data = allwine)
#Quality, Table 16
favstats(~quality, data = allwine)
#Volatile acidity 

ggplot(allwine, aes(volatile_acidity, color = type)) + geom_histogram(binwidth = 0.1, fill = "white")
#Alcohol 
ggplot(allwine, aes(alcohol, color = type)) + geom_histogram(binwidth = 1, fill = "white")
#density 
ggplot(allwine, aes(density, color = type)) + geom_histogram(binwidth = 1, fill = "white")
#chlorides 
ggplot(allwine, aes(chlorides, colour = type)) + geom_histogram(binwidth = 1, fill = "white")
#Quality 

ggplot(allwine, aes(quality, color = type)) + geom_histogram(binwidth = 1, fill = "white")

p1 + p2 + p3 + plot_layout(ncol = 3)
# red vs white quality distribution

ggplot(allwine, aes(x = as.factor(quality), fill = type)) +
  geom_bar(alpha = 0.8) +
  facet_wrap( ~type, nrow = 2) +
  theme_bw()
#(we see the differences in height comes from number of observations for white and red wine. We can see however, that they are both normally distributed)

ggplot(allwine, aes(type, quality, fill = type)) +
  geom_boxplot() +
  stat_summary(fun = mean, geom = "point", col = "black") + 
  stat_summary(fun = mean, geom = "text", col = "black",    
               vjust = 1.5, aes(label = paste("Mean:", round(..y.., digits = 2))))

# alcohol vs quality (pretty bad correlation)

ggplot(allwine, aes(as.factor(quality), alcohol)) +
 geom_boxplot() +
   facet_wrap( ~type)
# pH vs quality
ggplot(allwine, aes(pH, as.factor(quality))) +
  geom_col(aes(fill = quality)) +
  facet_wrap(~type) +
  coord_flip() +
  theme(axis.title.x=element_blank())
  
#density vs. fixed.acidity (highest,0.668)
ggplot(allwine, aes(density, fixed_acidity, color = type)) +
  geom_point(alpha = 1/8, position = position_jitter(height = 0 ,width = 0), size = 0.8) +
   geom_smooth(method = 'lm') +
  coord_cartesian(xlim = c(min(allwine$density), 1000) , ylim = c(0,14)) +
  labs(title = "Density vs. Fixed acidity") +
  theme_classic()

#density vs alcohol (second, -0.496)
ggplot(allwine, aes(density, alcohol, color = type)) +
  geom_point(alpha = 1/8, position = position_jitter(height = 0.1 ,width = 0), size = 0.8) +
   geom_smooth(method = 'lm') +
  coord_cartesian(xlim = c(min(allwine$density), 1000), ylim = c(5,15)) +
  labs(title = "Density vs. alcohol") +
  theme_classic()

# sulphates vs. chlorides

ggplot(allwine, aes(chlorides, sulphates, color = type)) +
  geom_point(alpha = .07, position = position_jitter(height = 0.02 ,width = 0.02), 
             size = 1) +
   geom_smooth(method = 'lm') +
  scale_x_log10() +
  coord_cartesian(ylim = c(min(allwine$sulphates), 1.005)) +
  labs(title = "Sulphates vs. chlorides") +
  theme_classic()

par(mfrow = c(2,2))
plot(finalmodel3)
#Barplot (count of samples by winetype)
ggplot(allwine_logistic, aes(x = cat_score)) + 
  geom_bar(colour = "black", fill = "orange") + theme_clean() +
  labs(x = "Quality score; 1 = Red, 0 = White", y = "Count",
       title = "Cases by type")

Wine Quality Analysis

Claire Wu and Pavel-Christian