#Project Title: Project: Data Analytics with Manegerial Application Internship
#NAME: Nimit Dhalia
#EMAIL: lucky.dhalia8@gmail.com
#COLLEGE : IIT Roorkee
Wine is an alcoholic beverage made from grapes, generally, Vitis vinifera, fermented without the addition of sugars, acids, enzymes, water, or other nutrients. (source: Wiki). Among different variants of wines two are worldwide popular by their names: Red wine and White wine. The red-wine production process involves extraction of color and flavor components from the grape skin. Red wine is made from dark-colored grape varieties. The actual color of the wine can range from violet, typical of young wines, through red to mature wines, to brown for older red wines.While white wine can be straw-yellow, yellow-green, or yellow-gold. Fermentation of the non-colored grapes pulp produces white wine. The grapes from which white wine is produced are typically green or yellow.
To support the growth of wine nowadays Industries are investing in new technologies for both wines making and the selling process. Quality assessment is a key element within this context and can be used to improve winemaking ( by identifying the most influential factors) and to stratify wines such as premium brands(useful for setting prices in the market). Quality assessment is assessed by physicochemical test or sensory tests. Physicochemical lab test used to characterize wine into the determination of density, alcohol or pH values, while sensory test relies on human experts. Nonetheless, the relationship between the physicochemical and sensory analysis is complex and still not fully understood.
In this project, I investigate how physicochemical are related to the quality of wine based on analytical data that are available in UCI repository (https://archive.ics.uci.edu/ml/datasets/Wine+Quality). Such study is useful for wine producers to potentially improve the revenue, marketing strategy, and the decision process.
The study concerns with wine quality based on samples of physicochemical and sensory data of two types of Portugal wine: Red (1599 entries) and White (4898 entries). Wine viewed as a luxury good, nowadays it is increasingly enjoyed by a wider range of consumers. Wine assessment is the challenging and complex task. Examination and evaluation of wine are done for certification and health safeguard purposes. Wine contains many chemical compounds similar or identical to those in fruits, vegetables, and spices. The sweetness of the wine is determined by the amount of residual sugar in the wine after fermentation, relative to the acidity present in the wine.Portugal has a big variety of local kinds, producing a very wide variety of different wines with distinctive personality. Portugal is 10th largest wine exporting country latest by 2013. Also, Portugal is 11th top wine producing country for the year 2014, according to the Food and Agriculture Organization (FAO), which is an agency of the United Nations; this is the latest information available from the FAO.
In this study, I investigate influence of different factors on wines taste. Consequently, a producer could set strategically the price for their different class of wines and also eventually he/she could make better decision to choose a suitable type of grapes as the raw material to produce desired wine in company. I empirically study how the Acidity, Residual Sugar to Alcohol, and other chemicals and how they affect the taste of wine thus the quality metric of wine. My regression analysis reveals that quality of wine can be explained with certain level of accuracy through four predictors (physicochemical) present in each wine (Red and White). Both wine persist completely different nature and inherent quality so regression model are different for both. For red wine, our response variable quality related to alcohol, total sulfur dioxide, volatile acidity, and sulphates plus for the white wine, four predictors are alcohol total sulfur dioxide, volatile.acidity (same as red-wine) and the density.
Wine is a beverage from fermented grape and other fruit juices with a lower amount of alcohol content. Quality of wine is graded based on the taste of wine and vintage. Tasting it is an ancient process as the wine itself is. When it comes to the quality of the wine, many other factors or attributes come into consideration other than the flavor. The dataset that I chose to analyze ‘Wine Quality’, represents the quality of wines ( white & red ) based on different physiochemical attributes ( fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, and alcohol ). The quality score for each wine combination in our dataset varies from 0 to 10 (ranging from least to highest). This analysis will uncover some important relationships between wine chemical contents like acidity and sugar levels versus its quality. The dataset exhibits a vast and distinct chemical and acidic combination of two types of wine (white & red). By employing smart data analysis techniques, we can unearth a hand full of important and interesting insights that would be helpful in predicting wine quality and that would also be prolific for the economic/financial sector and business sector of the production company.
The datasets are publicly available for research purposes and the details are described in [Cortez et al., 2009]. I collected the data from the UCI repository website (https://archive.ics.uci.edu/ml/datasets/Wine+Quality). Dataset is large enough to rely on its result ( with 4898 white and 1599 red entries ) of Vinho Verde (Portugal Wine Company from the northwest region of Portugal) samples are considered:
These data sets include physicochemical and sensory related data to red and white Vinho Verde wine sample. The data were collected from May/2004 to February/2007 using only protected designation of origin samples that were tested at the official certification entity (CVRVV). The CVRVV is an inter-professional organization with the goal of improving the quality and marketing of Vinho Verde. The data were recorded by a computerized system (iLab), which automatically manages the process of wine sample testing from producer requests to the laboratory and sensory analysis. Each entry denotes a given test (analytical or sensory) and the final database was exported into a single sheet (.csv).
Description of attributes:
Fixed acidity: most acids involved with wine or fixed or nonvolatile (do not evaporate readily)
Volatile acidity: the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste
Citric acid: found in small quantities, citric acid can add ‘freshness’ and flavor to wines
Residual sugar: the amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet
Chlorides: the amount of salt in the wine
Free sulfur dioxide: the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine
Total sulfur dioxide: amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine
Density: the density of water is close to that of water depending on the percent alcohol and sugar content
pH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale
Sulphates: a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant
Alcohol: the percent alcohol content of the wine
Quality: output variable (based on sensory data, score between 0 and 10)
\[Quality_r= \alpha_0 + \alpha_1 VolatileAcidity + \alpha_2 Alcohol + \alpha_3 Sulphates + \alpha_4 TotalSulfurDioxide + \epsilon\]
\[Quality_w= \alpha_0 + \alpha_1 Density + \alpha_2 TotalSulfurDioxide + \alpha_3 Alcohol + \alpha_4 VolatileAcidity + \epsilon\]
redWine<- read.csv("winequality-red.csv", sep = ";")
rModel2<- quality ~ volatile.acidity + alcohol + sulphates + total.sulfur.dioxide
rfit2<- lm(rModel2 , data=redWine)
# OLS Model
summary(rfit2)
##
## Call:
## lm(formula = rModel2, data = redWine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.72716 -0.38486 -0.06503 0.44980 2.13257
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.8258128 0.2006892 14.081 < 2e-16 ***
## volatile.acidity -1.1985632 0.0966011 -12.407 < 2e-16 ***
## alcohol 0.2953105 0.0160331 18.419 < 2e-16 ***
## sulphates 0.7121396 0.1005146 7.085 2.08e-12 ***
## total.sulfur.dioxide -0.0022354 0.0005108 -4.376 1.28e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.655 on 1594 degrees of freedom
## Multiple R-squared: 0.3438, Adjusted R-squared: 0.3421
## F-statistic: 208.8 on 4 and 1594 DF, p-value: < 2.2e-16
For red-wine, I established the effect of volatile acidity, alcohol, total sulfur dioxide, and sulphates on our response variable quality with the simplest model and estimated model using linear least squares (OLS method). If there is a high level of volatile acidity or total sulfur dioxide in wine then its more likely to be bad in taste and termed as faulty wine. On contrary, a wine having alcohol and sulphates in a higher level is appreciated and consider as fine quality wine.
whiteWine<- read.csv("winequality-white.csv", sep = ";")
wModel2<- quality ~ density + total.sulfur.dioxide + alcohol + volatile.acidity
wfit2<- lm(wModel2 , data=redWine)
# OLS Model
summary(wfit2)
##
## Call:
## lm(formula = wModel2, data = redWine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.61607 -0.39286 -0.06512 0.46152 2.19572
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.694e+01 1.026e+01 -1.650 0.099130 .
## density 2.010e+01 1.019e+01 1.972 0.048737 *
## total.sulfur.dioxide -1.929e-03 5.169e-04 -3.732 0.000196 ***
## alcohol 3.203e-01 1.874e-02 17.088 < 2e-16 ***
## volatile.acidity -1.353e+00 9.524e-02 -14.211 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6644 on 1594 degrees of freedom
## Multiple R-squared: 0.3248, Adjusted R-squared: 0.3231
## F-statistic: 191.7 on 4 and 1594 DF, p-value: < 2.2e-16
For white wine, the simple regression model tells us that similar to red-wine the quality of the white wine is also adversely affected by the higher amount of volatile acidity and total sulfur dioxide in wine. Moreover, higher the density and alcohol tend to give a good quality white wine. From the model, it can be inferred that increase in one unit of density keeping another factor constant reflect a 2.010e+01 increase in quality level (Table 2).
Red wine and white wine datasets posses different relationship between quality and corresponding physiochemical so two different models are fitted.The simple linear model explains 28-30% of the variation in data. It does not reveal us the more abstract relations between variable still, we can roughly say that quality of the wine is majority depend on volatile acidity, alcohol, total sulfur dioxide and sulhpates for red-wine and for white wine it depends on volatile acidity, alcohol, total sulfur dioxide and density. The sign and strength of regressor (Table 1 & 2) on quality is useful information and managerially sound for different business scenario.
This paper was motivated by the need for research that could improve the understanding of how the quality of the wine is influenced by its different physicochemical present in it.Out of the thirteen attributes, the statistically significant attribute that influence the quality of the wine is an essential finding. By employing linear regression analysis, we come up with a model that highlights the significant attributes in both sets. The result of this regression analysis will be helpful in production and in quality prediction by studying the impact of those significant attributes in predicting the quality.There is space for further analysis to reveal the more interesting pattern and to employ rigor analytical tool to augment a more sophisticated model.
An important insight is that how much each of the chemical components contributes to the quality of the wine and how can we grade the quality level of a newly produced wine. This helps the producer to identify the distinguishing factors affecting the quality level and thereby fix a reasonable price for wine. It’s easier to filter and choose the raw material (grapes, fruits or vegetable) with a prior laboratory test to check required chemicals needed for certain quality of wine which is demanded in the market.Moreover, In different culture and nation, there is demand for the specific type of wine due to variation in climatic condition or taste preference of peoples residing in this areas. In such scenarios, the producer can collect the data that reveals the taste preferences of different culture and consequently prepare a suitable wine for such area to improve his marketing, sales and of course revenue.
Portuguese wine, Available from: https://en.wikipedia.org/wiki/Portuguese_wine. [27 Jan 2018].
List of wine-producing countries, Available from: https://en.wikipedia.org/wiki/Wine#Vintages. [27 Jan 2018].
Wine award to portugal wine: http://www.winesofportugal.com/us/news-and-events/awards1/ [27 Jan 2018].
P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009.
| \(\beta\) | SE | t-statistic | |
|---|---|---|---|
| Intercept | 2.8258128 | 0.2006892 | 14.081 |
| volatile.acidity | -1.1985632 | 0.0966011 | -12.407 |
| alcohol | 0.2953105 | 0.0160331 | 18.419 |
| sulphates | 0.7121396 | 0.1005146 | 7.085 |
| total.sulfur.dioxide | -0.0022354 | 0.0005108 | -4.376 |
| ————————— | ————- | ———– | ————- |
| \(\beta\) | SE | t-statistic | |
|---|---|---|---|
| Intercept | -1.694e+01 | 1.026e+01 | -1.650 |
| volatile.acidity | 2.010e+01 | 1.019e+01 | 1.972 |
| alcohol | -1.929e-03 | 5.169e-04 | -3.732 |
| sulphates | 3.203e-01 | 1.874e-02 | 17.088 |
| total.sulfur.dioxide | -1.353e+00 | 9.524e-02 | -14.211 |
| ————————— | ————- | ———– | ————- |
# Summarize the Data
library(psych)
describe(redWine)
vars n mean sd median trimmed mad min
fixed.acidity 1 1599 8.32 1.74 7.90 8.15 1.48 4.60
volatile.acidity 2 1599 0.53 0.18 0.52 0.52 0.18 0.12
citric.acid 3 1599 0.27 0.19 0.26 0.26 0.25 0.00
residual.sugar 4 1599 2.54 1.41 2.20 2.26 0.44 0.90
chlorides 5 1599 0.09 0.05 0.08 0.08 0.01 0.01
free.sulfur.dioxide 6 1599 15.87 10.46 14.00 14.58 10.38 1.00
total.sulfur.dioxide 7 1599 46.47 32.90 38.00 41.84 26.69 6.00
density 8 1599 1.00 0.00 1.00 1.00 0.00 0.99
pH 9 1599 3.31 0.15 3.31 3.31 0.15 2.74
sulphates 10 1599 0.66 0.17 0.62 0.64 0.12 0.33
alcohol 11 1599 10.42 1.07 10.20 10.31 1.04 8.40
quality 12 1599 5.64 0.81 6.00 5.59 1.48 3.00
max range skew kurtosis se
fixed.acidity 15.90 11.30 0.98 1.12 0.04
volatile.acidity 1.58 1.46 0.67 1.21 0.00
citric.acid 1.00 1.00 0.32 -0.79 0.00
residual.sugar 15.50 14.60 4.53 28.49 0.04
chlorides 0.61 0.60 5.67 41.53 0.00
free.sulfur.dioxide 72.00 71.00 1.25 2.01 0.26
total.sulfur.dioxide 289.00 283.00 1.51 3.79 0.82
density 1.00 0.01 0.07 0.92 0.00
pH 4.01 1.27 0.19 0.80 0.00
sulphates 2.00 1.67 2.42 11.66 0.00
alcohol 14.90 6.50 0.86 0.19 0.03
quality 8.00 5.00 0.22 0.29 0.02
describe(whiteWine)
vars n mean sd median trimmed mad min
fixed.acidity 1 4898 6.85 0.84 6.80 6.82 0.74 3.80
volatile.acidity 2 4898 0.28 0.10 0.26 0.27 0.09 0.08
citric.acid 3 4898 0.33 0.12 0.32 0.33 0.09 0.00
residual.sugar 4 4898 6.39 5.07 5.20 5.80 5.34 0.60
chlorides 5 4898 0.05 0.02 0.04 0.04 0.01 0.01
free.sulfur.dioxide 6 4898 35.31 17.01 34.00 34.36 16.31 2.00
total.sulfur.dioxide 7 4898 138.36 42.50 134.00 136.96 43.00 9.00
density 8 4898 0.99 0.00 0.99 0.99 0.00 0.99
pH 9 4898 3.19 0.15 3.18 3.18 0.15 2.72
sulphates 10 4898 0.49 0.11 0.47 0.48 0.10 0.22
alcohol 11 4898 10.51 1.23 10.40 10.43 1.48 8.00
quality 12 4898 5.88 0.89 6.00 5.85 1.48 3.00
max range skew kurtosis se
fixed.acidity 14.20 10.40 0.65 2.17 0.01
volatile.acidity 1.10 1.02 1.58 5.08 0.00
citric.acid 1.66 1.66 1.28 6.16 0.00
residual.sugar 65.80 65.20 1.08 3.46 0.07
chlorides 0.35 0.34 5.02 37.51 0.00
free.sulfur.dioxide 289.00 287.00 1.41 11.45 0.24
total.sulfur.dioxide 440.00 431.00 0.39 0.57 0.61
density 1.04 0.05 0.98 9.78 0.00
pH 3.82 1.10 0.46 0.53 0.00
sulphates 1.08 0.86 0.98 1.59 0.00
alcohol 14.20 6.20 0.49 -0.70 0.02
quality 9.00 6.00 0.16 0.21 0.01
# Influential variables
#how correlated our different variables with Quality:
cor(x=redWine[,c(1:12)], y=redWine$quality) #red wine
[,1]
fixed.acidity 0.12405165
volatile.acidity -0.39055778
citric.acid 0.22637251
residual.sugar 0.01373164
chlorides -0.12890656
free.sulfur.dioxide -0.05065606
total.sulfur.dioxide -0.18510029
density -0.17491923
pH -0.05773139
sulphates 0.25139708
alcohol 0.47616632
quality 1.00000000
cor(x=whiteWine[,c(1:12)], y=whiteWine$quality) #white wine
[,1]
fixed.acidity -0.113662831
volatile.acidity -0.194722969
citric.acid -0.009209091
residual.sugar -0.097576829
chlorides -0.209934411
free.sulfur.dioxide 0.008158067
total.sulfur.dioxide -0.174737218
density -0.307123313
pH 0.099427246
sulphates 0.053677877
alcohol 0.435574715
quality 1.000000000
tapply(redWine$alcohol, redWine$quality, mean) #red wine
3 4 5 6 7 8
9.955000 10.265094 9.899706 10.629519 11.465913 12.094444
tapply(whiteWine$alcohol, whiteWine$quality, mean) #white wine
3 4 5 6 7 8 9
10.34500 10.15245 9.80884 10.57537 11.36794 11.63600 12.18000
tapply(redWine$volatile.acidity, redWine$quality, mean) #red wine
3 4 5 6 7 8
0.8845000 0.6939623 0.5770411 0.4974843 0.4039196 0.4233333
tapply(whiteWine$volatile.acidity, whiteWine$quality, mean) #white wine
3 4 5 6 7 8 9
0.3332500 0.3812270 0.3020110 0.2605641 0.2627670 0.2774000 0.2980000
tapply(redWine$total.sulfur.dioxide, redWine$quality, mean) #red wine
3 4 5 6 7 8
24.90000 36.24528 56.51395 40.86991 35.02010 33.44444
tapply(whiteWine$total.sulfur.dioxide, whiteWine$quality, mean) #white wine
3 4 5 6 7 8 9
170.6000 125.2791 150.9046 137.0473 125.1148 126.1657 116.0000
library(coefplot)
coefplot(rfit2, intercept=FALSE) #red wine
coefplot(wfit2, intercept=FALSE) #white wine
#T-tests for showing red wine and white wine we need seprate model.
mean(redWine$quality)
## [1] 5.636023
mean(whiteWine$quality)
## [1] 5.877909
t.test(redWine$quality,whiteWine$quality)
##
## Welch Two Sample t-test
##
## data: redWine$quality and whiteWine$quality
## t = -10.149, df = 2950.8, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.2886173 -0.1951564
## sample estimates:
## mean of x mean of y
## 5.636023 5.877909
mean(redWine$alcohol)
## [1] 10.42298
mean(whiteWine$alcohol)
## [1] 10.51427
t.test(redWine$alcohol,whiteWine$alcohol)
##
## Welch Two Sample t-test
##
## data: redWine$alcohol and whiteWine$alcohol
## t = -2.859, df = 3100.5, p-value = 0.004278
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.15388669 -0.02868117
## sample estimates:
## mean of x mean of y
## 10.42298 10.51427