Analysis of Wine Data

Arpan Dutta, Debanjan Bhattacharjee, Soumyajit Roy

Used Libraries

Packages

require(MASS) #for lda and qda
require(ggplot2)
require(car)
require(heplots)
require(klaR)
require(psych)
require(GPArotation)
require(rgl)
require(kableExtra)

Introduction

Data consists results of a chemical analysis of 13 constituents (columns 2-14) of wines grown in the same region in Italy but derived from three different cultivars (given by Column 1). The constituent attributes measured in the chemical analysis are:

Alcohol : percentage of alcohol content in the wine
Malic acid : Influence the acidity and flavour profile of the wine
Ash : Inorganic mineral content of the wine .
Alkalinity of ash : Measures alkalinity of the ash conent. afeects th ph of the wine and can impact its stability.
Magnesium : Measures magnesium content of the wine. Influence wine fermentation and aroma.
Total phenols : Total concentration of phenolic compounds in the wine .
Flavanoids : Represents the concentration of flavonoid compounds. Impacts color, flavor complexity, and health benefits.
Nonflavanoid phenols: Indicates the concentration of non-flavonoid phenolic compounds. These compounds contribute to wine color, flavor, and mouthfeel.
Proanthocyanins: Represents the concentration of proanthocyanidin compounds. Proanthocyanins contribute to wine bitterness, astringency, and color stability.
Color intensity : Measures the intensity of the wine’s color. This covariate provides information about the depth and richness of the wine’s hue.
Hue : Indicates the hue or tint of the wine. Hue reflects the wine’s color spectrum, ranging from reddish-purple to orange-yellow.
od : OD280/OD315 of diluted wines : Represents the ratio of absorbance measurements at two different wavelengths. This ratio provides information about the concentration of specific compounds in the wine.
Proline : Indicates the concentration of the amino acid proline in the wine. Proline can influence wine fermentation, stability, and sensory attributes.

Data Pre - processing:

load the dataset.
There are 178 obs of 14 variables with no NA values.
we make the first coloum cultivars as categorical variable, we have 3 levels: cultivars-1, cultivars-2, cultivars-3

Glimpse of the dataset

wine data
cultivars	alcohol	malic_acid	ash	ash_alkalinity	magnesium	total_phenols	flavanoids	nonflavanoids	Proanthocyanins	color_intensity	hue	od	proline
1	14.23	1.71	2.43	15.6	127	2.80	3.06	0.28	2.29	5.64	1.04	3.92	1065
1	13.20	1.78	2.14	11.2	100	2.65	2.76	0.26	1.28	4.38	1.05	3.40	1050
1	13.16	2.36	2.67	18.6	101	2.80	3.24	0.30	2.81	5.68	1.03	3.17	1185
1	14.37	1.95	2.50	16.8	113	3.85	3.49	0.24	2.18	7.80	0.86	3.45	1480
1	13.24	2.59	2.87	21.0	118	2.80	2.69	0.39	1.82	4.32	1.04	2.93	735

Summary of the data

 cultivars    alcohol        malic_acid         ash        ash_alkalinity 
 1:59      Min.   :11.03   Min.   :0.740   Min.   :1.360   Min.   :10.60  
 2:71      1st Qu.:12.36   1st Qu.:1.603   1st Qu.:2.210   1st Qu.:17.20  
 3:48      Median :13.05   Median :1.865   Median :2.360   Median :19.50  
           Mean   :13.00   Mean   :2.336   Mean   :2.367   Mean   :19.49  
           3rd Qu.:13.68   3rd Qu.:3.083   3rd Qu.:2.558   3rd Qu.:21.50  
           Max.   :14.83   Max.   :5.800   Max.   :3.230   Max.   :30.00  
   magnesium      total_phenols     flavanoids    nonflavanoids   
 Min.   : 70.00   Min.   :0.980   Min.   :0.340   Min.   :0.1300  
 1st Qu.: 88.00   1st Qu.:1.742   1st Qu.:1.205   1st Qu.:0.2700  
 Median : 98.00   Median :2.355   Median :2.135   Median :0.3400  
 Mean   : 99.74   Mean   :2.295   Mean   :2.029   Mean   :0.3619  
 3rd Qu.:107.00   3rd Qu.:2.800   3rd Qu.:2.875   3rd Qu.:0.4375  
 Max.   :162.00   Max.   :3.880   Max.   :5.080   Max.   :0.6600  
 Proanthocyanins color_intensity       hue               od       
 Min.   :0.410   Min.   : 1.280   Min.   :0.4800   Min.   :1.270  
 1st Qu.:1.250   1st Qu.: 3.220   1st Qu.:0.7825   1st Qu.:1.938  
 Median :1.555   Median : 4.690   Median :0.9650   Median :2.780  
 Mean   :1.591   Mean   : 5.058   Mean   :0.9574   Mean   :2.612  
 3rd Qu.:1.950   3rd Qu.: 6.200   3rd Qu.:1.1200   3rd Qu.:3.170  
 Max.   :3.580   Max.   :13.000   Max.   :1.7100   Max.   :4.000  
    proline      
 Min.   : 278.0  
 1st Qu.: 500.5  
 Median : 673.5  
 Mean   : 746.9  
 3rd Qu.: 985.0  
 Max.   :1680.0

Exploratory data analysis

Scatterplot of some Variables

Correlation plot

for the full data

for cultivars 1

for cultivars 2

for cultivars 3

Histograms for whole data

Histograms for cultivars-1

Histograms for cultivars-2

Histograms for cultivars-3

Checking for Normality

Shapiro-Wilk test for cultivars-1

[1] "P-value for alcohol = 0.47907 ***"
[1] "P-value for malic_acid = 0 "
[1] "P-value for ash = 0.15556 ***"
[1] "P-value for ash_alkalinity = 0.21609 ***"
[1] "P-value for magnesium = 0.08617 ***"
[1] "P-value for total_phenols = 0.0203 "
[1] "P-value for flavanoids = 0.63873 ***"
[1] "P-value for nonflavanoids = 0.03015 "
[1] "P-value for Proanthocyanins = 0.0315 "
[1] "P-value for color_intensity = 0.12509 ***"
[1] "P-value for hue = 0.15082 ***"
[1] "P-value for od = 0.07745 ***"
[1] "P-value for proline = 0.52324 ***"

comment: malic_acid, total_phenols, nonflavinoids, Proanthocyanins are not following normality.

Shapiro-Wilk test for cultivars-2

[1] "P-value for alcohol = 0.11396 ***"
[1] "P-value for malic_acid = 0 "
[1] "P-value for ash = 0.61976 ***"
[1] "P-value for ash_alkalinity = 0.07397 ***"
[1] "P-value for magnesium = 0 "
[1] "P-value for total_phenols = 0.31801 ***"
[1] "P-value for flavanoids = 0.0015 "
[1] "P-value for nonflavanoids = 0.3128 ***"
[1] "P-value for Proanthocyanins = 0.00815 "
[1] "P-value for color_intensity = 0.00082 "
[1] "P-value for hue = 0.22493 ***"
[1] "P-value for od = 0.08904 ***"
[1] "P-value for proline = 0.00177 "

comment: malic_acid, magnesium, flavanoids, Proanthocyanins, color_intensity, proline are not normally distributed.

Shapiro-Wilk test for cultivars-3

[1] "P-value for alcohol = 0.64084 ***"
[1] "P-value for malic_acid = 0.73772 ***"
[1] "P-value for ash = 0.10923 ***"
[1] "P-value for ash_alkalinity = 0.09874 ***"
[1] "P-value for magnesium = 0.03865 "
[1] "P-value for total_phenols = 0.01577 "
[1] "P-value for flavanoids = 0.00036 "
[1] "P-value for nonflavanoids = 0.02284 "
[1] "P-value for Proanthocyanins = 0.00025 "
[1] "P-value for color_intensity = 0.08775 ***"
[1] "P-value for hue = 0.02819 "
[1] "P-value for od = 0.08311 ***"
[1] "P-value for proline = 0.45849 ***"

comment: magnesium, total_phenols, flavanoids, nonflavanoids, Proanthocyanins, hue are not normally distributed.

Violation of normality

malic_acid
 total_phenols
 nonflavanoids
 Proanthocyanins
 magnesium
 flavanoids
 color_intensity
 proline

this variables are not following normal in atleast one group.

Box-cox transformation on whole dataset

we perform boxcox transformation on these variables in whole dataset

Checking for normality for cultivars-1 after transformation

Shapiro-wilk test for cultivars-1 after transformation

[1] "P-value for alcohol = 0.47907 ***"
[1] "P-value for malic_acid = 0 "
[1] "P-value for ash = 0.15556 ***"
[1] "P-value for ash_alkalinity = 0.21609 ***"
[1] "P-value for magnesium = 0.50006 ***"
[1] "P-value for total_phenols = 0.05264 ***"
[1] "P-value for flavanoids = 0.77356 ***"
[1] "P-value for nonflavanoids = 0.36968 ***"
[1] "P-value for Proanthocyanins = 0.1333 ***"
[1] "P-value for color_intensity = 0.78541 ***"
[1] "P-value for hue = 0.15082 ***"
[1] "P-value for od = 0.07745 ***"
[1] "P-value for proline = 0.39451 ***"

comment: only malic_acid is not normally distributed.

qqPlot for cultivars-1

2D scatterplot and Confidence ellipsoid for cultivars-1

Now we draw the scatterplots of some variables and add confidence-ellipse with level 0.90.

comment: almost 90% of the data points are inside the ellipse.

3D plot for cultivars-1

Chisq plot for cultivars-1

Royston test for cultivars-1

            Royston test for Multivariate Normality 

  data : boxdata1[, -1] 

  R               : 45.97313 
  p-value         : 1.177547e-05 

  Result  : Data are not multivariate normal (sig.level = 0.05)

comment: pvalue is less than 0.05, so we reject our null hypothesis at level 0.05.

Checking for normality for cultivars-2 after transformation

Shapiro-wilk test for cultivars-2 after transformation

[1] "P-value for alcohol = 0.11396 ***"
[1] "P-value for malic_acid = 0.33092 ***"
[1] "P-value for ash = 0.61976 ***"
[1] "P-value for ash_alkalinity = 0.07397 ***"
[1] "P-value for magnesium = 0.00102 "
[1] "P-value for total_phenols = 0.66604 ***"
[1] "P-value for flavanoids = 0.06954 ***"
[1] "P-value for nonflavanoids = 0.57319 ***"
[1] "P-value for Proanthocyanins = 0.04581 "
[1] "P-value for color_intensity = 0.4973 ***"
[1] "P-value for hue = 0.22493 ***"
[1] "P-value for od = 0.08904 ***"
[1] "P-value for proline = 0.65382 ***"

comment: magnesium, Proanthocyanins are not normally distributed, but p-value for Proanthocyanins is 0.045, which is close to 0.05

qqPlot for cultivars-2

2D scatterplot and Confidence ellipsoid

Now we draw the scatterplots of some variables and add confidence-ellipse with level 0.90.

comment: almost 90% of the data points are inside the ellipse.

3D plot for cultivars-2

Chisq plot for cultivars-2

Royston test for cultivars-2

            Royston test for Multivariate Normality 

  data : boxdata2[, -1] 

  R               : 30.0038 
  p-value         : 0.004233989 

  Result  : Data are not multivariate normal (sig.level = 0.05)

comment: we reject our null hypothesis at level 0.05.

Checking for normality for cultivars-3 after transformation

Shapiro-wilk test for cultivars-3 after transformation

[1] "P-value for alcohol = 0.64084 ***"
[1] "P-value for malic_acid = 0.0015 "
[1] "P-value for ash = 0.10923 ***"
[1] "P-value for ash_alkalinity = 0.09874 ***"
[1] "P-value for magnesium = 0.31166 ***"
[1] "P-value for total_phenols = 0.07849 ***"
[1] "P-value for flavanoids = 0.00223 "
[1] "P-value for nonflavanoids = 0.00106 "
[1] "P-value for Proanthocyanins = 0.01529 "
[1] "P-value for color_intensity = 0.10767 ***"
[1] "P-value for hue = 0.02819 "
[1] "P-value for od = 0.08311 ***"
[1] "P-value for proline = 0.78278 ***"

comment: malic_acid, flavanoids, nonflavanoids, Proanthocyanins, hue are not normal.

qqPlot for cultivars-3

2D scatterplot and Confidence ellipsoid for cultivars-3

Now we draw the scatterplots of some variables and add confidence-ellipse with level 0.90.

comment: almost 90% of the data points are inside the ellipse.

3D plot for cultivars-3

Chisq plot for cultivars-3

Royston test for cultivars-3

            Royston test for Multivariate Normality 

  data : boxdata3[, -1] 

  R               : 52.27771 
  p-value         : 6.024032e-07 

  Result  : Data are not multivariate normal (sig.level = 0.05)

Comment: p-value is less than 0.05, so reject our null hypothesis at level 0.05.

Final data

'data.frame':   178 obs. of  14 variables:
 $ cultivars      : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
 $ alcohol        : num  14.2 13.2 13.2 14.4 13.2 ...
 $ malic_acid     : num  0.496 0.529 0.757 0.605 0.828 ...
 $ ash            : num  2.43 2.14 2.67 2.5 2.87 2.45 2.45 2.61 2.17 2.27 ...
 $ ash_alkalinity : num  15.6 11.2 18.6 16.8 21 15.2 14.6 17.6 14 16 ...
 $ magnesium      : num  0.713 0.713 0.713 0.713 0.713 ...
 $ total_phenols  : num  1.51 1.4 1.51 2.24 1.51 ...
 $ flavanoids     : num  1.7 1.48 1.82 2 1.43 ...
 $ nonflavanoids  : num  -1.06 -1.11 -1.01 -1.16 -0.82 ...
 $ Proanthocyanins: num  1.073 0.266 1.431 0.994 0.721 ...
 $ color_intensity: num  1.89 1.59 1.9 2.28 1.58 ...
 $ hue            : num  1.04 1.05 1.03 0.86 1.04 1.05 1.02 1.06 1.08 1.01 ...
 $ od             : num  3.92 3.4 3.17 3.45 2.93 2.85 3.58 3.58 2.85 3.55 ...
 $ proline        : num  5.02 5.01 5.07 5.18 4.83 ...

MANOVA

Box’s M test

We will now check whether the covariance matrices of the three population groups are equal or not. We are to test \[ H_0: \Sigma_1 = \Sigma_2 = \Sigma_3 \] \[vs\] \[ H_1: H_0 \ is \ not \ true \]


    Box's M-test for Homogeneity of Covariance Matrices

data:  Y
Chi-Sq (approx.) = 622.97, df = 182, p-value < 2.2e-16

MANOVA

We are now interested to test the equality of means of the three population groups. We are to test \[ H_0: \mu_1 = \mu_2 = \mu_3 \] \[vs\] \[ H_1: H_0 \ is \ not \ true \]

            Df Pillai approx F num Df den Df    Pr(>F)    
fdata[, 1]   2 1.7337   82.119     26    328 < 2.2e-16 ***
Residuals  175                                            
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Comment: The Manova test gives us small p-value i.e the null hypothesis that the cultivars has the same mean is rejected.So we can conclude that the three cultivars doesn’t have same mean.

Principle Component Analysis

Scree Plot

We are always intersested in data reduction .Here we have 13 covariates . So we want to find the principle components that explains the most of the variance .

Comment: The scree plot shows that the largest eigen value is significantly large compared to the rest of the eigenvalues.

Principal component

Comment: This plot shows the same thing . The first PC explains most of the variability.

Bi-plot

It plots the data,along with the projections of the original variables on the first two components.

Classification

We split the whole data into two parts, take 100 random observations in train data and remaining are for test data.

LDA

Lets do Linear Discriminant Analysis( though the equality of variance is not satisfied).

   
pre  1  2  3
  1 27  1  0
  2  0 33  1
  3  0  0 16

Comment: We can see that the LDA performs very well . The no. of data from cultivar 1 misclassified as 2 is 1 . The no. of data from cultivar 2 misclassified as 3 is 1 and there is no other misclassified data point . So the most of the data is classified correctly.

Visualization

Here we have done a Linear Discriminant Analysis taking two covarite so that we can visualize the discriminant rule.

LDA using Leave one out

Step-1:-We remove one observation from the data set and fit LDA using the remaining dataset.
Step-2:-We predict the removed value using the fitted model.
Step-3 :- Compare with the true value

Groupwise Histogram

QDA

Now lets see how QDA works(though we know the violation of normality assumption affects is very much)

   
pre  1  2  3
  1 27  0  0
  2  0 34  0
  3  0  0 17

Comments: In the QDA there is no misclassified data point .So it performs better that LDA. Note that, in the Box’s M test we have seen that the cultivars are not homogeneous. But when we perform LDA we need the assumption of homogeneity.Therefore the QDA performs better than LDA.

QDA using leave one out

Step-1:-We remove one observation from the data set and fit LDA using the remaining dataset.
Step-2:-We predict the removed value using the fitted model.
Step-3 :- Compare with the true value

Factor analysis

Kaiser,Mayer,Olkin test

To start doing Factor analysis,let us check whether the data or the covariance matrix is compatible to FA or not.

[1] 0.5

Comments: Here the covariance matrix is non-singular!The p-vlaues is showing 0.5.So we are accepting the test and concluding that we can to FA for this data.

Factor Analysis

Lets do FA with various factors.

Comments 6-factor model is appropriate

Interpreteing Factor analysis with 2-factor model

Comment: We take two factors. This figure is showing the relation between factors and the variables with respective loadings.

Factor Score(Bartlett’s Method)

Now lets check whether our factor scores corresponding to 6-factor model is satisfying the assumptions or not.

Verifying assumptions

\(E(F)=0\)
\(Cov(\textbf F)=I\)
\(L'\psi^{-1}L=Diagonal\) (constraint)

Expectations

comments: \(E(F)=0\) assumption is satisfied.

Variance

comments: The figure is showing enough evidence for supporitng the claim that \(Cov(\textbf F)=I\)

Specific Variance

	Factor1	Factor2	Factor3	Factor4	Factor5	Factor6
Factor1	-1.228	0.063	-0.016	-0.109	-0.063	0.030
Factor2	0.063	-1.083	-0.004	0.048	-0.013	0.004
Factor3	-0.016	-0.004	-1.056	-0.012	0.011	0.011
Factor4	-0.109	0.048	-0.012	-1.262	0.078	0.027
Factor5	-0.063	-0.013	0.011	0.078	-1.184	0.002
Factor6	0.030	0.004	0.011	0.027	0.002	-1.008

comments: The specificn variance matrix is not diagonal. We have also verified the constraint \(L'\psi^{-1}L=Diagonal\) is not actually satisfying.

References

Acknowledgement

We want to say a big Thank you to everyone who helped us completing this project Successfully.We would like to address our deep sense of gratitude towards Professor Swagata Nandi for helping us by assigning this topic and for providing necessary guidance.We also want to give special thanks to Subhrangsu, Sourav, Subhendu for helping us throughout the whole process.