Data consists results of a chemical analysis of 13 constituents (columns 2-14) of wines grown in the same region in Italy but derived from three different cultivars (given by Column 1). The constituent attributes measured in the chemical analysis are:
Alcohol : percentage of alcohol content in the wineMalic acid : Influence the acidity and flavour profile of the wineAsh : Inorganic mineral content of the wine .Alkalinity of ash : Measures alkalinity of the ash conent. afeects th ph of the wine and can impact its stability.Magnesium : Measures magnesium content of the wine. Influence wine fermentation and aroma.Total phenols : Total concentration of phenolic compounds in the wine .Flavanoids : Represents the concentration of flavonoid compounds. Impacts color, flavor complexity, and health benefits.Nonflavanoid phenols: Indicates the concentration of non-flavonoid phenolic compounds. These compounds contribute to wine color, flavor, and mouthfeel.Proanthocyanins: Represents the concentration of proanthocyanidin compounds. Proanthocyanins contribute to wine bitterness, astringency, and color stability.Color intensity : Measures the intensity of the wine’s color. This covariate provides information about the depth and richness of the wine’s hue.Hue : Indicates the hue or tint of the wine. Hue reflects the wine’s color spectrum, ranging from reddish-purple to orange-yellow.od : OD280/OD315 of diluted wines : Represents the ratio of absorbance measurements at two different wavelengths. This ratio provides information about the concentration of specific compounds in the wine.Proline : Indicates the concentration of the amino acid proline in the wine. Proline can influence wine fermentation, stability, and sensory attributes.cultivars as categorical variable, we have 3 levels: cultivars-1, cultivars-2, cultivars-3| cultivars | alcohol | malic_acid | ash | ash_alkalinity | magnesium | total_phenols | flavanoids | nonflavanoids | Proanthocyanins | color_intensity | hue | od | proline |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 14.23 | 1.71 | 2.43 | 15.6 | 127 | 2.80 | 3.06 | 0.28 | 2.29 | 5.64 | 1.04 | 3.92 | 1065 |
| 1 | 13.20 | 1.78 | 2.14 | 11.2 | 100 | 2.65 | 2.76 | 0.26 | 1.28 | 4.38 | 1.05 | 3.40 | 1050 |
| 1 | 13.16 | 2.36 | 2.67 | 18.6 | 101 | 2.80 | 3.24 | 0.30 | 2.81 | 5.68 | 1.03 | 3.17 | 1185 |
| 1 | 14.37 | 1.95 | 2.50 | 16.8 | 113 | 3.85 | 3.49 | 0.24 | 2.18 | 7.80 | 0.86 | 3.45 | 1480 |
| 1 | 13.24 | 2.59 | 2.87 | 21.0 | 118 | 2.80 | 2.69 | 0.39 | 1.82 | 4.32 | 1.04 | 2.93 | 735 |
cultivars alcohol malic_acid ash ash_alkalinity
1:59 Min. :11.03 Min. :0.740 Min. :1.360 Min. :10.60
2:71 1st Qu.:12.36 1st Qu.:1.603 1st Qu.:2.210 1st Qu.:17.20
3:48 Median :13.05 Median :1.865 Median :2.360 Median :19.50
Mean :13.00 Mean :2.336 Mean :2.367 Mean :19.49
3rd Qu.:13.68 3rd Qu.:3.083 3rd Qu.:2.558 3rd Qu.:21.50
Max. :14.83 Max. :5.800 Max. :3.230 Max. :30.00
magnesium total_phenols flavanoids nonflavanoids
Min. : 70.00 Min. :0.980 Min. :0.340 Min. :0.1300
1st Qu.: 88.00 1st Qu.:1.742 1st Qu.:1.205 1st Qu.:0.2700
Median : 98.00 Median :2.355 Median :2.135 Median :0.3400
Mean : 99.74 Mean :2.295 Mean :2.029 Mean :0.3619
3rd Qu.:107.00 3rd Qu.:2.800 3rd Qu.:2.875 3rd Qu.:0.4375
Max. :162.00 Max. :3.880 Max. :5.080 Max. :0.6600
Proanthocyanins color_intensity hue od
Min. :0.410 Min. : 1.280 Min. :0.4800 Min. :1.270
1st Qu.:1.250 1st Qu.: 3.220 1st Qu.:0.7825 1st Qu.:1.938
Median :1.555 Median : 4.690 Median :0.9650 Median :2.780
Mean :1.591 Mean : 5.058 Mean :0.9574 Mean :2.612
3rd Qu.:1.950 3rd Qu.: 6.200 3rd Qu.:1.1200 3rd Qu.:3.170
Max. :3.580 Max. :13.000 Max. :1.7100 Max. :4.000
proline
Min. : 278.0
1st Qu.: 500.5
Median : 673.5
Mean : 746.9
3rd Qu.: 985.0
Max. :1680.0
[1] "P-value for alcohol = 0.47907 ***"
[1] "P-value for malic_acid = 0 "
[1] "P-value for ash = 0.15556 ***"
[1] "P-value for ash_alkalinity = 0.21609 ***"
[1] "P-value for magnesium = 0.08617 ***"
[1] "P-value for total_phenols = 0.0203 "
[1] "P-value for flavanoids = 0.63873 ***"
[1] "P-value for nonflavanoids = 0.03015 "
[1] "P-value for Proanthocyanins = 0.0315 "
[1] "P-value for color_intensity = 0.12509 ***"
[1] "P-value for hue = 0.15082 ***"
[1] "P-value for od = 0.07745 ***"
[1] "P-value for proline = 0.52324 ***"
comment: malic_acid, total_phenols, nonflavinoids, Proanthocyanins are not following normality.
[1] "P-value for alcohol = 0.11396 ***"
[1] "P-value for malic_acid = 0 "
[1] "P-value for ash = 0.61976 ***"
[1] "P-value for ash_alkalinity = 0.07397 ***"
[1] "P-value for magnesium = 0 "
[1] "P-value for total_phenols = 0.31801 ***"
[1] "P-value for flavanoids = 0.0015 "
[1] "P-value for nonflavanoids = 0.3128 ***"
[1] "P-value for Proanthocyanins = 0.00815 "
[1] "P-value for color_intensity = 0.00082 "
[1] "P-value for hue = 0.22493 ***"
[1] "P-value for od = 0.08904 ***"
[1] "P-value for proline = 0.00177 "
comment: malic_acid, magnesium, flavanoids, Proanthocyanins, color_intensity, proline are not normally distributed.
[1] "P-value for alcohol = 0.64084 ***"
[1] "P-value for malic_acid = 0.73772 ***"
[1] "P-value for ash = 0.10923 ***"
[1] "P-value for ash_alkalinity = 0.09874 ***"
[1] "P-value for magnesium = 0.03865 "
[1] "P-value for total_phenols = 0.01577 "
[1] "P-value for flavanoids = 0.00036 "
[1] "P-value for nonflavanoids = 0.02284 "
[1] "P-value for Proanthocyanins = 0.00025 "
[1] "P-value for color_intensity = 0.08775 ***"
[1] "P-value for hue = 0.02819 "
[1] "P-value for od = 0.08311 ***"
[1] "P-value for proline = 0.45849 ***"
comment: magnesium, total_phenols, flavanoids, nonflavanoids, Proanthocyanins, hue are not normally distributed.
malic_acid
total_phenols
nonflavanoids
Proanthocyanins
magnesium
flavanoids
color_intensity
proline
this variables are not following normal in atleast one group.
we perform boxcox transformation on these variables in whole dataset
[1] "P-value for alcohol = 0.47907 ***"
[1] "P-value for malic_acid = 0 "
[1] "P-value for ash = 0.15556 ***"
[1] "P-value for ash_alkalinity = 0.21609 ***"
[1] "P-value for magnesium = 0.50006 ***"
[1] "P-value for total_phenols = 0.05264 ***"
[1] "P-value for flavanoids = 0.77356 ***"
[1] "P-value for nonflavanoids = 0.36968 ***"
[1] "P-value for Proanthocyanins = 0.1333 ***"
[1] "P-value for color_intensity = 0.78541 ***"
[1] "P-value for hue = 0.15082 ***"
[1] "P-value for od = 0.07745 ***"
[1] "P-value for proline = 0.39451 ***"
comment: only malic_acid is not normally distributed.
Now we draw the scatterplots of some variables and add confidence-ellipse with level 0.90.
comment: almost 90% of the data points are inside the ellipse.
Royston test for Multivariate Normality
data : boxdata1[, -1]
R : 45.97313
p-value : 1.177547e-05
Result : Data are not multivariate normal (sig.level = 0.05)
comment: pvalue is less than 0.05, so we reject our null hypothesis at level 0.05.
[1] "P-value for alcohol = 0.11396 ***"
[1] "P-value for malic_acid = 0.33092 ***"
[1] "P-value for ash = 0.61976 ***"
[1] "P-value for ash_alkalinity = 0.07397 ***"
[1] "P-value for magnesium = 0.00102 "
[1] "P-value for total_phenols = 0.66604 ***"
[1] "P-value for flavanoids = 0.06954 ***"
[1] "P-value for nonflavanoids = 0.57319 ***"
[1] "P-value for Proanthocyanins = 0.04581 "
[1] "P-value for color_intensity = 0.4973 ***"
[1] "P-value for hue = 0.22493 ***"
[1] "P-value for od = 0.08904 ***"
[1] "P-value for proline = 0.65382 ***"
comment: magnesium, Proanthocyanins are not normally distributed, but p-value for Proanthocyanins is 0.045, which is close to 0.05
Now we draw the scatterplots of some variables and add confidence-ellipse with level 0.90.
comment: almost 90% of the data points are inside the ellipse.
Royston test for Multivariate Normality
data : boxdata2[, -1]
R : 30.0038
p-value : 0.004233989
Result : Data are not multivariate normal (sig.level = 0.05)
comment: we reject our null hypothesis at level 0.05.
[1] "P-value for alcohol = 0.64084 ***"
[1] "P-value for malic_acid = 0.0015 "
[1] "P-value for ash = 0.10923 ***"
[1] "P-value for ash_alkalinity = 0.09874 ***"
[1] "P-value for magnesium = 0.31166 ***"
[1] "P-value for total_phenols = 0.07849 ***"
[1] "P-value for flavanoids = 0.00223 "
[1] "P-value for nonflavanoids = 0.00106 "
[1] "P-value for Proanthocyanins = 0.01529 "
[1] "P-value for color_intensity = 0.10767 ***"
[1] "P-value for hue = 0.02819 "
[1] "P-value for od = 0.08311 ***"
[1] "P-value for proline = 0.78278 ***"
comment: malic_acid, flavanoids, nonflavanoids, Proanthocyanins, hue are not normal.
Now we draw the scatterplots of some variables and add confidence-ellipse with level 0.90.
comment: almost 90% of the data points are inside the ellipse.
Royston test for Multivariate Normality
data : boxdata3[, -1]
R : 52.27771
p-value : 6.024032e-07
Result : Data are not multivariate normal (sig.level = 0.05)
Comment: p-value is less than 0.05, so reject our null hypothesis at level 0.05.
'data.frame': 178 obs. of 14 variables:
$ cultivars : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
$ alcohol : num 14.2 13.2 13.2 14.4 13.2 ...
$ malic_acid : num 0.496 0.529 0.757 0.605 0.828 ...
$ ash : num 2.43 2.14 2.67 2.5 2.87 2.45 2.45 2.61 2.17 2.27 ...
$ ash_alkalinity : num 15.6 11.2 18.6 16.8 21 15.2 14.6 17.6 14 16 ...
$ magnesium : num 0.713 0.713 0.713 0.713 0.713 ...
$ total_phenols : num 1.51 1.4 1.51 2.24 1.51 ...
$ flavanoids : num 1.7 1.48 1.82 2 1.43 ...
$ nonflavanoids : num -1.06 -1.11 -1.01 -1.16 -0.82 ...
$ Proanthocyanins: num 1.073 0.266 1.431 0.994 0.721 ...
$ color_intensity: num 1.89 1.59 1.9 2.28 1.58 ...
$ hue : num 1.04 1.05 1.03 0.86 1.04 1.05 1.02 1.06 1.08 1.01 ...
$ od : num 3.92 3.4 3.17 3.45 2.93 2.85 3.58 3.58 2.85 3.55 ...
$ proline : num 5.02 5.01 5.07 5.18 4.83 ...
We will now check whether the covariance matrices of the three population groups are equal or not. We are to test \[ H_0: \Sigma_1 = \Sigma_2 = \Sigma_3 \] \[vs\] \[ H_1: H_0 \ is \ not \ true \]
Box's M-test for Homogeneity of Covariance Matrices
data: Y
Chi-Sq (approx.) = 622.97, df = 182, p-value < 2.2e-16
We are now interested to test the equality of means of the three population groups. We are to test \[ H_0: \mu_1 = \mu_2 = \mu_3 \] \[vs\] \[ H_1: H_0 \ is \ not \ true \]
Df Pillai approx F num Df den Df Pr(>F)
fdata[, 1] 2 1.7337 82.119 26 328 < 2.2e-16 ***
Residuals 175
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Comment: The Manova test gives us small p-value i.e the null hypothesis that the cultivars has the same mean is rejected.So we can conclude that the three cultivars doesn’t have same mean.
We are always intersested in data reduction .Here we have 13 covariates . So we want to find the principle components that explains the most of the variance .
Comment: The scree plot shows that the largest eigen value is significantly large compared to the rest of the eigenvalues.
Comment: This plot shows the same thing . The first PC explains most of the variability.
It plots the data,along with the projections of the original variables on the first two components.
We split the whole data into two parts, take 100 random observations in train data and remaining are for test data.
Lets do Linear Discriminant Analysis( though the equality of variance is not satisfied).
pre 1 2 3
1 27 1 0
2 0 33 1
3 0 0 16
Comment: We can see that the LDA performs very well . The no. of data from cultivar 1 misclassified as 2 is 1 . The no. of data from cultivar 2 misclassified as 3 is 1 and there is no other misclassified data point . So the most of the data is classified correctly.
Here we have done a Linear Discriminant Analysis taking two covarite so that we can visualize the discriminant rule.
1 2 3
1 59 0 0
2 1 69 1
3 0 1 47
Now lets see how QDA works(though we know the violation of normality assumption affects is very much)
pre 1 2 3
1 27 0 0
2 0 34 0
3 0 0 17
Comments: In the QDA there is no misclassified data point .So it performs better that LDA. Note that, in the Box’s M test we have seen that the cultivars are not homogeneous. But when we perform LDA we need the assumption of homogeneity.Therefore the QDA performs better than LDA.
1 2 3
1 59 0 0
2 1 70 0
3 0 0 48
To start doing Factor analysis,let us check whether the data or the covariance matrix is compatible to FA or not.
[1] 0.5
Comments: Here the covariance matrix is non-singular!The p-vlaues is showing 0.5.So we are accepting the test and concluding that we can to FA for this data.
Lets do FA with various factors.
Comments 6-factor model is appropriate
Comment: We take two factors. This figure is showing the relation between factors and the variables with respective loadings.
Now lets check whether our factor scores corresponding to 6-factor model is satisfying the assumptions or not.
comments: \(E(F)=0\) assumption is satisfied.
comments: The figure is showing enough evidence for supporitng the claim that \(Cov(\textbf F)=I\)
| Factor1 | Factor2 | Factor3 | Factor4 | Factor5 | Factor6 | |
|---|---|---|---|---|---|---|
| Factor1 | -1.228 | 0.063 | -0.016 | -0.109 | -0.063 | 0.030 |
| Factor2 | 0.063 | -1.083 | -0.004 | 0.048 | -0.013 | 0.004 |
| Factor3 | -0.016 | -0.004 | -1.056 | -0.012 | 0.011 | 0.011 |
| Factor4 | -0.109 | 0.048 | -0.012 | -1.262 | 0.078 | 0.027 |
| Factor5 | -0.063 | -0.013 | 0.011 | 0.078 | -1.184 | 0.002 |
| Factor6 | 0.030 | 0.004 | 0.011 | 0.027 | 0.002 | -1.008 |
comments: The specificn variance matrix is not diagonal. We have also verified the constraint \(L'\psi^{-1}L=Diagonal\) is not actually satisfying.
We want to say a big Thank you to everyone who helped us completing this project Successfully.We would like to address our deep sense of gratitude towards Professor Swagata Nandi for helping us by assigning this topic and for providing necessary guidance.We also want to give special thanks to Subhrangsu, Sourav, Subhendu for helping us throughout the whole process.
Indian Statistical Institute, Delhi