The data shows color and phenolic measurements from three diffrent types of red wine Tempranillo, Graciano, and Cabernet Sauvignon and from times 1.5 to 26 months. In the description of the data it is not mentioned exactly how the data was collected but is from M. MOnagas, P.J. Martin-Alvarez, B. Bartolome, C. Gomez-Cordoves (2006). “Statistical Interpretaion of the Color Parameters of Red Wines in Function of their Phenolic Composition During Aging in Bottle,” European Food Research Technology, Vol/ 222, pp. 702-209.
WineType (categorical): Type of wine (1 = Tempranillo, 2 = Graciano, 3 = Cabernet Sauvignon). Months (continuous): Time points in months (1.5, 3, 7, 9, 12, 19.5, 26).
Colorimetric Indices (continuous): CI_abs: Color Intensity. PctYellow: Percentage of yellow color. PctRed: Percentage of red color. PctBlue: Percentage of blue color. PctdA: Percentage of pure red. Tint: Measure of hue.
CIELAB Variables (continuous): L: Lightness. C: Chroma. h: Hue angle. a: Red/green component. b*: Yellow/blue component.
Phenolic Concentrations (mg/L) (continuous): GLU: Glucosides. AC: Acetyl-Glucoside. CIN: Cinnamolyl Glucosides. PYRUVILICS: Piruvylics. VINYLICS: Vinylics. FLAVANOLS: Flavanols. FLAVONOLS: Flavonols. CINNAMICS: Cinnamics. BENZOICS: Benzoics.
One question we can answer from this dataset are How does the color of each wine change over time?
wine = read.csv("https://raw.githubusercontent.com/TylerBattaglini/STA-321/main/redwine_phenolics_color.csv")
pander(head(wine))
WineType | Months | CI_abs | PctYellow | PctRed | PctBlue | PctdA | Tint |
---|---|---|---|---|---|---|---|
1 | 1.5 | 0.54 | 39.63 | 47.3 | 13.08 | 44.28 | 0.84 |
1 | 3 | 0.53 | 40.11 | 47.6 | 12.29 | 44.95 | 0.84 |
1 | 7 | 0.59 | 41.14 | 45.87 | 13 | 40.98 | 0.9 |
1 | 9 | 0.61 | 42.41 | 43.55 | 14.03 | 35.19 | 0.97 |
1 | 12 | 0.58 | 43.12 | 43.17 | 13.71 | 34.18 | 1 |
1 | 19.5 | 0.65 | 43.53 | 42.53 | 13.91 | 32.43 | 1.02 |
L_star | C_star | h_star | a_star | b_star | GLU | AC | CIN |
---|---|---|---|---|---|---|---|
70.85 | 26.76 | 8.42 | 26.47 | 3.92 | 294 | 25 | 41.41 |
71.3 | 27.39 | 9.3 | 27.03 | 4.43 | 291.2 | 24.82 | 39.52 |
69.13 | 27.74 | 14.71 | 26.84 | 7.05 | 241.7 | 18.05 | 31.77 |
68.9 | 25.68 | 19.17 | 24.26 | 8.43 | 200.8 | 18.94 | 28.93 |
70.15 | 25.03 | 22.18 | 23.18 | 9.45 | 195.8 | 17.05 | 26.49 |
67.6 | 26.69 | 25.09 | 24.16 | 11.32 | 172.1 | 12.33 | 23.31 |
PYRUVILICS | VINYLICS | FLAVANOLS | FLAVONOLS | CINNAMICS | BENZOICS |
---|---|---|---|---|---|
3.54 | 0.85 | 40.4 | 10.16 | 0.73 | 20.54 |
3.44 | 0.84 | 36.43 | 11.2 | 1.06 | 20.54 |
3.09 | 0.85 | 28.98 | 10.08 | 1.46 | 20.46 |
2.51 | 0.85 | 26.03 | 10.02 | 1.56 | 19.69 |
2.31 | 0.9 | 23.78 | 8.29 | 1.59 | 17.94 |
1.98 | 0.99 | 25.54 | 8.99 | 2.07 | 16.47 |
cor(wine[, c("PctYellow" , "PctRed" , "PctBlue" , "PctdA" , "Tint" , "L_star" , "C_star" , "h_star" , "a_star" , "b_star")])
PctYellow PctRed PctBlue PctdA Tint L_star
PctYellow 1.0000000 -0.9967832 0.9470656 -0.9971463 0.9971792 0.8891158
PctRed -0.9967832 1.0000000 -0.9697443 0.9954049 -0.9925643 -0.8636488
PctBlue 0.9470656 -0.9697443 1.0000000 -0.9499914 0.9385418 0.7521327
PctdA -0.9971463 0.9954049 -0.9499914 1.0000000 -0.9991542 -0.8777421
Tint 0.9971792 -0.9925643 0.9385418 -0.9991542 1.0000000 0.8895708
L_star 0.8891158 -0.8636488 0.7521327 -0.8777421 0.8895708 1.0000000
C_star -0.9663385 0.9556817 -0.8854307 0.9578633 -0.9624469 -0.9713457
h_star 0.9202595 -0.9124449 0.8511498 -0.9369638 0.9369379 0.7443927
a_star -0.9742234 0.9635554 -0.8928374 0.9675497 -0.9718639 -0.9661964
b_star 0.7636888 -0.7742779 0.7747803 -0.7779744 0.7680176 0.4480203
C_star h_star a_star b_star
PctYellow -0.9663385 0.9202595 -0.9742234 0.7636888
PctRed 0.9556817 -0.9124449 0.9635554 -0.7742779
PctBlue -0.8854307 0.8511498 -0.8928374 0.7747803
PctdA 0.9578633 -0.9369638 0.9675497 -0.7779744
Tint -0.9624469 0.9369379 -0.9718639 0.7680176
L_star -0.9713457 0.7443927 -0.9661964 0.4480203
C_star 1.0000000 -0.8294239 0.9991101 -0.5982883
h_star -0.8294239 1.0000000 -0.8519375 0.9019515
a_star 0.9991101 -0.8519375 1.0000000 -0.6274512
b_star -0.5982883 0.9019515 -0.6274512 1.0000000
cor(wine[, c("GLU", "AC", "CIN", "PYRUVILICS", "VINYLICS", "FLAVANOLS", "FLAVONOLS", "CINNAMICS", "BENZOICS")])
GLU AC CIN PYRUVILICS VINYLICS
GLU 1.000000000 0.06646864 0.98425506 -0.006723363 0.4140117
AC 0.066468642 1.00000000 -0.05598275 0.933226840 -0.4080035
CIN 0.984255058 -0.05598275 1.00000000 -0.132460744 0.4483935
PYRUVILICS -0.006723363 0.93322684 -0.13246074 1.000000000 -0.4566231
VINYLICS 0.414011714 -0.40800350 0.44839346 -0.456623104 1.0000000
FLAVANOLS 0.114840758 0.36881975 0.10232394 0.391422113 -0.7746567
FLAVONOLS -0.276634829 0.09139609 -0.26011469 0.158195327 -0.8936213
CINNAMICS -0.197715491 -0.34981902 -0.15028912 -0.445395098 0.6741770
BENZOICS -0.184364479 -0.25095925 -0.15279299 -0.103557707 -0.6970817
FLAVANOLS FLAVONOLS CINNAMICS BENZOICS
GLU 0.1148408 -0.27663483 -0.1977155 -0.1843645
AC 0.3688197 0.09139609 -0.3498190 -0.2509593
CIN 0.1023239 -0.26011469 -0.1502891 -0.1527930
PYRUVILICS 0.3914221 0.15819533 -0.4453951 -0.1035577
VINYLICS -0.7746567 -0.89362126 0.6741770 -0.6970817
FLAVANOLS 1.0000000 0.85777860 -0.7266122 0.6674665
FLAVONOLS 0.8577786 1.00000000 -0.6295157 0.8783409
CINNAMICS -0.7266122 -0.62951570 1.0000000 -0.6054222
BENZOICS 0.6674665 0.87834089 -0.6054222 1.0000000
We look at the correlation between varibales and see lots of correlation which could lead to multicollinearity. We should section these variables and group them.
wine_filtered <- wine %>%
filter(WineType != 1) %>%
select(-PctYellow, -PctRed, -PctBlue, -PctdA, -L_star, -C_star, -h_star, -a_star, -b_star, -Tint, -WineType) %>%
mutate(Phenolics_Concentration = (GLU + AC + CIN + PYRUVILICS + VINYLICS + FLAVANOLS + FLAVONOLS + CINNAMICS + BENZOICS) / 9) %>%
select(-GLU, -AC, -CIN, -PYRUVILICS, -VINYLICS, -FLAVANOLS, -FLAVONOLS, -CINNAMICS, -BENZOICS)
str(wine_filtered)
'data.frame': 14 obs. of 3 variables:
$ Months : num 1.5 3 7 9 12 19.5 26 1.5 3 7 ...
$ CI_abs : num 1.13 1.12 1.1 1.07 1.03 1.03 0.96 1.06 1.01 1.01 ...
$ Phenolics_Concentration: num 52 48.9 41.8 30.7 28.2 ...
We filter out Wine 1 because we only want to look at Red Wine and not other colors which was included. We also delete the color variables because they are telling the same story as Ci_abs. We group the Phenolic analysis togethor because they are correlated and again tell the same story.
plot(wine_filtered$Months, wine_filtered$CI_abs, main = "Color vs Time", xlab = "Time", ylab = "Color Intensity")
We see here that there is some correlation between time of how long the
wine aged and the Color Intensity.
full.model = lm(CI_abs~ ., data = wine_filtered)
kable(summary(full.model)$coef, caption ="Statistics of Regression Coefficients")
Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
(Intercept) | 1.1377981 | 0.1201053 | 9.4733359 | 0.0000013 |
Months | -0.0051872 | 0.0035208 | -1.4733070 | 0.1686988 |
Phenolics_Concentration | -0.0008608 | 0.0024276 | -0.3545693 | 0.7296163 |
table(wine_filtered$WineType)
< table of extent 0 >
cor(wine_filtered)
Months CI_abs Phenolics_Concentration
Months 1.0000000 -0.6490598 -0.9148849
CI_abs -0.6490598 1.0000000 0.5611678
Phenolics_Concentration -0.9148849 0.5611678 1.0000000
We begin our serach for the final model by looking at a model that includes all the predictor values.
par(mfrow=c(2,2))
plot(full.model)
We see her that our residuals are all over ther place, most likely
because our lack of variables. Our looks slighlty normal with
someoutliers but we cannot assume normality. We also see unequal
variance.
vif(full.model)
Months Phenolics_Concentration
6.135513 6.135513
Our VIF values are a little big but we can continue our analysis.
par(pty = "s", mfrow = c(2, 2), oma=c(.1,.1,.1,.1), mar=c(4, 0, 2, 0))
boxcox(CI_abs ~ Phenolics_Concentration + log(Months)
, data = wine_filtered, lambda = seq(0, 1, length = 10),
xlab=expression(paste(lambda, ": log months")))
boxcox(CI_abs ~ Phenolics_Concentration + Months, data = wine_filtered, lambda = seq(-0.5, 1, length = 10),
xlab=expression(paste(lambda, ": months")))
boxcox(CI_abs ~ log(1+Phenolics_Concentration) + Months, data = wine_filtered, lambda = seq(-0.5, 1, length = 10),
xlab=expression(paste(lambda, ": log phenolics")))
##
boxcox(CI_abs ~ log(1+Months) + log(Phenolics_Concentration)
, data = wine_filtered, lambda = seq(-0.5, 1, length = 10),
xlab=expression(paste(lambda, ": log-months, log.phenolics")))
We do a Box-Cox transformation to try and fix our non normal QQ plot and
unequal variance.
sqrt.CI = lm((CI_abs)^0.5 ~ Months + Phenolics_Concentration, data = wine_filtered)
kable(summary(sqrt.CI)$coef, caption = "log-transformed model")
Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
(Intercept) | 1.0694735 | 0.0586378 | 18.2386389 | 0.0000000 |
Months | -0.0026035 | 0.0017189 | -1.5146053 | 0.1580653 |
Phenolics_Concentration | -0.0004623 | 0.0011852 | -0.3900359 | 0.7039565 |
par(mfrow = c(2,2))
plot(sqrt.CI)
Firstly we do a Square-root transformation. We can see that we still
have non-constant variance and a non-normal QQ plot.
log.CI = lm(log(CI_abs) ~ Months + Phenolics_Concentration, data = wine_filtered)
kable(summary(log.CI)$coef, caption = "log-transformed model")
Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
(Intercept) | 0.1400287 | 0.1145686 | 1.2222259 | 0.2471659 |
Months | -0.0052261 | 0.0033585 | -1.5560876 | 0.1479720 |
Phenolics_Concentration | -0.0009855 | 0.0023157 | -0.4255795 | 0.6786214 |
par(mfrow = c(2,2))
plot(log.CI)
We then do a log transformation of the Color Intensity based off the Box
Cox transformation and build a linear regressionbased on log CI. Again
these plots do not help our non constant variance and non-normal QQ
plot.
par(pty = "s", mfrow = c(1, 3))
qqnorm(full.model$residuals, main = "Full-Model")
qqline(full.model$residuals)
qqnorm(log.CI$residuals, main = "Log-CI")
qqline(log.CI$residuals)
qqnorm(sqrt.CI$residuals, main = "sqrt CI")
qqline(sqrt.CI$residuals)
select=function(m){
e = m$resid
n0 = length(e)
SSE=(m$df)*(summary(m)$sigma)^2
R.sq=summary(m)$r.squared
R.adj=summary(m)$adj.r
MSE=(summary(m)$sigma)^2
Cp=(SSE/MSE)-(n0-2*(n0-m$df))
AIC=n0*log(SSE)-n0*log(n0)+2*(n0-m$df)
SBC=n0*log(SSE)-n0*log(n0)+(log(n0))*(n0-m$df)
X=model.matrix(m)
H=X%*%solve(t(X)%*%X)%*%t(X)
d=e/(1-diag(H))
PRESS=t(d)%*%d
tbl = as.data.frame(cbind(SSE=SSE, R.sq=R.sq, R.adj = R.adj, Cp = Cp, AIC = AIC, SBC = SBC, PRD = PRESS))
names(tbl)=c("SSE", "R.sq", "R.adj", "Cp", "AIC", "SBC", "PRESS")
tbl
}
We then do some goodness of fit measures for each of the three models
output.sum = rbind(select(full.model), select(sqrt.CI), select(log.CI))
row.names(output.sum) = c("full.model", "sqrt.CI", "log.CI")
kable(output.sum, caption = "Goodness-of-fit Measures of Candidate Models")
SSE | R.sq | R.adj | Cp | AIC | SBC | PRESS | |
---|---|---|---|---|---|---|---|
full.model | 0.0209950 | 0.4278182 | 0.3237851 | 3 | -85.03541 | -83.11823 | 0.0324937 |
sqrt.CI | 0.0050043 | 0.4322969 | 0.3290781 | 3 | -105.11112 | -103.19394 | 0.0077449 |
log.CI | 0.0191039 | 0.4368550 | 0.3344651 | 3 | -86.35687 | -84.43970 | 0.0295675 |
We see that our third model is the best out of the three models based on R R squared
kable(summary(log.CI)$coef, caption = "Inferential Statistics of Final Model")
Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
(Intercept) | 0.1400287 | 0.1145686 | 1.2222259 | 0.2471659 |
Months | -0.0052261 | 0.0033585 | -1.5560876 | 0.1479720 |
Phenolics_Concentration | -0.0009855 | 0.0023157 | -0.4255795 | 0.6786214 |
In our final model we see that our p-values are not significant and we would need to do some type of variable selection.
Since we have a non-normal and unequal variance we really have no statistical findings we can take away. We have a very small sample because of the deletion of many variables and combination of others. For later projects we should combine the colors instead of deleting them so hopefully we can assume normatlity and constant variance.
We performed a Box Cox transformation to serch for a final model for our data. We have a lot of violations so again we cannot make any inferences about the data.