Brief intro
I retrieved this data from an obscure but fun collection of datasets available here. The data were the subject of a 2013 paper in the Journal of the Science of Food and Agriculture. The authors looked at compounds in 40 lager beers to determine the contribution of phenolic and melanoidin content to antioxidant activity, which makes beers stay flavorful longer. Hence, “Make Beer Taste Better.”
Exploration
Thankfully, there are no missing data in this set. The variables are all continuous except for “beer”, an ID number. For some unknown reason, the predictor ‘Total sulphur dioxide’ is not included in the data.
This study looked separately at the contribution of total phenolic content and melanoidin content (predictors) to each of five measures of antioxidant activity (dependent variables).
Variables
beer – Beer ID
tpc – Total phenolic content (predictor)
ma – melanoidin content (predictor)
dsa – DPPH radical scavenging activity (dependent)
asa – ABTS radical cation scavenging activity (dependent)
orac – Oxygen radical absorbence activity (dependent)
rp – Reducing Power (dependent)
mca – Metal Chelating Activity (dependent)
Summary statistics
beer_df <- data.frame(read.csv("~/Dropbox/CUNY/IS605/Discussions/lager_antioxidant_reg.csv"))
describe(beer_df)[, -c(1,6,7,10)]## n mean sd median min max skew kurtosis se
## beer 40 20.50 11.69 20.50 1.00 40.00 0.00 -1.29 1.85
## tpc 40 168.23 41.74 163.11 84.64 267.27 0.26 -0.35 6.60
## ma 40 10.30 2.84 10.71 1.64 14.83 -1.01 1.01 0.45
## dsa 40 0.61 0.17 0.60 0.24 0.97 0.17 -0.20 0.03
## asa 40 1.34 0.40 1.36 0.16 2.23 -0.46 0.58 0.06
## orac 40 3.57 2.44 3.29 0.01 9.12 0.41 -0.69 0.39
## rp 40 0.69 0.18 0.66 0.38 1.30 1.02 1.48 0.03
## mca 40 29.71 22.20 21.56 5.36 80.89 1.14 -0.03 3.51
Check for correlations among variables
corPlot(beer_df[,c(2:8)])A look at the numbers shows that tpc and ma are modestly correlated (0.48) and tpc has a stronger positive correlation to the antioxidant measures than ma:
round(cor(beer_df[,c(2:8)]),4)## tpc ma dsa asa orac rp mca
## tpc 1.0000 0.4805 0.8319 0.5326 0.3580 0.7396 0.6219
## ma 0.4805 1.0000 0.3869 0.2638 -0.0650 0.3802 0.3167
## dsa 0.8319 0.3869 1.0000 0.4552 0.5360 0.6132 0.5406
## asa 0.5326 0.2638 0.4552 1.0000 0.2003 0.6614 0.3523
## orac 0.3580 -0.0650 0.5360 0.2003 1.0000 0.3190 0.1791
## rp 0.7396 0.3802 0.6132 0.6614 0.3190 1.0000 0.3743
## mca 0.6219 0.3167 0.5406 0.3523 0.1791 0.3743 1.0000
Distributions
Nothing too crazy here. Some skew in ma, rp, mca. As expected.
varnames <- colnames(beer_df)[1:8]
par(mfrow=c(1,2))
for (i in 2:8){
hist(beer_df[,i], freq=F, col='lightblue',
main=paste('Hist of ', varnames[i], split=" "), breaks=10)
lines(density(beer_df[,i]), col='red', lwd=3)
}
Outliers?
Some potential outliers. In a subsequent refinement, one could experiment with data transformations. But for this data, based on results to follow, probably not worth it.
par(mfrow=c(1,2))
for (i in 2:8){
boxplot(beer_df[,i], freq=F, col='lightblue',
main=paste('Boxplot of ', varnames[i], split=" "))
}Build models to predict each of the five antioxidant measures.
fit_dsa <- lm(dsa ~ tpc+ma, data=beer_df)
fit_asa <- lm(asa ~ tpc+ma, data=beer_df)
fit_orac <- lm(orac ~ tpc+ma, data=beer_df)
fit_rp <- lm(rp ~ tpc+ma, data=beer_df)
fit_mca <- lm(mca ~ tpc+ma, data=beer_df)
summary(fit_asa)##
## Call:
## lm(formula = asa ~ tpc + ma, data = beer_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.09710 -0.13768 0.07892 0.17627 0.94328
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.473614 0.253542 1.868 0.0697 .
## tpc 0.005032 0.001513 3.326 0.0020 **
## ma 0.001439 0.022261 0.065 0.9488
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3459 on 37 degrees of freedom
## Multiple R-squared: 0.2837, Adjusted R-squared: 0.245
## F-statistic: 7.327 on 2 and 37 DF, p-value: 0.002086
summary(fit_dsa)##
## Call:
## lm(formula = dsa ~ tpc + ma, data = beer_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.12939 -0.05549 -0.01595 0.02306 0.37028
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.0391199 0.0714976 0.547 0.588
## tpc 0.0034459 0.0004267 8.076 1.1e-09 ***
## ma -0.0010021 0.0062775 -0.160 0.874
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.09755 on 37 degrees of freedom
## Multiple R-squared: 0.6922, Adjusted R-squared: 0.6756
## F-statistic: 41.61 on 2 and 37 DF, p-value: 3.405e-10
summary(fit_mca)##
## Call:
## lm(formula = mca ~ tpc + ma, data = beer_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -42.089 -12.140 -0.752 10.300 45.019
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -26.81143 13.07941 -2.050 0.047511 *
## tpc 0.32485 0.07806 4.162 0.000181 ***
## ma 0.18198 1.14838 0.158 0.874953
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 17.85 on 37 degrees of freedom
## Multiple R-squared: 0.3872, Adjusted R-squared: 0.354
## F-statistic: 11.69 on 2 and 37 DF, p-value: 0.0001164
summary(fit_orac)##
## Call:
## lm(formula = orac ~ tpc + ma, data = beer_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.1563 -1.4871 0.1516 1.3949 5.1867
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.329002 1.637808 0.811 0.42230
## tpc 0.029523 0.009774 3.020 0.00456 **
## ma -0.264516 0.143800 -1.839 0.07388 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.235 on 37 degrees of freedom
## Multiple R-squared: 0.2012, Adjusted R-squared: 0.158
## F-statistic: 4.66 on 2 and 37 DF, p-value: 0.01567
summary(fit_rp)##
## Call:
## lm(formula = rp ~ tpc + ma, data = beer_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.31842 -0.03862 -0.00862 0.05294 0.32473
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.1308174 0.0932954 1.402 0.169
## tpc 0.0031986 0.0005568 5.745 1.39e-06 ***
## ma 0.0020961 0.0081914 0.256 0.799
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1273 on 37 degrees of freedom
## Multiple R-squared: 0.5478, Adjusted R-squared: 0.5234
## F-statistic: 22.42 on 2 and 37 DF, p-value: 4.196e-07
Results
It looks like the two compounds are best at predicting DPPH radical scavenging activity (dsa) with adjusted R2 = 0.675. However, the melanoidin content (ma) is not significant at p < 0.5 in any of the models, so it appears that Total phenolic content (tpc) is responsible for most of the antioxidant activity measured.
Residual plots
Residual plots for each of the models show that errors are evenly disbursed (for the most part). Assumptions of linearity, normality, independence and constant variance appear to be – in general – satisified for these models, which could be tuned with attention to outliers and data transformations.
plot.new()
par(mfrow=c(2,2))
plot(fit_asa)plot.new()
par(mfrow=c(2,2))
plot(fit_dsa)plot.new()
par(mfrow=c(2,2))
plot(fit_mca)plot.new()
par(mfrow=c(2,2))
plot(fit_orac)plot.new()
par(mfrow=c(2,2))
plot(fit_rp)Linear regression is a good fit for the dsa assay
x <- beer_df$tpc
y <- beer_df$dsa
plot(x,y, main='Predicting DSA', col='blue')
abline(lm(dsa~tpc, data=beer_df), col='red', lwd=3)