Make Beer Taste Better

Brief intro

I retrieved this data from an obscure but fun collection of datasets available here. The data were the subject of a 2013 paper in the Journal of the Science of Food and Agriculture. The authors looked at compounds in 40 lager beers to determine the contribution of phenolic and melanoidin content to antioxidant activity, which makes beers stay flavorful longer. Hence, “Make Beer Taste Better.”

Exploration

Thankfully, there are no missing data in this set. The variables are all continuous except for “beer”, an ID number. For some unknown reason, the predictor ‘Total sulphur dioxide’ is not included in the data.

This study looked separately at the contribution of total phenolic content and melanoidin content (predictors) to each of five measures of antioxidant activity (dependent variables).

Variables
beer – Beer ID
tpc – Total phenolic content (predictor)
ma – melanoidin content (predictor)
dsa – DPPH radical scavenging activity (dependent)
asa – ABTS radical cation scavenging activity (dependent)
orac – Oxygen radical absorbence activity (dependent)
rp – Reducing Power (dependent)
mca – Metal Chelating Activity (dependent)

Summary statistics

beer_df <- data.frame(read.csv("~/Dropbox/CUNY/IS605/Discussions/lager_antioxidant_reg.csv"))
describe(beer_df)[, -c(1,6,7,10)]

##       n   mean    sd median   min    max  skew kurtosis   se
## beer 40  20.50 11.69  20.50  1.00  40.00  0.00    -1.29 1.85
## tpc  40 168.23 41.74 163.11 84.64 267.27  0.26    -0.35 6.60
## ma   40  10.30  2.84  10.71  1.64  14.83 -1.01     1.01 0.45
## dsa  40   0.61  0.17   0.60  0.24   0.97  0.17    -0.20 0.03
## asa  40   1.34  0.40   1.36  0.16   2.23 -0.46     0.58 0.06
## orac 40   3.57  2.44   3.29  0.01   9.12  0.41    -0.69 0.39
## rp   40   0.69  0.18   0.66  0.38   1.30  1.02     1.48 0.03
## mca  40  29.71 22.20  21.56  5.36  80.89  1.14    -0.03 3.51

Check for correlations among variables

corPlot(beer_df[,c(2:8)])

A look at the numbers shows that tpc and ma are modestly correlated (0.48) and tpc has a stronger positive correlation to the antioxidant measures than ma:

round(cor(beer_df[,c(2:8)]),4)

##         tpc      ma    dsa    asa    orac     rp    mca
## tpc  1.0000  0.4805 0.8319 0.5326  0.3580 0.7396 0.6219
## ma   0.4805  1.0000 0.3869 0.2638 -0.0650 0.3802 0.3167
## dsa  0.8319  0.3869 1.0000 0.4552  0.5360 0.6132 0.5406
## asa  0.5326  0.2638 0.4552 1.0000  0.2003 0.6614 0.3523
## orac 0.3580 -0.0650 0.5360 0.2003  1.0000 0.3190 0.1791
## rp   0.7396  0.3802 0.6132 0.6614  0.3190 1.0000 0.3743
## mca  0.6219  0.3167 0.5406 0.3523  0.1791 0.3743 1.0000

Distributions

Nothing too crazy here. Some skew in ma, rp, mca. As expected.

varnames <- colnames(beer_df)[1:8]
par(mfrow=c(1,2))
for (i in 2:8){
 hist(beer_df[,i], freq=F, col='lightblue',
      main=paste('Hist of ', varnames[i], split=" "), breaks=10)
 lines(density(beer_df[,i]), col='red', lwd=3)
}

Outliers?

Some potential outliers. In a subsequent refinement, one could experiment with data transformations. But for this data, based on results to follow, probably not worth it.

par(mfrow=c(1,2))
for (i in 2:8){
 boxplot(beer_df[,i], freq=F, col='lightblue',
      main=paste('Boxplot of ', varnames[i], split=" "))
}

Build models to predict each of the five antioxidant measures.

fit_dsa <- lm(dsa ~ tpc+ma, data=beer_df)
fit_asa <- lm(asa ~ tpc+ma, data=beer_df)
fit_orac <- lm(orac ~ tpc+ma, data=beer_df)
fit_rp <- lm(rp ~ tpc+ma, data=beer_df)
fit_mca <- lm(mca ~ tpc+ma, data=beer_df)


summary(fit_asa)

## 
## Call:
## lm(formula = asa ~ tpc + ma, data = beer_df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.09710 -0.13768  0.07892  0.17627  0.94328 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept) 0.473614   0.253542   1.868   0.0697 . 
## tpc         0.005032   0.001513   3.326   0.0020 **
## ma          0.001439   0.022261   0.065   0.9488   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3459 on 37 degrees of freedom
## Multiple R-squared:  0.2837, Adjusted R-squared:  0.245 
## F-statistic: 7.327 on 2 and 37 DF,  p-value: 0.002086

summary(fit_dsa)

## 
## Call:
## lm(formula = dsa ~ tpc + ma, data = beer_df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.12939 -0.05549 -0.01595  0.02306  0.37028 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.0391199  0.0714976   0.547    0.588    
## tpc          0.0034459  0.0004267   8.076  1.1e-09 ***
## ma          -0.0010021  0.0062775  -0.160    0.874    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.09755 on 37 degrees of freedom
## Multiple R-squared:  0.6922, Adjusted R-squared:  0.6756 
## F-statistic: 41.61 on 2 and 37 DF,  p-value: 3.405e-10

summary(fit_mca)

## 
## Call:
## lm(formula = mca ~ tpc + ma, data = beer_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -42.089 -12.140  -0.752  10.300  45.019 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -26.81143   13.07941  -2.050 0.047511 *  
## tpc           0.32485    0.07806   4.162 0.000181 ***
## ma            0.18198    1.14838   0.158 0.874953    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 17.85 on 37 degrees of freedom
## Multiple R-squared:  0.3872, Adjusted R-squared:  0.354 
## F-statistic: 11.69 on 2 and 37 DF,  p-value: 0.0001164

summary(fit_orac)

## 
## Call:
## lm(formula = orac ~ tpc + ma, data = beer_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.1563 -1.4871  0.1516  1.3949  5.1867 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  1.329002   1.637808   0.811  0.42230   
## tpc          0.029523   0.009774   3.020  0.00456 **
## ma          -0.264516   0.143800  -1.839  0.07388 . 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.235 on 37 degrees of freedom
## Multiple R-squared:  0.2012, Adjusted R-squared:  0.158 
## F-statistic:  4.66 on 2 and 37 DF,  p-value: 0.01567

summary(fit_rp)

## 
## Call:
## lm(formula = rp ~ tpc + ma, data = beer_df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.31842 -0.03862 -0.00862  0.05294  0.32473 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.1308174  0.0932954   1.402    0.169    
## tpc         0.0031986  0.0005568   5.745 1.39e-06 ***
## ma          0.0020961  0.0081914   0.256    0.799    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1273 on 37 degrees of freedom
## Multiple R-squared:  0.5478, Adjusted R-squared:  0.5234 
## F-statistic: 22.42 on 2 and 37 DF,  p-value: 4.196e-07

Results

It looks like the two compounds are best at predicting DPPH radical scavenging activity (dsa) with adjusted R² = 0.675. However, the melanoidin content (ma) is not significant at p < 0.5 in any of the models, so it appears that Total phenolic content (tpc) is responsible for most of the antioxidant activity measured.

Residual plots

Residual plots for each of the models show that errors are evenly disbursed (for the most part). Assumptions of linearity, normality, independence and constant variance appear to be – in general – satisified for these models, which could be tuned with attention to outliers and data transformations.

plot.new()
par(mfrow=c(2,2))
plot(fit_asa)

plot.new()
par(mfrow=c(2,2))
plot(fit_dsa)

plot.new()
par(mfrow=c(2,2))
plot(fit_mca)

plot.new()
par(mfrow=c(2,2))
plot(fit_orac)

plot.new()
par(mfrow=c(2,2))
plot(fit_rp)

Linear regression is a good fit for the dsa assay

x <- beer_df$tpc
y <- beer_df$dsa

plot(x,y, main='Predicting DSA', col='blue')
abline(lm(dsa~tpc, data=beer_df), col='red', lwd=3)