Project Background

This project stems from a dataset with 73861 entries regarding the Original Gravity (OG), Final Gravity (FG), Color (according to the Standard Reference Method of measurement), International Bitterness Units (IBU), Alcohol by Volume (ABV), and Boil Time (BT) of various beers.

The purpose of the present research is to determine if ABU may be predicted by OG, FG, Color, IBU, and Boil Time. All variables are continuous in accordance with units specific to their formulas (e.g., IBU is reported in units “IBU”).

Descriptives

Alcohol by Volume (ABV)

The mean alcohol by volume of the beer is 6.137% with a standard deviation of 1.884%. The median alcohol by volume is 5.790%. Also, there is a wide range of values in the dataset. The minimum alcohol by volume is 0.000%, and the maximum alcohol by volume is 54.720%.

library(readxl)
recipeData <- read_xlsx ("C:/Users/Admin/Downloads/recipeData.xlsx")
attach(recipeData)
summary(ABV)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   5.080   5.790   6.137   6.830  54.720
sd(ABV)
## [1] 1.88351

Original Gravity (OG)

The mean original gravity of the beer is 1.406 units with a standard deviation of 2.197 units. The median original gravity is 1.058 units.

summary(OG)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   1.051   1.058   1.406   1.069  34.035
sd(OG)
## [1] 2.196908

Final Gravity (FG)

The mean final gravity of the beer is 1.076 units with a standard deviation of 0.433 units. The median final gravity is 1.013 units.

summary(FG)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -0.003   1.011   1.013   1.076   1.017  23.425
sd(FG)
## [1] 0.4325241

Color

The mean color of the beer (according to Standard Reference Method units) is 13.400 units with a standard deviation of 11.945 units. The median color is 8.440 units.

summary(Color)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    5.17    8.44   13.40   16.79  186.00
sd(Color)
## [1] 11.94451

International Bitterness Units (IBU)

The mean IBU (International Bitterness Units) of the beer is 44.280 units with a standard deviation of 42.946 units. The median IBU is 3.770. The IBU values range from 0.000 to 3409.30 units.

summary(IBU)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   23.37   35.77   44.28   56.38 3409.30
sd(IBU)
## [1] 42.94551

Boil Time (BT)

The mean boil time is 65.070 minutes with a standard deviation of 15.024 minutes. The median boil time is 60.000 minutes.

summary(BoilTime)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   60.00   60.00   65.07   60.00  240.00
sd(BoilTime)
## [1] 15.02423

Visualization

library(ggplot2)
ggplot(recipeData, aes(x = ABV)) +
  geom_histogram(fill="orange2") + geom_vline(xintercept=mean(ABV),color="red")+
  labs(title = "Alcohol by Volume Distribution of Beers",
       x = "Alcohol by Volume (%)", 
       y="Count")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The red line indicates the mean ABV. Few outliers are present, but they are barely visible on the graph. Accordingly, the mean is indicated to be greater than the mode.

library(ggplot2)
ggplot(recipeData, aes(x = Color)) +
  geom_histogram(fill="orange2")+geom_vline(xintercept=mean(Color),color="red")+
  labs(title = "Color Distribution of Beers",
       x = "Color", 
       y="Count")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The red line indicates the mean ABV. Few outliers are present, but they are barely visible on the graph. Accordingly, the mean is indicated to be greater than the mode.

library(ggplot2)
ggplot(recipeData, aes(x = OG)) +
  geom_histogram(fill="orange2")+ geom_vline(xintercept=mean(OG),color="red")+
  labs(title = "Original Gravity Distribution of Beers",
       x = "Original Gravity", 
       y="Count")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The mean OG of the beer is indicated by the red line. Some entries are significantly greater than the mean and mode.

library(ggplot2)
ggplot(recipeData, aes(x = FG)) +
  geom_histogram(fill="orange2")+ geom_vline(xintercept=mean(FG),color="red")+
  labs(title = "Final Gravity Distribution of Beers",
       x = "Final Gravity", 
       y="Count")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The FG distribution of the beer appears similar to the OG distribution. Some entries are greater than the mean and mode, thereby raising the mean.

library(ggplot2)
ggplot(recipeData, aes(x = IBU)) +
  geom_histogram(fill="orange2")+ geom_vline(xintercept=mean(IBU),color="red")+
  labs(title = "International Bitterness Units Distribution of Beers",
       x = "IBU", 
       y="Count")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The mean IBU is indicated by the red line. While there are only a few outliers, the mean appears higher than the mode.

library(ggplot2)
ggplot(recipeData, aes(x = BoilTime)) +
  geom_histogram(fill="orange2")+ geom_vline(xintercept=mean(BoilTime),color="red")+
  labs(title = "Boil Time Distribution of Beers",
       x = "Boil Time (minutes)", 
       y="Count")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The distribution of BT indicates that most beers take between approximately 50 and 100 minutes to boil, with a few outliers.

Multiple Linear Regression

Statement of Full Regression Line

model <- lm(ABV ~ OG + FG + Color + IBU + BoilTime, data = recipeData)
summary(model)
## 
## Call:
## lm(formula = ABV ~ OG + FG + Color + IBU + BoilTime, data = recipeData)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -35.630  -0.884  -0.235   0.621  47.507 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.5178856  0.0449163 100.584  < 2e-16 ***
## OG           0.0538659  0.0082575   6.523 6.92e-11 ***
## FG          -0.1815787  0.0419520  -4.328 1.51e-05 ***
## Color        0.0395175  0.0005351  73.850  < 2e-16 ***
## IBU          0.0118083  0.0001489  79.328  < 2e-16 ***
## BoilTime     0.0105421  0.0004256  24.768  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.729 on 73855 degrees of freedom
## Multiple R-squared:  0.1572, Adjusted R-squared:  0.1571 
## F-statistic:  2755 on 5 and 73855 DF,  p-value: < 2.2e-16

The full regression line is as follows, such that \(Y=ABV\): \(Y=0.05X_{OG}-0.18X_{FG}+0.04X_{Color}+0.01X_{IBU}+0.01X_{BT}\)

The p-value for the F-statistic for this model is nearly zero (\(2.2e-16\)), indicating the need to further test for the significance of the regression line.

Test for Significance of Regression Line

one_summary <- summary(model)
p.value.string = function(p.value){
  p.value <- round(p.value, digits=4)
  if (p.value == 0) {
    return("p < 0.0001")
  } else {
    return(paste0("p = ", format(p.value, scientific = F)))
  }
}

Hypotheses

   \(H_0: \ \beta_1 = \beta_2 = \beta_3 = \beta_4 = \beta_5=0\)
   \(H_1:\) at least one \(\beta_i \ne 0\)

Test Statistic

   \(F_0 = 2754.87\).

p-value

   \(p < 0.0001\).

Rejection Region

   Reject if \(p < \alpha\), where \(\alpha=0.05\).

Conclusion and Interpretation

   Reject \(H_0\). There is sufficient evidence that the regression line is significant. Partial F tests will follow.

Partial F tests

OG Alone
full = lm(ABV ~ OG, data=recipeData)

out <- anova(full) 
library(pander)
## Warning: package 'pander' was built under R version 4.0.3
pander(out, style='rmarkdown')
Analysis of Variance Table
  Df Sum Sq Mean Sq F value Pr(>F)
OG 1 227 227 64.04 1.238e-15
Residuals 73859 261800 3.545 NA NA
FG after adjusting for OG
full = lm(ABV ~ OG + FG, data=recipeData) # Full Model
reduced = lm(ABV ~ OG, data=recipeData) # Reduced model

out <- anova(reduced, full) # Compare the models
library(pander)
pander(out, style='rmarkdown')
Analysis of Variance Table
Res.Df RSS Df Sum of Sq F Pr(>F)
73859 261800 NA NA NA NA
73858 261798 1 1.065 0.3004 0.5836
Color after adjusting for OG and FG
full = lm(ABV ~ OG + FG + Color, data=recipeData) # Full Model
reduced = lm(ABV ~ OG + FG, data=recipeData) # Reduced model

out <- anova(reduced, full) # Compare the models
Analysis of Variance Table
Res.Df RSS Df Sum of Sq F Pr(>F)
73858 261798 NA NA NA NA
73857 242502 1 19297 5877 0
IBU after adjusting for OG, FG, and Color
full = lm(ABV ~ OG + FG + Color + IBU, data=recipeData) # Full Model
reduced = lm(ABV ~ OG + FG + Color, data=recipeData) # Reduced model

out <- anova(reduced, full) # Compare the models
Analysis of Variance Table
Res.Df RSS Df Sum of Sq F Pr(>F)
73857 242502 NA NA NA NA
73856 222673 1 19828 6577 0
Boil Time after adjusting for OG, FG, Color, and IBU
full = lm(ABV ~ OG + FG + Color + IBU + BoilTime, data=recipeData) # Full Model
reduced = lm(ABV ~ OG + FG + Color + IBU, data=recipeData) # Reduced model

out <- anova(reduced, full) # Compare the models
Analysis of Variance Table
Res.Df RSS Df Sum of Sq F Pr(>F)
73856 222673 NA NA NA NA
73855 220839 1 1834 613.4 7.151e-135

Reduced Model

model2 <- lm(ABV ~ OG + Color + IBU + BoilTime, data = recipeData)
summary(model2)
## 
## Call:
## lm(formula = ABV ~ OG + Color + IBU + BoilTime, data = recipeData)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -35.631  -0.884  -0.235   0.622  47.513 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 4.3704743  0.0292871 149.229  < 2e-16 ***
## OG          0.0203987  0.0028983   7.038 1.96e-12 ***
## Color       0.0394044  0.0005345  73.718  < 2e-16 ***
## IBU         0.0118084  0.0001489  79.319  < 2e-16 ***
## BoilTime    0.0105518  0.0004257  24.788  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.729 on 73856 degrees of freedom
## Multiple R-squared:  0.157,  Adjusted R-squared:  0.1569 
## F-statistic:  3438 on 4 and 73856 DF,  p-value: < 2.2e-16

The reduced model (without FG as a predictor) is the following:

\(Y=0.02X_{OG}+0.04X_{Color}+0.01X_{IBU}+0.01X_{BT}\)

Test for Significance of Reduced Model

one_summary <- summary(model2)

Hypotheses

   \(H_0: \ \beta_1 = \beta_2 = \beta_3 = \beta_4=0\)
   \(H_1:\) at least one \(\beta_i \ne 0\)

Test Statistic

   \(F_0 = 3438.08\).

p-value

   \(p < 0.0001\).

Rejection Region

   Reject if \(p < \alpha\), where \(\alpha=0.05\).

Conclusion and Interpretation

   Reject \(H_0\). There is sufficient evidence to suggest that the regression line is significant.

95% confidence interval for \(\beta_i\)

library(tibble)
## Warning: package 'tibble' was built under R version 4.0.3
one_ci <- as_tibble(confint(model2, level=0.95))

The confidence interval for the slope of OG is (0.0147, 0.0261).

The confidence interval for the slope of Color is (0.0384, 0.0405).

The confidence interval for the slope of IBU is (0.0115, 0.0121).

The confidence interval for the slope of Boil Time is (0.0097, 0.0114).

Adjusted \(R^2\)

summary(model2)$adj.r.squared 
## [1] 0.1569295

The adjusted \(R^2\) value is \(0.1569295\). ## Conclusion

The original model indicated that Final Gravity did not add much to the model considering the prior addition of Original Gravity, so removing Final Gravity from the original model is beneficial. Accordingly, alcohol by volume for beer may be well predicted by the following variables: Original Gravity, Color, International Bitterness Units, and Boil Time.