Beer Profile and Ratings Analysis

About the Data:

The Beer Profile and Ratings dataset from Kaggle was used for the project. The main data set (beer_profile_and_ratings.csv) contains the following columns: (General) • Name: Beer name (label) • Style: Beer Style • Brewery: Brewery name • Beer Name: Complete beer name (Brewery + Brew Name) • Description: Notes on the beer if available • ABV: Alcohol content of beer (% by volume) • Min IBU: The minimum IBU value each beer can possess • Max IBU: The maximum IBU value each beer can possess

(Mouth feel) • Astringency • Body • Alcohol (Taste) • Bitter • Sweet •Sour • Salty (Flavor And Aroma) • Fruits • Hoppy • Spices • Malty

(Reviews) • review_aroma • review_appearance • review_palate •review_taste • review_overall • number_of_reviews

Loading the libraries

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggplot2)
library(reshape2)

## 
## Attaching package: 'reshape2'
## 
## The following object is masked from 'package:tidyr':
## 
##     smiths

library(dplyr)
library(gridExtra)

## 
## Attaching package: 'gridExtra'
## 
## The following object is masked from 'package:dplyr':
## 
##     combine

library(boot)

# Loading the dataset in beers data frame

beers <- read.csv("/Users/bhavyakalra/Desktop/Stats R/Final_project_beer/beer_profile_and_ratings.csv")
head(beers)

##                           Name   Style
## 1                        Amber Altbier
## 2                   Double Bag Altbier
## 3               Long Trail Ale Altbier
## 4                 Doppelsticke Altbier
## 5 Sleigh'r Dark Doüble Alt Ale Altbier
## 6                       Sticke Altbier
##                                            Brewery
## 1                              Alaskan Brewing Co.
## 2                           Long Trail Brewing Co.
## 3                           Long Trail Brewing Co.
## 4 Uerige Obergärige Hausbrauerei GmbH / Zum Uerige
## 5                          Ninkasi Brewing Company
## 6 Uerige Obergärige Hausbrauerei GmbH / Zum Uerige
##                                                       Beer.Name..Full.
## 1                                    Alaskan Brewing Co. Alaskan Amber
## 2                                    Long Trail Brewing Co. Double Bag
## 3                                Long Trail Brewing Co. Long Trail Ale
## 4 Uerige Obergärige Hausbrauerei GmbH / Zum Uerige Uerige Doppelsticke
## 5                 Ninkasi Brewing Company Sleigh'r Dark Doüble Alt Ale
## 6       Uerige Obergärige Hausbrauerei GmbH / Zum Uerige Uerige Sticke
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               Description
## 1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 Notes:Richly malty and long on the palate, with just enough hop backing to make this beautiful amber colored "alt" style beer notably well balanced.\\t
## 2 Notes:This malty, full-bodied double alt is also known as “Stickebier” – German slang for “secret brew”. Long Trail Double Bag was originally offered only in our brewery taproom as a special treat to our visitors. With an alcohol content of 7.2%, please indulge in moderation. The Long Trail Brewing Company is proud to have Double Bag named Malt Advocate’s “Beer of the Year” in 2001. Malt Advocate is a national magazine devoted to “expanding the boundaries of fine drinks”. Their panel of judges likes to keep things simple, and therefore of thousands of eligible competitors they award only two categories: “Imported” and “Domestic”. It is a great honor to receive this recognition.33 IBU\\t
## 3                                                                                                                                                                                                                                                                                           Notes:Long Trail Ale is a full-bodied amber ale modeled after the “Alt-biers” of Düsseldorf, Germany. Our top fermenting yeast and cold finishing temperature result in a complex, yet clean, full flavor. Originally introduced in November of 1989, Long Trail Ale beer quickly became, and remains, the largest selling craft-brew in Vermont. It is a multiple medal winner at the Great American Beer Festival.25 IBU\\t
## 4                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Notes:
## 5                                                                                                                                                     Notes:Called 'Dark Double Alt' on the label.Seize the season with Sleigh'r. Layers of deeply toasted malt are balanced by just enough hop bitterness to make it deceivingly drinkable. Paired with a dry finish, Sleigh’r is anything but your typical winter brew.An Alt ferments with Ale yeast at colder lagering temperatures. This effect gives Alts a more refined, crisp lager-like flavor than traditional ales. The Alt has been “Ninkasified” raising the ABV and IBUs. Sleigh'r has a deep, toasted malt flavor that finishes dry and balanced.50 IBU\\t
## 6                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Notes:
##   ABV Min.IBU Max.IBU Astringency Body Alcohol Bitter Sweet Sour Salty Fruits
## 1 5.3      25      50          13   32       9     47    74   33     0     33
## 2 7.2      25      50          12   57      18     33    55   16     0     24
## 3 5.0      25      50          14   37       6     42    43   11     0     10
## 4 8.5      25      50          13   55      31     47   101   18     1     49
## 5 7.2      25      50          25   51      26     44    45    9     1     11
## 6 6.0      25      50          22   45      13     46    62   25     1     34
##   Hoppy Spices Malty review_aroma review_appearance review_palate review_taste
## 1    57      8   111     3.498994          3.636821      3.556338     3.643863
## 2    35     12    84     3.798337          3.846154      3.904366     4.024948
## 3    54      4    62     3.409814          3.667109      3.600796     3.631300
## 4    40     16   119     4.148098          4.033967      4.150815     4.205163
## 5    51     20    95     3.625000          3.973958      3.734375     3.765625
## 6    60      4   103     4.007937          4.007937      4.087302     4.192063
##   review_overall number_of_reviews
## 1       3.847082               497
## 2       4.034304               481
## 3       3.830239               377
## 4       4.005435               368
## 5       3.817708                96
## 6       4.230159               315

Making sets of variable combinations from the beers dataset.

Set 1: Review Ratings

# Create a subset of columns related to review ratings
review_data <- beers[, c("review_aroma", "review_appearance", "review_palate", "review_taste", "review_overall")]

# Calculate the Average Review Rating
review_data$Average_Rating <- rowMeans(review_data)

response_variable <- review_data$Average_Rating

# Explanatory variables (individual review ratings)
explanatory_variables <- review_data[, c("review_aroma", "review_appearance", "review_palate", "review_taste")]


head(review_data)

##   review_aroma review_appearance review_palate review_taste review_overall
## 1     3.498994          3.636821      3.556338     3.643863       3.847082
## 2     3.798337          3.846154      3.904366     4.024948       4.034304
## 3     3.409814          3.667109      3.600796     3.631300       3.830239
## 4     4.148098          4.033967      4.150815     4.205163       4.005435
## 5     3.625000          3.973958      3.734375     3.765625       3.817708
## 6     4.007937          4.007937      4.087302     4.192063       4.230159
##   Average_Rating
## 1       3.636620
## 2       3.921622
## 3       3.627852
## 4       4.108696
## 5       3.783333
## 6       4.105080

head(response_variable)

## [1] 3.636620 3.921622 3.627852 4.108696 3.783333 4.105080

head(explanatory_variables)

##   review_aroma review_appearance review_palate review_taste
## 1     3.498994          3.636821      3.556338     3.643863
## 2     3.798337          3.846154      3.904366     4.024948
## 3     3.409814          3.667109      3.600796     3.631300
## 4     4.148098          4.033967      4.150815     4.205163
## 5     3.625000          3.973958      3.734375     3.765625
## 6     4.007937          4.007937      4.087302     4.192063

# Hypothesis: Considering reviews of beer, more Aromatic tends to be more appreciated and deserve more number reviews. Let's find out
# Scatterplot for Average Rating vs. Aroma
plot(explanatory_variables$review_aroma, response_variable, main="Average Rating vs. Aroma", 
     xlab="Aroma", ylab="Average Rating")
abline(lm(response_variable ~ explanatory_variables$review_aroma), col="blue")

cat("Based on the scatterplot, there appears to be a positive relationship between aroma and average rating. As aroma rating increases, average rating tends to increase as well.")

## Based on the scatterplot, there appears to be a positive relationship between aroma and average rating. As aroma rating increases, average rating tends to increase as well.

cat("The blue regression line suggests a strong positive linear trend.")

## The blue regression line suggests a strong positive linear trend.

# Create boxplots for review variables
boxplot(review_data[, c("review_aroma", "review_appearance", "review_palate", "review_taste")], 
        main = "Boxplots of Review Variables",
        ylab = "Review Scores",
        col = c("lightblue", "lightgreen", "lightpink", "lightyellow"),
        border = c("blue", "green", "red", "yellow"),
        names = c("Aroma", "Appearance", "Palate", "Taste"))

cat("Box plot of Review variables showcases a lot of outliers for the review scores")

## Box plot of Review variables showcases a lot of outliers for the review scores

correlation_matrix <- cor(review_data)
correlation_matrix

##                   review_aroma review_appearance review_palate review_taste
## review_aroma         1.0000000         0.8556294     0.9074331    0.9366702
## review_appearance    0.8556294         1.0000000     0.8699784    0.8506928
## review_palate        0.9074331         0.8699784     1.0000000    0.9465200
## review_taste         0.9366702         0.8506928     0.9465200    1.0000000
## review_overall       0.8705039         0.8134436     0.9181542    0.9371016
## Average_Rating       0.9594218         0.9139756     0.9716572    0.9800607
##                   review_overall Average_Rating
## review_aroma           0.8705039      0.9594218
## review_appearance      0.8134436      0.9139756
## review_palate          0.9181542      0.9716572
## review_taste           0.9371016      0.9800607
## review_overall         1.0000000      0.9505558
## Average_Rating         0.9505558      1.0000000

heatmap(correlation_matrix, 
        col = colorRampPalette(c("blue", "white", "red"))(50),
        main = "Correlation Matrix Heatmap",
        xlab = "Variables",
        ylab = "Variables")

cat("The correlation matrix and heatmap depicts how the the review variables have a good coefficiency with each other.")

## The correlation matrix and heatmap depicts how the the review variables have a good coefficiency with each other.

library(boot)

boot_ci <- function (v, func = median, conf = 0.95, n_iter = 1000) {
  # Define the bootstrapped function
  boot_func <- function(x, i) func(x[i])
  
  # Perform the bootstrap resampling
  b <- boot(data = v, statistic = boot_func, R = n_iter)
  
  # Compute the confidence intervals
  boot.ci(b, conf = conf, type = "perc")
}

boot_ci(review_data$Average_Rating, mean, 0.95)

## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 1000 bootstrap replicates
## 
## CALL : 
## boot.ci(boot.out = b, conf = conf, type = "perc")
## 
## Intervals : 
## Level     Percentile     
## 95%   ( 3.687,  3.716 )  
## Calculations and Intervals on Original Scale

cat("The mean of Average rating is: ",mean(review_data$Average_Rating), "which lies between the above mean value")

## The mean of Average rating is:  3.700726 which lies between the above mean value

cat("This proves that are assumption via bootstrapping process stands true for the whole population")

## This proves that are assumption via bootstrapping process stands true for the whole population

Set 2: Beer characteristic

# Calculating Bitterness as the average of Min IBU and Max IBU, although we have a column "Bitter" but it depicts the bitter value of beer. Here by taking Minimun and Maximum IBU values we are creating a new column "Bitterness" which depicts bitterness through the IBU values.
beers$Bitterness <- (beers$Min.IBU + beers$Max.IBU) / 2

sweetness <- beers$Sweet

# Making a column Balance as the difference between Sweetness and Bitterness
beers$Balance <- sweetness - beers$Bitterness

Beer_char <- beers[, c("Sweet", "Sour", "Fruits", "Bitterness")]
# Response variable
response_variable <- beers$Bitterness

# Explanatory variables
explanatory_variables <- beers[, c("Sweet", "Sour", "Fruits")]

# View the modified dataset
head(Beer_char)

##   Sweet Sour Fruits Bitterness
## 1    74   33     33       37.5
## 2    55   16     24       37.5
## 3    43   11     10       37.5
## 4   101   18     49       37.5
## 5    45    9     11       37.5
## 6    62   25     34       37.5

head(response_variable)

## [1] 37.5 37.5 37.5 37.5 37.5 37.5

head(explanatory_variables)

##   Sweet Sour Fruits
## 1    74   33     33
## 2    55   16     24
## 3    43   11     10
## 4   101   18     49
## 5    45    9     11
## 6    62   25     34

# Hypothesis: Considering beer manufacturing more quantity of Bitterness would be because of the mixture being Sour. Let's find out
# Scatterplot for Bitterness vs. Sweet
plot(explanatory_variables$Sweet, response_variable, main="Bitterness vs. Sweet", 
     xlab="Sweet", ylab="Bitterness")
abline(lm(response_variable ~ beers$Sweet), col="blue")

cat("Based on the scatterplot, there appears to be a does not seem to be good relationship between sweetness and bitterness. As sweetness increases, bitterness does not tends to increase as well.")

## Based on the scatterplot, there appears to be a does not seem to be good relationship between sweetness and bitterness. As sweetness increases, bitterness does not tends to increase as well.

# Create a boxplot for multiple variables (e.g., Bitterness, Sweet, Sour, Fruits)
boxplot(Beer_char[, c("Bitterness", "Sweet", "Sour", "Fruits")], 
        main = "Boxplot of Bitterness, Sweet, Sour, and Fruits",
        ylab = "Values",
        col = c("lightblue", "lightgreen", "lightpink", "lightyellow"),
        border = c("blue", "green", "red", "yellow"),
        names = c("Bitterness", "Sweet", "Sour", "Fruits"))

cat("Box plot of Beer charactertistic depicts that outliers in Sweet, Sour and Fruits more than in Bitterness.")

## Box plot of Beer charactertistic depicts that outliers in Sweet, Sour and Fruits more than in Bitterness.

correlation_matrix <- cor(Beer_char)
correlation_matrix

##                Sweet        Sour    Fruits  Bitterness
## Sweet      1.0000000  0.25791256 0.4820299  0.26747034
## Sour       0.2579126  1.00000000 0.7858825 -0.05667624
## Fruits     0.4820299  0.78588254 1.0000000  0.13692900
## Bitterness 0.2674703 -0.05667624 0.1369290  1.00000000

heatmap(correlation_matrix, 
        col = colorRampPalette(c("blue", "white", "red"))(50),
        main = "Correlation Matrix Heatmap",
        xlab = "Variables",
        ylab = "Variables")

cat("The correlation matrix and heatmap depicts how the the beer characteristics variables have varying coefficiency. Sour and Fruits have the highest coefficiency of 78.5% depicting how if Fruits content is increased the Sourness of beer would increase as well.")

## The correlation matrix and heatmap depicts how the the beer characteristics variables have varying coefficiency. Sour and Fruits have the highest coefficiency of 78.5% depicting how if Fruits content is increased the Sourness of beer would increase as well.

boot_ci(Beer_char$Bitterness, median, 0.95)

## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 1000 bootstrap replicates
## 
## CALL : 
## boot.ci(boot.out = b, conf = conf, type = "perc")
## 
## Intervals : 
## Level     Percentile     
## 95%   (25.0, 27.5 )  
## Calculations and Intervals on Original Scale

cat("The median value of Bitterness variable is: ", median(Beer_char$Bitterness)," which lies between the above values")

## The median value of Bitterness variable is:  25  which lies between the above values

cat("This proves that are assumption via bootstrapping process stands true for the whole population")

## This proves that are assumption via bootstrapping process stands true for the whole population

Set 3: Flavour Profile

# Create a subset of columns related to flavor profile
flavor_data <- beers[, c("Astringency", "Body", "Alcohol","Bitter", "Sweet", "Sour", "Salty", "Fruits", "Hoppy", "Spices", "Malty")]

# Calculate Flavor Intensity as the sum of flavor profile attributes
flavor_data$Flavor_Intensity <- rowSums(flavor_data)

# Response variable
response_variable <- flavor_data$Flavor_Intensity

# Explanatory variables (individual flavor profile attributes)
explanatory_variables <- flavor_data[, c("Astringency", "Body", "Alcohol", "Bitter", "Sweet", "Sour", "Salty", "Fruits", "Hoppy", "Spices", "Malty")]

# View the response and explanatory variables
head(flavor_data)

##   Astringency Body Alcohol Bitter Sweet Sour Salty Fruits Hoppy Spices Malty
## 1          13   32       9     47    74   33     0     33    57      8   111
## 2          12   57      18     33    55   16     0     24    35     12    84
## 3          14   37       6     42    43   11     0     10    54      4    62
## 4          13   55      31     47   101   18     1     49    40     16   119
## 5          25   51      26     44    45    9     1     11    51     20    95
## 6          22   45      13     46    62   25     1     34    60      4   103
##   Flavor_Intensity
## 1              417
## 2              346
## 3              283
## 4              490
## 5              378
## 6              415

head(response_variable)

## [1] 417 346 283 490 378 415

head(explanatory_variables)

##   Astringency Body Alcohol Bitter Sweet Sour Salty Fruits Hoppy Spices Malty
## 1          13   32       9     47    74   33     0     33    57      8   111
## 2          12   57      18     33    55   16     0     24    35     12    84
## 3          14   37       6     42    43   11     0     10    54      4    62
## 4          13   55      31     47   101   18     1     49    40     16   119
## 5          25   51      26     44    45    9     1     11    51     20    95
## 6          22   45      13     46    62   25     1     34    60      4   103

# Scatterplot for Flavor Intensity vs. Astringency
plot(explanatory_variables$Astringency, response_variable, main="Flavor Intensity vs. Astringency", 
     xlab="Astringency", ylab="Flavor Intensity")
abline(lm(response_variable ~ explanatory_variables$Astringency), col="blue")

# Draw conclusions based on the plot
# Example conclusions:
cat("Based on the scatterplot, there appears to be a little to no correlation between astringency and flavor intensity.")

## Based on the scatterplot, there appears to be a little to no correlation between astringency and flavor intensity.

# Create boxplots for flavor profile variables
boxplot(flavor_data[, c("Astringency", "Body", "Alcohol", "Bitter", "Sweet", "Sour", "Salty", "Fruits", "Hoppy", "Spices", "Malty")], 
        main = "Boxplots of Flavor Profile Variables",
        ylab = "Values",
        col = c("lightblue", "lightgreen", "lightpink", "black", "black","lightcyan", "lightgray", "lightyellow", "black", "black", "black"),
        border = c("blue", "green", "red", "purple", "orange", "cyan", "gray", "yellow", "brown", "red", "violet"),
        names = c("Astringency", "Body", "Alcohol", "Bitter", "Sweet", "Sour", "Salty", "Fruits", "Hoppy", "Spices", "Malty"))

cat("Box plot of flavour variables depicts a lot of outliers for the flavour scores")

## Box plot of flavour variables depicts a lot of outliers for the flavour scores

correlation_matrix <- cor(flavor_data)
correlation_matrix

##                  Astringency        Body      Alcohol       Bitter       Sweet
## Astringency       1.00000000 -0.05953971 -0.171986878  0.114685977 -0.02145640
## Body             -0.05953971  1.00000000  0.268885007  0.542236421  0.45884180
## Alcohol          -0.17198688  0.26888501  1.000000000  0.009087782  0.52703889
## Bitter            0.11468598  0.54223642  0.009087782  1.000000000  0.09170547
## Sweet            -0.02145640  0.45884180  0.527038889  0.091705467  1.00000000
## Sour              0.57102991 -0.12673331  0.048767388 -0.136913688  0.25791256
## Salty             0.34715504 -0.09927735 -0.094329293  0.004692825 -0.13191783
## Fruits            0.34523213 -0.04815457  0.254299063 -0.093449864  0.48202994
## Hoppy             0.33095085  0.07013823 -0.079949288  0.712886753 -0.03432745
## Spices           -0.08379502  0.18512299  0.252875793 -0.084048103  0.10754762
## Malty            -0.08208537  0.75422818  0.270105608  0.565570029  0.47103197
## Flavor_Intensity  0.31689528  0.63087079  0.454995695  0.551453125  0.71674819
##                          Sour        Salty      Fruits       Hoppy       Spices
## Astringency       0.571029913  0.347155038  0.34523213  0.33095085 -0.083795019
## Body             -0.126733314 -0.099277352 -0.04815457  0.07013823  0.185122992
## Alcohol           0.048767388 -0.094329293  0.25429906 -0.07994929  0.252875793
## Bitter           -0.136913688  0.004692825 -0.09344986  0.71288675 -0.084048103
## Sweet             0.257912561 -0.131917834  0.48202994 -0.03432745  0.107547623
## Sour              1.000000000  0.098172842  0.78588254  0.06889461  0.001831036
## Salty             0.098172842  1.000000000  0.02691958  0.17260618 -0.023078823
## Fruits            0.785882542  0.026919585  1.00000000  0.11040734  0.148281264
## Hoppy             0.068894607  0.172606178  0.11040734  1.00000000 -0.131963707
## Spices            0.001831036 -0.023078823  0.14828126 -0.13196371  1.000000000
## Malty            -0.303266373 -0.028241289 -0.19688969  0.19576698  0.061398556
## Flavor_Intensity  0.421956100  0.035660418  0.56310324  0.43843528  0.258095332
##                        Malty Flavor_Intensity
## Astringency      -0.08208537       0.31689528
## Body              0.75422818       0.63087079
## Alcohol           0.27010561       0.45499569
## Bitter            0.56557003       0.55145312
## Sweet             0.47103197       0.71674819
## Sour             -0.30326637       0.42195610
## Salty            -0.02824129       0.03566042
## Fruits           -0.19688969       0.56310324
## Hoppy             0.19576698       0.43843528
## Spices            0.06139856       0.25809533
## Malty             1.00000000       0.58987363
## Flavor_Intensity  0.58987363       1.00000000

heatmap(correlation_matrix, 
        col = colorRampPalette(c("blue", "white", "red"))(50),
        main = "Correlation Matrix Heatmap",
        xlab = "Variables",
        ylab = "Variables")

cat("The correlation matrix and heatmap showcases how the the flavour variables have varying coefficiency. Fruits and Sour have the highest coeffiecieny of 78.5% depicting how if Fruits content is increased the Sourness of beer would increase as well.")

## The correlation matrix and heatmap showcases how the the flavour variables have varying coefficiency. Fruits and Sour have the highest coeffiecieny of 78.5% depicting how if Fruits content is increased the Sourness of beer would increase as well.

boot_ci(flavor_data$Flavor_Intensity, mean, 0.95)

## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 1000 bootstrap replicates
## 
## CALL : 
## boot.ci(boot.out = b, conf = conf, type = "perc")
## 
## Intervals : 
## Level     Percentile     
## 95%   (376.5, 386.5 )  
## Calculations and Intervals on Original Scale

cat("The mean of Flavor intensity is: ", mean(flavor_data$Flavor_Intensity), " which lies between the above values. /n")

## The mean of Flavor intensity is:  381.63  which lies between the above values. /n

cat("This proves that are assumption via bootstrapping process stands true for the whole population")