Beer Recipe Data analysis by Paola Correa

Introduction:

This data was obtained from the kaggle website. The original csv file contains 75,000 homebrew beer recipes with over 176 different styles. Beer records are user-reported and are classified according to one of the 176 different styles. These recipes go into as much or as little detail as the user provided. Additionally this data contains 23 variblaes most of which are quatitive with 2 categorical ones. This data can be downloaded at: https://www.kaggle.com/jtrofe/beer-recipes#recipeData.csv

Univariate Plots Section

Data structure and summary:

In the following structure we can see that the data has several factors, nums and ints. Also sevral N/A are embeded in some of the variables. I willl clean these so that we can do more stats.

## 'data.frame':    73861 obs. of  23 variables:
##  $ BeerID       : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Name         : Factor w/ 59149 levels "'  #1 All American Liberty Lager",..: 55922 50153 59068 59065 6597 48143 45725 50494 14409 34475 ...
##  $ URL          : Factor w/ 73861 levels "/homebrew/recipe/view/10003/yuen-porter-clone",..: 2448 2469 68305 68249 73416 20359 35605 72561 21627 70920 ...
##  $ Style        : Factor w/ 176 levels "Altbier","Alternative Grain Beer",..: 45 85 7 7 20 10 86 45 129 86 ...
##  $ StyleID      : int  45 85 7 7 20 10 86 45 129 86 ...
##  $ Size.L.      : num  21.8 20.8 18.9 22.7 50 ...
##  $ OG           : num  1.05 1.08 1.06 1.06 1.06 ...
##  $ FG           : num  1.01 1.02 1.02 1.02 1.01 ...
##  $ ABV          : num  5.48 8.16 5.91 5.8 6.48 5.58 7.09 5.36 5.77 8.22 ...
##  $ IBU          : num  17.6 60.6 59.2 54.5 17.8 ...
##  $ Color        : num  4.83 15.64 8.98 8.5 4.57 ...
##  $ BoilSize     : num  28.4 24.6 22.7 26.5 60 ...
##  $ BoilTime     : int  75 60 60 60 90 70 90 75 75 60 ...
##  $ BoilGravity  : Factor w/ 510 levels "0","1","1.001",..: 40 72 510 510 52 49 510 42 44 60 ...
##  $ Efficiency   : num  70 70 70 70 72 79 75 70 73 70 ...
##  $ MashThickness: Factor w/ 568 levels "0","0.13","0.22",..: 568 568 568 568 568 568 568 99 568 568 ...
##  $ SugarScale   : Factor w/ 2 levels "Plato","Specific Gravity": 2 2 2 2 2 2 2 2 2 2 ...
##  $ BrewMethod   : Factor w/ 4 levels "All Grain","BIAB",..: 1 1 3 1 1 1 1 1 1 1 ...
##  $ PitchRate    : Factor w/ 10 levels "0","0.35","0.5",..: 10 10 10 10 10 5 10 10 10 10 ...
##  $ PrimaryTemp  : Factor w/ 218 levels "-0.56","-1.11",..: 75 218 218 218 92 218 218 218 218 111 ...
##  $ PrimingMethod: Factor w/ 875 levels " ??????"," 40g/5 US G / 19L. -Brew sugar",..: 266 651 651 651 804 651 651 266 266 269 ...
##  $ PrimingAmount: Factor w/ 1898 levels "---"," @ 15 psi",..: 1203 1866 1866 1866 1442 1866 1866 1178 1149 1220 ...
##  $ UserId       : int  116 955 NA NA 18325 5889 1051 116 116 NA ...

##      BeerID                  Name      
##  Min.   :    1   Awesome Recipe: 1311  
##  1st Qu.:18466   IPA           :  197  
##  Median :36931   Saison        :  181  
##  Mean   :36931   Kölsch        :  173  
##  3rd Qu.:55396   Black IPA     :  129  
##  Max.   :73861   Stout         :  129  
##                  (Other)       :71741  
##                                                 URL       
##  /homebrew/recipe/view/10003/yuen-porter-clone    :    1  
##  /homebrew/recipe/view/100034/double-dog-citra-ipa:    1  
##  /homebrew/recipe/view/100068/hop-rocket          :    1  
##  /homebrew/recipe/view/100096/evie-pale-ale-ag    :    1  
##  /homebrew/recipe/view/100147/battle-of-flodden   :    1  
##  /homebrew/recipe/view/10022/california-apa       :    1  
##  (Other)                                          :73855  
##                   Style          StyleID          Size.L.       
##  American IPA        :11940   Min.   :  1.00   Min.   :   1.00  
##  American Pale Ale   : 7581   1st Qu.: 10.00   1st Qu.:  18.93  
##  Saison              : 2617   Median : 35.00   Median :  20.82  
##  American Light Lager: 2277   Mean   : 60.18   Mean   :  43.93  
##  American Amber Ale  : 2038   3rd Qu.:111.00   3rd Qu.:  23.66  
##  Blonde Ale          : 1753   Max.   :176.00   Max.   :9200.00  
##  (Other)             :45655                                     
##        OG               FG              ABV              IBU         
##  Min.   : 1.000   Min.   :-0.003   Min.   : 0.000   Min.   :   0.00  
##  1st Qu.: 1.051   1st Qu.: 1.011   1st Qu.: 5.080   1st Qu.:  23.37  
##  Median : 1.058   Median : 1.013   Median : 5.790   Median :  35.77  
##  Mean   : 1.406   Mean   : 1.076   Mean   : 6.137   Mean   :  44.28  
##  3rd Qu.: 1.069   3rd Qu.: 1.017   3rd Qu.: 6.830   3rd Qu.:  56.38  
##  Max.   :34.035   Max.   :23.425   Max.   :54.720   Max.   :3409.30  
##                                                                      
##      Color           BoilSize          BoilTime       BoilGravity   
##  Min.   :  0.00   Min.   :   1.00   Min.   :  0.00   N/A    : 2990  
##  1st Qu.:  5.17   1st Qu.:  20.82   1st Qu.: 60.00   1.044  : 2502  
##  Median :  8.44   Median :  27.44   Median : 60.00   1.042  : 2470  
##  Mean   : 13.40   Mean   :  49.73   Mean   : 65.07   1.043  : 2438  
##  3rd Qu.: 16.79   3rd Qu.:  30.00   3rd Qu.: 60.00   1.041  : 2391  
##  Max.   :186.00   Max.   :9700.00   Max.   :240.00   1.04   : 2304  
##                                                      (Other):58766  
##    Efficiency     MashThickness              SugarScale   
##  Min.   :  0.00   N/A    :29864   Plato           : 1902  
##  1st Qu.: 65.00   1.5    :15499   Specific Gravity:71959  
##  Median : 70.00   3      : 8312                           
##  Mean   : 66.35   1.25   : 4923                           
##  3rd Qu.: 75.00   2.5    : 1864                           
##  Max.   :100.00   1.3    : 1110                           
##                   (Other):12289                           
##         BrewMethod      PitchRate      PrimaryTemp       PrimingMethod  
##  All Grain   :49692   N/A    :39252   N/A    :22662   N/A       :67094  
##  BIAB        :12016   0.35   : 9477   20     :14185   Corn Sugar:  715  
##  extract     : 8626   0.75   : 9002   21.11  : 4622   Dextrose  :  503  
##  Partial Mash: 3527   0.5    : 5469   18.33  : 4182   corn sugar:  360  
##                       1      : 5194   18     : 4129   Keg       :  330  
##                       1.25   : 2405   19     : 2674   (Other)   : 4858  
##                       (Other): 3062   (Other):21407   NA's      :    1  
##  PrimingAmount       UserId      
##  N/A    :69084   Min.   :    49  
##  5 oz   :  205   1st Qu.: 20984  
##  3/4 cup:  110   Median : 42897  
##  4 oz   :  106   Mean   : 43078  
##  1 cup  :  102   3rd Qu.: 57841  
##  (Other): 4253   Max.   :134362  
##  NA's   :    1   NA's   :50490

We will start analysing the color varibale for all the beer recipes.First we observe the summary of color variable to get an idea of the data limits and proceed to plot the raw counts.As we van see with the raw data there are very few counts above 80.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    5.17    8.44   13.40   16.79  186.00

Based on the graph above I plot the log 10 transformation to help visulaize the data much more.

This graph shows us that the distributioin is slightly binomial. HIgher counts are seen at 0.8 and 1.6, suggesting that beers are mostly made in teh lighter scales but a quite a few in the darker.

Next we will plot histograms for several other varibales such as ABV, IBU, BoilTime, OG (specific gravity of wort before fermentation), FG(specific gravity of wort after fermentation).

Here we see the ABV distribution is normal and has the highest counts arround 6-8 ABV.

Here we see the IBU distribuition also normal and with highest counts at 1.5 IBUs

Here we see the Boiltime has the highest counts at 1hr and next highest at 1.5hrs.

Here we see that the OG distibuion is centered arround the 1.07 median (colored in red)

Here we see that the FG distibuion is centered arround the 1.03 median (colored in red)

I am interested in looking in more detail at the alcohol by volume across beer samples. To look at this I constructed a new categorical variable called ABV_ranges. In the first table we see the count of ranges. Then I plotted these ABV categories as percentages in a pie chart.

## # A tibble: 6 x 2
##   ABV_ranges     n
##   <fct>      <int>
## 1 <NA>         168
## 2 10 to 50    1218
## 3 0 to 3      2779
## 4 8 to 10     3252
## 5 6 to 8     12028
## 6 3 to 4     13276

As we can see the most common ABV ranges are the Medium and Medium High ranges.

Next we plot the top_10 beer styles

Here we see that the Amercian IPA and Amercian Brown Ale are teh most and least common beer styles respectivly.

Univariate Analysis

What is the structure of your dataset?

My data set has 73861 observations and 23 variables. Only 2 of the 23 variables are categorical. The rest are quantitative. The scales and distributions of most of these variables are quite different. For instance the color variable which depicts beer color from ligthest to darkest has a bimodal distribution. Having a mode at ligther colors and another at the darker ones. The ABV, FG and OG are all normal unimodal distributions.The IBU distribution is sligthly skewed to the left, but once log transforemed (as seen her) this skewness goes away. The boilTime variable has very few unique times, having the most common boiltime at 60min. Plotting our 2 catagorical variables we can see that the most common Brew Method is All Grain and Sugar scale is Specific Gravity.

What is/are the main feature(s) of interest in your dataset?

The main feuture of interest for me would be the ABU (alcohol per volume) as I would like to know whether homebrewers are making more alcoholic beers in general. Also, I am a homebrewere myself and would like to elucidate if there is any variable of the brewing process that could lead to higher alcohol content.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

It would be nice to see if any of the varibale or variables that indicate the recepi are corralated with ABV content.This would be done in the next sections. Another point that would help visualize the data better I thought was converting the ABV variable into ranges so that we could see which alcohol levels are more/less common.

Did you create any new variables from existing variables in the dataset?

As mentioned above I created a couple new variables. ABV_ranges was created to see the distribution of alcohol depicted by ranges that go from Very Low to Extemely high. I also found the top10 most common styles of beeer and broke them up by ABV_ranges and Brewing Method.

Of the features you investigated, were there any unusual distributions? you perform any operations on the data to tidy, adjust, or change the form the data? If so, why did you do this?

For the color variable I applied the log10 function as it seemed to be skewed and have 2 modes. This tranformation helped seen the symetry fo teh data better.Similarly the IBU was log10+1 transformed to reduce the skewness of the data. For the OG and FG I applied the coord_cartesian so that when ploting the median that we see in this graph is the meadian of the whole data. The OG and FG both have few extreme outliers which is why the mean of both lies far from most of the data points. This is why I plotted the median, as in this case gives us a better idea of OG/FG measurements.

Bivariate Plots Section

As a follow-up from the last section I will plot the top 10 common beer styles with proportions of the ABV ranges and the Brew Method breakdowns

Here we can see that that similar to the whole distribution of ABV_ranges seen in the pie chart above, the most common styles also follow the trend of having medium to medium high ABV_ranges.

Here we see that the most common Brew Method in the top_10 styles is All Grain, and second most common one is BIAB. To follow this up I will plot the Brew Method proportions of the whole data set. I will break up this Brew Method by Sugar Scale to see whteher any Brew Method works the most with a particular Sugar Scale.

This graph shows that similar to the top10 beer styles the ALL grain brewig method is the most common of the whole data set. Additionally, the most prevalent Sugar Scale in any of the Brew Methods is Specific Gravity; therefore this sugarScales is prefered for all Brewing Methods.

Next, I will plot the ABVs against IBU to test whether this variable correlates with ABV. Additionally will check the pearson correaltion coefficient to confirm postitive, negative or no correlations respectivly.

## 
##  Pearson's product-moment correlation
## 
## data:  beer$ABV and beer$IBU
## t = 82.628, df = 73859, p-value < 2.2e-16
## alternative hypothesis: true correlation is greater than 0
## 95 percent confidence interval:
##  0.2853385 1.0000000
## sample estimates:
##       cor 
## 0.2908885

Here we see that there is a slightly possitive correlation. However, this might be because of over plotting. Next we proceed to subset the data to see whether this would help visualizing potential correlation. Since we know from previous graphs that most common ABV ranges are between 3-8. we will subset the data by these ranges to plot and calculate correlation coeficients.

## 
##  Spearman's rank correlation rho
## 
## data:  ABV and IBU
## S = 2.7557e+13, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## 0.3862732

The correaltion with subset of the data seems more positive. Additionally, we can better visualize this realtionship better once we look at the 3-8 ranges. Next, we will see whether otehr variables have any relatuonships, such as OG and FG.

## 
##  Pearson's product-moment correlation
## 
## data:  beer$OG and beer$FG
## t = 724.83, df = 73859, p-value < 2.2e-16
## alternative hypothesis: true correlation is greater than 0
## 95 percent confidence interval:
##  0.9355969 1.0000000
## sample estimates:
##       cor 
## 0.9363471

As we can see from the correlation coeficient, OG anf FG have a pretty strong correlation. The graph of these variables shows this as well. Next, we will explore any relationships of ABV with categorica variables.

As we can see there isn’t any particualr trned with neither Brewing Method nor Sugar Scales

Grouping variables and further plotting

After haveing plot raw variables I realize that visualization might be better if we start grouping them and then plotting. Here I am groupng variables by IBU, BoitlTIme and OG and plotting them against the ABV_mean of those groups. After each plots I am running correlations to confrim what I find the graphs.

## 
##  Pearson's product-moment correlation
## 
## data:  beer_by_IBUNOzeros$ABV_mean and beer_by_IBUNOzeros$IBU
## t = 56.345, df = 12584, p-value < 2.2e-16
## alternative hypothesis: true correlation is greater than 0
## 95 percent confidence interval:
##  0.4370592 1.0000000
## sample estimates:
##       cor 
## 0.4488452

Here we see that the correlation between ABV and grouped IBU data has a much stronger correlation than the raw counts. We see this with both the coeficient and the plot.

## 
##  Pearson's product-moment correlation
## 
## data:  beer_by_OG$ABV_mean and beer_by_OG$OG
## t = -5.5381, df = 2034, p-value = 1
## alternative hypothesis: true correlation is greater than 0
## 95 percent confidence interval:
##  -0.1576452  1.0000000
## sample estimates:
##        cor 
## -0.1218817

Here we see the correaltion of ABV with the OG is slightly negative. However, this is hard to interpret as the distribution for OG is complex. This distribution has most data points around 1, and others between 10-30.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the . How did the feature(s) of interest vary with other features in the dataset?

I mainly attempted to correlate the alcohol content (ABV) with quatitaive and categorical variables. Among the quatitative ABV seems to have a slightly positive correlation with IBU and Boil Time. Hoewever, the relaionship with OG is close to 0 so not very strong. There was no clear relationship witth the categorical values of BRewMethod or SugarScale.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

The original and the FInal gravities have a very strong positive correlation.This would suggest that during the brewing process original gravities increase linearly.Nonetheless, the OG has a neutral to negatice correlation with ABV_mean.

What was the strongest relationship you found?

The Original and the FInal Gravities correlation.

Multivariate Plots Section

First we will only consider the quantitative data and transform everything to numeric, then will plot a correlation matrix of the remaining variables (12)

Similar to what we saw with our Bivariate analysis the OG and FG relationship is strong. The only variable taht ABV seems to have a positive correaltion is the IBU. THis correlation is similar to what we observed in the bivarate analysis. Interestinglu, the only negatie strong correlations that we can see are the Effectiveness and Mashtickness.

Next we are going to do a reduction of the data dimensions with PCA analysis.

## Importance of components:
##                             PC1      PC2       PC3      PC4      PC5
## Standard deviation     265.0874 209.8677 102.04665 59.58779 42.83803
## Proportion of Variance   0.5376   0.3369   0.07966  0.02716  0.01404
## Cumulative Proportion    0.5376   0.8745   0.95415  0.98132  0.99535
##                             PC6      PC7      PC8     PC9    PC10    PC11
## Standard deviation     15.36514 14.05888 12.55853 2.98178 1.96147 1.76230
## Proportion of Variance  0.00181  0.00151  0.00121 0.00007 0.00003 0.00002
## Cumulative Proportion   0.99716  0.99867  0.99988 0.99995 0.99998 1.00000
##                          PC12
## Standard deviation     0.1489
## Proportion of Variance 0.0000
## Cumulative Proportion  1.0000

## Importance of components:
##                           PC1    PC2    PC3    PC4    PC5     PC6     PC7
## Standard deviation     1.5735 1.3661 1.2567 1.1545 1.1126 0.94902 0.88643
## Proportion of Variance 0.2063 0.1555 0.1316 0.1111 0.1032 0.07505 0.06548
## Cumulative Proportion  0.2063 0.3618 0.4934 0.6045 0.7077 0.78272 0.84820
##                           PC8     PC9    PC10    PC11    PC12
## Standard deviation     0.8117 0.78558 0.69123 0.24929 0.07545
## Proportion of Variance 0.0549 0.05143 0.03982 0.00518 0.00047
## Cumulative Proportion  0.9031 0.95453 0.99435 0.99953 1.00000

In the first table you see the PCA analysis wihtout scale standarization and in the second with scale standarizarion. This scale satndarization is needed to reduce the satndard deviation, because the difference in scales or magnitudes of the variables. I will use the second standarized table results for further plotting.

In this graph we can see that the first and second component explain aproximatly 40% of the data.

Here we plot the 2 components and see 2 trends one possitive and one negative. We will color these by the categorial variables in the last section.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the . Were there features that strengthened each other in terms of at your feature(s) of interest?

In the correlation matrix we can see that there are sevral positve correlations, some of these are similar to what i observed with the bivariate analysis. Howerver, this gives us a more global view of the whole quantitaive data set. ABV was modeslty correlated with IBU and Color. I also used PCA analysis to try and reduce the dimentionality of this data and found tha 98% of the variance could be explained by 9 components. This could help to modelthe data in the future.

Were there any interesting or surprising interactions between features?

The correation matrix shows addintional Biol gravity seems to be slightly correlated with ABV and more strongly with OG and FG. When plotting the 2 first components of the PCA we get some clustreting of the data by Sugar Scale. This was not observed by ABV_ranges (data not shown)

Final Plots and Summary

Plot One

Description One

My interest was to figure out alcoholic preferences in hombrew beer. For this it was useful to set ranges of ABV and then analyse the proportion of these in the beer recipe data.Here we can clearly see the percentages of the ABV (alcohol by volume) ranges in the entire data set. We can observe that the most common ABV_ranges in the data are among intermediate to high levels of alcohol. Very few homebrewers make really high or really low alcohol content beers.

Plot Two

Description Two

Another intereest of mine was to find the correlations between ABV and other variables in the data set. This correlation matrix shows all the correlations of quantitative data of beer recipes. We can observe that ABV has strong possitive correlation with itself, suggesting our method is correct, and we can easily visualize at once its possitive correlation with IBU, and minorly with Boil TIme and Boil Gravity.

Plot Three

Description Three

Here i included 2 PCA plots because I wanted to contrast a weak grouping variable (ABV_ranges) and a modest one (SUgarScale) in addition to include categorical variables for seing more global data patterns.Here we reduced the dimensions of the data to 2 componnets. It is surprising that there seems to be a positive and negative PC correlations, these 2 cannot be grouped by the ABV_ranges but with the Sugarscale they line up nicely.

Reflection

This dataset was very interesting I believe the univarite and bivariate analysis went well and showed some interesting trends. I think it made sense that the IBU goes in hand with ABV, as they showed a positive correlation. It was very surprising that people brew mostly intermediate alcoholic beers instead of going for higher ABV’s. This was my first time using R for such a big data set. I struggled a bit when it came to do the bivariate and PCA plots as the outputs where not quick enough in my computer. As a first timer this can be a little frustrating since I needed a lot of adapting the code and running the script. For future work I would have liked to learn more about modeling the data. I wished to find a model that predicts based not only on 1 varible but on a set of variables.