This data was obtained from the kaggle website. The original csv file contains 75,000 homebrew beer recipes with over 176 different styles. Beer records are user-reported and are classified according to one of the 176 different styles. These recipes go into as much or as little detail as the user provided. Additionally this data contains 23 variblaes most of which are quatitive with 2 categorical ones. This data can be downloaded at: https://www.kaggle.com/jtrofe/beer-recipes#recipeData.csv
In the following structure we can see that the data has several factors, nums and ints. Also sevral N/A are embeded in some of the variables. I willl clean these so that we can do more stats.
## 'data.frame': 73861 obs. of 23 variables:
## $ BeerID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Name : Factor w/ 59149 levels "' #1 All American Liberty Lager",..: 55922 50153 59068 59065 6597 48143 45725 50494 14409 34475 ...
## $ URL : Factor w/ 73861 levels "/homebrew/recipe/view/10003/yuen-porter-clone",..: 2448 2469 68305 68249 73416 20359 35605 72561 21627 70920 ...
## $ Style : Factor w/ 176 levels "Altbier","Alternative Grain Beer",..: 45 85 7 7 20 10 86 45 129 86 ...
## $ StyleID : int 45 85 7 7 20 10 86 45 129 86 ...
## $ Size.L. : num 21.8 20.8 18.9 22.7 50 ...
## $ OG : num 1.05 1.08 1.06 1.06 1.06 ...
## $ FG : num 1.01 1.02 1.02 1.02 1.01 ...
## $ ABV : num 5.48 8.16 5.91 5.8 6.48 5.58 7.09 5.36 5.77 8.22 ...
## $ IBU : num 17.6 60.6 59.2 54.5 17.8 ...
## $ Color : num 4.83 15.64 8.98 8.5 4.57 ...
## $ BoilSize : num 28.4 24.6 22.7 26.5 60 ...
## $ BoilTime : int 75 60 60 60 90 70 90 75 75 60 ...
## $ BoilGravity : Factor w/ 510 levels "0","1","1.001",..: 40 72 510 510 52 49 510 42 44 60 ...
## $ Efficiency : num 70 70 70 70 72 79 75 70 73 70 ...
## $ MashThickness: Factor w/ 568 levels "0","0.13","0.22",..: 568 568 568 568 568 568 568 99 568 568 ...
## $ SugarScale : Factor w/ 2 levels "Plato","Specific Gravity": 2 2 2 2 2 2 2 2 2 2 ...
## $ BrewMethod : Factor w/ 4 levels "All Grain","BIAB",..: 1 1 3 1 1 1 1 1 1 1 ...
## $ PitchRate : Factor w/ 10 levels "0","0.35","0.5",..: 10 10 10 10 10 5 10 10 10 10 ...
## $ PrimaryTemp : Factor w/ 218 levels "-0.56","-1.11",..: 75 218 218 218 92 218 218 218 218 111 ...
## $ PrimingMethod: Factor w/ 875 levels " ??????"," 40g/5 US G / 19L. -Brew sugar",..: 266 651 651 651 804 651 651 266 266 269 ...
## $ PrimingAmount: Factor w/ 1898 levels "---"," @ 15 psi",..: 1203 1866 1866 1866 1442 1866 1866 1178 1149 1220 ...
## $ UserId : int 116 955 NA NA 18325 5889 1051 116 116 NA ...
## BeerID Name
## Min. : 1 Awesome Recipe: 1311
## 1st Qu.:18466 IPA : 197
## Median :36931 Saison : 181
## Mean :36931 Kölsch : 173
## 3rd Qu.:55396 Black IPA : 129
## Max. :73861 Stout : 129
## (Other) :71741
## URL
## /homebrew/recipe/view/10003/yuen-porter-clone : 1
## /homebrew/recipe/view/100034/double-dog-citra-ipa: 1
## /homebrew/recipe/view/100068/hop-rocket : 1
## /homebrew/recipe/view/100096/evie-pale-ale-ag : 1
## /homebrew/recipe/view/100147/battle-of-flodden : 1
## /homebrew/recipe/view/10022/california-apa : 1
## (Other) :73855
## Style StyleID Size.L.
## American IPA :11940 Min. : 1.00 Min. : 1.00
## American Pale Ale : 7581 1st Qu.: 10.00 1st Qu.: 18.93
## Saison : 2617 Median : 35.00 Median : 20.82
## American Light Lager: 2277 Mean : 60.18 Mean : 43.93
## American Amber Ale : 2038 3rd Qu.:111.00 3rd Qu.: 23.66
## Blonde Ale : 1753 Max. :176.00 Max. :9200.00
## (Other) :45655
## OG FG ABV IBU
## Min. : 1.000 Min. :-0.003 Min. : 0.000 Min. : 0.00
## 1st Qu.: 1.051 1st Qu.: 1.011 1st Qu.: 5.080 1st Qu.: 23.37
## Median : 1.058 Median : 1.013 Median : 5.790 Median : 35.77
## Mean : 1.406 Mean : 1.076 Mean : 6.137 Mean : 44.28
## 3rd Qu.: 1.069 3rd Qu.: 1.017 3rd Qu.: 6.830 3rd Qu.: 56.38
## Max. :34.035 Max. :23.425 Max. :54.720 Max. :3409.30
##
## Color BoilSize BoilTime BoilGravity
## Min. : 0.00 Min. : 1.00 Min. : 0.00 N/A : 2990
## 1st Qu.: 5.17 1st Qu.: 20.82 1st Qu.: 60.00 1.044 : 2502
## Median : 8.44 Median : 27.44 Median : 60.00 1.042 : 2470
## Mean : 13.40 Mean : 49.73 Mean : 65.07 1.043 : 2438
## 3rd Qu.: 16.79 3rd Qu.: 30.00 3rd Qu.: 60.00 1.041 : 2391
## Max. :186.00 Max. :9700.00 Max. :240.00 1.04 : 2304
## (Other):58766
## Efficiency MashThickness SugarScale
## Min. : 0.00 N/A :29864 Plato : 1902
## 1st Qu.: 65.00 1.5 :15499 Specific Gravity:71959
## Median : 70.00 3 : 8312
## Mean : 66.35 1.25 : 4923
## 3rd Qu.: 75.00 2.5 : 1864
## Max. :100.00 1.3 : 1110
## (Other):12289
## BrewMethod PitchRate PrimaryTemp PrimingMethod
## All Grain :49692 N/A :39252 N/A :22662 N/A :67094
## BIAB :12016 0.35 : 9477 20 :14185 Corn Sugar: 715
## extract : 8626 0.75 : 9002 21.11 : 4622 Dextrose : 503
## Partial Mash: 3527 0.5 : 5469 18.33 : 4182 corn sugar: 360
## 1 : 5194 18 : 4129 Keg : 330
## 1.25 : 2405 19 : 2674 (Other) : 4858
## (Other): 3062 (Other):21407 NA's : 1
## PrimingAmount UserId
## N/A :69084 Min. : 49
## 5 oz : 205 1st Qu.: 20984
## 3/4 cup: 110 Median : 42897
## 4 oz : 106 Mean : 43078
## 1 cup : 102 3rd Qu.: 57841
## (Other): 4253 Max. :134362
## NA's : 1 NA's :50490
We will start analysing the color varibale for all the beer recipes.First we observe the summary of color variable to get an idea of the data limits and proceed to plot the raw counts.As we van see with the raw data there are very few counts above 80.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 5.17 8.44 13.40 16.79 186.00
Based on the graph above I plot the log 10 transformation to help visulaize the data much more.
This graph shows us that the distributioin is slightly binomial. HIgher counts are seen at 0.8 and 1.6, suggesting that beers are mostly made in teh lighter scales but a quite a few in the darker.
Next we will plot histograms for several other varibales such as ABV, IBU, BoilTime, OG (specific gravity of wort before fermentation), FG(specific gravity of wort after fermentation).
Here we see the ABV distribution is normal and has the highest counts arround 6-8 ABV.
Here we see the IBU distribuition also normal and with highest counts at 1.5 IBUs
Here we see the Boiltime has the highest counts at 1hr and next highest at 1.5hrs.
Here we see that the OG distibuion is centered arround the 1.07 median (colored in red)
Here we see that the FG distibuion is centered arround the 1.03 median (colored in red)
I am interested in looking in more detail at the alcohol by volume across beer samples. To look at this I constructed a new categorical variable called ABV_ranges. In the first table we see the count of ranges. Then I plotted these ABV categories as percentages in a pie chart.
## # A tibble: 6 x 2
## ABV_ranges n
## <fct> <int>
## 1 <NA> 168
## 2 10 to 50 1218
## 3 0 to 3 2779
## 4 8 to 10 3252
## 5 6 to 8 12028
## 6 3 to 4 13276
As we can see the most common ABV ranges are the Medium and Medium High ranges.
Next we plot the top_10 beer styles
Here we see that the Amercian IPA and Amercian Brown Ale are teh most and least common beer styles respectivly.
My data set has 73861 observations and 23 variables. Only 2 of the 23 variables are categorical. The rest are quantitative. The scales and distributions of most of these variables are quite different. For instance the color variable which depicts beer color from ligthest to darkest has a bimodal distribution. Having a mode at ligther colors and another at the darker ones. The ABV, FG and OG are all normal unimodal distributions.The IBU distribution is sligthly skewed to the left, but once log transforemed (as seen her) this skewness goes away. The boilTime variable has very few unique times, having the most common boiltime at 60min. Plotting our 2 catagorical variables we can see that the most common Brew Method is All Grain and Sugar scale is Specific Gravity.
The main feuture of interest for me would be the ABU (alcohol per volume) as I would like to know whether homebrewers are making more alcoholic beers in general. Also, I am a homebrewere myself and would like to elucidate if there is any variable of the brewing process that could lead to higher alcohol content.
It would be nice to see if any of the varibale or variables that indicate the recepi are corralated with ABV content.This would be done in the next sections. Another point that would help visualize the data better I thought was converting the ABV variable into ranges so that we could see which alcohol levels are more/less common.
As mentioned above I created a couple new variables. ABV_ranges was created to see the distribution of alcohol depicted by ranges that go from Very Low to Extemely high. I also found the top10 most common styles of beeer and broke them up by ABV_ranges and Brewing Method.
For the color variable I applied the log10 function as it seemed to be skewed and have 2 modes. This tranformation helped seen the symetry fo teh data better.Similarly the IBU was log10+1 transformed to reduce the skewness of the data. For the OG and FG I applied the coord_cartesian so that when ploting the median that we see in this graph is the meadian of the whole data. The OG and FG both have few extreme outliers which is why the mean of both lies far from most of the data points. This is why I plotted the median, as in this case gives us a better idea of OG/FG measurements.
As a follow-up from the last section I will plot the top 10 common beer styles with proportions of the ABV ranges and the Brew Method breakdowns
Here we can see that that similar to the whole distribution of ABV_ranges seen in the pie chart above, the most common styles also follow the trend of having medium to medium high ABV_ranges.
Here we see that the most common Brew Method in the top_10 styles is All Grain, and second most common one is BIAB. To follow this up I will plot the Brew Method proportions of the whole data set. I will break up this Brew Method by Sugar Scale to see whteher any Brew Method works the most with a particular Sugar Scale.
This graph shows that similar to the top10 beer styles the ALL grain brewig method is the most common of the whole data set. Additionally, the most prevalent Sugar Scale in any of the Brew Methods is Specific Gravity; therefore this sugarScales is prefered for all Brewing Methods.
Next, I will plot the ABVs against IBU to test whether this variable correlates with ABV. Additionally will check the pearson correaltion coefficient to confirm postitive, negative or no correlations respectivly.
##
## Pearson's product-moment correlation
##
## data: beer$ABV and beer$IBU
## t = 82.628, df = 73859, p-value < 2.2e-16
## alternative hypothesis: true correlation is greater than 0
## 95 percent confidence interval:
## 0.2853385 1.0000000
## sample estimates:
## cor
## 0.2908885
Here we see that there is a slightly possitive correlation. However, this might be because of over plotting. Next we proceed to subset the data to see whether this would help visualizing potential correlation. Since we know from previous graphs that most common ABV ranges are between 3-8. we will subset the data by these ranges to plot and calculate correlation coeficients.
##
## Spearman's rank correlation rho
##
## data: ABV and IBU
## S = 2.7557e+13, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.3862732
The correaltion with subset of the data seems more positive. Additionally, we can better visualize this realtionship better once we look at the 3-8 ranges. Next, we will see whether otehr variables have any relatuonships, such as OG and FG.
##
## Pearson's product-moment correlation
##
## data: beer$OG and beer$FG
## t = 724.83, df = 73859, p-value < 2.2e-16
## alternative hypothesis: true correlation is greater than 0
## 95 percent confidence interval:
## 0.9355969 1.0000000
## sample estimates:
## cor
## 0.9363471
As we can see from the correlation coeficient, OG anf FG have a pretty strong correlation. The graph of these variables shows this as well. Next, we will explore any relationships of ABV with categorica variables.
As we can see there isn’t any particualr trned with neither Brewing Method nor Sugar Scales
After haveing plot raw variables I realize that visualization might be better if we start grouping them and then plotting. Here I am groupng variables by IBU, BoitlTIme and OG and plotting them against the ABV_mean of those groups. After each plots I am running correlations to confrim what I find the graphs.
##
## Pearson's product-moment correlation
##
## data: beer_by_IBUNOzeros$ABV_mean and beer_by_IBUNOzeros$IBU
## t = 56.345, df = 12584, p-value < 2.2e-16
## alternative hypothesis: true correlation is greater than 0
## 95 percent confidence interval:
## 0.4370592 1.0000000
## sample estimates:
## cor
## 0.4488452
Here we see that the correlation between ABV and grouped IBU data has a much stronger correlation than the raw counts. We see this with both the coeficient and the plot.
##
## Pearson's product-moment correlation
##
## data: beer_by_OG$ABV_mean and beer_by_OG$OG
## t = -5.5381, df = 2034, p-value = 1
## alternative hypothesis: true correlation is greater than 0
## 95 percent confidence interval:
## -0.1576452 1.0000000
## sample estimates:
## cor
## -0.1218817
Here we see the correaltion of ABV with the OG is slightly negative. However, this is hard to interpret as the distribution for OG is complex. This distribution has most data points around 1, and others between 10-30.
I mainly attempted to correlate the alcohol content (ABV) with quatitaive and categorical variables. Among the quatitative ABV seems to have a slightly positive correlation with IBU and Boil Time. Hoewever, the relaionship with OG is close to 0 so not very strong. There was no clear relationship witth the categorical values of BRewMethod or SugarScale.
The original and the FInal gravities have a very strong positive correlation.This would suggest that during the brewing process original gravities increase linearly.Nonetheless, the OG has a neutral to negatice correlation with ABV_mean.
The Original and the FInal Gravities correlation.
First we will only consider the quantitative data and transform everything to numeric, then will plot a correlation matrix of the remaining variables (12)
Similar to what we saw with our Bivariate analysis the OG and FG relationship is strong. The only variable taht ABV seems to have a positive correaltion is the IBU. THis correlation is similar to what we observed in the bivarate analysis. Interestinglu, the only negatie strong correlations that we can see are the Effectiveness and Mashtickness.
Next we are going to do a reduction of the data dimensions with PCA analysis.
## Importance of components:
## PC1 PC2 PC3 PC4 PC5
## Standard deviation 265.0874 209.8677 102.04665 59.58779 42.83803
## Proportion of Variance 0.5376 0.3369 0.07966 0.02716 0.01404
## Cumulative Proportion 0.5376 0.8745 0.95415 0.98132 0.99535
## PC6 PC7 PC8 PC9 PC10 PC11
## Standard deviation 15.36514 14.05888 12.55853 2.98178 1.96147 1.76230
## Proportion of Variance 0.00181 0.00151 0.00121 0.00007 0.00003 0.00002
## Cumulative Proportion 0.99716 0.99867 0.99988 0.99995 0.99998 1.00000
## PC12
## Standard deviation 0.1489
## Proportion of Variance 0.0000
## Cumulative Proportion 1.0000
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.5735 1.3661 1.2567 1.1545 1.1126 0.94902 0.88643
## Proportion of Variance 0.2063 0.1555 0.1316 0.1111 0.1032 0.07505 0.06548
## Cumulative Proportion 0.2063 0.3618 0.4934 0.6045 0.7077 0.78272 0.84820
## PC8 PC9 PC10 PC11 PC12
## Standard deviation 0.8117 0.78558 0.69123 0.24929 0.07545
## Proportion of Variance 0.0549 0.05143 0.03982 0.00518 0.00047
## Cumulative Proportion 0.9031 0.95453 0.99435 0.99953 1.00000
In the first table you see the PCA analysis wihtout scale standarization and in the second with scale standarizarion. This scale satndarization is needed to reduce the satndard deviation, because the difference in scales or magnitudes of the variables. I will use the second standarized table results for further plotting.
In this graph we can see that the first and second component explain aproximatly 40% of the data.
Here we plot the 2 components and see 2 trends one possitive and one negative. We will color these by the categorial variables in the last section.
In the correlation matrix we can see that there are sevral positve correlations, some of these are similar to what i observed with the bivariate analysis. Howerver, this gives us a more global view of the whole quantitaive data set. ABV was modeslty correlated with IBU and Color. I also used PCA analysis to try and reduce the dimentionality of this data and found tha 98% of the variance could be explained by 9 components. This could help to modelthe data in the future.
The correation matrix shows addintional Biol gravity seems to be slightly correlated with ABV and more strongly with OG and FG. When plotting the 2 first components of the PCA we get some clustreting of the data by Sugar Scale. This was not observed by ABV_ranges (data not shown)
My interest was to figure out alcoholic preferences in hombrew beer. For this it was useful to set ranges of ABV and then analyse the proportion of these in the beer recipe data.Here we can clearly see the percentages of the ABV (alcohol by volume) ranges in the entire data set. We can observe that the most common ABV_ranges in the data are among intermediate to high levels of alcohol. Very few homebrewers make really high or really low alcohol content beers.
Another intereest of mine was to find the correlations between ABV and other variables in the data set. This correlation matrix shows all the correlations of quantitative data of beer recipes. We can observe that ABV has strong possitive correlation with itself, suggesting our method is correct, and we can easily visualize at once its possitive correlation with IBU, and minorly with Boil TIme and Boil Gravity.
Here i included 2 PCA plots because I wanted to contrast a weak grouping variable (ABV_ranges) and a modest one (SUgarScale) in addition to include categorical variables for seing more global data patterns.Here we reduced the dimensions of the data to 2 componnets. It is surprising that there seems to be a positive and negative PC correlations, these 2 cannot be grouped by the ABV_ranges but with the Sugarscale they line up nicely.
This dataset was very interesting I believe the univarite and bivariate analysis went well and showed some interesting trends. I think it made sense that the IBU goes in hand with ABV, as they showed a positive correlation. It was very surprising that people brew mostly intermediate alcoholic beers instead of going for higher ABV’s. This was my first time using R for such a big data set. I struggled a bit when it came to do the bivariate and PCA plots as the outputs where not quick enough in my computer. As a first timer this can be a little frustrating since I needed a lot of adapting the code and running the script. For future work I would have liked to learn more about modeling the data. I wished to find a model that predicts based not only on 1 varible but on a set of variables.