R Bridge Course Final Project

Select a data set and analyze. The presentation approach is up to you, but it should contain the following:

  1. Data Exploration: This should include summary statistics, means, medians, quartiles, or any other relevant information about the data set. Please include some conclusions in the R Markdown text.
  2. Data Wrangling: Please perform some basic transformations. They will need to make sense but could include column renaming, creating a subset of the data, replacing values, or creating new columns with derived data (for example, if it makes sense you could sum two columns together).
  3. Graphics: Please make sure to display at least one scatter plot, box plot and histogram. Don’t be limited to this. Please explore the many other options in R packages such as ggplot2.
  4. Meaningful Question for Analysis: Please state at the beginning a meaningful question for analysis. Use the first three steps and anything else that would be helpful to answer the question you are posing from the data set you chose. Please write a brief conclusion paragraph in R markdown at the end.

Setup

The data set contains a list of 2,410 US craft beers and 510 US breweries. The data was collected in January 2017 on CraftCans.com. Data is courtesy of Jean-Nicholas Hould at Kaggle.com. The data set deals only with canned craft beer - a growing craft beer segment, so any further mentions of beer implies only canned beer.

# Load packages
library(RCurl)
library(psych)
library(ggplot2)
library(reshape2)

# Load data file from GitHub
beers <- read.csv(text=getURL("https://raw.githubusercontent.com/ilyakats/CUNY-R-Bridge-Workshop/master/beers.csv"), header = TRUE, sep = ",")
breweries <- read.csv(text=getURL("https://raw.githubusercontent.com/ilyakats/CUNY-R-Bridge-Workshop/master/breweries.csv"), header = TRUE, sep = ",")

# Rename X column in the Breweries data set to match Beers data set for easier merging
colnames(breweries)[1] <- "brewery_id"

# Trim whitespaces
breweries$state <- trimws(breweries$state)
breweries$name <- trimws(breweries$name)
beers$style <- trimws(beers$style)

Analysis

Initial data review/summary - means, median, min/max, a few rows of data…

describe(beers)
## Warning: NAs introduced by coercion
## Warning in FUN(newX[, i], ...): no non-missing arguments to min; returning
## Inf
## Warning in FUN(newX[, i], ...): no non-missing arguments to max; returning
## -Inf
##            vars    n    mean     sd  median trimmed    mad min     max
## X             1 2410 1204.50 695.85 1204.50 1204.50 893.27 0.0 2409.00
## abv           2 2348    0.06   0.01    0.06    0.06   0.01 0.0    0.13
## ibu           3 1405   42.71  25.95   35.00   40.14  25.20 4.0  138.00
## id            4 2410 1431.11 752.46 1453.50 1446.41 934.78 1.0 2692.00
## name*         5 2410 1152.36 663.06 1162.50 1152.63 851.01 1.0 2305.00
## style*        6 2410     NaN     NA      NA     NaN     NA Inf    -Inf
## brewery_id    7 2410  231.75 157.69  205.00  224.32 194.22 0.0  557.00
## ounces        8 2410   13.59   2.35   12.00   13.33   0.00 8.4   32.00
##              range  skew kurtosis    se
## X          2409.00  0.00    -1.20 14.17
## abv           0.13  0.96     1.14  0.00
## ibu         134.00  0.79    -0.14  0.69
## id         2691.00 -0.12    -1.09 15.33
## name*      2304.00 -0.01    -1.19 13.51
## style*        -Inf    NA       NA    NA
## brewery_id  557.00  0.31    -1.09  3.21
## ounces       23.60  2.04     9.01  0.05
head(beers)
##   X   abv ibu   id                name                          style
## 1 0 0.050  NA 1436            Pub Beer            American Pale Lager
## 2 1 0.066  NA 2265         Devil's Cup        American Pale Ale (APA)
## 3 2 0.071  NA 2264 Rise of the Phoenix                   American IPA
## 4 3 0.090  NA 2263            Sinister American Double / Imperial IPA
## 5 4 0.075  NA 2262       Sex and Candy                   American IPA
## 6 5 0.077  NA 2261        Black Exodus                  Oatmeal Stout
##   brewery_id ounces
## 1        408     12
## 2        177     12
## 3        177     12
## 4        177     12
## 5        177     12
## 6        177     12
head(breweries)
##   brewery_id                      name          city state
## 1          0         NorthGate Brewing   Minneapolis    MN
## 2          1 Against the Grain Brewery    Louisville    KY
## 3          2  Jack's Abby Craft Lagers    Framingham    MA
## 4          3 Mike Hess Brewing Company     San Diego    CA
## 5          4   Fort Point Beer Company San Francisco    CA
## 6          5     COAST Brewing Company    Charleston    SC

I would like to look into IBU values. IBU stands for International Bitterness Unit and is used to qualify the bitterness of beer. The higher the number the more bitter the beer.

boxplot(beers$ibu)

IBU values over 100 are somewhat useless since they cannot be perceived by a drinker. The chart shows that half of beers with listed IBU values have IBU between 21 and 64. A very usable range. There are of course a few outliers above 128.5 (upper fence). Great marketing tool, but hardly usable.

Check out histogram as well…

ggplot(beers) + geom_histogram(aes(x = ibu), binwidth = 5, na.rm = TRUE)

Get a list of 20 most bitter styles (including average ABV for those styles)…

style.ibu <- aggregate(cbind(ibu, abv) ~ style, beers, mean)
style.ibu <- style.ibu[order(-style.ibu$ibu), ]

style.ibu[1:20, ]
##                                  style      ibu        abv
## 7                  American Barleywine 96.00000 0.09900000
## 12      American Double / Imperial IPA 93.32000 0.08769333
## 80              Russian Imperial Stout 86.50000 0.09950000
## 13  American Double / Imperial Pilsner 85.00000 0.07500000
## 30             Belgian Strong Dark Ale 72.00000 0.09200000
## 8                   American Black Ale 68.90000 0.07315000
## 16                        American IPA 67.63455 0.06480731
## 43                  English Barleywine 66.66667 0.10766667
## 50                       English Stout 66.00000 0.08000000
## 23                 American Strong Ale 65.41667 0.07608333
## 15           American India Pale Lager 63.33333 0.06266667
## 14    American Double / Imperial Stout 62.00000 0.09666667
## 28                         Belgian IPA 57.00000 0.07100000
## 47        English India Pale Ale (IPA) 54.71429 0.06214286
## 26                       Baltic Porter 54.00000 0.09366667
## 51                  English Strong Ale 54.00000 0.08233333
## 81                            Rye Beer 52.00000 0.06611111
## 24                  American White IPA 48.83333 0.06200000
## 54 Extra Special / Strong Bitter (ESB) 45.71429 0.05750000
## 17             American Pale Ale (APA) 44.94118 0.05497386

Graph scatterplot of bitterness (IBU) and strength (ABV)…

ggplot(beers, aes(x = ibu, y = abv)) + geom_point(na.rm=TRUE)+geom_smooth(method=lm,se=FALSE, na.rm=TRUE)

The scatterplot is fairly spread out. There is weak, if any, dependency between two variables.

I would also like to consider the state the beer is brewed in and see if any states favor more bitter beer in comparison to other states. Merge beers and breweries data sets and display histogram by state.

loc.beers <- merge(beers, breweries, by = "brewery_id")
names(loc.beers)[names(loc.beers) == "name.x"] <- "beer_name"
names(loc.beers)[names(loc.beers) == "name.y"] <- "brewery_name"
ggplot(loc.beers, aes(x = ibu)) + geom_histogram(binwidth = 5, na.rm=TRUE) + facet_wrap(~state)

A few interesting findings based on the state histograms:

I think it is worthwhile to look into two states with large number of beers and spikes on both sides of bitterness - Colorado and Indiana. Prepare a subset for each state.

# Subsets for Colorado and Indiana beers
co.beers <- loc.beers[loc.beers$state == "CO", ]
in.beers <- loc.beers[loc.beers$state == "IN", ]

# Counts of beers for each style
co.sums <- as.data.frame(table(co.beers$style))
in.sums <- as.data.frame(table(in.beers$style))

# Merge style counts for both states, adjust column names and set NAs to 0
stl.cnt <- merge(co.sums[co.sums$Freq > 4, ], in.sums[in.sums$Freq > 4, ], by = "Var1", all = TRUE)
colnames(stl.cnt) <- c("style", "CO.count", "IN.count")
stl.cnt[is.na(stl.cnt)] <- 0

# Melt the data for plotting
stl.cnt.m <- melt(stl.cnt, id.vars='style')

levels(stl.cnt.m$variable)[levels(stl.cnt.m$variable) == "CO.count"] <- "Colorado"
levels(stl.cnt.m$variable)[levels(stl.cnt.m$variable) == "IN.count"] <- "Indiana"

ggplot(stl.cnt.m, aes(style, value, fill = variable)) +   
  geom_bar(position = "dodge", stat="identity") + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1), legend.title=element_blank()) +
  ggtitle("Number of Beers per Style (for Styles with at Least 5 Beers in CO or IN)") +
  labs(x = "Beer Style", y = "Number of Beers")

Although it may appear that the American Pale Ale (APA) category has a significant discrepancy between two states, the average IBU for APAs is 44.9. More significant are discrepancies in IPAs (average IBU is 67.6) and Imperial IPAs (average IBU is 93.3). I want to see what breweries release these styles.

co.ipas <- as.data.frame(table(co.beers[co.beers$style == "American IPA" | co.beers$style == "American Double / Imperial IPA", ]$brewery_name))
colnames(co.ipas) <- c("Brewery", "No of IPAs")
co.ipas[order(-co.ipas[,2]), ]
##                          Brewery No of IPAs
## 16           Oskar Blues Brewery         14
## 15   New Belgium Brewing Company          4
## 5        Bonfire Brewing Company          3
## 14  Great Divide Brewing Company          3
## 19      Renegade Brewing Company          3
## 21           Ska Brewing Company          3
## 3          Avery Brewing Company          2
## 7          Dad & Dude's Breweria          2
## 11 Eddyline Brewery & Restaurant          2
## 26       Upslope Brewing Company          2
## 1          Asher Brewing Company          1
## 2          Aspen Brewing Company          1
## 4             Big Choice Brewing          1
## 6           Breckenridge Brewery          1
## 8            Denver Beer Company          1
## 9          Dolores River Brewery          1
## 10      Dry Dock Brewing Company          1
## 12                  Epic Brewing          1
## 13          Fate Brewing Company          1
## 17      Palisade Brewing Company          1
## 18    Pikes Peak Brewing Company          1
## 20             Silverton Brewery          1
## 22    Steamworks Brewing Company          1
## 23     Telluride Brewing Company          1
## 24          Tommyknocker Brewery          1
## 25  Twisted Pine Brewing Company          1
in.ipas <- as.data.frame(table(in.beers[in.beers$style == "American IPA" | in.beers$style == "American Double / Imperial IPA", ]$brewery_name))
colnames(in.ipas) <- c("Brewery", "No of IPAs")
in.ipas[order(-in.ipas[,2]), ]
##                      Brewery No of IPAs
## 11  Sun King Brewing Company          7
## 1        18th Street Brewery          4
## 2  450 North Brewing Company          3
## 4           Burn 'Em Brewing          3
## 8       Four Fathers Brewing          2
## 9     Great Crescent Brewery          2
## 10  People's Brewing Company          2
## 12   Tin Man Brewing Company          2
## 13    Upland Brewing Company          2
## 3         Bare Hands Brewery          1
## 5    Cutters Brewing Company          1
## 6  Daredevil Brewing Company          1
## 7          Flat 12 Bierwerks          1

I wanted to see if a single brewery in Colorado with a tendency to release strong, bitter beers is responsible for a spike in bitter beer across the state. Colorado brewers produce 54 IPAs or Double IPAs (DIPAs). Indiana brewers produce 31 such beers. Oskar Blues in Colorado is responsible for a significant portion of bitter bitters, but I don’t think it is enough. There is a significant number of Colorado breweries which release at least one bitter beer and I can only conlude that Colorado enjoys a bitter beer more than other states. As an interesting side note, Oskar Blues was the first craft brewery to can their beer back in 2002. They have never released any beers in bottles.

Conlusion

I have looked into the numbers related to canned craft beers in the United States. Specifically, I concentrated on bitterness of these beers. Even further I concentrated on bitter beers of Colorado. After this brief, initial analysis it does appear that Colorado has a taste for bitter beers. Of course, the set only covers canned beers. Although canning is on the rise, majority of craft beer is still bottled. It is possible that the “pioneers” of the craft beer world tend to brew beer bitter AND can their products. Further analysis using all craft beer data should be performed. Since craft brewing is a hobby of mine, it was interesting to see that Colorado has a disproportional number of bitter beers since traditionally California is considered purveyor of hoppy beers.