Select a data set and analyze. The presentation approach is up to you, but it should contain the following:
The data set contains a list of 2,410 US craft beers and 510 US breweries. The data was collected in January 2017 on CraftCans.com. Data is courtesy of Jean-Nicholas Hould at Kaggle.com. The data set deals only with canned craft beer - a growing craft beer segment, so any further mentions of beer implies only canned beer.
# Load packages
library(RCurl)
library(psych)
library(ggplot2)
library(reshape2)
# Load data file from GitHub
beers <- read.csv(text=getURL("https://raw.githubusercontent.com/ilyakats/CUNY-R-Bridge-Workshop/master/beers.csv"), header = TRUE, sep = ",")
breweries <- read.csv(text=getURL("https://raw.githubusercontent.com/ilyakats/CUNY-R-Bridge-Workshop/master/breweries.csv"), header = TRUE, sep = ",")
# Rename X column in the Breweries data set to match Beers data set for easier merging
colnames(breweries)[1] <- "brewery_id"
# Trim whitespaces
breweries$state <- trimws(breweries$state)
breweries$name <- trimws(breweries$name)
beers$style <- trimws(beers$style)
Initial data review/summary - means, median, min/max, a few rows of data…
describe(beers)
## Warning: NAs introduced by coercion
## Warning in FUN(newX[, i], ...): no non-missing arguments to min; returning
## Inf
## Warning in FUN(newX[, i], ...): no non-missing arguments to max; returning
## -Inf
## vars n mean sd median trimmed mad min max
## X 1 2410 1204.50 695.85 1204.50 1204.50 893.27 0.0 2409.00
## abv 2 2348 0.06 0.01 0.06 0.06 0.01 0.0 0.13
## ibu 3 1405 42.71 25.95 35.00 40.14 25.20 4.0 138.00
## id 4 2410 1431.11 752.46 1453.50 1446.41 934.78 1.0 2692.00
## name* 5 2410 1152.36 663.06 1162.50 1152.63 851.01 1.0 2305.00
## style* 6 2410 NaN NA NA NaN NA Inf -Inf
## brewery_id 7 2410 231.75 157.69 205.00 224.32 194.22 0.0 557.00
## ounces 8 2410 13.59 2.35 12.00 13.33 0.00 8.4 32.00
## range skew kurtosis se
## X 2409.00 0.00 -1.20 14.17
## abv 0.13 0.96 1.14 0.00
## ibu 134.00 0.79 -0.14 0.69
## id 2691.00 -0.12 -1.09 15.33
## name* 2304.00 -0.01 -1.19 13.51
## style* -Inf NA NA NA
## brewery_id 557.00 0.31 -1.09 3.21
## ounces 23.60 2.04 9.01 0.05
head(beers)
## X abv ibu id name style
## 1 0 0.050 NA 1436 Pub Beer American Pale Lager
## 2 1 0.066 NA 2265 Devil's Cup American Pale Ale (APA)
## 3 2 0.071 NA 2264 Rise of the Phoenix American IPA
## 4 3 0.090 NA 2263 Sinister American Double / Imperial IPA
## 5 4 0.075 NA 2262 Sex and Candy American IPA
## 6 5 0.077 NA 2261 Black Exodus Oatmeal Stout
## brewery_id ounces
## 1 408 12
## 2 177 12
## 3 177 12
## 4 177 12
## 5 177 12
## 6 177 12
head(breweries)
## brewery_id name city state
## 1 0 NorthGate Brewing Minneapolis MN
## 2 1 Against the Grain Brewery Louisville KY
## 3 2 Jack's Abby Craft Lagers Framingham MA
## 4 3 Mike Hess Brewing Company San Diego CA
## 5 4 Fort Point Beer Company San Francisco CA
## 6 5 COAST Brewing Company Charleston SC
I would like to look into IBU values. IBU stands for International Bitterness Unit and is used to qualify the bitterness of beer. The higher the number the more bitter the beer.
boxplot(beers$ibu)
IBU values over 100 are somewhat useless since they cannot be perceived by a drinker. The chart shows that half of beers with listed IBU values have IBU between 21 and 64. A very usable range. There are of course a few outliers above 128.5 (upper fence). Great marketing tool, but hardly usable.
Check out histogram as well…
ggplot(beers) + geom_histogram(aes(x = ibu), binwidth = 5, na.rm = TRUE)
Get a list of 20 most bitter styles (including average ABV for those styles)…
style.ibu <- aggregate(cbind(ibu, abv) ~ style, beers, mean)
style.ibu <- style.ibu[order(-style.ibu$ibu), ]
style.ibu[1:20, ]
## style ibu abv
## 7 American Barleywine 96.00000 0.09900000
## 12 American Double / Imperial IPA 93.32000 0.08769333
## 80 Russian Imperial Stout 86.50000 0.09950000
## 13 American Double / Imperial Pilsner 85.00000 0.07500000
## 30 Belgian Strong Dark Ale 72.00000 0.09200000
## 8 American Black Ale 68.90000 0.07315000
## 16 American IPA 67.63455 0.06480731
## 43 English Barleywine 66.66667 0.10766667
## 50 English Stout 66.00000 0.08000000
## 23 American Strong Ale 65.41667 0.07608333
## 15 American India Pale Lager 63.33333 0.06266667
## 14 American Double / Imperial Stout 62.00000 0.09666667
## 28 Belgian IPA 57.00000 0.07100000
## 47 English India Pale Ale (IPA) 54.71429 0.06214286
## 26 Baltic Porter 54.00000 0.09366667
## 51 English Strong Ale 54.00000 0.08233333
## 81 Rye Beer 52.00000 0.06611111
## 24 American White IPA 48.83333 0.06200000
## 54 Extra Special / Strong Bitter (ESB) 45.71429 0.05750000
## 17 American Pale Ale (APA) 44.94118 0.05497386
Graph scatterplot of bitterness (IBU) and strength (ABV)…
ggplot(beers, aes(x = ibu, y = abv)) + geom_point(na.rm=TRUE)+geom_smooth(method=lm,se=FALSE, na.rm=TRUE)
The scatterplot is fairly spread out. There is weak, if any, dependency between two variables.
I would also like to consider the state the beer is brewed in and see if any states favor more bitter beer in comparison to other states. Merge beers and breweries data sets and display histogram by state.
loc.beers <- merge(beers, breweries, by = "brewery_id")
names(loc.beers)[names(loc.beers) == "name.x"] <- "beer_name"
names(loc.beers)[names(loc.beers) == "name.y"] <- "brewery_name"
ggplot(loc.beers, aes(x = ibu)) + geom_histogram(binwidth = 5, na.rm=TRUE) + facet_wrap(~state)
A few interesting findings based on the state histograms:
I think it is worthwhile to look into two states with large number of beers and spikes on both sides of bitterness - Colorado and Indiana. Prepare a subset for each state.
# Subsets for Colorado and Indiana beers
co.beers <- loc.beers[loc.beers$state == "CO", ]
in.beers <- loc.beers[loc.beers$state == "IN", ]
# Counts of beers for each style
co.sums <- as.data.frame(table(co.beers$style))
in.sums <- as.data.frame(table(in.beers$style))
# Merge style counts for both states, adjust column names and set NAs to 0
stl.cnt <- merge(co.sums[co.sums$Freq > 4, ], in.sums[in.sums$Freq > 4, ], by = "Var1", all = TRUE)
colnames(stl.cnt) <- c("style", "CO.count", "IN.count")
stl.cnt[is.na(stl.cnt)] <- 0
# Melt the data for plotting
stl.cnt.m <- melt(stl.cnt, id.vars='style')
levels(stl.cnt.m$variable)[levels(stl.cnt.m$variable) == "CO.count"] <- "Colorado"
levels(stl.cnt.m$variable)[levels(stl.cnt.m$variable) == "IN.count"] <- "Indiana"
ggplot(stl.cnt.m, aes(style, value, fill = variable)) +
geom_bar(position = "dodge", stat="identity") +
theme(axis.text.x = element_text(angle = 90, hjust = 1), legend.title=element_blank()) +
ggtitle("Number of Beers per Style (for Styles with at Least 5 Beers in CO or IN)") +
labs(x = "Beer Style", y = "Number of Beers")
Although it may appear that the American Pale Ale (APA) category has a significant discrepancy between two states, the average IBU for APAs is 44.9. More significant are discrepancies in IPAs (average IBU is 67.6) and Imperial IPAs (average IBU is 93.3). I want to see what breweries release these styles.
co.ipas <- as.data.frame(table(co.beers[co.beers$style == "American IPA" | co.beers$style == "American Double / Imperial IPA", ]$brewery_name))
colnames(co.ipas) <- c("Brewery", "No of IPAs")
co.ipas[order(-co.ipas[,2]), ]
## Brewery No of IPAs
## 16 Oskar Blues Brewery 14
## 15 New Belgium Brewing Company 4
## 5 Bonfire Brewing Company 3
## 14 Great Divide Brewing Company 3
## 19 Renegade Brewing Company 3
## 21 Ska Brewing Company 3
## 3 Avery Brewing Company 2
## 7 Dad & Dude's Breweria 2
## 11 Eddyline Brewery & Restaurant 2
## 26 Upslope Brewing Company 2
## 1 Asher Brewing Company 1
## 2 Aspen Brewing Company 1
## 4 Big Choice Brewing 1
## 6 Breckenridge Brewery 1
## 8 Denver Beer Company 1
## 9 Dolores River Brewery 1
## 10 Dry Dock Brewing Company 1
## 12 Epic Brewing 1
## 13 Fate Brewing Company 1
## 17 Palisade Brewing Company 1
## 18 Pikes Peak Brewing Company 1
## 20 Silverton Brewery 1
## 22 Steamworks Brewing Company 1
## 23 Telluride Brewing Company 1
## 24 Tommyknocker Brewery 1
## 25 Twisted Pine Brewing Company 1
in.ipas <- as.data.frame(table(in.beers[in.beers$style == "American IPA" | in.beers$style == "American Double / Imperial IPA", ]$brewery_name))
colnames(in.ipas) <- c("Brewery", "No of IPAs")
in.ipas[order(-in.ipas[,2]), ]
## Brewery No of IPAs
## 11 Sun King Brewing Company 7
## 1 18th Street Brewery 4
## 2 450 North Brewing Company 3
## 4 Burn 'Em Brewing 3
## 8 Four Fathers Brewing 2
## 9 Great Crescent Brewery 2
## 10 People's Brewing Company 2
## 12 Tin Man Brewing Company 2
## 13 Upland Brewing Company 2
## 3 Bare Hands Brewery 1
## 5 Cutters Brewing Company 1
## 6 Daredevil Brewing Company 1
## 7 Flat 12 Bierwerks 1
I wanted to see if a single brewery in Colorado with a tendency to release strong, bitter beers is responsible for a spike in bitter beer across the state. Colorado brewers produce 54 IPAs or Double IPAs (DIPAs). Indiana brewers produce 31 such beers. Oskar Blues in Colorado is responsible for a significant portion of bitter bitters, but I don’t think it is enough. There is a significant number of Colorado breweries which release at least one bitter beer and I can only conlude that Colorado enjoys a bitter beer more than other states. As an interesting side note, Oskar Blues was the first craft brewery to can their beer back in 2002. They have never released any beers in bottles.
I have looked into the numbers related to canned craft beers in the United States. Specifically, I concentrated on bitterness of these beers. Even further I concentrated on bitter beers of Colorado. After this brief, initial analysis it does appear that Colorado has a taste for bitter beers. Of course, the set only covers canned beers. Although canning is on the rise, majority of craft beer is still bottled. It is possible that the “pioneers” of the craft beer world tend to brew beer bitter AND can their products. Further analysis using all craft beer data should be performed. Since craft brewing is a hobby of mine, it was interesting to see that Colorado has a disproportional number of bitter beers since traditionally California is considered purveyor of hoppy beers.