For my data analysis project, I decided to look at this beer dataset and see if I could find any interesting patterns using cluster analysis. I’m curious to see if beers naturally group together based on things like how strong they are (ABV) and how bitter they taste (IBU). This could be really useful for understanding what kinds of beers are similar to each other, which might help me find new beers to try based on ones I already like.
## 'data.frame': 199 obs. of 8 variables:
## $ X : int 0 1 2 3 4 5 6 7 8 9 ...
## $ abv : num 0.05 0.066 0.071 0.09 0.075 0.077 0.045 0.065 0.055 0.086 ...
## $ ibu : int NA NA NA NA NA NA NA NA NA NA ...
## $ id : int 1436 2265 2264 2263 2262 2261 2260 2259 2258 2131 ...
## $ name : chr "Pub Beer" "Devil's Cup" "Rise of the Phoenix" "Sinister" ...
## $ style : chr "American Pale Lager" "American Pale Ale (APA)" "American IPA" "American Double / Imperial IPA" ...
## $ brewery_id: int 408 177 177 177 177 177 177 177 177 177 ...
## $ ounces : num 12 12 12 12 12 12 12 12 12 12 ...
## X abv ibu id name style
## 1 0 0.050 NA 1436 Pub Beer American Pale Lager
## 2 1 0.066 NA 2265 Devil's Cup American Pale Ale (APA)
## 3 2 0.071 NA 2264 Rise of the Phoenix American IPA
## 4 3 0.090 NA 2263 Sinister American Double / Imperial IPA
## 5 4 0.075 NA 2262 Sex and Candy American IPA
## 6 5 0.077 NA 2261 Black Exodus Oatmeal Stout
## brewery_id ounces
## 1 408 12
## 2 177 12
## 3 177 12
## 4 177 12
## 5 177 12
## 6 177 12
## abv ibu ounces
## Min. :0.0320 Min. : 4.00 Min. : 8.40
## 1st Qu.:0.0500 1st Qu.: 18.00 1st Qu.:12.00
## Median :0.0590 Median : 35.00 Median :12.00
## Mean :0.0607 Mean : 39.82 Mean :13.05
## 3rd Qu.:0.0700 3rd Qu.: 60.00 3rd Qu.:16.00
## Max. :0.1250 Max. :138.00 Max. :16.00
## NA's :78
## X abv ibu id name style brewery_id
## 0 0 78 0 0 0 0
## ounces
## 0
# Fix data types and remove unnecessary index column
beers$abv <- as.numeric(as.character(beers$abv))
beers$ibu <- as.numeric(as.character(beers$ibu))
beers$ounces <- as.numeric(as.character(beers$ounces))
beers <- beers %>% select(-X)
I’ve found that the IBU column has a lot of missing values. Let’s look at the distribution of what we do have:
# Distribution of non-missing IBU values
ggplot(beers %>% filter(!is.na(ibu)), aes(x = ibu)) +
geom_histogram(bins = 30, fill = "steelblue", color = "black") +
labs(title = "Distribution of IBU Values",
x = "International Bitterness Units (IBU)",
y = "Count") +
theme_minimal()
I need to handle these missing values to do my clustering:
# Calculate median IBU by style
style_ibu_median <- beers %>%
filter(!is.na(ibu)) %>%
group_by(style) %>%
summarize(median_ibu = median(ibu, na.rm = TRUE))
# Function to fill in missing IBU values
impute_ibu <- function(beer_df) {
df <- beer_df
for (i in 1:nrow(df)) {
if (is.na(df$ibu[i])) {
# Try to use the style's median IBU
style_median <- style_ibu_median$median_ibu[style_ibu_median$style == df$style[i]]
if (length(style_median) > 0 && !is.na(style_median)) {
df$ibu[i] <- style_median
} else {
# Otherwise use overall median
df$ibu[i] <- median(df$ibu, na.rm = TRUE)
}
}
}
return(df)
}
beers_imputed <- impute_ibu(beers)
sum(is.na(beers_imputed$ibu)) # Check if we fixed all missing values
## [1] 0
Now I’ll prepare the data for clustering:
Let’s see if there are any interesting patterns before we do clustering:
# ABV vs IBU scatterplot
ggplot(beers_imputed, aes(x = abv, y = ibu)) +
geom_point(alpha = 0.6) +
labs(title = "Relationship between ABV and IBU",
x = "Alcohol by Volume (ABV)",
y = "International Bitterness Units (IBU)") +
theme_minimal()
# Correlation matrix
cor_matrix <- cor(cluster_data)
corrplot(cor_matrix, method = "circle", type = "upper",
tl.col = "black", addCoef.col = "black")
Let’s also look at the most common beer styles:
# Count beers per style
style_counts <- beers %>%
count(style) %>%
arrange(desc(n))
# Top 10 styles
top_styles <- style_counts %>% head(10)
ggplot(top_styles, aes(x = reorder(style, n), y = n)) +
geom_bar(stat = "identity", fill = "darkgoldenrod1") +
coord_flip() +
labs(title = "Top 10 Beer Styles", x = "Style", y = "Count") +
theme_minimal()
To figure out how many clusters to use, I’ll try the elbow method:
# Elbow method
set.seed(123)
wss <- sapply(1:10, function(k) {
kmeans(cluster_data_scaled, centers = k, nstart = 25)$tot.withinss
})
# Plot elbow method
ggplot(data.frame(k = 1:10, wss = wss), aes(x = k, y = wss)) +
geom_line() +
geom_point(size = 2) +
labs(title = "Elbow Method for Optimal k",
x = "Number of Clusters (k)",
y = "Total Within-Cluster Sum of Squares") +
theme_minimal()
The elbow looks like it might be at k=4, so I’ll use that for my analysis.
set.seed(123)
k <- 4
km <- kmeans(cluster_data_scaled, centers = k, nstart = 25)
# Add cluster assignments to our data
beers_clustered <- beers_imputed %>%
select(name, abv, ibu, style, brewery_id, ounces) %>%
mutate(cluster = km$cluster)
# Convert scaled centers back to original scale
centers_unscaled <- as.data.frame(km$centers)
for (i in 1:ncol(cluster_data)) {
centers_unscaled[, i] <- centers_unscaled[, i] * sd(cluster_data[, i]) + mean(cluster_data[, i])
}
# Look at cluster centers
centers_unscaled
## abv ibu ounces
## 1 0.05190323 26.90323 16.000
## 2 0.07322917 68.14583 11.925
## 3 0.05410204 25.28061 12.000
## 4 0.07518182 63.27273 16.000
##
## 1 2 3 4
## 31 48 98 22
# PCA for 2D visualization
pca_result <- prcomp(cluster_data_scaled)
pca_data <- as.data.frame(pca_result$x[, 1:2])
pca_data$cluster <- factor(km$cluster)
# Plot clusters
ggplot(pca_data, aes(x = PC1, y = PC2, color = cluster)) +
geom_point(alpha = 0.7) +
stat_ellipse(level = 0.95) +
scale_color_brewer(palette = "Set1") +
labs(title = "Beer Clusters on Principal Components") +
theme_minimal()
# ABV vs IBU colored by cluster
ggplot(beers_clustered, aes(x = abv, y = ibu, color = factor(cluster))) +
geom_point(alpha = 0.7) +
scale_color_brewer(palette = "Set1", name = "Cluster") +
labs(title = "Beer Clusters by ABV and IBU",
x = "Alcohol by Volume (ABV)",
y = "International Bitterness Units (IBU)") +
theme_minimal()
Let’s see what makes each cluster different:
# Average values for each cluster
cluster_profiles <- beers_clustered %>%
group_by(cluster) %>%
summarize(
count = n(),
avg_abv = mean(abv, na.rm = TRUE),
avg_ibu = mean(ibu, na.rm = TRUE),
avg_ounces = mean(ounces, na.rm = TRUE)
)
cluster_profiles
## # A tibble: 4 × 5
## cluster count avg_abv avg_ibu avg_ounces
## <int> <int> <dbl> <dbl> <dbl>
## 1 1 31 0.0519 26.9 16
## 2 2 48 0.0732 68.1 11.9
## 3 3 98 0.0541 25.3 12
## 4 4 22 0.0752 63.3 16
# Most common styles in each cluster
beers_clustered %>%
group_by(cluster, style) %>%
summarize(count = n()) %>%
arrange(cluster, desc(count)) %>%
group_by(cluster) %>%
slice_head(n = 3)
## # A tibble: 12 × 3
## # Groups: cluster [4]
## cluster style count
## <int> <chr> <int>
## 1 1 American Pale Ale (APA) 4
## 2 1 Kölsch 3
## 3 1 American IPA 2
## 4 2 American IPA 26
## 5 2 American Double / Imperial IPA 7
## 6 2 American Black Ale 2
## 7 3 American Pale Ale (APA) 14
## 8 3 American Amber / Red Ale 9
## 9 3 American Blonde Ale 6
## 10 4 American IPA 12
## 11 4 American Double / Imperial IPA 4
## 12 4 American Amber / Red Ale 1
Based on my analysis, I can describe the four clusters as:
Cluster 1: Standard Beers - These have medium alcohol content (around 5.5%) and medium bitterness. Most come in 12oz cans, and include many pale ales and amber ales.
Cluster 2: Strong Beers - Higher alcohol (7-9%) and more bitter. This includes IPAs, Double IPAs, and other strong ales.
Cluster 3: Light Beers - Lower alcohol content (under 5%) and less bitter. These are your session beers, lagers, and blonde ales.
Cluster 4: Tallboy Beers - Defined mostly by the 16oz can format. These have varying alcohol and bitterness levels.
This cluster analysis helped me see that beers naturally group not just by taste characteristics, but also by packaging size. The relationship between alcohol content and bitterness is particularly interesting - as alcohol content goes up, bitterness often increases too, especially in the IPA family.
If I were to continue this project, I’d want to include more variables like color, price, and maybe even user ratings to see if the clusters change. ```