A craft brewery in Buffalo, New York. (Photo: Andre Carrotflower, Wikimedia Commons, CC BY-SA 4.0)
Craft beer is very popular in the United States. But the breweries are not spread out evenly. Some cities (like San Diego or Portland) have a lot of them, and many rural areas have almost none. I wanted to study this pattern using the spatial machine learning tools from this course.
Here I use a big dataset of 5,394 craft breweries in the United States
The main question is simple:
Where do breweries cluster, why do they cluster there, and where is the map still “empty” even though the conditions look good?
The questions I try to answer:
The last point, the “under-served” market, is the part I find most interesting.
I use three data sources and join them together.
| Layer | Source | What it gives |
|---|---|---|
| Breweries (points) | Open Brewery DB (API) | 5,394 operating breweries with coordinates |
| County borders | TIGER / tigris |
the polygons for the analysis, in Albers CRS (EPSG:5070) |
| Socio-economic data | ACS 5-year (2018–22), tidycensus |
income, education, age, share of young adults, unemployment |
Open Brewery DB has 8,207 breweries in the US. Only about 72% of them have real coordinates, which is one weak point of the data. I keep the contiguous US, remove the breweries that are “planning” or “closed”, and I get 5,394 breweries. I assign every brewery to a county, so in total I work with 3,109 counties. I also group the brewery types into three roles: consumption (brewpub, taproom), small production (micro, nano) and large production (regional, large).
Number of breweries in each county. We can already see the pattern: California, the Pacific Northwest, Colorado, the Upper Midwest and the Northeast have the most.
First I check if location really matters. I build neighbour weights (queen contiguity and k-nearest neighbours), and I also make “neighbour” variables, for example the average income of the neighbour counties. Then I group counties into urban / intermediate / rural, like a simple version of DEGURBA.
deg <- modelsp %>% st_drop_geometry() %>%
group_by(degurba) %>%
summarise(counties = n(), breweries = sum(n_brew),
`mean per county` = round(mean(n_brew), 2),
`mean per 100k people` = round(mean(brew_per_100k, na.rm = TRUE), 2),
.groups = "drop")
kab(deg, caption = "Breweries by urban / rural class")| degurba | counties | breweries | mean per county | mean per 100k people |
|---|---|---|---|---|
| rural | 1715 | 688 | 0.40 | 2.25 |
| intermediate | 1062 | 1613 | 1.52 | 1.49 |
| urban | 332 | 3093 | 9.32 | 1.96 |
Urban counties have most of the breweries in total. But if we look per person, the rural counties are actually the highest (2.25 per 100k, vs 1.96 in urban and 1.49 in intermediate). So small towns also like their local brewery. This is a small surprise.
lw <- nb2listw(poly2nb(modelsp, queen = TRUE), style = "W", zero.policy = TRUE)
v <- modelsp$brew_per_100k; v[is.na(v)] <- mean(v, na.rm = TRUE)
mi_n <- moran.test(modelsp$n_brew, lw, zero.policy = TRUE)$estimate[1]
mi_pc <- moran.test(v, lw, zero.policy = TRUE)$estimate[1]
data.frame(variable = c("brewery count", "breweries per 100k"),
`Moran's I` = round(c(mi_n, mi_pc), 3), check.names = FALSE) |> kab()| variable | Moran’s I |
|---|---|
| brewery count | 0.304 |
| breweries per 100k | 0.189 |
Moran’s I is positive and significant (p is almost 0). This means a county with many breweries usually has neighbours with many breweries too. So the location is not random, and using spatial methods makes sense here.
I measure clustering in three ways.
ce <- as.data.frame(aggl$ce_by) %>%
transmute(role = type_group, n,
`observed NN (km)` = round(obs_nn_km, 1),
`expected NN (km)` = round(exp_nn_km, 1),
`Clark-Evans R` = round(R, 3))
kab(ce, caption = "Clark-Evans index (R below 1 means clustering)")| role | n | observed NN (km) | expected NN (km) | Clark-Evans R |
|---|---|---|---|---|
| consumption | 1897 | 16.1 | 32.1 | 0.501 |
| large_prod | 230 | 39.8 | 92.3 | 0.432 |
| small_prod | 3266 | 10.8 | 24.5 | 0.440 |
The Clark-Evans index for all breweries is R = 0.41. The real average distance to the nearest brewery is about 7.9 km, but under a random pattern it would be about 19 km. So breweries are around 2.4 times closer than random. The SPAG-style coverage index gives the same message: equal circles around the breweries cover only 25% of the country.
One detail is interesting: the production breweries cluster a bit more (R ≈ 0.44) than the consumption ones (R ≈ 0.50). Production follows the beer regions, and consumption follows the people.
Ripley’s L stays above zero at every distance, so breweries are clustered at all scales.
I also compare the two main roles with kernel density. The correlation between the consumption map and the production map is 0.93, so they are mostly in the same places, only production is a little more concentrated.
Two of the biggest hubs in my data: Mission Brewery in San Diego, the number-one cluster with 380 breweries (left), and Deschutes Brewery in Portland, Oregon (right). (Photos: Wikimedia Commons, CC BY-SA 3.0)
To find the clusters of points I use DBSCAN (25 km, 10 breweries). It finds 88 hubs, and HDBSCAN gives almost the same (89). About 35% of breweries are “isolated”, which is again the rural part.
kab(head(as.data.frame(hubs), 10) %>%
transmute(hub, n, state, `anchor city` = anchor_city,
`% consumption` = pct_consumption, `% small prod` = pct_small_prod),
caption = "The ten biggest beer hubs")| hub | n | state | anchor city | % consumption | % small prod |
|---|---|---|---|---|---|
| 3 | 380 | California | San Diego | 26 | 68 |
| 19 | 199 | California | San Francisco | 39 | 54 |
| 16 | 194 | Washington | Seattle | 32 | 64 |
| 5 | 174 | Colorado | Denver | 30 | 63 |
| 10 | 148 | Maryland | Baltimore | 27 | 70 |
| 2 | 147 | Oregon | Portland | 39 | 57 |
| 14 | 119 | Illinois | Chicago | 39 | 57 |
| 40 | 108 | Massachusetts | Boston | 26 | 63 |
| 46 | 80 | Pennsylvania | Philadelphia | 48 | 48 |
| 44 | 79 | New York | Brooklyn | 23 | 71 |
Beer hubs from DBSCAN. Grey points are the isolated breweries.
The hubs are the famous beer cities: San Diego (380 breweries), the Bay Area, Seattle, Denver, Portland, Chicago, Boston. They are also a bit different from each other. San Diego is more about production (68% micro), but Philadelphia and Detroit have more brewpubs, so they are more about drinking on the spot.
Now I try to predict the number of breweries in a county. I use
Random Forest (ranger) and a small
neural network (nnet). I compare two
feature sets: one with only normal variables (aspatial), and one that
also adds spatial variables (coordinates, neighbour values, distance to
the nearest brewery). I also use two kinds of cross-validation: normal
random folds, and spatial folds (blocks in space).
cv <- readRDS(P("supervised_results.rds"))$cv
names(cv) <- c("model", "random CV", "spatial CV")
kab(cv, caption = "Cross-validation error (RMSE, brewery-count scale)")| model | random CV | spatial CV |
|---|---|---|
| RF aspatial | 4.155 | 4.425 |
| RF spatial | 3.997 | 4.314 |
| ANN spatial | 5.138 | 5.796 |
I notice a few things here. The spatial features make the RF a little better (3.997 vs 4.155). The spatial cross-validation gives higher error than the random one, which is normal: random folds are too easy because the test points sit next to the training points. And the Random Forest is clearly better than the simple neural network here. The final RF has OOB R² = 0.84, which is quite good.
impdf <- data.frame(feature = names(imp), importance = as.numeric(imp)) |>
head(10) |> arrange(importance)
ggplot(impdf, aes(reorder(feature, importance), importance)) +
geom_col(fill = "#b2182b") + coord_flip() +
labs(x = NULL, y = "permutation importance") + theme_minimal(base_size = 12)Variable importance from the Random Forest. The distance to the nearest brewery is more important than population.
The most important variable is the distance to the nearest brewery, even more than population. I think this is the clearest result of the project: breweries appear where there are already other breweries. This is what economists call agglomeration.
When I use counties as the unit, the result can depend on the shape and size of the units. This is called the Modifiable Areal Unit Problem (MAUP). To check it, I run the same Random Forest again, but on regular square grids (50 km and 100 km) instead of counties. I move the breweries and the socio-economic data onto the grid cells by area weighting.
maup <- readRDS(P("grid_maup.rds"))
kab(maup, caption = "The same model on counties vs. two grid sizes")| resolution | cells | RMSE aspatial (spatial CV) | RMSE spatial (spatial CV) | OOB R2 | top predictor |
|---|---|---|---|---|---|
| county | 3109 | 4.425 | 4.314 | 0.838 | dist_nearest_brew_km |
| 50 km | 3373 | 4.270 | 3.726 | 0.831 | dist_nearest_brew_km |
| 100 km | 896 | 9.388 | 9.014 | 0.831 | dist_nearest_brew_km |
The story does not change. For every spatial unit the OOB R² stays around 0.83, the spatial features are always better than the aspatial ones, and the distance to the nearest brewery is always the most important variable. So my main results are robust to MAUP. The RMSE looks bigger for the 100 km grid, but this is only because a big cell contains more breweries, so the counts (and the errors) are larger.
Breweries per 50 km grid cell. The same pattern as the county map, but on a regular grid (class 7).
In this part I make map surfaces. The first one is a kernel density of the brewery points. The second one uses kriging to take the RF prediction from the county points and make a smooth surface for the whole country.
A. Kernel density of brewery locations.
B. RF prediction, kriged from points to a smooth surface.
I also tried to krige the RF residuals (the part the model cannot explain). But the variogram is almost only nugget (54.9 against a small structure of 0.65). This actually means a good thing: the Random Forest already used the spatial information, so what is left has almost no spatial pattern. Because of this, it is better to look at the “opportunity” at the county level, not as a smooth surface.
Here I define gap = real number − expected number of breweries in a county.
Opportunity gap by county. Red = more breweries than expected (saturated). Blue = fewer than expected (under-served).
us <- pred %>% st_drop_geometry() %>% arrange(opportunity_gap) %>%
transmute(county, state = STUSPS, actual = n_brew,
expected = round(pred_n_brew, 1), gap = round(opportunity_gap, 1)) %>% head(6)
sat <- pred %>% st_drop_geometry() %>% arrange(desc(opportunity_gap)) %>%
transmute(county, state = STUSPS, actual = n_brew,
expected = round(pred_n_brew, 1), gap = round(opportunity_gap, 1)) %>% head(6)
kab(us, caption = "Most UNDER-served counties (much fewer breweries than expected)")| county | state | actual | expected | gap |
|---|---|---|---|---|
| Utah | UT | 1 | 10.5 | -9.5 |
| DeKalb | GA | 4 | 10.9 | -6.9 |
| New York | NY | 2 | 8.6 | -6.6 |
| Hudson | NJ | 1 | 7.4 | -6.4 |
| Denton | TX | 4 | 10.1 | -6.1 |
| Gwinnett | GA | 1 | 6.9 | -5.9 |
| county | state | actual | expected | gap |
|---|---|---|---|---|
| San Diego | CA | 148 | 29.0 | 119.0 |
| Los Angeles | CA | 118 | 37.2 | 80.8 |
| King | WA | 88 | 26.4 | 61.6 |
| Multnomah | OR | 81 | 22.4 | 58.6 |
| Denver | CO | 60 | 15.1 | 44.9 |
| Cook | IL | 64 | 20.5 | 43.5 |
I did not tell the model anything about laws or rent, but the results still make sense. The most under-served places are Utah County and Davis County in Utah, which have strict alcohol rules, and also Manhattan and Jersey City, where the rent and the licences are very expensive. The most saturated places are exactly the big beer cities (San Diego, Los Angeles, Seattle, Portland, Chicago, Denver). They have even more breweries than the model expects.
Finally I use association rules. I cut the county variables and also the neighbour variables into levels, then I look for rules. Adding the neighbour items is what makes the rules “spatial”.
kab(head(arules::DATAFRAME(rules$high), 6),
caption = "Top rules for HIGH number of breweries (sorted by lift)")| LHS | RHS | support | confidence | coverage | lift | count | |
|---|---|---|---|---|---|---|---|
| 1 | {young=young_high,urb=urb_urban,nbr_brew=nbr_brew_high} | {brew=brew_high} | 0.0292699 | 0.5582822 | 0.0524284 | 8.550243 | 91 |
| 5 | {income=inc_high,young=young_high,urb=urb_urban} | {brew=brew_high} | 0.0279833 | 0.5370370 | 0.0521068 | 8.224868 | 87 |
| 2 | {age=age_young,urb=urb_urban,nbr_brew=nbr_brew_high} | {brew=brew_high} | 0.0257317 | 0.5194805 | 0.0495336 | 7.955985 | 80 |
| 4 | {educ=edu_high,young=young_high,urb=urb_urban} | {brew=brew_high} | 0.0360244 | 0.5185185 | 0.0694757 | 7.941252 | 112 |
| 3 | {young=young_high,urb=urb_urban,nbr_inc=nbr_inc_high} | {brew=brew_high} | 0.0305564 | 0.5135135 | 0.0595047 | 7.864599 | 95 |
A county with many breweries is usually young, urban,
educated and rich, and also has neighbours with many breweries and high
income (nbr_brew_high, nbr_inc_high
come back many times, with lift around 8). The opposite case:
kab(head(arules::DATAFRAME(rules$none), 4),
caption = "Top rules for NO breweries (sorted by lift)")| LHS | RHS | support | confidence | coverage | lift | count | |
|---|---|---|---|---|---|---|---|
| 99 | {educ=edu_low,nbr_brew=nbr_brew_low} | {brew=brew_none} | 0.1630749 | 0.9458955 | 0.1724027 | 1.526890 | 507 |
| 96 | {educ=edu_low,nbr_inc=nbr_inc_low} | {brew=brew_none} | 0.1852686 | 0.9335494 | 0.1984561 | 1.506960 | 576 |
| 88 | {income=inc_low,educ=edu_low} | {brew=brew_none} | 0.1984561 | 0.9195231 | 0.2158250 | 1.484318 | 617 |
| 100 | {educ=edu_low,urb=urb_rural} | {brew=brew_none} | 0.2277260 | 0.9170984 | 0.2483114 | 1.480404 | 708 |
A county with no brewery is usually low education, rural, and surrounded by poor neighbours with few breweries (confidence about 0.92–0.95). So the neighbours matter in both directions.
Hops growing on an old factory wall in Warsaw, the home city of this course. (Photo: Panek, Wikimedia Commons, CC BY-SA 4.0)
Conclusion. The US craft beer industry is strongly clustered (Clark-Evans R = 0.41, positive Moran’s I). And the clustering itself is the best predictor of where breweries are, which fits the idea of agglomeration. Spatial features and spatial cross-validation are useful to get a fair, not too optimistic, error. The opportunity gap is the most practical part: it can find places that have fewer breweries than their conditions suggest, and it also catches real things like alcohol laws and high rent without being told.
Limitations. (1) Open Brewery DB is made by volunteers, and only 72% of US breweries have coordinates, so the coverage can be biased. (2) I work at county level, so there is the Modifiable Areal Unit Problem (MAUP); a grid version would be a good robustness check. (3) My DEGURBA is only an approximation at county level, not the real grid version. (4) The kriged surface should be read in a general way, because the country has big empty areas. (5) These are correlations, not causal effects.
Data and image sources. Breweries: Open Brewery DB. Socio-economic data: U.S. Census Bureau, ACS 5-year (2018–22) and TIGER/Line. Photos from Wikimedia Commons (CC BY-SA 4.0): brewery in Buffalo (Andre Carrotflower), beer flight (Gerry Dincher), hops in Warsaw (Panek).
Note on tools. I used an AI assistant (Claude) to help me write and debug the R code. I chose the topic, the data and the methods, I ran all the analysis myself, and I checked the results and the numbers. Any mistakes that are left are my own.
References.