1 What this project is about

A craft brewery in Buffalo, New York. (Photo: Andre Carrotflower, Wikimedia Commons, CC BY-SA 4.0)

A craft brewery in Buffalo, New York. (Photo: Andre Carrotflower, Wikimedia Commons, CC BY-SA 4.0)

Craft beer is very popular in the United States. But the breweries are not spread out evenly. Some cities (like San Diego or Portland) have a lot of them, and many rural areas have almost none. I wanted to study this pattern using the spatial machine learning tools from this course.

Here I use a big dataset of 5,394 craft breweries in the United States

The main question is simple:

Where do breweries cluster, why do they cluster there, and where is the map still “empty” even though the conditions look good?

The questions I try to answer:

  1. How strongly do breweries cluster in space?
  2. What factors explain where breweries are, and do spatial variables help the prediction?
  3. Can I turn the model predictions from points into a smooth map surface?
  4. Where are the markets that look “under-served”?
  5. What kind of place usually has many breweries?

The last point, the “under-served” market, is the part I find most interesting.

2 Data

I use three data sources and join them together.

Layer Source What it gives
Breweries (points) Open Brewery DB (API) 5,394 operating breweries with coordinates
County borders TIGER / tigris the polygons for the analysis, in Albers CRS (EPSG:5070)
Socio-economic data ACS 5-year (2018–22), tidycensus income, education, age, share of young adults, unemployment

Open Brewery DB has 8,207 breweries in the US. Only about 72% of them have real coordinates, which is one weak point of the data. I keep the contiguous US, remove the breweries that are “planning” or “closed”, and I get 5,394 breweries. I assign every brewery to a county, so in total I work with 3,109 counties. I also group the brewery types into three roles: consumption (brewpub, taproom), small production (micro, nano) and large production (regional, large).

knitr::include_graphics("figures/02_brewery_counts_by_county.png")
Number of breweries in each county. We can already see the pattern: California, the Pacific Northwest, Colorado, the Upper Midwest and the Northeast have the most.

Number of breweries in each county. We can already see the pattern: California, the Pacific Northwest, Colorado, the Upper Midwest and the Northeast have the most.

3 Is there a spatial pattern?

First I check if location really matters. I build neighbour weights (queen contiguity and k-nearest neighbours), and I also make “neighbour” variables, for example the average income of the neighbour counties. Then I group counties into urban / intermediate / rural, like a simple version of DEGURBA.

deg <- modelsp %>% st_drop_geometry() %>%
  group_by(degurba) %>%
  summarise(counties = n(), breweries = sum(n_brew),
            `mean per county` = round(mean(n_brew), 2),
            `mean per 100k people` = round(mean(brew_per_100k, na.rm = TRUE), 2),
            .groups = "drop")
kab(deg, caption = "Breweries by urban / rural class")
Breweries by urban / rural class
degurba counties breweries mean per county mean per 100k people
rural 1715 688 0.40 2.25
intermediate 1062 1613 1.52 1.49
urban 332 3093 9.32 1.96

Urban counties have most of the breweries in total. But if we look per person, the rural counties are actually the highest (2.25 per 100k, vs 1.96 in urban and 1.49 in intermediate). So small towns also like their local brewery. This is a small surprise.

lw <- nb2listw(poly2nb(modelsp, queen = TRUE), style = "W", zero.policy = TRUE)
v  <- modelsp$brew_per_100k; v[is.na(v)] <- mean(v, na.rm = TRUE)
mi_n  <- moran.test(modelsp$n_brew, lw, zero.policy = TRUE)$estimate[1]
mi_pc <- moran.test(v, lw, zero.policy = TRUE)$estimate[1]
data.frame(variable = c("brewery count", "breweries per 100k"),
           `Moran's I` = round(c(mi_n, mi_pc), 3), check.names = FALSE) |> kab()
variable Moran’s I
brewery count 0.304
breweries per 100k 0.189

Moran’s I is positive and significant (p is almost 0). This means a county with many breweries usually has neighbours with many breweries too. So the location is not random, and using spatial methods makes sense here.

4 How much do breweries cluster?

I measure clustering in three ways.

ce <- as.data.frame(aggl$ce_by) %>%
  transmute(role = type_group, n,
            `observed NN (km)` = round(obs_nn_km, 1),
            `expected NN (km)` = round(exp_nn_km, 1),
            `Clark-Evans R` = round(R, 3))
kab(ce, caption = "Clark-Evans index (R below 1 means clustering)")
Clark-Evans index (R below 1 means clustering)
role n observed NN (km) expected NN (km) Clark-Evans R
consumption 1897 16.1 32.1 0.501
large_prod 230 39.8 92.3 0.432
small_prod 3266 10.8 24.5 0.440

The Clark-Evans index for all breweries is R = 0.41. The real average distance to the nearest brewery is about 7.9 km, but under a random pattern it would be about 19 km. So breweries are around 2.4 times closer than random. The SPAG-style coverage index gives the same message: equal circles around the breweries cover only 25% of the country.

One detail is interesting: the production breweries cluster a bit more (R ≈ 0.44) than the consumption ones (R ≈ 0.50). Production follows the beer regions, and consumption follows the people.

knitr::include_graphics("figures/08_ripley_L.png")
Ripley's L stays above zero at every distance, so breweries are clustered at all scales.

Ripley’s L stays above zero at every distance, so breweries are clustered at all scales.

I also compare the two main roles with kernel density. The correlation between the consumption map and the production map is 0.93, so they are mostly in the same places, only production is a little more concentrated.

5 The beer hubs

Two of the biggest hubs in my data: Mission Brewery in San Diego, the number-one cluster with 380 breweries (left), and Deschutes Brewery in Portland, Oregon (right). (Photos: Wikimedia Commons, CC BY-SA 3.0)Two of the biggest hubs in my data: Mission Brewery in San Diego, the number-one cluster with 380 breweries (left), and Deschutes Brewery in Portland, Oregon (right). (Photos: Wikimedia Commons, CC BY-SA 3.0)

Two of the biggest hubs in my data: Mission Brewery in San Diego, the number-one cluster with 380 breweries (left), and Deschutes Brewery in Portland, Oregon (right). (Photos: Wikimedia Commons, CC BY-SA 3.0)

To find the clusters of points I use DBSCAN (25 km, 10 breweries). It finds 88 hubs, and HDBSCAN gives almost the same (89). About 35% of breweries are “isolated”, which is again the rural part.

kab(head(as.data.frame(hubs), 10) %>%
      transmute(hub, n, state, `anchor city` = anchor_city,
                `% consumption` = pct_consumption, `% small prod` = pct_small_prod),
    caption = "The ten biggest beer hubs")
The ten biggest beer hubs
hub n state anchor city % consumption % small prod
3 380 California San Diego 26 68
19 199 California San Francisco 39 54
16 194 Washington Seattle 32 64
5 174 Colorado Denver 30 63
10 148 Maryland Baltimore 27 70
2 147 Oregon Portland 39 57
14 119 Illinois Chicago 39 57
40 108 Massachusetts Boston 26 63
46 80 Pennsylvania Philadelphia 48 48
44 79 New York Brooklyn 23 71
knitr::include_graphics("figures/06_beer_hubs.png")
Beer hubs from DBSCAN. Grey points are the isolated breweries.

Beer hubs from DBSCAN. Grey points are the isolated breweries.

The hubs are the famous beer cities: San Diego (380 breweries), the Bay Area, Seattle, Denver, Portland, Chicago, Boston. They are also a bit different from each other. San Diego is more about production (68% micro), but Philadelphia and Detroit have more brewpubs, so they are more about drinking on the spot.

6 What explains the pattern?

Now I try to predict the number of breweries in a county. I use Random Forest (ranger) and a small neural network (nnet). I compare two feature sets: one with only normal variables (aspatial), and one that also adds spatial variables (coordinates, neighbour values, distance to the nearest brewery). I also use two kinds of cross-validation: normal random folds, and spatial folds (blocks in space).

cv <- readRDS(P("supervised_results.rds"))$cv
names(cv) <- c("model", "random CV", "spatial CV")
kab(cv, caption = "Cross-validation error (RMSE, brewery-count scale)")
Cross-validation error (RMSE, brewery-count scale)
model random CV spatial CV
RF aspatial 4.155 4.425
RF spatial 3.997 4.314
ANN spatial 5.138 5.796

I notice a few things here. The spatial features make the RF a little better (3.997 vs 4.155). The spatial cross-validation gives higher error than the random one, which is normal: random folds are too easy because the test points sit next to the training points. And the Random Forest is clearly better than the simple neural network here. The final RF has OOB R² = 0.84, which is quite good.

impdf <- data.frame(feature = names(imp), importance = as.numeric(imp)) |>
  head(10) |> arrange(importance)
ggplot(impdf, aes(reorder(feature, importance), importance)) +
  geom_col(fill = "#b2182b") + coord_flip() +
  labs(x = NULL, y = "permutation importance") + theme_minimal(base_size = 12)
Variable importance from the Random Forest. The distance to the nearest brewery is more important than population.

Variable importance from the Random Forest. The distance to the nearest brewery is more important than population.

The most important variable is the distance to the nearest brewery, even more than population. I think this is the clearest result of the project: breweries appear where there are already other breweries. This is what economists call agglomeration.

7 Is the result robust? Gridded model and MAUP

When I use counties as the unit, the result can depend on the shape and size of the units. This is called the Modifiable Areal Unit Problem (MAUP). To check it, I run the same Random Forest again, but on regular square grids (50 km and 100 km) instead of counties. I move the breweries and the socio-economic data onto the grid cells by area weighting.

maup <- readRDS(P("grid_maup.rds"))
kab(maup, caption = "The same model on counties vs. two grid sizes")
The same model on counties vs. two grid sizes
resolution cells RMSE aspatial (spatial CV) RMSE spatial (spatial CV) OOB R2 top predictor
county 3109 4.425 4.314 0.838 dist_nearest_brew_km
50 km 3373 4.270 3.726 0.831 dist_nearest_brew_km
100 km 896 9.388 9.014 0.831 dist_nearest_brew_km

The story does not change. For every spatial unit the OOB R² stays around 0.83, the spatial features are always better than the aspatial ones, and the distance to the nearest brewery is always the most important variable. So my main results are robust to MAUP. The RMSE looks bigger for the 100 km grid, but this is only because a big cell contains more breweries, so the counts (and the errors) are larger.

knitr::include_graphics("figures/10_grid_actual.png")
Breweries per 50 km grid cell. The same pattern as the county map, but on a regular grid (class 7).

Breweries per 50 km grid cell. The same pattern as the county map, but on a regular grid (class 7).

8 From points to a surface: kriging

In this part I make map surfaces. The first one is a kernel density of the brewery points. The second one uses kriging to take the RF prediction from the county points and make a smooth surface for the whole country.

knitr::include_graphics("figures/07A_kde_surface.png")
A. Kernel density of brewery locations.

A. Kernel density of brewery locations.

knitr::include_graphics("figures/07B_kriged_predicted.png")
B. RF prediction, kriged from points to a smooth surface.

B. RF prediction, kriged from points to a smooth surface.

I also tried to krige the RF residuals (the part the model cannot explain). But the variogram is almost only nugget (54.9 against a small structure of 0.65). This actually means a good thing: the Random Forest already used the spatial information, so what is left has almost no spatial pattern. Because of this, it is better to look at the “opportunity” at the county level, not as a smooth surface.

9 The opportunity gap

Here I define gap = real number − expected number of breweries in a county.

knitr::include_graphics("figures/07C_opportunity_gap.png")
Opportunity gap by county. Red = more breweries than expected (saturated). Blue = fewer than expected (under-served).

Opportunity gap by county. Red = more breweries than expected (saturated). Blue = fewer than expected (under-served).

us <- pred %>% st_drop_geometry() %>% arrange(opportunity_gap) %>%
  transmute(county, state = STUSPS, actual = n_brew,
            expected = round(pred_n_brew, 1), gap = round(opportunity_gap, 1)) %>% head(6)
sat <- pred %>% st_drop_geometry() %>% arrange(desc(opportunity_gap)) %>%
  transmute(county, state = STUSPS, actual = n_brew,
            expected = round(pred_n_brew, 1), gap = round(opportunity_gap, 1)) %>% head(6)
kab(us,  caption = "Most UNDER-served counties (much fewer breweries than expected)")
Most UNDER-served counties (much fewer breweries than expected)
county state actual expected gap
Utah UT 1 10.5 -9.5
DeKalb GA 4 10.9 -6.9
New York NY 2 8.6 -6.6
Hudson NJ 1 7.4 -6.4
Denton TX 4 10.1 -6.1
Gwinnett GA 1 6.9 -5.9
kab(sat, caption = "Most SATURATED counties (much more than expected)")
Most SATURATED counties (much more than expected)
county state actual expected gap
San Diego CA 148 29.0 119.0
Los Angeles CA 118 37.2 80.8
King WA 88 26.4 61.6
Multnomah OR 81 22.4 58.6
Denver CO 60 15.1 44.9
Cook IL 64 20.5 43.5

I did not tell the model anything about laws or rent, but the results still make sense. The most under-served places are Utah County and Davis County in Utah, which have strict alcohol rules, and also Manhattan and Jersey City, where the rent and the licences are very expensive. The most saturated places are exactly the big beer cities (San Diego, Los Angeles, Seattle, Portland, Chicago, Denver). They have even more breweries than the model expects.

10 What kind of place has many breweries?

Finally I use association rules. I cut the county variables and also the neighbour variables into levels, then I look for rules. Adding the neighbour items is what makes the rules “spatial”.

kab(head(arules::DATAFRAME(rules$high), 6),
    caption = "Top rules for HIGH number of breweries (sorted by lift)")
Top rules for HIGH number of breweries (sorted by lift)
LHS RHS support confidence coverage lift count
1 {young=young_high,urb=urb_urban,nbr_brew=nbr_brew_high} {brew=brew_high} 0.0292699 0.5582822 0.0524284 8.550243 91
5 {income=inc_high,young=young_high,urb=urb_urban} {brew=brew_high} 0.0279833 0.5370370 0.0521068 8.224868 87
2 {age=age_young,urb=urb_urban,nbr_brew=nbr_brew_high} {brew=brew_high} 0.0257317 0.5194805 0.0495336 7.955985 80
4 {educ=edu_high,young=young_high,urb=urb_urban} {brew=brew_high} 0.0360244 0.5185185 0.0694757 7.941252 112
3 {young=young_high,urb=urb_urban,nbr_inc=nbr_inc_high} {brew=brew_high} 0.0305564 0.5135135 0.0595047 7.864599 95

A county with many breweries is usually young, urban, educated and rich, and also has neighbours with many breweries and high income (nbr_brew_high, nbr_inc_high come back many times, with lift around 8). The opposite case:

kab(head(arules::DATAFRAME(rules$none), 4),
    caption = "Top rules for NO breweries (sorted by lift)")
Top rules for NO breweries (sorted by lift)
LHS RHS support confidence coverage lift count
99 {educ=edu_low,nbr_brew=nbr_brew_low} {brew=brew_none} 0.1630749 0.9458955 0.1724027 1.526890 507
96 {educ=edu_low,nbr_inc=nbr_inc_low} {brew=brew_none} 0.1852686 0.9335494 0.1984561 1.506960 576
88 {income=inc_low,educ=edu_low} {brew=brew_none} 0.1984561 0.9195231 0.2158250 1.484318 617
100 {educ=edu_low,urb=urb_rural} {brew=brew_none} 0.2277260 0.9170984 0.2483114 1.480404 708

A county with no brewery is usually low education, rural, and surrounded by poor neighbours with few breweries (confidence about 0.92–0.95). So the neighbours matter in both directions.

11 Conclusion and limitations

Hops growing on an old factory wall in Warsaw, the home city of this course. (Photo: Panek, Wikimedia Commons, CC BY-SA 4.0)

Hops growing on an old factory wall in Warsaw, the home city of this course. (Photo: Panek, Wikimedia Commons, CC BY-SA 4.0)

Conclusion. The US craft beer industry is strongly clustered (Clark-Evans R = 0.41, positive Moran’s I). And the clustering itself is the best predictor of where breweries are, which fits the idea of agglomeration. Spatial features and spatial cross-validation are useful to get a fair, not too optimistic, error. The opportunity gap is the most practical part: it can find places that have fewer breweries than their conditions suggest, and it also catches real things like alcohol laws and high rent without being told.

Limitations. (1) Open Brewery DB is made by volunteers, and only 72% of US breweries have coordinates, so the coverage can be biased. (2) I work at county level, so there is the Modifiable Areal Unit Problem (MAUP); a grid version would be a good robustness check. (3) My DEGURBA is only an approximation at county level, not the real grid version. (4) The kriged surface should be read in a general way, because the country has big empty areas. (5) These are correlations, not causal effects.

Data and image sources. Breweries: Open Brewery DB. Socio-economic data: U.S. Census Bureau, ACS 5-year (2018–22) and TIGER/Line. Photos from Wikimedia Commons (CC BY-SA 4.0): brewery in Buffalo (Andre Carrotflower), beer flight (Gerry Dincher), hops in Warsaw (Panek).

Note on tools. I used an AI assistant (Claude) to help me write and debug the R code. I chose the topic, the data and the methods, I ran all the analysis myself, and I checked the results and the numbers. Any mistakes that are left are my own.

References.

  • Elzinga, K. G., Tremblay, C. H., & Tremblay, V. J. (2015). Craft beer in the United States: history, numbers, and geography. Journal of Wine Economics, 10(3), 242–274.
  • Nilsson, I., Reid, N., & Lehnert, M. (2018). Geographic patterns of craft breweries at the intraurban scale. The Professional Geographer, 70(1), 114–125.
  • Hoalst-Pullen, N., & Patterson, M. W. (Eds.) (2020). The Geography of Beer: Culture and Economics. Springer.
  • Kopczewska, K. (2025). Modelling Spatial Density: Data, Methods, and R Applications in Statistics, Econometrics, and Machine Learning. Oxford University Press.
  • Schnell, S. M., & Reese, J. F. (2003). Microbreweries as tools of local identity. Journal of Cultural Geography, 21(1), 45–69. ```