1 What this project is about

A craft brewery in Buffalo, New York. (Photo: Andre Carrotflower, Wikimedia Commons, CC BY-SA 4.0)

Craft beer is very popular in the United States. But the breweries are not spread out evenly. Some cities (like San Diego or Portland) have a lot of them, and many rural areas have almost none. I wanted to study this pattern using the spatial machine learning tools from this course.

Here I use a big dataset of 5,394 craft breweries in the United States

The main question is simple:

Where do breweries cluster, why do they cluster there, and where is the map still “empty” even though the conditions look good?

The questions I try to answer:

How strongly do breweries cluster in space?
What factors explain where breweries are, and do spatial variables help the prediction?
Can I turn the model predictions from points into a smooth map surface?
Where are the markets that look “under-served”?
What kind of place usually has many breweries?

The last point, the “under-served” market, is the part I find most interesting.

2 Data

I use three data sources and join them together.

Layer	Source	What it gives
Breweries (points)	Open Brewery DB (API)	5,394 operating breweries with coordinates
County borders	TIGER / `tigris`	the polygons for the analysis, in Albers CRS (EPSG:5070)
Socio-economic data	ACS 5-year (2018–22), `tidycensus`	income, education, age, share of young adults, unemployment

Open Brewery DB has 8,207 breweries in the US. Only about 72% of them have real coordinates, which is one weak point of the data. I keep the contiguous US, remove the breweries that are “planning” or “closed”, and I get 5,394 breweries. I assign every brewery to a county, so in total I work with 3,109 counties. I also group the brewery types into three roles: consumption (brewpub, taproom), small production (micro, nano) and large production (regional, large).

knitr::include_graphics("figures/02_brewery_counts_by_county.png")

Number of breweries in each county. We can already see the pattern: California, the Pacific Northwest, Colorado, the Upper Midwest and the Northeast have the most.

3 Is there a spatial pattern?

First I check if location really matters. I build neighbour weights (queen contiguity and k-nearest neighbours), and I also make “neighbour” variables, for example the average income of the neighbour counties. Then I group counties into urban / intermediate / rural, like a simple version of DEGURBA.

deg <- modelsp %>% st_drop_geometry() %>%
  group_by(degurba) %>%
  summarise(counties = n(), breweries = sum(n_brew),
            `mean per county` = round(mean(n_brew), 2),
            `mean per 100k people` = round(mean(brew_per_100k, na.rm = TRUE), 2),
            .groups = "drop")
kab(deg, caption = "Breweries by urban / rural class")

Breweries by urban / rural class
degurba	counties	breweries	mean per county	mean per 100k people
rural	1715	688	0.40	2.25
intermediate	1062	1613	1.52	1.49
urban	332	3093	9.32	1.96

Urban counties have most of the breweries in total. But if we look per person, the rural counties are actually the highest (2.25 per 100k, vs 1.96 in urban and 1.49 in intermediate). So small towns also like their local brewery. This is a small surprise.

lw <- nb2listw(poly2nb(modelsp, queen = TRUE), style = "W", zero.policy = TRUE)
v  <- modelsp$brew_per_100k; v[is.na(v)] <- mean(v, na.rm = TRUE)
mi_n  <- moran.test(modelsp$n_brew, lw, zero.policy = TRUE)$estimate[1]
mi_pc <- moran.test(v, lw, zero.policy = TRUE)$estimate[1]
data.frame(variable = c("brewery count", "breweries per 100k"),
           `Moran's I` = round(c(mi_n, mi_pc), 3), check.names = FALSE) |> kab()

variable	Moran’s I
brewery count	0.304
breweries per 100k	0.189

Moran’s I is positive and significant (p is almost 0). This means a county with many breweries usually has neighbours with many breweries too. So the location is not random, and using spatial methods makes sense here.

4 How much do breweries cluster?

I measure clustering in three ways.

ce <- as.data.frame(aggl$ce_by) %>%
  transmute(role = type_group, n,
            `observed NN (km)` = round(obs_nn_km, 1),
            `expected NN (km)` = round(exp_nn_km, 1),
            `Clark-Evans R` = round(R, 3))
kab(ce, caption = "Clark-Evans index (R below 1 means clustering)")

Clark-Evans index (R below 1 means clustering)
role	n	observed NN (km)	expected NN (km)	Clark-Evans R
consumption	1897	16.1	32.1	0.501
large_prod	230	39.8	92.3	0.432
small_prod	3266	10.8	24.5	0.440

The Clark-Evans index for all breweries is R = 0.41. The real average distance to the nearest brewery is about 7.9 km, but under a random pattern it would be about 19 km. So breweries are around 2.4 times closer than random. The SPAG-style coverage index gives the same message: equal circles around the breweries cover only 25% of the country.

One detail is interesting: the production breweries cluster a bit more (R ≈ 0.44) than the consumption ones (R ≈ 0.50). Production follows the beer regions, and consumption follows the people.

knitr::include_graphics("figures/08_ripley_L.png")

Ripley’s L stays above zero at every distance, so breweries are clustered at all scales.

I also compare the two main roles with kernel density. The correlation between the consumption map and the production map is 0.93, so they are mostly in the same places, only production is a little more concentrated.

5 The beer hubs

Two of the biggest hubs in my data: Mission Brewery in San Diego, the number-one cluster with 380 breweries (left), and Deschutes Brewery in Portland, Oregon (right). (Photos: Wikimedia Commons, CC BY-SA 3.0)

To find the clusters of points I use DBSCAN (25 km, 10 breweries). It finds 88 hubs, and HDBSCAN gives almost the same (89). About 35% of breweries are “isolated”, which is again the rural part.

kab(head(as.data.frame(hubs), 10) %>%
      transmute(hub, n, state, `anchor city` = anchor_city,
                `% consumption` = pct_consumption, `% small prod` = pct_small_prod),
    caption = "The ten biggest beer hubs")

The ten biggest beer hubs
hub	n	state	anchor city	% consumption	% small prod
3	380	California	San Diego	26	68
19	199	California	San Francisco	39	54
16	194	Washington	Seattle	32	64
5	174	Colorado	Denver	30	63
10	148	Maryland	Baltimore	27	70
2	147	Oregon	Portland	39	57
14	119	Illinois	Chicago	39	57
40	108	Massachusetts	Boston	26	63
46	80	Pennsylvania	Philadelphia	48	48
44	79	New York	Brooklyn	23	71

knitr::include_graphics("figures/06_beer_hubs.png")

Beer hubs from DBSCAN. Grey points are the isolated breweries.

The hubs are the famous beer cities: San Diego (380 breweries), the Bay Area, Seattle, Denver, Portland, Chicago, Boston. They are also a bit different from each other. San Diego is more about production (68% micro), but Philadelphia and Detroit have more brewpubs, so they are more about drinking on the spot.

6 What explains the pattern?

Now I try to predict the number of breweries in a county. I use Random Forest (ranger) and a small neural network (nnet). I compare two feature sets: one with only normal variables (aspatial), and one that also adds spatial variables (coordinates, neighbour values, distance to the nearest brewery). I also use two kinds of cross-validation: normal random folds, and spatial folds (blocks in space).

cv <- readRDS(P("supervised_results.rds"))$cv
names(cv) <- c("model", "random CV", "spatial CV")
kab(cv, caption = "Cross-validation error (RMSE, brewery-count scale)")

Cross-validation error (RMSE, brewery-count scale)
model	random CV	spatial CV
RF aspatial	4.155	4.425
RF spatial	3.997	4.314
ANN spatial	5.138	5.796

I notice a few things here. The spatial features make the RF a little better (3.997 vs 4.155). The spatial cross-validation gives higher error than the random one, which is normal: random folds are too easy because the test points sit next to the training points. And the Random Forest is clearly better than the simple neural network here. The final RF has OOB R² = 0.84, which is quite good.

impdf <- data.frame(feature = names(imp), importance = as.numeric(imp)) |>
  head(10) |> arrange(importance)
ggplot(impdf, aes(reorder(feature, importance), importance)) +
  geom_col(fill = "#b2182b") + coord_flip() +
  labs(x = NULL, y = "permutation importance") + theme_minimal(base_size = 12)

Variable importance from the Random Forest. The distance to the nearest brewery is more important than population.

The most important variable is the distance to the nearest brewery, even more than population. I think this is the clearest result of the project: breweries appear where there are already other breweries. This is what economists call agglomeration.

7 Is the result robust? Gridded model and MAUP

When I use counties as the unit, the result can depend on the shape and size of the units. This is called the Modifiable Areal Unit Problem (MAUP). To check it, I run the same Random Forest again, but on regular square grids (50 km and 100 km) instead of counties. I move the breweries and the socio-economic data onto the grid cells by area weighting.

maup <- readRDS(P("grid_maup.rds"))
kab(maup, caption = "The same model on counties vs. two grid sizes")

The same model on counties vs. two grid sizes
resolution	cells	RMSE aspatial (spatial CV)	RMSE spatial (spatial CV)	OOB R2	top predictor
county	3109	4.425	4.314	0.838	dist_nearest_brew_km
50 km	3373	4.270	3.726	0.831	dist_nearest_brew_km
100 km	896	9.388	9.014	0.831	dist_nearest_brew_km

The story does not change. For every spatial unit the OOB R² stays around 0.83, the spatial features are always better than the aspatial ones, and the distance to the nearest brewery is always the most important variable. So my main results are robust to MAUP. The RMSE looks bigger for the 100 km grid, but this is only because a big cell contains more breweries, so the counts (and the errors) are larger.

knitr::include_graphics("figures/10_grid_actual.png")

Breweries per 50 km grid cell. The same pattern as the county map, but on a regular grid (class 7).

8 From points to a surface: kriging

In this part I make map surfaces. The first one is a kernel density of the brewery points. The second one uses kriging to take the RF prediction from the county points and make a smooth surface for the whole country.

knitr::include_graphics("figures/07A_kde_surface.png")

A. Kernel density of brewery locations.

knitr::include_graphics("figures/07B_kriged_predicted.png")

B. RF prediction, kriged from points to a smooth surface.

I also tried to krige the RF residuals (the part the model cannot explain). But the variogram is almost only nugget (54.9 against a small structure of 0.65). This actually means a good thing: the Random Forest already used the spatial information, so what is left has almost no spatial pattern. Because of this, it is better to look at the “opportunity” at the county level, not as a smooth surface.

9 The opportunity gap

Here I define gap = real number − expected number of breweries in a county.

knitr::include_graphics("figures/07C_opportunity_gap.png")

Opportunity gap by county. Red = more breweries than expected (saturated). Blue = fewer than expected (under-served).

us <- pred %>% st_drop_geometry() %>% arrange(opportunity_gap) %>%
  transmute(county, state = STUSPS, actual = n_brew,
            expected = round(pred_n_brew, 1), gap = round(opportunity_gap, 1)) %>% head(6)
sat <- pred %>% st_drop_geometry() %>% arrange(desc(opportunity_gap)) %>%
  transmute(county, state = STUSPS, actual = n_brew,
            expected = round(pred_n_brew, 1), gap = round(opportunity_gap, 1)) %>% head(6)
kab(us,  caption = "Most UNDER-served counties (much fewer breweries than expected)")

Most UNDER-served counties (much fewer breweries than expected)
county	state	actual	expected	gap
Utah	UT	1	10.5	-9.5
DeKalb	GA	4	10.9	-6.9
New York	NY	2	8.6	-6.6
Hudson	NJ	1	7.4	-6.4
Denton	TX	4	10.1	-6.1
Gwinnett	GA	1	6.9	-5.9

kab(sat, caption = "Most SATURATED counties (much more than expected)")

Most SATURATED counties (much more than expected)
county	state	actual	expected	gap
San Diego	CA	148	29.0	119.0
Los Angeles	CA	118	37.2	80.8
King	WA	88	26.4	61.6
Multnomah	OR	81	22.4	58.6
Denver	CO	60	15.1	44.9
Cook	IL	64	20.5	43.5

I did not tell the model anything about laws or rent, but the results still make sense. The most under-served places are Utah County and Davis County in Utah, which have strict alcohol rules, and also Manhattan and Jersey City, where the rent and the licences are very expensive. The most saturated places are exactly the big beer cities (San Diego, Los Angeles, Seattle, Portland, Chicago, Denver). They have even more breweries than the model expects.

10 What kind of place has many breweries?

Finally I use association rules. I cut the county variables and also the neighbour variables into levels, then I look for rules. Adding the neighbour items is what makes the rules “spatial”.

kab(head(arules::DATAFRAME(rules$high), 6),
    caption = "Top rules for HIGH number of breweries (sorted by lift)")

Top rules for HIGH number of breweries (sorted by lift)
	LHS	RHS	support	confidence	coverage	lift	count
1	{young=young_high,urb=urb_urban,nbr_brew=nbr_brew_high}	{brew=brew_high}	0.0292699	0.5582822	0.0524284	8.550243	91
5	{income=inc_high,young=young_high,urb=urb_urban}	{brew=brew_high}	0.0279833	0.5370370	0.0521068	8.224868	87
2	{age=age_young,urb=urb_urban,nbr_brew=nbr_brew_high}	{brew=brew_high}	0.0257317	0.5194805	0.0495336	7.955985	80
4	{educ=edu_high,young=young_high,urb=urb_urban}	{brew=brew_high}	0.0360244	0.5185185	0.0694757	7.941252	112
3	{young=young_high,urb=urb_urban,nbr_inc=nbr_inc_high}	{brew=brew_high}	0.0305564	0.5135135	0.0595047	7.864599	95

A county with many breweries is usually young, urban, educated and rich, and also has neighbours with many breweries and high income (nbr_brew_high, nbr_inc_high come back many times, with lift around 8). The opposite case:

kab(head(arules::DATAFRAME(rules$none), 4),
    caption = "Top rules for NO breweries (sorted by lift)")

Top rules for NO breweries (sorted by lift)
	LHS	RHS	support	confidence	coverage	lift	count
99	{educ=edu_low,nbr_brew=nbr_brew_low}	{brew=brew_none}	0.1630749	0.9458955	0.1724027	1.526890	507
96	{educ=edu_low,nbr_inc=nbr_inc_low}	{brew=brew_none}	0.1852686	0.9335494	0.1984561	1.506960	576
88	{income=inc_low,educ=edu_low}	{brew=brew_none}	0.1984561	0.9195231	0.2158250	1.484318	617
100	{educ=edu_low,urb=urb_rural}	{brew=brew_none}	0.2277260	0.9170984	0.2483114	1.480404	708

A county with no brewery is usually low education, rural, and surrounded by poor neighbours with few breweries (confidence about 0.92–0.95). So the neighbours matter in both directions.

11 Conclusion and limitations

Hops growing on an old factory wall in Warsaw, the home city of this course. (Photo: Panek, Wikimedia Commons, CC BY-SA 4.0)

Conclusion. The US craft beer industry is strongly clustered (Clark-Evans R = 0.41, positive Moran’s I). And the clustering itself is the best predictor of where breweries are, which fits the idea of agglomeration. Spatial features and spatial cross-validation are useful to get a fair, not too optimistic, error. The opportunity gap is the most practical part: it can find places that have fewer breweries than their conditions suggest, and it also catches real things like alcohol laws and high rent without being told.

Limitations. (1) Open Brewery DB is made by volunteers, and only 72% of US breweries have coordinates, so the coverage can be biased. (2) I work at county level, so there is the Modifiable Areal Unit Problem (MAUP); a grid version would be a good robustness check. (3) My DEGURBA is only an approximation at county level, not the real grid version. (4) The kriged surface should be read in a general way, because the country has big empty areas. (5) These are correlations, not causal effects.

Data and image sources. Breweries: Open Brewery DB. Socio-economic data: U.S. Census Bureau, ACS 5-year (2018–22) and TIGER/Line. Photos from Wikimedia Commons (CC BY-SA 4.0): brewery in Buffalo (Andre Carrotflower), beer flight (Gerry Dincher), hops in Warsaw (Panek).

Note on tools. I used an AI assistant (Claude) to help me write and debug the R code. I chose the topic, the data and the methods, I ran all the analysis myself, and I checked the results and the numbers. Any mistakes that are left are my own.

References.

Elzinga, K. G., Tremblay, C. H., & Tremblay, V. J. (2015). Craft beer in the United States: history, numbers, and geography. Journal of Wine Economics, 10(3), 242–274.
Nilsson, I., Reid, N., & Lehnert, M. (2018). Geographic patterns of craft breweries at the intraurban scale. The Professional Geographer, 70(1), 114–125.
Hoalst-Pullen, N., & Patterson, M. W. (Eds.) (2020). The Geography of Beer: Culture and Economics. Springer.
Kopczewska, K. (2025). Modelling Spatial Density: Data, Methods, and R Applications in Statistics, Econometrics, and Machine Learning. Oxford University Press.
Schnell, S. M., & Reese, J. F. (2003). Microbreweries as tools of local identity. Journal of Cultural Geography, 21(1), 45–69. ```

Hops & Hotspots

Where U.S. craft breweries are, and why: a spatial machine learning project

Xiao Li, Spatial Machine Learning in R

June 2026