Introduction and Project Overview

Food systems are incredibly complex. Comprised of a myriad stakeholders—some of whose interests align and some of whose do not—they take shape both organically and as a result of intentional development. The challenge of understanding any given food system is thus commensurately complex.

As communities begin to reflect upon their local food systems and understand their agency in shaping what those systems look like in the future, a growing number are forming what are being broadly categorized as “food policy councils.” While the exact structure and role of these boards, committees, and councils differ, they are generally tasked with recommending policies to the appropriate governing bodies in an effort to shape the development of the food system within which they are situated.

One strategy many of these food policy councils have taken is to encourage the development of local food retailers in certain areas to encourage people in those areas to buy more of their food from local farms.

Now in its second year, the Missoula City-County Food Policy Advisory Board (FPAB) is considering such a measure. This, of course, raises the question of efficacy.

Is the number of people who primarily source their food directly from local farmers (through farm stands/stores or farmers markets) positively correlated to the density of those types of local food retailers in their area? Put another way, is getting people to source more of their food directly from local farmers simply a matter of increasing the density of those opportunities in their area?

Our independent variable is the density (per 1,000 people) of local food retailers (farm stands/stores, farmers markets).

Our dependent variable is the percentage of people who primarily source their food from local food retailers.

The Data

To answer this question, I’ll be looking at several datasets.

Missoula Food Retailers Data

Missoula Food Retailers contains all food retailers in Missoula County that meet the following criteria:

They offer multiple items from at least one food group in their raw form.

The criteria were designed to capture only places at which one could reasonably shop for groceries. For example,

  • a farm stand with a selection of fresh produce would be included
  • a farm store with a selection of frozen meat would be included
  • a box store with an aisle of sodas/shelf stable juices, an aisle of candy, and an aisle of frozen entrees would not be included
  • a corner store that primarily offers beer and wine but also has chips, salsa, etc. would not be included
  • a coffee shop that has a stand of fresh bananas at the counter and cooler of drinks-to-go would not be included

Let’s take a look.

# set wd
setwd("~/Documents/R Projects/Data")

# read data
foodRetailers <- read_excel("Missoula Food System.xlsx")

# view
foodRetailers
## # A tibble: 26 × 6
##    Category         Name          Neighborhood    Ward Address          Duration
##    <chr>            <chr>         <chr>          <dbl> <chr>            <chr>   
##  1 farm stand/store Clark Fork O… <NA>              NA 542 Tower St, M… summer  
##  2 farm stand/store Frank's Litt… <NA>              NA 203 N Curtis St… summer  
##  3 farm stand/store Oxbow F2M     <NA>              NA 6215 Haugan Dr,… year-ro…
##  4 farm stand/store Turner Farms  <NA>              NA 3515 S 3rd St W… year-ro…
##  5 farmers market   Clark Fork M… <NA>              NA 101 Carousel Dr… summer  
##  6 farmers market   Missoula Far… Heart of Miss…    NA 534 N Higgins A… summer  
##  7 farmers market   Missoula Val… <NA>              NA 2901 Brooks St,… winter  
##  8 farmers market   Orchard Home… <NA>              NA 2537 S 3rd St W… summer  
##  9 farmers market   Target Range… <NA>              NA 4095 South Ave … summer  
## 10 grocery store    Albertsons    Heart of Miss…     1 1003 E Broadway… year-ro…
## # … with 16 more rows

Other than missing values in Neighborhood and Ward, this looks good. Because I won’t be looking at either of those variables in this analysis, I’ll create a subset of the dataset with only the variables I’m interested in before taking a closer look.

# create subset as datatable
foodRetailersDT <- foodRetailers %>%
  subset(select = c("Category", "Name", "Address", "Duration")) %>%
  datatable(rownames = FALSE,
            caption = 'Food Retailers in Missoula')

# view
foodRetailersDT

Let’s take a closer look. We now have four variables:

  1. Category refers to the type of food retailer (farm stand/store, farmers market, grocery store).
  2. Name refers to the name of the entity.
  3. Address is the street address.
  4. Duration refers to the season of operation (year-round, summer, winter).

There are twenty-six food retailers in Missoula County that meet the above criteria. Of those, only the farm stands/stores and farmers markets meet the criteria to be considered local food retailers, meaning that they are outlets for local farmers to sell their product directly to consumers.

We can see that each variable has its own column, each observation has its own row, and each value has its own cell. It thus meets the criteria for a tidy dataset.1 Furthermore, everything appears to be spelled correctly and consistently, as well as formatted consistently, which is important for accurate calculations.

To prepare for calculating the density of local-only outlets in each zip code, I’ll first calculate the number of those outlets in each zip code in each season.

# calculate number in summer
localSummer <- foodRetailers %>%
  separate(Address, c("Street Address", "Zip"), sep = -5) %>%
  filter(Category == "farm stand/store" | Category == "farmers market" & Duration == "summer") %>%
  count(Zip) %>%
  rename("Summer" = n)

# calculate number in winter
localWinter <- foodRetailers %>%
  separate(Address, c("Street Address", "Zip"), sep = -5) %>%
  filter(Category == "farm stand/store" | Category == "farmers market" & Duration == "winter") %>%
  count(Zip) %>%
  rename("Winter" = n)

# join
local <- full_join(localSummer, localWinter, by = "Zip")

# view
local
## # A tibble: 4 × 3
##   Zip   Summer Winter
##   <chr>  <int>  <int>
## 1 59801      1      2
## 2 59802      2     NA
## 3 59803      1      1
## 4 59804      4      2

Now that I’ve calculated that, I’m ready to add in the population data to calculate density.

Population Data

The second dataset I’ll be working with is the population of each zip code in Missoula. All population data were collected by the U.S. Census Bureau and retrieved from data.census.gov.

I’ll join it to the above table so that we have the population and the number of local food retailers in each zip code in one place.

# set wd
setwd("~/Documents/R Projects/Data")

# read data
zipPop <- read_excel("Decennial 2010 Pop by Zip.xlsx", sheet = "Data") %>%
  rename("59801" = "ZCTA5 59801",
         "59802" = "ZCTA5 59802",
         "59803" = "ZCTA5 59803",
         "59804" = "ZCTA5 59804",
         "59808" = "ZCTA5 59808") %>%
 pivot_longer(
    cols = c(`59801`, `59802`, `59803`, `59804`, `59808`),
    names_to = "Zip",
    values_to = "Total"
  ) %>%
  rename("Population" = "Total") %>%
  subset(select = c("Zip", "Population"))

# join local food retailer data with population data
zipDT <- full_join(local, zipPop, by = "Zip")

# view
zipDT
## # A tibble: 5 × 4
##   Zip   Summer Winter Population
##   <chr>  <int>  <int>      <dbl>
## 1 59801      1      2      30940
## 2 59802      2     NA      18407
## 3 59803      1      1      16203
## 4 59804      4      2       7533
## 5 59808     NA     NA      17904

Now let’s calculate the density of local food retailers in each zip code in each season.

# calculate density per 1000 people for each zip code
zipDensityDT <- zipDT %>%
  mutate("Summer Density" = round(Summer/(Population/1000), digits = 2),
         "Winter Density" = round(Winter/(Population/1000), digits = 2))

# view
zipDensityDT
## # A tibble: 5 × 6
##   Zip   Summer Winter Population `Summer Density` `Winter Density`
##   <chr>  <int>  <int>      <dbl>            <dbl>            <dbl>
## 1 59801      1      2      30940             0.03             0.06
## 2 59802      2     NA      18407             0.11            NA   
## 3 59803      1      1      16203             0.06             0.06
## 4 59804      4      2       7533             0.53             0.27
## 5 59808     NA     NA      17904            NA               NA

Now we can add in the data about purchasing behavior.

Community Food Assessment Data

The third dataset I’ll be looking at is the results from an online survey conducted by the FPAB as a part of its 2021 Rapid Community Food Assessment. The online survey was open for six weeks and was promoted through various City and County channels as well as through the networks of each board member.

While the survey asked about many things, the two questions I’m interested in are:

  1. In a typical week during the spring and summer seasons, where does your household primarily source its food? —> Primary Source in Summer
  2. In a typical week during the fall and winter seasons, where does your household primarily source its food? —> Primary Source in Winter

Respondents could answer one of seven ways:

  1. Local grocery store “Local” here means locally owned and does not indicate that the food is from local farms.
  2. Chain supermarket
  3. Food pantry/food bank
  4. Hunting, fishing
  5. Gathering, foraging
  6. Home or community garden
  7. Direct from local farmers Through farm stands/stores or farmers markets.

Both of these questions represent categorical variables, each with seven possible values. For my analysis, however, I am only interested in whether respondents source primarily direct from local farmers (option 7) or not (options 1-6). I will essentially be reducing the measurement level from categorical to binary.

Let’s take a look.

# set wd
setwd("~/Documents/R Projects/Data/Rapid CFA data")

# read data
CFA <- read_excel("FPAB Community Food Assessment Survey Data.xlsx", sheet = "Raw") %>%
  rename("Zip" = "Zip Code",
         "Identify As" = 14,
         "Primary Source in Summer" = 31,
         "Primary Source in Winter" = 39) %>%
  select(c("Zip",
           "Identify As",
           "Primary Source in Summer",
           "Primary Source in Winter")) %>%
  filter(`Identify As` == "Consumer", `Zip` == 59801 | `Zip` == 59802 | `Zip` == 59803 | `Zip` == 59804 | `Zip` == 59808) %>%
  replace_na(list("Primary Source in Summer" = "Other", "Primary Source in Winter" = "Other")) %>%
  select(c("Zip", "Primary Source in Summer", "Primary Source in Winter"))

# view
CFA
## # A tibble: 325 × 3
##      Zip `Primary Source in Summer` `Primary Source in Winter`
##    <dbl> <chr>                      <chr>                     
##  1 59802 Other                      Other                     
##  2 59801 Direct from local farmers  Other                     
##  3 59801 Direct from local farmers  Other                     
##  4 59803 Other                      Other                     
##  5 59801 Other                      Other                     
##  6 59802 Direct from local farmers  Direct from local farmers 
##  7 59803 Direct from local farmers  Other                     
##  8 59808 Direct from local farmers  Other                     
##  9 59801 Other                      Other                     
## 10 59802 Direct from local farmers  Other                     
## # … with 315 more rows

Having examined the raw data in Excel closely, I was able to do some initial cleaning and wrangling immediately upon reading in the data. I renamed a few columns to make them more intuitive and easier to work with and selected only the columns I’m interested in. I’ve also filtered to include only those respondents who reside in Missoula and consider themselves a “consumer” in the context of the food system.

I’m now going to use the table() function to get a count of how many respondents who answered each way and use that to then calculate the percentage of respondents in each zip code who primarily source their food directly from farmers.

# use table to get counts
summerSource <- CFA %>%
  group_by(`Zip`) %>% # group by zip code
  select("Primary Source in Summer") %>% # select specific column
  table() %>%
  as.data.frame.matrix(row.names = FALSE) # coerce back to dataframe

# create vector of zip codes
Zip <- c("59801", "59802", "59803", "59804", "59808")

# bind zip code vector to front of dataframe
summerSource <- cbind(Zip, summerSource)

# mutate to get percentage of people in each zip code who primarily buy from local farmers in the summer
summerSource %>%
  mutate("Total" = `Direct from local farmers` + `Other`) %>%
  mutate("Percent Local" = round(`Direct from local farmers`/`Total` * 100)) %>%
  select(c("Zip", "Percent Local")) # use select to effectively remove unnecessary columns
##     Zip Percent Local
## 1 59801            64
## 2 59802            69
## 3 59803            56
## 4 59804            83
## 5 59808            48

Now that I have that, I’m going to join it to our density data in order to visualize the relationship between the two variables.

# join density and percent data
seasonFull <- full_join(zipDensityDT, summerSource, by = "Zip")

# create subset
summerDT <- seasonFull %>%
  select(c("Zip", "Summer Density", "Direct from local farmers")) %>%
  rename("Density" = "Summer Density",
         "Percent" = "Direct from local farmers") %>%
  replace_na(list("Density" = 0))

# view
summerDT
## # A tibble: 5 × 3
##   Zip   Density Percent
##   <chr>   <dbl>   <int>
## 1 59801    0.03      84
## 2 59802    0.11      74
## 3 59803    0.06      19
## 4 59804    0.53      19
## 5 59808    0         14

For each of the five zip codes that comprise the City of Missoula we can see:

  1. The density of local food retailers — Our independent variable, and
  2. The percentage of people who source their food primarily from local food retailers — Our dependent variable

Analysis

Let’s first take a look at the distribution of each of the variables.

# create col plot
ggplot(data = summerDT) +
  geom_col(mapping = aes(y = Percent, x = Density, fill = Zip)) +
  labs(title = "Density of Local Food Retailers and Sourcing Behavior in Missoula, MT", y = "Percent of People Primarily Sourcing Directly from Farms", x = "Density of Local Food Retailers") +
  theme_minimal()

This graphic clearly shows the difference in the minimum and maximum values for both variables. We can also use the summarize() function to calculate the actual values.

The minimum density is 0 and the maximum is 0.53.

The minimum percentage is 14 and the maximum is 84.

The range of both variables is quite large.

Since we’re ultimately interested in the correlation between the two variables, let’s look at the data in a scatterplot next.

# create scatterplot
pScatter <- ggplot(summerDT, aes(x = Density, y = Percent, label = Zip)) +
  geom_point() +
  geom_text(size = 3,
            nudge_x = 0,
            nudge_y = 2.5) +
  geom_smooth(method = lm, se = FALSE) +
  labs(title = "Density of Local Food Retailers and Sourcing Behavior in Missoula, MT", y = "Percent of People Primarily Sourcing Directly from Farms", x = "Density of Local Food Retailers") +
  theme_minimal()

pScatter

Based on the above visualization, it appears that the variables are slightly inversely correlated. Using the cor() function to test for correlation gives us a correlation coefficient of -0.2956464, confirming the slight inverse correlation.

Discussion

It’s important to first note that the sample size in this analysis is extremely small. Too small, in fact, to reliably draw any conclusions about the relationship between the two variables. It is my intent that this analysis be used a framework for examining the relationship moving forward.

Where people source their food is likely informed by many variables not even touched upon here: age, gender, education, culture, income, hours worked per week, having a partner and/or dependents or not, etc. Any one of these, or any combination of these, could be a more significant predictor of whether someone will primarily source their food directly from local farmers.

Limitations and Challenges

  1. The sample represented here is likely not representative of Missoula as a whole. Outreach relied heavily upon mobilizing the networks of each individual board member and could have thus disproportionately reached people who are highly engaged with the local food system. If true, the real percentage of people who purchase directly from farmers could be much, much smaller. The number of respondents from each zip code were wildly different (and not proportional) as well.

  2. In the survey questions of interest, “primarily” was not defined so there is likely some variability in respondents’ threshold for answering those questions. E.g. If one respondent took “primarily” to mean > 95% and one took it to mean > 60% and both assessed their own percentage at 80%, they would have answered differently.

  3. Self-reporting is inherently problematic—people misremember and response bias is common. It’s possible that respondents assessed their own sourcing behavior a particular way because they assumed that was the desired answer.

Conclusion

The relationship between these two variables is worth analyzing in a more comprehensive way. As stated above, it’s my intent that this analysis serve as a starting point for doing just that. The FPAB plans to conduct a Rapid Community Food Assessment every year, which will allow us to gather more data from our own community. A comprehensive outreach plan will help us reach a more representative sample of our community.

I’d also like to see analyses of similar communities. The bulk of studies looking at retail food density come from urban centers, which differ from Missoula in some important ways. Nestled in a deeply rural state, nearly everyone in Missoula has a car (or access to one through a household member). Local food retailer density may simply not matter as much in a community in which most people are not constrained by distance or public transportation.

In conclusion, then, this area of study is ripe for further research and it is my hope that this analysis provides a framework for moving forward as we attempt to better understand the relationship between the density of local food retailers and sourcing behavior.


  1. Hadley Wickham and Garrett Grolemund, R for Data Science (Sebastopol: O’Reilly Media, Inc., 2017), 149.↩︎