This data comes from the Economic Research Service from the U.S. Department of Agriculture and contains parameters for citizens across the country that have limited food access. Low food access, according to the USDA, is determined by accessibility to sources of healthy food, individual factors that may affect accessibility, and neighborhood-level indicators. The variables include vehicle access, housing data, and the number of children, seniors, and low income individuals that are considered to have low food access. These groups are divided by their distance to a supermarket; beyond a half mile, one mile, 10 miles, and 20 miles.Using this dataset, I plan investigate the relationship between low food access and low income populations.
#uploading the data and related librarieslibrary(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.0.4
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(plotly)
Attaching package: 'plotly'
The following object is masked from 'package:ggplot2':
last_plot
The following object is masked from 'package:stats':
filter
The following object is masked from 'package:graphics':
layout
Rows: 3142 Columns: 25
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): County, State
dbl (23): Population, Housing Data.Residing in Group Quarters, Housing Data....
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(foodaccess) #brief glimpse of the dataset
# A tibble: 6 × 25
County Population State Housing Data.Residin…¹ Housing Data.Total H…²
<chr> <dbl> <chr> <dbl> <dbl>
1 Autauga County 54571 Alaba… 455 20221
2 Baldwin County 182265 Alaba… 2307 73180
3 Barbour County 27457 Alaba… 3193 9820
4 Bibb County 22915 Alaba… 2224 7953
5 Blount County 57322 Alaba… 489 21578
6 Bullock County 10914 Alaba… 1690 3745
# ℹ abbreviated names: ¹`Housing Data.Residing in Group Quarters`,
# ²`Housing Data.Total Housing Units`
# ℹ 20 more variables: `Vehicle Access.1 Mile` <dbl>,
# `Vehicle Access.1/2 Mile` <dbl>, `Vehicle Access.10 Miles` <dbl>,
# `Vehicle Access.20 Miles` <dbl>,
# `Low Access Numbers.Children.1 Mile` <dbl>,
# `Low Access Numbers.Children.1/2 Mile` <dbl>, …
names(foodaccess) #viewing the variables the dataset contains
#sortingtinyfoodaccess <- foodaccess |>arrange(desc("Low Access Numbers.People.10.Miles")) |>#eliminating unncessary columns and only keeping relevant onesselect(!("Housing Data.Residing in Group Quarters":"Low Access Numbers.Low Income People.1/2 Mile")) |>#keep low income (10) columnselect(!("Low Access Numbers.Low Income People.20 Miles":"Low Access Numbers.People.1/2 Mile")) |>select(!("Low Access Numbers.People.20 Miles":"Low Access Numbers.Seniors.20 Miles"))
Linear Regression Analysis
For my linear regression, I chose to compare the rate of low access individuals per population and rate of low income individuals per population, both on the 10 miles radius. Low food access is my dependent variable and low income is the independent variable. I assume that the two variables have a strong relationship as having a low income can make it much more difficult to access food. They’re less likely to be able to afford food and have means of transportation to their nearest supermarket.
names(tinyfoodaccess)[4] <-"LowAccessLowIncome10Miles"names(tinyfoodaccess)[5] <-"LowAccessPeople10Miles"tinyfoodaccess <- tinyfoodaccess |>#new column for % low access, 10 miles per county populationmutate(lowaccessrate10 = (LowAccessPeople10Miles / Population)*100) |>#new column for % low income, 10 milesmutate(lowincomerate10 = (LowAccessLowIncome10Miles / Population) *100)#getting rid of scientific notationtinyfoodaccess$lowaccessrate10 <-format(tinyfoodaccess$lowaccessrate10, scientific =FALSE)tinyfoodaccess$lowincomerate10 <-format(tinyfoodaccess$lowincomerate10, scientific =FALSE)#this turned the factors into characters so im switching them back to numerictinyfoodaccess$lowincomerate10 <-as.numeric(tinyfoodaccess$lowincomerate10)tinyfoodaccess$lowaccessrate10 <-as.numeric(tinyfoodaccess$lowaccessrate10)#round numbers to two decimal placestinyfoodaccess$lowaccessrate10 <-round(tinyfoodaccess$lowaccessrate10, digits =2)tinyfoodaccess$lowincomerate10 <-round(tinyfoodaccess$lowincomerate10, digits =2)
Linear Model
This is the resulting linear model for low food access (beyond 10 miles) as the dependent variable and low income (beyond 10 miles) as the independent variable. As expected, there is a strong correlation between the two. The given p-value is an incredibly small number, proving that this data is significant. The adjusted R-squared is about 88%, meaning the linear model greatly aligns with the original data.
#create linear model and assign it to variable "foodlm"foodlm <-lm(lowaccessrate10 ~ lowincomerate10, data = tinyfoodaccess)summary(foodlm)
Call:
lm(formula = lowaccessrate10 ~ lowincomerate10, data = tinyfoodaccess)
Residuals:
Min 1Q Median 3Q Max
-64.635 -1.206 -1.090 0.391 61.693
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.19682 0.13468 8.886 <2e-16 ***
lowincomerate10 2.45275 0.01603 153.045 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 6.534 on 3140 degrees of freedom
Multiple R-squared: 0.8818, Adjusted R-squared: 0.8818
F-statistic: 2.342e+04 on 1 and 3140 DF, p-value: < 2.2e-16
Data Visualization
#group states by region to add third variableNortheast <-c("Connecticut","Maine","Massachusetts","New Hampshire","Rhode Island","Vermont","New Jersey","New York","Pennsylvania")Midwest <-c("Indiana","Illinois","Michigan","Ohio","Wisconsin","Iowa","Kansas","Minnesota","Missouri","Nebraska","North Dakota","South Dakota")South <-c("Delaware","District of Columbia","Florida","Georgia","Maryland","North Carolina","South Carolina","Virginia","West Virginia","Alabama","Kentucky","Mississippi","Tennessee","Arkansas","Louisiana","Oklahoma","Texas")West <-c("Arizona","Colorado","Idaho","New Mexico","Montana","Utah","Nevada","Wyoming","Alaska","California","Hawaii","Oregon","Washington")regionlist <-list(Northeast = Northeast, Midwest = Midwest, South = South, West = West)tinyfoodaccess <- tinyfoodaccess |>mutate(Region ="x")tinyfoodaccess$Region <-sapply(tinyfoodaccess$State, function(x) names(regionlist)[grepl(x, regionlist)])tibble(tinyfoodaccess)
# A tibble: 3,142 × 8
County Population State LowAccessLowIncome10…¹ LowAccessPeople10Miles
<chr> <dbl> <chr> <dbl> <dbl>
1 Autauga County 54571 Alab… 2307 5119
2 Baldwin County 182265 Alab… 846 2308
3 Barbour County 27457 Alab… 2440 4643
4 Bibb County 22915 Alab… 102 365
5 Blount County 57322 Alab… 0 0
6 Bullock County 10914 Alab… 1267 2586
7 Butler County 20947 Alab… 556 1334
8 Calhoun County 118572 Alab… 0 0
9 Chambers Coun… 34215 Alab… 292 680
10 Cherokee Coun… 25989 Alab… 34 91
# ℹ 3,132 more rows
# ℹ abbreviated name: ¹LowAccessLowIncome10Miles
# ℹ 3 more variables: lowaccessrate10 <dbl>, lowincomerate10 <dbl>,
# Region <chr>
#scatterplot of low income vs. low accessfa <- tinyfoodaccess |>ggplot(aes(lowaccessrate10, lowincomerate10, text =paste("State:", State, "\nCounty:", County))) +geom_point(aes(color = Region)) +labs(x ="% of Population with Low Food Access",y ="% of Population with Low Income",title ="Low Food Access Data by County",caption ="Source: U.S. Department of Agriculture - Economic Research Service") +scale_color_brewer(palette ="PuRd") +theme_bw()ggplotly(fa)
This dataset contains information on populations that have low food access in every county in the U.S., organized by the resident’s proximity to a supermarket, grocery store, or other source of healthy food. They’re organized by their distance of living beyond a half mile, one mile, ten miles, and twenty miles from a healthy food source. For my project, I chose to focus on the ten-mile level. Since I was comparing that variable with the low income population count, I filtered out the columns to only include the total low access population count per county and low income pop. per county, both on the ten-mile level. Initially, I wanted to compare low access population with the group household data to find a connection, as those that live in group charters almost always have food provided to them. However, because the group housing data is calculated by households instead of individuals, my linear regression analysis became a bit inconsistent, so I switched my focus to low income.
My visualization is a scatterplot demonstrating the rates of low income + low access populations to the total low access population per county. To have the most data possible, I didn’t want to filter my data to a few states but I wanted to include region as a third variable, which I did by adding an additional column that assigns each county a region depending on which state it’s in. Through my graph, I noticed practically all the counties in the Northeast have very low rates of low food access, although they do have fewer counties in general. The Midwest and South were more spread out, but overall these variables have a positive linear relationship. In each county, the rates of low food access and low-income populations are very similar.