My data set contains information about US county’s ability to access supermarkets, supercenters, grocery stores, or other sources of healthy and affordable food. The name of the data set is food_acces.csv. The organization that collected the data is the United States Department of Agriculture’s Economic Research Service.
The data set additionally has information on measures of how individuals and neighborhoods are able to access food are based on the following indicators:
Accessibility to sources of healthy food, as measured by distance to a store or by the number of stores in an area. - Individual-level resources that may affect accessibility, such as family income or vehicle availability. - Neighborhood-level indicators of resources, such as the average income of the neighborhood and the availability of public transportation.
My data set has 3142 observations and 13 variables.
Background on data and why I choose it:
According to (World Bank Group, 2024), food security is defined when all people, at all times, have physical and economic access to sufficient safe and nutritious food that meets their dietary needs and food preferences for an active and healthy life. The reason why I choose this data is because food security is a real issue that affects millions people around the world and this data set observes, number of stores, family income, state, population vehicle transport all which affect food security which I found to be impactful.
World Bank Group. (2024). What is food security. In World Bank. https://www.worldbank.org/en/topic/agriculture/brief/food-security-update/what-is-food-security
Methodology:
The USDA’s (United States Department of Agriculture) methodology on their website is: They first collected geographic data from 2010 census and other sources not listed. Next, they used distance to stores, the number of stores and availability of transport to measure for accessibility. Futhermore, they aggregated the individual and community factors such as vehicle access and income. Lastly, they cleaned the data and standardized the values for accuracy.
(Food Access Research Atlas - About the Atlas | Economic Research Service, n.d.)
Variables used:
state county population low_access_numbers_children_1_mile low_access_numbers_people_1_mile low_access_numbers_seniors_1_mile low_access_numbers_low_income_people_1_mile vehicle_access_1_mile
Research question:
“How does the percentage of people with low access to supermarkets (within 1 mile) vary across U.S. Virginia, Maryland and Pensslyvania states?”
Loading library
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.1 ✔ stringr 1.5.1
✔ ggplot2 4.0.0 ✔ tibble 3.3.0
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.1.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggfortify)library(plotly)
Attaching package: 'plotly'
The following object is masked from 'package:ggplot2':
last_plot
The following object is masked from 'package:stats':
filter
The following object is masked from 'package:graphics':
layout
In order to explore my potential predictors, I used a correlation heatmap of all numeric variables. The heatmap displayed a very strong correlation between, low_access_numbers_children_1_mile, low_access_numbers_seniors_1_mile and low_access_numbers_people_1_2_mile and since they are the same variable calculated at different distances and age groups it would be redundant and show co linearity they were not good predictors for my final model. The predictors were selected because they had the highest correlation to low access to supermarket (1 mile) and were exploring different aspects such as population demographics, transport and economic disadvantages. Based on my exploration, population, vehicle_access_1_mile, low_access_numbers_low_income_people_1_mile were the best fitted predictors for low_access_numbers_people_1_mile.
Multiple Regression Model
model<-lm( low_access_numbers_people_1_mile ~ population + vehicle_access_1_mile+ low_access_numbers_low_income_people_1_mile , data = food_access_2)autoplot(model, 1:4,nrow=2,ncol=2) ##Got this from correlation scatter plots and regressions tutorial, to see the diagnostic plots
Warning: `fortify(<lm>)` was deprecated in ggplot2 3.6.0.
ℹ Please use `broom::augment(<lm>)` instead.
ℹ The deprecated feature was likely used in the ggfortify package.
Please report the issue at <https://github.com/sinhrks/ggfortify/issues>.
Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
ℹ Please use tidy evaluation idioms with `aes()`.
ℹ See also `vignette("ggplot2-in-packages")` for more information.
ℹ The deprecated feature was likely used in the ggfortify package.
Please report the issue at <https://github.com/sinhrks/ggfortify/issues>.
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
ℹ The deprecated feature was likely used in the ggfortify package.
Please report the issue at <https://github.com/sinhrks/ggfortify/issues>.
Summary of model
summary(model)
Call:
lm(formula = low_access_numbers_people_1_mile ~ population +
vehicle_access_1_mile + low_access_numbers_low_income_people_1_mile,
data = food_access_2)
Residuals:
Min 1Q Median 3Q Max
-198059 -5256 -1704 849 174853
Coefficients:
Estimate Std. Error t value
(Intercept) 1.680e+03 4.440e+02 3.784
population 4.624e-02 1.630e-03 28.373
vehicle_access_1_mile 1.687e+01 7.153e-01 23.592
low_access_numbers_low_income_people_1_mile 1.937e+00 4.618e-02 41.953
Pr(>|t|)
(Intercept) 0.000157 ***
population < 2e-16 ***
vehicle_access_1_mile < 2e-16 ***
low_access_numbers_low_income_people_1_mile < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 20590 on 3138 degrees of freedom
Multiple R-squared: 0.899, Adjusted R-squared: 0.8989
F-statistic: 9314 on 3 and 3138 DF, p-value: < 2.2e-16
Linear regression analysis
This regression model predicts for the number of people in a county that are living 1 mile away from a supermarket based on the predictors population size,vehicle availability, and the number of low-income residents that have limited food access.
Model equation
The model equation for this multiple linear regression is as follows: y= a+b1x1+b2x2+ b3x3
where: a= intercept y= low_access_numbers_people_1_mile b1= effect of population b2= effect of vehicle access(1 mile) b3= effect of low-income population with low food access (1 mile)
Fitted equation low_access_numbers_people_1_mile= 1680+ 0.046(population)+16.87(vehichle acces (1 mile))+ 1.94 (low-income population with low food access)
Interpretation This means that: The number of people with low food access increase by about 0.046 for each additionally person in a county this a small effect per person but meaningful overall. The number of people with low food access increases as people without vehicle access increases. Low income population has a moderate effect on people with low food access, number of people with low food access increases as the low income population increases.
P-value and adjusted R^2 value
The three predictors were all very statistically significant with p value= < 2e-16. Thus, showing strong evidence that vehicle access, population and income are strongly associated with number of people with low food access.
The Adjusted R-squared is 0.8989 for this linear regression model. This means that the model explains 89.89% of the variation in food access across the country. The model overall is statistically significant at p value =< 2.2e-16.
Diagnostic plots
Residuals vs fitted plot : the blue line is slightly bent downwards, the residuals are centered around 0 however the wider spread of residuals are around the counties with large fitted values suggesting Heteroscedasticity in extreme high fitted residual counties.
Normal Q-Q plot : Majority of the points on the graph follow the diagonal line, meaning the residuals are approximately normal, but there are a few outliers on both tails (such as observations 205, 611,and 2681) that deviate from normality.
Scale-Location plot : The blue line upwards shows an increase in variance and the spread of residuals increases with fitted values, indicating that there is mild heteroscedasticity.
Cook’s Distance : some of the few points that are influential and could affect the model results are (205,611, and 2631) indicating counties that strongly affect the model but dont invalidate the entire model.
Source of information:
https://rpubs.com/rsaidi/950425
Final visualization 1:
food_access_filtered <-food_access_2 |>filter(population>=10000) |>filter(state %in%c("Virginia", "Maryland", "Pennsylvania")) |>select(county, state, population, vehicle_access_1_mile,low_access_numbers_people_1_mile, low_access_numbers_low_income_people_1_mile) # I choose 10,000 since I didnt want small counties skewing resultsfood_access_filtered
# A tibble: 204 × 6
county state population vehicle_access_1_mile low_access_numbers_p…¹
<chr> <chr> <dbl> <dbl> <dbl>
1 Allegany County Mary… 75087 1051 46218
2 Anne Arundel C… Mary… 537656 2276 209639
3 Baltimore Coun… Mary… 805029 4094 228432
4 Baltimore city Mary… 620961 2867 28080
5 Calvert County Mary… 88737 583 71158
6 Caroline County Mary… 33066 456 25269
7 Carroll County Mary… 167134 1320 108830
8 Cecil County Mary… 101108 1033 73410
9 Charles County Mary… 146551 850 94730
10 Dorchester Cou… Mary… 32618 543 22745
# ℹ 194 more rows
# ℹ abbreviated name: ¹low_access_numbers_people_1_mile
# ℹ 1 more variable: low_access_numbers_low_income_people_1_mile <dbl>
Making percent of low access
food_access_filtered<-food_access_filtered |>mutate(percent_low_access= (low_access_numbers_low_income_people_1_mile/population)*100) # I choose percents since its gives a fairer comparisonfood_access_filtered
# A tibble: 204 × 7
county state population vehicle_access_1_mile low_access_numbers_p…¹
<chr> <chr> <dbl> <dbl> <dbl>
1 Allegany County Mary… 75087 1051 46218
2 Anne Arundel C… Mary… 537656 2276 209639
3 Baltimore Coun… Mary… 805029 4094 228432
4 Baltimore city Mary… 620961 2867 28080
5 Calvert County Mary… 88737 583 71158
6 Caroline County Mary… 33066 456 25269
7 Carroll County Mary… 167134 1320 108830
8 Cecil County Mary… 101108 1033 73410
9 Charles County Mary… 146551 850 94730
10 Dorchester Cou… Mary… 32618 543 22745
# ℹ 194 more rows
# ℹ abbreviated name: ¹low_access_numbers_people_1_mile
# ℹ 2 more variables: low_access_numbers_low_income_people_1_mile <dbl>,
# percent_low_access <dbl>