My data set contains information about US county’s ability to access supermarkets, supercenters, grocery stores, or other sources of healthy and affordable food. The name of the data set is food_acces.csv. The organization that collected the data is the United States Department of Agriculture’s Economic Research Service.
The data set additionally has information on measures of how individuals and neighborhoods are able to access food are based on the following indicators:
Accessibility to sources of healthy food, as measured by distance to a store or by the number of stores in an area.
Individual-level resources that may affect accessibility, such as family income or vehicle availability.
Neighborhood-level indicators of resources, such as the average income of the neighborhood and the availability of public transportation.
(Food Access Research Atlas - Download the Data | Economic Research Service, 2021)
My data set has 3142 observations and 13 variables.
Background on data and why I choose it:
According to (World Bank Group, 2024), food security is defined when all people, at all times, have physical and economic access to sufficient safe and nutritious food that meets their dietary needs and food preferences for an active and healthy life. The reason why I choose this data is because food security is a real issue that affects millions people around the world and this data set observes, number of stores, family income, state, population vehicle transport all which affect food security which I found to be impactful.
World Bank Group. (2024). What is food security. In World Bank. https://www.worldbank.org/en/topic/agriculture/brief/food-security-update/what-is-food-security
Methodology:
The USDA’s (United States Department of Agriculture) methodology on their website is: They first collected geographic data from 2010 census and other sources not listed. Next, they used distance to stores, the number of stores and availability of transport to measure for accessibility. Futhermore, they aggregated the individual and community factors such as vehicle access and income. Lastly, they cleaned the data and standardized the values for accuracy.
(Food Access Research Atlas - About the Atlas | Economic Research Service, n.d.)
Variables used:
Variables
Definition
State (categorical)
The name of the U.S state where the county is located.
County (categorical)
The county that is named in the data set.
Population (quantitative)
The total accumulation of number of residents living the county.
low_access_numbers_people_1_mile (quantitative)
The number of people who live in a county and live more that 1 mile away from a food store.
The number of people who live in a county and live more that 1 mile away from a food store and are low income.
vehicle_access_1_mile (quantitative)
Housing units without vehicle count beyond 1 mile from supermarket
Research question:
How does the percentage of people with low access to supermarkets (beyond a mile) vary across the states Virginia, Maryland and Pensslyvania?
Loading library
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.1 ✔ stringr 1.5.1
✔ ggplot2 4.0.0 ✔ tibble 3.3.0
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.1.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggfortify)library(plotly)
Attaching package: 'plotly'
The following object is masked from 'package:ggplot2':
last_plot
The following object is masked from 'package:stats':
filter
The following object is masked from 'package:graphics':
layout
In order to explore my potential predictors, I used a correlation heatmap of all numeric variables. The heatmap displayed a very strong correlation between, low_access_numbers_children_1_mile, low_access_numbers_seniors_1_mile and low_access_numbers_people_1_2_mile and since they are the same variable calculated at different distances and age groups it would be redundant and show co linearity they were not good predictors for my final model. The predictors were selected because they had the highest correlation to low access to supermarket (1 mile) and were exploring different aspects such as population demographics, transport and economic disadvantages. Based on my exploration, population, vehicle_access_1_mile, low_access_numbers_low_income_people_1_mile were the best fitted predictors for low_access_numbers_people_1_mile.
Multiple Regression Model
model<-lm( low_access_numbers_people_1_mile ~ population + vehicle_access_1_mile+ low_access_numbers_low_income_people_1_mile , data = food_access_2)autoplot(model, 1:4,nrow=2,ncol=2) ##Got this from correlation scatter plots and regressions tutorial, to see the diagnostic plots
Warning: `fortify(<lm>)` was deprecated in ggplot2 3.6.0.
ℹ Please use `broom::augment(<lm>)` instead.
ℹ The deprecated feature was likely used in the ggfortify package.
Please report the issue at <https://github.com/sinhrks/ggfortify/issues>.
Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
ℹ Please use tidy evaluation idioms with `aes()`.
ℹ See also `vignette("ggplot2-in-packages")` for more information.
ℹ The deprecated feature was likely used in the ggfortify package.
Please report the issue at <https://github.com/sinhrks/ggfortify/issues>.
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
ℹ The deprecated feature was likely used in the ggfortify package.
Please report the issue at <https://github.com/sinhrks/ggfortify/issues>.
Summary of model
summary(model)
Call:
lm(formula = low_access_numbers_people_1_mile ~ population +
vehicle_access_1_mile + low_access_numbers_low_income_people_1_mile,
data = food_access_2)
Residuals:
Min 1Q Median 3Q Max
-198059 -5256 -1704 849 174853
Coefficients:
Estimate Std. Error t value
(Intercept) 1.680e+03 4.440e+02 3.784
population 4.624e-02 1.630e-03 28.373
vehicle_access_1_mile 1.687e+01 7.153e-01 23.592
low_access_numbers_low_income_people_1_mile 1.937e+00 4.618e-02 41.953
Pr(>|t|)
(Intercept) 0.000157 ***
population < 2e-16 ***
vehicle_access_1_mile < 2e-16 ***
low_access_numbers_low_income_people_1_mile < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 20590 on 3138 degrees of freedom
Multiple R-squared: 0.899, Adjusted R-squared: 0.8989
F-statistic: 9314 on 3 and 3138 DF, p-value: < 2.2e-16
Linear regression analysis
This regression model predicts for the number of people in a county that are living 1 mile away from a supermarket based on the predictors population size,vehicle availability, and the number of low-income residents that have limited food access.
Model equation
The model equation for this multiple linear regression is as follows: y= a+b1x1+b2x2+ b3x3
where: a= intercept y= low_access_numbers_people_1_mile b1= effect of population b2= effect of vehicle access(1 mile) b3= effect of low-income population with low food access (1 mile)
Fitted equation
low_access_numbers_people_1_mile= 1680+ 0.046(population)+16.87(vehichle acces (1 mile))+ 1.94 (low-income population with low food access)
Interpretation This means that: The number of people with low food access increase by about 0.046 for each additionally person in a county this a small effect per person but meaningful overall. The number of people with low food access increases as people without vehicle access increases. Low income population has a moderate effect on people with low food access, number of people with low food access increases as the low income population increases.
P-value and adjusted R^2 value
The three predictors were all very statistically significant with p value= < 2e-16. Thus, showing strong evidence that vehicle access, population and income are strongly associated with number of people with low food access.
The Adjusted R-squared is 0.8989 for this linear regression model. This means that the model explains 89.89% of the variation in food access across the country. The model overall is statistically significant at p value =< 2.2e-16.
Diagnostic plots
Residuals vs fitted plot : the blue line is slightly bent downwards, the residuals are centered around 0 however the wider spread of residuals are around the counties with large fitted values suggesting Heteroscedasticity in extreme high fitted residual counties.
Normal Q-Q plot : Majority of the points on the graph follow the diagonal line, meaning the residuals are approximately normal, but there are a few outliers on both tails (such as observations 205, 611,and 2681) that deviate from normality.
Scale-Location plot : The blue line upwards shows an increase in variance and the spread of residuals increases with fitted values, indicating that there is mild heteroscedasticity.
Cook’s Distance : some of the few points that are influential and could affect the model results are (205,611, and 2631) indicating counties that strongly affect the model but dont invalidate the entire model.
Source of information:
https://rpubs.com/rsaidi/950425
Final visualization 1:
food_access_filtered <-food_access_2 |>filter(population>=10000) |>filter(state %in%c("Virginia", "Maryland", "Pennsylvania")) |>select(county, state, population, vehicle_access_1_mile,low_access_numbers_people_1_mile, low_access_numbers_low_income_people_1_mile) # I choose 10,000 since I didnt want small counties skewing resultsfood_access_filtered
# A tibble: 204 × 6
county state population vehicle_access_1_mile low_access_numbers_p…¹
<chr> <chr> <dbl> <dbl> <dbl>
1 Allegany County Mary… 75087 1051 46218
2 Anne Arundel C… Mary… 537656 2276 209639
3 Baltimore Coun… Mary… 805029 4094 228432
4 Baltimore city Mary… 620961 2867 28080
5 Calvert County Mary… 88737 583 71158
6 Caroline County Mary… 33066 456 25269
7 Carroll County Mary… 167134 1320 108830
8 Cecil County Mary… 101108 1033 73410
9 Charles County Mary… 146551 850 94730
10 Dorchester Cou… Mary… 32618 543 22745
# ℹ 194 more rows
# ℹ abbreviated name: ¹low_access_numbers_people_1_mile
# ℹ 1 more variable: low_access_numbers_low_income_people_1_mile <dbl>
Making percent of low access
food_access_filtered<-food_access_filtered |>mutate(percent_low_access= (low_access_numbers_low_income_people_1_mile/population)*100) # I choose percents since its gives a fairer comparisonfood_access_filtered
# A tibble: 204 × 7
county state population vehicle_access_1_mile low_access_numbers_p…¹
<chr> <chr> <dbl> <dbl> <dbl>
1 Allegany County Mary… 75087 1051 46218
2 Anne Arundel C… Mary… 537656 2276 209639
3 Baltimore Coun… Mary… 805029 4094 228432
4 Baltimore city Mary… 620961 2867 28080
5 Calvert County Mary… 88737 583 71158
6 Caroline County Mary… 33066 456 25269
7 Carroll County Mary… 167134 1320 108830
8 Cecil County Mary… 101108 1033 73410
9 Charles County Mary… 146551 850 94730
10 Dorchester Cou… Mary… 32618 543 22745
# ℹ 194 more rows
# ℹ abbreviated name: ¹low_access_numbers_people_1_mile
# ℹ 2 more variables: low_access_numbers_low_income_people_1_mile <dbl>,
# percent_low_access <dbl>
Final visualization: static
options (scipen=999) # to remove significant figuresfinal_plot_static <-ggplot (food_access_filtered, aes(x= vehicle_access_1_mile, y= low_access_numbers_people_1_mile, color = percent_low_access, )) +geom_point(alpha=0.6) +scale_x_log10()+scale_y_log10()+labs(x="Households without vehicle beyond 1 mile(log-scale)", y="People with low food access 1 mile (log-scale per 100)",title=" Vehicle access vs low food access Mid-atlantic (log scale)",caption="Source:United States Department of Agriculture’s Economic Research Service.", color="% Low Access", )+facet_wrap(~state)+scale_colour_viridis_c()+theme_bw()+theme(plot.title =element_text(face="bold",size=14,hjust=0.5,family ="mono"),axis.text.x =element_text(size =9),axis.text.y =element_text(size =9))final_plot_static
Final visualization 1: using plotly
options (scipen=999) # to remove significant figuresfinal_plot <-ggplot(food_access_filtered, aes(x= vehicle_access_1_mile, y= low_access_numbers_people_1_mile, color = percent_low_access,text=paste( "County:",county, "\n","People with low food access:", low_access_numbers_people_1_mile, "\n","No vehicle household:",vehicle_access_1_mile, "\n","low access:",round(percent_low_access, 1)))) +geom_point(alpha=0.6) +scale_x_log10()+scale_y_log10()+labs(x="Households without vehicle beyond 1 mile(log-scale)", y="People with low food access 1 mile (log-scale per 100)",title=" Vehicle access vs low food access Mid-atlantic (log scale)",caption="Source:United States Department of Agriculture’s Economic Research Service.", color="% Low Access", )+facet_wrap(~state)+scale_colour_viridis_c()+#https://ggplot2.tidyverse.org/reference/scale_viridis.html found the colour theme using the reference.theme_bw()+theme(plot.title =element_text(face="bold",size=14,hjust=0.5,family ="mono"),axis.text.x =element_text(size =9),axis.text.y =element_text(size =9))final_plot<-ggplotly(final_plot, tooltip="text")
Warning in scale_x_log10(): log-10 transformation introduced infinite values.
Warning in scale_y_log10(): log-10 transformation introduced infinite values.
final_plot
What the visualization shows
In this visualization the x-axis showed the number of households that do not have vehicle access and the y access showed the number of people that do not have access to a grocery store less than 1 mile away.I used the log scale to remove the clustering of the small counties on the left side of the graph. The color was a scale that showed the highest percentages of low food access for each counties.I was surprised to see that all three states had similar trends, even though Virginia had more counties that didn’t disrupt the number of people with low food access or affect the correlation. (The conclusion has more information)
# A tibble: 3 × 3
state average_pct_low_access access_level
<chr> <dbl> <chr>
1 Maryland 12.0 Low
2 Pennsylvania 16.8 Medium
3 Virginia 19.6 High
Visualization 2
ggplot(state_summary, aes(x = state, y =average_pct_low_access, fill = access_level)) +geom_col(width =0.7) +labs( title ="Average Percent Of Low Food Access vs State",x="State",y="Average Percent of Population with Low Food Access",caption="Source: USDA Economic Research with Low Food Access",fill ="food access level") +scale_fill_brewer(palette ="Set2") +theme_bw() +theme( plot.title =element_text(hjust =0.5, face="bold"),legend.position ="right")
What the visualization shows:
The bar chart above show the comparison of the average percentage of residents with low food access (1 mile away from supermarket) across the states; Maryland, Virginia and Pennsylvania. Each bar on the chart indicates the average (mean) percentage of people living 1 mile away from a supermarket. The color’s represent the levels of food access based on the data observed. Virginia had the highest average percent of low food access with Maryland is the lowest. The differences highlight that there is more people in Virginia suffering from low food access.
Disclaimer The access levels are categorized to understand the data better and are not based off official policy thresholds.
Conclusion:
My faceted scatterplot highlighted a strong linear correlation between vehicle access and low food access in Maryland, Virginia, and Pennsylvania. The x-axis showed the number of households that do not have vehicle access and the y access showed the number of people that do not have access to a grocery store less than 1 mile away.I used the log scale to remove the clustering of the small counties on the left side of the graph. The color was a scale that showed the highest percentages of low food access for each counties.I was surprised to see that all three states had similar trends, even though Virginia had more counties that didn’t disrupt the number of people with low food access or affect the correlation. Virginia has the highest number of low food access out of the three states. This visualization suggested that transportation accessibility has an effect on the number of people with access to food.
The bar chart above show the comparison of the average percentage of residents with low food access (1 mile away from supermarket) across the states; Maryland, Virginia and Pennsylvania. Each bar on the chart indicates the average (mean) percentage of people living 1 mile away from a supermarket. The color’s represent the levels of food access based on the data observed. Virginia had the highest average percent of low food access with Maryland is the lowest. The differences highlight that there is more people in Virginia suffering from low food access.
How I cleaned the data set.
I cleaned the data set by first changing all my names of the variables to lowercase.This was done by first going into the data set and then making all of them lowercase with ‘tolower’ “names(food_acess) <- tolower(names(food_access))”. Next, I checked for NA’s using ‘colSums’ and since there were no NA’s in my data set I did not have to filter anything. Lastly, I checked for what unique states were in the data set to see if the 3 states I wanted to investigate were there, and I filtered for Maryland,Virginia and Pennsylvania. Additionally, I filtered for population>= 10000 since less than that would be too small of population and I selected my variables. Lastly, I summarized, for mean percent low acess and used mutate and case_when to group the access levels.
Additional improvements I thought about including but couldn’t:
I thought about initially including the entire DMV, however, District of Columbia only had one row which was too little information for me to investigate and would misconstrue the data presented. Also, I wanted to use highcharter and plot a bubble graph for the different groups children, seniors and low- income people with low access to food, however when I used it in the same document as plotly my document would not render. I then tried a bar graph with high charter but it would not show a legend therefore I got my second model after different attempts.