Final project 2

Author

Ayan Elmi

https://picryl.com/media/good-food-display-nci-visuals-online-bb10ce

source: https://picryl.com/media/good-food-display-nci-visuals-online-bb10ce

Introduction

My data set contains information about US county’s ability to access supermarkets, supercenters, grocery stores, or other sources of healthy and affordable food. The name of the data set is food_acces.csv. The organization that collected the data is the United States Department of Agriculture’s Economic Research Service.

The data set additionally has information on measures of how individuals and neighborhoods are able to access food are based on the following indicators:

Accessibility to sources of healthy food, as measured by distance to a store or by the number of stores in an area.

  • Individual-level resources that may affect accessibility, such as family income or vehicle availability.

  • Neighborhood-level indicators of resources, such as the average income of the neighborhood and the availability of public transportation.

(Food Access Research Atlas - Download the Data | Economic Research Service, 2021)

My data set has 3142 observations and 13 variables.

Background on data and why I choose it:

According to (World Bank Group, 2024), food security is defined when all people, at all times, have physical and economic access to sufficient safe and nutritious food that meets their dietary needs and food preferences for an active and healthy life. The reason why I choose this data is because food security is a real issue that affects millions people around the world and this data set observes, number of stores, family income, state, population vehicle transport all which affect food security which I found to be impactful.

World Bank Group. (2024). What is food security. In World Bank. https://www.worldbank.org/en/topic/agriculture/brief/food-security-update/what-is-food-security

Methodology:

The USDA’s (United States Department of Agriculture) methodology on their website is: They first collected geographic data from 2010 census and other sources not listed. Next, they used distance to stores, the number of stores and availability of transport to measure for accessibility. Futhermore, they aggregated the individual and community factors such as vehicle access and income. Lastly, they cleaned the data and standardized the values for accuracy.

(Food Access Research Atlas - About the Atlas | Economic Research Service, n.d.)

Variables used:

Variables Definition
State (categorical) The name of the U.S state where the county is located.
County (categorical) The county that is named in the data set.
Population (quantitative) The total accumulation of number of residents living the county.
low_access_numbers_people_1_mile (quantitative) The number of people who live in a county and live more that 1 mile away from a food store.
low_access_numbers_low_income_people_1_mile (quantitative) The number of people who live in a county and live more that 1 mile away from a food store and are low income.
vehicle_access_1_mile (quantitative) Housing units without vehicle count beyond 1 mile from supermarket

Research question:

How does the percentage of people with low access to supermarkets (beyond a mile) vary across the states Virginia, Maryland and Pensslyvania?

Loading library

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.1     ✔ stringr   1.5.1
✔ ggplot2   4.0.0     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.1.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggfortify)
library(plotly)

Attaching package: 'plotly'

The following object is masked from 'package:ggplot2':

    last_plot

The following object is masked from 'package:stats':

    filter

The following object is masked from 'package:graphics':

    layout

Setting working directory

setwd("~/Desktop/Data 110")
food_access<-read_csv("food_access (1).csv")

Checking the head and structure of the dataset

#str(food_access) checked the structure but did this for rendering.
head(food_access)
# A tibble: 6 × 25
  County         Population State  Housing Data.Residin…¹ Housing Data.Total H…²
  <chr>               <dbl> <chr>                   <dbl>                  <dbl>
1 Autauga County      54571 Alaba…                    455                  20221
2 Baldwin County     182265 Alaba…                   2307                  73180
3 Barbour County      27457 Alaba…                   3193                   9820
4 Bibb County         22915 Alaba…                   2224                   7953
5 Blount County       57322 Alaba…                    489                  21578
6 Bullock County      10914 Alaba…                   1690                   3745
# ℹ abbreviated names: ¹​`Housing Data.Residing in Group Quarters`,
#   ²​`Housing Data.Total Housing Units`
# ℹ 20 more variables: `Vehicle Access.1 Mile` <dbl>,
#   `Vehicle Access.1/2 Mile` <dbl>, `Vehicle Access.10 Miles` <dbl>,
#   `Vehicle Access.20 Miles` <dbl>,
#   `Low Access Numbers.Children.1 Mile` <dbl>,
#   `Low Access Numbers.Children.1/2 Mile` <dbl>, …

Cleaning the data

names(food_access) <- gsub("[(). \\-]", "_", names(food_access)) # replace ., (), space, with dash
names(food_access) <- gsub("_$", "", names(food_access))  # remove trailing underscore
names(food_access) <- tolower(names(food_access))         # lowercase

head(food_access) #verify
# A tibble: 6 × 25
  county         population state  housing_data_residin…¹ housing_data_total_h…²
  <chr>               <dbl> <chr>                   <dbl>                  <dbl>
1 Autauga County      54571 Alaba…                    455                  20221
2 Baldwin County     182265 Alaba…                   2307                  73180
3 Barbour County      27457 Alaba…                   3193                   9820
4 Bibb County         22915 Alaba…                   2224                   7953
5 Blount County       57322 Alaba…                    489                  21578
6 Bullock County      10914 Alaba…                   1690                   3745
# ℹ abbreviated names: ¹​housing_data_residing_in_group_quarters,
#   ²​housing_data_total_housing_units
# ℹ 20 more variables: vehicle_access_1_mile <dbl>,
#   `vehicle_access_1/2_mile` <dbl>, vehicle_access_10_miles <dbl>,
#   vehicle_access_20_miles <dbl>, low_access_numbers_children_1_mile <dbl>,
#   `low_access_numbers_children_1/2_mile` <dbl>,
#   low_access_numbers_children_10_miles <dbl>, …

Checking for Na’s

colSums(is.na(food_access))
                                       county 
                                            0 
                                   population 
                                            0 
                                        state 
                                            0 
      housing_data_residing_in_group_quarters 
                                            0 
             housing_data_total_housing_units 
                                            0 
                        vehicle_access_1_mile 
                                            0 
                      vehicle_access_1/2_mile 
                                            0 
                      vehicle_access_10_miles 
                                            0 
                      vehicle_access_20_miles 
                                            0 
           low_access_numbers_children_1_mile 
                                            0 
         low_access_numbers_children_1/2_mile 
                                            0 
         low_access_numbers_children_10_miles 
                                            0 
         low_access_numbers_children_20_miles 
                                            0 
  low_access_numbers_low_income_people_1_mile 
                                            0 
low_access_numbers_low_income_people_1/2_mile 
                                            0 
low_access_numbers_low_income_people_10_miles 
                                            0 
low_access_numbers_low_income_people_20_miles 
                                            0 
             low_access_numbers_people_1_mile 
                                            0 
           low_access_numbers_people_1/2_mile 
                                            0 
           low_access_numbers_people_10_miles 
                                            0 
           low_access_numbers_people_20_miles 
                                            0 
            low_access_numbers_seniors_1_mile 
                                            0 
          low_access_numbers_seniors_1/2_mile 
                                            0 
          low_access_numbers_seniors_10_miles 
                                            0 
          low_access_numbers_seniors_20_miles 
                                            0 

Checking the number of states

#unique(states) 
#I checked for the number of states in the data set but I commented it out for rendering

Selecting possible predictors for the model

food_access_2<-food_access |>
  select (state,county, population, low_access_numbers_children_1_mile, low_access_numbers_people_1_mile, low_access_numbers_seniors_1_mile, low_access_numbers_low_income_people_1_mile,vehicle_access_1_mile)
head(food_access_2)
# A tibble: 6 × 8
  state   county        population low_access_numbers_c…¹ low_access_numbers_p…²
  <chr>   <chr>              <dbl>                  <dbl>                  <dbl>
1 Alabama Autauga Coun…      54571                   9973                  37424
2 Alabama Baldwin Coun…     182265                  30633                 132442
3 Alabama Barbour Coun…      27457                   3701                  19007
4 Alabama Bibb County        22915                   4198                  17560
5 Alabama Blount County      57322                  12575                  50848
6 Alabama Bullock Coun…      10914                   1872                   8709
# ℹ abbreviated names: ¹​low_access_numbers_children_1_mile,
#   ²​low_access_numbers_people_1_mile
# ℹ 3 more variables: low_access_numbers_seniors_1_mile <dbl>,
#   low_access_numbers_low_income_people_1_mile <dbl>,
#   vehicle_access_1_mile <dbl>

Reason for the predictors chosen

In order to explore my potential predictors, I used a correlation heatmap of all numeric variables. The heatmap displayed a very strong correlation between, low_access_numbers_children_1_mile, low_access_numbers_seniors_1_mile and low_access_numbers_people_1_2_mile and since they are the same variable calculated at different distances and age groups it would be redundant and show co linearity they were not good predictors for my final model. The predictors were selected because they had the highest correlation to low access to supermarket (1 mile) and were exploring different aspects such as population demographics, transport and economic disadvantages. Based on my exploration, population, vehicle_access_1_mile, low_access_numbers_low_income_people_1_mile were the best fitted predictors for low_access_numbers_people_1_mile.

Multiple Regression Model

model<- lm(  low_access_numbers_people_1_mile  ~ population + vehicle_access_1_mile+ low_access_numbers_low_income_people_1_mile  , data = food_access_2)
autoplot(model, 1:4,nrow=2,ncol=2) ##Got this from correlation scatter plots and regressions tutorial, to see the diagnostic plots
Warning: `fortify(<lm>)` was deprecated in ggplot2 3.6.0.
ℹ Please use `broom::augment(<lm>)` instead.
ℹ The deprecated feature was likely used in the ggfortify package.
  Please report the issue at <https://github.com/sinhrks/ggfortify/issues>.
Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
ℹ Please use tidy evaluation idioms with `aes()`.
ℹ See also `vignette("ggplot2-in-packages")` for more information.
ℹ The deprecated feature was likely used in the ggfortify package.
  Please report the issue at <https://github.com/sinhrks/ggfortify/issues>.
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
ℹ The deprecated feature was likely used in the ggfortify package.
  Please report the issue at <https://github.com/sinhrks/ggfortify/issues>.

Summary of model

summary(model)

Call:
lm(formula = low_access_numbers_people_1_mile ~ population + 
    vehicle_access_1_mile + low_access_numbers_low_income_people_1_mile, 
    data = food_access_2)

Residuals:
    Min      1Q  Median      3Q     Max 
-198059   -5256   -1704     849  174853 

Coefficients:
                                             Estimate Std. Error t value
(Intercept)                                 1.680e+03  4.440e+02   3.784
population                                  4.624e-02  1.630e-03  28.373
vehicle_access_1_mile                       1.687e+01  7.153e-01  23.592
low_access_numbers_low_income_people_1_mile 1.937e+00  4.618e-02  41.953
                                            Pr(>|t|)    
(Intercept)                                 0.000157 ***
population                                   < 2e-16 ***
vehicle_access_1_mile                        < 2e-16 ***
low_access_numbers_low_income_people_1_mile  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 20590 on 3138 degrees of freedom
Multiple R-squared:  0.899, Adjusted R-squared:  0.8989 
F-statistic:  9314 on 3 and 3138 DF,  p-value: < 2.2e-16

Linear regression analysis

This regression model predicts for the number of people in a county that are living 1 mile away from a supermarket based on the predictors population size,vehicle availability, and the number of low-income residents that have limited food access.

Model equation

The model equation for this multiple linear regression is as follows: y= a+b1x1+b2x2+ b3x3

where: a= intercept y= low_access_numbers_people_1_mile b1= effect of population b2= effect of vehicle access(1 mile) b3= effect of low-income population with low food access (1 mile)

Fitted equation

low_access_numbers_people_1_mile= 1680+ 0.046(population)+16.87(vehichle acces (1 mile))+ 1.94 (low-income population with low food access)

Interpretation This means that: The number of people with low food access increase by about 0.046 for each additionally person in a county this a small effect per person but meaningful overall. The number of people with low food access increases as people without vehicle access increases. Low income population has a moderate effect on people with low food access, number of people with low food access increases as the low income population increases.

P-value and adjusted R^2 value

The three predictors were all very statistically significant with p value= < 2e-16. Thus, showing strong evidence that vehicle access, population and income are strongly associated with number of people with low food access.

The Adjusted R-squared is 0.8989 for this linear regression model. This means that the model explains 89.89% of the variation in food access across the country. The model overall is statistically significant at p value =< 2.2e-16.

Diagnostic plots

Residuals vs fitted plot : the blue line is slightly bent downwards, the residuals are centered around 0 however the wider spread of residuals are around the counties with large fitted values suggesting Heteroscedasticity in extreme high fitted residual counties.

Normal Q-Q plot : Majority of the points on the graph follow the diagonal line, meaning the residuals are approximately normal, but there are a few outliers on both tails (such as observations 205, 611,and 2681) that deviate from normality.

Scale-Location plot : The blue line upwards shows an increase in variance and the spread of residuals increases with fitted values, indicating that there is mild heteroscedasticity.

Cook’s Distance : some of the few points that are influential and could affect the model results are (205,611, and 2631) indicating counties that strongly affect the model but dont invalidate the entire model.

Source of information:

https://rpubs.com/rsaidi/950425

Final visualization 1:

food_access_filtered <-food_access_2 |>
  filter(population>= 10000) |>
  filter(state %in% c("Virginia", "Maryland", "Pennsylvania")) |>
  select(county, state, population, vehicle_access_1_mile,low_access_numbers_people_1_mile, low_access_numbers_low_income_people_1_mile)      # I choose 10,000 since I didnt want small counties skewing results
food_access_filtered
# A tibble: 204 × 6
   county          state population vehicle_access_1_mile low_access_numbers_p…¹
   <chr>           <chr>      <dbl>                 <dbl>                  <dbl>
 1 Allegany County Mary…      75087                  1051                  46218
 2 Anne Arundel C… Mary…     537656                  2276                 209639
 3 Baltimore Coun… Mary…     805029                  4094                 228432
 4 Baltimore city  Mary…     620961                  2867                  28080
 5 Calvert County  Mary…      88737                   583                  71158
 6 Caroline County Mary…      33066                   456                  25269
 7 Carroll County  Mary…     167134                  1320                 108830
 8 Cecil County    Mary…     101108                  1033                  73410
 9 Charles County  Mary…     146551                   850                  94730
10 Dorchester Cou… Mary…      32618                   543                  22745
# ℹ 194 more rows
# ℹ abbreviated name: ¹​low_access_numbers_people_1_mile
# ℹ 1 more variable: low_access_numbers_low_income_people_1_mile <dbl>

Making percent of low access

food_access_filtered<-food_access_filtered |>
  mutate(
    percent_low_access= (low_access_numbers_low_income_people_1_mile/population)*100) # I choose percents since its gives a fairer comparison
food_access_filtered 
# A tibble: 204 × 7
   county          state population vehicle_access_1_mile low_access_numbers_p…¹
   <chr>           <chr>      <dbl>                 <dbl>                  <dbl>
 1 Allegany County Mary…      75087                  1051                  46218
 2 Anne Arundel C… Mary…     537656                  2276                 209639
 3 Baltimore Coun… Mary…     805029                  4094                 228432
 4 Baltimore city  Mary…     620961                  2867                  28080
 5 Calvert County  Mary…      88737                   583                  71158
 6 Caroline County Mary…      33066                   456                  25269
 7 Carroll County  Mary…     167134                  1320                 108830
 8 Cecil County    Mary…     101108                  1033                  73410
 9 Charles County  Mary…     146551                   850                  94730
10 Dorchester Cou… Mary…      32618                   543                  22745
# ℹ 194 more rows
# ℹ abbreviated name: ¹​low_access_numbers_people_1_mile
# ℹ 2 more variables: low_access_numbers_low_income_people_1_mile <dbl>,
#   percent_low_access <dbl>

Final visualization: static

options (scipen= 999) # to remove significant figures
final_plot_static <-ggplot (food_access_filtered, aes(x= vehicle_access_1_mile, y= low_access_numbers_people_1_mile, color = percent_low_access,
                  )) +

   geom_point(alpha=0.6) +
  scale_x_log10()+
  scale_y_log10()+
  labs(x= "Households without vehicle beyond 1 mile(log-scale)", y= "People with low food access 1 mile (log-scale per 100)",
       title= "    Vehicle access vs low food access Mid-atlantic (log scale)",
       caption= "Source:United States Department of Agriculture’s Economic Research Service.", color= "% Low Access", )+
   facet_wrap(~state)+
  scale_colour_viridis_c()+
theme_bw()+
  theme(plot.title = element_text(face="bold",size=14,hjust=0.5,family = "mono"),
        axis.text.x = element_text(size = 9),
        axis.text.y = element_text(size = 9))

final_plot_static 

Final visualization 1: using plotly

options (scipen= 999) # to remove significant figures
final_plot <-ggplot(food_access_filtered, aes(x= vehicle_access_1_mile, y= low_access_numbers_people_1_mile, color = percent_low_access,
                  text= paste( 
                    "County:",county, "\n",
                    "People with low food access:", low_access_numbers_people_1_mile, "\n",
                     "No vehicle household:",vehicle_access_1_mile, "\n",
                     "low access:",round(percent_low_access, 1)))) +
   geom_point(alpha=0.6) +
  scale_x_log10()+
  scale_y_log10()+
  labs(x= "Households without vehicle beyond 1 mile(log-scale)", y= "People with low food access 1 mile (log-scale per 100)",
       title= "    Vehicle access vs low food access Mid-atlantic (log scale)",
       caption= "Source:United States Department of Agriculture’s Economic Research Service.", color= "% Low Access", )+
   facet_wrap(~state)+
  scale_colour_viridis_c()+  #https://ggplot2.tidyverse.org/reference/scale_viridis.html found the colour theme using the reference.
  theme_bw()+
  theme(plot.title = element_text(face="bold",size=14,hjust=0.5,family = "mono"),
        axis.text.x = element_text(size = 9),
        axis.text.y = element_text(size = 9))

final_plot<- ggplotly(final_plot, tooltip="text") 
Warning in scale_x_log10(): log-10 transformation introduced infinite values.
Warning in scale_y_log10(): log-10 transformation introduced infinite values.
final_plot   

What the visualization shows

In this visualization the x-axis showed the number of households that do not have vehicle access and the y access showed the number of people that do not have access to a grocery store less than 1 mile away.I used the log scale to remove the clustering of the small counties on the left side of the graph. The color was a scale that showed the highest percentages of low food access for each counties.I was surprised to see that all three states had similar trends, even though Virginia had more counties that didn’t disrupt the number of people with low food access or affect the correlation. (The conclusion has more information)

State summary for average percent low access:

state_summary<- food_access_filtered |>
  group_by(state) |> #grouping by the states
  summarise( average_pct_low_access = mean(percent_low_access, na.rm = TRUE) ) |>
  mutate (access_level= case_when (average_pct_low_access < 14 ~ "Low",
                               average_pct_low_access >= 14 & average_pct_low_access < 18 ~ "Medium",
                               average_pct_low_access >= 18 ~"High"))
state_summary
# A tibble: 3 × 3
  state        average_pct_low_access access_level
  <chr>                         <dbl> <chr>       
1 Maryland                       12.0 Low         
2 Pennsylvania                   16.8 Medium      
3 Virginia                       19.6 High        

Visualization 2

ggplot(state_summary, aes(x = state, y =average_pct_low_access, fill = access_level)) +
  geom_col(width = 0.7) +
  labs( title = "Average Percent Of Low Food Access vs State",
        x= "State",
        y= "Average Percent of Population with Low Food Access",
        caption= "Source: USDA Economic Research with Low Food Access",
        fill = "food access level") +
  scale_fill_brewer(palette = "Set2") +
  theme_bw() +
  theme( plot.title = element_text(hjust = 0.5, face= "bold"),
         legend.position = "right")

What the visualization shows:

The bar chart above show the comparison of the average percentage of residents with low food access (1 mile away from supermarket) across the states; Maryland, Virginia and Pennsylvania. Each bar on the chart indicates the average (mean) percentage of people living 1 mile away from a supermarket. The color’s represent the levels of food access based on the data observed. Virginia had the highest average percent of low food access with Maryland is the lowest. The differences highlight that there is more people in Virginia suffering from low food access.

Disclaimer The access levels are categorized to understand the data better and are not based off official policy thresholds.

Conclusion:

My faceted scatterplot highlighted a strong linear correlation between vehicle access and low food access in Maryland, Virginia, and Pennsylvania. The x-axis showed the number of households that do not have vehicle access and the y access showed the number of people that do not have access to a grocery store less than 1 mile away.I used the log scale to remove the clustering of the small counties on the left side of the graph. The color was a scale that showed the highest percentages of low food access for each counties.I was surprised to see that all three states had similar trends, even though Virginia had more counties that didn’t disrupt the number of people with low food access or affect the correlation. Virginia has the highest number of low food access out of the three states. This visualization suggested that transportation accessibility has an effect on the number of people with access to food.

The bar chart above show the comparison of the average percentage of residents with low food access (1 mile away from supermarket) across the states; Maryland, Virginia and Pennsylvania. Each bar on the chart indicates the average (mean) percentage of people living 1 mile away from a supermarket. The color’s represent the levels of food access based on the data observed. Virginia had the highest average percent of low food access with Maryland is the lowest. The differences highlight that there is more people in Virginia suffering from low food access.

How I cleaned the data set.

I cleaned the data set by first changing all my names of the variables to lowercase.This was done by first going into the data set and then making all of them lowercase with ‘tolower’ “names(food_acess) <- tolower(names(food_access))”. Next, I checked for NA’s using ‘colSums’ and since there were no NA’s in my data set I did not have to filter anything. Lastly, I checked for what unique states were in the data set to see if the 3 states I wanted to investigate were there, and I filtered for Maryland,Virginia and Pennsylvania. Additionally, I filtered for population>= 10000 since less than that would be too small of population and I selected my variables. Lastly, I summarized, for mean percent low acess and used mutate and case_when to group the access levels.

Additional improvements I thought about including but couldn’t:

I thought about initially including the entire DMV, however, District of Columbia only had one row which was too little information for me to investigate and would misconstrue the data presented. Also, I wanted to use highcharter and plot a bubble graph for the different groups children, seniors and low- income people with low access to food, however when I used it in the same document as plotly my document would not render. I then tried a bar graph with high charter but it would not show a legend therefore I got my second model after different attempts.

Citation:

https://www.ers.usda.gov/data-products/food-access-research-atlas/download-the-data https://ggplot2.tidyverse.org/reference/scale_viridis.html https://rpubs.com/rsaidi/950425 https://picryl.com/media/good-food-display-nci-visuals-online-bb10ce https://www.worldbank.org/en/topic/agriculture/brief/food-security-update/what-is-food-security