For this project I selected the food desert dataset from: https://www.kaggle.com/tcrammond/food-access-and-food-deserts. I wanted to explore this dataset to determine how populations are affected by food deserts. I used USDA’s designation of food deserts “living more than one mile from a supermarket in urban/suburban areas, and more than 10 miles from a supermarket in rural areas”. I specifically selected this variable because I believe that walking more than one mile or drive longer tan 10 miles can carry a huge burden to any household.
library(tidyverse) #installs tidyverse package
## Warning: package 'tidyverse' was built under R version 4.0.4
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.3 v purrr 0.3.4
## v tibble 3.0.5 v dplyr 1.0.3
## v tidyr 1.1.2 v stringr 1.4.0
## v readr 1.4.0 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(dplyr) #installs dplyr package
library(plotly) # will use for plotly graphs
## Warning: package 'plotly' was built under R version 4.0.4
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
library(htmlwidgets) #for plotly graphs
## Warning: package 'htmlwidgets' was built under R version 4.0.4
setwd("C:/Users/Dano/Documents/") #sets working directoy
food <- read_csv("food_access.csv")
##
## -- Column specification --------------------------------------------------------
## cols(
## .default = col_double(),
## CensusTract = col_character(),
## State = col_character(),
## County = col_character()
## )
## i Use `spec()` for the full column specifications.
# uploads data
names(food) <- tolower(names(food)) # converts headings to lowercase column names
#In this analysis I din't use the command "na.omit" because it would remove the columns that have "0" and in this dataset, "0" is used when a variable is not applicable.
cleanfood <- food %>% mutate(across(is.numeric, ~ round(., 2))) #rounds to two decimal places
## Warning: Problem with `mutate()` input `..1`.
## i Predicate functions must be wrapped in `where()`.
##
## # Bad
## data %>% select(is.numeric)
##
## # Good
## data %>% select(where(is.numeric))
##
## i Please update your code.
## This message is displayed once per session.
## i Input `..1` is `across(is.numeric, ~round(., 2))`.
From the dataset, I chose the variables that I want to explore.
food1 <- cleanfood %>%
select(censustract, state, urban, pop2010, povertyrate, medianfamilyincome, lilatracts_1and10, la1and10, tractlowi, lapop1share, lawhite1share, lablack1share, laasian1share, lanhopi1share, laaian1share, laomultir1share, lahisp1share, lapop10share, lawhite10share, lablack10share, laasian10share, lanhopi10share, laaian10share, laomultir10share, lahisp10share, lasnap1share, lasnap10share )
# selects the variables that I want to explore
view(food1)
First, I want to explore how urban and rural populations are compared in relation to poverty rate. To do this I split the data into urbal or rural based on the variable “urban”. In this dataset, urban is marked by 1 and rural by 0.
Graph 01 shows that in urban areas the poverty rate is higher in non food deserts census tracts than in food deserts (one mile from a supermarket in urban/suburban areas) census tracts. My first finding is surprising because I thought that I would find higher poverty rate in food deserts census tracts.
foodurban<- food1 %>%
filter(urban==1)
graph01 <- foodurban%>%
ggplot(aes(x = la1and10, y=povertyrate)) +
geom_boxplot(aes(fill=factor(la1and10))) +
geom_point(aes(color = factor(la1and10))) +
scale_x_continuous(breaks = c(0, 1)) +
scale_colour_manual(name= "Food Access Status" , breaks = c("0", "1"),
labels = c("0 Non Food Desert", "1 Food Desert"),
values = c("forestgreen", "darkgoldenrod1")) +
scale_fill_manual(name= "Food Access Status" , breaks = c("0", "1"),
labels = c("0 Non Food Desert", "1 Food Desert"),
values = c("forestgreen", "darkgoldenrod1")) +
labs(title="Poverty Rate Percentage vs. Food Acess Status in Urban Census Tracts ", x="Food Access Status", y = "Poverty Rate") + theme_classic()
graph01
My second exploration is to see how poverty percentage compares in rural census tract areas. Based on graph 02, the poverty rate in rural areas is higher in food deserts than in non food deserts census tracts. Even though the median in the food desert areas is higher than in non food deserts census tracts, the difference is relatively small.
foodrural <- food1 %>%
filter(urban==0)
graph02 <- foodrural%>%
ggplot(aes(x = la1and10, y=povertyrate)) +
geom_boxplot(aes(fill=factor(la1and10))) +
geom_point(aes(color = factor(la1and10))) +
scale_x_continuous(breaks = c(0, 1)) +
scale_colour_manual(name= "Food Access Status" , breaks = c("0", "1"),
labels = c("0 Non Food Desert", "1 Food Desert"),
values = c("forestgreen", "darkgoldenrod1")) +
scale_fill_manual(name= "Food Access Status" , breaks = c("0", "1"),
labels = c("0 Non Food Desert", "1 Food Desert"),
values = c("forestgreen", "darkgoldenrod1")) +
labs(title="Poverty Rate vs. Food Acess Status in Rural Census Tracts " , x="Food Access Status", y = "Poverty Rate") + theme_classic()
graph02
I also want to understand the relationship between race and food deserts. In the first scatter plot I compared the share of the population that are beyond 1 mile from supermarket against the share of the population that are white beyond 1 mile from supermarket.
correlationp1mile<- food1 %>%
ggplot(aes(`lapop1share`, `lawhite1share`))+
geom_point(color = "gold2", shape = 20, alpha=.1) +
geom_smooth(method='lm',color="black",formula=y~x, size = 0.2, se = FALSE)+
labs(x="Population Share Beyond 1 Mile",y="White Share Population", title="Correlation Between Population and White Population")+
theme_light()+ theme(legend.position = "none") + theme( title=element_text(size=7,face="bold"))
correlationp1mile
fig1 <- ggplotly(correlationp1mile)
fig1
The first scatter plot provides the fraction of people that are in food deserts versus the fraction of people that are white and in food deserts. The regression coefficient from the regression line shows that on average, 82% of people that are on a food deserts are white.
fit1 <- lm(lawhite1share ~ 0 + lapop1share, data = food1)
summary (fit1)
##
## Call:
## lm(formula = lawhite1share ~ 0 + lapop1share, data = food1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.82520 -0.00301 0.00000 0.04041 0.17480
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## lapop1share 0.8251978 0.0007637 1080 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1147 on 72863 degrees of freedom
## Multiple R-squared: 0.9413, Adjusted R-squared: 0.9413
## F-statistic: 1.167e+06 on 1 and 72863 DF, p-value: < 2.2e-16
The second scatter plot provides the fraction of people that are in food deserts versus the fraction of people that are black and in food deserts. The regression coefficient from the regression line shows that on average, 8.9% of people that are on a food deserts are black.
correlationp1mile<- food1%>%
ggplot(aes(`lapop1share`, `lablack1share`)) +
geom_point(color ="gold2", shape = 20, size = 1 , alpha=.1) +
geom_smooth(method='lm',color="black",formula=y~x, size = 0.2, se = FALSE)+
labs(x="Population Share 1 Mile",y="Black Population Share", title="Correlation Between Population and Black Population")+
theme_light()+ theme(legend.position = "none") + theme( title=element_text(size=7,face="bold"))
correlationp1mile
fig2 <- ggplotly(correlationp1mile)
fig2
fit2 <- lm(lablack1share ~ 0 + lapop1share, data = food1)
summary (fit2)
##
## Call:
## lm(formula = lablack1share ~ 0 + lapop1share, data = food1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.08836 -0.03836 -0.00060 0.00000 0.91164
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## lapop1share 0.0883635 0.0006401 138 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.09612 on 72863 degrees of freedom
## Multiple R-squared: 0.2073, Adjusted R-squared: 0.2073
## F-statistic: 1.906e+04 on 1 and 72863 DF, p-value: < 2.2e-16
The third scatter plot provides the fraction of people that are in food deserts versus the fraction of people that are hispanic and in food deserts. The regression coefficient from the regression line shows that on average, 8.4% of people that are on a food deserts are hispanic.
correlationp1mile<- food1%>%
ggplot(aes(`lapop1share`, `lahisp1share`)) +
geom_point(color ="gold2", shape = 20, size = 1 , alpha=.1) +
geom_smooth(method='lm',color="black",formula=y~x, size = 0.2, se = FALSE)+
labs(x="Population Share Beyond 1 Mile", y="Hispanic Share Population", title="Correlation Between Population and Hispanic Share Population")+
theme_light()+ theme(legend.position = "none") + theme( title=element_text(size=7,face="bold"))
correlationp1mile
fig3 <- ggplotly(correlationp1mile)
fig3
fit3 <- lm(lahisp1share ~ 0 + lapop1share, data = food1)
summary (fit3)
##
## Call:
## lm(formula = lahisp1share ~ 0 + lapop1share, data = food1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.08384 -0.02875 0.00000 0.00000 0.91616
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## lapop1share 0.0838361 0.0005318 157.6 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.07986 on 72863 degrees of freedom
## Multiple R-squared: 0.2543, Adjusted R-squared: 0.2543
## F-statistic: 2.485e+04 on 1 and 72863 DF, p-value: < 2.2e-16
The fourth scatter plot provides the fraction of people that are in food deserts versus the fraction of people that are native american and in food deserts. The regression coefficient from the regression line shows that the on average, 1.3% of people that are on a food desert are native american.
correlationp1mile<- food1%>%
ggplot(aes(`lapop1share`, `laaian1share`)) +
geom_point(color ="gold2", shape = 20, size = 1 , alpha=.1) +
geom_smooth(method='lm',color="black",formula=y~x, size = 0.2, se = FALSE)+
labs(x="Population Share Beyond 1 Mile", y="Native American Share Population", title="Correlation Between Population and Hispanic Share Population")+
theme_light()+ theme(legend.position = "none") + theme( title=element_text(size=7,face="bold"))
correlationp1mile
fig4 <- ggplotly(correlationp1mile)
fig4
fit3 <- lm(laaian1share ~ 0 + lapop1share, data = food1)
summary (fit3)
##
## Call:
## lm(formula = laaian1share ~ 0 + lapop1share, data = food1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.01378 -0.00772 -0.00179 0.00000 0.98622
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## lapop1share 0.0137846 0.0002838 48.58 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.04261 on 72863 degrees of freedom
## Multiple R-squared: 0.03137, Adjusted R-squared: 0.03136
## F-statistic: 2360 on 1 and 72863 DF, p-value: < 2.2e-16
In this project I explored the food desert dataset. I explored categorical and numerical variables to find trends between food deserts and race, and food deserts and urban/rural areas. I chose this dataset because my dad recently noted how much he misses eating watermelon. He noted that watermelons are heavy so carrying one home can be burdensome. I then noticed that it would take me about 1.3 miles to walk one way to the closest grocery store and if I didn’t own a car, grabbing weekly groceries would be a hurdle. I then wanted to understand how poor households compare to wealthier households in relation to accessibility to healthy food sources.
Accessibility to grocery stores and supermarkets allow households to have healthy food. Considering that accessibility to a supermarket is a necessity, I first decided to explore all households that are beyond one mile from a supermarket or 10 miles beyond a supermarket in rural areas. The first two graphs surprised me because I thought that food deserts are more prevalent in poor urban areas but per the analysis, there are more food deserts in poor rural areas. Graph 01 shows that in urban areas the poverty rate is higher in non food deserts census tracts than in food deserts. Graph 02 shows that the poverty rate in rural areas is higher in food deserts than in non food deserts census tracts.
I also wanted to explore how different races are affected by food deserts. Graph 01-04 show the fraction of people that are in food deserts versus the fraction of people that are white, black, hispanic and native american. The scatter plots show that the biggest group (in rural and urban areas) that are on food deserts are whites, followed by blacks, hispanics and native americans.
Considering the US population makeup, finding that the white population has a high food inaccessibility is not surprising because it follows the US population composition. I did further research and found USDA’s 2017 report on SNAP (food stamps) benefit participants and the makeup of participants is similar. Whites account for 35% of participants, blacks account for 26% of participants, hispanics account for 17% of participants, asians make 3% of participants and native americans make about 2% of participants (https://fns-prod.azureedge.net/sites/default/files/resource-files/Characteristics2017.pdf).
This dataset is very interesting but there is more that I could have compared such as SNAP participation in urban areas. I also would have liked to analyze factors contributing to inaccessibility to supermarkets/grocery stores. For example, I found an interesting article from The Week, that discussed the impact of current zoning laws on location of small grocery stores within neighborhoods so exploring these factors would provide more information on programs that could be implemented to increase food accessibility (https://theweek.com/articles/977300/mystery-americas-small-groceries).