For this project, we were tasked with analyzing data gathered by The Trust for Public Land and identifying two to three questions that might lead to policy decisions. The data was broken down by city. It included information like park size and amenities as well as two years worth of data ranking each city’s parks for equity.
Our first step was to organize and clean our data. We began by reading data into R.
data <- read_csv("data_parks.csv")
equity <- read_csv("equitydata_2022_New.csv")
equity2021 <- read_csv("equitydata_2021_NEW.csv")
Once our datasets were loaded, we took some time to look it over and see what cleaning it needed. We noted that the years for each dataset didn’t quite match up. Some had just 2022 or 2021, and one had some data going back to 2012. We needed to re-organize things so that we only looked at years that had observations for every city and eliminate years that had missing data.
# arrange the years in an ascending order
data_arranged <- arrange(data, year)
# descriptive stats
summary(data_arranged)
# cities are coded with state information. separate city and state.
data_sep <- separate(data,
col = city,
into = c("city", "state"),
sep = ",")
# identify the years that each city appear in the dataset
city_details <- data_sep %>%
group_by(city) %>%
summarise(years_covered = n(),
first_year = min(year),
last_year = max(year),
list_of_years = list(unique(year))) %>%
arrange(-years_covered)
city_details
Next, we needed to joining the equity datasets to the main data because the equity datasets covered 2022 and 2021. We created separate datasets for those two years using the main dataset (“data parks”) and then joined it with the equity data to create one dataset that included the general parks variables and the equity variables for the years 2022 and 2021.
data2022 <- data%>%
filter(year == 2022)
data2021 <- data %>%
filter(year == 2021)
joined2021 <- inner_join(data2021, equity2021, by = "city") # 2021 data with equity scores
joined_data <- inner_join(data2022, equity, by = "city") # 2022 data with equity scores
To combine both datasets, we needed to rename the columns so that they matched.
joined2021 <- joined2021%>%
rename(lowincome_data = income_data,
lowincome_points = income_points,
distribution_poc_data = ratio_data,
distribution_poc_points = ratio_points,
distribution_income_data = ratio_income_data,
distribution_income_points = ratio_income_points,
equity_points = equity)
joined_data <- joined_data %>%
rename(distribution_poc_data = distribuion_poc_data) # note the typo
Finally, we put the 2021 and 2022 data together and then re-separate the city from the state names.
joined21_22 <- rbind(joined2021, joined_data)
joined21_22=
separate(joined21_22,
"city",
into = c("city", "state"),
sep= ",")
We knew we wanted to visualize equity categories as part of our analysis. To this end we created a categorical variable from the variables “equity points” and “total points”. We used cut off points to create three categories, “low”, “medium”, and “high”. Once we had our categories, we added them to our total dataset.
breaks <-c(0, 33, 66, 100)
labels <-c("low", "medium", "high")
equity_category <- cut(joined21_22$equity_points,
breaks = breaks,
labels = labels,
include.lowest = TRUE)
points_category <- cut(joined21_22$total_points,
breaks = breaks,
labels = labels,
include.lowest = TRUE)
joined21_22$equity_category <- equity_category
joined21_22$points_category <- points_category
Our initial hypotheses were that states with a higher equity categories will have larger parks and more amenities. If we were correct, this would indicate ideal places to invest future funding and ensure that equitable parks are also welcoming parks.
We first wanted to look at the relationship between equity category and park size:
joined21_22 %>%
group_by(equity_category) %>%
summarise(avg_parksize = mean(med_park_size_data)) %>%
arrange(desc(avg_parksize))
## # A tibble: 3 × 2
## equity_category avg_parksize
## <fct> <dbl>
## 1 low 8.87
## 2 medium 6.26
## 3 high 4.02
We also looked at equity category and amenities; as expected, where there was a higher equity category, there were also higher amenity points.
joined21_22 %>%
group_by(equity_category) %>%
summarise(avg_amenities = mean(amenities_points)) %>%
arrange(desc(avg_amenities))
## # A tibble: 3 × 2
## equity_category avg_amenities
## <fct> <dbl>
## 1 high 58.4
## 2 medium 49.3
## 3 low 38.9
We were also interested in the change in park size over the years. This showed little change between 2021 and 2022, but intrigued us enough to create a plot with a longer timeframe.
joined21_22 %>%
group_by(year) %>%
summarise(avg_parksize_points = mean(med_park_size_points)) %>%
arrange(desc(avg_parksize_points))
## # A tibble: 2 × 2
## year avg_parksize_points
## <dbl> <dbl>
## 1 2021 49.7
## 2 2022 49.5
We were interested in the change in park size between 2012 and 2022. We decided to create a dumbell plot too visualize total points change over the course of that decade. Before plotting, we removed Arlington from the full dataset. It is coded twice because there are two cities named Arlington and became a duplicate after we separated the state information. Given that this problem only impacted two cities, and our interest was in overall trends, we decided it was simpler to simply remove it than to find another way to keep both cities but distinguish them from each other
From this dumbell plot, we can make recommendations on where we believe funding should be invested to improve park scores in cities that have lagged between 2012 to 2022.For example, Virginia Beach, Phoenix, and Sacramento stand out has having negative changes in park scores from 2012 to 2022 and would benefit from funding to improve.
result =
data_sep %>%
filter(city!= "Arlington") %>%
group_by(city)%>% # group by city
reframe(total_pct2022 = total_points[year == 2022],
total_pct2012 = total_pct[year == 2012],
difference = abs(total_pct2022 - total_pct2012))
# reorder the data
result <- result %>%
mutate(city = reorder(city, -total_pct2022))
# graph the dumbbell plot
result <- result %>%
mutate(city = reorder(city, -total_pct2022)) #reorder the data so the values are in ascending order in the graph
# assign colors to years
colors <- c("2012" = "#c0d7f0", "2022" = "#416296")
ggplot(result, aes(y = city)) +
geom_segment(aes(x = total_pct2012, xend = total_pct2022, yend = city), color = "grey")+
geom_point(aes(x = total_pct2012, color = "2012"), size = 3.5, shape = 20) +
geom_point(aes(x = total_pct2022, color = "2022"), size = 3.5, shape = 20) +
labs(
x = "Percentage",
y = "City",
title = "Change in Park Scores from 2012 to 2022"
)+
scale_color_manual(values = colors, name = "Year")+
theme_minimal()+
theme(panel.grid.major.x = element_line(color = "#cdd1cd", size = 0.3),
panel.grid.major.y = element_blank(), panel.grid.minor = element_blank())
Our aim with this second plot was to visualize the relationship between a citie’s park equity score and its overall park score and park size. We included a regression line, which indicates that there is a positive relationship between the equity score and the park score; that is, parks with a higher equity rating also generally have a higher rating overall (indicating more amenities, greater park spending per resident, etc). We used color to more easily see the three categories of equity score, “low”, “medium” and “high”. The “high” and “medium” equity categories are fairly tidily split, while the “low” equity category cities are nestled within the “medium” cities.
joined21_22%>%
ggplot()+
geom_smooth(mapping = aes(x = spend_per_resident_data_dollar,
y = equity_points),
method = "lm",
color = "#de806f",
se = FALSE)+
geom_point(mapping = aes(x = spend_per_resident_data_dollar,
y = equity_points,
color = points_category),
shape = 20,
alpha = 0.5)+
labs(
x = "Spending Per-Resident ($)",
y = "Equity Score",
title = "Relationship Between Per-Resident Spending and Equity Score",
color = " Overall Parks Score",
size = "Median Park Size (In acres)")+
theme_minimal()+
theme(panel.grid.major.y = element_line(color = "#cdd1cd", size = 0.3),
panel.grid.major.x = element_blank(), panel.grid.minor = element_blank())+
geom_hline(yintercept = 0, color = "#6f706f")+
scale_y_continuous(limits = c(0, 100), breaks = seq(0, 100, by = 25))