Exploring and cleaning the data
Data exploration
Let’s check out our dataset.
rmarkdown::paged_table(head(parks))
Our dataset has 27 variables including year, city and other discrete and continuous variables. We can see that some of these variables end with “data” while others end with “points”. According the Trust for Public Land, variables ending with “data” contain raw data, while variables ending with “points” are essentially yearly normalized values (a higher number is better).
We’re interested specifically in the following variables:
- year, the year of measurement;
- city, the city name;
- med_park_size_points, the median park size points;
- park_pct_city_points, the parkland as percentage of city area points;
- pct_near_park_points, the percent of residents within a 10 minute walk to park points;
- spend_per_resident_points, the spending per resident in points;
- amenities_points, the amenities points total (i.e. recreational area);
- total_points, total points
Our four main metrics are represented by the med_park_size_points (acreage), park_pct_city_points (acreage), pct_near_park_points (accessibility), spend_per_resident_points (investment), and amenities_points (amenities) variables. Note that the acreage metric is computed using both the med_park_size_points and park_pct_city_points.
We can start off by checking which cities are in the dataset.
cities <- parks %>%
count(city)
rmarkdown::paged_table(cities)
There are 102 observations and we can rename two cities: Washington D.C. and Charlotte.
parks$city <- case_when(
parks$city == "Washington, DC" ~ "Washington, D.C.",
parks$city == "Charlotte/Mecklenburg County" ~ "Charlotte",
TRUE ~ parks$city
)
Let’s get rid of unnecessary columns…
parks <- parks %>%
select(-city_dup)
We can compute the average score for each of the main metrics by year.
parks %>%
select(year, city, amenities_points, spend_per_resident_points, pct_near_park_points, park_pct_city_points,
med_park_size_points, total_points, total_pct) %>%
group_by(year) %>%
summarise(avg_amenities = mean(amenities_points),
avg_spend = mean(spend_per_resident_points),
avg_near_park = mean(pct_near_park_points),
avg_pct_city = mean(park_pct_city_points),
avg_med_park_size = mean(med_park_size_points),
avg_total_pct = mean(total_pct)) %>%
kable(caption = "Average metric scores") %>%
kable_minimal()
Average metric scores
|
year
|
avg_amenities
|
avg_spend
|
avg_near_park
|
avg_pct_city
|
avg_med_park_size
|
avg_total_pct
|
|
2012
|
NA
|
9.525000
|
22.05000
|
10.27500
|
10.90000
|
50.90000
|
|
2013
|
NA
|
10.340000
|
23.16000
|
9.74000
|
10.56000
|
51.41000
|
|
2014
|
NA
|
10.216667
|
23.33333
|
10.03333
|
10.16667
|
51.85000
|
|
2015
|
39.213333
|
9.813333
|
24.93333
|
10.14667
|
10.56000
|
54.50000
|
|
2016
|
10.173469
|
10.295918
|
22.40816
|
10.54082
|
10.62245
|
53.35204
|
|
2017
|
9.868687
|
10.303030
|
22.10101
|
10.19192
|
10.61616
|
NA
|
|
2018
|
19.556701
|
19.690722
|
22.38144
|
10.48454
|
10.58763
|
51.61134
|
|
2019
|
49.592784
|
50.438144
|
58.01546
|
25.43814
|
26.49485
|
52.49485
|
|
2020
|
49.291753
|
54.278351
|
57.15464
|
25.71134
|
25.41237
|
52.96495
|
It seems like average scores are more or less consistent in the first few years, but change quite a bit in the last two years. Perhaps the maximum score varies each year. Let’s take a look at the maximum values for all years.
parks %>%
select(year, city, amenities_points, spend_per_resident_points, pct_near_park_points, park_pct_city_points,
med_park_size_points, total_points, total_pct) %>%
group_by(year) %>%
summarise(max_amenities = max(amenities_points),
max_spend = max(spend_per_resident_points),
max_near_park = max(pct_near_park_points),
max_pct_city = max(park_pct_city_points),
max_med_park_size = max(med_park_size_points),
max_total_points = max(total_points)) %>%
kable(caption = "Maximum metric scores") %>%
kable_minimal()
Maximum metric scores
|
year
|
max_amenities
|
max_spend
|
max_near_park
|
max_pct_city
|
max_med_park_size
|
max_total_points
|
|
2012
|
NA
|
20
|
40
|
20
|
20
|
89.0
|
|
2013
|
NA
|
20
|
40
|
20
|
20
|
97.0
|
|
2014
|
NA
|
20
|
40
|
20
|
20
|
NA
|
|
2015
|
77.0
|
20
|
40
|
20
|
20
|
19.0
|
|
2016
|
19.0
|
20
|
40
|
20
|
20
|
104.0
|
|
2017
|
19.0
|
20
|
40
|
20
|
20
|
87.5
|
|
2018
|
37.0
|
40
|
40
|
20
|
20
|
135.0
|
|
2019
|
90.8
|
100
|
100
|
50
|
50
|
NA
|
|
2020
|
92.5
|
100
|
100
|
50
|
50
|
341.0
|
As we can see from the last table, the maximum score for these variables has changed over the years. For instance, it appears that the maximum score for the amenities_point variable (the first column) has gone from 80 in 2015 to 20 in 2016 and 2017, to 40 in 2018, and to 100 in 2019 and 2020. The maximum potential score for the other metrics are slightly more consistent, but there’s still a big change in the scoring methodology for 2019 and 2020. We’ll now standardize the scores in the next section.
Standardizing the data
Let’s standardize the scores across all years such that they are comparable across time. That is, the amenities score, investment score and accessibility score should have a maximum score of 100. The acreage score is constructed using the pct_city and med_park_size variables, so, for both of these variables, the maximum score should be 50.
parks_standardized <- parks %>%
select(year, city, amenities_points, spend_per_resident_points, pct_near_park_points, park_pct_city_points,
med_park_size_points, total_points, total_pct, rank) %>%
mutate(amenities_points =
case_when(
year == 2015 ~ amenities_points*1.25,
year == 2016 ~ amenities_points*5,
year == 2017 ~ amenities_points*5,
year == 2018 ~ amenities_points*2.5,
TRUE ~ amenities_points),
spend_per_resident_points =
case_when(
year %in% c(2012, 2013, 2014, 2015, 2016, 2017) ~ spend_per_resident_points*5,
year == 2018 ~ spend_per_resident_points*2.5,
TRUE ~ spend_per_resident_points),
pct_near_park_points =
case_when(
year %in% c(2012, 2013, 2014, 2015, 2016, 2017, 2018) ~ pct_near_park_points*2.5,
TRUE ~ pct_near_park_points),
park_pct_city_points =
case_when(
year %in% c(2012, 2013, 2014, 2015, 2016, 2017, 2018) ~ park_pct_city_points*2.5,
TRUE ~ park_pct_city_points),
med_park_size_points =
case_when(
year %in% c(2012, 2013, 2014, 2015, 2016, 2017, 2018) ~ med_park_size_points*2.5,
TRUE ~ med_park_size_points))
Let’s check the scores again…
parks_standardized %>%
select(year, city, amenities_points, spend_per_resident_points, pct_near_park_points, park_pct_city_points,
med_park_size_points, total_points, total_pct) %>%
group_by(year) %>%
summarise(avg_amenities = mean(amenities_points),
avg_spend = mean(spend_per_resident_points),
avg_near_park = mean(pct_near_park_points),
avg_pct_city = mean(park_pct_city_points),
avg_med_park_size = mean(med_park_size_points)) %>%
kable(caption = "Average metric scores") %>%
kable_minimal()
Average metric scores
|
year
|
avg_amenities
|
avg_spend
|
avg_near_park
|
avg_pct_city
|
avg_med_park_size
|
|
2012
|
NA
|
47.62500
|
55.12500
|
25.68750
|
27.25000
|
|
2013
|
NA
|
51.70000
|
57.90000
|
24.35000
|
26.40000
|
|
2014
|
NA
|
51.08333
|
58.33333
|
25.08333
|
25.41667
|
|
2015
|
49.01667
|
49.06667
|
62.33333
|
25.36667
|
26.40000
|
|
2016
|
50.86735
|
51.47959
|
56.02041
|
26.35204
|
26.55612
|
|
2017
|
49.34343
|
51.51515
|
55.25253
|
25.47980
|
26.54040
|
|
2018
|
48.89175
|
49.22680
|
55.95361
|
26.21134
|
26.46907
|
|
2019
|
49.59278
|
50.43814
|
58.01546
|
25.43814
|
26.49485
|
|
2020
|
49.29175
|
54.27835
|
57.15464
|
25.71134
|
25.41237
|
The values are more consistent and the variables appear to have been standardized correctly.
Creating the categorical region variable
We can create the categorical variable region to enhance our analysis afterwards. We’ll create four categories: west, midwest, south and northeast. Each city in our dataset will fall under one of these four categories.
parks_standardized <- parks_standardized %>%
mutate(region =
case_when(
city %in% c("Albuquerque", "Anaheim", "Anchorage", "Aurora", "Bakersfield", "Boise", "Chandler",
"Chula Vista", "Colorado Springs", "Denver", "Henderson", "Honolulu", "Irvine",
"Las Vegas", "Long Beach", "Los Angeles", "North Las Vegas", "Oakland", "Phoenix",
"Portland", "Reno", "Riverside", "Sacramento", "San Diego", "San Francisco", "Santa Ana",
"Scottsdale", "Seattle") ~ "west",
city %in% c("Arlington, Texas", "Arlington, Virginia", "Atlanta", "Austin", "Baton Rouge",
"Charlotte", "Chesapeake", "Corpus Christi", "Dallas", "Durham", "El Paso",
"Fort Worth", "Fremont",
"Fresno", "Garland", "Glendale", "Greensboro", "Hialeah", "Houston", "Irving",
"Jacksonville", "Laredo", "Lexington", "Louisville", "Lubbock", "Memphis", "Mesa",
"Miami", "Nashville", "New Orleans", "Norfolk", "Oklahoma City", "Orlando", "Plano",
"Raleigh", "Richmond", "San Antonio", "San Jose", "St. Petersburg", "Stockton", "Tampa",
"Tucson", "Tulsa", "Virginia Beach", "Washington, D.C.", "Winston-Salem") ~ "south",
city %in% c("Baltimore", "Boston", "Buffalo", "Jersey City", "New York", "Newark", "Philadelphia",
"Pittsburgh") ~ "northeast",
city %in% c("Chicago", "Cincinnati", "Cleveland", "Columbus", "Des Moines", "Detroit", "Fort Wayne",
"Indianapolis", "Kansas City", "Lincoln", "Madison", "Milwaukee", "Minneapolis",
"Omaha", "St. Louis", "St. Paul", "Toledo", "Wichita") ~ "midwest"))
parks_standardized$region <- as_factor(parks_standardized$region)
Getting the coordinates for US cities
We’ll also need to get the coordinates for the cities in our dataset in order to create a map of the US later on. These coordinates can be obtained from the ‘usmap’ package which provides a dataframe listing coordinates for medium to large sized cities. Below is the code used to prepare the data.
#Clean us.cities dataset (from the 'usmap' package)
us.cities$name <- str_sub(us.cities$name, end = -4)
us.cities$name <- str_replace(us.cities$name, "Saint", "St.")
us.cities$name <- str_replace(us.cities$name, "WASHINGTON", "Washington, D.C.")
parks_2020 <- parks_standardized %>%
filter(year == 2020)
#Join datasets
us.cities <- left_join(x = us.cities, y = parks_2020, by = c("name" = "city"))
#Remove rows with missing values
us.cities <- us.cities %>%
drop_na()
#Identify duplicate cities
duplicates <- us.cities %>%
add_count(name) %>%
filter(n > 1)
#Manually check and remove duplicates
duplicates <- duplicates[-c(2, 3, 4, 6, 9, 10, 12, 14, 16, 18, 19),]
duplicates <- duplicates %>%
select(-n)
#Merge duplicates dataset to us.cities
us.cities <- filter(us.cities,
!name %in% c("Aurora", "Columbus", "Glendale", "Jacksonville",
"Kansas City", "Lincoln", "Madison", "Newark", "Portland"))
us.cities <- rbind(us.cities, duplicates)
#Prepare dataset for mapping
us.cities_map <- us.cities[, c(5, 4, 1, 2, 14, 16)]
us.cities_map <- us.cities_map %>%
rename(state = country.etc)
us.cities_mapt <- usmap_transform(us.cities_map)
We’re now ready to visualize our data.
Visualizing the data
The graph below shows the top ten cities in the U.S. with the best overall park scores for each year. As can be inferred from the graph, some cities make it to the top 5 list only some years, while others are present almost every year. For example, Sacramento was among the cities with best parks in the first few years (2012 and 2013) while Washington D.C. has had one of the best scores every year since 2012 (except in 2014). Lastly, Minneapolis has had the best parks every year since 2013 except in 2019 (note: missing data for 2017).
Line graph - cities

We can also take a look at major U.S. cities and compare their park scores.

Among major U.S. cities, Washington D.C. has had the best parks since 2015. Overall, the trends for these cities seem to be relatively stable. It also seems that Northeastern cities have substantially better scores compared to cities located in other regions, especially those located in the South, such as Houston or Miami. Is this pattern the same for all 102 cities being analyzed?
The plot below shows the average park score of each U.S. region by year. It seems that, indeed, northeastern cities have higher scores on average, than other regions, especially in the earlier years, where the Northeast region had the best score. However, Midwestern cities’ park scores have increased consistently and, beginning in 2018, have had the best parks in the U.S.
Bar charts - US regions

What makes Midwestern cities’ parks perform so well? Is it possible that Midwestern cities outperform other U.S. cities in one particular metric? For example, it could be possible that Midwestern cities invest more in their parks.
Let’s explore the relationship between each one of the four metrics and the total park score for all cities. In this case we’ll only focus on last year’s data, which should be enough to get an idea about the relationship between each metric and the total score. Furthermore, I’ve applied a LOESS function to help visualize the relationship between each metric and the total score.
Scatter plots - cities

In chart A above, which shows the relationship between acreage score and total score, each point represents one city, colored by the region it belongs to. As shown in the graph, there is no obvious positive relation between acreage score and total points for all regions except for the south. In fact, there seems to be an inverted U-shape relationship between acreage score and total points for Western and Midwestern cities. This means that bigger parks are in general better but to a certain extent; past some optimal acreage score value, it would seem that a high acreage score value does not necessarily translate into better parks. Finally, as we saw before, we can again see that Southern cities tend to have lower total scores than the rest of U.S. cities. This can also be vizualized in the map shown above, which presents all cities that were analyzed in 2020 by the TPL. The points in red are cities with high total scores whereas those in yellow have low total scores.
Unlike in chart A, for charts B, C and D, there is indeed a more apparent positive relation between each metric and total points, for all regions. It would seem that the relationship between investment score and total points is the strongest one among all metrics. This means that if a city has a high investment score, then it is very likely to also have a high total score. Contrast this to the relation between acreage score and total points mentioned before, where a city with a high acreage score does not necessarily imply that it will have a high total score.
Although the above charts are useful to visualize the relationships between the four metrics and total score, it may be more useful to present a correlogram or a correlation matrix to better assess the relation (correlation) between the metrics and total score.
Correlogram - metric scores

The correlogram shows the correlation between each of the metrics, including total score. A correlation of 1 (shaded in dark blue) indicates a perfect positive correlation and a correlation of -1 (shaded in dark red) indicates a perfect inverse correlation.
The correlogram reveals a couple of insights. First, as we saw before, the correlation between investment and total score is very strong (0.89) and the correlation between acreage score and total score is not very high (0.28). Second, we now have the correlations between the metrics themselves. Most correlations are positive except for two correlations: the correlation between acreage and amenities (-0.13) and acreage and accessibility (-0.27).
These last two correlations suggest that as park size increases, there are fewer amenities and that these parks are less accessible. This shouldn’t come as too big of a surprise: it may be more difficult to build amenities in large parks and these relatively large parks may be located on the outskirts of a city, which may make them more inaccessible. Furthermore, these large parks could act as local “nature parks”, meaning that amenities may not be necessarily required in these parks; just as in proper nature parks, visitors may visit the park to enjoy the landscape and nature.
Although these findings are interesting, we should be careful not to attribute causality to these relationships. However, they do provide some sort of starting point for further (causal) analyses and make way for interesting hypothesis formulations, such as the ones described in the previous paragraph.