Exploratory Data Analysis with R - the Parkscore Index

Project Overview

The dataset

The ParkScore Index is an index created by the Trust for Public Land (TPL) which keeps track of the green space availability in U.S. cities; it measures how well cities meet their resident’s need for parks based on four metrics: park access, acreage, investment and amenities.

This dataset was TidyTuesday’s featured dataset for the week of 2021-06-22. TidyTuesday is a weekly data project aimed at the R ecosystem where users apply their R skills, get feedback, explore other’s work, and connect with the greater #Rstats community.

Main findings

Using the dataset provided by the TPL, I found the following:

Out of the major cities, Washington DC has had the best rated parks since 2015. Out of all cities in the dataset, Minneapolis has had the best ranked parks almost every year since 2012.
Northeastern cities haved had the best parks in general since 2012, although Midwestern cities’ parks have had great improvements and have become the best in recent years (beginning in 2018).
Southern cities have consistently had the worst ranked parks in the US.
Park access, investment and amenities are positively related to total score; acreage is also positively related to total score, although there’s a much weaker relationship. Out of the four metrics the correlation between investment and total park score is the strongest one (2020 data). This could suggest that investment is the most important component in improving the quality of parks; further causal analysis is required.
Acreage is weakly negatively related to amenities and accessibility. This suggests that large parks tend to have fewer amenities and are less accessible. The reasons for these relationships could be that large parks are located on the outskirts of a city, making them more inaccessible, and that they may act as local “Nature parks”, and as a result amenities may not be very much needed. Again, further causal analysis is needed.

parks <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-06-22/parks.csv')
library(tidyverse) #Data manipulation package
library(kableExtra) #For making nice looking tables
library(cowplot) #ggplot add-on
library(corrplot) #For creating correlation matrices
library(forcats) #For working with categorical variables
library(maps) #US cities coordinates
library(usmap) #US map dataframes
library(ggrepel) #For labelling data

Exploring and cleaning the data

Data exploration

Let’s check out our dataset.

rmarkdown::paged_table(head(parks))

Our dataset has 27 variables including year, city and other discrete and continuous variables. We can see that some of these variables end with “data” while others end with “points”. According the Trust for Public Land, variables ending with “data” contain raw data, while variables ending with “points” are essentially yearly normalized values (a higher number is better).

We’re interested specifically in the following variables:
- year, the year of measurement;
- city, the city name;
- med_park_size_points, the median park size points;
- park_pct_city_points, the parkland as percentage of city area points;
- pct_near_park_points, the percent of residents within a 10 minute walk to park points;
- spend_per_resident_points, the spending per resident in points;
- amenities_points, the amenities points total (i.e. recreational area);
- total_points, total points

Our four main metrics are represented by the med_park_size_points (acreage), park_pct_city_points (acreage), pct_near_park_points (accessibility), spend_per_resident_points (investment), and amenities_points (amenities) variables. Note that the acreage metric is computed using both the med_park_size_points and park_pct_city_points.

We can start off by checking which cities are in the dataset.

cities <- parks %>%
  count(city) 

rmarkdown::paged_table(cities)

There are 102 observations and we can rename two cities: Washington D.C. and Charlotte.

parks$city <- case_when(
  parks$city == "Washington, DC" ~ "Washington, D.C.",
  parks$city == "Charlotte/Mecklenburg County" ~ "Charlotte",
  TRUE ~ parks$city
)

Let’s get rid of unnecessary columns…

parks <- parks %>%
  select(-city_dup)

We can compute the average score for each of the main metrics by year.

parks %>%
  select(year, city, amenities_points, spend_per_resident_points, pct_near_park_points, park_pct_city_points, 
         med_park_size_points, total_points, total_pct) %>%
  group_by(year) %>%
  summarise(avg_amenities = mean(amenities_points),
            avg_spend = mean(spend_per_resident_points),
            avg_near_park = mean(pct_near_park_points),
            avg_pct_city = mean(park_pct_city_points),
            avg_med_park_size = mean(med_park_size_points),
            avg_total_pct = mean(total_pct)) %>%
  kable(caption = "Average metric scores") %>%
  kable_minimal()

Average metric scores
year	avg_amenities	avg_spend	avg_near_park	avg_pct_city	avg_med_park_size	avg_total_pct
2012	NA	9.525000	22.05000	10.27500	10.90000	50.90000
2013	NA	10.340000	23.16000	9.74000	10.56000	51.41000
2014	NA	10.216667	23.33333	10.03333	10.16667	51.85000
2015	39.213333	9.813333	24.93333	10.14667	10.56000	54.50000
2016	10.173469	10.295918	22.40816	10.54082	10.62245	53.35204
2017	9.868687	10.303030	22.10101	10.19192	10.61616	NA
2018	19.556701	19.690722	22.38144	10.48454	10.58763	51.61134
2019	49.592784	50.438144	58.01546	25.43814	26.49485	52.49485
2020	49.291753	54.278351	57.15464	25.71134	25.41237	52.96495

It seems like average scores are more or less consistent in the first few years, but change quite a bit in the last two years. Perhaps the maximum score varies each year. Let’s take a look at the maximum values for all years.

parks %>%
  select(year, city, amenities_points, spend_per_resident_points, pct_near_park_points, park_pct_city_points, 
         med_park_size_points, total_points, total_pct) %>%
  group_by(year) %>%
  summarise(max_amenities = max(amenities_points),
            max_spend = max(spend_per_resident_points),
            max_near_park = max(pct_near_park_points),
            max_pct_city = max(park_pct_city_points),
            max_med_park_size = max(med_park_size_points),
            max_total_points = max(total_points)) %>%
  kable(caption = "Maximum metric scores") %>%
  kable_minimal()

Maximum metric scores
year	max_amenities	max_spend	max_near_park	max_pct_city	max_med_park_size	max_total_points
2012	NA	20	40	20	20	89.0
2013	NA	20	40	20	20	97.0
2014	NA	20	40	20	20	NA
2015	77.0	20	40	20	20	19.0
2016	19.0	20	40	20	20	104.0
2017	19.0	20	40	20	20	87.5
2018	37.0	40	40	20	20	135.0
2019	90.8	100	100	50	50	NA
2020	92.5	100	100	50	50	341.0

As we can see from the last table, the maximum score for these variables has changed over the years. For instance, it appears that the maximum score for the amenities_point variable (the first column) has gone from 80 in 2015 to 20 in 2016 and 2017, to 40 in 2018, and to 100 in 2019 and 2020. The maximum potential score for the other metrics are slightly more consistent, but there’s still a big change in the scoring methodology for 2019 and 2020. We’ll now standardize the scores in the next section.

Standardizing the data

Let’s standardize the scores across all years such that they are comparable across time. That is, the amenities score, investment score and accessibility score should have a maximum score of 100. The acreage score is constructed using the pct_city and med_park_size variables, so, for both of these variables, the maximum score should be 50.

parks_standardized <- parks %>%
  select(year, city, amenities_points, spend_per_resident_points, pct_near_park_points, park_pct_city_points, 
         med_park_size_points, total_points, total_pct, rank) %>%
  mutate(amenities_points = 
           case_when(
             year == 2015 ~ amenities_points*1.25,
             year == 2016 ~ amenities_points*5,
             year == 2017 ~ amenities_points*5,
             year == 2018 ~ amenities_points*2.5,
             TRUE ~ amenities_points),
         spend_per_resident_points = 
           case_when(
             year %in% c(2012, 2013, 2014, 2015, 2016, 2017) ~ spend_per_resident_points*5,
             year == 2018 ~ spend_per_resident_points*2.5,
             TRUE ~ spend_per_resident_points),
         pct_near_park_points = 
           case_when(
             year %in% c(2012, 2013, 2014, 2015, 2016, 2017, 2018) ~ pct_near_park_points*2.5,
             TRUE ~ pct_near_park_points),
         park_pct_city_points =
           case_when(
             year %in% c(2012, 2013, 2014, 2015, 2016, 2017, 2018) ~ park_pct_city_points*2.5,
             TRUE ~ park_pct_city_points),
         med_park_size_points = 
           case_when(
             year %in% c(2012, 2013, 2014, 2015, 2016, 2017, 2018) ~ med_park_size_points*2.5,
             TRUE ~ med_park_size_points))

Let’s check the scores again…

parks_standardized %>%
  select(year, city, amenities_points, spend_per_resident_points, pct_near_park_points, park_pct_city_points, 
         med_park_size_points, total_points, total_pct) %>%
  group_by(year) %>%
  summarise(avg_amenities = mean(amenities_points),
            avg_spend = mean(spend_per_resident_points),
            avg_near_park = mean(pct_near_park_points),
            avg_pct_city = mean(park_pct_city_points),
            avg_med_park_size = mean(med_park_size_points)) %>%
  kable(caption = "Average metric scores") %>%
  kable_minimal()

Average metric scores
year	avg_amenities	avg_spend	avg_near_park	avg_pct_city	avg_med_park_size
2012	NA	47.62500	55.12500	25.68750	27.25000
2013	NA	51.70000	57.90000	24.35000	26.40000
2014	NA	51.08333	58.33333	25.08333	25.41667
2015	49.01667	49.06667	62.33333	25.36667	26.40000
2016	50.86735	51.47959	56.02041	26.35204	26.55612
2017	49.34343	51.51515	55.25253	25.47980	26.54040
2018	48.89175	49.22680	55.95361	26.21134	26.46907
2019	49.59278	50.43814	58.01546	25.43814	26.49485
2020	49.29175	54.27835	57.15464	25.71134	25.41237

The values are more consistent and the variables appear to have been standardized correctly.

Creating the categorical region variable

We can create the categorical variable region to enhance our analysis afterwards. We’ll create four categories: west, midwest, south and northeast. Each city in our dataset will fall under one of these four categories.

parks_standardized <- parks_standardized %>%
  mutate(region =
           case_when(
             city %in% c("Albuquerque", "Anaheim", "Anchorage", "Aurora", "Bakersfield", "Boise", "Chandler",
                         "Chula Vista", "Colorado Springs", "Denver", "Henderson", "Honolulu", "Irvine", 
                         "Las Vegas", "Long Beach", "Los Angeles", "North Las Vegas", "Oakland", "Phoenix", 
                         "Portland", "Reno", "Riverside", "Sacramento", "San Diego", "San Francisco", "Santa Ana", 
                         "Scottsdale", "Seattle") ~ "west",
             city %in% c("Arlington, Texas", "Arlington, Virginia", "Atlanta", "Austin", "Baton Rouge", 
                         "Charlotte", "Chesapeake", "Corpus Christi", "Dallas", "Durham", "El Paso", 
                         "Fort Worth", "Fremont", 
                         "Fresno", "Garland", "Glendale", "Greensboro", "Hialeah", "Houston", "Irving", 
                         "Jacksonville", "Laredo", "Lexington", "Louisville", "Lubbock", "Memphis", "Mesa", 
                         "Miami", "Nashville", "New Orleans", "Norfolk", "Oklahoma City", "Orlando", "Plano", 
                         "Raleigh", "Richmond", "San Antonio", "San Jose", "St. Petersburg", "Stockton", "Tampa", 
                         "Tucson", "Tulsa", "Virginia Beach", "Washington, D.C.", "Winston-Salem") ~ "south",
             city %in% c("Baltimore", "Boston", "Buffalo", "Jersey City", "New York", "Newark", "Philadelphia", 
                         "Pittsburgh") ~ "northeast",
             city %in% c("Chicago", "Cincinnati", "Cleveland", "Columbus", "Des Moines", "Detroit", "Fort Wayne", 
                         "Indianapolis", "Kansas City", "Lincoln", "Madison", "Milwaukee", "Minneapolis", 
                         "Omaha", "St. Louis", "St. Paul", "Toledo", "Wichita") ~ "midwest"))

parks_standardized$region <- as_factor(parks_standardized$region)

Getting the coordinates for US cities

We’ll also need to get the coordinates for the cities in our dataset in order to create a map of the US later on. These coordinates can be obtained from the ‘usmap’ package which provides a dataframe listing coordinates for medium to large sized cities. Below is the code used to prepare the data.

#Clean us.cities dataset (from the 'usmap' package)
us.cities$name <- str_sub(us.cities$name, end = -4)

us.cities$name <- str_replace(us.cities$name, "Saint", "St.")

us.cities$name <- str_replace(us.cities$name, "WASHINGTON", "Washington, D.C.")

parks_2020 <- parks_standardized %>%
  filter(year == 2020)

#Join datasets
us.cities <- left_join(x = us.cities, y = parks_2020, by = c("name" = "city"))

#Remove rows with missing values
us.cities <- us.cities %>% 
  drop_na()

#Identify duplicate cities
duplicates <- us.cities %>%
add_count(name) %>%
  filter(n > 1)

#Manually check and remove duplicates
duplicates <- duplicates[-c(2, 3, 4, 6, 9, 10, 12, 14, 16, 18, 19),]
duplicates <- duplicates %>%
  select(-n)

#Merge duplicates dataset to us.cities
us.cities <- filter(us.cities, 
                    !name %in% c("Aurora", "Columbus", "Glendale", "Jacksonville", 
                                 "Kansas City", "Lincoln", "Madison", "Newark", "Portland"))

us.cities <- rbind(us.cities, duplicates)

#Prepare dataset for mapping
us.cities_map <- us.cities[, c(5, 4, 1, 2, 14, 16)]

us.cities_map <- us.cities_map %>%
  rename(state = country.etc)

us.cities_mapt <- usmap_transform(us.cities_map)

We’re now ready to visualize our data.

Visualizing the data

The graph below shows the top ten cities in the U.S. with the best overall park scores for each year. As can be inferred from the graph, some cities make it to the top 5 list only some years, while others are present almost every year. For example, Sacramento was among the cities with best parks in the first few years (2012 and 2013) while Washington D.C. has had one of the best scores every year since 2012 (except in 2014). Lastly, Minneapolis has had the best parks every year since 2013 except in 2019 (note: missing data for 2017).

Line graph - cities

We can also take a look at major U.S. cities and compare their park scores.

Among major U.S. cities, Washington D.C. has had the best parks since 2015. Overall, the trends for these cities seem to be relatively stable. It also seems that Northeastern cities have substantially better scores compared to cities located in other regions, especially those located in the South, such as Houston or Miami. Is this pattern the same for all 102 cities being analyzed?

The plot below shows the average park score of each U.S. region by year. It seems that, indeed, northeastern cities have higher scores on average, than other regions, especially in the earlier years, where the Northeast region had the best score. However, Midwestern cities’ park scores have increased consistently and, beginning in 2018, have had the best parks in the U.S.

Bar charts - US regions

What makes Midwestern cities’ parks perform so well? Is it possible that Midwestern cities outperform other U.S. cities in one particular metric? For example, it could be possible that Midwestern cities invest more in their parks.

Let’s explore the relationship between each one of the four metrics and the total park score for all cities. In this case we’ll only focus on last year’s data, which should be enough to get an idea about the relationship between each metric and the total score. Furthermore, I’ve applied a LOESS function to help visualize the relationship between each metric and the total score.

Scatter plots - cities

In chart A above, which shows the relationship between acreage score and total score, each point represents one city, colored by the region it belongs to. As shown in the graph, there is no obvious positive relation between acreage score and total points for all regions except for the south. In fact, there seems to be an inverted U-shape relationship between acreage score and total points for Western and Midwestern cities. This means that bigger parks are in general better but to a certain extent; past some optimal acreage score value, it would seem that a high acreage score value does not necessarily translate into better parks. Finally, as we saw before, we can again see that Southern cities tend to have lower total scores than the rest of U.S. cities. This can also be vizualized in the map shown above, which presents all cities that were analyzed in 2020 by the TPL. The points in red are cities with high total scores whereas those in yellow have low total scores.

Unlike in chart A, for charts B, C and D, there is indeed a more apparent positive relation between each metric and total points, for all regions. It would seem that the relationship between investment score and total points is the strongest one among all metrics. This means that if a city has a high investment score, then it is very likely to also have a high total score. Contrast this to the relation between acreage score and total points mentioned before, where a city with a high acreage score does not necessarily imply that it will have a high total score.

Although the above charts are useful to visualize the relationships between the four metrics and total score, it may be more useful to present a correlogram or a correlation matrix to better assess the relation (correlation) between the metrics and total score.

Correlogram - metric scores

The correlogram shows the correlation between each of the metrics, including total score. A correlation of 1 (shaded in dark blue) indicates a perfect positive correlation and a correlation of -1 (shaded in dark red) indicates a perfect inverse correlation.

The correlogram reveals a couple of insights. First, as we saw before, the correlation between investment and total score is very strong (0.89) and the correlation between acreage score and total score is not very high (0.28). Second, we now have the correlations between the metrics themselves. Most correlations are positive except for two correlations: the correlation between acreage and amenities (-0.13) and acreage and accessibility (-0.27).

These last two correlations suggest that as park size increases, there are fewer amenities and that these parks are less accessible. This shouldn’t come as too big of a surprise: it may be more difficult to build amenities in large parks and these relatively large parks may be located on the outskirts of a city, which may make them more inaccessible. Furthermore, these large parks could act as local “nature parks”, meaning that amenities may not be necessarily required in these parks; just as in proper nature parks, visitors may visit the park to enjoy the landscape and nature.

Although these findings are interesting, we should be careful not to attribute causality to these relationships. However, they do provide some sort of starting point for further (causal) analyses and make way for interesting hypothesis formulations, such as the ones described in the previous paragraph.