Describe the Data

For my final project, I will be analyzing how the distribution of singletrack trails correlates with population density across Utah Census Tracts. To do this, I will be using two datasets. The first contains Census Tract data from 2010. This is a shapefile which also includes area and population variables of each tract. This will be used to calculate population density. The second shapefile consists of identified trails across the state. The datasets are further discussed below:

Q1) Describe and Cite Data

Census Tract Data

This Utah Census Tract dataset has been used in several of our assignments throughout the semester. While we have used tract populations in some of our analysis, we have not yet combined them with the corresponding Tract area to compute population density. I feel that this is a much more sound measure as it allows us to better compare each tract against each other.

Data Citation: UGRC, 2010, 2010 U.S. Census Bureau Data, Utah Division of Geographic Information, https://gis.utah.gov/data/demographic/census/


Trail Data

This map layer represents off-street features making up Utah’s recreation trail and transportation network. This data was originally collected and purchased in 2014 and revised by several entities throughout the years. The latest update was made in November 2023. Each trail and pathway is classified based on the surface and user type. This will allow me to focus solely on soft-surface singletrack trails designated for hiking and biking. There is no particular reason for this focus other than personal interest. This said, the results may conclude a lack of singletrack trails in a particular tract while other trail types are abundant. This bias and error must be taken into account. Furthermore, it is important to note that while great efforts have been made to include as many trails and pathways as possible, this is not a complete list. This said, this will introduce error into our results.

Data Citation: UGRC, 2020, Utah Trails and Pathways, Utah Division of Geographic Information, https://gis.utah.gov/data/recreation/trails/


Q3-Q4) Research Questions

RQ1: How does the number of single track trails change with population density?

To address this question, I will first create a new variable with values representing the population density of each observation (tract). This is easily computed by dividing the population (POP100) by the tract area (AREALAND). Two considerations need to be made for this. First, I will be strictly using land area and not the combination of land area and water area. This is due to the assumption that none of the population lives within the water area and no trails are present on the water area. Secondly, the tract area is presented in square meters. To increase the readability of the population density values, I will be dividing the area by 1,000,000 to represent square kilometers. Secondly, I will use the tract polygons to add a new variable to the trail dataset, which includes which tract the trail resides in. I can then do a simple analysis of the average trail accessibility in relation to the population density.

One potential limitation to this analysis is that trails may intersect multiple tracts. To address this, I will assign the trail to both tracts as it is assumed that residents of each would have access to the trail.


RQ2: How do the kilometers of single track trails change with population density?

This analysis would be very similar to RQ1 but instead of simply calculating the number of directly available trails, I would calculate the number of kilometers of singletrack trail within a particular census tract. this may be a more reliable and comparable measure as two tracts may have the same number of trails, yet one may have longer trails, thus being more acceptable.


RQ3: Are census tracts with more singletrack trails generally more densely populated? Is this a linear relationship?

This question builds off of the previous questions by taking the trail-density value of each tract and determining if there is a linear relationship between the two.



Data Analysis


Set Up

library("tidyverse")
library("ggExtra")
library("sf")
library("spdep")
library("spgwr")
library("spatialreg")
library("scales")
library("dplyr")
library(tmap)



Loading Data


Loading Census Tract Data
## Reading layer `CensusTracts2010' from data source 
##   `/Users/glendonvansandt/Desktop/NR6950 - R/Final/Utah_Census_Tracts_2010/CensusTracts2010.shp' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 588 features and 32 fields
## Geometry type: POLYGON
## Dimension:     XY
## Bounding box:  xmin: -114.053 ymin: 36.99795 xmax: -109.0411 ymax: 42.00162
## Geodetic CRS:  WGS 84


Loading Park Data
## Reading layer `TrailsAndPathways' from data source 
##   `/Users/glendonvansandt/Desktop/NR6950 - R/Final/Utah_Trails_and_Pathways/TrailsAndPathways.shp' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 41727 features and 27 fields
## Geometry type: MULTILINESTRING
## Dimension:     XY
## Bounding box:  xmin: -114.0435 ymin: 36.93035 xmax: -108.9807 ymax: 42.09105
## Geodetic CRS:  WGS 84


Q2) Cleaning the Data

Tidy??

To be classified as tidy, data must follow these three rules:

  1. Every column is a variable.

  2. Every row is an observation.

  3. Every cell is a single value.

view(trails)

view(tract)

when inspecting the data, it is clear that both datasets follow each of these rules, thus making them tidy.


Cleaning data

Fortunately, both datasets are relatively clean, thus making it quite easy for me to use. While its not technically ‘cleaning’ the data, I will need to omit all existing trails not classified as dirt or unpaved or permitting hiking and/or biking.

hbst <- trails[trails$SurfaceTyp == "Unpaved" & trails$Status == "EXISTING" & trails$Class == "Trail" & trails$CartoCode %in% c("1 - Hiking Only", "2 - Hiking and Biking Allowed", "5 - Biking Only"), ]


Additionally, the land area variable was categorized as categorical even though it was numerical. Ill have to change this. While I’m at it, I might as well create new variables. One of which is converting the area to square kilometers. The other is the computed population density.

tract$AREALAND <- as.numeric(tract$AREALAND)

tract <- tract %>%
  mutate(AREALAND_km2 = AREALAND / 1e6)

tract <- tract %>%
  mutate(popdens = POP100 / AREALAND_km2)



Exploring the Data

Q5-Q7) Visualize the Center and Spread of Key Variables and identify outliers

Tract Data

I started by simply looking at the summary statistics of each of my variables.

summary(tract$POP100)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0    3438    4442    4700    5786   21591
# The minimum tract population is 0 residents while the maximum is 21591 residents. The average population of the tract is 4700 residents. 
summary(tract$AREALAND_km2)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##     0.467     2.067     4.059   361.936    22.716 14822.635
# The smallest tract had a land area of 0.467 square kilometers while the largest was 14822 square kilometers. This said, the average land area of all Utah Census Tracts was 4.059 square kilometers.
summary(tract$popdens)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0   230.9  1185.9  1330.2  2004.5 11352.3
# Tract population density ranged from 0 people per square kilometer to 11352.3 people per square kilometer. The average population density was 1185.9 people per square kilometer. 

Because the tract data was numerical, I could visualize the variables better than the trail data.

# There are several outliers, but because the data seems to be accurate, I will not be removing it from the dataset. This said, I will log-transform the right-skewed variables to make future analysis more accurate. One thing to note here is that because the minimum tract population is 0, I had to add a small constant to the values to avoid a value of negative infinity. 

boxplot(tract$POP100, main = "Distribution of Tract Population")

small_constant <- 0.01 
tract$log_POP100 <- log(tract$POP100 + small_constant)

boxplot(tract$log_POP100)

summary(tract$log_POP100)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -4.605   8.143   8.399   8.301   8.663   9.980
# Much like the population data, there are no land areas that I deemed necessary for removal. This said, there was an extreme variance of values, thus leading me to perform a log transformation.  

boxplot(tract$AREALAND_km2, main = "Distribution of Land Area")

tract$log_AREALAND_km2 <- log(tract$AREALAND_km2)
boxplot(tract$log_AREALAND_km2)

summary(tract$log_AREALAND_km2)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -0.7618  0.7262  1.4009  2.2569  3.1231  9.6039
# I will do the same log transformation for population density and add the small constant.  

boxplot(tract$popdens, main = "Distribution of Population Density")

tract$log_popdens <- log(tract$popdens + small_constant)
boxplot(tract$log_popdens)

summary(tract$log_POP100)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -4.605   8.143   8.399   8.301   8.663   9.980

I also wanted to plot the log population density to gain a better understanding of how the data was spatially related.

ggplot() +
geom_sf(data = tract, aes(fill = log_popdens)) +
scale_fill_distiller(palette = "spectral", breaks = pretty_breaks(n = 5)) +
labs(title = "Population Density by Census Tract", fill = "Population Density") +
theme_minimal()
## Warning in pal_name(palette, type): Unknown palette spectral

Looking at the population of the tract data, an issue was noticed. There are three areas representing public lands with no permanent residents, thus having a population of 0. One in particular represents the Wasatch–Cache National Forest. This tract contains the main source of trails for Residents of Cache Valley. This means that while cache valley has lots of singletrack available, their trail-density value will be very low. To address this, I will create a variable that includes trails within a 10km radios of the tract boundary. This would allow Logan, UT tracts to include some of the trails in Logan Canyon and Green Canyon.


Trail Data

First, I wanted to visualize the trail user type distribution.

# This has shown that the majority of the filtered trails are multi-use with hiking-only trails being more predominant than cycling-only trails. These results are expected as most trails are managed as multi-use. Furthermore, wilderness areas do not allow bicycles which may provide one explanation for the difference between hiking-only and biking-only trails. 

barplot(table(hbst$CartoCode), main = "Distribution of User Types", xlab = "User Types", ylab = "Number of Trails")

Second, I wanted to explore the lengths of trails

# Here, I created a new trail length variable. I wanted the distances shown as kilometers and scientific notation turned off. I also log transformed as the data was skewed to the right. Lastly, I removed all trails with a trail length of 0. 

hbst$trail_length <- st_length(hbst$geometry) / 1000
hbst$trail_length <- as.numeric(gsub("[^0-9.]", "", as.character(hbst$trail_length)))
hbst$log_trail_length <- log(hbst$trail_length)
hbst <- subset(hbst, log_trail_length != 0)

summary(hbst$log_trail_length)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    -Inf -2.1446 -0.6972    -Inf  0.5342  3.6989
boxplot(hbst$log_trail_length)
## Warning in bplt(at[i], wid = width[i], stats = z$stats[, i], out =
## z$out[z$group == : Outlier (-Inf) in boxplot 1 is not drawn


I wanted to visualize the trails a couple of ways.

# Simply plotting the trails

ggplot() +
  geom_sf(data = hbst,) +
  theme_minimal()

Part 2 Q1) Projection

Both datasets are projected using WGS 84. No reprojection is needed.

proj_tract <- st_crs(tract)
proj_hbst <- st_crs(hbst)
print(proj_tract)
## Coordinate Reference System:
##   User input: WGS 84 
##   wkt:
## GEOGCRS["WGS 84",
##     DATUM["World Geodetic System 1984",
##         ELLIPSOID["WGS 84",6378137,298.257223563,
##             LENGTHUNIT["metre",1]]],
##     PRIMEM["Greenwich",0,
##         ANGLEUNIT["degree",0.0174532925199433]],
##     CS[ellipsoidal,2],
##         AXIS["latitude",north,
##             ORDER[1],
##             ANGLEUNIT["degree",0.0174532925199433]],
##         AXIS["longitude",east,
##             ORDER[2],
##             ANGLEUNIT["degree",0.0174532925199433]],
##     ID["EPSG",4326]]
print(proj_hbst)
## Coordinate Reference System:
##   User input: WGS 84 
##   wkt:
## GEOGCRS["WGS 84",
##     DATUM["World Geodetic System 1984",
##         ELLIPSOID["WGS 84",6378137,298.257223563,
##             LENGTHUNIT["metre",1]]],
##     PRIMEM["Greenwich",0,
##         ANGLEUNIT["degree",0.0174532925199433]],
##     CS[ellipsoidal,2],
##         AXIS["latitude",north,
##             ORDER[1],
##             ANGLEUNIT["degree",0.0174532925199433]],
##         AXIS["longitude",east,
##             ORDER[2],
##             ANGLEUNIT["degree",0.0174532925199433]],
##     ID["EPSG",4326]]


Part 2 Q3) Data Subsets

It was at this point, when I realized that trying to compute the relationship of every trail and tract in Utah going to push my computer past its limits. For the sake of time and sanity, I will be strictly focusing my analysis on the tracts here in Cache County. I will make this clear in the final presentation

To accomplish this, I needed to clip both the tract and trail datasets to Cache County. This was quite easy as both datasets included either county name or county FIPS variables.

cc_tract <- tract[tract$COUNTYFP10 == "005", ]
ggplot() +
  geom_sf(data = cc_tract) +
  theme_minimal()

cc_trails <- hbst[hbst$County %in% c("CACHE"), ]
ggplot() +
  geom_sf(data = cc_trails) +
  theme_minimal()

#overlaying Cache County trails onto Cache County census tracts. 

ggplot() +
  geom_sf(data = cc_tract, fill = "lightgray", color = "black") +
  geom_sf(data = cc_trails, aes(color = CartoCode), size = 3) +  
  scale_color_manual(name = "User Types", values = c("1 - Hiking Only" = "red", "2 - Hiking and Biking Allowed" = "blue", "5 - Biking Only" = "green")) +
  labs(title = "Trails and Census Tracts in Cache County") +
  theme_minimal()


Part 2 Q2 & Q4) Spatial Object and merge

Although it is not a part of my original analysis, I also wanted to add trail head data just for an added visualization later on. The dataset does not include any county identifier so I needed to create a new spatial object based on the trailhead point coordinates.

trailheads <- st_read("/Users/glendonvansandt/Desktop/NR6950 - R/Final/Utah_Trailheads/Trailheads.shp")
## Reading layer `Trailheads' from data source 
##   `/Users/glendonvansandt/Desktop/NR6950 - R/Final/Utah_Trailheads/Trailheads.shp' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 568 features and 10 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: -114.0238 ymin: 36.99999 xmax: -109.128 ymax: 41.9744
## Geodetic CRS:  WGS 84
cc_trailheads <- st_intersection(trailheads, cc_tract)
## Warning: attribute variables are assumed to be spatially constant throughout
## all geometries
th_coords <- st_coordinates(cc_trailheads)

#trails and trailheads overlaid onto census tract population density. I will create a better visualization after the analysis in part 3. 
plot(cc_tract$geometry, col = colorRampPalette(c("lightblue", "darkblue"))(length(cc_tract$log_popdens)), main = "Trails and Trailheads in Cache County")
points(th_coords, col = "green", pch = 18)
plot(cc_trails$geometry, col = "coral", add = TRUE)

In order to answer the RQ1, I need to create a new spatial object representing a 10km buffer around each census tract. This was done as some tracts may not have any trails residing within their tract boundaries, yet residents would still have access to trails outside of their tract if they travel a short distance.

cc_tract_buff <- st_buffer(cc_tract, dist = 10000)
cc_bufftracts <- st_union(cc_tract, cc_tract_buff)
## Warning: attribute variables are assumed to be spatially constant throughout
## all geometries

secondly, I needed to create a new spatial object that merged the Cache County tract data with the trail count information.

trails_in_bufftracts <- st_intersects(cc_bufftracts, cc_trails) %>%
  lengths()

tract_trail_count <- data.frame(TRACTCE10 = cc_tract$TRACTCE10, trail_count = trails_in_bufftracts)

cc_tract2 <- merge(cc_tract, tract_trail_count, by = "TRACTCE10", all.x = TRUE)

cc_tract2_sf <- st_sf(cc_tract2)


Part 2 Q5 & Q6) Spatial visualizations

ggplot() +
  geom_sf(data = cc_tract2_sf, aes(fill = trail_count)) +
  scale_fill_gradient(low = "lightblue", high = "darkblue", breaks = seq(min(cc_tract2_sf$trail_count), max(cc_tract2_sf$trail_count), length.out = 6), guide = "legend") +
  theme_minimal() +
  labs(title = "Number of Trails in each Tract (Including Buffer)", fill = "Trail Count")

tm_shape(cc_tract2_sf) +
  tm_fill("trail_count", palette = "Blues", style = "quantile") +
  tm_borders() +
  tm_layout(title = "Number of Trails in each Tract (Including Buffer)", 
            legend.text.size = 0.7,
            legend.show = TRUE)

These visualizations show the number of trails available to each tract. All census tracts have at least 14 trails available within a 10km distance. the Wasatch–Cache National Forest tract has the most with 274 available trails. Most citizens of Cache Valley rely on the national forest for outdoor recreation opportunities. This map shows this very well as the urbanized tracts have fewer trails than those in and along the forest boundary.


Next, I followed the same method for the number of trails but for the length of trails. This will help us answer RQ2

trail_length_in_bufftracts <- st_intersects(cc_bufftracts, cc_trails) %>%
  lengths() %>%
  sapply(function(indices) sum(cc_trails$trail_length[indices]))

tract_trail_length <- data.frame(TRACTCE10 = cc_tract$TRACTCE10, trail_length = trail_length_in_bufftracts)

cc_tract3 <- merge(cc_tract, tract_trail_length, by = "TRACTCE10", all.x = TRUE)

cc_tract3_sf <- st_sf(cc_tract3)



cc_tract3_total <- cc_tract3_sf %>%
  group_by(TRACTCE10) %>%
  summarise(total_trail_length = sum(trail_length))

#Plotting the length of trail accessible to each tract.
ggplot() +
  geom_sf(data = cc_tract3_total, aes(fill = total_trail_length)) +
  scale_fill_gradient(low = "lightblue", high = "darkblue", 
       breaks = seq(min(cc_tract3_total$total_trail_length), 
          max(cc_tract3_total$total_trail_length), 
          length.out = 6), 
       guide = "legend") +
  theme_minimal() +
  labs(title = "Total Trail Length in each Tract (Including Buffer)", fill = "Total Trail Length")

tm_map <- tm_shape(cc_tract3_total) +
  tm_fill("total_trail_length", palette = "Blues", style = "cont") +
  tm_borders()
tm_map <- tm_map + tm_layout(title = "Total Trail Length in each Tract (Including Buffer)")
tm_view(tm_map)
## $tm_layout
## $tm_layout$alpha

## 
## $tm_layout$style
## [1] NA
## 
## 
## attr(,"class")
## [1] "tm"

These visualizations show the kilometers of trail available to each tract. While the results of this are visually similar to the previous visualization, there are a couple interesting findings. Firstly, while the number of trails available to Cache Valley residents is minimal, the total distance of trails available to them is actually quite high. Secondly, trail length seems to be a better measure than trail count as it considers the span of each trail and doesn’t just assume each trail is the distance. For example, one might assume that because some of the logan urban tracts have relatively few available trails, that the available trail distance would also be low. By analyzing total trail length, we can see that this is not necessarily true.