Reading in data that might explain urban development patterns for Salt Lake City

library(terra)
## terra 1.8.60
#land use data
NLCD_2001 <- rast("NLCD_2001_SL.tif")
NLCD_2004 <- rast("NLCD_2004_SL.tif")
NLCD_2006 <- rast("NLCD_2006_SL.tif")
NLCD_2008 <- rast("NLCD_2008_SL.tif")
NLCD_2011 <- rast("NLCD_2011_SL.tif")
NLCD_2013 <- rast("NLCD_2013_SL.tif")
NLCD_2016 <- rast("NLCD_2016_SL.tif")
# Distance to parks and protected areas in km (Euclidian) for the study area
Park_dist <- rast("Parks_dist_SL.tif")
# Road density for a 1 km neighborhood
Rd_dns1km <- rast("Rd_dns1km_SL.tif")
# Distance to water bodies in km (Euclidean)
WaterDist <- rast("WaterDist_SL.tif")
# elevation
DEM <- rast("DEM_SL.tif")

Comparing two different years of land use data.

plot(NLCD_2001)

plot(NLCD_2016)

Combining all the rasters from all the different years and changing it into a data frame for data analysis

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.2
## ✔ ggplot2   4.0.0     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ tidyr::extract() masks terra::extract()
## ✖ dplyr::filter()  masks stats::filter()
## ✖ dplyr::lag()     masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
allrasters <- c(NLCD_2001, NLCD_2004, NLCD_2006, NLCD_2008, NLCD_2011, NLCD_2013, NLCD_2016, Park_dist, Rd_dns1km, WaterDist,DEM)

allrastersSL <- as.data.frame(allrasters, xy=TRUE)
#filtering out NA or no data (these have a pixel code of 128 in the raster)
allrastersSL <- allrastersSL %>%
  filter (NLCD_2001_SL != 128)
head(allrastersSL)
##          x       y NLCD_2001_SL NLCD_2004_SL NLCD_2006_SL NLCD_2008_SL
## 1 -1330860 2102550           11           52           52           95
## 2 -1330830 2102550           11           52           52           95
## 3 -1330800 2102550           11           11           11           11
## 4 -1330770 2102550           11           11           11           11
## 5 -1330740 2102550           11           11           11           11
## 6 -1330710 2102550           11           11           71           71
##   NLCD_2011_SL NLCD_2013_SL NLCD_2016_SL Parks_dist_SL Rd_dns1km_SL
## 1           95           95           95      1358.308            0
## 2           95           95           95      1343.317            0
## 3           95           11           11      1328.834            0
## 4           95           11           11      1314.876            0
## 5           95           11           95      1301.461            0
## 6           71           71           71      1288.255            0
##   WaterDist_SL DEM_SL
## 1     301.4963   1281
## 2     305.9412   1281
## 3     313.2092   1281
## 4     323.1099   1281
## 5     330.0000   1281
## 6     331.3608   1281

Creating our sample size

library(leaflet)
sampleSLrnd <- spatSample(allrasters, size=200, "random", cells=TRUE, xy=TRUE)
head(sampleSLrnd)
##      cell        x       y NLCD_2001_SL NLCD_2004_SL NLCD_2006_SL NLCD_2008_SL
## 1 2267811 -1305780 2065950           42           42           42           42
## 2  860975 -1305420 2088690           42           42           42           42
## 3  680294 -1324890 2091600           NA           NA           NA           NA
## 4  870339 -1302900 2088540           NA           NA           NA           NA
## 5 2833901 -1305480 2056800           NA           NA           NA           NA
## 6 1795515 -1331940 2073570           22           22           22           22
##   NLCD_2011_SL NLCD_2013_SL NLCD_2016_SL Parks_dist_SL Rd_dns1km_SL
## 1           42           42           42      2427.591   0.02754821
## 2           42           42           42      6033.656   0.00000000
## 3           NA           NA           NA            NA           NA
## 4           NA           NA           NA            NA           NA
## 5           NA           NA           NA            NA           NA
## 6           22           22           22      8885.319   0.26446280
##   WaterDist_SL DEM_SL
## 1     3428.148   2408
## 2     2774.311   2107
## 3           NA     NA
## 4           NA     NA
## 5           NA     NA
## 6     1682.676   1332
plot(sampleSLrnd$x, sampleSLrnd$y)

Checking for distribution

hist(sampleSLrnd$Parks_dist_SL)

Looking at the spatial autocorrelation between the 4 selected independent variables that we stacked into the raster

#dataframe that now has lat and long
flat_data <- as.data.frame(sampleSLrnd)
#calculating distances btwn points   
dist_matrix <- as.matrix(dist(cbind(flat_data$x, flat_data$y)))
#matrix of inverse distance weights?
dist_matix.inv <- 1/dist_matrix
diag(dist_matix.inv) <- 0

library(ape)
## 
## Attaching package: 'ape'
## The following object is masked from 'package:dplyr':
## 
##     where
## The following objects are masked from 'package:terra':
## 
##     rotate, trans, zoom
Moran.I(sampleSLrnd$Rd_dns1km_SL, dist_matix.inv, na.rm = TRUE)
## $observed
## [1] 0.2180653
## 
## $expected
## [1] -0.005025126
## 
## $sd
## [1] 0.0137177
## 
## $p.value
## [1] 0

I want to see what codes commonly are appearing in my data frame.

SLcols <- sampleSLrnd[ , grepl("NLCD_", names(sampleSLrnd))]
lapply(SLcols, function(x) table(x, useNA = "ifany"))
## $NLCD_2001_SL
## x
##   11   21   22   23   24   31   41   42   52   71   81   82   90   95 <NA> 
##    7    9   20   12    5    3   19   14   30    7    5    4    3    2   60 
## 
## $NLCD_2004_SL
## x
##   11   21   22   23   24   31   41   42   52   71   81   82   90   95 <NA> 
##    5    9   20   12    5    5   19   14   34    5    4    3    3    2   60 
## 
## $NLCD_2006_SL
## x
##   11   21   22   23   24   31   41   42   52   71   81   82   90   95 <NA> 
##    5   11   19   18    5    5   18   14   30    5    3    3    2    2   60 
## 
## $NLCD_2008_SL
## x
##   11   21   22   23   24   31   41   42   52   71   81   82   90   95 <NA> 
##    4   11   19   18    5    6   18   15   29    5    3    3    2    2   60 
## 
## $NLCD_2011_SL
## x
##   11   21   22   23   24   31   41   42   52   71   81   82   90   95 <NA> 
##    4   12   20   18    5    5   18   15   28    4    3    3    2    3   60 
## 
## $NLCD_2013_SL
## x
##   11   21   22   23   24   31   41   42   52   71   81   82   90   95 <NA> 
##    5   12   20   18    5    5   18   15   28    4    3    3    2    2   60 
## 
## $NLCD_2016_SL
## x
##   11   21   22   23   24   31   41   42   52   71   81   82   90   95 <NA> 
##    3   12   20   18    5    5   18   15   28    4    3    3    2    4   60

Looking at specifically what forested or scrub areas have been developed. 41 is deciduous forest, 42 is evergreen, 52 is shrub/scrub, 71 is grassland/herbaceous, 95 is emergent herbaceous wetlands.

natural <- c(41, 42, 52, 71, 95)
allrastersSL <- allrastersSL %>%
  mutate(
    natTransition = (NLCD_2001_SL %in% natural) &
                    !(NLCD_2016_SL %in% natural) &
                    !is.na(NLCD_2016_SL)
  )
#plotting natural areas that stayed natural from 2001 to 2016 vs natural areas that changed to any other land coverage type
#true areas = change, false = stayed natural
ggplot(allrastersSL, aes(x = x, y = y, color = natTransition)) +
  geom_point(size = 2, shape = 15) +
  scale_color_manual(values = c("FALSE" = "darkgreen", "TRUE" = "lightblue")) +
  coord_equal() +
  theme_minimal() +
  labs(color = "Natural → Other")

Let’s see what type of land use these lost natural areas became.

allrastersSL <- allrastersSL %>%
  mutate(
    natToClass = ifelse(NLCD_2001_SL %in% natural &
                        !(NLCD_2016_SL %in% natural),
                        NLCD_2016_SL,
                        NA)
  )
table(allrastersSL$natToClass, useNA = "ifany")
## 
##      11      21      22      23      24      31      43      81      82      90 
##    6032   14483   15763   23147   10436    4532     260     317    4088     624 
##    <NA> 
## 2243467

Calculting percent of decrease in natural areas in 2016 since 2001.

#natural pixels in 2001 and 2016 (counts)
natural2001 <- sum(allrastersSL$NLCD_2001_SL %in% natural, na.rm = TRUE)
natural2016 <- sum(allrastersSL$NLCD_2016_SL %in% natural, na.rm = TRUE)
naturalChanged <- sum(
  (allrastersSL$NLCD_2001_SL %in% natural) &
  !(allrastersSL$NLCD_2016_SL %in% natural),
  na.rm = TRUE
)

percentNaturalChanged <- naturalChanged / natural2001 * 100
percentNaturalChanged
## [1] 6.16201

Generalized linear model for my data of interest

natural <- c(41, 42, 52, 71, 95)

allrastersSL <- allrastersSL %>%
  mutate(
    natTransition = ifelse(
      (NLCD_2001_SL %in% natural) &
      !(NLCD_2016_SL %in% natural) &
      !is.na(NLCD_2016_SL),
      1,
      0
    )
  )
# presence points
naturalLoss <- allrastersSL %>% filter(natTransition == 1)
# absence points (non-loss)
nonLoss <- allrastersSL %>% filter(natTransition == 0)
# sample absence points 2x the presence
set.seed(123)
index <- sample(1:nrow(nonLoss), nrow(naturalLoss) * 2)
nonLossSample <- nonLoss[index, ]
SLsample <- rbind(naturalLoss, nonLossSample)
natLoss_glm <- glm(
  natTransition ~ Parks_dist_SL + Rd_dns1km_SL + WaterDist_SL + DEM_SL,
  data = SLsample,
  family = binomial
)
summary(natLoss_glm)
## 
## Call:
## glm(formula = natTransition ~ Parks_dist_SL + Rd_dns1km_SL + 
##     WaterDist_SL + DEM_SL, family = binomial, data = SLsample)
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    3.399e+00  4.021e-02   84.55   <2e-16 ***
## Parks_dist_SL  1.192e-04  1.031e-06  115.55   <2e-16 ***
## Rd_dns1km_SL  -8.666e-01  4.355e-02  -19.90   <2e-16 ***
## WaterDist_SL   1.803e-04  5.289e-06   34.09   <2e-16 ***
## DEM_SL        -3.397e-03  2.893e-05 -117.39   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 304312  on 239045  degrees of freedom
## Residual deviance: 253179  on 239041  degrees of freedom
## AIC: 253189
## 
## Number of Fisher Scoring iterations: 6

Checking for goodness of fit.

library(ROCR)
# generate predicted probabilities
pred <- prediction(
  predict(natLoss_glm, type = "response"),
  SLsample$natTransition
)
# ROC curve
perf <- performance(pred, "tpr", "fpr")
plot(perf, colorize = TRUE)

# AUC
auc_ROCR <- performance(pred, measure = "auc")
auc_ROCR <- auc_ROCR@y.values[[1]]
auc_ROCR
## [1] 0.7564264

Cluster analysis

library(sf)
## Linking to GEOS 3.13.0, GDAL 3.8.5, PROJ 9.5.1; sf_use_s2() is TRUE
library(dplyr)
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
amenData <- st_read("AmenDataAll.shp")
## Reading layer `AmenDataAll' from data source 
##   `/Users/ansleighgunter/Desktop/Lab5prj/AmenDataAll.shp' using driver `ESRI Shapefile'
## Simple feature collection with 79971 features and 138 fields
## Geometry type: POLYGON
## Dimension:     XY
## Bounding box:  xmin: -2356114 ymin: 272043.6 xmax: 2263886 ymax: 3182044
## Projected CRS: NAD83 / Conus Albers
geomData <- st_sfc(amenData$geometry)
#select vars
clusterVars <- c("GolfCount", "ShopCount", "HigherEdCo")
amenDataDF_numeric <- as.data.frame(amenData)[, clusterVars]
#scale
cluster_scaled <- scale(amenDataDF_numeric)
#determine number of clusters
wss <- numeric(10)
for (k in 1:10) {
  wss[k] <- sum(kmeans(cluster_scaled, centers = k, nstart = 25)$withinss)
}
plot(1:10, wss, type = "b", xlab = "Number of Clusters", ylab = "Within Sum of Squares",
     main = "Elbow Method to Determine K")

#kmeans
set.seed(123)  # for reproducibility
fit <- kmeans(cluster_scaled, centers = 5, nstart = 25)
#add back
amenData$cluster <- fit$cluster
#add geom back
st_geometry(amenData) <- geomData
#plot
ggplot() + 
  geom_sf(data = amenData, aes(fill = as.factor(cluster)), color = NA) +
  theme_minimal() +
  ggtitle("Clusters based on Golf, Shop, HigherEd Counts") +
  scale_fill_viridis_d(name = "Cluster")

#for interpretation
cluster_summary <- aggregate(cluster_scaled, by = list(fit$cluster), FUN = mean)
colnames(cluster_summary)[1] <- "Cluster"
print(cluster_summary)
##   Cluster  GolfCount   ShopCount  HigherEdCo
## 1       1  2.6513089 -0.09718645  0.22216358
## 2       2 -0.2542386 -0.09718645 -0.05063041
## 3       3  2.4170949  7.95160435  0.83767199
## 4       4  2.1281651  3.13413897 17.78189806
## 5       5  2.3523914  1.08129693 90.00966249

Questions 1. I did kind of the reverse of what we did for the lab tutorial, to see if it’s the opposite pattern or a little different. I wanted to see what previously natural areas in Salt Lake City had been lost since 2001, whereas our tutorial focusing on the amount of urban development that was gained. I’m wondering if the predictors of distance to parks, roads, water, and elevation are just as good at predicting the places where these natural sites will be as they were at predicting the areas of increased urban development. I’ll just be looking at whether the estimate is positive or negative to get a sense of how good of a predictor it is. Park distance has a slight positive effect. If a natural area is farther away from parks, that means it has a higher chance of being lost, which makes sense, as these parks are protected areas. Water distance is slightly positive as well, so natural areas further away from water have a higher chance of loss. Elevation is negative, so higher elevation mean lower chance of loss, which makes sense, because these higher elevation areas are harder to access and not the low hanging fruit that would experience land-use change. Road density is negative, so higher road density means a lower chance of natural area loss? This one doesn’t make as much sense to me, as road density, seems it would be a pretty good predictor for natural area loss. All of the values are significant, which can be seen from the stars next to the p-values. The goodness of fit test generated a value of 0.7564264, which means my model is a pretty good predictor of where natural areas will be lost. I think an interesting variable to look at would be population density per each pixel. This would allow us to visualize if natural areas are more likely to be lost near dense urban centers, perhaps because of the need for more housing, or next to low density areas, possibly as a result of continued suburban sprawl.

  1. I chose to look at GolfCount, ShopCount, and HigherEdCount in the amenity data because it covers a spread of recreational, commercial, and educational sectors and it seems like these would be areas of interest for people to live. I used the elbow method to determine the number of clusters to use, which was 5. Positive values in the summary table represent an above average presence, while negative values represent below average. Cluster 1 has above average golf count, below average shops, and above average for higher ed. This is likely an area dominated by golf. Cluster 2 has below average for all three amenities, likely an isolated residential area or an overall amenity poor area. Cluster 3 has above average for all three amenities, but very high for shops. This could be a mixed-use area with lots of amenities and recreation opportunities, maybe shopping centers. Cluster 4 has above average for all three once again, but this time very high in the higher ed category. Cluster 4 probably represents university campuses, which recreational amenities and some shops, but dominated by higher ed. Cluster 5 is very similar to cluster 4, with above average values for all three amenities, but a super high value for higher education. Maybe this is an area dominated by a big school, kind of like Ann Arbor! The majority of the US is the cluster 2 category, which is below average for all three of my selected amenities! Vast areas of the west are only cluster 2 pixels. Cluster 1 is relatively common and seems to intersperse with cluster 1 around a lot of major cities, including New York city, Detroit, Chicago, and Orlando. The rarest clusters seem to be clusters 3,4, and 5. Cluster three is that category which is very high prevalence of all three amenities, while cluster 4 and 5 are the higher ed heavy. Because the appearance of the few cluster 5 pixels in the map don’t seem to be around any major cities, it’s likely that these are small towns, dominated by a large public college. I would be interesting to zoom to different parts of the map. I’ll need to look into how to accomplish this.