Reading in data that might explain urban development patterns for Salt Lake City
library(terra)
## terra 1.8.60
#land use data
NLCD_2001 <- rast("NLCD_2001_SL.tif")
NLCD_2004 <- rast("NLCD_2004_SL.tif")
NLCD_2006 <- rast("NLCD_2006_SL.tif")
NLCD_2008 <- rast("NLCD_2008_SL.tif")
NLCD_2011 <- rast("NLCD_2011_SL.tif")
NLCD_2013 <- rast("NLCD_2013_SL.tif")
NLCD_2016 <- rast("NLCD_2016_SL.tif")
# Distance to parks and protected areas in km (Euclidian) for the study area
Park_dist <- rast("Parks_dist_SL.tif")
# Road density for a 1 km neighborhood
Rd_dns1km <- rast("Rd_dns1km_SL.tif")
# Distance to water bodies in km (Euclidean)
WaterDist <- rast("WaterDist_SL.tif")
# elevation
DEM <- rast("DEM_SL.tif")
Comparing two different years of land use data.
plot(NLCD_2001)
plot(NLCD_2016)
Combining all the rasters from all the different years and changing it into a data frame for data analysis
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.2
## ✔ ggplot2 4.0.0 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.1.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ tidyr::extract() masks terra::extract()
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
allrasters <- c(NLCD_2001, NLCD_2004, NLCD_2006, NLCD_2008, NLCD_2011, NLCD_2013, NLCD_2016, Park_dist, Rd_dns1km, WaterDist,DEM)
allrastersSL <- as.data.frame(allrasters, xy=TRUE)
#filtering out NA or no data (these have a pixel code of 128 in the raster)
allrastersSL <- allrastersSL %>%
filter (NLCD_2001_SL != 128)
head(allrastersSL)
## x y NLCD_2001_SL NLCD_2004_SL NLCD_2006_SL NLCD_2008_SL
## 1 -1330860 2102550 11 52 52 95
## 2 -1330830 2102550 11 52 52 95
## 3 -1330800 2102550 11 11 11 11
## 4 -1330770 2102550 11 11 11 11
## 5 -1330740 2102550 11 11 11 11
## 6 -1330710 2102550 11 11 71 71
## NLCD_2011_SL NLCD_2013_SL NLCD_2016_SL Parks_dist_SL Rd_dns1km_SL
## 1 95 95 95 1358.308 0
## 2 95 95 95 1343.317 0
## 3 95 11 11 1328.834 0
## 4 95 11 11 1314.876 0
## 5 95 11 95 1301.461 0
## 6 71 71 71 1288.255 0
## WaterDist_SL DEM_SL
## 1 301.4963 1281
## 2 305.9412 1281
## 3 313.2092 1281
## 4 323.1099 1281
## 5 330.0000 1281
## 6 331.3608 1281
Creating our sample size
library(leaflet)
sampleSLrnd <- spatSample(allrasters, size=200, "random", cells=TRUE, xy=TRUE)
head(sampleSLrnd)
## cell x y NLCD_2001_SL NLCD_2004_SL NLCD_2006_SL NLCD_2008_SL
## 1 2267811 -1305780 2065950 42 42 42 42
## 2 860975 -1305420 2088690 42 42 42 42
## 3 680294 -1324890 2091600 NA NA NA NA
## 4 870339 -1302900 2088540 NA NA NA NA
## 5 2833901 -1305480 2056800 NA NA NA NA
## 6 1795515 -1331940 2073570 22 22 22 22
## NLCD_2011_SL NLCD_2013_SL NLCD_2016_SL Parks_dist_SL Rd_dns1km_SL
## 1 42 42 42 2427.591 0.02754821
## 2 42 42 42 6033.656 0.00000000
## 3 NA NA NA NA NA
## 4 NA NA NA NA NA
## 5 NA NA NA NA NA
## 6 22 22 22 8885.319 0.26446280
## WaterDist_SL DEM_SL
## 1 3428.148 2408
## 2 2774.311 2107
## 3 NA NA
## 4 NA NA
## 5 NA NA
## 6 1682.676 1332
plot(sampleSLrnd$x, sampleSLrnd$y)
Checking for distribution
hist(sampleSLrnd$Parks_dist_SL)
Looking at the spatial autocorrelation between the 4 selected independent variables that we stacked into the raster
#dataframe that now has lat and long
flat_data <- as.data.frame(sampleSLrnd)
#calculating distances btwn points
dist_matrix <- as.matrix(dist(cbind(flat_data$x, flat_data$y)))
#matrix of inverse distance weights?
dist_matix.inv <- 1/dist_matrix
diag(dist_matix.inv) <- 0
library(ape)
##
## Attaching package: 'ape'
## The following object is masked from 'package:dplyr':
##
## where
## The following objects are masked from 'package:terra':
##
## rotate, trans, zoom
Moran.I(sampleSLrnd$Rd_dns1km_SL, dist_matix.inv, na.rm = TRUE)
## $observed
## [1] 0.2180653
##
## $expected
## [1] -0.005025126
##
## $sd
## [1] 0.0137177
##
## $p.value
## [1] 0
I want to see what codes commonly are appearing in my data frame.
SLcols <- sampleSLrnd[ , grepl("NLCD_", names(sampleSLrnd))]
lapply(SLcols, function(x) table(x, useNA = "ifany"))
## $NLCD_2001_SL
## x
## 11 21 22 23 24 31 41 42 52 71 81 82 90 95 <NA>
## 7 9 20 12 5 3 19 14 30 7 5 4 3 2 60
##
## $NLCD_2004_SL
## x
## 11 21 22 23 24 31 41 42 52 71 81 82 90 95 <NA>
## 5 9 20 12 5 5 19 14 34 5 4 3 3 2 60
##
## $NLCD_2006_SL
## x
## 11 21 22 23 24 31 41 42 52 71 81 82 90 95 <NA>
## 5 11 19 18 5 5 18 14 30 5 3 3 2 2 60
##
## $NLCD_2008_SL
## x
## 11 21 22 23 24 31 41 42 52 71 81 82 90 95 <NA>
## 4 11 19 18 5 6 18 15 29 5 3 3 2 2 60
##
## $NLCD_2011_SL
## x
## 11 21 22 23 24 31 41 42 52 71 81 82 90 95 <NA>
## 4 12 20 18 5 5 18 15 28 4 3 3 2 3 60
##
## $NLCD_2013_SL
## x
## 11 21 22 23 24 31 41 42 52 71 81 82 90 95 <NA>
## 5 12 20 18 5 5 18 15 28 4 3 3 2 2 60
##
## $NLCD_2016_SL
## x
## 11 21 22 23 24 31 41 42 52 71 81 82 90 95 <NA>
## 3 12 20 18 5 5 18 15 28 4 3 3 2 4 60
Looking at specifically what forested or scrub areas have been developed. 41 is deciduous forest, 42 is evergreen, 52 is shrub/scrub, 71 is grassland/herbaceous, 95 is emergent herbaceous wetlands.
natural <- c(41, 42, 52, 71, 95)
allrastersSL <- allrastersSL %>%
mutate(
natTransition = (NLCD_2001_SL %in% natural) &
!(NLCD_2016_SL %in% natural) &
!is.na(NLCD_2016_SL)
)
#plotting natural areas that stayed natural from 2001 to 2016 vs natural areas that changed to any other land coverage type
#true areas = change, false = stayed natural
ggplot(allrastersSL, aes(x = x, y = y, color = natTransition)) +
geom_point(size = 2, shape = 15) +
scale_color_manual(values = c("FALSE" = "darkgreen", "TRUE" = "lightblue")) +
coord_equal() +
theme_minimal() +
labs(color = "Natural → Other")
Let’s see what type of land use these lost natural areas became.
allrastersSL <- allrastersSL %>%
mutate(
natToClass = ifelse(NLCD_2001_SL %in% natural &
!(NLCD_2016_SL %in% natural),
NLCD_2016_SL,
NA)
)
table(allrastersSL$natToClass, useNA = "ifany")
##
## 11 21 22 23 24 31 43 81 82 90
## 6032 14483 15763 23147 10436 4532 260 317 4088 624
## <NA>
## 2243467
Calculting percent of decrease in natural areas in 2016 since 2001.
#natural pixels in 2001 and 2016 (counts)
natural2001 <- sum(allrastersSL$NLCD_2001_SL %in% natural, na.rm = TRUE)
natural2016 <- sum(allrastersSL$NLCD_2016_SL %in% natural, na.rm = TRUE)
naturalChanged <- sum(
(allrastersSL$NLCD_2001_SL %in% natural) &
!(allrastersSL$NLCD_2016_SL %in% natural),
na.rm = TRUE
)
percentNaturalChanged <- naturalChanged / natural2001 * 100
percentNaturalChanged
## [1] 6.16201
Generalized linear model for my data of interest
natural <- c(41, 42, 52, 71, 95)
allrastersSL <- allrastersSL %>%
mutate(
natTransition = ifelse(
(NLCD_2001_SL %in% natural) &
!(NLCD_2016_SL %in% natural) &
!is.na(NLCD_2016_SL),
1,
0
)
)
# presence points
naturalLoss <- allrastersSL %>% filter(natTransition == 1)
# absence points (non-loss)
nonLoss <- allrastersSL %>% filter(natTransition == 0)
# sample absence points 2x the presence
set.seed(123)
index <- sample(1:nrow(nonLoss), nrow(naturalLoss) * 2)
nonLossSample <- nonLoss[index, ]
SLsample <- rbind(naturalLoss, nonLossSample)
natLoss_glm <- glm(
natTransition ~ Parks_dist_SL + Rd_dns1km_SL + WaterDist_SL + DEM_SL,
data = SLsample,
family = binomial
)
summary(natLoss_glm)
##
## Call:
## glm(formula = natTransition ~ Parks_dist_SL + Rd_dns1km_SL +
## WaterDist_SL + DEM_SL, family = binomial, data = SLsample)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.399e+00 4.021e-02 84.55 <2e-16 ***
## Parks_dist_SL 1.192e-04 1.031e-06 115.55 <2e-16 ***
## Rd_dns1km_SL -8.666e-01 4.355e-02 -19.90 <2e-16 ***
## WaterDist_SL 1.803e-04 5.289e-06 34.09 <2e-16 ***
## DEM_SL -3.397e-03 2.893e-05 -117.39 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 304312 on 239045 degrees of freedom
## Residual deviance: 253179 on 239041 degrees of freedom
## AIC: 253189
##
## Number of Fisher Scoring iterations: 6
Checking for goodness of fit.
library(ROCR)
# generate predicted probabilities
pred <- prediction(
predict(natLoss_glm, type = "response"),
SLsample$natTransition
)
# ROC curve
perf <- performance(pred, "tpr", "fpr")
plot(perf, colorize = TRUE)
# AUC
auc_ROCR <- performance(pred, measure = "auc")
auc_ROCR <- auc_ROCR@y.values[[1]]
auc_ROCR
## [1] 0.7564264
Cluster analysis
library(sf)
## Linking to GEOS 3.13.0, GDAL 3.8.5, PROJ 9.5.1; sf_use_s2() is TRUE
library(dplyr)
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
amenData <- st_read("AmenDataAll.shp")
## Reading layer `AmenDataAll' from data source
## `/Users/ansleighgunter/Desktop/Lab5prj/AmenDataAll.shp' using driver `ESRI Shapefile'
## Simple feature collection with 79971 features and 138 fields
## Geometry type: POLYGON
## Dimension: XY
## Bounding box: xmin: -2356114 ymin: 272043.6 xmax: 2263886 ymax: 3182044
## Projected CRS: NAD83 / Conus Albers
geomData <- st_sfc(amenData$geometry)
#select vars
clusterVars <- c("GolfCount", "ShopCount", "HigherEdCo")
amenDataDF_numeric <- as.data.frame(amenData)[, clusterVars]
#scale
cluster_scaled <- scale(amenDataDF_numeric)
#determine number of clusters
wss <- numeric(10)
for (k in 1:10) {
wss[k] <- sum(kmeans(cluster_scaled, centers = k, nstart = 25)$withinss)
}
plot(1:10, wss, type = "b", xlab = "Number of Clusters", ylab = "Within Sum of Squares",
main = "Elbow Method to Determine K")
#kmeans
set.seed(123) # for reproducibility
fit <- kmeans(cluster_scaled, centers = 5, nstart = 25)
#add back
amenData$cluster <- fit$cluster
#add geom back
st_geometry(amenData) <- geomData
#plot
ggplot() +
geom_sf(data = amenData, aes(fill = as.factor(cluster)), color = NA) +
theme_minimal() +
ggtitle("Clusters based on Golf, Shop, HigherEd Counts") +
scale_fill_viridis_d(name = "Cluster")
#for interpretation
cluster_summary <- aggregate(cluster_scaled, by = list(fit$cluster), FUN = mean)
colnames(cluster_summary)[1] <- "Cluster"
print(cluster_summary)
## Cluster GolfCount ShopCount HigherEdCo
## 1 1 2.6513089 -0.09718645 0.22216358
## 2 2 -0.2542386 -0.09718645 -0.05063041
## 3 3 2.4170949 7.95160435 0.83767199
## 4 4 2.1281651 3.13413897 17.78189806
## 5 5 2.3523914 1.08129693 90.00966249
Questions 1. I did kind of the reverse of what we did for the lab tutorial, to see if it’s the opposite pattern or a little different. I wanted to see what previously natural areas in Salt Lake City had been lost since 2001, whereas our tutorial focusing on the amount of urban development that was gained. I’m wondering if the predictors of distance to parks, roads, water, and elevation are just as good at predicting the places where these natural sites will be as they were at predicting the areas of increased urban development. I’ll just be looking at whether the estimate is positive or negative to get a sense of how good of a predictor it is. Park distance has a slight positive effect. If a natural area is farther away from parks, that means it has a higher chance of being lost, which makes sense, as these parks are protected areas. Water distance is slightly positive as well, so natural areas further away from water have a higher chance of loss. Elevation is negative, so higher elevation mean lower chance of loss, which makes sense, because these higher elevation areas are harder to access and not the low hanging fruit that would experience land-use change. Road density is negative, so higher road density means a lower chance of natural area loss? This one doesn’t make as much sense to me, as road density, seems it would be a pretty good predictor for natural area loss. All of the values are significant, which can be seen from the stars next to the p-values. The goodness of fit test generated a value of 0.7564264, which means my model is a pretty good predictor of where natural areas will be lost. I think an interesting variable to look at would be population density per each pixel. This would allow us to visualize if natural areas are more likely to be lost near dense urban centers, perhaps because of the need for more housing, or next to low density areas, possibly as a result of continued suburban sprawl.