This markdown is an updated version of a module written by Ken Steif in 2019. It was an unpublished chapter from his book Public Policy Analytics, and it is therefore very detailed. It has been updated to keep it aligned with current packages. For the most part, the original text is still used here, out of respect for the original material and the author. However, alterations, annotations and supplemental code are throughout.

For this project, we built a growth model based on a binary logistic regression that predicts the probability of land cover change at a raster cell level over a specified interval as a function of a) Existing land cover, b) The location of infrastructure and c) Demographic and economic spatial variables.

While we have used logistic regression to model spatial phenomena like flooding before, this workflow is different in that we wrangle raster data in R rather than doing it in ArcGIS.

The appendices to this document add some additional code and alternative code blocks to what is in the chapter itself. You can use these to augment the workflow to run entirely in R and load, downsample and wrangle rasters.

Thanks very much to Jenna Epstein for helping to update and edit this material in 2022. A note from Professor Michael Fichman.

1. Introduction

Regional urban development is an emergent outcome resulting from the autonomous decisions of many different agents, including developers, real estate buyers and tenants as well as planners and regulators. Each of these groups optimize for a different set of bottom lines. Developers seek profit, buying land at a low price, improving on that land, and selling it at a premium. Households and firms consume housing and commercial real estate balancing price constraints with access to amenities and customers, respectively.

Planners regulate development by trading-off economic growth with the mitigation of negative externalities toward the goal of economic and environmental sustainability. This is a case study in managing these trade-offs. The focus of this chapter is on Cook, DuPage, and Will Counties in Illinois in the Chicago region. Chicago is located in Cook County, IL. As a sprawling Metropolitan area, climate change and equity concerns are forcing cities like Chicago to reconsider the role of land use planning.

Land use planning that is both economically productive and sustainable requires both supply and demand-side insights. Specifically, the Planner must understand future demand for development and how that demand contrasts with the supply of environmentally sensitive land. The goal of this chapter is to model this interplay.

The next section wrangles a host of datasets including, Land Cover and Land Cover Change from the USGS, Census demographics, and transportation. Spatial lag features are also engineered from the land cover change data hypothesizing that the time/space scale of development between 2011 and 2021 can help predict new development in 2031.

Exploratory analysis is undertaken to investigate the relationship between development and the aforementioned features. Findings then motivate the estimation of a geospatial predictive model trained on new development between 2011 and 2021. Those dynamics are then harnessed to predict Development_Demand for 2031 - predictions interpreted as the probability of new development here. The goal is not to create the most accurate model, but to demonstrate the role for geospatial machine learning in the land use modeling process.

Once predictions are validated for their accuracy, we evaluate their confluence with environmentally sensitive land in counties throughout the region, with particular emphasis on landscape fragmentation. The final section uses predictions to ‘allocate’ new development across to places where growth can have an economic impact without impeding sustainability goals. It also tests a scenario for attracting new development with an extension of regional rail (new transportation infrastructure).

1.2. Setup

Below we load the libraries needed for the analysis as well as a mapTheme and plotTheme. A set of palette colors are also specified.

library(tidyverse)
library(sf)
library(terra)
library(raster)
library(knitr)
library(kableExtra)
library(tidycensus)
library(tigris)
library(FNN)
#library(QuantPsyc) # JE Note: in R 4.1, QuantPsyc package not available.
library(caret)
library(yardstick)
library(pscl)
library(plotROC) 
library(ggrepel)
library(pROC)
library(grid)
library(gridExtra)
library(viridis)
library(igraph)
library(dplyr)
library(scales)
library(ggplot2)

plotTheme <- theme(
  plot.title =element_text(size=12),
  plot.subtitle = element_text(size=8),
  plot.caption = element_text(size = 6),
  axis.text.x = element_text(size = 10, angle = 45, hjust = 1),
  axis.text.y = element_text(size = 10),
  axis.title.y = element_text(size = 10),
  # Set the entire chart region to blank
  panel.background=element_blank(),
  plot.background=element_blank(),
  #panel.border=element_rect(colour="#F0F0F0"),
  # Format the grid
  panel.grid.major=element_line(colour="#D0D0D0",size=.75),
  axis.ticks=element_blank())

mapTheme <- theme(plot.title =element_text(size=12),
                  plot.subtitle = element_text(size=8),
                  plot.caption = element_text(size = 6),
                  axis.line=element_blank(),
                  axis.text.x=element_blank(),
                  axis.text.y=element_blank(),
                  axis.ticks=element_blank(),
                  axis.title.x=element_blank(),
                  axis.title.y=element_blank(),
                  panel.background=element_blank(),
                  panel.border=element_blank(),
                  panel.grid.major=element_line(colour = 'transparent'),
                  panel.grid.minor=element_blank(),
                  legend.direction = "vertical", 
                  legend.position = "right",
                  plot.margin = margin(1, 1, 1, 1, 'cm'),
                  legend.key.height = unit(1, "cm"), legend.key.width = unit(0.2, "cm"))

palette2 <- c("#41b6c4","#253494")
palette4 <- c("#a1dab4","#41b6c4","#2c7fb8","#253494")
palette5 <- c("#ffffcc","#a1dab4","#41b6c4","#2c7fb8","#253494")
palette6 <- c("#253494","#2c7fb8","#41b6c4","#a1dab4","#ffffcc")
palette10 <- c("#f7fcf0","#e0f3db","#ccebc5","#a8ddb5","#7bccc4",
               "#4eb3d3","#2b8cbe","#0868ac","#084081","#f7fcf0")

We also include several helper functions. quintilesBreaks takes a dataframe and a column and outputs the quintiles breaks, helping shorten the below ggplot calls.

It takes longer to ggplot a polygon fishnet with geom_sf than it does to plot geom_point. To cut down on plotting time, the xyC (for ‘XY Coordinates’) takes a fishnet sf and converts it to a dataframe of grid cell centroid coordinates.

rast is a function allowing us to quickly plot raster values in ggplot.

#this function converts a column in to quintiles. It is used for mapping.
quintileBreaks <- function(df,variable) {
    as.character(quantile(df[[variable]],
                          c(.01,.2,.4,.6,.8),na.rm=T))
}

#This function can be used to convert a polygon sf to centroids xy coords.
xyC <- function(aPolygonSF) {
  as.data.frame(
    cbind(x=st_coordinates(st_centroid(aPolygonSF))[,1],
          y=st_coordinates(st_centroid(aPolygonSF))[,2]))
} 

#this function convert a raster to a data frame so it can be plotted in ggplot
rast <- function(inRaster) {
  data.frame(
    xyFromCell(inRaster, 1:ncell(inRaster)), 
    value = getValues(inRaster)) }

2. Data Wrangling & Feature Engineering

In this section a considerable amount of vector and raster data is wrangled together into a regression-ready dataset. The following datasets are used:

2.2 - 2.3: Land cover data downloaded from the Multi-Resolution Land Characteristics Consortium’s National Land Cover Database (NLCD) includes annual land cover and land cover change raster data for the entire country. These data are sampled to a 4,000 by 4,000 ft^2 fishnet, which will be used for this analysis. 2.4: Population data is downloaded from the U.S. Census and joined to the fishnet by distributing Census Tract population totals proportionally to each grid cell. 2.5: Highway vectors are downloaded from the U.S. Census TIGER Line shapefiles for 2019 and used to wrangle highway proximity features. Regional rail vectors are downloaded from the City of Chicago. It includes all Metra commuter rail lines in the Chicagoland region. For the Chicago downtown area, we created a polygon for the Chicago Loop and imported the shapefile. 2.6: The land cover change data is used to engineer spatial lag features. 2.7: County polygons are downloaded using the tigris package. 2.8: Each feature is wrangled into a final dataset.

It is important to remember that land cover is not ‘land use’. Typically, the former refers to phenomena on the Earth’s surface including both the built environment and natural resources, while the latter typically refers only to variation in the built environment.

Other raster features are created such as distance to highways, for instance. These rasters are then integrated with a vector fishnet. Additional feature engineering is performed on the vector-side providing a simple, but comprehensive snapshot of the development process in and around Chicago and Cook, DuPage, and Will Counties between 2011 and 2021.

2.2. Land Cover Change Data

The dependent variable we wish to forecast is land cover change between 2011 and 2021. In this section, the land cover raster data is loaded, reclassified and integrated with a vector fishnet. As before, the fishnet will allow us to parametrize spatial relationships in a regression context.

The table below shows descriptions of each categorical land cover type in the land cover data. Below, we will reclassify these data into more useful categories.

JE Note: Land cover categories can be found here - will replace with markdown table at a later point.

Several raster layers have been provided for this analysis:

  • We read in ThreeCountyArea - this is the extent of the study area

  • ThreeCountyLC_2011 is a raster of land cover in 2011 for the three counties: Cook, DuPage, and Will Counties. ThreeCountyLC_2021 is a raster of land cover in 2021 for the same geography. We have to calculate land cover change - where there were conversions between one land cover and another on the time frame 2011-2021. We plot the raster using ggplot and the rast function specified above.

Note that these rasters are projected as NAD 1983 NAD83 Illinois East, crs = 3435. The original land cover raster is at a 30 meter by 30 meter resolution. The rasters provided are ultimately resampled up to 4000 feet by 4000 feet. The Cook, DuPage, and Will County areas are 1,635 sq. miles; 336 sq. miles; and 849 sq. miles, respectively. Together, the total area for all three counties is about 2820 sq. miles, so this metro area is quite large. A vector fishnet dataset of this size would be too computationally intensive. By resampling we gain some computational efficiency but lose some accuracy. Nevertheless, this approach works well for educational purposes.

# Load required libraries
library(raster)

# Read the shapefile
IL_counties <- st_read("C:/Users/3lpaw/Desktop/ArcGIS Pro 3.2/EnvModeling/04_24_24_UrbanGrowthModeling/Downloaded_Data/IL_Counties/IL_BNDY_County_Py.shp")
# Change the coordinate system to NAD83 Illinois East
IL_counties <- st_transform(IL_counties, crs = 3435)
# Subset the data - Select only the 3 counties that we need: Cook, Du Page, and Will
IL_counties_subset <- IL_counties[IL_counties$COUNTY_NAM %in% c("WILL", "COOK", "DUPAGE"), c("COUNTY_NAM", "CO_FIPS", "geometry")]

# Write the subsetted data back to a shapefile
# st_write(IL_counties_subset, "C:/Users/3lpaw/Desktop/ArcGIS Pro 3.2/EnvModeling/04_24_24_UrbanGrowthModeling/Downloaded_Data/IL_Counties/ThreeCountyArea.shp")
ThreeCountyArea <- st_read("C:/Users/3lpaw/Desktop/ArcGIS Pro 3.2/EnvModeling/04_24_24_UrbanGrowthModeling/Downloaded_Data/IL_Counties/ThreeCountyArea.shp")

# Specify the coordinate system for the 3 IL counties to NAD83 Illinois East
ThreeCountyArea <- st_transform(ThreeCountyArea, crs = 3435)

# Set the desired raster resolution
#resolution <- 100  # Choose a suitable resolution (in meters or degrees, depending on your data)
# Define the extent of the raster
#extent <- extent(ThreeCountyArea)
# Create an empty raster layer with the specified resolution and extent
#empty_raster1 <- raster(extent, res = resolution)
# Rasterize the shapefile onto the empty raster layer
#rasterized <- rasterize(ThreeCountyArea, empty_raster1, field="CO_FIPS")
# Save the raster to a file
# writeRaster(rasterized, "C:/Users/3lpaw/Desktop/ArcGIS Pro 3.2/EnvModeling/04_24_24_UrbanGrowthModeling/Downloaded_Data/IL_Counties/ThreeCountyArea.tif", format = "GTiff")

ThreeCountyArea_Boundary <- raster("C:/Users/3lpaw/Desktop/ArcGIS Pro 3.2/EnvModeling/04_24_24_UrbanGrowthModeling/Downloaded_Data/IL_Counties/ThreeCountyArea.tif")

# Read the two land cover rasters, one for 2011 and one for 2021
ThreeCountyLC_2011 <- raster("C:/Users/3lpaw/Desktop/ArcGIS Pro 3.2/EnvModeling/04_24_24_UrbanGrowthModeling/Downloaded_Data/LC_2011_2021/3CountyLC_2011.tif")

ThreeCountyLC_2021 <- raster("C:/Users/3lpaw/Desktop/ArcGIS Pro 3.2/EnvModeling/04_24_24_UrbanGrowthModeling/Downloaded_Data/LC_2011_2021/3CountyLC_2021.tif")

# Reproject the rasters to the same coordinate system as the other data layers: NAD83 Illinois East, crs = 3435
ThreeCountyLC_2011 <- projectRaster(ThreeCountyLC_2011, crs = 3435)
ThreeCountyLC_2021 <- projectRaster(ThreeCountyLC_2021, crs = 3435)

We now calculate land cover change from 2011 to 2021. Reclassify 2011 and 2021 land cover databases to consist of 1 and 0 observations (e.g. 1 is the developed classes 13-24, 0 is everything else).

reclassMatrix <- 
  matrix(c(
    -5,12,0,
    12,24,1,
    24,Inf,0),
  ncol=3, byrow=T)
Developed_2011 <- 
  reclassify(ThreeCountyLC_2011,reclassMatrix)
# you can see the frequency of the values
freq(Developed_2011)
##      value   count
## [1,]     0 4184672
## [2,]     1 3576202
## [3,]    NA  156298
Developed_2021 <- 
  reclassify(ThreeCountyLC_2021,reclassMatrix)
freq(Developed_2021)
##      value   count
## [1,]     0 4152752
## [2,]     1 3608122
## [3,]    NA  156298

Then do some map algebra to find the places where land cover changed (we calculate Development_change). Let’s see a quick histogram of the values - these should range from 0 (undeveloped in 2011, undeveloped in 2021), 1 (undeveloped in 2011, developed in 2021 (presuming nothing went from developed to undeveloped)), and 2 (developed in both periods). The 1’s represent the change. You can see the number of the values (0’s, 1’s, and 2’s) in the frequency table.

Development_change <- Developed_2011+Developed_2021
freq(Development_change)
##      value   count
## [1,]     0 4141691
## [2,]     1   54042
## [3,]     2 3565141
## [4,]    NA  156298
hist(Development_change, xlim = c(0,2))

We now plot the 3 County Area and the land cover change from 2011-2021. Again, the 1s represent where land cover changed from not developed to developed between 2011 and 2021. The 2s represent land cover that was developed both in 2011 and 2021, so there was no change. There is not a lot of development change because it is only one decade and Chicago is already pretty developed, but there are some red areas mostly in Will County to the southwest.

ggplot() +
  geom_raster(data=rast(Development_change) %>% na.omit %>% filter(value > 0), 
              aes(x,y,fill=as.factor(value))) +
 # scale_fill_viridis(direction = -1, discrete=TRUE, option = "A", name ="Land Cover\nChange") +
  scale_fill_manual(values = c("2" = "gray", "1" = "red"), name = "Land Cover\nChange") +
  labs(title = "Land Cover Change for the Three County Area, 2011-2021") +
 mapTheme +
  theme(legend.direction="horizontal") +
geom_sf(data=ThreeCountyArea, fill = "transparent", colour = "black")

Next, we reclassify the raster such that all the developed grid cell values receive a value of 1 and all other values receive a value of 0. This is done using a reclassify matrix. The matrix reads row by row. Row 1 says any grid cell ranging from 0 to 12 takes a value of 0; 13 or greater through 24, a value of 1; and all other values take 0.

reclassMatrix2 <- 
  matrix(c(
    0,0,
    1,1,
    2,0),
  ncol=2, byrow=T)

reclassMatrix2
##      [,1] [,2]
## [1,]    0    0
## [2,]    1    1
## [3,]    2    0

Now reclassify and convert all 0’s to NA. We apply a name to the raster with names. This is done to make it faster to join raster to the fishnet below. You can see the frequency table of values with freq(Development_change2). There are 52042 areas that changed from underdeveloped to developed in the data.

Development_change2 <- 
  reclassify(Development_change,reclassMatrix2)

Development_change2[Development_change2 < 1] <- NA

names(Development_change2) <- "Dev_change"
freq(Development_change2)
##      value   count
## [1,]     1   54042
## [2,]    NA 7863130
ggplot() +
  geom_sf(data=ThreeCountyArea, fill = "transparent") +
  geom_raster(data=rast(Development_change2) %>% na.omit, 
              aes(x,y,fill=as.factor(value))) +
  scale_fill_viridis(discrete=TRUE, name ="Land Cover\nChange") + 
  labs(title="Development Land Use Change, 2011-2021") +
  mapTheme

Next, the fishnet is created at 4000 by 4000 foot resolution and subset it to the Three County Area with st_intersection.

ThreeCounty_fishnet <- 
  st_make_grid(ThreeCountyArea, 4000) %>%
  st_sf()

ThreeCounty_fishnet <-
  ThreeCounty_fishnet[ThreeCountyArea,]

The vector fishnet is then plotted. Note that this plot takes a bit of time to render because there are thousands of polygons.

ggplot() +
  geom_sf(data=ThreeCounty_fishnet) +
  labs(title="Fishnet, 4000 Foot Resolution") +
  mapTheme

Then the raster is converted to points, which makes its joining to the vector fishnet a bit faster. Now to extract the raster values into the fishnet. There is a function in the raster package called RasterToPolygon but it is quite slow.

Below, a slightly faster approach is develop that converts the raster to an sf point layer and then joins the points to the fishnet with aggregate. This works well because the raster and the fishnet are of the same spatial resolution. Finally, the fishnet variable Dev_change is created that is 1 where new development has occurred and 0 where it has not. This is our dependent variable and encoded as a factor.

To speed up the mapping process, fishnet polygons are converted to centroid points using the xyC function.

changePoints <-
  rasterToPoints(Development_change2) %>%
  as.data.frame() %>%
  st_as_sf(coords = c("x", "y"), crs = st_crs(ThreeCounty_fishnet))

fishnet <- 
  aggregate(changePoints, ThreeCounty_fishnet, sum) %>%
  mutate(Dev_change = ifelse(is.na(Dev_change),0,1),
         Dev_change = as.factor(Dev_change))

ggplot() +
  geom_sf(data=ThreeCounty_fishnet) +
  geom_point(data=fishnet, 
             aes(x=xyC(fishnet)$x, y=xyC(fishnet)$y, colour=Dev_change)) +
  scale_colour_manual(values = palette2,
                      labels=c("No Change","New Development"),
                      name = "") +
  labs(title = "Land Cover Development Change", subtitle = "As fishnet centroids") +
  mapTheme

2.3. Land Cover in 2011

It is reasonable to hypothesize that the propensity of new development is a function of existing land cover categories. In this section we identify these other land cover categories from 2011 and integrate each with the fishnet.

# Our shapefile with the boundary for the three counties is ThreeCountyArea
# Our land cover raster for 2011 for the three counties is ThreeCountyLC_2011

# ggplot() +
#   geom_sf(data=ThreeCountyArea) +
#   geom_raster(data=rast(ThreeCountyLC_2011) %>% na.omit %>% filter(value > 0), 
#               aes(x,y,fill=as.factor(value))) +
#   scale_fill_viridis(discrete=TRUE, name ="") +
#   labs(title = "Land Cover, 2011") +
#   mapTheme +
#   theme(legend.direction="horizontal")

The table below shows the approach taken to recoded existing land cover codes into the categories used in our analysis. In the code block below new rasters are generated and names are applied. Naming ensures that when the raster is integrated with the fishnet, the column reflects the appropriate raster.

Old_Classification New_Classification
Open Space as well as Low, Medium and High Intensity Development Developed
Deciduous, Evergreen, and Mixed Forest Forest
Pasture/Hay and Cultivated Crops Farm
Woody and Emergent Herbaceous Wetlands Woodlands
Barren Land, Dwarf Scrub, and Grassland/Herbaceous Other Undeveloped
Water Water
developed <- ThreeCountyLC_2011 == 21 | ThreeCountyLC_2011 == 22 | ThreeCountyLC_2011 == 23 | ThreeCountyLC_2011 == 24
forest <- ThreeCountyLC_2011 == 41 | ThreeCountyLC_2011 == 42 | ThreeCountyLC_2011 == 43 
farm <- ThreeCountyLC_2011 == 81 | ThreeCountyLC_2011 == 82 
wetlands <- ThreeCountyLC_2011 == 90 | ThreeCountyLC_2011 == 95 
otherUndeveloped <- ThreeCountyLC_2011 == 52 | ThreeCountyLC_2011 == 71 | ThreeCountyLC_2011 == 31 
water <- ThreeCountyLC_2011 == 11

names(developed) <- "developed"
names(forest) <- "forest"
names(farm) <- "farm"
names(wetlands) <- "wetlands"
names(otherUndeveloped) <- "otherUndeveloped"
names(water) <- "water"

Next, each raster is aggregated to the fishnet by way of a function called aggregateRaster. Here, the process used above to To do this, a function is created below that loops through a list of rasters, converts the ith raster to points, filters only points that have value of 1 (ie. is the ith land cover type), and then aggregates to the fishnet.

Here is the function.

aggregateRaster <- function(inputRasterList, theFishnet) {
  #create an empty fishnet with the same dimensions as the input fishnet
  theseFishnets <- theFishnet %>% dplyr::select()
  #for each raster in the raster list
  for (i in inputRasterList) {
  #create a variable name corresponding to the ith raster
  varName <- names(i)
  #convert raster to points as an sf
    thesePoints <-
      rasterToPoints(i) %>%
      as.data.frame() %>%
      st_as_sf(coords = c("x", "y"), crs = st_crs(theFishnet)) %>%
      filter(.[[1]] == 1)
  #aggregate to the fishnet
    thisFishnet <-
      aggregate(thesePoints, theFishnet, length) %>%
      mutate(!!varName := ifelse(is.na(.[[1]]),0,1))
  #add to the larger fishnet
    theseFishnets <- cbind(theseFishnets,thisFishnet)
  }
  #output all aggregates as one large fishnet
   return(theseFishnets)
  }

The theRasterList of land cover types in 2011 is created and then fed into aggregateRaster. The result is converted to long form grid cell centroids and plot as small multiple maps.

Note the inclusion of st_cast here which convert all geometries to POLYGON. If you create a frequency table of geometry types in aggregatedRasters, you will notice some and handful of MULTIPOLYGONS. Try table(st_geometry_type(aggregatedRasters)). These rogue multipolygons break the xyC function which is designed to find grid cell centroids. After all, there is no one centroid of several combined polygons. Thus st_cast ensures all geometries are just POLYGON. Look out for this function throughout the remainder of this chapter.

theRasterList <- c(developed,forest,farm,wetlands,otherUndeveloped,water)

aggregatedRasters <-
  aggregateRaster(theRasterList, ThreeCounty_fishnet) %>%
  dplyr::select(developed,forest,farm,wetlands,otherUndeveloped,water) %>%
  mutate_if(is.numeric,as.factor)

aggregatedRasters %>%
  gather(var,value,developed:water) %>%
  st_cast("POLYGON") %>%    #just to make sure no weird geometries slipped in
  mutate(X = xyC(.)$x,
         Y = xyC(.)$y) %>%
  ggplot() +
    geom_sf(data=ThreeCountyArea) +
    geom_point(aes(X,Y, colour=as.factor(value))) +
    facet_wrap(~var) +
    scale_colour_manual(values = palette2,
                        labels=c("Other","Land Cover"),
                        name = "") +
    labs(title = "Land Cover Types, 2011",
         subtitle = "As fishnet centroids") +
   mapTheme

2.4. Census Data

Population and population change is obviously an critical demand-side component of predicting Development_Demand. Census data for both 2011 and 2021 can be downloaded quickly using the tidycensus package. As illustrated below, these data are downloaded at a census tract geography and thus, an approach is needed to reconcile tracts and fishnet geometries. This is accomplished using a technique called areal weighted interpolation.

Recall, you will need a census API key to download the census data which must be input with census_api_key.

census_api_key("e5e96d76285beca3c6e1a9110d762a430ced4811", overwrite = TRUE)

First data is pulled for 2011 and reprojected.

# Pull population data for 2011
library(tidycensus)
library(tigris)
library(sf)

# Define the counties you want to retrieve data for
counties <- c("Cook", "DuPage", "Will County")

# Retrieve population data for 2011 and tract geometry
Pop_2011 <- get_acs(geography = "tract",
                            variables = "B01003_001",
                            year = 2011,
                            state = "IL",
                            county = counties,
                            geometry = TRUE) %>%
  rename(Pop_2011 = estimate) %>%
  st_transform(st_crs(ThreeCounty_fishnet))

# Write the spatial dataframe to a shapefile
st_write(Pop_2011,"C:/Users/3lpaw/Desktop/ArcGIS Pro 3.2/EnvModeling/04_24_24_UrbanGrowthModeling/Downloaded_Data/Population/Pop_2011.shp", append = TRUE)

# Clip Pop_2011 to the extent of ThreeCounty_fishnet. Otherwise, it includes an area of Lake Michigan
Pop_2011 <- st_intersection(Pop_2011, ThreeCounty_fishnet)

Now data for 2021 is downloaded. In this instance, st_buffer is used to buffer the the tracts by -1ft. This is done because tidycensus appears to return geometries that are problematic when subjected to the area weighted interpolation function below. As done in previous chapters, a very small buffer is used to correct the geometries.

counties <- c("Cook", "DuPage", "Will County")

# Pull population data for 2021
Pop_2021 <- get_acs(geography = "tract",
                            variables = "B01003_001",
                            year = 2021,
                            state = "IL",
                            county = counties,
                            geometry = TRUE) %>%
  rename(Pop_2021 = estimate) %>%
  st_transform(st_crs(ThreeCounty_fishnet)) %>%
  st_buffer(-1)
st_write(Pop_2021,"C:/Users/3lpaw/Desktop/ArcGIS Pro 3.2/EnvModeling/04_24_24_UrbanGrowthModeling/Downloaded_Data/Population/Pop_2021.shp", append = TRUE)

Both years of census data are then plotted.

class(Pop_2011)
## [1] "sf"         "data.frame"
str(Pop_2011)
## Classes 'sf' and 'data.frame':   10316 obs. of  6 variables:
##  $ GEOID   : chr  "17197884004" "17197884004" "17197884004" "17197884004" ...
##  $ NAME    : chr  "Census Tract 8840.04, Will County, Illinois" "Census Tract 8840.04, Will County, Illinois" "Census Tract 8840.04, Will County, Illinois" "Census Tract 8840.04, Will County, Illinois" ...
##  $ variable: chr  "B01003_001" "B01003_001" "B01003_001" "B01003_001" ...
##  $ Pop_2011: num  2552 2552 2552 2552 1555 ...
##  $ moe     : num  221 221 221 221 184 221 184 184 184 184 ...
##  $ geometry:sfc_GEOMETRY of length 10316; first list element: List of 1
##   ..$ : num [1:14, 1:2] 1010924 1009934 1008790 1008790 1008779 ...
##   ..- attr(*, "class")= chr [1:3] "XY" "POLYGON" "sfg"
##  - attr(*, "sf_column")= chr "geometry"
##  - attr(*, "agr")= Factor w/ 3 levels "constant","aggregate",..: NA NA NA NA NA
##   ..- attr(*, "names")= chr [1:5] "GEOID" "NAME" "variable" "Pop_2011" ...
str(Pop_2021)
## Classes 'sf' and 'data.frame':   1723 obs. of  6 variables:
##  $ GEOID   : chr  "17031230200" "17197880113" "17031320400" "17031828701" ...
##  $ NAME    : chr  "Census Tract 2302, Cook County, Illinois" "Census Tract 8801.13, Will County, Illinois" "Census Tract 3204, Cook County, Illinois" "Census Tract 8287.01, Cook County, Illinois" ...
##  $ variable: chr  "B01003_001" "B01003_001" "B01003_001" "B01003_001" ...
##  $ Pop_2021: num  1979 4004 3058 4571 1201 ...
##  $ moe     : num  359 573 366 623 163 830 592 441 507 424 ...
##  $ geometry:sfc_POLYGON of length 1723; first list element: List of 1
##   ..$ : num [1:10, 1:2] 1152980 1153860 1153860 1154304 1154763 ...
##   ..- attr(*, "class")= chr [1:3] "XY" "POLYGON" "sfg"
##  - attr(*, "sf_column")= chr "geometry"
##  - attr(*, "agr")= Factor w/ 3 levels "constant","aggregate",..: NA NA NA NA NA
##   ..- attr(*, "names")= chr [1:5] "GEOID" "NAME" "variable" "Pop_2021" ...
# If the geometry column is missing or has a different name, rename it
Pop_2011 <- st_set_geometry(Pop_2011, "geometry")
Pop_2021 <- st_set_geometry(Pop_2021, "geometry")

library(gridExtra)

grid.arrange(
  ggplot() +
    geom_sf(data = Pop_2011, aes(fill = factor(ntile(Pop_2011, 5))), colour = NA) +
    scale_fill_manual(values = palette5,
                      labels = quintileBreaks(Pop_2011, "Pop_2011"),
                      name = "Quintile\nBreaks") +
    labs(title = "Population per Census Tract in Cook, Du Page, and Will Counties: 2011") +
    mapTheme,

  ggplot() +
    geom_sf(data = Pop_2021, aes(fill = factor(ntile(Pop_2021, 5))), colour = NA) +
    scale_fill_manual(values = palette5,
                      labels = quintileBreaks(Pop_2021, "Pop_2021"),
                      name = "Quintile\nBreaks") +
    labs(title = "Population per Census Tract in Cook, Du Page, and Will Counties: 2021") +
    mapTheme, 
  ncol = 2
)

Now to reconcile tract boundaries and fishnet grid cells. I’d like you to pay particular attention to this process.

A spatial join would be inappropriate as it would assign the same population value from one tract to the many intersecting grid cells. Instead, the area weighted interpolation function, st_interpolate_aw, assigns a proportion of a tract’s population to a grid cell weighted by the proportion of the tract that intersects the grid cell. This works best of course, when we assume that the tract population is uniformly distributed across the tract. This is typically not a great assumption. However, it is a reasonable here particularly given population is a feature in a regression and not an outcome that needs to be measured with significant precision.

The Census data Pop_2011, has a different spatial extent than ThreeCounty_fishnet. Most notably, there are no vectors where water is present. To maintain the needed 3851 grid cell units (nrow(ThreeCounty_fishnet)), a a unique id is created, fishnetID. Then the area weighted interpolation is performed on population for the 2011 and 2021 layers. Finally, the results are joined back (left_join) to ThreeCounty_fishnet. This approach maintains a consistent spatial extent.

It would be helpful for you to spend some time running through each code block line by line. Areal weighted interpolation is a really strong spatial analysis skill to have. You can do this in ArcGIS but there is not automated approach.

ThreeCounty_fishnet <-
  ThreeCounty_fishnet %>%
  rownames_to_column("fishnetID") %>% 
  mutate(fishnetID = as.numeric(fishnetID)) %>%
  dplyr::select(fishnetID)

fishnetPopulation11 <-
  st_interpolate_aw(Pop_2011["Pop_2011"], ThreeCounty_fishnet, extensive=TRUE) %>%
  as.data.frame(.) %>%
  rownames_to_column(var = "fishnetID") %>%
  left_join(ThreeCounty_fishnet %>%
              mutate(fishnetID = as.character(fishnetID)),
            ., by=c("fishnetID"='fishnetID')) %>% 
  mutate(Pop_2011 = replace_na(Pop_2011,0)) %>%
  dplyr::select(Pop_2011)

fishnetPopulation21 <-
  st_interpolate_aw(Pop_2021["Pop_2021"],ThreeCounty_fishnet, extensive=TRUE) %>%
  as.data.frame(.) %>%
  rownames_to_column(var = "fishnetID") %>%
  left_join(ThreeCounty_fishnet %>%
              mutate(fishnetID = as.character(fishnetID)),
            ., by=c("fishnetID"='fishnetID')) %>% 
  mutate(Pop_2021 = replace_na(Pop_2021,0)) %>%
  dplyr::select(Pop_2021)

fishnetPopulation <- 
  cbind(fishnetPopulation11,fishnetPopulation21) %>%
  dplyr::select(Pop_2011,Pop_2021) %>%
  mutate(pop_Change = Pop_2021 - Pop_2011)

For comparison purposes, both the 2021 census tract geometries and the population weighted grid cells are plotted.

grid.arrange(
ggplot() +
  geom_sf(data=Pop_2021, aes(fill=factor(ntile(Pop_2021,5))),colour=NA) +
  scale_fill_manual(values = palette5,
                    labels=substr(quintileBreaks(Pop_2021,"Pop_2021"),1,4),
                   name="Quintile\nBreaks") +
  labs(title="Population for Cook, Du Page, and Will Counties: 2021",
       subtitle="Represented as tracts; Boundaries omitted") +
  mapTheme,

ggplot() +
  geom_sf(data=fishnetPopulation, aes(fill=factor(ntile(Pop_2021,5))),colour=NA) +
  scale_fill_manual(values = palette5,
                   labels=substr(quintileBreaks(fishnetPopulation,"Pop_2021"),1,4),
                   name="Quintile\nBreaks") +
  labs(title="Population for Cook, Du Page, and Will Counties: 2021",
       subtitle="Represented as fishnet gridcells; Boundaries omitted") +
  mapTheme, ncol=2)

2.5. Highway Distance

Accessibility is a key determinant of development potential particularly in a sprawling city like Chicago. Accessibility features are engineered by measuring distance from each grid cell to its nearest highway.

First highway vectors are downloaded from the U.S. Census TIGER Line 2019 datasets in a shapefile format; projected and subset to the subset using st_intersection. Below, new development is mapped with the highway overlay.

ThreeCounties_Highways <-
  st_read("C:/Users/3lpaw/Desktop/ArcGIS Pro 3.2/EnvModeling/04_24_24_UrbanGrowthModeling/Downloaded_Data/IL_highways/IL_highways/tl_2019_17_prisecroads.shp") %>%
  st_transform(st_crs(ThreeCountyArea)) %>%
  st_intersection(ThreeCountyArea)
ggplot() +
  geom_point(data=fishnet, 
             aes(x=xyC(fishnet)[,1], y=xyC(fishnet)[,2],colour=Dev_change),size=1.5) +
  geom_sf(data=ThreeCounties_Highways, colour = "red") +
  scale_colour_manual(values = palette2,
                      labels=c("No Change","New Development")) +
  labs(title = "New Development and Highways",
       subtitle = "As fishnet centroids") +
  mapTheme

Below are some great r-based raster skills. The distance from each grid cell to its nearest highway segment is measured.

First, you can convert a the highway layer to raster. This can be done by creating an emptyRaster of NA grid cells at the same spatial extent as Development_change. Then, highway_raster is created by converting ThreeCounties_Highways to sp form and then to applying rasterize. The raster is then converted to points with rasterToPoints and st_as_sf, then aggregate is used to calculate mean distance by grid cell.

You may (but likely not) be interested in learning that sp is the older spatial data convention in R. Although sf is the new convention, raster/vector interactions still require sp. The as function converts.

Instead, we used a slightly different method to calculate distance to highways. Here is the code we didn’t use:

# Original code (not using):
# emptyRaster <- Development_change
# emptyRaster[] <- NA
# 
# highway_raster <- 
#   as(ThreeCounties_Highways,'Spatial') %>%
#   rasterize(.,emptyRaster)
# 
# highway_raster_distance <- distance(highway_raster)
# names(highway_raster_distance) <- "distance_highways"
# 
# highwayPoints <-
#   rasterToPoints(highway_raster_distance) %>%
#   as.data.frame() %>%
#   st_as_sf(coords = c("x", "y"), crs = st_crs(ThreeCounty_fishnet))
# 
# highwayPoints_fishnet <- 
#   aggregate(highwayPoints, ThreeCounty_fishnet, mean) %>%
#   mutate(distance_highways = ifelse(is.na(distance_highways),0,distance_highways))
# 
# ggplot() +
#   geom_sf(data=ThreeCountyArea) +
#   geom_point(data=highwayPoints_fishnet, aes(x=xyC(highwayPoints_fishnet)[,1], 
#                                              y=xyC(highwayPoints_fishnet)[,2], 
#                  colour=factor(ntile(distance_highways,5))),size=1.5) +
#   scale_colour_manual(values = palette5,
#                       labels=substr(quintileBreaks(highwayPoints_fishnet,"distance_highways"),1,8),
#                       name="Quintile\nBreaks") +
#   geom_sf(data=houstonHighways, colour = "red") +
#   labs(title = "Distance to Highways",
#        subtitle = "As fishnet centroids; Highways visualized in red") +
#   mapTheme

Instead, we read in the highways raster and the ThreeCounty_fishnet and used st_nearest_feature to calculate the distance from each fishnet cell centroid to the nearest highway. The variable highway_dist includes these calculated distances to the nearest highway.

#Read in highways and transform to appropriate projection
# It's called "ThreeCounties_Highways" - we read this in earlier in the code

#Convert fishnet to centroids
centroid <- ThreeCounty_fishnet %>%
  st_centroid()

#Determine nereast highway to each centroid
nearest_feat <- st_nearest_feature(centroid,ThreeCounties_Highways)

#Calcuate distance from each grid square centroid to nearest highway
ThreeCounty_fishnet$highway_dist <- as.double(st_distance(centroid, ThreeCounties_Highways[nearest_feat,], by_element=TRUE))

Highway_fishnet <- ThreeCounty_fishnet

#Make a quick sample map of the results
ggplot()+
  geom_sf(data=Highway_fishnet,aes(fill=highway_dist),color='transparent')+
  scale_fill_viridis_c(name='Distance to Highway (feet)')+
  geom_sf(data=ThreeCounties_Highways,color='red')+
  labs(title = "Distance to Highways",
       subtitle = "Using fishnet centroids") +
  theme_void()

There are a lot of federal and state highways in the three county area, particularly closer to Chicago. There are some more rural and suburban areas in the southern portion of the study area that are further from highways.

2.5.b Distance to Regional Rail Lines (Metra)

Transit accessibility is a key determinant of development potential particularly in a city like Chicago. Transit-oriented development is commonly prioritized to create compact development. Accessibility features are engineered by measuring distance from each grid cell to its nearest regional rail line.

First regional rail line vectors are downloaded from the City of Chicago open data website in shp format; projected and subset to the subset using st_intersection. Below, new development is mapped with the regional rail line overlay. The regional rail lines are a part of the Metra system for the Chicagoland region.

ThreeCounties_RegRail <-
  st_read("C:/Users/3lpaw/Desktop/ArcGIS Pro 3.2/EnvModeling/04_24_24_UrbanGrowthModeling/Downloaded_Data/Metra_Lines/MetraLinesshp.shp") %>%
  st_transform(st_crs(ThreeCountyArea)) %>%
  st_intersection(ThreeCountyArea)

Next, we map the existing regional rail lines with our analysis of development change from 2011-2021.

ggplot() +
  geom_point(data=fishnet, 
             aes(x=xyC(fishnet)[,1], y=xyC(fishnet)[,2],colour=Dev_change),size=1.5) +
  geom_sf(data=ThreeCounties_RegRail, colour = "red") +
  scale_colour_manual(values = palette2,
                      labels=c("No Change","New Development")) +
  labs(title = "New Development and Regional Rail Lines",
       subtitle = "As fishnet centroids") +
  mapTheme

Next, we calculate the distance to the nearest regional rail line for each fishnet cell centroid and map it. This variable is called RegRail_dist.

#Determine nearest rail line to each centroid
nearest_rail <- st_nearest_feature(centroid,ThreeCounties_RegRail)

#Calcuate distance from each grid square centroid to nearest rail line
ThreeCounty_fishnet$regrail_dist <- as.double(st_distance(centroid, ThreeCounties_RegRail[nearest_rail,], by_element=TRUE))

RegRail_fishnet <- ThreeCounty_fishnet #%>%
#  select(fishnetID, geometry, regrail_dist)

#Make a quick sample map of the results
ggplot()+
  geom_sf(data=RegRail_fishnet,aes(fill=regrail_dist),color='transparent')+
  scale_fill_viridis_c(name='Distance to Regional Rail Lines (feet)')+
  geom_sf(data=ThreeCounties_RegRail,color='red')+
  labs(title = "Distance to Regional Rail Lines",
       subtitle = "Using fishnet centroids") +
  theme_void()

As you can see, most of the three county area is within 25,000 feet (or a little over 4.5 miles) of a regional rail line. This is too large of a distance for walking commuters, but it is accessible for people who bike or drive or take other modes to the regional rail stations to access the Chicago city region. To improve housing density, it is ideal to create more developments in close proximity to regional rail stations, ideally within a half mile for walkers or a mile or more for bike/bus/car/other modal commutes.

2.5.c Distance to Downtown Chicago (The Loop)

Proximity to employment centers is a key determinant of development particularly in a major city like Chicago and its surrounding commuter suburbs. Features are engineered by measuring distance from each grid cell to downtown Chicago, also known as the Loop.

First a polygon feature is traced around Downtown Chicago using ArcGIS Pro. Below, new development is mapped with the Downtown_Chicago feature overlay.

ThreeCounties_Downtown <-
 st_read("C:/Users/3lpaw/Desktop/ArcGIS Pro 3.2/EnvModeling/04_24_24_UrbanGrowthModeling/Downloaded_Data/Downtown_Chicago/Downtown_Chicago.shp") %>%
  st_transform(st_crs(ThreeCountyArea)) %>%
  st_intersection(ThreeCountyArea)
## Reading layer `Downtown_Chicago' from data source 
##   `C:\Users\3lpaw\Desktop\ArcGIS Pro 3.2\EnvModeling\04_24_24_UrbanGrowthModeling\Downloaded_Data\Downtown_Chicago\Downtown_Chicago.shp' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 1 feature and 1 field
## Geometry type: POLYGON
## Dimension:     XY
## Bounding box:  xmin: -87.63707 ymin: 41.85304 xmax: -87.61388 ymax: 41.88842
## Geodetic CRS:  NAD83

Next, we plot the Chicago Loop area seen in red and show the development change from 2011-2021 in the three county area.

ggplot() +
  geom_point(data=fishnet, 
             aes(x=xyC(fishnet)[,1], y=xyC(fishnet)[,2],colour=Dev_change),size=1.5) +
  geom_sf(data=ThreeCounties_Downtown, colour = "black", fill = "red") +
  scale_colour_manual(values = palette2,
                      labels=c("No Change","New Development")) +
  labs(title = "New Development and Proximity to Downtown",
       subtitle = "As fishnet centroids") +
  mapTheme

Next, we plot the proximity to downtown variable (downtown_dist) using fishnet centroids and map it.

library(scales)

#Determine distance to City Center centroid
near_downtown <- st_nearest_feature(centroid,ThreeCounties_Downtown)

#Calculate distance from each grid square centroid to the city center centroid
ThreeCounty_fishnet$downtown_dist <- as.double(st_distance(centroid, ThreeCounties_Downtown[near_downtown,], by_element=TRUE))

Downtown_fishnet <- ThreeCounty_fishnet 
#%>%   select(fishnetID, geometry, downtown_dist)

ggplot()+
geom_sf(data=Downtown_fishnet,aes(fill=downtown_dist),color='transparent')+
  scale_fill_viridis_c(name='Proximity to Downtown (feet)', labels = comma)+
  geom_sf(data=ThreeCounties_Downtown,color="black", fill = "red")+
  labs(title = "Distance to Downtown Chicago",
       subtitle = "Using fishnet centroids") +
  theme_void()

2.5.d Distance to Lake Michigan

Proximity to the waterfront is a key determinant of development particularly in a major city like Chicago and its surrounding commuter suburbs. Features are engineered by measuring distance from each grid cell to the Lake Michigan shoreline.

Below, new development is mapped with the Lake Michigan shoreline feature overlay, which is shown in red.

ThreeCounties_LakeMichigan <-
 st_read("C:/Users/3lpaw/Desktop/ArcGIS Pro 3.2/EnvModeling/04_24_24_UrbanGrowthModeling/Downloaded_Data/Lake_Michigan_Shoreline/lake_shore.shp") %>%
  st_transform(st_crs(ThreeCountyArea)) %>%
  st_intersection(ThreeCountyArea)
## Reading layer `lake_shore' from data source 
##   `C:\Users\3lpaw\Desktop\ArcGIS Pro 3.2\EnvModeling\04_24_24_UrbanGrowthModeling\Downloaded_Data\Lake_Michigan_Shoreline\lake_shore.shp' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 1 feature and 1 field
## Geometry type: LINESTRING
## Dimension:     XY
## Bounding box:  xmin: -87.77002 ymin: 41.68835 xmax: -87.52472 ymax: 42.15241
## Geodetic CRS:  NAD83
ggplot() +
  geom_point(data=fishnet, 
             aes(x=xyC(fishnet)[,1], y=xyC(fishnet)[,2],colour=Dev_change),size=1.5) +
  geom_sf(data=ThreeCounties_LakeMichigan, colour = "red") +
  scale_colour_manual(values = palette2,
                      labels=c("No Change","New Development")) +
  labs(title = "New Development and Proximity to Lake Michigan",
       subtitle = "As fishnet centroids") +
  mapTheme

Next, we calculate the distance to the lake (lake_dist) and map this variable.

library(scales)

#Determine distance to City Center centroid
near_lake <- st_nearest_feature(centroid,ThreeCounties_LakeMichigan)

#Calculate distance from each grid square centroid to the lake centroid
ThreeCounty_fishnet$lake_dist <- as.double(st_distance(centroid, ThreeCounties_LakeMichigan[near_lake,], by_element=TRUE))

LakeMichigan_fishnet <- ThreeCounty_fishnet 
#%>% select(fishnetID, geometry, lake_dist)

ggplot()+
geom_sf(data=LakeMichigan_fishnet,aes(fill=lake_dist),color='transparent')+
  scale_fill_viridis_c(name='Proximity to Lake Michigan (feet)', labels = comma)+
  geom_sf(data=ThreeCounties_LakeMichigan,color="black", fill = "red")+
  labs(title = "Distance to Lake Michigan",
       subtitle = "Using fishnet centroids") +
  theme_void()

2.6. The Spatial Lag of Development

At the center of our model is a hypothesis that development demand must in part, be a function of the pattern of existing development. Development occurs where the market believes a higher and better use may bring an investment return. In the case of sprawling region like Houston, assuming the requisite demand, there is a clear return on investment for converting farmland to suburban housing.

The traditional ‘bid-rent’ economic model of development posits that development demand is a function of accessibility. This model works well in cities where centralized locations offer the most accessibility. However, it assumes that all consumers of land share the same preferences for central city access. While urban land is valuable, contemporary urbanism in regions like Houston show us that suburban locations can be quite desirable as well.

Why is that? Hinterland locations do not offer direct access to jobs and cultural amenities. Instead, residents trade-off accessibility for larger lots and bigger homes; as well as a bundle of public services like school quality. Developers are attracted to suburban and exurban locations because of cheap land on ‘greenfield’ sites like farms and open space.

The demand for greenfield development can vary substantially depending on the existing spatial configuration of development. If accessibility to central locations was the only underlying consideration, developers would sprawl directly out to the periphery, much like the dynamic we modeled in the Urban Growth Boundary chapter (Chapter 2). As a space/time process, this would look much like spilled milk, emanating from a central point outward across the kitchen table.

Another option for developers is to move beyond the periphery onto greenfield sites that are cheaper because they are even less accessible. In this case, the space/time process looks small ‘patches’ of new development dotting the landscape and “leapfrogging” from one greenfield to the next. The economic incentive is to always develop beyond the periphery, where land is cheapest. There are some real costs to this model however. For one, when development is so diffuse, it is more burdensome to efficiently deploy infrastructure like roads, sewers and electicity. Second, leapfrog development fragmentats natural areas reducing biodiversity and stressing the natural habitat of species that need continuous open space to thrive.

In Chicago, as in many sprawling regions of the U.S. the economic incentives that underlie sprawl likely encourage both the accessibility and leapfrog models of development. For our purposes however, features must be created to associate these patterns with development. Without them, the model may lack the appropriate spatial experience on which to forecast growth.

To keep it simple, we develop features associated with accessibility-based patterns. In reality, the analyst should develop a series of applicable features and test which best associate with the outcome of interest. The problem becomes infinitely more difficult when one realizes that sprawl patterns may differ throughout the study area - if for instance, land use restrictions varied by county. Below we estimate models using logistic regression, but higher level machine learning algorithms, most notably, Random Forest, are more adept at dealing with non-linearities across space.

Accessibility is measured by way of a spatial lag hypothesizing that new development is a function of distance to existing development. The shorter the distance, the more accessible a grid cell is to existing development. This is measured by calculating the average distance from each grid cell to its 2 nearest developed neighboring grid cells in 2011 using the nn_function. The function below calculates average nearest neighbor distance between k point layers. The first parameter specifies coordinates that we want to measureFrom, in this case, fishnet centroids. The second, indicates the point layer we wish to measureTo, in this case, the fishnet centroids that were developed in 2011.

nn_function <- function(measureFrom,measureTo,k) {
  #convert the sf layers to matrices
  measureFrom_Matrix <-
    as.matrix(measureFrom)
  measureTo_Matrix <-
    as.matrix(measureTo)
  nn <-   
    get.knnx(measureTo, measureFrom, k)$nn.dist
    output <-
    as.data.frame(nn) %>%
    rownames_to_column(var = "thisPoint") %>%
    gather(points, point_distance, V1:ncol(.)) %>%
    arrange(as.numeric(thisPoint)) %>%
    group_by(thisPoint) %>%
    summarize(pointDistance = mean(point_distance)) %>%
    arrange(as.numeric(thisPoint)) %>% 
    dplyr::select(-thisPoint) %>%
    pull()
  
  return(output)  
}

Why k=2? As k fluctuates, so does the hypothesized scale of accessibility. One can test the effect of different k parameters on model goodness of fit, but as mentioned, a more sophisticated model would hypothesize that this scale can vary significantly from city to suburb to rural town.

Next, the function appending the lag distance to fishnet. There are 3 inputs. The fishnet which is converted to a coordinate data frame with the xyC function. 2011 developed areas are created using filter. The map below illustrates relative accessibility from every grid cell to nearby development.

fishnet$lagDevelopment <-
    nn_function(xyC(fishnet),
                xyC(filter(aggregatedRasters,developed==1)),
                2)

ggplot() +
  geom_sf(data = ThreeCountyArea, fill = "transparent") +
  geom_point(data = fishnet, 
             aes(x = xyC(fishnet)[, 1], y = xyC(fishnet)[, 2], color = lagDevelopment), 
             size = 1.5) +
  scale_color_gradientn(colors = palette5, 
                        limits = range(fishnet$lagDevelopment),
                        name = "Lag Development",
                        labels = scales::comma) +
  labs(title = "Spatial Lag to 2011 Development",
       subtitle = "As fishnet centroids") +
  mapTheme

# ggplot() +
#   geom_sf(data=ThreeCountyArea, fill = "transparent") +
#   geom_point(data=fishnet, 
#              aes(x=xyC(fishnet)[,1], y=xyC(fishnet)[,2], 
#                  colour=factor(ntile(lagDevelopment,5))), size=1.5) +
#   scale_colour_manual(values = palette5,
#                      labels=substr(quintileBreaks(fishnet,"lagDevelopment"),1,7),
#                      name="Quintile\nBreaks") +
#   labs(title = "Spatial Lag to 2011 Development",
#        subtitle = "As fishnet centroids") +
#   mapTheme

2.7. Study Area Counties

The tigris package allows Illinois county geometries to be downloaded. A spatial subset returns only the three counties we are looking at. Note that the subset includes a negative 1000ft st_buffer. This is done because the spatial extent of ThreeCountyArea intersects county boundaries that are actually outside of our study area. Buffering ThreeCountyArea slightly limits the intersection range to only those counties in the study area.

Once studyAreaCounties is created, it is st_joined with dat such that each grid cells knows which county it’s in.

We already have a shapefile and raster of the ThreeCountyArea, so we skipped this code.

# We already have a shapefile of the 3 county study area.
# options(tigris_class = "sf")
# 
# studyAreaCounties <- 
#   counties("Illinois") %>%
#   st_transform(st_crs(ThreeCountyArea)) %>%
#   dplyr::select(NAME) %>%
#   .[st_buffer(ThreeCountyArea,-1000), , op=st_intersects]

We mapped the three counties for reference.

library(ggplot2)

# Calculate centroids for the polygons
county_centroids <- st_centroid(IL_counties_subset)

# Extract centroid coordinates
county_centroid_coords <- st_coordinates(county_centroids)

# Create a data frame with centroid coordinates
county_centroid_df <- data.frame(X = county_centroid_coords[, "X"], Y = county_centroid_coords[, "Y"])

# Add COUNTY_NAM column to centroid_df
county_centroid_df$COUNTY_NAM <- county_centroids$COUNTY_NAM

# Plot with text labels
ggplot() +
  geom_sf(data = IL_counties_subset) +
  geom_text(data = county_centroid_df, aes(label = COUNTY_NAM, x = X, y = Y), size = 3, color = "black") +
  labs(title = "Study Area Counties") +
  mapTheme

2.8. Create the Final Dataset

The last step is to bring together all the disparate feature layers into a final dataset that can be used for analysis. The various fishnet layers are cbind together, needed features are extracted and the final fishnet, dat is then joined with ThreeCountyArea to assign each grid cell to a county. developed21 is created to designate those areas that have already been developed through 2021. Finally, any grid cell that has a water land cover designation is removed.

dat <- 
  cbind(
    fishnet, fishnetPopulation, Highway_fishnet, RegRail_fishnet, Downtown_fishnet, LakeMichigan_fishnet,  aggregatedRasters) %>%
  dplyr::select(Dev_change, developed, forest, farm, wetlands, otherUndeveloped, water,
                Pop_2011, Pop_2021, pop_Change, highway_dist, regrail_dist, downtown_dist, lake_dist, lagDevelopment) %>%
  st_join(ThreeCountyArea) %>%
  mutate(developed21 = ifelse(Dev_change == 1 & developed == 1, 0, developed)) %>%
  filter(water == 0) 

3. Exploratory Analysis

In this section we explore the extent to which each features is associated with development change. If the goal was to predict a continuous variable, scatterplots and correlation coefficients make this process straightforward and relatively easy to explain to a non-technical decison maker.

In this case however, the dependent variable is a binary outcome - either a grid cell was developed between 2011 and 2021 or it wasn’t. In this case, the relevant question is whether for a given feature, there is a statistically significant difference between areas that changed and areas that did not. These differences are explored in a set of plots below. For models with lots of features, these plots could be compliment by a series of difference in means statistical tests.

The below code block selects the distance variables and spatial lag features, converts each to long form and plots each as bar plots. Note that geom_bar calculates the mean.

dat %>%
  dplyr::select(pop_Change, highway_dist, regrail_dist, downtown_dist, lake_dist, lagDevelopment, Dev_change) %>%
  gather(Variable, Value, -Dev_change, -geometry) %>%
  ggplot(., aes(Dev_change, Value, fill=Dev_change)) + 
    geom_bar(position = "dodge", stat = "summary", fun.y = "mean") +
    facet_wrap(~Variable) +
    scale_fill_manual(values = palette2,
                      labels=c("No Change","New Development"),
                      name="") +
    labs(title="New Development as a Function of the Continuous Variables") +
    plotTheme

There are significant differences between no change and new development (from 2011-2021) for the lake distance and downtown Chicago distance variables. The bar plots for the other variables (Development lag, distance to highways, population change, and distance to regional rail) are above.

Next, the same visualization is created for the population related variables. These plots inform which features should be included in the model.

dat %>%
  dplyr::select(Pop_2011, Pop_2021, pop_Change, Dev_change) %>%
  gather(Variable, Value, -Dev_change, -geometry) %>%
  ggplot(., aes(Dev_change, Value, fill=Dev_change)) + 
    geom_bar(position = "dodge", stat = "summary", fun.y = "mean") +
    facet_wrap(~Variable) +
    scale_fill_manual(values = palette2,
                      labels=c("No Change","New Development"),
                      name="") +
    labs(title="New Development as a Function of Factor Variables") +
    plotTheme

The population change variable is a bit of a tough one, because the number, size, and shape of the Census tracts changed from the 2010 to 2020 decennial Census. If a Census tract had an increase in population, then it split into two or more Census tracts and was given a different tract ID number and boundary. This makes it a bit hard to compare geographies across the decade, but we used the spatial lag feature to address that issue.

Next, a table of land cover conversion between 2011 and 2021 is created. The table suggests for instance, that 9% of farmland regionally was converted to development between 2011 and 2021. This indicator should be interpreted in the context of the scale changes we imposed on the data by moving from a 30ft by 30ft raster to a 4000ft by 4000ft fishnet. This is the same reason why the table suggests developed area was then “developed”.

dat %>%
  dplyr::select(Dev_change:otherUndeveloped,developed) %>%
  gather(Land_Cover_Type, Value, -Dev_change, -geometry) %>%
   st_set_geometry(NULL) %>%
     group_by(Dev_change, Land_Cover_Type) %>%
     summarize(n = sum(as.numeric(Value))) %>%
     ungroup() %>%
    mutate(Conversion_Rate = paste0(round(100 * n/sum(n), 2), "%")) %>%
    filter(Dev_change == 1) %>%
  dplyr::select(Land_Cover_Type,Conversion_Rate) %>%
  kable() %>% kable_styling(full_width = F)
Land_Cover_Type Conversion_Rate
developed 13.35%
farm 9.1%
forest 7.1%
otherUndeveloped 6.23%
wetlands 5.09%

4. Predicting for 2031

In this section, six separate logistic regression models are estimated to predict development change between 2011 and 2021 - with each subsequent model more sophisticated then the last. To do so, the data is split into 50% training/test sets. Models are estimated on the training set.

Normally, as in previous chapters, a results table row would be generated for each model describing the accuracy and generalizability of predictions for each specification. For brevity, a less sophisticated approach is taken here, judging each by the McFadden or “Psuedo” R Squared statistic on the test set. The model with the greatest goodness of fit is then used for the purposes of prediction.

4.2. Modeling

First, dat is split into training and test sets. Note how imbalanced the panel is with table(datTrain$developed).

set.seed(3456)
trainIndex <- 
  createDataPartition(dat$developed, p = .50,
                                  list = FALSE,
                                  times = 1)
datTrain <- dat[ trainIndex,]
datTest  <- dat[-trainIndex,]

nrow(dat)
## [1] 2808
table(datTrain$developed)
## 
##    0    1 
##  151 1254

Next six separate glm models are estimated adding new variables for each. Figure 4.1 shows the Psuedo R-Squared associated with each model.

Model1 includes only the 2011 land cover types. Model2 adds the lagDevelopment. Models 3, 4 and 5 attempt three different approaches for modeling population change. Model3 uses population in 2011; Model4 uses 2011 and 2021 population; and Model5 uses population change. All are significant so which population feature should be chosen? The answer lies in how the model will be used to forecast. By modeling population change between 2011 and 2021, the model is well specified to forecast 2031 development by having pop_Change indicate change between 2021 and 2031. Model6 includes distance to the highways and all other variables, and is the final model employed for prediction.

Model1 <- glm(Dev_change ~ wetlands + forest  + farm + otherUndeveloped, 
              family="binomial"(link="logit"), data = datTrain)

Model2 <- glm(Dev_change ~ wetlands + forest  + farm + otherUndeveloped + lagDevelopment, 
              family="binomial"(link="logit"), data = datTrain)
              
Model3 <- glm(Dev_change ~ wetlands + forest  + farm + otherUndeveloped + lagDevelopment + Pop_2011, 
              family="binomial"(link="logit"), data = datTrain)          
              
Model4 <- glm(Dev_change ~ wetlands + forest  + farm + otherUndeveloped + lagDevelopment + Pop_2011 + 
              Pop_2021, 
              family="binomial"(link="logit"), data = datTrain)              
            
Model5 <- glm(Dev_change ~ wetlands + forest  + farm + otherUndeveloped + lagDevelopment + pop_Change, 
              family="binomial"(link="logit"), data = datTrain)              
              
Model6 <- glm(Dev_change ~ wetlands + forest  + farm + otherUndeveloped + lagDevelopment + pop_Change + 
              highway_dist + downtown_dist + lake_dist + regrail_dist, 
              family="binomial"(link="logit"), data = datTrain) 

Working carefully through the below code block, a very concise approach for creating a data frame of psudeo R Squares for each model and plotting them for comparison. Recall, pR2 is the function for psuedo R squared. Dissect the line that uses the map_dfc function to see how this approach loops through the models retrieving the goodness of fit for each.

modelList <- paste0("Model", 1:6)
map_dfc(modelList, function(x)pR2(get(x)))[4,] %>%
  setNames(paste0("Model",1:6)) %>%
  gather(Model,McFadden) %>%
  ggplot(aes(Model,McFadden)) +
    geom_bar(stat="identity") +
    labs(title= "McFadden R-Squared by Model") +
    plotTheme
## fitting null model for pseudo-r2
## fitting null model for pseudo-r2
## fitting null model for pseudo-r2
## fitting null model for pseudo-r2
## fitting null model for pseudo-r2
## fitting null model for pseudo-r2

Next, a data frame is created that includes columns for the observed development change, Dev_change, and one that includes predicted probabilities for Model6. This data frame is then used as an input to a density plot visualizing the distribution of predicted probabilities by observed class. Only a small number of predicted probabilities are greater than or equal to 50% (nrow(filter(testSetProbs, probs >= .50)) / nrow(datTest)). This makes good sense, given how rare of an event development is in our dataset. Ultimately, in order to judge our model with a confusion matrix, a smaller development classification threshold must be employed.

testSetProbs <- 
  data.frame(class = datTest$Dev_change,
             probs = predict(Model6, datTest, type="response")) 
  
ggplot(testSetProbs, aes(probs)) +
  geom_density(aes(fill=class), alpha=0.5) +
  scale_fill_manual(values = palette2,
                    labels=c("No Change","New Development")) +
  labs(title = "Histogram of test set predicted probabilities",
       x="Predicted Probabilities",y="Density") +
  plotTheme

4.3. Accuracy

Now to pick a predicted probability threshold to classify an area as having new development. Recall, Sensitivity or the True Positive rate is the proportion of actual positives (1’s) that were predicted to be positive. For example, the Sensitivity in our model is the rate of developed areas actually predicted as such. Specificity or True Negative Rate is the proportion of actual negatives (0’s) that were predicted to be negatives. For example, the Specificity in our model is the rate of No Change areas that were correctly predicted as No change.

It is important to consider what Planners would typically optimize for given this use case. One approach is to maximize the number of 1’s predicted correctly (Sensitivity) so as to not under or over-predict new development. It may okay in this use case to incorrectly predict no change as changed (Specificity). An abundance of False Negative errors may be reasonable if Planners don’t mind over emphasizing development potential. It is important to remember that below this potential will evaluated alongside supply-side indicators such as the presence of sensitive land.

There are some clear tradeoffs between Sensitivity and Specificity in our model that deserve some exploration. To illustrate, two different thresholds of 5% and 17% are explored. Predicted classes for both thresholds are generated and instead of using the confusionMatrix function from caret as we have in the past, here confusion matrix metrics are derived from the yardstick package. This allows us to group_by the threshold and summarize the metrics of interest.

The options call below is required to tell yardstick that the positive factor class in testSetProbs is 1. Without it, yardstick will by default, see the first factor level as 0 and flip the confusion metrics around.

options(yardstick.event_first = FALSE)

testSetProbs <- 
  testSetProbs %>% 
  mutate(predClass_05 = as.factor(ifelse(testSetProbs$probs >= 0.05 ,1,0)),
         predClass_17 = as.factor(ifelse(testSetProbs$probs >= 0.17 ,1,0))) 

testSetProbs %>%
  dplyr::select(-probs) %>%
  gather(Variable, Value, -class) %>%
  group_by(Variable) %>%
  summarize(Sensitivity = round(yardstick::sens_vec(class,factor(Value)),2),
            Specificity = round(yardstick::spec_vec(class,factor(Value)),2),
            Accuracy = round(yardstick::accuracy_vec(class,factor(Value)),2)) %>% 
  kable() %>%
  kable_styling(full_width = F)
Variable Sensitivity Specificity Accuracy
predClass_05 0.03 0.99 0.35
predClass_17 0.21 0.91 0.44

The 5% threshold correctly predicts a lower number of new development areas (Sensitivity), but incorrectly predicts a higher number of no change areas (Specificity). As there are far more no change areas in the data, this is reflected in a lower overall accuracy. Conversely, the 17% threshold has a lower higher rate and a slightly lower Specificity rate. Again, because the dataset is majority no change areas, this leads to a higher Accuracy rate.

Given the use case, and the spatial distribution of land cover change, it may be more useful to have a model that predicts generally where new development occurs rather than one that predicts precisely where. As illustrated below, the 17% threshold provides this outcome. These trade-offs can be visualized in the plot below. Here the model is used to predict for the entire dat dataset. Which threshold looks more reasonable given the distribution of observed development change?

Note that these indicators are converted as.factor so they can be mapped with scale_color_manual.

predsForMap <-         
  dat %>%
    mutate(probs = predict(Model6, dat, type="response") ,
           Threshold_5_Pct = as.factor(ifelse(probs >= 0.05 ,1,0)),
           Threshold_17_Pct =  as.factor(ifelse(probs >= 0.17 ,1,0))) %>%
    dplyr::select(Dev_change,Threshold_5_Pct,Threshold_17_Pct) %>%
    gather(Variable,Value, -geometry) %>%
    st_cast("POLYGON")
ggplot() +
  geom_point(data=predsForMap, aes(x=xyC(predsForMap)[,1], y=xyC(predsForMap)[,2], colour=Value)) +
  facet_wrap(~Variable) +
  scale_colour_manual(values = palette2, labels=c("No Change","New Development"),
                      name="") +
  labs(title="Development Predictions - Low Threshold") + 
  geom_sf(data=ThreeCountyArea, fill = "transparent") +
  mapTheme

To provide a bit more insight, the code block below produces both true positives (Sensitivity) and true negatives (Specificity) for each grid cell by threshold type. Notice how the spatial pattern of Sensitivity for both thresholds is relatively consistent, but the 5% threshold misses most the study area with respect to Specificity.

ConfusionMatrix.metrics <-
  dat %>%
    mutate(probs = predict(Model6, dat, type="response") ,
           Threshold_5_Pct = as.factor(ifelse(probs >= 0.05 ,1,0)),
           Threshold_17_Pct =  as.factor(ifelse(probs >= 0.17 ,1,0))) %>%
    mutate(TrueP_05 = ifelse(Dev_change  == 1 & Threshold_5_Pct == 1, 1,0),
           TrueN_05 = ifelse(Dev_change  == 0 & Threshold_5_Pct == 0, 1,0),
           TrueP_17 = ifelse(Dev_change  == 1 & Threshold_17_Pct == 1, 1,0),
           TrueN_17 = ifelse(Dev_change  == 0 & Threshold_17_Pct == 0, 1,0)) %>%
    dplyr::select(., starts_with("True")) %>%
    gather(Variable, Value, -geometry) %>%
    st_cast("POLYGON")
ggplot(data=ConfusionMatrix.metrics) +
  geom_point(aes(x=xyC(ConfusionMatrix.metrics)[,1], 
                 y=xyC(ConfusionMatrix.metrics)[,2], colour = as.factor(Value))) +
  facet_wrap(~Variable) +
  scale_colour_manual(values = palette2, labels=c("Correct","Incorrect"),
                       name="") +
  labs(title="Development Predictions - Low Threshold") + 
  geom_sf(data=ThreeCountyArea, fill = "transparent") +
  mapTheme

4.4 Generalizability

For this use case, it matters little whether the model generalizes well across random holdouts. Thus, regular cross-validation is substituted for spatial cross-validation. The latter is explicitly concerned with generalizability across space. The approach helps us understand whether our model is comparable to each county in the study area despite any possible differences in land use or land use planning.

To test across-space generalizability, spatialCV function is run, which iteratively loops through dat having each county take a turn as the hold out test set. This is also called ‘Leave-one-group-out cross validation.’. A model is estimated for the n - 1 counties that remain and used to predict for the hold out county.

spatialCV <- function(dataFrame, uniqueID, dependentVariable, modelName) {

#initialize a data frame 
endList <- list()

#create a list that is all the spatial group unqiue ids in the data frame (ie counties)    
  uniqueID_List <- unique(dataFrame[[uniqueID]])  
  x <- 1
  y <- length(uniqueID_List)
  
#create a counter and while it is less than the number of counties...  
  while(x <= y) 
  {
#call a current county    
    currentUniqueID <- uniqueID_List[x]
#create a training set comprised of units not in that county and a test set of units
#that are that county
    training <- dataFrame[ which(dataFrame[[uniqueID]] != uniqueID_List[x]),]
    testing <- dataFrame[ which(dataFrame[[uniqueID]] == uniqueID_List[x]),]
#create seperate xy vectors
    trainingX <- training[ , -which(names(training) %in% c(dependentVariable))]
    testingX <- testing[ , -which(names(testing) %in% c(dependentVariable))]
    
    trainY <- training[[dependentVariable]]
    testY <- testing[[dependentVariable]]
#Calculate predictions on the test county as part of a data frame including the observed
#outcome and the unique county ID    
   thisPrediction <- 
     data.frame(class = testY,
                probs = predict(modelName, testingX, type="response"),
                county = currentUniqueID) 

#Row bind the predictions to a data farme
   endList <- rbind(endList, thisPrediction)
#iterate counter    
    x <- x + 1 
  } 
#return the final list of counties and associated predictions  
  return (as.data.frame(endList))
}

Now the function is run; a 17% predicted probability threshhold is set and a facetted ROC plot for each county is created.

spatialCV_counties <-
  spatialCV(dat,"COUNTY_NAM","Dev_change", Model6) %>%
  mutate(predClass = as.factor(ifelse(probs >= 0.17 ,1,0)))

To investigate the across-3county-area generalizability of the model, the code block below produces and maps confusion matrix statistics by county. It is important to ensure as above, that the yardstick.event_first option is set.

Some interesting patterns emerge. First, results for those counties with little Observed_Change are not meaningful. In this case, all three counties have some “Observed_Change”. In places with substantial new development, Sensitivity rates are comparable with the results from the test set results on the entire study area. In DuPage County, Sensitivity is 0.00, implying that the model didn’t correctly identify any of the observed changes in DUPAGE county. This suggests the model may struggle with predicting positive cases in this county, but this is likely because there is little development change occurring in general. Specificity is high across the board, because there are a lot of existing developed areas and the model predicts an abundance of developed areas, where there was no change during the specified time period. Again, this is less of a concern because these estimates will be offset by sensitive land cover in Section 7 below. Will County has the highest Sensitivity level and the highest Accuracy rate of the three counties. Overall, these confusion matrix metrics help us to understand how well the model is performing in terms of the true positive and true negative rates, and they suggest the model is generalizable to those counties that underwent significant development change like Will County.

spatialCV_metrics <-
  spatialCV_counties %>% 
    group_by(county) %>% 
    summarize(Observed_Change = sum(as.numeric(as.character(class))),
              Sensitivity = round(yardstick::sens_vec(class,predClass),2),
              Specificity = round(yardstick::spec_vec(class,predClass),2),
              Accuracy = round(yardstick::accuracy_vec(class,predClass),2)) 

spatialCV_metrics %>%
  kable() %>%
  kable_styling(full_width = F)
county Observed_Change Sensitivity Specificity Accuracy
COOK 346 0.19 0.90 0.39
DUPAGE 124 0.00 1.00 0.30
WILL 456 0.29 0.93 0.54

5. Predicting Land Cover Demand for 2031

At this point, a simple but useful model has been trained to predict urban development between 2011 and 2021 as a function of baseline features from 2011 including land cover, built environment and population. Next, we are going to updated our features to reflect a 2021 baseline. Having done so, predictions from our new model would then be fore 2031.

Generalizability is always the concern when forecasting, and for this use case Planners must ask themselves whether the 2011-2021 3 IL County experience generalizes to the 2021-2031 3 IL County experience. In other words, have the macroeconomic real estate conditions changed dramatically between the two time periods? This is question with no definitive answer, but it useful to consider the exogenous factors that may differentiate today’s Chicago area from that of 2011. The big for instance is climate change. If the real estate market capitalized flood risk into devastated areas going forward, this would effectively change the nature of real estate demand in the region. Thus the pre-flood, 2011 experienced is no longer entirely relevant.

This would not completely invalidate the model if these changes are marginal, as development demand predictions can be adjusted in Section 7 below. However, consider the usefulness of this approach for a lakefront city that loses say 10% of its developable land to sea level rise in the following decade.

For brevity, we only update two features in our model. First, population change (pop_change) is updated using county level population projections visualized in the plot below. The second is lagDevelopment, which describes how predicted new development relates in space to old development.

Once the features are updated, 2031 predictions are estimated and mapped.

Below, lagDevelopment is mutate describing average distance to 2021 development. Note that the field name, lagDevelopment is unchanged (ie. not updated to lagDevelopment_2021). This is done purposefully as model6 has a regression coefficient called lagDevelopment. If this variable wasn’t present in our updated data frame then the predict command would fail.

dat <-
  dat %>%
  mutate(lagDevelopment = nn_function(xyC(.), xyC(filter(.,developed21 == 1)),2))

Now to update population change. A new data frame, Pop_2031 is created which includes 2021 population counts and 2031 projections for each county in the study area. Population is plotted by year and by county. Chicago’s Cook County is projected to see the greatest population gains by far. Anecdotally, we know that much of Cook County is already developed, which suggests its development scenario will involve more ‘infill’ development then sprawl.

ThreeCounties_Pop_2021 <- get_acs(geography = "county",
                            variables = "B01003_001",
                            year = 2021,
                            state = "IL",
                            county = counties,
                            geometry = TRUE) %>%
  mutate(County = gsub(" County, Illinois", "", NAME)) %>%
  rename(CountyPop_2021 = estimate) %>%
  st_transform(st_crs(ThreeCounty_fishnet))
## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |=======                                                               |  10%
  |                                                                            
  |=======                                                               |  11%
  |                                                                            
  |========                                                              |  11%
  |                                                                            
  |========                                                              |  12%
  |                                                                            
  |=========                                                             |  12%
  |                                                                            
  |=========                                                             |  13%
  |                                                                            
  |==========                                                            |  14%
  |                                                                            
  |===========                                                           |  15%
  |                                                                            
  |============                                                          |  17%
  |                                                                            
  |=============                                                         |  18%
  |                                                                            
  |=============                                                         |  19%
  |                                                                            
  |==============                                                        |  19%
  |                                                                            
  |==============                                                        |  20%
  |                                                                            
  |===============                                                       |  21%
  |                                                                            
  |===============                                                       |  22%
  |                                                                            
  |================                                                      |  22%
  |                                                                            
  |================                                                      |  23%
  |                                                                            
  |================                                                      |  24%
  |                                                                            
  |=================                                                     |  24%
  |                                                                            
  |=================                                                     |  25%
  |                                                                            
  |==================                                                    |  26%
  |                                                                            
  |===================                                                   |  27%
  |                                                                            
  |===================                                                   |  28%
  |                                                                            
  |====================                                                  |  28%
  |                                                                            
  |====================                                                  |  29%
  |                                                                            
  |=====================                                                 |  30%
  |                                                                            
  |======================                                                |  31%
  |                                                                            
  |======================                                                |  32%
  |                                                                            
  |=======================                                               |  32%
  |                                                                            
  |=======================                                               |  33%
  |                                                                            
  |========================                                              |  34%
  |                                                                            
  |========================                                              |  35%
  |                                                                            
  |=========================                                             |  35%
  |                                                                            
  |=========================                                             |  36%
  |                                                                            
  |==========================                                            |  37%
  |                                                                            
  |===========================                                           |  38%
  |                                                                            
  |============================                                          |  39%
  |                                                                            
  |============================                                          |  40%
  |                                                                            
  |============================                                          |  41%
  |                                                                            
  |=============================                                         |  41%
  |                                                                            
  |=============================                                         |  42%
  |                                                                            
  |==============================                                        |  43%
  |                                                                            
  |===============================                                       |  44%
  |                                                                            
  |===============================                                       |  45%
  |                                                                            
  |================================                                      |  45%
  |                                                                            
  |================================                                      |  46%
  |                                                                            
  |=================================                                     |  47%
  |                                                                            
  |==================================                                    |  48%
  |                                                                            
  |==================================                                    |  49%
  |                                                                            
  |===================================                                   |  49%
  |                                                                            
  |===================================                                   |  50%
  |                                                                            
  |====================================                                  |  51%
  |                                                                            
  |====================================                                  |  52%
  |                                                                            
  |=====================================                                 |  52%
  |                                                                            
  |=====================================                                 |  53%
  |                                                                            
  |=======================================                               |  55%
  |                                                                            
  |========================================                              |  57%
  |                                                                            
  |========================================                              |  58%
  |                                                                            
  |=========================================                             |  58%
  |                                                                            
  |=========================================                             |  59%
  |                                                                            
  |==========================================                            |  59%
  |                                                                            
  |==========================================                            |  60%
  |                                                                            
  |===========================================                           |  61%
  |                                                                            
  |===========================================                           |  62%
  |                                                                            
  |============================================                          |  62%
  |                                                                            
  |============================================                          |  63%
  |                                                                            
  |=============================================                         |  64%
  |                                                                            
  |=============================================                         |  65%
  |                                                                            
  |==============================================                        |  66%
  |                                                                            
  |===============================================                       |  67%
  |                                                                            
  |================================================                      |  68%
  |                                                                            
  |=================================================                     |  70%
  |                                                                            
  |==================================================                    |  71%
  |                                                                            
  |===================================================                   |  72%
  |                                                                            
  |===================================================                   |  73%
  |                                                                            
  |====================================================                  |  74%
  |                                                                            
  |====================================================                  |  75%
  |                                                                            
  |=====================================================                 |  76%
  |                                                                            
  |======================================================                |  77%
  |                                                                            
  |=======================================================               |  78%
  |                                                                            
  |=======================================================               |  79%
  |                                                                            
  |========================================================              |  79%
  |                                                                            
  |========================================================              |  80%
  |                                                                            
  |========================================================              |  81%
  |                                                                            
  |=========================================================             |  81%
  |                                                                            
  |=========================================================             |  82%
  |                                                                            
  |==========================================================            |  82%
  |                                                                            
  |==========================================================            |  83%
  |                                                                            
  |==========================================================            |  84%
  |                                                                            
  |===========================================================           |  84%
  |                                                                            
  |===========================================================           |  85%
  |                                                                            
  |============================================================          |  85%
  |                                                                            
  |============================================================          |  86%
  |                                                                            
  |=============================================================         |  87%
  |                                                                            
  |=============================================================         |  88%
  |                                                                            
  |==============================================================        |  89%
  |                                                                            
  |===============================================================       |  90%
  |                                                                            
  |================================================================      |  92%
  |                                                                            
  |=================================================================     |  93%
  |                                                                            
  |==================================================================    |  94%
  |                                                                            
  |===================================================================   |  95%
  |                                                                            
  |===================================================================   |  96%
  |                                                                            
  |====================================================================  |  97%
  |                                                                            
  |====================================================================  |  98%
  |                                                                            
  |===================================================================== |  99%
  |                                                                            
  |======================================================================|  99%
  |                                                                            
  |======================================================================| 100%
# Calculate the sum of CountyPop_2021 for all counties
all_counties_pop_2021 <- ThreeCounties_Pop_2021 %>%
  summarize(AllCounties_Pop2021 = sum(CountyPop_2021))

# Add the sum to ThreeCounties_Pop_2021 as a new column
ThreeCounties_Pop_2021 <- ThreeCounties_Pop_2021 %>%
  mutate(AllCounties_Pop2021 = all_counties_pop_2021$AllCounties_Pop2021)

# Need to include the population projections for 2031 for the three counties. 
#these are population in 2020 from IL Department of Health linked here: https://dph.illinois.gov/content/dam/soi/en/web/idph/files/publications/population-projections-report-2010-2030.pdf
# Code for joining and plotting 2021 population and 2031 population projection: 
# Perform the left join
Pop_2031 <- 
  data.frame(
    COUNTY_NAM = c("Cook", "DuPage", "Will"),
    pop_2021 = c(5265398, 934094, 696403),
    pop_2031 = c(4689134, 946910, 902476)
  )%>%
  mutate(Counties = toupper(COUNTY_NAM))

# Left join with ThreeCounties_Pop_2021
Pop_2031 <- left_join(
  Pop_2031,
  ThreeCounties_Pop_2021 %>%
    dplyr::select(County, CountyPop_2021, AllCounties_Pop2021) %>%
    st_set_geometry(NULL),
  by = c("COUNTY_NAM" = "County")
)
# Plotting 2021 and 2031 populations side by side
Pop_2031 %>%
  pivot_longer(cols = starts_with("pop_"), names_to = "Year", values_to = "Population") %>%
  ggplot(aes(x = reorder(COUNTY_NAM, -Population), y = Population, fill = Year)) +
  geom_bar(stat = "identity", position = position_dodge(width = 0.9), color = "black") +
  scale_fill_manual(values = palette2, labels = c("2021", "2031"), name = "Year") +
  labs(title = "Population Change by County: 2021 - 2031",
       x = "County", y = "Population") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  plotTheme

Interestingly enough, Cook County is projected to have a decline in population from 2021 to 2031. This is largely due to outmigration in recent years with thousands of residents leaving the county to live elsewhere. The rise of remote work schedules and the costs of housing have driven people to move to new areas outside of the county, often in other states. For Will County, the population is projected to increase for 2031, likely in the areas closer to the Lake Michigan shoreline and transit routes. Population is projected to largely remain the same for DuPage County when comparing 2021 to 2031.

5.2. Predicting Development Demand

Next, the Pop_2031 table is joined to dat and pop_change in order to ‘distribute’ the new population across the study area. To do so, the the allocation of new population is weighted by a grid cell’s existing population (pop_2031.infill). 2010 population is subtracted from this figure to get pop_Change. Finally, Model6 is used to predict for 2020 given the updated population change and lag development features.

The map of predicted probabilities that results is best thought of as a measure of predicted development demand in 2031.

dat_infill <-
  dat %>%
  #calculate population change
    left_join(Pop_2031, by = c("COUNTY_NAM" = "Counties")) %>%
    mutate(proportion_of_county_pop = pop_2021 / AllCounties_Pop2021,
           pop_2031.infill = proportion_of_county_pop * pop_2031,
           pop_Change = round(pop_2031.infill - pop_2021),2) %>%
    dplyr::select(-pop_2031, -AllCounties_Pop2021, 
                  -proportion_of_county_pop, -pop_2031.infill) %>%
  #predict for 2031
    mutate(predict_2031.infill = predict(Model6,. , type="response"))

dat_infill %>%
  ggplot() +  
  geom_point(aes(x=xyC(dat_infill)[,1], y=xyC(dat_infill)[,2], colour = factor(ntile(predict_2031.infill,5)))) +
  scale_colour_manual(values = palette5,
                    labels=substr(quintileBreaks(dat_infill,"predict_2031.infill"),1,4),
                    name="Quintile\nBreaks") +
  geom_sf(data=ThreeCountyArea, fill=NA, colour="black", size=1) +
  labs(title= "Development Demand in 2031: Predicted Probabilities") +
  mapTheme

There are higher probabilities for development demand in Will County, in the south portion of the study area.

6. Comparing Predicted Development Demand & Environmental Sensitivity

We now have a really strong indicator of development demand for 2031 to help guide local land use planning. Demand however, is only one side of the equation. It must balanced with the supply of environmentally sensitive land. Understanding the interplay between demand and supply is the first stage of the ‘Allocation’ phase, where Planners ultimately decide which land should be developed and which should not.

For this analysis farmland and undeveloped land are be deemed Suitable, while environmentally sensitive areas like wetlands and forest are be deemed Not Suitable. Below, 2021 land cover data is read in and several measures of environmental sensitivity are created by county. These include:

  1. The total amount of wetlands and forest land cover area in 2021.
  2. The amount of sensitive land (wetland and forest) lost between 2011 and 2021.
  3. The total area of large sensitive landscape ‘patches’ in 2021.

The third metric warrants some further discussion. In the context of leapfrog development, Section 2.6 discusses the concept of landscape fragmentation - the idea that discontinuous development across space carves out disjointed slivers of wilderness. This fragmentation reduces biodiversity particularly for species that need room to roam. Below, environmentally sensitive_regions are created to represent large areas of unfragmented natural resources. We then consider the total area of these clumps for each county.

6.2. 2021 Land Cover Data

To begin, the 2021 Land Cover data is read in and reclassified.

# We already did this. It's called ThreeCountyLC_2021

developed21 <- ThreeCountyLC_2021 == 21 | ThreeCountyLC_2021 == 22 | ThreeCountyLC_2021 == 23 | ThreeCountyLC_2021 == 24
forest21 <- ThreeCountyLC_2021 == 41 | ThreeCountyLC_2021 == 42 | ThreeCountyLC_2021 == 43 
farm21 <- ThreeCountyLC_2021 == 81 | ThreeCountyLC_2021 == 82 
wetlands21 <- ThreeCountyLC_2021 == 90 | ThreeCountyLC_2021 == 95 
otherUndeveloped21 <- ThreeCountyLC_2021 == 52 | ThreeCountyLC_2021 == 71 | ThreeCountyLC_2021 == 31 
water21 <- ThreeCountyLC_2021 == 11

names(developed21) <- "dev21"
names(forest21) <- "forest21"
names(farm21) <- "farm21"
names(wetlands21) <- "wetlands21"
names(otherUndeveloped21) <- "otherUndeveloped21"
names(water21) <- "water21"

This next step takes too long to plot because the rasters are large. But the code is there just in case.

# ggplot() +
#   geom_sf(data=ThreeCountyArea) +
#   geom_raster(data = rbind(rast(ThreeCountyLC_2011) %>% mutate(label = "2011"),
#                            rast(ThreeCountyLC_2021) %>% mutate(label = "2021")) %>% 
#               na.omit %>% filter(value > 0), 
#               aes(x,y,fill=as.factor(value))) +
#   facet_wrap(~label) +
#   scale_fill_viridis(discrete=TRUE, name ="") +
#   labs(title = "Land Cover, 2011 & 2021") +
#   mapTheme #+ theme(legend.position = "none")

Next, each raster is aggregated to the fishnet using the aggregateRaster function and 2021 land cover types are mapped.

theRasterList21 <- c(developed21,forest21,farm21,wetlands21,otherUndeveloped21,water21)

dat2 <-
  aggregateRaster(theRasterList21, dat) %>%
  dplyr::select(dev21,forest21,farm21,wetlands21,otherUndeveloped21,water21) %>%
  st_set_geometry(NULL) %>%
  bind_cols(.,dat) %>%
  st_sf() %>%
  st_cast("POLYGON")

dat2 %>%
  gather(var,value,dev21:water21) %>%
  st_centroid() %>%
  mutate(X = st_coordinates(.)[,1],
         Y = st_coordinates(.)[,2]) %>%
  ggplot() +
    geom_sf(data=ThreeCountyArea) +
    geom_point(aes(X,Y, colour=as.factor(value))) +
    facet_wrap(~var) +
    scale_colour_manual(values = palette2,
                        labels=c("Other","Land Cover"),
                        name = "") +
    labs(title = "Land Cover Types, 2021",
         subtitle = "As fishnet centroids") +
   mapTheme

Note that there are wetlands and forest sprinkled throughout the three county area, and there is a lot of farmland concentrated in the southwest portion in Will County.

6.3. Sensitive Land Cover Lost

Below an indicator sensitive_lost is created indicating grid cells that were either forest or wetlands in 2011 but were no longer so in 2021. The output layer, sensitive_land_lost, gives a sense for how development in the recent past has effected the natural environment.

dat2 <-
  dat2 %>%
   mutate(sensitive_lost21 = ifelse(forest == 1 & forest21 == 0 |
                                    wetlands == 1 & wetlands21 == 0,1,0))
                      
ggplot() +
  geom_point(data=dat2, aes(x=xyC(dat2)[,1], y=xyC(dat2)[,2], colour=as.factor(sensitive_lost21))) +
  scale_colour_manual(values = palette2,
                      labels=c("No Change","Sensitive Lost"),
                      name = "") +
  labs(title = "Sensitive lands lost: 2011 - 2021",
       subtitle = "As fishnet centroids") +
  geom_sf(data=ThreeCountyArea, fill = "transparent") +
  mapTheme

This type of analysis is helpful to see where sensitive lands are being lost throughout the three county area due to development. Thankfully, there are only a small number of these lost areas from 2011-2021. If there were large areas lost, it would be helpful for planners to implement iniatives to protect those sensitive areas from development encroachment and allow them to continue providing environmental benefits.

6.4 Landscape Fragmentation

In this section, the wetlands21 and forest21 rasters are converted to contiguous sensitive_regions using the raster::clump function. This is equivalent to Region Group in ArcGIS. The raster clumps are then converted to vector sf layers; dissolved into unique regions; Acres are calculated; and the layers are converted back to raster to be extracted back to the fishnet with aggregateRaster. It is worth going through this code block line by line. Note that only sensitive_regions with areas greater than 1 acre are included.

We ended up not including this section in our analysis.

#  emptyRaster <- Development_change
#  emptyRaster[] <- NA
# 
# sensitiveRegions <- 
#   raster::clump(wetlands21 + forest21) %>%
#   rasterToPolygons() %>%
#   st_as_sf() %>%
#   group_by(clumps) %>% 
#   summarize() %>%
#     mutate(Acres = as.numeric(st_area(.) * 0.0000229568)) %>%
#     filter(Acres > 3954)  %>%
#   dplyr::select() %>%
#   raster::rasterize(.,emptyRaster) 
# sensitiveRegions[sensitiveRegions > 0] <- 1  
# names(sensitiveRegions) <- "sensitiveRegions"
# 
# dat2 <-
#   aggregateRaster(c(sensitiveRegions), dat2) %>%
#   dplyr::select(sensitiveRegions) %>%
#   st_set_geometry(NULL) %>%
#   bind_cols(.,dat2) %>%
#   st_sf()
# 
# ggplot() +
#   geom_point(data=dat2, aes(x=xyC(dat2)[,1], y=xyC(dat2)[,2], colour=as.factor(sensitiveRegions))) +
#   scale_colour_manual(values = palette2,
#                       labels=c("Other","Sensitive Regions"),
#                       name="") +
#   labs(title = "Sensitive regions",
#        subtitle = "Continous areas of either wetlands or forests\ngreater than 1 acre") +
#   mapTheme

6.5. Summarize by County

The below dplyr statement takes as its input, dat2, which was created in Sections 6.2 - 6.4 and wrangles together a table of county-level, supply and demand metrics which can be used to analyze suitability by county.

county_specific_metrics <- 
  dat2 %>%
  #predict development demand from our model
  mutate(Development_Demand = predict(Model6, dat2, type="response")) %>%
  #get a count count of grid cells by county which we can use to calculate rates below
  left_join(st_set_geometry(dat, NULL) %>% group_by(COUNTY_NAM) %>% summarize(count = n())) %>%
  #calculate summary statistics by county
  group_by(COUNTY_NAM) %>%
  summarize(Total_Farmland = sum(farm21) / max(count),
            Total_Forest = sum(forest21) / max(count),
            Total_Wetlands = sum(wetlands21) / max(count),
            Total_Undeveloped = sum(otherUndeveloped21) / max(count),
            Sensitive_Land_Lost = sum(sensitive_lost21) / max(count),
            #Sensitive_Regions = sum(sensitiveRegions) / max(count),
            Mean_Development_Demand = mean(Development_Demand)) %>%
  #get population data by county
  left_join(Pop_2031, by = c("COUNTY_NAM" = "Counties")) %>% 
            mutate(Population_Change = pop_2031 - pop_2021,
                   Population_Change_Rate = Population_Change / pop_2031) %>%
            dplyr::select(COUNTY_NAM,Total_Farmland, Total_Forest, Total_Wetlands, Total_Undeveloped, Sensitive_Land_Lost, Mean_Development_Demand, Population_Change_Rate)

Now a small multiple plot can be created providing both supply and demand side analytics by county. The plot gives a sense for development demand (Demand-Side), suitable land for development (Suitable) and sensitive land (Not Suitable).

In Will County, south of Chicago, the data suggests both population and development demand will increase. At the same time, there is a high rate of developable farmland. Will County has a large amount of forest and wetlands areas, but otherwise it is well suitable to new development.

Conversely, DuPage County, the county east of Chicago and Cook County, has a significant amount of forest and wetland areas like Will County, but it has a slightly higher amount of sensitive lands lost. Comparatively, it has a lot less undeveloped land and farmland that is suitable for new development. There are very low rates of mean development demand and population growth.

Cook County has the lowest amounts of sensitive lands but also low amounts of undeveloped areas and farmland as well. This is because Chicago and most of the county has already been developed. And the population is expected to continue to decline due to outmigration.

In these counties, there are some real trade-offs to be made between suitable/sensitive land and development pressure.

county_specific_metrics %>%
  gather(Variable, Value, -COUNTY_NAM, -geometry) %>%
  mutate(Variable = factor(Variable, levels=c("Population_Change_Rate","Mean_Development_Demand",
                                              "Total_Farmland","Total_Undeveloped","Total_Forest",
                                              "Total_Wetlands","Sensitive_Land_Lost",
                                              ordered = TRUE))) %>%
  mutate(Planning_Designation = case_when(
    Variable == "Population_Change_Rate" | Variable == "Mean_Development_Demand" ~ "Demand-Side",
    Variable == "Total_Farmland" | Variable == "Total_Undeveloped"               ~ "Suitable",
    TRUE                                                                         ~ "Not Suitable")) %>%
  ggplot(aes(x=Variable, y=Value, fill=Planning_Designation)) +
    geom_bar(stat="identity", position=position_dodge(), colour="black") +
    facet_wrap(~COUNTY_NAM, ncol=5) +
    coord_flip() +
    scale_y_continuous(breaks = seq(-.25, 1, by = .25)) +
    geom_vline(xintercept = 2.5) + geom_vline(xintercept = 4.5) +
    scale_fill_manual(values=c("black","red","darkgreen")) +
    labs(title= "County Specific Allocation Metrics", subtitle= "As rates", x="Indicator", y="Rate") +
    plotTheme + theme(axis.text.x = element_text(angle = 45, hjust = 1), legend.position="bottom")

7. Allocation

Allocation is the final stage of the urban growth modeling process. Now that both demand and supply is understood, Planners can allocate development rights accordingly. Of course, this could take many forms of regulation including zoning, subdivision approval or outright conservation. In this section, demand and supply are visualized for Will County.

First, development demand is predicted for Will County. Then a layer, WillCounty_landUse is created, that includes indicators for both previously developed land and environmentally unsuitable land. This layer then is overlayed atop development demand and projected population change to give the full supply and demand-side picture in Will County.

There are some clear opportunities for development in Will County. Significant infill opportunities exist along the southern boundary where population change is projected to be greatest. There is also a good deal of environmentally suitable land in the center of the county and closer to the Lake Michigan shoreline to the east. This would be ideal space for large housing or commercial developments. Will County has a lot of forests and wetlands in the southwest corner of the county, so there are sensitive land areas not suitable for development there.

WillCounty <- dat2 %>%
  mutate(Development_Demand = predict(Model6, dat2, type = "response")) %>%
  filter(COUNTY_NAM == "WILL") 

WillCounty_landUse <- rbind(
  filter(WillCounty, forest21 == 1 | wetlands21 == 1) %>%
    dplyr::select() %>%
    mutate(Land_Use = "Not Suitable"),
  filter(WillCounty, dev21 == 1) %>%
    dplyr::select() %>%
    mutate(Land_Use = "Developed"))

# Calculate quantiles for Development Demand and Population Change
dev_demand_quantiles <- quantile(WillCounty$Development_Demand, probs = seq(0, 1, by = 0.25)) # Changed to 4 quantiles
pop_change_quantiles <- quantile(WillCounty$pop_Change, probs = seq(0, 1, by = 0.25)) # Calculate quantiles for pop_Change
 
grid.arrange(
  ggplot() +
    geom_sf(data = WillCounty, aes(fill = cut(Development_Demand, breaks = dev_demand_quantiles)), colour = NA) +
    geom_point(data = WillCounty_landUse, aes(x = xyC(WillCounty_landUse)[, 1], 
                                              y = xyC(WillCounty_landUse)[, 2], 
                                              colour = Land_Use),
               shape = 15, size = 2) +
    geom_sf(data = st_intersection(ThreeCounties_Highways, filter(ThreeCountyArea, COUNTY_NAM == "WILL")), size = 2, colour = "gray") +
    scale_fill_manual(values = palette5, name = "Development Demand",
                      labels = as.character(round(dev_demand_quantiles, digits = 2))) +
    scale_colour_manual(values = c("black", "red")) + 
    labs(title = "Development Potential, 2031: Will County") + mapTheme +
    guides(fill = guide_legend(order = 1), colour = guide_legend(order = 2)),

  ggplot() +
    geom_sf(data = WillCounty, aes(fill = cut(pop_Change, breaks = pop_change_quantiles)), colour = NA) +
    geom_point(data = WillCounty_landUse, aes(x = xyC(WillCounty_landUse)[, 1], 
                                              y = xyC(WillCounty_landUse)[, 2], 
                                              colour = Land_Use),
               shape = 15, size = 2) +
    geom_sf(data = st_intersection(ThreeCounties_Highways, filter(ThreeCountyArea, COUNTY_NAM == "WILL")), size = 2, colour = "gray") +
    scale_fill_manual(values = palette6, name = "Population Change",
                      labels = as.character(round(pop_change_quantiles, digits = 2))) +
    scale_colour_manual(values = c("black", "red")) + 
    labs(title = "Projected Population, 2031: Will County") + mapTheme +
    guides(fill = guide_legend(order = 1), colour = guide_legend(order = 2)), ncol = 2)

The plots above are created using a ggplot trick to show what appears to be overlayed polygons (fishnet grid cells). ggplot does not natively allow multiple aesthetics (aes) of the same style. In practice, this means it is not possible to have two scale_fill_manual parameters and thus, two legends for the same map. This limitation is cleverly avoided by plotting WillCounty_landUse as colored points (as opposed to filled grid cells). In the geom_point parameter above, the points are set to shape = 15, which is a filled box. This box can then be sized to make it appear like a fishnet grid cell.

We stop short in actually allocating land to development. While the model is well suited for understanding sprawl-style development, it is not useful for understanding how new demand might be absorbed by upzoning and densification of existing development. It would not be wise to allocate the entire projected population to undeveloped land. Instead, we’d prefer a more nuanced understanding of how local land use laws might play a role. At this stage in the analysis however, the Planner has all she needs to engage local stakeholders about future development decisions.

7.2 Scenario 2: Estimating the Effect of New Transportation

We created a new regional rail line to simulate the extension of the Metra rail system in the Chicago region. The rail extension expands the system southwest into Will County. You can see the existing system and the extension plotted.

MetraLineswExtension <-
  st_read("C:/Users/3lpaw/Desktop/ArcGIS Pro 3.2/EnvModeling/04_24_24_UrbanGrowthModeling/Downloaded_Data/Metra_Lines/MetraLineswExtension.shp") %>%
  st_transform(st_crs(ThreeCountyArea)) %>%
  st_intersection(ThreeCountyArea)
library(gridExtra)

# Adjust the widths parameter to control the size of each plot
grid.arrange(
  ggplot() +
    geom_sf(data=ThreeCountyArea, fill = "transparent") +
    geom_sf(data=ThreeCounties_RegRail, color = "red") +
    labs(title = "Existing Regional Rail Lines") +
    mapTheme,

  ggplot() +
    geom_sf(data=ThreeCountyArea, fill = "transparent") +
    geom_sf(data=MetraLineswExtension, color = "red") +
    labs(title = "Regional Rail Lines with Extension") +
    mapTheme,
  
  # Adjust widths as needed to control the size of each plot
  nrow = 1, 
  widths = c(2, 2)
)

Now we recreate the regional rail fishnet, using the shapefile that has the new rail extension and we calculate the distance to nearest rail line again.

#Determine nearest regional rail line to each centroid
nearest_metra <- st_nearest_feature(centroid,MetraLineswExtension)

#Calcuate distance from each grid square centroid to nearest regional rail line
ThreeCounty_fishnet$metra_dist <- as.double(st_distance(centroid, MetraLineswExtension[nearest_metra,], by_element=TRUE))

Metra_fishnet <- ThreeCounty_fishnet #%>%
#  select(fishnetID, geometry, regrail_dist)

#Make a quick sample map of the results
ggplot()+
  geom_sf(data=Metra_fishnet,aes(fill=metra_dist),color='transparent')+
  scale_fill_viridis_c(name='Distance to Regional Rail Lines (feet)')+
  geom_sf(data=MetraLineswExtension,color='red')+
  labs(title = "Distance to Regional Rail Lines with Extension",
       subtitle = "Using fishnet centroids") +
  theme_void()

Next, we recreate the final dataset to include the new variable (regional rail with extension) also called metra_dist here. The other fishnets are combined here to include all of the previous variables.

dat3 <-
  cbind(
    fishnet, fishnetPopulation, Highway_fishnet, RegRail_fishnet, Metra_fishnet, Downtown_fishnet, LakeMichigan_fishnet, aggregatedRasters) %>%
  dplyr::select(Dev_change, developed, forest, farm, wetlands, otherUndeveloped, water,
                Pop_2011, Pop_2021, pop_Change, highway_dist, regrail_dist, metra_dist, downtown_dist, lake_dist, lagDevelopment) %>%
  st_join(ThreeCountyArea) %>%
  mutate(developed21 = ifelse(Dev_change == 1 & developed == 1, 0, developed)) %>%
  filter(water == 0)

Exploratory analysis:

In this section we explore the extent to which each features is associated with development change. If the goal was to predict a continuous variable, scatterplots and correlation coefficients make this process straightforward and relatively easy to explain to a non-technical decison maker.

In this case however, the dependent variable is a binary outcome - either a grid cell was developed between 2011 and 2021 or it wasn’t. In this case, the relevant question is whether for a given feature, there is a statistically significant difference between areas that changed and areas that did not. These differences are explored in a set of plots below. For models with lots of features, these plots could be compliment by a series of difference in means statistical tests.

The below code block selects the distance and spatial lag features, converts each to long form and plots each as bar plots. Note that geom_bar calculates the mean.

dat3 %>%
  dplyr::select(pop_Change, highway_dist, regrail_dist, metra_dist, downtown_dist, lake_dist, lagDevelopment, Dev_change) %>%
  gather(Variable, Value, -Dev_change, -geometry) %>%
  ggplot(., aes(Dev_change, Value, fill=Dev_change)) + 
    geom_bar(position = "dodge", stat = "summary", fun.y = "mean") +
    facet_wrap(~Variable) +
    scale_fill_manual(values = palette2,
                      labels=c("No Change","New Development"),
                      name="") +
    labs(title="New Development as a Function of the Continuous Variables") +
    plotTheme

The plot for metra_dist (distance to nearest regional rail line, including the new extension) does not look that different from the plot for regrail_dist (includes only existing regional rail lines). However, there is a slight difference between no change and new development for the metra_dist variable.

7.2.a Modeling

First, dat3 is split into training and test sets.

set.seed(3456)
trainIndex3 <- 
  createDataPartition(dat3$developed, p = .50,
                                  list = FALSE,
                                  times = 1)
dat3Train <- dat3[ trainIndex3,]
dat3Test  <- dat3[-trainIndex3,]

nrow(dat3)
## [1] 2808

Model7 includes distance to the highways and all other variables including the new rail extension, and is the final model employed for prediction. I removed the regrail_dist because it is the existing rail lines without the new extension.

Model1 <- glm(Dev_change ~ wetlands + forest  + farm + otherUndeveloped, 
              family="binomial"(link="logit"), data = datTrain)

Model2 <- glm(Dev_change ~ wetlands + forest  + farm + otherUndeveloped + lagDevelopment, 
              family="binomial"(link="logit"), data = datTrain)
              
Model3 <- glm(Dev_change ~ wetlands + forest  + farm + otherUndeveloped + lagDevelopment + Pop_2011, 
              family="binomial"(link="logit"), data = datTrain)          
              
Model4 <- glm(Dev_change ~ wetlands + forest  + farm + otherUndeveloped + lagDevelopment + Pop_2011 + 
              Pop_2021, 
              family="binomial"(link="logit"), data = datTrain)              
            
Model5 <- glm(Dev_change ~ wetlands + forest  + farm + otherUndeveloped + lagDevelopment + pop_Change, 
              family="binomial"(link="logit"), data = datTrain)              
              
Model6 <- glm(Dev_change ~ wetlands + forest  + farm + otherUndeveloped + lagDevelopment + pop_Change + 
              highway_dist + downtown_dist + lake_dist + regrail_dist, 
              family="binomial"(link="logit"), data = datTrain) 

Model7 <- glm(Dev_change ~ wetlands + forest  + farm + otherUndeveloped + lagDevelopment + pop_Change + 
              highway_dist + downtown_dist + lake_dist + metra_dist, 
              family="binomial"(link="logit"), data = dat3Train) 

Comparing models:

modelList <- paste0("Model", 1:7)
map_dfc(modelList, function(x)pR2(get(x)))[4,] %>%
  setNames(paste0("Model",1:7)) %>%
  gather(Model,McFadden) %>%
  ggplot(aes(Model,McFadden)) +
    geom_bar(stat="identity") +
    labs(title= "McFadden R-Squared by Model") +
    plotTheme
## fitting null model for pseudo-r2
## fitting null model for pseudo-r2
## fitting null model for pseudo-r2
## fitting null model for pseudo-r2
## fitting null model for pseudo-r2
## fitting null model for pseudo-r2
## fitting null model for pseudo-r2

Next, a data frame is created that includes columns for the observed development change, Dev_change, and one that includes predicted probabilities for Model7. This data frame is then used as an input to a density plot visualizing the distribution of predicted probabilities by observed class. Only a small number of predicted probabilities are greater than or equal to 50% (nrow(filter(testSetProbs, probs >= .50)) / nrow(datTest)). This makes good sense, given how rare of an event development is in our dataset. Ultimately, in order to judge our model with a confusion matrix, a smaller development classification threshold must be employed.

testSetProbs2 <- 
  data.frame(class = dat3Test$Dev_change,
             probs = predict(Model7, dat3Test, type="response")) 
  
ggplot(testSetProbs2, aes(probs)) +
  geom_density(aes(fill=class), alpha=0.5) +
  scale_fill_manual(values = palette2,
                    labels=c("No Change","New Development")) +
  labs(title = "Histogram of test set predicted probabilities",
       x="Predicted Probabilities",y="Density") +
  plotTheme

Calculating Accuracy for this model:

options(yardstick.event_first = FALSE)

testSetProbs2 <- 
  testSetProbs2 %>% 
  mutate(predClass_05 = as.factor(ifelse(testSetProbs2$probs >= 0.05 ,1,0)),
         predClass_17 = as.factor(ifelse(testSetProbs2$probs >= 0.17 ,1,0))) 

testSetProbs2 %>%
  dplyr::select(-probs) %>%
  gather(Variable, Value, -class) %>%
  group_by(Variable) %>%
  summarize(Sensitivity = round(yardstick::sens_vec(class,factor(Value)),2),
            Specificity = round(yardstick::spec_vec(class,factor(Value)),2),
            Accuracy = round(yardstick::accuracy_vec(class,factor(Value)),2)) %>% 
  kable() %>%
  kable_styling(full_width = F)
Variable Sensitivity Specificity Accuracy
predClass_05 0.03 0.99 0.35
predClass_17 0.24 0.91 0.46

The results are a bit similar to model 6, but the accuracy increased by 0.01 for each variable.

predsForMap2 <-         
  dat3 %>%
    mutate(probs = predict(Model7, dat3, type="response") ,
           Threshold_5_Pct = as.factor(ifelse(probs >= 0.05 ,1,0)),
           Threshold_17_Pct =  as.factor(ifelse(probs >= 0.17 ,1,0))) %>%
    dplyr::select(Dev_change,Threshold_5_Pct,Threshold_17_Pct) %>%
    gather(Variable,Value, -geometry) %>%
    st_cast("POLYGON")
ggplot() +
  geom_point(data=predsForMap2, aes(x=xyC(predsForMap2)[,1], y=xyC(predsForMap2)[,2], colour=Value)) +
  facet_wrap(~Variable) +
  scale_colour_manual(values = palette2, labels=c("No Change","New Development"),
                      name="") +
  labs(title="Development Predictions - Low Threshold") + 
  geom_sf(data=ThreeCountyArea, fill = "transparent") +
  mapTheme

Predicting Development Demand

Next, the Pop_2031 table is joined to dat3 and pop_change in order to ‘distribute’ the new population across the study area. To do so, the the allocation of new population is weighted by a grid cell’s existing population (pop_2031.infill). 2010 population is subtracted from this figure to get pop_Change. Finally, Model7 is used to predict for 2031 given the updated population change and lag development features.

The map of predicted probabilities that results is best thought of as a measure of predicted development demand in 2031.

dat3_infill <-
  dat3 %>%
  #calculate population change
    left_join(Pop_2031, by = c("COUNTY_NAM" = "Counties")) %>%
    mutate(proportion_of_county_pop = pop_2021 / AllCounties_Pop2021,
           pop_2031.infill = proportion_of_county_pop * pop_2031,
           pop_Change = round(pop_2031.infill - pop_2021),2) %>%
    dplyr::select(-pop_2031, -AllCounties_Pop2021, 
                  -proportion_of_county_pop, -pop_2031.infill) %>%
  #predict for 2031
    mutate(predict_2031.infill = predict(Model7,. , type="response"))

# Calculate quintile breaks
quintile_breaks <- quintileBreaks(dat3_infill, "predict_2031.infill")

# Sort the breaks in ascending order
sorted_breaks <- sort(quintile_breaks)

dat3_infill %>%
  ggplot() +  
  geom_point(aes(x=xyC(dat3_infill)[,1], y=xyC(dat3_infill)[,2], colour = factor(ntile(predict_2031.infill,5)))) +
  scale_colour_manual(values = palette5,
                    labels = substr(sorted_breaks, 1, 4),
                    name="Quintile\nBreaks") +
  geom_sf(data=ThreeCountyArea, fill=NA, colour="black", size=1) +
  labs(title= "Development Demand in 2031: Predicted Probabilities") +
  mapTheme

Summarize by County

The below dplyr statement takes as its input, dat3, which wrangles together a table of county-level, supply and demand metrics which can be used to analyze suitability by county.

Next, each raster is aggregated to the fishnet using the aggregateRaster function and 2021 land cover types are mapped.

theRasterList21 <- c(developed21,forest21,farm21,wetlands21,otherUndeveloped21,water21)

dat4 <-
  aggregateRaster(theRasterList21, dat3) %>%
  dplyr::select(dev21,forest21,farm21,wetlands21,otherUndeveloped21,water21) %>%
  st_set_geometry(NULL) %>%
  bind_cols(.,dat) %>%
  st_sf() %>%
  st_cast("POLYGON")

dat4 <- dat4 %>%
  mutate(metra_dist = dat3$metra_dist)

dat4 %>%
  gather(var,value,dev21:water21) %>%
  st_centroid() %>%
  mutate(X = st_coordinates(.)[,1],
         Y = st_coordinates(.)[,2]) %>%
  ggplot() +
    geom_sf(data=ThreeCountyArea) +
    geom_point(aes(X,Y, colour=as.factor(value))) +
    facet_wrap(~var) +
    scale_colour_manual(values = palette2,
                        labels=c("Other","Land Cover"),
                        name = "") +
    labs(title = "Land Cover Types, 2021",
         subtitle = "As fishnet centroids") +
   mapTheme

Below an indicator sensitive_lost is created indicating grid cells that were either forest or wetlands in 2011 but were no longer so in 2021. The output layer, sensitive_land_lost, gives a sense for how development in the recent past has effected the natural environment.

(We are repeating these steps from before, but with the updated final dataset (dat4).

dat4 <-
  dat4 %>%
   mutate(sensitive_lost21 = ifelse(forest == 1 & forest21 == 0 |
                                    wetlands == 1 & wetlands21 == 0,1,0))
                      
ggplot() +
  geom_point(data=dat4, aes(x=xyC(dat4)[,1], y=xyC(dat4)[,2], colour=as.factor(sensitive_lost21))) +
  scale_colour_manual(values = palette2,
                      labels=c("No Change","Sensitive Lost"),
                      name = "") +
  labs(title = "Sensitive lands lost: 2011 - 2021",
       subtitle = "As fishnet centroids") +
  geom_sf(data=ThreeCountyArea, fill = "transparent") +
  mapTheme

Next, we plot county-specific metrics.

county_specific_metrics_2 <- 
  dat4 %>%
  #predict development demand from our model
  mutate(Development_Demand = predict(Model7, dat4, type="response")) %>%
  #get a count count of grid cells by county which we can use to calculate rates below
  left_join(st_set_geometry(dat, NULL) %>% group_by(COUNTY_NAM) %>% summarize(count = n())) %>%
  #calculate summary statistics by county
  group_by(COUNTY_NAM) %>%
  summarize(Total_Farmland = sum(farm21) / max(count),
            Total_Forest = sum(forest21) / max(count),
            Total_Wetlands = sum(wetlands21) / max(count),
            Total_Undeveloped = sum(otherUndeveloped21) / max(count),
            Sensitive_Land_Lost = sum(sensitive_lost21) / max(count),
            #Sensitive_Regions = sum(sensitiveRegions) / max(count),
            Mean_Development_Demand = mean(Development_Demand)) %>%
  #get population data by county
  left_join(Pop_2031, by = c("COUNTY_NAM" = "Counties")) %>% 
            mutate(Population_Change = pop_2031 - pop_2021,
                   Population_Change_Rate = Population_Change / pop_2031) %>%
            dplyr::select(COUNTY_NAM,Total_Farmland, Total_Forest, Total_Wetlands, Total_Undeveloped, Sensitive_Land_Lost, Mean_Development_Demand, Population_Change_Rate)

Now a small multiple plot can be created providing both supply and demand side analytics by county. The plot gives a sense for development demand (Demand-Side), suitable land for development (Suitable) and sensitive land (Not Suitable).

This plot is similar to the results from the previous model, but now we can see that development demand has increased for Will County. Mean_Development_Demand increased slightly for Will County because we used Model7 and the Metra line extension variable, compared to Model6 which used only existing Metra rail lines. Adding the regional rail line extension into Will County would likely contribute to an increase in development demand for areas near the transit route.

county_specific_metrics_2 %>%
  gather(Variable, Value, -COUNTY_NAM, -geometry) %>%
  mutate(Variable = factor(Variable, levels=c("Population_Change_Rate","Mean_Development_Demand",
                                              "Total_Farmland","Total_Undeveloped","Total_Forest",
                                              "Total_Wetlands","Sensitive_Land_Lost",
                                              ordered = TRUE))) %>%
  mutate(Planning_Designation = case_when(
    Variable == "Population_Change_Rate" | Variable == "Mean_Development_Demand" ~ "Demand-Side",
    Variable == "Total_Farmland" | Variable == "Total_Undeveloped"               ~ "Suitable",
    TRUE                                                                         ~ "Not Suitable")) %>%
  ggplot(aes(x=Variable, y=Value, fill=Planning_Designation)) +
    geom_bar(stat="identity", position=position_dodge(), colour="black") +
    facet_wrap(~COUNTY_NAM, ncol=5) +
    coord_flip() +
    scale_y_continuous(breaks = seq(-.25, 1, by = .25)) +
    geom_vline(xintercept = 2.5) + geom_vline(xintercept = 4.5) +
    scale_fill_manual(values=c("black","red","darkgreen")) +
    labs(title= "County Specific Allocation Metrics", subtitle= "As rates", x="Indicator", y="Rate") +
    plotTheme + theme(axis.text.x = element_text(angle = 45, hjust = 1), legend.position="bottom")

8. Appendix

This is new material describing how you can determine land cover change from land cover rasters for two time periods

8.1. Calculating Land Cover Change

For your assignment, you are going to need to get land cover data for two new time periods and figure out what areas developed in that interval (and then model it). In the workflow above, we used a the NLCD land cover change data set, but we could also have calculated our own version using the 2001 and 2011 data sets. You could do this in R, or ArcGIS for your assignment. Here is an abbreviated workflow for doing it in R using the data from this exercise:

Reclassify 2001 and 2011 land cover databases to consist of 1 and 0 observations (e.g. 1 is the developed classes 13-24, 0 is everything else).

# reclassMatrix <- 
#   matrix(c(
#     0,12,0,
#     12,24,1,
#     24,Inf,0),
#   ncol=3, byrow=T)
# developed_2001 <- 
#   reclassify(lc_2001,reclassMatrix)
# 
# developed_2011 <- 
#   reclassify(lc_2011,reclassMatrix)

Then do some map algebra to find the places where land cover changed. Let’s see a quick histogram of the values - these should range from 0 (undeveloped in 2001, undeveloped in 2011), 1 (undeveloped in 2001, developed in 2011 (presuming nothing went from developed to undeveloped)), and 2 (developed in both periods). The 1’s represent the change.

# 
# development_change <- developed_2001+developed_2011
# 
# hist(development_change)

We can subsequently turn any of the 0’s and 1’s to NA

# development_change[development_change != 1] <- NA
# 
# ggplot() +
#   geom_sf(data=houstonMSA) +
#   geom_raster(data=rast(development_change) %>% na.omit, 
#               aes(x,y,fill=as.factor(value))) +
#   scale_fill_viridis(discrete=TRUE, name ="Land Cover\nChange") + 
#   labs(title="Development land use change") +
#   mapTheme

8.2. Downsampling Rasters

Notice that we used 4000x4000 unit grid cells in this analysis to keep small grid cell sizes from crushing our laptops while we did this plotting and geo-processing. This is very simple to do in R - the below code takes our development_change raster and downsamples it by a factor of two using the aggregate function. You could load an original data set in at the beginning of your analysis and downsample it before you get started.

# 
# development_change
# 
# aggregate(development_change, fact = 2)

8.3. Cropping Rasters

Say you have rasters that you would like to manipulate in R instead of in ArcGIS. If you have data covering the study area, you can use this as the extent to which you would like to crop the data, and then use mask to clip the data to the exact boundaries.

First, read in the resampled raster of land cover around Atlanta, Georgia from 2001 and then plot it.

# lc_atl_2001 <- raster("https://raw.githubusercontent.com/mafichman/CPLN_675/main/Week_14_15/data/atl_lc01_resamp_new.tif")
# 
# plot(lc_atl_2001)

We will also read in Atlanta counties for the extent of the bounding box, and then crop the raster to the counties.

# atl_counties <- st_read("https://raw.githubusercontent.com/mafichman/CPLN_675/main/Week_14_15/data/Counties_Atlanta_Region.geojson") %>% st_transform("ESRI:102667") # ESRI 1983 state plane GA west
# 
# lc_atl_2001_crop <- crop(lc_atl_2001, extent(atl_counties))
# 
# plot(lc_atl_2001_crop)

Then, mask the raster using the Atlanta are counties. This is similar to the “clipping” process for vector data.

# lc_atl_2001_mask <- mask(lc_atl_2001_crop, atl_counties)
# plot(lc_atl_2001_mask)

8.4 Updated Census Data Calls

In section 2.4, census data was pulled using the get_decennial; however, if you are using a different timeframe, you you will use data from the American Community Survey (ACS). You will replace get_decennial with get_acs in your project’s workflow. The code chunk below shows how to use get_acs to obtain population data from the counties in the Houston MSA in 2019. Note that get_acs returns both an estimate (denoted with an “E”) and a margin of error (denoted with an “M”). We use the select command in dplyr to only retain estimate version of the variable.

# # Specify which variable(s) you would like to grab. Here, only one (Total Population) is listed, but you could add more to the call.
# acs_vars <- c("B02001_001E")
# 
# # Using "tract" as the geography and 2019 as the year, download data data for the Houston MSA counties listed.
# houstonPop19 <- get_acs(geography = "tract", 
#                         variables = acs_vars, 
#                         year = 2019,
#                         state = 48, 
#                         geometry = TRUE, 
#                         output = "wide",
#                         county=c("Harris COunty","San Jacinto","Montgomery","Liberty","Waller",
#                          "Austin","Chambers","Fort Bend","Brazoria","Galveston")) %>%
#                 rename(pop2019 = B02001_001E) %>%
#                 dplyr::select(-starts_with("B"))
# 
# # Make sure to transform to the crs of the fishnet!
# houstonPop19 <- houstonPop19 %>%
#   st_transform(st_crs(houstonMSA_fishnet))
---
title: "CPLN 675 - Urban Growth Modeling for the Chicago Region, 2024"
author: "Lauren Pawlowski and Sophie Maes"
date: "5/10/2024"
output:
  html_document:
    toc: true
    toc_float: true
    code_folding: hide
    code_download: true
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE)
```

<style>
  .superbigimage{
      overflow-x:scroll;
      white-space: nowrap;
  }

  .superbigimage img{
     max-width: none;
  }


</style>

_This markdown is an updated version of a module written by Ken Steif in 2019. It was an unpublished chapter from his book Public Policy Analytics, and it is therefore very detailed. It has been updated to keep it aligned with current packages. For the most part, the original text is still used here, out of respect for the original material and the author. However, alterations, annotations and supplemental code are throughout._

_For this project, we built a growth model based on a binary logistic regression that predicts the probability of land cover change at a raster cell level over a specified interval as a function of a) Existing land cover, b) The location of infrastructure and c) Demographic and economic spatial variables._

_While we have used logistic regression to model spatial phenomena like flooding before, this workflow is different in that we wrangle raster data in R rather than doing it in ArcGIS._

_The appendices to this document add some additional code and alternative code blocks to what is in the chapter itself. You can use these to augment the workflow to run entirely in R and load, downsample and wrangle rasters._

_Thanks very much to Jenna Epstein for helping to update and edit this material in 2022. A note from Professor Michael Fichman._


# 1. Introduction

Regional urban development is an emergent outcome resulting from the autonomous decisions of many different agents, including developers, real estate buyers and tenants as well as planners and regulators. Each of these groups optimize for a different set of bottom lines. Developers seek profit, buying land at a low price, improving on that land, and selling it at a premium. Households and firms consume housing and commercial real estate balancing price constraints with access to amenities and customers, respectively.

Planners regulate development by trading-off economic growth with the mitigation of negative externalities toward the goal of economic and environmental sustainability. This is a case study in managing these trade-offs. The focus of this chapter is on Cook, DuPage, and Will Counties in Illinois in the Chicago region. Chicago is located in Cook County, IL. As a sprawling Metropolitan area, climate change and equity concerns are forcing cities like Chicago to reconsider the role of land use planning.

Land use planning that is both economically productive and sustainable requires both supply and demand-side insights. Specifically, the Planner must understand future demand for development and how that demand contrasts with the supply of environmentally sensitive land. The goal of this chapter is to model this interplay.

The next section wrangles a host of datasets including, Land Cover and Land Cover Change from the USGS, Census demographics, and transportation. Spatial lag features are also engineered from the land cover change data hypothesizing that the time/space scale of development between 2011 and 2021 can help predict new development in 2031.

Exploratory analysis is undertaken to investigate the relationship between development and the aforementioned features. Findings then motivate the estimation of a geospatial predictive model trained on new development between 2011 and 2021. Those dynamics are then harnessed to predict Development_Demand for 2031 - predictions interpreted as the probability of new development here. The goal is not to create the most accurate model, but to demonstrate the role for geospatial machine learning in the land use modeling process.

Once predictions are validated for their accuracy, we evaluate their confluence with environmentally sensitive land in counties throughout the region, with particular emphasis on landscape fragmentation. The final section uses predictions to ‘allocate’ new development across to places where growth can have an economic impact without impeding sustainability goals. It also tests a scenario for attracting new development with an extension of regional rail (new transportation infrastructure).

# 1.2. Setup

Below we load the libraries needed for the analysis as well as a `mapTheme` and `plotTheme`. A set of palette colors are also specified.

```{r load_packages, message=FALSE, warning=FALSE, results = "hide"}
library(tidyverse)
library(sf)
library(terra)
library(raster)
library(knitr)
library(kableExtra)
library(tidycensus)
library(tigris)
library(FNN)
#library(QuantPsyc) # JE Note: in R 4.1, QuantPsyc package not available.
library(caret)
library(yardstick)
library(pscl)
library(plotROC) 
library(ggrepel)
library(pROC)
library(grid)
library(gridExtra)
library(viridis)
library(igraph)
library(dplyr)
library(scales)
library(ggplot2)

plotTheme <- theme(
  plot.title =element_text(size=12),
  plot.subtitle = element_text(size=8),
  plot.caption = element_text(size = 6),
  axis.text.x = element_text(size = 10, angle = 45, hjust = 1),
  axis.text.y = element_text(size = 10),
  axis.title.y = element_text(size = 10),
  # Set the entire chart region to blank
  panel.background=element_blank(),
  plot.background=element_blank(),
  #panel.border=element_rect(colour="#F0F0F0"),
  # Format the grid
  panel.grid.major=element_line(colour="#D0D0D0",size=.75),
  axis.ticks=element_blank())

mapTheme <- theme(plot.title =element_text(size=12),
                  plot.subtitle = element_text(size=8),
                  plot.caption = element_text(size = 6),
                  axis.line=element_blank(),
                  axis.text.x=element_blank(),
                  axis.text.y=element_blank(),
                  axis.ticks=element_blank(),
                  axis.title.x=element_blank(),
                  axis.title.y=element_blank(),
                  panel.background=element_blank(),
                  panel.border=element_blank(),
                  panel.grid.major=element_line(colour = 'transparent'),
                  panel.grid.minor=element_blank(),
                  legend.direction = "vertical", 
                  legend.position = "right",
                  plot.margin = margin(1, 1, 1, 1, 'cm'),
                  legend.key.height = unit(1, "cm"), legend.key.width = unit(0.2, "cm"))

palette2 <- c("#41b6c4","#253494")
palette4 <- c("#a1dab4","#41b6c4","#2c7fb8","#253494")
palette5 <- c("#ffffcc","#a1dab4","#41b6c4","#2c7fb8","#253494")
palette6 <- c("#253494","#2c7fb8","#41b6c4","#a1dab4","#ffffcc")
palette10 <- c("#f7fcf0","#e0f3db","#ccebc5","#a8ddb5","#7bccc4",
               "#4eb3d3","#2b8cbe","#0868ac","#084081","#f7fcf0")
```

We also include several helper functions. `quintilesBreaks` takes a dataframe and a column and outputs the quintiles breaks, helping shorten the below `ggplot` calls.

It takes longer to `ggplot` a polygon fishnet with `geom_sf` than it does to plot `geom_point`. To cut down on plotting time, the `xyC` (for ‘XY Coordinates’) takes a fishnet `sf` and converts it to a dataframe of grid cell centroid coordinates.

`rast` is a function allowing us to quickly plot raster values in `ggplot`.

```{r 3, warning = FALSE, message = FALSE}
#this function converts a column in to quintiles. It is used for mapping.
quintileBreaks <- function(df,variable) {
    as.character(quantile(df[[variable]],
                          c(.01,.2,.4,.6,.8),na.rm=T))
}

#This function can be used to convert a polygon sf to centroids xy coords.
xyC <- function(aPolygonSF) {
  as.data.frame(
    cbind(x=st_coordinates(st_centroid(aPolygonSF))[,1],
          y=st_coordinates(st_centroid(aPolygonSF))[,2]))
} 

#this function convert a raster to a data frame so it can be plotted in ggplot
rast <- function(inRaster) {
  data.frame(
    xyFromCell(inRaster, 1:ncell(inRaster)), 
    value = getValues(inRaster)) }
```

# 2. Data Wrangling & Feature Engineering

In this section a considerable amount of vector and raster data is wrangled together into a regression-ready dataset. The following datasets are used:

2.2 - 2.3: Land cover data [downloaded](https://www.mrlc.gov/data/nlcd-land-cover-change-index-conus) from the Multi-Resolution Land Characteristics Consortium’s National Land Cover Database (NLCD) includes annual land cover and land cover change raster data for the entire country. These data are sampled to a 4,000 by 4,000 ft^2 fishnet, which will be used for this analysis. 2.4: Population data is downloaded from the U.S. Census and joined to the fishnet by distributing Census Tract population totals proportionally to each grid cell. 2.5: Highway vectors are downloaded from the U.S. Census TIGER Line shapefiles for 2019 and used to wrangle highway proximity features. Regional rail vectors are downloaded from the City of Chicago. It includes all Metra commuter rail lines in the Chicagoland region. For the Chicago downtown area, we created a polygon for the Chicago Loop and imported the shapefile. 2.6: The land cover change data is used to engineer spatial lag features. 2.7: County polygons are downloaded using the `tigris` package. 2.8: Each feature is wrangled into a final dataset.

It is important to remember that land cover is not ‘land use’. Typically, the former refers to phenomena on the Earth’s surface including both the built environment and natural resources, while the latter typically refers only to variation in the built environment.

Other raster features are created such as distance to highways, for instance. These rasters are then integrated with a vector fishnet. Additional feature engineering is performed on the vector-side providing a simple, but comprehensive snapshot of the development process in and around Chicago and Cook, DuPage, and Will Counties between 2011 and 2021.

## 2.2. Land Cover Change Data

The dependent variable we wish to forecast is land cover change between 2011 and 2021. In this section, the land cover raster data is loaded, reclassified and integrated with a vector fishnet. As before, the fishnet will allow us to parametrize spatial relationships in a regression context.

The table below shows descriptions of each categorical land cover type in the land cover data. Below, we will reclassify these data into more useful categories.

 _JE Note: Land cover categories can be found [here](https://www.mrlc.gov/data/legends/national-land-cover-database-class-legend-and-description) - will replace with markdown table at a later point._ 
 
Several raster layers have been provided for this analysis: 

- We read in `ThreeCountyArea` - this is the extent of the study area 

- `ThreeCountyLC_2011` is a raster of land cover in 2011 for the three counties: Cook, DuPage, and Will Counties. `ThreeCountyLC_2021` is a raster of land cover in 2021 for the same geography. We have to calculate land cover change - where there were conversions between one land cover and another on the time frame 2011-2021. We plot the raster using `ggplot` and the `rast` function specified above.

Note that these rasters are projected as `NAD 1983 NAD83 Illinois East, crs = 3435`. The original land cover raster is at a 30 meter by 30 meter resolution. The rasters provided are ultimately resampled up to 4000 feet by 4000 feet. The Cook, DuPage, and Will County areas are 1,635 sq. miles; 336 sq. miles; and 849 sq. miles, respectively. Together, the total area for all three counties is about 2820 sq. miles, so this metro area is quite large. A vector fishnet dataset of this size would be too computationally intensive. By resampling we gain some computational efficiency but lose some accuracy. Nevertheless, this approach works well for educational purposes.


```{r load_data, warning = FALSE, message = FALSE, results = "hide"}
# Load required libraries
library(raster)

# Read the shapefile
IL_counties <- st_read("C:/Users/3lpaw/Desktop/ArcGIS Pro 3.2/EnvModeling/04_24_24_UrbanGrowthModeling/Downloaded_Data/IL_Counties/IL_BNDY_County_Py.shp")
# Change the coordinate system to NAD83 Illinois East
IL_counties <- st_transform(IL_counties, crs = 3435)
# Subset the data - Select only the 3 counties that we need: Cook, Du Page, and Will
IL_counties_subset <- IL_counties[IL_counties$COUNTY_NAM %in% c("WILL", "COOK", "DUPAGE"), c("COUNTY_NAM", "CO_FIPS", "geometry")]

# Write the subsetted data back to a shapefile
# st_write(IL_counties_subset, "C:/Users/3lpaw/Desktop/ArcGIS Pro 3.2/EnvModeling/04_24_24_UrbanGrowthModeling/Downloaded_Data/IL_Counties/ThreeCountyArea.shp")
ThreeCountyArea <- st_read("C:/Users/3lpaw/Desktop/ArcGIS Pro 3.2/EnvModeling/04_24_24_UrbanGrowthModeling/Downloaded_Data/IL_Counties/ThreeCountyArea.shp")

# Specify the coordinate system for the 3 IL counties to NAD83 Illinois East
ThreeCountyArea <- st_transform(ThreeCountyArea, crs = 3435)

# Set the desired raster resolution
#resolution <- 100  # Choose a suitable resolution (in meters or degrees, depending on your data)
# Define the extent of the raster
#extent <- extent(ThreeCountyArea)
# Create an empty raster layer with the specified resolution and extent
#empty_raster1 <- raster(extent, res = resolution)
# Rasterize the shapefile onto the empty raster layer
#rasterized <- rasterize(ThreeCountyArea, empty_raster1, field="CO_FIPS")
# Save the raster to a file
# writeRaster(rasterized, "C:/Users/3lpaw/Desktop/ArcGIS Pro 3.2/EnvModeling/04_24_24_UrbanGrowthModeling/Downloaded_Data/IL_Counties/ThreeCountyArea.tif", format = "GTiff")

ThreeCountyArea_Boundary <- raster("C:/Users/3lpaw/Desktop/ArcGIS Pro 3.2/EnvModeling/04_24_24_UrbanGrowthModeling/Downloaded_Data/IL_Counties/ThreeCountyArea.tif")

# Read the two land cover rasters, one for 2011 and one for 2021
ThreeCountyLC_2011 <- raster("C:/Users/3lpaw/Desktop/ArcGIS Pro 3.2/EnvModeling/04_24_24_UrbanGrowthModeling/Downloaded_Data/LC_2011_2021/3CountyLC_2011.tif")

ThreeCountyLC_2021 <- raster("C:/Users/3lpaw/Desktop/ArcGIS Pro 3.2/EnvModeling/04_24_24_UrbanGrowthModeling/Downloaded_Data/LC_2011_2021/3CountyLC_2021.tif")

# Reproject the rasters to the same coordinate system as the other data layers: NAD83 Illinois East, crs = 3435
ThreeCountyLC_2011 <- projectRaster(ThreeCountyLC_2011, crs = 3435)
ThreeCountyLC_2021 <- projectRaster(ThreeCountyLC_2021, crs = 3435)
```

We now calculate land cover change from 2011 to 2021. 
Reclassify 2011 and 2021 land cover databases to consist of 1 and 0 observations (e.g. 1 is the developed classes 13-24, 0 is everything else).

```{r reclass_Matrix, warning= FALSE, message= FALSE}
reclassMatrix <- 
  matrix(c(
    -5,12,0,
    12,24,1,
    24,Inf,0),
  ncol=3, byrow=T)
```

```{r plot_developedchange, warning= FALSE, message= FALSE}
Developed_2011 <- 
  reclassify(ThreeCountyLC_2011,reclassMatrix)
# you can see the frequency of the values
freq(Developed_2011)

Developed_2021 <- 
  reclassify(ThreeCountyLC_2021,reclassMatrix)
freq(Developed_2021)
```

Then do some map algebra to find the places where land cover changed (we calculate Development_change). Let’s see a quick histogram of the values - these should range from 0 (undeveloped in 2011, undeveloped in 2021), 1 (undeveloped in 2011, developed in 2021 (presuming nothing went from developed to undeveloped)), and 2 (developed in both periods). The 1’s represent the change. You can see the number of the values (0's, 1's, and 2's) in the frequency table. 

```{r plot_dev_change_counties, warning= FALSE, message= FALSE}
Development_change <- Developed_2011+Developed_2021
freq(Development_change)
hist(Development_change, xlim = c(0,2))
```

We now plot the 3 County Area and the land cover change from 2011-2021. Again, the 1s represent where land cover changed from not developed to developed between 2011 and 2021. The 2s represent land cover that was developed both in 2011 and 2021, so there was no change. There is not a lot of development change because it is only one decade and Chicago is already pretty developed, but there are some red areas mostly in Will County to the southwest. 

```{r plot_devchange, warning= FALSE, message= FALSE}
ggplot() +
  geom_raster(data=rast(Development_change) %>% na.omit %>% filter(value > 0), 
              aes(x,y,fill=as.factor(value))) +
 # scale_fill_viridis(direction = -1, discrete=TRUE, option = "A", name ="Land Cover\nChange") +
  scale_fill_manual(values = c("2" = "gray", "1" = "red"), name = "Land Cover\nChange") +
  labs(title = "Land Cover Change for the Three County Area, 2011-2021") +
 mapTheme +
  theme(legend.direction="horizontal") +
geom_sf(data=ThreeCountyArea, fill = "transparent", colour = "black")
```

Next, we reclassify the raster such that all the developed grid cell values receive a value of 1 and all other values receive a value of 0. This is done using a reclassify matrix. The matrix reads row by row. Row 1 says any grid cell ranging from 0 to 12 takes a value of 0; 13 or greater through 24, a value of 1; and all other values take 0.

```{r 9, warning = FALSE, message = FALSE}
reclassMatrix2 <- 
  matrix(c(
    0,0,
    1,1,
    2,0),
  ncol=2, byrow=T)

reclassMatrix2
```

Now `reclassify` and convert all 0’s to `NA`. We apply a name to the raster with `names`. This is done to make it faster to join raster to the fishnet below. You can see the frequency table of values with `freq(Development_change2)`. There are 52042 areas that changed from underdeveloped to developed in the data.

```{r 10, warning = FALSE, message = FALSE}
Development_change2 <- 
  reclassify(Development_change,reclassMatrix2)

Development_change2[Development_change2 < 1] <- NA

names(Development_change2) <- "Dev_change"
freq(Development_change2)
ggplot() +
  geom_sf(data=ThreeCountyArea, fill = "transparent") +
  geom_raster(data=rast(Development_change2) %>% na.omit, 
              aes(x,y,fill=as.factor(value))) +
  scale_fill_viridis(discrete=TRUE, name ="Land Cover\nChange") + 
  labs(title="Development Land Use Change, 2011-2021") +
  mapTheme
```

Next, the fishnet is created at 4000 by 4000 foot resolution and subset it to the Three County Area with `st_intersection`. 

```{r 11, warning = FALSE, message = FALSE}
ThreeCounty_fishnet <- 
  st_make_grid(ThreeCountyArea, 4000) %>%
  st_sf()

ThreeCounty_fishnet <-
  ThreeCounty_fishnet[ThreeCountyArea,]
```

The vector fishnet is then plotted. Note that this plot takes a bit of time to render because there are thousands of polygons.

```{r plot_fishnet, warning = FALSE, message= FALSE}
ggplot() +
  geom_sf(data=ThreeCounty_fishnet) +
  labs(title="Fishnet, 4000 Foot Resolution") +
  mapTheme
```

Then the raster is converted to points, which makes its joining to the vector fishnet a bit faster. Now to extract the raster values into the fishnet. There is a function in the `raster` package called `RasterToPolygon` but it is quite slow.

Below, a slightly faster approach is develop that converts the raster to an `sf` point layer and then joins the points to the fishnet with `aggregate`. This works well because the raster and the fishnet are of the same spatial resolution. Finally, the fishnet variable `Dev_change` is created that is `1` where new development has occurred and `0` where it has not. This is our dependent variable and encoded as a factor.

To speed up the mapping process, fishnet polygons are converted to centroid points using the `xyC` function. 


```{r plot_dev_change2, warning = FALSE, message = FALSE}
changePoints <-
  rasterToPoints(Development_change2) %>%
  as.data.frame() %>%
  st_as_sf(coords = c("x", "y"), crs = st_crs(ThreeCounty_fishnet))

fishnet <- 
  aggregate(changePoints, ThreeCounty_fishnet, sum) %>%
  mutate(Dev_change = ifelse(is.na(Dev_change),0,1),
         Dev_change = as.factor(Dev_change))

ggplot() +
  geom_sf(data=ThreeCounty_fishnet) +
  geom_point(data=fishnet, 
             aes(x=xyC(fishnet)$x, y=xyC(fishnet)$y, colour=Dev_change)) +
  scale_colour_manual(values = palette2,
                      labels=c("No Change","New Development"),
                      name = "") +
  labs(title = "Land Cover Development Change", subtitle = "As fishnet centroids") +
  mapTheme
```


## 2.3. Land Cover in 2011

It is reasonable to hypothesize that the propensity of new development is a function of existing land cover categories. In this section we identify these other land cover categories from 2011 and integrate each with the fishnet.

```{r plot_LC_2011, warning = FALSE, message = FALSE}
# Our shapefile with the boundary for the three counties is ThreeCountyArea
# Our land cover raster for 2011 for the three counties is ThreeCountyLC_2011

# ggplot() +
#   geom_sf(data=ThreeCountyArea) +
#   geom_raster(data=rast(ThreeCountyLC_2011) %>% na.omit %>% filter(value > 0), 
#               aes(x,y,fill=as.factor(value))) +
#   scale_fill_viridis(discrete=TRUE, name ="") +
#   labs(title = "Land Cover, 2011") +
#   mapTheme +
#   theme(legend.direction="horizontal")
```

The table below shows the approach taken to recoded existing land cover codes into the categories used in our analysis. In the code block below new rasters are generated and `names` are applied. Naming ensures that when the raster is integrated with the fishnet, the column reflects the appropriate raster.

| Old_Classification             | New_Classification                                  |
|--------------------------------|-----------------------------------------------------|
| Open Space as well as Low, Medium and High Intensity Development | Developed |
| Deciduous, Evergreen, and Mixed Forest |  Forest |
| Pasture/Hay and Cultivated Crops | Farm |
| Woody and Emergent Herbaceous Wetlands | Woodlands |
| Barren Land, Dwarf Scrub, and Grassland/Herbaceous | Other Undeveloped |
| Water | Water |

```{r 15, warning = FALSE, message = FALSE}
developed <- ThreeCountyLC_2011 == 21 | ThreeCountyLC_2011 == 22 | ThreeCountyLC_2011 == 23 | ThreeCountyLC_2011 == 24
forest <- ThreeCountyLC_2011 == 41 | ThreeCountyLC_2011 == 42 | ThreeCountyLC_2011 == 43 
farm <- ThreeCountyLC_2011 == 81 | ThreeCountyLC_2011 == 82 
wetlands <- ThreeCountyLC_2011 == 90 | ThreeCountyLC_2011 == 95 
otherUndeveloped <- ThreeCountyLC_2011 == 52 | ThreeCountyLC_2011 == 71 | ThreeCountyLC_2011 == 31 
water <- ThreeCountyLC_2011 == 11

names(developed) <- "developed"
names(forest) <- "forest"
names(farm) <- "farm"
names(wetlands) <- "wetlands"
names(otherUndeveloped) <- "otherUndeveloped"
names(water) <- "water"
```

Next, each raster is aggregated to the fishnet by way of a function called `aggregateRaster`. Here, the process used above to To do this, a function is created below that loops through a list of rasters, converts the _ith_ raster to points, filters only points that have value of `1` (ie. is the _ith_ land cover type), and then aggregates to the fishnet.

Here is the function.

```{r 16, warning = FALSE, message = FALSE}
aggregateRaster <- function(inputRasterList, theFishnet) {
  #create an empty fishnet with the same dimensions as the input fishnet
  theseFishnets <- theFishnet %>% dplyr::select()
  #for each raster in the raster list
  for (i in inputRasterList) {
  #create a variable name corresponding to the ith raster
  varName <- names(i)
  #convert raster to points as an sf
    thesePoints <-
      rasterToPoints(i) %>%
      as.data.frame() %>%
      st_as_sf(coords = c("x", "y"), crs = st_crs(theFishnet)) %>%
      filter(.[[1]] == 1)
  #aggregate to the fishnet
    thisFishnet <-
      aggregate(thesePoints, theFishnet, length) %>%
      mutate(!!varName := ifelse(is.na(.[[1]]),0,1))
  #add to the larger fishnet
    theseFishnets <- cbind(theseFishnets,thisFishnet)
  }
  #output all aggregates as one large fishnet
   return(theseFishnets)
  }
```

The `theRasterList` of land cover types in 2011 is created and then fed into `aggregateRaster`. The result is converted to long form grid cell centroids and plot as small multiple maps.

Note the inclusion of `st_cast` here which convert all geometries to `POLYGON`. If you create a frequency table of geometry types in `aggregatedRasters`, you will notice some and handful of `MULTIPOLYGONS`. Try `table(st_geometry_type(aggregatedRasters)`). These rogue multipolygons break the `xyC` function which is designed to find grid cell centroids. After all, there is no one centroid of several combined polygons. Thus `st_cast` ensures all geometries are just `POLYGON`. Look out for this function throughout the remainder of this chapter.

```{r, warning = FALSE, message = FALSE}
theRasterList <- c(developed,forest,farm,wetlands,otherUndeveloped,water)

aggregatedRasters <-
  aggregateRaster(theRasterList, ThreeCounty_fishnet) %>%
  dplyr::select(developed,forest,farm,wetlands,otherUndeveloped,water) %>%
  mutate_if(is.numeric,as.factor)

aggregatedRasters %>%
  gather(var,value,developed:water) %>%
  st_cast("POLYGON") %>%    #just to make sure no weird geometries slipped in
  mutate(X = xyC(.)$x,
         Y = xyC(.)$y) %>%
  ggplot() +
    geom_sf(data=ThreeCountyArea) +
    geom_point(aes(X,Y, colour=as.factor(value))) +
    facet_wrap(~var) +
    scale_colour_manual(values = palette2,
                        labels=c("Other","Land Cover"),
                        name = "") +
    labs(title = "Land Cover Types, 2011",
         subtitle = "As fishnet centroids") +
   mapTheme
```

## 2.4. Census Data

Population and population change is obviously an critical demand-side component of predicting `Development_Demand`. Census data for both 2011 and 2021 can be downloaded quickly using the `tidycensus` package. As illustrated below, these data are downloaded at a census tract geography and thus, an approach is needed to reconcile tracts and fishnet geometries. This is accomplished using a technique called areal weighted interpolation.

Recall, you will need a census API key to download the census data which must be input with `census_api_key`. 

```{r load_key_hide, warning= FALSE, include=FALSE}
# census_key <- read.table("~/GitHub/census_key.txt", quote="\"", comment.char="")
# census_api_key(census_key[1] %>% as.character(), overwrite = TRUE)
```

```{r load_key, warning = FALSE, eval = FALSE}
census_api_key("e5e96d76285beca3c6e1a9110d762a430ced4811", overwrite = TRUE)
```

First data is pulled for 2011 and reprojected.

```{r, warning = FALSE, message = FALSE, results = "hide"}
# Pull population data for 2011
library(tidycensus)
library(tigris)
library(sf)

# Define the counties you want to retrieve data for
counties <- c("Cook", "DuPage", "Will County")

# Retrieve population data for 2011 and tract geometry
Pop_2011 <- get_acs(geography = "tract",
                            variables = "B01003_001",
                            year = 2011,
                            state = "IL",
                            county = counties,
                            geometry = TRUE) %>%
  rename(Pop_2011 = estimate) %>%
  st_transform(st_crs(ThreeCounty_fishnet))

# Write the spatial dataframe to a shapefile
st_write(Pop_2011,"C:/Users/3lpaw/Desktop/ArcGIS Pro 3.2/EnvModeling/04_24_24_UrbanGrowthModeling/Downloaded_Data/Population/Pop_2011.shp", append = TRUE)

# Clip Pop_2011 to the extent of ThreeCounty_fishnet. Otherwise, it includes an area of Lake Michigan
Pop_2011 <- st_intersection(Pop_2011, ThreeCounty_fishnet)
```

Now data for 2021 is downloaded. In this instance, `st_buffer` is used to buffer the the tracts by -1ft. This is done because `tidycensus` appears to return geometries that are problematic when subjected to the area weighted interpolation function below. As done in previous chapters, a very small buffer is used to correct the geometries.

```{r, warning = FALSE, message = FALSE, results = "hide"}

counties <- c("Cook", "DuPage", "Will County")

# Pull population data for 2021
Pop_2021 <- get_acs(geography = "tract",
                            variables = "B01003_001",
                            year = 2021,
                            state = "IL",
                            county = counties,
                            geometry = TRUE) %>%
  rename(Pop_2021 = estimate) %>%
  st_transform(st_crs(ThreeCounty_fishnet)) %>%
  st_buffer(-1)
st_write(Pop_2021,"C:/Users/3lpaw/Desktop/ArcGIS Pro 3.2/EnvModeling/04_24_24_UrbanGrowthModeling/Downloaded_Data/Population/Pop_2021.shp", append = TRUE)

```

Both years of census data are then plotted.

<div class="superbigimage">
```{r, warning = FALSE, message = FALSE, fig.height= 8, fig.width= 11}
class(Pop_2011)
str(Pop_2011)
str(Pop_2021)

# If the geometry column is missing or has a different name, rename it
Pop_2011 <- st_set_geometry(Pop_2011, "geometry")
Pop_2021 <- st_set_geometry(Pop_2021, "geometry")

library(gridExtra)

grid.arrange(
  ggplot() +
    geom_sf(data = Pop_2011, aes(fill = factor(ntile(Pop_2011, 5))), colour = NA) +
    scale_fill_manual(values = palette5,
                      labels = quintileBreaks(Pop_2011, "Pop_2011"),
                      name = "Quintile\nBreaks") +
    labs(title = "Population per Census Tract in Cook, Du Page, and Will Counties: 2011") +
    mapTheme,

  ggplot() +
    geom_sf(data = Pop_2021, aes(fill = factor(ntile(Pop_2021, 5))), colour = NA) +
    scale_fill_manual(values = palette5,
                      labels = quintileBreaks(Pop_2021, "Pop_2021"),
                      name = "Quintile\nBreaks") +
    labs(title = "Population per Census Tract in Cook, Du Page, and Will Counties: 2021") +
    mapTheme, 
  ncol = 2
)
```
</div>

Now to reconcile tract boundaries and fishnet grid cells. I’d like you to pay particular attention to this process.

A spatial join would be inappropriate as it would assign the same population value from one tract to the many intersecting grid cells. Instead, the area weighted interpolation function, `st_interpolate_aw`, assigns a proportion of a tract’s population to a grid cell weighted by the proportion of the tract that intersects the grid cell. This works best of course, when we assume that the tract population is uniformly distributed across the tract. This is typically not a great assumption. However, it is a reasonable here particularly given population is a feature in a regression and not an outcome that needs to be measured with significant precision.

The Census data `Pop_2011`, has a different spatial extent than `ThreeCounty_fishnet`. Most notably, there are no vectors where water is present. To maintain the needed 3851 grid cell units `(nrow(ThreeCounty_fishnet))`, a a unique id is created, `fishnetID`. Then the area weighted interpolation is performed on `population` for the 2011 and 2021 layers. Finally, the results are joined back (`left_join`) to `ThreeCounty_fishnet`. This approach maintains a consistent spatial extent.

It would be helpful for you to spend some time running through each code block line by line. Areal weighted interpolation is a really strong spatial analysis skill to have. You can do this in ArcGIS but there is not automated approach.

```{r, warning = FALSE, message = FALSE}
ThreeCounty_fishnet <-
  ThreeCounty_fishnet %>%
  rownames_to_column("fishnetID") %>% 
  mutate(fishnetID = as.numeric(fishnetID)) %>%
  dplyr::select(fishnetID)

fishnetPopulation11 <-
  st_interpolate_aw(Pop_2011["Pop_2011"], ThreeCounty_fishnet, extensive=TRUE) %>%
  as.data.frame(.) %>%
  rownames_to_column(var = "fishnetID") %>%
  left_join(ThreeCounty_fishnet %>%
              mutate(fishnetID = as.character(fishnetID)),
            ., by=c("fishnetID"='fishnetID')) %>% 
  mutate(Pop_2011 = replace_na(Pop_2011,0)) %>%
  dplyr::select(Pop_2011)

fishnetPopulation21 <-
  st_interpolate_aw(Pop_2021["Pop_2021"],ThreeCounty_fishnet, extensive=TRUE) %>%
  as.data.frame(.) %>%
  rownames_to_column(var = "fishnetID") %>%
  left_join(ThreeCounty_fishnet %>%
              mutate(fishnetID = as.character(fishnetID)),
            ., by=c("fishnetID"='fishnetID')) %>% 
  mutate(Pop_2021 = replace_na(Pop_2021,0)) %>%
  dplyr::select(Pop_2021)

fishnetPopulation <- 
  cbind(fishnetPopulation11,fishnetPopulation21) %>%
  dplyr::select(Pop_2011,Pop_2021) %>%
  mutate(pop_Change = Pop_2021 - Pop_2011)
```

For comparison purposes, both the 2021 census tract geometries and the population weighted grid cells are plotted.

<div class="superbigimage">
```{r, warning = FALSE, message = FALSE, fig.height = 8, fig.width= 11}
grid.arrange(
ggplot() +
  geom_sf(data=Pop_2021, aes(fill=factor(ntile(Pop_2021,5))),colour=NA) +
  scale_fill_manual(values = palette5,
                    labels=substr(quintileBreaks(Pop_2021,"Pop_2021"),1,4),
                   name="Quintile\nBreaks") +
  labs(title="Population for Cook, Du Page, and Will Counties: 2021",
       subtitle="Represented as tracts; Boundaries omitted") +
  mapTheme,

ggplot() +
  geom_sf(data=fishnetPopulation, aes(fill=factor(ntile(Pop_2021,5))),colour=NA) +
  scale_fill_manual(values = palette5,
                   labels=substr(quintileBreaks(fishnetPopulation,"Pop_2021"),1,4),
                   name="Quintile\nBreaks") +
  labs(title="Population for Cook, Du Page, and Will Counties: 2021",
       subtitle="Represented as fishnet gridcells; Boundaries omitted") +
  mapTheme, ncol=2)
```
</div>

## 2.5. Highway Distance

Accessibility is a key determinant of development potential particularly in a sprawling city like Chicago.  Accessibility features are engineered by measuring distance from each grid cell to its nearest highway.

First highway vectors are downloaded from the U.S. Census TIGER Line 2019 datasets in a shapefile format; projected and subset to the subset using `st_intersection`. Below, new development is mapped with the highway overlay.

```{r, warning = FALSE, message = FALSE, results = "hide"}
ThreeCounties_Highways <-
  st_read("C:/Users/3lpaw/Desktop/ArcGIS Pro 3.2/EnvModeling/04_24_24_UrbanGrowthModeling/Downloaded_Data/IL_highways/IL_highways/tl_2019_17_prisecroads.shp") %>%
  st_transform(st_crs(ThreeCountyArea)) %>%
  st_intersection(ThreeCountyArea)
```

```{r plot_highway_devchange, warning = FALSE, message= FALSE}
ggplot() +
  geom_point(data=fishnet, 
             aes(x=xyC(fishnet)[,1], y=xyC(fishnet)[,2],colour=Dev_change),size=1.5) +
  geom_sf(data=ThreeCounties_Highways, colour = "red") +
  scale_colour_manual(values = palette2,
                      labels=c("No Change","New Development")) +
  labs(title = "New Development and Highways",
       subtitle = "As fishnet centroids") +
  mapTheme
```

Below are some great r-based raster skills. The distance from each grid cell to its nearest highway segment is measured.

First, you can convert a the highway layer to raster. This can be done by creating an `emptyRaster` of `NA` grid cells at the same spatial extent as `Development_change`. Then, `highway_raster` is created by converting `ThreeCounties_Highways` to `sp` form and then to applying `rasterize`. The raster is then converted to points with `rasterToPoints` and `st_as_sf`, then `aggregate` is used to calculate mean distance by grid cell.

You may (but likely not) be interested in learning that `sp` is the older spatial data convention in R. Although `sf` is the new convention, raster/vector interactions still require `sp`. The `as` function converts.

Instead, we used a slightly different method to calculate distance to highways. Here is the code we didn't use:

```{r, warning = FALSE, message = FALSE}
# Original code (not using):
# emptyRaster <- Development_change
# emptyRaster[] <- NA
# 
# highway_raster <- 
#   as(ThreeCounties_Highways,'Spatial') %>%
#   rasterize(.,emptyRaster)
# 
# highway_raster_distance <- distance(highway_raster)
# names(highway_raster_distance) <- "distance_highways"
# 
# highwayPoints <-
#   rasterToPoints(highway_raster_distance) %>%
#   as.data.frame() %>%
#   st_as_sf(coords = c("x", "y"), crs = st_crs(ThreeCounty_fishnet))
# 
# highwayPoints_fishnet <- 
#   aggregate(highwayPoints, ThreeCounty_fishnet, mean) %>%
#   mutate(distance_highways = ifelse(is.na(distance_highways),0,distance_highways))
# 
# ggplot() +
#   geom_sf(data=ThreeCountyArea) +
#   geom_point(data=highwayPoints_fishnet, aes(x=xyC(highwayPoints_fishnet)[,1], 
#                                              y=xyC(highwayPoints_fishnet)[,2], 
#                  colour=factor(ntile(distance_highways,5))),size=1.5) +
#   scale_colour_manual(values = palette5,
#                       labels=substr(quintileBreaks(highwayPoints_fishnet,"distance_highways"),1,8),
#                       name="Quintile\nBreaks") +
#   geom_sf(data=houstonHighways, colour = "red") +
#   labs(title = "Distance to Highways",
#        subtitle = "As fishnet centroids; Highways visualized in red") +
#   mapTheme
```

Instead, we read in the highways raster and the ThreeCounty_fishnet and used st_nearest_feature to calculate the distance from each fishnet cell centroid to the nearest highway. The variable highway_dist includes these calculated distances to the nearest highway.

```{r}
#Read in highways and transform to appropriate projection
# It's called "ThreeCounties_Highways" - we read this in earlier in the code

#Convert fishnet to centroids
centroid <- ThreeCounty_fishnet %>%
  st_centroid()

#Determine nereast highway to each centroid
nearest_feat <- st_nearest_feature(centroid,ThreeCounties_Highways)

#Calcuate distance from each grid square centroid to nearest highway
ThreeCounty_fishnet$highway_dist <- as.double(st_distance(centroid, ThreeCounties_Highways[nearest_feat,], by_element=TRUE))

Highway_fishnet <- ThreeCounty_fishnet

#Make a quick sample map of the results
ggplot()+
  geom_sf(data=Highway_fishnet,aes(fill=highway_dist),color='transparent')+
  scale_fill_viridis_c(name='Distance to Highway (feet)')+
  geom_sf(data=ThreeCounties_Highways,color='red')+
  labs(title = "Distance to Highways",
       subtitle = "Using fishnet centroids") +
  theme_void()
```

There are a lot of federal and state highways in the three county area, particularly closer to Chicago. There are some more rural and suburban areas in the southern portion of the study area that are further from highways.


## 2.5.b Distance to Regional Rail Lines (Metra)

Transit accessibility is a key determinant of development potential particularly in a city like Chicago. Transit-oriented development is commonly prioritized to create compact development. Accessibility features are engineered by measuring distance from each grid cell to its nearest regional rail line.

First regional rail line vectors are downloaded from the City of Chicago open data website in `shp` format; projected and subset to the subset using `st_intersection`. Below, new development is mapped with the regional rail line overlay. The regional rail lines are a part of the Metra system for the Chicagoland region.


```{r, warning = FALSE, message = FALSE, results = "hide"}
ThreeCounties_RegRail <-
  st_read("C:/Users/3lpaw/Desktop/ArcGIS Pro 3.2/EnvModeling/04_24_24_UrbanGrowthModeling/Downloaded_Data/Metra_Lines/MetraLinesshp.shp") %>%
  st_transform(st_crs(ThreeCountyArea)) %>%
  st_intersection(ThreeCountyArea)
```

Next, we map the existing regional rail lines with our analysis of development change from 2011-2021.

```{r plot_highway, warning = FALSE, message= FALSE}
ggplot() +
  geom_point(data=fishnet, 
             aes(x=xyC(fishnet)[,1], y=xyC(fishnet)[,2],colour=Dev_change),size=1.5) +
  geom_sf(data=ThreeCounties_RegRail, colour = "red") +
  scale_colour_manual(values = palette2,
                      labels=c("No Change","New Development")) +
  labs(title = "New Development and Regional Rail Lines",
       subtitle = "As fishnet centroids") +
  mapTheme
```

Next, we calculate the distance to the nearest regional rail line for each fishnet cell centroid and map it. This variable is called RegRail_dist.

```{r}
#Determine nearest rail line to each centroid
nearest_rail <- st_nearest_feature(centroid,ThreeCounties_RegRail)

#Calcuate distance from each grid square centroid to nearest rail line
ThreeCounty_fishnet$regrail_dist <- as.double(st_distance(centroid, ThreeCounties_RegRail[nearest_rail,], by_element=TRUE))

RegRail_fishnet <- ThreeCounty_fishnet #%>%
#  select(fishnetID, geometry, regrail_dist)

#Make a quick sample map of the results
ggplot()+
  geom_sf(data=RegRail_fishnet,aes(fill=regrail_dist),color='transparent')+
  scale_fill_viridis_c(name='Distance to Regional Rail Lines (feet)')+
  geom_sf(data=ThreeCounties_RegRail,color='red')+
  labs(title = "Distance to Regional Rail Lines",
       subtitle = "Using fishnet centroids") +
  theme_void()
```

As you can see, most of the three county area is within 25,000 feet (or a little over 4.5 miles) of a regional rail line. This is too large of a distance for walking commuters, but it is accessible for people who bike or drive or take other modes to the regional rail stations to access the Chicago city region. To improve housing density, it is ideal to create more developments in close proximity to regional rail stations, ideally within a half mile for walkers or a mile or more for bike/bus/car/other modal commutes. 


# 2.5.c Distance to Downtown Chicago (The Loop)

Proximity to employment centers is a key determinant of development particularly in a major city like Chicago and its surrounding commuter suburbs. Features are engineered by measuring distance from each grid cell to downtown Chicago, also known as the Loop.

First a polygon feature is traced around Downtown Chicago using ArcGIS Pro. Below, new development is mapped with the Downtown_Chicago feature overlay.

```{r}
ThreeCounties_Downtown <-
 st_read("C:/Users/3lpaw/Desktop/ArcGIS Pro 3.2/EnvModeling/04_24_24_UrbanGrowthModeling/Downloaded_Data/Downtown_Chicago/Downtown_Chicago.shp") %>%
  st_transform(st_crs(ThreeCountyArea)) %>%
  st_intersection(ThreeCountyArea)
```

Next, we plot the Chicago Loop area seen in red and show the development change from 2011-2021 in the three county area.

```{r}
ggplot() +
  geom_point(data=fishnet, 
             aes(x=xyC(fishnet)[,1], y=xyC(fishnet)[,2],colour=Dev_change),size=1.5) +
  geom_sf(data=ThreeCounties_Downtown, colour = "black", fill = "red") +
  scale_colour_manual(values = palette2,
                      labels=c("No Change","New Development")) +
  labs(title = "New Development and Proximity to Downtown",
       subtitle = "As fishnet centroids") +
  mapTheme
```

Next, we plot the proximity to downtown variable (downtown_dist) using fishnet centroids and map it. 

```{r}
library(scales)

#Determine distance to City Center centroid
near_downtown <- st_nearest_feature(centroid,ThreeCounties_Downtown)

#Calculate distance from each grid square centroid to the city center centroid
ThreeCounty_fishnet$downtown_dist <- as.double(st_distance(centroid, ThreeCounties_Downtown[near_downtown,], by_element=TRUE))

Downtown_fishnet <- ThreeCounty_fishnet 
#%>%   select(fishnetID, geometry, downtown_dist)

ggplot()+
geom_sf(data=Downtown_fishnet,aes(fill=downtown_dist),color='transparent')+
  scale_fill_viridis_c(name='Proximity to Downtown (feet)', labels = comma)+
  geom_sf(data=ThreeCounties_Downtown,color="black", fill = "red")+
  labs(title = "Distance to Downtown Chicago",
       subtitle = "Using fishnet centroids") +
  theme_void()
```



# 2.5.d Distance to Lake Michigan

Proximity to the waterfront is a key determinant of development particularly in a major city like Chicago and its surrounding commuter suburbs. Features are engineered by measuring distance from each grid cell to the Lake Michigan shoreline.

Below, new development is mapped with the Lake Michigan shoreline feature overlay, which is shown in red.

```{r}
ThreeCounties_LakeMichigan <-
 st_read("C:/Users/3lpaw/Desktop/ArcGIS Pro 3.2/EnvModeling/04_24_24_UrbanGrowthModeling/Downloaded_Data/Lake_Michigan_Shoreline/lake_shore.shp") %>%
  st_transform(st_crs(ThreeCountyArea)) %>%
  st_intersection(ThreeCountyArea)
```


```{r}
ggplot() +
  geom_point(data=fishnet, 
             aes(x=xyC(fishnet)[,1], y=xyC(fishnet)[,2],colour=Dev_change),size=1.5) +
  geom_sf(data=ThreeCounties_LakeMichigan, colour = "red") +
  scale_colour_manual(values = palette2,
                      labels=c("No Change","New Development")) +
  labs(title = "New Development and Proximity to Lake Michigan",
       subtitle = "As fishnet centroids") +
  mapTheme
```

Next, we calculate the distance to the lake (lake_dist) and map this variable. 

```{r}
library(scales)

#Determine distance to City Center centroid
near_lake <- st_nearest_feature(centroid,ThreeCounties_LakeMichigan)

#Calculate distance from each grid square centroid to the lake centroid
ThreeCounty_fishnet$lake_dist <- as.double(st_distance(centroid, ThreeCounties_LakeMichigan[near_lake,], by_element=TRUE))

LakeMichigan_fishnet <- ThreeCounty_fishnet 
#%>% select(fishnetID, geometry, lake_dist)

ggplot()+
geom_sf(data=LakeMichigan_fishnet,aes(fill=lake_dist),color='transparent')+
  scale_fill_viridis_c(name='Proximity to Lake Michigan (feet)', labels = comma)+
  geom_sf(data=ThreeCounties_LakeMichigan,color="black", fill = "red")+
  labs(title = "Distance to Lake Michigan",
       subtitle = "Using fishnet centroids") +
  theme_void()
```



## 2.6. The Spatial Lag of Development

At the center of our model is a hypothesis that development demand must in part, be a function of the pattern of existing development. Development occurs where the market believes a higher and better use may bring an investment return. In the case of sprawling region like Houston, assuming the requisite demand, there is a clear return on investment for converting farmland to suburban housing.

The traditional ‘bid-rent’ economic model of development posits that development demand is a function of accessibility. This model works well in cities where centralized locations offer the most accessibility. However, it assumes that all consumers of land share the same preferences for central city access. While urban land is valuable, contemporary urbanism in regions like Houston show us that suburban locations can be quite desirable as well.

Why is that? Hinterland locations do not offer direct access to jobs and cultural amenities. Instead, residents trade-off accessibility for larger lots and bigger homes; as well as a bundle of public services like school quality. Developers are attracted to suburban and exurban locations because of cheap land on ‘greenfield’ sites like farms and open space.

The demand for greenfield development can vary substantially depending on the existing spatial configuration of development. If accessibility to central locations was the only underlying consideration, developers would sprawl directly out to the periphery, much like the dynamic we modeled in the Urban Growth Boundary chapter (Chapter 2). As a space/time process, this would look much like spilled milk, emanating from a central point outward across the kitchen table.

Another option for developers is to move beyond the periphery onto greenfield sites that are cheaper because they are even less accessible. In this case, the space/time process looks small ‘patches’ of new development dotting the landscape and “leapfrogging” from one greenfield to the next. The economic incentive is to always develop beyond the periphery, where land is cheapest. There are some real costs to this model however. For one, when development is so diffuse, it is more burdensome to efficiently deploy infrastructure like roads, sewers and electicity. Second, leapfrog development fragmentats natural areas reducing biodiversity and stressing the natural habitat of species that need continuous open space to thrive.

In Chicago, as in many sprawling regions of the U.S. the economic incentives that underlie sprawl likely encourage both the accessibility and leapfrog models of development. For our purposes however, features must be created to associate these patterns with development. Without them, the model may lack the appropriate spatial experience on which to forecast growth.

To keep it simple, we develop features associated with accessibility-based patterns. In reality, the analyst should develop a series of applicable features and test which best associate with the outcome of interest. The problem becomes infinitely more difficult when one realizes that sprawl patterns may differ throughout the study area - if for instance, land use restrictions varied by county. Below we estimate models using logistic regression, but higher level machine learning algorithms, most notably, Random Forest, are more adept at dealing with non-linearities across space.

Accessibility is measured by way of a spatial lag hypothesizing that new development is a function of distance to existing development. The shorter the distance, the more accessible a grid cell is to existing development. This is measured by calculating the average distance from each grid cell to its 2 nearest developed neighboring grid cells in 2011 using the `nn_function`. The function below calculates average nearest neighbor distance between k point layers. The first parameter specifies coordinates that we want to `measureFrom`, in this case, `fishnet` centroids. The second, indicates the point layer we wish to `measureTo`, in this case, the fishnet centroids that were developed in 2011.

```{r, warning = FALSE, message = FALSE}
nn_function <- function(measureFrom,measureTo,k) {
  #convert the sf layers to matrices
  measureFrom_Matrix <-
    as.matrix(measureFrom)
  measureTo_Matrix <-
    as.matrix(measureTo)
  nn <-   
    get.knnx(measureTo, measureFrom, k)$nn.dist
    output <-
    as.data.frame(nn) %>%
    rownames_to_column(var = "thisPoint") %>%
    gather(points, point_distance, V1:ncol(.)) %>%
    arrange(as.numeric(thisPoint)) %>%
    group_by(thisPoint) %>%
    summarize(pointDistance = mean(point_distance)) %>%
    arrange(as.numeric(thisPoint)) %>% 
    dplyr::select(-thisPoint) %>%
    pull()
  
  return(output)  
}
```

Why `k=2`? As `k` fluctuates, so does the hypothesized scale of accessibility. One can test the effect of different k parameters on model goodness of fit, but as mentioned, a more sophisticated model would hypothesize that this scale can vary significantly from city to suburb to rural town.

Next, the function appending the lag distance to `fishnet`. There are 3 inputs. The `fishnet` which is converted to a coordinate data frame with the `xyC` function. 2011 developed areas are created using `filter`. The map below illustrates relative accessibility from every grid cell to nearby development.

```{r, warning = FALSE, message = FALSE}
fishnet$lagDevelopment <-
    nn_function(xyC(fishnet),
                xyC(filter(aggregatedRasters,developed==1)),
                2)

ggplot() +
  geom_sf(data = ThreeCountyArea, fill = "transparent") +
  geom_point(data = fishnet, 
             aes(x = xyC(fishnet)[, 1], y = xyC(fishnet)[, 2], color = lagDevelopment), 
             size = 1.5) +
  scale_color_gradientn(colors = palette5, 
                        limits = range(fishnet$lagDevelopment),
                        name = "Lag Development",
                        labels = scales::comma) +
  labs(title = "Spatial Lag to 2011 Development",
       subtitle = "As fishnet centroids") +
  mapTheme

# ggplot() +
#   geom_sf(data=ThreeCountyArea, fill = "transparent") +
#   geom_point(data=fishnet, 
#              aes(x=xyC(fishnet)[,1], y=xyC(fishnet)[,2], 
#                  colour=factor(ntile(lagDevelopment,5))), size=1.5) +
#   scale_colour_manual(values = palette5,
#                      labels=substr(quintileBreaks(fishnet,"lagDevelopment"),1,7),
#                      name="Quintile\nBreaks") +
#   labs(title = "Spatial Lag to 2011 Development",
#        subtitle = "As fishnet centroids") +
#   mapTheme
```


## 2.7. Study Area Counties

The `tigris` package allows Illinois county geometries to be downloaded. A spatial subset returns only the three counties we are looking at. Note that the subset includes a negative 1000ft `st_buffer`. This is done because the spatial extent of `ThreeCountyArea` intersects county boundaries that are actually outside of our study area. Buffering `ThreeCountyArea` slightly limits the intersection range to only those counties in the study area.

Once `studyAreaCounties` is created, it is `st_join`ed with `dat` such that each grid cells knows which county it’s in. 

We already have a shapefile and raster of the ThreeCountyArea, so we skipped this code. 

```{r, warning = FALSE, message = FALSE, results = "hide"}
# We already have a shapefile of the 3 county study area.
# options(tigris_class = "sf")
# 
# studyAreaCounties <- 
#   counties("Illinois") %>%
#   st_transform(st_crs(ThreeCountyArea)) %>%
#   dplyr::select(NAME) %>%
#   .[st_buffer(ThreeCountyArea,-1000), , op=st_intersects]
```

We mapped the three counties for reference.

```{r, warning = FALSE, message = FALSE}
library(ggplot2)

# Calculate centroids for the polygons
county_centroids <- st_centroid(IL_counties_subset)

# Extract centroid coordinates
county_centroid_coords <- st_coordinates(county_centroids)

# Create a data frame with centroid coordinates
county_centroid_df <- data.frame(X = county_centroid_coords[, "X"], Y = county_centroid_coords[, "Y"])

# Add COUNTY_NAM column to centroid_df
county_centroid_df$COUNTY_NAM <- county_centroids$COUNTY_NAM

# Plot with text labels
ggplot() +
  geom_sf(data = IL_counties_subset) +
  geom_text(data = county_centroid_df, aes(label = COUNTY_NAM, x = X, y = Y), size = 3, color = "black") +
  labs(title = "Study Area Counties") +
  mapTheme
```


## 2.8. Create the Final Dataset

The last step is to bring together all the disparate feature layers into a final dataset that can be used for analysis. The various fishnet layers are `cbind` together, needed features are extracted and the final fishnet, `dat` is then joined with `ThreeCountyArea` to assign each grid cell to a county. `developed21` is created to designate those areas that have already been developed through 2021. Finally, any grid cell that has a `water` land cover designation is removed.

```{r, warning = FALSE, message = FALSE}
dat <- 
  cbind(
    fishnet, fishnetPopulation, Highway_fishnet, RegRail_fishnet, Downtown_fishnet, LakeMichigan_fishnet,  aggregatedRasters) %>%
  dplyr::select(Dev_change, developed, forest, farm, wetlands, otherUndeveloped, water,
                Pop_2011, Pop_2021, pop_Change, highway_dist, regrail_dist, downtown_dist, lake_dist, lagDevelopment) %>%
  st_join(ThreeCountyArea) %>%
  mutate(developed21 = ifelse(Dev_change == 1 & developed == 1, 0, developed)) %>%
  filter(water == 0) 
```


# 3. Exploratory Analysis

In this section we explore the extent to which each features is associated with development change. If the goal was to predict a continuous variable, scatterplots and correlation coefficients make this process straightforward and relatively easy to explain to a non-technical decison maker.

In this case however, the dependent variable is a binary outcome - either a grid cell was developed between 2011 and 2021 or it wasn’t. In this case, the relevant question is whether for a given feature, there is a statistically significant difference between areas that changed and areas that did not. These differences are explored in a set of plots below. For models with lots of features, these plots could be compliment by a series of difference in means statistical tests.

The below code block `select`s the distance variables and spatial lag features, converts each to long form and plots each as bar plots. Note that `geom_bar` calculates the `mean`.

```{r, warning = FALSE, message = FALSE}
dat %>%
  dplyr::select(pop_Change, highway_dist, regrail_dist, downtown_dist, lake_dist, lagDevelopment, Dev_change) %>%
  gather(Variable, Value, -Dev_change, -geometry) %>%
  ggplot(., aes(Dev_change, Value, fill=Dev_change)) + 
    geom_bar(position = "dodge", stat = "summary", fun.y = "mean") +
    facet_wrap(~Variable) +
    scale_fill_manual(values = palette2,
                      labels=c("No Change","New Development"),
                      name="") +
    labs(title="New Development as a Function of the Continuous Variables") +
    plotTheme
```

There are significant differences between no change and new development (from 2011-2021) for the lake distance and downtown Chicago distance variables. The bar plots for the other variables (Development lag, distance to highways, population change, and distance to regional rail) are above.

Next, the same visualization is created for the population related variables. These plots inform which features should be included in the model.

```{r, warning = FALSE, message = FALSE}
dat %>%
  dplyr::select(Pop_2011, Pop_2021, pop_Change, Dev_change) %>%
  gather(Variable, Value, -Dev_change, -geometry) %>%
  ggplot(., aes(Dev_change, Value, fill=Dev_change)) + 
    geom_bar(position = "dodge", stat = "summary", fun.y = "mean") +
    facet_wrap(~Variable) +
    scale_fill_manual(values = palette2,
                      labels=c("No Change","New Development"),
                      name="") +
    labs(title="New Development as a Function of Factor Variables") +
    plotTheme
```

The population change variable is a bit of a tough one, because the number, size, and shape of the Census tracts changed from the 2010 to 2020 decennial Census. If a Census tract had an increase in population, then it split into two or more Census tracts and was given a different tract ID number and boundary. This makes it a bit hard to compare geographies across the decade, but we used the spatial lag feature to address that issue. 


Next, a table of land cover conversion between 2011 and 2021 is created. The table suggests for instance, that 9% of farmland regionally was converted to development between 2011 and 2021. This indicator should be interpreted in the context of the scale changes we imposed on the data by moving from a 30ft by 30ft raster to a 4000ft by 4000ft fishnet. This is the same reason why the table suggests `developed` area was then “developed”.

```{r, warning = FALSE, message = FALSE}
dat %>%
  dplyr::select(Dev_change:otherUndeveloped,developed) %>%
  gather(Land_Cover_Type, Value, -Dev_change, -geometry) %>%
   st_set_geometry(NULL) %>%
     group_by(Dev_change, Land_Cover_Type) %>%
     summarize(n = sum(as.numeric(Value))) %>%
     ungroup() %>%
    mutate(Conversion_Rate = paste0(round(100 * n/sum(n), 2), "%")) %>%
    filter(Dev_change == 1) %>%
  dplyr::select(Land_Cover_Type,Conversion_Rate) %>%
  kable() %>% kable_styling(full_width = F)
```

# 4. Predicting for 2031

In this section, six separate logistic regression models are estimated to predict development change between 2011 and 2021 - with each subsequent model more sophisticated then the last. To do so, the data is split into 50% training/test sets. Models are estimated on the training set.

Normally, as in previous chapters, a results table row would be generated for each model describing the accuracy and generalizability of predictions for each specification. For brevity, a less sophisticated approach is taken here, judging each by the McFadden or “Psuedo” R Squared statistic on the test set. The model with the greatest goodness of fit is then used for the purposes of prediction.

## 4.2. Modeling

First, `dat` is split into training and test sets. Note how imbalanced the panel is with `table(datTrain$developed)`.

```{r 46, warning = FALSE, message = FALSE}
set.seed(3456)
trainIndex <- 
  createDataPartition(dat$developed, p = .50,
                                  list = FALSE,
                                  times = 1)
datTrain <- dat[ trainIndex,]
datTest  <- dat[-trainIndex,]

nrow(dat)
table(datTrain$developed)
```

Next six separate `glm` models are estimated adding new variables for each. Figure 4.1 shows the Psuedo R-Squared associated with each model.

`Model1` includes only the 2011 land cover types. `Model2` adds the `lagDevelopment`. Models 3, 4 and 5 attempt three different approaches for modeling population change. `Model3` uses population in 2011; `Model4` uses 2011 and 2021 population; and `Model5` uses population change. All are significant so which population feature should be chosen? The answer lies in how the model will be used to forecast. By modeling population change between 2011 and 2021, the model is well specified to forecast 2031 development by having `pop_Change` indicate change between 2021 and 2031. `Model6` includes distance to the highways and all other variables, and is the final model employed for prediction.

```{r 47, warning = FALSE, message = FALSE}
Model1 <- glm(Dev_change ~ wetlands + forest  + farm + otherUndeveloped, 
              family="binomial"(link="logit"), data = datTrain)

Model2 <- glm(Dev_change ~ wetlands + forest  + farm + otherUndeveloped + lagDevelopment, 
              family="binomial"(link="logit"), data = datTrain)
              
Model3 <- glm(Dev_change ~ wetlands + forest  + farm + otherUndeveloped + lagDevelopment + Pop_2011, 
              family="binomial"(link="logit"), data = datTrain)          
              
Model4 <- glm(Dev_change ~ wetlands + forest  + farm + otherUndeveloped + lagDevelopment + Pop_2011 + 
              Pop_2021, 
              family="binomial"(link="logit"), data = datTrain)              
            
Model5 <- glm(Dev_change ~ wetlands + forest  + farm + otherUndeveloped + lagDevelopment + pop_Change, 
              family="binomial"(link="logit"), data = datTrain)              
              
Model6 <- glm(Dev_change ~ wetlands + forest  + farm + otherUndeveloped + lagDevelopment + pop_Change + 
              highway_dist + downtown_dist + lake_dist + regrail_dist, 
              family="binomial"(link="logit"), data = datTrain) 
```

Working carefully through the below code block, a very concise approach for creating a data frame of psudeo R Squares for each model and plotting them for comparison. Recall, `pR2` is the function for psuedo R squared. Dissect the line that uses the `map_dfc` function to see how this approach loops through the models retrieving the goodness of fit for each.

```{r 48, warning = FALSE, message = FALSE}
modelList <- paste0("Model", 1:6)
map_dfc(modelList, function(x)pR2(get(x)))[4,] %>%
  setNames(paste0("Model",1:6)) %>%
  gather(Model,McFadden) %>%
  ggplot(aes(Model,McFadden)) +
    geom_bar(stat="identity") +
    labs(title= "McFadden R-Squared by Model") +
    plotTheme
```

Next, a data frame is created that includes columns for the observed development change, `Dev_change`, and one that includes predicted probabilities for `Model6`. This data frame is then used as an input to a density plot visualizing the distribution of predicted probabilities by observed class. Only a small number of predicted probabilities are greater than or equal to 50% `(nrow(filter(testSetProbs, probs >= .50)) / nrow(datTest))`. This makes good sense, given how rare of an event development is in our dataset. Ultimately, in order to judge our model with a confusion matrix, a smaller development classification threshold must be employed.

```{r, warning = FALSE, message = FALSE}
testSetProbs <- 
  data.frame(class = datTest$Dev_change,
             probs = predict(Model6, datTest, type="response")) 
  
ggplot(testSetProbs, aes(probs)) +
  geom_density(aes(fill=class), alpha=0.5) +
  scale_fill_manual(values = palette2,
                    labels=c("No Change","New Development")) +
  labs(title = "Histogram of test set predicted probabilities",
       x="Predicted Probabilities",y="Density") +
  plotTheme
```

## 4.3. Accuracy

Now to pick a predicted probability threshold to classify an area as having new development. Recall, *Sensitivity* or the True Positive rate is the proportion of actual positives (1’s) that were predicted to be positive. For example, the Sensitivity in our model is the rate of developed areas actually predicted as such. *Specificity* or True Negative Rate is the proportion of actual negatives (0’s) that were predicted to be negatives. For example, the Specificity in our model is the rate of No Change areas that were correctly predicted as No change.

It is important to consider what Planners would typically optimize for given this use case. One approach is to maximize the number of 1’s predicted correctly (Sensitivity) so as to not under or over-predict new development. It may okay in this use case to incorrectly predict no change as changed (Specificity). An abundance of False Negative errors may be reasonable if Planners don’t mind over emphasizing development potential. It is important to remember that below this potential will evaluated alongside supply-side indicators such as the presence of sensitive land.

There are some clear tradeoffs between Sensitivity and Specificity in our model that deserve some exploration. To illustrate, two different thresholds of 5% and 17% are explored. Predicted classes for both thresholds are generated and instead of using the `confusionMatrix` function from `caret` as we have in the past, here confusion matrix metrics are derived from the `yardstick` package. This allows us to `group_by` the threshold and `summarize` the metrics of interest.

The `options` call below is required to tell `yardstick` that the positive factor class in `testSetProbs` is `1`. Without it, yardstick will by default, see the first factor level as `0` and flip the confusion metrics around.

```{r, warning = FALSE, message = FALSE}
options(yardstick.event_first = FALSE)

testSetProbs <- 
  testSetProbs %>% 
  mutate(predClass_05 = as.factor(ifelse(testSetProbs$probs >= 0.05 ,1,0)),
         predClass_17 = as.factor(ifelse(testSetProbs$probs >= 0.17 ,1,0))) 

testSetProbs %>%
  dplyr::select(-probs) %>%
  gather(Variable, Value, -class) %>%
  group_by(Variable) %>%
  summarize(Sensitivity = round(yardstick::sens_vec(class,factor(Value)),2),
            Specificity = round(yardstick::spec_vec(class,factor(Value)),2),
            Accuracy = round(yardstick::accuracy_vec(class,factor(Value)),2)) %>% 
  kable() %>%
  kable_styling(full_width = F)
```

The 5% threshold correctly predicts a lower number of new development areas (Sensitivity), but incorrectly predicts a higher number of no change areas (Specificity). As there are far more no change areas in the data, this is reflected in a lower overall accuracy. Conversely, the 17% threshold has a lower higher rate and a slightly lower Specificity rate. Again, because the dataset is majority no change areas, this leads to a higher Accuracy rate.

Given the use case, and the spatial distribution of land cover change, it may be more useful to have a model that predicts generally where new development occurs rather than one that predicts precisely where. As illustrated below, the 17% threshold provides this outcome. These trade-offs can be visualized in the plot below. Here the model is used to predict for the entire `dat` dataset. Which threshold looks more reasonable given the distribution of observed development change?

Note that these indicators are converted `as.factor` so they can be mapped with `scale_color_manual`.

```{r, warning = FALSE, message = FALSE}
predsForMap <-         
  dat %>%
    mutate(probs = predict(Model6, dat, type="response") ,
           Threshold_5_Pct = as.factor(ifelse(probs >= 0.05 ,1,0)),
           Threshold_17_Pct =  as.factor(ifelse(probs >= 0.17 ,1,0))) %>%
    dplyr::select(Dev_change,Threshold_5_Pct,Threshold_17_Pct) %>%
    gather(Variable,Value, -geometry) %>%
    st_cast("POLYGON")
```


<div class="superbigimage">
```{r 52, warning = FALSE, message= FALSE, fig.height = 6, fig.width= 8}
ggplot() +
  geom_point(data=predsForMap, aes(x=xyC(predsForMap)[,1], y=xyC(predsForMap)[,2], colour=Value)) +
  facet_wrap(~Variable) +
  scale_colour_manual(values = palette2, labels=c("No Change","New Development"),
                      name="") +
  labs(title="Development Predictions - Low Threshold") + 
  geom_sf(data=ThreeCountyArea, fill = "transparent") +
  mapTheme
```
</div>

To provide a bit more insight, the code block below produces both true positives (Sensitivity) and true negatives (Specificity) for each grid cell by threshold type. Notice how the spatial pattern of Sensitivity for both thresholds is relatively consistent, but the 5% threshold misses most the study area with respect to Specificity.

```{r 53, warning = FALSE, message = FALSE}
ConfusionMatrix.metrics <-
  dat %>%
    mutate(probs = predict(Model6, dat, type="response") ,
           Threshold_5_Pct = as.factor(ifelse(probs >= 0.05 ,1,0)),
           Threshold_17_Pct =  as.factor(ifelse(probs >= 0.17 ,1,0))) %>%
    mutate(TrueP_05 = ifelse(Dev_change  == 1 & Threshold_5_Pct == 1, 1,0),
           TrueN_05 = ifelse(Dev_change  == 0 & Threshold_5_Pct == 0, 1,0),
           TrueP_17 = ifelse(Dev_change  == 1 & Threshold_17_Pct == 1, 1,0),
           TrueN_17 = ifelse(Dev_change  == 0 & Threshold_17_Pct == 0, 1,0)) %>%
    dplyr::select(., starts_with("True")) %>%
    gather(Variable, Value, -geometry) %>%
    st_cast("POLYGON")
```


<div class="superbigimage">
```{r 54, warning = FALSE, message = FALSE, fig.height= 8, fig.width= 8 }
ggplot(data=ConfusionMatrix.metrics) +
  geom_point(aes(x=xyC(ConfusionMatrix.metrics)[,1], 
                 y=xyC(ConfusionMatrix.metrics)[,2], colour = as.factor(Value))) +
  facet_wrap(~Variable) +
  scale_colour_manual(values = palette2, labels=c("Correct","Incorrect"),
                       name="") +
  labs(title="Development Predictions - Low Threshold") + 
  geom_sf(data=ThreeCountyArea, fill = "transparent") +
  mapTheme
```
</div>

## 4.4 Generalizability

For this use case, it matters little whether the model generalizes well across random holdouts. Thus, regular cross-validation is substituted for spatial cross-validation. The latter is explicitly concerned with generalizability across space. The approach helps us understand whether our model is comparable to each county in the study area despite any possible differences in land use or land use planning.

To test across-space generalizability, `spatialCV` function is run, which iteratively loops through `dat` having each county take a turn as the hold out test set. This is also called ‘Leave-one-group-out cross validation.’. A model is estimated for the n - 1 counties that remain and used to `predict` for the hold out county.

```{r 55, warning = FALSE, message = FALSE}
spatialCV <- function(dataFrame, uniqueID, dependentVariable, modelName) {

#initialize a data frame 
endList <- list()

#create a list that is all the spatial group unqiue ids in the data frame (ie counties)    
  uniqueID_List <- unique(dataFrame[[uniqueID]])  
  x <- 1
  y <- length(uniqueID_List)
  
#create a counter and while it is less than the number of counties...  
  while(x <= y) 
  {
#call a current county    
    currentUniqueID <- uniqueID_List[x]
#create a training set comprised of units not in that county and a test set of units
#that are that county
    training <- dataFrame[ which(dataFrame[[uniqueID]] != uniqueID_List[x]),]
    testing <- dataFrame[ which(dataFrame[[uniqueID]] == uniqueID_List[x]),]
#create seperate xy vectors
    trainingX <- training[ , -which(names(training) %in% c(dependentVariable))]
    testingX <- testing[ , -which(names(testing) %in% c(dependentVariable))]
    
    trainY <- training[[dependentVariable]]
    testY <- testing[[dependentVariable]]
#Calculate predictions on the test county as part of a data frame including the observed
#outcome and the unique county ID    
   thisPrediction <- 
     data.frame(class = testY,
                probs = predict(modelName, testingX, type="response"),
                county = currentUniqueID) 

#Row bind the predictions to a data farme
   endList <- rbind(endList, thisPrediction)
#iterate counter    
    x <- x + 1 
  } 
#return the final list of counties and associated predictions  
  return (as.data.frame(endList))
}
```

Now the function is run; a 17% predicted probability threshhold is set and a facetted ROC plot for each county is created. 

```{r 56, warning = FALSE, message = FALSE}
spatialCV_counties <-
  spatialCV(dat,"COUNTY_NAM","Dev_change", Model6) %>%
  mutate(predClass = as.factor(ifelse(probs >= 0.17 ,1,0)))
```

To investigate the across-3county-area generalizability of the model, the code block below produces and maps confusion matrix statistics by county. It is important to ensure as above, that the `yardstick.event_first` option is set.

Some interesting patterns emerge. First, results for those counties with little `Observed_Change` are not meaningful. In this case, all three counties have some "Observed_Change". In places with substantial new development, Sensitivity rates are comparable with the results from the test set results on the entire study area. In DuPage County, Sensitivity is 0.00, implying that the model didn't correctly identify any of the observed changes in DUPAGE county. This suggests the model may struggle with predicting positive cases in this county, but this is likely because there is little development change occurring in general. Specificity is high across the board, because there are a lot of existing developed areas and the model predicts an abundance of developed areas, where there was no change during the specified time period. Again, this is less of a concern because these estimates will be offset by sensitive land cover in Section 7 below. Will County has the highest Sensitivity level and the highest Accuracy rate of the three counties. Overall, these confusion matrix metrics help us to understand how well the model is performing in terms of the true positive and true negative rates, and they suggest the model is generalizable to those counties that underwent significant development change like Will County.


```{r 57, warning = FALSE, message = FALSE}
spatialCV_metrics <-
  spatialCV_counties %>% 
    group_by(county) %>% 
    summarize(Observed_Change = sum(as.numeric(as.character(class))),
              Sensitivity = round(yardstick::sens_vec(class,predClass),2),
              Specificity = round(yardstick::spec_vec(class,predClass),2),
              Accuracy = round(yardstick::accuracy_vec(class,predClass),2)) 

spatialCV_metrics %>%
  kable() %>%
  kable_styling(full_width = F)
```

# 5. Predicting Land Cover Demand for 2031

At this point, a simple but useful model has been trained to predict urban development between 2011 and 2021 as a function of baseline features from 2011 including land cover, built environment and population. Next, we are going to updated our features to reflect a 2021 baseline. Having done so, predictions from our new model would then be fore 2031.

Generalizability is always the concern when forecasting, and for this use case Planners must ask themselves whether the 2011-2021 3 IL County experience generalizes to the 2021-2031 3 IL County experience. In other words, have the macroeconomic real estate conditions changed dramatically between the two time periods? This is question with no definitive answer, but it useful to consider the exogenous factors that may differentiate today’s Chicago area from that of 2011. The big for instance is climate change. If the real estate market capitalized flood risk into devastated areas going forward, this would effectively change the nature of real estate demand in the region. Thus the pre-flood, 2011 experienced is no longer entirely relevant.

This would not completely invalidate the model if these changes are marginal, as development demand predictions can be adjusted in Section 7 below. However, consider the usefulness of this approach for a lakefront city that loses say 10% of its developable land to sea level rise in the following decade.

For brevity, we only update two features in our model. First, population change (`pop_change`) is updated using county level population projections visualized in the plot below. The second is `lagDevelopment`, which describes how predicted new development relates in space to old development.

Once the features are updated, 2031 predictions are estimated and mapped.

Below, `lagDevelopment` is mutate describing average distance to 2021 development. Note that the field name, `lagDevelopment` is unchanged (ie. not updated to `lagDevelopment_2021`). This is done purposefully as model6 has a regression coefficient called `lagDevelopment`. If this variable wasn’t present in our updated data frame then the `predict` command would fail.

```{r 58, warning = FALSE, message = FALSE}
dat <-
  dat %>%
  mutate(lagDevelopment = nn_function(xyC(.), xyC(filter(.,developed21 == 1)),2))
```

Now to update population change. A new data frame, `Pop_2031` is created which includes 2021 population counts and 2031 projections for each county in the study area. Population is plotted by year and by county. Chicago’s Cook County is projected to see the greatest population gains by far. Anecdotally, we know that much of Cook County is already developed, which suggests its development scenario will involve more ‘infill’ development then sprawl.

```{r 59, warning = FALSE, message = FALSE}
ThreeCounties_Pop_2021 <- get_acs(geography = "county",
                            variables = "B01003_001",
                            year = 2021,
                            state = "IL",
                            county = counties,
                            geometry = TRUE) %>%
  mutate(County = gsub(" County, Illinois", "", NAME)) %>%
  rename(CountyPop_2021 = estimate) %>%
  st_transform(st_crs(ThreeCounty_fishnet))

# Calculate the sum of CountyPop_2021 for all counties
all_counties_pop_2021 <- ThreeCounties_Pop_2021 %>%
  summarize(AllCounties_Pop2021 = sum(CountyPop_2021))

# Add the sum to ThreeCounties_Pop_2021 as a new column
ThreeCounties_Pop_2021 <- ThreeCounties_Pop_2021 %>%
  mutate(AllCounties_Pop2021 = all_counties_pop_2021$AllCounties_Pop2021)

# Need to include the population projections for 2031 for the three counties. 
#these are population in 2020 from IL Department of Health linked here: https://dph.illinois.gov/content/dam/soi/en/web/idph/files/publications/population-projections-report-2010-2030.pdf
# Code for joining and plotting 2021 population and 2031 population projection: 
# Perform the left join
Pop_2031 <- 
  data.frame(
    COUNTY_NAM = c("Cook", "DuPage", "Will"),
    pop_2021 = c(5265398, 934094, 696403),
    pop_2031 = c(4689134, 946910, 902476)
  )%>%
  mutate(Counties = toupper(COUNTY_NAM))

# Left join with ThreeCounties_Pop_2021
Pop_2031 <- left_join(
  Pop_2031,
  ThreeCounties_Pop_2021 %>%
    dplyr::select(County, CountyPop_2021, AllCounties_Pop2021) %>%
    st_set_geometry(NULL),
  by = c("COUNTY_NAM" = "County")
)

```

```{r 60, warning = FALSE, message = FALSE}
# Plotting 2021 and 2031 populations side by side
Pop_2031 %>%
  pivot_longer(cols = starts_with("pop_"), names_to = "Year", values_to = "Population") %>%
  ggplot(aes(x = reorder(COUNTY_NAM, -Population), y = Population, fill = Year)) +
  geom_bar(stat = "identity", position = position_dodge(width = 0.9), color = "black") +
  scale_fill_manual(values = palette2, labels = c("2021", "2031"), name = "Year") +
  labs(title = "Population Change by County: 2021 - 2031",
       x = "County", y = "Population") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  plotTheme
```


Interestingly enough, Cook County is projected to have a decline in population from 2021 to 2031. This is largely due to outmigration in recent years with thousands of residents leaving the county to live elsewhere. The rise of remote work schedules and the costs of housing have driven people to move to new areas outside of the county, often in other states. For Will County, the population is projected to increase for 2031, likely in the areas closer to the Lake Michigan shoreline and transit routes. Population is projected to largely remain the same for DuPage County when comparing 2021 to 2031.


## 5.2. Predicting Development Demand

Next, the `Pop_2031` table is joined to `dat` and `pop_change` in order to ‘distribute’ the new population across the study area. To do so, the the allocation of new population is weighted by a grid cell’s existing population (`pop_2031.infill`). 2010 population is subtracted from this figure to get `pop_Change`. Finally, `Model6` is used to predict for 2020 given the updated population change and lag development features.

The map of predicted probabilities that results is best thought of as a measure of predicted development demand in 2031.

```{r 61, warning = FALSE, message = FALSE}
dat_infill <-
  dat %>%
  #calculate population change
    left_join(Pop_2031, by = c("COUNTY_NAM" = "Counties")) %>%
    mutate(proportion_of_county_pop = pop_2021 / AllCounties_Pop2021,
           pop_2031.infill = proportion_of_county_pop * pop_2031,
           pop_Change = round(pop_2031.infill - pop_2021),2) %>%
    dplyr::select(-pop_2031, -AllCounties_Pop2021, 
                  -proportion_of_county_pop, -pop_2031.infill) %>%
  #predict for 2031
    mutate(predict_2031.infill = predict(Model6,. , type="response"))

dat_infill %>%
  ggplot() +  
  geom_point(aes(x=xyC(dat_infill)[,1], y=xyC(dat_infill)[,2], colour = factor(ntile(predict_2031.infill,5)))) +
  scale_colour_manual(values = palette5,
                    labels=substr(quintileBreaks(dat_infill,"predict_2031.infill"),1,4),
                    name="Quintile\nBreaks") +
  geom_sf(data=ThreeCountyArea, fill=NA, colour="black", size=1) +
  labs(title= "Development Demand in 2031: Predicted Probabilities") +
  mapTheme

```

There are higher probabilities for development demand in Will County, in the south portion of the study area.


# 6. Comparing Predicted Development Demand & Environmental Sensitivity

We now have a really strong indicator of development demand for 2031 to help guide local land use planning. Demand however, is only one side of the equation. It must balanced with the supply of environmentally sensitive land. Understanding the interplay between demand and supply is the first stage of the ‘Allocation’ phase, where Planners ultimately decide which land should be developed and which should not.

For this analysis farmland and undeveloped land are be deemed `Suitable`, while environmentally sensitive areas like wetlands and forest are be deemed `Not Suitable`. Below, 2021 land cover data is read in and several measures of environmental sensitivity are created by county. These include:

1. The total amount of wetlands and forest land cover area in 2021.
2. The amount of sensitive land (wetland and forest) lost between 2011 and 2021.
3. The total area of large sensitive landscape ‘patches’ in 2021.

The third metric warrants some further discussion. In the context of leapfrog development, Section 2.6 discusses the concept of landscape fragmentation - the idea that discontinuous development across space carves out disjointed slivers of wilderness. This fragmentation reduces biodiversity particularly for species that need room to roam. Below, environmentally `sensitive_regions` are created to represent large areas of unfragmented natural resources. We then consider the total area of these clumps for each county.

## 6.2. 2021 Land Cover Data

To begin, the 2021 Land Cover data is read in and reclassified.

```{r 62, warning = FALSE, message = FALSE}
# We already did this. It's called ThreeCountyLC_2021

developed21 <- ThreeCountyLC_2021 == 21 | ThreeCountyLC_2021 == 22 | ThreeCountyLC_2021 == 23 | ThreeCountyLC_2021 == 24
forest21 <- ThreeCountyLC_2021 == 41 | ThreeCountyLC_2021 == 42 | ThreeCountyLC_2021 == 43 
farm21 <- ThreeCountyLC_2021 == 81 | ThreeCountyLC_2021 == 82 
wetlands21 <- ThreeCountyLC_2021 == 90 | ThreeCountyLC_2021 == 95 
otherUndeveloped21 <- ThreeCountyLC_2021 == 52 | ThreeCountyLC_2021 == 71 | ThreeCountyLC_2021 == 31 
water21 <- ThreeCountyLC_2021 == 11

names(developed21) <- "dev21"
names(forest21) <- "forest21"
names(farm21) <- "farm21"
names(wetlands21) <- "wetlands21"
names(otherUndeveloped21) <- "otherUndeveloped21"
names(water21) <- "water21"
```

This next step takes too long to plot because the rasters are large. But the code is there just in case.

```{r 62a, warning = FALSE, message = FALSE, eval=FALSE}
# ggplot() +
#   geom_sf(data=ThreeCountyArea) +
#   geom_raster(data = rbind(rast(ThreeCountyLC_2011) %>% mutate(label = "2011"),
#                            rast(ThreeCountyLC_2021) %>% mutate(label = "2021")) %>% 
#               na.omit %>% filter(value > 0), 
#               aes(x,y,fill=as.factor(value))) +
#   facet_wrap(~label) +
#   scale_fill_viridis(discrete=TRUE, name ="") +
#   labs(title = "Land Cover, 2011 & 2021") +
#   mapTheme #+ theme(legend.position = "none")
```


Next, each raster is aggregated to the fishnet using the `aggregateRaster` function and 2021 land cover types are mapped.

```{r 63, warning = FALSE, message = FALSE}
theRasterList21 <- c(developed21,forest21,farm21,wetlands21,otherUndeveloped21,water21)

dat2 <-
  aggregateRaster(theRasterList21, dat) %>%
  dplyr::select(dev21,forest21,farm21,wetlands21,otherUndeveloped21,water21) %>%
  st_set_geometry(NULL) %>%
  bind_cols(.,dat) %>%
  st_sf() %>%
  st_cast("POLYGON")

dat2 %>%
  gather(var,value,dev21:water21) %>%
  st_centroid() %>%
  mutate(X = st_coordinates(.)[,1],
         Y = st_coordinates(.)[,2]) %>%
  ggplot() +
    geom_sf(data=ThreeCountyArea) +
    geom_point(aes(X,Y, colour=as.factor(value))) +
    facet_wrap(~var) +
    scale_colour_manual(values = palette2,
                        labels=c("Other","Land Cover"),
                        name = "") +
    labs(title = "Land Cover Types, 2021",
         subtitle = "As fishnet centroids") +
   mapTheme
```

Note that there are wetlands and forest sprinkled throughout the three county area, and there is a lot of farmland concentrated in the southwest portion in Will County.


## 6.3. Sensitive Land Cover Lost

Below an indicator `sensitive_lost` is created indicating grid cells that were either forest or wetlands in 2011 but were no longer so in 2021. The output layer, `sensitive_land_lost`, gives a sense for how development in the recent past has effected the natural environment.

```{r 64, warning = FALSE, message = FALSE, fig.height = 6, fig.width= 6}
dat2 <-
  dat2 %>%
   mutate(sensitive_lost21 = ifelse(forest == 1 & forest21 == 0 |
                                    wetlands == 1 & wetlands21 == 0,1,0))
                      
ggplot() +
  geom_point(data=dat2, aes(x=xyC(dat2)[,1], y=xyC(dat2)[,2], colour=as.factor(sensitive_lost21))) +
  scale_colour_manual(values = palette2,
                      labels=c("No Change","Sensitive Lost"),
                      name = "") +
  labs(title = "Sensitive lands lost: 2011 - 2021",
       subtitle = "As fishnet centroids") +
  geom_sf(data=ThreeCountyArea, fill = "transparent") +
  mapTheme
```

This type of analysis is helpful to see where sensitive lands are being lost throughout the three county area due to development. Thankfully, there are only a small number of these lost areas from 2011-2021. If there were large areas lost, it would be helpful for planners to implement iniatives to protect those sensitive areas from development encroachment and allow them to continue providing environmental benefits. 


## 6.4 Landscape Fragmentation

In this section, the `wetlands21` and `forest21` rasters are converted to contiguous `sensitive_regions` using the `raster::clump` function. This is equivalent to Region Group in ArcGIS. The raster clumps are then converted to vector `sf` layers; dissolved into unique regions; Acres are calculated; and the layers are converted back to raster to be extracted back to the fishnet with `aggregateRaster`. It is worth going through this code block line by line. Note that only `sensitive_regions` with areas greater than 1 acre are included.

We ended up not including this section in our analysis.

```{r 65, warning = FALSE, message = FALSE, fig.height = 6, fig.width= 6}

#  emptyRaster <- Development_change
#  emptyRaster[] <- NA
# 
# sensitiveRegions <- 
#   raster::clump(wetlands21 + forest21) %>%
#   rasterToPolygons() %>%
#   st_as_sf() %>%
#   group_by(clumps) %>% 
#   summarize() %>%
#     mutate(Acres = as.numeric(st_area(.) * 0.0000229568)) %>%
#     filter(Acres > 3954)  %>%
#   dplyr::select() %>%
#   raster::rasterize(.,emptyRaster) 
# sensitiveRegions[sensitiveRegions > 0] <- 1  
# names(sensitiveRegions) <- "sensitiveRegions"
# 
# dat2 <-
#   aggregateRaster(c(sensitiveRegions), dat2) %>%
#   dplyr::select(sensitiveRegions) %>%
#   st_set_geometry(NULL) %>%
#   bind_cols(.,dat2) %>%
#   st_sf()
# 
# ggplot() +
#   geom_point(data=dat2, aes(x=xyC(dat2)[,1], y=xyC(dat2)[,2], colour=as.factor(sensitiveRegions))) +
#   scale_colour_manual(values = palette2,
#                       labels=c("Other","Sensitive Regions"),
#                       name="") +
#   labs(title = "Sensitive regions",
#        subtitle = "Continous areas of either wetlands or forests\ngreater than 1 acre") +
#   mapTheme
```

## 6.5. Summarize by County

The below `dplyr` statement takes as its input, `dat2`, which was created in Sections 6.2 - 6.4 and wrangles together a table of county-level, supply and demand metrics which can be used to analyze suitability by county.

```{r 66, warning = FALSE, message = FALSE}
county_specific_metrics <- 
  dat2 %>%
  #predict development demand from our model
  mutate(Development_Demand = predict(Model6, dat2, type="response")) %>%
  #get a count count of grid cells by county which we can use to calculate rates below
  left_join(st_set_geometry(dat, NULL) %>% group_by(COUNTY_NAM) %>% summarize(count = n())) %>%
  #calculate summary statistics by county
  group_by(COUNTY_NAM) %>%
  summarize(Total_Farmland = sum(farm21) / max(count),
            Total_Forest = sum(forest21) / max(count),
            Total_Wetlands = sum(wetlands21) / max(count),
            Total_Undeveloped = sum(otherUndeveloped21) / max(count),
            Sensitive_Land_Lost = sum(sensitive_lost21) / max(count),
            #Sensitive_Regions = sum(sensitiveRegions) / max(count),
            Mean_Development_Demand = mean(Development_Demand)) %>%
  #get population data by county
  left_join(Pop_2031, by = c("COUNTY_NAM" = "Counties")) %>% 
            mutate(Population_Change = pop_2031 - pop_2021,
                   Population_Change_Rate = Population_Change / pop_2031) %>%
            dplyr::select(COUNTY_NAM,Total_Farmland, Total_Forest, Total_Wetlands, Total_Undeveloped, Sensitive_Land_Lost, Mean_Development_Demand, Population_Change_Rate)

```


Now a small multiple plot can be created providing both supply and demand side analytics by county. The plot gives a sense for development demand (`Demand-Side`), suitable land for development (`Suitable`) and sensitive land (`Not Suitable`).

In Will County, south of Chicago, the data suggests both population and development demand will increase. At the same time, there is a high rate of developable farmland. Will County has a large amount of forest and wetlands areas, but otherwise it is well suitable to new development.

Conversely, DuPage County, the county east of Chicago and Cook County, has a significant amount of forest and wetland areas like Will County, but it has a slightly higher amount of sensitive lands lost. Comparatively, it has a lot less undeveloped land and farmland that is suitable for new development. There are very low rates of mean development demand and population growth. 

Cook County has the lowest amounts of sensitive lands but also low amounts of undeveloped areas and farmland as well. This is because Chicago and most of the county has already been developed. And the population is expected to continue to decline due to outmigration.

In these counties, there are some real trade-offs to be made between suitable/sensitive land and development pressure.

```{r 67, warning = FALSE, message = FALSE}
county_specific_metrics %>%
  gather(Variable, Value, -COUNTY_NAM, -geometry) %>%
  mutate(Variable = factor(Variable, levels=c("Population_Change_Rate","Mean_Development_Demand",
                                              "Total_Farmland","Total_Undeveloped","Total_Forest",
                                              "Total_Wetlands","Sensitive_Land_Lost",
                                              ordered = TRUE))) %>%
  mutate(Planning_Designation = case_when(
    Variable == "Population_Change_Rate" | Variable == "Mean_Development_Demand" ~ "Demand-Side",
    Variable == "Total_Farmland" | Variable == "Total_Undeveloped"               ~ "Suitable",
    TRUE                                                                         ~ "Not Suitable")) %>%
  ggplot(aes(x=Variable, y=Value, fill=Planning_Designation)) +
    geom_bar(stat="identity", position=position_dodge(), colour="black") +
    facet_wrap(~COUNTY_NAM, ncol=5) +
    coord_flip() +
    scale_y_continuous(breaks = seq(-.25, 1, by = .25)) +
    geom_vline(xintercept = 2.5) + geom_vline(xintercept = 4.5) +
    scale_fill_manual(values=c("black","red","darkgreen")) +
    labs(title= "County Specific Allocation Metrics", subtitle= "As rates", x="Indicator", y="Rate") +
    plotTheme + theme(axis.text.x = element_text(angle = 45, hjust = 1), legend.position="bottom")
```


# 7. Allocation

Allocation is the final stage of the urban growth modeling process. Now that both demand and supply is understood, Planners can allocate development rights accordingly. Of course, this could take many forms of regulation including zoning, subdivision approval or outright conservation. In this section, demand and supply are visualized for Will County. 

First, development demand is predicted for Will County. Then a layer, `WillCounty_landUse` is created, that includes indicators for both previously developed land and environmentally unsuitable land. This layer then is overlayed atop development demand and projected population change to give the full supply and demand-side picture in Will County.

There are some clear opportunities for development in Will County. Significant infill opportunities exist along the southern boundary where population change is projected to be greatest. There is also a good deal of environmentally suitable land in the center of the county and closer to the Lake Michigan shoreline to the east. This would be ideal space for large housing or commercial developments. Will County has a lot of forests and wetlands in the southwest corner of the county, so there are sensitive land areas not suitable for development there.

<div class="superbigimage">
```{r 68, warning = FALSE, message = FALSE, fig.height= 8, fig.width= 11}
WillCounty <- dat2 %>%
  mutate(Development_Demand = predict(Model6, dat2, type = "response")) %>%
  filter(COUNTY_NAM == "WILL") 

WillCounty_landUse <- rbind(
  filter(WillCounty, forest21 == 1 | wetlands21 == 1) %>%
    dplyr::select() %>%
    mutate(Land_Use = "Not Suitable"),
  filter(WillCounty, dev21 == 1) %>%
    dplyr::select() %>%
    mutate(Land_Use = "Developed"))

# Calculate quantiles for Development Demand and Population Change
dev_demand_quantiles <- quantile(WillCounty$Development_Demand, probs = seq(0, 1, by = 0.25)) # Changed to 4 quantiles
pop_change_quantiles <- quantile(WillCounty$pop_Change, probs = seq(0, 1, by = 0.25)) # Calculate quantiles for pop_Change
 
grid.arrange(
  ggplot() +
    geom_sf(data = WillCounty, aes(fill = cut(Development_Demand, breaks = dev_demand_quantiles)), colour = NA) +
    geom_point(data = WillCounty_landUse, aes(x = xyC(WillCounty_landUse)[, 1], 
                                              y = xyC(WillCounty_landUse)[, 2], 
                                              colour = Land_Use),
               shape = 15, size = 2) +
    geom_sf(data = st_intersection(ThreeCounties_Highways, filter(ThreeCountyArea, COUNTY_NAM == "WILL")), size = 2, colour = "gray") +
    scale_fill_manual(values = palette5, name = "Development Demand",
                      labels = as.character(round(dev_demand_quantiles, digits = 2))) +
    scale_colour_manual(values = c("black", "red")) + 
    labs(title = "Development Potential, 2031: Will County") + mapTheme +
    guides(fill = guide_legend(order = 1), colour = guide_legend(order = 2)),

  ggplot() +
    geom_sf(data = WillCounty, aes(fill = cut(pop_Change, breaks = pop_change_quantiles)), colour = NA) +
    geom_point(data = WillCounty_landUse, aes(x = xyC(WillCounty_landUse)[, 1], 
                                              y = xyC(WillCounty_landUse)[, 2], 
                                              colour = Land_Use),
               shape = 15, size = 2) +
    geom_sf(data = st_intersection(ThreeCounties_Highways, filter(ThreeCountyArea, COUNTY_NAM == "WILL")), size = 2, colour = "gray") +
    scale_fill_manual(values = palette6, name = "Population Change",
                      labels = as.character(round(pop_change_quantiles, digits = 2))) +
    scale_colour_manual(values = c("black", "red")) + 
    labs(title = "Projected Population, 2031: Will County") + mapTheme +
    guides(fill = guide_legend(order = 1), colour = guide_legend(order = 2)), ncol = 2)

```
</div>
The plots above are created using a `ggplot` trick to show what appears to be overlayed polygons (fishnet grid cells). ggplot does not natively allow multiple aesthetics (`aes`) of the same style. In practice, this means it is not possible to have two `scale_fill_manual` parameters and thus, two legends for the same map. This limitation is cleverly avoided by plotting `WillCounty_landUse` as colored points (as opposed to filled grid cells). In the `geom_point` parameter above, the points are set to `shape = 15`, which is a filled box. This box can then be sized to make it appear like a fishnet grid cell.

We stop short in actually allocating land to development. While the model is well suited for understanding sprawl-style development, it is not useful for understanding how new demand might be absorbed by upzoning and densification of existing development. It would not be wise to allocate the entire projected population to undeveloped land. Instead, we’d prefer a more nuanced understanding of how local land use laws might play a role. At this stage in the analysis however, the Planner has all she needs to engage local stakeholders about future development decisions.


# 7.2 Scenario 2: Estimating the Effect of New Transportation

We created a new regional rail line to simulate the extension of the Metra rail system in the Chicago region. The rail extension expands the system southwest into Will County. You can see the existing system and the extension plotted. 

```{r, warning = FALSE, message = FALSE, results = "hide"}
MetraLineswExtension <-
  st_read("C:/Users/3lpaw/Desktop/ArcGIS Pro 3.2/EnvModeling/04_24_24_UrbanGrowthModeling/Downloaded_Data/Metra_Lines/MetraLineswExtension.shp") %>%
  st_transform(st_crs(ThreeCountyArea)) %>%
  st_intersection(ThreeCountyArea)
```

```{r 70}
library(gridExtra)

# Adjust the widths parameter to control the size of each plot
grid.arrange(
  ggplot() +
    geom_sf(data=ThreeCountyArea, fill = "transparent") +
    geom_sf(data=ThreeCounties_RegRail, color = "red") +
    labs(title = "Existing Regional Rail Lines") +
    mapTheme,

  ggplot() +
    geom_sf(data=ThreeCountyArea, fill = "transparent") +
    geom_sf(data=MetraLineswExtension, color = "red") +
    labs(title = "Regional Rail Lines with Extension") +
    mapTheme,
  
  # Adjust widths as needed to control the size of each plot
  nrow = 1, 
  widths = c(2, 2)
)

```

Now we recreate the regional rail fishnet, using the shapefile that has the new rail extension and we calculate the distance to nearest rail line again.


```{r}
#Determine nearest regional rail line to each centroid
nearest_metra <- st_nearest_feature(centroid,MetraLineswExtension)

#Calcuate distance from each grid square centroid to nearest regional rail line
ThreeCounty_fishnet$metra_dist <- as.double(st_distance(centroid, MetraLineswExtension[nearest_metra,], by_element=TRUE))

Metra_fishnet <- ThreeCounty_fishnet #%>%
#  select(fishnetID, geometry, regrail_dist)

#Make a quick sample map of the results
ggplot()+
  geom_sf(data=Metra_fishnet,aes(fill=metra_dist),color='transparent')+
  scale_fill_viridis_c(name='Distance to Regional Rail Lines (feet)')+
  geom_sf(data=MetraLineswExtension,color='red')+
  labs(title = "Distance to Regional Rail Lines with Extension",
       subtitle = "Using fishnet centroids") +
  theme_void()
```

Next, we recreate the final dataset to include the new variable (regional rail with extension) also called metra_dist here. The other fishnets are combined here to include all of the previous variables.

```{r}
dat3 <-
  cbind(
    fishnet, fishnetPopulation, Highway_fishnet, RegRail_fishnet, Metra_fishnet, Downtown_fishnet, LakeMichigan_fishnet, aggregatedRasters) %>%
  dplyr::select(Dev_change, developed, forest, farm, wetlands, otherUndeveloped, water,
                Pop_2011, Pop_2021, pop_Change, highway_dist, regrail_dist, metra_dist, downtown_dist, lake_dist, lagDevelopment) %>%
  st_join(ThreeCountyArea) %>%
  mutate(developed21 = ifelse(Dev_change == 1 & developed == 1, 0, developed)) %>%
  filter(water == 0)
```

Exploratory analysis: 

In this section we explore the extent to which each features is associated with development change. If the goal was to predict a continuous variable, scatterplots and correlation coefficients make this process straightforward and relatively easy to explain to a non-technical decison maker.

In this case however, the dependent variable is a binary outcome - either a grid cell was developed between 2011 and 2021 or it wasn’t. In this case, the relevant question is whether for a given feature, there is a statistically significant difference between areas that changed and areas that did not. These differences are explored in a set of plots below. For models with lots of features, these plots could be compliment by a series of difference in means statistical tests.

The below code block `select`s the distance and spatial lag features, converts each to long form and plots each as bar plots. Note that `geom_bar` calculates the `mean`. 

```{r, warning = FALSE, message = FALSE}
dat3 %>%
  dplyr::select(pop_Change, highway_dist, regrail_dist, metra_dist, downtown_dist, lake_dist, lagDevelopment, Dev_change) %>%
  gather(Variable, Value, -Dev_change, -geometry) %>%
  ggplot(., aes(Dev_change, Value, fill=Dev_change)) + 
    geom_bar(position = "dodge", stat = "summary", fun.y = "mean") +
    facet_wrap(~Variable) +
    scale_fill_manual(values = palette2,
                      labels=c("No Change","New Development"),
                      name="") +
    labs(title="New Development as a Function of the Continuous Variables") +
    plotTheme
```

The plot for metra_dist (distance to nearest regional rail line, including the new extension) does not look that different from the plot for regrail_dist (includes only existing regional rail lines). However, there is a slight difference between no change and new development for the metra_dist variable.


## 7.2.a Modeling

First, `dat3` is split into training and test sets. 

```{r, warning = FALSE, message = FALSE}
set.seed(3456)
trainIndex3 <- 
  createDataPartition(dat3$developed, p = .50,
                                  list = FALSE,
                                  times = 1)
dat3Train <- dat3[ trainIndex3,]
dat3Test  <- dat3[-trainIndex3,]

nrow(dat3)
```


`Model7` includes distance to the highways and all other variables including the new rail extension, and is the final model employed for prediction. I removed the regrail_dist because it is the existing rail lines without the new extension.

```{r}
Model1 <- glm(Dev_change ~ wetlands + forest  + farm + otherUndeveloped, 
              family="binomial"(link="logit"), data = datTrain)

Model2 <- glm(Dev_change ~ wetlands + forest  + farm + otherUndeveloped + lagDevelopment, 
              family="binomial"(link="logit"), data = datTrain)
              
Model3 <- glm(Dev_change ~ wetlands + forest  + farm + otherUndeveloped + lagDevelopment + Pop_2011, 
              family="binomial"(link="logit"), data = datTrain)          
              
Model4 <- glm(Dev_change ~ wetlands + forest  + farm + otherUndeveloped + lagDevelopment + Pop_2011 + 
              Pop_2021, 
              family="binomial"(link="logit"), data = datTrain)              
            
Model5 <- glm(Dev_change ~ wetlands + forest  + farm + otherUndeveloped + lagDevelopment + pop_Change, 
              family="binomial"(link="logit"), data = datTrain)              
              
Model6 <- glm(Dev_change ~ wetlands + forest  + farm + otherUndeveloped + lagDevelopment + pop_Change + 
              highway_dist + downtown_dist + lake_dist + regrail_dist, 
              family="binomial"(link="logit"), data = datTrain) 

Model7 <- glm(Dev_change ~ wetlands + forest  + farm + otherUndeveloped + lagDevelopment + pop_Change + 
              highway_dist + downtown_dist + lake_dist + metra_dist, 
              family="binomial"(link="logit"), data = dat3Train) 
```

Comparing models:

```{r, warning = FALSE, message = FALSE}
modelList <- paste0("Model", 1:7)
map_dfc(modelList, function(x)pR2(get(x)))[4,] %>%
  setNames(paste0("Model",1:7)) %>%
  gather(Model,McFadden) %>%
  ggplot(aes(Model,McFadden)) +
    geom_bar(stat="identity") +
    labs(title= "McFadden R-Squared by Model") +
    plotTheme
```

Next, a data frame is created that includes columns for the observed development change, `Dev_change`, and one that includes predicted probabilities for `Model7`. This data frame is then used as an input to a density plot visualizing the distribution of predicted probabilities by observed class. Only a small number of predicted probabilities are greater than or equal to 50% `(nrow(filter(testSetProbs, probs >= .50)) / nrow(datTest))`. This makes good sense, given how rare of an event development is in our dataset. Ultimately, in order to judge our model with a confusion matrix, a smaller development classification threshold must be employed.

```{r, warning = FALSE, message = FALSE}
testSetProbs2 <- 
  data.frame(class = dat3Test$Dev_change,
             probs = predict(Model7, dat3Test, type="response")) 
  
ggplot(testSetProbs2, aes(probs)) +
  geom_density(aes(fill=class), alpha=0.5) +
  scale_fill_manual(values = palette2,
                    labels=c("No Change","New Development")) +
  labs(title = "Histogram of test set predicted probabilities",
       x="Predicted Probabilities",y="Density") +
  plotTheme
```

Calculating Accuracy for this model:

```{r, warning = FALSE, message = FALSE}
options(yardstick.event_first = FALSE)

testSetProbs2 <- 
  testSetProbs2 %>% 
  mutate(predClass_05 = as.factor(ifelse(testSetProbs2$probs >= 0.05 ,1,0)),
         predClass_17 = as.factor(ifelse(testSetProbs2$probs >= 0.17 ,1,0))) 

testSetProbs2 %>%
  dplyr::select(-probs) %>%
  gather(Variable, Value, -class) %>%
  group_by(Variable) %>%
  summarize(Sensitivity = round(yardstick::sens_vec(class,factor(Value)),2),
            Specificity = round(yardstick::spec_vec(class,factor(Value)),2),
            Accuracy = round(yardstick::accuracy_vec(class,factor(Value)),2)) %>% 
  kable() %>%
  kable_styling(full_width = F)
```

The results are a bit similar to model 6, but the accuracy increased by 0.01 for each variable.

```{r, warning = FALSE, message = FALSE}
predsForMap2 <-         
  dat3 %>%
    mutate(probs = predict(Model7, dat3, type="response") ,
           Threshold_5_Pct = as.factor(ifelse(probs >= 0.05 ,1,0)),
           Threshold_17_Pct =  as.factor(ifelse(probs >= 0.17 ,1,0))) %>%
    dplyr::select(Dev_change,Threshold_5_Pct,Threshold_17_Pct) %>%
    gather(Variable,Value, -geometry) %>%
    st_cast("POLYGON")
```


<div class="superbigimage">
```{r, warning = FALSE, message= FALSE, fig.height = 6, fig.width= 8}
ggplot() +
  geom_point(data=predsForMap2, aes(x=xyC(predsForMap2)[,1], y=xyC(predsForMap2)[,2], colour=Value)) +
  facet_wrap(~Variable) +
  scale_colour_manual(values = palette2, labels=c("No Change","New Development"),
                      name="") +
  labs(title="Development Predictions - Low Threshold") + 
  geom_sf(data=ThreeCountyArea, fill = "transparent") +
  mapTheme
```


Predicting Development Demand

Next, the `Pop_2031` table is joined to `dat3` and `pop_change` in order to ‘distribute’ the new population across the study area. To do so, the the allocation of new population is weighted by a grid cell’s existing population (`pop_2031.infill`). 2010 population is subtracted from this figure to get `pop_Change`. Finally, `Model7` is used to predict for 2031 given the updated population change and lag development features.

The map of predicted probabilities that results is best thought of as a measure of predicted development demand in 2031.

```{r, warning = FALSE, message = FALSE}
dat3_infill <-
  dat3 %>%
  #calculate population change
    left_join(Pop_2031, by = c("COUNTY_NAM" = "Counties")) %>%
    mutate(proportion_of_county_pop = pop_2021 / AllCounties_Pop2021,
           pop_2031.infill = proportion_of_county_pop * pop_2031,
           pop_Change = round(pop_2031.infill - pop_2021),2) %>%
    dplyr::select(-pop_2031, -AllCounties_Pop2021, 
                  -proportion_of_county_pop, -pop_2031.infill) %>%
  #predict for 2031
    mutate(predict_2031.infill = predict(Model7,. , type="response"))

# Calculate quintile breaks
quintile_breaks <- quintileBreaks(dat3_infill, "predict_2031.infill")

# Sort the breaks in ascending order
sorted_breaks <- sort(quintile_breaks)

dat3_infill %>%
  ggplot() +  
  geom_point(aes(x=xyC(dat3_infill)[,1], y=xyC(dat3_infill)[,2], colour = factor(ntile(predict_2031.infill,5)))) +
  scale_colour_manual(values = palette5,
                    labels = substr(sorted_breaks, 1, 4),
                    name="Quintile\nBreaks") +
  geom_sf(data=ThreeCountyArea, fill=NA, colour="black", size=1) +
  labs(title= "Development Demand in 2031: Predicted Probabilities") +
  mapTheme

```

Summarize by County

The below `dplyr` statement takes as its input, `dat3`, which wrangles together a table of county-level, supply and demand metrics which can be used to analyze suitability by county.

Next, each raster is aggregated to the fishnet using the `aggregateRaster` function and 2021 land cover types are mapped.

```{r, warning = FALSE, message = FALSE}
theRasterList21 <- c(developed21,forest21,farm21,wetlands21,otherUndeveloped21,water21)

dat4 <-
  aggregateRaster(theRasterList21, dat3) %>%
  dplyr::select(dev21,forest21,farm21,wetlands21,otherUndeveloped21,water21) %>%
  st_set_geometry(NULL) %>%
  bind_cols(.,dat) %>%
  st_sf() %>%
  st_cast("POLYGON")

dat4 <- dat4 %>%
  mutate(metra_dist = dat3$metra_dist)

dat4 %>%
  gather(var,value,dev21:water21) %>%
  st_centroid() %>%
  mutate(X = st_coordinates(.)[,1],
         Y = st_coordinates(.)[,2]) %>%
  ggplot() +
    geom_sf(data=ThreeCountyArea) +
    geom_point(aes(X,Y, colour=as.factor(value))) +
    facet_wrap(~var) +
    scale_colour_manual(values = palette2,
                        labels=c("Other","Land Cover"),
                        name = "") +
    labs(title = "Land Cover Types, 2021",
         subtitle = "As fishnet centroids") +
   mapTheme
```


Below an indicator `sensitive_lost` is created indicating grid cells that were either forest or wetlands in 2011 but were no longer so in 2021. The output layer, `sensitive_land_lost`, gives a sense for how development in the recent past has effected the natural environment.

(We are repeating these steps from before, but with the updated final dataset (dat4).

```{r, warning = FALSE, message = FALSE, fig.height = 6, fig.width= 6}
dat4 <-
  dat4 %>%
   mutate(sensitive_lost21 = ifelse(forest == 1 & forest21 == 0 |
                                    wetlands == 1 & wetlands21 == 0,1,0))
                      
ggplot() +
  geom_point(data=dat4, aes(x=xyC(dat4)[,1], y=xyC(dat4)[,2], colour=as.factor(sensitive_lost21))) +
  scale_colour_manual(values = palette2,
                      labels=c("No Change","Sensitive Lost"),
                      name = "") +
  labs(title = "Sensitive lands lost: 2011 - 2021",
       subtitle = "As fishnet centroids") +
  geom_sf(data=ThreeCountyArea, fill = "transparent") +
  mapTheme
```

Next, we plot county-specific metrics.

```{r, warning = FALSE, message = FALSE}
county_specific_metrics_2 <- 
  dat4 %>%
  #predict development demand from our model
  mutate(Development_Demand = predict(Model7, dat4, type="response")) %>%
  #get a count count of grid cells by county which we can use to calculate rates below
  left_join(st_set_geometry(dat, NULL) %>% group_by(COUNTY_NAM) %>% summarize(count = n())) %>%
  #calculate summary statistics by county
  group_by(COUNTY_NAM) %>%
  summarize(Total_Farmland = sum(farm21) / max(count),
            Total_Forest = sum(forest21) / max(count),
            Total_Wetlands = sum(wetlands21) / max(count),
            Total_Undeveloped = sum(otherUndeveloped21) / max(count),
            Sensitive_Land_Lost = sum(sensitive_lost21) / max(count),
            #Sensitive_Regions = sum(sensitiveRegions) / max(count),
            Mean_Development_Demand = mean(Development_Demand)) %>%
  #get population data by county
  left_join(Pop_2031, by = c("COUNTY_NAM" = "Counties")) %>% 
            mutate(Population_Change = pop_2031 - pop_2021,
                   Population_Change_Rate = Population_Change / pop_2031) %>%
            dplyr::select(COUNTY_NAM,Total_Farmland, Total_Forest, Total_Wetlands, Total_Undeveloped, Sensitive_Land_Lost, Mean_Development_Demand, Population_Change_Rate)

```


Now a small multiple plot can be created providing both supply and demand side analytics by county. The plot gives a sense for development demand (`Demand-Side`), suitable land for development (`Suitable`) and sensitive land (`Not Suitable`).

This plot is similar to the results from the previous model, but now we can see that development demand has increased for Will County. Mean_Development_Demand increased slightly for Will County because we used Model7 and the Metra line extension variable, compared to Model6 which used only existing Metra rail lines. Adding the regional rail line extension into Will County would likely contribute to an increase in development demand for areas near the transit route.

```{r, warning = FALSE, message = FALSE}
county_specific_metrics_2 %>%
  gather(Variable, Value, -COUNTY_NAM, -geometry) %>%
  mutate(Variable = factor(Variable, levels=c("Population_Change_Rate","Mean_Development_Demand",
                                              "Total_Farmland","Total_Undeveloped","Total_Forest",
                                              "Total_Wetlands","Sensitive_Land_Lost",
                                              ordered = TRUE))) %>%
  mutate(Planning_Designation = case_when(
    Variable == "Population_Change_Rate" | Variable == "Mean_Development_Demand" ~ "Demand-Side",
    Variable == "Total_Farmland" | Variable == "Total_Undeveloped"               ~ "Suitable",
    TRUE                                                                         ~ "Not Suitable")) %>%
  ggplot(aes(x=Variable, y=Value, fill=Planning_Designation)) +
    geom_bar(stat="identity", position=position_dodge(), colour="black") +
    facet_wrap(~COUNTY_NAM, ncol=5) +
    coord_flip() +
    scale_y_continuous(breaks = seq(-.25, 1, by = .25)) +
    geom_vline(xintercept = 2.5) + geom_vline(xintercept = 4.5) +
    scale_fill_manual(values=c("black","red","darkgreen")) +
    labs(title= "County Specific Allocation Metrics", subtitle= "As rates", x="Indicator", y="Rate") +
    plotTheme + theme(axis.text.x = element_text(angle = 45, hjust = 1), legend.position="bottom")
```



# 8. Appendix

_This is new material describing how you can determine land cover change from land cover rasters for two time periods_

## 8.1. Calculating Land Cover Change

For your assignment, you are going to need to get land cover data for two new time periods and figure out what areas developed in that interval (and then model it). In the workflow above, we used a the NLCD land cover change data set, but we could also have calculated our own version using the 2001 and 2011 data sets. You could do this in R, or ArcGIS for your assignment. Here is an abbreviated workflow for doing it in R using the data from this exercise:

Reclassify 2001 and 2011 land cover databases to consist of 1 and 0 observations (e.g. 1 is the developed classes 13-24, 0 is everything else). 

```{r}
# reclassMatrix <- 
#   matrix(c(
#     0,12,0,
#     12,24,1,
#     24,Inf,0),
#   ncol=3, byrow=T)
```

```{r, warning = FALSE, message = FALSE}
# developed_2001 <- 
#   reclassify(lc_2001,reclassMatrix)
# 
# developed_2011 <- 
#   reclassify(lc_2011,reclassMatrix)

```

Then do some map algebra to find the places where land cover changed. Let's see a quick histogram of the values - these should range from 0 (undeveloped in 2001, undeveloped in 2011), 1 (undeveloped in 2001, developed in 2011 (presuming nothing went from developed to undeveloped)), and 2 (developed in both periods). The 1's represent the change.

```{r, warning = FALSE, message = FALSE}
# 
# development_change <- developed_2001+developed_2011
# 
# hist(development_change)
```

We can subsequently turn any of the 0's and 1's to NA

```{r, warning = FALSE, message = FALSE}
# development_change[development_change != 1] <- NA
# 
# ggplot() +
#   geom_sf(data=houstonMSA) +
#   geom_raster(data=rast(development_change) %>% na.omit, 
#               aes(x,y,fill=as.factor(value))) +
#   scale_fill_viridis(discrete=TRUE, name ="Land Cover\nChange") + 
#   labs(title="Development land use change") +
#   mapTheme
```

## 8.2. Downsampling Rasters

Notice that we used 4000x4000 unit grid cells in this analysis to keep small grid cell sizes from crushing our laptops while we did this plotting and geo-processing. This is very simple to do in R - the below code takes our `development_change` raster and downsamples it by a factor of two using the `aggregate` function. You could load an original data set in at the beginning of your analysis and downsample it before you get started.

```{r, warning = FALSE}
# 
# development_change
# 
# aggregate(development_change, fact = 2)

```

## 8.3. Cropping Rasters
Say you have rasters that you would like to manipulate in R instead of in ArcGIS. If you have data covering the study area, you can use this as the extent to which you would like to crop the data, and then use `mask` to clip the data to the exact boundaries.

First, read in the resampled raster of land cover around Atlanta, Georgia from 2001 and then plot it.

```{r message=FALSE, warning=FALSE}
# lc_atl_2001 <- raster("https://raw.githubusercontent.com/mafichman/CPLN_675/main/Week_14_15/data/atl_lc01_resamp_new.tif")
# 
# plot(lc_atl_2001)
```


We will also read in Atlanta counties for the extent of the bounding box, and then `crop` the raster to the counties.

```{r message=FALSE, warning=FALSE}
# atl_counties <- st_read("https://raw.githubusercontent.com/mafichman/CPLN_675/main/Week_14_15/data/Counties_Atlanta_Region.geojson") %>% st_transform("ESRI:102667") # ESRI 1983 state plane GA west
# 
# lc_atl_2001_crop <- crop(lc_atl_2001, extent(atl_counties))
# 
# plot(lc_atl_2001_crop)
```


Then, `mask` the raster using the Atlanta are counties. This is similar to the "clipping" process for vector data.

```{r message=FALSE, warning=FALSE}
# lc_atl_2001_mask <- mask(lc_atl_2001_crop, atl_counties)
# plot(lc_atl_2001_mask)
```


## 8.4 Updated Census Data Calls

In section 2.4, census data was pulled using the `get_decennial`; however, if you are using a different timeframe, you you will use data from the American Community Survey (ACS). You will replace `get_decennial` with `get_acs` in your project's workflow. The code chunk below shows how to use `get_acs` to obtain population data from the counties in the Houston MSA in 2019. Note that `get_acs` returns both an estimate (denoted with an "E") and a margin of error (denoted with an "M"). We use the `select` command in `dplyr` to only retain estimate version of the variable.


```{r, warning = FALSE, message = FALSE, results = "hide"}
# # Specify which variable(s) you would like to grab. Here, only one (Total Population) is listed, but you could add more to the call.
# acs_vars <- c("B02001_001E")
# 
# # Using "tract" as the geography and 2019 as the year, download data data for the Houston MSA counties listed.
# houstonPop19 <- get_acs(geography = "tract", 
#                         variables = acs_vars, 
#                         year = 2019,
#                         state = 48, 
#                         geometry = TRUE, 
#                         output = "wide",
#                         county=c("Harris COunty","San Jacinto","Montgomery","Liberty","Waller",
#                          "Austin","Chambers","Fort Bend","Brazoria","Galveston")) %>%
#                 rename(pop2019 = B02001_001E) %>%
#                 dplyr::select(-starts_with("B"))
# 
# # Make sure to transform to the crs of the fishnet!
# houstonPop19 <- houstonPop19 %>%
#   st_transform(st_crs(houstonMSA_fishnet))

```
