Predicting Residential Housing Activity

Introduction

New construction permit issued by government is an early indicator of housing market activity. While housing market is generally considered as one of the first economic sectors to rise or fall when economic conditions improve or degrade, New Construction Permit count becomes a crucial parameter in guaging economic condition. Since new residential housing construction is directly related to housing needs from the residents, it is meaningful to explore the trend and underlying reasons of such trend behind new construction permits. This is relevant to policy discussion concerneing gentrification, for new residential development brings new investsment from richer people yet sometimes eviction and displacement for the long-time residents and poor locals. This project focuses on predicting new residential construction permits in Philadelphia, a typical city recognized as being gentrified in the recent decade by the major media and social science researches. In Philadelphia, various neighborhoods are being shaken as newcomers move in. It is thus of great interest and an imperative goal for government to tackle the problems related to gentrification.

To predict for new construction permit, we created a series of Time/Space models that capture dynamics using both space and time measures. These models explictly accounted for time by controlling for a host of quarterly fixed effects; accounted for space by including engineered spatial lag features.The two hypothesis are posted here: 1) the new construction permit counts at this time is in part a function of values at previous time periods; 2) the new construction permit counts at this place is in part a function of values at its neighborhood. The main use case here is philadelphia local government, who could use this modeling approach to closely monitor neighborhood with high new construction activities; to allocate tangible resources such as affordable housing and provide housing assistance; and to design or improve a range of neighborhood revitalization programs.

For a multimedia presentation of this project, please go visit https://www.youtube.com/watch?v=5K8lWgJNzD8.

Setup & Styles

library(tidyverse)
library(tidycensus)
library(sf)
library(FNN)
library(jsonlite)
library(knitr)
library(kableExtra)
library(censusapi)
library(dplyr)
library(RSocrata)
library(units)
library(tmap)
library(ggplot2)
library(maptools)
library(rgdal)
library(raster)
library(rgeos)
library(lubridate)
library(tigris)
library(gganimate)
library(viridis)
library(riem)
library(gridExtra)
library(knitr)
library(kableExtra)
library(sp)

options(scipen=999)
options(tigris_class = "sf")

mapTheme <- function(base_size = 12) {
  theme(
    text = element_text(size = 12, color = "black", family="San Serif"),
    plot.title = element_text(size = 13,colour = "black", family="Sans Serif"),
    plot.subtitle=element_text(size = 12, face="italic", family="Serif"),
    plot.caption=element_text(size = 10, hjust=0, family = "Serif"),
    axis.ticks = element_blank(),
    panel.background = element_blank(),axis.title = element_blank(),
    axis.text = element_blank(),
    axis.title.x = element_blank(),
    axis.title.y = element_blank(),
    panel.grid.minor = element_blank(),
    panel.border = element_rect(colour = "lightgrey", fill=NA, size= 0)
  )
}

plotTheme <- function(base_size = 12) {
  theme(
    text = element_text( color = "black", family="San Serif"),
    plot.title = element_text(size = 13, colour = "black", family="Sans Serif"),
    plot.subtitle = element_text(size = 12, face="italic", family="Serif"),
    plot.caption = element_text(size = 10, hjust=0, family = "Serif"),
    axis.ticks = element_blank(),
    panel.background = element_blank(),
    panel.grid.major = element_line("grey50", size = 0.1),
    panel.grid.minor = element_blank(),
    panel.border = element_rect(colour = "grey", fill=NA, size= 0),
    strip.background = element_rect(fill = "grey70", color = "white"),
    strip.text = element_text(size=11),
    axis.title = element_text(size=11),
    axis.text = element_text(size=10),
    plot.background = element_blank(),
    legend.background = element_blank(),
    legend.title = element_text(size = 8, colour = "grey", face = "italic"),
    legend.text = element_text(size = 9, colour = "grey", face = "italic"),
    strip.text.x = element_text(size = 11)
  )
}

qBr <- function(df, variable, rnd) {
  if (missing(rnd)) {
    as.character(quantile(round(df[[variable]],0),
                          c(.01,.2,.4,.6,.8), na.rm=T))
  } else if (rnd == FALSE | rnd == F) {
    as.character(formatC(quantile(df[[variable]]), digits = 3),
                 c(.01,.2,.4,.6,.8), na.rm=T)
  }
}

q5 <- function(variable) {as.factor(ntile(variable, 5))}

palette2 <- c("#FF6347","#3182bd")
palette4 <- c("#D2FBD4","#92BCAB","#527D82","#123F5A")
palette5 <- c("#eff3ff","#bdd7e7","#6baed6","#3182bd","#08519c")
palette5.1 <- c("#bdd7e7","#bdd7e7","#6baed6","#3182bd","#08519c")
palette5.2 <- c("#bdd7e7","#bdd7e7","#bdd7e7","#3182bd","#08519c")
palette5.4 <- c("#bdd7e7","#bdd7e7","#bdd7e7","#bdd7e7","#08519c")

Data Wrangling

We prepared data mainly by using sources on Open Data Philly¹, PASDA², and US Census Bureau³. The new construction permits dataset was extracted from Licenses & Inspection Building and Zoning Permit Department⁴. As socioeconomic data is anticipated to be indicative in real estate industry, we researched for a few of them, including distance to and counts of parks, SEPTA bus stops, and high quality healthy food stores⁵, and numerous demographic ones from Census Bureau. In addition, Tidycensus data⁶ was crucial in defining the spatial context in terms of geometry and projection. To capture the temporal context, we would introduce later a reasonable time unit–quarter– to observe data. As a result, we assembled all datasets into a complete panel of observation that includes every possible space-time combination.

Here are some more details about the specific fata we were dealing with.

The new construction permit data in Philadelphia from year 2010 to 2017 was chosen, as we were interested in activities of the mosst recent decade. To characterize “residential housing construction activity”, we actively selected the new construction permits data input that were marked as “completed” and “residential” type. This was to observe only the successful implementation of the related construction plans in residential setting. In other words, wen ensured that the result housing was actively recruiting newcomers to settle in. For the temporal context, quarter based on the issue date of each such new construction permit so as to explore possible quarterly trend. We tried month before the decision of using quarter. Yet due to the lack of enough data points in the former, the later made more sense in the temporal part of the model.

#new construction for all data 2010-2017 
newconstruction_all <-read.csv("C:/Users/Joyce Yuqi Liu/Desktop/GISR_data/Final/philly4_final_2_joinR.csv")
newconstruction_all <-
  newconstruction_all %>%
  subset(newconstruction_all$year<=2017) 
newconstruction_all <-
  newconstruction_all %>%
  subset(newconstruction_all$year >=2010)


#Create the quarter
newconstruction_all$quarter <- ifelse(newconstruction_all$month< 4, "1", 
                                            ifelse(newconstruction_all$month < 7,"2", 
                                                   ifelse(newconstruction_all$month < 10, "3", "4")))

# Convert "quarter" variable into numeric
newconstruction_all$quarter <- as.numeric(newconstruction_all$quarter)

Hypothesizing that multiple socioeconomic factors play a role in construction interests, we applied census data fromn years in correspondance to those of the new construction permit data. Since gentrification is often seen in an area where experiences an increase in home values and income, and the number of highly educated residents, we decided to take a broad look at variables across years that are related to population, median household income, median household value, population growth, education attainment of bachelor degree, number of rich and poor renters, percentage of vancancy lands, etc. We also added the data concerning the number of parks, bus stops and healthy food stores to each census tract by means of feature engineering. Below, we visualized the location of all SEPTA stops in Greater Philadelphia (blue dots) and Parks(green dots) to cross check the data validity.

install.packages('curl', repos = "http://cran.us.r-project.org")
census_api_key("f2f4e556fa5b8c5b288989bff2ea03483661218f", overwrite = TRUE)
install = TRUE

var2010 <- load_variables(2010, "acs5", cache = TRUE)
var2010

tracts2010 <- get_acs(geography = "tract", 
                      variable = c("B19013_001",
                                   "B25118_022",
                                   "B25106_006",
                                   "B25003_003",
                                   "B06012_004",
                                   "B25077_001", 
                                   "B16010_041",
                                   "B01003_001",
                                   "B25002_003",
                                   "B25002_002"),
                      year=2011, state = 42, county = 101, geometry = TRUE) %>%
  st_transform(102728)
table(tracts2010$variable)

tracts2010 <- 
  tracts2010 %>%
  dplyr::select(-moe, -NAME) %>%
  spread(variable, estimate)

tracts2010 <- 
  tracts2010 %>%
  rename(nfMHI= B19013_001,
         ctRichRenter= B25118_022,
         ctPoorRenter= B25106_006,
         ctRenterHH= B25003_003,
         ctPoverty150= B06012_004 ,
         nfMHV= B25077_001 , 
         ctBach= B16010_041 ,
         ctpop= B01003_001,
         ctvacant= B25002_003,
         ctoccupy= B25002_002)%>%
  mutate(pctBach= ctBach/ctpop,
         pctPoverty150=ctPoverty150/ctpop,
         pctRichRenter=ctRichRenter/ctRenterHH,
         pctPoorRenter=ctPoorRenter/ctRenterHH,
         MHI=1.18*nfMHI,
         MHV=1.18*nfMHV,
         pctvacant=ctvacant/(ctvacant+ctoccupy))
# "B25118_022": TENURE BY HOUSEHOLD INCOME IN THE PAST 12 MONTHS:   Estimate!!Total!!Renter occupied!!$50,000 to $74,999
#"B25106_006": Less than $20,000!!30 percent or more
# "B06012_004": Estimate!!Total!!At or above 150 percent of the poverty level
# NOt available"B07413_003: Estimate!!Total living in area 1 year ago!!Same house 1 year ago!!Householder lived in renter housing

#2011 data>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
tracts2011 <- get_acs(geography = "tract", 
                      variable = c("B19013_001",
                                   "B25118_022",
                                   "B25106_006",
                                   "B25003_003",
                                   "B06012_004",
                                   "B25077_001", 
                                   "B16010_041",
                                   "B01003_001",
                                   "B25002_003",
                                   "B25002_002"),
                      year=2011, state = 42, county = 101, geometry = TRUE) %>%
  st_transform(102728)
table(tracts2011$variable)

tracts2011 <- 
  tracts2011 %>%
  dplyr::select(-moe, -NAME) %>%
  spread(variable, estimate)

tracts2011 <- 
  tracts2011 %>%
  rename(nfMHI= B19013_001,
         ctRichRenter= B25118_022,
         ctPoorRenter= B25106_006,
         ctRenterHH= B25003_003,
         ctPoverty150= B06012_004 ,
         nfMHV= B25077_001 , 
         ctBach= B16010_041 ,
         ctpop= B01003_001,
         ctvacant= B25002_003,
         ctoccupy= B25002_002)%>%
  mutate(pctBach= ctBach/ctpop,
         pctPoverty150=ctPoverty150/ctpop,
         pctRichRenter=ctRichRenter/ctRenterHH,
         pctPoorRenter=ctPoorRenter/ctRenterHH,
         MHI=1.14*nfMHI,
         MHV=1.14*nfMHV,
         pctvacant=ctvacant/(ctvacant+ctoccupy))

#2012>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
tracts2012 <- get_acs(geography = "tract", 
                      variable = c("B19013_001",
                                   "B25118_022",
                                   "B25106_006",
                                   "B25003_003",
                                   "B06012_004",
                                   "B25077_001", 
                                   "B16010_041",
                                   "B01003_001",
                                   "B25002_003",
                                   "B25002_002"),
                      year=2012, state = 42, county = 101, geometry = TRUE) %>%
  st_transform(102728)
table(tracts2012$variable)

tracts2012 <- 
  tracts2012 %>%
  dplyr::select(-moe, -NAME) %>%
  spread(variable, estimate)

tracts2012 <- 
  tracts2012 %>%
  rename(nfMHI= B19013_001,
         ctRichRenter= B25118_022,
         ctPoorRenter= B25106_006,
         ctRenterHH= B25003_003,
         ctPoverty150= B06012_004 ,
         nfMHV= B25077_001 , 
         ctBach= B16010_041 ,
         ctpop= B01003_001,
         ctvacant= B25002_003,
         ctoccupy= B25002_002)%>%
  mutate(pctBach= ctBach/ctpop,
         pctPoverty150=ctPoverty150/ctpop,
         pctRichRenter=ctRichRenter/ctRenterHH,
         pctPoorRenter=ctPoorRenter/ctRenterHH,
         MHI=1.10*nfMHI,
         MHV=1.10*nfMHV,
         pctvacant=ctvacant/(ctvacant+ctoccupy))


#2013>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
tracts2013 <- get_acs(geography = "tract", 
                      variable = c("B19013_001",
                                   "B25118_022",
                                   "B25106_006",
                                   "B25003_003",
                                   "B06012_004",
                                   "B25077_001", 
                                   "B16010_041",
                                   "B01003_001",
                                   "B25002_003",
                                   "B25002_002"),
                      year=2013, state = 42, county = 101, geometry = TRUE) %>%
  st_transform(102728)
table(tracts2012$variable)

tracts2013 <- 
  tracts2013 %>%
  dplyr::select(-moe, -NAME) %>%
  spread(variable, estimate)

tracts2013 <- 
  tracts2013 %>%
  rename(nfMHI= B19013_001,
         ctRichRenter= B25118_022,
         ctPoorRenter= B25106_006,
         ctRenterHH= B25003_003,
         ctPoverty150= B06012_004 ,
         nfMHV= B25077_001 , 
         ctBach= B16010_041 ,
         ctpop= B01003_001,
         ctvacant= B25002_003,
         ctoccupy= B25002_002)%>%
  mutate(pctBach= ctBach/ctpop,
         pctPoverty150=ctPoverty150/ctpop,
         pctRichRenter=ctRichRenter/ctRenterHH,
         pctPoorRenter=ctPoorRenter/ctRenterHH,
         MHI=1.10*nfMHI,
         MHV=1.10*nfMHV,
         pctvacant=ctvacant/(ctvacant+ctoccupy))

#2014>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
tracts2014 <- get_acs(geography = "tract", 
                      variable = c("B19013_001",
                                   "B25118_022",
                                   "B25106_006",
                                   "B25003_003",
                                   "B06012_004",
                                   "B25077_001", 
                                   "B16010_041",
                                   "B01003_001",
                                   "B25002_003",
                                   "B25002_002"),
                      year=2014, state = 42, county = 101, geometry = TRUE) %>%
  st_transform(102728)
table(tracts2014$variable)

tracts2014 <- 
  tracts2014 %>%
  dplyr::select(-moe, -NAME) %>%
  spread(variable, estimate)

tracts2014 <- 
  tracts2014 %>%
  rename(nfMHI= B19013_001,
         ctRichRenter= B25118_022,
         ctPoorRenter= B25106_006,
         ctRenterHH= B25003_003,
         ctPoverty150= B06012_004 ,
         nfMHV= B25077_001 , 
         ctBach= B16010_041 ,
         ctpop= B01003_001,
         ctvacant= B25002_003,
         ctoccupy= B25002_002)%>%
  mutate(pctBach= ctBach/ctpop,
         pctPoverty150=ctPoverty150/ctpop,
         pctRichRenter=ctRichRenter/ctRenterHH,
         pctPoorRenter=ctPoorRenter/ctRenterHH,
         MHI=1.09*nfMHI,
         MHV=1.09*nfMHV,
         pctvacant=ctvacant/(ctvacant+ctoccupy))

#2015>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
tracts2015 <- get_acs(geography = "tract", 
                      variable = c("B19013_001",
                                   "B25118_022",
                                   "B25106_006",
                                   "B25003_003",
                                   "B06012_004",
                                   "B25077_001", 
                                   "B16010_041",
                                   "B01003_001",
                                   "B25002_003",
                                   "B25002_002"),
                      year=2015, state = 42, county = 101, geometry = TRUE) %>%
  st_transform(102728)
table(tracts2015$variable)

tracts2015 <- 
  tracts2015 %>%
  dplyr::select(-moe, -NAME) %>%
  spread(variable, estimate)

tracts2015 <- 
  tracts2015 %>%
  rename(nfMHI= B19013_001,
         ctRichRenter= B25118_022,
         ctPoorRenter= B25106_006,
         ctRenterHH= B25003_003,
         ctPoverty150= B06012_004 ,
         nfMHV= B25077_001 , 
         ctBach= B16010_041 ,
         ctpop= B01003_001,
         ctvacant= B25002_003,
         ctoccupy= B25002_002)%>%
  mutate(pctBach= ctBach/ctpop,
         pctPoverty150=ctPoverty150/ctpop,
         pctRichRenter=ctRichRenter/ctRenterHH,
         pctPoorRenter=ctPoorRenter/ctRenterHH,
         MHI=1.09*nfMHI,
         MHV=1.09*nfMHV,
         pctvacant=ctvacant/(ctvacant+ctoccupy))

#2016>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
tracts2016 <- get_acs(geography = "tract", 
                      variable = c("B19013_001",
                                   "B25118_022",
                                   "B25106_006",
                                   "B25003_003",
                                   "B06012_004",
                                   "B25077_001", 
                                   "B16010_041",
                                   "B01003_001",
                                   "B25002_003",
                                   "B25002_002"),
                      year=2012, state = 42, county = 101, geometry = TRUE) %>%
  st_transform(102728)
table(tracts2016$variable)

tracts2016 <- 
  tracts2016 %>%
  dplyr::select(-moe, -NAME) %>%
  spread(variable, estimate)

tracts2016 <- 
  tracts2016 %>%
  rename(nfMHI= B19013_001,
         ctRichRenter= B25118_022,
         ctPoorRenter= B25106_006,
         ctRenterHH= B25003_003,
         ctPoverty150= B06012_004 ,
         nfMHV= B25077_001 , 
         ctBach= B16010_041 ,
         ctpop= B01003_001,
         ctvacant= B25002_003,
         ctoccupy= B25002_002)%>%
  mutate(pctBach= ctBach/ctpop,
         pctPoverty150=ctPoverty150/ctpop,
         pctRichRenter=ctRichRenter/ctRenterHH,
         pctPoorRenter=ctPoorRenter/ctRenterHH,
         MHI=1.07*nfMHI,
         MHV=1.07*nfMHV,
         pctvacant=ctvacant/(ctvacant+ctoccupy))

#2017>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
tracts2017 <- get_acs(geography = "tract", 
                      variable = c("B19013_001",
                                   "B25118_022",
                                   "B25106_006",
                                   "B25003_003",
                                   "B06012_004",
                                   "B25077_001", 
                                   "B16010_041",
                                   "B01003_001",
                                   "B25002_003",
                                   "B25002_002"),
                      year=2012, state = 42, county = 101, geometry = TRUE) %>%
  st_transform(102728)
table(tracts2017$variable)

tracts2017 <- 
  tracts2017 %>%
  dplyr::select(-moe, -NAME) %>%
  spread(variable, estimate)

tracts2017 <- 
  tracts2017 %>%
  rename(nfMHI= B19013_001,
         ctRichRenter= B25118_022,
         ctPoorRenter= B25106_006,
         ctRenterHH= B25003_003,
         ctPoverty150= B06012_004 ,
         nfMHV= B25077_001 , 
         ctBach= B16010_041 ,
         ctpop= B01003_001,
         ctvacant= B25002_003,
         ctoccupy= B25002_002)%>%
  mutate(pctBach= ctBach/ctpop,
         pctPoverty150=ctPoverty150/ctpop,
         pctRichRenter=ctRichRenter/ctRenterHH,
         pctPoorRenter=ctPoorRenter/ctRenterHH,
         MHI=1.05*nfMHI,
         MHV=1.05*nfMHV,
         pctvacant=ctvacant/(ctvacant+ctoccupy))

tracts2010$year<-2010
tracts2011$year<-2011
tracts2012$year<-2012
tracts2013$year<-2013
tracts2014$year<-2014
tracts2015$year<-2015
tracts2016$year<-2016
tracts2017$year<-2017

#Other variables associated with the census tracts>>>>>>>>
#count how many SEPTA stop there are in any census tract in Philly 
stops <-read.csv("C:/Users/Joyce Yuqi Liu/Desktop/GISR_data/Final/SEPTAStopsByLineSpring2016.csv")
stops <- stops[!duplicated(stops[c("STOPID", "STOPNAME")]),]
require(rgdal)
censustracts <- readOGR(dsn="C:/Users/Joyce Yuqi Liu/Desktop/GISR_data/Final/Philly_Censustract", 
                        layer="Census_Tracts_2010")

censustracts_only <-censustracts
require(maptools)
require(ggplot2)

censustracts <- fortify(censustracts, region="GEOID10")
coords <- stops[c("LON", "LAT")]
coords <- coords[complete.cases(coords),]

library(sp)
sp <- SpatialPoints(coords)
proj4string(sp) <- "+proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0"
proj4string(sp)

by_tract_stops <- over(sp, censustracts_only)
kable(head(by_tract_stops, 5))

require(dplyr)
by_tract_stops <- by_tract_stops %>%
  group_by(GEOID10) %>%
  summarise(total=n())

by_tract_stops <- by_tract_stops[!is.na(by_tract_stops$GEOID10),]
kable(head(by_tract_stops,5))

colnames(by_tract_stops) <- c("GEOID", "ctstops")
by_tract_stops$GEOID <- as.character(by_tract_stops$GEOID) #now the GEOID is character
censustracts <- censustracts %>%
  rename(GEOID="id")

#count how many parks in certain census tract
parks <-read.csv("C:/Users/Joyce Yuqi Liu/Desktop/GISR_data/Final/PhillyFPC_PPR_parks_crs.csv")
require(rgdal)
censustracts1 <- readOGR(dsn="C:/Users/Joyce Yuqi Liu/Desktop/GISR_data/Final/Philly_Censustract", 
                         layer="Census_Tracts_2010")
censustracts1_only <-censustracts1
require(maptools)
require(ggplot2)

censustracts1 <- fortify(censustracts1, region="GEOID10")
coords1 <- parks[c("LNG", "LAT")]
coords1 <- coords1[complete.cases(coords1),]

library(sp)
sp1 <- SpatialPoints(coords1)
proj4string(sp1) <- "+proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0"
proj4string(sp1)

plot(censustracts1_only)
plot(sp, col= "lightblue", pch = 1, cex = 1, main = "SEPTA stops", add=TRUE)

plot(censustracts1_only)
plot(sp1, col= "darkgreen", pch = 1, size = 1,main = "Parks", add=TRUE)

by_tract_park <- over(sp1, censustracts1_only)
kable(head(by_tract_park, 5))

require(dplyr)
by_tract_park <- by_tract_park %>%
  group_by(GEOID10) %>%
  summarise(total=n())

by_tract_park <- by_tract_park[!is.na(by_tract_park$GEOID10),]
kable(head(by_tract_park,5))

colnames(by_tract_park) <- c("GEOID", "ctparks")
by_tract_park$GEOID <- as.character(by_tract_park$GEOID) #now the GEOID is character
#censustracts <- censustracts %>%
#  rename(GEOID="id")

#count of healthy food high produce stores 
by_tract_food <-
  read.csv("C:/Users/Joyce Yuqi Liu/Desktop/GISR_data/Final/FoodRetail_HPSS_final.csv")
by_tract_food$GEOID <-as.character(by_tract_food$GEOID10)


#join all the essential elements: 
Join_map1 <- left_join(censustracts,by_tract_stops)
Join_map2 <- left_join(Join_map1, by_tract_park)
Join_map3 <- left_join(Join_map2, by_tract_food)
write.csv(Join_map3, file="Join_map3.csv")

Join_map3_new <- read.csv("C:/Users/Joyce Yuqi Liu/Desktop/GISR_data/Final/Join_map3_new.csv")
# Join_map3_new= GEOID, ctstops, ctparks, ctfood; 384 obs

Join_map3_new$ctparks[is.na(Join_map3_new$ctparks)] <- 0

Join_map3_final <-
  Join_map3_new[!duplicated(Join_map3_new[c("GEOID","ctstops")]),]
Join_map3_final$GEOID <-as.character(Join_map3_final$GEOID)
#this is the final part of the data collection!!!


#Join 2010 with the other data hooked to the census tract
tracts2010$GEOID <-as.character(tracts2010$GEOID)
Join_2010_map3 <-left_join(Join_map3_final, tracts2010)

#Join 2011 with the other data hooked to the census tract
tracts2011$GEOID <-as.character(tracts2011$GEOID)
Join_2011_map3 <-left_join(Join_map3_final, tracts2011)

#Join 2012 with the other data hooked to the census tract
tracts2012$GEOID <-as.character(tracts2012$GEOID)
Join_2012_map3 <-left_join(Join_map3_final, tracts2012)

#Join 2013 with the other data hooked to the census tract
Join_2013_map3 <-left_join(Join_map3_final, tracts2013)
tracts2013$GEOID <-as.character(tracts2013$GEOID)

#Join 2014 with the other data hooked to the census tract
tracts2014$GEOID <-as.character(tracts2014$GEOID)
Join_2014_map3 <-left_join(Join_map3_final, tracts2014)

#Join 2015 with the other data hooked to the census tract
tracts2015$GEOID <-as.character(tracts2015$GEOID)
Join_2015_map3 <-left_join(Join_map3_final, tracts2015)

#Join 2016 with the other data hooked to the census tract
tracts2016$GEOID <-as.character(tracts2016$GEOID)
Join_2016_map3 <-left_join(Join_map3_final, tracts2016)

#Join 2017 with the other data hooked to the census tract
tracts2017$GEOID <-as.character(tracts2017$GEOID)
Join_2017_map3 <-left_join(Join_map3_final, tracts2017)

tracts_all_full <- rbind(Join_2017_map3,Join_2016_map3, Join_2015_map3, Join_2014_map3, Join_2013_map3,
                    Join_2012_map3, Join_2011_map3, Join_2010_map3)
write.csv(tracts_all_full, file="2010-2017 5yr demographic and other data.csv")
names (tracts_all_full)

After wrangling with the datasets above, we created a panel dataset study.panel below to combine space and time intervals. While spatial context was captured by Census tract area, time intervals in quarter groups every 3 month starting January of 2010 to December of 2017. As a result of combining all possibilities of spatial and temporal intervals, we had a panel consisted of 12,288 grids by 32 unique quarters. In addition, this panel had assigned “permit_count” for each time and space observation. This panel is completed by finding all unique space/time intervals, inserting 0 permit count where necessary, and then performing some additional feature engineering which adds all the geometry and socioeconomic features.

#subset census tracts to create a full panel
vars_tracts <- c("year", "GEOID")
year_GEOID <- tracts_all_full[vars_tracts]
write.csv(year_GEOID, file="year_GEOID_old.csv")

#create the study panel from the census tract, quarter, and year
study.panel <-read.csv("C:/Users/Joyce Yuqi Liu/Desktop/GISR_data/Final/year_GEOID.csv")
# 12288 grids on the study panel

#panel creation after the expand grid 
tracts_all_full$GEOID <-as.character(tracts_all_full$GEOID)
newconstruction_all$GEOID<-as.character(newconstruction_all$GEOID)
study.panel$GEOID <-as.character(study.panel$GEOID)

newconstruction_all$quarter <-as.numeric(newconstruction_all$quarter)
study.panel$quarter <-as.numeric(study.panel$quarter)

tracts_all_full$year <-as.numeric(tracts_all_full$year)
newconstruction_all$year<-as.numeric(newconstruction_all$year)
study.panel$year <-as.numeric(study.panel$year)


newconstruction_all.panel<-
  newconstruction_all %>%
  mutate(permit_counter = 1) %>%
  right_join(study.panel)  %>%
  group_by(year,quarter, GEOID) %>% 
  summarize(permit_count=sum(permit_counter, na.rm = T)) %>%  
  left_join(tracts_all_full, by=c("GEOID"="GEOID", "year"="year")) %>%
  st_sf() %>%
  arrange(year, quarter, GEOID)

sum(newconstruction_all.panel$permit_count)

Next, we mined data for time trend through some additional feature engineering and capture the result by time lags.

newconstruction_all.panel <-
  newconstruction_all.panel %>%
  arrange(GEOID, year, quarter) %>%  
  mutate(lag1Quarter = dplyr::lag(permit_count, 1),
         lag2Quarter = dplyr::lag(permit_count, 2),
         lag3Quarter = dplyr::lag(permit_count, 3))

dplyr::select(st_set_geometry(newconstruction_all.panel, NULL), 
              year,quarter, GEOID, permit_count, 
              lag1Quarter, lag2Quarter, lag3Quarter)

Finally, we created training and testing datasets to evaluate the observed trend.Mainly to test generalizability, we trained on a subset of years and tested on years to follow. More specifically, the model is trained on data of years 2010 - 2016 and tested it on data of year 2017.

newconstruction_all.panel.train <- filter(newconstruction_all.panel, year <= 2016)
newconstruction_all.panel.test <- filter(newconstruction_all.panel, year > 2016)

Exploratory Analysis

In this section, we explored our new construction permit data for time, space, and demographic trends. In order to forecast well, these features must help generate reasonable results based on the time/space trends in our data model. The hypothsesis is that much of the variation in permit counts can be predicted by the combinate of temporal and spatial effects. Various data visualizations are presented as a approach for this.

Firstly, our time series hypothesis is that new construction permit count varies as a function of season (quarter) and year. Several tests of serial correlation is tested below by way of data visualization.

# visualize all year data for this 
newconstruction_all.panel %>%
  group_by(year, quarter)%>%
  summarize(permit_count=sum(permit_count))%>%
  ggplot(aes(quarter, permit_count))+
  geom_point()+ 
  geom_line() +
  geom_smooth(method="lm", se=F)+
  facet_wrap(~year, ncol=8)

newconstruction_all.panel <-
  newconstruction_all.panel %>%
  arrange(GEOID, year, quarter) %>%  
  mutate(lag1Quarter = dplyr::lag(permit_count, 1),
         lag2Quarter = dplyr::lag(permit_count, 2),
         lag3Quarter = dplyr::lag(permit_count, 3))

dplyr::select(st_set_geometry(newconstruction_all.panel, NULL), 
              year,quarter, GEOID, permit_count, 
              lag1Quarter, lag2Quarter, lag3Quarter)

## # A tibble: 12,288 x 7
## # Groups:   year, quarter [32]
##     year quarter GEOID       permit_count lag1Quarter lag2Quarter lag3Quarter
##  * <dbl>   <dbl> <chr>              <dbl>       <dbl>       <dbl>       <dbl>
##  1  2010       1 42101000100            0          NA          NA          NA
##  2  2010       2 42101000100            0          NA          NA          NA
##  3  2010       3 42101000100            0          NA          NA          NA
##  4  2010       4 42101000100            0          NA          NA          NA
##  5  2011       1 42101000100            0          NA          NA          NA
##  6  2011       2 42101000100            0          NA          NA          NA
##  7  2011       3 42101000100            0          NA          NA          NA
##  8  2011       4 42101000100            0          NA          NA          NA
##  9  2012       1 42101000100            0          NA          NA          NA
## 10  2012       2 42101000100            1          NA          NA          NA
## # ... with 12,278 more rows

newconstruction_all.panel.train <- filter(newconstruction_all.panel, year <= 2016)
newconstruction_all.panel.test <- filter(newconstruction_all.panel, year > 2016)


# test for all data 
st_set_geometry(
  rbind(
    mutate(newconstruction_all.panel.train, label = "Training"), 
    mutate(newconstruction_all.panel.test, label = "Testing")), NULL) %>%
  group_by(label, year, quarter) %>% 
  summarize(permit_count = sum(permit_count)) %>% #don't understand
  ggplot(aes(quarter, permit_count, colour = label)) + 
  geom_line() +
  ylim(0,500) +
  labs(title="New permit counts in Philadelphia by year: 2010 to 2017",
       x="Quarter Number", y="Permit Counts") +
  plotTheme() + # theme(panel.grid.major = element_blank()) +   
  scale_colour_manual(values = palette2) +
  facet_wrap(~year, ncol=8)

As you can see, the distribution of permit counts in these two groups is shown in the above plots. There are some interesting trends to note here. Firstly, an apparent rising trend of new construction permits from 2010 to 2016 is demonstrated in the plot, and then drops in 2017. Second, the peak of permit counts arises in the second quarter of 2016 (over 350 permits), suggesting a fairly vigorous new construction activity in that period. Indeed, Philadelphia has experienced a building boom, with more than 1,200 building permits issued for new construction in 2016, which is a 20 percent increase from 2015. Major reason accounting for this phenomenon is a dynamic economic and population growth. From 2010 to 2016, Philadelphia region has added 14,064 residents each year on average to its population, which is a 2.1 percent increase over that time span. As more jobs and housing popped up in Philly, the local economy has gradually been recovering from economic recession of 2007-9.

The plot below breaks out the same time series in small multiple form by year. This approach provides a better and clearer illustration, for which we reputed our hypothesis that the permit counts actually do not have a seasonal trend.

#Create new permit 
as.data.frame(newconstruction_all.panel) %>%
  group_by(year,quarter) %>%  
  summarize(permit_count = sum(permit_count)) %>%
  ggplot(aes(quarter,permit_count)) +  
  geom_line() +
  plotTheme() +
  facet_wrap(~year, scales="free", ncol=3) +
  ylim(0,500) +
  labs(title="New Permit counts in Philadelphia by year: From 2010 to 2017",
       x="Year", y="Permit Counts")

Next, the usefulness of the time lag features are tested. We created “plotData.lag” to pull out space and time intervals for one year, groups by quarter to remove any spatial variation and outputs means for several variables. Pearson correlation is then calculated for each and output below, and we can see that all the correlation coefficient is 1. The visualization is also shown below. Due to the fact that our permit dataset was quite limited in number compared with the study panel, and the summarizing time unit–quarter–only rendered a few data points, the result was taken less reliablely.

plotData.lag <-
  as.data.frame(newconstruction_all.panel) %>%
  filter(year == 2013) %>%
  group_by(quarter) %>%     #group_by(year.x, month) %>% 
  summarise_at(vars(starts_with("lag"), "permit_count"), mean, na.rm = TRUE) %>%
  gather(Variable, Value, -quarter, -permit_count) %>%  # gather(Variable, Value, -year.x, -month, -permit_count) %>%
  mutate(Variable = factor(Variable, levels=c("lag1Quarter","lag2Quarter","lag3Quarter")))


correlation.lag <-
  plotData.lag %>%
  group_by(Variable) %>%  
  summarize(correlation = round(cor(Value, permit_count),2))


correlation.lag

## # A tibble: 3 x 2
##   Variable    correlation
##   <fct>             <dbl>
## 1 lag1Quarter           1
## 2 lag2Quarter           1
## 3 lag3Quarter           1

plotData.lag %>%
  ggplot(aes(Value, permit_count)) + 
  geom_point() + geom_smooth(method = "lm", se = F) + 
  facet_wrap(~Variable) +
  geom_text(data=correlation.lag, aes(label=paste("R =", correlation)),colour = "grey", 
            x=-Inf, y=Inf, hjust=-0.1, vjust=1.2) +
  labs(title = "New Residential Construction permit Counts as a function of Time Lags in quarter", 
       subtitle= "New Residential Construction Permit Counts in 2013, Philadelphia", x = "Lag Permit Count") +
  plotTheme()

For the purpose of exploratory analysis, we briefly analyzed the correlation between new residential construction permit counts by census tract for one year (2013) as a function of 6 Census variables, namely the count the parks, the count of healthy food stores, median household income, median household value, the count of bachelor degree, and the count of bus stops. We hope that these socio-economic variables could inform the prediction model later.

#delete those with value 0  
socioeco.panel <- newconstruction_all.panel 
socioeco.panel  <- socioeco.panel [!(socioeco.panel $ctpop == 0),] # elminate all tracts with no residents
socioeco.panel  <- socioeco.panel [!is.na(socioeco.panel $nfMHI),] # eliminate all tracts without mhi
socioeco.panel  <- socioeco.panel [!is.na(socioeco.panel $nfMHV),]# eliminate all tracts without mhv


plotData.census1 <-
  as.data.frame(socioeco.panel ) %>%
  filter(year == 2013) %>%
  group_by(GEOID,quarter) %>%
  summarize(permit_count = sum(permit_count))  %>%
  left_join(Join_2013_map3, by= c("GEOID"="GEOID")) %>%
  summarise_at(vars(starts_with("ct"),starts_with("pct"),"permit_count", "MHI", "MHV"), mean, na.rm = TRUE) %>%
  gather(Variable, Value, -GEOID, -permit_count) %>% 
  mutate(Variable = factor(Variable, levels=c(
    "ctparks","ctHPSS_P", "MHI", "MHV","ctBach", "ctstops"
  )))


plotData.census1 <- plotData.census1[!is.na(plotData.census1$GEOID),]
plotData.census1 <- plotData.census1[!is.na(plotData.census1$Variable),]


correlation.census1 <-
  plotData.census1 %>%
  group_by(Variable) %>%  
  summarize(correlation = round(cor(Value, permit_count),2))


correlation.census1

## # A tibble: 6 x 2
##   Variable correlation
##   <fct>          <dbl>
## 1 ctparks         0.14
## 2 ctHPSS_P        0.17
## 3 MHI             0.18
## 4 MHV             0.19
## 5 ctBach          0.24
## 6 ctstops         0.11

# plot the correlation
ggplot (plotData.census1, aes(Value,log(permit_count))) + 
  geom_point(cez=1, color= "grey") + 
  geom_smooth(method="lm", se = F, color = "orange") +
  facet_wrap(~Variable, scales="free", ncol=3) +
  geom_text(data=correlation.census1, aes(label=paste("R =", correlation)),
            colour = "black", x=-Inf, y=Inf, hjust=-0.1, vjust=1.2) +
  plotTheme() +
  labs(title="One year (2013) of New Construction Permit (log) by Census Tract\nas a function of selected Census variables")

The following visualizations told a vivid story about spatial autocorrelation and helped us to understand if there are systematic permit counts patterns across space that can be harnessed when forecasting. The first figure showed the distribution of new residential construction permit count by quarter. There were lots of census tracts without any new residential construction permit over the past decade, most of which are concentrated on the edge of the city, such as Upper North Philadelphia and Near Northeast Philadelphia neighborhoods. Moreover, quarter 1 has the most concentrated permit issuance across census tracts (around Center City neighborhood area).

ggplot() + 
  geom_sf(data=newconstruction_all.panel, fill = "grey") + 
  labs(title = "Study Area Tracts") +
  mapTheme()

#quartile with 0
temp.DataFrame_0 <-
  newconstruction_all.panel %>% 
  group_by(quarter, GEOID) %>%
  summarize(Sum_permit_count = sum(permit_count))
qBr(temp.DataFrame_0, "Sum_permit_count")

## [1] "0" "0" "0" "1" "3"

#[1] "0" "0" "0" "1" "3"
# including 0
newconstruction_all.panel %>% 
  group_by(quarter, GEOID) %>%
  summarize(Sum_Permit_Count = sum(permit_count, na.rm=T)) %>%
  ggplot() + geom_sf(aes(fill = q5(Sum_Permit_Count))) +
  facet_wrap(~quarter, ncol = 4) +
  scale_fill_manual(values = palette5.2,
                    labels = c("0", "0", "0", "1", "3"),
                    name = "Permit Count (with zeros)") +
  labs(title="Sum of new permit counts by tract and quarter") +
  mapTheme() + theme(legend.position = "bottom")

Next, we visualized 8 years of new residential construction permit data by quarter, and the quartile is based on all census tracts with non-zero permit-counts, where we saw that there is no significant difference in time pattern but in spatial context: new residential construction activity is the most vigorous in north and south Philadelphia near center city. This was consistent with our local knowledge that these areas are most gentrified in the recent decade.

#quartile without 0
temp.DataFrame_wo0 <-
  newconstruction_all.panel %>% 
  group_by(quarter, GEOID) %>%
  summarize(Sum_permit_count = sum(permit_count))%>%
  filter(Sum_permit_count >0)
qBr(temp.DataFrame_wo0, "Sum_permit_count")

## [1] "1"  "1"  "2"  "5"  "16"

#[1] "1"  "1"  "2"  "5"  "16"
# not including 0
newconstruction_all.panel_wo0 <-
  newconstruction_all.panel %>%
  group_by(quarter, GEOID) %>%
  summarize(Sum_permit_count = sum(permit_count, na.rm=T))
newconstruction_all.panel_wo0 <-
  newconstruction_all.panel_wo0%>%
  subset(Sum_permit_count>0)
newconstruction_all.panel_wo0%>%
  ggplot() +  
  geom_sf((aes(fill = q5(Sum_permit_count)))) +
  facet_wrap(~quarter, ncol = 4) +
  scale_fill_manual(values = palette5.1,
                    labels = c("1","1","2","5","16"),
                    name = "Permit Count (w/o zeros)") +
  labs(title="Sum of new permit counts by tract and quarter based on non-zero permit count") +
  mapTheme() + theme(legend.position = "bottom")

The plot below demonstrates that only 20% of the permit count for census tract in each year combination on study panel has any new residential construction activity. It could be seen that more census tracts have new residential construction activities as time become more recent.

# by year:
temp.DataFrame_year_0 <-
  newconstruction_all.panel %>% 
  group_by(year, GEOID) %>%
  summarize(Sum_permit_count = sum(permit_count))
qBr(temp.DataFrame_year_0, "Sum_permit_count")

## [1] "0" "0" "0" "0" "1"

#[1] "0" "0" "0" "0" "1"

newconstruction_all.panel %>% 
  group_by(year, GEOID) %>%
  summarize(Sum_Permit_Count = sum(permit_count)) %>%
  ggplot() + geom_sf(aes(fill = q5(Sum_Permit_Count))) +
  facet_wrap(~year, ncol = 4) +
  scale_fill_manual(values = palette5.4,
                    labels = c("0", "0", "0", "0", "1"),
                    name = "Permit Count") +
  labs(title="Sum of New Residential Construction Permit Counts by Tract and Year") +
  mapTheme() + theme(legend.position = "bottom")

The following plot further illustrates that among all census tracts that have any new residential construction activity, north and south to center city has the most active new residential constructions than other areas. Among all the neighborhoods, Southwest Philadelphia, Kensington-Richmond area and Far Northeast Philadelphia are among the most significantly changed neighborhoods since 2010 in terms of new residential construction activities.

#quartile without 0
temp.DataFrame_year_wo0 <-
  newconstruction_all.panel %>% 
  group_by(year, GEOID) %>%
  summarize(Sum_permit_count = sum(permit_count))%>%
  filter(Sum_permit_count >0)
qBr(temp.DataFrame_year_wo0, "Sum_permit_count")

## [1] "1"  "1"  "2"  "4"  "10"

#[1] "1"  "1"  "2"  "4"  "10"

# not including 0
newconstruction_all.panel_year_wo0 <-
  newconstruction_all.panel %>%
  group_by(year, GEOID) %>%
  summarize(Sum_permit_count = sum(permit_count, na.rm=T))
newconstruction_all.panel_year_wo0 <-
  newconstruction_all.panel_year_wo0 %>%
  subset(Sum_permit_count>0)
newconstruction_all.panel_year_wo0 %>%
  ggplot() +  
  geom_sf((aes(fill = q5(Sum_permit_count)))) +
  facet_wrap(~year, ncol = 4) +
  scale_fill_manual(values = palette5.1,
                    labels = c("1","1","2","4","10"),
                    name = "Permit Count (w/o zeros)") +
  labs(title="Sum of New Residential Construction Permit Counts by Tract and Year") +
  mapTheme() + theme(legend.position = "bottom")

Lastly, we built an animated map of new residential construction permits over space and time, which provides further evidence of the space dependencies in the data. This could also be a good demonstration for an online app, government research website, or for other communication purposes.

# space time animation
library(reprex)
library(dplyr)
library(ggplot2)
library(devtools)
devtools::install_github("thomasp85/gganimate")
devtools::install_github("r-lib/pkgbuild")
library(pkgbuild)
find_rtools() 
devtools::install_github("thomasp85/gganimate", ref="v0.1.1")
library(gganimate)
library(animation)
library(gapminder)
library(ggplot2)


year2012.panel <-
  newconstruction_all.panel %>%
  filter(year == 2012)


Join_2012_map3$GEOID <-as.numeric(Join_2012_map3$GEOID)
year2012.panel$GEOID <-as.numeric(year2012.panel$GEOID)



permit.animation.data <-
  year2012.panel %>%
  group_by(quarter, GEOID) %>%
  summarize(Permit_Count = sum(permit_count, na.rm=T)) %>% 
  left_join(Join_2012_map3,by=c("GEOID" = "GEOID")) %>%
  st_sf() %>%
  mutate(Permits = case_when(Permit_Count == 0 ~ "0 permits",
                             Permit_Count > 0 & Permit_Count <= 5 ~ "1-5 permits",
                             Permit_Count > 5 & Permit_Count <= 10 ~ "6-10 permits",
                             Permit_Count > 10 & Permit_Count <= 20 ~ "11-20 permits",
                             Permit_Count > 20 ~ "21+ permits")) %>%
  mutate(Permits  = factor(Permits, levels=c("0 permits","1-5 permits","6-10 permits","11-20 permits","21+ permits")))


Permit_animation <-
  ggplot() +
  geom_sf(data=permit.animation.data, aes(fill=Permits)) +
  scale_fill_manual(values = palette5) +
  labs(title = "New Residential Construction Permits in 2012, Philadelphia",
       caption = "Data: L&I Department") +
  transition_manual(quarter) +
  mapTheme()

animate(Permit_animation, duration=5)
#anim_save("Permit_animation.gif")


#year>>>>> animation

year_all.panel <-
  newconstruction_all.panel 

#>#>#>#>#>#>##>
year_all.panel$GEOID <-as.numeric(year_all.panel$GEOID)
newconstruction_all.panel $GEOID <-as.numeric(newconstruction_all.panel$GEOID)

#>#>#>#>#>#>#>#>#>

permit_all.animation.data <-
  year_all.panel %>%
  group_by(year, GEOID) %>%
  summarize(Permit_Count = sum(permit_count, na.rm=T)) %>% 
  st_sf() %>%
  mutate(Permits = case_when(Permit_Count == 0 ~ "0 permits",
                             Permit_Count > 0 & Permit_Count <= 10 ~ "1-10 permits",
                             Permit_Count > 10 & Permit_Count <= 20 ~ "10-20 permits",
                             Permit_Count > 20 & Permit_Count <= 30 ~ "21-30 permits",
                             Permit_Count > 30 ~ "30+ permits")) %>%
  mutate(Permits  = factor(Permits, levels=c("0 permits","1-10 permits","10-20 permits","21-30 permits","30+ permits")))


Permit_all_animation <-
  ggplot() +
  geom_sf(data=permit_all.animation.data, aes(fill=Permits)) +
  scale_fill_manual(values = palette5) +
  labs(title = "New Residential Construction Permits in  2010-2017, Philadelphia",
       caption = "Data: L&I Department") +
  transition_manual(year) +
  mapTheme()


animate(Permit_all_animation, duration=10)

#anim_save("Permit_animation2.gif")

Here is the animated map for New Permits in 2012 across four quarters. The Center City Area (where the tighter grids are) suggest high activity year long, while other activity spot are near Center City or along major roads.

Animation 1

In another time scale, Year, we saw that patterns across year, as it gets closer to 2017, activities havee been spreading out and increasing in activity intensity. Places such as West Kensington, Fishtown that is northeast to Center City, indeed witnessed shift of demographics and housing gentrification.

Animation 2

Modeling and Validation

In this section, we developed several models from our training data and test those models on our testing data. Four regressions are run and predictions are created in a nested data frame allowing us to concisely validate each model.

Below are four different ordinary least squares(OLS) regressions estimated on training dataset. Most of the space and time features included are fixed effects. For instance, each model with quarter effects, hypothesizes that quarter of the year is an important prediction of new residential construction permit counts across space and time. The variable of GEOID will be accounted for as a set of character fixed effects. Controlling for GEOID hypothesizes that new residential construction permit count origins play a significant role in forecasting permit counts.

Here is a breakdown of the 4 regressions estimated below: The first focuses on just time effects.

“reg1” includes quarterly fixed effects and six socioeconomic variables, while lag effects are not included for now; “reg2”, replaces the time fixed effects with a vector of GEOID (which represents census tract) fixed effects; “reg3” includes both time and space fixed effects; “reg4” adds the time lag features.

# Model Estimation
# estimate OLS regression on newconstruction_all.panel_train
# quarter fixed effect= hypothesize that quarter of the year is an important predictor of the new premit count across space and time
# character fixed effect= use GEOID ,hypothesize that the census tract location plays a big role in new permit count prediction
# we have 4 reg to compare
# reg1 is just time effect (aka quarter fixed effect)
# reg2 is just character fixed effect 
# reg3 is both time and character
# reg4 is time, character and time lag
newconstruction_all.panel.train_1<- na.omit(newconstruction_all.panel.train)
# from 9216 to 8752
newconstruction_all.panel.test_1<- na.omit(newconstruction_all.panel.test)
#from 3072 to 2920

reg1 <-
  lm(permit_count ~ quarter + ctBach + ctparks +
       ctHPSS_P + MHI + MHV + ctstops, data = newconstruction_all.panel.train_1)

reg2 <-
  lm(permit_count ~ GEOID + ctBach+ ctparks +
       ctHPSS_P + MHI + MHV + ctstops, data = newconstruction_all.panel.train_1)

reg3 <-
  lm(permit_count ~ GEOID + quarter + ctparks +
       ctHPSS_P + MHI + MHV + ctstops, data = newconstruction_all.panel.train_1)

reg4 <-
  lm(permit_count ~ GEOID + quarter + ctparks +
       ctHPSS_P + MHI + MHV + ctstops
     + lag1Quarter + lag2Quarter
     + lag3Quarter, data = newconstruction_all.panel.train_1)

Next, goodness of fit, specifically Mean Absolute Error (MAE), was calculated on testing data for each model. A quick function was also created to predict for new construction permit counts in each year.

#Validate test set by time
newconstruction_all.panel.test.yearNest <- 
  newconstruction_all.panel.test_1 %>%
  nest(-year) 

newconstruction_all.panel.test.yearNest

## # A tibble: 1 x 2
## # Groups:   year [1]
##    year data                 
##   <dbl> <list>               
## 1  2017 <tibble [1,460 x 27]>

#used to predict each year
model_pred <- function(dat, fit){
  pred <- predict(fit, newdata = dat)}

Below shows the way to calculate year_predictions for each year’s permit count prediction in our testing dataset. Each new column is a list of predictions for each model by year.

year_predictions <- 
  newconstruction_all.panel.test.yearNest  %>% 
  mutate(A_Time_FE = map(.x = data, fit = reg1, .f = model_pred),
         B_Space_FE = map(.x = data, fit = reg2, .f = model_pred),
         C_Time_Space_FE = map(.x = data, fit = reg3, .f = model_pred),
         D_Time_Space_FE_timeLags = map(.x = data, fit = reg4, .f = model_pred)
  )

year_predictions

## # A tibble: 1 x 6
## # Groups:   year [1]
##    year data         A_Time_FE   B_Space_FE  C_Time_Space_FE D_Time_Space_FE_ti~
##   <dbl> <list>       <list>      <list>      <list>          <list>             
## 1  2017 <tibble [1,~ <dbl [1,46~ <dbl [1,46~ <dbl [1,460]>   <dbl [1,460]>

According to the table below, the most powerful model was our third model, which had the least mean errors between 0.68 and 2.21 new construction permits across each space/time interval. As we can map through the nested data to calculate the mean permit counts by year, the results showed that our model was not very robust.

year_predictions <-
  year_predictions %>% 
  gather(Regression, Prediction, -data, -year) %>%
  mutate(Observed = map(data, pull, permit_count),
         Absolute_Error = map2(Observed, Prediction,  ~ abs(.x - .y)),
         MAE = map_dbl(Absolute_Error, mean),
         sd_AE = map_dbl(Absolute_Error, sd))

year_predictions

## # A tibble: 4 x 8
## # Groups:   year [1]
##    year data      Regression     Prediction Observed  Absolute_Error   MAE sd_AE
##   <dbl> <list>    <chr>          <list>     <list>    <list>         <dbl> <dbl>
## 1  2017 <tibble ~ A_Time_FE      <dbl [1,4~ <dbl [1,~ <dbl [1,460]>  0.974  2.50
## 2  2017 <tibble ~ B_Space_FE     <dbl [1,4~ <dbl [1,~ <dbl [1,460]>  0.699  2.21
## 3  2017 <tibble ~ C_Time_Space_~ <dbl [1,4~ <dbl [1,~ <dbl [1,460]>  0.686  2.21
## 4  2017 <tibble ~ D_Time_Space_~ <dbl [1,4~ <dbl [1,~ <dbl [1,460]>  0.694  2.21

newconstruction_all.panel.test.yearNest %>%
  mutate(permit_count = map(data, pull, permit_count),
         Mean_Permit_Count = map_dbl(permit_count, mean))

## # A tibble: 1 x 4
## # Groups:   year [1]
##    year data                  permit_count  Mean_Permit_Count
##   <dbl> <list>                <list>                    <dbl>
## 1  2017 <tibble [1,460 x 27]> <dbl [1,460]>             0.655

The MAE can also be plotted by model by year as below. We again saw that the quarterly effects did not bring much additional predictive power to the model.

year_predictions %>%
  dplyr::select(year, Regression, MAE) %>%
  gather(Variable, MAE, -Regression, -year) %>%
  ggplot(aes(year, MAE)) + 
  geom_bar(aes(fill = Regression), position = "dodge", stat="identity") +
  scale_fill_manual(values = palette4) +
  labs(title = "Mean Absolute Errors by model specification and year") +
  plotTheme()

For each model, the predicted and observed permt counts is plotted in time series form below, which explained that we are predicting for each census tract, on average how many new construction permit in each quarter. Both observed and predicted quantitites are taken out of their spatial context and summed only by each quarter. It is also clear that the time fixed effects alone don’t help predict the trend, and time lag effects also add just a poor amount of predictive power to this model. Thus, new construction permit count patterns from the very recent past does little help to prediction in the near future.

year_predictions %>% 
  mutate(quarter = map(data, pull, quarter),
         GEOID = map(data, pull, GEOID)) %>%
  dplyr::select(quarter,GEOID, Observed, Prediction, Regression) %>%
  unnest() %>%
  gather(Variable, Value, -Regression, -quarter,-GEOID, -year) %>%
  group_by(Regression, Variable, quarter) %>%
  summarize(Value = mean(Value)) %>%
  ggplot(aes(quarter, Value, colour=Variable)) + 
  geom_line(size = 1.1) + 
  facet_wrap(~Regression, ncol=1) +
  scale_colour_manual(values = palette4) +
  labs(title = "Mean Predicted/Observed permit counts by quarter interval on the census tract level", 
       subtitle = "Philadelphia; A test set of 4 quarters in 2017", x = "Quarter", y= "Permit Counts") +
  plotTheme()

Nevertheless, given the available data for analysis, we mapped Mean Absolute Error for choosing the most optimal regressional model of all. Here, all errors for all time periods were averaged by GEOID variable to provide a measure of Mean Absolute Error by tract. The highest permit count MAE occured in the north area to Center City, where more construction activity had obviously occured. At the first glance, regardless of the overall power of the model, the model is better at predicting in space than in time.

#validate test set by space
myData <-
  filter(year_predictions, Regression == "C_Time_Space_FE") %>% 
  unnest() %>% 
  st_sf() %>%
  dplyr::select(GEOID, Absolute_Error, geometry) %>%
  gather(Variable, Value, -GEOID, -geometry,-year) %>%
  group_by(Variable, GEOID) %>%
  summarize(MAE = mean(Value))

quantile(myData$"MAE")

##           0%          25%          50%          75%         100% 
##  0.007755582  0.023974687  0.095123276  0.503877791 14.245338133

#  0.006854681,  0.023860059,  0.095222699,  0.503427340, 14.244957731 

filter(year_predictions, Regression == "C_Time_Space_FE") %>% 
  unnest() %>% 
  st_sf() %>%
  dplyr::select(GEOID, Absolute_Error, geometry, quarter) %>%
  gather(Variable, Value, -GEOID, -geometry,-year) %>%
  group_by(Variable, GEOID) %>%
  summarize(MAE = mean(Value)) %>%
  ggplot() + 
  geom_sf(aes(fill = q5(MAE))) +
  scale_fill_manual(values = palette5,
                    labels = c("0.01", "0.02", "0.10", "0.50", "14.24"),
                    name = "MAE") +
  labs(title="Mean Absolute Error by regression and tract") +
  mapTheme() + theme(legend.position="bottom")

Finally, we also mapped a time/space MAE series for four quarters of year 2017. A very vague diffrence among each plots further prove the fact that there was very few systematic pattern for time correlation in our model.

myData2<-
  filter(year_predictions, Regression == "C_Time_Space_FE") %>% 
  unnest() %>% 
  st_sf() %>%
  dplyr::select(GEOID, Absolute_Error, geometry, quarter) %>%
  gather(Variable, Value, -quarter, -GEOID, -geometry) %>%
  group_by(quarter, GEOID) %>%
  summarize(MAE = mean(Value)) 

quantile(myData2$"MAE")

##       0%      25%      50%      75%     100% 
## 1008.500 1008.510 1008.536 1008.712 1034.245

# 1008.500, 1008.510, 1008.535, 1008.709, 1034.245


filter(year_predictions, Regression == "C_Time_Space_FE") %>% 
  unnest() %>% 
  st_sf() %>%
  dplyr::select(GEOID, Absolute_Error, geometry, quarter) %>%
  gather(Variable, Value, -quarter, -GEOID, -geometry) %>%
  group_by(quarter, GEOID) %>%
  summarize(MAE = mean(Value)) %>%
  ggplot() + 
  geom_sf(aes(fill = q5(MAE))) +
  facet_wrap(~quarter, ncol = 8) +
  scale_fill_manual(values = palette5,
                    labels = c("1008.50", "1008.51", "1008.54", "1008.71", "1034.25"),
                    name = "MAE (trip count)") +
  labs(title="Mean absolute permit count error by tract and quarter",
       subtitle = "For the year of 2017") +
  mapTheme() + theme(legend.position="bottom")

Discussion & Conclusion

Although we can not negelct the fact that these are linear regressions and not powerful machine learning models, the predictive results’ accuracy is still not satisfactory enough. We indeed found some evidence of spatial autocorrelation, but we don’t predict nearly as strong across time. Considering that our citywide prediction model isn’t quite accurate and significant, the next step would be looking at smaller scales by filtering out the census tracts of interest. Below shows an example of code to demonstrate this interest, where we split test and train dataset based on the data of the selected study tracts.

#want to see one census tract and how it works

#newconstruction_subset <- 
#  newconstruction_all.panel %>%
#  subset(newconstruction_all.panel$GEOID == 42101036700)

#newconstruction_subset.train <- filter(newconstruction_subset, year == 2013)
#newconstruction_subset.test <- filter(newconstruction_subset, year == 2016)

In conclusion, despite of the inherent data limitations, our model still provides values from which Philadelphia local government could benefit a lot. It could serve as a guidance for government to understand the trend of new construction activity from northern and southern parts of Philadelphia to center city, and then using the prediction of new construction permit to inform possible gentrification in the future. According to National Community Reinvestment Coalition⁷, from 2000 to 2013, 57 different census tracts in Philly became gentrified, such as spanning Northern Liberties, West Poplar, and Callowhill. However, when doing municipal plans, government usually are not informed of neighborhood level activity or effectively integrating them as solid part of the planning goals&bjectives. Since city officials care about the neighborhoods and communities that will be vulnerable to the next wave of gentrification, they could utilize this model to strategically allocate resources for various affordable housing programs, prioritize the public infrastructure investment and make policies like “Right to Stay”.

References in Brief

{Open Data Philly: https://www.opendataphilly.org/}↩
{PASDA: https://www.pasda.psu.edu/}↩
{US Census Bureau: https://www.socialexplorer.com/}↩
{Licenses and Inspections Building Permits Department: https://data.phila.gov/visualizations/li-building-permits}↩
{Parks: http://www.pasda.psu.edu/uci/DataSummary.aspx?dataset=7021 SEPTA subway stops: https://www.opendataphilly.org/dataset/septa-sms-transit High Quality Healthy Food Stores: https://www.opendataphilly.org/dataset/licenses-and-inspections-business-licenses/resource/385d7476-c1bc-4a51-bd72-cc086c1cb9ab}↩
{Tidycensus Data (through R CRAN packages–https://cran.r-project.org/web/packages/tidycensus/tidycensus.pdf}↩
{National Community Reinvestment Coalition: https://ncrc.org/ }↩