Spatial analysis of the factors pushing from and pulling migrants to the suburbs

Introduction

In this study, we examine part of the dataset from the paper “Causes of the spatially uneven outflow of Warsaw inhabitants to the city’s suburbs: an economic analysis of the problem” (Bogusz (2021)), the reviewed version of which was submitted to the Central Eastern Economic Journal. In it, I investigate the suburban migration patterns of Warsaw inhabitants (the outflow from inner city to suburban boroughs) and aim to identify the features of boroughs which are key pulling factors for migrants. I built a correlated random effects panel model with Mundlak device. I estimated the extended gravity model of migration equation by feasible GLS, incorporating Mundlak style time averages of variables which might be correlated with transient errors. That kind of specification allowed me to overcome problems of possible endogeneity, as well as transient errors collinearity. My work on this topic also included parsing unyielding data from Polish county offices. These data contains each property market transaction from 2008-2019, based on which I wanted to calculate the average price of housing on a borough level. This variable can be, however, the source of endogeneity in the model.

This topic is relevant, as when those pulling and pushing factors are identified, adequate spatial planning can be executed in order to curb negative consequences of unrestricted urban sprawl. The panel model in my article is based on the extended gravity model of migration.

However, one caveat remains and it is addressed in this assignment. Namely, the suburban boroughs are spatial entities and can, thus, interact with each other to some extent. It is reasonable to assume that spatial autocorrelation might be present in the model. That is why it is feasible to try modeling suburban migration with a spatial approach - in our case, with the spatial lag model.

A thorough literature review of the topic can be found in one of the working papers:

In this assignment, we focus on (1) visualisation of spatial data; (2) spatial panel modeling.

The full repository of the following project can be found on Github: https://github.com/szymekszymek/Project_SE

Below, we run the initial setup for analysis.

# 0. Initial set-up
{
  # 0.1. clear env
rm(list = ls())

  # 0.2. load packages
requiredPackages <- c('tidyverse', 'haven', 'readxl', 'utf8', 'rjson', 'osmdata', 'sf', 'sp', 'broom', 'viridis', 'cowplot', 'reshape2', 'spatialreg', 'spdep', 'plm', 'splm', 'Metrics', 'stargazer', 'kableExtra', 'dplyr')
for (i in requiredPackages) {
  if (!require(i,character.only = T)) {
    install.packages(i, character.only = T)
  }
}

  # 0.3. define custom functions
r2 <- function (actual, res) {
  rss <- sum(res^2)
  tss <- sum((actual - mean(actual)) ^ 2)
  r2 <- 1 - rss/tss
  return (r2)
}

r2_adj <- function (actual, res, p = 29) {
  n <- 759
  r2_adj <- 1 - ((1 - r2(actual, res)) * (n - 1) / (n - p - 1))
  return (r2_adj)
}

rmse <- function (res) {
  return (sqrt(sum(res^2)/759))
}

# https://stackoverflow.com/questions/46186527/how-to-calculate-bic-and-aic-for-a-gmm-model-in-r-using-plm
AIC_adj <- function(mod, p){
  # Number of observations
  n.N   <- nrow(mod$model)
  # Residuals vector
  u.hat <- residuals(mod)
  # Variance estimation
  s.sq  <- log( (sum(u.hat^2)/(n.N)))
  # Number of parameters (incl. constant) + one additional for variance estimation
  p     <-  p
  
  # Note: minus sign cancels in log likelihood
  aic <- 2*p  +  n.N * (  log(2*pi) + s.sq  + 1 ) 
  
  return(aic)
}
}

Exploratory Data Analysis

We conduct a spatial panel analysis of migration from Warsaw to 69 suburban boroughs in between 2008 and 2018. The dependent variable is the number of migrants who moved out of Warsaw and reported residence in a borough. Below, we describe the dataset and visualize the data that we have at our disposal.

Variables

The regressors used in the analysis are.

Population density - one of the key variables of the gravity model of migration.
Distance (straight line) - the second key variable of the gravity model of migration.
Relative income per capita - average income in a borough (in PLN) divided by the average income in all boroughs weighted by population (in PLN). Calculated based on personal income tax data.
Unemployment rate - measures local labor market condition.
Total greenery spaces (parks, green plazas, urban forests, etc.) - an amenity.
Ratio of the forest area and the total area of a borough - measures to what extend the presence of forests hinders habitable zones (e.g. it is forbidden to settle in close proximity to the Kampinowski National Park).
Number of kindergartens per 100 000 population - an amenity.
Mean price of m2 of housing in a borough divided by the same for Warsaw - a relative measure of the prices of housing is constructed. Below, I include a brief summary from my article “Causes of the spatially uneven outflow of Warsaw inhabitants to the city’s suburbs: an economic analysis of the problem” about how the price indicator was constructed based on the transactional data obtained from Polish county offices.
Borough type, according to the polish administrative classification. Possible types are urban, urban-rural, and rural.
Presence of suburban train station - a dummy indicating if there is a train station in a borough.

Below a disclaimer from Bogusz (2021) about how the relative price of housing was calculated:

Information about prices of housing available at the Polish Statistical Office are limited to the county level in years 2015-2018, which is a very rough measure for prices in boroughs. In my attempt to find a better one, I obtained real estate transactional data from Polish county offices for each of the 69 boroughs in the years 2008-2019 for the following property types: flats, houses, and parcels. Based on that parsed data, I constructed an indicator of the relative price of housing. I included observations for each borough, type, and year where the number of transactions was greater than 10. The average price of m2 of housing was obtained by weighting average prices of property types (in each borough and each year) by the number of transactions corresponding to different types. However, in terms of flats, the price is related to the living space, while it corresponds to the area of a parcel for both house and parcel. This is due to the reason that no information about the living space of houses was available for about half of the boroughs in my study. The availability of details of transactions depends on the data classifying and storing system a specific county office uses and there is no standardization of such a system in Poland. In addition, I limited the area of parcels to observations greater than 200 m2 and smaller than 2000 m2. The reason for this distinction is that properties classified as houses or parcels are not necessarily lots habitable by private agents, but can be also service establishments. On the one hand, 200 m2 is the minimum parcel size, where a habitable house, rather than holiday cabin can be built. On the other, 2000 m2 is a reasonable upper boundary since private houses are rarely built on larger parcels in the suburbs of Warsaw (the average parcel size is between 1400-2000 in 90% of the boroughs in years examined) giving rise to the presumption that larger parcels function as service establishments in the vast majority.That obtained average price was divided by the average price in Warsaw as obtained from the Polish Statistical Office for 2015-2018 and extrapolated for 2008-2019 by the use of the index of prices of housing for Warsaw for 2008-2019. Hence, a price relative to the one in Warsaw was constructed. Furthermore, no transactions were observed for all types or at all for some boroughs in some years. To complete the unbalanced panel, the well-known multiple imputation technique was used. Since in my setting observations for only one regressor were missing, predictive mean matching was applied. When imputing the missing values of a continuous variable, PMM may be preferable to linear regression when the normality of the underlying model is suspect (Little, 1988; Rubin, 1986). The imputation was made before calculating the time average or price relative to the price in Warsaw. (Bogusz (2021))

The table below presents variable names, units, and sources.

names <- c("check_in", "pop_density", "dist", "income", "unempl", "greenery", "forest", "average_b_waw", "bu", "bur", "train")
units <- c("people", "people/km2", "km", "ratio", "rate", "ha", "ratio", "ratio", "binary", "binary", "binary")
sources <- c("Polish Statistical Office","Polish Statistical Office","Polish Statistical Office","Polish Statistical Office","Polish Statistical Office","Polish Statistical Office","Polish Statistical Office","Suburban County Offices","Polish Statistical Office","Polish Statistical Office","Polish Statistical Office")
df <- data.frame(names, units, sources)
colnames(df) <- c("Variable Name", "Unit", "Source")
df %>%
  kbl() %>%
  kable_styling(bootstrap_options = c("striped", "responsive", "hover")) %>%
  scroll_box(width = "100%")

Variable Name	Unit	Source
check_in	people	Polish Statistical Office
pop_density	people/km2	Polish Statistical Office
dist	km	Polish Statistical Office
income	ratio	Polish Statistical Office
unempl	rate	Polish Statistical Office
greenery	ha	Polish Statistical Office
forest	ratio	Polish Statistical Office
average_b_waw	ratio	Suburban County Offices
bu	binary	Polish Statistical Office
bur	binary	Polish Statistical Office
train	binary	Polish Statistical Office

The below map shows the administrative division of the suburban ring of Warsaw, i.e. where are respective boroughs located.

  # 1. plot of boroughs with names
{
  # 1.1. preapre data with polygons' centroids to add labels to map
  locs_cent_sp <- st_transform(locs$osm_multipolygon, crs = 2180) %>%
    st_geometry() %>%
    st_centroid() %>%
    st_transform(., crs ="+proj=longlat +datum=WGS84") %>%
    as(., 'Spatial') %>%
    as.data.frame() %>%
    mutate (
      borough = locs$osm_multipolygons$borough,
      x = as.numeric(coords.x1),
      y = as.numeric(coords.x2)
    ) %>%
    dplyr::select(borough, x, y)
  
  # 1.2. for some boroughs, numbers will be used instead of names (for readability purposes)
  boroughs_numbers <- c('Nowy Dwór Mazowiecki', 'Piastów', 'Podkowa Leśna', 'Milanówek', 'Mińsk Mazowiecki', 'Mińsk Mazowiecki 2', 'Legionowo', 'Pruszków', 'Sulejówek')
  
  # 1.3. filter for boroughs where numbers will be shown
  locs_cent_sp_number <- locs_cent_sp[locs_cent_sp$borough %in% boroughs_numbers, ] %>%
    arrange(borough)
  
  # 1.4. add unique numbers to borough
  locs_cent_sp_number$number <- 1:nrow(locs_cent_sp_number)
  
  # 1.5. filter for boroughs where names will be shown
  locs_cent_sp_name <- locs_cent_sp[!locs_cent_sp$borough %in% boroughs_numbers, ]
  
  # 1.6. create data frame to add the names of borough denoted with numbers
  (max(locs_cent_sp_name$y) - min(locs_cent_sp_name$y)) / 9 # -0.0706
  
  map_legend <- data.frame(
    number = locs_cent_sp_number$number,
    borough = locs_cent_sp_number$borough,
    x = max(locs_cent_sp_name$x) + 0.15,
    y = seq(max(locs_cent_sp_name$y), min(locs_cent_sp_name$y), -0.0709)
  )
  
  # 1.7. add enters to names that have more than one word (for readability purposes)
  for (i in (1:nrow(locs_cent_sp_name))) {
    name <- locs_cent_sp_name[i, 'borough']
    if (grepl(' ', name) ||  grepl('-', name)) {
      name <- ifelse(grepl(' ', name), strsplit(name, ' '), strsplit(name, '-'))[[1]]
      name <- paste(name,  collapse = '\n')
      locs_cent_sp_name[i, 'borough'] <- name
    }
  }
    # 1.8. manually adjust locations to better fit the name
  locs_cent_sp_name[locs_cent_sp_name$borough == 'Lesznowola', 'y'] <- locs_cent_sp_name[locs_cent_sp_name$borough == 'Lesznowola', 'y'] + 0.01
  locs_cent_sp_name[locs_cent_sp_name$borough == 'Ożarów\nMazowiecki', 'y'] <- locs_cent_sp_name[locs_cent_sp_name$borough == 'Ożarów\nMazowiecki', 'y'] + 0.005
  locs_cent_sp_name[locs_cent_sp_name$borough == 'Wiązowna', 'y'] <- locs_cent_sp_name[locs_cent_sp_name$borough == 'Wiązowna', 'y'] + 0.01
  locs_cent_sp_name[locs_cent_sp_name$borough == 'Wiązowna', 'x'] <- locs_cent_sp_name[locs_cent_sp_name$borough == 'Wiązowna', 'x'] - 0.01
  locs_cent_sp_name[locs_cent_sp_name$borough == 'Celestynów', 'x'] <- locs_cent_sp_name[locs_cent_sp_name$borough == 'Celestynów', 'x'] - 0.01
  locs_cent_sp_name[locs_cent_sp_name$borough == 'Pomiechówek', 'x'] <- locs_cent_sp_name[locs_cent_sp_name$borough == 'Pomiechówek', 'x'] - 0.01
  locs_cent_sp_name[locs_cent_sp_name$borough == 'Pomiechówek', 'y'] <- locs_cent_sp_name[locs_cent_sp_name$borough == 'Pomiechówek', 'y'] - 0.014
  locs_cent_sp_number[locs_cent_sp_number$borough == 'Mińsk Mazowiecki 2', 'y'] <- locs_cent_sp_number[locs_cent_sp_number$borough == 'Mińsk Mazowiecki 2', 'y'] - 0.01
  locs_cent_sp_number[locs_cent_sp_number$borough == 'Mińsk Mazowiecki 2', 'x'] <- locs_cent_sp_number[locs_cent_sp_number$borough == 'Mińsk Mazowiecki 2', 'x'] + 0.045
  locs_cent_sp_number[locs_cent_sp_number$borough == 'Mińsk Mazowiecki 1', 'y'] <- locs_cent_sp_number[locs_cent_sp_number$borough == 'Mińsk Mazowiecki 1', 'y'] + 0.005
  locs_cent_sp_number[locs_cent_sp_number$borough == 'Nowy Dwór Mazowiecki', 'y'] <- locs_cent_sp_number[locs_cent_sp_number$borough == 'Nowy Dwór Mazowiecki', 'y'] - 0.011
  locs_cent_sp_number[locs_cent_sp_number$borough == 'Nowy Dwór Mazowiecki', 'x'] <- locs_cent_sp_number[locs_cent_sp_number$borough == 'Nowy Dwór Mazowiecki', 'x'] + 0.02
  locs_cent_sp_number[locs_cent_sp_number$borough == 'Podkowa Leśna', 'y'] <- locs_cent_sp_number[locs_cent_sp_number$borough == 'Podkowa Leśna', 'y'] - 0.007
  locs_cent_sp_number[locs_cent_sp_number$borough == 'Podkowa Leśna', 'x'] <- locs_cent_sp_number[locs_cent_sp_number$borough == 'Podkowa Leśna', 'x'] + 0.018
  
  # 1.9. plot
  map_names <- ggplot(locs_sp, aes(x=long, y=lat, group=group)) +
    # borough outlines and fill
    geom_polygon(color = 'black', fill = '#f9f9f9', size=1) +
    # number labels on the map
    geom_label(
      data = locs_cent_sp_number,
      mapping = aes(x=x, y=y, label = number),
      size = 2.5,
      color = 'black',
      fontface = 'bold',
      inherit.aes = F) +
    # number labels on the legend
    geom_label(
      data = map_legend,
      mapping = aes(x=x, y=y, label = number),
      size = 2.5,
      color = 'black',
      fontface = 'bold',
      inherit.aes = F) +
    # text labels on the legend
    geom_text(
      data = map_legend,
      mapping = aes(x=x + 0.03, y=y, label = borough),
      size = 2.5,
      color = 'black',
      fontface = 'bold',
      hjust = 0,
      inherit.aes = F) +
    # empty points to fit the legend's text labels
    geom_point(
      data = map_legend,
      mapping = aes(x=x + 0.15, y=y),
      size = 0,
      color = NA,
      inherit.aes = F) +
    # text labels on the map
    geom_text(
      data = locs_cent_sp_name,
      mapping = aes(x=x, y=y, label = borough),
      size = 2,
      color = 'black',
      fontface = 'bold',
      angle = 20,
      inherit.aes = F) +
    coord_fixed() +
    theme_void()
  
  # 3.1.10. save plot
  png('img/map_names.png', width = 3508, height = 1700, res=300)
  map_names
  dev.off()
}

Visualisation 1

We start the analysis by examining the distribution of the dependent variable, and the key variables of the gravity model of migration, as well as relative income per capita, in boroughs in 2008 and 2018 (start and end of the analysis). I look also at income, because, according to the existing literature on suburbanisation, it is supposed to have great positive influence on the number of suburban migrants (Mieszkowski and Mills (1993)).

  # 2. variable levels plots

  # 2.1.  define plot function
plt_metric <- function (var, year, title, caption, legend, breaks) {
  var <- enquo(var)
  ggplot() +
    geom_polygon(data = locs_sp[locs_sp$year == year,], aes(x=long, y=lat, group=group, fill=!!var), colour='black', size=0.5, alpha=0.9) +
    coord_sf() +
    scale_fill_viridis(trans = "log1p",
                       breaks = breaks, labels = breaks
    ) +
    labs(
      caption = caption,
      title = title,
      fill = legend
    ) +
    theme_void() +
    theme(
      panel.grid.major = element_blank(),
      panel.grid.minor = element_blank(),
      panel.border = element_blank(),
      plot.background = element_blank(),
      plot.title = element_text(color="#000000", face="bold", size=16, hjust=0),
      plot.caption = element_text(color="#000000", face="bold", size=10, hjust=1),
      plot.margin = unit(c(0.5,0.5,0.5,0.5), "cm"),
      strip.text.x = element_text(color="#505050", face="bold", size=12, hjust=0.5),
      legend.title = element_text(color="#505050", face="bold", size=12, hjust=1),
      legend.text = element_text(color="#505050", size=10, angle = 0),
      legend.position = 'bottom'
    ) +
    guides(
      fill = guide_colourbar(
        ticks.colour = "white",
        ticks.linewidth = 1.2,
        barwidth=15,
        label.position="bottom"))
}

  # 2.2. define plot function to compare 2 years
plt_compare <- function (var, year1, year2, title, caption, legend, breaks) {
  var <- enquo(var)
  ggplot() +
    geom_polygon(data = locs_sp[locs_sp$year %in% c(year1, year2),], aes(x=long, y=lat, group=group, fill=!!var), colour='black', size=0.5, alpha=0.9) +
    coord_sf() +
    facet_wrap(~year) +
    scale_fill_viridis(trans = "log1p",
                       breaks = breaks, labels = breaks
    ) +
    labs(
      caption = caption,
      title = title,
      fill = legend
    ) +
    theme_void() +
    theme(
      panel.grid.major = element_blank(),
      panel.grid.minor = element_blank(),
      panel.border = element_blank(),
      plot.background = element_blank(),
      plot.title = element_text(color="#000000", face="bold", size=16, hjust=0),
      plot.caption = element_text(color="#000000", face="bold", size=10, hjust=1),
      plot.margin = unit(c(0.5,0.5,0.5,0.5), "cm"),
      strip.text.x = element_text(color="#505050", face="bold", size=12, hjust=0.5),
      legend.title = element_text(color="#505050", face="bold", size=12, hjust=1),
      legend.text = element_text(color="#505050", size=10, angle = 0),
      legend.position = 'bottom'
    ) +
    guides(
      fill = guide_colourbar(
        ticks.colour = "white",
        ticks.linewidth = 1.2,
        barwidth=15,
        label.position="bottom"))
}

  # 2.3. plot bu/bur/br
plt_bu_bur_bu <- locs_sp %>%
  mutate(
    b = case_when(
      bu == '1' ~ 'urban',
      bur == '1' ~ 'urban-rural',
      T ~ 'rural'
    )) %>%
  filter(
    year == 2018
  ) %>%
  ggplot() +
  geom_polygon(aes(x=long, y=lat, group=group, fill=b), colour='black', size=0.5, alpha=0.9) +
  coord_sf() +
  scale_fill_viridis(discrete=T) +
  labs(
    caption = 'Data Source: Polish Statistical Office',
    title = 'a) type of borough',
    fill = 'type'
  ) +
  theme_void() +
  theme(
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank(),
    panel.border = element_blank(),
    plot.background = element_blank(),
    plot.title = element_text(color="#000000", face="bold", size=16, hjust=0),
    plot.caption = element_text(color="#000000", face="bold", size=10, hjust=1),
    plot.margin = unit(c(0.5,0.5,0.5,0.5), "cm"),
    strip.text.x = element_text(color="#505050", face="bold", size=12, hjust=0.5),
    legend.title = element_text(color="#505050", face="bold", size=12, hjust=1),
    legend.text = element_text(color="#505050", size=10, angle = 0),
    legend.position = 'bottom'
  )

  # 2.4. plots for time-invariant variables
plt_dist <- plt_metric(dist, 2018, 'c) distance', 'Data Source: Google Maps', 'km', c(10, 12, 15, 20, 25, 30, 40, 50))

plt_train <- locs_sp %>%
  mutate(
    train = case_when(
      train == '1' ~ 'presence of suburban train station',
      T ~ 'lack of suburban train station'
    )) %>%
  filter(
    year == 2018
  ) %>%
  ggplot() +
  geom_polygon(aes(x=long, y=lat, group=group, fill=train), colour='black', size=0.5, alpha=0.9) +
  coord_sf() +
  scale_fill_viridis(discrete=T) +
  labs(
    caption = 'Data Source: Google Maps',
    title = 'b) suburban train station',
    fill = ''
  ) +
  theme_void() +
  theme(
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank(),
    panel.border = element_blank(),
    plot.background = element_blank(),
    plot.title = element_text(color="#000000", face="bold", size=16, hjust=0),
    plot.caption = element_text(color="#000000", face="bold", size=10, hjust=1),
    plot.margin = unit(c(0.5,0.5,0.5,0.5), "cm"),
    strip.text.x = element_text(color="#505050", face="bold", size=12, hjust=0.5),
    legend.title = element_text(color="#505050", face="bold", size=12, hjust=1),
    legend.text = element_text(color="#505050", size=10, angle = 0),
    legend.position = 'bottom'
  )


  # 2.5. combined plot for all variables
plt_check_in <- plt_compare(check_in, 2008, 2018, 'd) migrants', 'Data Source: Polish Statistical Office', '# of migrants', c(1, 2, 5, 10, 25, 50, 100, 250, 750))

plt_income <- plt_compare(income, 2008, 2018, 'e) income per capita', 'Data Source: own calculation based on data from Polish Statistical Office', 'ratio', c(0.3, 0.5, 0.75, 1, 1.5, 2, 2.5, 3))

plt_kinder <- plt_compare(kinder, 2008, 2018, 'f) number of kindergartens', 'Data Source: Polish Statistical Office', '# of kindergartens', c(1, 2, 5, 10, 25, 50, 75, 100))

plt_unempl <- plt_compare(unempl, 2008, 2018, 'g) unemployment rate', 'Data Source: Polish Statistical Office', 'ratio', seq(0, 0.1, 0.01))

plt_pop_density <- plt_compare(pop_density, 2008, 2018, 'h) population density', 'Data Source: Polish Statistical Office', '', c(0.5, 1, 2, 3, 5, 7, 10, 15, 20, 30)) +
  labs(fill = expression(bold(paste('# of people / ', km^2))))

plt_greenery <- plt_compare(greenery, 2008, 2018, 'i) area of green amenities (parks etc.)', 'Data Source: Polish Statistical Office', 'area in ha', c(2, 5, 10, 25, 50, 100, 250))

plt_average_b_waw <- plt_compare(average_b_waw, 2008, 2018, 'j) relative price of housing', 'Data Source: Polish Statistical Office & suburban county officies', 'ratio', c(0.01, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6))


png('img/maps_combined.png', width = 3508, height = 2*2480, res=300)
plot_grid(plt_bu_bur_bu, plt_dist, plt_train, plt_check_in, plt_income, plt_kinder, plt_unempl, plt_pop_density, plt_greenery, plt_average_b_waw, ncol = 2, greedy=F)
dev.off()

Two things are visible on the above plots. First of all, that the levels of variables changed between the first and the last year in our analysis, e.g. the number of kindergartens per 100 000 people increased in all boroughs. Second, great disparities between boroughs are visible for most of the variables, e.g. in income per capita or area of green amenities.

Visualisation 2

We also happen to have the location data about roads and buildings in each of the boroughs in our analysis, that we scraped from the OSM for our Master thesis. Here, we exploit it for additional visual inspection of the degree of urbanisation of the suburban ring. However, these location data won’t be used in our spatial models. Below, we plot the locations of roads and buildings for three representative boroughs: Pruszków, which is classified as type “urban,” Konstancin-Jeziorna (type “urban-rural”), and Baranów (type “rural”).

These plots serve rather the purpose of presenting an interesting visualisation, than contributing to out spatial models.

  # 3. plot with highways and buildings
{
  # 3.1. large highway types
  hghws_tp_lrg <- c('motorway', 'trunk', 'primary', 'secondary', 'tertiary')
  
  # 3.2. select only large highways
  hghws_lrg <- hghws$osm_lines[hghws$osm_lines$highway %in% hghws_tp_lrg,]
  
  # 3.3. function to plot boroughs_osm
  plt_loc <- function(borough = 'all', l=T, b=T, h=T, hl=T) {
    # create spatial object for loc, bldgs, hghws based on selection go borough
    if (borough=='all') {
      locs_sp <- as(locs$osm_multipolygon, 'Spatial')
      bldgs_sp <- as(bldgs$osm_polygons, 'Spatial')
      hghws_sp <- as(hghws$osm_lines, 'Spatial')
      hghws_lrg_sp <- as(hghws_lrg, 'Spatial')
    } else if (class(borough) == 'character') {
      locs_sp <- as(locs$osm_multipolygon[locs$osm_multipolygon$borough %in% borough,], 'Spatial')
      bldgs_sp <- as(bldgs$osm_polygons[bldgs$osm_polygons$borough %in% borough,], 'Spatial')
      hghws_sp <- as(hghws$osm_lines[hghws$osm_lines$borough %in% borough,], 'Spatial')
      hghws_lrg_sp <- as(hghws_lrg[hghws_lrg$borough %in% borough,], 'Spatial')
    } else if (class(borough) == 'numeric') {
      locs_sp <- as(locs$osm_multipolygon[borough,], 'Spatial')
      bldgs_sp <- as(bldgs$osm_polygons[borough,], 'Spatial')
      hghws_sp <- as(hghws$osm_lines[borough,], 'Spatial')
      hghws_lrg_sp <- as(hghws_lrg[borough,], 'Spatial')
    }
    
    legend <- c()
    if (l) {legend <- append(legend, c('border' = 'black'))}
    if (b) {legend <- append(legend, c('building' = '#482677'))}
    if (hl) {legend <- append(legend, c('large road' = '#1F968B'))}
    if (h) {legend <- append(legend, c('road' = '#73D055'))}
    
    # plot
    p <- ggplot() +
      {if (h) geom_line(data = hghws_sp, aes(x=long, y=lat, group=group, colour = 'road'), size=0.2)} +
      {if (hl) geom_line(data = hghws_lrg_sp, aes(x=long, y=lat, group=group, colour = 'large road'), size=0.4)} +
      {if (b) geom_polygon(data = bldgs_sp, aes(x=long, y=lat, group=group, colour = 'building'), fill = '#482677', size=0)} +
      {if (l) geom_polygon(data = locs_sp, aes(x=long, y=lat, group=group, colour = 'border'), fill=NA, size=1)} +
      labs(
        title = borough,
        color = NULL) +
      scale_color_manual(values = legend) +
      theme_void() +
      coord_fixed() +
      theme(
        plot.title = element_text(color="#000000", face="bold", size=16, hjust=0.5),
        legend.position = 'bottom',
        legend.text = element_text(size=10)
      ) +
      guides(color = guide_legend(
        keyheight = 1, 
        keywidth = 1,
        override.aes = list(fill = legend)
      ))
    
    print(p)
    
    return(p)
  }
  
  # save plot for Pruszków
  png('img/map_pruszkow.png', width = 3508, height = 2000, res=300)
  plt_loc('Pruszków')
  dev.off()
  
  # save plot for Konstancin Jeziorna
  png('img/map_konstancin.png', width = 3508, height = 2000, res=300)
  plt_loc('Konstancin-Jeziorna')
  dev.off()
  
  # save plot for Jaktorów
  png('img/map_jaktorow.png', width = 3508, height = 2000, res=300)
  plt_loc('Jaktorów')
  dev.off()
}

It can be seen that the map for Pruszków is much denser than the map of Konstancin-Jeziorna, which, in turn, is much denser than the map for Jaktorów. As expected, the degree of urbanization varies between boroughs greatly and it is the highest for the urban borough and the lowest for the rural borough.

Spatial Models

Data Preparation

Below we run the necessary data preparation before we can run the spatial models. It includes loading all necessary datasets, cleaning, and mergind them. Especially, we need to merge data from the Polish Statistical Office with OSM locations of boroughs.

# 1. Since our code was deleted, we will be using the files that were saved during development, but the webscraping code is unavailable.
{
  # 1.1. file with dependent and independent variables in years 2008-2019 (only check_in is available for 2019)
data <- read.csv('data/data.csv', sep = ',')

  # 1.2. file with OSM IDs for all borough (was required for downloading the data from OSM and will be used to sort the observations)
boroughs_osm <- read.csv('data/boroughs_osm.csv', sep = ',')

  # 1.3. file with borough polygons (downloaded from OSM using osmdata package and transformed to correct CRS)
locs <- readRDS('data/locs.RDS')

  # 1.4. file with borough polygons joined with information from data file (locs + data)
locs_sp <- readRDS('data/locs_sp.RDS')

  # 1.5. file with buildings
# bldgs <- readRDS('data/bldgs.RDS')

  # 1.6 file with roads
# hghws <- readRDS('data/hghws.RDS')
}

# 2. Let's prepare data for modeling
{
  # 2.1. filter out 2019 since it only contains target variable
data_model <- data %>%
  filter(year != 2019)

  # 2.2. some variables do not change over time, and thus we should find only spatial lags for them
variables_constant <- c('bu', 'bur', 'br', 'dist', 'train')
    
  # 2.3. now let's select the variables that do change over time
variables_temporal <- c('check_in', 'income', 'kinder', 'unempl', 'pop_density', 'greenery', 'average_b_waw')

  # 2.4. convert binary variables to categorical
data_model <- data_model %>%
  mutate_at(c('bu', 'bur', 'br', 'train'), as.factor)

  # 2.4. add osm_id and select columns for modeling
data_model <- data_model %>%
  merge(boroughs_osm, by = 'borough') %>%
  dplyr::select(osm_id, year, variables_constant[variables_constant!='br'], all_of(variables_temporal)) %>%
  arrange(osm_id, year)

  # 2.5. create spatial weights matrix
locs$osm_multipolygons <- arrange(locs$osm_multipolygons, rownames(locs$osm_multipolygons))

cont.nb <- poly2nb(as(locs$osm_multipolygons, 'Spatial'))
cont.listw <- nb2listw(cont.nb, style="W")
}

Modeling

We begin the modeling by setting two equations: one with all the variables, and the second one with only time-varying variables, due to the impossibility of including time-fixed variables in the spatial lag model with individual effects.

# 4. Modeling

  # 4.1. create formulas - form_all (all variables) & form_time (time-varying variables)
form_all <- check_in ~ bu + bur + dist + train + income + kinder + unempl + pop_density + greenery + average_b_waw
form_time <- check_in ~ income + kinder + unempl + pop_density + greenery + average_b_waw

We test for spatial autocorrelation and random regional effects on both formulas. In each case, we get a result that spatial autocorrelation (dependence), as well as random regional effects are present in our data.

  # 4.2 testing - performed on both formulas
bsjktest(form_all, data=data_model, listw=cont.listw, test="C.1") # H0 rejected -> 'spatial dependence in error terms, sub RE and serial corr.'

## 
##  Baltagi, Song, Jung and Koh C.1 conditional test
## 
## data:  check_in ~ bu + bur + dist + train + income + kinder + unempl +     pop_density + greenery + average_b_waw
## LM = 20.684, df = 1, p-value = 5.418e-06
## alternative hypothesis: spatial dependence in error terms, sub RE and serial corr.

bsjktest(form_time, data=data_model, listw=cont.listw, test="J") # H0 rejected -> 'random effects or serial corr. spatial dependence in error terms'

## 
##  Baltagi, Song, Jung and Koh joint test (J)
## 
## data:  check_in ~ income + kinder + unempl + pop_density + greenery +     average_b_waw
## LM = 2229.5, df = 3, p-value < 2.2e-16
## alternative hypothesis: random effects or serial corr. or spatial dependence in error terms

bsktest(form_all, data=data_model, listw=cont.listw, test="LM1", standardize=TRUE) # H0 rejected -> Random effects

## 
##  Baltagi, Song and Koh SLM1 marginal test
## 
## data:  check_in ~ bu + bur + dist + train + income + kinder + unempl +     pop_density + greenery + average_b_waw
## SLM1 = 51.713, p-value < 2.2e-16
## alternative hypothesis: Random effects

bsktest(form_time, data=data_model, listw=cont.listw, test="LM1", standardize=TRUE) # H0 rejected -> Random effects

## 
##  Baltagi, Song and Koh SLM1 marginal test
## 
## data:  check_in ~ income + kinder + unempl + pop_density + greenery +     average_b_waw
## SLM1 = 49.321, p-value < 2.2e-16
## alternative hypothesis: Random effects

bsktest(form_all, data=data_model, listw=cont.listw, test="LM2", standardize=TRUE) # H0 rejected -> Spatial autocorrelation

## 
##  Baltagi, Song and Koh LM2 marginal test
## 
## data:  check_in ~ bu + bur + dist + train + income + kinder + unempl +     pop_density + greenery + average_b_waw
## SLM2 = 3.739, p-value = 0.0001847
## alternative hypothesis: Spatial autocorrelation

bsktest(form_time, data=data_model, listw=cont.listw, test="LM2", standardize=TRUE) # H0 rejected -> Spatial autocorrelation

## 
##  Baltagi, Song and Koh LM2 marginal test
## 
## data:  check_in ~ income + kinder + unempl + pop_density + greenery +     average_b_waw
## SLM2 = 3.7418, p-value = 0.0001827
## alternative hypothesis: Spatial autocorrelation

Then, we move to testing for the specification of our model with the Hausman test for spatial panel data models. Rejecting the null hypothesis, we get the result that fixed effect model is a correct specification.

sphtest(form_time, data=data_model, listw=cont.listw, spatial.model="error", method="ML", errors = 'BSK') # H0 rejected -> Fixed Effects

## 
##  Hausman test for spatial models
## 
## data:  x
## chisq = 25.933, df = 6, p-value = 0.0002291
## alternative hypothesis: one model is inconsistent

sphtest(form_time, data=data_model, listw=cont.listw, spatial.model="lag", method="ML", errors = 'BSK') # H0 rejected -> Fixed Effects

## 
##  Hausman test for spatial models
## 
## data:  x
## chisq = 20.543, df = 6, p-value = 0.002216
## alternative hypothesis: one model is inconsistent

sphtest(form_time, data=data_model, listw=cont.listw, spatial.model="sarar", method="ML", errors = 'BSK') # H0 rejected -> Fixed Effects

## 
##  Hausman test for spatial models
## 
## data:  x
## chisq = 11.433, df = 6, p-value = 0.07587
## alternative hypothesis: one model is inconsistent

Thus, we can move to modeling. First, we run regular panel fixed effect model to compare the results with the spatial models later. We try the “within” FE method with borough- and time-fixed effects. We also try a simple pooled model for comparison with a simple pooled spatial model. At the end of this section, we compare all models and interpret the results of the best one - with regards to adjusted AIC, and adjusted \(R^2\).

  # 4.3. plm models - without spatial errors and coefficients

    # 4.3.1. FE for time
model_plm_1 <- plm(form_all, data=data_model, model="within", effect="time")

    # 4.3.2 FE for boroughs
model_plm_2 <- plm(form_time, data=data_model, model="within", effect="individual")

    # 4.3.3. FE for time & boroughs
model_plm_3 <- plm(form_time, data=data_model, model="within", effect="twoway")

    # 4.3.4. pooling
model_plm_4 <- plm(form_all, data=data_model, model="pooling")

We run 8 different spatial models. They differ with respect to the estimation type (spatio-temporal lag vs. pooling) and the type of fixed effects included.

  # 4.4. spml models - with spatial errors and coefficients

    # 4.4.1. FE for time with Baltagi errors and y lags
model_spml_1 <-spml(form_all, data=data_model, listw=cont.listw, model="within", spatial.error="b", lag=T, effect="time")

    # 4.4.2. FE for time with Baltagi errors 
model_spml_2<-spml(form_all, data=data_model, listw=cont.listw, model="within", spatial.error="b", lag=F, effect="time")

    # 4.4.3. FE for boroughs with Baltagi errors and y lags
model_spml_3<-spml(form_time, data=data_model, listw=cont.listw, model="within", spatial.error="b", lag=T, effect="individual")

    # 4.4.4. FE for boroughs with Baltagi errors
model_spml_4<-spml(form_time, data=data_model, listw=cont.listw, model="within", spatial.error="b", lag=F, effect="individual")

    # 4.4.5. FE for time & boroughs with Baltagi errors and y lags
model_spml_5<-spml(form_time, data=data_model, listw=cont.listw, model="within", spatial.error="b", lag=T, effect="twoway")

    # 4.4.6. FE for time & boroughs with Baltagi errors
model_spml_6<-spml(form_time, data=data_model, listw=cont.listw, model="within", spatial.error="b", lag=F, effect="twoway")

    # 4.4.7. pooling with Baltagi errors and y lags
model_spml_7<-spml(form_all, data=data_model, listw=cont.listw, model="pooling", spatial.error="b", lag=T)

    # 4.4.8. pooling with Baltagi errors
model_spml_8<-spml(form_all, data=data_model, listw=cont.listw, model="pooling", spatial.error="b", lag=F)

  # 4.5. models' statistics
{
  model_stats <- data.frame(
    model_name = rep('', 12),
    params = c(11 + 11, 7 + 69, 7 + 11 + 69, 11, 11 + 11 + 2, 11 + 11 + 1, 7 + 69 + 2, 7 + 69 + 1, 7 + 11 + 69 + 2, 7 + 11 + 69 + 1, 11 + 2, 11 + 1),
    formula = c('form_all', rep('form_time', 2), rep('form_all', 3), rep('form_time', 4), rep('form_all', 2)),
    type = c(rep('plm', 4), rep('spml', 8)),
    param_model = c(rep("within", 3), 'pooling', rep("within", 6), rep('pooling', 2)),
    param_spatial.error = c(rep(NA, 4), rep("b", 8)),
    param_lag = c(rep(NA, 4), rep(c(T, F), 4)),
    param_effect = c('time', 'individual', 'twoway', NA, rep('time', 2), rep('individual', 2), rep('twoway', 2), rep(NA, 2)),
    rmse = rep(0, 12),
    r2 = rep(0, 12),
    r2_adj = rep(0, 12),
    aic_adj = rep(0, 12)
  )
  
  i <- 1
  
  for (model in c(paste0('model_plm_', seq(1,4)), paste0('model_spml_', seq(1,8)))) {
    
    model_stats[i,'model_name'] <- model
    model_stats[i,'rmse'] <- rmse(get(model)$residuals)
    model_stats[i,'r2'] <- r2(data_model$check_in, get(model)$residuals)
    model_stats[i,'r2_adj'] <- r2_adj(data_model$check_in, get(model)$residuals, model_stats[i,'params'])
    model_stats[i,'aic_adj'] <- AIC_adj(get(model), model_stats[i,'params'])
    i <- i+1
  }
}

Then, we print the models results and we can choose the best one.

model_stats %>%
  arrange(desc(r2_adj)) %>%
  kbl() %>%
  kable_styling(bootstrap_options = c("striped", "responsive", "hover")) %>%
  scroll_box(width = "100%")

model_name	params	formula	type	param_model	param_spatial.error	param_lag	param_effect	rmse	r2	r2_adj	aic_adj
model_spml_5	89	form_time	spml	within	b	TRUE	twoway	28.86045	0.9639835	0.9591921	7436.181
model_plm_3	87	form_time	plm	within	NA	NA	twoway	28.91338	0.9638513	0.9591643	7434.963
model_spml_6	88	form_time	spml	within	b	FALSE	twoway	28.91432	0.9638489	0.9591007	7437.012
model_plm_2	76	form_time	plm	within	NA	NA	individual	30.65947	0.9593533	0.9548238	7501.974
model_spml_4	77	form_time	spml	within	b	FALSE	individual	30.68255	0.9592921	0.9546893	7505.116
model_spml_3	78	form_time	spml	within	b	TRUE	individual	32.28592	0.9549264	0.9497562	7584.439
model_spml_1	24	form_all	spml	within	b	TRUE	time	92.46229	0.6303209	0.6182333	9073.632
model_plm_1	22	form_all	plm	within	NA	NA	time	92.69488	0.6284587	0.6173528	9073.446
model_spml_2	23	form_all	spml	within	b	FALSE	time	92.70981	0.6283390	0.6167088	9075.691
model_plm_4	11	form_all	plm	pooling	NA	NA	NA	95.63200	0.6045405	0.5987171	9098.799
model_spml_8	12	form_all	spml	pooling	b	FALSE	NA	95.73107	0.6037207	0.5973462	9102.371
model_spml_7	13	form_all	spml	pooling	b	TRUE	NA	95.83346	0.6028725	0.5959428	9105.994

# model_spml_5 has lowest RMSE and highest R2 Adjusted, but model_plm_3 has the lowest Adjusted AIC

The models are sorted with respect to the adjusted \(R^2\). It can be seen that spatial model 5 performs the best. It is a model with FEs for time & boroughs with Baltagi errors and y lags. In terms of adjusted \(R^2\), it is followed by the fixed effects panel model with FEs for time & boroughs, which also exhibits the lowest adjusted AIC. Thus, the outcome of those two models will be interpreted.

  # 4.6. final model
    # 4.6.1. summary
summary(model_spml_5)

## Spatial panel fixed effects sarar model
##  
## 
## Call:
## spml(formula = form_time, data = data_model, listw = cont.listw, 
##     model = "within", effect = "twoway", lag = T, spatial.error = "b")
## 
## Residuals:
##      Min.   1st Qu.    Median   3rd Qu.      Max. 
## -164.2932  -13.4055    0.2155   12.3383  210.4628 
## 
## Spatial error parameter:
##     Estimate Std. Error t-value Pr(>|t|)
## rho 0.062121   0.378587  0.1641   0.8697
## 
## Spatial autoregressive coefficient:
##        Estimate Std. Error t-value Pr(>|t|)
## lambda -0.12398    0.39531 -0.3136   0.7538
## 
## Coefficients:
##                  Estimate  Std. Error t-value  Pr(>|t|)    
## income          45.618448   11.996806  3.8025 0.0001432 ***
## kinder          -0.251682    0.145335 -1.7317 0.0833206 .  
## unempl        -143.171586  201.772821 -0.7096 0.4779719    
## pop_density      0.225314    2.628047  0.0857 0.9316777    
## greenery        -0.097578    0.073041 -1.3359 0.1815706    
## average_b_waw    0.144876   15.525485  0.0093 0.9925546    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

summary(model_plm_3)

## Twoways effects Within Model
## 
## Call:
## plm(formula = form_time, data = data_model, effect = "twoway", 
##     model = "within")
## 
## Balanced Panel: n = 69, T = 11, N = 759
## 
## Residuals:
##        Min.     1st Qu.      Median     3rd Qu.        Max. 
## -1.6747e+02 -1.2706e+01  1.5163e-04  1.2299e+01  2.1213e+02 
## 
## Coefficients:
##                  Estimate  Std. Error t-value  Pr(>|t|)    
## income          45.188565   12.755764  3.5426 0.0004235 ***
## kinder          -0.249277    0.154610 -1.6123 0.1073653    
## unempl        -132.678645  212.238455 -0.6251 0.5320911    
## pop_density      0.141593    2.787537  0.0508 0.9595038    
## greenery        -0.097419    0.077638 -1.2548 0.2099894    
## average_b_waw    0.244109   16.453777  0.0148 0.9881674    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Total Sum of Squares:    652580
## Residual Sum of Squares: 634510
## R-Squared:      0.027681
## Adj. R-Squared: -0.093498
## F-statistic: 3.19803 on 6 and 674 DF, p-value: 0.0042043

Both models are consistent in that they identify relative income to be the sole strongly significant pulling factor for migrants. According to both models, if the relative income rises by 1, the number of migrants rises by approximately 45 people. Such order of magnitude is reasonable, since 1 is a huge change in relative income, due to how it is defined. This result conforms to the “natural evolution theory” by Mieszkowski and Mills (1993). Below, we examine the direct and indirect impacts and individual effects. We inspect only the borough-fixed effects because all time effects turned out insignificant in our best models.

  # 4.6.2. impacts
impacts <- impacts(model_spml_5, listw = cont.listw, time = 11)

impacts <- data.frame(
  Variable = attributes(impacts)$bnames,
  Direct = round(impacts$res[1][[1]], 2),
  Indirect = round(impacts$res[2][[1]], 2),
  Total = round(impacts$res[3][[1]], 2)
)

impacts %>%
  kbl() %>%
  kable_styling(bootstrap_options = c("striped", "responsive", "hover")) %>%
  scroll_box(width = "100%")

Variable	Direct	Indirect	Total
income	45.76	-5.18	40.59
kinder	-0.25	0.03	-0.22
unempl	-143.62	16.24	-127.38
pop_density	0.23	-0.03	0.20
greenery	-0.10	0.01	-0.09
average_b_waw	0.15	-0.02	0.13

It can be seen that the indirect effect of the relative income is negative, while the direct effect is positive. The negative indirect impact can be essentially interpreted in such a way that people are less likely to move to a particular borough, if the surrounding boroughs have high relative income, which makes a lot of sense.

As a last step, we can inspect the borough individual effects.

  # 4.6.3 individual effects (boroughs)
effects_individual <- effects(model_spml_5)$SETable %>%
  as.data.frame() %>%
  mutate(
    borough = locs$osm_multipolygons$borough,
    osm_id = locs$osm_multipolygons$osm_id,
    effect_bin = factor(case_when(
      Estimate > -250 & Estimate <= 0 ~ '-250 - 0',
      Estimate > 0 & Estimate <= 250 ~ '0 - 250',
      Estimate > 250 & Estimate <= 500 ~ '250 - 500',
      Estimate > 500 & Estimate <= 1000 ~ '500 - 1000'),
      levels = c('-250 - 0', '0 - 250', '250 - 500', '500 - 1000')),
    effect_significant = ifelse(`Pr(>|t|)` <= 0.05, 'T', 'F')
  )

  # 4.6.4. map of individual effects
locs_sp <- locs_sp %>% 
  merge(., effects_individual, by.x='id', by.y='osm_id')

map_effects <- ggplot() +
  geom_polygon_pattern(data = locs_sp[locs_sp$effect_significant == 'F', ], aes(x=long, y=lat, group=group, fill=effect_bin), pattern = 'magick', colour='black', size=0.5, alpha=0.9) +
  geom_polygon(data = locs_sp[locs_sp$effect_significant == 'T', ], aes(x=long, y=lat, group=group, fill=effect_bin), colour='black', size=0.5, alpha=0.9) +
  coord_sf() +
  scale_fill_viridis(discrete=T) +
  #annotate(geom = 'text', label = '', x = 50, y = -2) + 
  labs(
    caption = 'Data Source: own calculation',
    title = 'Individual Effects',
    fill = 'dots represent insignificant effects (5%)'
  ) +
  theme_void() +
  theme(
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank(),
    panel.border = element_blank(),
    plot.background = element_blank(),
    plot.title = element_text(color="#000000", face="bold", size=16, hjust=0.5),
    plot.caption = element_text(color="#000000", face="bold", size=10, hjust=1),
    plot.margin = unit(c(0.5,0.5,0.5,0.5), "cm"),
    strip.text.x = element_text(color="#505050", face="bold", size=12, hjust=0.5),
    legend.title = element_text(color="#505050", size=12, hjust=0.5),
    legend.text = element_text(color="#505050", size=10, angle = 0),
    legend.position = 'bottom'
  ) + 
  guides(fill = guide_legend(title.position = "bottom"))

png('img/map_effects.png', width = 3508, height = 2480, res=300)
map_effects
dev.off()
}

The majority of the individual effects are significant at 5% level. It can be seen that more than 50% of the boroughs have negative individual effects, i.e. people rather do not migrate there. However, it might be the case that relative income alone captures migration to those places.

Conclusions

We received a consistent result from both, the FE within panel model, and the FE within spatial panel model, i.e. that income is the only significant predictor of spatial suburban movement of people. We are a little bit concerned that our models yielded only one highly significant regressor, but this issue can be potentially backed up with migration theory and it is probable that the “richness” of a destination place, relative to other potential places, is a sole determinant of suburban migration choices.

Yet, one caveat remains to be addressed. Our two best models use 89 and 87 predictors, respectively. At the same time, we model 759 observations. The included ratio of observations to regressors violates the rule of thumb (traditionally, it is 10). Thus, an issue of overfitting can be present in our models. Overcoming such issue would require adding a penalty on the parameters. An elastic-net style penalty can possibly be added to any regression model. This is, however, out of the scope of this assignment. However, if our model was intended for publication, the potential issue of overfitting would require more insight.

References

Bogusz, H. 2021. “Causes of the Spatially Uneven Outflow of Warsaw Inhabitants to the City’s Suburbs: An Economic Analysis of the Problem.” WNE UW Working Papers. Warsaw: Faculty of Economic Sciences.

Mieszkowski, P., and E. S. Mills. 1993. “The Causes of Metropolitan Suburbanization.” Journal of Economic Perspectives 7 (3): 135–47.