Analysis of the factors pushing from and pulling migrants to the suburbs

Introduction

In this study, we examine part of the dataset from the paper “Causes of the spatially uneven outflow of Warsaw inhabitants to the city’s suburbs: an economic analysis of the problem” (Bogusz, 2020), which is currently under review in Geographia Polonica. In it, I investigate the suburban migration patterns of Warsaw inhabitants (the outflow from inner city to suburban boroughs) and aim to identify the features of boroughs which are key pulling factors for migrants. I built a correlated random effects panel model with Mundlak device. I estimated the extended gravity model of migration equation by feasible GLS, incorporating Mundlak style time averages of variables which might be correlated with transient errors. That kind of specification allowed me to overcome problems of possible endogeneity, as well as transient errors collinearity. My work on this topic also included parsing unyielding data from Polish county offices. These data contains each property market transaction from 2008-2019, based on which I wanted to calculate the average price of housing on a borough level. This variable can be, however, the source of endogeneity in the model.

This topic is relevant, as when those pulling and pushing factors are identified, adequate spatial planning can be executed in order to curb negative consequences of unrestricted urban sprawl. The panel model in my article is based on the extended gravity model of migration.

However, here I want to examine only a subset of the data, corresponding to year 2018, by the use of non-linear algorithms. As single trees are not useful in predicting a continuous outcome, we use two distinct approaches that are based on multiple models in different ways – bootstrap averaging (bagging) and boosting. Random forest (Breiman (2001)) is an example of the bagging approach. It consists of estimating multiple independent tree models, each trained on a different bootstrap sample of the original dataset. In addition, at each split of each tree, only a random subset of all predictors is considered. Extreme Gradient Boosting (XGB) (Chen and Guestrin (2016)) is an example of the boosting approach, which is also usually based on trees models. It builds the model in an iterative fashion at each step trying to improve the previous model by giving higher weights to observations that were not ﬁtting well to the previous step. In addition to capturing non-linear relationships, XGB is also capable of performing regularization, for example by shrinkage like in the Elastic Net. This exercise can lead to a deeper insight in understanding whether the relationships between the dependent variable and the regressors are linear or not, and can be thus useful for my research work. In order to uncover the inner workings of the black-box algorithms, innovative XAI methods of Permutation-based feature importance (Zhao and Hastie (2019)) and ALE plots (Apley and Zhu (2019)) are used.

Permutation-based variable importance was ﬁrst introduced by Breiman (2001) in a Random Forest algorithm. Further research was done by Fisher, Rudin, and Dominici (2019), who proposed a model agnostic tool for calculating the contribution of individual features into prediction accuracy. Variable importance is calculated by randomly permuting a variable and computing an increase in prediction error with the newly created learning sample. The loss function, which quantifies the goodness-of-ﬁt of a model for each variable, is plotted for visual inspection of variable importance. Permutation Feature Importance allows for ranking the used regressors in terms of their contribution to prediction accuracy. This ranking is applied to the results of each of our models. Furthermore, the most commonly used black-box visualization tool is the Partial Dependence Plot (PDP) introduced by Friedman (2001). It depicts the marginal effect of an input variable and the model outcome (ceteris paribus) and is a graphical representation of predictions. For a given variable, PDP averages model predictions keeping Xi feature values constant. However, this method assumes no correlation between predictors, as averaging incorporates the dependence between two features. As we show in the next chapter, that assumption is unrealistic in our case. Recently Apley and Zhu (Apley and Zhu (2019)) proposed an extension of PDP that takes the correlation bias into account, called Accumulated Local Effects (ALE). ALE calculates the average predicted outcome with respect to the predictor value with slight modiﬁcations. For a given predictor, ALE calculates average changes in prediction for observations in close neighbourhood to the original one. The graphical representation is then analogous to the PDP. In this way, we plot the relationships of the most important predictors (as indicated by Permutation Feature Importance) on the outcome variable for each of the models using ALE plots.

Exploratory Data Analysis

We conduct a cross-sectional analysis of migration from Warsaw to 70 suburban boroughs in 2018. The dependent variable is the number of migrants who moved out of Warsaw and reported residence in a borough. The regressors are:

Population density - one of the key variables of the gravity model of migration.
Distance (straight line) - the second key variable of the gravity model of migration.
Relative income per capita - average income in a borough (in PLN) divided by the average income in all boroughs weighted by population (in PLN). Calculated based on personal income tax data.
Unemployment rate - measures local labor market condition.
Total greenery spaces (parks, green plazas, urban forests, etc.) - an amenity.
Ratio of the forest area and the total area of a borough - measures to what extend the presence of forests hinders habitable zones (e.g. it is forbidden to settle in close proximity to the Kampinowski National Park).
Number of kindergartens per 100 000 population - an amenity.
Mean price of m2 of housing in a borough divided by the same for Warsaw - a relative measure of the prices of housing is constructed. Below, I include a brief summary from my article “Causes of the spatially uneven outflow of Warsaw inhabitants to the city’s suburbs: an economic analysis of the problem” about how the price indicator was constructed based on the transactional data obtained from Polish county offices.

Information about prices of housing available at the Polish Statistical Office are limited to the county level in years 2015-2018, which is a very rough measure for prices in boroughs. In my attempt to find a better one, I obtained real estate transactional data from Polish county offices for each of the 69 boroughs in the years 2008-2019 for the following property types: flats, houses, and parcels. Based on that parsed data, I constructed an indicator of the relative price of housing. I included observations for each borough, type, and year where the number of transactions was greater than 10. The average price of m2 of housing was obtained by weighting average prices of property types (in each borough and each year) by the number of transactions corresponding to different types. However, in terms of flats, the price is related to the living space, while it corresponds to the area of a parcel for both house and parcel. This is due to the reason that no information about the living space of houses was available for about half of the boroughs in my study. The availability of details of transactions depends on the data classifying and storing system a specific county office uses and there is no standardization of such a system in Poland. In addition, I limited the area of parcels to observations greater than 200 m2 and smaller than 2000 m2. The reason for this distinction is that properties classified as houses or parcels are not necessarily lots habitable by private agents, but can be also service establishments. On the one hand, 200 m2 is the minimum parcel size, where a habitable house, rather than holiday cabin can be built. On the other, 2000 m2 is a reasonable upper boundary since private houses are rarely built on larger parcels in the suburbs of Warsaw (the average parcel size is between 1400-2000 in 90% of the boroughs in years examined) giving rise to the presumption that larger parcels function as service establishments in the vast majority.That obtained average price was divided by the average price in Warsaw as obtained from the Polish Statistical Office for 2015-2018 and extrapolated for 2008-2019 by the use of the index of prices of housing for Warsaw for 2008-2019. Hence, a price relative to the one in Warsaw was constructed. Furthermore, no transactions were observed for all types or at all for some boroughs in some years. To complete the unbalanced panel, the well-known multiple imputation technique was used. Since in my setting observations for only one regressor were missing, predictive mean matching was applied. When imputing the missing values of a continuous variable, PMM may be preferable to linear regression when the normality of the underlying model is suspect (Little, 1988; Rubin, 1986). The imputation was made before calculating the time average or price relative to the price in Warsaw. (Bogusz, 2020)

I start the analysis by examining the distribution of the dependent variable, and the key variables of the gravity model of migration, as well as relative income per capita, in boroughs in 2018. I look also at income, because, according to the existing literature on suburbanisation, it is supposed to have great positive influence on the number of suburban migrants (Mieszkowski and Mills (1993)). Mind the comment in the code.

# I load the dataset with metrics.
panel_metrics <- read.xlsx('/Users/honoratabogusz/Documents/Documents Mac/Data Science WNE 2019-2021/Master Thesis/data analysis/panel_metrics_poprawne.xlsx')
# I create logarithms for better clearance on the graph.
panel_metrics$l_check_in <- log(panel_metrics$check_in_2019)
panel_metrics$l_popdens <- log(panel_metrics$pop_density)
panel_metrics$l_dist <- log(panel_metrics$dist)
panel_metrics$l_income <- log(panel_metrics$income)
# I load an external file with previously scrapped coordinates from Open Street Map.
locs <- readRDS('/Users/honoratabogusz/Documents/Documents Mac/Data Science WNE 2019-2021/Master Thesis/data analysis/locs.RDS')
# I prepare the data for plotting and merge the two datasets.
locs_sp <- as(locs$osm_multipolygon, 'Spatial')
panel_metrics <- merge(panel_metrics, locs_sp@data[,c('borough', 'osm_id')], by='borough')
locs_sp <- tidy(locs_sp)
locs_sp <- merge(locs_sp, panel_metrics, by.x='id', by.y='osm_id', how=)
# Here is a custom function for a plot.
# I use the Python Viridis palette in all visualizations,
# as it is robust to all kinds of color blindness.
plot_metrics <- function (var, title, caption, legend) {
  var <- enquo(var)
  ggplot() +
    geom_polygon(data = locs_sp, aes(x=long, y=lat, group=group, fill=!!var), colour='black', size=1) +
    theme_bw() +
    theme(
      plot.title = element_text(color="#000000", face="bold", size=12, hjust=0, family='Palatino'),
      axis.title = element_text(family='Palatino'),
      axis.text = element_text(family='Palatino'),
      legend.title = element_text(family='Palatino', face='bold'),
      legend.text = element_text(family='Palatino')
    ) +
    coord_fixed() +
    scale_fill_viridis(name = legend) +
    theme_void() +
    labs(
      title=title,
      caption=caption
    ) +
    theme(
      plot.title = element_text(color="#000000", face="bold", size=12, hjust=0.5),
      plot.caption = element_text(color="#000000", face="bold", size=8, hjust=1),
      legend.title = element_text(color="#000000", face="bold", size=8, hjust=1),
      legend.text = element_text(color="#000000", face="bold", size=8)
    )
}
migrants <- plot_metrics(l_check_in, 'number of migrants', 'Data Source: Polish Statistical Office', '# of migrants')
population_density <- plot_metrics(l_popdens, 'population density', 'Data Source: Polish Statistical Office', ' # of people / km^2')
distance <- plot_metrics(l_dist, 'distance (straight line)', 'Data Source: Google Maps', 'km')
income <- plot_metrics(l_income, 'relative income per capita', 'Data Source: Polish Statistical Office', 'ratio')
grid.arrange(migrants, population_density, distance, income, ncol=2, nrow = 2)

panel_metrics <- NULL

It can be seen that none of the variables was evenly spread between boroughs in that year. Some of the boroughs with a greater logarithm of the number of migrants are the ones of increased logarithm of population density, smaller logarithm of distance, and greater logarithm of the average relative income, which follows the anticipated relationships between the dependent variable and the three plotted regressors. Furthermore, we expect larger greenery spaces, more kindergartens to be pulling factors. Following the literature (Loibl (2004)), migrants should choose municipalities of lower residential lot prices relative to the ones in the core city. We don’t expect the local job market conditions represented by the unemployment rate to play a role in the case of suburban migration, because migrants living in the suburban ring usually still commute to work to Warsaw’s core.

In my paper “Causes of the spatially uneven outflow of Warsaw inhabitants to the city’s suburbs: an economic analysis of the problem”, all continues, non-ratio variables were logarithmed for two reasons. First of all, it conforms to the form of the extended gravity model of migration (Poot et al. (2016)). Secondly, logarithms are often feasible in linear modelling. However, there is no reason for functional transformations of the variables for models based on trees. Since RF and XGB require no prior assumptions about the data and can adjust to it flexibly (Hastie, Tibshirani, and Friedman (2009)), no functional transformations are made. However, below I check the relationships between the dependent variable and four regressors, which I expect to have the largest influence on the number of migrants: population density, distance, relative income per capita, and relative prices of housing.

# I load the dataset
suburbs <- read.xlsx('/Users/honoratabogusz/Documents/Documents Mac/Data Science WNE 2019-2021/III semester/ML2/suburbs.xlsx')
dim(suburbs)

## [1] 69 11

head(suburbs)

##      borough migrants pop_density distance    income unempl greenery
## 1    Baranow     21.0   0.7006902    38.61 0.5133392  0.018     5.45
## 2     Blonie     62.0   2.5245384    27.23 0.7403060  0.015   119.99
## 3    Brwinow    195.0   3.8095582    22.26 1.1664167  0.028    78.82
## 4     Ceglow      0.5   0.6435219    50.32 0.4977847  0.028     9.01
## 5 Celestynow     22.0   1.3135403    31.88 0.8076207  0.026    19.04
## 6    Czosnow     63.0   0.7743869    26.34 0.8492052  0.029    11.28
##        forest   kinder price_borough price_warsaw
## 1 0.001725511 18.94298            NA     8524.537
## 2 0.001859079 37.02847    4475.93993    10873.180
## 3 0.078390124 37.90032    2283.88973    11008.680
## 4 0.360807278 32.49919      53.26426     8795.534
## 5 0.521763383 25.68493     107.39982    10105.350
## 6 0.274057610 80.42626     219.59008    10331.180

set.seed(435632)
# some observations of the price of m2 in boroughs are missing, so they are imputed using predictive mean matching
imp_single <- mice(suburbs, m =1, method = "pmm") # impute only once

## 
##  iter imp variable
##   1   1  price_borough
##   2   1  price_borough
##   3   1  price_borough
##   4   1  price_borough
##   5   1  price_borough

suburbs <- complete(imp_single)
# check the shape of relationships between the dependent variable and the regressors
suburbs$price_b_w <- suburbs$price_borough/suburbs$price_warsaw
suburbs$price_borough <- NULL
suburbs$price_warsaw <- NULL
suburbs1 <- suburbs
suburbs$borough <- NULL

a <- ggplot(suburbs, aes(x = pop_density, y = migrants)) +
  geom_point() +
  labs(x="population density",y="number of migrants") +
  geom_smooth(colour="green",method="loess", size=2) +
  theme_classic() 
b <- ggplot(suburbs, aes(x = distance, y = migrants)) +
  geom_point() +
  labs(x="distance (straight line in km)",y="number of migrants") +
  geom_smooth(colour="green",method="loess", size=2) +
  theme_classic()
c <- ggplot(suburbs, aes(x = income, y = migrants)) +
  geom_point() +
  labs(x="relative income per capita",y="number of migrants") +
  geom_smooth(colour="green",method="loess", size=2) +
  theme_classic()
d <- ggplot(suburbs, aes(x = price_b_w, y = migrants)) +
  geom_point() +
  labs(x="price of m2 in borough/price of m2 in Warsaw",y="number of migrants") +
  geom_smooth(colour="green",method="loess", size=2) +
  theme_classic()
grid.arrange(a, b, c, d, nrow = 2)

A quick approximation by LOESS delivers a conclusion that the relationships might be indeed non-linear.

The LOOCV and the search for hyperparameters are coded here by hand. Thus, below, I include some helper functions.

# R2
r2 <- function (actual, preds) {
  rss <- sum((preds - actual) ^ 2)
  tss <- sum((actual - mean(actual)) ^ 2)
  r2 <- 1 - rss/tss
  return (r2)
}
# Adjusted R2
r2_adj <- function (actual, preds, p = 29) {
  n <- length(actual)
  r2_adj <- 1 - ((1 - r2(actual, preds)) * (n - 1) / (n - p - 1))
  return (r2_adj)
}
# place holder for model stats on validation and training sample
model_stats <- data.frame(
  model = character(), 
  rmse_valid = numeric(),
  mae_valid = numeric(),
  r2_valid = numeric(),
  r2_adj_valid = numeric(),
  rmse_train = numeric(), 
  mae_train = numeric(),
  r2_train = numeric(),
  r2_adj_train = numeric()
)

Random Forest

I begin by setting a range of possible parameters. I run the LOOCV on 69 models to choose the model with the optimal number of trees, number of variables used in one tree, minimum number of observations used in a terminal node, and the maximum number of terminal nodes. After setting the narrower range of parameters, I run the LOOCV once more to choose the final model. Mind that the below two chunks of code are computationally expensive, and thus, only the second one will be run here, when compiling to html. The first one is included for reasons of demonstration, but not run.

# this chunk is computationally expensive
# setting a range of possible parameters
ntrees <- 10 * 1:15
# no of variables used in a tree
mtrys <- c(3, 5, 10, 15, 20)
# min number of observations in a terminal node
nodesizes <- c(5, 10, 15, 20)
# max number of terminal nodes (NULL means that the tree is grown to max possible extent)
maxnodes <- c(3, 5, 10)

rf_cv <- data.frame(
  ntree = numeric(), 
  mtry = numeric(),
  nodesize = numeric(),
  maxnode = numeric(),
  rmse = numeric(),
  mae = numeric(),
  r2 = numeric(),
  r2_adj = numeric()
)

for (ntree in ntrees) {
  print(ntree)
  for (mtry in mtrys) {
    for (nodesize in nodesizes) {
      for (maxnode in maxnodes) {
        
        rf_res_test <- c()
        
        for (i in 1:69) {
          
          # divide in train and test dataset
          x <- 1:69
          suburbs_train <- suburbs[x[-i],]
          suburbs_test <- suburbs[x[i],]
          
          # train rf
          set.seed(435632)
          rf <- randomForest(migrants~., data=suburbs_train, ntree = ntree, mtry = mtry, replace = F, nodesize = nodesize, maxnode = maxnode)
          
          # calculate prediction and residual
          rf_res_test_tmp <- as.data.frame(predict(rf, suburbs_test[2:length(suburbs_test)]))
          rf_res_test <- append(rf_res_test, rf_res_test_tmp[1,1])
          
        }
        # append results
        row <- data.frame(
          ntree = ntree, 
          mtry = mtry,
          nodesize = nodesize,
          maxnode = maxnode,
          rmse = rmse(suburbs$migrants, rf_res_test),
          mae = mae(suburbs$migrants, rf_res_test),
          r2 = r2(suburbs$migrants, rf_res_test),
          r2_adj = r2_adj(suburbs$migrants, rf_res_test)
        )
        
        rf_cv <- rbind(rf_cv, row)
      }
    }
  }
}
# choose the optimal model
rf_cv[rf_cv$rmse == min (rf_cv$rmse) | rf_cv$mae == min (rf_cv$mae) | rf_cv$r2 == max (rf_cv$r2), ]

The below chunk of code (last part of optimization of RF) has to be run, as it is needed further for the construction of the resulting RF model.

start_time <- Sys.time()
# set some parameters accordingly
ntrees <- 30:70
# no of variables used in a tree
mtrys <- 7:13
# min number of observations in a terminal node
nodesizes <- 3:5
# max number of terminal nodes (NULL means that the tree is grown to max possible extent)
maxnodes <- 10

rf_cv_2 <- data.frame(
  ntree = numeric(), 
  mtry = numeric(),
  nodesize = numeric(),
  maxnode = numeric(),
  rmse = numeric(),
  mae = numeric(),
  r2 = numeric(),
  r2_adj = numeric()
)

for (ntree in ntrees) {
  print(ntree)
  for (mtry in mtrys) {
    for (nodesize in nodesizes) {
      for (maxnode in maxnodes) {
        
        rf_res_test <- c()
        
        for (i in 1:69) {
          
          # divide in train and test dataset
          x <- 1:69
          suburbs_train <- suburbs[x[-i],]
          suburbs_test <- suburbs[x[i],]
          
          # train rf
          set.seed(435632)
          rf <- randomForest(migrants~., data=suburbs_train, ntree = ntree, mtry = mtry, replace = F, nodesize = nodesize, maxnode = maxnode)
          
          # calcualte prediction and residual
          rf_res_test_tmp <- as.data.frame(predict(rf, suburbs_test[2:length(suburbs_test)]))
          rf_res_test <- append(rf_res_test, rf_res_test_tmp[1,1])
          
        }
        # append results
        row <- data.frame(
          ntree = ntree, 
          mtry = mtry,
          nodesize = nodesize,
          maxnode = maxnode,
          rmse = rmse(suburbs$migrants, rf_res_test),
          mae = mae(suburbs$migrants, rf_res_test),
          r2 = r2(suburbs$migrants, rf_res_test),
          r2_adj = r2_adj(suburbs$migrants, rf_res_test)
        )
        
        rf_cv_2 <- rbind(rf_cv_2, row)
      }
    }
  }
}

## [1] 30
## [1] 31
## [1] 32
## [1] 33
## [1] 34
## [1] 35
## [1] 36
## [1] 37
## [1] 38
## [1] 39
## [1] 40
## [1] 41
## [1] 42
## [1] 43
## [1] 44
## [1] 45
## [1] 46
## [1] 47
## [1] 48
## [1] 49
## [1] 50
## [1] 51
## [1] 52
## [1] 53
## [1] 54
## [1] 55
## [1] 56
## [1] 57
## [1] 58
## [1] 59
## [1] 60
## [1] 61
## [1] 62
## [1] 63
## [1] 64
## [1] 65
## [1] 66
## [1] 67
## [1] 68
## [1] 69
## [1] 70

# choose the optimal model
rf_cv_2[rf_cv_2$rmse == min (rf_cv_2$rmse) | rf_cv_2$mae == min (rf_cv_2$mae) | rf_cv_2$r2 == max (rf_cv_2$r2), ]

##     ntree mtry nodesize maxnode     rmse      mae        r2    r2_adj
## 24     31    7        5      10 112.4533 70.85600 0.4923836 0.1149253
## 171    38    7        5      10 112.3918 71.97873 0.4929393 0.1158942

end_time <- Sys.time()
end_time - start_time

## Time difference of 9.627332 mins

We run Random Forest with the optimal found parameters. I examine the importance plot, as well as some model stats: RMSE, MAE and R\(^2\).

set.seed(435632)
rf <- randomForest(migrants~., data=suburbs, importance = T, ntree=38, mtry=7, replace=F, nodesize=5, maxnode=10)
importance(rf)

##                %IncMSE IncNodePurity
## pop_density  1.4730385      81099.66
## distance     2.8880587     456851.33
## income       3.6564708      92707.15
## unempl      -2.3262192      19957.93
## greenery     1.7112943     146320.30
## forest      -0.5695084      12252.34
## kinder       0.8652800     115156.37
## price_b_w    2.3197050      62554.37

rf

## 
## Call:
##  randomForest(formula = migrants ~ ., data = suburbs, importance = T,      ntree = 38, mtry = 7, replace = F, nodesize = 5, maxnode = 10) 
##                Type of random forest: regression
##                      Number of trees: 38
## No. of variables tried at each split: 7
## 
##           Mean of squared residuals: 13960.23
##                     % Var explained: 43.96

varImpPlot(rf)

rf_res <- as.data.frame(predict(rf, suburbs[2:length(suburbs)]))
rf_res$res <- rf_res[,1]
rmse(suburbs$migrants, rf_res$res)

## [1] 60.30409

mae(suburbs$migrants, rf_res$res)

## [1] 37.2874

r2(suburbs$migrants, rf_res$res)

## [1] 0.854023

# append model stats
row <- data.frame(
  model = 'rf', 
  rmse_valid = rf_cv_2[rf_cv_2$mae == min (rf_cv_2$mae), 'rmse'],
  mae_valid =rf_cv_2[rf_cv_2$mae == min (rf_cv_2$mae), 'mae'],
  r2_valid = rf_cv_2[rf_cv_2$mae == min (rf_cv_2$mae), 'r2'],
  rmse_train = rmse(suburbs$migrants, rf_res$res), 
  mae_train = mae(suburbs$migrants, rf_res$res),
  r2_train = r2(suburbs$migrants, rf_res$res)
)
model_stats <- rbind(model_stats[model_stats$model != 'rf',], row)
model_stats

##   model rmse_valid mae_valid  r2_valid rmse_train mae_train r2_train
## 1    rf   112.4533    70.856 0.4923836   60.30409   37.2874 0.854023

As awaited, the most important variables are relative income per capita, distance, and price of housing relative to the one in Warsaw. Inspecting the node purity plot, we can conclude that the most useful variable in predicitng the number of migrants is distance (straight line).

Further on, we can plot the actual vs. predicted values of the target variable.

# plot results
rf_res_graph <- data.frame(
  borough = suburbs1$borough, 
  actual = suburbs$migrants, 
  pred = round(rf_res$res, 0)
)

rf_res_graph$actual_vs_pred <- ifelse(rf_res_graph$actual > rf_res_graph$pred, 'lower', 'higher')

rf_plot <- ggplot(rf_res_graph, aes(x = borough, y = actual)) +
  geom_bar(col = 'blue', stat='identity', width = 1) +
  geom_bar(data = rf_res_graph, aes(x = borough, y = pred, col = actual_vs_pred), stat='identity', width=0.2) + 
  theme(
    plot.title = element_text(color="#000000", face="bold", size=24, hjust=0.5),
    axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
  ggtitle('RF Results')

rf_plot

It can be seen on the above plot that most of the values were predicted quite well, except for a few outliers: Ożarów Mazowiecki, Józefów, Marki, Ząbki. Overall, I conclude that the quality of that prediction is pretty well, especially given the very small set of observations.

eXtreme Gradient Boosting

We move to inspecting the same problem with XGB. XGB has much more hyper-parameters than RF, and thus requires a longer optimization of them. This optimization is done in a similar manner than in RF presented above, and is also computationally expensive. The first chunk does not need to be run when compiling to html, because we already found the parameters. However, I include this chunk for the reasons of demonstration (without running it).

# this chunk is computationally expensive
# first learning rate & iteration & after that tree-specific parameters & after that boosting parameters again
# number of iterations
nroundss <- 3:7
booster <- 'gbtree'
# learning rate - shrinking the feature weights after each iteration to prevent overfitting
etas <- 0.05 * 1:19
# minimum loss reduction required to make a further partition on a leaf node of the tree
gammas <- c(0, 10, 100, 1000, 5000, 10000, 50000, 100000)
# tree construction - since our sample is very small we will use 'exact' (the greedy appraoch)
tree_method <- 'exact'

xgb_gbtree_cv_1 <- data.frame(
  booster = character(),
  nrounds = numeric(),
  eta = numeric(), 
  gamma = numeric(),
  rmse = numeric(),
  mae = numeric(),
  r2 = numeric(),
  r2_adj = numeric()
)

for (eta in etas) {
  print(eta)
  for (gamma in gammas) {
    
    params <- list(
      booster = booster,
      eta = eta,
      gamma = gamma,
      tree_method = tree_method,
      objective = "reg:squarederror",
      eval_metric = "rmse")
    
    for (nrounds in nroundss) {
      
      xgb_res_test <- c()
      
      for (i in 1:69) {
        
        # divide in train and test dataset
        x <- 1:69
        suburbs_train <- suburbs[x[-i],]
        suburbs_test <- suburbs[x[i],]
        
        suburbs_train_dmatrix <- xgb.DMatrix(data = data.matrix(suburbs_train[,-1]), label = suburbs_train[,1])
        
        suburbs_test_dmatrix <- xgb.DMatrix(data = data.matrix(suburbs_test[,-1]), label = suburbs_test[,1])
        
        # train model
        set.seed(435632)
        xgb <- xgb.train(params = params, data = suburbs_train_dmatrix, nrounds = nrounds, verbose = 0)
        
        # calcualte prediction and residual
        xgb_res_test_tmp <- as.data.frame(predict(xgb, suburbs_test_dmatrix))
        xgb_res_test <- append(xgb_res_test, xgb_res_test_tmp[1,1])
      }
      
      # append results
      row <- data.frame(
        booster = 'gbtree',
        nrounds = nrounds,
        eta = eta, 
        gamma = gamma,
        rmse = rmse(suburbs$migrants, xgb_res_test),
        mae = mae(suburbs$migrants, xgb_res_test),
        r2 = r2(suburbs$migrants, xgb_res_test),
        r2_adj = r2_adj(suburbs$migrants, xgb_res_test)
      )
      
      xgb_gbtree_cv_1 <- rbind(xgb_gbtree_cv_1, row)
    }            
  }
}
# choosing the optimal model
xgb_gbtree_cv_1[xgb_gbtree_cv_1$rmse == min (xgb_gbtree_cv_1$rmse) | xgb_gbtree_cv_1$mae == min (xgb_gbtree_cv_1$mae) | xgb_gbtree_cv_1$r2 == max (xgb_gbtree_cv_1$r2), ]

# number of iterations
nrounds <- 5
booster <- 'gbtree'
# learning rate - shrinking the feature weights after each iteration to prevent overfitting
eta <- 0.45
# minimum loss reduction required to make a further partition on a leaf node of the tree
gamma <- 10000
# L2 regularization param 
lambdas <- c(0, 1, 10, 100, 1000, 10000, 100000)
# L1 regularization param 
alphas <- c(0, 1, 10, 100, 1000, 10000, 100000)
# tree concstruction - since our sample is very small we will use 'exact' (the greedy approach)
tree_method <- 'exact'

xgb_gbtree_cv_2 <- data.frame(
  lambda = character(),
  alpha = numeric(),
  rmse = numeric(),
  mae = numeric(),
  r2 = numeric(),
  r2_adj = numeric()
)

for (lambda in lambdas) {
  print(lambda)
  for (alpha in alphas) {
    
    params <- list(
      booster = booster,
      eta = eta,
      gamma = gamma,
      alpha = alpha,
      lambda = lambda,
      tree_method = tree_method,
      objective = "reg:squarederror",
      eval_metric = "rmse")
    
    xgb_res_test <- c()
    
    for (i in 1:69) {
      
      # divide in train and test dataset
      x <- 1:69
      suburbs_train <- suburbs[x[-i],]
      suburbs_test <- suburbs[x[i],]
      
      suburbs_train_dmatrix <- xgb.DMatrix(data = data.matrix(suburbs_train[,-1]), label = suburbs_train[,1])
      
      suburbs_test_dmatrix <- xgb.DMatrix(data = data.matrix(suburbs_test[,-1]), label = suburbs_test[,1])
      
      # train model
      set.seed(435632)
      xgb <- xgb.train(params = params, data = suburbs_train_dmatrix, nrounds = nrounds, verbose = 0)
      
      # calcualte prediction and residual
      xgb_res_test_tmp <- as.data.frame(predict(xgb, suburbs_test_dmatrix))
      xgb_res_test <- append(xgb_res_test, xgb_res_test_tmp[1,1])
    }
    
    # append results
    row <- data.frame(
      lambda = lambda,
      alpha = alpha,
      rmse = rmse(suburbs$migrants, xgb_res_test),
      mae = mae(suburbs$migrants, xgb_res_test),
      r2 = r2(suburbs$migrants, xgb_res_test),
      r2_adj = r2_adj(suburbs$migrants, xgb_res_test)
    )
    
    xgb_gbtree_cv_2 <- rbind(xgb_gbtree_cv_2, row)
  }
}
# choosing the optimal model
xgb_gbtree_cv_2[xgb_gbtree_cv_2$rmse == min (xgb_gbtree_cv_2$rmse) | xgb_gbtree_cv_2$mae == min (xgb_gbtree_cv_2$mae) | xgb_gbtree_cv_2$r2 == max (xgb_gbtree_cv_2$r2), ]

# number of iterations
nroundss <- 7
booster <- 'gbtree'
# learning rate - shrinking the feature weights after each iteration to prevent overfitting
etas <- 0.3
# minimum loss reduction required to make a further partition on a leaf node of the tree
gamma <- 10000
# L2 regularization param 
lambda <- 1
# L1 regularization param 
alpha <- 1
# max tree depth
max_depths <- 2:8
# minimum number of instances needed to be in each node
min_child_weights <- c(7, 10, 12, 15, 20)
# data subsample for each iteration
subsamples <- 0.2 * 1:5
# ratio of columns for each tree
colsamples_bytree <- 0.2 * 1:5
# tree construction - since our sample is very small we will use 'exact' (the greedy appraoch)
tree_method <- 'exact'

xgb_gbtree_cv_3 <- data.frame(
  nrounds = numeric(),
  eta = numeric(),
  max_depths = numeric(),
  min_child_weights = numeric(),
  subsamples = numeric(),
  colsamples_bytree = numeric(),
  rmse = numeric(),
  mae = numeric(),
  r2 = numeric(),
  r2_adj = numeric()
)

for (max_depth in max_depths) {
  print(max_depth)
  for (min_child_weight in min_child_weights) {
    for (subsample in subsamples) {
      for (colsample_bytree in colsamples_bytree) {
        for (nrounds in nroundss) {
          for (eta in etas) {
            params <- list(
              booster = booster,
              eta = eta,
              gamma = gamma,
              alpha = alpha,
              lambda = lambda,
              max_depth = max_depth,
              min_child_weight = min_child_weight,
              subsample = subsample,
              colsample_bytree = colsample_bytree,
              tree_method = tree_method,
              objective = "reg:squarederror",
              eval_metric = "rmse")
            
            xgb_res_test <- c()
            
            for (i in 1:69) {
              
              # divide in train and test dataset
              x <- 1:69
              suburbs_train <- suburbs[x[-i],]
              suburbs_test <- suburbs[x[i],]
              
              suburbs_train_dmatrix <- xgb.DMatrix(data = data.matrix(suburbs_train[,-1]), label = suburbs_train[,1])
              
              suburbs_test_dmatrix <- xgb.DMatrix(data = data.matrix(suburbs_test[,-1]), label = suburbs_test[,1])
              
              # train model
              set.seed(435632)
              xgb <- xgb.train(params = params, data = suburbs_train_dmatrix, nrounds = nrounds, verbose = 0)
              
              # calcualte prediction and residual
              xgb_res_test_tmp <- as.data.frame(predict(xgb, suburbs_test_dmatrix))
              xgb_res_test <- append(xgb_res_test, xgb_res_test_tmp[1,1])
            }
            
            # append results
            row <- data.frame(
              nrounds = nrounds,
              eta = eta,
              max_depths = max_depth,
              min_child_weights = min_child_weight,
              subsamples = subsample,
              colsamples_bytree = colsample_bytree,
              rmse = rmse(suburbs$migrants, xgb_res_test),
              mae = mae(suburbs$migrants, xgb_res_test),
              r2 = r2(suburbs$migrants, xgb_res_test),
              r2_adj = r2_adj(suburbs$migrants, xgb_res_test)
            )
            
            xgb_gbtree_cv_3 <- rbind(xgb_gbtree_cv_3, row)
          }  
        }
      }
    }
  }
}
# choosing the optimal model
xgb_gbtree_cv_3[xgb_gbtree_cv_3$rmse == min (xgb_gbtree_cv_3$rmse) | xgb_gbtree_cv_3$mae == min (xgb_gbtree_cv_3$mae) | xgb_gbtree_cv_3$r2 == max (xgb_gbtree_cv_3$r2), ]

The below chunk (last part of the optimization) needs to be run, as its results are used in the further construction of the XGB.

start_time <- Sys.time()
# number of iterations
nroundss <- 7
booster <- 'gbtree'
# learning rate - shrinking the feature weights after each iteration to prevent overfitting
etas <- 0.3
# minimum loss reduction required to make a further partition on a leaf node of the tree
gammas <- 1
# L2 regularization parameter
lambdas <- 1
# L1 regularization parameter 
alphas <- 10 * 1:10
# max tree depth
max_depth <- 5
# minimum number of instances needed to be in each node
min_child_weight <- 7
# data subsample for each iteration
subsample <- 0.8
# ratio of columns for each tree
colsample_bytree <- 0.4
# tree concstruction - since our sample is very small we will use 'exact' (the greedy appraoch)
tree_method <- 'exact'

xgb_gbtree_cv_4 <- data.frame(
  booster = character(),
  nrounds = numeric(),
  eta = numeric(), 
  gamma = numeric(),
  lambda = numeric(),
  alpha = numeric(),
  rmse = numeric(),
  mae = numeric(),
  r2 = numeric(),
  r2_adj = numeric()
)

for (eta in etas) {
  print(eta)
  for (gamma in gammas) {
    for (lambda in lambdas) {
      for (alpha in alphas) {
        
        params <- list(
          booster = booster,
          eta = eta,
          gamma = gamma,
          lambda = lambda,
          alpha = alpha,
          max_depth = max_depth,
          min_child_weight = min_child_weight,
          subsample = subsample,
          colsample_bytree = colsample_bytree,
          tree_method = tree_method,
          objective = "reg:squarederror",
          eval_metric = "rmse")
        
        for (nrounds in nroundss) {
          
          xgb_res_test <- c()
          
          for (i in 1:69) {
            
            # divide in train and test dataset
            x <- 1:69
            suburbs_train <- suburbs[x[-i],]
            suburbs_test <- suburbs[x[i],]
            
            suburbs_train_dmatrix <- xgb.DMatrix(data = data.matrix(suburbs_train[,-1]), label = suburbs_train[,1])
            
            suburbs_test_dmatrix <- xgb.DMatrix(data = data.matrix(suburbs_test[,-1]), label = suburbs_test[,1])
            
            # train model
            set.seed(435632)
            xgb <- xgb.train(params = params, data = suburbs_train_dmatrix, nrounds = nrounds, verbose = 0)
            
            # calcualte prediction and residual
            xgb_res_test_tmp <- as.data.frame(predict(xgb, suburbs_test_dmatrix))
            xgb_res_test <- append(xgb_res_test, xgb_res_test_tmp[1,1])
          }
          
          # append results
          row <- data.frame(
            booster = 'gbtree',
            nrounds = nrounds,
            eta = eta, 
            gamma = gamma,
            lambda = lambda,
            alpha = alpha,
            rmse = rmse(suburbs$migrants, xgb_res_test),
            mae = mae(suburbs$migrants, xgb_res_test),
            r2 = r2(suburbs$migrants, xgb_res_test),
            r2_adj = r2_adj(suburbs$migrants, xgb_res_test)
          )
          
          xgb_gbtree_cv_4 <- rbind(xgb_gbtree_cv_4, row)
        }
      }
    }            
  }
}

## [1] 0.3

# choose the optimal model
xgb_gbtree_cv_4[xgb_gbtree_cv_4$rmse == min (xgb_gbtree_cv_4$rmse) | xgb_gbtree_cv_4$mae == min (xgb_gbtree_cv_4$mae) | xgb_gbtree_cv_4$r2 == max (xgb_gbtree_cv_4$r2), ]

##   booster nrounds eta gamma lambda alpha     rmse      mae        r2   r2_adj
## 4  gbtree       7 0.3     1      1    40 104.8612 62.43703 0.5586118 0.230400
## 5  gbtree       7 0.3     1      1    50 104.7637 63.17775 0.5594325 0.231831

end_time <- Sys.time()
end_time-start_time

## Time difference of 4.111133 secs

With the found optimal hyperparameters, we can run the XGB.

# list of chosen parameters
params <- list(
  booster = 'gbtree',
  eta = 0.3,
  gamma = 1,
  alpha = 40,
  lambda = 1,
  max_depth = 3,
  min_child_weight = 7,
  subsample = 0.8,
  colsample_bytree = 0.4,
  tree_method = 'exact',
  objective = 'reg:squarederror',
  eval_metric = 'rmse')

suburbs_dmatrix <- xgb.DMatrix(data = data.matrix(suburbs[,-1]), label = suburbs[,1])

set.seed(435632)
xgb_gbtree <- xgb.train(params = params, data = suburbs_dmatrix, nrounds = 8, verbose = 2)

xgb_gbtree_res <- as.data.frame(predict(xgb_gbtree, suburbs_dmatrix))
xgb_gbtree_res$res <- xgb_gbtree_res[,1]

rmse(suburbs$migrants, xgb_gbtree_res$res)

## [1] 88.13577

mae(suburbs$migrants, xgb_gbtree_res$res)

## [1] 48.23574

r2(suburbs$migrants, xgb_gbtree_res$res)

## [1] 0.6881863

# LOOV stats
for (i in 1:69) {
  
  if (i==1) {
    xgb_res_test <- c()
  }
  
  # divide in train and test dataset
  x <- 1:69
  suburbs_train <- suburbs[x[-i],]
  suburbs_test <- suburbs[x[i],]
  
  suburbs_train_dmatrix <- xgb.DMatrix(data = data.matrix(suburbs_train[,-1]), label = suburbs_train[,1])
  suburbs_test_dmatrix <- xgb.DMatrix(data = data.matrix(suburbs_test[,-1]), label = suburbs_test[,1])
  
  # train model
  set.seed(435632)
  xgb <- xgb.train(params = params, data = suburbs_train_dmatrix, nrounds = 8, verbose = 0)
  
  # calcualte prediction and residual
  xgb_res_test_tmp <- as.data.frame(predict(xgb, suburbs_test_dmatrix))
  xgb_res_test <- append(xgb_res_test, xgb_res_test_tmp[1,1])
  
  if (i==69) {
    cat('rmse: \t', rmse(suburbs$migrants, xgb_res_test), '\n')
    cat('mae: \t', mae(suburbs$migrants, xgb_res_test), '\n')
    cat('r2: \t', r2(suburbs$migrants,xgb_res_test), '\n')
  }
}

## rmse:     101.9836 
## mae:      62.51926 
## r2:   0.5825044

# append stats
row <- data.frame(
  model = 'xgb', 
  rmse_valid = xgb_gbtree_cv_4[xgb_gbtree_cv_4$rmse == min (xgb_gbtree_cv_4$rmse) & xgb_gbtree_cv_4$r2 == max (xgb_gbtree_cv_4$r2), 'rmse'],
  mae_valid = xgb_gbtree_cv_4[xgb_gbtree_cv_4$rmse == min (xgb_gbtree_cv_4$rmse) & xgb_gbtree_cv_4$r2 == max (xgb_gbtree_cv_4$r2), 'mae'],
  r2_valid = xgb_gbtree_cv_4[xgb_gbtree_cv_4$rmse == min (xgb_gbtree_cv_4$rmse) & xgb_gbtree_cv_4$r2 == max (xgb_gbtree_cv_4$r2), 'r2'],
  rmse_train = rmse(suburbs$migrants, xgb_gbtree_res$res), 
  mae_train = mae(suburbs$migrants, xgb_gbtree_res$res),
  r2_train = r2(suburbs$migrants, xgb_gbtree_res$res)
)
model_stats <- rbind(model_stats[model_stats$model != 'xgb_2',], row)
model_stats

##   model rmse_valid mae_valid  r2_valid rmse_train mae_train  r2_train
## 1    rf   112.4533  70.85600 0.4923836   60.30409  37.28740 0.8540230
## 2   xgb   104.7637  63.17775 0.5594325   88.13577  48.23574 0.6881863

It can be seen that the XGB outperforms the RF on the validation sample, both in terms of lower RMSE and MAE, and higher R\(^2\). In terms of the training sample, it seems that the RF is overfitted (\(R^2\)). We can plot the actual vs predicted values.

# plot results
xgb_gbtree_res_graph <- data.frame(
  borough = suburbs1$borough, 
  actual = suburbs$migrants, 
  pred = round(xgb_res_test, 0)
)

xgb_gbtree_res_graph$actual_vs_pred <- ifelse(xgb_gbtree_res_graph$actual > xgb_gbtree_res_graph$pred, 'lower', 'higher')

xgb_gbtree_plot <- ggplot(xgb_gbtree_res_graph, aes(x = borough, y = actual)) +
  geom_bar(col = 'blue', stat='identity', width = 1) +
  geom_bar(data = xgb_gbtree_res_graph, aes(x = borough, y = pred, col = actual_vs_pred), stat='identity', width=0.2) + 
  theme(
    plot.title = element_text(color="#000000", face="bold", size=24, hjust=0.5),
    axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
  ggtitle('XGB Results')

xgb_gbtree_plot

It appears that the XGB predicted the number of migrants accordingly in the majority of boroughs. Exceptions of particularly bad performance are: Lesznowola, Łomianki, Marki, Piaseczno, Zielonka.

Permutation-based Feature Importance

In order to assess which variables contribute to the prediction accuracy in the widest extent, we calculate PFI for each regressor, for both RF and XGB. PFI is calculated as change in RMSE after permuting each variable 1000 times with Monte Carlo. As a result, we obtain mean importance of each variable for RF and XGB, and a corresponding rank of importance.

# RF
yhat_rf <- function(X.model, newdata) as.numeric(predict(X.model, newdata))
set.seed(435632)
vi_rf <- as.data.frame(vi_permute(rf, train = suburbs, target = 'migrants', metric='rmse', pred_wrapper = yhat_rf,  type='ratio', nsim=1000))
vi_rf <- vi_rf %>% 
  arrange(desc(ifelse(Importance==1, NA, Importance))) %>%
  mutate(
    rank_rf = ifelse(Importance==1, NA, row_number()),
    Importance = ifelse(Importance==1, NA,  Importance - 1),
    StDev = ifelse(StDev==0, NA,  StDev) 
  ) %>%
  rename(
    mean_imp_rf = Importance,
    sd_imp_rf = StDev
  )

vi_rf

##      Variable mean_imp_rf   sd_imp_rf rank_rf
## 1    distance  0.96526194 0.153869338       1
## 2    greenery  0.36214931 0.055362278       2
## 3      kinder  0.34104585 0.077677753       3
## 4      income  0.23460697 0.034938313       4
## 5   price_b_w  0.15850729 0.031116264       5
## 6 pop_density  0.11388529 0.029272840       6
## 7      forest  0.03601710 0.008114264       7
## 8      unempl  0.01863727 0.011836609       8

# XGB
yhat_xgb <- function(X.model, newdata) as.numeric(predict(X.model, data.matrix(newdata)))
set.seed(435632)
vi_xgb <- as.data.frame(vi_permute(xgb_gbtree, train = data.matrix(suburbs[,-1]), target = suburbs[,1], metric='rmse', pred_wrapper = yhat_xgb,  type='ratio', nsim=1000))
vi_xgb <- vi_xgb %>% 
  arrange(desc(ifelse(Importance==1, NA, Importance))) %>%
  mutate(
    rank_xgb = ifelse(Importance==1, NA, row_number()),
    Importance = ifelse(Importance==1, NA,  Importance - 1),
    StDev = ifelse(StDev==0, NA,  StDev)
  ) %>%
  rename(
    mean_imp_xgb = Importance,
    sd_imp_xgb = StDev
  )

vi_xgb

##      Variable mean_imp_xgb  sd_imp_xgb rank_xgb
## 1    distance  0.393326390 0.058441394        1
## 2   price_b_w  0.257571584 0.052865750        2
## 3      income  0.093618385 0.019358312        3
## 4      kinder  0.084401615 0.025078584        4
## 5      unempl  0.044434310 0.025787604        5
## 6      forest  0.019399703 0.008503558        6
## 7 pop_density  0.013942463 0.009890154        7
## 8    greenery  0.005730307 0.002483963        8

It can be seen that distance is the most important variable, with mean PFI equal to 96% for RF and 39% for XGB. However, the next top rated variables differ across these two algorithms. In RF, total greenrry spaces and the number of kindergartens per 100 000 people were ranked second and third, while in XGB: relative price of housing and relative income, respectively. However, in RF, relative income took the fourth place, while the number of kindergartens per 100 000 people did in XGB. It appears, that distance, relative income, and the number of kindergartens are the most important variables in predicting the number of migrants who move out of Warsaw to the suburban boroughs.

Accumulated Local Effects plots

In order to interpret the relationships between the number of migrants and the regressors, we plot Accumulated Local Effects. Below is a custom function for this purpose.

# fix scales
scale_color <- viridis(2)
names(scale_color) <- c('rf', 'xgb')
scale_line <- c('solid', 'solid')
names(scale_line) <- c('rf', 'xgb')

# function

ALEplot <- function (var_nm, letter) {
  
  var <- which(colnames(suburbs) == var_nm) - 1
  ale_rf <- as.data.frame(ALEPlot(suburbs[,-1], rf, pred.fun=yhat_rf, J=var, K=69, NA.plot = TRUE))
  ale_rf$model <- 'rf'
  ale_xgb <- as.data.frame(ALEPlot(suburbs[,-1], xgb_gbtree, pred.fun=yhat_xgb, J=var, K=69, NA.plot = TRUE))
  ale_xgb$model <- 'xgb'
  
  ale <- rbind(ale_rf, ale_xgb)
  
  ale_drop <- ale %>%
    group_by(model) %>%
    summarise(sum = sum(f.values)) %>%
    as.data.frame() %>%
    filter(sum==0)
  
  ale_drop <- c(ale_drop$model)
  
  ale <- ale %>%
    filter(!model %in% ale_drop)
  
  plt <- ggplot(ale, aes(x= x.values, y = f.values, linetype = model, color = model)) +
    geom_line(size=1) +
    scale_colour_manual(name = 'model', values = scale_color) +
    scale_linetype_manual(name = 'model', values = scale_line) +
    ggtitle(paste(letter, var_nm, sep=') ')) +
    theme_classic() +
    theme(
      plot.title = element_text(color="#000000", face="bold", size=16, hjust=0
                                #, family='Palatino'
      ),
      #axis.title = element_text(family='Palatino'),
      #axis.text = element_text(family='Palatino'),
      legend.title = element_text(
        #family='Palatino', 
        face='bold')
      #,legend.text = element_text(family='Palatino')
    )
  
  # return(plt)
}

I run the “ALEplot” function 8 times corresponding to 8 regressors. I do not include code for it here, because this code displays intermediate plot outlines (function ALEplot from the package of the same name), and we are only interested in the resulting grid, presented below.

grid.arrange(ALE1, ALE2, ALE3, ALE4, ALE5, ALE6, ALE7, ALE8, ncol=2, nrow=4)

First of all, it can be seen that while all 8 regressors were used in RF, XGB discarded population density and unemployment rate. However, form the ALE plot for population density, it can be concluded that it has a positive influence on the number of migrants. Their number rises for population density between 0 and about 13 people per km2, after which value it remains constant. Distance, on the contrary, has a negative influence on the number of migrants. Their number decreases from 100 to -50 fot distance greater than 15 km. These two findings conform to the gravity model of migration theory. The relationship between the relative income and the number of migrants is positive, as anticipated, and the number of migrants rises for the relative income greater than 0.5 of the mean income in all boroughs. Then, the relationship between the relative price of housing and the number of migrants is, surprisingly, also positive. It may be the case that this variable is highly correlated with population density, and that, due to this, reason, both RF and XGB did not handle the presence of both well. But this is difficult to verify. On the other hand, it may be that in the particular year of 2018, a role of social affiliations played a more significant role than cheaper housing. Some part of the suburbs of Warsaw are enclaves (Konstancin-Jeziorna, Lesznowola, Łomianki), rather than places where the poor reside. In the panel model done in my article, this relationship was negative in years 2008-2019. It appears that the unemployment rate has a negative impact on the number of migrants. However, this variable is rarely significant, as people who live in the suburbs usually still commute to work to Warsaw. For example, this variable was not used in XGB. Surprisingly, both the relationships between the ratio of forests to the total area and the total greenery spaces are positive. Apparently, also the ratio of forests to the total area is an amenity, although the main point of including that variable was measuring, to what extend the forests hinder habitable zones. Finally, as anticipated, the number of kindergartens has a positive influence of the number of migrants.

References

Apley, D. W., and J. Zhu. 2019. “Visualizing the Effects of Predictor Variables in Black Box Supervised Learning Models,” December.

Breiman, L. 2001. “Machine Learning, Volume 45, Number 1 - Springerlink.” Machine Learning 45: 5–32. https://doi.org/10.1023/A:1010933404324.

Chen, T., and Carlos Guestrin. 2016. “XGBoost: A Scalable Tree Boosting System.” Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.

Fisher, A. J., C. Rudin, and F. Dominici. 2019. “All Models Are Wrong, but Many Are Useful: Learning a Variable’s Importance by Studying an Entire Class of Prediction Models Simultaneously.” Journal of Machine Learning Research 20: 1–81.

Friedman, J. 2001. “Greedy Function Approximation: A Gradient Boosting Machine.” The Annals of Statistics 29 (November). https://doi.org/10.1214/aos/1013203451.

Hastie, T., R. Tibshirani, and J. Friedman. 2009. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer.

Loibl, W. 2004. “Simulation of Suburban Migration: Driving Forces, Socio-Economic Characteristics, Migration Behaviour and Resulting Land-Use Patterns.” In, 201–23.

Mieszkowski, P., and E. S. Mills. 1993. “The Causes of Metropolitan Suburbanization.” Journal of Economic Perspectives 7 (3): 135–47.

Poot, J., O. Alimi, M. P. Cameron, and D. C. Maré. 2016. “The Gravity Model of Migration: The Successful Comeback of an Ageing Superstar in Regional Science” 2016: 63–86.

Zhao, Q., and T. Hastie. 2019. “Causal Interpretations of Black-Box Models.” Journal of Business & Economic Statistics, 1–19. https://doi.org/10.1080/07350015.2019.1624293.