Why building M4 will be more beneficial in reducing education inequalities than M3? Data-driven approach to building public transport in Warsaw

Szymczak Wojciech and Pignataro Danielle

14-12-2023


The story of Warsaw Metro extension

As of 2023, the Warsaw City Council published a plan for the development of extending the public transport system (see below). According to the City council, 60% of the population living in Warsaw use public transport daily. It should not be surprising that students are a large group in this group. However, not all schools are well connected with public transport, which may become a real problem. Students who attend schools with poor connections to the public system are more likely to miss classes in the event of transportation issues. In addition, areas with no speed public transport, such as the metro, might struggle with more traffic. Marc L. Stein and Jeffrey A. Grigg (2019) from John Hopkins University showed that increased transportation times often lead to higher absenteeism among students in the 9th grade. Moreover, students from disadvantaged backgrounds are more likely to suffer from public transport issues.

Source: Warsaw City Council
Source: Warsaw City Council

Since there seems to be a connection between the public transport system and school outcomes, we turn to this problem in our project. Although student’s achievements are not the critical factor for extending the metro system, it is an important social issue that should be examined. As the M2 line was recently extended, we estimate the influence of these developments on students’ achievements. Based on the analysed models, we predict the impact of the three scenario extensions - M3, M4, and M5. It is even more important as social partners argue that planned M3 line does not well match the city’s needs.

Research outline

In this document, we report our findings. We divide our analysis into six points.

  1. Download the data from the avaliable online sources

  2. Prepare the data for the analysis

  3. Conduct simple Difference-in-Differences

  4. Analyse the spatiotemporal clustering membership

  5. Run regression and ML models

  6. Predict the benefits of M3, M4 and M5 extension

We conclude with limitations of the analysis and possible further research.

Required Packages

# adding a buffer to the borders to mitigate the uneven line problem

# WAW_box2<-getbb("Warszawa, Polska")
# smallbuffer <- st_buffer(waw.pov.sf, dist = 0.0007)
# query<-opq(WAW_box2) %>% add_osm_feature(key='admin_level', value='9') # get Warsaw districts
# bound_waw.sf<-osmdata_sf(query) #get the warsaw discrict lines
# 
# # borders of districts in Warsaw
# wawDistBorders <- st_intersection(bound_waw.sf$osm_lines %>% 
#                                     st_transform(crs = "+proj=longlat +datum=NAD83"), 
#                                   smallbuffer)

Before starting the analysis, we have to install and load the required packages. We put this chunk in the comment to optimise code running, but You can copy and uncomment it in your R script.

# requiredPackages = c( "imola", "readr", "tidyverse", "stringr", "colourpicker",
#                       "sf", "sp", "spdep", "spatialreg", "RColorBrewer", "ggplot2",
#                       "ggmap", "tibble", "dplyr", "tidygeocoder", "rvest", 
#                       "spatstat", "GISTools", "raster", "keras", "dbscan", "factoextra", 
#                       "fpc", "osmdata", "OpenStreetMap", "cowplot", "viridis", "leaflet", "geodist", 
#                       "GWmodel", "SpatialML", "fossil", "bayesbio") # Required Packages 
# 
# for(i in requiredPackages){if(!require(i,character.only = TRUE)) install.packages(i)}
# for(i in requiredPackages){if(!require(i,character.only = TRUE)) library(i,character.only = TRUE) } 

Data Preparation

Firstly we read the raw data from the national register. We use tidygeocoder to find the geolocation of all schools in Warsaw. Because of the time needed and computational complexity we put this chunk in the comment.

# math <- read_csv("math_19_22_data.csv")
# 
# math$country <- "Poland"
# df <- math %>%
#   tidygeocoder::geocode(street = street, city = city, country = country, method = "osm" )
# 
# 
# df <- df %>%
#   filter(!is.na(lat))
# 
# save(df, file = "Math_Results_2019_Warsaw.Rdata")

Instead of geocoding the data again, we directly load the data after geocoding. We prepare the data for the public transport analysis.

# load the data instead of working on them again 
load("C:/Users/wojci/OneDrive/WSZ/CausalMLinR/Math_Results_2019_Warsaw.Rdata")

# Find data for metro stations
# Webscrapping from Wikipedia
metro_stations_m1 <- c("Kabaty", "Natolin", "Imielin", "Ursynów", "Stokłosy",
                    "Służew", "Wilanowska", "Wierzbno", "Racławicka", "Pole_Mokotowskie", 
                    "Politechnika", "Centrum", "Świętokrzyska", "Ratusz_Arsenał",
                    "Dworzec_Gdański", "Plac_Wilsona", "Marymont", 
                    "Słodowiec", "Stare_Bielany", "Wawrzyszew", "Młociny")

metro_stations_m2019 <- c("Rondo_Daszyńskiego", "Rondo_ONZ", "Nowy_Świat-Uniwersytet",
                          "Centrum_Nauki_Kopernik", "Stadion_Narodowy", "Dworzec_Wileński")

metro_stations_m2020 <- c("Księcia_Janusza",
                          "Młynów", "Płocka", "Szwedzka", 
                          "Targówek_Mieszkaniowy", "Trocka")

metro_stations_m2022 <- c("Bemowo", "Ulrychów", "Zacisze", 
                          "Kondratowicza", "Bródno")

metro <- as.data.frame( c(metro_stations_m2019, 
                         metro_stations_m2020,
                         metro_stations_m2022,
                         metro_stations_m1), dnn = list("Station"), responseName = "freq")

metro <- metro %>%
  rename("Station" = "c(metro_stations_m2019, metro_stations_m2020, metro_stations_m2022, metro_stations_m1)"  )

head(metro)
##                  Station
## 1     Rondo_Daszyńskiego
## 2              Rondo_ONZ
## 3 Nowy_Świat-Uniwersytet
## 4 Centrum_Nauki_Kopernik
## 5       Stadion_Narodowy
## 6       Dworzec_Wileński

As no available dataset includes data on location of metro stations in Warsaw, we download it directly from Wikipedia.

# Webscraping information on Station location in Warsaw
for (i in 1:length(metro$Station)){
  url <- paste("https://pl.wikipedia.org/wiki/", metro[i, 1],"_(stacja_metra)", sep = "", collapse = NULL)
    x <- tryCatch({
      url %>% 
        read_html() %>% 
        html_node(xpath = '//*[@id="coordinates"]') %>% 
        html_text() %>% 
        as.character() %>% 
        str_split(pattern = "/")
    }, error = function(e) {
      # handle HTTP error 404
      if (grepl("HTTP error 404", e$message)) {
        message(paste("Error: ", e$message, " - Skipping URL ", url))
        return(NULL)
      }
      else {
        # re-throw any other errors
        stop(e)
      }
    })
    if (!is.null(x)) {
      metro[i, 2] <- as.character(str_extract_all(gsub(",", ".", x[[1]][2]), "[+-]?([0-9]*[.])?[0-9]+")[[1]])[1]
      metro[i, 3] <- as.character(str_extract_all(gsub(",", ".", x[[1]][2]), "[+-]?([0-9]*[.])?[0-9]+")[[1]])[2]
    }
  }
metro <- metro %>% 
  rename ("Latitude" = "V2", 
          "Longitude" = "V3")

# Map all metro points
metro_map <- data.frame(name = metro$Station,
                        lon= as.numeric(metro$Longitude),
                         lat = as.numeric(metro$Latitude),
                         label = "metro")
# School maps
school_map <- data.frame(name = df$name, 
                         lon = as.numeric(df$long),
                         lat =  as.numeric(df$lat), 
                         label = "school")

# Map both metro
map_points <- rbind(metro_map, school_map)
head(map_points)
##                     name      lon      lat label
## 1     Rondo_Daszyńskiego 20.98306 52.23000 metro
## 2              Rondo_ONZ 20.99833 52.23306 metro
## 3 Nowy_Świat-Uniwersytet 21.01639 52.23667 metro
## 4 Centrum_Nauki_Kopernik 21.03139 52.24000 metro
## 5       Stadion_Narodowy 21.04306 52.24639 metro
## 6       Dworzec_Wileński 21.03528 52.25417 metro
schools <- map_points$name[map_points$name %in% df$name] 
head(schools)
## [1] "21 Społeczne Liceum Ogólnokształcące im. Jerzego Grotowskiego"                                                        
## [2] "Zespół Szkół Samochodowych i Licealnych nr2"                                                                          
## [3] "Niepubliczne Liceum Ogólnokształcące nr 81 SGH"                                                                       
## [4] "Technikum Łączności im. prof dr.inż. Janusza Groszkowskiego w Zespole Szkół Nr 37 im. Agnieszki Osieckiej w Warszawie"
## [5] "Liceum Ogólnokształcące Niepubliczne dla Dorosłych nr 16"                                                             
## [6] "Liceum Ogólnokształcące dla Dorosłych \"ELITA\""

We use leaflet to visualize the location of schools and metro stations.

pal <- colorFactor(c("navy", "red"), domain = c("metro", "school"))


leaflet() %>% addTiles() %>% 
  addCircleMarkers(data = map_points,
                   lat = ~lat, lng = ~lon,
                   color = ~pal(label),
                   radius = 4,
                   stroke = FALSE,
                   fillOpacity = 0.5,
                   label = map_points$label)
# Save map with M1 and M2 stations for further analysis
metro$Label <- ifelse(metro$Station %in% metro_stations_m1, "M1", "M2")
metrom1m2 <- metro %>%  
  rename( "latitude" = "Latitude" , 
          "longitude" = "Longitude")
save(metrom1m2, file = "metro_m1m2.Rdata")

We estimate the distance between schools and metro stations to analyse the impact of extending the metro system in Warsaw. We estimate three distances between schools and M1 stations and between different parts of the M2 line.

df$distance_2019 <- rep(NA, length(df$name))
df$distance_2020 <- rep(NA, length(df$name))
df$distance_2022 <- rep(NA, length(df$name))

for (i in 1:length(df$name)){
  # working matrix 
  D2019 <- df[i, c("name",   "long", "lat")]
  
  # matrix for M1 stations
  Dmetro1 <- data.frame(name = metro$Station[metro$Station %in% metro_stations_m1], 
                        lat =  metro$Latitude[metro$Station %in% metro_stations_m1], 
                        long = metro$Longitude[metro$Station %in% metro_stations_m1])
  
  # matrix for M1 stations in 2019
  Dmetro2_2019 <- data.frame(name = metro$Station[metro$Station %in% metro_stations_m2019], 
                        lat =  metro$Latitude[metro$Station %in% metro_stations_m2019], 
                        long = metro$Longitude[metro$Station %in% metro_stations_m2019])
  
  # matrix for M1 stations in 2020
  Dmetro2_2020 <- data.frame(name = metro$Station[metro$Station %in% metro_stations_m2020], 
                             lat =  metro$Latitude[metro$Station %in% metro_stations_m2020], 
                             long = metro$Longitude[metro$Station %in% metro_stations_m2020])
  
  # matrix for M1 stations in 2022
  Dmetro2_2022 <- data.frame(name = metro$Station[metro$Station %in% metro_stations_m2022], 
                             lat =  metro$Latitude[metro$Station %in% metro_stations_m2022], 
                             long = metro$Longitude[metro$Station %in% metro_stations_m2022])
  
  dist2019 <- rbind(D2019, Dmetro1, Dmetro2_2019)
  dist2020 <- rbind(D2019, Dmetro1, Dmetro2_2019, Dmetro2_2020)
  dist2022 <- rbind(D2019, Dmetro1, Dmetro2_2019, Dmetro2_2020, Dmetro2_2022)
  
  
  distance_matrix2019 <- geodist(dist2019, measure = 'geodesic' )/1000 #converting it to km
  distance_matrix2020 <- geodist(dist2020, measure = 'geodesic' )/1000 #converting it to km
  distance_matrix2022 <- geodist(dist2022, measure = 'geodesic' )/1000 #converting it to km
  
  df[i, 21] <- min(distance_matrix2019[1,2:28])
  df[i, 22] <- min(distance_matrix2020[1,2:34])
  df[i, 23] <- min(distance_matrix2022[1,2:39])
}
save(df, file = "loc_math_1922.Rdata")
## Summarize the change in the distance 
vis_distance <- data.frame(id = 1:length(df$distance_2019))
vis_distance <- cbind(vis_distance, 
                      "distance2019" = df$distance_2019, 
                      "distance2020" = df$distance_2020,
                      "distance2022" = df$distance_2022)


vis_distance_l <- reshape(vis_distance, direction = "long", idvar="id",
                          varying = 2:4, sep = "")
vis_distance_l$short <- ifelse(vis_distance_l$distance < 0.5, 1, 0)
vis_distance_l$long <- ifelse(vis_distance_l$distance > 3, 1, 0)

p <- ggplot(vis_distance_l, aes(x = as.factor(time), y = distance, fill = as.factor(time)))+
  geom_bar( stat = "summary", fun.y = "mean") +
  scale_fill_manual(values = c("#B6163A", "#999999", "#000000"))+
  scale_color_manual(values = c("#B6163A","#999999", "#000000")) +
  theme_minimal() +
  labs(title = "") +
  xlab("Year") +
  ylab("Distance between school and metro station") +
  theme(
    text = element_text(family = "Roboto Light"),
    panel.grid.major = element_blank(),
    legend.position = "none", 
    panel.grid.minor = element_blank(),
    panel.grid.major.x = element_blank(),
    axis.text=element_text(size = 12),
    axis.title=element_text(size = 12),
    axis.line = element_line(size = 0.5, colour = "grey", linetype = 1),
    axis.ticks = element_line(size = 0.5, color="grey")
  )

p

Based on the data, we observe that the distance between schools and metro stations has decreased from around 2 km to 1.75 km in 2022. These changes are relatively small. However, its impact is heterogeneous, as it influences only some schools.

Difference in Differences Application

We plot the relationship between average achievements in mathematics in 2019 and 2022. We observe that schools, in general, score similarly between the years. However, we find higher variation in the lower corner of the scatter plot. In line with the international literature, schools that do poorly are more likely to randomly increase or decrease their position. At the same time, schools in the top spots are robust enough to keep their position, regardless of the exam difficulty.

# Basic scatter plot
ggplot(df, aes(x= math2019, y = math2022)) + 
  geom_point() +
  geom_smooth(method = lm, color="black")+
  labs(title = "",
       x = "Average score in 2019", y = "Average score in 2022")+
  theme(
    text = element_text(family = "Roboto Light"),
    panel.grid.major = element_blank(),
    legend.position = "none", 
    panel.grid.minor = element_blank(),
    panel.grid.major.x = element_blank(),
    axis.text=element_text(size = 12),
    axis.title=element_text(size = 12),
    axis.line = element_line(size = 0.5, colour = "grey", linetype = 1),
    axis.ticks = element_line(size = 0.5, color="grey")
  )

We propose 3 different designs for the analysis of the impact of public transport extension.

Design Definition
Distance-based

Treatment: Schools for which the distance decreased

Control: All schools already close to any metro station

Distance-based with pre-treatment exclusion

Treatment: Schools for which the distance decreased

Control: Schools located close to the metro, for which the distance did not change

Distance-based with post-treatment exclusion

Treatment: Only those schools for which the distance decreased from more than 1 km to less than 0.7 km (walkable distance)

Control: Schools already in walkable distance to metro in 2019

df$math_diff <- df$math2022 - df$math2019
# 1) Different treatment definitions
df$treat_1 <- ifelse(df$distance_2019 > df$distance_2022, 1, 0)
df$control_1 <- ifelse(df$distance_2019 < 0.5, 1, 0)

df_1 <- df %>% 
  dplyr:: filter(treat_1 == 1 | control_1 == 1)

# 2) Get rid of those for which the distance in 2019 was already low
df$treat_2 <- ifelse(df$distance_2019 > df$distance_2022, 1, 0)
df$control_2 <- ifelse(df$distance_2019 > 1.5, 1, 0)

df_2 <- df %>% 
  dplyr:: filter(treat_2 == 1 | control_2 == 1)

# 3) Treated: distance 2022 is lower than 0.7km, although it was higher in 2019
df$treat_3 <- ifelse(df$distance_2022 < 0.7 & df$distance_2019 > 1, 1, 0)
df$control_3 <- ifelse(df$distance_2019 == df$distance_2022 & df$distance_2019 < 0.7, 1, 0)

df_3 <- df %>% 
  dplyr:: filter(treat_3 == 1 | control_3 == 1)

#DID in regression form (with statistical significance)
reg1 <- lm(math_diff ~  treat_1, data = df_1, weights = df_1$n_students2019)
reg2 <- lm(math_diff ~  treat_2, data = df_2, weights = df_2$n_students2019)
reg3 <- lm(math_diff ~  treat_3, data = df_3, weights = df_3$n_students2019)

stargazer::stargazer(reg1, reg2, reg3, type = 
                       "text")
## 
## ============================================================================
##                                       Dependent variable:                   
##                     --------------------------------------------------------
##                                            math_diff                        
##                            (1)                (2)                (3)        
## ----------------------------------------------------------------------------
## treat_1                   1.277                                             
##                          (1.811)                                            
##                                                                             
## treat_2                                      0.541                          
##                                             (1.639)                         
##                                                                             
## treat_3                                                         -0.675      
##                                                                (4.975)      
##                                                                             
## Constant                  1.028              1.765*             0.430       
##                          (1.109)            (0.907)            (1.021)      
##                                                                             
## ----------------------------------------------------------------------------
## Observations                64                 79                 54        
## R2                        0.008              0.001              0.0004      
## Adjusted R2               -0.008             -0.012             -0.019      
## Residual Std. Error  60.695 (df = 62)   57.854 (df = 77)   64.773 (df = 52) 
## F Statistic         0.497 (df = 1; 62) 0.109 (df = 1; 77) 0.018 (df = 1; 52)
## ============================================================================
## Note:                                            *p<0.1; **p<0.05; ***p<0.01

We find no significant impact of the extension of the metro stations on the school achievements. Although the 1st and second scenarios show some positive effects, they are not significant in the frequentist sense. However, we point to the spatial character of the relationship between distance and achievements. In a further part, we explore whether spatial analysis is justified. In addition, our difference-in-differences is characterised by evident flaws:

  • Parallel Trends Assumption: DiD assumes that the treated and untreated groups would follow the same trend over time in the absence of the treatment. If historical trends in academic achievement are similar between treated and untreated groups before the metro extension, this strengthens the validity of the DiD approach. However, we did not have enough periods to analyse the assumption of parallel trends.

  • Unobserved Confounding Factors: DiD assumes that no unobserved confounding factors influence the treatment and control groups differently. In this case, that could have been extensions of other forms of public transport (i.e., buses and trams). At the same time, from the spatial perspective, some areas of Warsaw are more deprived in terms of socioeconomic outcomes. The pandemic might have hit the disadvantaged groups harder than the affluent students.

In the concluding part, we point to the main limitations of our analysis.

Data Preparation for Spatial Analysis

With a few observations, we conducted the spatial analysis on the grided map. We thankfully acknowledge the code provided by Maria Kubara in this part.

# Kubara, M. (2023). Spatiotemporal localisation patterns of technological startups: 
# the case for recurrent neural networks in predicting urban startup clusters. 
# The Annals of Regional Science, 1-33.

# Link for the Kubara's github repositorium: https://github.com/mariakubara/RNN_startup_clusters

pov.sf <- st_read("C:/Users/wojci/OneDrive/WSZ/CausalMLinR/Project/Data/powiaty.shp")
## Reading layer `powiaty' from data source 
##   `C:\Users\wojci\OneDrive\WSZ\CausalMLinR\Project\Data\powiaty.shp' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 380 features and 29 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: 171677.6 ymin: 133223.7 xmax: 861895.7 ymax: 774923.7
## Projected CRS: ETRS89 / Poland CS92
pov.sf <- st_transform(pov.sf, crs = "+proj=longlat +datum=NAD83")
waw.pov.sf <- pov.sf %>% filter(jpt_nazwa_=='powiat Warszawa') 
load('2nd Week/RNN_startup_clusters-main/data/gridPopWaw.RData')

#pop.df.waw - data from grid
#pop.grid.waw - spatial polygons object
#pop.waw Spatial Polygons Data Frame - all in one
pop.grid.waw.sf <- st_as_sf(pop.grid.waw, crs = "+proj=longlat +datum=NAD83")


#Grid ID to order and merge the data 
pop.grid.waw.sf$ID_grid <- 1:601

# We use function provided by Dr Earl W. Duncan (2021) to get the centroids of the polygons
get.centroid.bb <- function(x){
  N <- length(x)  # Number of polygons
  # Initialise data.frame
  Centroids.bb <- data.frame(matrix(NA, N, 2, dimnames = list(NULL, c("long", "lat"))))
  for(i in 1:N){
    # Bounding box of polygon
    bb <- bbox(x@polygons[[i]])
    # Compute centroid
    Centroids.bb[i,] <- c(
      0.5 * (bb[1,1] + bb[1,2]),
      0.5 * (bb[2,1] + bb[2,2]))
  }
  return(Centroids.bb)
}

pop.grid.waw$lat <- get.centroid.bb(pop.grid.waw)[,1]
pop.grid.waw$long <- get.centroid.bb(pop.grid.waw)[,2]


#################################################################
 meansByYearT <- matrix(data=NA, nrow=1, ncol=601)

  
  coordinates(df)<-c("long","lat") # class changed for SpatialPoints

  pop.grid.waw$ID<-rownames(pop.df.waw) # identification of units in grid
  pop.df.waw$ID<-rownames(pop.df.waw) # identification of units in data.frame
  crs(df)<-crs(pop.grid.waw) # equalize the projection between objects
  df$ID<-over(df, pop.grid.waw) # assign ID grid to companies 
  
  crds<-coordinates(pop.grid.waw) # centroids of grid cells

We aggregate several variables on the grided level:

  • Math19 - average result in basic math exam in 2019

  • Math22 <- average result in basic math exam in 2022

  • Dist19 <- the average distance to the metro station in 2019

  • Dist20 <- the average distance to the metro station in 2020

  • Dist22 <- the average distance to the metro station in 2022

  • nstudent19 <- number of students attending schools in the grid in 2019

  • nstudent22 <- the number of students attending schools in the grid in 2022

  • public <- share of public schools in the grid

  • psych <- share of schools in the grid with access to psychological help

# point data aggregation by grid ID
  df.agg19 <-aggregate(df$math2019, by=list(df$ID$ID), FUN = mean)
  df.agg22 <-aggregate(df$math2022, by=list(df$ID$ID), FUN = mean)
  df.agg_dist19 <-aggregate(df$distance_2019, by=list(df$ID$ID), FUN = mean)
  df.agg_dist20 <-aggregate(df$distance_2020, by=list(df$ID$ID), FUN = mean)
  df.agg_dist22 <-aggregate(df$distance_2022, by=list(df$ID$ID), FUN = mean)
  df.agg_nstudent19 <-aggregate(df$n_students2019, by=list(df$ID$ID), FUN = sum)
  df.agg_nstudent22 <-aggregate(df$n_students22, by=list(df$ID$ID), FUN = sum)
  df.public <- aggregate(df$Public, by=list(df$ID$ID), FUN = mean)
  df.psych <- aggregate(df$psych, by=list(df$ID$ID), FUN = mean)
  
  
  # merge datasets 
  pop.df.waw.m <- merge(pop.df.waw, df.agg19, by.x="ID", by.y="Group.1", all.x=TRUE)
  names(pop.df.waw.m)[18] <- "math19"
  pop.df.waw.m<-merge(pop.df.waw.m, df.agg22, by.x="ID", by.y="Group.1", all.x=TRUE)
  names(pop.df.waw.m)[19] <- "math22"
  
  pop.df.waw.m<-merge(pop.df.waw.m, df.agg_dist19, by.x="ID", by.y="Group.1", all.x=TRUE)
  names(pop.df.waw.m)[20] <- "dist19"
  pop.df.waw.m<-merge(pop.df.waw.m, df.agg_dist20, by.x="ID", by.y="Group.1", all.x=TRUE)
  names(pop.df.waw.m)[21] <- "dist20"
  pop.df.waw.m<-merge(pop.df.waw.m, df.agg_dist22, by.x="ID", by.y="Group.1", all.x=TRUE)
  names(pop.df.waw.m)[22] <- "dist22"
  
  pop.df.waw.m<-merge(pop.df.waw.m, df.agg_nstudent19, by.x="ID", by.y="Group.1", all.x=TRUE)
  names(pop.df.waw.m)[23] <- "nstudent19"
  pop.df.waw.m<-merge(pop.df.waw.m, df.agg_nstudent22, by.x="ID", by.y="Group.1", all.x=TRUE)
  names(pop.df.waw.m)[24] <- "nstudent22"
  
  pop.df.waw.m<-merge(pop.df.waw.m, df.public, by.x="ID", by.y="Group.1", all.x=TRUE)
  names(pop.df.waw.m)[25] <- "public"
  
  pop.df.waw.m<-merge(pop.df.waw.m, df.psych, by.x="ID", by.y="Group.1", all.x=TRUE)
  names(pop.df.waw.m)[26] <- "psych"
  

  # clean the NA values
  m1<-which(is.na(pop.df.waw.m$math19))
  pop.df.waw.m$math19[m1]<-0
  pop.df.waw.m$math22[m1]<-0
  pop.df.waw.m$dist19[m1]<-0
  pop.df.waw.m$dist20[m1]<-0
  pop.df.waw.m$dist22[m1]<-0
  pop.df.waw.m$nstudent19[m1]<-0
  pop.df.waw.m$nstudent22[m1]<-0
  
  
  #merging with results matrix
  countsByYear_math19<- rbind(meansByYearT,
                        pop.df.waw.m$math19)
  countsByYear_math22<- rbind(meansByYearT,
                              pop.df.waw.m$math22)
  countsByYear_dist19<- rbind(meansByYearT,
                              pop.df.waw.m$dist19)
  countsByYear_dist20<- rbind(meansByYearT,
                              pop.df.waw.m$dist20)
  countsByYear_dist22<- rbind(meansByYearT,
                              pop.df.waw.m$dist22)
  countsByYear_nstudent19 <- rbind(meansByYearT,
                              pop.df.waw.m$nstudent19)
  countsByYear_nstudent22 <- rbind(meansByYearT,
                                  pop.df.waw.m$nstudent22)
  countsByYear_public <- rbind(meansByYearT,
                                   pop.df.waw.m$public)
  countsByYear_psych <- rbind(meansByYearT,
                                   pop.df.waw.m$psych)
  

  totalCountInGrid1 <- colSums(countsByYear_math19,na.rm=T)
  totalCountInGrid2 <- colSums(countsByYear_math22,na.rm=T)
  totalCountInGrid3 <- colSums(countsByYear_dist19,na.rm=T)
  totalCountInGrid4 <- colSums(countsByYear_dist20,na.rm=T)
  totalCountInGrid5 <- colSums(countsByYear_dist22,na.rm=T)
  totalCountInGrid6 <- colSums(countsByYear_nstudent19,na.rm=T)
  totalCountInGrid7 <- colSums(countsByYear_nstudent22,na.rm=T)
  totalCountInGrid8 <- colSums(countsByYear_public,na.rm=T)
  totalCountInGrid9 <- colSums(countsByYear_psych,na.rm=T)

  
  #bind pop grid data with firm counts year by year
  pop.df.waw.schools<- cbind(pop.df.waw, 
                             totalCountInGrid1,
                             totalCountInGrid2,
                             totalCountInGrid3,
                             totalCountInGrid4,
                             totalCountInGrid5,
                             totalCountInGrid6,
                             totalCountInGrid7,
                             totalCountInGrid8,
                             totalCountInGrid9)
  head(pop.df.waw.schools)
##         TOT TOT_0_14 TOT_15_64 TOT_65__ TOT_MALE TOT_FEM MALE_0_14 MALE_15_64
## 206436 4841      696      3586      559     2339    2502       365       1739
## 206464  137       19        98       20       61      76        10         46
## 206476  269       38       199       32      124     145        23         89
## 206492    0        0         0        0        0       0         0          0
## 206505  386       63       276       47      180     206        26        137
## 206602   86       11        62       13       41      45         7         28
##        MALE_65__ FEM_0_14 FEM_15_64 FEM_65__ FEM_RATIO SHAPE_Leng SHAPE_Area
## 206436       235      331      1847      324  106.9688   3997.988   998991.0
## 206464         5        9        52       15  124.5902   3998.006   999000.1
## 206476        12       15       110       20  116.9355   3997.989   998991.6
## 206492         0        0         0        0    0.0000   3998.008   999000.9
## 206505        17       37       139       30  114.4444   3998.005   998999.3
## 206602         6        4        34        7  109.7561   3998.009   999001.7
##                                   CODE     ID totalCountInGrid1
## 206436 CRS3035RES1000mN3286000E5059000 206436                 0
## 206464 CRS3035RES1000mN3298000E5059000 206464                 0
## 206476 CRS3035RES1000mN3287000E5059000 206476                 0
## 206492 CRS3035RES1000mN3299000E5059000 206492                 0
## 206505 CRS3035RES1000mN3297000E5059000 206505                 0
## 206602 CRS3035RES1000mN3300000E5059000 206602                 0
##        totalCountInGrid2 totalCountInGrid3 totalCountInGrid4 totalCountInGrid5
## 206436                 0                 0                 0                 0
## 206464                 0                 0                 0                 0
## 206476                 0                 0                 0                 0
## 206492                 0                 0                 0                 0
## 206505                 0                 0                 0                 0
## 206602                 0                 0                 0                 0
##        totalCountInGrid6 totalCountInGrid7 totalCountInGrid8 totalCountInGrid9
## 206436                 0                 0                 0                 0
## 206464                 0                 0                 0                 0
## 206476                 0                 0                 0                 0
## 206492                 0                 0                 0                 0
## 206505                 0                 0                 0                 0
## 206602                 0                 0                 0                 0
  names(pop.df.waw.schools)[18] <- "math19_avg"
  names(pop.df.waw.schools)[19] <- "math22_avg"
  names(pop.df.waw.schools)[20] <- "dist_19_avg"
  names(pop.df.waw.schools)[21] <- "dist_20_avg"
  names(pop.df.waw.schools)[22] <- "dist_22_avg"
  names(pop.df.waw.schools)[23] <- "nstudents19"
  names(pop.df.waw.schools)[24] <- "nstudents22"
  names(pop.df.waw.schools)[25] <- "public"
  names(pop.df.waw.schools)[26] <- "psych"
  
  pop.df.waw.schools$long <- pop.grid.waw$long
  pop.df.waw.schools$lat <- pop.grid.waw$lat

# Save file for the analysis 
head(pop.df.waw.schools)
##         TOT TOT_0_14 TOT_15_64 TOT_65__ TOT_MALE TOT_FEM MALE_0_14 MALE_15_64
## 206436 4841      696      3586      559     2339    2502       365       1739
## 206464  137       19        98       20       61      76        10         46
## 206476  269       38       199       32      124     145        23         89
## 206492    0        0         0        0        0       0         0          0
## 206505  386       63       276       47      180     206        26        137
## 206602   86       11        62       13       41      45         7         28
##        MALE_65__ FEM_0_14 FEM_15_64 FEM_65__ FEM_RATIO SHAPE_Leng SHAPE_Area
## 206436       235      331      1847      324  106.9688   3997.988   998991.0
## 206464         5        9        52       15  124.5902   3998.006   999000.1
## 206476        12       15       110       20  116.9355   3997.989   998991.6
## 206492         0        0         0        0    0.0000   3998.008   999000.9
## 206505        17       37       139       30  114.4444   3998.005   998999.3
## 206602         6        4        34        7  109.7561   3998.009   999001.7
##                                   CODE     ID math19_avg math22_avg dist_19_avg
## 206436 CRS3035RES1000mN3286000E5059000 206436          0          0           0
## 206464 CRS3035RES1000mN3298000E5059000 206464          0          0           0
## 206476 CRS3035RES1000mN3287000E5059000 206476          0          0           0
## 206492 CRS3035RES1000mN3299000E5059000 206492          0          0           0
## 206505 CRS3035RES1000mN3297000E5059000 206505          0          0           0
## 206602 CRS3035RES1000mN3300000E5059000 206602          0          0           0
##        dist_20_avg dist_22_avg nstudents19 nstudents22 public psych     long
## 206436           0           0           0           0      0     0 52.19197
## 206464           0           0           0           0      0     0 52.29843
## 206476           0           0           0           0      0     0 52.20085
## 206492           0           0           0           0      0     0 52.30730
## 206505           0           0           0           0      0     0 52.28956
## 206602           0           0           0           0      0     0 52.31617
##             lat
## 206436 20.84561
## 206464 20.87183
## 206476 20.84779
## 206492 20.87402
## 206505 20.86964
## 206602 20.87621
  save(pop.df.waw.schools, file = 'C:/Users/wojci/OneDrive/WSZ/CausalMLinR/Project/Data/data_schools_grided.RData')

Based on the prepared data we conduct the Moran’s test to determine whether the spatial analysis is justified.

rm(list = ls())
load('C:/Users/wojci/OneDrive/WSZ/CausalMLinR/Project/Data/data_schools_grided.RData')
set.seed(424178)
setwd("C:/Users/wojci/OneDrive/WSZ/CausalMLinR")

schools.df <- pop.df.waw.schools
schools.df$no_school <- ifelse(schools.df$nstudents19 == 0, 1, 0)
schools.sp <- schools.df
coordinates(schools.sp) <- c("lat", "long")
proj4string(schools.sp)<-as(st_crs("+proj=longlat +ellps=WGS84"), "CRS")
summary(schools.sp) 
## Object of class SpatialPointsDataFrame
## Coordinates:
##           min      max
## lat  20.84561 21.27214
## long 52.09673 52.37289
## Is projected: FALSE 
## proj4string : [+proj=longlat +ellps=WGS84 +no_defs]
## Number of points: 601
## Data attributes:
##       TOT           TOT_0_14        TOT_15_64        TOT_65__     
##  Min.   :    0   Min.   :   0.0   Min.   :    0   Min.   :   0.0  
##  1st Qu.:  155   1st Qu.:  24.0   1st Qu.:  109   1st Qu.:  14.0  
##  Median :  697   Median : 117.0   Median :  499   Median :  77.0  
##  Mean   : 2917   Mean   : 385.7   Mean   : 2030   Mean   : 500.8  
##  3rd Qu.: 3734   3rd Qu.: 551.0   3rd Qu.: 2614   3rd Qu.: 489.0  
##  Max.   :21531   Max.   :2672.0   Max.   :15020   Max.   :5297.0  
##     TOT_MALE       TOT_FEM        MALE_0_14        MALE_15_64    
##  Min.   :   0   Min.   :    0   Min.   :   0.0   Min.   :   0.0  
##  1st Qu.:  76   1st Qu.:   78   1st Qu.:  12.0   1st Qu.:  55.0  
##  Median : 330   Median :  365   Median :  61.0   Median : 242.0  
##  Mean   :1340   Mean   : 1576   Mean   : 197.4   Mean   : 957.1  
##  3rd Qu.:1760   3rd Qu.: 2044   3rd Qu.: 294.0   3rd Qu.:1240.0  
##  Max.   :9531   Max.   :12000   Max.   :1370.0   Max.   :7242.0  
##    MALE_65__         FEM_0_14        FEM_15_64       FEM_65__     
##  Min.   :   0.0   Min.   :   0.0   Min.   :   0   Min.   :   0.0  
##  1st Qu.:   5.0   1st Qu.:  12.0   1st Qu.:  56   1st Qu.:   8.0  
##  Median :  31.0   Median :  57.0   Median : 252   Median :  46.0  
##  Mean   : 185.8   Mean   : 188.3   Mean   :1073   Mean   : 314.9  
##  3rd Qu.: 192.0   3rd Qu.: 266.0   3rd Qu.:1420   3rd Qu.: 287.0  
##  Max.   :1684.0   Max.   :1302.0   Max.   :7778   Max.   :3625.0  
##    FEM_RATIO       SHAPE_Leng     SHAPE_Area         CODE          
##  Min.   :  0.0   Min.   :3998   Min.   :998991   Length:601        
##  1st Qu.:101.5   1st Qu.:3998   1st Qu.:999043   Class :character  
##  Median :111.2   Median :3998   Median :999078   Mode  :character  
##  Mean   :102.4   Mean   :3998   Mean   :999082                     
##  3rd Qu.:118.5   3rd Qu.:3998   3rd Qu.:999118                     
##  Max.   :200.0   Max.   :3998   Max.   :999193                     
##       ID              math19_avg       math22_avg     dist_19_avg     
##  Length:601         Min.   : 0.000   Min.   : 0.00   Min.   : 0.0000  
##  Class :character   1st Qu.: 0.000   1st Qu.: 0.00   1st Qu.: 0.0000  
##  Mode  :character   Median : 0.000   Median : 0.00   Median : 0.0000  
##                     Mean   : 9.076   Mean   : 9.23   Mean   : 0.3362  
##                     3rd Qu.: 0.000   3rd Qu.: 0.00   3rd Qu.: 0.0000  
##                     Max.   :94.471   Max.   :95.18   Max.   :13.3007  
##   dist_20_avg       dist_22_avg       nstudents19      nstudents22    
##  Min.   : 0.0000   Min.   : 0.0000   Min.   :  0.00   Min.   :  0.00  
##  1st Qu.: 0.0000   1st Qu.: 0.0000   1st Qu.:  0.00   1st Qu.:  0.00  
##  Median : 0.0000   Median : 0.0000   Median :  0.00   Median :  0.00  
##  Mean   : 0.2998   Mean   : 0.2904   Mean   : 21.87   Mean   : 25.86  
##  3rd Qu.: 0.0000   3rd Qu.: 0.0000   3rd Qu.:  0.00   3rd Qu.:  0.00  
##  Max.   :13.3007   Max.   :13.3007   Max.   :545.00   Max.   :755.00  
##      public           psych             no_school     
##  Min.   :0.0000   Min.   :-1.000000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.: 0.000000   1st Qu.:1.0000  
##  Median :0.0000   Median : 0.000000   Median :1.0000  
##  Mean   :0.1164   Mean   :-0.002496   Mean   :0.8519  
##  3rd Qu.:0.0000   3rd Qu.: 0.000000   3rd Qu.:1.0000  
##  Max.   :1.0000   Max.   : 1.000000   Max.   :1.0000
schools.sf <- st_as_sf(schools.sp)
head(schools.sf)
## Simple feature collection with 6 features and 27 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: 20.84561 ymin: 52.19197 xmax: 20.87621 ymax: 52.31617
## Geodetic CRS:  +proj=longlat +ellps=WGS84 +no_defs
##         TOT TOT_0_14 TOT_15_64 TOT_65__ TOT_MALE TOT_FEM MALE_0_14 MALE_15_64
## 206436 4841      696      3586      559     2339    2502       365       1739
## 206464  137       19        98       20       61      76        10         46
## 206476  269       38       199       32      124     145        23         89
## 206492    0        0         0        0        0       0         0          0
## 206505  386       63       276       47      180     206        26        137
## 206602   86       11        62       13       41      45         7         28
##        MALE_65__ FEM_0_14 FEM_15_64 FEM_65__ FEM_RATIO SHAPE_Leng SHAPE_Area
## 206436       235      331      1847      324  106.9688   3997.988   998991.0
## 206464         5        9        52       15  124.5902   3998.006   999000.1
## 206476        12       15       110       20  116.9355   3997.989   998991.6
## 206492         0        0         0        0    0.0000   3998.008   999000.9
## 206505        17       37       139       30  114.4444   3998.005   998999.3
## 206602         6        4        34        7  109.7561   3998.009   999001.7
##                                   CODE     ID math19_avg math22_avg dist_19_avg
## 206436 CRS3035RES1000mN3286000E5059000 206436          0          0           0
## 206464 CRS3035RES1000mN3298000E5059000 206464          0          0           0
## 206476 CRS3035RES1000mN3287000E5059000 206476          0          0           0
## 206492 CRS3035RES1000mN3299000E5059000 206492          0          0           0
## 206505 CRS3035RES1000mN3297000E5059000 206505          0          0           0
## 206602 CRS3035RES1000mN3300000E5059000 206602          0          0           0
##        dist_20_avg dist_22_avg nstudents19 nstudents22 public psych no_school
## 206436           0           0           0           0      0     0         1
## 206464           0           0           0           0      0     0         1
## 206476           0           0           0           0      0     0         1
## 206492           0           0           0           0      0     0         1
## 206505           0           0           0           0      0     0         1
## 206602           0           0           0           0      0     0         1
##                         geometry
## 206436 POINT (20.84561 52.19197)
## 206464 POINT (20.87183 52.29843)
## 206476 POINT (20.84779 52.20085)
## 206492  POINT (20.87402 52.3073)
## 206505 POINT (20.86964 52.28956)
## 206602 POINT (20.87621 52.31617)
# spatial weights matrix – from sf
crds.sf <- st_centroid(st_geometry(schools.sf)) # centroids
knn.schools <- knearneigh(crds.sf, k=5)     # 5 nearest points
schools.knn.nb.sf <- knn2nb(knn.schools)
schools.knn.sym.listw <- nb2listw(make.sym.nb(schools.knn.nb.sf))

# Start with Moran test for spatial relationship
moran.test(schools.sf$math22_avg, schools.knn.sym.listw)
## 
##  Moran I test under randomisation
## 
## data:  schools.sf$math22_avg  
## weights: schools.knn.sym.listw    
## 
## Moran I statistic standard deviate = 15.93, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##       0.371669929      -0.001666667       0.000549262
moran.test(schools.sf$math19_avg, schools.knn.sym.listw)
## 
##  Moran I test under randomisation
## 
## data:  schools.sf$math19_avg  
## weights: schools.knn.sym.listw    
## 
## Moran I statistic standard deviate = 15.971, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##      0.3727464078     -0.0016666667      0.0005496188
# Moran's I Statistics is equal to 15.93
# Corresponding p-value < 2.2e-16
# We reject the null hypothesis on the random spatial distribution of the data

Null hypothesis in Moran’s test assumes that points are randomly distributed over the space Based on the Moran’s test we reject the null hypothesis. Conclusion: Spatial analysis is justified.

Spatiotemporal clustering of 2019 and 2022 achievements

Before running the models, we also check whether there are some differences in clustering between 2019 and 2022 results.

We start with simple visualization of the data.

We find differences between the 2019 and 2022. Based on results in 2019 and 2022, we use K-means to cluster schools into district based on their location and achievements in mathematics. We use rand index and jaccard similiarity to compare the clustering between the years.

# Select data for clustering
df_clustering19 <- df[,c(11, 19, 20)]
df_clustering22 <- df[,c(14, 19, 20)]

# we set the maximum number of clusters
k.max <- 20
df_clustering19_scaled <- scale(df_clustering19)
df_clustering22_scaled <- scale(df_clustering22)

wss19 <- sapply(1:k.max, 
              function(k){kmeans(df_clustering19_scaled, k, nstart=50,iter.max = 15 )$tot.withinss})
wss19
##  [1] 492.00000 349.94171 278.67844 230.15365 190.98734 166.47400 145.66602
##  [8] 128.12113 116.89859 107.92195  99.16845  90.32335  84.39915  79.12764
## [15]  74.81848  69.43243  65.44149  62.28323  57.49721  56.76225
wss22 <- sapply(1:k.max, 
              function(k){kmeans(df_clustering22_scaled, k, nstart=50,iter.max = 15 )$tot.withinss})
wss22
##  [1] 492.00000 353.51456 276.71519 226.29575 190.27458 168.03346 149.38144
##  [8] 132.01353 120.11473 108.52911  99.95816  92.23566  84.90176  81.27866
## [15]  75.58690  71.00520  66.55689  61.90323  60.36939  57.77518
plot(1:k.max, wss19,
     type="b", pch = 19, frame = FALSE, 
     xlab="Number of clusters K",
     ylab="Total within-clusters sum of squares")

plot(1:k.max, wss22,
     type="b", pch = 19, frame = FALSE, 
     xlab="Number of clusters K",
     ylab="Total within-clusters sum of squares")

fviz_nbclust(df_clustering19_scaled, kmeans, method = "silhouette")

fviz_nbclust(df_clustering22_scaled, kmeans, method = "silhouette")

We can’t see any reasonable point when using within sum-of-sqaures measure. However, when we use silhouette measure, we find that both in 2019 and 2022 the optimal number of clusters is 9! It needs to be noted, that such outcome is rarely found, as silhouette measure returns in majority of the analysed cases two clusters as an optimal solution.

clustering19 <- kmeans(df_clustering19_scaled, 9, nstart = 25)
fviz_cluster(clustering19, df_clustering19_scaled, frame = FALSE, geom = "point")

df$cluster19 <- clustering19$cluster

clustering22 <- kmeans(df_clustering22_scaled, 9, nstart = 25)
fviz_cluster(clustering22, df_clustering22_scaled, frame = FALSE, geom = "point")

df$cluster22 <- clustering22$cluster

df %>% 
  group_by(cluster19) %>% 
  summarise("AVG_Math19" = mean(math2019))
## # A tibble: 9 × 2
##   cluster19 AVG_Math19
##       <int>      <dbl>
## 1         1       69.1
## 2         2       37.6
## 3         3       78.7
## 4         4       63.4
## 5         5       43.6
## 6         6       57.8
## 7         7       70.8
## 8         8       86.1
## 9         9       49.3
df %>% 
  group_by(cluster22) %>% 
  summarise("AVG_Math22" = mean(math2022))
## # A tibble: 9 × 2
##   cluster22 AVG_Math22
##       <int>      <dbl>
## 1         1       63.8
## 2         2       75.0
## 3         3       44.6
## 4         4       48.0
## 5         5       81.4
## 6         6       47.0
## 7         7       70.7
## 8         8       84.0
## 9         9       43.1

We see large differences between clusters. In 2019, students from Cluster 8 scored more than twice as much as those from Cluster 2. We observed a similar pattern in 2022 when students from Cluster 8 scored two times better than their peers from Cluster 9. Although the number of clusters is the same in 2019 and 2022, we are unsure whether these clusters are the same. By applying the Rand index, we check whether schools remained in the same clusters over the years.

rand.index(df$cluster19, df$cluster22)
## [1] 0.884405
jaccard(df$cluster19, df$cluster22)
## [1] 1

Rand index is equal to 0.884405, which means that the clusters are quite similar. Jaccard Similarity does not take into account observations which were originally clustered into different clusters in the initial periods. We find a jaccard similarity equal to 1, meaning that temporal composition of clusters did not change. To prove it we map these points.

# Read data for Metro M1, M2 
load("C:/Users/wojci/OneDrive/WSZ/CausalMLinR/metro_m1m2.Rdata")
plot_cluster19 <- ggplot()  +
  geom_point(data = metrom1m2, aes(x = as.numeric(longitude), y = as.numeric(latitude), fill = factor(Label)), shape = 23, size = 2) +
  geom_point(data = df, aes(x = long, y = lat, size = n_students2019, col = factor(cluster19)), shape = 20, alpha = 0.9) + 
  geom_sf(waw.pov.sf.line, mapping = aes(geometry = geometry)) +
  scale_color_manual(values = c("#FF0000",
                     "#0000FF",
                     "#008000",
                     "#FFFF00",
                     "#800080",
                     "#FFA500",
                     "#40E0D0",
                     "#FFC0CB",
                     "#A52A2A"))+
  theme_bw() +
  theme(text = element_text(color = "grey20", size = 12, family = "Roboto"),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        legend.position = "none",
        legend.key.width = unit(1.5, 'cm')) 

plot_cluster22 <- ggplot()  +
  geom_point(data = metrom1m2, aes(x = as.numeric(longitude), y = as.numeric(latitude), fill = factor(Label)), shape = 23, size = 2) +
  geom_point(data = df, aes(x = long, y = lat, size = n_students22, col = factor(cluster22)), shape = 20, alpha = 0.9) + 
  geom_sf(waw.pov.sf.line, mapping = aes(geometry = geometry)) +
  scale_color_manual(values = c("#FF0000",
                     "#0000FF",
                     "#008000",
                     "#FFFF00",
                     "#800080",
                     "#FFA500",
                     "#40E0D0",
                     "#FFC0CB",
                     "#A52A2A"))+
  theme_bw() +
  theme(text = element_text(color = "grey20", size = 12, family = "Roboto"),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        legend.position = "none",
        legend.key.width = unit(1.5, 'cm')) 

ggpubr::ggarrange(plot_cluster19, plot_cluster22, widths = 20, heights = 16) 

Visually, we see that schools were clustered similar. However, if we look closer at the western part of Warsaw, closer to the new stations of M2 (blue diamons), there are some migrating patterns.

df %>% 
  filter(area == "Wola" | area == "Bemowo") %>% 
  group_by(cluster22) %>% 
  summarise(Freq = n())
## # A tibble: 5 × 2
##   cluster22  Freq
##       <int> <int>
## 1         2     3
## 2         3     1
## 3         4    16
## 4         5     4
## 5         9     1

Schools in Wola and Bemowo were grouped into clusters 1 and 6 in 2019 and have moved to other clusters. Unfortunately, in terms of the migration from cluster 6 (as of 2019) to clusters 3 and 4 in 2022, we observe an average decrease in the results of this group. Hence, instead of improving students’ results by extending public transport, we observed the opposite effect. It needs to be noticed, however, that our model, due to the data limitations, does not consider socioeconomic factors. Poorer labour market outcomes characterize Wola. Consequently, the impact of COVID-19 might influence more deprived areas on a larger scale, leading to deteriorating student results regardless of extending access to public transport.

Regression Analysis

As we have already shown, the relationship between spatial dimension and average achievements seems to be present in the data. For that purpose, we propose quantitative models - Geographically Weighted Regression and Causal Forests. Firstly, we cleaned our environment to ensure the loaded data was correct.

rm(list = ls())
load('C:/Users/wojci/OneDrive/WSZ/CausalMLinR/Project/Data/data_schools_grided.RData')
set.seed(424178)
setwd("C:/Users/wojci/OneDrive/WSZ/CausalMLinR")

# We need three datasets
# data frames
# data frame in SP format
# data frame in SF format

schools.df <- pop.df.waw.schools
schools.df$no_school <- ifelse(schools.df$nstudents19 == 0, 1, 0)
schools.sp <- schools.df
coordinates(schools.sp) <- c("lat", "long")
proj4string(schools.sp)<-as(st_crs("+proj=longlat +ellps=WGS84"), "CRS")
summary(schools.sp) 
## Object of class SpatialPointsDataFrame
## Coordinates:
##           min      max
## lat  20.84561 21.27214
## long 52.09673 52.37289
## Is projected: FALSE 
## proj4string : [+proj=longlat +ellps=WGS84 +no_defs]
## Number of points: 601
## Data attributes:
##       TOT           TOT_0_14        TOT_15_64        TOT_65__     
##  Min.   :    0   Min.   :   0.0   Min.   :    0   Min.   :   0.0  
##  1st Qu.:  155   1st Qu.:  24.0   1st Qu.:  109   1st Qu.:  14.0  
##  Median :  697   Median : 117.0   Median :  499   Median :  77.0  
##  Mean   : 2917   Mean   : 385.7   Mean   : 2030   Mean   : 500.8  
##  3rd Qu.: 3734   3rd Qu.: 551.0   3rd Qu.: 2614   3rd Qu.: 489.0  
##  Max.   :21531   Max.   :2672.0   Max.   :15020   Max.   :5297.0  
##     TOT_MALE       TOT_FEM        MALE_0_14        MALE_15_64    
##  Min.   :   0   Min.   :    0   Min.   :   0.0   Min.   :   0.0  
##  1st Qu.:  76   1st Qu.:   78   1st Qu.:  12.0   1st Qu.:  55.0  
##  Median : 330   Median :  365   Median :  61.0   Median : 242.0  
##  Mean   :1340   Mean   : 1576   Mean   : 197.4   Mean   : 957.1  
##  3rd Qu.:1760   3rd Qu.: 2044   3rd Qu.: 294.0   3rd Qu.:1240.0  
##  Max.   :9531   Max.   :12000   Max.   :1370.0   Max.   :7242.0  
##    MALE_65__         FEM_0_14        FEM_15_64       FEM_65__     
##  Min.   :   0.0   Min.   :   0.0   Min.   :   0   Min.   :   0.0  
##  1st Qu.:   5.0   1st Qu.:  12.0   1st Qu.:  56   1st Qu.:   8.0  
##  Median :  31.0   Median :  57.0   Median : 252   Median :  46.0  
##  Mean   : 185.8   Mean   : 188.3   Mean   :1073   Mean   : 314.9  
##  3rd Qu.: 192.0   3rd Qu.: 266.0   3rd Qu.:1420   3rd Qu.: 287.0  
##  Max.   :1684.0   Max.   :1302.0   Max.   :7778   Max.   :3625.0  
##    FEM_RATIO       SHAPE_Leng     SHAPE_Area         CODE          
##  Min.   :  0.0   Min.   :3998   Min.   :998991   Length:601        
##  1st Qu.:101.5   1st Qu.:3998   1st Qu.:999043   Class :character  
##  Median :111.2   Median :3998   Median :999078   Mode  :character  
##  Mean   :102.4   Mean   :3998   Mean   :999082                     
##  3rd Qu.:118.5   3rd Qu.:3998   3rd Qu.:999118                     
##  Max.   :200.0   Max.   :3998   Max.   :999193                     
##       ID              math19_avg       math22_avg     dist_19_avg     
##  Length:601         Min.   : 0.000   Min.   : 0.00   Min.   : 0.0000  
##  Class :character   1st Qu.: 0.000   1st Qu.: 0.00   1st Qu.: 0.0000  
##  Mode  :character   Median : 0.000   Median : 0.00   Median : 0.0000  
##                     Mean   : 9.076   Mean   : 9.23   Mean   : 0.3362  
##                     3rd Qu.: 0.000   3rd Qu.: 0.00   3rd Qu.: 0.0000  
##                     Max.   :94.471   Max.   :95.18   Max.   :13.3007  
##   dist_20_avg       dist_22_avg       nstudents19      nstudents22    
##  Min.   : 0.0000   Min.   : 0.0000   Min.   :  0.00   Min.   :  0.00  
##  1st Qu.: 0.0000   1st Qu.: 0.0000   1st Qu.:  0.00   1st Qu.:  0.00  
##  Median : 0.0000   Median : 0.0000   Median :  0.00   Median :  0.00  
##  Mean   : 0.2998   Mean   : 0.2904   Mean   : 21.87   Mean   : 25.86  
##  3rd Qu.: 0.0000   3rd Qu.: 0.0000   3rd Qu.:  0.00   3rd Qu.:  0.00  
##  Max.   :13.3007   Max.   :13.3007   Max.   :545.00   Max.   :755.00  
##      public           psych             no_school     
##  Min.   :0.0000   Min.   :-1.000000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.: 0.000000   1st Qu.:1.0000  
##  Median :0.0000   Median : 0.000000   Median :1.0000  
##  Mean   :0.1164   Mean   :-0.002496   Mean   :0.8519  
##  3rd Qu.:0.0000   3rd Qu.: 0.000000   3rd Qu.:1.0000  
##  Max.   :1.0000   Max.   : 1.000000   Max.   :1.0000
schools.sf <- st_as_sf(schools.sp)
head(schools.sf)
## Simple feature collection with 6 features and 27 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: 20.84561 ymin: 52.19197 xmax: 20.87621 ymax: 52.31617
## Geodetic CRS:  +proj=longlat +ellps=WGS84 +no_defs
##         TOT TOT_0_14 TOT_15_64 TOT_65__ TOT_MALE TOT_FEM MALE_0_14 MALE_15_64
## 206436 4841      696      3586      559     2339    2502       365       1739
## 206464  137       19        98       20       61      76        10         46
## 206476  269       38       199       32      124     145        23         89
## 206492    0        0         0        0        0       0         0          0
## 206505  386       63       276       47      180     206        26        137
## 206602   86       11        62       13       41      45         7         28
##        MALE_65__ FEM_0_14 FEM_15_64 FEM_65__ FEM_RATIO SHAPE_Leng SHAPE_Area
## 206436       235      331      1847      324  106.9688   3997.988   998991.0
## 206464         5        9        52       15  124.5902   3998.006   999000.1
## 206476        12       15       110       20  116.9355   3997.989   998991.6
## 206492         0        0         0        0    0.0000   3998.008   999000.9
## 206505        17       37       139       30  114.4444   3998.005   998999.3
## 206602         6        4        34        7  109.7561   3998.009   999001.7
##                                   CODE     ID math19_avg math22_avg dist_19_avg
## 206436 CRS3035RES1000mN3286000E5059000 206436          0          0           0
## 206464 CRS3035RES1000mN3298000E5059000 206464          0          0           0
## 206476 CRS3035RES1000mN3287000E5059000 206476          0          0           0
## 206492 CRS3035RES1000mN3299000E5059000 206492          0          0           0
## 206505 CRS3035RES1000mN3297000E5059000 206505          0          0           0
## 206602 CRS3035RES1000mN3300000E5059000 206602          0          0           0
##        dist_20_avg dist_22_avg nstudents19 nstudents22 public psych no_school
## 206436           0           0           0           0      0     0         1
## 206464           0           0           0           0      0     0         1
## 206476           0           0           0           0      0     0         1
## 206492           0           0           0           0      0     0         1
## 206505           0           0           0           0      0     0         1
## 206602           0           0           0           0      0     0         1
##                         geometry
## 206436 POINT (20.84561 52.19197)
## 206464 POINT (20.87183 52.29843)
## 206476 POINT (20.84779 52.20085)
## 206492  POINT (20.87402 52.3073)
## 206505 POINT (20.86964 52.28956)
## 206602 POINT (20.87621 52.31617)

We start with simple OLS regression. We differentiate the models with weighthing based on the number of students. We do it because larger schools do not have to be randomly distributed; while planning public transport, extension should consider the cost-benefit analysis, answering how to help the largest number of students with the lowest cost.

# basic equation and OLS models
eq22 <- math22_avg ~  dist_22_avg  + public + psych + no_school
eq19 <- math19_avg ~  dist_19_avg  + public + psych + no_school

# OLS with Females/Males Ratio and Distance to Metro
mod_ols22 <- lm(eq22, data = schools.df)
mod_ols19 <- lm(eq19, data = schools.df)


# Weighted OLS (by the total population aged 0-14 - students proxy)
mod_ols_weights22 <- lm(eq22, data = schools.df, 
                      weights = schools.df$nstudents22)
mod_ols_weights19 <- lm(eq19, data = schools.df, 
                      weights = schools.df$nstudents19)


# Summarise the results 
stargazer::stargazer(mod_ols22,
                     mod_ols_weights22,
                     mod_ols19, 
                     mod_ols_weights19, type = "text")
## 
## ================================================================================================================
##                                                         Dependent variable:                                     
##                     --------------------------------------------------------------------------------------------
##                                      math22_avg                                     math19_avg                  
##                                (1)                    (2)                    (3)                     (4)        
## ----------------------------------------------------------------------------------------------------------------
## dist_22_avg                   0.162                  0.818                                                      
##                              (0.279)                (0.702)                                                     
##                                                                                                                 
## dist_19_avg                                                                 0.180                  1.190*       
##                                                                            (0.234)                 (0.657)      
##                                                                                                                 
## public                        -1.239                 2.422                  -0.591                  1.440       
##                              (1.958)                (4.916)                (1.652)                 (4.306)      
##                                                                                                                 
## psych                        5.804**                12.945*                7.543***                12.943*      
##                              (2.287)                (7.220)                (1.940)                 (6.550)      
##                                                                                                                 
## no_school                   -63.084***                                    -61.472***                            
##                              (1.883)                                       (1.610)                              
##                                                                                                                 
## Constant                    63.084***              62.208***              61.472***               61.451***     
##                              (1.859)                (4.565)                (1.590)                 (4.029)      
##                                                                                                                 
## ----------------------------------------------------------------------------------------------------------------
## Observations                   601                    601                    601                     601        
## R2                            0.916                  0.053                  0.936                   0.074       
## Adjusted R2                   0.915                  0.019                  0.936                   0.042       
## Residual Std. Error      6.740 (df = 596)      198.990 (df = 85)       5.710 (df = 596)       165.818 (df = 85) 
## F Statistic         1,623.113*** (df = 4; 596) 1.577 (df = 3; 85) 2,187.997*** (df = 4; 596) 2.273* (df = 3; 85)
## ================================================================================================================
## Note:                                                                                *p<0.1; **p<0.05; ***p<0.01

We additionally run Instrumental Variable Estimation. We instrument the average distance. Our instrument is the distance from 2019. The reasoning behind this is that the location of the metro might be endogenous to other important unobservable factors, such as the number of students living in the area. However, as Ichino and Winter-Ebmber (1999) noted, the application of instruments highly depends on the estimation goal. In our case, we want to see how the deviation from the average distance influenced the results. Hence, such an instrument might serve its purpose quite well.

iv_dist22 <- ivreg(math22_avg ~  dist_22_avg + public + psych + no_school  | dist_19_avg + public + psych + no_school , data = schools.df)
summary(iv_dist22)
## 
## Call:
## ivreg(formula = math22_avg ~ dist_22_avg + public + psych + no_school | 
##     dist_19_avg + public + psych + no_school, data = schools.df)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -4.626e+01 -1.421e-13 -1.421e-13 -1.421e-13  3.268e+01 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  63.47630    1.87364  33.879   <2e-16 ***
## dist_22_avg   0.02547    0.29015   0.088   0.9301    
## public       -1.39627    1.96059  -0.712   0.4766    
## psych         5.81782    2.28781   2.543   0.0112 *  
## no_school   -63.47630    1.89718 -33.458   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.741 on 596 degrees of freedom
## Multiple R-Squared: 0.9159,  Adjusted R-squared: 0.9153 
## Wald test:  1622 on 4 and 596 DF,  p-value: < 2.2e-16

Simple models find, in general, no significant impact of metro extension on school achievements. We find similar results, when using IV estimation. However, the parameter for the the distance seems to be overestimated in the simple regression. Let’s see if the spatial models can solve the problem.

# We run the model only on grids with some data 
dMat <- gw.dist(cbind(schools.df$lat, schools.df$long)) # coordinates not from sp
bw19 <- GWmodel::bw.gwr(eq19, data = schools.sp, kernel='gaussian', adaptive = TRUE, dMat = dMat)
## Adaptive bandwidth: 379 CV score: 22258.42 
## Adaptive bandwidth: 242 CV score: 22203.21 
## Adaptive bandwidth: 157 CV score: 22346.28 
## Adaptive bandwidth: 294 CV score: 22234.95 
## Adaptive bandwidth: 209 CV score: 22206.9 
## Adaptive bandwidth: 261 CV score: 22217.53 
## Adaptive bandwidth: 228 CV score: 22214.95 
## Adaptive bandwidth: 248 CV score: 22205.33 
## Adaptive bandwidth: 235 CV score: 22208.69 
## Adaptive bandwidth: 243 CV score: 22203.37 
## Adaptive bandwidth: 238 CV score: 22197.7 
## Adaptive bandwidth: 239 CV score: 22200.09 
## Adaptive bandwidth: 241 CV score: 22203.18 
## Adaptive bandwidth: 239 CV score: 22200.09 
## Adaptive bandwidth: 240 CV score: 22197.71 
## Adaptive bandwidth: 239 CV score: 22200.09 
## Adaptive bandwidth: 239 CV score: 22200.09 
## Adaptive bandwidth: 238 CV score: 22197.7
bw22 <- GWmodel::bw.gwr(eq22, data = schools.sp, kernel='gaussian', adaptive = TRUE, dMat = dMat)
## Adaptive bandwidth: 379 CV score: 31178.65 
## Adaptive bandwidth: 242 CV score: 31154.39 
## Adaptive bandwidth: 157 CV score: 31437.63 
## Adaptive bandwidth: 294 CV score: 31171.91 
## Adaptive bandwidth: 209 CV score: 31170.57 
## Adaptive bandwidth: 261 CV score: 31161.1 
## Adaptive bandwidth: 228 CV score: 31173.72 
## Adaptive bandwidth: 248 CV score: 31152.27 
## Adaptive bandwidth: 254 CV score: 31153.66 
## Adaptive bandwidth: 246 CV score: 31152.2 
## Adaptive bandwidth: 243 CV score: 31154.35 
## Adaptive bandwidth: 246 CV score: 31152.2
modelGWR19<-GWmodel::gwr.basic(eq19, data = schools.sp, kernel='gaussian', adaptive=FALSE, bw=bw19, dMat=dMat)  
modelGWR22<-GWmodel::gwr.basic(eq22, data = schools.sp, kernel='gaussian', adaptive=FALSE, bw=bw22, dMat=dMat)  

modelGWR.df19<-as.data.frame(modelGWR19$SDF)        # class data.frame
modelGWR.df22<-as.data.frame(modelGWR22$SDF)        # class data.frame

schools.sf$coef_dist19 <- modelGWR.df19$dist_19_avg     # variable “dist”
schools.sf$coef_dist22 <- modelGWR.df22$dist_22_avg     # variable “dist”
pov.sf <- st_read("C:/Users/wojci/OneDrive/WSZ/CausalMLinR/Project/Data/powiaty.shp")
## Reading layer `powiaty' from data source 
##   `C:\Users\wojci\OneDrive\WSZ\CausalMLinR\Project\Data\powiaty.shp' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 380 features and 29 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: 171677.6 ymin: 133223.7 xmax: 861895.7 ymax: 774923.7
## Projected CRS: ETRS89 / Poland CS92
pov.sf <- st_transform(pov.sf, crs = "+proj=longlat +datum=NAD83")
waw.pov.sf <- pov.sf %>% filter(jpt_nazwa_=='powiat Warszawa') 
waw.pov.sf.line <- st_cast(waw.pov.sf,"MULTILINESTRING")
load("C:/Users/wojci/OneDrive/WSZ/CausalMLinR/loc_math_1922.Rdata")

We look at the estimated parameters in the map. We observe some spatial patterns over the Warsaw Area.

impact_gwr19 <- ggplot() +
  geom_point(data = df, aes(x = long, y = lat, size = n_students2019), shape = 20, alpha = 0.9, col = "#000000") +
  geom_sf(schools.sf, mapping = aes(color = coef_dist19), shape = 15, alpha = 0.4, size = 7.5) +
  geom_sf(waw.pov.sf.line, mapping = aes(geometry = geometry), inherit.aes = FALSE) +
  theme_bw() +
  theme(text = element_text(color = "grey20", size = 12,family = "Roboto"),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        legend.position = "right",
        legend.key.width = unit(1.5, 'cm')) +
  scale_color_gradient(low = wes_palette("Zissou1", 5, type = "continuous")[1], 
                       high = wes_palette("Zissou1", 5, type = "continuous")[4]) +
  labs(color = "Dist. Coeff.", size = "School size")

impact_gwr22 <- ggplot() +
  geom_point(data = df, aes(x = long, y = lat, size = n_students22), shape = 20, alpha = 0.9, col = "#000000") +
  geom_sf(schools.sf, mapping = aes(color = coef_dist22), shape = 15, alpha = 0.4, size = 7.5) +
  geom_sf(waw.pov.sf.line, mapping = aes(geometry = geometry), inherit.aes = FALSE) +
  theme_bw() +
  theme(text = element_text(color = "grey20", size = 12, family = "Roboto"),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        legend.position = "right",
        legend.key.width = unit(1.5, 'cm')) +
  scale_color_gradient(low = wes_palette("Zissou1", 5, type = "continuous")[1], 
                       high = wes_palette("Zissou1", 5, type = "continuous")[4]) +
  labs(color = "Dist. Coeff.", size = "School size")
  
# plot the graphs together
ggpubr::ggarrange(impact_gwr19, impact_gwr22,
                    common.legend = TRUE, widths = 20, heights = 16) 

However, as we can see on the graph, the spatial heterogeneity is negligible. In fact, the difference in coefficients between Ursus and Wesoła is visible only at the 7th decimal point!

We turn our attention to causal forests. As this case corresponds to the analysis with treatment and control groups, we assume that schools for which the distance decreased in 2022 were treated. In control group we will assess only those schools, which were already close to the metro stations.

# load the data once again
load("C:/Users/wojci/OneDrive/WSZ/CausalMLinR/loc_math_1922.Rdata")
df$treatment <- ifelse(df$distance_2022 < df$distance_2019, 1, 0)
df$control <- ifelse(df$distance_2019 < 0.5, 1, 0)
df <- df %>% 
  filter(control == 1 | treatment == 1)
table(df$treatment)
## 
##  0  1 
## 34 30
table(df$control)
## 
##  0  1 
## 30 34
# no intersection between control and treatment

We end up with (almost) balanced assignment to control and treatment groups - 30 schools treated, 34 in control. We move to estimating the causal forest.

# Point to the analysed variables
Y <- df$math2022 - df$math2019 # outcome - difference in results 
X <- as.data.frame(cbind(df$Public, df$size, df$Oriented, df$psych, df$lat, df$long)) # covariates for control. Note: Srodmiescie for base group
W <- df$treatment # treatment

# E[Y | X] with no test/train split
forest.Y <- regression_forest(X, Y)
Y.hat_train <- predict(forest.Y)$predictions


# Estimate propensity scores 
forest.W <- regression_forest(X, W)
W.hat_train <- predict(forest.W)$predictions

#Train the causal forest (weighted by propensity score)
c.forest <- causal_forest(X, Y, W, Y.hat_train, W.hat_train, min.node.size = 3)
tau.hat <- predict(c.forest, X)$predictions

We plot the predicted values and the outcome to assess the quality of the goodness of fit.

pred <- data.frame(value = append(as.numeric(tau.hat), as.numeric(Y)))
pred$label <- c(rep("Prediction", 64), rep("Real", 64))


ggplot(data = pred) + 
  geom_histogram(aes(x = value, fill = label), 
                 alpha = 0.5,
                 position = "identity") + 
  theme_minimal() + 
  labs(title = "") +
  xlab("Predictions and Real values") +
  scale_y_continuous(name = "Density") +
  theme(
    text = element_text(family = "Roboto Light"),
    panel.grid.major = element_blank(),
    legend.position = "none", 
    panel.grid.minor = element_blank(),
    panel.grid.major.x = element_blank(),
    axis.text=element_text(size = 12),
    axis.title=element_text(size = 12),
    axis.line = element_line(size = 0.5, colour = "grey", linetype = 1),
    axis.ticks = element_line(size = 0.5, color="grey")
  ) + scale_fill_manual(values = c("#B6163A", "#000000"))

We find that our model is not good at predicting difference in scores, which outlay from the average scores.

#Find ATE
average_treatment_effect(c.forest, target.sample = "all") #ATE for all observations
##  estimate   std.err 
## 0.8509355 1.9379617
average_treatment_effect(c.forest, target.sample = "treated") #ATE only for treated 
## estimate  std.err 
## 1.735434 2.445492

We find positive treatment effects of the extension of the public transport. However, these estimates are not statistically significant. We further test the model on the simulated data for M3, M4 and M5 extension scenarios.

M3, M4 and M5 urgency - projection from the models

We start again with working with the data. We map schools’ distances to planned M3, M4 and M5.

# load the data instead of working on them again 
load("C:/Users/wojci/OneDrive/WSZ/CausalMLinR/Math_Results_2019_Warsaw.Rdata")

# Metro Coordinates 
metro_cord345 <- read.csv("Project/Data/Metro_Pred345.csv")

metro_stations_m3 <- metro_cord345[metro_cord345$Label == "M3",]

metro_stations_m4 <- metro_cord345[metro_cord345$Label == "M4",]

metro_stations_m5 <- metro_cord345[metro_cord345$Label == "M5",]

# School maps
school_map <- data.frame(Station = df$name,
                         longitude = as.numeric(df$long),
                         latitude =  as.numeric(df$lat), 
                         Label = "school")

# Map both metro
load("C:/Users/wojci/OneDrive/WSZ/CausalMLinR/metro_m1m2.Rdata")
map_points <- rbind(metro_cord345, school_map, metrom1m2)

table(map_points$Label)
## 
##     M1     M2     M3     M4     M5 school 
##     21     17     13     20     17    165
pal <- colorFactor(c( "green", "yellow", "purple", "brown", "red", "navy"),
                   domain = c("M3", "M4", "M5", "school", "M1", "M2"))


leaflet() %>% addTiles() %>% 
  addCircleMarkers(data = map_points,
                   lat = ~as.numeric(latitude), lng = ~as.numeric(longitude),
                   color = ~pal(map_points$Label),
                   radius = 4,
                   stroke = FALSE,
                   fillOpacity = 0.5,
                   label = map_points$Label) %>% 
  addLegend("bottomright", pal = pal, values = map_points$Label)
##########################################################################################################
## Calculate distance to metro stations 

df$distanceM3 <- rep(NA, length(df$name))
df$distanceM4 <- rep(NA, length(df$name))
df$distanceM5 <- rep(NA, length(df$name))

for (i in 1:length(df$name)){
  # working matrix 
  D2019 <- df[i, c("name",   "long", "lat")]
  
  # matrix for M3 stations
  DmetroM3 <- data.frame(name = map_points$Station[map_points$Label %in% c("M1", "M2", "M3")], 
                        lat =  as.numeric(map_points$latitude[map_points$Label %in% c("M1", "M2", "M3")]), 
                        long = as.numeric(map_points$longitude[map_points$Label %in% c("M1", "M2", "M3")]))

  # matrix for M4 stations
  DmetroM4 <- data.frame(name = map_points$Station[map_points$Label %in% c("M1", "M2", "M4")], 
                         lat =  as.numeric(map_points$latitude[map_points$Label %in% c("M1", "M2", "M4")]), 
                         long = as.numeric(map_points$longitude[map_points$Label %in% c("M1", "M2", "M4")]))
  
  # matrix for M5 stations in 2019
  DmetroM5 <- data.frame(name = map_points$Station[map_points$Label %in% c("M1", "M2", "M5")], 
                         lat =  as.numeric(map_points$latitude[map_points$Label %in% c("M1", "M2", "M5")]), 
                         long = as.numeric(map_points$longitude[map_points$Label %in% c("M1", "M2", "M5")]))
  
  
  distM3 <- rbind(D2019, DmetroM3)
  distM4 <- rbind(D2019, DmetroM4)
  distM5 <- rbind(D2019, DmetroM5)
  
  
  distance_matrix_M3 <- geodist(distM3, measure = 'geodesic' )/1000 #converting it to km
  distance_matrix_M4 <- geodist(distM4, measure = 'geodesic' )/1000 #converting it to km
  distance_matrix_M5 <- geodist(distM5, measure = 'geodesic' )/1000 #converting it to km
  
  df[i, 21] <- min(distance_matrix_M3[1,2:52])
  df[i, 22] <- min(distance_matrix_M4[1,2:59])
  df[i, 23] <- min(distance_matrix_M5[1,2:56])
}
################################## VISUALISATION ##################################
## Summarize the change in the distance 
vis_distance <- data.frame(id = 1:length(df$distanceM3))
vis_distance <- cbind(vis_distance, df$distanceM3, df$distanceM4, df$distanceM5)


vis_distance_l <- reshape(vis_distance, direction = "long", idvar="id",
                          varying = 2:4, sep = "")
vis_distance_l<- vis_distance_l %>%  
  rename("distance" = `df$distanceM` )

vis_distance_l$short <- ifelse(vis_distance_l$distance < 0.5, 1, 0)
vis_distance_l$long <- ifelse(vis_distance_l$distance > 3, 1, 0)

m345_proj_dist <- ggplot(vis_distance_l, aes(x = as.factor(time), y = distance, fill = as.factor(time)))+
  geom_bar( stat = "summary", fun = mean) +
  theme_minimal() + 
  labs(title = "") +
  scale_x_discrete(name = "Metro expansion scenario",
                   labels = c(
                     "3" = "M3",
                     "4" = "M4",
                     "5" = "M5")) +
  scale_y_continuous(name = "Average distance to station (in km)") +
  theme(
    text = element_text(family = "Roboto Light"),
    panel.grid.major = element_blank(),
    legend.position = "none", 
    panel.grid.minor = element_blank(),
    panel.grid.major.x = element_blank(),
    axis.text=element_text(size = 12),
    axis.title=element_text(size = 12),
    axis.line = element_line(size = 0.5, colour = "grey", linetype = 1),
    axis.ticks = element_line(size = 0.5, color="grey")
  ) + scale_fill_manual(values = c("#B6163A", "#000000", "#E3A2AF"))


# Number of schools with large distance 
m345_proj_n <- ggplot(vis_distance_l, aes(x = as.factor(time), y = long, fill = as.factor(time)))+
  geom_bar(stat = "summary", fun = mean) +
  theme_minimal() + 
  labs(title = "") +
  scale_x_discrete(name = "Metro expansion scenario",
                   labels = c(
                     "3" = "M3",
                     "4" = "M4",
                     "5" = "M5")) +
  scale_y_continuous(name = "Share of schools with no close access to metro") +
  theme(
    text = element_text(family = "Roboto Light"),
    panel.grid.major = element_blank(),
    legend.position = "none", 
    panel.grid.minor = element_blank(),
    panel.grid.major.x = element_blank(),
    axis.text=element_text(size = 12),
    axis.title=element_text(size = 12),
    axis.line = element_line(size = 0.5, colour = "grey", linetype = 1),
    axis.ticks = element_line(size = 0.5, color="grey")
  ) + scale_fill_manual(values = c("#B6163A", "#000000", "#E3A2AF"))

ggpubr::ggarrange(m345_proj_dist, m345_proj_n)

Descriptive analysis shows that M5 expansion scenario would have the highest impact in terms of improving access to schools via Metro. We test it based on the Causal Forest model estimated in the previous part.

df_test <- df
load("C:/Users/wojci/OneDrive/WSZ/CausalMLinR/loc_math_1922.Rdata")
df_test$distance2019 <- df$distance_2019
df_test$distance2022 <- df$distance_2022

# Treatment, control assignment for M3
df_test$treatmentM3 <- ifelse(df_test$distanceM3 < df_test$distance2022, 1, 0)
df_test$controlM3 <- ifelse(df_test$distance2022 < 0.5 & df_test$treatmentM3 == 0, 1, 0)
df_testM3 <- df_test %>% 
  filter(controlM3 == 1 | treatmentM3 == 1)
table(df_testM3$treatmentM3)
## 
##  0  1 
## 35 65
table(df_testM3$controlM3)
## 
##  0  1 
## 65 35
# Treatment, control assignment for M4
df_test$treatmentM4 <- ifelse(df_test$distanceM4 < df_test$distance2022, 1, 0)
df_test$controlM4 <- ifelse(df_test$distance2022 < 0.5 & df_test$treatmentM4 == 0, 1, 0)
df_testM4 <- df_test %>% 
  filter(controlM4 == 1 | treatmentM4 == 1)
table(df_testM4$treatmentM4)
## 
##  0  1 
## 35 73
table(df_testM4$controlM4)
## 
##  0  1 
## 73 35
# Treatment, control assignment for M5
df_test$treatmentM5 <- ifelse(df_test$distanceM5 < df_test$distance2022, 1, 0)
df_test$controlM5 <- ifelse(df_test$distance2022 < 0.5 & df_test$treatmentM5 == 0, 1, 0)
df_testM5 <- df_test %>% 
  filter(controlM5 == 1 | treatmentM5 == 1)
table(df_testM5$treatmentM5)
## 
##  0  1 
## 35 66
table(df_testM5$controlM5)
## 
##  0  1 
## 66 35

For sure the data for the test analysis is less balanced in terms of group assignment when compared to the analysis for M2 extension.

X_M3 <- as.data.frame(cbind(df_testM3$Public, df_testM3$size, df_testM3$Oriented, df_testM3$psych, df_testM3$lat, df_testM3$long)) # covariates for control. 

X_M4 <- as.data.frame(cbind(df_testM4$Public, df_testM4$size, df_testM4$Oriented, df_testM4$psych, df_testM4$lat, df_testM4$long)) # covariates for control. 

X_M5 <- as.data.frame(cbind(df_testM5$Public, df_testM5$size, df_testM5$Oriented, df_testM5$psych, df_testM5$lat, df_testM5$long)) # covariates for control. 

c.predM3 <- predict(c.forest, X_M3, estimate.variance = TRUE)
c.predM4 <- predict(c.forest, X_M4, estimate.variance = TRUE)
c.predM5 <- predict(c.forest, X_M5, estimate.variance = TRUE)

Plot the predictions.

pred345 <- data.frame(value = c(as.numeric(c.predM3$predictions),
                                     as.numeric(c.predM4$predictions),
                                      as.numeric(c.predM5$predictions)))

pred345$label <- c(rep("M3", 100), rep("M4", 108), rep("M5", 101))


ggplot(data = pred345) + 
  geom_histogram(aes(x = value, fill = label), 
                 alpha = 0.5,
                 position = "identity") + 
  theme_minimal() + 
  labs(title = "") +
  xlab("Predictions and Real values") +
  scale_y_continuous(name = "Density") +
  theme(
    text = element_text(family = "Roboto Light"),
    panel.grid.major = element_blank(),
    legend.position = "left", 
    panel.grid.minor = element_blank(),
    panel.grid.major.x = element_blank(),
    axis.text=element_text(size = 12),
    axis.title=element_text(size = 12),
    axis.line = element_line(size = 0.5, colour = "grey", linetype = 1),
    axis.ticks = element_line(size = 0.5, color="grey")
  ) + scale_fill_manual(values = c("#0000FF","#008000","#FFFF00"))

Median predictions

median(c.predM3$predictions)
## [1] 0.4666325
median(c.predM4$predictions)
## [1] 0.5429314
median(c.predM5$predictions)
## [1] 0.2933546

Based on the median prediction of the CF model on the test data, we suggest starting the public transport expansion with the M4 line. It is an unexpected result when looking at the average distance to schools. However, M4 will connect Bialoleka to the speed metro system, which is extremely important when considering the area’s current pace of growth. At the same time, although planned M4 lies parallel to M1, it goes through Srodmiescie, which schools densely surround. Consequently, this could lead to a major decrease in travel time due to the model methodology.

Limitation of our research

Despite our extended analysis (almost 1300 lines in Markdown), we are aware of the drawbacks of the analysis. Firstly, we need access to data on student travel pathways. It might be possible that students living close to schools choose them and do not need public transport access. At the same time, the extension of public transport might treat schools far from the stations. Our analysis is also subject to selection bias, as we cannot assess the reasons behind choosing certain schools. Some schools are historically superior to others, which highly influences the decisions of prospective students. Finally, we should have takenconsider the influence of other transport sources. The analysis of social actors points out that the extension of M3 is not profitable because it will be faster to get to the endpoint by tram. From the technical perspective, our analysis suffers from small sample sizes, which could strongly influence the significance of the estimated parameters.

Conclusion

We have applied various statistical, econometric and machine learning techniques to derive the impact of extending public transport on academic achievements in mathematics. We find that the proposed extension of public transport could increase the accessibility and achievements of students in Warsaw. Based on the descriptive pattern, M5 would increase access to schools the most. However, our model showed that the M4 extension could yield the largest returns in improving academic achievements.