As of 2023, the Warsaw City Council published a plan for the development of extending the public transport system (see below). According to the City council, 60% of the population living in Warsaw use public transport daily. It should not be surprising that students are a large group in this group. However, not all schools are well connected with public transport, which may become a real problem. Students who attend schools with poor connections to the public system are more likely to miss classes in the event of transportation issues. In addition, areas with no speed public transport, such as the metro, might struggle with more traffic. Marc L. Stein and Jeffrey A. Grigg (2019) from John Hopkins University showed that increased transportation times often lead to higher absenteeism among students in the 9th grade. Moreover, students from disadvantaged backgrounds are more likely to suffer from public transport issues.
Since there seems to be a connection between the public transport system and school outcomes, we turn to this problem in our project. Although student’s achievements are not the critical factor for extending the metro system, it is an important social issue that should be examined. As the M2 line was recently extended, we estimate the influence of these developments on students’ achievements. Based on the analysed models, we predict the impact of the three scenario extensions - M3, M4, and M5. It is even more important as social partners argue that planned M3 line does not well match the city’s needs.
In this document, we report our findings. We divide our analysis into six points.
Download the data from the avaliable online sources
Prepare the data for the analysis
Conduct simple Difference-in-Differences
Analyse the spatiotemporal clustering membership
Run regression and ML models
Predict the benefits of M3, M4 and M5 extension
We conclude with limitations of the analysis and possible further research.
# adding a buffer to the borders to mitigate the uneven line problem
# WAW_box2<-getbb("Warszawa, Polska")
# smallbuffer <- st_buffer(waw.pov.sf, dist = 0.0007)
# query<-opq(WAW_box2) %>% add_osm_feature(key='admin_level', value='9') # get Warsaw districts
# bound_waw.sf<-osmdata_sf(query) #get the warsaw discrict lines
#
# # borders of districts in Warsaw
# wawDistBorders <- st_intersection(bound_waw.sf$osm_lines %>%
# st_transform(crs = "+proj=longlat +datum=NAD83"),
# smallbuffer)Before starting the analysis, we have to install and load the required packages. We put this chunk in the comment to optimise code running, but You can copy and uncomment it in your R script.
# requiredPackages = c( "imola", "readr", "tidyverse", "stringr", "colourpicker",
# "sf", "sp", "spdep", "spatialreg", "RColorBrewer", "ggplot2",
# "ggmap", "tibble", "dplyr", "tidygeocoder", "rvest",
# "spatstat", "GISTools", "raster", "keras", "dbscan", "factoextra",
# "fpc", "osmdata", "OpenStreetMap", "cowplot", "viridis", "leaflet", "geodist",
# "GWmodel", "SpatialML", "fossil", "bayesbio") # Required Packages
#
# for(i in requiredPackages){if(!require(i,character.only = TRUE)) install.packages(i)}
# for(i in requiredPackages){if(!require(i,character.only = TRUE)) library(i,character.only = TRUE) } Firstly we read the raw data from the national register. We use tidygeocoder to find the geolocation of all schools in Warsaw. Because of the time needed and computational complexity we put this chunk in the comment.
# math <- read_csv("math_19_22_data.csv")
#
# math$country <- "Poland"
# df <- math %>%
# tidygeocoder::geocode(street = street, city = city, country = country, method = "osm" )
#
#
# df <- df %>%
# filter(!is.na(lat))
#
# save(df, file = "Math_Results_2019_Warsaw.Rdata")Instead of geocoding the data again, we directly load the data after geocoding. We prepare the data for the public transport analysis.
# load the data instead of working on them again
load("C:/Users/wojci/OneDrive/WSZ/CausalMLinR/Math_Results_2019_Warsaw.Rdata")
# Find data for metro stations
# Webscrapping from Wikipedia
metro_stations_m1 <- c("Kabaty", "Natolin", "Imielin", "Ursynów", "Stokłosy",
"Służew", "Wilanowska", "Wierzbno", "Racławicka", "Pole_Mokotowskie",
"Politechnika", "Centrum", "Świętokrzyska", "Ratusz_Arsenał",
"Dworzec_Gdański", "Plac_Wilsona", "Marymont",
"Słodowiec", "Stare_Bielany", "Wawrzyszew", "Młociny")
metro_stations_m2019 <- c("Rondo_Daszyńskiego", "Rondo_ONZ", "Nowy_Świat-Uniwersytet",
"Centrum_Nauki_Kopernik", "Stadion_Narodowy", "Dworzec_Wileński")
metro_stations_m2020 <- c("Księcia_Janusza",
"Młynów", "Płocka", "Szwedzka",
"Targówek_Mieszkaniowy", "Trocka")
metro_stations_m2022 <- c("Bemowo", "Ulrychów", "Zacisze",
"Kondratowicza", "Bródno")
metro <- as.data.frame( c(metro_stations_m2019,
metro_stations_m2020,
metro_stations_m2022,
metro_stations_m1), dnn = list("Station"), responseName = "freq")
metro <- metro %>%
rename("Station" = "c(metro_stations_m2019, metro_stations_m2020, metro_stations_m2022, metro_stations_m1)" )
head(metro)## Station
## 1 Rondo_Daszyńskiego
## 2 Rondo_ONZ
## 3 Nowy_Świat-Uniwersytet
## 4 Centrum_Nauki_Kopernik
## 5 Stadion_Narodowy
## 6 Dworzec_Wileński
As no available dataset includes data on location of metro stations in Warsaw, we download it directly from Wikipedia.
# Webscraping information on Station location in Warsaw
for (i in 1:length(metro$Station)){
url <- paste("https://pl.wikipedia.org/wiki/", metro[i, 1],"_(stacja_metra)", sep = "", collapse = NULL)
x <- tryCatch({
url %>%
read_html() %>%
html_node(xpath = '//*[@id="coordinates"]') %>%
html_text() %>%
as.character() %>%
str_split(pattern = "/")
}, error = function(e) {
# handle HTTP error 404
if (grepl("HTTP error 404", e$message)) {
message(paste("Error: ", e$message, " - Skipping URL ", url))
return(NULL)
}
else {
# re-throw any other errors
stop(e)
}
})
if (!is.null(x)) {
metro[i, 2] <- as.character(str_extract_all(gsub(",", ".", x[[1]][2]), "[+-]?([0-9]*[.])?[0-9]+")[[1]])[1]
metro[i, 3] <- as.character(str_extract_all(gsub(",", ".", x[[1]][2]), "[+-]?([0-9]*[.])?[0-9]+")[[1]])[2]
}
}
metro <- metro %>%
rename ("Latitude" = "V2",
"Longitude" = "V3")
# Map all metro points
metro_map <- data.frame(name = metro$Station,
lon= as.numeric(metro$Longitude),
lat = as.numeric(metro$Latitude),
label = "metro")
# School maps
school_map <- data.frame(name = df$name,
lon = as.numeric(df$long),
lat = as.numeric(df$lat),
label = "school")
# Map both metro
map_points <- rbind(metro_map, school_map)
head(map_points)## name lon lat label
## 1 Rondo_Daszyńskiego 20.98306 52.23000 metro
## 2 Rondo_ONZ 20.99833 52.23306 metro
## 3 Nowy_Świat-Uniwersytet 21.01639 52.23667 metro
## 4 Centrum_Nauki_Kopernik 21.03139 52.24000 metro
## 5 Stadion_Narodowy 21.04306 52.24639 metro
## 6 Dworzec_Wileński 21.03528 52.25417 metro
## [1] "21 Społeczne Liceum Ogólnokształcące im. Jerzego Grotowskiego"
## [2] "Zespół Szkół Samochodowych i Licealnych nr2"
## [3] "Niepubliczne Liceum Ogólnokształcące nr 81 SGH"
## [4] "Technikum Łączności im. prof dr.inż. Janusza Groszkowskiego w Zespole Szkół Nr 37 im. Agnieszki Osieckiej w Warszawie"
## [5] "Liceum Ogólnokształcące Niepubliczne dla Dorosłych nr 16"
## [6] "Liceum Ogólnokształcące dla Dorosłych \"ELITA\""
We use leaflet to visualize the location of schools and metro stations.
pal <- colorFactor(c("navy", "red"), domain = c("metro", "school"))
leaflet() %>% addTiles() %>%
addCircleMarkers(data = map_points,
lat = ~lat, lng = ~lon,
color = ~pal(label),
radius = 4,
stroke = FALSE,
fillOpacity = 0.5,
label = map_points$label)# Save map with M1 and M2 stations for further analysis
metro$Label <- ifelse(metro$Station %in% metro_stations_m1, "M1", "M2")
metrom1m2 <- metro %>%
rename( "latitude" = "Latitude" ,
"longitude" = "Longitude")
save(metrom1m2, file = "metro_m1m2.Rdata")We estimate the distance between schools and metro stations to analyse the impact of extending the metro system in Warsaw. We estimate three distances between schools and M1 stations and between different parts of the M2 line.
df$distance_2019 <- rep(NA, length(df$name))
df$distance_2020 <- rep(NA, length(df$name))
df$distance_2022 <- rep(NA, length(df$name))
for (i in 1:length(df$name)){
# working matrix
D2019 <- df[i, c("name", "long", "lat")]
# matrix for M1 stations
Dmetro1 <- data.frame(name = metro$Station[metro$Station %in% metro_stations_m1],
lat = metro$Latitude[metro$Station %in% metro_stations_m1],
long = metro$Longitude[metro$Station %in% metro_stations_m1])
# matrix for M1 stations in 2019
Dmetro2_2019 <- data.frame(name = metro$Station[metro$Station %in% metro_stations_m2019],
lat = metro$Latitude[metro$Station %in% metro_stations_m2019],
long = metro$Longitude[metro$Station %in% metro_stations_m2019])
# matrix for M1 stations in 2020
Dmetro2_2020 <- data.frame(name = metro$Station[metro$Station %in% metro_stations_m2020],
lat = metro$Latitude[metro$Station %in% metro_stations_m2020],
long = metro$Longitude[metro$Station %in% metro_stations_m2020])
# matrix for M1 stations in 2022
Dmetro2_2022 <- data.frame(name = metro$Station[metro$Station %in% metro_stations_m2022],
lat = metro$Latitude[metro$Station %in% metro_stations_m2022],
long = metro$Longitude[metro$Station %in% metro_stations_m2022])
dist2019 <- rbind(D2019, Dmetro1, Dmetro2_2019)
dist2020 <- rbind(D2019, Dmetro1, Dmetro2_2019, Dmetro2_2020)
dist2022 <- rbind(D2019, Dmetro1, Dmetro2_2019, Dmetro2_2020, Dmetro2_2022)
distance_matrix2019 <- geodist(dist2019, measure = 'geodesic' )/1000 #converting it to km
distance_matrix2020 <- geodist(dist2020, measure = 'geodesic' )/1000 #converting it to km
distance_matrix2022 <- geodist(dist2022, measure = 'geodesic' )/1000 #converting it to km
df[i, 21] <- min(distance_matrix2019[1,2:28])
df[i, 22] <- min(distance_matrix2020[1,2:34])
df[i, 23] <- min(distance_matrix2022[1,2:39])
}
save(df, file = "loc_math_1922.Rdata")
## Summarize the change in the distance
vis_distance <- data.frame(id = 1:length(df$distance_2019))
vis_distance <- cbind(vis_distance,
"distance2019" = df$distance_2019,
"distance2020" = df$distance_2020,
"distance2022" = df$distance_2022)
vis_distance_l <- reshape(vis_distance, direction = "long", idvar="id",
varying = 2:4, sep = "")
vis_distance_l$short <- ifelse(vis_distance_l$distance < 0.5, 1, 0)
vis_distance_l$long <- ifelse(vis_distance_l$distance > 3, 1, 0)
p <- ggplot(vis_distance_l, aes(x = as.factor(time), y = distance, fill = as.factor(time)))+
geom_bar( stat = "summary", fun.y = "mean") +
scale_fill_manual(values = c("#B6163A", "#999999", "#000000"))+
scale_color_manual(values = c("#B6163A","#999999", "#000000")) +
theme_minimal() +
labs(title = "") +
xlab("Year") +
ylab("Distance between school and metro station") +
theme(
text = element_text(family = "Roboto Light"),
panel.grid.major = element_blank(),
legend.position = "none",
panel.grid.minor = element_blank(),
panel.grid.major.x = element_blank(),
axis.text=element_text(size = 12),
axis.title=element_text(size = 12),
axis.line = element_line(size = 0.5, colour = "grey", linetype = 1),
axis.ticks = element_line(size = 0.5, color="grey")
)
pBased on the data, we observe that the distance between schools and metro stations has decreased from around 2 km to 1.75 km in 2022. These changes are relatively small. However, its impact is heterogeneous, as it influences only some schools.
We plot the relationship between average achievements in mathematics in 2019 and 2022. We observe that schools, in general, score similarly between the years. However, we find higher variation in the lower corner of the scatter plot. In line with the international literature, schools that do poorly are more likely to randomly increase or decrease their position. At the same time, schools in the top spots are robust enough to keep their position, regardless of the exam difficulty.
# Basic scatter plot
ggplot(df, aes(x= math2019, y = math2022)) +
geom_point() +
geom_smooth(method = lm, color="black")+
labs(title = "",
x = "Average score in 2019", y = "Average score in 2022")+
theme(
text = element_text(family = "Roboto Light"),
panel.grid.major = element_blank(),
legend.position = "none",
panel.grid.minor = element_blank(),
panel.grid.major.x = element_blank(),
axis.text=element_text(size = 12),
axis.title=element_text(size = 12),
axis.line = element_line(size = 0.5, colour = "grey", linetype = 1),
axis.ticks = element_line(size = 0.5, color="grey")
)We propose 3 different designs for the analysis of the impact of public transport extension.
| Design | Definition |
|---|---|
| Distance-based | Treatment: Schools for which the distance decreased Control: All schools already close to any metro station |
| Distance-based with pre-treatment exclusion | Treatment: Schools for which the distance decreased Control: Schools located close to the metro, for which the distance did not change |
| Distance-based with post-treatment exclusion | Treatment: Only those schools for which the distance decreased from more than 1 km to less than 0.7 km (walkable distance) Control: Schools already in walkable distance to metro in 2019 |
df$math_diff <- df$math2022 - df$math2019
# 1) Different treatment definitions
df$treat_1 <- ifelse(df$distance_2019 > df$distance_2022, 1, 0)
df$control_1 <- ifelse(df$distance_2019 < 0.5, 1, 0)
df_1 <- df %>%
dplyr:: filter(treat_1 == 1 | control_1 == 1)
# 2) Get rid of those for which the distance in 2019 was already low
df$treat_2 <- ifelse(df$distance_2019 > df$distance_2022, 1, 0)
df$control_2 <- ifelse(df$distance_2019 > 1.5, 1, 0)
df_2 <- df %>%
dplyr:: filter(treat_2 == 1 | control_2 == 1)
# 3) Treated: distance 2022 is lower than 0.7km, although it was higher in 2019
df$treat_3 <- ifelse(df$distance_2022 < 0.7 & df$distance_2019 > 1, 1, 0)
df$control_3 <- ifelse(df$distance_2019 == df$distance_2022 & df$distance_2019 < 0.7, 1, 0)
df_3 <- df %>%
dplyr:: filter(treat_3 == 1 | control_3 == 1)
#DID in regression form (with statistical significance)
reg1 <- lm(math_diff ~ treat_1, data = df_1, weights = df_1$n_students2019)
reg2 <- lm(math_diff ~ treat_2, data = df_2, weights = df_2$n_students2019)
reg3 <- lm(math_diff ~ treat_3, data = df_3, weights = df_3$n_students2019)
stargazer::stargazer(reg1, reg2, reg3, type =
"text")##
## ============================================================================
## Dependent variable:
## --------------------------------------------------------
## math_diff
## (1) (2) (3)
## ----------------------------------------------------------------------------
## treat_1 1.277
## (1.811)
##
## treat_2 0.541
## (1.639)
##
## treat_3 -0.675
## (4.975)
##
## Constant 1.028 1.765* 0.430
## (1.109) (0.907) (1.021)
##
## ----------------------------------------------------------------------------
## Observations 64 79 54
## R2 0.008 0.001 0.0004
## Adjusted R2 -0.008 -0.012 -0.019
## Residual Std. Error 60.695 (df = 62) 57.854 (df = 77) 64.773 (df = 52)
## F Statistic 0.497 (df = 1; 62) 0.109 (df = 1; 77) 0.018 (df = 1; 52)
## ============================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
We find no significant impact of the extension of the metro stations on the school achievements. Although the 1st and second scenarios show some positive effects, they are not significant in the frequentist sense. However, we point to the spatial character of the relationship between distance and achievements. In a further part, we explore whether spatial analysis is justified. In addition, our difference-in-differences is characterised by evident flaws:
Parallel Trends Assumption: DiD assumes that the treated and untreated groups would follow the same trend over time in the absence of the treatment. If historical trends in academic achievement are similar between treated and untreated groups before the metro extension, this strengthens the validity of the DiD approach. However, we did not have enough periods to analyse the assumption of parallel trends.
Unobserved Confounding Factors: DiD assumes that no unobserved confounding factors influence the treatment and control groups differently. In this case, that could have been extensions of other forms of public transport (i.e., buses and trams). At the same time, from the spatial perspective, some areas of Warsaw are more deprived in terms of socioeconomic outcomes. The pandemic might have hit the disadvantaged groups harder than the affluent students.
In the concluding part, we point to the main limitations of our analysis.
With a few observations, we conducted the spatial analysis on the grided map. We thankfully acknowledge the code provided by Maria Kubara in this part.
# Kubara, M. (2023). Spatiotemporal localisation patterns of technological startups:
# the case for recurrent neural networks in predicting urban startup clusters.
# The Annals of Regional Science, 1-33.
# Link for the Kubara's github repositorium: https://github.com/mariakubara/RNN_startup_clusters
pov.sf <- st_read("C:/Users/wojci/OneDrive/WSZ/CausalMLinR/Project/Data/powiaty.shp")## Reading layer `powiaty' from data source
## `C:\Users\wojci\OneDrive\WSZ\CausalMLinR\Project\Data\powiaty.shp'
## using driver `ESRI Shapefile'
## Simple feature collection with 380 features and 29 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: 171677.6 ymin: 133223.7 xmax: 861895.7 ymax: 774923.7
## Projected CRS: ETRS89 / Poland CS92
pov.sf <- st_transform(pov.sf, crs = "+proj=longlat +datum=NAD83")
waw.pov.sf <- pov.sf %>% filter(jpt_nazwa_=='powiat Warszawa')
load('2nd Week/RNN_startup_clusters-main/data/gridPopWaw.RData')
#pop.df.waw - data from grid
#pop.grid.waw - spatial polygons object
#pop.waw Spatial Polygons Data Frame - all in one
pop.grid.waw.sf <- st_as_sf(pop.grid.waw, crs = "+proj=longlat +datum=NAD83")
#Grid ID to order and merge the data
pop.grid.waw.sf$ID_grid <- 1:601
# We use function provided by Dr Earl W. Duncan (2021) to get the centroids of the polygons
get.centroid.bb <- function(x){
N <- length(x) # Number of polygons
# Initialise data.frame
Centroids.bb <- data.frame(matrix(NA, N, 2, dimnames = list(NULL, c("long", "lat"))))
for(i in 1:N){
# Bounding box of polygon
bb <- bbox(x@polygons[[i]])
# Compute centroid
Centroids.bb[i,] <- c(
0.5 * (bb[1,1] + bb[1,2]),
0.5 * (bb[2,1] + bb[2,2]))
}
return(Centroids.bb)
}
pop.grid.waw$lat <- get.centroid.bb(pop.grid.waw)[,1]
pop.grid.waw$long <- get.centroid.bb(pop.grid.waw)[,2]
#################################################################
meansByYearT <- matrix(data=NA, nrow=1, ncol=601)
coordinates(df)<-c("long","lat") # class changed for SpatialPoints
pop.grid.waw$ID<-rownames(pop.df.waw) # identification of units in grid
pop.df.waw$ID<-rownames(pop.df.waw) # identification of units in data.frame
crs(df)<-crs(pop.grid.waw) # equalize the projection between objects
df$ID<-over(df, pop.grid.waw) # assign ID grid to companies
crds<-coordinates(pop.grid.waw) # centroids of grid cellsWe aggregate several variables on the grided level:
Math19 - average result in basic math exam in 2019
Math22 <- average result in basic math exam in 2022
Dist19 <- the average distance to the metro station in 2019
Dist20 <- the average distance to the metro station in 2020
Dist22 <- the average distance to the metro station in 2022
nstudent19 <- number of students attending schools in the grid in 2019
nstudent22 <- the number of students attending schools in the grid in 2022
public <- share of public schools in the grid
psych <- share of schools in the grid with access to psychological help
# point data aggregation by grid ID
df.agg19 <-aggregate(df$math2019, by=list(df$ID$ID), FUN = mean)
df.agg22 <-aggregate(df$math2022, by=list(df$ID$ID), FUN = mean)
df.agg_dist19 <-aggregate(df$distance_2019, by=list(df$ID$ID), FUN = mean)
df.agg_dist20 <-aggregate(df$distance_2020, by=list(df$ID$ID), FUN = mean)
df.agg_dist22 <-aggregate(df$distance_2022, by=list(df$ID$ID), FUN = mean)
df.agg_nstudent19 <-aggregate(df$n_students2019, by=list(df$ID$ID), FUN = sum)
df.agg_nstudent22 <-aggregate(df$n_students22, by=list(df$ID$ID), FUN = sum)
df.public <- aggregate(df$Public, by=list(df$ID$ID), FUN = mean)
df.psych <- aggregate(df$psych, by=list(df$ID$ID), FUN = mean)
# merge datasets
pop.df.waw.m <- merge(pop.df.waw, df.agg19, by.x="ID", by.y="Group.1", all.x=TRUE)
names(pop.df.waw.m)[18] <- "math19"
pop.df.waw.m<-merge(pop.df.waw.m, df.agg22, by.x="ID", by.y="Group.1", all.x=TRUE)
names(pop.df.waw.m)[19] <- "math22"
pop.df.waw.m<-merge(pop.df.waw.m, df.agg_dist19, by.x="ID", by.y="Group.1", all.x=TRUE)
names(pop.df.waw.m)[20] <- "dist19"
pop.df.waw.m<-merge(pop.df.waw.m, df.agg_dist20, by.x="ID", by.y="Group.1", all.x=TRUE)
names(pop.df.waw.m)[21] <- "dist20"
pop.df.waw.m<-merge(pop.df.waw.m, df.agg_dist22, by.x="ID", by.y="Group.1", all.x=TRUE)
names(pop.df.waw.m)[22] <- "dist22"
pop.df.waw.m<-merge(pop.df.waw.m, df.agg_nstudent19, by.x="ID", by.y="Group.1", all.x=TRUE)
names(pop.df.waw.m)[23] <- "nstudent19"
pop.df.waw.m<-merge(pop.df.waw.m, df.agg_nstudent22, by.x="ID", by.y="Group.1", all.x=TRUE)
names(pop.df.waw.m)[24] <- "nstudent22"
pop.df.waw.m<-merge(pop.df.waw.m, df.public, by.x="ID", by.y="Group.1", all.x=TRUE)
names(pop.df.waw.m)[25] <- "public"
pop.df.waw.m<-merge(pop.df.waw.m, df.psych, by.x="ID", by.y="Group.1", all.x=TRUE)
names(pop.df.waw.m)[26] <- "psych"
# clean the NA values
m1<-which(is.na(pop.df.waw.m$math19))
pop.df.waw.m$math19[m1]<-0
pop.df.waw.m$math22[m1]<-0
pop.df.waw.m$dist19[m1]<-0
pop.df.waw.m$dist20[m1]<-0
pop.df.waw.m$dist22[m1]<-0
pop.df.waw.m$nstudent19[m1]<-0
pop.df.waw.m$nstudent22[m1]<-0
#merging with results matrix
countsByYear_math19<- rbind(meansByYearT,
pop.df.waw.m$math19)
countsByYear_math22<- rbind(meansByYearT,
pop.df.waw.m$math22)
countsByYear_dist19<- rbind(meansByYearT,
pop.df.waw.m$dist19)
countsByYear_dist20<- rbind(meansByYearT,
pop.df.waw.m$dist20)
countsByYear_dist22<- rbind(meansByYearT,
pop.df.waw.m$dist22)
countsByYear_nstudent19 <- rbind(meansByYearT,
pop.df.waw.m$nstudent19)
countsByYear_nstudent22 <- rbind(meansByYearT,
pop.df.waw.m$nstudent22)
countsByYear_public <- rbind(meansByYearT,
pop.df.waw.m$public)
countsByYear_psych <- rbind(meansByYearT,
pop.df.waw.m$psych)
totalCountInGrid1 <- colSums(countsByYear_math19,na.rm=T)
totalCountInGrid2 <- colSums(countsByYear_math22,na.rm=T)
totalCountInGrid3 <- colSums(countsByYear_dist19,na.rm=T)
totalCountInGrid4 <- colSums(countsByYear_dist20,na.rm=T)
totalCountInGrid5 <- colSums(countsByYear_dist22,na.rm=T)
totalCountInGrid6 <- colSums(countsByYear_nstudent19,na.rm=T)
totalCountInGrid7 <- colSums(countsByYear_nstudent22,na.rm=T)
totalCountInGrid8 <- colSums(countsByYear_public,na.rm=T)
totalCountInGrid9 <- colSums(countsByYear_psych,na.rm=T)
#bind pop grid data with firm counts year by year
pop.df.waw.schools<- cbind(pop.df.waw,
totalCountInGrid1,
totalCountInGrid2,
totalCountInGrid3,
totalCountInGrid4,
totalCountInGrid5,
totalCountInGrid6,
totalCountInGrid7,
totalCountInGrid8,
totalCountInGrid9)
head(pop.df.waw.schools)## TOT TOT_0_14 TOT_15_64 TOT_65__ TOT_MALE TOT_FEM MALE_0_14 MALE_15_64
## 206436 4841 696 3586 559 2339 2502 365 1739
## 206464 137 19 98 20 61 76 10 46
## 206476 269 38 199 32 124 145 23 89
## 206492 0 0 0 0 0 0 0 0
## 206505 386 63 276 47 180 206 26 137
## 206602 86 11 62 13 41 45 7 28
## MALE_65__ FEM_0_14 FEM_15_64 FEM_65__ FEM_RATIO SHAPE_Leng SHAPE_Area
## 206436 235 331 1847 324 106.9688 3997.988 998991.0
## 206464 5 9 52 15 124.5902 3998.006 999000.1
## 206476 12 15 110 20 116.9355 3997.989 998991.6
## 206492 0 0 0 0 0.0000 3998.008 999000.9
## 206505 17 37 139 30 114.4444 3998.005 998999.3
## 206602 6 4 34 7 109.7561 3998.009 999001.7
## CODE ID totalCountInGrid1
## 206436 CRS3035RES1000mN3286000E5059000 206436 0
## 206464 CRS3035RES1000mN3298000E5059000 206464 0
## 206476 CRS3035RES1000mN3287000E5059000 206476 0
## 206492 CRS3035RES1000mN3299000E5059000 206492 0
## 206505 CRS3035RES1000mN3297000E5059000 206505 0
## 206602 CRS3035RES1000mN3300000E5059000 206602 0
## totalCountInGrid2 totalCountInGrid3 totalCountInGrid4 totalCountInGrid5
## 206436 0 0 0 0
## 206464 0 0 0 0
## 206476 0 0 0 0
## 206492 0 0 0 0
## 206505 0 0 0 0
## 206602 0 0 0 0
## totalCountInGrid6 totalCountInGrid7 totalCountInGrid8 totalCountInGrid9
## 206436 0 0 0 0
## 206464 0 0 0 0
## 206476 0 0 0 0
## 206492 0 0 0 0
## 206505 0 0 0 0
## 206602 0 0 0 0
names(pop.df.waw.schools)[18] <- "math19_avg"
names(pop.df.waw.schools)[19] <- "math22_avg"
names(pop.df.waw.schools)[20] <- "dist_19_avg"
names(pop.df.waw.schools)[21] <- "dist_20_avg"
names(pop.df.waw.schools)[22] <- "dist_22_avg"
names(pop.df.waw.schools)[23] <- "nstudents19"
names(pop.df.waw.schools)[24] <- "nstudents22"
names(pop.df.waw.schools)[25] <- "public"
names(pop.df.waw.schools)[26] <- "psych"
pop.df.waw.schools$long <- pop.grid.waw$long
pop.df.waw.schools$lat <- pop.grid.waw$lat
# Save file for the analysis
head(pop.df.waw.schools)## TOT TOT_0_14 TOT_15_64 TOT_65__ TOT_MALE TOT_FEM MALE_0_14 MALE_15_64
## 206436 4841 696 3586 559 2339 2502 365 1739
## 206464 137 19 98 20 61 76 10 46
## 206476 269 38 199 32 124 145 23 89
## 206492 0 0 0 0 0 0 0 0
## 206505 386 63 276 47 180 206 26 137
## 206602 86 11 62 13 41 45 7 28
## MALE_65__ FEM_0_14 FEM_15_64 FEM_65__ FEM_RATIO SHAPE_Leng SHAPE_Area
## 206436 235 331 1847 324 106.9688 3997.988 998991.0
## 206464 5 9 52 15 124.5902 3998.006 999000.1
## 206476 12 15 110 20 116.9355 3997.989 998991.6
## 206492 0 0 0 0 0.0000 3998.008 999000.9
## 206505 17 37 139 30 114.4444 3998.005 998999.3
## 206602 6 4 34 7 109.7561 3998.009 999001.7
## CODE ID math19_avg math22_avg dist_19_avg
## 206436 CRS3035RES1000mN3286000E5059000 206436 0 0 0
## 206464 CRS3035RES1000mN3298000E5059000 206464 0 0 0
## 206476 CRS3035RES1000mN3287000E5059000 206476 0 0 0
## 206492 CRS3035RES1000mN3299000E5059000 206492 0 0 0
## 206505 CRS3035RES1000mN3297000E5059000 206505 0 0 0
## 206602 CRS3035RES1000mN3300000E5059000 206602 0 0 0
## dist_20_avg dist_22_avg nstudents19 nstudents22 public psych long
## 206436 0 0 0 0 0 0 52.19197
## 206464 0 0 0 0 0 0 52.29843
## 206476 0 0 0 0 0 0 52.20085
## 206492 0 0 0 0 0 0 52.30730
## 206505 0 0 0 0 0 0 52.28956
## 206602 0 0 0 0 0 0 52.31617
## lat
## 206436 20.84561
## 206464 20.87183
## 206476 20.84779
## 206492 20.87402
## 206505 20.86964
## 206602 20.87621
save(pop.df.waw.schools, file = 'C:/Users/wojci/OneDrive/WSZ/CausalMLinR/Project/Data/data_schools_grided.RData')Based on the prepared data we conduct the Moran’s test to determine whether the spatial analysis is justified.
rm(list = ls())
load('C:/Users/wojci/OneDrive/WSZ/CausalMLinR/Project/Data/data_schools_grided.RData')
set.seed(424178)
setwd("C:/Users/wojci/OneDrive/WSZ/CausalMLinR")
schools.df <- pop.df.waw.schools
schools.df$no_school <- ifelse(schools.df$nstudents19 == 0, 1, 0)
schools.sp <- schools.df
coordinates(schools.sp) <- c("lat", "long")
proj4string(schools.sp)<-as(st_crs("+proj=longlat +ellps=WGS84"), "CRS")
summary(schools.sp) ## Object of class SpatialPointsDataFrame
## Coordinates:
## min max
## lat 20.84561 21.27214
## long 52.09673 52.37289
## Is projected: FALSE
## proj4string : [+proj=longlat +ellps=WGS84 +no_defs]
## Number of points: 601
## Data attributes:
## TOT TOT_0_14 TOT_15_64 TOT_65__
## Min. : 0 Min. : 0.0 Min. : 0 Min. : 0.0
## 1st Qu.: 155 1st Qu.: 24.0 1st Qu.: 109 1st Qu.: 14.0
## Median : 697 Median : 117.0 Median : 499 Median : 77.0
## Mean : 2917 Mean : 385.7 Mean : 2030 Mean : 500.8
## 3rd Qu.: 3734 3rd Qu.: 551.0 3rd Qu.: 2614 3rd Qu.: 489.0
## Max. :21531 Max. :2672.0 Max. :15020 Max. :5297.0
## TOT_MALE TOT_FEM MALE_0_14 MALE_15_64
## Min. : 0 Min. : 0 Min. : 0.0 Min. : 0.0
## 1st Qu.: 76 1st Qu.: 78 1st Qu.: 12.0 1st Qu.: 55.0
## Median : 330 Median : 365 Median : 61.0 Median : 242.0
## Mean :1340 Mean : 1576 Mean : 197.4 Mean : 957.1
## 3rd Qu.:1760 3rd Qu.: 2044 3rd Qu.: 294.0 3rd Qu.:1240.0
## Max. :9531 Max. :12000 Max. :1370.0 Max. :7242.0
## MALE_65__ FEM_0_14 FEM_15_64 FEM_65__
## Min. : 0.0 Min. : 0.0 Min. : 0 Min. : 0.0
## 1st Qu.: 5.0 1st Qu.: 12.0 1st Qu.: 56 1st Qu.: 8.0
## Median : 31.0 Median : 57.0 Median : 252 Median : 46.0
## Mean : 185.8 Mean : 188.3 Mean :1073 Mean : 314.9
## 3rd Qu.: 192.0 3rd Qu.: 266.0 3rd Qu.:1420 3rd Qu.: 287.0
## Max. :1684.0 Max. :1302.0 Max. :7778 Max. :3625.0
## FEM_RATIO SHAPE_Leng SHAPE_Area CODE
## Min. : 0.0 Min. :3998 Min. :998991 Length:601
## 1st Qu.:101.5 1st Qu.:3998 1st Qu.:999043 Class :character
## Median :111.2 Median :3998 Median :999078 Mode :character
## Mean :102.4 Mean :3998 Mean :999082
## 3rd Qu.:118.5 3rd Qu.:3998 3rd Qu.:999118
## Max. :200.0 Max. :3998 Max. :999193
## ID math19_avg math22_avg dist_19_avg
## Length:601 Min. : 0.000 Min. : 0.00 Min. : 0.0000
## Class :character 1st Qu.: 0.000 1st Qu.: 0.00 1st Qu.: 0.0000
## Mode :character Median : 0.000 Median : 0.00 Median : 0.0000
## Mean : 9.076 Mean : 9.23 Mean : 0.3362
## 3rd Qu.: 0.000 3rd Qu.: 0.00 3rd Qu.: 0.0000
## Max. :94.471 Max. :95.18 Max. :13.3007
## dist_20_avg dist_22_avg nstudents19 nstudents22
## Min. : 0.0000 Min. : 0.0000 Min. : 0.00 Min. : 0.00
## 1st Qu.: 0.0000 1st Qu.: 0.0000 1st Qu.: 0.00 1st Qu.: 0.00
## Median : 0.0000 Median : 0.0000 Median : 0.00 Median : 0.00
## Mean : 0.2998 Mean : 0.2904 Mean : 21.87 Mean : 25.86
## 3rd Qu.: 0.0000 3rd Qu.: 0.0000 3rd Qu.: 0.00 3rd Qu.: 0.00
## Max. :13.3007 Max. :13.3007 Max. :545.00 Max. :755.00
## public psych no_school
## Min. :0.0000 Min. :-1.000000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.: 0.000000 1st Qu.:1.0000
## Median :0.0000 Median : 0.000000 Median :1.0000
## Mean :0.1164 Mean :-0.002496 Mean :0.8519
## 3rd Qu.:0.0000 3rd Qu.: 0.000000 3rd Qu.:1.0000
## Max. :1.0000 Max. : 1.000000 Max. :1.0000
## Simple feature collection with 6 features and 27 fields
## Geometry type: POINT
## Dimension: XY
## Bounding box: xmin: 20.84561 ymin: 52.19197 xmax: 20.87621 ymax: 52.31617
## Geodetic CRS: +proj=longlat +ellps=WGS84 +no_defs
## TOT TOT_0_14 TOT_15_64 TOT_65__ TOT_MALE TOT_FEM MALE_0_14 MALE_15_64
## 206436 4841 696 3586 559 2339 2502 365 1739
## 206464 137 19 98 20 61 76 10 46
## 206476 269 38 199 32 124 145 23 89
## 206492 0 0 0 0 0 0 0 0
## 206505 386 63 276 47 180 206 26 137
## 206602 86 11 62 13 41 45 7 28
## MALE_65__ FEM_0_14 FEM_15_64 FEM_65__ FEM_RATIO SHAPE_Leng SHAPE_Area
## 206436 235 331 1847 324 106.9688 3997.988 998991.0
## 206464 5 9 52 15 124.5902 3998.006 999000.1
## 206476 12 15 110 20 116.9355 3997.989 998991.6
## 206492 0 0 0 0 0.0000 3998.008 999000.9
## 206505 17 37 139 30 114.4444 3998.005 998999.3
## 206602 6 4 34 7 109.7561 3998.009 999001.7
## CODE ID math19_avg math22_avg dist_19_avg
## 206436 CRS3035RES1000mN3286000E5059000 206436 0 0 0
## 206464 CRS3035RES1000mN3298000E5059000 206464 0 0 0
## 206476 CRS3035RES1000mN3287000E5059000 206476 0 0 0
## 206492 CRS3035RES1000mN3299000E5059000 206492 0 0 0
## 206505 CRS3035RES1000mN3297000E5059000 206505 0 0 0
## 206602 CRS3035RES1000mN3300000E5059000 206602 0 0 0
## dist_20_avg dist_22_avg nstudents19 nstudents22 public psych no_school
## 206436 0 0 0 0 0 0 1
## 206464 0 0 0 0 0 0 1
## 206476 0 0 0 0 0 0 1
## 206492 0 0 0 0 0 0 1
## 206505 0 0 0 0 0 0 1
## 206602 0 0 0 0 0 0 1
## geometry
## 206436 POINT (20.84561 52.19197)
## 206464 POINT (20.87183 52.29843)
## 206476 POINT (20.84779 52.20085)
## 206492 POINT (20.87402 52.3073)
## 206505 POINT (20.86964 52.28956)
## 206602 POINT (20.87621 52.31617)
# spatial weights matrix – from sf
crds.sf <- st_centroid(st_geometry(schools.sf)) # centroids
knn.schools <- knearneigh(crds.sf, k=5) # 5 nearest points
schools.knn.nb.sf <- knn2nb(knn.schools)
schools.knn.sym.listw <- nb2listw(make.sym.nb(schools.knn.nb.sf))
# Start with Moran test for spatial relationship
moran.test(schools.sf$math22_avg, schools.knn.sym.listw)##
## Moran I test under randomisation
##
## data: schools.sf$math22_avg
## weights: schools.knn.sym.listw
##
## Moran I statistic standard deviate = 15.93, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic Expectation Variance
## 0.371669929 -0.001666667 0.000549262
##
## Moran I test under randomisation
##
## data: schools.sf$math19_avg
## weights: schools.knn.sym.listw
##
## Moran I statistic standard deviate = 15.971, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic Expectation Variance
## 0.3727464078 -0.0016666667 0.0005496188
# Moran's I Statistics is equal to 15.93
# Corresponding p-value < 2.2e-16
# We reject the null hypothesis on the random spatial distribution of the dataNull hypothesis in Moran’s test assumes that points are randomly distributed over the space Based on the Moran’s test we reject the null hypothesis. Conclusion: Spatial analysis is justified.
Before running the models, we also check whether there are some differences in clustering between 2019 and 2022 results.
We start with simple visualization of the data.
We find differences between the 2019 and 2022. Based on results in 2019 and 2022, we use K-means to cluster schools into district based on their location and achievements in mathematics. We use rand index and jaccard similiarity to compare the clustering between the years.
# Select data for clustering
df_clustering19 <- df[,c(11, 19, 20)]
df_clustering22 <- df[,c(14, 19, 20)]
# we set the maximum number of clusters
k.max <- 20
df_clustering19_scaled <- scale(df_clustering19)
df_clustering22_scaled <- scale(df_clustering22)
wss19 <- sapply(1:k.max,
function(k){kmeans(df_clustering19_scaled, k, nstart=50,iter.max = 15 )$tot.withinss})
wss19## [1] 492.00000 349.94171 278.67844 230.15365 190.98734 166.47400 145.66602
## [8] 128.12113 116.89859 107.92195 99.16845 90.32335 84.39915 79.12764
## [15] 74.81848 69.43243 65.44149 62.28323 57.49721 56.76225
wss22 <- sapply(1:k.max,
function(k){kmeans(df_clustering22_scaled, k, nstart=50,iter.max = 15 )$tot.withinss})
wss22## [1] 492.00000 353.51456 276.71519 226.29575 190.27458 168.03346 149.38144
## [8] 132.01353 120.11473 108.52911 99.95816 92.23566 84.90176 81.27866
## [15] 75.58690 71.00520 66.55689 61.90323 60.36939 57.77518
plot(1:k.max, wss19,
type="b", pch = 19, frame = FALSE,
xlab="Number of clusters K",
ylab="Total within-clusters sum of squares")plot(1:k.max, wss22,
type="b", pch = 19, frame = FALSE,
xlab="Number of clusters K",
ylab="Total within-clusters sum of squares")We can’t see any reasonable point when using within sum-of-sqaures measure. However, when we use silhouette measure, we find that both in 2019 and 2022 the optimal number of clusters is 9! It needs to be noted, that such outcome is rarely found, as silhouette measure returns in majority of the analysed cases two clusters as an optimal solution.
clustering19 <- kmeans(df_clustering19_scaled, 9, nstart = 25)
fviz_cluster(clustering19, df_clustering19_scaled, frame = FALSE, geom = "point")df$cluster19 <- clustering19$cluster
clustering22 <- kmeans(df_clustering22_scaled, 9, nstart = 25)
fviz_cluster(clustering22, df_clustering22_scaled, frame = FALSE, geom = "point")df$cluster22 <- clustering22$cluster
df %>%
group_by(cluster19) %>%
summarise("AVG_Math19" = mean(math2019))## # A tibble: 9 × 2
## cluster19 AVG_Math19
## <int> <dbl>
## 1 1 69.1
## 2 2 37.6
## 3 3 78.7
## 4 4 63.4
## 5 5 43.6
## 6 6 57.8
## 7 7 70.8
## 8 8 86.1
## 9 9 49.3
## # A tibble: 9 × 2
## cluster22 AVG_Math22
## <int> <dbl>
## 1 1 63.8
## 2 2 75.0
## 3 3 44.6
## 4 4 48.0
## 5 5 81.4
## 6 6 47.0
## 7 7 70.7
## 8 8 84.0
## 9 9 43.1
We see large differences between clusters. In 2019, students from Cluster 8 scored more than twice as much as those from Cluster 2. We observed a similar pattern in 2022 when students from Cluster 8 scored two times better than their peers from Cluster 9. Although the number of clusters is the same in 2019 and 2022, we are unsure whether these clusters are the same. By applying the Rand index, we check whether schools remained in the same clusters over the years.
## [1] 0.884405
## [1] 1
Rand index is equal to 0.884405, which means that the clusters are quite similar. Jaccard Similarity does not take into account observations which were originally clustered into different clusters in the initial periods. We find a jaccard similarity equal to 1, meaning that temporal composition of clusters did not change. To prove it we map these points.
plot_cluster19 <- ggplot() +
geom_point(data = metrom1m2, aes(x = as.numeric(longitude), y = as.numeric(latitude), fill = factor(Label)), shape = 23, size = 2) +
geom_point(data = df, aes(x = long, y = lat, size = n_students2019, col = factor(cluster19)), shape = 20, alpha = 0.9) +
geom_sf(waw.pov.sf.line, mapping = aes(geometry = geometry)) +
scale_color_manual(values = c("#FF0000",
"#0000FF",
"#008000",
"#FFFF00",
"#800080",
"#FFA500",
"#40E0D0",
"#FFC0CB",
"#A52A2A"))+
theme_bw() +
theme(text = element_text(color = "grey20", size = 12, family = "Roboto"),
axis.title.x = element_blank(),
axis.title.y = element_blank(),
legend.position = "none",
legend.key.width = unit(1.5, 'cm'))
plot_cluster22 <- ggplot() +
geom_point(data = metrom1m2, aes(x = as.numeric(longitude), y = as.numeric(latitude), fill = factor(Label)), shape = 23, size = 2) +
geom_point(data = df, aes(x = long, y = lat, size = n_students22, col = factor(cluster22)), shape = 20, alpha = 0.9) +
geom_sf(waw.pov.sf.line, mapping = aes(geometry = geometry)) +
scale_color_manual(values = c("#FF0000",
"#0000FF",
"#008000",
"#FFFF00",
"#800080",
"#FFA500",
"#40E0D0",
"#FFC0CB",
"#A52A2A"))+
theme_bw() +
theme(text = element_text(color = "grey20", size = 12, family = "Roboto"),
axis.title.x = element_blank(),
axis.title.y = element_blank(),
legend.position = "none",
legend.key.width = unit(1.5, 'cm'))
ggpubr::ggarrange(plot_cluster19, plot_cluster22, widths = 20, heights = 16) Visually, we see that schools were clustered similar. However, if we look closer at the western part of Warsaw, closer to the new stations of M2 (blue diamons), there are some migrating patterns.
## # A tibble: 5 × 2
## cluster22 Freq
## <int> <int>
## 1 2 3
## 2 3 1
## 3 4 16
## 4 5 4
## 5 9 1
Schools in Wola and Bemowo were grouped into clusters 1 and 6 in 2019 and have moved to other clusters. Unfortunately, in terms of the migration from cluster 6 (as of 2019) to clusters 3 and 4 in 2022, we observe an average decrease in the results of this group. Hence, instead of improving students’ results by extending public transport, we observed the opposite effect. It needs to be noticed, however, that our model, due to the data limitations, does not consider socioeconomic factors. Poorer labour market outcomes characterize Wola. Consequently, the impact of COVID-19 might influence more deprived areas on a larger scale, leading to deteriorating student results regardless of extending access to public transport.
As we have already shown, the relationship between spatial dimension and average achievements seems to be present in the data. For that purpose, we propose quantitative models - Geographically Weighted Regression and Causal Forests. Firstly, we cleaned our environment to ensure the loaded data was correct.
rm(list = ls())
load('C:/Users/wojci/OneDrive/WSZ/CausalMLinR/Project/Data/data_schools_grided.RData')
set.seed(424178)
setwd("C:/Users/wojci/OneDrive/WSZ/CausalMLinR")
# We need three datasets
# data frames
# data frame in SP format
# data frame in SF format
schools.df <- pop.df.waw.schools
schools.df$no_school <- ifelse(schools.df$nstudents19 == 0, 1, 0)
schools.sp <- schools.df
coordinates(schools.sp) <- c("lat", "long")
proj4string(schools.sp)<-as(st_crs("+proj=longlat +ellps=WGS84"), "CRS")
summary(schools.sp) ## Object of class SpatialPointsDataFrame
## Coordinates:
## min max
## lat 20.84561 21.27214
## long 52.09673 52.37289
## Is projected: FALSE
## proj4string : [+proj=longlat +ellps=WGS84 +no_defs]
## Number of points: 601
## Data attributes:
## TOT TOT_0_14 TOT_15_64 TOT_65__
## Min. : 0 Min. : 0.0 Min. : 0 Min. : 0.0
## 1st Qu.: 155 1st Qu.: 24.0 1st Qu.: 109 1st Qu.: 14.0
## Median : 697 Median : 117.0 Median : 499 Median : 77.0
## Mean : 2917 Mean : 385.7 Mean : 2030 Mean : 500.8
## 3rd Qu.: 3734 3rd Qu.: 551.0 3rd Qu.: 2614 3rd Qu.: 489.0
## Max. :21531 Max. :2672.0 Max. :15020 Max. :5297.0
## TOT_MALE TOT_FEM MALE_0_14 MALE_15_64
## Min. : 0 Min. : 0 Min. : 0.0 Min. : 0.0
## 1st Qu.: 76 1st Qu.: 78 1st Qu.: 12.0 1st Qu.: 55.0
## Median : 330 Median : 365 Median : 61.0 Median : 242.0
## Mean :1340 Mean : 1576 Mean : 197.4 Mean : 957.1
## 3rd Qu.:1760 3rd Qu.: 2044 3rd Qu.: 294.0 3rd Qu.:1240.0
## Max. :9531 Max. :12000 Max. :1370.0 Max. :7242.0
## MALE_65__ FEM_0_14 FEM_15_64 FEM_65__
## Min. : 0.0 Min. : 0.0 Min. : 0 Min. : 0.0
## 1st Qu.: 5.0 1st Qu.: 12.0 1st Qu.: 56 1st Qu.: 8.0
## Median : 31.0 Median : 57.0 Median : 252 Median : 46.0
## Mean : 185.8 Mean : 188.3 Mean :1073 Mean : 314.9
## 3rd Qu.: 192.0 3rd Qu.: 266.0 3rd Qu.:1420 3rd Qu.: 287.0
## Max. :1684.0 Max. :1302.0 Max. :7778 Max. :3625.0
## FEM_RATIO SHAPE_Leng SHAPE_Area CODE
## Min. : 0.0 Min. :3998 Min. :998991 Length:601
## 1st Qu.:101.5 1st Qu.:3998 1st Qu.:999043 Class :character
## Median :111.2 Median :3998 Median :999078 Mode :character
## Mean :102.4 Mean :3998 Mean :999082
## 3rd Qu.:118.5 3rd Qu.:3998 3rd Qu.:999118
## Max. :200.0 Max. :3998 Max. :999193
## ID math19_avg math22_avg dist_19_avg
## Length:601 Min. : 0.000 Min. : 0.00 Min. : 0.0000
## Class :character 1st Qu.: 0.000 1st Qu.: 0.00 1st Qu.: 0.0000
## Mode :character Median : 0.000 Median : 0.00 Median : 0.0000
## Mean : 9.076 Mean : 9.23 Mean : 0.3362
## 3rd Qu.: 0.000 3rd Qu.: 0.00 3rd Qu.: 0.0000
## Max. :94.471 Max. :95.18 Max. :13.3007
## dist_20_avg dist_22_avg nstudents19 nstudents22
## Min. : 0.0000 Min. : 0.0000 Min. : 0.00 Min. : 0.00
## 1st Qu.: 0.0000 1st Qu.: 0.0000 1st Qu.: 0.00 1st Qu.: 0.00
## Median : 0.0000 Median : 0.0000 Median : 0.00 Median : 0.00
## Mean : 0.2998 Mean : 0.2904 Mean : 21.87 Mean : 25.86
## 3rd Qu.: 0.0000 3rd Qu.: 0.0000 3rd Qu.: 0.00 3rd Qu.: 0.00
## Max. :13.3007 Max. :13.3007 Max. :545.00 Max. :755.00
## public psych no_school
## Min. :0.0000 Min. :-1.000000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.: 0.000000 1st Qu.:1.0000
## Median :0.0000 Median : 0.000000 Median :1.0000
## Mean :0.1164 Mean :-0.002496 Mean :0.8519
## 3rd Qu.:0.0000 3rd Qu.: 0.000000 3rd Qu.:1.0000
## Max. :1.0000 Max. : 1.000000 Max. :1.0000
## Simple feature collection with 6 features and 27 fields
## Geometry type: POINT
## Dimension: XY
## Bounding box: xmin: 20.84561 ymin: 52.19197 xmax: 20.87621 ymax: 52.31617
## Geodetic CRS: +proj=longlat +ellps=WGS84 +no_defs
## TOT TOT_0_14 TOT_15_64 TOT_65__ TOT_MALE TOT_FEM MALE_0_14 MALE_15_64
## 206436 4841 696 3586 559 2339 2502 365 1739
## 206464 137 19 98 20 61 76 10 46
## 206476 269 38 199 32 124 145 23 89
## 206492 0 0 0 0 0 0 0 0
## 206505 386 63 276 47 180 206 26 137
## 206602 86 11 62 13 41 45 7 28
## MALE_65__ FEM_0_14 FEM_15_64 FEM_65__ FEM_RATIO SHAPE_Leng SHAPE_Area
## 206436 235 331 1847 324 106.9688 3997.988 998991.0
## 206464 5 9 52 15 124.5902 3998.006 999000.1
## 206476 12 15 110 20 116.9355 3997.989 998991.6
## 206492 0 0 0 0 0.0000 3998.008 999000.9
## 206505 17 37 139 30 114.4444 3998.005 998999.3
## 206602 6 4 34 7 109.7561 3998.009 999001.7
## CODE ID math19_avg math22_avg dist_19_avg
## 206436 CRS3035RES1000mN3286000E5059000 206436 0 0 0
## 206464 CRS3035RES1000mN3298000E5059000 206464 0 0 0
## 206476 CRS3035RES1000mN3287000E5059000 206476 0 0 0
## 206492 CRS3035RES1000mN3299000E5059000 206492 0 0 0
## 206505 CRS3035RES1000mN3297000E5059000 206505 0 0 0
## 206602 CRS3035RES1000mN3300000E5059000 206602 0 0 0
## dist_20_avg dist_22_avg nstudents19 nstudents22 public psych no_school
## 206436 0 0 0 0 0 0 1
## 206464 0 0 0 0 0 0 1
## 206476 0 0 0 0 0 0 1
## 206492 0 0 0 0 0 0 1
## 206505 0 0 0 0 0 0 1
## 206602 0 0 0 0 0 0 1
## geometry
## 206436 POINT (20.84561 52.19197)
## 206464 POINT (20.87183 52.29843)
## 206476 POINT (20.84779 52.20085)
## 206492 POINT (20.87402 52.3073)
## 206505 POINT (20.86964 52.28956)
## 206602 POINT (20.87621 52.31617)
We start with simple OLS regression. We differentiate the models with weighthing based on the number of students. We do it because larger schools do not have to be randomly distributed; while planning public transport, extension should consider the cost-benefit analysis, answering how to help the largest number of students with the lowest cost.
# basic equation and OLS models
eq22 <- math22_avg ~ dist_22_avg + public + psych + no_school
eq19 <- math19_avg ~ dist_19_avg + public + psych + no_school
# OLS with Females/Males Ratio and Distance to Metro
mod_ols22 <- lm(eq22, data = schools.df)
mod_ols19 <- lm(eq19, data = schools.df)
# Weighted OLS (by the total population aged 0-14 - students proxy)
mod_ols_weights22 <- lm(eq22, data = schools.df,
weights = schools.df$nstudents22)
mod_ols_weights19 <- lm(eq19, data = schools.df,
weights = schools.df$nstudents19)
# Summarise the results
stargazer::stargazer(mod_ols22,
mod_ols_weights22,
mod_ols19,
mod_ols_weights19, type = "text")##
## ================================================================================================================
## Dependent variable:
## --------------------------------------------------------------------------------------------
## math22_avg math19_avg
## (1) (2) (3) (4)
## ----------------------------------------------------------------------------------------------------------------
## dist_22_avg 0.162 0.818
## (0.279) (0.702)
##
## dist_19_avg 0.180 1.190*
## (0.234) (0.657)
##
## public -1.239 2.422 -0.591 1.440
## (1.958) (4.916) (1.652) (4.306)
##
## psych 5.804** 12.945* 7.543*** 12.943*
## (2.287) (7.220) (1.940) (6.550)
##
## no_school -63.084*** -61.472***
## (1.883) (1.610)
##
## Constant 63.084*** 62.208*** 61.472*** 61.451***
## (1.859) (4.565) (1.590) (4.029)
##
## ----------------------------------------------------------------------------------------------------------------
## Observations 601 601 601 601
## R2 0.916 0.053 0.936 0.074
## Adjusted R2 0.915 0.019 0.936 0.042
## Residual Std. Error 6.740 (df = 596) 198.990 (df = 85) 5.710 (df = 596) 165.818 (df = 85)
## F Statistic 1,623.113*** (df = 4; 596) 1.577 (df = 3; 85) 2,187.997*** (df = 4; 596) 2.273* (df = 3; 85)
## ================================================================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
We additionally run Instrumental Variable Estimation. We instrument the average distance. Our instrument is the distance from 2019. The reasoning behind this is that the location of the metro might be endogenous to other important unobservable factors, such as the number of students living in the area. However, as Ichino and Winter-Ebmber (1999) noted, the application of instruments highly depends on the estimation goal. In our case, we want to see how the deviation from the average distance influenced the results. Hence, such an instrument might serve its purpose quite well.
iv_dist22 <- ivreg(math22_avg ~ dist_22_avg + public + psych + no_school | dist_19_avg + public + psych + no_school , data = schools.df)
summary(iv_dist22)##
## Call:
## ivreg(formula = math22_avg ~ dist_22_avg + public + psych + no_school |
## dist_19_avg + public + psych + no_school, data = schools.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.626e+01 -1.421e-13 -1.421e-13 -1.421e-13 3.268e+01
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 63.47630 1.87364 33.879 <2e-16 ***
## dist_22_avg 0.02547 0.29015 0.088 0.9301
## public -1.39627 1.96059 -0.712 0.4766
## psych 5.81782 2.28781 2.543 0.0112 *
## no_school -63.47630 1.89718 -33.458 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.741 on 596 degrees of freedom
## Multiple R-Squared: 0.9159, Adjusted R-squared: 0.9153
## Wald test: 1622 on 4 and 596 DF, p-value: < 2.2e-16
Simple models find, in general, no significant impact of metro extension on school achievements. We find similar results, when using IV estimation. However, the parameter for the the distance seems to be overestimated in the simple regression. Let’s see if the spatial models can solve the problem.
# We run the model only on grids with some data
dMat <- gw.dist(cbind(schools.df$lat, schools.df$long)) # coordinates not from sp
bw19 <- GWmodel::bw.gwr(eq19, data = schools.sp, kernel='gaussian', adaptive = TRUE, dMat = dMat)## Adaptive bandwidth: 379 CV score: 22258.42
## Adaptive bandwidth: 242 CV score: 22203.21
## Adaptive bandwidth: 157 CV score: 22346.28
## Adaptive bandwidth: 294 CV score: 22234.95
## Adaptive bandwidth: 209 CV score: 22206.9
## Adaptive bandwidth: 261 CV score: 22217.53
## Adaptive bandwidth: 228 CV score: 22214.95
## Adaptive bandwidth: 248 CV score: 22205.33
## Adaptive bandwidth: 235 CV score: 22208.69
## Adaptive bandwidth: 243 CV score: 22203.37
## Adaptive bandwidth: 238 CV score: 22197.7
## Adaptive bandwidth: 239 CV score: 22200.09
## Adaptive bandwidth: 241 CV score: 22203.18
## Adaptive bandwidth: 239 CV score: 22200.09
## Adaptive bandwidth: 240 CV score: 22197.71
## Adaptive bandwidth: 239 CV score: 22200.09
## Adaptive bandwidth: 239 CV score: 22200.09
## Adaptive bandwidth: 238 CV score: 22197.7
## Adaptive bandwidth: 379 CV score: 31178.65
## Adaptive bandwidth: 242 CV score: 31154.39
## Adaptive bandwidth: 157 CV score: 31437.63
## Adaptive bandwidth: 294 CV score: 31171.91
## Adaptive bandwidth: 209 CV score: 31170.57
## Adaptive bandwidth: 261 CV score: 31161.1
## Adaptive bandwidth: 228 CV score: 31173.72
## Adaptive bandwidth: 248 CV score: 31152.27
## Adaptive bandwidth: 254 CV score: 31153.66
## Adaptive bandwidth: 246 CV score: 31152.2
## Adaptive bandwidth: 243 CV score: 31154.35
## Adaptive bandwidth: 246 CV score: 31152.2
modelGWR19<-GWmodel::gwr.basic(eq19, data = schools.sp, kernel='gaussian', adaptive=FALSE, bw=bw19, dMat=dMat)
modelGWR22<-GWmodel::gwr.basic(eq22, data = schools.sp, kernel='gaussian', adaptive=FALSE, bw=bw22, dMat=dMat)
modelGWR.df19<-as.data.frame(modelGWR19$SDF) # class data.frame
modelGWR.df22<-as.data.frame(modelGWR22$SDF) # class data.frame
schools.sf$coef_dist19 <- modelGWR.df19$dist_19_avg # variable “dist”
schools.sf$coef_dist22 <- modelGWR.df22$dist_22_avg # variable “dist”## Reading layer `powiaty' from data source
## `C:\Users\wojci\OneDrive\WSZ\CausalMLinR\Project\Data\powiaty.shp'
## using driver `ESRI Shapefile'
## Simple feature collection with 380 features and 29 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: 171677.6 ymin: 133223.7 xmax: 861895.7 ymax: 774923.7
## Projected CRS: ETRS89 / Poland CS92
pov.sf <- st_transform(pov.sf, crs = "+proj=longlat +datum=NAD83")
waw.pov.sf <- pov.sf %>% filter(jpt_nazwa_=='powiat Warszawa')
waw.pov.sf.line <- st_cast(waw.pov.sf,"MULTILINESTRING")
load("C:/Users/wojci/OneDrive/WSZ/CausalMLinR/loc_math_1922.Rdata")We look at the estimated parameters in the map. We observe some spatial patterns over the Warsaw Area.
impact_gwr19 <- ggplot() +
geom_point(data = df, aes(x = long, y = lat, size = n_students2019), shape = 20, alpha = 0.9, col = "#000000") +
geom_sf(schools.sf, mapping = aes(color = coef_dist19), shape = 15, alpha = 0.4, size = 7.5) +
geom_sf(waw.pov.sf.line, mapping = aes(geometry = geometry), inherit.aes = FALSE) +
theme_bw() +
theme(text = element_text(color = "grey20", size = 12,family = "Roboto"),
axis.title.x = element_blank(),
axis.title.y = element_blank(),
legend.position = "right",
legend.key.width = unit(1.5, 'cm')) +
scale_color_gradient(low = wes_palette("Zissou1", 5, type = "continuous")[1],
high = wes_palette("Zissou1", 5, type = "continuous")[4]) +
labs(color = "Dist. Coeff.", size = "School size")
impact_gwr22 <- ggplot() +
geom_point(data = df, aes(x = long, y = lat, size = n_students22), shape = 20, alpha = 0.9, col = "#000000") +
geom_sf(schools.sf, mapping = aes(color = coef_dist22), shape = 15, alpha = 0.4, size = 7.5) +
geom_sf(waw.pov.sf.line, mapping = aes(geometry = geometry), inherit.aes = FALSE) +
theme_bw() +
theme(text = element_text(color = "grey20", size = 12, family = "Roboto"),
axis.title.x = element_blank(),
axis.title.y = element_blank(),
legend.position = "right",
legend.key.width = unit(1.5, 'cm')) +
scale_color_gradient(low = wes_palette("Zissou1", 5, type = "continuous")[1],
high = wes_palette("Zissou1", 5, type = "continuous")[4]) +
labs(color = "Dist. Coeff.", size = "School size")
# plot the graphs together
ggpubr::ggarrange(impact_gwr19, impact_gwr22,
common.legend = TRUE, widths = 20, heights = 16) However, as we can see on the graph, the spatial heterogeneity is negligible. In fact, the difference in coefficients between Ursus and Wesoła is visible only at the 7th decimal point!
We turn our attention to causal forests. As this case corresponds to the analysis with treatment and control groups, we assume that schools for which the distance decreased in 2022 were treated. In control group we will assess only those schools, which were already close to the metro stations.
# load the data once again
load("C:/Users/wojci/OneDrive/WSZ/CausalMLinR/loc_math_1922.Rdata")
df$treatment <- ifelse(df$distance_2022 < df$distance_2019, 1, 0)
df$control <- ifelse(df$distance_2019 < 0.5, 1, 0)
df <- df %>%
filter(control == 1 | treatment == 1)
table(df$treatment)##
## 0 1
## 34 30
##
## 0 1
## 30 34
We end up with (almost) balanced assignment to control and treatment groups - 30 schools treated, 34 in control. We move to estimating the causal forest.
# Point to the analysed variables
Y <- df$math2022 - df$math2019 # outcome - difference in results
X <- as.data.frame(cbind(df$Public, df$size, df$Oriented, df$psych, df$lat, df$long)) # covariates for control. Note: Srodmiescie for base group
W <- df$treatment # treatment
# E[Y | X] with no test/train split
forest.Y <- regression_forest(X, Y)
Y.hat_train <- predict(forest.Y)$predictions
# Estimate propensity scores
forest.W <- regression_forest(X, W)
W.hat_train <- predict(forest.W)$predictions
#Train the causal forest (weighted by propensity score)
c.forest <- causal_forest(X, Y, W, Y.hat_train, W.hat_train, min.node.size = 3)
tau.hat <- predict(c.forest, X)$predictionsWe plot the predicted values and the outcome to assess the quality of the goodness of fit.
pred <- data.frame(value = append(as.numeric(tau.hat), as.numeric(Y)))
pred$label <- c(rep("Prediction", 64), rep("Real", 64))
ggplot(data = pred) +
geom_histogram(aes(x = value, fill = label),
alpha = 0.5,
position = "identity") +
theme_minimal() +
labs(title = "") +
xlab("Predictions and Real values") +
scale_y_continuous(name = "Density") +
theme(
text = element_text(family = "Roboto Light"),
panel.grid.major = element_blank(),
legend.position = "none",
panel.grid.minor = element_blank(),
panel.grid.major.x = element_blank(),
axis.text=element_text(size = 12),
axis.title=element_text(size = 12),
axis.line = element_line(size = 0.5, colour = "grey", linetype = 1),
axis.ticks = element_line(size = 0.5, color="grey")
) + scale_fill_manual(values = c("#B6163A", "#000000"))We find that our model is not good at predicting difference in scores, which outlay from the average scores.
## estimate std.err
## 0.8509355 1.9379617
## estimate std.err
## 1.735434 2.445492
We find positive treatment effects of the extension of the public transport. However, these estimates are not statistically significant. We further test the model on the simulated data for M3, M4 and M5 extension scenarios.
We start again with working with the data. We map schools’ distances to planned M3, M4 and M5.
# load the data instead of working on them again
load("C:/Users/wojci/OneDrive/WSZ/CausalMLinR/Math_Results_2019_Warsaw.Rdata")
# Metro Coordinates
metro_cord345 <- read.csv("Project/Data/Metro_Pred345.csv")
metro_stations_m3 <- metro_cord345[metro_cord345$Label == "M3",]
metro_stations_m4 <- metro_cord345[metro_cord345$Label == "M4",]
metro_stations_m5 <- metro_cord345[metro_cord345$Label == "M5",]
# School maps
school_map <- data.frame(Station = df$name,
longitude = as.numeric(df$long),
latitude = as.numeric(df$lat),
Label = "school")
# Map both metro
load("C:/Users/wojci/OneDrive/WSZ/CausalMLinR/metro_m1m2.Rdata")
map_points <- rbind(metro_cord345, school_map, metrom1m2)
table(map_points$Label)##
## M1 M2 M3 M4 M5 school
## 21 17 13 20 17 165
pal <- colorFactor(c( "green", "yellow", "purple", "brown", "red", "navy"),
domain = c("M3", "M4", "M5", "school", "M1", "M2"))
leaflet() %>% addTiles() %>%
addCircleMarkers(data = map_points,
lat = ~as.numeric(latitude), lng = ~as.numeric(longitude),
color = ~pal(map_points$Label),
radius = 4,
stroke = FALSE,
fillOpacity = 0.5,
label = map_points$Label) %>%
addLegend("bottomright", pal = pal, values = map_points$Label)##########################################################################################################
## Calculate distance to metro stations
df$distanceM3 <- rep(NA, length(df$name))
df$distanceM4 <- rep(NA, length(df$name))
df$distanceM5 <- rep(NA, length(df$name))
for (i in 1:length(df$name)){
# working matrix
D2019 <- df[i, c("name", "long", "lat")]
# matrix for M3 stations
DmetroM3 <- data.frame(name = map_points$Station[map_points$Label %in% c("M1", "M2", "M3")],
lat = as.numeric(map_points$latitude[map_points$Label %in% c("M1", "M2", "M3")]),
long = as.numeric(map_points$longitude[map_points$Label %in% c("M1", "M2", "M3")]))
# matrix for M4 stations
DmetroM4 <- data.frame(name = map_points$Station[map_points$Label %in% c("M1", "M2", "M4")],
lat = as.numeric(map_points$latitude[map_points$Label %in% c("M1", "M2", "M4")]),
long = as.numeric(map_points$longitude[map_points$Label %in% c("M1", "M2", "M4")]))
# matrix for M5 stations in 2019
DmetroM5 <- data.frame(name = map_points$Station[map_points$Label %in% c("M1", "M2", "M5")],
lat = as.numeric(map_points$latitude[map_points$Label %in% c("M1", "M2", "M5")]),
long = as.numeric(map_points$longitude[map_points$Label %in% c("M1", "M2", "M5")]))
distM3 <- rbind(D2019, DmetroM3)
distM4 <- rbind(D2019, DmetroM4)
distM5 <- rbind(D2019, DmetroM5)
distance_matrix_M3 <- geodist(distM3, measure = 'geodesic' )/1000 #converting it to km
distance_matrix_M4 <- geodist(distM4, measure = 'geodesic' )/1000 #converting it to km
distance_matrix_M5 <- geodist(distM5, measure = 'geodesic' )/1000 #converting it to km
df[i, 21] <- min(distance_matrix_M3[1,2:52])
df[i, 22] <- min(distance_matrix_M4[1,2:59])
df[i, 23] <- min(distance_matrix_M5[1,2:56])
}################################## VISUALISATION ##################################
## Summarize the change in the distance
vis_distance <- data.frame(id = 1:length(df$distanceM3))
vis_distance <- cbind(vis_distance, df$distanceM3, df$distanceM4, df$distanceM5)
vis_distance_l <- reshape(vis_distance, direction = "long", idvar="id",
varying = 2:4, sep = "")
vis_distance_l<- vis_distance_l %>%
rename("distance" = `df$distanceM` )
vis_distance_l$short <- ifelse(vis_distance_l$distance < 0.5, 1, 0)
vis_distance_l$long <- ifelse(vis_distance_l$distance > 3, 1, 0)
m345_proj_dist <- ggplot(vis_distance_l, aes(x = as.factor(time), y = distance, fill = as.factor(time)))+
geom_bar( stat = "summary", fun = mean) +
theme_minimal() +
labs(title = "") +
scale_x_discrete(name = "Metro expansion scenario",
labels = c(
"3" = "M3",
"4" = "M4",
"5" = "M5")) +
scale_y_continuous(name = "Average distance to station (in km)") +
theme(
text = element_text(family = "Roboto Light"),
panel.grid.major = element_blank(),
legend.position = "none",
panel.grid.minor = element_blank(),
panel.grid.major.x = element_blank(),
axis.text=element_text(size = 12),
axis.title=element_text(size = 12),
axis.line = element_line(size = 0.5, colour = "grey", linetype = 1),
axis.ticks = element_line(size = 0.5, color="grey")
) + scale_fill_manual(values = c("#B6163A", "#000000", "#E3A2AF"))
# Number of schools with large distance
m345_proj_n <- ggplot(vis_distance_l, aes(x = as.factor(time), y = long, fill = as.factor(time)))+
geom_bar(stat = "summary", fun = mean) +
theme_minimal() +
labs(title = "") +
scale_x_discrete(name = "Metro expansion scenario",
labels = c(
"3" = "M3",
"4" = "M4",
"5" = "M5")) +
scale_y_continuous(name = "Share of schools with no close access to metro") +
theme(
text = element_text(family = "Roboto Light"),
panel.grid.major = element_blank(),
legend.position = "none",
panel.grid.minor = element_blank(),
panel.grid.major.x = element_blank(),
axis.text=element_text(size = 12),
axis.title=element_text(size = 12),
axis.line = element_line(size = 0.5, colour = "grey", linetype = 1),
axis.ticks = element_line(size = 0.5, color="grey")
) + scale_fill_manual(values = c("#B6163A", "#000000", "#E3A2AF"))
ggpubr::ggarrange(m345_proj_dist, m345_proj_n)Descriptive analysis shows that M5 expansion scenario would have the highest impact in terms of improving access to schools via Metro. We test it based on the Causal Forest model estimated in the previous part.
df_test <- df
load("C:/Users/wojci/OneDrive/WSZ/CausalMLinR/loc_math_1922.Rdata")
df_test$distance2019 <- df$distance_2019
df_test$distance2022 <- df$distance_2022
# Treatment, control assignment for M3
df_test$treatmentM3 <- ifelse(df_test$distanceM3 < df_test$distance2022, 1, 0)
df_test$controlM3 <- ifelse(df_test$distance2022 < 0.5 & df_test$treatmentM3 == 0, 1, 0)
df_testM3 <- df_test %>%
filter(controlM3 == 1 | treatmentM3 == 1)
table(df_testM3$treatmentM3)##
## 0 1
## 35 65
##
## 0 1
## 65 35
# Treatment, control assignment for M4
df_test$treatmentM4 <- ifelse(df_test$distanceM4 < df_test$distance2022, 1, 0)
df_test$controlM4 <- ifelse(df_test$distance2022 < 0.5 & df_test$treatmentM4 == 0, 1, 0)
df_testM4 <- df_test %>%
filter(controlM4 == 1 | treatmentM4 == 1)
table(df_testM4$treatmentM4)##
## 0 1
## 35 73
##
## 0 1
## 73 35
# Treatment, control assignment for M5
df_test$treatmentM5 <- ifelse(df_test$distanceM5 < df_test$distance2022, 1, 0)
df_test$controlM5 <- ifelse(df_test$distance2022 < 0.5 & df_test$treatmentM5 == 0, 1, 0)
df_testM5 <- df_test %>%
filter(controlM5 == 1 | treatmentM5 == 1)
table(df_testM5$treatmentM5)##
## 0 1
## 35 66
##
## 0 1
## 66 35
For sure the data for the test analysis is less balanced in terms of group assignment when compared to the analysis for M2 extension.
X_M3 <- as.data.frame(cbind(df_testM3$Public, df_testM3$size, df_testM3$Oriented, df_testM3$psych, df_testM3$lat, df_testM3$long)) # covariates for control.
X_M4 <- as.data.frame(cbind(df_testM4$Public, df_testM4$size, df_testM4$Oriented, df_testM4$psych, df_testM4$lat, df_testM4$long)) # covariates for control.
X_M5 <- as.data.frame(cbind(df_testM5$Public, df_testM5$size, df_testM5$Oriented, df_testM5$psych, df_testM5$lat, df_testM5$long)) # covariates for control.
c.predM3 <- predict(c.forest, X_M3, estimate.variance = TRUE)
c.predM4 <- predict(c.forest, X_M4, estimate.variance = TRUE)
c.predM5 <- predict(c.forest, X_M5, estimate.variance = TRUE)Plot the predictions.
pred345 <- data.frame(value = c(as.numeric(c.predM3$predictions),
as.numeric(c.predM4$predictions),
as.numeric(c.predM5$predictions)))
pred345$label <- c(rep("M3", 100), rep("M4", 108), rep("M5", 101))
ggplot(data = pred345) +
geom_histogram(aes(x = value, fill = label),
alpha = 0.5,
position = "identity") +
theme_minimal() +
labs(title = "") +
xlab("Predictions and Real values") +
scale_y_continuous(name = "Density") +
theme(
text = element_text(family = "Roboto Light"),
panel.grid.major = element_blank(),
legend.position = "left",
panel.grid.minor = element_blank(),
panel.grid.major.x = element_blank(),
axis.text=element_text(size = 12),
axis.title=element_text(size = 12),
axis.line = element_line(size = 0.5, colour = "grey", linetype = 1),
axis.ticks = element_line(size = 0.5, color="grey")
) + scale_fill_manual(values = c("#0000FF","#008000","#FFFF00"))Median predictions
## [1] 0.4666325
## [1] 0.5429314
## [1] 0.2933546
Based on the median prediction of the CF model on the test data, we suggest starting the public transport expansion with the M4 line. It is an unexpected result when looking at the average distance to schools. However, M4 will connect Bialoleka to the speed metro system, which is extremely important when considering the area’s current pace of growth. At the same time, although planned M4 lies parallel to M1, it goes through Srodmiescie, which schools densely surround. Consequently, this could lead to a major decrease in travel time due to the model methodology.
Despite our extended analysis (almost 1300 lines in Markdown), we are aware of the drawbacks of the analysis. Firstly, we need access to data on student travel pathways. It might be possible that students living close to schools choose them and do not need public transport access. At the same time, the extension of public transport might treat schools far from the stations. Our analysis is also subject to selection bias, as we cannot assess the reasons behind choosing certain schools. Some schools are historically superior to others, which highly influences the decisions of prospective students. Finally, we should have takenconsider the influence of other transport sources. The analysis of social actors points out that the extension of M3 is not profitable because it will be faster to get to the endpoint by tram. From the technical perspective, our analysis suffers from small sample sizes, which could strongly influence the significance of the estimated parameters.
We have applied various statistical, econometric and machine learning techniques to derive the impact of extending public transport on academic achievements in mathematics. We find that the proposed extension of public transport could increase the accessibility and achievements of students in Warsaw. Based on the descriptive pattern, M5 would increase access to schools the most. However, our model showed that the M4 extension could yield the largest returns in improving academic achievements.