Introduction

This project was made as part of the ‘Causal Machine Learning for Spatial data’ elective course at the University of Warsaw, Poland - by second-year students of the Masters in Data Science and Business Analytics. Methods, materials and sources are a collection of in-class materials provided by professors: Dr. Kevin Credit (Maynooth University), Maria Kubara (University of Warsaw) and Dr. Katarzyna Kopczewska (University of Warsaw). Additional resources were also used and respectively cited along the publication. You can check all external sources in the bibliography at the end of this publication.

We will apply the Exploratory Data Analysis and Machine Learning techniques in spatial data, for which we will try to assess the distribution of educational resources by looking at the average rating of instructional quality for nearby schools and their relation to population density and demographics. Our analysis will involve a detailed analysis of spatial data, regression forests, causal forests, and evaluation of treatment effects using data from Chicago.

Data set description

We will use a data set prepared by Dr. Kevin Credit, which was part of our lecture. The data set describes several demographic and geographical variables of the city of Chicago involves data from different sources as Chicago Open Data Portal, Database of Road Transportation Emissions (DARTE), InfoUSA, US Census and OSMnx, for intersections on the OpenStreetMap driving street.

The table below describes the variables of the data set, which was provided as well in the materials of the lecture (Lec1, slide 43):

Variable Description Year Source
SEBB2010 Spatial Empirical Bayes estimate of new building permit density 2010 Chicago Open Data Portal
SEBB2017 Spatial Empirical Bayes estimate of new building permit density 2017 Chicago Open Data Portal
SEBC2010 Spatial Empirical Bayes estimate of Co2 density 2010 DARTE
SEBC2017 Spatial Empirical Bayes estimate of Co2 density 2017 DARTE
WAS2010 Walkable Accessibility Score 2010 InfoUSA according to Credit et al. (2023)
WAS2017 Walkable Accessibility Score 2017 InfoUSA according to Credit et al. (2023)
POPD10 Population density 2010 US Census
HUD10 Housing unit density 2010 US Census
MEDAGE10 Median age 2010 US Census
BLKP10 % Black non-hispanic population 2010 US Census
HSPP10 % Hispanic population 2010 US Census
ASNP10 % Asian non-hispanic population 2010 US Census
AVGHHS10 Average household size 2010 US Census
FAMCP10 % families with children under 18 2010 US Census
BACHP10 % population with bachelor’s degree or higher 2010 US Census
AUTOP10 % commuting by auto 2010 US Census
BWLKP10 % commuting by bicycle and walking 2010 US Census
UNEMP10 % labor force that is unemployed 2010 US Census
MBSAP10 % employed in management, business, science and arts occupations 2010 US Census
SRVP10 % employed in service occupations 2010 US Census
PTMMP10 % employed in production, transportation, and metiral moving occupations 2010 US Census
MHHIN10 Median household income 2010 US Census
PCIN10 Per capita income 2010 US Census
OWNP10 % owner-occupied housing units 2010 US Census
MYRMOV10 Median year householders moved into unit 2010 US Census
MRENT10 Median gross rent 2010 US Census
MVAL10 Median housing value 2010 US Census
MYRBLT10 Median year structure built 2010 US Census
AVGVEH10 Average number of vehicles available per household 2010 US Census
CRIMER Number of property and violent crimes per person 2010 Chicago Open Data Portal
ADT_mean Average daily traffic for nearby traffic monitors 2006 Chicago Open Data Portal
AvgWEEK_me Average ‘L’ weekday station entries for October 2012 Chicago Open Data Portal
boardings_ Average weekday bus stop boardings for October 2012 Chicago Open Data Portal
VACANTD Density of violations for vacant and abandoned buildings 2011 Chicago Open Data Portal
R_ZONEP % area zoned for residential use 2012 Chicago Open Data Portal
INSTR_mean Average rating of instructional quality for nearby schools 2011-2012 Chicago Open Data Portal
INTDEN Intersection density on the driving network 2022 OSMnx

Exploratory Analysis

For the initial exploration and analysis, the following libraries were pre-loaded: “spatialreg”, “visreg”, “RColorBrewer”, “finalfit”, “sf”, “tidyverse”, “tmap”, “spdep”, “randomForest”, “grf”, “xgboost”, “MLmetrics”, “ggplot2”, “conflicted”.

## Simple feature collection with 6 features and 233 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: 1017048 ymin: -273278.1 xmax: 1018232 ymax: -262023.3
## Projected CRS: US National Atlas Equal Area
##           GISJOIN STATEFP COUNTYFP TRACTCE BLKGRPCE       GEOID INTPTLAT
## 1 G17003103106003      17       31  310600        3 1.70313e+11 41.85426
## 2 G17003100609003      17       31   60900        3 1.70311e+11 41.95107
## 3 G17003108326003      17       31  832600        3 1.70318e+11 41.91597
## 4 G17003103105001      17       31  310500        1 1.70313e+11 41.85523
## 5 G17003100702002      17       31   70200        2 1.70311e+11 41.92684
## 6 G17003100712001      17       31   71200        1 1.70311e+11 41.92398
##    INTPTLON STUSPS SHAPE_AREA SHAPE_LEN OBJECTID     geo_num   WAS1997
## 1 -87.66000     IL  246655.38 2011.9200    60127 1.70313e+11 114.98125
## 2 -87.64610     IL   46068.08  887.0996    62801 1.70311e+11 111.38628
## 3 -87.65086     IL  195251.17 1776.3103    62062 1.70318e+11 134.09631
## 4 -87.65475     IL  163307.23 1836.0981    60126 1.70313e+11  70.51981
## 5 -87.64674     IL  111163.55 1511.1013    59942 1.70311e+11 136.96843
## 6 -87.64588     IL  122250.99 1612.8085    62056 1.70311e+11 139.45412
##     WAS1998   WAS1999   WAS2000   WAS2001   WAS2002   WAS2003   WAS2004
## 1 112.68706 116.07955 115.46190 116.01869 115.94659 114.19428 106.44323
## 2 107.28641 108.58006 107.08335 110.37375 109.90671 114.31385 111.88058
## 3 131.65831 130.94132 131.19474 134.03595 134.46763 134.52855 136.79366
## 4  68.52643  72.37316  72.39139  78.86017  78.22318  82.53956  78.98625
## 5 136.30583 136.54356 136.83535 135.60628 136.62464 136.15944 135.90513
## 6 137.41612 137.03879 137.69063 137.69739 138.94128 137.72948 137.59178
##     WAS2005   WAS2006   WAS2007   WAS2008   WAS2009   WAS2010   WAS2011
## 1 110.01056 104.54327  99.24179 101.11941 100.36977 100.58168 101.05011
## 2 112.34848 107.77906 111.21434 111.76010 113.68037 111.71166 113.50152
## 3 137.77072 136.87422 136.58207 137.09881 137.27139 136.78424 137.28044
## 4  81.23481  82.08587  76.39534  80.97864  82.14137  84.43072  81.52528
## 5 136.33797 135.95398 136.62708 136.29602 136.42009 136.21051 136.98160
## 6 137.36677 136.56867 137.65550 137.21661 136.82965 136.08127 138.28191
##     WAS2012   WAS2013   WAS2014   WAS2015   WAS2016   WAS2017   WAS2018
## 1 108.12578 114.73515 116.41560 116.59266 111.23720 108.54997 103.84254
## 2 114.93171 118.74014 120.82048 121.85577 115.63330 110.42166 105.77073
## 3 138.05600 138.89841 138.90867 137.84639 136.00810 135.66187 133.66457
## 4  88.77165  95.41195  98.70901  96.06995  91.19823  90.04694  88.35598
## 5 137.06088 136.87513 136.98457 137.18987 136.51432 136.00956 135.80418
## 6 138.38523 139.36465 139.47302 136.66167 136.42708 137.42534 135.88255
##     WAS2019 bg_area_m2 kgco2_1980 kgco2_1981 kgco2_1982 kgco2_1983 kgco2_1984
## 1 110.87772  244810.18 159964.422 166495.032  169713.57  176848.92  180911.65
## 2 108.32063   45681.61   9502.519   9890.463   10081.66   10505.52   10746.87
## 3 135.78517  193601.13  85160.939  88637.667   90351.13   94149.81   96312.70
## 4  95.26068  160554.40 152948.334 159192.509  162269.88  169092.28  172976.81
## 5 136.03996  110226.25 103032.924 107239.283  109312.34  113908.21  116525.01
## 6 136.40674  121219.57 109004.478 113454.628  115647.83  120510.07  123278.54
##   kgco2_1985 kgco2_1986 kgco2_1987 kgco2_1988 kgco2_1989 kgco2_1990 kgco2_1991
## 1  184341.00  186901.62  188881.76  188072.41  184128.58  175848.48  169910.46
## 2   10950.58   11102.69   11220.32   11172.24   10937.96   10446.09   10093.35
## 3   98138.40   99501.61  100555.78  100124.91   98025.31   93617.20   90455.95
## 4  176255.74  178704.06  180597.35  179823.50  176052.64  168135.71  162458.13
## 5  118733.85  120383.15  121658.55  121137.25  118597.03  113263.83  109439.15
## 6  125615.39  127360.28  128709.60  128158.09  125470.65  119828.34  115782.00
##   kgco2_1992 kgco2_1993 kgco2_1994 kgco2_1995 kgco2_1996 kgco2_1997 kgco2_1998
## 1  176166.47  190878.20  203941.44  213316.60  219667.92  227465.89   230870.2
## 2   10464.98   11338.92   12114.93   12671.85   13049.14   13512.37    13714.6
## 3   93786.49  101618.64  108573.17  113564.26  116945.55  121096.98   122909.3
## 4  168439.75  182506.22  194996.50  203960.46  210033.22  217489.17   220744.1
## 5  113468.65  122944.45  131358.48  137397.01  141487.89  146510.55   148703.3
## 6  120045.03  130070.04  138971.71  145360.22  149688.21  155001.97   157321.8
##   kgco2_1999 kgco2_2000 kgco2_2001 kgco2_2002 kgco2_2003 kgco2_2004 kgco2_2005
## 1  237911.32  242080.98  244461.14  251651.18  253766.59  260063.75  262496.85
## 2   14132.87   14380.57   14521.96   14949.07   15074.74   15448.81   15593.35
## 3  126657.86  128877.68  130144.81  133972.61  135098.80  138451.24  139746.56
## 4  227476.46  231463.23  233739.00  240613.69  242636.31  248657.28  250983.66
## 5  153238.44  155924.11  157457.17  162088.27  163450.81  167506.80  169073.96
## 6  162119.79  164961.12  166583.03  171482.54  172924.04  177215.12  178873.10
##   kgco2_2006 kgco2_2007 kgco2_2008 kgco2_2009 kgco2_2010 kgco2_2011 kgco2_2012
## 1  268801.53  279079.90  277820.82  274717.96  281115.46  286525.99  303448.79
## 2   15967.87   16578.45   16503.65   16319.33   16699.37   17020.78   18026.06
## 3  143103.01  148574.95  147904.65  146252.77  149658.63  152539.06  161548.32
## 4  257011.81  266839.37  265635.52  262668.75  268785.65  273958.87  290139.43
## 5  173134.80  179755.08  178944.11  176945.57  181066.19  184551.10  195451.06
## 6  183169.30  190173.28  189315.31  187200.93  191560.37  195247.27  206778.96
##   kgco2_2013 kgco2_2014 kgco2_2015 kgco2_2016 kgco2_2017 co2d_1980 co2d_1981
## 1  323886.24  320908.37  310721.99  284128.57  245374.76 0.6534223 0.6800985
## 2   19240.12   19063.23   18458.11   16878.36   14576.23 0.2080163 0.2165086
## 3  172428.69  170843.35  165420.39  151262.73  130631.20 0.4398783 0.4578365
## 4  309680.49  306833.23  297093.63  271666.60  234612.55 0.9526262 0.9915176
## 5  208614.80  206696.76  200135.72  183006.93  158045.64 0.9347404 0.9729015
## 6  220705.64  218676.44  211735.13  193613.59  167205.60 0.8992317 0.9359431
##   co2d_1982 co2d_1983 co2d_1984 co2d_1985 co2d_1986 co2d_1987 co2d_1988
## 1 0.6932456 0.7223920 0.7389875 0.7529956 0.7634553 0.7715437 0.7682377
## 2 0.2206940 0.2299728 0.2352559 0.2397154 0.2430452 0.2456201 0.2445677
## 3 0.4666870 0.4863082 0.4974801 0.5069103 0.5139516 0.5193967 0.5171711
## 4 1.0106847 1.0531774 1.0773720 1.0977945 1.1130436 1.1248358 1.1200160
## 5 0.9917088 1.0334037 1.0571440 1.0771831 1.0921459 1.1037167 1.0989874
## 6 0.9540360 0.9941470 1.0169854 1.0362633 1.0506577 1.0617890 1.0572393
##   co2d_1989 co2d_1990 co2d_1991 co2d_1992 co2d_1993 co2d_1994 co2d_1995
## 1 0.7521280 0.7183054 0.6940498 0.7196044 0.7796988 0.8330595 0.8713551
## 2 0.2394392 0.2286718 0.2209500 0.2290853 0.2482163 0.2652036 0.2773950
## 3 0.5063261 0.4835571 0.4672284 0.4844315 0.5248866 0.5608086 0.5865888
## 4 1.0965295 1.0472196 1.0118572 1.0491133 1.1367251 1.2145198 1.2703511
## 5 1.0759419 1.0275577 0.9928593 1.0294159 1.1153828 1.1917169 1.2464999
## 6 1.0350692 0.9885231 0.9551428 0.9903106 1.0730118 1.1464462 1.1991481
##   co2d_1996 co2d_1997 co2d_1998 co2d_1999 co2d_2000 co2d_2001 co2d_2002
## 1 0.8972990 0.9291521 0.9430579 0.9718196 0.9888518 0.9985742 1.0279441
## 2 0.2856542 0.2957946 0.3002215 0.3093777 0.3147999 0.3178951 0.3272449
## 3 0.6040540 0.6254973 0.6348586 0.6542207 0.6656866 0.6722317 0.6920032
## 4 1.3081748 1.3546135 1.3748869 1.4168186 1.4416499 1.4558243 1.4986427
## 5 1.2836134 1.3291803 1.3490730 1.3902174 1.4145825 1.4284907 1.4705053
## 6 1.2348518 1.2786877 1.2978247 1.3374061 1.3608456 1.3742255 1.4146440
##   co2d_2003 co2d_2004 co2d_2005 co2d_2006 co2d_2007 co2d_2008 co2d_2009
## 1 1.0365851 1.0623078 1.0722465 1.0979998 1.1399849 1.1348418 1.1221673
## 2 0.3299958 0.3381846 0.3413486 0.3495471 0.3629130 0.3612757 0.3572408
## 3 0.6978203 0.7151366 0.7218272 0.7391641 0.7674281 0.7639658 0.7554334
## 4 1.5112405 1.5487416 1.5632313 1.6007771 1.6619873 1.6544892 1.6360109
## 5 1.4828665 1.5196635 1.5338811 1.5707221 1.6307830 1.6234256 1.6052943
## 6 1.4265357 1.4619348 1.4756124 1.5110538 1.5688331 1.5617553 1.5443127
##   co2d_2010 co2d_2011 co2d_2012 co2d_2013 co2d_2014 co2d_2015 co2d_2016
## 1 1.1482997 1.1704007 1.2395269 1.3230097 1.3108457 1.2692364 1.1606077
## 2 0.3655600 0.3725958 0.3946021 0.4211788 0.4173064 0.4040601 0.3694782
## 3 0.7730256 0.7879038 0.8344389 0.8906389 0.8824502 0.8544392 0.7813112
## 4 1.6741095 1.7063305 1.8071098 1.9288197 1.9110858 1.8504234 1.6920533
## 5 1.6426776 1.6742936 1.7731808 1.8926055 1.8752046 1.8156812 1.6602845
## 6 1.5802759 1.6106909 1.7058215 1.8207096 1.8039697 1.7467075 1.5972139
##   co2d_2017 BP2006 BP2007 BP2008 BP2009 BP2010 BP2011 BP2012 BP2013 BP2014
## 1 1.0023062      1      0      0      1      0      0      1      0      2
## 2 0.3190831      0      0      0      0      0      0      0      0      0
## 3 0.6747440      3     11      4      1      7      1      1      9      4
## 4 1.4612651      1      1      2      1      0      0      0      0      0
## 5 1.4338294      4      1      2      0      0      3      1      2      0
## 6 1.3793614      0      0      1      1      0      0      0      1      2
##   BP2015 BP2016 BP2017 BP2018 BP2019 BP2020 BP2021 BP2022 BPV2006 BPV2007
## 1      1      4      3      2      1      0      3      1  300000       0
## 2      0      0      0      0      0      2      0      0       0       0
## 3      4      2      2      5      4      1      2      4  600000  750000
## 4      1      2      2      2      2      1      1      1    2000  193000
## 5      2      1      3      5      0      3      0      2  725000    8000
## 6      0      0      9      3      6      1      2      0       0       0
##   BPV2008 BPV2009 BPV2010 BPV2011 BPV2012 BPV2013 BPV2014 BPV2015 BPV2016
## 1       0 1200000   0e+00       0   2e+03       0  237000    8000  225000
## 2       0       0   0e+00       0   0e+00       0       0       0       0
## 3  255000  575000   4e+05  725000   6e+05  590000  587500  615000  900000
## 4  170000  400000   0e+00       0   0e+00       0       0  200000  146000
## 5  280000       0   0e+00 1400000   5e+04  460000       0  507500  450000
## 6   35000    8000   0e+00       0   0e+00    1500 7521500       0       0
##   BPV2017 BPV2018 BPV2019 BPV2020 BPV2021 BPV2022 TREAT CONTR_1 CONTR_2
## 1  450000    8500    1000       0  550000    3000     0       0       1
## 2       0       0       0 4062500       0       0     0       0       1
## 3  725000  745000  325000  751000  605000 2158150     0       0       1
## 4  126000  146000  147400  420000    2000    3000     0       0       1
## 5  350000  622350       0  800000       0  730000     0       0       1
## 6 8000000   92000  522500 1373500 1041375       0     0       0       1
##       SEBB2010     SEBB2017     SEBV2010     SEBV2017  SEBC2010  SEBC2017
## 1 1.113566e-06 8.976012e-06 1.730782e-05 1.838150e+00 1.1483197 1.0023252
## 2 2.911093e-06 5.502589e-06 7.083786e-06 3.803694e-06 0.3656292 0.3191523
## 3 2.951217e-05 1.621849e-05 2.066105e+00 3.744813e+00 0.7730287 0.6747474
## 4 1.072735e-06 8.520549e-06 2.200400e-05 7.847806e-01 1.6741340 1.4612887
## 5 3.177479e-06 2.931373e-05 2.721240e-06 3.175289e+00 1.6426546 1.4338065
## 6 2.095912e-06 5.324562e-05 2.061875e-06 6.599593e+01 1.5802548 1.3793403
##   AREALAND POP10    POPD10 HU10     HUD10    WHTP10     BLKP10     HSPP10
## 1   246657  1975  8007.071  752  3048.768 0.1772152 0.04303797 0.82278481
## 2    46069  1347 29238.751  935 20295.643 0.9420935 0.06458797 0.05790646
## 3   195248  1970 10089.732 1070  5480.210 0.9675127 0.01624365 0.03248731
## 4   163307  1700 10409.842  656  4016.974 0.1188235 0.02176471 0.88117647
## 5   111164  1097  9868.303  566  5091.576 0.9635369 0.04102097 0.03646308
## 6   122252   904  7394.562  481  3934.496 0.9446903 0.01106195 0.05530973
##        ASNP10    OWNP10    RNTP10 AVGHHS10 MEDAGE10   AUTOP10   TRNSP10
## 1 0.011139241 0.1413534 0.8586466 2.969925     30.0 0.4940476 0.2663690
## 2 0.057906459 0.4248292 0.5751708 1.534169     40.7 0.3279901 0.5869297
## 3 0.030456853 0.4307536 0.5692464 2.006110     29.6 0.3756378 0.4508929
## 4 0.004117647 0.2307692 0.7692308 3.041145     27.2 0.4983871 0.1725806
## 5 0.064721969 0.4416342 0.5583658 2.134241     28.7 0.2784993 0.5598846
## 6 0.038716814 0.4269663 0.5730337 2.031461     27.4 0.3427621 0.3544093
##      BWLKP10   FAMCP10    BACHP10 MHHIN10 PCIN10    UNEMP10   MBSAP10
## 1 0.23958333 0.6258741 0.09607843   38424  13940 0.16612022 0.1454784
## 2 0.03945746 0.0000000 0.84477893   65250  78693 0.00000000 0.7059524
## 3 0.08035714 0.5022831 0.93454545  115610  82550 0.03150093 0.6996173
## 4 0.27741935 0.5244957 0.11073825   30139  11977 0.17001339 0.1822581
## 5 0.04184704 0.3681319 0.75641026  130833  62202 0.04281768 0.7503608
## 6 0.14975042 0.4333333 0.81402936   79886  61276 0.04603175 0.6289517
##       SRVP10   SALEP10     NCMP10     PTMMP10 MYRBLT10 MYRMOV10   AVGVEH1
## 1 0.33551769 0.0904325 0.11271298 0.315858453     1939     2004 0.7368422
## 2 0.00000000 0.2523810 0.00000000 0.041666667     1941     2004 0.5751784
## 3 0.03125000 0.2340561 0.02869898 0.006377551     1962     2005 1.3513238
## 4 0.29193548 0.1435484 0.15967742 0.222580645     1939     2005 1.9323256
## 5 0.09090909 0.1212121 0.00000000 0.037518038     1962     2004 1.1945525
## 6 0.14808652 0.2229617 0.00000000 0.000000000     1955     2005 0.8667416
##   MRENT10 MOWNC10 MVAL10 ADT_mean VACANT CRIME     CRIMER VACANTD INSTR_mean
## 1     801    1450 287500 14307.36      0    84 0.04253164       0   45.72656
## 2     886    1393 261700 15346.89      0    41 0.03043801       0   43.65320
## 3    1653    2947 677100 16417.97      0   116 0.05888325       0   56.64396
## 4     693     879 266700 14102.34      0    80 0.04705882       0   50.96734
## 5    1633    2219 361100 16310.04      0    77 0.07019143       0   64.78398
## 6    1783    2456 582400 16380.96      0   157 0.17367257       0   65.23341
##   INT      INTDEN BIZ_ZONE  BIZ_ZONEP   RM_ZONE  RM_ZONEP    R_ZONE   R_ZONEP
## 1  18 0.000072975 22348.47 0.09060546  71577.00 0.2901884  71577.00 0.2901884
## 2   6 0.000130239       NA 0.00000000  32807.34 0.7121348  32807.34 0.7121348
## 3   7 0.000035851 19188.48 0.09827745 153400.20 0.7856685 153400.20 0.7856685
## 4  21 0.000128592 58897.68 0.36065616  65924.54 0.4036847  65924.54 0.4036847
## 5   4 0.000035982 19161.83 0.17237445  87986.95 0.7915058  87989.52 0.7915289
## 6   7 0.000057258 31517.27 0.25780579  60495.21 0.4948402  60495.21 0.4948402
##   AvgWEEK_me boardings_ MEDA_LAG   BLKP_LAG   HSPP_LAG   ASNP_LAG  BACH_LAG
## 1          0   47.97143 29.10000 0.02144151 0.83494417 0.01188178 0.2351624
## 2          0  471.30000 33.42500 0.05094803 0.07974083 0.08134141 0.7841093
## 3          0   41.08000 30.17143 0.02797249 0.04911648 0.04368033 0.8731743
## 4          0   51.13333 30.98000 0.04768728 0.78122150 0.01696486 0.2449111
## 5          0   92.40000 27.48571 0.02233651 0.05322567 0.06956968 0.9197526
## 6          0   72.37500 28.50000 0.03286525 0.05376456 0.06617362 0.8692822
##     UNEM_LAG  MBSA_LAG    SRV_LAG    PTMM_LAG  MHHI_LAG PCIN_LAG   OWN_LAG
## 1 0.17708961 0.2429275 0.28188390 0.195462429  33719.00 15064.75 0.2203922
## 2 0.04110074 0.5593857 0.11181872 0.037571302  61016.75 55231.75 0.4436658
## 3 0.04046894 0.6452313 0.06412460 0.011157560 101597.86 83209.00 0.4973330
## 4 0.19767955 0.2866674 0.24113421 0.209981081  34581.20 16386.80 0.2153571
## 5 0.05190341 0.6701621 0.09010968 0.007767554  74346.71 62555.57 0.3165681
## 6 0.05670957 0.6456898 0.06860391 0.012836838  89316.33 67326.17 0.4048973
##   YRMV_LAG YRBT_LAG  ADT_LAG  WEEK_LAG BOARD_LAG  BIZZ_LAG   RMZ_LAG    RZ_LAG
## 1 2004.250 1939.000 17237.26  347.3917 103.15771 0.3182851 0.3037695 0.3037695
## 2 2005.000 1958.500 14149.20    0.0000 178.56595 0.2245706 0.7473168 0.7473168
## 3 2004.429 1954.000 18058.98 1238.8601 183.23622 0.1666474 0.5153935 0.5153935
## 4 2003.800 1939.000 13101.41  277.9133  82.26962 0.2486523 0.3969165 0.3969165
## 5 2005.571 1957.143 16333.59    0.0000 239.78112 0.2393864 0.5813819 0.6199602
## 6 2005.000 1959.333 16323.20    0.0000 218.02262 0.1698625 0.6345898 0.6345937
##       BP10_LAG     BP17_LAG   BV10_LAG  BV17_LAG C210_LAG C217_LAG  WS10_LAG
## 1 6.982992e-07 6.121502e-06 0.16129904  0.234710 5.024619 6.162446  85.24906
## 2 2.300062e-06 1.800641e-05 0.05953222 61.802988 1.307971 1.141676 116.65812
## 3 7.548065e-06 1.582958e-05 7.96840070 26.005930 4.142690 3.142691 132.53921
## 4 7.089616e-07 7.991582e-06 0.12904293  2.652111 4.237748 5.120286  83.86582
## 5 6.878193e-06 2.480580e-05 5.82701497 10.972368 1.264354 1.103607 135.41468
## 6 4.254014e-06 2.480560e-05 6.19766071  4.277070 1.400891 1.222780 135.14118
##    WS17_LAG TREAT_LAG                       geometry
## 1  90.61526         0 MULTIPOLYGON (((1017922 -27...
## 2 115.28542         0 MULTIPOLYGON (((1017336 -26...
## 3 134.06204         0 MULTIPOLYGON (((1017621 -26...
## 4  89.58642         0 MULTIPOLYGON (((1018232 -27...
## 5 135.74890         0 MULTIPOLYGON (((1017755 -26...
## 6 135.07443         0 MULTIPOLYGON (((1017863 -26...

Histograms & Correlation matrix

Average rating of school quality for nearby schools from 2011- 2012 (inverse-distance weighting by school):

We will subset the independent variables of our data and analyze their correlation:

Regarding missing values: we will omit missing values as they will not provide useful insights to our model and will negatively affect the prediction results.

## $Continuous
##                 label var_type    n missing_n missing_percent     mean       sd
## POPD10         POPD10    <dbl> 2151         0             0.0   8178.0   9014.3
## HUD10           HUD10    <dbl> 2151         0             0.0   4083.0   6697.2
## MEDAGE10     MEDAGE10    <dbl> 2143         8             0.4     35.0      8.4
## BLKP10         BLKP10    <dbl> 2151         0             0.0      0.4      0.4
## HSPP10         HSPP10    <dbl> 2151         0             0.0      0.3      0.3
## ASNP10         ASNP10    <dbl> 2151         0             0.0      0.0      0.1
## AVGHHS10     AVGHHS10    <dbl> 2151         0             0.0      2.8      1.7
## FAMCP10       FAMCP10    <dbl> 2151         0             0.0      0.4      0.2
## BACHP10       BACHP10    <dbl> 2151         0             0.0      0.4      0.2
## AUTOP10       AUTOP10    <dbl> 2151         0             0.0      0.6      0.2
## BWLKP10       BWLKP10    <dbl> 2151         0             0.0      0.1      0.1
## UNEMP10       UNEMP10    <dbl> 2151         0             0.0      0.1      0.1
## MBSAP10       MBSAP10    <dbl> 2151         0             0.0      0.3      0.2
## SRVP10         SRVP10    <dbl> 2151         0             0.0      0.2      0.1
## PTMMP10       PTMMP10    <dbl> 2151         0             0.0      0.1      0.1
## MHHIN10       MHHIN10    <dbl> 2142         9             0.4  50115.1  26619.5
## PCIN10         PCIN10    <dbl> 2142         9             0.4  27971.3  21461.0
## OWNP10         OWNP10    <dbl> 2151         0             0.0      0.5      0.2
## MYRMOV10     MYRMOV10    <dbl> 2135        16             0.7   2001.5      5.1
## MRENT10       MRENT10    <dbl> 2015       136             6.3    979.5    291.1
## MVAL10         MVAL10    <dbl> 2093        58             2.7 262755.2 137702.8
## MYRBLT10     MYRBLT10    <dbl> 2143         8             0.4   1948.0     13.5
## AVGVEH1       AVGVEH1    <dbl> 1982       169             7.9      1.2      0.5
## CRIMER         CRIMER    <dbl> 2143         8             0.4      0.1      0.1
## ADT_mean     ADT_mean    <dbl> 2151         0             0.0  20117.8   8338.3
## AvgWEEK_me AvgWEEK_me    <dbl> 2151         0             0.0    212.2   1143.5
## boardings_ boardings_    <dbl> 2151         0             0.0     84.5    104.6
## VACANTD       VACANTD    <dbl> 2151         0             0.0      0.0      0.0
## RM_ZONEP     RM_ZONEP    <dbl> 2151         0             0.0      0.2      0.3
## INTDEN         INTDEN    <dbl> 2151         0             0.0      0.0      0.0
##                min quartile_25   median quartile_75       max
## POPD10         0.0      4099.0   6528.4      9770.3  231973.4
## HUD10          0.0      1657.7   2617.3      4220.3  166034.2
## MEDAGE10      11.9        29.2     33.6        39.7      81.2
## BLKP10         0.0         0.0      0.1         0.9       1.0
## HSPP10         0.0         0.0      0.1         0.4       1.0
## ASNP10         0.0         0.0      0.0         0.1       0.9
## AVGHHS10       0.0         2.2      2.7         3.2      58.0
## FAMCP10        0.0         0.3      0.5         0.6       1.0
## BACHP10        0.0         0.2      0.3         0.5       1.0
## AUTOP10        0.0         0.5      0.6         0.8       1.0
## BWLKP10        0.0         0.0      0.0         0.1       0.7
## UNEMP10        0.0         0.1      0.1         0.2       0.7
## MBSAP10        0.0         0.2      0.3         0.5       1.0
## SRVP10         0.0         0.1      0.2         0.3       1.0
## PTMMP10        0.0         0.0      0.1         0.2       0.7
## MHHIN10     2499.0     31212.5  44492.5     62718.5  191250.0
## PCIN10      4016.0     14545.0  21372.0     33391.5  209753.0
## OWNP10         0.0         0.3      0.4         0.6       1.0
## MYRMOV10    1969.0      2001.0   2003.0      2004.0    2011.0
## MRENT10      195.0       805.0    935.0      1089.5    2001.0
## MVAL10     28400.0    165900.0 233800.0    321700.0 1000001.0
## MYRBLT10    1939.0      1939.0   1939.0      1954.0    2004.0
## AVGVEH1        0.0         0.8      1.2         1.5       3.0
## CRIMER         0.0         0.0      0.1         0.1       2.0
## ADT_mean    3149.2     14483.3  18949.3     24007.5   67048.4
## AvgWEEK_me     0.0         0.0      0.0         0.0   15667.6
## boardings_     0.0        23.9     57.6       106.2    1343.3
## VACANTD        0.0         0.0      0.0         0.0       0.0
## RM_ZONEP       0.0         0.0      0.0         0.3       1.0
## INTDEN         0.0         0.0      0.0         0.0       0.0
## 
## $Categorical
## data frame with 0 columns and 2151 rows
library(corrplot)
## corrplot 0.92 loaded
CTable <- cor(na.omit(Cor))
corrplot(CTable, method = "circle", type = "full", tl.cex = 0.7)

According to the correlation matrix, HUD10 has very high correlation with POPD10, so we could decide to remove HUD10 from the independent variable:

Cor2 <- subset(Cor, select = -HUD10)

Now we will visualize our data in the map, in regards to schooling:

tmap_mode("view")
qualityrate <- tm_shape(C) + tm_fill("INSTR_mean", style= "quantile", n = 7, palette = "-YlGnBu", title = "school quality mean", alpha=.5) + 
  tm_borders(alpha=.3, col="white", lwd = 1)
qualityrate

Unsupervised Learning in Spatial data

Dimension reduction with PCA

We will apply Principal Components Analysis

PCA: excluding Geometric Properties for Initial Analysis: this is because PCA is sensitive to the scale of variables, and including variables with vastly different scales might result in principal components being dominated by variables with larger scales.

The code below standardizes the data, applies PCA and displays the results in the table below:

# Imputing missing values for mean
Cor2[] <- lapply(Cor2, function(x) ifelse(is.na(x), mean(x, na.rm = TRUE), x))
# Standardize the data
Cor2_scaled <- scale(Cor2)
# Apply PCA
pca_result <- prcomp(Cor2_scaled, center = TRUE, scale. = TRUE)

# Summary of PCA results
summary(pca_result)
## Importance of components:
##                          PC1    PC2     PC3     PC4     PC5    PC6     PC7
## Standard deviation     2.549 1.9532 1.64806 1.21621 1.18698 1.0274 0.99644
## Proportion of Variance 0.224 0.1316 0.09366 0.05101 0.04858 0.0364 0.03424
## Cumulative Proportion  0.224 0.3556 0.44923 0.50024 0.54882 0.5852 0.61946
##                            PC8     PC9    PC10    PC11    PC12    PC13    PC14
## Standard deviation     0.99231 0.95629 0.93386 0.90452 0.86354 0.83448 0.82655
## Proportion of Variance 0.03395 0.03153 0.03007 0.02821 0.02571 0.02401 0.02356
## Cumulative Proportion  0.65341 0.68495 0.71502 0.74323 0.76894 0.79296 0.81651
##                           PC15    PC16    PC17    PC18    PC19    PC20    PC21
## Standard deviation     0.80933 0.76702 0.74399 0.71351 0.70092 0.66724 0.60225
## Proportion of Variance 0.02259 0.02029 0.01909 0.01755 0.01694 0.01535 0.01251
## Cumulative Proportion  0.83910 0.85939 0.87847 0.89603 0.91297 0.92832 0.94083
##                           PC22    PC23    PC24    PC25    PC26    PC27    PC28
## Standard deviation     0.58663 0.56295 0.54781 0.46880 0.43596 0.37495 0.35054
## Proportion of Variance 0.01187 0.01093 0.01035 0.00758 0.00655 0.00485 0.00424
## Cumulative Proportion  0.95270 0.96362 0.97397 0.98155 0.98810 0.99295 0.99719
##                           PC29
## Standard deviation     0.28551
## Proportion of Variance 0.00281
## Cumulative Proportion  1.00000

Based on the results above, we now plot a Scree plot using the ggplot2 package, which will allow us to visualize the components and their percentages of explained variances:

get_eigenvalue(pca_result)
##        eigenvalue variance.percent cumulative.variance.percent
## Dim.1  6.49659382       22.4020476                    22.40205
## Dim.2  3.81498134       13.1551081                    35.55716
## Dim.3  2.71609781        9.3658545                    44.92301
## Dim.4  1.47917098        5.1005896                    50.02360
## Dim.5  1.40892790        4.8583721                    54.88197
## Dim.6  1.05558503        3.6399484                    58.52192
## Dim.7  0.99288628        3.4237458                    61.94567
## Dim.8  0.98468018        3.3954489                    65.34111
## Dim.9  0.91449363        3.1534263                    68.49454
## Dim.10 0.87209933        3.0072391                    71.50178
## Dim.11 0.81816382        2.8212545                    74.32303
## Dim.12 0.74569327        2.5713561                    76.89439
## Dim.13 0.69634899        2.4012034                    79.29559
## Dim.14 0.68317735        2.3557840                    81.65138
## Dim.15 0.65501876        2.2586854                    83.91006
## Dim.16 0.58832443        2.0287049                    85.93877
## Dim.17 0.55351667        1.9086782                    87.84745
## Dim.18 0.50909276        1.7554923                    89.60294
## Dim.19 0.49129121        1.6941076                    91.29705
## Dim.20 0.44520298        1.5351827                    92.83223
## Dim.21 0.36269961        1.2506883                    94.08292
## Dim.22 0.34413203        1.1866622                    95.26958
## Dim.23 0.31691721        1.0928180                    96.36240
## Dim.24 0.30009074        1.0347957                    97.39719
## Dim.25 0.21977224        0.7578353                    98.15503
## Dim.26 0.19006169        0.6553851                    98.81041
## Dim.27 0.14058871        0.4847887                    99.29520
## Dim.28 0.12287658        0.4237123                    99.71892
## Dim.29 0.08151465        0.2810850                   100.00000

Based on the Scree plot we can determine that the principal components are the ones corresponding to the first 4 dimensions, which explain 51.6% of the variance of our data. For our purposes, we want to be able to explain a higher percentage of the variance, so we will choose 10 dimensions, which represent 71.5% of the variance of our data.

Now let’s take a look at the contribution of each variable for the 10 selected components. The dashed red line serves as the threshold for determining the level of variance to preserve in the data set. Variables exhibiting loadings surpassing this threshold are deemed significant contributors to explaining variance within the data. Consequently, such variables are recommended for inclusion in subsequent analyses and variables situated below the threshold may carry lesser significance and can be considered for potential exclusion from further analysis.

# Extract variable contributions
var_contrib <- get_pca_var(pca_result)

# Plot
contrib_plot <- fviz_contrib(pca_result, choice = "var", axes = 1:9, top = 30,
                              col.var = "blue", col.ind = "darkred",
                              title = "Variable Contribution Principal Components",
                              labels = list("Var", "Contrib", "Cos2", "Cos2*"),
                              repel = TRUE) +
                labs(x = "Principal Components", y = "Contribution (%)")
contrib_plot

The first 11 variables are the most important, contributing the most to the explanation of the variance:

  • BACHP10, INTDEN, BLK10, HSPP10, PCIN10, MHHIN10, OWNP10, MBSAP10, AVGHHS10, MEDAGE10, FAMCP10, AUTOP10

Clustering with K-means

Spatial Clustering: The clustering of similar colors indicates that there are areas with similar school quality levels. Such clustering could be the result of various factors, including similar neighborhood characteristics, zoning policies, or even school districts that affect resource distribution.

Based on the previous PCA results, we need to merge them into our main spatial data C:

pca_data <- st_drop_geometry(C)  # Drop geometry for combining with PCA result

# Exclude non-numeric columns, including GISJOIN
numeric_columns <- sapply(pca_data, is.numeric)
pca_data <- pca_data[, numeric_columns]

# Impute missing values with mean for each numeric column separately
for (col in colnames(pca_data)) {
  pca_data[, col] <- ifelse(is.na(pca_data[, col]), mean(pca_data[, col], na.rm = TRUE), pca_data[, col])
}
# Combine with PCA result
pca_data <- cbind(pca_data, as.data.frame(pca_result$x))

Firstly, we use Elbow method and Silhoutte Method to decide the numbers of K

library(tidyverse)
library(cluster)
library(factoextra)

Elbow Method

set.seed(123) # for reproducibility
fviz_nbclust(pca_data[, -c(1:3)], kmeans, method = "wss") +
  geom_vline(xintercept = 4, linetype = 2) +
  labs(subtitle = "Elbow Method")

according to the results of elbow method, we should choose 3 clusters. Now let us use Silhouette Method evaluates how similar an object is to its own cluster compared to other clusters.

Silhouette Method

fviz_nbclust(pca_data[, -c(1:3)], kmeans, method = "silhouette") +
  labs(subtitle = "Silhouette Method")

according to the results of Silhoutte Method, we should choose 2 clusters.

We set 3 clusters and run k-means

k <- 3  # Specify the number of clusters
clusters <- kmeans(pca_data[, -c(1:3)], centers = k)

Before plotting the clusters, we should include them in the spatial data to later be able to plot them in the map:

C$cluster <- clusters$cluster

Now we are able to visualize the clusters in the map:

library(tmap)

tm_shape(C) +
  tm_borders() +
  tm_fill("cluster", palette = "Set3", title = "Cluster")

Cluster 1, marked with the lightest color, could signify regions with similar attributes that differentiate them from clusters 2 and 3. These might be areas with higher or lower average school quality ratings or other educational metrics.

Part II : Working on a detailed analysis involving spatial data, regression forests, causal forests, and evaluation of treatment effects using data from Chicago.

Subset by different control areas

C<- pca_data[, -c(1:3)]
C_CA <- as.data.frame(subset(C,(C$TREAT==1|C$CONTR_1==1))) #Neighbourhood-only control area
C_CHI <- as.data.frame(subset(C,(C$TREAT==1|C$CONTR_2==1))) #Entire city control area

Causal forest model

Causal forest variables formulation - SAR

Y <- C_CA$INSTR_mean
X <- as.data.frame(cbind(C_CA$BACHP10, C_CA$INTDEN, C_CA$BLK10, C_CA$HSPP10, C_CA$PCIN10, C_CA$MHHIN10, C_CA$OWNP10, C_CA$MBSAP10, C_CA$AVGHHS10, C_CA$MEDAGE10, C_CA$FAMCP10, C_CA$AUTOP10) )
  
W <- C_CA$TREAT

These are estimates of m(X) = E[Y | X] with no test/train split

forest.Y <- regression_forest(X, Y)
Y.hat_train <- predict(forest.Y)$predictions
forest.W <- regression_forest(X, W)
W.hat_train <- predict(forest.W)$predictions

Train the causal forest (weighted by propensity score)

c.forest <- causal_forest(X, Y, W, Y.hat_train, W.hat_train)
tau.hat <- predict(c.forest, X)$predictions

Find ATE

average_treatment_effect(c.forest, target.sample = "all") #ATE for all observations (treated and untreated)
##  estimate   std.err 
## -1.189603  2.513111

The negative estimate suggests that the treatment might have a negative effect on the outcome variable on average across all observations. However, the relatively large standard error indicates there is uncertainty around this estimate.

Create output shapefile with the predicted values of CATE

output <- C_CA %>%
  select(BACHP10, INTDEN, HSPP10, PCIN10, MHHIN10, OWNP10, MBSAP10, AVGHHS10, MEDAGE10, FAMCP10, AUTOP10,TREAT, INSTR_mean) %>%
  dplyr::mutate(CATE_CF = tau.hat)

What’s the nature of the heterogeneity? What variables are useful for targeting based on treatment effects?

imp <- c.forest %>% 
  variable_importance() %>% 
  as.data.frame() %>% 
  mutate(variable = colnames(c.forest$X.orig)) 
imp[order(imp$V1, decreasing = TRUE),]
##            V1 variable
## 2  0.15788153       V2
## 11 0.11874564      V11
## 10 0.10168641      V10
## 6  0.08897561       V6
## 9  0.07960976       V9
## 7  0.07492683       V7
## 3  0.07191638       V3
## 1  0.07024390       V1
## 4  0.04682927       V4
## 8  0.04381882       V8
## 5  0.02341463       V5

according to the result, V2, V11 have the biggest effect on the result of treatment. They are INTDEN(Intersection density on the driving network), AUTOP10(% commuting by auto).

Plot linear relationships between CATE and the 4 most important variables

p1 <- ggplot(output, aes(x = INTDEN, y = tau.hat, color=as.factor(TREAT))) +
  scale_color_brewer(palette="Paired") +
  geom_point(alpha = 0.4 ) +
  geom_smooth(method = "lm", fullrange=TRUE) +
  ylab("Treatment effect") +
  theme_light()



p2 <- ggplot(output, aes(x = AUTOP10, y = tau.hat, color=as.factor(TREAT))) +
  scale_color_brewer(palette="Paired") +
  geom_point(alpha = 0.4 ) +
  geom_smooth(method = "lm", fullrange=TRUE) +
  ylab("Treatment effect") +
  theme_light()



cowplot::plot_grid(p1, p2,ncol=1)
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'

INTDEN: There appears to be a positive trend, suggesting that as INTDEN increases, the treatment effect becomes less negative. AUTOP10: there’s a slight negative trend, suggesting the treatment effect moreless negative as AUTOP10 increases.

The shaded areas around the regression lines represent confidence intervals, indicating the precision of the estimated effect. The plots suggest that the treatment effect varies with both INTDEN and AUTOP10, and this variation appears to be consistent across both treated and untreated groups. ### Compute a prioritization rule based on estimated treatment effects

priority.cate <- predict(c.forest, X)$predictions

Estimate AUTOC (in this case, no held out data)

cf.eval <- causal_forest(X, Y, W, Y.hat_train, W.hat_train) #Priority rule = units with biggest positive treatment effects (CATE) are highest-priority

We evaluate prioritization rules via rank-weighted average treatment effects (RATEs), which capture the extent to which individuals who are highly ranked by the prioritization rule are more responsive to treatment than the average treatment effect

rate <- rank_average_treatment_effect(cf.eval, priority.cate, debiasing.weights = W.hat_train)
plot(rate)

From the plot, we can see that the treatment effect fluctuates as we vary the proportion of the targeted population. The confidence intervals suggest substantial uncertainty around these estimates, especially as the proportion targeted approaches 1. This could indicate variability in treatment effects across different segments of the population or could reflect the model’s instability in certain areas of the data.

Bibliography

Kubara, M. Spatiotemporal localisation patterns of technological startups: the case for recurrent neural networks in predicting urban startup clusters. Ann Reg Sci (2023). https://doi.org/10.1007/s00168-023-01220-7US Census