This project was made as part of the ‘Causal Machine Learning for Spatial data’ elective course at the University of Warsaw, Poland - by second-year students of the Masters in Data Science and Business Analytics. Methods, materials and sources are a collection of in-class materials provided by professors: Dr. Kevin Credit (Maynooth University), Maria Kubara (University of Warsaw) and Dr. Katarzyna Kopczewska (University of Warsaw). Additional resources were also used and respectively cited along the publication. You can check all external sources in the bibliography at the end of this publication.
We will apply the Exploratory Data Analysis and Machine Learning techniques in spatial data, for which we will try to assess the distribution of educational resources by looking at the average rating of instructional quality for nearby schools and their relation to population density and demographics. Our analysis will involve a detailed analysis of spatial data, regression forests, causal forests, and evaluation of treatment effects using data from Chicago.
We will use a data set prepared by Dr. Kevin Credit, which was part of our lecture. The data set describes several demographic and geographical variables of the city of Chicago involves data from different sources as Chicago Open Data Portal, Database of Road Transportation Emissions (DARTE), InfoUSA, US Census and OSMnx, for intersections on the OpenStreetMap driving street.
The table below describes the variables of the data set, which was provided as well in the materials of the lecture (Lec1, slide 43):
| Variable | Description | Year | Source |
|---|---|---|---|
| SEBB2010 | Spatial Empirical Bayes estimate of new building permit density | 2010 | Chicago Open Data Portal |
| SEBB2017 | Spatial Empirical Bayes estimate of new building permit density | 2017 | Chicago Open Data Portal |
| SEBC2010 | Spatial Empirical Bayes estimate of Co2 density | 2010 | DARTE |
| SEBC2017 | Spatial Empirical Bayes estimate of Co2 density | 2017 | DARTE |
| WAS2010 | Walkable Accessibility Score | 2010 | InfoUSA according to Credit et al. (2023) |
| WAS2017 | Walkable Accessibility Score | 2017 | InfoUSA according to Credit et al. (2023) |
| POPD10 | Population density | 2010 | US Census |
| HUD10 | Housing unit density | 2010 | US Census |
| MEDAGE10 | Median age | 2010 | US Census |
| BLKP10 | % Black non-hispanic population | 2010 | US Census |
| HSPP10 | % Hispanic population | 2010 | US Census |
| ASNP10 | % Asian non-hispanic population | 2010 | US Census |
| AVGHHS10 | Average household size | 2010 | US Census |
| FAMCP10 | % families with children under 18 | 2010 | US Census |
| BACHP10 | % population with bachelor’s degree or higher | 2010 | US Census |
| AUTOP10 | % commuting by auto | 2010 | US Census |
| BWLKP10 | % commuting by bicycle and walking | 2010 | US Census |
| UNEMP10 | % labor force that is unemployed | 2010 | US Census |
| MBSAP10 | % employed in management, business, science and arts occupations | 2010 | US Census |
| SRVP10 | % employed in service occupations | 2010 | US Census |
| PTMMP10 | % employed in production, transportation, and metiral moving occupations | 2010 | US Census |
| MHHIN10 | Median household income | 2010 | US Census |
| PCIN10 | Per capita income | 2010 | US Census |
| OWNP10 | % owner-occupied housing units | 2010 | US Census |
| MYRMOV10 | Median year householders moved into unit | 2010 | US Census |
| MRENT10 | Median gross rent | 2010 | US Census |
| MVAL10 | Median housing value | 2010 | US Census |
| MYRBLT10 | Median year structure built | 2010 | US Census |
| AVGVEH10 | Average number of vehicles available per household | 2010 | US Census |
| CRIMER | Number of property and violent crimes per person | 2010 | Chicago Open Data Portal |
| ADT_mean | Average daily traffic for nearby traffic monitors | 2006 | Chicago Open Data Portal |
| AvgWEEK_me | Average ‘L’ weekday station entries for October | 2012 | Chicago Open Data Portal |
| boardings_ | Average weekday bus stop boardings for October | 2012 | Chicago Open Data Portal |
| VACANTD | Density of violations for vacant and abandoned buildings | 2011 | Chicago Open Data Portal |
| R_ZONEP | % area zoned for residential use | 2012 | Chicago Open Data Portal |
| INSTR_mean | Average rating of instructional quality for nearby schools | 2011-2012 | Chicago Open Data Portal |
| INTDEN | Intersection density on the driving network | 2022 | OSMnx |
For the initial exploration and analysis, the following libraries were pre-loaded: “spatialreg”, “visreg”, “RColorBrewer”, “finalfit”, “sf”, “tidyverse”, “tmap”, “spdep”, “randomForest”, “grf”, “xgboost”, “MLmetrics”, “ggplot2”, “conflicted”.
## Simple feature collection with 6 features and 233 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: 1017048 ymin: -273278.1 xmax: 1018232 ymax: -262023.3
## Projected CRS: US National Atlas Equal Area
## GISJOIN STATEFP COUNTYFP TRACTCE BLKGRPCE GEOID INTPTLAT
## 1 G17003103106003 17 31 310600 3 1.70313e+11 41.85426
## 2 G17003100609003 17 31 60900 3 1.70311e+11 41.95107
## 3 G17003108326003 17 31 832600 3 1.70318e+11 41.91597
## 4 G17003103105001 17 31 310500 1 1.70313e+11 41.85523
## 5 G17003100702002 17 31 70200 2 1.70311e+11 41.92684
## 6 G17003100712001 17 31 71200 1 1.70311e+11 41.92398
## INTPTLON STUSPS SHAPE_AREA SHAPE_LEN OBJECTID geo_num WAS1997
## 1 -87.66000 IL 246655.38 2011.9200 60127 1.70313e+11 114.98125
## 2 -87.64610 IL 46068.08 887.0996 62801 1.70311e+11 111.38628
## 3 -87.65086 IL 195251.17 1776.3103 62062 1.70318e+11 134.09631
## 4 -87.65475 IL 163307.23 1836.0981 60126 1.70313e+11 70.51981
## 5 -87.64674 IL 111163.55 1511.1013 59942 1.70311e+11 136.96843
## 6 -87.64588 IL 122250.99 1612.8085 62056 1.70311e+11 139.45412
## WAS1998 WAS1999 WAS2000 WAS2001 WAS2002 WAS2003 WAS2004
## 1 112.68706 116.07955 115.46190 116.01869 115.94659 114.19428 106.44323
## 2 107.28641 108.58006 107.08335 110.37375 109.90671 114.31385 111.88058
## 3 131.65831 130.94132 131.19474 134.03595 134.46763 134.52855 136.79366
## 4 68.52643 72.37316 72.39139 78.86017 78.22318 82.53956 78.98625
## 5 136.30583 136.54356 136.83535 135.60628 136.62464 136.15944 135.90513
## 6 137.41612 137.03879 137.69063 137.69739 138.94128 137.72948 137.59178
## WAS2005 WAS2006 WAS2007 WAS2008 WAS2009 WAS2010 WAS2011
## 1 110.01056 104.54327 99.24179 101.11941 100.36977 100.58168 101.05011
## 2 112.34848 107.77906 111.21434 111.76010 113.68037 111.71166 113.50152
## 3 137.77072 136.87422 136.58207 137.09881 137.27139 136.78424 137.28044
## 4 81.23481 82.08587 76.39534 80.97864 82.14137 84.43072 81.52528
## 5 136.33797 135.95398 136.62708 136.29602 136.42009 136.21051 136.98160
## 6 137.36677 136.56867 137.65550 137.21661 136.82965 136.08127 138.28191
## WAS2012 WAS2013 WAS2014 WAS2015 WAS2016 WAS2017 WAS2018
## 1 108.12578 114.73515 116.41560 116.59266 111.23720 108.54997 103.84254
## 2 114.93171 118.74014 120.82048 121.85577 115.63330 110.42166 105.77073
## 3 138.05600 138.89841 138.90867 137.84639 136.00810 135.66187 133.66457
## 4 88.77165 95.41195 98.70901 96.06995 91.19823 90.04694 88.35598
## 5 137.06088 136.87513 136.98457 137.18987 136.51432 136.00956 135.80418
## 6 138.38523 139.36465 139.47302 136.66167 136.42708 137.42534 135.88255
## WAS2019 bg_area_m2 kgco2_1980 kgco2_1981 kgco2_1982 kgco2_1983 kgco2_1984
## 1 110.87772 244810.18 159964.422 166495.032 169713.57 176848.92 180911.65
## 2 108.32063 45681.61 9502.519 9890.463 10081.66 10505.52 10746.87
## 3 135.78517 193601.13 85160.939 88637.667 90351.13 94149.81 96312.70
## 4 95.26068 160554.40 152948.334 159192.509 162269.88 169092.28 172976.81
## 5 136.03996 110226.25 103032.924 107239.283 109312.34 113908.21 116525.01
## 6 136.40674 121219.57 109004.478 113454.628 115647.83 120510.07 123278.54
## kgco2_1985 kgco2_1986 kgco2_1987 kgco2_1988 kgco2_1989 kgco2_1990 kgco2_1991
## 1 184341.00 186901.62 188881.76 188072.41 184128.58 175848.48 169910.46
## 2 10950.58 11102.69 11220.32 11172.24 10937.96 10446.09 10093.35
## 3 98138.40 99501.61 100555.78 100124.91 98025.31 93617.20 90455.95
## 4 176255.74 178704.06 180597.35 179823.50 176052.64 168135.71 162458.13
## 5 118733.85 120383.15 121658.55 121137.25 118597.03 113263.83 109439.15
## 6 125615.39 127360.28 128709.60 128158.09 125470.65 119828.34 115782.00
## kgco2_1992 kgco2_1993 kgco2_1994 kgco2_1995 kgco2_1996 kgco2_1997 kgco2_1998
## 1 176166.47 190878.20 203941.44 213316.60 219667.92 227465.89 230870.2
## 2 10464.98 11338.92 12114.93 12671.85 13049.14 13512.37 13714.6
## 3 93786.49 101618.64 108573.17 113564.26 116945.55 121096.98 122909.3
## 4 168439.75 182506.22 194996.50 203960.46 210033.22 217489.17 220744.1
## 5 113468.65 122944.45 131358.48 137397.01 141487.89 146510.55 148703.3
## 6 120045.03 130070.04 138971.71 145360.22 149688.21 155001.97 157321.8
## kgco2_1999 kgco2_2000 kgco2_2001 kgco2_2002 kgco2_2003 kgco2_2004 kgco2_2005
## 1 237911.32 242080.98 244461.14 251651.18 253766.59 260063.75 262496.85
## 2 14132.87 14380.57 14521.96 14949.07 15074.74 15448.81 15593.35
## 3 126657.86 128877.68 130144.81 133972.61 135098.80 138451.24 139746.56
## 4 227476.46 231463.23 233739.00 240613.69 242636.31 248657.28 250983.66
## 5 153238.44 155924.11 157457.17 162088.27 163450.81 167506.80 169073.96
## 6 162119.79 164961.12 166583.03 171482.54 172924.04 177215.12 178873.10
## kgco2_2006 kgco2_2007 kgco2_2008 kgco2_2009 kgco2_2010 kgco2_2011 kgco2_2012
## 1 268801.53 279079.90 277820.82 274717.96 281115.46 286525.99 303448.79
## 2 15967.87 16578.45 16503.65 16319.33 16699.37 17020.78 18026.06
## 3 143103.01 148574.95 147904.65 146252.77 149658.63 152539.06 161548.32
## 4 257011.81 266839.37 265635.52 262668.75 268785.65 273958.87 290139.43
## 5 173134.80 179755.08 178944.11 176945.57 181066.19 184551.10 195451.06
## 6 183169.30 190173.28 189315.31 187200.93 191560.37 195247.27 206778.96
## kgco2_2013 kgco2_2014 kgco2_2015 kgco2_2016 kgco2_2017 co2d_1980 co2d_1981
## 1 323886.24 320908.37 310721.99 284128.57 245374.76 0.6534223 0.6800985
## 2 19240.12 19063.23 18458.11 16878.36 14576.23 0.2080163 0.2165086
## 3 172428.69 170843.35 165420.39 151262.73 130631.20 0.4398783 0.4578365
## 4 309680.49 306833.23 297093.63 271666.60 234612.55 0.9526262 0.9915176
## 5 208614.80 206696.76 200135.72 183006.93 158045.64 0.9347404 0.9729015
## 6 220705.64 218676.44 211735.13 193613.59 167205.60 0.8992317 0.9359431
## co2d_1982 co2d_1983 co2d_1984 co2d_1985 co2d_1986 co2d_1987 co2d_1988
## 1 0.6932456 0.7223920 0.7389875 0.7529956 0.7634553 0.7715437 0.7682377
## 2 0.2206940 0.2299728 0.2352559 0.2397154 0.2430452 0.2456201 0.2445677
## 3 0.4666870 0.4863082 0.4974801 0.5069103 0.5139516 0.5193967 0.5171711
## 4 1.0106847 1.0531774 1.0773720 1.0977945 1.1130436 1.1248358 1.1200160
## 5 0.9917088 1.0334037 1.0571440 1.0771831 1.0921459 1.1037167 1.0989874
## 6 0.9540360 0.9941470 1.0169854 1.0362633 1.0506577 1.0617890 1.0572393
## co2d_1989 co2d_1990 co2d_1991 co2d_1992 co2d_1993 co2d_1994 co2d_1995
## 1 0.7521280 0.7183054 0.6940498 0.7196044 0.7796988 0.8330595 0.8713551
## 2 0.2394392 0.2286718 0.2209500 0.2290853 0.2482163 0.2652036 0.2773950
## 3 0.5063261 0.4835571 0.4672284 0.4844315 0.5248866 0.5608086 0.5865888
## 4 1.0965295 1.0472196 1.0118572 1.0491133 1.1367251 1.2145198 1.2703511
## 5 1.0759419 1.0275577 0.9928593 1.0294159 1.1153828 1.1917169 1.2464999
## 6 1.0350692 0.9885231 0.9551428 0.9903106 1.0730118 1.1464462 1.1991481
## co2d_1996 co2d_1997 co2d_1998 co2d_1999 co2d_2000 co2d_2001 co2d_2002
## 1 0.8972990 0.9291521 0.9430579 0.9718196 0.9888518 0.9985742 1.0279441
## 2 0.2856542 0.2957946 0.3002215 0.3093777 0.3147999 0.3178951 0.3272449
## 3 0.6040540 0.6254973 0.6348586 0.6542207 0.6656866 0.6722317 0.6920032
## 4 1.3081748 1.3546135 1.3748869 1.4168186 1.4416499 1.4558243 1.4986427
## 5 1.2836134 1.3291803 1.3490730 1.3902174 1.4145825 1.4284907 1.4705053
## 6 1.2348518 1.2786877 1.2978247 1.3374061 1.3608456 1.3742255 1.4146440
## co2d_2003 co2d_2004 co2d_2005 co2d_2006 co2d_2007 co2d_2008 co2d_2009
## 1 1.0365851 1.0623078 1.0722465 1.0979998 1.1399849 1.1348418 1.1221673
## 2 0.3299958 0.3381846 0.3413486 0.3495471 0.3629130 0.3612757 0.3572408
## 3 0.6978203 0.7151366 0.7218272 0.7391641 0.7674281 0.7639658 0.7554334
## 4 1.5112405 1.5487416 1.5632313 1.6007771 1.6619873 1.6544892 1.6360109
## 5 1.4828665 1.5196635 1.5338811 1.5707221 1.6307830 1.6234256 1.6052943
## 6 1.4265357 1.4619348 1.4756124 1.5110538 1.5688331 1.5617553 1.5443127
## co2d_2010 co2d_2011 co2d_2012 co2d_2013 co2d_2014 co2d_2015 co2d_2016
## 1 1.1482997 1.1704007 1.2395269 1.3230097 1.3108457 1.2692364 1.1606077
## 2 0.3655600 0.3725958 0.3946021 0.4211788 0.4173064 0.4040601 0.3694782
## 3 0.7730256 0.7879038 0.8344389 0.8906389 0.8824502 0.8544392 0.7813112
## 4 1.6741095 1.7063305 1.8071098 1.9288197 1.9110858 1.8504234 1.6920533
## 5 1.6426776 1.6742936 1.7731808 1.8926055 1.8752046 1.8156812 1.6602845
## 6 1.5802759 1.6106909 1.7058215 1.8207096 1.8039697 1.7467075 1.5972139
## co2d_2017 BP2006 BP2007 BP2008 BP2009 BP2010 BP2011 BP2012 BP2013 BP2014
## 1 1.0023062 1 0 0 1 0 0 1 0 2
## 2 0.3190831 0 0 0 0 0 0 0 0 0
## 3 0.6747440 3 11 4 1 7 1 1 9 4
## 4 1.4612651 1 1 2 1 0 0 0 0 0
## 5 1.4338294 4 1 2 0 0 3 1 2 0
## 6 1.3793614 0 0 1 1 0 0 0 1 2
## BP2015 BP2016 BP2017 BP2018 BP2019 BP2020 BP2021 BP2022 BPV2006 BPV2007
## 1 1 4 3 2 1 0 3 1 300000 0
## 2 0 0 0 0 0 2 0 0 0 0
## 3 4 2 2 5 4 1 2 4 600000 750000
## 4 1 2 2 2 2 1 1 1 2000 193000
## 5 2 1 3 5 0 3 0 2 725000 8000
## 6 0 0 9 3 6 1 2 0 0 0
## BPV2008 BPV2009 BPV2010 BPV2011 BPV2012 BPV2013 BPV2014 BPV2015 BPV2016
## 1 0 1200000 0e+00 0 2e+03 0 237000 8000 225000
## 2 0 0 0e+00 0 0e+00 0 0 0 0
## 3 255000 575000 4e+05 725000 6e+05 590000 587500 615000 900000
## 4 170000 400000 0e+00 0 0e+00 0 0 200000 146000
## 5 280000 0 0e+00 1400000 5e+04 460000 0 507500 450000
## 6 35000 8000 0e+00 0 0e+00 1500 7521500 0 0
## BPV2017 BPV2018 BPV2019 BPV2020 BPV2021 BPV2022 TREAT CONTR_1 CONTR_2
## 1 450000 8500 1000 0 550000 3000 0 0 1
## 2 0 0 0 4062500 0 0 0 0 1
## 3 725000 745000 325000 751000 605000 2158150 0 0 1
## 4 126000 146000 147400 420000 2000 3000 0 0 1
## 5 350000 622350 0 800000 0 730000 0 0 1
## 6 8000000 92000 522500 1373500 1041375 0 0 0 1
## SEBB2010 SEBB2017 SEBV2010 SEBV2017 SEBC2010 SEBC2017
## 1 1.113566e-06 8.976012e-06 1.730782e-05 1.838150e+00 1.1483197 1.0023252
## 2 2.911093e-06 5.502589e-06 7.083786e-06 3.803694e-06 0.3656292 0.3191523
## 3 2.951217e-05 1.621849e-05 2.066105e+00 3.744813e+00 0.7730287 0.6747474
## 4 1.072735e-06 8.520549e-06 2.200400e-05 7.847806e-01 1.6741340 1.4612887
## 5 3.177479e-06 2.931373e-05 2.721240e-06 3.175289e+00 1.6426546 1.4338065
## 6 2.095912e-06 5.324562e-05 2.061875e-06 6.599593e+01 1.5802548 1.3793403
## AREALAND POP10 POPD10 HU10 HUD10 WHTP10 BLKP10 HSPP10
## 1 246657 1975 8007.071 752 3048.768 0.1772152 0.04303797 0.82278481
## 2 46069 1347 29238.751 935 20295.643 0.9420935 0.06458797 0.05790646
## 3 195248 1970 10089.732 1070 5480.210 0.9675127 0.01624365 0.03248731
## 4 163307 1700 10409.842 656 4016.974 0.1188235 0.02176471 0.88117647
## 5 111164 1097 9868.303 566 5091.576 0.9635369 0.04102097 0.03646308
## 6 122252 904 7394.562 481 3934.496 0.9446903 0.01106195 0.05530973
## ASNP10 OWNP10 RNTP10 AVGHHS10 MEDAGE10 AUTOP10 TRNSP10
## 1 0.011139241 0.1413534 0.8586466 2.969925 30.0 0.4940476 0.2663690
## 2 0.057906459 0.4248292 0.5751708 1.534169 40.7 0.3279901 0.5869297
## 3 0.030456853 0.4307536 0.5692464 2.006110 29.6 0.3756378 0.4508929
## 4 0.004117647 0.2307692 0.7692308 3.041145 27.2 0.4983871 0.1725806
## 5 0.064721969 0.4416342 0.5583658 2.134241 28.7 0.2784993 0.5598846
## 6 0.038716814 0.4269663 0.5730337 2.031461 27.4 0.3427621 0.3544093
## BWLKP10 FAMCP10 BACHP10 MHHIN10 PCIN10 UNEMP10 MBSAP10
## 1 0.23958333 0.6258741 0.09607843 38424 13940 0.16612022 0.1454784
## 2 0.03945746 0.0000000 0.84477893 65250 78693 0.00000000 0.7059524
## 3 0.08035714 0.5022831 0.93454545 115610 82550 0.03150093 0.6996173
## 4 0.27741935 0.5244957 0.11073825 30139 11977 0.17001339 0.1822581
## 5 0.04184704 0.3681319 0.75641026 130833 62202 0.04281768 0.7503608
## 6 0.14975042 0.4333333 0.81402936 79886 61276 0.04603175 0.6289517
## SRVP10 SALEP10 NCMP10 PTMMP10 MYRBLT10 MYRMOV10 AVGVEH1
## 1 0.33551769 0.0904325 0.11271298 0.315858453 1939 2004 0.7368422
## 2 0.00000000 0.2523810 0.00000000 0.041666667 1941 2004 0.5751784
## 3 0.03125000 0.2340561 0.02869898 0.006377551 1962 2005 1.3513238
## 4 0.29193548 0.1435484 0.15967742 0.222580645 1939 2005 1.9323256
## 5 0.09090909 0.1212121 0.00000000 0.037518038 1962 2004 1.1945525
## 6 0.14808652 0.2229617 0.00000000 0.000000000 1955 2005 0.8667416
## MRENT10 MOWNC10 MVAL10 ADT_mean VACANT CRIME CRIMER VACANTD INSTR_mean
## 1 801 1450 287500 14307.36 0 84 0.04253164 0 45.72656
## 2 886 1393 261700 15346.89 0 41 0.03043801 0 43.65320
## 3 1653 2947 677100 16417.97 0 116 0.05888325 0 56.64396
## 4 693 879 266700 14102.34 0 80 0.04705882 0 50.96734
## 5 1633 2219 361100 16310.04 0 77 0.07019143 0 64.78398
## 6 1783 2456 582400 16380.96 0 157 0.17367257 0 65.23341
## INT INTDEN BIZ_ZONE BIZ_ZONEP RM_ZONE RM_ZONEP R_ZONE R_ZONEP
## 1 18 0.000072975 22348.47 0.09060546 71577.00 0.2901884 71577.00 0.2901884
## 2 6 0.000130239 NA 0.00000000 32807.34 0.7121348 32807.34 0.7121348
## 3 7 0.000035851 19188.48 0.09827745 153400.20 0.7856685 153400.20 0.7856685
## 4 21 0.000128592 58897.68 0.36065616 65924.54 0.4036847 65924.54 0.4036847
## 5 4 0.000035982 19161.83 0.17237445 87986.95 0.7915058 87989.52 0.7915289
## 6 7 0.000057258 31517.27 0.25780579 60495.21 0.4948402 60495.21 0.4948402
## AvgWEEK_me boardings_ MEDA_LAG BLKP_LAG HSPP_LAG ASNP_LAG BACH_LAG
## 1 0 47.97143 29.10000 0.02144151 0.83494417 0.01188178 0.2351624
## 2 0 471.30000 33.42500 0.05094803 0.07974083 0.08134141 0.7841093
## 3 0 41.08000 30.17143 0.02797249 0.04911648 0.04368033 0.8731743
## 4 0 51.13333 30.98000 0.04768728 0.78122150 0.01696486 0.2449111
## 5 0 92.40000 27.48571 0.02233651 0.05322567 0.06956968 0.9197526
## 6 0 72.37500 28.50000 0.03286525 0.05376456 0.06617362 0.8692822
## UNEM_LAG MBSA_LAG SRV_LAG PTMM_LAG MHHI_LAG PCIN_LAG OWN_LAG
## 1 0.17708961 0.2429275 0.28188390 0.195462429 33719.00 15064.75 0.2203922
## 2 0.04110074 0.5593857 0.11181872 0.037571302 61016.75 55231.75 0.4436658
## 3 0.04046894 0.6452313 0.06412460 0.011157560 101597.86 83209.00 0.4973330
## 4 0.19767955 0.2866674 0.24113421 0.209981081 34581.20 16386.80 0.2153571
## 5 0.05190341 0.6701621 0.09010968 0.007767554 74346.71 62555.57 0.3165681
## 6 0.05670957 0.6456898 0.06860391 0.012836838 89316.33 67326.17 0.4048973
## YRMV_LAG YRBT_LAG ADT_LAG WEEK_LAG BOARD_LAG BIZZ_LAG RMZ_LAG RZ_LAG
## 1 2004.250 1939.000 17237.26 347.3917 103.15771 0.3182851 0.3037695 0.3037695
## 2 2005.000 1958.500 14149.20 0.0000 178.56595 0.2245706 0.7473168 0.7473168
## 3 2004.429 1954.000 18058.98 1238.8601 183.23622 0.1666474 0.5153935 0.5153935
## 4 2003.800 1939.000 13101.41 277.9133 82.26962 0.2486523 0.3969165 0.3969165
## 5 2005.571 1957.143 16333.59 0.0000 239.78112 0.2393864 0.5813819 0.6199602
## 6 2005.000 1959.333 16323.20 0.0000 218.02262 0.1698625 0.6345898 0.6345937
## BP10_LAG BP17_LAG BV10_LAG BV17_LAG C210_LAG C217_LAG WS10_LAG
## 1 6.982992e-07 6.121502e-06 0.16129904 0.234710 5.024619 6.162446 85.24906
## 2 2.300062e-06 1.800641e-05 0.05953222 61.802988 1.307971 1.141676 116.65812
## 3 7.548065e-06 1.582958e-05 7.96840070 26.005930 4.142690 3.142691 132.53921
## 4 7.089616e-07 7.991582e-06 0.12904293 2.652111 4.237748 5.120286 83.86582
## 5 6.878193e-06 2.480580e-05 5.82701497 10.972368 1.264354 1.103607 135.41468
## 6 4.254014e-06 2.480560e-05 6.19766071 4.277070 1.400891 1.222780 135.14118
## WS17_LAG TREAT_LAG geometry
## 1 90.61526 0 MULTIPOLYGON (((1017922 -27...
## 2 115.28542 0 MULTIPOLYGON (((1017336 -26...
## 3 134.06204 0 MULTIPOLYGON (((1017621 -26...
## 4 89.58642 0 MULTIPOLYGON (((1018232 -27...
## 5 135.74890 0 MULTIPOLYGON (((1017755 -26...
## 6 135.07443 0 MULTIPOLYGON (((1017863 -26...
Average rating of school quality for nearby schools from 2011- 2012 (inverse-distance weighting by school):
We will subset the independent variables of our data and analyze their correlation:
Regarding missing values: we will omit missing values as they will not provide useful insights to our model and will negatively affect the prediction results.
## $Continuous
## label var_type n missing_n missing_percent mean sd
## POPD10 POPD10 <dbl> 2151 0 0.0 8178.0 9014.3
## HUD10 HUD10 <dbl> 2151 0 0.0 4083.0 6697.2
## MEDAGE10 MEDAGE10 <dbl> 2143 8 0.4 35.0 8.4
## BLKP10 BLKP10 <dbl> 2151 0 0.0 0.4 0.4
## HSPP10 HSPP10 <dbl> 2151 0 0.0 0.3 0.3
## ASNP10 ASNP10 <dbl> 2151 0 0.0 0.0 0.1
## AVGHHS10 AVGHHS10 <dbl> 2151 0 0.0 2.8 1.7
## FAMCP10 FAMCP10 <dbl> 2151 0 0.0 0.4 0.2
## BACHP10 BACHP10 <dbl> 2151 0 0.0 0.4 0.2
## AUTOP10 AUTOP10 <dbl> 2151 0 0.0 0.6 0.2
## BWLKP10 BWLKP10 <dbl> 2151 0 0.0 0.1 0.1
## UNEMP10 UNEMP10 <dbl> 2151 0 0.0 0.1 0.1
## MBSAP10 MBSAP10 <dbl> 2151 0 0.0 0.3 0.2
## SRVP10 SRVP10 <dbl> 2151 0 0.0 0.2 0.1
## PTMMP10 PTMMP10 <dbl> 2151 0 0.0 0.1 0.1
## MHHIN10 MHHIN10 <dbl> 2142 9 0.4 50115.1 26619.5
## PCIN10 PCIN10 <dbl> 2142 9 0.4 27971.3 21461.0
## OWNP10 OWNP10 <dbl> 2151 0 0.0 0.5 0.2
## MYRMOV10 MYRMOV10 <dbl> 2135 16 0.7 2001.5 5.1
## MRENT10 MRENT10 <dbl> 2015 136 6.3 979.5 291.1
## MVAL10 MVAL10 <dbl> 2093 58 2.7 262755.2 137702.8
## MYRBLT10 MYRBLT10 <dbl> 2143 8 0.4 1948.0 13.5
## AVGVEH1 AVGVEH1 <dbl> 1982 169 7.9 1.2 0.5
## CRIMER CRIMER <dbl> 2143 8 0.4 0.1 0.1
## ADT_mean ADT_mean <dbl> 2151 0 0.0 20117.8 8338.3
## AvgWEEK_me AvgWEEK_me <dbl> 2151 0 0.0 212.2 1143.5
## boardings_ boardings_ <dbl> 2151 0 0.0 84.5 104.6
## VACANTD VACANTD <dbl> 2151 0 0.0 0.0 0.0
## RM_ZONEP RM_ZONEP <dbl> 2151 0 0.0 0.2 0.3
## INTDEN INTDEN <dbl> 2151 0 0.0 0.0 0.0
## min quartile_25 median quartile_75 max
## POPD10 0.0 4099.0 6528.4 9770.3 231973.4
## HUD10 0.0 1657.7 2617.3 4220.3 166034.2
## MEDAGE10 11.9 29.2 33.6 39.7 81.2
## BLKP10 0.0 0.0 0.1 0.9 1.0
## HSPP10 0.0 0.0 0.1 0.4 1.0
## ASNP10 0.0 0.0 0.0 0.1 0.9
## AVGHHS10 0.0 2.2 2.7 3.2 58.0
## FAMCP10 0.0 0.3 0.5 0.6 1.0
## BACHP10 0.0 0.2 0.3 0.5 1.0
## AUTOP10 0.0 0.5 0.6 0.8 1.0
## BWLKP10 0.0 0.0 0.0 0.1 0.7
## UNEMP10 0.0 0.1 0.1 0.2 0.7
## MBSAP10 0.0 0.2 0.3 0.5 1.0
## SRVP10 0.0 0.1 0.2 0.3 1.0
## PTMMP10 0.0 0.0 0.1 0.2 0.7
## MHHIN10 2499.0 31212.5 44492.5 62718.5 191250.0
## PCIN10 4016.0 14545.0 21372.0 33391.5 209753.0
## OWNP10 0.0 0.3 0.4 0.6 1.0
## MYRMOV10 1969.0 2001.0 2003.0 2004.0 2011.0
## MRENT10 195.0 805.0 935.0 1089.5 2001.0
## MVAL10 28400.0 165900.0 233800.0 321700.0 1000001.0
## MYRBLT10 1939.0 1939.0 1939.0 1954.0 2004.0
## AVGVEH1 0.0 0.8 1.2 1.5 3.0
## CRIMER 0.0 0.0 0.1 0.1 2.0
## ADT_mean 3149.2 14483.3 18949.3 24007.5 67048.4
## AvgWEEK_me 0.0 0.0 0.0 0.0 15667.6
## boardings_ 0.0 23.9 57.6 106.2 1343.3
## VACANTD 0.0 0.0 0.0 0.0 0.0
## RM_ZONEP 0.0 0.0 0.0 0.3 1.0
## INTDEN 0.0 0.0 0.0 0.0 0.0
##
## $Categorical
## data frame with 0 columns and 2151 rows
library(corrplot)
## corrplot 0.92 loaded
CTable <- cor(na.omit(Cor))
corrplot(CTable, method = "circle", type = "full", tl.cex = 0.7)
According to the correlation matrix, HUD10 has very high correlation with POPD10, so we could decide to remove HUD10 from the independent variable:
Cor2 <- subset(Cor, select = -HUD10)
Now we will visualize our data in the map, in regards to schooling:
tmap_mode("view")
qualityrate <- tm_shape(C) + tm_fill("INSTR_mean", style= "quantile", n = 7, palette = "-YlGnBu", title = "school quality mean", alpha=.5) +
tm_borders(alpha=.3, col="white", lwd = 1)
qualityrate
We will apply Principal Components Analysis
PCA: excluding Geometric Properties for Initial Analysis: this is because PCA is sensitive to the scale of variables, and including variables with vastly different scales might result in principal components being dominated by variables with larger scales.
The code below standardizes the data, applies PCA and displays the results in the table below:
# Imputing missing values for mean
Cor2[] <- lapply(Cor2, function(x) ifelse(is.na(x), mean(x, na.rm = TRUE), x))
# Standardize the data
Cor2_scaled <- scale(Cor2)
# Apply PCA
pca_result <- prcomp(Cor2_scaled, center = TRUE, scale. = TRUE)
# Summary of PCA results
summary(pca_result)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.549 1.9532 1.64806 1.21621 1.18698 1.0274 0.99644
## Proportion of Variance 0.224 0.1316 0.09366 0.05101 0.04858 0.0364 0.03424
## Cumulative Proportion 0.224 0.3556 0.44923 0.50024 0.54882 0.5852 0.61946
## PC8 PC9 PC10 PC11 PC12 PC13 PC14
## Standard deviation 0.99231 0.95629 0.93386 0.90452 0.86354 0.83448 0.82655
## Proportion of Variance 0.03395 0.03153 0.03007 0.02821 0.02571 0.02401 0.02356
## Cumulative Proportion 0.65341 0.68495 0.71502 0.74323 0.76894 0.79296 0.81651
## PC15 PC16 PC17 PC18 PC19 PC20 PC21
## Standard deviation 0.80933 0.76702 0.74399 0.71351 0.70092 0.66724 0.60225
## Proportion of Variance 0.02259 0.02029 0.01909 0.01755 0.01694 0.01535 0.01251
## Cumulative Proportion 0.83910 0.85939 0.87847 0.89603 0.91297 0.92832 0.94083
## PC22 PC23 PC24 PC25 PC26 PC27 PC28
## Standard deviation 0.58663 0.56295 0.54781 0.46880 0.43596 0.37495 0.35054
## Proportion of Variance 0.01187 0.01093 0.01035 0.00758 0.00655 0.00485 0.00424
## Cumulative Proportion 0.95270 0.96362 0.97397 0.98155 0.98810 0.99295 0.99719
## PC29
## Standard deviation 0.28551
## Proportion of Variance 0.00281
## Cumulative Proportion 1.00000
Based on the results above, we now plot a Scree plot using the ggplot2 package, which will allow us to visualize the components and their percentages of explained variances:
get_eigenvalue(pca_result)
## eigenvalue variance.percent cumulative.variance.percent
## Dim.1 6.49659382 22.4020476 22.40205
## Dim.2 3.81498134 13.1551081 35.55716
## Dim.3 2.71609781 9.3658545 44.92301
## Dim.4 1.47917098 5.1005896 50.02360
## Dim.5 1.40892790 4.8583721 54.88197
## Dim.6 1.05558503 3.6399484 58.52192
## Dim.7 0.99288628 3.4237458 61.94567
## Dim.8 0.98468018 3.3954489 65.34111
## Dim.9 0.91449363 3.1534263 68.49454
## Dim.10 0.87209933 3.0072391 71.50178
## Dim.11 0.81816382 2.8212545 74.32303
## Dim.12 0.74569327 2.5713561 76.89439
## Dim.13 0.69634899 2.4012034 79.29559
## Dim.14 0.68317735 2.3557840 81.65138
## Dim.15 0.65501876 2.2586854 83.91006
## Dim.16 0.58832443 2.0287049 85.93877
## Dim.17 0.55351667 1.9086782 87.84745
## Dim.18 0.50909276 1.7554923 89.60294
## Dim.19 0.49129121 1.6941076 91.29705
## Dim.20 0.44520298 1.5351827 92.83223
## Dim.21 0.36269961 1.2506883 94.08292
## Dim.22 0.34413203 1.1866622 95.26958
## Dim.23 0.31691721 1.0928180 96.36240
## Dim.24 0.30009074 1.0347957 97.39719
## Dim.25 0.21977224 0.7578353 98.15503
## Dim.26 0.19006169 0.6553851 98.81041
## Dim.27 0.14058871 0.4847887 99.29520
## Dim.28 0.12287658 0.4237123 99.71892
## Dim.29 0.08151465 0.2810850 100.00000
Based on the Scree plot we can determine that the principal components are the ones corresponding to the first 4 dimensions, which explain 51.6% of the variance of our data. For our purposes, we want to be able to explain a higher percentage of the variance, so we will choose 10 dimensions, which represent 71.5% of the variance of our data.
Now let’s take a look at the contribution of each variable for the 10 selected components. The dashed red line serves as the threshold for determining the level of variance to preserve in the data set. Variables exhibiting loadings surpassing this threshold are deemed significant contributors to explaining variance within the data. Consequently, such variables are recommended for inclusion in subsequent analyses and variables situated below the threshold may carry lesser significance and can be considered for potential exclusion from further analysis.
# Extract variable contributions
var_contrib <- get_pca_var(pca_result)
# Plot
contrib_plot <- fviz_contrib(pca_result, choice = "var", axes = 1:9, top = 30,
col.var = "blue", col.ind = "darkred",
title = "Variable Contribution Principal Components",
labels = list("Var", "Contrib", "Cos2", "Cos2*"),
repel = TRUE) +
labs(x = "Principal Components", y = "Contribution (%)")
contrib_plot
The first 11 variables are the most important, contributing the most to the explanation of the variance:
Spatial Clustering: The clustering of similar colors indicates that there are areas with similar school quality levels. Such clustering could be the result of various factors, including similar neighborhood characteristics, zoning policies, or even school districts that affect resource distribution.
Based on the previous PCA results, we need to merge them into our main spatial data C:
pca_data <- st_drop_geometry(C) # Drop geometry for combining with PCA result
# Exclude non-numeric columns, including GISJOIN
numeric_columns <- sapply(pca_data, is.numeric)
pca_data <- pca_data[, numeric_columns]
# Impute missing values with mean for each numeric column separately
for (col in colnames(pca_data)) {
pca_data[, col] <- ifelse(is.na(pca_data[, col]), mean(pca_data[, col], na.rm = TRUE), pca_data[, col])
}
# Combine with PCA result
pca_data <- cbind(pca_data, as.data.frame(pca_result$x))
Firstly, we use Elbow method and Silhoutte Method to decide the numbers of K
library(tidyverse)
library(cluster)
library(factoextra)
set.seed(123) # for reproducibility
fviz_nbclust(pca_data[, -c(1:3)], kmeans, method = "wss") +
geom_vline(xintercept = 4, linetype = 2) +
labs(subtitle = "Elbow Method")
according to the results of elbow method, we should choose 3 clusters.
Now let us use Silhouette Method evaluates how similar an object is to
its own cluster compared to other clusters.
fviz_nbclust(pca_data[, -c(1:3)], kmeans, method = "silhouette") +
labs(subtitle = "Silhouette Method")
according to the results of Silhoutte Method, we should choose 2 clusters.
We set 3 clusters and run k-means
k <- 3 # Specify the number of clusters
clusters <- kmeans(pca_data[, -c(1:3)], centers = k)
Before plotting the clusters, we should include them in the spatial data to later be able to plot them in the map:
C$cluster <- clusters$cluster
Now we are able to visualize the clusters in the map:
library(tmap)
tm_shape(C) +
tm_borders() +
tm_fill("cluster", palette = "Set3", title = "Cluster")
Cluster 1, marked with the lightest color, could signify regions with similar attributes that differentiate them from clusters 2 and 3. These might be areas with higher or lower average school quality ratings or other educational metrics.
C<- pca_data[, -c(1:3)]
C_CA <- as.data.frame(subset(C,(C$TREAT==1|C$CONTR_1==1))) #Neighbourhood-only control area
C_CHI <- as.data.frame(subset(C,(C$TREAT==1|C$CONTR_2==1))) #Entire city control area
Causal forest variables formulation - SAR
Y <- C_CA$INSTR_mean
X <- as.data.frame(cbind(C_CA$BACHP10, C_CA$INTDEN, C_CA$BLK10, C_CA$HSPP10, C_CA$PCIN10, C_CA$MHHIN10, C_CA$OWNP10, C_CA$MBSAP10, C_CA$AVGHHS10, C_CA$MEDAGE10, C_CA$FAMCP10, C_CA$AUTOP10) )
W <- C_CA$TREAT
These are estimates of m(X) = E[Y | X] with no test/train split
forest.Y <- regression_forest(X, Y)
Y.hat_train <- predict(forest.Y)$predictions
forest.W <- regression_forest(X, W)
W.hat_train <- predict(forest.W)$predictions
c.forest <- causal_forest(X, Y, W, Y.hat_train, W.hat_train)
tau.hat <- predict(c.forest, X)$predictions
average_treatment_effect(c.forest, target.sample = "all") #ATE for all observations (treated and untreated)
## estimate std.err
## -1.189603 2.513111
The negative estimate suggests that the treatment might have a negative effect on the outcome variable on average across all observations. However, the relatively large standard error indicates there is uncertainty around this estimate.
output <- C_CA %>%
select(BACHP10, INTDEN, HSPP10, PCIN10, MHHIN10, OWNP10, MBSAP10, AVGHHS10, MEDAGE10, FAMCP10, AUTOP10,TREAT, INSTR_mean) %>%
dplyr::mutate(CATE_CF = tau.hat)
imp <- c.forest %>%
variable_importance() %>%
as.data.frame() %>%
mutate(variable = colnames(c.forest$X.orig))
imp[order(imp$V1, decreasing = TRUE),]
## V1 variable
## 2 0.15788153 V2
## 11 0.11874564 V11
## 10 0.10168641 V10
## 6 0.08897561 V6
## 9 0.07960976 V9
## 7 0.07492683 V7
## 3 0.07191638 V3
## 1 0.07024390 V1
## 4 0.04682927 V4
## 8 0.04381882 V8
## 5 0.02341463 V5
according to the result, V2, V11 have the biggest effect on the result of treatment. They are INTDEN(Intersection density on the driving network), AUTOP10(% commuting by auto).
p1 <- ggplot(output, aes(x = INTDEN, y = tau.hat, color=as.factor(TREAT))) +
scale_color_brewer(palette="Paired") +
geom_point(alpha = 0.4 ) +
geom_smooth(method = "lm", fullrange=TRUE) +
ylab("Treatment effect") +
theme_light()
p2 <- ggplot(output, aes(x = AUTOP10, y = tau.hat, color=as.factor(TREAT))) +
scale_color_brewer(palette="Paired") +
geom_point(alpha = 0.4 ) +
geom_smooth(method = "lm", fullrange=TRUE) +
ylab("Treatment effect") +
theme_light()
cowplot::plot_grid(p1, p2,ncol=1)
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
INTDEN: There appears to be a positive trend, suggesting that as INTDEN increases, the treatment effect becomes less negative. AUTOP10: there’s a slight negative trend, suggesting the treatment effect moreless negative as AUTOP10 increases.
The shaded areas around the regression lines represent confidence intervals, indicating the precision of the estimated effect. The plots suggest that the treatment effect varies with both INTDEN and AUTOP10, and this variation appears to be consistent across both treated and untreated groups. ### Compute a prioritization rule based on estimated treatment effects
priority.cate <- predict(c.forest, X)$predictions
cf.eval <- causal_forest(X, Y, W, Y.hat_train, W.hat_train) #Priority rule = units with biggest positive treatment effects (CATE) are highest-priority
We evaluate prioritization rules via rank-weighted average treatment effects (RATEs), which capture the extent to which individuals who are highly ranked by the prioritization rule are more responsive to treatment than the average treatment effect
rate <- rank_average_treatment_effect(cf.eval, priority.cate, debiasing.weights = W.hat_train)
plot(rate)
From the plot, we can see that the treatment effect fluctuates as we
vary the proportion of the targeted population. The confidence intervals
suggest substantial uncertainty around these estimates, especially as
the proportion targeted approaches 1. This could indicate variability in
treatment effects across different segments of the population or could
reflect the model’s instability in certain areas of the data.
Kubara, M. Spatiotemporal localisation patterns of technological startups: the case for recurrent neural networks in predicting urban startup clusters. Ann Reg Sci (2023). https://doi.org/10.1007/s00168-023-01220-7US Census