Data Cleaning and Downscaling in R

Introduction: Singapore

View of the “super trees” in Singapore. Photo by Annie Spratt on Unsplash.

Here we are interested in the study of world cities’ diversity, and Singapore is an interesting place to study for a number of reasons. First, it is one of the few remaining city-states in the world, and it is far and away the most influential (unless you think of Vatican City’s influence through the Catholic Church, perhaps). Second, the city-state is defined both by multi-ethnicity and a very unique approach to public housing.

While the majority of the population is of Chinese descent, Singapore also is home to substantial Malay and Indian populations, as well as a smattering of other groups. In addition to officially recognized migrant residents, Singapore hosts a large number of migrant workers from places like India, Indonesia, and the Philippines, who accounted for almost 30% of the population in 2019.

Singapore also is fairly unique in the extent of its public housing, with 77% of its officially recognized residents residing in housing managed by the government’s Housing and Development Board (HDB) as of 2022 (see here). When we consider that non-residents are not eligible for HDB housing, this puts the rate at about 55% of the city-state’s population, similar to the rate to another city that is particularly well known for public housing: Vienna, Austria, where about 60% of the population live in public housing or cooperatives (see here).

Singapore’s public housing arrangement is surprising not only because the number is so high, but also because Singapore is known much more for having a (neo)liberal capitalist orientation than for being a Southeast Asian socialist enclave. Public housing in Singapore became an important part of government policy in part for security reasons. First, housing was seen as a way to move families into a more middle-class, and, it was hoped, more status-quo orientation, helping bolster Singapore against feared Communist revolt. Second, it was thought that public housing could be a way to integrate ethnic groups, helping prevent the formation of ethnic enclaves the government feared could foment social unrest. So, in stark contrast to Chicago, where both federal and municipal policy choices pushed toward segregation, Singapore’s government took the opposite approach, though, to be clear, not necessarily for anti-racist reasons.

These political motivations notwithstanding, Singapore is also well known for inequitable treatment of its large migrant worker population. Roughly 300,000 or so of these workers, it is estimated, have been feeding the country’s construction boom, part of an attempt to keep sky-high housing prices at bay. Many such workers are housed in crowded worker dormitories. In the initial phases of the COVID-19 pandemic, these dormatories accounted for approximately 90% of the COVID-19 cases in the city, helping to shine a particularly bright spotlight on migrant labor conditions.

We will be using our tools to consider some of these issues in this tutorial.

Objectives

  • Explore tools and strategies for cleaning data in R
  • Get further practice with geoprocessing
  • Get further practice accessing data from OpenStreetMap
  • Learn about the process of downscaling data
  • Learn how to generate population centroids
  • Learn how to use inverse distance weighting to estimate values at specific locations
  • Practice thinking through strategies to deal with real-world geospatial data challenges

The Data

Often, the data that are available to us were produced without our specific uses in mind. A whole lot of GIS data, for example, data come from government agencies that collect and use such data as part of their administrative work. That means that sometimes we need to use GIS tools to combine data from different sources to suit our purposes. Here, we’ll continue looking at ways to use spatial analysis to link data from different spatial datasets, but, this time, the data are going to be a bit messier - which makes them a bit more typical of the types of data we often have to work with in the real world.

This time, all our data come from Singapore’s own data portal, which is impressively extensive.

  • HDBTowns2018.csv contains population numbers for all the HDB housing areas as of 2018 (the most recent year I could find)
  • singapore_subzones.gpkg contains boundaries for the HDB housing areas
  • hdb_property_info.csv contains information on the construction date and number of units for all HDB buildings
  • hdb_existing_buildings.gpkg contains the footprints of all HDB buildings
  • SubzoneTot[Group].csv contain data on the population of residents of different ethnicities at the subzone level based on the 2015 Singaporean census
  • PZCodes.csv a crosswalk file that we’ll use to conduct a table join between two of our datasets

The data can be found in this zip file.

The Packages

Just our old friends, today, except for the centr package, which we will use to compute population-weighted centroids, and the phylin package, which we will use to compute inverse-distance-weighted interpolations:

require(tidyverse)
Loading required package: tidyverse
Warning: package 'ggplot2' was built under R version 4.3.3
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
require(sf)
Loading required package: sf
Linking to GEOS 3.11.2, GDAL 3.7.2, PROJ 9.3.0; sf_use_s2() is TRUE
require(maptiles)
Loading required package: maptiles
Warning: package 'maptiles' was built under R version 4.3.3
require(tidyterra)
Loading required package: tidyterra
Warning: package 'tidyterra' was built under R version 4.3.3

Attaching package: 'tidyterra'

The following object is masked from 'package:stats':

    filter
require(osmdata)
Loading required package: osmdata
Warning: package 'osmdata' was built under R version 4.3.3
Data (c) OpenStreetMap contributors, ODbL 1.0. https://www.openstreetmap.org/copyright
require(ggspatial)
Loading required package: ggspatial
require(centr)
Loading required package: centr
Warning: package 'centr' was built under R version 4.3.3
require(phylin)
Loading required package: phylin
Warning: package 'phylin' was built under R version 4.3.3

Wrangling government data

Now, it’s important to remember that data “in the wild” is rarely as clean and easy to use as what I’ve given you in the previous labs. I’ve been prepping and cleaning most of the data to make it a bit more accessible. In this lab, while I’m still doing a fair bit of prepping and cleaning for you, I want you to start getting a feel for some of the problems that you can run into so that you can start developing some skills in troubleshooting problems that you might encounter working with data in the real world. We’ll see a few examples of some common problems in this section.

As I mentioned above, one of the most unique features of Singapore as a city is the prominent role of the government-run housing estates called the new towns. In this section, we’re going to introduce a new tool that we’ll be using to map the population of new town residents across the city-state. In your lab folder, you can find a spreadsheet file (HDBTowns2018.csv) with the most recent (2018) estimates for the population living in HDB housing. Let’s read it in and have a look:

setwd("G:/My Drive/My Classes/World Cities/Labs/Data Cleaning")

hdb_pop <- read.csv("HDBTowns2018.csv", stringsAsFactors = FALSE)

head(hdb_pop)
  financial_year town_or_estate population
1           2018     Ang Mo Kio     141600
2           2018          Bedok     191300
3           2018         Bishan      62100
4           2018    Bukit Batok     115200
5           2018    Bukit Merah     144300
6           2018  Bukit Panjang     120100

Because we don’t have spatial locations for the towns and estates, we need to find a vector layer to which we could perform a table join.

In your lab folder, you will find a file called singapore_subzones, which I have downloaded from Singapore’s data portal. We’re going to use it to try to map the HDB housing populations. Let’s read it in:

subzones <- st_read("singapore_subzones.gpkg")
Reading layer `masterplan2019subzoneboundarynoseakml__ura_mp19_subzone_no_sea_pl' from data source `G:\My Drive\My Classes\World Cities\Labs\Data Cleaning\singapore_subzones.gpkg' 
  using driver `GPKG'
Simple feature collection with 332 features and 23 fields
Geometry type: MULTIPOLYGON
Dimension:     XY, XYZ
Bounding box:  xmin: 103.6057 ymin: 1.158699 xmax: 104.0885 ymax: 1.470775
z_range:       zmin: 0 zmax: 0
Geodetic CRS:  WGS 84
plot(st_geometry(subzones))

head(subzones)
Simple feature collection with 6 features and 23 fields
Geometry type: MULTIPOLYGON
Dimension:     XY, XYZ
Bounding box:  xmin: 103.6537 ymin: 1.216215 xmax: 103.8988 ymax: 1.297462
z_range:       zmin: 0 zmax: 0
Geodetic CRS:  WGS 84
   Name
1 kml_1
2 kml_2
3 kml_3
4 kml_4
5 kml_5
6 kml_6
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    description
1              <center><table><tr><th colspan='2' align='center'><em>Attributes</em></th></tr><tr bgcolor="#E3E3F3">\n<th>SUBZONE_NO</th>\n<td>1</td>\n</tr><tr bgcolor="">\n<th>SUBZONE_N</th>\n<td>MARINA EAST</td>\n</tr><tr bgcolor="#E3E3F3">\n<th>SUBZONE_C</th>\n<td>MESZ01</td>\n</tr><tr bgcolor="">\n<th>CA_IND</th>\n<td>Y</td>\n</tr><tr bgcolor="#E3E3F3">\n<th>PLN_AREA_N</th>\n<td>MARINA EAST</td>\n</tr><tr bgcolor="">\n<th>PLN_AREA_C</th>\n<td>ME</td>\n</tr><tr bgcolor="#E3E3F3">\n<th>REGION_N</th>\n<td>CENTRAL REGION</td>\n</tr><tr bgcolor="">\n<th>REGION_C</th>\n<td>CR</td>\n</tr><tr bgcolor="#E3E3F3">\n<th>INC_CRC</th>\n<td>4FB7E5B1B9455DE0</td>\n</tr><tr bgcolor="">\n<th>FMEL_UPD_D</th>\n<td>20191223152313</td>\n</tr></table></center>
2        <center><table><tr><th colspan='2' align='center'><em>Attributes</em></th></tr><tr bgcolor="#E3E3F3">\n<th>SUBZONE_NO</th>\n<td>5</td>\n</tr><tr bgcolor="">\n<th>SUBZONE_N</th>\n<td>INSTITUTION HILL</td>\n</tr><tr bgcolor="#E3E3F3">\n<th>SUBZONE_C</th>\n<td>RVSZ05</td>\n</tr><tr bgcolor="">\n<th>CA_IND</th>\n<td>Y</td>\n</tr><tr bgcolor="#E3E3F3">\n<th>PLN_AREA_N</th>\n<td>RIVER VALLEY</td>\n</tr><tr bgcolor="">\n<th>PLN_AREA_C</th>\n<td>RV</td>\n</tr><tr bgcolor="#E3E3F3">\n<th>REGION_N</th>\n<td>CENTRAL REGION</td>\n</tr><tr bgcolor="">\n<th>REGION_C</th>\n<td>CR</td>\n</tr><tr bgcolor="#E3E3F3">\n<th>INC_CRC</th>\n<td>C3C22D1EE31757BD</td>\n</tr><tr bgcolor="">\n<th>FMEL_UPD_D</th>\n<td>20191223152313</td>\n</tr></table></center>
3       <center><table><tr><th colspan='2' align='center'><em>Attributes</em></th></tr><tr bgcolor="#E3E3F3">\n<th>SUBZONE_NO</th>\n<td>1</td>\n</tr><tr bgcolor="">\n<th>SUBZONE_N</th>\n<td>ROBERTSON QUAY</td>\n</tr><tr bgcolor="#E3E3F3">\n<th>SUBZONE_C</th>\n<td>SRSZ01</td>\n</tr><tr bgcolor="">\n<th>CA_IND</th>\n<td>Y</td>\n</tr><tr bgcolor="#E3E3F3">\n<th>PLN_AREA_N</th>\n<td>SINGAPORE RIVER</td>\n</tr><tr bgcolor="">\n<th>PLN_AREA_C</th>\n<td>SR</td>\n</tr><tr bgcolor="#E3E3F3">\n<th>REGION_N</th>\n<td>CENTRAL REGION</td>\n</tr><tr bgcolor="">\n<th>REGION_C</th>\n<td>CR</td>\n</tr><tr bgcolor="#E3E3F3">\n<th>INC_CRC</th>\n<td>87306ABAF4B67E2E</td>\n</tr><tr bgcolor="">\n<th>FMEL_UPD_D</th>\n<td>20191223152313</td>\n</tr></table></center>
4 <center><table><tr><th colspan='2' align='center'><em>Attributes</em></th></tr><tr bgcolor="#E3E3F3">\n<th>SUBZONE_NO</th>\n<td>1</td>\n</tr><tr bgcolor="">\n<th>SUBZONE_N</th>\n<td>JURONG ISLAND AND BUKOM</td>\n</tr><tr bgcolor="#E3E3F3">\n<th>SUBZONE_C</th>\n<td>WISZ01</td>\n</tr><tr bgcolor="">\n<th>CA_IND</th>\n<td>N</td>\n</tr><tr bgcolor="#E3E3F3">\n<th>PLN_AREA_N</th>\n<td>WESTERN ISLANDS</td>\n</tr><tr bgcolor="">\n<th>PLN_AREA_C</th>\n<td>WI</td>\n</tr><tr bgcolor="#E3E3F3">\n<th>REGION_N</th>\n<td>WEST REGION</td>\n</tr><tr bgcolor="">\n<th>REGION_C</th>\n<td>WR</td>\n</tr><tr bgcolor="#E3E3F3">\n<th>INC_CRC</th>\n<td>C87E378D3456FC35</td>\n</tr><tr bgcolor="">\n<th>FMEL_UPD_D</th>\n<td>20191223152313</td>\n</tr></table></center>
5                  <center><table><tr><th colspan='2' align='center'><em>Attributes</em></th></tr><tr bgcolor="#E3E3F3">\n<th>SUBZONE_NO</th>\n<td>2</td>\n</tr><tr bgcolor="">\n<th>SUBZONE_N</th>\n<td>FORT CANNING</td>\n</tr><tr bgcolor="#E3E3F3">\n<th>SUBZONE_C</th>\n<td>MUSZ02</td>\n</tr><tr bgcolor="">\n<th>CA_IND</th>\n<td>Y</td>\n</tr><tr bgcolor="#E3E3F3">\n<th>PLN_AREA_N</th>\n<td>MUSEUM</td>\n</tr><tr bgcolor="">\n<th>PLN_AREA_C</th>\n<td>MU</td>\n</tr><tr bgcolor="#E3E3F3">\n<th>REGION_N</th>\n<td>CENTRAL REGION</td>\n</tr><tr bgcolor="">\n<th>REGION_C</th>\n<td>CR</td>\n</tr><tr bgcolor="#E3E3F3">\n<th>INC_CRC</th>\n<td>8E8F2616FFA9E019</td>\n</tr><tr bgcolor="">\n<th>FMEL_UPD_D</th>\n<td>20191223152313</td>\n</tr></table></center>
6       <center><table><tr><th colspan='2' align='center'><em>Attributes</em></th></tr><tr bgcolor="#E3E3F3">\n<th>SUBZONE_NO</th>\n<td>5</td>\n</tr><tr bgcolor="">\n<th>SUBZONE_N</th>\n<td>MARINA EAST (MP)</td>\n</tr><tr bgcolor="#E3E3F3">\n<th>SUBZONE_C</th>\n<td>MPSZ05</td>\n</tr><tr bgcolor="">\n<th>CA_IND</th>\n<td>N</td>\n</tr><tr bgcolor="#E3E3F3">\n<th>PLN_AREA_N</th>\n<td>MARINE PARADE</td>\n</tr><tr bgcolor="">\n<th>PLN_AREA_C</th>\n<td>MP</td>\n</tr><tr bgcolor="#E3E3F3">\n<th>REGION_N</th>\n<td>CENTRAL REGION</td>\n</tr><tr bgcolor="">\n<th>REGION_C</th>\n<td>CR</td>\n</tr><tr bgcolor="#E3E3F3">\n<th>INC_CRC</th>\n<td>B53EBCDBCF99A0B9</td>\n</tr><tr bgcolor="">\n<th>FMEL_UPD_D</th>\n<td>20191223152313</td>\n</tr></table></center>
  timestamp begin  end altitudeMode tessellate extrude visibility drawOrder
1      <NA>  <NA> <NA>         <NA>         -1       0         -1        NA
2      <NA>  <NA> <NA>         <NA>         -1       0         -1        NA
3      <NA>  <NA> <NA>         <NA>         -1       0         -1        NA
4      <NA>  <NA> <NA>         <NA>         -1       0         -1        NA
5      <NA>  <NA> <NA>         <NA>         -1       0         -1        NA
6      <NA>  <NA> <NA>         <NA>         -1       0         -1        NA
  icon SUBZONE_NO               SUBZONE_N SUBZONE_C CA_IND      PLN_AREA_N
1 <NA>          1             MARINA EAST    MESZ01      Y     MARINA EAST
2 <NA>          5        INSTITUTION HILL    RVSZ05      Y    RIVER VALLEY
3 <NA>          1          ROBERTSON QUAY    SRSZ01      Y SINGAPORE RIVER
4 <NA>          1 JURONG ISLAND AND BUKOM    WISZ01      N WESTERN ISLANDS
5 <NA>          2            FORT CANNING    MUSZ02      Y          MUSEUM
6 <NA>          5        MARINA EAST (MP)    MPSZ05      N   MARINE PARADE
  PLN_AREA_C       REGION_N REGION_C          INC_CRC     FMEL_UPD_D SubTitle
1         ME CENTRAL REGION       CR 4FB7E5B1B9455DE0 20191223152313     <NA>
2         RV CENTRAL REGION       CR C3C22D1EE31757BD 20191223152313     <NA>
3         SR CENTRAL REGION       CR 87306ABAF4B67E2E 20191223152313     <NA>
4         WI    WEST REGION       WR C87E378D3456FC35 20191223152313     <NA>
5         MU CENTRAL REGION       CR 8E8F2616FFA9E019 20191223152313     <NA>
6         MP CENTRAL REGION       CR B53EBCDBCF99A0B9 20191223152313     <NA>
  snippet                           geom
1         MULTIPOLYGON Z (((103.8802 ...
2         MULTIPOLYGON Z (((103.8376 ...
3         MULTIPOLYGON Z (((103.8341 ...
4         MULTIPOLYGON (((103.7125 1....
5         MULTIPOLYGON Z (((103.8472 ...
6         MULTIPOLYGON Z (((103.8987 ...

This might look overwhelming at first, but upon careful inspection, you can see that the town_or_estate variable in the hdb_pop object actually corresponds to the PLN_AREA_N column in subzones. That’s because the HDB-run estates and towns correspond to one of the spatial divisions, Planning Areas, that Singapore uses for its census.

The problem we have here is that our vector layer shows boundaries of census subzones, which are the smallest spatial units Singapore uses for its censuses. They are parts of Planning Areas, but if we were to use a table join to add data from the HDBTowns2018 table to this one, we would have lots of polygons for each town.

Fortunately, we can dissolving to address this problem. In this case, for example, we can group together all the subzones in a single planning area by using the PLN_AREA_N field and then use summarise() to create a new object with the planning area boundaries. Then we can then table join data on the population of HDB residents.

Task 1: Dissolve the subzones using the PLN_AREA_N variable and assign the results to an object called planning_zones. Then map the results, which should look something like this (helpful hint: if you get an error that mentions that a loop is not valid, try running sf_use_s2(FALSE) before the rest of your code - this turns off the spherical geometry tools, which can sometimes cause issues at small scales like these):

Spherical geometry (s2) switched off
although coordinates are longitude/latitude, st_union assumes that they are
planar

Now we just have one more problem to tackle. We want to table join the HDB2018 table to the Planning Areas layer we just created, but in order to do a table join, the entries in our join fields need to match exactly. Have a look at the entries in the fields we need to match and see if you see the problem:

hdb_pop$town_or_estate
 [1] "Ang Mo Kio"      "Bedok"           "Bishan"          "Bukit Batok"    
 [5] "Bukit Merah"     "Bukit Panjang"   "Bukit Timah"     "Central Area"   
 [9] "Choa Chu Kang"   "Clementi"        "Geylang"         "Hougang"        
[13] "Jurong East"     "Jurong West"     "Kallang/Whampoa" "Marine Parade"  
[17] "Pasir Ris"       "Punggol"         "Queenstown"      "Sembawang"      
[21] "Sengkang"        "Serangoon"       "Tampines"        "Toa Payoh"      
[25] "Woodlands"       "Yishun"         
planning_zones$PLN_AREA_N
 [1] "ANG MO KIO"              "BEDOK"                  
 [3] "BISHAN"                  "BOON LAY"               
 [5] "BUKIT BATOK"             "BUKIT MERAH"            
 [7] "BUKIT PANJANG"           "BUKIT TIMAH"            
 [9] "CENTRAL WATER CATCHMENT" "CHANGI"                 
[11] "CHANGI BAY"              "CHOA CHU KANG"          
[13] "CLEMENTI"                "DOWNTOWN CORE"          
[15] "GEYLANG"                 "HOUGANG"                
[17] "JURONG EAST"             "JURONG WEST"            
[19] "KALLANG"                 "LIM CHU KANG"           
[21] "MANDAI"                  "MARINA EAST"            
[23] "MARINA SOUTH"            "MARINE PARADE"          
[25] "MUSEUM"                  "NEWTON"                 
[27] "NORTH-EASTERN ISLANDS"   "NOVENA"                 
[29] "ORCHARD"                 "OUTRAM"                 
[31] "PASIR RIS"               "PAYA LEBAR"             
[33] "PIONEER"                 "PUNGGOL"                
[35] "QUEENSTOWN"              "RIVER VALLEY"           
[37] "ROCHOR"                  "SELETAR"                
[39] "SEMBAWANG"               "SENGKANG"               
[41] "SERANGOON"               "SIMPANG"                
[43] "SINGAPORE RIVER"         "SOUTHERN ISLANDS"       
[45] "STRAITS VIEW"            "SUNGEI KADUT"           
[47] "TAMPINES"                "TANGLIN"                
[49] "TENGAH"                  "TOA PAYOH"              
[51] "TUAS"                    "WESTERN ISLANDS"        
[53] "WESTERN WATER CATCHMENT" "WOODLANDS"              
[55] "YISHUN"                 

Even though the letters in the two names are the same, one is in Title Case, while the other is in ALL CAPS, so we’ve got a similar problem to what we encountered with the community areas in Chicago.

In addition, one entry in HDB, Kallang/Whampoa is just called KALLANG in the planning zones object. Since you already know how to fix the text case problem because of our previous tutorial, I’m just going to show you how to fix specific problem entries in a column:

hdb_pop$town_or_estate[hdb_pop$town_or_estate=="Kallang/Whampoa"] <- "Kallang"

So let’s unpack what we’re doing here: the conditional statement in the square brackets is being used to identify the places in the town_or_estate column where the value is exactly Kallang/Whampoa. Then all we’re doing is using the <- operator to assign it a new value, "Kallang".

Task 2: Use the toupper() function to create a new column in hdb_pop that you can use to join the HDB population data to the planning_zones object using a left_join(). Call the resulting object planning_zones_hdb (hint: be sure to join hdb_pop to planning_zones and not the other way around, as doing it the other way will cause you problems later on).

Joining with `by = join_by(PLN_AREA_N)`

Okay, let’s have a look:

plot(planning_zones_hdb[,c("geom", "population")])

See all those blank areas? Those are planning areas where there are no HDB residences. Because hdp_pop only has data on places where HDB properties are found, the other areas are showing up as NA - R’s designation for missing data.

Because we actually know that all those NA values are actually zero, we can use a trick similar to the one we used to correct Kallang above:

planning_zones_hdb$population[is.na(planning_zones_hdb$population)] <- 0

Task 3: Explain what’s going on in the above code block. What is each part doing? What does the code actually accomplish?

Downscaling vector data

So these areas are pretty aggregated - what if we wanted to have more detailled estimates? This is where we get into techniques called downscaling, which are a bunch of different strategies for taking information available at an aggregate level and then using additional, related, information available at a finer scale to estimate finer-grained variations.

We actually have a great tool available for this from Singapore’s data portal, where they have a KML file with the outline of every HDB building. Let’s read that one in:

hdb_buildings <- st_read("hdb_existing_buildings.gpkg")
Reading layer `hdbexistingbuilding__hdb_active_blk_p' from data source 
  `G:\My Drive\My Classes\World Cities\Labs\Data Cleaning\hdb_existing_buildings.gpkg' 
  using driver `GPKG'
Simple feature collection with 12847 features and 18 fields
Geometry type: MULTIPOLYGON
Dimension:     XY, XYZ
Bounding box:  xmin: 103.6848 ymin: 1.270005 xmax: 103.989 ymax: 1.457245
z_range:       zmin: 0 zmax: 0
Geodetic CRS:  WGS 84
plot(st_geometry(hdb_buildings))

head(hdb_buildings)
Simple feature collection with 6 features and 18 fields
Geometry type: MULTIPOLYGON
Dimension:     XYZ
Bounding box:  xmin: 103.7497 ymin: 1.278795 xmax: 103.9585 ymax: 1.449331
z_range:       zmin: 0 zmax: 0
Geodetic CRS:  WGS 84
   Name
1 kml_1
2 kml_2
3 kml_3
4 kml_4
5 kml_5
6 kml_6
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      description
1  <center><table><tr><th colspan='2' align='center'><em>Attributes</em></th></tr><tr bgcolor="#E3E3F3">\n<th>BLK_NO</th>\n<td>780C</td>\n</tr><tr bgcolor="">\n<th>ST_COD</th>\n<td>WOC05L</td>\n</tr><tr bgcolor="#E3E3F3">\n<th>ENTITYID</th>\n<td>1991</td>\n</tr><tr bgcolor="">\n<th>POSTAL_COD</th>\n<td>733780</td>\n</tr><tr bgcolor="#E3E3F3">\n<th>INC_CRC</th>\n<td>B93627FDFF3D6313</td>\n</tr><tr bgcolor="">\n<th>FMEL_UPD_D</th>\n<td>20230328181025</td>\n</tr></table></center>
2   <center><table><tr><th colspan='2' align='center'><em>Attributes</em></th></tr><tr bgcolor="#E3E3F3">\n<th>BLK_NO</th>\n<td>373</td>\n</tr><tr bgcolor="">\n<th>ST_COD</th>\n<td>BUS09S</td>\n</tr><tr bgcolor="#E3E3F3">\n<th>ENTITYID</th>\n<td>6782</td>\n</tr><tr bgcolor="">\n<th>POSTAL_COD</th>\n<td>650373</td>\n</tr><tr bgcolor="#E3E3F3">\n<th>INC_CRC</th>\n<td>49487496978A23EF</td>\n</tr><tr bgcolor="">\n<th>FMEL_UPD_D</th>\n<td>20130426120311</td>\n</tr></table></center>
3   <center><table><tr><th colspan='2' align='center'><em>Attributes</em></th></tr><tr bgcolor="#E3E3F3">\n<th>BLK_NO</th>\n<td>328</td>\n</tr><tr bgcolor="">\n<th>ST_COD</th>\n<td>TAS39U</td>\n</tr><tr bgcolor="#E3E3F3">\n<th>ENTITYID</th>\n<td>7578</td>\n</tr><tr bgcolor="">\n<th>POSTAL_COD</th>\n<td>520328</td>\n</tr><tr bgcolor="#E3E3F3">\n<th>INC_CRC</th>\n<td>67E2089D8D4D20D0</td>\n</tr><tr bgcolor="">\n<th>FMEL_UPD_D</th>\n<td>20130426120323</td>\n</tr></table></center>
4   <center><table><tr><th colspan='2' align='center'><em>Attributes</em></th></tr><tr bgcolor="#E3E3F3">\n<th>BLK_NO</th>\n<td>771A</td>\n</tr><tr bgcolor="">\n<th>ST_COD</th>\n<td>CHS27B</td>\n</tr><tr bgcolor="#E3E3F3">\n<th>ENTITYID</th>\n<td>307</td>\n</tr><tr bgcolor="">\n<th>POSTAL_COD</th>\n<td>681771</td>\n</tr><tr bgcolor="#E3E3F3">\n<th>INC_CRC</th>\n<td>577EE045DEA00FEE</td>\n</tr><tr bgcolor="">\n<th>FMEL_UPD_D</th>\n<td>20130426120328</td>\n</tr></table></center>
5    <center><table><tr><th colspan='2' align='center'><em>Attributes</em></th></tr><tr bgcolor="#E3E3F3">\n<th>BLK_NO</th>\n<td>3A</td>\n</tr><tr bgcolor="">\n<th>ST_COD</th>\n<td>TEC01D</td>\n</tr><tr bgcolor="#E3E3F3">\n<th>ENTITYID</th>\n<td>7332</td>\n</tr><tr bgcolor="">\n<th>POSTAL_COD</th>\n<td>091003</td>\n</tr><tr bgcolor="#E3E3F3">\n<th>INC_CRC</th>\n<td>AF8FF795FF2C87EA</td>\n</tr><tr bgcolor="">\n<th>FMEL_UPD_D</th>\n<td>20130426120321</td>\n</tr></table></center>
6 <center><table><tr><th colspan='2' align='center'><em>Attributes</em></th></tr><tr bgcolor="#E3E3F3">\n<th>BLK_NO</th>\n<td>781C</td>\n</tr><tr bgcolor="">\n<th>ST_COD</th>\n<td>WOA05F</td>\n</tr><tr bgcolor="#E3E3F3">\n<th>ENTITYID</th>\n<td>12799</td>\n</tr><tr bgcolor="">\n<th>POSTAL_COD</th>\n<td>733781</td>\n</tr><tr bgcolor="#E3E3F3">\n<th>INC_CRC</th>\n<td>9257EC517D945A34</td>\n</tr><tr bgcolor="">\n<th>FMEL_UPD_D</th>\n<td>20230328181025</td>\n</tr></table></center>
  timestamp begin  end altitudeMode tessellate extrude visibility drawOrder
1      <NA>  <NA> <NA>         <NA>         -1       0         -1        NA
2      <NA>  <NA> <NA>         <NA>         -1       0         -1        NA
3      <NA>  <NA> <NA>         <NA>         -1       0         -1        NA
4      <NA>  <NA> <NA>         <NA>         -1       0         -1        NA
5      <NA>  <NA> <NA>         <NA>         -1       0         -1        NA
6      <NA>  <NA> <NA>         <NA>         -1       0         -1        NA
  icon BLK_NO ST_COD ENTITYID POSTAL_COD          INC_CRC     FMEL_UPD_D
1 <NA>   780C WOC05L     1991     733780 B93627FDFF3D6313 20230328181025
2 <NA>    373 BUS09S     6782     650373 49487496978A23EF 20130426120311
3 <NA>    328 TAS39U     7578     520328 67E2089D8D4D20D0 20130426120323
4 <NA>   771A CHS27B      307     681771 577EE045DEA00FEE 20130426120328
5 <NA>     3A TEC01D     7332     091003 AF8FF795FF2C87EA 20130426120321
6 <NA>   781C WOA05F    12799     733781 9257EC517D945A34 20230328181025
  snippet                           geom
1         MULTIPOLYGON Z (((103.8017 ...
2         MULTIPOLYGON Z (((103.7502 ...
3         MULTIPOLYGON Z (((103.9581 ...
4         MULTIPOLYGON Z (((103.7497 ...
5         MULTIPOLYGON Z (((103.8173 ...
6         MULTIPOLYGON Z (((103.8025 ...

We also have a file that contains information on building construction dates, as well as the number of units of different sizes in each building in a separate file from the Singapore Data Portal. In principle, we could use the number of units in the building as a proxy to downscale the population data. Let’s read that table in, as well:

hdb_building_info <- read.csv("hdb_property_info.csv",
                              stringsAsFactors = FALSE)

head(hdb_building_info)
  blk_no            street max_floor_lvl year_completed residential commercial
1      1          BEACH RD            16           1970           Y          Y
2      1   BEDOK STH AVE 1            14           1975           Y          N
3      1     CANTONMENT RD             2           2010           N          Y
4      1      CHAI CHEE RD            15           1982           Y          N
5      1 CHANGI VILLAGE RD             4           1975           Y          Y
6      1         DELTA AVE            25           1982           Y          N
  market_hawker miscellaneous multistorey_carpark precinct_pavilion
1             N             N                   N                 N
2             N             Y                   N                 N
3             N             N                   N                 N
4             N             N                   N                 N
5             N             N                   N                 N
6             N             N                   N                 N
  bldg_contract_town total_dwelling_units X1room_sold X2room_sold X3room_sold
1                KWN                  142           0           1         138
2                 BD                  206           0           0         204
3                 CT                    0           0           0           0
4                 BD                  102           0           0           0
5                PRC                   55           0           0          54
6                 BM                   96           0           0           0
  X4room_sold X5room_sold exec_sold multigen_sold studio_apartment_sold
1           1           2         0             0                     0
2           0           2         0             0                     0
3           0           0         0             0                     0
4          10          92         0             0                     0
5           0           1         0             0                     0
6           0          96         0             0                     0
  X1room_rental X2room_rental X3room_rental other_room_rental
1             0             0             0                 0
2             0             0             0                 0
3             0             0             0                 0
4             0             0             0                 0
5             0             0             0                 0
6             0             0             0                 0

Now, in practice, downscaling is much more of an art than a science, becuase you really just have to make do with whatever data you can get your hands on. That means you sometimes just have to be creative with how you select and combine datasets.

Let’s look at the room sizes covered in the hdb_building_info object. We can see the types of flats/apartments available on this HDB page The _sold and _rental columns have the different types:

names(hdb_building_info)
 [1] "blk_no"                "street"                "max_floor_lvl"        
 [4] "year_completed"        "residential"           "commercial"           
 [7] "market_hawker"         "miscellaneous"         "multistorey_carpark"  
[10] "precinct_pavilion"     "bldg_contract_town"    "total_dwelling_units" 
[13] "X1room_sold"           "X2room_sold"           "X3room_sold"          
[16] "X4room_sold"           "X5room_sold"           "exec_sold"            
[19] "multigen_sold"         "studio_apartment_sold" "X1room_rental"        
[22] "X2room_rental"         "X3room_rental"         "other_room_rental"    

So probably our best tool for downscaling would be the housing capacity in each building, and our best proxy for capacity would be the number of bedrooms. Based on the website above, multigenerational flats have 4 bedrooms; 5-room, 4-room, and executive flats have 3, and 2-rooms have 1. We could say studios are 0.5, and it looks like the “other” and 1-room categories probably cover senior housing units, all of which have 1 bedroom.

So we can use that information to construct a new variable estimating the total bedrooms in a building and then aggregate that measure up to the block. Because we have population numbers for 2018, we should also filter out any properties constructed after that:

hdb_building_info <- hdb_building_info %>%
  filter(year_completed <= 2018) %>%
  mutate(bedrooms = X1room_sold +
           X2room_sold +
           (X3room_sold * 2) +
           (X4room_sold * 3) +
           (X5room_sold * 3) +
           (exec_sold * 3) +
           (multigen_sold * 3) +
           (studio_apartment_sold *3) +
           X1room_rental +
           X2room_rental +
           (X3room_rental * 2) +
           other_room_rental)

Now we need to join those data to the building footprints. Unfortunately, we have a pretty big challenge here. On the one hand, the hdb_building_info table has some great information about the HDB properties. Unfortunately, the only common column between it and the building footprints object is the block number, is really just the first part of the street address. The building footprints have an ST_COD column, which presumably designates streets, but it’s in a code, rather than the street name, and I have not been able to find a table that would allow me to match it to actual street names, which is how the information is presented in the building characteristics table.

So after several hours of failed attempts, I think I’ve come up with an okay (certainly not perfect) strategy. While we have different formats for the street name in the two objects, it’s probably the case that most of the time at least the first letter of the street code and the corresponding street name are the same, so we can use that to start building a common key across the two datasets. In the building information table, we also have a column called bldg_contract_town that appears to designate the planning area in which a building is located. Of course, just to make things fun, this is also in a code form, but we can probably figure out which code is which and then create a new variable. Then we could use a spatial join on the buildings to add in the planning area names to that object.

Okay, that was complicated, so here’s a run-down of the steps I’ve come up with: * Recode the bldg_contract_town codes into planning area names (because I’m nice, I did that for you; it’s in the PZCodes.csv file) * Spatially join the planning area data to the building footprints * Construct common codes for table joining using the block number, the first letter of the street name, and the planning area name

So let’s look at our recoding sheet:

pz_recode <- read.csv("PZCodes.csv", stringsAsFactors = FALSE)

pz_recode
   bldg_contract_town    PLN_AREA_N
1                 AMK    ANG MO KIO
2                  BB   BUKIT BATOK
3                  BD         BEDOK
4                  BH        BISHAN
5                  BM   BUKIT MERAH
6                  BP BUKIT PANJANG
7                  BT   BUKIT TIMAH
8                 CCK CHOA CHU KANG
9                  CL      CLEMENTI
10                 CT      CLEMENTI
11                 GL       GEYLANG
12                 HG       HOUGANG
13                 JE   JURONG EAST
14                 JW   JURONG WEST
15                KWN       KALLANG
16                 MP MARINE PARADE
17                 PG       PUNGGOL
18                PRC     PASIR RIS
19                 QT    QUEENSTOWN
20                 SB     SEMBAWANG
21                SGN     SERANGOON
22                 SK      SENGKANG
23                TAP      TAMPINES
24                 TG        TENGAH
25                 TP     TOA PAYOH
26                 WL     WOODLANDS
27                 YS        YISHUN

I’ve set it up so that we can just use left_join() to add the recoded column to our hdb_building_info table. Then we can create our new key column:

hdb_building_info <- hdb_building_info %>%
  #Get only the residential buildings
  filter(residential == "Y") %>%
  left_join(pz_recode) %>%
  #Using paste0() to make a string that combines everything we're using to
  #make our key for the table join:
  mutate(key = paste0(blk_no,
                      str_sub(street, 1, 1),
                      PLN_AREA_N))
Joining with `by = join_by(bldg_contract_town)`
head(hdb_building_info$key)
[1] "1BKALLANG"     "1BBEDOK"       "1CBEDOK"       "1CPASIR RIS"  
[5] "1DBUKIT MERAH" "1DQUEENSTOWN" 

Now we’ll prep the key in the building footprint object:

hdb_buildings_pop <- hdb_buildings %>%
  st_join(planning_zones_hdb) %>%
  mutate(key = paste0(BLK_NO,
                      str_sub(ST_COD, 1, 1),
                      PLN_AREA_N))
although coordinates are longitude/latitude, st_intersects assumes that they
are planar
head(hdb_buildings_pop$key)
[1] "780CWWOODLANDS"     "373BBUKIT BATOK"    "328TTAMPINES"      
[4] "771ACCHOA CHU KANG" "3ATBUKIT MERAH"     "781CWWOODLANDS"    

Of course, this is still going to be far from perfect, because even combining all these keys, we will have some entries with duplicate keys:

sum(duplicated(hdb_building_info$key))
[1] 88
sum(duplicated(hdb_buildings_pop$key))
[1] 160

This is a problem because left_join() will match all combinations of rows whose keys are duplicated in the two tables - that will result in a bunch of extra entries in the combined dataset, which is a problem.

In addition, there are still some buildings about which we have info that we cannot match to any building, and vice-versa:

sum(!hdb_building_info$key %in% hdb_buildings_pop$key)
[1] 531

That means we still have two things to do: 1) take care of the duplicates, and 2) try to have a principled way to match the buildings for which we have information to something.

The duplicates are pretty easily dealt with. Because they’re at least in the same planning area, they’re relatively close to one another, so we can just aggregate them by the key variable and take the mean number of bedrooms. That will distribute the bedrooms evenly across all the buildings we can’t specifically identify. Not perfect, but a reasonable enough guess.

Then we need to deal with the buildings we can’t match. While not ideal, one conservative approach would be to say that at least we know the planning area for the unmatched buildings, so we could just replace the key with the planning area and aggregate across planning areas like we’re going to do for duplicated keys:

hdb_building_info$key[!hdb_building_info$key %in%
                            hdb_buildings_pop$key] <- hdb_building_info$PLN_AREA_N[!hdb_building_info$key %in%
                            hdb_buildings_pop$key]

hdb_building_info_agg <- hdb_building_info %>%
  group_by(key) %>%
  summarise(bedrooms = mean(bedrooms))

#Notice that for all the non-duplicated keys, the mean of bedrooms will just
#be the original value

Task 4: That’s a complicated code block above! Unpack what’s going on and explain what each bit of that line is doing.

And then we’ll need to replace the unmatched building footprint keys with their respective planning zones, as well:

hdb_buildings_pop$key[!hdb_buildings_pop$key%in%hdb_building_info_agg$key] <- 
  hdb_buildings_pop$PLN_AREA_N[!hdb_buildings_pop$key%in%hdb_building_info_agg$key]

Now, that should have done the trick. Again, not perfect, and there are likely better methods, but this gives you an idea:

sum(!hdb_building_info_agg$key%in%hdb_buildings_pop$key)
[1] 0

Hooray! Now we can finally do that table join:

hdb_buildings_pop_info <- hdb_buildings_pop %>%
  left_join(hdb_building_info_agg) %>%
  filter(!is.na(bedrooms))
Joining with `by = join_by(key)`

Before we plot this, it looks like the hdb_blocks layer has some invalid polygons, and we have a bit of a problem because the layer also seems to have a height value, in addition to x and y coordinates. We’ll try our trick of using a 0-width buffer to see if we can fix that issue, and then cast the object as a multipolygon to see if we can deal with the height problem:

hdb_buildings_pop_info <- hdb_buildings_pop_info %>%
  #Remove the 3D information:
  st_zm(drop=TRUE) %>%
  #Draw 0-width buffer to deal with loops:
  st_buffer(dist = 0)
Warning in st_buffer.sfc(st_geometry(x), dist, nQuadSegs, endCapStyle =
endCapStyle, : st_buffer does not correctly buffer longitude/latitude data
dist is assumed to be in decimal degrees (arc_degrees).

So here’s what this looks like:

singapore <- planning_zones %>%
  summarise()
although coordinates are longitude/latitude, st_union assumes that they are
planar
ggplot(data = singapore) +
  geom_sf(data=singapore, fill="white") +
  geom_sf(data=hdb_buildings_pop_info, aes(fill=bedrooms, color=bedrooms)) +
  scale_color_viridis_c("HDB Bedrooms") +
  scale_fill_viridis_c("HDB Bedrooms") +
  coord_sf(crs = st_crs(3414)) +
  annotation_scale(location = "bl") +
  theme_minimal()

Now we’re ready to turn these numbers into estimated residents in each building. We’re basically going to be using a modified version of proportional assignment, where instead of estimating values based on area’s share of the total area, we’ll be estimating them based on buildings’ share of the total bedroom stock in a planning area.

So for this to work, we’ll need to compute the total bedrooms within each planning zone and then calculate each building’s share. We can actually use group_by() to accomplish this. If we follow group_by() with a mutate() function that includes any function that computes summary data, R will compute that summary data according to the the groups. That means that we can just group_by() planning zone and add up the bedrooms in each zone without having to use summarise(). Here’s how:

hdb_buildings_pop_info <- hdb_buildings_pop_info %>%
  group_by(PLN_AREA_N) %>%
  mutate(bedrooms_pl_area = sum(bedrooms)) %>%
  #Then we get rid of the groups for the next calculation
  ungroup() %>%
  mutate(bedroom_prop = bedrooms/bedrooms_pl_area,
          residents = bedroom_prop * population)

hist(hdb_buildings_pop_info$residents)

Task 5: Wow! That was a bit of a wild ride. Now, what we just did was a sort of good-enough strategy to get an estimate of something out of a messy dataset, but it’s clearly not ideal. Really, I kinda’ just went through this to give you a sense of what working through these kinds of messy data actually look like. I’d like you to take a moment to reflect on what we just did. Why did I make the choices I did? Can you think of some ways that we might improve on our estimates?

Task 6: Now that we have these micro-scale estimates of where about 80% of Singapore’s residents live, let’s try assessing their accessibility to something of interest. Using the osmdata package, download an urban amenity of interest and estimate the number of HDB residents that live in a building within half a kilometer of at least one of those amenities.

Spatial interpolation using inverse-distance weighting

Now that we have these micro-level population estimates, we can use them to create a very high-resolution picture of residential geography by ethnicity across the city-state. The Singapore Data Portal has data from it’s 2015 census on the ethnic composition of Singaporean residents at the subzone level, but, because we have estimates of residents at the building level, we could use them to make our estimated ethnic geography of Singapore more precise. To do this, we will use one simple example of spatial interpolation. Spatial interpolation methods use the assumption, quoting Waldo Tobler’s first law of geography, that “everything is related to everything else, but near things are more related than distant things” as a guide for making guesses about the values of variables in locations that we didn’t measure based on their values in locations that we did. In other words, spatial interpolation methods assume that the value of a variable in one location will be more like the value in nearby locations than in more distant ones.

There are dozens of different spatial interpolation techniques, which themselves are just a subset of the broader field of geostatistics, but we’ll stick with a very simple example: inverse distance weighting. Inverse distance weighting is basically just another version of the kinds of weighted means that we’ve been using for proportional assignment. The difference here is that instead of being weighted by the share of a variable, in inverse distance weighting we weight observations based on how far away they are from the location whose value we are trying to estimate.

So to use inverse distance weighting to downscale ethnic composition estimates from subzones to buildings, we’ll need to do a few things: * First, get ethnic composition data at the subzone level * Second, because inverse distance weighting assumes we’re using point data, we need a principled way to assign the ethnic composition measures to a single point within the subzone * Third, again because we need to work with points data, we need to construct points identifying the center of each building * Fourth, we need to compute the distances between each building and all the subzone centroids * Finally, we need to apply inverse distance weighting to estimate the ethnic composition of each building

Now we’ll go through this process step by step:

Preparing subzone-level ethnic composition data

First, we’re going to read in some data that I’ve extracted from Singapore’s 2015 Census of Population and General Household Survey, which contains data on Singapore’s resident population by ethnic group (note that this unfortunately excludes migrant workers, who make up about 30% of the island’s population). Since almost 80% of the resident population lives in HDB flats, the census data should give us a good sense of these housing units’ ethnic composition.

Let’s read in the data and have a look:

subzone_pop <- read.csv("SubzoneTotPop.csv", stringsAsFactors = FALSE)
subzone_chin <- read.csv("SubzoneTotChin.csv", stringsAsFactors = FALSE)
subzone_ind <- read.csv("SubzoneTotInd.csv", stringsAsFactors = FALSE)
subzone_malay <- read.csv("SubzoneTotMalay.csv", stringsAsFactors = FALSE)
subzone_other <- read.csv("SubzoneTotOther.csv", stringsAsFactors = FALSE)

head(subzone_pop)
          Subzone TotPop
1       Admiralty  14410
2  Alexandra Hill  15650
3 Alexandra North   1000
4        Aljunied  41710
5      Anak Bukit  21990
6      Anchorvale  38040

All of these objects are structured the same way, with the name of the subzone and the population for the ethnic group in question (Total, Chinese, India, Malay, and Other). That means we can just use left_join() to combine them into a single object:

subzone_census <- subzone_pop %>%
  left_join(subzone_chin) %>%
  left_join(subzone_ind) %>%
  left_join(subzone_malay) %>%
  left_join(subzone_other)
Joining with `by = join_by(Subzone)`
Joining with `by = join_by(Subzone)`
Joining with `by = join_by(Subzone)`
Joining with `by = join_by(Subzone)`
head(subzone_census)
          Subzone TotPop TotChinese TotInd TotMalay TotOther
1       Admiralty  14410      10240   1330     2290      540
2  Alexandra Hill  15650      11330   1700     2300      320
3 Alexandra North   1000        750    140       10      110
4        Aljunied  41710      34510   2680     3030     1490
5      Anak Bukit  21990      20090    840      220      830
6      Anchorvale  38040      29590   3440     3960     1050

Task 7: Use the mutate() function to add columns with the proportion of each ethnic group in each subzone to the subzone_census object. Name these new variables prop_chin, prop_ind, prop_malay, and prop_other. Your result should look like this:

          Subzone TotPop TotChinese TotInd TotMalay TotOther prop_chin
1       Admiralty  14410      10240   1330     2290      540 0.7106176
2  Alexandra Hill  15650      11330   1700     2300      320 0.7239617
3 Alexandra North   1000        750    140       10      110 0.7500000
4        Aljunied  41710      34510   2680     3030     1490 0.8273795
5      Anak Bukit  21990      20090    840      220      830 0.9135971
6      Anchorvale  38040      29590   3440     3960     1050 0.7778654
    prop_ind prop_malay prop_other
1 0.09229702 0.15891742 0.03747398
2 0.10862620 0.14696486 0.02044728
3 0.14000000 0.01000000 0.11000000
4 0.06425318 0.07264445 0.03572285
5 0.03819918 0.01000455 0.03774443
6 0.09043113 0.10410095 0.02760252

Even at the subzone level, we already have some evidence that the population distribution isn’t perfectly even. If it were, we would expect the proportion of each ethnic group to be distributed in a bell curve, but that’s not what we see for all the groups:

subzone_census %>%
  pivot_longer(cols = c("prop_chin", "prop_ind", "prop_malay", "prop_other")) %>%
  ggplot(aes(x=value)) +
  geom_histogram() +
  scale_x_continuous("Proportion of Subzone Population") +
  scale_y_continuous("Number of Subzones") +
  facet_wrap(~name)
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning: Removed 221 rows containing non-finite outside the scale range
(`stat_bin()`).

The Chinese and Indian populations look about like what we would expect, but not the Malay and Other populations. But let’s keep working on using inverse distance weighting to get better estimates.

Our next step will then be to do a left_join() to add the subzone geometries to these data. As is often the case, though, we’ll need to do a bit of text cleaning before we can successfully perform the join:

head(subzones$SUBZONE_N)
[1] "MARINA EAST"             "INSTITUTION HILL"       
[3] "ROBERTSON QUAY"          "JURONG ISLAND AND BUKOM"
[5] "FORT CANNING"            "MARINA EAST (MP)"       
head(subzone_census$Subzone)
[1] "Admiralty"       "Alexandra Hill"  "Alexandra North" "Aljunied"       
[5] "Anak Bukit"      "Anchorvale"     

Okay, looks like a pretty straightforward situation where we need to change changes - let’s start by converting the census data to uppercase:

subzone_census <- subzone_census %>%
  mutate(subzone_upper = toupper(Subzone))

subzone_census$subzone_upper[!subzone_census$subzone_upper %in% subzones$SUBZONE_N]
[1] "LAKESIDE"                "TENGAH"                 
[3] "WESTERN WATER CATCHMENT"

Great! That took care of all but three. Now we just need to try to figure out what’s wrong with those three. Here’s all the areas the subzones object that we haven’t matched:

sort(subzones$SUBZONE_N[!subzones$SUBZONE_N%in%subzone_census$subzone_upper])
 [1] "AIRPORT ROAD"              "BAHAR"                    
 [3] "BAYFRONT SUBZONE"          "BENOI SECTOR"             
 [5] "BRICKLAND"                 "BRICKWORKS"               
 [7] "CHANGI AIRPORT"            "CHANGI BAY"               
 [9] "CHIN BEE"                  "CLEANTECH"                
[11] "CLIFFORD PIER"             "CONEY ISLAND"             
[13] "EAST COAST"                "FOREST HILL"              
[15] "GARDEN"                    "JURONG ISLAND AND BUKOM"  
[17] "LAKESIDE (BUSINESS)"       "LAKESIDE (LEISURE)"       
[19] "LORONG HALUS"              "LORONG HALUS NORTH"       
[21] "MANDAI EAST"               "MANDAI WEST"              
[23] "MARINA EAST"               "MARINA EAST (MP)"         
[25] "MARINA SOUTH"              "MAXWELL"                  
[27] "MURAI"                     "NICOLL"                   
[29] "PARK"                      "PAYA LEBAR WEST"          
[31] "PHILLIP"                   "PIONEER SECTOR"           
[33] "PLAB"                      "PLANTATION"               
[35] "PULAU PUNGGOL BARAT"       "PULAU PUNGGOL TIMOR"      
[37] "PULAU SELETAR"             "PUNGGOL CANAL"            
[39] "SAFTI"                     "SEMAKAU"                  
[41] "SHIPYARD"                  "SIMPANG NORTH"            
[43] "SIMPANG SOUTH"             "SOUTHERN GROUP"           
[45] "STRAITS VIEW"              "SUDONG"                   
[47] "TAMPINES NORTH"            "TANJONG IRAU"             
[49] "TENGAH INDUSTRIAL ESTATE"  "TENGEH"                   
[51] "THE WHARVES"               "TUAS VIEW EXTENSION"      
[53] "WOODLANDS REGIONAL CENTRE" "YIO CHU KANG NORTH"       

Okay, so two things are immediately clear: LAKESIDE is divided into two subzones in the subzone object, and TENGAH is spelled TENGEH in the subzone object, and there is also a TENGAH INDUSTRIAL STATE. In other circumstances we might need to dissolve the two Lakeside subzones, but since we’re just going to be assigning these data to buildings inside those subzones, we don’t need to worry about that. So we can just change those specific values in the subzones object.

Western Water Catchment is a bit trickier. It’s actually a planning area, not a subzone. Singapore’s census doesn’t release data for subzones below a certain population, and the entire Western Water Catchment only has 900 residents. So for that one, we’ll need to reassign the subzone name as “WESTERN WATER CATCHMENT” and dissolve them:

subzones$SUBZONE_N[subzones$SUBZONE_N == "TENGEH"] <- "TENGAH"

subzones$SUBZONE_N[subzones$SUBZONE_N == "TENGAH INDUSTRIAL ESTATE"] <- "TENGAH"

#str_detect() returns TRUE is it finds a specific text anywhere in
#a string:
subzones$SUBZONE_N[str_detect(subzones$SUBZONE_N, fixed("LAKESIDE"))] <- "LAKESIDE"

subzones$SUBZONE_N[subzones$PLN_AREA_N=="WESTERN WATER CATCHMENT"] <- 
  "WESTERN WATER CATCHMENT"

subzones_diss <- subzones %>%
  group_by(SUBZONE_N) %>%
  summarise()
although coordinates are longitude/latitude, st_union assumes that they are
planar

Now we just need to conduct the left_join():

subzones_diss <- subzones_diss %>%
  left_join(subzone_census, by = join_by(SUBZONE_N == subzone_upper))

Computing weighted centroid with the centrpackage

The next step in our procedure is to assign all these data on subzone ethnic compositions to individual points. There are lots of options for how we could do this. One would be to just assign the value to the centroids, which are the geometric center of the polygon. The thing is, because we have information on the HDB residential building locations, we can actually use this information to do even better by computing the centroid for all the buildings in each subzone, taken as a whole.

But, because we also have information on the number of bedrooms in each building, we can do even better by taking that information into account, computing weighted centroids. Weighted centroids are yet another example of the weighted mean techniques we’ve been using so much. In this case, we’re just computing the average X and Y coordinates, weighted by bedrooms (or whatever variable we want to use).

Because all of these operations are using physical distances, we want to make sure that the layers we are using are projected into an appropriate local projection. For Singapore, we can use the Singapore Trans-Mercator projection, which is EPSG number 3414. Let’s reproject the subzones and buildings into that projection now:

#Adding some additional processing steps to clean up the polygons; this will
#likely produce some warnings, but you can safely ignore them:
subzone_census_stm <- subzones_diss %>%
  st_transform(st_crs(3414)) %>%
  st_zm(drop=TRUE) %>%
  st_cast("MULTIPOLYGON") %>%
  st_transform(st_crs(3414)) %>%
  st_buffer(0)

hdb_buildings_stm <- hdb_buildings_pop_info %>%
  st_zm(drop=TRUE) %>%
  st_cast("MULTIPOLYGON") %>%
  st_transform(st_crs(3414)) %>%
  st_buffer(0)

Now we can get the centroids for the buildings, for which we’ll use the helpfully named st_centroid() function:

buildings_cents <- hdb_buildings_stm %>%
  st_centroid()
Warning: st_centroid assumes attributes are constant over geometries
plot(st_geometry(buildings_cents))

Now we can use st_join() to attach the census data to the buildings:

buildings_cents <- buildings_cents %>%
  st_join(subzone_census_stm)

head(buildings_cents)
Simple feature collection with 6 features and 38 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 18723.77 ymin: 29052.02 xmax: 41907.43 ymax: 47864.69
Projected CRS: SVY21 / Singapore TM
# A tibble: 6 × 39
  Name  description  timestamp begin  end   
  <chr> <chr>        <dttm>    <dttm> <dttm>
1 kml_1 "<center><t… NA        NA     NA    
2 kml_3 "<center><t… NA        NA     NA    
3 kml_4 "<center><t… NA        NA     NA    
4 kml_5 "<center><t… NA        NA     NA    
5 kml_7 "<center><t… NA        NA     NA    
6 kml_8 "<center><t… NA        NA     NA    
# ℹ 34 more variables: altitudeMode <chr>, tessellate <int>, extrude <int>,
#   visibility <int>, drawOrder <int>, icon <chr>, BLK_NO <chr>, ST_COD <chr>,
#   ENTITYID <chr>, POSTAL_COD <chr>, INC_CRC <chr>, FMEL_UPD_D <chr>,
#   snippet <chr>, PLN_AREA_N <chr>, financial_year <int>,
#   town_or_estate <chr>, population <dbl>, key <chr>, bedrooms <dbl>,
#   bedrooms_pl_area <dbl>, bedroom_prop <dbl>, residents <dbl>,
#   geom <POINT [m]>, SUBZONE_N <chr>, Subzone <chr>, TotPop <int>, …

Now we’re ready to compute the centroid of the HDB resident location in each subzone. We’ll use the centr package, which is mean_center() function, which can compute centroids grouped according to a variable in the input layer, weighted using another variable. Here we go:

subzone_res_cents <- buildings_cents %>%
  mean_center(group = "SUBZONE_N",
              weight = "residents")

plot(st_geometry(subzone_res_cents))

Task 8: Using the projected subzones data, compute the population-weighted centroid for Singapore’s entire resident population (HINT: you’ll need to filter out the subzones that have no values for TotPop). Make sure to plot the point on top of the original subzones, so you can see where it is. The result should look something like this:

Warning: st_centroid assumes attributes are constant over geometries

Just so you can see the difference, here is the actual centroid of all the subzones, ignoring population:

subzone_cents_2 <- subzone_census_stm %>%
  summarise() %>%
  st_centroid()

ggplot(data = subzone_census_stm) +
  geom_sf(data = subzone_census_stm) +
  geom_sf(data = subzone_cents_2, size = 6)

Interestingly, the difference is very slight, suggesting that the HDB population is actually pretty well distributed around the island.

Computing inverse-distance-weighted interpolations with the phylin package

There are at least three R packages that have functions for computing inverse-distance weighted interpolations, but the simplest for our purposes is in the phylin package, which is actually designed as a set of tools for ecological genetics research.

phylin’s function for computing inverse-distance-weighted interpolations is idw(). Its syntax is a bit confusing, because it was built for ecologigists and some of the conventions are a bit different in that field from those in urban geography. First, you have to specify the values that you want to interpolate. In our case, that would be the proportions of each ethnic group from the subzone census data (so we’ll need to run the function once for each group). Then, you need to specify the coordinates (coords) at which each value is located, which would be the X and Y coordinates of the population-weighted centroids. Finally, we have to specify the coordinates to which we want to interpolate, which phylin calls a grid because usually we are interpolating measured values to create a raster (or grid) map. In our case, though, we are interpolating to the building centroids, so that’s our “grid”. Finally, we have to specify p, the power to which we will raise the distance in the interpolation. The higher p is, the lower the weight we are attaching to more distant points. While there is no optimal way to set this value, let’s use 1.5.

Okay, let’s try this for the Chinese proportion of the resident population:

#First, we need to add the subzone data back to the weighted centroids:
subzone_res_cents <- subzone_res_cents %>%
  st_join(subzone_census_stm)

#Now we can perform the interpolation:
prop_chin_interp <- idw(values = subzone_res_cents$prop_chin,
                        coords = st_coordinates(subzone_res_cents),
                        grid = st_coordinates(buildings_cents),
                        p = 1.5)

head(prop_chin_interp)
          Z
1 0.6756703
2 0.6876953
3 0.7176592
4 0.7826039
5 0.6960012
6 0.7768884

So what you’re seeing here is a dataframe with an estimate of the interpolated value for each building. We can now just add that value to the buildings object (be sure to just extract Z, the column with the interpolations:

buildings_cents <- buildings_cents %>%
  mutate(prop_chin_interp = prop_chin_interp$Z)

Task 9: Compute the inverse-distance-weighted interpolation for prop_ind, prop_malay, and prop_other and add them to the buildings_cents object. Name them prop_ind_int, prop_malay_int, and prop_other_int, respectively.

Finally, we can use these proportions to estimate the residents of each ethnic group by building and make some maps:

buildings_cents <- buildings_cents %>%
  mutate(res_chinese = residents * prop_chin_interp,
         res_indian = residents * prop_ind_interp,
         res_malay = residents * prop_malay_interp,
         res_other = residents * prop_other_interp)

temp <- buildings_cents %>%
  pivot_longer(cols = c("res_chinese", "res_indian", "res_malay", "res_other")) %>%
  mutate(group = name %>% str_remove(fixed("res_")) %>% toupper()) %>%
  #Converting to percent of total group for visualization:
  group_by(group) %>%
  mutate(percentage = value/sum(value) * 100)
  

ggplot(data = subzone_census_stm) +
  geom_sf(data = temp, aes(color = percentage), alpha = 0.5) +
  scale_color_gradient("Estimated Percentage",
                       trans = scales::log_trans(base = 10),
                       low = "white", high = "darkblue") +
  annotation_scale(location = "tr") +
  ggtitle("Estimated Residents in HDB Buildings in Singapore",
          subtitle = "By Percentage of Ethnic Group") +
  facet_wrap(~group, ncol = 2) +
  theme_minimal()

Wow - after all that, it really does look like the HDB’s program has resulted in a remarkably even ethnic geography in its buildings!

Task 10: Repeat your amenities analysis from Task 6, this time comparing estimated shares of HDB residents with access across the four ethnic groups. What do you find?

Congratulations! You’re (finally) done!