Take Home Assignment 1 for IS415 - Geospatial Analytics and Applications - G10

Overview

This exercise is examining national development or public service issues by utilizing data sets from various governmental agencies in Singapore. The assignment is meant to demonstrate basic Geo-spatial Data Wrangling, Geo-spatial Analytics and Geo-visualization in R. For the purpose of this assignment, we will utilizing course material in order to achieve 3 specific objectives:

Calibrating a simple linear regression to reveal the relation between public bus commuters’ flows (i.e. tap-in or tap-out) data and residential population at the planning sub-zone level.
Performing spatial auto-correlation analysis on the residual of the regression model to test if the model conforms to the randomization assumption.
Performing localized geospatial statistics analysis by using commuters’ tap-in and tap-out data to identify geographical clustering.

For the purpose of this study, the passenger volume by bus-stop data set of Land Transport Authority (LTA) has been provided. This data set is extracted using the dynamic API provided at LTA DataMall. The remaining data from the government open data portal.

Part 0: Setup

For the purpose of this assignment, we will be using packages taught in the course. Particularly we will be utilizing rgdal, sf, spdep, tmap and tidyverse.

Loading in required packages

packages = c('rgdal', 'sf', 'spdep', 'tmap', 'tidyverse')
for (p in packages){
  if(!require(p, character.only = T)){
    install.packages(p)
  }
  library(p,character.only = T)
}

Part 1: Geospatial Data Wrangling & Linear Regression Model

Step 1: Defining the scope of our task

Loading in the passenger volume dataset

sgpassengervolume <- read_csv("data/aspatial/passenger volume by busstop.csv")

## Parsed with column specification:
## cols(
##   YEAR_MONTH = col_character(),
##   DAY_TYPE = col_character(),
##   TIME_PER_HOUR = col_double(),
##   PT_TYPE = col_character(),
##   PT_CODE = col_character(),
##   TOTAL_TAP_IN_VOLUME = col_double(),
##   TOTAL_TAP_OUT_VOLUME = col_double()
## )

NA/Validity Checks 1

sgpassengervolume[rowSums(is.na(sgpassengervolume))!=0,]

## # A tibble: 0 x 7
## # ... with 7 variables: YEAR_MONTH <chr>, DAY_TYPE <chr>, TIME_PER_HOUR <dbl>,
## #   PT_TYPE <chr>, PT_CODE <chr>, TOTAL_TAP_IN_VOLUME <dbl>,
## #   TOTAL_TAP_OUT_VOLUME <dbl>

Cleared of NA values.

Analysis of dataset

Based on the Dataset provided on passenger volume by bus-stop data, we are examining passenger flows from of the period for the month of January 2020. As requested by the assignment, we need to aggregate the data at the planning sub-zone level in order to perform a simple linear model for its relation to residential population in the same zones.

Aggregating dataset by Bus Stop Number

For our purpose, we don’t care about the time or day differences in our dataset, we only care about the volumes based on the bus stop locations so we will need a new data-frame which is summed by bus stop number

Totalbusstopvol <- sgpassengervolume %>% 
  group_by(BUS_STOP_N = PT_CODE) %>% #rename as per naming convention by LTA
  summarise(Tap_In = sum(TOTAL_TAP_IN_VOLUME), 
            Tap_out = sum(TOTAL_TAP_OUT_VOLUME))

NA/Validity Checks

Totalbusstopvol[rowSums(is.na(Totalbusstopvol))!=0,]

## # A tibble: 0 x 3
## # ... with 3 variables: BUS_STOP_N <chr>, Tap_In <dbl>, Tap_out <dbl>

Cleared of NA values.

Step 2: Loading in Geospatial Data for bus locations

Thus we will need a relational database that helps us to map the bus-stop locations to the planning sub-zone level. For that purpose we will be utilizing the Bus Stop Location Shapefile from the LTA DataMall Source:https://www.mytransport.sg/content/mytransport/home/dataMall/static-data.html#Whole%20Island

Loading in Bus Stop Location Geospatial Data

We’ll load it in as a Simple Feature DataFrame, in this case since we know its from Singapore database, there is no need to transform. We will simply set the CRS to 3414 in correspondence to SG.

busstoplocations_sf <- readOGR(dsn = "data/geospatial/BusStopLocation_Jan2020", layer = "BusStop")

## OGR data source with driver: ESRI Shapefile 
## Source: "C:\data\GSA\Take_Home_EX01\data\geospatial\BusStopLocation_Jan2020", layer: "BusStop"
## with 5040 features
## It has 3 fields

busstoplocations_sf <- st_as_sf(busstoplocations_sf)
busstoplocations_sf <- st_transform(busstoplocations_sf, crs= 3414)

#st_crs(busstoplocations_sf)

NA/Validity Check 1

busstoplocations_sf[rowSums(is.na(busstoplocations_sf))!=0,]

## Simple feature collection with 121 features and 3 fields
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: 10792.71 ymin: 28191.17 xmax: 43485.8 ymax: 47931.22
## CRS:            EPSG:3414
## First 10 features:
##     BUS_STOP_N BUS_ROOF_N LOC_DESC                  geometry
## 1        78221        B06     <NA> POINT (42227.96 39563.16)
## 36       30059        B15     <NA> POINT (16128.92 39596.94)
## 63       70291        B03     <NA> POINT (33981.28 35537.19)
## 105      96081        B05     <NA> POINT (41603.76 35413.11)
## 194      31171        B30     <NA> POINT (13634.64 39378.37)
## 221      66511        B02     <NA>    POINT (32266 39747.92)
## 351      67481        B04     <NA>  POINT (32945.05 41494.7)
## 352      83239        B06     <NA>  POINT (37922.7 33240.12)
## 427      64521       B01A     <NA>  POINT (34249.3 39110.15)
## 536      31169        B31     <NA> POINT (13160.11 39376.74)

test <- st_is_valid(busstoplocations_sf)
length(which(test==FALSE))

## [1] 0

Geometric objects are valid. However,there are several Location Descriptions with NA values. We will replace them with “UNKNOWN”

Replace NA with Unknown

busstoplocations_sf$LOC_DESC[is.na(busstoplocations_sf$LOC_DESC)] <- "UNKNOWN"

NA/Validity Checks 2

busstoplocations_sf[rowSums(is.na(busstoplocations_sf))!=0,]

## Simple feature collection with 2 features and 3 fields
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: 21294.95 ymin: 37068.44 xmax: 32659.61 ymax: 38543.97
## CRS:            EPSG:3414
##      BUS_STOP_N BUS_ROOF_N           LOC_DESC                  geometry
## 2897      43961       <NA> AFT DAIRY FARM HTS POINT (21294.95 38543.97)
## 2968      62141       <NA>          OPP BLK 1 POINT (32659.61 37068.44)

Some Bus Roof Numbers are missing as well. This is why we need to repeat the validity check even if we think we have cleared all the NA values. We will replace these with “UNKNOWN”.

Replace NA with Unknown

busstoplocations_sf$BUS_ROOF_N[is.na(busstoplocations_sf$BUS_ROOF_N)] <- "UNKNOWN"

NA/Validity Checks 3

busstoplocations_sf[rowSums(is.na(busstoplocations_sf))!=0,]

## Simple feature collection with 0 features and 3 fields
## bbox:           xmin: NA ymin: NA xmax: NA ymax: NA
## CRS:            EPSG:3414
## [1] BUS_STOP_N BUS_ROOF_N LOC_DESC   geometry  
## <0 rows> (or 0-length row.names)

No more NA values

Joining Commuter’s volume to Bus Stop Location dataset

Next we need to attach our passenger data to our Spatial Points DataFrame

busstoplocations_sf <- left_join(busstoplocations_sf,Totalbusstopvol) #we already renamed the variable to match so there is no need to specify the joining factor

## Joining, by = "BUS_STOP_N"

NA/Validity Checks 1

busstoplocations_sf[rowSums(is.na(busstoplocations_sf))!=0,]

## Simple feature collection with 78 features and 5 fields
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: 7702.752 ymin: 27733.06 xmax: 41244.44 ymax: 49261.6
## CRS:            EPSG:3414
## First 10 features:
##     BUS_STOP_N BUS_ROOF_N               LOC_DESC Tap_In Tap_out
## 178      91121        NIL                HSE 189     NA      NA
## 256      44401        NIL            OPP BLK 183     NA      NA
## 344      31121        B40   NIRVANA MEMORIAL GDN     NA      NA
## 461      23381       B01A            PLANT ENGRG     NA      NA
## 567      65601        NIL      SUMANG STN EXIT B     NA      NA
## 574      65609        NIL      SUMANG STN EXIT A     NA      NA
## 592      02031        B01        OPP THE ADELPHI     NA      NA
## 640      91151        NIL           HAWAII TOWER     NA      NA
## 644      23411        B07 AFT JURONG MARINE BASE     NA      NA
## 692      18209        B01      BEF ALL ST CHAPEL     NA      NA
##                      geometry
## 178 POINT (35030.24 31256.69)
## 256 POINT (20144.74 40279.55)
## 344  POINT (11838.8 39157.43)
## 461 POINT (12552.18 32234.11)
## 567 POINT (35281.38 43367.08)
## 574  POINT (35238.99 43359.3)
## 592 POINT (29991.79 30453.16)
## 640 POINT (34364.42 31052.96)
## 644 POINT (11634.46 31782.65)
## 692 POINT (23789.09 30790.59)

When looking at our data set, we can see some NA values have in our Tap_In and Tap_Out columns. This is due to there being more bus stop locations in our geospatial data set than in our commuters’ volume. We will need to replace these NA values with 0 or else we will encounter issues in aggregation.

Replace NA with 0

busstoplocations_sf$Tap_In[is.na(busstoplocations_sf$Tap_In)] <- 0
busstoplocations_sf$Tap_out[is.na(busstoplocations_sf$Tap_out)] <- 0

NA/Validity Checks 2

busstoplocations_sf[rowSums(is.na(busstoplocations_sf))!=0,]

## Simple feature collection with 0 features and 5 fields
## bbox:           xmin: NA ymin: NA xmax: NA ymax: NA
## CRS:            EPSG:3414
## [1] BUS_STOP_N BUS_ROOF_N LOC_DESC   Tap_In     Tap_out    geometry  
## <0 rows> (or 0-length row.names)

Cleared of NA values.

Step 3: Loading in Planning Area Subzone Geospatial Data and Population Data

In order for us to perform our regression. We need population information by planning area subzone level. We will utilize the “Singapore Residents by Subzone and Type of Dwelling, 2011 - 2019” and utilize only the 2019 information to get the latest data.

Source: https://data.gov.sg/dataset/Singapore-residents-by-subzone-and-type-of-dwelling-2011-2019

To make it geospatial data, we need to match with the according geospatial subzone data, according to the website, this information is still map to the Master Plan 2014 subzones, as such we will use those although there is an updated 2019 MP. For this exercise, we will use the shapefile instead of the KML files also provided

Source: https://data.gov.sg/dataset/master-plan-2014-subzone-boundary-web

Loading in Population Data

For the purpose of our analysis, we will use only the 2019 data.

NOTE: For the purpose of our analysis, in order to reduce errors in our linear regression model later on, we will remove children in the age group. This is because children below the age of 7 and under a certain height criteria do not tap in or out the bus, thus including this data would produce noise and increase the errors in our linear regression model.

Source: https://www.sbstransit.com.sg/fares-and-concessions

PopData <- read_csv("data/aspatial/planning-area-subzone-age-group-sex-and-type-of-dwelling-june-2011-2019.csv") %>%
  filter(year==2019) %>% #filter only by 2019 data
  filter(age_group !="0_to_4")#removing age group 0-4 as they are irrelevant for the purpose of our study

## Parsed with column specification:
## cols(
##   planning_area = col_character(),
##   subzone = col_character(),
##   age_group = col_character(),
##   sex = col_character(),
##   type_of_dwelling = col_character(),
##   resident_count = col_double(),
##   year = col_double()
## )

NA/Validity Checks 1

PopData[rowSums(is.na(PopData))!=0,]

## # A tibble: 0 x 7
## # ... with 7 variables: planning_area <chr>, subzone <chr>, age_group <chr>,
## #   sex <chr>, type_of_dwelling <chr>, resident_count <dbl>, year <dbl>

Cleared of NA values

Aggregating Resident Population by subzone

ResPopbySubzone <- PopData %>% 
  group_by(subzone = subzone) %>% #rename as per naming convention by LTA
  summarise(resident_count= sum(resident_count))

ResPopbySubzone <- mutate_at(ResPopbySubzone, .vars = "subzone", .funs=toupper) #convert names to upper case, be sure to only do so for names.

NA/Validity Checks 1

ResPopbySubzone[rowSums(is.na(ResPopbySubzone))!=0,]

## # A tibble: 0 x 2
## # ... with 2 variables: subzone <chr>, resident_count <dbl>

Cleared of NA values

Loading in Planning Area Subzone geospatial data

NOTE: Check WGS and CRS. In this case st_transform does not change the polygons because they are already a Singaporean dataset but it helps us to format and label our dataset.

PlanningSubzone_sf <- readOGR(dsn = "data/geospatial/MP14_SUBZONE_WEB", layer = "MP14_SUBZONE_WEB_PL")

## OGR data source with driver: ESRI Shapefile 
## Source: "C:\data\GSA\Take_Home_EX01\data\geospatial\MP14_SUBZONE_WEB", layer: "MP14_SUBZONE_WEB_PL"
## with 323 features
## It has 15 fields

PlanningSubzone_sf <- st_as_sf(PlanningSubzone_sf)
PlanningSubzone_sf <- st_transform(PlanningSubzone_sf, crs= 3414)

#st_crs(PlanningSubzone_sf)

NA/Validity Checks 1

PlanningSubzone_sf[rowSums(is.na(PlanningSubzone_sf))!=0,]

## Simple feature collection with 0 features and 15 fields
## bbox:           xmin: NA ymin: NA xmax: NA ymax: NA
## CRS:            EPSG:3414
##  [1] OBJECTID   SUBZONE_NO SUBZONE_N  SUBZONE_C  CA_IND     PLN_AREA_N
##  [7] PLN_AREA_C REGION_N   REGION_C   INC_CRC    FMEL_UPD_D X_ADDR    
## [13] Y_ADDR     SHAPE_Leng SHAPE_Area geometry  
## <0 rows> (or 0-length row.names)

test <- st_is_valid(PlanningSubzone_sf)
length(which(test==FALSE))

## [1] 9

Cleared of NA values. However, 9 polygons are invalid according to the st_is_valid test. Luckily the sf package is able to correct this for us.

PlanningSubzone_sf <- st_make_valid(PlanningSubzone_sf)

NA/Validity Checks 2

test <- st_is_valid(PlanningSubzone_sf)
length(which(test==FALSE))

## [1] 0

Polygons are valid.

Joining Residential data to Simple Dataframe

PlanningSubzone_sf <- left_join(PlanningSubzone_sf , ResPopbySubzone, by = c("SUBZONE_N" = "subzone"))

NA/Validity Checks 1

PlanningSubzone_sf[rowSums(is.na(PlanningSubzone_sf))!=0,]

## Simple feature collection with 0 features and 16 fields
## bbox:           xmin: NA ymin: NA xmax: NA ymax: NA
## CRS:            EPSG:3414
##  [1] OBJECTID       SUBZONE_NO     SUBZONE_N      SUBZONE_C      CA_IND        
##  [6] PLN_AREA_N     PLN_AREA_C     REGION_N       REGION_C       INC_CRC       
## [11] FMEL_UPD_D     X_ADDR         Y_ADDR         SHAPE_Leng     SHAPE_Area    
## [16] resident_count geometry      
## <0 rows> (or 0-length row.names)

test <- st_is_valid(PlanningSubzone_sf)
length(which(test==FALSE))

## [1] 0

Cleared of NA and invalid Polygons.

Quick mapping to show distribution of resident count in Singapore

tm_shape(PlanningSubzone_sf)+
  tm_fill(col = "resident_count",
              n = 5,
              style="jenks",
              palette = "Blues",
          title = "Resident Population Count") +
  tm_layout(legend.position = c("right", "bottom")) +
  tm_borders(alpha = 0.5) +
  tmap_style("white") +
  tm_credits("Source: Planning Area Sub-zone boundary MP2014 from Urban Redevelopment Authorithy (URA)\n and Population data from Department of Statistics (DOS)", 
             position = c("left", "bottom"))

Map seems to align with most population distributions we’ve seen before, so we will accept that our data has been loaded in properly

Step 4: Aggregation and filtering of relevant data

Aggregating Bus Stop Level data to Subzone Level Data

NOTE: At this point in time, we will need to consider what to do with subzones where you have no data for your Tap-in and Tap-out or if there is no residents in the subzone.

For the purpose of our analysis, we will firstly remove zones where all Tap-in, Tap-out and Residents are equal to 0. This is because it is likely irrelevant for our study as these areas are unreachable by the general population and may cause problems during the geospatial auto-correlation calculations later on.

For zones in which there one of these conditions but not all, we will keep these areas as they are still relevant for our scope of study. We will replace any missing values with 0

However, it should be noted that based on the data observed, there are areas in which with 0 residents but a high amount of Tap-in and Tap-out which may be due to construction projects. The best example of this would be the Woodlands Regional Center with a disproportionately large bus volumes but no residential population because it is still under construction.

MasterSubzone_sf <- st_join(PlanningSubzone_sf,busstoplocations_sf, join=st_intersects) #using st_intersects to join bus stops that are either within or touching the polygons.

NA/Validity Checks 1

MasterSubzone_sf[rowSums(is.na(MasterSubzone_sf))!=0,]

## Simple feature collection with 18 features and 21 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: 8012.578 ymin: 15748.72 xmax: 56396.44 ymax: 49122.83
## CRS:            EPSG:3414
## First 10 features:
##     OBJECTID SUBZONE_NO               SUBZONE_N SUBZONE_C CA_IND
## 13        13          1             MARINA EAST    MESZ01      Y
## 16        16          1 JURONG ISLAND AND BUKOM    WISZ01      N
## 17        17          3                  SUDONG    WISZ03      N
## 18        18          2                 SEMAKAU    WISZ02      N
## 19        19          2          SOUTHERN GROUP    SISZ02      N
## 73        73          4               CAIRNHILL    NTSZ04      Y
## 98        98          7             MOUNT EMILY    RCSZ07      Y
## 234      234          1              CHANGI BAY    CBSZ01      N
## 274      274          1      LORONG HALUS NORTH    SESZ01      N
## 280      280          3     PULAU PUNGGOL TIMOR    SLSZ03      N
##           PLN_AREA_N PLN_AREA_C          REGION_N REGION_C          INC_CRC
## 13       MARINA EAST         ME    CENTRAL REGION       CR 782A2FAF53029A34
## 16   WESTERN ISLANDS         WI       WEST REGION       WR 699F7210FBF1AFA8
## 17   WESTERN ISLANDS         WI       WEST REGION       WR F718C723E08FBD51
## 18   WESTERN ISLANDS         WI       WEST REGION       WR E69207D4F76DEEA3
## 19  SOUTHERN ISLANDS         SI    CENTRAL REGION       CR 5809FC547293EA2D
## 73            NEWTON         NT    CENTRAL REGION       CR 8B79CA72836AD7A6
## 98            ROCHOR         RC    CENTRAL REGION       CR D8225E737A6496B4
## 234       CHANGI BAY         CB       EAST REGION       ER ACFD43D8AE3F7B3F
## 274         SENGKANG         SE NORTH-EAST REGION      NER 4A9FCDF7EAE9BEAE
## 280          SELETAR         SL NORTH-EAST REGION      NER DF9C2E922175C69D
##     FMEL_UPD_D   X_ADDR   Y_ADDR SHAPE_Leng SHAPE_Area resident_count
## 13  2014/12/05 32344.05 30103.25   6470.950  1844060.7              0
## 16  2014/12/05 13012.88 27225.87  68083.936 36707720.9              0
## 17  2014/12/05 15931.76 19579.07  24759.066  4207271.1              0
## 18  2014/12/05 21206.33 20465.81  18703.681  4963787.1              0
## 19  2014/12/05 29815.09 23412.59  25626.977  2206319.5              0
## 73  2014/12/05 28482.67 32108.19   3853.900   452471.8           3680
## 98  2014/12/05 29583.26 31639.48   2301.230   193992.3           1400
## 234 2014/12/05 49502.49 34316.54  18731.560  1829821.7              0
## 274 2014/12/05 36619.22 40404.62   3930.097   924866.0              0
## 280 2014/12/05 34428.85 44535.90   4817.804  1259743.1              0
##     BUS_STOP_N BUS_ROOF_N LOC_DESC Tap_In Tap_out
## 13        <NA>       <NA>     <NA>     NA      NA
## 16        <NA>       <NA>     <NA>     NA      NA
## 17        <NA>       <NA>     <NA>     NA      NA
## 18        <NA>       <NA>     <NA>     NA      NA
## 19        <NA>       <NA>     <NA>     NA      NA
## 73        <NA>       <NA>     <NA>     NA      NA
## 98        <NA>       <NA>     <NA>     NA      NA
## 234       <NA>       <NA>     <NA>     NA      NA
## 274       <NA>       <NA>     <NA>     NA      NA
## 280       <NA>       <NA>     <NA>     NA      NA
##                           geometry
## 13  MULTIPOLYGON (((33214.62 29...
## 16  MULTIPOLYGON (((14557.7 304...
## 17  MULTIPOLYGON (((15772.59 21...
## 18  MULTIPOLYGON (((19843.41 21...
## 19  MULTIPOLYGON (((27865.53 22...
## 73  MULTIPOLYGON (((28674.31 32...
## 98  MULTIPOLYGON (((29794.8 319...
## 234 MULTIPOLYGON (((48860.11 34...
## 274 MULTIPOLYGON (((37017.79 40...
## 280 MULTIPOLYGON (((35103.37 44...

test <- st_is_valid(MasterSubzone_sf)
length(which(test==FALSE))

## [1] 0

No polygons are invalid but there are some rows of Tap_in and Tap_out which are NA. We will need to replaces these NA values with 0 in order to aggregate the data. We will also replace NA values for the missing bus stop roof numbers, bus stop numbers and Location descriptions.

Replace NA with 0 and Unknown

MasterSubzone_sf$Tap_In[is.na(MasterSubzone_sf$Tap_In)] <- 0
MasterSubzone_sf$Tap_out[is.na(MasterSubzone_sf$Tap_out)] <- 0
MasterSubzone_sf$BUS_ROOF_N[is.na(MasterSubzone_sf$BUS_ROOF_N)] <- "UNKNOWN"
MasterSubzone_sf$BUS_STOP_N[is.na(MasterSubzone_sf$BUS_STOP_N)] <- "UNKNOWN"
MasterSubzone_sf$LOC_DESC[is.na(MasterSubzone_sf$LOC_DESC)] <- "UNKNOWN"

NA/Validity Checks 2

MasterSubzone_sf[rowSums(is.na(MasterSubzone_sf))!=0,]

## Simple feature collection with 0 features and 21 fields
## bbox:           xmin: NA ymin: NA xmax: NA ymax: NA
## CRS:            EPSG:3414
##  [1] OBJECTID       SUBZONE_NO     SUBZONE_N      SUBZONE_C      CA_IND        
##  [6] PLN_AREA_N     PLN_AREA_C     REGION_N       REGION_C       INC_CRC       
## [11] FMEL_UPD_D     X_ADDR         Y_ADDR         SHAPE_Leng     SHAPE_Area    
## [16] resident_count BUS_STOP_N     BUS_ROOF_N     LOC_DESC       Tap_In        
## [21] Tap_out        geometry      
## <0 rows> (or 0-length row.names)

test <- st_is_valid(MasterSubzone_sf)
length(which(test==FALSE))

## [1] 0

All Polygons are valid and the data is cleared of NA.

Aggregating Tap-in and Tap-out Values to subzone and only keeping in relevant features

SummarizedMasterSubzone_sf <- MasterSubzone_sf %>%
  filter(Tap_In!= 0 | Tap_out != 0 | resident_count != 0) %>% # we'll now  filter out any subzones with all values zero
  group_by(Subzone = SUBZONE_N) %>%
  summarise(Residential = as.numeric(first(resident_count)), Tap_in = sum(Tap_In), Tap_out = sum(Tap_out))

NA/Validity Checks 1

SummarizedMasterSubzone_sf[rowSums(is.na(SummarizedMasterSubzone_sf))!=0,]

## Simple feature collection with 0 features and 4 fields
## bbox:           xmin: NA ymin: NA xmax: NA ymax: NA
## CRS:            EPSG:3414
## # A tibble: 0 x 5
## # ... with 5 variables: Subzone <chr>, Residential <dbl>, Tap_in <dbl>,
## #   Tap_out <dbl>, geometry <GEOMETRY [m]>

test <- st_is_valid(SummarizedMasterSubzone_sf)
length(which(test==FALSE))

## [1] 0

summary(SummarizedMasterSubzone_sf)

##    Subzone           Residential         Tap_in           Tap_out       
##  Length:307         Min.   :     0   Min.   :      0   Min.   :      0  
##  Class :character   1st Qu.:    80   1st Qu.:  89266   1st Qu.:  87778  
##  Mode  :character   Median :  6140   Median : 227638   Median : 222195  
##                     Mean   : 12534   Mean   : 376106   Mean   : 375453  
##                     3rd Qu.: 18345   3rd Qu.: 483565   3rd Qu.: 466563  
##                     Max.   :127200   Max.   :3754337   Max.   :3660297  
##           geometry  
##  MULTIPOLYGON :307  
##  epsg:3414    :  0  
##  +proj=tmer...:  0  
##                     
##                     
##

Cleared of NA and invalid polygons

Step 5: Plotting our summarized data

Now we need to inspect what polygons have remained and look at the distribution of our Tap-in and Tap-out data.

Plotting two quick thematic maps

CountTapInMap <- tm_shape(SummarizedMasterSubzone_sf)+
  tm_fill(col = c("Tap_in"),
              n = 5,
              style="jenks",
              palette = "Blues",
          title = "Tap-in Volume") +
  tm_layout(legend.position = c("right", "bottom")) +
  tm_borders(alpha = 0.5) +
  tmap_style("white")+
  tm_credits("Source: Planning Sub-zone boundary from Urban Redevelopment Authorithy (URA)\n and Commuter's Volume data from Land Transport Authority (LTA)", 
             position = c("left", "bottom"))


CountTapOutMap <- tm_shape(SummarizedMasterSubzone_sf)+
  tm_fill(col = c("Tap_out"),
              n = 5,
              style="jenks",
              palette = "Reds",
          title = "Tap-out Volume") +
  tm_layout(legend.position = c("right", "bottom")) +
  tm_borders(alpha = 0.5) +
  tmap_style("white") +
  tm_credits("Source: Planning Sub-zone boundary from Urban Redevelopment Authorithy (URA)\n and Commuter's Volume data from Land Transport Authority (LTA)", 
             position = c("left", "bottom"))


tmap_arrange(CountTapInMap, CountTapOutMap, asp=1, ncol=2)

We can see that during the aggregation process that some of the islands, such as Jurong Island and Pulau Ubin have disappeared. This is good because these areas are inaccessible by bus transportation which allows us to concentrate our study on the relevant areas within mainland Singapore.

Step 6: Performing Linear Regression Model

For this exercise, as we are interested in the relationship between the tap-in and tap-out volumes to the residual value. We will be using the lm() function to calculate the linear regression. Since for the purpose of our study, we will assume that our X in this cause is Residential Population and Y is Tap-In or Tap-Out.

This is based on the fact that the Residential Population data was gathered in 2019 before the Tap-In and Tap-Out data in January 2020 although we may be committing Post hoc ergo propter hoc fallacy.

Performing Linear Regression for Tap-in values

TapInLM.model <- lm(Tap_in ~ Residential, data = SummarizedMasterSubzone_sf)
summary(TapInLM.model)

## 
## Call:
## lm(formula = Tap_in ~ Residential, data = SummarizedMasterSubzone_sf)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -799481 -127198  -67424   31303 1954431 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.207e+05  2.075e+04   5.816 1.52e-08 ***
## Residential 2.038e+01  9.715e-01  20.977  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 294300 on 305 degrees of freedom
## Multiple R-squared:  0.5906, Adjusted R-squared:  0.5893 
## F-statistic:   440 on 1 and 305 DF,  p-value: < 2.2e-16

Performing Linear Regression for Tap-out values

TapOutLM.model <- lm(Tap_out ~ Residential, data = SummarizedMasterSubzone_sf)
summary(TapOutLM.model)

## 
## Call:
## lm(formula = Tap_out ~ Residential, data = SummarizedMasterSubzone_sf)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -758965 -120883  -57479   32583 1648011 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.212e+05  2.009e+04   6.036 4.59e-09 ***
## Residential 2.028e+01  9.406e-01  21.562  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 285000 on 305 degrees of freedom
## Multiple R-squared:  0.6039, Adjusted R-squared:  0.6026 
## F-statistic: 464.9 on 1 and 305 DF,  p-value: < 2.2e-16

We can see there have been no observations deleted due to missing values, assuring us that the data has been cleaned properly

Step 5: Interpretation of Linear Regression Model for commuter’s flow to Residential Population by URA 2014 Master Plan Planning Area Subzones

Based on the results shown, both Tap-on and Tap-out volumes are shown to be positively correlated. The results are significant at an alpha value of 0.001 which allows us to reject the null hypothesis at a 0.999% confidence interval. The R-square value of 0.5906 and 0.6039 shows a fair amount of strength between the model and the response variable.

Plotting Tap_In volumes to Residential Population

TapInPlot <- ggplot(data= TapInLM.model, aes(x = Residential, y= Tap_in)) +
  geom_point(color='blue',size=2)+
  geom_segment(aes(xend = Residential, yend = predict(TapInLM.model))) + 
    geom_smooth(method = "lm", formula = y ~ x, color='black')+
  xlab("Residential Population Count")+
  ylab("Tap-in Volume Count")
  

TapInPlot

Plotting Tap_out volumes to Residential Population

TapOutPlot <- ggplot(data= TapOutLM.model, aes(x = Residential, y= Tap_out)) +
  geom_point(color='red',size=2)+
  geom_segment(aes(xend = Residential, yend = predict(TapOutLM.model))) + 
    geom_smooth(method = "lm", formula = y ~ x, color='black')+
  xlab("Residential Population Count")+
  ylab("Tap-out Volume Count")

TapOutPlot

Plotting Residuals

par(mfrow=c(1,2))
plot(TapInLM.model$fitted.values, TapInLM.model$residuals, pch = 8, col = "blue", xlab = "Predict Tap-in Values", ylab = "Residual Values (Tap-in)")
plot(TapOutLM.model$fitted.values, TapOutLM.model$residuals, pch = 8, col = "red",  xlab = "Predict Tap-in Values", ylab = "Residual Values (Tap-in)")

When observing the residuals, the concentration of data points in the low residential amounts makes this hard to interpret. However, we can see there is some degree of heteroscedasticity, which could suggest a missing variable or a variable that requires transformation, however it is unclear at this time. There also appear to be some outlier values observed in the distribution. This could be due to the point raised earlier about places such as Woodlands Regional Center.

Part 2: Spatial Autocorrelation Analysis on residuals

In the next part, we will append the residuals back to our Simple Features Dataframe and check if the randomization assumption holds true. This is to help us check that the relationship established in our previous linear model holds true or if there is a geospatial problem in the data generating process.

Step 1: Adding residuals and predicted values to Simple Feature Dataframe

It is important to ensure that the indexes from your linear model are the same as that of our data-frame. Usually the safer way would be to create an index for both, however, for simplicity we will just check manually and see that the data from our linear model and Simple Feature Dataframe align.

Appending using tibble functions

ResidualFitted_sf <- SummarizedMasterSubzone_sf

ResidualFitted_sf <- add_column(ResidualFitted_sf, ResidualTapIn = residuals(TapInLM.model), .before = "geometry")

ResidualFitted_sf <- add_column(ResidualFitted_sf, ResidualTapOut = residuals(TapOutLM.model), .before = "geometry")

ResidualFitted_sf <- add_column(ResidualFitted_sf, PredictedTapIn = predict(TapInLM.model), .before = "geometry")

ResidualFitted_sf <- add_column(ResidualFitted_sf, PredictedTapOut = predict(TapOutLM.model), .before = "geometry")

Appending Aboslute values for residual Tap-in and Tap-out

ResidualFitted_sf <- add_column(ResidualFitted_sf, ABSResidualTapIn = abs(residuals(TapInLM.model)), .before = "geometry")
ResidualFitted_sf <- add_column(ResidualFitted_sf, ABSResidualTapOut = abs(residuals(TapOutLM.model)), .before = "geometry")

NA/Validity Checks 1

ResidualFitted_sf[rowSums(is.na(ResidualFitted_sf))!=0,]

## Simple feature collection with 0 features and 10 fields
## bbox:           xmin: NA ymin: NA xmax: NA ymax: NA
## CRS:            EPSG:3414
## # A tibble: 0 x 11
## # ... with 11 variables: Subzone <chr>, Residential <dbl>, Tap_in <dbl>,
## #   Tap_out <dbl>, ResidualTapIn <dbl>, ResidualTapOut <dbl>,
## #   PredictedTapIn <dbl>, PredictedTapOut <dbl>, ABSResidualTapIn <dbl>,
## #   ABSResidualTapOut <dbl>, geometry <GEOMETRY [m]>

test <- st_is_valid(ResidualFitted_sf)
length(which(test==FALSE))

## [1] 0

Cleared of NA and invalid polygons.

Step 2: Mapping out our Residual Values

This is to provide us a quick look at

Mapping using tmaps

ResidualTapInmap <- tm_shape(ResidualFitted_sf)+ 
  tm_fill("ResidualTapIn", 
              n = 5,
              style="jenks",
              palette = "RdYlGn",
          title = "Residual Tap-in") +
  tm_layout(legend.position = c("right", "bottom")) +
  tm_borders(alpha = 0.5) +
  tmap_style("white")

ResidualTapOutmap <- tm_shape(ResidualFitted_sf)+ 
  tm_fill("ResidualTapOut", 
              n = 5,
              style="jenks",
              palette = "RdYlBu",
          title = "Residual Tap-out") +
  tm_layout(legend.position = c("right", "bottom")) +
  tm_borders(alpha = 0.5) +
  tmap_style("white")

tmap_arrange(ResidualTapInmap, ResidualTapOutmap, asp=1, ncol=2)

Step 3: Forming Weight Matrixes using Contiguity-based neighbours

Creating (QUEEN) contiguity based neighbours

wm_q <- poly2nb(ResidualFitted_sf, queen=TRUE)
summary(wm_q)

## Neighbour list object:
## Number of regions: 307 
## Number of nonzero links: 1862 
## Percentage nonzero weights: 1.975618 
## Average number of links: 6.065147 
## Link number distribution:
## 
##  2  3  4  5  6  7  8  9 10 11 12 14 17 
##  6 10 28 78 78 49 37 14  2  2  1  1  1 
## 6 least connected regions:
## 2 42 133 172 230 277 with 2 links
## 1 most connected region:
## 40 with 17 links

Creating (ROOK) contiguity based neighbours

wm_r <- poly2nb(ResidualFitted_sf, queen=FALSE)
summary(wm_r)

## Neighbour list object:
## Number of regions: 307 
## Number of nonzero links: 1614 
## Percentage nonzero weights: 1.712485 
## Average number of links: 5.257329 
## Link number distribution:
## 
##  1  2  3  4  5  6  7  8  9 10 13 14 
##  1  5 23 70 92 62 29 17  4  2  1  1 
## 1 least connected region:
## 172 with 1 link
## 1 most connected region:
## 40 with 14 links

Visualizing (QUEEN) and (ROOK) contiguity based neighbours

centroids <- sf::st_centroid(ResidualFitted_sf$geometry) #finding centroids for our polygons

par(mfrow=c(1,2))
plot(ResidualFitted_sf$geometry, border="lightgrey", main="Queen Contiguity", asp=1)
plot(wm_q, st_coordinates(centroids), pch = 5, cex = 0.6, add = TRUE, col= "red" )
plot(ResidualFitted_sf$geometry, border="lightgrey", main="Rook Contiguity", asp =1)
plot(wm_r, st_coordinates(centroids), pch = 5, cex = 0.6, add = TRUE, col = "red")

Analysis of contiguity methods

By looking at the plots and summaries of both the Rooks and Queens Contiguity, we can see that there is a large difference between both methodologies. a total difference of around 258 links. By looking at the map, we can see that this is very evident in particular to the north and north-east part of the map. With reference to bus routes in Singapore, it is often they move along the edges and corners of planning area subzones, thus the Queens method for continguity would be more applicable for our study.

Source: https://wiki.smu.edu.sg/1415T2is415/File:IS415_2014-15_Term2_Assign1_ktchan.2011_Web_map_overview.png

Note: Consider plotting the bus routes on subzones

Step 4: Forming Weight Matrixes using Fixed Distance-based Neighbours

Double check dataset

st_crs(ResidualFitted_sf)

## Coordinate Reference System:
##   User input: EPSG:3414 
##   wkt:
## PROJCS["SVY21 / Singapore TM",
##     GEOGCS["SVY21",
##         DATUM["SVY21",
##             SPHEROID["WGS 84",6378137,298.257223563,
##                 AUTHORITY["EPSG","7030"]],
##             AUTHORITY["EPSG","6757"]],
##         PRIMEM["Greenwich",0,
##             AUTHORITY["EPSG","8901"]],
##         UNIT["degree",0.0174532925199433,
##             AUTHORITY["EPSG","9122"]],
##         AUTHORITY["EPSG","4757"]],
##     PROJECTION["Transverse_Mercator"],
##     PARAMETER["latitude_of_origin",1.366666666666667],
##     PARAMETER["central_meridian",103.8333333333333],
##     PARAMETER["scale_factor",1],
##     PARAMETER["false_easting",28001.642],
##     PARAMETER["false_northing",38744.572],
##     UNIT["metre",1,
##         AUTHORITY["EPSG","9001"]],
##     AUTHORITY["EPSG","3414"]]

Since we know the EPSG is 3414, our distance will be in Meters. Thus it should be no surprise if we have large values compared to other data-sets. Additionally the coordinates are in latitude and longitude which means we will need to be careful when mapping our coordinates or using functions that utilize them.

Determining the appropriate cut-off distance.

coords <- st_coordinates(centroids) #taken from the previous Step 3.

k1 <- knn2nb(knearneigh(coords))

k1dists <- unlist(nbdists(k1, coords, longlat = FALSE))
summary(k1dists)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   182.5   616.3   887.5   937.2  1169.8  5404.1

Unfortunately, it would seem that the distance values vary wildly between zones.

This may be due to centroids for large polygons such as those in the Western Water Catchment Area, Central Water Catchment Area and Changi Airport cover much larger areas compared to places such as Telok Blangah Rise. Thus a Fixed Distance Weight Matrix would be not very useful for us to utilize.

Computing Fixed Distance Matrix (NOT RECOMMENDED WEIGHT MATRIX)

Lets take a look

Creating Fixed-Distance based on max value 5404.1 ~5045

wm_d5405 <- dnearneigh(coords, 0, 5405, longlat = FALSE)

Plotting links created using Fixed-Distance Based Weight Matrix.

par(mfrow=c(1,2))

plot(ResidualFitted_sf$geometry, border="lightgrey", main="1st Nearest Neighbours")
plot(k1, coords, add=TRUE, col="red", length=0.08)
plot(ResidualFitted_sf$geometry, border="lightgrey", main="Fixed-Distance Link")
plot(wm_d5405, coords, add=TRUE, pch = 19, cex = 0.6)

As we can see, the Fixed Distance Weight Matrix methodology would not be very useful for analysis.

Step 5: Computing adaptive distance weight matrix

For the purpose of our analysis, we’ll be using a set of 6 nearest neighbors as it is the average number of links indicated during our Queen Contiguity based Neighbors analysis.

Setting aribitrary neighbours

wm_knn6 <- knn2nb(knearneigh(coords, k=6, longlat = FALSE), row.names=row.names(ResidualFitted_sf$Subzone))
wm_knn6

## Neighbour list object:
## Number of regions: 307 
## Number of nonzero links: 1842 
## Percentage nonzero weights: 1.954397 
## Average number of links: 6 
## Non-symmetric neighbours list

Plotting out adaptive distance weight matrix links

plot(ResidualFitted_sf$geometry, border="lightgrey", main= "KNN 6 Neighbours")
plot(wm_knn6, coords, pch = 19, cex = 0.6, add = TRUE, col = "red")

Although the links on first glance appear quite good, you will see areas such as Changi Airport reaching over the neighbors near it. We also know that based on the Queens contiguity method that some areas have up to 17 connections which are more than double the connections set here.

Step 6: Deciding on a weight matrix

While both the Adaptive Distance Weight Matrix and Rook Contiguity-based Neighbors seem promising, it would be more appropriate to utilize the Queens Continguity-based Neighbors as our ideal choice. These are due to the following points:

The size of polygons have large variation causing issues in using Fixed-Distance based weight Matrixes
The number of neighbors around the subzones differ from 1-17 if considering the Queens Contiguity-based neighbors, thus setting an arbitrary number of neighbors will result on some subzones over-reaching neighboring zones and more importantly, not taking into account certain neighbors when required to for auto-correlation.
Since we’re observing commuter’s volumes, there is no reason for us to believe that bus routes will only go through sides which are larger than a point, in fact some bus routes seem to pass through the corners of the subzones. Residents may also travel out of subzones to take the bus in another subzone thus there is no reason for Rook Contiguity unless there is a computational constraint.

Thus for the next phase of analysis, we will only be utilizing the Queen Contiguity-based neighbors in our spatial auto-correlation analysis.

Step 7: Weights based on the Inverse Distance Method using Queens Contiguity-based Neighours

Deriving a spatial weight matrix based on Inversed Distance method using the Queens method.

dist <- nbdists(wm_q, coords, longlat = FALSE)
ids <- lapply(dist, function(x) 1/(x))

Row-standardised weights matrix

rswm_q <- nb2listw(wm_q, style="W", zero.policy = TRUE)
rswm_q

## Characteristics of weights list object:
## Neighbour list object:
## Number of regions: 307 
## Number of nonzero links: 1862 
## Percentage nonzero weights: 1.975618 
## Average number of links: 6.065147 
## 
## Weights style: W 
## Weights constants summary:
##     n    nn  S0       S1       S2
## W 307 94249 307 106.4054 1261.803

Verifying weights have been appropriatly applied.

wm_q[[1]]

## [1] 217 218 220 221 222 264

rswm_q$weights[1]

## [[1]]
## [1] 0.1666667 0.1666667 0.1666667 0.1666667 0.1666667 0.1666667

Looks like they match up.

Step 8: Perform Moran’s I test on both Tap-in residual and Tap-out residual dataset

Performing Moran’s I test for Tap-in Residual

moran.test(ResidualFitted_sf$ResidualTapIn, listw=rswm_q, zero.policy = TRUE, na.action=na.omit)

## 
##  Moran I test under randomisation
## 
## data:  ResidualFitted_sf$ResidualTapIn  
## weights: rswm_q    
## 
## Moran I statistic standard deviate = -1.0545, p-value = 0.8542
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##      -0.037755733      -0.003267974       0.001069537

Performing Moran’s I test for Tap-out Residual

moran.test(ResidualFitted_sf$ResidualTapOut, listw=rswm_q, zero.policy = TRUE, na.action=na.omit)

## 
##  Moran I test under randomisation
## 
## data:  ResidualFitted_sf$ResidualTapOut  
## weights: rswm_q    
## 
## Moran I statistic standard deviate = -0.50383, p-value = 0.6928
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##      -0.019809059      -0.003267974       0.001077864

Interpretation of Moran’s I test

Based on our results, we can see that the p-values resulting from both tests are relatively high, thus we are unable to reject the null hypothesis that the results generated are random even at a 90% confidence level. This promotes that the relationships seen in our linear regression models are indeed true.

The coefficient for Moran’s I in this case is inapplicable.

Step 9: Robustness Testing

Testing on Aboslute Values for Residual

moran.test(ResidualFitted_sf$ABSResidualTapIn, listw=rswm_q, zero.policy = TRUE, na.action=na.omit)

## 
##  Moran I test under randomisation
## 
## data:  ResidualFitted_sf$ABSResidualTapIn  
## weights: rswm_q    
## 
## Moran I statistic standard deviate = 0.4131, p-value = 0.3398
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##       0.010105384      -0.003267974       0.001048047

moran.test(ResidualFitted_sf$ABSResidualTapOut, listw=rswm_q, zero.policy = TRUE, na.action=na.omit)

## 
##  Moran I test under randomisation
## 
## data:  ResidualFitted_sf$ABSResidualTapOut  
## weights: rswm_q    
## 
## Moran I statistic standard deviate = 1.8125, p-value = 0.03495
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##       0.055795360      -0.003267974       0.001061855

Using Monte Carlo Moran’s I

For robustness testing, we use Monte Carlo to check if our results hold.

Computing Monte-Carlo Moran’s I for Residual Tap-in

set.seed(1234)
MCMoranTapIn= moran.mc(ResidualFitted_sf$ResidualTapIn, listw=rswm_q, nsim=999, zero.policy = TRUE, na.action=na.omit)
MCMoranTapIn

## 
##  Monte-Carlo simulation of Moran I
## 
## data:  ResidualFitted_sf$ResidualTapIn 
## weights: rswm_q  
## number of simulations + 1: 1000 
## 
## statistic = -0.037756, observed rank = 157, p-value = 0.843
## alternative hypothesis: greater

Computing Monte-Carlo Moran’s I for Residual Tap-out

set.seed(1234)
MCMoranTapOut= moran.mc(ResidualFitted_sf$ResidualTapOut, listw=rswm_q, nsim=999, zero.policy = TRUE, na.action=na.omit)
MCMoranTapOut

## 
##  Monte-Carlo simulation of Moran I
## 
## data:  ResidualFitted_sf$ResidualTapOut 
## weights: rswm_q  
## number of simulations + 1: 1000 
## 
## statistic = -0.019809, observed rank = 309, p-value = 0.691
## alternative hypothesis: greater

Step 10: Interpretation of robustness testing

After testing against both absolute values and using Monte-Carlo Moran’s I, we can safely say our results are promising as all are unable to reject the null hypothesis except for absolute values of Tap-out Residual Moran’s I testing.

Plotting Absolute Values for Tap-out Residual against predicted values

plot(TapOutLM.model$fitted.values, abs(TapOutLM.model$residuals), pch = 8, col = "red", xlab = "Predicted Tap-out Values", ylab = "Absoluate Residual (Tap-out)")

There does not appear to be any distinct pattern observed

Mapping absolute values of Tap-out Residual

AbosluteResidualTapOutmap <- tm_shape(ResidualFitted_sf)+ 
  tm_fill("ABSResidualTapOut", 
              n = 5,
              style="jenks",
              palette = "RdYlBu",
          title = "Aboslute Residual Tap-out Values") +
  tm_layout(legend.position = c("right", "bottom")) +
  tm_borders(alpha = 0.5) +
  tmap_style("white")

AbosluteResidualTapOutmap

It is most likely the Moran’s I value and p-value derived is due to low-values near each other towards the center of the map.

Conclusion of analysis

Overall we can be confident in our results that the relationship between Residential Population in URA 2014 Planning Area is positively correlated with Commuters’ volumes by bus. This is due to low p-values and the randomization assumption of residual values holding true for both linear modelling as well as geospatial auto-correlation.

Part 3: Localized Geospatial Statistical Analysis

Step 0: Converting to SpatialPolygonDataFrame

For the purpose of this Analysis, we will be converting the Simple Feature DataFrame back into a SpatialPolygoneDataFrame

Utilizing the as_Spatial() function from the sp package

SG_Commuter_vol_df <- as_Spatial(ResidualFitted_sf)

Step 1: Computing local Moran’s I for Cluster and Outlier Analysis

Computing Local Moran’s I for Tap-in volumes

subzone_labels <- order(SG_Commuter_vol_df$Subzone)
localMITapIn <- localmoran(SG_Commuter_vol_df$Tap_in, rswm_q)

#printCoefmat(data.frame(localMITapIn[subzone_labels,], row.names=SG_Commuter_vol_df$Subzone[subzone_labels]), check.names=FALSE)

Computing Local Moran’s I for Tap-out volumes

localMITapOut <- localmoran(SG_Commuter_vol_df$Tap_out, rswm_q)

#printCoefmat(data.frame(localMITapOut[subzone_labels,], row.names=SG_Commuter_vol_df$Subzone[subzone_labels]), check.names=FALSE)

Appending local Moran’s I and P-values to Simple Feature Dataframe

SG_Commuter_vol_df.localMITapIn <- cbind(SG_Commuter_vol_df,localMITapIn)
SG_Commuter_vol_df.localMITapOut <- cbind(SG_Commuter_vol_df,localMITapOut)

Part 2: Mapping Local Moran’s I

Visualizing local Moran’s I for Tap-in values

TapInlocalMI.map <- tm_shape(SG_Commuter_vol_df.localMITapIn) +
  tm_fill(col = "Ii", 
          style = "pretty", 
          title = "Tap-in Local MI Statistics") +
  tm_borders(alpha = 0.5)

TapInPvalue.map <- tm_shape(SG_Commuter_vol_df.localMITapIn) +
  tm_fill(col = "Pr.z...0.", 
          breaks=c(-Inf, 0.001, 0.01, 0.05, 0.1, Inf),
          palette="-Blues", 
          title = "Tap-in Local MI p-values") +
  tm_borders(alpha = 0.5)

tmap_arrange(TapInlocalMI.map, TapInPvalue.map, asp=1, ncol=2)

## Variable(s) "Ii" contains positive and negative values, so midpoint is set to 0. Set midpoint = NA to show the full spectrum of the color palette.

Visualizing local Moran’s I for Tap-out values

TapOutlocalMI.map <- tm_shape(SG_Commuter_vol_df.localMITapOut) +
  tm_fill(col = "Ii", 
          style = "pretty", 
          title = "Tap-out Local MI Statistics") +
  tm_borders(alpha = 0.5)

TapOutPvalue.map <- tm_shape(SG_Commuter_vol_df.localMITapOut) +
  tm_fill(col = "Pr.z...0.", 
          breaks=c(-Inf, 0.001, 0.01, 0.05, 0.1, Inf),
          palette="-Reds", 
          title = "Tap-out Local MI p-values") +
  tm_borders(alpha = 0.5)

tmap_arrange(TapOutlocalMI.map, TapOutPvalue.map, asp=1, ncol=2)

## Variable(s) "Ii" contains positive and negative values, so midpoint is set to 0. Set midpoint = NA to show the full spectrum of the color palette.

Intepretation of Local Moran Statistics

Based on the map results, we can see that there are several clusters for both Tap-in and Tap-out specifically around the Tampines East, Tampines West and Bedok North subzones. There also appear to be several outliers, however they have low p-values thus we will not take them into consideration.

Step 3: Creating LISA Cluster Map

Creating Moran Scatter Plot with standardized Tap-in Values

SG_Commuter_vol_df.localMITapIn$Z.Tap_In <- scale(SG_Commuter_vol_df.localMITapIn$Tap_in) %>% as.vector 

TapInLMoranPlot <- moran.plot(SG_Commuter_vol_df.localMITapIn$Z.Tap_In, rswm_q, labels=as.character(SG_Commuter_vol_df.localMITapIn$Subzone), xlab="z Tap-in Volumes", ylab="Spatially Lag z Tap-in Volumes")

Creating Moran Scatter Plot with standardized Tap-out Values

SG_Commuter_vol_df.localMITapOut$Z.Tap_out <-scale(SG_Commuter_vol_df.localMITapOut$Tap_out) %>% as.vector 

TapOutLMoranPlot <- moran.plot(SG_Commuter_vol_df.localMITapOut$Z.Tap_out, rswm_q, labels=as.character(SG_Commuter_vol_df.localMITapOut$Subzone), xlab="z Tap-out Volumes", ylab="Spatially Lag z Tap-out Volumes")

Build Quadrants for Local MI Tap-in values

TapInquadrant <- vector(mode="numeric",length=nrow(localMITapIn))
TapInDV <- SG_Commuter_vol_df$Tap_in - mean(SG_Commuter_vol_df$Tap_in)     
TapInC_mI <- localMITapIn[,1] - mean(localMITapIn[,1])    
signif <- 0.05       
TapInquadrant[TapInDV >0 & TapInC_mI>0] <- 4      
TapInquadrant[TapInDV <0 & TapInC_mI<0] <- 1      
TapInquadrant[TapInDV <0 & TapInC_mI>0] <- 2
TapInquadrant[TapInDV >0 & TapInC_mI<0] <- 3
TapInquadrant[localMITapIn[,5]>signif] <- 0

Build Quadrants for Local MI Tap-out values

TapOutquadrant <- vector(mode="numeric",length=nrow(localMITapOut))
TapOutDV <- SG_Commuter_vol_df$Tap_out - mean(SG_Commuter_vol_df$Tap_out)     
TapOutC_mI <- localMITapOut[,1] - mean(localMITapOut[,1])    
signif <- 0.05       
TapOutquadrant[TapOutDV >0 & TapOutC_mI>0] <- 4      
TapOutquadrant[TapOutDV <0 & TapOutC_mI<0] <- 1      
TapOutquadrant[TapOutDV <0 & TapOutC_mI>0] <- 2
TapOutquadrant[TapOutDV >0 & TapOutC_mI<0] <- 3
TapOutquadrant[localMITapOut[,5]>signif] <- 0

Constructing LISA Maps for Both Tap-in and Tap-out Local Moran’s I

colors <- c("#ffffff", "#2c7bb6", "#abd9e9", "#fdae61", "#d7191c")
clusters <- c("Insignificant", "Low-Low", "Low-High", "High-Low", "High-High")


SG_Commuter_vol_df.localMITapIn$TapInquadrant <- TapInquadrant

LISATapIn <- tm_shape(SG_Commuter_vol_df.localMITapIn) +
  tm_fill(col = "TapInquadrant", style = "cat", title="Tap-in Classification", palette = colors[c(sort(unique(TapInquadrant)))+1], labels = clusters[c(sort(unique(TapInquadrant)))+1], popup.vars = c("Postal.Code")) +
  tm_view(set.zoom.limits = c(11,17)) +
  tm_borders(alpha=0.5)

SG_Commuter_vol_df.localMITapOut$TapOutquadrant <- TapOutquadrant

LISATapOut <- tm_shape(SG_Commuter_vol_df.localMITapOut) +
  tm_fill(col = "TapOutquadrant", style = "cat", title="Tap-out Classification", palette = colors[c(sort(unique(TapOutquadrant)))+1], labels = clusters[c(sort(unique(TapOutquadrant)))+1], popup.vars = c("Postal.Code")) +
  tm_view(set.zoom.limits = c(11,17)) +
  tm_borders(alpha=0.5)

tmap_arrange(LISATapIn, LISATapOut, asp=1, ncol=2)

Intepretation of LISA Map

Based on the LISA map, we can see several clusters which primarily align with our initial observations of the Local Moran’s I scores in the previous Step. However, there appears to be some subzones in the north-east and west that fall within the High-High quadrants indicating positive auto-correlations with neighboring subzones.

Step 4: Hot Spot and Cold Spot Area Analysis using Gi Statistics

For the purpose of this analysis, we will only be using Adaptive Distance Weight Matrix for reasons explained in the earlier Part 2 on the nature of the polygons we are dealing with. We will take the same clusters we set in the previous section.

Computing Gi Statistics

wm_knn6_lw <- nb2listw(wm_knn6, style = 'B') #wm_knn6 taken from previous Part 2

TapIngi.adaptive <- localG(SG_Commuter_vol_df$Tap_in, wm_knn6_lw)
TapInSG.gi <- cbind(SG_Commuter_vol_df, as.matrix(TapIngi.adaptive))
names(TapInSG.gi)[11] <- "gstat_adaptive"

TapOutgi.adaptive <- localG(SG_Commuter_vol_df$Tap_out, wm_knn6_lw)
TapOutSG.gi <- cbind(SG_Commuter_vol_df, as.matrix(TapOutgi.adaptive))
names(TapOutSG.gi)[11] <- "gstat_adaptive"

Mapping out Gi Statistics

TapIngiMap <- tm_shape(TapInSG.gi) +
  tm_fill(col = "gstat_adaptive",
          style = "pretty",
          palette = "-RdBu",
          n=5,
          title = "Tap-in local Gi") +
  tm_borders(alpha = 0.5)

TapOutgiMap <- tm_shape(TapOutSG.gi) +
  tm_fill(col = "gstat_adaptive",
          style = "pretty",
          palette = "-RdBu",
          n=5,
          title = "Tap-out local Gi") +
  tm_borders(alpha = 0.5)

tmap_arrange(TapIngiMap, TapOutgiMap, asp=1, ncol=2)

## Variable(s) "gstat_adaptive" contains positive and negative values, so midpoint is set to 0. Set midpoint = NA to show the full spectrum of the color palette.
## Variable(s) "gstat_adaptive" contains positive and negative values, so midpoint is set to 0. Set midpoint = NA to show the full spectrum of the color palette.

Interpretation of GI Statistic Maps

Based on the results, we see hot-spots in areas which we had originally identified in our previous LISA Map. There is also a surprising cold spot towards the west towards Tuas. This may be due to the nature of how k-nearestneighbours works and specifically how Tuas View Extension subzone is only really adjacent to 2 subzones but was coerced using the Adaptive Distance Weight Matrix to consider more zones.

Overall Conclusion

Overall there appears to be some clustering for both Tap-in and Tap-out around the Tampines East subzone with the Local Moran’s I Analysis, LISA Mapping and Gi Statistical all highlighting the zone or some of it’s surrounding areas as clusters or hot spots. This aligns with the population map done in Part 1 which shows the area as relatively higher populated than others although we had not done any geospatial analytics on population distributions.

Part 4: Study limitations and assumptions

There are several variables which are not included within the statistical analysis above that must be considered when examining the scope of the study.

We had already removed one age group of 0-4 year olds due to the age and height requirements to enter the bus, thus their movements would not be captured in the tap-in and tap-out data. However, the age for no bus-fare requirement is 7 years old and a certain height criteria. As there was no feasible way to segment the residential data in a timely manner, the age group for 5-9 year olds was still included which could have introduced some level of noise in the data.
We did not factor in the variable of bus frequencies or availability of bus stops in each subzone. These could affect the aggregated tap-in and tap-out numbers as more frequently serviced bus stops might create larger commuters’ flow.
Our analysis also did not control for the availability of alternate transportation such as the MRT which greatly affects the use of bus services in an area. These may also contribute to increased volumes at routes that intersect with these MRT stations.

Take-Home_EX01

Lee Yi De (Isaac)

10 May 2020

Take Home Assignment 1 for IS415 - Geospatial Analytics and Applications - G10

Overview

Part 0: Setup

Loading in required packages

Part 1: Geospatial Data Wrangling & Linear Regression Model

Step 1: Defining the scope of our task

Loading in the passenger volume dataset

NA/Validity Checks 1

Analysis of dataset

Aggregating dataset by Bus Stop Number

NA/Validity Checks

Step 2: Loading in Geospatial Data for bus locations

Loading in Bus Stop Location Geospatial Data

NA/Validity Check 1

Replace NA with Unknown

NA/Validity Checks 2

Replace NA with Unknown

NA/Validity Checks 3

Joining Commuter’s volume to Bus Stop Location dataset

NA/Validity Checks 1

Replace NA with 0

NA/Validity Checks 2

Step 3: Loading in Planning Area Subzone Geospatial Data and Population Data

Loading in Population Data

NA/Validity Checks 1

Aggregating Resident Population by subzone

NA/Validity Checks 1

Loading in Planning Area Subzone geospatial data

NOTE: Check WGS and CRS. In this case st_transform does not change the polygons because they are already a Singaporean dataset but it helps us to format and label our dataset.

NA/Validity Checks 1

NA/Validity Checks 2

Joining Residential data to Simple Dataframe

NA/Validity Checks 1

Quick mapping to show distribution of resident count in Singapore

Step 4: Aggregation and filtering of relevant data

Aggregating Bus Stop Level data to Subzone Level Data

NOTE: At this point in time, we will need to consider what to do with subzones where you have no data for your Tap-in and Tap-out or if there is no residents in the subzone.

NA/Validity Checks 1

Replace NA with 0 and Unknown

NA/Validity Checks 2

Aggregating Tap-in and Tap-out Values to subzone and only keeping in relevant features

NA/Validity Checks 1

Step 5: Plotting our summarized data

Plotting two quick thematic maps

Step 6: Performing Linear Regression Model

Performing Linear Regression for Tap-in values

Performing Linear Regression for Tap-out values

Step 5: Interpretation of Linear Regression Model for commuter’s flow to Residential Population by URA 2014 Master Plan Planning Area Subzones

Plotting Tap_In volumes to Residential Population

Plotting Tap_out volumes to Residential Population

Plotting Residuals

Part 2: Spatial Autocorrelation Analysis on residuals

Step 1: Adding residuals and predicted values to Simple Feature Dataframe

Appending using tibble functions

Appending Aboslute values for residual Tap-in and Tap-out

NA/Validity Checks 1

Step 2: Mapping out our Residual Values

Mapping using tmaps

Step 3: Forming Weight Matrixes using Contiguity-based neighbours

Creating (QUEEN) contiguity based neighbours

Creating (ROOK) contiguity based neighbours

Visualizing (QUEEN) and (ROOK) contiguity based neighbours

Analysis of contiguity methods

Note: Consider plotting the bus routes on subzones

Step 4: Forming Weight Matrixes using Fixed Distance-based Neighbours

Double check dataset

Determining the appropriate cut-off distance.

Computing Fixed Distance Matrix (NOT RECOMMENDED WEIGHT MATRIX)

Creating Fixed-Distance based on max value 5404.1 ~5045

Plotting links created using Fixed-Distance Based Weight Matrix.

Step 5: Computing adaptive distance weight matrix

Setting aribitrary neighbours

Plotting out adaptive distance weight matrix links

Step 6: Deciding on a weight matrix

Step 7: Weights based on the Inverse Distance Method using Queens Contiguity-based Neighours

Deriving a spatial weight matrix based on Inversed Distance method using the Queens method.

Row-standardised weights matrix