Take Home Assignment 1 for IS415 - Geospatial Analytics and Applications - G10

Overview

This exercise is examining national development or public service issues by utilizing data sets from various governmental agencies in Singapore. The assignment is meant to demonstrate basic Geo-spatial Data Wrangling, Geo-spatial Analytics and Geo-visualization in R. For the purpose of this assignment, we will utilizing course material in order to achieve 3 specific objectives:

Calibrating a simple linear regression to reveal the relation between public bus commuters’ flows (i.e. tap-in or tap-out) data and residential population at the planning sub-zone level.
Performing spatial auto-correlation analysis on the residual of the regression model to test if the model conforms to the randomization assumption.
Performing localized geospatial statistics analysis by using commuters’ tap-in and tap-out data to identify geographical clustering.

For the purpose of this study, the passenger volume by bus-stop data set of Land Transport Authority (LTA) has been provided. This data set is extracted using the dynamic API provided at LTA DataMall. The remaining data from the government open data portal.

Part 0: Setup

For the purpose of this assignment, we will be using packages taught in the course. Particularly we will be utilizing rgdal, sf, spdep, tmap and tidyverse.

Loading in required packages

packages = c('rgdal', 'sf', 'spdep', 'tmap', 'tidyverse')
for (p in packages){
  if(!require(p, character.only = T)){
    install.packages(p)
  }
  library(p,character.only = T)
}

Part 1: Geospatial Data Wrangling & Linear Regression Model

Step 1: Defining the scope of our task

Loading in the passenger volume dataset

sgpassengervolume <- read_csv("data/aspatial/passenger volume by busstop.csv")

## Parsed with column specification:
## cols(
##   YEAR_MONTH = col_character(),
##   DAY_TYPE = col_character(),
##   TIME_PER_HOUR = col_double(),
##   PT_TYPE = col_character(),
##   PT_CODE = col_character(),
##   TOTAL_TAP_IN_VOLUME = col_double(),
##   TOTAL_TAP_OUT_VOLUME = col_double()
## )

NA/Validity Checks 1

sgpassengervolume[rowSums(is.na(sgpassengervolume))!=0,]

## # A tibble: 0 x 7
## # ... with 7 variables: YEAR_MONTH <chr>, DAY_TYPE <chr>, TIME_PER_HOUR <dbl>,
## #   PT_TYPE <chr>, PT_CODE <chr>, TOTAL_TAP_IN_VOLUME <dbl>,
## #   TOTAL_TAP_OUT_VOLUME <dbl>

Cleared of NA values.

Analysis of dataset

Based on the Dataset provided on passenger volume by bus-stop data, we are examining passenger flows from of the period for the month of January 2020. As requested by the assignment, we need to aggregate the data at the planning sub-zone level in order to perform a simple linear model for its relation to residential population in the same zones.

Aggregating dataset by Bus Stop Number

For our purpose, we don’t care about the time or day differences in our dataset, we only care about the volumes based on the bus stop locations so we will need a new data-frame which is summed by bus stop number

Totalbusstopvol <- sgpassengervolume %>% 
  group_by(BUS_STOP_N = PT_CODE) %>% #rename as per naming convention by LTA
  summarise(Tap_In = sum(TOTAL_TAP_IN_VOLUME), 
            Tap_out = sum(TOTAL_TAP_OUT_VOLUME))

NA/Validity Checks

Totalbusstopvol[rowSums(is.na(Totalbusstopvol))!=0,]

## # A tibble: 0 x 3
## # ... with 3 variables: BUS_STOP_N <chr>, Tap_In <dbl>, Tap_out <dbl>

Cleared of NA values.

Step 2: Loading in Geospatial Data for bus locations

Thus we will need a relational database that helps us to map the bus-stop locations to the planning sub-zone level. For that purpose we will be utilizing the Bus Stop Location Shapefile from the LTA DataMall Source:https://www.mytransport.sg/content/mytransport/home/dataMall/static-data.html#Whole%20Island

Loading in Bus Stop Location Geospatial Data

We’ll load it in as a Simple Feature DataFrame, in this case since we know its from Singapore database, there is no need to transform. We will simply set the CRS to 3414 in correspondence to SG.

busstoplocations_sf <- readOGR(dsn = "data/geospatial/BusStopLocation_Jan2020", layer = "BusStop")

## OGR data source with driver: ESRI Shapefile 
## Source: "C:\data\GSA\IS415_Take-Home_EX01\data\geospatial\BusStopLocation_Jan2020", layer: "BusStop"
## with 5040 features
## It has 3 fields

busstoplocations_sf <- st_as_sf(busstoplocations_sf)
busstoplocations_sf <- st_transform(busstoplocations_sf, crs= 3414)

#st_crs(busstoplocations_sf)

NA/Validity Check 1

busstoplocations_sf[rowSums(is.na(busstoplocations_sf))!=0,]

## Simple feature collection with 121 features and 3 fields
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: 10792.71 ymin: 28191.17 xmax: 43485.8 ymax: 47931.22
## CRS:            EPSG:3414
## First 10 features:
##     BUS_STOP_N BUS_ROOF_N LOC_DESC                  geometry
## 1        78221        B06     <NA> POINT (42227.96 39563.16)
## 36       30059        B15     <NA> POINT (16128.92 39596.94)
## 63       70291        B03     <NA> POINT (33981.28 35537.19)
## 105      96081        B05     <NA> POINT (41603.76 35413.11)
## 194      31171        B30     <NA> POINT (13634.64 39378.37)
## 221      66511        B02     <NA>    POINT (32266 39747.92)
## 351      67481        B04     <NA>  POINT (32945.05 41494.7)
## 352      83239        B06     <NA>  POINT (37922.7 33240.12)
## 427      64521       B01A     <NA>  POINT (34249.3 39110.15)
## 536      31169        B31     <NA> POINT (13160.11 39376.74)

test <- st_is_valid(busstoplocations_sf)
length(which(test==FALSE))

## [1] 0

Geometric objects are valid. However,there are several Location Descriptions with NA values. We will replace them with “UNKNOWN”

Replace NA with Unknown

busstoplocations_sf$LOC_DESC[is.na(busstoplocations_sf$LOC_DESC)] <- "UNKNOWN"

NA/Validity Checks 2

busstoplocations_sf[rowSums(is.na(busstoplocations_sf))!=0,]

## Simple feature collection with 2 features and 3 fields
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: 21294.95 ymin: 37068.44 xmax: 32659.61 ymax: 38543.97
## CRS:            EPSG:3414
##      BUS_STOP_N BUS_ROOF_N           LOC_DESC                  geometry
## 2897      43961       <NA> AFT DAIRY FARM HTS POINT (21294.95 38543.97)
## 2968      62141       <NA>          OPP BLK 1 POINT (32659.61 37068.44)

Some Bus Roof Numbers are missing as well. This is why we need to repeat the validity check even if we think we have cleared all the NA values. We will replace these with “UNKNOWN”.

Replace NA with Unknown

busstoplocations_sf$BUS_ROOF_N[is.na(busstoplocations_sf$BUS_ROOF_N)] <- "UNKNOWN"

NA/Validity Checks 3

busstoplocations_sf[rowSums(is.na(busstoplocations_sf))!=0,]

## Simple feature collection with 0 features and 3 fields
## bbox:           xmin: NA ymin: NA xmax: NA ymax: NA
## CRS:            EPSG:3414
## [1] BUS_STOP_N BUS_ROOF_N LOC_DESC   geometry  
## <0 rows> (or 0-length row.names)

No more NA values

Joining Commuter’s volume to Bus Stop Location dataset

Next we need to attach our passenger data to our Spatial Points DataFrame

busstoplocations_sf <- left_join(busstoplocations_sf,Totalbusstopvol) #we already renamed the variable to match so there is no need to specify the joining factor

## Joining, by = "BUS_STOP_N"

NA/Validity Checks 1

busstoplocations_sf[rowSums(is.na(busstoplocations_sf))!=0,]

## Simple feature collection with 78 features and 5 fields
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: 7702.752 ymin: 27733.06 xmax: 41244.44 ymax: 49261.6
## CRS:            EPSG:3414
## First 10 features:
##     BUS_STOP_N BUS_ROOF_N               LOC_DESC Tap_In Tap_out
## 178      91121        NIL                HSE 189     NA      NA
## 256      44401        NIL            OPP BLK 183     NA      NA
## 344      31121        B40   NIRVANA MEMORIAL GDN     NA      NA
## 461      23381       B01A            PLANT ENGRG     NA      NA
## 567      65601        NIL      SUMANG STN EXIT B     NA      NA
## 574      65609        NIL      SUMANG STN EXIT A     NA      NA
## 592      02031        B01        OPP THE ADELPHI     NA      NA
## 640      91151        NIL           HAWAII TOWER     NA      NA
## 644      23411        B07 AFT JURONG MARINE BASE     NA      NA
## 692      18209        B01      BEF ALL ST CHAPEL     NA      NA
##                      geometry
## 178 POINT (35030.24 31256.69)
## 256 POINT (20144.74 40279.55)
## 344  POINT (11838.8 39157.43)
## 461 POINT (12552.18 32234.11)
## 567 POINT (35281.38 43367.08)
## 574  POINT (35238.99 43359.3)
## 592 POINT (29991.79 30453.16)
## 640 POINT (34364.42 31052.96)
## 644 POINT (11634.46 31782.65)
## 692 POINT (23789.09 30790.59)

When looking at our data set, we can see some NA values have in our Tap_In and Tap_Out columns. This is due to there being more bus stop locations in our geospatial data set than in our commuters’ volume. We will need to replace these NA values with 0 or else we will encounter issues in aggregation.

Replace NA with 0

busstoplocations_sf$Tap_In[is.na(busstoplocations_sf$Tap_In)] <- 0
busstoplocations_sf$Tap_out[is.na(busstoplocations_sf$Tap_out)] <- 0

NA/Validity Checks 2

busstoplocations_sf[rowSums(is.na(busstoplocations_sf))!=0,]

## Simple feature collection with 0 features and 5 fields
## bbox:           xmin: NA ymin: NA xmax: NA ymax: NA
## CRS:            EPSG:3414
## [1] BUS_STOP_N BUS_ROOF_N LOC_DESC   Tap_In     Tap_out    geometry  
## <0 rows> (or 0-length row.names)

Cleared of NA values.

Step 3: Loading in Planning Area Subzone Geospatial Data and Population Data

In order for us to perform our regression. We need population information by planning area subzone level. We will utilize the “Singapore Residents by Subzone and Type of Dwelling, 2011 - 2019” and utilize only the 2019 information to get the latest data.

Source: https://data.gov.sg/dataset/Singapore-residents-by-subzone-and-type-of-dwelling-2011-2019

To make it geospatial data, we need to match with the according geospatial subzone data, according to the website, this information is still map to the Master Plan 2014 subzones, as such we will use those although there is an updated 2019 MP. For this exercise, we will use the shapefile instead of the KML files also provided

Source: https://data.gov.sg/dataset/master-plan-2014-subzone-boundary-web

Loading in Population Data

For the purpose of our analysis, we will use only the 2019 data.

NOTE: For the purpose of our analysis, in order to reduce errors in our linear regression model later on, we will remove children in the 0-4 age group. This is because children below the age of 7 and under a certain height criteria do not tap in or out the bus, thus including this data would not be captured in our bus tap-in and tap-out data, producing noise and increase the errors in our linear regression model.

Source: https://www.sbstransit.com.sg/fares-and-concessions

PopData <- read_csv("data/aspatial/planning-area-subzone-age-group-sex-and-type-of-dwelling-june-2011-2019.csv") %>%
  filter(year==2019) %>% #filter only by 2019 data
  filter(age_group !="0_to_4")#removing age group 0-4 as they are irrelevant for the purpose of our study

## Parsed with column specification:
## cols(
##   planning_area = col_character(),
##   subzone = col_character(),
##   age_group = col_character(),
##   sex = col_character(),
##   type_of_dwelling = col_character(),
##   resident_count = col_double(),
##   year = col_double()
## )

NA/Validity Checks 1

PopData[rowSums(is.na(PopData))!=0,]

## # A tibble: 0 x 7
## # ... with 7 variables: planning_area <chr>, subzone <chr>, age_group <chr>,
## #   sex <chr>, type_of_dwelling <chr>, resident_count <dbl>, year <dbl>

Cleared of NA values

Aggregating Resident Population by subzone

ResPopbySubzone <- PopData %>% 
  group_by(subzone = subzone) %>% #rename as per naming convention by LTA
  summarise(resident_count= sum(resident_count))

ResPopbySubzone <- mutate_at(ResPopbySubzone, .vars = "subzone", .funs=toupper) #convert names to upper case, be sure to only do so for names.

NA/Validity Checks 1

ResPopbySubzone[rowSums(is.na(ResPopbySubzone))!=0,]

## # A tibble: 0 x 2
## # ... with 2 variables: subzone <chr>, resident_count <dbl>

Cleared of NA values

Loading in Planning Area Subzone geospatial data

NOTE: Check WGS and CRS. In this case st_transform does not change the polygons because they are already a Singaporean dataset but it helps us to format and label our dataset.

PlanningSubzone_sf <- readOGR(dsn = "data/geospatial/MP14_SUBZONE_WEB", layer = "MP14_SUBZONE_WEB_PL")

## OGR data source with driver: ESRI Shapefile 
## Source: "C:\data\GSA\IS415_Take-Home_EX01\data\geospatial\MP14_SUBZONE_WEB", layer: "MP14_SUBZONE_WEB_PL"
## with 323 features
## It has 15 fields

PlanningSubzone_sf <- st_as_sf(PlanningSubzone_sf)
PlanningSubzone_sf <- st_transform(PlanningSubzone_sf, crs= 3414)

#st_crs(PlanningSubzone_sf)

NA/Validity Checks 1

PlanningSubzone_sf[rowSums(is.na(PlanningSubzone_sf))!=0,]

## Simple feature collection with 0 features and 15 fields
## bbox:           xmin: NA ymin: NA xmax: NA ymax: NA
## CRS:            EPSG:3414
##  [1] OBJECTID   SUBZONE_NO SUBZONE_N  SUBZONE_C  CA_IND     PLN_AREA_N
##  [7] PLN_AREA_C REGION_N   REGION_C   INC_CRC    FMEL_UPD_D X_ADDR    
## [13] Y_ADDR     SHAPE_Leng SHAPE_Area geometry  
## <0 rows> (or 0-length row.names)

test <- st_is_valid(PlanningSubzone_sf)
length(which(test==FALSE))

## [1] 9

Cleared of NA values. However, 9 polygons are invalid according to the st_is_valid test. Luckily the sf package is able to correct this for us.

PlanningSubzone_sf <- st_make_valid(PlanningSubzone_sf)

NA/Validity Checks 2

test <- st_is_valid(PlanningSubzone_sf)
length(which(test==FALSE))

## [1] 0

Polygons are valid.

Joining Residential data to Simple Dataframe

PlanningSubzone_sf <- left_join(PlanningSubzone_sf , ResPopbySubzone, by = c("SUBZONE_N" = "subzone"))

NA/Validity Checks 1

PlanningSubzone_sf[rowSums(is.na(PlanningSubzone_sf))!=0,]

## Simple feature collection with 0 features and 16 fields
## bbox:           xmin: NA ymin: NA xmax: NA ymax: NA
## CRS:            EPSG:3414
##  [1] OBJECTID       SUBZONE_NO     SUBZONE_N      SUBZONE_C      CA_IND        
##  [6] PLN_AREA_N     PLN_AREA_C     REGION_N       REGION_C       INC_CRC       
## [11] FMEL_UPD_D     X_ADDR         Y_ADDR         SHAPE_Leng     SHAPE_Area    
## [16] resident_count geometry      
## <0 rows> (or 0-length row.names)

test <- st_is_valid(PlanningSubzone_sf)
length(which(test==FALSE))

## [1] 0

Cleared of NA and invalid Polygons.

Mapping to show distribution of residential population for 2019 (Excluding aged 0-4) in Singapore

tm_shape(PlanningSubzone_sf)+
  tm_fill(col = "resident_count",
              n = 5,
              style="jenks",
              palette = "Blues",
          title = "Resident Population Count") +
  tm_layout(legend.position = c("right", "bottom")) +
  tm_borders(alpha = 0.5) +
  tmap_style("white") +
  tm_credits("Source: Planning Area Sub-zone boundary MP2014 from Urban Redevelopment Authorithy (URA)\n and Population data from Department of Statistics (DOS)", 
             position = c("left", "bottom"))

Map seems to align with most population distributions we’ve seen before, so we will accept that our data has been loaded in properly

Step 4: Aggregation and filtering of relevant data

Aggregating Bus Stop-Level data to Subzone-Level data

NOTE: At this point in time, we will need to consider what to do with subzones where you have no data for your Tap-in and Tap-out or if there are no residents in the subzone.

For the purpose of our analysis, we will firstly remove zones where all Tap-in, Tap-out and Residents are equal to 0. This is because it is likely irrelevant for our study as these areas are unreachable by the general population and may cause problems during the geospatial auto-correlation calculations later on.

For zones in which one of these conditions but not all are met, we will keep these areas as they are still relevant for our scope of study. We will replace any missing values with 0.

However, it should be noted that based on initial inspection of the data, there are areas in which with 0 residents but a high amount of Tap-in and Tap-out which may be due to construction projects. The best example of this would be the Woodlands Regional Center with a disproportionately large bus volumes but no residential population because it is still under construction.

Performing join using the st_join() function and st_intersects criteria.

MasterSubzone_sf <- st_join(PlanningSubzone_sf,busstoplocations_sf, join=st_intersects) #using st_intersects to join bus stops that are either within or touching the polygons.

NA/Validity Checks 1

MasterSubzone_sf[rowSums(is.na(MasterSubzone_sf))!=0,]

## Simple feature collection with 18 features and 21 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: 8012.578 ymin: 15748.72 xmax: 56396.44 ymax: 49122.83
## CRS:            EPSG:3414
## First 10 features:
##     OBJECTID SUBZONE_NO               SUBZONE_N SUBZONE_C CA_IND
## 13        13          1             MARINA EAST    MESZ01      Y
## 16        16          1 JURONG ISLAND AND BUKOM    WISZ01      N
## 17        17          3                  SUDONG    WISZ03      N
## 18        18          2                 SEMAKAU    WISZ02      N
## 19        19          2          SOUTHERN GROUP    SISZ02      N
## 73        73          4               CAIRNHILL    NTSZ04      Y
## 98        98          7             MOUNT EMILY    RCSZ07      Y
## 234      234          1              CHANGI BAY    CBSZ01      N
## 274      274          1      LORONG HALUS NORTH    SESZ01      N
## 280      280          3     PULAU PUNGGOL TIMOR    SLSZ03      N
##           PLN_AREA_N PLN_AREA_C          REGION_N REGION_C          INC_CRC
## 13       MARINA EAST         ME    CENTRAL REGION       CR 782A2FAF53029A34
## 16   WESTERN ISLANDS         WI       WEST REGION       WR 699F7210FBF1AFA8
## 17   WESTERN ISLANDS         WI       WEST REGION       WR F718C723E08FBD51
## 18   WESTERN ISLANDS         WI       WEST REGION       WR E69207D4F76DEEA3
## 19  SOUTHERN ISLANDS         SI    CENTRAL REGION       CR 5809FC547293EA2D
## 73            NEWTON         NT    CENTRAL REGION       CR 8B79CA72836AD7A6
## 98            ROCHOR         RC    CENTRAL REGION       CR D8225E737A6496B4
## 234       CHANGI BAY         CB       EAST REGION       ER ACFD43D8AE3F7B3F
## 274         SENGKANG         SE NORTH-EAST REGION      NER 4A9FCDF7EAE9BEAE
## 280          SELETAR         SL NORTH-EAST REGION      NER DF9C2E922175C69D
##     FMEL_UPD_D   X_ADDR   Y_ADDR SHAPE_Leng SHAPE_Area resident_count
## 13  2014/12/05 32344.05 30103.25   6470.950  1844060.7              0
## 16  2014/12/05 13012.88 27225.87  68083.936 36707720.9              0
## 17  2014/12/05 15931.76 19579.07  24759.066  4207271.1              0
## 18  2014/12/05 21206.33 20465.81  18703.681  4963787.1              0
## 19  2014/12/05 29815.09 23412.59  25626.977  2206319.5              0
## 73  2014/12/05 28482.67 32108.19   3853.900   452471.8           3680
## 98  2014/12/05 29583.26 31639.48   2301.230   193992.3           1400
## 234 2014/12/05 49502.49 34316.54  18731.560  1829821.7              0
## 274 2014/12/05 36619.22 40404.62   3930.097   924866.0              0
## 280 2014/12/05 34428.85 44535.90   4817.804  1259743.1              0
##     BUS_STOP_N BUS_ROOF_N LOC_DESC Tap_In Tap_out
## 13        <NA>       <NA>     <NA>     NA      NA
## 16        <NA>       <NA>     <NA>     NA      NA
## 17        <NA>       <NA>     <NA>     NA      NA
## 18        <NA>       <NA>     <NA>     NA      NA
## 19        <NA>       <NA>     <NA>     NA      NA
## 73        <NA>       <NA>     <NA>     NA      NA
## 98        <NA>       <NA>     <NA>     NA      NA
## 234       <NA>       <NA>     <NA>     NA      NA
## 274       <NA>       <NA>     <NA>     NA      NA
## 280       <NA>       <NA>     <NA>     NA      NA
##                           geometry
## 13  MULTIPOLYGON (((33214.62 29...
## 16  MULTIPOLYGON (((14557.7 304...
## 17  MULTIPOLYGON (((15772.59 21...
## 18  MULTIPOLYGON (((19843.41 21...
## 19  MULTIPOLYGON (((27865.53 22...
## 73  MULTIPOLYGON (((28674.31 32...
## 98  MULTIPOLYGON (((29794.8 319...
## 234 MULTIPOLYGON (((48860.11 34...
## 274 MULTIPOLYGON (((37017.79 40...
## 280 MULTIPOLYGON (((35103.37 44...

test <- st_is_valid(MasterSubzone_sf)
length(which(test==FALSE))

## [1] 0

No polygons are invalid but there are some rows of Tap_in and Tap_out which are NA. We will need to replaces these NA values with 0 in order to aggregate the data. We will also replace NA values for the missing bus stop roof numbers, bus stop numbers and Location descriptions.

Replace NA with 0 and Unknown

MasterSubzone_sf$Tap_In[is.na(MasterSubzone_sf$Tap_In)] <- 0
MasterSubzone_sf$Tap_out[is.na(MasterSubzone_sf$Tap_out)] <- 0
MasterSubzone_sf$BUS_ROOF_N[is.na(MasterSubzone_sf$BUS_ROOF_N)] <- "UNKNOWN"
MasterSubzone_sf$BUS_STOP_N[is.na(MasterSubzone_sf$BUS_STOP_N)] <- "UNKNOWN"
MasterSubzone_sf$LOC_DESC[is.na(MasterSubzone_sf$LOC_DESC)] <- "UNKNOWN"

NA/Validity Checks 2

MasterSubzone_sf[rowSums(is.na(MasterSubzone_sf))!=0,]

## Simple feature collection with 0 features and 21 fields
## bbox:           xmin: NA ymin: NA xmax: NA ymax: NA
## CRS:            EPSG:3414
##  [1] OBJECTID       SUBZONE_NO     SUBZONE_N      SUBZONE_C      CA_IND        
##  [6] PLN_AREA_N     PLN_AREA_C     REGION_N       REGION_C       INC_CRC       
## [11] FMEL_UPD_D     X_ADDR         Y_ADDR         SHAPE_Leng     SHAPE_Area    
## [16] resident_count BUS_STOP_N     BUS_ROOF_N     LOC_DESC       Tap_In        
## [21] Tap_out        geometry      
## <0 rows> (or 0-length row.names)

test <- st_is_valid(MasterSubzone_sf)
length(which(test==FALSE))

## [1] 0

All Polygons are valid and the data is cleared of NA.

Aggregating Tap-in and Tap-out Values to subzone and only keeping in relevant features

SummarizedMasterSubzone_sf <- MasterSubzone_sf %>%
  filter(Tap_In!= 0 | Tap_out != 0 | resident_count != 0) %>% # we'll now  filter out any subzones with all values zero
  group_by(Subzone = SUBZONE_N) %>%
  summarise(Residential = as.numeric(first(resident_count)), Tap_in = sum(Tap_In), Tap_out = sum(Tap_out))

NA/Validity Checks 1

SummarizedMasterSubzone_sf[rowSums(is.na(SummarizedMasterSubzone_sf))!=0,]

## Simple feature collection with 0 features and 4 fields
## bbox:           xmin: NA ymin: NA xmax: NA ymax: NA
## CRS:            EPSG:3414
## # A tibble: 0 x 5
## # ... with 5 variables: Subzone <chr>, Residential <dbl>, Tap_in <dbl>,
## #   Tap_out <dbl>, geometry <GEOMETRY [m]>

test <- st_is_valid(SummarizedMasterSubzone_sf)
length(which(test==FALSE))

## [1] 0

summary(SummarizedMasterSubzone_sf)

##    Subzone           Residential         Tap_in           Tap_out       
##  Length:307         Min.   :     0   Min.   :      0   Min.   :      0  
##  Class :character   1st Qu.:    80   1st Qu.:  89266   1st Qu.:  87778  
##  Mode  :character   Median :  6140   Median : 227638   Median : 222195  
##                     Mean   : 12534   Mean   : 376106   Mean   : 375453  
##                     3rd Qu.: 18345   3rd Qu.: 483565   3rd Qu.: 466563  
##                     Max.   :127200   Max.   :3754337   Max.   :3660297  
##           geometry  
##  MULTIPOLYGON :307  
##  epsg:3414    :  0  
##  +proj=tmer...:  0  
##                     
##                     
##

Cleared of NA and invalid polygons

Step 5: Plotting our summarized data

Now we need to inspect what polygons have remained and look at the distribution of our Tap-in and Tap-out data.

Plotting two thematic maps of Tap-in and Tap-out volumes

CountTapInMap <- tm_shape(SummarizedMasterSubzone_sf)+
  tm_fill(col = c("Tap_in"),
              n = 5,
              style="jenks",
              palette = "Blues",
          title = "Tap-in Volume") +
  tm_layout(legend.position = c("right", "bottom")) +
  tm_borders(alpha = 0.5) +
  tmap_style("white")+
  tm_credits("Source: Planning Sub-zone boundary from Urban Redevelopment Authorithy (URA)\n and Commuter's Volume data from Land Transport Authority (LTA)", 
             position = c("left", "bottom"))


CountTapOutMap <- tm_shape(SummarizedMasterSubzone_sf)+
  tm_fill(col = c("Tap_out"),
              n = 5,
              style="jenks",
              palette = "Reds",
          title = "Tap-out Volume") +
  tm_layout(legend.position = c("right", "bottom")) +
  tm_borders(alpha = 0.5) +
  tmap_style("white") +
  tm_credits("Source: Planning Sub-zone boundary from Urban Redevelopment Authorithy (URA)\n and Commuter's Volume data from Land Transport Authority (LTA)", 
             position = c("left", "bottom"))


tmap_arrange(CountTapInMap, CountTapOutMap, asp=1, ncol=2)

We can see that during the aggregation process that some of the islands, such as Jurong Island and Pulau Ubin have disappeared. This is good because these areas are inaccessible by bus transportation which allows us to concentrate our study on the relevant areas within mainland Singapore.

Step 6: Performing Linear Regression Model

For this exercise, as we are interested in the relationship between the tap-in and tap-out volumes to the residential population. We will be using the lm() function to calculate the linear regression. Since for the purpose of our study, we will assume that our X is Residential Population and Y is Tap-in or Tap-out. This is based on the fact that the Residential Population data was gathered in 2019 before the Tap-In and Tap-Out data in January 2020 although we may be committing post hoc ergo propter hoc fallacy.

Performing Linear Regression for Tap-in values

TapInLM.model <- lm(Tap_in ~ Residential, data = SummarizedMasterSubzone_sf)
summary(TapInLM.model)

## 
## Call:
## lm(formula = Tap_in ~ Residential, data = SummarizedMasterSubzone_sf)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -799481 -127198  -67424   31303 1954431 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.207e+05  2.075e+04   5.816 1.52e-08 ***
## Residential 2.038e+01  9.715e-01  20.977  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 294300 on 305 degrees of freedom
## Multiple R-squared:  0.5906, Adjusted R-squared:  0.5893 
## F-statistic:   440 on 1 and 305 DF,  p-value: < 2.2e-16

Performing Linear Regression for Tap-out values

TapOutLM.model <- lm(Tap_out ~ Residential, data = SummarizedMasterSubzone_sf)
summary(TapOutLM.model)

## 
## Call:
## lm(formula = Tap_out ~ Residential, data = SummarizedMasterSubzone_sf)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -758965 -120883  -57479   32583 1648011 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.212e+05  2.009e+04   6.036 4.59e-09 ***
## Residential 2.028e+01  9.406e-01  21.562  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 285000 on 305 degrees of freedom
## Multiple R-squared:  0.6039, Adjusted R-squared:  0.6026 
## F-statistic: 464.9 on 1 and 305 DF,  p-value: < 2.2e-16

We can see there have been no observations deleted due to missing values, assuring us that the data has been cleaned properly

Step 5: Interpretation of Linear Regression Model for commuter’s flow to Residential Population by URA 2014 Master Plan Planning Area Subzones

Based on the results shown, both Tap-on and Tap-out volumes are shown to be positively correlated. The results are significant at an alpha value of 0.001 which allows us to reject the null hypothesis at a 0.999% confidence interval. The R-square value of 0.5906 and 0.6039 shows a fair amount of strength between the model and the response variable.

Plotting Tap_In volumes to Residential Population

TapInPlot <- ggplot(data= TapInLM.model, aes(x = Residential, y= Tap_in)) +
  geom_point(color='blue',size=2)+
  geom_segment(aes(xend = Residential, yend = predict(TapInLM.model))) + 
    geom_smooth(method = "lm", formula = y ~ x, color='black')+
  xlab("Residential Population Count")+
  ylab("Tap-in Volume Count")
  

TapInPlot

Plotting Tap_out volumes to Residential Population

TapOutPlot <- ggplot(data= TapOutLM.model, aes(x = Residential, y= Tap_out)) +
  geom_point(color='red',size=2)+
  geom_segment(aes(xend = Residential, yend = predict(TapOutLM.model))) + 
    geom_smooth(method = "lm", formula = y ~ x, color='black')+
  xlab("Residential Population Count")+
  ylab("Tap-out Volume Count")

TapOutPlot

Plotting Residuals

par(mfrow=c(1,2))
plot(TapInLM.model$fitted.values, TapInLM.model$residuals, pch = 8, col = "blue", xlab = "Predict Tap-in Values", ylab = "Residual Values (Tap-in)")
plot(TapOutLM.model$fitted.values, TapOutLM.model$residuals, pch = 8, col = "red",  xlab = "Predict Tap-in Values", ylab = "Residual Values (Tap-in)")

When observing the residuals, the concentration of data points in the low residential amounts makes this hard to interpret. However, we can see there is some degree of heteroscedasticity, which could suggest a missing variable or a variable that requires transformation, however it is unclear at this time. There also appear to be some outlier values observed in the distribution. This could be due to the point raised earlier about places such as Woodlands Regional Center.

Part 2: Spatial Autocorrelation Analysis on residuals

In the next part, we will append the residuals back to our Simple Features Dataframe and check if the randomization assumption holds true. This is to help us check that the relationship established in our previous linear model holds true or if there is a geospatial problem in the data generating process.

Step 1: Adding residuals and predicted values to Simple Feature Dataframe

It is important to ensure that the indexes from your linear model are the same as that of our data-frame. Usually the safer way would be to create an index for both, however, for simplicity we will just check manually and see that the data from our linear model and Simple Feature Dataframe align.

Appending using tibble functions

ResidualFitted_sf <- SummarizedMasterSubzone_sf

ResidualFitted_sf <- add_column(ResidualFitted_sf, ResidualTapIn = residuals(TapInLM.model), .before = "geometry")

ResidualFitted_sf <- add_column(ResidualFitted_sf, ResidualTapOut = residuals(TapOutLM.model), .before = "geometry")

ResidualFitted_sf <- add_column(ResidualFitted_sf, PredictedTapIn = predict(TapInLM.model), .before = "geometry")

ResidualFitted_sf <- add_column(ResidualFitted_sf, PredictedTapOut = predict(TapOutLM.model), .before = "geometry")

NA/Validity Checks 1

ResidualFitted_sf[rowSums(is.na(ResidualFitted_sf))!=0,]

## Simple feature collection with 0 features and 8 fields
## bbox:           xmin: NA ymin: NA xmax: NA ymax: NA
## CRS:            EPSG:3414
## # A tibble: 0 x 9
## # ... with 9 variables: Subzone <chr>, Residential <dbl>, Tap_in <dbl>,
## #   Tap_out <dbl>, ResidualTapIn <dbl>, ResidualTapOut <dbl>,
## #   PredictedTapIn <dbl>, PredictedTapOut <dbl>, geometry <GEOMETRY [m]>

test <- st_is_valid(ResidualFitted_sf)
length(which(test==FALSE))

## [1] 0

Cleared of NA and invalid polygons.

Step 2: Mapping out our Residual Values

Mapping using tmaps

ResidualTapInmap <- tm_shape(ResidualFitted_sf)+ 
  tm_fill("ResidualTapIn", 
              n = 5,
              style="jenks",
              palette = "RdYlGn",
          title = "Residual Tap-in") +
  tm_layout(legend.position = c("right", "bottom")) +
  tm_borders(alpha = 0.5) +
  tmap_style("white")

ResidualTapOutmap <- tm_shape(ResidualFitted_sf)+ 
  tm_fill("ResidualTapOut", 
              n = 5,
              style="jenks",
              palette = "RdYlBu",
          title = "Residual Tap-out") +
  tm_layout(legend.position = c("right", "bottom")) +
  tm_borders(alpha = 0.5) +
  tmap_style("white")

tmap_arrange(ResidualTapInmap, ResidualTapOutmap, asp=1, ncol=2)

Step 3: Forming Weight Matrixes using Contiguity-based neighbours

Creating (QUEEN) contiguity based neighbours

wm_q <- poly2nb(ResidualFitted_sf, queen=TRUE)
summary(wm_q)

## Neighbour list object:
## Number of regions: 307 
## Number of nonzero links: 1862 
## Percentage nonzero weights: 1.975618 
## Average number of links: 6.065147 
## Link number distribution:
## 
##  2  3  4  5  6  7  8  9 10 11 12 14 17 
##  6 10 28 78 78 49 37 14  2  2  1  1  1 
## 6 least connected regions:
## 2 42 133 172 230 277 with 2 links
## 1 most connected region:
## 40 with 17 links

Creating (ROOK) contiguity based neighbours

wm_r <- poly2nb(ResidualFitted_sf, queen=FALSE)
summary(wm_r)

## Neighbour list object:
## Number of regions: 307 
## Number of nonzero links: 1614 
## Percentage nonzero weights: 1.712485 
## Average number of links: 5.257329 
## Link number distribution:
## 
##  1  2  3  4  5  6  7  8  9 10 13 14 
##  1  5 23 70 92 62 29 17  4  2  1  1 
## 1 least connected region:
## 172 with 1 link
## 1 most connected region:
## 40 with 14 links

Visualizing (QUEEN) and (ROOK) contiguity based neighbours

centroids <- sf::st_centroid(ResidualFitted_sf$geometry) #finding centroids for our polygons

par(mfrow=c(1,2))
plot(ResidualFitted_sf$geometry, border="lightgrey", main="Queen Contiguity", asp=1)
plot(wm_q, st_coordinates(centroids), pch = 5, cex = 0.6, add = TRUE, col= "red" )
plot(ResidualFitted_sf$geometry, border="lightgrey", main="Rook Contiguity", asp =1)
plot(wm_r, st_coordinates(centroids), pch = 5, cex = 0.6, add = TRUE, col = "red")

Analysis of contiguity methods

Plotting bus stop locations on Planning Area Subzone bountaries

tm_shape(PlanningSubzone_sf)+
  tm_fill(col="SUBZONE_N", palette = "Blues",
          title = "Subzone Boundaries", legend.show = FALSE) +
  tm_layout(legend.position = c("right", "bottom")) +
  tm_borders(alpha = 1) +
  tmap_style("white") +
  tm_credits("Source: Planning Area Sub-zone boundary MP2014 from Urban Redevelopment Authorithy (URA)", 
             position = c("left", "bottom"))+
  tm_shape(busstoplocations_sf)+
  tm_dots(size = 0.03)

By looking at the plots and summaries of both the Rooks and Queen Contiguity, we can see that there is a large difference between both methodologies. A total difference of 258 links. By looking at the map, we can see that this is very evident in particular to the north and north-east part of the map.

With reference to bus routes in Singapore and locations in the above plot, it is often they move along the edges and corners of planning area subzones, thus the Queen method for continguity would be more applicable for our study as they cross over point-connections between polygons.

Source: https://wiki.smu.edu.sg/1415T2is415/File:IS415_2014-15_Term2_Assign1_ktchan.2011_Web_map_overview.png

Step 4: Forming Weight Matrixes using Fixed Distance-based Neighbours

Double check dataset

st_crs(ResidualFitted_sf)

## Coordinate Reference System:
##   User input: EPSG:3414 
##   wkt:
## PROJCS["SVY21 / Singapore TM",
##     GEOGCS["SVY21",
##         DATUM["SVY21",
##             SPHEROID["WGS 84",6378137,298.257223563,
##                 AUTHORITY["EPSG","7030"]],
##             AUTHORITY["EPSG","6757"]],
##         PRIMEM["Greenwich",0,
##             AUTHORITY["EPSG","8901"]],
##         UNIT["degree",0.0174532925199433,
##             AUTHORITY["EPSG","9122"]],
##         AUTHORITY["EPSG","4757"]],
##     PROJECTION["Transverse_Mercator"],
##     PARAMETER["latitude_of_origin",1.366666666666667],
##     PARAMETER["central_meridian",103.8333333333333],
##     PARAMETER["scale_factor",1],
##     PARAMETER["false_easting",28001.642],
##     PARAMETER["false_northing",38744.572],
##     UNIT["metre",1,
##         AUTHORITY["EPSG","9001"]],
##     AUTHORITY["EPSG","3414"]]

Since we know the EPSG is 3414, our distance will be in metres. Thus it should be no surprise if we have large values compared to other data-sets. Additionally the coordinates are in latitude and longitude which means we will need to be careful when mapping our coordinates or using functions that utilize them.

Determining the appropriate cut-off distance.

coords <- st_coordinates(centroids) #taken from the previous Step 3.

k1 <- knn2nb(knearneigh(coords))

k1dists <- unlist(nbdists(k1, coords, longlat = FALSE))
summary(k1dists)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   182.5   616.3   887.5   937.2  1169.8  5404.1

Unfortunately, it would seem that the distance values vary wildly between zones.

This may be due to centroids for large polygons such as those in the Western Water Catchment Area, Central Water Catchment Area and Changi Airport cover much larger areas compared to places such as Telok Blangah Rise. Thus a Fixed Distance Weight Matrix would be not very useful for us to utilize.

Computing Fixed Distance Matrix (NOT RECOMMENDED WEIGHT MATRIX)

Lets take a look

Creating Fixed-Distance based on max value 5404.1 ~5045

wm_d5405 <- dnearneigh(coords, 0, 5405, longlat = FALSE)

Plotting links created using Fixed-Distance Based Weight Matrix.

par(mfrow=c(1,2))

plot(ResidualFitted_sf$geometry, border="lightgrey", main="1st Nearest Neighbours")
plot(k1, coords, add=TRUE, col="red", length=0.08)
plot(ResidualFitted_sf$geometry, border="lightgrey", main="Fixed-Distance Link")
plot(wm_d5405, coords, add=TRUE, pch = 19, cex = 0.6)

As we can see, the Fixed Distance Weight Matrix methodology would not be very useful for analysis.

Step 5: Computing adaptive distance weight matrix

For the purpose of our analysis, we’ll be using a set of 6 nearest neighbors as it is the average number of links indicated during our Queen Contiguity based Neighbors analysis.

Setting neighbours based on average number of links from Queen Contiguity

wm_knn6 <- knn2nb(knearneigh(coords, k=6, longlat = FALSE), row.names=row.names(ResidualFitted_sf$Subzone))
wm_knn6

## Neighbour list object:
## Number of regions: 307 
## Number of nonzero links: 1842 
## Percentage nonzero weights: 1.954397 
## Average number of links: 6 
## Non-symmetric neighbours list

Plotting out adaptive distance weight matrix links

plot(ResidualFitted_sf$geometry, border="lightgrey", main= "KNN 6 Neighbours")
plot(wm_knn6, coords, pch = 19, cex = 0.6, add = TRUE, col = "red")

Although the links on first glance appear quite good, you will see areas such as Changi Airport reaching over the neighbors near it. We also know that based on the Queen contiguity method that some areas have up to 17 connections which are more than double the connections set here.

Step 6: Deciding on a weight matrix

While both the Adaptive Distance Weight Matrix and Rook Contiguity-based Neighbors seem promising, it would be more appropriate to utilize the Queen Continguity-based Neighbors as our ideal choice. These are due to the following points:

The size of polygons have large variation causing issues in using Fixed-Distance based weight Matrixes
The number of neighbors around the subzones differ from 1-17 if considering the Queen Contiguity-based neighbors, thus setting an arbitrary number of neighbors will result on some subzones over-reaching neighboring zones and more importantly, not taking into account certain neighbors when required to for auto-correlation.
Since we’re observing commuter’s volumes, there is no reason for us to believe that bus routes will only go through sides which are larger than a point, in fact some bus routes seem to pass through the corners of the subzones. Residents may also travel out of subzones to take the bus in another subzone thus there is no reason for Rook Contiguity unless there is a computational constraint.

Thus for the next phase of analysis, we will only be utilizing the Queen Contiguity-based neighbors in our spatial auto-correlation analysis.

Step 7: Weights based on the Inverse Distance Method using Queen Contiguity-based Neighours

Deriving a spatial weight matrix based on Inversed Distance method using the Queen method.

dist <- nbdists(wm_q, coords, longlat = FALSE)
ids <- lapply(dist, function(x) 1/(x))

Row-standardised weights matrix

rswm_q <- nb2listw(wm_q, style="W", zero.policy = TRUE)
rswm_q

## Characteristics of weights list object:
## Neighbour list object:
## Number of regions: 307 
## Number of nonzero links: 1862 
## Percentage nonzero weights: 1.975618 
## Average number of links: 6.065147 
## 
## Weights style: W 
## Weights constants summary:
##     n    nn  S0       S1       S2
## W 307 94249 307 106.4054 1261.803

Verifying weights have been appropriatly applied.

wm_q[[1]]

## [1] 217 218 220 221 222 264

rswm_q$weights[1]

## [[1]]
## [1] 0.1666667 0.1666667 0.1666667 0.1666667 0.1666667 0.1666667

Looks like they match up.

Step 8: Perform Moran’s I test on both Tap-in residual and Tap-out residual dataset

Performing Moran’s I test for Tap-in Residual

moran.test(ResidualFitted_sf$ResidualTapIn, listw=rswm_q, zero.policy = TRUE, na.action=na.omit)

## 
##  Moran I test under randomisation
## 
## data:  ResidualFitted_sf$ResidualTapIn  
## weights: rswm_q    
## 
## Moran I statistic standard deviate = -1.0545, p-value = 0.8542
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##      -0.037755733      -0.003267974       0.001069537

Performing Moran’s I test for Tap-out Residual

moran.test(ResidualFitted_sf$ResidualTapOut, listw=rswm_q, zero.policy = TRUE, na.action=na.omit)

## 
##  Moran I test under randomisation
## 
## data:  ResidualFitted_sf$ResidualTapOut  
## weights: rswm_q    
## 
## Moran I statistic standard deviate = -0.50383, p-value = 0.6928
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##      -0.019809059      -0.003267974       0.001077864

Interpretation of Moran’s I test

Based on our results, we can see that the p-values resulting from both tests are relatively high, thus we are unable to reject the null hypothesis that the results generated are random even at a 90% confidence level. This promotes that the relationships seen in our linear regression models are indeed true.

The coefficient for Moran’s I in this case is inapplicable.

Step 9: Robustness Testing

Using Monte Carlo Moran’s I

For robustness testing, we use Monte Carlo to check if our results hold.

Computing Monte-Carlo Moran’s I for Residual Tap-in

set.seed(1234)
MCMoranTapIn= moran.mc(ResidualFitted_sf$ResidualTapIn, listw=rswm_q, nsim=999, zero.policy = TRUE, na.action=na.omit)
MCMoranTapIn

## 
##  Monte-Carlo simulation of Moran I
## 
## data:  ResidualFitted_sf$ResidualTapIn 
## weights: rswm_q  
## number of simulations + 1: 1000 
## 
## statistic = -0.037756, observed rank = 157, p-value = 0.843
## alternative hypothesis: greater

Computing Monte-Carlo Moran’s I for Residual Tap-out

set.seed(1234)
MCMoranTapOut= moran.mc(ResidualFitted_sf$ResidualTapOut, listw=rswm_q, nsim=999, zero.policy = TRUE, na.action=na.omit)
MCMoranTapOut

## 
##  Monte-Carlo simulation of Moran I
## 
## data:  ResidualFitted_sf$ResidualTapOut 
## weights: rswm_q  
## number of simulations + 1: 1000 
## 
## statistic = -0.019809, observed rank = 309, p-value = 0.691
## alternative hypothesis: greater

Step 10: Interpretation of robustness testing

After testing using Monte-Carlo Moran’s I, we can safely say our results appear consistent and are promising as all are unable to reject the null hypothesis.

Conclusion of analysis

Overall we can be confident in our results that the relationship between Residential Population in URA 2014 Planning Area is positively correlated with Commuters’ volumes by bus. This is due to low p-values and the randomization assumption of residual values holding true for both linear modelling as well as geospatial auto-correlation.

Part 3: Localized Geospatial Statistical Analysis

Step 0: Converting to SpatialPolygonDataFrame

For the purpose of this Analysis, we will be converting the Simple Feature DataFrame back into a SpatialPolygoneDataFrame

Utilizing the as_Spatial() function from the sp package

SG_Commuter_vol_df <- as_Spatial(ResidualFitted_sf)

Step 1: Computing local Moran’s I for Cluster and Outlier Analysis

Computing Local Moran’s I for Tap-in volumes

subzone_labels <- order(SG_Commuter_vol_df$Subzone)
localMITapIn <- localmoran(SG_Commuter_vol_df$Tap_in, rswm_q)

#printCoefmat(data.frame(localMITapIn[subzone_labels,], row.names=SG_Commuter_vol_df$Subzone[subzone_labels]), check.names=FALSE)

Computing Local Moran’s I for Tap-out volumes

localMITapOut <- localmoran(SG_Commuter_vol_df$Tap_out, rswm_q)

#printCoefmat(data.frame(localMITapOut[subzone_labels,], row.names=SG_Commuter_vol_df$Subzone[subzone_labels]), check.names=FALSE)

Appending local Moran’s I and P-values to Simple Feature Dataframe

SG_Commuter_vol_df.localMITapIn <- cbind(SG_Commuter_vol_df,localMITapIn)
SG_Commuter_vol_df.localMITapOut <- cbind(SG_Commuter_vol_df,localMITapOut)

Step 2: Mapping Local Moran’s I

Visualizing local Moran’s I for Tap-in values

TapInlocalMI.map <- tm_shape(SG_Commuter_vol_df.localMITapIn) +
  tm_fill(col = "Ii", 
          style = "pretty", 
          title = "Tap-in Local MI Statistics") +
  tm_borders(alpha = 0.5)

TapInPvalue.map <- tm_shape(SG_Commuter_vol_df.localMITapIn) +
  tm_fill(col = "Pr.z...0.", 
          breaks=c(-Inf, 0.001, 0.01, 0.05, 0.1, Inf),
          palette="-Blues", 
          title = "Tap-in Local MI p-values") +
  tm_borders(alpha = 0.5)

tmap_arrange(TapInlocalMI.map, TapInPvalue.map, asp=1, ncol=2)

## Variable(s) "Ii" contains positive and negative values, so midpoint is set to 0. Set midpoint = NA to show the full spectrum of the color palette.

Visualizing local Moran’s I for Tap-out values

TapOutlocalMI.map <- tm_shape(SG_Commuter_vol_df.localMITapOut) +
  tm_fill(col = "Ii", 
          style = "pretty", 
          title = "Tap-out Local MI Statistics") +
  tm_borders(alpha = 0.5)

TapOutPvalue.map <- tm_shape(SG_Commuter_vol_df.localMITapOut) +
  tm_fill(col = "Pr.z...0.", 
          breaks=c(-Inf, 0.001, 0.01, 0.05, 0.1, Inf),
          palette="-Reds", 
          title = "Tap-out Local MI p-values") +
  tm_borders(alpha = 0.5)

tmap_arrange(TapOutlocalMI.map, TapOutPvalue.map, asp=1, ncol=2)

## Variable(s) "Ii" contains positive and negative values, so midpoint is set to 0. Set midpoint = NA to show the full spectrum of the color palette.

Intepretation of Local Moran Statistics

Based on the map results, we can see that there are several clusters for both Tap-in and Tap-out specifically around the Tampines East, Tampines West and Bedok North subzones. There also appear to be several outliers, however they have low p-values thus we will not take them into consideration.

Step 3: Creating LISA Cluster Map

Creating Moran Scatter Plot with standardized Tap-in Values

SG_Commuter_vol_df.localMITapIn$Z.Tap_In <- scale(SG_Commuter_vol_df.localMITapIn$Tap_in) %>% as.vector 

TapInLMoranPlot <- moran.plot(SG_Commuter_vol_df.localMITapIn$Z.Tap_In, rswm_q, labels=as.character(SG_Commuter_vol_df.localMITapIn$Subzone), xlab="z Tap-in Volumes", ylab="Spatially Lag z Tap-in Volumes")

Creating Moran Scatter Plot with standardized Tap-out Values

SG_Commuter_vol_df.localMITapOut$Z.Tap_out <-scale(SG_Commuter_vol_df.localMITapOut$Tap_out) %>% as.vector 

TapOutLMoranPlot <- moran.plot(SG_Commuter_vol_df.localMITapOut$Z.Tap_out, rswm_q, labels=as.character(SG_Commuter_vol_df.localMITapOut$Subzone), xlab="z Tap-out Volumes", ylab="Spatially Lag z Tap-out Volumes")

Build Quadrants for Local MI Tap-in values

TapInquadrant <- vector(mode="numeric",length=nrow(localMITapIn))
TapInDV <- SG_Commuter_vol_df$Tap_in - mean(SG_Commuter_vol_df$Tap_in)     
TapInC_mI <- localMITapIn[,1] - mean(localMITapIn[,1])    
signif <- 0.05       
TapInquadrant[TapInDV >0 & TapInC_mI>0] <- 4      
TapInquadrant[TapInDV <0 & TapInC_mI<0] <- 1      
TapInquadrant[TapInDV <0 & TapInC_mI>0] <- 2
TapInquadrant[TapInDV >0 & TapInC_mI<0] <- 3
TapInquadrant[localMITapIn[,5]>signif] <- 0

Build Quadrants for Local MI Tap-out values

TapOutquadrant <- vector(mode="numeric",length=nrow(localMITapOut))
TapOutDV <- SG_Commuter_vol_df$Tap_out - mean(SG_Commuter_vol_df$Tap_out)     
TapOutC_mI <- localMITapOut[,1] - mean(localMITapOut[,1])    
signif <- 0.05       
TapOutquadrant[TapOutDV >0 & TapOutC_mI>0] <- 4      
TapOutquadrant[TapOutDV <0 & TapOutC_mI<0] <- 1      
TapOutquadrant[TapOutDV <0 & TapOutC_mI>0] <- 2
TapOutquadrant[TapOutDV >0 & TapOutC_mI<0] <- 3
TapOutquadrant[localMITapOut[,5]>signif] <- 0

Constructing LISA Maps for Both Tap-in and Tap-out Local Moran’s I

colors <- c("#ffffff", "#2c7bb6", "#abd9e9", "#fdae61", "#d7191c")
clusters <- c("Insignificant", "Low-Low", "Low-High", "High-Low", "High-High")


SG_Commuter_vol_df.localMITapIn$TapInquadrant <- TapInquadrant

LISATapIn <- tm_shape(SG_Commuter_vol_df.localMITapIn) +
  tm_fill(col = "TapInquadrant", style = "cat", title="Tap-in Classification", palette = colors[c(sort(unique(TapInquadrant)))+1], labels = clusters[c(sort(unique(TapInquadrant)))+1], popup.vars = c("Postal.Code")) +
  tm_view(set.zoom.limits = c(11,17)) +
  tm_borders(alpha=0.5)

SG_Commuter_vol_df.localMITapOut$TapOutquadrant <- TapOutquadrant

LISATapOut <- tm_shape(SG_Commuter_vol_df.localMITapOut) +
  tm_fill(col = "TapOutquadrant", style = "cat", title="Tap-out Classification", palette = colors[c(sort(unique(TapOutquadrant)))+1], labels = clusters[c(sort(unique(TapOutquadrant)))+1], popup.vars = c("Postal.Code")) +
  tm_view(set.zoom.limits = c(11,17)) +
  tm_borders(alpha=0.5)

tmap_arrange(LISATapIn, LISATapOut, asp=1, ncol=2)

Intepretation of LISA Map

Based on the LISA map, we can see several clusters which primarily align with our initial observations of the Local Moran’s I scores in the previous Step. However, there appears to be some subzones in the north-east and west that fall within the High-High quadrants indicating positive auto-correlations with neighboring subzones.

Step 4: Hot Spot and Cold Spot Area Analysis using Gi Statistics

For the purpose of this analysis, we will only be using Adaptive Distance Weight Matrix for reasons explained in the earlier Part 2 on the nature of the polygons we are dealing with. We will take the same clusters we set in the previous section.

Computing Gi Statistics

wm_knn6_lw <- nb2listw(wm_knn6, style = 'B') #wm_knn6 taken from previous Part 2

TapIngi.adaptive <- localG(SG_Commuter_vol_df$Tap_in, wm_knn6_lw)
TapInSG.gi <- cbind(SG_Commuter_vol_df, as.matrix(TapIngi.adaptive))
names(TapInSG.gi)[length(names(TapInSG.gi))] <- "gstat_adaptive"

TapOutgi.adaptive <- localG(SG_Commuter_vol_df$Tap_out, wm_knn6_lw)
TapOutSG.gi <- cbind(SG_Commuter_vol_df, as.matrix(TapOutgi.adaptive))
names(TapOutSG.gi)[length(names(TapOutSG.gi))] <- "gstat_adaptive"

Mapping out Getis-Ord Gi Statistics

TapIngiMap <- tm_shape(TapInSG.gi) +
  tm_fill(col = "gstat_adaptive",
          style = "jenks",
          palette = "-RdBu",
          n=5,
          title = "Tap-in local Gi") +
  tm_borders(alpha = 0.5)

TapOutgiMap <- tm_shape(TapOutSG.gi) +
  tm_fill(col = "gstat_adaptive",
          style = "jenks",
          palette = "-RdBu",
          n=5,
          title = "Tap-out local Gi") +
  tm_borders(alpha = 0.5)

tmap_arrange(TapIngiMap, TapOutgiMap, asp=1, ncol=2)

## Variable(s) "gstat_adaptive" contains positive and negative values, so midpoint is set to 0. Set midpoint = NA to show the full spectrum of the color palette.
## Variable(s) "gstat_adaptive" contains positive and negative values, so midpoint is set to 0. Set midpoint = NA to show the full spectrum of the color palette.

Interpretation of GI Statistic Maps

Based on the results, we see hot-spots in areas which we had originally identified in our previous LISA Map. There is also a surprising cold spot towards the west towards Tuas. This may be due to the nature of how k-nearestneighbours works and specifically how Tuas View Extension subzone is only really adjacent to 2 subzones but was coerced using the Adaptive Distance Weight Matrix to consider more zones.

Overall Conclusion

Overall there appears to be some clustering for both Tap-in and Tap-out around the Tampines East subzone with the Local Moran’s I Analysis, LISA Mapping and Gi Statistical all highlighting the zone or some of it’s surrounding areas as clusters or hot spots. This aligns with the population map done in Part 1 which shows the area as relatively higher populated than others although we had not done any geospatial analytics on population distributions.

Part 4: Study limitations and assumptions

There are several variables which are not included within the statistical analysis above that must be considered when examining the scope of the study.

We had already removed one age group of 0-4 year olds due to the age and height requirements to enter the bus, thus their movements would not be captured in the tap-in and tap-out data. However, the age for no bus-fare requirement is 7 years old and a certain height criteria. As there was no feasible way to segment the residential data in a timely manner, the age group for 5-9 year olds was still included which could have introduced some level of noise in the data.
We did not factor in the variable of bus frequencies or availability of bus stops in each subzone. These could affect the aggregated tap-in and tap-out numbers as more frequently serviced bus stops might create larger commuters’ flow.
Our analysis also did not control for the availability of alternate transportation such as the MRT which greatly affects the use of bus services in an area. These may also contribute to increased volumes at routes that intersect with these MRT stations.

Part 5: BONUS: Analysing time effects of commuters’ volumes

In this section, we’re interested if there is a time-variation factor that could affect the spatial distribution of our commuters’ volumes. Intuition is that during morning peak periods, people are travelling from home in residential zones to work in non-residential zones. The opposite should be true for evening peak periods. We’re looking to see if we segment the data according to these two peak periods if there is a significant difference in terms of geospatial distribution.

Step 1: EDA on bus volumes to find peak periods

HourStats <- sgpassengervolume %>% 
  filter(DAY_TYPE == "WEEKDAY") %>%
  group_by(TIME_PER_HOUR) %>%
  summarise(Tap_InStat = sum(TOTAL_TAP_IN_VOLUME), 
            Tap_outStat = sum(TOTAL_TAP_OUT_VOLUME))

Bar Plotting Tap In and Tap Out

ggplot(data= HourStats,aes(x = TIME_PER_HOUR, y= Tap_InStat)) + geom_bar(stat="identity", position = "dodge")

ggplot(data= HourStats,aes(x = TIME_PER_HOUR, y= Tap_outStat)) + geom_bar(stat="identity", position = "dodge")

Based on the data above, we can see there are 2 distinct peak periods for both Tap-in and Tap-out. We are going to see if there is a geospatial difference in these periods. We will use these peak periods as our two sets of data both within 3 hour spans of 6-8am and 5-7pm.

Assuming these are due to the work hour peak timings, we will further limit our data to “WEEKDAY” which are work days.

Step 2: Creating Morning and Evening DataSet

Morning Data Set

TotalbusstopvolMorning <- sgpassengervolume %>% 
  filter(TIME_PER_HOUR %in% (6:8), DAY_TYPE == "WEEKDAY") %>%
  group_by(BUS_STOP_N = PT_CODE) %>% #rename as per naming convention by LTA
  summarise(Tap_InMorning = sum(TOTAL_TAP_IN_VOLUME), 
            Tap_outMorning = sum(TOTAL_TAP_OUT_VOLUME)) #be sure to check data

NA/Validity check

TotalbusstopvolMorning[rowSums(is.na(TotalbusstopvolMorning))!=0,]

## # A tibble: 0 x 3
## # ... with 3 variables: BUS_STOP_N <chr>, Tap_InMorning <dbl>,
## #   Tap_outMorning <dbl>

Cleared of NA Values

Evening Data Set

TotalbusstopvolEvening <- sgpassengervolume %>% 
  filter(TIME_PER_HOUR %in% (17:19), DAY_TYPE == "WEEKDAY") %>%
  group_by(BUS_STOP_N = PT_CODE) %>% #rename as per naming convention by LTA
  summarise(Tap_InEvening = sum(TOTAL_TAP_IN_VOLUME), 
            Tap_outEvening = sum(TOTAL_TAP_OUT_VOLUME)) #check Day type and time

Validity check 1

TotalbusstopvolEvening[rowSums(is.na(TotalbusstopvolEvening))!=0,]

## # A tibble: 0 x 3
## # ... with 3 variables: BUS_STOP_N <chr>, Tap_InEvening <dbl>,
## #   Tap_outEvening <dbl>

Cleared of NA Values

Step 3: Create New Simple Feature DataFrame

TimeVar_sf <- left_join(busstoplocations_sf,TotalbusstopvolMorning) #we already renamed the variable to match so there is no need to specify the joining factor

## Joining, by = "BUS_STOP_N"

TimeVar_sf <- left_join(TimeVar_sf,TotalbusstopvolEvening) #we already renamed the variable to match so there is no need to specify the joining factor

## Joining, by = "BUS_STOP_N"

Validity check 1

TimeVar_sf[rowSums(is.na(TimeVar_sf))!=0,]

## Simple feature collection with 125 features and 9 fields
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: 7702.752 ymin: 27733.06 xmax: 41244.44 ymax: 49261.6
## CRS:            EPSG:3414
## First 10 features:
##     BUS_STOP_N BUS_ROOF_N                 LOC_DESC Tap_In Tap_out Tap_InMorning
## 14       49249        B08          DRAGON FISH IND    721     691            NA
## 178      91121        NIL                  HSE 189      0       0            NA
## 256      44401        NIL              OPP BLK 183      0       0            NA
## 275      91101        B03        TG KATONG FLYOVER     24      54            NA
## 276      49209        B01 SG BULOH WETLAND RESERVE   1413    1118            NA
## 344      31121        B40     NIRVANA MEMORIAL GDN      0       0            NA
## 372      45391        NIL                  BLK 657      0       9            NA
## 374      49239        B06      OPP NYEE PHOE GROUP    369     222            NA
## 432      93171        B15 OPP ISLAND RESORT S'PORE    542     206            NA
## 461      23381       B01A              PLANT ENGRG      0       0            NA
##     Tap_outMorning Tap_InEvening Tap_outEvening                  geometry
## 14              NA            NA             NA POINT (16000.18 46531.64)
## 178             NA            NA             NA POINT (35030.24 31256.69)
## 256             NA            NA             NA POINT (20144.74 40279.55)
## 275             NA            NA             NA POINT (35058.96 30937.84)
## 276             NA            NA             NA POINT (16482.29 47421.34)
## 344             NA            NA             NA  POINT (11838.8 39157.43)
## 372             NA            NA             NA  POINT (18697.5 42446.09)
## 374             NA            NA             NA POINT (16274.91 46921.07)
## 432             NA            NA             NA  POINT (38108.3 31937.69)
## 461             NA            NA             NA POINT (12552.18 32234.11)

Replace NA Values with 0

TimeVar_sf$Tap_InMorning[is.na(TimeVar_sf$Tap_InMorning)] <- 0
TimeVar_sf$Tap_outMorning[is.na(TimeVar_sf$Tap_outMorning)] <- 0
TimeVar_sf$Tap_InEvening[is.na(TimeVar_sf$Tap_InEvening)] <- 0
TimeVar_sf$Tap_outEvening[is.na(TimeVar_sf$Tap_outEvening)] <- 0

NA/Validity check 2

TimeVar_sf[rowSums(is.na(TimeVar_sf))!=0,]

## Simple feature collection with 0 features and 9 fields
## bbox:           xmin: NA ymin: NA xmax: NA ymax: NA
## CRS:            EPSG:3414
##  [1] BUS_STOP_N     BUS_ROOF_N     LOC_DESC       Tap_In         Tap_out       
##  [6] Tap_InMorning  Tap_outMorning Tap_InEvening  Tap_outEvening geometry      
## <0 rows> (or 0-length row.names)

Cleared of NA values.

Step 4:Joining data to Population and Planning Area subzone polygons

TimeVarSubzone_sf <- st_join(PlanningSubzone_sf,TimeVar_sf, join=st_intersects) #using st_intersects to join bus stops that are either within or touching the polygons.

NA/Validity Check 1

TimeVarSubzone_sf[rowSums(is.na(TimeVarSubzone_sf))!=0,]

## Simple feature collection with 18 features and 25 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: 8012.578 ymin: 15748.72 xmax: 56396.44 ymax: 49122.83
## CRS:            EPSG:3414
## First 10 features:
##     OBJECTID SUBZONE_NO               SUBZONE_N SUBZONE_C CA_IND
## 13        13          1             MARINA EAST    MESZ01      Y
## 16        16          1 JURONG ISLAND AND BUKOM    WISZ01      N
## 17        17          3                  SUDONG    WISZ03      N
## 18        18          2                 SEMAKAU    WISZ02      N
## 19        19          2          SOUTHERN GROUP    SISZ02      N
## 73        73          4               CAIRNHILL    NTSZ04      Y
## 98        98          7             MOUNT EMILY    RCSZ07      Y
## 234      234          1              CHANGI BAY    CBSZ01      N
## 274      274          1      LORONG HALUS NORTH    SESZ01      N
## 280      280          3     PULAU PUNGGOL TIMOR    SLSZ03      N
##           PLN_AREA_N PLN_AREA_C          REGION_N REGION_C          INC_CRC
## 13       MARINA EAST         ME    CENTRAL REGION       CR 782A2FAF53029A34
## 16   WESTERN ISLANDS         WI       WEST REGION       WR 699F7210FBF1AFA8
## 17   WESTERN ISLANDS         WI       WEST REGION       WR F718C723E08FBD51
## 18   WESTERN ISLANDS         WI       WEST REGION       WR E69207D4F76DEEA3
## 19  SOUTHERN ISLANDS         SI    CENTRAL REGION       CR 5809FC547293EA2D
## 73            NEWTON         NT    CENTRAL REGION       CR 8B79CA72836AD7A6
## 98            ROCHOR         RC    CENTRAL REGION       CR D8225E737A6496B4
## 234       CHANGI BAY         CB       EAST REGION       ER ACFD43D8AE3F7B3F
## 274         SENGKANG         SE NORTH-EAST REGION      NER 4A9FCDF7EAE9BEAE
## 280          SELETAR         SL NORTH-EAST REGION      NER DF9C2E922175C69D
##     FMEL_UPD_D   X_ADDR   Y_ADDR SHAPE_Leng SHAPE_Area resident_count
## 13  2014/12/05 32344.05 30103.25   6470.950  1844060.7              0
## 16  2014/12/05 13012.88 27225.87  68083.936 36707720.9              0
## 17  2014/12/05 15931.76 19579.07  24759.066  4207271.1              0
## 18  2014/12/05 21206.33 20465.81  18703.681  4963787.1              0
## 19  2014/12/05 29815.09 23412.59  25626.977  2206319.5              0
## 73  2014/12/05 28482.67 32108.19   3853.900   452471.8           3680
## 98  2014/12/05 29583.26 31639.48   2301.230   193992.3           1400
## 234 2014/12/05 49502.49 34316.54  18731.560  1829821.7              0
## 274 2014/12/05 36619.22 40404.62   3930.097   924866.0              0
## 280 2014/12/05 34428.85 44535.90   4817.804  1259743.1              0
##     BUS_STOP_N BUS_ROOF_N LOC_DESC Tap_In Tap_out Tap_InMorning Tap_outMorning
## 13        <NA>       <NA>     <NA>     NA      NA            NA             NA
## 16        <NA>       <NA>     <NA>     NA      NA            NA             NA
## 17        <NA>       <NA>     <NA>     NA      NA            NA             NA
## 18        <NA>       <NA>     <NA>     NA      NA            NA             NA
## 19        <NA>       <NA>     <NA>     NA      NA            NA             NA
## 73        <NA>       <NA>     <NA>     NA      NA            NA             NA
## 98        <NA>       <NA>     <NA>     NA      NA            NA             NA
## 234       <NA>       <NA>     <NA>     NA      NA            NA             NA
## 274       <NA>       <NA>     <NA>     NA      NA            NA             NA
## 280       <NA>       <NA>     <NA>     NA      NA            NA             NA
##     Tap_InEvening Tap_outEvening                       geometry
## 13             NA             NA MULTIPOLYGON (((33214.62 29...
## 16             NA             NA MULTIPOLYGON (((14557.7 304...
## 17             NA             NA MULTIPOLYGON (((15772.59 21...
## 18             NA             NA MULTIPOLYGON (((19843.41 21...
## 19             NA             NA MULTIPOLYGON (((27865.53 22...
## 73             NA             NA MULTIPOLYGON (((28674.31 32...
## 98             NA             NA MULTIPOLYGON (((29794.8 319...
## 234            NA             NA MULTIPOLYGON (((48860.11 34...
## 274            NA             NA MULTIPOLYGON (((37017.79 40...
## 280            NA             NA MULTIPOLYGON (((35103.37 44...

test <- st_is_valid(TimeVarSubzone_sf)
length(which(test==FALSE))

## [1] 0

Replacing NA values

TimeVarSubzone_sf$Tap_In[is.na(TimeVarSubzone_sf$Tap_In)] <- 0
TimeVarSubzone_sf$Tap_out[is.na(TimeVarSubzone_sf$Tap_out)] <- 0
TimeVarSubzone_sf$Tap_InMorning[is.na(TimeVarSubzone_sf$Tap_InMorning)] <- 0
TimeVarSubzone_sf$Tap_outMorning[is.na(TimeVarSubzone_sf$Tap_outMorning)] <- 0
TimeVarSubzone_sf$Tap_InEvening[is.na(TimeVarSubzone_sf$Tap_InEvening)] <- 0
TimeVarSubzone_sf$Tap_outEvening[is.na(TimeVarSubzone_sf$Tap_outEvening)] <- 0
TimeVarSubzone_sf$BUS_ROOF_N[is.na(TimeVarSubzone_sf$BUS_ROOF_N)] <- "UNKNOWN"
TimeVarSubzone_sf$BUS_STOP_N[is.na(TimeVarSubzone_sf$BUS_STOP_N)] <- "UNKNOWN"
TimeVarSubzone_sf$LOC_DESC[is.na(TimeVarSubzone_sf$LOC_DESC)] <- "UNKNOWN"

NA/Validity Check 2

TimeVarSubzone_sf[rowSums(is.na(TimeVarSubzone_sf))!=0,]

## Simple feature collection with 0 features and 25 fields
## bbox:           xmin: NA ymin: NA xmax: NA ymax: NA
## CRS:            EPSG:3414
##  [1] OBJECTID       SUBZONE_NO     SUBZONE_N      SUBZONE_C      CA_IND        
##  [6] PLN_AREA_N     PLN_AREA_C     REGION_N       REGION_C       INC_CRC       
## [11] FMEL_UPD_D     X_ADDR         Y_ADDR         SHAPE_Leng     SHAPE_Area    
## [16] resident_count BUS_STOP_N     BUS_ROOF_N     LOC_DESC       Tap_In        
## [21] Tap_out        Tap_InMorning  Tap_outMorning Tap_InEvening  Tap_outEvening
## [26] geometry      
## <0 rows> (or 0-length row.names)

test <- st_is_valid(TimeVarSubzone_sf)
length(which(test==FALSE))

## [1] 0

Cleared of NA values.

Step 5: Aggregation of data

Aggregating Tap-in and Tap-out/ Morning and Evening Values to subzone and only keeping in relevant features.

SummarizedTimeVar_sf <- TimeVarSubzone_sf %>%
  filter(Tap_In!= 0 | Tap_out != 0 | resident_count != 0 ) %>% # Filters do not change in this case. 
  group_by(Subzone = SUBZONE_N) %>%
  summarise(Residential = as.numeric(first(resident_count)), Tap_in = sum(Tap_In), Tap_out = sum(Tap_out), Tap_inMorning = sum(Tap_InMorning), Tap_outMorning = sum(Tap_outMorning), Tap_inEvening = sum(Tap_InEvening), Tap_outEvening = sum(Tap_outEvening) )

NA/Validity Checks 1

SummarizedTimeVar_sf[rowSums(is.na(SummarizedTimeVar_sf))!=0,]

## Simple feature collection with 0 features and 8 fields
## bbox:           xmin: NA ymin: NA xmax: NA ymax: NA
## CRS:            EPSG:3414
## # A tibble: 0 x 9
## # ... with 9 variables: Subzone <chr>, Residential <dbl>, Tap_in <dbl>,
## #   Tap_out <dbl>, Tap_inMorning <dbl>, Tap_outMorning <dbl>,
## #   Tap_inEvening <dbl>, Tap_outEvening <dbl>, geometry <GEOMETRY [m]>

test <- st_is_valid(SummarizedTimeVar_sf)
length(which(test==FALSE))

## [1] 0

Clear of NA values and polygons are valid

Step 6: Plotting Data

Plotting four thematic maps

TapInMorningMap <- tm_shape(SummarizedTimeVar_sf)+
  tm_fill(col = c("Tap_inMorning"),
              n = 5,
              style="jenks",
              palette = "Blues",
          title = "Morning Tap-in Volume") +
  tm_layout(legend.position = c("right", "bottom")) +
  tm_borders(alpha = 0.5) +
  tmap_style("white")+
  tm_credits("Source: Planning Sub-zone boundary from Urban Redevelopment Authorithy (URA)\n and Commuter's Volume data from Land Transport Authority (LTA)", 
             position = c("left", "bottom"))


TapOutMorningMap <- tm_shape(SummarizedTimeVar_sf)+
  tm_fill(col = c("Tap_outMorning"),
              n = 5,
              style="jenks",
              palette = "Reds",
          title = "Morning Tap-out Volume") +
  tm_layout(legend.position = c("right", "bottom")) +
  tm_borders(alpha = 0.5) +
  tmap_style("white") +
  tm_credits("Source: Planning Sub-zone boundary from Urban Redevelopment Authorithy (URA)\n and Commuter's Volume data from Land Transport Authority (LTA)", 
             position = c("left", "bottom"))

TapInEveningMap <- tm_shape(SummarizedTimeVar_sf)+
  tm_fill(col = c("Tap_inEvening"),
              n = 5,
              style="jenks",
              palette = "Blues",
          title = "Evening Tap-in Volume") +
  tm_layout(legend.position = c("right", "bottom")) +
  tm_borders(alpha = 0.5) +
  tmap_style("white")+
  tm_credits("Source: Planning Sub-zone boundary from Urban Redevelopment Authorithy (URA)\n and Commuter's Volume data from Land Transport Authority (LTA)", 
             position = c("left", "bottom"))


TapOutEveningMap <- tm_shape(SummarizedTimeVar_sf)+
  tm_fill(col = c("Tap_outEvening"),
              n = 5,
              style="jenks",
              palette = "Reds",
          title = "Evening Tap-out Volume") +
  tm_layout(legend.position = c("right", "bottom")) +
  tm_borders(alpha = 0.5) +
  tmap_style("white") +
  tm_credits("Source: Planning Sub-zone boundary from Urban Redevelopment Authorithy (URA)\n and Commuter's Volume data from Land Transport Authority (LTA)", 
             position = c("left", "bottom"))

tmap_arrange(TapInMorningMap, TapOutMorningMap, TapInEveningMap, TapOutEveningMap,asp=1, ncol=2, nrow = 2)

Upon initial glance, it is hard to spot any drastic shifts between the four different zones with the exception of Changi Airport having significantly higher Tap-out volumes in the morning. One thing we could examine is the difference in morning versus evening for both and plot another map out for those two.

Calculating Difference between time periods.

Note the calculation we are making here is (Morning - Evening) Values

SummarizedTimeVar_sf <- add_column(SummarizedTimeVar_sf, TapInDifference = SummarizedTimeVar_sf$Tap_inMorning - SummarizedTimeVar_sf$Tap_inEvening, .before = "geometry")

SummarizedTimeVar_sf <- add_column(SummarizedTimeVar_sf, TapOutDifference = SummarizedTimeVar_sf$Tap_outMorning - SummarizedTimeVar_sf$Tap_outEvening, .before = "geometry")

Plotting the mapping between time periods

TapInDifferenceMap <- tm_shape(SummarizedTimeVar_sf)+
  tm_fill(col = "TapInDifference",
              n = 5,
              style="jenks",
              palette = "RdBu",
          title = "Difference in Tap-in Volume") +
  tm_layout(legend.position = c("right", "bottom")) +
  tm_borders(alpha = 0.5) +
  tmap_style("white")+
  tm_credits("Source: Planning Sub-zone boundary from Urban Redevelopment Authorithy (URA)\n and Commuter's Volume data from Land Transport Authority (LTA)", 
             position = c("left", "bottom"))


TapOutDifferenceMap <- tm_shape(SummarizedTimeVar_sf)+
  tm_fill(col = "TapOutDifference",
              n = 5,
              style="jenks",
              palette = "PRGn",
          title = "Difference in Tap-out Volume") +
  tm_layout(legend.position = c("right", "bottom")) +
  tm_borders(alpha = 0.5) +
  tmap_style("white") +
  tm_credits("Source: Planning Sub-zone boundary from Urban Redevelopment Authorithy (URA)\n and Commuter's Volume data from Land Transport Authority (LTA)", 
             position = c("left", "bottom"))

tmap_arrange(TapInDifferenceMap, TapOutDifferenceMap ,asp=1, ncol=2)

It seems there are some distinct differences based on the region, however it is hard to be sure what this map actually tells us in terms of clustering and its relationship to resident population in those subzones. We will need to plot the data on a graph to give us a better idea on how these values interact.

Performing Simple linear regression for Tap-in

TapInDiffLM.model <- lm(TapInDifference ~ Residential, data = SummarizedTimeVar_sf)
summary(TapInDiffLM.model)

## 
## Call:
## lm(formula = TapInDifference ~ Residential, data = SummarizedTimeVar_sf)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -135341  -10496    3849   13548  104149 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.355e+04  2.296e+03  -5.901 9.62e-09 ***
## Residential  1.170e+00  1.075e-01  10.884  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 32570 on 305 degrees of freedom
## Multiple R-squared:  0.2797, Adjusted R-squared:  0.2774 
## F-statistic: 118.5 on 1 and 305 DF,  p-value: < 2.2e-16

Plotting Tap-in differences to Residential Population

TapInDiffPlot <- ggplot(data= TapInDiffLM.model, aes(x = Residential, y= TapInDifference)) +
  geom_point(color='blue',size=2)+
  geom_segment(aes(xend = Residential, yend = predict(TapInDiffLM.model))) + 
    geom_smooth(method = "lm", formula = y ~ x, color='black')+
  xlab("Residential Population Count")+
  ylab("Tap-in Difference")

TapInDiffPlot

Interpretation:

Although our linear model seems to indicate a positive relationship between the resident population and the difference between morning and evening peak our Tap-in volumes, it is hard to draw a conclusion as a large amount of data is skewed to the left with the low residential numbers. Thus we must look for a different method

Step 7: Hot Spot and Cold Spot Area Analysis using Gi Statistics

For the purpose of this analysis, we will only be using Adaptive Distance Weight Matrix created in the earlier Part 3.

Converting from Simple Features DataFrame to SpatialPolygonDataFrame

SG_Commuter_Diff_df <- as_Spatial(SummarizedTimeVar_sf)

Computing Gi Statistics

TapInDiffgi.adaptive <- localG(SG_Commuter_Diff_df$TapInDifference, wm_knn6_lw)
TapInDiffSG.gi <- cbind(SG_Commuter_Diff_df, as.matrix(TapInDiffgi.adaptive))
names(TapInDiffSG.gi)[length(names(TapInDiffSG.gi))] <- "gstat_adaptive"

TapOutDiffgi.adaptive <- localG(SG_Commuter_Diff_df$TapOutDifference, wm_knn6_lw)
TapOutDiffSG.gi <- cbind(SG_Commuter_Diff_df, as.matrix(TapOutDiffgi.adaptive))
names(TapOutDiffSG.gi)[length(names(TapOutDiffSG.gi))] <- "gstat_adaptive"

Mapping out Getis-Ord Gi Statistics

TapInDiffgiMap <- tm_shape(TapInDiffSG.gi) +
  tm_fill(col = "gstat_adaptive",
          style = "jenks",
          palette = "-RdBu",
          n=5,
          title = "Tap-in local Gi") +
  tm_borders(alpha = 0.5)

TapOutDiffgiMap <- tm_shape(TapOutDiffSG.gi) +
  tm_fill(col = "gstat_adaptive",
          style = "jenks",
          palette = "-RdBu",
          n=5,
          title = "Tap-out local Gi") +
  tm_borders(alpha = 0.5)

tmap_arrange(TapInDiffgiMap, TapOutDiffgiMap, asp=1, ncol=2)

## Variable(s) "gstat_adaptive" contains positive and negative values, so midpoint is set to 0. Set midpoint = NA to show the full spectrum of the color palette.
## Variable(s) "gstat_adaptive" contains positive and negative values, so midpoint is set to 0. Set midpoint = NA to show the full spectrum of the color palette.

Interpretation of Gi statistics

Here we can see something very interesting. When looking at the Tap-in Local Gi map, we can see that there are hot spots primarily in the northern areas of Singapore while cold spots exist primarily towards the south or downtown Singapore. These hot spots indicate more people tapping in in the mornings than in the evening, possibly the working crowd travelling downtown to work. The results for the Tap-out Local Gi also seem to support there results from the Tap-in Local Gi. Interestingly the Tuas and lower Jurong area are hot spots for the tap-out values which lines up with the idea that more people arrive for work and tap-out in the morning than in the evening.

Step 8: Concluding Remarks

Overall, this study shows that time-variations do play a factor in commuter flow within Singapore, particularly on a geospatial level from North-to-South in the morning peak hours and vice-versa in the evening peak hours. This should be factored in when considering utilizing the bus Tap-in and Tap-out volumes for analysis. Further more, this study could help influence URA’s developmental policies on commercial and residential areas around Singapore in observing it’s effectiveness in implementing these zones.

Take-Home_EX01

Lee Yi De (Isaac)

10 May 2020

Take Home Assignment 1 for IS415 - Geospatial Analytics and Applications - G10

Overview

Part 0: Setup

Loading in required packages

Part 1: Geospatial Data Wrangling & Linear Regression Model

Step 1: Defining the scope of our task

Loading in the passenger volume dataset

NA/Validity Checks 1

Analysis of dataset

Aggregating dataset by Bus Stop Number

NA/Validity Checks

Step 2: Loading in Geospatial Data for bus locations

Loading in Bus Stop Location Geospatial Data

NA/Validity Check 1

Replace NA with Unknown

NA/Validity Checks 2

Replace NA with Unknown

NA/Validity Checks 3

Joining Commuter’s volume to Bus Stop Location dataset

NA/Validity Checks 1

Replace NA with 0

NA/Validity Checks 2

Step 3: Loading in Planning Area Subzone Geospatial Data and Population Data

Loading in Population Data

NA/Validity Checks 1

Aggregating Resident Population by subzone

NA/Validity Checks 1

Loading in Planning Area Subzone geospatial data

NOTE: Check WGS and CRS. In this case st_transform does not change the polygons because they are already a Singaporean dataset but it helps us to format and label our dataset.

NA/Validity Checks 1

NA/Validity Checks 2

Joining Residential data to Simple Dataframe

NA/Validity Checks 1

Mapping to show distribution of residential population for 2019 (Excluding aged 0-4) in Singapore

Step 4: Aggregation and filtering of relevant data

Aggregating Bus Stop-Level data to Subzone-Level data

NOTE: At this point in time, we will need to consider what to do with subzones where you have no data for your Tap-in and Tap-out or if there are no residents in the subzone.

Performing join using the st_join() function and st_intersects criteria.

NA/Validity Checks 1

Replace NA with 0 and Unknown

NA/Validity Checks 2

Aggregating Tap-in and Tap-out Values to subzone and only keeping in relevant features

NA/Validity Checks 1

Step 5: Plotting our summarized data

Plotting two thematic maps of Tap-in and Tap-out volumes

Step 6: Performing Linear Regression Model

Performing Linear Regression for Tap-in values

Performing Linear Regression for Tap-out values

Step 5: Interpretation of Linear Regression Model for commuter’s flow to Residential Population by URA 2014 Master Plan Planning Area Subzones

Plotting Tap_In volumes to Residential Population

Plotting Tap_out volumes to Residential Population

Plotting Residuals

Part 2: Spatial Autocorrelation Analysis on residuals

Step 1: Adding residuals and predicted values to Simple Feature Dataframe

Appending using tibble functions

NA/Validity Checks 1

Step 2: Mapping out our Residual Values

Mapping using tmaps

Step 3: Forming Weight Matrixes using Contiguity-based neighbours

Creating (QUEEN) contiguity based neighbours

Creating (ROOK) contiguity based neighbours

Visualizing (QUEEN) and (ROOK) contiguity based neighbours

Analysis of contiguity methods

Plotting bus stop locations on Planning Area Subzone bountaries

Step 4: Forming Weight Matrixes using Fixed Distance-based Neighbours

Double check dataset

Determining the appropriate cut-off distance.

Computing Fixed Distance Matrix (NOT RECOMMENDED WEIGHT MATRIX)

Creating Fixed-Distance based on max value 5404.1 ~5045

Plotting links created using Fixed-Distance Based Weight Matrix.

Step 5: Computing adaptive distance weight matrix

Setting neighbours based on average number of links from Queen Contiguity

Plotting out adaptive distance weight matrix links

Step 6: Deciding on a weight matrix

Step 7: Weights based on the Inverse Distance Method using Queen Contiguity-based Neighours

Deriving a spatial weight matrix based on Inversed Distance method using the Queen method.

Row-standardised weights matrix