Take Home Assignment 1 for IS415 - Geospatial Analytics and Applications - G10

Overview

This exercise is examining national development or public service issues by utilizing data sets from various governmental agencies in Singapore. The assignment is meant to demonstrate basic Geo-spatial Data Wrangling, Geo-spatial Analytics and Geo-visualization in R. For the purpose of this assignment, we will utilizing course material in order to achieve 3 specific objectives:

For the purpose of this study, the passenger volume by bus-stop data set of Land Transport Authority (LTA) has been provided. This data set is extracted using the dynamic API provided at LTA DataMall. The remaining data from the government open data portal.

Part 0: Setup

For the purpose of this assignment, we will be using packages taught in the course. Particularly we will be utilizing rgdal, sf, spdep, tmap and tidyverse.

Loading in required packages

packages = c('rgdal', 'sf', 'spdep', 'tmap', 'tidyverse')
for (p in packages){
  if(!require(p, character.only = T)){
    install.packages(p)
  }
  library(p,character.only = T)
}

Part 1: Geospatial Data Wrangling & Linear Regression Model

Step 1: Defining the scope of our task

Loading in the passenger volume dataset

sgpassengervolume <- read_csv("data/aspatial/passenger volume by busstop.csv")
## Parsed with column specification:
## cols(
##   YEAR_MONTH = col_character(),
##   DAY_TYPE = col_character(),
##   TIME_PER_HOUR = col_double(),
##   PT_TYPE = col_character(),
##   PT_CODE = col_character(),
##   TOTAL_TAP_IN_VOLUME = col_double(),
##   TOTAL_TAP_OUT_VOLUME = col_double()
## )

Analysis of dataset

Based on the Dataset provided on passenger volume by bus-stop data, we are examining passenger flows from of the period for the month of January 2020. As requested by the assignment, we need to aggregate the data at the planning sub-zone level in order to perform a simple linear model for its relation to residential population in the same zones.

Aggregating dataset by Bus Stop Number

For our purpose, we don’t care about the time or day differences in our dataset, we only care about the volumes based on the bus stop locations so we will need a new data-frame which is summed by bus stop number

Totalbusstopvol <- sgpassengervolume %>% 
  group_by(BUS_STOP_N = PT_CODE) %>% #rename as per naming convention by LTA
  summarise(Tap_In = sum(TOTAL_TAP_IN_VOLUME), 
            Tap_out = sum(TOTAL_TAP_OUT_VOLUME))

Step 2: Loading in Geospatial Data for bus locations

Thus we will need a relational database that helps us to map the bus-stop locations to the planning sub-zone level. For that purpose we will be utilizing the Bus Stop Location Shapefile from the LTA DataMall Source:https://www.mytransport.sg/content/mytransport/home/dataMall/static-data.html#Whole%20Island

Loading in Bus Stop Location Geospatial Data

We’ll load it in as a Simple Feature DataFrame, in this case since we know its from Singapore database, there is no need to transform. We will simply set the CRS to 3414 in correspondence to SG.

busstoplocations_sf <- readOGR(dsn = "data/geospatial/BusStopLocation_Jan2020", layer = "BusStop")
## OGR data source with driver: ESRI Shapefile 
## Source: "C:\data\GSA\IS415_Take-Home_EX01\data\geospatial\BusStopLocation_Jan2020", layer: "BusStop"
## with 5040 features
## It has 3 fields
busstoplocations_sf <- st_as_sf(busstoplocations_sf)
busstoplocations_sf <- st_transform(busstoplocations_sf, crs= 3414)

#st_crs(busstoplocations_sf)

Replace NA with Unknown

busstoplocations_sf$LOC_DESC[is.na(busstoplocations_sf$LOC_DESC)] <- "UNKNOWN"

Replace NA with Unknown

busstoplocations_sf$BUS_ROOF_N[is.na(busstoplocations_sf$BUS_ROOF_N)] <- "UNKNOWN"

Joining Commuter’s volume to Bus Stop Location dataset

Next we need to attach our passenger data to our Spatial Points DataFrame

busstoplocations_sf <- left_join(busstoplocations_sf,Totalbusstopvol) #we already renamed the variable to match so there is no need to specify the joining factor
## Joining, by = "BUS_STOP_N"

Replace NA with 0

busstoplocations_sf$Tap_In[is.na(busstoplocations_sf$Tap_In)] <- 0
busstoplocations_sf$Tap_out[is.na(busstoplocations_sf$Tap_out)] <- 0

Step 3: Loading in Planning Area Subzone Geospatial Data and Population Data

In order for us to perform our regression. We need population information by planning area subzone level. We will utilize the “Singapore Residents by Subzone and Type of Dwelling, 2011 - 2019” and utilize only the 2019 information to get the latest data.

Source: https://data.gov.sg/dataset/Singapore-residents-by-subzone-and-type-of-dwelling-2011-2019

To make it geospatial data, we need to match with the according geospatial subzone data, according to the website, this information is still map to the Master Plan 2014 subzones, as such we will use those although there is an updated 2019 MP. For this exercise, we will use the shapefile instead of the KML files also provided

Source: https://data.gov.sg/dataset/master-plan-2014-subzone-boundary-web

Loading in Population Data

For the purpose of our analysis, we will use only the 2019 data.

NOTE: For the purpose of our analysis, in order to reduce errors in our linear regression model later on, we will remove children in the 0-4 age group. This is because children below the age of 7 and under a certain height criteria do not tap in or out the bus, thus including this data would not be captured in our bus tap-in and tap-out data, producing noise and increase the errors in our linear regression model.

Source: https://www.sbstransit.com.sg/fares-and-concessions

PopData <- read_csv("data/aspatial/planning-area-subzone-age-group-sex-and-type-of-dwelling-june-2011-2019.csv") %>%
  filter(year==2019) %>% #filter only by 2019 data
  filter(age_group !="0_to_4")#removing age group 0-4 as they are irrelevant for the purpose of our study
## Parsed with column specification:
## cols(
##   planning_area = col_character(),
##   subzone = col_character(),
##   age_group = col_character(),
##   sex = col_character(),
##   type_of_dwelling = col_character(),
##   resident_count = col_double(),
##   year = col_double()
## )

Aggregating Resident Population by subzone

ResPopbySubzone <- PopData %>% 
  group_by(subzone = subzone) %>% #rename as per naming convention by LTA
  summarise(resident_count= sum(resident_count))

ResPopbySubzone <- mutate_at(ResPopbySubzone, .vars = "subzone", .funs=toupper) #convert names to upper case, be sure to only do so for names.

Loading in Planning Area Subzone geospatial data

NOTE: Check WGS and CRS. In this case st_transform does not change the polygons because they are already a Singaporean dataset but it helps us to format and label our dataset.
PlanningSubzone_sf <- readOGR(dsn = "data/geospatial/MP14_SUBZONE_WEB", layer = "MP14_SUBZONE_WEB_PL")
## OGR data source with driver: ESRI Shapefile 
## Source: "C:\data\GSA\IS415_Take-Home_EX01\data\geospatial\MP14_SUBZONE_WEB", layer: "MP14_SUBZONE_WEB_PL"
## with 323 features
## It has 15 fields
PlanningSubzone_sf <- st_as_sf(PlanningSubzone_sf)
PlanningSubzone_sf <- st_transform(PlanningSubzone_sf, crs= 3414)

#st_crs(PlanningSubzone_sf)

Joining Residential data to Simple Dataframe

PlanningSubzone_sf <- left_join(PlanningSubzone_sf , ResPopbySubzone, by = c("SUBZONE_N" = "subzone"))

Mapping to show distribution of residential population for 2019 (Excluding aged 0-4) in Singapore

tm_shape(PlanningSubzone_sf)+
  tm_fill(col = "resident_count",
              n = 5,
              style="jenks",
              palette = "Blues",
          title = "Resident Population Count") +
  tm_layout(legend.position = c("right", "bottom")) +
  tm_borders(alpha = 0.5) +
  tmap_style("white") +
  tm_credits("Source: Planning Area Sub-zone boundary MP2014 from Urban Redevelopment Authorithy (URA)\n and Population data from Department of Statistics (DOS)", 
             position = c("left", "bottom"))

Map seems to align with most population distributions we’ve seen before, so we will accept that our data has been loaded in properly

Step 4: Aggregation and filtering of relevant data

Aggregating Bus Stop-Level data to Subzone-Level data

NOTE: At this point in time, we will need to consider what to do with subzones where you have no data for your Tap-in and Tap-out or if there are no residents in the subzone.

For the purpose of our analysis, we will firstly remove zones where all Tap-in, Tap-out and Residents are equal to 0. This is because it is likely irrelevant for our study as these areas are unreachable by the general population and may cause problems during the geospatial auto-correlation calculations later on.

For zones in which one of these conditions but not all are met, we will keep these areas as they are still relevant for our scope of study. We will replace any missing values with 0.

However, it should be noted that based on initial inspection of the data, there are areas in which with 0 residents but a high amount of Tap-in and Tap-out which may be due to construction projects. The best example of this would be the Woodlands Regional Center with a disproportionately large bus volumes but no residential population because it is still under construction.

Performing join using the st_join() function and st_intersects criteria.

MasterSubzone_sf <- st_join(PlanningSubzone_sf,busstoplocations_sf, join=st_intersects) #using st_intersects to join bus stops that are either within or touching the polygons.

Replace NA with 0 and Unknown

MasterSubzone_sf$Tap_In[is.na(MasterSubzone_sf$Tap_In)] <- 0
MasterSubzone_sf$Tap_out[is.na(MasterSubzone_sf$Tap_out)] <- 0
MasterSubzone_sf$BUS_ROOF_N[is.na(MasterSubzone_sf$BUS_ROOF_N)] <- "UNKNOWN"
MasterSubzone_sf$BUS_STOP_N[is.na(MasterSubzone_sf$BUS_STOP_N)] <- "UNKNOWN"
MasterSubzone_sf$LOC_DESC[is.na(MasterSubzone_sf$LOC_DESC)] <- "UNKNOWN"

Aggregating Tap-in and Tap-out Values to subzone and only keeping in relevant features

SummarizedMasterSubzone_sf <- MasterSubzone_sf %>%
  filter(Tap_In!= 0 | Tap_out != 0 | resident_count != 0) %>% # we'll now  filter out any subzones with all values zero
  group_by(Subzone = SUBZONE_N) %>%
  summarise(Residential = as.numeric(first(resident_count)), Tap_in = sum(Tap_In), Tap_out = sum(Tap_out))

Step 5: Plotting our summarized data

Now we need to inspect what polygons have remained and look at the distribution of our Tap-in and Tap-out data.

Plotting two thematic maps of Tap-in and Tap-out volumes

CountTapInMap <- tm_shape(SummarizedMasterSubzone_sf)+
  tm_fill(col = c("Tap_in"),
              n = 5,
              style="jenks",
              palette = "Blues",
          title = "Tap-in Volume") +
  tm_layout(legend.position = c("right", "bottom")) +
  tm_borders(alpha = 0.5) +
  tmap_style("white")+
  tm_credits("Source: Planning Sub-zone boundary from Urban Redevelopment Authorithy (URA)\n and Commuter's Volume data from Land Transport Authority (LTA)", 
             position = c("left", "bottom"))


CountTapOutMap <- tm_shape(SummarizedMasterSubzone_sf)+
  tm_fill(col = c("Tap_out"),
              n = 5,
              style="jenks",
              palette = "Reds",
          title = "Tap-out Volume") +
  tm_layout(legend.position = c("right", "bottom")) +
  tm_borders(alpha = 0.5) +
  tmap_style("white") +
  tm_credits("Source: Planning Sub-zone boundary from Urban Redevelopment Authorithy (URA)\n and Commuter's Volume data from Land Transport Authority (LTA)", 
             position = c("left", "bottom"))


tmap_arrange(CountTapInMap, CountTapOutMap, asp=1, ncol=2)

We can see that during the aggregation process that some of the islands, such as Jurong Island and Pulau Ubin have disappeared. This is good because these areas are inaccessible by bus transportation which allows us to concentrate our study on the relevant areas within mainland Singapore.

Step 6: Performing Linear Regression Model

For this exercise, as we are interested in the relationship between the tap-in and tap-out volumes to the residential population. We will be using the lm() function to calculate the linear regression. Since for the purpose of our study, we will assume that our X is Residential Population and Y is Tap-in or Tap-out. This is based on the fact that the Residential Population data was gathered in 2019 before the Tap-In and Tap-Out data in January 2020 although we may be committing post hoc ergo propter hoc fallacy.

Performing Linear Regression for Tap-in values

TapInLM.model <- lm(Tap_in ~ Residential, data = SummarizedMasterSubzone_sf)
summary(TapInLM.model)
## 
## Call:
## lm(formula = Tap_in ~ Residential, data = SummarizedMasterSubzone_sf)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -799481 -127198  -67424   31303 1954431 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.207e+05  2.075e+04   5.816 1.52e-08 ***
## Residential 2.038e+01  9.715e-01  20.977  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 294300 on 305 degrees of freedom
## Multiple R-squared:  0.5906, Adjusted R-squared:  0.5893 
## F-statistic:   440 on 1 and 305 DF,  p-value: < 2.2e-16

Performing Linear Regression for Tap-out values

TapOutLM.model <- lm(Tap_out ~ Residential, data = SummarizedMasterSubzone_sf)
summary(TapOutLM.model)
## 
## Call:
## lm(formula = Tap_out ~ Residential, data = SummarizedMasterSubzone_sf)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -758965 -120883  -57479   32583 1648011 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.212e+05  2.009e+04   6.036 4.59e-09 ***
## Residential 2.028e+01  9.406e-01  21.562  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 285000 on 305 degrees of freedom
## Multiple R-squared:  0.6039, Adjusted R-squared:  0.6026 
## F-statistic: 464.9 on 1 and 305 DF,  p-value: < 2.2e-16

We can see there have been no observations deleted due to missing values, assuring us that the data has been cleaned properly

Step 5: Interpretation of Linear Regression Model for commuter’s flow to Residential Population by URA 2014 Master Plan Planning Area Subzones

Based on the results shown, both Tap-on and Tap-out volumes are shown to be positively correlated. The results are significant at an alpha value of 0.001 which allows us to reject the null hypothesis at a 0.999% confidence interval. The R-square value of 0.5906 and 0.6039 shows a fair amount of strength between the model and the response variable.

Plotting Tap_In volumes to Residential Population

TapInPlot <- ggplot(data= TapInLM.model, aes(x = Residential, y= Tap_in)) +
  geom_point(color='blue',size=2)+
  geom_segment(aes(xend = Residential, yend = predict(TapInLM.model))) + 
    geom_smooth(method = "lm", formula = y ~ x, color='black')+
  xlab("Residential Population Count")+
  ylab("Tap-in Volume Count")
  

TapInPlot

Plotting Tap_out volumes to Residential Population

TapOutPlot <- ggplot(data= TapOutLM.model, aes(x = Residential, y= Tap_out)) +
  geom_point(color='red',size=2)+
  geom_segment(aes(xend = Residential, yend = predict(TapOutLM.model))) + 
    geom_smooth(method = "lm", formula = y ~ x, color='black')+
  xlab("Residential Population Count")+
  ylab("Tap-out Volume Count")

TapOutPlot

Plotting Residuals

par(mfrow=c(1,2))
plot(TapInLM.model$fitted.values, TapInLM.model$residuals, pch = 8, col = "blue", xlab = "Predict Tap-in Values", ylab = "Residual Values (Tap-in)")
plot(TapOutLM.model$fitted.values, TapOutLM.model$residuals, pch = 8, col = "red",  xlab = "Predict Tap-in Values", ylab = "Residual Values (Tap-in)")

When observing the residuals, the concentration of data points in the low residential amounts makes this hard to interpret. However, we can see there is some degree of heteroscedasticity, which could suggest a missing variable or a variable that requires transformation, however it is unclear at this time. There also appear to be some outlier values observed in the distribution. This could be due to the point raised earlier about places such as Woodlands Regional Center.

Part 2: Spatial Autocorrelation Analysis on residuals

In the next part, we will append the residuals back to our Simple Features Dataframe and check if the randomization assumption holds true. This is to help us check that the relationship established in our previous linear model holds true or if there is a geospatial problem in the data generating process.

Step 1: Adding residuals and predicted values to Simple Feature Dataframe

It is important to ensure that the indexes from your linear model are the same as that of our data-frame. Usually the safer way would be to create an index for both, however, for simplicity we will just check manually and see that the data from our linear model and Simple Feature Dataframe align.

Appending using tibble functions

ResidualFitted_sf <- SummarizedMasterSubzone_sf

ResidualFitted_sf <- add_column(ResidualFitted_sf, ResidualTapIn = residuals(TapInLM.model), .before = "geometry")

ResidualFitted_sf <- add_column(ResidualFitted_sf, ResidualTapOut = residuals(TapOutLM.model), .before = "geometry")

ResidualFitted_sf <- add_column(ResidualFitted_sf, PredictedTapIn = predict(TapInLM.model), .before = "geometry")

ResidualFitted_sf <- add_column(ResidualFitted_sf, PredictedTapOut = predict(TapOutLM.model), .before = "geometry")

Step 2: Mapping out our Residual Values

Mapping using tmaps

ResidualTapInmap <- tm_shape(ResidualFitted_sf)+ 
  tm_fill("ResidualTapIn", 
              n = 5,
              style="jenks",
              palette = "RdYlGn",
          title = "Residual Tap-in") +
  tm_layout(legend.position = c("right", "bottom")) +
  tm_borders(alpha = 0.5) +
  tmap_style("white")

ResidualTapOutmap <- tm_shape(ResidualFitted_sf)+ 
  tm_fill("ResidualTapOut", 
              n = 5,
              style="jenks",
              palette = "RdYlBu",
          title = "Residual Tap-out") +
  tm_layout(legend.position = c("right", "bottom")) +
  tm_borders(alpha = 0.5) +
  tmap_style("white")

tmap_arrange(ResidualTapInmap, ResidualTapOutmap, asp=1, ncol=2)

Step 3: Forming Weight Matrixes using Contiguity-based neighbours

Creating (QUEEN) contiguity based neighbours

wm_q <- poly2nb(ResidualFitted_sf, queen=TRUE)
summary(wm_q)
## Neighbour list object:
## Number of regions: 307 
## Number of nonzero links: 1862 
## Percentage nonzero weights: 1.975618 
## Average number of links: 6.065147 
## Link number distribution:
## 
##  2  3  4  5  6  7  8  9 10 11 12 14 17 
##  6 10 28 78 78 49 37 14  2  2  1  1  1 
## 6 least connected regions:
## 2 42 133 172 230 277 with 2 links
## 1 most connected region:
## 40 with 17 links

Creating (ROOK) contiguity based neighbours

wm_r <- poly2nb(ResidualFitted_sf, queen=FALSE)
summary(wm_r)
## Neighbour list object:
## Number of regions: 307 
## Number of nonzero links: 1614 
## Percentage nonzero weights: 1.712485 
## Average number of links: 5.257329 
## Link number distribution:
## 
##  1  2  3  4  5  6  7  8  9 10 13 14 
##  1  5 23 70 92 62 29 17  4  2  1  1 
## 1 least connected region:
## 172 with 1 link
## 1 most connected region:
## 40 with 14 links

Visualizing (QUEEN) and (ROOK) contiguity based neighbours

centroids <- sf::st_centroid(ResidualFitted_sf$geometry) #finding centroids for our polygons

par(mfrow=c(1,2))
plot(ResidualFitted_sf$geometry, border="lightgrey", main="Queen Contiguity", asp=1)
plot(wm_q, st_coordinates(centroids), pch = 5, cex = 0.6, add = TRUE, col= "red" )
plot(ResidualFitted_sf$geometry, border="lightgrey", main="Rook Contiguity", asp =1)
plot(wm_r, st_coordinates(centroids), pch = 5, cex = 0.6, add = TRUE, col = "red")

Analysis of contiguity methods

Plotting bus stop locations on Planning Area Subzone bountaries

tm_shape(PlanningSubzone_sf)+
  tm_fill(col="SUBZONE_N", palette = "Blues",
          title = "Subzone Boundaries", legend.show = FALSE) +
  tm_layout(legend.position = c("right", "bottom")) +
  tm_borders(alpha = 1) +
  tmap_style("white") +
  tm_credits("Source: Planning Area Sub-zone boundary MP2014 from Urban Redevelopment Authorithy (URA)", 
             position = c("left", "bottom"))+
  tm_shape(busstoplocations_sf)+
  tm_dots(size = 0.03)

By looking at the plots and summaries of both the Rooks and Queen Contiguity, we can see that there is a large difference between both methodologies. A total difference of 258 links. By looking at the map, we can see that this is very evident in particular to the north and north-east part of the map.

With reference to bus routes in Singapore and locations in the above plot, it is often they move along the edges and corners of planning area subzones, thus the Queen method for continguity would be more applicable for our study as they cross over point-connections between polygons.

Source: https://wiki.smu.edu.sg/1415T2is415/File:IS415_2014-15_Term2_Assign1_ktchan.2011_Web_map_overview.png

Step 4: Forming Weight Matrixes using Fixed Distance-based Neighbours

Double check dataset

st_crs(ResidualFitted_sf)
## Coordinate Reference System:
##   User input: EPSG:3414 
##   wkt:
## PROJCS["SVY21 / Singapore TM",
##     GEOGCS["SVY21",
##         DATUM["SVY21",
##             SPHEROID["WGS 84",6378137,298.257223563,
##                 AUTHORITY["EPSG","7030"]],
##             AUTHORITY["EPSG","6757"]],
##         PRIMEM["Greenwich",0,
##             AUTHORITY["EPSG","8901"]],
##         UNIT["degree",0.0174532925199433,
##             AUTHORITY["EPSG","9122"]],
##         AUTHORITY["EPSG","4757"]],
##     PROJECTION["Transverse_Mercator"],
##     PARAMETER["latitude_of_origin",1.366666666666667],
##     PARAMETER["central_meridian",103.8333333333333],
##     PARAMETER["scale_factor",1],
##     PARAMETER["false_easting",28001.642],
##     PARAMETER["false_northing",38744.572],
##     UNIT["metre",1,
##         AUTHORITY["EPSG","9001"]],
##     AUTHORITY["EPSG","3414"]]

Since we know the EPSG is 3414, our distance will be in metres. Thus it should be no surprise if we have large values compared to other data-sets. Additionally the coordinates are in latitude and longitude which means we will need to be careful when mapping our coordinates or using functions that utilize them.

Determining the appropriate cut-off distance.

coords <- st_coordinates(centroids) #taken from the previous Step 3.

k1 <- knn2nb(knearneigh(coords))

k1dists <- unlist(nbdists(k1, coords, longlat = FALSE))
summary(k1dists)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   182.5   616.3   887.5   937.2  1169.8  5404.1

Unfortunately, it would seem that the distance values vary wildly between zones.

This may be due to centroids for large polygons such as those in the Western Water Catchment Area, Central Water Catchment Area and Changi Airport cover much larger areas compared to places such as Telok Blangah Rise. Thus a Fixed Distance Weight Matrix would be not very useful for us to utilize.

Step 5: Computing adaptive distance weight matrix

For the purpose of our analysis, we’ll be using a set of 6 nearest neighbors as it is the average number of links indicated during our Queen Contiguity based Neighbors analysis.

Step 6: Deciding on a weight matrix

While both the Adaptive Distance Weight Matrix and Rook Contiguity-based Neighbors seem promising, it would be more appropriate to utilize the Queen Continguity-based Neighbors as our ideal choice. These are due to the following points:

  • The size of polygons have large variation causing issues in using Fixed-Distance based weight Matrixes

  • The number of neighbors around the subzones differ from 1-17 if considering the Queen Contiguity-based neighbors, thus setting an arbitrary number of neighbors will result on some subzones over-reaching neighboring zones and more importantly, not taking into account certain neighbors when required to for auto-correlation.

  • Since we’re observing commuter’s volumes, there is no reason for us to believe that bus routes will only go through sides which are larger than a point, in fact some bus routes seem to pass through the corners of the subzones. Residents may also travel out of subzones to take the bus in another subzone thus there is no reason for Rook Contiguity unless there is a computational constraint.

Thus for the next phase of analysis, we will only be utilizing the Queen Contiguity-based neighbors in our spatial auto-correlation analysis.

Step 7: Weights based on the Inverse Distance Method using Queen Contiguity-based Neighours

Deriving a spatial weight matrix based on Inversed Distance method using the Queen method.

dist <- nbdists(wm_q, coords, longlat = FALSE)
ids <- lapply(dist, function(x) 1/(x))

Row-standardised weights matrix

rswm_q <- nb2listw(wm_q, style="W", zero.policy = TRUE)
rswm_q
## Characteristics of weights list object:
## Neighbour list object:
## Number of regions: 307 
## Number of nonzero links: 1862 
## Percentage nonzero weights: 1.975618 
## Average number of links: 6.065147 
## 
## Weights style: W 
## Weights constants summary:
##     n    nn  S0       S1       S2
## W 307 94249 307 106.4054 1261.803

Verifying weights have been appropriatly applied.

wm_q[[1]]
## [1] 217 218 220 221 222 264
rswm_q$weights[1]
## [[1]]
## [1] 0.1666667 0.1666667 0.1666667 0.1666667 0.1666667 0.1666667

Looks like they match up.

Step 8: Perform Moran’s I test on both Tap-in residual and Tap-out residual dataset

Performing Moran’s I test for Tap-in Residual

moran.test(ResidualFitted_sf$ResidualTapIn, listw=rswm_q, zero.policy = TRUE, na.action=na.omit)
## 
##  Moran I test under randomisation
## 
## data:  ResidualFitted_sf$ResidualTapIn  
## weights: rswm_q    
## 
## Moran I statistic standard deviate = -1.0545, p-value = 0.8542
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##      -0.037755733      -0.003267974       0.001069537

Performing Moran’s I test for Tap-out Residual

moran.test(ResidualFitted_sf$ResidualTapOut, listw=rswm_q, zero.policy = TRUE, na.action=na.omit)
## 
##  Moran I test under randomisation
## 
## data:  ResidualFitted_sf$ResidualTapOut  
## weights: rswm_q    
## 
## Moran I statistic standard deviate = -0.50383, p-value = 0.6928
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##      -0.019809059      -0.003267974       0.001077864

Interpretation of Moran’s I test

Based on our results, we can see that the p-values resulting from both tests are relatively high, thus we are unable to reject the null hypothesis that the results generated are random even at a 90% confidence level. This promotes that the relationships seen in our linear regression models are indeed true.

The coefficient for Moran’s I in this case is inapplicable.

Step 9: Robustness Testing

Using Monte Carlo Moran’s I

For robustness testing, we use Monte Carlo to check if our results hold.

Computing Monte-Carlo Moran’s I for Residual Tap-in

set.seed(1234)
MCMoranTapIn= moran.mc(ResidualFitted_sf$ResidualTapIn, listw=rswm_q, nsim=999, zero.policy = TRUE, na.action=na.omit)
MCMoranTapIn
## 
##  Monte-Carlo simulation of Moran I
## 
## data:  ResidualFitted_sf$ResidualTapIn 
## weights: rswm_q  
## number of simulations + 1: 1000 
## 
## statistic = -0.037756, observed rank = 157, p-value = 0.843
## alternative hypothesis: greater

Computing Monte-Carlo Moran’s I for Residual Tap-out

set.seed(1234)
MCMoranTapOut= moran.mc(ResidualFitted_sf$ResidualTapOut, listw=rswm_q, nsim=999, zero.policy = TRUE, na.action=na.omit)
MCMoranTapOut
## 
##  Monte-Carlo simulation of Moran I
## 
## data:  ResidualFitted_sf$ResidualTapOut 
## weights: rswm_q  
## number of simulations + 1: 1000 
## 
## statistic = -0.019809, observed rank = 309, p-value = 0.691
## alternative hypothesis: greater

Step 10: Interpretation of robustness testing

After testing using Monte-Carlo Moran’s I, we can safely say our results appear consistent and are promising as all are unable to reject the null hypothesis.

Conclusion of analysis

Overall we can be confident in our results that the relationship between Residential Population in URA 2014 Planning Area is positively correlated with Commuters’ volumes by bus. This is due to low p-values and the randomization assumption of residual values holding true for both linear modelling as well as geospatial auto-correlation.

Part 3: Localized Geospatial Statistical Analysis

Step 0: Converting to SpatialPolygonDataFrame

For the purpose of this Analysis, we will be converting the Simple Feature DataFrame back into a SpatialPolygoneDataFrame

Utilizing the as_Spatial() function from the sp package

SG_Commuter_vol_df <- as_Spatial(ResidualFitted_sf)

Step 1: Computing local Moran’s I for Cluster and Outlier Analysis

Computing Local Moran’s I for Tap-in volumes

subzone_labels <- order(SG_Commuter_vol_df$Subzone)
localMITapIn <- localmoran(SG_Commuter_vol_df$Tap_in, rswm_q)

#printCoefmat(data.frame(localMITapIn[subzone_labels,], row.names=SG_Commuter_vol_df$Subzone[subzone_labels]), check.names=FALSE)

Computing Local Moran’s I for Tap-out volumes

localMITapOut <- localmoran(SG_Commuter_vol_df$Tap_out, rswm_q)

#printCoefmat(data.frame(localMITapOut[subzone_labels,], row.names=SG_Commuter_vol_df$Subzone[subzone_labels]), check.names=FALSE)

Appending local Moran’s I and P-values to Simple Feature Dataframe

SG_Commuter_vol_df.localMITapIn <- cbind(SG_Commuter_vol_df,localMITapIn)
SG_Commuter_vol_df.localMITapOut <- cbind(SG_Commuter_vol_df,localMITapOut)

Step 2: Mapping Local Moran’s I

Visualizing local Moran’s I for Tap-in values

TapInlocalMI.map <- tm_shape(SG_Commuter_vol_df.localMITapIn) +
  tm_fill(col = "Ii", 
          style = "pretty", 
          title = "Tap-in Local MI Statistics") +
  tm_borders(alpha = 0.5)

TapInPvalue.map <- tm_shape(SG_Commuter_vol_df.localMITapIn) +
  tm_fill(col = "Pr.z...0.", 
          breaks=c(-Inf, 0.001, 0.01, 0.05, 0.1, Inf),
          palette="-Blues", 
          title = "Tap-in Local MI p-values") +
  tm_borders(alpha = 0.5)

tmap_arrange(TapInlocalMI.map, TapInPvalue.map, asp=1, ncol=2)
## Variable(s) "Ii" contains positive and negative values, so midpoint is set to 0. Set midpoint = NA to show the full spectrum of the color palette.

Visualizing local Moran’s I for Tap-out values

TapOutlocalMI.map <- tm_shape(SG_Commuter_vol_df.localMITapOut) +
  tm_fill(col = "Ii", 
          style = "pretty", 
          title = "Tap-out Local MI Statistics") +
  tm_borders(alpha = 0.5)

TapOutPvalue.map <- tm_shape(SG_Commuter_vol_df.localMITapOut) +
  tm_fill(col = "Pr.z...0.", 
          breaks=c(-Inf, 0.001, 0.01, 0.05, 0.1, Inf),
          palette="-Reds", 
          title = "Tap-out Local MI p-values") +
  tm_borders(alpha = 0.5)

tmap_arrange(TapOutlocalMI.map, TapOutPvalue.map, asp=1, ncol=2)
## Variable(s) "Ii" contains positive and negative values, so midpoint is set to 0. Set midpoint = NA to show the full spectrum of the color palette.

Intepretation of Local Moran Statistics

Based on the map results, we can see that there are several clusters for both Tap-in and Tap-out specifically around the Tampines East, Tampines West and Bedok North subzones. There also appear to be several outliers, however they have low p-values thus we will not take them into consideration.

Step 3: Creating LISA Cluster Map

Creating Moran Scatter Plot with standardized Tap-in Values

SG_Commuter_vol_df.localMITapIn$Z.Tap_In <- scale(SG_Commuter_vol_df.localMITapIn$Tap_in) %>% as.vector 

TapInLMoranPlot <- moran.plot(SG_Commuter_vol_df.localMITapIn$Z.Tap_In, rswm_q, labels=as.character(SG_Commuter_vol_df.localMITapIn$Subzone), xlab="z Tap-in Volumes", ylab="Spatially Lag z Tap-in Volumes")

Creating Moran Scatter Plot with standardized Tap-out Values

SG_Commuter_vol_df.localMITapOut$Z.Tap_out <-scale(SG_Commuter_vol_df.localMITapOut$Tap_out) %>% as.vector 

TapOutLMoranPlot <- moran.plot(SG_Commuter_vol_df.localMITapOut$Z.Tap_out, rswm_q, labels=as.character(SG_Commuter_vol_df.localMITapOut$Subzone), xlab="z Tap-out Volumes", ylab="Spatially Lag z Tap-out Volumes")

Build Quadrants for Local MI Tap-in values

TapInquadrant <- vector(mode="numeric",length=nrow(localMITapIn))
TapInDV <- SG_Commuter_vol_df$Tap_in - mean(SG_Commuter_vol_df$Tap_in)     
TapInC_mI <- localMITapIn[,1] - mean(localMITapIn[,1])    
signif <- 0.05       
TapInquadrant[TapInDV >0 & TapInC_mI>0] <- 4      
TapInquadrant[TapInDV <0 & TapInC_mI<0] <- 1      
TapInquadrant[TapInDV <0 & TapInC_mI>0] <- 2
TapInquadrant[TapInDV >0 & TapInC_mI<0] <- 3
TapInquadrant[localMITapIn[,5]>signif] <- 0

Build Quadrants for Local MI Tap-out values

TapOutquadrant <- vector(mode="numeric",length=nrow(localMITapOut))
TapOutDV <- SG_Commuter_vol_df$Tap_out - mean(SG_Commuter_vol_df$Tap_out)     
TapOutC_mI <- localMITapOut[,1] - mean(localMITapOut[,1])    
signif <- 0.05       
TapOutquadrant[TapOutDV >0 & TapOutC_mI>0] <- 4      
TapOutquadrant[TapOutDV <0 & TapOutC_mI<0] <- 1      
TapOutquadrant[TapOutDV <0 & TapOutC_mI>0] <- 2
TapOutquadrant[TapOutDV >0 & TapOutC_mI<0] <- 3
TapOutquadrant[localMITapOut[,5]>signif] <- 0

Constructing LISA Maps for Both Tap-in and Tap-out Local Moran’s I

colors <- c("#ffffff", "#2c7bb6", "#abd9e9", "#fdae61", "#d7191c")
clusters <- c("Insignificant", "Low-Low", "Low-High", "High-Low", "High-High")


SG_Commuter_vol_df.localMITapIn$TapInquadrant <- TapInquadrant

LISATapIn <- tm_shape(SG_Commuter_vol_df.localMITapIn) +
  tm_fill(col = "TapInquadrant", style = "cat", title="Tap-in Classification", palette = colors[c(sort(unique(TapInquadrant)))+1], labels = clusters[c(sort(unique(TapInquadrant)))+1], popup.vars = c("Postal.Code")) +
  tm_view(set.zoom.limits = c(11,17)) +
  tm_borders(alpha=0.5)

SG_Commuter_vol_df.localMITapOut$TapOutquadrant <- TapOutquadrant

LISATapOut <- tm_shape(SG_Commuter_vol_df.localMITapOut) +
  tm_fill(col = "TapOutquadrant", style = "cat", title="Tap-out Classification", palette = colors[c(sort(unique(TapOutquadrant)))+1], labels = clusters[c(sort(unique(TapOutquadrant)))+1], popup.vars = c("Postal.Code")) +
  tm_view(set.zoom.limits = c(11,17)) +
  tm_borders(alpha=0.5)

tmap_arrange(LISATapIn, LISATapOut, asp=1, ncol=2)

Intepretation of LISA Map

Based on the LISA map, we can see several clusters which primarily align with our initial observations of the Local Moran’s I scores in the previous Step. However, there appears to be some subzones in the north-east and west that fall within the High-High quadrants indicating positive auto-correlations with neighboring subzones.

Step 4: Hot Spot and Cold Spot Area Analysis using Gi Statistics

For the purpose of this analysis, we will only be using Adaptive Distance Weight Matrix for reasons explained in the earlier Part 2 on the nature of the polygons we are dealing with. We will take the same clusters we set in the previous section.

Computing Gi Statistics

wm_knn6_lw <- nb2listw(wm_knn6, style = 'B') #wm_knn6 taken from previous Part 2

TapIngi.adaptive <- localG(SG_Commuter_vol_df$Tap_in, wm_knn6_lw)
TapInSG.gi <- cbind(SG_Commuter_vol_df, as.matrix(TapIngi.adaptive))
names(TapInSG.gi)[length(names(TapInSG.gi))] <- "gstat_adaptive"

TapOutgi.adaptive <- localG(SG_Commuter_vol_df$Tap_out, wm_knn6_lw)
TapOutSG.gi <- cbind(SG_Commuter_vol_df, as.matrix(TapOutgi.adaptive))
names(TapOutSG.gi)[length(names(TapOutSG.gi))] <- "gstat_adaptive"

Mapping out Getis-Ord Gi Statistics

TapIngiMap <- tm_shape(TapInSG.gi) +
  tm_fill(col = "gstat_adaptive",
          style = "jenks",
          palette = "-RdBu",
          n=5,
          title = "Tap-in local Gi") +
  tm_borders(alpha = 0.5)

TapOutgiMap <- tm_shape(TapOutSG.gi) +
  tm_fill(col = "gstat_adaptive",
          style = "jenks",
          palette = "-RdBu",
          n=5,
          title = "Tap-out local Gi") +
  tm_borders(alpha = 0.5)

tmap_arrange(TapIngiMap, TapOutgiMap, asp=1, ncol=2)
## Variable(s) "gstat_adaptive" contains positive and negative values, so midpoint is set to 0. Set midpoint = NA to show the full spectrum of the color palette.
## Variable(s) "gstat_adaptive" contains positive and negative values, so midpoint is set to 0. Set midpoint = NA to show the full spectrum of the color palette.

Interpretation of GI Statistic Maps

Based on the results, we see hot-spots in areas which we had originally identified in our previous LISA Map. There is also a surprising cold spot towards the west towards Tuas. This may be due to the nature of how k-nearestneighbours works and specifically how Tuas View Extension subzone is only really adjacent to 2 subzones but was coerced using the Adaptive Distance Weight Matrix to consider more zones.

Overall Conclusion

Overall there appears to be some clustering for both Tap-in and Tap-out around the Tampines East subzone with the Local Moran’s I Analysis, LISA Mapping and Gi Statistical all highlighting the zone or some of it’s surrounding areas as clusters or hot spots. This aligns with the population map done in Part 1 which shows the area as relatively higher populated than others although we had not done any geospatial analytics on population distributions.

Part 4: Study limitations and assumptions

There are several variables which are not included within the statistical analysis above that must be considered when examining the scope of the study.

Part 5: BONUS: Analysing time effects of commuters’ volumes

In this section, we’re interested if there is a time-variation factor that could affect the spatial distribution of our commuters’ volumes. Intuition is that during morning peak periods, people are travelling from home in residential zones to work in non-residential zones. The opposite should be true for evening peak periods. We’re looking to see if we segment the data according to these two peak periods if there is a significant difference in terms of geospatial distribution.

Step 1: EDA on bus volumes to find peak periods

HourStats <- sgpassengervolume %>% 
  filter(DAY_TYPE == "WEEKDAY") %>%
  group_by(TIME_PER_HOUR) %>%
  summarise(Tap_InStat = sum(TOTAL_TAP_IN_VOLUME), 
            Tap_outStat = sum(TOTAL_TAP_OUT_VOLUME))

Bar Plotting Tap In and Tap Out

ggplot(data= HourStats,aes(x = TIME_PER_HOUR, y= Tap_InStat)) + geom_bar(stat="identity", position = "dodge")

ggplot(data= HourStats,aes(x = TIME_PER_HOUR, y= Tap_outStat)) + geom_bar(stat="identity", position = "dodge")

Based on the data above, we can see there are 2 distinct peak periods for both Tap-in and Tap-out. We are going to see if there is a geospatial difference in these periods. We will use these peak periods as our two sets of data both within 3 hour spans of 6-8am and 5-7pm.

Assuming these are due to the work hour peak timings, we will further limit our data to “WEEKDAY” which are work days.

Step 2: Creating Morning and Evening DataSet

Morning Data Set

TotalbusstopvolMorning <- sgpassengervolume %>% 
  filter(TIME_PER_HOUR %in% (6:8), DAY_TYPE == "WEEKDAY") %>%
  group_by(BUS_STOP_N = PT_CODE) %>% #rename as per naming convention by LTA
  summarise(Tap_InMorning = sum(TOTAL_TAP_IN_VOLUME), 
            Tap_outMorning = sum(TOTAL_TAP_OUT_VOLUME)) #be sure to check data

Evening Data Set

TotalbusstopvolEvening <- sgpassengervolume %>% 
  filter(TIME_PER_HOUR %in% (17:19), DAY_TYPE == "WEEKDAY") %>%
  group_by(BUS_STOP_N = PT_CODE) %>% #rename as per naming convention by LTA
  summarise(Tap_InEvening = sum(TOTAL_TAP_IN_VOLUME), 
            Tap_outEvening = sum(TOTAL_TAP_OUT_VOLUME)) #check Day type and time

Validity check 1

TotalbusstopvolEvening[rowSums(is.na(TotalbusstopvolEvening))!=0,]
## # A tibble: 0 x 3
## # ... with 3 variables: BUS_STOP_N <chr>, Tap_InEvening <dbl>,
## #   Tap_outEvening <dbl>

Cleared of NA Values

Step 3: Create New Simple Feature DataFrame

TimeVar_sf <- left_join(busstoplocations_sf,TotalbusstopvolMorning) #we already renamed the variable to match so there is no need to specify the joining factor
## Joining, by = "BUS_STOP_N"
TimeVar_sf <- left_join(TimeVar_sf,TotalbusstopvolEvening) #we already renamed the variable to match so there is no need to specify the joining factor
## Joining, by = "BUS_STOP_N"

Validity check 1

TimeVar_sf[rowSums(is.na(TimeVar_sf))!=0,]
## Simple feature collection with 125 features and 9 fields
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: 7702.752 ymin: 27733.06 xmax: 41244.44 ymax: 49261.6
## CRS:            EPSG:3414
## First 10 features:
##     BUS_STOP_N BUS_ROOF_N                 LOC_DESC Tap_In Tap_out Tap_InMorning
## 14       49249        B08          DRAGON FISH IND    721     691            NA
## 178      91121        NIL                  HSE 189      0       0            NA
## 256      44401        NIL              OPP BLK 183      0       0            NA
## 275      91101        B03        TG KATONG FLYOVER     24      54            NA
## 276      49209        B01 SG BULOH WETLAND RESERVE   1413    1118            NA
## 344      31121        B40     NIRVANA MEMORIAL GDN      0       0            NA
## 372      45391        NIL                  BLK 657      0       9            NA
## 374      49239        B06      OPP NYEE PHOE GROUP    369     222            NA
## 432      93171        B15 OPP ISLAND RESORT S'PORE    542     206            NA
## 461      23381       B01A              PLANT ENGRG      0       0            NA
##     Tap_outMorning Tap_InEvening Tap_outEvening                  geometry
## 14              NA            NA             NA POINT (16000.18 46531.64)
## 178             NA            NA             NA POINT (35030.24 31256.69)
## 256             NA            NA             NA POINT (20144.74 40279.55)
## 275             NA            NA             NA POINT (35058.96 30937.84)
## 276             NA            NA             NA POINT (16482.29 47421.34)
## 344             NA            NA             NA  POINT (11838.8 39157.43)
## 372             NA            NA             NA  POINT (18697.5 42446.09)
## 374             NA            NA             NA POINT (16274.91 46921.07)
## 432             NA            NA             NA  POINT (38108.3 31937.69)
## 461             NA            NA             NA POINT (12552.18 32234.11)

Replace NA Values with 0

TimeVar_sf$Tap_InMorning[is.na(TimeVar_sf$Tap_InMorning)] <- 0
TimeVar_sf$Tap_outMorning[is.na(TimeVar_sf$Tap_outMorning)] <- 0
TimeVar_sf$Tap_InEvening[is.na(TimeVar_sf$Tap_InEvening)] <- 0
TimeVar_sf$Tap_outEvening[is.na(TimeVar_sf$Tap_outEvening)] <- 0

Step 4:Joining data to Population and Planning Area subzone polygons

TimeVarSubzone_sf <- st_join(PlanningSubzone_sf,TimeVar_sf, join=st_intersects) #using st_intersects to join bus stops that are either within or touching the polygons.

Replacing NA values

TimeVarSubzone_sf$Tap_In[is.na(TimeVarSubzone_sf$Tap_In)] <- 0
TimeVarSubzone_sf$Tap_out[is.na(TimeVarSubzone_sf$Tap_out)] <- 0
TimeVarSubzone_sf$Tap_InMorning[is.na(TimeVarSubzone_sf$Tap_InMorning)] <- 0
TimeVarSubzone_sf$Tap_outMorning[is.na(TimeVarSubzone_sf$Tap_outMorning)] <- 0
TimeVarSubzone_sf$Tap_InEvening[is.na(TimeVarSubzone_sf$Tap_InEvening)] <- 0
TimeVarSubzone_sf$Tap_outEvening[is.na(TimeVarSubzone_sf$Tap_outEvening)] <- 0
TimeVarSubzone_sf$BUS_ROOF_N[is.na(TimeVarSubzone_sf$BUS_ROOF_N)] <- "UNKNOWN"
TimeVarSubzone_sf$BUS_STOP_N[is.na(TimeVarSubzone_sf$BUS_STOP_N)] <- "UNKNOWN"
TimeVarSubzone_sf$LOC_DESC[is.na(TimeVarSubzone_sf$LOC_DESC)] <- "UNKNOWN"

Step 5: Aggregation of data

Aggregating Tap-in and Tap-out/ Morning and Evening Values to subzone and only keeping in relevant features.

SummarizedTimeVar_sf <- TimeVarSubzone_sf %>%
  filter(Tap_In!= 0 | Tap_out != 0 | resident_count != 0 ) %>% # Filters do not change in this case. 
  group_by(Subzone = SUBZONE_N) %>%
  summarise(Residential = as.numeric(first(resident_count)), Tap_in = sum(Tap_In), Tap_out = sum(Tap_out), Tap_inMorning = sum(Tap_InMorning), Tap_outMorning = sum(Tap_outMorning), Tap_inEvening = sum(Tap_InEvening), Tap_outEvening = sum(Tap_outEvening) )

Step 6: Plotting Data

Plotting four thematic maps

TapInMorningMap <- tm_shape(SummarizedTimeVar_sf)+
  tm_fill(col = c("Tap_inMorning"),
              n = 5,
              style="jenks",
              palette = "Blues",
          title = "Morning Tap-in Volume") +
  tm_layout(legend.position = c("right", "bottom")) +
  tm_borders(alpha = 0.5) +
  tmap_style("white")+
  tm_credits("Source: Planning Sub-zone boundary from Urban Redevelopment Authorithy (URA)\n and Commuter's Volume data from Land Transport Authority (LTA)", 
             position = c("left", "bottom"))


TapOutMorningMap <- tm_shape(SummarizedTimeVar_sf)+
  tm_fill(col = c("Tap_outMorning"),
              n = 5,
              style="jenks",
              palette = "Reds",
          title = "Morning Tap-out Volume") +
  tm_layout(legend.position = c("right", "bottom")) +
  tm_borders(alpha = 0.5) +
  tmap_style("white") +
  tm_credits("Source: Planning Sub-zone boundary from Urban Redevelopment Authorithy (URA)\n and Commuter's Volume data from Land Transport Authority (LTA)", 
             position = c("left", "bottom"))

TapInEveningMap <- tm_shape(SummarizedTimeVar_sf)+
  tm_fill(col = c("Tap_inEvening"),
              n = 5,
              style="jenks",
              palette = "Blues",
          title = "Evening Tap-in Volume") +
  tm_layout(legend.position = c("right", "bottom")) +
  tm_borders(alpha = 0.5) +
  tmap_style("white")+
  tm_credits("Source: Planning Sub-zone boundary from Urban Redevelopment Authorithy (URA)\n and Commuter's Volume data from Land Transport Authority (LTA)", 
             position = c("left", "bottom"))


TapOutEveningMap <- tm_shape(SummarizedTimeVar_sf)+
  tm_fill(col = c("Tap_outEvening"),
              n = 5,
              style="jenks",
              palette = "Reds",
          title = "Evening Tap-out Volume") +
  tm_layout(legend.position = c("right", "bottom")) +
  tm_borders(alpha = 0.5) +
  tmap_style("white") +
  tm_credits("Source: Planning Sub-zone boundary from Urban Redevelopment Authorithy (URA)\n and Commuter's Volume data from Land Transport Authority (LTA)", 
             position = c("left", "bottom"))

tmap_arrange(TapInMorningMap, TapOutMorningMap, TapInEveningMap, TapOutEveningMap,asp=1, ncol=2, nrow = 2)

Upon initial glance, it is hard to spot any drastic shifts between the four different zones with the exception of Changi Airport having significantly higher Tap-out volumes in the morning. One thing we could examine is the difference in morning versus evening for both and plot another map out for those two.

Calculating Difference between time periods.

Note the calculation we are making here is (Morning - Evening) Values

SummarizedTimeVar_sf <- add_column(SummarizedTimeVar_sf, TapInDifference = SummarizedTimeVar_sf$Tap_inMorning - SummarizedTimeVar_sf$Tap_inEvening, .before = "geometry")

SummarizedTimeVar_sf <- add_column(SummarizedTimeVar_sf, TapOutDifference = SummarizedTimeVar_sf$Tap_outMorning - SummarizedTimeVar_sf$Tap_outEvening, .before = "geometry")

Plotting the mapping between time periods

TapInDifferenceMap <- tm_shape(SummarizedTimeVar_sf)+
  tm_fill(col = "TapInDifference",
              n = 5,
              style="jenks",
              palette = "RdBu",
          title = "Difference in Tap-in Volume") +
  tm_layout(legend.position = c("right", "bottom")) +
  tm_borders(alpha = 0.5) +
  tmap_style("white")+
  tm_credits("Source: Planning Sub-zone boundary from Urban Redevelopment Authorithy (URA)\n and Commuter's Volume data from Land Transport Authority (LTA)", 
             position = c("left", "bottom"))


TapOutDifferenceMap <- tm_shape(SummarizedTimeVar_sf)+
  tm_fill(col = "TapOutDifference",
              n = 5,
              style="jenks",
              palette = "PRGn",
          title = "Difference in Tap-out Volume") +
  tm_layout(legend.position = c("right", "bottom")) +
  tm_borders(alpha = 0.5) +
  tmap_style("white") +
  tm_credits("Source: Planning Sub-zone boundary from Urban Redevelopment Authorithy (URA)\n and Commuter's Volume data from Land Transport Authority (LTA)", 
             position = c("left", "bottom"))

tmap_arrange(TapInDifferenceMap, TapOutDifferenceMap ,asp=1, ncol=2)

It seems there are some distinct differences based on the region, however it is hard to be sure what this map actually tells us in terms of clustering and its relationship to resident population in those subzones. We will need to plot the data on a graph to give us a better idea on how these values interact.

Performing Simple linear regression for Tap-in

TapInDiffLM.model <- lm(TapInDifference ~ Residential, data = SummarizedTimeVar_sf)
summary(TapInDiffLM.model)
## 
## Call:
## lm(formula = TapInDifference ~ Residential, data = SummarizedTimeVar_sf)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -135341  -10496    3849   13548  104149 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.355e+04  2.296e+03  -5.901 9.62e-09 ***
## Residential  1.170e+00  1.075e-01  10.884  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 32570 on 305 degrees of freedom
## Multiple R-squared:  0.2797, Adjusted R-squared:  0.2774 
## F-statistic: 118.5 on 1 and 305 DF,  p-value: < 2.2e-16

Plotting Tap-in differences to Residential Population

TapInDiffPlot <- ggplot(data= TapInDiffLM.model, aes(x = Residential, y= TapInDifference)) +
  geom_point(color='blue',size=2)+
  geom_segment(aes(xend = Residential, yend = predict(TapInDiffLM.model))) + 
    geom_smooth(method = "lm", formula = y ~ x, color='black')+
  xlab("Residential Population Count")+
  ylab("Tap-in Difference")

TapInDiffPlot

Interpretation:

Although our linear model seems to indicate a positive relationship between the resident population and the difference between morning and evening peak our Tap-in volumes, it is hard to draw a conclusion as a large amount of data is skewed to the left with the low residential numbers. Thus we must look for a different method

Step 7: Hot Spot and Cold Spot Area Analysis using Gi Statistics

For the purpose of this analysis, we will only be using Adaptive Distance Weight Matrix created in the earlier Part 3.

Converting from Simple Features DataFrame to SpatialPolygonDataFrame

SG_Commuter_Diff_df <- as_Spatial(SummarizedTimeVar_sf)

Computing Gi Statistics

TapInDiffgi.adaptive <- localG(SG_Commuter_Diff_df$TapInDifference, wm_knn6_lw)
TapInDiffSG.gi <- cbind(SG_Commuter_Diff_df, as.matrix(TapInDiffgi.adaptive))
names(TapInDiffSG.gi)[length(names(TapInDiffSG.gi))] <- "gstat_adaptive"

TapOutDiffgi.adaptive <- localG(SG_Commuter_Diff_df$TapOutDifference, wm_knn6_lw)
TapOutDiffSG.gi <- cbind(SG_Commuter_Diff_df, as.matrix(TapOutDiffgi.adaptive))
names(TapOutDiffSG.gi)[length(names(TapOutDiffSG.gi))] <- "gstat_adaptive"

Mapping out Getis-Ord Gi Statistics

TapInDiffgiMap <- tm_shape(TapInDiffSG.gi) +
  tm_fill(col = "gstat_adaptive",
          style = "jenks",
          palette = "-RdBu",
          n=5,
          title = "Tap-in local Gi") +
  tm_borders(alpha = 0.5)

TapOutDiffgiMap <- tm_shape(TapOutDiffSG.gi) +
  tm_fill(col = "gstat_adaptive",
          style = "jenks",
          palette = "-RdBu",
          n=5,
          title = "Tap-out local Gi") +
  tm_borders(alpha = 0.5)

tmap_arrange(TapInDiffgiMap, TapOutDiffgiMap, asp=1, ncol=2)
## Variable(s) "gstat_adaptive" contains positive and negative values, so midpoint is set to 0. Set midpoint = NA to show the full spectrum of the color palette.
## Variable(s) "gstat_adaptive" contains positive and negative values, so midpoint is set to 0. Set midpoint = NA to show the full spectrum of the color palette.

Interpretation of Gi statistics

Here we can see something very interesting. When looking at the Tap-in Local Gi map, we can see that there are hot spots primarily in the northern areas of Singapore while cold spots exist primarily towards the south or downtown Singapore. These hot spots indicate more people tapping in in the mornings than in the evening, possibly the working crowd travelling downtown to work. The results for the Tap-out Local Gi also seem to support there results from the Tap-in Local Gi. Interestingly the Tuas and lower Jurong area are hot spots for the tap-out values which lines up with the idea that more people arrive for work and tap-out in the morning than in the evening.

Step 8: Concluding Remarks

Overall, this study shows that time-variations do play a factor in commuter flow within Singapore, particularly on a geospatial level from North-to-South in the morning peak hours and vice-versa in the evening peak hours. This should be factored in when considering utilizing the bus Tap-in and Tap-out volumes for analysis. Further more, this study could help influence URA’s developmental policies on commercial and residential areas around Singapore in observing it’s effectiveness in implementing these zones.