Take Home Assignment 1 for IS415 - Geospatial Analytics and Applications - G10

Overview

This exercise is examining national development or public service issues by utilizing data sets from various governmental agencies in Singapore. The assignment is meant to demonstrate basic Geo-spatial Data Wrangling, Geo-spatial Analytics and Geo-visualization in R. For the purpose of this assignment, we will utilizing course material in order to achieve 3 specific objectives:

For the purpose of this study, the passenger volume by bus-stop data set of Land Transport Authority (LTA) has been provided. This data set is extracted using the dynamic API provided at LTA DataMall. The remaining data from the government open data portal.

Part 0: Setup

For the purpose of this assignment, we will be using packages taught in the course. Particularly we will be utilizing rgdal, sf, spdep, tmap and tidyverse.

Loading in required packages

packages = c('rgdal', 'sf', 'spdep', 'tmap', 'tidyverse')
for (p in packages){
  if(!require(p, character.only = T)){
    install.packages(p)
  }
  library(p,character.only = T)
}

Part 1: Geospatial Data Wrangling & Linear Regression Model

Step 1: Defining the scope of our task

Loading in the passenger volume dataset

sgpassengervolume <- read_csv("data/aspatial/passenger volume by busstop.csv")
## Parsed with column specification:
## cols(
##   YEAR_MONTH = col_character(),
##   DAY_TYPE = col_character(),
##   TIME_PER_HOUR = col_double(),
##   PT_TYPE = col_character(),
##   PT_CODE = col_character(),
##   TOTAL_TAP_IN_VOLUME = col_double(),
##   TOTAL_TAP_OUT_VOLUME = col_double()
## )

Analysis of dataset

Based on the Dataset provided on passenger volume by bus-stop data, we are examining passenger flows from of the period for the month of January 2020. As requested by the assignment, we need to aggregate the data at the planning sub-zone level in order to perform a simple linear model for its relation to residential population in the same zones.

Aggregating dataset by Bus Stop Number

For our purpose, we don’t care about the time or day differences in our dataset, we only care about the volumes based on the bus stop locations so we will need a new data-frame which is summed by bus stop number

Totalbusstopvol <- sgpassengervolume %>% 
  group_by(BUS_STOP_N = PT_CODE) %>% #rename as per naming convention by LTA
  summarise(Tap_In = sum(TOTAL_TAP_IN_VOLUME), 
            Tap_out = sum(TOTAL_TAP_OUT_VOLUME))

Step 2: Loading in Geospatial Data for bus locations

Thus we will need a relational database that helps us to map the bus-stop locations to the planning sub-zone level. For that purpose we will be utilizing the Bus Stop Location Shapefile from the LTA DataMall Source:https://www.mytransport.sg/content/mytransport/home/dataMall/static-data.html#Whole%20Island

Loading in Bus Stop Location Geospatial Data

We’ll load it in as a Simple Feature DataFrame, in this case since we know its from Singapore database, there is no need to transform. We will simply set the CRS to 3414 in correspondence to SG.

busstoplocations_sf <- readOGR(dsn = "data/geospatial/BusStopLocation_Jan2020", layer = "BusStop")
## OGR data source with driver: ESRI Shapefile 
## Source: "C:\data\GSA\Take_Home_EX01\data\geospatial\BusStopLocation_Jan2020", layer: "BusStop"
## with 5040 features
## It has 3 fields
busstoplocations_sf <- st_as_sf(busstoplocations_sf)
busstoplocations_sf <- st_transform(busstoplocations_sf, crs= 3414)

#st_crs(busstoplocations_sf)

Replace NA with Unknown

busstoplocations_sf$LOC_DESC[is.na(busstoplocations_sf$LOC_DESC)] <- "UNKNOWN"

Replace NA with Unknown

busstoplocations_sf$BUS_ROOF_N[is.na(busstoplocations_sf$BUS_ROOF_N)] <- "UNKNOWN"

Joining Commuter’s volume to Bus Stop Location dataset

Next we need to attach our passenger data to our Spatial Points DataFrame

busstoplocations_sf <- left_join(busstoplocations_sf,Totalbusstopvol) #we already renamed the variable to match so there is no need to specify the joining factor
## Joining, by = "BUS_STOP_N"

Replace NA with 0

busstoplocations_sf$Tap_In[is.na(busstoplocations_sf$Tap_In)] <- 0
busstoplocations_sf$Tap_out[is.na(busstoplocations_sf$Tap_out)] <- 0

Step 3: Loading in Planning Area Subzone Geospatial Data and Population Data

In order for us to perform our regression. We need population information by planning area subzone level. We will utilize the “Singapore Residents by Subzone and Type of Dwelling, 2011 - 2019” and utilize only the 2019 information to get the latest data.

Source: https://data.gov.sg/dataset/Singapore-residents-by-subzone-and-type-of-dwelling-2011-2019

To make it geospatial data, we need to match with the according geospatial subzone data, according to the website, this information is still map to the Master Plan 2014 subzones, as such we will use those although there is an updated 2019 MP. For this exercise, we will use the shapefile instead of the KML files also provided

Source: https://data.gov.sg/dataset/master-plan-2014-subzone-boundary-web

Loading in Population Data

For the purpose of our analysis, we will use only the 2019 data.

NOTE: For the purpose of our analysis, in order to reduce errors in our linear regression model later on, we will remove children in the age group. This is because children below the age of 7 and under a certain height criteria do not tap in or out the bus, thus including this data would produce noise and increase the errors in our linear regression model.

Source: https://www.sbstransit.com.sg/fares-and-concessions

PopData <- read_csv("data/aspatial/planning-area-subzone-age-group-sex-and-type-of-dwelling-june-2011-2019.csv") %>%
  filter(year==2019) %>% #filter only by 2019 data
  filter(age_group !="0_to_4")#removing age group 0-4 as they are irrelevant for the purpose of our study
## Parsed with column specification:
## cols(
##   planning_area = col_character(),
##   subzone = col_character(),
##   age_group = col_character(),
##   sex = col_character(),
##   type_of_dwelling = col_character(),
##   resident_count = col_double(),
##   year = col_double()
## )

Aggregating Resident Population by subzone

ResPopbySubzone <- PopData %>% 
  group_by(subzone = subzone) %>% #rename as per naming convention by LTA
  summarise(resident_count= sum(resident_count))

ResPopbySubzone <- mutate_at(ResPopbySubzone, .vars = "subzone", .funs=toupper) #convert names to upper case, be sure to only do so for names.

Loading in Planning Area Subzone geospatial data

NOTE: Check WGS and CRS. In this case st_transform does not change the polygons because they are already a Singaporean dataset but it helps us to format and label our dataset.

PlanningSubzone_sf <- readOGR(dsn = "data/geospatial/MP14_SUBZONE_WEB", layer = "MP14_SUBZONE_WEB_PL")
## OGR data source with driver: ESRI Shapefile 
## Source: "C:\data\GSA\Take_Home_EX01\data\geospatial\MP14_SUBZONE_WEB", layer: "MP14_SUBZONE_WEB_PL"
## with 323 features
## It has 15 fields
PlanningSubzone_sf <- st_as_sf(PlanningSubzone_sf)
PlanningSubzone_sf <- st_transform(PlanningSubzone_sf, crs= 3414)

#st_crs(PlanningSubzone_sf)

Joining Residential data to Simple Dataframe

PlanningSubzone_sf <- left_join(PlanningSubzone_sf , ResPopbySubzone, by = c("SUBZONE_N" = "subzone"))

Quick mapping to show distribution of resident count in Singapore

tm_shape(PlanningSubzone_sf)+
  tm_fill(col = "resident_count",
              n = 5,
              style="jenks",
              palette = "Blues",
          title = "Resident Population Count") +
  tm_layout(legend.position = c("right", "bottom")) +
  tm_borders(alpha = 0.5) +
  tmap_style("white") +
  tm_credits("Source: Planning Area Sub-zone boundary MP2014 from Urban Redevelopment Authorithy (URA)\n and Population data from Department of Statistics (DOS)", 
             position = c("left", "bottom"))

Map seems to align with most population distributions we’ve seen before, so we will accept that our data has been loaded in properly

Step 4: Aggregation and filtering of relevant data

Aggregating Bus Stop Level data to Subzone Level Data

NOTE: At this point in time, we will need to consider what to do with subzones where you have no data for your Tap-in and Tap-out or if there is no residents in the subzone.

For the purpose of our analysis, we will firstly remove zones where all Tap-in, Tap-out and Residents are equal to 0. This is because it is likely irrelevant for our study as these areas are unreachable by the general population and may cause problems during the geospatial auto-correlation calculations later on.

For zones in which there one of these conditions but not all, we will keep these areas as they are still relevant for our scope of study. We will replace any missing values with 0

However, it should be noted that based on the data observed, there are areas in which with 0 residents but a high amount of Tap-in and Tap-out which may be due to construction projects. The best example of this would be the Woodlands Regional Center with a disproportionately large bus volumes but no residential population because it is still under construction.

MasterSubzone_sf <- st_join(PlanningSubzone_sf,busstoplocations_sf, join=st_intersects) #using st_intersects to join bus stops that are either within or touching the polygons.

Replace NA with 0 and Unknown

MasterSubzone_sf$Tap_In[is.na(MasterSubzone_sf$Tap_In)] <- 0
MasterSubzone_sf$Tap_out[is.na(MasterSubzone_sf$Tap_out)] <- 0
MasterSubzone_sf$BUS_ROOF_N[is.na(MasterSubzone_sf$BUS_ROOF_N)] <- "UNKNOWN"
MasterSubzone_sf$BUS_STOP_N[is.na(MasterSubzone_sf$BUS_STOP_N)] <- "UNKNOWN"
MasterSubzone_sf$LOC_DESC[is.na(MasterSubzone_sf$LOC_DESC)] <- "UNKNOWN"

Aggregating Tap-in and Tap-out Values to subzone and only keeping in relevant features

SummarizedMasterSubzone_sf <- MasterSubzone_sf %>%
  filter(Tap_In!= 0 | Tap_out != 0 | resident_count != 0) %>% # we'll now  filter out any subzones with all values zero
  group_by(Subzone = SUBZONE_N) %>%
  summarise(Residential = as.numeric(first(resident_count)), Tap_in = sum(Tap_In), Tap_out = sum(Tap_out))

Step 5: Plotting our summarized data

Now we need to inspect what polygons have remained and look at the distribution of our Tap-in and Tap-out data.

Plotting two quick thematic maps

CountTapInMap <- tm_shape(SummarizedMasterSubzone_sf)+
  tm_fill(col = c("Tap_in"),
              n = 5,
              style="jenks",
              palette = "Blues",
          title = "Tap-in Volume") +
  tm_layout(legend.position = c("right", "bottom")) +
  tm_borders(alpha = 0.5) +
  tmap_style("white")+
  tm_credits("Source: Planning Sub-zone boundary from Urban Redevelopment Authorithy (URA)\n and Commuter's Volume data from Land Transport Authority (LTA)", 
             position = c("left", "bottom"))


CountTapOutMap <- tm_shape(SummarizedMasterSubzone_sf)+
  tm_fill(col = c("Tap_out"),
              n = 5,
              style="jenks",
              palette = "Reds",
          title = "Tap-out Volume") +
  tm_layout(legend.position = c("right", "bottom")) +
  tm_borders(alpha = 0.5) +
  tmap_style("white") +
  tm_credits("Source: Planning Sub-zone boundary from Urban Redevelopment Authorithy (URA)\n and Commuter's Volume data from Land Transport Authority (LTA)", 
             position = c("left", "bottom"))


tmap_arrange(CountTapInMap, CountTapOutMap, asp=1, ncol=2)

We can see that during the aggregation process that some of the islands, such as Jurong Island and Pulau Ubin have disappeared. This is good because these areas are inaccessible by bus transportation which allows us to concentrate our study on the relevant areas within mainland Singapore.

Step 6: Performing Linear Regression Model

For this exercise, as we are interested in the relationship between the tap-in and tap-out volumes to the residual value. We will be using the lm() function to calculate the linear regression. Since for the purpose of our study, we will assume that our X in this cause is Residential Population and Y is Tap-In or Tap-Out.

This is based on the fact that the Residential Population data was gathered in 2019 before the Tap-In and Tap-Out data in January 2020 although we may be committing Post hoc ergo propter hoc fallacy.

Performing Linear Regression for Tap-in values

TapInLM.model <- lm(Tap_in ~ Residential, data = SummarizedMasterSubzone_sf)
summary(TapInLM.model)
## 
## Call:
## lm(formula = Tap_in ~ Residential, data = SummarizedMasterSubzone_sf)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -799481 -127198  -67424   31303 1954431 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.207e+05  2.075e+04   5.816 1.52e-08 ***
## Residential 2.038e+01  9.715e-01  20.977  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 294300 on 305 degrees of freedom
## Multiple R-squared:  0.5906, Adjusted R-squared:  0.5893 
## F-statistic:   440 on 1 and 305 DF,  p-value: < 2.2e-16

Performing Linear Regression for Tap-out values

TapOutLM.model <- lm(Tap_out ~ Residential, data = SummarizedMasterSubzone_sf)
summary(TapOutLM.model)
## 
## Call:
## lm(formula = Tap_out ~ Residential, data = SummarizedMasterSubzone_sf)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -758965 -120883  -57479   32583 1648011 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.212e+05  2.009e+04   6.036 4.59e-09 ***
## Residential 2.028e+01  9.406e-01  21.562  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 285000 on 305 degrees of freedom
## Multiple R-squared:  0.6039, Adjusted R-squared:  0.6026 
## F-statistic: 464.9 on 1 and 305 DF,  p-value: < 2.2e-16

We can see there have been no observations deleted due to missing values, assuring us that the data has been cleaned properly

Step 5: Interpretation of Linear Regression Model for commuter’s flow to Residential Population by URA 2014 Master Plan Planning Area Subzones

Based on the results shown, both Tap-on and Tap-out volumes are shown to be positively correlated. The results are significant at an alpha value of 0.001 which allows us to reject the null hypothesis at a 0.999% confidence interval. The R-square value of 0.5906 and 0.6039 shows a fair amount of strength between the model and the response variable.

Plotting Tap_In volumes to Residential Population

TapInPlot <- ggplot(data= TapInLM.model, aes(x = Residential, y= Tap_in)) +
  geom_point(color='blue',size=2)+
  geom_segment(aes(xend = Residential, yend = predict(TapInLM.model))) + 
    geom_smooth(method = "lm", formula = y ~ x, color='black')+
  xlab("Residential Population Count")+
  ylab("Tap-in Volume Count")
  

TapInPlot

Plotting Tap_out volumes to Residential Population

TapOutPlot <- ggplot(data= TapOutLM.model, aes(x = Residential, y= Tap_out)) +
  geom_point(color='red',size=2)+
  geom_segment(aes(xend = Residential, yend = predict(TapOutLM.model))) + 
    geom_smooth(method = "lm", formula = y ~ x, color='black')+
  xlab("Residential Population Count")+
  ylab("Tap-out Volume Count")

TapOutPlot

Plotting Residuals

par(mfrow=c(1,2))
plot(TapInLM.model$fitted.values, TapInLM.model$residuals, pch = 8, col = "blue", xlab = "Predict Tap-in Values", ylab = "Residual Values (Tap-in)")
plot(TapOutLM.model$fitted.values, TapOutLM.model$residuals, pch = 8, col = "red",  xlab = "Predict Tap-in Values", ylab = "Residual Values (Tap-in)")

When observing the residuals, the concentration of data points in the low residential amounts makes this hard to interpret. However, we can see there is some degree of heteroscedasticity, which could suggest a missing variable or a variable that requires transformation, however it is unclear at this time. There also appear to be some outlier values observed in the distribution. This could be due to the point raised earlier about places such as Woodlands Regional Center.

Part 2: Spatial Autocorrelation Analysis on residuals

In the next part, we will append the residuals back to our Simple Features Dataframe and check if the randomization assumption holds true. This is to help us check that the relationship established in our previous linear model holds true or if there is a geospatial problem in the data generating process.

Step 1: Adding residuals and predicted values to Simple Feature Dataframe

It is important to ensure that the indexes from your linear model are the same as that of our data-frame. Usually the safer way would be to create an index for both, however, for simplicity we will just check manually and see that the data from our linear model and Simple Feature Dataframe align.

Appending using tibble functions

ResidualFitted_sf <- SummarizedMasterSubzone_sf

ResidualFitted_sf <- add_column(ResidualFitted_sf, ResidualTapIn = residuals(TapInLM.model), .before = "geometry")

ResidualFitted_sf <- add_column(ResidualFitted_sf, ResidualTapOut = residuals(TapOutLM.model), .before = "geometry")

ResidualFitted_sf <- add_column(ResidualFitted_sf, PredictedTapIn = predict(TapInLM.model), .before = "geometry")

ResidualFitted_sf <- add_column(ResidualFitted_sf, PredictedTapOut = predict(TapOutLM.model), .before = "geometry")

Appending Aboslute values for residual Tap-in and Tap-out

ResidualFitted_sf <- add_column(ResidualFitted_sf, ABSResidualTapIn = abs(residuals(TapInLM.model)), .before = "geometry")
ResidualFitted_sf <- add_column(ResidualFitted_sf, ABSResidualTapOut = abs(residuals(TapOutLM.model)), .before = "geometry")

Step 2: Mapping out our Residual Values

This is to provide us a quick look at

Mapping using tmaps

ResidualTapInmap <- tm_shape(ResidualFitted_sf)+ 
  tm_fill("ResidualTapIn", 
              n = 5,
              style="jenks",
              palette = "RdYlGn",
          title = "Residual Tap-in") +
  tm_layout(legend.position = c("right", "bottom")) +
  tm_borders(alpha = 0.5) +
  tmap_style("white")

ResidualTapOutmap <- tm_shape(ResidualFitted_sf)+ 
  tm_fill("ResidualTapOut", 
              n = 5,
              style="jenks",
              palette = "RdYlBu",
          title = "Residual Tap-out") +
  tm_layout(legend.position = c("right", "bottom")) +
  tm_borders(alpha = 0.5) +
  tmap_style("white")

tmap_arrange(ResidualTapInmap, ResidualTapOutmap, asp=1, ncol=2)

Step 3: Forming Weight Matrixes using Contiguity-based neighbours

Creating (QUEEN) contiguity based neighbours

wm_q <- poly2nb(ResidualFitted_sf, queen=TRUE)
summary(wm_q)
## Neighbour list object:
## Number of regions: 307 
## Number of nonzero links: 1862 
## Percentage nonzero weights: 1.975618 
## Average number of links: 6.065147 
## Link number distribution:
## 
##  2  3  4  5  6  7  8  9 10 11 12 14 17 
##  6 10 28 78 78 49 37 14  2  2  1  1  1 
## 6 least connected regions:
## 2 42 133 172 230 277 with 2 links
## 1 most connected region:
## 40 with 17 links

Creating (ROOK) contiguity based neighbours

wm_r <- poly2nb(ResidualFitted_sf, queen=FALSE)
summary(wm_r)
## Neighbour list object:
## Number of regions: 307 
## Number of nonzero links: 1614 
## Percentage nonzero weights: 1.712485 
## Average number of links: 5.257329 
## Link number distribution:
## 
##  1  2  3  4  5  6  7  8  9 10 13 14 
##  1  5 23 70 92 62 29 17  4  2  1  1 
## 1 least connected region:
## 172 with 1 link
## 1 most connected region:
## 40 with 14 links

Visualizing (QUEEN) and (ROOK) contiguity based neighbours

centroids <- sf::st_centroid(ResidualFitted_sf$geometry) #finding centroids for our polygons

par(mfrow=c(1,2))
plot(ResidualFitted_sf$geometry, border="lightgrey", main="Queen Contiguity", asp=1)
plot(wm_q, st_coordinates(centroids), pch = 5, cex = 0.6, add = TRUE, col= "red" )
plot(ResidualFitted_sf$geometry, border="lightgrey", main="Rook Contiguity", asp =1)
plot(wm_r, st_coordinates(centroids), pch = 5, cex = 0.6, add = TRUE, col = "red")

Analysis of contiguity methods

By looking at the plots and summaries of both the Rooks and Queens Contiguity, we can see that there is a large difference between both methodologies. a total difference of around 258 links. By looking at the map, we can see that this is very evident in particular to the north and north-east part of the map. With reference to bus routes in Singapore, it is often they move along the edges and corners of planning area subzones, thus the Queens method for continguity would be more applicable for our study.

Source: https://wiki.smu.edu.sg/1415T2is415/File:IS415_2014-15_Term2_Assign1_ktchan.2011_Web_map_overview.png

Note: Consider plotting the bus routes on subzones

Step 4: Forming Weight Matrixes using Fixed Distance-based Neighbours

Double check dataset

st_crs(ResidualFitted_sf)
## Coordinate Reference System:
##   User input: EPSG:3414 
##   wkt:
## PROJCS["SVY21 / Singapore TM",
##     GEOGCS["SVY21",
##         DATUM["SVY21",
##             SPHEROID["WGS 84",6378137,298.257223563,
##                 AUTHORITY["EPSG","7030"]],
##             AUTHORITY["EPSG","6757"]],
##         PRIMEM["Greenwich",0,
##             AUTHORITY["EPSG","8901"]],
##         UNIT["degree",0.0174532925199433,
##             AUTHORITY["EPSG","9122"]],
##         AUTHORITY["EPSG","4757"]],
##     PROJECTION["Transverse_Mercator"],
##     PARAMETER["latitude_of_origin",1.366666666666667],
##     PARAMETER["central_meridian",103.8333333333333],
##     PARAMETER["scale_factor",1],
##     PARAMETER["false_easting",28001.642],
##     PARAMETER["false_northing",38744.572],
##     UNIT["metre",1,
##         AUTHORITY["EPSG","9001"]],
##     AUTHORITY["EPSG","3414"]]

Since we know the EPSG is 3414, our distance will be in Meters. Thus it should be no surprise if we have large values compared to other data-sets. Additionally the coordinates are in latitude and longitude which means we will need to be careful when mapping our coordinates or using functions that utilize them.

Determining the appropriate cut-off distance.

coords <- st_coordinates(centroids) #taken from the previous Step 3.

k1 <- knn2nb(knearneigh(coords))

k1dists <- unlist(nbdists(k1, coords, longlat = FALSE))
summary(k1dists)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   182.5   616.3   887.5   937.2  1169.8  5404.1

Unfortunately, it would seem that the distance values vary wildly between zones.

This may be due to centroids for large polygons such as those in the Western Water Catchment Area, Central Water Catchment Area and Changi Airport cover much larger areas compared to places such as Telok Blangah Rise. Thus a Fixed Distance Weight Matrix would be not very useful for us to utilize.

Step 5: Computing adaptive distance weight matrix

For the purpose of our analysis, we’ll be using a set of 6 nearest neighbors as it is the average number of links indicated during our Queen Contiguity based Neighbors analysis.

Setting aribitrary neighbours

wm_knn6 <- knn2nb(knearneigh(coords, k=6, longlat = FALSE), row.names=row.names(ResidualFitted_sf$Subzone))
wm_knn6
## Neighbour list object:
## Number of regions: 307 
## Number of nonzero links: 1842 
## Percentage nonzero weights: 1.954397 
## Average number of links: 6 
## Non-symmetric neighbours list

Step 6: Deciding on a weight matrix

While both the Adaptive Distance Weight Matrix and Rook Contiguity-based Neighbors seem promising, it would be more appropriate to utilize the Queens Continguity-based Neighbors as our ideal choice. These are due to the following points:

  • The size of polygons have large variation causing issues in using Fixed-Distance based weight Matrixes

  • The number of neighbors around the subzones differ from 1-17 if considering the Queens Contiguity-based neighbors, thus setting an arbitrary number of neighbors will result on some subzones over-reaching neighboring zones and more importantly, not taking into account certain neighbors when required to for auto-correlation.

  • Since we’re observing commuter’s volumes, there is no reason for us to believe that bus routes will only go through sides which are larger than a point, in fact some bus routes seem to pass through the corners of the subzones. Residents may also travel out of subzones to take the bus in another subzone thus there is no reason for Rook Contiguity unless there is a computational constraint.

Thus for the next phase of analysis, we will only be utilizing the Queen Contiguity-based neighbors in our spatial auto-correlation analysis.

Step 7: Weights based on the Inverse Distance Method using Queens Contiguity-based Neighours

Deriving a spatial weight matrix based on Inversed Distance method using the Queens method.

dist <- nbdists(wm_q, coords, longlat = FALSE)
ids <- lapply(dist, function(x) 1/(x))

Row-standardised weights matrix

rswm_q <- nb2listw(wm_q, style="W", zero.policy = TRUE)
rswm_q
## Characteristics of weights list object:
## Neighbour list object:
## Number of regions: 307 
## Number of nonzero links: 1862 
## Percentage nonzero weights: 1.975618 
## Average number of links: 6.065147 
## 
## Weights style: W 
## Weights constants summary:
##     n    nn  S0       S1       S2
## W 307 94249 307 106.4054 1261.803

Verifying weights have been appropriatly applied.

wm_q[[1]]
## [1] 217 218 220 221 222 264
rswm_q$weights[1]
## [[1]]
## [1] 0.1666667 0.1666667 0.1666667 0.1666667 0.1666667 0.1666667

Looks like they match up.

Step 8: Perform Moran’s I test on both Tap-in residual and Tap-out residual dataset

Performing Moran’s I test for Tap-in Residual

moran.test(ResidualFitted_sf$ResidualTapIn, listw=rswm_q, zero.policy = TRUE, na.action=na.omit)
## 
##  Moran I test under randomisation
## 
## data:  ResidualFitted_sf$ResidualTapIn  
## weights: rswm_q    
## 
## Moran I statistic standard deviate = -1.0545, p-value = 0.8542
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##      -0.037755733      -0.003267974       0.001069537

Performing Moran’s I test for Tap-out Residual

moran.test(ResidualFitted_sf$ResidualTapOut, listw=rswm_q, zero.policy = TRUE, na.action=na.omit)
## 
##  Moran I test under randomisation
## 
## data:  ResidualFitted_sf$ResidualTapOut  
## weights: rswm_q    
## 
## Moran I statistic standard deviate = -0.50383, p-value = 0.6928
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##      -0.019809059      -0.003267974       0.001077864

Interpretation of Moran’s I test

Based on our results, we can see that the p-values resulting from both tests are relatively high, thus we are unable to reject the null hypothesis that the results generated are random even at a 90% confidence level. This promotes that the relationships seen in our linear regression models are indeed true.

The coefficient for Moran’s I in this case is inapplicable.

Step 9: Robustness Testing

Testing on Aboslute Values for Residual

moran.test(ResidualFitted_sf$ABSResidualTapIn, listw=rswm_q, zero.policy = TRUE, na.action=na.omit)
## 
##  Moran I test under randomisation
## 
## data:  ResidualFitted_sf$ABSResidualTapIn  
## weights: rswm_q    
## 
## Moran I statistic standard deviate = 0.4131, p-value = 0.3398
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##       0.010105384      -0.003267974       0.001048047
moran.test(ResidualFitted_sf$ABSResidualTapOut, listw=rswm_q, zero.policy = TRUE, na.action=na.omit)
## 
##  Moran I test under randomisation
## 
## data:  ResidualFitted_sf$ABSResidualTapOut  
## weights: rswm_q    
## 
## Moran I statistic standard deviate = 1.8125, p-value = 0.03495
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##       0.055795360      -0.003267974       0.001061855

Using Monte Carlo Moran’s I

For robustness testing, we use Monte Carlo to check if our results hold.

Computing Monte-Carlo Moran’s I for Residual Tap-in

set.seed(1234)
MCMoranTapIn= moran.mc(ResidualFitted_sf$ResidualTapIn, listw=rswm_q, nsim=999, zero.policy = TRUE, na.action=na.omit)
MCMoranTapIn
## 
##  Monte-Carlo simulation of Moran I
## 
## data:  ResidualFitted_sf$ResidualTapIn 
## weights: rswm_q  
## number of simulations + 1: 1000 
## 
## statistic = -0.037756, observed rank = 157, p-value = 0.843
## alternative hypothesis: greater

Computing Monte-Carlo Moran’s I for Residual Tap-out

set.seed(1234)
MCMoranTapOut= moran.mc(ResidualFitted_sf$ResidualTapOut, listw=rswm_q, nsim=999, zero.policy = TRUE, na.action=na.omit)
MCMoranTapOut
## 
##  Monte-Carlo simulation of Moran I
## 
## data:  ResidualFitted_sf$ResidualTapOut 
## weights: rswm_q  
## number of simulations + 1: 1000 
## 
## statistic = -0.019809, observed rank = 309, p-value = 0.691
## alternative hypothesis: greater

Step 10: Interpretation of robustness testing

After testing against both absolute values and using Monte-Carlo Moran’s I, we can safely say our results are promising as all are unable to reject the null hypothesis except for absolute values of Tap-out Residual Moran’s I testing.

Plotting Absolute Values for Tap-out Residual against predicted values

plot(TapOutLM.model$fitted.values, abs(TapOutLM.model$residuals), pch = 8, col = "red", xlab = "Predicted Tap-out Values", ylab = "Absoluate Residual (Tap-out)")

There does not appear to be any distinct pattern observed

Mapping absolute values of Tap-out Residual

AbosluteResidualTapOutmap <- tm_shape(ResidualFitted_sf)+ 
  tm_fill("ABSResidualTapOut", 
              n = 5,
              style="jenks",
              palette = "RdYlBu",
          title = "Aboslute Residual Tap-out Values") +
  tm_layout(legend.position = c("right", "bottom")) +
  tm_borders(alpha = 0.5) +
  tmap_style("white")

AbosluteResidualTapOutmap

It is most likely the Moran’s I value and p-value derived is due to low-values near each other towards the center of the map.

Conclusion of analysis

Overall we can be confident in our results that the relationship between Residential Population in URA 2014 Planning Area is positively correlated with Commuters’ volumes by bus. This is due to low p-values and the randomization assumption of residual values holding true for both linear modelling as well as geospatial auto-correlation.

Part 3: Localized Geospatial Statistical Analysis

Step 0: Converting to SpatialPolygonDataFrame

For the purpose of this Analysis, we will be converting the Simple Feature DataFrame back into a SpatialPolygoneDataFrame

Utilizing the as_Spatial() function from the sp package

SG_Commuter_vol_df <- as_Spatial(ResidualFitted_sf)

Step 1: Computing local Moran’s I for Cluster and Outlier Analysis

Computing Local Moran’s I for Tap-in volumes

subzone_labels <- order(SG_Commuter_vol_df$Subzone)
localMITapIn <- localmoran(SG_Commuter_vol_df$Tap_in, rswm_q)

#printCoefmat(data.frame(localMITapIn[subzone_labels,], row.names=SG_Commuter_vol_df$Subzone[subzone_labels]), check.names=FALSE)

Computing Local Moran’s I for Tap-out volumes

localMITapOut <- localmoran(SG_Commuter_vol_df$Tap_out, rswm_q)

#printCoefmat(data.frame(localMITapOut[subzone_labels,], row.names=SG_Commuter_vol_df$Subzone[subzone_labels]), check.names=FALSE)

Appending local Moran’s I and P-values to Simple Feature Dataframe

SG_Commuter_vol_df.localMITapIn <- cbind(SG_Commuter_vol_df,localMITapIn)
SG_Commuter_vol_df.localMITapOut <- cbind(SG_Commuter_vol_df,localMITapOut)

Part 2: Mapping Local Moran’s I

Visualizing local Moran’s I for Tap-in values

TapInlocalMI.map <- tm_shape(SG_Commuter_vol_df.localMITapIn) +
  tm_fill(col = "Ii", 
          style = "pretty", 
          title = "Tap-in Local MI Statistics") +
  tm_borders(alpha = 0.5)

TapInPvalue.map <- tm_shape(SG_Commuter_vol_df.localMITapIn) +
  tm_fill(col = "Pr.z...0.", 
          breaks=c(-Inf, 0.001, 0.01, 0.05, 0.1, Inf),
          palette="-Blues", 
          title = "Tap-in Local MI p-values") +
  tm_borders(alpha = 0.5)

tmap_arrange(TapInlocalMI.map, TapInPvalue.map, asp=1, ncol=2)
## Variable(s) "Ii" contains positive and negative values, so midpoint is set to 0. Set midpoint = NA to show the full spectrum of the color palette.

Visualizing local Moran’s I for Tap-out values

TapOutlocalMI.map <- tm_shape(SG_Commuter_vol_df.localMITapOut) +
  tm_fill(col = "Ii", 
          style = "pretty", 
          title = "Tap-out Local MI Statistics") +
  tm_borders(alpha = 0.5)

TapOutPvalue.map <- tm_shape(SG_Commuter_vol_df.localMITapOut) +
  tm_fill(col = "Pr.z...0.", 
          breaks=c(-Inf, 0.001, 0.01, 0.05, 0.1, Inf),
          palette="-Reds", 
          title = "Tap-out Local MI p-values") +
  tm_borders(alpha = 0.5)

tmap_arrange(TapOutlocalMI.map, TapOutPvalue.map, asp=1, ncol=2)
## Variable(s) "Ii" contains positive and negative values, so midpoint is set to 0. Set midpoint = NA to show the full spectrum of the color palette.

Intepretation of Local Moran Statistics

Based on the map results, we can see that there are several clusters for both Tap-in and Tap-out specifically around the Tampines East, Tampines West and Bedok North subzones. There also appear to be several outliers, however they have low p-values thus we will not take them into consideration.

Step 3: Creating LISA Cluster Map

Creating Moran Scatter Plot with standardized Tap-in Values

SG_Commuter_vol_df.localMITapIn$Z.Tap_In <- scale(SG_Commuter_vol_df.localMITapIn$Tap_in) %>% as.vector 

TapInLMoranPlot <- moran.plot(SG_Commuter_vol_df.localMITapIn$Z.Tap_In, rswm_q, labels=as.character(SG_Commuter_vol_df.localMITapIn$Subzone), xlab="z Tap-in Volumes", ylab="Spatially Lag z Tap-in Volumes")

Creating Moran Scatter Plot with standardized Tap-out Values

SG_Commuter_vol_df.localMITapOut$Z.Tap_out <-scale(SG_Commuter_vol_df.localMITapOut$Tap_out) %>% as.vector 

TapOutLMoranPlot <- moran.plot(SG_Commuter_vol_df.localMITapOut$Z.Tap_out, rswm_q, labels=as.character(SG_Commuter_vol_df.localMITapOut$Subzone), xlab="z Tap-out Volumes", ylab="Spatially Lag z Tap-out Volumes")

Build Quadrants for Local MI Tap-in values

TapInquadrant <- vector(mode="numeric",length=nrow(localMITapIn))
TapInDV <- SG_Commuter_vol_df$Tap_in - mean(SG_Commuter_vol_df$Tap_in)     
TapInC_mI <- localMITapIn[,1] - mean(localMITapIn[,1])    
signif <- 0.05       
TapInquadrant[TapInDV >0 & TapInC_mI>0] <- 4      
TapInquadrant[TapInDV <0 & TapInC_mI<0] <- 1      
TapInquadrant[TapInDV <0 & TapInC_mI>0] <- 2
TapInquadrant[TapInDV >0 & TapInC_mI<0] <- 3
TapInquadrant[localMITapIn[,5]>signif] <- 0

Build Quadrants for Local MI Tap-out values

TapOutquadrant <- vector(mode="numeric",length=nrow(localMITapOut))
TapOutDV <- SG_Commuter_vol_df$Tap_out - mean(SG_Commuter_vol_df$Tap_out)     
TapOutC_mI <- localMITapOut[,1] - mean(localMITapOut[,1])    
signif <- 0.05       
TapOutquadrant[TapOutDV >0 & TapOutC_mI>0] <- 4      
TapOutquadrant[TapOutDV <0 & TapOutC_mI<0] <- 1      
TapOutquadrant[TapOutDV <0 & TapOutC_mI>0] <- 2
TapOutquadrant[TapOutDV >0 & TapOutC_mI<0] <- 3
TapOutquadrant[localMITapOut[,5]>signif] <- 0

Constructing LISA Maps for Both Tap-in and Tap-out Local Moran’s I

colors <- c("#ffffff", "#2c7bb6", "#abd9e9", "#fdae61", "#d7191c")
clusters <- c("Insignificant", "Low-Low", "Low-High", "High-Low", "High-High")


SG_Commuter_vol_df.localMITapIn$TapInquadrant <- TapInquadrant

LISATapIn <- tm_shape(SG_Commuter_vol_df.localMITapIn) +
  tm_fill(col = "TapInquadrant", style = "cat", title="Tap-in Classification", palette = colors[c(sort(unique(TapInquadrant)))+1], labels = clusters[c(sort(unique(TapInquadrant)))+1], popup.vars = c("Postal.Code")) +
  tm_view(set.zoom.limits = c(11,17)) +
  tm_borders(alpha=0.5)

SG_Commuter_vol_df.localMITapOut$TapOutquadrant <- TapOutquadrant

LISATapOut <- tm_shape(SG_Commuter_vol_df.localMITapOut) +
  tm_fill(col = "TapOutquadrant", style = "cat", title="Tap-out Classification", palette = colors[c(sort(unique(TapOutquadrant)))+1], labels = clusters[c(sort(unique(TapOutquadrant)))+1], popup.vars = c("Postal.Code")) +
  tm_view(set.zoom.limits = c(11,17)) +
  tm_borders(alpha=0.5)

tmap_arrange(LISATapIn, LISATapOut, asp=1, ncol=2)

Intepretation of LISA Map

Based on the LISA map, we can see several clusters which primarily align with our initial observations of the Local Moran’s I scores in the previous Step. However, there appears to be some subzones in the north-east and west that fall within the High-High quadrants indicating positive auto-correlations with neighboring subzones.

Step 4: Hot Spot and Cold Spot Area Analysis using Gi Statistics

For the purpose of this analysis, we will only be using Adaptive Distance Weight Matrix for reasons explained in the earlier Part 2 on the nature of the polygons we are dealing with. We will take the same clusters we set in the previous section.

Computing Gi Statistics

wm_knn6_lw <- nb2listw(wm_knn6, style = 'B') #wm_knn6 taken from previous Part 2

TapIngi.adaptive <- localG(SG_Commuter_vol_df$Tap_in, wm_knn6_lw)
TapInSG.gi <- cbind(SG_Commuter_vol_df, as.matrix(TapIngi.adaptive))
names(TapInSG.gi)[11] <- "gstat_adaptive"

TapOutgi.adaptive <- localG(SG_Commuter_vol_df$Tap_out, wm_knn6_lw)
TapOutSG.gi <- cbind(SG_Commuter_vol_df, as.matrix(TapOutgi.adaptive))
names(TapOutSG.gi)[11] <- "gstat_adaptive"

Mapping out Gi Statistics

TapIngiMap <- tm_shape(TapInSG.gi) +
  tm_fill(col = "gstat_adaptive",
          style = "pretty",
          palette = "-RdBu",
          n=5,
          title = "Tap-in local Gi") +
  tm_borders(alpha = 0.5)

TapOutgiMap <- tm_shape(TapOutSG.gi) +
  tm_fill(col = "gstat_adaptive",
          style = "pretty",
          palette = "-RdBu",
          n=5,
          title = "Tap-out local Gi") +
  tm_borders(alpha = 0.5)

tmap_arrange(TapIngiMap, TapOutgiMap, asp=1, ncol=2)
## Variable(s) "gstat_adaptive" contains positive and negative values, so midpoint is set to 0. Set midpoint = NA to show the full spectrum of the color palette.
## Variable(s) "gstat_adaptive" contains positive and negative values, so midpoint is set to 0. Set midpoint = NA to show the full spectrum of the color palette.

Interpretation of GI Statistic Maps

Based on the results, we see hot-spots in areas which we had originally identified in our previous LISA Map. There is also a surprising cold spot towards the west towards Tuas. This may be due to the nature of how k-nearestneighbours works and specifically how Tuas View Extension subzone is only really adjacent to 2 subzones but was coerced using the Adaptive Distance Weight Matrix to consider more zones.

Overall Conclusion

Overall there appears to be some clustering for both Tap-in and Tap-out around the Tampines East subzone with the Local Moran’s I Analysis, LISA Mapping and Gi Statistical all highlighting the zone or some of it’s surrounding areas as clusters or hot spots. This aligns with the population map done in Part 1 which shows the area as relatively higher populated than others although we had not done any geospatial analytics on population distributions.

Part 4: Study limitations and assumptions

There are several variables which are not included within the statistical analysis above that must be considered when examining the scope of the study.