This exercise is examining national development or public service issues by utilizing data sets from various governmental agencies in Singapore. The assignment is meant to demonstrate basic Geo-spatial Data Wrangling, Geo-spatial Analytics and Geo-visualization in R. For the purpose of this assignment, we will utilizing course material in order to achieve 3 specific objectives:
Calibrating a simple linear regression to reveal the relation between public bus commuters’ flows (i.e. tap-in or tap-out) data and residential population at the planning sub-zone level.
Performing spatial auto-correlation analysis on the residual of the regression model to test if the model conforms to the randomization assumption.
Performing localized geospatial statistics analysis by using commuters’ tap-in and tap-out data to identify geographical clustering.
For the purpose of this study, the passenger volume by bus-stop data set of Land Transport Authority (LTA) has been provided. This data set is extracted using the dynamic API provided at LTA DataMall. The remaining data from the government open data portal.
For the purpose of this assignment, we will be using packages taught in the course. Particularly we will be utilizing rgdal, sf, spdep, tmap and tidyverse.
packages = c('rgdal', 'sf', 'spdep', 'tmap', 'tidyverse')
for (p in packages){
if(!require(p, character.only = T)){
install.packages(p)
}
library(p,character.only = T)
}
sgpassengervolume <- read_csv("data/aspatial/passenger volume by busstop.csv")
## Parsed with column specification:
## cols(
## YEAR_MONTH = col_character(),
## DAY_TYPE = col_character(),
## TIME_PER_HOUR = col_double(),
## PT_TYPE = col_character(),
## PT_CODE = col_character(),
## TOTAL_TAP_IN_VOLUME = col_double(),
## TOTAL_TAP_OUT_VOLUME = col_double()
## )
Based on the Dataset provided on passenger volume by bus-stop data, we are examining passenger flows from of the period for the month of January 2020. As requested by the assignment, we need to aggregate the data at the planning sub-zone level in order to perform a simple linear model for its relation to residential population in the same zones.
For our purpose, we don’t care about the time or day differences in our dataset, we only care about the volumes based on the bus stop locations so we will need a new data-frame which is summed by bus stop number
Totalbusstopvol <- sgpassengervolume %>%
group_by(BUS_STOP_N = PT_CODE) %>% #rename as per naming convention by LTA
summarise(Tap_In = sum(TOTAL_TAP_IN_VOLUME),
Tap_out = sum(TOTAL_TAP_OUT_VOLUME))
Thus we will need a relational database that helps us to map the bus-stop locations to the planning sub-zone level. For that purpose we will be utilizing the Bus Stop Location Shapefile from the LTA DataMall Source:https://www.mytransport.sg/content/mytransport/home/dataMall/static-data.html#Whole%20Island
We’ll load it in as a Simple Feature DataFrame, in this case since we know its from Singapore database, there is no need to transform. We will simply set the CRS to 3414 in correspondence to SG.
busstoplocations_sf <- readOGR(dsn = "data/geospatial/BusStopLocation_Jan2020", layer = "BusStop")
## OGR data source with driver: ESRI Shapefile
## Source: "C:\data\GSA\IS415_Take-Home_EX01\data\geospatial\BusStopLocation_Jan2020", layer: "BusStop"
## with 5040 features
## It has 3 fields
busstoplocations_sf <- st_as_sf(busstoplocations_sf)
busstoplocations_sf <- st_transform(busstoplocations_sf, crs= 3414)
#st_crs(busstoplocations_sf)
busstoplocations_sf$LOC_DESC[is.na(busstoplocations_sf$LOC_DESC)] <- "UNKNOWN"
busstoplocations_sf$BUS_ROOF_N[is.na(busstoplocations_sf$BUS_ROOF_N)] <- "UNKNOWN"
Next we need to attach our passenger data to our Spatial Points DataFrame
busstoplocations_sf <- left_join(busstoplocations_sf,Totalbusstopvol) #we already renamed the variable to match so there is no need to specify the joining factor
## Joining, by = "BUS_STOP_N"
busstoplocations_sf$Tap_In[is.na(busstoplocations_sf$Tap_In)] <- 0
busstoplocations_sf$Tap_out[is.na(busstoplocations_sf$Tap_out)] <- 0
In order for us to perform our regression. We need population information by planning area subzone level. We will utilize the “Singapore Residents by Subzone and Type of Dwelling, 2011 - 2019” and utilize only the 2019 information to get the latest data.
Source: https://data.gov.sg/dataset/Singapore-residents-by-subzone-and-type-of-dwelling-2011-2019
To make it geospatial data, we need to match with the according geospatial subzone data, according to the website, this information is still map to the Master Plan 2014 subzones, as such we will use those although there is an updated 2019 MP. For this exercise, we will use the shapefile instead of the KML files also provided
Source: https://data.gov.sg/dataset/master-plan-2014-subzone-boundary-web
For the purpose of our analysis, we will use only the 2019 data.
Source: https://www.sbstransit.com.sg/fares-and-concessions
PopData <- read_csv("data/aspatial/planning-area-subzone-age-group-sex-and-type-of-dwelling-june-2011-2019.csv") %>%
filter(year==2019) %>% #filter only by 2019 data
filter(age_group !="0_to_4")#removing age group 0-4 as they are irrelevant for the purpose of our study
## Parsed with column specification:
## cols(
## planning_area = col_character(),
## subzone = col_character(),
## age_group = col_character(),
## sex = col_character(),
## type_of_dwelling = col_character(),
## resident_count = col_double(),
## year = col_double()
## )
ResPopbySubzone <- PopData %>%
group_by(subzone = subzone) %>% #rename as per naming convention by LTA
summarise(resident_count= sum(resident_count))
ResPopbySubzone <- mutate_at(ResPopbySubzone, .vars = "subzone", .funs=toupper) #convert names to upper case, be sure to only do so for names.
PlanningSubzone_sf <- readOGR(dsn = "data/geospatial/MP14_SUBZONE_WEB", layer = "MP14_SUBZONE_WEB_PL")
## OGR data source with driver: ESRI Shapefile
## Source: "C:\data\GSA\IS415_Take-Home_EX01\data\geospatial\MP14_SUBZONE_WEB", layer: "MP14_SUBZONE_WEB_PL"
## with 323 features
## It has 15 fields
PlanningSubzone_sf <- st_as_sf(PlanningSubzone_sf)
PlanningSubzone_sf <- st_transform(PlanningSubzone_sf, crs= 3414)
#st_crs(PlanningSubzone_sf)
PlanningSubzone_sf <- left_join(PlanningSubzone_sf , ResPopbySubzone, by = c("SUBZONE_N" = "subzone"))
tm_shape(PlanningSubzone_sf)+
tm_fill(col = "resident_count",
n = 5,
style="jenks",
palette = "Blues",
title = "Resident Population Count") +
tm_layout(legend.position = c("right", "bottom")) +
tm_borders(alpha = 0.5) +
tmap_style("white") +
tm_credits("Source: Planning Area Sub-zone boundary MP2014 from Urban Redevelopment Authorithy (URA)\n and Population data from Department of Statistics (DOS)",
position = c("left", "bottom"))
Map seems to align with most population distributions we’ve seen before, so we will accept that our data has been loaded in properly
For the purpose of our analysis, we will firstly remove zones where all Tap-in, Tap-out and Residents are equal to 0. This is because it is likely irrelevant for our study as these areas are unreachable by the general population and may cause problems during the geospatial auto-correlation calculations later on.
For zones in which one of these conditions but not all are met, we will keep these areas as they are still relevant for our scope of study. We will replace any missing values with 0.
However, it should be noted that based on initial inspection of the data, there are areas in which with 0 residents but a high amount of Tap-in and Tap-out which may be due to construction projects. The best example of this would be the Woodlands Regional Center with a disproportionately large bus volumes but no residential population because it is still under construction.
MasterSubzone_sf <- st_join(PlanningSubzone_sf,busstoplocations_sf, join=st_intersects) #using st_intersects to join bus stops that are either within or touching the polygons.
MasterSubzone_sf$Tap_In[is.na(MasterSubzone_sf$Tap_In)] <- 0
MasterSubzone_sf$Tap_out[is.na(MasterSubzone_sf$Tap_out)] <- 0
MasterSubzone_sf$BUS_ROOF_N[is.na(MasterSubzone_sf$BUS_ROOF_N)] <- "UNKNOWN"
MasterSubzone_sf$BUS_STOP_N[is.na(MasterSubzone_sf$BUS_STOP_N)] <- "UNKNOWN"
MasterSubzone_sf$LOC_DESC[is.na(MasterSubzone_sf$LOC_DESC)] <- "UNKNOWN"
SummarizedMasterSubzone_sf <- MasterSubzone_sf %>%
filter(Tap_In!= 0 | Tap_out != 0 | resident_count != 0) %>% # we'll now filter out any subzones with all values zero
group_by(Subzone = SUBZONE_N) %>%
summarise(Residential = as.numeric(first(resident_count)), Tap_in = sum(Tap_In), Tap_out = sum(Tap_out))
Now we need to inspect what polygons have remained and look at the distribution of our Tap-in and Tap-out data.
CountTapInMap <- tm_shape(SummarizedMasterSubzone_sf)+
tm_fill(col = c("Tap_in"),
n = 5,
style="jenks",
palette = "Blues",
title = "Tap-in Volume") +
tm_layout(legend.position = c("right", "bottom")) +
tm_borders(alpha = 0.5) +
tmap_style("white")+
tm_credits("Source: Planning Sub-zone boundary from Urban Redevelopment Authorithy (URA)\n and Commuter's Volume data from Land Transport Authority (LTA)",
position = c("left", "bottom"))
CountTapOutMap <- tm_shape(SummarizedMasterSubzone_sf)+
tm_fill(col = c("Tap_out"),
n = 5,
style="jenks",
palette = "Reds",
title = "Tap-out Volume") +
tm_layout(legend.position = c("right", "bottom")) +
tm_borders(alpha = 0.5) +
tmap_style("white") +
tm_credits("Source: Planning Sub-zone boundary from Urban Redevelopment Authorithy (URA)\n and Commuter's Volume data from Land Transport Authority (LTA)",
position = c("left", "bottom"))
tmap_arrange(CountTapInMap, CountTapOutMap, asp=1, ncol=2)
We can see that during the aggregation process that some of the islands, such as Jurong Island and Pulau Ubin have disappeared. This is good because these areas are inaccessible by bus transportation which allows us to concentrate our study on the relevant areas within mainland Singapore.
For this exercise, as we are interested in the relationship between the tap-in and tap-out volumes to the residential population. We will be using the lm() function to calculate the linear regression. Since for the purpose of our study, we will assume that our X is Residential Population and Y is Tap-in or Tap-out. This is based on the fact that the Residential Population data was gathered in 2019 before the Tap-In and Tap-Out data in January 2020 although we may be committing post hoc ergo propter hoc fallacy.
TapInLM.model <- lm(Tap_in ~ Residential, data = SummarizedMasterSubzone_sf)
summary(TapInLM.model)
##
## Call:
## lm(formula = Tap_in ~ Residential, data = SummarizedMasterSubzone_sf)
##
## Residuals:
## Min 1Q Median 3Q Max
## -799481 -127198 -67424 31303 1954431
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.207e+05 2.075e+04 5.816 1.52e-08 ***
## Residential 2.038e+01 9.715e-01 20.977 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 294300 on 305 degrees of freedom
## Multiple R-squared: 0.5906, Adjusted R-squared: 0.5893
## F-statistic: 440 on 1 and 305 DF, p-value: < 2.2e-16
TapOutLM.model <- lm(Tap_out ~ Residential, data = SummarizedMasterSubzone_sf)
summary(TapOutLM.model)
##
## Call:
## lm(formula = Tap_out ~ Residential, data = SummarizedMasterSubzone_sf)
##
## Residuals:
## Min 1Q Median 3Q Max
## -758965 -120883 -57479 32583 1648011
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.212e+05 2.009e+04 6.036 4.59e-09 ***
## Residential 2.028e+01 9.406e-01 21.562 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 285000 on 305 degrees of freedom
## Multiple R-squared: 0.6039, Adjusted R-squared: 0.6026
## F-statistic: 464.9 on 1 and 305 DF, p-value: < 2.2e-16
We can see there have been no observations deleted due to missing values, assuring us that the data has been cleaned properly
Based on the results shown, both Tap-on and Tap-out volumes are shown to be positively correlated. The results are significant at an alpha value of 0.001 which allows us to reject the null hypothesis at a 0.999% confidence interval. The R-square value of 0.5906 and 0.6039 shows a fair amount of strength between the model and the response variable.
TapInPlot <- ggplot(data= TapInLM.model, aes(x = Residential, y= Tap_in)) +
geom_point(color='blue',size=2)+
geom_segment(aes(xend = Residential, yend = predict(TapInLM.model))) +
geom_smooth(method = "lm", formula = y ~ x, color='black')+
xlab("Residential Population Count")+
ylab("Tap-in Volume Count")
TapInPlot
TapOutPlot <- ggplot(data= TapOutLM.model, aes(x = Residential, y= Tap_out)) +
geom_point(color='red',size=2)+
geom_segment(aes(xend = Residential, yend = predict(TapOutLM.model))) +
geom_smooth(method = "lm", formula = y ~ x, color='black')+
xlab("Residential Population Count")+
ylab("Tap-out Volume Count")
TapOutPlot
par(mfrow=c(1,2))
plot(TapInLM.model$fitted.values, TapInLM.model$residuals, pch = 8, col = "blue", xlab = "Predict Tap-in Values", ylab = "Residual Values (Tap-in)")
plot(TapOutLM.model$fitted.values, TapOutLM.model$residuals, pch = 8, col = "red", xlab = "Predict Tap-in Values", ylab = "Residual Values (Tap-in)")
When observing the residuals, the concentration of data points in the low residential amounts makes this hard to interpret. However, we can see there is some degree of heteroscedasticity, which could suggest a missing variable or a variable that requires transformation, however it is unclear at this time. There also appear to be some outlier values observed in the distribution. This could be due to the point raised earlier about places such as Woodlands Regional Center.
In the next part, we will append the residuals back to our Simple Features Dataframe and check if the randomization assumption holds true. This is to help us check that the relationship established in our previous linear model holds true or if there is a geospatial problem in the data generating process.
It is important to ensure that the indexes from your linear model are the same as that of our data-frame. Usually the safer way would be to create an index for both, however, for simplicity we will just check manually and see that the data from our linear model and Simple Feature Dataframe align.
ResidualFitted_sf <- SummarizedMasterSubzone_sf
ResidualFitted_sf <- add_column(ResidualFitted_sf, ResidualTapIn = residuals(TapInLM.model), .before = "geometry")
ResidualFitted_sf <- add_column(ResidualFitted_sf, ResidualTapOut = residuals(TapOutLM.model), .before = "geometry")
ResidualFitted_sf <- add_column(ResidualFitted_sf, PredictedTapIn = predict(TapInLM.model), .before = "geometry")
ResidualFitted_sf <- add_column(ResidualFitted_sf, PredictedTapOut = predict(TapOutLM.model), .before = "geometry")
ResidualTapInmap <- tm_shape(ResidualFitted_sf)+
tm_fill("ResidualTapIn",
n = 5,
style="jenks",
palette = "RdYlGn",
title = "Residual Tap-in") +
tm_layout(legend.position = c("right", "bottom")) +
tm_borders(alpha = 0.5) +
tmap_style("white")
ResidualTapOutmap <- tm_shape(ResidualFitted_sf)+
tm_fill("ResidualTapOut",
n = 5,
style="jenks",
palette = "RdYlBu",
title = "Residual Tap-out") +
tm_layout(legend.position = c("right", "bottom")) +
tm_borders(alpha = 0.5) +
tmap_style("white")
tmap_arrange(ResidualTapInmap, ResidualTapOutmap, asp=1, ncol=2)
wm_q <- poly2nb(ResidualFitted_sf, queen=TRUE)
summary(wm_q)
## Neighbour list object:
## Number of regions: 307
## Number of nonzero links: 1862
## Percentage nonzero weights: 1.975618
## Average number of links: 6.065147
## Link number distribution:
##
## 2 3 4 5 6 7 8 9 10 11 12 14 17
## 6 10 28 78 78 49 37 14 2 2 1 1 1
## 6 least connected regions:
## 2 42 133 172 230 277 with 2 links
## 1 most connected region:
## 40 with 17 links
wm_r <- poly2nb(ResidualFitted_sf, queen=FALSE)
summary(wm_r)
## Neighbour list object:
## Number of regions: 307
## Number of nonzero links: 1614
## Percentage nonzero weights: 1.712485
## Average number of links: 5.257329
## Link number distribution:
##
## 1 2 3 4 5 6 7 8 9 10 13 14
## 1 5 23 70 92 62 29 17 4 2 1 1
## 1 least connected region:
## 172 with 1 link
## 1 most connected region:
## 40 with 14 links
centroids <- sf::st_centroid(ResidualFitted_sf$geometry) #finding centroids for our polygons
par(mfrow=c(1,2))
plot(ResidualFitted_sf$geometry, border="lightgrey", main="Queen Contiguity", asp=1)
plot(wm_q, st_coordinates(centroids), pch = 5, cex = 0.6, add = TRUE, col= "red" )
plot(ResidualFitted_sf$geometry, border="lightgrey", main="Rook Contiguity", asp =1)
plot(wm_r, st_coordinates(centroids), pch = 5, cex = 0.6, add = TRUE, col = "red")
tm_shape(PlanningSubzone_sf)+
tm_fill(col="SUBZONE_N", palette = "Blues",
title = "Subzone Boundaries", legend.show = FALSE) +
tm_layout(legend.position = c("right", "bottom")) +
tm_borders(alpha = 1) +
tmap_style("white") +
tm_credits("Source: Planning Area Sub-zone boundary MP2014 from Urban Redevelopment Authorithy (URA)",
position = c("left", "bottom"))+
tm_shape(busstoplocations_sf)+
tm_dots(size = 0.03)
By looking at the plots and summaries of both the Rooks and Queen Contiguity, we can see that there is a large difference between both methodologies. A total difference of 258 links. By looking at the map, we can see that this is very evident in particular to the north and north-east part of the map.
With reference to bus routes in Singapore and locations in the above plot, it is often they move along the edges and corners of planning area subzones, thus the Queen method for continguity would be more applicable for our study as they cross over point-connections between polygons.
st_crs(ResidualFitted_sf)
## Coordinate Reference System:
## User input: EPSG:3414
## wkt:
## PROJCS["SVY21 / Singapore TM",
## GEOGCS["SVY21",
## DATUM["SVY21",
## SPHEROID["WGS 84",6378137,298.257223563,
## AUTHORITY["EPSG","7030"]],
## AUTHORITY["EPSG","6757"]],
## PRIMEM["Greenwich",0,
## AUTHORITY["EPSG","8901"]],
## UNIT["degree",0.0174532925199433,
## AUTHORITY["EPSG","9122"]],
## AUTHORITY["EPSG","4757"]],
## PROJECTION["Transverse_Mercator"],
## PARAMETER["latitude_of_origin",1.366666666666667],
## PARAMETER["central_meridian",103.8333333333333],
## PARAMETER["scale_factor",1],
## PARAMETER["false_easting",28001.642],
## PARAMETER["false_northing",38744.572],
## UNIT["metre",1,
## AUTHORITY["EPSG","9001"]],
## AUTHORITY["EPSG","3414"]]
Since we know the EPSG is 3414, our distance will be in metres. Thus it should be no surprise if we have large values compared to other data-sets. Additionally the coordinates are in latitude and longitude which means we will need to be careful when mapping our coordinates or using functions that utilize them.
coords <- st_coordinates(centroids) #taken from the previous Step 3.
k1 <- knn2nb(knearneigh(coords))
k1dists <- unlist(nbdists(k1, coords, longlat = FALSE))
summary(k1dists)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 182.5 616.3 887.5 937.2 1169.8 5404.1
Unfortunately, it would seem that the distance values vary wildly between zones.
This may be due to centroids for large polygons such as those in the Western Water Catchment Area, Central Water Catchment Area and Changi Airport cover much larger areas compared to places such as Telok Blangah Rise. Thus a Fixed Distance Weight Matrix would be not very useful for us to utilize.
Lets take a look
wm_d5405 <- dnearneigh(coords, 0, 5405, longlat = FALSE)
par(mfrow=c(1,2))
plot(ResidualFitted_sf$geometry, border="lightgrey", main="1st Nearest Neighbours")
plot(k1, coords, add=TRUE, col="red", length=0.08)
plot(ResidualFitted_sf$geometry, border="lightgrey", main="Fixed-Distance Link")
plot(wm_d5405, coords, add=TRUE, pch = 19, cex = 0.6)
As we can see, the Fixed Distance Weight Matrix methodology would not be very useful for analysis.
For the purpose of our analysis, we’ll be using a set of 6 nearest neighbors as it is the average number of links indicated during our Queen Contiguity based Neighbors analysis.
wm_knn6 <- knn2nb(knearneigh(coords, k=6, longlat = FALSE), row.names=row.names(ResidualFitted_sf$Subzone))
wm_knn6
## Neighbour list object:
## Number of regions: 307
## Number of nonzero links: 1842
## Percentage nonzero weights: 1.954397
## Average number of links: 6
## Non-symmetric neighbours list
plot(ResidualFitted_sf$geometry, border="lightgrey", main= "KNN 6 Neighbours")
plot(wm_knn6, coords, pch = 19, cex = 0.6, add = TRUE, col = "red")
Although the links on first glance appear quite good, you will see areas such as Changi Airport reaching over the neighbors near it. We also know that based on the Queen contiguity method that some areas have up to 17 connections which are more than double the connections set here.
While both the Adaptive Distance Weight Matrix and Rook Contiguity-based Neighbors seem promising, it would be more appropriate to utilize the Queen Continguity-based Neighbors as our ideal choice. These are due to the following points:
The size of polygons have large variation causing issues in using Fixed-Distance based weight Matrixes
The number of neighbors around the subzones differ from 1-17 if considering the Queen Contiguity-based neighbors, thus setting an arbitrary number of neighbors will result on some subzones over-reaching neighboring zones and more importantly, not taking into account certain neighbors when required to for auto-correlation.
Since we’re observing commuter’s volumes, there is no reason for us to believe that bus routes will only go through sides which are larger than a point, in fact some bus routes seem to pass through the corners of the subzones. Residents may also travel out of subzones to take the bus in another subzone thus there is no reason for Rook Contiguity unless there is a computational constraint.
Thus for the next phase of analysis, we will only be utilizing the Queen Contiguity-based neighbors in our spatial auto-correlation analysis.
dist <- nbdists(wm_q, coords, longlat = FALSE)
ids <- lapply(dist, function(x) 1/(x))
rswm_q <- nb2listw(wm_q, style="W", zero.policy = TRUE)
rswm_q
## Characteristics of weights list object:
## Neighbour list object:
## Number of regions: 307
## Number of nonzero links: 1862
## Percentage nonzero weights: 1.975618
## Average number of links: 6.065147
##
## Weights style: W
## Weights constants summary:
## n nn S0 S1 S2
## W 307 94249 307 106.4054 1261.803
wm_q[[1]]
## [1] 217 218 220 221 222 264
rswm_q$weights[1]
## [[1]]
## [1] 0.1666667 0.1666667 0.1666667 0.1666667 0.1666667 0.1666667
Looks like they match up.
moran.test(ResidualFitted_sf$ResidualTapIn, listw=rswm_q, zero.policy = TRUE, na.action=na.omit)
##
## Moran I test under randomisation
##
## data: ResidualFitted_sf$ResidualTapIn
## weights: rswm_q
##
## Moran I statistic standard deviate = -1.0545, p-value = 0.8542
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic Expectation Variance
## -0.037755733 -0.003267974 0.001069537
moran.test(ResidualFitted_sf$ResidualTapOut, listw=rswm_q, zero.policy = TRUE, na.action=na.omit)
##
## Moran I test under randomisation
##
## data: ResidualFitted_sf$ResidualTapOut
## weights: rswm_q
##
## Moran I statistic standard deviate = -0.50383, p-value = 0.6928
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic Expectation Variance
## -0.019809059 -0.003267974 0.001077864
Based on our results, we can see that the p-values resulting from both tests are relatively high, thus we are unable to reject the null hypothesis that the results generated are random even at a 90% confidence level. This promotes that the relationships seen in our linear regression models are indeed true.
The coefficient for Moran’s I in this case is inapplicable.
For robustness testing, we use Monte Carlo to check if our results hold.
set.seed(1234)
MCMoranTapIn= moran.mc(ResidualFitted_sf$ResidualTapIn, listw=rswm_q, nsim=999, zero.policy = TRUE, na.action=na.omit)
MCMoranTapIn
##
## Monte-Carlo simulation of Moran I
##
## data: ResidualFitted_sf$ResidualTapIn
## weights: rswm_q
## number of simulations + 1: 1000
##
## statistic = -0.037756, observed rank = 157, p-value = 0.843
## alternative hypothesis: greater
set.seed(1234)
MCMoranTapOut= moran.mc(ResidualFitted_sf$ResidualTapOut, listw=rswm_q, nsim=999, zero.policy = TRUE, na.action=na.omit)
MCMoranTapOut
##
## Monte-Carlo simulation of Moran I
##
## data: ResidualFitted_sf$ResidualTapOut
## weights: rswm_q
## number of simulations + 1: 1000
##
## statistic = -0.019809, observed rank = 309, p-value = 0.691
## alternative hypothesis: greater
After testing using Monte-Carlo Moran’s I, we can safely say our results appear consistent and are promising as all are unable to reject the null hypothesis.
Overall we can be confident in our results that the relationship between Residential Population in URA 2014 Planning Area is positively correlated with Commuters’ volumes by bus. This is due to low p-values and the randomization assumption of residual values holding true for both linear modelling as well as geospatial auto-correlation.
For the purpose of this Analysis, we will be converting the Simple Feature DataFrame back into a SpatialPolygoneDataFrame
SG_Commuter_vol_df <- as_Spatial(ResidualFitted_sf)
subzone_labels <- order(SG_Commuter_vol_df$Subzone)
localMITapIn <- localmoran(SG_Commuter_vol_df$Tap_in, rswm_q)
#printCoefmat(data.frame(localMITapIn[subzone_labels,], row.names=SG_Commuter_vol_df$Subzone[subzone_labels]), check.names=FALSE)
localMITapOut <- localmoran(SG_Commuter_vol_df$Tap_out, rswm_q)
#printCoefmat(data.frame(localMITapOut[subzone_labels,], row.names=SG_Commuter_vol_df$Subzone[subzone_labels]), check.names=FALSE)
SG_Commuter_vol_df.localMITapIn <- cbind(SG_Commuter_vol_df,localMITapIn)
SG_Commuter_vol_df.localMITapOut <- cbind(SG_Commuter_vol_df,localMITapOut)
TapInlocalMI.map <- tm_shape(SG_Commuter_vol_df.localMITapIn) +
tm_fill(col = "Ii",
style = "pretty",
title = "Tap-in Local MI Statistics") +
tm_borders(alpha = 0.5)
TapInPvalue.map <- tm_shape(SG_Commuter_vol_df.localMITapIn) +
tm_fill(col = "Pr.z...0.",
breaks=c(-Inf, 0.001, 0.01, 0.05, 0.1, Inf),
palette="-Blues",
title = "Tap-in Local MI p-values") +
tm_borders(alpha = 0.5)
tmap_arrange(TapInlocalMI.map, TapInPvalue.map, asp=1, ncol=2)
## Variable(s) "Ii" contains positive and negative values, so midpoint is set to 0. Set midpoint = NA to show the full spectrum of the color palette.
TapOutlocalMI.map <- tm_shape(SG_Commuter_vol_df.localMITapOut) +
tm_fill(col = "Ii",
style = "pretty",
title = "Tap-out Local MI Statistics") +
tm_borders(alpha = 0.5)
TapOutPvalue.map <- tm_shape(SG_Commuter_vol_df.localMITapOut) +
tm_fill(col = "Pr.z...0.",
breaks=c(-Inf, 0.001, 0.01, 0.05, 0.1, Inf),
palette="-Reds",
title = "Tap-out Local MI p-values") +
tm_borders(alpha = 0.5)
tmap_arrange(TapOutlocalMI.map, TapOutPvalue.map, asp=1, ncol=2)
## Variable(s) "Ii" contains positive and negative values, so midpoint is set to 0. Set midpoint = NA to show the full spectrum of the color palette.
Based on the map results, we can see that there are several clusters for both Tap-in and Tap-out specifically around the Tampines East, Tampines West and Bedok North subzones. There also appear to be several outliers, however they have low p-values thus we will not take them into consideration.
SG_Commuter_vol_df.localMITapIn$Z.Tap_In <- scale(SG_Commuter_vol_df.localMITapIn$Tap_in) %>% as.vector
TapInLMoranPlot <- moran.plot(SG_Commuter_vol_df.localMITapIn$Z.Tap_In, rswm_q, labels=as.character(SG_Commuter_vol_df.localMITapIn$Subzone), xlab="z Tap-in Volumes", ylab="Spatially Lag z Tap-in Volumes")
SG_Commuter_vol_df.localMITapOut$Z.Tap_out <-scale(SG_Commuter_vol_df.localMITapOut$Tap_out) %>% as.vector
TapOutLMoranPlot <- moran.plot(SG_Commuter_vol_df.localMITapOut$Z.Tap_out, rswm_q, labels=as.character(SG_Commuter_vol_df.localMITapOut$Subzone), xlab="z Tap-out Volumes", ylab="Spatially Lag z Tap-out Volumes")
TapInquadrant <- vector(mode="numeric",length=nrow(localMITapIn))
TapInDV <- SG_Commuter_vol_df$Tap_in - mean(SG_Commuter_vol_df$Tap_in)
TapInC_mI <- localMITapIn[,1] - mean(localMITapIn[,1])
signif <- 0.05
TapInquadrant[TapInDV >0 & TapInC_mI>0] <- 4
TapInquadrant[TapInDV <0 & TapInC_mI<0] <- 1
TapInquadrant[TapInDV <0 & TapInC_mI>0] <- 2
TapInquadrant[TapInDV >0 & TapInC_mI<0] <- 3
TapInquadrant[localMITapIn[,5]>signif] <- 0
TapOutquadrant <- vector(mode="numeric",length=nrow(localMITapOut))
TapOutDV <- SG_Commuter_vol_df$Tap_out - mean(SG_Commuter_vol_df$Tap_out)
TapOutC_mI <- localMITapOut[,1] - mean(localMITapOut[,1])
signif <- 0.05
TapOutquadrant[TapOutDV >0 & TapOutC_mI>0] <- 4
TapOutquadrant[TapOutDV <0 & TapOutC_mI<0] <- 1
TapOutquadrant[TapOutDV <0 & TapOutC_mI>0] <- 2
TapOutquadrant[TapOutDV >0 & TapOutC_mI<0] <- 3
TapOutquadrant[localMITapOut[,5]>signif] <- 0
colors <- c("#ffffff", "#2c7bb6", "#abd9e9", "#fdae61", "#d7191c")
clusters <- c("Insignificant", "Low-Low", "Low-High", "High-Low", "High-High")
SG_Commuter_vol_df.localMITapIn$TapInquadrant <- TapInquadrant
LISATapIn <- tm_shape(SG_Commuter_vol_df.localMITapIn) +
tm_fill(col = "TapInquadrant", style = "cat", title="Tap-in Classification", palette = colors[c(sort(unique(TapInquadrant)))+1], labels = clusters[c(sort(unique(TapInquadrant)))+1], popup.vars = c("Postal.Code")) +
tm_view(set.zoom.limits = c(11,17)) +
tm_borders(alpha=0.5)
SG_Commuter_vol_df.localMITapOut$TapOutquadrant <- TapOutquadrant
LISATapOut <- tm_shape(SG_Commuter_vol_df.localMITapOut) +
tm_fill(col = "TapOutquadrant", style = "cat", title="Tap-out Classification", palette = colors[c(sort(unique(TapOutquadrant)))+1], labels = clusters[c(sort(unique(TapOutquadrant)))+1], popup.vars = c("Postal.Code")) +
tm_view(set.zoom.limits = c(11,17)) +
tm_borders(alpha=0.5)
tmap_arrange(LISATapIn, LISATapOut, asp=1, ncol=2)
Based on the LISA map, we can see several clusters which primarily align with our initial observations of the Local Moran’s I scores in the previous Step. However, there appears to be some subzones in the north-east and west that fall within the High-High quadrants indicating positive auto-correlations with neighboring subzones.
For the purpose of this analysis, we will only be using Adaptive Distance Weight Matrix for reasons explained in the earlier Part 2 on the nature of the polygons we are dealing with. We will take the same clusters we set in the previous section.
wm_knn6_lw <- nb2listw(wm_knn6, style = 'B') #wm_knn6 taken from previous Part 2
TapIngi.adaptive <- localG(SG_Commuter_vol_df$Tap_in, wm_knn6_lw)
TapInSG.gi <- cbind(SG_Commuter_vol_df, as.matrix(TapIngi.adaptive))
names(TapInSG.gi)[length(names(TapInSG.gi))] <- "gstat_adaptive"
TapOutgi.adaptive <- localG(SG_Commuter_vol_df$Tap_out, wm_knn6_lw)
TapOutSG.gi <- cbind(SG_Commuter_vol_df, as.matrix(TapOutgi.adaptive))
names(TapOutSG.gi)[length(names(TapOutSG.gi))] <- "gstat_adaptive"
TapIngiMap <- tm_shape(TapInSG.gi) +
tm_fill(col = "gstat_adaptive",
style = "jenks",
palette = "-RdBu",
n=5,
title = "Tap-in local Gi") +
tm_borders(alpha = 0.5)
TapOutgiMap <- tm_shape(TapOutSG.gi) +
tm_fill(col = "gstat_adaptive",
style = "jenks",
palette = "-RdBu",
n=5,
title = "Tap-out local Gi") +
tm_borders(alpha = 0.5)
tmap_arrange(TapIngiMap, TapOutgiMap, asp=1, ncol=2)
## Variable(s) "gstat_adaptive" contains positive and negative values, so midpoint is set to 0. Set midpoint = NA to show the full spectrum of the color palette.
## Variable(s) "gstat_adaptive" contains positive and negative values, so midpoint is set to 0. Set midpoint = NA to show the full spectrum of the color palette.
Based on the results, we see hot-spots in areas which we had originally identified in our previous LISA Map. There is also a surprising cold spot towards the west towards Tuas. This may be due to the nature of how k-nearestneighbours works and specifically how Tuas View Extension subzone is only really adjacent to 2 subzones but was coerced using the Adaptive Distance Weight Matrix to consider more zones.
Overall there appears to be some clustering for both Tap-in and Tap-out around the Tampines East subzone with the Local Moran’s I Analysis, LISA Mapping and Gi Statistical all highlighting the zone or some of it’s surrounding areas as clusters or hot spots. This aligns with the population map done in Part 1 which shows the area as relatively higher populated than others although we had not done any geospatial analytics on population distributions.
There are several variables which are not included within the statistical analysis above that must be considered when examining the scope of the study.
We had already removed one age group of 0-4 year olds due to the age and height requirements to enter the bus, thus their movements would not be captured in the tap-in and tap-out data. However, the age for no bus-fare requirement is 7 years old and a certain height criteria. As there was no feasible way to segment the residential data in a timely manner, the age group for 5-9 year olds was still included which could have introduced some level of noise in the data.
We did not factor in the variable of bus frequencies or availability of bus stops in each subzone. These could affect the aggregated tap-in and tap-out numbers as more frequently serviced bus stops might create larger commuters’ flow.
Our analysis also did not control for the availability of alternate transportation such as the MRT which greatly affects the use of bus services in an area. These may also contribute to increased volumes at routes that intersect with these MRT stations.
In this section, we’re interested if there is a time-variation factor that could affect the spatial distribution of our commuters’ volumes. Intuition is that during morning peak periods, people are travelling from home in residential zones to work in non-residential zones. The opposite should be true for evening peak periods. We’re looking to see if we segment the data according to these two peak periods if there is a significant difference in terms of geospatial distribution.
HourStats <- sgpassengervolume %>%
filter(DAY_TYPE == "WEEKDAY") %>%
group_by(TIME_PER_HOUR) %>%
summarise(Tap_InStat = sum(TOTAL_TAP_IN_VOLUME),
Tap_outStat = sum(TOTAL_TAP_OUT_VOLUME))
ggplot(data= HourStats,aes(x = TIME_PER_HOUR, y= Tap_InStat)) + geom_bar(stat="identity", position = "dodge")
ggplot(data= HourStats,aes(x = TIME_PER_HOUR, y= Tap_outStat)) + geom_bar(stat="identity", position = "dodge")
Based on the data above, we can see there are 2 distinct peak periods for both Tap-in and Tap-out. We are going to see if there is a geospatial difference in these periods. We will use these peak periods as our two sets of data both within 3 hour spans of 6-8am and 5-7pm.
Assuming these are due to the work hour peak timings, we will further limit our data to “WEEKDAY” which are work days.
TotalbusstopvolMorning <- sgpassengervolume %>%
filter(TIME_PER_HOUR %in% (6:8), DAY_TYPE == "WEEKDAY") %>%
group_by(BUS_STOP_N = PT_CODE) %>% #rename as per naming convention by LTA
summarise(Tap_InMorning = sum(TOTAL_TAP_IN_VOLUME),
Tap_outMorning = sum(TOTAL_TAP_OUT_VOLUME)) #be sure to check data
TotalbusstopvolEvening <- sgpassengervolume %>%
filter(TIME_PER_HOUR %in% (17:19), DAY_TYPE == "WEEKDAY") %>%
group_by(BUS_STOP_N = PT_CODE) %>% #rename as per naming convention by LTA
summarise(Tap_InEvening = sum(TOTAL_TAP_IN_VOLUME),
Tap_outEvening = sum(TOTAL_TAP_OUT_VOLUME)) #check Day type and time
TotalbusstopvolEvening[rowSums(is.na(TotalbusstopvolEvening))!=0,]
## # A tibble: 0 x 3
## # ... with 3 variables: BUS_STOP_N <chr>, Tap_InEvening <dbl>,
## # Tap_outEvening <dbl>
Cleared of NA Values
TimeVar_sf <- left_join(busstoplocations_sf,TotalbusstopvolMorning) #we already renamed the variable to match so there is no need to specify the joining factor
## Joining, by = "BUS_STOP_N"
TimeVar_sf <- left_join(TimeVar_sf,TotalbusstopvolEvening) #we already renamed the variable to match so there is no need to specify the joining factor
## Joining, by = "BUS_STOP_N"
TimeVar_sf[rowSums(is.na(TimeVar_sf))!=0,]
## Simple feature collection with 125 features and 9 fields
## geometry type: POINT
## dimension: XY
## bbox: xmin: 7702.752 ymin: 27733.06 xmax: 41244.44 ymax: 49261.6
## CRS: EPSG:3414
## First 10 features:
## BUS_STOP_N BUS_ROOF_N LOC_DESC Tap_In Tap_out Tap_InMorning
## 14 49249 B08 DRAGON FISH IND 721 691 NA
## 178 91121 NIL HSE 189 0 0 NA
## 256 44401 NIL OPP BLK 183 0 0 NA
## 275 91101 B03 TG KATONG FLYOVER 24 54 NA
## 276 49209 B01 SG BULOH WETLAND RESERVE 1413 1118 NA
## 344 31121 B40 NIRVANA MEMORIAL GDN 0 0 NA
## 372 45391 NIL BLK 657 0 9 NA
## 374 49239 B06 OPP NYEE PHOE GROUP 369 222 NA
## 432 93171 B15 OPP ISLAND RESORT S'PORE 542 206 NA
## 461 23381 B01A PLANT ENGRG 0 0 NA
## Tap_outMorning Tap_InEvening Tap_outEvening geometry
## 14 NA NA NA POINT (16000.18 46531.64)
## 178 NA NA NA POINT (35030.24 31256.69)
## 256 NA NA NA POINT (20144.74 40279.55)
## 275 NA NA NA POINT (35058.96 30937.84)
## 276 NA NA NA POINT (16482.29 47421.34)
## 344 NA NA NA POINT (11838.8 39157.43)
## 372 NA NA NA POINT (18697.5 42446.09)
## 374 NA NA NA POINT (16274.91 46921.07)
## 432 NA NA NA POINT (38108.3 31937.69)
## 461 NA NA NA POINT (12552.18 32234.11)
TimeVar_sf$Tap_InMorning[is.na(TimeVar_sf$Tap_InMorning)] <- 0
TimeVar_sf$Tap_outMorning[is.na(TimeVar_sf$Tap_outMorning)] <- 0
TimeVar_sf$Tap_InEvening[is.na(TimeVar_sf$Tap_InEvening)] <- 0
TimeVar_sf$Tap_outEvening[is.na(TimeVar_sf$Tap_outEvening)] <- 0
TimeVarSubzone_sf <- st_join(PlanningSubzone_sf,TimeVar_sf, join=st_intersects) #using st_intersects to join bus stops that are either within or touching the polygons.
TimeVarSubzone_sf$Tap_In[is.na(TimeVarSubzone_sf$Tap_In)] <- 0
TimeVarSubzone_sf$Tap_out[is.na(TimeVarSubzone_sf$Tap_out)] <- 0
TimeVarSubzone_sf$Tap_InMorning[is.na(TimeVarSubzone_sf$Tap_InMorning)] <- 0
TimeVarSubzone_sf$Tap_outMorning[is.na(TimeVarSubzone_sf$Tap_outMorning)] <- 0
TimeVarSubzone_sf$Tap_InEvening[is.na(TimeVarSubzone_sf$Tap_InEvening)] <- 0
TimeVarSubzone_sf$Tap_outEvening[is.na(TimeVarSubzone_sf$Tap_outEvening)] <- 0
TimeVarSubzone_sf$BUS_ROOF_N[is.na(TimeVarSubzone_sf$BUS_ROOF_N)] <- "UNKNOWN"
TimeVarSubzone_sf$BUS_STOP_N[is.na(TimeVarSubzone_sf$BUS_STOP_N)] <- "UNKNOWN"
TimeVarSubzone_sf$LOC_DESC[is.na(TimeVarSubzone_sf$LOC_DESC)] <- "UNKNOWN"
SummarizedTimeVar_sf <- TimeVarSubzone_sf %>%
filter(Tap_In!= 0 | Tap_out != 0 | resident_count != 0 ) %>% # Filters do not change in this case.
group_by(Subzone = SUBZONE_N) %>%
summarise(Residential = as.numeric(first(resident_count)), Tap_in = sum(Tap_In), Tap_out = sum(Tap_out), Tap_inMorning = sum(Tap_InMorning), Tap_outMorning = sum(Tap_outMorning), Tap_inEvening = sum(Tap_InEvening), Tap_outEvening = sum(Tap_outEvening) )
TapInMorningMap <- tm_shape(SummarizedTimeVar_sf)+
tm_fill(col = c("Tap_inMorning"),
n = 5,
style="jenks",
palette = "Blues",
title = "Morning Tap-in Volume") +
tm_layout(legend.position = c("right", "bottom")) +
tm_borders(alpha = 0.5) +
tmap_style("white")+
tm_credits("Source: Planning Sub-zone boundary from Urban Redevelopment Authorithy (URA)\n and Commuter's Volume data from Land Transport Authority (LTA)",
position = c("left", "bottom"))
TapOutMorningMap <- tm_shape(SummarizedTimeVar_sf)+
tm_fill(col = c("Tap_outMorning"),
n = 5,
style="jenks",
palette = "Reds",
title = "Morning Tap-out Volume") +
tm_layout(legend.position = c("right", "bottom")) +
tm_borders(alpha = 0.5) +
tmap_style("white") +
tm_credits("Source: Planning Sub-zone boundary from Urban Redevelopment Authorithy (URA)\n and Commuter's Volume data from Land Transport Authority (LTA)",
position = c("left", "bottom"))
TapInEveningMap <- tm_shape(SummarizedTimeVar_sf)+
tm_fill(col = c("Tap_inEvening"),
n = 5,
style="jenks",
palette = "Blues",
title = "Evening Tap-in Volume") +
tm_layout(legend.position = c("right", "bottom")) +
tm_borders(alpha = 0.5) +
tmap_style("white")+
tm_credits("Source: Planning Sub-zone boundary from Urban Redevelopment Authorithy (URA)\n and Commuter's Volume data from Land Transport Authority (LTA)",
position = c("left", "bottom"))
TapOutEveningMap <- tm_shape(SummarizedTimeVar_sf)+
tm_fill(col = c("Tap_outEvening"),
n = 5,
style="jenks",
palette = "Reds",
title = "Evening Tap-out Volume") +
tm_layout(legend.position = c("right", "bottom")) +
tm_borders(alpha = 0.5) +
tmap_style("white") +
tm_credits("Source: Planning Sub-zone boundary from Urban Redevelopment Authorithy (URA)\n and Commuter's Volume data from Land Transport Authority (LTA)",
position = c("left", "bottom"))
tmap_arrange(TapInMorningMap, TapOutMorningMap, TapInEveningMap, TapOutEveningMap,asp=1, ncol=2, nrow = 2)
Upon initial glance, it is hard to spot any drastic shifts between the four different zones with the exception of Changi Airport having significantly higher Tap-out volumes in the morning. One thing we could examine is the difference in morning versus evening for both and plot another map out for those two.
Note the calculation we are making here is (Morning - Evening) Values
SummarizedTimeVar_sf <- add_column(SummarizedTimeVar_sf, TapInDifference = SummarizedTimeVar_sf$Tap_inMorning - SummarizedTimeVar_sf$Tap_inEvening, .before = "geometry")
SummarizedTimeVar_sf <- add_column(SummarizedTimeVar_sf, TapOutDifference = SummarizedTimeVar_sf$Tap_outMorning - SummarizedTimeVar_sf$Tap_outEvening, .before = "geometry")
TapInDifferenceMap <- tm_shape(SummarizedTimeVar_sf)+
tm_fill(col = "TapInDifference",
n = 5,
style="jenks",
palette = "RdBu",
title = "Difference in Tap-in Volume") +
tm_layout(legend.position = c("right", "bottom")) +
tm_borders(alpha = 0.5) +
tmap_style("white")+
tm_credits("Source: Planning Sub-zone boundary from Urban Redevelopment Authorithy (URA)\n and Commuter's Volume data from Land Transport Authority (LTA)",
position = c("left", "bottom"))
TapOutDifferenceMap <- tm_shape(SummarizedTimeVar_sf)+
tm_fill(col = "TapOutDifference",
n = 5,
style="jenks",
palette = "PRGn",
title = "Difference in Tap-out Volume") +
tm_layout(legend.position = c("right", "bottom")) +
tm_borders(alpha = 0.5) +
tmap_style("white") +
tm_credits("Source: Planning Sub-zone boundary from Urban Redevelopment Authorithy (URA)\n and Commuter's Volume data from Land Transport Authority (LTA)",
position = c("left", "bottom"))
tmap_arrange(TapInDifferenceMap, TapOutDifferenceMap ,asp=1, ncol=2)
It seems there are some distinct differences based on the region, however it is hard to be sure what this map actually tells us in terms of clustering and its relationship to resident population in those subzones. We will need to plot the data on a graph to give us a better idea on how these values interact.
TapInDiffLM.model <- lm(TapInDifference ~ Residential, data = SummarizedTimeVar_sf)
summary(TapInDiffLM.model)
##
## Call:
## lm(formula = TapInDifference ~ Residential, data = SummarizedTimeVar_sf)
##
## Residuals:
## Min 1Q Median 3Q Max
## -135341 -10496 3849 13548 104149
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.355e+04 2.296e+03 -5.901 9.62e-09 ***
## Residential 1.170e+00 1.075e-01 10.884 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 32570 on 305 degrees of freedom
## Multiple R-squared: 0.2797, Adjusted R-squared: 0.2774
## F-statistic: 118.5 on 1 and 305 DF, p-value: < 2.2e-16
TapInDiffPlot <- ggplot(data= TapInDiffLM.model, aes(x = Residential, y= TapInDifference)) +
geom_point(color='blue',size=2)+
geom_segment(aes(xend = Residential, yend = predict(TapInDiffLM.model))) +
geom_smooth(method = "lm", formula = y ~ x, color='black')+
xlab("Residential Population Count")+
ylab("Tap-in Difference")
TapInDiffPlot
Although our linear model seems to indicate a positive relationship between the resident population and the difference between morning and evening peak our Tap-in volumes, it is hard to draw a conclusion as a large amount of data is skewed to the left with the low residential numbers. Thus we must look for a different method
For the purpose of this analysis, we will only be using Adaptive Distance Weight Matrix created in the earlier Part 3.
SG_Commuter_Diff_df <- as_Spatial(SummarizedTimeVar_sf)
TapInDiffgi.adaptive <- localG(SG_Commuter_Diff_df$TapInDifference, wm_knn6_lw)
TapInDiffSG.gi <- cbind(SG_Commuter_Diff_df, as.matrix(TapInDiffgi.adaptive))
names(TapInDiffSG.gi)[length(names(TapInDiffSG.gi))] <- "gstat_adaptive"
TapOutDiffgi.adaptive <- localG(SG_Commuter_Diff_df$TapOutDifference, wm_knn6_lw)
TapOutDiffSG.gi <- cbind(SG_Commuter_Diff_df, as.matrix(TapOutDiffgi.adaptive))
names(TapOutDiffSG.gi)[length(names(TapOutDiffSG.gi))] <- "gstat_adaptive"
TapInDiffgiMap <- tm_shape(TapInDiffSG.gi) +
tm_fill(col = "gstat_adaptive",
style = "jenks",
palette = "-RdBu",
n=5,
title = "Tap-in local Gi") +
tm_borders(alpha = 0.5)
TapOutDiffgiMap <- tm_shape(TapOutDiffSG.gi) +
tm_fill(col = "gstat_adaptive",
style = "jenks",
palette = "-RdBu",
n=5,
title = "Tap-out local Gi") +
tm_borders(alpha = 0.5)
tmap_arrange(TapInDiffgiMap, TapOutDiffgiMap, asp=1, ncol=2)
## Variable(s) "gstat_adaptive" contains positive and negative values, so midpoint is set to 0. Set midpoint = NA to show the full spectrum of the color palette.
## Variable(s) "gstat_adaptive" contains positive and negative values, so midpoint is set to 0. Set midpoint = NA to show the full spectrum of the color palette.
Here we can see something very interesting. When looking at the Tap-in Local Gi map, we can see that there are hot spots primarily in the northern areas of Singapore while cold spots exist primarily towards the south or downtown Singapore. These hot spots indicate more people tapping in in the mornings than in the evening, possibly the working crowd travelling downtown to work. The results for the Tap-out Local Gi also seem to support there results from the Tap-in Local Gi. Interestingly the Tuas and lower Jurong area are hot spots for the tap-out values which lines up with the idea that more people arrive for work and tap-out in the morning than in the evening.
Overall, this study shows that time-variations do play a factor in commuter flow within Singapore, particularly on a geospatial level from North-to-South in the morning peak hours and vice-versa in the evening peak hours. This should be factored in when considering utilizing the bus Tap-in and Tap-out volumes for analysis. Further more, this study could help influence URA’s developmental policies on commercial and residential areas around Singapore in observing it’s effectiveness in implementing these zones.