This document is a record of the steps undertaken in the attempt to study and visualise the spatial distribution of Hawker Centres in Singapore, along with their characteristics.
# install.packages(c('data.table','sp','spatstat','ggmap','maptools','ggplot2'))
# Make sure you have the above dependencies before proceeding
# Import the following packages using library
library(data.table)
library(sp)
library(rgdal)
library(spatstat)
library(ggmap)
library(maptools)
library(ggplot2)
library(plotly)
Data for this project is obtained by compiling various lists of all successsful tenders for hawker centres from early 2012 to early 2017 from the website of the National Environment Agency of Singapore here.
The data is then converted from pdf to csv format using the Tabula Tool
tenders <- fread("D:\\Backup\\SUTD\\HASS\\Maps 1\\Labs\\Lab 9\\02221-Lab9\\tabula-tender-bids-from-mar-2012-to-jan-2017.csv" , header = F, fill = T)
# Setting fill = True will fill empty rows or columns if there are weird breaks in the dataset
As the dataset obtained from tabula is not properly formatted and 100% correct, some work to clean up the dataset has to be done.
# The following will set the headers of the csv in the order of the vector given
# This allows for easy reference to the columns
colnames(tenders) <- c("centre","stall","area","trade","bid","month")
# Removes the empty rows that happens after every row with values
tenders <- tenders[centre != "" ,] # Only select rows in tenders with non-empty values
# Remove the separator in the dataset to separate between cooked food and lock-up stalls
# tenders <- tenders[centre != "LOCK-UP STALLS", ] # This is 1 alternative
tenders <- tenders[!1135,] # Removes row 1135 which is where the separator is
# Another class - Market Stall at 1749 after remonving above separator
tenders <- tenders[!1749]
# to fix mis-aligned data using R:
tofix <- tenders[month=="",] # Create a new data table with all the entries with empty months
tofix[,c("month","bid","trade","area"):=list(bid,trade,area,stall),] # Reassigns "month"->"bid" etc...
tofix[,stall:=stringi::stri_extract_last(tofix[,centre,],regex="[0-9]{2}-[0-9]{2,3}"),] # Extracts the stall numbers from centre and stores it under "stall"
tofix[,centre:=stringi::stri_replace_last(tofix[,centre,],replacement = "",regex=" [0-9]{2}-[0-9]{2,3}"),] # Removes the stall numbers from centre
tenders[month=="",] <- tofix # Changes rows in tenders with wrong formatting to right one
#to fix badly formated periods in area:
tenders[,areaNum:=as.numeric(gsub(area,pattern=" |\\.", replacement=""))/100,]
#to fix badly formatted hyphens in stall
tenders[,stall:=stringi::stri_replace_last(tenders[,stall,],replacement = "", regex=" -"),]
tenders[,stall:=stringi::stri_replace_last(tenders[,stall,],replacement = "-", regex=" "),]
# Create a new column to mark stalls separated between cooked food and lock-up stalls
tenders[1:1134,type:="cooked",] # The special operator ":=" assigns the value "cooked" to field type, and creates the new field if it is not present
tenders[1135:1748 , type:="lockup"] # Sets relevant stalls to type lockup
tenders[1749:nrow(tenders),type:="market_stall"] # This sets everything below the separator to type "market_stall" - nrow(tenders) gives the value of the last row of tenders
# Removes The character '$' and ',' from the column "bid" - $1,000.31 -> 1000.31
tenders[,bidNum := as.numeric(gsub(bid, pattern = "\\$|,", replacement = "")),]
# as.numeric converts objects of type "numeric" - Something like casting
# gsub belongs to the grep class - gsub will replace ALL matches
# gsub(pattern, replacement, x, ignore.case = FALSE, perl = FALSE,fixed = FALSE, useBytes = FALSE)
# Creates a new column "date"" to represent the dates of bids
# While saving the data in a date format
tenders[,date:=as.Date(paste0("01-",month), "%d-%b-%Y"),]
# paste0 - used to concatenate vectors after converting to character (Smt like string addition)
# %b - abbreviated month
# Creates a new field price per area - "priceM2"
tenders[,priceM2:=bidNum/as.numeric(areaNum),] # Calculates the price per metre and save result in a new field "priceM2"
head(tenders) # used to show data to see if data is cleaned up properly
## centre stall area trade bid month
## 1: AMOY STREET FOOD CENTRE 01-11 5.65 COOKED FOOD $3,701.11 Oct-2015
## 2: AMOY STREET FOOD CENTRE 01-48 5.65 COOKED FOOD $4,048.00 Mar-2016
## 3: AMOY STREET FOOD CENTRE 01-48 5.65 COOKED FOOD $3,200.00 Jun-2016
## 4: AMOY STREET FOOD CENTRE 01-50 5.65 COOKED FOOD $2,728.00 Jun-2016
## 5: AMOY STREET FOOD CENTRE 01-60 5.65 COOKED FOOD $3,188.00 Jun-2016
## 6: AMOY STREET FOOD CENTRE 01-68 5.65 HALAL COOKED FOOD $1,800.00 Jun-2013
## areaNum type bidNum date priceM2
## 1: 5.65 cooked 3701.11 2015-10-01 655.0637
## 2: 5.65 cooked 4048.00 2016-03-01 716.4602
## 3: 5.65 cooked 3200.00 2016-06-01 566.3717
## 4: 5.65 cooked 2728.00 2016-06-01 482.8319
## 5: 5.65 cooked 3188.00 2016-06-01 564.2478
## 6: 5.65 cooked 1800.00 2013-06-01 318.5841
Test for incomplete or badly formatted data - Code not run for final document
tenders[!complete.cases(tenders)]
tenders[,list(price=mean(priceM2)), by = centre] # This returns a list of hawker centres with their averaage bid per area
# .() can be used as an alias to list()
table(tenders$type) # To see the different type of stall, and the count for each type
tenders[type=="cooked",list(price=mean(priceM2)), by = centre] # Average price per centre for cooked food
tenders[,list(count=length(stall)),by=centre] # Counting the number of bids per centre by the number of entries per centre
ggplot(tenders[,list(count=length(stall)), by = date], aes(date,count)) + geom_point(color="sienna4") + geom_line(color="orangered2") + geom_smooth(color="indianred2") + ggtitle("Changes in Bids over time") + xlab("Time") + ylab("Number of Bids") # Plots line graph to show how number of bids change over time
## `geom_smooth()` using method = 'loess'
From the change in bids over time, it is obvious that the number of hawker centres in singapore has increased over the years. This might be due to the increase in the popoulation of Singapore over the years.
ggplotly(ggplot(tenders, aes(x=areaNum, y = bidNum)) + geom_hex() + scale_fill_distiller( type = "seq", palette = 7, direction = 1, values = NULL, space = "Lab", na.value = "grey50", guide = "colourbar") + ggtitle("Relation between Bid Price and Size of shop") +xlab("Size of shop in square metres") + ylab("Bid price for shop")) # Plots density graph showing relation between bidding price and size of shop
From the density graph, we can see that majority of the successful bids are for shop sizes between 6 to 7 square meteres in size, with relatively low prices. Also, we can see that most shop sizes are between 5 to 10 square metres in size. Thus, we can infer that most of the shops in hawker centres are between 5 to 10 square metres in size, and the average bid for shops are usually quite small, although there is a high variance in the data set. This could be caused by the locations of the hawker centres, where hawker centres in Business Districts would require higher bidding price as compared to hawker centres in neighbourhood areas.
ggplot(tenders[,list(price=mean(priceM2)), by = date], aes(date,price)) + geom_point(color="sienna4") + geom_line(color="orangered2") + geom_smooth(color="indianred2") + ggtitle("Changes in average price of Bids over time") + xlab("Time") + ylab("Average price of Bids") # Plots line graph to show how avg price of bids has changed with time
## `geom_smooth()` using method = 'loess'
ggplotly(ggplot(tenders[,list(count=length(stall)),by=centre], aes(centre,count)) + geom_bar(stat = "identity", fill = "blue4") + ggtitle("Number of Bids per Centre 2012-2017") + xlab("Hawker Centres") + ylab("Number of Bids") + theme(axis.text.x = element_blank(),axis.ticks.x = element_blank())) # Plots bar char showing the differences in number of bids per centre
ggplotly(ggplot(tenders[,list(price=mean(priceM2)),by=centre], aes(centre,price)) + geom_bar(stat = "identity", fill = "coral4") + ggtitle("Average price of Bid per Centre 2012-2017") + xlab("Hawker Centres") + ylab("Average price of Bids") + theme(axis.text.x = element_blank(),axis.ticks.x = element_blank())) # Plots bar chart showing the average price per area per centre
ggplotly(ggplot(tenders[,list(price=mean(priceM2)),by=type], aes(type,price)) + geom_bar(stat = "identity", fill="darkorange2") + ggtitle("Average price of bid by Type") + xlab("Type") + ylab("Average price of Bid") + theme(axis.text.x = element_blank(),axis.ticks.x = element_blank())) # Plots bar chart showing the average price per area by type
ggplotly(ggplot(tenders[,list(price=mean(priceM2)),by=trade], aes(trade,price)) + geom_bar(stat = "identity",fill="firebrick") + ggtitle("Average price of bid by trade") + xlab("Trade") + ylab("Average price of Bid") + theme(axis.text.x = element_blank(),axis.ticks.x = element_blank())) # Plots bar chart showing the average price per area by trade
As we intend to carry out spatial analysis of the hawker centres in singapore, we need to obtain a spatial dataset of all hawker centres in Singpaore. This data is obtained from the Government of Singapore here.
h <- readOGR("D:\\Backup\\SUTD\\HASS\\Maps 1\\Labs\\Lab 9\\hawker-centres-kml\\HAWKERCENTRE.kml") # Importing in the dataset
## OGR data source with driver: KML
## Source: "D:\Backup\SUTD\HASS\Maps 1\Labs\Lab 9\hawker-centres-kml\HAWKERCENTRE.kml", layer: "HAWKERCENTRE"
## with 116 features
## It has 2 fields
plot(h, main="Visualisation of Hawker Centre Data") # Plot the points of the dataset for initial visualisation
h.t <- data.frame(toupper(h$Name),h@coords[,1:2]) # This extracts the name in the kml and converts it to upper case. It extracts the coordinates as well. The result is then stored in h.t
colnames(h.t) <- c("name","lon","lat") # This renames the column names
# Merge carries out a spatial joint
tenders.sp <- merge(tenders, h.t, by.x="centre",by.y="name", all.x=T) # Merges h.t to tenders by matching centre in tenders to name in h.t
This is a simple method of spatially joining our datset together. However, this method does not join all points together properly, thus another better method has to be used.
if(file.exists("geocoded-centres.csv")){ # To prevent having to access Google's service everytime file is rendered
centres <- read.csv("geocoded-centres.csv")
}else{
centres <- tenders[,list(count=.N), by=centre]
centres[,location:=paste0(centre,", Singapore"),]
g <- geocode(centres[,location,], output = "latlon", source = "google", sensor = F)
centres <- cbind(centres,g)
write.csv(centres, "geocoded-centres.csv")
}
tenders.sp <- merge(tenders,centres, by.x = "centre", by.y = "centre", all.x = T)
ggplot(tenders.sp, aes(x=lon, y = lat, size = priceM2, color = type)) + geom_point(alpha = 0.3) + coord_fixed() + ggtitle("Hawker Centres by Type 2012-2017") + xlab("Longditude") + ylab("Latitude")
ggplot(tenders.sp, aes(x=lon, y = lat)) + geom_point() + geom_density_2d() + coord_fixed() + ggtitle("Density Contour of Hawker Centres 2012-2017") + xlab("Longditude") + ylab("Latitude")
ggplot(tenders.sp, aes(x=lon, y = lat)) + geom_hex() +coord_fixed() + ggtitle("Density of Hawker Centres 2012-2017") + xlab("Longditude") + ylab("Latitude") + scale_fill_distiller( type = "seq", palette = 7, direction = 1, values = NULL, space = "Lab", na.value = "grey50", guide = "colourbar")
This section will aim to determine if there is a specific spatial distribution with regards to the number of bids per ‘trade’ and ‘type’
ggplot(tenders.sp , aes(x=lon,y=lat)) + geom_point() + facet_wrap(~trade) + theme(axis.text = element_blank(), axis.ticks = element_blank()) + ggtitle("Spatial Distribution of Hawker Centres by Trade 2012-2017") + xlab("") + ylab("")
ggplot(tenders.sp , aes(x=lon,y=lat)) + geom_point() + geom_density_2d() + coord_fixed() + facet_wrap(~trade) + theme(axis.text = element_blank(), axis.ticks = element_blank()) + ggtitle("Spatial Density Contour of Hawker Centres by Trade 2012-2017") + xlab("") + ylab("")
ggplot(tenders.sp , aes(x=lon,y=lat)) + geom_hex() +coord_fixed() + facet_wrap(~trade) + theme(axis.text = element_blank(), axis.ticks = element_blank()) + ggtitle("Spatial Density of Hawker Centres by Trade 2012-2017") + xlab("") + ylab("") + scale_fill_distiller( type = "seq", palette = 14, direction = 1, values = NULL, space = "Lab", na.value = "grey50", guide = "colourbar")
From the Density contour plot of data from the data obtained, we can conclude that there are some food types that tend to cluster together, for example, Mutton, Halal Beef, Cut Fruits and Drinks. This might be because these shops do not cater to the mainstream demands of the customers, thus, their area of operation is limited as they would only be able to serve a niche market, making it more sensible for this shops to cluster, allowing the various shops to benefit from one another by attracting the niche cliente to the same area. The rest of the shops seems to be relatively spread out over Singapore. However, despite the shops being spread out for each food type, there are still clusters within each food type, as can be seen from the spatial density plot, where there are some red zones where many shops of the same type cluster together.This clustering effect might be caused by certain distrincts being famous for certain food specialities, resulting in more shops clustering in the same area to take advantage of the fame of these areas to increase the size of their customer base.
ggplot(tenders.sp , aes(x=lon,y=lat)) + geom_point() + theme(axis.text = element_blank(), axis.ticks = element_blank()) + ggtitle("Spatial Distribution of Hawker Centres by Type 2012-2017") + xlab("") + ylab("") + facet_wrap(~type)
ggplot(tenders.sp , aes(x=lon,y=lat)) + geom_point() + geom_density_2d() + coord_fixed() + facet_wrap(~type) + theme(axis.text = element_blank(), axis.ticks = element_blank()) + ggtitle("Spatial Density Contour of Hawker Centres by Type 2012-2017") + xlab("") + ylab("")
ggplot(tenders.sp , aes(x=lon,y=lat)) + geom_hex() +coord_fixed() + facet_wrap(~type) + theme(axis.text = element_blank(), axis.ticks = element_blank()) + ggtitle("Spatial Density of Hawker Centres by Type 2012-2017") + xlab("") + ylab("") + scale_fill_distiller( type = "seq", palette = 14, direction = 1, values = NULL, space = "Lab", na.value = "grey50", guide = "colourbar")
#Creating new field year in tenders.sp
tenders.sp[,year:=stringi::stri_extract_first(tenders.sp[,date,],regex="[0-9]{4}"),]
ggplot(tenders.sp , aes(x=lon,y=lat)) + geom_point() + facet_wrap(~year) + theme(axis.text = element_blank(), axis.ticks = element_blank()) + ggtitle("Spatial Distribution of Hawker Centres over the Years") + xlab("") + ylab("")
ggplot(tenders.sp , aes(x=lon,y=lat)) + geom_point() + geom_density_2d() + coord_fixed() + facet_wrap(~year) + theme(axis.text = element_blank(), axis.ticks = element_blank()) + ggtitle("Spatial Density contour of Hawker Centres over the Years") + xlab("") + ylab("")
ggplot(tenders.sp , aes(x=lon,y=lat)) + geom_hex() +coord_fixed() + facet_wrap(~year) + theme(axis.text = element_blank(), axis.ticks = element_blank()) + ggtitle("Spatial Density of Hawker Centres over the Years") + xlab("") + ylab("") + scale_fill_distiller( type = "seq", palette = 14, direction = 1, values = NULL, space = "Lab", na.value = "grey50", guide = "colourbar")
Note that data for the year of 2017 should not be interpreted at face value as the data for the year of 2017 is not complete, and it only contains data from early 2017.
This section will look at analysing the patterns that exist within the spatial dataset. In this section, we will be using the spatstat package at the end, which requires our data set to in a .ppp format.
centres.sp <- tenders.sp[lat>0,list(lon=lon[1],lat=lat[1], price=mean(priceM2), count = .N), by = centre] # This creates a table with 1 row for each centre, containing data of the lat,lon,avg price, and number of bids per hawker centre
centres.sp[is.na(price),price:=0,] # Set NA fields to 0
coordinates(centres.sp) <- c('lon','lat')
# Requires Spatstat
centres.ppp <- unmark(as.ppp(centres.sp)) # unmark removes all the unnecessary variables in the dataset
# Requires maptools
plot(centres.sp,main="Location of Hawker Centres without reference 2012-2017")
plot(centres.ppp,main="Location of Hawker Centres with Border 2012-2017")
# readOGR used to read kml file
sg <- readOGR("D:\\Backup\\SUTD\\HASS\\Maps 1\\Labs\\Lab 9\\02221-Lab9\\sg-all.shp")
plot(sg, main="Map of Singapore") # Plot to check that the import is correct
sg.window <- as.owin(sg) # converts owin type object to class window
centres.ppp <- centres.ppp[sg.window]
plot(centres.ppp, main="Hawker Centres 2012-2017") # To show the final result that we obtained
This plot of the hawker centres seems to suggest that hawker centres in Singapore tend to cluster together. However, we cannot conclusively say that hawker centres in Singapore are clusterd based on this ovservation as we cannot tell if the clustering we observe is the result of pure chance.
To obtain assurance, we have to carry out further analysis on the spatial point patterns. To do this, we use the spatstat package as shown below.
Plotting the new centres.ppp using different Point Pattern Analysis methods as compared to a random Poisson distribution
plot(Kest(centres.ppp), main="Visualisation to classify observations")
Using this plot, we can then compare our data points with data obtained by random sampling of a poisson distribution, which should be very close to a truly random situation. Therefore, from this plot, we can see that for all values of r, our data point is always greater than the value of a random distribution, thus, we can conclude that our data set exhibits clustering behaviour. This can also be seen via other visualisations as illustrated below.
plot(density(centres.ppp, 0.02), main="Spatial density of Hawker Centres 2012-2017")
contour(density(centres.ppp,0.02), main="Spatial contour of Hawker Centres 2012-2017")
From the density and contour map, we can see where hawker centres tend to cluster about in Singapore. This clustering around the Central, south, South-East section of Singapore might be due to these areas of singapore being the Central Business District, and a more prosperous part of Singapore. It could also be that the population density of Singapore is higher in those areas.
To test this hypothesis, we have to look at data relating to the Population density of Singapore. We do this in the next section.
pop <- as.im(readGDAL("D:\\Backup\\SUTD\\HASS\\Maps 1\\Labs\\Lab 9\\02221-Lab9\\sg-pop.tif"))
## D:\Backup\SUTD\HASS\Maps 1\Labs\Lab 9\02221-Lab9\sg-pop.tif has GDAL driver GTiff
## and has 37 rows and 58 columns
plot(pop, main="Population Density map of Singapore") # Visualising the population distribution in Singapore
From this graph, we can get a rough census of how the population of Singapore is distributed. We shall now plot a graph of the distribution of hawker centres against the population distribution to observe for any correlation between the 2 variables.
plot(rhohat(centres.ppp, pop), main="Distribution of hawker centres against population distribution", xlab="Population distrbution", ylab="Intensity") # estimate the number of hawker centre in each location
# rhohat computes a smoothing estimate of hawker centres, as a function of population in this case
# From the graph we can see there is a very strong correlation with population to the numer of hawker centres
# But at the very end, we can see that there are 2 hawker centres that have no clustering at all
The distribution of hawker centres can also be weighted by the price of the tender of each hawker centre.
plot(rhohat(centres.ppp,pop,weights=centres.sp$price),main="Distribution of hawker centres(weighted by price) against population distribution", xlab="Population distribution", ylab="Intensity")
From the 2 graphs above, we can see a distinct trend whereby there is a strong correlation between the population density in a particular area, and the number of hawker centres observed, where the number of hawker centre observed increases with the population density of the area. Thus, there is a clustering of hawker centres around areas with high population density as can be seen in the graph where the greatest cluster of hawker centre is also the point with the highest population density.
However, there exists some outliers in the data obtained, as we can see that at an unspecified location, despite population density being relatively high as compared to other areas, there is no clustering of hawker centre at those areas. In fact, only 2 hawker centres are observed there.
To understand this phenonema, we need to find out where the 2 outliers are located.
plot(pop, main="Overlay of Hawker Centres(2012-2017) on population density of Singapore")
plot(centres.ppp, add=T, col = "white")
From this, we can see that the areas with high popoulation density but no clustering of hawker centres to be Jurong West, Woodlands, and Hougang There could be many reasons for the lack of clustering of hawker centres in this areas. Possible reasons include the ease of access to these places, as shop owners will feel less inclined to set-up shops in places like Jurong West, Woodlands, or Hougang if there is no easy way there, or any major source of attraction to attract customers to visit these remote places. This might be caused by the city planning where most of the attractions and commerical activities in Singapore are centered in the central, southern, south-east areas. Other factors could be that this estates are still considered new estates, and have yet to mature, thus resulting in a lack of clustering observed in this areas as new hawker centres have yet to be built to cater to the population in those areas.