First, load in the required libraries:
library(tidyverse)
## -- Attaching packages ---------------------------------------------------------------------------- tidyverse 1.2.1 --
## <U+221A> ggplot2 2.2.1 <U+221A> purrr 0.2.4
## <U+221A> tibble 1.4.2 <U+221A> dplyr 0.7.4
## <U+221A> tidyr 0.8.0 <U+221A> stringr 1.3.0
## <U+221A> readr 1.1.1 <U+221A> forcats 0.3.0
## -- Conflicts ------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(sf)
## Warning: package 'sf' was built under R version 3.4.4
## Linking to GEOS 3.6.1, GDAL 2.1.3, proj.4 4.9.3
library(sp)
library(rgdal)
## Warning: package 'rgdal' was built under R version 3.4.4
## rgdal: version: 1.2-18, (SVN revision 718)
## Geospatial Data Abstraction Library extensions to R successfully loaded
## Loaded GDAL runtime: GDAL 2.1.3, released 2017/20/01
## Path to GDAL shared files: /Library/Frameworks/R.framework/Versions/3.4/Resources/library/rgdal/gdal
## GDAL binary built with GEOS: FALSE
## Loaded PROJ.4 runtime: Rel. 4.9.3, 15 August 2016, [PJ_VERSION: 493]
## Path to PROJ.4 shared files: /Library/Frameworks/R.framework/Versions/3.4/Resources/library/rgdal/proj
## Linking to sp version: 1.2-7
library(spatstat)
## Loading required package: spatstat.data
## Loading required package: nlme
##
## Attaching package: 'nlme'
## The following object is masked from 'package:dplyr':
##
## collapse
## Loading required package: rpart
##
## spatstat 1.55-0 (nickname: 'Stunned Mullet')
## For an introduction to spatstat, type 'beginner'
library(ggmap)
library(maptools)
## Checking rgeos availability: FALSE
## Note: when rgeos is not available, polygon geometry computations in maptools depend on gpclib,
## which has a restricted licence. It is disabled by default;
## to enable gpclib, type gpclibPermit()
library(lubridate)
##
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
##
## date
Load the data from a csv file and take a look at it:
tenders <- read_csv("tabula-tender-bids-from-mar-2012-to-jan-2017.csv", col_names = c("centre","stall","area","trade","bid","month"))
## Parsed with column specification:
## cols(
## centre = col_character(),
## stall = col_character(),
## area = col_double(),
## trade = col_character(),
## bid = col_character(),
## month = col_character()
## )
## Warning in rbind(names(probs), probs_f): number of columns of result is not
## a multiple of vector length (arg 1)
## Warning: 23 parsing failures.
## row # A tibble: 5 x 5 col row col expected actual file expected <int> <chr> <chr> <chr> <chr> actual 1 2269 <NA> 6 columns 4 columns 'tabula-tender-bids-from-m~ file 2 3499 <NA> 6 columns 4 columns 'tabula-tender-bids-from-m~ row 3 3945 area no trailing characters " 38 ." 'tabula-tender-bids-from-m~ col 4 3947 area no trailing characters " 27 ." 'tabula-tender-bids-from-m~ expected 5 3999 area no trailing characters " 02 ." 'tabula-tender-bids-from-m~
## ... ................. ... .......................................................................... ........ .......................................................................... ...... .......................................................................... .... .......................................................................... ... .......................................................................... ... .......................................................................... ........ ..........................................................................
## See problems(...) for more details.
head(tenders)
summary(tenders)
## centre stall area trade
## Length:5563 Length:5563 Min. : 2.080 Length:5563
## Class :character Class :character 1st Qu.: 6.000 Class :character
## Mode :character Mode :character Median : 6.750 Mode :character
## Mean : 7.091
## 3rd Qu.: 8.490
## Max. :22.000
## NA's :2804
## bid month
## Length:5563 Length:5563
## Class :character Class :character
## Mode :character Mode :character
##
##
##
##
There are some issues with the data that need to be cleaned: - Every alternate row is empty - The stall type are marked using header rows - Price and month should be parsed as numeric and date formats respectively
# Fix alternate empty rows
tenders <- tenders %>% drop_na(centre)
# Convert stall type to column
tenders <- tenders %>%
filter(row_number() != 1135) %>%
mutate(type = if_else(row_number() < 1135, "cooked", "lockup")) %>%
filter(row_number() != 1749) %>%
mutate(type = if_else(row_number() >= 1749, "market", type))
# Parse price and month as numeric and date
tenders <- tenders %>%
mutate(bidNum = as.numeric(str_replace_all(bid, pattern="\\$|,", replacement=""))) %>%
mutate(date = dmy(paste0("01-",month))) %>%
mutate(priceM2 = bidNum/area)
(At this point, I realised that there are 49 rows (588 to 636) where the data is shifted, because of a missing comma in each row. I edited the csv file to add the missing commas, then redid all the earlier steps.)
Does the average price per m2 change over time?
avg_price_date = tenders %>%
filter(is.na(priceM2) == FALSE) %>%
group_by(date) %>%
summarise(avg_price = mean(priceM2))
ggplot(avg_price_date, aes(x=date, y=avg_price)) + geom_line() + labs(title="Hawker stalls price per m2 over time", x="Month", y="Price per m2")
It looks like the average price increased over the years, but because the price fluctuates significantly, it’s a bit hard to tell for certain. What if we take a running 3 month average of the price instead?
avg_price_date$avg_3mth = 0
n = nrow(avg_price_date)
for (i in 2:(nrow(avg_price_date)-1)) {
avg_price_date$avg_3mth[i] = (avg_price_date$avg_price[i-1]
+avg_price_date$avg_price[i]
+avg_price_date$avg_price[i+1]
)/3
}
avg_price_date$avg_3mth[1] = (avg_price_date$avg_price[1]+avg_price_date$avg_price[2])/2
avg_price_date$avg_3mth[n] = (avg_price_date$avg_price[n]+avg_price_date$avg_price[n-1])/2
ggplot(avg_price_date, aes(x=date, y=avg_3mth)) + geom_line() + labs(title="Hawker stalls price per m2, 3-month average", x="Month", y="Price per m2")
We can see that the price started very high in 2012 but crashed in the later half. It was on an upward trend from 2014-16, but dipped towards the end of 2016.
How many bids are there for each trade?
nbids_trade <- tenders %>%
group_by(trade) %>%
summarise(count = n())
# sort data and plot
nbids_trade <- nbids_trade[order(nbids_trade$count), ]
nbids_trade %>%
ggplot(aes(x=factor(trade, trade), y=count)) + geom_col() + coord_flip() + labs(title="Number of bids for each trade", x="Trade", y="Number of bids")
Cooked food and sundry goods top the list, while more specialized trades such as beef, mutton and halal meat have the least bids.
Next, we use geocoding to get the specific geographic location of each hawker centre and do some analysis.
# geocoding and joining
centres = read.csv('centres-geocoded.csv')
tenders.sp <- left_join(tenders, centres, by = c("centre" = "centre"))
## Warning: Column `centre` joining character vector and factor, coercing into
## character vector
Running a check, we find that there is one row in tenders that has not been joined correctly. This is because the centre name is repeated in the original data, so just clean that up and join again.
tenders.sp %>% filter(is.na(lat))
tenders$centre[597] = "BLK 51 OLD AIRPORT ROAD"
tenders.sp <- left_join(tenders, centres, by = c("centre" = "centre"))
## Warning: Column `centre` joining character vector and factor, coercing into
## character vector
Plotting a few different aspects of the data:
# convenience variables for theming
theme_noaxis = theme(axis.ticks.y=element_blank(), axis.text.y=element_blank(), axis.ticks.x=element_blank(), axis.text.x=element_blank())
theme_nolabs = labs(x="", y="")
ggplot(tenders.sp, aes(x=lon, y=lat)) + geom_point() + geom_density2d() + coord_fixed() + labs(title="Density of hawker stall bids") + theme_noaxis + theme_nolabs
The highest density of bids is in the south, especially around the city area. This is surprising as we don’t normally associate the city with hawker centres, but more of shopping centres and restaurants. However, since the data is on bids rather than the number of stalls, it could be that hawker stalls in the city are more competitive.
ggplot(tenders.sp, aes(x=lon, y=lat)) + geom_point(alpha=0.3) + coord_fixed() + facet_grid(type ~ year(date)) + labs(title="Distribution of hawker stall bids by year and type", x="", y="") + theme_noaxis
Cooked food and lockup stalls seem to have become more widely distributed over the years.
ggplot(tenders.sp, aes(x=lon, y=lat)) + geom_point(alpha=0.3) + coord_fixed() + facet_wrap(~trade) + labs(title="Distribution of hawker stall bids by trade", x="", y="") + theme_noaxis
There does not seem to be any clear pattern of hawker stall locations based on trade.
The spatstat package allows us to analyse spatial data.
First, we need to convert the dataset to a format that spatstat can work with:
centres.sp <- tenders.sp %>%
filter(lat > 0) %>%
group_by(centre, lon, lat) %>%
summarise(price = mean(priceM2, na.rm = T)) %>%
replace_na(list(price = 0))
coordinates(centres.sp) <- c('lon', 'lat')
centres.ppp <- unmark(as.ppp(centres.sp))
Load in a Singapore map to use as the boundary:
sg <- readOGR(".", "sg-all")
## OGR data source with driver: ESRI Shapefile
## Source: "/Users/siyan/Documents/2018/T7 making maps/Lab9/lab9rproj", layer: "sg-all"
## with 1 features
## It has 13 fields
sg.window <- as.owin(sg)
centres.ppp <- centres.ppp[sg.window]
Now, we can plot some data!
plot(centres.ppp)
plot(Kest(centres.ppp))
plot(density(centres.ppp, 0.01))
contour(density(centres.ppp, 0.015))
The smoothing parameter adjusts the “fine-ness” of density calculations. When it is very small, individual hawker centres can be seen. When it is big, the density is aggregated over almost the whole of Singapore.
Does the clustering of hawker centres correspond to population densities?
pop <- as.im(readGDAL("sg-pop.tif"))
## sg-pop.tif has GDAL driver GTiff
## and has 37 rows and 58 columns
plot(rhohat(centres.ppp, pop))
plot(pop)
plot(centres.ppp, add=T)
The locations of hawker centres mostly follows the population density in the south and east of Singapore. Surprisingly, the two areas with highest population density (around Chua Chu Kang and Yio Chu Kang) are not represented in the tenders data. There are also hardly any centres in the north.
A point to note is that our initial data is on bids for hawker centre stalls. Hence, it could be that the hawker centres in these areas did not have any turnover in 2012-2017.