Analysing Mean Resale Price of HDB in 2019

Introduction

1) How does the reslae price of housing vary accross the different regions in Singapore?

2) Does the storey of the housing unit affect the resale price?

1. Major Design and Data Challenges with Solutions

1.1 Data Challenge

1.1.1 Missing Variable

In the dataset containing the mean resale prices, the month and year values are concatenated and stored under a single variable “month” (e.g 2019-01, 2019-02). For my visualization, a challenge I am facing is to filter the data only relevant to 2019. Hence, with the current raw data, I am unable to plot visualizations relevant to 2019 only.

1.1.2 High granularity of data

The current dataset has high granularity as each data is seperated by multiple factors such as the location, housing type, block number and storey range the house belongs to. Visualising all these factors may be an information overload for users, hence I have chosen two factors to visualise: location and storey range. A challenge I am facing is the grouping of data for the different visualisations that I want to create.

1.2 Design Challenge

1.2.1 High granularity of location data

1.2.2 Lack of geospatial data

Currently, the only location details available is the ‘town’ name (e.g Ang Mo Kio). These information is insufficient when we want to plot maps in R. While geospatial dataset of Singapore is widely available, when combining two different datasets, the mapping of location details is a challenge.

1.3 Solutions

1.3.1 To solve the challenge of missing “year” variable, data pre-processing needs to be done to seperate the year and month values and create new columns: Year and Month. This can be achieved by using the seperate() function in R, which turns a single column into multiple columns.

1.3.2 To solve the challenge of high granularity of data, sketching of the visualisation is required prior to coding. This way, I can identify the variables needed and how data can be grouped. The proposed solution sketch has been attached under 1.4. Based on the proposed sketch, the heatmap would require us to group the dataset by ‘town’ then ‘storey_range’ while the map requires us to group the dataset by ‘town’ alone. Variables such as block, month or street name will not be required for my visualisations.

1.3.4 To solve the challenge of missing geospatial data, research had to be done to identify a seperate geospatial dataset that contains a variable that can be mapped to ‘town’ variable in our current dataset. Once identified, the two common columns (‘town’ and ‘PLN_AREA_N’) can be used as the basis for combining the dataset. The left_join() function from dplyr package can be used. This function takes all values from the table we specify as the ‘left’ and match them to records from the table on the ‘right’. If there isn’t a match, it will return NULL for that row.

1.4 Proposed Sketch Design

For my proposed sketch, I explored interactive heatmap as an option for readers to quickly obtain a visual overview and identify the storeys with high mean resale price. As compared to other forms of visualisation such as bar graph, through my research I found heatmap to be the most effective to model the relationship between two variables (storey range and location) in terms of intensity (varying degree of price). Since how high or low the resale price is can be reflected using varying intensity of colors, I found heatmap to be the best to visualise this large amount of information, without overwhelming the viewers.

I also chose to visualise the relationship between region the HDB is located in and the mean resale price using maps. All maps are interactive and users should be able to click on particular parts of the map to view the specific location’s name and mean resale price. I also chose to use varying intensity of colours to reflect how high or low the resale price is. I plan to visualise by regions instead of Singapore as a whole so that users can focus on the area that is relevant to them. For example, if I am residing in the Central area I would be interested to see the range of price I could possibly sell my flat for or if I plan to buy a house in the North, I would prefer to obtain information only relevant to that region.

sketch of heatmap by location and storey sketch of map by region

1.5 Step by step guide

1.5.1 Download the necessary packages prior to starting our visualisation and data pre-processing.

packages = c('sf', 'tmap', 'tidyverse', 'plotly', 'highcharter', 'dplyr')
for (p in packages){
  if(!require(p, character.only = T)){
    install.packages(p)
  }
  library(p,character.only = T)
}
## Loading required package: sf
## Linking to GEOS 3.8.0, GDAL 3.0.4, PROJ 6.3.1
## Loading required package: tmap
## Loading required package: tidyverse
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.2     v purrr   0.3.4
## v tibble  3.0.3     v dplyr   1.0.2
## v tidyr   1.1.2     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.0
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
## Loading required package: plotly
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
## Loading required package: highcharter
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
## Highcharts (www.highcharts.com) is a Highsoft software product which is
## not free for commercial and Governmental use

1.5.2 Load all data, “resale” refers to the dataset containing our mean resale prices and “subzone” contains the geospatial information for the different locations in Singapore.

resale <- read.csv('resale price.csv', header = T)
subzone <- st_read("geographical data/MP14_SUBZONE_WEB_PL.shp")
## Reading layer `MP14_SUBZONE_WEB_PL' from data source `C:\Users\ASUS\Desktop\Y3S1\428 visual analytics\assignment 5\geographical data\MP14_SUBZONE_WEB_PL.shp' using driver `ESRI Shapefile'
## Simple feature collection with 323 features and 15 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: 2667.538 ymin: 15748.72 xmax: 56396.44 ymax: 50256.33
## projected CRS:  SVY21

1.5.3 Use head() function to view the first few rows of the dataset. This will help us understand the dataset and select the right features later.

head(resale) #we can see that the columns month contains year data - we want to extract this
##     month       town flat_type block       street_name storey_range
## 1 2017-01 ANG MO KIO    2 ROOM   406 ANG MO KIO AVE 10     10 TO 12
## 2 2017-01 ANG MO KIO    3 ROOM   108  ANG MO KIO AVE 4     01 TO 03
## 3 2017-01 ANG MO KIO    3 ROOM   602  ANG MO KIO AVE 5     01 TO 03
## 4 2017-01 ANG MO KIO    3 ROOM   465 ANG MO KIO AVE 10     04 TO 06
## 5 2017-01 ANG MO KIO    3 ROOM   601  ANG MO KIO AVE 5     01 TO 03
## 6 2017-01 ANG MO KIO    3 ROOM   150  ANG MO KIO AVE 5     01 TO 03
##   floor_area_sqm     flat_model lease_commence_date    remaining_lease
## 1             44       Improved                1979 61 years 04 months
## 2             67 New Generation                1978 60 years 07 months
## 3             67 New Generation                1980 62 years 05 months
## 4             68 New Generation                1980  62 years 01 month
## 5             67 New Generation                1980 62 years 05 months
## 6             68 New Generation                1981           63 years
##   resale_price
## 1       232000
## 2       250000
## 3       262000
## 4       265000
## 5       265000
## 6       275000

1.5.4 Pre-process dataset: we will seperate the month and year values in the column “month” using the seperate() function. We will then store these new data under two new columns: “Year” and “Month”. This is the new dataset we will be using.

resale_modified <- separate(resale, month, c("Year", "Month"))
head(resale_modified)
##   Year Month       town flat_type block       street_name storey_range
## 1 2017    01 ANG MO KIO    2 ROOM   406 ANG MO KIO AVE 10     10 TO 12
## 2 2017    01 ANG MO KIO    3 ROOM   108  ANG MO KIO AVE 4     01 TO 03
## 3 2017    01 ANG MO KIO    3 ROOM   602  ANG MO KIO AVE 5     01 TO 03
## 4 2017    01 ANG MO KIO    3 ROOM   465 ANG MO KIO AVE 10     04 TO 06
## 5 2017    01 ANG MO KIO    3 ROOM   601  ANG MO KIO AVE 5     01 TO 03
## 6 2017    01 ANG MO KIO    3 ROOM   150  ANG MO KIO AVE 5     01 TO 03
##   floor_area_sqm     flat_model lease_commence_date    remaining_lease
## 1             44       Improved                1979 61 years 04 months
## 2             67 New Generation                1978 60 years 07 months
## 3             67 New Generation                1980 62 years 05 months
## 4             68 New Generation                1980  62 years 01 month
## 5             67 New Generation                1980 62 years 05 months
## 6             68 New Generation                1981           63 years
##   resale_price
## 1       232000
## 2       250000
## 3       262000
## 4       265000
## 5       265000
## 6       275000

1.5.6 Now we will prepare the dataset for plotting of heatmap. We group the dataset by “town” and “Storey_range” using the group_by() function and use summarise() function to compute the mean resale price for each of our new rows. This new dataset will be stored in the variable plot_by_storeys.

plot_by_storeys<- resale19 %>%
  group_by(town, storey_range) %>% 
  summarise(mean_resale_price = mean(resale_price)) 
## `summarise()` regrouping output by 'town' (override with `.groups` argument)

1.5.7 We will plot an interactive heatmap using the following code. We want to produce a heatmap that allows viwers to click on each tile in the heatmap to display the town and mean resale price. Thus, we will use the mutate() function to create the text we want to display. Then we will plot the heatmap using the ggplot package and turn it to interactive using the plotly package.

plot_by_storeys <- plot_by_storeys %>%
  mutate(text = paste0("Location: ", town, "\n", "Mean resale Price:", round(mean_resale_price,2)))

heatmap_by_storeys <- ggplot(data = plot_by_storeys, 
                  mapping = aes(x = town, y = storey_range, fill = mean_resale_price, text=text)) +
            geom_tile() +
  labs(title = "Heatmap of Average HDB Resale Price by Area and Housing Storey (2019)", x = "Town", y = "Storey") +
  scale_fill_gradient(name = "Mean Resale Price",
                      low = "#ffedde",
                      high = "#c24b0c")+
  theme(axis.text.x = element_text(angle = 45))

ggplotly(heatmap_by_storeys, tooltip="text")

1.5.8 Next, we will prepare the dataset to plot the maps. We group the filtered dataset by town and calculate the mean price for each row.

mean_prices <- resale19 %>%
  group_by(town) %>% 
  summarise(mean_resale_price_2 = mean(resale_price))
## `summarise()` ungrouping output (override with `.groups` argument)

1.5.9 Next we use the left_join() function to combine the dataset containing our mean resale price by town and dataset containing the geospatial information. We will set the tmap_mode to view so that all our maps are interactive.

subzone_resale2019 <- left_join(subzone, mean_prices, by = c("PLN_AREA_N" = "town"))
tmap_mode("view")
## tmap mode set to interactive viewing

1.5.10 Use the head()) function to view how the new dataset looks like.

head(subzone_resale2019)
## Simple feature collection with 6 features and 16 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: 24468.89 ymin: 28369.47 xmax: 32362.39 ymax: 30542.74
## projected CRS:  SVY21
##   OBJECTID SUBZONE_NO      SUBZONE_N SUBZONE_C CA_IND      PLN_AREA_N
## 1        1          1   MARINA SOUTH    MSSZ01      Y    MARINA SOUTH
## 2        2          1   PEARL'S HILL    OTSZ01      Y          OUTRAM
## 3        3          3      BOAT QUAY    SRSZ03      Y SINGAPORE RIVER
## 4        4          8 HENDERSON HILL    BMSZ08      N     BUKIT MERAH
## 5        5          3        REDHILL    BMSZ03      N     BUKIT MERAH
## 6        6          7 ALEXANDRA HILL    BMSZ07      N     BUKIT MERAH
##   PLN_AREA_C       REGION_N REGION_C          INC_CRC FMEL_UPD_D   X_ADDR
## 1         MS CENTRAL REGION       CR 5ED7EB253F99252E 2014-12-05 31595.84
## 2         OT CENTRAL REGION       CR 8C7149B9EB32EEFC 2014-12-05 28679.06
## 3         SR CENTRAL REGION       CR C35FEFF02B13E0E5 2014-12-05 29654.96
## 4         BM CENTRAL REGION       CR 3775D82C5DDBEFBD 2014-12-05 26782.83
## 5         BM CENTRAL REGION       CR 85D9ABEF0A40678F 2014-12-05 26201.96
## 6         BM CENTRAL REGION       CR 9D286521EF5E3B59 2014-12-05 25358.82
##     Y_ADDR SHAPE_Leng SHAPE_Area mean_resale_price_2
## 1 29220.19   5267.381  1630379.3                  NA
## 2 29782.05   3506.107   559816.2                  NA
## 3 29974.66   1740.926   160807.5                  NA
## 4 29933.77   3313.625   595428.9            559507.2
## 5 30005.70   2825.594   387429.4            559507.2
## 6 29991.38   4428.913  1030378.8            559507.2
##                         geometry
## 1 MULTIPOLYGON (((31495.56 30...
## 2 MULTIPOLYGON (((29092.28 30...
## 3 MULTIPOLYGON (((29932.33 29...
## 4 MULTIPOLYGON (((27131.28 30...
## 5 MULTIPOLYGON (((26451.03 30...
## 6 MULTIPOLYGON (((25899.7 297...

1.5.11 Plot the interactive map of mean resale price for the whole of Singapore. Although this was not part of my proposed sketch, while plotting the various maps, I found that having a map reflecting pricing across the whole of Singapore will be easier for comparison across the different regions. This interactive map allows viewers to hover over areas to view the town’s name or click on them to view the exact mean resale price. The map is made interactive by calling the function tm_polygons to display the pop-ups upon clicking.

complete_map <- tm_shape(subzone_resale2019)+
  tm_fill("mean_resale_price_2",
          n = 6,
          style = "quantile", 
          palette = "Oranges",
          id = 'SUBZONE_N',
          title="Mean HDB Resale Price (2019)") +
  tm_borders(alpha = 0.5)+
  tm_polygons(popup.vars=c(
                    "Mean Resale Price: " = "mean_resale_price_2"))

1.5.12 We will also plot the map seperated by various regions in Singapore. Similarly, these maps are interactive and displays the mean resale price for each location upon clicking on them.

tmap_options(limits=c(facets.plot=32, facets.view=5))

map_by_regions <- tm_shape(subzone_resale2019)+
  tm_fill("mean_resale_price_2",
          n = 6,
          style = "quantile", 
          palette = "Oranges",
          id = 'SUBZONE_N',
          title=c("Central Region", "East Region", "North-East Region", "North Region", "West Region")) +
  tm_borders(alpha = 0.5)+
  tm_facets(by="REGION_N", free.coords=TRUE)+
  tm_layout(legend.show = FALSE) +
  tm_borders(alpha = 0.5)+
  tm_polygons(popup.vars=c(
                    "Mean Resale Price: " = "mean_resale_price_2"))

1.6 Final Visualization

1.6.1 Interactive Heatmap showing mean resale price by area and housing storey in 2019.

Key Insight 1 : HDB flats located at higher storeys have a higher mean HDB resale price.

Based on the heatmap, for each location, as the storey range increases, the heatmap displays a more intense colour, signifying higher mean resale price. An exception is Bukit Timah which has a storey range much lower than other towns. Although, there is a similar trend where houses in storeys “13 to 15” had the highest mean resale prices, interestingly, houses in storeys “01 to 03” were the next highly priced in Bukit Timah. This can be explained by looking at the common housing types in that town. Bukit Timah is known for its highly-priced landed properties which are common 1 to 3 storeys high, explaining the exception in trend.

ggplotly(heatmap_by_storeys, tooltip="text")

1.6.2 Interactive Map of Singapore displaying different towns by mean resale price in 2019

Key Insight 2: Towns of similar mean resale prices are generally clustered together around Singapore.

Based on the map, we can see that clusters of the same colours can be identified. This is interesting because prior to plotting this map, when analysing the dataset by itself, this trend could not be discovered. To further explore this trend, we will be taking a look at the maps seperated by regions.

complete_map
## Warning: One tm layer group has duplicated layer types, which are omitted. To
## draw multiple layers of the same type, use multiple layer groups (i.e. specify
## tm_shape prior to each of them).

1.6.3 Interactive Maps of Singapore seperated by different regions, displaying the mean HDB resale price for each town in that region.

Key Insight 3: Houses in Cetral Region has the highest mean resale price.

From the maps of the various region, we can see that houses in the Central region are the most expensive while houses in the North region are the cheapeast, in terms of mean resale prices. This could be due to the range of ammenities available for houses in the Central region, such as shopping malls, MRT stations and schools. Having such convinience will push the resale price of housing in the Central area up.

map_by_regions
## Warning: One tm layer group has duplicated layer types, which are omitted. To
## draw multiple layers of the same type, use multiple layer groups (i.e. specify
## tm_shape prior to each of them).