1) Overview

Data Source: https://www.kaggle.com/dansbecker/melbourne-housing-snapshot

This project aims to visualise the trends in the property market of Melbourne, focusing on various variables in the dataset to obtain insights. We will be focusing on property sales in Melbourne for this project.

There are 3 visualisations used in this project:

A boxplot of property prices across different regions in Melbourne.
A geospatial visualisation that shows the frequency of property transactions in different areas of Melbourne.
A facet visualisation containing scatter plots and regression lines that shows the relationship between Price and other variables in the dataset.

2) Challenges identified and proposed solutions

The dataset has many missing/NA values for variables we are interested in. Missing/NA values affects our ability to draw insights from the data. As such, I will be removing records with missing/NA values for the variables we are interested in.
The dataset contains lots of information on properties in Melbourne, including those which were not sold. We will need to remove such records since we are focusing on property sales only.
For the facet visualisation, we need the dataframe to be in a long format rather than the wide format we currently have. I will be using the “gather” function from the tidyr package to transform the data.

3) Sketch of Proposed Design

4) Data Preparation

4.1) Load R Packages

library(plotly)
library(tidyverse)
library(stringr)
library(ggmap)
library(scales)
library(leaflet)

4.2) Read csv file and preview the data

df <- read.csv("melb_data.csv")
head(df,5)

4.3) Data Wrangling

Check for NA values in the dataset.

colSums(is.na(df))

##        Suburb       Address         Rooms          Type         Price 
##             0             0             0             0             0 
##        Method       SellerG          Date      Distance      Postcode 
##             0             0             0             0             0 
##      Bedroom2      Bathroom           Car      Landsize  BuildingArea 
##             0             0            62             0          6450 
##     YearBuilt   CouncilArea     Lattitude    Longtitude    Regionname 
##          5375             0             0             0             0 
## Propertycount 
##             0

Since the Age of the property is a variable we want to examine later on, we will remove rows where “YearBuilt” has an NA value.

df <- df[!is.na(df$YearBuilt),]


#Check NA values again
colSums(is.na(df))

##        Suburb       Address         Rooms          Type         Price 
##             0             0             0             0             0 
##        Method       SellerG          Date      Distance      Postcode 
##             0             0             0             0             0 
##      Bedroom2      Bathroom           Car      Landsize  BuildingArea 
##             0             0            31             0          1347 
##     YearBuilt   CouncilArea     Lattitude    Longtitude    Regionname 
##             0             0             0             0             0 
## Propertycount 
##             0

Compute the age of the property when it was sold.

df$Year_Sold <- str_sub(df$Date,-4,-1)

df$Property_Age_in_Years <- as.numeric(df$Year_Sold) - as.numeric(df$YearBuilt)

head(df,5)

We only want to focus on the properties that were sold.

According to the description of the dataset, the column “Method” has the following values:

S - property sold; SP - property sold prior; PI - property passed in; PN - sold prior not disclosed; SN - sold not disclosed; NB - no bid; VB - vendor bid; W - withdrawn prior to auction; SA - sold after auction; SS - sold after auction price not disclosed. N/A - price or highest bid not available.

“Method” refers to the method of the property transaction.

Therefore, we will only keep rows in which the “Method” contains the character “S” since we are focusing on property sales.

df <- df[grepl("S", df$Method), ]

unique(df$Method)

## [1] "S"  "SP" "SA"

#Check NAs
colSums(is.na(df))

##                Suburb               Address                 Rooms 
##                     0                     0                     0 
##                  Type                 Price                Method 
##                     0                     0                     0 
##               SellerG                  Date              Distance 
##                     0                     0                     0 
##              Postcode              Bedroom2              Bathroom 
##                     0                     0                     0 
##                   Car              Landsize          BuildingArea 
##                    22                     0                  1034 
##             YearBuilt           CouncilArea             Lattitude 
##                     0                     0                     0 
##            Longtitude            Regionname         Propertycount 
##                     0                     0                     0 
##             Year_Sold Property_Age_in_Years 
##                     0                     0

Check the values of Council Area. As seen below, there are empty strings in the Council Area column.

unique(df$CouncilArea)

##  [1] "Yarra"             "Moonee Valley"     "Port Phillip"     
##  [4] "Darebin"           "Hobsons Bay"       "Stonnington"      
##  [7] "Boroondara"        "Monash"            "Glen Eira"        
## [10] "Whitehorse"        "Maribyrnong"       "Bayside"          
## [13] "Moreland"          "Manningham"        "Banyule"          
## [16] "Kingston"          "Brimbank"          "Melbourne"        
## [19] "Hume"              "Knox"              "Melton"           
## [22] "Maroondah"         "Greater Dandenong" "Nillumbik"        
## [25] "Whittlesea"        "Frankston"         "Macedon Ranges"   
## [28] "Yarra Ranges"      "Casey"             "Wyndham"          
## [31] "Cardinia"          ""

Remove rows which has empty Council Areas, since we might be interested in this variable in our analysis.

df <- df[!(df$CouncilArea)=="",]

Remove rows where BuildingArea is NA, since we are interested in this variable in our analysis later on.

df <- df[!is.na(df$BuildingArea),]

#Check NAs
colSums(is.na(df))

##                Suburb               Address                 Rooms 
##                     0                     0                     0 
##                  Type                 Price                Method 
##                     0                     0                     0 
##               SellerG                  Date              Distance 
##                     0                     0                     0 
##              Postcode              Bedroom2              Bathroom 
##                     0                     0                     0 
##                   Car              Landsize          BuildingArea 
##                     0                     0                     0 
##             YearBuilt           CouncilArea             Lattitude 
##                     0                     0                     0 
##            Longtitude            Regionname         Propertycount 
##                     0                     0                     0 
##             Year_Sold Property_Age_in_Years 
##                     0                     0

#Check Dimensions
dim(df)

## [1] 4896   23

Next, we create a dataframe that is in a long format to facilitate the creation of the facet visualisation later on.

#Rename Columns
df1 <- df %>% rename(No_of_Rooms = Rooms) %>% 
  rename(No_of_Bathrooms = Bathroom) %>% 
  rename(Distance_to_CBD_in_KM = Distance)  %>% 
  rename(No_of_Carspots = Car) %>% 
  rename(Building_Area_in_Sq_Metres = BuildingArea)

#Set df to long for facet wrap function

df_long <- df1 %>% 
  select(Price, No_of_Rooms, Property_Age_in_Years, Building_Area_in_Sq_Metres, Distance_to_CBD_in_KM) %>% 
  gather(Variable, value, No_of_Rooms:Distance_to_CBD_in_KM)

(df_long) %>% rmarkdown::paged_table()

Lastly, we create a dataframe which we will use to create the leaflet map.

values <- unique(df$CouncilArea)
#print(values)

#Assign each district a value
assign_value <- function(x) {
  if (x %in% values) {
    index <- match(x,values)
    return(index)
  }
  
  
}

df_map <- df

df_map$Group <- sapply(df$CouncilArea,assign_value)
(df_map) %>% rmarkdown::paged_table()

#Check dimensions of df_map

dim(df_map)

## [1] 4896   24

Now that we have prepared our dataframes, we can now proceed to create the visualisations.

5) Final Visualisations and Insights

5.1) Boxplot of prices across different regions

This boxplot shows the distributions of selling price across different regions in Melbourne.
From the boxplot, we can see that the variations in selling price differs across different regions in Melbourne. For instance, Western Victoria has the smallest variation in selling prices whereas Southern Metropolitan has the largest variation in selling prices, as shown by it’s wider interquartile range.
Also, Southern Metropolitan has the highest median selling price whereas Western Victoria has the lowest median selling price among the different regions in Melbourne.

5.2) Leaflet map displaying property transactions in Melbourne

This geospatial visualisation shows the frequencies of property transactions across different regions in Melbourne.
The user can click into the various circle markers to view the attributes of each individual property, such as the Price and Number of Rooms.
A checkbox filter is also available on the right of the map for the user to select specific Council Areas that they are interested in.
From the map, we can clearly see that property transactions are heavily concentrated in the central areas of Melbourne, especially the areas along the Port Phillip Bay.
We can also see that as we move further away from the bay area, the number of property transactions decreases.
This is likely to be a sign of urban agglomeration in Melbourne, resulting in people shifting towards the central, urban regions. As the urban areas becomes increasingly packed, it may lead to rising property prices in these central regions since demand is very high.

The Property Market of Melbourne

Ng Wei Xuan

3/26/2021