Data Source: https://www.kaggle.com/dansbecker/melbourne-housing-snapshot
This project aims to visualise the trends in the property market of Melbourne, focusing on various variables in the dataset to obtain insights. We will be focusing on property sales in Melbourne for this project.
There are 3 visualisations used in this project:
library(plotly)
library(tidyverse)
library(stringr)
library(ggmap)
library(scales)
library(leaflet)
df <- read.csv("melb_data.csv")
head(df,5)
Check for NA values in the dataset.
colSums(is.na(df))
## Suburb Address Rooms Type Price
## 0 0 0 0 0
## Method SellerG Date Distance Postcode
## 0 0 0 0 0
## Bedroom2 Bathroom Car Landsize BuildingArea
## 0 0 62 0 6450
## YearBuilt CouncilArea Lattitude Longtitude Regionname
## 5375 0 0 0 0
## Propertycount
## 0
Since the Age of the property is a variable we want to examine later on, we will remove rows where “YearBuilt” has an NA value.
df <- df[!is.na(df$YearBuilt),]
#Check NA values again
colSums(is.na(df))
## Suburb Address Rooms Type Price
## 0 0 0 0 0
## Method SellerG Date Distance Postcode
## 0 0 0 0 0
## Bedroom2 Bathroom Car Landsize BuildingArea
## 0 0 31 0 1347
## YearBuilt CouncilArea Lattitude Longtitude Regionname
## 0 0 0 0 0
## Propertycount
## 0
Compute the age of the property when it was sold.
df$Year_Sold <- str_sub(df$Date,-4,-1)
df$Property_Age_in_Years <- as.numeric(df$Year_Sold) - as.numeric(df$YearBuilt)
head(df,5)
We only want to focus on the properties that were sold.
According to the description of the dataset, the column “Method” has the following values:
S - property sold; SP - property sold prior; PI - property passed in; PN - sold prior not disclosed; SN - sold not disclosed; NB - no bid; VB - vendor bid; W - withdrawn prior to auction; SA - sold after auction; SS - sold after auction price not disclosed. N/A - price or highest bid not available.
“Method” refers to the method of the property transaction.
Therefore, we will only keep rows in which the “Method” contains the character “S” since we are focusing on property sales.
df <- df[grepl("S", df$Method), ]
unique(df$Method)
## [1] "S" "SP" "SA"
#Check NAs
colSums(is.na(df))
## Suburb Address Rooms
## 0 0 0
## Type Price Method
## 0 0 0
## SellerG Date Distance
## 0 0 0
## Postcode Bedroom2 Bathroom
## 0 0 0
## Car Landsize BuildingArea
## 22 0 1034
## YearBuilt CouncilArea Lattitude
## 0 0 0
## Longtitude Regionname Propertycount
## 0 0 0
## Year_Sold Property_Age_in_Years
## 0 0
Check the values of Council Area. As seen below, there are empty strings in the Council Area column.
unique(df$CouncilArea)
## [1] "Yarra" "Moonee Valley" "Port Phillip"
## [4] "Darebin" "Hobsons Bay" "Stonnington"
## [7] "Boroondara" "Monash" "Glen Eira"
## [10] "Whitehorse" "Maribyrnong" "Bayside"
## [13] "Moreland" "Manningham" "Banyule"
## [16] "Kingston" "Brimbank" "Melbourne"
## [19] "Hume" "Knox" "Melton"
## [22] "Maroondah" "Greater Dandenong" "Nillumbik"
## [25] "Whittlesea" "Frankston" "Macedon Ranges"
## [28] "Yarra Ranges" "Casey" "Wyndham"
## [31] "Cardinia" ""
Remove rows which has empty Council Areas, since we might be interested in this variable in our analysis.
df <- df[!(df$CouncilArea)=="",]
Remove rows where BuildingArea is NA, since we are interested in this variable in our analysis later on.
df <- df[!is.na(df$BuildingArea),]
#Check NAs
colSums(is.na(df))
## Suburb Address Rooms
## 0 0 0
## Type Price Method
## 0 0 0
## SellerG Date Distance
## 0 0 0
## Postcode Bedroom2 Bathroom
## 0 0 0
## Car Landsize BuildingArea
## 0 0 0
## YearBuilt CouncilArea Lattitude
## 0 0 0
## Longtitude Regionname Propertycount
## 0 0 0
## Year_Sold Property_Age_in_Years
## 0 0
#Check Dimensions
dim(df)
## [1] 4896 23
Next, we create a dataframe that is in a long format to facilitate the creation of the facet visualisation later on.
#Rename Columns
df1 <- df %>% rename(No_of_Rooms = Rooms) %>%
rename(No_of_Bathrooms = Bathroom) %>%
rename(Distance_to_CBD_in_KM = Distance) %>%
rename(No_of_Carspots = Car) %>%
rename(Building_Area_in_Sq_Metres = BuildingArea)
#Set df to long for facet wrap function
df_long <- df1 %>%
select(Price, No_of_Rooms, Property_Age_in_Years, Building_Area_in_Sq_Metres, Distance_to_CBD_in_KM) %>%
gather(Variable, value, No_of_Rooms:Distance_to_CBD_in_KM)
(df_long) %>% rmarkdown::paged_table()
Lastly, we create a dataframe which we will use to create the leaflet map.
values <- unique(df$CouncilArea)
#print(values)
#Assign each district a value
assign_value <- function(x) {
if (x %in% values) {
index <- match(x,values)
return(index)
}
}
df_map <- df
df_map$Group <- sapply(df$CouncilArea,assign_value)
(df_map) %>% rmarkdown::paged_table()
#Check dimensions of df_map
dim(df_map)
## [1] 4896 24
Now that we have prepared our dataframes, we can now proceed to create the visualisations.