Context
Melbourne real estate is BOOMING. Can you find the insight or predict the next big trend to become a real estate mogul… or even harder, to snap up a reasonably priced 2-bedroom unit?
Content
This is a snapshot of a dataset created by Tony Pino.
It was scraped from publicly available results posted every week from Domain.com.au. He cleaned it well, and now it’s up to you to make data analysis magic. The dataset includes Address, Type of Real estate, Suburb, Method of Selling, Rooms, Price, Real Estate Agent, Date of Sale and distance from C.B.D.
Notes on Specific Variables
Rooms: Number of rooms
Price: Price in dollars
Method: S - property sold; SP - property sold prior; PI - property passed in; PN - sold prior not disclosed; SN - sold not disclosed; NB - no bid; VB - vendor bid; W - withdrawn prior to auction; SA - sold after auction; SS - sold after auction price not disclosed. N/A - price or highest bid not available.
Type: br - bedroom(s); h - house,cottage,villa, semi,terrace; u - unit, duplex; t - townhouse; dev site - development site; o res - other residential.
SellerG: Real Estate Agent
Date: Date sold
Distance: Distance from CBD (in km)
Regionname: General Region (West, North West, North, North east …etc)
Propertycount: Number of properties that exist in the suburb.
Bedroom2 : Scraped # of Bedrooms (from different source)
Bathroom: Number of Bathrooms
Car: Number of carspots
Landsize: Land Size
BuildingArea: Building Size
CouncilArea: Governing council for the area
Acknowledgements
This is intended as a static (unchanging) snapshot of https://www.kaggle.com/anthonypino/melbourne-housing-market. It was created in September 2017. Additionally, homes with no Price have been removed.
Use read.csv() to input data
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(ggplot2)
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
options(scipen = 123)
<- read.csv("melb_data.csv")
melb_housing melb_housing
Check head & tail data of dataframe
head(melb_housing,10)
tail(melb_housing,10)
Note
There are ‘NA’ values in the “BuildingArea” and “YearBuilt” columns that need to be checked in more detail for the entire dataframe.
Dataframe Column’s Name
names(melb_housing)
## [1] "Suburb" "Address" "Rooms" "Type"
## [5] "Price" "Method" "SellerG" "Date"
## [9] "Distance" "Postcode" "Bedroom2" "Bathroom"
## [13] "Car" "Landsize" "BuildingArea" "YearBuilt"
## [17] "CouncilArea" "Lattitude" "Longtitude" "Regionname"
## [21] "Propertycount"
Dataframe Dimension
dim(melb_housing)
## [1] 13580 21
Note
Check Dataframe Structure
str(melb_housing)
## 'data.frame': 13580 obs. of 21 variables:
## $ Suburb : chr "Abbotsford" "Abbotsford" "Abbotsford" "Abbotsford" ...
## $ Address : chr "85 Turner St" "25 Bloomburg St" "5 Charles St" "40 Federation La" ...
## $ Rooms : int 2 2 3 3 4 2 3 2 1 2 ...
## $ Type : chr "h" "h" "h" "h" ...
## $ Price : num 1480000 1035000 1465000 850000 1600000 ...
## $ Method : chr "S" "S" "SP" "PI" ...
## $ SellerG : chr "Biggin" "Biggin" "Biggin" "Biggin" ...
## $ Date : chr "3/12/2016" "4/02/2016" "4/03/2017" "4/03/2017" ...
## $ Distance : num 2.5 2.5 2.5 2.5 2.5 2.5 2.5 2.5 2.5 2.5 ...
## $ Postcode : num 3067 3067 3067 3067 3067 ...
## $ Bedroom2 : num 2 2 3 3 3 2 4 2 1 3 ...
## $ Bathroom : num 1 1 2 2 1 1 2 1 1 1 ...
## $ Car : num 1 0 0 1 2 0 0 2 1 2 ...
## $ Landsize : num 202 156 134 94 120 181 245 256 0 220 ...
## $ BuildingArea : num NA 79 150 NA 142 NA 210 107 NA 75 ...
## $ YearBuilt : num NA 1900 1900 NA 2014 ...
## $ CouncilArea : chr "Yarra" "Yarra" "Yarra" "Yarra" ...
## $ Lattitude : num -37.8 -37.8 -37.8 -37.8 -37.8 ...
## $ Longtitude : num 145 145 145 145 145 ...
## $ Regionname : chr "Northern Metropolitan" "Northern Metropolitan" "Northern Metropolitan" "Northern Metropolitan" ...
## $ Propertycount: num 4019 4019 4019 4019 4019 ...
“Date” column (melb_housing$Date
) data type conversion from ‘char’ to ‘date’
$Date <- dmy(melb_housing$Date) melb_housing
Check the unique values for each column to get the column to convert to the ‘factor’ data type
<- rapply(melb_housing, # Counts of unique values
count_unique function(x) length(unique(x)))
# Print counts count_unique
## Suburb Address Rooms Type Price
## 314 13378 9 3 2204
## Method SellerG Date Distance Postcode
## 5 268 58 202 198
## Bedroom2 Bathroom Car Landsize BuildingArea
## 12 9 12 1448 603
## YearBuilt CouncilArea Lattitude Longtitude Regionname
## 145 34 6503 7063 8
## Propertycount
## 311
Note
Convert multiple “unique” data columns to factor data types for “Suburb”, “Type”, “Method”, “SellerG”, “Postcode”, “CouncilArea” dan “Regionname” columns
<- c("Suburb","Type","Method","SellerG","Postcode","CouncilArea","Regionname")
conv_factor <- lapply(melb_housing[,conv_factor], as.factor) melb_housing[,conv_factor]
Dataframe Summary
summary(melb_housing)
## Suburb Address Rooms Type
## Reservoir : 359 Length:13580 Min. : 1.000 h:9449
## Richmond : 260 Class :character 1st Qu.: 2.000 t:1114
## Bentleigh East: 249 Mode :character Median : 3.000 u:3017
## Preston : 239 Mean : 2.938
## Brunswick : 222 3rd Qu.: 3.000
## Essendon : 220 Max. :10.000
## (Other) :12031
## Price Method SellerG Date
## Min. : 85000 PI:1564 Nelson :1565 Min. :2016-01-28
## 1st Qu.: 650000 S :9022 Jellis :1316 1st Qu.:2016-09-10
## Median : 903000 SA: 92 hockingstuart:1167 Median :2017-03-04
## Mean :1075684 SP:1703 Barry :1011 Mean :2017-01-28
## 3rd Qu.:1330000 VB:1199 Ray : 701 3rd Qu.:2017-06-24
## Max. :9000000 Marshall : 659 Max. :2017-09-23
## (Other) :7161
## Distance Postcode Bedroom2 Bathroom
## Min. : 0.00 3073 : 359 Min. : 0.000 Min. :0.000
## 1st Qu.: 6.10 3020 : 306 1st Qu.: 2.000 1st Qu.:1.000
## Median : 9.20 3121 : 292 Median : 3.000 Median :1.000
## Mean :10.14 3040 : 290 Mean : 2.915 Mean :1.534
## 3rd Qu.:13.00 3046 : 284 3rd Qu.: 3.000 3rd Qu.:2.000
## Max. :48.10 3165 : 249 Max. :20.000 Max. :8.000
## (Other):11800
## Car Landsize BuildingArea YearBuilt
## Min. : 0.00 Min. : 0.0 Min. : 0 Min. :1196
## 1st Qu.: 1.00 1st Qu.: 177.0 1st Qu.: 93 1st Qu.:1940
## Median : 2.00 Median : 440.0 Median : 126 Median :1970
## Mean : 1.61 Mean : 558.4 Mean : 152 Mean :1965
## 3rd Qu.: 2.00 3rd Qu.: 651.0 3rd Qu.: 174 3rd Qu.:1999
## Max. :10.00 Max. :433014.0 Max. :44515 Max. :2018
## NA's :62 NA's :6450 NA's :5375
## CouncilArea Lattitude Longtitude
## :1369 Min. :-38.18 Min. :144.4
## Moreland :1163 1st Qu.:-37.86 1st Qu.:144.9
## Boroondara :1160 Median :-37.80 Median :145.0
## Moonee Valley: 997 Mean :-37.81 Mean :145.0
## Darebin : 934 3rd Qu.:-37.76 3rd Qu.:145.1
## Glen Eira : 848 Max. :-37.41 Max. :145.5
## (Other) :7109
## Regionname Propertycount
## Southern Metropolitan :4695 Min. : 249
## Northern Metropolitan :3890 1st Qu.: 4380
## Western Metropolitan :2948 Median : 6555
## Eastern Metropolitan :1471 Mean : 7454
## South-Eastern Metropolitan: 450 3rd Qu.:10331
## Eastern Victoria : 53 Max. :21650
## (Other) : 73
Note
There are several missing values in the column “Car”, “BuildingArea” dan “Yearbuilt”
round(colSums(is.na(melb_housing[,c("Car","BuildingArea","YearBuilt")]))/13580*100,2) #13580 adalah jumlah row data
## Car BuildingArea YearBuilt
## 0.46 47.50 39.58
Note
As much as 47% of data on building area (“Building Area”) and 39% of data on the year the property was built (“Year Built”) is missing. So, these two columns will not be used in a more detailed analysis.
Analysis of price data obtained from each sales method and type of property
sort(round(prop.table(table(melb_housing$Method)) * 100,2), decreasing = T)
##
## S SP PI VB SA
## 66.44 12.54 11.52 8.83 0.68
round(prop.table(table(melb_housing$Method, melb_housing$Type)) * 100,2)
##
## h t u
## PI 7.87 0.99 2.66
## S 47.92 5.32 13.20
## SA 0.49 0.05 0.14
## SP 7.95 1.05 3.54
## VB 5.36 0.79 2.68
Note
As much as 66% of property prices are obtained from sales using the “S” (Property Sold) method, which is dominated by “h” type houses (houses, cottages, villas, semis, terraces).
Reference Link: Symbol of Sales Method
The Correlation Between Numerical Columns
na.omit()
)<- na.omit(melb_housing)
melb_dropNA melb_dropNA
<- melb_dropNA[,c("Price","Distance","Rooms","Bedroom2","Bathroom","Car","Landsize","BuildingArea","YearBuilt")]
melb_num melb_num
ggcorr()
)ggcorr(melb_num, label = T, label_round = 2, hjust = 0.8, size = 5, col = "grey", layout.exp = 1)
Note
Reference Link: Correlation Strength Level
Top 5 Agent by Property Sales 2016
<- melb_dropNA[melb_dropNA$Date >= "2016-01-01" & melb_dropNA$Date <= "2016-12-31",]
melb_dropNA_16
<- aggregate(Price ~ SellerG + Method,
melb_sales_meth_16 data = melb_dropNA_16,
FUN = sum)
head(melb_sales_meth_16[order(melb_sales_meth_16$Price, decreasing = T),],5)
Top 5 Agent by Property Sales 2017
<- melb_dropNA[melb_dropNA$Date >= "2017-01-01" & melb_dropNA$Date <= "2017-12-31",]
melb_dropNA_17
<- aggregate(Price ~ SellerG + Method,
melb_sales_meth_17 data = melb_dropNA_17,
FUN = sum)
head(melb_sales_meth_17[order(melb_sales_meth_17$Price, decreasing = T),],5)
Nelson, Jellis and Marshall Sales per Quarter 2016 - 2017
$Quarter <- quarter(melb_dropNA$Date, type = "year.quarter")
melb_dropNA
$Quarter <- as.character(melb_dropNA$Quarter)
melb_dropNA
<- melb_dropNA[melb_dropNA$SellerG %in% c("Nelson","Jellis","Marshall"),]
melb_dropNA_Agent
<- aggregate(Price ~ Quarter + SellerG,
melb_sales_Agent data = melb_dropNA_Agent,
FUN = sum)
order(melb_sales_Agent$Price, decreasing = T),] melb_sales_Agent[
Note
Sales agents sold properties with the most value in Q4 2016 compared to other quarters in 2016 - 2017.
Other References
https://www.r-bloggers.com/2021/06/remove-rows-that-contain-all-na-or-certain-columns-in-r/