1 Explanation

Context

Melbourne real estate is BOOMING. Can you find the insight or predict the next big trend to become a real estate mogul… or even harder, to snap up a reasonably priced 2-bedroom unit?

Content

This is a snapshot of a dataset created by Tony Pino.

It was scraped from publicly available results posted every week from Domain.com.au. He cleaned it well, and now it’s up to you to make data analysis magic. The dataset includes Address, Type of Real estate, Suburb, Method of Selling, Rooms, Price, Real Estate Agent, Date of Sale and distance from C.B.D.

Notes on Specific Variables

  • Rooms: Number of rooms

  • Price: Price in dollars

  • Method: S - property sold; SP - property sold prior; PI - property passed in; PN - sold prior not disclosed; SN - sold not disclosed; NB - no bid; VB - vendor bid; W - withdrawn prior to auction; SA - sold after auction; SS - sold after auction price not disclosed. N/A - price or highest bid not available.

  • Type: br - bedroom(s); h - house,cottage,villa, semi,terrace; u - unit, duplex; t - townhouse; dev site - development site; o res - other residential.

  • SellerG: Real Estate Agent

  • Date: Date sold

  • Distance: Distance from CBD (in km)

  • Regionname: General Region (West, North West, North, North east …etc)

  • Propertycount: Number of properties that exist in the suburb.

  • Bedroom2 : Scraped # of Bedrooms (from different source)

  • Bathroom: Number of Bathrooms

  • Car: Number of carspots

  • Landsize: Land Size

  • BuildingArea: Building Size

  • CouncilArea: Governing council for the area

Acknowledgements

This is intended as a static (unchanging) snapshot of https://www.kaggle.com/anthonypino/melbourne-housing-market. It was created in September 2017. Additionally, homes with no Price have been removed.

Data Source

2 Input Data

Use read.csv() to input data

library(lubridate)
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
library(ggplot2)
library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
options(scipen = 123)

melb_housing <- read.csv("melb_data.csv")
melb_housing

2.1 Data Inspection

Check head & tail data of dataframe

head(melb_housing,10)
tail(melb_housing,10)

Note

There are ‘NA’ values in the “BuildingArea” and “YearBuilt” columns that need to be checked in more detail for the entire dataframe.

Dataframe Column’s Name

names(melb_housing)
##  [1] "Suburb"        "Address"       "Rooms"         "Type"         
##  [5] "Price"         "Method"        "SellerG"       "Date"         
##  [9] "Distance"      "Postcode"      "Bedroom2"      "Bathroom"     
## [13] "Car"           "Landsize"      "BuildingArea"  "YearBuilt"    
## [17] "CouncilArea"   "Lattitude"     "Longtitude"    "Regionname"   
## [21] "Propertycount"

Dataframe Dimension

dim(melb_housing)
## [1] 13580    21

Note

  • Number of Row = 13580
  • Number of Column = 21

2.2 Data Cleansing & Coertion

Check Dataframe Structure

str(melb_housing)
## 'data.frame':    13580 obs. of  21 variables:
##  $ Suburb       : chr  "Abbotsford" "Abbotsford" "Abbotsford" "Abbotsford" ...
##  $ Address      : chr  "85 Turner St" "25 Bloomburg St" "5 Charles St" "40 Federation La" ...
##  $ Rooms        : int  2 2 3 3 4 2 3 2 1 2 ...
##  $ Type         : chr  "h" "h" "h" "h" ...
##  $ Price        : num  1480000 1035000 1465000 850000 1600000 ...
##  $ Method       : chr  "S" "S" "SP" "PI" ...
##  $ SellerG      : chr  "Biggin" "Biggin" "Biggin" "Biggin" ...
##  $ Date         : chr  "3/12/2016" "4/02/2016" "4/03/2017" "4/03/2017" ...
##  $ Distance     : num  2.5 2.5 2.5 2.5 2.5 2.5 2.5 2.5 2.5 2.5 ...
##  $ Postcode     : num  3067 3067 3067 3067 3067 ...
##  $ Bedroom2     : num  2 2 3 3 3 2 4 2 1 3 ...
##  $ Bathroom     : num  1 1 2 2 1 1 2 1 1 1 ...
##  $ Car          : num  1 0 0 1 2 0 0 2 1 2 ...
##  $ Landsize     : num  202 156 134 94 120 181 245 256 0 220 ...
##  $ BuildingArea : num  NA 79 150 NA 142 NA 210 107 NA 75 ...
##  $ YearBuilt    : num  NA 1900 1900 NA 2014 ...
##  $ CouncilArea  : chr  "Yarra" "Yarra" "Yarra" "Yarra" ...
##  $ Lattitude    : num  -37.8 -37.8 -37.8 -37.8 -37.8 ...
##  $ Longtitude   : num  145 145 145 145 145 ...
##  $ Regionname   : chr  "Northern Metropolitan" "Northern Metropolitan" "Northern Metropolitan" "Northern Metropolitan" ...
##  $ Propertycount: num  4019 4019 4019 4019 4019 ...

“Date” column (melb_housing$Date) data type conversion from ‘char’ to ‘date’

melb_housing$Date <- dmy(melb_housing$Date)

Check the unique values for each column to get the column to convert to the ‘factor’ data type

count_unique <- rapply(melb_housing,    # Counts of unique values
                       function(x) length(unique(x)))
count_unique                           # Print counts
##        Suburb       Address         Rooms          Type         Price 
##           314         13378             9             3          2204 
##        Method       SellerG          Date      Distance      Postcode 
##             5           268            58           202           198 
##      Bedroom2      Bathroom           Car      Landsize  BuildingArea 
##            12             9            12          1448           603 
##     YearBuilt   CouncilArea     Lattitude    Longtitude    Regionname 
##           145            34          6503          7063             8 
## Propertycount 
##           311

Note

Convert multiple “unique” data columns to factor data types for “Suburb”, “Type”, “Method”, “SellerG”, “Postcode”, “CouncilArea” dan “Regionname” columns

conv_factor <- c("Suburb","Type","Method","SellerG","Postcode","CouncilArea","Regionname")
melb_housing[,conv_factor] <- lapply(melb_housing[,conv_factor], as.factor)

3 Data Exploration

Dataframe Summary

summary(melb_housing)
##             Suburb        Address              Rooms        Type    
##  Reservoir     :  359   Length:13580       Min.   : 1.000   h:9449  
##  Richmond      :  260   Class :character   1st Qu.: 2.000   t:1114  
##  Bentleigh East:  249   Mode  :character   Median : 3.000   u:3017  
##  Preston       :  239                      Mean   : 2.938           
##  Brunswick     :  222                      3rd Qu.: 3.000           
##  Essendon      :  220                      Max.   :10.000           
##  (Other)       :12031                                               
##      Price         Method             SellerG          Date           
##  Min.   :  85000   PI:1564   Nelson       :1565   Min.   :2016-01-28  
##  1st Qu.: 650000   S :9022   Jellis       :1316   1st Qu.:2016-09-10  
##  Median : 903000   SA:  92   hockingstuart:1167   Median :2017-03-04  
##  Mean   :1075684   SP:1703   Barry        :1011   Mean   :2017-01-28  
##  3rd Qu.:1330000   VB:1199   Ray          : 701   3rd Qu.:2017-06-24  
##  Max.   :9000000             Marshall     : 659   Max.   :2017-09-23  
##                              (Other)      :7161                       
##     Distance        Postcode        Bedroom2         Bathroom    
##  Min.   : 0.00   3073   :  359   Min.   : 0.000   Min.   :0.000  
##  1st Qu.: 6.10   3020   :  306   1st Qu.: 2.000   1st Qu.:1.000  
##  Median : 9.20   3121   :  292   Median : 3.000   Median :1.000  
##  Mean   :10.14   3040   :  290   Mean   : 2.915   Mean   :1.534  
##  3rd Qu.:13.00   3046   :  284   3rd Qu.: 3.000   3rd Qu.:2.000  
##  Max.   :48.10   3165   :  249   Max.   :20.000   Max.   :8.000  
##                  (Other):11800                                   
##       Car           Landsize         BuildingArea     YearBuilt   
##  Min.   : 0.00   Min.   :     0.0   Min.   :    0   Min.   :1196  
##  1st Qu.: 1.00   1st Qu.:   177.0   1st Qu.:   93   1st Qu.:1940  
##  Median : 2.00   Median :   440.0   Median :  126   Median :1970  
##  Mean   : 1.61   Mean   :   558.4   Mean   :  152   Mean   :1965  
##  3rd Qu.: 2.00   3rd Qu.:   651.0   3rd Qu.:  174   3rd Qu.:1999  
##  Max.   :10.00   Max.   :433014.0   Max.   :44515   Max.   :2018  
##  NA's   :62                         NA's   :6450    NA's   :5375  
##         CouncilArea     Lattitude        Longtitude   
##               :1369   Min.   :-38.18   Min.   :144.4  
##  Moreland     :1163   1st Qu.:-37.86   1st Qu.:144.9  
##  Boroondara   :1160   Median :-37.80   Median :145.0  
##  Moonee Valley: 997   Mean   :-37.81   Mean   :145.0  
##  Darebin      : 934   3rd Qu.:-37.76   3rd Qu.:145.1  
##  Glen Eira    : 848   Max.   :-37.41   Max.   :145.5  
##  (Other)      :7109                                   
##                       Regionname   Propertycount  
##  Southern Metropolitan     :4695   Min.   :  249  
##  Northern Metropolitan     :3890   1st Qu.: 4380  
##  Western Metropolitan      :2948   Median : 6555  
##  Eastern Metropolitan      :1471   Mean   : 7454  
##  South-Eastern Metropolitan: 450   3rd Qu.:10331  
##  Eastern Victoria          :  53   Max.   :21650  
##  (Other)                   :  73

Note

There are several missing values in the column “Car”, “BuildingArea” dan “Yearbuilt”

round(colSums(is.na(melb_housing[,c("Car","BuildingArea","YearBuilt")]))/13580*100,2) #13580 adalah jumlah row data
##          Car BuildingArea    YearBuilt 
##         0.46        47.50        39.58

Note

As much as 47% of data on building area (“Building Area”) and 39% of data on the year the property was built (“Year Built”) is missing. So, these two columns will not be used in a more detailed analysis.

Analysis of price data obtained from each sales method and type of property

sort(round(prop.table(table(melb_housing$Method)) * 100,2), decreasing = T)
## 
##     S    SP    PI    VB    SA 
## 66.44 12.54 11.52  8.83  0.68
round(prop.table(table(melb_housing$Method, melb_housing$Type)) * 100,2)
##     
##          h     t     u
##   PI  7.87  0.99  2.66
##   S  47.92  5.32 13.20
##   SA  0.49  0.05  0.14
##   SP  7.95  1.05  3.54
##   VB  5.36  0.79  2.68

Note

As much as 66% of property prices are obtained from sales using the “S” (Property Sold) method, which is dominated by “h” type houses (houses, cottages, villas, semis, terraces).

Reference Link: Symbol of Sales Method

The Correlation Between Numerical Columns

  1. Drop row with missing value using (na.omit())
melb_dropNA <- na.omit(melb_housing)               
melb_dropNA
  1. Creating new dataframe for numerical data columns only
melb_num <- melb_dropNA[,c("Price","Distance","Rooms","Bedroom2","Bathroom","Car","Landsize","BuildingArea","YearBuilt")]
melb_num
  1. Plot the correlation using (ggcorr())
ggcorr(melb_num, label = T, label_round = 2, hjust = 0.8, size = 5, col = "grey", layout.exp = 1)

Note

  • “Rooms”. Bathroom”, “Bedroom2” and “BuildingArea” columns are moderately correlated to property “Price”
  • “BuildingArea” moderately correlated to number of “Rooms”, “Bedroom2” and “Bathroom”
  • The year the property was built (“Yearbuilt”), “Landsize”, “Distance” to CBD have weak correaltion to property “Price”

Reference Link: ggcorr

Reference Link: Correlation Strength Level

Top 5 Agent by Property Sales 2016

melb_dropNA_16 <- melb_dropNA[melb_dropNA$Date >= "2016-01-01" & melb_dropNA$Date <= "2016-12-31",]

melb_sales_meth_16 <- aggregate(Price ~ SellerG + Method,
          data = melb_dropNA_16,
          FUN = sum)

head(melb_sales_meth_16[order(melb_sales_meth_16$Price, decreasing = T),],5)

Top 5 Agent by Property Sales 2017

melb_dropNA_17 <- melb_dropNA[melb_dropNA$Date >= "2017-01-01" & melb_dropNA$Date <= "2017-12-31",]

melb_sales_meth_17 <- aggregate(Price ~ SellerG + Method,
          data = melb_dropNA_17,
          FUN = sum)

head(melb_sales_meth_17[order(melb_sales_meth_17$Price, decreasing = T),],5)

Nelson, Jellis and Marshall Sales per Quarter 2016 - 2017

melb_dropNA$Quarter <- quarter(melb_dropNA$Date, type = "year.quarter")

melb_dropNA$Quarter <- as.character(melb_dropNA$Quarter)

melb_dropNA_Agent <- melb_dropNA[melb_dropNA$SellerG %in% c("Nelson","Jellis","Marshall"),]

melb_sales_Agent <- aggregate(Price ~ Quarter + SellerG,
          data = melb_dropNA_Agent,
          FUN = sum)

melb_sales_Agent[order(melb_sales_Agent$Price, decreasing = T),]

Note

Sales agents sold properties with the most value in Q4 2016 compared to other quarters in 2016 - 2017.

Other References

https://www.r-bloggers.com/2021/06/remove-rows-that-contain-all-na-or-certain-columns-in-r/

https://www.statology.org/r-aggregate-multiple-columns/