Introduction

This dataset is about the selling price of real estate in Melbourne. We can obtain the dataset from https://www.kaggle.com/anthonypino/melbourne-housing-market . The type of real estate contain house, unit, and townhouse. The variables in the dataset can be seen below.

Variable Description:

  • Suburb: Suburb
  • Address: Address
  • Rooms: Number of rooms
  • Type: type of real estate including : br - bedroom(s); h - house,cottage,villa, semi,terrace; u - unit, duplex; t - townhouse;
    dev site - development site; o res - other residential.
  • Price: Price in Australian dollars
  • Method_of_Sales: method for selling real estate including: S - property sold; SP - property sold prior; PI - property passed in; PN - sold prior not disclosed; SN - sold not disclosed; NB - no bid; VB - vendor bid; W - withdrawn prior to auction; SA - sold after auction; SS - sold after auction price not disclosed. N/A - price or highest bid not available.
  • Agent: Real Estate Agent:
  • Date_of_Sold : Date Sold
  • Postcode: Postcode
  • Region: General Region (West, North West, North, North east …etc)
  • Property_Count: Number of properties that exist in the suburb
  • Distance: Distance from Central Business Development (in km)
  • Council_Area: Governing council for the area

1. Input Data

house <- read.csv("MELBOURNE_HOUSE_PRICES_LESS.csv")

2. Data Inspection

head(house)
##         Suburb          Address Rooms Type   Price Method  SellerG      Date
## 1   Abbotsford    49 Lithgow St     3    h 1490000      S   Jellis 1/04/2017
## 2   Abbotsford    59A Turner St     3    h 1220000      S Marshall 1/04/2017
## 3   Abbotsford    119B Yarra St     3    h 1420000      S   Nelson 1/04/2017
## 4   Aberfeldie       68 Vida St     3    h 1515000      S    Barry 1/04/2017
## 5 Airport West 92 Clydesdale Rd     2    h  670000      S   Nelson 1/04/2017
## 6 Airport West     4/32 Earl St     2    t  530000      S   Jellis 1/04/2017
##   Postcode            Regionname Propertycount Distance
## 1     3067 Northern Metropolitan          4019      3.0
## 2     3067 Northern Metropolitan          4019      3.0
## 3     3067 Northern Metropolitan          4019      3.0
## 4     3040  Western Metropolitan          1543      7.5
## 5     3042  Western Metropolitan          3464     10.4
## 6     3042  Western Metropolitan          3464     10.4
##                  CouncilArea
## 1         Yarra City Council
## 2         Yarra City Council
## 3         Yarra City Council
## 4 Moonee Valley City Council
## 5 Moonee Valley City Council
## 6 Moonee Valley City Council
tail(house)
##                 Suburb        Address Rooms Type  Price Method SellerG
## 63018          Preston  229 Murray Rd     3    h 808000      S      RW
## 63019    Roxburgh Park      3 Carr Pl     3    h 566000      S   Raine
## 63020    Roxburgh Park    9 Parker Ct     3    h 500000      S   Raine
## 63021    Roxburgh Park 5 Parkinson Wy     3    h 545000      S   Raine
## 63022       Thomastown 3/1 Travers St     3    u     NA     PI   Barry
## 63023 Williams Landing    1 Diadem Wy     4    h     NA     SP  Aussie
##             Date Postcode            Regionname Propertycount Distance
## 63018 31/03/2018     3072 Northern Metropolitan         14577      8.4
## 63019 31/03/2018     3064 Northern Metropolitan          5833     20.6
## 63020 31/03/2018     3064 Northern Metropolitan          5833     20.6
## 63021 31/03/2018     3064 Northern Metropolitan          5833     20.6
## 63022 31/03/2018     3074 Northern Metropolitan          7955     15.3
## 63023 31/03/2018     3027  Western Metropolitan          1999     17.6
##                   CouncilArea
## 63018    Darebin City Council
## 63019       Hume City Council
## 63020       Hume City Council
## 63021       Hume City Council
## 63022 Whittlesea City Council
## 63023    Wyndham City Council
str(house)
## 'data.frame':    63023 obs. of  13 variables:
##  $ Suburb       : chr  "Abbotsford" "Abbotsford" "Abbotsford" "Aberfeldie" ...
##  $ Address      : chr  "49 Lithgow St" "59A Turner St" "119B Yarra St" "68 Vida St" ...
##  $ Rooms        : int  3 3 3 3 2 2 2 3 6 3 ...
##  $ Type         : chr  "h" "h" "h" "h" ...
##  $ Price        : int  1490000 1220000 1420000 1515000 670000 530000 540000 715000 NA 1925000 ...
##  $ Method       : chr  "S" "S" "S" "S" ...
##  $ SellerG      : chr  "Jellis" "Marshall" "Nelson" "Barry" ...
##  $ Date         : chr  "1/04/2017" "1/04/2017" "1/04/2017" "1/04/2017" ...
##  $ Postcode     : int  3067 3067 3067 3040 3042 3042 3042 3042 3021 3206 ...
##  $ Regionname   : chr  "Northern Metropolitan" "Northern Metropolitan" "Northern Metropolitan" "Western Metropolitan" ...
##  $ Propertycount: int  4019 4019 4019 1543 3464 3464 3464 3464 1899 3280 ...
##  $ Distance     : num  3 3 3 7.5 10.4 10.4 10.4 10.4 14 3 ...
##  $ CouncilArea  : chr  "Yarra City Council" "Yarra City Council" "Yarra City Council" "Moonee Valley City Council" ...
dim(house)
## [1] 63023    13
names(house)
##  [1] "Suburb"        "Address"       "Rooms"         "Type"         
##  [5] "Price"         "Method"        "SellerG"       "Date"         
##  [9] "Postcode"      "Regionname"    "Propertycount" "Distance"     
## [13] "CouncilArea"
# Renaming all of column house to be more easily understand
names(house) <- c("Suburb","Address","Rooms","Type",
                  "Price","Method_of_Sales","Agent",
                  "Date_of_Sold","Postcode","Region",
                  "Property_Count","Distance", "Council_Area")

From our inspection we can conclude:

3. Data Wrangling

summary(house)
##     Suburb            Address              Rooms            Type          
##  Length:63023       Length:63023       Min.   : 1.000   Length:63023      
##  Class :character   Class :character   1st Qu.: 3.000   Class :character  
##  Mode  :character   Mode  :character   Median : 3.000   Mode  :character  
##                                        Mean   : 3.111                     
##                                        3rd Qu.: 4.000                     
##                                        Max.   :31.000                     
##                                                                           
##      Price          Method_of_Sales       Agent           Date_of_Sold      
##  Min.   :   85000   Length:63023       Length:63023       Length:63023      
##  1st Qu.:  620000   Class :character   Class :character   Class :character  
##  Median :  830000   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :  997898                                                           
##  3rd Qu.: 1220000                                                           
##  Max.   :11200000                                                           
##  NA's   :14590                                                              
##     Postcode       Region          Property_Count     Distance    
##  Min.   :3000   Length:63023       Min.   :   39   Min.   : 0.00  
##  1st Qu.:3056   Class :character   1st Qu.: 4380   1st Qu.: 7.00  
##  Median :3107   Mode  :character   Median : 6795   Median :11.40  
##  Mean   :3126                      Mean   : 7618   Mean   :12.68  
##  3rd Qu.:3163                      3rd Qu.:10412   3rd Qu.:16.70  
##  Max.   :3980                      Max.   :21650   Max.   :64.10  
##                                                                   
##  Council_Area      
##  Length:63023      
##  Class :character  
##  Mode  :character  
##                    
##                    
##                    
## 

Check missing value

colSums(is.na(house))
##          Suburb         Address           Rooms            Type           Price 
##               0               0               0               0           14590 
## Method_of_Sales           Agent    Date_of_Sold        Postcode          Region 
##               0               0               0               0               0 
##  Property_Count        Distance    Council_Area 
##               0               0               0

There’re some missing value at column Price. So we need to replace that missing value

Replace missing value Price with 0

house$Price[is.na(house$Price)] <- 0

Re-checking missing value

colSums(is.na(house))
##          Suburb         Address           Rooms            Type           Price 
##               0               0               0               0               0 
## Method_of_Sales           Agent    Date_of_Sold        Postcode          Region 
##               0               0               0               0               0 
##  Property_Count        Distance    Council_Area 
##               0               0               0

Great!! We’re already done with this missing value. Next, we need to change the data type for each column

Change the data type

house$Rooms <- as.factor(house$Rooms)
house$Type <- as.factor(house$Type)
house$Method_of_Sales <- as.factor(house$Method_of_Sales)
house$Region <- as.factor(house$Region)
house$Date_of_Sold <- as.Date(house$Date_of_Sold, format = "%d/%m/%Y")

# renaming levels of factor

levels(house$Type) <- list(House = "h", Townhouse = "t", Unit = "u")

levels(house$Method_of_Sales) <- list(`property sold` = "S",
                                      `property sold prior` = "SP",
                                      `property passed in` = "PI",
                                      `sold prior not disclosed` = "PN",
                                      `sold not disclosed` = "SN",
                                      `no bid` = "NB",
                                      `vendor bid` = "VB",
                                      `withdrawn prior to auction` = "W",
                                      `sold after auction` = "SA",
                                      `sold after auction price not disclosed` = "SS")
str(house)
## 'data.frame':    63023 obs. of  13 variables:
##  $ Suburb         : chr  "Abbotsford" "Abbotsford" "Abbotsford" "Aberfeldie" ...
##  $ Address        : chr  "49 Lithgow St" "59A Turner St" "119B Yarra St" "68 Vida St" ...
##  $ Rooms          : Factor w/ 14 levels "1","2","3","4",..: 3 3 3 3 2 2 2 3 6 3 ...
##  $ Type           : Factor w/ 3 levels "House","Townhouse",..: 1 1 1 1 1 2 3 1 1 1 ...
##  $ Price          : num  1490000 1220000 1420000 1515000 670000 ...
##  $ Method_of_Sales: Factor w/ 10 levels "property sold",..: 1 1 1 1 1 1 1 2 3 1 ...
##  $ Agent          : chr  "Jellis" "Marshall" "Nelson" "Barry" ...
##  $ Date_of_Sold   : Date, format: "2017-04-01" "2017-04-01" ...
##  $ Postcode       : int  3067 3067 3067 3040 3042 3042 3042 3042 3021 3206 ...
##  $ Region         : Factor w/ 8 levels "Eastern Metropolitan",..: 3 3 3 7 7 7 7 7 7 6 ...
##  $ Property_Count : int  4019 4019 4019 1543 3464 3464 3464 3464 1899 3280 ...
##  $ Distance       : num  3 3 3 7.5 10.4 10.4 10.4 10.4 14 3 ...
##  $ Council_Area   : chr  "Yarra City Council" "Yarra City Council" "Yarra City Council" "Moonee Valley City Council" ...
summary(house)
##     Suburb            Address              Rooms              Type      
##  Length:63023       Length:63023       3      :27950   House    :45053  
##  Class :character   Class :character   4      :15747   Townhouse: 6315  
##  Mode  :character   Mode  :character   2      :13248   Unit     :11655  
##                                        5      : 3434                    
##                                        1      : 2111                    
##                                        6      :  424                    
##                                        (Other):  109                    
##      Price                          Method_of_Sales     Agent          
##  Min.   :       0   property sold           :34063   Length:63023      
##  1st Qu.:  363000   property passed in      : 9790   Class :character  
##  Median :  697500   property sold prior     : 8916   Mode  :character  
##  Mean   :  766882   vendor bid              : 5956                     
##  3rd Qu.: 1065000   sold not disclosed      : 2674                     
##  Max.   :11200000   sold prior not disclosed:  651                     
##                     (Other)                 :  973                     
##   Date_of_Sold           Postcode                           Region     
##  Min.   :2016-01-28   Min.   :3000   Southern Metropolitan     :17559  
##  1st Qu.:2016-12-03   1st Qu.:3056   Northern Metropolitan     :16781  
##  Median :2017-09-03   Median :3107   Western Metropolitan      :11717  
##  Mean   :2017-08-04   Mean   :3126   Eastern Metropolitan      :10396  
##  3rd Qu.:2018-03-10   3rd Qu.:3163   South-Eastern Metropolitan: 5212  
##  Max.   :2018-10-13   Max.   :3980   Eastern Victoria          :  564  
##                                      (Other)                   :  794  
##  Property_Count     Distance     Council_Area      
##  Min.   :   39   Min.   : 0.00   Length:63023      
##  1st Qu.: 4380   1st Qu.: 7.00   Class :character  
##  Median : 6795   Median :11.40   Mode  :character  
##  Mean   : 7618   Mean   :12.68                     
##  3rd Qu.:10412   3rd Qu.:16.70                     
##  Max.   :21650   Max.   :64.10                     
## 

Summary:

  1. The first real estate sold occur at 2016-01-28

  2. Type of real estate contain house, unit, and townhouse. House is most for selling among the others

  3. The maximum price about 11200000 AUD

  4. The method of sales commonly used by the agent property is selling the real estate directly to consumer (S)

  5. The most property count is 21650

  6. The most closest distance to Central Business Development is about 0 km

4. EDA (Exploratory Data Analysis)

  1. Which type of real estate is most for selling based on the numbers of rooms?
table(house$Type, house$Rooms)
##            
##                 1     2     3     4     5     6     7     8     9    10    11
##   House       142  4503 22017 14479  3389   419    60    26     4     7     1
##   Townhouse    36  1529  3554  1152    41     0     1     1     1     0     0
##   Unit       1933  7216  2379   116     4     5     0     2     0     0     0
##            
##                12    16    31
##   House         4     1     1
##   Townhouse     0     0     0
##   Unit          0     0     0

Answer: house with 3 numbers of rooms are the most type of house for selling

  1. Which type of real estate has highest Price based on the numbers of rooms?
house_high <- aggregate(Price~Type+Rooms, house, FUN = max)

house_high[order(house_high$Price, decreasing = T),][1,]
##     Type Rooms    Price
## 10 House     4 11200000

Answer : House, Cottage, Villa, Semi, and Terrace with 4 numbers of rooms are type of real estate which is has highest price

  1. Which suburb region has the most property count?
sub_high <- aggregate(Property_Count~Suburb+Region, house, sum)

sub_high[order(sub_high$Property_Count, decreasing = T),][1,]
##        Suburb                Region Property_Count
## 149 Reservoir Northern Metropolitan       26867650

Answer: Northern Metropolitan has the most property count

  1. The highest of total price based suburb and type of real estate ?
total_price <- aggregate(Price~Suburb+Type, house, sum)

total_price[order(total_price$Price, decreasing = T),][1,]
##          Suburb  Type     Price
## 24 Balwyn North House 615319800

Answer: The highest of total price in Balwyn North with type of real estate are House, Cottage, Villa, Semi, and Terrace

  1. How much rate price of unit in 2018?
house$year = as.numeric(format(house$Date_of_Sold, "%Y"))

unit_18 <- aggregate(Price~Type+year, house, mean)

unit_18[(unit_18$Type == "u" & unit_18$year == 2018),]
## [1] Type  year  Price
## <0 rows> (or 0-length row.names)

Answer : Rate price of unit in 2018 is 495973.3 AUD

  1. Which of council area is closest to Central Business Development?
council <- aggregate(Distance~Council_Area, house, min)

council[order(council$Distance, decreasing = F),][1,]
##              Council_Area Distance
## 19 Melbourne City Council        0

Answer : Melbourne City Council is about closest to CDB

  1. Proportions type of real estate?
pie(xtabs(~Type, house))

Answer : Based on pie chart, we could know that type real estate with house is highest percentage among the others. It means type of real estate selling is dominated by house

  1. Total price every year?
year_total <- xtabs(Price~year, house)


options(scipen = 999)
barplot(year_total)

title(main = "Total Price Household Market per Year",
      ylab = "Total Price")

Answer : The total higher price is on 2017

  1. Rate price of townhouse every suburb
townhouse <- house[house$Type == "Townhouse",]

townhouse_2 <- aggregate(Price~Type+Suburb, townhouse, mean)


# Display Top 5 Rate price of Townhouse
townhouse_2[order(townhouse_2$Price, decreasing = T), ][1:5,]
##          Type      Suburb   Price
## 152 Townhouse     Kooyong 1565500
## 37  Townhouse    Brighton 1288916
## 241 Townhouse      Toorak 1283929
## 52  Townhouse  Canterbury 1230250
## 4   Townhouse Albert Park 1226000

Answer: the highest rate price for townhouse at Kooyoung. The rate price is 1565500 AUD

  1. Proportions of method sales?
barplot(prop.table(table(house$Method_of_Sales)),
        ylim = c(0,1))

title(xlab = "Method of Sales", 
      main = "Proportion Method of Sales Household Market",
      ylab = "Proportions")

Answer : Based on pie chart above, we could know that the method for selling is commonly used by selling the property directly

  1. Method of Sales every Type of real estate?
barplot(xtabs(~Type + Method_of_Sales, house),
        ylim = c(0,50000), ylab = "Total", xlab = "Method of Sales")

table(droplevels(house$Method_of_Sales), house$Type)
##                                         
##                                          House Townhouse  Unit
##   property sold                          24865      3212  5986
##   property sold prior                     6036       840  2040
##   property passed in                      6882      1159  1749
##   sold prior not disclosed                 446        85   120
##   sold not disclosed                      2149       192   333
##   vendor bid                              3976       738  1242
##   withdrawn prior to auction               335        43   106
##   sold after auction                       308        37    71
##   sold after auction price not disclosed    56         9     8

Answer: Based from our analysis, we could know that selling the house by directly to consumer is most used by the agent property in Melbourne

  1. Which property agent has the highest rate price for house?
mean_house <- house[(house$Type == "House"),]

mean_high <- aggregate(Price~Agent+Type, mean_house, mean)

mean_high[order(mean_high$Price, decreasing = T),][1,]
##     Agent  Type   Price
## 120   For House 3780000

Answer : The agent property who has highest rate price for house is Agent For

5. Conclusion

From our report we can conclude that house with 3 numbers of rooms are the most selling. Balwyn North has highest total of price for house selling. Suburb region has the most property count is Northern Metropolitan. Melbourne City Council is about the most closest to CDB. It’s about 0 km. The highest total price occur at 2017. The method of sales is commonly used by agent real estate is selling directly to consumer. Usually, this method is used for selling house. The property agent who has highest rate price for house is Agent For