MATH 2349 Assignment 3

Required Packages

library(dplyr)
library(readr)
library(tidyr)
library(outliers)
library(Hmisc)
library(lubridate)

Executive Summary

There are two data sets used in this report. The first contains a wealth of information on properties sold within the last few years in Melbourne. This includes the address and suburb of each property, the selling price, the type of property (house, townhouse, unit), the size of the property, and the council it is located in, as well as several other variables. The second is a data set that has been drawn from the Australian Bureau of Statistics. This consists of SEIFA statistics, which index the socio-economic status of an area by evaluating a variety of factors. There are two particular variables from this data set that are of particular interest: the Index of Relative Socio-economic Advantage and Disadvantage (IRSAD), and the usual resident population. An IRSAD score has a mean of 1000, and a higher score means an area is generally more wealthy while a lower score means it is less wealthy. The goal of this analysis is to take every entry in the housing data set, and give it an IRSAD score and population by matching the common variable Local Government Area (LGA). This will create one data set that can be used for analysis in the future. However, there are many faults with the data, so preprocessing steps will need to take place.

Preprocessing involved applying tidy data principles to the ABS data, then splitting it into two separate data frames that could be merged into the housing data set. This was followed by converting variables into suitable data types and labeling factors. There were plenty of missing values in multiple columns, so they were identified and then filled using imputation. Finally, a new variable (price per square meter) was created, outliers were detected and dealt with, and a log transformation was applied to normalize the data.

Data (Importation, Data Source, and Variable Description)

HousingData<-read.csv("~/Downloads/Melbourne_housing_FULL 2.csv")
Seifa<-read.csv("~/Downloads/ABS_SEIFA2016_LGA-en.csv")
head(HousingData)

##     Suburb          Address Rooms Type   Price Method  SellerG    Date
## 1 Brighton   802 Hampton St     3    h 1550000     VB   Buxton  3/9/16
## 2 Brighton       476 New St     4    h      NA      S Marshall  3/9/16
## 3 Brighton    18 Rooding St     3    h 1635000      S   Buxton  3/9/16
## 4 Brighton     8 Ballara Ct     3    h      NA      S   Buxton 3/12/16
## 5 Brighton  148 Cochrane St     3    h      NA     SP Marshall 3/12/16
## 6 Brighton 152A Cochrane St     4    h 1830000      S   Hodges 3/12/16
##   Distance Postcode Bedroom2 Bathroom Car Landsize BuildingArea YearBuilt
## 1     11.2     3186        3        2   2      663           NA      1940
## 2     11.2     3186        3        3   2      683          254      1915
## 3     11.2     3186        3        1   2      366          156      1920
## 4     11.2     3186        3        2   4      688           NA        NA
## 5     11.2     3186        3        1   1      318          136      1910
## 6     11.2     3186        4        2   2      436           NA        NA
##   CouncilArea Lattitude Longtitude            Regionname Propertycount
## 1 Bayside (C)  -37.9159   145.0068 Southern Metropolitan         10579
## 2 Bayside (C)  -37.8931   144.9955 Southern Metropolitan         10579
## 3 Bayside (C)  -37.9038   145.0001 Southern Metropolitan         10579
## 4 Bayside (C)  -37.9030   144.9900 Southern Metropolitan         10579
## 5 Bayside (C)  -37.8931   145.0004 Southern Metropolitan         10579
## 6 Bayside (C)  -37.8927   145.0005 Southern Metropolitan         10579

head(Seifa)

##   LGA_2016 Local.Government.Areas...2016 SEIFAINDEXTYPE
## 1    10050                    Albury (C)            IEO
## 2    10050                    Albury (C)            IEO
## 3    10050                    Albury (C)            IEO
## 4    10050                    Albury (C)            IEO
## 5    10050                    Albury (C)            IEO
## 6    10050                    Albury (C)            IEO
##                          Index.Type SEIFA_MEASURE
## 1 Index of Education and Occupation         SCORE
## 2 Index of Education and Occupation          RWAR
## 3 Index of Education and Occupation          RWAD
## 4 Index of Education and Occupation          RWAP
## 5 Index of Education and Occupation          RWSR
## 6 Index of Education and Occupation          RWSD
##                                   Measure TIME Time Value Flag.Codes Flags
## 1                                   Score 2016 2016   961                 
## 2                   Rank within Australia 2016 2016   291                 
## 3          Rank within Australia - Decile 2016 2016     6                 
## 4      Rank within Australia - Percentile 2016 2016    54                 
## 5          Rank within State or Territory 2016 2016    69                 
## 6 Rank within State or Territory - Decile 2016 2016     6

There are two data sets that have been imported. The first is a Melbourne housing market data set, which has been sourced from Kaggle using the following link: https://www.kaggle.com/anthonypino/melbourne-housing-market. The second is a Census data set from the Australian Bureau of Statistics (ABS) containing Socio-Economic Indexes for Areas (SEIFA) information, found here: http://www.abs.gov.au/AUSSTATS/abs@.nsf/DetailsPage/2033.0.55.0012016?OpenDocument.

There are 21 variables in the Melbourne housing data set: Suburb - Suburb; Address - Address; Rooms - Number of rooms; Price - Price in Australian Dollars; Method - Method of selling property; Type - Type of property; SellerG - Real Estate Agent; Date - Date sold; Distance - Distance from CBD in km; Postcode - Postcode; Regionname - General region (west, north); Propertycount - Number of properties in suburb; Bedroom2 - Number of bedrooms; Bathroom - Number of bathrooms; Car - Number of carspots; Landsize - Land size in meters; YearBuilt - Year the house was built; CouncilArea - Governing council for the area; Lattitude - Lattitude; Longitude - Longitude.

There are 11 variables in the SEIFA data set: LGA_2016 - Local Government Area code; Local Government Areas - 2016 - Local Government Area; SEIFAINDEXTYPE - SEIFA Index acronymn; Index Type - SEIFA Index; SEIFA_MEASURE - SEIFA measure acronym; Measure - SEIFA measure; TIME - Year; Time - Year; Value - Value; Flag Codes - Flag codes; Flags - Flags.

Tidy and Manipulate Data I

Seifa2<-spread(Seifa,key = Measure, value = Value)

head(Seifa2)

##   LGA_2016 Local.Government.Areas...2016 SEIFAINDEXTYPE
## 1    10050                    Albury (C)            IEO
## 2    10050                    Albury (C)            IEO
## 3    10050                    Albury (C)            IEO
## 4    10050                    Albury (C)            IEO
## 5    10050                    Albury (C)            IEO
## 6    10050                    Albury (C)            IEO
##                          Index.Type SEIFA_MEASURE TIME Time Flag.Codes
## 1 Index of Education and Occupation          MAXS 2016 2016           
## 2 Index of Education and Occupation          MINS 2016 2016           
## 3 Index of Education and Occupation          RWAD 2016 2016           
## 4 Index of Education and Occupation          RWAP 2016 2016           
## 5 Index of Education and Occupation          RWAR 2016 2016           
## 6 Index of Education and Occupation          RWSD 2016 2016           
##   Flags Maximum score for SA1s in area Minimum score for SA1s in area
## 1                                 1180                             NA
## 2                                   NA                            754
## 3                                   NA                             NA
## 4                                   NA                             NA
## 5                                   NA                             NA
## 6                                   NA                             NA
##   Rank within Australia Rank within Australia - Decile
## 1                    NA                             NA
## 2                    NA                             NA
## 3                    NA                              6
## 4                    NA                             NA
## 5                   291                             NA
## 6                    NA                             NA
##   Rank within Australia - Percentile Rank within State or Territory
## 1                                 NA                             NA
## 2                                 NA                             NA
## 3                                 NA                             NA
## 4                                 54                             NA
## 5                                 NA                             NA
## 6                                 NA                             NA
##   Rank within State or Territory - Decile
## 1                                      NA
## 2                                      NA
## 3                                      NA
## 4                                      NA
## 5                                      NA
## 6                                       6
##   Rank within State or Territory - Percentile Score
## 1                                          NA    NA
## 2                                          NA    NA
## 3                                          NA    NA
## 4                                          NA    NA
## 5                                          NA    NA
## 6                                          NA    NA
##   Usual resident population
## 1                        NA
## 2                        NA
## 3                        NA
## 4                        NA
## 5                        NA
## 6                        NA

The SEIFA data is in an untidy format, as multiple variables are stored in rows. The spread() function converts this from a long to wide format.

Seifa3<-Seifa2[!(Seifa2$SEIFAINDEXTYPE=="IEO"|
                   Seifa2$SEIFAINDEXTYPE=="IER"|Seifa2$SEIFAINDEXTYPE=="IRSD"),]

names(Seifa3)[17]<-"Percentile_Rank_State"

names(Seifa3)[19]<-"Usual_Resident_Population"

Observations are filtered out so that only “IRSAD” values are in the data set. Then, columns are renamed for preprocessing purposes.

Seifa4<-Seifa3 %>% 
  select(-(TIME:Percentile_Rank_State))

SeifaScore<-Seifa4[!is.na(Seifa4$Score),]

names(SeifaScore)[2]<-"LGA_Name"
  
SeifaScore2<-SeifaScore %>%
  select(LGA_Name,Score)

The data is subset to delete insignificant variables, and cases containing NA in the score column are deleted. A column is renamed for preprocessing purposes, and then the data is subset again.

head(SeifaScore2)

##                  LGA_Name Score
## 29             Albury (C)   956
## 69  Armidale Regional (A)   976
## 109           Ballina (A)   987
## 149         Balranald (A)   927
## 189 Bathurst Regional (A)   973
## 229       Bega Valley (A)   951

str(SeifaScore2)

## 'data.frame':    544 obs. of  2 variables:
##  $ LGA_Name: Factor w/ 545 levels "Adelaide (C)",..: 4 11 16 18 28 32 33 38 41 42 ...
##  $ Score   : int  956 976 987 927 973 951 954 935 993 954 ...

This produces a data frame containing the IRSAD score for each LGA.

SeifaPopulation<-Seifa4[!is.na(Seifa4$Usual_Resident_Population),]

names(SeifaPopulation)[2]<-"LGA_Name"

SeifaPopulation2<-SeifaPopulation %>%
  select(LGA_Name,Usual_Resident_Population)

The same process is done for resident population. Cases containing NA in the resident population column are deleted, the column is renamed, and the data is subset.

head(SeifaPopulation2)

##                  LGA_Name Usual_Resident_Population
## 30             Albury (C)                     51076
## 70  Armidale Regional (A)                     29449
## 110           Ballina (A)                     41790
## 150         Balranald (A)                      2287
## 190 Bathurst Regional (A)                     41300
## 230       Bega Valley (A)                     33253

str(SeifaPopulation2)

## 'data.frame':    544 obs. of  2 variables:
##  $ LGA_Name                 : Factor w/ 545 levels "Adelaide (C)",..: 4 11 16 18 28 32 33 38 41 42 ...
##  $ Usual_Resident_Population: int  51076 29449 41790 2287 41300 33253 12668 8462 336962 5955 ...

Now each LGA has a resident population value stored in a data frame.

Data Continued (Merge)

names(HousingData)[17]<-"LGA_Name"

HousingData$LGA_Name<-as.character(HousingData$LGA_Name)

SeifaScore2$LGA_Name<-as.character(SeifaScore2$LGA_Name)

SeifaPopulation2$LGA_Name<-as.character(SeifaPopulation2$LGA_Name)

Data_Combined<-HousingData %>%
  left_join(SeifaScore2, by="LGA_Name")

Data_Combined2<-Data_Combined %>%
  left_join(SeifaPopulation2, by="LGA_Name")

The Melbourne housing data set is joined with the IRSAD and population data by using the common variable “LGA_Name”. They are converted to characters from factors for ease of joining.

Data_Combined3<-Data_Combined2 %>%
  select(-(YearBuilt))

Data_Combined3<-Data_Combined3 %>%
  select(-(Lattitude:Longtitude))

Data_Combined3<-Data_Combined3 %>%
  select(-(Regionname:Propertycount))

Data_Combined3<-Data_Combined3 %>% 
  select(-(Bedroom2:Car)) 

Data_Combined3<- Data_Combined3[-c(32994,32995,32996), ]

Variables not pertinent to the analysis and rows containing many missing values are removed to create the final assignment data set.

head(Data_Combined3)

##     Suburb          Address Rooms Type   Price Method  SellerG    Date
## 1 Brighton   802 Hampton St     3    h 1550000     VB   Buxton  3/9/16
## 2 Brighton       476 New St     4    h      NA      S Marshall  3/9/16
## 3 Brighton    18 Rooding St     3    h 1635000      S   Buxton  3/9/16
## 4 Brighton     8 Ballara Ct     3    h      NA      S   Buxton 3/12/16
## 5 Brighton  148 Cochrane St     3    h      NA     SP Marshall 3/12/16
## 6 Brighton 152A Cochrane St     4    h 1830000      S   Hodges 3/12/16
##   Distance Postcode Landsize BuildingArea    LGA_Name Score
## 1     11.2     3186      663           NA Bayside (C)  1125
## 2     11.2     3186      683          254 Bayside (C)  1125
## 3     11.2     3186      366          156 Bayside (C)  1125
## 4     11.2     3186      688           NA Bayside (C)  1125
## 5     11.2     3186      318          136 Bayside (C)  1125
## 6     11.2     3186      436           NA Bayside (C)  1125
##   Usual_Resident_Population
## 1                     97087
## 2                     97087
## 3                     97087
## 4                     97087
## 5                     97087
## 6                     97087

This is the final assignment data set to be used.

Understand

dim(Data_Combined3)

## [1] 32993    15

options(max.print=100)
attributes(Data_Combined3)

## $names
##  [1] "Suburb"                    "Address"                  
##  [3] "Rooms"                     "Type"                     
##  [5] "Price"                     "Method"                   
##  [7] "SellerG"                   "Date"                     
##  [9] "Distance"                  "Postcode"                 
## [11] "Landsize"                  "BuildingArea"             
## [13] "LGA_Name"                  "Score"                    
## [15] "Usual_Resident_Population"
## 
## $row.names
##   [1]   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17
##  [18]  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34
##  [35]  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51
##  [52]  52  53  54  55  56  57  58  59  60  61  62  63  64  65  66  67  68
##  [69]  69  70  71  72  73  74  75  76  77  78  79  80  81  82  83  84  85
##  [86]  86  87  88  89  90  91  92  93  94  95  96  97  98  99 100
##  [ reached getOption("max.print") -- omitted 32893 entries ]
## 
## $class
## [1] "data.frame"

str(Data_Combined3)

## 'data.frame':    32993 obs. of  15 variables:
##  $ Suburb                   : Factor w/ 329 levels "Abbotsford","Aberfeldie",..: 43 43 43 43 43 43 43 43 43 43 ...
##  $ Address                  : Factor w/ 32179 levels "1 Abercrombie St",..: 30476 23353 9143 29641 6977 7544 16134 8495 2733 25993 ...
##  $ Rooms                    : int  3 4 3 3 3 4 3 4 5 4 ...
##  $ Type                     : Factor w/ 3 levels "h","t","u": 1 1 1 1 1 1 2 1 1 1 ...
##  $ Price                    : int  1550000 NA 1635000 NA NA 1830000 1300000 3695000 NA NA ...
##  $ Method                   : Factor w/ 9 levels "PI","PN","S",..: 8 3 3 3 6 3 1 3 3 6 ...
##  $ SellerG                  : Factor w/ 380 levels "@Realty","A",..: 50 209 50 50 209 148 247 209 209 247 ...
##  $ Date                     : Factor w/ 78 levels "1/7/17","10/12/16",..: 60 60 60 56 56 56 56 56 56 56 ...
##  $ Distance                 : Factor w/ 213 levels "#N/A","0","0.7",..: 21 21 21 21 21 21 21 21 21 21 ...
##  $ Postcode                 : Factor w/ 204 levels "#N/A","3000",..: 146 146 146 146 146 146 146 146 146 146 ...
##  $ Landsize                 : int  663 683 366 688 318 436 NA 836 845 0 ...
##  $ BuildingArea             : num  NA 254 156 NA 136 NA NA 284 432 258 ...
##  $ LGA_Name                 : chr  "Bayside (C)" "Bayside (C)" "Bayside (C)" "Bayside (C)" ...
##  $ Score                    : int  1125 1125 1125 1125 1125 1125 1125 1125 1125 1125 ...
##  $ Usual_Resident_Population: int  97087 97087 97087 97087 97087 97087 97087 97087 97087 97087 ...

There are now 32,993 observations and 15 variables in the final assignment data. Every house sold now has the suburb it is located in, its address, the number of rooms, the type of property, its selling price, the selling method, the real estate agent, the date sold, the distance from the CBD, its postcode, size of in land its on (in square meters), size of the building (in square meters), its Local Government Area, the IRSAD score of the LGA, and the population of the LGA.

8 variables are factors (Suburb, Address, Type, Method, SellerG, Date, Distance, and Postcode), 1 is numeric (BuildingArea), 1 is a character (LGA_Name), and 5 are integer (Rooms, Price, Landsize, Score, and Usual_Resident_Population).

Data_Combined3$Date<-as.character(Data_Combined3$Date)

Data_Combined3$Date<-dmy(Data_Combined3$Date)

## Warning in as.POSIXlt.POSIXct(x, tz): unknown timezone 'zone/tz/2018c.1.0/
## zoneinfo/Australia/Melbourne'

Data_Combined3$Date<-as.Date(Data_Combined3$Date)

str(Data_Combined3$Date)

##  Date[1:32993], format: "2016-09-03" "2016-09-03" "2016-09-03" "2016-12-03" ...

Data_Combined3$Suburb<-as.character(Data_Combined3$Suburb)

Data_Combined3$Address<-as.character(Data_Combined3$Address)

Data_Combined3$SellerG<-as.character(Data_Combined3$SellerG)

Data_Combined3$Distance<-as.numeric(Data_Combined3$Distance)

str(Data_Combined3)

## 'data.frame':    32993 obs. of  15 variables:
##  $ Suburb                   : chr  "Brighton" "Brighton" "Brighton" "Brighton" ...
##  $ Address                  : chr  "802 Hampton St" "476 New St" "18 Rooding St" "8 Ballara Ct" ...
##  $ Rooms                    : int  3 4 3 3 3 4 3 4 5 4 ...
##  $ Type                     : Factor w/ 3 levels "h","t","u": 1 1 1 1 1 1 2 1 1 1 ...
##  $ Price                    : int  1550000 NA 1635000 NA NA 1830000 1300000 3695000 NA NA ...
##  $ Method                   : Factor w/ 9 levels "PI","PN","S",..: 8 3 3 3 6 3 1 3 3 6 ...
##  $ SellerG                  : chr  "Buxton" "Marshall" "Buxton" "Buxton" ...
##  $ Date                     : Date, format: "2016-09-03" "2016-09-03" ...
##  $ Distance                 : num  21 21 21 21 21 21 21 21 21 21 ...
##  $ Postcode                 : Factor w/ 204 levels "#N/A","3000",..: 146 146 146 146 146 146 146 146 146 146 ...
##  $ Landsize                 : int  663 683 366 688 318 436 NA 836 845 0 ...
##  $ BuildingArea             : num  NA 254 156 NA 136 NA NA 284 432 258 ...
##  $ LGA_Name                 : chr  "Bayside (C)" "Bayside (C)" "Bayside (C)" "Bayside (C)" ...
##  $ Score                    : int  1125 1125 1125 1125 1125 1125 1125 1125 1125 1125 ...
##  $ Usual_Resident_Population: int  97087 97087 97087 97087 97087 97087 97087 97087 97087 97087 ...

The “Date” variable is converted to a date format. “Suburb”, “Address”, and “SellerG” are converted to characters. Distance is converted to numeric.

levels(Data_Combined3$Type)

## [1] "h" "t" "u"

levels(Data_Combined3$Method)

## [1] "PI" "PN" "S"  "SA" "SN" "SP" "SS" "VB" "W"

Data_Combined3$Type<-factor(Data_Combined3$Type, levels =c("h","t","u"), 
                       labels = c("House", "Townhouse", "Unit"))

Data_Combined3$Method<-factor(Data_Combined3$Method, 
                              levels =c("PI","PN","S","SA","SN","SP","SS","VB","W"), 
                       labels = c("Passed In", "Sold Prior Not Disclosed","Sold",
                                  "Sold After Auction","Sold Not Disclosed",
                                  "Property Sold Prior","Sold After Auction Not Disclosed",
                                  "Vendor Bid","Withdrawn Prior"))

levels(Data_Combined3$Type)

## [1] "House"     "Townhouse" "Unit"

levels(Data_Combined3$Method)

## [1] "Passed In"                        "Sold Prior Not Disclosed"        
## [3] "Sold"                             "Sold After Auction"              
## [5] "Sold Not Disclosed"               "Property Sold Prior"             
## [7] "Sold After Auction Not Disclosed" "Vendor Bid"                      
## [9] "Withdrawn Prior"

Type is a factor with three levels, and is labled to signfiy the type of property. Method is a factor with 9 levels, and is labled to signify the method by which it was sold.

Scan I

colnames(Data_Combined3)[colSums(is.na(Data_Combined3)) > 0]

## [1] "Price"        "Landsize"     "BuildingArea"

sum(is.na(Data_Combined3$Price))

## [1] 7206

sum(is.na(Data_Combined3$Landsize))

## [1] 11087

sum(is.na(Data_Combined3$BuildingArea))

## [1] 19900

The columns containing NAs are identified. There are 7,206 NAs in “Price”, 11,087 in “Landsize”, and 19,900 in “BuildingArea”.

is.special <- function(x){
if (is.numeric(x)) !is.finite(x) else is.na(x)
}


is.special <- function(x){
if (is.numeric(x)) !is.finite(x)
}

sapply(Data_Combined3, is.special)

## $Suburb
## NULL
## 
## $Address
## NULL
## 
## $Rooms
##   [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [12] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [23] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [34] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [45] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [56] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [67] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [78] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [89] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [100] FALSE
##  [ reached getOption("max.print") -- omitted 32893 entries ]
## 
## $Type
## NULL
## 
## $Price
##   [1] FALSE  TRUE FALSE  TRUE  TRUE FALSE FALSE FALSE  TRUE  TRUE FALSE
##  [12]  TRUE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE  TRUE FALSE
##  [23] FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE  TRUE FALSE
##  [34] FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE  TRUE  TRUE FALSE
##  [45] FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE
##  [56] FALSE  TRUE FALSE FALSE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE
##  [67] FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE
##  [78]  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE  TRUE  TRUE FALSE FALSE
##  [89] FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE FALSE
## [100]  TRUE
##  [ reached getOption("max.print") -- omitted 32893 entries ]
## 
## $Method
## NULL
## 
## $SellerG
## NULL
## 
## $Date
## NULL
## 
## $Distance
##   [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [12] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [23] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [34] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [45] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [56] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [67] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [78] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [89] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [100] FALSE
##  [ reached getOption("max.print") -- omitted 32893 entries ]
## 
## $Postcode
## NULL
## 
## $Landsize
##   [1] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE
##  [12] FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE
##  [23] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE
##  [34] FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE
##  [45] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [56] FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE
##  [67] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE
##  [78] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [89] FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE
## [100] FALSE
##  [ reached getOption("max.print") -- omitted 32893 entries ]
## 
## $BuildingArea
##   [1]  TRUE FALSE FALSE  TRUE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE
##  [12] FALSE FALSE  TRUE FALSE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE FALSE
##  [23] FALSE FALSE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE
##  [34] FALSE  TRUE FALSE FALSE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE FALSE
##  [45] FALSE FALSE FALSE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE
##  [56]  TRUE FALSE FALSE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE FALSE FALSE
##  [67]  TRUE  TRUE  TRUE FALSE  TRUE FALSE FALSE  TRUE  TRUE  TRUE  TRUE
##  [78] FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE
##  [89]  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE FALSE FALSE FALSE  TRUE
## [100]  TRUE
##  [ reached getOption("max.print") -- omitted 32893 entries ]
## 
## $LGA_Name
## NULL
## 
## $Score
##   [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [12] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [23] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [34] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [45] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [56] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [67] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [78] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [89] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [100] FALSE
##  [ reached getOption("max.print") -- omitted 32893 entries ]
## 
## $Usual_Resident_Population
##   [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [12] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [23] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [34] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [45] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [56] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [67] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [78] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [89] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [100] FALSE
##  [ reached getOption("max.print") -- omitted 32893 entries ]

All variables in the data set are scanned for any special values. This returns the same columns as above.

is.na(Data_Combined3$Landsize) <- !Data_Combined3$Landsize

is.na(Data_Combined3$BuildingArea) <- !Data_Combined3$BuildingArea

Any 0 values in the “Landsize” or “BuildingArea” columns are converted to NAs because they are impossible value (a property and building can’t be 0 square meters).

Data_Combined4<-Data_Combined3 %>%
  group_by(Postcode) %>%
  mutate(Price=impute(Price, fun=mean))

Data_Combined5<-Data_Combined4 %>%
  group_by(Postcode) %>%
  mutate(Landsize=impute(Landsize, fun=mean))

Data_Combined6<-Data_Combined5 %>%
  group_by(Postcode) %>%
  mutate(BuildingArea=impute(BuildingArea, fun=mean))

To deal with the missing values, imputation was performed. For any missing value in “Price”, the average price of the properties in the same postcode was imputed. Similarly, for any observation missing “Landsize”, the average landsize for that postcode was imputed, and all missing values in “BuildingArea” were imputed using the average building size in that postcode.

Tidy and Manipulate Data II

Data_Combined7<-mutate(Data_Combined6,
       Price_Per_Sq_M = Price/BuildingArea)

colnames(Data_Combined7)

##  [1] "Suburb"                    "Address"                  
##  [3] "Rooms"                     "Type"                     
##  [5] "Price"                     "Method"                   
##  [7] "SellerG"                   "Date"                     
##  [9] "Distance"                  "Postcode"                 
## [11] "Landsize"                  "BuildingArea"             
## [13] "LGA_Name"                  "Score"                    
## [15] "Usual_Resident_Population" "Price_Per_Sq_M"

str(Data_Combined7$Price_Per_Sq_M)

##  num [1:32993] 7173 7812 10481 9182 14590 ...

The variables “Price” and “BuildingArea” are mutated to create the new variable “Price_Per_Sq_M”. This takes the price of the property and divides it by the size of of the building to produce the commonly used statistic price per square meter.

Scan II

Data_Combined8<-Data_Combined7[complete.cases(Data_Combined7),]

z.scores <- Data_Combined8$Rooms %>%  scores(type = "z")

z.scores %>% summary()

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -2.07500 -1.04900 -0.02383  0.00000  1.00200 13.31000

which( abs(z.scores) >3 )

##   [1]    30   143   201   637   698   719  1329  1331  1394  1401  1503
##  [12]  1784  1859  1978  2017  2027  2046  2091  2129  2140  2163  2247
##  [23]  2346  2453  2529  2575  2776  2966  3014  3073  3151  3183  3319
##  [34]  3335  3374  3453  3596  3737  3862  3879  3924  3954  4028  4088
##  [45]  4180  4291  4298  4333  4366  4455  4502  4523  4693  4710  4751
##  [56]  5015  5092  5093  5154  5173  5246  5353  5427  6031  6098  6115
##  [67]  6370  6502  6504  6629  6702  6743  6752  6778  6801  6877  6998
##  [78]  7027  7059  7083  7240  7300  7338  7607  7983  8569  8578  8973
##  [89]  9065 10129 10403 10514 10780 11147 11305 11480 11647 11785 11793
## [100] 11886
##  [ reached getOption("max.print") -- omitted 155 entries ]

Data_Combined8$Rooms[ which( abs(z.scores) >3 )] <- mean(Data_Combined8$Rooms, na.rm = TRUE)

To identify outliers of the “Rooms” variable, I found which observations had an absolute z score of greater than 3. To deal with these outliers, I took the mean of the rooms in all observations and imputed them in place of these outlier values.

z.scores2 <- Data_Combined8$Price %>%  scores(type = "z")

z.scores2 %>% summary()

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -1.6380 -0.6868 -0.2373  0.0000  0.4149 16.5900

which( abs(z.scores2) >3 )

##   [1]    8   13   14   20   47   74   79   97  110  118  142  153  173  182
##  [15]  189  193  195  197  205  285  412  444  463  480  482  500  510  596
##  [29]  600  606  619  701  789  809  831  832  882  885  905  965  968 1006
##  [43] 1015 1068 1132 1253 1257 1258 1260 1263 1268 1318 1333 1381 1385 1394
##  [57] 1396 1407 1440 1469 1473 1512 1540 1586 1592 1632 1659 1660 1716 1741
##  [71] 1784 1810 1857 1859 1867 1886 1899 1904 1907 1910 1914 1946 1948 1962
##  [85] 1978 1986 1990 1996 1998 2005 2013 2019 2039 2062 2072 2099 2107 2128
##  [99] 2213 2250
##  [ reached getOption("max.print") -- omitted 395 entries ]

Data_Combined8$Price[ which( abs(z.scores2) >3 )] <- mean(Data_Combined8$Price, na.rm = TRUE)

To identify outliers of the “Price” variable, I found which observations had an absolute z score of greater than 3. To deal with these outliers, I took the mean of the price in all observations and imputed them in place of these outlier values.

z.scores3 <- Data_Combined8$Landsize %>%  scores(type = "z")

z.scores3 %>% summary()

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##  -0.23720  -0.08833  -0.03697   0.00000   0.00277 152.20000

which( abs(z.scores3) >3 )

##  [1]   708  1557  2199  4213  5085  7059  8273 10323 14205 14301 14492
## [12] 14596 14730 16101 16104 16106 16108 16109 16110 16111 16113 16115
## [23] 16117 16119 16120 16121 16122 16123 16125 16126 16130 16131 16134
## [34] 16140 16515 16667 16883 17011 17035 17401 17498 18012 19558 19666
## [45] 19724 19739 19744 19795 19882 20151 20180 20392 20394 20468 20619
## [56] 21402 23328 23672 24021 24670 26149 26336 28027 31821 32242 32399
## [67] 32428 32503

Data_Combined8$Landsize[ which( abs(z.scores3) >3 )] <- mean(Data_Combined8$Landsize, na.rm = TRUE)

To identify outliers of the “Landsize” variable, I found which observations had an absolute z score of greater than 3. To deal with these outliers, I took the mean of the land size in all observations and imputed them in place of these outlier values.

z.scores4 <- Data_Combined8$Distance %>%  scores(type = "z")

z.scores4 %>% summary()

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -1.4940 -0.9199 -0.2173  0.0000  1.0450  1.5320

which( abs(z.scores4) >3 )

## integer(0)

To identify outliers of the “Distance” variable, I found which observations had an absolute z score of greater than 3. There were no outliers present using this method.

z.scores5 <- Data_Combined8$Score %>%  scores(type = "z")

z.scores5 %>% summary()

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -2.4110 -0.5613  0.1435  0.0000  0.7249  1.3420

which( abs(z.scores5) >3 )

## integer(0)

Data_Combined8$Score[ which( abs(z.scores5) >3 )] <- mean(Data_Combined8$Score, na.rm = TRUE)

To identify outliers of the “Score” variable, I found which observations had an absolute z score of greater than 3. There were no outliers present using this method.

z.scores6 <- Data_Combined8$BuildingArea %>%  scores(type = "z")

z.scores6 %>% summary()

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##  -0.54320  -0.13460  -0.04719   0.00000   0.08727 149.70000

which( abs(z.scores6) >3 )

##  [1]  1686  2296  2648 10097 11075 16113 16115 16125 16128 16132 16175
## [12] 21825 24970

Data_Combined8$BuildingArea[ which( abs(z.scores6) >3 )] <- mean(Data_Combined8$BuildingArea, na.rm = TRUE)

To identify outliers of the “BuildingArea” variable, I found which observations had an absolute z score of greater than 3. To deal with these outliers, I took the mean of the building area in all observations and imputed them in place of these outlier values.

z.scores7 <- Data_Combined8$Usual_Resident_Population %>%  scores(type = "z")

z.scores7 %>% summary()

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -2.8960 -1.0010  0.1277  0.0000  0.6675  4.1430

which( abs(z.scores7) >3 )

##   [1] 7073 7074 7075 7076 7077 7078 7079 7080 7081 7082 7083 7084 7085 7086
##  [15] 7087 7088 7089 7090 7091 7092 7093 7094 7095 7096 7097 7098 7099 7100
##  [29] 7101 7102 7103 7104 7105 7106 7107 7108 7109 7110 7111 7112 7113 7114
##  [43] 7115 7116 7117 7118 7119 7120 7121 7122 7123 7124 7125 7126 7127 7128
##  [57] 7129 7130 7131 7132 7133 7134 7135 7136 7137 7138 7139 7140 7141 7142
##  [71] 7143 7144 7145 7146 7147 7148 7149 7150 7151 7152 7153 7154 7155 7156
##  [85] 7157 7158 7159 7160 7161 7162 7163 7164 7165 7166 7167 7168 7169 7170
##  [99] 7171 7172
##  [ reached getOption("max.print") -- omitted 69 entries ]

Data_Combined8$Usual_Resident_Population[ which( abs(z.scores7) >3 )] <- mean(Data_Combined8$Usual_Resident_Population, na.rm = TRUE)

To identify outliers of the “Usual_Resident_Population” variable, I found which observations had an absolute z score of greater than 3. To deal with these outliers, I took the mean of the population in all observations and imputed them in place of these outlier values.

z.scores8 <- Data_Combined8$Price_Per_Sq_M %>%  scores(type = "z")

z.scores8 %>% summary()

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##  -0.01666  -0.01038  -0.00780   0.00000  -0.00484 181.40000

which( abs(z.scores8) >3 )

## [1]  4137 27653

Data_Combined8$Price_Per_Sq_M[ which( abs(z.scores8) >3 )] <- mean(Data_Combined8$Price_Per_Sq_M, na.rm = TRUE)

To identify outliers of the “Price_Per_Sq_M” variable, I found which observations had an absolute z score of greater than 3. To deal with these outliers, I took the mean of the price value in all observations and imputed them in place of these outlier values.

Transform

hist(Data_Combined$Price)

Data_Combined8$Price<- log10(Data_Combined8$Price)

hist(Data_Combined8$Price)

A histogram of the “Price” variable is produced. There is a clear right skew in the data, therefore a log transformation is applied. This transforms the data into a normal distribution, which is useful for further statistical analysis.

head(Data_Combined8)

## # A tibble: 6 x 16
## # Groups:   Postcode [1]
##     Suburb          Address Rooms   Type    Price              Method
##      <chr>            <chr> <dbl> <fctr>    <dbl>              <fctr>
## 1 Brighton   802 Hampton St     3  House 6.190332          Vendor Bid
## 2 Brighton       476 New St     4  House 6.297591                Sold
## 3 Brighton    18 Rooding St     3  House 6.213518                Sold
## 4 Brighton     8 Ballara Ct     3  House 6.297591                Sold
## 5 Brighton  148 Cochrane St     3  House 6.297591 Property Sold Prior
## 6 Brighton 152A Cochrane St     4  House 6.262451                Sold
## # ... with 10 more variables: SellerG <chr>, Date <date>, Distance <dbl>,
## #   Postcode <fctr>, Landsize <dbl>, BuildingArea <dbl>, LGA_Name <chr>,
## #   Score <dbl>, Usual_Resident_Population <dbl>, Price_Per_Sq_M <dbl>