library(dplyr)
library(readr)
library(tidyr)
library(outliers)
library(Hmisc)
library(lubridate)
There are two data sets used in this report. The first contains a wealth of information on properties sold within the last few years in Melbourne. This includes the address and suburb of each property, the selling price, the type of property (house, townhouse, unit), the size of the property, and the council it is located in, as well as several other variables. The second is a data set that has been drawn from the Australian Bureau of Statistics. This consists of SEIFA statistics, which index the socio-economic status of an area by evaluating a variety of factors. There are two particular variables from this data set that are of particular interest: the Index of Relative Socio-economic Advantage and Disadvantage (IRSAD), and the usual resident population. An IRSAD score has a mean of 1000, and a higher score means an area is generally more wealthy while a lower score means it is less wealthy. The goal of this analysis is to take every entry in the housing data set, and give it an IRSAD score and population by matching the common variable Local Government Area (LGA). This will create one data set that can be used for analysis in the future. However, there are many faults with the data, so preprocessing steps will need to take place.
Preprocessing involved applying tidy data principles to the ABS data, then splitting it into two separate data frames that could be merged into the housing data set. This was followed by converting variables into suitable data types and labeling factors. There were plenty of missing values in multiple columns, so they were identified and then filled using imputation. Finally, a new variable (price per square meter) was created, outliers were detected and dealt with, and a log transformation was applied to normalize the data.
HousingData<-read.csv("~/Downloads/Melbourne_housing_FULL 2.csv")
Seifa<-read.csv("~/Downloads/ABS_SEIFA2016_LGA-en.csv")
head(HousingData)
## Suburb Address Rooms Type Price Method SellerG Date
## 1 Brighton 802 Hampton St 3 h 1550000 VB Buxton 3/9/16
## 2 Brighton 476 New St 4 h NA S Marshall 3/9/16
## 3 Brighton 18 Rooding St 3 h 1635000 S Buxton 3/9/16
## 4 Brighton 8 Ballara Ct 3 h NA S Buxton 3/12/16
## 5 Brighton 148 Cochrane St 3 h NA SP Marshall 3/12/16
## 6 Brighton 152A Cochrane St 4 h 1830000 S Hodges 3/12/16
## Distance Postcode Bedroom2 Bathroom Car Landsize BuildingArea YearBuilt
## 1 11.2 3186 3 2 2 663 NA 1940
## 2 11.2 3186 3 3 2 683 254 1915
## 3 11.2 3186 3 1 2 366 156 1920
## 4 11.2 3186 3 2 4 688 NA NA
## 5 11.2 3186 3 1 1 318 136 1910
## 6 11.2 3186 4 2 2 436 NA NA
## CouncilArea Lattitude Longtitude Regionname Propertycount
## 1 Bayside (C) -37.9159 145.0068 Southern Metropolitan 10579
## 2 Bayside (C) -37.8931 144.9955 Southern Metropolitan 10579
## 3 Bayside (C) -37.9038 145.0001 Southern Metropolitan 10579
## 4 Bayside (C) -37.9030 144.9900 Southern Metropolitan 10579
## 5 Bayside (C) -37.8931 145.0004 Southern Metropolitan 10579
## 6 Bayside (C) -37.8927 145.0005 Southern Metropolitan 10579
head(Seifa)
## LGA_2016 Local.Government.Areas...2016 SEIFAINDEXTYPE
## 1 10050 Albury (C) IEO
## 2 10050 Albury (C) IEO
## 3 10050 Albury (C) IEO
## 4 10050 Albury (C) IEO
## 5 10050 Albury (C) IEO
## 6 10050 Albury (C) IEO
## Index.Type SEIFA_MEASURE
## 1 Index of Education and Occupation SCORE
## 2 Index of Education and Occupation RWAR
## 3 Index of Education and Occupation RWAD
## 4 Index of Education and Occupation RWAP
## 5 Index of Education and Occupation RWSR
## 6 Index of Education and Occupation RWSD
## Measure TIME Time Value Flag.Codes Flags
## 1 Score 2016 2016 961
## 2 Rank within Australia 2016 2016 291
## 3 Rank within Australia - Decile 2016 2016 6
## 4 Rank within Australia - Percentile 2016 2016 54
## 5 Rank within State or Territory 2016 2016 69
## 6 Rank within State or Territory - Decile 2016 2016 6
There are two data sets that have been imported. The first is a Melbourne housing market data set, which has been sourced from Kaggle using the following link: https://www.kaggle.com/anthonypino/melbourne-housing-market. The second is a Census data set from the Australian Bureau of Statistics (ABS) containing Socio-Economic Indexes for Areas (SEIFA) information, found here: http://www.abs.gov.au/AUSSTATS/abs@.nsf/DetailsPage/2033.0.55.0012016?OpenDocument.
There are 21 variables in the Melbourne housing data set: Suburb - Suburb; Address - Address; Rooms - Number of rooms; Price - Price in Australian Dollars; Method - Method of selling property; Type - Type of property; SellerG - Real Estate Agent; Date - Date sold; Distance - Distance from CBD in km; Postcode - Postcode; Regionname - General region (west, north); Propertycount - Number of properties in suburb; Bedroom2 - Number of bedrooms; Bathroom - Number of bathrooms; Car - Number of carspots; Landsize - Land size in meters; YearBuilt - Year the house was built; CouncilArea - Governing council for the area; Lattitude - Lattitude; Longitude - Longitude.
There are 11 variables in the SEIFA data set: LGA_2016 - Local Government Area code; Local Government Areas - 2016 - Local Government Area; SEIFAINDEXTYPE - SEIFA Index acronymn; Index Type - SEIFA Index; SEIFA_MEASURE - SEIFA measure acronym; Measure - SEIFA measure; TIME - Year; Time - Year; Value - Value; Flag Codes - Flag codes; Flags - Flags.
Seifa2<-spread(Seifa,key = Measure, value = Value)
head(Seifa2)
## LGA_2016 Local.Government.Areas...2016 SEIFAINDEXTYPE
## 1 10050 Albury (C) IEO
## 2 10050 Albury (C) IEO
## 3 10050 Albury (C) IEO
## 4 10050 Albury (C) IEO
## 5 10050 Albury (C) IEO
## 6 10050 Albury (C) IEO
## Index.Type SEIFA_MEASURE TIME Time Flag.Codes
## 1 Index of Education and Occupation MAXS 2016 2016
## 2 Index of Education and Occupation MINS 2016 2016
## 3 Index of Education and Occupation RWAD 2016 2016
## 4 Index of Education and Occupation RWAP 2016 2016
## 5 Index of Education and Occupation RWAR 2016 2016
## 6 Index of Education and Occupation RWSD 2016 2016
## Flags Maximum score for SA1s in area Minimum score for SA1s in area
## 1 1180 NA
## 2 NA 754
## 3 NA NA
## 4 NA NA
## 5 NA NA
## 6 NA NA
## Rank within Australia Rank within Australia - Decile
## 1 NA NA
## 2 NA NA
## 3 NA 6
## 4 NA NA
## 5 291 NA
## 6 NA NA
## Rank within Australia - Percentile Rank within State or Territory
## 1 NA NA
## 2 NA NA
## 3 NA NA
## 4 54 NA
## 5 NA NA
## 6 NA NA
## Rank within State or Territory - Decile
## 1 NA
## 2 NA
## 3 NA
## 4 NA
## 5 NA
## 6 6
## Rank within State or Territory - Percentile Score
## 1 NA NA
## 2 NA NA
## 3 NA NA
## 4 NA NA
## 5 NA NA
## 6 NA NA
## Usual resident population
## 1 NA
## 2 NA
## 3 NA
## 4 NA
## 5 NA
## 6 NA
The SEIFA data is in an untidy format, as multiple variables are stored in rows. The spread() function converts this from a long to wide format.
Seifa3<-Seifa2[!(Seifa2$SEIFAINDEXTYPE=="IEO"|
Seifa2$SEIFAINDEXTYPE=="IER"|Seifa2$SEIFAINDEXTYPE=="IRSD"),]
names(Seifa3)[17]<-"Percentile_Rank_State"
names(Seifa3)[19]<-"Usual_Resident_Population"
Observations are filtered out so that only “IRSAD” values are in the data set. Then, columns are renamed for preprocessing purposes.
Seifa4<-Seifa3 %>%
select(-(TIME:Percentile_Rank_State))
SeifaScore<-Seifa4[!is.na(Seifa4$Score),]
names(SeifaScore)[2]<-"LGA_Name"
SeifaScore2<-SeifaScore %>%
select(LGA_Name,Score)
The data is subset to delete insignificant variables, and cases containing NA in the score column are deleted. A column is renamed for preprocessing purposes, and then the data is subset again.
head(SeifaScore2)
## LGA_Name Score
## 29 Albury (C) 956
## 69 Armidale Regional (A) 976
## 109 Ballina (A) 987
## 149 Balranald (A) 927
## 189 Bathurst Regional (A) 973
## 229 Bega Valley (A) 951
str(SeifaScore2)
## 'data.frame': 544 obs. of 2 variables:
## $ LGA_Name: Factor w/ 545 levels "Adelaide (C)",..: 4 11 16 18 28 32 33 38 41 42 ...
## $ Score : int 956 976 987 927 973 951 954 935 993 954 ...
This produces a data frame containing the IRSAD score for each LGA.
SeifaPopulation<-Seifa4[!is.na(Seifa4$Usual_Resident_Population),]
names(SeifaPopulation)[2]<-"LGA_Name"
SeifaPopulation2<-SeifaPopulation %>%
select(LGA_Name,Usual_Resident_Population)
The same process is done for resident population. Cases containing NA in the resident population column are deleted, the column is renamed, and the data is subset.
head(SeifaPopulation2)
## LGA_Name Usual_Resident_Population
## 30 Albury (C) 51076
## 70 Armidale Regional (A) 29449
## 110 Ballina (A) 41790
## 150 Balranald (A) 2287
## 190 Bathurst Regional (A) 41300
## 230 Bega Valley (A) 33253
str(SeifaPopulation2)
## 'data.frame': 544 obs. of 2 variables:
## $ LGA_Name : Factor w/ 545 levels "Adelaide (C)",..: 4 11 16 18 28 32 33 38 41 42 ...
## $ Usual_Resident_Population: int 51076 29449 41790 2287 41300 33253 12668 8462 336962 5955 ...
Now each LGA has a resident population value stored in a data frame.
names(HousingData)[17]<-"LGA_Name"
HousingData$LGA_Name<-as.character(HousingData$LGA_Name)
SeifaScore2$LGA_Name<-as.character(SeifaScore2$LGA_Name)
SeifaPopulation2$LGA_Name<-as.character(SeifaPopulation2$LGA_Name)
Data_Combined<-HousingData %>%
left_join(SeifaScore2, by="LGA_Name")
Data_Combined2<-Data_Combined %>%
left_join(SeifaPopulation2, by="LGA_Name")
The Melbourne housing data set is joined with the IRSAD and population data by using the common variable “LGA_Name”. They are converted to characters from factors for ease of joining.
Data_Combined3<-Data_Combined2 %>%
select(-(YearBuilt))
Data_Combined3<-Data_Combined3 %>%
select(-(Lattitude:Longtitude))
Data_Combined3<-Data_Combined3 %>%
select(-(Regionname:Propertycount))
Data_Combined3<-Data_Combined3 %>%
select(-(Bedroom2:Car))
Data_Combined3<- Data_Combined3[-c(32994,32995,32996), ]
Variables not pertinent to the analysis and rows containing many missing values are removed to create the final assignment data set.
head(Data_Combined3)
## Suburb Address Rooms Type Price Method SellerG Date
## 1 Brighton 802 Hampton St 3 h 1550000 VB Buxton 3/9/16
## 2 Brighton 476 New St 4 h NA S Marshall 3/9/16
## 3 Brighton 18 Rooding St 3 h 1635000 S Buxton 3/9/16
## 4 Brighton 8 Ballara Ct 3 h NA S Buxton 3/12/16
## 5 Brighton 148 Cochrane St 3 h NA SP Marshall 3/12/16
## 6 Brighton 152A Cochrane St 4 h 1830000 S Hodges 3/12/16
## Distance Postcode Landsize BuildingArea LGA_Name Score
## 1 11.2 3186 663 NA Bayside (C) 1125
## 2 11.2 3186 683 254 Bayside (C) 1125
## 3 11.2 3186 366 156 Bayside (C) 1125
## 4 11.2 3186 688 NA Bayside (C) 1125
## 5 11.2 3186 318 136 Bayside (C) 1125
## 6 11.2 3186 436 NA Bayside (C) 1125
## Usual_Resident_Population
## 1 97087
## 2 97087
## 3 97087
## 4 97087
## 5 97087
## 6 97087
This is the final assignment data set to be used.
dim(Data_Combined3)
## [1] 32993 15
options(max.print=100)
attributes(Data_Combined3)
## $names
## [1] "Suburb" "Address"
## [3] "Rooms" "Type"
## [5] "Price" "Method"
## [7] "SellerG" "Date"
## [9] "Distance" "Postcode"
## [11] "Landsize" "BuildingArea"
## [13] "LGA_Name" "Score"
## [15] "Usual_Resident_Population"
##
## $row.names
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
## [18] 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
## [35] 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51
## [52] 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68
## [69] 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85
## [86] 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
## [ reached getOption("max.print") -- omitted 32893 entries ]
##
## $class
## [1] "data.frame"
str(Data_Combined3)
## 'data.frame': 32993 obs. of 15 variables:
## $ Suburb : Factor w/ 329 levels "Abbotsford","Aberfeldie",..: 43 43 43 43 43 43 43 43 43 43 ...
## $ Address : Factor w/ 32179 levels "1 Abercrombie St",..: 30476 23353 9143 29641 6977 7544 16134 8495 2733 25993 ...
## $ Rooms : int 3 4 3 3 3 4 3 4 5 4 ...
## $ Type : Factor w/ 3 levels "h","t","u": 1 1 1 1 1 1 2 1 1 1 ...
## $ Price : int 1550000 NA 1635000 NA NA 1830000 1300000 3695000 NA NA ...
## $ Method : Factor w/ 9 levels "PI","PN","S",..: 8 3 3 3 6 3 1 3 3 6 ...
## $ SellerG : Factor w/ 380 levels "@Realty","A",..: 50 209 50 50 209 148 247 209 209 247 ...
## $ Date : Factor w/ 78 levels "1/7/17","10/12/16",..: 60 60 60 56 56 56 56 56 56 56 ...
## $ Distance : Factor w/ 213 levels "#N/A","0","0.7",..: 21 21 21 21 21 21 21 21 21 21 ...
## $ Postcode : Factor w/ 204 levels "#N/A","3000",..: 146 146 146 146 146 146 146 146 146 146 ...
## $ Landsize : int 663 683 366 688 318 436 NA 836 845 0 ...
## $ BuildingArea : num NA 254 156 NA 136 NA NA 284 432 258 ...
## $ LGA_Name : chr "Bayside (C)" "Bayside (C)" "Bayside (C)" "Bayside (C)" ...
## $ Score : int 1125 1125 1125 1125 1125 1125 1125 1125 1125 1125 ...
## $ Usual_Resident_Population: int 97087 97087 97087 97087 97087 97087 97087 97087 97087 97087 ...
There are now 32,993 observations and 15 variables in the final assignment data. Every house sold now has the suburb it is located in, its address, the number of rooms, the type of property, its selling price, the selling method, the real estate agent, the date sold, the distance from the CBD, its postcode, size of in land its on (in square meters), size of the building (in square meters), its Local Government Area, the IRSAD score of the LGA, and the population of the LGA.
8 variables are factors (Suburb, Address, Type, Method, SellerG, Date, Distance, and Postcode), 1 is numeric (BuildingArea), 1 is a character (LGA_Name), and 5 are integer (Rooms, Price, Landsize, Score, and Usual_Resident_Population).
Data_Combined3$Date<-as.character(Data_Combined3$Date)
Data_Combined3$Date<-dmy(Data_Combined3$Date)
## Warning in as.POSIXlt.POSIXct(x, tz): unknown timezone 'zone/tz/2018c.1.0/
## zoneinfo/Australia/Melbourne'
Data_Combined3$Date<-as.Date(Data_Combined3$Date)
str(Data_Combined3$Date)
## Date[1:32993], format: "2016-09-03" "2016-09-03" "2016-09-03" "2016-12-03" ...
Data_Combined3$Suburb<-as.character(Data_Combined3$Suburb)
Data_Combined3$Address<-as.character(Data_Combined3$Address)
Data_Combined3$SellerG<-as.character(Data_Combined3$SellerG)
Data_Combined3$Distance<-as.numeric(Data_Combined3$Distance)
str(Data_Combined3)
## 'data.frame': 32993 obs. of 15 variables:
## $ Suburb : chr "Brighton" "Brighton" "Brighton" "Brighton" ...
## $ Address : chr "802 Hampton St" "476 New St" "18 Rooding St" "8 Ballara Ct" ...
## $ Rooms : int 3 4 3 3 3 4 3 4 5 4 ...
## $ Type : Factor w/ 3 levels "h","t","u": 1 1 1 1 1 1 2 1 1 1 ...
## $ Price : int 1550000 NA 1635000 NA NA 1830000 1300000 3695000 NA NA ...
## $ Method : Factor w/ 9 levels "PI","PN","S",..: 8 3 3 3 6 3 1 3 3 6 ...
## $ SellerG : chr "Buxton" "Marshall" "Buxton" "Buxton" ...
## $ Date : Date, format: "2016-09-03" "2016-09-03" ...
## $ Distance : num 21 21 21 21 21 21 21 21 21 21 ...
## $ Postcode : Factor w/ 204 levels "#N/A","3000",..: 146 146 146 146 146 146 146 146 146 146 ...
## $ Landsize : int 663 683 366 688 318 436 NA 836 845 0 ...
## $ BuildingArea : num NA 254 156 NA 136 NA NA 284 432 258 ...
## $ LGA_Name : chr "Bayside (C)" "Bayside (C)" "Bayside (C)" "Bayside (C)" ...
## $ Score : int 1125 1125 1125 1125 1125 1125 1125 1125 1125 1125 ...
## $ Usual_Resident_Population: int 97087 97087 97087 97087 97087 97087 97087 97087 97087 97087 ...
The “Date” variable is converted to a date format. “Suburb”, “Address”, and “SellerG” are converted to characters. Distance is converted to numeric.
levels(Data_Combined3$Type)
## [1] "h" "t" "u"
levels(Data_Combined3$Method)
## [1] "PI" "PN" "S" "SA" "SN" "SP" "SS" "VB" "W"
Data_Combined3$Type<-factor(Data_Combined3$Type, levels =c("h","t","u"),
labels = c("House", "Townhouse", "Unit"))
Data_Combined3$Method<-factor(Data_Combined3$Method,
levels =c("PI","PN","S","SA","SN","SP","SS","VB","W"),
labels = c("Passed In", "Sold Prior Not Disclosed","Sold",
"Sold After Auction","Sold Not Disclosed",
"Property Sold Prior","Sold After Auction Not Disclosed",
"Vendor Bid","Withdrawn Prior"))
levels(Data_Combined3$Type)
## [1] "House" "Townhouse" "Unit"
levels(Data_Combined3$Method)
## [1] "Passed In" "Sold Prior Not Disclosed"
## [3] "Sold" "Sold After Auction"
## [5] "Sold Not Disclosed" "Property Sold Prior"
## [7] "Sold After Auction Not Disclosed" "Vendor Bid"
## [9] "Withdrawn Prior"
Type is a factor with three levels, and is labled to signfiy the type of property. Method is a factor with 9 levels, and is labled to signify the method by which it was sold.
colnames(Data_Combined3)[colSums(is.na(Data_Combined3)) > 0]
## [1] "Price" "Landsize" "BuildingArea"
sum(is.na(Data_Combined3$Price))
## [1] 7206
sum(is.na(Data_Combined3$Landsize))
## [1] 11087
sum(is.na(Data_Combined3$BuildingArea))
## [1] 19900
The columns containing NAs are identified. There are 7,206 NAs in “Price”, 11,087 in “Landsize”, and 19,900 in “BuildingArea”.
is.special <- function(x){
if (is.numeric(x)) !is.finite(x) else is.na(x)
}
is.special <- function(x){
if (is.numeric(x)) !is.finite(x)
}
sapply(Data_Combined3, is.special)
## $Suburb
## NULL
##
## $Address
## NULL
##
## $Rooms
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [12] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [23] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [34] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [45] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [56] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [67] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [78] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [89] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [100] FALSE
## [ reached getOption("max.print") -- omitted 32893 entries ]
##
## $Type
## NULL
##
## $Price
## [1] FALSE TRUE FALSE TRUE TRUE FALSE FALSE FALSE TRUE TRUE FALSE
## [12] TRUE FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE TRUE FALSE
## [23] FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE TRUE FALSE
## [34] FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE TRUE TRUE FALSE
## [45] FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE
## [56] FALSE TRUE FALSE FALSE TRUE TRUE TRUE TRUE FALSE TRUE TRUE
## [67] FALSE FALSE TRUE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
## [78] TRUE FALSE FALSE FALSE FALSE TRUE FALSE TRUE TRUE FALSE FALSE
## [89] FALSE FALSE TRUE FALSE FALSE TRUE FALSE TRUE FALSE FALSE FALSE
## [100] TRUE
## [ reached getOption("max.print") -- omitted 32893 entries ]
##
## $Method
## NULL
##
## $SellerG
## NULL
##
## $Date
## NULL
##
## $Distance
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [12] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [23] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [34] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [45] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [56] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [67] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [78] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [89] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [100] FALSE
## [ reached getOption("max.print") -- omitted 32893 entries ]
##
## $Postcode
## NULL
##
## $Landsize
## [1] FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
## [12] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE FALSE
## [23] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE
## [34] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
## [45] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [56] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
## [67] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE
## [78] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [89] FALSE FALSE TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
## [100] FALSE
## [ reached getOption("max.print") -- omitted 32893 entries ]
##
## $BuildingArea
## [1] TRUE FALSE FALSE TRUE FALSE TRUE TRUE FALSE FALSE FALSE FALSE
## [12] FALSE FALSE TRUE FALSE TRUE TRUE TRUE FALSE TRUE TRUE FALSE
## [23] FALSE FALSE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE
## [34] FALSE TRUE FALSE FALSE TRUE TRUE FALSE TRUE TRUE TRUE FALSE
## [45] FALSE FALSE FALSE TRUE FALSE TRUE TRUE TRUE TRUE TRUE FALSE
## [56] TRUE FALSE FALSE TRUE TRUE TRUE TRUE FALSE TRUE FALSE FALSE
## [67] TRUE TRUE TRUE FALSE TRUE FALSE FALSE TRUE TRUE TRUE TRUE
## [78] FALSE FALSE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE
## [89] TRUE TRUE TRUE FALSE TRUE TRUE TRUE FALSE FALSE FALSE TRUE
## [100] TRUE
## [ reached getOption("max.print") -- omitted 32893 entries ]
##
## $LGA_Name
## NULL
##
## $Score
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [12] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [23] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [34] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [45] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [56] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [67] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [78] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [89] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [100] FALSE
## [ reached getOption("max.print") -- omitted 32893 entries ]
##
## $Usual_Resident_Population
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [12] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [23] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [34] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [45] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [56] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [67] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [78] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [89] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [100] FALSE
## [ reached getOption("max.print") -- omitted 32893 entries ]
All variables in the data set are scanned for any special values. This returns the same columns as above.
is.na(Data_Combined3$Landsize) <- !Data_Combined3$Landsize
is.na(Data_Combined3$BuildingArea) <- !Data_Combined3$BuildingArea
Any 0 values in the “Landsize” or “BuildingArea” columns are converted to NAs because they are impossible value (a property and building can’t be 0 square meters).
Data_Combined4<-Data_Combined3 %>%
group_by(Postcode) %>%
mutate(Price=impute(Price, fun=mean))
Data_Combined5<-Data_Combined4 %>%
group_by(Postcode) %>%
mutate(Landsize=impute(Landsize, fun=mean))
Data_Combined6<-Data_Combined5 %>%
group_by(Postcode) %>%
mutate(BuildingArea=impute(BuildingArea, fun=mean))
To deal with the missing values, imputation was performed. For any missing value in “Price”, the average price of the properties in the same postcode was imputed. Similarly, for any observation missing “Landsize”, the average landsize for that postcode was imputed, and all missing values in “BuildingArea” were imputed using the average building size in that postcode.
Data_Combined7<-mutate(Data_Combined6,
Price_Per_Sq_M = Price/BuildingArea)
colnames(Data_Combined7)
## [1] "Suburb" "Address"
## [3] "Rooms" "Type"
## [5] "Price" "Method"
## [7] "SellerG" "Date"
## [9] "Distance" "Postcode"
## [11] "Landsize" "BuildingArea"
## [13] "LGA_Name" "Score"
## [15] "Usual_Resident_Population" "Price_Per_Sq_M"
str(Data_Combined7$Price_Per_Sq_M)
## num [1:32993] 7173 7812 10481 9182 14590 ...
The variables “Price” and “BuildingArea” are mutated to create the new variable “Price_Per_Sq_M”. This takes the price of the property and divides it by the size of of the building to produce the commonly used statistic price per square meter.
Data_Combined8<-Data_Combined7[complete.cases(Data_Combined7),]
z.scores <- Data_Combined8$Rooms %>% scores(type = "z")
z.scores %>% summary()
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -2.07500 -1.04900 -0.02383 0.00000 1.00200 13.31000
which( abs(z.scores) >3 )
## [1] 30 143 201 637 698 719 1329 1331 1394 1401 1503
## [12] 1784 1859 1978 2017 2027 2046 2091 2129 2140 2163 2247
## [23] 2346 2453 2529 2575 2776 2966 3014 3073 3151 3183 3319
## [34] 3335 3374 3453 3596 3737 3862 3879 3924 3954 4028 4088
## [45] 4180 4291 4298 4333 4366 4455 4502 4523 4693 4710 4751
## [56] 5015 5092 5093 5154 5173 5246 5353 5427 6031 6098 6115
## [67] 6370 6502 6504 6629 6702 6743 6752 6778 6801 6877 6998
## [78] 7027 7059 7083 7240 7300 7338 7607 7983 8569 8578 8973
## [89] 9065 10129 10403 10514 10780 11147 11305 11480 11647 11785 11793
## [100] 11886
## [ reached getOption("max.print") -- omitted 155 entries ]
Data_Combined8$Rooms[ which( abs(z.scores) >3 )] <- mean(Data_Combined8$Rooms, na.rm = TRUE)
To identify outliers of the “Rooms” variable, I found which observations had an absolute z score of greater than 3. To deal with these outliers, I took the mean of the rooms in all observations and imputed them in place of these outlier values.
z.scores2 <- Data_Combined8$Price %>% scores(type = "z")
z.scores2 %>% summary()
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.6380 -0.6868 -0.2373 0.0000 0.4149 16.5900
which( abs(z.scores2) >3 )
## [1] 8 13 14 20 47 74 79 97 110 118 142 153 173 182
## [15] 189 193 195 197 205 285 412 444 463 480 482 500 510 596
## [29] 600 606 619 701 789 809 831 832 882 885 905 965 968 1006
## [43] 1015 1068 1132 1253 1257 1258 1260 1263 1268 1318 1333 1381 1385 1394
## [57] 1396 1407 1440 1469 1473 1512 1540 1586 1592 1632 1659 1660 1716 1741
## [71] 1784 1810 1857 1859 1867 1886 1899 1904 1907 1910 1914 1946 1948 1962
## [85] 1978 1986 1990 1996 1998 2005 2013 2019 2039 2062 2072 2099 2107 2128
## [99] 2213 2250
## [ reached getOption("max.print") -- omitted 395 entries ]
Data_Combined8$Price[ which( abs(z.scores2) >3 )] <- mean(Data_Combined8$Price, na.rm = TRUE)
To identify outliers of the “Price” variable, I found which observations had an absolute z score of greater than 3. To deal with these outliers, I took the mean of the price in all observations and imputed them in place of these outlier values.
z.scores3 <- Data_Combined8$Landsize %>% scores(type = "z")
z.scores3 %>% summary()
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.23720 -0.08833 -0.03697 0.00000 0.00277 152.20000
which( abs(z.scores3) >3 )
## [1] 708 1557 2199 4213 5085 7059 8273 10323 14205 14301 14492
## [12] 14596 14730 16101 16104 16106 16108 16109 16110 16111 16113 16115
## [23] 16117 16119 16120 16121 16122 16123 16125 16126 16130 16131 16134
## [34] 16140 16515 16667 16883 17011 17035 17401 17498 18012 19558 19666
## [45] 19724 19739 19744 19795 19882 20151 20180 20392 20394 20468 20619
## [56] 21402 23328 23672 24021 24670 26149 26336 28027 31821 32242 32399
## [67] 32428 32503
Data_Combined8$Landsize[ which( abs(z.scores3) >3 )] <- mean(Data_Combined8$Landsize, na.rm = TRUE)
To identify outliers of the “Landsize” variable, I found which observations had an absolute z score of greater than 3. To deal with these outliers, I took the mean of the land size in all observations and imputed them in place of these outlier values.
z.scores4 <- Data_Combined8$Distance %>% scores(type = "z")
z.scores4 %>% summary()
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.4940 -0.9199 -0.2173 0.0000 1.0450 1.5320
which( abs(z.scores4) >3 )
## integer(0)
To identify outliers of the “Distance” variable, I found which observations had an absolute z score of greater than 3. There were no outliers present using this method.
z.scores5 <- Data_Combined8$Score %>% scores(type = "z")
z.scores5 %>% summary()
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -2.4110 -0.5613 0.1435 0.0000 0.7249 1.3420
which( abs(z.scores5) >3 )
## integer(0)
Data_Combined8$Score[ which( abs(z.scores5) >3 )] <- mean(Data_Combined8$Score, na.rm = TRUE)
To identify outliers of the “Score” variable, I found which observations had an absolute z score of greater than 3. There were no outliers present using this method.
z.scores6 <- Data_Combined8$BuildingArea %>% scores(type = "z")
z.scores6 %>% summary()
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.54320 -0.13460 -0.04719 0.00000 0.08727 149.70000
which( abs(z.scores6) >3 )
## [1] 1686 2296 2648 10097 11075 16113 16115 16125 16128 16132 16175
## [12] 21825 24970
Data_Combined8$BuildingArea[ which( abs(z.scores6) >3 )] <- mean(Data_Combined8$BuildingArea, na.rm = TRUE)
To identify outliers of the “BuildingArea” variable, I found which observations had an absolute z score of greater than 3. To deal with these outliers, I took the mean of the building area in all observations and imputed them in place of these outlier values.
z.scores7 <- Data_Combined8$Usual_Resident_Population %>% scores(type = "z")
z.scores7 %>% summary()
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -2.8960 -1.0010 0.1277 0.0000 0.6675 4.1430
which( abs(z.scores7) >3 )
## [1] 7073 7074 7075 7076 7077 7078 7079 7080 7081 7082 7083 7084 7085 7086
## [15] 7087 7088 7089 7090 7091 7092 7093 7094 7095 7096 7097 7098 7099 7100
## [29] 7101 7102 7103 7104 7105 7106 7107 7108 7109 7110 7111 7112 7113 7114
## [43] 7115 7116 7117 7118 7119 7120 7121 7122 7123 7124 7125 7126 7127 7128
## [57] 7129 7130 7131 7132 7133 7134 7135 7136 7137 7138 7139 7140 7141 7142
## [71] 7143 7144 7145 7146 7147 7148 7149 7150 7151 7152 7153 7154 7155 7156
## [85] 7157 7158 7159 7160 7161 7162 7163 7164 7165 7166 7167 7168 7169 7170
## [99] 7171 7172
## [ reached getOption("max.print") -- omitted 69 entries ]
Data_Combined8$Usual_Resident_Population[ which( abs(z.scores7) >3 )] <- mean(Data_Combined8$Usual_Resident_Population, na.rm = TRUE)
To identify outliers of the “Usual_Resident_Population” variable, I found which observations had an absolute z score of greater than 3. To deal with these outliers, I took the mean of the population in all observations and imputed them in place of these outlier values.
z.scores8 <- Data_Combined8$Price_Per_Sq_M %>% scores(type = "z")
z.scores8 %>% summary()
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.01666 -0.01038 -0.00780 0.00000 -0.00484 181.40000
which( abs(z.scores8) >3 )
## [1] 4137 27653
Data_Combined8$Price_Per_Sq_M[ which( abs(z.scores8) >3 )] <- mean(Data_Combined8$Price_Per_Sq_M, na.rm = TRUE)
To identify outliers of the “Price_Per_Sq_M” variable, I found which observations had an absolute z score of greater than 3. To deal with these outliers, I took the mean of the price value in all observations and imputed them in place of these outlier values.
hist(Data_Combined$Price)
Data_Combined8$Price<- log10(Data_Combined8$Price)
hist(Data_Combined8$Price)
A histogram of the “Price” variable is produced. There is a clear right skew in the data, therefore a log transformation is applied. This transforms the data into a normal distribution, which is useful for further statistical analysis.
head(Data_Combined8)
## # A tibble: 6 x 16
## # Groups: Postcode [1]
## Suburb Address Rooms Type Price Method
## <chr> <chr> <dbl> <fctr> <dbl> <fctr>
## 1 Brighton 802 Hampton St 3 House 6.190332 Vendor Bid
## 2 Brighton 476 New St 4 House 6.297591 Sold
## 3 Brighton 18 Rooding St 3 House 6.213518 Sold
## 4 Brighton 8 Ballara Ct 3 House 6.297591 Sold
## 5 Brighton 148 Cochrane St 3 House 6.297591 Property Sold Prior
## 6 Brighton 152A Cochrane St 4 House 6.262451 Sold
## # ... with 10 more variables: SellerG <chr>, Date <date>, Distance <dbl>,
## # Postcode <fctr>, Landsize <dbl>, BuildingArea <dbl>, LGA_Name <chr>,
## # Score <dbl>, Usual_Resident_Population <dbl>, Price_Per_Sq_M <dbl>