# The required packages
library(readr)
library(dplyr)
library(readxl)
library(editrules)
library(forecast)
library(knitr)
In this report, I have taken the following steps for preprocessing of the data:
1. Get: Two datasets are used for the purpose of this assignment. The datasets are downloaded from open data source websites as excel and csv files and then imported into R using the read_csv() and read_xlsx() functions.
2. Understand: The variable types are examined for class types and converted to appropriate formats where necessary. Variables that form categories are coverted to factors, labeled and ordered if required. Using the grepl() function, the housing data is subsetted for ‘Metorpolitan’ region only. As we examining Distance from CBD, the regional areas will be very far away from CBD.
3. Tidy & Manipulate: Both the datasets are checked if it conforms to the tidy principles. New variables are created using the mutate() function. New subsets are created using the filter() and select() functions to exclude variables that are not used in the analysis.
4. Scan: Datasets are scanned for missing values, special characters and obvious errors. If there are any missing values they are dealt with after considering the impact on the data. Each variable is scanned for outliers using the Univariate and/or Multivariate outlier detection methods. Outliers are handled with after considering whether its valuable for an analysis or not.
5. Transform: The distribution of each variable is checked and appropriate transformation is applied to decrease the skewness. Once the right transformation is determined, the variable will be transformed.
First Dataset: Melbourne Housing Market
The first dataset is Melbourne Housing Market sourced from kaggle. This dataset was last updated on 06/08/2018. read_csv() function is used to read this data and first six rows of the data are displayed using the head() function.
This data was scraped by Tony Pino and made publicly available on kaggle. The data is originally obtained from Domain.com.au which publishes prices of house sold with all other relevant information. There are two files available for download. For this assignment MELBOURNE_HOUSE_PRICES_LESS.csv is appropriate as it has variables required for the report.
# Read house price data
houseprice_data <- read_csv("~/R/Projects/Data Wrangling/Assignment 2/MELBOURNE_HOUSE_PRICES_LESS.csv")
## Parsed with column specification:
## cols(
## Suburb = col_character(),
## Address = col_character(),
## Rooms = col_double(),
## Type = col_character(),
## Price = col_double(),
## Method = col_character(),
## SellerG = col_character(),
## Date = col_character(),
## Postcode = col_double(),
## Regionname = col_character(),
## Propertycount = col_double(),
## Distance = col_double(),
## CouncilArea = col_character()
## )
# view top 6 observations
head(houseprice_data,6)
## # A tibble: 6 x 13
## Suburb Address Rooms Type Price Method SellerG Date Postcode Regionname
## <chr> <chr> <dbl> <chr> <dbl> <chr> <chr> <chr> <dbl> <chr>
## 1 Abbot~ 49 Lit~ 3 h 1.49e6 S Jellis 1/04~ 3067 Northern ~
## 2 Abbot~ 59A Tu~ 3 h 1.22e6 S Marsha~ 1/04~ 3067 Northern ~
## 3 Abbot~ 119B Y~ 3 h 1.42e6 S Nelson 1/04~ 3067 Northern ~
## 4 Aberf~ 68 Vid~ 3 h 1.52e6 S Barry 1/04~ 3040 Western M~
## 5 Airpo~ 92 Cly~ 2 h 6.70e5 S Nelson 1/04~ 3042 Western M~
## 6 Airpo~ 4/32 E~ 2 t 5.30e5 S Jellis 1/04~ 3042 Western M~
## # ... with 3 more variables: Propertycount <dbl>, Distance <dbl>,
## # CouncilArea <chr>
Second dataset: Criminal incidents
The second dataset is Criminal incidents in Victoria which provides data on criminal offences recorded in Melbourne. This data is sourced from Crime Statistics Agency. The file selected is for the year ending June 2018 for the data as the housing data is for 2017-18. This file was last published as of September 2018. Sheet named as .
The file is imported into R using using the read_xlsx() function. The file has several tables. Sheet named ‘Table 07’ contains data of crime incidents recorded suburb wise is read.
The data records crime incidents reported suburb-wise for each offences. The data is saved year wise for year year ending in June.
# Read the second dataset
crime_data <- read_xlsx("~/R/Projects/Data Wrangling/Assignment 2/Data_tables_Criminal_Incidents_Visualisation_year_ending_June_2018.xlsx",sheet="Table 07")
# view the top 6 observations
head(crime_data,6)
## # A tibble: 6 x 7
## `Year ending Ju~ Postcode `Suburb/Town Na~ `Offence Divisi~ `Offence Subdiv~
## <dbl> <dbl> <chr> <chr> <chr>
## 1 2018 3000 MELBOURNE A Crimes agains~ A20 Assault and~
## 2 2018 3000 MELBOURNE A Crimes agains~ A20 Assault and~
## 3 2018 3000 MELBOURNE A Crimes agains~ A20 Assault and~
## 4 2018 3000 MELBOURNE A Crimes agains~ A20 Assault and~
## 5 2018 3000 MELBOURNE A Crimes agains~ A20 Assault and~
## 6 2018 3000 MELBOURNE A Crimes agains~ A50 Robbery
## # ... with 2 more variables: `Offence Subgroup` <chr>, `Incidents
## # Recorded` <dbl>
Understand Melbourne Housing Market dataset:
The dimension of the data will be inspected using the dim() function. To check the data type of the variables, glimpse() function is used. This will reveal the data type of each variable.
# check total obs read
dim(houseprice_data)
## [1] 63023 13
# view the glimpse of data structure
glimpse(houseprice_data)
## Rows: 63,023
## Columns: 13
## $ Suburb <chr> "Abbotsford", "Abbotsford", "Abbotsford", "Aberfeldie...
## $ Address <chr> "49 Lithgow St", "59A Turner St", "119B Yarra St", "6...
## $ Rooms <dbl> 3, 3, 3, 3, 2, 2, 2, 3, 6, 3, 3, 4, 2, 4, 2, 4, 3, 2,...
## $ Type <chr> "h", "h", "h", "h", "h", "t", "u", "h", "h", "h", "u"...
## $ Price <dbl> 1490000, 1220000, 1420000, 1515000, 670000, 530000, 5...
## $ Method <chr> "S", "S", "S", "S", "S", "S", "S", "SP", "PI", "S", "...
## $ SellerG <chr> "Jellis", "Marshall", "Nelson", "Barry", "Nelson", "J...
## $ Date <chr> "1/04/2017", "1/04/2017", "1/04/2017", "1/04/2017", "...
## $ Postcode <dbl> 3067, 3067, 3067, 3040, 3042, 3042, 3042, 3042, 3021,...
## $ Regionname <chr> "Northern Metropolitan", "Northern Metropolitan", "No...
## $ Propertycount <dbl> 4019, 4019, 4019, 1543, 3464, 3464, 3464, 3464, 1899,...
## $ Distance <dbl> 3.0, 3.0, 3.0, 7.5, 10.4, 10.4, 10.4, 10.4, 14.0, 3.0...
## $ CouncilArea <chr> "Yarra City Council", "Yarra City Council", "Yarra Ci...
The data contains 62023 observations and 13 variables. The variables used in this assignment are Suburb, Price, Date and Regionname. The data types of the variables in the dataset are incorrect. Suburb & Regionname variables should be factor type and Date should of date type.
Subset will be created with these 4 variables. Conversions will be applied on the subset. The original file is left as it was imported in case any further subset is required.
House prices dataset contains all regions of Victoria. As distance from Melbourne CBD is one of the variables of interest, Regional Victoria is not included in the assignment.
grepl() function of base package is used to filter all observations with the string ‘Metropolitan’ in it.
# subset the data for required variables only and filter for data for 2017 & Metropolitan region
houseprice_sub <- houseprice_data %>% filter(grepl(pattern = 'Metropolitan',x = Regionname, ignore.case = TRUE))
# check if filter applied correctly.
dim(houseprice_sub)
## [1] 61665 13
#check structure of filtered data
glimpse(houseprice_sub)
## Rows: 61,665
## Columns: 13
## $ Suburb <chr> "Abbotsford", "Abbotsford", "Abbotsford", "Aberfeldie...
## $ Address <chr> "49 Lithgow St", "59A Turner St", "119B Yarra St", "6...
## $ Rooms <dbl> 3, 3, 3, 3, 2, 2, 2, 3, 6, 3, 3, 4, 2, 4, 2, 4, 3, 2,...
## $ Type <chr> "h", "h", "h", "h", "h", "t", "u", "h", "h", "h", "u"...
## $ Price <dbl> 1490000, 1220000, 1420000, 1515000, 670000, 530000, 5...
## $ Method <chr> "S", "S", "S", "S", "S", "S", "S", "SP", "PI", "S", "...
## $ SellerG <chr> "Jellis", "Marshall", "Nelson", "Barry", "Nelson", "J...
## $ Date <chr> "1/04/2017", "1/04/2017", "1/04/2017", "1/04/2017", "...
## $ Postcode <dbl> 3067, 3067, 3067, 3040, 3042, 3042, 3042, 3042, 3021,...
## $ Regionname <chr> "Northern Metropolitan", "Northern Metropolitan", "No...
## $ Propertycount <dbl> 4019, 4019, 4019, 1543, 3464, 3464, 3464, 3464, 1899,...
## $ Distance <dbl> 3.0, 3.0, 3.0, 7.5, 10.4, 10.4, 10.4, 10.4, 14.0, 3.0...
## $ CouncilArea <chr> "Yarra City Council", "Yarra City Council", "Yarra Ci...
Regionname is categorized and labeled as “North”, “South”, “West”, “East”, “South-East” using the factor() function.
Suburb is categorized as per the suburb names itself. We do not need to create levels for each of the suburbs as they are too many.
Date variable is changed to Date type.
# factor regionname
houseprice_sub$Regionname <- houseprice_sub$Regionname %>% factor(c("Northern Metropolitan", "Western Metropolitan", "Southern Metropolitan", "Eastern Metropolitan", "South-Eastern Metropolitan"), labels = c("North", "South" , "West", "East", "South-East"),ordered = TRUE )
# Check class of regionname
class(houseprice_sub$Regionname)
## [1] "ordered" "factor"
# Factor Suburb
houseprice_sub$Suburb <- as.factor(houseprice_sub$Suburb)
# Check class of Suburb
class(houseprice_sub$Suburb)
## [1] "factor"
# covert the date variable to date format
houseprice_sub$Date <- as.Date(houseprice_sub$Date, format = "%d/%m/%Y")
# check date class after conversion
class(houseprice_sub$Date)
## [1] "Date"
The crime incidents data for June ending 2018 is from July 1 2017 to June 30 2018. The house data will be subsetted further for this date range.
#filter house data to match crime date date range
houseprice_sub <- houseprice_sub%>% select(Regionname, Suburb, Price, Date, Distance) %>%filter(Date >= '2017-07-01' & Date <= '2018-06-30')
#check if data is subsetted
dim(houseprice_sub)
## [1] 28374 5
#view top 6 observations
head(houseprice_sub,6)
## # A tibble: 6 x 5
## Regionname Suburb Price Date Distance
## <ord> <fct> <dbl> <date> <dbl>
## 1 North Abbotsford 1350000 2017-07-01 3
## 2 South Airport West 858000 2017-07-01 10.4
## 3 South Airport West 593000 2017-07-01 10.4
## 4 South Airport West 640000 2017-07-01 10.4
## 5 South Airport West 961000 2017-07-01 10.4
## 6 West Albert Park 1535000 2017-07-01 3
# glimpse the data structure
glimpse(houseprice_sub)
## Rows: 28,374
## Columns: 5
## $ Regionname <ord> North, South, South, South, South, West, South, North, N...
## $ Suburb <fct> Abbotsford, Airport West, Airport West, Airport West, Ai...
## $ Price <dbl> 1350000, 858000, 593000, 640000, 961000, 1535000, 570000...
## $ Date <date> 2017-07-01, 2017-07-01, 2017-07-01, 2017-07-01, 2017-07...
## $ Distance <dbl> 3.0, 10.4, 10.4, 10.4, 10.4, 3.0, 10.5, 5.7, 5.7, 15.5, ...
The subset data comprises of 28374 observations and the follwing 5 variables:
Suburb[Factor w/ 309 levels] : Name of the suburbPrice[Numeric] : Selling price of the houseDate[Date] : Date the house is soldRegionname[Ord.factor w/ 5 levels] : Name of the regionDistance[Numeric] : Distance of the suburb from Melbourne CBDUnderstand Crime Incidents data:
The dimension of the data will be inspected using the dim() function. To check the data type of the variables, glimpse() function is used. This will reveal the data type of each variable.
# check data dimesions
dim(crime_data)
## [1] 280221 7
# check data structure
glimpse(crime_data)
## Rows: 280,221
## Columns: 7
## $ `Year ending June` <dbl> 2018, 2018, 2018, 2018, 2018, 2018, 2018, 201...
## $ Postcode <dbl> 3000, 3000, 3000, 3000, 3000, 3000, 3000, 300...
## $ `Suburb/Town Name` <chr> "MELBOURNE", "MELBOURNE", "MELBOURNE", "MELBO...
## $ `Offence Division` <chr> "A Crimes against the person", "A Crimes agai...
## $ `Offence Subdivision` <chr> "A20 Assault and related offences", "A20 Assa...
## $ `Offence Subgroup` <chr> "A211 FV Serious assault", "A212 Non-FV Serio...
## $ `Incidents Recorded` <dbl> 109, 561, 147, 132, 644, 144, 25, 22, 29, 10,...
The data contains 280221 observations and 7 variables. The variables used in this assignment are Suburb/Town Name and Incidents Recorded. The data type of the variable Suburb/Town Name should be factor type, also preferably the column names should not have space.
Check names of the variables:
# check names of columns
names(crime_data)
## [1] "Year ending June" "Postcode" "Suburb/Town Name"
## [4] "Offence Division" "Offence Subdivision" "Offence Subgroup"
## [7] "Incidents Recorded"
Apply change of variable name on required columns using the rename() function as below :
# rename variable names
crime_data <- rename(crime_data, Year = `Year ending June`,Suburb = `Suburb/Town Name`,Offence_Type = `Offence Division`,Incidents = `Incidents Recorded`)
sapply(crime_data,class)
## Year Postcode Suburb Offence_Type
## "numeric" "numeric" "character" "character"
## Offence Subdivision Offence Subgroup Incidents
## "character" "character" "numeric"
Subset will be created with for the 2 variables and for observations for the year 2018 only using filter() and select() functions.
# filter required variables
crime_data_sub <- crime_data%>%filter(Year == 2018) %>% select(Suburb,Incidents)
# check for data dimensions
dim(crime_data_sub)
## [1] 31958 2
Suburb variable will be converted to factor as below:
# convert to factor
crime_data_sub$Suburb <- crime_data_sub$Suburb %>% factor()
#check the class after conversion
class(crime_data_sub$Suburb)
## [1] "factor"
Check the data dimension and structure of the subsetted dataset:
#dimension
dim(crime_data_sub)
## [1] 31958 2
#glimpse of structure
str(crime_data_sub)
## tibble [31,958 x 2] (S3: tbl_df/tbl/data.frame)
## $ Suburb : Factor w/ 2388 levels "ABBOTSFORD","ABERFELDIE",..: 1335 1335 1335 1335 1335 1335 1335 1335 1335 1335 ...
## $ Incidents: num [1:31958] 109 561 147 132 644 144 25 22 29 10 ...
The data contains 280221 observations and 7 variables. The variables used in this assignment are :
Suburb[Factor w/ 2388 levels] : Name of the suburbIncidents[Numeric] : Crime incidents recorded for the suburbBoth the datasets used for this assignment are not messy. The dataset confirms to the Tidy principle (Wickham and Grolemund (2016)) of: Each variable must have its own column, Each observation must have its own row and Each value must have its own cell. Hence, I will not be using any of the tidy techniques for these datasets.
Create new variables Median_Price and Med_Dist2CBD using the mutate() function. The median is computed by grouping the data Suburb-wise using group_by() function. distinct() will retain only distinct observations. This will be saved into a new dataset.
# compute median house price and distance, grouped by suburb
houseprice_median <- houseprice_sub %>% group_by(Suburb) %>% mutate( Median_Price = median(Price,na.rm = TRUE),Med_Dist2CBD = median(Distance)) %>%distinct(Regionname,Suburb,Median_Price,Med_Dist2CBD)
#view the new dataset
head(houseprice_median,6)
## # A tibble: 6 x 4
## # Groups: Suburb [6]
## Regionname Suburb Median_Price Med_Dist2CBD
## <ord> <fct> <dbl> <dbl>
## 1 North Abbotsford 1087500 3
## 2 South Airport West 790000 10.4
## 3 West Albert Park 1950000 3
## 4 South Albion 635000 10.5
## 5 North Alphington 1597500 5.7
## 6 South Altona Meadows 665000 15.5
#view the structure of the new dataset
glimpse(houseprice_median) # correct data structure
## Rows: 307
## Columns: 4
## Groups: Suburb [307]
## $ Regionname <ord> North, South, West, South, North, South, South, South,...
## $ Suburb <fct> Abbotsford, Airport West, Albert Park, Albion, Alphing...
## $ Median_Price <dbl> 1087500, 790000, 1950000, 635000, 1597500, 665000, 805...
## $ Med_Dist2CBD <dbl> 3.0, 10.4, 3.0, 10.5, 5.7, 15.5, 9.4, 12.8, 6.3, 4.3, ...
Similarly compute total incident numbers grouped by suburb and save to a new dataset:
# compute total incidents grouped by suburb
crimedata_sbrb <- crime_data_sub %>% group_by(Suburb) %>% summarise(Incidents_Total = sum(Incidents))
## `summarise()` ungrouping output (override with `.groups` argument)
#view the dataset
head(crimedata_sbrb,6)
## # A tibble: 6 x 2
## Suburb Incidents_Total
## <fct> <dbl>
## 1 ABBOTSFORD 1074
## 2 ABERFELDIE 88
## 3 ACHERON 7
## 4 ADDINGTON 3
## 5 ADELAIDE LEAD 3
## 6 AINTREE 27
#check the glimpse of structure
glimpse (crimedata_sbrb)
## Rows: 2,388
## Columns: 2
## $ Suburb <fct> ABBOTSFORD, ABERFELDIE, ACHERON, ADDINGTON, ADELAID...
## $ Incidents_Total <dbl> 1074, 88, 7, 3, 3, 27, 21, 2, 732, 221, 556, 7, 2, ...
Join the datasets Check for missing values prior to joining both the datasets. This is a SCAN I step done here in order to join dataset without missing values.
Convert the case for Suburb variable in house_data to upper case to match the Suburb variable in crime incidents dataset.
#check missing values in crime data
colSums(is.na(crimedata_sbrb))
## Suburb Incidents_Total
## 0 0
#check missing values in crime data
colSums(is.na(houseprice_median))
## Regionname Suburb Median_Price Med_Dist2CBD
## 0 0 0 0
# Convert case for suburb to match the housing data in
houseprice_median$Suburb <- toupper(houseprice_median$Suburb)
#check the converted data
head(houseprice_median$Suburb)
## [1] "ABBOTSFORD" "AIRPORT WEST" "ALBERT PARK" "ALBION"
## [5] "ALPHINGTON" "ALTONA MEADOWS"
#check class
class(houseprice_median$Suburb)
## [1] "character"
During conversion the Suburb variable’s data type has changed so we change it back to factor.
# factor the suburb variable
houseprice_median$Suburb <- factor(houseprice_median$Suburb)
#check class
class(houseprice_median$Suburb)
## [1] "factor"
Check if any difference between the two datasets:
a <- houseprice_median$Suburb
b <- crimedata_sbrb$Suburb
setdiff(a,b) # WEST FOOTSCRAY suburb not in crime data
## [1] "WEST FOOTSCRAY"
# Check the observation to see if relevant
houseprice_median %>% filter(Suburb %in% "WEST FOOTSCRAY")
## # A tibble: 1 x 4
## # Groups: Suburb [1]
## Regionname Suburb Median_Price Med_Dist2CBD
## <ord> <fct> <dbl> <dbl>
## 1 South WEST FOOTSCRAY 840000 6.4
crimedata_sbrb %>% filter(Suburb %in% "WEST FOOTSCRAY")
## # A tibble: 0 x 2
## # ... with 2 variables: Suburb <fct>, Incidents_Total <dbl>
The difference is ‘West Footscray’ There is no crime incident for this suburb but we do have a sale price and distance to CBD hence we will not exlude this observation.
Finally, to join the data a left join is used as all observations from housing data are required while one from crime data is missing.
# left join to include all observations from the house market data
hp_cd_join <- houseprice_median %>% left_join(crimedata_sbrb, by = "Suburb")
#dimension
dim(hp_cd_join)
## [1] 307 5
#glimpse of structure
glimpse(hp_cd_join)
## Rows: 307
## Columns: 5
## Groups: Suburb [304]
## $ Regionname <ord> North, South, West, South, North, South, South, Sou...
## $ Suburb <fct> ABBOTSFORD, AIRPORT WEST, ALBERT PARK, ALBION, ALPH...
## $ Median_Price <dbl> 1087500, 790000, 1950000, 635000, 1597500, 665000, ...
## $ Med_Dist2CBD <dbl> 3.0, 10.4, 3.0, 10.5, 5.7, 15.5, 9.4, 12.8, 6.3, 4....
## $ Incidents_Total <dbl> 1074, 732, 556, 433, 237, 647, 833, 326, 465, 1057,...
# check for missing
colSums(is.na(hp_cd_join))
## Regionname Suburb Median_Price Med_Dist2CBD Incidents_Total
## 0 0 0 0 1
Scan the dataset hp_cd_join for missing values, special values and obvious errors.
sapply() is used to check for finite, na, infinite and nan or not.
It is acceptable that the variables are finite so that is in the ifcondition. If not it checks for other special values. If any of the variables gives an output greater than 0 then we have to examine which variable has the special value.
editset() is used to create an edit rule to check if any of the values is negative number.
#function to check for special values
sapply(hp_cd_join, function(x) which(is.finite(x)))
## $Regionname
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
## [19] 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
## [37] 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54
## [55] 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72
## [73] 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90
## [91] 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108
## [109] 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126
## [127] 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144
## [145] 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162
## [163] 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180
## [181] 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198
## [199] 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216
## [217] 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234
## [235] 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252
## [253] 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270
## [271] 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288
## [289] 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306
## [307] 307
##
## $Suburb
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
## [19] 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
## [37] 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54
## [55] 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72
## [73] 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90
## [91] 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108
## [109] 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126
## [127] 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144
## [145] 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162
## [163] 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180
## [181] 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198
## [199] 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216
## [217] 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234
## [235] 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252
## [253] 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270
## [271] 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288
## [289] 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306
## [307] 307
##
## $Median_Price
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
## [19] 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
## [37] 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54
## [55] 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72
## [73] 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90
## [91] 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108
## [109] 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126
## [127] 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144
## [145] 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162
## [163] 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180
## [181] 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198
## [199] 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216
## [217] 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234
## [235] 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252
## [253] 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270
## [271] 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288
## [289] 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306
## [307] 307
##
## $Med_Dist2CBD
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
## [19] 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
## [37] 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54
## [55] 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72
## [73] 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90
## [91] 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108
## [109] 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126
## [127] 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144
## [145] 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162
## [163] 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180
## [181] 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198
## [199] 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216
## [217] 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234
## [235] 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252
## [253] 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270
## [271] 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288
## [289] 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306
## [307] 307
##
## $Incidents_Total
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
## [19] 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
## [37] 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54
## [55] 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72
## [73] 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90
## [91] 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108
## [109] 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126
## [127] 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144
## [145] 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162
## [163] 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180
## [181] 181 182 183 184 185 186 187 188 189 190 191 192 193 194 196 197 198 199
## [199] 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217
## [217] 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235
## [235] 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253
## [253] 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271
## [271] 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289
## [289] 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307
sapply(hp_cd_join, function(x) which(is.na(x)))
## $Regionname
## integer(0)
##
## $Suburb
## integer(0)
##
## $Median_Price
## integer(0)
##
## $Med_Dist2CBD
## integer(0)
##
## $Incidents_Total
## [1] 195
sapply(hp_cd_join, function(x) which(is.infinite(x)))
## $Regionname
## integer(0)
##
## $Suburb
## integer(0)
##
## $Median_Price
## integer(0)
##
## $Med_Dist2CBD
## integer(0)
##
## $Incidents_Total
## integer(0)
sapply(hp_cd_join, function(x) which(is.nan(x)))
## $Regionname
## integer(0)
##
## $Suburb
## integer(0)
##
## $Median_Price
## integer(0)
##
## $Med_Dist2CBD
## integer(0)
##
## $Incidents_Total
## integer(0)
#create a rule to check any of the variables contain non negative values
negative_rule <- editset(c("Median_Price >= 0", "Med_Dist2CBD >= 0","Incidents_Total>=0"))
negative_rule
##
## Edit set:
## num1 : 0 <= Median_Price
## num2 : 0 <= Med_Dist2CBD
## num3 : 0 <= Incidents_Total
In this all the variables of interest will be scanned for outliers. Inclusion of an outlier can have an impact on the analysis results. It is critical that the dataset which will be used for analysis is examined and scanned for outliers. Appropriate handling methods of the outliers need to be applied based on whether the outlier can be included or must be excluded.
The numeric variables required for the analysis are : Median House Prices[Median_Price], Median Distance to CBD[Med_Dist2CBD] and Number of Incidents[Incidents_Total]. The category variable is the Regions[RegionName]
First, histogram is plotted to find distribution of the variables.
# This is the R chunk for the Transform Section
#histogram of median price
hp_cd_join$Median_Price %>% hist(main="Distribution of Median Price", ylab = "", xlab = "Median Price") # right skewed
#histogram of median distance
hp_cd_join$Med_Dist2CBD %>% hist(main="Distribution of Median Distance", ylab = "", xlab = "Median Distance") # right skewed
#histogram of total incidents
hp_cd_join$Incidents_Total %>% hist(main="Distribution of Total Incidents", ylab = "", xlab = "Total incidents")
The variables are not normally distributed. Boxplot will help us observe the outliers. The variables are plotted to boxplot below :
# Scanning Median_Price for outliers
boxplot(hp_cd_join$Median_Price ~ hp_cd_join$Regionname , main = "Median house Price", ylab = "Median Price", xlab = "Metropolitan Regions")
# Scanning Distance to Melbourne CBD for outliers
boxplot(hp_cd_join$Med_Dist2CBD ~ hp_cd_join$Regionname , main = "Distance to Melbourne CBD",ylab = "Median distance", xlab = "Metropolitan Regions")
# Scanning incidents for outliers
boxplot(hp_cd_join$Incidents_Total ~ hp_cd_join$Regionname , main = "Total Crime Incidents",ylab = "Total Incidents", xlab = "Metropolitan Regions")
There seems to be suburbs with exceptionally high price.
Outliers
Find the outliers in the variables:
# Scanning outliers for house sales price
prices_out = boxplot(hp_cd_join$Median_Price ~ hp_cd_join$Regionname)$out
# Scanning outliers for Distance to Melbourne CBD
distance_out = boxplot(hp_cd_join$Med_Dist2CBD ~ hp_cd_join$Regionname)$out
# Scanning outliers for incidents_out
incidents_out = boxplot(hp_cd_join$Incidents_Total ~ hp_cd_join$Regionname)$out
Examine the outliers in each variable set and decide on handling method for each:
hp_cd_join %>% filter(Median_Price%in% prices_out)
## # A tibble: 10 x 5
## # Groups: Suburb [10]
## Regionname Suburb Median_Price Med_Dist2CBD Incidents_Total
## <ord> <fct> <dbl> <dbl> <dbl>
## 1 North ALPHINGTON 1597500 5.7 237
## 2 East EAGLEMONT 1748000 8.9 112
## 3 South ESSENDON WEST 1276000 7.5 35
## 4 North PRINCES HILL 1575000 3.5 116
## 5 South ABERFELDIE 1505000 7.5 88
## 6 East BOX HILL 1285000 10.9 1117
## 7 South WILLIAMSTOWN 1285000 6.8 769
## 8 West DEEPDENE 3080000 7.9 37
## 9 North OAKLANDS JUNCTION 1495000 24.3 13
## 10 South-East CRANBOURNE SOUTH 1320000 34.7 49
DEEPDENE suburb in the West seems to be the most high priced suburb at 3080000. It is possibly due to its proximity to CBD. We cannot exclude this data as it seems relevant for the topic of analysis to check if the proximity of CBD effects the house sales price or not. Hence, this observation is not excluded.
hp_cd_join %>% filter(Med_Dist2CBD%in% distance_out)
## # A tibble: 5 x 5
## # Groups: Suburb [5]
## Regionname Suburb Median_Price Med_Dist2CBD Incidents_Total
## <ord> <fct> <dbl> <dbl> <dbl>
## 1 South SUNBURY 520000 31.7 2142
## 2 South WYNDHAM VALE 472000 27.2 890
## 3 South MANOR LAKES 545500 27.2 110
## 4 South DIGGERS REST 485000 27.7 254
## 5 South WILDWOOD 1030000 31.7 18
The observations which are suggested as outliers are all far away from the CBD and that could be the reason for the greater distance.We need to examine all these variables as sales are recorded for these areas as well as crime incidents and they can be signifanct to our analysis. Hence, these observations are not excluded.
hp_cd_join %>% filter(Incidents_Total%in% incidents_out)
## # A tibble: 22 x 5
## # Groups: Suburb [22]
## Regionname Suburb Median_Price Med_Dist2CBD Incidents_Total
## <ord> <fct> <dbl> <dbl> <dbl>
## 1 East BORONIA 767500 25 1703
## 2 South-East DANDENONG 570000 24.7 6019
## 3 South FOOTSCRAY 760000 5.1 2273
## 4 South-East FRANKSTON 640000 38 5925
## 5 South HOPPERS CROSSING 570000 18.4 2378
## 6 South-East NOBLE PARK 640000 22.7 2370
## 7 West PORT MELBOURNE 1282000 3.5 1727
## 8 West PRAHRAN 1100000 4.6 2195
## 9 North PRESTON 880000 8.4 3935
## 10 North RESERVOIR 715000 12 3324
## # ... with 12 more rows
The exceptionally high number of incidents 16602 reported for Melbourne suburb could be as crimes related to justice procedure offences are recorded under Melbourne as per the original dataset. It is safe to exclude the data for Melbourne suburb as it is 0 kms to CBD which will not help in our analysis topic. Hence, this observation should be excluded.
The observation for Melbourne will be excluded from the dataset hp_cd_join and a new dataset will be created.
# exclude the outlier observation and keep all below 15000 as a set target for incidents
hp_cd_join_clean <- hp_cd_join %>% filter(!Suburb %in% "MELBOURNE"| Incidents_Total<15000)
#dimension
dim(hp_cd_join_clean)
## [1] 306 5
#glimpse structure
glimpse(hp_cd_join_clean)
## Rows: 306
## Columns: 5
## Groups: Suburb [303]
## $ Regionname <ord> North, South, West, South, North, South, South, Sou...
## $ Suburb <fct> ABBOTSFORD, AIRPORT WEST, ALBERT PARK, ALBION, ALPH...
## $ Median_Price <dbl> 1087500, 790000, 1950000, 635000, 1597500, 665000, ...
## $ Med_Dist2CBD <dbl> 3.0, 10.4, 3.0, 10.5, 5.7, 15.5, 9.4, 12.8, 6.3, 4....
## $ Incidents_Total <dbl> 1074, 732, 556, 433, 237, 647, 833, 326, 465, 1057,...
Boxplot of Crime incidents after excluding the outlier.
boxplot(hp_cd_join_clean$Incidents_Total ~ hp_cd_join_clean$Regionname , main = "Total Crime Incidents",
ylab = "Total Incidents", xlab = "Metropolitan Regions")
Inspite of excluding the data we can see some outliers. This is possible as these are regions where the crime rate is exceptionally high and possibly might be affecting the house prices as well.
This can be further analysed with Multivariate detection methods.
Multivariate outlier detection:
Median house price(AU$) Vs Distance to Melbourne CBD(kms) Let’s examine the relation bewteen a suburb’s distance to CBD and the house prices. People do prefer to be closer to the city for ease of commuting to work place and the convininces of a city life. However, people also prefer a quite suburb setting.Can the house price data reveal if this is true ? To examine this we will plot the Median House Prices[Median_Price] vs Distance to Melbourne CBD[Med_Dist2CBD] :
#
plot(hp_cd_join_clean$Median_Price ~hp_cd_join_clean$Med_Dist2CBD,
main = "Median house price(AU$) Vs Distance to Melbourne CBD(kms)", ylab = "Price (AU$)",
xlab = "Distance (kms)")
The closer the suburb is to the CBD, the higher the price is observed. Although the suburbs that are almost 20 kms away too have a similar price structure but there may be other factors at play such as number of bedrooms, age of property etc.. We can say that further the house is from CBD the price is much lower. Whether this is a linear relation or not is a separate analysis topic.
Median house price(AU$) Vs Total Crime Incidents(nos)
Now, lets analyse the Median house price vs total crime
plot(hp_cd_join_clean$Median_Price ~hp_cd_join_clean$Incidents_Total,
main = "Median house price(AU$) Vs Total Crime Incidents(nos)", ylab = "Price (AU$)",
xlab = "Incidents (nos)")
What is noteworthy here is that, the suburb with the high sales price and most sales have low crime incidents. This might be an indication that buyers do consider the crime rate of a suburb prior to purchasing a house.
Outliers are observed and lets examine them:
hp_cd_join_clean[hp_cd_join_clean$Incidents_Total > 5000,]
## # A tibble: 3 x 5
## # Groups: Suburb [3]
## Regionname Suburb Median_Price Med_Dist2CBD Incidents_Total
## <ord> <fct> <dbl> <dbl> <dbl>
## 1 South-East DANDENONG 570000 24.7 6019
## 2 South-East FRANKSTON 640000 38 5925
## 3 <NA> <NA> NA NA NA
The suburbs that are showing as outliers have a high crime rate. these might be suburbs were the crime rate is exceptionally high and it is further away from CBD as well. The house price is low too. We will not exclude this data as it is relavnt to the analyse topic to check for effect of crime incidents on housing price. The NA in the output above is for the WEST FOOTSCRAY observation that we allowed.
Transformation is applied to the Median house price[Median_Price], Median distance[Med_Dist2CBD] and Incidents[Incidents_Total] seperately to reduce the skewness:
hp_cd_join_clean$Median_Price %>% hist(main="Original Median Price", ylab = "",
xlab = "Median price") # right skewed
log10(hp_cd_join_clean$Median_Price) %>% hist(main="Log10 Transformation", ylab = "",
xlab = "log10(Median Price)")
# This is the R chunk for the Transform Section
hp_cd_join_clean$Med_Dist2CBD %>% hist(main="Original Median distance", ylab = "",
xlab = "Median Distance") # right skewed
sqrt(hp_cd_join_clean$Med_Dist2CBD) %>% hist(main="Square root Transformation", ylab = "",
xlab = "SQRT(Median Distance)")
# This is the R chunk for the Transform Section
par(mfrow=c(2,2))
hp_cd_join$Incidents_Total %>% hist(main="Original Incidents", ylab = "",
xlab = "Total Incidents" ) # left skewed
log10(hp_cd_join$Incidents_Total) %>% hist(main="Log10 Transformation", ylab = "",
xlab = "log10(Total Incidents)")
(hp_cd_join$Incidents_Total)^-1 %>% hist(main="Reciprocal Transformation", ylab = "",
xlab = "(Total Incidents)^(-1)")
Apply the above transformations to the variables:
# create another dataset
hp_cd_transformed <- hp_cd_join_clean
# glimpse
glimpse(hp_cd_transformed)
## Rows: 306
## Columns: 5
## Groups: Suburb [303]
## $ Regionname <ord> North, South, West, South, North, South, South, Sou...
## $ Suburb <fct> ABBOTSFORD, AIRPORT WEST, ALBERT PARK, ALBION, ALPH...
## $ Median_Price <dbl> 1087500, 790000, 1950000, 635000, 1597500, 665000, ...
## $ Med_Dist2CBD <dbl> 3.0, 10.4, 3.0, 10.5, 5.7, 15.5, 9.4, 12.8, 6.3, 4....
## $ Incidents_Total <dbl> 1074, 732, 556, 433, 237, 647, 833, 326, 465, 1057,...
# Transformed Median price
hp_cd_transformed$Median_Price <- log10(hp_cd_join_clean$Median_Price)
# View original variable
par(mfrow=c(1,2))
hist(hp_cd_join_clean$Median_Price, main="Original Median Price", ylab = "", xlab = "Median price")
# View transformed variable
hist(hp_cd_transformed$Median_Price, main="Transformed Median Price", ylab = "", xlab = "Median price")