library(dplyr)
library(tidyr)
library(readr)
library(rvest)
library(stringr)
library(lubridate)
library(forecast)
For this project, open source data on Airbnb, along with data from a Wikipedia table on Local Government Areas (LGA) were used for the purpose of data pre-processing. The following steps were taken:
The main dataset used for this project is the “Melbourne Airbnb Open Data” as hosted on Kaggle by Tyler Xie (https://www.kaggle.com/tylerx/melbourne-airbnb-open-data). The data was gathered from the Inside Airbnb database and compiled by Tyler Xie into a CSV file. There are 96 variables for the Airbnb dataset, however only 12 variables were chosen for the purpose of this project:
Below, the Airbnb dataset is imported and subset according to the varibles chosen above:
airbnb_orig <- read.csv("listings_dec18.csv", stringsAsFactors = F)
airbnb <- airbnb_orig[,c("id","host_since","street","neighbourhood_cleansed","property_type","accommodates",
"bathrooms","bedrooms","price","guests_included","number_of_reviews", "instant_bookable")]
head(airbnb)
The second dataset is scraped from a table off the Wikipedia article “Local Government Areas of Victoria” (https://en.wikipedia.org/wiki/Local_government_areas_of_Victoria#Greater_Melbourne). Its variables contain information on land size and population for each LGA. Only 3 variables were chosen out of the 11 variables, which were then renamed:
Below, the Wikipedia table is scraped and imported using the ‘rvest’ package, and then subsetted with the chosen variables above. The variables are then renamed, while the LGA observations are edited to remove the strings “City of” or “Shire of” in front of the LGA names. This is so that the LGA variable will match with the ‘neighbourhood_cleansed’ variable in the Airbnb dataset so that the two datasets may be joined in the next step:
LGA_wiki <- read_html("https://en.wikipedia.org/wiki/Local_government_areas_of_Victoria#Greater_Melbourne")
length(html_nodes(LGA_wiki, "table"))
## [1] 10
LGA <- html_table(html_nodes(LGA_wiki, "table") [[1]], fill=T)
head(LGA)
LGA <- LGA[2:32,c(1,5,7)]
colnames(LGA) <- c("LGA","LGA_Area_km2","LGA_Density")
LGA <- LGA %>%
mutate(LGA = ifelse(str_detect(LGA, "Shire of"), str_replace(LGA, pattern="Shire of ", replacement=""),
ifelse(str_detect(LGA, "City of"), str_replace(LGA, pattern="City of ", replacement=""), "Error")))
head(LGA)
The two datasets are joined below, where Airbnb’s ‘neighbourhood_cleansed’ is equal to LGA’s ‘LGA’, to form a dataset with 14 variables:
airbnb_LGA <- left_join(airbnb, LGA, by = c("neighbourhood_cleansed" = "LGA"))
head(airbnb_LGA)
dim(airbnb_LGA)
## [1] 22895 14
str(airbnb_LGA)
## 'data.frame': 22895 obs. of 14 variables:
## $ id : int 9835 10803 12936 15246 16760 38271 41836 43414 43429 44082 ...
## $ host_since : chr "2009-08-21" "2009-09-16" "2009-10-31" "2009-12-03" ...
## $ street : chr "Bulleen, VIC, Australia" "Brunswick East, VIC, Australia" "St Kilda, VIC, Australia" "Thornbury, VIC, Australia" ...
## $ neighbourhood_cleansed: chr "Manningham" "Moreland" "Port Phillip" "Darebin" ...
## $ property_type : chr "House" "Apartment" "Apartment" "House" ...
## $ accommodates : int 2 2 3 1 2 5 2 2 2 2 ...
## $ bathrooms : num 1 1 1 1.5 1 1 1 1 1 1 ...
## $ bedrooms : int 1 1 1 1 1 3 1 1 1 1 ...
## $ price : chr "$60.00" "$35.00" "$159.00" "$50.00" ...
## $ guests_included : int 1 1 2 1 1 1 2 1 1 1 ...
## $ number_of_reviews : int 4 108 15 29 63 112 159 3 168 28 ...
## $ instant_bookable : chr "f" "t" "f" "f" ...
## $ LGA_Area_km2 : chr "113" "51" "21" "53" ...
## $ LGA_Density : chr "1,057" "3,270" "5,101" "2,847" ...
As seen above, the imported data structure is a data frame, however the majority of the variables are not the correct variable type. Before performing data type conversions, some of the variables will be renamed to make more sense:
colnames(airbnb_LGA)[4] <- "LGA"
colnames(airbnb_LGA)[3] <- "location"
head(airbnb_LGA)
Below, the following conversions take place:
For the last 3 variables, readr’s parse_number function is used to remove dollar signs and commas in order to convert the characters into numerics:
airbnb_LGA$id <- as.factor(airbnb_LGA$id)
airbnb_LGA$host_since <- ymd(airbnb_LGA$host_since)
airbnb_LGA$property_type <- factor(airbnb_LGA$property_type)
airbnb_LGA$instant_bookable <- factor(airbnb_LGA$instant_bookable)
airbnb_LGA$LGA <- factor(airbnb_LGA$LGA)
unique(airbnb_LGA$accommodates)
## [1] 2 3 1 5 4 7 10 6 8 16 9 13 11 12 14 15
airbnb_LGA$accommodates <- factor(airbnb_LGA$accommodates,
levels = c("1","2","3","4","5","6","7","8","9","10","11","12","13","14","15"),
ordered = T)
levels(airbnb_LGA$accommodates)
## [1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10" "11" "12" "13" "14"
## [15] "15"
unique(airbnb_LGA$bathrooms)
## [1] 1.0 1.5 2.5 2.0 NA 3.0 6.0 0.5 0.0 5.0 8.0 4.0 3.5 4.5
## [15] 7.0 14.0 9.0 7.5 5.5 6.5 10.0
airbnb_LGA$bathrooms <- factor(airbnb_LGA$bathrooms,
levels = c("0","0.5","1","1.5","2","2.5","3","3.5","4","4.5","5","5.5",
"6","6.5","7","7.5","8","9","10","14"),
ordered = T)
levels(airbnb_LGA$bathrooms)
## [1] "0" "0.5" "1" "1.5" "2" "2.5" "3" "3.5" "4" "4.5" "5"
## [12] "5.5" "6" "6.5" "7" "7.5" "8" "9" "10" "14"
unique(airbnb_LGA$bedrooms)
## [1] 1 3 2 4 5 10 0 6 7 9 NA 16 8 11
airbnb_LGA$bedrooms <- factor(airbnb_LGA$bedrooms,
levels = c("0","1","2","3","4","5","6","7","8","9","10","11"),
ordered = T)
levels(airbnb_LGA$bedrooms)
## [1] "0" "1" "2" "3" "4" "5" "6" "7" "8" "9" "10" "11"
unique(airbnb_LGA$guests_included)
## [1] 1 2 3 6 4 8 12 5 9 7 16 10 11 15 14
airbnb_LGA$guests_included <- factor(airbnb_LGA$guests_included,
levels = c("1","2","3","4","5","6","7","8","9","10","11","12","14","15","16"),
ordered = T)
levels(airbnb_LGA$guests_included)
## [1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10" "11" "12" "14" "15"
## [15] "16"
airbnb_LGA$price <- parse_number(airbnb_LGA$price)
airbnb_LGA$LGA_Area_km2 <- parse_number(airbnb_LGA$LGA_Area_km2)
airbnb_LGA$LGA_Density <- parse_number(airbnb_LGA$LGA_Density)
The following is the dataset after the variable type conversions:
str(airbnb_LGA)
## 'data.frame': 22895 obs. of 14 variables:
## $ id : Factor w/ 22895 levels "9835","10803",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ host_since : Date, format: "2009-08-21" "2009-09-16" ...
## $ location : chr "Bulleen, VIC, Australia" "Brunswick East, VIC, Australia" "St Kilda, VIC, Australia" "Thornbury, VIC, Australia" ...
## $ LGA : Factor w/ 30 levels "Banyule","Bayside",..: 15 22 24 7 24 6 7 18 20 8 ...
## $ property_type : Factor w/ 35 levels "Aparthotel","Apartment",..: 22 2 2 22 22 2 22 2 2 22 ...
## $ accommodates : Ord.factor w/ 15 levels "1"<"2"<"3"<"4"<..: 2 2 3 1 2 5 2 2 2 2 ...
## $ bathrooms : Ord.factor w/ 20 levels "0"<"0.5"<"1"<..: 3 3 3 4 3 3 3 3 3 3 ...
## $ bedrooms : Ord.factor w/ 12 levels "0"<"1"<"2"<"3"<..: 2 2 2 2 2 4 2 2 2 2 ...
## $ price : num 60 35 159 50 69 98 50 100 98 60 ...
## $ guests_included : Ord.factor w/ 15 levels "1"<"2"<"3"<"4"<..: 1 1 2 1 1 1 2 1 1 1 ...
## $ number_of_reviews: int 4 108 15 29 63 112 159 3 168 28 ...
## $ instant_bookable : Factor w/ 2 levels "f","t": 1 2 1 1 1 1 1 1 2 1 ...
## $ LGA_Area_km2 : num 113 51 21 53 21 410 53 37 82 131 ...
## $ LGA_Density : num 1057 3270 5101 2847 5101 ...
The dataset is in a tidy format since the following apply:
Below, two new variables are created via the mutate function (from dplyr). The variable ‘price_per_guest’ is created by dividing the price (‘price’) by the number of guests included (‘guest_included’). As ‘guests_included’ was converted to a factor earlier, it is temporarily coerced into a numeric in order for the arithmetic to work. The variable ‘LGA_population’ is created by multiplying the LGA area (LGA_Area_km2) with the LGA population density (LGA_Density):
airbnb_LGA <- airbnb_LGA %>%
mutate(price_per_guest = round(price/as.numeric(guests_included), 2),
LGA_population = LGA_Area_km2*LGA_Density)
head(airbnb_LGA)
airbnb_LGA <- airbnb_LGA %>% separate(location, into = c("Suburb", "State", "Country"), sep = ",")
## Warning: Expected 3 pieces. Additional pieces discarded in 49 rows [424,
## 767, 768, 770, 779, 790, 809, 1196, 1290, 1338, 1407, 1434, 1494, 1585,
## 1864, 2034, 2899, 4220, 6779, 6939, ...].
## Warning: Expected 3 pieces. Missing pieces filled with `NA` in 61 rows
## [1213, 2540, 5363, 6329, 7410, 15039, 16158, 16662, 16663, 16673, 16674,
## 16840, 16868, 17167, 17604, 17605, 17634, 17637, 17638, 17767, ...].
head(airbnb_LGA)
Below, the sapply function is used to scan all variables in the dataset for NA values:
sapply(airbnb_LGA, function(x) sum(is.na(x)))
## id host_since Suburb State
## 0 3 0 0
## Country LGA property_type accommodates
## 61 0 0 81
## bathrooms bedrooms price guests_included
## 17 7 0 0
## number_of_reviews instant_bookable LGA_Area_km2 LGA_Density
## 0 0 0 0
## price_per_guest LGA_population
## 0 0
sum(is.na(airbnb_LGA))
## [1] 169
As shown above, there are 108 NA values in the entire dataset, which is a very small number compared to the total number of observations, hence these NA observations can be removed:
airbnb_LGA <- na.omit(airbnb_LGA)
sapply(airbnb_LGA, function(x) sum(is.na(x)))
## id host_since Suburb State
## 0 0 0 0
## Country LGA property_type accommodates
## 0 0 0 0
## bathrooms bedrooms price guests_included
## 0 0 0 0
## number_of_reviews instant_bookable LGA_Area_km2 LGA_Density
## 0 0 0 0
## price_per_guest LGA_population
## 0 0
Below is a function to scan all numerical variables for special values such as infinite numbers or NaNs, which is applied to the entire dataset. The results indicate that there are no special values in this dataset:
special <- function(x){if (is.numeric(x)) sum(is.infinite(x) | is.nan(x))}
sapply(airbnb_LGA, special)
## $id
## NULL
##
## $host_since
## NULL
##
## $Suburb
## NULL
##
## $State
## NULL
##
## $Country
## NULL
##
## $LGA
## NULL
##
## $property_type
## NULL
##
## $accommodates
## NULL
##
## $bathrooms
## NULL
##
## $bedrooms
## NULL
##
## $price
## [1] 0
##
## $guests_included
## NULL
##
## $number_of_reviews
## [1] 0
##
## $instant_bookable
## NULL
##
## $LGA_Area_km2
## [1] 0
##
## $LGA_Density
## [1] 0
##
## $price_per_guest
## [1] 0
##
## $LGA_population
## [1] 0
Check inconsistency from separated variable, “Suburb”, “State”, and “Country”
unique (airbnb_LGA$Suburb)
## [1] "Bulleen" "Brunswick East"
## [3] "St Kilda" "Thornbury"
## [5] "St Kilda East" "Berwick"
## [7] "Reservoir" "East Melbourne"
## [9] "Oakleigh East" "Frankston"
## [11] "Richmond" "Melbourne"
## [13] "Brighton East" "Kew"
## [15] "Fitzroy" "Hampton"
## [17] "Clifton Hill" "Mordialloc"
## [19] "Northcote" "Port Melbourne"
## [21] "South Melbourne" "Prahran"
## [23] "Saint Kilda East" "Parkdale"
## [25] "Ringwood East" "Parkville"
## [27] "Yarraville" "Belgrave"
## [29] "Chum Creek" "Ivanhoe"
## [31] "Elwood" "Mount Waverley"
## [33] "Middle Park" "Hawthorn East"
## [35] "Carlton" "Caulfield North"
## [37] "Montmorency" "Melbourne "
## [39] "Hadfield" "North Melbourne"
## [41] "Albert Park" "Collingwood"
## [43] "South Yarra" "Windsor"
## [45] "Blackburn" "Menzies Creek"
## [47] "Derrimut" "Oakleigh South"
## [49] "Abbotsford" "South Kingsville"
## [51] "Armadale" "Southbank"
## [53] "Langwarrin" "Warneet"
## [55] "Truganina" "Balaclava"
## [57] "Heidelberg" "Brunswick"
## [59] "Caulfield" "Elsternwick"
## [61] "Kooyong" "Malvern East"
## [63] "Flemington" "Balwyn North"
## [65] "Point Cook" "Maidstone"
## [67] "Preston" "Brighton"
## [69] "Sandringham" "East St. Kilda"
## [71] "Rosanna" "Caulfield South"
## [73] "Coburg" "Blackburn South"
## [75] "Werribee" "Yarra Glen"
## [77] "Murrumbeena" "St Kilda West"
## [79] "Hawthorn" "Ashburton"
## [81] "Sherbrooke" "Spotswood"
## [83] "Hughesdale" "Dandenong"
## [85] "Cottles Bridge" "Altona"
## [87] "Viewbank" "Aspendale"
## [89] "Seddon" "Alphington"
## [91] "Burnley" "Olinda"
## [93] "Fairfield" "Manningham"
## [95] "Essendon" "Rowville"
## [97] "Healesville" "Ringwood North"
## [99] "Fitzroy North" "Badger Creek"
## [101] "Diamond Creek" "Toorak"
## [103] "Clayton" "Docklands"
## [105] "Glen Iris" "Highett"
## [107] "Greensborough" "Footscray"
## [109] "Beaumaris" "Forest Hill"
## [111] "Ormond" "Ripponlea"
## [113] "Gardenvale" "Doncaster East"
## [115] "Yarra Junction" "Macclesfield"
## [117] "West Melbourne" "St. Kilda"
## [119] "West Footscray" "North Warrandyte"
## [121] "Heidelberg Heights" "Upwey"
## [123] "Saint Kilda" "Croydon North"
## [125] "Lilydale" "Epping"
## [127] "St Kilda / Elwood" "Bayswater"
## [129] "Ashwood" "Sandhurst"
## [131] "Bentleigh" "Warrandyte"
## [133] "McKinnon" "Langwarrin South"
## [135] "Pascoe Vale South" "Newport"
## [137] "Cremorne" "Nunawading"
## [139] "Ripponlea (East St Kilda)" "Brunswick / Melbourne"
## [141] "Cheltenham" "Gembrook"
## [143] "Eltham North" "Bentleigh East"
## [145] "Box Hill" "Box Hill North"
## [147] "Carnegie" "Maribyrnong"
## [149] "Balwyn" "Glen Waverley"
## [151] "Glen Huntly" "Princes Hill"
## [153] "Warburton" "Moonee Ponds"
## [155] "East Warburton" "Launching Place"
## [157] "Chirnside Park" "Narre Warren South"
## [159] "Seaford" "Braybrook"
## [161] "Sassafras" "Brunswick West"
## [163] "Research" "The Basin"
## [165] "Kensington" "Dixons Creek"
## [167] "Kalorama" "Wattle Glen"
## [169] "Williamstown" "Melbourne (Eltham)"
## [171] "Oak Park" "Camberwell"
## [173] "Dandenong Ranges" "芒特韦弗å\210©åŒº"
## [175] "Croydon South" "Wantirna South"
## [177] "Smiths Gully" "Frankston South"
## [179] "Mount Dandenong" "Black Rock"
## [181] "Cockatoo" "Oakleigh"
## [183] "Glenroy" "Kilsyth South"
## [185] "Burwood" "Ascot Vale"
## [187] "Ivanhoe (Melbourne)" "Kew East"
## [189] "Lalor" "Emerald"
## [191] "FITZROY" "Gladstone Park"
## [193] "Malvern" "Westmeadows"
## [195] "Saint Kilda West" "Carrum Downs"
## [197] "Millgrove" "Boronia"
## [199] "Carlton North" "Carlton South"
## [201] "Hampton East" "Fawkner"
## [203] "Tonimbuk" "Huntingdale"
## [205] "brunswick" "Mount Toolebewong"
## [207] "墨尔本" "Victoria"
## [209] "North Fitzroy" "Warrandyte South"
## [211] "Ringwood" "Mt Waverley"
## [213] "Surrey Hills" "Coburg North"
## [215] "Broadmeadows " "Diggers Rest"
## [217] "Yellingbo" "Heidelberg West"
## [219] "Strathmore" "Caulfield East"
## [221] "Patterson Lakes" "Keysborough"
## [223] "Tarneit" "Don Valley"
## [225] "Burwood East" "HEIDELBERG WEST"
## [227] "Brunswick " "Wonga Park"
## [229] "Eaglemont" "Delahey"
## [231] "Cranbourne West" "Canterbury"
## [233] "Kingsville" "Eltham"
## [235] "Middle Park Melbourne" "Moorabbin"
## [237] "St Kilda East " "Belgrave Heights"
## [239] "Seabrook" "Ferny Creek"
## [241] "Lyndhurst" "Narre Warren"
## [243] "Mont Albert North" "Clyde North"
## [245] "Park Orchards" "Mentone"
## [247] "Donvale" "Gladysdale"
## [249] "McMahons Creek" "Gruyere"
## [251] "Wantirna" "Broadmeadows"
## [253] "Caroline Springs" "Ivanhoe East"
## [255] "Williams Landing" "Aberfeldie"
## [257] "Mitcham" "Sunshine West"
## [259] "St Andrews" "Mt Dandenong"
## [261] "Hurstbridge" "Croydon"
## [263] "Travancore" "Taylors Hill"
## [265] "Southbank Melbourne " "Pakenham "
## [267] "Kangaroo ground" "Wandin North"
## [269] "Cranbourne South" "Seville"
## [271] "Craigieburn" "Toolangi"
## [273] "Noble Park" "Beaconsfield"
## [275] "Chelsea" "Yering"
## [277] "Edithvale" "Notting Hill"
## [279] "Springvale South" "Hoppers Crossing"
## [281] "Christmas Hills" "Doncaster"
## [283] "Kangaroo Ground" "Bulla"
## [285] "Bundoora" "Mount Evelyn"
## [287] "Airport West" "Kealba"
## [289] "Blackburn North" "St Kilda West Melbourne"
## [291] "Tullamarine" "Vermont South"
## [293] "Yuroke" "HEIDELBERG"
## [295] "Endeavour Hills" "Vermont"
## [297] "Kallista" "Carlton "
## [299] " South Yarra" "Dandenong South"
## [301] "Cranbourne North" "Guys Hill"
## [303] "The Patch" "Altona Meadows"
## [305] "Wesburn" "Kilsyth"
## [307] "Essendon West" "Werribee South"
## [309] "Heatherton" "Armadale North"
## [311] "Sunshine" "Beaconsfield Upper"
## [313] "Lower Plenty" "StKilda East"
## [315] "Keilor Park" "Niddrie"
## [317] "Macleod" "clayton"
## [319] "Keilor East" "Clayton South"
## [321] "Knoxfield" "Greenvale"
## [323] "Eynesbury" "Doreen"
## [325] "Carrum" "Mulgrave"
## [327] "St. Kilda West" "VIC"
## [329] "Chadstone" "Sunbury"
## [331] "Praharan " "Narre Warren North"
## [333] "Mlebourne" "Wheelers Hill"
## [335] "Officer" "Mooroolbark"
## [337] "Prahran / Toorak " "Mont Albert"
## [339] "Gowanbrae" "Wandin East"
## [341] "Pakenham" "Saint Helena"
## [343] "Chelsea Heights" "BRUNSWICK EAST"
## [345] "Noble Park North" "Box Hill South"
## [347] "Yarra Valley" "Saint Andrews"
## [349] "Altona North" "Keilor Downs"
## [351] "St.Kilda East" "Sydenham"
## [353] "Bonbeach" "STRATHTULLOH"
## [355] "STRTHTULLOCH" "Coburg (Melbourne)"
## [357] "City of Port Phillip" "Lysterfield"
## [359] "Bangholme" "Seaholme"
## [361] "Templestowe Lower" "Keilor"
## [363] "Templestowe" "Woori Yallock"
## [365] "Warranwood" "Cranbourne East"
## [367] "Briar Hill" "Ferntree Gully"
## [369] "eaglemont " "Kurunjang"
## [371] "Springvale" "Saint Albans"
## [373] "Clarinda" "East Doncaster"
## [375] "Steels Creek" "Little River"
## [377] "Hallam" "South Melbourne "
## [379] "North Fitzroy " "Meadow Heights"
## [381] "Tecoma" "Port Melbourne "
## [383] "Chum Creek/Healesville" "Bayswater North"
## [385] "St kilda" "Panton Hill"
## [387] "melbourne" "clayton south"
## [389] "st kilda" "Kingsbury"
## [391] "Thomastown" "Brimbank City"
## [393] "Wyndham Vale" "Pascoe Vale"
## [395] "Moreland City" "Heathmont"
## [397] "Healesville " "Docklands "
## [399] "Seville East" "Montrose"
## [401] "Nar Nar Goon North" "South Morang"
## [403] "St Kilda " "Mill Park"
## [405] "Mcmahons Creek" "Hoddles Creek"
## [407] "Pearcedale" "Brunswick East "
## [409] "Mernda" "Strathewen"
## [411] "Oaklands Junction" "Cairnlea"
## [413] "Wollert" "Roxburgh Park"
## [415] "Deer Park" "Brookfield"
## [417] "Williamstown North" "Plumpton"
## [419] "Bellfield" "Donnybrook"
## [421] "East st kilda " "St. Andrews"
## [423] "Braybrook " "Cranbourne"
## [425] "Harkaway" "Hawthorn east"
## [427] "Melton South" "Keilor Lodge"
## [429] "Pakenham Upper" "Laverton"
## [431] "Essendon " "Silvan"
## [433] "Doveton" "Watsonia"
## [435] "Beveridge" "Essendon North"
## [437] "Selby" "Richmond "
## [439] "Cora Lynn" "Hampton Park"
## [441] "Footscray " "Lynbrook"
## [443] "Melton West" "Whittlesea"
## [445] "Aspendale Gardens" "Scoresby"
## [447] "Monbulk" "Dandenong North"
## [449] "Mickleham" "Avondale Heights"
## [451] "Coldstream" "Bunyip"
## [453] "Iona" "墨ç\210¾æœ¬ Melbourne"
## [455] "Sunshine North" "Dingley Village"
## [457] "Tremont" "Humevale"
## [459] "Powelltown" "Rockbank"
## [461] "Port Phillip City" " Melbourne"
## [463] "Frankston North" "Wyndham City"
## [465] "St kilda " "Mount Eliza"
## [467] "Albert Park. Melbourne " "South Wharf"
## [469] "Upper Ferntree Gully" "Melbourne City"
## [471] "Gilderoy" "West Melbourne - flagstaff "
## [473] "Preston West" "Kinglake West"
## [475] "Albion" "Jacana"
## [477] "Blackburn " "Yarra Ranges Shire"
## [479] "Flemington " "St Kilda east"
## [481] "Cardinia Shire" "Saint Kilda Beach"
## [483] "Narbethong" "Clematis"
## [485] "Wheelers" "Waterways"
## [487] "Toolern Vale" "Tarrawarra"
## [489] "Bayles" "Whitehorse City"
## [491] "Mount Cottrell" "Burnside Heights"
## [493] "Tooradin" "Brooklyn"
## [495] "Hillside" "Eumemmerring"
## [497] "Doncaster,Melbourne" "Melbourne CBD"
## [499] "Big Pats Creek" "Ivanhoe "
## [501] "Burnside" "Eden Park"
## [503] "Melbourne VIC 3000" "Wheelers Hill VIC 3150"
## [505] "Wantirna south" "South Yarra VIC 3141"
## [507] "Doncaster VIC 3108" "Somerton VIC 3062"
## [509] "Kings Park" "Strathtulloh "
## [511] "Brunswick VIC 3056" "Wantirna South VIC 3152"
## [513] "Melbourne VIC 3004" "Melton South ( STRATHTULLOH)"
## [515] "Northcote South" "Watsons Creek"
## [517] "Oaklands junction " "Werribee "
## [519] "East Brunswick" "East Melbourne "
## [521] "Garfield North" "Officer South"
## [523] "Truganina " "Dockland "
## [525] "Botanic Ridge" "Melborne"
## [527] "Holmesglen" "Brunswick west"
## [529] "Abbotsford " "Heidelberg West "
## [531] "Dallas" "Brisbane"
## [533] "Manor Lakes" "Watsonia North"
## [535] "Albanvale"
change all observation to lower case using stringr function
airbnb_LGA$Suburb <- str_to_lower(airbnb_LGA$Suburb)
unique (airbnb_LGA$State)
## [1] " VIC" " Vic"
## [3] " vic" " Melbourne"
## [5] " 维多å\210©äºšå·ž" " Yarra Valley"
## [7] " Melbourne." "10minutes to Warburton"
## [9] " Yarra Glen" " VICTORIA"
## [11] " Victoria" " 维多å\210©äºš"
## [13] " St. Kilda " " St. Kilda"
## [15] " ç¶å¤šå\210©äºž VIC" " victoria"
## [17] " Melbourne VIC 3000" " Wheelers Hill VIC 3150"
## [19] " South Yarra VIC 3141" " NSW"
## [21] " Doncaster VIC 3108" " Somerton VIC 3062"
## [23] " Brunswick VIC 3056" " Wantirna South VIC 3152"
## [25] " Melbourne VIC 3004" " Australia"
## [27] " VI" " VICtoria"
## [29] " VIC 3008" " Victory"
## [31] " QLD" " Mel"
unique (airbnb_LGA$Country)
## [1] " Australia" " VIC" " Healesville" " AU"
## [5] " Victoria"
The actual airbnb location is in Victoria and Australia, so other than that will be changed to “VIC” and “AU”
airbnb_LGA$State <- replace(airbnb_LGA$State, airbnb_LGA$State != "VIC", "VIC")
airbnb_LGA$Country <- replace(airbnb_LGA$Country, airbnb_LGA$Country != "AU", "AU")
Using the editrules package, the following code sets some basic rules to scan for inconsistencies. The rules defined below are to make sure that no numerical value is a negative number. For the price variable however, the rule makes sure that no value is below 5 as for the purpose of this project, any Airbnb accommodation for less than $5 per night is considered a mistake or is not a legitimate listing. There are 22 violations under the rule number 1 (price >= 5), those of which can be seen below:
library(editrules)
rules <- editset(c("price >= 5",
"price_per_guest >= 0",
"number_of_reviews >= 0",
"LGA_Area_km2 >= 0",
"LGA_Density >= 0",
"LGA_population >= 0"))
rules
##
## Edit set:
## num1 : 5 <= price
## num2 : 0 <= price_per_guest
## num3 : 0 <= number_of_reviews
## num4 : 0 <= LGA_Area_km2
## num5 : 0 <= LGA_Density
## num6 : 0 <= LGA_population
summary(violatedEdits(rules,airbnb_LGA))
## Edit violations, 22729 observations, 0 completely missing (0%):
##
## editname freq rel
## num1 22 0.1%
##
## Edit violations per record:
##
## errors freq rel
## 0 22707 99.9%
## 1 22 0.1%
which(violatedEdits(rules,airbnb_LGA))
## [1] 9820 10054 10903 10906 10907 10908 10909 10911 10912 10913 10914
## [12] 11457 11497 11608 11789 12298 12302 12428 12493 12607 14480 17703
The observations with violated rules are removed, resulting in zero violated rules:
airbnb_LGA <- airbnb_LGA[-(which(violatedEdits(rules,airbnb_LGA))),]
summary(violatedEdits(rules,airbnb_LGA))
## No violations detected, 0 checks evaluated to NA
## NULL
Another inconsistency within the dataset is that some values of the guests_included variable are larger than the accommodates variable, which is impossible as ‘accommodates’ is the maximum number of guests allowed to stay in the accommodation. Below, it is shown that 108 observations have guests_included being larger than accommodates.
summary(as.numeric(airbnb_LGA$guests_included) > as.numeric(airbnb_LGA$accommodates))
## Mode FALSE TRUE
## logical 22599 108
which(as.numeric(airbnb_LGA$guests_included) > as.numeric(airbnb_LGA$accommodates))
## [1] 80 227 323 364 592 751 831 1608 1882 1904 2271
## [12] 2273 2725 2898 3064 3355 3396 3636 3751 3838 4083 4220
## [23] 4294 4640 4770 4927 5702 6251 6402 6405 6424 6850 7003
## [34] 7316 7910 8138 8441 8549 8705 9326 9357 10269 10283 10647
## [45] 10677 10722 10831 10920 10970 11393 11460 11598 11699 11934 12033
## [56] 12035 12079 12083 12112 12241 12443 12744 13153 13294 13349 13385
## [67] 13527 13549 13778 14027 14484 15131 15348 15583 16030 16249 16350
## [78] 16650 16778 17620 17839 17883 18019 18095 18156 18281 18394 18427
## [89] 18428 19441 19478 19934 20221 20413 20423 20782 20806 20821 21006
## [100] 21094 21494 21633 21664 21680 21826 21828 21834 22542
To deal with this, an ifelse statement is used to determine the observations with this inconsistency, which are then imputed with the the value from the accommodates variable:
airbnb_LGA <- airbnb_LGA %>%
mutate(guests_included = ifelse(as.numeric(guests_included) <= as.numeric(accommodates), guests_included, accommodates))
summary(as.numeric(airbnb_LGA$guests_included) > as.numeric(airbnb_LGA$accommodates))
## Mode FALSE
## logical 22707
Before scanning for outliers, an upper threshold will be applied to the price variable in order to remove listings with obvious pricing errors. A search on Airbnb listings in Victoria show that the most luxurious mansions list for no more than $3,500 per night, while a wedding reception accommodation is listed for around $5,000. Listings higher than $5,000 per night are usually small dwellings, which are the result of pricing errors or that they’re not legitimate listings. The following shows that the dataset contains prices that go up into the $12,000 range. Any listing with prices above $5,000 per night is removed:
max(airbnb_LGA$price)
## [1] 12624
which(airbnb_LGA$price > 5000)
## [1] 5596 7823 8070 19392
airbnb_LGA <- airbnb_LGA[-(which(airbnb_LGA$price > 5000)),]
max(airbnb_LGA$price)
## [1] 4445
In terms of scanning the numerical variables for outliers, ‘LGA_Area_km2’, ‘LGA_Density’, and ‘LGA_population’ will not be scanned since these values are tied to the factor variable ‘LGA’ which contain 30 factors, hence only giving 30 unique values for each of these variables. Any outliers resulting from these variables will not be removed or imputed as they simply demonstrate that some LGAs may have significantly lower or higher populations relative to each other. The numerical variables that will be scanned are ‘price’, ‘price_per_guest’, and ‘number_of_reviews’.
Below, it is shown that all 3 of the variables are right-skewed due to the fact that Airbnb prices and the number of reviews cannot be negative, but can be extremely large. Because of this, Tukey’s method (boxplot) will be used to detect outliers as it does not require the assumption of normality. It is shown that the number of outliers for each variable is quite large, with all of them making up more than five percent of the total observations. This means that imputation will be used to deal with these outliers as opposed to removal. It is also important to note that the following outliers probably do not represent errors. Under normal circumstances, these outliers will be left alone as they represent real and correct values that describe the distribution of Airbnb prices and number of reviews, however for the purpose of demonstration, they will be dealt with.
Price:
hist(airbnb_LGA$price, breaks = 200, main="Price Per Night", xlab="Price")
price_box <- airbnb_LGA$price %>%
boxplot(main="Price Per Night", ylab="Price", col = "lightgrey")
summary(airbnb_LGA$price)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 12 71 109 145 165 4445
length(price_box$out)
## [1] 1397
length(price_box$out)/nrow(airbnb_LGA)
## [1] 0.06153372
Price Per Guest:
hist(airbnb_LGA$price_per_guest, breaks = 200, main="Price Per Guest", xlab="Price")
price_per_guest_box <- airbnb_LGA$price_per_guest %>%
boxplot(main="Price Per Guest", ylab="Price", col = "lightgrey")
summary(airbnb_LGA$price_per_guest)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.86 46.50 68.00 102.86 116.00 4445.00
length(price_per_guest_box$out)
## [1] 1557
length(price_per_guest_box$out)/nrow(airbnb_LGA)
## [1] 0.06858124
Number of Reviews:
hist(airbnb_LGA$number_of_reviews, breaks = 200, main="Number of Reviews", xlab="Number")
number_of_reviews_box <- airbnb_LGA$number_of_reviews %>%
boxplot(main="Number of Reviews", ylab="Number", col = "lightgrey")
summary(airbnb_LGA$number_of_reviews)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 1.00 5.00 21.34 24.00 538.00
length(number_of_reviews_box$out)
## [1] 2480
length(number_of_reviews_box$out)/nrow(airbnb_LGA)
## [1] 0.1092367
For this dataset, imputation is done via imputing the outliers with the median since the distributions are extremely skewed. Winsorising was considered, however it would’ve resulted in a large spike in the nearest neighbour of the distribution given the large number of outliers. The median method used also resulted in a large spike in the median of the distribution, so neither solution is ideal (later in the “Transform” section another technique will be explored: the variable ‘price_duplicate’ is created in the code below for later transformation). For each variable, the following code replaces the outliers found above (that were found using the boxplot function) with the median of its variable:
airbnb_LGA <- airbnb_LGA %>%
mutate(price_duplicate = price)
airbnb_LGA$price[which(airbnb_LGA$price < price_box$stats[1] | airbnb_LGA$price > price_box$stats[5])] <-
median(airbnb_LGA$price)
hist(airbnb_LGA$price, breaks = 200, main="Price - Imputed", xlab="Price")
summary(airbnb_LGA$price)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 12.0 71.0 109.0 116.7 150.0 306.0
airbnb_LGA$price_per_guest[which(airbnb_LGA$price_per_guest < price_per_guest_box$stats[1] | airbnb_LGA$price_per_guest > price_per_guest_box$stats[5])] <-
median(airbnb_LGA$price_per_guest)
hist(airbnb_LGA$price_per_guest, breaks = 200, main="Price Per Guest - Imputed", xlab="Price")
summary(airbnb_LGA$price_per_guest)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.86 46.50 68.00 77.64 99.50 220.00
airbnb_LGA$number_of_reviews[which(airbnb_LGA$number_of_reviews < number_of_reviews_box$stats[1] | airbnb_LGA$number_of_reviews > number_of_reviews_box$stats[5])] <-
median(airbnb_LGA$number_of_reviews)
hist(airbnb_LGA$number_of_reviews, breaks = 200, main="Number of Reviews - Imputed", xlab="Price")
summary(airbnb_LGA$number_of_reviews)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 1.000 5.000 9.563 12.000 58.000
In this section, the ‘price_duplicate’ variable is transformed and normalised in order to perform a z-score outlier detection. The following uses the ‘forecast’ package in order to perform a BoxCox transformation. Comparing the tranformed distribution with original distribution, the transformed distribution is much more normal:
hist(airbnb_LGA$price_duplicate, breaks = 100, main="Price - Original", xlab="Price")
price_boxcox <- BoxCox(airbnb_LGA$price_duplicate,lambda = "auto")
hist(price_boxcox, breaks = 100, main="Price - Transformed", xlab="Price")
Below, the tranformed distribution (price_boxcox) can be standardised given its approximately normal distribution. The number of outliers (values that exceed 3 standard deviations) are 88 in this case. Much less than Tukey’s method when applied to the non-transformed price variable (which had 1,399 outliers):
z_scores <- scale(price_boxcox, center = T, scale = T)
summary(z_scores)
## V1
## Min. :-4.49789
## 1st Qu.:-0.60448
## Median : 0.06852
## Mean : 0.00000
## 3rd Qu.: 0.64397
## Max. : 3.40560
length(which(abs(z_scores) > 3))
## [1] 89