Preprocessing Airbnb Melbourne Dataset

Required Packages

library(dplyr)
library(tidyr)
library(readr)
library(rvest)
library(stringr)
library(lubridate)
library(forecast)

Executive Summary

For this project, open source data on Airbnb, along with data from a Wikipedia table on Local Government Areas (LGA) were used for the purpose of data pre-processing. The following steps were taken:

Airbnb data was loaded and subset with the desired variables, while LGA data was scraped via Wikipedia and also subset with the desired variables. These two datasets were then relationally joined to form a new dataset.
Variables in the new dataset were renamed to better reflect their meanings, and were then converted to more appropriate variable types.
The dataset was briefly examined to check if it was in a ‘tidy’ format, in which it was hence no further action was taken regarding format.
New variables were then created from the existing variables and added to the dataset.
The entire dataset was scanned for NA values and other special values (infinite numbers and ‘Not a Number’ values), which were then removed according to appropriateness.
Rules were also defined and applied to check for any inconsistencies. For certain rule violations, observations were removed, while for other rule violations, observations were imputed.
The appropriate variables were scanned for outliers via Tukey’s method based on the distributions of the variables, which were then imputed due to the number of outliers that were found. Other methods were also briefly discussed.
For the last step, one variable was chosen to be transformed and normalised in order to apply a z-score outlier detection method.

Data

The main dataset used for this project is the “Melbourne Airbnb Open Data” as hosted on Kaggle by Tyler Xie (https://www.kaggle.com/tylerx/melbourne-airbnb-open-data). The data was gathered from the Inside Airbnb database and compiled by Tyler Xie into a CSV file. There are 96 variables for the Airbnb dataset, however only 12 variables were chosen for the purpose of this project:

id = Identification number for Airbnb listing.
host_since = The date when the host of the lisitng signed up to Airbnb as a host.
street = Street address.
neighbourhood_cleansed = Local government area.
property_type = The type of building (eg: House, Apartment, etc.)
accommodates = The number of guests the accommodation can accommodate.
bathrooms = The number of bathrooms.
bedrooms = The number of bedrooms.
price = Price per night.
guests_included = The number of guests included in the price per night.
number_of_reviews = The number of reviews the accommodation has.
instant_bookable = Whether the accommodation instantly bookable.

Below, the Airbnb dataset is imported and subset according to the varibles chosen above:

airbnb_orig <- read.csv("listings_dec18.csv", stringsAsFactors = F)
airbnb <- airbnb_orig[,c("id","host_since","street","neighbourhood_cleansed","property_type","accommodates",
                         "bathrooms","bedrooms","price","guests_included","number_of_reviews", "instant_bookable")]
head(airbnb)

The second dataset is scraped from a table off the Wikipedia article “Local Government Areas of Victoria” (https://en.wikipedia.org/wiki/Local_government_areas_of_Victoria#Greater_Melbourne). Its variables contain information on land size and population for each LGA. Only 3 variables were chosen out of the 11 variables, which were then renamed:

LGA = Local government area.
LGA_Area_km2 = The size of the LGA in kilometres squared.
LGA_Density = The population density per kilometre squared.

Below, the Wikipedia table is scraped and imported using the ‘rvest’ package, and then subsetted with the chosen variables above. The variables are then renamed, while the LGA observations are edited to remove the strings “City of” or “Shire of” in front of the LGA names. This is so that the LGA variable will match with the ‘neighbourhood_cleansed’ variable in the Airbnb dataset so that the two datasets may be joined in the next step:

LGA_wiki <- read_html("https://en.wikipedia.org/wiki/Local_government_areas_of_Victoria#Greater_Melbourne")
length(html_nodes(LGA_wiki, "table"))

## [1] 10

LGA <- html_table(html_nodes(LGA_wiki, "table") [[1]], fill=T)
head(LGA)

LGA <- LGA[2:32,c(1,5,7)]
colnames(LGA) <- c("LGA","LGA_Area_km2","LGA_Density")
LGA <- LGA %>% 
  mutate(LGA = ifelse(str_detect(LGA, "Shire of"), str_replace(LGA, pattern="Shire of ", replacement=""), 
               ifelse(str_detect(LGA, "City of"), str_replace(LGA, pattern="City of ", replacement=""), "Error")))
head(LGA)

The two datasets are joined below, where Airbnb’s ‘neighbourhood_cleansed’ is equal to LGA’s ‘LGA’, to form a dataset with 14 variables:

airbnb_LGA <- left_join(airbnb, LGA, by = c("neighbourhood_cleansed" = "LGA"))
head(airbnb_LGA)

dim(airbnb_LGA)

## [1] 22895    14

Understand

str(airbnb_LGA)

## 'data.frame':    22895 obs. of  14 variables:
##  $ id                    : int  9835 10803 12936 15246 16760 38271 41836 43414 43429 44082 ...
##  $ host_since            : chr  "2009-08-21" "2009-09-16" "2009-10-31" "2009-12-03" ...
##  $ street                : chr  "Bulleen, VIC, Australia" "Brunswick East, VIC, Australia" "St Kilda, VIC, Australia" "Thornbury, VIC, Australia" ...
##  $ neighbourhood_cleansed: chr  "Manningham" "Moreland" "Port Phillip" "Darebin" ...
##  $ property_type         : chr  "House" "Apartment" "Apartment" "House" ...
##  $ accommodates          : int  2 2 3 1 2 5 2 2 2 2 ...
##  $ bathrooms             : num  1 1 1 1.5 1 1 1 1 1 1 ...
##  $ bedrooms              : int  1 1 1 1 1 3 1 1 1 1 ...
##  $ price                 : chr  "$60.00" "$35.00" "$159.00" "$50.00" ...
##  $ guests_included       : int  1 1 2 1 1 1 2 1 1 1 ...
##  $ number_of_reviews     : int  4 108 15 29 63 112 159 3 168 28 ...
##  $ instant_bookable      : chr  "f" "t" "f" "f" ...
##  $ LGA_Area_km2          : chr  "113" "51" "21" "53" ...
##  $ LGA_Density           : chr  "1,057" "3,270" "5,101" "2,847" ...

As seen above, the imported data structure is a data frame, however the majority of the variables are not the correct variable type. Before performing data type conversions, some of the variables will be renamed to make more sense:

colnames(airbnb_LGA)[4] <- "LGA"
colnames(airbnb_LGA)[3] <- "location"
head(airbnb_LGA)

Below, the following conversions take place:

id: Numeric to Factor
host_since: Character to Date
property_type: Character to Factor
instant_bookable: Character to Factor
LGA: Character to Factor
accommodates: Character to Ordinal Factor
bathrooms: Character to Ordinal Factor
bedrooms: Character to Ordinal Factor
guests_included: Character to Ordinal Factor
price: Character to Numeric
LGA_Area_km2: Character to Numeric
LGA_Density: Character to Numeric

For the last 3 variables, readr’s parse_number function is used to remove dollar signs and commas in order to convert the characters into numerics:

airbnb_LGA$id <- as.factor(airbnb_LGA$id)
airbnb_LGA$host_since <- ymd(airbnb_LGA$host_since)
airbnb_LGA$property_type <- factor(airbnb_LGA$property_type)
airbnb_LGA$instant_bookable <- factor(airbnb_LGA$instant_bookable)
airbnb_LGA$LGA <- factor(airbnb_LGA$LGA)

unique(airbnb_LGA$accommodates)

##  [1]  2  3  1  5  4  7 10  6  8 16  9 13 11 12 14 15

airbnb_LGA$accommodates <- factor(airbnb_LGA$accommodates,
                                  levels = c("1","2","3","4","5","6","7","8","9","10","11","12","13","14","15"),
                                  ordered = T)
levels(airbnb_LGA$accommodates)

##  [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10" "11" "12" "13" "14"
## [15] "15"

unique(airbnb_LGA$bathrooms)

##  [1]  1.0  1.5  2.5  2.0   NA  3.0  6.0  0.5  0.0  5.0  8.0  4.0  3.5  4.5
## [15]  7.0 14.0  9.0  7.5  5.5  6.5 10.0

airbnb_LGA$bathrooms <- factor(airbnb_LGA$bathrooms,
                                  levels = c("0","0.5","1","1.5","2","2.5","3","3.5","4","4.5","5","5.5",
                                             "6","6.5","7","7.5","8","9","10","14"),
                                  ordered = T)
levels(airbnb_LGA$bathrooms)

##  [1] "0"   "0.5" "1"   "1.5" "2"   "2.5" "3"   "3.5" "4"   "4.5" "5"  
## [12] "5.5" "6"   "6.5" "7"   "7.5" "8"   "9"   "10"  "14"

unique(airbnb_LGA$bedrooms)

##  [1]  1  3  2  4  5 10  0  6  7  9 NA 16  8 11

airbnb_LGA$bedrooms <- factor(airbnb_LGA$bedrooms,
                               levels = c("0","1","2","3","4","5","6","7","8","9","10","11"),
                               ordered = T)
levels(airbnb_LGA$bedrooms)

##  [1] "0"  "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10" "11"

unique(airbnb_LGA$guests_included)

##  [1]  1  2  3  6  4  8 12  5  9  7 16 10 11 15 14

airbnb_LGA$guests_included <- factor(airbnb_LGA$guests_included,
                              levels = c("1","2","3","4","5","6","7","8","9","10","11","12","14","15","16"),
                              ordered = T)
levels(airbnb_LGA$guests_included)

##  [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10" "11" "12" "14" "15"
## [15] "16"

airbnb_LGA$price <- parse_number(airbnb_LGA$price)
airbnb_LGA$LGA_Area_km2 <- parse_number(airbnb_LGA$LGA_Area_km2)
airbnb_LGA$LGA_Density <- parse_number(airbnb_LGA$LGA_Density)

The following is the dataset after the variable type conversions:

str(airbnb_LGA)

## 'data.frame':    22895 obs. of  14 variables:
##  $ id               : Factor w/ 22895 levels "9835","10803",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ host_since       : Date, format: "2009-08-21" "2009-09-16" ...
##  $ location         : chr  "Bulleen, VIC, Australia" "Brunswick East, VIC, Australia" "St Kilda, VIC, Australia" "Thornbury, VIC, Australia" ...
##  $ LGA              : Factor w/ 30 levels "Banyule","Bayside",..: 15 22 24 7 24 6 7 18 20 8 ...
##  $ property_type    : Factor w/ 35 levels "Aparthotel","Apartment",..: 22 2 2 22 22 2 22 2 2 22 ...
##  $ accommodates     : Ord.factor w/ 15 levels "1"<"2"<"3"<"4"<..: 2 2 3 1 2 5 2 2 2 2 ...
##  $ bathrooms        : Ord.factor w/ 20 levels "0"<"0.5"<"1"<..: 3 3 3 4 3 3 3 3 3 3 ...
##  $ bedrooms         : Ord.factor w/ 12 levels "0"<"1"<"2"<"3"<..: 2 2 2 2 2 4 2 2 2 2 ...
##  $ price            : num  60 35 159 50 69 98 50 100 98 60 ...
##  $ guests_included  : Ord.factor w/ 15 levels "1"<"2"<"3"<"4"<..: 1 1 2 1 1 1 2 1 1 1 ...
##  $ number_of_reviews: int  4 108 15 29 63 112 159 3 168 28 ...
##  $ instant_bookable : Factor w/ 2 levels "f","t": 1 2 1 1 1 1 1 1 2 1 ...
##  $ LGA_Area_km2     : num  113 51 21 53 21 410 53 37 82 131 ...
##  $ LGA_Density      : num  1057 3270 5101 2847 5101 ...

Tidy & Manipulate Data I

The dataset is in a tidy format since the following apply:

Each variable has its own column.
Each observation has its own row.
Each value has its own cell.

Tidy & Manipulate Data II

Below, two new variables are created via the mutate function (from dplyr). The variable ‘price_per_guest’ is created by dividing the price (‘price’) by the number of guests included (‘guest_included’). As ‘guests_included’ was converted to a factor earlier, it is temporarily coerced into a numeric in order for the arithmetic to work. The variable ‘LGA_population’ is created by multiplying the LGA area (LGA_Area_km2) with the LGA population density (LGA_Density):

airbnb_LGA <- airbnb_LGA %>%
  mutate(price_per_guest = round(price/as.numeric(guests_included), 2),
         LGA_population = LGA_Area_km2*LGA_Density)
head(airbnb_LGA)

airbnb_LGA <- airbnb_LGA %>% separate(location, into = c("Suburb", "State", "Country"), sep = ",")

## Warning: Expected 3 pieces. Additional pieces discarded in 49 rows [424,
## 767, 768, 770, 779, 790, 809, 1196, 1290, 1338, 1407, 1434, 1494, 1585,
## 1864, 2034, 2899, 4220, 6779, 6939, ...].

## Warning: Expected 3 pieces. Missing pieces filled with `NA` in 61 rows
## [1213, 2540, 5363, 6329, 7410, 15039, 16158, 16662, 16663, 16673, 16674,
## 16840, 16868, 17167, 17604, 17605, 17634, 17637, 17638, 17767, ...].

head(airbnb_LGA)

Scan I

Below, the sapply function is used to scan all variables in the dataset for NA values:

sapply(airbnb_LGA, function(x) sum(is.na(x)))

##                id        host_since            Suburb             State 
##                 0                 3                 0                 0 
##           Country               LGA     property_type      accommodates 
##                61                 0                 0                81 
##         bathrooms          bedrooms             price   guests_included 
##                17                 7                 0                 0 
## number_of_reviews  instant_bookable      LGA_Area_km2       LGA_Density 
##                 0                 0                 0                 0 
##   price_per_guest    LGA_population 
##                 0                 0

sum(is.na(airbnb_LGA))

## [1] 169

As shown above, there are 108 NA values in the entire dataset, which is a very small number compared to the total number of observations, hence these NA observations can be removed:

airbnb_LGA <- na.omit(airbnb_LGA)

sapply(airbnb_LGA, function(x) sum(is.na(x)))

##                id        host_since            Suburb             State 
##                 0                 0                 0                 0 
##           Country               LGA     property_type      accommodates 
##                 0                 0                 0                 0 
##         bathrooms          bedrooms             price   guests_included 
##                 0                 0                 0                 0 
## number_of_reviews  instant_bookable      LGA_Area_km2       LGA_Density 
##                 0                 0                 0                 0 
##   price_per_guest    LGA_population 
##                 0                 0

Below is a function to scan all numerical variables for special values such as infinite numbers or NaNs, which is applied to the entire dataset. The results indicate that there are no special values in this dataset:

special <- function(x){if (is.numeric(x)) sum(is.infinite(x) | is.nan(x))}
sapply(airbnb_LGA, special)

## $id
## NULL
## 
## $host_since
## NULL
## 
## $Suburb
## NULL
## 
## $State
## NULL
## 
## $Country
## NULL
## 
## $LGA
## NULL
## 
## $property_type
## NULL
## 
## $accommodates
## NULL
## 
## $bathrooms
## NULL
## 
## $bedrooms
## NULL
## 
## $price
## [1] 0
## 
## $guests_included
## NULL
## 
## $number_of_reviews
## [1] 0
## 
## $instant_bookable
## NULL
## 
## $LGA_Area_km2
## [1] 0
## 
## $LGA_Density
## [1] 0
## 
## $price_per_guest
## [1] 0
## 
## $LGA_population
## [1] 0

Check inconsistency from separated variable, “Suburb”, “State”, and “Country”

unique (airbnb_LGA$Suburb)

##   [1] "Bulleen"                      "Brunswick East"              
##   [3] "St Kilda"                     "Thornbury"                   
##   [5] "St Kilda East"                "Berwick"                     
##   [7] "Reservoir"                    "East Melbourne"              
##   [9] "Oakleigh East"                "Frankston"                   
##  [11] "Richmond"                     "Melbourne"                   
##  [13] "Brighton East"                "Kew"                         
##  [15] "Fitzroy"                      "Hampton"                     
##  [17] "Clifton Hill"                 "Mordialloc"                  
##  [19] "Northcote"                    "Port Melbourne"              
##  [21] "South Melbourne"              "Prahran"                     
##  [23] "Saint Kilda East"             "Parkdale"                    
##  [25] "Ringwood East"                "Parkville"                   
##  [27] "Yarraville"                   "Belgrave"                    
##  [29] "Chum Creek"                   "Ivanhoe"                     
##  [31] "Elwood"                       "Mount Waverley"              
##  [33] "Middle Park"                  "Hawthorn East"               
##  [35] "Carlton"                      "Caulfield North"             
##  [37] "Montmorency"                  "Melbourne "                  
##  [39] "Hadfield"                     "North Melbourne"             
##  [41] "Albert Park"                  "Collingwood"                 
##  [43] "South Yarra"                  "Windsor"                     
##  [45] "Blackburn"                    "Menzies Creek"               
##  [47] "Derrimut"                     "Oakleigh South"              
##  [49] "Abbotsford"                   "South Kingsville"            
##  [51] "Armadale"                     "Southbank"                   
##  [53] "Langwarrin"                   "Warneet"                     
##  [55] "Truganina"                    "Balaclava"                   
##  [57] "Heidelberg"                   "Brunswick"                   
##  [59] "Caulfield"                    "Elsternwick"                 
##  [61] "Kooyong"                      "Malvern East"                
##  [63] "Flemington"                   "Balwyn North"                
##  [65] "Point Cook"                   "Maidstone"                   
##  [67] "Preston"                      "Brighton"                    
##  [69] "Sandringham"                  "East St. Kilda"              
##  [71] "Rosanna"                      "Caulfield South"             
##  [73] "Coburg"                       "Blackburn South"             
##  [75] "Werribee"                     "Yarra Glen"                  
##  [77] "Murrumbeena"                  "St Kilda West"               
##  [79] "Hawthorn"                     "Ashburton"                   
##  [81] "Sherbrooke"                   "Spotswood"                   
##  [83] "Hughesdale"                   "Dandenong"                   
##  [85] "Cottles Bridge"               "Altona"                      
##  [87] "Viewbank"                     "Aspendale"                   
##  [89] "Seddon"                       "Alphington"                  
##  [91] "Burnley"                      "Olinda"                      
##  [93] "Fairfield"                    "Manningham"                  
##  [95] "Essendon"                     "Rowville"                    
##  [97] "Healesville"                  "Ringwood North"              
##  [99] "Fitzroy North"                "Badger Creek"                
## [101] "Diamond Creek"                "Toorak"                      
## [103] "Clayton"                      "Docklands"                   
## [105] "Glen Iris"                    "Highett"                     
## [107] "Greensborough"                "Footscray"                   
## [109] "Beaumaris"                    "Forest Hill"                 
## [111] "Ormond"                       "Ripponlea"                   
## [113] "Gardenvale"                   "Doncaster East"              
## [115] "Yarra Junction"               "Macclesfield"                
## [117] "West Melbourne"               "St. Kilda"                   
## [119] "West Footscray"               "North Warrandyte"            
## [121] "Heidelberg Heights"           "Upwey"                       
## [123] "Saint Kilda"                  "Croydon North"               
## [125] "Lilydale"                     "Epping"                      
## [127] "St Kilda / Elwood"            "Bayswater"                   
## [129] "Ashwood"                      "Sandhurst"                   
## [131] "Bentleigh"                    "Warrandyte"                  
## [133] "McKinnon"                     "Langwarrin South"            
## [135] "Pascoe Vale South"            "Newport"                     
## [137] "Cremorne"                     "Nunawading"                  
## [139] "Ripponlea (East St Kilda)"    "Brunswick / Melbourne"       
## [141] "Cheltenham"                   "Gembrook"                    
## [143] "Eltham North"                 "Bentleigh East"              
## [145] "Box Hill"                     "Box Hill North"              
## [147] "Carnegie"                     "Maribyrnong"                 
## [149] "Balwyn"                       "Glen Waverley"               
## [151] "Glen Huntly"                  "Princes Hill"                
## [153] "Warburton"                    "Moonee Ponds"                
## [155] "East Warburton"               "Launching Place"             
## [157] "Chirnside Park"               "Narre Warren South"          
## [159] "Seaford"                      "Braybrook"                   
## [161] "Sassafras"                    "Brunswick West"              
## [163] "Research"                     "The Basin"                   
## [165] "Kensington"                   "Dixons Creek"                
## [167] "Kalorama"                     "Wattle Glen"                 
## [169] "Williamstown"                 "Melbourne (Eltham)"          
## [171] "Oak Park"                     "Camberwell"                  
## [173] "Dandenong Ranges"             "èŠ’ç‰¹éŸ¦å¼—å\210©åŒº"          
## [175] "Croydon South"                "Wantirna South"              
## [177] "Smiths Gully"                 "Frankston South"             
## [179] "Mount Dandenong"              "Black Rock"                  
## [181] "Cockatoo"                     "Oakleigh"                    
## [183] "Glenroy"                      "Kilsyth South"               
## [185] "Burwood"                      "Ascot Vale"                  
## [187] "Ivanhoe (Melbourne)"          "Kew East"                    
## [189] "Lalor"                        "Emerald"                     
## [191] "FITZROY"                      "Gladstone Park"              
## [193] "Malvern"                      "Westmeadows"                 
## [195] "Saint Kilda West"             "Carrum Downs"                
## [197] "Millgrove"                    "Boronia"                     
## [199] "Carlton North"                "Carlton South"               
## [201] "Hampton East"                 "Fawkner"                     
## [203] "Tonimbuk"                     "Huntingdale"                 
## [205] "brunswick"                    "Mount Toolebewong"           
## [207] "å¢¨å°”æœ¬"                    "Victoria"                    
## [209] "North Fitzroy"                "Warrandyte South"            
## [211] "Ringwood"                     "Mt Waverley"                 
## [213] "Surrey Hills"                 "Coburg North"                
## [215] "Broadmeadows "                "Diggers Rest"                
## [217] "Yellingbo"                    "Heidelberg West"             
## [219] "Strathmore"                   "Caulfield East"              
## [221] "Patterson Lakes"              "Keysborough"                 
## [223] "Tarneit"                      "Don Valley"                  
## [225] "Burwood East"                 "HEIDELBERG WEST"             
## [227] "Brunswick "                   "Wonga Park"                  
## [229] "Eaglemont"                    "Delahey"                     
## [231] "Cranbourne West"              "Canterbury"                  
## [233] "Kingsville"                   "Eltham"                      
## [235] "Middle Park Melbourne"        "Moorabbin"                   
## [237] "St Kilda East "               "Belgrave Heights"            
## [239] "Seabrook"                     "Ferny Creek"                 
## [241] "Lyndhurst"                    "Narre Warren"                
## [243] "Mont Albert North"            "Clyde North"                 
## [245] "Park Orchards"                "Mentone"                     
## [247] "Donvale"                      "Gladysdale"                  
## [249] "McMahons Creek"               "Gruyere"                     
## [251] "Wantirna"                     "Broadmeadows"                
## [253] "Caroline Springs"             "Ivanhoe East"                
## [255] "Williams Landing"             "Aberfeldie"                  
## [257] "Mitcham"                      "Sunshine West"               
## [259] "St Andrews"                   "Mt Dandenong"                
## [261] "Hurstbridge"                  "Croydon"                     
## [263] "Travancore"                   "Taylors Hill"                
## [265] "Southbank Melbourne "         "Pakenham "                   
## [267] "Kangaroo ground"              "Wandin North"                
## [269] "Cranbourne South"             "Seville"                     
## [271] "Craigieburn"                  "Toolangi"                    
## [273] "Noble Park"                   "Beaconsfield"                
## [275] "Chelsea"                      "Yering"                      
## [277] "Edithvale"                    "Notting Hill"                
## [279] "Springvale South"             "Hoppers Crossing"            
## [281] "Christmas Hills"              "Doncaster"                   
## [283] "Kangaroo Ground"              "Bulla"                       
## [285] "Bundoora"                     "Mount Evelyn"                
## [287] "Airport West"                 "Kealba"                      
## [289] "Blackburn North"              "St Kilda West Melbourne"     
## [291] "Tullamarine"                  "Vermont South"               
## [293] "Yuroke"                       "HEIDELBERG"                  
## [295] "Endeavour Hills"              "Vermont"                     
## [297] "Kallista"                     "Carlton "                    
## [299] " South Yarra"                 "Dandenong South"             
## [301] "Cranbourne North"             "Guys Hill"                   
## [303] "The Patch"                    "Altona Meadows"              
## [305] "Wesburn"                      "Kilsyth"                     
## [307] "Essendon West"                "Werribee South"              
## [309] "Heatherton"                   "Armadale North"              
## [311] "Sunshine"                     "Beaconsfield Upper"          
## [313] "Lower Plenty"                 "StKilda East"                
## [315] "Keilor Park"                  "Niddrie"                     
## [317] "Macleod"                      "clayton"                     
## [319] "Keilor East"                  "Clayton South"               
## [321] "Knoxfield"                    "Greenvale"                   
## [323] "Eynesbury"                    "Doreen"                      
## [325] "Carrum"                       "Mulgrave"                    
## [327] "St. Kilda West"               "VIC"                         
## [329] "Chadstone"                    "Sunbury"                     
## [331] "Praharan "                    "Narre Warren North"          
## [333] "Mlebourne"                    "Wheelers Hill"               
## [335] "Officer"                      "Mooroolbark"                 
## [337] "Prahran / Toorak "            "Mont Albert"                 
## [339] "Gowanbrae"                    "Wandin East"                 
## [341] "Pakenham"                     "Saint Helena"                
## [343] "Chelsea Heights"              "BRUNSWICK EAST"              
## [345] "Noble Park North"             "Box Hill South"              
## [347] "Yarra Valley"                 "Saint Andrews"               
## [349] "Altona North"                 "Keilor Downs"                
## [351] "St.Kilda East"                "Sydenham"                    
## [353] "Bonbeach"                     "STRATHTULLOH"                
## [355] "STRTHTULLOCH"                 "Coburg (Melbourne)"          
## [357] "City of Port Phillip"         "Lysterfield"                 
## [359] "Bangholme"                    "Seaholme"                    
## [361] "Templestowe Lower"            "Keilor"                      
## [363] "Templestowe"                  "Woori Yallock"               
## [365] "Warranwood"                   "Cranbourne East"             
## [367] "Briar Hill"                   "Ferntree Gully"              
## [369] "eaglemont "                   "Kurunjang"                   
## [371] "Springvale"                   "Saint Albans"                
## [373] "Clarinda"                     "East Doncaster"              
## [375] "Steels Creek"                 "Little River"                
## [377] "Hallam"                       "South Melbourne "            
## [379] "North Fitzroy "               "Meadow Heights"              
## [381] "Tecoma"                       "Port Melbourne "             
## [383] "Chum Creek/Healesville"       "Bayswater North"             
## [385] "St kilda"                     "Panton Hill"                 
## [387] "melbourne"                    "clayton south"               
## [389] "st kilda"                     "Kingsbury"                   
## [391] "Thomastown"                   "Brimbank City"               
## [393] "Wyndham Vale"                 "Pascoe Vale"                 
## [395] "Moreland City"                "Heathmont"                   
## [397] "Healesville "                 "Docklands "                  
## [399] "Seville East"                 "Montrose"                    
## [401] "Nar Nar Goon North"           "South Morang"                
## [403] "St Kilda "                    "Mill Park"                   
## [405] "Mcmahons Creek"               "Hoddles Creek"               
## [407] "Pearcedale"                   "Brunswick East "             
## [409] "Mernda"                       "Strathewen"                  
## [411] "Oaklands Junction"            "Cairnlea"                    
## [413] "Wollert"                      "Roxburgh Park"               
## [415] "Deer Park"                    "Brookfield"                  
## [417] "Williamstown North"           "Plumpton"                    
## [419] "Bellfield"                    "Donnybrook"                  
## [421] "East st kilda "               "St. Andrews"                 
## [423] "Braybrook "                   "Cranbourne"                  
## [425] "Harkaway"                     "Hawthorn east"               
## [427] "Melton South"                 "Keilor Lodge"                
## [429] "Pakenham Upper"               "Laverton"                    
## [431] "Essendon "                    "Silvan"                      
## [433] "Doveton"                      "Watsonia"                    
## [435] "Beveridge"                    "Essendon North"              
## [437] "Selby"                        "Richmond "                   
## [439] "Cora Lynn"                    "Hampton Park"                
## [441] "Footscray "                   "Lynbrook"                    
## [443] "Melton West"                  "Whittlesea"                  
## [445] "Aspendale Gardens"            "Scoresby"                    
## [447] "Monbulk"                      "Dandenong North"             
## [449] "Mickleham"                    "Avondale Heights"            
## [451] "Coldstream"                   "Bunyip"                      
## [453] "Iona"                         "å¢¨ç\210¾æœ¬ Melbourne"         
## [455] "Sunshine North"               "Dingley Village"             
## [457] "Tremont"                      "Humevale"                    
## [459] "Powelltown"                   "Rockbank"                    
## [461] "Port Phillip City"            " Melbourne"                  
## [463] "Frankston North"              "Wyndham City"                
## [465] "St kilda "                    "Mount Eliza"                 
## [467] "Albert Park. Melbourne "      "South Wharf"                 
## [469] "Upper Ferntree Gully"         "Melbourne City"              
## [471] "Gilderoy"                     "West Melbourne - flagstaff " 
## [473] "Preston West"                 "Kinglake West"               
## [475] "Albion"                       "Jacana"                      
## [477] "Blackburn "                   "Yarra Ranges Shire"          
## [479] "Flemington "                  "St Kilda east"               
## [481] "Cardinia Shire"               "Saint Kilda Beach"           
## [483] "Narbethong"                   "Clematis"                    
## [485] "Wheelers"                     "Waterways"                   
## [487] "Toolern Vale"                 "Tarrawarra"                  
## [489] "Bayles"                       "Whitehorse City"             
## [491] "Mount Cottrell"               "Burnside Heights"            
## [493] "Tooradin"                     "Brooklyn"                    
## [495] "Hillside"                     "Eumemmerring"                
## [497] "Doncasterï¼ŒMelbourne"        "Melbourne CBD"               
## [499] "Big Pats Creek"               "Ivanhoe "                    
## [501] "Burnside"                     "Eden Park"                   
## [503] "Melbourne VIC 3000"           "Wheelers Hill VIC 3150"      
## [505] "Wantirna south"               "South Yarra VIC 3141"        
## [507] "Doncaster VIC 3108"           "Somerton VIC 3062"           
## [509] "Kings Park"                   "Strathtulloh "               
## [511] "Brunswick VIC 3056"           "Wantirna South VIC 3152"     
## [513] "Melbourne VIC 3004"           "Melton South ( STRATHTULLOH)"
## [515] "Northcote South"              "Watsons Creek"               
## [517] "Oaklands junction "           "Werribee "                   
## [519] "East Brunswick"               "East Melbourne "             
## [521] "Garfield North"               "Officer South"               
## [523] "Truganina "                   "Dockland "                   
## [525] "Botanic Ridge"                "Melborne"                    
## [527] "Holmesglen"                   "Brunswick west"              
## [529] "Abbotsford "                  "Heidelberg West "            
## [531] "Dallas"                       "Brisbane"                    
## [533] "Manor Lakes"                  "Watsonia North"              
## [535] "Albanvale"

change all observation to lower case using stringr function

airbnb_LGA$Suburb <- str_to_lower(airbnb_LGA$Suburb)

unique (airbnb_LGA$State)

##  [1] " VIC"                     " Vic"                    
##  [3] " vic"                     " Melbourne"              
##  [5] " ç»´å¤šå\210©äºšå·ž"         " Yarra Valley"           
##  [7] " Melbourne."              "10minutes to Warburton"  
##  [9] " Yarra Glen"              " VICTORIA"               
## [11] " Victoria"                " ç»´å¤šå\210©äºš"           
## [13] " St. Kilda "              " St. Kilda"              
## [15] " ç¶å¤šå\210©äºž VIC"        " victoria"               
## [17] " Melbourne VIC 3000"      " Wheelers Hill VIC 3150" 
## [19] " South Yarra VIC 3141"    " NSW"                    
## [21] " Doncaster VIC 3108"      " Somerton VIC 3062"      
## [23] " Brunswick VIC 3056"      " Wantirna South VIC 3152"
## [25] " Melbourne VIC 3004"      " Australia"              
## [27] " VI"                      " VICtoria"               
## [29] " VIC 3008"                " Victory"                
## [31] " QLD"                     " Mel"

unique (airbnb_LGA$Country)

## [1] " Australia"   " VIC"         " Healesville" " AU"         
## [5] " Victoria"

The actual airbnb location is in Victoria and Australia, so other than that will be changed to “VIC” and “AU”

airbnb_LGA$State <- replace(airbnb_LGA$State, airbnb_LGA$State != "VIC", "VIC")

airbnb_LGA$Country <- replace(airbnb_LGA$Country, airbnb_LGA$Country != "AU", "AU")

Using the editrules package, the following code sets some basic rules to scan for inconsistencies. The rules defined below are to make sure that no numerical value is a negative number. For the price variable however, the rule makes sure that no value is below 5 as for the purpose of this project, any Airbnb accommodation for less than $5 per night is considered a mistake or is not a legitimate listing. There are 22 violations under the rule number 1 (price >= 5), those of which can be seen below:

library(editrules)

rules <- editset(c("price >= 5",
                   "price_per_guest >= 0",
                   "number_of_reviews >= 0",
                   "LGA_Area_km2 >= 0",
                   "LGA_Density >= 0",
                   "LGA_population >= 0"))
rules

## 
## Edit set:
## num1 : 5 <= price
## num2 : 0 <= price_per_guest
## num3 : 0 <= number_of_reviews
## num4 : 0 <= LGA_Area_km2
## num5 : 0 <= LGA_Density
## num6 : 0 <= LGA_population

summary(violatedEdits(rules,airbnb_LGA))

## Edit violations, 22729 observations, 0 completely missing (0%):
## 
##  editname freq  rel
##      num1   22 0.1%
## 
## Edit violations per record:
## 
##  errors  freq   rel
##       0 22707 99.9%
##       1    22  0.1%

which(violatedEdits(rules,airbnb_LGA))

##  [1]  9820 10054 10903 10906 10907 10908 10909 10911 10912 10913 10914
## [12] 11457 11497 11608 11789 12298 12302 12428 12493 12607 14480 17703

The observations with violated rules are removed, resulting in zero violated rules:

airbnb_LGA <- airbnb_LGA[-(which(violatedEdits(rules,airbnb_LGA))),]
summary(violatedEdits(rules,airbnb_LGA))

## No violations detected, 0 checks evaluated to NA

## NULL

Another inconsistency within the dataset is that some values of the guests_included variable are larger than the accommodates variable, which is impossible as ‘accommodates’ is the maximum number of guests allowed to stay in the accommodation. Below, it is shown that 108 observations have guests_included being larger than accommodates.

summary(as.numeric(airbnb_LGA$guests_included) > as.numeric(airbnb_LGA$accommodates))

##    Mode   FALSE    TRUE 
## logical   22599     108

which(as.numeric(airbnb_LGA$guests_included) > as.numeric(airbnb_LGA$accommodates))

##   [1]    80   227   323   364   592   751   831  1608  1882  1904  2271
##  [12]  2273  2725  2898  3064  3355  3396  3636  3751  3838  4083  4220
##  [23]  4294  4640  4770  4927  5702  6251  6402  6405  6424  6850  7003
##  [34]  7316  7910  8138  8441  8549  8705  9326  9357 10269 10283 10647
##  [45] 10677 10722 10831 10920 10970 11393 11460 11598 11699 11934 12033
##  [56] 12035 12079 12083 12112 12241 12443 12744 13153 13294 13349 13385
##  [67] 13527 13549 13778 14027 14484 15131 15348 15583 16030 16249 16350
##  [78] 16650 16778 17620 17839 17883 18019 18095 18156 18281 18394 18427
##  [89] 18428 19441 19478 19934 20221 20413 20423 20782 20806 20821 21006
## [100] 21094 21494 21633 21664 21680 21826 21828 21834 22542

To deal with this, an ifelse statement is used to determine the observations with this inconsistency, which are then imputed with the the value from the accommodates variable:

airbnb_LGA <- airbnb_LGA %>%
  mutate(guests_included = ifelse(as.numeric(guests_included) <= as.numeric(accommodates), guests_included, accommodates))

summary(as.numeric(airbnb_LGA$guests_included) > as.numeric(airbnb_LGA$accommodates))

##    Mode   FALSE 
## logical   22707

Scan II

Before scanning for outliers, an upper threshold will be applied to the price variable in order to remove listings with obvious pricing errors. A search on Airbnb listings in Victoria show that the most luxurious mansions list for no more than $3,500 per night, while a wedding reception accommodation is listed for around $5,000. Listings higher than $5,000 per night are usually small dwellings, which are the result of pricing errors or that they’re not legitimate listings. The following shows that the dataset contains prices that go up into the $12,000 range. Any listing with prices above $5,000 per night is removed:

max(airbnb_LGA$price)

## [1] 12624

which(airbnb_LGA$price > 5000)

## [1]  5596  7823  8070 19392

airbnb_LGA <- airbnb_LGA[-(which(airbnb_LGA$price > 5000)),]
max(airbnb_LGA$price)

## [1] 4445

In terms of scanning the numerical variables for outliers, ‘LGA_Area_km2’, ‘LGA_Density’, and ‘LGA_population’ will not be scanned since these values are tied to the factor variable ‘LGA’ which contain 30 factors, hence only giving 30 unique values for each of these variables. Any outliers resulting from these variables will not be removed or imputed as they simply demonstrate that some LGAs may have significantly lower or higher populations relative to each other. The numerical variables that will be scanned are ‘price’, ‘price_per_guest’, and ‘number_of_reviews’.

Below, it is shown that all 3 of the variables are right-skewed due to the fact that Airbnb prices and the number of reviews cannot be negative, but can be extremely large. Because of this, Tukey’s method (boxplot) will be used to detect outliers as it does not require the assumption of normality. It is shown that the number of outliers for each variable is quite large, with all of them making up more than five percent of the total observations. This means that imputation will be used to deal with these outliers as opposed to removal. It is also important to note that the following outliers probably do not represent errors. Under normal circumstances, these outliers will be left alone as they represent real and correct values that describe the distribution of Airbnb prices and number of reviews, however for the purpose of demonstration, they will be dealt with.

Price:

hist(airbnb_LGA$price, breaks = 200, main="Price Per Night", xlab="Price")

price_box <- airbnb_LGA$price %>%
  boxplot(main="Price Per Night", ylab="Price", col = "lightgrey")

summary(airbnb_LGA$price)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      12      71     109     145     165    4445

length(price_box$out)

## [1] 1397

length(price_box$out)/nrow(airbnb_LGA)

## [1] 0.06153372

Price Per Guest:

hist(airbnb_LGA$price_per_guest, breaks = 200, main="Price Per Guest", xlab="Price")

price_per_guest_box <- airbnb_LGA$price_per_guest %>%
  boxplot(main="Price Per Guest", ylab="Price", col = "lightgrey")

summary(airbnb_LGA$price_per_guest)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.86   46.50   68.00  102.86  116.00 4445.00

length(price_per_guest_box$out)

## [1] 1557

length(price_per_guest_box$out)/nrow(airbnb_LGA)

## [1] 0.06858124

Number of Reviews:

hist(airbnb_LGA$number_of_reviews, breaks = 200, main="Number of Reviews", xlab="Number")

number_of_reviews_box <- airbnb_LGA$number_of_reviews %>%
  boxplot(main="Number of Reviews", ylab="Number", col = "lightgrey")

summary(airbnb_LGA$number_of_reviews)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    1.00    5.00   21.34   24.00  538.00

length(number_of_reviews_box$out)

## [1] 2480

length(number_of_reviews_box$out)/nrow(airbnb_LGA)

## [1] 0.1092367

For this dataset, imputation is done via imputing the outliers with the median since the distributions are extremely skewed. Winsorising was considered, however it would’ve resulted in a large spike in the nearest neighbour of the distribution given the large number of outliers. The median method used also resulted in a large spike in the median of the distribution, so neither solution is ideal (later in the “Transform” section another technique will be explored: the variable ‘price_duplicate’ is created in the code below for later transformation). For each variable, the following code replaces the outliers found above (that were found using the boxplot function) with the median of its variable:

airbnb_LGA <- airbnb_LGA %>%
  mutate(price_duplicate = price)

airbnb_LGA$price[which(airbnb_LGA$price < price_box$stats[1] | airbnb_LGA$price > price_box$stats[5])] <- 
  median(airbnb_LGA$price)
hist(airbnb_LGA$price, breaks = 200, main="Price - Imputed", xlab="Price")

summary(airbnb_LGA$price)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    12.0    71.0   109.0   116.7   150.0   306.0

airbnb_LGA$price_per_guest[which(airbnb_LGA$price_per_guest < price_per_guest_box$stats[1] | airbnb_LGA$price_per_guest > price_per_guest_box$stats[5])] <- 
  median(airbnb_LGA$price_per_guest)
hist(airbnb_LGA$price_per_guest, breaks = 200, main="Price Per Guest - Imputed", xlab="Price")

summary(airbnb_LGA$price_per_guest)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.86   46.50   68.00   77.64   99.50  220.00

airbnb_LGA$number_of_reviews[which(airbnb_LGA$number_of_reviews < number_of_reviews_box$stats[1] | airbnb_LGA$number_of_reviews > number_of_reviews_box$stats[5])] <- 
  median(airbnb_LGA$number_of_reviews)
hist(airbnb_LGA$number_of_reviews, breaks = 200, main="Number of Reviews - Imputed", xlab="Price")

summary(airbnb_LGA$number_of_reviews)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   1.000   5.000   9.563  12.000  58.000

Transform

In this section, the ‘price_duplicate’ variable is transformed and normalised in order to perform a z-score outlier detection. The following uses the ‘forecast’ package in order to perform a BoxCox transformation. Comparing the tranformed distribution with original distribution, the transformed distribution is much more normal:

hist(airbnb_LGA$price_duplicate, breaks = 100, main="Price - Original", xlab="Price")

price_boxcox <- BoxCox(airbnb_LGA$price_duplicate,lambda = "auto")
hist(price_boxcox, breaks = 100, main="Price - Transformed", xlab="Price")

Below, the tranformed distribution (price_boxcox) can be standardised given its approximately normal distribution. The number of outliers (values that exceed 3 standard deviations) are 88 in this case. Much less than Tukey’s method when applied to the non-transformed price variable (which had 1,399 outliers):

z_scores <- scale(price_boxcox, center = T, scale = T)
summary(z_scores)

##        V1          
##  Min.   :-4.49789  
##  1st Qu.:-0.60448  
##  Median : 0.06852  
##  Mean   : 0.00000  
##  3rd Qu.: 0.64397  
##  Max.   : 3.40560

length(which(abs(z_scores) > 3))

## [1] 89