Airbnb Inc. is an American vacation rental online marketplace company based in San Francisco, California, United States. Airbnb keeps and provides a marketplace which is reachable to users through its online website or via its mobile app. By utilizing Airbnb, users are able to lease lodging, primarily home stay, depends on their purposes or register their properties for rental. Airbnb itself does not own any of the listed properties; instead, it profits by collecting a brokerage fee and service fee percentage from both the host and guest per booking transactions.
Our project focuses on Airbnb services in Singapore. Though physically small in size of area, Singapore has been Southeast Asia’s most modern city for over a century. For many travelers, Singapore is their first introduction to Southeast Asia as the city blends Chinese, Malay, Indian and English cultures and religions. Therefore, with the vast tourism demands from travelers, the Airbnb services here is blooming in a rapid speed.
First of all, we want to predict future price of Airbnb listings in Singapore by using regression, a supervised machine learning technique. The regression model chosen in this project is linear regression model. The prediction of future price can aid in the quality of decision-making process for Airbnb company to carry out specific promotion. Besides that, travelers can plan their vacation in advance with the accurate prediction results of Airbnb listed properties price.
Next, we want to predict which class (room type) of Airbnb listings in Singapore belongs to by using classification, a supervised machine learning technique as well. The classification model chosen in this project is random forest model.Classification of room type can help guest to ensure the information of accommodation is accurate and this is important for their service experience.
The Singapore Airbnb listings data set is available on Inside Airbnb website in csv file format. The link to the website is http://insideairbnb.com/get-the-data.html. The data was collected on 26th October 2020 according to the website. There is 7907 sample but some missing data present in the data set as well. The purpose of the data set is to have information of all Airbnb listings in Singapore distributed across 5 regions which are Central Region, North Region, North-East Region, East Region and West Region, and also their reviews rated by guests.
R package required for the project is loaded.
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.2 v purrr 0.3.4
## v tibble 3.0.4 v dplyr 1.0.2
## v tidyr 1.1.2 v stringr 1.4.0
## v readr 1.4.0 v forcats 0.5.0
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(kableExtra)
##
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
##
## group_rows
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
## The following object is masked from 'package:purrr':
##
## some
library(corrplot)
## corrplot 0.84 loaded
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
library(tm)
## Loading required package: NLP
##
## Attaching package: 'NLP'
## The following object is masked from 'package:ggplot2':
##
## annotate
library(wordcloud)
## Loading required package: RColorBrewer
library(caTools)
library(cowplot)
library(rpart)
Data set in csv format titled Singapore Airbnb Listings is read and assigned as airbnb.
airbnb <- read.csv("C:/Users/User/Desktop/Singapore Airbnb Listings.csv")
First 20 rows of airbnb data are viewed.
head(airbnb,20) %>%
kable() %>%
kable_styling()
| id | name | host_id | host_name | neighbourhood_group | neighbourhood | latitude | longitude | room_type | price | minimum_nights | number_of_reviews | last_review | reviews_per_month | calculated_host_listings_count | availability_365 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 49091 | COZICOMFORT LONG TERM STAY ROOM 2 | 266763 | Francesca | North Region | Woodlands | 1.44255 | 103.7958 | Private room | 83 | 180 | 1 | 2013-10-21 | 0.01 | 2 | 365 |
| 50646 | Pleasant Room along Bukit Timah | 227796 | Sujatha | Central Region | Bukit Timah | 1.33235 | 103.7852 | Private room | 81 | 90 | 18 | 2014-12-26 | 0.28 | 1 | 365 |
| 56334 | COZICOMFORT | 266763 | Francesca | North Region | Woodlands | 1.44246 | 103.7967 | Private room | 69 | 6 | 20 | 2015-10-01 | 0.20 | 2 | 365 |
| 71609 | Ensuite Room (Room 1 & 2) near EXPO | 367042 | Belinda | East Region | Tampines | 1.34541 | 103.9571 | Private room | 206 | 1 | 14 | 2019-08-11 | 0.15 | 9 | 353 |
| 71896 | B&B Room 1 near Airport & EXPO | 367042 | Belinda | East Region | Tampines | 1.34567 | 103.9596 | Private room | 94 | 1 | 22 | 2019-07-28 | 0.22 | 9 | 355 |
| 71903 | Room 2-near Airport & EXPO | 367042 | Belinda | East Region | Tampines | 1.34702 | 103.9610 | Private room | 104 | 1 | 39 | 2019-08-15 | 0.38 | 9 | 346 |
| 71907 | 3rd level Jumbo room 5 near EXPO | 367042 | Belinda | East Region | Tampines | 1.34348 | 103.9634 | Private room | 208 | 1 | 25 | 2019-07-25 | 0.25 | 9 | 172 |
| 241503 | Long stay at The Breezy East “Leopard” | 1017645 | Bianca | East Region | Bedok | 1.32304 | 103.9136 | Private room | 50 | 90 | 174 | 2019-05-31 | 1.88 | 4 | 59 |
| 241508 | Long stay at The Breezy East “Plumeria” | 1017645 | Bianca | East Region | Bedok | 1.32458 | 103.9116 | Private room | 54 | 90 | 198 | 2019-04-28 | 2.08 | 4 | 133 |
| 241510 | Long stay at The Breezy East “Red Palm” | 1017645 | Bianca | East Region | Bedok | 1.32461 | 103.9119 | Private room | 42 | 90 | 236 | 2019-07-31 | 2.53 | 4 | 147 |
| 275343 | Conveniently located City Room!( (Phone number hidden by Airbnb) ) | 1439258 | K2 Guesthouse | Central Region | Bukit Merah | 1.28875 | 103.8081 | Private room | 44 | 15 | 18 | 2019-04-21 | 0.23 | 32 | 331 |
| 275344 | 15 mins to Outram MRT Single Room (B) | 1439258 | K2 Guesthouse | Central Region | Bukit Merah | 1.28837 | 103.8110 | Private room | 40 | 30 | 10 | 2018-09-13 | 0.11 | 32 | 276 |
| 289234 | Booking for 3 bedrooms | 367042 | Belinda | East Region | Tampines | 1.34561 | 103.9598 | Private room | 417 | 2 | 12 | 2019-01-01 | 0.14 | 9 | 239 |
| 294281 | 5 mins walk from Newton subway | 1521514 | Elizabeth | Central Region | Newton | 1.31125 | 103.8382 | Private room | 65 | 2 | 125 | 2019-08-22 | 1.35 | 6 | 336 |
| 324945 | 20 Mins to Sentosa @ Hilltop ! (8) | 1439258 | K2 Guesthouse | Central Region | Bukit Merah | 1.28976 | 103.8090 | Private room | 44 | 30 | 13 | 2019-02-02 | 0.15 | 32 | 340 |
| 330089 | Accomo@ REDHILL-INSEAD, NTU,NUS -Mu(D) | 1439258 | K2 Guesthouse | Central Region | Bukit Merah | 1.28677 | 103.8124 | Private room | 40 | 30 | 10 | 2019-04-27 | 0.14 | 32 | 331 |
| 330095 | 10 mins to Redhill MRT @ Mini Orange Room(5) | 1439258 | K2 Guesthouse | Central Region | Bukit Merah | 1.28537 | 103.8109 | Private room | 31 | 90 | 3 | 2016-08-22 | 0.04 | 32 | 361 |
| 344803 | Budget short stay room near EXPO | 367042 | Belinda | East Region | Tampines | 1.34943 | 103.9595 | Private room | 49 | 2 | 45 | 2019-08-11 | 0.50 | 9 | 357 |
| 355955 | Double room in an Authentic Peranakan Shophouse | 1759905 | Aresha | Central Region | Geylang | 1.31420 | 103.9023 | Private room | 81 | 90 | 0 | NA | 1 | 173 | |
| 369141 | 5mins from Newton Train Station | 1521514 | Elizabeth | Central Region | Newton | 1.31150 | 103.8376 | Private room | 60 | 2 | 84 | 2019-07-10 | 1.17 | 6 | 340 |
Object-oriented classification of airbnb data is determined.
class(airbnb)
## [1] "data.frame"
typeof(airbnb)
## [1] "list"
Dimension of airbnb data is determined.
dim(airbnb)
## [1] 7907 16
It is a data frame with 7907 rows and 16 columns.
Content of airbnb data is determined.
glimpse(airbnb)
## Rows: 7,907
## Columns: 16
## $ id <int> 49091, 50646, 56334, 71609, 71896, 7...
## $ name <chr> "COZICOMFORT LONG TERM STAY ROOM 2",...
## $ host_id <int> 266763, 227796, 266763, 367042, 3670...
## $ host_name <chr> "Francesca", "Sujatha", "Francesca",...
## $ neighbourhood_group <chr> "North Region", "Central Region", "N...
## $ neighbourhood <chr> "Woodlands", "Bukit Timah", "Woodlan...
## $ latitude <dbl> 1.44255, 1.33235, 1.44246, 1.34541, ...
## $ longitude <dbl> 103.7958, 103.7852, 103.7967, 103.95...
## $ room_type <chr> "Private room", "Private room", "Pri...
## $ price <int> 83, 81, 69, 206, 94, 104, 208, 50, 5...
## $ minimum_nights <int> 180, 90, 6, 1, 1, 1, 1, 90, 90, 90, ...
## $ number_of_reviews <int> 1, 18, 20, 14, 22, 39, 25, 174, 198,...
## $ last_review <chr> "2013-10-21", "2014-12-26", "2015-10...
## $ reviews_per_month <dbl> 0.01, 0.28, 0.20, 0.15, 0.22, 0.38, ...
## $ calculated_host_listings_count <int> 2, 1, 2, 9, 9, 9, 9, 4, 4, 4, 32, 32...
## $ availability_365 <int> 365, 365, 365, 353, 355, 346, 172, 5...
Structure of airbnb data is determined.
str(airbnb)
## 'data.frame': 7907 obs. of 16 variables:
## $ id : int 49091 50646 56334 71609 71896 71903 71907 241503 241508 241510 ...
## $ name : chr "COZICOMFORT LONG TERM STAY ROOM 2" "Pleasant Room along Bukit Timah" "COZICOMFORT" "Ensuite Room (Room 1 & 2) near EXPO" ...
## $ host_id : int 266763 227796 266763 367042 367042 367042 367042 1017645 1017645 1017645 ...
## $ host_name : chr "Francesca" "Sujatha" "Francesca" "Belinda" ...
## $ neighbourhood_group : chr "North Region" "Central Region" "North Region" "East Region" ...
## $ neighbourhood : chr "Woodlands" "Bukit Timah" "Woodlands" "Tampines" ...
## $ latitude : num 1.44 1.33 1.44 1.35 1.35 ...
## $ longitude : num 104 104 104 104 104 ...
## $ room_type : chr "Private room" "Private room" "Private room" "Private room" ...
## $ price : int 83 81 69 206 94 104 208 50 54 42 ...
## $ minimum_nights : int 180 90 6 1 1 1 1 90 90 90 ...
## $ number_of_reviews : int 1 18 20 14 22 39 25 174 198 236 ...
## $ last_review : chr "2013-10-21" "2014-12-26" "2015-10-01" "2019-08-11" ...
## $ reviews_per_month : num 0.01 0.28 0.2 0.15 0.22 0.38 0.25 1.88 2.08 2.53 ...
## $ calculated_host_listings_count: int 2 1 2 9 9 9 9 4 4 4 ...
## $ availability_365 : int 365 365 365 353 355 346 172 59 133 147 ...
Summary of airbnb data is determined.
summary(airbnb)
## id name host_id host_name
## Min. : 49091 Length:7907 Min. : 23666 Length:7907
## 1st Qu.:15821800 Class :character 1st Qu.: 23058075 Class :character
## Median :24706270 Mode :character Median : 63448912 Mode :character
## Mean :23388625 Mean : 91144807
## 3rd Qu.:32348500 3rd Qu.:155381142
## Max. :38112762 Max. :288567551
##
## neighbourhood_group neighbourhood latitude longitude
## Length:7907 Length:7907 Min. :1.244 Min. :103.6
## Class :character Class :character 1st Qu.:1.296 1st Qu.:103.8
## Mode :character Mode :character Median :1.311 Median :103.8
## Mean :1.314 Mean :103.8
## 3rd Qu.:1.322 3rd Qu.:103.9
## Max. :1.455 Max. :104.0
##
## room_type price minimum_nights number_of_reviews
## Length:7907 Min. : 0.0 Min. : 1.00 Min. : 0.00
## Class :character 1st Qu.: 65.0 1st Qu.: 1.00 1st Qu.: 0.00
## Mode :character Median : 124.0 Median : 3.00 Median : 2.00
## Mean : 169.3 Mean : 17.51 Mean : 12.81
## 3rd Qu.: 199.0 3rd Qu.: 10.00 3rd Qu.: 10.00
## Max. :10000.0 Max. :1000.00 Max. :323.00
##
## last_review reviews_per_month calculated_host_listings_count
## Length:7907 Min. : 0.010 Min. : 1.00
## Class :character 1st Qu.: 0.180 1st Qu.: 2.00
## Mode :character Median : 0.550 Median : 9.00
## Mean : 1.044 Mean : 40.61
## 3rd Qu.: 1.370 3rd Qu.: 48.00
## Max. :13.000 Max. :274.00
## NA's :2758
## availability_365
## Min. : 0.0
## 1st Qu.: 54.0
## Median :260.0
## Mean :208.7
## 3rd Qu.:355.0
## Max. :365.0
##
Number of attributes in airbnb data is determined.
length(airbnb)
## [1] 16
A total of 16 attributes are presented in airbnb data.
Every attributes of airbnb data are determined.
names(airbnb)
## [1] "id" "name"
## [3] "host_id" "host_name"
## [5] "neighbourhood_group" "neighbourhood"
## [7] "latitude" "longitude"
## [9] "room_type" "price"
## [11] "minimum_nights" "number_of_reviews"
## [13] "last_review" "reviews_per_month"
## [15] "calculated_host_listings_count" "availability_365"
The attributes of airbnb data contained id, name, host_id, host_name, neighbourhood_group, neighbourhood, latitude, longitude, room_type, price, minimum_nights, number_of_reviews, last_review, reviews_per_month, calculated_host_listings_count and availability_365.
Amount of missing value in airbnb data is calculated.
sum(is.na(airbnb))
## [1] 2758
A total of 2758 missing values are presented in airbnb data.
Attributes of airbnb data that contain missing values are located.
colSums(is.na(airbnb))
## id name
## 0 0
## host_id host_name
## 0 0
## neighbourhood_group neighbourhood
## 0 0
## latitude longitude
## 0 0
## room_type price
## 0 0
## minimum_nights number_of_reviews
## 0 0
## last_review reviews_per_month
## 0 2758
## calculated_host_listings_count availability_365
## 0 0
All 2758 missing values are located under reviews_per_month column.
Attributes of airbnb data are renamed to have a clearer view of the data.
Unwanted attributes of airbnb data such as ID, Host_name and Last_review are removed.
Remaining attributes of airbnb data are arranged according to its significance level.
Airbnb price which is equal to 0 is filtered since price can’t be 0 (faulty record). They would make predictive models significantly weaker.
All missing values in Review_per_month column are replaced with value 0.
Airbnb data is sorted ascendingly by Room_type, Price, Region and Neighborhood.
airbnb <- airbnb %>%
rename(ID=id,
Name = name,
Host_ID = host_id,
Host_name = host_name,
Region = neighbourhood_group,
Neighbourhood = neighbourhood,
Latitude = latitude,
Longitude = longitude,
Room_type = room_type,
Price = price,
Minimum_night = minimum_nights,
Review_count = number_of_reviews,
Last_review = last_review,
Review_per_month = reviews_per_month,
Host_listing_count = calculated_host_listings_count,
Day_available_per_year = availability_365) %>%
select(Name,
Room_type,
Price,
Region,
Neighbourhood,
Latitude,
Longitude,
Host_ID,
Minimum_night,
Review_count,
Review_per_month,
Host_listing_count,
Day_available_per_year) %>%
filter (Price > 0) %>%
mutate(Review_per_month = replace_na(Review_per_month,0)) %>%
arrange(Room_type,Price,Region,Neighbourhood)
Room_type, Region and Neighbourhood attributes are set as factor instead of character as in default string value of data frame in R is set as character.
airbnb[c("Room_type","Region","Neighbourhood")] <- map(airbnb[c("Room_type","Region","Neighbourhood")], as.factor)
Sanity check of missing values remained in airbnb data is performed.
sum(is.na(airbnb))
## [1] 0
No missing values are remaining in airbnb data.
First 20 rows of cleaned airbnb data are viewed.
head(airbnb,20) %>%
kable() %>%
kable_styling()
| Name | Room_type | Price | Region | Neighbourhood | Latitude | Longitude | Host_ID | Minimum_night | Review_count | Review_per_month | Host_listing_count | Day_available_per_year |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Central 1BR Apt in Foodie Haven Hipster Paradise | Entire home/apt | 14 | Central Region | Geylang | 1.31455 | 103.8832 | 29799617 | 3 | 4 | 4.00 | 1 | 34 |
| Senja cozy | Entire home/apt | 14 | West Region | Bukit Panjang | 1.38400 | 103.7631 | 75175440 | 1 | 1 | 0.45 | 2 | 0 |
| Cozy Private Room in Nice Apartment with Pool | Entire home/apt | 31 | Central Region | Bukit Timah | 1.33861 | 103.7808 | 26246420 | 90 | 14 | 0.43 | 1 | 0 |
| Spacious Basic Studio in a Local Hood! | Entire home/apt | 39 | North Region | Yishun | 1.42968 | 103.8366 | 73254645 | 1 | 3 | 2.31 | 1 | 0 |
| Capsule Pod Single Bed in 4 Female Share Room | Entire home/apt | 42 | Central Region | Outram | 1.28374 | 103.8441 | 87411537 | 1 | 6 | 0.16 | 7 | 346 |
| Studioroom @ Farrer Park MRT, Central. Privacy | Entire home/apt | 43 | Central Region | Kallang | 1.31413 | 103.8574 | 24682062 | 25 | 0 | 0.00 | 1 | 155 |
| Short Stay Apartment (Charisma View) S(598671) | Entire home/apt | 46 | Central Region | Bukit Timah | 1.34477 | 103.7686 | 141452197 | 90 | 1 | 0.04 | 1 | 180 |
| Stylish & Modern Deluxe Condominium in Katong | Entire home/apt | 50 | Central Region | Geylang | 1.31384 | 103.8941 | 198046784 | 3 | 22 | 2.04 | 1 | 3 |
| Unique space pod for 1 pax, near eateries, MRT! | Entire home/apt | 50 | Central Region | Kallang | 1.31250 | 103.8613 | 211434562 | 1 | 8 | 0.75 | 64 | 365 |
| One Bedroom Whole House | Entire home/apt | 50 | West Region | Choa Chu Kang | 1.37869 | 103.7518 | 20682139 | 5 | 14 | 1.07 | 1 | 310 |
| Condo with sea view | Entire home/apt | 50 | West Region | Clementi | 1.29767 | 103.7649 | 547772 | 3 | 2 | 0.10 | 1 | 0 |
| Cozy Lakeside | Entire home/apt | 50 | West Region | Jurong West | 1.33492 | 103.7235 | 75175440 | 1 | 2 | 0.91 | 2 | 0 |
| [Green] Quiet Studio Unit with Reservoir View | Entire home/apt | 54 | North-East Region | Sengkang | 1.39546 | 103.8804 | 52404087 | 90 | 2 | 0.29 | 1 | 164 |
| Unique single capsule bed, Shops, eateries, MRT! | Entire home/apt | 56 | Central Region | Kallang | 1.31218 | 103.8607 | 211434562 | 1 | 5 | 0.47 | 64 | 365 |
| Unique space pod for 1 pax, can cook, shops, MRT! | Entire home/apt | 56 | Central Region | Kallang | 1.31079 | 103.8621 | 211434562 | 1 | 0 | 0.00 | 64 | 365 |
| 2 bed room near town | Entire home/apt | 56 | Central Region | Toa Payoh | 1.33957 | 103.8457 | 24054848 | 1 | 0 | 0.00 | 1 | 0 |
| 2 Bedroom furnished flat 3 min from MRT stop | Entire home/apt | 56 | Central Region | Toa Payoh | 1.33996 | 103.8444 | 11493720 | 1 | 3 | 0.08 | 1 | 0 |
| (WHOLE HOUSE) Breezy Comfy Accomodation. | Entire home/apt | 56 | North Region | Woodlands | 1.44787 | 103.7985 | 128465640 | 62 | 0 | 0.00 | 2 | 365 |
| Beachfront Hammocking at East Coast Park Site D | Entire home/apt | 58 | East Region | Bedok | 1.30437 | 103.9221 | 252092906 | 1 | 0 | 0.00 | 2 | 179 |
| 1BR in new apartment with own bath | Entire home/apt | 60 | Central Region | Kallang | 1.32746 | 103.8644 | 39246294 | 1 | 0 | 0.00 | 2 | 0 |
Dimension of cleaned airbnb data is determined.
dim(airbnb)
## [1] 7906 13
It is now a data frame with 7906 rows and 13 columns.
Content of cleaned airbnb data is determined.
glimpse(airbnb)
## Rows: 7,906
## Columns: 13
## $ Name <chr> "Central 1BR Apt in Foodie Haven Hipster Par...
## $ Room_type <fct> Entire home/apt, Entire home/apt, Entire hom...
## $ Price <int> 14, 14, 31, 39, 42, 43, 46, 50, 50, 50, 50, ...
## $ Region <fct> Central Region, West Region, Central Region,...
## $ Neighbourhood <fct> Geylang, Bukit Panjang, Bukit Timah, Yishun,...
## $ Latitude <dbl> 1.31455, 1.38400, 1.33861, 1.42968, 1.28374,...
## $ Longitude <dbl> 103.8832, 103.7631, 103.7808, 103.8366, 103....
## $ Host_ID <int> 29799617, 75175440, 26246420, 73254645, 8741...
## $ Minimum_night <int> 3, 1, 90, 1, 1, 25, 90, 3, 1, 5, 3, 1, 90, 1...
## $ Review_count <int> 4, 1, 14, 3, 6, 0, 1, 22, 8, 14, 2, 2, 2, 5,...
## $ Review_per_month <dbl> 4.00, 0.45, 0.43, 2.31, 0.16, 0.00, 0.04, 2....
## $ Host_listing_count <int> 1, 2, 1, 1, 7, 1, 1, 1, 64, 1, 1, 2, 1, 64, ...
## $ Day_available_per_year <int> 34, 0, 0, 0, 346, 155, 180, 3, 365, 310, 0, ...
Structure of cleaned airbnb data is determined.
str(airbnb)
## 'data.frame': 7906 obs. of 13 variables:
## $ Name : chr "Central 1BR Apt in Foodie Haven Hipster Paradise" "Senja cozy" "Cozy Private Room in Nice Apartment with Pool" "Spacious Basic Studio in a Local Hood!" ...
## $ Room_type : Factor w/ 3 levels "Entire home/apt",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Price : int 14 14 31 39 42 43 46 50 50 50 ...
## $ Region : Factor w/ 5 levels "Central Region",..: 1 5 1 4 1 1 1 1 1 5 ...
## $ Neighbourhood : Factor w/ 43 levels "Ang Mo Kio","Bedok",..: 12 6 7 43 25 16 7 12 16 9 ...
## $ Latitude : num 1.31 1.38 1.34 1.43 1.28 ...
## $ Longitude : num 104 104 104 104 104 ...
## $ Host_ID : int 29799617 75175440 26246420 73254645 87411537 24682062 141452197 198046784 211434562 20682139 ...
## $ Minimum_night : int 3 1 90 1 1 25 90 3 1 5 ...
## $ Review_count : int 4 1 14 3 6 0 1 22 8 14 ...
## $ Review_per_month : num 4 0.45 0.43 2.31 0.16 0 0.04 2.04 0.75 1.07 ...
## $ Host_listing_count : int 1 2 1 1 7 1 1 1 64 1 ...
## $ Day_available_per_year: int 34 0 0 0 346 155 180 3 365 310 ...
Summary of cleaned airbnb data is determined.
summary(airbnb)
## Name Room_type Price
## Length:7906 Entire home/apt:4131 Min. : 14.0
## Class :character Private room :3381 1st Qu.: 65.0
## Mode :character Shared room : 394 Median : 124.0
## Mean : 169.4
## 3rd Qu.: 199.0
## Max. :10000.0
##
## Region Neighbourhood Latitude Longitude
## Central Region :6308 Kallang :1043 Min. :1.244 Min. :103.6
## East Region : 508 Geylang : 994 1st Qu.:1.296 1st Qu.:103.8
## North-East Region: 346 Novena : 537 Median :1.311 Median :103.8
## North Region : 204 Rochor : 535 Mean :1.314 Mean :103.8
## West Region : 540 Outram : 477 3rd Qu.:1.322 3rd Qu.:103.9
## Bukit Merah: 470 Max. :1.455 Max. :104.0
## (Other) :3850
## Host_ID Minimum_night Review_count Review_per_month
## Min. : 23666 Min. : 1.00 Min. : 0.00 Min. : 0.0000
## 1st Qu.: 23055695 1st Qu.: 1.00 1st Qu.: 0.00 1st Qu.: 0.0000
## Median : 63448912 Median : 3.00 Median : 2.00 Median : 0.1600
## Mean : 91141831 Mean : 17.51 Mean : 12.81 Mean : 0.6797
## 3rd Qu.:155386432 3rd Qu.: 10.00 3rd Qu.: 10.00 3rd Qu.: 0.8500
## Max. :288567551 Max. :1000.00 Max. :323.00 Max. :13.0000
##
## Host_listing_count Day_available_per_year
## Min. : 1.00 Min. : 0.0
## 1st Qu.: 2.00 1st Qu.: 54.0
## Median : 9.00 Median :260.0
## Mean : 40.61 Mean :208.7
## 3rd Qu.: 48.00 3rd Qu.:355.0
## Max. :274.00 Max. :365.0
##
Distribution of Singapore Airbnb price is presented in boxplot graph.
# Store the graph
box_plot <- ggplot(airbnb, aes(y = Price))
# Add the geometric object box plot
box_plot +
geom_boxplot() +coord_flip()+ggtitle("Overall Price Boxplot")
75% of Singapore Airbnb set their rental price below SGD199.
# Store the graph
box_plot <- ggplot(airbnb, aes(x = Region,y = Price))
# Add the geometric object box plot
box_plot +
geom_boxplot() +coord_flip()+ggtitle("Boxplot of Price by Region")
The highest Singapore Airbnb price is located at West Region and Central Region. All the price of Singapore Airbnb at North Region are less than SGD 1250.
Relationship of Room Type and Price is presented.
freq_room <- airbnb %>%
count(Room_type)
freq_room <- freq_room %>%
arrange(desc(Room_type))
options(repr.plot.width=14, repr.plot.height=6)
plot1 <- ggplot(freq_room, aes(x="", y=n, fill=Room_type)) +
geom_bar(stat="identity", width=1) +
coord_polar("y", start=0)+theme_void() +
geom_text(aes(label = paste(round(n / sum(n) * 100,
1), "%")),colour = 'white',position =
position_stack(vjust = 0.5))+
ggtitle("Pie chart of Room Types")
avg_price_host <- airbnb %>%
group_by(Room_type) %>%
summarise(avg_price= mean(Price),.groups ='drop')
plot2 <-ggplot(avg_price_host, aes(x=reorder(Room_type, -avg_price), y=avg_price, fill="violet"))+
geom_col(aes(fill=avg_price),width = 1)+
ggtitle("Airbnb in each region")+coord_flip()+
scale_y_continuous(limits=c(0, 300))+
geom_label(mapping = aes(label = round(avg_price,
1)), size = 4, fill = "#F5FFFA", fontface = "bold")
plot_grid(plot1, plot2, ncol=2, nrow=1,rel_widths = c(1, 1))
More than half of the room type in Singapore are entire home/apt and only 5% of room type is shared room.
The average price for shared room is SGD 65.7, private room is SGD 110.9 and entire home is SGD 227.1.
Relationship of Region and Price is presented.
freq_region <- airbnb %>%
count(Region)
freq_region <- freq_region %>%
arrange(desc(Region))
options(repr.plot.width=14, repr.plot.height=6)
plot2_1 <- ggplot(freq_region, aes(x="", y=n, fill=Region)) +
geom_bar(stat="identity", width=1) +
coord_polar("y", start=0)+theme_void() + geom_text(aes(label = paste(round(n / sum(n) * 100, 1), "%")),colour = 'white',position = position_stack(vjust = 0.5))+ggtitle("Pie chart of Region Types")
avg_price_region <- airbnb %>%
group_by(Region) %>%
summarise(avg_price= mean(Price),.groups ='drop')
plot2_2 <-ggplot(avg_price_region, aes(x=reorder(Region, -avg_price), y=avg_price, fill="violet"))+
geom_col(aes(fill=avg_price),width = 1) +coord_flip()+
geom_label(mapping = aes(label = round(avg_price, 1)), size = 4, fill = "#F5FFFA", fontface = "bold")+ scale_y_continuous(limits=c(0, 300))
plot_grid(plot2_1, plot2_2, ncol=2, nrow=1,rel_widths = c(1, 1))
79.8% (6309 units) Singapore Airbnb is located at Central Region while Airbnb at North Region is the least with only 204 units.
The average price for North-East region is the cheapest which is SGD 99.8 while the average price for West Region and Central Region are the most expensive which are SGD 176 and SGD 176.7.
Top 10 most expensive and cheapest Singapore Airbnb neighbourhood location are identified.
top_10_neighbourhood <- aggregate(list(airbnb$Price), list(airbnb$Neighbourhood, airbnb$Region), mean)
colnames(top_10_neighbourhood) <- c("Neighbourhood", "Region","Average_price_per_neighborhood")
top_10_neighbourhood <- top_10_neighbourhood[order(top_10_neighbourhood$Average_price_per_neighborhood),]
top_10_neighbourhood <- tail(top_10_neighbourhood, 12)
top_10_neighbourhood <- head(top_10_neighbourhood, 10)
r <- c()
for(i in 10:1){r <- c(r, i)}
row.names(top_10_neighbourhood) <- r
top_10_neighbourhood
## Neighbourhood Region Average_price_per_neighborhood
## 10 Newton Central Region 188.7463
## 9 Rochor Central Region 189.1458
## 8 Singapore River Central Region 189.9371
## 7 Tanglin Central Region 201.2762
## 6 Downtown Core Central Region 205.3949
## 5 Bukit Batok West Region 206.1692
## 4 Museum Central Region 236.3175
## 3 Orchard Central Region 291.0294
## 2 Bukit Panjang West Region 365.3529
## 1 Marina South Central Region 419.0000
options(repr.plot.width=15, repr.plot.height=11)
plot3 <- ggplot(data = top_10_neighbourhood, mapping = aes(x = reorder(Neighbourhood, -Average_price_per_neighborhood), y = Average_price_per_neighborhood)) +
geom_bar(stat = "identity", mapping = aes(fill = Region, color = Region), alpha = .8, size = 1.5) +
coord_flip() +
geom_label(mapping = aes(label = round(Average_price_per_neighborhood, 1)), size = 4, fill = "#F5FFFA", fontface = "bold") + ggtitle("Top 10 most expensive Airbnb neighbourhood in Singapore")
plot3
Most of the expensive Singapore Airbnb are located at Central Region and the only neighbourhood which has average price > SGD 400 is at Marina South.
top_10_neighbourhood_2 <- aggregate(list(airbnb$Price), list(airbnb$Neighbourhood, airbnb$Region), mean)
colnames(top_10_neighbourhood_2) <- c("Neighbourhood","Region", "Average_price_per_neighborhood")
top_10_neighbourhood_2 <- top_10_neighbourhood_2[order(top_10_neighbourhood_2$Average_price_per_neighborhood),]
top_10_neighbourhood_2 <- head(top_10_neighbourhood_2, 10)
r <- c()
for(i in 10:1){r <- c(r, i)}
row.names(top_10_neighbourhood_2) <- r
top_10_neighbourhood_2
## Neighbourhood Region Average_price_per_neighborhood
## 10 Western Water Catchment West Region 46.25000
## 9 Mandai North Region 56.66667
## 8 Lim Chu Kang North Region 65.00000
## 7 Sengkang North-East Region 74.85075
## 6 Woodlands North Region 81.49254
## 5 Punggol North-East Region 85.74419
## 4 Sembawang North Region 88.26829
## 3 Jurong West West Region 91.04575
## 2 Serangoon North-East Region 91.17391
## 1 Choa Chu Kang West Region 93.31746
options(repr.plot.width=15, repr.plot.height=11)
plot4 <- ggplot(data = top_10_neighbourhood_2, mapping = aes(x = reorder(Neighbourhood, -Average_price_per_neighborhood), y = Average_price_per_neighborhood)) +
geom_bar(stat = "identity", mapping = aes(fill = Region, color = Region), alpha = .8, size = 1.5) +
geom_label(mapping = aes(label = round(Average_price_per_neighborhood, 1)), size = 4, fill = "#F5FFFA", fontface = "bold") +
coord_flip() +
ggtitle("Top 10 cheapest Airbnb neighbourhood in Singapore")
plot4
Travelers can find cheap Airbnb at the neighbourhood across North-East Region, North Region and West Region, especially in Western Water Catchment neighbourhood which has average price of SGD 46.2 only.
df_map <- aggregate(list(airbnb$Price), list(airbnb$Day_available_per_year), mean)
colnames(df_map) <- c("Availability", "Average_price_per_availability")
ggplot(data = df_map, mapping = aes(y = Average_price_per_availability, x = Availability, color = Average_price_per_availability)) +
theme_minimal() +
scale_fill_identity() +
geom_line(mapping = aes(color = Average_price_per_availability)) +
ggtitle("Relationship between availability and price of Airbnb")
The average price for 82 days availability is the highest which is SGD 830 while the average price for 185 days availability is SGD 21.
How Airbnb is distributed on Singapore Map is presented.
ggplot(data = airbnb, mapping = aes(x = Latitude, y = Longitude, color = Region)) +
theme_minimal() +
scale_fill_identity() +
geom_point(mapping = aes(color = Region), size = 3) +
ggtitle("Airbnb in Singapore")
Longitude and latitude aid in plotting the map and the distribution of Airbnb across the region is shown.
Location of Airbnb with price < SGD 200 is presented.
df_map_1 <- airbnb %>%
filter(Price <200)
ggplot(data = df_map_1, mapping = aes(x = Latitude, y = Longitude, color = Price)) +
theme_minimal() +
scale_fill_identity() +
geom_point(mapping = aes(color = Price), size = 3) +
ggtitle("Location of Airbnb with price < SGD 200")+scale_color_gradient(low="blue", high="red")
Location of Airbnb with price > SGD 200 is presented.
df_map2 <- airbnb %>%
filter(Price >200)
ggplot(data = df_map2, mapping = aes(x = Latitude, y = Longitude, color = Price)) +
theme_minimal() +
scale_fill_identity() +
geom_point(mapping = aes(color = Price), size = 3) +
ggtitle("Location of Airbnb with price > SGD 200")+scale_color_gradient(low="blue", high="red")
Word cloud of Airbnb Name is presented
text <- airbnb$Name
docs <- Corpus(VectorSource(text))
docs <- docs %>%
tm_map(removeNumbers) %>%
tm_map(removePunctuation) %>%
tm_map(stripWhitespace)
## Warning in tm_map.SimpleCorpus(., removeNumbers): transformation drops documents
## Warning in tm_map.SimpleCorpus(., removePunctuation): transformation drops
## documents
## Warning in tm_map.SimpleCorpus(., stripWhitespace): transformation drops
## documents
docs <- tm_map(docs, content_transformer(tolower))
## Warning in tm_map.SimpleCorpus(docs, content_transformer(tolower)):
## transformation drops documents
docs <- tm_map(docs, removeWords, stopwords("english"))
## Warning in tm_map.SimpleCorpus(docs, removeWords, stopwords("english")):
## transformation drops documents
dtm <- TermDocumentMatrix(docs)
matrix <- as.matrix(dtm)
words <- sort(rowSums(matrix),decreasing=TRUE)
df <- data.frame(word = names(words),freq=words)
set.seed(1234) # for reproducibility
wordcloud(words = df$word, freq = df$freq, min.freq = 1,max.words=100, random.order=FALSE, rot.per=0.35,colors=brewer.pal(8, "Dark2"))
Airbnb name often use ‘near’, ‘mrt’, ‘city’ or ‘min’ as the words help in indicating the convenience for the transportation. ‘room’, ‘bedroom’, ‘spacious’, ‘cosy’,‘cozy’ are also used to describe the accommodation that can catch attention.
Distribution of host listing count is presented.
paste('There are' ,length(unique(airbnb$Host_ID)), 'Hosts for Singapore Airbnb.')
## [1] "There are 2705 Hosts for Singapore Airbnb."
airbnb_host <-distinct(airbnb, Host_ID, .keep_all = TRUE)
breaks <- c(0,25,50,75,100,125,150,175,200,225,250,275,300)
# specify interval/bin labels
tags <- c("[0-25)","[25,50)","[50-75)","[75,100)", "[100-125)", "[125,150)","[150-175)", "[175-200)","[200-225)", "[225,250)","[250-275)", "[275-300)")
# bucketing values into bins
group_tags <- cut(airbnb_host$Host_listing_count,
breaks=breaks,
include.lowest=TRUE,
right=FALSE,
labels=tags)
ggplot(data = as_tibble(group_tags), mapping = aes(x=value)) +
geom_bar(fill="bisque",color="white") +
stat_count(geom="text", aes(label=..count..), vjust=-0.5) +
labs(x='Host Listing Count') +
ggtitle('Host Listing Count')+
theme_minimal()
getmode <- function(v) {
uniqv <- unique(v)
uniqv[which.max(tabulate(match(v, uniqv)))]
}
paste('Most number of host listing in Singapore is ' ,getmode(airbnb_host$Host_listing_count),'while Maximum host listing in Singapore is' ,max(airbnb_host$Host_listing_count),'.')
## [1] "Most number of host listing in Singapore is 1 while Maximum host listing in Singapore is 274 ."
98% hosts are hosting less than 25 Airbnb.
Correlation plot is presented.
#One Hot encoding for room type, region and neighbourhood
df_ohe <- airbnb %>%
select(Room_type, Region,Neighbourhood)
dummy <- dummyVars(" ~ .", data=df_ohe)
newdata <- data.frame(predict(dummy, newdata = df_ohe))
df2 <- merge(airbnb, newdata, by=0)
names(newdata)
## [1] "Room_type.Entire.home.apt"
## [2] "Room_type.Private.room"
## [3] "Room_type.Shared.room"
## [4] "Region.Central.Region"
## [5] "Region.East.Region"
## [6] "Region.North.East.Region"
## [7] "Region.North.Region"
## [8] "Region.West.Region"
## [9] "Neighbourhood.Ang.Mo.Kio"
## [10] "Neighbourhood.Bedok"
## [11] "Neighbourhood.Bishan"
## [12] "Neighbourhood.Bukit.Batok"
## [13] "Neighbourhood.Bukit.Merah"
## [14] "Neighbourhood.Bukit.Panjang"
## [15] "Neighbourhood.Bukit.Timah"
## [16] "Neighbourhood.Central.Water.Catchment"
## [17] "Neighbourhood.Choa.Chu.Kang"
## [18] "Neighbourhood.Clementi"
## [19] "Neighbourhood.Downtown.Core"
## [20] "Neighbourhood.Geylang"
## [21] "Neighbourhood.Hougang"
## [22] "Neighbourhood.Jurong.East"
## [23] "Neighbourhood.Jurong.West"
## [24] "Neighbourhood.Kallang"
## [25] "Neighbourhood.Lim.Chu.Kang"
## [26] "Neighbourhood.Mandai"
## [27] "Neighbourhood.Marina.South"
## [28] "Neighbourhood.Marine.Parade"
## [29] "Neighbourhood.Museum"
## [30] "Neighbourhood.Newton"
## [31] "Neighbourhood.Novena"
## [32] "Neighbourhood.Orchard"
## [33] "Neighbourhood.Outram"
## [34] "Neighbourhood.Pasir.Ris"
## [35] "Neighbourhood.Punggol"
## [36] "Neighbourhood.Queenstown"
## [37] "Neighbourhood.River.Valley"
## [38] "Neighbourhood.Rochor"
## [39] "Neighbourhood.Sembawang"
## [40] "Neighbourhood.Sengkang"
## [41] "Neighbourhood.Serangoon"
## [42] "Neighbourhood.Singapore.River"
## [43] "Neighbourhood.Southern.Islands"
## [44] "Neighbourhood.Sungei.Kadut"
## [45] "Neighbourhood.Tampines"
## [46] "Neighbourhood.Tanglin"
## [47] "Neighbourhood.Toa.Payoh"
## [48] "Neighbourhood.Tuas"
## [49] "Neighbourhood.Western.Water.Catchment"
## [50] "Neighbourhood.Woodlands"
## [51] "Neighbourhood.Yishun"
df_cor <- df2 %>%
select(Price, Latitude, Longitude, Minimum_night, Review_count, Review_per_month, Host_listing_count, Day_available_per_year, Room_type.Entire.home.apt, Room_type.Private.room, Room_type.Shared.room, Region.Central.Region, Region.East.Region, Region.North.East.Region, Region.North.Region, Region.West.Region)
M <-cor(df_cor)
corrplot.mixed(M)
Based on the correlation plot, there is a moderate positive linear relationship (0.36) between host listing count and room type (entire home).Host listing count is having moderate negative linear relationship (-0.33) with room type (private room).
Name, Latitude, Longitude and Host ID attributes are removed for regression and new airbnb data without Name column is assigned as airbnb_lm.
airbnb_lm <- airbnb %>%
select(-c(Name,Latitude,Longitude,Host_ID))
Train-test split is carried out with 70-30 split ratio.
set.seed(1000)
for_splitting <- sample.split(Y = airbnb_lm$Price, SplitRatio = 0.7)
airbnb_train <- subset(airbnb_lm, for_splitting == TRUE)
airbnb_test <- subset(airbnb_lm, for_splitting == FALSE)
Sanity check of correct train-test split is performed.
nrow(airbnb_train) + nrow(airbnb_test) == nrow(airbnb_lm)
## [1] TRUE
Due to the presence of extreme price outliers, 2 train sets are created.
All price points include outliers are assigned as airbnb_train.
Price points exclude outliers are assigned as airbnb_train_without_outlier.
airbnb_train_without_outlier <- airbnb_train %>%
filter(Price <= quantile(airbnb_train$Price, 0.9) &
Price >= quantile(airbnb_train$Price, 0.1))
Price variance of train_set and train_set_without_outlier are calculated.
var(airbnb_train$Price)
## [1] 114252
var(airbnb_train_without_outlier$Price)
## [1] 4834.207
Train set without outlier has significantly lower variance as compared to train set with extreme outliers.
Due to the presence of extreme price outliers, 2 test sets are created.
All price points include outliers are assigned as airbnb_test.
Price points exclude outliers are assigned as airbnb_test_without_outlier.
airbnb_test_without_outlier <- airbnb_test %>%
filter(Price <= quantile(airbnb_test$Price, 0.9) & Price >= quantile(airbnb_test$Price, 0.1))
Price variance of test set and test set without outlier are calculated.
var(airbnb_test$Price)
## [1] 119238.2
var(airbnb_test_without_outlier$Price)
## [1] 4321.204
Test set with outlier has significantly lower variance as compared to test set with extreme outliers.
First linear regression model is modeled.
first_model <- lm(Price ~ .,data = airbnb_train)
#The results are summarized.
summary(first_model)
##
## Call:
## lm(formula = Price ~ ., data = airbnb_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1510.9 -74.8 -32.1 16.9 9864.7
##
## Coefficients: (4 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 259.60313 40.39281 6.427 1.41e-10 ***
## Room_typePrivate room -126.97297 10.10227 -12.569 < 2e-16 ***
## Room_typeShared room -182.69957 21.92893 -8.331 < 2e-16 ***
## RegionEast Region -25.68336 65.79844 -0.390 0.69630
## RegionNorth-East Region -40.05534 67.39826 -0.594 0.55233
## RegionNorth Region -46.01364 65.21698 -0.706 0.48050
## RegionWest Region -123.92895 166.26607 -0.745 0.45608
## NeighbourhoodBedok 23.97847 56.33675 0.426 0.67040
## NeighbourhoodBishan 9.64872 64.50782 0.150 0.88111
## NeighbourhoodBukit Batok 231.67091 168.48860 1.375 0.16919
## NeighbourhoodBukit Merah -46.00465 43.28594 -1.063 0.28792
## NeighbourhoodBukit Panjang 530.53617 176.17879 3.011 0.00261 **
## NeighbourhoodBukit Timah -5.75903 51.55522 -0.112 0.91106
## NeighbourhoodCentral Water Catchment 52.84269 82.97585 0.637 0.52425
## NeighbourhoodChoa Chu Kang 43.22766 169.12628 0.256 0.79827
## NeighbourhoodClementi 117.29512 165.85913 0.707 0.47947
## NeighbourhoodDowntown Core 18.05198 44.03093 0.410 0.68183
## NeighbourhoodGeylang -33.99759 41.43057 -0.821 0.41191
## NeighbourhoodHougang 10.62450 65.06035 0.163 0.87029
## NeighbourhoodJurong East 166.58908 165.42278 1.007 0.31395
## NeighbourhoodJurong West 78.13475 164.62219 0.475 0.63507
## NeighbourhoodKallang 2.96763 41.32580 0.072 0.94276
## NeighbourhoodLim Chu Kang 22.55857 327.42830 0.069 0.94507
## NeighbourhoodMandai -48.94909 233.99969 -0.209 0.83431
## NeighbourhoodMarina South 290.08574 325.11929 0.892 0.37230
## NeighbourhoodMarine Parade -43.46109 49.48241 -0.878 0.37981
## NeighbourhoodMuseum 39.63346 60.26470 0.658 0.51079
## NeighbourhoodNewton 2.22475 51.53557 0.043 0.96557
## NeighbourhoodNovena -25.42157 43.17764 -0.589 0.55604
## NeighbourhoodOrchard 78.55282 51.27233 1.532 0.12556
## NeighbourhoodOutram -19.40417 43.31624 -0.448 0.65420
## NeighbourhoodPasir Ris -29.00293 69.35648 -0.418 0.67584
## NeighbourhoodPunggol -35.18246 81.81577 -0.430 0.66720
## NeighbourhoodQueenstown -41.21073 46.11625 -0.894 0.37156
## NeighbourhoodRiver Valley -28.31951 44.67385 -0.634 0.52616
## NeighbourhoodRochor 26.01482 42.86234 0.607 0.54392
## NeighbourhoodSembawang -23.57312 77.03179 -0.306 0.75960
## NeighbourhoodSengkang -27.22713 72.53592 -0.375 0.70741
## NeighbourhoodSerangoon -9.55875 71.74093 -0.133 0.89401
## NeighbourhoodSingapore River 23.45483 49.58637 0.473 0.63623
## NeighbourhoodSouthern Islands 1509.45683 101.38815 14.888 < 2e-16 ***
## NeighbourhoodSungei Kadut -3.41313 234.26102 -0.015 0.98838
## NeighbourhoodTampines NA NA NA NA
## NeighbourhoodTanglin -13.72596 47.12134 -0.291 0.77084
## NeighbourhoodToa Payoh NA NA NA NA
## NeighbourhoodWestern Water Catchment NA NA NA NA
## NeighbourhoodWoodlands -23.45157 69.00008 -0.340 0.73396
## NeighbourhoodYishun NA NA NA NA
## Minimum_night 0.04359 0.10655 0.409 0.68250
## Review_count -0.22768 0.20509 -1.110 0.26698
## Review_per_month -10.79017 5.39619 -2.000 0.04559 *
## Host_listing_count -0.35131 0.07896 -4.449 8.78e-06 ***
## Day_available_per_year 0.03237 0.03189 1.015 0.31013
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 322.7 on 5493 degrees of freedom
## Multiple R-squared: 0.09672, Adjusted R-squared: 0.08882
## F-statistic: 12.25 on 48 and 5493 DF, p-value: < 2.2e-16
First model is not so good. Median residual error is -32.1, while it should be near 0.
First linear regression model is plotted.
plot(first_model)
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced
First linear regression model does not satisfy linear model assumptions as shown clearly by normal Q-Q plot(normal Q-Q plot should be straight line).
Since first model seems bad, it will not be used to predict new prices.
Second linear regression model is modeled.
Logarithmic transformation is introduced in second linear regression model and airbnb_train_without_outlier is used so that outliers are removed.
second_model <- lm(log(Price) ~ ., data = airbnb_train_without_outlier)
#The results are summarized.
summary(second_model)
##
## Call:
## lm(formula = log(Price) ~ ., data = airbnb_train_without_outlier)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.24213 -0.29154 -0.04794 0.26085 1.42413
##
## Coefficients: (4 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.929e+00 4.978e-02 98.998 < 2e-16 ***
## Room_typePrivate room -6.848e-01 1.300e-02 -52.677 < 2e-16 ***
## Room_typeShared room -8.246e-01 4.524e-02 -18.229 < 2e-16 ***
## RegionEast Region 7.450e-02 8.703e-02 0.856 0.392065
## RegionNorth-East Region 1.033e-01 8.285e-02 1.247 0.212318
## RegionNorth Region 1.437e-01 8.798e-02 1.634 0.102352
## RegionWest Region -4.409e-01 2.733e-01 -1.613 0.106831
## NeighbourhoodBedok 1.236e-01 7.646e-02 1.616 0.106081
## NeighbourhoodBishan 7.103e-02 8.454e-02 0.840 0.400841
## NeighbourhoodBukit Batok 5.465e-01 2.758e-01 1.982 0.047576 *
## NeighbourhoodBukit Merah 1.791e-02 5.341e-02 0.335 0.737308
## NeighbourhoodBukit Panjang 3.967e-01 2.890e-01 1.373 0.169844
## NeighbourhoodBukit Timah 1.169e-01 6.586e-02 1.775 0.076006 .
## NeighbourhoodCentral Water Catchment -5.614e-02 1.093e-01 -0.514 0.607420
## NeighbourhoodChoa Chu Kang 2.924e-01 2.775e-01 1.054 0.292122
## NeighbourhoodClementi 4.944e-01 2.742e-01 1.803 0.071397 .
## NeighbourhoodDowntown Core 3.215e-01 5.477e-02 5.871 4.64e-09 ***
## NeighbourhoodGeylang 6.685e-02 5.108e-02 1.309 0.190672
## NeighbourhoodHougang -1.279e-01 8.398e-02 -1.523 0.127761
## NeighbourhoodJurong East 5.497e-01 2.727e-01 2.016 0.043899 *
## NeighbourhoodJurong West 4.044e-01 2.724e-01 1.484 0.137818
## NeighbourhoodKallang 8.707e-02 5.128e-02 1.698 0.089617 .
## NeighbourhoodLim Chu Kang -2.475e-01 3.883e-01 -0.637 0.523967
## NeighbourhoodMandai -5.958e-01 3.860e-01 -1.544 0.122716
## NeighbourhoodMarine Parade 1.428e-01 6.096e-02 2.342 0.019212 *
## NeighbourhoodMuseum 1.739e-01 7.920e-02 2.196 0.028140 *
## NeighbourhoodNewton 1.928e-01 6.496e-02 2.969 0.003008 **
## NeighbourhoodNovena 1.799e-01 5.309e-02 3.388 0.000709 ***
## NeighbourhoodOrchard 3.940e-01 7.078e-02 5.567 2.75e-08 ***
## NeighbourhoodOutram 1.602e-01 5.380e-02 2.978 0.002917 **
## NeighbourhoodPasir Ris 1.731e-02 9.315e-02 0.186 0.852587
## NeighbourhoodPunggol -1.910e-01 1.023e-01 -1.867 0.061922 .
## NeighbourhoodQueenstown 1.266e-01 5.708e-02 2.217 0.026661 *
## NeighbourhoodRiver Valley 7.634e-02 5.520e-02 1.383 0.166757
## NeighbourhoodRochor 1.986e-01 5.326e-02 3.728 0.000195 ***
## NeighbourhoodSembawang -2.576e-01 1.104e-01 -2.333 0.019698 *
## NeighbourhoodSengkang -2.365e-01 9.438e-02 -2.506 0.012249 *
## NeighbourhoodSerangoon -2.386e-02 8.939e-02 -0.267 0.789549
## NeighbourhoodSingapore River 3.332e-01 6.736e-02 4.946 7.86e-07 ***
## NeighbourhoodSouthern Islands 4.113e-01 3.831e-01 1.074 0.283083
## NeighbourhoodSungei Kadut -5.566e-01 2.780e-01 -2.002 0.045327 *
## NeighbourhoodTampines NA NA NA NA
## NeighbourhoodTanglin 1.926e-01 5.835e-02 3.302 0.000969 ***
## NeighbourhoodToa Payoh NA NA NA NA
## NeighbourhoodWestern Water Catchment NA NA NA NA
## NeighbourhoodWoodlands -2.270e-01 9.904e-02 -2.292 0.021964 *
## NeighbourhoodYishun NA NA NA NA
## Minimum_night -1.548e-03 1.497e-04 -10.345 < 2e-16 ***
## Review_count -7.201e-04 2.698e-04 -2.669 0.007633 **
## Review_per_month -1.779e-02 7.182e-03 -2.477 0.013272 *
## Host_listing_count -2.867e-04 9.854e-05 -2.910 0.003635 **
## Day_available_per_year 4.865e-04 4.182e-05 11.632 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3788 on 4459 degrees of freedom
## Multiple R-squared: 0.4937, Adjusted R-squared: 0.4884
## F-statistic: 92.53 on 47 and 4459 DF, p-value: < 2.2e-16
Second model is an improvement. Median residual error is now -0.04794, which is far better than -32.1 from the first model.
Second linear regression model is plotted.
plot(second_model)
## Warning: not plotting observations with leverage one:
## 1771, 2537, 4468
Normal Q-Q plot for second model looks much better than the first model.
Backward stepwise model using first linear regression model is modeled.
backward_first_model <- step(first_model, direction = 'backward')
## Start: AIC=64076.29
## Price ~ Room_type + Region + Neighbourhood + Minimum_night +
## Review_count + Review_per_month + Host_listing_count + Day_available_per_year
##
##
## Step: AIC=64076.29
## Price ~ Room_type + Neighbourhood + Minimum_night + Review_count +
## Review_per_month + Host_listing_count + Day_available_per_year
##
## Df Sum of Sq RSS AIC
## - Minimum_night 1 17421 571858565 64074
## - Day_available_per_year 1 107260 571948404 64075
## - Review_count 1 128300 571969445 64076
## <none> 571841144 64076
## - Review_per_month 1 416244 572257388 64078
## - Host_listing_count 1 2060880 573902025 64094
## - Room_type 2 19424357 591265501 64257
## - Neighbourhood 41 36765785 608606929 64340
##
## Step: AIC=64074.46
## Price ~ Room_type + Neighbourhood + Review_count + Review_per_month +
## Host_listing_count + Day_available_per_year
##
## Df Sum of Sq RSS AIC
## - Review_count 1 125861 571984426 64074
## - Day_available_per_year 1 126548 571985113 64074
## <none> 571858565 64074
## - Review_per_month 1 444011 572302576 64077
## - Host_listing_count 1 2085299 573943864 64093
## - Room_type 2 19455170 591313735 64256
## - Neighbourhood 41 36748742 608607306 64338
##
## Step: AIC=64073.68
## Price ~ Room_type + Neighbourhood + Review_per_month + Host_listing_count +
## Day_available_per_year
##
## Df Sum of Sq RSS AIC
## - Day_available_per_year 1 123708 572108134 64073
## <none> 571984426 64074
## - Review_per_month 1 1494308 573478734 64086
## - Host_listing_count 1 2064242 574048668 64092
## - Room_type 2 19579523 591563949 64256
## - Neighbourhood 41 36850650 608835076 64338
##
## Step: AIC=64072.88
## Price ~ Room_type + Neighbourhood + Review_per_month + Host_listing_count
##
## Df Sum of Sq RSS AIC
## <none> 572108134 64073
## - Review_per_month 1 1550015 573658149 64086
## - Host_listing_count 1 1940915 574049049 64090
## - Room_type 2 19463233 591571367 64254
## - Neighbourhood 41 36935412 609043546 64338
The results are summarized.
summary(backward_first_model)
##
## Call:
## lm(formula = Price ~ Room_type + Neighbourhood + Review_per_month +
## Host_listing_count, data = airbnb_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1498.7 -74.8 -32.6 17.0 9856.8
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 225.73667 55.13042 4.095 4.29e-05 ***
## Room_typePrivate room -126.50969 10.07651 -12.555 < 2e-16 ***
## Room_typeShared room -181.31937 21.67952 -8.364 < 2e-16 ***
## NeighbourhoodBedok 37.74554 58.24043 0.648 0.516948
## NeighbourhoodBishan 49.31144 74.70493 0.660 0.509228
## NeighbourhoodBukit Batok 147.92957 72.72496 2.034 0.041990 *
## NeighbourhoodBukit Merah -4.91808 57.42986 -0.086 0.931759
## NeighbourhoodBukit Panjang 446.80519 89.07189 5.016 5.44e-07 ***
## NeighbourhoodBukit Timah 33.39223 63.84018 0.523 0.600954
## NeighbourhoodCentral Water Catchment 47.17123 84.78358 0.556 0.577979
## NeighbourhoodChoa Chu Kang -41.25912 74.25672 -0.556 0.578488
## NeighbourhoodClementi 34.56652 66.40561 0.521 0.602711
## NeighbourhoodDowntown Core 58.90036 58.24778 1.011 0.311964
## NeighbourhoodGeylang 5.18932 56.22030 0.092 0.926460
## NeighbourhoodHougang 11.51088 65.04173 0.177 0.859533
## NeighbourhoodJurong East 82.67947 65.24959 1.267 0.205164
## NeighbourhoodJurong West -5.38999 63.14389 -0.085 0.931978
## NeighbourhoodKallang 44.29926 56.03126 0.791 0.429202
## NeighbourhoodLim Chu Kang 20.91386 327.82192 0.064 0.949135
## NeighbourhoodMandai -50.53253 234.58201 -0.215 0.829452
## NeighbourhoodMarina South 323.79501 327.22511 0.990 0.322454
## NeighbourhoodMarine Parade -7.18550 62.26071 -0.115 0.908124
## NeighbourhoodMuseum 82.18940 71.23612 1.154 0.248649
## NeighbourhoodNewton 43.67768 63.83073 0.684 0.493831
## NeighbourhoodNovena 15.03921 57.55556 0.261 0.793872
## NeighbourhoodOrchard 119.85031 63.51614 1.887 0.059223 .
## NeighbourhoodOutram 21.27236 57.46920 0.370 0.711283
## NeighbourhoodPasir Ris -13.39659 70.84907 -0.189 0.850032
## NeighbourhoodPunggol -35.96595 81.81073 -0.440 0.660227
## NeighbourhoodQueenstown -0.42895 59.57713 -0.007 0.994256
## NeighbourhoodRiver Valley 12.02683 58.46561 0.206 0.837027
## NeighbourhoodRochor 65.95791 57.16818 1.154 0.248652
## NeighbourhoodSembawang -26.91649 78.92253 -0.341 0.733079
## NeighbourhoodSengkang -24.23756 72.39202 -0.335 0.737781
## NeighbourhoodSerangoon -10.60934 71.73099 -0.148 0.882423
## NeighbourhoodSingapore River 65.42088 62.26363 1.051 0.293439
## NeighbourhoodSouthern Islands 1551.66535 108.15045 14.347 < 2e-16 ***
## NeighbourhoodSungei Kadut -13.24408 234.72362 -0.056 0.955006
## NeighbourhoodTampines 10.17235 75.66701 0.134 0.893063
## NeighbourhoodTanglin 26.82013 60.40458 0.444 0.657054
## NeighbourhoodToa Payoh 38.21343 67.37688 0.567 0.570629
## NeighbourhoodWestern Water Catchment -79.68544 170.29240 -0.468 0.639851
## NeighbourhoodWoodlands -27.32811 71.11665 -0.384 0.700792
## NeighbourhoodYishun -1.36924 75.15107 -0.018 0.985464
## Review_per_month -15.27120 3.95750 -3.859 0.000115 ***
## Host_listing_count -0.33116 0.07669 -4.318 1.60e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 322.6 on 5496 degrees of freedom
## Multiple R-squared: 0.0963, Adjusted R-squared: 0.0889
## F-statistic: 13.01 on 45 and 5496 DF, p-value: < 2.2e-16
Backward stepwise model using first model is not so good. Median residual error is -32.6, while it should be near 0.
vif(backward_first_model) %>%
kable() %>%
kable_styling()
| GVIF | Df | GVIF^(1/(2*Df)) | |
|---|---|---|---|
| Room_type | 1.394975 | 2 | 1.086780 |
| Neighbourhood | 1.450225 | 41 | 1.004543 |
| Review_per_month | 1.082944 | 1 | 1.040646 |
| Host_listing_count | 1.311297 | 1 | 1.145119 |
RMSE of backward stepwise model using first linear regression model is calculated.
rmse_first_model <- sqrt(mean((residuals(backward_first_model)^2)))
print(rmse_first_model)
## [1] 321.2964
RMSE of backward stepwise model using first model is 321.2964, not so good as well as it should be close 0.
Backward stepwise model using second linear regression model is modeled.
backward_second_model <- step(second_model, direction = 'backward')
## Start: AIC=-8702.5
## log(Price) ~ Room_type + Region + Neighbourhood + Minimum_night +
## Review_count + Review_per_month + Host_listing_count + Day_available_per_year
##
##
## Step: AIC=-8702.5
## log(Price) ~ Room_type + Neighbourhood + Minimum_night + Review_count +
## Review_per_month + Host_listing_count + Day_available_per_year
##
## Df Sum of Sq RSS AIC
## <none> 639.83 -8702.5
## - Review_per_month 1 0.88 640.71 -8698.3
## - Review_count 1 1.02 640.85 -8697.3
## - Host_listing_count 1 1.21 641.04 -8696.0
## - Minimum_night 1 15.36 655.19 -8597.6
## - Day_available_per_year 1 19.41 659.24 -8569.8
## - Neighbourhood 40 42.35 682.18 -8493.6
## - Room_type 2 410.70 1050.53 -6471.7
The results are summarized.
summary(backward_second_model)
##
## Call:
## lm(formula = log(Price) ~ Room_type + Neighbourhood + Minimum_night +
## Review_count + Review_per_month + Host_listing_count + Day_available_per_year,
## data = airbnb_train_without_outlier)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.24213 -0.29154 -0.04794 0.26085 1.42413
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.032e+00 6.804e-02 73.958 < 2e-16 ***
## Room_typePrivate room -6.848e-01 1.300e-02 -52.677 < 2e-16 ***
## Room_typeShared room -8.246e-01 4.524e-02 -18.229 < 2e-16 ***
## NeighbourhoodBedok 9.474e-02 7.175e-02 1.320 0.18676
## NeighbourhoodBishan -3.231e-02 9.636e-02 -0.335 0.73738
## NeighbourhoodBukit Batok 2.268e-03 9.092e-02 0.025 0.98010
## NeighbourhoodBukit Merah -8.543e-02 7.065e-02 -1.209 0.22663
## NeighbourhoodBukit Panjang -1.475e-01 1.246e-01 -1.183 0.23681
## NeighbourhoodBukit Timah 1.354e-02 8.036e-02 0.168 0.86620
## NeighbourhoodCentral Water Catchment -1.574e-02 1.053e-01 -0.149 0.88119
## NeighbourhoodChoa Chu Kang -2.518e-01 9.550e-02 -2.637 0.00839 **
## NeighbourhoodClementi -4.976e-02 8.620e-02 -0.577 0.56380
## NeighbourhoodDowntown Core 2.182e-01 7.206e-02 3.028 0.00248 **
## NeighbourhoodGeylang -3.650e-02 6.916e-02 -0.528 0.59772
## NeighbourhoodHougang -1.279e-01 8.398e-02 -1.523 0.12776
## NeighbourhoodJurong East 5.459e-03 8.101e-02 0.067 0.94628
## NeighbourhoodJurong West -1.398e-01 7.942e-02 -1.761 0.07834 .
## NeighbourhoodKallang -1.628e-02 6.921e-02 -0.235 0.81400
## NeighbourhoodLim Chu Kang -2.071e-01 3.871e-01 -0.535 0.59275
## NeighbourhoodMandai -5.554e-01 3.849e-01 -1.443 0.14903
## NeighbourhoodMarine Parade 3.943e-02 7.662e-02 0.515 0.60683
## NeighbourhoodMuseum 7.059e-02 9.188e-02 0.768 0.44240
## NeighbourhoodNewton 8.948e-02 7.969e-02 1.123 0.26155
## NeighbourhoodNovena 7.654e-02 7.070e-02 1.083 0.27905
## NeighbourhoodOrchard 2.907e-01 8.442e-02 3.443 0.00058 ***
## NeighbourhoodOutram 5.685e-02 7.100e-02 0.801 0.42333
## NeighbourhoodPasir Ris -1.154e-02 8.959e-02 -0.129 0.89750
## NeighbourhoodPunggol -1.910e-01 1.023e-01 -1.867 0.06192 .
## NeighbourhoodQueenstown 2.321e-02 7.344e-02 0.316 0.75202
## NeighbourhoodRiver Valley -2.701e-02 7.207e-02 -0.375 0.70783
## NeighbourhoodRochor 9.522e-02 7.067e-02 1.347 0.17795
## NeighbourhoodSembawang -2.172e-01 1.066e-01 -2.037 0.04172 *
## NeighbourhoodSengkang -2.365e-01 9.438e-02 -2.506 0.01225 *
## NeighbourhoodSerangoon -2.386e-02 8.939e-02 -0.267 0.78955
## NeighbourhoodSingapore River 2.298e-01 8.171e-02 2.813 0.00494 **
## NeighbourhoodSouthern Islands 3.079e-01 3.859e-01 0.798 0.42500
## NeighbourhoodSungei Kadut -5.162e-01 2.766e-01 -1.866 0.06208 .
## NeighbourhoodTampines -2.885e-02 9.849e-02 -0.293 0.76958
## NeighbourhoodTanglin 8.930e-02 7.448e-02 1.199 0.23059
## NeighbourhoodToa Payoh -1.033e-01 8.285e-02 -1.247 0.21232
## NeighbourhoodWestern Water Catchment -5.442e-01 2.772e-01 -1.963 0.04966 *
## NeighbourhoodWoodlands -1.866e-01 9.475e-02 -1.969 0.04898 *
## NeighbourhoodYishun 4.040e-02 9.914e-02 0.407 0.68370
## Minimum_night -1.548e-03 1.497e-04 -10.345 < 2e-16 ***
## Review_count -7.201e-04 2.698e-04 -2.669 0.00763 **
## Review_per_month -1.779e-02 7.182e-03 -2.477 0.01327 *
## Host_listing_count -2.867e-04 9.854e-05 -2.910 0.00363 **
## Day_available_per_year 4.865e-04 4.182e-05 11.632 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3788 on 4459 degrees of freedom
## Multiple R-squared: 0.4937, Adjusted R-squared: 0.4884
## F-statistic: 92.53 on 47 and 4459 DF, p-value: < 2.2e-16
Backward stepwise model using second model is an improvement. Median residual error is now -0.04794, which is far better than -32.6 from the first model.
vif(backward_second_model) %>%
kable() %>%
kable_styling()
| GVIF | Df | GVIF^(1/(2*Df)) | |
|---|---|---|---|
| Room_type | 1.364824 | 2 | 1.080859 |
| Neighbourhood | 1.597313 | 40 | 1.005871 |
| Minimum_night | 1.130842 | 1 | 1.063410 |
| Review_count | 2.057790 | 1 | 1.434500 |
| Review_per_month | 2.098675 | 1 | 1.448681 |
| Host_listing_count | 1.416358 | 1 | 1.190109 |
| Day_available_per_year | 1.171249 | 1 | 1.082243 |
RMSE of backward stepwise model using second linear regression model is calculated.
rmse_second_model <- sqrt(mean((residuals(backward_second_model)^2)))
print(rmse_second_model)
## [1] 0.3767803
RMSE of backward stepwise model using second model is 0.3767803, a better model compared to backward stepwise model using first model which is 321.2964 as lower values of RMSE indicate better fit.
Prices for testing set without outliers are predicted.
predict_regression <- predict(second_model, newdata = airbnb_test_without_outlier)
## Warning in predict.lm(second_model, newdata = airbnb_test_without_outlier):
## prediction from a rank-deficient fit may be misleading
predict_regression <- exp(predict_regression)
RMSE_regression <- sqrt(mean( (airbnb_test_without_outlier$Price - predict_regression)**2 ))
print(RMSE_regression)
## [1] 50.53446
The sum of squared deviations of actual values from predicted values is calculated.
SSE <- sum((airbnb_test_without_outlier$Price - predict_regression)**2)
print(SSE)
## [1] 4849536
The sum of squared deviations of predicted values from the mean value is calculated.
SSR <- sum((predict_regression - mean(airbnb_test_without_outlier$Price)) ** 2)
print(SSR)
## [1] 3825675
R-squared, a statistical measure of how close the data are to the fitted regression line, is calculated.
R2 <- 1 - SSE/(SSE + SSR)
print(R2)
## [1] 0.4409893
Scatter plot of observed and predicted value group by Room type is plotted.
regression_results <- tibble(
obs = airbnb_test_without_outlier$Price,
pred = predict_regression,
diff = pred - obs,
abs_diff = abs(pred - obs),
type = airbnb_test_without_outlier$Room_type)
regression_plot <- regression_results %>%
ggplot(aes(obs, pred)) +
geom_point(alpha = 0.1) +
scale_x_log10() +
scale_y_log10() +
ggtitle("Observed vs predicted",
subtitle = "Linear regression model") +
geom_abline(slope = 1, intercept = 0, color = "blue", linetype = 2) +
facet_wrap(~type)
ggplotly(regression_plot)
## Warning: `group_by_()` is deprecated as of dplyr 0.7.0.
## Please use `group_by()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
Data is split into training set and testing set.
df_cor2 <- df2 %>%
select(Price, Latitude, Longitude, Minimum_night, Review_count, Review_per_month, Host_listing_count, Day_available_per_year, Room_type, Region)
df_cor2[c("Room_type","Region")] <- map(df_cor2[c("Room_type","Region")], as.factor)
intrain <- createDataPartition(y = df_cor2$Room_type, p = 0.67, list = FALSE)
training <- df_cor2[intrain,]
testing <- df_cor2[-intrain,]
set.seed(12345)
# Training with classification tree model
airbnb.rpart <- rpart(Room_type ~ ., data=training, method="class")
print(airbnb.rpart, digits = 3)
## n= 5298
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 5298 2530 Entire home/apt (0.52246 0.42771 0.04983)
## 2) Price>=100 3054 538 Entire home/apt (0.82384 0.16896 0.00720) *
## 3) Price< 100 2244 494 Private room (0.11230 0.77986 0.10784)
## 6) Price>=32.5 2015 362 Private room (0.12407 0.82035 0.05558)
## 12) Price>=81.5 556 213 Private room (0.36691 0.61691 0.01619)
## 24) Host_listing_count>=56 107 25 Entire home/apt (0.76636 0.23364 0.00000) *
## 25) Host_listing_count< 56 449 131 Private room (0.27171 0.70824 0.02004) *
## 13) Price< 81.5 1459 149 Private room (0.03153 0.89788 0.07060) *
## 7) Price< 32.5 229 99 Shared room (0.00873 0.42358 0.56769)
## 14) Host_listing_count< 3.5 108 17 Private room (0.01852 0.84259 0.13889) *
## 15) Host_listing_count>=3.5 121 6 Shared room (0.00000 0.04959 0.95041) *
printcp(airbnb.rpart) # display the results
##
## Classification tree:
## rpart(formula = Room_type ~ ., data = training, method = "class")
##
## Variables actually used in tree construction:
## [1] Host_listing_count Price
##
## Root node error: 2530/5298 = 0.47754
##
## n= 5298
##
## CP nsplit rel error xerror xstd
## 1 0.592095 0 1.00000 1.00000 0.014370
## 2 0.021542 1 0.40791 0.41028 0.011419
## 3 0.011265 3 0.36482 0.36798 0.010949
## 4 0.010000 5 0.34229 0.33399 0.010534
plotcp(airbnb.rpart) # visualize cross-validation results
summary(airbnb.rpart)
## Call:
## rpart(formula = Room_type ~ ., data = training, method = "class")
## n= 5298
##
## CP nsplit rel error xerror xstd
## 1 0.59209486 0 1.0000000 1.0000000 0.01437033
## 2 0.02154150 1 0.4079051 0.4102767 0.01141897
## 3 0.01126482 3 0.3648221 0.3679842 0.01094939
## 4 0.01000000 5 0.3422925 0.3339921 0.01053363
##
## Variable importance
## Price Host_listing_count Latitude
## 56 12 10
## Region Minimum_night Longitude
## 9 6 4
## Day_available_per_year Review_count
## 2 1
##
## Node number 1: 5298 observations, complexity param=0.5920949
## predicted class=Entire home/apt expected loss=0.4775387 P(node) =1
## class counts: 2768 2266 264
## probabilities: 0.522 0.428 0.050
## left son=2 (3054 obs) right son=3 (2244 obs)
## Primary splits:
## Price < 100.5 to the right, improve=1150.7490, (0 missing)
## Host_listing_count < 14.5 to the right, improve= 324.1613, (0 missing)
## Region splits as LRRRR, improve= 234.1491, (0 missing)
## Latitude < 1.33698 to the left, improve= 218.4871, (0 missing)
## Minimum_night < 1.5 to the right, improve= 192.1937, (0 missing)
## Surrogate splits:
## Latitude < 1.33698 to the left, agree=0.648, adj=0.168, (0 split)
## Region splits as LRRRR, agree=0.644, adj=0.160, (0 split)
## Host_listing_count < 4.5 to the right, agree=0.641, adj=0.153, (0 split)
## Longitude < 103.8035 to the right, agree=0.611, adj=0.081, (0 split)
## Minimum_night < 89.5 to the left, agree=0.611, adj=0.081, (0 split)
##
## Node number 2: 3054 observations
## predicted class=Entire home/apt expected loss=0.1761624 P(node) =0.5764439
## class counts: 2516 516 22
## probabilities: 0.824 0.169 0.007
##
## Node number 3: 2244 observations, complexity param=0.0215415
## predicted class=Private room expected loss=0.2201426 P(node) =0.4235561
## class counts: 252 1750 242
## probabilities: 0.112 0.780 0.108
## left son=6 (2015 obs) right son=7 (229 obs)
## Primary splits:
## Price < 32.5 to the right, improve=89.03290, (0 missing)
## Minimum_night < 1.5 to the right, improve=41.12249, (0 missing)
## Host_listing_count < 6.5 to the left, improve=31.10437, (0 missing)
## Region splits as RLLLL, improve=27.86415, (0 missing)
## Latitude < 1.31664 to the right, improve=27.36096, (0 missing)
##
## Node number 6: 2015 observations, complexity param=0.01126482
## predicted class=Private room expected loss=0.1796526 P(node) =0.3803322
## class counts: 250 1653 112
## probabilities: 0.124 0.820 0.056
## left son=12 (556 obs) right son=13 (1459 obs)
## Primary splits:
## Price < 81.5 to the right, improve=78.25492, (0 missing)
## Host_listing_count < 61.5 to the right, improve=25.73927, (0 missing)
## Region splits as LRRLR, improve=13.52241, (0 missing)
## Longitude < 103.8442 to the right, improve=12.68967, (0 missing)
## Day_available_per_year < 84.5 to the left, improve=12.55265, (0 missing)
## Surrogate splits:
## Host_listing_count < 112.5 to the right, agree=0.729, adj=0.018, (0 split)
## Longitude < 103.9716 to the right, agree=0.726, adj=0.005, (0 split)
## Latitude < 1.448945 to the right, agree=0.725, adj=0.002, (0 split)
## Minimum_night < 300 to the right, agree=0.725, adj=0.002, (0 split)
##
## Node number 7: 229 observations, complexity param=0.0215415
## predicted class=Shared room expected loss=0.4323144 P(node) =0.04322386
## class counts: 2 97 130
## probabilities: 0.009 0.424 0.568
## left son=14 (108 obs) right son=15 (121 obs)
## Primary splits:
## Host_listing_count < 3.5 to the left, improve=73.48741, (0 missing)
## Latitude < 1.31625 to the right, improve=54.17262, (0 missing)
## Minimum_night < 1.5 to the right, improve=53.14828, (0 missing)
## Region splits as RLLLL, improve=33.72940, (0 missing)
## Day_available_per_year < 124 to the left, improve=21.76556, (0 missing)
## Surrogate splits:
## Latitude < 1.314795 to the right, agree=0.825, adj=0.630, (0 split)
## Minimum_night < 1.5 to the right, agree=0.795, adj=0.565, (0 split)
## Day_available_per_year < 40.5 to the left, agree=0.777, adj=0.528, (0 split)
## Region splits as RLLLL, agree=0.738, adj=0.444, (0 split)
## Review_count < 1.5 to the left, agree=0.690, adj=0.343, (0 split)
##
## Node number 12: 556 observations, complexity param=0.01126482
## predicted class=Private room expected loss=0.3830935 P(node) =0.1049453
## class counts: 204 343 9
## probabilities: 0.367 0.617 0.016
## left son=24 (107 obs) right son=25 (449 obs)
## Primary splits:
## Host_listing_count < 56 to the right, improve=40.63883, (0 missing)
## Day_available_per_year < 85 to the left, improve=38.51441, (0 missing)
## Longitude < 103.885 to the right, improve=25.20401, (0 missing)
## Minimum_night < 2.5 to the right, improve=22.96243, (0 missing)
## Latitude < 1.312175 to the right, improve=18.74820, (0 missing)
##
## Node number 13: 1459 observations
## predicted class=Private room expected loss=0.1021247 P(node) =0.2753869
## class counts: 46 1310 103
## probabilities: 0.032 0.898 0.071
##
## Node number 14: 108 observations
## predicted class=Private room expected loss=0.1574074 P(node) =0.02038505
## class counts: 2 91 15
## probabilities: 0.019 0.843 0.139
##
## Node number 15: 121 observations
## predicted class=Shared room expected loss=0.04958678 P(node) =0.02283881
## class counts: 0 6 115
## probabilities: 0.000 0.050 0.950
##
## Node number 24: 107 observations
## predicted class=Entire home/apt expected loss=0.2336449 P(node) =0.0201963
## class counts: 82 25 0
## probabilities: 0.766 0.234 0.000
##
## Node number 25: 449 observations
## predicted class=Private room expected loss=0.2917595 P(node) =0.08474896
## class counts: 122 318 9
## probabilities: 0.272 0.708 0.020
# Predict the testing dataset with the trained model
predictions1 <- predict(airbnb.rpart, testing, type = "class")
# Evaluation: Accuracy and other metrics
confusionMatrix(predictions1, testing$Room_type)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Entire home/apt Private room Shared room
## Entire home/apt 1293 255 13
## Private room 70 848 62
## Shared room 0 12 55
##
## Overall Statistics
##
## Accuracy : 0.842
## 95% CI : (0.8275, 0.8558)
## No Information Rate : 0.5226
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6992
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: Entire home/apt Class: Private room
## Sensitivity 0.9486 0.7605
## Specificity 0.7847 0.9116
## Pos Pred Value 0.8283 0.8653
## Neg Pred Value 0.9331 0.8360
## Prevalence 0.5226 0.4275
## Detection Rate 0.4958 0.3252
## Detection Prevalence 0.5985 0.3758
## Balanced Accuracy 0.8667 0.8361
## Class: Shared room
## Sensitivity 0.42308
## Specificity 0.99516
## Pos Pred Value 0.82090
## Neg Pred Value 0.97048
## Prevalence 0.04985
## Detection Rate 0.02109
## Detection Prevalence 0.02569
## Balanced Accuracy 0.70912
Overall accuracy for this classification model is 0.8551.
set.seed(12345)
# Training the data using Random forest model
airbnb.rf <- randomForest(Room_type ~. , data=training, importance = TRUE)
airbnb.rf
##
## Call:
## randomForest(formula = Room_type ~ ., data = training, importance = TRUE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 9.32%
## Confusion matrix:
## Entire home/apt Private room Shared room class.error
## Entire home/apt 2577 190 1 0.06900289
## Private room 207 2046 13 0.09708738
## Shared room 6 77 181 0.31439394
# Predict the testing dataset with the trained model
predictions2 <- predict(airbnb.rf, testing, type = "class")
# Evaluation: Accuracy and other metrics
confusionMatrix(predictions2, testing$Room_type)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Entire home/apt Private room Shared room
## Entire home/apt 1284 114 5
## Private room 78 997 33
## Shared room 1 4 92
##
## Overall Statistics
##
## Accuracy : 0.9099
## 95% CI : (0.8982, 0.9206)
## No Information Rate : 0.5226
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.8317
##
## Mcnemar's Test P-Value : 4.875e-07
##
## Statistics by Class:
##
## Class: Entire home/apt Class: Private room
## Sensitivity 0.9420 0.8942
## Specificity 0.9044 0.9257
## Pos Pred Value 0.9152 0.8998
## Neg Pred Value 0.9344 0.9213
## Prevalence 0.5226 0.4275
## Detection Rate 0.4923 0.3823
## Detection Prevalence 0.5380 0.4248
## Balanced Accuracy 0.9232 0.9099
## Class: Shared room
## Sensitivity 0.70769
## Specificity 0.99798
## Pos Pred Value 0.94845
## Neg Pred Value 0.98487
## Prevalence 0.04985
## Detection Rate 0.03528
## Detection Prevalence 0.03719
## Balanced Accuracy 0.85284
Number of tree is 500 and there are 3 variables tried at each split.
important <- importance(airbnb.rf, type=1 )
Important_Features <- data.frame(Feature = row.names(important), Importance = important[, 1])
plot_imp <- ggplot(Important_Features,
aes(x= reorder(Feature,
Importance) , y = Importance) ) +
geom_bar(stat = "identity") +
coord_flip() +
theme_light(base_size = 13) +
xlab("") +
ylab("Importance")+
ggtitle("Important Features in Random Forest Model for\n Singapore airbnb data") +
theme(plot.title = element_text(size=13))
plot_imp
The accuracy of Random forest model is 0.9015.
For regression of Singapore Airbnb price, a linear regression model with logarithmic transformation and outlier removal perform better and fit perfectly.
For classification of Singapore Airbnb Room Type, random forest model performs better than classification tree to predict the room type based on the features when comparing the accuracy score. The top 3 most important features to predict the room type in this random forest model are Price, minimum night and host listing count. This model can aid to help to ensure the room type is correct according to the input information by host so guest can be more worry free about their accommodation and thus improve customer experience.