Regression of Singapore Airbnb Price and Classification of Singapore Airbnb Room Type

Part 1: Introduction

1.1 Introduction of Airbnb

Airbnb Inc. is an American vacation rental online marketplace company based in San Francisco, California, United States. Airbnb keeps and provides a marketplace which is reachable to users through its online website or via its mobile app. By utilizing Airbnb, users are able to lease lodging, primarily home stay, depends on their purposes or register their properties for rental. Airbnb itself does not own any of the listed properties; instead, it profits by collecting a brokerage fee and service fee percentage from both the host and guest per booking transactions.

1.2 Project Objective

Our project focuses on Airbnb services in Singapore. Though physically small in size of area, Singapore has been Southeast Asia’s most modern city for over a century. For many travelers, Singapore is their first introduction to Southeast Asia as the city blends Chinese, Malay, Indian and English cultures and religions. Therefore, with the vast tourism demands from travelers, the Airbnb services here is blooming in a rapid speed.

First of all, we want to predict future price of Airbnb listings in Singapore by using regression, a supervised machine learning technique. The regression model chosen in this project is linear regression model. The prediction of future price can aid in the quality of decision-making process for Airbnb company to carry out specific promotion. Besides that, travelers can plan their vacation in advance with the accurate prediction results of Airbnb listed properties price.

Next, we want to predict which class (room type) of Airbnb listings in Singapore belongs to by using classification, a supervised machine learning technique as well. The classification model chosen in this project is random forest model.Classification of room type can help guest to ensure the information of accommodation is accurate and this is important for their service experience.

Part 2: Getting Data

The Singapore Airbnb listings data set is available on Inside Airbnb website in csv file format. The link to the website is http://insideairbnb.com/get-the-data.html. The data was collected on 26th October 2020 according to the website. There is 7907 sample but some missing data present in the data set as well. The purpose of the data set is to have information of all Airbnb listings in Singapore distributed across 5 regions which are Central Region, North Region, North-East Region, East Region and West Region, and also their reviews rated by guests.

R package required for the project is loaded.

library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.2     v purrr   0.3.4
## v tibble  3.0.4     v dplyr   1.0.2
## v tidyr   1.1.2     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.0
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(kableExtra)
## 
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
## 
##     group_rows
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
library(caret)
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
## 
##     lift
library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
## The following object is masked from 'package:purrr':
## 
##     some
library(corrplot)
## corrplot 0.84 loaded
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
## 
##     combine
## The following object is masked from 'package:ggplot2':
## 
##     margin
library(tm)
## Loading required package: NLP
## 
## Attaching package: 'NLP'
## The following object is masked from 'package:ggplot2':
## 
##     annotate
library(wordcloud)
## Loading required package: RColorBrewer
library(caTools)
library(cowplot)
library(rpart)

Data set in csv format titled Singapore Airbnb Listings is read and assigned as airbnb.

airbnb <- read.csv("C:/Users/User/Desktop/Singapore Airbnb Listings.csv")

First 20 rows of airbnb data are viewed.

head(airbnb,20) %>%
  kable() %>%
  kable_styling()
id name host_id host_name neighbourhood_group neighbourhood latitude longitude room_type price minimum_nights number_of_reviews last_review reviews_per_month calculated_host_listings_count availability_365
49091 COZICOMFORT LONG TERM STAY ROOM 2 266763 Francesca North Region Woodlands 1.44255 103.7958 Private room 83 180 1 2013-10-21 0.01 2 365
50646 Pleasant Room along Bukit Timah 227796 Sujatha Central Region Bukit Timah 1.33235 103.7852 Private room 81 90 18 2014-12-26 0.28 1 365
56334 COZICOMFORT 266763 Francesca North Region Woodlands 1.44246 103.7967 Private room 69 6 20 2015-10-01 0.20 2 365
71609 Ensuite Room (Room 1 & 2) near EXPO 367042 Belinda East Region Tampines 1.34541 103.9571 Private room 206 1 14 2019-08-11 0.15 9 353
71896 B&B Room 1 near Airport & EXPO 367042 Belinda East Region Tampines 1.34567 103.9596 Private room 94 1 22 2019-07-28 0.22 9 355
71903 Room 2-near Airport & EXPO 367042 Belinda East Region Tampines 1.34702 103.9610 Private room 104 1 39 2019-08-15 0.38 9 346
71907 3rd level Jumbo room 5 near EXPO 367042 Belinda East Region Tampines 1.34348 103.9634 Private room 208 1 25 2019-07-25 0.25 9 172
241503 Long stay at The Breezy East “Leopard” 1017645 Bianca East Region Bedok 1.32304 103.9136 Private room 50 90 174 2019-05-31 1.88 4 59
241508 Long stay at The Breezy East “Plumeria” 1017645 Bianca East Region Bedok 1.32458 103.9116 Private room 54 90 198 2019-04-28 2.08 4 133
241510 Long stay at The Breezy East “Red Palm” 1017645 Bianca East Region Bedok 1.32461 103.9119 Private room 42 90 236 2019-07-31 2.53 4 147
275343 Conveniently located City Room!( (Phone number hidden by Airbnb) ) 1439258 K2 Guesthouse Central Region Bukit Merah 1.28875 103.8081 Private room 44 15 18 2019-04-21 0.23 32 331
275344 15 mins to Outram MRT Single Room (B) 1439258 K2 Guesthouse Central Region Bukit Merah 1.28837 103.8110 Private room 40 30 10 2018-09-13 0.11 32 276
289234 Booking for 3 bedrooms 367042 Belinda East Region Tampines 1.34561 103.9598 Private room 417 2 12 2019-01-01 0.14 9 239
294281 5 mins walk from Newton subway 1521514 Elizabeth Central Region Newton 1.31125 103.8382 Private room 65 2 125 2019-08-22 1.35 6 336
324945 20 Mins to Sentosa @ Hilltop ! (8) 1439258 K2 Guesthouse Central Region Bukit Merah 1.28976 103.8090 Private room 44 30 13 2019-02-02 0.15 32 340
330089 Accomo@ REDHILL-INSEAD, NTU,NUS -Mu(D) 1439258 K2 Guesthouse Central Region Bukit Merah 1.28677 103.8124 Private room 40 30 10 2019-04-27 0.14 32 331
330095 10 mins to Redhill MRT @ Mini Orange Room(5) 1439258 K2 Guesthouse Central Region Bukit Merah 1.28537 103.8109 Private room 31 90 3 2016-08-22 0.04 32 361
344803 Budget short stay room near EXPO 367042 Belinda East Region Tampines 1.34943 103.9595 Private room 49 2 45 2019-08-11 0.50 9 357
355955 Double room in an Authentic Peranakan Shophouse 1759905 Aresha Central Region Geylang 1.31420 103.9023 Private room 81 90 0 NA 1 173
369141 5mins from Newton Train Station 1521514 Elizabeth Central Region Newton 1.31150 103.8376 Private room 60 2 84 2019-07-10 1.17 6 340

Object-oriented classification of airbnb data is determined.

class(airbnb)
## [1] "data.frame"
typeof(airbnb)
## [1] "list"

Dimension of airbnb data is determined.

dim(airbnb)
## [1] 7907   16

It is a data frame with 7907 rows and 16 columns.

Content of airbnb data is determined.

glimpse(airbnb)
## Rows: 7,907
## Columns: 16
## $ id                             <int> 49091, 50646, 56334, 71609, 71896, 7...
## $ name                           <chr> "COZICOMFORT LONG TERM STAY ROOM 2",...
## $ host_id                        <int> 266763, 227796, 266763, 367042, 3670...
## $ host_name                      <chr> "Francesca", "Sujatha", "Francesca",...
## $ neighbourhood_group            <chr> "North Region", "Central Region", "N...
## $ neighbourhood                  <chr> "Woodlands", "Bukit Timah", "Woodlan...
## $ latitude                       <dbl> 1.44255, 1.33235, 1.44246, 1.34541, ...
## $ longitude                      <dbl> 103.7958, 103.7852, 103.7967, 103.95...
## $ room_type                      <chr> "Private room", "Private room", "Pri...
## $ price                          <int> 83, 81, 69, 206, 94, 104, 208, 50, 5...
## $ minimum_nights                 <int> 180, 90, 6, 1, 1, 1, 1, 90, 90, 90, ...
## $ number_of_reviews              <int> 1, 18, 20, 14, 22, 39, 25, 174, 198,...
## $ last_review                    <chr> "2013-10-21", "2014-12-26", "2015-10...
## $ reviews_per_month              <dbl> 0.01, 0.28, 0.20, 0.15, 0.22, 0.38, ...
## $ calculated_host_listings_count <int> 2, 1, 2, 9, 9, 9, 9, 4, 4, 4, 32, 32...
## $ availability_365               <int> 365, 365, 365, 353, 355, 346, 172, 5...

Structure of airbnb data is determined.

str(airbnb)
## 'data.frame':    7907 obs. of  16 variables:
##  $ id                            : int  49091 50646 56334 71609 71896 71903 71907 241503 241508 241510 ...
##  $ name                          : chr  "COZICOMFORT LONG TERM STAY ROOM 2" "Pleasant Room along Bukit Timah" "COZICOMFORT" "Ensuite Room (Room 1 & 2) near EXPO" ...
##  $ host_id                       : int  266763 227796 266763 367042 367042 367042 367042 1017645 1017645 1017645 ...
##  $ host_name                     : chr  "Francesca" "Sujatha" "Francesca" "Belinda" ...
##  $ neighbourhood_group           : chr  "North Region" "Central Region" "North Region" "East Region" ...
##  $ neighbourhood                 : chr  "Woodlands" "Bukit Timah" "Woodlands" "Tampines" ...
##  $ latitude                      : num  1.44 1.33 1.44 1.35 1.35 ...
##  $ longitude                     : num  104 104 104 104 104 ...
##  $ room_type                     : chr  "Private room" "Private room" "Private room" "Private room" ...
##  $ price                         : int  83 81 69 206 94 104 208 50 54 42 ...
##  $ minimum_nights                : int  180 90 6 1 1 1 1 90 90 90 ...
##  $ number_of_reviews             : int  1 18 20 14 22 39 25 174 198 236 ...
##  $ last_review                   : chr  "2013-10-21" "2014-12-26" "2015-10-01" "2019-08-11" ...
##  $ reviews_per_month             : num  0.01 0.28 0.2 0.15 0.22 0.38 0.25 1.88 2.08 2.53 ...
##  $ calculated_host_listings_count: int  2 1 2 9 9 9 9 4 4 4 ...
##  $ availability_365              : int  365 365 365 353 355 346 172 59 133 147 ...

Summary of airbnb data is determined.

summary(airbnb)
##        id               name              host_id           host_name        
##  Min.   :   49091   Length:7907        Min.   :    23666   Length:7907       
##  1st Qu.:15821800   Class :character   1st Qu.: 23058075   Class :character  
##  Median :24706270   Mode  :character   Median : 63448912   Mode  :character  
##  Mean   :23388625                      Mean   : 91144807                     
##  3rd Qu.:32348500                      3rd Qu.:155381142                     
##  Max.   :38112762                      Max.   :288567551                     
##                                                                              
##  neighbourhood_group neighbourhood         latitude       longitude    
##  Length:7907         Length:7907        Min.   :1.244   Min.   :103.6  
##  Class :character    Class :character   1st Qu.:1.296   1st Qu.:103.8  
##  Mode  :character    Mode  :character   Median :1.311   Median :103.8  
##                                         Mean   :1.314   Mean   :103.8  
##                                         3rd Qu.:1.322   3rd Qu.:103.9  
##                                         Max.   :1.455   Max.   :104.0  
##                                                                        
##   room_type             price         minimum_nights    number_of_reviews
##  Length:7907        Min.   :    0.0   Min.   :   1.00   Min.   :  0.00   
##  Class :character   1st Qu.:   65.0   1st Qu.:   1.00   1st Qu.:  0.00   
##  Mode  :character   Median :  124.0   Median :   3.00   Median :  2.00   
##                     Mean   :  169.3   Mean   :  17.51   Mean   : 12.81   
##                     3rd Qu.:  199.0   3rd Qu.:  10.00   3rd Qu.: 10.00   
##                     Max.   :10000.0   Max.   :1000.00   Max.   :323.00   
##                                                                          
##  last_review        reviews_per_month calculated_host_listings_count
##  Length:7907        Min.   : 0.010    Min.   :  1.00                
##  Class :character   1st Qu.: 0.180    1st Qu.:  2.00                
##  Mode  :character   Median : 0.550    Median :  9.00                
##                     Mean   : 1.044    Mean   : 40.61                
##                     3rd Qu.: 1.370    3rd Qu.: 48.00                
##                     Max.   :13.000    Max.   :274.00                
##                     NA's   :2758                                    
##  availability_365
##  Min.   :  0.0   
##  1st Qu.: 54.0   
##  Median :260.0   
##  Mean   :208.7   
##  3rd Qu.:355.0   
##  Max.   :365.0   
## 

Number of attributes in airbnb data is determined.

length(airbnb)
## [1] 16

A total of 16 attributes are presented in airbnb data.

Every attributes of airbnb data are determined.

names(airbnb)
##  [1] "id"                             "name"                          
##  [3] "host_id"                        "host_name"                     
##  [5] "neighbourhood_group"            "neighbourhood"                 
##  [7] "latitude"                       "longitude"                     
##  [9] "room_type"                      "price"                         
## [11] "minimum_nights"                 "number_of_reviews"             
## [13] "last_review"                    "reviews_per_month"             
## [15] "calculated_host_listings_count" "availability_365"

The attributes of airbnb data contained id, name, host_id, host_name, neighbourhood_group, neighbourhood, latitude, longitude, room_type, price, minimum_nights, number_of_reviews, last_review, reviews_per_month, calculated_host_listings_count and availability_365.

Part 3: Data Preprocessing

Amount of missing value in airbnb data is calculated.

sum(is.na(airbnb))
## [1] 2758

A total of 2758 missing values are presented in airbnb data.

Attributes of airbnb data that contain missing values are located.

colSums(is.na(airbnb))
##                             id                           name 
##                              0                              0 
##                        host_id                      host_name 
##                              0                              0 
##            neighbourhood_group                  neighbourhood 
##                              0                              0 
##                       latitude                      longitude 
##                              0                              0 
##                      room_type                          price 
##                              0                              0 
##                 minimum_nights              number_of_reviews 
##                              0                              0 
##                    last_review              reviews_per_month 
##                              0                           2758 
## calculated_host_listings_count               availability_365 
##                              0                              0

All 2758 missing values are located under reviews_per_month column.

Attributes of airbnb data are renamed to have a clearer view of the data.

Unwanted attributes of airbnb data such as ID, Host_name and Last_review are removed.

Remaining attributes of airbnb data are arranged according to its significance level.

Airbnb price which is equal to 0 is filtered since price can’t be 0 (faulty record). They would make predictive models significantly weaker.

All missing values in Review_per_month column are replaced with value 0.

Airbnb data is sorted ascendingly by Room_type, Price, Region and Neighborhood.

airbnb <- airbnb %>%
  rename(ID=id,
         Name = name,
         Host_ID = host_id,
         Host_name = host_name, 
         Region = neighbourhood_group,
         Neighbourhood = neighbourhood,
         Latitude = latitude,
         Longitude = longitude,
         Room_type = room_type,
         Price = price,
         Minimum_night = minimum_nights,
         Review_count = number_of_reviews,
         Last_review = last_review,
         Review_per_month = reviews_per_month,
         Host_listing_count = calculated_host_listings_count,
         Day_available_per_year = availability_365) %>%
  select(Name, 
         Room_type, 
         Price,
         Region,
         Neighbourhood,
         Latitude,
         Longitude,
         Host_ID,
         Minimum_night,
         Review_count, 
         Review_per_month,
         Host_listing_count,
         Day_available_per_year) %>%
  filter (Price > 0) %>%
  mutate(Review_per_month = replace_na(Review_per_month,0)) %>%
  arrange(Room_type,Price,Region,Neighbourhood)

Room_type, Region and Neighbourhood attributes are set as factor instead of character as in default string value of data frame in R is set as character.

airbnb[c("Room_type","Region","Neighbourhood")] <- map(airbnb[c("Room_type","Region","Neighbourhood")], as.factor)

Sanity check of missing values remained in airbnb data is performed.

sum(is.na(airbnb))
## [1] 0

No missing values are remaining in airbnb data.

First 20 rows of cleaned airbnb data are viewed.

head(airbnb,20) %>%
  kable() %>%
  kable_styling()
Name Room_type Price Region Neighbourhood Latitude Longitude Host_ID Minimum_night Review_count Review_per_month Host_listing_count Day_available_per_year
Central 1BR Apt in Foodie Haven Hipster Paradise Entire home/apt 14 Central Region Geylang 1.31455 103.8832 29799617 3 4 4.00 1 34
Senja cozy Entire home/apt 14 West Region Bukit Panjang 1.38400 103.7631 75175440 1 1 0.45 2 0
Cozy Private Room in Nice Apartment with Pool Entire home/apt 31 Central Region Bukit Timah 1.33861 103.7808 26246420 90 14 0.43 1 0
Spacious Basic Studio in a Local Hood! Entire home/apt 39 North Region Yishun 1.42968 103.8366 73254645 1 3 2.31 1 0
Capsule Pod Single Bed in 4 Female Share Room Entire home/apt 42 Central Region Outram 1.28374 103.8441 87411537 1 6 0.16 7 346
Studioroom @ Farrer Park MRT, Central. Privacy Entire home/apt 43 Central Region Kallang 1.31413 103.8574 24682062 25 0 0.00 1 155
Short Stay Apartment (Charisma View) S(598671) Entire home/apt 46 Central Region Bukit Timah 1.34477 103.7686 141452197 90 1 0.04 1 180
Stylish & Modern Deluxe Condominium in Katong Entire home/apt 50 Central Region Geylang 1.31384 103.8941 198046784 3 22 2.04 1 3
Unique space pod for 1 pax, near eateries, MRT! Entire home/apt 50 Central Region Kallang 1.31250 103.8613 211434562 1 8 0.75 64 365
One Bedroom Whole House Entire home/apt 50 West Region Choa Chu Kang 1.37869 103.7518 20682139 5 14 1.07 1 310
Condo with sea view Entire home/apt 50 West Region Clementi 1.29767 103.7649 547772 3 2 0.10 1 0
Cozy Lakeside Entire home/apt 50 West Region Jurong West 1.33492 103.7235 75175440 1 2 0.91 2 0
[Green] Quiet Studio Unit with Reservoir View Entire home/apt 54 North-East Region Sengkang 1.39546 103.8804 52404087 90 2 0.29 1 164
Unique single capsule bed, Shops, eateries, MRT! Entire home/apt 56 Central Region Kallang 1.31218 103.8607 211434562 1 5 0.47 64 365
Unique space pod for 1 pax, can cook, shops, MRT! Entire home/apt 56 Central Region Kallang 1.31079 103.8621 211434562 1 0 0.00 64 365
2 bed room near town Entire home/apt 56 Central Region Toa Payoh 1.33957 103.8457 24054848 1 0 0.00 1 0
2 Bedroom furnished flat 3 min from MRT stop Entire home/apt 56 Central Region Toa Payoh 1.33996 103.8444 11493720 1 3 0.08 1 0
(WHOLE HOUSE) Breezy Comfy Accomodation. Entire home/apt 56 North Region Woodlands 1.44787 103.7985 128465640 62 0 0.00 2 365
Beachfront Hammocking at East Coast Park Site D Entire home/apt 58 East Region Bedok 1.30437 103.9221 252092906 1 0 0.00 2 179
1BR in new apartment with own bath Entire home/apt 60 Central Region Kallang 1.32746 103.8644 39246294 1 0 0.00 2 0

Dimension of cleaned airbnb data is determined.

dim(airbnb)
## [1] 7906   13

It is now a data frame with 7906 rows and 13 columns.

Content of cleaned airbnb data is determined.

glimpse(airbnb)
## Rows: 7,906
## Columns: 13
## $ Name                   <chr> "Central 1BR Apt in Foodie Haven Hipster Par...
## $ Room_type              <fct> Entire home/apt, Entire home/apt, Entire hom...
## $ Price                  <int> 14, 14, 31, 39, 42, 43, 46, 50, 50, 50, 50, ...
## $ Region                 <fct> Central Region, West Region, Central Region,...
## $ Neighbourhood          <fct> Geylang, Bukit Panjang, Bukit Timah, Yishun,...
## $ Latitude               <dbl> 1.31455, 1.38400, 1.33861, 1.42968, 1.28374,...
## $ Longitude              <dbl> 103.8832, 103.7631, 103.7808, 103.8366, 103....
## $ Host_ID                <int> 29799617, 75175440, 26246420, 73254645, 8741...
## $ Minimum_night          <int> 3, 1, 90, 1, 1, 25, 90, 3, 1, 5, 3, 1, 90, 1...
## $ Review_count           <int> 4, 1, 14, 3, 6, 0, 1, 22, 8, 14, 2, 2, 2, 5,...
## $ Review_per_month       <dbl> 4.00, 0.45, 0.43, 2.31, 0.16, 0.00, 0.04, 2....
## $ Host_listing_count     <int> 1, 2, 1, 1, 7, 1, 1, 1, 64, 1, 1, 2, 1, 64, ...
## $ Day_available_per_year <int> 34, 0, 0, 0, 346, 155, 180, 3, 365, 310, 0, ...

Structure of cleaned airbnb data is determined.

str(airbnb)
## 'data.frame':    7906 obs. of  13 variables:
##  $ Name                  : chr  "Central 1BR Apt in Foodie Haven Hipster Paradise" "Senja cozy" "Cozy Private Room in Nice Apartment with Pool" "Spacious Basic Studio in a Local Hood!" ...
##  $ Room_type             : Factor w/ 3 levels "Entire home/apt",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Price                 : int  14 14 31 39 42 43 46 50 50 50 ...
##  $ Region                : Factor w/ 5 levels "Central Region",..: 1 5 1 4 1 1 1 1 1 5 ...
##  $ Neighbourhood         : Factor w/ 43 levels "Ang Mo Kio","Bedok",..: 12 6 7 43 25 16 7 12 16 9 ...
##  $ Latitude              : num  1.31 1.38 1.34 1.43 1.28 ...
##  $ Longitude             : num  104 104 104 104 104 ...
##  $ Host_ID               : int  29799617 75175440 26246420 73254645 87411537 24682062 141452197 198046784 211434562 20682139 ...
##  $ Minimum_night         : int  3 1 90 1 1 25 90 3 1 5 ...
##  $ Review_count          : int  4 1 14 3 6 0 1 22 8 14 ...
##  $ Review_per_month      : num  4 0.45 0.43 2.31 0.16 0 0.04 2.04 0.75 1.07 ...
##  $ Host_listing_count    : int  1 2 1 1 7 1 1 1 64 1 ...
##  $ Day_available_per_year: int  34 0 0 0 346 155 180 3 365 310 ...

Summary of cleaned airbnb data is determined.

summary(airbnb)
##      Name                     Room_type        Price        
##  Length:7906        Entire home/apt:4131   Min.   :   14.0  
##  Class :character   Private room   :3381   1st Qu.:   65.0  
##  Mode  :character   Shared room    : 394   Median :  124.0  
##                                            Mean   :  169.4  
##                                            3rd Qu.:  199.0  
##                                            Max.   :10000.0  
##                                                             
##                Region         Neighbourhood     Latitude       Longitude    
##  Central Region   :6308   Kallang    :1043   Min.   :1.244   Min.   :103.6  
##  East Region      : 508   Geylang    : 994   1st Qu.:1.296   1st Qu.:103.8  
##  North-East Region: 346   Novena     : 537   Median :1.311   Median :103.8  
##  North Region     : 204   Rochor     : 535   Mean   :1.314   Mean   :103.8  
##  West Region      : 540   Outram     : 477   3rd Qu.:1.322   3rd Qu.:103.9  
##                           Bukit Merah: 470   Max.   :1.455   Max.   :104.0  
##                           (Other)    :3850                                  
##     Host_ID          Minimum_night      Review_count    Review_per_month 
##  Min.   :    23666   Min.   :   1.00   Min.   :  0.00   Min.   : 0.0000  
##  1st Qu.: 23055695   1st Qu.:   1.00   1st Qu.:  0.00   1st Qu.: 0.0000  
##  Median : 63448912   Median :   3.00   Median :  2.00   Median : 0.1600  
##  Mean   : 91141831   Mean   :  17.51   Mean   : 12.81   Mean   : 0.6797  
##  3rd Qu.:155386432   3rd Qu.:  10.00   3rd Qu.: 10.00   3rd Qu.: 0.8500  
##  Max.   :288567551   Max.   :1000.00   Max.   :323.00   Max.   :13.0000  
##                                                                          
##  Host_listing_count Day_available_per_year
##  Min.   :  1.00     Min.   :  0.0         
##  1st Qu.:  2.00     1st Qu.: 54.0         
##  Median :  9.00     Median :260.0         
##  Mean   : 40.61     Mean   :208.7         
##  3rd Qu.: 48.00     3rd Qu.:355.0         
##  Max.   :274.00     Max.   :365.0         
## 

Part 4: Exploratory Data Analysis / Data Visualization

Distribution of Singapore Airbnb price is presented in boxplot graph.

# Store the graph
box_plot <- ggplot(airbnb, aes(y = Price))
# Add the geometric object box plot
box_plot +
    geom_boxplot() +coord_flip()+ggtitle("Overall Price Boxplot")

75% of Singapore Airbnb set their rental price below SGD199.

# Store the graph
box_plot <- ggplot(airbnb, aes(x = Region,y = Price))
# Add the geometric object box plot
box_plot +
    geom_boxplot() +coord_flip()+ggtitle("Boxplot of Price by Region")

The highest Singapore Airbnb price is located at West Region and Central Region. All the price of Singapore Airbnb at North Region are less than SGD 1250.

Relationship of Room Type and Price is presented.

freq_room <- airbnb %>%
  count(Room_type)
freq_room <- freq_room %>% 
  arrange(desc(Room_type)) 
options(repr.plot.width=14, repr.plot.height=6)
plot1 <- ggplot(freq_room, aes(x="", y=n, fill=Room_type)) +
  geom_bar(stat="identity", width=1) +
  coord_polar("y", start=0)+theme_void() +
  geom_text(aes(label = paste(round(n / sum(n) * 100,
  1), "%")),colour = 'white',position =
  position_stack(vjust = 0.5))+
  ggtitle("Pie chart of Room Types") 
avg_price_host <- airbnb %>%
  group_by(Room_type) %>%
  summarise(avg_price= mean(Price),.groups ='drop') 
plot2 <-ggplot(avg_price_host, aes(x=reorder(Room_type, -avg_price), y=avg_price, fill="violet"))+
  geom_col(aes(fill=avg_price),width = 1)+
  ggtitle("Airbnb in each region")+coord_flip()+
  scale_y_continuous(limits=c(0, 300))+
  geom_label(mapping = aes(label = round(avg_price,
  1)), size = 4, fill = "#F5FFFA", fontface = "bold") 
plot_grid(plot1, plot2, ncol=2, nrow=1,rel_widths = c(1, 1))

More than half of the room type in Singapore are entire home/apt and only 5% of room type is shared room.

The average price for shared room is SGD 65.7, private room is SGD 110.9 and entire home is SGD 227.1.

Relationship of Region and Price is presented.

freq_region <- airbnb %>%
  count(Region)
freq_region <- freq_region %>% 
  arrange(desc(Region)) 
options(repr.plot.width=14, repr.plot.height=6)
plot2_1 <- ggplot(freq_region, aes(x="", y=n, fill=Region)) +
  geom_bar(stat="identity", width=1) +
  coord_polar("y", start=0)+theme_void() + geom_text(aes(label = paste(round(n / sum(n) * 100, 1), "%")),colour = 'white',position = position_stack(vjust = 0.5))+ggtitle("Pie chart of Region Types")
avg_price_region <- airbnb %>%
  group_by(Region) %>%
  summarise(avg_price= mean(Price),.groups ='drop') 
plot2_2 <-ggplot(avg_price_region, aes(x=reorder(Region, -avg_price), y=avg_price, fill="violet"))+
  geom_col(aes(fill=avg_price),width = 1) +coord_flip()+
     geom_label(mapping = aes(label = round(avg_price, 1)), size = 4, fill = "#F5FFFA", fontface = "bold")+ scale_y_continuous(limits=c(0, 300))
plot_grid(plot2_1, plot2_2, ncol=2, nrow=1,rel_widths = c(1, 1))

79.8% (6309 units) Singapore Airbnb is located at Central Region while Airbnb at North Region is the least with only 204 units.

The average price for North-East region is the cheapest which is SGD 99.8 while the average price for West Region and Central Region are the most expensive which are SGD 176 and SGD 176.7.

Top 10 most expensive and cheapest Singapore Airbnb neighbourhood location are identified.

top_10_neighbourhood <- aggregate(list(airbnb$Price), list(airbnb$Neighbourhood, airbnb$Region), mean)
colnames(top_10_neighbourhood) <- c("Neighbourhood", "Region","Average_price_per_neighborhood")
top_10_neighbourhood <- top_10_neighbourhood[order(top_10_neighbourhood$Average_price_per_neighborhood),]
top_10_neighbourhood <- tail(top_10_neighbourhood, 12)
top_10_neighbourhood <- head(top_10_neighbourhood, 10)
r <- c()
for(i in 10:1){r <- c(r, i)}
row.names(top_10_neighbourhood) <- r
top_10_neighbourhood
##      Neighbourhood         Region Average_price_per_neighborhood
## 10          Newton Central Region                       188.7463
## 9           Rochor Central Region                       189.1458
## 8  Singapore River Central Region                       189.9371
## 7          Tanglin Central Region                       201.2762
## 6    Downtown Core Central Region                       205.3949
## 5      Bukit Batok    West Region                       206.1692
## 4           Museum Central Region                       236.3175
## 3          Orchard Central Region                       291.0294
## 2    Bukit Panjang    West Region                       365.3529
## 1     Marina South Central Region                       419.0000
options(repr.plot.width=15, repr.plot.height=11)
plot3 <- ggplot(data = top_10_neighbourhood, mapping = aes(x = reorder(Neighbourhood, -Average_price_per_neighborhood), y = Average_price_per_neighborhood)) +
     geom_bar(stat = "identity", mapping = aes(fill = Region, color = Region), alpha = .8, size = 1.5) +
  coord_flip() +
     geom_label(mapping = aes(label = round(Average_price_per_neighborhood, 1)), size = 4, fill = "#F5FFFA", fontface = "bold") + ggtitle("Top 10 most expensive Airbnb neighbourhood in Singapore")
plot3

Most of the expensive Singapore Airbnb are located at Central Region and the only neighbourhood which has average price > SGD 400 is at Marina South.

top_10_neighbourhood_2 <- aggregate(list(airbnb$Price), list(airbnb$Neighbourhood, airbnb$Region), mean)
colnames(top_10_neighbourhood_2) <- c("Neighbourhood","Region", "Average_price_per_neighborhood")
top_10_neighbourhood_2 <- top_10_neighbourhood_2[order(top_10_neighbourhood_2$Average_price_per_neighborhood),]
top_10_neighbourhood_2 <- head(top_10_neighbourhood_2, 10)
r <- c()
for(i in 10:1){r <- c(r, i)}
row.names(top_10_neighbourhood_2) <- r
top_10_neighbourhood_2
##              Neighbourhood            Region Average_price_per_neighborhood
## 10 Western Water Catchment       West Region                       46.25000
## 9                   Mandai      North Region                       56.66667
## 8             Lim Chu Kang      North Region                       65.00000
## 7                 Sengkang North-East Region                       74.85075
## 6                Woodlands      North Region                       81.49254
## 5                  Punggol North-East Region                       85.74419
## 4                Sembawang      North Region                       88.26829
## 3              Jurong West       West Region                       91.04575
## 2                Serangoon North-East Region                       91.17391
## 1            Choa Chu Kang       West Region                       93.31746
options(repr.plot.width=15, repr.plot.height=11)
plot4 <- ggplot(data = top_10_neighbourhood_2, mapping = aes(x = reorder(Neighbourhood, -Average_price_per_neighborhood), y = Average_price_per_neighborhood)) +
     geom_bar(stat = "identity", mapping = aes(fill = Region, color = Region), alpha = .8, size = 1.5) +
     geom_label(mapping = aes(label = round(Average_price_per_neighborhood, 1)), size = 4, fill = "#F5FFFA", fontface = "bold") +
     coord_flip() +
     ggtitle("Top 10 cheapest Airbnb neighbourhood in Singapore")
plot4

Travelers can find cheap Airbnb at the neighbourhood across North-East Region, North Region and West Region, especially in Western Water Catchment neighbourhood which has average price of SGD 46.2 only.

df_map <- aggregate(list(airbnb$Price), list(airbnb$Day_available_per_year), mean)
colnames(df_map) <- c("Availability", "Average_price_per_availability")
ggplot(data = df_map, mapping = aes(y = Average_price_per_availability, x = Availability, color = Average_price_per_availability)) +
    theme_minimal() +
    scale_fill_identity() +
    geom_line(mapping = aes(color = Average_price_per_availability)) +
    ggtitle("Relationship between availability and price of Airbnb")

The average price for 82 days availability is the highest which is SGD 830 while the average price for 185 days availability is SGD 21.

How Airbnb is distributed on Singapore Map is presented.

ggplot(data = airbnb, mapping = aes(x = Latitude, y = Longitude, color = Region)) +
    theme_minimal() +
    scale_fill_identity() +
    geom_point(mapping = aes(color = Region), size = 3) +
    ggtitle("Airbnb in Singapore")

Longitude and latitude aid in plotting the map and the distribution of Airbnb across the region is shown.

Location of Airbnb with price < SGD 200 is presented.

df_map_1 <- airbnb %>%
  filter(Price <200)
ggplot(data = df_map_1, mapping = aes(x = Latitude, y = Longitude, color = Price)) +
    theme_minimal() +
    scale_fill_identity() +
    geom_point(mapping = aes(color = Price), size = 3) +
    ggtitle("Location of Airbnb with price < SGD 200")+scale_color_gradient(low="blue", high="red")

Location of Airbnb with price > SGD 200 is presented.

df_map2 <- airbnb %>%
  filter(Price >200)

ggplot(data = df_map2, mapping = aes(x = Latitude, y = Longitude, color = Price)) +
    theme_minimal() +
    scale_fill_identity() +
    geom_point(mapping = aes(color = Price), size = 3) +
    ggtitle("Location of Airbnb with price > SGD 200")+scale_color_gradient(low="blue", high="red")

Word cloud of Airbnb Name is presented

text <- airbnb$Name
docs <- Corpus(VectorSource(text))
docs <- docs %>%
  tm_map(removeNumbers) %>%
  tm_map(removePunctuation) %>%
  tm_map(stripWhitespace)
## Warning in tm_map.SimpleCorpus(., removeNumbers): transformation drops documents
## Warning in tm_map.SimpleCorpus(., removePunctuation): transformation drops
## documents
## Warning in tm_map.SimpleCorpus(., stripWhitespace): transformation drops
## documents
docs <- tm_map(docs, content_transformer(tolower))
## Warning in tm_map.SimpleCorpus(docs, content_transformer(tolower)):
## transformation drops documents
docs <- tm_map(docs, removeWords, stopwords("english"))
## Warning in tm_map.SimpleCorpus(docs, removeWords, stopwords("english")):
## transformation drops documents
dtm <- TermDocumentMatrix(docs) 
matrix <- as.matrix(dtm) 
words <- sort(rowSums(matrix),decreasing=TRUE) 
df <- data.frame(word = names(words),freq=words)
set.seed(1234) # for reproducibility 
wordcloud(words = df$word, freq = df$freq, min.freq = 1,max.words=100, random.order=FALSE, rot.per=0.35,colors=brewer.pal(8, "Dark2"))

Airbnb name often use ‘near’, ‘mrt’, ‘city’ or ‘min’ as the words help in indicating the convenience for the transportation. ‘room’, ‘bedroom’, ‘spacious’, ‘cosy’,‘cozy’ are also used to describe the accommodation that can catch attention.

Distribution of host listing count is presented.

paste('There are' ,length(unique(airbnb$Host_ID)), 'Hosts for Singapore Airbnb.')
## [1] "There are 2705 Hosts for Singapore Airbnb."
airbnb_host <-distinct(airbnb, Host_ID, .keep_all = TRUE)

breaks <- c(0,25,50,75,100,125,150,175,200,225,250,275,300)
# specify interval/bin labels
tags <- c("[0-25)","[25,50)","[50-75)","[75,100)", "[100-125)", "[125,150)","[150-175)", "[175-200)","[200-225)", "[225,250)","[250-275)", "[275-300)")
# bucketing values into bins
  group_tags <- cut(airbnb_host$Host_listing_count, 
                  breaks=breaks, 
                  include.lowest=TRUE, 
                  right=FALSE, 
                  labels=tags)

ggplot(data = as_tibble(group_tags), mapping = aes(x=value)) + 
  geom_bar(fill="bisque",color="white") + 
  stat_count(geom="text", aes(label=..count..), vjust=-0.5) +
  labs(x='Host Listing Count') +
  ggtitle('Host Listing Count')+
  theme_minimal() 

getmode <- function(v) {
   uniqv <- unique(v)
   uniqv[which.max(tabulate(match(v, uniqv)))]
}

paste('Most number of host listing in Singapore is ' ,getmode(airbnb_host$Host_listing_count),'while Maximum host listing in Singapore is' ,max(airbnb_host$Host_listing_count),'.')
## [1] "Most number of host listing in Singapore is  1 while Maximum host listing in Singapore is 274 ."

98% hosts are hosting less than 25 Airbnb.

Correlation plot is presented.

#One Hot encoding for room type, region and neighbourhood
df_ohe <- airbnb %>%
  select(Room_type, Region,Neighbourhood)

dummy <- dummyVars(" ~ .", data=df_ohe)
newdata <- data.frame(predict(dummy, newdata = df_ohe))
df2 <- merge(airbnb, newdata, by=0)
names(newdata)
##  [1] "Room_type.Entire.home.apt"            
##  [2] "Room_type.Private.room"               
##  [3] "Room_type.Shared.room"                
##  [4] "Region.Central.Region"                
##  [5] "Region.East.Region"                   
##  [6] "Region.North.East.Region"             
##  [7] "Region.North.Region"                  
##  [8] "Region.West.Region"                   
##  [9] "Neighbourhood.Ang.Mo.Kio"             
## [10] "Neighbourhood.Bedok"                  
## [11] "Neighbourhood.Bishan"                 
## [12] "Neighbourhood.Bukit.Batok"            
## [13] "Neighbourhood.Bukit.Merah"            
## [14] "Neighbourhood.Bukit.Panjang"          
## [15] "Neighbourhood.Bukit.Timah"            
## [16] "Neighbourhood.Central.Water.Catchment"
## [17] "Neighbourhood.Choa.Chu.Kang"          
## [18] "Neighbourhood.Clementi"               
## [19] "Neighbourhood.Downtown.Core"          
## [20] "Neighbourhood.Geylang"                
## [21] "Neighbourhood.Hougang"                
## [22] "Neighbourhood.Jurong.East"            
## [23] "Neighbourhood.Jurong.West"            
## [24] "Neighbourhood.Kallang"                
## [25] "Neighbourhood.Lim.Chu.Kang"           
## [26] "Neighbourhood.Mandai"                 
## [27] "Neighbourhood.Marina.South"           
## [28] "Neighbourhood.Marine.Parade"          
## [29] "Neighbourhood.Museum"                 
## [30] "Neighbourhood.Newton"                 
## [31] "Neighbourhood.Novena"                 
## [32] "Neighbourhood.Orchard"                
## [33] "Neighbourhood.Outram"                 
## [34] "Neighbourhood.Pasir.Ris"              
## [35] "Neighbourhood.Punggol"                
## [36] "Neighbourhood.Queenstown"             
## [37] "Neighbourhood.River.Valley"           
## [38] "Neighbourhood.Rochor"                 
## [39] "Neighbourhood.Sembawang"              
## [40] "Neighbourhood.Sengkang"               
## [41] "Neighbourhood.Serangoon"              
## [42] "Neighbourhood.Singapore.River"        
## [43] "Neighbourhood.Southern.Islands"       
## [44] "Neighbourhood.Sungei.Kadut"           
## [45] "Neighbourhood.Tampines"               
## [46] "Neighbourhood.Tanglin"                
## [47] "Neighbourhood.Toa.Payoh"              
## [48] "Neighbourhood.Tuas"                   
## [49] "Neighbourhood.Western.Water.Catchment"
## [50] "Neighbourhood.Woodlands"              
## [51] "Neighbourhood.Yishun"
df_cor <- df2 %>%
select(Price, Latitude, Longitude, Minimum_night, Review_count, Review_per_month, Host_listing_count, Day_available_per_year, Room_type.Entire.home.apt, Room_type.Private.room, Room_type.Shared.room, Region.Central.Region, Region.East.Region, Region.North.East.Region, Region.North.Region, Region.West.Region)

M <-cor(df_cor)
corrplot.mixed(M)

Based on the correlation plot, there is a moderate positive linear relationship (0.36) between host listing count and room type (entire home).Host listing count is having moderate negative linear relationship (-0.33) with room type (private room).

Part 5: Machine Learning Model: Regression

5.1 Linear Regression

Name, Latitude, Longitude and Host ID attributes are removed for regression and new airbnb data without Name column is assigned as airbnb_lm.

airbnb_lm <- airbnb %>%
  select(-c(Name,Latitude,Longitude,Host_ID))

Train-test split is carried out with 70-30 split ratio.

set.seed(1000)
for_splitting <- sample.split(Y = airbnb_lm$Price, SplitRatio = 0.7) 
airbnb_train <- subset(airbnb_lm, for_splitting == TRUE)
airbnb_test <- subset(airbnb_lm, for_splitting == FALSE)

Sanity check of correct train-test split is performed.

nrow(airbnb_train) + nrow(airbnb_test) == nrow(airbnb_lm)
## [1] TRUE

Due to the presence of extreme price outliers, 2 train sets are created.

All price points include outliers are assigned as airbnb_train.

Price points exclude outliers are assigned as airbnb_train_without_outlier.

airbnb_train_without_outlier <- airbnb_train %>% 
  filter(Price <= quantile(airbnb_train$Price, 0.9) & 
           Price >= quantile(airbnb_train$Price, 0.1))

Price variance of train_set and train_set_without_outlier are calculated.

var(airbnb_train$Price)
## [1] 114252
var(airbnb_train_without_outlier$Price)
## [1] 4834.207

Train set without outlier has significantly lower variance as compared to train set with extreme outliers.

Due to the presence of extreme price outliers, 2 test sets are created.

All price points include outliers are assigned as airbnb_test.

Price points exclude outliers are assigned as airbnb_test_without_outlier.

airbnb_test_without_outlier <- airbnb_test %>% 
  filter(Price <= quantile(airbnb_test$Price, 0.9) & Price >= quantile(airbnb_test$Price, 0.1))

Price variance of test set and test set without outlier are calculated.

var(airbnb_test$Price)
## [1] 119238.2
var(airbnb_test_without_outlier$Price)
## [1] 4321.204

Test set with outlier has significantly lower variance as compared to test set with extreme outliers.

First linear regression model is modeled.

first_model <- lm(Price ~ .,data = airbnb_train)
#The results are summarized.
summary(first_model)
## 
## Call:
## lm(formula = Price ~ ., data = airbnb_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1510.9   -74.8   -32.1    16.9  9864.7 
## 
## Coefficients: (4 not defined because of singularities)
##                                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                           259.60313   40.39281   6.427 1.41e-10 ***
## Room_typePrivate room                -126.97297   10.10227 -12.569  < 2e-16 ***
## Room_typeShared room                 -182.69957   21.92893  -8.331  < 2e-16 ***
## RegionEast Region                     -25.68336   65.79844  -0.390  0.69630    
## RegionNorth-East Region               -40.05534   67.39826  -0.594  0.55233    
## RegionNorth Region                    -46.01364   65.21698  -0.706  0.48050    
## RegionWest Region                    -123.92895  166.26607  -0.745  0.45608    
## NeighbourhoodBedok                     23.97847   56.33675   0.426  0.67040    
## NeighbourhoodBishan                     9.64872   64.50782   0.150  0.88111    
## NeighbourhoodBukit Batok              231.67091  168.48860   1.375  0.16919    
## NeighbourhoodBukit Merah              -46.00465   43.28594  -1.063  0.28792    
## NeighbourhoodBukit Panjang            530.53617  176.17879   3.011  0.00261 ** 
## NeighbourhoodBukit Timah               -5.75903   51.55522  -0.112  0.91106    
## NeighbourhoodCentral Water Catchment   52.84269   82.97585   0.637  0.52425    
## NeighbourhoodChoa Chu Kang             43.22766  169.12628   0.256  0.79827    
## NeighbourhoodClementi                 117.29512  165.85913   0.707  0.47947    
## NeighbourhoodDowntown Core             18.05198   44.03093   0.410  0.68183    
## NeighbourhoodGeylang                  -33.99759   41.43057  -0.821  0.41191    
## NeighbourhoodHougang                   10.62450   65.06035   0.163  0.87029    
## NeighbourhoodJurong East              166.58908  165.42278   1.007  0.31395    
## NeighbourhoodJurong West               78.13475  164.62219   0.475  0.63507    
## NeighbourhoodKallang                    2.96763   41.32580   0.072  0.94276    
## NeighbourhoodLim Chu Kang              22.55857  327.42830   0.069  0.94507    
## NeighbourhoodMandai                   -48.94909  233.99969  -0.209  0.83431    
## NeighbourhoodMarina South             290.08574  325.11929   0.892  0.37230    
## NeighbourhoodMarine Parade            -43.46109   49.48241  -0.878  0.37981    
## NeighbourhoodMuseum                    39.63346   60.26470   0.658  0.51079    
## NeighbourhoodNewton                     2.22475   51.53557   0.043  0.96557    
## NeighbourhoodNovena                   -25.42157   43.17764  -0.589  0.55604    
## NeighbourhoodOrchard                   78.55282   51.27233   1.532  0.12556    
## NeighbourhoodOutram                   -19.40417   43.31624  -0.448  0.65420    
## NeighbourhoodPasir Ris                -29.00293   69.35648  -0.418  0.67584    
## NeighbourhoodPunggol                  -35.18246   81.81577  -0.430  0.66720    
## NeighbourhoodQueenstown               -41.21073   46.11625  -0.894  0.37156    
## NeighbourhoodRiver Valley             -28.31951   44.67385  -0.634  0.52616    
## NeighbourhoodRochor                    26.01482   42.86234   0.607  0.54392    
## NeighbourhoodSembawang                -23.57312   77.03179  -0.306  0.75960    
## NeighbourhoodSengkang                 -27.22713   72.53592  -0.375  0.70741    
## NeighbourhoodSerangoon                 -9.55875   71.74093  -0.133  0.89401    
## NeighbourhoodSingapore River           23.45483   49.58637   0.473  0.63623    
## NeighbourhoodSouthern Islands        1509.45683  101.38815  14.888  < 2e-16 ***
## NeighbourhoodSungei Kadut              -3.41313  234.26102  -0.015  0.98838    
## NeighbourhoodTampines                        NA         NA      NA       NA    
## NeighbourhoodTanglin                  -13.72596   47.12134  -0.291  0.77084    
## NeighbourhoodToa Payoh                       NA         NA      NA       NA    
## NeighbourhoodWestern Water Catchment         NA         NA      NA       NA    
## NeighbourhoodWoodlands                -23.45157   69.00008  -0.340  0.73396    
## NeighbourhoodYishun                          NA         NA      NA       NA    
## Minimum_night                           0.04359    0.10655   0.409  0.68250    
## Review_count                           -0.22768    0.20509  -1.110  0.26698    
## Review_per_month                      -10.79017    5.39619  -2.000  0.04559 *  
## Host_listing_count                     -0.35131    0.07896  -4.449 8.78e-06 ***
## Day_available_per_year                  0.03237    0.03189   1.015  0.31013    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 322.7 on 5493 degrees of freedom
## Multiple R-squared:  0.09672,    Adjusted R-squared:  0.08882 
## F-statistic: 12.25 on 48 and 5493 DF,  p-value: < 2.2e-16

First model is not so good. Median residual error is -32.1, while it should be near 0.

First linear regression model is plotted.

plot(first_model)

## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced

## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced

First linear regression model does not satisfy linear model assumptions as shown clearly by normal Q-Q plot(normal Q-Q plot should be straight line).

Since first model seems bad, it will not be used to predict new prices.

Second linear regression model is modeled.

Logarithmic transformation is introduced in second linear regression model and airbnb_train_without_outlier is used so that outliers are removed.

second_model <- lm(log(Price) ~ ., data = airbnb_train_without_outlier)
#The results are summarized.
summary(second_model) 
## 
## Call:
## lm(formula = log(Price) ~ ., data = airbnb_train_without_outlier)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.24213 -0.29154 -0.04794  0.26085  1.42413 
## 
## Coefficients: (4 not defined because of singularities)
##                                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                           4.929e+00  4.978e-02  98.998  < 2e-16 ***
## Room_typePrivate room                -6.848e-01  1.300e-02 -52.677  < 2e-16 ***
## Room_typeShared room                 -8.246e-01  4.524e-02 -18.229  < 2e-16 ***
## RegionEast Region                     7.450e-02  8.703e-02   0.856 0.392065    
## RegionNorth-East Region               1.033e-01  8.285e-02   1.247 0.212318    
## RegionNorth Region                    1.437e-01  8.798e-02   1.634 0.102352    
## RegionWest Region                    -4.409e-01  2.733e-01  -1.613 0.106831    
## NeighbourhoodBedok                    1.236e-01  7.646e-02   1.616 0.106081    
## NeighbourhoodBishan                   7.103e-02  8.454e-02   0.840 0.400841    
## NeighbourhoodBukit Batok              5.465e-01  2.758e-01   1.982 0.047576 *  
## NeighbourhoodBukit Merah              1.791e-02  5.341e-02   0.335 0.737308    
## NeighbourhoodBukit Panjang            3.967e-01  2.890e-01   1.373 0.169844    
## NeighbourhoodBukit Timah              1.169e-01  6.586e-02   1.775 0.076006 .  
## NeighbourhoodCentral Water Catchment -5.614e-02  1.093e-01  -0.514 0.607420    
## NeighbourhoodChoa Chu Kang            2.924e-01  2.775e-01   1.054 0.292122    
## NeighbourhoodClementi                 4.944e-01  2.742e-01   1.803 0.071397 .  
## NeighbourhoodDowntown Core            3.215e-01  5.477e-02   5.871 4.64e-09 ***
## NeighbourhoodGeylang                  6.685e-02  5.108e-02   1.309 0.190672    
## NeighbourhoodHougang                 -1.279e-01  8.398e-02  -1.523 0.127761    
## NeighbourhoodJurong East              5.497e-01  2.727e-01   2.016 0.043899 *  
## NeighbourhoodJurong West              4.044e-01  2.724e-01   1.484 0.137818    
## NeighbourhoodKallang                  8.707e-02  5.128e-02   1.698 0.089617 .  
## NeighbourhoodLim Chu Kang            -2.475e-01  3.883e-01  -0.637 0.523967    
## NeighbourhoodMandai                  -5.958e-01  3.860e-01  -1.544 0.122716    
## NeighbourhoodMarine Parade            1.428e-01  6.096e-02   2.342 0.019212 *  
## NeighbourhoodMuseum                   1.739e-01  7.920e-02   2.196 0.028140 *  
## NeighbourhoodNewton                   1.928e-01  6.496e-02   2.969 0.003008 ** 
## NeighbourhoodNovena                   1.799e-01  5.309e-02   3.388 0.000709 ***
## NeighbourhoodOrchard                  3.940e-01  7.078e-02   5.567 2.75e-08 ***
## NeighbourhoodOutram                   1.602e-01  5.380e-02   2.978 0.002917 ** 
## NeighbourhoodPasir Ris                1.731e-02  9.315e-02   0.186 0.852587    
## NeighbourhoodPunggol                 -1.910e-01  1.023e-01  -1.867 0.061922 .  
## NeighbourhoodQueenstown               1.266e-01  5.708e-02   2.217 0.026661 *  
## NeighbourhoodRiver Valley             7.634e-02  5.520e-02   1.383 0.166757    
## NeighbourhoodRochor                   1.986e-01  5.326e-02   3.728 0.000195 ***
## NeighbourhoodSembawang               -2.576e-01  1.104e-01  -2.333 0.019698 *  
## NeighbourhoodSengkang                -2.365e-01  9.438e-02  -2.506 0.012249 *  
## NeighbourhoodSerangoon               -2.386e-02  8.939e-02  -0.267 0.789549    
## NeighbourhoodSingapore River          3.332e-01  6.736e-02   4.946 7.86e-07 ***
## NeighbourhoodSouthern Islands         4.113e-01  3.831e-01   1.074 0.283083    
## NeighbourhoodSungei Kadut            -5.566e-01  2.780e-01  -2.002 0.045327 *  
## NeighbourhoodTampines                        NA         NA      NA       NA    
## NeighbourhoodTanglin                  1.926e-01  5.835e-02   3.302 0.000969 ***
## NeighbourhoodToa Payoh                       NA         NA      NA       NA    
## NeighbourhoodWestern Water Catchment         NA         NA      NA       NA    
## NeighbourhoodWoodlands               -2.270e-01  9.904e-02  -2.292 0.021964 *  
## NeighbourhoodYishun                          NA         NA      NA       NA    
## Minimum_night                        -1.548e-03  1.497e-04 -10.345  < 2e-16 ***
## Review_count                         -7.201e-04  2.698e-04  -2.669 0.007633 ** 
## Review_per_month                     -1.779e-02  7.182e-03  -2.477 0.013272 *  
## Host_listing_count                   -2.867e-04  9.854e-05  -2.910 0.003635 ** 
## Day_available_per_year                4.865e-04  4.182e-05  11.632  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3788 on 4459 degrees of freedom
## Multiple R-squared:  0.4937, Adjusted R-squared:  0.4884 
## F-statistic: 92.53 on 47 and 4459 DF,  p-value: < 2.2e-16

Second model is an improvement. Median residual error is now -0.04794, which is far better than -32.1 from the first model.

Second linear regression model is plotted.

plot(second_model)
## Warning: not plotting observations with leverage one:
##   1771, 2537, 4468

Normal Q-Q plot for second model looks much better than the first model.

5.2 Backward Stepwise Model with First Linear Regression Model

Backward stepwise model using first linear regression model is modeled.

backward_first_model <- step(first_model, direction = 'backward')
## Start:  AIC=64076.29
## Price ~ Room_type + Region + Neighbourhood + Minimum_night + 
##     Review_count + Review_per_month + Host_listing_count + Day_available_per_year
## 
## 
## Step:  AIC=64076.29
## Price ~ Room_type + Neighbourhood + Minimum_night + Review_count + 
##     Review_per_month + Host_listing_count + Day_available_per_year
## 
##                          Df Sum of Sq       RSS   AIC
## - Minimum_night           1     17421 571858565 64074
## - Day_available_per_year  1    107260 571948404 64075
## - Review_count            1    128300 571969445 64076
## <none>                                571841144 64076
## - Review_per_month        1    416244 572257388 64078
## - Host_listing_count      1   2060880 573902025 64094
## - Room_type               2  19424357 591265501 64257
## - Neighbourhood          41  36765785 608606929 64340
## 
## Step:  AIC=64074.46
## Price ~ Room_type + Neighbourhood + Review_count + Review_per_month + 
##     Host_listing_count + Day_available_per_year
## 
##                          Df Sum of Sq       RSS   AIC
## - Review_count            1    125861 571984426 64074
## - Day_available_per_year  1    126548 571985113 64074
## <none>                                571858565 64074
## - Review_per_month        1    444011 572302576 64077
## - Host_listing_count      1   2085299 573943864 64093
## - Room_type               2  19455170 591313735 64256
## - Neighbourhood          41  36748742 608607306 64338
## 
## Step:  AIC=64073.68
## Price ~ Room_type + Neighbourhood + Review_per_month + Host_listing_count + 
##     Day_available_per_year
## 
##                          Df Sum of Sq       RSS   AIC
## - Day_available_per_year  1    123708 572108134 64073
## <none>                                571984426 64074
## - Review_per_month        1   1494308 573478734 64086
## - Host_listing_count      1   2064242 574048668 64092
## - Room_type               2  19579523 591563949 64256
## - Neighbourhood          41  36850650 608835076 64338
## 
## Step:  AIC=64072.88
## Price ~ Room_type + Neighbourhood + Review_per_month + Host_listing_count
## 
##                      Df Sum of Sq       RSS   AIC
## <none>                            572108134 64073
## - Review_per_month    1   1550015 573658149 64086
## - Host_listing_count  1   1940915 574049049 64090
## - Room_type           2  19463233 591571367 64254
## - Neighbourhood      41  36935412 609043546 64338

The results are summarized.

summary(backward_first_model)
## 
## Call:
## lm(formula = Price ~ Room_type + Neighbourhood + Review_per_month + 
##     Host_listing_count, data = airbnb_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1498.7   -74.8   -32.6    17.0  9856.8 
## 
## Coefficients:
##                                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                           225.73667   55.13042   4.095 4.29e-05 ***
## Room_typePrivate room                -126.50969   10.07651 -12.555  < 2e-16 ***
## Room_typeShared room                 -181.31937   21.67952  -8.364  < 2e-16 ***
## NeighbourhoodBedok                     37.74554   58.24043   0.648 0.516948    
## NeighbourhoodBishan                    49.31144   74.70493   0.660 0.509228    
## NeighbourhoodBukit Batok              147.92957   72.72496   2.034 0.041990 *  
## NeighbourhoodBukit Merah               -4.91808   57.42986  -0.086 0.931759    
## NeighbourhoodBukit Panjang            446.80519   89.07189   5.016 5.44e-07 ***
## NeighbourhoodBukit Timah               33.39223   63.84018   0.523 0.600954    
## NeighbourhoodCentral Water Catchment   47.17123   84.78358   0.556 0.577979    
## NeighbourhoodChoa Chu Kang            -41.25912   74.25672  -0.556 0.578488    
## NeighbourhoodClementi                  34.56652   66.40561   0.521 0.602711    
## NeighbourhoodDowntown Core             58.90036   58.24778   1.011 0.311964    
## NeighbourhoodGeylang                    5.18932   56.22030   0.092 0.926460    
## NeighbourhoodHougang                   11.51088   65.04173   0.177 0.859533    
## NeighbourhoodJurong East               82.67947   65.24959   1.267 0.205164    
## NeighbourhoodJurong West               -5.38999   63.14389  -0.085 0.931978    
## NeighbourhoodKallang                   44.29926   56.03126   0.791 0.429202    
## NeighbourhoodLim Chu Kang              20.91386  327.82192   0.064 0.949135    
## NeighbourhoodMandai                   -50.53253  234.58201  -0.215 0.829452    
## NeighbourhoodMarina South             323.79501  327.22511   0.990 0.322454    
## NeighbourhoodMarine Parade             -7.18550   62.26071  -0.115 0.908124    
## NeighbourhoodMuseum                    82.18940   71.23612   1.154 0.248649    
## NeighbourhoodNewton                    43.67768   63.83073   0.684 0.493831    
## NeighbourhoodNovena                    15.03921   57.55556   0.261 0.793872    
## NeighbourhoodOrchard                  119.85031   63.51614   1.887 0.059223 .  
## NeighbourhoodOutram                    21.27236   57.46920   0.370 0.711283    
## NeighbourhoodPasir Ris                -13.39659   70.84907  -0.189 0.850032    
## NeighbourhoodPunggol                  -35.96595   81.81073  -0.440 0.660227    
## NeighbourhoodQueenstown                -0.42895   59.57713  -0.007 0.994256    
## NeighbourhoodRiver Valley              12.02683   58.46561   0.206 0.837027    
## NeighbourhoodRochor                    65.95791   57.16818   1.154 0.248652    
## NeighbourhoodSembawang                -26.91649   78.92253  -0.341 0.733079    
## NeighbourhoodSengkang                 -24.23756   72.39202  -0.335 0.737781    
## NeighbourhoodSerangoon                -10.60934   71.73099  -0.148 0.882423    
## NeighbourhoodSingapore River           65.42088   62.26363   1.051 0.293439    
## NeighbourhoodSouthern Islands        1551.66535  108.15045  14.347  < 2e-16 ***
## NeighbourhoodSungei Kadut             -13.24408  234.72362  -0.056 0.955006    
## NeighbourhoodTampines                  10.17235   75.66701   0.134 0.893063    
## NeighbourhoodTanglin                   26.82013   60.40458   0.444 0.657054    
## NeighbourhoodToa Payoh                 38.21343   67.37688   0.567 0.570629    
## NeighbourhoodWestern Water Catchment  -79.68544  170.29240  -0.468 0.639851    
## NeighbourhoodWoodlands                -27.32811   71.11665  -0.384 0.700792    
## NeighbourhoodYishun                    -1.36924   75.15107  -0.018 0.985464    
## Review_per_month                      -15.27120    3.95750  -3.859 0.000115 ***
## Host_listing_count                     -0.33116    0.07669  -4.318 1.60e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 322.6 on 5496 degrees of freedom
## Multiple R-squared:  0.0963, Adjusted R-squared:  0.0889 
## F-statistic: 13.01 on 45 and 5496 DF,  p-value: < 2.2e-16

Backward stepwise model using first model is not so good. Median residual error is -32.6, while it should be near 0.

vif(backward_first_model) %>%
  kable() %>%
  kable_styling()
GVIF Df GVIF^(1/(2*Df))
Room_type 1.394975 2 1.086780
Neighbourhood 1.450225 41 1.004543
Review_per_month 1.082944 1 1.040646
Host_listing_count 1.311297 1 1.145119

RMSE of backward stepwise model using first linear regression model is calculated.

rmse_first_model <- sqrt(mean((residuals(backward_first_model)^2))) 
print(rmse_first_model)
## [1] 321.2964

RMSE of backward stepwise model using first model is 321.2964, not so good as well as it should be close 0.

5.3 Backward Stepwise Model with Second Linear Regression Model

Backward stepwise model using second linear regression model is modeled.

backward_second_model <- step(second_model, direction = 'backward')
## Start:  AIC=-8702.5
## log(Price) ~ Room_type + Region + Neighbourhood + Minimum_night + 
##     Review_count + Review_per_month + Host_listing_count + Day_available_per_year
## 
## 
## Step:  AIC=-8702.5
## log(Price) ~ Room_type + Neighbourhood + Minimum_night + Review_count + 
##     Review_per_month + Host_listing_count + Day_available_per_year
## 
##                          Df Sum of Sq     RSS     AIC
## <none>                                 639.83 -8702.5
## - Review_per_month        1      0.88  640.71 -8698.3
## - Review_count            1      1.02  640.85 -8697.3
## - Host_listing_count      1      1.21  641.04 -8696.0
## - Minimum_night           1     15.36  655.19 -8597.6
## - Day_available_per_year  1     19.41  659.24 -8569.8
## - Neighbourhood          40     42.35  682.18 -8493.6
## - Room_type               2    410.70 1050.53 -6471.7

The results are summarized.

summary(backward_second_model)
## 
## Call:
## lm(formula = log(Price) ~ Room_type + Neighbourhood + Minimum_night + 
##     Review_count + Review_per_month + Host_listing_count + Day_available_per_year, 
##     data = airbnb_train_without_outlier)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.24213 -0.29154 -0.04794  0.26085  1.42413 
## 
## Coefficients:
##                                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                           5.032e+00  6.804e-02  73.958  < 2e-16 ***
## Room_typePrivate room                -6.848e-01  1.300e-02 -52.677  < 2e-16 ***
## Room_typeShared room                 -8.246e-01  4.524e-02 -18.229  < 2e-16 ***
## NeighbourhoodBedok                    9.474e-02  7.175e-02   1.320  0.18676    
## NeighbourhoodBishan                  -3.231e-02  9.636e-02  -0.335  0.73738    
## NeighbourhoodBukit Batok              2.268e-03  9.092e-02   0.025  0.98010    
## NeighbourhoodBukit Merah             -8.543e-02  7.065e-02  -1.209  0.22663    
## NeighbourhoodBukit Panjang           -1.475e-01  1.246e-01  -1.183  0.23681    
## NeighbourhoodBukit Timah              1.354e-02  8.036e-02   0.168  0.86620    
## NeighbourhoodCentral Water Catchment -1.574e-02  1.053e-01  -0.149  0.88119    
## NeighbourhoodChoa Chu Kang           -2.518e-01  9.550e-02  -2.637  0.00839 ** 
## NeighbourhoodClementi                -4.976e-02  8.620e-02  -0.577  0.56380    
## NeighbourhoodDowntown Core            2.182e-01  7.206e-02   3.028  0.00248 ** 
## NeighbourhoodGeylang                 -3.650e-02  6.916e-02  -0.528  0.59772    
## NeighbourhoodHougang                 -1.279e-01  8.398e-02  -1.523  0.12776    
## NeighbourhoodJurong East              5.459e-03  8.101e-02   0.067  0.94628    
## NeighbourhoodJurong West             -1.398e-01  7.942e-02  -1.761  0.07834 .  
## NeighbourhoodKallang                 -1.628e-02  6.921e-02  -0.235  0.81400    
## NeighbourhoodLim Chu Kang            -2.071e-01  3.871e-01  -0.535  0.59275    
## NeighbourhoodMandai                  -5.554e-01  3.849e-01  -1.443  0.14903    
## NeighbourhoodMarine Parade            3.943e-02  7.662e-02   0.515  0.60683    
## NeighbourhoodMuseum                   7.059e-02  9.188e-02   0.768  0.44240    
## NeighbourhoodNewton                   8.948e-02  7.969e-02   1.123  0.26155    
## NeighbourhoodNovena                   7.654e-02  7.070e-02   1.083  0.27905    
## NeighbourhoodOrchard                  2.907e-01  8.442e-02   3.443  0.00058 ***
## NeighbourhoodOutram                   5.685e-02  7.100e-02   0.801  0.42333    
## NeighbourhoodPasir Ris               -1.154e-02  8.959e-02  -0.129  0.89750    
## NeighbourhoodPunggol                 -1.910e-01  1.023e-01  -1.867  0.06192 .  
## NeighbourhoodQueenstown               2.321e-02  7.344e-02   0.316  0.75202    
## NeighbourhoodRiver Valley            -2.701e-02  7.207e-02  -0.375  0.70783    
## NeighbourhoodRochor                   9.522e-02  7.067e-02   1.347  0.17795    
## NeighbourhoodSembawang               -2.172e-01  1.066e-01  -2.037  0.04172 *  
## NeighbourhoodSengkang                -2.365e-01  9.438e-02  -2.506  0.01225 *  
## NeighbourhoodSerangoon               -2.386e-02  8.939e-02  -0.267  0.78955    
## NeighbourhoodSingapore River          2.298e-01  8.171e-02   2.813  0.00494 ** 
## NeighbourhoodSouthern Islands         3.079e-01  3.859e-01   0.798  0.42500    
## NeighbourhoodSungei Kadut            -5.162e-01  2.766e-01  -1.866  0.06208 .  
## NeighbourhoodTampines                -2.885e-02  9.849e-02  -0.293  0.76958    
## NeighbourhoodTanglin                  8.930e-02  7.448e-02   1.199  0.23059    
## NeighbourhoodToa Payoh               -1.033e-01  8.285e-02  -1.247  0.21232    
## NeighbourhoodWestern Water Catchment -5.442e-01  2.772e-01  -1.963  0.04966 *  
## NeighbourhoodWoodlands               -1.866e-01  9.475e-02  -1.969  0.04898 *  
## NeighbourhoodYishun                   4.040e-02  9.914e-02   0.407  0.68370    
## Minimum_night                        -1.548e-03  1.497e-04 -10.345  < 2e-16 ***
## Review_count                         -7.201e-04  2.698e-04  -2.669  0.00763 ** 
## Review_per_month                     -1.779e-02  7.182e-03  -2.477  0.01327 *  
## Host_listing_count                   -2.867e-04  9.854e-05  -2.910  0.00363 ** 
## Day_available_per_year                4.865e-04  4.182e-05  11.632  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3788 on 4459 degrees of freedom
## Multiple R-squared:  0.4937, Adjusted R-squared:  0.4884 
## F-statistic: 92.53 on 47 and 4459 DF,  p-value: < 2.2e-16

Backward stepwise model using second model is an improvement. Median residual error is now -0.04794, which is far better than -32.6 from the first model.

vif(backward_second_model) %>%
  kable() %>%
  kable_styling()
GVIF Df GVIF^(1/(2*Df))
Room_type 1.364824 2 1.080859
Neighbourhood 1.597313 40 1.005871
Minimum_night 1.130842 1 1.063410
Review_count 2.057790 1 1.434500
Review_per_month 2.098675 1 1.448681
Host_listing_count 1.416358 1 1.190109
Day_available_per_year 1.171249 1 1.082243

RMSE of backward stepwise model using second linear regression model is calculated.

rmse_second_model <- sqrt(mean((residuals(backward_second_model)^2))) 
print(rmse_second_model)
## [1] 0.3767803

RMSE of backward stepwise model using second model is 0.3767803, a better model compared to backward stepwise model using first model which is 321.2964 as lower values of RMSE indicate better fit.

5.4 Price Prediction

Prices for testing set without outliers are predicted.

predict_regression <- predict(second_model, newdata = airbnb_test_without_outlier)
## Warning in predict.lm(second_model, newdata = airbnb_test_without_outlier):
## prediction from a rank-deficient fit may be misleading
predict_regression <- exp(predict_regression)
RMSE_regression <- sqrt(mean( (airbnb_test_without_outlier$Price - predict_regression)**2 ))
print(RMSE_regression)
## [1] 50.53446

The sum of squared deviations of actual values from predicted values is calculated.

SSE <- sum((airbnb_test_without_outlier$Price - predict_regression)**2)
print(SSE)
## [1] 4849536

The sum of squared deviations of predicted values from the mean value is calculated.

SSR <- sum((predict_regression - mean(airbnb_test_without_outlier$Price)) ** 2)
print(SSR)
## [1] 3825675

R-squared, a statistical measure of how close the data are to the fitted regression line, is calculated.

R2 <- 1 - SSE/(SSE + SSR)
print(R2)
## [1] 0.4409893

Scatter plot of observed and predicted value group by Room type is plotted.

regression_results <- tibble(
  obs = airbnb_test_without_outlier$Price,
  pred = predict_regression,
  diff = pred - obs,
  abs_diff = abs(pred - obs),
  type = airbnb_test_without_outlier$Room_type)
regression_plot <- regression_results %>% 
  ggplot(aes(obs, pred)) +
  geom_point(alpha = 0.1) +
  scale_x_log10() +
  scale_y_log10() +
  ggtitle("Observed vs predicted",
          subtitle = "Linear regression model") + 
  geom_abline(slope = 1, intercept = 0, color = "blue", linetype = 2)  +
  facet_wrap(~type)
ggplotly(regression_plot)
## Warning: `group_by_()` is deprecated as of dplyr 0.7.0.
## Please use `group_by()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.

Part 6: Machine Learning Model: Classification

6.1 Data Partition

Data is split into training set and testing set.

df_cor2 <- df2 %>%
select(Price, Latitude, Longitude, Minimum_night, Review_count, Review_per_month, Host_listing_count, Day_available_per_year, Room_type, Region)
df_cor2[c("Room_type","Region")] <- map(df_cor2[c("Room_type","Region")], as.factor)
intrain <- createDataPartition(y = df_cor2$Room_type, p = 0.67, list = FALSE)
training <- df_cor2[intrain,]
testing <- df_cor2[-intrain,]

6.2 Classification tree with rpart()

set.seed(12345)
# Training with classification tree model
airbnb.rpart <- rpart(Room_type ~ ., data=training, method="class")
print(airbnb.rpart, digits = 3)
## n= 5298 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 5298 2530 Entire home/apt (0.52246 0.42771 0.04983)  
##    2) Price>=100 3054  538 Entire home/apt (0.82384 0.16896 0.00720) *
##    3) Price< 100 2244  494 Private room (0.11230 0.77986 0.10784)  
##      6) Price>=32.5 2015  362 Private room (0.12407 0.82035 0.05558)  
##       12) Price>=81.5 556  213 Private room (0.36691 0.61691 0.01619)  
##         24) Host_listing_count>=56 107   25 Entire home/apt (0.76636 0.23364 0.00000) *
##         25) Host_listing_count< 56 449  131 Private room (0.27171 0.70824 0.02004) *
##       13) Price< 81.5 1459  149 Private room (0.03153 0.89788 0.07060) *
##      7) Price< 32.5 229   99 Shared room (0.00873 0.42358 0.56769)  
##       14) Host_listing_count< 3.5 108   17 Private room (0.01852 0.84259 0.13889) *
##       15) Host_listing_count>=3.5 121    6 Shared room (0.00000 0.04959 0.95041) *
printcp(airbnb.rpart) # display the results
## 
## Classification tree:
## rpart(formula = Room_type ~ ., data = training, method = "class")
## 
## Variables actually used in tree construction:
## [1] Host_listing_count Price             
## 
## Root node error: 2530/5298 = 0.47754
## 
## n= 5298 
## 
##         CP nsplit rel error  xerror     xstd
## 1 0.592095      0   1.00000 1.00000 0.014370
## 2 0.021542      1   0.40791 0.41028 0.011419
## 3 0.011265      3   0.36482 0.36798 0.010949
## 4 0.010000      5   0.34229 0.33399 0.010534
plotcp(airbnb.rpart) # visualize cross-validation results

summary(airbnb.rpart) 
## Call:
## rpart(formula = Room_type ~ ., data = training, method = "class")
##   n= 5298 
## 
##           CP nsplit rel error    xerror       xstd
## 1 0.59209486      0 1.0000000 1.0000000 0.01437033
## 2 0.02154150      1 0.4079051 0.4102767 0.01141897
## 3 0.01126482      3 0.3648221 0.3679842 0.01094939
## 4 0.01000000      5 0.3422925 0.3339921 0.01053363
## 
## Variable importance
##                  Price     Host_listing_count               Latitude 
##                     56                     12                     10 
##                 Region          Minimum_night              Longitude 
##                      9                      6                      4 
## Day_available_per_year           Review_count 
##                      2                      1 
## 
## Node number 1: 5298 observations,    complexity param=0.5920949
##   predicted class=Entire home/apt  expected loss=0.4775387  P(node) =1
##     class counts:  2768  2266   264
##    probabilities: 0.522 0.428 0.050 
##   left son=2 (3054 obs) right son=3 (2244 obs)
##   Primary splits:
##       Price              < 100.5    to the right, improve=1150.7490, (0 missing)
##       Host_listing_count < 14.5     to the right, improve= 324.1613, (0 missing)
##       Region             splits as  LRRRR,        improve= 234.1491, (0 missing)
##       Latitude           < 1.33698  to the left,  improve= 218.4871, (0 missing)
##       Minimum_night      < 1.5      to the right, improve= 192.1937, (0 missing)
##   Surrogate splits:
##       Latitude           < 1.33698  to the left,  agree=0.648, adj=0.168, (0 split)
##       Region             splits as  LRRRR,        agree=0.644, adj=0.160, (0 split)
##       Host_listing_count < 4.5      to the right, agree=0.641, adj=0.153, (0 split)
##       Longitude          < 103.8035 to the right, agree=0.611, adj=0.081, (0 split)
##       Minimum_night      < 89.5     to the left,  agree=0.611, adj=0.081, (0 split)
## 
## Node number 2: 3054 observations
##   predicted class=Entire home/apt  expected loss=0.1761624  P(node) =0.5764439
##     class counts:  2516   516    22
##    probabilities: 0.824 0.169 0.007 
## 
## Node number 3: 2244 observations,    complexity param=0.0215415
##   predicted class=Private room     expected loss=0.2201426  P(node) =0.4235561
##     class counts:   252  1750   242
##    probabilities: 0.112 0.780 0.108 
##   left son=6 (2015 obs) right son=7 (229 obs)
##   Primary splits:
##       Price              < 32.5     to the right, improve=89.03290, (0 missing)
##       Minimum_night      < 1.5      to the right, improve=41.12249, (0 missing)
##       Host_listing_count < 6.5      to the left,  improve=31.10437, (0 missing)
##       Region             splits as  RLLLL,        improve=27.86415, (0 missing)
##       Latitude           < 1.31664  to the right, improve=27.36096, (0 missing)
## 
## Node number 6: 2015 observations,    complexity param=0.01126482
##   predicted class=Private room     expected loss=0.1796526  P(node) =0.3803322
##     class counts:   250  1653   112
##    probabilities: 0.124 0.820 0.056 
##   left son=12 (556 obs) right son=13 (1459 obs)
##   Primary splits:
##       Price                  < 81.5     to the right, improve=78.25492, (0 missing)
##       Host_listing_count     < 61.5     to the right, improve=25.73927, (0 missing)
##       Region                 splits as  LRRLR,        improve=13.52241, (0 missing)
##       Longitude              < 103.8442 to the right, improve=12.68967, (0 missing)
##       Day_available_per_year < 84.5     to the left,  improve=12.55265, (0 missing)
##   Surrogate splits:
##       Host_listing_count < 112.5    to the right, agree=0.729, adj=0.018, (0 split)
##       Longitude          < 103.9716 to the right, agree=0.726, adj=0.005, (0 split)
##       Latitude           < 1.448945 to the right, agree=0.725, adj=0.002, (0 split)
##       Minimum_night      < 300      to the right, agree=0.725, adj=0.002, (0 split)
## 
## Node number 7: 229 observations,    complexity param=0.0215415
##   predicted class=Shared room      expected loss=0.4323144  P(node) =0.04322386
##     class counts:     2    97   130
##    probabilities: 0.009 0.424 0.568 
##   left son=14 (108 obs) right son=15 (121 obs)
##   Primary splits:
##       Host_listing_count     < 3.5      to the left,  improve=73.48741, (0 missing)
##       Latitude               < 1.31625  to the right, improve=54.17262, (0 missing)
##       Minimum_night          < 1.5      to the right, improve=53.14828, (0 missing)
##       Region                 splits as  RLLLL,        improve=33.72940, (0 missing)
##       Day_available_per_year < 124      to the left,  improve=21.76556, (0 missing)
##   Surrogate splits:
##       Latitude               < 1.314795 to the right, agree=0.825, adj=0.630, (0 split)
##       Minimum_night          < 1.5      to the right, agree=0.795, adj=0.565, (0 split)
##       Day_available_per_year < 40.5     to the left,  agree=0.777, adj=0.528, (0 split)
##       Region                 splits as  RLLLL,        agree=0.738, adj=0.444, (0 split)
##       Review_count           < 1.5      to the left,  agree=0.690, adj=0.343, (0 split)
## 
## Node number 12: 556 observations,    complexity param=0.01126482
##   predicted class=Private room     expected loss=0.3830935  P(node) =0.1049453
##     class counts:   204   343     9
##    probabilities: 0.367 0.617 0.016 
##   left son=24 (107 obs) right son=25 (449 obs)
##   Primary splits:
##       Host_listing_count     < 56       to the right, improve=40.63883, (0 missing)
##       Day_available_per_year < 85       to the left,  improve=38.51441, (0 missing)
##       Longitude              < 103.885  to the right, improve=25.20401, (0 missing)
##       Minimum_night          < 2.5      to the right, improve=22.96243, (0 missing)
##       Latitude               < 1.312175 to the right, improve=18.74820, (0 missing)
## 
## Node number 13: 1459 observations
##   predicted class=Private room     expected loss=0.1021247  P(node) =0.2753869
##     class counts:    46  1310   103
##    probabilities: 0.032 0.898 0.071 
## 
## Node number 14: 108 observations
##   predicted class=Private room     expected loss=0.1574074  P(node) =0.02038505
##     class counts:     2    91    15
##    probabilities: 0.019 0.843 0.139 
## 
## Node number 15: 121 observations
##   predicted class=Shared room      expected loss=0.04958678  P(node) =0.02283881
##     class counts:     0     6   115
##    probabilities: 0.000 0.050 0.950 
## 
## Node number 24: 107 observations
##   predicted class=Entire home/apt  expected loss=0.2336449  P(node) =0.0201963
##     class counts:    82    25     0
##    probabilities: 0.766 0.234 0.000 
## 
## Node number 25: 449 observations
##   predicted class=Private room     expected loss=0.2917595  P(node) =0.08474896
##     class counts:   122   318     9
##    probabilities: 0.272 0.708 0.020

6.3 Prediction and Evaluation

# Predict the testing dataset with the trained model 
predictions1 <- predict(airbnb.rpart, testing, type = "class")

# Evaluation: Accuracy and other metrics
confusionMatrix(predictions1, testing$Room_type)
## Confusion Matrix and Statistics
## 
##                  Reference
## Prediction        Entire home/apt Private room Shared room
##   Entire home/apt            1293          255          13
##   Private room                 70          848          62
##   Shared room                   0           12          55
## 
## Overall Statistics
##                                           
##                Accuracy : 0.842           
##                  95% CI : (0.8275, 0.8558)
##     No Information Rate : 0.5226          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.6992          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: Entire home/apt Class: Private room
## Sensitivity                          0.9486              0.7605
## Specificity                          0.7847              0.9116
## Pos Pred Value                       0.8283              0.8653
## Neg Pred Value                       0.9331              0.8360
## Prevalence                           0.5226              0.4275
## Detection Rate                       0.4958              0.3252
## Detection Prevalence                 0.5985              0.3758
## Balanced Accuracy                    0.8667              0.8361
##                      Class: Shared room
## Sensitivity                     0.42308
## Specificity                     0.99516
## Pos Pred Value                  0.82090
## Neg Pred Value                  0.97048
## Prevalence                      0.04985
## Detection Rate                  0.02109
## Detection Prevalence            0.02569
## Balanced Accuracy               0.70912

Overall accuracy for this classification model is 0.8551.

6.4 Random Forest

set.seed(12345)
# Training the data using Random forest model
airbnb.rf <- randomForest(Room_type ~. , data=training, importance = TRUE)

airbnb.rf
## 
## Call:
##  randomForest(formula = Room_type ~ ., data = training, importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 9.32%
## Confusion matrix:
##                 Entire home/apt Private room Shared room class.error
## Entire home/apt            2577          190           1  0.06900289
## Private room                207         2046          13  0.09708738
## Shared room                   6           77         181  0.31439394
# Predict the testing dataset with the trained model
predictions2 <- predict(airbnb.rf, testing, type = "class")

# Evaluation: Accuracy and other metrics
confusionMatrix(predictions2, testing$Room_type)
## Confusion Matrix and Statistics
## 
##                  Reference
## Prediction        Entire home/apt Private room Shared room
##   Entire home/apt            1284          114           5
##   Private room                 78          997          33
##   Shared room                   1            4          92
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9099          
##                  95% CI : (0.8982, 0.9206)
##     No Information Rate : 0.5226          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.8317          
##                                           
##  Mcnemar's Test P-Value : 4.875e-07       
## 
## Statistics by Class:
## 
##                      Class: Entire home/apt Class: Private room
## Sensitivity                          0.9420              0.8942
## Specificity                          0.9044              0.9257
## Pos Pred Value                       0.9152              0.8998
## Neg Pred Value                       0.9344              0.9213
## Prevalence                           0.5226              0.4275
## Detection Rate                       0.4923              0.3823
## Detection Prevalence                 0.5380              0.4248
## Balanced Accuracy                    0.9232              0.9099
##                      Class: Shared room
## Sensitivity                     0.70769
## Specificity                     0.99798
## Pos Pred Value                  0.94845
## Neg Pred Value                  0.98487
## Prevalence                      0.04985
## Detection Rate                  0.03528
## Detection Prevalence            0.03719
## Balanced Accuracy               0.85284

Number of tree is 500 and there are 3 variables tried at each split.

important <- importance(airbnb.rf, type=1 ) 
Important_Features <- data.frame(Feature = row.names(important), Importance = important[, 1])

plot_imp <- ggplot(Important_Features, 
    aes(x= reorder(Feature,
Importance) , y = Importance) ) +
geom_bar(stat = "identity") +
coord_flip() +
theme_light(base_size = 13) +
xlab("") + 
ylab("Importance")+
ggtitle("Important Features in Random Forest Model for\n Singapore airbnb data") +
theme(plot.title = element_text(size=13))

plot_imp

The accuracy of Random forest model is 0.9015.

Part 7: Conclusion

For regression of Singapore Airbnb price, a linear regression model with logarithmic transformation and outlier removal perform better and fit perfectly.

For classification of Singapore Airbnb Room Type, random forest model performs better than classification tree to predict the room type based on the features when comparing the accuracy score. The top 3 most important features to predict the room type in this random forest model are Price, minimum night and host listing count. This model can aid to help to ensure the room type is correct according to the input information by host so guest can be more worry free about their accommodation and thus improve customer experience.