Import Libraries

library(tidyverse)
library(ggplot2)
library(caret)
library(corrplot)
library(leaflet)
library(plotly)
library(readr)
library(dplyr)       
library(OneR)
library(e1071)
library(tree)
library(party)
library(rpart)
library(rmarkdown)

Load Data

airbnb <- read.csv("AB_NYC_2019.csv", stringsAsFactors = FALSE, na.strings = c(""))

Data Pre-Processing

The first 6 rows of the data are shown:

head(airbnb)
##     id                                             name host_id   host_name
## 1 2539               Clean & quiet apt home by the park    2787        John
## 2 2595                            Skylit Midtown Castle    2845    Jennifer
## 3 3647              THE VILLAGE OF HARLEM....NEW YORK !    4632   Elisabeth
## 4 3831                  Cozy Entire Floor of Brownstone    4869 LisaRoxanne
## 5 5022 Entire Apt: Spacious Studio/Loft by central park    7192       Laura
## 6 5099        Large Cozy 1 BR Apartment In Midtown East    7322       Chris
##   neighbourhood_group neighbourhood latitude longitude       room_type price
## 1            Brooklyn    Kensington 40.64749 -73.97237    Private room   149
## 2           Manhattan       Midtown 40.75362 -73.98377 Entire home/apt   225
## 3           Manhattan        Harlem 40.80902 -73.94190    Private room   150
## 4            Brooklyn  Clinton Hill 40.68514 -73.95976 Entire home/apt    89
## 5           Manhattan   East Harlem 40.79851 -73.94399 Entire home/apt    80
## 6           Manhattan   Murray Hill 40.74767 -73.97500 Entire home/apt   200
##   minimum_nights number_of_reviews last_review reviews_per_month
## 1              1                 9  2018-10-19              0.21
## 2              1                45  2019-05-21              0.38
## 3              3                 0        <NA>                NA
## 4              1               270  2019-07-05              4.64
## 5             10                 9  2018-11-19              0.10
## 6              3                74  2019-06-22              0.59
##   calculated_host_listings_count availability_365
## 1                              6              365
## 2                              2              355
## 3                              1              365
## 4                              1              194
## 5                              1                0
## 6                              1              129

This dataset contains 16 features about Airbnb listings within New York City. Below are the features with their respective descriptions:

Check the number of records in the dataset.

nrow(airbnb)
## [1] 48895

Inspect the data structures. The dataset would need to be cleaned before any conversion of data types can be performed.

str(airbnb)
## 'data.frame':    48895 obs. of  16 variables:
##  $ id                            : int  2539 2595 3647 3831 5022 5099 5121 5178 5203 5238 ...
##  $ name                          : chr  "Clean & quiet apt home by the park" "Skylit Midtown Castle" "THE VILLAGE OF HARLEM....NEW YORK !" "Cozy Entire Floor of Brownstone" ...
##  $ host_id                       : int  2787 2845 4632 4869 7192 7322 7356 8967 7490 7549 ...
##  $ host_name                     : chr  "John" "Jennifer" "Elisabeth" "LisaRoxanne" ...
##  $ neighbourhood_group           : chr  "Brooklyn" "Manhattan" "Manhattan" "Brooklyn" ...
##  $ neighbourhood                 : chr  "Kensington" "Midtown" "Harlem" "Clinton Hill" ...
##  $ latitude                      : num  40.6 40.8 40.8 40.7 40.8 ...
##  $ longitude                     : num  -74 -74 -73.9 -74 -73.9 ...
##  $ room_type                     : chr  "Private room" "Entire home/apt" "Private room" "Entire home/apt" ...
##  $ price                         : int  149 225 150 89 80 200 60 79 79 150 ...
##  $ minimum_nights                : int  1 1 3 1 10 3 45 2 2 1 ...
##  $ number_of_reviews             : int  9 45 0 270 9 74 49 430 118 160 ...
##  $ last_review                   : chr  "2018-10-19" "2019-05-21" NA "2019-07-05" ...
##  $ reviews_per_month             : num  0.21 0.38 NA 4.64 0.1 0.59 0.4 3.47 0.99 1.33 ...
##  $ calculated_host_listings_count: int  6 2 1 1 1 1 1 1 1 4 ...
##  $ availability_365              : int  365 355 365 194 0 129 0 220 0 188 ...

Get the summary of the data.

summary(airbnb)
##        id               name              host_id           host_name        
##  Min.   :    2539   Length:48895       Min.   :     2438   Length:48895      
##  1st Qu.: 9471945   Class :character   1st Qu.:  7822033   Class :character  
##  Median :19677284   Mode  :character   Median : 30793816   Mode  :character  
##  Mean   :19017143                      Mean   : 67620011                     
##  3rd Qu.:29152178                      3rd Qu.:107434423                     
##  Max.   :36487245                      Max.   :274321313                     
##                                                                              
##  neighbourhood_group neighbourhood         latitude       longitude     
##  Length:48895        Length:48895       Min.   :40.50   Min.   :-74.24  
##  Class :character    Class :character   1st Qu.:40.69   1st Qu.:-73.98  
##  Mode  :character    Mode  :character   Median :40.72   Median :-73.96  
##                                         Mean   :40.73   Mean   :-73.95  
##                                         3rd Qu.:40.76   3rd Qu.:-73.94  
##                                         Max.   :40.91   Max.   :-73.71  
##                                                                         
##   room_type             price         minimum_nights    number_of_reviews
##  Length:48895       Min.   :    0.0   Min.   :   1.00   Min.   :  0.00   
##  Class :character   1st Qu.:   69.0   1st Qu.:   1.00   1st Qu.:  1.00   
##  Mode  :character   Median :  106.0   Median :   3.00   Median :  5.00   
##                     Mean   :  152.7   Mean   :   7.03   Mean   : 23.27   
##                     3rd Qu.:  175.0   3rd Qu.:   5.00   3rd Qu.: 24.00   
##                     Max.   :10000.0   Max.   :1250.00   Max.   :629.00   
##                                                                          
##  last_review        reviews_per_month calculated_host_listings_count
##  Length:48895       Min.   : 0.010    Min.   :  1.000               
##  Class :character   1st Qu.: 0.190    1st Qu.:  1.000               
##  Mode  :character   Median : 0.720    Median :  1.000               
##                     Mean   : 1.373    Mean   :  7.144               
##                     3rd Qu.: 2.020    3rd Qu.:  2.000               
##                     Max.   :58.500    Max.   :327.000               
##                     NA's   :10052                                   
##  availability_365
##  Min.   :  0.0   
##  1st Qu.:  0.0   
##  Median : 45.0   
##  Mean   :112.8   
##  3rd Qu.:227.0   
##  Max.   :365.0   
## 

Describe the data. It is observed that the price has a minimum value of 0 USD which is not possible. We will remove it after wards.

psych::describe(airbnb)
##                                vars     n        mean          sd      median
## id                                1 48895 19017143.24 10983108.39 19677284.00
## name*                             2 48879    23969.07    13814.77    23938.00
## host_id                           3 48895 67620010.65 78610967.03 30793816.00
## host_name*                        4 48874     5460.42     3232.22     5333.00
## neighbourhood_group*              5 48895        2.68        0.74        3.00
## neighbourhood*                    6 48895      108.11       68.74       95.00
## latitude                          7 48895       40.73        0.05       40.72
## longitude                         8 48895      -73.95        0.05      -73.96
## room_type*                        9 48895        1.50        0.55        1.00
## price                            10 48895      152.72      240.15      106.00
## minimum_nights                   11 48895        7.03       20.51        3.00
## number_of_reviews                12 48895       23.27       44.55        5.00
## last_review*                     13 38843     1491.10      400.67     1714.00
## reviews_per_month                14 38843        1.37        1.68        0.72
## calculated_host_listings_count   15 48895        7.14       32.95        1.00
## availability_365                 16 48895      112.78      131.62       45.00
##                                    trimmed         mad     min          max
## id                             19188061.30 14689959.59 2539.00  36487245.00
## name*                             23967.98    17712.62    1.00     47905.00
## host_id                        54170438.55 40836605.41 2438.00 274321313.00
## host_name*                         5433.36     4209.84    1.00     11452.00
## neighbourhood_group*                  2.61        1.48    1.00         5.00
## neighbourhood*                      106.81       88.96    1.00       221.00
## latitude                             40.73        0.05   40.50        40.91
## longitude                           -73.96        0.04  -74.24       -73.71
## room_type*                            1.48        0.00    1.00         3.00
## price                               121.43       68.20    0.00     10000.00
## minimum_nights                        3.58        2.97    1.00      1250.00
## number_of_reviews                    12.45        7.41    0.00       629.00
## last_review*                       1579.38       65.23    1.00      1764.00
## reviews_per_month                     1.06        0.92    0.01        58.50
## calculated_host_listings_count        1.50        0.00    1.00       327.00
## availability_365                     96.50       66.72    0.00       365.00
##                                       range  skew kurtosis        se
## id                              36484706.00 -0.09    -1.23  49669.87
## name*                              47904.00  0.00    -1.20     62.49
## host_id                        274318875.00  1.21     0.17 355509.26
## host_name*                         11451.00  0.05    -1.16     14.62
## neighbourhood_group*                   4.00  0.37    -0.11      0.00
## neighbourhood*                       220.00  0.26    -1.26      0.31
## latitude                               0.41  0.24     0.15      0.00
## longitude                              0.53  1.28     5.02      0.00
## room_type*                             2.00  0.42    -0.97      0.00
## price                              10000.00 19.12   585.59      1.09
## minimum_nights                      1249.00 21.83   853.95      0.09
## number_of_reviews                    629.00  3.69    19.53      0.20
## last_review*                        1763.00 -1.66     1.71      2.03
## reviews_per_month                     58.49  3.13    42.49      0.01
## calculated_host_listings_count       326.00  7.93    67.54      0.15
## availability_365                     365.00  0.76    -1.00      0.60

Since id, host_id does not give any information, these columns are dropped. The last_review is the date for the last review, which does not affect the price. It is also dropped.

airbnb <- subset(airbnb, select = -c(id, host_id, last_review))

The percentage of missing values is examined.

sapply(airbnb, function(x)sum(is.na(x)/nrow(airbnb)))
##                           name                      host_name 
##                   0.0003272318                   0.0004294918 
##            neighbourhood_group                  neighbourhood 
##                   0.0000000000                   0.0000000000 
##                       latitude                      longitude 
##                   0.0000000000                   0.0000000000 
##                      room_type                          price 
##                   0.0000000000                   0.0000000000 
##                 minimum_nights              number_of_reviews 
##                   0.0000000000                   0.0000000000 
##              reviews_per_month calculated_host_listings_count 
##                   0.2055833930                   0.0000000000 
##               availability_365 
##                   0.0000000000

If there is no entry for review_per_month, it can be treated as no review, so the review_per_month is replaced with 0.

airbnb[["reviews_per_month"]][is.na(airbnb[["reviews_per_month"]])] <- 0
sapply(airbnb, function(x)sum(is.na(x)/nrow(airbnb)))
##                           name                      host_name 
##                   0.0003272318                   0.0004294918 
##            neighbourhood_group                  neighbourhood 
##                   0.0000000000                   0.0000000000 
##                       latitude                      longitude 
##                   0.0000000000                   0.0000000000 
##                      room_type                          price 
##                   0.0000000000                   0.0000000000 
##                 minimum_nights              number_of_reviews 
##                   0.0000000000                   0.0000000000 
##              reviews_per_month calculated_host_listings_count 
##                   0.0000000000                   0.0000000000 
##               availability_365 
##                   0.0000000000

Now, it is observed that the NULL values in the name and the host_name is less than 0.0005%. Therefore, it is safe ro directly omit these record without losing important information.

airbnb <- na.omit(airbnb)
sapply(airbnb, function(x)sum(is.na(x)/nrow(airbnb)))
##                           name                      host_name 
##                              0                              0 
##            neighbourhood_group                  neighbourhood 
##                              0                              0 
##                       latitude                      longitude 
##                              0                              0 
##                      room_type                          price 
##                              0                              0 
##                 minimum_nights              number_of_reviews 
##                              0                              0 
##              reviews_per_month calculated_host_listings_count 
##                              0                              0 
##               availability_365 
##                              0

Price cannot be 0 USD so we will omit the data with price of 0 USD. The price of USD 10000 is also considered as outliers as it is too far away from the mean price. Therefore, the data with price within the top 10% and the bottom 10% are considered as outliers and are removed.

airbnb <- airbnb %>% filter(price > 0)
airbnb <- airbnb %>% filter(price < quantile(airbnb$price, 0.9) & price > quantile(airbnb$price, 0.1))
summary(airbnb)
##      name            host_name         neighbourhood_group neighbourhood     
##  Length:38928       Length:38928       Length:38928        Length:38928      
##  Class :character   Class :character   Class :character    Class :character  
##  Mode  :character   Mode  :character   Mode  :character    Mode  :character  
##                                                                              
##                                                                              
##                                                                              
##     latitude       longitude       room_type             price      
##  Min.   :40.50   Min.   :-74.24   Length:38928       Min.   : 50.0  
##  1st Qu.:40.69   1st Qu.:-73.98   Class :character   1st Qu.: 75.0  
##  Median :40.72   Median :-73.96   Mode  :character   Median :107.0  
##  Mean   :40.73   Mean   :-73.95                      Mean   :121.6  
##  3rd Qu.:40.76   3rd Qu.:-73.94                      3rd Qu.:155.0  
##  Max.   :40.91   Max.   :-73.71                      Max.   :268.0  
##  minimum_nights     number_of_reviews reviews_per_month
##  Min.   :   1.000   Min.   :  0.0     Min.   : 0.000   
##  1st Qu.:   1.000   1st Qu.:  1.0     1st Qu.: 0.050   
##  Median :   2.000   Median :  6.0     Median : 0.390   
##  Mean   :   6.753   Mean   : 24.7     Mean   : 1.115   
##  3rd Qu.:   5.000   3rd Qu.: 26.0     3rd Qu.: 1.650   
##  Max.   :1250.000   Max.   :540.0     Max.   :58.500   
##  calculated_host_listings_count availability_365
##  Min.   :  1.000                Min.   :  0.0   
##  1st Qu.:  1.000                1st Qu.:  0.0   
##  Median :  1.000                Median : 37.0   
##  Mean   :  6.132                Mean   :107.3   
##  3rd Qu.:  2.000                3rd Qu.:208.0   
##  Max.   :327.000                Max.   :365.0

The variables name, host_name, neighbourhood_group, neighbourbood and room_type is changed to factor.

nrow(airbnb)
## [1] 38928
names_to_factor <- c("name", "host_name", "neighbourhood_group", "neighbourhood", "room_type")
airbnb[names_to_factor] <- map(airbnb[names_to_factor], as.factor)

Again, the structure of the dataframe is examined.

str(airbnb)
## 'data.frame':    38928 obs. of  13 variables:
##  $ name                          : Factor w/ 38201 levels "'Fan'tastic",..: 10048 30380 36046 12549 15516 20003 6477 20041 12459 14151 ...
##  $ host_name                     : Factor w/ 9945 levels "'Cil","-TheQueensCornerLot",..: 4369 4187 2566 5429 5183 1696 3118 8417 6012 1080 ...
##  $ neighbourhood_group           : Factor w/ 5 levels "Bronx","Brooklyn",..: 2 3 3 2 3 3 2 3 3 3 ...
##  $ neighbourhood                 : Factor w/ 219 levels "Allerton","Arden Heights",..: 108 127 94 42 62 137 14 95 202 36 ...
##  $ latitude                      : num  40.6 40.8 40.8 40.7 40.8 ...
##  $ longitude                     : num  -74 -74 -73.9 -74 -73.9 ...
##  $ room_type                     : Factor w/ 3 levels "Entire home/apt",..: 2 1 2 1 1 1 2 2 2 1 ...
##  $ price                         : int  149 225 150 89 80 200 60 79 79 150 ...
##  $ minimum_nights                : int  1 1 3 1 10 3 45 2 2 1 ...
##  $ number_of_reviews             : int  9 45 0 270 9 74 49 430 118 160 ...
##  $ reviews_per_month             : num  0.21 0.38 0 4.64 0.1 0.59 0.4 3.47 0.99 1.33 ...
##  $ calculated_host_listings_count: int  6 2 1 1 1 1 1 1 1 4 ...
##  $ availability_365              : int  365 355 365 194 0 129 0 220 0 188 ...
##  - attr(*, "na.action")= 'omit' Named int [1:37] 361 2701 2855 3704 5746 5776 5976 6076 6270 6568 ...
##   ..- attr(*, "names")= chr [1:37] "361" "2701" "2855" "3704" ...

Exploratory Data Analysis (EDA)

1. Descriptive Analysis and Diagnostic Analysis

Since the price is the output to be predicted, the distribution of the price will be interesting to examine. It is observed that the plot is very skewed.

ggplot(airbnb, aes(x = price)) + 
  geom_histogram(bins = 40, fill = "steelblue4", colour='black') + 
   ggtitle("Distribution of Price")

The price is observed to be distributed in around USD 50 to USD 130.

Next, the factor that affect the price the most is the main focus. The initial guess is that the number of reviews will be the main factor that affect the price. Therefore, the distribution of price according to number of reviews.

ggplot(data = airbnb, aes(x = number_of_reviews, y = price))+ 
  geom_jitter(color="steelblue4") + 
  ggtitle("Distribution of Price According to Review")

From the Jitter Plot above, it is observed that the more the number of review, the price is lower as the homestay might be old as compared to those with lowest number of reviews. This makes the homestay with higher number of review to be cheaper.

Next, the average price according to the neighbourhood_group is calculated and plotted.

bronx <- subset(airbnb, neighbourhood_group == "Bronx")
bronx_avg <- sum(bronx$price)/nrow(bronx)

brooklyn <- subset(airbnb, neighbourhood_group == "Brooklyn")
brooklyn_avg <- sum(brooklyn$price)/nrow(brooklyn)

manhattan <- subset(airbnb, neighbourhood_group == "Manhattan")
manhattan_avg <- sum(manhattan$price)/nrow(manhattan)

queens <- subset(airbnb, neighbourhood_group == "Queens")
queens_avg <- sum(queens$price)/nrow(queens)

staten <- subset(airbnb, neighbourhood_group == "Staten Island")
staten_avg <- sum(staten$price)/nrow(staten)

price_vs_neighbourhooodGroup <- data.frame(
  "Neighbourhood_Group" = c("Bronx", "Brooklyn", "Manhattan", "Queens", "Staten Island"),
  "Average_Price" = c(bronx_avg, brooklyn_avg, manhattan_avg, queens_avg, staten_avg)
)

price_vs_neighbourhooodGroup
##   Neighbourhood_Group Average_Price
## 1               Bronx      90.24022
## 2            Brooklyn     111.51331
## 3           Manhattan     138.49750
## 4              Queens      98.29096
## 5       Staten Island      98.21481
ggplot(data = price_vs_neighbourhooodGroup, aes(x = Neighbourhood_Group, y = Average_Price)) +
  geom_bar(stat="identity", fill = "steelblue4", colour = "black") + 
  ggtitle("Distribution of Price According to Neighbour Group")

It is observed that the average price per night is the highest in Manhattan, followed by Brooklyn, Staten Island, Queens and Bronx.

The average price per night according to room type is also calculated and plotted.

entire <- subset(airbnb, room_type == "Entire home/apt")
entire_avg <- sum(entire$price)/nrow(entire)

private <- subset(airbnb, room_type == "Private room")
private_avg <- sum(private$price)/nrow(private)

shared <- subset(airbnb, room_type == "Shared room")
shared_avg <- sum(shared$price)/nrow(shared)

price_vs_roomType <- data.frame(
  "Room_Type" = c("Entire Home/Apt", "Private Room", "Shared Room"),
  "Average_Price" = c(entire_avg, private_avg, shared_avg)
)

price_vs_roomType
##         Room_Type Average_Price
## 1 Entire Home/Apt     151.76954
## 2    Private Room      86.76991
## 3     Shared Room      86.98239
ggplot(data = price_vs_roomType, aes(x = Room_Type, y = Average_Price)) +
  geom_bar(stat="identity", fill = "steelblue4", colour = "black") + 
  ggtitle("Distribution of Price According to Room Type")

The entire home or apartment is observed to be the most expensive one, followed by the private room and shared room. This make sense as renting entire home or apartment means renting bigger space.

Now, the property type in each neighbourhood is also important to be visualized. It is observed that Manhattan has most entire home/apartment to be rented. This is also why the average price per night in Manhattan is high as compared to other neighbourhood.

ggplot(airbnb) + 
  geom_histogram(aes(neighbourhood_group, fill = room_type), stat = "count", position = 'fill') + 
  theme_minimal(base_size = 13)+ xlab("") + ylab("") + 
  ggtitle("The Proportion of Property Type in Each Area")

The distribution of price per night in each neighbourhood group is inspected.

airbnb_nh <- airbnb %>%
  group_by(neighbourhood_group) %>%
  summarise(price = round(mean(price), 2))
## `summarise()` ungrouping output (override with `.groups` argument)
ggplot(airbnb, aes(price)) +
  geom_histogram(bins = 30, aes(y = ..density..), fill = "steelblue4") + 
  geom_density(alpha = 0.2, fill = "blue") +
  ggtitle("Transformed distribution of price\n by neighbourhood groups",
          subtitle = expression(~'log'[10] ~ "x-axis")) +
  geom_vline(data = airbnb_nh, aes(xintercept = price), size = 2, linetype = 3) +
  geom_text(data = airbnb_nh,y = 1.5, aes(x = price + 1400, label = paste("Mean  = $",price)), color = "red", size = 3) +
  facet_wrap(~neighbourhood_group) +
  scale_x_log10()

The price distribution based on room type and neighbourhood group is shown as below. It is observed that the price per night for entire home/apt is the highest in Manhattan. In Brooklyn, Queens and Bronx, the price per night for private room is higher as compared to other room types.

airbnb %>% arrange(desc(price)) %>% top_n(10) %>% select(- host_name, -name) %>%  
  ggplot(aes(x = price, fill = neighbourhood_group)) +
  geom_histogram(bins = 50) +
  scale_x_log10() + 
  ggtitle("Summary of Price Distributions") +
  facet_wrap(~room_type + neighbourhood_group)
## Selecting by availability_365

Leaflet map

To further visualize the data, a leaflet map is used. It is observed that the nerighbourhood groups are clearly partitioned. Therefore, the data is cleaned.

 pal <- colorFactor(palette = c("red", "green", "blue", "purple", "yellow"), domain = airbnb$neighbourhood_group)
 
 leaflet(data = airbnb) %>% addProviderTiles(providers$CartoDB.DarkMatterNoLabels) %>%  addCircleMarkers(~longitude, ~latitude, color = ~pal(neighbourhood_group), weight = 1, radius=1, fillOpacity = 0.1, opacity = 0.1,
                                                                                                        label = paste("Name:", airbnb$name)) %>% 
     addLegend("bottomright", pal = pal, values = ~neighbourhood_group,
     title = "Neighbourhood groups",
     opacity = 1
   )

Correlation Heatmap

A correlation heatmap is used to determine the correlation between the variables. 1 indicates strong positive correlation while -1 indicates strong negative correlation. It is observed that only number of review and review per month has a strong positive correlation as these two variables are fundamentally related to reviews.

airbnb_cor <- airbnb[, sapply(airbnb, is.numeric)]
airbnb_cor <- airbnb_cor[complete.cases(airbnb_cor), ]
correlation_matrix <- cor(airbnb_cor)
corrplot(correlation_matrix, method = "color")

## Train Test Split To perform predictive analysis, the data is split into train and test data with a ratio of 80% : 20%.

set.seed(45)
pd <- sample(2, nrow(airbnb), replace = TRUE, prob = c(0.8,0.2))
train <- airbnb[pd == 1,]#, here means all column 
test <- airbnb[pd == 2,]
cat("Number of train data: ", nrow(train), "\n")
## Number of train data:  31162
cat("Number of test data: ", nrow(test), "\n")
## Number of test data:  7766

Predictive Analysis

The main goal of the analysis is to predict the price per night. A basic linear regression model is used for the first model

lr1 <- train(price ~ latitude + longitude + room_type + minimum_nights  + availability_365 + neighbourhood_group, data = train, method = "lm")
summary(lr1)
## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -118.832  -27.300   -7.002   22.244  195.450 
## 
## Coefficients:
##                                      Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)                        -1.465e+04  7.454e+02  -19.660  < 2e-16 ***
## latitude                           -7.892e+01  7.204e+00  -10.955  < 2e-16 ***
## longitude                          -2.435e+02  8.344e+00  -29.185  < 2e-16 ***
## `room_typePrivate room`            -6.146e+01  4.882e-01 -125.896  < 2e-16 ***
## `room_typeShared room`             -6.957e+01  2.097e+00  -33.173  < 2e-16 ***
## minimum_nights                     -1.461e-01  1.272e-02  -11.486  < 2e-16 ***
## availability_365                    4.605e-02  1.892e-03   24.347  < 2e-16 ***
## neighbourhood_groupBrooklyn        -7.529e+00  2.109e+00   -3.570 0.000358 ***
## neighbourhood_groupManhattan        1.721e+01  1.936e+00    8.888  < 2e-16 ***
## neighbourhood_groupQueens           4.408e+00  2.050e+00    2.150 0.031550 *  
## `neighbourhood_groupStaten Island` -7.131e+01  3.941e+00  -18.096  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 41.87 on 31151 degrees of freedom
## Multiple R-squared:  0.4312, Adjusted R-squared:  0.431 
## F-statistic:  2362 on 10 and 31151 DF,  p-value: < 2.2e-16

The performance of the linear regression model is inspected using the plots. From the analysis of variance (ANOVA) of the linear regression model, it is observed that the \(R^2\) score negative. This shows that the output of the model is not even close to the actual result. is Furthermore, the Normal Q-Q plot of a linear regression model should be a straight line. However, it is not straight in this model. Therefore, this model is not good to be used for predictive analysis.

plot(lr1$finalModel)

pred <- predict(lr1, newdata = test)
pred <- exp(pred)

RMSE <- sqrt(mean( (test$price - pred)**2 ))
SST <- sum((test$price - mean(test$price)) ** 2)
SSE <- sum((test$price - pred)**2)
SSR <- sum((pred - mean(test$price)) ** 2)
R2 <- (SST - SSE) / SST

cat("SST: ", SST, "   SSE: ", SSE, "   SSR: ", SSR, "\nR2: ", R2)
## SST:  23694467    SSE:  4.975242e+166    SSR:  4.975242e+166 
## R2:  -2.099748e+159

To improve the linear regression model, the price is transformed using the logarithmic function so that the output is not that compact. Also, more variables is added for prediction.

lr2 <- lm(log(price) ~ room_type + neighbourhood_group + latitude + longitude 
                        + number_of_reviews + availability_365
                       + reviews_per_month + 
                     calculated_host_listings_count + minimum_nights, data = train)
summary(lr2)
## 
## Call:
## lm(formula = log(price) ~ room_type + neighbourhood_group + latitude + 
##     longitude + number_of_reviews + availability_365 + reviews_per_month + 
##     calculated_host_listings_count + minimum_nights, data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.19487 -0.23025 -0.01561  0.21244  1.56801 
## 
## Coefficients:
##                                    Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)                      -1.190e+02  5.872e+00  -20.261  < 2e-16 ***
## room_typePrivate room            -5.358e-01  3.837e-03 -139.634  < 2e-16 ***
## room_typeShared room             -6.112e-01  1.644e-02  -37.174  < 2e-16 ***
## neighbourhood_groupBrooklyn      -4.678e-02  1.655e-02   -2.828  0.00469 ** 
## neighbourhood_groupManhattan      1.497e-01  1.516e-02    9.877  < 2e-16 ***
## neighbourhood_groupQueens         4.149e-02  1.606e-02    2.583  0.00979 ** 
## neighbourhood_groupStaten Island -5.693e-01  3.094e-02  -18.402  < 2e-16 ***
## latitude                         -5.806e-01  5.666e-02  -10.247  < 2e-16 ***
## longitude                        -1.994e+00  6.583e-02  -30.297  < 2e-16 ***
## number_of_reviews                -1.224e-04  5.142e-05   -2.381  0.01727 *  
## availability_365                  3.702e-04  1.563e-05   23.685  < 2e-16 ***
## reviews_per_month                -8.493e-03  1.512e-03   -5.616 1.98e-08 ***
## calculated_host_listings_count    3.540e-04  6.520e-05    5.430 5.67e-08 ***
## minimum_nights                   -1.451e-03  1.008e-04  -14.398  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3279 on 31148 degrees of freedom
## Multiple R-squared:  0.4817, Adjusted R-squared:  0.4814 
## F-statistic:  2226 on 13 and 31148 DF,  p-value: < 2.2e-16

In the improved linear regression model. The \(R^2\) score of the model successfully improved to 0.43268, which is way better than the first model. The Normal Q-Q plot is observed to have a straight line. This indicate that this model is far better than the first model.

plot(lr2)

pred <- predict(lr2, newdata = test)
pred <- exp(pred)

RMSE <- sqrt(mean( (test$price - pred)**2 ))
SST <- sum((test$price - mean(test$price)) ** 2)
SSE <- sum((test$price - pred)**2)
SSR <- sum((pred - mean(test$price)) ** 2)
R2 <- (SST - SSE) / SST

cat("SST: ", SST, "   SSE: ", SSE, "   SSR: ", SSR, "\nR2: ", R2)
## SST:  23694467    SSE:  13442344    SSR:  10051848 
## R2:  0.43268

To examine the performance of the model, predictions are made. It is observed that the predicted value is close to the actual value.

actual <- test$price

lr_result <- data.frame(
  "Actual" = actual,
  "Predicted" = pred
)

head(lr_result, 20)
##     Actual Predicted
## 12      85  92.22537
## 17     215 144.22092
## 25      60  81.63663
## 27     150 160.11093
## 34      89  89.28803
## 37      68  93.91987
## 48     151 168.49698
## 57     250 160.25839
## 68     105  84.44066
## 73     150 105.28018
## 74     145  81.31374
## 76     130 134.43550
## 77      94  97.81509
## 78     105 139.00519
## 93     175 154.25297
## 94      65  90.97497
## 97     125 108.98804
## 98      80 129.79904
## 99     100  92.05198
## 100    200 143.39008

From the plot below, it is observed that the datapoints are distributed around the red line. This shows that the linear regression model is able to predict the price per night of Airbnb.

lm_line = lm(Predicted ~ Actual, data = lr_result)
plot(x = lr_result$Actual, y = lr_result$Predicted,
main = "Actual and Predicted Price",
xlab = "Actual Price ($)",
ylab = "Predicted Price ($)")
abline(lm_line, col="red", lwd=3)

## Classification In this project, the predicted price in ranges is also one of the field of the team members’ interest. The price is discretized into categorical data with three levels: low, medium and high.

train$price <- cut(train$price, br = c(0, 50, 100, 500), labels = c("Affordable", "Medium", "Expensive"))
test$price <- cut(test$price, br = c(0, 50, 100, 500), labels = c("Affordable", "Medium", "Expensive"))

The new structure of the data is examined. It is observed that the price per night has turned into factor data.

str(train)
## 'data.frame':    31162 obs. of  13 variables:
##  $ name                          : Factor w/ 38201 levels "'Fan'tastic",..: 10048 30380 36046 12549 15516 20003 6477 20041 12459 14151 ...
##  $ host_name                     : Factor w/ 9945 levels "'Cil","-TheQueensCornerLot",..: 4369 4187 2566 5429 5183 1696 3118 8417 6012 1080 ...
##  $ neighbourhood_group           : Factor w/ 5 levels "Bronx","Brooklyn",..: 2 3 3 2 3 3 2 3 3 3 ...
##  $ neighbourhood                 : Factor w/ 219 levels "Allerton","Arden Heights",..: 108 127 94 42 62 137 14 95 202 36 ...
##  $ latitude                      : num  40.6 40.8 40.8 40.7 40.8 ...
##  $ longitude                     : num  -74 -74 -73.9 -74 -73.9 ...
##  $ room_type                     : Factor w/ 3 levels "Entire home/apt",..: 2 1 2 1 1 1 2 2 2 1 ...
##  $ price                         : Factor w/ 3 levels "Affordable","Medium",..: 3 3 3 2 2 3 2 2 2 3 ...
##  $ minimum_nights                : int  1 1 3 1 10 3 45 2 2 1 ...
##  $ number_of_reviews             : int  9 45 0 270 9 74 49 430 118 160 ...
##  $ reviews_per_month             : num  0.21 0.38 0 4.64 0.1 0.59 0.4 3.47 0.99 1.33 ...
##  $ calculated_host_listings_count: int  6 2 1 1 1 1 1 1 1 4 ...
##  $ availability_365              : int  365 355 365 194 0 129 0 220 0 188 ...
##  - attr(*, "na.action")= 'omit' Named int [1:37] 361 2701 2855 3704 5746 5776 5976 6076 6270 6568 ...
##   ..- attr(*, "names")= chr [1:37] "361" "2701" "2855" "3704" ...

Decision Tree Model

The first classifcation model used is the decision tree model. The decision tree takes all the variables as features excpet for neighbourhood. The dicision tree model is plotted in the figure below:

treeModel <- ctree(price~latitude+longitude+neighbourhood_group+room_type+minimum_nights+number_of_reviews+reviews_per_month+calculated_host_listings_count+availability_365, 
                   controls = ctree_control(mincriterion = 0.999, minsplit = 1000), 
                   data = train)
plot(treeModel)

The decision tree model is tested against the train dataset. It is observed that the accuracy of decision tree model on the train dataset is 76.22%. This is consider as highly accurate considering the number of dataset used.

treePrediction <- predict(treeModel, train)
eval_model(treePrediction, train$price)
## 
## Confusion matrix (absolute):
##             Actual
## Prediction   Affordable Expensive Medium   Sum
##   Affordable          0         0      0     0
##   Expensive          56     13753   3845 17654
##   Medium           1183      2326   9999 13508
##   Sum              1239     16079  13844 31162
## 
## Confusion matrix (relative):
##             Actual
## Prediction   Affordable Expensive Medium  Sum
##   Affordable       0.00      0.00   0.00 0.00
##   Expensive        0.00      0.44   0.12 0.57
##   Medium           0.04      0.07   0.32 0.43
##   Sum              0.04      0.52   0.44 1.00
## 
## Accuracy:
## 0.7622 (23752/31162)
## 
## Error rate:
## 0.2378 (7410/31162)
## 
## Error rate reduction (vs. base rate):
## 0.5087 (p-value < 2.2e-16)

The decision tree model is also tested against the test dataset. The accuracy of the decision tree model on the test dataset is 75.68% which is almost similar to that for train dataset.

treePrediction <- predict(treeModel, test)
eval_model(treePrediction, test$price)
## 
## Confusion matrix (absolute):
##             Actual
## Prediction   Affordable Expensive Medium  Sum
##   Affordable          0         0      0    0
##   Expensive          15      3344    979 4338
##   Medium            277       618   2533 3428
##   Sum               292      3962   3512 7766
## 
## Confusion matrix (relative):
##             Actual
## Prediction   Affordable Expensive Medium  Sum
##   Affordable       0.00      0.00   0.00 0.00
##   Expensive        0.00      0.43   0.13 0.56
##   Medium           0.04      0.08   0.33 0.44
##   Sum              0.04      0.51   0.45 1.00
## 
## Accuracy:
## 0.7568 (5877/7766)
## 
## Error rate:
## 0.2432 (1889/7766)
## 
## Error rate reduction (vs. base rate):
## 0.5034 (p-value < 2.2e-16)

Naive Bayes Model

In this section, another classification model, Naive Bayes model is used. Before the model is trained, the data needs to be precessed further. The name, host_name, neightbourhood, latitude and logitude is dropped as it could not be used to train the Naive Bayes model.

train <- subset(train, select = -c(name, host_name, neighbourhood, latitude, longitude))
test <- subset(test, select = -c(name, host_name, neighbourhood, latitude, longitude))
summary(train)
##     neighbourhood_group           room_type            price      
##  Bronx        :  595    Entire home/apt:16748   Affordable: 1239  
##  Brooklyn     :12926    Private room   :14004   Medium    :13844  
##  Manhattan    :13977    Shared room    :  410   Expensive :16079  
##  Queens       : 3441                                              
##  Staten Island:  223                                              
##                                                                   
##  minimum_nights    number_of_reviews reviews_per_month
##  Min.   :  1.000   Min.   :  0.00    Min.   : 0.000   
##  1st Qu.:  1.000   1st Qu.:  1.00    1st Qu.: 0.050   
##  Median :  2.000   Median :  6.00    Median : 0.390   
##  Mean   :  6.669   Mean   : 24.72    Mean   : 1.111   
##  3rd Qu.:  5.000   3rd Qu.: 26.00    3rd Qu.: 1.650   
##  Max.   :999.000   Max.   :488.00    Max.   :17.820   
##  calculated_host_listings_count availability_365
##  Min.   :  1.000                Min.   :  0     
##  1st Qu.:  1.000                1st Qu.:  0     
##  Median :  1.000                Median : 36     
##  Mean   :  6.082                Mean   :107     
##  3rd Qu.:  2.000                3rd Qu.:208     
##  Max.   :327.000                Max.   :365
summary(test)
##     neighbourhood_group           room_type           price     
##  Bronx        : 146     Entire home/apt:4115   Affordable: 292  
##  Brooklyn     :3263     Private room   :3550   Medium    :3512  
##  Manhattan    :3424     Shared room    : 101   Expensive :3962  
##  Queens       : 886                                             
##  Staten Island:  47                                             
##                                                                 
##  minimum_nights     number_of_reviews reviews_per_month
##  Min.   :   1.000   Min.   :  0.0     Min.   : 0.000   
##  1st Qu.:   1.000   1st Qu.:  1.0     1st Qu.: 0.050   
##  Median :   2.000   Median :  6.0     Median : 0.400   
##  Mean   :   7.088   Mean   : 24.6     Mean   : 1.133   
##  3rd Qu.:   5.000   3rd Qu.: 25.0     3rd Qu.: 1.660   
##  Max.   :1250.000   Max.   :540.0     Max.   :58.500   
##  calculated_host_listings_count availability_365
##  Min.   :  1.000                Min.   :  0.0   
##  1st Qu.:  1.000                1st Qu.:  0.0   
##  Median :  1.000                Median : 41.0   
##  Mean   :  6.332                Mean   :108.5   
##  3rd Qu.:  2.000                3rd Qu.:209.0   
##  Max.   :327.000                Max.   :365.0

The remaining data is converted into categorical data through discretization.

train$minimum_nights <- cut(train$minimum_nights, br = c(0, 100, 500, 1000, 1500), labels = c("Low", "Medium", "Moderate", "High"))
train$number_of_reviews <- cut(train$number_of_reviews, br = c(0, 100, 350, 700), labels = c("Low", "Medium", "High"))
train$reviews_per_month <- cut(train$reviews_per_month, br = c(0, 20, 40, 60), labels = c("Low", "Medium", "High"))
train$calculated_host_listings_count <- cut(train$calculated_host_listings_count, br = c(0, 50, 100, 400), labels = c("Low", "Medium", "High"))
train$availability_365 <- cut(train$availability_365, br = c(0, 60, 120, 180, 240, 300, 370), labels = c("2Mo", "4Mo", "6Mo", "8Mo", "10Mo", "12Mo"))

test$minimum_nights <- cut(test$minimum_nights, br = c(0, 100, 500, 1000, 1500), labels = c("Low", "Medium", "Moderate", "High"))
test$number_of_reviews <- cut(test$number_of_reviews, br = c(0, 100, 350, 700), labels = c("Low", "Medium", "High"))
test$reviews_per_month <- cut(test$reviews_per_month, br = c(0, 20, 40, 60), labels = c("Low", "Medium", "High"))
test$calculated_host_listings_count <- cut(test$calculated_host_listings_count, br = c(0, 50, 100, 400), labels = c("Low", "Medium", "High"))
test$availability_365 <- cut(test$availability_365, br = c(0, 60, 120, 180, 240, 300, 370), labels = c("2Mo", "4Mo", "6Mo", "8Mo", "10Mo", "12Mo"))

The final structure of the dataset is examined. It is observed that all the data is in categorical form.

str(train)
## 'data.frame':    31162 obs. of  8 variables:
##  $ neighbourhood_group           : Factor w/ 5 levels "Bronx","Brooklyn",..: 2 3 3 2 3 3 2 3 3 3 ...
##  $ room_type                     : Factor w/ 3 levels "Entire home/apt",..: 2 1 2 1 1 1 2 2 2 1 ...
##  $ price                         : Factor w/ 3 levels "Affordable","Medium",..: 3 3 3 2 2 3 2 2 2 3 ...
##  $ minimum_nights                : Factor w/ 4 levels "Low","Medium",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ number_of_reviews             : Factor w/ 3 levels "Low","Medium",..: 1 1 NA 2 1 1 1 3 2 2 ...
##  $ reviews_per_month             : Factor w/ 3 levels "Low","Medium",..: 1 1 NA 1 1 1 1 1 1 1 ...
##  $ calculated_host_listings_count: Factor w/ 3 levels "Low","Medium",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ availability_365              : Factor w/ 6 levels "2Mo","4Mo","6Mo",..: 6 6 6 4 NA 3 NA 4 NA 4 ...
str(test)
## 'data.frame':    7766 obs. of  8 variables:
##  $ neighbourhood_group           : Factor w/ 5 levels "Bronx","Brooklyn",..: 3 2 2 3 2 3 3 3 2 3 ...
##  $ room_type                     : Factor w/ 3 levels "Entire home/apt",..: 2 1 2 1 2 2 1 1 2 2 ...
##  $ price                         : Factor w/ 3 levels "Affordable","Medium",..: 2 3 2 3 2 2 3 3 3 3 ...
##  $ minimum_nights                : Factor w/ 4 levels "Low","Medium",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ number_of_reviews             : Factor w/ 3 levels "Low","Medium",..: 2 2 1 1 2 2 1 1 2 2 ...
##  $ reviews_per_month             : Factor w/ 3 levels "Low","Medium",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ calculated_host_listings_count: Factor w/ 3 levels "Low","Medium",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ availability_365              : Factor w/ 6 levels "2Mo","4Mo","6Mo",..: 1 6 2 2 6 2 6 4 6 5 ...

Naive Bayes model works based on the Bayes’ Theorem. It assumes that all the features used are independent of each other although this might not be true in the real world situation. However, Naive Bayes model is capable of generating output that are suprisingly accurate.

NaiveBayesModel <- naiveBayes(price ~., data = train)
NaiveBayesModel
## 
## Naive Bayes Classifier for Discrete Predictors
## 
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
## 
## A-priori probabilities:
## Y
## Affordable     Medium  Expensive 
## 0.03975996 0.44425903 0.51598100 
## 
## Conditional probabilities:
##             neighbourhood_group
## Y                  Bronx    Brooklyn   Manhattan      Queens Staten Island
##   Affordable 0.033091203 0.527845036 0.217110573 0.209039548   0.012913640
##   Medium     0.028748916 0.471034383 0.339352788 0.150967928   0.009895984
##   Expensive  0.009702096 0.357671497 0.560358231 0.067914671   0.004353505
## 
##             room_type
## Y            Entire home/apt Private room Shared room
##   Affordable     0.043583535  0.926553672 0.029862793
##   Medium         0.253683906  0.725007223 0.021308870
##   Expensive      0.819827104  0.175321848 0.004851048
## 
##             minimum_nights
## Y                     Low       Medium     Moderate         High
##   Affordable 9.959645e-01 4.035513e-03 0.000000e+00 0.000000e+00
##   Medium     9.969662e-01 2.961572e-03 7.223346e-05 0.000000e+00
##   Expensive  9.963306e-01 3.607190e-03 6.219292e-05 0.000000e+00
## 
##             number_of_reviews
## Y                    Low      Medium        High
##   Affordable 0.941968912 0.055958549 0.002072539
##   Medium     0.914842923 0.083127427 0.002029651
##   Expensive  0.916498994 0.082185420 0.001315586
## 
##             reviews_per_month
## Y            Low Medium High
##   Affordable   1      0    0
##   Medium       1      0    0
##   Expensive    1      0    0
## 
##             calculated_host_listings_count
## Y                     Low       Medium         High
##   Affordable 0.9991928975 0.0000000000 0.0008071025
##   Medium     0.9964605605 0.0018058365 0.0017336030
##   Expensive  0.9591392500 0.0209590149 0.0199017352
## 
##             availability_365
## Y                   2Mo        4Mo        6Mo        8Mo       10Mo       12Mo
##   Affordable 0.28405797 0.16376812 0.12028986 0.08550725 0.08840580 0.25797101
##   Medium     0.28195798 0.16950346 0.13252729 0.07547834 0.09167743 0.24885550
##   Expensive  0.27848955 0.14170118 0.11896734 0.09989404 0.12359118 0.23735671

The Naive Bayes model is tested against the train data and it generates output with 75.42% accuracy which is slightly lower than the accuracy of a decision tree. This is because decision tree considers more features as compared to Naive Bayes model.

NBPredictions <- predict(NaiveBayesModel, train)
eval_model(NBPredictions, train$price)
## 
## Confusion matrix (absolute):
##             Actual
## Prediction   Affordable Expensive Medium   Sum
##   Affordable          0         1      1     2
##   Expensive          55     13185   3526 16766
##   Medium           1184      2893  10317 14394
##   Sum              1239     16079  13844 31162
## 
## Confusion matrix (relative):
##             Actual
## Prediction   Affordable Expensive Medium  Sum
##   Affordable       0.00      0.00   0.00 0.00
##   Expensive        0.00      0.42   0.11 0.54
##   Medium           0.04      0.09   0.33 0.46
##   Sum              0.04      0.52   0.44 1.00
## 
## Accuracy:
## 0.7542 (23502/31162)
## 
## Error rate:
## 0.2458 (7660/31162)
## 
## Error rate reduction (vs. base rate):
## 0.4921 (p-value < 2.2e-16)

The Naive Bayes model is also tested against the test dataset. The accuracy of Naive Bayes model on the test dataset is observed to be 75.77%. This is also similar to that in the train dataset.

NBPredictions <- predict(NaiveBayesModel, test)
eval_model(NBPredictions, test$price)
## 
## Confusion matrix (absolute):
##             Actual
## Prediction   Affordable Expensive Medium  Sum
##   Affordable          0         0      0    0
##   Expensive          21      3232    877 4130
##   Medium            271       730   2635 3636
##   Sum               292      3962   3512 7766
## 
## Confusion matrix (relative):
##             Actual
## Prediction   Affordable Expensive Medium  Sum
##   Affordable       0.00      0.00   0.00 0.00
##   Expensive        0.00      0.42   0.11 0.53
##   Medium           0.03      0.09   0.34 0.47
##   Sum              0.04      0.51   0.45 1.00
## 
## Accuracy:
## 0.7555 (5867/7766)
## 
## Error rate:
## 0.2445 (1899/7766)
## 
## Error rate reduction (vs. base rate):
## 0.5008 (p-value < 2.2e-16)

Conclusion

In conclusion, price prediction can be a regression problem and also can be transformed into a classification problem. In regression, linear regression models is used which in classification, decision tree and Naive Bayes model are used. The linear regression model can be improved by transforming the price into logarithmic scale and increase the number of features used for training. The decision tree model also had a higher accuracy as compared to the Naive Bayes model as more features is used to train the decision tree model as compared to the Naive Bayes model.