library(tidyverse)
library(ggplot2)
library(caret)
library(corrplot)
library(leaflet)
library(plotly)
library(readr)
library(dplyr)
library(OneR)
library(e1071)
library(tree)
library(party)
library(rpart)
library(rmarkdown)
airbnb <- read.csv("AB_NYC_2019.csv", stringsAsFactors = FALSE, na.strings = c(""))
The first 6 rows of the data are shown:
head(airbnb)
## id name host_id host_name
## 1 2539 Clean & quiet apt home by the park 2787 John
## 2 2595 Skylit Midtown Castle 2845 Jennifer
## 3 3647 THE VILLAGE OF HARLEM....NEW YORK ! 4632 Elisabeth
## 4 3831 Cozy Entire Floor of Brownstone 4869 LisaRoxanne
## 5 5022 Entire Apt: Spacious Studio/Loft by central park 7192 Laura
## 6 5099 Large Cozy 1 BR Apartment In Midtown East 7322 Chris
## neighbourhood_group neighbourhood latitude longitude room_type price
## 1 Brooklyn Kensington 40.64749 -73.97237 Private room 149
## 2 Manhattan Midtown 40.75362 -73.98377 Entire home/apt 225
## 3 Manhattan Harlem 40.80902 -73.94190 Private room 150
## 4 Brooklyn Clinton Hill 40.68514 -73.95976 Entire home/apt 89
## 5 Manhattan East Harlem 40.79851 -73.94399 Entire home/apt 80
## 6 Manhattan Murray Hill 40.74767 -73.97500 Entire home/apt 200
## minimum_nights number_of_reviews last_review reviews_per_month
## 1 1 9 2018-10-19 0.21
## 2 1 45 2019-05-21 0.38
## 3 3 0 <NA> NA
## 4 1 270 2019-07-05 4.64
## 5 10 9 2018-11-19 0.10
## 6 3 74 2019-06-22 0.59
## calculated_host_listings_count availability_365
## 1 6 365
## 2 2 355
## 3 1 365
## 4 1 194
## 5 1 0
## 6 1 129
This dataset contains 16 features about Airbnb listings within New York City. Below are the features with their respective descriptions:
Check the number of records in the dataset.
nrow(airbnb)
## [1] 48895
Inspect the data structures. The dataset would need to be cleaned before any conversion of data types can be performed.
str(airbnb)
## 'data.frame': 48895 obs. of 16 variables:
## $ id : int 2539 2595 3647 3831 5022 5099 5121 5178 5203 5238 ...
## $ name : chr "Clean & quiet apt home by the park" "Skylit Midtown Castle" "THE VILLAGE OF HARLEM....NEW YORK !" "Cozy Entire Floor of Brownstone" ...
## $ host_id : int 2787 2845 4632 4869 7192 7322 7356 8967 7490 7549 ...
## $ host_name : chr "John" "Jennifer" "Elisabeth" "LisaRoxanne" ...
## $ neighbourhood_group : chr "Brooklyn" "Manhattan" "Manhattan" "Brooklyn" ...
## $ neighbourhood : chr "Kensington" "Midtown" "Harlem" "Clinton Hill" ...
## $ latitude : num 40.6 40.8 40.8 40.7 40.8 ...
## $ longitude : num -74 -74 -73.9 -74 -73.9 ...
## $ room_type : chr "Private room" "Entire home/apt" "Private room" "Entire home/apt" ...
## $ price : int 149 225 150 89 80 200 60 79 79 150 ...
## $ minimum_nights : int 1 1 3 1 10 3 45 2 2 1 ...
## $ number_of_reviews : int 9 45 0 270 9 74 49 430 118 160 ...
## $ last_review : chr "2018-10-19" "2019-05-21" NA "2019-07-05" ...
## $ reviews_per_month : num 0.21 0.38 NA 4.64 0.1 0.59 0.4 3.47 0.99 1.33 ...
## $ calculated_host_listings_count: int 6 2 1 1 1 1 1 1 1 4 ...
## $ availability_365 : int 365 355 365 194 0 129 0 220 0 188 ...
Get the summary of the data.
summary(airbnb)
## id name host_id host_name
## Min. : 2539 Length:48895 Min. : 2438 Length:48895
## 1st Qu.: 9471945 Class :character 1st Qu.: 7822033 Class :character
## Median :19677284 Mode :character Median : 30793816 Mode :character
## Mean :19017143 Mean : 67620011
## 3rd Qu.:29152178 3rd Qu.:107434423
## Max. :36487245 Max. :274321313
##
## neighbourhood_group neighbourhood latitude longitude
## Length:48895 Length:48895 Min. :40.50 Min. :-74.24
## Class :character Class :character 1st Qu.:40.69 1st Qu.:-73.98
## Mode :character Mode :character Median :40.72 Median :-73.96
## Mean :40.73 Mean :-73.95
## 3rd Qu.:40.76 3rd Qu.:-73.94
## Max. :40.91 Max. :-73.71
##
## room_type price minimum_nights number_of_reviews
## Length:48895 Min. : 0.0 Min. : 1.00 Min. : 0.00
## Class :character 1st Qu.: 69.0 1st Qu.: 1.00 1st Qu.: 1.00
## Mode :character Median : 106.0 Median : 3.00 Median : 5.00
## Mean : 152.7 Mean : 7.03 Mean : 23.27
## 3rd Qu.: 175.0 3rd Qu.: 5.00 3rd Qu.: 24.00
## Max. :10000.0 Max. :1250.00 Max. :629.00
##
## last_review reviews_per_month calculated_host_listings_count
## Length:48895 Min. : 0.010 Min. : 1.000
## Class :character 1st Qu.: 0.190 1st Qu.: 1.000
## Mode :character Median : 0.720 Median : 1.000
## Mean : 1.373 Mean : 7.144
## 3rd Qu.: 2.020 3rd Qu.: 2.000
## Max. :58.500 Max. :327.000
## NA's :10052
## availability_365
## Min. : 0.0
## 1st Qu.: 0.0
## Median : 45.0
## Mean :112.8
## 3rd Qu.:227.0
## Max. :365.0
##
Describe the data. It is observed that the price has a minimum value of 0 USD which is not possible. We will remove it after wards.
psych::describe(airbnb)
## vars n mean sd median
## id 1 48895 19017143.24 10983108.39 19677284.00
## name* 2 48879 23969.07 13814.77 23938.00
## host_id 3 48895 67620010.65 78610967.03 30793816.00
## host_name* 4 48874 5460.42 3232.22 5333.00
## neighbourhood_group* 5 48895 2.68 0.74 3.00
## neighbourhood* 6 48895 108.11 68.74 95.00
## latitude 7 48895 40.73 0.05 40.72
## longitude 8 48895 -73.95 0.05 -73.96
## room_type* 9 48895 1.50 0.55 1.00
## price 10 48895 152.72 240.15 106.00
## minimum_nights 11 48895 7.03 20.51 3.00
## number_of_reviews 12 48895 23.27 44.55 5.00
## last_review* 13 38843 1491.10 400.67 1714.00
## reviews_per_month 14 38843 1.37 1.68 0.72
## calculated_host_listings_count 15 48895 7.14 32.95 1.00
## availability_365 16 48895 112.78 131.62 45.00
## trimmed mad min max
## id 19188061.30 14689959.59 2539.00 36487245.00
## name* 23967.98 17712.62 1.00 47905.00
## host_id 54170438.55 40836605.41 2438.00 274321313.00
## host_name* 5433.36 4209.84 1.00 11452.00
## neighbourhood_group* 2.61 1.48 1.00 5.00
## neighbourhood* 106.81 88.96 1.00 221.00
## latitude 40.73 0.05 40.50 40.91
## longitude -73.96 0.04 -74.24 -73.71
## room_type* 1.48 0.00 1.00 3.00
## price 121.43 68.20 0.00 10000.00
## minimum_nights 3.58 2.97 1.00 1250.00
## number_of_reviews 12.45 7.41 0.00 629.00
## last_review* 1579.38 65.23 1.00 1764.00
## reviews_per_month 1.06 0.92 0.01 58.50
## calculated_host_listings_count 1.50 0.00 1.00 327.00
## availability_365 96.50 66.72 0.00 365.00
## range skew kurtosis se
## id 36484706.00 -0.09 -1.23 49669.87
## name* 47904.00 0.00 -1.20 62.49
## host_id 274318875.00 1.21 0.17 355509.26
## host_name* 11451.00 0.05 -1.16 14.62
## neighbourhood_group* 4.00 0.37 -0.11 0.00
## neighbourhood* 220.00 0.26 -1.26 0.31
## latitude 0.41 0.24 0.15 0.00
## longitude 0.53 1.28 5.02 0.00
## room_type* 2.00 0.42 -0.97 0.00
## price 10000.00 19.12 585.59 1.09
## minimum_nights 1249.00 21.83 853.95 0.09
## number_of_reviews 629.00 3.69 19.53 0.20
## last_review* 1763.00 -1.66 1.71 2.03
## reviews_per_month 58.49 3.13 42.49 0.01
## calculated_host_listings_count 326.00 7.93 67.54 0.15
## availability_365 365.00 0.76 -1.00 0.60
Since id, host_id does not give any information, these columns are dropped. The last_review is the date for the last review, which does not affect the price. It is also dropped.
airbnb <- subset(airbnb, select = -c(id, host_id, last_review))
The percentage of missing values is examined.
sapply(airbnb, function(x)sum(is.na(x)/nrow(airbnb)))
## name host_name
## 0.0003272318 0.0004294918
## neighbourhood_group neighbourhood
## 0.0000000000 0.0000000000
## latitude longitude
## 0.0000000000 0.0000000000
## room_type price
## 0.0000000000 0.0000000000
## minimum_nights number_of_reviews
## 0.0000000000 0.0000000000
## reviews_per_month calculated_host_listings_count
## 0.2055833930 0.0000000000
## availability_365
## 0.0000000000
If there is no entry for review_per_month, it can be treated as no review, so the review_per_month is replaced with 0.
airbnb[["reviews_per_month"]][is.na(airbnb[["reviews_per_month"]])] <- 0
sapply(airbnb, function(x)sum(is.na(x)/nrow(airbnb)))
## name host_name
## 0.0003272318 0.0004294918
## neighbourhood_group neighbourhood
## 0.0000000000 0.0000000000
## latitude longitude
## 0.0000000000 0.0000000000
## room_type price
## 0.0000000000 0.0000000000
## minimum_nights number_of_reviews
## 0.0000000000 0.0000000000
## reviews_per_month calculated_host_listings_count
## 0.0000000000 0.0000000000
## availability_365
## 0.0000000000
Now, it is observed that the NULL values in the name and the host_name is less than 0.0005%. Therefore, it is safe ro directly omit these record without losing important information.
airbnb <- na.omit(airbnb)
sapply(airbnb, function(x)sum(is.na(x)/nrow(airbnb)))
## name host_name
## 0 0
## neighbourhood_group neighbourhood
## 0 0
## latitude longitude
## 0 0
## room_type price
## 0 0
## minimum_nights number_of_reviews
## 0 0
## reviews_per_month calculated_host_listings_count
## 0 0
## availability_365
## 0
Price cannot be 0 USD so we will omit the data with price of 0 USD. The price of USD 10000 is also considered as outliers as it is too far away from the mean price. Therefore, the data with price within the top 10% and the bottom 10% are considered as outliers and are removed.
airbnb <- airbnb %>% filter(price > 0)
airbnb <- airbnb %>% filter(price < quantile(airbnb$price, 0.9) & price > quantile(airbnb$price, 0.1))
summary(airbnb)
## name host_name neighbourhood_group neighbourhood
## Length:38928 Length:38928 Length:38928 Length:38928
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## latitude longitude room_type price
## Min. :40.50 Min. :-74.24 Length:38928 Min. : 50.0
## 1st Qu.:40.69 1st Qu.:-73.98 Class :character 1st Qu.: 75.0
## Median :40.72 Median :-73.96 Mode :character Median :107.0
## Mean :40.73 Mean :-73.95 Mean :121.6
## 3rd Qu.:40.76 3rd Qu.:-73.94 3rd Qu.:155.0
## Max. :40.91 Max. :-73.71 Max. :268.0
## minimum_nights number_of_reviews reviews_per_month
## Min. : 1.000 Min. : 0.0 Min. : 0.000
## 1st Qu.: 1.000 1st Qu.: 1.0 1st Qu.: 0.050
## Median : 2.000 Median : 6.0 Median : 0.390
## Mean : 6.753 Mean : 24.7 Mean : 1.115
## 3rd Qu.: 5.000 3rd Qu.: 26.0 3rd Qu.: 1.650
## Max. :1250.000 Max. :540.0 Max. :58.500
## calculated_host_listings_count availability_365
## Min. : 1.000 Min. : 0.0
## 1st Qu.: 1.000 1st Qu.: 0.0
## Median : 1.000 Median : 37.0
## Mean : 6.132 Mean :107.3
## 3rd Qu.: 2.000 3rd Qu.:208.0
## Max. :327.000 Max. :365.0
The variables name, host_name, neighbourhood_group, neighbourbood and room_type is changed to factor.
nrow(airbnb)
## [1] 38928
names_to_factor <- c("name", "host_name", "neighbourhood_group", "neighbourhood", "room_type")
airbnb[names_to_factor] <- map(airbnb[names_to_factor], as.factor)
Again, the structure of the dataframe is examined.
str(airbnb)
## 'data.frame': 38928 obs. of 13 variables:
## $ name : Factor w/ 38201 levels "'Fan'tastic",..: 10048 30380 36046 12549 15516 20003 6477 20041 12459 14151 ...
## $ host_name : Factor w/ 9945 levels "'Cil","-TheQueensCornerLot",..: 4369 4187 2566 5429 5183 1696 3118 8417 6012 1080 ...
## $ neighbourhood_group : Factor w/ 5 levels "Bronx","Brooklyn",..: 2 3 3 2 3 3 2 3 3 3 ...
## $ neighbourhood : Factor w/ 219 levels "Allerton","Arden Heights",..: 108 127 94 42 62 137 14 95 202 36 ...
## $ latitude : num 40.6 40.8 40.8 40.7 40.8 ...
## $ longitude : num -74 -74 -73.9 -74 -73.9 ...
## $ room_type : Factor w/ 3 levels "Entire home/apt",..: 2 1 2 1 1 1 2 2 2 1 ...
## $ price : int 149 225 150 89 80 200 60 79 79 150 ...
## $ minimum_nights : int 1 1 3 1 10 3 45 2 2 1 ...
## $ number_of_reviews : int 9 45 0 270 9 74 49 430 118 160 ...
## $ reviews_per_month : num 0.21 0.38 0 4.64 0.1 0.59 0.4 3.47 0.99 1.33 ...
## $ calculated_host_listings_count: int 6 2 1 1 1 1 1 1 1 4 ...
## $ availability_365 : int 365 355 365 194 0 129 0 220 0 188 ...
## - attr(*, "na.action")= 'omit' Named int [1:37] 361 2701 2855 3704 5746 5776 5976 6076 6270 6568 ...
## ..- attr(*, "names")= chr [1:37] "361" "2701" "2855" "3704" ...
Since the price is the output to be predicted, the distribution of the price will be interesting to examine. It is observed that the plot is very skewed.
ggplot(airbnb, aes(x = price)) +
geom_histogram(bins = 40, fill = "steelblue4", colour='black') +
ggtitle("Distribution of Price")
The price is observed to be distributed in around USD 50 to USD 130.
The distribution of property at each neighbourhood group is also part of the team’s interest.
barchart(airbnb$neighbourhood_group,
main = "Number of Properties In Each Neighbourhood Group",
xlab = "Number of Properties",
ylab = "Neighbourhood Group",
col = "steelblue4")
Next, the factor that affect the price the most is the main focus. The initial guess is that the number of reviews will be the main factor that affect the price. Therefore, the distribution of price according to number of reviews.
ggplot(data = airbnb, aes(x = number_of_reviews, y = price))+
geom_jitter(color="steelblue4") +
ggtitle("Distribution of Price According to Review")
From the Jitter Plot above, it is observed that the more the number of review, the price is lower as the homestay might be old as compared to those with lowest number of reviews. This makes the homestay with higher number of review to be cheaper.
Next, the average price according to the neighbourhood_group is calculated and plotted.
bronx <- subset(airbnb, neighbourhood_group == "Bronx")
bronx_avg <- sum(bronx$price)/nrow(bronx)
brooklyn <- subset(airbnb, neighbourhood_group == "Brooklyn")
brooklyn_avg <- sum(brooklyn$price)/nrow(brooklyn)
manhattan <- subset(airbnb, neighbourhood_group == "Manhattan")
manhattan_avg <- sum(manhattan$price)/nrow(manhattan)
queens <- subset(airbnb, neighbourhood_group == "Queens")
queens_avg <- sum(queens$price)/nrow(queens)
staten <- subset(airbnb, neighbourhood_group == "Staten Island")
staten_avg <- sum(staten$price)/nrow(staten)
price_vs_neighbourhooodGroup <- data.frame(
"Neighbourhood_Group" = c("Bronx", "Brooklyn", "Manhattan", "Queens", "Staten Island"),
"Average_Price" = c(bronx_avg, brooklyn_avg, manhattan_avg, queens_avg, staten_avg)
)
price_vs_neighbourhooodGroup
## Neighbourhood_Group Average_Price
## 1 Bronx 90.24022
## 2 Brooklyn 111.51331
## 3 Manhattan 138.49750
## 4 Queens 98.29096
## 5 Staten Island 98.21481
ggplot(data = price_vs_neighbourhooodGroup, aes(x = Neighbourhood_Group, y = Average_Price)) +
geom_bar(stat="identity", fill = "steelblue4", colour = "black") +
ggtitle("Distribution of Price According to Neighbour Group")
It is observed that the average price per night is the highest in Manhattan, followed by Brooklyn, Staten Island, Queens and Bronx.
The average price per night according to room type is also calculated and plotted.
entire <- subset(airbnb, room_type == "Entire home/apt")
entire_avg <- sum(entire$price)/nrow(entire)
private <- subset(airbnb, room_type == "Private room")
private_avg <- sum(private$price)/nrow(private)
shared <- subset(airbnb, room_type == "Shared room")
shared_avg <- sum(shared$price)/nrow(shared)
price_vs_roomType <- data.frame(
"Room_Type" = c("Entire Home/Apt", "Private Room", "Shared Room"),
"Average_Price" = c(entire_avg, private_avg, shared_avg)
)
price_vs_roomType
## Room_Type Average_Price
## 1 Entire Home/Apt 151.76954
## 2 Private Room 86.76991
## 3 Shared Room 86.98239
ggplot(data = price_vs_roomType, aes(x = Room_Type, y = Average_Price)) +
geom_bar(stat="identity", fill = "steelblue4", colour = "black") +
ggtitle("Distribution of Price According to Room Type")
The entire home or apartment is observed to be the most expensive one, followed by the private room and shared room. This make sense as renting entire home or apartment means renting bigger space.
Now, the property type in each neighbourhood is also important to be visualized. It is observed that Manhattan has most entire home/apartment to be rented. This is also why the average price per night in Manhattan is high as compared to other neighbourhood.
ggplot(airbnb) +
geom_histogram(aes(neighbourhood_group, fill = room_type), stat = "count", position = 'fill') +
theme_minimal(base_size = 13)+ xlab("") + ylab("") +
ggtitle("The Proportion of Property Type in Each Area")
The distribution of price per night in each neighbourhood group is inspected.
airbnb_nh <- airbnb %>%
group_by(neighbourhood_group) %>%
summarise(price = round(mean(price), 2))
## `summarise()` ungrouping output (override with `.groups` argument)
ggplot(airbnb, aes(price)) +
geom_histogram(bins = 30, aes(y = ..density..), fill = "steelblue4") +
geom_density(alpha = 0.2, fill = "blue") +
ggtitle("Distribution of price by neighbourhood groups") +
geom_vline(data = airbnb_nh, aes(xintercept = price), size = 1, linetype = 3) +
geom_text(data = airbnb_nh, y = 0.018, aes(x = price+50, label = paste("Mean = $", price)), color = "red", size = 4) +
facet_wrap(~neighbourhood_group)
The price distribution based on room type and neighbourhood group is shown as below. It is observed that the price per night for entire home/apt is the highest in Manhattan. In Brooklyn, Queens and Bronx, the price per night for private room is higher as compared to other room types.
airbnb %>% arrange(desc(price)) %>% top_n(10) %>% select(- host_name, -name) %>%
ggplot(aes(x = price, fill = neighbourhood_group)) +
geom_histogram(bins = 50) +
scale_x_log10() +
ggtitle("Summary of Price Distributions") +
facet_wrap(~room_type + neighbourhood_group)
## Selecting by availability_365
To further visualize the data, a leaflet map is used. It is observed that the nerighbourhood groups are clearly partitioned. Therefore, the data is cleaned.
pal <- colorFactor(palette = c("red", "green", "blue", "purple", "yellow"), domain = airbnb$neighbourhood_group)
leaflet(data = airbnb) %>% addProviderTiles(providers$CartoDB.DarkMatterNoLabels) %>% addCircleMarkers(~longitude, ~latitude, color = ~pal(neighbourhood_group), weight = 1, radius=1, fillOpacity = 0.1, opacity = 0.1,
label = paste("Name:", airbnb$name)) %>%
addLegend("bottomright", pal = pal, values = ~neighbourhood_group,
title = "Neighbourhood groups",
opacity = 1
)
A correlation heatmap is used to determine the correlation between the variables. 1 indicates strong positive correlation while -1 indicates strong negative correlation. It is observed that only number of review and review per month has a strong positive correlation as these two variables are fundamentally related to reviews.
airbnb_cor <- airbnb[, sapply(airbnb, is.numeric)]
airbnb_cor <- airbnb_cor[complete.cases(airbnb_cor), ]
correlation_matrix <- cor(airbnb_cor)
corrplot(correlation_matrix, method = "color")
## Train Test Split To perform predictive analysis, the data is split into train and test data with a ratio of 80% : 20%.
set.seed(45)
pd <- sample(2, nrow(airbnb), replace = TRUE, prob = c(0.8,0.2))
train <- airbnb[pd == 1,]#, here means all column
test <- airbnb[pd == 2,]
cat("Number of train data: ", nrow(train), "\n")
## Number of train data: 31162
cat("Number of test data: ", nrow(test), "\n")
## Number of test data: 7766
The main goal of the analysis is to predict the price per night. A basic linear regression model is used for the first model
lr1 <- lm(price ~ room_type + neighbourhood_group + number_of_reviews + latitude + longitude, data = train)
summary(lr1)
##
## Call:
## lm(formula = price ~ room_type + neighbourhood_group + number_of_reviews +
## latitude + longitude, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -122.977 -27.298 -7.427 22.394 201.261
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.217e+04 7.477e+02 -16.280 < 2e-16 ***
## room_typePrivate room -6.062e+01 4.909e-01 -123.495 < 2e-16 ***
## room_typeShared room -6.807e+01 2.118e+00 -32.144 < 2e-16 ***
## neighbourhood_groupBrooklyn -1.211e+01 2.122e+00 -5.706 1.17e-08 ***
## neighbourhood_groupManhattan 1.436e+01 1.952e+00 7.356 1.94e-13 ***
## neighbourhood_groupQueens 1.020e+00 2.066e+00 0.494 0.621
## neighbourhood_groupStaten Island -6.850e+01 3.980e+00 -17.210 < 2e-16 ***
## number_of_reviews -2.166e-02 5.251e-03 -4.125 3.71e-05 ***
## latitude -9.609e+01 7.244e+00 -13.265 < 2e-16 ***
## longitude -2.195e+02 8.391e+00 -26.164 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 42.29 on 31152 degrees of freedom
## Multiple R-squared: 0.4196, Adjusted R-squared: 0.4195
## F-statistic: 2503 on 9 and 31152 DF, p-value: < 2.2e-16
The performance of the linear regression model is inspected using the plots. From the analysis of variance (ANOVA) of the linear regression model, it is observed that the \(R^2\) score is 0.4221489. The Normal Q-Q plot is approcimate a straight line which indicates that the model can be used to predict the price per night.
plot(lr1)
pred <- predict(lr1, newdata = test)
RMSE <- sqrt(mean( (test$price - pred)**2 ))
SSE <- sum((test$price - pred)**2)
SSR <- sum((pred - mean(test$price)) ** 2)
SST <- SSR +SSE
R2 <- (SST - SSE) / SST
cat("SST: ", SST, " SSE: ", SSE, " SSR: ", SSR, "\nR2: ", R2, " RMSE: ", RMSE)
## SST: 23643801 SSE: 13662596 SSR: 9981205
## R2: 0.4221489 RMSE: 41.94381
To improve the linear regression model, the price is transformed using the logarithmic function so that the output is not that compact. Also, more variables is added for prediction.
lr2 <- lm(price ~ room_type + neighbourhood_group + latitude + longitude
+ number_of_reviews + availability_365
+ reviews_per_month +
calculated_host_listings_count + minimum_nights, data = train)
summary(lr2)
##
## Call:
## lm(formula = price ~ room_type + neighbourhood_group + latitude +
## longitude + number_of_reviews + availability_365 + reviews_per_month +
## calculated_host_listings_count + minimum_nights, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -118.700 -27.115 -6.865 21.799 203.248
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.395e+04 7.471e+02 -18.675 < 2e-16 ***
## room_typePrivate room -6.101e+01 4.882e-01 -124.976 < 2e-16 ***
## room_typeShared room -6.925e+01 2.092e+00 -33.108 < 2e-16 ***
## neighbourhood_groupBrooklyn -6.405e+00 2.105e+00 -3.043 0.00235 **
## neighbourhood_groupManhattan 1.731e+01 1.929e+00 8.976 < 2e-16 ***
## neighbourhood_groupQueens 4.796e+00 2.043e+00 2.347 0.01891 *
## neighbourhood_groupStaten Island -6.733e+01 3.936e+00 -17.106 < 2e-16 ***
## latitude -7.427e+01 7.208e+00 -10.304 < 2e-16 ***
## longitude -2.315e+02 8.376e+00 -27.638 < 2e-16 ***
## number_of_reviews -2.922e-02 6.542e-03 -4.465 8.02e-06 ***
## availability_365 4.535e-02 1.988e-03 22.806 < 2e-16 ***
## reviews_per_month -8.939e-01 1.924e-01 -4.646 3.40e-06 ***
## calculated_host_listings_count 9.076e-02 8.295e-03 10.942 < 2e-16 ***
## minimum_nights -1.726e-01 1.282e-02 -13.464 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 41.71 on 31148 degrees of freedom
## Multiple R-squared: 0.4355, Adjusted R-squared: 0.4353
## F-statistic: 1849 on 13 and 31148 DF, p-value: < 2.2e-16
In the improved linear regression model. The \(R^2\) score of the model successfully improved to 0.4386039, which is slightly better than the first model. The Normal Q-Q plot is observed to have a straight line.
plot(lr2)
pred <- predict(lr2, newdata = test)
RMSE <- sqrt(mean( (test$price - pred)**2 ))
SSE <- sum((test$price - pred)**2)
SSR <- sum((pred - mean(test$price)) ** 2)
SST <- SSR +SSE
R2 <- (SST - SSE) / SST
cat("SST: ", SST, " SSE: ", SSE, " SSR: ", SSR, "\nR2: ", R2, " RMSE: ", RMSE)
## SST: 23700627 SSE: 13305440 SSR: 10395186
## R2: 0.4386039 RMSE: 41.39195
To examine the performance of the model, predictions are made. It is observed that the predicted value is close to the actual value.
actual <- test$price
lr_result <- data.frame(
"Actual" = actual,
"Predicted" = pred
)
head(lr_result, 20)
## Actual Predicted
## 12 85 99.41299
## 17 215 150.24233
## 25 60 87.75879
## 27 150 164.12913
## 34 89 96.48257
## 37 68 101.17062
## 48 151 170.28508
## 57 250 164.55484
## 68 105 90.62591
## 73 150 115.84731
## 74 145 86.66560
## 76 130 144.24181
## 77 94 108.16938
## 78 105 147.60883
## 93 175 157.78832
## 94 65 99.65684
## 97 125 117.54191
## 98 80 139.91626
## 99 100 99.87336
## 100 200 151.79968
From the plot below, it is observed that the datapoints are distributed around the red line. This shows that the linear regression model is able to predict the price per night of Airbnb.
lm_line = lm(Predicted ~ Actual, data = lr_result)
plot(x = lr_result$Actual, y = lr_result$Predicted,
main = "Actual and Predicted Price",
xlab = "Actual Price ($)",
ylab = "Predicted Price ($)")
abline(lm_line, col="red", lwd=3)
## Classification In this project, the predicted price in ranges is also one of the field of the team members’ interest. The price is discretized into categorical data with three levels: low, medium and high.
train$price <- cut(train$price, br = c(0, 50, 100, 500), labels = c("Affordable", "Medium", "Expensive"))
test$price <- cut(test$price, br = c(0, 50, 100, 500), labels = c("Affordable", "Medium", "Expensive"))
The new structure of the data is examined. It is observed that the price per night has turned into factor data.
str(train)
## 'data.frame': 31162 obs. of 13 variables:
## $ name : Factor w/ 38201 levels "'Fan'tastic",..: 10048 30380 36046 12549 15516 20003 6477 20041 12459 14151 ...
## $ host_name : Factor w/ 9945 levels "'Cil","-TheQueensCornerLot",..: 4369 4187 2566 5429 5183 1696 3118 8417 6012 1080 ...
## $ neighbourhood_group : Factor w/ 5 levels "Bronx","Brooklyn",..: 2 3 3 2 3 3 2 3 3 3 ...
## $ neighbourhood : Factor w/ 219 levels "Allerton","Arden Heights",..: 108 127 94 42 62 137 14 95 202 36 ...
## $ latitude : num 40.6 40.8 40.8 40.7 40.8 ...
## $ longitude : num -74 -74 -73.9 -74 -73.9 ...
## $ room_type : Factor w/ 3 levels "Entire home/apt",..: 2 1 2 1 1 1 2 2 2 1 ...
## $ price : Factor w/ 3 levels "Affordable","Medium",..: 3 3 3 2 2 3 2 2 2 3 ...
## $ minimum_nights : int 1 1 3 1 10 3 45 2 2 1 ...
## $ number_of_reviews : int 9 45 0 270 9 74 49 430 118 160 ...
## $ reviews_per_month : num 0.21 0.38 0 4.64 0.1 0.59 0.4 3.47 0.99 1.33 ...
## $ calculated_host_listings_count: int 6 2 1 1 1 1 1 1 1 4 ...
## $ availability_365 : int 365 355 365 194 0 129 0 220 0 188 ...
## - attr(*, "na.action")= 'omit' Named int [1:37] 361 2701 2855 3704 5746 5776 5976 6076 6270 6568 ...
## ..- attr(*, "names")= chr [1:37] "361" "2701" "2855" "3704" ...
The first classifcation model used is the decision tree model. The decision tree takes all the variables as features excpet for neighbourhood. The dicision tree model is plotted in the figure below:
treeModel <- ctree(price~latitude+longitude+neighbourhood_group+room_type+minimum_nights+number_of_reviews+reviews_per_month+calculated_host_listings_count+availability_365,
controls = ctree_control(mincriterion = 0.999, minsplit = 1000),
data = train)
plot(treeModel)
The decision tree model is tested against the train dataset. It is observed that the accuracy of decision tree model on the train dataset is 76.22%. This is consider as highly accurate considering the number of dataset used.
treePrediction <- predict(treeModel, train)
eval_model(treePrediction, train$price)
##
## Confusion matrix (absolute):
## Actual
## Prediction Affordable Expensive Medium Sum
## Affordable 0 0 0 0
## Expensive 56 13753 3845 17654
## Medium 1183 2326 9999 13508
## Sum 1239 16079 13844 31162
##
## Confusion matrix (relative):
## Actual
## Prediction Affordable Expensive Medium Sum
## Affordable 0.00 0.00 0.00 0.00
## Expensive 0.00 0.44 0.12 0.57
## Medium 0.04 0.07 0.32 0.43
## Sum 0.04 0.52 0.44 1.00
##
## Accuracy:
## 0.7622 (23752/31162)
##
## Error rate:
## 0.2378 (7410/31162)
##
## Error rate reduction (vs. base rate):
## 0.5087 (p-value < 2.2e-16)
The decision tree model is also tested against the test dataset. The accuracy of the decision tree model on the test dataset is 75.68% which is almost similar to that for train dataset.
treePrediction <- predict(treeModel, test)
eval_model(treePrediction, test$price)
##
## Confusion matrix (absolute):
## Actual
## Prediction Affordable Expensive Medium Sum
## Affordable 0 0 0 0
## Expensive 15 3344 979 4338
## Medium 277 618 2533 3428
## Sum 292 3962 3512 7766
##
## Confusion matrix (relative):
## Actual
## Prediction Affordable Expensive Medium Sum
## Affordable 0.00 0.00 0.00 0.00
## Expensive 0.00 0.43 0.13 0.56
## Medium 0.04 0.08 0.33 0.44
## Sum 0.04 0.51 0.45 1.00
##
## Accuracy:
## 0.7568 (5877/7766)
##
## Error rate:
## 0.2432 (1889/7766)
##
## Error rate reduction (vs. base rate):
## 0.5034 (p-value < 2.2e-16)
In this section, another classification model, Naive Bayes model is used. Before the model is trained, the data needs to be precessed further. The name, host_name, neightbourhood, latitude and logitude is dropped as it could not be used to train the Naive Bayes model.
train <- subset(train, select = -c(name, host_name, neighbourhood, latitude, longitude))
test <- subset(test, select = -c(name, host_name, neighbourhood, latitude, longitude))
summary(train)
## neighbourhood_group room_type price
## Bronx : 595 Entire home/apt:16748 Affordable: 1239
## Brooklyn :12926 Private room :14004 Medium :13844
## Manhattan :13977 Shared room : 410 Expensive :16079
## Queens : 3441
## Staten Island: 223
##
## minimum_nights number_of_reviews reviews_per_month
## Min. : 1.000 Min. : 0.00 Min. : 0.000
## 1st Qu.: 1.000 1st Qu.: 1.00 1st Qu.: 0.050
## Median : 2.000 Median : 6.00 Median : 0.390
## Mean : 6.669 Mean : 24.72 Mean : 1.111
## 3rd Qu.: 5.000 3rd Qu.: 26.00 3rd Qu.: 1.650
## Max. :999.000 Max. :488.00 Max. :17.820
## calculated_host_listings_count availability_365
## Min. : 1.000 Min. : 0
## 1st Qu.: 1.000 1st Qu.: 0
## Median : 1.000 Median : 36
## Mean : 6.082 Mean :107
## 3rd Qu.: 2.000 3rd Qu.:208
## Max. :327.000 Max. :365
summary(test)
## neighbourhood_group room_type price
## Bronx : 146 Entire home/apt:4115 Affordable: 292
## Brooklyn :3263 Private room :3550 Medium :3512
## Manhattan :3424 Shared room : 101 Expensive :3962
## Queens : 886
## Staten Island: 47
##
## minimum_nights number_of_reviews reviews_per_month
## Min. : 1.000 Min. : 0.0 Min. : 0.000
## 1st Qu.: 1.000 1st Qu.: 1.0 1st Qu.: 0.050
## Median : 2.000 Median : 6.0 Median : 0.400
## Mean : 7.088 Mean : 24.6 Mean : 1.133
## 3rd Qu.: 5.000 3rd Qu.: 25.0 3rd Qu.: 1.660
## Max. :1250.000 Max. :540.0 Max. :58.500
## calculated_host_listings_count availability_365
## Min. : 1.000 Min. : 0.0
## 1st Qu.: 1.000 1st Qu.: 0.0
## Median : 1.000 Median : 41.0
## Mean : 6.332 Mean :108.5
## 3rd Qu.: 2.000 3rd Qu.:209.0
## Max. :327.000 Max. :365.0
The remaining data is converted into categorical data through discretization.
train$minimum_nights <- cut(train$minimum_nights, br = c(0, 100, 500, 1000, 1500), labels = c("Low", "Medium", "Moderate", "High"))
train$number_of_reviews <- cut(train$number_of_reviews, br = c(0, 100, 350, 700), labels = c("Low", "Medium", "High"))
train$reviews_per_month <- cut(train$reviews_per_month, br = c(0, 20, 40, 60), labels = c("Low", "Medium", "High"))
train$calculated_host_listings_count <- cut(train$calculated_host_listings_count, br = c(0, 50, 100, 400), labels = c("Low", "Medium", "High"))
train$availability_365 <- cut(train$availability_365, br = c(0, 60, 120, 180, 240, 300, 370), labels = c("2Mo", "4Mo", "6Mo", "8Mo", "10Mo", "12Mo"))
test$minimum_nights <- cut(test$minimum_nights, br = c(0, 100, 500, 1000, 1500), labels = c("Low", "Medium", "Moderate", "High"))
test$number_of_reviews <- cut(test$number_of_reviews, br = c(0, 100, 350, 700), labels = c("Low", "Medium", "High"))
test$reviews_per_month <- cut(test$reviews_per_month, br = c(0, 20, 40, 60), labels = c("Low", "Medium", "High"))
test$calculated_host_listings_count <- cut(test$calculated_host_listings_count, br = c(0, 50, 100, 400), labels = c("Low", "Medium", "High"))
test$availability_365 <- cut(test$availability_365, br = c(0, 60, 120, 180, 240, 300, 370), labels = c("2Mo", "4Mo", "6Mo", "8Mo", "10Mo", "12Mo"))
The final structure of the dataset is examined. It is observed that all the data is in categorical form.
str(train)
## 'data.frame': 31162 obs. of 8 variables:
## $ neighbourhood_group : Factor w/ 5 levels "Bronx","Brooklyn",..: 2 3 3 2 3 3 2 3 3 3 ...
## $ room_type : Factor w/ 3 levels "Entire home/apt",..: 2 1 2 1 1 1 2 2 2 1 ...
## $ price : Factor w/ 3 levels "Affordable","Medium",..: 3 3 3 2 2 3 2 2 2 3 ...
## $ minimum_nights : Factor w/ 4 levels "Low","Medium",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ number_of_reviews : Factor w/ 3 levels "Low","Medium",..: 1 1 NA 2 1 1 1 3 2 2 ...
## $ reviews_per_month : Factor w/ 3 levels "Low","Medium",..: 1 1 NA 1 1 1 1 1 1 1 ...
## $ calculated_host_listings_count: Factor w/ 3 levels "Low","Medium",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ availability_365 : Factor w/ 6 levels "2Mo","4Mo","6Mo",..: 6 6 6 4 NA 3 NA 4 NA 4 ...
str(test)
## 'data.frame': 7766 obs. of 8 variables:
## $ neighbourhood_group : Factor w/ 5 levels "Bronx","Brooklyn",..: 3 2 2 3 2 3 3 3 2 3 ...
## $ room_type : Factor w/ 3 levels "Entire home/apt",..: 2 1 2 1 2 2 1 1 2 2 ...
## $ price : Factor w/ 3 levels "Affordable","Medium",..: 2 3 2 3 2 2 3 3 3 3 ...
## $ minimum_nights : Factor w/ 4 levels "Low","Medium",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ number_of_reviews : Factor w/ 3 levels "Low","Medium",..: 2 2 1 1 2 2 1 1 2 2 ...
## $ reviews_per_month : Factor w/ 3 levels "Low","Medium",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ calculated_host_listings_count: Factor w/ 3 levels "Low","Medium",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ availability_365 : Factor w/ 6 levels "2Mo","4Mo","6Mo",..: 1 6 2 2 6 2 6 4 6 5 ...
Naive Bayes model works based on the Bayes’ Theorem. It assumes that all the features used are independent of each other although this might not be true in the real world situation. However, Naive Bayes model is capable of generating output that are suprisingly accurate.
NaiveBayesModel <- naiveBayes(price ~., data = train)
NaiveBayesModel
##
## Naive Bayes Classifier for Discrete Predictors
##
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
##
## A-priori probabilities:
## Y
## Affordable Medium Expensive
## 0.03975996 0.44425903 0.51598100
##
## Conditional probabilities:
## neighbourhood_group
## Y Bronx Brooklyn Manhattan Queens Staten Island
## Affordable 0.033091203 0.527845036 0.217110573 0.209039548 0.012913640
## Medium 0.028748916 0.471034383 0.339352788 0.150967928 0.009895984
## Expensive 0.009702096 0.357671497 0.560358231 0.067914671 0.004353505
##
## room_type
## Y Entire home/apt Private room Shared room
## Affordable 0.043583535 0.926553672 0.029862793
## Medium 0.253683906 0.725007223 0.021308870
## Expensive 0.819827104 0.175321848 0.004851048
##
## minimum_nights
## Y Low Medium Moderate High
## Affordable 9.959645e-01 4.035513e-03 0.000000e+00 0.000000e+00
## Medium 9.969662e-01 2.961572e-03 7.223346e-05 0.000000e+00
## Expensive 9.963306e-01 3.607190e-03 6.219292e-05 0.000000e+00
##
## number_of_reviews
## Y Low Medium High
## Affordable 0.941968912 0.055958549 0.002072539
## Medium 0.914842923 0.083127427 0.002029651
## Expensive 0.916498994 0.082185420 0.001315586
##
## reviews_per_month
## Y Low Medium High
## Affordable 1 0 0
## Medium 1 0 0
## Expensive 1 0 0
##
## calculated_host_listings_count
## Y Low Medium High
## Affordable 0.9991928975 0.0000000000 0.0008071025
## Medium 0.9964605605 0.0018058365 0.0017336030
## Expensive 0.9591392500 0.0209590149 0.0199017352
##
## availability_365
## Y 2Mo 4Mo 6Mo 8Mo 10Mo 12Mo
## Affordable 0.28405797 0.16376812 0.12028986 0.08550725 0.08840580 0.25797101
## Medium 0.28195798 0.16950346 0.13252729 0.07547834 0.09167743 0.24885550
## Expensive 0.27848955 0.14170118 0.11896734 0.09989404 0.12359118 0.23735671
The Naive Bayes model is tested against the train data and it generates output with 75.42% accuracy which is slightly lower than the accuracy of a decision tree. This is because decision tree considers more features as compared to Naive Bayes model.
NBPredictions <- predict(NaiveBayesModel, train)
eval_model(NBPredictions, train$price)
##
## Confusion matrix (absolute):
## Actual
## Prediction Affordable Expensive Medium Sum
## Affordable 0 1 1 2
## Expensive 55 13185 3526 16766
## Medium 1184 2893 10317 14394
## Sum 1239 16079 13844 31162
##
## Confusion matrix (relative):
## Actual
## Prediction Affordable Expensive Medium Sum
## Affordable 0.00 0.00 0.00 0.00
## Expensive 0.00 0.42 0.11 0.54
## Medium 0.04 0.09 0.33 0.46
## Sum 0.04 0.52 0.44 1.00
##
## Accuracy:
## 0.7542 (23502/31162)
##
## Error rate:
## 0.2458 (7660/31162)
##
## Error rate reduction (vs. base rate):
## 0.4921 (p-value < 2.2e-16)
The Naive Bayes model is also tested against the test dataset. The accuracy of Naive Bayes model on the test dataset is observed to be 75.77%. This is also similar to that in the train dataset.
NBPredictions <- predict(NaiveBayesModel, test)
eval_model(NBPredictions, test$price)
##
## Confusion matrix (absolute):
## Actual
## Prediction Affordable Expensive Medium Sum
## Affordable 0 0 0 0
## Expensive 21 3232 877 4130
## Medium 271 730 2635 3636
## Sum 292 3962 3512 7766
##
## Confusion matrix (relative):
## Actual
## Prediction Affordable Expensive Medium Sum
## Affordable 0.00 0.00 0.00 0.00
## Expensive 0.00 0.42 0.11 0.53
## Medium 0.03 0.09 0.34 0.47
## Sum 0.04 0.51 0.45 1.00
##
## Accuracy:
## 0.7555 (5867/7766)
##
## Error rate:
## 0.2445 (1899/7766)
##
## Error rate reduction (vs. base rate):
## 0.5008 (p-value < 2.2e-16)
In conclusion, price prediction can be a regression problem and also can be transformed into a classification problem. In regression, linear regression models is used which in classification, decision tree and Naive Bayes model are used. The linear regression model can be improved by transforming the price into logarithmic scale and increase the number of features used for training. The decision tree model also had a higher accuracy as compared to the Naive Bayes model as more features is used to train the decision tree model as compared to the Naive Bayes model.