Abstract:

New York city has been one of the most popular cities for travel and hottest market for Airbnb. Airbnb is an online-based marketing company that connects people looking for accommodation (Airbnb guests) to people looking to rent their properties (Airbnb hosts) on a short-term or long-term basis.The dataset contains the real-world data of Airbnb of New York city and describes the listing activity and metrics in NYC, NY for 2019

Data Source

Dataset is available on Airbnb: http://insideairbnb.com/get-the-data.html and can be downloaded from : https://www.kaggle.com/dgomonov/new-york-city-airbnb-open-data. Data has- 48895 distinct observations and 16 distinct variables. The analysis has been done in R. Source code can be found on my Github: https://github.com/saranggupta94/airbnb

Introduction

The datset includes the listing activity of and metrics in NYC, for 2019. This dataset contains prices for listing in different neighborhood groups within different neighborhood cities. It also contains different factors like property types, reviews, and availability of listings, that can affect the price for the listing. The data has 48895 obsevations and 16 attributes. Response variable: Price per night

Explanatory variables: price, name, host id, host name, Neighborhood-group, neighborhood, latitude, longitude, room-type, minimum-nights, number of reviews, last review, review per month, calculated host listings, and availability 365 days.

Problem Statement

One of the biggest challenges for companies is to maintain positive customer experience along with having a financially profitable business model for property owners. How factors are affecting the price for the Airbnb listing in NYC? What is the overall location distribution of Airbnb NYC? Which neighborhood has a better average price for the Airbnb listing?

OUR GOAL

Our goal is to build a statistical model to effectively predict the price for the listings and company can use this model to come up with a price suggestion for the future listings.

Dataset

Download the dataset

library(readr)
AB_NYC_2019 <- read_csv("C:/Users/dkkan/Desktop/Coding/Projects/AB_NYC_2019.csv")
airbnb<-AB_NYC_2019

Understanding the data

dim(airbnb)
## [1] 48895    16
str(airbnb)
## tibble [48,895 x 16] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ id                            : num [1:48895] 2539 2595 3647 3831 5022 ...
##  $ name                          : chr [1:48895] "Clean & quiet apt home by the park" "Skylit Midtown Castle" "THE VILLAGE OF HARLEM....NEW YORK !" "Cozy Entire Floor of Brownstone" ...
##  $ host_id                       : num [1:48895] 2787 2845 4632 4869 7192 ...
##  $ host_name                     : chr [1:48895] "John" "Jennifer" "Elisabeth" "LisaRoxanne" ...
##  $ neighbourhood_group           : chr [1:48895] "Brooklyn" "Manhattan" "Manhattan" "Brooklyn" ...
##  $ neighbourhood                 : chr [1:48895] "Kensington" "Midtown" "Harlem" "Clinton Hill" ...
##  $ latitude                      : num [1:48895] 40.6 40.8 40.8 40.7 40.8 ...
##  $ longitude                     : num [1:48895] -74 -74 -73.9 -74 -73.9 ...
##  $ room_type                     : chr [1:48895] "Private room" "Entire home/apt" "Private room" "Entire home/apt" ...
##  $ price                         : num [1:48895] 149 225 150 89 80 200 60 79 79 150 ...
##  $ minimum_nights                : num [1:48895] 1 1 3 1 10 3 45 2 2 1 ...
##  $ number_of_reviews             : num [1:48895] 9 45 0 270 9 74 49 430 118 160 ...
##  $ last_review                   : Date[1:48895], format: "2018-10-19" "2019-05-21" ...
##  $ reviews_per_month             : num [1:48895] 0.21 0.38 NA 4.64 0.1 0.59 0.4 3.47 0.99 1.33 ...
##  $ calculated_host_listings_count: num [1:48895] 6 2 1 1 1 1 1 1 1 4 ...
##  $ availability_365              : num [1:48895] 365 355 365 194 0 129 0 220 0 188 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   id = col_double(),
##   ..   name = col_character(),
##   ..   host_id = col_double(),
##   ..   host_name = col_character(),
##   ..   neighbourhood_group = col_character(),
##   ..   neighbourhood = col_character(),
##   ..   latitude = col_double(),
##   ..   longitude = col_double(),
##   ..   room_type = col_character(),
##   ..   price = col_double(),
##   ..   minimum_nights = col_double(),
##   ..   number_of_reviews = col_double(),
##   ..   last_review = col_date(format = ""),
##   ..   reviews_per_month = col_double(),
##   ..   calculated_host_listings_count = col_double(),
##   ..   availability_365 = col_double()
##   .. )
vis_miss(airbnb)

First, we checked the summary statistics of the data and then we checked missing values and we found 1.3% missing values in one variable. We used mean imputation technique to fix NA values to maintain the sample size.

Exploratory Data Analysis

DIstribution Of Airbnb Price

cleanup =theme(panel.grid.major =element_blank(),
               panel.grid.minor =element_blank(), 
               panel.background =element_blank(), 
               axis.line.x =element_line(color ="black"),
               axis.line.y =element_line(color ="black"),
               legend.key =element_rect(fill ="white"),
               text =element_text(size =15))
#Distribution of price
par(mfrow=c(2,1))
ggplot(airbnb) + 
  cleanup+
  geom_histogram(aes(price),fill = 'orange',alpha = 0.85,binwidth = 15) + 
  theme_minimal(base_size = 13) + xlab("Price") + ylab("Frequency") + 
  ggtitle("The Distrubition of Price") 

#Transformed distribution of Price
ggplot(airbnb, aes(price)) +
  cleanup+
  geom_histogram(bins = 30, aes(y = ..density..), fill = "orange") + 
  geom_density(alpha = 0.2, fill = "orange") +ggtitle("Transformed distribution of price",
  subtitle = expression("With" ~'log'[10] ~ "transformation of x-axis")) + scale_x_log10()

The original distribution of price is highly skewed. Which will not make better visualization with other variables. We used logarithmic transformation for better insight view of price distribution. Logarithmic scale is defined with in range of 10.

Correlation Metrics

airbnb_cor <- airbnb[, sapply(airbnb, is.numeric)]
airbnb_cor <- airbnb_cor[complete.cases(airbnb_cor), ]
correlation_matrix <- cor(airbnb_cor, method = "spearman")
corrplot(correlation_matrix, method = "color")

Target variable Price has price has positive correlation with minimum number of nights, availability of 365 days. Calculated host listings have negative correlation with price and the above graph also shows least correlation with number of reviews.

Average Price by Room Type

par(mfrow=c(1,3))
ggplot(airbnb, aes(x = room_type, y = mean(price), fill = room_type))+
  geom_bar(stat = "identity")+theme_minimal()+
  cleanup+
  labs(title = "Average price by Room type",
       x = "Room Type", y = "Average Price") 

#Average Price each Neighbourhood_group
ggplot(airbnb, aes(x = fct_infreq(neighbourhood_group), y = mean(price), fill = neighbourhood_group))+
  geom_bar(stat = "identity")+
  cleanup+
  labs(title = "Average price each Neighborhood Group",
       x = "Neighbourhood Group", y = "Price") +
  theme(legend.position = "right")

#Property types in Neighborhood Group 
ggplot(airbnb, aes(x = fct_infreq(neighbourhood_group), fill= room_type))+
  geom_bar()+
  cleanup+
  labs(title = "Property types in Neighbourhood_group ",
       x = "Neighbourhood Group", y = "No. of listings") +
  theme(legend.position = "right")

In the above 1st graph, we see Entire home/apt have higher average price that can conclude private room and shared have low average price for per night. In the 2nd graph, property is distributed in neighborhood groups and 3rd graph shows average price within each neighborhood groups. We can see, Manhattan and Brooklyn have higher average per night price than other neighborhood groups.

Relation between price and number of reviews

ggplot(airbnb, aes(number_of_reviews, price)) +
  theme(axis.title = element_text(), axis.title.x = element_text()) +
  geom_point(aes(size = price), alpha = 0.05, color = "red") +
  cleanup+
  xlab("Number of reviews") +
  ylab("Price") +
  ggtitle("Relationship between number of reviews",
          subtitle = "The most expensive objects have small number of reviews (or 0)")

The above plot shows low price listings have higher number of reviews and high price listings have low number of reviews. That shows a negative relationship.

Price vs two independent variables(Neighborhood Group & Room Types)

cleanup <- theme(panel.grid.major = element_blank(),
                 panel.grid.minor = element_blank(), 
                 panel.background = element_blank(), 
                 axis.line.x = element_line(color = 'black'), 
                 axis.line.y = element_line(color = 'black'), 
                 legend.key = element_rect(fill = 'white'), 
                 text = element_text(size = 15)) 


bar <- ggplot(airbnb, aes(neighbourhood_group, price, fill = room_type)) 
bar+stat_summary(fun.y = mean,
                 geom = "bar", 
                 position = "dodge") + 
  stat_summary(fun.data = mean_cl_normal, 
               geom = "errorbar", 
               position = position_dodge(width = 0.90), 
               width = .25)+ 
  xlab("Neighbourhood Groups")+ 
  ylab("Rental Price")+ 
  cleanup+scale_fill_manual(name = "Property Types", 
                            labels = c("Entire House", "Private Room", "Shared Room"), 
                            values = c("deeppink3", "hotpink", "pink"))

The above graph can tell us about price for each property types in neighborhood groups. Manhattan is the most expensive neighborhood in between all. Shared room type property has lowest price per night as compare to other property types.

Top 10 Neighbourhood

airbnb %>%
  group_by(neighbourhood) %>%
  dplyr::summarize(num_listings = n(), 
            borough = unique(neighbourhood_group)) %>%
  top_n(n = 10, wt = num_listings) %>%
  ggplot(aes(x = fct_reorder(neighbourhood, num_listings), 
             y = num_listings, fill = borough)) +
  geom_col() +
  coord_flip() +
  theme(legend.position = "bottom") +
  labs(title = "Top 10 neighborhoods by no. of listings",
       x = "Neighborhood", y = "No. of listings")

Price range within Neighborhood

#ggmap  - an object of class ggmap (from function get_map)
height <- max(airbnb$latitude) - min(airbnb$latitude)
width <- max(airbnb$longitude) - min(airbnb$longitude)
LA_borders <- c(bottom  = min(airbnb$latitude)  - 0.1 * height, 
                top     = max(airbnb$latitude)  + 0.1 * height,
                left    = min(airbnb$longitude) - 0.1 * width,
                right   = max(airbnb$longitude) + 0.1 * width)

map <- get_stamenmap(LA_borders, zoom = 10, maptype = "toner-lite")
ggmap(map) +
  geom_point(data = airbnb, mapping = aes(x = longitude, y = latitude, 
                                               col = log(price))) +
  scale_color_distiller(palette = "RdYlGn", direction = 1)

The above map represents pricing in the city. We can see mostly area are in low to medium price range (red to yellow). Green indicates most expensive area and we can find few green areas which means city has less expensive property than average price.

Data Cleaning

table(is.na(airbnb))
## 
##  FALSE   TRUE 
## 762179  20141
apply(airbnb,2,function(x) sum(is.na(x))) 
##                             id                           name 
##                              0                             16 
##                        host_id                      host_name 
##                              0                             21 
##            neighbourhood_group                  neighbourhood 
##                              0                              0 
##                       latitude                      longitude 
##                              0                              0 
##                      room_type                          price 
##                              0                              0 
##                 minimum_nights              number_of_reviews 
##                              0                              0 
##                    last_review              reviews_per_month 
##                          10052                          10052 
## calculated_host_listings_count               availability_365 
##                              0                              0
#Mean imputation
airbnb$reviews_per_month[is.na(airbnb$reviews_per_month)] <- mean(airbnb$reviews_per_month, na.rm = TRUE)
summary(airbnb$reviews_per_month)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.010   0.280   1.220   1.373   1.580  58.500
#Remove unwanted Variables
airbnb$last_review = NULL
airbnb$id = NULL
airbnb$host_id = NULL
airbnb$name = NULL
airbnb$host_name = NULL
airbnb$neighbourhood = NULL

Data Preparation for Modeling

#factor variable
airbnb$neighbourhood_group<- as.factor(airbnb$neighbourhood_group)
airbnb$room_type<- as.factor(airbnb$room_type)

factor_variables <- sapply(airbnb,is.factor)
airbnb_factor <- airbnb[,factor_variables]
factor.names <- names(airbnb_factor)

airbnb_factor <- as.data.frame(airbnb_factor)
airbnb_factor <- acm.disjonctif(airbnb_factor)
airbnb <- airbnb[,-which(names(airbnb) %in% factor.names)]

airbnb <- cbind(airbnb,airbnb_factor)

rm(airbnb_factor,factor_variables,factor.names)

nums <- unlist(lapply(airbnb, is.numeric))  
airbnb<-airbnb[,nums]

str(airbnb)
## 'data.frame':    48895 obs. of  16 variables:
##  $ latitude                         : num  40.6 40.8 40.8 40.7 40.8 ...
##  $ longitude                        : num  -74 -74 -73.9 -74 -73.9 ...
##  $ price                            : num  149 225 150 89 80 200 60 79 79 150 ...
##  $ minimum_nights                   : num  1 1 3 1 10 3 45 2 2 1 ...
##  $ number_of_reviews                : num  9 45 0 270 9 74 49 430 118 160 ...
##  $ reviews_per_month                : num  0.21 0.38 1.37 4.64 0.1 ...
##  $ calculated_host_listings_count   : num  6 2 1 1 1 1 1 1 1 4 ...
##  $ availability_365                 : num  365 355 365 194 0 129 0 220 0 188 ...
##  $ neighbourhood_group.Bronx        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ neighbourhood_group.Brooklyn     : num  1 0 0 1 0 0 1 0 0 0 ...
##  $ neighbourhood_group.Manhattan    : num  0 1 1 0 1 1 0 1 1 1 ...
##  $ neighbourhood_group.Queens       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ neighbourhood_group.Staten Island: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ room_type.Entire home/apt        : num  0 1 0 1 1 1 0 0 0 1 ...
##  $ room_type.Private room           : num  1 0 1 0 0 0 1 1 1 0 ...
##  $ room_type.Shared room            : num  0 0 0 0 0 0 0 0 0 0 ...

Model Building Process

airbnb$price[airbnb$price == 0]<- 1
#Split the final data into train and test
set.seed(123)
smp_siz = floor(0.75*nrow(airbnb)) 
train_ind = sample(seq_len(nrow(airbnb)),size = smp_siz)
train =airbnb[train_ind,]
test=airbnb[-train_ind,]

#Model1
model1<- lm(log(price)~ minimum_nights+number_of_reviews+reviews_per_month+availability_365, data = train)
summary(model1)
## 
## Call:
## lm(formula = log(price) ~ minimum_nights + number_of_reviews + 
##     reviews_per_month + availability_365, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.8922 -0.5010 -0.0456  0.4457  4.5144 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        4.694e+00  5.706e-03 822.681  < 2e-16 ***
## minimum_nights     5.017e-04  1.902e-04   2.637  0.00836 ** 
## number_of_reviews -7.735e-04  9.643e-05  -8.021 1.08e-15 ***
## reviews_per_month -1.176e-02  2.910e-03  -4.040 5.36e-05 ***
## availability_365   5.576e-04  2.841e-05  19.625  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6934 on 36666 degrees of freedom
## Multiple R-squared:  0.01362,    Adjusted R-squared:  0.01352 
## F-statistic: 126.6 on 4 and 36666 DF,  p-value: < 2.2e-16
prediction1<-predict(model1, newdata = test)
prediction1<- exp(prediction1)
mse = mean(model1$residuals^2)
AIC(model1)
## [1] 77223.83
BIC(model1)
## [1] 77274.89
#Model2
model2<-lm(log(price)~.,data= train)
summary(model2)
## 
## Call:
## lm(formula = log(price) ~ ., data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.2710 -0.3135 -0.0521  0.2399  5.2809 
## 
## Coefficients: (2 not defined because of singularities)
##                                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                         -1.968e+02  8.110e+00 -24.268  < 2e-16 ***
## latitude                            -6.482e-01  7.904e-02  -8.200 2.48e-16 ***
## longitude                           -3.058e+00  9.106e-02 -33.580  < 2e-16 ***
## minimum_nights                      -1.997e-03  1.381e-04 -14.456  < 2e-16 ***
## number_of_reviews                   -9.504e-04  6.994e-05 -13.588  < 2e-16 ***
## reviews_per_month                    1.126e-02  2.117e-03   5.322 1.04e-07 ***
## calculated_host_listings_count      -2.816e-04  8.543e-05  -3.296 0.000982 ***
## availability_365                     7.357e-04  2.131e-05  34.527  < 2e-16 ***
## neighbourhood_group.Bronx            8.586e-01  4.240e-02  20.250  < 2e-16 ***
## neighbourhood_group.Brooklyn         8.161e-01  3.365e-02  24.256  < 2e-16 ***
## neighbourhood_group.Manhattan        1.130e+00  3.412e-02  33.133  < 2e-16 ***
## neighbourhood_group.Queens           9.431e-01  3.766e-02  25.042  < 2e-16 ***
## `neighbourhood_group.Staten Island`         NA         NA      NA       NA    
## `room_type.Entire home/apt`          1.172e+00  1.752e-02  66.884  < 2e-16 ***
## `room_type.Private room`             4.215e-01  1.755e-02  24.022  < 2e-16 ***
## `room_type.Shared room`                     NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4989 on 36657 degrees of freedom
## Multiple R-squared:  0.4896, Adjusted R-squared:  0.4894 
## F-statistic:  2704 on 13 and 36657 DF,  p-value: < 2.2e-16
prediction2<-predict(model2, newdata = test)
## Warning in predict.lm(model2, newdata = test): prediction from a rank-deficient
## fit may be misleading
prediction2<- exp(prediction2)
mse = mean(model2$residuals^2)
AIC(model2)
## [1] 53084.04
BIC(model2)
## [1] 53211.69
#Model3
model3<-lm(log(price)~.-calculated_host_listings_count, data = train)
summary(model3)
## 
## Call:
## lm(formula = log(price) ~ . - calculated_host_listings_count, 
##     data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.2684 -0.3137 -0.0522  0.2394  5.2831 
## 
## Coefficients: (2 not defined because of singularities)
##                                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                         -1.961e+02  8.108e+00 -24.183  < 2e-16 ***
## latitude                            -6.292e-01  7.885e-02  -7.980 1.50e-15 ***
## longitude                           -3.037e+00  9.086e-02 -33.429  < 2e-16 ***
## minimum_nights                      -2.036e-03  1.377e-04 -14.787  < 2e-16 ***
## number_of_reviews                   -9.274e-04  6.960e-05 -13.324  < 2e-16 ***
## reviews_per_month                    1.104e-02  2.116e-03   5.219 1.81e-07 ***
## availability_365                     7.198e-04  2.076e-05  34.678  < 2e-16 ***
## neighbourhood_group.Bronx            8.489e-01  4.231e-02  20.067  < 2e-16 ***
## neighbourhood_group.Brooklyn         8.100e-01  3.360e-02  24.108  < 2e-16 ***
## neighbourhood_group.Manhattan        1.121e+00  3.400e-02  32.968  < 2e-16 ***
## neighbourhood_group.Queens           9.346e-01  3.758e-02  24.871  < 2e-16 ***
## `neighbourhood_group.Staten Island`         NA         NA      NA       NA    
## `room_type.Entire home/apt`          1.170e+00  1.751e-02  66.798  < 2e-16 ***
## `room_type.Private room`             4.205e-01  1.755e-02  23.966  < 2e-16 ***
## `room_type.Shared room`                     NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.499 on 36658 degrees of freedom
## Multiple R-squared:  0.4894, Adjusted R-squared:  0.4892 
## F-statistic:  2928 on 12 and 36658 DF,  p-value: < 2.2e-16
prediction3<-predict(model3, newdata = test)
## Warning in predict.lm(model3, newdata = test): prediction from a rank-deficient
## fit may be misleading
prediction3<- exp(prediction3)
mse = mean(model3$residuals^2)
AIC(model3)
## [1] 53092.91
BIC(model3)
## [1] 53212.05

Interpretation of Models

Firstly, we will check the most correlated variables from correlation plot that weather these variables have impact on price per night or not. So, in 1st model we compare price with minimum nights, number of reviews, reviews per month and availability 365 days and we got result of 0.0079 adjusted R2. All variables are significant in this model. We kept all variables and try to run one more model with other variables.

In 2nd model, we selected all variables and we got pretty good result. We got adjusted R2 0.52 but few variables in this model are not significant. So, we need to run one more model to remove those variables.

Third model includes all variables except calculated host listing and this model fit average with accounts for 53% of the variance. We try another model but we were getting bad results of model fitting, so we stopped on this model. We also check AIC BIC for cross validation. Model has lower AIC and BIC than other models and lower indicates a more parsimonious model, relative to a model fit with a higher AIC. Based on these results, we selected model 3 for our prediction.

Prediction on test dataset

model3_step<- step(model3)
## Start:  AIC=-50976.88
## log(price) ~ (latitude + longitude + minimum_nights + number_of_reviews + 
##     reviews_per_month + calculated_host_listings_count + availability_365 + 
##     neighbourhood_group.Bronx + neighbourhood_group.Brooklyn + 
##     neighbourhood_group.Manhattan + neighbourhood_group.Queens + 
##     `neighbourhood_group.Staten Island` + `room_type.Entire home/apt` + 
##     `room_type.Private room` + `room_type.Shared room`) - calculated_host_listings_count
## 
## 
## Step:  AIC=-50976.88
## log(price) ~ latitude + longitude + minimum_nights + number_of_reviews + 
##     reviews_per_month + availability_365 + neighbourhood_group.Bronx + 
##     neighbourhood_group.Brooklyn + neighbourhood_group.Manhattan + 
##     neighbourhood_group.Queens + `neighbourhood_group.Staten Island` + 
##     `room_type.Entire home/apt` + `room_type.Private room`
## 
## 
## Step:  AIC=-50976.88
## log(price) ~ latitude + longitude + minimum_nights + number_of_reviews + 
##     reviews_per_month + availability_365 + neighbourhood_group.Bronx + 
##     neighbourhood_group.Brooklyn + neighbourhood_group.Manhattan + 
##     neighbourhood_group.Queens + `room_type.Entire home/apt` + 
##     `room_type.Private room`
## 
##                                 Df Sum of Sq     RSS    AIC
## <none>                                        9126.3 -50977
## - reviews_per_month              1      6.78  9133.1 -50952
## - latitude                       1     15.85  9142.2 -50915
## - number_of_reviews              1     44.20  9170.5 -50802
## - minimum_nights                 1     54.44  9180.8 -50761
## - neighbourhood_group.Bronx      1    100.25  9226.6 -50578
## - `room_type.Private room`       1    143.00  9269.3 -50409
## - neighbourhood_group.Brooklyn   1    144.69  9271.0 -50402
## - neighbourhood_group.Queens     1    154.00  9280.3 -50365
## - neighbourhood_group.Manhattan  1    270.59  9396.9 -49907
## - longitude                      1    278.21  9404.5 -49878
## - availability_365               1    299.38  9425.7 -49795
## - `room_type.Entire home/apt`    1   1110.85 10237.2 -46767
plot(log(test$price), main = "Linear Model", ylab = "Test Set Rental Count", pch = 20)
points(predict(model3_step, newdata = test), col = "red", pch = 20)

Predicting using the attributes from testing dataset and plot them against the true values the graph shows that the spread of the response variable is similar to multiple linear model. Still, we cannot depend on this because we worked on a small data and this dataset does not contain more information about Airbnb host, property description etc., which can help more to predict more accurately the price per night.

CONCLUSIONS

This study predicts the price per night for NYC Airbnb listings. We explored all the variables affecting price of listings in exploratory analysis. All variables are significant except calculated host listing and last reviews but not enough for this study. We gained following insights from the data, which can answer our research question:

• Neighborhood group and property type are important factors. • Manhattan has the most expensive properties compare to other neighborhood groups. • Brooklyn is the good choice with average price and number of revies. • Expensive property has a smaller number of reviews than low or average price range properties. • Availability of room did not guarantee a higher average listing price.

Using multiple linear models, we see that the duration prediction has a good fit of training data with generalization error, but prediction of pricing per night seems to yield periodically increasing loss, which is something that should be addressed in future work..

FUTURE WORK

• There are several limitations in this study. Future studies should explore the influence of other factors like season, property insights, crime rate of neighborhood, and monthly Airbnb host revenue.

• We will try to analysis the name of the listing by text mining technique. So that we can find best words to use in the listing.

• We will explore this data more in the future with different advanced predictive modeling techniques like Decision Tree, Random Forest, and Support Vector machine to improve predictive power