1. Synopsis

The Data was collected from Kaggle. In this project we present to you exploratory data analysis, visualizations and modelling of New York Airbnb data. Airbnb, Inc.is an online marketplace for arranging or offering lodging, primarily homestays, or tourism experiences. The company does not own any of the real estate listings, nor does it host events; it acts as a broker, receiving commissions from each booking. Revenue for Airbnb comes from its guests and hosts: hosts are charged 3% of the value of the booking, while guests are charged 6%-12% per the nature of the booking. Airbnb market is quite blooming in New York city (NYC) which had more than 48,000 listings as of August-2019 calendar year.

We focus on New York City’s data as we wish to perform an in-depth analysis on one of the most densely populated cities in the world.

In this project, we also try to predict the factors that affect the pricing of the airbnbs around New York. This includes creating different kind of models, model specification, transformation, variable selection and many more.

We carried out the project in the following steps:

  • Data Cleaning and Preparation
  • Data Visualization
  • Modelling and Model Checking
  • Finalising the Model
  • Prediction using the Final Model.

The original Data Set can be found here —> New York AirBnB Data 2019


2. Packages Required

We begin by loading the packages that will be required throughout the course of our analysis.

#install DAAG from archived source
if(!is.element("DAAG", installed.packages()[,1])){
  packageurl <- "https://cran.r-project.org/src/contrib/Archive/DAAG/DAAG_1.22.tar.gz"
  install.packages("latticeExtra")
  install.packages(packageurl, repos=NULL, type="source")
}


library(tidyr)
library(DT)
library(ggplot2)
library(dplyr)
library(tidyverse)
library(kableExtra)
library(lubridate)
library(readxl)
library(highcharter)
library(lubridate)
library(scales)
library(RColorBrewer)
library(wesanderson)
library(plotly)
library(shiny)
library(readxl)
library(readr)
library(choroplethr)
library(choroplethrMaps)
library(GGally)
library(zoo)
library(scales)
library(ggmap)
library(stringr)
library(gridExtra)
library(caret)
library(treemap)
library(psych)
library(DAAG)
library(leaps)
library(corrplot)
library(glmnet)
library(boot)

The packages used can be found in a tabular format below. The table also describes the usage of each packages that we are using.

Package Description
library(tidyr) For changing the layout of your data sets, to convert data into the tidy format
library(DT) For HTML display of data
library(ggplot2) For customizable graphical representation
library(dplyr) For data manipulation
library(tidyverse) Collection of R packages designed for data science that works harmoniously with other packages
library(kableExtra) To display table in a fancy way
library(lubridate) Lubridate makes it easier to do the things R does with date-times and possible to do the things R does not
library(readxl) The readxl package makes it easy to get data out of Excel and into R
library(highcharter) Highcharter is a R wrapper for Highcharts javascript libray and its modules
library(lubridate) Lubridate is an R package that makes it easier to work with dates and times.
library(scales) The idea of the scales package is to implement scales in a way that is graphics system agnostic
library(RColorBrewer) RColorBrewer is an R package that allows users to create colourful graphs with pre-made color palettes that visualize data in a clear and distinguishable manner
library(wesanderson) A Wes Anderson is color palette for R
library(plotly) Plotly’s R graphing library makes interactive, publication-quality graphs
library(shiny) Shiny is an R package that makes it easy to build interactive web apps straight from R
library(readxl) The readxl package makes it easy to get data out of Excel and into R.
library(readr) The goal of readr is to provide a fast and friendly way to read rectangular data (like csv, tsv, and fwf).
library(choroplethr) Simplify the Creation of Choropleth Maps in R
library(choroplethrMaps) choroplethrMaps is an R package that contains 3 maps used by the choroplethr package.
library(GGally) ggplot2 is a plotting system based on the grammar of graphics.
library(zoo) S3 Infrastructure for Regular and Irregular Time Series (Z’s Ordered Observations)
library(scales) The scales packages provides the internal scaling infrastructure used by ggplot2, and gives you tools to override the default breaks, labels, transformations and palettes.
library(ggmap) A collection of functions to visualize spatial data and models on top of static maps from various online sources (e.g Google Maps and Stamen Maps).
library(stringr) A consistent, simple and easy to use set of wrappers around the fantastic ‘stringi’ package.
library(gridExtra) Provides a number of user-level functions to work with “grid” graphics, notably to arrange multiple grid-based plots on a page, and draw tables.
library(caret) The caret package (short for Classification And REgression Training) contains functions to streamline the model training process for complex regression and classification problems.
library(treemap) A treemap is a space-filling visualization of hierarchical structures. This package offers great flexibility to draw treemaps.
library(psych) A general purpose toolbox for personality, psychometric theory and experimental psychology.
library(DAAG) Data Analysis and Graphics Data and Functions
library(leaps) Regression subset selection, including exhaustive search.
library(corrplot) The corrplot package is a graphical display of a correlation matrix, confidence interval
library(glmnet) Extremely efficient procedures for fitting the entire lasso or elastic-net regularization path for linear regression, logistic and multinomial regression models, Poisson regression and the Cox model.

3. Data Preparation

After loading the required packages, we move on to the data perparation step which would start by loading the data into our R-studio.

3.1 Loading and Reading the Data

Summary and Glimpse of the Data

summary(airbnb_data)
##        id                                         name      
##  Min.   :    2539   Hillside Hotel                  :   18  
##  1st Qu.: 9471945   Home away from home             :   17  
##  Median :19677284                                   :   16  
##  Mean   :19017143   New york Multi-unit building    :   16  
##  3rd Qu.:29152178   Brooklyn Apartment              :   12  
##  Max.   :36487245   Loft Suite @ The Box House Hotel:   11  
##                     (Other)                         :48805  
##     host_id                 host_name        neighbourhood_group
##  Min.   :     2438   Michael     :  417   Bronx        : 1091   
##  1st Qu.:  7822033   David       :  403   Brooklyn     :20104   
##  Median : 30793816   Sonder (NYC):  327   Manhattan    :21661   
##  Mean   : 67620011   John        :  294   Queens       : 5666   
##  3rd Qu.:107434423   Alex        :  279   Staten Island:  373   
##  Max.   :274321313   Blueground  :  232                         
##                      (Other)     :46943                         
##             neighbourhood      latitude       longitude     
##  Williamsburg      : 3920   Min.   :40.50   Min.   :-74.24  
##  Bedford-Stuyvesant: 3714   1st Qu.:40.69   1st Qu.:-73.98  
##  Harlem            : 2658   Median :40.72   Median :-73.96  
##  Bushwick          : 2465   Mean   :40.73   Mean   :-73.95  
##  Upper West Side   : 1971   3rd Qu.:40.76   3rd Qu.:-73.94  
##  Hell's Kitchen    : 1958   Max.   :40.91   Max.   :-73.71  
##  (Other)           :32209                                   
##            room_type         price         minimum_nights   
##  Entire home/apt:25409   Min.   :    0.0   Min.   :   1.00  
##  Private room   :22326   1st Qu.:   69.0   1st Qu.:   1.00  
##  Shared room    : 1160   Median :  106.0   Median :   3.00  
##                          Mean   :  152.7   Mean   :   7.03  
##                          3rd Qu.:  175.0   3rd Qu.:   5.00  
##                          Max.   :10000.0   Max.   :1250.00  
##                                                             
##  number_of_reviews     last_review    reviews_per_month
##  Min.   :  0.00              :10052   Min.   : 0.010   
##  1st Qu.:  1.00    2019-06-23: 1413   1st Qu.: 0.190   
##  Median :  5.00    2019-07-01: 1359   Median : 0.720   
##  Mean   : 23.27    2019-06-30: 1341   Mean   : 1.373   
##  3rd Qu.: 24.00    2019-06-24:  875   3rd Qu.: 2.020   
##  Max.   :629.00    2019-07-07:  718   Max.   :58.500   
##                    (Other)   :33137   NA's   :10052    
##  calculated_host_listings_count availability_365
##  Min.   :  1.000                Min.   :  0.0   
##  1st Qu.:  1.000                1st Qu.:  0.0   
##  Median :  1.000                Median : 45.0   
##  Mean   :  7.144                Mean   :112.8   
##  3rd Qu.:  2.000                3rd Qu.:227.0   
##  Max.   :327.000                Max.   :365.0   
## 
glimpse(airbnb_data)
## Observations: 48,895
## Variables: 16
## $ id                             <int> 2539, 2595, 3647, 3831, 5022, 5...
## $ name                           <fct> "Clean & quiet apt home by the ...
## $ host_id                        <int> 2787, 2845, 4632, 4869, 7192, 7...
## $ host_name                      <fct> John, Jennifer, Elisabeth, Lisa...
## $ neighbourhood_group            <fct> Brooklyn, Manhattan, Manhattan,...
## $ neighbourhood                  <fct> Kensington, Midtown, Harlem, Cl...
## $ latitude                       <dbl> 40.64749, 40.75362, 40.80902, 4...
## $ longitude                      <dbl> -73.97237, -73.98377, -73.94190...
## $ room_type                      <fct> Private room, Entire home/apt, ...
## $ price                          <int> 149, 225, 150, 89, 80, 200, 60,...
## $ minimum_nights                 <int> 1, 1, 3, 1, 10, 3, 45, 2, 2, 1,...
## $ number_of_reviews              <int> 9, 45, 0, 270, 9, 74, 49, 430, ...
## $ last_review                    <fct> 2018-10-19, 2019-05-21, , 2019-...
## $ reviews_per_month              <dbl> 0.21, 0.38, NA, 4.64, 0.10, 0.5...
## $ calculated_host_listings_count <int> 6, 2, 1, 1, 1, 1, 1, 1, 1, 4, 1...
## $ availability_365               <int> 365, 355, 365, 194, 0, 129, 0, ...

Observations

  1. The dataset was imported into R studio and it was found to have 48895 observations and 16 variables.
  2. Data type of all categorical variables was ‘Character’, thus it was changed to categorical.
  3. Summary statistics reveal ‘Entire home or apartment’ is the most common type of Airbnb followed by a ‘private room’. ‘Shared rooms’ are the least common. Amongst Neighbourhood groups, Manhattan has the highest number of Air BnB’s followed by Brooklyn. Staten Island has the least number. It also shows that minimum price is 0, mean price is 152.7 and maximum price is 10,000 which is quite unusal, indicating potential ouliers.

Checking for NA

Next we will check for missing values and deal with them accordingly.

summary(is.na(airbnb_data))
##      id             name          host_id        host_name      
##  Mode :logical   Mode :logical   Mode :logical   Mode :logical  
##  FALSE:48895     FALSE:48895     FALSE:48895     FALSE:48895    
##                                                                 
##  neighbourhood_group neighbourhood    latitude       longitude      
##  Mode :logical       Mode :logical   Mode :logical   Mode :logical  
##  FALSE:48895         FALSE:48895     FALSE:48895     FALSE:48895    
##                                                                     
##  room_type         price         minimum_nights  number_of_reviews
##  Mode :logical   Mode :logical   Mode :logical   Mode :logical    
##  FALSE:48895     FALSE:48895     FALSE:48895     FALSE:48895      
##                                                                   
##  last_review     reviews_per_month calculated_host_listings_count
##  Mode :logical   Mode :logical     Mode :logical                 
##  FALSE:48895     FALSE:38843       FALSE:48895                   
##                  TRUE :10052                                     
##  availability_365
##  Mode :logical   
##  FALSE:48895     
## 

We visualise the number of missings in each variable using naniar gg_miss_var

naniar::gg_miss_var(airbnb_data) +
  theme_minimal()+
  labs(y = "Look at all the Missing Values") 

Observations

Analysis for missing values shows that there are no missing values in our dataset except for the variable ‘review per month’ which has 10052 missing values. Since we are planning to keep this variable in our study, we will decide later how we are going to deal with these missing values.


3.2 Data Cleaning

The Original Dataset New York AirBnB Data 2019 we collected from Kaggle did not need much cleaning. The Data set is almost clean with only one coloumn - reviews_per_month having 10052 N/A values. Deleting all these observatio now would not be a good solution as it is a high number and would result in biased results. Apart from that, all the variables were properly named and was already in proper case, hence variable need not be renamed. All the character variables were changed to factor variables.


3.3 Cleaned Dataset

The final cleaned dataset can be found below in an interactive table.

datatable(head(airbnb_data, 20), class = 'cell-border stripe')

3.4 Summary of Variables

#Reading the variable summary excel File
var_sum <- read_excel("Excel Files/variable_summary.xlsx")

kable(var_sum) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, fixed_thead = T, )
Sr.No. Variable Class Description
1 id int unique listing ID
2 name character name of the listing
3 host_id int host ID
4 host_name character name of the host
5 neighbourhood_group factor location in NYC
6 neighbourhood factor area in NYC
7 latitude int latitude coordinates
8 longitude int longitude coordinates
9 room_type factor listing space type
10 price int price in dollars
11 minimum_nights int amount of minimum nights
12 number_of_reviews int number of reviews
13 last_review date latest review
14 reviews_per_month int number of reviews per month
15 calculated_host_listings_count int amount of listing per host
16 availability_365 int number of days when listing is available for booking

4. Exploratory Data Analysis

Which Type of Listings are there in the Neighbourhood?

We do an analysis to find out the type of listings that are common to a particular neighbourhood.

property_df <-  airbnb_data %>% 
  group_by(neighbourhood_group, room_type) %>% 
  summarize(Freq = n())

# propertydf <- propertydf %>% 
#   filter(property_type %in% c("Apartment","House","Condominium","Townhouse", "Loft"))

total_property <-  airbnb_data %>% 
  filter(room_type %in% c("Private room","Entire home/apt","Entire home/apt")) %>% 
  group_by(neighbourhood_group) %>% 
  summarize(sum = n())

property_ratio <- merge (property_df, total_property, by="neighbourhood_group")

property_ratio <- property_ratio %>% 
  mutate(ratio = Freq/sum)

ggplot(property_ratio, aes(x=neighbourhood_group, y = ratio, fill = room_type)) +
  geom_bar(position = "dodge", stat="identity") + 
  xlab("Borough") + ylab ("Count") +
  scale_fill_discrete(name = "Property Type") + 
  scale_y_continuous(labels = scales::percent) +
  ggtitle("Which types of Listings are there in NYC?",
          subtitle = "Map showing Count of Listing Type by Borough ") +
          theme(plot.title = element_text(face = "bold", size = 14) ) +
          theme(plot.subtitle = element_text(face = "bold", color = "grey35", hjust = 0.5)) +
          theme(plot.caption = element_text(color = "grey68"))+scale_color_gradient(low="#d3cbcb", high="#852eaa")+
          scale_fill_manual("Property Type", values=c("#e06f69","#357b8a", "#7db5b8", "#59c6f3", "#f6c458")) +
          xlab("Neighborhood") + ylab("Percentage")

Observations

  1. Private room is the most common listing type in all neigbourhoods except Manhattan where Entire Home/apartment is the most common type.
  2. Shared room is the least common in all neigbourhoods.

Mean Price Comparison for each Neighbourhood Group

We want to see the average prices of listing in every neighbourhood.

airbnb_data %>% 
  filter(!(is.na(neighbourhood_group))) %>% 
  filter(!(neighbourhood_group == "Unknown")) %>% 
  group_by(neighbourhood_group) %>% 
  summarise(mean_price = mean(price, na.rm = TRUE)) %>% 
  ggplot(aes(x = reorder(neighbourhood_group, mean_price), y = mean_price, fill = neighbourhood_group)) +
  geom_col(stat ="identity", color = "black", fill="#357b8a") +
  coord_flip() +
  theme_gray() +
  labs(x = "Neighbourhood Group", y = "Price") +
  geom_text(aes(label = round(mean_price,digit = 2)), hjust = 2.0, color = "white", size = 3.5) +
  ggtitle("Mean Price comparison for each Neighbourhood Group", subtitle = "Price vs Neighbourhood Group") + 
  xlab("Neighbourhood Group") + 
  ylab("Mean Price") +
  theme(legend.position = "none",
        plot.title = element_text(color = "black", size = 14, face = "bold", hjust = 0.5),
        plot.subtitle = element_text(color = "darkblue", hjust = 0.5),
        axis.title.y = element_text(),
        axis.title.x = element_text(),
        axis.ticks = element_blank())

Observations

  1. Average price of listings is the highest for Manhattan (196.88 USD) followed by Brookyln (124.28). One possible reason for high average price in Manhattan could be that whole apartments/home are the most common type of listings there.
  2. Bronx has the cheapest listings with an average price of 87.5 USD.

Mean Price Comparison for each Room Type

airbnb_data %>% 
  filter(!(is.na(room_type))) %>% 
  filter(!(room_type == "Unknown")) %>% 
  group_by(room_type) %>% 
  summarise(mean_price = mean(price, na.rm = TRUE)) %>% 
  ggplot(aes(x = reorder(room_type, mean_price), y = mean_price, fill = room_type)) +
  geom_col(stat ="identity", color = "black", fill="#357b8a") +
  coord_flip() +
  theme_gray() +
  labs(x = "Room Type", y = "Price") +
  geom_text(aes(label = round(mean_price,digit = 2)), hjust = 2.0, color = "white", size = 3.5) +
  ggtitle("Mean Price comparison with all Room Types", subtitle = "Price vs Room Type") + 
  xlab("Room Type") + 
  ylab("Mean Price") +
  theme(legend.position = "none",
        plot.title = element_text(color = "black", size = 14, face = "bold", hjust = 0.5),
        plot.subtitle = element_text(color = "darkblue", hjust = 0.5),
        axis.title.y = element_text(),
        axis.title.x = element_text(),
        axis.ticks = element_blank())

Observations

  1. Average price is the highest for Entire home.apartment followed by private room and shared room which is quite expected.

5. Modelling

5.1 Data Splitting

Prior to modelling, we will split the data into Training set and Testing set so that we can use the testing set to validate our model. As it is a good practice, we are splitting the dataset in to parts in the ratio of 70:30. Training set will be 70% percent of the original data. We will use the test dataset in the future for testing and prediction purposes. Objects with price equal to 0 will be ommited since price can’t be 0 (faulty records). In order to remove the outliers, we are filtering the airbnb data by removing the extreme values of price from both sides (10% from both the end).They would make predictive models significantly weaker.

airbnb_filtered_data <- airbnb_data %>% 
  filter(price < quantile(airbnb_data$price, 0.9) & price > quantile(airbnb_data$price, 0.1)) %>% 
  drop_na()
set.seed(123456)

airbnb_filtered_data <- airbnb_filtered_data %>% mutate(id = row_number())

airbnb_train <- airbnb_filtered_data %>% sample_frac(.7) %>% filter(price > 0)

airbnb_test  <- anti_join(airbnb_filtered_data, airbnb_train, by = 'id') %>% filter(price > 0)
# sanity check
nrow(airbnb_train) + nrow(airbnb_test) == nrow(airbnb_filtered_data %>% filter(price > 0))
## [1] TRUE

Observations

  1. The resulting training dataset has 34,221 observations and testing dataset has 14.663 observations.
  2. Sanity check confirms that that after removing the observations with price 0 and splitting the dataset, the sum of observations in test and train dataset is equal to the total number of observations in the original dataset.

In our model, we won’t be considering the below variables and the reasons for the same are provided in the summary below.

Summary of Variables Excluded:

  • id: Unique Identifier, so not relevant to the study
  • name: Identifier, so not relevant to the study
  • host_id: Unique Identifier, so not relevant to the study
  • host_name: Identifier, so not relevant to the study
  • neighbourhood: Redundant variable as we are already taking neighbourhood_group in our study
  • last_review: categorical variable with a high number of categories, will unnecessarity complicate our model

Hence, we try to predict the price of the airbnbs using the remaining covariates:

  • neighbourhood_group
  • latitude
  • longitude
  • room_type
  • minimum_nights
  • number_of_reviews
  • reviews_per_month
  • calculated_host_listings_count
  • availability_365

5.2 Model Building Process

5.2.1 1st Linear Regression Model

We will build our initial linear model using all the variables that we have selected for the study.

airbnb_model_1 <- lm (price ~ neighbourhood_group + latitude + longitude + room_type + minimum_nights  + number_of_reviews + reviews_per_month + calculated_host_listings_count +
                        availability_365, data = airbnb_train)

summary_model_1 <- summary(airbnb_model_1)
mse_1 <- summary_model_1$sigma^2
r_sq_1 <- summary_model_1$r.squared
adj_r_sq_1 <- summary_model_1$adj.r.squared

summary_model_1

Model 1 Observations:

  • MSE: 1651
  • R-squared: 0.446
  • Adjusted R-squared: 0.446

Plot of the 1st Linear Regression Model

par(mfrow=c(2,2)) 
plot(airbnb_model_1)

Observations

  1. Residuals vs fitted values shows that the dots are not evenly distributed around zero and do not show a constant variance around X. This means that linearity and equal variance assumptions are not satisifed.
  2. QQ plot shows a 45 degree line meaning that Nomrality assumptions are met.

5.2.2 Variable Selection Method

We will use both Subset regression method as well as step-wise regression for variable selection and see how the models differ in from each other.

Best Subset Regression Method
best_fit_model <- regsubsets (price ~ neighbourhood_group + latitude + longitude + room_type + minimum_nights  + number_of_reviews + 
                                 reviews_per_month +calculated_host_listings_count + availability_365, data = airbnb_train,  nbest = 2, nvmax = 9)

summary(best_fit_model)
plot(best_fit_model, scale="bic")

Observations

  1. Using Section method with BIC, we find that the top 4 models have the same and lowest BIC. We will pick our covariates based on the model which has the highest adjusted R-squared and lowest MSPE.
  2. On doing the analysis in our test script, we find that the top most model is one with lowest MSPE and highest adjusted R-Squared.
  3. Based on the results, the covariates that we have to select here are:
  • neighbourhood_group
  • latitude
  • longitude
  • minimum_nights
  • room_type
  • availablity_365
  • calculated_host_listings_count

Next we will create a model using the the above mentioned covariates

airbnb_model_3 <- lm (price ~ room_type + neighbourhood_group  + latitude + longitude  + minimum_nights +
                        availability_365 + calculated_host_listings_count , data = airbnb_train, nbest = 2, nvmax = 9)



summary_model_3 <- summary(airbnb_model_3)
mse_3 <- summary_model_3$sigma^2
r_sq_3 <- summary_model_3$r.squared
adj_r_sq_3 <- summary_model_3$adj.r.squared

summary_model_3

Model 3 Observations:

  • MSE: 1653
  • R-squared: 0.445
  • Adjusted R-squared: 0.445

Stepwise Regression with AIC/BIC
Stepwise selection using AIC (Direction = “both”)
null <- lm(price~1, data = airbnb_train)
full <- lm(price ~ neighbourhood_group + latitude + longitude + room_type + minimum_nights  + number_of_reviews + 
                                 reviews_per_month +calculated_host_listings_count + availability_365, data = airbnb_train)

step(null, scope =list(lower=null, upper= full), direction = "both")
## Start:  AIC=176704.4
## price ~ 1
## 
##                                  Df Sum of Sq      RSS    AIC
## + room_type                       2  22890604 42906354 167264
## + longitude                       1   5096488 60700471 174925
## + neighbourhood_group             4   4998958 60798000 174967
## + calculated_host_listings_count  1   1507974 64288984 176194
## + reviews_per_month               1    262173 65534786 176618
## + number_of_reviews               1     80578 65716380 176679
## + availability_365                1     75390 65721569 176681
## + minimum_nights                  1     45410 65751548 176691
## + latitude                        1     44731 65752227 176691
## <none>                                        65796958 176704
## 
## Step:  AIC=167263.8
## price ~ room_type
## 
##                                  Df Sum of Sq      RSS    AIC
## + neighbourhood_group             4   3918120 38988234 165156
## + longitude                       1   2996630 39909724 165666
## + calculated_host_listings_count  1    745452 42160902 166879
## + availability_365                1    297661 42608694 167112
## + latitude                        1    258346 42648008 167132
## + reviews_per_month               1     48080 42858274 167241
## + minimum_nights                  1     42643 42863712 167244
## + number_of_reviews               1     18543 42887811 167256
## <none>                                        42906354 167264
## - room_type                       2  22890604 65796958 176704
## 
## Step:  AIC=165156.4
## price ~ room_type + neighbourhood_group
## 
##                                  Df Sum of Sq      RSS    AIC
## + longitude                       1   1260466 37727769 164432
## + availability_365                1    545720 38442514 164847
## + latitude                        1    518694 38469540 164863
## + calculated_host_listings_count  1    420175 38568060 164919
## + minimum_nights                  1    117965 38870269 165091
## + number_of_reviews               1      3875 38984359 165156
## <none>                                        38988234 165156
## + reviews_per_month               1       952 38987282 165158
## - neighbourhood_group             4   3918120 42906354 167264
## - room_type                       2  21809766 60798000 174967
## 
## Step:  AIC=164432.5
## price ~ room_type + neighbourhood_group + longitude
## 
##                                  Df Sum of Sq      RSS    AIC
## + availability_365                1    686845 37040923 164029
## + calculated_host_listings_count  1    325486 37402282 164243
## + latitude                        1    238175 37489594 164295
## + minimum_nights                  1    142868 37584901 164351
## + reviews_per_month               1      8272 37719497 164430
## <none>                                        37727769 164432
## + number_of_reviews               1       228 37727540 164434
## - longitude                       1   1260466 38988234 165156
## - neighbourhood_group             4   2181956 39909724 165666
## - room_type                       2  20652266 58380035 174072
## 
## Step:  AIC=164028.6
## price ~ room_type + neighbourhood_group + longitude + availability_365
## 
##                                  Df Sum of Sq      RSS    AIC
## + minimum_nights                  1    230804 36810120 163893
## + latitude                        1    177240 36863683 163925
## + calculated_host_listings_count  1    162446 36878477 163934
## + number_of_reviews               1     34202 37006721 164010
## <none>                                        37040923 164029
## + reviews_per_month               1      2133 37038790 164029
## - availability_365                1    686845 37727769 164432
## - longitude                       1   1401591 38442514 164847
## - neighbourhood_group             4   2293463 39334386 165348
## - room_type                       2  20937925 57978849 173922
## 
## Step:  AIC=163892.5
## price ~ room_type + neighbourhood_group + longitude + availability_365 + 
##     minimum_nights
## 
##                                  Df Sum of Sq      RSS    AIC
## + calculated_host_listings_count  1    177451 36632669 163788
## + latitude                        1    169853 36640267 163792
## + number_of_reviews               1     52650 36757470 163863
## + reviews_per_month               1     12162 36797958 163887
## <none>                                        36810120 163893
## - minimum_nights                  1    230804 37040923 164029
## - availability_365                1    774781 37584901 164351
## - longitude                       1   1445262 38255381 164741
## - neighbourhood_group             4   2367828 39177948 165262
## - room_type                       2  21161888 57972008 173922
## 
## Step:  AIC=163787.8
## price ~ room_type + neighbourhood_group + longitude + availability_365 + 
##     minimum_nights + calculated_host_listings_count
## 
##                                  Df Sum of Sq      RSS    AIC
## + latitude                        1    147040 36485629 163701
## + number_of_reviews               1     36217 36596452 163768
## + reviews_per_month               1      9472 36623198 163784
## <none>                                        36632669 163788
## - calculated_host_listings_count  1    177451 36810120 163893
## - minimum_nights                  1    245808 36878477 163934
## - availability_365                1    598275 37230945 164144
## - longitude                       1   1347564 37980233 164584
## - neighbourhood_group             4   2206610 38839279 165072
## - room_type                       2  20766645 57399315 173704
## 
## Step:  AIC=163700.9
## price ~ room_type + neighbourhood_group + longitude + availability_365 + 
##     minimum_nights + calculated_host_listings_count + latitude
## 
##                                  Df Sum of Sq      RSS    AIC
## + number_of_reviews               1     35080 36450549 163682
## + reviews_per_month               1     13251 36472379 163695
## <none>                                        36485629 163701
## - latitude                        1    147040 36632669 163788
## - calculated_host_listings_count  1    154637 36640267 163792
## - minimum_nights                  1    237680 36723309 163842
## - availability_365                1    555879 37041509 164033
## - longitude                       1   1101353 37586982 164356
## - neighbourhood_group             4   1912183 38397812 164821
## - room_type                       2  20287824 56773454 173464
## 
## Step:  AIC=163681.7
## price ~ room_type + neighbourhood_group + longitude + availability_365 + 
##     minimum_nights + calculated_host_listings_count + latitude + 
##     number_of_reviews
## 
##                                  Df Sum of Sq      RSS    AIC
## <none>                                        36450549 163682
## + reviews_per_month               1       272 36450277 163684
## - number_of_reviews               1     35080 36485629 163701
## - calculated_host_listings_count  1    139762 36590312 163764
## - latitude                        1    145903 36596452 163768
## - minimum_nights                  1    252072 36702622 163832
## - availability_365                1    590741 37041291 164035
## - longitude                       1   1096108 37546657 164334
## - neighbourhood_group             4   1913238 38363788 164804
## - room_type                       2  20288372 56738921 173453
## 
## Call:
## lm(formula = price ~ room_type + neighbourhood_group + longitude + 
##     availability_365 + minimum_nights + calculated_host_listings_count + 
##     latitude + number_of_reviews, data = airbnb_train)
## 
## Coefficients:
##                      (Intercept)             room_typePrivate room  
##                       -1.471e+04                        -6.156e+01  
##             room_typeShared room       neighbourhood_groupBrooklyn  
##                       -7.455e+01                        -7.033e+00  
##     neighbourhood_groupManhattan         neighbourhood_groupQueens  
##                        1.683e+01                         3.661e+00  
## neighbourhood_groupStaten Island                         longitude  
##                       -7.201e+01                        -2.432e+02  
##                 availability_365                    minimum_nights  
##                        4.360e-02                        -1.909e-01  
##   calculated_host_listings_count                          latitude  
##                        1.063e-01                        -7.698e+01  
##                number_of_reviews  
##                       -2.631e-02

The covariates given by this fit are as follows:

  • room_type
  • neighbourhood_group
  • longitude
  • availability_365
  • calculated_host_listings_count
  • minimum_nights
  • latitude
  • number_of_reviews

Hence, Building a Model using this co variates:

airbnb_model_4 <- lm(price ~ room_type + neighbourhood_group + longitude + availability_365 + calculated_host_listings_count + minimum_nights + latitude + number_of_reviews, 
                     data = airbnb_train, nbest = 2, nvmax = 9)


summary_model_4 <- summary(airbnb_model_4)
mse_4 <- summary_model_4$sigma^2
r_sq_4 <- summary_model_4$r.squared
adj_r_sq_4 <- summary_model_4$adj.r.squared

summary_model_4

Model 4 Observations:

  • MSE: 1651
  • R-squared: 0.446
  • Adjusted R-squared: 0.446
stepwise selection using BIC (Direction = “both”)
null <- lm(price~1, data = airbnb_train)
full <- lm(price ~ neighbourhood_group + latitude + longitude + room_type + minimum_nights  + number_of_reviews + 
                                 reviews_per_month +calculated_host_listings_count + availability_365, data = airbnb_train)

n=dim(airbnb_train[1])
step(null, scope =list(lower=null, upper= full), k=log(n), direction = "both")
## Start:  AIC=176712.4
## price ~ 1
## 
##                                  Df Sum of Sq      RSS    AIC
## + room_type                       2  22890604 42906354 167288
## + longitude                       1   5096488 60700471 174921
## + neighbourhood_group             4   4998958 60798000 174957
## + calculated_host_listings_count  1   1507974 64288984 176210
## + reviews_per_month               1    262173 65534786 176614
## + availability_365                1     75390 65721569 176677
## + minimum_nights                  1     45410 65751548 176687
## + number_of_reviews               1     80578 65716380 176695
## + latitude                        1     44731 65752227 176707
## <none>                                        65796958 176712
## 
## Step:  AIC=167287.8
## price ~ room_type
## 
##                                  Df Sum of Sq      RSS    AIC
## + neighbourhood_group             4   3918120 38988234 165142
## + longitude                       1   2996630 39909724 165658
## + calculated_host_listings_count  1    745452 42160902 166871
## + availability_365                1    297661 42608694 167144
## + latitude                        1    258346 42648008 167164
## + number_of_reviews               1     18543 42887811 167248
## + reviews_per_month               1     48080 42858274 167273
## + minimum_nights                  1     42643 42863712 167276
## <none>                                        42906354 167288
## - room_type                       2  22890604 65796958 176702
## 
## Step:  AIC=165212.4
## price ~ room_type + neighbourhood_group
## 
##                                  Df Sum of Sq      RSS    AIC
## + longitude                       1   1260466 37727769 164496
## + availability_365                1    545720 38442514 164831
## + latitude                        1    518694 38469540 164847
## + calculated_host_listings_count  1    420175 38568060 164983
## + minimum_nights                  1    117965 38870269 165075
## + reviews_per_month               1       952 38987282 165142
## <none>                                        38988234 165212
## + number_of_reviews               1      3875 38984359 165220
## - neighbourhood_group             4   3918120 42906354 167288
## - room_type                       2  21809766 60798000 174957
## 
## Step:  AIC=164496.5
## price ~ room_type + neighbourhood_group + longitude
## 
##                                  Df Sum of Sq      RSS    AIC
## + availability_365                1    686845 37040923 164101
## + calculated_host_listings_count  1    325486 37402282 164225
## + latitude                        1    238175 37489594 164277
## + number_of_reviews               1       228 37727540 164416
## + minimum_nights                  1    142868 37584901 164423
## <none>                                        37727769 164496
## + reviews_per_month               1      8272 37719497 164502
## - longitude                       1   1260466 38988234 165142
## - neighbourhood_group             4   2181956 39909724 165698
## - room_type                       2  20652266 58380035 174060
## 
## Step:  AIC=164100.6
## price ~ room_type + neighbourhood_group + longitude + availability_365
## 
##                                  Df Sum of Sq      RSS    AIC
## + latitude                        1    177240 36863683 163905
## + calculated_host_listings_count  1    162446 36878477 163914
## + minimum_nights                  1    230804 36810120 163973
## + number_of_reviews               1     34202 37006721 163990
## <none>                                        37040923 164101
## + reviews_per_month               1      2133 37038790 164109
## - availability_365                1    686845 37727769 164496
## - longitude                       1   1401591 38442514 164831
## - neighbourhood_group             4   2293463 39334386 165388
## - room_type                       2  20937925 57978849 173908
## 
## Step:  AIC=164004.7
## price ~ room_type + neighbourhood_group + longitude + availability_365 + 
##     latitude
## 
##                                  Df Sum of Sq      RSS    AIC
## + minimum_nights                  1    223417 36640267 163770
## + reviews_per_month               1      4297 36859387 163902
## + calculated_host_listings_count  1    140374 36723309 163930
## + number_of_reviews               1     32252 36831432 163995
## <none>                                        36863683 164005
## - latitude                        1    177240 37040923 164011
## - availability_365                1    625910 37489594 164367
## - longitude                       1   1123646 37987329 164568
## - neighbourhood_group             4   2032199 38895883 165150
## - room_type                       2  20396760 57260444 173633
## 
## Step:  AIC=163880.4
## price ~ room_type + neighbourhood_group + longitude + availability_365 + 
##     latitude + minimum_nights
## 
##                                  Df Sum of Sq      RSS    AIC
## + calculated_host_listings_count  1    154637 36485629 163677
## + number_of_reviews               1     49955 36590312 163740
## - latitude                        1    169853 36810120 163873
## <none>                                        36640267 163880
## + reviews_per_month               1     16497 36623769 163880
## - minimum_nights                  1    223417 36863683 164005
## - availability_365                1    709581 37349848 164294
## - longitude                       1   1165230 37805497 164462
## - neighbourhood_group             4   2075217 38715483 165057
## - room_type                       2  20614734 57255001 173631
## 
## Step:  AIC=163797
## price ~ room_type + neighbourhood_group + longitude + availability_365 + 
##     latitude + minimum_nights + calculated_host_listings_count
## 
##                                  Df Sum of Sq      RSS    AIC
## + number_of_reviews               1     35080 36450549 163656
## - latitude                        1    147040 36632669 163766
## - calculated_host_listings_count  1    154637 36640267 163770
## <none>                                        36485629 163797
## + reviews_per_month               1     13251 36472379 163799
## - minimum_nights                  1    237680 36723309 163930
## - availability_365                1    555879 37041509 164121
## - longitude                       1   1101353 37586982 164334
## - neighbourhood_group             4   1912183 38397812 164885
## - room_type                       2  20287824 56773454 173444
## 
## Step:  AIC=163785.7
## price ~ room_type + neighbourhood_group + longitude + availability_365 + 
##     latitude + minimum_nights + calculated_host_listings_count + 
##     number_of_reviews
## 
##                                  Df Sum of Sq      RSS    AIC
## + reviews_per_month               1       272 36450277 163656
## - calculated_host_listings_count  1    139762 36590312 163740
## - latitude                        1    145903 36596452 163744
## <none>                                        36450549 163786
## - number_of_reviews               1     35080 36485629 163797
## - minimum_nights                  1    252072 36702622 163928
## - availability_365                1    590741 37041291 164131
## - longitude                       1   1096108 37546657 164310
## - neighbourhood_group             4   1913238 38363788 164876
## - room_type                       2  20288372 56738921 173431
## 
## Step:  AIC=163795.6
## price ~ room_type + neighbourhood_group + longitude + availability_365 + 
##     latitude + minimum_nights + calculated_host_listings_count + 
##     number_of_reviews + reviews_per_month
## 
## Call:
## lm(formula = price ~ room_type + neighbourhood_group + longitude + 
##     availability_365 + latitude + minimum_nights + calculated_host_listings_count + 
##     number_of_reviews + reviews_per_month, data = airbnb_train)
## 
## Coefficients:
##                      (Intercept)             room_typePrivate room  
##                       -1.467e+04                        -6.156e+01  
##             room_typeShared room       neighbourhood_groupBrooklyn  
##                       -7.453e+01                        -7.077e+00  
##     neighbourhood_groupManhattan         neighbourhood_groupQueens  
##                        1.681e+01                         3.635e+00  
## neighbourhood_groupStaten Island                         longitude  
##                       -7.196e+01                        -2.428e+02  
##                 availability_365                          latitude  
##                        4.367e-02                        -7.718e+01  
##                   minimum_nights    calculated_host_listings_count  
##                       -1.915e-01                         1.063e-01  
##                number_of_reviews                 reviews_per_month  
##                       -2.482e-02                        -8.301e-02

The covariates given by this fit are as follows:

  • room_type
  • neighbourhood_group
  • longitude
  • availability_365
  • minimum_nights
  • calculated_host_listings_count
  • latitude
  • number_of_reviews
  • reviews_per_month

Hence, Building a Model using this co variates:

airbnb_model_5 <- lm(price ~ room_type + neighbourhood_group + longitude + availability_365 + minimum_nights + latitude + calculated_host_listings_count + 
                       number_of_reviews + reviews_per_month, data = airbnb_train, nbest = 2, nvmax = 9)


summary_model_5 <- summary(airbnb_model_5)
mse_5 <- summary_model_5$sigma^2
r_sq_5 <- summary_model_5$r.squared
adj_r_sq_5 <- summary_model_5$adj.r.squared

summary_model_5

Model 5 Observations:

  • MSE: 1651
  • R-squared: 0.446
  • Adjusted R-squared: 0.446

Lasso Regression Method

Next we use the Lasso Regression method for variable selection.

lasso_fit <- glmnet(x = as.matrix(airbnb_train[, -c(which(colnames(airbnb_train) %in% c("room_type", "neighbourhood_group","price", "name", "host_name", 
                                                                                        "neighbourhood", "last_review", "host_id", "id")))]), 
                    y = airbnb_train$price, alpha = 0.5)

coef(lasso_fit,s = lasso_fit$lambda.min)
## 8 x 65 sparse Matrix of class "dgCMatrix"
##                                                                
## (Intercept)                    120.4523 -1629.47633 -3285.14766
## latitude                         .          .           .      
## longitude                        .        -23.66303   -46.05149
## minimum_nights                   .          .           .      
## number_of_reviews                .          .           .      
## reviews_per_month                .          .           .      
## calculated_host_listings_count   .          .           .      
## availability_365                 .          .           .      
##                                                                  
## (Intercept)                    -4847.38927 -6317.69968 -7698.1541
## latitude                           .           .           .     
## longitude                        -67.17657   -87.05852  -105.7254
## minimum_nights                     .           .           .     
## number_of_reviews                  .           .           .     
## reviews_per_month                  .           .           .     
## calculated_host_listings_count     .           .           .     
## availability_365                   .           .           .     
##                                                                    
## (Intercept)                    -8991.3110 -10200.1229 -1.128308e+04
## latitude                           .           .       .           
## longitude                       -123.2119   -139.5578 -1.542010e+02
## minimum_nights                     .           .       .           
## number_of_reviews                  .           .       .           
## reviews_per_month                  .           .       .           
## calculated_host_listings_count     .           .       1.297033e-02
## availability_365                   .           .       .           
##                                                                         
## (Intercept)                    -1.225891e+04 -1.316478e+04 -1.400446e+04
## latitude                        .             .             .           
## longitude                      -1.673950e+02 -1.796432e+02 -1.909964e+02
## minimum_nights                  .             .             .           
## number_of_reviews               .             .             .           
## reviews_per_month               .             .             .           
## calculated_host_listings_count  3.413940e-02  5.378151e-02  7.198001e-02
## availability_365                .             .             .           
##                                                                         
## (Intercept)                    -1.478169e+04 -1.550019e+04 -1.616360e+04
## latitude                        .             .             .           
## longitude                      -2.015052e+02 -2.112200e+02 -2.201900e+02
## minimum_nights                  .             .             .           
## number_of_reviews               .             .             .           
## reviews_per_month               .             .             .           
## calculated_host_listings_count  8.881807e-02  1.043778e-01  1.187394e-01
## availability_365                .             .             .           
##                                                                         
## (Intercept)                    -16775.488962 -1.733927e+04 -1.785824e+04
## latitude                            .         .             .           
## longitude                        -228.463246 -2.360861e+02 -2.431030e+02
## minimum_nights                      .         .             .           
## number_of_reviews                   .         .             .           
## reviews_per_month                   .         .             .           
## calculated_host_listings_count      0.131981  1.441778e-01  1.554021e-01
## availability_365                    .         .             .           
##                                                                         
## (Intercept)                    -1.833555e+04 -1.877421e+04 -1.919758e+04
## latitude                        .             .             .           
## longitude                      -2.495567e+02 -2.554877e+02 -2.612111e+02
## minimum_nights                  .             .             .           
## number_of_reviews               .             .             .           
## reviews_per_month               .             .             .           
## calculated_host_listings_count  1.657226e-01  1.752049e-01  1.831789e-01
## availability_365                .             .             7.257962e-04
##                                                                         
## (Intercept)                    -1.973191e+04 -2.029562e+04 -2.081313e+04
## latitude                        2.480943e+00  6.294373e+00  9.794989e+00
## longitude                      -2.670672e+02 -2.725869e+02 -2.776545e+02
## minimum_nights                  .             .             .           
## number_of_reviews               .             .             .           
## reviews_per_month               .             .             .           
## calculated_host_listings_count  1.894060e-01  1.950260e-01  2.001623e-01
## availability_365                2.370588e-03  3.909134e-03  5.321501e-03
##                                                                         
## (Intercept)                    -2.128786e+04 -2.172311e+04 -2.209790e+04
## latitude                        1.300597e+01  1.594960e+01  1.860508e+01
## longitude                      -2.823033e+02 -2.865656e+02 -2.901703e+02
## minimum_nights                  .             .             .           
## number_of_reviews               .             .             .           
## reviews_per_month               .             .            -6.284334e-02
## calculated_host_listings_count  2.048558e-01  2.091435e-01  2.129058e-01
## availability_365                6.617165e-03  7.805081e-03  9.036752e-03
##                                                                         
## (Intercept)                    -2.243553e+04 -2.274475e+04 -2.302766e+04
## latitude                        2.102348e+01  2.323952e+01  2.526713e+01
## longitude                      -2.934032e+02 -2.963636e+02 -2.990719e+02
## minimum_nights                  .             .             .           
## number_of_reviews               .             .             .           
## reviews_per_month              -1.344146e-01 -1.998507e-01 -2.597120e-01
## calculated_host_listings_count  2.162946e-01  2.193910e-01  2.222159e-01
## availability_365                1.019773e-02  1.126096e-02  1.223440e-02
##                                                                         
## (Intercept)                    -2.328639e+04 -2.352295e+04 -2.373916e+04
## latitude                        2.712167e+01  2.881734e+01  3.036728e+01
## longitude                      -3.015488e+02 -3.038132e+02 -3.058829e+02
## minimum_nights                  .             .             .           
## number_of_reviews               .             .             .           
## reviews_per_month              -3.144546e-01 -3.645004e-01 -4.102389e-01
## calculated_host_listings_count  2.247930e-01  2.271436e-01  2.292875e-01
## availability_365                1.312527e-02  1.394026e-02  1.468557e-02
##                                                                         
## (Intercept)                    -2.393886e+04 -2.412593e+04 -2.429536e+04
## latitude                        3.179072e+01  3.309745e+01  3.429044e+01
## longitude                      -3.077990e+02 -3.096087e+02 -3.112425e+02
## minimum_nights                  .             .             .           
## number_of_reviews              -9.641900e-04 -2.101625e-03 -3.083937e-03
## reviews_per_month              -4.365266e-01 -4.566604e-01 -4.757090e-01
## calculated_host_listings_count  2.311111e-01  2.326949e-01  2.341687e-01
## availability_365                1.541139e-02  1.609126e-02  1.670775e-02
##                                                                         
## (Intercept)                    -2.445006e+04 -2.459131e+04 -2.471792e+04
## latitude                        3.537976e+01  3.637435e+01  3.727417e+01
## longitude                      -3.127342e+02 -3.140963e+02 -3.153126e+02
## minimum_nights                  .             .             .           
## number_of_reviews              -3.979467e-03 -4.797363e-03 -5.473288e-03
## reviews_per_month              -4.931019e-01 -5.089503e-01 -5.245033e-01
## calculated_host_listings_count  2.355118e-01  2.367358e-01  2.378731e-01
## availability_365                1.727086e-02  1.778517e-02  1.824980e-02
##                                                                         
## (Intercept)                    -2.483564e+04 -2.495097e+04 -2.505244e+04
## latitude                        3.810293e+01  3.889598e+01  3.960811e+01
## longitude                      -3.164478e+02 -3.175706e+02 -3.185505e+02
## minimum_nights                  .            -2.495664e-03 -5.212982e-03
## number_of_reviews              -6.153489e-03 -6.878866e-03 -7.396819e-03
## reviews_per_month              -5.377084e-01 -5.515890e-01 -5.672199e-01
## calculated_host_listings_count  2.388885e-01  2.398767e-01  2.408309e-01
## availability_365                1.867860e-02  1.912076e-02  1.952244e-02
##                                                                         
## (Intercept)                    -2.514724e+04 -2.523390e+04 -2.531298e+04
## latitude                        4.026502e+01  4.086485e+01  4.141208e+01
## longitude                      -3.194708e+02 -3.203123e+02 -3.210802e+02
## minimum_nights                 -7.672802e-03 -9.913638e-03 -1.195763e-02
## number_of_reviews              -7.932288e-03 -8.427774e-03 -8.880560e-03
## reviews_per_month              -5.805522e-01 -5.925743e-01 -6.035199e-01
## calculated_host_listings_count  2.416734e-01  2.424404e-01  2.431393e-01
## availability_365                1.989376e-02  2.023290e-02  2.054238e-02
##                                                                         
## (Intercept)                    -2.538510e+04 -2.545089e+04 -2.551089e+04
## latitude                        4.191122e+01  4.236645e+01  4.278160e+01
## longitude                      -3.217807e+02 -3.224196e+02 -3.230022e+02
## minimum_nights                 -1.382218e-02 -1.552291e-02 -1.707406e-02
## number_of_reviews              -9.293679e-03 -9.670505e-03 -1.001419e-02
## reviews_per_month              -6.134967e-01 -6.225913e-01 -6.308816e-01
## calculated_host_listings_count  2.437763e-01  2.443567e-01  2.448856e-01
## availability_365                2.082473e-02  2.108230e-02  2.131724e-02
##                                                                         
## (Intercept)                    -2.556560e+04 -2.561549e+04 -2.566097e+04
## latitude                        4.316016e+01  4.350535e+01  4.382007e+01
## longitude                      -3.235336e+02 -3.240181e+02 -3.244599e+02
## minimum_nights                 -1.848866e-02 -1.977864e-02 -2.095490e-02
## number_of_reviews              -1.032761e-02 -1.061343e-02 -1.087404e-02
## reviews_per_month              -6.384382e-01 -6.453260e-01 -6.516039e-01
## calculated_host_listings_count  2.453677e-01  2.458069e-01  2.462071e-01
## availability_365                2.153151e-02  2.172693e-02  2.190512e-02
##                                                                         
## (Intercept)                    -2.570245e+04 -2.573846e+04 -2.577272e+04
## latitude                        4.410700e+01  4.436415e+01  4.460218e+01
## longitude                      -3.248627e+02 -3.252081e+02 -3.255404e+02
## minimum_nights                 -2.202737e-02 -2.296799e-02 -2.386341e-02
## number_of_reviews              -1.111166e-02 -1.128886e-02 -1.147699e-02
## reviews_per_month              -6.573258e-01 -6.625114e-01 -6.674357e-01
## calculated_host_listings_count  2.465719e-01  2.469592e-01  2.472637e-01
## availability_365                2.206761e-02  2.220894e-02  2.234356e-02
##                                                                         
## (Intercept)                    -2.580421e+04 -2.583299e+04 -2.585925e+04
## latitude                        4.481978e+01  4.501828e+01  4.519925e+01
## longitude                      -3.258464e+02 -3.261262e+02 -3.263816e+02
## minimum_nights                 -2.467998e-02 -2.542375e-02 -2.610148e-02
## number_of_reviews              -1.165707e-02 -1.182407e-02 -1.197725e-02
## reviews_per_month              -6.718217e-01 -6.757751e-01 -6.793626e-01
## calculated_host_listings_count  2.475361e-01  2.477834e-01  2.480083e-01
## availability_365                2.246707e-02  2.257985e-02  2.268271e-02
##                                                                         
## (Intercept)                    -2.588319e+04 -2.590501e+04 -2.592490e+04
## latitude                        4.536422e+01  4.551458e+01  4.565163e+01
## longitude                      -3.266145e+02 -3.268267e+02 -3.270202e+02
## minimum_nights                 -2.671916e-02 -2.728213e-02 -2.779525e-02
## number_of_reviews              -1.211720e-02 -1.224487e-02 -1.236127e-02
## reviews_per_month              -6.826269e-01 -6.855999e-01 -6.883085e-01
## calculated_host_listings_count  2.482132e-01  2.483999e-01  2.485700e-01
## availability_365                2.277650e-02  2.286200e-02  2.293993e-02
##                                                                         
## (Intercept)                    -2.594303e+04 -2.595955e+04 -2.597461e+04
## latitude                        4.577653e+01  4.589037e+01  4.599412e+01
## longitude                      -3.271966e+02 -3.273573e+02 -3.275038e+02
## minimum_nights                 -2.826292e-02 -2.868916e-02 -2.907762e-02
## number_of_reviews              -1.246738e-02 -1.256409e-02 -1.265224e-02
## reviews_per_month              -6.907766e-01 -6.930256e-01 -6.950750e-01
## calculated_host_listings_count  2.487249e-01  2.488661e-01  2.489948e-01
## availability_365                2.301096e-02  2.307570e-02  2.313471e-02
##                                                           
## (Intercept)                    -2.598833e+04 -2.600084e+04
## latitude                        4.608866e+01  4.617483e+01
## longitude                      -3.276373e+02 -3.277590e+02
## minimum_nights                 -2.943166e-02 -2.975431e-02
## number_of_reviews              -1.273257e-02 -1.280578e-02
## reviews_per_month              -6.969424e-01 -6.986441e-01
## calculated_host_listings_count  2.491120e-01  2.492188e-01
## availability_365                2.318849e-02  2.323750e-02

The covariates given out by this fit are:

  • latitude
  • longitude
  • minimum_nights
  • number_of_reviews
  • reviews_per_month
  • calculated_host_listings_count
  • availability_365

Hence, we will build a Model using these covariates

airbnb_model_6 <- lm(price ~ latitude + longitude + minimum_nights + number_of_reviews + reviews_per_month + calculated_host_listings_count + availability_365,
                     data = airbnb_train, nbest = 2, nvmax = 9)


summary_model_6 <- summary(airbnb_model_6)
#mse_6 <- mean(summary_model_6$sigma^2)
mse_6 <- mean(summary_model_6$residuals^2)
r_sq_6 <- summary_model_6$r.squared
adj_r_sq_6 <- summary_model_6$adj.r.squared

summary_model_6
## 
## Call:
## lm(formula = price ~ latitude + longitude + minimum_nights + 
##     number_of_reviews + reviews_per_month + calculated_host_listings_count + 
##     availability_365, data = airbnb_train, nbest = 2, nvmax = 9)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -136.59  -41.18  -10.53   32.33  208.49 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    -2.613e+04  6.616e+02 -39.495  < 2e-16 ***
## latitude                        4.706e+01  6.436e+00   7.312 2.72e-13 ***
## longitude                      -3.290e+02  7.914e+00 -41.572  < 2e-16 ***
## minimum_nights                 -3.307e-02  1.973e-02  -1.676  0.09370 .  
## number_of_reviews              -1.356e-02  8.632e-03  -1.571  0.11627    
## reviews_per_month              -7.161e-01  2.594e-01  -2.761  0.00577 ** 
## calculated_host_listings_count  2.503e-01  1.458e-02  17.170  < 2e-16 ***
## availability_365                2.374e-02  2.918e-03   8.136 4.30e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 51.84 on 22082 degrees of freedom
## Multiple R-squared:  0.0981, Adjusted R-squared:  0.09782 
## F-statistic: 343.1 on 7 and 22082 DF,  p-value: < 2.2e-16

Model 6 Observations:

  • MSE: 2687
  • R-squared: 0.0981
  • Adjusted R-squared: 0.0978

Cross Validation of Lasso Fit:

cv_lasso_fit = cv.glmnet(x = as.matrix(airbnb_train[, -c(which(colnames(airbnb_train) %in% c("room_type", "neighbourhood_group","price" , "name", "host_name",
                                                                                                          "neighbourhood", "last_review")))]), 
                    y = airbnb_train$price, alpha = 1, nfolds = 5)
plot(cv_lasso_fit)

5.3. Model Selection

In order to pick a model from the ones that we have built, we will do a comparison MSPE and adjusted R-squared of all the models and choose the model with the best combination of both.

Comparison of all the Models:

#Reading the variable summary excel File
model_sum <- read_excel("Excel Files/Model_Comparison.xlsx")

kable(model_sum) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, fixed_thead = T, )
Model Name MSE R-Squared Adjusted R-Squared
Linear Regression Model 1651 0.4460 0.44600
Best Subset Regression Method 1653 0.4450 0.44500
Stepwise Regression using AIC 1651 0.4460 0.44500
Stepwise Regression using BIC 1651 0.4460 0.44500
Lasso Fit Model 2687 0.0981 0.09782

Observations

  1. The model built using Lasso Regression is the least desirable one as it has the lowest adjusted R-squared and highest MSE value.
  2. The linear model built using stepwise regression using both AIC and BIC have the best combination of MSE and Adjusted R-squared and these values are same for both of them. However, we are selecting the **Model 5** (stepwise regression using BIC) as BIC is considered to be more Conservative and it gives the Lesser number of Variables when compared to AIC, thus making the model more interpretable.

Summary of Model 5

summary_model_5 <- summary(airbnb_model_5)
mse_5 <- summary_model_5$sigma^2
r_sq_5 <- summary_model_5$r.squared
adj_r_sq_5 <- summary_model_5$adj.r.squared

summary_model_5
## 
## Call:
## lm(formula = price ~ room_type + neighbourhood_group + longitude + 
##     availability_365 + minimum_nights + latitude + calculated_host_listings_count + 
##     number_of_reviews + reviews_per_month, data = airbnb_train, 
##     nbest = 2, nvmax = 9)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -116.972  -26.271   -6.395   21.015  234.809 
## 
## Coefficients:
##                                    Estimate Std. Error  t value Pr(>|t|)
## (Intercept)                      -1.467e+04  8.420e+02  -17.424  < 2e-16
## room_typePrivate room            -6.156e+01  5.635e-01 -109.234  < 2e-16
## room_typeShared room             -7.453e+01  2.523e+00  -29.542  < 2e-16
## neighbourhood_groupBrooklyn      -7.077e+00  2.463e+00   -2.873 0.004066
## neighbourhood_groupManhattan      1.681e+01  2.267e+00    7.416 1.25e-13
## neighbourhood_groupQueens         3.635e+00  2.386e+00    1.524 0.127622
## neighbourhood_groupStaten Island -7.196e+01  4.569e+00  -15.750  < 2e-16
## longitude                        -2.428e+02  9.494e+00  -25.575  < 2e-16
## availability_365                  4.367e-02  2.310e-03   18.900  < 2e-16
## minimum_nights                   -1.915e-01  1.553e-02  -12.336  < 2e-16
## latitude                         -7.718e+01  8.204e+00   -9.408  < 2e-16
## calculated_host_listings_count    1.063e-01  1.155e-02    9.207  < 2e-16
## number_of_reviews                -2.482e-02  6.783e-03   -3.659 0.000254
## reviews_per_month                -8.301e-02  2.044e-01   -0.406 0.684616
##                                     
## (Intercept)                      ***
## room_typePrivate room            ***
## room_typeShared room             ***
## neighbourhood_groupBrooklyn      ** 
## neighbourhood_groupManhattan     ***
## neighbourhood_groupQueens           
## neighbourhood_groupStaten Island ***
## longitude                        ***
## availability_365                 ***
## minimum_nights                   ***
## latitude                         ***
## calculated_host_listings_count   ***
## number_of_reviews                ***
## reviews_per_month                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 40.63 on 22076 degrees of freedom
## Multiple R-squared:  0.446,  Adjusted R-squared:  0.4457 
## F-statistic:  1367 on 13 and 22076 DF,  p-value: < 2.2e-16

5.4 Validation and Prediction

In-sample model evaluation (train error)

MSE of the Final Selected Model

(summary_model_5$sigma)^2
## [1] 1651.127

R2 of the Final Selected Model

summary_model_5$r.squared
## [1] 0.4460188

Adjusted R2 of the Final Selected Model

  summary_model_5$adj.r.squared
## [1] 0.4456926

Out-of-sample prediction (test error)

To evaluate how the model performs on future data, we use predict() to get the predicted values from the test set.

#pi is a vector that contains predicted values for test set.
pi <- predict(object = airbnb_model_5, newdata = airbnb_test)

Mean Squared Error (MSE) of the final Model:

mean((pi - airbnb_test$price)^2)
## [1] 1632.556

the Mean Absolute Error (MAE) of the Final Model - Test Data.

mean(abs(pi - airbnb_test$price))
## [1] 31.0044

Cross Validation of the Final Selected Model

MSPE of the Full Filtered Data

model_glm_1 = glm(price~neighbourhood_group + latitude + longitude + room_type + minimum_nights  + number_of_reviews + 
                                 reviews_per_month +calculated_host_listings_count + availability_365,  data = airbnb_filtered_data)

cv.glm(data =airbnb_filtered_data, glmfit = model_glm_1, K = 3)$delta[2]
## [1] 1646.619

Comparing the MSE of Test Dataset which is equal to 1633, and the MSPE of the Full Data which is 1646, we can see that the values are almost similar. Hence the variables that we have selected for our model are good estimators of our dependent variable.

6. Conclusion

Summary of the problem:

To Analyze the New York Air Bnb dataset and find the interesting insights related to the factors that affect the prices of listing in New York city. We tried to answer the following questions using our analysis:

  • How are rental properties distributed across the neighborhoods of NYC?
  • How do prices vary with respect to other factors?

Date Used and Methodology: The data is collected from an online data-source Kaggle.

Methodology:

We used multiple regression method to create a model that would help predict prices of listing based on relevant factors. We started with importing and cleaning the data followed by exploratory analysis to gain some interesting insights. For model building, we decided to split the dataset into training and testing datasets in the ratio of 70:30. We created an initial model using all the variables we thought were relevant, followed by building models using 4 different variable selections methods i.e. stepwise-regression methods (using AIC and BIC both), regression subset method and Lasso regression methods. MSE, R^2, Adj R^2 of these 5 models were then compared in order to pick the best model. Next, we performed in-sample validation, out-sample prediction and cross-validation techniques on the final selected model. Upon validation, we found that the MSPE of Test Dataset i.e. 1633 is almost similar to the MSPE of the Full Data i.e. 1646. Hence, we could conclude that the Model selected for analysis, predicts the variable with a good fit.

Interesting insights from Analysis The above analysis helped us understand the New York Air Bnb dataset better. Following insights were drawn from it:

  • Private room is the most common listing type in all neighborhoods except Manhattan where Entire Home/apartment is the most common type.
  • Shared room is the least common type of listings.
  • Average price of listings is the highest for Manhattan followed by Brooklyn.
  • Bronx has the cheapest listings with an average price of 87.5 USD.
  • Average price is the highest for Entire home/apartment followed by private room and shared room.
  • The factors that impact listing prices are:
    • room_type
      • neighbourhood_group
      • longitude
      • availability_365
      • minimum_nights
      • calculated_host_listings_count
      • latitude
      • number_of_reviews
      • reviews_per_month

Implications from the analysis:

This information can be really helpful for multiple categories of the society such as tourists, house owners, real estate agents etc. It can help customers decide which neighborhoods they should look into depending on their needs and house owners can decide on the prices they set on the listings taking all these factors into consideration.

Final Model

The Final Model selected was Model 5. This Model was selected by Stepwise Regression using BIC (direction : both). The Observations of the Model:

  • MSE: 1651
  • R-squared: 0.446
  • Adjusted R-squared: 0.446

The Covariates in the Model

  • room_type
  • neighbourhood_group
  • longitude
  • availability_365
  • minimum_nights
  • calculated_host_listings_count
  • latitude
  • number_of_reviews
  • reviews_per_month