The Data was collected from Kaggle. In this project we present to you exploratory data analysis, visualizations and modelling of New York Airbnb data. Airbnb, Inc.is an online marketplace for arranging or offering lodging, primarily homestays, or tourism experiences. The company does not own any of the real estate listings, nor does it host events; it acts as a broker, receiving commissions from each booking. Revenue for Airbnb comes from its guests and hosts: hosts are charged 3% of the value of the booking, while guests are charged 6%-12% per the nature of the booking. Airbnb market is quite blooming in New York city (NYC) which had more than 48,000 listings as of August-2019 calendar year.
We focus on New York City’s data as we wish to perform an in-depth analysis on one of the most densely populated cities in the world.
In this project, we also try to predict the factors that affect the pricing of the airbnbs around New York. This includes creating different kind of models, model specification, transformation, variable selection and many more.
We carried out the project in the following steps:
The original Data Set can be found here —> New York AirBnB Data 2019
We begin by loading the packages that will be required throughout the course of our analysis.
#install DAAG from archived source
if(!is.element("DAAG", installed.packages()[,1])){
packageurl <- "https://cran.r-project.org/src/contrib/Archive/DAAG/DAAG_1.22.tar.gz"
install.packages("latticeExtra")
install.packages(packageurl, repos=NULL, type="source")
}
library(tidyr)
library(DT)
library(ggplot2)
library(dplyr)
library(tidyverse)
library(kableExtra)
library(lubridate)
library(readxl)
library(highcharter)
library(lubridate)
library(scales)
library(RColorBrewer)
library(wesanderson)
library(plotly)
library(shiny)
library(readxl)
library(readr)
library(choroplethr)
library(choroplethrMaps)
library(GGally)
library(zoo)
library(scales)
library(ggmap)
library(stringr)
library(gridExtra)
library(caret)
library(treemap)
library(psych)
library(DAAG)
library(leaps)
library(corrplot)
library(glmnet)
library(boot)
The packages used can be found in a tabular format below. The table also describes the usage of each packages that we are using.
| Package | Description |
|---|---|
| library(tidyr) | For changing the layout of your data sets, to convert data into the tidy format |
| library(DT) | For HTML display of data |
| library(ggplot2) | For customizable graphical representation |
| library(dplyr) | For data manipulation |
| library(tidyverse) | Collection of R packages designed for data science that works harmoniously with other packages |
| library(kableExtra) | To display table in a fancy way |
| library(lubridate) | Lubridate makes it easier to do the things R does with date-times and possible to do the things R does not |
| library(readxl) | The readxl package makes it easy to get data out of Excel and into R |
| library(highcharter) | Highcharter is a R wrapper for Highcharts javascript libray and its modules |
| library(lubridate) | Lubridate is an R package that makes it easier to work with dates and times. |
| library(scales) | The idea of the scales package is to implement scales in a way that is graphics system agnostic |
| library(RColorBrewer) | RColorBrewer is an R package that allows users to create colourful graphs with pre-made color palettes that visualize data in a clear and distinguishable manner |
| library(wesanderson) | A Wes Anderson is color palette for R |
| library(plotly) | Plotly’s R graphing library makes interactive, publication-quality graphs |
| library(shiny) | Shiny is an R package that makes it easy to build interactive web apps straight from R |
| library(readxl) | The readxl package makes it easy to get data out of Excel and into R. |
| library(readr) | The goal of readr is to provide a fast and friendly way to read rectangular data (like csv, tsv, and fwf). |
| library(choroplethr) | Simplify the Creation of Choropleth Maps in R |
| library(choroplethrMaps) | choroplethrMaps is an R package that contains 3 maps used by the choroplethr package. |
| library(GGally) | ggplot2 is a plotting system based on the grammar of graphics. |
| library(zoo) | S3 Infrastructure for Regular and Irregular Time Series (Z’s Ordered Observations) |
| library(scales) | The scales packages provides the internal scaling infrastructure used by ggplot2, and gives you tools to override the default breaks, labels, transformations and palettes. |
| library(ggmap) | A collection of functions to visualize spatial data and models on top of static maps from various online sources (e.g Google Maps and Stamen Maps). |
| library(stringr) | A consistent, simple and easy to use set of wrappers around the fantastic ‘stringi’ package. |
| library(gridExtra) | Provides a number of user-level functions to work with “grid” graphics, notably to arrange multiple grid-based plots on a page, and draw tables. |
| library(caret) | The caret package (short for Classification And REgression Training) contains functions to streamline the model training process for complex regression and classification problems. |
| library(treemap) | A treemap is a space-filling visualization of hierarchical structures. This package offers great flexibility to draw treemaps. |
| library(psych) | A general purpose toolbox for personality, psychometric theory and experimental psychology. |
| library(DAAG) | Data Analysis and Graphics Data and Functions |
| library(leaps) | Regression subset selection, including exhaustive search. |
| library(corrplot) | The corrplot package is a graphical display of a correlation matrix, confidence interval |
| library(glmnet) | Extremely efficient procedures for fitting the entire lasso or elastic-net regularization path for linear regression, logistic and multinomial regression models, Poisson regression and the Cox model. |
After loading the required packages, we move on to the data perparation step which would start by loading the data into our R-studio.
summary(airbnb_data)
## id name
## Min. : 2539 Hillside Hotel : 18
## 1st Qu.: 9471945 Home away from home : 17
## Median :19677284 : 16
## Mean :19017143 New york Multi-unit building : 16
## 3rd Qu.:29152178 Brooklyn Apartment : 12
## Max. :36487245 Loft Suite @ The Box House Hotel: 11
## (Other) :48805
## host_id host_name neighbourhood_group
## Min. : 2438 Michael : 417 Bronx : 1091
## 1st Qu.: 7822033 David : 403 Brooklyn :20104
## Median : 30793816 Sonder (NYC): 327 Manhattan :21661
## Mean : 67620011 John : 294 Queens : 5666
## 3rd Qu.:107434423 Alex : 279 Staten Island: 373
## Max. :274321313 Blueground : 232
## (Other) :46943
## neighbourhood latitude longitude
## Williamsburg : 3920 Min. :40.50 Min. :-74.24
## Bedford-Stuyvesant: 3714 1st Qu.:40.69 1st Qu.:-73.98
## Harlem : 2658 Median :40.72 Median :-73.96
## Bushwick : 2465 Mean :40.73 Mean :-73.95
## Upper West Side : 1971 3rd Qu.:40.76 3rd Qu.:-73.94
## Hell's Kitchen : 1958 Max. :40.91 Max. :-73.71
## (Other) :32209
## room_type price minimum_nights
## Entire home/apt:25409 Min. : 0.0 Min. : 1.00
## Private room :22326 1st Qu.: 69.0 1st Qu.: 1.00
## Shared room : 1160 Median : 106.0 Median : 3.00
## Mean : 152.7 Mean : 7.03
## 3rd Qu.: 175.0 3rd Qu.: 5.00
## Max. :10000.0 Max. :1250.00
##
## number_of_reviews last_review reviews_per_month
## Min. : 0.00 :10052 Min. : 0.010
## 1st Qu.: 1.00 2019-06-23: 1413 1st Qu.: 0.190
## Median : 5.00 2019-07-01: 1359 Median : 0.720
## Mean : 23.27 2019-06-30: 1341 Mean : 1.373
## 3rd Qu.: 24.00 2019-06-24: 875 3rd Qu.: 2.020
## Max. :629.00 2019-07-07: 718 Max. :58.500
## (Other) :33137 NA's :10052
## calculated_host_listings_count availability_365
## Min. : 1.000 Min. : 0.0
## 1st Qu.: 1.000 1st Qu.: 0.0
## Median : 1.000 Median : 45.0
## Mean : 7.144 Mean :112.8
## 3rd Qu.: 2.000 3rd Qu.:227.0
## Max. :327.000 Max. :365.0
##
glimpse(airbnb_data)
## Observations: 48,895
## Variables: 16
## $ id <int> 2539, 2595, 3647, 3831, 5022, 5...
## $ name <fct> "Clean & quiet apt home by the ...
## $ host_id <int> 2787, 2845, 4632, 4869, 7192, 7...
## $ host_name <fct> John, Jennifer, Elisabeth, Lisa...
## $ neighbourhood_group <fct> Brooklyn, Manhattan, Manhattan,...
## $ neighbourhood <fct> Kensington, Midtown, Harlem, Cl...
## $ latitude <dbl> 40.64749, 40.75362, 40.80902, 4...
## $ longitude <dbl> -73.97237, -73.98377, -73.94190...
## $ room_type <fct> Private room, Entire home/apt, ...
## $ price <int> 149, 225, 150, 89, 80, 200, 60,...
## $ minimum_nights <int> 1, 1, 3, 1, 10, 3, 45, 2, 2, 1,...
## $ number_of_reviews <int> 9, 45, 0, 270, 9, 74, 49, 430, ...
## $ last_review <fct> 2018-10-19, 2019-05-21, , 2019-...
## $ reviews_per_month <dbl> 0.21, 0.38, NA, 4.64, 0.10, 0.5...
## $ calculated_host_listings_count <int> 6, 2, 1, 1, 1, 1, 1, 1, 1, 4, 1...
## $ availability_365 <int> 365, 355, 365, 194, 0, 129, 0, ...
Next we will check for missing values and deal with them accordingly.
summary(is.na(airbnb_data))
## id name host_id host_name
## Mode :logical Mode :logical Mode :logical Mode :logical
## FALSE:48895 FALSE:48895 FALSE:48895 FALSE:48895
##
## neighbourhood_group neighbourhood latitude longitude
## Mode :logical Mode :logical Mode :logical Mode :logical
## FALSE:48895 FALSE:48895 FALSE:48895 FALSE:48895
##
## room_type price minimum_nights number_of_reviews
## Mode :logical Mode :logical Mode :logical Mode :logical
## FALSE:48895 FALSE:48895 FALSE:48895 FALSE:48895
##
## last_review reviews_per_month calculated_host_listings_count
## Mode :logical Mode :logical Mode :logical
## FALSE:48895 FALSE:38843 FALSE:48895
## TRUE :10052
## availability_365
## Mode :logical
## FALSE:48895
##
We visualise the number of missings in each variable using naniar gg_miss_var
naniar::gg_miss_var(airbnb_data) +
theme_minimal()+
labs(y = "Look at all the Missing Values")
Analysis for missing values shows that there are no missing values in our dataset except for the variable ‘review per month’ which has 10052 missing values. Since we are planning to keep this variable in our study, we will decide later how we are going to deal with these missing values.
The Original Dataset New York AirBnB Data 2019 we collected from Kaggle did not need much cleaning. The Data set is almost clean with only one coloumn - reviews_per_month having 10052 N/A values. Deleting all these observatio now would not be a good solution as it is a high number and would result in biased results. Apart from that, all the variables were properly named and was already in proper case, hence variable need not be renamed. All the character variables were changed to factor variables.
The final cleaned dataset can be found below in an interactive table.
datatable(head(airbnb_data, 20), class = 'cell-border stripe')
#Reading the variable summary excel File
var_sum <- read_excel("Excel Files/variable_summary.xlsx")
kable(var_sum) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, fixed_thead = T, )
| Sr.No. | Variable | Class | Description |
|---|---|---|---|
| 1 | id | int | unique listing ID |
| 2 | name | character | name of the listing |
| 3 | host_id | int | host ID |
| 4 | host_name | character | name of the host |
| 5 | neighbourhood_group | factor | location in NYC |
| 6 | neighbourhood | factor | area in NYC |
| 7 | latitude | int | latitude coordinates |
| 8 | longitude | int | longitude coordinates |
| 9 | room_type | factor | listing space type |
| 10 | price | int | price in dollars |
| 11 | minimum_nights | int | amount of minimum nights |
| 12 | number_of_reviews | int | number of reviews |
| 13 | last_review | date | latest review |
| 14 | reviews_per_month | int | number of reviews per month |
| 15 | calculated_host_listings_count | int | amount of listing per host |
| 16 | availability_365 | int | number of days when listing is available for booking |
We do an analysis to find out the type of listings that are common to a particular neighbourhood.
property_df <- airbnb_data %>%
group_by(neighbourhood_group, room_type) %>%
summarize(Freq = n())
# propertydf <- propertydf %>%
# filter(property_type %in% c("Apartment","House","Condominium","Townhouse", "Loft"))
total_property <- airbnb_data %>%
filter(room_type %in% c("Private room","Entire home/apt","Entire home/apt")) %>%
group_by(neighbourhood_group) %>%
summarize(sum = n())
property_ratio <- merge (property_df, total_property, by="neighbourhood_group")
property_ratio <- property_ratio %>%
mutate(ratio = Freq/sum)
ggplot(property_ratio, aes(x=neighbourhood_group, y = ratio, fill = room_type)) +
geom_bar(position = "dodge", stat="identity") +
xlab("Borough") + ylab ("Count") +
scale_fill_discrete(name = "Property Type") +
scale_y_continuous(labels = scales::percent) +
ggtitle("Which types of Listings are there in NYC?",
subtitle = "Map showing Count of Listing Type by Borough ") +
theme(plot.title = element_text(face = "bold", size = 14) ) +
theme(plot.subtitle = element_text(face = "bold", color = "grey35", hjust = 0.5)) +
theme(plot.caption = element_text(color = "grey68"))+scale_color_gradient(low="#d3cbcb", high="#852eaa")+
scale_fill_manual("Property Type", values=c("#e06f69","#357b8a", "#7db5b8", "#59c6f3", "#f6c458")) +
xlab("Neighborhood") + ylab("Percentage")
We want to see the average prices of listing in every neighbourhood.
airbnb_data %>%
filter(!(is.na(neighbourhood_group))) %>%
filter(!(neighbourhood_group == "Unknown")) %>%
group_by(neighbourhood_group) %>%
summarise(mean_price = mean(price, na.rm = TRUE)) %>%
ggplot(aes(x = reorder(neighbourhood_group, mean_price), y = mean_price, fill = neighbourhood_group)) +
geom_col(stat ="identity", color = "black", fill="#357b8a") +
coord_flip() +
theme_gray() +
labs(x = "Neighbourhood Group", y = "Price") +
geom_text(aes(label = round(mean_price,digit = 2)), hjust = 2.0, color = "white", size = 3.5) +
ggtitle("Mean Price comparison for each Neighbourhood Group", subtitle = "Price vs Neighbourhood Group") +
xlab("Neighbourhood Group") +
ylab("Mean Price") +
theme(legend.position = "none",
plot.title = element_text(color = "black", size = 14, face = "bold", hjust = 0.5),
plot.subtitle = element_text(color = "darkblue", hjust = 0.5),
axis.title.y = element_text(),
axis.title.x = element_text(),
axis.ticks = element_blank())
airbnb_data %>%
filter(!(is.na(room_type))) %>%
filter(!(room_type == "Unknown")) %>%
group_by(room_type) %>%
summarise(mean_price = mean(price, na.rm = TRUE)) %>%
ggplot(aes(x = reorder(room_type, mean_price), y = mean_price, fill = room_type)) +
geom_col(stat ="identity", color = "black", fill="#357b8a") +
coord_flip() +
theme_gray() +
labs(x = "Room Type", y = "Price") +
geom_text(aes(label = round(mean_price,digit = 2)), hjust = 2.0, color = "white", size = 3.5) +
ggtitle("Mean Price comparison with all Room Types", subtitle = "Price vs Room Type") +
xlab("Room Type") +
ylab("Mean Price") +
theme(legend.position = "none",
plot.title = element_text(color = "black", size = 14, face = "bold", hjust = 0.5),
plot.subtitle = element_text(color = "darkblue", hjust = 0.5),
axis.title.y = element_text(),
axis.title.x = element_text(),
axis.ticks = element_blank())
Prior to modelling, we will split the data into Training set and Testing set so that we can use the testing set to validate our model. As it is a good practice, we are splitting the dataset in to parts in the ratio of 70:30. Training set will be 70% percent of the original data. We will use the test dataset in the future for testing and prediction purposes. Objects with price equal to 0 will be ommited since price can’t be 0 (faulty records). In order to remove the outliers, we are filtering the airbnb data by removing the extreme values of price from both sides (10% from both the end).They would make predictive models significantly weaker.
airbnb_filtered_data <- airbnb_data %>%
filter(price < quantile(airbnb_data$price, 0.9) & price > quantile(airbnb_data$price, 0.1)) %>%
drop_na()
set.seed(123456)
airbnb_filtered_data <- airbnb_filtered_data %>% mutate(id = row_number())
airbnb_train <- airbnb_filtered_data %>% sample_frac(.7) %>% filter(price > 0)
airbnb_test <- anti_join(airbnb_filtered_data, airbnb_train, by = 'id') %>% filter(price > 0)
# sanity check
nrow(airbnb_train) + nrow(airbnb_test) == nrow(airbnb_filtered_data %>% filter(price > 0))
## [1] TRUE
In our model, we won’t be considering the below variables and the reasons for the same are provided in the summary below.
Summary of Variables Excluded:
Hence, we try to predict the price of the airbnbs using the remaining covariates:
We will build our initial linear model using all the variables that we have selected for the study.
airbnb_model_1 <- lm (price ~ neighbourhood_group + latitude + longitude + room_type + minimum_nights + number_of_reviews + reviews_per_month + calculated_host_listings_count +
availability_365, data = airbnb_train)
summary_model_1 <- summary(airbnb_model_1)
mse_1 <- summary_model_1$sigma^2
r_sq_1 <- summary_model_1$r.squared
adj_r_sq_1 <- summary_model_1$adj.r.squared
summary_model_1
par(mfrow=c(2,2))
plot(airbnb_model_1)
We will use both Subset regression method as well as step-wise regression for variable selection and see how the models differ in from each other.
best_fit_model <- regsubsets (price ~ neighbourhood_group + latitude + longitude + room_type + minimum_nights + number_of_reviews +
reviews_per_month +calculated_host_listings_count + availability_365, data = airbnb_train, nbest = 2, nvmax = 9)
summary(best_fit_model)
plot(best_fit_model, scale="bic")
Next we will create a model using the the above mentioned covariates
airbnb_model_3 <- lm (price ~ room_type + neighbourhood_group + latitude + longitude + minimum_nights +
availability_365 + calculated_host_listings_count , data = airbnb_train, nbest = 2, nvmax = 9)
summary_model_3 <- summary(airbnb_model_3)
mse_3 <- summary_model_3$sigma^2
r_sq_3 <- summary_model_3$r.squared
adj_r_sq_3 <- summary_model_3$adj.r.squared
summary_model_3
null <- lm(price~1, data = airbnb_train)
full <- lm(price ~ neighbourhood_group + latitude + longitude + room_type + minimum_nights + number_of_reviews +
reviews_per_month +calculated_host_listings_count + availability_365, data = airbnb_train)
step(null, scope =list(lower=null, upper= full), direction = "both")
## Start: AIC=176704.4
## price ~ 1
##
## Df Sum of Sq RSS AIC
## + room_type 2 22890604 42906354 167264
## + longitude 1 5096488 60700471 174925
## + neighbourhood_group 4 4998958 60798000 174967
## + calculated_host_listings_count 1 1507974 64288984 176194
## + reviews_per_month 1 262173 65534786 176618
## + number_of_reviews 1 80578 65716380 176679
## + availability_365 1 75390 65721569 176681
## + minimum_nights 1 45410 65751548 176691
## + latitude 1 44731 65752227 176691
## <none> 65796958 176704
##
## Step: AIC=167263.8
## price ~ room_type
##
## Df Sum of Sq RSS AIC
## + neighbourhood_group 4 3918120 38988234 165156
## + longitude 1 2996630 39909724 165666
## + calculated_host_listings_count 1 745452 42160902 166879
## + availability_365 1 297661 42608694 167112
## + latitude 1 258346 42648008 167132
## + reviews_per_month 1 48080 42858274 167241
## + minimum_nights 1 42643 42863712 167244
## + number_of_reviews 1 18543 42887811 167256
## <none> 42906354 167264
## - room_type 2 22890604 65796958 176704
##
## Step: AIC=165156.4
## price ~ room_type + neighbourhood_group
##
## Df Sum of Sq RSS AIC
## + longitude 1 1260466 37727769 164432
## + availability_365 1 545720 38442514 164847
## + latitude 1 518694 38469540 164863
## + calculated_host_listings_count 1 420175 38568060 164919
## + minimum_nights 1 117965 38870269 165091
## + number_of_reviews 1 3875 38984359 165156
## <none> 38988234 165156
## + reviews_per_month 1 952 38987282 165158
## - neighbourhood_group 4 3918120 42906354 167264
## - room_type 2 21809766 60798000 174967
##
## Step: AIC=164432.5
## price ~ room_type + neighbourhood_group + longitude
##
## Df Sum of Sq RSS AIC
## + availability_365 1 686845 37040923 164029
## + calculated_host_listings_count 1 325486 37402282 164243
## + latitude 1 238175 37489594 164295
## + minimum_nights 1 142868 37584901 164351
## + reviews_per_month 1 8272 37719497 164430
## <none> 37727769 164432
## + number_of_reviews 1 228 37727540 164434
## - longitude 1 1260466 38988234 165156
## - neighbourhood_group 4 2181956 39909724 165666
## - room_type 2 20652266 58380035 174072
##
## Step: AIC=164028.6
## price ~ room_type + neighbourhood_group + longitude + availability_365
##
## Df Sum of Sq RSS AIC
## + minimum_nights 1 230804 36810120 163893
## + latitude 1 177240 36863683 163925
## + calculated_host_listings_count 1 162446 36878477 163934
## + number_of_reviews 1 34202 37006721 164010
## <none> 37040923 164029
## + reviews_per_month 1 2133 37038790 164029
## - availability_365 1 686845 37727769 164432
## - longitude 1 1401591 38442514 164847
## - neighbourhood_group 4 2293463 39334386 165348
## - room_type 2 20937925 57978849 173922
##
## Step: AIC=163892.5
## price ~ room_type + neighbourhood_group + longitude + availability_365 +
## minimum_nights
##
## Df Sum of Sq RSS AIC
## + calculated_host_listings_count 1 177451 36632669 163788
## + latitude 1 169853 36640267 163792
## + number_of_reviews 1 52650 36757470 163863
## + reviews_per_month 1 12162 36797958 163887
## <none> 36810120 163893
## - minimum_nights 1 230804 37040923 164029
## - availability_365 1 774781 37584901 164351
## - longitude 1 1445262 38255381 164741
## - neighbourhood_group 4 2367828 39177948 165262
## - room_type 2 21161888 57972008 173922
##
## Step: AIC=163787.8
## price ~ room_type + neighbourhood_group + longitude + availability_365 +
## minimum_nights + calculated_host_listings_count
##
## Df Sum of Sq RSS AIC
## + latitude 1 147040 36485629 163701
## + number_of_reviews 1 36217 36596452 163768
## + reviews_per_month 1 9472 36623198 163784
## <none> 36632669 163788
## - calculated_host_listings_count 1 177451 36810120 163893
## - minimum_nights 1 245808 36878477 163934
## - availability_365 1 598275 37230945 164144
## - longitude 1 1347564 37980233 164584
## - neighbourhood_group 4 2206610 38839279 165072
## - room_type 2 20766645 57399315 173704
##
## Step: AIC=163700.9
## price ~ room_type + neighbourhood_group + longitude + availability_365 +
## minimum_nights + calculated_host_listings_count + latitude
##
## Df Sum of Sq RSS AIC
## + number_of_reviews 1 35080 36450549 163682
## + reviews_per_month 1 13251 36472379 163695
## <none> 36485629 163701
## - latitude 1 147040 36632669 163788
## - calculated_host_listings_count 1 154637 36640267 163792
## - minimum_nights 1 237680 36723309 163842
## - availability_365 1 555879 37041509 164033
## - longitude 1 1101353 37586982 164356
## - neighbourhood_group 4 1912183 38397812 164821
## - room_type 2 20287824 56773454 173464
##
## Step: AIC=163681.7
## price ~ room_type + neighbourhood_group + longitude + availability_365 +
## minimum_nights + calculated_host_listings_count + latitude +
## number_of_reviews
##
## Df Sum of Sq RSS AIC
## <none> 36450549 163682
## + reviews_per_month 1 272 36450277 163684
## - number_of_reviews 1 35080 36485629 163701
## - calculated_host_listings_count 1 139762 36590312 163764
## - latitude 1 145903 36596452 163768
## - minimum_nights 1 252072 36702622 163832
## - availability_365 1 590741 37041291 164035
## - longitude 1 1096108 37546657 164334
## - neighbourhood_group 4 1913238 38363788 164804
## - room_type 2 20288372 56738921 173453
##
## Call:
## lm(formula = price ~ room_type + neighbourhood_group + longitude +
## availability_365 + minimum_nights + calculated_host_listings_count +
## latitude + number_of_reviews, data = airbnb_train)
##
## Coefficients:
## (Intercept) room_typePrivate room
## -1.471e+04 -6.156e+01
## room_typeShared room neighbourhood_groupBrooklyn
## -7.455e+01 -7.033e+00
## neighbourhood_groupManhattan neighbourhood_groupQueens
## 1.683e+01 3.661e+00
## neighbourhood_groupStaten Island longitude
## -7.201e+01 -2.432e+02
## availability_365 minimum_nights
## 4.360e-02 -1.909e-01
## calculated_host_listings_count latitude
## 1.063e-01 -7.698e+01
## number_of_reviews
## -2.631e-02
The covariates given by this fit are as follows:
Hence, Building a Model using this co variates:
airbnb_model_4 <- lm(price ~ room_type + neighbourhood_group + longitude + availability_365 + calculated_host_listings_count + minimum_nights + latitude + number_of_reviews,
data = airbnb_train, nbest = 2, nvmax = 9)
summary_model_4 <- summary(airbnb_model_4)
mse_4 <- summary_model_4$sigma^2
r_sq_4 <- summary_model_4$r.squared
adj_r_sq_4 <- summary_model_4$adj.r.squared
summary_model_4
null <- lm(price~1, data = airbnb_train)
full <- lm(price ~ neighbourhood_group + latitude + longitude + room_type + minimum_nights + number_of_reviews +
reviews_per_month +calculated_host_listings_count + availability_365, data = airbnb_train)
n=dim(airbnb_train[1])
step(null, scope =list(lower=null, upper= full), k=log(n), direction = "both")
## Start: AIC=176712.4
## price ~ 1
##
## Df Sum of Sq RSS AIC
## + room_type 2 22890604 42906354 167288
## + longitude 1 5096488 60700471 174921
## + neighbourhood_group 4 4998958 60798000 174957
## + calculated_host_listings_count 1 1507974 64288984 176210
## + reviews_per_month 1 262173 65534786 176614
## + availability_365 1 75390 65721569 176677
## + minimum_nights 1 45410 65751548 176687
## + number_of_reviews 1 80578 65716380 176695
## + latitude 1 44731 65752227 176707
## <none> 65796958 176712
##
## Step: AIC=167287.8
## price ~ room_type
##
## Df Sum of Sq RSS AIC
## + neighbourhood_group 4 3918120 38988234 165142
## + longitude 1 2996630 39909724 165658
## + calculated_host_listings_count 1 745452 42160902 166871
## + availability_365 1 297661 42608694 167144
## + latitude 1 258346 42648008 167164
## + number_of_reviews 1 18543 42887811 167248
## + reviews_per_month 1 48080 42858274 167273
## + minimum_nights 1 42643 42863712 167276
## <none> 42906354 167288
## - room_type 2 22890604 65796958 176702
##
## Step: AIC=165212.4
## price ~ room_type + neighbourhood_group
##
## Df Sum of Sq RSS AIC
## + longitude 1 1260466 37727769 164496
## + availability_365 1 545720 38442514 164831
## + latitude 1 518694 38469540 164847
## + calculated_host_listings_count 1 420175 38568060 164983
## + minimum_nights 1 117965 38870269 165075
## + reviews_per_month 1 952 38987282 165142
## <none> 38988234 165212
## + number_of_reviews 1 3875 38984359 165220
## - neighbourhood_group 4 3918120 42906354 167288
## - room_type 2 21809766 60798000 174957
##
## Step: AIC=164496.5
## price ~ room_type + neighbourhood_group + longitude
##
## Df Sum of Sq RSS AIC
## + availability_365 1 686845 37040923 164101
## + calculated_host_listings_count 1 325486 37402282 164225
## + latitude 1 238175 37489594 164277
## + number_of_reviews 1 228 37727540 164416
## + minimum_nights 1 142868 37584901 164423
## <none> 37727769 164496
## + reviews_per_month 1 8272 37719497 164502
## - longitude 1 1260466 38988234 165142
## - neighbourhood_group 4 2181956 39909724 165698
## - room_type 2 20652266 58380035 174060
##
## Step: AIC=164100.6
## price ~ room_type + neighbourhood_group + longitude + availability_365
##
## Df Sum of Sq RSS AIC
## + latitude 1 177240 36863683 163905
## + calculated_host_listings_count 1 162446 36878477 163914
## + minimum_nights 1 230804 36810120 163973
## + number_of_reviews 1 34202 37006721 163990
## <none> 37040923 164101
## + reviews_per_month 1 2133 37038790 164109
## - availability_365 1 686845 37727769 164496
## - longitude 1 1401591 38442514 164831
## - neighbourhood_group 4 2293463 39334386 165388
## - room_type 2 20937925 57978849 173908
##
## Step: AIC=164004.7
## price ~ room_type + neighbourhood_group + longitude + availability_365 +
## latitude
##
## Df Sum of Sq RSS AIC
## + minimum_nights 1 223417 36640267 163770
## + reviews_per_month 1 4297 36859387 163902
## + calculated_host_listings_count 1 140374 36723309 163930
## + number_of_reviews 1 32252 36831432 163995
## <none> 36863683 164005
## - latitude 1 177240 37040923 164011
## - availability_365 1 625910 37489594 164367
## - longitude 1 1123646 37987329 164568
## - neighbourhood_group 4 2032199 38895883 165150
## - room_type 2 20396760 57260444 173633
##
## Step: AIC=163880.4
## price ~ room_type + neighbourhood_group + longitude + availability_365 +
## latitude + minimum_nights
##
## Df Sum of Sq RSS AIC
## + calculated_host_listings_count 1 154637 36485629 163677
## + number_of_reviews 1 49955 36590312 163740
## - latitude 1 169853 36810120 163873
## <none> 36640267 163880
## + reviews_per_month 1 16497 36623769 163880
## - minimum_nights 1 223417 36863683 164005
## - availability_365 1 709581 37349848 164294
## - longitude 1 1165230 37805497 164462
## - neighbourhood_group 4 2075217 38715483 165057
## - room_type 2 20614734 57255001 173631
##
## Step: AIC=163797
## price ~ room_type + neighbourhood_group + longitude + availability_365 +
## latitude + minimum_nights + calculated_host_listings_count
##
## Df Sum of Sq RSS AIC
## + number_of_reviews 1 35080 36450549 163656
## - latitude 1 147040 36632669 163766
## - calculated_host_listings_count 1 154637 36640267 163770
## <none> 36485629 163797
## + reviews_per_month 1 13251 36472379 163799
## - minimum_nights 1 237680 36723309 163930
## - availability_365 1 555879 37041509 164121
## - longitude 1 1101353 37586982 164334
## - neighbourhood_group 4 1912183 38397812 164885
## - room_type 2 20287824 56773454 173444
##
## Step: AIC=163785.7
## price ~ room_type + neighbourhood_group + longitude + availability_365 +
## latitude + minimum_nights + calculated_host_listings_count +
## number_of_reviews
##
## Df Sum of Sq RSS AIC
## + reviews_per_month 1 272 36450277 163656
## - calculated_host_listings_count 1 139762 36590312 163740
## - latitude 1 145903 36596452 163744
## <none> 36450549 163786
## - number_of_reviews 1 35080 36485629 163797
## - minimum_nights 1 252072 36702622 163928
## - availability_365 1 590741 37041291 164131
## - longitude 1 1096108 37546657 164310
## - neighbourhood_group 4 1913238 38363788 164876
## - room_type 2 20288372 56738921 173431
##
## Step: AIC=163795.6
## price ~ room_type + neighbourhood_group + longitude + availability_365 +
## latitude + minimum_nights + calculated_host_listings_count +
## number_of_reviews + reviews_per_month
##
## Call:
## lm(formula = price ~ room_type + neighbourhood_group + longitude +
## availability_365 + latitude + minimum_nights + calculated_host_listings_count +
## number_of_reviews + reviews_per_month, data = airbnb_train)
##
## Coefficients:
## (Intercept) room_typePrivate room
## -1.467e+04 -6.156e+01
## room_typeShared room neighbourhood_groupBrooklyn
## -7.453e+01 -7.077e+00
## neighbourhood_groupManhattan neighbourhood_groupQueens
## 1.681e+01 3.635e+00
## neighbourhood_groupStaten Island longitude
## -7.196e+01 -2.428e+02
## availability_365 latitude
## 4.367e-02 -7.718e+01
## minimum_nights calculated_host_listings_count
## -1.915e-01 1.063e-01
## number_of_reviews reviews_per_month
## -2.482e-02 -8.301e-02
The covariates given by this fit are as follows:
Hence, Building a Model using this co variates:
airbnb_model_5 <- lm(price ~ room_type + neighbourhood_group + longitude + availability_365 + minimum_nights + latitude + calculated_host_listings_count +
number_of_reviews + reviews_per_month, data = airbnb_train, nbest = 2, nvmax = 9)
summary_model_5 <- summary(airbnb_model_5)
mse_5 <- summary_model_5$sigma^2
r_sq_5 <- summary_model_5$r.squared
adj_r_sq_5 <- summary_model_5$adj.r.squared
summary_model_5
Next we use the Lasso Regression method for variable selection.
lasso_fit <- glmnet(x = as.matrix(airbnb_train[, -c(which(colnames(airbnb_train) %in% c("room_type", "neighbourhood_group","price", "name", "host_name",
"neighbourhood", "last_review", "host_id", "id")))]),
y = airbnb_train$price, alpha = 0.5)
coef(lasso_fit,s = lasso_fit$lambda.min)
## 8 x 65 sparse Matrix of class "dgCMatrix"
##
## (Intercept) 120.4523 -1629.47633 -3285.14766
## latitude . . .
## longitude . -23.66303 -46.05149
## minimum_nights . . .
## number_of_reviews . . .
## reviews_per_month . . .
## calculated_host_listings_count . . .
## availability_365 . . .
##
## (Intercept) -4847.38927 -6317.69968 -7698.1541
## latitude . . .
## longitude -67.17657 -87.05852 -105.7254
## minimum_nights . . .
## number_of_reviews . . .
## reviews_per_month . . .
## calculated_host_listings_count . . .
## availability_365 . . .
##
## (Intercept) -8991.3110 -10200.1229 -1.128308e+04
## latitude . . .
## longitude -123.2119 -139.5578 -1.542010e+02
## minimum_nights . . .
## number_of_reviews . . .
## reviews_per_month . . .
## calculated_host_listings_count . . 1.297033e-02
## availability_365 . . .
##
## (Intercept) -1.225891e+04 -1.316478e+04 -1.400446e+04
## latitude . . .
## longitude -1.673950e+02 -1.796432e+02 -1.909964e+02
## minimum_nights . . .
## number_of_reviews . . .
## reviews_per_month . . .
## calculated_host_listings_count 3.413940e-02 5.378151e-02 7.198001e-02
## availability_365 . . .
##
## (Intercept) -1.478169e+04 -1.550019e+04 -1.616360e+04
## latitude . . .
## longitude -2.015052e+02 -2.112200e+02 -2.201900e+02
## minimum_nights . . .
## number_of_reviews . . .
## reviews_per_month . . .
## calculated_host_listings_count 8.881807e-02 1.043778e-01 1.187394e-01
## availability_365 . . .
##
## (Intercept) -16775.488962 -1.733927e+04 -1.785824e+04
## latitude . . .
## longitude -228.463246 -2.360861e+02 -2.431030e+02
## minimum_nights . . .
## number_of_reviews . . .
## reviews_per_month . . .
## calculated_host_listings_count 0.131981 1.441778e-01 1.554021e-01
## availability_365 . . .
##
## (Intercept) -1.833555e+04 -1.877421e+04 -1.919758e+04
## latitude . . .
## longitude -2.495567e+02 -2.554877e+02 -2.612111e+02
## minimum_nights . . .
## number_of_reviews . . .
## reviews_per_month . . .
## calculated_host_listings_count 1.657226e-01 1.752049e-01 1.831789e-01
## availability_365 . . 7.257962e-04
##
## (Intercept) -1.973191e+04 -2.029562e+04 -2.081313e+04
## latitude 2.480943e+00 6.294373e+00 9.794989e+00
## longitude -2.670672e+02 -2.725869e+02 -2.776545e+02
## minimum_nights . . .
## number_of_reviews . . .
## reviews_per_month . . .
## calculated_host_listings_count 1.894060e-01 1.950260e-01 2.001623e-01
## availability_365 2.370588e-03 3.909134e-03 5.321501e-03
##
## (Intercept) -2.128786e+04 -2.172311e+04 -2.209790e+04
## latitude 1.300597e+01 1.594960e+01 1.860508e+01
## longitude -2.823033e+02 -2.865656e+02 -2.901703e+02
## minimum_nights . . .
## number_of_reviews . . .
## reviews_per_month . . -6.284334e-02
## calculated_host_listings_count 2.048558e-01 2.091435e-01 2.129058e-01
## availability_365 6.617165e-03 7.805081e-03 9.036752e-03
##
## (Intercept) -2.243553e+04 -2.274475e+04 -2.302766e+04
## latitude 2.102348e+01 2.323952e+01 2.526713e+01
## longitude -2.934032e+02 -2.963636e+02 -2.990719e+02
## minimum_nights . . .
## number_of_reviews . . .
## reviews_per_month -1.344146e-01 -1.998507e-01 -2.597120e-01
## calculated_host_listings_count 2.162946e-01 2.193910e-01 2.222159e-01
## availability_365 1.019773e-02 1.126096e-02 1.223440e-02
##
## (Intercept) -2.328639e+04 -2.352295e+04 -2.373916e+04
## latitude 2.712167e+01 2.881734e+01 3.036728e+01
## longitude -3.015488e+02 -3.038132e+02 -3.058829e+02
## minimum_nights . . .
## number_of_reviews . . .
## reviews_per_month -3.144546e-01 -3.645004e-01 -4.102389e-01
## calculated_host_listings_count 2.247930e-01 2.271436e-01 2.292875e-01
## availability_365 1.312527e-02 1.394026e-02 1.468557e-02
##
## (Intercept) -2.393886e+04 -2.412593e+04 -2.429536e+04
## latitude 3.179072e+01 3.309745e+01 3.429044e+01
## longitude -3.077990e+02 -3.096087e+02 -3.112425e+02
## minimum_nights . . .
## number_of_reviews -9.641900e-04 -2.101625e-03 -3.083937e-03
## reviews_per_month -4.365266e-01 -4.566604e-01 -4.757090e-01
## calculated_host_listings_count 2.311111e-01 2.326949e-01 2.341687e-01
## availability_365 1.541139e-02 1.609126e-02 1.670775e-02
##
## (Intercept) -2.445006e+04 -2.459131e+04 -2.471792e+04
## latitude 3.537976e+01 3.637435e+01 3.727417e+01
## longitude -3.127342e+02 -3.140963e+02 -3.153126e+02
## minimum_nights . . .
## number_of_reviews -3.979467e-03 -4.797363e-03 -5.473288e-03
## reviews_per_month -4.931019e-01 -5.089503e-01 -5.245033e-01
## calculated_host_listings_count 2.355118e-01 2.367358e-01 2.378731e-01
## availability_365 1.727086e-02 1.778517e-02 1.824980e-02
##
## (Intercept) -2.483564e+04 -2.495097e+04 -2.505244e+04
## latitude 3.810293e+01 3.889598e+01 3.960811e+01
## longitude -3.164478e+02 -3.175706e+02 -3.185505e+02
## minimum_nights . -2.495664e-03 -5.212982e-03
## number_of_reviews -6.153489e-03 -6.878866e-03 -7.396819e-03
## reviews_per_month -5.377084e-01 -5.515890e-01 -5.672199e-01
## calculated_host_listings_count 2.388885e-01 2.398767e-01 2.408309e-01
## availability_365 1.867860e-02 1.912076e-02 1.952244e-02
##
## (Intercept) -2.514724e+04 -2.523390e+04 -2.531298e+04
## latitude 4.026502e+01 4.086485e+01 4.141208e+01
## longitude -3.194708e+02 -3.203123e+02 -3.210802e+02
## minimum_nights -7.672802e-03 -9.913638e-03 -1.195763e-02
## number_of_reviews -7.932288e-03 -8.427774e-03 -8.880560e-03
## reviews_per_month -5.805522e-01 -5.925743e-01 -6.035199e-01
## calculated_host_listings_count 2.416734e-01 2.424404e-01 2.431393e-01
## availability_365 1.989376e-02 2.023290e-02 2.054238e-02
##
## (Intercept) -2.538510e+04 -2.545089e+04 -2.551089e+04
## latitude 4.191122e+01 4.236645e+01 4.278160e+01
## longitude -3.217807e+02 -3.224196e+02 -3.230022e+02
## minimum_nights -1.382218e-02 -1.552291e-02 -1.707406e-02
## number_of_reviews -9.293679e-03 -9.670505e-03 -1.001419e-02
## reviews_per_month -6.134967e-01 -6.225913e-01 -6.308816e-01
## calculated_host_listings_count 2.437763e-01 2.443567e-01 2.448856e-01
## availability_365 2.082473e-02 2.108230e-02 2.131724e-02
##
## (Intercept) -2.556560e+04 -2.561549e+04 -2.566097e+04
## latitude 4.316016e+01 4.350535e+01 4.382007e+01
## longitude -3.235336e+02 -3.240181e+02 -3.244599e+02
## minimum_nights -1.848866e-02 -1.977864e-02 -2.095490e-02
## number_of_reviews -1.032761e-02 -1.061343e-02 -1.087404e-02
## reviews_per_month -6.384382e-01 -6.453260e-01 -6.516039e-01
## calculated_host_listings_count 2.453677e-01 2.458069e-01 2.462071e-01
## availability_365 2.153151e-02 2.172693e-02 2.190512e-02
##
## (Intercept) -2.570245e+04 -2.573846e+04 -2.577272e+04
## latitude 4.410700e+01 4.436415e+01 4.460218e+01
## longitude -3.248627e+02 -3.252081e+02 -3.255404e+02
## minimum_nights -2.202737e-02 -2.296799e-02 -2.386341e-02
## number_of_reviews -1.111166e-02 -1.128886e-02 -1.147699e-02
## reviews_per_month -6.573258e-01 -6.625114e-01 -6.674357e-01
## calculated_host_listings_count 2.465719e-01 2.469592e-01 2.472637e-01
## availability_365 2.206761e-02 2.220894e-02 2.234356e-02
##
## (Intercept) -2.580421e+04 -2.583299e+04 -2.585925e+04
## latitude 4.481978e+01 4.501828e+01 4.519925e+01
## longitude -3.258464e+02 -3.261262e+02 -3.263816e+02
## minimum_nights -2.467998e-02 -2.542375e-02 -2.610148e-02
## number_of_reviews -1.165707e-02 -1.182407e-02 -1.197725e-02
## reviews_per_month -6.718217e-01 -6.757751e-01 -6.793626e-01
## calculated_host_listings_count 2.475361e-01 2.477834e-01 2.480083e-01
## availability_365 2.246707e-02 2.257985e-02 2.268271e-02
##
## (Intercept) -2.588319e+04 -2.590501e+04 -2.592490e+04
## latitude 4.536422e+01 4.551458e+01 4.565163e+01
## longitude -3.266145e+02 -3.268267e+02 -3.270202e+02
## minimum_nights -2.671916e-02 -2.728213e-02 -2.779525e-02
## number_of_reviews -1.211720e-02 -1.224487e-02 -1.236127e-02
## reviews_per_month -6.826269e-01 -6.855999e-01 -6.883085e-01
## calculated_host_listings_count 2.482132e-01 2.483999e-01 2.485700e-01
## availability_365 2.277650e-02 2.286200e-02 2.293993e-02
##
## (Intercept) -2.594303e+04 -2.595955e+04 -2.597461e+04
## latitude 4.577653e+01 4.589037e+01 4.599412e+01
## longitude -3.271966e+02 -3.273573e+02 -3.275038e+02
## minimum_nights -2.826292e-02 -2.868916e-02 -2.907762e-02
## number_of_reviews -1.246738e-02 -1.256409e-02 -1.265224e-02
## reviews_per_month -6.907766e-01 -6.930256e-01 -6.950750e-01
## calculated_host_listings_count 2.487249e-01 2.488661e-01 2.489948e-01
## availability_365 2.301096e-02 2.307570e-02 2.313471e-02
##
## (Intercept) -2.598833e+04 -2.600084e+04
## latitude 4.608866e+01 4.617483e+01
## longitude -3.276373e+02 -3.277590e+02
## minimum_nights -2.943166e-02 -2.975431e-02
## number_of_reviews -1.273257e-02 -1.280578e-02
## reviews_per_month -6.969424e-01 -6.986441e-01
## calculated_host_listings_count 2.491120e-01 2.492188e-01
## availability_365 2.318849e-02 2.323750e-02
The covariates given out by this fit are:
Hence, we will build a Model using these covariates
airbnb_model_6 <- lm(price ~ latitude + longitude + minimum_nights + number_of_reviews + reviews_per_month + calculated_host_listings_count + availability_365,
data = airbnb_train, nbest = 2, nvmax = 9)
summary_model_6 <- summary(airbnb_model_6)
#mse_6 <- mean(summary_model_6$sigma^2)
mse_6 <- mean(summary_model_6$residuals^2)
r_sq_6 <- summary_model_6$r.squared
adj_r_sq_6 <- summary_model_6$adj.r.squared
summary_model_6
##
## Call:
## lm(formula = price ~ latitude + longitude + minimum_nights +
## number_of_reviews + reviews_per_month + calculated_host_listings_count +
## availability_365, data = airbnb_train, nbest = 2, nvmax = 9)
##
## Residuals:
## Min 1Q Median 3Q Max
## -136.59 -41.18 -10.53 32.33 208.49
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.613e+04 6.616e+02 -39.495 < 2e-16 ***
## latitude 4.706e+01 6.436e+00 7.312 2.72e-13 ***
## longitude -3.290e+02 7.914e+00 -41.572 < 2e-16 ***
## minimum_nights -3.307e-02 1.973e-02 -1.676 0.09370 .
## number_of_reviews -1.356e-02 8.632e-03 -1.571 0.11627
## reviews_per_month -7.161e-01 2.594e-01 -2.761 0.00577 **
## calculated_host_listings_count 2.503e-01 1.458e-02 17.170 < 2e-16 ***
## availability_365 2.374e-02 2.918e-03 8.136 4.30e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 51.84 on 22082 degrees of freedom
## Multiple R-squared: 0.0981, Adjusted R-squared: 0.09782
## F-statistic: 343.1 on 7 and 22082 DF, p-value: < 2.2e-16
Cross Validation of Lasso Fit:
cv_lasso_fit = cv.glmnet(x = as.matrix(airbnb_train[, -c(which(colnames(airbnb_train) %in% c("room_type", "neighbourhood_group","price" , "name", "host_name",
"neighbourhood", "last_review")))]),
y = airbnb_train$price, alpha = 1, nfolds = 5)
plot(cv_lasso_fit)
In order to pick a model from the ones that we have built, we will do a comparison MSPE and adjusted R-squared of all the models and choose the model with the best combination of both.
#Reading the variable summary excel File
model_sum <- read_excel("Excel Files/Model_Comparison.xlsx")
kable(model_sum) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, fixed_thead = T, )
| Model Name | MSE | R-Squared | Adjusted R-Squared |
|---|---|---|---|
| Linear Regression Model | 1651 | 0.4460 | 0.44600 |
| Best Subset Regression Method | 1653 | 0.4450 | 0.44500 |
| Stepwise Regression using AIC | 1651 | 0.4460 | 0.44500 |
| Stepwise Regression using BIC | 1651 | 0.4460 | 0.44500 |
| Lasso Fit Model | 2687 | 0.0981 | 0.09782 |
**Model 5** (stepwise regression using BIC) as BIC is considered to be more Conservative and it gives the Lesser number of Variables when compared to AIC, thus making the model more interpretable.summary_model_5 <- summary(airbnb_model_5)
mse_5 <- summary_model_5$sigma^2
r_sq_5 <- summary_model_5$r.squared
adj_r_sq_5 <- summary_model_5$adj.r.squared
summary_model_5
##
## Call:
## lm(formula = price ~ room_type + neighbourhood_group + longitude +
## availability_365 + minimum_nights + latitude + calculated_host_listings_count +
## number_of_reviews + reviews_per_month, data = airbnb_train,
## nbest = 2, nvmax = 9)
##
## Residuals:
## Min 1Q Median 3Q Max
## -116.972 -26.271 -6.395 21.015 234.809
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.467e+04 8.420e+02 -17.424 < 2e-16
## room_typePrivate room -6.156e+01 5.635e-01 -109.234 < 2e-16
## room_typeShared room -7.453e+01 2.523e+00 -29.542 < 2e-16
## neighbourhood_groupBrooklyn -7.077e+00 2.463e+00 -2.873 0.004066
## neighbourhood_groupManhattan 1.681e+01 2.267e+00 7.416 1.25e-13
## neighbourhood_groupQueens 3.635e+00 2.386e+00 1.524 0.127622
## neighbourhood_groupStaten Island -7.196e+01 4.569e+00 -15.750 < 2e-16
## longitude -2.428e+02 9.494e+00 -25.575 < 2e-16
## availability_365 4.367e-02 2.310e-03 18.900 < 2e-16
## minimum_nights -1.915e-01 1.553e-02 -12.336 < 2e-16
## latitude -7.718e+01 8.204e+00 -9.408 < 2e-16
## calculated_host_listings_count 1.063e-01 1.155e-02 9.207 < 2e-16
## number_of_reviews -2.482e-02 6.783e-03 -3.659 0.000254
## reviews_per_month -8.301e-02 2.044e-01 -0.406 0.684616
##
## (Intercept) ***
## room_typePrivate room ***
## room_typeShared room ***
## neighbourhood_groupBrooklyn **
## neighbourhood_groupManhattan ***
## neighbourhood_groupQueens
## neighbourhood_groupStaten Island ***
## longitude ***
## availability_365 ***
## minimum_nights ***
## latitude ***
## calculated_host_listings_count ***
## number_of_reviews ***
## reviews_per_month
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 40.63 on 22076 degrees of freedom
## Multiple R-squared: 0.446, Adjusted R-squared: 0.4457
## F-statistic: 1367 on 13 and 22076 DF, p-value: < 2.2e-16
(summary_model_5$sigma)^2
## [1] 1651.127
summary_model_5$r.squared
## [1] 0.4460188
summary_model_5$adj.r.squared
## [1] 0.4456926
To evaluate how the model performs on future data, we use predict() to get the predicted values from the test set.
#pi is a vector that contains predicted values for test set.
pi <- predict(object = airbnb_model_5, newdata = airbnb_test)
mean((pi - airbnb_test$price)^2)
## [1] 1632.556
mean(abs(pi - airbnb_test$price))
## [1] 31.0044
model_glm_1 = glm(price~neighbourhood_group + latitude + longitude + room_type + minimum_nights + number_of_reviews +
reviews_per_month +calculated_host_listings_count + availability_365, data = airbnb_filtered_data)
cv.glm(data =airbnb_filtered_data, glmfit = model_glm_1, K = 3)$delta[2]
## [1] 1646.619
Comparing the MSE of Test Dataset which is equal to 1633, and the MSPE of the Full Data which is 1646, we can see that the values are almost similar. Hence the variables that we have selected for our model are good estimators of our dependent variable.
To Analyze the New York Air Bnb dataset and find the interesting insights related to the factors that affect the prices of listing in New York city. We tried to answer the following questions using our analysis:
Date Used and Methodology: The data is collected from an online data-source Kaggle.
We used multiple regression method to create a model that would help predict prices of listing based on relevant factors. We started with importing and cleaning the data followed by exploratory analysis to gain some interesting insights. For model building, we decided to split the dataset into training and testing datasets in the ratio of 70:30. We created an initial model using all the variables we thought were relevant, followed by building models using 4 different variable selections methods i.e. stepwise-regression methods (using AIC and BIC both), regression subset method and Lasso regression methods. MSE, R^2, Adj R^2 of these 5 models were then compared in order to pick the best model. Next, we performed in-sample validation, out-sample prediction and cross-validation techniques on the final selected model. Upon validation, we found that the MSPE of Test Dataset i.e. 1633 is almost similar to the MSPE of the Full Data i.e. 1646. Hence, we could conclude that the Model selected for analysis, predicts the variable with a good fit.
Interesting insights from Analysis The above analysis helped us understand the New York Air Bnb dataset better. Following insights were drawn from it:
This information can be really helpful for multiple categories of the society such as tourists, house owners, real estate agents etc. It can help customers decide which neighborhoods they should look into depending on their needs and house owners can decide on the prices they set on the listings taking all these factors into consideration.
The Final Model selected was Model 5. This Model was selected by Stepwise Regression using BIC (direction : both). The Observations of the Model: