Introduction
Airbnb has gained popularity over the years due to their unique approach of a short term rental to suit an individualโs lifestyle or needs. Airbnbโs services have expanded to all across the globe so that no matter where a traveler goes, they have options other than hotels. By looking at quantitative (features that have numerical values) values from Airbnb's in NYC, we can gain insight into what factors might lead to how expensive a listing in NYC may be.
Project statement
There are hundreds of Airbnbโs In NYC that can be available for a specific location for a specific time. This makes choosing which Airbnb to stay in a bit more challenging, due to the amount of options. In this analysis specifically, we will be looking at Airbnbโs number of reviews and availability in NYC to determine how the final prices (need to figure out if monthly, weekly or nightly).
Method: Multiple Linear Regression
Install and Load Packages
#install.packages("readxl")
#install.packages("dplyr")
#install.packages("ggplot2")
library("readxl")
library("dplyr")
## Warning: package 'dplyr' was built under R version 4.3.3
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library("ggplot2")
## Warning: package 'ggplot2' was built under R version 4.3.3
Variables
Data Description: A description of some of the features are presented in the table below.
Variable | Definition
-------------------- | -------------
1. Number of Reviews | The number of reviews for each Airbnb listing in NYC.
2. Availability | The availability of Airbnb rooms in NYC in 365 days.
3. Price | The nightly price per Airbnb listing in NYC.
Import the data
airbnb_data <- read_excel(file.choose())
Clean the data
ab_data <- select(airbnb_data, -id, -host_id, -name, -host_name, -neighbourhood_group, -neighbourhood, -latitude, -longitude, -room_type, -minimum_nights, -last_review, -reviews_per_month, -calculated_host_listings_count) # Removes irrelevant data
Descriptive Statistics and Visual Analysis
Descriptive Statistics
summary(ab_data) # Summarizes the data
## price number_of_reviews availability_365
## Min. : 0.0 Min. : 0.00 Min. : 0.0
## 1st Qu.: 69.0 1st Qu.: 1.00 1st Qu.: 0.0
## Median : 106.0 Median : 5.00 Median : 45.0
## Mean : 152.7 Mean : 23.27 Mean :112.8
## 3rd Qu.: 175.0 3rd Qu.: 24.00 3rd Qu.:227.0
## Max. :10000.0 Max. :629.00 Max. :365.0
The mean for number of reviews is 23.27.
The mean for availability is 112.80.
The mean for price is 152.70.
Scatter plot and Correlation matrix
pairs(~number_of_reviews + availability_365 + price, data = ab_data) # Scatter plot

corr <- cor(ab_data) # Correlation
corr # Outputs correlation results
## price number_of_reviews availability_365
## price 1.00000000 -0.04795423 0.08182883
## number_of_reviews -0.04795423 1.00000000 0.17202758
## availability_365 0.08182883 0.17202758 1.00000000
Interpretation: Price and the number of reviews has a negative correlation of -0.48. Price and availability has a weak correlation of 0.08 and number of reviews and availability has a weak correlation of 0.17.
Assumptions
Level of significance = 0.05.
We assume the relationship between dependent and independent variables will be linear.
Residuals must be normally distributed.
Results
Estimated Regression Equation
ลท = 141.64 - 0.34*(Number of Reviews)+ 0.17*(Availability)
Significant Variables
model <- lm(price ~ number_of_reviews + availability_365, data = ab_data)
summary(model)
##
## Call:
## lm(formula = price ~ number_of_reviews + availability_365, data = ab_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -197.3 -81.6 -40.0 26.6 9860.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 141.639282 1.480882 95.64 <2e-16 ***
## number_of_reviews -0.344582 0.024616 -14.00 <2e-16 ***
## availability_365 0.169366 0.008332 20.33 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 238.9 on 48892 degrees of freedom
## Multiple R-squared: 0.01066, Adjusted R-squared: 0.01062
## F-statistic: 263.4 on 2 and 48892 DF, p-value: < 2.2e-16
Interpretation:
The variables are significant because they are all < 0.05.
The adjusted R-squared is 0.01 which means that the model is weak.
Based on the F-statistic, the model is significant because the p-value (0.00) is less than alpha (0.05).
Coefficient Interpretation
Description: A description of the coefficients in the equation.
Coefficient | Interpretation
-------------- | -------------
1. b0 = 141.64 | The average price for an Airbnb in NYC would be $141.64 without any availability for 365 days and # of reviews.
2. b1 = - 0.34 | Price is expected to decrease by - $0.34 for each additional review
3. b2 = 0.17 | Price is expected to increase by $0.17 for each additional day available in 365 days.
Test Assumptions
Linearity of the model
plot(model, 1)

Interpretation: There is a linear relationship when looking at the residuals vs fitted plot.
QQ Plot
plot(model, 2)

Interpretation: We can confirm that our residuals are normally distributed because the points lie along the line.
Conclusion and Recommendation
Conclusion
We can conclude that although the model found statistically significant relationships between both number of reviews, availability, and price, the overall fit of the model is low. This means that there are other factors (like neighborhood groups) that affects the price of Airbnb listings.
We observed that the r-squared value is very low, meaning that the number of reviews and availability together do not explain much of the variability in Airbnb prices.
Recommendation
Hosts may considering maximizing their listing availability during high-demand period, it could potentially command higher prices.
Given that the number of reviews has a negative effect on price, hosts may offer longer stays as a way to lower turnover and reviews while keeping profit high.
Hosts may want to do extra research into other factors that can affect price like location (neighborhood groups) to better maximize revenue outside of just considering variables like reviews and availability: comparing prices of similar properties in the same area.