New York City Airbnb

library(dplyr)
library(psych)
library(knitr)
library(ggplot2)
Airbnb NYC Image

Airbnb NYC Image

Introduction:

Airbnb is an online marketplace which lets homeowners rent out their properties or spare rooms to guests. Airbnb takes 3% commission of every booking from hosts, and between 6% and 12% from guests. For my final project, I am going to analyze the public Airbnb data of New York City, http://insideairbnb.com/get-the-data.html, which includes information on listings, reviews and neighborhoods. In my analysis, I look to understand why Airbnb listings are priced differently and where a homeowner could get the biggest ROI. As a frequent user of the platform and a New York City native, I am interest to see if my assumptions on pricing per listing, per neighborhood are accurate.

Hypothesis:

From the exploration question, hypothesis is defined as follows:

H0: Neighborhoods, room types, minimum nights and reviews have no influence over Airbnb listing prices in NYC

H1: Neighborhoods, room types, minimum nights and reviews have influence over Airbnb listing prices in NYC

Data Source:

New York City Airbnb Data - **Source: http://insideairbnb.com/get-the-data.html:**

The Airbnb data was collected by Murry Cox. The dataset is meant to be an independent to allows users to analyze Airbnb usage throughout the world. The snapshot data is regularly scraped/updated by Murry Cox. For this project, data was integrated through reading the .csv files. This is considered an observational study.

  • Reviews data: information and metrics for Airbnb listings in New York City

  • Listing data: timestamp review information

  • Neighborhoods data: neighborhood listing information

Load The Data:

#load data
listings <- read.table("C:/Users/burke/OneDrive/Desktop/Data 606-Stats/Project Data 606/Data 606 Data/listings.csv", quote = "", fill=TRUE, header = TRUE, sep = ",", stringsAsFactors=FALSE)

# Data translations- listing data 
# Converted all the columns with numbers to numeric 
# The raw columns in the listings dataframe were pretty descriptive, kept original

reformatted.listings <-listings %>% 
  mutate(price = as.numeric(price, na.rm = TRUE),
         id = as.numeric(id, na.rm = TRUE),
         host_id = as.numeric(host_id, na.rm = TRUE),
         number_of_reviews = as.numeric(number_of_reviews, na.rm = TRUE),
         minimum_nights = as.numeric(minimum_nights, na.rm = TRUE),
         calculated_host_listings_count = as.numeric(calculated_host_listings_count, na.rm = TRUE),
         reviews_per_month = as.numeric(reviews_per_month, na.rm = TRUE),
         availability_365 = as.numeric(availability_365, na.rm = TRUE)) %>%
  filter(price > 0 & 
         number_of_reviews > 0 &
         reviews_per_month >0 &
         calculated_host_listings_count > 0 &
         availability_365 >0 &
         id >0)%>%
select(-c(latitude, longitude, id, name, host_id, host_name, last_review, 
          reviews_per_month, calculated_host_listings_count,availability_365))

Filtered Data:

The below table displays all NYC Airbnb listings compiled by Murry Cox.

Airbnb Listing Data-NYC
neighbourhood_group neighbourhood room_type price minimum_nights number_of_reviews
Brooklyn Kensington Private room 39 1 7
Manhattan Midtown Entire home/apt 225 1 29
Brooklyn Williamsburg Private room 70 5 27
Brooklyn Clinton Hill Entire home/apt 89 1 181
Manhattan Hell’s Kitchen Entire home/apt 150 1 26
Manhattan Hell’s Kitchen Entire home/apt 150 1 58
Brooklyn Bedford-Stuyvesant Private room 60 45 51
Brooklyn Sunset Park Entire home/apt 253 4 1
Manhattan Hell’s Kitchen Private room 79 2 371
Manhattan Chinatown Entire home/apt 120 1 146

Cases:

Each case represents a Airbnb listing in the New York City. There are 21,974 observations in the given data set.

Dimensions of The Airbnb Dataset
x
21974
6

Explanatory Variables:

  • Number of reviews , numerical
  • Minimum night requirement, numerical

  • Neighborhood, categorical
  • Room type, categorical

Response Variable:

  • Price, numerical

Exploratory Analysis:

Min, Median, Mean, Max, Standard Deviation:

The below chart, as an initial inspection of the data, suggests that there are differences in between the median prices per listing/room type/neighborhood

Airbnb NYC Listing- Count, Mean and Standard Deviation
neighbourhood_group neighbourhood room_type Frequency Min.Value Median.Value Mean.Value Max.Value SD
Brooklyn Sea Gate Entire home/apt 1 1485 1485 1485 1485 NaN
Manhattan Tribeca Entire home/apt 39 151 395 490 1500 312
Manhattan NoHo Private room 4 85 124 448 1460 675
Queens Holliswood Entire home/apt 1 414 414 414 414 NaN
Manhattan SoHo Entire home/apt 139 95 249 372 2000 331
Manhattan Roosevelt Island Entire home/apt 6 108 150 366 1400 509
Manhattan Flatiron District Entire home/apt 47 128 265 359 1050 237
Manhattan Theater District Entire home/apt 64 95 234 330 2000 288
Manhattan NoHo Entire home/apt 35 125 245 319 975 188
Manhattan Midtown Entire home/apt 442 67 239 314 2500 253
Manhattan Nolita Entire home/apt 98 99 244 310 5000 490
Manhattan Civic Center Entire home/apt 8 102 195 297 800 233
Manhattan Greenwich Village Entire home/apt 149 70 209 285 2450 285
Brooklyn Clinton Hill Entire home/apt 168 70 152 279 8000 698
Manhattan Battery Park City Entire home/apt 11 205 250 279 450 75
Manhattan West Village Entire home/apt 322 75 225 276 2590 235
Staten Island South Beach Entire home/apt 1 275 275 275 275 NaN
Manhattan Chelsea Entire home/apt 369 60 222 274 1731 188
Queens Neponsit Entire home/apt 1 274 274 274 274 NaN
Manhattan Kips Bay Entire home/apt 144 99 200 273 1950 207

Price Analysis:

From the above chart, it is apparent that there are some outliers in the filtered data set, while the medians are defined by the neighborhood, room type. These intital observations can suggest that the room type, neighborhood could be predictive of listing price as established in the hypothesis.

From the below histogram, normality, skewness to the right and outliers can be confirmed. For visualization purposes,I chose a density histogram with a limited x-axis.

Medians Price Analysis:

From the calculated medians, we can have the following histogram below. Just like the above price density histogram, normality, skewness to the right and outliers can be confirmed. Based on the below Q-Q Plot, we can visualize how the medians data follow the qqline most of the trajectory.

Factor 1) Room Type:

Room Types
room_type Count mean.price
Entire home/apt 11231 210
Private room 10142 81
Shared room 601 57
##   room_type             Count         mean.price   
##  Length:3           Min.   :  601   Min.   : 57.0  
##  Class :character   1st Qu.: 5372   1st Qu.: 69.0  
##  Mode  :character   Median :10142   Median : 81.0  
##                     Mean   : 7325   Mean   :116.0  
##                     3rd Qu.:10686   3rd Qu.:145.5  
##                     Max.   :11231   Max.   :210.0

Room Type Analysis Summary:

It is apparent, based on the listing counts, that owners are more inclined to list entire properties than that of private rooms or shared rooms. The entire property listings are observed to be more expensive, on average, than the private rooms or shared rooms. This suggests that room type does impact the price of the NYC Airbnb listing. The impact of room type will be later explored in the regression analysis.

Factor 2) Neighborhood:

Airbnb NYC-Neighborhood Data
neighbourhood Count mean.price
Williamsburg 1756 149
Bedford-Stuyvesant 1750 105
Harlem 1375 116
Bushwick 984 83
Hell’s Kitchen 984 206
East Village 884 191
Upper West Side 827 207
Upper East Side 762 170
Crown Heights 712 104
East Harlem 648 136
Midtown 597 273
Chelsea 504 230
Lower East Side 455 174
Greenpoint 442 134
Astoria 421 104
West Village 396 253
Washington Heights 387 86
Clinton Hill 283 204
Flatbush 260 91
Prospect-Lefferts Gardens 250 103
##  neighbourhood          Count          mean.price   
##  Length:207         Min.   :   1.0   Min.   : 17.0  
##  Class :character   1st Qu.:   5.0   1st Qu.: 71.5  
##  Mode  :character   Median :  19.0   Median : 86.0  
##                     Mean   : 106.2   Mean   :111.3  
##                     3rd Qu.:  74.0   3rd Qu.:134.0  
##                     Max.   :1756.0   Max.   :555.0

Neighborhood Analysis Summary:

The mean price of the listing per neighborhood suggests that there is some correlation between the two variables. The more expensive listing neighborhoods, minus a few outliers, seem to be in Manhattan proper. The true impact of neighborhood will be later explored in the regression analysis.

Exploratory Analysis Initial Insights (Room Types & Neighborhoods):

It is shown that most property owners are inclined towards listing their entire property. It can also be seen that most of the pricier listings are in Manhattan. Now I will analyze the relationship of prices for different room type and neighborhood on a plotted heat map.

NYC Listings Summary- Room Type & Neighborhood
room_type neighbourhood_group Count mean.price
Entire home/apt Manhattan 5889 248
Entire home/apt Brooklyn 4229 175
Entire home/apt Queens 888 142
Entire home/apt Staten Island 76 124
Entire home/apt Bronx 149 108
Private room Manhattan 3733 102
Private room Brooklyn 4480 73
Shared room Manhattan 266 67
Private room Queens 1495 63
Private room Bronx 344 56
Shared room Queens 108 56
Private room Staten Island 90 55
Shared room Bronx 20 47
Shared room Brooklyn 206 45
Shared room Staten Island 1 40
##   room_type         neighbourhood_group     Count        mean.price   
##  Length:15          Length:15           Min.   :   1   Min.   : 40.0  
##  Class :character   Class :character    1st Qu.:  99   1st Qu.: 55.5  
##  Mode  :character   Mode  :character    Median : 266   Median : 67.0  
##                                         Mean   :1465   Mean   : 93.4  
##                                         3rd Qu.:2614   3rd Qu.:116.0  
##                                         Max.   :5889   Max.   :248.0

Heat Map Summary:

A entire home/apartment in Manhattan will demand the most expensive listing price, on average.

Number of Reviews Distribution:

NYC Listings Summary- Number of Reviews (Top 20)
neighbourhood_group reviews Count mean.price
Staten Island [0,30] 113 92
Staten Island (30,60] 32 72
Staten Island (60,90] 10 93
Staten Island (90,120] 5 78
Staten Island (120,150] 3 56
Staten Island (150,180] 4 56
Queens [0,30] 1723 94
Queens (30,60] 380 86
Queens (60,90] 177 93
Queens (90,120] 71 76
Queens (120,150] 59 81
Queens (150,180] 28 76
Queens (180,210] 16 56
Queens (210,240] 16 72
Queens (240,270] 4 66

Review Analysis Summary:

There does not seem to be a correlation between the number of reviews and the average price of the listings.

Minimum Stay Distribution:

NYC Listings Summary- Minimum Stay
stay Count mean.price
(360,370] 3 400
(220,230] 1 315
(590,600] 1 275
(90,100] 3 228
(1.24e+03,1.25e+03] 1 180
(20,30] 720 172
(30,40] 29 165
(60,70] 2 162
(80,90] 23 155
[0,10] 14255 153
(110,120] 3 147
(150,160] 2 115
(10,20] 355 109
(50,60] 20 101
(70,80] 4 97
(170,180] 6 89
(40,50] 5 81
(180,190] 2 78
(290,300] 1 65
(490,500] 1 50
(190,200] 2 45
(210,220] 1 40
##       stay        Count            mean.price   
##  [0,10] : 1   Min.   :    1.00   Min.   : 40.0  
##  (10,20]: 1   1st Qu.:    1.25   1st Qu.: 83.0  
##  (20,30]: 1   Median :    3.00   Median :131.0  
##  (30,40]: 1   Mean   :  701.82   Mean   :146.5  
##  (40,50]: 1   3rd Qu.:   16.50   3rd Qu.:170.2  
##  (50,60]: 1   Max.   :14255.00   Max.   :400.0  
##  (Other):16

Minimum Stay Analysis Summary:

There does not seem to be a huge correlation between the minimum stay requirement and the average price of the listings. However, unsurprisingly, the bulk of the listings are within the 1 - 10 day range.

Inference:

Satisfying conditions for inference:

Conditions:

  • The sample size is greater than 30

  • The data sets follow a uni modal normal distribution

  • The samples are random

the conditions for inference in the NYC Airbnb data seem to be satisfied

Multiple linear regression:

## 
## Call:
## lm(formula = price ~ minimum_nights + number_of_reviews + neighbourhood_group + 
##     room_type, data = reformatted.listings)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -205.8  -54.3  -15.7   16.3 9874.1 
## 
## Coefficients:
##                                    Estimate Std. Error t value Pr(>|t|)
## (Intercept)                       161.24954    7.82315  20.612  < 2e-16
## minimum_nights                     -0.07189    0.08561  -0.840 0.401023
## number_of_reviews                  -0.17963    0.02710  -6.628 3.48e-11
## neighbourhood_groupBrooklyn        28.29586    7.83107   3.613 0.000303
## neighbourhood_groupManhattan       81.86020    7.83322  10.450  < 2e-16
## neighbourhood_groupQueens          13.33099    8.35163   1.596 0.110455
## neighbourhood_groupStaten Island   -4.89939   15.34555  -0.319 0.749524
## room_typePrivate room            -118.69700    2.39627 -49.534  < 2e-16
## room_typeShared room             -148.76270    7.22615 -20.587  < 2e-16
##                                     
## (Intercept)                      ***
## minimum_nights                      
## number_of_reviews                ***
## neighbourhood_groupBrooklyn      ***
## neighbourhood_groupManhattan     ***
## neighbourhood_groupQueens           
## neighbourhood_groupStaten Island    
## room_typePrivate room            ***
## room_typeShared room             ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 172.2 on 21965 degrees of freedom
## Multiple R-squared:  0.1477, Adjusted R-squared:  0.1474 
## F-statistic: 475.8 on 8 and 21965 DF,  p-value: < 2.2e-16

From the above results, the model output indicates some evidence of a difference in the listing price per neighborhood, reviews, minimum stays.

## Analysis of Variance Table
## 
## Response: price
##                        Df    Sum Sq  Mean Sq  F value    Pr(>F)    
## minimum_nights          1    121068   121068    4.084    0.0433 *  
## number_of_reviews       1    671150   671150   22.640 1.966e-06 ***
## neighbourhood_group     4  34463398  8615850  290.637 < 2.2e-16 ***
## room_type               2  77578114 38789057 1308.463 < 2.2e-16 ***
## Residuals           21965 651146833    29645                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Conclusion:

From the initial question: Do neighborhoods, room types, minimum nights and reviews have influence over listing prices in NYC? It can be concluded:

The above plots, modeling, and statistical analysis indicate that neighborhoods, room types, minimum nights and reviews and listing price did appear to impact the Airbnb listing price. H0 can be discarded and my alternative hypothesis H1 is accepted. The above conclusion is statistically accepted since the analysis of variance returned an extremely low p-value (2.2e-16) which is less than 0.05.

Based on your final model, entire property listings in Manhattan are the most expensive.

It can be concluded from the plots that prices of listings depends upon following factors:

  • The type of room type. Entire properties are the most expensive, followed by private rooms and shared apartments

  • The neighborhood. Manhattan is the most expensive (on average), followed by Brooklyn, Queens, Statin Island and the Bronx

  • A higher number of reviews did not guarantee a higher average listing price

  • A stay minimum did not guarantee a higher average listing price