library(dplyr)
library(psych)
library(knitr)
library(ggplot2)
Airbnb NYC Image
Airbnb is an online marketplace which lets homeowners rent out their properties or spare rooms to guests. Airbnb takes 3% commission of every booking from hosts, and between 6% and 12% from guests. For my final project, I am going to analyze the public Airbnb data of New York City, http://insideairbnb.com/get-the-data.html, which includes information on listings, reviews and neighborhoods. In my analysis, I look to understand why Airbnb listings are priced differently and where a homeowner could get the biggest ROI. As a frequent user of the platform and a New York City native, I am interest to see if my assumptions on pricing per listing, per neighborhood are accurate.
From the exploration question, hypothesis is defined as follows:
H0: Neighborhoods, room types, minimum nights and reviews have no influence over Airbnb listing prices in NYC
H1: Neighborhoods, room types, minimum nights and reviews have influence over Airbnb listing prices in NYC
New York City Airbnb Data - **Source: http://insideairbnb.com/get-the-data.html:**
The Airbnb data was collected by Murry Cox. The dataset is meant to be an independent to allows users to analyze Airbnb usage throughout the world. The snapshot data is regularly scraped/updated by Murry Cox. For this project, data was integrated through reading the .csv files. This is considered an observational study.
Reviews data: information and metrics for Airbnb listings in New York City
Listing data: timestamp review information
Neighborhoods data: neighborhood listing information
#load data
listings <- read.table("C:/Users/burke/OneDrive/Desktop/Data 606-Stats/Project Data 606/Data 606 Data/listings.csv", quote = "", fill=TRUE, header = TRUE, sep = ",", stringsAsFactors=FALSE)
# Data translations- listing data
# Converted all the columns with numbers to numeric
# The raw columns in the listings dataframe were pretty descriptive, kept original
reformatted.listings <-listings %>%
mutate(price = as.numeric(price, na.rm = TRUE),
id = as.numeric(id, na.rm = TRUE),
host_id = as.numeric(host_id, na.rm = TRUE),
number_of_reviews = as.numeric(number_of_reviews, na.rm = TRUE),
minimum_nights = as.numeric(minimum_nights, na.rm = TRUE),
calculated_host_listings_count = as.numeric(calculated_host_listings_count, na.rm = TRUE),
reviews_per_month = as.numeric(reviews_per_month, na.rm = TRUE),
availability_365 = as.numeric(availability_365, na.rm = TRUE)) %>%
filter(price > 0 &
number_of_reviews > 0 &
reviews_per_month >0 &
calculated_host_listings_count > 0 &
availability_365 >0 &
id >0)%>%
select(-c(latitude, longitude, id, name, host_id, host_name, last_review,
reviews_per_month, calculated_host_listings_count,availability_365))
The below table displays all NYC Airbnb listings compiled by Murry Cox.
| neighbourhood_group | neighbourhood | room_type | price | minimum_nights | number_of_reviews |
|---|---|---|---|---|---|
| Brooklyn | Kensington | Private room | 39 | 1 | 7 |
| Manhattan | Midtown | Entire home/apt | 225 | 1 | 29 |
| Brooklyn | Williamsburg | Private room | 70 | 5 | 27 |
| Brooklyn | Clinton Hill | Entire home/apt | 89 | 1 | 181 |
| Manhattan | Hell’s Kitchen | Entire home/apt | 150 | 1 | 26 |
| Manhattan | Hell’s Kitchen | Entire home/apt | 150 | 1 | 58 |
| Brooklyn | Bedford-Stuyvesant | Private room | 60 | 45 | 51 |
| Brooklyn | Sunset Park | Entire home/apt | 253 | 4 | 1 |
| Manhattan | Hell’s Kitchen | Private room | 79 | 2 | 371 |
| Manhattan | Chinatown | Entire home/apt | 120 | 1 | 146 |
Each case represents a Airbnb listing in the New York City. There are 21,974 observations in the given data set.
| x |
|---|
| 21974 |
| 6 |
Minimum night requirement, numerical
Room type, categorical
The below chart, as an initial inspection of the data, suggests that there are differences in between the median prices per listing/room type/neighborhood
| neighbourhood_group | neighbourhood | room_type | Frequency | Min.Value | Median.Value | Mean.Value | Max.Value | SD |
|---|---|---|---|---|---|---|---|---|
| Brooklyn | Sea Gate | Entire home/apt | 1 | 1485 | 1485 | 1485 | 1485 | NaN |
| Manhattan | Tribeca | Entire home/apt | 39 | 151 | 395 | 490 | 1500 | 312 |
| Manhattan | NoHo | Private room | 4 | 85 | 124 | 448 | 1460 | 675 |
| Queens | Holliswood | Entire home/apt | 1 | 414 | 414 | 414 | 414 | NaN |
| Manhattan | SoHo | Entire home/apt | 139 | 95 | 249 | 372 | 2000 | 331 |
| Manhattan | Roosevelt Island | Entire home/apt | 6 | 108 | 150 | 366 | 1400 | 509 |
| Manhattan | Flatiron District | Entire home/apt | 47 | 128 | 265 | 359 | 1050 | 237 |
| Manhattan | Theater District | Entire home/apt | 64 | 95 | 234 | 330 | 2000 | 288 |
| Manhattan | NoHo | Entire home/apt | 35 | 125 | 245 | 319 | 975 | 188 |
| Manhattan | Midtown | Entire home/apt | 442 | 67 | 239 | 314 | 2500 | 253 |
| Manhattan | Nolita | Entire home/apt | 98 | 99 | 244 | 310 | 5000 | 490 |
| Manhattan | Civic Center | Entire home/apt | 8 | 102 | 195 | 297 | 800 | 233 |
| Manhattan | Greenwich Village | Entire home/apt | 149 | 70 | 209 | 285 | 2450 | 285 |
| Brooklyn | Clinton Hill | Entire home/apt | 168 | 70 | 152 | 279 | 8000 | 698 |
| Manhattan | Battery Park City | Entire home/apt | 11 | 205 | 250 | 279 | 450 | 75 |
| Manhattan | West Village | Entire home/apt | 322 | 75 | 225 | 276 | 2590 | 235 |
| Staten Island | South Beach | Entire home/apt | 1 | 275 | 275 | 275 | 275 | NaN |
| Manhattan | Chelsea | Entire home/apt | 369 | 60 | 222 | 274 | 1731 | 188 |
| Queens | Neponsit | Entire home/apt | 1 | 274 | 274 | 274 | 274 | NaN |
| Manhattan | Kips Bay | Entire home/apt | 144 | 99 | 200 | 273 | 1950 | 207 |
From the above chart, it is apparent that there are some outliers in the filtered data set, while the medians are defined by the neighborhood, room type. These intital observations can suggest that the room type, neighborhood could be predictive of listing price as established in the hypothesis.
From the below histogram, normality, skewness to the right and outliers can be confirmed. For visualization purposes,I chose a density histogram with a limited x-axis.
From the calculated medians, we can have the following histogram below. Just like the above price density histogram, normality, skewness to the right and outliers can be confirmed. Based on the below Q-Q Plot, we can visualize how the medians data follow the qqline most of the trajectory.
| room_type | Count | mean.price |
|---|---|---|
| Entire home/apt | 11231 | 210 |
| Private room | 10142 | 81 |
| Shared room | 601 | 57 |
## room_type Count mean.price
## Length:3 Min. : 601 Min. : 57.0
## Class :character 1st Qu.: 5372 1st Qu.: 69.0
## Mode :character Median :10142 Median : 81.0
## Mean : 7325 Mean :116.0
## 3rd Qu.:10686 3rd Qu.:145.5
## Max. :11231 Max. :210.0
Room Type Analysis Summary:
It is apparent, based on the listing counts, that owners are more inclined to list entire properties than that of private rooms or shared rooms. The entire property listings are observed to be more expensive, on average, than the private rooms or shared rooms. This suggests that room type does impact the price of the NYC Airbnb listing. The impact of room type will be later explored in the regression analysis.
| neighbourhood | Count | mean.price |
|---|---|---|
| Williamsburg | 1756 | 149 |
| Bedford-Stuyvesant | 1750 | 105 |
| Harlem | 1375 | 116 |
| Bushwick | 984 | 83 |
| Hell’s Kitchen | 984 | 206 |
| East Village | 884 | 191 |
| Upper West Side | 827 | 207 |
| Upper East Side | 762 | 170 |
| Crown Heights | 712 | 104 |
| East Harlem | 648 | 136 |
| Midtown | 597 | 273 |
| Chelsea | 504 | 230 |
| Lower East Side | 455 | 174 |
| Greenpoint | 442 | 134 |
| Astoria | 421 | 104 |
| West Village | 396 | 253 |
| Washington Heights | 387 | 86 |
| Clinton Hill | 283 | 204 |
| Flatbush | 260 | 91 |
| Prospect-Lefferts Gardens | 250 | 103 |
## neighbourhood Count mean.price
## Length:207 Min. : 1.0 Min. : 17.0
## Class :character 1st Qu.: 5.0 1st Qu.: 71.5
## Mode :character Median : 19.0 Median : 86.0
## Mean : 106.2 Mean :111.3
## 3rd Qu.: 74.0 3rd Qu.:134.0
## Max. :1756.0 Max. :555.0
Neighborhood Analysis Summary:
The mean price of the listing per neighborhood suggests that there is some correlation between the two variables. The more expensive listing neighborhoods, minus a few outliers, seem to be in Manhattan proper. The true impact of neighborhood will be later explored in the regression analysis.
It is shown that most property owners are inclined towards listing their entire property. It can also be seen that most of the pricier listings are in Manhattan. Now I will analyze the relationship of prices for different room type and neighborhood on a plotted heat map.
| room_type | neighbourhood_group | Count | mean.price |
|---|---|---|---|
| Entire home/apt | Manhattan | 5889 | 248 |
| Entire home/apt | Brooklyn | 4229 | 175 |
| Entire home/apt | Queens | 888 | 142 |
| Entire home/apt | Staten Island | 76 | 124 |
| Entire home/apt | Bronx | 149 | 108 |
| Private room | Manhattan | 3733 | 102 |
| Private room | Brooklyn | 4480 | 73 |
| Shared room | Manhattan | 266 | 67 |
| Private room | Queens | 1495 | 63 |
| Private room | Bronx | 344 | 56 |
| Shared room | Queens | 108 | 56 |
| Private room | Staten Island | 90 | 55 |
| Shared room | Bronx | 20 | 47 |
| Shared room | Brooklyn | 206 | 45 |
| Shared room | Staten Island | 1 | 40 |
## room_type neighbourhood_group Count mean.price
## Length:15 Length:15 Min. : 1 Min. : 40.0
## Class :character Class :character 1st Qu.: 99 1st Qu.: 55.5
## Mode :character Mode :character Median : 266 Median : 67.0
## Mean :1465 Mean : 93.4
## 3rd Qu.:2614 3rd Qu.:116.0
## Max. :5889 Max. :248.0
Heat Map Summary:
A entire home/apartment in Manhattan will demand the most expensive listing price, on average.
| neighbourhood_group | reviews | Count | mean.price |
|---|---|---|---|
| Staten Island | [0,30] | 113 | 92 |
| Staten Island | (30,60] | 32 | 72 |
| Staten Island | (60,90] | 10 | 93 |
| Staten Island | (90,120] | 5 | 78 |
| Staten Island | (120,150] | 3 | 56 |
| Staten Island | (150,180] | 4 | 56 |
| Queens | [0,30] | 1723 | 94 |
| Queens | (30,60] | 380 | 86 |
| Queens | (60,90] | 177 | 93 |
| Queens | (90,120] | 71 | 76 |
| Queens | (120,150] | 59 | 81 |
| Queens | (150,180] | 28 | 76 |
| Queens | (180,210] | 16 | 56 |
| Queens | (210,240] | 16 | 72 |
| Queens | (240,270] | 4 | 66 |
Review Analysis Summary:
There does not seem to be a correlation between the number of reviews and the average price of the listings.
| stay | Count | mean.price |
|---|---|---|
| (360,370] | 3 | 400 |
| (220,230] | 1 | 315 |
| (590,600] | 1 | 275 |
| (90,100] | 3 | 228 |
| (1.24e+03,1.25e+03] | 1 | 180 |
| (20,30] | 720 | 172 |
| (30,40] | 29 | 165 |
| (60,70] | 2 | 162 |
| (80,90] | 23 | 155 |
| [0,10] | 14255 | 153 |
| (110,120] | 3 | 147 |
| (150,160] | 2 | 115 |
| (10,20] | 355 | 109 |
| (50,60] | 20 | 101 |
| (70,80] | 4 | 97 |
| (170,180] | 6 | 89 |
| (40,50] | 5 | 81 |
| (180,190] | 2 | 78 |
| (290,300] | 1 | 65 |
| (490,500] | 1 | 50 |
| (190,200] | 2 | 45 |
| (210,220] | 1 | 40 |
## stay Count mean.price
## [0,10] : 1 Min. : 1.00 Min. : 40.0
## (10,20]: 1 1st Qu.: 1.25 1st Qu.: 83.0
## (20,30]: 1 Median : 3.00 Median :131.0
## (30,40]: 1 Mean : 701.82 Mean :146.5
## (40,50]: 1 3rd Qu.: 16.50 3rd Qu.:170.2
## (50,60]: 1 Max. :14255.00 Max. :400.0
## (Other):16
Minimum Stay Analysis Summary:
There does not seem to be a huge correlation between the minimum stay requirement and the average price of the listings. However, unsurprisingly, the bulk of the listings are within the 1 - 10 day range.
Conditions:
The sample size is greater than 30
The data sets follow a uni modal normal distribution
The samples are random
the conditions for inference in the NYC Airbnb data seem to be satisfied
##
## Call:
## lm(formula = price ~ minimum_nights + number_of_reviews + neighbourhood_group +
## room_type, data = reformatted.listings)
##
## Residuals:
## Min 1Q Median 3Q Max
## -205.8 -54.3 -15.7 16.3 9874.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 161.24954 7.82315 20.612 < 2e-16
## minimum_nights -0.07189 0.08561 -0.840 0.401023
## number_of_reviews -0.17963 0.02710 -6.628 3.48e-11
## neighbourhood_groupBrooklyn 28.29586 7.83107 3.613 0.000303
## neighbourhood_groupManhattan 81.86020 7.83322 10.450 < 2e-16
## neighbourhood_groupQueens 13.33099 8.35163 1.596 0.110455
## neighbourhood_groupStaten Island -4.89939 15.34555 -0.319 0.749524
## room_typePrivate room -118.69700 2.39627 -49.534 < 2e-16
## room_typeShared room -148.76270 7.22615 -20.587 < 2e-16
##
## (Intercept) ***
## minimum_nights
## number_of_reviews ***
## neighbourhood_groupBrooklyn ***
## neighbourhood_groupManhattan ***
## neighbourhood_groupQueens
## neighbourhood_groupStaten Island
## room_typePrivate room ***
## room_typeShared room ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 172.2 on 21965 degrees of freedom
## Multiple R-squared: 0.1477, Adjusted R-squared: 0.1474
## F-statistic: 475.8 on 8 and 21965 DF, p-value: < 2.2e-16
From the above results, the model output indicates some evidence of a difference in the listing price per neighborhood, reviews, minimum stays.
## Analysis of Variance Table
##
## Response: price
## Df Sum Sq Mean Sq F value Pr(>F)
## minimum_nights 1 121068 121068 4.084 0.0433 *
## number_of_reviews 1 671150 671150 22.640 1.966e-06 ***
## neighbourhood_group 4 34463398 8615850 290.637 < 2.2e-16 ***
## room_type 2 77578114 38789057 1308.463 < 2.2e-16 ***
## Residuals 21965 651146833 29645
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
From the initial question: Do neighborhoods, room types, minimum nights and reviews have influence over listing prices in NYC? It can be concluded:
The above plots, modeling, and statistical analysis indicate that neighborhoods, room types, minimum nights and reviews and listing price did appear to impact the Airbnb listing price. H0 can be discarded and my alternative hypothesis H1 is accepted. The above conclusion is statistically accepted since the analysis of variance returned an extremely low p-value (2.2e-16) which is less than 0.05.
Based on your final model, entire property listings in Manhattan are the most expensive.
It can be concluded from the plots that prices of listings depends upon following factors:
The type of room type. Entire properties are the most expensive, followed by private rooms and shared apartments
The neighborhood. Manhattan is the most expensive (on average), followed by Brooklyn, Queens, Statin Island and the Bronx
A higher number of reviews did not guarantee a higher average listing price
A stay minimum did not guarantee a higher average listing price