Airbnb Analysis

Author

Griffin Lessinger

Introduction

Airbnb is an important facet of tourism and hospitality industries, especially in New York City, where nightly accommodations are competitive. In this document, we will be examining a dataset regarding Airbnb locations in New York City in an exploratory fashion. The goal is (broadly) to find some insights or connections between different attributes/qualities of Airbnb locations.

Data

The given dataset is a spreadsheet of 25 columns, ranging from lat and long (coordinates) to house_rules:

# A tibble: 6 × 5
        id NAME                     `host id` host_identity_verified `host name`
     <dbl> <chr>                        <dbl> <chr>                  <chr>      
1  3331490 Bayview room               6.19e10 verified               Ruth       
2 56665996 Cozy bedroom in a home-…   9.27e10 unconfirmed            Melkorka   
3 28708018 ðŸŒŸðŸŒŸLarge Condo, Pe…   3.72e10 verified               Dinero     
4 41098316 Private Room Steps from…   1.98e10 unconfirmed            Brian      
5 53898421 East Vilage Sleeping Sp…   3.92e10 verified               Deborah    
6 42271401 In the heart of Hunters…   9.83e10 unconfirmed            Lauro

(Above is a sample of the first 5 columns from the first 5 rows). We are given roughly 400 locations to work with, as well as data from each of those locations; enough to conduct some analyses.

It is important to note that there are many features of the dataset that carry little (numerically) investigative value, such as NAME (which is probably unique to the location), country (as all observations are taken from NYC), and many others.

Findings

1. Relationship between `neighborhood group` and `price`

This one is a bit of an obvious pick, but how much does location within NYC affect the listing prices of Airbnbs? neighborhood group is essentially just designating the borough in which the listing is located, price is nightly price:

[1] "ANOVA test on nightly rates by borough:"

                       Df   Sum Sq Mean Sq F value Pr(>F)
`neighbourhood group`   4   648203  162051   1.487  0.205
Residuals             394 42924472  108945

Interestingly, even at the 20% confidence level, we can infer (from ANOVA) that borough has no meaningful impact on mean nightly rates. This is surprising! I would have thought that Mannhattan and Brooklyn would have had the highest average nightly rate, but this is incorrect!

2. Relationship between `host_identity_verified` and `review rate number`

Another item that an Airbnb customer may be interested in is the “verified status” of their host. Are “verified” hosts more likely to have strong ratings than “unconfirmed” hosts? We use a two sample t-test for difference of group means:


    Welch Two Sample t-test

data:  verifieds and unconfirm
t = 0.24619, df = 393.05, p-value = 0.8057
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.2222617  0.2858944
sample estimates:
mean of x mean of y 
 3.230769  3.198953

It looks like the status of hosts who’s identities have been verified vs unconfirmed has essentially no weight on overall stay rating, with such a massive p-value. The group means are roughly the same, with average ratings of 3.231 (out of 5) compared to 3.199 (out of 5). Good to know.

3. What makes buyers happy?

Aside from the verified status of the host, there are many other aspects of the stay that will impact overall rating of the listing (again, review rate number). It wouldn’t be feasible (in this document) to go through the features one-by-one and determine which are the most important, like instant_bookable, cancellation_policy, etc. Instead, we can model review rate number as a predicted variable of all of the listing features (using a random forest), then list feature importance as a sort of “rule of thumb” for important items that a host may want to consider. We will only be using factors that a host has control over as predictors:

Below is a plot of feature importance, in descending order:

From what it looks like, price and service_fee are the most important factors that impact the overall stay rating. This is expected. But, it seems that minimum_nights is quite important as well! One could imagine that being locked into an accommodation deal for a set minimum number of nights would negatively impact the rating, especially if it turns out that the person staying does not like the Airbnb listing they chose.

(NOTE: this random forest should NOT be relied upon to show actual trends or formally extrapolate from, just to give some insight on what to look into. It serves as a rule of thumb, not a proper test.)

4. Relationship between `reviews per month` and `host_identity_verified`

We can assume that a higher reviews per month value implies that the listing is more active, and therefore, is an indirect measure of listing activity. With that given, is there a notable difference in listing activity between listings with verified hosts and listings with unconfirmed hosts?


    Welch Two Sample t-test

data:  verifieds and unconfirm
t = 1.6124, df = 335.37, p-value = 0.1078
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.06044088  0.61003411
sample estimates:
mean of x mean of y 
 1.435829  1.161032

We find that the verification status of the host is insignificant at the 10% confidence level, but significant at the 15% confidence level. Looking at the boxplots, there are slightly more reviews per month for listings with a verified host. But generally, it seems as though the verification status of the host is inconsequential for listing activity.

5. Relationship between `reviews per month` and minimum cost of stay

The reason why we are looking at not just price is because we want to understand the relationship between the listing activity and the minimum cost of the stay. Cost will be calculated by:

\[nights*price + servicefee\] Which is why we will use minimum_nights to find the minimum cost.

The pattern here is something to think about. For many of the lower-cost listings, they only have a minimum_nights of (usually) 3 or less. This allows for many more listings per month than, say, a listing that requires staying for at least 2 weeks. This is likely why we see an initially decreasing trend in monthly reviews as minimum price increases, but then a “leveling-out” of sorts (as wealthier people will consistently take those more expenisve and longer listings). Dependent on minimum_nights, there is a hard limit on reviews per month.

However, the claim that a lower minimum price leads to more listing activity for listings that require staying for few nights seems reasonable, and would be worth exploring. One could expect a strong negative correlation between minimum price and reviews per month.

(Also, there are a lot of really expensive listings! I had to remove 2 outliers, one costed a minimum of 52,937 USD, the other was 360,818 USD!)

Further Research

In conclusion, we found that Airbnb users are likely most concerned with the cost of the stay (price, service_fee, minimum_nights) rather than other aspects relating to the host or the locales themselves. It is possible that there are some qualitative features of the listings that Airbnb users do care about that were not present in the dataset, such as maximum occupancy, vehicle storage, nearby amenities, etc.

Future exploration should be focused on ways to maximize listing activity, which would involve more analysis of prices and associated costs of stay. From what was seen in this article, cost is likely an overwhelmingly determining factor in Airbnb client listing choice.

Introduction

Data

Findings

1. Relationship between neighborhood group and price

2. Relationship between host_identity_verified and review rate number