This is a dataset of Airbnb Listings in New York: https://www.kaggle.com/dgomonov/new-york-city-airbnb-open-data
The dataset looks pretty tidy - if you are looking at the individual listings. If you are looking at the hosts though, you have multiple rows per host, repeated data etc. So, for this assignment I made a tidy dataset of the hosts for analysis.
There is not a lot of data on the hosts, so I will just have the ID, name and I’ll add a few aggregate columns for analysis.
I will look to see if the number of listings a host affects the number of reviews those listings get.
listings <- read.csv('AB_NYC_2019.csv', stringsAsFactors = FALSE)
head(listings)
## id name host_id
## 1 2539 Clean & quiet apt home by the park 2787
## 2 2595 Skylit Midtown Castle 2845
## 3 3647 THE VILLAGE OF HARLEM....NEW YORK ! 4632
## 4 3831 Cozy Entire Floor of Brownstone 4869
## 5 5022 Entire Apt: Spacious Studio/Loft by central park 7192
## 6 5099 Large Cozy 1 BR Apartment In Midtown East 7322
## host_name neighbourhood_group neighbourhood latitude longitude
## 1 John Brooklyn Kensington 40.64749 -73.97237
## 2 Jennifer Manhattan Midtown 40.75362 -73.98377
## 3 Elisabeth Manhattan Harlem 40.80902 -73.94190
## 4 LisaRoxanne Brooklyn Clinton Hill 40.68514 -73.95976
## 5 Laura Manhattan East Harlem 40.79851 -73.94399
## 6 Chris Manhattan Murray Hill 40.74767 -73.97500
## room_type price minimum_nights number_of_reviews last_review
## 1 Private room 149 1 9 2018-10-19
## 2 Entire home/apt 225 1 45 2019-05-21
## 3 Private room 150 3 0
## 4 Entire home/apt 89 1 270 2019-07-05
## 5 Entire home/apt 80 10 9 2018-11-19
## 6 Entire home/apt 200 3 74 2019-06-22
## reviews_per_month calculated_host_listings_count availability_365
## 1 0.21 6 365
## 2 0.38 2 355
## 3 NA 1 365
## 4 4.64 1 194
## 5 0.10 1 0
## 6 0.59 1 129
str(listings)
## 'data.frame': 48895 obs. of 16 variables:
## $ id : int 2539 2595 3647 3831 5022 5099 5121 5178 5203 5238 ...
## $ name : chr "Clean & quiet apt home by the park" "Skylit Midtown Castle" "THE VILLAGE OF HARLEM....NEW YORK !" "Cozy Entire Floor of Brownstone" ...
## $ host_id : int 2787 2845 4632 4869 7192 7322 7356 8967 7490 7549 ...
## $ host_name : chr "John" "Jennifer" "Elisabeth" "LisaRoxanne" ...
## $ neighbourhood_group : chr "Brooklyn" "Manhattan" "Manhattan" "Brooklyn" ...
## $ neighbourhood : chr "Kensington" "Midtown" "Harlem" "Clinton Hill" ...
## $ latitude : num 40.6 40.8 40.8 40.7 40.8 ...
## $ longitude : num -74 -74 -73.9 -74 -73.9 ...
## $ room_type : chr "Private room" "Entire home/apt" "Private room" "Entire home/apt" ...
## $ price : int 149 225 150 89 80 200 60 79 79 150 ...
## $ minimum_nights : int 1 1 3 1 10 3 45 2 2 1 ...
## $ number_of_reviews : int 9 45 0 270 9 74 49 430 118 160 ...
## $ last_review : chr "2018-10-19" "2019-05-21" "" "2019-07-05" ...
## $ reviews_per_month : num 0.21 0.38 NA 4.64 0.1 0.59 0.4 3.47 0.99 1.33 ...
## $ calculated_host_listings_count: int 6 2 1 1 1 1 1 1 1 4 ...
## $ availability_365 : int 365 355 365 194 0 129 0 220 0 188 ...
Let’s check to make sure our host data is actually the same for all host IDs:
c(count(unique(listings[c('host_id')])),
count(unique(listings[c('host_id','host_name')])),
count(unique(listings[c('host_id','calculated_host_listings_count')])))
## $n
## [1] 37457
##
## $n
## [1] 37457
##
## $n
## [1] 37457
That looks good. Let’s create our hosts dataset
#Fill 0 for the NA, I believe these can be interpreted as 0
listings[is.na(listings['reviews_per_month']),'reviews_per_month'] <- 0
listings <- group_by(listings,host_id,host_name,calculated_host_listings_count)
hosts <-summarize(listings,avg_price = mean(price),count_listings= n(),avg_reviews_per_month = mean(reviews_per_month))
head(hosts)
## # A tibble: 6 x 6
## # Groups: host_id, host_name [6]
## host_id host_name calculated_host~ avg_price count_listings
## <int> <chr> <int> <dbl> <int>
## 1 2438 Tasos 1 95 1
## 2 2571 Teedo 1 182 1
## 3 2787 John 6 101. 6
## 4 2845 Jennifer 2 162 2
## 5 2868 Letha M. 1 60 1
## 6 2881 Loli 2 58.5 2
## # ... with 1 more variable: avg_reviews_per_month <dbl>
They had already calculated the listing count, let’s see if our aggregate function matches.
sum(hosts$count_listings != hosts$calculated_host_listings_count)
## [1] 0
They are equivalent, so let’s remove one.
hosts <- subset(hosts, select=-c(calculated_host_listings_count))
Let’s take a look at look at listing count. The advantage of having a dataset grouped by hosts is that this variable makes more sense. You can compare how this variable looks on either dataset.
listing_counts <- as.data.frame(cbind(table(hosts$count_listings),
table(listings$calculated_host_listings_count)))
names(listing_counts) = c('by_host','by_listing')
listing_counts
## by_host by_listing
## 1 32303 32303
## 2 3329 6658
## 3 951 2853
## 4 360 1440
## 5 169 845
## 6 95 570
## 7 57 399
## 8 52 416
## 9 26 234
## 10 21 210
## 11 10 110
## 12 15 180
## 13 10 130
## 14 5 70
## 15 5 75
## 16 1 16
## 17 4 68
## 18 3 54
## 19 1 19
## 20 2 40
## 21 1 21
## 23 3 69
## 25 2 50
## 26 1 26
## 27 1 27
## 28 2 56
## 29 1 29
## 30 1 30
## 31 2 62
## 32 1 32
## 33 3 99
## 34 2 68
## 37 1 37
## 39 1 39
## 43 1 43
## 47 1 47
## 49 2 98
## 50 1 50
## 52 2 104
## 65 1 65
## 87 1 87
## 91 1 91
## 96 2 192
## 103 1 103
## 121 1 121
## 232 1 232
## 327 1 327
You can see the numbers get stranger as the count goes up. The first dataset has 327 entries with the value “327”, but that is just one host with 327 listings. Depending on what you’re looking at, this may not really matter. But, if you think the host is having a large effect on your response variable, it my skew your results to have one host show up in 327 responses.
Let’s check out the correlation between number of listings and avg. numbers of reviews. It should be noted that I’m taking an average of an average here, and there is no way for me to weight this column, since the dataset doesn’t tell me how long the listings have been listed. This is not great but will have to do for this analysis.
ggplot(hosts, aes(x=count_listings, y=avg_reviews_per_month)) + geom_point() + geom_smooth(method="lm")
Hard to say exactly from the chart, but it does look like it has a positive correlation
model = glm(data=hosts,avg_reviews_per_month ~ count_listings)
summary(model)
##
## Call:
## glm(formula = avg_reviews_per_month ~ count_listings, data = hosts)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.4370 -0.9861 -0.6761 0.4339 19.9139
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.015543 0.008469 119.915 < 2e-16 ***
## count_listings 0.010550 0.002773 3.804 0.000143 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 2.19559)
##
## Null deviance: 82268 on 37456 degrees of freedom
## Residual deviance: 82236 on 37455 degrees of freedom
## AIC: 135760
##
## Number of Fisher Scoring iterations: 2
And we see here that every extra listing a host has correlates with an extra review every 10 months. Out of curiosity, let’s create the same model on the original listings dataset.
model2 = glm(data=listings,reviews_per_month ~ calculated_host_listings_count)
summary(model2)
##
## Call:
## glm(formula = reviews_per_month ~ calculated_host_listings_count,
## data = listings)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.105 -1.053 -0.705 0.495 57.413
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.107293 0.007383 149.98 <2e-16 ***
## calculated_host_listings_count -0.002293 0.000219 -10.47 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 2.545653)
##
## Null deviance: 124744 on 48894 degrees of freedom
## Residual deviance: 124465 on 48893 degrees of freedom
## AIC: 184449
##
## Number of Fisher Scoring iterations: 2
And now we have a negative coefficient. Why the discrepency? I would hypothesize that some of the large values (remember our 327 listing host) could be skewing the results. In our original dataset, that 327 shows up 327 times, so it could carry a lot of weight.
A probably more interesting question is, which conclusion is right? I would say they are both right, but they are answering a different question.
If you pull up two listings online, and see that one is the host’s only listing, and the other is a “superhost” and has many properties, you would want to bet on the former as having more reviews. (Though the coefficient is really small, so I wouldn’t bet your life savings or anything.)
On the other hand, if you look at the hosts themselves of those properties, you’d want to place your bet on the “superhost” as having more reviews on average for all of their properties.