This is a dataset of Airbnb Listings in New York: https://www.kaggle.com/dgomonov/new-york-city-airbnb-open-data

The dataset looks pretty tidy - if you are looking at the individual listings. If you are looking at the hosts though, you have multiple rows per host, repeated data etc. So, for this assignment I made a tidy dataset of the hosts for analysis.
There is not a lot of data on the hosts, so I will just have the ID, name and I’ll add a few aggregate columns for analysis.
I will look to see if the number of listings a host affects the number of reviews those listings get.

listings <- read.csv('AB_NYC_2019.csv', stringsAsFactors = FALSE)
head(listings)
##     id                                             name host_id
## 1 2539               Clean & quiet apt home by the park    2787
## 2 2595                            Skylit Midtown Castle    2845
## 3 3647              THE VILLAGE OF HARLEM....NEW YORK !    4632
## 4 3831                  Cozy Entire Floor of Brownstone    4869
## 5 5022 Entire Apt: Spacious Studio/Loft by central park    7192
## 6 5099        Large Cozy 1 BR Apartment In Midtown East    7322
##     host_name neighbourhood_group neighbourhood latitude longitude
## 1        John            Brooklyn    Kensington 40.64749 -73.97237
## 2    Jennifer           Manhattan       Midtown 40.75362 -73.98377
## 3   Elisabeth           Manhattan        Harlem 40.80902 -73.94190
## 4 LisaRoxanne            Brooklyn  Clinton Hill 40.68514 -73.95976
## 5       Laura           Manhattan   East Harlem 40.79851 -73.94399
## 6       Chris           Manhattan   Murray Hill 40.74767 -73.97500
##         room_type price minimum_nights number_of_reviews last_review
## 1    Private room   149              1                 9  2018-10-19
## 2 Entire home/apt   225              1                45  2019-05-21
## 3    Private room   150              3                 0            
## 4 Entire home/apt    89              1               270  2019-07-05
## 5 Entire home/apt    80             10                 9  2018-11-19
## 6 Entire home/apt   200              3                74  2019-06-22
##   reviews_per_month calculated_host_listings_count availability_365
## 1              0.21                              6              365
## 2              0.38                              2              355
## 3                NA                              1              365
## 4              4.64                              1              194
## 5              0.10                              1                0
## 6              0.59                              1              129
str(listings)
## 'data.frame':    48895 obs. of  16 variables:
##  $ id                            : int  2539 2595 3647 3831 5022 5099 5121 5178 5203 5238 ...
##  $ name                          : chr  "Clean & quiet apt home by the park" "Skylit Midtown Castle" "THE VILLAGE OF HARLEM....NEW YORK !" "Cozy Entire Floor of Brownstone" ...
##  $ host_id                       : int  2787 2845 4632 4869 7192 7322 7356 8967 7490 7549 ...
##  $ host_name                     : chr  "John" "Jennifer" "Elisabeth" "LisaRoxanne" ...
##  $ neighbourhood_group           : chr  "Brooklyn" "Manhattan" "Manhattan" "Brooklyn" ...
##  $ neighbourhood                 : chr  "Kensington" "Midtown" "Harlem" "Clinton Hill" ...
##  $ latitude                      : num  40.6 40.8 40.8 40.7 40.8 ...
##  $ longitude                     : num  -74 -74 -73.9 -74 -73.9 ...
##  $ room_type                     : chr  "Private room" "Entire home/apt" "Private room" "Entire home/apt" ...
##  $ price                         : int  149 225 150 89 80 200 60 79 79 150 ...
##  $ minimum_nights                : int  1 1 3 1 10 3 45 2 2 1 ...
##  $ number_of_reviews             : int  9 45 0 270 9 74 49 430 118 160 ...
##  $ last_review                   : chr  "2018-10-19" "2019-05-21" "" "2019-07-05" ...
##  $ reviews_per_month             : num  0.21 0.38 NA 4.64 0.1 0.59 0.4 3.47 0.99 1.33 ...
##  $ calculated_host_listings_count: int  6 2 1 1 1 1 1 1 1 4 ...
##  $ availability_365              : int  365 355 365 194 0 129 0 220 0 188 ...

Let’s check to make sure our host data is actually the same for all host IDs:

c(count(unique(listings[c('host_id')])),
count(unique(listings[c('host_id','host_name')])),
count(unique(listings[c('host_id','calculated_host_listings_count')])))
## $n
## [1] 37457
## 
## $n
## [1] 37457
## 
## $n
## [1] 37457

That looks good. Let’s create our hosts dataset

#Fill 0 for the NA, I believe these can be interpreted as 0
listings[is.na(listings['reviews_per_month']),'reviews_per_month'] <- 0

listings <- group_by(listings,host_id,host_name,calculated_host_listings_count)

hosts <-summarize(listings,avg_price = mean(price),count_listings= n(),avg_reviews_per_month = mean(reviews_per_month))

head(hosts)
## # A tibble: 6 x 6
## # Groups:   host_id, host_name [6]
##   host_id host_name calculated_host~ avg_price count_listings
##     <int> <chr>                <int>     <dbl>          <int>
## 1    2438 Tasos                    1      95                1
## 2    2571 Teedo                    1     182                1
## 3    2787 John                     6     101.               6
## 4    2845 Jennifer                 2     162                2
## 5    2868 Letha M.                 1      60                1
## 6    2881 Loli                     2      58.5              2
## # ... with 1 more variable: avg_reviews_per_month <dbl>

They had already calculated the listing count, let’s see if our aggregate function matches.

sum(hosts$count_listings != hosts$calculated_host_listings_count)
## [1] 0

They are equivalent, so let’s remove one.

hosts <- subset(hosts, select=-c(calculated_host_listings_count))

Let’s take a look at look at listing count. The advantage of having a dataset grouped by hosts is that this variable makes more sense. You can compare how this variable looks on either dataset.

listing_counts <- as.data.frame(cbind(table(hosts$count_listings),
table(listings$calculated_host_listings_count)))

names(listing_counts) = c('by_host','by_listing')
listing_counts
##     by_host by_listing
## 1     32303      32303
## 2      3329       6658
## 3       951       2853
## 4       360       1440
## 5       169        845
## 6        95        570
## 7        57        399
## 8        52        416
## 9        26        234
## 10       21        210
## 11       10        110
## 12       15        180
## 13       10        130
## 14        5         70
## 15        5         75
## 16        1         16
## 17        4         68
## 18        3         54
## 19        1         19
## 20        2         40
## 21        1         21
## 23        3         69
## 25        2         50
## 26        1         26
## 27        1         27
## 28        2         56
## 29        1         29
## 30        1         30
## 31        2         62
## 32        1         32
## 33        3         99
## 34        2         68
## 37        1         37
## 39        1         39
## 43        1         43
## 47        1         47
## 49        2         98
## 50        1         50
## 52        2        104
## 65        1         65
## 87        1         87
## 91        1         91
## 96        2        192
## 103       1        103
## 121       1        121
## 232       1        232
## 327       1        327

You can see the numbers get stranger as the count goes up. The first dataset has 327 entries with the value “327”, but that is just one host with 327 listings. Depending on what you’re looking at, this may not really matter. But, if you think the host is having a large effect on your response variable, it my skew your results to have one host show up in 327 responses.

Let’s check out the correlation between number of listings and avg. numbers of reviews. It should be noted that I’m taking an average of an average here, and there is no way for me to weight this column, since the dataset doesn’t tell me how long the listings have been listed. This is not great but will have to do for this analysis.

ggplot(hosts, aes(x=count_listings, y=avg_reviews_per_month)) + geom_point() + geom_smooth(method="lm")

Hard to say exactly from the chart, but it does look like it has a positive correlation

model = glm(data=hosts,avg_reviews_per_month ~ count_listings)
summary(model)
## 
## Call:
## glm(formula = avg_reviews_per_month ~ count_listings, data = hosts)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.4370  -0.9861  -0.6761   0.4339  19.9139  
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    1.015543   0.008469 119.915  < 2e-16 ***
## count_listings 0.010550   0.002773   3.804 0.000143 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 2.19559)
## 
##     Null deviance: 82268  on 37456  degrees of freedom
## Residual deviance: 82236  on 37455  degrees of freedom
## AIC: 135760
## 
## Number of Fisher Scoring iterations: 2

And we see here that every extra listing a host has correlates with an extra review every 10 months. Out of curiosity, let’s create the same model on the original listings dataset.

model2 = glm(data=listings,reviews_per_month ~ calculated_host_listings_count)
summary(model2)
## 
## Call:
## glm(formula = reviews_per_month ~ calculated_host_listings_count, 
##     data = listings)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.105  -1.053  -0.705   0.495  57.413  
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     1.107293   0.007383  149.98   <2e-16 ***
## calculated_host_listings_count -0.002293   0.000219  -10.47   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 2.545653)
## 
##     Null deviance: 124744  on 48894  degrees of freedom
## Residual deviance: 124465  on 48893  degrees of freedom
## AIC: 184449
## 
## Number of Fisher Scoring iterations: 2

And now we have a negative coefficient. Why the discrepency? I would hypothesize that some of the large values (remember our 327 listing host) could be skewing the results. In our original dataset, that 327 shows up 327 times, so it could carry a lot of weight.

A probably more interesting question is, which conclusion is right? I would say they are both right, but they are answering a different question.

If you pull up two listings online, and see that one is the host’s only listing, and the other is a “superhost” and has many properties, you would want to bet on the former as having more reviews. (Though the coefficient is really small, so I wouldn’t bet your life savings or anything.)

On the other hand, if you look at the hosts themselves of those properties, you’d want to place your bet on the “superhost” as having more reviews on average for all of their properties.