New York City Airbnb Open Data

Part 1 - Introduction

Since 2008, guests and hosts have used Airbnb to expand on traveling possibilities and present more unique, personalized way of experiencing the world. Comparing to hotel, Airbnb have a price advantage and provide more choice for the for traveling. This dataset describes the listing activity and metrics in NYC, NY for 2019. We will explore the data and find out some interesting discovers.

####Data Preparation

library(tidyr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(stringr)
library(ggplot2)
library(psych)
## 
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha
airbnb<-read.csv("https://raw.githubusercontent.com/DaisyCai2019/Homework/master/AB_NYC_2019.csv")

# We only interest in the data of 2019
airbnb<-separate(airbnb,last_review,c("Year","Month","Day"))
## Warning: Expected 3 pieces. Missing pieces filled with `NA` in 10052 rows
## [3, 20, 27, 37, 39, 194, 205, 261, 266, 268, 277, 346, 350, 391, 426, 433,
## 438, 487, 546, 586, ...].
airbnb19<-filter(airbnb,Year=="2019")
head(airbnb19)
##     id                                      name host_id   host_name
## 1 2595                     Skylit Midtown Castle    2845    Jennifer
## 2 3831           Cozy Entire Floor of Brownstone    4869 LisaRoxanne
## 3 5099 Large Cozy 1 BR Apartment In Midtown East    7322       Chris
## 4 5178          Large Furnished Room Near B'way     8967    Shunichi
## 5 5238        Cute & Cozy Lower East Side 1 bdrm    7549         Ben
## 6 5295          Beautiful 1br on Upper West Side    7702        Lena
##   neighbourhood_group   neighbourhood latitude longitude       room_type
## 1           Manhattan         Midtown 40.75362 -73.98377 Entire home/apt
## 2            Brooklyn    Clinton Hill 40.68514 -73.95976 Entire home/apt
## 3           Manhattan     Murray Hill 40.74767 -73.97500 Entire home/apt
## 4           Manhattan  Hell's Kitchen 40.76489 -73.98493    Private room
## 5           Manhattan       Chinatown 40.71344 -73.99037 Entire home/apt
## 6           Manhattan Upper West Side 40.80316 -73.96545 Entire home/apt
##   price minimum_nights number_of_reviews Year Month Day reviews_per_month
## 1   225              1                45 2019    05  21              0.38
## 2    89              1               270 2019    07  05              4.64
## 3   200              3                74 2019    06  22              0.59
## 4    79              2               430 2019    06  24              3.47
## 5   150              1               160 2019    06  09              1.33
## 6   135              5                53 2019    06  22              0.43
##   calculated_host_listings_count availability_365
## 1                              2              355
## 2                              1              194
## 3                              1              129
## 4                              1              220
## 5                              4              188
## 6                              1                6

Part 2 - Data

Research question

We want to know which neighbourhood is more popular in 2019.

####What are the cases, and how many are there? Each case represents every host data and their apartment/room.

####Describe the method of data collection.

I collect the data from Kaggle, which published by Airbnb two months ago.

####What type of study is this (observational/experiment)?

The data is observational.

####Data Source: If you collected the data, state self-collected. If not, provide a citation/link.

Data source: https://www.kaggle.com/dgomonov/new-york-city-airbnb-open-data

####Response: What is the response variable, and what type is it (numerical/categorical)?

The response variable is number_of_reviews and is numerical.

####Explanatory: What is the explanatory variable(s), and what type is it (numerical/categorival)?

the explanatory variable is the neighbourhood_group and it is categorical.

Part 3 - Relevant summary statistics

summary(airbnb19)
##        id                                         name      
##  Min.   :    2595   Home away from home             :   10  
##  1st Qu.:12022728   Loft Suite @ The Box House Hotel:   10  
##  Median :22343909   Beautiful Brooklyn Brownstone   :    5  
##  Mean   :20689219   Brooklyn Apartment              :    5  
##  3rd Qu.:30376690   Harlem Gem                      :    5  
##  Max.   :36455809   Private room                    :    5  
##                     (Other)                         :25169  
##     host_id                 host_name        neighbourhood_group
##  Min.   :     2571   Michael     :  215   Bronx        :  698   
##  1st Qu.:  8159536   Sonder (NYC):  207   Brooklyn     :10466   
##  Median : 40027302   David       :  197   Manhattan    :10322   
##  Mean   : 78561358   John        :  177   Queens       : 3456   
##  3rd Qu.:139238261   Alex        :  153   Staten Island:  267   
##  Max.   :273841667   Maria       :  122                         
##                      (Other)     :24138                         
##             neighbourhood      latitude       longitude     
##  Bedford-Stuyvesant: 2209   Min.   :40.51   Min.   :-74.24  
##  Williamsburg      : 1853   1st Qu.:40.69   1st Qu.:-73.98  
##  Harlem            : 1435   Median :40.72   Median :-73.95  
##  Bushwick          : 1202   Mean   :40.73   Mean   :-73.95  
##  Hell's Kitchen    : 1119   3rd Qu.:40.76   3rd Qu.:-73.93  
##  East Village      :  866   Max.   :40.91   Max.   :-73.71  
##  (Other)           :16525                                   
##            room_type         price        minimum_nights   
##  Entire home/apt:13266   Min.   :   0.0   Min.   :  1.000  
##  Private room   :11356   1st Qu.:  69.0   1st Qu.:  1.000  
##  Shared room    :  587   Median : 105.0   Median :  2.000  
##                          Mean   : 141.8   Mean   :  4.898  
##                          3rd Qu.: 175.0   3rd Qu.:  4.000  
##                          Max.   :7500.0   Max.   :365.000  
##                                                            
##  number_of_reviews     Year              Month          
##  Min.   :  1.00    Length:25209       Length:25209      
##  1st Qu.:  5.00    Class :character   Class :character  
##  Median : 18.00    Mode  :character   Mode  :character  
##  Mean   : 40.21                                         
##  3rd Qu.: 53.00                                         
##  Max.   :629.00                                         
##                                                         
##      Day            reviews_per_month calculated_host_listings_count
##  Length:25209       Min.   : 0.020    Min.   :  1.000               
##  Class :character   1st Qu.: 0.650    1st Qu.:  1.000               
##  Mode  :character   Median : 1.460    Median :  1.000               
##                     Mean   : 1.974    Mean   :  6.148               
##                     3rd Qu.: 2.840    3rd Qu.:  2.000               
##                     Max.   :58.500    Max.   :327.000               
##                                                                     
##  availability_365
##  Min.   :  0.0   
##  1st Qu.: 22.0   
##  Median :116.0   
##  Mean   :146.4   
##  3rd Qu.:269.0   
##  Max.   :365.0   
## 
describe(airbnb19$number_of_reviews)
##    vars     n  mean    sd median trimmed   mad min max range skew kurtosis
## X1    1 25209 40.21 55.32     18   28.39 22.24   1 629   628 2.73    10.89
##      se
## X1 0.35
airbnb_group<-airbnb19%>% 
  group_by(neighbourhood_group)%>%
  summarize(Mean = mean(number_of_reviews, na.rm=TRUE))

head(airbnb_group)
## # A tibble: 5 x 2
##   neighbourhood_group  Mean
##   <fct>               <dbl>
## 1 Bronx                37.7
## 2 Brooklyn             41.5
## 3 Manhattan            38.5
## 4 Queens               41.9
## 5 Staten Island        41.2
ggplot(airbnb19, aes(x=number_of_reviews)) + geom_histogram(binwidth = 20)

ggplot(airbnb_group, aes(x=neighbourhood_group, y=Mean) ) +
    geom_bar(stat="identity") +
    theme(axis.text.x = element_text(angle = 10, hjust = 1))

airbnb_group2<-airbnb19%>% 
  filter(number_of_reviews>400)%>%
  mutate(mean(number_of_reviews))
  

ggplot(airbnb_group2, aes(x=neighbourhood, y=number_of_reviews) ) +
    geom_bar(stat="identity") +
    theme(axis.text.x = element_text(angle = 1000, hjust = 1))

Part 4 - Conclusion

Although five boroughs’ average review are similar, traveler most interest in the room/apartment in Queens. To be specific, East Elmhurst in Queens has the most total reviews because it is close to the airport.