New York City Airbnb Open Data

Mohammad Jafari 3733815 - Ali Eslahi 3702858

Last updated: 27 October, 2019

Introduction

In the decade since it was launched, online home rental platform Airbnb has amassed millions of rooms worldwide.Since 2008, guests and hosts used Airbnb to develop the traveling culture and to depict a personalized of experiencing the world. Eleven years on, Airbnb’s site lists more than six million rooms, flats and houses in more than 81,000 cities across the globe. On average, two million people rest their heads in an Airbnb property each night – half a billion since 2008.In this assignment, we sought to investigate the prices of Airbnb accommodations according to the different neighbourhoods in New York City in 2019. Furthermore, based on competitive prices in different areas, we can analyze the popularity of the property and have different hyposthesis on them and test upon them.

Introduction Cont.

The data used here is from USA Airbnb data in 2019. The data collected and stored based on the:
• Pricing

• Neighbourhood

• Latitude

• Longitude

• Room type

• Minimum nights

• Last review

• Availability

Problem Statement

People are always struggling with accommodating themselves when they are having a short trip. Owing to the fact that most of the New York travels are intended for business purposes, the consideration of commuting to the central business district will play an important factor in choosing suitable accommodation. In addition, we have a proper data analysis hypothesis testing to showcase the average price of private room in manhattan.

Data

library(readxl)
AB_NYC_2019 <- read_excel("~/Desktop/AB_NYC_2019.xlsx")

We have gathered our data from https://www.kaggle.com/dgomonov/new-york-city-airbnb-open-data/downloads/AB_NYC_2019.csv/3

Data Cont.

For fulfilling our purpose we have chosen variables neighbourhood_group and price. The prices are in US dollors and accounts for stay each individual night.

New York City Airbnb 2019 data
id name host_id host_name neighbourhood_group neighbourhood latitude longitude room_type price minimum_nights number_of_reviews last_review reviews_per_month calculated_host_listings_count availability_365
2539 Clean & quiet apt home by the park 2787 John Brooklyn Kensington 40.64749 -73.97237 Private room 149 1 9 2018-10-19 0.21 6 365
2595 Skylit Midtown Castle 2845 Jennifer Manhattan Midtown 40.75362 -73.98377 Entire home/apt 225 1 45 2019-05-21 0.38 2 355
3647 THE VILLAGE OF HARLEM….NEW YORK ! 4632 Elisabeth Manhattan Harlem 40.80902 -73.94190 Private room 150 3 0 NA NA 1 365
3831 Cozy Entire Floor of Brownstone 4869 LisaRoxanne Brooklyn Clinton Hill 40.68514 -73.95976 Entire home/apt 89 1 270 2019-07-05 4.64 1 194
5022 Entire Apt: Spacious Studio/Loft by central park 7192 Laura Manhattan East Harlem 40.79851 -73.94399 Entire home/apt 80 10 9 2018-11-19 0.10 1 0
5099 Large Cozy 1 BR Apartment In Midtown East 7322 Chris Manhattan Murray Hill 40.74767 -73.97500 Entire home/apt 200 3 74 2019-06-22 0.59 1 129

Descriptive Statistics and Visualisation

First we produce the frequency of data to see how it will cost us to rent an accomadation in NYC.

AB_NYC_2019$price %>% hist(xlab="Price",main="frequency of prices", col = "dodgerblue3", breaks = 78)

Becasue of the existing outliners depicted in figure below, it is not very clear that how much the prices are varying.

boxplot(AB_NYC_2019$price, main = "overal view of property prices")

Hence by applying Log10 on prices we would be able to weaken the outliners and have a quick guess about the price range.

hist(log10(AB_NYC_2019$price),xlab="price", main = "Transformed frequency of the price data Airbnb NYC", col = "dodgerblue3", breaks = 78)

Graph above depicts that most of the properties are around 10^2 = 100 USD.

Room type analysis.

We would like to see how much prices are varying dependant on whether it is a private room or it is the enitire house or apartment. So by filtering the data we have

Privaterooms <- AB_NYC_2019 %>% filter(room_type == "Private room")
Privaterooms$price %>% summary()
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##     0.00    50.00    70.00    89.78    95.00 10000.00
Entirehouse <- AB_NYC_2019 %>% filter(room_type == "Entire home/apt")
Entirehouse$price %>% summary()
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0   120.0   160.0   211.8   229.0 10000.0
sharedroom<- AB_NYC_2019 %>% filter(room_type == "Shared room")
sharedroom$price %>% summary()
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   33.00   45.00   70.13   75.00 1800.00

Since our data is extremely skewed because of the outliners, we would prefer to use median price as a more reliable prameter rather that mean value.

Based on median values and our budget we are able to choose a suitable flat style that we wish to live in.

Decsriptive Statistics Cont.

Neighbourhood area is, also , an important factor in finding a nice place to accomodate.

table(AB_NYC_2019$room_type,AB_NYC_2019$neighbourhood_group)
##                  
##                   Bronx Brooklyn Manhattan Queens Staten Island
##   Entire home/apt   379     9559     13199   2096           176
##   Private room      652    10132      7982   3372           188
##   Shared room        60      413       480    198             9

Table above depicts the number of entire apartment in each each neighbourhood based on the property type.

AB_NYC_2019 %>% boxplot(price ~ neighbourhood_group,data = ., main="Box Plot of price vs neighbourhood", 
                     ylab="neighbourhood", xlab="Price",horizontal=TRUE, col = "skyblue")

Box plot above shows that in Manhattan and Broklyn there are the most expensive properties comparing to other neighbourhoods. But yet we are not able to decide what are the median prices in each area by this plot. So we categorize each area versus each type of room to realize their median value.

Entirehouse %>% group_by(neighbourhood_group) %>% summarise(Min = min(price,na.rm = TRUE),
                                         Q1 = quantile(price,probs = .25,na.rm = TRUE),
                                         Median = median(price, na.rm = TRUE),
                                         Q3 = quantile(price,probs = .75,na.rm = TRUE),
                                         Max = max(price,na.rm = TRUE),
                                         Mean = mean(price, na.rm = TRUE),
                                         SD = sd(price, na.rm = TRUE),
                                         n = n(),
                                         Missing = sum(is.na(price)))
Privaterooms %>% group_by(neighbourhood_group) %>% summarise(Min = min(price,na.rm = TRUE),
                                         Q1 = quantile(price,probs = .25,na.rm = TRUE),
                                         Median = median(price, na.rm = TRUE),
                                         Q3 = quantile(price,probs = .75,na.rm = TRUE),
                                         Max = max(price,na.rm = TRUE),
                                         Mean = mean(price, na.rm = TRUE),
                                         SD = sd(price, na.rm = TRUE),
                                         n = n(),
                                         Missing = sum(is.na(price)))
sharedroom %>% group_by(neighbourhood_group) %>% summarise(Min = min(price,na.rm = TRUE),
                                         Q1 = quantile(price,probs = .25,na.rm = TRUE),
                                         Median = median(price, na.rm = TRUE),
                                         Q3 = quantile(price,probs = .75,na.rm = TRUE),
                                         Max = max(price,na.rm = TRUE),
                                         Mean = mean(price, na.rm = TRUE),
                                         SD = sd(price, na.rm = TRUE),
                                         n = n(),
                                         Missing = sum(is.na(price)))

Hypothesis Testing

For having a better judgement, we have narrowed down our sample data to private rooms in manhattan district by filtering them as follows:

Manhattan <- filter(Privaterooms, neighbourhood_group == "Manhattan")

As for the result, we have 7982 >30 observation which is considered to be a large sample and we are safe to continue with paired- sample t-test. The statistical hypothesis for the test is as follows:

                          H0:μ = 116
                          HA:μ \neq 116
m <- filter(AB_NYC_2019, neighbourhood_group == "Manhattan", room_type == "Private room")
t.test(m$price, alternative = "two.sided", mu=116, confint=0.95)
## 
##  One Sample t-test
## 
## data:  m$price
## t = 0.36482, df = 7981, p-value = 0.7153
## alternative hypothesis: true mean is not equal to 116
## 95 percent confidence interval:
##  112.6036 120.9496
## sample estimates:
## mean of x 
##  116.7766

The result above shows that the test statistic is t = 0.36482

Hypthesis Testing Cont.

As the t-test implies 95 percent of the data are in confidence interval 112.60 and 120.94. Since H0=116 is in the interval, we fail to reject the H0.

Also with the resulted p value of 0.71 >0.05, we have to say we failed to reject the hypothesis

Discussion

The result of one sample t-test found that the mean price of the private rooms in Manhattan was not statistically signifant.

in General prices in broklyn and Manhattan tend to be more expensive other than other regions in all room type categories.

In each neighbourhood, there were some outliners where the hosts were offering their property more than its normal prices and Manhattan and broklyn has more hosts who falls within that category.

From the data analysis and based on our adopted data, we also can derive the distance between the property and business district to have a better judgement on choosing an accomodation.

In addtion we can see which neighbourhood area had best properties in 2019 based on its total credited reviews.

Acoording to tested hypothesis we found that that if our budget is close to 116 US dollors or more there is 95% chance of getting a suitable flat in Manhattan, private room style.

References

https://www.kaggle.com

https://yihui.name/knitr/options/?version=1.2.1335&mode=desktop

https://stackoverflow.com/questions/4787332/how-to-remove-outliers-from-a-dataset