Introduction

In the decade since it was launched, online home rental platform Airbnb has amassed millions of rooms worldwide.Since 2008, guests and hosts used Airbnb to develop the traveling culture and to depict a personalized of experiencing the world. Eleven years on, Airbnb’s site lists more than six million rooms, flats and houses in more than 81,000 cities across the globe. On average, two million people rest their heads in an Airbnb property each night – half a billion since 2008.In this assignment, we sought to investigate the prices of Airbnb accommodations according to the different neighbourhoods in New York City in 2019. Furthermore, based on competitive prices in different areas, we can analyze the popularity of the property and have different hyposthesis on them and test upon them.

Introduction Cont.

The data used here is from USA Airbnb data in 2019. The data collected and stored based on the:
• Pricing

• Neighbourhood

• Latitude

• Longitude

• Room type

• Minimum nights

• Last review

• Availability

Problem Statement

People are always struggling with accommodating themselves when they are having a short trip. Owing to the fact that most of the New York travels are intended for business purposes, the consideration of commuting to the central business district will play an important factor in choosing suitable accommodation. In addition, we have a proper data analysis hypothesis testing to showcase the average price of private room in manhattan.

Data

library(readxl)
AB_NYC_2019 <- read_excel("~/Desktop/AB_NYC_2019.xlsx")

We have gathered our data from https://www.kaggle.com/dgomonov/new-york-city-airbnb-open-data/downloads/AB_NYC_2019.csv/3

Data Cont.

For fulfilling our purpose we have chosen variables neighbourhood_group and price. The prices are in US dollors and accounts for stay each individual night.

New York City Airbnb 2019 data
id	name	host_id	host_name	neighbourhood_group	neighbourhood	latitude	longitude	room_type	price	minimum_nights	number_of_reviews	last_review	reviews_per_month	calculated_host_listings_count	availability_365
2539	Clean & quiet apt home by the park	2787	John	Brooklyn	Kensington	40.64749	-73.97237	Private room	149	1	9	2018-10-19	0.21	6	365
2595	Skylit Midtown Castle	2845	Jennifer	Manhattan	Midtown	40.75362	-73.98377	Entire home/apt	225	1	45	2019-05-21	0.38	2	355
3647	THE VILLAGE OF HARLEM….NEW YORK !	4632	Elisabeth	Manhattan	Harlem	40.80902	-73.94190	Private room	150	3	0	NA	NA	1	365
3831	Cozy Entire Floor of Brownstone	4869	LisaRoxanne	Brooklyn	Clinton Hill	40.68514	-73.95976	Entire home/apt	89	1	270	2019-07-05	4.64	1	194
5022	Entire Apt: Spacious Studio/Loft by central park	7192	Laura	Manhattan	East Harlem	40.79851	-73.94399	Entire home/apt	80	10	9	2018-11-19	0.10	1	0
5099	Large Cozy 1 BR Apartment In Midtown East	7322	Chris	Manhattan	Murray Hill	40.74767	-73.97500	Entire home/apt	200	3	74	2019-06-22	0.59	1	129

Descriptive Statistics and Visualisation

First we produce the frequency of data to see how it will cost us to rent an accomadation in NYC.

AB_NYC_2019$price %>% hist(xlab="Price",main="frequency of prices", col = "dodgerblue3", breaks = 78)

Becasue of the existing outliners depicted in figure below, it is not very clear that how much the prices are varying.

boxplot(AB_NYC_2019$price, main = "overal view of property prices")

Hence by applying Log10 on prices we would be able to weaken the outliners and have a quick guess about the price range.

hist(log10(AB_NYC_2019$price),xlab="price", main = "Transformed frequency of the price data Airbnb NYC", col = "dodgerblue3", breaks = 78)

Graph above depicts that most of the properties are around 10^2 = 100 USD.

Room type analysis.

We would like to see how much prices are varying dependant on whether it is a private room or it is the enitire house or apartment. So by filtering the data we have

Privaterooms <- AB_NYC_2019 %>% filter(room_type == "Private room")
Privaterooms$price %>% summary()

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##     0.00    50.00    70.00    89.78    95.00 10000.00

Entirehouse <- AB_NYC_2019 %>% filter(room_type == "Entire home/apt")
Entirehouse$price %>% summary()

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0   120.0   160.0   211.8   229.0 10000.0

sharedroom<- AB_NYC_2019 %>% filter(room_type == "Shared room")
sharedroom$price %>% summary()

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   33.00   45.00   70.13   75.00 1800.00

Since our data is extremely skewed because of the outliners, we would prefer to use median price as a more reliable prameter rather that mean value.

Based on median values and our budget we are able to choose a suitable flat style that we wish to live in.

Decsriptive Statistics Cont.

Neighbourhood area is, also , an important factor in finding a nice place to accomodate.

table(AB_NYC_2019$room_type,AB_NYC_2019$neighbourhood_group)

##                  
##                   Bronx Brooklyn Manhattan Queens Staten Island
##   Entire home/apt   379     9559     13199   2096           176
##   Private room      652    10132      7982   3372           188
##   Shared room        60      413       480    198             9

Table above depicts the number of entire apartment in each each neighbourhood based on the property type.

AB_NYC_2019 %>% boxplot(price ~ neighbourhood_group,data = ., main="Box Plot of price vs neighbourhood", 
                     ylab="neighbourhood", xlab="Price",horizontal=TRUE, col = "skyblue")

Box plot above shows that in Manhattan and Broklyn there are the most expensive properties comparing to other neighbourhoods. But yet we are not able to decide what are the median prices in each area by this plot. So we categorize each area versus each type of room to realize their median value.

Entirehouse %>% group_by(neighbourhood_group) %>% summarise(Min = min(price,na.rm = TRUE),
                                         Q1 = quantile(price,probs = .25,na.rm = TRUE),
                                         Median = median(price, na.rm = TRUE),
                                         Q3 = quantile(price,probs = .75,na.rm = TRUE),
                                         Max = max(price,na.rm = TRUE),
                                         Mean = mean(price, na.rm = TRUE),
                                         SD = sd(price, na.rm = TRUE),
                                         n = n(),
                                         Missing = sum(is.na(price)))

Privaterooms %>% group_by(neighbourhood_group) %>% summarise(Min = min(price,na.rm = TRUE),
                                         Q1 = quantile(price,probs = .25,na.rm = TRUE),
                                         Median = median(price, na.rm = TRUE),
                                         Q3 = quantile(price,probs = .75,na.rm = TRUE),
                                         Max = max(price,na.rm = TRUE),
                                         Mean = mean(price, na.rm = TRUE),
                                         SD = sd(price, na.rm = TRUE),
                                         n = n(),
                                         Missing = sum(is.na(price)))

sharedroom %>% group_by(neighbourhood_group) %>% summarise(Min = min(price,na.rm = TRUE),
                                         Q1 = quantile(price,probs = .25,na.rm = TRUE),
                                         Median = median(price, na.rm = TRUE),
                                         Q3 = quantile(price,probs = .75,na.rm = TRUE),
                                         Max = max(price,na.rm = TRUE),
                                         Mean = mean(price, na.rm = TRUE),
                                         SD = sd(price, na.rm = TRUE),
                                         n = n(),
                                         Missing = sum(is.na(price)))

Hypothesis Testing

For having a better judgement, we have narrowed down our sample data to private rooms in manhattan district by filtering them as follows:

Manhattan <- filter(Privaterooms, neighbourhood_group == "Manhattan")

As for the result, we have 7982 >30 observation which is considered to be a large sample and we are safe to continue with paired- sample t-test. The statistical hypothesis for the test is as follows:

                          H0:μ = 116
                          HA:μ \neq 116

m <- filter(AB_NYC_2019, neighbourhood_group == "Manhattan", room_type == "Private room")
t.test(m$price, alternative = "two.sided", mu=116, confint=0.95)

## 
##  One Sample t-test
## 
## data:  m$price
## t = 0.36482, df = 7981, p-value = 0.7153
## alternative hypothesis: true mean is not equal to 116
## 95 percent confidence interval:
##  112.6036 120.9496
## sample estimates:
## mean of x 
##  116.7766

The result above shows that the test statistic is t = 0.36482

Hypthesis Testing Cont.

Here are the examples of mathematical equations:

As the t-test implies 95 percent of the data are in confidence interval 112.60 and 120.94. Since H0=116 is in the interval, we fail to reject the H0.

Also with the resulted p value of 0.71 >0.05, we have to say we failed to reject the hypothesis

Discussion

The result of one sample t-test found that the mean price of the private rooms in Manhattan was not statistically signifant.

in General prices in broklyn and Manhattan tend to be more expensive other than other regions in all room type categories.

In each neighbourhood, there were some outliners where the hosts were offering their property more than its normal prices and Manhattan and broklyn has more hosts who falls within that category.

From the data analysis and based on our adopted data, we also can derive the distance between the property and business district to have a better judgement on choosing an accomodation.

In addtion we can see which neighbourhood area had best properties in 2019 based on its total credited reviews.

Acoording to tested hypothesis we found that that if our budget is close to 116 US dollors or more there is 95% chance of getting a suitable flat in Manhattan, private room style.

New York City Airbnb Open Data

RPubs link information