# Load tidyverse, ggplot2 and fansi
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.3 v purrr 0.3.4
## v tibble 3.1.2 v dplyr 1.0.6
## v tidyr 1.1.3 v stringr 1.4.0
## v readr 1.4.0 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(ggplot2)
library(fansi)
# Load other necessary packages
library(import)
## The import package should not be attached.
## Use "colon syntax" instead, e.g. import::from, or import:::from.
library(rio)
library(ggthemes)
library(ggrepel)
library(dbplyr)
##
## Attaching package: 'dbplyr'
## The following objects are masked from 'package:dplyr':
##
## ident, sql
# Read the csv file that I had saved in my set working directory
airbnb <- "airbnb_ny19.csv"
setwd("C:/Documents - Copy/PERSONAL/Data 110_MC_Class")
airbnb_test <- import(airbnb)
# View the data
view(airbnb_test)
It has a total of 16 columns and 48,895 entries.
The airbnbny_19 dataset is one of the datasets for our Data 110 Summer-1 2021 class that I obtained via Blackboard. It lists the daily price in dollars of Airbnb accommodation in New York City by room type including Entire home/apartment, Private room and Shared room. Apart from room types the data has a total of 16 variables including other categorical variables such as neighbourhood group and neighbourhood, as well as numerical variables such as minimum nights of stay and availability. The data frame has a total of 48,895 entries.
The price of an airbnb room is among the key questions that a traveler would wish answered before booking for a room. We should expect the price to vary according to some predictable factor such as type of room or the neighbourhood. We set out to explore how the price of the room varies by room type and how this pattern varies between neighborhood types.
# Use str to examine the structure of the datacframe
str(airbnb_test)
## 'data.frame': 48895 obs. of 16 variables:
## $ id : int 2539 2595 3647 3831 5022 5099 5121 5178 5203 5238 ...
## $ name : chr "Clean & quiet apt home by the park" "Skylit Midtown Castle" "THE VILLAGE OF HARLEM....NEW YORK !" "Cozy Entire Floor of Brownstone" ...
## $ host_id : int 2787 2845 4632 4869 7192 7322 7356 8967 7490 7549 ...
## $ host_name : chr "John" "Jennifer" "Elisabeth" "LisaRoxanne" ...
## $ neighbourhood_group : chr "Brooklyn" "Manhattan" "Manhattan" "Brooklyn" ...
## $ neighbourhood : chr "Kensington" "Midtown" "Harlem" "Clinton Hill" ...
## $ latitude : num 40.6 40.8 40.8 40.7 40.8 ...
## $ longitude : num -74 -74 -73.9 -74 -73.9 ...
## $ room_type : chr "Private room" "Entire home/apt" "Private room" "Entire home/apt" ...
## $ price : int 149 225 150 89 80 200 60 79 79 150 ...
## $ minimum_nights : int 1 1 3 1 10 3 45 2 2 1 ...
## $ number_of_reviews : int 9 45 0 270 9 74 49 430 118 160 ...
## $ last_review : chr "10/19/2018" "5/21/2019" "" "7/5/2019" ...
## $ reviews_per_month : num 0.21 0.38 NA 4.64 0.1 0.59 0.4 3.47 0.99 1.33 ...
## $ calculated_host_listings_count: int 6 2 1 1 1 1 1 1 1 4 ...
## $ availability_365 : int 365 355 365 194 0 129 0 220 0 188 ...
A number of numeric variables we are interested in including price, mimimum_nights and availability_365 are appearing as integers, not as numeric. We need to clean the data to have them as numeric.
# convert price, availability_365 and minimum_nights to numeric variable
airbnb_test$price <- as.numeric(airbnb_test$price)
airbnb_test$availability_365 <- as.numeric(airbnb_test$availability_365)
airbnb_test$minimum_nights <- as.numeric(airbnb_test$minimum_nights )
str(airbnb_test)
## 'data.frame': 48895 obs. of 16 variables:
## $ id : int 2539 2595 3647 3831 5022 5099 5121 5178 5203 5238 ...
## $ name : chr "Clean & quiet apt home by the park" "Skylit Midtown Castle" "THE VILLAGE OF HARLEM....NEW YORK !" "Cozy Entire Floor of Brownstone" ...
## $ host_id : int 2787 2845 4632 4869 7192 7322 7356 8967 7490 7549 ...
## $ host_name : chr "John" "Jennifer" "Elisabeth" "LisaRoxanne" ...
## $ neighbourhood_group : chr "Brooklyn" "Manhattan" "Manhattan" "Brooklyn" ...
## $ neighbourhood : chr "Kensington" "Midtown" "Harlem" "Clinton Hill" ...
## $ latitude : num 40.6 40.8 40.8 40.7 40.8 ...
## $ longitude : num -74 -74 -73.9 -74 -73.9 ...
## $ room_type : chr "Private room" "Entire home/apt" "Private room" "Entire home/apt" ...
## $ price : num 149 225 150 89 80 200 60 79 79 150 ...
## $ minimum_nights : num 1 1 3 1 10 3 45 2 2 1 ...
## $ number_of_reviews : int 9 45 0 270 9 74 49 430 118 160 ...
## $ last_review : chr "10/19/2018" "5/21/2019" "" "7/5/2019" ...
## $ reviews_per_month : num 0.21 0.38 NA 4.64 0.1 0.59 0.4 3.47 0.99 1.33 ...
## $ calculated_host_listings_count: int 6 2 1 1 1 1 1 1 1 4 ...
## $ availability_365 : num 365 355 365 194 0 129 0 220 0 188 ...
# Do bar charts of neighbourhood_groups by room type
ggplot(data = airbnb_test) +
geom_bar(mapping = aes(x = neighbourhood_group, fill = room_type))+
ggtitle("Airbnb Neighbhourhood Groups in NY City by Room Type")
We can see that Manhattan has the largest number of airbnb hosts followed by Brooklyn, Queens, the Bronx and then Staten Island. In Manhattan the largest group of entries is for Entire homes followed by Private room and very few entries are for Shared room. In Brooklyn the entries for Entire homes are almost equal to those for Private rooms with only a few entries for Shared room. In the Bronx and Queens there are more entries for Private room than for Entire home, with Shared rooms being the least. In Staten Island it appears the entries for Entire home are equal to Private room.
# Plot price against availability and use neighbourhood_group as color fill.
plot1 <- airbnb_test %>%
ggplot(aes(price, availability_365, color = neighbourhood_group))+
geom_point()+
ggtitle("Airbnb Price Scatter Plot in NY City by Room Availablity")
plot1
It looks like price has many outliers beyond $1,250. We need to filter them out.
# Filter for price less than $1,250
price_1250 <- airbnb_test %>%
filter(price <= 1250)
view(price_1250)
Removing the outliers of $1,250 and above has reduced the entries to 48,709 entries from the original 48,895 entries, meaning there were 186 entries of $1,250 and above.
# Plot price against availability and use neighbourhood_group as color fill.
plot2 <- price_1250 %>%
ggplot(aes(price, availability_365, color = neighbourhood_group))+
geom_point()+
ggtitle("Airbnb Price Less than $1250 in NY City by Room Availablity")
plot2
Most of the prices seem to be below $400.
# prepare data
bronx <- price_1250 %>%
filter( neighbourhood_group == "Bronx" ) %>%
arrange(neighbourhood_group)
view(bronx)
The bronx1 data frame has 1,090 entries.
ggplot(data = bronx) +
stat_count(mapping = aes(x = room_type))+
ggtitle("Bar Chart Airbnb Price Less than $1250 in Bronx by Room Type")
We can see that the most frequent room type is Private room followed by Entire home/apt and the least is Share room.
# Do a box plot of price by room type
boxpl <- bronx %>%
ggplot() +
geom_boxplot(aes(y=price, group=room_type,fill=room_type))+
ggtitle("Box Plot Airbnb Price Less than $1250 in the Bronx by Room Type")
boxpl
There appear to be too many outliers. Let us reduce the price to less than $500 to get a better visualization of the data.
# Filter for price less than $500
brprice_500 <- bronx %>%
filter(price <= 500)
view(brprice_500)
There are 1084 entries with price less than $ 500 in the Bronx, meaning we have removed just 6 entries that are $500 and above.
# Do a box plot of price by room type
boxpl <- brprice_500 %>%
ggplot() +
geom_boxplot(aes(y=price, group=room_type,fill=room_type))+
ggtitle("Box Plot Airbnb Price Less than $500 in the Bronx by Room Type")
boxpl
In the Bronx the median price for Entire home is about $100, for Private room in about $55 and for Shared room it is about $40. We can see that the interquartile range (IQR) of the price of an entire home/apt is well above Private room or Shared room and, as expected, Private room has a higher price range than Shared room; however, the IQR of Private room overlaps with that of Shared room. This means that for the upper end price of a Shared room one could get a Private room, probably in a different location. Let us explore whether the same price pattern exists in other neighbourhood groups.
# Filter for brooklyn rooms costing less than $500
brooklyn_500 <-airbnb_test %>%
filter( neighbourhood_group == "Brooklyn" & price<= 500 ) %>%
arrange(neighbourhood_group)
view(brooklyn_500)
The brooklyn_500 data has 19,875 entries.
# Do a box plot of Brooklyn prices less than $500
boxpl <- brooklyn_500 %>%
ggplot() +
geom_boxplot(aes(y=price, group=room_type,fill=room_type))+
ggtitle("Box Plot Airbnb Price Less than $500 in Brooklyn by Room Type")
boxpl
In Brooklyn, the median price for Entire home is about $145, for Private room it is about $70, and for Shared room it is about $40. The price hierarchy remains, with Entire home being highest followed by Private room and the lowest price is for Shared room. There is no IQR overlaps between the room types.
# Filter for Manhattan rooms costing less than $500
man_500 <-airbnb_test %>%
filter( neighbourhood_group == "Manhattan" & price<= 500 ) %>%
arrange(neighbourhood_group)
view(man_500)
The Manhattan less than $500 data has 20,888 entries.
# Do a box plot of Manhattan less than $500 data
boxpl <- man_500 %>%
ggplot() +
geom_boxplot(aes(y=price, group=room_type,fill=room_type))+
ggtitle("Box Plot Airbnb Price Less than $500 in Manhattan by Room Type")
boxpl
In Manhattan, the median price for Entire home is about $250, for Private room it is about $90, and for Shared room it is about $75. The hierarchy is maintained, with Entire home being most expensive followed by Private room and then Shared room. However, just as for the Bronx, in Manhattan the IQRs of Private room and Shared room overlap, meaning for the higher prices of a Shared room one could get a Private room. Let us try this out for Queens.
# Filter for Queens rooms costing less than $500
Que_500 <-airbnb_test %>%
filter( neighbourhood_group == "Queens" & price<= 500 ) %>%
arrange(neighbourhood_group)
view(Que_500)
The Queens data set has 5,637 entries.
# Do a box plot for Queens less than $500
boxpl <- Que_500 %>%
ggplot() +
geom_boxplot(aes(y=price, group=room_type,fill=room_type))+
ggtitle("Box plot Airbnb Price Less than $500 in Queens by Room Type")
boxpl
In Queens, the median price for Entire home is about $120, for Private room it is about $60, and for Shared room it is about $40. The pattern in Queens resembles that in Brooklyn. Entire home is the most expensive followed by Private room and then Shared room and there is no overlap between IQRs.
# Filter for Staten rooms costing less than $500
Staten_500 <-airbnb_test %>%
filter( neighbourhood_group == "Staten Island" & price<= 500 ) %>%
arrange(neighbourhood_group)
view(Staten_500)
There are 367 entries for Staten Island less than $500.
# Do a box plot for Staten less than $500
boxpl <- Staten_500 %>%
ggplot() +
geom_boxplot(aes(y=price, group=room_type,fill=room_type))+
ggtitle("Box plot Airbnb Price Less than $500 on Staten Island by Room Type")
boxpl
On Staten Island, the median price for Entire home is about $100, for Private room it is about $50, and for Shared room it is about $35. The hierarchy is maintained, with Entire home being most expensive followed by Private room and then Shared room. However, just as for the Bronx and Manhattan, the IQRs of Private room and Shared room overlap, meaning for the higher price of a Shared room one could get a Private room.
This dataset educates us about the airbnb prices in New York City in 2019 in at least two ways. Manhattan has the highest prices, the median price for Entire home is about $250, for Private room is about $90 and for Shared room it is about $75; and Staten Island seems to have the lowest prices, the median price for Entire home is about $100, for Private room it is about $50 and for Shared room, it is about $35. The second take-home message is that while the price hierarchy is maintained, with Entire home being most expensive followed by Private room and then Shared room, there is a spatial difference in the relation between the price of Private room and Shared room. In the Bronx, Manhattan and Staten Island the IQRs of Private room and Shared room overlap, meaning for the higher price of a Shared room, one could get a Private room. However, this overlap is not seen in Brooklyn and Queens. We can therefore conclude that there is no uniform pattern of overlap of price range between Private room and Shared room among Airbnb entries in this dataset.
Our conclusion seems to agree with published literature on this topic, that there are varying factors affecting Airbnb prices. The issue of Airbnb prices by room type was studied by Voltes-Dorta and Sánchez-Medina (2020) in Bristol in England and they reported that there are spatial patterns in Airbnb prices and that these patterns differ by room type. Perez-Sanchez et al 2018 studied factors affecting Airbnb prices in four Spanish coastal cities and concluded that the price increased with decreasing distance from tourist areas or from the coastline.
Augusto Voltes-Dorta and Agustín Sánchez-Medina. 2020. Drivers of Airbnb prices according to property/room type, season and location: A regression approach. Journal of Hospitality and Tourism Management. Volume 45, December 2020, Pages 266-275. https://doi.org/10.1016/j.jhtm.2020.08.015
V. Raul Perez-Sanchez , Leticia Serrano-Estrada , Pablo Marti and Raul-Tomas Mora-Garci. 2018. The What, Where, and Why of Airbnb Price Determinants. Sustainability 2018, 10, 4596; doi:10.3390/su10124596.