PART A

Research Question

For this project, we will be using the AIRBNB_NYC dataset. Airbnb is an online marketplace that allows travelers to book spaces that are rented out by property owners. Travelers are able to use the online platform to search for spaces tailored to their needs.

This dataset includes thousands of 2019 Airbnb property listings in NYC, NY. The dataset compiles information such as the location of the listing (neighborhood and coordinates), the type of room, price of the listing, information regarding reviews, and much more. This information allows us to make predictions and draw conclusions of the Airbnb environment in NYC.

In this case, we will be using the dataset AIRBNB_NYC to observe the effect that number of reviews has on the price of Airbnbs.

Airbnb <- read.csv("Downloads/AIRBnB_NYC.csv")

Raw Response Variable

We were interested in investigating “Price” as a numerical response variable. The price of a Airbnb listing may be sensitive to many metrics. Therefore, it will be useful to investigate whether different factors or variables affect price (and if so, in what way). Price is especially important to Airbnb because it alters both property owners and travelers’ incentive to use Airbnb as a service.

We observed summary statistics of “Price” in the raw Airbnb dataset.

summary(Airbnb$price)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0    69.0   106.0   152.7   175.0 10000.0
sd(Airbnb$price)
## [1] 240.1542

Plot

ggplot(Airbnb, aes(price)) + geom_histogram(bins=15) + labs(title = "Distribution of Price of Airbnbs in New York City")

From this plot, you can see that the variable “Price” is heavily right skewed. This means that it is very unusual to have listings that are extremely expensive, or in this case, higher than $2,500. From this plot, you can see that listings are concentrated between the prices of $0 and $2,500. To better analyze price as a variable overall, we would like to disregard some of these more extreme and unusual Airbnb listings.

New Response Variable

In the price column of the Airbnb dataset, we observed that some are listed as $0, which is an unlikely value for either renting a room or a whole apartment. Therefore, we filtered out all zeros from the price column. In addition, there are also listings ranging from 4000 to 10000. These prices constitute about 0.05% of all data. In this case, we concluded that this price range is highly unlikely for a Airbnb listing and chose to filter them out.

#create new variable which will be the new cleaned version of price
Airbnb$CleanPrice<-Airbnb$price

#re-code all entries of CleanPrice where Price =0 or greater than 4000 to NA
Airbnb$CleanPrice[Airbnb$price==0]<-NA
Airbnb$CleanPrice[Airbnb$price>4000]<-NA

Raw Explanatory Variable

We are interested in investigating whether a listing’s number of reviews alters the price of the listing. Thus, our main explanatory variable would be number of reviews. Reviews are important for Airbnb and property owners because it is public customer feedback. Travelers take these reviews into consideration when deciding whether to book a listing.

Additionally, we can predict that having a high number of reviews may incentivise property owners to raise the price of their listing. On the contrary, we can also predict that a high number of reviews means that the property was generally accessible and affordable in price to overall Airbnb users. We would like to investigate how sensitive the price of a listing is to the number of reviews.

First, we observed the summary statistics of “Number of Reviews” variable in the dataset.

summary(Airbnb$number_of_reviews)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    1.00    5.00   23.27   24.00  629.00
sd(Airbnb$number_of_reviews)
## [1] 44.55058

Plot

ggplot(Airbnb, aes(number_of_reviews)) + geom_histogram(bins=15) + labs(x= "Number of reviews", title = "Distribution of Number of Reviews in Airbnbs in New York City")

From the plot, you can see that the distribution of the “Number of Reviews” variable is right skewed. Compared to most data points, there are not many listings that have more than 100 reviews. Therefore, it is more common for a listing to have fewer reviews. Specifically, there are a number of unusually high amount of 0 reviews, possibly caused by properties being quickly delisted and such.

New Explanatory Variable

Since we would like to investigate whether the number of reviews affects the price of a listing, we chose to disregard listings that had no reviews. There are other factors that are included when there are zero reviews, such as properties being quickly delisted and such. We have decided to leave 0s off completely, because we do not want possible confounding variables to interfere with our analysis. We realized that this could skew our results but we believe that the impact from removing 0s completely would be smaller than the impact of leaving on false data points.

#create new variable which will be the new cleaned version of number of reviews
Airbnb$CleanNumReviews<-Airbnb$number_of_reviews

#re-code all entries of CleanNumReviews where number_of_reviews= 0 to NA
Airbnb$CleanNumReviews[Airbnb$number_of_reviews==0]<-NA

Prediction

We expect to see the pattern that an increase in number of reviews is associated with a lower price of Airbnbs in New York City areas, since people might wish to choose to live in an Airbnb with a lower pricing. We suspect that an increased number of reviews could be indicative of lower listing prices, mainly because if lots of people choose to stay and leave reviews, then the owner of the AirBnb could lower the price because the high number of reviews could indicate that more people want to live there. Therefore, the owner will not have to set a very high price since the property will be rented out frequently as indicated by the large number of reviews left.

Plot

ggplot(Airbnb,aes(CleanNumReviews,CleanPrice)) +
  geom_point() +labs( x= "Number of reviews", y= "Price", title = "Effect of number of reviews have on price of Airbnbs")
## Warning: Removed 10077 rows containing missing values (geom_point).

This plot presents to us as an exponential graph. From this plot, we observed that most data points are congested in the lower left corner, with fewer data points lying on the far end of both the x and y axis.

Looking at data points with prices ranging from $1500 to $4000, all of them have the number of reviews of less than 100 and most of them have the number of reviews of less than 50. This shows us that people tend to not leave reviews on Airbnb with higher prices, which might also refer to a lower occupancy rate in Airbnb with higher prices. Now looking at the data points with higher number of reviews, we observed that almost all Airbnbs with the number of reviews ranging from 200 to 629 have a price below $500.

Following this pattern in the plot, we can say that Airbnbs with more reviews tend to have lower prices while Airbnb with a higher price tend to have less reviews, which matches our expectation.

PART B

Secondary explanatory variable

Our secondary explanatory variable is ”neighbourhood_group”

We choose neighborhood group as our second explanatory variable because we would like to see whether neighborhood group is a major factor affecting the price of Airbnbs in New York City. New York City is huge and very diverse not only racially, but economically. More popular and affluent neighborhoods may list higher prices for their Airbnbs than listings in neighborhoods that are not as desired by travelers. We would like to investigate how much price is sensitive dependent on the neighborhood of the Airbnb listing. Sensitivity in price dependent by neighborhood may also relate to reviews. We can predict that areas that aren’t sought after or booked frequently will not have many reviews.

Summary statistics

Airbnb %>%
  group_by(neighbourhood_group) %>%
    arrange(desc(CleanPrice))%>%
     summarize(CleanPrice, CleanNumReviews)
## `summarise()` has grouped output by 'neighbourhood_group'. You can override
## using the `.groups` argument.
## # A tibble: 48,895 × 3
## # Groups:   neighbourhood_group [5]
##    neighbourhood_group CleanPrice CleanNumReviews
##    <chr>                    <int>           <int>
##  1 Bronx                     2500              NA
##  2 Bronx                     1000              NA
##  3 Bronx                      800               1
##  4 Bronx                      680              NA
##  5 Bronx                      670               2
##  6 Bronx                      600              NA
##  7 Bronx                      600              NA
##  8 Bronx                      500              NA
##  9 Bronx                      500              19
## 10 Bronx                      475              NA
## # … with 48,885 more rows

Plot

ggplot(Airbnb,aes(CleanNumReviews,CleanPrice)) + geom_point(na.rm = T) + labs(x= "Number of reviews", y= "Price", title = "Number of Reviews effect on Price of Airbnbs Grouped by Neighbourhood groups") + facet_wrap(~neighbourhood_group)

In this plot, we divided the plot by grouping the data points by their neighborhood groups. The plot in part A shows us the relationship between number of reviews and price in New York, while this plot clearly showed us how data points are spread in different neighborhood groups. By dividing the data points into five plots, we could better assess whether neighborhood group is a major factor affecting the price of Airbnbs.

From the five plots, we observed that by comparing data points in the same price range, Queens and Manhattan have a larger number of reviews, followed by Brooklyn, Staten Island and Bronx. In addition, by comparing data points with the same number of reviews, Manhattan has Airbnb with the highest price, followed by Brooklyn, Queens, Bronx and Staten Island. Combining these two observations, we noticed that Bronx and Staten Island had comparatively lower price and also a smaller number of reviews. Manhattan, on the other hand, has the highest price and highest number of reviews.

All five plots displayed a similar relationship between number of reviews and price as the plots followed a similar exponential line. Most Airbnbs lie in the lower left corner where Airbnbs that have a mediocre number of reviews tend to also have a mediocre price. This also fits our expectation towards most airbnbs.

In addition, according to common knowledge, Manhattan did display a higher price range than any of the other neighbourhood groups, so we can say that neighbourhood group is a large factor affecting the price of airbnbs in New York.

Improvements and modification to the plot

Changing the color, transparency, and size of the plots to be more viewer friendly and easier to interpret.

Plot after Improvement

ggplot(Airbnb,aes(CleanNumReviews,CleanPrice)) + geom_point(na.rm = T, aes(color = neighbourhood_group, alpha = 0.2, shape = '.', size = 0.05)) + labs(x= "Number of reviews", y= "Price", title = "Number of Reviews effect on Price of Airbnbs Grouped by Neighbourhood groups") +  facet_wrap(~neighbourhood_group)

Third Explanatory Variable

Using room type allows for deeper analysis of the original question: price versus number of reviews. It allows us to select one type of room so the comparisons

Summary statistics

Airbnb %>%
  group_by(room_type) %>%
    arrange(desc(CleanPrice))%>%
     summarize(CleanPrice, CleanNumReviews)
## `summarise()` has grouped output by 'room_type'. You can override using the
## `.groups` argument.
## # A tibble: 48,895 × 3
## # Groups:   room_type [3]
##    room_type       CleanPrice CleanNumReviews
##    <chr>                <int>           <int>
##  1 Entire home/apt       4000              NA
##  2 Entire home/apt       4000              NA
##  3 Entire home/apt       4000              NA
##  4 Entire home/apt       3900               7
##  5 Entire home/apt       3800               2
##  6 Entire home/apt       3750              NA
##  7 Entire home/apt       3750              NA
##  8 Entire home/apt       3613               1
##  9 Entire home/apt       3600              NA
## 10 Entire home/apt       3518              NA
## # … with 48,885 more rows

Plot

ggplot(Airbnb,aes(CleanNumReviews,CleanPrice)) + 
  geom_point(na.rm = T) + labs(x= "Number of reviews", y= "Price", title = "Number of Reviews effect on Price of Airbnbs Grouped by Room Type") +  facet_wrap(~room_type)

This plot is different because it is broken up by room types not neighbourhood groups, there are less facets

Conclusion

In conclusion, it can be seen that Airbnb’s listed at a lower price have a higher number of reviews. This trend is seen in all the neighbourhood groups.