This is a dataset I found on Kaggle that contains various information on airbnbs in New York City it includes many columns such as host,host_id,what their airbnbs are and how much they charge. I did some various analysis with the helping dplyr and tidyr to help filter out the data.
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5 v purrr 0.3.4
## v tibble 3.1.6 v dplyr 1.0.7
## v tidyr 1.2.0 v stringr 1.4.0
## v readr 2.1.1 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
housing <- read.csv("https://raw.githubusercontent.com/AldataSci/Project2-Data607/main/AB_NYC_2019.csv",header=TRUE,sep=",")
## I ommited the nas
housing <- na.omit(housing)
head(housing)
## id name host_id host_name
## 1 2539 Clean & quiet apt home by the park 2787 John
## 2 2595 Skylit Midtown Castle 2845 Jennifer
## 4 3831 Cozy Entire Floor of Brownstone 4869 LisaRoxanne
## 5 5022 Entire Apt: Spacious Studio/Loft by central park 7192 Laura
## 6 5099 Large Cozy 1 BR Apartment In Midtown East 7322 Chris
## 7 5121 BlissArtsSpace! 7356 Garon
## neighbourhood_group neighbourhood latitude longitude room_type
## 1 Brooklyn Kensington 40.64749 -73.97237 Private room
## 2 Manhattan Midtown 40.75362 -73.98377 Entire home/apt
## 4 Brooklyn Clinton Hill 40.68514 -73.95976 Entire home/apt
## 5 Manhattan East Harlem 40.79851 -73.94399 Entire home/apt
## 6 Manhattan Murray Hill 40.74767 -73.97500 Entire home/apt
## 7 Brooklyn Bedford-Stuyvesant 40.68688 -73.95596 Private room
## price minimum_nights number_of_reviews last_review reviews_per_month
## 1 149 1 9 2018-10-19 0.21
## 2 225 1 45 2019-05-21 0.38
## 4 89 1 270 2019-07-05 4.64
## 5 80 10 9 2018-11-19 0.10
## 6 200 3 74 2019-06-22 0.59
## 7 60 45 49 2017-10-05 0.40
## calculated_host_listings_count availability_365
## 1 6 365
## 2 2 355
## 4 1 194
## 5 1 0
## 6 1 129
## 7 1 0
For my analysis I dont really need the Ids of the host or the Housing, nor the latitude or longitude.. I wanted to learn about which various room types in New York City Cost
## I used dyplr to filter out certain columns and then I graphed the results on a scatterplot to better understand
## what I am seeing
house <- housing %>%
select(c(name,host_name,neighbourhood,room_type,price))
ggplot(house,aes(x=room_type,y=price)) +
geom_point(col="blue") +
labs(title="Scatterplot of Room Type and Price", xlabs="Prices", ylabs= "Type of Airbnbs") +
coord_flip()
This looks interesting it seems like people charge a lot of money for a private room just as much as an entire home or an apartment which is crazy.We can also see in the data that there are only 3 different kinds of airbnbs in New York City which are a shared room,private room or an entire home/apartment. But to see that people charge 10k for a private room is crazy to me.
Is there a relationship between borough and price? I selected the relevant data with dplyr which are nbhd group,price and the room type and then I visualized the data with a bar graph to better understand what is it’s relationship
Nbhd <- housing %>%
select(neighbourhood_group,price,room_type)
ggplot(Nbhd,aes(x=room_type,y=price,fill=neighbourhood_group)) +
geom_bar(stat="identity",position=position_dodge(0.9)) +
labs(y="Price",x = "Types of Airbnbs in NYC")
From this bar graph I made I can see that the most expensive airbnbs are located in either Brooklyn and Manhattan with the price topping 10,000 dollars. It may be since Brooklyn and Manhattan are the tourists attractions in NYC and hence are the most expensive. On the other hand we can see that the either boroughs are not that popular and hence the cheapest compared to Brooklyn and Manhattan which makes sense since there isnt nothing that would attract tourists in those boroughs.
Finally I wanted to compare various airbnbs types by averaGE user_ratings and see what would happen
review <-housing %>%
select(name,neighbourhood_group,room_type,price,number_of_reviews) %>%
group_by(neighbourhood_group,room_type) %>%
summarise(avg_review = mean(number_of_reviews))
## `summarise()` has grouped output by 'neighbourhood_group'. You can override using the `.groups` argument.
ggplot(review,aes(x=neighbourhood_group,y=avg_review,fill=room_type)) +
geom_bar(stat="identity",position=position_dodge(0.9)) +
labs(x="Nbhd", y= "Price")
It’s so interesting to find that there was a higher average of reviews for the Bronx and Staten Island compared to Manhattan and Brooklyn, since Manhattan and Brooklyn are popular places to rent an airbnb. It seems like a possibility that the reviews were mostly negative since these two boroughs aren’t popular places to rent one.