The dataset I have selected is a list of airbnb bookings in New York City in 2019. The reason why I selected this data is because around the time I was searching for a dataset, my cousins and I were looking for airbnbs to stay in because they booked flights to visit me in Maryland this upcoming weekend. We went through several sites and the airbnbs were unbelievably expensive (possibly because of how we filtered our options). It took roughly 5 days for us to find a solid airbnb that looked appealing and wasn’t too expensive. Once I stumbled upon this dataset, I figured if Maryland airbnbs were expensive, then surely New York’s airbnb prices are through the roof. During my Thanksgiving break, I had also visited New York and stayed in a hotel in Manhattan. During my stay, I was able to visit several cities and understand the environment in each area. With that said, I felt like I had some decent background knowledge to gauge the conditions of airbnbs in New York. The repository source where I found the data is from Kaggle by DGOMONOV, however the primary source of the data comes from Airbnb itself and allows for people to search up airbnbs within their location. Visit this website to find more information: http://insideairbnb.com/
The dataset has 48,895 observations and 16 variables (10 numeric, 6 character). The variables provide information about the location of the airbnb, price, id, neighborhood area/region, reviews, and room types. Based on my background knowledge, I wanted to explore which neighborhood group had the highest prices. Once I find those results, I will select one neighborhood group and find the neighborhood in that group with the highest prices. Once I do that, with the neighborhood I select, I will examine which room type has the highest prices within the neighborhood. In order to solve those questions, the variables I will utilize in my data visualizations and analysis include neighborhood_group, neighborhood, latitude, longitude, room type, and price.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(leaflet)
## Warning: package 'leaflet' was built under R version 4.3.2
library(ggplot2)
library(highcharter)
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
library(viridisLite)
library(RColorBrewer)
setwd("C:/Users/danyd/OneDrive/Desktop/data 110/week13hw")
airbnbs <- read_csv("airbnbnyc.csv")
## Rows: 48895 Columns: 16
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (6): name, host_name, neighbourhood_group, neighbourhood, room_type, la...
## dbl (10): id, host_id, latitude, longitude, price, minimum_nights, number_of...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
When cleaning the data ,find out whether there are NAs in the dataset. Once that question is answered, if there are NAs in the dataset, remove them to avoid any complications with data visualizations and errors in data. Once these steps were completed, there were a total of 10,074 NAs in this dataset. Using !is.na, I removed the NAs and changed the name of the variable “neighbourhood_group” to “boroughs”. The areas in that column: Queens, Bronx, Brooklyn, Manhattan, and Staten Island are considered the 5 major boroughs in New York City, which is why I changed the name. Also, neighborhood groups did not fit the right description. Once that was done, I proceeded in creating my first visualization.
has_na <- any(is.na(airbnbs))
print(has_na) #If this code prints TRUE, there are NAs in this dataset. If it prints FALSE there are none.
## [1] TRUE
colSums(is.na(airbnbs))
## id name
## 0 16
## host_id host_name
## 0 21
## neighbourhood_group neighbourhood
## 0 0
## latitude longitude
## 0 0
## room_type price
## 0 0
## minimum_nights number_of_reviews
## 0 0
## last_review reviews_per_month
## 10052 10052
## calculated_host_listings_count availability_365
## 0 0
airbnbsnyc <- airbnbs |>
filter(!is.na(name),
!is.na(host_name),
!is.na(last_review),
!is.na(reviews_per_month)) |>
rename(boroughs = neighbourhood_group)
ggplot(airbnbsnyc, aes(boroughs, price, fill = boroughs)) +
geom_bar(stat = "identity", position = "dodge", alpha = 0.7) +
labs(title = "Total Prices Per Borough",
x = "Boroughs",
y = "Price",
fill = "Boroughs",
caption = "Source: Inside Airbnb: http://insideairbnb.com/") +
scale_fill_viridis_d()+
theme_bw()
The bar graph above shows the total prices of airbnbs in each borough located in New York City. Based on the readings in the visualization, Staten Island has the lowest prices of airbnbs, while Brooklyn, Queens, and Manhattan are tied in having the highest total prices. Next I will look to see which of my selected neighborhoods have the highest prices in Manhattan.
Earlier I mentioned that during my visit to New York, I stayed in a hotel in the Manhattan area which is why I am focusing on the Manhattan borough. Additionally, I wanted to look into the neighborhoods that were close to the hotel, especially since it was a good area and it was relatively close to Times Square. The neighborhoods that were relatively close to that area via Google Search were Midtown, Murray Hill, and Kips Bay. Within those three neighborhoods, I wanted to gauge which neighborhood had the highest prices. After filtering the data for the three neighborhoods closest to the hotel, I performed a random sample using 312 observations. The reason I used 312 observations was because it was the most appropriate number to use given the population size was 1,664 observations, confidence interval was 95%, and the margin of error was 5%. Once those steps were completed, I proceeded to create my next visualization (The name of the hotel was Hampton Inn Manhattan Grand Central. The address was 231 E 43rd. Street. This additional information is just evidence for location proximity.)
manhattanbnbs <- airbnbsnyc |>
filter(boroughs %in% c("Manhattan"),neighbourhood %in% c("Kips Bay","Midtown", "Murray Hill"))
# Sample 500 observations from the data set
random_sample <- manhattanbnbs[sample(nrow(manhattanbnbs), 312), ]
ggplot(random_sample, aes(price, neighbourhood, fill = neighbourhood)) +
geom_boxplot(width = 1) + # Adjust the width as needed
theme(axis.text.y = element_text(angle = 0, hjust = 1)) + # Rotate y-axis labels
labs(title = "Airbnb Prices in Manhattan Neighborhoods",
x = "Price (USD)",
y = "Neighborhood",
fill = "Neighborhoods") +
scale_fill_brewer(palette = "Set1") + # Using RBrewer color palette
theme_bw() +
theme(legend.position = "top") +
labs(caption = "Source: Inside Airbnb: http://insideairbnb.com/")
I created boxplots for each of the selected neighborhoods in the Manhattan borough. Based on the visualization above, it is difficult to say which neighborhood has the highest prices. It is easy to say that Midtown has the highest prices, however in the random sample, the percentage of airbnbs that occupy Midtown are larger than Kips Bay and Murray Hill. I also noticed an outlier within the Midtown boxplot. One of the listed airbnbs had a price of $5,100 which was extremely high in comparison to the other airbnbs that did not even cross $1,500. It could be accurate considering it is in an area close to Times Square, but nonetheless, using eye test, I would say Midtown has the highest prices out of the three neighborhoods.
Now that we have looked at the difference and statistics between the three neighborhoods, let’s group the data based on airbnbs in one specific neighborhood, Midtown. Once that is done we will look at which room types have the highest prices in Midtown.
midtownbnbs <- manhattanbnbs |>
group_by(neighbourhood %in% c("Midtown"))
highchart() %>%
hc_chart(type = "bar", inverted = TRUE) %>%
hc_title(text = "Prices for Room Types in Midtown") %>%
hc_xAxis(categories = midtownbnbs$room_type, title = list(text = "Room Type")) %>%
hc_yAxis(title = list(text = "Prices")) %>%
hc_add_series(
data = midtownbnbs,
hcaes(x = room_type, y=price, group = room_type),
type = "bar"
) %>%
hc_legend(
title = list(text = "Room Types"),
layout = "vertical",
align = "right",
verticalAlign = "top"
) %>%
hc_caption(text = "Source: Inside Airbnb: http://insideairbnb.com/")
In the visualization above, there is a highcharter bar graph that shows the prices for room types. Based on the results, it is clear that entire homes have the highest prices. This makes sense because an entire home compared to a private room or shared room has more amenities, freedom, and space to occupy which is justifiable.
airbnblocations <- midtownbnbs %>%
leaflet() %>%
addTiles() %>%
addMarkers(
lat = ~latitude,
lng = ~longitude,
popup = ~paste("Name: ", name, "<br>",
"Room Type: ", room_type, "<br>",
"Minimum Nights: ", minimum_nights, "<br>",
"Price: $", price, "<br>",
"Host Name: ", host_name, "<br>",
"Availability: ", availability_365, "<br>",
"Latitude: ", latitude, "<br>",
"Longitude: ", longitude, "<br>")
)
airbnblocations
The map above simply displays the locations of the airbnbs located in Midtown. You may see that the locations are very close to one another but is important to understand that the reason so is because they are not only in the same country, state, and borough, but also neighborhood. Feel free to zoom in and tap on a location to find details of each airbnb.
To be honest, before I started this assignment, I did not really know what boroughs were. I just knew I had to change the name of variable. The following website is where I gathered information on the five major boroughs in New York City: https://www.nyc.gov/nyc-resources/about-the-city-of-new-york.page#:~:text=New%20York%20is%20composed%20of,dynamic%20city%20in%20the%20world. I was successful in answering my questions although I am not entirely sure how accurate the results are. I did the best I could with the data I was working with. One aspect of my project that I was a bit annoyed with was the fact that my boxplots were pretty narrow. I tried to increase the width but it remained the same. Overall I enjoyed making these data visualizations and I was able to draw a lot of connections to airbnbs that I searched for in Maryland with components of quality, space, and amenities that increased the prices of each airbnb.