The data set I will be working with on this project is the airbnb_ny19 data set; focusing on the Airbnb statistics in New York during the year 2019. The data is provided by Airbnb. I chose this data set because it was the first set I found containing variables for longitude and latitude. I also have family that live in New Jersey which is pretty much New York. In regards to cleaning the data, I will remove unnecessary variables and NA values in the data. For this project, I will be focusing on the following variables: neighborhood_group, latitude, longitude, room_type, price, and reviews_per_month. Neighborhood_group and room_type are categorical variables while latitude, longitude, price, and reviews per month are quantitative. Using these variables, I will work to better understand the relationship between the price, number of reviews, and the room type for the Airbnb.
Load the necessary libraries
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.2.0 ✔ readr 2.1.6
✔ forcats 1.0.1 ✔ stringr 1.6.0
✔ ggplot2 4.0.2 ✔ tibble 3.3.1
✔ lubridate 1.9.5 ✔ tidyr 1.3.2
✔ purrr 1.2.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tidyr)library(leaflet)
Warning: package 'leaflet' was built under R version 4.5.3
library(ggplot2)library(plotly)
Warning: package 'plotly' was built under R version 4.5.3
Attaching package: 'plotly'
The following object is masked from 'package:ggplot2':
last_plot
The following object is masked from 'package:stats':
filter
The following object is masked from 'package:graphics':
layout
Rows: 48895 Columns: 16
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (6): name, host_name, neighbourhood_group, neighbourhood, room_type, la...
dbl (10): id, host_id, latitude, longitude, price, minimum_nights, number_of...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Cleaning the data set
# I will be removing the following variables as they have no use in my research.clean_data <- data |>select(-last_review, -id, -name, -host_id, -calculated_host_listings_count, -host_name) |># I will now filter out all NA values in the reviews_per_month variable filter(!is.na(reviews_per_month))# Check to see if cleaning worked.sum(is.na(clean_data))
[1] 0
The value returned 0, meaning the cleaning worked.
Now I will use simple plots to determine my focus for the final visualization. Let’s start with the neighborhoods.
ggplot(clean_data, aes(x = neighbourhood_group)) +# Analyze the neighbourhood group from the clean data setgeom_bar() +# Make a bar charttheme_minimal() # Remove the grey background
Most of the data comes from Brooklyn and Manhattan. Because Manhattan is slightly larger, I will use Manhattan in the final visualization.
Now to check the room type.
ggplot(clean_data, aes(x = room_type)) +# evaluate the room type in the clean data setgeom_bar() +# Make a bar charttheme_minimal() # Remove the grey background
Most data comes from entire home/apt or private room. For the final visualization, I will analyze all three types of rooms.
Now to check the price variable for any potential outliers
ggplot(clean_data, aes(x = price)) +# Evaluate the pricegeom_histogram(bins =20) +# Make a histogram, set the width of the bars equal to 20theme_minimal() # Remove the grey background
Knowing prices are not in the thousands, here appears to be an upper limit outlier. We will examine that more closely.
summary(clean_data$price) # Summarize the price data and display the statistical analysis
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0 69.0 101.0 142.3 170.0 10000.0
The mean is greater than the median meaning there is a big value or values skewing the data to the right.
We will calculate the upper limit. 170 + 1.5 * (170 - 69) = $321.5. The upper limit is 321.5, so in the final visualization we will remove any values greater than 321.5. This will remove any outliers from the final visualization.
Now we will check the specific neighborhoods inside the groups
ggplot(clean_data, aes(x = neighbourhood)) +# Analyze the neighborhoods inside the groupsgeom_bar() +# Make a bar charttheme_minimal() # Remove the grey background
There’s too many variables here, so I will use the count function instead of a graph.
clean_data |>count(neighbourhood) # Get the number of times a neighborhood is displayed in the clean data set
# A tibble: 218 × 2
neighbourhood n
<chr> <int>
1 Allerton 37
2 Arden Heights 4
3 Arrochar 20
4 Arverne 66
5 Astoria 709
6 Bath Beach 15
7 Battery Park City 36
8 Bay Ridge 115
9 Bay Terrace 5
10 Bay Terrace, Staten Island 2
# ℹ 208 more rows
Scrolling through the data, there are certain neighborhoods which have a higher count than others, such as Harlem or Greenpoint. I will choose Midtown in Manhattan as it has a unique name and fits the under 800 observations criteria.
Now I will analyze the reviews per month.
ggplot(clean_data, aes(x = reviews_per_month)) +# Evaluate the reviews per monthgeom_histogram() +# Create a histogramtheme_minimal() # Remove the grey background
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
Most Airbnb’s get under 8 ish reviews per month, however there are some which receive around 15 a month. This is worth noting in the final visualization
Now we will filter the data for the final visualization.
sorted_data <- clean_data |># Make a new data setfilter(neighbourhood_group =="Manhattan") |># Filter for only neighborhoods in Manhattanfilter(price <321.5) |># Filter for prices less than the upper limit to get no outliersfilter(neighbourhood =="Midtown") # Filter for Midtown onlyhead(sorted_data) # View the beginning of the new data set
I now have a filtered data set with 749 observations. I will use that to make the final visualization.
ggplot(sorted_data, aes(x = reviews_per_month, y = price, color = room_type)) +# Create a graph and set the x, y, and third variable (colour) geom_smooth(se =FALSE) +# Create a line and remove the grey background behind the linestheme_minimal() +# Remove the all encompassing greyu backgroundscale_color_brewer(palette ="Accent") +# Set the line colours using a palette labs(color ="Room Type", title ="Price vs. Reviews per Month Across Airbnb Listings in Midtown New York", caption ="Source: Airbnb", x ="Number of Reviews per Month", y ="US Dollar ($) Price per Night", subtitle ="749 Total Observations") # Label every part of the graph
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
ggplotly() # Add interactivity to the graph
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Essay on the visualization
This visualization analyzes the relationship between the price per night of an Airbnb versus its reviews per month and what type of room it is. The price in US dollars per night does not strike me as surprising. It makes complete sense that when paying for an entire home or apartment, it would be more expensive than if you were to pay for a single room. It also is within reason that a private room would cost more than a shared room. The one fact I was surprised to learn in this visualization was how the price increased around the 7-8 mark of reviews per month for shared rooms. My hypothesis was that the shared room never experience any increase in price whatsoever, but would instead show a decline all throughout. One thing I did not get to do was include geom point in this graph. While I think the graph looks better without it, I thought it would be interesting to include nonetheless. The reason I could not include it was because it created too much noise in the graph and did not add much overall.
Now I will make a map visualization
popper <-paste0("<b>US Dollar ($) Price per Night: </b>", sorted_data$price, "<br>","<b>Number of Reviews: </b>", sorted_data$number_of_reviews, "<br>","<b>Number of Reviews per month: </b>", sorted_data$reviews_per_month, "<br>","<b>Room Type: </b>", sorted_data$room_type, "<br>")leaflet() |>setView(lng =-73.9840, lat =40.7549, zoom =14) |>addProviderTiles("Esri.WorldStreetMap") |>addCircles(data = sorted_data,radius = sorted_data$reviews_per_month,color ="maroon",opacity =0.5,popup = popper )
Assuming "longitude" and "latitude" are longitude and latitude, respectively
Essay on the map visualization
The map visualization analyzes the reviews per month rate of different Airbnb locations across midtown. One interesting pattern that emerges from the map is the grouping of the different Airbnbs. It appears that there are three specific hubs for the Airbnbs to be centered around. While I do not know the exact reasoning for this, I believe it boils down to the location. These areas probably hold a lot of apartments and housing which allow for more Airbnbs to be located and offered. Alongside this, the areas are probably very close to famous landmarks and tourist attractions in New York, which make them desirable for those visiting New York. One surprise from the map is that there is an Airbnb listing in the middle of Bryant Park. I was able to include everything I wanted inside the map visualization.