Project 2

Author

Oliver Kronen

Introduction

The data set I will be working with on this project is the airbnb_ny19 data set; focusing on the Airbnb statistics in New York during the year 2019. The data is provided by Airbnb. I chose this data set because it was the first set I found containing variables for longitude and latitude. I also have family that live in New Jersey which is pretty much New York. In regards to cleaning the data, I will remove unnecessary variables and NA values in the data. For this project, I will be focusing on the following variables: neighborhood_group, latitude, longitude, room_type, price, and reviews_per_month. Neighborhood_group and room_type are categorical variables while latitude, longitude, price, and reviews per month are quantitative. Using these variables, I will work to better understand the relationship between the price, number of reviews, and the room type for the Airbnb.

Load the necessary libraries

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.2.0     ✔ readr     2.1.6
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.2     ✔ tibble    3.3.1
✔ lubridate 1.9.5     ✔ tidyr     1.3.2
✔ purrr     1.2.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(tidyr)
library(leaflet)

Warning: package 'leaflet' was built under R version 4.5.3

library(ggplot2)
library(plotly)

Warning: package 'plotly' was built under R version 4.5.3


Attaching package: 'plotly'

The following object is masked from 'package:ggplot2':

    last_plot

The following object is masked from 'package:stats':

    filter

The following object is masked from 'package:graphics':

    layout

Set the working directory

setwd("C:/Users/MyPC/Downloads/Data 110")
data <- read_csv("airbnb_ny19.csv")

Rows: 48895 Columns: 16
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (6): name, host_name, neighbourhood_group, neighbourhood, room_type, la...
dbl (10): id, host_id, latitude, longitude, price, minimum_nights, number_of...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Cleaning the data set

# I will be removing the following variables as they have no use in my research.
clean_data <- data |>
  select(-last_review, -id, -name, -host_id, -calculated_host_listings_count, -host_name) |>
# I will now filter out all NA values in  the reviews_per_month variable  
  filter(!is.na(reviews_per_month))
# Check to see if cleaning worked.
sum(is.na(clean_data))

[1] 0

The value returned 0, meaning the cleaning worked.

Now I will use simple plots to determine my focus for the final visualization. Let’s start with the neighborhoods.

ggplot(clean_data, aes(x = neighbourhood_group)) + # Analyze the neighbourhood group from the clean data set
  geom_bar() + # Make a bar chart
  theme_minimal() # Remove the grey background

Most of the data comes from Brooklyn and Manhattan. Because Manhattan is slightly larger, I will use Manhattan in the final visualization.

Now to check the room type.

ggplot(clean_data, aes(x = room_type)) + # evaluate the room type in the clean data set
  geom_bar() + # Make a bar chart
  theme_minimal() # Remove the grey background

Most data comes from entire home/apt or private room. For the final visualization, I will analyze all three types of rooms.

Now to check the price variable for any potential outliers

ggplot(clean_data, aes(x = price)) + # Evaluate the price
  geom_histogram(bins = 20) + # Make a histogram, set the width of the bars equal to 20
  theme_minimal() # Remove the grey background

Knowing prices are not in the thousands, here appears to be an upper limit outlier. We will examine that more closely.

summary(clean_data$price) # Summarize the price data and display the statistical analysis

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    0.0    69.0   101.0   142.3   170.0 10000.0

The mean is greater than the median meaning there is a big value or values skewing the data to the right.

We will calculate the upper limit. 170 + 1.5 * (170 - 69) = $321.5. The upper limit is 321.5, so in the final visualization we will remove any values greater than 321.5. This will remove any outliers from the final visualization.

Now we will check the specific neighborhoods inside the groups

ggplot(clean_data, aes(x = neighbourhood)) + # Analyze the neighborhoods inside the groups
  geom_bar() + # Make a bar chart
  theme_minimal() # Remove the grey background

There’s too many variables here, so I will use the count function instead of a graph.

clean_data |> count(neighbourhood) # Get the number of times a neighborhood is displayed in the clean data set

# A tibble: 218 × 2
   neighbourhood                  n
   <chr>                      <int>
 1 Allerton                      37
 2 Arden Heights                  4
 3 Arrochar                      20
 4 Arverne                       66
 5 Astoria                      709
 6 Bath Beach                    15
 7 Battery Park City             36
 8 Bay Ridge                    115
 9 Bay Terrace                    5
10 Bay Terrace, Staten Island     2
# ℹ 208 more rows

Scrolling through the data, there are certain neighborhoods which have a higher count than others, such as Harlem or Greenpoint. I will choose Midtown in Manhattan as it has a unique name and fits the under 800 observations criteria.

Now I will analyze the reviews per month.

ggplot(clean_data, aes(x = reviews_per_month)) + # Evaluate the reviews per month
  geom_histogram() + # Create a histogram
  theme_minimal() # Remove the grey background

`stat_bin()` using `bins = 30`. Pick better value `binwidth`.

Most Airbnb’s get under 8 ish reviews per month, however there are some which receive around 15 a month. This is worth noting in the final visualization

Now we will filter the data for the final visualization.

sorted_data <- clean_data |> # Make a new data set
  filter(neighbourhood_group == "Manhattan") |> # Filter for only neighborhoods in Manhattan
  filter(price < 321.5) |> # Filter for prices less than the upper limit to get no outliers
  filter(neighbourhood == "Midtown") # Filter for Midtown only
head(sorted_data) # View the beginning of the new data set

# A tibble: 6 × 10
  neighbourhood_group neighbourhood latitude longitude room_type       price
  <chr>               <chr>            <dbl>     <dbl> <chr>           <dbl>
1 Manhattan           Midtown           40.8     -74.0 Entire home/apt   225
2 Manhattan           Midtown           40.8     -74.0 Entire home/apt   250
3 Manhattan           Midtown           40.8     -74.0 Entire home/apt   110
4 Manhattan           Midtown           40.7     -74.0 Entire home/apt   169
5 Manhattan           Midtown           40.8     -74.0 Entire home/apt   145
6 Manhattan           Midtown           40.7     -74.0 Entire home/apt   125
# ℹ 4 more variables: minimum_nights <dbl>, number_of_reviews <dbl>,
#   reviews_per_month <dbl>, availability_365 <dbl>

I now have a filtered data set with 749 observations. I will use that to make the final visualization.

ggplot(sorted_data, aes(x = reviews_per_month, y = price, color = room_type)) + # Create a graph and set the x, y, and third variable (colour) 
  geom_smooth(se = FALSE) + # Create a line and remove the grey background behind the lines
  theme_minimal() + # Remove the all encompassing greyu background
  scale_color_brewer(palette = "Accent") + # Set the line colours using a palette 
  labs(color = "Room Type", title = "Price vs. Reviews per Month Across Airbnb Listings in Midtown New York", caption = "Source: Airbnb", x = "Number of Reviews per Month", y = "US Dollar ($) Price per Night", subtitle = "749 Total Observations") # Label every part of the graph

`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

ggplotly() # Add interactivity to the graph

`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Essay on the visualization

This visualization analyzes the relationship between the price per night of an Airbnb versus its reviews per month and what type of room it is. The price in US dollars per night does not strike me as surprising. It makes complete sense that when paying for an entire home or apartment, it would be more expensive than if you were to pay for a single room. It also is within reason that a private room would cost more than a shared room. The one fact I was surprised to learn in this visualization was how the price increased around the 7-8 mark of reviews per month for shared rooms. My hypothesis was that the shared room never experience any increase in price whatsoever, but would instead show a decline all throughout. One thing I did not get to do was include geom point in this graph. While I think the graph looks better without it, I thought it would be interesting to include nonetheless. The reason I could not include it was because it created too much noise in the graph and did not add much overall.

Now I will make a map visualization

popper <- paste0(
  "<b>US Dollar ($) Price per Night: </b>", sorted_data$price, "<br>",
  "<b>Number of Reviews: </b>", sorted_data$number_of_reviews, "<br>",
  "<b>Number of Reviews per month: </b>", sorted_data$reviews_per_month, "<br>",
  "<b>Room Type: </b>", sorted_data$room_type, "<br>"
)

leaflet() |>
  setView(lng = -73.9840, lat = 40.7549, zoom = 14) |>
  addProviderTiles("Esri.WorldStreetMap") |>
  addCircles(
    data = sorted_data,
    radius = sorted_data$reviews_per_month,
    color = "maroon",
    opacity = 0.5,
    popup = popper
  )

Assuming "longitude" and "latitude" are longitude and latitude, respectively

Essay on the map visualization

The map visualization analyzes the reviews per month rate of different Airbnb locations across midtown. One interesting pattern that emerges from the map is the grouping of the different Airbnbs. It appears that there are three specific hubs for the Airbnbs to be centered around. While I do not know the exact reasoning for this, I believe it boils down to the location. These areas probably hold a lot of apartments and housing which allow for more Airbnbs to be located and offered. Alongside this, the areas are probably very close to famous landmarks and tourist attractions in New York, which make them desirable for those visiting New York. One surprise from the map is that there is an Airbnb listing in the middle of Bryant Park. I was able to include everything I wanted inside the map visualization.

Citations

Image - Richard Haworth. (n.d.) What is Airbnb and what can it do for you? https://www.richardhaworth.co.uk/news/what-is-airbnb-and-what-can-it-do-for-you

Filtering, specifically section 8.11 titled Standalone - Epidemiologist R Handbook. (n.d.). Cleaning data and core functions. https://www.epirhandbook.com/en/new_pages/cleaning.html#filter-rows

Everything else came from prior lessons and assignments