project 1

Author

xutong zhang

Introduction:

In this project, I will be exploring two datasets. The first dataset is of the population of various neighborhoods in New YOrk. The second ddataset shows Airbnb listings in New York for the year of 2016. My main objective is to investigate whether or not there is a positive correlation between the population of a neighborhood and the price of airbnb bookings or net profits generated from these bookings.

Datasets:

New York Population Dataset: This dataset includes variables like neighborhood, population, and age group.
Airbnb New York 2016 Dataset: This dataset includes variables like listing ID, neighborhoo price, and minimum nights.

Data Cleaning

For the New York Population Dataset:

I did not find any missing values or anything that needs cleaning

For the Airbnb dataset, we are missing some values

/Users/xutongzhang/Downloads/airbnb_ny19.csv

airbnb_df <- read.csv("/Users/xutongzhang/Downloads/airbnb_ny19.csv")
# Figure out which are missing values
missing_values_count <- colSums(is.na(airbnb_df))

if (sum(missing_values_count) > 0) {
  print(missing_values_count[missing_values_count > 0])
}

reviews_per_month 
            10052

So we see there are some missing values which can cause an error in further code so we best clean it by replaces all the NaNs with something, kets say 0.

# Replace missing values in 'reviews_per_month' with 0
airbnb_df$reviews_per_month[is.na(airbnb_df$reviews_per_month)] <- 0

Now since the two datasets need to align

library(dplyr)


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

#LOAD DATA
ny_population_df <-read.csv("/Users/xutongzhang/Downloads/newyorkpopulation.csv")
airbnb_ny19_df <- read.csv("/Users/xutongzhang/Downloads/airbnb_ny19.csv")

# Map neighbourhood_group to corresponding county
# This took a lot of searching online
borough_to_county <- c('Manhattan' = 'New York County, New York',
                        'Brooklyn' = 'Kings County, New York',
                        'Queens' = 'Queens County, New York',
                        'Bronx' = 'Bronx County, New York',
                        'Staten Island' = 'Richmond County, New York')

# Create a new column for county
airbnb_ny19_df$county <- airbnb_ny19_df$neighbourhood_group%>% factor(levels = names(borough_to_county)) %>% as.character() %>% recode(!!!borough_to_county)

# Set Airbnb data by county
aggregated_airbnb <- airbnb_ny19_df %>% group_by(county) %>% summarise(price = mean(price, na.rm = TRUE), number_of_reviews = sum(number_of_reviews, na.rm = TRUE))

# Filter the NY population data for 2016 and for only the counties that are boroughs of NYC
ny_population_filtered <- ny_population_df %>% filter(Geography %in% borough_to_county)

Visualizing my data

To Explore the relationship between the population of various neighborhoods in New York and the Airbnb activity in those areas, I will be creating a scatter plot.

I am going to start by merging the datasets, and then making a scatterplot.

library(dplyr)
library(ggplot2)

# Merge the aggregated Airbnb data with the NY population data
merged_data <- merge(aggregated_airbnb, ny_population_filtered, by.x = 'county', by.y = 'Geography')

# Create the scatter plot
ggplot(merged_data, aes(x = X2016, y = price, color = county)) +
  geom_point(shape = 19, size = 6) +  # Using shape 19 for full circles
  scale_color_brewer(palette = "Dark2") +  # Using a darker palette for strong colors
  labs(title = "Relationship between County Population and Average Airbnb Price",
       x = "Population (2016)",
       y = "Average Airbnb Price ($)") +
  theme_minimal() +
  theme(legend.position = "bottom")

Findings Essay:

Cleaning recap: First I got rid of data in the code that was NaN to prevent any errors. Even though I did not use this in my final chart, it is still a good habit. Then I made sure our data was clean by mapping airbnb neighborhooods to counties. One of the itial challenges was getting the two datasets to align from county to neighborhood. Then we aggregated the Airbnb data to calculate two key metrics for each county: the average Airbnb price and the total number of reviews. These mtrics help me understand the relationship a bit more. Lastly I filutered the New york populations to the counties that corresponded, just to clean the data a bit more.

Scatterplot findings: I was very interested to see what kind of info I would learn from the scatter chart. Such as Manhattan having the highest Airbnb porices despite having a substantial popualation. This makes me think Manhatton is more premium or expensive in general. Then Brooklyn hasa bit of blalance between moderate airbnb price and a high population. Queens stands out more for having a moderate average price with a high population, and the Bron has a lower price point. All of these were so cool to be able to visualize and shows how data is important, especially for airbnb.

I wish I could have gotten more into depth on a few things such as individual neighborhood comparisons. But this was not possible with the new york population data in the csv file. I would love to do some more research in the future with larger datasets to see the full picture.