library(conflicted)
library(dplyr)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.0 ✔ readr 2.1.4
## ✔ ggplot2 3.4.4 ✔ stringr 1.5.1
## ✔ lubridate 1.9.3 ✔ tibble 3.2.1
## ✔ purrr 1.0.2 ✔ tidyr 1.3.0
library(ggplot2)
library(zoo)
library(RColorBrewer)
data <- read.csv("challenge_datasets/AB_NYC_2019.csv")
head(data)
## id name host_id host_name
## 1 2539 Clean & quiet apt home by the park 2787 John
## 2 2595 Skylit Midtown Castle 2845 Jennifer
## 3 3647 THE VILLAGE OF HARLEM....NEW YORK ! 4632 Elisabeth
## 4 3831 Cozy Entire Floor of Brownstone 4869 LisaRoxanne
## 5 5022 Entire Apt: Spacious Studio/Loft by central park 7192 Laura
## 6 5099 Large Cozy 1 BR Apartment In Midtown East 7322 Chris
## neighbourhood_group neighbourhood latitude longitude room_type price
## 1 Brooklyn Kensington 40.64749 -73.97237 Private room 149
## 2 Manhattan Midtown 40.75362 -73.98377 Entire home/apt 225
## 3 Manhattan Harlem 40.80902 -73.94190 Private room 150
## 4 Brooklyn Clinton Hill 40.68514 -73.95976 Entire home/apt 89
## 5 Manhattan East Harlem 40.79851 -73.94399 Entire home/apt 80
## 6 Manhattan Murray Hill 40.74767 -73.97500 Entire home/apt 200
## minimum_nights number_of_reviews last_review reviews_per_month
## 1 1 9 2018-10-19 0.21
## 2 1 45 2019-05-21 0.38
## 3 3 0 NA
## 4 1 270 2019-07-05 4.64
## 5 10 9 2018-11-19 0.10
## 6 3 74 2019-06-22 0.59
## calculated_host_listings_count availability_365
## 1 6 365
## 2 2 355
## 3 1 365
## 4 1 194
## 5 1 0
## 6 1 129
The dataset is a comprehensive collection of Airbnb listings in New York City, containing 48,895 rows and 16 columns. It encompasses a wide range of information, including:
head(data)
## id name host_id host_name
## 1 2539 Clean & quiet apt home by the park 2787 John
## 2 2595 Skylit Midtown Castle 2845 Jennifer
## 3 3647 THE VILLAGE OF HARLEM....NEW YORK ! 4632 Elisabeth
## 4 3831 Cozy Entire Floor of Brownstone 4869 LisaRoxanne
## 5 5022 Entire Apt: Spacious Studio/Loft by central park 7192 Laura
## 6 5099 Large Cozy 1 BR Apartment In Midtown East 7322 Chris
## neighbourhood_group neighbourhood latitude longitude room_type price
## 1 Brooklyn Kensington 40.64749 -73.97237 Private room 149
## 2 Manhattan Midtown 40.75362 -73.98377 Entire home/apt 225
## 3 Manhattan Harlem 40.80902 -73.94190 Private room 150
## 4 Brooklyn Clinton Hill 40.68514 -73.95976 Entire home/apt 89
## 5 Manhattan East Harlem 40.79851 -73.94399 Entire home/apt 80
## 6 Manhattan Murray Hill 40.74767 -73.97500 Entire home/apt 200
## minimum_nights number_of_reviews last_review reviews_per_month
## 1 1 9 2018-10-19 0.21
## 2 1 45 2019-05-21 0.38
## 3 3 0 NA
## 4 1 270 2019-07-05 4.64
## 5 10 9 2018-11-19 0.10
## 6 3 74 2019-06-22 0.59
## calculated_host_listings_count availability_365
## 1 6 365
## 2 2 355
## 3 1 365
## 4 1 194
## 5 1 0
## 6 1 129
The summary statistics indicate a broad price range, with a minimum of $0 and a maximum of $10,000, though the mean price is around $152.70, and the median is $106, suggesting a right-skewed distribution. Most listings require between 1 to 5 minimum nights, with a median of 3 nights. The number of reviews per listing varies significantly, with some having none and others having as many as 629 reviews, but on average, a listing has about 23 reviews. Reviews per month also vary, indicating differing levels of activity across listings. The availability of listings throughout the year ranges from 0 to 365 days, with a mean availability of around 113 days.
summary(data)
## id name host_id host_name
## Min. : 2539 Length:48895 Min. : 2438 Length:48895
## 1st Qu.: 9471945 Class :character 1st Qu.: 7822033 Class :character
## Median :19677284 Mode :character Median : 30793816 Mode :character
## Mean :19017143 Mean : 67620011
## 3rd Qu.:29152178 3rd Qu.:107434423
## Max. :36487245 Max. :274321313
##
## neighbourhood_group neighbourhood latitude longitude
## Length:48895 Length:48895 Min. :40.50 Min. :-74.24
## Class :character Class :character 1st Qu.:40.69 1st Qu.:-73.98
## Mode :character Mode :character Median :40.72 Median :-73.96
## Mean :40.73 Mean :-73.95
## 3rd Qu.:40.76 3rd Qu.:-73.94
## Max. :40.91 Max. :-73.71
##
## room_type price minimum_nights number_of_reviews
## Length:48895 Min. : 0.0 Min. : 1.00 Min. : 0.00
## Class :character 1st Qu.: 69.0 1st Qu.: 1.00 1st Qu.: 1.00
## Mode :character Median : 106.0 Median : 3.00 Median : 5.00
## Mean : 152.7 Mean : 7.03 Mean : 23.27
## 3rd Qu.: 175.0 3rd Qu.: 5.00 3rd Qu.: 24.00
## Max. :10000.0 Max. :1250.00 Max. :629.00
##
## last_review reviews_per_month calculated_host_listings_count
## Length:48895 Min. : 0.010 Min. : 1.000
## Class :character 1st Qu.: 0.190 1st Qu.: 1.000
## Mode :character Median : 0.720 Median : 1.000
## Mean : 1.373 Mean : 7.144
## 3rd Qu.: 2.020 3rd Qu.: 2.000
## Max. :58.500 Max. :327.000
## NA's :10052
## availability_365
## Min. : 0.0
## 1st Qu.: 0.0
## Median : 45.0
## Mean :112.8
## 3rd Qu.:227.0
## Max. :365.0
##
The data includes some missing values, particularly in the reviews_per_month column, which will need to be considered during analysis. The type of room offered is categorical, with possible types like private rooms and entire homes/apartments. The geographical data is precise, with latitude and longitude provided for each listing, allowing for detailed spatial analysis.
str(data)
## 'data.frame': 48895 obs. of 16 variables:
## $ id : int 2539 2595 3647 3831 5022 5099 5121 5178 5203 5238 ...
## $ name : chr "Clean & quiet apt home by the park" "Skylit Midtown Castle" "THE VILLAGE OF HARLEM....NEW YORK !" "Cozy Entire Floor of Brownstone" ...
## $ host_id : int 2787 2845 4632 4869 7192 7322 7356 8967 7490 7549 ...
## $ host_name : chr "John" "Jennifer" "Elisabeth" "LisaRoxanne" ...
## $ neighbourhood_group : chr "Brooklyn" "Manhattan" "Manhattan" "Brooklyn" ...
## $ neighbourhood : chr "Kensington" "Midtown" "Harlem" "Clinton Hill" ...
## $ latitude : num 40.6 40.8 40.8 40.7 40.8 ...
## $ longitude : num -74 -74 -73.9 -74 -73.9 ...
## $ room_type : chr "Private room" "Entire home/apt" "Private room" "Entire home/apt" ...
## $ price : int 149 225 150 89 80 200 60 79 79 150 ...
## $ minimum_nights : int 1 1 3 1 10 3 45 2 2 1 ...
## $ number_of_reviews : int 9 45 0 270 9 74 49 430 118 160 ...
## $ last_review : chr "2018-10-19" "2019-05-21" "" "2019-07-05" ...
## $ reviews_per_month : num 0.21 0.38 NA 4.64 0.1 0.59 0.4 3.47 0.99 1.33 ...
## $ calculated_host_listings_count: int 6 2 1 1 1 1 1 1 1 4 ...
## $ availability_365 : int 365 355 365 194 0 129 0 220 0 188 ...
This dataset provides a rich source of information for analyzing the Airbnb market in New York City, including the distribution of listings across different neighborhoods, price points, and the activity level of hosts and listings.
# Checking for missing values
missing_values <- sum(is.na(data$reviews_per_month))
missing_values
## [1] 10052
# Checking for any duplicated rows
duplicates <- sum(duplicated(data))
duplicates
## [1] 0
# Remove rows with NA values
data <- na.omit(data)
We can create a new factor variable for price ranges and convert some character variables into factors.
# Convert 'neighbourhood_group' and 'room_type' into factors
data$neighbourhood_group <- as.factor(data$neighbourhood_group)
data$room_type <- as.factor(data$room_type)
# Create a price range variable
data$price_range <- cut(data$price, breaks=c(0, 100, 200, 300, 10000), labels=c("0-100", "101-200", "201-300", "300+"), include.lowest = TRUE)
# Remove rows with non-finite values (Inf, -Inf) in the 'price' column
data <- data[is.finite(data$price), ]
str(data)
## 'data.frame': 38843 obs. of 17 variables:
## $ id : int 2539 2595 3831 5022 5099 5121 5178 5203 5238 5295 ...
## $ name : chr "Clean & quiet apt home by the park" "Skylit Midtown Castle" "Cozy Entire Floor of Brownstone" "Entire Apt: Spacious Studio/Loft by central park" ...
## $ host_id : int 2787 2845 4869 7192 7322 7356 8967 7490 7549 7702 ...
## $ host_name : chr "John" "Jennifer" "LisaRoxanne" "Laura" ...
## $ neighbourhood_group : Factor w/ 5 levels "Bronx","Brooklyn",..: 2 3 2 3 3 2 3 3 3 3 ...
## $ neighbourhood : chr "Kensington" "Midtown" "Clinton Hill" "East Harlem" ...
## $ latitude : num 40.6 40.8 40.7 40.8 40.7 ...
## $ longitude : num -74 -74 -74 -73.9 -74 ...
## $ room_type : Factor w/ 3 levels "Entire home/apt",..: 2 1 1 1 1 2 2 2 1 1 ...
## $ price : int 149 225 89 80 200 60 79 79 150 135 ...
## $ minimum_nights : int 1 1 1 10 3 45 2 2 1 5 ...
## $ number_of_reviews : int 9 45 270 9 74 49 430 118 160 53 ...
## $ last_review : chr "2018-10-19" "2019-05-21" "2019-07-05" "2018-11-19" ...
## $ reviews_per_month : num 0.21 0.38 4.64 0.1 0.59 0.4 3.47 0.99 1.33 0.43 ...
## $ calculated_host_listings_count: int 6 2 1 1 1 1 1 1 4 1 ...
## $ availability_365 : int 365 355 194 0 129 0 220 0 188 6 ...
## $ price_range : Factor w/ 4 levels "0-100","101-200",..: 2 3 1 1 2 1 1 1 2 2 ...
## - attr(*, "na.action")= 'omit' Named int [1:10052] 3 20 27 37 39 194 205 261 266 268 ...
## ..- attr(*, "names")= chr [1:10052] "3" "20" "27" "37" ...
This line graph depicts the evolution of average Airbnb prices in New York City over the years 2011 to 2019.
Choice of Graph Type:
Clarity: A line graph is ideal for showing trends and changes over time. It clearly illustrates how the average price has evolved year by year.
Comparative Analysis: It allows for easy comparison between different points in time.
# Convert the last_review to Date format and extract the year
data$last_review <- as.Date(data$last_review, format="%Y-%m-%d")
data$last_review_year <- year(data$last_review)
time_dependent_data <- data %>%
group_by(last_review_year) %>%
summarise(average_price = mean(price))
# Creating the time-dependent graph
ggplot(time_dependent_data, aes(x=last_review_year, y=average_price)) +
geom_line(color = "#56B4E9", size=1.75) +
geom_point(color = "#E69F00", size=3.5) +
scale_x_continuous(breaks = seq(min(time_dependent_data$last_review_year,na.rm = TRUE),
max(time_dependent_data$last_review_year,na.rm = TRUE), by = 1)) +
theme_minimal() +
labs(title = "Evolution of Average Airbnb Prices in NYC (2011-2019)",
x = "Year",
y = "Average Price ($)")
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
This pie chart illustrates the distribution of different types of rooms (Entire home/apt, Private room, Shared room) available for Airbnb listings in Manhattan.
Choice of Graph Type:
Intuitive Representation: A pie chart is a straightforward way to represent part-whole relationships. It shows how each category (room type) contributes to the whole (total listings in Manhattan).
Percentage Visualization: It provides a clear visual representation of the percentage contribution of each room type.
Immediate Comparison: Viewers can quickly grasp the relative size of each category in comparison to the others.
# Filtering data for Manhattan
manhattan_data <- subset(data, neighbourhood_group == "Manhattan")
# Counting the number of listings for each room type in Manhattan
room_type_counts <- table(manhattan_data$room_type)
color_palette <- brewer.pal(n = length(room_type_counts), name = "Set3")
# Creating the part-whole relationship graph (Pie Chart)
pie(room_type_counts,
col = color_palette,
labels = paste(names(room_type_counts), sprintf("%1.1f%%", 100 * room_type_counts / sum(room_type_counts))),
main = "Distribution of Room Types in Manhattan")
For this assignment, I used the AB_NYC dataset that I used in the previous assignment. For the initial tidy_data and mutation part, I made use of the same scripts that I used in the previous assignment.
Some of the methods used in the visualizations were copied from the following sources:
For aligning x-ticks with year of review: https://stackoverflow.com/questions/69803798/using-scale-x-continuous-in-ggplot-with-x-and-y-axis-labels
For using color palette: https://stackoverflow.com/questions/53286792/how-to-select-certain-colours-from-a-colour-palette-in-r-ggplot
How to generate Pie Charts: https://www.statmethods.net/graphs/pie.html
Adding percentage with text labels: https://www.dataanalytics.org.uk/interactive-labels-in-r-pie-charts/
Thank you!