Load required libraries

library(conflicted)
library(dplyr)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ readr     2.1.4
## ✔ ggplot2   3.4.4     ✔ stringr   1.5.1
## ✔ lubridate 1.9.3     ✔ tibble    3.2.1
## ✔ purrr     1.0.2     ✔ tidyr     1.3.0
library(ggplot2)
library(zoo)
library(RColorBrewer)

Read data

data <- read.csv("challenge_datasets/AB_NYC_2019.csv")
head(data)
##     id                                             name host_id   host_name
## 1 2539               Clean & quiet apt home by the park    2787        John
## 2 2595                            Skylit Midtown Castle    2845    Jennifer
## 3 3647              THE VILLAGE OF HARLEM....NEW YORK !    4632   Elisabeth
## 4 3831                  Cozy Entire Floor of Brownstone    4869 LisaRoxanne
## 5 5022 Entire Apt: Spacious Studio/Loft by central park    7192       Laura
## 6 5099        Large Cozy 1 BR Apartment In Midtown East    7322       Chris
##   neighbourhood_group neighbourhood latitude longitude       room_type price
## 1            Brooklyn    Kensington 40.64749 -73.97237    Private room   149
## 2           Manhattan       Midtown 40.75362 -73.98377 Entire home/apt   225
## 3           Manhattan        Harlem 40.80902 -73.94190    Private room   150
## 4            Brooklyn  Clinton Hill 40.68514 -73.95976 Entire home/apt    89
## 5           Manhattan   East Harlem 40.79851 -73.94399 Entire home/apt    80
## 6           Manhattan   Murray Hill 40.74767 -73.97500 Entire home/apt   200
##   minimum_nights number_of_reviews last_review reviews_per_month
## 1              1                 9  2018-10-19              0.21
## 2              1                45  2019-05-21              0.38
## 3              3                 0                            NA
## 4              1               270  2019-07-05              4.64
## 5             10                 9  2018-11-19              0.10
## 6              3                74  2019-06-22              0.59
##   calculated_host_listings_count availability_365
## 1                              6              365
## 2                              2              355
## 3                              1              365
## 4                              1              194
## 5                              1                0
## 6                              1              129

Data Description

The dataset is a comprehensive collection of Airbnb listings in New York City, containing 48,895 rows and 16 columns. It encompasses a wide range of information, including:

head(data)
##     id                                             name host_id   host_name
## 1 2539               Clean & quiet apt home by the park    2787        John
## 2 2595                            Skylit Midtown Castle    2845    Jennifer
## 3 3647              THE VILLAGE OF HARLEM....NEW YORK !    4632   Elisabeth
## 4 3831                  Cozy Entire Floor of Brownstone    4869 LisaRoxanne
## 5 5022 Entire Apt: Spacious Studio/Loft by central park    7192       Laura
## 6 5099        Large Cozy 1 BR Apartment In Midtown East    7322       Chris
##   neighbourhood_group neighbourhood latitude longitude       room_type price
## 1            Brooklyn    Kensington 40.64749 -73.97237    Private room   149
## 2           Manhattan       Midtown 40.75362 -73.98377 Entire home/apt   225
## 3           Manhattan        Harlem 40.80902 -73.94190    Private room   150
## 4            Brooklyn  Clinton Hill 40.68514 -73.95976 Entire home/apt    89
## 5           Manhattan   East Harlem 40.79851 -73.94399 Entire home/apt    80
## 6           Manhattan   Murray Hill 40.74767 -73.97500 Entire home/apt   200
##   minimum_nights number_of_reviews last_review reviews_per_month
## 1              1                 9  2018-10-19              0.21
## 2              1                45  2019-05-21              0.38
## 3              3                 0                            NA
## 4              1               270  2019-07-05              4.64
## 5             10                 9  2018-11-19              0.10
## 6              3                74  2019-06-22              0.59
##   calculated_host_listings_count availability_365
## 1                              6              365
## 2                              2              355
## 3                              1              365
## 4                              1              194
## 5                              1                0
## 6                              1              129

The summary statistics indicate a broad price range, with a minimum of $0 and a maximum of $10,000, though the mean price is around $152.70, and the median is $106, suggesting a right-skewed distribution. Most listings require between 1 to 5 minimum nights, with a median of 3 nights. The number of reviews per listing varies significantly, with some having none and others having as many as 629 reviews, but on average, a listing has about 23 reviews. Reviews per month also vary, indicating differing levels of activity across listings. The availability of listings throughout the year ranges from 0 to 365 days, with a mean availability of around 113 days.

summary(data)
##        id               name              host_id           host_name        
##  Min.   :    2539   Length:48895       Min.   :     2438   Length:48895      
##  1st Qu.: 9471945   Class :character   1st Qu.:  7822033   Class :character  
##  Median :19677284   Mode  :character   Median : 30793816   Mode  :character  
##  Mean   :19017143                      Mean   : 67620011                     
##  3rd Qu.:29152178                      3rd Qu.:107434423                     
##  Max.   :36487245                      Max.   :274321313                     
##                                                                              
##  neighbourhood_group neighbourhood         latitude       longitude     
##  Length:48895        Length:48895       Min.   :40.50   Min.   :-74.24  
##  Class :character    Class :character   1st Qu.:40.69   1st Qu.:-73.98  
##  Mode  :character    Mode  :character   Median :40.72   Median :-73.96  
##                                         Mean   :40.73   Mean   :-73.95  
##                                         3rd Qu.:40.76   3rd Qu.:-73.94  
##                                         Max.   :40.91   Max.   :-73.71  
##                                                                         
##   room_type             price         minimum_nights    number_of_reviews
##  Length:48895       Min.   :    0.0   Min.   :   1.00   Min.   :  0.00   
##  Class :character   1st Qu.:   69.0   1st Qu.:   1.00   1st Qu.:  1.00   
##  Mode  :character   Median :  106.0   Median :   3.00   Median :  5.00   
##                     Mean   :  152.7   Mean   :   7.03   Mean   : 23.27   
##                     3rd Qu.:  175.0   3rd Qu.:   5.00   3rd Qu.: 24.00   
##                     Max.   :10000.0   Max.   :1250.00   Max.   :629.00   
##                                                                          
##  last_review        reviews_per_month calculated_host_listings_count
##  Length:48895       Min.   : 0.010    Min.   :  1.000               
##  Class :character   1st Qu.: 0.190    1st Qu.:  1.000               
##  Mode  :character   Median : 0.720    Median :  1.000               
##                     Mean   : 1.373    Mean   :  7.144               
##                     3rd Qu.: 2.020    3rd Qu.:  2.000               
##                     Max.   :58.500    Max.   :327.000               
##                     NA's   :10052                                   
##  availability_365
##  Min.   :  0.0   
##  1st Qu.:  0.0   
##  Median : 45.0   
##  Mean   :112.8   
##  3rd Qu.:227.0   
##  Max.   :365.0   
## 

The data includes some missing values, particularly in the reviews_per_month column, which will need to be considered during analysis. The type of room offered is categorical, with possible types like private rooms and entire homes/apartments. The geographical data is precise, with latitude and longitude provided for each listing, allowing for detailed spatial analysis.

str(data)
## 'data.frame':    48895 obs. of  16 variables:
##  $ id                            : int  2539 2595 3647 3831 5022 5099 5121 5178 5203 5238 ...
##  $ name                          : chr  "Clean & quiet apt home by the park" "Skylit Midtown Castle" "THE VILLAGE OF HARLEM....NEW YORK !" "Cozy Entire Floor of Brownstone" ...
##  $ host_id                       : int  2787 2845 4632 4869 7192 7322 7356 8967 7490 7549 ...
##  $ host_name                     : chr  "John" "Jennifer" "Elisabeth" "LisaRoxanne" ...
##  $ neighbourhood_group           : chr  "Brooklyn" "Manhattan" "Manhattan" "Brooklyn" ...
##  $ neighbourhood                 : chr  "Kensington" "Midtown" "Harlem" "Clinton Hill" ...
##  $ latitude                      : num  40.6 40.8 40.8 40.7 40.8 ...
##  $ longitude                     : num  -74 -74 -73.9 -74 -73.9 ...
##  $ room_type                     : chr  "Private room" "Entire home/apt" "Private room" "Entire home/apt" ...
##  $ price                         : int  149 225 150 89 80 200 60 79 79 150 ...
##  $ minimum_nights                : int  1 1 3 1 10 3 45 2 2 1 ...
##  $ number_of_reviews             : int  9 45 0 270 9 74 49 430 118 160 ...
##  $ last_review                   : chr  "2018-10-19" "2019-05-21" "" "2019-07-05" ...
##  $ reviews_per_month             : num  0.21 0.38 NA 4.64 0.1 0.59 0.4 3.47 0.99 1.33 ...
##  $ calculated_host_listings_count: int  6 2 1 1 1 1 1 1 1 4 ...
##  $ availability_365              : int  365 355 365 194 0 129 0 220 0 188 ...

This dataset provides a rich source of information for analyzing the Airbnb market in New York City, including the distribution of listings across different neighborhoods, price points, and the activity level of hosts and listings.

Tidy data

# Checking for missing values
missing_values <- sum(is.na(data$reviews_per_month))
missing_values
## [1] 10052
# Checking for any duplicated rows
duplicates <- sum(duplicated(data))
duplicates
## [1] 0
# Remove rows with NA values
data <- na.omit(data)

Mutate Variables

We can create a new factor variable for price ranges and convert some character variables into factors.

# Convert 'neighbourhood_group' and 'room_type' into factors
data$neighbourhood_group <- as.factor(data$neighbourhood_group)
data$room_type <- as.factor(data$room_type)

# Create a price range variable
data$price_range <- cut(data$price, breaks=c(0, 100, 200, 300, 10000), labels=c("0-100", "101-200", "201-300", "300+"), include.lowest = TRUE)
# Remove rows with non-finite values (Inf, -Inf) in the 'price' column
data <- data[is.finite(data$price), ]
str(data)
## 'data.frame':    38843 obs. of  17 variables:
##  $ id                            : int  2539 2595 3831 5022 5099 5121 5178 5203 5238 5295 ...
##  $ name                          : chr  "Clean & quiet apt home by the park" "Skylit Midtown Castle" "Cozy Entire Floor of Brownstone" "Entire Apt: Spacious Studio/Loft by central park" ...
##  $ host_id                       : int  2787 2845 4869 7192 7322 7356 8967 7490 7549 7702 ...
##  $ host_name                     : chr  "John" "Jennifer" "LisaRoxanne" "Laura" ...
##  $ neighbourhood_group           : Factor w/ 5 levels "Bronx","Brooklyn",..: 2 3 2 3 3 2 3 3 3 3 ...
##  $ neighbourhood                 : chr  "Kensington" "Midtown" "Clinton Hill" "East Harlem" ...
##  $ latitude                      : num  40.6 40.8 40.7 40.8 40.7 ...
##  $ longitude                     : num  -74 -74 -74 -73.9 -74 ...
##  $ room_type                     : Factor w/ 3 levels "Entire home/apt",..: 2 1 1 1 1 2 2 2 1 1 ...
##  $ price                         : int  149 225 89 80 200 60 79 79 150 135 ...
##  $ minimum_nights                : int  1 1 1 10 3 45 2 2 1 5 ...
##  $ number_of_reviews             : int  9 45 270 9 74 49 430 118 160 53 ...
##  $ last_review                   : chr  "2018-10-19" "2019-05-21" "2019-07-05" "2018-11-19" ...
##  $ reviews_per_month             : num  0.21 0.38 4.64 0.1 0.59 0.4 3.47 0.99 1.33 0.43 ...
##  $ calculated_host_listings_count: int  6 2 1 1 1 1 1 1 4 1 ...
##  $ availability_365              : int  365 355 194 0 129 0 220 0 188 6 ...
##  $ price_range                   : Factor w/ 4 levels "0-100","101-200",..: 2 3 1 1 2 1 1 1 2 2 ...
##  - attr(*, "na.action")= 'omit' Named int [1:10052] 3 20 27 37 39 194 205 261 266 268 ...
##   ..- attr(*, "names")= chr [1:10052] "3" "20" "27" "37" ...

Histogram with Density Overlay for Number of Reviews

Choice of Graph Type:

Histograms are excellent for showing the frequency distribution of a single numerical variable. Adding a density overlay provides an additional layer of information, highlighting the overall distribution trend. This combination is particularly effective in visualizing the distribution of the number of reviews, showing both the frequency of specific review counts and the overall trend.

ggplot(data, aes(x = number_of_reviews)) +
  geom_histogram(aes(y = after_stat(density)), binwidth = 10, fill = "skyblue", alpha = 0.6) +
  geom_density(color = "red", size = 1) +
  labs(title = "Distribution of Number of Reviews for NYC Airbnb Listings",
       subtitle = "Histogram with density overlay",
       caption = "Data source: Airbnb NYC 2019. Note: Binwidth set at 10.",
       x = "Number of Reviews",
       y = "Density") +
  theme_minimal() +
  theme(
    strip.text = element_text(face = "bold", size = 12),  # Enhancing facet labels
    axis.title = element_text(size = 14, face = "bold"),  # Bold axis titles
    axis.text = element_text(size = 12),  # Larger axis texts
    plot.title = element_text(size = 14, face = "bold", hjust = 0.5),  # Bold and centered title
    plot.subtitle = element_text(size = 14, hjust = 0.5),  # Centered subtitle
    plot.caption = element_text(size = 10, hjust = 0),  # Caption aligned to left
    legend.position = "bottom"  # Legend at the bottom
  ) 
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Violin Plot for Minimum Nights Across Room Types

Choice of Graph Type:

Violin plots are useful for comparing the distribution of a numerical variable across different categories. They combine aspects of box plots and density plots, providing a deeper understanding of the data distribution. In this case, it can show how the requirement for minimum nights varies by room type, including the density and range of values.

ggplot(data, aes(x = neighbourhood_group, y = minimum_nights, fill = neighbourhood_group)) +
  geom_violin(trim = FALSE) +
  scale_fill_brewer(palette = "Set2") +
  labs(title = "Distribution of Minimum Nights Requirement Across Neighbourhood Groups in NYC",
       subtitle = "Violin plot showing range and density of minimum nights in different neighbourhoods",
       caption = "Data source: Airbnb NYC 2019. Note: Extreme values may be present.",
       x = "Neighbourhood Group",
       y = "Minimum Nights") +
  theme_minimal() +
  theme(
    strip.text = element_text(face = "bold", size = 12),  # Enhancing facet labels
    axis.title = element_text(size = 14, face = "bold"),  # Bold axis titles
    axis.text = element_text(size = 12),  # Larger axis texts
    plot.title = element_text(size = 14, face = "bold", hjust = 0.5),  # Bold and centered title
    plot.subtitle = element_text(size = 14, hjust = 0.5),  # Centered subtitle
    plot.caption = element_text(size = 10, hjust = 0),  # Caption aligned to left
    legend.position = "bottom"  # Legend at the bottom
  ) +
  ylim(0, 30)
## Warning: Removed 439 rows containing non-finite values (`stat_ydensity()`).
## Warning: Removed 83 rows containing missing values (`geom_violin()`).

Faceted Histogram for Price Distribution Across Different Neighbourhood Groups

Choice of Graph Type:

Faceted histograms allow us to compare the distribution of a numerical variable (like price) across different categories (such as neighbourhood groups). This can provide insights into how price distributions vary in different areas of the city.

ggplot(data[data$price <= 1000, ], aes(x = price, fill = neighbourhood_group)) +  # Focus on prices up to $1000
  geom_histogram(bins = 30, alpha = 0.7) +
  facet_wrap(~ neighbourhood_group, scales = "free_y") +
  scale_fill_brewer(palette = "Set3") +  # Using a color palette for different neighbourhood groups
  labs(title = "Price Distribution Across Neighbourhood Groups in NYC",
       subtitle = "Faceted histogram showing price distribution for each neighbourhood group (up to $1000)",
       caption = "Data source: Airbnb NYC 2019",
       x = "Price (USD)",
       y = "Frequency") +
  theme_minimal() +
  theme(
    strip.text = element_text(face = "bold", size = 12),
    axis.title = element_text(size = 14, face = "bold"),
    axis.text = element_text(size = 12),
    plot.title = element_text(size = 16, face = "bold", hjust = 0.5),
    plot.subtitle = element_text(size = 14, hjust = 0.5),
    plot.caption = element_text(size = 10, hjust = 0),
    legend.position = "bottom"
  )

Disclaimer

For this assignment, I used the AB_NYC dataset that I used in the previous assignment. For the initial tidy_data and mutation part, I made use of the same scripts that I used in the previous assignment.

Thank you!