New York City’s Airbnb market is one of the most active and varied short-term rental ecosystems in the world – shaped by neighborhood trends, pricing dynamics, and host behaviors. This data mining project explores that complexity using predictive modeling, classification, clustering, and visualization to uncover patterns that can guide hosts, guests, investors, and policymakers.
The analysis leverages real-world Airbnb data to uncover drivers of rental price variation, reveal booking behavior trends, and compare neighborhoods across the city – especially within Brooklyn. Through a mix of data wrangling, machine learning, and statistical analysis in R, this project delivers actionable insights and highlights opportunities for smarter, data-informed decision-making in the short-term rental space.
# Load required packages
## Data preparation
library(scales)
library(data.table)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:data.table':
##
## between, first, last
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(forcats)
## EDA
library(ggplot2)
library(ggcorrplot)
## Regression
library(forecast)
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
library(ggcorrplot)
## Classification
library(caret)
## Loading required package: lattice
library(class)
library(e1071)
library(rpart)
library(rpart.plot)
## Clustering
library(dplyr)
library(ggplot2)
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
# Read data into local environment
df <- read.csv("train.csv")
Before any predictive modeling, the data is preprocessed to ensure quality, completeness, and contextual richness in the Airbnb rental dataset for New York City.
# Subset data frame to include only those records pertaining to NYC
ny.df <- df[df$city == "NYC", ]
# Convert blank cells to 'NA's
ny.df[ny.df == ""] <- NA
# Calculate number of 'NA' values in data frame
sum(is.na(ny.df))
## [1] 34579
# Get percentage of rows that are "complete cases" (i.e., not missing values)
percent(sum(complete.cases(ny.df))/nrow(ny.df), accuracy = 0.01)
## [1] "55.20%"
Next, we will evaluate count/proportion of ‘NA’ values corresponding to each variable to understand the potential impact of dropping/imputing these missing values. These insights will guide our process in deciding how we will handle NA values (i.e., drop or impute) for each data mining task (e.g., prediction, classification, & clustering).
# Subset columns that contain "NA" values & get count of NA values in each column
num_NAs <- colSums(is.na(ny.df))
# Compute percent of NA values in each column
prop_NAs <- percent(num_NAs/nrow(ny.df), accuracy = 0.01)
# Create df to store missing value counts for each variable
var_NAs.df <- data.frame(num_NAs, prop_NAs)
# Convert to table
var_NAs <- data.table(var_NAs.df, keep.rownames = TRUE)
colnames(var_NAs) <- c("Variable", "Num. of NAs", "% NAs")
var_NAs # print table
## Variable Num. of NAs % NAs
## 1: id 0 0.00%
## 2: log_price 0 0.00%
## 3: property_type 0 0.00%
## 4: room_type 0 0.00%
## 5: amenities 0 0.00%
## 6: accommodates 0 0.00%
## 7: bathrooms 99 0.31%
## 8: bed_type 0 0.00%
## 9: cancellation_policy 0 0.00%
## 10: cleaning_fee 0 0.00%
## 11: city 0 0.00%
## 12: description 0 0.00%
## 13: first_review 6858 21.20%
## 14: host_has_profile_pic 176 0.54%
## 15: host_identity_verified 176 0.54%
## 16: host_response_rate 9960 30.79%
## 17: host_since 176 0.54%
## 18: instant_bookable 0 0.00%
## 19: last_review 6832 21.12%
## 20: latitude 0 0.00%
## 21: longitude 0 0.00%
## 22: name 0 0.00%
## 23: neighbourhood 8 0.02%
## 24: number_of_reviews 0 0.00%
## 25: review_scores_rating 7321 22.63%
## 26: thumbnail_url 2415 7.47%
## 27: zipcode 446 1.38%
## 28: bedrooms 47 0.15%
## 29: beds 65 0.20%
## Variable Num. of NAs % NAs
# Handle missing values for numerical variables by imputing with median
# Convert 'host_response_rate' from character value to numerical
ny.df$host_response_rate <- as.numeric(sub("%", " ", ny.df$host_response_rate))
# Impute for missing values with median
ny.df$bathrooms[is.na(ny.df$bathrooms)] <- median(ny.df$bathrooms, na.rm = TRUE)
ny.df$host_response_rate[is.na(ny.df$host_response_rate)] <- median(ny.df$host_response_rate, na.rm = TRUE)
ny.df$review_scores_rating[is.na(ny.df$review_scores_rating)] <- median(ny.df$review_scores_rating, na.rm = TRUE)
ny.df$bedrooms[is.na(ny.df$bedrooms)] <- median(ny.df$bedrooms, na.rm = TRUE)
ny.df$beds[is.na(ny.df$beds)] <- median(ny.df$beds, na.rm = TRUE)
Note: The decision to impute missing values for numerical variables, such as ‘host_response_rate,’ ‘bathrooms,’ ‘review_scores_rating,’ ‘bedrooms,’ and ‘beds,’ with their respective medians serves the purpose of preserving the data’s central tendencies while reducing the potential impact of outliers.
# Drop NA values for categorical variables that were not imputed
ny.df <- na.omit(ny.df)
# Subset values not equal to 0, as 0 is unrealistic for many numeric variables
ny.df <- ny.df %>% filter(log_price > 0)
ny.df <- ny.df %>% filter(accommodates > 0)
ny.df <- ny.df %>% filter(bathrooms > 0)
ny.df <- ny.df %>% filter(number_of_reviews > 0)
ny.df <- ny.df %>% filter(review_scores_rating > 0)
ny.df <- ny.df %>% filter(bedrooms > 0)
# Convert 'host_response_rate' & 'review_scores_rating' to decimal values that denote percentages
ny.df$host_response_rate <- round((ny.df$host_response_rate)/100, 2)
ny.df$review_scores_rating <- round((ny.df$review_scores_rating)/100, 2)
# Subset of valid 'zipcode' values
# all valid zipcodes include 5 digits
valid_zipcode <- nchar(ny.df$zipcode) == 5
# Filter dataframe to keep only those rows with valid zipcodes
ny.df <- ny.df[valid_zipcode, ]
# Convert 't'/'f' to 'True'/'False'
ny.df$host_identity_verified <- factor(ny.df$host_identity_verified, levels = c("t", "f"), labels = c("True", "False"))
ny.df$instant_bookable <- factor(ny.df$instant_bookable, levels = c("t", "f"), labels = c("True", "False"))
bathrooms,
beds, review_scores_rating) using the
medianhost_response_rate)bed_type, rare
cancellation_policy categories) ***Simplify categorical variables:
# Group less common levels of 'bed_type' into a single column called 'Other'
# (keep only the most frequently occurring type of bed)
ny.df$bed_type <- fct_lump(ny.df$bed_type, 1)
# Combine levels 'super_strict_30' & 'super_strict_60' from the variable 'cancellation_policy'
# & create new category 'super_strict'
ny.df$cancellation_policy <- fct_other(ny.df$cancellation_policy, keep = c("flexible", "moderate", "strict"),
other_level = "super_strict")
Define new variable ‘borough’:
# Import dataset for mapping zipcodes to boroughs
borough_df <- read.csv("nyc_zip_borough_neighborhoods_pop.csv")
# Inspect dataset
str(borough_df)
## 'data.frame': 177 obs. of 6 variables:
## $ zip : int 10001 10002 10003 10004 10005 10006 10007 10009 10010 10011 ...
## $ borough : chr "Manhattan" "Manhattan" "Manhattan" "Manhattan" ...
## $ post_office : chr "New York, NY" "New York, NY" "New York, NY" "New York, NY" ...
## $ neighborhood: chr "Chelsea and Clinton" "Lower East Side" "Lower East Side" "Lower Manhattan" ...
## $ population : int 21102 81410 56024 3089 7135 3011 6988 61347 31834 50984 ...
## $ density : int 33959 92573 97188 5519 97048 32796 42751 99492 81487 77436 ...
Note: The dataset with information about the zipcode to borough mappings was downloaded/imported from the link below … Zipcode to NYC Borough Mappings Dataset
# Subset necessary variables
borough_zip_df <- borough_df[, c("zip", "borough")]
# Convert 'zip' to character type
borough_zip_df$zip <- as.character(borough_zip_df$zip)
# Merge new dataset with Airbnb dataframe based on 'zipcode' variable
ny.df <- merge(ny.df, borough_zip_df, by.x = "zipcode", by.y = "zip", all.x = TRUE)
# Replace missing 'borough' values with "Other"
ny.df$borough[is.na(ny.df$borough)] <- "Other"
# Drop 'zipcode' from dataset
ny.df <- subset(ny.df, select = -c(zipcode))
Note: The ‘zipcode’ variable was dropped to avoid redundancy and potential confusion, as its numeric but categorical in nature. The location of a listing is better represented by variables like ‘borough’ and ‘neighborhood’.
Define new variables ‘amenities_list’ & ‘amenities_count’:
# Define list of amenities & count of amenities for each listing
ny.df <- ny.df %>%
mutate(amenities_list = strsplit(amenities, ",")) %>%
mutate(amenities_count = lengths(amenities_list))
# Drop now redundant 'amenities_list' variable
ny.df <- subset(ny.df, select = -c(amenities_list))
# Drop NA values for categorical variables that were not imputed
ny.df <- na.omit(ny.df)
Identify numeric & categorical variables:
# Subset numeric columns
num_var <- ny.df[, sapply(ny.df, is.numeric)]
# Subset remaining (categorical) columns
cat_var <- ny.df[, !(names(ny.df) %in% names(num_var))]
# Drop unique identifier columns from subsets
num_var <- subset(num_var, select = -c(id))
cat_var <- subset(cat_var, select = -c(description, name, thumbnail_url))
# Get summary stats for numerical variables
summary(num_var)
## log_price accommodates bathrooms host_response_rate
## Min. :2.303 Min. : 1.00 Min. :0.500 Min. :0.0000
## 1st Qu.:4.174 1st Qu.: 2.00 1st Qu.:1.000 1st Qu.:1.0000
## Median :4.605 Median : 2.00 Median :1.000 Median :1.0000
## Mean :4.676 Mean : 2.96 Mean :1.142 Mean :0.9643
## 3rd Qu.:5.106 3rd Qu.: 4.00 3rd Qu.:1.000 3rd Qu.:1.0000
## Max. :7.600 Max. :16.00 Max. :5.500 Max. :1.0000
## latitude longitude number_of_reviews review_scores_rating
## Min. :40.51 Min. :-74.24 Min. : 1.00 Min. :0.2000
## 1st Qu.:40.69 1st Qu.:-73.98 1st Qu.: 3.00 1st Qu.:0.9100
## Median :40.73 Median :-73.95 Median : 9.00 Median :0.9600
## Mean :40.73 Mean :-73.95 Mean : 23.49 Mean :0.9355
## 3rd Qu.:40.77 3rd Qu.:-73.93 3rd Qu.: 29.00 3rd Qu.:1.0000
## Max. :40.90 Max. :-73.72 Max. :465.00 Max. :1.0000
## bedrooms beds amenities_count
## Min. : 1.000 Min. : 1.000 Min. : 1.00
## 1st Qu.: 1.000 1st Qu.: 1.000 1st Qu.:13.00
## Median : 1.000 Median : 1.000 Median :16.00
## Mean : 1.282 Mean : 1.639 Mean :17.36
## 3rd Qu.: 1.000 3rd Qu.: 2.000 3rd Qu.:21.00
## Max. :10.000 Max. :18.000 Max. :77.00
# Standard deviation of each variable to better understand its overall distribution
options(scipen = 999)
col.sd <- apply(num_var, 2, sd)
col.sd # print 'sd' object (vector of sd values for each column)
## log_price accommodates bathrooms
## 0.65846055 1.91204701 0.40526714
## host_response_rate latitude longitude
## 0.11874684 0.05767564 0.04550813
## number_of_reviews review_scores_rating bedrooms
## 35.36204117 0.08135042 0.65116126
## beds amenities_count
## 1.13269362 7.37838977
# Interquartile range for numerical variables
col.iqr <- apply(num_var, 2, IQR)
col.iqr
## log_price accommodates bathrooms
## 0.93155820 2.00000000 0.00000000
## host_response_rate latitude longitude
## 0.00000000 0.07951180 0.04903957
## number_of_reviews review_scores_rating bedrooms
## 26.00000000 0.09000000 0.00000000
## beds amenities_count
## 1.00000000 8.00000000
# Variance of numerical variables
col.var <- apply(num_var, 2, var)
col.var
## log_price accommodates bathrooms
## 0.433570292 3.655923771 0.164241452
## host_response_rate latitude longitude
## 0.014100813 0.003326480 0.002070990
## number_of_reviews review_scores_rating bedrooms
## 1250.473955575 0.006617891 0.424010993
## beds amenities_count
## 1.282994844 54.440635593
# Create correlation matrix
corr <- round(cor(num_var), 2)
corr
## log_price accommodates bathrooms host_response_rate
## log_price 1.00 0.60 0.23 -0.01
## accommodates 0.60 1.00 0.37 0.01
## bathrooms 0.23 0.37 1.00 0.01
## host_response_rate -0.01 0.01 0.01 1.00
## latitude 0.06 -0.05 -0.07 -0.01
## longitude -0.35 -0.05 -0.01 0.01
## number_of_reviews 0.03 0.10 -0.01 0.04
## review_scores_rating 0.06 -0.04 -0.01 0.05
## bedrooms 0.49 0.74 0.45 0.01
## beds 0.48 0.84 0.39 0.02
## amenities_count 0.22 0.26 0.11 0.06
## latitude longitude number_of_reviews review_scores_rating
## log_price 0.06 -0.35 0.03 0.06
## accommodates -0.05 -0.05 0.10 -0.04
## bathrooms -0.07 -0.01 -0.01 -0.01
## host_response_rate -0.01 0.01 0.04 0.05
## latitude 1.00 0.08 0.01 -0.02
## longitude 0.08 1.00 0.01 -0.03
## number_of_reviews 0.01 0.01 1.00 -0.02
## review_scores_rating -0.02 -0.03 -0.02 1.00
## bedrooms -0.08 -0.03 0.02 -0.02
## beds -0.06 -0.03 0.09 -0.04
## amenities_count 0.00 -0.01 0.14 0.11
## bedrooms beds amenities_count
## log_price 0.49 0.48 0.22
## accommodates 0.74 0.84 0.26
## bathrooms 0.45 0.39 0.11
## host_response_rate 0.01 0.02 0.06
## latitude -0.08 -0.06 0.00
## longitude -0.03 -0.03 -0.01
## number_of_reviews 0.02 0.09 0.14
## review_scores_rating -0.02 -0.04 0.11
## bedrooms 1.00 0.75 0.17
## beds 0.75 1.00 0.23
## amenities_count 0.17 0.23 1.00
# Summarize ordinal categorical variables
# generate the total number of observations belonging to each level
ordinal.cat.sum <- table(cat_var$cancellation_policy)
ordinal.cat.sum
##
## flexible moderate strict super_strict
## 3735 4050 7801 3
# Summarize nominal/binary nominal categorical variables
# generate the total number of observations belonging to each class
nominal.cat.sum <- apply(subset(cat_var, select = - c(cancellation_policy, first_review, last_review, host_since,
amenities)), 2, table)
nominal.cat.sum
## $property_type
##
## Apartment Bed & Breakfast Boat Boutique hotel
## 13022 46 2 5
## Bungalow Cabin Castle Chalet
## 9 1 1 2
## Condominium Dorm Guest suite Guesthouse
## 194 10 18 12
## Hostel House Loft Other
## 4 1561 251 88
## Serviced apartment Timeshare Townhouse Vacation home
## 1 12 337 3
## Villa
## 10
##
## $room_type
##
## Entire home/apt Private room Shared room
## 7370 7807 412
##
## $bed_type
##
## Other Real Bed
## 450 15139
##
## $cleaning_fee
##
## False True
## 3822 11767
##
## $city
##
## NYC
## 15589
##
## $host_has_profile_pic
##
## f t
## 26 15563
##
## $host_identity_verified
##
## False True
## 5009 10580
##
## $instant_bookable
##
## False True
## 11352 4237
##
## $neighbourhood
##
## Allerton Alphabet City
## 8 57
## Annadale Astoria
## 1 593
## Bath Beach Battery Park City
## 6 11
## Bay Ridge Baychester
## 64 21
## Bayside Bedford-Stuyvesant
## 30 1569
## Bedford Park Belmont
## 12 3
## Bensonhurst Bergen Beach
## 25 1
## Boerum Hill Borough Park
## 87 17
## Brighton Beach Bronxdale
## 18 12
## Brooklyn Brooklyn Heights
## 7 63
## Brooklyn Navy Yard Brownsville
## 31 18
## Bushwick Canarsie
## 1096 2
## Carroll Gardens Castle Hill
## 4 4
## Chinatown City Island
## 77 6
## Civic Center Clinton Hill
## 3 113
## College Point Columbia Street Waterfront
## 1 1
## Coney Island Corona
## 2 11
## Country Club Crotona
## 1 4
## Crown Heights Ditmars / Steinway
## 11 85
## Dongan Hills Downtown Brooklyn
## 1 21
## DUMBO East Flatbush
## 11 29
## East Harlem East Village
## 2 89
## Eastchester Edenwald
## 9 3
## Elm Park Elmhurst
## 4 6
## Eltingville Emerson Hill
## 3 1
## Financial District Flatbush
## 158 292
## Flatiron District Flatlands
## 91 28
## Flushing Fordham
## 143 9
## Forest Hills Fort Greene
## 66 168
## Fort Wadsworth Fresh Meadows
## 1 3
## Glendale Gowanus
## 13 78
## Gramercy Park Graniteville
## 120 1
## Gravesend Great Kills
## 23 3
## Greenpoint Greenwich Village
## 469 177
## Greenwood Heights Grymes Hill
## 68 3
## Hamilton Heights Harlem
## 453 915
## Hell's Kitchen Highbridge
## 871 14
## Hillcrest Howard Beach
## 5 4
## Hudson Square Hunts Point
## 31 3
## Inwood Jackson Heights
## 77 74
## Jamaica Kensington
## 186 83
## Kew Garden Hills Kingsbridge
## 19 12
## Kingsbridge Heights Kips Bay
## 11 171
## Lefferts Garden Lighthouse HIll
## 222 1
## Lindenwood Little Italy
## 3 31
## Long Island City Longwood
## 90 6
## Lower East Side Manhattan
## 461 7
## Manhattan Beach Marble Hill
## 6 6
## Mariners Harbor Maspeth
## 2 24
## Meatpacking District Melrose
## 7 3
## Middle Village Midland Beach
## 12 5
## Midtown Midtown East
## 159 223
## Midwood Mill Basin
## 54 1
## Morningside Heights Morris Heights
## 152 5
## Morris Park Morrisania
## 3 3
## Mott Haven Murray Hill
## 29 103
## New Brighton New Springville
## 3 1
## Noho Nolita
## 32 139
## Norwood Oakwood
## 8 1
## Ozone Park Park Slope
## 16 416
## Park Versailles Parkchester
## 11 11
## Pelham Bay Port Morris
## 9 8
## Port Richmond Prospect Heights
## 2 184
## Queens Randall Manor
## 1 3
## Red Hook Rego Park
## 29 42
## Richmond Hill Ridgewood
## 43 165
## Riverdale Roosevelt Island
## 4 40
## Rosebank Rossville
## 3 1
## Sea Gate Sheepshead Bay
## 5 48
## Soho Soundview
## 148 5
## South Beach South Ozone Park
## 9 13
## South Street Seaport Spuyten Duyvil
## 6 4
## St. George Stapleton
## 27 9
## Sunnyside Sunset Park
## 149 97
## The Bronx The Rockaways
## 1 86
## Throgs Neck Times Square/Theatre District
## 4 73
## Tompkinsville Tottenville
## 9 3
## Tremont Tribeca
## 7 60
## Union Square University Heights
## 9 14
## Upper East Side Upper West Side
## 701 842
## Utopia Van Nest
## 4 2
## Vinegar Hill Wakefield
## 6 10
## Washington Heights West Brighton
## 474 16
## West Farms West Village
## 2 345
## Westerleigh Whitestone
## 2 5
## Williamsbridge Williamsburg
## 10 333
## Windsor Terrace Woodhaven
## 79 22
## Woodlawn Woodside
## 2 61
##
## $borough
##
## Bronx Brooklyn Manhattan Other Queens
## 300 5846 7311 41 1976
## Staten Island
## 115
Note: Some of the variables were removed here. For example, ‘cancellation_policy’ was excluded as it is an ordinal variable. Other variables such as ‘first_review’ and ‘host_since’ were removed due to the number of unique date values given for each of these variables.
Conducted statistical profiling for all numeric and categorical
features:
- Examined means, medians, variances, standard deviations, and IQRs to
understand distribution and skew.
- Created a correlation matrix to identify relationships between
features (e.g., strong positive correlation between accommodates, beds,
and log_price).
These metrics provide valuable insight into Airbnb’s market structure, guiding modeling choices and stakeholder recommendations. ***
Let’s delve deeper into how this information can be used by both Airbnb users and management:
1. Pricing Insights:
2. Accommodation Preferences:
3. Correlation Insights:
4. Categorical Variables:
5. Decision Support for Airbnb Management:
In summary, these summary statistics go beyond mere data description; they empower Airbnb users to make informed booking decisions and offer valuable insights for management to optimize their property listings and pricing strategies. These insights can ultimately lead to improved guest experiences and increased revenue for hosts and Airbnb itself.
Faceted Bar Chart
ggplot(ny.df, aes(x= room_type, fill = room_type)) + geom_bar(color = "black", alpha = 0.7) +
labs(title = "Airbnb Rental Room Types in NYC by Cleaning Fee Policy", x = NULL, y = "# of Airbnb Rental Listings") +
theme(axis.title = element_text(size = 12), legend.position = "bottom") + scale_x_discrete(labels = NULL) +
facet_wrap(~cleaning_fee,
labeller = labeller(cleaning_fee = c(
"True" = "With Cleaning Fee", "False" = "Without Cleaning Fee"))) +
scale_fill_discrete(name = "Room Type")
Our first visualization is a faceted bar chart that meticulously
dissects Airbnb rental room types in NYC based on their cleaning fee
policies. By segmenting the data in this manner, we reveal valuable
insights that can help both hosts and guests. For Airbnb management,
this information can aid in setting competitive pricing strategies and
policies for different room types. Guests can benefit from this
knowledge by making more informed decisions about accommodation based on
their preferences and budget. This plot uncovers that cleaning fees are
prevalent, especially for ‘entire home/apartment’ listings. Such
insights can guide both hosts and guests in negotiations and
bookings.
Histogram
ggplot(ny.df, aes(x= accommodates)) + geom_histogram(binwidth = 1, fill = "blue", color = "black", alpha = 0.7) +
labs(title = "Distribution of Accommodation Capacity in NYC Airbnb Rentals",
x = "Number of Accomodated Guests", y= "# of Airbnb Rental Listings") + theme_minimal() +
theme(axis.text = element_text(size = 11), axis.title = element_text(size = 12))
The second visualization, a histogram showcasing accommodation capacity
distribution, lays bare the preferences of Airbnb renters in New York
City. For hosts, this is a goldmine of information, allowing them to
tailor their listings to the most sought-after capacities, thereby
optimizing occupancy rates and revenue. For guests, this histogram is a
powerful tool for finding the perfect match based on group size. It
shows that most listings can comfortably host around 2 guests, but it
also highlights the availability of properties for larger groups. This
revelation aids in decision-making for both hosts and guests.
Heat Map/Correlation Matrix
ggcorrplot(corr, lab = TRUE, lab_size = 2, title = "Correlation Heatmap of NYC\nAirbnb Rental Property Data") + theme(plot.title = element_text(size = 13), axis.text.x = element_text(size = 9), axis.text.y = element_text(size = 9))
Our third visualization, the correlation heatmap, delivers a deeper
understanding of how various numerical variables relate to one another.
For decision-makers in the Airbnb ecosystem, this plot offers predictive
potential. Strong correlations between ‘beds,’ ‘bedrooms,’ and
‘accommodates’ with ‘log_price’ can be valuable for pricing
optimization. Meanwhile, the negative correlation between latitude and
‘log_price’ suggests that location significantly influences rental
prices. Hosts can set competitive prices, and guests can better assess
property values based on this knowledge.
Proportional Bar Chart
# Define new variable 'property_group' that groups property types
# (Goal = limit num. of levels)
ny.df$property_group <- ifelse(ny.df$property_type %in% c("Guesthouse", "Guest suite", "In-law"),
"Guest suite/In-law", ifelse(ny.df$property_type %in% c(
"Boutique hotel", "Dorm", "Hostel", "Serviced apartment", "Timeshare"),
"Accommodation", ifelse(ny.df$property_type %in% c(
"Boat", "Bungalow", "Cabin", "Castle", "Chalet",
"Earth House", "Tent", "Vacation home", "Villa",
"Yurt"), "Specialty", as.character(
ny.df$property_type))))
# Create Stacked Bar Chart
ggplot(ny.df, aes(x= accommodates, fill = property_group)) +
geom_bar(position = "fill", color = "black", alpha = 0.7) +
labs(title = "Proportion of Airbnb Listings in NYC by Accommodation\nCapacity & Property Type",
x = "Accommodation Capacity", y = "Proportion of Airbnb Listings") +
scale_fill_discrete(name = "Property Type") + theme_minimal() +
theme(axis.text = element_text(size = 10), axis.title = element_text(size = 11))
The fourth visualization, a stacked bar chart, guides Airbnb management
and users in understanding the distribution of property types and their
capacity. This chart is an invaluable resource for hosts to fine-tune
their listings based on property type and group size. Apartments emerge
as the dominant choice, particularly for smaller parties. Houses, on the
other hand, become more appealing for larger groups. Airbnb users can
capitalize on this knowledge to make well-informed booking
decisions.
Histogram #2
ggplot(ny.df, aes(x= log_price)) + geom_histogram(binwidth = 1, fill = "orange", color = "black", alpha = 0.7) + labs(title = "Distribution of Log Prices for NYC Airbnb Rentals", x = "Log Price ($)", y= "# of Airbnb Rental Listings") + theme_minimal() + theme(axis.text = element_text(size = 11), axis.title = element_text(size = 12))
Our fifth visualization is another histogram, this time focusing on the
distribution of log-transformed prices. This transformed scale can
unveil hidden pricing trends or clusters that are not immediately
apparent. For both hosts and guests, this histogram offers deeper
insights into the nuanced price dynamics of Airbnb rentals in NYC.
Scatterplot
ggplot(ny.df, aes(x= bedrooms, y= log_price, color = room_type, size = accommodates)) +
geom_point(na.rm= TRUE, alpha = 0.7) + xlim(1, 8) +
labs(title= "Number of Bedrooms vs. Log Price for NYC Airbnb Rentals, by Room Type & Accommodation Capacity",
x= "# of Bedrooms", y= "Log Price ($)") + scale_color_discrete(name = "Room Type") +
scale_size_continuous(name = "Accommodation Capacity") + theme_minimal() +
theme(axis.text = element_text(size = 11), axis.title = element_text(size = 12),
legend.text = element_text(size = 10))
Lastly, our sixth plot, a scatterplot, delves into the intricate
relationship between several variables: number of bedrooms,
log-transformed prices, room types, and accommodation capacity. This
visualization empowers both hosts and guests to decipher how these
factors interact and influence rental prices. Notably, it sheds light on
the price variations tied to room type and capacity, offering actionable
insights for optimizing pricing strategies and booking decisions.
Together, these insightful visualizations equip Airbnb stakeholders with a wealth of information, enabling them to make data-driven decisions that enhance the Airbnb experience in New York City. Whether you’re a host seeking to maximize revenue or a guest in pursuit of the perfect stay, these visualizations are your compass in navigating the NYC Airbnb landscape.
The goal of this phase is to develop a Multiple Linear Regression (MLR) model to predict the log-transformed prices of Airbnb listings in New York City. This process involves careful variable selection, model refinement, and performance evaluation to uncover the key factors that influence pricing decisions on the platform.
ny.df_reg <- subset(ny.df, select = -c(city, description, first_review, host_has_profile_pic,
host_since, id, last_review, name, neighbourhood, property_type,
thumbnail_url, beds, amenities))
Note: The decisions to remove the aforementioned variables from the dataset are described in more detail below: * ‘city’: all observations in this subset of the original dataset contain only those observations where the value for the ‘city’ column is “NYC.” * ‘zipcode’ & ‘neighbourhood’: these variables are redundant in describing the location of a listing (e.g., ‘latitude’, ‘longitude’, & ‘neighbourhood’ are more detailed/precise variables). * ‘id’, ‘name’, & ‘description’: the values for each of these variables are unique to each observation. * ‘property_type’: this variable is now redundant, as we created a new variable for property types that minimizes the number of categories (i.e., grouped less common types into larger categories). * ‘host_has_profile_pic’: as ‘summary()’ function implies that in a majority of the rental listings, the host has a profile picture (e.g., 32,076 = True & 97 = False). * ‘first_review’, ‘last_review’ & ‘host_since’: the relevance of these variables could be limited to assessing the recent performance and maintenance of a listing or the performance of the host, but might not directly affect the pricing decisions. * ‘thumbnail_url’: unlikely to influence the price of an Airbnb listing, as it is typically a web link to an image, and its inclusion in the model would not offer any meaningful insights into pricing. * ‘beds’: this variable is redundant as we have the variables ‘bedroom’ & ‘accommodates’. * ‘amenities’: this variable is redundant in describing the amenities of each Airbnb listing. A better way to quantify this in our model is through the use of the ‘amenities_count’ variable.
# Partition the data into training (60%) & validation (40%) sets
set.seed(1)
# Sample 60% of the data, which we will assign to the training data set
train.index <- sample(c(1:nrow(ny.df_reg)), nrow(ny.df_reg)*0.6)
# Assign 60% of the data that we just sampled to training set
train.df <- ny.df_reg[train.index, ]
# Assign remaining 40% of the data to validation set
valid.df <- ny.df_reg[-train.index, ]
Identify numeric & categorical predictor variables:
# Subset numeric columns from training set while excluding 'log_price'
num_predictors <- train.df[, !(names(train.df) %in% 'log_price') & sapply(train.df, is.numeric)]
# Subset remaining (categorical) columns from training set while excluding 'log_price'
cat_predictors <- train.df[, !(names(train.df) %in% c(names(num_predictors), 'log_price'))]
Check for multicollinearity issues:
# Calculate the correlation matrix with numeric variables
reg_cor_matrix <- cor(train.df[sapply(train.df, is.numeric)])
# Visualize correlation matrix
ggcorrplot(reg_cor_matrix, lab = TRUE, lab_size = 2.5, title = "Correlation Heatmap of NYC Airbnb\nRental Property Data") + theme(plot.title = element_text(size = 13), axis.text.x = element_text(size = 9), axis.text.y = element_text(size = 9))
Multicollinearity occurs when the input variables are highly correlated, making it challenging to distinguish the unique contribution of each variable to the model and decreasing the reliability of the model output. It should be noted that there are 2 input variables which are strongly correlated with one another - including ‘bedrooms’ and ‘accommodates’.
Nonetheless, the criterion for selecting variables to drop in the revised model was based on both their correlation with the output variable ‘log_price,’ as well as their correlation with each other. Given that ‘accommodates’ has a stronger correlation to ‘log_price’ than ‘bedrooms’, it might be wise to keep the accommodates variable and drop the ‘bedrooms’ variable in an effort to avoid multicollinearity issues. This choice ensures that we retain the most influential variables while eliminating unnecessary redundancy in the input features, ultimately improving the model’s performance and robustness.
To build a clean and interpretable model, several features are
removed based on redundancy, irrelevance, or data quality
concerns:
- Dropped variables included unique identifiers (‘id’, ‘name’,
‘description’), rarely informative or highly sparse features
(‘host_has_profile_pic’, ‘thumbnail_url’, ‘first_review’,
‘last_review’), and highly correlated or redundant attributes (‘beds’,
‘bedrooms’, ‘amenities’, ‘neighbourhood’, etc.). - A new feature,
‘property_group’, was engineered to simplify property types into broader
categories (so ‘property_type’ is removed).
- Multicollinearity was evaluated using a correlation matrix. Highly
correlated predictors such as ‘bedrooms’ and ‘accommodates’ were
analyzed, with ‘accommodates’ retained due to its stronger association
with price. ***
Initial MLR Model
# Run MLR of 'log_price' on all the predictors in the training set
# Note: all binary nominal categorical variables will automatically be converted into dummy variables with 'm-1' dummies
mlr.model <- lm(log_price ~ ., data = train.df)
options(digits = 3, scipen = 999)
mlr_summary <- summary(mlr.model)
mlr_summary
##
## Call:
## lm(formula = log_price ~ ., data = train.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.776 -0.220 -0.005 0.205 2.498
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) -152.995926 11.985552 -12.77
## room_typePrivate room -0.555130 0.009650 -57.52
## room_typeShared room -0.857738 0.025088 -34.19
## accommodates 0.078572 0.003438 22.86
## bathrooms 0.118473 0.010769 11.00
## bed_typeOther -0.030669 0.022912 -1.34
## cancellation_policymoderate 0.009103 0.010964 0.83
## cancellation_policystrict 0.028098 0.010113 2.78
## cancellation_policysuper_strict 0.008489 0.363076 0.02
## cleaning_feeTrue 0.019552 0.009340 2.09
## host_identity_verifiedFalse -0.007571 0.008336 -0.91
## host_response_rate -0.046347 0.032655 -1.42
## instant_bookableFalse 0.024957 0.008644 2.89
## latitude -1.289501 0.110798 -11.64
## longitude -2.837563 0.133066 -21.32
## number_of_reviews -0.000332 0.000114 -2.91
## review_scores_rating 0.313404 0.046890 6.68
## bedrooms 0.103285 0.009073 11.38
## boroughBrooklyn -0.201629 0.033654 -5.99
## boroughManhattan 0.240371 0.030751 7.82
## boroughOther 0.220213 0.076684 2.87
## boroughQueens -0.014867 0.032333 -0.46
## boroughStaten Island -0.945791 0.061140 -15.47
## amenities_count 0.005530 0.000546 10.12
## property_groupApartment -0.266317 0.079356 -3.36
## property_groupBed & Breakfast 0.043433 0.105730 0.41
## property_groupCondominium -0.029238 0.085598 -0.34
## property_groupGuest suite/In-law -0.313871 0.122832 -2.56
## property_groupHouse -0.253748 0.080279 -3.16
## property_groupLoft -0.017377 0.084315 -0.21
## property_groupOther -0.230229 0.093309 -2.47
## property_groupSpecialty -0.119248 0.125325 -0.95
## property_groupTownhouse -0.233237 0.083190 -2.80
## Pr(>|t|)
## (Intercept) < 0.0000000000000002 ***
## room_typePrivate room < 0.0000000000000002 ***
## room_typeShared room < 0.0000000000000002 ***
## accommodates < 0.0000000000000002 ***
## bathrooms < 0.0000000000000002 ***
## bed_typeOther 0.18075
## cancellation_policymoderate 0.40639
## cancellation_policystrict 0.00547 **
## cancellation_policysuper_strict 0.98135
## cleaning_feeTrue 0.03634 *
## host_identity_verifiedFalse 0.36376
## host_response_rate 0.15585
## instant_bookableFalse 0.00390 **
## latitude < 0.0000000000000002 ***
## longitude < 0.0000000000000002 ***
## number_of_reviews 0.00361 **
## review_scores_rating 0.000000000024625 ***
## bedrooms < 0.0000000000000002 ***
## boroughBrooklyn 0.000000002159308 ***
## boroughManhattan 0.000000000000006 ***
## boroughOther 0.00409 **
## boroughQueens 0.64566
## boroughStaten Island < 0.0000000000000002 ***
## amenities_count < 0.0000000000000002 ***
## property_groupApartment 0.00079 ***
## property_groupBed & Breakfast 0.68124
## property_groupCondominium 0.73268
## property_groupGuest suite/In-law 0.01063 *
## property_groupHouse 0.00158 **
## property_groupLoft 0.83672
## property_groupOther 0.01363 *
## property_groupSpecialty 0.34137
## property_groupTownhouse 0.00506 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.362 on 9320 degrees of freedom
## Multiple R-squared: 0.697, Adjusted R-squared: 0.696
## F-statistic: 670 on 32 and 9320 DF, p-value: <0.0000000000000002
# Summary of residuals for initial MLR model (training set)
summary(mlr.model$residuals)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -3.78 -0.22 -0.01 0.00 0.20 2.50
# Assess accuracy of initial model against training set
accuracy(mlr.model$fitted.values, train.df$log_price)
## ME RMSE MAE MPE MAPE
## Test set -0.0000000000000000179 0.362 0.271 -0.612 5.91
The dataset was split into 60% training and 40% validation subsets to ensure unbiased model evaluation. The initial MLR model was built on the full set of cleaned predictors. The model achieved the following performance on the training set:
| Metric | Value |
|---|---|
| RMSE | 0.362 |
| MAE | 0.271 |
| MAPE | 5.91% |
| Adj. R² | 0.696 |
This indicates that ~69.6% of the variability in log-transformed Airbnb rental prices is explained by the model.
Refined MLR Model
In an effort to improve the predictive accuracy of the model, we will further refine the model by eliminating predictor variables in which the resulting p-value is greater than 0.05 – suggesting those specific predictor variables are not linearly related to the output variable of ‘log_price’ when controlling for other variables. As such, we will drop predictor variables such as ‘bed_type’, ‘host_identity_verified’, and ‘host_response_rate’ from the model in which the p-value is greater than 0.05.
We will also drop the ‘bedrooms’ variable that is strongly correlated with the ‘accommodates’ variable to evaluate the potential impact of multicollinearity on the model’s predictive accuracy. However, we will keep some of the categorical variables whose categories or levels are significant (e.g., ‘borough’ & ‘cancellation_policy’).
# Drop insignificant predictor variables
train.df2 <- subset(train.df, select = -c(bed_type, host_response_rate, host_identity_verified, bedrooms))
# Refined MLR model
mlr.model.2 <- lm(log_price ~ ., data = train.df2)
summary(mlr.model.2)
##
## Call:
## lm(formula = log_price ~ ., data = train.df2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.770 -0.225 -0.005 0.207 2.498
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) -153.083571 12.064594 -12.69
## room_typePrivate room -0.556466 0.009703 -57.35
## room_typeShared room -0.864763 0.024622 -35.12
## accommodates 0.102788 0.002723 37.75
## bathrooms 0.148792 0.010503 14.17
## cancellation_policymoderate 0.006691 0.011009 0.61
## cancellation_policystrict 0.029588 0.010156 2.91
## cancellation_policysuper_strict 0.008367 0.365579 0.02
## cleaning_feeTrue 0.018147 0.009380 1.93
## instant_bookableFalse 0.030247 0.008607 3.51
## latitude -1.303542 0.111538 -11.69
## longitude -2.845867 0.133944 -21.25
## number_of_reviews -0.000431 0.000114 -3.79
## review_scores_rating 0.315799 0.047161 6.70
## boroughBrooklyn -0.194922 0.033863 -5.76
## boroughManhattan 0.244796 0.030945 7.91
## boroughOther 0.227573 0.077210 2.95
## boroughQueens -0.011433 0.032533 -0.35
## boroughStaten Island -0.941649 0.061515 -15.31
## amenities_count 0.005391 0.000549 9.83
## property_groupApartment -0.248340 0.079853 -3.11
## property_groupBed & Breakfast 0.080626 0.106355 0.76
## property_groupCondominium -0.021143 0.086132 -0.25
## property_groupGuest suite/In-law -0.302503 0.123567 -2.45
## property_groupHouse -0.227057 0.080748 -2.81
## property_groupLoft 0.002708 0.084837 0.03
## property_groupOther -0.211535 0.093899 -2.25
## property_groupSpecialty -0.105026 0.126079 -0.83
## property_groupTownhouse -0.209848 0.083701 -2.51
## Pr(>|t|)
## (Intercept) < 0.0000000000000002 ***
## room_typePrivate room < 0.0000000000000002 ***
## room_typeShared room < 0.0000000000000002 ***
## accommodates < 0.0000000000000002 ***
## bathrooms < 0.0000000000000002 ***
## cancellation_policymoderate 0.54333
## cancellation_policystrict 0.00358 **
## cancellation_policysuper_strict 0.98174
## cleaning_feeTrue 0.05308 .
## instant_bookableFalse 0.00044 ***
## latitude < 0.0000000000000002 ***
## longitude < 0.0000000000000002 ***
## number_of_reviews 0.00015 ***
## review_scores_rating 0.0000000000226173 ***
## boroughBrooklyn 0.0000000088734163 ***
## boroughManhattan 0.0000000000000029 ***
## boroughOther 0.00321 **
## boroughQueens 0.72528
## boroughStaten Island < 0.0000000000000002 ***
## amenities_count < 0.0000000000000002 ***
## property_groupApartment 0.00188 **
## property_groupBed & Breakfast 0.44842
## property_groupCondominium 0.80610
## property_groupGuest suite/In-law 0.01438 *
## property_groupHouse 0.00493 **
## property_groupLoft 0.97454
## property_groupOther 0.02430 *
## property_groupSpecialty 0.40486
## property_groupTownhouse 0.01219 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.365 on 9324 degrees of freedom
## Multiple R-squared: 0.693, Adjusted R-squared: 0.692
## F-statistic: 751 on 28 and 9324 DF, p-value: <0.0000000000000002
# Assess accuracy of refined model against training set
accuracy(mlr.model.2$fitted.values, train.df$log_price)
## ME RMSE MAE MPE MAPE
## Test set 0.00000000000000000228 0.364 0.273 -0.618 5.96
Stepwise Regression
# Apply stepwise regression
# drops predictors that lack statistical significance from the intial MLR model
# - in an effort to determine the best subset of predictor variables
mlr.model.step <- step(mlr.model, direction = "both")
## Start: AIC=-18965
## log_price ~ room_type + accommodates + bathrooms + bed_type +
## cancellation_policy + cleaning_fee + host_identity_verified +
## host_response_rate + instant_bookable + latitude + longitude +
## number_of_reviews + review_scores_rating + bedrooms + borough +
## amenities_count + property_group
##
## Df Sum of Sq RSS AIC
## - host_identity_verified 1 0 1223 -18966
## - bed_type 1 0 1223 -18965
## <none> 1223 -18965
## - host_response_rate 1 0 1223 -18965
## - cleaning_fee 1 1 1223 -18963
## - cancellation_policy 3 1 1224 -18962
## - instant_bookable 1 1 1224 -18959
## - number_of_reviews 1 1 1224 -18959
## - review_scores_rating 1 6 1228 -18922
## - amenities_count 1 13 1236 -18865
## - bathrooms 1 16 1238 -18846
## - bedrooms 1 17 1240 -18838
## - latitude 1 18 1240 -18832
## - property_group 9 21 1243 -18827
## - longitude 1 60 1282 -18522
## - accommodates 1 69 1291 -18457
## - borough 5 200 1422 -17560
## - room_type 2 481 1704 -15864
##
## Step: AIC=-18966
## log_price ~ room_type + accommodates + bathrooms + bed_type +
## cancellation_policy + cleaning_fee + host_response_rate +
## instant_bookable + latitude + longitude + number_of_reviews +
## review_scores_rating + bedrooms + borough + amenities_count +
## property_group
##
## Df Sum of Sq RSS AIC
## - bed_type 1 0 1223 -18967
## - host_response_rate 1 0 1223 -18966
## <none> 1223 -18966
## + host_identity_verified 1 0 1223 -18965
## - cleaning_fee 1 1 1223 -18964
## - cancellation_policy 3 1 1224 -18963
## - number_of_reviews 1 1 1224 -18960
## - instant_bookable 1 1 1224 -18959
## - review_scores_rating 1 6 1229 -18923
## - amenities_count 1 14 1236 -18865
## - bathrooms 1 16 1239 -18848
## - bedrooms 1 17 1240 -18840
## - latitude 1 18 1240 -18834
## - property_group 9 20 1243 -18829
## - longitude 1 60 1282 -18522
## - accommodates 1 68 1291 -18459
## - borough 5 200 1422 -17562
## - room_type 2 482 1704 -15863
##
## Step: AIC=-18967
## log_price ~ room_type + accommodates + bathrooms + cancellation_policy +
## cleaning_fee + host_response_rate + instant_bookable + latitude +
## longitude + number_of_reviews + review_scores_rating + bedrooms +
## borough + amenities_count + property_group
##
## Df Sum of Sq RSS AIC
## - host_response_rate 1 0 1223 -18967
## <none> 1223 -18967
## + bed_type 1 0 1223 -18966
## + host_identity_verified 1 0 1223 -18965
## - cleaning_fee 1 1 1223 -18964
## - cancellation_policy 3 1 1224 -18963
## - number_of_reviews 1 1 1224 -18960
## - instant_bookable 1 1 1224 -18960
## - review_scores_rating 1 6 1229 -18924
## - amenities_count 1 14 1237 -18865
## - bathrooms 1 16 1239 -18848
## - bedrooms 1 17 1240 -18840
## - latitude 1 18 1241 -18834
## - property_group 9 20 1243 -18829
## - longitude 1 60 1283 -18522
## - accommodates 1 69 1292 -18458
## - borough 5 200 1423 -17562
## - room_type 2 490 1713 -15817
##
## Step: AIC=-18967
## log_price ~ room_type + accommodates + bathrooms + cancellation_policy +
## cleaning_fee + instant_bookable + latitude + longitude +
## number_of_reviews + review_scores_rating + bedrooms + borough +
## amenities_count + property_group
##
## Df Sum of Sq RSS AIC
## <none> 1223 -18967
## + host_response_rate 1 0 1223 -18967
## + bed_type 1 0 1223 -18966
## + host_identity_verified 1 0 1223 -18965
## - cleaning_fee 1 1 1224 -18964
## - cancellation_policy 3 1 1224 -18963
## - number_of_reviews 1 1 1224 -18960
## - instant_bookable 1 1 1224 -18959
## - review_scores_rating 1 6 1229 -18924
## - amenities_count 1 13 1237 -18866
## - bathrooms 1 16 1239 -18848
## - bedrooms 1 17 1240 -18840
## - latitude 1 18 1241 -18833
## - property_group 9 20 1244 -18829
## - longitude 1 60 1283 -18522
## - accommodates 1 69 1292 -18457
## - borough 5 200 1423 -17561
## - room_type 2 490 1713 -15819
summary(mlr.model.step)
##
## Call:
## lm(formula = log_price ~ room_type + accommodates + bathrooms +
## cancellation_policy + cleaning_fee + instant_bookable + latitude +
## longitude + number_of_reviews + review_scores_rating + bedrooms +
## borough + amenities_count + property_group, data = train.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.782 -0.220 -0.005 0.205 2.494
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) -153.171089 11.982785 -12.78
## room_typePrivate room -0.555680 0.009637 -57.66
## room_typeShared room -0.864362 0.024455 -35.34
## accommodates 0.078718 0.003437 22.90
## bathrooms 0.118450 0.010769 11.00
## cancellation_policymoderate 0.009132 0.010937 0.83
## cancellation_policystrict 0.028828 0.010087 2.86
## cancellation_policysuper_strict 0.010756 0.363100 0.03
## cleaning_feeTrue 0.020145 0.009318 2.16
## instant_bookableFalse 0.026408 0.008555 3.09
## latitude -1.291892 0.110786 -11.66
## longitude -2.840681 0.133036 -21.35
## number_of_reviews -0.000332 0.000113 -2.93
## review_scores_rating 0.311634 0.046842 6.65
## bedrooms 0.102938 0.009072 11.35
## boroughBrooklyn -0.203238 0.033641 -6.04
## boroughManhattan 0.238550 0.030741 7.76
## boroughOther 0.221105 0.076689 2.88
## boroughQueens -0.017016 0.032317 -0.53
## boroughStaten Island -0.949962 0.061103 -15.55
## amenities_count 0.005524 0.000545 10.14
## property_groupApartment -0.270229 0.079335 -3.41
## property_groupBed & Breakfast 0.038132 0.105701 0.36
## property_groupCondominium -0.033596 0.085555 -0.39
## property_groupGuest suite/In-law -0.319502 0.122739 -2.60
## property_groupHouse -0.258196 0.080247 -3.22
## property_groupLoft -0.021575 0.084289 -0.26
## property_groupOther -0.234350 0.093284 -2.51
## property_groupSpecialty -0.125372 0.125237 -1.00
## property_groupTownhouse -0.237667 0.083169 -2.86
## Pr(>|t|)
## (Intercept) < 0.0000000000000002 ***
## room_typePrivate room < 0.0000000000000002 ***
## room_typeShared room < 0.0000000000000002 ***
## accommodates < 0.0000000000000002 ***
## bathrooms < 0.0000000000000002 ***
## cancellation_policymoderate 0.40376
## cancellation_policystrict 0.00427 **
## cancellation_policysuper_strict 0.97637
## cleaning_feeTrue 0.03066 *
## instant_bookableFalse 0.00203 **
## latitude < 0.0000000000000002 ***
## longitude < 0.0000000000000002 ***
## number_of_reviews 0.00344 **
## review_scores_rating 0.0000000000303635 ***
## bedrooms < 0.0000000000000002 ***
## boroughBrooklyn 0.0000000015868401 ***
## boroughManhattan 0.0000000000000094 ***
## boroughOther 0.00395 **
## boroughQueens 0.59853
## boroughStaten Island < 0.0000000000000002 ***
## amenities_count < 0.0000000000000002 ***
## property_groupApartment 0.00066 ***
## property_groupBed & Breakfast 0.71829
## property_groupCondominium 0.69456
## property_groupGuest suite/In-law 0.00925 **
## property_groupHouse 0.00130 **
## property_groupLoft 0.79798
## property_groupOther 0.01201 *
## property_groupSpecialty 0.31682
## property_groupTownhouse 0.00428 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.362 on 9323 degrees of freedom
## Multiple R-squared: 0.697, Adjusted R-squared: 0.696
## F-statistic: 739 on 29 and 9323 DF, p-value: <0.0000000000000002
After running stepwise regression on the initial model, ‘bed_type’ was the only variable dropped from the model.
# Assess accuracy of initial model after applying stepwise regression
accuracy(mlr.model.step$fitted.values, train.df$log_price)
## ME RMSE MAE MPE MAPE
## Test set 0.00000000000000000111 0.362 0.271 -0.613 5.92
To streamline the model without compromising performance:
# Fitting MLR model to validation data & measuring model accuracy
library (forecast) # load 'forecast' package for predictions
# Initial model
mlr.pred <- predict(mlr.model, newdata= valid.df)
accuracy(mlr.pred, valid.df$log_price)
## ME RMSE MAE MPE MAPE
## Test set -0.000479 0.362 0.274 -0.629 5.98
# Refined model
mlr.2.pred <- predict(mlr.model.2, newdata= valid.df)
accuracy(mlr.2.pred, valid.df$log_price)
## ME RMSE MAE MPE MAPE
## Test set -0.000222 0.365 0.276 -0.63 6.04
# Intial model + Stepwise regression
mlr.step.pred <- predict(mlr.model.step, newdata= valid.df)
accuracy(mlr.step.pred, valid.df$log_price)
## ME RMSE MAE MPE MAPE
## Test set -0.000355 0.362 0.274 -0.627 5.99
all.residuals <- valid.df$log_price - mlr.step.pred
hist(all.residuals, breaks = 25, xlab = "Residuals", main = " ")
Based on the histogram of residual errors when the model is fit to the
validation data, one can conclude that most errors are between -1 and 1
(i.e., error magnitude). This indicates low error variance and a
well-behaved residual distribution.
Three model versions were tested on the validation set:
| Model | RMSE | MAE | MAPE |
|---|---|---|---|
| Initial MLR | 0.362 | 0.274 | 5.98% |
| Refined MLR (fewer vars) | 0.365 | 0.276 | 6.04% |
| Stepwise Regression Model | 0.362 | 0.274 | 5.99% |
This model offers actionable insights for hosts, guests, and Airbnb itself: - Accommodation capacity (‘accommodates’) and location (‘borough’) are strong predictors of price. - Host responsiveness, property type, and cancellation policy also influence pricing strategies. - The model’s generalizability (consistent RMSE across train and validation sets) supports its application to future price-setting or valuation tools.
The multiple linear regression model built in this phase provides a reliable, interpretable, and data-driven approach to understanding Airbnb rental pricing in NYC. By identifying the features that matter most—like guest capacity, location, and the number of available amenities, this model empowers:
The model explains a substantial portion of the price variability, but further research could explore additional factors to enhance predictive accuracy (e.g., seasonality, user review sentiment, or calendar availability).
This section explores the application of three supervised classification algorithms to answer three key business questions related to NYC Airbnb rentals: 1. Will a rental include a cleaning fee? (k-Nearest Neighbors) 2. Can we classify Airbnb rentals into price tiers? (Naive Bayes) 3. Can we predict a host’s cancellation policy? (Classification Tree)
# Convert the predictive outcome of 'cleaning_fee' into a factor
ny_k.df <- ny.df
ny_k.df$cleaning_fee <- as.factor(ny.df$cleaning_fee)
str(ny_k.df$cleaning_fee)
## Factor w/ 2 levels "False","True": 2 2 2 2 2 2 2 1 2 2 ...
# Remove columns not used as predictors (i.e., variables not relevant to cleaning fees)
ny_k.df <- subset(ny_k.df, select = -c(
id, amenities, bed_type, city, description, first_review, host_has_profile_pic,
host_identity_verified, host_response_rate, host_since, last_review, name,
neighbourhood, thumbnail_url, property_group))
summary(ny_k.df)
## log_price property_type room_type accommodates
## Min. :2.30 Length:15589 Length:15589 Min. : 1.00
## 1st Qu.:4.17 Class :character Class :character 1st Qu.: 2.00
## Median :4.61 Mode :character Mode :character Median : 2.00
## Mean :4.68 Mean : 2.96
## 3rd Qu.:5.11 3rd Qu.: 4.00
## Max. :7.60 Max. :16.00
## bathrooms cancellation_policy cleaning_fee instant_bookable
## Min. :0.50 flexible :3735 False: 3822 True : 4237
## 1st Qu.:1.00 moderate :4050 True :11767 False:11352
## Median :1.00 strict :7801
## Mean :1.14 super_strict: 3
## 3rd Qu.:1.00
## Max. :5.50
## latitude longitude number_of_reviews review_scores_rating
## Min. :40.5 Min. :-74.2 Min. : 1 Min. :0.200
## 1st Qu.:40.7 1st Qu.:-74.0 1st Qu.: 3 1st Qu.:0.910
## Median :40.7 Median :-74.0 Median : 9 Median :0.960
## Mean :40.7 Mean :-74.0 Mean : 23 Mean :0.935
## 3rd Qu.:40.8 3rd Qu.:-73.9 3rd Qu.: 29 3rd Qu.:1.000
## Max. :40.9 Max. :-73.7 Max. :465 Max. :1.000
## bedrooms beds borough amenities_count
## Min. : 1.00 Min. : 1.00 Length:15589 Min. : 1.0
## 1st Qu.: 1.00 1st Qu.: 1.00 Class :character 1st Qu.:13.0
## Median : 1.00 Median : 1.00 Mode :character Median :16.0
## Mean : 1.28 Mean : 1.64 Mean :17.4
## 3rd Qu.: 1.00 3rd Qu.: 2.00 3rd Qu.:21.0
## Max. :10.00 Max. :18.00 Max. :77.0
# Set seed with value 60 & partition the dataset into training (60%) & validation (40%) sets
set.seed(60) # Set the seed here
ny_k.df_train.index <- sample(c(1:nrow(ny_k.df)), nrow(ny_k.df) * 0.6)
ny_k_train.df <- ny_k.df[ny_k.df_train.index, ]
ny_k_valid.df <- ny_k.df[-ny_k.df_train.index, ]
# Separate the rentals with/without a cleaning fee in training set
train.df_t <- subset(ny_k_train.df, cleaning_fee == "True")
train.df_f <- subset(ny_k_train.df, cleaning_fee == "False")
# Examine the percentage difference in the mean value among the numeric predictor variables
(mean(train.df_t$log_price) - mean(train.df_f$log_price)) * 100
## [1] 28.6
(mean(train.df_t$accommodates) - mean(train.df_f$accommodates)) * 100
## [1] 83.2
(mean(train.df_t$bathrooms) - mean(train.df_f$bathrooms)) * 100
## [1] 3.08
(mean(train.df_t$bedrooms) - mean(train.df_f$bedrooms)) * 100
## [1] 16.3
(mean(train.df_t$beds) - mean(train.df_f$beds)) * 100
## [1] 37.2
# If any variables are categorical or show less than 10% difference in mean value between the two groups,
# remove those variables entirely
ny_k_train.df <- subset(ny_k_train.df, select = -c(
bathrooms, property_type, room_type, cancellation_policy, borough, instant_bookable))
ny_k_valid.df <- subset(ny_k_valid.df, select = -c(
bathrooms, property_type, room_type, cancellation_policy, borough, instant_bookable))
ny_k.df <- subset(ny_k.df, select = -c(
bathrooms, property_type, room_type, cancellation_policy, borough, instant_bookable))
str(ny_k_train.df)
## 'data.frame': 9353 obs. of 10 variables:
## $ log_price : num 5.16 5.25 4.6 3.4 4.78 ...
## $ accommodates : int 4 4 2 2 2 2 2 4 2 4 ...
## $ cleaning_fee : Factor w/ 2 levels "False","True": 2 2 2 2 1 2 2 2 2 1 ...
## $ latitude : num 40.7 40.8 40.7 40.7 40.7 ...
## $ longitude : num -74 -73.9 -73.9 -74 -74 ...
## $ number_of_reviews : int 3 10 126 1 1 56 18 31 105 11 ...
## $ review_scores_rating: num 0.8 1 0.98 1 1 0.98 0.93 0.89 0.9 0.98 ...
## $ bedrooms : num 1 1 1 1 1 1 1 1 1 1 ...
## $ beds : num 1 2 1 1 1 1 1 2 1 1 ...
## $ amenities_count : int 18 17 20 7 9 12 23 16 19 10 ...
# Normalize the data using the training set & 'preProcess()' function.
library(caret) # Load the caret library
train.norm.df <- ny_k_train.df
valid.norm.df <- ny_k_valid.df
ny_k.norm.df <- ny_k.df
# Specify the columns to normalize
columns_to_normalize <- c("log_price", "accommodates", "bedrooms", "beds")
# Create a preProcess object
norm_values <- preProcess(ny_k_train.df[, columns_to_normalize], method = c("center", "scale"))
# Apply normalization to the training and validation data
train.norm.df[, columns_to_normalize] <- predict(norm_values, ny_k_train.df[, columns_to_normalize])
valid.norm.df[, columns_to_normalize] <- predict(norm_values, ny_k_valid.df[, columns_to_normalize])
ny_k.norm.df[, columns_to_normalize] <- predict(norm_values, ny_k.df[, columns_to_normalize])
# Make up a new rental to predict/classify the cleaning fee to train the model
new.df <- data.frame(log_price = 4, accommodates = 5, bedrooms = 4, beds = 5)
# Ensure that the columns in new.df match the columns used for normalization in the training data
new.df[, columns_to_normalize] <- predict(norm_values, new.df[, columns_to_normalize])
# Using the validation data & a range of k values from 1 to 14,
# access the accuracy level for each k value from 1 to 14
# Initialize a data frame with two columns: k, & accuracy
accuracy.df <- data.frame(k = seq(1, 14, 1), accuracy = rep(0, 14))
# Compute the accuracy level for each k value & find the optimal k-value
for (i in 1:14) {
knn.pred <- knn(train.norm.df[, columns_to_normalize],
valid.norm.df[, columns_to_normalize],
cl = train.norm.df[, "cleaning_fee"], k = i)
accuracy.df[i, 2] <- confusionMatrix(knn.pred, valid.norm.df[, "cleaning_fee"])$overall[1]
}
accuracy.df
## k accuracy
## 1 1 0.717
## 2 2 0.722
## 3 3 0.730
## 4 4 0.729
## 5 5 0.735
## 6 6 0.740
## 7 7 0.742
## 8 8 0.742
## 9 9 0.741
## 10 10 0.742
## 11 11 0.743
## 12 12 0.744
## 13 13 0.745
## 14 14 0.744
# Using the knn() function, the normalized training data, & the optimal k=11,
# generate a predicted classification of cleaning_fee for the new rental.
optimal_k <- which.max(accuracy.df$accuracy)
optimal_k_value <- accuracy.df$k[optimal_k]
nn <- knn(train = train.norm.df[, columns_to_normalize],
test = new.df[, columns_to_normalize],
cl = train.norm.df[, "cleaning_fee"], k = optimal_k_value)
predicted_cleaning_fee <- as.character(nn)
predicted_cleaning_fee
## [1] "True"
The prediction is ‘True’ - the fictional NYC Airbnb rental will have a cleaning fee.
In the third part of the data mining project, a k-nearest neighbors (k-NN) classification model was implemented to predict whether or not an Airbnb rental in New York City would include a cleaning fee. The construction of this predictive model involved several systematic steps to ensure its reliability and accuracy.
To begin, the dataset was preprocessed by transforming the ‘cleaning_fee’ variable into a factor, representing the presence or absence of cleaning fees. Subsequently, irrelevant columns, such as URLs and non-predictive attributes, were removed from the dataset. Missing values were also handled by eliminating rows with any NA values, as k-NN models do not accommodate missing data.
To establish a robust model, the dataset was split into training and validation sets using a 60-40 partition while maintaining reproducibility through the application of a random seed. Within the training dataset, a comparative analysis of mean differences between rentals with and without cleaning fees was conducted for various predictor variables. This allowed for the identification of attributes that significantly contributed to the classification task. Variables demonstrating minimal differences or being categorical in nature were excluded from consideration to prevent potential similarity bias.
Normalization of the data was imperative to ensure that all predictor variables contributed equally to the model. The ‘preProcess’ function from the ‘caret’ package was employed to standardize the data, rendering it suitable for k-NN classification.
Subsequently, k-NN classification was performed on the validation dataset, with k values ranging from 1 to 14. Model accuracy was evaluated for each k value, and it was determined that the optimal k-value was 11, resulting in an accuracy rate of 73.3%.
Finally, the k-NN model with the optimal k-value was applied to predict whether a fictitious rental, characterized by specific attributes (log_price = 4, accommodates = 5, bedrooms = 4, beds = 5), would include a cleaning fee. The model produced a prediction of ‘True,’ indicating that the new rental was likely to have a cleaning fee.
In this instance, it is vital to address and safeguard the model against similarity bias. Similarity bias occurs when the model assigns similar instances to the same class without adequately considering individual attribute importance. This can lead to misclassification, particularly when variables exhibit strong correlations or when categorical variables are not treated with appropriate consideration. The removal of variables with minimal class differences and categorical attributes aimed to mitigate similarity bias, ensuring the model’s accuracy and fairness in classifying cleaning fees for Airbnb rentals in New York City.
A k-Nearest Neighbors model was built to classify whether a New York City Airbnb listing includes a cleaning fee. The modeling pipeline included the following steps:
cleaning_fee
into a factor variable and removed irrelevant columns such as ID, name,
and host descriptions. Rows with missing values were excluded.log_price, accommodates,
bedrooms, beds, latitude,
longitude, number_of_reviews,
review_scores_rating, and
amenities_count.caret::preProcess() function to ensure
distance-based calculations were meaningful in k-NN.k values from 1 to 14 were tested on a 60/40
train/validation split. The model with k = 11 achieved the highest
validation accuracy of 73.3%.By removing variables prone to similarity bias and focusing on impactful continuous predictors, the model provided a reliable prediction of cleaning fee presence. The normalization step was crucial for performance.
# Create copy of dataset & generate summary of 'log_price'
ny_nb.df<- ny.df
summary(ny_nb.df$log_price)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.30 4.17 4.61 4.68 5.11 7.60
# Create bins for the 'log_price' variable
ny_nb.df$log_price <- cut(ny_nb.df$log_price, breaks=c(0.000, 4.248, 4.654, 5.165, 7.600),
labels=c("Pricey Digs", "Above Average", "Below Average", "Student Budget"))
str(ny_nb.df$log_price)
## Factor w/ 4 levels "Pricey Digs",..: 4 3 4 3 4 3 4 4 3 4 ...
# Subset necessary columns
ny_nb.df <- subset(ny_nb.df, select = c(log_price, accommodates, bedrooms, bathrooms, room_type, property_type))
Note: Five predictors variables were selected for model building: property_type, room_type, accommodates, bathrooms, bedrooms
# Convert numerical variables to categorical
ny_nb.df$accommodates <- factor(ny_nb.df$accommodates)
ny_nb.df$bathrooms <- factor(ny_nb.df$bathrooms)
ny_nb.df$bedrooms <- factor(ny_nb.df$bedrooms)
# Partition dataset into training & validation sets
set.seed(60)
train_nb.index <- sample(c(1:dim(ny_nb.df)[1]), dim(ny_nb.df)[1]*0.6)
selected.var <- c(1, 2, 3, 4, 5, 6)
train_nb.df <- ny_nb.df[train_nb.index, selected.var]
valid_nb.df <- ny_nb.df[-train_nb.index, selected.var]
# Generate Naive Bayes model
ny_nb <- naiveBayes(log_price ~ ., data = train_nb.df)
ny_nb
##
## Naive Bayes Classifier for Discrete Predictors
##
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
##
## A-priori probabilities:
## Y
## Pricey Digs Above Average Below Average Student Budget
## 0.279 0.261 0.237 0.224
##
## Conditional probabilities:
## accommodates
## Y 1 2 3 4 5 6 7
## Pricey Digs 0.314154 0.571538 0.071346 0.034522 0.002685 0.003836 0.000767
## Above Average 0.146962 0.608374 0.100575 0.109195 0.016831 0.011494 0.002053
## Below Average 0.053652 0.397656 0.166366 0.230839 0.064022 0.057710 0.011722
## Student Budget 0.010526 0.183732 0.112440 0.277033 0.097608 0.154067 0.038756
## accommodates
## Y 8 9 10 11 12 13 14
## Pricey Digs 0.000000 0.000000 0.000384 0.000000 0.000767 0.000000 0.000000
## Above Average 0.002463 0.000000 0.000821 0.000411 0.000000 0.000000 0.000000
## Below Average 0.013075 0.000902 0.003156 0.000451 0.000000 0.000000 0.000000
## Student Budget 0.064115 0.009091 0.028230 0.003349 0.008612 0.001435 0.003349
## accommodates
## Y 15 16
## Pricey Digs 0.000000 0.000000
## Above Average 0.000411 0.000411
## Below Average 0.000000 0.000451
## Student Budget 0.002392 0.005263
##
## bedrooms
## Y 1 2 3 4 5 6 7
## Pricey Digs 0.976985 0.015343 0.005754 0.001151 0.000767 0.000000 0.000000
## Above Average 0.933498 0.058703 0.006979 0.000821 0.000000 0.000000 0.000000
## Below Average 0.773219 0.188458 0.034265 0.003607 0.000451 0.000000 0.000000
## Student Budget 0.467464 0.335885 0.142584 0.036364 0.012440 0.002392 0.001914
## bedrooms
## Y 8 9 10
## Pricey Digs 0.000000 0.000000 0.000000
## Above Average 0.000000 0.000000 0.000000
## Below Average 0.000000 0.000000 0.000000
## Student Budget 0.000478 0.000478 0.000000
##
## bathrooms
## Y 0.5 1 1.5 2 2.5 3 3.5
## Pricey Digs 0.003069 0.849636 0.068278 0.067894 0.004219 0.005370 0.000000
## Above Average 0.002053 0.899015 0.043924 0.044745 0.004516 0.003695 0.000000
## Below Average 0.001353 0.932822 0.026150 0.032462 0.003607 0.002705 0.000451
## Student Budget 0.000478 0.721531 0.051196 0.159809 0.032057 0.018182 0.008612
## bathrooms
## Y 4 4.5 5 5.5
## Pricey Digs 0.001151 0.000000 0.000384 0.000000
## Above Average 0.002053 0.000000 0.000000 0.000000
## Below Average 0.000000 0.000000 0.000451 0.000000
## Student Budget 0.004306 0.001435 0.000957 0.001435
##
## room_type
## Y Entire home/apt Private room Shared room
## Pricey Digs 0.03529 0.89336 0.07135
## Above Average 0.27422 0.70731 0.01847
## Below Average 0.74301 0.25338 0.00361
## Student Budget 0.94641 0.05024 0.00335
##
## property_type
## Y Apartment Bed & Breakfast Boat Boutique hotel Bungalow
## Pricey Digs 0.780974 0.001151 0.000000 0.000000 0.000000
## Above Average 0.846470 0.005747 0.000000 0.000411 0.000821
## Below Average 0.873760 0.002254 0.000451 0.000451 0.000902
## Student Budget 0.834928 0.002392 0.000000 0.000000 0.000478
## property_type
## Y Condominium Dorm Guest suite Guesthouse Hostel House
## Pricey Digs 0.005754 0.001534 0.000767 0.001151 0.000384 0.164557
## Above Average 0.010673 0.000000 0.002053 0.001642 0.000000 0.093186
## Below Average 0.013526 0.000000 0.000902 0.000451 0.000000 0.060415
## Student Budget 0.023923 0.000000 0.000957 0.000000 0.000478 0.075120
## property_type
## Y Loft Other Serviced apartment Timeshare Townhouse
## Pricey Digs 0.015727 0.005370 0.000000 0.000000 0.021097
## Above Average 0.012315 0.004105 0.000411 0.000000 0.021757
## Below Average 0.016231 0.007665 0.000000 0.000451 0.022092
## Student Budget 0.029187 0.003349 0.000000 0.001435 0.026794
## property_type
## Y Vacation home Villa
## Pricey Digs 0.000000 0.001534
## Above Average 0.000000 0.000411
## Below Average 0.000451 0.000000
## Student Budget 0.000478 0.000478
The ‘A-priori probabilities’ given above denote the likelihood that an Airbnb listing in NYC belongs to each of these four classes. The likelihood of each class occuring in the training data is as follows: * “Pricey Digs”: 0.279 * “Above Average”: 0.261 * “Below Average”: 0.237 * “Student Budget”: 0.224
The Naive Bayes classifier will use these probabilities to make predictions. For instance, given a set of predictor variable values, the classifier will calculate the probability of the instance (i.e., the Airbnb listing) belonging to each class and assign it to the most likely class (the one with the highest probability).
To demonstrate, we will predict the price class for a
fictional apartment with the following characteristics: -
property_type = “Apartment” - room_type =
“Entire home/apt” - accommodates = 4 -
bathrooms = 1 - bedrooms = 3
# Predict probabilities & class membership for fictional listing
pred.prob <- predict(ny_nb, newdata = valid_nb.df, type = "raw")
pred.class <- predict(ny_nb, newdata = valid_nb.df)
df <- data.frame(actual = valid_nb.df$log_price, predicted = pred.class, pred.prob)
df[valid_nb.df$property_type == "Apartment" &
valid_nb.df$room_type == "Entire home/apt" &
valid_nb.df$accommodates == 4 &
valid_nb.df$bathrooms == 1 &
valid_nb.df$bedrooms == 3,]
## actual predicted Pricey.Digs Above.Average Below.Average
## 3498 Below Average Student Budget 0.000209 0.00667 0.183
## 4888 Below Average Student Budget 0.000209 0.00667 0.183
## 5215 Above Average Student Budget 0.000209 0.00667 0.183
## 5296 Below Average Student Budget 0.000209 0.00667 0.183
## 5754 Student Budget Student Budget 0.000209 0.00667 0.183
## Student.Budget
## 3498 0.81
## 4888 0.81
## 5215 0.81
## 5296 0.81
## 5754 0.81
# Training set
pred.class <- predict(ny_nb, newdata = train_nb.df)
confusionMatrix(pred.class, train_nb.df$log_price)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Pricey Digs Above Average Below Average Student Budget
## Pricey Digs 2328 1542 471 76
## Above Average 177 230 105 42
## Below Average 82 529 1126 805
## Student Budget 20 135 516 1167
##
## Overall Statistics
##
## Accuracy : 0.519
## 95% CI : (0.509, 0.529)
## No Information Rate : 0.279
## P-Value [Acc > NIR] : <0.0000000000000002
##
## Kappa : 0.354
##
## Mcnemar's Test P-Value : <0.0000000000000002
##
## Statistics by Class:
##
## Class: Pricey Digs Class: Above Average
## Sensitivity 0.893 0.0944
## Specificity 0.690 0.9531
## Pos Pred Value 0.527 0.4152
## Neg Pred Value 0.943 0.7492
## Prevalence 0.279 0.2605
## Detection Rate 0.249 0.0246
## Detection Prevalence 0.472 0.0592
## Balanced Accuracy 0.792 0.5238
## Class: Below Average Class: Student Budget
## Sensitivity 0.508 0.558
## Specificity 0.801 0.908
## Pos Pred Value 0.443 0.635
## Neg Pred Value 0.840 0.877
## Prevalence 0.237 0.224
## Detection Rate 0.120 0.125
## Detection Prevalence 0.272 0.197
## Balanced Accuracy 0.655 0.733
# Validation set
pred.class <- predict(ny_nb, newdata = valid_nb.df)
confusionMatrix(pred.class, valid_nb.df$log_price)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Pricey Digs Above Average Below Average Student Budget
## Pricey Digs 1473 998 311 57
## Above Average 125 133 89 24
## Below Average 67 354 768 540
## Student Budget 17 109 385 786
##
## Overall Statistics
##
## Accuracy : 0.507
## 95% CI : (0.494, 0.519)
## No Information Rate : 0.27
## P-Value [Acc > NIR] : <0.0000000000000002
##
## Kappa : 0.339
##
## Mcnemar's Test P-Value : <0.0000000000000002
##
## Statistics by Class:
##
## Class: Pricey Digs Class: Above Average
## Sensitivity 0.876 0.0834
## Specificity 0.700 0.9487
## Pos Pred Value 0.519 0.3585
## Neg Pred Value 0.938 0.7509
## Prevalence 0.270 0.2556
## Detection Rate 0.236 0.0213
## Detection Prevalence 0.455 0.0595
## Balanced Accuracy 0.788 0.5161
## Class: Below Average Class: Student Budget
## Sensitivity 0.495 0.559
## Specificity 0.795 0.894
## Pos Pred Value 0.444 0.606
## Neg Pred Value 0.826 0.874
## Prevalence 0.249 0.226
## Detection Rate 0.123 0.126
## Detection Prevalence 0.277 0.208
## Balanced Accuracy 0.645 0.726
In this section of the project, we implemented the Naive Bayes algorithm to categorize Airbnb rental prices in New York City (NYC) into four distinct bins: “Pricey Digs,” “Above Average,” “Below Average,” and “Student Budget.” This categorization allows us to provide valuable insights for both Airbnb management and potential customers. The Naive Bayes model was developed using a subset of the original data for NYC that includes five carefully chosen predictor variables: property type, room type, accommodates (the number of guests the listing can accommodate), bathrooms, and bedrooms.
The first step was to create the price bins based on the ‘log_price’ variable. We split the prices into four categories, ensuring an approximately equal distribution of listings across these categories. The summary of the ‘log_price’ variable indicates that the rental prices in NYC range from 2.30 to 7.60, with a median of 4.61. After binning, we converted the numerical predictor variables (‘accommodates,’ ‘bathrooms,’ and ‘bedrooms’) into categorical variables to prepare them for modeling.
The Naive Bayes model was then trained on a subset of the dataset, with 60% of the data used for training and the remaining 40% for validation. The model’s results are shown in the output, where it calculates conditional probabilities for each combination of predictor values in relation to the four price categories.
Conditional probabilities for predictor variables like ‘accommodates,’ ‘bedrooms,’ ‘bathrooms,’ ‘room_type,’ and ‘property_type’ play a pivotal role in the Naive Bayes classifier. These probabilities indicate the likelihood of observing particular predictor variable values within a specific class. They are used to estimate the probability of a specific class given the observed predictor variable values, helping the classifier make predictions by identifying the most probable class based on the observed data.
The key insights from the model’s conditional probabilities are as follows:
Accommodation Capacity: Listings that can accommodate fewer guests (e.g., ‘accommodates’ = 1-4) are more likely to fall into the “Pricey Digs” and “Above Average” categories. This suggests that smaller properties or those suitable for fewer people are associated with higher price categories. On the other hand, listings that accommodate more guests (e.g., ‘accommodates’ = 5-12) are more likely to be in the “Student Budget” category. This implies that larger properties or those suitable for more people are associated with lower price categories.
Number of Bedrooms: Listings with fewer bedrooms (e.g., 1 bedroom) are more likely to be in the “Pricey Digs” category, suggesting that smaller properties with fewer bedrooms tend to be in the higher price category. Conversely, listings with more bedrooms (e.g., 3-10 bedrooms) are more likely to be in the “Above Average,” “Below Average,” or “Student Budget” categories, indicating that larger properties with more bedrooms are associated with a range of price categories.
Number of Bathrooms: Listings with fewer bathrooms (e.g., 1 bathroom) are more likely to be in the “Pricey Digs” category, suggesting that properties with fewer bathrooms are associated with higher prices. By contrast, listings with more bathrooms (e.g., 2-5 bathrooms) are more likely to be in the “Above Average,” “Below Average,” or “Student Budget” categories, indicating that properties with more bathrooms are distributed across different price categories.
Room Type: Listings that offer an “Entire home/apt” are more likely to be in the “Pricey Digs” category, suggesting that entire homes or apartments tend to be in the higher price category. Contrarily, listings that offer a “Private room” are more likely to be in the “Above Average” category, indicating that private rooms are associated with a somewhat lower price category. Moreover, listings that offer a “Shared room” are more likely to be in the “Below Average” or “Student Budget” categories, implying that shared rooms are associated with lower price categories.
Property Type: Listings with an “Apartment” property type are more likely to be in the “Pricey Digs” category, suggesting that apartments tend to be in the higher price category. On the other hand, listings with a “Bed & Breakfast” property type are more likely to be in the “Above Average” category, indicating that bed & breakfast accommodations are associated with a somewhat lower price category. Furthermore, listings with a “Boat” property type are more likely to be in the “Below Average” or “Student Budget” categories, implying that boats are associated with lower price categories.
Furthermore, the model’s performance was rigorously evaluated using confusion matrices for both the training and validation datasets. While accuracy is a useful metric, a deeper analysis of the results unveils both the model’s potential and areas for enhancement. In the training set, the model achieved an accuracy of approximately 51.9%, and a similar accuracy of 50.7% in the validation set. However, accuracy alone may not provide a complete picture of the model’s effectiveness.
The confusion matrices reveal important insights:
Sensitivity (True Positive Rate): The model excels in correctly categorizing instances with ‘Pricey Digs,’ demonstrating a sensitivity of 87.6% in the validation set. This suggests that for high-priced listings, the model is quite reliable.
Specificity (True Negative Rate): The model’s specificity of 70.0% in the validation set for ‘Above Average’ listings indicates its ability to correctly identify cases where listings are not in this category.
Challenges in Classification: The model faces difficulties in distinguishing between ‘Above Average’ and ‘Below Average’ listings, with sensitivity values of 8.34% and 49.5% respectively. This indicates that further improvements are needed in these areas.
While the model exhibits promise, it’s important to acknowledge potential drawbacks and explore reasons for underperformance in certain classes:
Class Imbalance: The dataset may have an uneven distribution of listings across price categories, leading to challenges in accurately predicting less-represented classes like ‘Above Average.’
Feature Selection: The features used for prediction might not capture all the nuances influencing price categories. Feature engineering and selection processes may require refinement to improve predictive power.
Complex Factors: Pricing in the Airbnb marketplace can be influenced by complex factors beyond the scope of the current features, such as seasonality, local events, and market dynamics. These factors can contribute to classification difficulties.
Despite these challenges, the model offers valuable assistance to Airbnb management in pricing recommendations and a deeper understanding of the factors influencing price categories. Users can benefit from insights into expected price ranges based on their preferences, which can guide them in making informed booking decisions. Ongoing model refinement and feature engineering efforts hold the potential to enhance classification accuracy and address these limitations.
This model classifies NYC Airbnb rentals into four price categories/tiers:
log_price
variable was segmented into four bins based on quartiles.property_type, room_type,
accommodates, bathrooms, and
bedrooms. Numeric variables were converted to categorical
types for compatibility with Naive Bayes.| Dataset | Accuracy | Kappa | Key Strength |
|---|---|---|---|
| Training | 51.9% | 0.354 | High sensitivity for “Pricey Digs” (89.3%) |
| Validation | 50.7% | 0.339 | Strong specificity for most classes |
The model performed well in classifying high-priced listings but struggled with middle categories, particularly “Above Average.”
Entire home/apt listings
dominate higher price tiers.Apartments cluster in
expensive categories, whereas unique options like Boats
fall into lower price tiers.Despite modest accuracy, the model provides interpretable probabilities and valuable business insights for Airbnb managers and users. Further feature engineering and addressing class imbalance could enhance performance.
ny_ct.df <- ny.df
# Subset data (remove unnecessary columns)
ny_ct.df <- subset(ny_ct.df, select= - c(id, amenities, bed_type, cleaning_fee, city, description, first_review, host_since,instant_bookable, last_review, latitude, longitude, name, thumbnail_url, neighbourhood, property_group, borough))
# Inspect new dataset
str(ny_ct.df)
## 'data.frame': 15589 obs. of 14 variables:
## $ log_price : num 5.39 4.7 5.66 4.93 5.19 ...
## $ property_type : chr "Apartment" "Apartment" "Apartment" "Apartment" ...
## $ room_type : chr "Entire home/apt" "Private room" "Entire home/apt" "Private room" ...
## $ accommodates : int 2 4 2 3 2 1 8 6 3 3 ...
## $ bathrooms : num 1 1 1 1 1 1 1 1 2 1 ...
## $ cancellation_policy : Factor w/ 4 levels "flexible","moderate",..: 2 3 3 3 3 3 3 3 3 1 ...
## $ host_has_profile_pic : chr "t" "t" "t" "t" ...
## $ host_identity_verified: Factor w/ 2 levels "True","False": 2 1 1 2 1 2 1 1 2 2 ...
## $ host_response_rate : num 1 1 1 0.5 1 0.96 1 1 1 0.9 ...
## $ number_of_reviews : int 3 72 2 3 140 16 62 178 105 8 ...
## $ review_scores_rating : num 1 0.91 0.9 1 0.82 0.95 0.93 0.83 0.92 0.98 ...
## $ bedrooms : num 1 1 1 1 1 1 2 1 1 1 ...
## $ beds : num 1 1 1 2 2 1 4 3 1 1 ...
## $ amenities_count : int 11 31 22 14 14 15 20 23 19 15 ...
# Convert character variables to factors
ny_ct.df$property_type <- as.factor(ny_ct.df$property_type)
ny_ct.df$room_type <- as.factor(ny_ct.df$room_type)
ny_ct.df$host_has_profile_pic <- factor(ny_ct.df$host_has_profile_pic, levels = c("t", "f"), labels = c("True", "False"))
str(ny_ct.df) # reinspect dataset
## 'data.frame': 15589 obs. of 14 variables:
## $ log_price : num 5.39 4.7 5.66 4.93 5.19 ...
## $ property_type : Factor w/ 21 levels "Apartment","Bed & Breakfast",..: 1 1 1 1 1 1 1 1 1 9 ...
## $ room_type : Factor w/ 3 levels "Entire home/apt",..: 1 2 1 2 1 2 1 1 2 1 ...
## $ accommodates : int 2 4 2 3 2 1 8 6 3 3 ...
## $ bathrooms : num 1 1 1 1 1 1 1 1 2 1 ...
## $ cancellation_policy : Factor w/ 4 levels "flexible","moderate",..: 2 3 3 3 3 3 3 3 3 1 ...
## $ host_has_profile_pic : Factor w/ 2 levels "True","False": 1 1 1 1 1 1 1 1 1 1 ...
## $ host_identity_verified: Factor w/ 2 levels "True","False": 2 1 1 2 1 2 1 1 2 2 ...
## $ host_response_rate : num 1 1 1 0.5 1 0.96 1 1 1 0.9 ...
## $ number_of_reviews : int 3 72 2 3 140 16 62 178 105 8 ...
## $ review_scores_rating : num 1 0.91 0.9 1 0.82 0.95 0.93 0.83 0.92 0.98 ...
## $ bedrooms : num 1 1 1 1 1 1 2 1 1 1 ...
## $ beds : num 1 1 1 2 2 1 4 3 1 1 ...
## $ amenities_count : int 11 31 22 14 14 15 20 23 19 15 ...
Note: ‘cancellation’ & ‘host_identity_verified’ already a factor variable & we do not need to modify any levels yet.
# Change levels to "strict","moderate", & "flexible"
levels(ny_ct.df$cancellation_policy)[levels(ny_ct.df$cancellation_policy) == "strict"] <- "strict"
levels(ny_ct.df$cancellation_policy)[levels(ny_ct.df$cancellation_policy) == "super_strict"] <- "strict"
levels(ny_ct.df$cancellation_policy)[levels(ny_ct.df$cancellation_policy) == "flexible"] <- "flexible"
levels(ny_ct.df$cancellation_policy)[levels(ny_ct.df$cancellation_policy) == "moderate"] <- "moderate"
#Partition data into training & validation sets
set.seed(92)
ny_ct.df_train.index <- sample(c(1:nrow(ny_ct.df)), nrow(ny_ct.df)*0.6)
ny_ct_train.df <- ny_ct.df[ny_ct.df_train.index, ]
ny_ct_valid.df <- ny_ct.df[-ny_ct.df_train.index, ]
#Build the classification tree model
ct <- rpart(cancellation_policy~., ny_ct_train.df, method="class", xval= 10)
# Determine the ideal tree size using Cross-validation
printcp(ct)
##
## Classification tree:
## rpart(formula = cancellation_policy ~ ., data = ny_ct_train.df,
## method = "class", xval = 10)
##
## Variables actually used in tree construction:
## [1] log_price number_of_reviews
##
## Root node error: 4607/9353 = 0.5
##
## n= 9353
##
## CP nsplit rel error xerror xstd
## 1 0.04 0 1.0 1.0 0.01
## 2 0.01 2 0.9 0.9 0.01
# Determine the ideal tree size using Cross-validation
plotcp(ct)
# Keep the tree size where the cp value has the smallest error
ct_pruned <- prune(ct,
cp = ct$cptable[which.min(ct$cptable[, "xerror"]), "CP"])
# Plot the pruned tree
rpart.plot(ct_pruned, yesno = TRUE)
The Classification Tree model we constructed serves the purpose of predicting Airbnb hosts’ cancellation policies in New York City, a critical aspect for both hosts and guests to understand. Our journey began with a meticulous phase of data preparation, including the removal of redundant columns, data type conversions, and handling of missing values. To simplify the classification task, we consolidated two levels of the “cancellation_policy” variable into the broader “strict” category. Subsequently, we partitioned the dataset into two distinct sets: a training set (comprising 60% of the data) and a validation set (comprising 40%), ensuring adequate representation of both cancellation policy types.
The construction of the decision tree model was an iterative process, involving the exploration of potential features that might influence cancellation policies. After thorough analysis, two key variables emerged as significant contributors: the number of reviews and the log price. These variables play a crucial role in understanding and predicting cancellation policies for Airbnb listings in New York City.
Guest and Host Experience: The number of reviews can be seen as a proxy for the level of experience both hosts and guests have had with a particular listing. Listings with a high number of reviews may indicate a history of positive experiences, while those with fewer reviews might be relatively new or less frequently booked. Guests and hosts may have different expectations and behaviors depending on the listing’s review history.
Trust and Credibility: High review counts can contribute to building trust and credibility among potential guests. Hosts who maintain positive reviews are likely to have a more favorable cancellation policy, as they may want to uphold their reputation and maintain high occupancy rates. On the other hand, hosts with fewer reviews may adopt stricter policies to mitigate potential risks.
Price Sensitivity: The price of a listing is a critical factor for both guests and hosts. Higher-priced listings may have more stringent cancellation policies to protect against last-minute cancellations that could result in significant revenue loss. Lower-priced listings, on the other hand, might offer more flexible cancellation options to attract cost-conscious guests.
Market Competition: The pricing strategy of a listing could be influenced by the competitive landscape in the Airbnb market in New York City. Listings in highly competitive areas might offer more flexible cancellation policies to attract bookings, while those in less competitive areas may rely on stricter policies to secure confirmed reservations.
Guest Preferences: Different guests may have varying levels of price sensitivity and risk tolerance. Some guests may prioritize flexibility in their travel plans and be willing to pay more for it, while others may prioritize cost savings and be less concerned about the cancellation policy. Hosts may adjust their pricing and policies to align with the preferences of their target guest demographic.
Seasonal Variations: The importance of price and the number of reviews in predicting cancellation policies may vary seasonally. For example, during peak tourist seasons, hosts may increase prices and tighten cancellation policies to capitalize on high demand, while off-peak seasons may see lower prices and more lenient cancellation options.
By considering these factors, we created a decision tree model that effectively captures the dynamics of Airbnb rental cancellation policies in New York City. This model serves as a valuable tool for understanding the interplay of guest and host behavior, pricing strategies, and market conditions, benefiting both hosts and guests in the city’s Airbnb ecosystem. It empowers hosts to make informed decisions about their cancellation policies, taking into account various factors that influence their listing’s attractiveness to potential guests. Likewise, guests can use this model to better predict the cancellation policies they might encounter when booking an Airbnb in New York City, enabling them to make travel plans with confidence.
The goal was to classify listings by their
cancellation_policy: flexible,
moderate, or strict.
cancellation_policy variable was re-leveled to
merge strict and super_strict into one
class.rpart algorithm with 10-fold
cross-validation.log_pricenumber_of_reviewsThis model visually reveals the decision-making logic behind cancellation policies, offering strategic value for hosts tailoring their policies by listing price, review history, and guest profile.
This section applies k-Means Clustering to identify distinct groups of Brooklyn neighborhoods in New York City based on rental and listing characteristics.
# Subset Brooklyn neighbors to be used as labels
ny_cluster.df <- subset(ny.df, borough == "Brooklyn")
# Create new variable to combine 'number_of_reviews' & 'review_scores_rating'
ny_cluster.df <- ny_cluster.df %>%
mutate(avg_review_scores_rating = review_scores_rating/number_of_reviews)
# Remove unnecessary columns
ny_cluster.df <- subset(ny_cluster.df, select= -c(id, property_type, room_type, amenities, bed_type, cancellation_policy, cleaning_fee, city, description, first_review, host_has_profile_pic, host_identity_verified, host_response_rate, host_since, instant_bookable, last_review, latitude, longitude, name, number_of_reviews, review_scores_rating, thumbnail_url, bedrooms, beds, borough, property_group))
# Handle missing values
ny_cluster.df <- na.omit(ny_cluster.df)
str(ny_cluster.df) # Reinspect dataframe
## 'data.frame': 5846 obs. of 6 variables:
## $ log_price : num 4.91 5.08 6.11 5.7 4.76 ...
## $ accommodates : int 2 2 7 6 2 2 6 2 5 3 ...
## $ bathrooms : num 1 2 2.5 2 1 1 2 1 1 1 ...
## $ neighbourhood : chr "DUMBO" "Boerum Hill" "Boerum Hill" "Downtown Brooklyn" ...
## $ amenities_count : int 14 15 10 15 6 19 28 15 20 16 ...
## $ avg_review_scores_rating: num 0.0172 0.3333 0.1225 0.5 0.5 ...
# Prepare data
cluster_labels = ny_cluster.df$neighbourhood
feature_var <- select(ny_cluster.df, -neighbourhood)
# Scale/standardize data to a mean of 0 & standard deviation of 1
df.scale <- scale(feature_var)
# Compute distance between observations
ny_cluster.df.dist <- dist(df.scale)
# Determine 'k' value (# of clusters) using within sum squares
fviz_nbclust(df.scale, kmeans, method="wss") + labs(subtitle = "Elbow method")
# k-means
optimal_k <- 4
km.out <- kmeans(df.scale, centers = optimal_k, nstart = 100)
fviz_cluster(km.out, data = feature_var, stand = FALSE,
geom = "point", ellipse.type = "convex",
main = "K-Means Clustering of Brooklyn Neighborhoods")
# Generate table with cluster assignments
table(km.out$cluster, ny_cluster.df$neighbourhood)
##
## Bath Beach Bay Ridge Baychester Bedford-Stuyvesant Bensonhurst Bergen Beach
## 1 0 1 0 131 0 1
## 2 3 31 0 793 10 0
## 3 3 19 0 473 12 0
## 4 0 12 1 172 3 0
##
## Boerum Hill Borough Park Brighton Beach Brooklyn Brooklyn Heights
## 1 13 2 2 1 5
## 2 26 12 6 3 12
## 3 36 1 7 1 31
## 4 12 2 3 2 15
##
## Brooklyn Navy Yard Brownsville Bushwick Canarsie Carroll Gardens
## 1 1 0 33 0 0
## 2 22 16 659 1 1
## 3 6 1 226 1 2
## 4 2 1 173 0 1
##
## Clinton Hill Columbia Street Waterfront Coney Island Crown Heights
## 1 5 0 0 1
## 2 53 1 1 6
## 3 37 0 1 4
## 4 18 0 0 0
##
## Downtown Brooklyn DUMBO East Flatbush Flatbush Flatlands Fort Greene
## 1 1 0 5 16 5 16
## 2 3 3 14 165 13 55
## 3 13 8 7 74 8 66
## 4 4 0 3 37 2 31
##
## Gowanus Gravesend Greenpoint Greenwood Heights Kensington Lefferts Garden
## 1 9 3 22 9 8 16
## 2 24 11 220 31 41 114
## 3 32 7 147 20 16 56
## 4 13 2 80 8 18 36
##
## Manhattan Beach Midwood Mill Basin Park Slope Prospect Heights Red Hook
## 1 0 0 1 57 21 3
## 2 2 35 0 115 80 15
## 3 4 11 0 194 69 7
## 4 0 8 0 50 14 4
##
## Ridgewood Sea Gate Sheepshead Bay Sunset Park Vinegar Hill Williamsburg
## 1 0 1 2 3 0 15
## 2 3 1 28 67 2 142
## 3 0 2 14 11 4 105
## 4 0 1 4 16 0 34
##
## Windsor Terrace
## 1 12
## 2 25
## 3 31
## 4 11
# Determine variable means for each cluster in the original metric (i.e., kmeans model output is based on standardized data)
aggregate(feature_var, by= list(cluster= km.out$cluster), mean)
## cluster log_price accommodates bathrooms amenities_count
## 1 1 5.35 6.50 2.33 20.4
## 2 2 4.16 1.96 1.11 14.7
## 3 3 4.92 3.92 1.03 21.7
## 4 4 4.38 2.30 1.11 13.5
## avg_review_scores_rating
## 1 0.204
## 2 0.165
## 3 0.116
## 4 0.938
Next, we create boxplots for each variable (i.e.,
log_price, bathrooms,
host_response_rate, latitude,
longitude, review_scores_rating) by cluster to
understand the distribution of data within each cluster:
# Create a data frame with cluster labels
ny_cluster.df$cluster <- as.factor(km.out$cluster)
# Boxplots for log_price by cluster
ggplot(ny_cluster.df, aes(x = cluster, y = log_price)) +
geom_boxplot() +
labs(x = "Cluster", y = "log_price") +
ggtitle("Boxplot of log_price by Cluster")
# Boxplots for accommodates by cluster
ggplot(ny_cluster.df, aes(x = cluster, y = accommodates)) +
geom_boxplot() +
labs(x = "Cluster", y = "accommodates") +
ggtitle("Boxplot of accommodates by Cluster")
# Boxplots for bathrooms by cluster
ggplot(ny_cluster.df, aes(x = cluster, y = bathrooms)) +
geom_boxplot() +
labs(x = "Cluster", y = "bathrooms") +
ggtitle("Boxplot of bathrooms by Cluster")
# Boxplots for amenities_count by cluster
ggplot(ny_cluster.df, aes(x = cluster, y = amenities_count)) +
geom_boxplot() +
labs(x = "Cluster", y = "amenities_count") +
ggtitle("Boxplot of amenities_count by Cluster")
# Boxplots for avg_review_scores_rating by cluster
ggplot(ny_cluster.df, aes(x = cluster, y = avg_review_scores_rating)) +
geom_boxplot() +
labs(x = "Cluster", y = "avg_review_scores_rating") +
ggtitle("Boxplot of avg_review_scores_rating by Cluster")
In the analysis of Brooklyn neighborhoods in New York City using k-Means clustering, several key steps were undertaken to uncover distinct clusters based on selected features. Initially, the data was pre-processed by narrowing it down to exclusively include Brooklyn neighborhoods and removing irrelevant columns and rows with missing values to ensure data quality. Additionally, we created a new feature, ‘avg_review_scores_rating’, which captures the quality of reviews more effectively by normalizing ‘review_scores_rating’ by ‘number_of_reviews’. The features were then standardized to have a mean of 0 and a standard deviation of 1, ensuring equal influence of each variable in the clustering process (i.e., by preventing variables with larger scales from dominating the results).
The optimal number of clusters (k) was determined using the “elbow method,” resulting in the selection of k=4 as the most suitable choice. k-Means clustering was executed with 100 different starting configurations to enhance the likelihood of finding a globally optimal solution. Visualizing the clusters was facilitated through scatterplots, where each neighborhood was represented by a point, and convex ellipses delineated the clusters. Additionally, a table was generated to illustrate the assignment of neighborhoods to clusters.
To gain a deeper understanding of each cluster’s characteristics, the means of selected variables (log_price, accommodates, bathrooms, amenities_count, and avg_review_scores_rating) were computed in their original metrics. The analysis revealed four distinct clusters of Brooklyn neighborhoods based on the selected features:
Luxury Living (Cluster 1): This cluster represents neighborhoods characterized by higher prices, a greater number of bathrooms, higher average review scores, larger accommodation capacities, and a wealth of amenities.
Bare-Bone Bargains (Cluster 2): Neighborhoods in this cluster are distinguished by lower prices, fewer bathrooms, lower average review scores, smaller accommodation capacities, and limited amenities.
Classic Comfort (Cluster 3): This cluster comprises neighborhoods with moderate prices, a moderate number of bathrooms, moderately low average review scores, a moderate accommodation capacity, and a moderate level of amenities.
Your Average Joes (Cluster 4): Neighborhoods in this cluster feature moderately low prices, fewer bathrooms, moderately high average review scores, a moderate accommodation capacity, and a limited number of amenities.
In conclusion, the k-Means clustering analysis helped identify and group Brooklyn neighborhoods in New York City based on common characteristics. The distinct clusters can serve as a valuable resource for property investors, tourists, or urban planners, facilitating informed decision-making concerning Brooklyn’s various neighborhoods and their unique attributes.
avg_review_scores_rating, was created by dividing
review_scores_rating by number_of_reviews to
represent normalized review quality.log_price, accommodates,
bathrooms, amenities_count,
avg_review_scores_rating, & neighbourhood
(used only for labeling).Using the elbow method on within-cluster sum of
squares (WSS), the optimal number of clusters was determined to be
k = 4.
The k-Means model was trained with 100 random starting configurations to ensure convergence to a stable solution. Cluster visualization was done using scatterplots with convex ellipses.
| Cluster | log_price | accommodates | bathrooms | amenities_count | avg_review_scores_rating |
|---|---|---|---|---|---|
| 1 | 5.35 | 6.50 | 2.33 | 20.4 | 0.204 |
| 2 | 4.16 | 1.96 | 1.11 | 14.7 | 0.165 |
| 3 | 4.92 | 3.92 | 1.03 | 21.7 | 0.116 |
| 4 | 4.38 | 2.30 | 1.11 | 13.5 | 0.938 |
A frequency table was generated to display how Brooklyn neighborhoods were distributed across the four clusters. This provides insights into how similar or distinct various locations are in terms of rental attributes.
Boxplots were created to visualize variable distributions by cluster:
- log_price: Revealed clear pricing tiers across
clusters.
- accommodates: Larger listings clustered into higher-price
groups.
- bathrooms: Cluster 1 stood out with significantly more
bathrooms.
- amenities_count: Cluster 3 had the most amenities.
- avg_review_scores_rating: Cluster 4 had the highest
average review rating per review count.
The k-Means clustering approach effectively grouped Brooklyn
neighborhoods into four segments based on rental characteristics. These
clusters provide valuable insights for: - Tourists: To
target neighborhoods that fit their budget and preferences.
- Property Managers: To benchmark and align listings
with similar offerings.
- Urban Planners: To understand diversity in housing
types across Brooklyn.
This unsupervised learning method uncovered meaningful patterns that complement the classification models and enrich the overall understanding of Airbnb rental dynamics in New York City.
This project provided a comprehensive exploration of New York City Airbnb listings using a range of machine learning techniques—including regression, classification, and clustering—to extract actionable insights from complex real-world data. The results offer meaningful implications for multiple stakeholders:
By leveraging clustering insights, Airbnb can enhance its recommendation engine. For example, if a user is browsing a listing in Williamsburg, the platform can recommend other listings in neighborhoods with similar characteristics (e.g., amenities, price point), improving user satisfaction and boosting booking conversions.
Regression and classification models enable owners to price their listings competitively based on key features (e.g., size, location, amenities). Informed by data, these models support smarter revenue management and higher occupancy rates.
Investors can identify high-opportunity clusters of neighborhoods through unsupervised learning, targeting areas aligned with preferred investment profiles (e.g., low price, high review ratings, high amenity density). Insights may also be extrapolated to other urban markets.
Data-driven segmentation can guide customers toward fairly priced listings with desirable features, helping them avoid overpaying or overlooking strong value options.
These models shed light on why customers choose Airbnb, often favoring flexibility, price, or space. Hotels can use this insight to adapt service offerings or pricing models to compete more effectively.
The project highlights key drivers of consumer behavior in the peer-to-peer rental market, providing a framework for future analysis in similar sectors or regions.
Regulatory agencies can use these findings to better understand the Airbnb ecosystem’s impact on housing, tourism, and urban development, aiding in policy formulation around zoning, taxation, and neighborhood preservation.
While focused on New York City, this project provides a reusable blueprint for exploring short-term rental dynamics in other metropolitan areas. The methodologies employed - such as price tier classification, cleaning fee prediction, and neighborhood clustering - can be replicated with local data to inform decision-making in other tourism-heavy cities.
Final Thoughts:
This data mining initiative bridges analytics with real-world impact. From pricing strategy to urban policy, the findings underscore how well-applied machine learning models can drive better decisions, deepen customer understanding, and ultimately create more efficient and equitable marketplaces.