New York City’s Airbnb market is one of the most active and varied short-term rental ecosystems in the world – shaped by neighborhood trends, pricing dynamics, and host behaviors. This data mining project explores that complexity using predictive modeling, classification, clustering, and visualization to uncover patterns that can guide hosts, guests, investors, and policymakers.

The analysis leverages real-world Airbnb data to uncover drivers of rental price variation, reveal booking behavior trends, and compare neighborhoods across the city – especially within Brooklyn. Through a mix of data wrangling, machine learning, and statistical analysis in R, this project delivers actionable insights and highlights opportunities for smarter, data-informed decision-making in the short-term rental space.

# Load required packages
## Data preparation
library(scales)
library(data.table)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:data.table':
## 
##     between, first, last
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(forcats)

## EDA
library(ggplot2)
library(ggcorrplot)

## Regression
library(forecast)
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
library(ggcorrplot)

## Classification
library(caret)
## Loading required package: lattice
library(class)
library(e1071)
library(rpart)
library(rpart.plot)

## Clustering
library(dplyr)
library(ggplot2)
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
# Read data into local environment
df <- read.csv("train.csv")

1. Data Preparation

Data Preprocessing

Before any predictive modeling, the data is preprocessed to ensure quality, completeness, and contextual richness in the Airbnb rental dataset for New York City.

# Subset data frame to include only those records pertaining to NYC
ny.df <- df[df$city == "NYC", ]

a. Dealing with Missing Values

# Convert blank cells to 'NA's
ny.df[ny.df == ""] <- NA

# Calculate number of 'NA' values in data frame
sum(is.na(ny.df))  
## [1] 34579
# Get percentage of rows that are "complete cases" (i.e., not missing values) 
percent(sum(complete.cases(ny.df))/nrow(ny.df), accuracy = 0.01)
## [1] "55.20%"

Next, we will evaluate count/proportion of ‘NA’ values corresponding to each variable to understand the potential impact of dropping/imputing these missing values. These insights will guide our process in deciding how we will handle NA values (i.e., drop or impute) for each data mining task (e.g., prediction, classification, & clustering).

# Subset columns that contain "NA" values & get count of NA values in each column 
num_NAs <- colSums(is.na(ny.df))  

# Compute percent of NA values in each column
prop_NAs <- percent(num_NAs/nrow(ny.df), accuracy = 0.01)  

# Create df to store missing value counts for each variable
var_NAs.df <- data.frame(num_NAs, prop_NAs)

# Convert to table
var_NAs <- data.table(var_NAs.df, keep.rownames = TRUE) 
colnames(var_NAs) <- c("Variable", "Num. of NAs", "% NAs")

var_NAs  # print table
##                   Variable Num. of NAs  % NAs
##  1:                     id           0  0.00%
##  2:              log_price           0  0.00%
##  3:          property_type           0  0.00%
##  4:              room_type           0  0.00%
##  5:              amenities           0  0.00%
##  6:           accommodates           0  0.00%
##  7:              bathrooms          99  0.31%
##  8:               bed_type           0  0.00%
##  9:    cancellation_policy           0  0.00%
## 10:           cleaning_fee           0  0.00%
## 11:                   city           0  0.00%
## 12:            description           0  0.00%
## 13:           first_review        6858 21.20%
## 14:   host_has_profile_pic         176  0.54%
## 15: host_identity_verified         176  0.54%
## 16:     host_response_rate        9960 30.79%
## 17:             host_since         176  0.54%
## 18:       instant_bookable           0  0.00%
## 19:            last_review        6832 21.12%
## 20:               latitude           0  0.00%
## 21:              longitude           0  0.00%
## 22:                   name           0  0.00%
## 23:          neighbourhood           8  0.02%
## 24:      number_of_reviews           0  0.00%
## 25:   review_scores_rating        7321 22.63%
## 26:          thumbnail_url        2415  7.47%
## 27:                zipcode         446  1.38%
## 28:               bedrooms          47  0.15%
## 29:                   beds          65  0.20%
##                   Variable Num. of NAs  % NAs
# Handle missing values for numerical variables by imputing with median

# Convert 'host_response_rate' from character value to numerical
ny.df$host_response_rate <- as.numeric(sub("%", " ", ny.df$host_response_rate))

# Impute for missing values with median
ny.df$bathrooms[is.na(ny.df$bathrooms)] <- median(ny.df$bathrooms, na.rm = TRUE)
ny.df$host_response_rate[is.na(ny.df$host_response_rate)] <- median(ny.df$host_response_rate, na.rm = TRUE)
ny.df$review_scores_rating[is.na(ny.df$review_scores_rating)] <- median(ny.df$review_scores_rating, na.rm = TRUE)
ny.df$bedrooms[is.na(ny.df$bedrooms)] <- median(ny.df$bedrooms, na.rm = TRUE)
ny.df$beds[is.na(ny.df$beds)] <- median(ny.df$beds, na.rm = TRUE)

Note: The decision to impute missing values for numerical variables, such as ‘host_response_rate,’ ‘bathrooms,’ ‘review_scores_rating,’ ‘bedrooms,’ and ‘beds,’ with their respective medians serves the purpose of preserving the data’s central tendencies while reducing the potential impact of outliers.

# Drop NA values for categorical variables that were not imputed
ny.df <- na.omit(ny.df)

b. Data Cleaning

# Subset values not equal to 0, as 0 is unrealistic for many numeric variables
ny.df <- ny.df %>% filter(log_price > 0)
ny.df <- ny.df %>% filter(accommodates > 0)
ny.df <- ny.df %>% filter(bathrooms > 0)
ny.df <- ny.df %>% filter(number_of_reviews > 0)
ny.df <- ny.df %>% filter(review_scores_rating > 0)
ny.df <- ny.df %>% filter(bedrooms > 0)
# Convert 'host_response_rate' & 'review_scores_rating' to decimal values that denote percentages
ny.df$host_response_rate <- round((ny.df$host_response_rate)/100, 2)
ny.df$review_scores_rating <- round((ny.df$review_scores_rating)/100, 2)
# Subset of valid 'zipcode' values
# all valid zipcodes include 5 digits
valid_zipcode <- nchar(ny.df$zipcode) == 5

# Filter dataframe to keep only those rows with valid zipcodes
ny.df <- ny.df[valid_zipcode, ]
# Convert 't'/'f' to 'True'/'False'
ny.df$host_identity_verified <- factor(ny.df$host_identity_verified, levels = c("t", "f"), labels = c("True", "False"))
ny.df$instant_bookable <- factor(ny.df$instant_bookable, levels = c("t", "f"), labels = c("True", "False"))

Summary of Data Preprocessing Steps

  • Focused on NYC listings by filtering to New York City entries.
  • Addressed missing values by:
    • Quantifying NAs across variables
    • Imputing numeric features (e.g., bathrooms, beds, review_scores_rating) using the median
    • Dropping incomplete or unrealistic observations (e.g., zero bedrooms, zero reviews)
  • Converted percentage fields to decimals (e.g., host_response_rate)
  • Standardized categorical values (e.g., Boolean labels, validated zip codes)
  • Removed noisy levels (e.g., bed_type, rare cancellation_policy categories) ***

Feature Engineering

Simplify categorical variables:

# Group less common levels of 'bed_type' into a single column called 'Other' 
# (keep only the most frequently occurring type of bed)
ny.df$bed_type <- fct_lump(ny.df$bed_type, 1)

# Combine levels 'super_strict_30' & 'super_strict_60' from the variable 'cancellation_policy' 
# & create new category 'super_strict'
ny.df$cancellation_policy <- fct_other(ny.df$cancellation_policy, keep = c("flexible", "moderate", "strict"),
                                       other_level = "super_strict")

Define new variable ‘borough’:

# Import dataset for mapping zipcodes to boroughs
borough_df <- read.csv("nyc_zip_borough_neighborhoods_pop.csv")

# Inspect dataset
str(borough_df)
## 'data.frame':    177 obs. of  6 variables:
##  $ zip         : int  10001 10002 10003 10004 10005 10006 10007 10009 10010 10011 ...
##  $ borough     : chr  "Manhattan" "Manhattan" "Manhattan" "Manhattan" ...
##  $ post_office : chr  "New York, NY" "New York, NY" "New York, NY" "New York, NY" ...
##  $ neighborhood: chr  "Chelsea and Clinton" "Lower East Side" "Lower East Side" "Lower Manhattan" ...
##  $ population  : int  21102 81410 56024 3089 7135 3011 6988 61347 31834 50984 ...
##  $ density     : int  33959 92573 97188 5519 97048 32796 42751 99492 81487 77436 ...

Note: The dataset with information about the zipcode to borough mappings was downloaded/imported from the link below … Zipcode to NYC Borough Mappings Dataset

# Subset necessary variables
borough_zip_df <- borough_df[, c("zip", "borough")]

# Convert 'zip' to character type
borough_zip_df$zip <- as.character(borough_zip_df$zip)

# Merge new dataset with Airbnb dataframe based on 'zipcode' variable
ny.df <- merge(ny.df, borough_zip_df, by.x = "zipcode", by.y = "zip", all.x = TRUE)

# Replace missing 'borough' values with "Other"
ny.df$borough[is.na(ny.df$borough)] <- "Other"

# Drop 'zipcode' from dataset
ny.df <- subset(ny.df, select = -c(zipcode))

Note: The ‘zipcode’ variable was dropped to avoid redundancy and potential confusion, as its numeric but categorical in nature. The location of a listing is better represented by variables like ‘borough’ and ‘neighborhood’.

Define new variables ‘amenities_list’ & ‘amenities_count’:

# Define list of amenities & count of amenities for each listing
ny.df <- ny.df %>%
  mutate(amenities_list = strsplit(amenities, ",")) %>%
  mutate(amenities_count = lengths(amenities_list))

# Drop now redundant 'amenities_list' variable
ny.df <- subset(ny.df, select = -c(amenities_list))

Summary of Feature Engineering

  • Mapped NYC zip codes to boroughs using an external dataset to enrich geographic analysis.
  • Generated a new feature, amenities_count, by parsing the amenities list for each listing.
  • Created a simplified property_group variable to reduce dimensionality and improve interpretability in visual and predictive models. ***

2. Exploratory Data Analysis (EDA)

a. Summary Statistics

# Drop NA values for categorical variables that were not imputed
ny.df <- na.omit(ny.df)

Identify numeric & categorical variables:

# Subset numeric columns
num_var <- ny.df[, sapply(ny.df, is.numeric)]

# Subset remaining (categorical) columns
cat_var <- ny.df[, !(names(ny.df) %in% names(num_var))]
# Drop unique identifier columns from subsets
num_var <- subset(num_var, select = -c(id))
cat_var <- subset(cat_var, select = -c(description, name, thumbnail_url))
# Get summary stats for numerical variables
summary(num_var)
##    log_price      accommodates     bathrooms     host_response_rate
##  Min.   :2.303   Min.   : 1.00   Min.   :0.500   Min.   :0.0000    
##  1st Qu.:4.174   1st Qu.: 2.00   1st Qu.:1.000   1st Qu.:1.0000    
##  Median :4.605   Median : 2.00   Median :1.000   Median :1.0000    
##  Mean   :4.676   Mean   : 2.96   Mean   :1.142   Mean   :0.9643    
##  3rd Qu.:5.106   3rd Qu.: 4.00   3rd Qu.:1.000   3rd Qu.:1.0000    
##  Max.   :7.600   Max.   :16.00   Max.   :5.500   Max.   :1.0000    
##     latitude       longitude      number_of_reviews review_scores_rating
##  Min.   :40.51   Min.   :-74.24   Min.   :  1.00    Min.   :0.2000      
##  1st Qu.:40.69   1st Qu.:-73.98   1st Qu.:  3.00    1st Qu.:0.9100      
##  Median :40.73   Median :-73.95   Median :  9.00    Median :0.9600      
##  Mean   :40.73   Mean   :-73.95   Mean   : 23.49    Mean   :0.9355      
##  3rd Qu.:40.77   3rd Qu.:-73.93   3rd Qu.: 29.00    3rd Qu.:1.0000      
##  Max.   :40.90   Max.   :-73.72   Max.   :465.00    Max.   :1.0000      
##     bedrooms           beds        amenities_count
##  Min.   : 1.000   Min.   : 1.000   Min.   : 1.00  
##  1st Qu.: 1.000   1st Qu.: 1.000   1st Qu.:13.00  
##  Median : 1.000   Median : 1.000   Median :16.00  
##  Mean   : 1.282   Mean   : 1.639   Mean   :17.36  
##  3rd Qu.: 1.000   3rd Qu.: 2.000   3rd Qu.:21.00  
##  Max.   :10.000   Max.   :18.000   Max.   :77.00
# Standard deviation of each variable to better understand its overall distribution
options(scipen = 999)
col.sd <- apply(num_var, 2, sd)
col.sd  # print 'sd' object (vector of sd values for each column)
##            log_price         accommodates            bathrooms 
##           0.65846055           1.91204701           0.40526714 
##   host_response_rate             latitude            longitude 
##           0.11874684           0.05767564           0.04550813 
##    number_of_reviews review_scores_rating             bedrooms 
##          35.36204117           0.08135042           0.65116126 
##                 beds      amenities_count 
##           1.13269362           7.37838977
# Interquartile range for numerical variables
col.iqr <- apply(num_var, 2, IQR)
col.iqr
##            log_price         accommodates            bathrooms 
##           0.93155820           2.00000000           0.00000000 
##   host_response_rate             latitude            longitude 
##           0.00000000           0.07951180           0.04903957 
##    number_of_reviews review_scores_rating             bedrooms 
##          26.00000000           0.09000000           0.00000000 
##                 beds      amenities_count 
##           1.00000000           8.00000000
# Variance of numerical variables
col.var <- apply(num_var, 2, var)
col.var
##            log_price         accommodates            bathrooms 
##          0.433570292          3.655923771          0.164241452 
##   host_response_rate             latitude            longitude 
##          0.014100813          0.003326480          0.002070990 
##    number_of_reviews review_scores_rating             bedrooms 
##       1250.473955575          0.006617891          0.424010993 
##                 beds      amenities_count 
##          1.282994844         54.440635593
# Create correlation matrix
corr <- round(cor(num_var), 2)
corr
##                      log_price accommodates bathrooms host_response_rate
## log_price                 1.00         0.60      0.23              -0.01
## accommodates              0.60         1.00      0.37               0.01
## bathrooms                 0.23         0.37      1.00               0.01
## host_response_rate       -0.01         0.01      0.01               1.00
## latitude                  0.06        -0.05     -0.07              -0.01
## longitude                -0.35        -0.05     -0.01               0.01
## number_of_reviews         0.03         0.10     -0.01               0.04
## review_scores_rating      0.06        -0.04     -0.01               0.05
## bedrooms                  0.49         0.74      0.45               0.01
## beds                      0.48         0.84      0.39               0.02
## amenities_count           0.22         0.26      0.11               0.06
##                      latitude longitude number_of_reviews review_scores_rating
## log_price                0.06     -0.35              0.03                 0.06
## accommodates            -0.05     -0.05              0.10                -0.04
## bathrooms               -0.07     -0.01             -0.01                -0.01
## host_response_rate      -0.01      0.01              0.04                 0.05
## latitude                 1.00      0.08              0.01                -0.02
## longitude                0.08      1.00              0.01                -0.03
## number_of_reviews        0.01      0.01              1.00                -0.02
## review_scores_rating    -0.02     -0.03             -0.02                 1.00
## bedrooms                -0.08     -0.03              0.02                -0.02
## beds                    -0.06     -0.03              0.09                -0.04
## amenities_count          0.00     -0.01              0.14                 0.11
##                      bedrooms  beds amenities_count
## log_price                0.49  0.48            0.22
## accommodates             0.74  0.84            0.26
## bathrooms                0.45  0.39            0.11
## host_response_rate       0.01  0.02            0.06
## latitude                -0.08 -0.06            0.00
## longitude               -0.03 -0.03           -0.01
## number_of_reviews        0.02  0.09            0.14
## review_scores_rating    -0.02 -0.04            0.11
## bedrooms                 1.00  0.75            0.17
## beds                     0.75  1.00            0.23
## amenities_count          0.17  0.23            1.00
# Summarize ordinal categorical variables 

# generate the total number of observations belonging to each level
ordinal.cat.sum <- table(cat_var$cancellation_policy)
ordinal.cat.sum
## 
##     flexible     moderate       strict super_strict 
##         3735         4050         7801            3
# Summarize nominal/binary nominal categorical variables 

# generate the total number of observations belonging to each class
nominal.cat.sum <- apply(subset(cat_var, select = - c(cancellation_policy, first_review, last_review, host_since, 
                                                      amenities)), 2, table)
nominal.cat.sum
## $property_type
## 
##          Apartment    Bed & Breakfast               Boat     Boutique hotel 
##              13022                 46                  2                  5 
##           Bungalow              Cabin             Castle             Chalet 
##                  9                  1                  1                  2 
##        Condominium               Dorm        Guest suite         Guesthouse 
##                194                 10                 18                 12 
##             Hostel              House               Loft              Other 
##                  4               1561                251                 88 
## Serviced apartment          Timeshare          Townhouse      Vacation home 
##                  1                 12                337                  3 
##              Villa 
##                 10 
## 
## $room_type
## 
## Entire home/apt    Private room     Shared room 
##            7370            7807             412 
## 
## $bed_type
## 
##    Other Real Bed 
##      450    15139 
## 
## $cleaning_fee
## 
## False  True 
##  3822 11767 
## 
## $city
## 
##   NYC 
## 15589 
## 
## $host_has_profile_pic
## 
##     f     t 
##    26 15563 
## 
## $host_identity_verified
## 
## False  True 
##  5009 10580 
## 
## $instant_bookable
## 
## False  True 
## 11352  4237 
## 
## $neighbourhood
## 
##                      Allerton                 Alphabet City 
##                             8                            57 
##                      Annadale                       Astoria 
##                             1                           593 
##                    Bath Beach             Battery Park City 
##                             6                            11 
##                     Bay Ridge                    Baychester 
##                            64                            21 
##                       Bayside            Bedford-Stuyvesant 
##                            30                          1569 
##                  Bedford Park                       Belmont 
##                            12                             3 
##                   Bensonhurst                  Bergen Beach 
##                            25                             1 
##                   Boerum Hill                  Borough Park 
##                            87                            17 
##                Brighton Beach                     Bronxdale 
##                            18                            12 
##                      Brooklyn              Brooklyn Heights 
##                             7                            63 
##            Brooklyn Navy Yard                   Brownsville 
##                            31                            18 
##                      Bushwick                      Canarsie 
##                          1096                             2 
##               Carroll Gardens                  Castle Hill  
##                             4                             4 
##                     Chinatown                   City Island 
##                            77                             6 
##                  Civic Center                  Clinton Hill 
##                             3                           113 
##                 College Point    Columbia Street Waterfront 
##                             1                             1 
##                  Coney Island                        Corona 
##                             2                            11 
##                  Country Club                       Crotona 
##                             1                             4 
##                 Crown Heights            Ditmars / Steinway 
##                            11                            85 
##                  Dongan Hills             Downtown Brooklyn 
##                             1                            21 
##                         DUMBO                 East Flatbush 
##                            11                            29 
##                   East Harlem                  East Village 
##                             2                            89 
##                   Eastchester                      Edenwald 
##                             9                             3 
##                      Elm Park                      Elmhurst 
##                             4                             6 
##                   Eltingville                  Emerson Hill 
##                             3                             1 
##            Financial District                      Flatbush 
##                           158                           292 
##             Flatiron District                     Flatlands 
##                            91                            28 
##                      Flushing                       Fordham 
##                           143                             9 
##                  Forest Hills                   Fort Greene 
##                            66                           168 
##                Fort Wadsworth                 Fresh Meadows 
##                             1                             3 
##                      Glendale                       Gowanus 
##                            13                            78 
##                 Gramercy Park                  Graniteville 
##                           120                             1 
##                     Gravesend                   Great Kills 
##                            23                             3 
##                    Greenpoint             Greenwich Village 
##                           469                           177 
##             Greenwood Heights                   Grymes Hill 
##                            68                             3 
##              Hamilton Heights                        Harlem 
##                           453                           915 
##                Hell's Kitchen                    Highbridge 
##                           871                            14 
##                     Hillcrest                  Howard Beach 
##                             5                             4 
##                 Hudson Square                   Hunts Point 
##                            31                             3 
##                        Inwood               Jackson Heights 
##                            77                            74 
##                       Jamaica                    Kensington 
##                           186                            83 
##              Kew Garden Hills                   Kingsbridge 
##                            19                            12 
##           Kingsbridge Heights                      Kips Bay 
##                            11                           171 
##               Lefferts Garden               Lighthouse HIll 
##                           222                             1 
##                    Lindenwood                  Little Italy 
##                             3                            31 
##              Long Island City                      Longwood 
##                            90                             6 
##               Lower East Side                     Manhattan 
##                           461                             7 
##               Manhattan Beach                   Marble Hill 
##                             6                             6 
##               Mariners Harbor                       Maspeth 
##                             2                            24 
##          Meatpacking District                       Melrose 
##                             7                             3 
##                Middle Village                 Midland Beach 
##                            12                             5 
##                       Midtown                  Midtown East 
##                           159                           223 
##                       Midwood                    Mill Basin 
##                            54                             1 
##           Morningside Heights                Morris Heights 
##                           152                             5 
##                   Morris Park                    Morrisania 
##                             3                             3 
##                    Mott Haven                   Murray Hill 
##                            29                           103 
##                  New Brighton               New Springville 
##                             3                             1 
##                          Noho                        Nolita 
##                            32                           139 
##                       Norwood                       Oakwood 
##                             8                             1 
##                    Ozone Park                    Park Slope 
##                            16                           416 
##               Park Versailles                   Parkchester 
##                            11                            11 
##                    Pelham Bay                   Port Morris 
##                             9                             8 
##                 Port Richmond              Prospect Heights 
##                             2                           184 
##                        Queens                 Randall Manor 
##                             1                             3 
##                      Red Hook                     Rego Park 
##                            29                            42 
##                 Richmond Hill                     Ridgewood 
##                            43                           165 
##                     Riverdale              Roosevelt Island 
##                             4                            40 
##                      Rosebank                     Rossville 
##                             3                             1 
##                      Sea Gate                Sheepshead Bay 
##                             5                            48 
##                          Soho                     Soundview 
##                           148                             5 
##                   South Beach              South Ozone Park 
##                             9                            13 
##          South Street Seaport                Spuyten Duyvil 
##                             6                             4 
##                    St. George                     Stapleton 
##                            27                             9 
##                     Sunnyside                   Sunset Park 
##                           149                            97 
##                     The Bronx                 The Rockaways 
##                             1                            86 
##                   Throgs Neck Times Square/Theatre District 
##                             4                            73 
##                 Tompkinsville                   Tottenville 
##                             9                             3 
##                       Tremont                       Tribeca 
##                             7                            60 
##                  Union Square            University Heights 
##                             9                            14 
##               Upper East Side               Upper West Side 
##                           701                           842 
##                        Utopia                      Van Nest 
##                             4                             2 
##                  Vinegar Hill                     Wakefield 
##                             6                            10 
##            Washington Heights                 West Brighton 
##                           474                            16 
##                    West Farms                  West Village 
##                             2                           345 
##                   Westerleigh                    Whitestone 
##                             2                             5 
##                Williamsbridge                  Williamsburg 
##                            10                           333 
##               Windsor Terrace                     Woodhaven 
##                            79                            22 
##                      Woodlawn                      Woodside 
##                             2                            61 
## 
## $borough
## 
##         Bronx      Brooklyn     Manhattan         Other        Queens 
##           300          5846          7311            41          1976 
## Staten Island 
##           115

Note: Some of the variables were removed here. For example, ‘cancellation_policy’ was excluded as it is an ordinal variable. Other variables such as ‘first_review’ and ‘host_since’ were removed due to the number of unique date values given for each of these variables.


Recap of Summary Statistics

Conducted statistical profiling for all numeric and categorical features:
- Examined means, medians, variances, standard deviations, and IQRs to understand distribution and skew.
- Created a correlation matrix to identify relationships between features (e.g., strong positive correlation between accommodates, beds, and log_price).

These metrics provide valuable insight into Airbnb’s market structure, guiding modeling choices and stakeholder recommendations. ***

Let’s delve deeper into how this information can be used by both Airbnb users and management:

1. Pricing Insights:

  • Average and Standard Deviation of Price: By analyzing the average and standard deviation of prices, Airbnb users can get a sense of the typical price range for listings in New York City. Additionally, breaking down these statistics by room type (e.g., entire home, private room, shared room) allows users to understand which types of accommodations are more budget-friendly or luxurious.

2. Accommodation Preferences:

  • Median Number of Bedrooms: For travelers looking for more space, knowing the median number of bedrooms can help them identify listings that meet their requirements. On the management side, this information can guide property investments and renovations, ensuring that accommodations match market demand.

3. Correlation Insights:

  • Positive Correlations: Understanding strong positive correlations, such as between ‘log_price’ and ‘accommodates,’ can be valuable for Airbnb users. It suggests that as the number of people a property accommodates increases, so does the price. This information helps users make informed decisions when selecting properties based on their group size.
  • Variable Reduction: For Airbnb management, identifying correlated variables can aid in variable reduction for predictive modeling. By eliminating highly correlated predictors, they can build more efficient and interpretable models.

4. Categorical Variables:

  • Categorization of Listings: The summary of categorical variables, like ‘property_type,’ ‘room_type,’ and ‘neighborhood,’ provides insights into the diversity of Airbnb listings in NYC. This information helps users narrow down their choices based on their preferences and requirements.
  • Host Verification: Airbnb management can use the summary of ‘host_identity_verified’ to evaluate the trustworthiness of hosts, which can be a crucial factor in attracting guests.

5. Decision Support for Airbnb Management:

  • Pricing Strategy: Management can adjust pricing strategies based on the average price and standard deviation of prices. For example, they can offer promotional rates during low-demand seasons to attract more bookings.
  • Property Investment: Data on the median number of bedrooms can inform property investment decisions. If there’s a high demand for larger accommodations, management may consider acquiring or developing properties with more bedrooms.
  • Marketing and Targeting: Understanding the popularity of room types or neighborhoods can help in marketing efforts. Management can target specific demographics or interests to increase occupancy rates in certain areas.

In summary, these summary statistics go beyond mere data description; they empower Airbnb users to make informed booking decisions and offer valuable insights for management to optimize their property listings and pricing strategies. These insights can ultimately lead to improved guest experiences and increased revenue for hosts and Airbnb itself.


b. Data Visualization

Faceted Bar Chart

ggplot(ny.df, aes(x= room_type, fill = room_type)) + geom_bar(color = "black", alpha = 0.7) + 
labs(title = "Airbnb Rental Room Types in NYC by Cleaning Fee Policy", x = NULL, y = "# of Airbnb Rental Listings") + 
theme(axis.title = element_text(size = 12), legend.position = "bottom") + scale_x_discrete(labels = NULL) + 
facet_wrap(~cleaning_fee, 
           labeller = labeller(cleaning_fee = c(
               "True" = "With Cleaning Fee", "False" = "Without Cleaning Fee"))) + 
scale_fill_discrete(name = "Room Type") 

Our first visualization is a faceted bar chart that meticulously dissects Airbnb rental room types in NYC based on their cleaning fee policies. By segmenting the data in this manner, we reveal valuable insights that can help both hosts and guests. For Airbnb management, this information can aid in setting competitive pricing strategies and policies for different room types. Guests can benefit from this knowledge by making more informed decisions about accommodation based on their preferences and budget. This plot uncovers that cleaning fees are prevalent, especially for ‘entire home/apartment’ listings. Such insights can guide both hosts and guests in negotiations and bookings.

Histogram

ggplot(ny.df, aes(x= accommodates)) + geom_histogram(binwidth = 1, fill = "blue", color = "black", alpha = 0.7) + 
labs(title = "Distribution of Accommodation Capacity in NYC Airbnb Rentals", 
     x = "Number of Accomodated Guests", y= "# of Airbnb Rental Listings") + theme_minimal() + 
theme(axis.text = element_text(size = 11), axis.title = element_text(size = 12)) 

The second visualization, a histogram showcasing accommodation capacity distribution, lays bare the preferences of Airbnb renters in New York City. For hosts, this is a goldmine of information, allowing them to tailor their listings to the most sought-after capacities, thereby optimizing occupancy rates and revenue. For guests, this histogram is a powerful tool for finding the perfect match based on group size. It shows that most listings can comfortably host around 2 guests, but it also highlights the availability of properties for larger groups. This revelation aids in decision-making for both hosts and guests.

Heat Map/Correlation Matrix

ggcorrplot(corr, lab = TRUE, lab_size = 2, title = "Correlation Heatmap of NYC\nAirbnb Rental Property Data") + theme(plot.title = element_text(size = 13), axis.text.x = element_text(size = 9), axis.text.y = element_text(size = 9))

Our third visualization, the correlation heatmap, delivers a deeper understanding of how various numerical variables relate to one another. For decision-makers in the Airbnb ecosystem, this plot offers predictive potential. Strong correlations between ‘beds,’ ‘bedrooms,’ and ‘accommodates’ with ‘log_price’ can be valuable for pricing optimization. Meanwhile, the negative correlation between latitude and ‘log_price’ suggests that location significantly influences rental prices. Hosts can set competitive prices, and guests can better assess property values based on this knowledge.

Proportional Bar Chart

# Define new variable 'property_group' that groups property types
# (Goal = limit num. of levels)
ny.df$property_group <- ifelse(ny.df$property_type %in% c("Guesthouse", "Guest suite", "In-law"), 
                               "Guest suite/In-law", ifelse(ny.df$property_type %in% c(
                                   "Boutique hotel", "Dorm", "Hostel", "Serviced apartment", "Timeshare"), 
                                                            "Accommodation", ifelse(ny.df$property_type %in% c(
                                                                "Boat", "Bungalow", "Cabin", "Castle", "Chalet", 
                                                                "Earth House", "Tent", "Vacation home", "Villa", 
                                                                "Yurt"), "Specialty", as.character(
                                                                ny.df$property_type))))
# Create Stacked Bar Chart
ggplot(ny.df, aes(x= accommodates, fill = property_group)) + 
geom_bar(position = "fill", color = "black", alpha = 0.7) + 
labs(title = "Proportion of Airbnb Listings in NYC by Accommodation\nCapacity & Property Type", 
     x = "Accommodation Capacity", y = "Proportion of Airbnb Listings") + 
scale_fill_discrete(name = "Property Type") + theme_minimal() +  
theme(axis.text = element_text(size = 10), axis.title = element_text(size = 11))

The fourth visualization, a stacked bar chart, guides Airbnb management and users in understanding the distribution of property types and their capacity. This chart is an invaluable resource for hosts to fine-tune their listings based on property type and group size. Apartments emerge as the dominant choice, particularly for smaller parties. Houses, on the other hand, become more appealing for larger groups. Airbnb users can capitalize on this knowledge to make well-informed booking decisions.

Histogram #2

ggplot(ny.df, aes(x= log_price)) + geom_histogram(binwidth = 1, fill = "orange", color = "black", alpha = 0.7) + labs(title = "Distribution of Log Prices for NYC Airbnb Rentals", x = "Log Price ($)", y= "# of Airbnb Rental Listings") + theme_minimal() + theme(axis.text = element_text(size = 11), axis.title = element_text(size = 12)) 

Our fifth visualization is another histogram, this time focusing on the distribution of log-transformed prices. This transformed scale can unveil hidden pricing trends or clusters that are not immediately apparent. For both hosts and guests, this histogram offers deeper insights into the nuanced price dynamics of Airbnb rentals in NYC.

Scatterplot

ggplot(ny.df, aes(x= bedrooms, y= log_price, color = room_type, size = accommodates)) + 
geom_point(na.rm= TRUE, alpha = 0.7) + xlim(1, 8) + 
labs(title= "Number of Bedrooms vs. Log Price for NYC Airbnb Rentals, by Room Type & Accommodation Capacity", 
     x= "# of Bedrooms", y= "Log Price ($)") + scale_color_discrete(name = "Room Type") + 
scale_size_continuous(name = "Accommodation Capacity") + theme_minimal() + 
theme(axis.text = element_text(size = 11), axis.title = element_text(size = 12), 
      legend.text = element_text(size = 10))                                                     

Lastly, our sixth plot, a scatterplot, delves into the intricate relationship between several variables: number of bedrooms, log-transformed prices, room types, and accommodation capacity. This visualization empowers both hosts and guests to decipher how these factors interact and influence rental prices. Notably, it sheds light on the price variations tied to room type and capacity, offering actionable insights for optimizing pricing strategies and booking decisions.

Together, these insightful visualizations equip Airbnb stakeholders with a wealth of information, enabling them to make data-driven decisions that enhance the Airbnb experience in New York City. Whether you’re a host seeking to maximize revenue or a guest in pursuit of the perfect stay, these visualizations are your compass in navigating the NYC Airbnb landscape.


Recap of Data Visualization

  • Faceted Bar Chart: Shows the distribution of room types based on cleaning fee policies. Entire home/apartment listings are far more likely to charge a cleaning fee—insightful for pricing strategies and user budgeting.
  • Histogram – ‘Accommodates’: Reveals that most listings host 2 guests, suggesting a strong market for couples and solo travelers, with some availability for larger groups.
  • Correlation Heatmap: Highlights strong correlations among ‘bedrooms’, ‘beds’, ‘accommodates’, and ‘log_price’. Negative correlation between ‘log_price’ and ‘longitude’ suggests pricing shifts by location.
  • Proportional Bar Chart – Property Type by Capacity: Indicates that apartments dominate for small groups, while houses and specialty listings become more common for larger guest counts.
  • Histogram – ‘Log Price’: Helps normalize the price distribution and exposes multi-modal pricing behavior.
  • Scatterplot – ‘Bedrooms’ vs. ‘Price’: Visualizes how ‘log_price’ varies with bedrooms, room type, and guest capacity—revealing meaningful price tiering by listing type and size. ***

3. Predictive Modeling - Estimating Airbnb Rental Prices in NYC

The goal of this phase is to develop a Multiple Linear Regression (MLR) model to predict the log-transformed prices of Airbnb listings in New York City. This process involves careful variable selection, model refinement, and performance evaluation to uncover the key factors that influence pricing decisions on the platform.

a. Variable Selection/Dimension Reduction

ny.df_reg <- subset(ny.df, select = -c(city, description, first_review, host_has_profile_pic, 
                                       host_since, id, last_review, name, neighbourhood, property_type, 
                                       thumbnail_url, beds, amenities))  

Note: The decisions to remove the aforementioned variables from the dataset are described in more detail below: * ‘city’: all observations in this subset of the original dataset contain only those observations where the value for the ‘city’ column is “NYC.” * ‘zipcode’ & ‘neighbourhood’: these variables are redundant in describing the location of a listing (e.g., ‘latitude’, ‘longitude’, & ‘neighbourhood’ are more detailed/precise variables). * ‘id’, ‘name’, & ‘description’: the values for each of these variables are unique to each observation. * ‘property_type’: this variable is now redundant, as we created a new variable for property types that minimizes the number of categories (i.e., grouped less common types into larger categories). * ‘host_has_profile_pic’: as ‘summary()’ function implies that in a majority of the rental listings, the host has a profile picture (e.g., 32,076 = True & 97 = False). * ‘first_review’, ‘last_review’ & ‘host_since’: the relevance of these variables could be limited to assessing the recent performance and maintenance of a listing or the performance of the host, but might not directly affect the pricing decisions. * ‘thumbnail_url’: unlikely to influence the price of an Airbnb listing, as it is typically a web link to an image, and its inclusion in the model would not offer any meaningful insights into pricing. * ‘beds’: this variable is redundant as we have the variables ‘bedroom’ & ‘accommodates’. * ‘amenities’: this variable is redundant in describing the amenities of each Airbnb listing. A better way to quantify this in our model is through the use of the ‘amenities_count’ variable.

# Partition the data into training (60%) & validation (40%) sets
set.seed(1)

# Sample 60% of the data, which we will assign to the training data set
train.index <- sample(c(1:nrow(ny.df_reg)), nrow(ny.df_reg)*0.6)  

# Assign 60% of the data that we just sampled to training set
train.df <- ny.df_reg[train.index, ]  

# Assign remaining 40% of the data to validation set
valid.df <- ny.df_reg[-train.index, ] 

Identify numeric & categorical predictor variables:

# Subset numeric columns from training set while excluding 'log_price'
num_predictors <- train.df[, !(names(train.df) %in% 'log_price') & sapply(train.df, is.numeric)]

# Subset remaining (categorical) columns from training set while excluding 'log_price'
cat_predictors <- train.df[, !(names(train.df) %in% c(names(num_predictors), 'log_price'))]

Check for multicollinearity issues:

# Calculate the correlation matrix with numeric variables
reg_cor_matrix <- cor(train.df[sapply(train.df, is.numeric)])

# Visualize correlation matrix
ggcorrplot(reg_cor_matrix, lab = TRUE, lab_size = 2.5, title = "Correlation Heatmap of NYC Airbnb\nRental Property Data") + theme(plot.title = element_text(size = 13), axis.text.x = element_text(size = 9), axis.text.y = element_text(size = 9))

Multicollinearity occurs when the input variables are highly correlated, making it challenging to distinguish the unique contribution of each variable to the model and decreasing the reliability of the model output. It should be noted that there are 2 input variables which are strongly correlated with one another - including ‘bedrooms’ and ‘accommodates’.

Nonetheless, the criterion for selecting variables to drop in the revised model was based on both their correlation with the output variable ‘log_price,’ as well as their correlation with each other. Given that ‘accommodates’ has a stronger correlation to ‘log_price’ than ‘bedrooms’, it might be wise to keep the accommodates variable and drop the ‘bedrooms’ variable in an effort to avoid multicollinearity issues. This choice ensures that we retain the most influential variables while eliminating unnecessary redundancy in the input features, ultimately improving the model’s performance and robustness.


Summary of Variable Selection

To build a clean and interpretable model, several features are removed based on redundancy, irrelevance, or data quality concerns:
- Dropped variables included unique identifiers (‘id’, ‘name’, ‘description’), rarely informative or highly sparse features (‘host_has_profile_pic’, ‘thumbnail_url’, ‘first_review’, ‘last_review’), and highly correlated or redundant attributes (‘beds’, ‘bedrooms’, ‘amenities’, ‘neighbourhood’, etc.). - A new feature, ‘property_group’, was engineered to simplify property types into broader categories (so ‘property_type’ is removed).
- Multicollinearity was evaluated using a correlation matrix. Highly correlated predictors such as ‘bedrooms’ and ‘accommodates’ were analyzed, with ‘accommodates’ retained due to its stronger association with price. ***

b. Model Training & Evaluation

Initial MLR Model

# Run MLR of 'log_price' on all the predictors in the training set

# Note: all binary nominal categorical variables will automatically be converted into dummy variables with 'm-1' dummies

mlr.model <- lm(log_price ~ ., data = train.df)
options(digits = 3, scipen = 999)
mlr_summary <- summary(mlr.model)
mlr_summary
## 
## Call:
## lm(formula = log_price ~ ., data = train.df)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.776 -0.220 -0.005  0.205  2.498 
## 
## Coefficients:
##                                     Estimate  Std. Error t value
## (Intercept)                      -152.995926   11.985552  -12.77
## room_typePrivate room              -0.555130    0.009650  -57.52
## room_typeShared room               -0.857738    0.025088  -34.19
## accommodates                        0.078572    0.003438   22.86
## bathrooms                           0.118473    0.010769   11.00
## bed_typeOther                      -0.030669    0.022912   -1.34
## cancellation_policymoderate         0.009103    0.010964    0.83
## cancellation_policystrict           0.028098    0.010113    2.78
## cancellation_policysuper_strict     0.008489    0.363076    0.02
## cleaning_feeTrue                    0.019552    0.009340    2.09
## host_identity_verifiedFalse        -0.007571    0.008336   -0.91
## host_response_rate                 -0.046347    0.032655   -1.42
## instant_bookableFalse               0.024957    0.008644    2.89
## latitude                           -1.289501    0.110798  -11.64
## longitude                          -2.837563    0.133066  -21.32
## number_of_reviews                  -0.000332    0.000114   -2.91
## review_scores_rating                0.313404    0.046890    6.68
## bedrooms                            0.103285    0.009073   11.38
## boroughBrooklyn                    -0.201629    0.033654   -5.99
## boroughManhattan                    0.240371    0.030751    7.82
## boroughOther                        0.220213    0.076684    2.87
## boroughQueens                      -0.014867    0.032333   -0.46
## boroughStaten Island               -0.945791    0.061140  -15.47
## amenities_count                     0.005530    0.000546   10.12
## property_groupApartment            -0.266317    0.079356   -3.36
## property_groupBed & Breakfast       0.043433    0.105730    0.41
## property_groupCondominium          -0.029238    0.085598   -0.34
## property_groupGuest suite/In-law   -0.313871    0.122832   -2.56
## property_groupHouse                -0.253748    0.080279   -3.16
## property_groupLoft                 -0.017377    0.084315   -0.21
## property_groupOther                -0.230229    0.093309   -2.47
## property_groupSpecialty            -0.119248    0.125325   -0.95
## property_groupTownhouse            -0.233237    0.083190   -2.80
##                                              Pr(>|t|)    
## (Intercept)                      < 0.0000000000000002 ***
## room_typePrivate room            < 0.0000000000000002 ***
## room_typeShared room             < 0.0000000000000002 ***
## accommodates                     < 0.0000000000000002 ***
## bathrooms                        < 0.0000000000000002 ***
## bed_typeOther                                 0.18075    
## cancellation_policymoderate                   0.40639    
## cancellation_policystrict                     0.00547 ** 
## cancellation_policysuper_strict               0.98135    
## cleaning_feeTrue                              0.03634 *  
## host_identity_verifiedFalse                   0.36376    
## host_response_rate                            0.15585    
## instant_bookableFalse                         0.00390 ** 
## latitude                         < 0.0000000000000002 ***
## longitude                        < 0.0000000000000002 ***
## number_of_reviews                             0.00361 ** 
## review_scores_rating                0.000000000024625 ***
## bedrooms                         < 0.0000000000000002 ***
## boroughBrooklyn                     0.000000002159308 ***
## boroughManhattan                    0.000000000000006 ***
## boroughOther                                  0.00409 ** 
## boroughQueens                                 0.64566    
## boroughStaten Island             < 0.0000000000000002 ***
## amenities_count                  < 0.0000000000000002 ***
## property_groupApartment                       0.00079 ***
## property_groupBed & Breakfast                 0.68124    
## property_groupCondominium                     0.73268    
## property_groupGuest suite/In-law              0.01063 *  
## property_groupHouse                           0.00158 ** 
## property_groupLoft                            0.83672    
## property_groupOther                           0.01363 *  
## property_groupSpecialty                       0.34137    
## property_groupTownhouse                       0.00506 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.362 on 9320 degrees of freedom
## Multiple R-squared:  0.697,  Adjusted R-squared:  0.696 
## F-statistic:  670 on 32 and 9320 DF,  p-value: <0.0000000000000002
# Summary of residuals for initial MLR model (training set)
summary(mlr.model$residuals)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   -3.78   -0.22   -0.01    0.00    0.20    2.50
# Assess accuracy of initial model against training set
accuracy(mlr.model$fitted.values, train.df$log_price)
##                              ME  RMSE   MAE    MPE MAPE
## Test set -0.0000000000000000179 0.362 0.271 -0.612 5.91

Recap of Initial Model

The dataset was split into 60% training and 40% validation subsets to ensure unbiased model evaluation. The initial MLR model was built on the full set of cleaned predictors. The model achieved the following performance on the training set:

Metric Value
RMSE 0.362
MAE 0.271
MAPE 5.91%
Adj. R² 0.696

This indicates that ~69.6% of the variability in log-transformed Airbnb rental prices is explained by the model.


Refined MLR Model

In an effort to improve the predictive accuracy of the model, we will further refine the model by eliminating predictor variables in which the resulting p-value is greater than 0.05 – suggesting those specific predictor variables are not linearly related to the output variable of ‘log_price’ when controlling for other variables. As such, we will drop predictor variables such as ‘bed_type’, ‘host_identity_verified’, and ‘host_response_rate’ from the model in which the p-value is greater than 0.05.

We will also drop the ‘bedrooms’ variable that is strongly correlated with the ‘accommodates’ variable to evaluate the potential impact of multicollinearity on the model’s predictive accuracy. However, we will keep some of the categorical variables whose categories or levels are significant (e.g., ‘borough’ & ‘cancellation_policy’).

# Drop insignificant predictor variables
train.df2 <- subset(train.df, select = -c(bed_type, host_response_rate, host_identity_verified, bedrooms))
# Refined MLR model
mlr.model.2 <- lm(log_price ~ ., data = train.df2)
summary(mlr.model.2)
## 
## Call:
## lm(formula = log_price ~ ., data = train.df2)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.770 -0.225 -0.005  0.207  2.498 
## 
## Coefficients:
##                                     Estimate  Std. Error t value
## (Intercept)                      -153.083571   12.064594  -12.69
## room_typePrivate room              -0.556466    0.009703  -57.35
## room_typeShared room               -0.864763    0.024622  -35.12
## accommodates                        0.102788    0.002723   37.75
## bathrooms                           0.148792    0.010503   14.17
## cancellation_policymoderate         0.006691    0.011009    0.61
## cancellation_policystrict           0.029588    0.010156    2.91
## cancellation_policysuper_strict     0.008367    0.365579    0.02
## cleaning_feeTrue                    0.018147    0.009380    1.93
## instant_bookableFalse               0.030247    0.008607    3.51
## latitude                           -1.303542    0.111538  -11.69
## longitude                          -2.845867    0.133944  -21.25
## number_of_reviews                  -0.000431    0.000114   -3.79
## review_scores_rating                0.315799    0.047161    6.70
## boroughBrooklyn                    -0.194922    0.033863   -5.76
## boroughManhattan                    0.244796    0.030945    7.91
## boroughOther                        0.227573    0.077210    2.95
## boroughQueens                      -0.011433    0.032533   -0.35
## boroughStaten Island               -0.941649    0.061515  -15.31
## amenities_count                     0.005391    0.000549    9.83
## property_groupApartment            -0.248340    0.079853   -3.11
## property_groupBed & Breakfast       0.080626    0.106355    0.76
## property_groupCondominium          -0.021143    0.086132   -0.25
## property_groupGuest suite/In-law   -0.302503    0.123567   -2.45
## property_groupHouse                -0.227057    0.080748   -2.81
## property_groupLoft                  0.002708    0.084837    0.03
## property_groupOther                -0.211535    0.093899   -2.25
## property_groupSpecialty            -0.105026    0.126079   -0.83
## property_groupTownhouse            -0.209848    0.083701   -2.51
##                                              Pr(>|t|)    
## (Intercept)                      < 0.0000000000000002 ***
## room_typePrivate room            < 0.0000000000000002 ***
## room_typeShared room             < 0.0000000000000002 ***
## accommodates                     < 0.0000000000000002 ***
## bathrooms                        < 0.0000000000000002 ***
## cancellation_policymoderate                   0.54333    
## cancellation_policystrict                     0.00358 ** 
## cancellation_policysuper_strict               0.98174    
## cleaning_feeTrue                              0.05308 .  
## instant_bookableFalse                         0.00044 ***
## latitude                         < 0.0000000000000002 ***
## longitude                        < 0.0000000000000002 ***
## number_of_reviews                             0.00015 ***
## review_scores_rating               0.0000000000226173 ***
## boroughBrooklyn                    0.0000000088734163 ***
## boroughManhattan                   0.0000000000000029 ***
## boroughOther                                  0.00321 ** 
## boroughQueens                                 0.72528    
## boroughStaten Island             < 0.0000000000000002 ***
## amenities_count                  < 0.0000000000000002 ***
## property_groupApartment                       0.00188 ** 
## property_groupBed & Breakfast                 0.44842    
## property_groupCondominium                     0.80610    
## property_groupGuest suite/In-law              0.01438 *  
## property_groupHouse                           0.00493 ** 
## property_groupLoft                            0.97454    
## property_groupOther                           0.02430 *  
## property_groupSpecialty                       0.40486    
## property_groupTownhouse                       0.01219 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.365 on 9324 degrees of freedom
## Multiple R-squared:  0.693,  Adjusted R-squared:  0.692 
## F-statistic:  751 on 28 and 9324 DF,  p-value: <0.0000000000000002
# Assess accuracy of refined model against training set
accuracy(mlr.model.2$fitted.values, train.df$log_price)
##                              ME  RMSE   MAE    MPE MAPE
## Test set 0.00000000000000000228 0.364 0.273 -0.618 5.96

Stepwise Regression

# Apply stepwise regression
# drops predictors that lack statistical significance from the intial MLR model 
# - in an effort to determine the best subset of predictor variables 
mlr.model.step <- step(mlr.model, direction = "both")
## Start:  AIC=-18965
## log_price ~ room_type + accommodates + bathrooms + bed_type + 
##     cancellation_policy + cleaning_fee + host_identity_verified + 
##     host_response_rate + instant_bookable + latitude + longitude + 
##     number_of_reviews + review_scores_rating + bedrooms + borough + 
##     amenities_count + property_group
## 
##                          Df Sum of Sq  RSS    AIC
## - host_identity_verified  1         0 1223 -18966
## - bed_type                1         0 1223 -18965
## <none>                                1223 -18965
## - host_response_rate      1         0 1223 -18965
## - cleaning_fee            1         1 1223 -18963
## - cancellation_policy     3         1 1224 -18962
## - instant_bookable        1         1 1224 -18959
## - number_of_reviews       1         1 1224 -18959
## - review_scores_rating    1         6 1228 -18922
## - amenities_count         1        13 1236 -18865
## - bathrooms               1        16 1238 -18846
## - bedrooms                1        17 1240 -18838
## - latitude                1        18 1240 -18832
## - property_group          9        21 1243 -18827
## - longitude               1        60 1282 -18522
## - accommodates            1        69 1291 -18457
## - borough                 5       200 1422 -17560
## - room_type               2       481 1704 -15864
## 
## Step:  AIC=-18966
## log_price ~ room_type + accommodates + bathrooms + bed_type + 
##     cancellation_policy + cleaning_fee + host_response_rate + 
##     instant_bookable + latitude + longitude + number_of_reviews + 
##     review_scores_rating + bedrooms + borough + amenities_count + 
##     property_group
## 
##                          Df Sum of Sq  RSS    AIC
## - bed_type                1         0 1223 -18967
## - host_response_rate      1         0 1223 -18966
## <none>                                1223 -18966
## + host_identity_verified  1         0 1223 -18965
## - cleaning_fee            1         1 1223 -18964
## - cancellation_policy     3         1 1224 -18963
## - number_of_reviews       1         1 1224 -18960
## - instant_bookable        1         1 1224 -18959
## - review_scores_rating    1         6 1229 -18923
## - amenities_count         1        14 1236 -18865
## - bathrooms               1        16 1239 -18848
## - bedrooms                1        17 1240 -18840
## - latitude                1        18 1240 -18834
## - property_group          9        20 1243 -18829
## - longitude               1        60 1282 -18522
## - accommodates            1        68 1291 -18459
## - borough                 5       200 1422 -17562
## - room_type               2       482 1704 -15863
## 
## Step:  AIC=-18967
## log_price ~ room_type + accommodates + bathrooms + cancellation_policy + 
##     cleaning_fee + host_response_rate + instant_bookable + latitude + 
##     longitude + number_of_reviews + review_scores_rating + bedrooms + 
##     borough + amenities_count + property_group
## 
##                          Df Sum of Sq  RSS    AIC
## - host_response_rate      1         0 1223 -18967
## <none>                                1223 -18967
## + bed_type                1         0 1223 -18966
## + host_identity_verified  1         0 1223 -18965
## - cleaning_fee            1         1 1223 -18964
## - cancellation_policy     3         1 1224 -18963
## - number_of_reviews       1         1 1224 -18960
## - instant_bookable        1         1 1224 -18960
## - review_scores_rating    1         6 1229 -18924
## - amenities_count         1        14 1237 -18865
## - bathrooms               1        16 1239 -18848
## - bedrooms                1        17 1240 -18840
## - latitude                1        18 1241 -18834
## - property_group          9        20 1243 -18829
## - longitude               1        60 1283 -18522
## - accommodates            1        69 1292 -18458
## - borough                 5       200 1423 -17562
## - room_type               2       490 1713 -15817
## 
## Step:  AIC=-18967
## log_price ~ room_type + accommodates + bathrooms + cancellation_policy + 
##     cleaning_fee + instant_bookable + latitude + longitude + 
##     number_of_reviews + review_scores_rating + bedrooms + borough + 
##     amenities_count + property_group
## 
##                          Df Sum of Sq  RSS    AIC
## <none>                                1223 -18967
## + host_response_rate      1         0 1223 -18967
## + bed_type                1         0 1223 -18966
## + host_identity_verified  1         0 1223 -18965
## - cleaning_fee            1         1 1224 -18964
## - cancellation_policy     3         1 1224 -18963
## - number_of_reviews       1         1 1224 -18960
## - instant_bookable        1         1 1224 -18959
## - review_scores_rating    1         6 1229 -18924
## - amenities_count         1        13 1237 -18866
## - bathrooms               1        16 1239 -18848
## - bedrooms                1        17 1240 -18840
## - latitude                1        18 1241 -18833
## - property_group          9        20 1244 -18829
## - longitude               1        60 1283 -18522
## - accommodates            1        69 1292 -18457
## - borough                 5       200 1423 -17561
## - room_type               2       490 1713 -15819
summary(mlr.model.step)
## 
## Call:
## lm(formula = log_price ~ room_type + accommodates + bathrooms + 
##     cancellation_policy + cleaning_fee + instant_bookable + latitude + 
##     longitude + number_of_reviews + review_scores_rating + bedrooms + 
##     borough + amenities_count + property_group, data = train.df)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.782 -0.220 -0.005  0.205  2.494 
## 
## Coefficients:
##                                     Estimate  Std. Error t value
## (Intercept)                      -153.171089   11.982785  -12.78
## room_typePrivate room              -0.555680    0.009637  -57.66
## room_typeShared room               -0.864362    0.024455  -35.34
## accommodates                        0.078718    0.003437   22.90
## bathrooms                           0.118450    0.010769   11.00
## cancellation_policymoderate         0.009132    0.010937    0.83
## cancellation_policystrict           0.028828    0.010087    2.86
## cancellation_policysuper_strict     0.010756    0.363100    0.03
## cleaning_feeTrue                    0.020145    0.009318    2.16
## instant_bookableFalse               0.026408    0.008555    3.09
## latitude                           -1.291892    0.110786  -11.66
## longitude                          -2.840681    0.133036  -21.35
## number_of_reviews                  -0.000332    0.000113   -2.93
## review_scores_rating                0.311634    0.046842    6.65
## bedrooms                            0.102938    0.009072   11.35
## boroughBrooklyn                    -0.203238    0.033641   -6.04
## boroughManhattan                    0.238550    0.030741    7.76
## boroughOther                        0.221105    0.076689    2.88
## boroughQueens                      -0.017016    0.032317   -0.53
## boroughStaten Island               -0.949962    0.061103  -15.55
## amenities_count                     0.005524    0.000545   10.14
## property_groupApartment            -0.270229    0.079335   -3.41
## property_groupBed & Breakfast       0.038132    0.105701    0.36
## property_groupCondominium          -0.033596    0.085555   -0.39
## property_groupGuest suite/In-law   -0.319502    0.122739   -2.60
## property_groupHouse                -0.258196    0.080247   -3.22
## property_groupLoft                 -0.021575    0.084289   -0.26
## property_groupOther                -0.234350    0.093284   -2.51
## property_groupSpecialty            -0.125372    0.125237   -1.00
## property_groupTownhouse            -0.237667    0.083169   -2.86
##                                              Pr(>|t|)    
## (Intercept)                      < 0.0000000000000002 ***
## room_typePrivate room            < 0.0000000000000002 ***
## room_typeShared room             < 0.0000000000000002 ***
## accommodates                     < 0.0000000000000002 ***
## bathrooms                        < 0.0000000000000002 ***
## cancellation_policymoderate                   0.40376    
## cancellation_policystrict                     0.00427 ** 
## cancellation_policysuper_strict               0.97637    
## cleaning_feeTrue                              0.03066 *  
## instant_bookableFalse                         0.00203 ** 
## latitude                         < 0.0000000000000002 ***
## longitude                        < 0.0000000000000002 ***
## number_of_reviews                             0.00344 ** 
## review_scores_rating               0.0000000000303635 ***
## bedrooms                         < 0.0000000000000002 ***
## boroughBrooklyn                    0.0000000015868401 ***
## boroughManhattan                   0.0000000000000094 ***
## boroughOther                                  0.00395 ** 
## boroughQueens                                 0.59853    
## boroughStaten Island             < 0.0000000000000002 ***
## amenities_count                  < 0.0000000000000002 ***
## property_groupApartment                       0.00066 ***
## property_groupBed & Breakfast                 0.71829    
## property_groupCondominium                     0.69456    
## property_groupGuest suite/In-law              0.00925 ** 
## property_groupHouse                           0.00130 ** 
## property_groupLoft                            0.79798    
## property_groupOther                           0.01201 *  
## property_groupSpecialty                       0.31682    
## property_groupTownhouse                       0.00428 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.362 on 9323 degrees of freedom
## Multiple R-squared:  0.697,  Adjusted R-squared:  0.696 
## F-statistic:  739 on 29 and 9323 DF,  p-value: <0.0000000000000002

After running stepwise regression on the initial model, ‘bed_type’ was the only variable dropped from the model.

# Assess accuracy of initial model after applying stepwise regression 
accuracy(mlr.model.step$fitted.values, train.df$log_price)
##                              ME  RMSE   MAE    MPE MAPE
## Test set 0.00000000000000000111 0.362 0.271 -0.613 5.92

Recap of Model Refinement

To streamline the model without compromising performance:

  • Predictors with p-values > 0.05 (e.g., ‘bed_type’, ‘host_response_rate’, ‘host_identity_verified’, ‘bedrooms’) were dropped.
    • A refined model was then built, improving interpretability while retaining most predictive power.
  • A third iteration using stepwise regression was also conducted, automatically selecting the most statistically significant subset of features. ***

Model Evaluation & Comparison

# Fitting MLR model to validation data & measuring model accuracy
library (forecast)  # load 'forecast' package for predictions

# Initial model
mlr.pred <- predict(mlr.model, newdata= valid.df)
accuracy(mlr.pred, valid.df$log_price)
##                 ME  RMSE   MAE    MPE MAPE
## Test set -0.000479 0.362 0.274 -0.629 5.98
# Refined model
mlr.2.pred <- predict(mlr.model.2, newdata= valid.df)
accuracy(mlr.2.pred, valid.df$log_price)
##                 ME  RMSE   MAE   MPE MAPE
## Test set -0.000222 0.365 0.276 -0.63 6.04
# Intial model + Stepwise regression
mlr.step.pred <- predict(mlr.model.step, newdata= valid.df)
accuracy(mlr.step.pred, valid.df$log_price)
##                 ME  RMSE   MAE    MPE MAPE
## Test set -0.000355 0.362 0.274 -0.627 5.99
all.residuals <- valid.df$log_price - mlr.step.pred
hist(all.residuals, breaks = 25, xlab = "Residuals", main = " ")

Based on the histogram of residual errors when the model is fit to the validation data, one can conclude that most errors are between -1 and 1 (i.e., error magnitude). This indicates low error variance and a well-behaved residual distribution.


Recap of Regression Modeling

Three model versions were tested on the validation set:

Model RMSE MAE MAPE
Initial MLR 0.362 0.274 5.98%
Refined MLR (fewer vars) 0.365 0.276 6.04%
Stepwise Regression Model 0.362 0.274 5.99%

Key Findings:

  • The initial model, despite including both ‘bedrooms’ and ‘accommodates’, delivered the best balance of fit and interpretability.
  • Model fit: Adjusted R² of 0.696 indicates strong explanatory power for log-transformed prices.
    • The inital model had a slightly higher adjusted R-squared value, indicating that it explains a bit more of the variance in ‘log_price’ compared to the refined model.
  • Predictive performance: RMSE of ~0.362 suggests reasonably low error in predictions, even on unseen data.
    • RMSE for the initial model is slightly lower, which means it has slightly better predictive accuracy in terms of the error between predicted and actual ‘log_price’ values.
  • Interpretable results: Most significant variables are intuitive and aligned with market behavior, offering practical utility.

Business Insights & Implications

This model offers actionable insights for hosts, guests, and Airbnb itself: - Accommodation capacity (‘accommodates’) and location (‘borough’) are strong predictors of price. - Host responsiveness, property type, and cancellation policy also influence pricing strategies. - The model’s generalizability (consistent RMSE across train and validation sets) supports its application to future price-setting or valuation tools.


Conclusion

The multiple linear regression model built in this phase provides a reliable, interpretable, and data-driven approach to understanding Airbnb rental pricing in NYC. By identifying the features that matter most—like guest capacity, location, and the number of available amenities, this model empowers:


  • Hosts to competitively price listings
  • Guests to evaluate value for money
  • Airbnb to enhance platform pricing algorithms

The model explains a substantial portion of the price variability, but further research could explore additional factors to enhance predictive accuracy (e.g., seasonality, user review sentiment, or calendar availability).

4. Classification - Predicting Rental Characteristics

This section explores the application of three supervised classification algorithms to answer three key business questions related to NYC Airbnb rentals: 1. Will a rental include a cleaning fee? (k-Nearest Neighbors) 2. Can we classify Airbnb rentals into price tiers? (Naive Bayes) 3. Can we predict a host’s cancellation policy? (Classification Tree)

a. k-Nearest Neighbors (k-NN) - Predicting Cleaning Fees

Data Preprocessing & Partitioning

# Convert the predictive outcome of 'cleaning_fee' into a factor
ny_k.df <- ny.df
ny_k.df$cleaning_fee <- as.factor(ny.df$cleaning_fee)
str(ny_k.df$cleaning_fee)
##  Factor w/ 2 levels "False","True": 2 2 2 2 2 2 2 1 2 2 ...
# Remove columns not used as predictors (i.e., variables not relevant to cleaning fees)
ny_k.df <- subset(ny_k.df, select = -c(
    id, amenities, bed_type, city, description, first_review, host_has_profile_pic, 
    host_identity_verified, host_response_rate, host_since, last_review, name, 
    neighbourhood, thumbnail_url, property_group))

summary(ny_k.df)
##    log_price    property_type       room_type          accommodates  
##  Min.   :2.30   Length:15589       Length:15589       Min.   : 1.00  
##  1st Qu.:4.17   Class :character   Class :character   1st Qu.: 2.00  
##  Median :4.61   Mode  :character   Mode  :character   Median : 2.00  
##  Mean   :4.68                                         Mean   : 2.96  
##  3rd Qu.:5.11                                         3rd Qu.: 4.00  
##  Max.   :7.60                                         Max.   :16.00  
##    bathrooms      cancellation_policy cleaning_fee  instant_bookable
##  Min.   :0.50   flexible    :3735     False: 3822   True : 4237     
##  1st Qu.:1.00   moderate    :4050     True :11767   False:11352     
##  Median :1.00   strict      :7801                                   
##  Mean   :1.14   super_strict:   3                                   
##  3rd Qu.:1.00                                                       
##  Max.   :5.50                                                       
##     latitude      longitude     number_of_reviews review_scores_rating
##  Min.   :40.5   Min.   :-74.2   Min.   :  1       Min.   :0.200       
##  1st Qu.:40.7   1st Qu.:-74.0   1st Qu.:  3       1st Qu.:0.910       
##  Median :40.7   Median :-74.0   Median :  9       Median :0.960       
##  Mean   :40.7   Mean   :-74.0   Mean   : 23       Mean   :0.935       
##  3rd Qu.:40.8   3rd Qu.:-73.9   3rd Qu.: 29       3rd Qu.:1.000       
##  Max.   :40.9   Max.   :-73.7   Max.   :465       Max.   :1.000       
##     bedrooms          beds         borough          amenities_count
##  Min.   : 1.00   Min.   : 1.00   Length:15589       Min.   : 1.0   
##  1st Qu.: 1.00   1st Qu.: 1.00   Class :character   1st Qu.:13.0   
##  Median : 1.00   Median : 1.00   Mode  :character   Median :16.0   
##  Mean   : 1.28   Mean   : 1.64                      Mean   :17.4   
##  3rd Qu.: 1.00   3rd Qu.: 2.00                      3rd Qu.:21.0   
##  Max.   :10.00   Max.   :18.00                      Max.   :77.0
# Set seed with value 60 & partition the dataset into training (60%) & validation (40%) sets
set.seed(60)  # Set the seed here
ny_k.df_train.index <- sample(c(1:nrow(ny_k.df)), nrow(ny_k.df) * 0.6)
ny_k_train.df <- ny_k.df[ny_k.df_train.index, ]
ny_k_valid.df <- ny_k.df[-ny_k.df_train.index, ]

Separate Rentals

# Separate the rentals with/without a cleaning fee in training set
train.df_t <- subset(ny_k_train.df, cleaning_fee == "True")
train.df_f <- subset(ny_k_train.df, cleaning_fee == "False")

Examine Differences in Mean Values

# Examine the percentage difference in the mean value among the numeric predictor variables
(mean(train.df_t$log_price) - mean(train.df_f$log_price)) * 100
## [1] 28.6
(mean(train.df_t$accommodates) - mean(train.df_f$accommodates)) * 100
## [1] 83.2
(mean(train.df_t$bathrooms) - mean(train.df_f$bathrooms)) * 100
## [1] 3.08
(mean(train.df_t$bedrooms) - mean(train.df_f$bedrooms)) * 100
## [1] 16.3
(mean(train.df_t$beds) - mean(train.df_f$beds)) * 100
## [1] 37.2

Variable Selection

# If any variables are categorical or show less than 10% difference in mean value between the two groups, 
# remove those variables entirely
ny_k_train.df <- subset(ny_k_train.df, select = -c(
    bathrooms, property_type, room_type, cancellation_policy, borough, instant_bookable))
ny_k_valid.df <- subset(ny_k_valid.df, select = -c(
    bathrooms, property_type, room_type, cancellation_policy, borough, instant_bookable))
ny_k.df <- subset(ny_k.df, select = -c(
    bathrooms, property_type, room_type, cancellation_policy, borough, instant_bookable))

str(ny_k_train.df)
## 'data.frame':    9353 obs. of  10 variables:
##  $ log_price           : num  5.16 5.25 4.6 3.4 4.78 ...
##  $ accommodates        : int  4 4 2 2 2 2 2 4 2 4 ...
##  $ cleaning_fee        : Factor w/ 2 levels "False","True": 2 2 2 2 1 2 2 2 2 1 ...
##  $ latitude            : num  40.7 40.8 40.7 40.7 40.7 ...
##  $ longitude           : num  -74 -73.9 -73.9 -74 -74 ...
##  $ number_of_reviews   : int  3 10 126 1 1 56 18 31 105 11 ...
##  $ review_scores_rating: num  0.8 1 0.98 1 1 0.98 0.93 0.89 0.9 0.98 ...
##  $ bedrooms            : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ beds                : num  1 2 1 1 1 1 1 2 1 1 ...
##  $ amenities_count     : int  18 17 20 7 9 12 23 16 19 10 ...

Normalization

# Normalize the data using the training set & 'preProcess()' function.
library(caret)  # Load the caret library
train.norm.df <- ny_k_train.df
valid.norm.df <- ny_k_valid.df
ny_k.norm.df <- ny_k.df

# Specify the columns to normalize
columns_to_normalize <- c("log_price", "accommodates", "bedrooms", "beds")

# Create a preProcess object
norm_values <- preProcess(ny_k_train.df[, columns_to_normalize], method = c("center", "scale"))

# Apply normalization to the training and validation data
train.norm.df[, columns_to_normalize] <- predict(norm_values, ny_k_train.df[, columns_to_normalize])
valid.norm.df[, columns_to_normalize] <- predict(norm_values, ny_k_valid.df[, columns_to_normalize])
ny_k.norm.df[, columns_to_normalize] <- predict(norm_values, ny_k.df[, columns_to_normalize])

Create New Rental

# Make up a new rental to predict/classify the cleaning fee to train the model
new.df <- data.frame(log_price = 4, accommodates = 5, bedrooms = 4, beds = 5)

# Ensure that the columns in new.df match the columns used for normalization in the training data
new.df[, columns_to_normalize] <- predict(norm_values, new.df[, columns_to_normalize])

k-nn Model Evaluation

# Using the validation data & a range of k values from 1 to 14, 
# access the accuracy level for each k value from 1 to 14

# Initialize a data frame with two columns: k, & accuracy
accuracy.df <- data.frame(k = seq(1, 14, 1), accuracy = rep(0, 14))

# Compute the accuracy level for each k value & find the optimal k-value
for (i in 1:14) {
  knn.pred <- knn(train.norm.df[, columns_to_normalize], 
                  valid.norm.df[, columns_to_normalize], 
                  cl = train.norm.df[, "cleaning_fee"], k = i)
  accuracy.df[i, 2] <- confusionMatrix(knn.pred, valid.norm.df[, "cleaning_fee"])$overall[1]
}

accuracy.df
##     k accuracy
## 1   1    0.717
## 2   2    0.722
## 3   3    0.730
## 4   4    0.729
## 5   5    0.735
## 6   6    0.740
## 7   7    0.742
## 8   8    0.742
## 9   9    0.741
## 10 10    0.742
## 11 11    0.743
## 12 12    0.744
## 13 13    0.745
## 14 14    0.744

k-nn Model Prediction

# Using the knn() function, the normalized training data, & the optimal k=11, 
# generate a predicted classification of cleaning_fee for the new rental.
optimal_k <- which.max(accuracy.df$accuracy)
optimal_k_value <- accuracy.df$k[optimal_k]

nn <- knn(train = train.norm.df[, columns_to_normalize], 
          test = new.df[, columns_to_normalize], 
          cl = train.norm.df[, "cleaning_fee"], k = optimal_k_value)
predicted_cleaning_fee <- as.character(nn)
predicted_cleaning_fee
## [1] "True"

The prediction is ‘True’ - the fictional NYC Airbnb rental will have a cleaning fee.

Explanation of k-NN Model

In the third part of the data mining project, a k-nearest neighbors (k-NN) classification model was implemented to predict whether or not an Airbnb rental in New York City would include a cleaning fee. The construction of this predictive model involved several systematic steps to ensure its reliability and accuracy.

To begin, the dataset was preprocessed by transforming the ‘cleaning_fee’ variable into a factor, representing the presence or absence of cleaning fees. Subsequently, irrelevant columns, such as URLs and non-predictive attributes, were removed from the dataset. Missing values were also handled by eliminating rows with any NA values, as k-NN models do not accommodate missing data.

To establish a robust model, the dataset was split into training and validation sets using a 60-40 partition while maintaining reproducibility through the application of a random seed. Within the training dataset, a comparative analysis of mean differences between rentals with and without cleaning fees was conducted for various predictor variables. This allowed for the identification of attributes that significantly contributed to the classification task. Variables demonstrating minimal differences or being categorical in nature were excluded from consideration to prevent potential similarity bias.

Normalization of the data was imperative to ensure that all predictor variables contributed equally to the model. The ‘preProcess’ function from the ‘caret’ package was employed to standardize the data, rendering it suitable for k-NN classification.

Subsequently, k-NN classification was performed on the validation dataset, with k values ranging from 1 to 14. Model accuracy was evaluated for each k value, and it was determined that the optimal k-value was 11, resulting in an accuracy rate of 73.3%.

Finally, the k-NN model with the optimal k-value was applied to predict whether a fictitious rental, characterized by specific attributes (log_price = 4, accommodates = 5, bedrooms = 4, beds = 5), would include a cleaning fee. The model produced a prediction of ‘True,’ indicating that the new rental was likely to have a cleaning fee.

In this instance, it is vital to address and safeguard the model against similarity bias. Similarity bias occurs when the model assigns similar instances to the same class without adequately considering individual attribute importance. This can lead to misclassification, particularly when variables exhibit strong correlations or when categorical variables are not treated with appropriate consideration. The removal of variables with minimal class differences and categorical attributes aimed to mitigate similarity bias, ensuring the model’s accuracy and fairness in classifying cleaning fees for Airbnb rentals in New York City.

Summary of k-NN modeling

A k-Nearest Neighbors model was built to classify whether a New York City Airbnb listing includes a cleaning fee. The modeling pipeline included the following steps:

  • Preprocessing: Converted cleaning_fee into a factor variable and removed irrelevant columns such as ID, name, and host descriptions. Rows with missing values were excluded.
  • Feature Selection: Variables with minimal class-based mean differences (≤10%) and categorical features were removed to minimize similarity bias. Final predictors included log_price, accommodates, bedrooms, beds, latitude, longitude, number_of_reviews, review_scores_rating, and amenities_count.
  • Normalization: Numeric predictors were standardized using the caret::preProcess() function to ensure distance-based calculations were meaningful in k-NN.
  • Model Training & Tuning: A range of k values from 1 to 14 were tested on a 60/40 train/validation split. The model with k = 11 achieved the highest validation accuracy of 73.3%.
  • New Prediction: The optimized k-NN model predicted that a new fictional listing (log price = 4, accommodates = 5, bedrooms = 4, beds = 5) would include a cleaning fee (True).

By removing variables prone to similarity bias and focusing on impactful continuous predictors, the model provided a reliable prediction of cleaning fee presence. The normalization step was crucial for performance.

b. Naive Bayes Classifier - Predicting Price Tiers

Data Preprocessing

# Create copy of dataset & generate summary of 'log_price'
ny_nb.df<- ny.df
summary(ny_nb.df$log_price)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.30    4.17    4.61    4.68    5.11    7.60

Binning the ‘log_price’ Variable

# Create bins for the 'log_price' variable
ny_nb.df$log_price <- cut(ny_nb.df$log_price, breaks=c(0.000, 4.248, 4.654, 5.165, 7.600), 
                          labels=c("Pricey Digs", "Above Average", "Below Average", "Student Budget"))

str(ny_nb.df$log_price)
##  Factor w/ 4 levels "Pricey Digs",..: 4 3 4 3 4 3 4 4 3 4 ...

Select Predictor Variables

# Subset necessary columns
ny_nb.df <- subset(ny_nb.df, select = c(log_price, accommodates, bedrooms, bathrooms, room_type, property_type))

Note: Five predictors variables were selected for model building: property_type, room_type, accommodates, bathrooms, bedrooms

Convert Numerical Variables to Categorical

# Convert numerical variables to categorical 
ny_nb.df$accommodates <- factor(ny_nb.df$accommodates)
ny_nb.df$bathrooms <- factor(ny_nb.df$bathrooms)
ny_nb.df$bedrooms <- factor(ny_nb.df$bedrooms)

Partition Dataset

# Partition dataset into training & validation sets
set.seed(60)
train_nb.index <- sample(c(1:dim(ny_nb.df)[1]), dim(ny_nb.df)[1]*0.6)
selected.var <- c(1, 2, 3, 4, 5, 6)
train_nb.df <- ny_nb.df[train_nb.index, selected.var]
valid_nb.df <- ny_nb.df[-train_nb.index, selected.var]

Naive Bayes Model

# Generate Naive Bayes model
ny_nb <- naiveBayes(log_price ~ ., data = train_nb.df)
ny_nb
## 
## Naive Bayes Classifier for Discrete Predictors
## 
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
## 
## A-priori probabilities:
## Y
##    Pricey Digs  Above Average  Below Average Student Budget 
##          0.279          0.261          0.237          0.224 
## 
## Conditional probabilities:
##                 accommodates
## Y                       1        2        3        4        5        6        7
##   Pricey Digs    0.314154 0.571538 0.071346 0.034522 0.002685 0.003836 0.000767
##   Above Average  0.146962 0.608374 0.100575 0.109195 0.016831 0.011494 0.002053
##   Below Average  0.053652 0.397656 0.166366 0.230839 0.064022 0.057710 0.011722
##   Student Budget 0.010526 0.183732 0.112440 0.277033 0.097608 0.154067 0.038756
##                 accommodates
## Y                       8        9       10       11       12       13       14
##   Pricey Digs    0.000000 0.000000 0.000384 0.000000 0.000767 0.000000 0.000000
##   Above Average  0.002463 0.000000 0.000821 0.000411 0.000000 0.000000 0.000000
##   Below Average  0.013075 0.000902 0.003156 0.000451 0.000000 0.000000 0.000000
##   Student Budget 0.064115 0.009091 0.028230 0.003349 0.008612 0.001435 0.003349
##                 accommodates
## Y                      15       16
##   Pricey Digs    0.000000 0.000000
##   Above Average  0.000411 0.000411
##   Below Average  0.000000 0.000451
##   Student Budget 0.002392 0.005263
## 
##                 bedrooms
## Y                       1        2        3        4        5        6        7
##   Pricey Digs    0.976985 0.015343 0.005754 0.001151 0.000767 0.000000 0.000000
##   Above Average  0.933498 0.058703 0.006979 0.000821 0.000000 0.000000 0.000000
##   Below Average  0.773219 0.188458 0.034265 0.003607 0.000451 0.000000 0.000000
##   Student Budget 0.467464 0.335885 0.142584 0.036364 0.012440 0.002392 0.001914
##                 bedrooms
## Y                       8        9       10
##   Pricey Digs    0.000000 0.000000 0.000000
##   Above Average  0.000000 0.000000 0.000000
##   Below Average  0.000000 0.000000 0.000000
##   Student Budget 0.000478 0.000478 0.000000
## 
##                 bathrooms
## Y                     0.5        1      1.5        2      2.5        3      3.5
##   Pricey Digs    0.003069 0.849636 0.068278 0.067894 0.004219 0.005370 0.000000
##   Above Average  0.002053 0.899015 0.043924 0.044745 0.004516 0.003695 0.000000
##   Below Average  0.001353 0.932822 0.026150 0.032462 0.003607 0.002705 0.000451
##   Student Budget 0.000478 0.721531 0.051196 0.159809 0.032057 0.018182 0.008612
##                 bathrooms
## Y                       4      4.5        5      5.5
##   Pricey Digs    0.001151 0.000000 0.000384 0.000000
##   Above Average  0.002053 0.000000 0.000000 0.000000
##   Below Average  0.000000 0.000000 0.000451 0.000000
##   Student Budget 0.004306 0.001435 0.000957 0.001435
## 
##                 room_type
## Y                Entire home/apt Private room Shared room
##   Pricey Digs            0.03529      0.89336     0.07135
##   Above Average          0.27422      0.70731     0.01847
##   Below Average          0.74301      0.25338     0.00361
##   Student Budget         0.94641      0.05024     0.00335
## 
##                 property_type
## Y                Apartment Bed & Breakfast     Boat Boutique hotel Bungalow
##   Pricey Digs     0.780974        0.001151 0.000000       0.000000 0.000000
##   Above Average   0.846470        0.005747 0.000000       0.000411 0.000821
##   Below Average   0.873760        0.002254 0.000451       0.000451 0.000902
##   Student Budget  0.834928        0.002392 0.000000       0.000000 0.000478
##                 property_type
## Y                Condominium     Dorm Guest suite Guesthouse   Hostel    House
##   Pricey Digs       0.005754 0.001534    0.000767   0.001151 0.000384 0.164557
##   Above Average     0.010673 0.000000    0.002053   0.001642 0.000000 0.093186
##   Below Average     0.013526 0.000000    0.000902   0.000451 0.000000 0.060415
##   Student Budget    0.023923 0.000000    0.000957   0.000000 0.000478 0.075120
##                 property_type
## Y                    Loft    Other Serviced apartment Timeshare Townhouse
##   Pricey Digs    0.015727 0.005370           0.000000  0.000000  0.021097
##   Above Average  0.012315 0.004105           0.000411  0.000000  0.021757
##   Below Average  0.016231 0.007665           0.000000  0.000451  0.022092
##   Student Budget 0.029187 0.003349           0.000000  0.001435  0.026794
##                 property_type
## Y                Vacation home    Villa
##   Pricey Digs         0.000000 0.001534
##   Above Average       0.000000 0.000411
##   Below Average       0.000451 0.000000
##   Student Budget      0.000478 0.000478

The ‘A-priori probabilities’ given above denote the likelihood that an Airbnb listing in NYC belongs to each of these four classes. The likelihood of each class occuring in the training data is as follows: * “Pricey Digs”: 0.279 * “Above Average”: 0.261 * “Below Average”: 0.237 * “Student Budget”: 0.224

The Naive Bayes classifier will use these probabilities to make predictions. For instance, given a set of predictor variable values, the classifier will calculate the probability of the instance (i.e., the Airbnb listing) belonging to each class and assign it to the most likely class (the one with the highest probability).

Predict Price Class for a Fictional Listing

To demonstrate, we will predict the price class for a fictional apartment with the following characteristics: - property_type = “Apartment” - room_type = “Entire home/apt” - accommodates = 4 - bathrooms = 1 - bedrooms = 3

# Predict probabilities & class membership for fictional listing
pred.prob <- predict(ny_nb, newdata = valid_nb.df, type = "raw")
pred.class <- predict(ny_nb, newdata = valid_nb.df)
df <- data.frame(actual = valid_nb.df$log_price, predicted = pred.class, pred.prob)
df[valid_nb.df$property_type == "Apartment" & 
   valid_nb.df$room_type == "Entire home/apt" & 
   valid_nb.df$accommodates == 4 & 
   valid_nb.df$bathrooms == 1 & 
   valid_nb.df$bedrooms == 3,]
##              actual      predicted Pricey.Digs Above.Average Below.Average
## 3498  Below Average Student Budget    0.000209       0.00667         0.183
## 4888  Below Average Student Budget    0.000209       0.00667         0.183
## 5215  Above Average Student Budget    0.000209       0.00667         0.183
## 5296  Below Average Student Budget    0.000209       0.00667         0.183
## 5754 Student Budget Student Budget    0.000209       0.00667         0.183
##      Student.Budget
## 3498           0.81
## 4888           0.81
## 5215           0.81
## 5296           0.81
## 5754           0.81

Confusion Matrix

# Training set
pred.class <- predict(ny_nb, newdata = train_nb.df)
confusionMatrix(pred.class, train_nb.df$log_price)
## Confusion Matrix and Statistics
## 
##                 Reference
## Prediction       Pricey Digs Above Average Below Average Student Budget
##   Pricey Digs           2328          1542           471             76
##   Above Average          177           230           105             42
##   Below Average           82           529          1126            805
##   Student Budget          20           135           516           1167
## 
## Overall Statistics
##                                              
##                Accuracy : 0.519              
##                  95% CI : (0.509, 0.529)     
##     No Information Rate : 0.279              
##     P-Value [Acc > NIR] : <0.0000000000000002
##                                              
##                   Kappa : 0.354              
##                                              
##  Mcnemar's Test P-Value : <0.0000000000000002
## 
## Statistics by Class:
## 
##                      Class: Pricey Digs Class: Above Average
## Sensitivity                       0.893               0.0944
## Specificity                       0.690               0.9531
## Pos Pred Value                    0.527               0.4152
## Neg Pred Value                    0.943               0.7492
## Prevalence                        0.279               0.2605
## Detection Rate                    0.249               0.0246
## Detection Prevalence              0.472               0.0592
## Balanced Accuracy                 0.792               0.5238
##                      Class: Below Average Class: Student Budget
## Sensitivity                         0.508                 0.558
## Specificity                         0.801                 0.908
## Pos Pred Value                      0.443                 0.635
## Neg Pred Value                      0.840                 0.877
## Prevalence                          0.237                 0.224
## Detection Rate                      0.120                 0.125
## Detection Prevalence                0.272                 0.197
## Balanced Accuracy                   0.655                 0.733
# Validation set
pred.class <- predict(ny_nb, newdata = valid_nb.df)
confusionMatrix(pred.class, valid_nb.df$log_price)
## Confusion Matrix and Statistics
## 
##                 Reference
## Prediction       Pricey Digs Above Average Below Average Student Budget
##   Pricey Digs           1473           998           311             57
##   Above Average          125           133            89             24
##   Below Average           67           354           768            540
##   Student Budget          17           109           385            786
## 
## Overall Statistics
##                                              
##                Accuracy : 0.507              
##                  95% CI : (0.494, 0.519)     
##     No Information Rate : 0.27               
##     P-Value [Acc > NIR] : <0.0000000000000002
##                                              
##                   Kappa : 0.339              
##                                              
##  Mcnemar's Test P-Value : <0.0000000000000002
## 
## Statistics by Class:
## 
##                      Class: Pricey Digs Class: Above Average
## Sensitivity                       0.876               0.0834
## Specificity                       0.700               0.9487
## Pos Pred Value                    0.519               0.3585
## Neg Pred Value                    0.938               0.7509
## Prevalence                        0.270               0.2556
## Detection Rate                    0.236               0.0213
## Detection Prevalence              0.455               0.0595
## Balanced Accuracy                 0.788               0.5161
##                      Class: Below Average Class: Student Budget
## Sensitivity                         0.495                 0.559
## Specificity                         0.795                 0.894
## Pos Pred Value                      0.444                 0.606
## Neg Pred Value                      0.826                 0.874
## Prevalence                          0.249                 0.226
## Detection Rate                      0.123                 0.126
## Detection Prevalence                0.277                 0.208
## Balanced Accuracy                   0.645                 0.726

Explanation of Naive Bayes Classifier:

In this section of the project, we implemented the Naive Bayes algorithm to categorize Airbnb rental prices in New York City (NYC) into four distinct bins: “Pricey Digs,” “Above Average,” “Below Average,” and “Student Budget.” This categorization allows us to provide valuable insights for both Airbnb management and potential customers. The Naive Bayes model was developed using a subset of the original data for NYC that includes five carefully chosen predictor variables: property type, room type, accommodates (the number of guests the listing can accommodate), bathrooms, and bedrooms.

The first step was to create the price bins based on the ‘log_price’ variable. We split the prices into four categories, ensuring an approximately equal distribution of listings across these categories. The summary of the ‘log_price’ variable indicates that the rental prices in NYC range from 2.30 to 7.60, with a median of 4.61. After binning, we converted the numerical predictor variables (‘accommodates,’ ‘bathrooms,’ and ‘bedrooms’) into categorical variables to prepare them for modeling.

The Naive Bayes model was then trained on a subset of the dataset, with 60% of the data used for training and the remaining 40% for validation. The model’s results are shown in the output, where it calculates conditional probabilities for each combination of predictor values in relation to the four price categories.

Key Insights:

Conditional probabilities for predictor variables like ‘accommodates,’ ‘bedrooms,’ ‘bathrooms,’ ‘room_type,’ and ‘property_type’ play a pivotal role in the Naive Bayes classifier. These probabilities indicate the likelihood of observing particular predictor variable values within a specific class. They are used to estimate the probability of a specific class given the observed predictor variable values, helping the classifier make predictions by identifying the most probable class based on the observed data.

The key insights from the model’s conditional probabilities are as follows:

  1. Accommodation Capacity: Listings that can accommodate fewer guests (e.g., ‘accommodates’ = 1-4) are more likely to fall into the “Pricey Digs” and “Above Average” categories. This suggests that smaller properties or those suitable for fewer people are associated with higher price categories. On the other hand, listings that accommodate more guests (e.g., ‘accommodates’ = 5-12) are more likely to be in the “Student Budget” category. This implies that larger properties or those suitable for more people are associated with lower price categories.

  2. Number of Bedrooms: Listings with fewer bedrooms (e.g., 1 bedroom) are more likely to be in the “Pricey Digs” category, suggesting that smaller properties with fewer bedrooms tend to be in the higher price category. Conversely, listings with more bedrooms (e.g., 3-10 bedrooms) are more likely to be in the “Above Average,” “Below Average,” or “Student Budget” categories, indicating that larger properties with more bedrooms are associated with a range of price categories.

  3. Number of Bathrooms: Listings with fewer bathrooms (e.g., 1 bathroom) are more likely to be in the “Pricey Digs” category, suggesting that properties with fewer bathrooms are associated with higher prices. By contrast, listings with more bathrooms (e.g., 2-5 bathrooms) are more likely to be in the “Above Average,” “Below Average,” or “Student Budget” categories, indicating that properties with more bathrooms are distributed across different price categories.

  4. Room Type: Listings that offer an “Entire home/apt” are more likely to be in the “Pricey Digs” category, suggesting that entire homes or apartments tend to be in the higher price category. Contrarily, listings that offer a “Private room” are more likely to be in the “Above Average” category, indicating that private rooms are associated with a somewhat lower price category. Moreover, listings that offer a “Shared room” are more likely to be in the “Below Average” or “Student Budget” categories, implying that shared rooms are associated with lower price categories.

  5. Property Type: Listings with an “Apartment” property type are more likely to be in the “Pricey Digs” category, suggesting that apartments tend to be in the higher price category. On the other hand, listings with a “Bed & Breakfast” property type are more likely to be in the “Above Average” category, indicating that bed & breakfast accommodations are associated with a somewhat lower price category. Furthermore, listings with a “Boat” property type are more likely to be in the “Below Average” or “Student Budget” categories, implying that boats are associated with lower price categories.

Furthermore, the model’s performance was rigorously evaluated using confusion matrices for both the training and validation datasets. While accuracy is a useful metric, a deeper analysis of the results unveils both the model’s potential and areas for enhancement. In the training set, the model achieved an accuracy of approximately 51.9%, and a similar accuracy of 50.7% in the validation set. However, accuracy alone may not provide a complete picture of the model’s effectiveness.

The confusion matrices reveal important insights:

  • Sensitivity (True Positive Rate): The model excels in correctly categorizing instances with ‘Pricey Digs,’ demonstrating a sensitivity of 87.6% in the validation set. This suggests that for high-priced listings, the model is quite reliable.

  • Specificity (True Negative Rate): The model’s specificity of 70.0% in the validation set for ‘Above Average’ listings indicates its ability to correctly identify cases where listings are not in this category.

  • Challenges in Classification: The model faces difficulties in distinguishing between ‘Above Average’ and ‘Below Average’ listings, with sensitivity values of 8.34% and 49.5% respectively. This indicates that further improvements are needed in these areas.

While the model exhibits promise, it’s important to acknowledge potential drawbacks and explore reasons for underperformance in certain classes:

  • Class Imbalance: The dataset may have an uneven distribution of listings across price categories, leading to challenges in accurately predicting less-represented classes like ‘Above Average.’

  • Feature Selection: The features used for prediction might not capture all the nuances influencing price categories. Feature engineering and selection processes may require refinement to improve predictive power.

  • Complex Factors: Pricing in the Airbnb marketplace can be influenced by complex factors beyond the scope of the current features, such as seasonality, local events, and market dynamics. These factors can contribute to classification difficulties.

Despite these challenges, the model offers valuable assistance to Airbnb management in pricing recommendations and a deeper understanding of the factors influencing price categories. Users can benefit from insights into expected price ranges based on their preferences, which can guide them in making informed booking decisions. Ongoing model refinement and feature engineering efforts hold the potential to enhance classification accuracy and address these limitations.

Recap: Naive Bayes Classifier

This model classifies NYC Airbnb rentals into four price categories/tiers:

  • Pricey Digs
  • Above Average
  • Below Average
  • Student Budget

Modeling Steps:

  • Binning: The continuous log_price variable was segmented into four bins based on quartiles.
  • Predictors: Five variables were used: property_type, room_type, accommodates, bathrooms, and bedrooms. Numeric variables were converted to categorical types for compatibility with Naive Bayes.
  • Data Partitioning: 60% of the dataset was used for training, and 40% for validation.
  • Model Training: The Naive Bayes classifier calculated conditional probabilities and class priors based on the training data.

Performance Evaluation:

Dataset Accuracy Kappa Key Strength
Training 51.9% 0.354 High sensitivity for “Pricey Digs” (89.3%)
Validation 50.7% 0.339 Strong specificity for most classes

The model performed well in classifying high-priced listings but struggled with middle categories, particularly “Above Average.”

Confusion Matrix Insights:

  • Sensitivity: Strong for “Pricey Digs” (~87.6%), poor for “Above Average” (~8.3%)
  • Specificity: Generally high across all classes (>70%)
  • Balanced Accuracy: Highest for “Pricey Digs” and “Student Budget”

Conditional Probability Highlights:

  • Accommodates: Small listings (1–4 people) tend toward “Pricey Digs,” while larger ones (5–12) lean “Student Budget.”
  • Bedrooms/Bathrooms: Fewer rooms correlate with higher prices.
  • Room Type: Entire home/apt listings dominate higher price tiers.
  • Property Type: Apartments cluster in expensive categories, whereas unique options like Boats fall into lower price tiers.

Despite modest accuracy, the model provides interpretable probabilities and valuable business insights for Airbnb managers and users. Further feature engineering and addressing class imbalance could enhance performance.

c. Classification Tree – Predicting Cancellation Policy

Data Preprocessing

ny_ct.df <- ny.df
# Subset data (remove unnecessary columns)
ny_ct.df <- subset(ny_ct.df, select= - c(id, amenities, bed_type, cleaning_fee, city, description, first_review, host_since,instant_bookable, last_review, latitude, longitude, name, thumbnail_url, neighbourhood, property_group, borough))
# Inspect new dataset
str(ny_ct.df)
## 'data.frame':    15589 obs. of  14 variables:
##  $ log_price             : num  5.39 4.7 5.66 4.93 5.19 ...
##  $ property_type         : chr  "Apartment" "Apartment" "Apartment" "Apartment" ...
##  $ room_type             : chr  "Entire home/apt" "Private room" "Entire home/apt" "Private room" ...
##  $ accommodates          : int  2 4 2 3 2 1 8 6 3 3 ...
##  $ bathrooms             : num  1 1 1 1 1 1 1 1 2 1 ...
##  $ cancellation_policy   : Factor w/ 4 levels "flexible","moderate",..: 2 3 3 3 3 3 3 3 3 1 ...
##  $ host_has_profile_pic  : chr  "t" "t" "t" "t" ...
##  $ host_identity_verified: Factor w/ 2 levels "True","False": 2 1 1 2 1 2 1 1 2 2 ...
##  $ host_response_rate    : num  1 1 1 0.5 1 0.96 1 1 1 0.9 ...
##  $ number_of_reviews     : int  3 72 2 3 140 16 62 178 105 8 ...
##  $ review_scores_rating  : num  1 0.91 0.9 1 0.82 0.95 0.93 0.83 0.92 0.98 ...
##  $ bedrooms              : num  1 1 1 1 1 1 2 1 1 1 ...
##  $ beds                  : num  1 1 1 2 2 1 4 3 1 1 ...
##  $ amenities_count       : int  11 31 22 14 14 15 20 23 19 15 ...
# Convert character variables to factors
ny_ct.df$property_type <- as.factor(ny_ct.df$property_type)
ny_ct.df$room_type <- as.factor(ny_ct.df$room_type)
ny_ct.df$host_has_profile_pic <- factor(ny_ct.df$host_has_profile_pic, levels = c("t", "f"), labels = c("True", "False"))

str(ny_ct.df) # reinspect dataset
## 'data.frame':    15589 obs. of  14 variables:
##  $ log_price             : num  5.39 4.7 5.66 4.93 5.19 ...
##  $ property_type         : Factor w/ 21 levels "Apartment","Bed & Breakfast",..: 1 1 1 1 1 1 1 1 1 9 ...
##  $ room_type             : Factor w/ 3 levels "Entire home/apt",..: 1 2 1 2 1 2 1 1 2 1 ...
##  $ accommodates          : int  2 4 2 3 2 1 8 6 3 3 ...
##  $ bathrooms             : num  1 1 1 1 1 1 1 1 2 1 ...
##  $ cancellation_policy   : Factor w/ 4 levels "flexible","moderate",..: 2 3 3 3 3 3 3 3 3 1 ...
##  $ host_has_profile_pic  : Factor w/ 2 levels "True","False": 1 1 1 1 1 1 1 1 1 1 ...
##  $ host_identity_verified: Factor w/ 2 levels "True","False": 2 1 1 2 1 2 1 1 2 2 ...
##  $ host_response_rate    : num  1 1 1 0.5 1 0.96 1 1 1 0.9 ...
##  $ number_of_reviews     : int  3 72 2 3 140 16 62 178 105 8 ...
##  $ review_scores_rating  : num  1 0.91 0.9 1 0.82 0.95 0.93 0.83 0.92 0.98 ...
##  $ bedrooms              : num  1 1 1 1 1 1 2 1 1 1 ...
##  $ beds                  : num  1 1 1 2 2 1 4 3 1 1 ...
##  $ amenities_count       : int  11 31 22 14 14 15 20 23 19 15 ...

Note: ‘cancellation’ & ‘host_identity_verified’ already a factor variable & we do not need to modify any levels yet.

# Change levels to "strict","moderate", & "flexible"
levels(ny_ct.df$cancellation_policy)[levels(ny_ct.df$cancellation_policy) == "strict"] <- "strict"
levels(ny_ct.df$cancellation_policy)[levels(ny_ct.df$cancellation_policy) == "super_strict"] <- "strict"
levels(ny_ct.df$cancellation_policy)[levels(ny_ct.df$cancellation_policy) == "flexible"] <- "flexible"
levels(ny_ct.df$cancellation_policy)[levels(ny_ct.df$cancellation_policy) == "moderate"] <- "moderate"
#Partition data into training & validation sets
set.seed(92)
ny_ct.df_train.index <- sample(c(1:nrow(ny_ct.df)), nrow(ny_ct.df)*0.6)
ny_ct_train.df <- ny_ct.df[ny_ct.df_train.index, ]
ny_ct_valid.df <- ny_ct.df[-ny_ct.df_train.index, ]
#Build the classification tree model
ct <- rpart(cancellation_policy~., ny_ct_train.df, method="class", xval= 10)
# Determine the ideal tree size using Cross-validation
printcp(ct)
## 
## Classification tree:
## rpart(formula = cancellation_policy ~ ., data = ny_ct_train.df, 
##     method = "class", xval = 10)
## 
## Variables actually used in tree construction:
## [1] log_price         number_of_reviews
## 
## Root node error: 4607/9353 = 0.5
## 
## n= 9353 
## 
##     CP nsplit rel error xerror xstd
## 1 0.04      0       1.0    1.0 0.01
## 2 0.01      2       0.9    0.9 0.01
# Determine the ideal tree size using Cross-validation
plotcp(ct)

# Keep the tree size where the cp value has the smallest error
ct_pruned <- prune(ct, 
                  cp = ct$cptable[which.min(ct$cptable[, "xerror"]), "CP"])
# Plot the pruned tree
rpart.plot(ct_pruned, yesno = TRUE)

Explanation of Classification Tree:

The Classification Tree model we constructed serves the purpose of predicting Airbnb hosts’ cancellation policies in New York City, a critical aspect for both hosts and guests to understand. Our journey began with a meticulous phase of data preparation, including the removal of redundant columns, data type conversions, and handling of missing values. To simplify the classification task, we consolidated two levels of the “cancellation_policy” variable into the broader “strict” category. Subsequently, we partitioned the dataset into two distinct sets: a training set (comprising 60% of the data) and a validation set (comprising 40%), ensuring adequate representation of both cancellation policy types.

The construction of the decision tree model was an iterative process, involving the exploration of potential features that might influence cancellation policies. After thorough analysis, two key variables emerged as significant contributors: the number of reviews and the log price. These variables play a crucial role in understanding and predicting cancellation policies for Airbnb listings in New York City.

  • Guest and Host Experience: The number of reviews can be seen as a proxy for the level of experience both hosts and guests have had with a particular listing. Listings with a high number of reviews may indicate a history of positive experiences, while those with fewer reviews might be relatively new or less frequently booked. Guests and hosts may have different expectations and behaviors depending on the listing’s review history.

  • Trust and Credibility: High review counts can contribute to building trust and credibility among potential guests. Hosts who maintain positive reviews are likely to have a more favorable cancellation policy, as they may want to uphold their reputation and maintain high occupancy rates. On the other hand, hosts with fewer reviews may adopt stricter policies to mitigate potential risks.

  • Price Sensitivity: The price of a listing is a critical factor for both guests and hosts. Higher-priced listings may have more stringent cancellation policies to protect against last-minute cancellations that could result in significant revenue loss. Lower-priced listings, on the other hand, might offer more flexible cancellation options to attract cost-conscious guests.

  • Market Competition: The pricing strategy of a listing could be influenced by the competitive landscape in the Airbnb market in New York City. Listings in highly competitive areas might offer more flexible cancellation policies to attract bookings, while those in less competitive areas may rely on stricter policies to secure confirmed reservations.

  • Guest Preferences: Different guests may have varying levels of price sensitivity and risk tolerance. Some guests may prioritize flexibility in their travel plans and be willing to pay more for it, while others may prioritize cost savings and be less concerned about the cancellation policy. Hosts may adjust their pricing and policies to align with the preferences of their target guest demographic.

  • Seasonal Variations: The importance of price and the number of reviews in predicting cancellation policies may vary seasonally. For example, during peak tourist seasons, hosts may increase prices and tighten cancellation policies to capitalize on high demand, while off-peak seasons may see lower prices and more lenient cancellation options.

By considering these factors, we created a decision tree model that effectively captures the dynamics of Airbnb rental cancellation policies in New York City. This model serves as a valuable tool for understanding the interplay of guest and host behavior, pricing strategies, and market conditions, benefiting both hosts and guests in the city’s Airbnb ecosystem. It empowers hosts to make informed decisions about their cancellation policies, taking into account various factors that influence their listing’s attractiveness to potential guests. Likewise, guests can use this model to better predict the cancellation policies they might encounter when booking an Airbnb in New York City, enabling them to make travel plans with confidence.

Summary of Classification Tree

The goal was to classify listings by their cancellation_policy: flexible, moderate, or strict.

Modeling Process:

  • Data Preprocessing: Unnecessary variables were removed. The cancellation_policy variable was re-leveled to merge strict and super_strict into one class.
  • Partitioning: The data was split into 60% training and 40% validation sets.
  • Modeling: A classification tree was constructed using the rpart algorithm with 10-fold cross-validation.
  • Pruning: The optimal tree was selected by minimizing cross-validated error (CP value), improving generalization.

Key Predictors:

  • log_price
  • number_of_reviews

Key Findings/Business Implications:

  • Number of Reviews: High review counts signal trustworthy, high-traffic listings, which often lean toward more lenient cancellation policies.
  • Price Sensitivity: Expensive listings may enforce stricter cancellation policies to reduce the impact of last-minute cancellations.
  • Market Competition & Guest Demographics: Competitive areas might encourage flexible policies; hosts targeting cost-conscious guests may offer leniency.
  • Seasonality: Hosts may modify cancellation terms based on demand patterns.

This model visually reveals the decision-making logic behind cancellation policies, offering strategic value for hosts tailoring their policies by listing price, review history, and guest profile.

5. Clustering - Clustering Brooklyn Neighborhoods

This section applies k-Means Clustering to identify distinct groups of Brooklyn neighborhoods in New York City based on rental and listing characteristics.

Data Preprocessing

# Subset Brooklyn neighbors to be used as labels
ny_cluster.df <- subset(ny.df, borough == "Brooklyn")
# Create new variable to combine 'number_of_reviews' & 'review_scores_rating'
ny_cluster.df <- ny_cluster.df %>%
  mutate(avg_review_scores_rating = review_scores_rating/number_of_reviews) 
# Remove unnecessary columns
ny_cluster.df <- subset(ny_cluster.df, select= -c(id, property_type, room_type, amenities, bed_type, cancellation_policy, cleaning_fee, city, description, first_review, host_has_profile_pic, host_identity_verified, host_response_rate, host_since, instant_bookable, last_review, latitude, longitude, name, number_of_reviews, review_scores_rating, thumbnail_url, bedrooms, beds, borough, property_group))
# Handle missing values
ny_cluster.df <- na.omit(ny_cluster.df)
str(ny_cluster.df)  # Reinspect dataframe
## 'data.frame':    5846 obs. of  6 variables:
##  $ log_price               : num  4.91 5.08 6.11 5.7 4.76 ...
##  $ accommodates            : int  2 2 7 6 2 2 6 2 5 3 ...
##  $ bathrooms               : num  1 2 2.5 2 1 1 2 1 1 1 ...
##  $ neighbourhood           : chr  "DUMBO" "Boerum Hill" "Boerum Hill" "Downtown Brooklyn" ...
##  $ amenities_count         : int  14 15 10 15 6 19 28 15 20 16 ...
##  $ avg_review_scores_rating: num  0.0172 0.3333 0.1225 0.5 0.5 ...
# Prepare data
cluster_labels = ny_cluster.df$neighbourhood
feature_var <- select(ny_cluster.df, -neighbourhood)

# Scale/standardize data to a mean of 0 & standard deviation of 1
df.scale <- scale(feature_var)

Determine optimal number of clusters (k)

# Compute distance between observations
ny_cluster.df.dist <- dist(df.scale)

# Determine 'k' value (# of clusters) using within sum squares
fviz_nbclust(df.scale, kmeans, method="wss") + labs(subtitle = "Elbow method")

k-Means Clustering

# k-means
optimal_k <- 4
km.out <- kmeans(df.scale, centers = optimal_k, nstart = 100)

Cluster Visualization/Interpretation

fviz_cluster(km.out, data = feature_var, stand = FALSE,
             geom = "point", ellipse.type = "convex", 
             main = "K-Means Clustering of Brooklyn Neighborhoods")

# Generate table with cluster assignments
table(km.out$cluster, ny_cluster.df$neighbourhood)
##    
##     Bath Beach Bay Ridge Baychester Bedford-Stuyvesant Bensonhurst Bergen Beach
##   1          0         1          0                131           0            1
##   2          3        31          0                793          10            0
##   3          3        19          0                473          12            0
##   4          0        12          1                172           3            0
##    
##     Boerum Hill Borough Park Brighton Beach Brooklyn Brooklyn Heights
##   1          13            2              2        1                5
##   2          26           12              6        3               12
##   3          36            1              7        1               31
##   4          12            2              3        2               15
##    
##     Brooklyn Navy Yard Brownsville Bushwick Canarsie Carroll Gardens
##   1                  1           0       33        0               0
##   2                 22          16      659        1               1
##   3                  6           1      226        1               2
##   4                  2           1      173        0               1
##    
##     Clinton Hill Columbia Street Waterfront Coney Island Crown Heights
##   1            5                          0            0             1
##   2           53                          1            1             6
##   3           37                          0            1             4
##   4           18                          0            0             0
##    
##     Downtown Brooklyn DUMBO East Flatbush Flatbush Flatlands Fort Greene
##   1                 1     0             5       16         5          16
##   2                 3     3            14      165        13          55
##   3                13     8             7       74         8          66
##   4                 4     0             3       37         2          31
##    
##     Gowanus Gravesend Greenpoint Greenwood Heights Kensington Lefferts Garden
##   1       9         3         22                 9          8              16
##   2      24        11        220                31         41             114
##   3      32         7        147                20         16              56
##   4      13         2         80                 8         18              36
##    
##     Manhattan Beach Midwood Mill Basin Park Slope Prospect Heights Red Hook
##   1               0       0          1         57               21        3
##   2               2      35          0        115               80       15
##   3               4      11          0        194               69        7
##   4               0       8          0         50               14        4
##    
##     Ridgewood Sea Gate Sheepshead Bay Sunset Park Vinegar Hill Williamsburg
##   1         0        1              2           3            0           15
##   2         3        1             28          67            2          142
##   3         0        2             14          11            4          105
##   4         0        1              4          16            0           34
##    
##     Windsor Terrace
##   1              12
##   2              25
##   3              31
##   4              11
# Determine variable means for each cluster in the original metric (i.e., kmeans model output is based on standardized data)
aggregate(feature_var, by= list(cluster= km.out$cluster), mean)
##   cluster log_price accommodates bathrooms amenities_count
## 1       1      5.35         6.50      2.33            20.4
## 2       2      4.16         1.96      1.11            14.7
## 3       3      4.92         3.92      1.03            21.7
## 4       4      4.38         2.30      1.11            13.5
##   avg_review_scores_rating
## 1                    0.204
## 2                    0.165
## 3                    0.116
## 4                    0.938

Next, we create boxplots for each variable (i.e., log_price, bathrooms, host_response_rate, latitude, longitude, review_scores_rating) by cluster to understand the distribution of data within each cluster:

# Create a data frame with cluster labels
ny_cluster.df$cluster <- as.factor(km.out$cluster)
# Boxplots for log_price by cluster
ggplot(ny_cluster.df, aes(x = cluster, y = log_price)) +
  geom_boxplot() +
  labs(x = "Cluster", y = "log_price") +
  ggtitle("Boxplot of log_price by Cluster")

# Boxplots for accommodates by cluster
ggplot(ny_cluster.df, aes(x = cluster, y = accommodates)) +
  geom_boxplot() +
  labs(x = "Cluster", y = "accommodates") +
  ggtitle("Boxplot of accommodates by Cluster")

# Boxplots for bathrooms by cluster
ggplot(ny_cluster.df, aes(x = cluster, y = bathrooms)) +
  geom_boxplot() +
  labs(x = "Cluster", y = "bathrooms") +
  ggtitle("Boxplot of bathrooms by Cluster")

# Boxplots for amenities_count by cluster
ggplot(ny_cluster.df, aes(x = cluster, y = amenities_count)) +
  geom_boxplot() +
  labs(x = "Cluster", y = "amenities_count") +
  ggtitle("Boxplot of amenities_count by Cluster")

# Boxplots for avg_review_scores_rating by cluster
ggplot(ny_cluster.df, aes(x = cluster, y = avg_review_scores_rating)) +
  geom_boxplot() +
  labs(x = "Cluster", y = "avg_review_scores_rating") +
  ggtitle("Boxplot of avg_review_scores_rating by Cluster")

Explanation of k-Means Clustering Model:

In the analysis of Brooklyn neighborhoods in New York City using k-Means clustering, several key steps were undertaken to uncover distinct clusters based on selected features. Initially, the data was pre-processed by narrowing it down to exclusively include Brooklyn neighborhoods and removing irrelevant columns and rows with missing values to ensure data quality. Additionally, we created a new feature, ‘avg_review_scores_rating’, which captures the quality of reviews more effectively by normalizing ‘review_scores_rating’ by ‘number_of_reviews’. The features were then standardized to have a mean of 0 and a standard deviation of 1, ensuring equal influence of each variable in the clustering process (i.e., by preventing variables with larger scales from dominating the results).

The optimal number of clusters (k) was determined using the “elbow method,” resulting in the selection of k=4 as the most suitable choice. k-Means clustering was executed with 100 different starting configurations to enhance the likelihood of finding a globally optimal solution. Visualizing the clusters was facilitated through scatterplots, where each neighborhood was represented by a point, and convex ellipses delineated the clusters. Additionally, a table was generated to illustrate the assignment of neighborhoods to clusters.

To gain a deeper understanding of each cluster’s characteristics, the means of selected variables (log_price, accommodates, bathrooms, amenities_count, and avg_review_scores_rating) were computed in their original metrics. The analysis revealed four distinct clusters of Brooklyn neighborhoods based on the selected features:

  • Luxury Living (Cluster 1): This cluster represents neighborhoods characterized by higher prices, a greater number of bathrooms, higher average review scores, larger accommodation capacities, and a wealth of amenities.

  • Bare-Bone Bargains (Cluster 2): Neighborhoods in this cluster are distinguished by lower prices, fewer bathrooms, lower average review scores, smaller accommodation capacities, and limited amenities.

  • Classic Comfort (Cluster 3): This cluster comprises neighborhoods with moderate prices, a moderate number of bathrooms, moderately low average review scores, a moderate accommodation capacity, and a moderate level of amenities.

  • Your Average Joes (Cluster 4): Neighborhoods in this cluster feature moderately low prices, fewer bathrooms, moderately high average review scores, a moderate accommodation capacity, and a limited number of amenities.

In conclusion, the k-Means clustering analysis helped identify and group Brooklyn neighborhoods in New York City based on common characteristics. The distinct clusters can serve as a valuable resource for property investors, tourists, or urban planners, facilitating informed decision-making concerning Brooklyn’s various neighborhoods and their unique attributes.

Recap on Clustering

a. Data Preprocessing

  • Subset Data: The data was filtered to include only Brooklyn neighborhoods.
  • Feature Engineering: A new variable, avg_review_scores_rating, was created by dividing review_scores_rating by number_of_reviews to represent normalized review quality.
  • Cleaning: Irrelevant columns (e.g., IDs, descriptions, host metadata, geographic coordinates) and rows with missing values were removed.
  • Final Variables: The following six features were used for clustering: log_price, accommodates, bathrooms, amenities_count, avg_review_scores_rating, & neighbourhood (used only for labeling).
  • Standardization: The numeric features were standardized to have a mean of 0 and standard deviation of 1 to ensure fair clustering.

b. Determining Optimal Clusters

Using the elbow method on within-cluster sum of squares (WSS), the optimal number of clusters was determined to be k = 4.

c. k-Means Clustering Results

The k-Means model was trained with 100 random starting configurations to ensure convergence to a stable solution. Cluster visualization was done using scatterplots with convex ellipses.

Cluster Profiles (Unstandardized Means):

Cluster log_price accommodates bathrooms amenities_count avg_review_scores_rating
1 5.35 6.50 2.33 20.4 0.204
2 4.16 1.96 1.11 14.7 0.165
3 4.92 3.92 1.03 21.7 0.116
4 4.38 2.30 1.11 13.5 0.938

Cluster Interpretations:

  • Cluster 1 – Luxury Living: High price, large accommodations, more bathrooms, high amenities, and good review quality.
  • Cluster 2 – Bare-Bone Bargains: Low price, small units, fewer bathrooms, and fewer amenities.
  • Cluster 3 – Classic Comfort: Moderately priced listings with average space and features.
  • Cluster 4 – Your Average Joes: Low-to-mid prices, fewer amenities, but strong normalized review scores.

d. Neighborhood Distribution by Cluster

A frequency table was generated to display how Brooklyn neighborhoods were distributed across the four clusters. This provides insights into how similar or distinct various locations are in terms of rental attributes.

e. Boxplot Visualizations

Boxplots were created to visualize variable distributions by cluster: - log_price: Revealed clear pricing tiers across clusters.
- accommodates: Larger listings clustered into higher-price groups.
- bathrooms: Cluster 1 stood out with significantly more bathrooms.
- amenities_count: Cluster 3 had the most amenities.
- avg_review_scores_rating: Cluster 4 had the highest average review rating per review count.

Conclusion

The k-Means clustering approach effectively grouped Brooklyn neighborhoods into four segments based on rental characteristics. These clusters provide valuable insights for: - Tourists: To target neighborhoods that fit their budget and preferences.
- Property Managers: To benchmark and align listings with similar offerings.
- Urban Planners: To understand diversity in housing types across Brooklyn.

This unsupervised learning method uncovered meaningful patterns that complement the classification models and enrich the overall understanding of Airbnb rental dynamics in New York City.

6. Conclusions + Implications of Analysis

This project provided a comprehensive exploration of New York City Airbnb listings using a range of machine learning techniques—including regression, classification, and clustering—to extract actionable insights from complex real-world data. The results offer meaningful implications for multiple stakeholders:

Strategic Applications by Stakeholder

Airbnb (Platform Owner):

By leveraging clustering insights, Airbnb can enhance its recommendation engine. For example, if a user is browsing a listing in Williamsburg, the platform can recommend other listings in neighborhoods with similar characteristics (e.g., amenities, price point), improving user satisfaction and boosting booking conversions.

Property Owners & Managers:

Regression and classification models enable owners to price their listings competitively based on key features (e.g., size, location, amenities). Informed by data, these models support smarter revenue management and higher occupancy rates.

Real Estate Investors:

Investors can identify high-opportunity clusters of neighborhoods through unsupervised learning, targeting areas aligned with preferred investment profiles (e.g., low price, high review ratings, high amenity density). Insights may also be extrapolated to other urban markets.

Travelers & Airbnb Customers:

Data-driven segmentation can guide customers toward fairly priced listings with desirable features, helping them avoid overpaying or overlooking strong value options.

Hotels & Hospitality Competitors:

These models shed light on why customers choose Airbnb, often favoring flexibility, price, or space. Hotels can use this insight to adapt service offerings or pricing models to compete more effectively.

Market Researchers:

The project highlights key drivers of consumer behavior in the peer-to-peer rental market, providing a framework for future analysis in similar sectors or regions.

Policy Makers & Urban Planners:

Regulatory agencies can use these findings to better understand the Airbnb ecosystem’s impact on housing, tourism, and urban development, aiding in policy formulation around zoning, taxation, and neighborhood preservation.

Broader Implications

While focused on New York City, this project provides a reusable blueprint for exploring short-term rental dynamics in other metropolitan areas. The methodologies employed - such as price tier classification, cleaning fee prediction, and neighborhood clustering - can be replicated with local data to inform decision-making in other tourism-heavy cities.

Final Thoughts:
This data mining initiative bridges analytics with real-world impact. From pricing strategy to urban policy, the findings underscore how well-applied machine learning models can drive better decisions, deepen customer understanding, and ultimately create more efficient and equitable marketplaces.