#' nice_table
#'
#' @param df
#' @param fw
nice_table <- function(df, cap=NULL, cols=NULL, dig=3, fw=F){
if (is.null(cols)) {c <- colnames(df)} else {c <- cols}
table <- df %>%
kable(caption=cap, col.names=c, digits=dig) %>%
kable_styling(
bootstrap_options = c("striped", "hover", "condensed"),
html_font = 'monospace',
full_width = fw)
return(table)
}
#' coeff2dt
#'
#' @param fitobject
#' @param s
#'
#' @return
#' @export
#'
#' @examples
coeff2dt <- function(fitobject, s) {
coeffs <- coef(fitobject, s)
coeffs.dt <- data.frame(name = coeffs@Dimnames[[1]][coeffs@i + 1], coefficient = coeffs@x)
# reorder the variables in term of coefficients
return(coeffs.dt[order(coeffs.dt$coefficient, decreasing = T),])
}
The real estate market is a dynamic and intricate setting where a variety of factors affect the price of real estate. Listings on Airbnb, a big part of this market, are not an exception. Numerous factors, such as location, amenities, and local demand, influence these prices. Comprehending and forecasting the cost of Airbnb listings is essential for both hosts looking to establish competitive rates and visitors looking for good value. The conventional hotel sector has also been impacted by Airbnb’s growth, drawing attention from economists and market analysts. The goal in this project was to forecast Airbnb listings’ price. Our objective was to gain insights into key factors that impact Airbnb listing prices in New York City by developing predictive models to estimate Airbnb listing prices based on relevant factors and evaluating the performance of the predictive models and identify opportunities for improvement.
We started by carefully reviewing the dataset to find problems like outliers, missing values, and possible predictor multicollinearity. This procedure, which was essential for guaranteeing the accuracy of our analysis, resulted in preprocessing and data cleaning, where we took care of these problems. We used Lasso Regression and Linear Regression models for our analysis after the data were prepared. These models were selected because they are well-suited to comprehending how different features affect the dependent variable, which in this case is the cost of Airbnb listings. Given its popularity for being easily understood and straightforward, Linear Regression gave us a starting point model. However, by penalizing less significant features, Lasso Regression’s feature selection capability allowed us to better understand the most significant predictors.
# Load data from Github
data <- read.csv("https://raw.githubusercontent.com/ex-pr/DATA_622/main/HW%204/nyc_listings_2023.csv")
The dataset contained 38,792 observations of 18 variables.
The information was listings for New York City from October, 2023,
and each record included information about a single rental property,
such as its type, location, cost, and review-related details. The
price
was the target, it specified the price of a Airbnb
listing per night. Specifically:
id
: Unique identifier for the listing.name
: The listing’s name or description.host_id
: Unique identifier for the host.host_name
: The host’s name.neighbourhood_group
: The borough in which the listing
was situated.neighbourhood
: The specific neighbourhood in which the
listing was situated.latitude
, longitude
: Location coordinates
for the listing.room_type
: The kind of room being provided (private
room, entire apartment, shared room or hotel room).price
: Cost per night for the listed property.minimum_nights
: A minimum number of nights needed to
make a reservation.number_of_reviews
: Total reviews that this listing had
gotten.last_review
: Date of the last review.reviews_per_month
: The average monthly number of
reviews.calculated_host_listings_count
: Total number of
listings that the host had.availability_365
: The number of days a year that the
listing is bookable.number_of_reviews_ltm
: Number of Reviews for the last
twelve months.license
: Details about the listing’s license.Regression techniques were the focus of the algorithm selection process because the price variable in our dataset was continuous. The data preparation steps had been tailored to suit these algorithms, with an emphasis on managing missing values, scaling and normalizing the features, and potentially encoding categorical variables when necessary.
Linear Regression: This algorithm was a useful place to start. The relationship between the independent variables and the target variable was assumed to be linear. It was an excellent baseline model because it was straightforward, comprehensible, and didn’t require complicated parameter tuning.
Lasso Regression: A kind of linear regression with a penalty term (Least Absolute Shrinkage and Selection Operator). The magnitude of the coefficients’ absolute value was equivalent to the penalty that was applied. Because this kind of regression performed feature selection by shrinking less significant feature coefficients to zero, it was helpful when we had a large number of features.
K-Fold Cross Validation: We applied it to evaluate our models’ performance. Using this method, the dataset was divided into k subsets, of which one was used as the test set and the others as the training set. Every subset served as the test set once during the k repetitions of this process. In order to prevent overfitting and underfitting, it made sure that each observation from the original dataset had an equal chance of showing up in the training and test sets.
The target variable’s nature as a quantitative measure, price, influenced the choice of these algorithms. For modeling the relationship between the predictors and a continuous outcome, linear and Lasso Regression were preferred due to their simplicity and effectiveness. However, multicollinearity and irrelevant features could affect the performance of lasso and linear regression; this was where feature selection and regularization techniques came into play. Reliability in validating model performance across dataset subsets had be aided by k-fold cross validation.
The data source: http://insideairbnb.com/get-the-data
# Check first rows of data
DT::datatable(
data[1:25,],
extensions = c('Scroller'),
options = list(scrollY = 350,
scrollX = 500,
deferRender = TRUE,
scroller = TRUE,
dom = 'lBfrtip',
fixedColumns = TRUE,
searching = FALSE),
rownames = FALSE)
The table below provided a summary statistics of the New York City listings market for 2023, highlighting variations in listing prices, booking requirements, and guest interaction.
The were a small number of missing values (0.01%) in the
host_name
and name
columns, a sizable amount
(26.7%) in the last_review
and
reviews_per_month
columns, and a very high percentage
(92.42%) of missing values in the license
column. Prior to
doing additional analysis, these missing values had to be addressed.
Average price, minimum night requirement, distribution of room types, and average availability is given in this summary. The prices of the listings varied from 0 to 30,000 dollars, with an average of 215.95 dollars. There was a minimum of 1 to 1250 nights required, with an average of 30.64 nights.The dataset included the following room types: hotel room, shared room, private room, and entire home/apt. Of the room types, entire home/apt accounted for 54.96% of listings. Listings were available for 148.75 days a year on average, but this could range from no availability to being available all year round. There were anywhere from none to many reviews for each listing, which suggested different levels of interaction or length of time listed on the platform. The average number of reviews per month was 1.08, although this varies substantially between listings.
# Check summary statistics of the data
print(dfSummary(data, text.graph.col = FALSE, graph.col = FALSE, style = "grid", valid.col = FALSE), headings = FALSE, max.tbl.height = 400, footnote = NA, col.width=5, method="render")
No | Variable | Stats / Values | Freqs (% of Valid) | Missing | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | id [numeric] |
|
38792 distinct values | 0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
2 | name [character] |
|
|
0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
3 | host_id [integer] |
|
23811 distinct values | 0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
4 | host_name [character] |
|
|
0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
5 | neighbourhood_group [character] |
|
|
0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
6 | neighbourhood [character] |
|
|
0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
7 | latitude [numeric] |
|
23359 distinct values | 0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
8 | longitude [numeric] |
|
21014 distinct values | 0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
9 | room_type [character] |
|
|
0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
10 | price [integer] |
|
1184 distinct values | 0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
11 | minimum_nights [integer] |
|
114 distinct values | 0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
12 | number_of_reviews [integer] |
|
465 distinct values | 0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
13 | last_review [character] |
|
|
0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
14 | reviews_per_month [numeric] |
|
822 distinct values | 10352 (26.7%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
15 | calculated_host_listings_count [integer] |
|
70 distinct values | 0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
16 | availability_365 [integer] |
|
366 distinct values | 0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
17 | number_of_reviews_ltm [integer] |
|
144 distinct values | 0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
18 | license [character] |
|
|
0 (0.0%) |
Columns neighbourhood_group, neighbourhood, room_type
were transformed to categorical variables. This change was performed to
better depict the variables’ inherent categorical character. The column
last_review
was transformed to date format as it contained
a date of the last review.
# Copy original data
imputed_df <- data
# Transform binary variables to factors
cols <- c("neighbourhood_group", "neighbourhood", "room_type")
imputed_df[cols] <- lapply(imputed_df[cols], factor)
# Convert 'last_review' to datetime objects
imputed_df$last_review <- as.Date(imputed_df$last_review)
Managing missing values was one of the critical problems to fix before building the models.
The missing values for continuous variable
reviews_per_month
(10,352 rows) were imputed with 0 as it
corresponded with 0 number_of_reviews
(10,352 rows with 0
number of reviews). The missing values for date column
last review
were imputed with the minimum date from the
column. Instead of the date format, we counted the amount of days since
the earliest review. The NaN values in the
last_review and reviews_per_month
columns all occurred for
examples where no reviews were given in the first place.
The dates were first transformed into an ordinal format, and then
they were normalized. To achieve this, the ordinal value of the earliest
date was subtracted from each and every date in the
last_review
column. The data was normalized so that the
earliest date in the dataset corresponded to 0 and all other dates were
represented as the number of days since this earliest date by
subtracting the ordinal value of the earliest date.
The rest of the columns with missing values were not useful for the
analysis and predictions: host_name
and name
columns contained name of the host and listing. The column
license
showed if a listing had license (according to the
law, specific short-term rentals require a license), this column had
more than 90% of the missing data and was dropped from the data. Columns
id, host_id
were dropped as well. Although, they didn’t
have any missing values, they contained unique identification numbers
for each listing and host which were not useful for the further
analysis.
# check NAs in number_of_reviews and last_review
dim(imputed_df[imputed_df$number_of_reviews == 0,])
## [1] 10352 18
sum(is.na(imputed_df$last_review))
## [1] 10352
# Set seed for constant results
set.seed(42)
# Impute NAs for 'reviews_per_month' with their 0
imputed_df <- imputed_df %>%
mutate(across(c('reviews_per_month'), ~replace_na(., 0)))
# remove columns id, host_id, host_name, name, license
imputed_df <- imputed_df %>%
dplyr::select(-c(id, host_id, host_name, name, license))
# Find the earliest date
earliest <- min(imputed_df$last_review, na.rm = TRUE)
# Replace NA values with the earliest date
imputed_df$last_review <- replace_na(imputed_df$last_review, earliest)
# Convert dates to ordinal and subtract the ordinal value of the earliest date
imputed_df$last_review <- as.integer(imputed_df$last_review - as.Date(earliest))
We assured that our dataset was appropriate for a broader range of algorithms that require numerical input by using one-hot encoding, boosting the potential accuracy and effectiveness of our later analysis.
neighbourhood_group and room_type
were recognized as
categorical columns that would benefit from one-hot encoding.
Three new columns were created for room_type
:
Entire home/apt, Hotel room, and Private room
. Each of
these columns accepted a binary value, indicating whether the
corresponding room type was present (1) or absent (0).
Three new columns were created for neighbourhood_group
:
neighbourhood_group_Bronx, neighbourhood_group_Brooklyn, neighbourhood_group_Manhattan, neighbourhood_group_Queens
.
Each of these columns accepted a binary value, indicating whether the
corresponding neighbourhood group was present (1) or absent (0).
The last category of each original categorical variable was
eliminated throughout the encoding procedure to avoid multicollinearity
and reduce redundancy
(neighbourhood_group_Staten Island, room_type_Shared room
).
The original columns were dropped after explanatory analysis.
The column names were fixed to remove space and transform all letters to lowercase using clean_names() function from janitor library.
Column neighbourhood
wasn’t encoded or used in the
models as it was highly granular, with numerous distinct categories, and
one-hot encoding could result in a dataset that was extremely high
dimensional. We assumed it was not crucial for the prediction, given
that we already had geographical coordinates and neighbourhood_group
included.
# Copy data without NAs
encoded_df <- imputed_df
# One hot encoding for 'neighbourhood_group' and 'room_type'
encoded_df <- dummy_cols(encoded_df, select_columns = c('neighbourhood_group', 'room_type'))
# Remove last category of 'neighbourhood_group' and 'room_type'
encoded_df <- encoded_df %>%
dplyr::select(-c('neighbourhood_group_Staten Island', 'room_type_Shared room')) %>%
clean_names()
#encoded_df <- dummy_cols(encoded_df, select_columns = c('neighbourhood'))
#encoded_df <- encoded_df %>%
#dplyr::select(-c('neighbourhood_Fort Wadsworth')) %>%
# clean_names()
New feature availability_ratio
was created to facilitate
interpretation, it offered a percentage or portion of the year that the
listing was accessible.
# create feature availability_ratio
encoded_df$availability_ratio <- encoded_df$availability_365 / 365
We used boxplots to detect outliers.
There were notable anomalies for the price
, with
certain listings charging astronomically high costs in contrast to the
majority. The outliers were removed (price > $9,500 per night, just
26 observations) and the BoxCox transformation was applied to the train
and test data.
minimum_nights
column also displayed anomalies, with
certain listings having extraordinarily high minimum nights
requirements. As we saw per graph, there were just 6 listings out of
38,792 with minimum number of nights greater than 500. Hosts could set
up these high numbers of minimum nights when they didn’t want to accept
new guests. As a results, these number for minimum nights were not real.
We removed them as these wrong numbers could skew the results.
number_of_reviews
, reviews_per_month
. A
small percentage of listings had an unusually high amount of total
reviews. Some listings were more popular than others because they
received a lot of reviews each month. For example, these listings could
represent a hotel, each hotel could have a lot of rooms in one listing
for rent. As a result, one listing received a lot of reviews for many
rooms inside it. But the number of listings with reviews greater than
700 was just 9 and the number of reviews per month greater than 40 was
3, they were not representative of the typical listings, they were
likely to skew your predictive modeling results. This is a common
approach when the outliers constitute a very small percentage of the
data and are not central to the analysis. Because the linear regression
and Lasso Regression models are sensitive to the size and distribution
of the input features, outliers can significantly affect the performance
of these models.
calculated_host_listings_count
. Compared to the
average host, some hosts had a lot more listings than others. Which
could also happen in case some hosts owned multiple property.
availability_ratio
. Listings with extremely high
availability all year round could be properties that were specifically
intended for rental use.
number_of_reviews_ltm
: Some listings had received a
significant amount of reviews in just the past year, comparable to the
overall number of reviews. As it was mentioned above, it could be some
hotels. The listings with number of reviews for the last 12 months
greater than 200 were removed.
If these outliers were not properly addressed, they may represent special cases or mistakes in data entry, which could skew the analysis. 50 rows were removed as outliers.
# Numeric columns to check for outliers
continuous_vars <- c("price", "minimum_nights", "number_of_reviews", "reviews_per_month", "calculated_host_listings_count", "number_of_reviews_ltm", "availability_ratio")
# List to store plots
plots <- list()
# Generate boxplots for each variable
for(i in 1:length(continuous_vars)) {
p <- ggplot(encoded_df, aes_string(y = continuous_vars[i])) +
geom_boxplot() +
theme_minimal()
plots[[i]] <- p
}
# Arrange the plots in a grid
grid.arrange(grobs = plots, ncol = 3)
dim(encoded_df[encoded_df$price > 9500,])
## [1] 26 21
dim(encoded_df[encoded_df$minimum_nights > 500,])
## [1] 6 21
dim(encoded_df[encoded_df$number_of_reviews > 700,])
## [1] 9 21
dim(encoded_df[encoded_df$reviews_per_month > 40,])
## [1] 3 21
dim(encoded_df[encoded_df$number_of_reviews_ltm > 200,])
## [1] 15 21
encoded_df <- encoded_df %>%
filter(price <=9500 & minimum_nights <= 500 & number_of_reviews <= 700 & reviews_per_month <= 40 & number_of_reviews_ltm <= 200)
After the data transformation, no missing values detected.
New columns were added
(neighbourhood_group_bronx, neighbourhood_group_brooklyn, neighbourhood_group_manhattan, neighbourhood_group_queens, room_type_entire_home_apt, room_type_hotel_room, room_type_private_room, availability_ratio
)
while other were removed
(id, name, host_id, host_name, license
).
Overall, where the outliers had been eliminated, the main changes
brought about by filtering were seen in the maximum values of
minimum_nights, number_of_reviews, reviews_per_month, number_of_reviews_ltm, price
.
Because of this, the means and standard deviations of these variables
had somewhat decreased, which had made the data more condensed and
probably more appropriate for linear and Lasso Regression analysis.
DT::datatable(
encoded_df[1:25,],
extensions = c('Scroller'),
options = list(scrollY = 350,
scrollX = 500,
deferRender = TRUE,
scroller = TRUE,
dom = 'lBfrtip',
fixedColumns = TRUE,
searching = FALSE),
rownames = FALSE)
print(dfSummary(encoded_df, text.graph.col = FALSE, graph.col = FALSE, style = "grid", valid.col = FALSE), headings = FALSE, max.tbl.height = 500, footnote = NA, col.width=50, method="render")
No | Variable | Stats / Values | Freqs (% of Valid) | Missing | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | neighbourhood_group [factor] |
|
|
0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
2 | neighbourhood [factor] |
|
|
0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
3 | latitude [numeric] |
|
23347 distinct values | 0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
4 | longitude [numeric] |
|
21002 distinct values | 0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
5 | room_type [factor] |
|
|
0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
6 | price [integer] |
|
1174 distinct values | 0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
7 | minimum_nights [integer] |
|
110 distinct values | 0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
8 | number_of_reviews [integer] |
|
450 distinct values | 0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
9 | last_review [integer] |
|
2924 distinct values | 0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
10 | reviews_per_month [numeric] |
|
805 distinct values | 0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
11 | calculated_host_listings_count [integer] |
|
70 distinct values | 0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
12 | availability_365 [integer] |
|
366 distinct values | 0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
13 | number_of_reviews_ltm [integer] |
|
129 distinct values | 0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
14 | neighbourhood_group_bronx [integer] |
|
|
0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
15 | neighbourhood_group_brooklyn [integer] |
|
|
0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
16 | neighbourhood_group_manhattan [integer] |
|
|
0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
17 | neighbourhood_group_queens [integer] |
|
|
0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
18 | room_type_entire_home_apt [integer] |
|
|
0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
19 | room_type_hotel_room [integer] |
|
|
0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
20 | room_type_private_room [integer] |
|
|
0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
21 | availability_ratio [numeric] |
|
366 distinct values | 0 (0.0%) |
availability_ratio
displayed a distribution that was
more variable. While some postings were available all year long, the
most of the hosts were not available all the year. Latitude, longitude
had a normal distribution, most of the hosts were concetrated in a
specific area. All other distributions were right-skewed. Severe
skewness in predictors or the target variable could be problematic in
regression analysis because it could go against the normalcy assumption,
particularly in linear regression models. In addition to causing
non-linearity and heteroscedasticity (non-constant variance), skewed
distributions can also make the model more susceptible to outliers. We
had to apply transformations like the BoxCox and log+1 transformations
in the data preparation stage to normalize these variables. It was
addressed after splitting the data for training and testing.
# Choose numeric variables
numeric_vars <-c("price", "minimum_nights", "number_of_reviews", "reviews_per_month", "calculated_host_listings_count", "availability_ratio", "number_of_reviews_ltm", "latitude", "longitude")
# List to store plots
plots <- list()
# Generate histograms for each variable
for (i in 1:length(numeric_vars)) {
p <- ggplot(encoded_df, aes_string(x = numeric_vars[i])) +
geom_histogram(aes(y=..density..), bins = 30, fill = "lightgreen", color = "black", alpha = 0.7) +
geom_density(alpha = 0.2, fill = "#FF6666") +
ggtitle(paste0('Distribution of ', numeric_vars[i])) +
theme_minimal()
plots[[i]] <- p
}
# Plot in grid with 3 columns
grid.arrange(grobs = plots, ncol = 2)
The distribution of Airbnb listings among New York City’s boroughs
showed noticeably more listings in some boroughs than in others, which
could indicate that there was more demand or a larger selection of
lodging in those areas. The most properties were found in
Manhattan
followed by Brooklyn
which was
probably because it had been a popular destination for both leisure and
business travelers.
The various kinds of rooms that were available in Airbnb listings
showed that some room types were more common than others, which could be
a reflection of the hosts’ preferences or trends in visitor demand.
Entire home/apt
followed by Private room
were
the most common room type among all neighborhood groups, indicating that
hosts in NYC were more likely to provide entire apartments as opposed to
shared spaces.
It was crucial to comprehend these distributions in order to comprehend the dynamics of New York City’s Airbnb listings. For instance, a borough with a lot of listings might be a commercial center or a well-liked vacation spot. In a similar vein, the popularity of a specific room type can reveal the kind of lodging that visitors to NYC usually look for.
# Choose factor variables
factor_vars <- c("neighbourhood_group", "room_type")
# List to store plots
plots <- list()
# Generate barplots for each variable
for (i in 1:length(factor_vars)) {
p <- ggplot(encoded_df, aes_string(x = factor_vars[i])) +
geom_bar(fill = "lightgreen", color = "black", alpha = 0.7) +
ggtitle(paste0('Distribution of ', factor_vars[i])) +
theme_minimal()
plots[[i]] <- p
}
# Plot in grid with 3 columns
grid.arrange(grobs = plots, ncol = 1)
Plotting the distribution of Airbnb properties by type of room across several neighborhood groups in New York City, the data was displayed as a grouped bar chart below.
Manhattan was followed by Brooklyn in terms of the quantity of Airbnb listings. Entire home/apt was the most popular listing, followed by Private room. This suggests that Airbnb stays were also common in Brooklyn.
Compared to Manhattan and Brooklyn, there were noticeably fewer listings in Queens, the Bronx, and Staten Island. Of these, Queens had a higher number of listings than Staten Island and the Bronx.
In all boroughs, private room listings were the second most prevalent, with a notable concentration in Brooklyn and Manhattan.
There weren’t many listings for shared rooms, which could be because hosts prefer not to provide shared spaces or because there weren’t as much demand for this kind of lodging.
There were variations in the distribution of room types among the boroughs, which could be due to factors such as the local housing market, zoning laws, or the travel and demographic patterns of each community.
# Group by borough, count room type in each borough
group_room <- encoded_df %>%
group_by(neighbourhood_group) %>%
count(room_type)
# Bar plot borough vs number of listings by room type
ggplot(group_room, aes(x = neighbourhood_group, y = n, fill = room_type)) +
geom_bar(position = "dodge", stat = "identity") +
theme_minimal() +
labs(title = "Borough vs Properties, NYC", x = "Borough", y = "Number of Properties") +
scale_fill_discrete(name = "Room Type")
The mix of neighborhoods in the top 10 reflected the diverse appeal of New York City’s numerous neighborhoods by showcasing a range of areas dispersed across different parts of the city.
The neighborhood’s strong representation of Williamsburg and Bedford-Stuyvesant underscored Brooklyn’s growing appeal as a destination for travelers to New York City.
The list included Manhattan neighborhoods like Harlem, Hell’s Kitchen, Upper West Side, and Upper East Side, highlighting the city’s ongoing appeal because of its convenient location and wealth of attractions.
Popular neighborhoods that draw a varied range of visitors included those with cultural, historical, or entertainment significance, such as Hell’s Kitchen and Harlem, which are well-known for their restaurants and close proximity to Broadway.
# Group by neighbourhood, count
group_neigh <- encoded_df %>%
group_by(neighbourhood) %>%
count() %>%
arrange(desc(n))
ggplot(group_neigh[1:10,], aes(x = reorder(neighbourhood, -n), y = n, fill = neighbourhood)) +
geom_bar(position = "dodge", stat = "identity") +
theme_minimal() +
labs(title = "Neighbourhood vs Properties, NYC", x = "Neighbourhood", y = "Number of Properties") +
theme(axis.text.x = element_text(angle = 45, hjust = 1), legend.position = "none")
Price vs Numerical Variables: Price didn’t seem to be strongly correlated with either latitude or longitude. Nonetheless, there were pockets of higher pricing in specific locations, most likely associated with upscale or more well-known neighborhoods. Few listings had a very high minimum night requirement, while the majority of listings had fewer minimum nights. The quantity of required minimum nights and the cost did not exhibit a definite linear relationship. The majority of expensive listings did not require a minimum of one night.
In general, listings with a large number of reviews were less expensive. This could suggest that listings at lower prices were being booked and reviewed more frequently. The prices of the listings with fewer reviews varied greatly. Listings with more reviews per month typically had lower prices, much like the number of reviews does. Additionally, this plot implied that listings that were reviewed and probably booked more frequently were more reasonably priced.
The price and the quantity of listings a host had were not clearly correlated. Prices varied widely for hosts with few or many listings. The prices for the various yearly availability levels varied greatly. The lack of a discernible pattern suggested that price couldn’t be strongly correlated with availability. Similar to the total number of reviews, listings with a higher number of recent reviews typically had lower prices. The trend suggested that listings with lower prices could be booked and reviewed more frequently.
numeric_vars <-c("minimum_nights", "number_of_reviews", "reviews_per_month", "calculated_host_listings_count", "availability_ratio", "number_of_reviews_ltm", "latitude", "longitude")
# Plot target vs numeric columns
for (i in numeric_vars) {
p <- ggplot(encoded_df, aes_string(x = i, y = "price")) +
geom_point() +
theme_bw() +
labs(title = paste('Price vs', i), x = i, y = 'Price')
print(p)
}
The price column had a highly skewed distribution. This could cause problems for machine learning algorithms such as linear regression. The BoxCox transformation and removal of outliers (after ) made the distribution look much closer to normal.
aa_boxcox <- boxcox(lm((encoded_df$price+1) ~ 1))
aa_lambda <- aa_boxcox$x[which.max(aa_boxcox$y)]
aa_trans <- BoxCox(encoded_df$price+1, aa_lambda)
encoded_df$box_price <- aa_trans
# Create the distribution plot for `price`
p1 <- ggplot(encoded_df, aes(x=price)) +
geom_histogram(aes(y=..density..), binwidth=1, colour="black", fill="white") +
geom_density(alpha=.2, fill="#FF6666") +
ggtitle("Distribution of Price")
# Create the distribution plot for BoxCox(price)
p2 <- ggplot(encoded_df, aes(x=box_price)) +
geom_histogram(aes(y=..density..), binwidth=1, colour="black", fill="white") +
geom_density(alpha=.2, fill="#FF6666") +
ggtitle("Distribution of transformed Price") +
xlab("BoxCox(price)")
# Create a Q-Q plot for BoxCox(price)
p3 <- ggplot() +
stat_qq(aes(sample = box_price), data = encoded_df) +
stat_qq_line(aes(sample = box_price), data = encoded_df) +
ggtitle("Q-Q Plot of BoxCox(price)")
# Arrange the plots in a 1x3 grid
grid.arrange(p1, p2, p3, ncol = 2)
Price vs Categorical Variables:
The median prices of various neighborhood groups differed noticeably from one another. The common belief that Manhattan was a more expensive area was supported by the fact that the borough’s median prices tend to be higher than those of other Manhattan neighborhoods. Compared to individual rooms and hotel rooms, entire homes and apartments were typically more expensive. There was a significant variance in the kind and caliber of these listings, as seen by the larger price variance for complete houses or apartments.
# Plot target vs categorical
for (i in factor_vars) {
p <- ggplot(encoded_df, aes_string(x = i, y = "price")) +
geom_boxplot() +
theme_bw() +
labs(title = paste('Price vs', i), x = i, y = 'Price')
print(p)
}
It appeared that a listing’s location (shown by latitude and longitude) affected its price, but not in a straightforward linear way. It was likely that neighborhood-specific factors were important. A few factors that clearly affected price were the type of room and the quantity of reviews. Listings with lower prices appeared to draw more reviews. A large number of the numerical variables didn’t clearly demonstrate a linear relationship with price, suggesting that the intricacies of the data could be beyond the scope of a basic linear model. Possibility of Non-linear and Interaction Effects: Non-linear modeling or the addition of interaction terms could be required to more accurately depict the relationship between these variables and the price, given the absence of obvious linear trends.
There was a positive correlation between higher prices
and listings in Manhattan
(0.21). This implied that
Manhattan real estate was more expensive to list, probably as a result
of the borough’s popularity and strategic location. In contrast, there
was a negative correlation (-0.11) between the price of listings in
Brooklyn and Queens, suggesting that these boroughs typically had lower
prices than Manhattan.
Price
and entire homes/apartments
had a
positive correlation (0.17), indicating that the price of these listings
was usually higher than that of shared accommodations. A negative
correlation (-0.17) indicated that private rooms were typically less
expensive than whole houses or apartments.
Price and longitude had a negative correlation (-0.16), which could suggest that listings in Brooklyn and Queens, which are further east, were typically less expensive. The positive correlation between latitude and price was less strong (0.04), indicating a potential trend toward slightly higher listing prices for listings further north.
The neighbourhood group encodings exhibited significant negative correlations with each other. For example, the correlation between neighbourhood group Manhattan and neighbourhood group Brooklyn was -0.67, and the correlation between neighbourhood group Manhattan and neighbourhood group Queens was -0.37. Given that these were mutually exclusive categories, this was expected.
Correlations between room type encodings and neighbourhood groups were also strongly negative. For example, there was a negative correlation (-0.97) between room_type_Private room and room_type_Entire home/apt. This suggested that these kinds of rooms belong to exclusive groups.
There existed a moderate negative correlation (-0.14) between number_of_reviews and calculated_host_listings_count. This could indicate that hosts with fewer properties tended to have listings that had received more reviews, either because they had been active on the platform longer or because they concentrated more on a single listing.
# Check correlation
rcore <- rcorr(as.matrix(encoded_df %>% dplyr::select(where(is.numeric))))
# Take correlation coeff
coeff <- rcore$r
# Build corr plot
corrplot(coeff, tl.cex = 0.5, tl.col="black", method = 'color', addCoef.col = "black",
type="upper", order="hclust", number.cex=0.5,
diag=FALSE)
tst <- encoded_df %>% dplyr::select(where(is.numeric))
kable(cor(drop_na(tst))[,3], "html", escape = F, col.names = c('Coefficient')) %>%
kable_styling("striped", full_width = F) %>%
column_spec(1, bold = T)
Coefficient | |
---|---|
latitude | 0.0425704 |
longitude | -0.1643373 |
price | 1.0000000 |
minimum_nights | -0.0941773 |
number_of_reviews | -0.0328301 |
last_review | -0.0863907 |
reviews_per_month | -0.0185994 |
calculated_host_listings_count | 0.0393215 |
availability_365 | 0.0894067 |
number_of_reviews_ltm | -0.0168238 |
neighbourhood_group_bronx | -0.0527385 |
neighbourhood_group_brooklyn | -0.1143825 |
neighbourhood_group_manhattan | 0.2121102 |
neighbourhood_group_queens | -0.1055112 |
room_type_entire_home_apt | 0.1709745 |
room_type_hotel_room | 0.0643294 |
room_type_private_room | -0.1727415 |
availability_ratio | 0.0894067 |
box_price | 0.6354842 |
Finally, we split our data into train (75%) and test (25%) datasets to evaluate model performance before we proceeded to prediction. The train data contained 29180 records, test data 9562.
# random seed
set.seed(42)
# 80/20 split of the data set
sample <- sample.split(encoded_df$price, SplitRatio = 0.75)
train_data <- subset(encoded_df, sample == TRUE)
test_data <- subset(encoded_df, sample == FALSE)
# Check dimensions of train and test data
dim(train_data)
## [1] 29180 22
dim(test_data)
## [1] 9562 22
As outliers showed, there were extremely high prices per night
($30,000
) in contrast to the mean ($216
). As a
result, the price distribution was right-skewed. This indicated that the
skewness was positive. To lessen the skewness, BoxCox transformation was
used after splitting the data for training and testing to avoid data
leakage. Transformation by log+1 for minimum_nights
was
preferable because division by zero was problematic. We also applied the
Box-Cox Transformation to
number_of_reviews, reviews_per_month, calculated_host_listings_count, number_of_reviews_ltm
due to their right-skeweness to stabilize variance and improve its
normalcy. We added 1 in boxCox() to ensure all values were positive,
which is necessary for the Box-Cox transformation. The data had to be
transformed in order for linear modeling methods like Lasso Regression
and Linear Regression to work. We wanted to increase the precision and
dependability of our models, so we normalized the distribution of these
important variables. By helping linear models meet their assumptions,
the transformed data improves the performance of the models.
train_transformed <- train_data
test_transformed <- test_data
# Log transformation for minimum_nights"
cols_transform <- c("minimum_nights") #"price",
for (i in cols_transform) {
train_transformed[[i]]<- log(train_transformed[[i]]+1)
test_transformed[[i]]<- log(test_transformed[[i]]+1)
}
# Boxcox transformation for "number_of_reviews", "reviews_per_month", "calculated_host_listings_count", "number_of_reviews_ltm", "price"
b_boxcox <- boxcox(lm((train_transformed$number_of_reviews+1) ~ 1))
b_lambda <- b_boxcox$x[which.max(b_boxcox$y)]
b_trans <- BoxCox(train_transformed$number_of_reviews+1, b_lambda)
train_transformed$number_of_reviews <- b_trans
b_trans <- BoxCox(test_transformed$number_of_reviews+1, b_lambda)
test_transformed$number_of_reviews <- b_trans
c_boxcox <- boxcox(lm((train_transformed$reviews_per_month+1) ~ 1))
c_lambda <- c_boxcox$x[which.max(c_boxcox$y)]
c_trans <- BoxCox(train_transformed$reviews_per_month+1, c_lambda)
train_transformed$reviews_per_month <- c_trans
c_trans <- BoxCox(test_transformed$reviews_per_month+1, c_lambda)
test_transformed$reviews_per_month <- c_trans
d_boxcox <- boxcox(lm((train_transformed$calculated_host_listings_count) ~ 1))
d_lambda <- d_boxcox$x[which.max(d_boxcox$y)]
d_trans <- BoxCox(train_transformed$calculated_host_listings_count, d_lambda)
train_transformed$calculated_host_listings_count <- d_trans
d_trans <- BoxCox(test_transformed$calculated_host_listings_count, d_lambda)
test_transformed$calculated_host_listings_count <- d_trans
e_boxcox <- boxcox(lm((train_transformed$number_of_reviews_ltm+1) ~ 1))
e_lambda <- e_boxcox$x[which.max(e_boxcox$y)]
e_trans <- BoxCox(train_transformed$number_of_reviews_ltm+1, e_lambda)
train_transformed$number_of_reviews_ltm <- e_trans
e_trans <- BoxCox(test_transformed$number_of_reviews_ltm+1, e_lambda)
test_transformed$number_of_reviews_ltm <- e_trans
aa_boxcox <- boxcox(lm((train_transformed$price+1) ~ 1))
aa_lambda <- aa_boxcox$x[which.max(aa_boxcox$y)]
aa_trans <- BoxCox(train_transformed$price+1, aa_lambda)
train_transformed$price <- aa_trans
aa_trans <- BoxCox(test_transformed$price+1, aa_lambda)
test_transformed$price <- aa_trans
For linear regression models, these transformations aided in stabilizing the variance and improving the symmetry of the distributions. The data was better suited for predictive modeling after transformations.
# Choose numeric variables
numeric_vars <-c("price", "minimum_nights", "number_of_reviews", "reviews_per_month", "calculated_host_listings_count", "number_of_reviews_ltm", "availability_ratio")
# List to store plots
plots <- list()
# Generate histograms for each variable
for (i in 1:length(numeric_vars)) {
p <- ggplot(train_transformed, aes_string(x = numeric_vars[i])) +
geom_histogram(aes(y=..density..), bins = 30, fill = "lightgreen", color = "black", alpha = 0.7) +
geom_density(alpha = 0.2, fill = "#FF6666") +
ggtitle(paste0('Distribution of ', numeric_vars[i])) +
theme_minimal()
plots[[i]] <- p
}
# Plot in grid with 3 columns
grid.arrange(grobs = plots, ncol = 2)
# Create the distribution plot for `price`
p1 <- ggplot(test_transformed, aes(x=price)) +
geom_histogram(aes(y=..density..), binwidth=1, colour="black", fill="white") +
geom_density(alpha=.2, fill="#FF6666") +
ggtitle("Distribution of Price")
# Create a Q-Q plot for price
p3 <- ggplot() +
stat_qq(aes(sample = price), data = test_transformed) +
stat_qq_line(aes(sample = price), data = test_transformed) +
ggtitle("Q-Q Plot of Price")
# Arrange the plots in a 1x3 grid
grid.arrange(p1, p3, ncol = 2)
For the first model, Linear Regression, we dropped: -
neighbourhood_group, room_type
features as we encoded them
and used encoded columns instead of these originals.
availability_365, box_price
as we created new
feature availability ratio and transformed price with BoxCox after
splitting data to train and test, so we didn’t need the previous column
box_price.
number_of_reviews, reviews_per_month
as they were
highly correlated, we kept only number of reviews for the last 12
months.
neighbourhood
in order to avoid high
dimensionality.
# remove some features from the model
train_model_1 <- train_transformed %>%
dplyr::select(-c(neighbourhood, neighbourhood_group, room_type, availability_365, number_of_reviews, reviews_per_month, box_price))
test_model_1 <- test_transformed %>%
dplyr::select(-c(neighbourhood, neighbourhood_group, room_type, availability_365, number_of_reviews, reviews_per_month, box_price))
We first configured five-fold cross-validation. The training data was divided into five parts, and the model was trained and validated five times, using the remaining parts as the training set and a different part as the validation set each time. This made it easier to estimate the model’s performance with greater accuracy. After, the Linear Regression model was trained.
set.seed(42)
# setup cross-validation
ctrl <- trainControl(method = "cv", number = 5)
# fit a regression model and use k-fold CV to evaluate performance
lm_model <- train(price ~ ., data = train_model_1, method = "lm", trControl = ctrl)
How well the model fit the data was shown in the summary of residuals. The residuals were between -3.3851 and 1.1955. A somewhat symmetric residual distribution around zero was suggested by a median near -0.0187, which is generally a positive indicator.
The F-statistic was 1,966
, the adjusted R-squared was
0.485
, and out of the 14 variables, all had statistically
significant p-values. The model was statistically significant, as
indicated by the F-statistic and p-value of less than 2.2e-16.The
adjusted R2 indicated that only 48% of the variance in the response
variable could be explained by the predictor variables. Even though the
model explained a sizable percentage of the variance in price, a sizable
portion remained unaccounted for by the model.
Residual standard error was 0.1879
. It provided an
estimate of the residuals’ standard deviation and, consequently, the
typical error in price prediction made by the model.
RMSE was 0.181
, indicating the average error magnitude.
Less RMSE is preferable.
A positive coefficient indicated that the predictor and the outcome variable were positively correlated. For example, the room_type_entire_home_apt had a positive coefficient of about 0.298. This suggested that the price of Airbnb listings for complete homes or apartments was approximately 0.298 units higher than the price of listings for other types of listings (like shared rooms or hotel rooms), holding all other variables constant. Neighborhood group Manhattan listings were more expensive than those in other areas. When a coefficient was positive, it indicated that the outcome variable raised along with the predictor.
A negative relationship between the predictor and the outcome variable was indicated by a negative coefficient. For instance,both, longitude and latitude, had sizable negative coefficients, price was significantly influenced by geography. A listing’s precise location was important, since certain areas (usually central or well-liked neighborhoods) bring higher prices. A negative coefficient indicated a decrease in the outcome variable with an increase in the predictor.
MAPE
(Mean Absolute Percentage Error) was 4.69. This
showed that the model’s predictions were, on average, 4.69% off from the
actual values. SMAPE
(Symmetric Mean Absolute Percentage
Error) was 4.68. By normalizing based on both expected and actual
values, SMAPE changes MAPE. The model’s good SMAPE of 4.68% indicated
that it was fairly accurate. MASE
(Mean Absolute Scaled
Error) was 0.539. In other words, the model’s forecasts were, on
average, 46.1% more accurate than the naive forecasts. MPE
(Mean Percentage Error) showed how biased the predictions were, -0.594.
A negative value indicated a slight under-forecasting tendency in the
model.
These model performance metrics were rather low, indicating a respectable level of predictive accuracy.
print(lm_model)
## Linear Regression
##
## 29180 samples
## 14 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 23344, 23344, 23344, 23344, 23344
## Resampling results:
##
## RMSE Rsquared MAE
## 0.1879817 0.484945 0.1422375
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
summary(lm_model)
##
## Call:
## lm(formula = .outcome ~ ., data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.3851 -0.1227 -0.0187 0.1029 1.1955
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.953e+01 2.960e+00 -13.356 < 2e-16 ***
## latitude -3.733e-01 3.250e-02 -11.486 < 2e-16 ***
## longitude -7.783e-01 3.388e-02 -22.973 < 2e-16 ***
## minimum_nights -9.868e-02 1.593e-03 -61.933 < 2e-16 ***
## last_review -1.516e-05 8.902e-07 -17.030 < 2e-16 ***
## calculated_host_listings_count -3.885e-02 2.005e-03 -19.373 < 2e-16 ***
## number_of_reviews_ltm 5.637e-02 3.003e-03 18.774 < 2e-16 ***
## neighbourhood_group_bronx 2.755e-01 1.648e-02 16.722 < 2e-16 ***
## neighbourhood_group_brooklyn 2.294e-01 1.291e-02 17.772 < 2e-16 ***
## neighbourhood_group_manhattan 3.449e-01 1.327e-02 25.987 < 2e-16 ***
## neighbourhood_group_queens 2.687e-01 1.476e-02 18.205 < 2e-16 ***
## room_type_entire_home_apt 2.983e-01 9.995e-03 29.841 < 2e-16 ***
## room_type_hotel_room 2.642e-01 2.240e-02 11.796 < 2e-16 ***
## room_type_private_room 5.287e-02 1.002e-02 5.277 1.32e-07 ***
## availability_ratio 1.117e-01 3.292e-03 33.947 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1879 on 29165 degrees of freedom
## Multiple R-squared: 0.4855, Adjusted R-squared: 0.4853
## F-statistic: 1966 on 14 and 29165 DF, p-value: < 2.2e-16
# validate and calculate RMSE
lm_model.valid <- predict(lm_model, newdata = test_model_1)
lm_model.eval <- bind_cols(target = test_model_1$price, predicted=unname(lm_model.valid))
lm_model.rmse <- sqrt(mean((lm_model.eval$target - lm_model.eval$predicted)^2))
# plot targets vs predicted
lm_model.eval %>%
ggplot(aes(x = target, y = predicted)) +
geom_point(alpha = .3) +
geom_smooth(method="lm", color='grey', alpha=.3, se=FALSE) +
labs(title=paste('RMSE:',round(lm_model.rmse,1)))
# Calculate metrics mape, smape, mase, mpe, rmse
multi_metric <- metric_set(mape, smape, mase, mpe, yardstick::rmse)
model1_df <- lm_model.eval %>% multi_metric(truth=target, estimate=predicted)
b <- summary(lm_model)
model1_df
## # A tibble: 5 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 mape standard 4.70
## 2 smape standard 4.68
## 3 mase standard 0.539
## 4 mpe standard -0.594
## 5 rmse standard 0.181
# Add results to table
results_lm_tbl <- tibble(
Model = character(),
mape = numeric(),
smape = numeric(),
mase = numeric(),
mpe = numeric(),
"RMSE" = numeric(),
"Adjusted R2" = numeric()
)
results_lm_tbl <- results_lm_tbl %>% add_row(tibble_row(
Model = "Model 1: Linear Regression, cv",
mape = model1_df[[1,3]],
smape = model1_df[[2,3]],
mase = model1_df[[3,3]],
mpe = model1_df[[4,3]],
"RMSE" = model1_df[[5,3]],
"Adjusted R2" = b$adj.r.squared
))
Remaining versus Fitted tests for homoscedasticity and nonlinearity. To indicate homoscedasticity, the points should ideally be distributed randomly around the horizontal line. The residuals in the plot were almost distributed without any recognized pattern.
If the residuals are regularly distributed, it is verified by the Standard Q-Q plot. The dashed line indicates that the points are distributed normally. There was some variation at the tails of the plot, suggesting possible problems with normalcy.
Scale-Location plot had points to be dispersed equally throughout the fitted value range. The plot indicated potential problems with equal variance, as the residuals’ spread widened as fitted values increased.
Leverage versus Residuals made it easier to spot outliers that unreasonably affected the model. Influential points could be those that were prominently located above the dashed Cook’s distance lines or far to the right of the plot. A few of the points had high leverage and/or high Cook’s distance, which made them potentially significant.
The diagnostic plots implied that there could be some transgressions of the homoscedasticity and residual normality assumptions of linear regression. Taking into account variable transformations or using models more resilient to such problems may be beneficial. The Residuals vs. Leverage plot’s indication of influential points may call for additional research. To determine whether these data points should be eliminated or whether there is a significant explanation for their status as outliers.
# Plot model's assumptions
lm_model_final <- lm_model$finalModel
par(mfrow = c(2, 2))
plot(lm_model_final)
Given the significant results of the Jarque-Bera (JB) test, it is possible that the residuals were not normally distributed. This could an impact on some statistical tests’ dependability. The residuals could exhibit heteroscedasticity, which indicated that the variance of the residuals was not constant across all levels of the independent variables, according to the Breusch-Pagan test.
# Test for residuals
JarqueBera.test(lm_model_final$residuals)
##
## Jarque Bera Test
##
## data: lm_model_final$residuals
## X-squared = 40599, df = 2, p-value < 2.2e-16
##
##
## Skewness
##
## data: lm_model_final$residuals
## statistic = 0.39057, p-value < 2.2e-16
##
##
## Kurtosis
##
## data: lm_model_final$residuals
## statistic = 8.7255, p-value < 2.2e-16
bptest(lm_model_final)
##
## studentized Breusch-Pagan test
##
## data: lm_model_final
## BP = 542.08, df = 14, p-value < 2.2e-16
The variance inflation factor (VIF) quantified the extent to which multicollinearity with other model predictors inflated the variance of a regression coefficient. As a general rule, high multicollinearity was indicated by a VIF greater than 5 or 10.
High VIF values indicated that variables such as availability_ratio, latitude, longitude, and neighbourhood_group_manhattan could be collinear with other variables in the model. Because it could inflate the standard errors of the coefficients and reduce the reliability of the model estimates, this collinearity could be problematic. The VIF only implied that a variable was not offering unique information when there were other variables present; it did not, however, implied that a variable was unimportant.
# Variable importance plot
vif_values <- vif(lm_model_final)
vif_values <- rownames_to_column(as.data.frame(vif_values), var = "var")
vif_values %>%
ggplot(aes(y=vif_values, x=var)) +
coord_flip() +
geom_hline(yintercept=5, linetype="dashed", color = "red") +
geom_bar(stat = 'identity', width=0.3 ,position=position_dodge())
To perform Lasso Regression, we used functions from the glmnet package. This package required the response variable to be a vector and the set of predictor variables to be of the class data.matrix. We used the transformed train and test data (with log and boxcox transformations) without choosing particular variables.
# Transform to matrix
set.seed(42)
t0 <- train_transformed %>% dplyr::select(-c(availability_365, neighbourhood_group, room_type, neighbourhood, box_price))
X <- model.matrix(price ~ ., data=t0)[,-1]
Y <- t0$price
Next, we used the glmnet() function to fit the Lasso Regression model and specify alpha=1. To determine what value to use for lambda, we performed k-fold cross-validation and identify the lambda value that produced the lowest test mean squared error (MSE).
The following attribute settings were selected for the model:
The coefficients extracted using lambda.min minimized the mean cross-validated error.
# Fit lasso model
set.seed(42)
lasso_cv <- cv.glmnet(
x=X,y=Y, # Y already logged in prep
family = "gaussian",
type.measure="mse",
standardize = TRUE, # standardize
nfold = 10,
alpha=1) # alpha=1 is lasso
# Find optimal lambda value that minimizes test MSE
best_lambda <- lasso_cv$lambda.min
best_lambda
## [1] 6.654594e-05
#produce plot of test MSE by lambda value
plot(lasso_cv)
After, we analyzed the final model produced by the optimal lambda value.
#find coefficients of best model
lasso_model <- glmnet(X, Y, alpha = 1, lambda = best_lambda, standardize = TRUE)
#coef(lasso_model)
# Show table with coeff of Lasso model
as.data.frame(as.matrix(coef(lasso_model, s = "lambda.min"))) %>%
arrange(desc(s1)) %>%
nice_table(cap='Model Coefficients', cols='Est')
Est | |
---|---|
neighbourhood_group_manhattan | 0.325 |
room_type_entire_home_apt | 0.294 |
room_type_hotel_room | 0.270 |
neighbourhood_group_bronx | 0.248 |
neighbourhood_group_queens | 0.246 |
neighbourhood_group_brooklyn | 0.212 |
reviews_per_month | 0.187 |
availability_ratio | 0.106 |
room_type_private_room | 0.048 |
number_of_reviews_ltm | 0.028 |
last_review | 0.000 |
number_of_reviews | -0.025 |
calculated_host_listings_count | -0.040 |
minimum_nights | -0.097 |
latitude | -0.338 |
longitude | -0.779 |
(Intercept) | -40.983 |
The analysis of residuals
and other performance metrics
demonstrated how well the Lasso Regression model fit the data. The
residuals showed a somewhat symmetric distribution around zero, with a
range of -3.395 to 1.194. A median near -0.0187 highlighted this
symmetry, which is typically a positive indicator that the model did not
consistently overestimate or underestimate the prices.
The adjusted R-squared
value of 0.489 indicated that the
predictor variables in the model could account for roughly 43.6% of the
price variance. Although significant, this value also suggested that the
model was unable to account for a sizable portion of the variability in
Airbnb prices.
The RMSE value of 0.181
provided a measure of the
typical deviation of the predicted values from the actual prices.
Understanding the average magnitude of prediction errors required an
understanding of this metric.
A positive correlation between the predictor and the outcome variable was indicated by positive coefficients. When all else was equal, a variable with a positive coefficient, for example, would imply that the price increased in tandem with the predictor’s value. Similar to the Linear model, neighbourhood_group_manhattan, room_type_entire_home_apt, and room_type_hotel_room were key predictors with significant positive coefficients.
On the other hand, a negative coefficient denoted a bad relationship. For instance, if the latitude variable had a negative coefficient, this would suggest that, assuming all other variables remain constant, moving northward, or increasing latitude, could be linked to a drop in price.
MAPE
was 4.68. This showed that the model’s predictions
were, on average, 4.7% off from the actual values. SMAPE
was 4.66. By normalizing based on both expected and actual values, SMAPE
changes MAPE. The model’s good SMAPE of 4.66% indicated that it was
fairly accurate. MASE
was 0.537. MPE
was
-0.589. A negative value indicated a slight under-forecasting tendency
in the model.
Overall, even though the Lasso model offered insightful information and a high level of predictive accuracy, there was still opportunity for improvement, perhaps through the use of more sophisticated modeling techniques or the investigation of new or different predictor variables.
t1 <- test_transformed %>% dplyr::select(-c(availability_365, neighbourhood_group, room_type, neighbourhood, box_price))
X_test <- model.matrix(price ~ ., data=t1)[,-1]
y_test <- t1[,"price"]
# validate and calculate RMSE
lasso_model.valid <- predict(lasso_model, newx = X_test, s = best_lambda)
lasso_model.eval <- bind_cols(target = y_test, predicted=unname(lasso_model.valid))
lasso_model.eval$predicted <- as.numeric(lasso_model.eval$predicted)
lasso_model.rmse <- sqrt(mean((lasso_model.eval$target - lasso_model.eval$predicted)^2))
# plot targets vs predicted
lasso_model.eval %>%
ggplot(aes(x = target, y = predicted)) +
geom_point(alpha = .3) +
geom_smooth(method="lm", color='grey', alpha=.3, se=FALSE) +
labs(title=paste('RMSE:',round(lasso_model.rmse,1)))
# Calculate metrics mape, smape, mase, mpe, rmse
lasso_model_df <- lasso_model.eval %>% multi_metric(truth=target, estimate=predicted)
lasso_model_df
## # A tibble: 5 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 mape standard 4.68
## 2 smape standard 4.66
## 3 mase standard 0.537
## 4 mpe standard -0.589
## 5 rmse standard 0.181
# R-squared and Adjusted R-squared
predictions_train <- predict(lasso_model, newx = X, s = best_lambda)
SSE <- sum((predictions_train - Y)^2)
SST <- sum((Y - mean(Y))^2)
r_squared_lasso <- 1 - SSE / SST
n <- length(Y)
p <- ncol(X)
adj_r_squared_lasso <- 1 - (1 - r_squared_lasso) * (n - 1) / (n - p - 1)
adj_r_squared_lasso
## [1] 0.4886678
# Predictions and residuals
residuals_lasso <- Y - predictions_train
# Calculate statistics for residuals
min(residuals_lasso)
## [1] -3.394797
max(residuals_lasso)
## [1] 1.193608
median(residuals_lasso)
## [1] -0.01862024
mean(residuals_lasso)
## [1] -6.953138e-14
sd(residuals_lasso)
## [1] 0.1872375
# Add results to table
results_lm_tbl <- results_lm_tbl %>% add_row(tibble_row(
Model = "Model 2: Lasso",
mape = lasso_model_df[[1,3]],
smape = lasso_model_df[[2,3]],
mase = lasso_model_df[[3,3]],
mpe = lasso_model_df[[4,3]],
"RMSE" = lasso_model_df[[5,3]],
"Adjusted R2" = adj_r_squared_lasso
))
Several diagnostic plots were used to evaluate the Lasso Regression model in order to identify potential problems and evaluate important assumptions. To verify nonlinearity and homoscedasticity, the Residual vs. Fitted plot is essential. To show homoscedasticity, the residuals should ideally be dispersed randomly around a horizontal line. The residuals of the Lasso model showed no discernible pattern and seemed to be scattered, indicating a reasonable level of homoscedasticity. On the other hand, the lack of a discernible pattern suggested that nonlinearity might not be a major problem.
The majority of the residuals in the Lasso model’s Q-Q plot fell along a straight line, indicating that the residuals were roughly normally distributed. A few deviations exist, especially in the tails, but these are typical of real-world data.
Possible problems with equal variance were highlighted by the Scale-Location plot, which aids in assessing the distribution of residuals across the range of fitted values. Heteroscedasticity, or the inconsistency of residual variance across the range of predicted values, may be indicated by a discernible spread in the residuals as the fitted values increased.
All of the diagnostic plots suggested possible violations of the residual normality and homoscedasticity assumptions. In light of these results, investigating variable transformations or taking into account models more resistant to these kinds of problems might be helpful. The Residuals vs. Leverage plot’s indications call for additional research into the key points to determine whether or not they should be eliminated and whether there is a compelling reason why they are considered outliers.
The reliability of some statistical inferences made from the model may be impacted by the residuals’ potential non-normal distribution, according to the significant findings of the Jarque-Bera (JB) test.
# Plot for Linearity and Homoscedasticity check
plot(predictions_train, residuals_lasso, xlab = "Predicted", ylab = "Residuals", main = "Residual vs. Fitted - Lasso")
abline(h = 0, col = "red")
par(mfrow = c(2, 2))
qqnorm(residuals_lasso, main = 'Normal Q-Q')
plot(predictions_train, sqrt(abs(residuals_lasso)), xlab = 'Predicted', ylab = 'Sqrt(|Residuals|)', main = 'Scale-Location')
Given the significant results of the Jarque-Bera (JB) test, it is possible that the residuals were not normally distributed. This could an impact on some statistical tests’ dependability. The residuals could exhibit heteroscedasticity, which indicated that the variance of the residuals was not constant across all levels of the independent variables, according to the Breusch-Pagan test.
JarqueBera.test(residuals_lasso)
##
## Jarque Bera Test
##
## data: residuals_lasso
## X-squared = 42159, df = 2, p-value < 2.2e-16
##
##
## Skewness
##
## data: residuals_lasso
## statistic = 0.38017, p-value < 2.2e-16
##
##
## Kurtosis
##
## data: residuals_lasso
## statistic = 8.8392, p-value < 2.2e-16
A study was conducted to measure the level of multicollinearity between the model’s predictors using the Variance Inflation Factor (VIF). Significant multicollinearity is indicated by high VIF values, usually greater than 5 or 10. High VIF values were found in this model for variables like availability_ratio, latitude, longitude, and neighborhood_group_manhattan, indicating possible collinearity with other predictors. The reliability of the model’s estimates may be jeopardized by this collinearity, which could inflate the standard errors of the coefficients. It’s crucial to remember that a high VIF does not always imply that a variable is unimportant; rather, it simply suggests that it does not offer unique information when combined with other variables.
vip(lasso_model, num_features=20 ,geom = "col", include_type=TRUE, lambda = "lambda.min")
coeffs.table <- coeff2dt(fitobject = lasso_model, s = "lambda.min")
coeffs.table %>% mutate(name = fct_reorder(name, desc(coefficient))) %>%
ggplot() +
geom_col(aes(y = name, x = coefficient, fill = {coefficient > 0})) +
xlab(label = "") +
ggtitle(expression(paste("Lasso Coefficients with ", lambda, " = 0.0275"))) +
theme_bw() +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust=0.5),legend.position = "none")
In order to concentrate on other factors for price prediction, Linear Regression and Lasso models models disregarded availability_365, neighborhood_group, room_type, and neighborhood characteristics.
The residuals of the Linear Regression model showed a distribution centered around zero, indicating no significant bias in predictions. Similar to the Linear Regression model, the Lasso model’s residuals also indicated a balanced distribution around zero, suggesting an unbiased prediction model. The existence of multicollinearity in the Linear Regression model may inflate the variance of coefficient estimates, resulting in less trustworthy interpretations, even though Lasso Regression helps to address multicollinearity by penalizing coefficients of correlated predictors.
The RMSE values (0.181) for both models were nearly the same, indicating comparable levels of price prediction accuracy. In terms of average prediction error, MAPE and SMAPE values were almost equal due to their small differences (4.7 for Linear regression, 4.68 for Lasso Regression). Though, Lasso Regression showed slightly better results.
The error distribution and bias of both models exhibited similar tendencies, with very close MASE and MPE values (0.539/-0.594 for Linear Regression, 0.537/-0.589 for Lasso regession). Lasso Regression showed slightly better results. Both models’ marginally negative MPEs suggest a propensity to underpredict prices.
The remarkably close Adjusted R-squared values (48.9% for Lasso and 48.5% for Linear Regression) indicated that the proportion of variance in the data explained by each model was similar. The similar distribution of the residuals for both models suggested that they were equally successful in terms of prediction bias and error variance.
The performance of the Lasso Regression model was somewhat better than of the Linear Regression model. This choice was based on its slightly superior performance metrics and its ability to handle a large number of features through automatic feature selection. The breaking of important presumptions, however, raised concerns and could have an impact on how reliable the predictions and interpretations were. Further refinement or exploration of different modeling techniques might be needed for better results.
For the future research directions, we can investigate more intricate models such as neural networks, Random Forest, or Gradient Boosting Machines to enhance prediction accuracy and better manage non-linearity. Time-series analysis and a closer look at how prices evolve over time may shed light on seasonal patterns and price swings. By incorporating more precise geospatial data, it may be possible to identify regional trends and provide more neighborhood-specific insights.
results_lm_tbl %>%
nice_table(cap='Model Comparison') %>%
scroll_box(width='100%')
Model | mape | smape | mase | mpe | RMSE | Adjusted R2 |
---|---|---|---|---|---|---|
Model 1: Linear Regression, cv | 4.699 | 4.681 | 0.539 | -0.594 | 0.181 | 0.485 |
Model 2: Lasso | 4.678 | 4.660 | 0.537 | -0.589 | 0.181 | 0.489 |
Despite the fact that both models performed admirably, choosing between them depended on specific requirements and data characteristics. The linear regression model is appropriate when interpretability is important and the dataset has little multicollinearity and is well-understood. Conversely, the Lasso model performs better with datasets that have a high number of features, particularly if some of those features may not be very important, because it automatically selects features. As a result, if the dataset is carefully selected and model simplicity and interpretability are crucial, then linear regression would be preferable. However, Lasso Regression is a better choice for a more complex dataset with lots of features because it can perform feature selection and prevent overfitting. As a result, a Lasso Regression model was chosen for its feature selection capabilities, Lasso Regression proved to be an effective tool in handling multicollinearity and streamlining the model by eliminating irrelevant predictors. The solution provides a robust starting point for understanding and predicting Airbnb prices in New York City. While the models demonstrate moderate predictive accuracy, there is room for improvement, particularly in addressing the assumptions of linear models. With a better understanding of the major variables influencing rental prices, hosts can use the model’s insights to optimize their pricing strategy and possibly boost income and occupancy rates. Renters can improve their decision-making process by using the model’s insights to identify competitive pricing and learn what factors might result in higher rental costs. These insights can be used by Airbnb for market analysis, to pinpoint areas with high demand, and to provide hosts customized pricing recommendations. The results of the model can help investors understand the dynamics of the rental market in various neighborhoods and pinpoint profitable areas to invest in. With the use of this data, policymakers can better understand how Airbnb rentals affect the local housing market, which will help with regulation and urban planning.
Nwanganga, F., & Chapple, M. (2020). Practical Machine Learning in R. https://doi.org/10.1002/9781119591542
Get the Data. (n.d.). http://insideairbnb.com/get-the-data
Faraway, J. J. (2014). Linear Models with R, Second Edition. CRC Press.
Sheather, S. (2009). A Modern Approach to Regression with R. Springer Science & Business Media.
James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2014. An Introduction to Statistical Learning: With Applications in R. Springer Publishing Company, Incorporated
Faraway, J. J. (2016). Extending the Linear Model with R: Generalized Linear, Mixed Effects and Nonparametric Regression Models. CRC Press.
This essay outlines the journey from data preparation to the selection of predictive models for Airbnb listings in New York City, based on a dataset with diverse variables influencing rental prices. Source: http://insideairbnb.com/get-the-data
The initial dataset included 38,792 records, each representing an Airbnb listing with 18 different attributes. Key variables included geographical coordinates, room type, and various features related to the host and the property. The target variable for our analysis was price, representing the cost per night of a listing.
The first step in our analysis involved extensive data cleaning and preparation. We transformed categorical variables like neighbourhood_group, neighbourhood, and room_type into factor variables and converted last_review dates into a numerical format by calculating the number of days from the earliest review. Missing values in reviews_per_month were imputed with zeros, and columns not critical to our analysis, such as id, host_name, and license, were dropped. To handle the high dimensionality and multicollinearity, we applied one-hot encoding to neighbourhood_group and room_type, eliminating redundant categories to prevent multicollinearity. We also created a new feature, availability_ratio, to provide a clearer understanding of each listing’s availability throughout the year.
Through exploratory data analysis, we observed significant variations in listing prices and minimum night requirements. These insights were crucial in understanding the market dynamics and in guiding our modeling decisions. Analysis of latitude and longitude revealed clustering of listings, indicating popular areas within the city. This spatial distribution could be crucial in understanding price variations. Different room types and their prevalence across various neighborhoods were examined. The distribution highlighted trends in preferred lodging types, potentially influencing pricing strategies. Exploration of the price variable showed a wide range of values, with certain outliers indicating extremely high or low prices. This necessitated further scrutiny and potential transformation for more accurate modeling. The newly created availability_ratio indicated varied patterns of listing availability, which could correlate with pricing strategies employed by hosts. A thorough examination of correlations between variables helped in identifying multicollinearity issues, especially between geographical coordinates and certain neighborhood groups.
We split the data into training (75%) and testing (25%) sets, ensuring a fair evaluation of our models. The training set included 29,180 records, while the test set comprised 9,562 records with 22 columns in each. Given the skewed distribution of several variables like price, minimum_nights, and reviews_per_month, we applied BoxCox transformations to stabilize variance and improve normality. The minimum_nights column underwent a log+1 transformation to address potential division by zero issues.The data had to be transformed in order for linear modeling methods like Lasso Regression and Linear Regression to work. We wanted to increase the precision and dependability of our models, so we normalized the distribution of these important variables. By helping linear models meet their assumptions, the transformed data improves the performance of the models.
**Linear Regression*: Our first model was a Linear Regression,
tailored to capture the linear relationships between the predictors and
the target variable. We excluded features that were highly correlated or
caused high dimensionality (neighbourhood_group, room_type,
availability_365, number_of_reviews, reviews_per_month). The model
underwent 5-fold cross-validation to assess its generalizability. The
final model demonstrated an adjusted R-squared of 48.5%, indicating a
moderate fit to the data. The model was statistically significant, as
indicated by the F-statistic (1,966) and p-value of less than 2.2e-16.
RMSE was 0.181
, indicating the average error magnitude. The
most important features to increase price were room_type_entire_home_apt
and Neighborhood group Manhattan; to decrease, longitude and latitude.
MAPE
(Mean Absolute Percentage Error) was 4.69. This showed
that the model’s predictions were, on average, 4.69% off from the actual
values. SMAPE
(Symmetric Mean Absolute Percentage Error)
was 4.68. By normalizing based on both expected and actual values, SMAPE
changes MAPE. The model’s good SMAPE of 4.68% indicated that it was
fairly accurate. MASE
(Mean Absolute Scaled Error) was
0.539. In other words, the model’s forecasts were, on average, 46.1%
more accurate than the naive forecasts. MPE
(Mean
Percentage Error) showed how biased the predictions were, -0.594. A
negative value indicated a slight under-forecasting tendency in the
model. These model performance metrics were rather low, indicating a
respectable level of predictive accuracy. However, diagnostic plots (Q-Q
plot, residuals vs fitted plot, Jarque-Bera and Breusch-Pagan tests)
suggested possible violations of homoscedasticity and normality
assumptions, pointing to the need for further refinement. The variance
inflation factor (VIF) quantified the extent to which multicollinearity
with other model predictors inflated the variance of a regression
coefficient.
Lasso Regression: Next, we implemented a Lasso
Regression model, which is particularly effective in feature selection
and in handling multicollinearity. The model was fine-tuned using
10-fold cross-validation to identify the optimal lambda value for
regularization. The Lasso model showed an adjusted R-squared of 48.9%,
slightly outperforming the Linear Regression model. The RMSE value of
0.181
provided a measure of the typical deviation of the
predicted values from the actual prices. Understanding the average
magnitude of prediction errors required an understanding of this metric.
The most important features to increase price were similar to the Linear
model, neighbourhood_group_manhattan, room_type_entire_home_apt, and
room_type_hotel_room were key predictors with significant positive
coefficients. The latitude variable had a negative coefficient, this
would suggest that, assuming all other variables remain constant, moving
northward, or increasing latitude, could be linked to a drop in price.
MAPE
was 4.68. This showed that the model’s predictions
were, on average, 4.7% off from the actual values. SMAPE
was 4.66. By normalizing based on both expected and actual values, SMAPE
changes MAPE. The model’s good SMAPE of 4.66% indicated that it was
fairly accurate. MASE
was 0.537. MPE
was
-0.589. A negative value indicated a slight under-forecasting tendency
in the model. However, similar to the Linear Regression, it displayed
potential issues with residual assumptions, underscoring the need for
careful interpretation of results.
In order to concentrate on other factors for price prediction, Linear Regression and Lasso models models disregarded availability_365, neighborhood_group, room_type, and neighborhood characteristics.
The residuals of the Linear Regression model showed a distribution centered around zero, indicating no significant bias in predictions. Similar to the Linear Regression model, the Lasso model’s residuals also indicated a balanced distribution around zero, suggesting an unbiased prediction model. The existence of multicollinearity in the Linear Regression model may inflate the variance of coefficient estimates, resulting in less trustworthy interpretations, even though Lasso Regression helps to address multicollinearity by penalizing coefficients of correlated predictors.
The RMSE values (0.181) for both models were nearly the same, indicating comparable levels of price prediction accuracy. In terms of average prediction error, MAPE and SMAPE values were almost equal due to their small differences (4.7 for Linear regression, 4.68 for Lasso Regression). Though, Lasso Regression showed slightly better results. The error distribution and bias of both models exhibited similar tendencies, with very close MASE and MPE values (0.539/-0.594 for Linear Regression, 0.537/-0.589 for Lasso regession). Lasso Regression showed slightly better results. Both models’ marginally negative MPEs suggest a propensity to underpredict prices. The remarkably close Adjusted R-squared values (48.9% for Lasso and 48.5% for Linear Regression) indicated that the proportion of variance in the data explained by each model was similar. The similar distribution of the residuals for both models suggested that they were equally successful in terms of prediction bias and error variance.
The performance of the Lasso Regression model was somewhat better than of the Linear Regression model. This choice was based on its slightly superior performance metrics and its ability to handle a large number of features through automatic feature selection. The breaking of important presumptions, however, raised concerns and could have an impact on how reliable the predictions and interpretations were. Further refinement or exploration of different modeling techniques might be needed for better results.
For the future research directions, we can investigate more intricate models such as neural networks, Random Forest, or Gradient Boosting Machines to enhance prediction accuracy and better manage non-linearity. Time-series analysis and a closer look at how prices evolve over time may shed light on seasonal patterns and price swings. By incorporating more precise geospatial data, it may be possible to identify regional trends and provide more neighborhood-specific insights.
Despite the fact that both models performed admirably, choosing between them depended on specific requirements and data characteristics. The linear regression model is appropriate when interpretability is important and the dataset has little multicollinearity and is well-understood. Conversely, the Lasso model performs better with datasets that have a high number of features, particularly if some of those features may not be very important, because it automatically selects features. As a result, if the dataset is carefully selected and model simplicity and interpretability are crucial, then linear regression would be preferable. However, Lasso Regression is a better choice for a more complex dataset with lots of features because it can perform feature selection and prevent overfitting. As a result, a Lasso Regression model was chosen for its feature selection capabilities, Lasso Regression proved to be an effective tool in handling multicollinearity and streamlining the model by eliminating irrelevant predictors. The solution provides a robust starting point for understanding and predicting Airbnb prices in New York City. While the models demonstrate moderate predictive accuracy, there is room for improvement, particularly in addressing the assumptions of linear models. With a better understanding of the major variables influencing rental prices, hosts can use the model’s insights to optimize their pricing strategy and possibly boost income and occupancy rates. Renters can improve their decision-making process by using the model’s insights to identify competitive pricing and learn what factors might result in higher rental costs. These insights can be used by Airbnb for market analysis, to pinpoint areas with high demand, and to provide hosts customized pricing recommendations. The results of the model can help investors understand the dynamics of the rental market in various neighborhoods and pinpoint profitable areas to invest in. With the use of this data, policymakers can better understand how Airbnb rentals affect the local housing market, which will help with regulation and urban planning.