Homework 4

#' nice_table
#' 
#' @param df
#' @param fw
nice_table <- function(df, cap=NULL, cols=NULL, dig=3, fw=F){
  if (is.null(cols)) {c <- colnames(df)} else {c <- cols}
  table <- df %>% 
    kable(caption=cap, col.names=c, digits=dig) %>% 
    kable_styling(
      bootstrap_options = c("striped", "hover", "condensed"),
      html_font = 'monospace',
      full_width = fw)
  return(table)
}

#' coeff2dt
#'
#' @param fitobject 
#' @param s 
#'
#' @return
#' @export
#'
#' @examples
coeff2dt <- function(fitobject, s) {
  coeffs <- coef(fitobject, s) 
  coeffs.dt <- data.frame(name = coeffs@Dimnames[[1]][coeffs@i + 1], coefficient = coeffs@x) 

  # reorder the variables in term of coefficients
  return(coeffs.dt[order(coeffs.dt$coefficient, decreasing = T),])
}

Overview

The real estate market is a dynamic and intricate setting where a variety of factors affect the price of real estate. Listings on Airbnb, a big part of this market, are not an exception. Numerous factors, such as location, amenities, and local demand, influence these prices. Comprehending and forecasting the cost of Airbnb listings is essential for both hosts looking to establish competitive rates and visitors looking for good value. The conventional hotel sector has also been impacted by Airbnb’s growth, drawing attention from economists and market analysts. The goal in this project was to forecast Airbnb listings’ price. Our objective was to gain insights into key factors that impact Airbnb listing prices in New York City by developing predictive models to estimate Airbnb listing prices based on relevant factors and evaluating the performance of the predictive models and identify opportunities for improvement.

We started by carefully reviewing the dataset to find problems like outliers, missing values, and possible predictor multicollinearity. This procedure, which was essential for guaranteeing the accuracy of our analysis, resulted in preprocessing and data cleaning, where we took care of these problems. We used Lasso Regression and Linear Regression models for our analysis after the data were prepared. These models were selected because they are well-suited to comprehending how different features affect the dependent variable, which in this case is the cost of Airbnb listings. Given its popularity for being easily understood and straightforward, Linear Regression gave us a starting point model. However, by penalizing less significant features, Lasso Regression’s feature selection capability allowed us to better understand the most significant predictors.

1. Data Preparation

# Load data from Github
data <- read.csv("https://raw.githubusercontent.com/ex-pr/DATA_622/main/HW%204/nyc_listings_2023.csv")

1.1 Summary Statistics

The dataset contained 38,792 observations of 18 variables.

The information was listings for New York City from October, 2023, and each record included information about a single rental property, such as its type, location, cost, and review-related details. The price was the target, it specified the price of a Airbnb listing per night. Specifically:

id: Unique identifier for the listing.
name: The listing’s name or description.
host_id: Unique identifier for the host.
host_name: The host’s name.
neighbourhood_group: The borough in which the listing was situated.
neighbourhood: The specific neighbourhood in which the listing was situated.
latitude, longitude: Location coordinates for the listing.
room_type: The kind of room being provided (private room, entire apartment, shared room or hotel room).
price: Cost per night for the listed property.
minimum_nights: A minimum number of nights needed to make a reservation.
number_of_reviews: Total reviews that this listing had gotten.
last_review: Date of the last review.
reviews_per_month: The average monthly number of reviews.
calculated_host_listings_count: Total number of listings that the host had.
availability_365: The number of days a year that the listing is bookable.
number_of_reviews_ltm: Number of Reviews for the last twelve months.
license: Details about the listing’s license.

Regression techniques were the focus of the algorithm selection process because the price variable in our dataset was continuous. The data preparation steps had been tailored to suit these algorithms, with an emphasis on managing missing values, scaling and normalizing the features, and potentially encoding categorical variables when necessary.

Linear Regression: This algorithm was a useful place to start. The relationship between the independent variables and the target variable was assumed to be linear. It was an excellent baseline model because it was straightforward, comprehensible, and didn’t require complicated parameter tuning.

Lasso Regression: A kind of linear regression with a penalty term (Least Absolute Shrinkage and Selection Operator). The magnitude of the coefficients’ absolute value was equivalent to the penalty that was applied. Because this kind of regression performed feature selection by shrinking less significant feature coefficients to zero, it was helpful when we had a large number of features.

K-Fold Cross Validation: We applied it to evaluate our models’ performance. Using this method, the dataset was divided into k subsets, of which one was used as the test set and the others as the training set. Every subset served as the test set once during the k repetitions of this process. In order to prevent overfitting and underfitting, it made sure that each observation from the original dataset had an equal chance of showing up in the training and test sets.

The target variable’s nature as a quantitative measure, price, influenced the choice of these algorithms. For modeling the relationship between the predictors and a continuous outcome, linear and Lasso Regression were preferred due to their simplicity and effectiveness. However, multicollinearity and irrelevant features could affect the performance of lasso and linear regression; this was where feature selection and regularization techniques came into play. Reliability in validating model performance across dataset subsets had be aided by k-fold cross validation.

The data source: http://insideairbnb.com/get-the-data

# Check first rows of data
DT::datatable(
      data[1:25,],
      extensions = c('Scroller'),
      options = list(scrollY = 350,
                     scrollX = 500,
                     deferRender = TRUE,
                     scroller = TRUE,
                     dom = 'lBfrtip',
                     fixedColumns = TRUE, 
                     searching = FALSE), 
      rownames = FALSE)

The table below provided a summary statistics of the New York City listings market for 2023, highlighting variations in listing prices, booking requirements, and guest interaction.

The were a small number of missing values (0.01%) in the host_name and name columns, a sizable amount (26.7%) in the last_review and reviews_per_month columns, and a very high percentage (92.42%) of missing values in the license column. Prior to doing additional analysis, these missing values had to be addressed.

Average price, minimum night requirement, distribution of room types, and average availability is given in this summary. The prices of the listings varied from 0 to 30,000 dollars, with an average of 215.95 dollars. There was a minimum of 1 to 1250 nights required, with an average of 30.64 nights.The dataset included the following room types: hotel room, shared room, private room, and entire home/apt. Of the room types, entire home/apt accounted for 54.96% of listings. Listings were available for 148.75 days a year on average, but this could range from no availability to being available all year round. There were anywhere from none to many reviews for each listing, which suggested different levels of interaction or length of time listed on the platform. The average number of reviews per month was 1.08, although this varies substantially between listings.

# Check summary statistics of the data
print(dfSummary(data, text.graph.col = FALSE, graph.col = FALSE, style = "grid", valid.col = FALSE), headings = FALSE, max.tbl.height = 400, footnote = NA, col.width=5, method="render")

Variable

Stats / Values

Freqs (% of Valid)

Missing

id [numeric]

Mean (sd) : 2.820216e+17 (3.85498e+17)

min ≤ med ≤ max:

2595 ≤ 45421225 ≤ 9.927295e+17

IQR (CV) : 7.208503e+17 (1.4)

38792 distinct values

0 (0.0%)

name [character]

1. Rental unit in New York

2. Rental unit in Brooklyn

3. Rental unit in Brooklyn

4. Rental unit in New York

5. Rental unit in New York

6. Rental unit in New York

7. Rental unit in New York

8. Rental unit in Brooklyn

9. Rental unit in New York

10. Townhouse in Queens

1 b

[ 12040 others ]

2016	(	5.2%	)
1052	(	2.7%	)
658	(	1.7%	)
621	(	1.6%	)
539	(	1.4%	)
434	(	1.1%	)
418	(	1.1%	)
348	(	0.9%	)
327	(	0.8%	)
274	(	0.7%	)
32105	(	82.8%	)

0 (0.0%)

host_id [integer]

Mean (sd) : 155835024 (167648522)

min ≤ med ≤ max:

1678 ≤ 76166434 ≤ 539598477

IQR (CV) : 258965439 (1.1)

23811 distinct values

0 (0.0%)

host_name [character]

1. Blueground

2. Eugene

3. RoomPicks

4. June

5. Michael

6. David

7. Urban Furnished

8. Hiroki

9. Shogo

10. Momoyo

[ 8819 others ]

602	(	1.6%	)
535	(	1.4%	)
533	(	1.4%	)
437	(	1.1%	)
325	(	0.8%	)
297	(	0.8%	)
270	(	0.7%	)
256	(	0.7%	)
238	(	0.6%	)
223	(	0.6%	)
35076	(	90.4%	)

0 (0.0%)

neighbourhood_group [character]

1. Bronx

2. Brooklyn

3. Manhattan

4. Queens

5. Staten Island

1374	(	3.5%	)
14192	(	36.6%	)
16905	(	43.6%	)
5949	(	15.3%	)
372	(	1.0%	)

0 (0.0%)

neighbourhood [character]

1. Bedford-Stuyvesant

2. Williamsburg

3. Midtown

4. Harlem

5. Bushwick

6. Hell's Kitchen

7. Upper West Side

8. Upper East Side

9. Crown Heights

10. East Village

[ 213 others ]

2740	(	7.1%	)
2262	(	5.8%	)
2043	(	5.3%	)
1820	(	4.7%	)
1636	(	4.2%	)
1580	(	4.1%	)
1508	(	3.9%	)
1485	(	3.8%	)
1238	(	3.2%	)
1074	(	2.8%	)
21406	(	55.2%	)

0 (0.0%)

latitude [numeric]

Mean (sd) : 40.7 (0.1)

min ≤ med ≤ max:

40.5 ≤ 40.7 ≤ 40.9

IQR (CV) : 0.1 (0)

23359 distinct values

0 (0.0%)

longitude [numeric]

Mean (sd) : -73.9 (0.1)

min ≤ med ≤ max:

-74.3 ≤ -74 ≤ -73.7

IQR (CV) : 0.1 (0)

21014 distinct values

0 (0.0%)

room_type [character]

1. Entire home/apt

2. Hotel room

3. Private room

4. Shared room

21319	(	55.0%	)
132	(	0.3%	)
16849	(	43.4%	)
492	(	1.3%	)

0 (0.0%)

price [integer]

Mean (sd) : 215.9 (496)

min ≤ med ≤ max:

0 ≤ 135 ≤ 30000

IQR (CV) : 146 (2.3)

1184 distinct values

0 (0.0%)

minimum_nights [integer]

Mean (sd) : 30.6 (26.6)

min ≤ med ≤ max:

1 ≤ 30 ≤ 1250

IQR (CV) : 0 (0.9)

114 distinct values

0 (0.0%)

number_of_reviews [integer]

Mean (sd) : 25.4 (55.9)

min ≤ med ≤ max:

0 ≤ 4 ≤ 1843

IQR (CV) : 24 (2.2)

465 distinct values

0 (0.0%)

last_review [character]

1. (Empty string)
2. 2023-09-04
3. 2023-09-17
4. 2023-09-24
5. 2023-09-05
6. 2023-09-10
7. 2023-09-16
8. 2023-09-11
9. 2023-08-31
10. 2023-09-25
[ 2915 others ]

10352	(	26.7%	)
654	(	1.7%	)
626	(	1.6%	)
459	(	1.2%	)
445	(	1.1%	)
403	(	1.0%	)
389	(	1.0%	)
351	(	0.9%	)
350	(	0.9%	)
339	(	0.9%	)
24424	(	63.0%	)

0 (0.0%)

reviews_per_month [numeric]

Mean (sd) : 1.1 (1.7)

min ≤ med ≤ max:

0 ≤ 0.4 ≤ 62.8

IQR (CV) : 1.4 (1.5)

822 distinct values

10352 (26.7%)

calculated_host_listings_count [integer]

Mean (sd) : 38.6 (113.3)

min ≤ med ≤ max:

1 ≤ 1 ≤ 602

IQR (CV) : 5 (2.9)

70 distinct values

0 (0.0%)

availability_365 [integer]

Mean (sd) : 148.8 (142.3)

min ≤ med ≤ max:

0 ≤ 120 ≤ 365

IQR (CV) : 300 (1)

366 distinct values

0 (0.0%)

number_of_reviews_ltm [integer]

Mean (sd) : 6.9 (16.6)

min ≤ med ≤ max:

0 ≤ 0 ≤ 814

IQR (CV) : 5 (2.4)

144 distinct values

0 (0.0%)

license [character]

1. (Empty string)

2. Exempt

3. OSE-STRREG-0000068

4. OSE-STRREG-0000437

5. OSE-STRREG-0008664

6. OSE-STRREG-0000207

7. OSE-STRREG-0000003

8. OSE-STRREG-0000110

9. OSE-STRREG-0000155

10. OSE-STRREG-0000244

[ 269 others ]

35853	(	92.4%	)
2505	(	6.5%	)
107	(	0.3%	)
7	(	0.0%	)
6	(	0.0%	)
4	(	0.0%	)
3	(	0.0%	)
3	(	0.0%	)
3	(	0.0%	)
3	(	0.0%	)
298	(	0.8%	)

0 (0.0%)

1.2 Column change

Columns neighbourhood_group, neighbourhood, room_type were transformed to categorical variables. This change was performed to better depict the variables’ inherent categorical character. The column last_review was transformed to date format as it contained a date of the last review.

# Copy original data
imputed_df <- data

# Transform binary variables to factors
cols <- c("neighbourhood_group", "neighbourhood", "room_type")
imputed_df[cols] <- lapply(imputed_df[cols], factor) 

# Convert 'last_review' to datetime objects
imputed_df$last_review <- as.Date(imputed_df$last_review)

1.3 Missing values, drop columns

Managing missing values was one of the critical problems to fix before building the models.

The missing values for continuous variable reviews_per_month (10,352 rows) were imputed with 0 as it corresponded with 0 number_of_reviews (10,352 rows with 0 number of reviews). The missing values for date column last review were imputed with the minimum date from the column. Instead of the date format, we counted the amount of days since the earliest review. The NaN values in the last_review and reviews_per_month columns all occurred for examples where no reviews were given in the first place.

The dates were first transformed into an ordinal format, and then they were normalized. To achieve this, the ordinal value of the earliest date was subtracted from each and every date in the last_review column. The data was normalized so that the earliest date in the dataset corresponded to 0 and all other dates were represented as the number of days since this earliest date by subtracting the ordinal value of the earliest date.

The rest of the columns with missing values were not useful for the analysis and predictions: host_name and name columns contained name of the host and listing. The column license showed if a listing had license (according to the law, specific short-term rentals require a license), this column had more than 90% of the missing data and was dropped from the data. Columns id, host_id were dropped as well. Although, they didn’t have any missing values, they contained unique identification numbers for each listing and host which were not useful for the further analysis.

# check NAs in number_of_reviews and last_review
dim(imputed_df[imputed_df$number_of_reviews == 0,])

## [1] 10352    18

sum(is.na(imputed_df$last_review))

## [1] 10352

# Set seed for constant results
set.seed(42)

# Impute NAs for 'reviews_per_month' with their 0
imputed_df <- imputed_df %>% 
              mutate(across(c('reviews_per_month'), ~replace_na(., 0)))

# remove columns id, host_id, host_name, name, license
imputed_df <- imputed_df %>% 
              dplyr::select(-c(id, host_id, host_name, name, license))

# Find the earliest date
earliest <- min(imputed_df$last_review, na.rm = TRUE)

# Replace NA values with the earliest date
imputed_df$last_review <- replace_na(imputed_df$last_review, earliest)

# Convert dates to ordinal and subtract the ordinal value of the earliest date
imputed_df$last_review <- as.integer(imputed_df$last_review - as.Date(earliest))

1.4 Encoding, rename colums

We assured that our dataset was appropriate for a broader range of algorithms that require numerical input by using one-hot encoding, boosting the potential accuracy and effectiveness of our later analysis.

neighbourhood_group and room_type were recognized as categorical columns that would benefit from one-hot encoding.

Three new columns were created for room_type: Entire home/apt, Hotel room, and Private room. Each of these columns accepted a binary value, indicating whether the corresponding room type was present (1) or absent (0).

Three new columns were created for neighbourhood_group: neighbourhood_group_Bronx, neighbourhood_group_Brooklyn, neighbourhood_group_Manhattan, neighbourhood_group_Queens. Each of these columns accepted a binary value, indicating whether the corresponding neighbourhood group was present (1) or absent (0).

The last category of each original categorical variable was eliminated throughout the encoding procedure to avoid multicollinearity and reduce redundancy (neighbourhood_group_Staten Island, room_type_Shared room). The original columns were dropped after explanatory analysis.

The column names were fixed to remove space and transform all letters to lowercase using clean_names() function from janitor library.

Column neighbourhood wasn’t encoded or used in the models as it was highly granular, with numerous distinct categories, and one-hot encoding could result in a dataset that was extremely high dimensional. We assumed it was not crucial for the prediction, given that we already had geographical coordinates and neighbourhood_group included.

# Copy data without NAs
encoded_df <- imputed_df

# One hot encoding for 'neighbourhood_group' and 'room_type'
encoded_df <- dummy_cols(encoded_df, select_columns = c('neighbourhood_group', 'room_type')) 


# Remove last category of 'neighbourhood_group' and 'room_type'
encoded_df <- encoded_df %>%
           dplyr::select(-c('neighbourhood_group_Staten Island', 'room_type_Shared room')) %>% 
           clean_names()

#encoded_df <- dummy_cols(encoded_df, select_columns = c('neighbourhood'))

#encoded_df <- encoded_df %>%
           #dplyr::select(-c('neighbourhood_Fort Wadsworth')) %>% 
          # clean_names()

New feature availability_ratio was created to facilitate interpretation, it offered a percentage or portion of the year that the listing was accessible.

# create feature availability_ratio
encoded_df$availability_ratio <- encoded_df$availability_365 / 365

1.5 Outliers

We used boxplots to detect outliers.

There were notable anomalies for the price, with certain listings charging astronomically high costs in contrast to the majority. The outliers were removed (price > $9,500 per night, just 26 observations) and the BoxCox transformation was applied to the train and test data.
minimum_nights column also displayed anomalies, with certain listings having extraordinarily high minimum nights requirements. As we saw per graph, there were just 6 listings out of 38,792 with minimum number of nights greater than 500. Hosts could set up these high numbers of minimum nights when they didn’t want to accept new guests. As a results, these number for minimum nights were not real. We removed them as these wrong numbers could skew the results.
number_of_reviews, reviews_per_month. A small percentage of listings had an unusually high amount of total reviews. Some listings were more popular than others because they received a lot of reviews each month. For example, these listings could represent a hotel, each hotel could have a lot of rooms in one listing for rent. As a result, one listing received a lot of reviews for many rooms inside it. But the number of listings with reviews greater than 700 was just 9 and the number of reviews per month greater than 40 was 3, they were not representative of the typical listings, they were likely to skew your predictive modeling results. This is a common approach when the outliers constitute a very small percentage of the data and are not central to the analysis. Because the linear regression and Lasso Regression models are sensitive to the size and distribution of the input features, outliers can significantly affect the performance of these models.
calculated_host_listings_count. Compared to the average host, some hosts had a lot more listings than others. Which could also happen in case some hosts owned multiple property.
availability_ratio. Listings with extremely high availability all year round could be properties that were specifically intended for rental use.
number_of_reviews_ltm: Some listings had received a significant amount of reviews in just the past year, comparable to the overall number of reviews. As it was mentioned above, it could be some hotels. The listings with number of reviews for the last 12 months greater than 200 were removed.

If these outliers were not properly addressed, they may represent special cases or mistakes in data entry, which could skew the analysis. 50 rows were removed as outliers.

# Numeric columns to check for outliers
continuous_vars <- c("price", "minimum_nights", "number_of_reviews", "reviews_per_month", "calculated_host_listings_count", "number_of_reviews_ltm", "availability_ratio")

# List to store plots
plots <- list()

# Generate boxplots for each variable
for(i in 1:length(continuous_vars)) {
  p <- ggplot(encoded_df, aes_string(y = continuous_vars[i])) + 
    geom_boxplot() +
    theme_minimal()
  plots[[i]] <- p
}

# Arrange the plots in a grid
grid.arrange(grobs = plots, ncol = 3)

dim(encoded_df[encoded_df$price  > 9500,])

## [1] 26 21

dim(encoded_df[encoded_df$minimum_nights  > 500,])

## [1]  6 21

dim(encoded_df[encoded_df$number_of_reviews  > 700,])

## [1]  9 21

dim(encoded_df[encoded_df$reviews_per_month  > 40,])

## [1]  3 21

dim(encoded_df[encoded_df$number_of_reviews_ltm  > 200,])

## [1] 15 21

encoded_df <- encoded_df %>%
         filter(price <=9500 & minimum_nights <= 500 & number_of_reviews  <= 700 & reviews_per_month  <= 40 & number_of_reviews_ltm  <= 200)

1.6 Summary Statistics for transformed data

After the data transformation, no missing values detected.

New columns were added (neighbourhood_group_bronx, neighbourhood_group_brooklyn, neighbourhood_group_manhattan, neighbourhood_group_queens, room_type_entire_home_apt, room_type_hotel_room, room_type_private_room, availability_ratio) while other were removed (id, name, host_id, host_name, license).

Overall, where the outliers had been eliminated, the main changes brought about by filtering were seen in the maximum values of minimum_nights, number_of_reviews, reviews_per_month, number_of_reviews_ltm, price. Because of this, the means and standard deviations of these variables had somewhat decreased, which had made the data more condensed and probably more appropriate for linear and Lasso Regression analysis.

DT::datatable(
      encoded_df[1:25,],
      extensions = c('Scroller'),
      options = list(scrollY = 350,
                     scrollX = 500,
                     deferRender = TRUE,
                     scroller = TRUE,
                     dom = 'lBfrtip',
                     fixedColumns = TRUE, 
                     searching = FALSE), 
      rownames = FALSE)

print(dfSummary(encoded_df, text.graph.col = FALSE, graph.col = FALSE, style = "grid", valid.col = FALSE), headings = FALSE, max.tbl.height = 500, footnote = NA, col.width=50, method="render")

Variable

Stats / Values

Freqs (% of Valid)

Missing

neighbourhood_group [factor]

1. Bronx

2. Brooklyn

3. Manhattan

4. Queens

5. Staten Island

1373	(	3.5%	)
14192	(	36.6%	)
16859	(	43.5%	)
5946	(	15.3%	)
372	(	1.0%	)

0 (0.0%)

neighbourhood [factor]

1. Allerton

2. Arden Heights

3. Arrochar

4. Arverne

5. Astoria

6. Bath Beach

7. Battery Park City

8. Bay Ridge

9. Bay Terrace

10. Baychester

[ 213 others ]

46	(	0.1%	)
6	(	0.0%	)
13	(	0.0%	)
96	(	0.2%	)
626	(	1.6%	)
27	(	0.1%	)
94	(	0.2%	)
124	(	0.3%	)
5	(	0.0%	)
32	(	0.1%	)
37673	(	97.2%	)

0 (0.0%)

latitude [numeric]

Mean (sd) : 40.7 (0.1)

min ≤ med ≤ max:

40.5 ≤ 40.7 ≤ 40.9

IQR (CV) : 0.1 (0)

23347 distinct values

0 (0.0%)

longitude [numeric]

Mean (sd) : -73.9 (0.1)

min ≤ med ≤ max:

-74.3 ≤ -74 ≤ -73.7

IQR (CV) : 0.1 (0)

21002 distinct values

0 (0.0%)

room_type [factor]

1. Entire home/apt

2. Hotel room

3. Private room

4. Shared room

21301	(	55.0%	)
131	(	0.3%	)
16820	(	43.4%	)
490	(	1.3%	)

0 (0.0%)

price [integer]

Mean (sd) : 206.8 (319.4)

min ≤ med ≤ max:

0 ≤ 135 ≤ 9313

IQR (CV) : 146 (1.5)

1174 distinct values

0 (0.0%)

minimum_nights [integer]

Mean (sd) : 30.5 (23.3)

min ≤ med ≤ max:

1 ≤ 30 ≤ 500

IQR (CV) : 0 (0.8)

110 distinct values

0 (0.0%)

number_of_reviews [integer]

Mean (sd) : 25 (52.6)

min ≤ med ≤ max:

0 ≤ 4 ≤ 695

IQR (CV) : 24 (2.1)

450 distinct values

0 (0.0%)

last_review [integer]

Mean (sd) : 2830 (1874.8)

min ≤ med ≤ max:

0 ≤ 3820 ≤ 4525

IQR (CV) : 4488 (0.7)

2924 distinct values

0 (0.0%)

reviews_per_month [numeric]

Mean (sd) : 0.8 (1.3)

min ≤ med ≤ max:

0 ≤ 0.2 ≤ 16.1

IQR (CV) : 1 (1.7)

805 distinct values

0 (0.0%)

calculated_host_listings_count [integer]

Mean (sd) : 38.6 (113.4)

min ≤ med ≤ max:

1 ≤ 1 ≤ 602

IQR (CV) : 5 (2.9)

70 distinct values

0 (0.0%)

availability_365 [integer]

Mean (sd) : 148.7 (142.2)

min ≤ med ≤ max:

0 ≤ 120 ≤ 365

IQR (CV) : 300 (1)

366 distinct values

0 (0.0%)

number_of_reviews_ltm [integer]

Mean (sd) : 6.8 (14.1)

min ≤ med ≤ max:

0 ≤ 0 ≤ 196

IQR (CV) : 5 (2.1)

129 distinct values

0 (0.0%)

neighbourhood_group_bronx [integer]

Min : 0

Mean : 0

Max : 1

0	:	37369	(	96.5%	)
1	:	1373	(	3.5%	)

0 (0.0%)

neighbourhood_group_brooklyn [integer]

Min : 0

Mean : 0.4

Max : 1

0	:	24550	(	63.4%	)
1	:	14192	(	36.6%	)

0 (0.0%)

neighbourhood_group_manhattan [integer]

Min : 0

Mean : 0.4

Max : 1

0	:	21883	(	56.5%	)
1	:	16859	(	43.5%	)

0 (0.0%)

neighbourhood_group_queens [integer]

Min : 0

Mean : 0.2

Max : 1

0	:	32796	(	84.7%	)
1	:	5946	(	15.3%	)

0 (0.0%)

room_type_entire_home_apt [integer]

Min : 0

Mean : 0.5

Max : 1

0	:	17441	(	45.0%	)
1	:	21301	(	55.0%	)

0 (0.0%)

room_type_hotel_room [integer]

Min : 0

Mean : 0

Max : 1

0	:	38611	(	99.7%	)
1	:	131	(	0.3%	)

0 (0.0%)

room_type_private_room [integer]

Min : 0

Mean : 0.4

Max : 1

0	:	21922	(	56.6%	)
1	:	16820	(	43.4%	)

0 (0.0%)

availability_ratio [numeric]

Mean (sd) : 0.4 (0.4)

min ≤ med ≤ max:

0 ≤ 0.3 ≤ 1

IQR (CV) : 0.8 (1)

366 distinct values

0 (0.0%)

2. Data Exploration

2.1 Continuous Variables

availability_ratio displayed a distribution that was more variable. While some postings were available all year long, the most of the hosts were not available all the year. Latitude, longitude had a normal distribution, most of the hosts were concetrated in a specific area. All other distributions were right-skewed. Severe skewness in predictors or the target variable could be problematic in regression analysis because it could go against the normalcy assumption, particularly in linear regression models. In addition to causing non-linearity and heteroscedasticity (non-constant variance), skewed distributions can also make the model more susceptible to outliers. We had to apply transformations like the BoxCox and log+1 transformations in the data preparation stage to normalize these variables. It was addressed after splitting the data for training and testing.

# Choose numeric variables
numeric_vars <-c("price", "minimum_nights", "number_of_reviews", "reviews_per_month", "calculated_host_listings_count", "availability_ratio", "number_of_reviews_ltm", "latitude", "longitude")

# List to store plots
plots <- list()

# Generate histograms for each variable
for (i in 1:length(numeric_vars)) {
  p <- ggplot(encoded_df, aes_string(x = numeric_vars[i])) + 
    geom_histogram(aes(y=..density..), bins = 30, fill = "lightgreen", color = "black", alpha = 0.7) +
    geom_density(alpha = 0.2, fill = "#FF6666") +
    ggtitle(paste0('Distribution of ', numeric_vars[i])) +
    theme_minimal()
  plots[[i]] <- p
}

# Plot in grid with 3 columns
grid.arrange(grobs = plots,  ncol = 2)

2.2 Categorical Variables

The distribution of Airbnb listings among New York City’s boroughs showed noticeably more listings in some boroughs than in others, which could indicate that there was more demand or a larger selection of lodging in those areas. The most properties were found in Manhattan followed by Brooklyn which was probably because it had been a popular destination for both leisure and business travelers.

The various kinds of rooms that were available in Airbnb listings showed that some room types were more common than others, which could be a reflection of the hosts’ preferences or trends in visitor demand. Entire home/apt followed by Private room were the most common room type among all neighborhood groups, indicating that hosts in NYC were more likely to provide entire apartments as opposed to shared spaces.

It was crucial to comprehend these distributions in order to comprehend the dynamics of New York City’s Airbnb listings. For instance, a borough with a lot of listings might be a commercial center or a well-liked vacation spot. In a similar vein, the popularity of a specific room type can reveal the kind of lodging that visitors to NYC usually look for.

# Choose factor variables
factor_vars <- c("neighbourhood_group", "room_type")

# List to store plots
plots <- list()

# Generate barplots for each variable
for (i in 1:length(factor_vars)) {
  p <- ggplot(encoded_df, aes_string(x = factor_vars[i])) + 
    geom_bar(fill = "lightgreen", color = "black", alpha = 0.7) +
    ggtitle(paste0('Distribution of ', factor_vars[i])) +
    theme_minimal()
  plots[[i]] <- p
}

# Plot in grid with 3 columns
grid.arrange(grobs = plots, ncol = 1)

Plotting the distribution of Airbnb properties by type of room across several neighborhood groups in New York City, the data was displayed as a grouped bar chart below.

Manhattan was followed by Brooklyn in terms of the quantity of Airbnb listings. Entire home/apt was the most popular listing, followed by Private room. This suggests that Airbnb stays were also common in Brooklyn.

Compared to Manhattan and Brooklyn, there were noticeably fewer listings in Queens, the Bronx, and Staten Island. Of these, Queens had a higher number of listings than Staten Island and the Bronx.

In all boroughs, private room listings were the second most prevalent, with a notable concentration in Brooklyn and Manhattan.

There weren’t many listings for shared rooms, which could be because hosts prefer not to provide shared spaces or because there weren’t as much demand for this kind of lodging.

There were variations in the distribution of room types among the boroughs, which could be due to factors such as the local housing market, zoning laws, or the travel and demographic patterns of each community.

# Group by borough, count room type in each borough
group_room <- encoded_df %>%
  group_by(neighbourhood_group) %>%
  count(room_type)

# Bar plot borough vs number of listings by room type
ggplot(group_room, aes(x = neighbourhood_group, y = n, fill = room_type)) +
  geom_bar(position = "dodge", stat = "identity") +
  theme_minimal() +
  labs(title = "Borough vs Properties, NYC", x = "Borough", y = "Number of Properties") +
  scale_fill_discrete(name = "Room Type")

The mix of neighborhoods in the top 10 reflected the diverse appeal of New York City’s numerous neighborhoods by showcasing a range of areas dispersed across different parts of the city.

The neighborhood’s strong representation of Williamsburg and Bedford-Stuyvesant underscored Brooklyn’s growing appeal as a destination for travelers to New York City.

The list included Manhattan neighborhoods like Harlem, Hell’s Kitchen, Upper West Side, and Upper East Side, highlighting the city’s ongoing appeal because of its convenient location and wealth of attractions.

Popular neighborhoods that draw a varied range of visitors included those with cultural, historical, or entertainment significance, such as Hell’s Kitchen and Harlem, which are well-known for their restaurants and close proximity to Broadway.

# Group by neighbourhood, count
group_neigh <- encoded_df %>% 
  group_by(neighbourhood) %>% 
  count() %>% 
  arrange(desc(n))

ggplot(group_neigh[1:10,], aes(x = reorder(neighbourhood, -n), y = n, fill = neighbourhood)) +
  geom_bar(position = "dodge", stat = "identity") +
  theme_minimal() +
  labs(title = "Neighbourhood vs Properties, NYC", x = "Neighbourhood", y = "Number of Properties") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1), legend.position = "none")

2.3 Target variable

Price vs Numerical Variables: Price didn’t seem to be strongly correlated with either latitude or longitude. Nonetheless, there were pockets of higher pricing in specific locations, most likely associated with upscale or more well-known neighborhoods. Few listings had a very high minimum night requirement, while the majority of listings had fewer minimum nights. The quantity of required minimum nights and the cost did not exhibit a definite linear relationship. The majority of expensive listings did not require a minimum of one night.

In general, listings with a large number of reviews were less expensive. This could suggest that listings at lower prices were being booked and reviewed more frequently. The prices of the listings with fewer reviews varied greatly. Listings with more reviews per month typically had lower prices, much like the number of reviews does. Additionally, this plot implied that listings that were reviewed and probably booked more frequently were more reasonably priced.

The price and the quantity of listings a host had were not clearly correlated. Prices varied widely for hosts with few or many listings. The prices for the various yearly availability levels varied greatly. The lack of a discernible pattern suggested that price couldn’t be strongly correlated with availability. Similar to the total number of reviews, listings with a higher number of recent reviews typically had lower prices. The trend suggested that listings with lower prices could be booked and reviewed more frequently.

numeric_vars <-c("minimum_nights", "number_of_reviews", "reviews_per_month", "calculated_host_listings_count", "availability_ratio", "number_of_reviews_ltm", "latitude", "longitude")

# Plot target vs numeric columns
for (i in numeric_vars) {
  p <- ggplot(encoded_df, aes_string(x = i, y = "price")) +
    geom_point() +
    theme_bw() +
    labs(title = paste('Price vs', i), x = i, y = 'Price')
  
  print(p)
}

The price column had a highly skewed distribution. This could cause problems for machine learning algorithms such as linear regression. The BoxCox transformation and removal of outliers (after ) made the distribution look much closer to normal.

aa_boxcox <- boxcox(lm((encoded_df$price+1) ~ 1))
aa_lambda <- aa_boxcox$x[which.max(aa_boxcox$y)]
aa_trans <- BoxCox(encoded_df$price+1, aa_lambda)
encoded_df$box_price <- aa_trans

# Create the distribution plot for `price`
p1 <- ggplot(encoded_df, aes(x=price)) +
      geom_histogram(aes(y=..density..), binwidth=1, colour="black", fill="white") +
      geom_density(alpha=.2, fill="#FF6666") +
      ggtitle("Distribution of Price")

# Create the distribution plot for BoxCox(price)
p2 <- ggplot(encoded_df, aes(x=box_price)) +
      geom_histogram(aes(y=..density..), binwidth=1, colour="black", fill="white") +
      geom_density(alpha=.2, fill="#FF6666") +
      ggtitle("Distribution of transformed Price") +
      xlab("BoxCox(price)")

# Create a Q-Q plot for BoxCox(price)
p3 <- ggplot() +
      stat_qq(aes(sample = box_price), data = encoded_df) +
      stat_qq_line(aes(sample = box_price), data = encoded_df) +
      ggtitle("Q-Q Plot of BoxCox(price)")

# Arrange the plots in a 1x3 grid
grid.arrange(p1, p2, p3, ncol = 2)

Price vs Categorical Variables:

The median prices of various neighborhood groups differed noticeably from one another. The common belief that Manhattan was a more expensive area was supported by the fact that the borough’s median prices tend to be higher than those of other Manhattan neighborhoods. Compared to individual rooms and hotel rooms, entire homes and apartments were typically more expensive. There was a significant variance in the kind and caliber of these listings, as seen by the larger price variance for complete houses or apartments.

# Plot target vs categorical
for (i in factor_vars) {
  p <- ggplot(encoded_df, aes_string(x = i, y = "price")) +
    geom_boxplot() +
    theme_bw() +
    labs(title = paste('Price vs', i), x = i, y = 'Price')
  print(p)
}

It appeared that a listing’s location (shown by latitude and longitude) affected its price, but not in a straightforward linear way. It was likely that neighborhood-specific factors were important. A few factors that clearly affected price were the type of room and the quantity of reviews. Listings with lower prices appeared to draw more reviews. A large number of the numerical variables didn’t clearly demonstrate a linear relationship with price, suggesting that the intricacies of the data could be beyond the scope of a basic linear model. Possibility of Non-linear and Interaction Effects: Non-linear modeling or the addition of interaction terms could be required to more accurately depict the relationship between these variables and the price, given the absence of obvious linear trends.

2.4 Correlation

There was a positive correlation between higher prices and listings in Manhattan (0.21). This implied that Manhattan real estate was more expensive to list, probably as a result of the borough’s popularity and strategic location. In contrast, there was a negative correlation (-0.11) between the price of listings in Brooklyn and Queens, suggesting that these boroughs typically had lower prices than Manhattan.

Price and entire homes/apartments had a positive correlation (0.17), indicating that the price of these listings was usually higher than that of shared accommodations. A negative correlation (-0.17) indicated that private rooms were typically less expensive than whole houses or apartments.

Price and longitude had a negative correlation (-0.16), which could suggest that listings in Brooklyn and Queens, which are further east, were typically less expensive. The positive correlation between latitude and price was less strong (0.04), indicating a potential trend toward slightly higher listing prices for listings further north.

The neighbourhood group encodings exhibited significant negative correlations with each other. For example, the correlation between neighbourhood group Manhattan and neighbourhood group Brooklyn was -0.67, and the correlation between neighbourhood group Manhattan and neighbourhood group Queens was -0.37. Given that these were mutually exclusive categories, this was expected.

Correlations between room type encodings and neighbourhood groups were also strongly negative. For example, there was a negative correlation (-0.97) between room_type_Private room and room_type_Entire home/apt. This suggested that these kinds of rooms belong to exclusive groups.

There existed a moderate negative correlation (-0.14) between number_of_reviews and calculated_host_listings_count. This could indicate that hosts with fewer properties tended to have listings that had received more reviews, either because they had been active on the platform longer or because they concentrated more on a single listing.

# Check correlation
rcore <- rcorr(as.matrix(encoded_df %>% dplyr::select(where(is.numeric))))
# Take correlation coeff
coeff <- rcore$r
# Build corr plot
corrplot(coeff, tl.cex = 0.5, tl.col="black", method = 'color', addCoef.col = "black",
         type="upper", order="hclust", number.cex=0.5,
         diag=FALSE)

tst <- encoded_df %>% dplyr::select(where(is.numeric))
kable(cor(drop_na(tst))[,3], "html", escape = F, col.names = c('Coefficient')) %>%
  kable_styling("striped", full_width = F) %>%
  column_spec(1, bold = T)

	Coefficient
latitude	0.0425704
longitude	-0.1643373
price	1.0000000
minimum_nights	-0.0941773
number_of_reviews	-0.0328301
last_review	-0.0863907
reviews_per_month	-0.0185994
calculated_host_listings_count	0.0393215
availability_365	0.0894067
number_of_reviews_ltm	-0.0168238
neighbourhood_group_bronx	-0.0527385
neighbourhood_group_brooklyn	-0.1143825
neighbourhood_group_manhattan	0.2121102
neighbourhood_group_queens	-0.1055112
room_type_entire_home_apt	0.1709745
room_type_hotel_room	0.0643294
room_type_private_room	-0.1727415
availability_ratio	0.0894067
box_price	0.6354842

3. Split data

Finally, we split our data into train (75%) and test (25%) datasets to evaluate model performance before we proceeded to prediction. The train data contained 29180 records, test data 9562.

# random seed
set.seed(42)

# 80/20 split of the data set
sample <- sample.split(encoded_df$price, SplitRatio = 0.75)
train_data  <- subset(encoded_df, sample == TRUE)
test_data   <- subset(encoded_df, sample == FALSE)

# Check dimensions of train and test data
dim(train_data)

## [1] 29180    22

dim(test_data)

## [1] 9562   22

As outliers showed, there were extremely high prices per night ($30,000) in contrast to the mean ($216). As a result, the price distribution was right-skewed. This indicated that the skewness was positive. To lessen the skewness, BoxCox transformation was used after splitting the data for training and testing to avoid data leakage. Transformation by log+1 for minimum_nights was preferable because division by zero was problematic. We also applied the Box-Cox Transformation to number_of_reviews, reviews_per_month, calculated_host_listings_count, number_of_reviews_ltm due to their right-skeweness to stabilize variance and improve its normalcy. We added 1 in boxCox() to ensure all values were positive, which is necessary for the Box-Cox transformation. The data had to be transformed in order for linear modeling methods like Lasso Regression and Linear Regression to work. We wanted to increase the precision and dependability of our models, so we normalized the distribution of these important variables. By helping linear models meet their assumptions, the transformed data improves the performance of the models.

train_transformed <- train_data
test_transformed <- test_data

# Log transformation for minimum_nights"
cols_transform <- c("minimum_nights") #"price", 

for (i in cols_transform) {
  train_transformed[[i]]<- log(train_transformed[[i]]+1)
  test_transformed[[i]]<- log(test_transformed[[i]]+1)
}


# Boxcox transformation for "number_of_reviews", "reviews_per_month", "calculated_host_listings_count", "number_of_reviews_ltm", "price"
b_boxcox <- boxcox(lm((train_transformed$number_of_reviews+1) ~ 1))
b_lambda <- b_boxcox$x[which.max(b_boxcox$y)]
b_trans <- BoxCox(train_transformed$number_of_reviews+1, b_lambda)
train_transformed$number_of_reviews <- b_trans 
b_trans <- BoxCox(test_transformed$number_of_reviews+1, b_lambda)
test_transformed$number_of_reviews <- b_trans 


c_boxcox <- boxcox(lm((train_transformed$reviews_per_month+1) ~ 1))
c_lambda <- c_boxcox$x[which.max(c_boxcox$y)]
c_trans <- BoxCox(train_transformed$reviews_per_month+1, c_lambda)
train_transformed$reviews_per_month <- c_trans 
c_trans <- BoxCox(test_transformed$reviews_per_month+1, c_lambda)
test_transformed$reviews_per_month <- c_trans 

d_boxcox <- boxcox(lm((train_transformed$calculated_host_listings_count) ~ 1))
d_lambda <- d_boxcox$x[which.max(d_boxcox$y)]
d_trans <- BoxCox(train_transformed$calculated_host_listings_count, d_lambda)
train_transformed$calculated_host_listings_count <- d_trans 
d_trans <- BoxCox(test_transformed$calculated_host_listings_count, d_lambda)
test_transformed$calculated_host_listings_count <- d_trans 

e_boxcox <- boxcox(lm((train_transformed$number_of_reviews_ltm+1) ~ 1))
e_lambda <- e_boxcox$x[which.max(e_boxcox$y)]
e_trans <- BoxCox(train_transformed$number_of_reviews_ltm+1, e_lambda)
train_transformed$number_of_reviews_ltm <- e_trans 
e_trans <- BoxCox(test_transformed$number_of_reviews_ltm+1, e_lambda)
test_transformed$number_of_reviews_ltm <- e_trans 

aa_boxcox <- boxcox(lm((train_transformed$price+1) ~ 1))
aa_lambda <- aa_boxcox$x[which.max(aa_boxcox$y)]
aa_trans <- BoxCox(train_transformed$price+1, aa_lambda)
train_transformed$price <- aa_trans
aa_trans <- BoxCox(test_transformed$price+1, aa_lambda)
test_transformed$price <- aa_trans

For linear regression models, these transformations aided in stabilizing the variance and improving the symmetry of the distributions. The data was better suited for predictive modeling after transformations.

# Choose numeric variables
numeric_vars <-c("price", "minimum_nights", "number_of_reviews", "reviews_per_month", "calculated_host_listings_count", "number_of_reviews_ltm", "availability_ratio")

# List to store plots
plots <- list()

# Generate histograms for each variable
for (i in 1:length(numeric_vars)) {
  p <- ggplot(train_transformed, aes_string(x = numeric_vars[i])) + 
    geom_histogram(aes(y=..density..), bins = 30, fill = "lightgreen", color = "black", alpha = 0.7) +
    geom_density(alpha = 0.2, fill = "#FF6666") +
    ggtitle(paste0('Distribution of ', numeric_vars[i])) +
    theme_minimal()
  plots[[i]] <- p
}

# Plot in grid with 3 columns
grid.arrange(grobs = plots, ncol = 2)

# Create the distribution plot for `price`
p1 <- ggplot(test_transformed, aes(x=price)) +
      geom_histogram(aes(y=..density..), binwidth=1, colour="black", fill="white") +
      geom_density(alpha=.2, fill="#FF6666") +
      ggtitle("Distribution of Price")

# Create a Q-Q plot for price
p3 <- ggplot() +
      stat_qq(aes(sample = price), data = test_transformed) +
      stat_qq_line(aes(sample = price), data = test_transformed) +
      ggtitle("Q-Q Plot of Price")

# Arrange the plots in a 1x3 grid
grid.arrange(p1, p3, ncol = 2)

4. Models

4.1 Model 1 - Linear Regression

For the first model, Linear Regression, we dropped: - neighbourhood_group, room_type features as we encoded them and used encoded columns instead of these originals.

availability_365, box_price as we created new feature availability ratio and transformed price with BoxCox after splitting data to train and test, so we didn’t need the previous column box_price.
number_of_reviews, reviews_per_month as they were highly correlated, we kept only number of reviews for the last 12 months.
neighbourhood in order to avoid high dimensionality.

# remove some features from the model
train_model_1 <- train_transformed %>% 
                dplyr::select(-c(neighbourhood, neighbourhood_group, room_type, availability_365, number_of_reviews, reviews_per_month, box_price)) 

test_model_1 <- test_transformed %>% 
                dplyr::select(-c(neighbourhood, neighbourhood_group, room_type, availability_365, number_of_reviews, reviews_per_month, box_price))

We first configured five-fold cross-validation. The training data was divided into five parts, and the model was trained and validated five times, using the remaining parts as the training set and a different part as the validation set each time. This made it easier to estimate the model’s performance with greater accuracy. After, the Linear Regression model was trained.

set.seed(42)

# setup cross-validation
ctrl <- trainControl(method = "cv", number = 5)

# fit a regression model and use k-fold CV to evaluate performance
lm_model <- train(price ~ ., data = train_model_1, method = "lm", trControl = ctrl)

Model Performance

How well the model fit the data was shown in the summary of residuals. The residuals were between -3.3851 and 1.1955. A somewhat symmetric residual distribution around zero was suggested by a median near -0.0187, which is generally a positive indicator.

The F-statistic was 1,966, the adjusted R-squared was 0.485, and out of the 14 variables, all had statistically significant p-values. The model was statistically significant, as indicated by the F-statistic and p-value of less than 2.2e-16.The adjusted R2 indicated that only 48% of the variance in the response variable could be explained by the predictor variables. Even though the model explained a sizable percentage of the variance in price, a sizable portion remained unaccounted for by the model.

Residual standard error was 0.1879. It provided an estimate of the residuals’ standard deviation and, consequently, the typical error in price prediction made by the model.

RMSE was 0.181, indicating the average error magnitude. Less RMSE is preferable.

A positive coefficient indicated that the predictor and the outcome variable were positively correlated. For example, the room_type_entire_home_apt had a positive coefficient of about 0.298. This suggested that the price of Airbnb listings for complete homes or apartments was approximately 0.298 units higher than the price of listings for other types of listings (like shared rooms or hotel rooms), holding all other variables constant. Neighborhood group Manhattan listings were more expensive than those in other areas. When a coefficient was positive, it indicated that the outcome variable raised along with the predictor.

A negative relationship between the predictor and the outcome variable was indicated by a negative coefficient. For instance,both, longitude and latitude, had sizable negative coefficients, price was significantly influenced by geography. A listing’s precise location was important, since certain areas (usually central or well-liked neighborhoods) bring higher prices. A negative coefficient indicated a decrease in the outcome variable with an increase in the predictor.

MAPE (Mean Absolute Percentage Error) was 4.69. This showed that the model’s predictions were, on average, 4.69% off from the actual values. SMAPE (Symmetric Mean Absolute Percentage Error) was 4.68. By normalizing based on both expected and actual values, SMAPE changes MAPE. The model’s good SMAPE of 4.68% indicated that it was fairly accurate. MASE (Mean Absolute Scaled Error) was 0.539. In other words, the model’s forecasts were, on average, 46.1% more accurate than the naive forecasts. MPE (Mean Percentage Error) showed how biased the predictions were, -0.594. A negative value indicated a slight under-forecasting tendency in the model.

These model performance metrics were rather low, indicating a respectable level of predictive accuracy.

print(lm_model)

## Linear Regression 
## 
## 29180 samples
##    14 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 23344, 23344, 23344, 23344, 23344 
## Resampling results:
## 
##   RMSE       Rsquared  MAE      
##   0.1879817  0.484945  0.1422375
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

summary(lm_model)

## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.3851 -0.1227 -0.0187  0.1029  1.1955 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    -3.953e+01  2.960e+00 -13.356  < 2e-16 ***
## latitude                       -3.733e-01  3.250e-02 -11.486  < 2e-16 ***
## longitude                      -7.783e-01  3.388e-02 -22.973  < 2e-16 ***
## minimum_nights                 -9.868e-02  1.593e-03 -61.933  < 2e-16 ***
## last_review                    -1.516e-05  8.902e-07 -17.030  < 2e-16 ***
## calculated_host_listings_count -3.885e-02  2.005e-03 -19.373  < 2e-16 ***
## number_of_reviews_ltm           5.637e-02  3.003e-03  18.774  < 2e-16 ***
## neighbourhood_group_bronx       2.755e-01  1.648e-02  16.722  < 2e-16 ***
## neighbourhood_group_brooklyn    2.294e-01  1.291e-02  17.772  < 2e-16 ***
## neighbourhood_group_manhattan   3.449e-01  1.327e-02  25.987  < 2e-16 ***
## neighbourhood_group_queens      2.687e-01  1.476e-02  18.205  < 2e-16 ***
## room_type_entire_home_apt       2.983e-01  9.995e-03  29.841  < 2e-16 ***
## room_type_hotel_room            2.642e-01  2.240e-02  11.796  < 2e-16 ***
## room_type_private_room          5.287e-02  1.002e-02   5.277 1.32e-07 ***
## availability_ratio              1.117e-01  3.292e-03  33.947  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1879 on 29165 degrees of freedom
## Multiple R-squared:  0.4855, Adjusted R-squared:  0.4853 
## F-statistic:  1966 on 14 and 29165 DF,  p-value: < 2.2e-16

# validate and calculate RMSE
lm_model.valid <- predict(lm_model, newdata = test_model_1)
lm_model.eval <- bind_cols(target = test_model_1$price, predicted=unname(lm_model.valid))
lm_model.rmse <- sqrt(mean((lm_model.eval$target - lm_model.eval$predicted)^2)) 

# plot targets vs predicted
lm_model.eval %>%
  ggplot(aes(x = target, y = predicted)) +
  geom_point(alpha = .3) +
  geom_smooth(method="lm", color='grey', alpha=.3, se=FALSE) +
  labs(title=paste('RMSE:',round(lm_model.rmse,1)))

# Calculate metrics mape, smape, mase, mpe, rmse
multi_metric <- metric_set(mape, smape, mase, mpe, yardstick::rmse)
model1_df <- lm_model.eval %>% multi_metric(truth=target, estimate=predicted)
b <- summary(lm_model)
model1_df

## # A tibble: 5 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 mape    standard       4.70 
## 2 smape   standard       4.68 
## 3 mase    standard       0.539
## 4 mpe     standard      -0.594
## 5 rmse    standard       0.181

# Add results to table
results_lm_tbl <- tibble(
                      Model = character(),
                      mape = numeric(), 
                      smape = numeric(), 
                      mase = numeric(), 
                      mpe = numeric(), 
                      "RMSE" = numeric(),
                      "Adjusted R2" = numeric()
                )

results_lm_tbl <- results_lm_tbl %>% add_row(tibble_row(
                      Model = "Model 1: Linear Regression, cv",
                      mape = model1_df[[1,3]],
                      smape = model1_df[[2,3]],
                      mase = model1_df[[3,3]],
                      mpe = model1_df[[4,3]],
                      "RMSE" = model1_df[[5,3]],
                      "Adjusted R2" = b$adj.r.squared
                     ))

Model Assumptions

Remaining versus Fitted tests for homoscedasticity and nonlinearity. To indicate homoscedasticity, the points should ideally be distributed randomly around the horizontal line. The residuals in the plot were almost distributed without any recognized pattern.

If the residuals are regularly distributed, it is verified by the Standard Q-Q plot. The dashed line indicates that the points are distributed normally. There was some variation at the tails of the plot, suggesting possible problems with normalcy.

Scale-Location plot had points to be dispersed equally throughout the fitted value range. The plot indicated potential problems with equal variance, as the residuals’ spread widened as fitted values increased.

Leverage versus Residuals made it easier to spot outliers that unreasonably affected the model. Influential points could be those that were prominently located above the dashed Cook’s distance lines or far to the right of the plot. A few of the points had high leverage and/or high Cook’s distance, which made them potentially significant.

The diagnostic plots implied that there could be some transgressions of the homoscedasticity and residual normality assumptions of linear regression. Taking into account variable transformations or using models more resilient to such problems may be beneficial. The Residuals vs. Leverage plot’s indication of influential points may call for additional research. To determine whether these data points should be eliminated or whether there is a significant explanation for their status as outliers.

# Plot model's assumptions
lm_model_final <- lm_model$finalModel

par(mfrow = c(2, 2))
plot(lm_model_final)

Given the significant results of the Jarque-Bera (JB) test, it is possible that the residuals were not normally distributed. This could an impact on some statistical tests’ dependability. The residuals could exhibit heteroscedasticity, which indicated that the variance of the residuals was not constant across all levels of the independent variables, according to the Breusch-Pagan test.

# Test for residuals
JarqueBera.test(lm_model_final$residuals)

## 
##  Jarque Bera Test
## 
## data:  lm_model_final$residuals
## X-squared = 40599, df = 2, p-value < 2.2e-16
## 
## 
##  Skewness
## 
## data:  lm_model_final$residuals
## statistic = 0.39057, p-value < 2.2e-16
## 
## 
##  Kurtosis
## 
## data:  lm_model_final$residuals
## statistic = 8.7255, p-value < 2.2e-16

bptest(lm_model_final)

## 
##  studentized Breusch-Pagan test
## 
## data:  lm_model_final
## BP = 542.08, df = 14, p-value < 2.2e-16

The variance inflation factor (VIF) quantified the extent to which multicollinearity with other model predictors inflated the variance of a regression coefficient. As a general rule, high multicollinearity was indicated by a VIF greater than 5 or 10.

High VIF values indicated that variables such as availability_ratio, latitude, longitude, and neighbourhood_group_manhattan could be collinear with other variables in the model. Because it could inflate the standard errors of the coefficients and reduce the reliability of the model estimates, this collinearity could be problematic. The VIF only implied that a variable was not offering unique information when there were other variables present; it did not, however, implied that a variable was unimportant.

# Variable importance plot
vif_values <- vif(lm_model_final)
vif_values <- rownames_to_column(as.data.frame(vif_values), var = "var")

vif_values %>%
  ggplot(aes(y=vif_values, x=var)) +
  coord_flip() + 
  geom_hline(yintercept=5, linetype="dashed", color = "red") +
  geom_bar(stat = 'identity', width=0.3 ,position=position_dodge())

4.2 Model 2 - Lasso

To perform Lasso Regression, we used functions from the glmnet package. This package required the response variable to be a vector and the set of predictor variables to be of the class data.matrix. We used the transformed train and test data (with log and boxcox transformations) without choosing particular variables.

# Transform to matrix
set.seed(42)
t0 <- train_transformed %>% dplyr::select(-c(availability_365, neighbourhood_group, room_type, neighbourhood, box_price))
X <- model.matrix(price ~ ., data=t0)[,-1]
Y <- t0$price

Next, we used the glmnet() function to fit the Lasso Regression model and specify alpha=1. To determine what value to use for lambda, we performed k-fold cross-validation and identify the lambda value that produced the lowest test mean squared error (MSE).

The following attribute settings were selected for the model:

type.measure = “mse” - The type.measure is set to minimize the Mean Squared Error for the model.
nfold = 10 - Given the size of the dataset we defaulted to 10-fold cross-validation.
family = gaussian - For Linear Regression
alpha = 1 - The alpha value of 1 sets the variable shrinkage method to lasso.
standardize = TRUE - Finally, we explicitly set the standardization attribute to TRUE; this will normalize the prediction variables around a mean of zero and a standard deviation of one before modeling.

The coefficients extracted using lambda.min minimized the mean cross-validated error.

# Fit lasso model
set.seed(42)
lasso_cv <- cv.glmnet(
  x=X,y=Y, # Y already logged in prep
  family = "gaussian",
  type.measure="mse",
  standardize = TRUE, # standardize
  nfold = 10,
  alpha=1) # alpha=1 is lasso

# Find optimal lambda value that minimizes test MSE
best_lambda <- lasso_cv$lambda.min
best_lambda

## [1] 6.654594e-05

#produce plot of test MSE by lambda value
plot(lasso_cv)

After, we analyzed the final model produced by the optimal lambda value.

#find coefficients of best model
lasso_model <- glmnet(X, Y, alpha = 1, lambda = best_lambda, standardize = TRUE)
#coef(lasso_model)


# Show table with coeff of Lasso model
as.data.frame(as.matrix(coef(lasso_model, s = "lambda.min"))) %>%
  arrange(desc(s1)) %>%
  nice_table(cap='Model Coefficients', cols='Est')

Model Coefficients
	Est
neighbourhood_group_manhattan	0.325
room_type_entire_home_apt	0.294
room_type_hotel_room	0.270
neighbourhood_group_bronx	0.248
neighbourhood_group_queens	0.246
neighbourhood_group_brooklyn	0.212
reviews_per_month	0.187
availability_ratio	0.106
room_type_private_room	0.048
number_of_reviews_ltm	0.028
last_review	0.000
number_of_reviews	-0.025
calculated_host_listings_count	-0.040
minimum_nights	-0.097
latitude	-0.338
longitude	-0.779
(Intercept)	-40.983

Model Performance

The analysis of residuals and other performance metrics demonstrated how well the Lasso Regression model fit the data. The residuals showed a somewhat symmetric distribution around zero, with a range of -3.395 to 1.194. A median near -0.0187 highlighted this symmetry, which is typically a positive indicator that the model did not consistently overestimate or underestimate the prices.

The adjusted R-squared value of 0.489 indicated that the predictor variables in the model could account for roughly 43.6% of the price variance. Although significant, this value also suggested that the model was unable to account for a sizable portion of the variability in Airbnb prices.

The RMSE value of 0.181 provided a measure of the typical deviation of the predicted values from the actual prices. Understanding the average magnitude of prediction errors required an understanding of this metric.

A positive correlation between the predictor and the outcome variable was indicated by positive coefficients. When all else was equal, a variable with a positive coefficient, for example, would imply that the price increased in tandem with the predictor’s value. Similar to the Linear model, neighbourhood_group_manhattan, room_type_entire_home_apt, and room_type_hotel_room were key predictors with significant positive coefficients.

On the other hand, a negative coefficient denoted a bad relationship. For instance, if the latitude variable had a negative coefficient, this would suggest that, assuming all other variables remain constant, moving northward, or increasing latitude, could be linked to a drop in price.

MAPE was 4.68. This showed that the model’s predictions were, on average, 4.7% off from the actual values. SMAPE was 4.66. By normalizing based on both expected and actual values, SMAPE changes MAPE. The model’s good SMAPE of 4.66% indicated that it was fairly accurate. MASE was 0.537. MPE was -0.589. A negative value indicated a slight under-forecasting tendency in the model.

Overall, even though the Lasso model offered insightful information and a high level of predictive accuracy, there was still opportunity for improvement, perhaps through the use of more sophisticated modeling techniques or the investigation of new or different predictor variables.

t1 <- test_transformed %>% dplyr::select(-c(availability_365, neighbourhood_group, room_type, neighbourhood, box_price))
X_test <- model.matrix(price ~ ., data=t1)[,-1]
y_test <- t1[,"price"] 

# validate and calculate RMSE
lasso_model.valid <- predict(lasso_model, newx = X_test, s = best_lambda)
lasso_model.eval <- bind_cols(target = y_test, predicted=unname(lasso_model.valid))
lasso_model.eval$predicted <- as.numeric(lasso_model.eval$predicted)
lasso_model.rmse <- sqrt(mean((lasso_model.eval$target - lasso_model.eval$predicted)^2)) 

# plot targets vs predicted
lasso_model.eval %>%
  ggplot(aes(x = target, y = predicted)) +
  geom_point(alpha = .3) +
  geom_smooth(method="lm", color='grey', alpha=.3, se=FALSE) +
  labs(title=paste('RMSE:',round(lasso_model.rmse,1)))

# Calculate metrics mape, smape, mase, mpe, rmse
lasso_model_df <- lasso_model.eval %>% multi_metric(truth=target, estimate=predicted)
lasso_model_df

## # A tibble: 5 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 mape    standard       4.68 
## 2 smape   standard       4.66 
## 3 mase    standard       0.537
## 4 mpe     standard      -0.589
## 5 rmse    standard       0.181

# R-squared and Adjusted R-squared
predictions_train <- predict(lasso_model, newx = X, s = best_lambda)

SSE <- sum((predictions_train - Y)^2)
SST <- sum((Y - mean(Y))^2)
r_squared_lasso <- 1 - SSE / SST

n <- length(Y)
p <- ncol(X)
adj_r_squared_lasso <- 1 - (1 - r_squared_lasso) * (n - 1) / (n - p - 1)
adj_r_squared_lasso

## [1] 0.4886678

# Predictions and residuals
residuals_lasso <- Y - predictions_train

# Calculate statistics for residuals
min(residuals_lasso)

## [1] -3.394797

max(residuals_lasso)

## [1] 1.193608

median(residuals_lasso)

## [1] -0.01862024

mean(residuals_lasso)

## [1] -6.953138e-14

sd(residuals_lasso)

## [1] 0.1872375

# Add results to table
results_lm_tbl <- results_lm_tbl %>% add_row(tibble_row(
                      Model = "Model 2: Lasso",
                      mape = lasso_model_df[[1,3]],
                      smape = lasso_model_df[[2,3]],
                      mase = lasso_model_df[[3,3]],
                      mpe = lasso_model_df[[4,3]],
                      "RMSE" = lasso_model_df[[5,3]],
                      "Adjusted R2" = adj_r_squared_lasso
                     ))

Model Assumptions

Several diagnostic plots were used to evaluate the Lasso Regression model in order to identify potential problems and evaluate important assumptions. To verify nonlinearity and homoscedasticity, the Residual vs. Fitted plot is essential. To show homoscedasticity, the residuals should ideally be dispersed randomly around a horizontal line. The residuals of the Lasso model showed no discernible pattern and seemed to be scattered, indicating a reasonable level of homoscedasticity. On the other hand, the lack of a discernible pattern suggested that nonlinearity might not be a major problem.

The majority of the residuals in the Lasso model’s Q-Q plot fell along a straight line, indicating that the residuals were roughly normally distributed. A few deviations exist, especially in the tails, but these are typical of real-world data.

Possible problems with equal variance were highlighted by the Scale-Location plot, which aids in assessing the distribution of residuals across the range of fitted values. Heteroscedasticity, or the inconsistency of residual variance across the range of predicted values, may be indicated by a discernible spread in the residuals as the fitted values increased.

All of the diagnostic plots suggested possible violations of the residual normality and homoscedasticity assumptions. In light of these results, investigating variable transformations or taking into account models more resistant to these kinds of problems might be helpful. The Residuals vs. Leverage plot’s indications call for additional research into the key points to determine whether or not they should be eliminated and whether there is a compelling reason why they are considered outliers.

The reliability of some statistical inferences made from the model may be impacted by the residuals’ potential non-normal distribution, according to the significant findings of the Jarque-Bera (JB) test.

# Plot for Linearity and Homoscedasticity check
plot(predictions_train, residuals_lasso, xlab = "Predicted", ylab = "Residuals", main = "Residual vs. Fitted - Lasso")
abline(h = 0, col = "red")

par(mfrow = c(2, 2))

qqnorm(residuals_lasso, main = 'Normal Q-Q')

plot(predictions_train, sqrt(abs(residuals_lasso)), xlab = 'Predicted', ylab = 'Sqrt(|Residuals|)', main = 'Scale-Location')

JarqueBera.test(residuals_lasso)

## 
##  Jarque Bera Test
## 
## data:  residuals_lasso
## X-squared = 42159, df = 2, p-value < 2.2e-16
## 
## 
##  Skewness
## 
## data:  residuals_lasso
## statistic = 0.38017, p-value < 2.2e-16
## 
## 
##  Kurtosis
## 
## data:  residuals_lasso
## statistic = 8.8392, p-value < 2.2e-16

A study was conducted to measure the level of multicollinearity between the model’s predictors using the Variance Inflation Factor (VIF). Significant multicollinearity is indicated by high VIF values, usually greater than 5 or 10. High VIF values were found in this model for variables like availability_ratio, latitude, longitude, and neighborhood_group_manhattan, indicating possible collinearity with other predictors. The reliability of the model’s estimates may be jeopardized by this collinearity, which could inflate the standard errors of the coefficients. It’s crucial to remember that a high VIF does not always imply that a variable is unimportant; rather, it simply suggests that it does not offer unique information when combined with other variables.

vip(lasso_model, num_features=20 ,geom = "col", include_type=TRUE, lambda = "lambda.min")

coeffs.table <- coeff2dt(fitobject = lasso_model, s = "lambda.min")

coeffs.table %>% mutate(name = fct_reorder(name, desc(coefficient))) %>%
ggplot() +
  geom_col(aes(y = name, x = coefficient, fill = {coefficient > 0})) +
  xlab(label = "") +
  ggtitle(expression(paste("Lasso Coefficients with ", lambda, " = 0.0275"))) +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust=0.5),legend.position = "none")

5. Model selection

In order to concentrate on other factors for price prediction, Linear Regression and Lasso models models disregarded availability_365, neighborhood_group, room_type, and neighborhood characteristics.

The residuals of the Linear Regression model showed a distribution centered around zero, indicating no significant bias in predictions. Similar to the Linear Regression model, the Lasso model’s residuals also indicated a balanced distribution around zero, suggesting an unbiased prediction model. The existence of multicollinearity in the Linear Regression model may inflate the variance of coefficient estimates, resulting in less trustworthy interpretations, even though Lasso Regression helps to address multicollinearity by penalizing coefficients of correlated predictors.

The error distribution and bias of both models exhibited similar tendencies, with very close MASE and MPE values (0.539/-0.594 for Linear Regression, 0.537/-0.589 for Lasso regession). Lasso Regression showed slightly better results. Both models’ marginally negative MPEs suggest a propensity to underpredict prices.

The remarkably close Adjusted R-squared values (48.9% for Lasso and 48.5% for Linear Regression) indicated that the proportion of variance in the data explained by each model was similar. The similar distribution of the residuals for both models suggested that they were equally successful in terms of prediction bias and error variance.

The performance of the Lasso Regression model was somewhat better than of the Linear Regression model. This choice was based on its slightly superior performance metrics and its ability to handle a large number of features through automatic feature selection. The breaking of important presumptions, however, raised concerns and could have an impact on how reliable the predictions and interpretations were. Further refinement or exploration of different modeling techniques might be needed for better results.

For the future research directions, we can investigate more intricate models such as neural networks, Random Forest, or Gradient Boosting Machines to enhance prediction accuracy and better manage non-linearity. Time-series analysis and a closer look at how prices evolve over time may shed light on seasonal patterns and price swings. By incorporating more precise geospatial data, it may be possible to identify regional trends and provide more neighborhood-specific insights.

results_lm_tbl %>% 
  nice_table(cap='Model Comparison') %>% 
  scroll_box(width='100%')

Model Comparison
Model	mape	smape	mase	mpe	RMSE	Adjusted R2
Model 1: Linear Regression, cv	4.699	4.681	0.539	-0.594	0.181	0.485
Model 2: Lasso	4.678	4.660	0.537	-0.589	0.181	0.489

6. Conclusion

Despite the fact that both models performed admirably, choosing between them depended on specific requirements and data characteristics. The linear regression model is appropriate when interpretability is important and the dataset has little multicollinearity and is well-understood. Conversely, the Lasso model performs better with datasets that have a high number of features, particularly if some of those features may not be very important, because it automatically selects features. As a result, if the dataset is carefully selected and model simplicity and interpretability are crucial, then linear regression would be preferable. However, Lasso Regression is a better choice for a more complex dataset with lots of features because it can perform feature selection and prevent overfitting. As a result, a Lasso Regression model was chosen for its feature selection capabilities, Lasso Regression proved to be an effective tool in handling multicollinearity and streamlining the model by eliminating irrelevant predictors. The solution provides a robust starting point for understanding and predicting Airbnb prices in New York City. While the models demonstrate moderate predictive accuracy, there is room for improvement, particularly in addressing the assumptions of linear models. With a better understanding of the major variables influencing rental prices, hosts can use the model’s insights to optimize their pricing strategy and possibly boost income and occupancy rates. Renters can improve their decision-making process by using the model’s insights to identify competitive pricing and learn what factors might result in higher rental costs. These insights can be used by Airbnb for market analysis, to pinpoint areas with high demand, and to provide hosts customized pricing recommendations. The results of the model can help investors understand the dynamics of the rental market in various neighborhoods and pinpoint profitable areas to invest in. With the use of this data, policymakers can better understand how Airbnb rentals affect the local housing market, which will help with regulation and urban planning.

References

Nwanganga, F., & Chapple, M. (2020). Practical Machine Learning in R. https://doi.org/10.1002/9781119591542
Get the Data. (n.d.). http://insideairbnb.com/get-the-data
Faraway, J. J. (2014). Linear Models with R, Second Edition. CRC Press.
Sheather, S. (2009). A Modern Approach to Regression with R. Springer Science & Business Media.
James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2014. An Introduction to Statistical Learning: With Applications in R. Springer Publishing Company, Incorporated
Faraway, J. J. (2016). Extending the Linear Model with R: Generalized Linear, Mixed Effects and Nonparametric Regression Models. CRC Press.

Essay: Predictive Modeling of Airbnb Prices in New York City

This essay outlines the journey from data preparation to the selection of predictive models for Airbnb listings in New York City, based on a dataset with diverse variables influencing rental prices. Source: http://insideairbnb.com/get-the-data

The initial dataset included 38,792 records, each representing an Airbnb listing with 18 different attributes. Key variables included geographical coordinates, room type, and various features related to the host and the property. The target variable for our analysis was price, representing the cost per night of a listing.

The first step in our analysis involved extensive data cleaning and preparation. We transformed categorical variables like neighbourhood_group, neighbourhood, and room_type into factor variables and converted last_review dates into a numerical format by calculating the number of days from the earliest review. Missing values in reviews_per_month were imputed with zeros, and columns not critical to our analysis, such as id, host_name, and license, were dropped. To handle the high dimensionality and multicollinearity, we applied one-hot encoding to neighbourhood_group and room_type, eliminating redundant categories to prevent multicollinearity. We also created a new feature, availability_ratio, to provide a clearer understanding of each listing’s availability throughout the year.

Through exploratory data analysis, we observed significant variations in listing prices and minimum night requirements. These insights were crucial in understanding the market dynamics and in guiding our modeling decisions. Analysis of latitude and longitude revealed clustering of listings, indicating popular areas within the city. This spatial distribution could be crucial in understanding price variations. Different room types and their prevalence across various neighborhoods were examined. The distribution highlighted trends in preferred lodging types, potentially influencing pricing strategies. Exploration of the price variable showed a wide range of values, with certain outliers indicating extremely high or low prices. This necessitated further scrutiny and potential transformation for more accurate modeling. The newly created availability_ratio indicated varied patterns of listing availability, which could correlate with pricing strategies employed by hosts. A thorough examination of correlations between variables helped in identifying multicollinearity issues, especially between geographical coordinates and certain neighborhood groups.

We split the data into training (75%) and testing (25%) sets, ensuring a fair evaluation of our models. The training set included 29,180 records, while the test set comprised 9,562 records with 22 columns in each. Given the skewed distribution of several variables like price, minimum_nights, and reviews_per_month, we applied BoxCox transformations to stabilize variance and improve normality. The minimum_nights column underwent a log+1 transformation to address potential division by zero issues.The data had to be transformed in order for linear modeling methods like Lasso Regression and Linear Regression to work. We wanted to increase the precision and dependability of our models, so we normalized the distribution of these important variables. By helping linear models meet their assumptions, the transformed data improves the performance of the models.

**Linear Regression*: Our first model was a Linear Regression, tailored to capture the linear relationships between the predictors and the target variable. We excluded features that were highly correlated or caused high dimensionality (neighbourhood_group, room_type, availability_365, number_of_reviews, reviews_per_month). The model underwent 5-fold cross-validation to assess its generalizability. The final model demonstrated an adjusted R-squared of 48.5%, indicating a moderate fit to the data. The model was statistically significant, as indicated by the F-statistic (1,966) and p-value of less than 2.2e-16. RMSE was 0.181, indicating the average error magnitude. The most important features to increase price were room_type_entire_home_apt and Neighborhood group Manhattan; to decrease, longitude and latitude. MAPE (Mean Absolute Percentage Error) was 4.69. This showed that the model’s predictions were, on average, 4.69% off from the actual values. SMAPE (Symmetric Mean Absolute Percentage Error) was 4.68. By normalizing based on both expected and actual values, SMAPE changes MAPE. The model’s good SMAPE of 4.68% indicated that it was fairly accurate. MASE (Mean Absolute Scaled Error) was 0.539. In other words, the model’s forecasts were, on average, 46.1% more accurate than the naive forecasts. MPE (Mean Percentage Error) showed how biased the predictions were, -0.594. A negative value indicated a slight under-forecasting tendency in the model. These model performance metrics were rather low, indicating a respectable level of predictive accuracy. However, diagnostic plots (Q-Q plot, residuals vs fitted plot, Jarque-Bera and Breusch-Pagan tests) suggested possible violations of homoscedasticity and normality assumptions, pointing to the need for further refinement. The variance inflation factor (VIF) quantified the extent to which multicollinearity with other model predictors inflated the variance of a regression coefficient.

Lasso Regression: Next, we implemented a Lasso Regression model, which is particularly effective in feature selection and in handling multicollinearity. The model was fine-tuned using 10-fold cross-validation to identify the optimal lambda value for regularization. The Lasso model showed an adjusted R-squared of 48.9%, slightly outperforming the Linear Regression model. The RMSE value of 0.181 provided a measure of the typical deviation of the predicted values from the actual prices. Understanding the average magnitude of prediction errors required an understanding of this metric. The most important features to increase price were similar to the Linear model, neighbourhood_group_manhattan, room_type_entire_home_apt, and room_type_hotel_room were key predictors with significant positive coefficients. The latitude variable had a negative coefficient, this would suggest that, assuming all other variables remain constant, moving northward, or increasing latitude, could be linked to a drop in price. MAPE was 4.68. This showed that the model’s predictions were, on average, 4.7% off from the actual values. SMAPE was 4.66. By normalizing based on both expected and actual values, SMAPE changes MAPE. The model’s good SMAPE of 4.66% indicated that it was fairly accurate. MASE was 0.537. MPE was -0.589. A negative value indicated a slight under-forecasting tendency in the model. However, similar to the Linear Regression, it displayed potential issues with residual assumptions, underscoring the need for careful interpretation of results.

In order to concentrate on other factors for price prediction, Linear Regression and Lasso models models disregarded availability_365, neighborhood_group, room_type, and neighborhood characteristics.

The RMSE values (0.181) for both models were nearly the same, indicating comparable levels of price prediction accuracy. In terms of average prediction error, MAPE and SMAPE values were almost equal due to their small differences (4.7 for Linear regression, 4.68 for Lasso Regression). Though, Lasso Regression showed slightly better results. The error distribution and bias of both models exhibited similar tendencies, with very close MASE and MPE values (0.539/-0.594 for Linear Regression, 0.537/-0.589 for Lasso regession). Lasso Regression showed slightly better results. Both models’ marginally negative MPEs suggest a propensity to underpredict prices. The remarkably close Adjusted R-squared values (48.9% for Lasso and 48.5% for Linear Regression) indicated that the proportion of variance in the data explained by each model was similar. The similar distribution of the residuals for both models suggested that they were equally successful in terms of prediction bias and error variance.

Homework 4

Daria Dubovskaia

Overview

1. Data Preparation

1.1 Summary Statistics

1.2 Column change

1.3 Missing values, drop columns

1.4 Encoding, rename colums

1.5 Outliers

1.6 Summary Statistics for transformed data

2. Data Exploration

2.1 Continuous Variables

2.2 Categorical Variables

2.3 Target variable

2.4 Correlation

3. Split data

4. Models

4.1 Model 1 - Linear Regression

Model Performance

Model Assumptions

4.2 Model 2 - Lasso

Model Performance

Model Assumptions

5. Model selection

6. Conclusion

References

Essay: Predictive Modeling of Airbnb Prices in New York City