library(tidyverse) 
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(lubridate)
library(tsibble)
## Registered S3 method overwritten by 'tsibble':
##   method               from 
##   as_tibble.grouped_df dplyr
## 
## Attaching package: 'tsibble'
## 
## The following object is masked from 'package:lubridate':
## 
##     interval
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, union
library(ggplot2)
# Load the dataset
data <- read_delim("./AB_NYC_2019.csv", delim = ",")
## Rows: 48895 Columns: 16
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr   (5): name, host_name, neighbourhood_group, neighbourhood, room_type
## dbl  (10): id, host_id, latitude, longitude, price, minimum_nights, number_o...
## date  (1): last_review
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Display the first few rows
head(data)
## # A tibble: 6 × 16
##      id name        host_id host_name neighbourhood_group neighbourhood latitude
##   <dbl> <chr>         <dbl> <chr>     <chr>               <chr>            <dbl>
## 1  2539 Clean & qu…    2787 John      Brooklyn            Kensington        40.6
## 2  2595 Skylit Mid…    2845 Jennifer  Manhattan           Midtown           40.8
## 3  3647 THE VILLAG…    4632 Elisabeth Manhattan           Harlem            40.8
## 4  3831 Cozy Entir…    4869 LisaRoxa… Brooklyn            Clinton Hill      40.7
## 5  5022 Entire Apt…    7192 Laura     Manhattan           East Harlem       40.8
## 6  5099 Large Cozy…    7322 Chris     Manhattan           Murray Hill       40.7
## # ℹ 9 more variables: longitude <dbl>, room_type <chr>, price <dbl>,
## #   minimum_nights <dbl>, number_of_reviews <dbl>, last_review <date>,
## #   reviews_per_month <dbl>, calculated_host_listings_count <dbl>,
## #   availability_365 <dbl>
str(data)
## spc_tbl_ [48,895 × 16] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ id                            : num [1:48895] 2539 2595 3647 3831 5022 ...
##  $ name                          : chr [1:48895] "Clean & quiet apt home by the park" "Skylit Midtown Castle" "THE VILLAGE OF HARLEM....NEW YORK !" "Cozy Entire Floor of Brownstone" ...
##  $ host_id                       : num [1:48895] 2787 2845 4632 4869 7192 ...
##  $ host_name                     : chr [1:48895] "John" "Jennifer" "Elisabeth" "LisaRoxanne" ...
##  $ neighbourhood_group           : chr [1:48895] "Brooklyn" "Manhattan" "Manhattan" "Brooklyn" ...
##  $ neighbourhood                 : chr [1:48895] "Kensington" "Midtown" "Harlem" "Clinton Hill" ...
##  $ latitude                      : num [1:48895] 40.6 40.8 40.8 40.7 40.8 ...
##  $ longitude                     : num [1:48895] -74 -74 -73.9 -74 -73.9 ...
##  $ room_type                     : chr [1:48895] "Private room" "Entire home/apt" "Private room" "Entire home/apt" ...
##  $ price                         : num [1:48895] 149 225 150 89 80 200 60 79 79 150 ...
##  $ minimum_nights                : num [1:48895] 1 1 3 1 10 3 45 2 2 1 ...
##  $ number_of_reviews             : num [1:48895] 9 45 0 270 9 74 49 430 118 160 ...
##  $ last_review                   : Date[1:48895], format: "2018-10-19" "2019-05-21" ...
##  $ reviews_per_month             : num [1:48895] 0.21 0.38 NA 4.64 0.1 0.59 0.4 3.47 0.99 1.33 ...
##  $ calculated_host_listings_count: num [1:48895] 6 2 1 1 1 1 1 1 1 4 ...
##  $ availability_365              : num [1:48895] 365 355 365 194 0 129 0 220 0 188 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   id = col_double(),
##   ..   name = col_character(),
##   ..   host_id = col_double(),
##   ..   host_name = col_character(),
##   ..   neighbourhood_group = col_character(),
##   ..   neighbourhood = col_character(),
##   ..   latitude = col_double(),
##   ..   longitude = col_double(),
##   ..   room_type = col_character(),
##   ..   price = col_double(),
##   ..   minimum_nights = col_double(),
##   ..   number_of_reviews = col_double(),
##   ..   last_review = col_date(format = ""),
##   ..   reviews_per_month = col_double(),
##   ..   calculated_host_listings_count = col_double(),
##   ..   availability_365 = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

Summary of Price

summary(data$price)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0    69.0   106.0   152.7   175.0 10000.0

Minimum Price (Min): Some listings have a price of $0, likely indicating mistakes, promotional offers, or placeholder values.

First Quartile (1st Qu. ): 25% of listings have a price of $69 or less. This indicates the lower range of typical Airbnb pricing in the dataset.

Median: The median price is $106, meaning half the listings are priced below this amount, and half are priced above. This is a reliable measure of central tendency, less influenced by outliers.

Mean: The average price is $152. 7. The mean is higher than the median, suggesting the existence of high-priced listings, which skew the average upward.

Third Quartile (3rd Qu. ): 175. 0 75% of listings are priced below $175. This represents the upper range of more typically priced listings.

Maximum Price (Max): 10000. 0 The maximum price is $10,000, which is exceptionally high. These are likely luxury properties or potentially erroneous entries.

Insights:

Boxplot of price to check the outliers

boxplot(data$price, main = "Price Distribution", ylab = "Price", col = "lightblue")

Q1_price <- quantile(data$price, 0.25, na.rm = TRUE)
Q3_price <- quantile(data$price, 0.75, na.rm = TRUE)
IQR_price <- Q3_price - Q1_price


lower_bound_price <- Q1_price - 1.5 * IQR_price
upper_bound_price <- Q3_price + 1.5 * IQR_price


outliers_price <- data %>%
  filter(price < lower_bound_price | price > upper_bound_price)

# Percentage of outliers
percentage_outliers <- (nrow(outliers_price) / nrow(data)) * 100
cat("Percentage of price outliers:", round(percentage_outliers, 2), "%\n")
## Percentage of price outliers: 6.08 %

Summary of Minimum Nights

summary(data$minimum_nights)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    1.00    3.00    7.03    5.00 1250.00

Minimum (Min): 1. 00 The least required stay is 1 night, which is typical for flexible reservations.

First Quartile (1st Qu. ): 1. 00 25% of listings demand just 1 night as a minimum stay, showing that short-term bookings are common.

Median: 3. 00 The median minimum stay is 3 nights, indicating that half of the listings require 3 nights or less.

Mean: 7. 03 The average minimum stay is approximately 7 nights. The mean exceeds the median, suggesting that some listings with very high minimum night requirements are affecting the average.

Third Quartile (3rd Qu. ): 5. 00 75% of listings demand 5 nights or fewer as the minimum stay.

Maximum (Max): 1250. 00 The highest required stay is 1250 nights, which is exceptionally high and likely a stray or a listing aimed at long-term rentals.

Insights:

Summary of Numbers_of_reviews

summary(data$number_of_reviews)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    1.00    5.00   23.27   24.00  629.00

Minimum (Min. ): The minimum number of reviews is 0, signifying that there are listings without any reviews at this time.

First Quartile (1st Qu. ): The 25th percentile value is 1, indicating that 25% of listings have 1 or fewer reviews.

Median: The median number of reviews is 5, indicating that half of the listings have 5 or fewer reviews. This implies that many listings receive only a small number of reviews.

Mean: The mean (average) number of reviews is 23. 27, which exceeds the median. This indicates the existence of a few listings with a significantly high number of reviews, raising the average.

Third Quartile (3rd Qu. ): The 75th percentile value is 24, meaning that 75% of listings have 24 or fewer reviews.

Maximum (Max. ): The maximum number of reviews is 629, representing an extreme figure compared to the other percentiles. This points to the existence of significantly outliers (listings with an exceptionally high number of reviews).

Insights:

Exploratory Data Analysis(EDA)

# Check for missing values
colSums(is.na(data))
##                             id                           name 
##                              0                             16 
##                        host_id                      host_name 
##                              0                             21 
##            neighbourhood_group                  neighbourhood 
##                              0                              0 
##                       latitude                      longitude 
##                              0                              0 
##                      room_type                          price 
##                              0                              0 
##                 minimum_nights              number_of_reviews 
##                              0                              0 
##                    last_review              reviews_per_month 
##                          10052                          10052 
## calculated_host_listings_count               availability_365 
##                              0                              0
missing_data <- data %>%
  summarise(across(everything(), ~sum(is.na(.)))) %>%
  pivot_longer(everything(), names_to = "Column", values_to = "Missing") %>%
  mutate(Percent = (Missing / nrow(data)) * 100)


ggplot(missing_data, aes(x = reorder(Column, -Percent), y = Percent)) +
  geom_bar(stat = "identity", fill = "coral") +
  coord_flip() +
  theme_minimal() +
  labs(
    title = "Proportion of Missing Data",
    x = "Columns",
    y = "Percentage of Missing Values"
  ) +
  scale_y_continuous(labels = scales::percent_format(scale = 1))

Key Observations:

Most columns, including room_type, price, number_of_reviews, and neighbourhood_group, show no missing data.

Columns such as last_review and reviews_per_month display around 20% of missing values, highlighting a signficant data deficiency for these attributes.

Columns like name and host_name have very few missing

Interpretation:

Fill missing values

data_unnull <- data %>%
  mutate(
    
    reviews_per_month = ifelse(is.na(reviews_per_month), median(reviews_per_month, na.rm = TRUE), reviews_per_month),
    
    
    host_name = ifelse(is.na(host_name), "Unknown", host_name),
    
    last_review = ifelse(is.na(last_review), as.Date("2000-01-01"), last_review)
  )

# Verify if missing values are handled
missing_data <- data_unnull %>%
  summarise(across(everything(), ~sum(is.na(.))))
missing_data
## # A tibble: 1 × 16
##      id  name host_id host_name neighbourhood_group neighbourhood latitude
##   <int> <int>   <int>     <int>               <int>         <int>    <int>
## 1     0    16       0         0                   0             0        0
## # ℹ 9 more variables: longitude <int>, room_type <int>, price <int>,
## #   minimum_nights <int>, number_of_reviews <int>, last_review <int>,
## #   reviews_per_month <int>, calculated_host_listings_count <int>,
## #   availability_365 <int>

Missing values in the dataset were addressed as follows:

Numeric (reviews_per_month): Missing values were replaced with the median value to preserve distribution consistency.

Categorical (host_name): Missing values were completed with “Unknown” to prevent row loss.

Date (last_review): Missing values were filled with a default date (January 1, 2000) to ensure completeness.

These approaches ensure the dataset is now free of missing values, preparing it for analysis and modeling.

Comparison of Missing Values Before and After Filling

# Before filling
missing_before <- data %>%
  summarise(across(everything(), ~sum(is.na(.)))) %>%
  pivot_longer(everything(), names_to = "Column", values_to = "Missing")

# After filling
missing_after <- data_unnull %>%
  summarise(across(everything(), ~sum(is.na(.)))) %>%
  pivot_longer(everything(), names_to = "Column", values_to = "Missing")

# Combine data
missing_combined <- rbind(
  missing_before %>% mutate(Stage = "Before Filling"),
  missing_after %>% mutate(Stage = "After Filling")
)

# Plot
ggplot(missing_combined, aes(x = reorder(Column, -Missing), y = Missing, fill = Stage)) +
  geom_bar(stat = "identity", position = "dodge") +
  coord_flip() +
  theme_minimal() +
  labs(
    title = "Comparison of Missing Values Before and After Filling",
    x = "Columns",
    y = "Count of Missing Values",
    fill = "Stage"
  )

Interpretation:

  1. Before filling (represented in blue), the columns reviews_per_month and last_review had a significant number of missing values.

  2. After filling (represented in red), these missing values have been resolved for all columns, reducing their counts to zero.

Price Distribution

ggplot(data_unnull, aes(x = price)) +
  geom_histogram(binwidth = 0.1, fill = "blue", color = "black", alpha = 0.7) +
  labs(title = "Price Distribution of Airbnb Listings in NYC",
       x = "Price", y = "Count")

This plot shows the price distribution of Airbnb listings in New York City. It appears to have a skewed distribution, with a long tail on the right side, indicating that there are a larger number of lower-priced listings compared to higher-priced ones.

To address the skewness and potentially achieve a more normal distribution, it would be appropriate to use a log transformation of the price data. This involves taking the natural logarithm of each price value, which can help to normalize the distribution and reduce the impact of the long tail.

log-transformed price

# Add log-transformed price to the dataset
data_unnull <- data_unnull %>%
  filter(!is.na(price) & price > 0) %>%
  mutate(log_price = log(price))
ggplot(data_unnull, aes(x = log_price)) +
  geom_histogram(binwidth = 0.1, fill = "blue", color = "black", alpha = 0.7) +
  labs(title = "Price Distribution of Airbnb Listings in NYC",
       x = "log_Price", y = "Count")

Distribution of Room Types in NYC

ggplot(data, aes(x = room_type)) + 
  geom_bar(fill = "skyblue") + 
  theme_minimal() + 
  labs(title = "Distribution of Room Types in NYC", x = "Room Type", y = "Count")

Interpretation:

The distribution indicates that the majority of Airbnb listings in NYC are for entire homes/apartments, which are probably more appealing for larger groups or travelers looking for privacy.

Private rooms seem to be the second most common choice, frequently selected for more budget-friendly stays where travelers share a space with the host.

Shared rooms are the least common, likely due to privacy issues or lower demand for shared accommodations.

Geographical Distribution of Listings

library(ggplot2)

ggplot(data_unnull, aes(x = longitude, y = latitude, color = price)) +
  geom_point(alpha = 0.5) +
  scale_color_gradient(low = "blue", high = "red") +
  labs(title = "Geographical Distribution of Listings", x = "Longitude", y = "Latitude")

This map visualizes the geographical distribution of Airbnb listings in NYC, with price represented by color intensity:

Insights for Stakeholders:

Correlation Heatmap of main Key Variables

library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
ggcorr(data_unnull %>% select(log_price, reviews_per_month, minimum_nights, number_of_reviews), 
       label = TRUE, label_round = 2, 
       palette = "RdYlBu", name = "Correlation") +
  labs(title = "Correlation Heatmap of Key Variables")

The rows represents the various variables: “number_of_review”, “minimum_nights”, “reviews_per_month”, and “price”.

The columns also display these same variables, with the intersections presenting the correlation coefficients between each pair of variables. The correlation coefficients vary from -1 to 1, where blue signifies a negative correlation, white shows no correlation, and red represents a positive correlation. The color’s intensity reflects the correlation’s strength.

Some key observations:

This heatmap offers a quick visual overview of how the various variables are interconnected.

Log(Price) vs Reviews per Month by Room Type

ggplot(data_unnull, aes(x = reviews_per_month, y = log_price, color = room_type)) +
  geom_point(alpha = 0.5) +
  #geom_smooth(method="lm", se= FALSE, aes(color=room_type))+
  labs(title = "Log(Price) vs Avg Reviews per Month by Room Type", 
       x = "Avg Reviews per Month", y = "Log(Price)", color = "Room Type") +
  theme_minimal()

This scatter plot illustrates the connection between the logarithm of price (Log(Price)) and the monthly Average number of reviews for various room types in NYC Airbnb listings.

This suggests that both room type and popularity (as indicated by reviews) affect the pricing.

Log(Price) vs Minimum nights

ggplot(data_unnull, aes(x = minimum_nights, y = log_price, color = room_type)) +
  geom_point(alpha = 0.5) +
  labs(title = "Log(Price) vs Minimum nights",
       x = "Minimum nights",
       y = "Log(Price)", color = "Room Type") +
  theme_minimal()

This plot illustrates the connection between the logarithm of the price and the minimum number of nights needed for Airbnb listings in New York City, categorized by room type.

The key observations are:

  1. There is generally a positive correlation between price and minimum nights, suggesting that higher-priced listings often require more minimum nights.

  2. The “Entire home/apt” listings display a broader range in both price and minimum nights when compared to the other room types.

  3. The “Private room” and “Shared room” listings tend to group more closely, with lower prices and fewer minimum nights on average than entire homes/apartments.

Overall, this plot offers insights into how the pricing and minimum night requirements vary among the different Airbnb room types in the NYC market.

Hypothesis 1 Price Distribution by Neighborhood Group

anova_price_neighborhood <- aov(log_price ~ neighbourhood_group, data = data_unnull)
anova_price_neighborhood_sum<-summary(anova_price_neighborhood)
print(anova_price_neighborhood_sum)
##                        Df Sum Sq Mean Sq F value Pr(>F)    
## neighbourhood_group     4   3144   786.0    1857 <2e-16 ***
## Residuals           48879  20687     0.4                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
ggplot(data_unnull, aes(x = neighbourhood_group, y = log_price)) +
  geom_boxplot() +
  theme_minimal() +
  labs(title = "Price Distribution by Neighborhood Group", x = "Neighborhood Group", y = "Price")

Insights:

Manhattan generally exhibits a higher price range compared to neighborhoods such as Brooklyn or Queens, as demonstrated by the elevated median and upper range of the boxplot.

Price variability is greater in specific neighborhoods, revealing that some areas feature more costly listings, while others offer more affordable options.

The box size illustrates that there is increased price variability in specific neighborhood group, suggesting a wider array of listings with varying price points.

f_value <- anova_price_neighborhood_sum[[1]]$`F value`[1]
p_value <- anova_price_neighborhood_sum[[1]]$`Pr(>F)`[1]

# Print results
cat("Hypothesis Test Results:\n")
## Hypothesis Test Results:
cat("F-value:", f_value, "\n")
## F-value: 1857.211
cat("p-value:", p_value, "\n")
## p-value: 0
if (p_value < 0.05) {
  cat("Conclusion: Reject the null hypothesis. Neighborhood group significantly affects price.\n")
} else {
  cat("Conclusion: Fail to reject the null hypothesis. Neighborhood group does not significantly affect price.\n")
}
## Conclusion: Reject the null hypothesis. Neighborhood group significantly affects price.

Hypothesis:

Test Statistics:

Given that the p-value is extremely small (less than 0.05), we reject the null hypothesis. This means that the neighborhood group significantly affects price.

The very low p-value indicates strong evidence against the null hypothesis, supporting the alternative hypothesis that the neighborhood group plays a significant role in determining the price of listings.

Hypothesis 2 Price Distribution by Room Type

anova_result <- aov(log_price ~ room_type, data = data_unnull)

anova_summary <- summary(anova_result)
print(anova_summary)
##                Df Sum Sq Mean Sq F value Pr(>F)    
## room_type       2   9192    4596   15347 <2e-16 ***
## Residuals   48881  14639       0                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
ggplot(data_unnull, aes(x = room_type, y = log_price)) +
  geom_violin(fill = "lightblue", alpha = 0.7) +
  labs(
    title = "Price Distribution by Room Type",
    x = "Room Type",
    y = "Price"
  ) +
  theme_minimal()

Key Insights:

f_value <- anova_summary[[1]]$`F value`[1]
p_value <- anova_summary[[1]]$`Pr(>F)`[1]

cat("Hypothesis Test Results:\n")
## Hypothesis Test Results:
cat("F-value:", f_value, "\n")
## F-value: 15347.09
cat("p-value:", p_value, "\n")
## p-value: 0
# Interpretation
if (p_value < 0.05) {
  cat("Conclusion: Reject the null hypothesis. Room type significantly affects price.\n")
} else {
  cat("Conclusion: Fail to reject the null hypothesis. Room type does not significantly affect price.\n")
}
## Conclusion: Reject the null hypothesis. Room type significantly affects price.

Hypothesis:

Test Statistics:

Given that the p-value is extremely small (less than 0.05), we reject the null hypothesis. This means that the suggests that room type significantly affects the pricing of listings on Airbnb.

Modelling

model<- lm(log_price ~ room_type * (reviews_per_month + minimum_nights), data = data_unnull)
summary(model)
## 
## Call:
## lm(formula = log_price ~ room_type * (reviews_per_month + minimum_nights), 
##     data = data_unnull)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.8552 -0.3689 -0.0565  0.2872  5.0285 
## 
## Coefficients:
##                                           Estimate Std. Error  t value Pr(>|t|)
## (Intercept)                              5.1660477  0.0047765 1081.551  < 2e-16
## room_typePrivate room                   -0.8543037  0.0068382 -124.931  < 2e-16
## room_typeShared room                    -1.1644404  0.0225627  -51.609  < 2e-16
## reviews_per_month                       -0.0196839  0.0024495   -8.036 9.51e-16
## minimum_nights                          -0.0002011  0.0001506   -1.335    0.182
## room_typePrivate room:reviews_per_month  0.0129351  0.0033203    3.896 9.80e-05
## room_typeShared room:reviews_per_month  -0.0149977  0.0116532   -1.287    0.198
## room_typePrivate room:minimum_nights    -0.0010949  0.0002713   -4.036 5.44e-05
## room_typeShared room:minimum_nights     -0.0005420  0.0005341   -1.015    0.310
##                                            
## (Intercept)                             ***
## room_typePrivate room                   ***
## room_typeShared room                    ***
## reviews_per_month                       ***
## minimum_nights                             
## room_typePrivate room:reviews_per_month ***
## room_typeShared room:reviews_per_month     
## room_typePrivate room:minimum_nights    ***
## room_typeShared room:minimum_nights        
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5466 on 48875 degrees of freedom
## Multiple R-squared:  0.3872, Adjusted R-squared:  0.3871 
## F-statistic:  3860 on 8 and 48875 DF,  p-value: < 2.2e-16

Based on the given linear regression model summary, here are the main insights:

  1. Room Type:
  1. Reviews per Month:
  1. Minimum Nights:
  1. Model Fit:

Overall, the model indicates that room type, reviews per month, and minimum nights all have meaningful connections with Airbnb listing prices in New York City, with varying degrees of these relationships among different room types.

Model Diagnostics

Residuals and fitted values

# Add residuals and fitted values to data
data_unnull$model_resid <- residuals(model)
data_unnull$model_fitted <- fitted(model)

# Residuals vs Fitted with ggplot
ggplot(data_unnull, aes(x = model_fitted, y = model_resid, color = room_type)) +
  geom_point(alpha = 0.6) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "black") +
  labs(title = "Residuals vs Fitted by Room Type", 
       x = "Fitted Values", 
       y = "Residuals", 
       color = "Room Type") +
  theme_minimal()

This plot shows the residuals compared to the fitted values for various room types (Entire home/apt, Private room, Shared room) in the regression analysis.

1.  Residuals for entire home/apt (red) display a greater spread at higher fitted values, indicating possible variability in price predictions.
2.  Residuals for shared room (blue) and private room (green) are more clustered around the fitted line, indicating less variance.

Histogram of Residuals

par(mfrow = c(1, 1))

hist_residuals <- resid(model)
hist(hist_residuals, breaks = 30, col = "lightblue", border = "black",
     main = "Histogram of Residuals", xlab = "Residuals")

This histogram shows the distribution of the residuals, which are the differences between the actual and predicted values in a statistical model.

Key Observations:

Q-Q plot for residuals

# Q-Q plot for residuals
qqnorm(residuals(model), main = "Q-Q Plot of Residuals")
qqline(residuals(model), col = "red", lwd = 2)

This Q-Q plot assesses the normality of residuals from the regression model used on the Airbnb NYC dataset:

Overall, the residuals are roughly normal, which support the assumption of normality for the regression model applied in examining Airbnb NYC pricing data.

Residuals vs Leverage Plot

library(ggplot2)

# Calculate diagnostics
leverage <- hatvalues(model)
std_residuals <- rstandard(model)
cooks_distance <- cooks.distance(model)

# Create a data frame for plotting
diag_data <- data.frame(
  Leverage = leverage,
  StandardizedResiduals = std_residuals,
  RoomType = data_unnull$room_type,
  CookDistance = cooks_distance
)

# Plot with interaction by room_type
ggplot(diag_data, aes(x = Leverage, y = StandardizedResiduals, color = RoomType)) +
  geom_point(aes(size = CookDistance), alpha = 0.6) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "black") +
  labs(
    title = "Residuals vs Leverage Plot by Room Type",
    x = "Leverage",
    y = "Standardized Residuals",
    size = "Cook's Distance",
    color = "Room Type"
  ) +
  theme_minimal() +
  scale_color_manual(values = c("red", "blue", "green"))

This Residuals vs. Leverage Plot by Room Type offers perspectives on possible influential points within the Airbnb NYC dataset:

  1. Clusters of Residuals:
  1. High-Leverage Outlier:
  1. Cook’s Distance:

Recommendation: This high-leverage outlier warrants further examination to assess its influence on the model and determine if it signifies valid data or should be excluded.

Conclusion:

The Airbnb NYC dataset highlights important trends and factors that affect listing prices and demand. Higher prices are linked to entire homes/apartments, certain neighborhoods such as Manhattan, and longer minimum stay requirements. Private and shared rooms present more budget-friendly options. Monthly reviews indicate demand, with sought-after listings typically priced competitively.

Residual analysis indicates the model performs well but identifies some outliers and significant data points that require further examination. Although the price distribution mostly conforms to expectations, some variations reveal chances for model enhancement. Recommendation for hosts include setting competitive prices, optimizing minimum stays, and ensuring steady reviews. For guests, focusing on well-reviewed and shared accommodations provides better value. Policymakers and analysts might delve deeper into pricing inconsistencies and less utilized areas to achieve a more balanced market.

Future Work: Incorporate additional predictors, analyze outliers, and examine spatial and temporal patterns to enhance predictions.