Project

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(lubridate)
library(tsibble)

## Registered S3 method overwritten by 'tsibble':
##   method               from 
##   as_tibble.grouped_df dplyr
## 
## Attaching package: 'tsibble'
## 
## The following object is masked from 'package:lubridate':
## 
##     interval
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, union

library(ggplot2)

# Load the dataset
data <- read_delim("./AB_NYC_2019.csv", delim = ",")

## Rows: 48895 Columns: 16
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr   (5): name, host_name, neighbourhood_group, neighbourhood, room_type
## dbl  (10): id, host_id, latitude, longitude, price, minimum_nights, number_o...
## date  (1): last_review
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# Display the first few rows
head(data)

## # A tibble: 6 × 16
##      id name        host_id host_name neighbourhood_group neighbourhood latitude
##   <dbl> <chr>         <dbl> <chr>     <chr>               <chr>            <dbl>
## 1  2539 Clean & qu…    2787 John      Brooklyn            Kensington        40.6
## 2  2595 Skylit Mid…    2845 Jennifer  Manhattan           Midtown           40.8
## 3  3647 THE VILLAG…    4632 Elisabeth Manhattan           Harlem            40.8
## 4  3831 Cozy Entir…    4869 LisaRoxa… Brooklyn            Clinton Hill      40.7
## 5  5022 Entire Apt…    7192 Laura     Manhattan           East Harlem       40.8
## 6  5099 Large Cozy…    7322 Chris     Manhattan           Murray Hill       40.7
## # ℹ 9 more variables: longitude <dbl>, room_type <chr>, price <dbl>,
## #   minimum_nights <dbl>, number_of_reviews <dbl>, last_review <date>,
## #   reviews_per_month <dbl>, calculated_host_listings_count <dbl>,
## #   availability_365 <dbl>

str(data)

## spc_tbl_ [48,895 × 16] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ id                            : num [1:48895] 2539 2595 3647 3831 5022 ...
##  $ name                          : chr [1:48895] "Clean & quiet apt home by the park" "Skylit Midtown Castle" "THE VILLAGE OF HARLEM....NEW YORK !" "Cozy Entire Floor of Brownstone" ...
##  $ host_id                       : num [1:48895] 2787 2845 4632 4869 7192 ...
##  $ host_name                     : chr [1:48895] "John" "Jennifer" "Elisabeth" "LisaRoxanne" ...
##  $ neighbourhood_group           : chr [1:48895] "Brooklyn" "Manhattan" "Manhattan" "Brooklyn" ...
##  $ neighbourhood                 : chr [1:48895] "Kensington" "Midtown" "Harlem" "Clinton Hill" ...
##  $ latitude                      : num [1:48895] 40.6 40.8 40.8 40.7 40.8 ...
##  $ longitude                     : num [1:48895] -74 -74 -73.9 -74 -73.9 ...
##  $ room_type                     : chr [1:48895] "Private room" "Entire home/apt" "Private room" "Entire home/apt" ...
##  $ price                         : num [1:48895] 149 225 150 89 80 200 60 79 79 150 ...
##  $ minimum_nights                : num [1:48895] 1 1 3 1 10 3 45 2 2 1 ...
##  $ number_of_reviews             : num [1:48895] 9 45 0 270 9 74 49 430 118 160 ...
##  $ last_review                   : Date[1:48895], format: "2018-10-19" "2019-05-21" ...
##  $ reviews_per_month             : num [1:48895] 0.21 0.38 NA 4.64 0.1 0.59 0.4 3.47 0.99 1.33 ...
##  $ calculated_host_listings_count: num [1:48895] 6 2 1 1 1 1 1 1 1 4 ...
##  $ availability_365              : num [1:48895] 365 355 365 194 0 129 0 220 0 188 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   id = col_double(),
##   ..   name = col_character(),
##   ..   host_id = col_double(),
##   ..   host_name = col_character(),
##   ..   neighbourhood_group = col_character(),
##   ..   neighbourhood = col_character(),
##   ..   latitude = col_double(),
##   ..   longitude = col_double(),
##   ..   room_type = col_character(),
##   ..   price = col_double(),
##   ..   minimum_nights = col_double(),
##   ..   number_of_reviews = col_double(),
##   ..   last_review = col_date(format = ""),
##   ..   reviews_per_month = col_double(),
##   ..   calculated_host_listings_count = col_double(),
##   ..   availability_365 = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

Summary of Price

summary(data$price)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0    69.0   106.0   152.7   175.0 10000.0

Minimum Price (Min): Some listings have a price of $0, likely indicating mistakes, promotional offers, or placeholder values.

First Quartile (1st Qu. ): 25% of listings have a price of $69 or less. This indicates the lower range of typical Airbnb pricing in the dataset.

Median: The median price is $106, meaning half the listings are priced below this amount, and half are priced above. This is a reliable measure of central tendency, less influenced by outliers.

Mean: The average price is $152. 7. The mean is higher than the median, suggesting the existence of high-priced listings, which skew the average upward.

Third Quartile (3rd Qu. ): 175. 0 75% of listings are priced below $175. This represents the upper range of more typically priced listings.

Maximum Price (Max): 10000. 0 The maximum price is $10,000, which is exceptionally high. These are likely luxury properties or potentially erroneous entries.

Insights:

The dataset presents a variety of prices, ranging from no cost or very low-cost options to very expensive ones.
The skewness (mean > median) suggests a right-skewed distribution with outliers in the higher range.

Boxplot of price to check the outliers

boxplot(data$price, main = "Price Distribution", ylab = "Price", col = "lightblue")

Q1_price <- quantile(data$price, 0.25, na.rm = TRUE)
Q3_price <- quantile(data$price, 0.75, na.rm = TRUE)
IQR_price <- Q3_price - Q1_price


lower_bound_price <- Q1_price - 1.5 * IQR_price
upper_bound_price <- Q3_price + 1.5 * IQR_price


outliers_price <- data %>%
  filter(price < lower_bound_price | price > upper_bound_price)

# Percentage of outliers
percentage_outliers <- (nrow(outliers_price) / nrow(data)) * 100
cat("Percentage of price outliers:", round(percentage_outliers, 2), "%\n")

## Percentage of price outliers: 6.08 %

Summary of Minimum Nights

summary(data$minimum_nights)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    1.00    3.00    7.03    5.00 1250.00

Minimum (Min): 1. 00 The least required stay is 1 night, which is typical for flexible reservations.

First Quartile (1st Qu. ): 1. 00 25% of listings demand just 1 night as a minimum stay, showing that short-term bookings are common.

Median: 3. 00 The median minimum stay is 3 nights, indicating that half of the listings require 3 nights or less.

Mean: 7. 03 The average minimum stay is approximately 7 nights. The mean exceeds the median, suggesting that some listings with very high minimum night requirements are affecting the average.

Third Quartile (3rd Qu. ): 5. 00 75% of listings demand 5 nights or fewer as the minimum stay.

Maximum (Max): 1250. 00 The highest required stay is 1250 nights, which is exceptionally high and likely a stray or a listing aimed at long-term rentals.

Insights:

Short-term rentals dominate in the dataset, with the majority of listings needing fewer than 5 nights.
The elevated mean in relation to the median indicates the existence of outliers with significantly longer minimum stay requirements.
The maximum figure of 1250 nights is probably unrealistic or designed for specific use cases.

Summary of Numbers_of_reviews

summary(data$number_of_reviews)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    1.00    5.00   23.27   24.00  629.00

Minimum (Min. ): The minimum number of reviews is 0, signifying that there are listings without any reviews at this time.

First Quartile (1st Qu. ): The 25th percentile value is 1, indicating that 25% of listings have 1 or fewer reviews.

Median: The median number of reviews is 5, indicating that half of the listings have 5 or fewer reviews. This implies that many listings receive only a small number of reviews.

Mean: The mean (average) number of reviews is 23. 27, which exceeds the median. This indicates the existence of a few listings with a significantly high number of reviews, raising the average.

Third Quartile (3rd Qu. ): The 75th percentile value is 24, meaning that 75% of listings have 24 or fewer reviews.

Maximum (Max. ): The maximum number of reviews is 629, representing an extreme figure compared to the other percentiles. This points to the existence of significantly outliers (listings with an exceptionally high number of reviews).

Insights:

Skewed Distribution: The mean is considerably higher than the median, suggesting a right-skewed distribution.
Most listings have relatively few reviews, but a small number of listings have exceptionally high review counts.

Exploratory Data Analysis(EDA)

# Check for missing values
colSums(is.na(data))

##                             id                           name 
##                              0                             16 
##                        host_id                      host_name 
##                              0                             21 
##            neighbourhood_group                  neighbourhood 
##                              0                              0 
##                       latitude                      longitude 
##                              0                              0 
##                      room_type                          price 
##                              0                              0 
##                 minimum_nights              number_of_reviews 
##                              0                              0 
##                    last_review              reviews_per_month 
##                          10052                          10052 
## calculated_host_listings_count               availability_365 
##                              0                              0

missing_data <- data %>%
  summarise(across(everything(), ~sum(is.na(.)))) %>%
  pivot_longer(everything(), names_to = "Column", values_to = "Missing") %>%
  mutate(Percent = (Missing / nrow(data)) * 100)


ggplot(missing_data, aes(x = reorder(Column, -Percent), y = Percent)) +
  geom_bar(stat = "identity", fill = "coral") +
  coord_flip() +
  theme_minimal() +
  labs(
    title = "Proportion of Missing Data",
    x = "Columns",
    y = "Percentage of Missing Values"
  ) +
  scale_y_continuous(labels = scales::percent_format(scale = 1))

Key Observations:

Most columns, including room_type, price, number_of_reviews, and neighbourhood_group, show no missing data.

Columns such as last_review and reviews_per_month display around 20% of missing values, highlighting a signficant data deficiency for these attributes.

Columns like name and host_name have very few missing

Interpretation:

The majority of the dataset seems complete, but the attributes last_review and reviews_per_month require focus.
These missing values could stem from properties that have not been reviewed yet, as these fields are typically filled based on user activity.
When missing values in a dataset exceed a certain threshold, it can lead to significant problems in data analysis, modeling, and decision-making.

Fill missing values

data_unnull <- data %>%
  mutate(
    
    reviews_per_month = ifelse(is.na(reviews_per_month), median(reviews_per_month, na.rm = TRUE), reviews_per_month),
    
    
    host_name = ifelse(is.na(host_name), "Unknown", host_name),
    
    last_review = ifelse(is.na(last_review), as.Date("2000-01-01"), last_review)
  )

# Verify if missing values are handled
missing_data <- data_unnull %>%
  summarise(across(everything(), ~sum(is.na(.))))
missing_data

## # A tibble: 1 × 16
##      id  name host_id host_name neighbourhood_group neighbourhood latitude
##   <int> <int>   <int>     <int>               <int>         <int>    <int>
## 1     0    16       0         0                   0             0        0
## # ℹ 9 more variables: longitude <int>, room_type <int>, price <int>,
## #   minimum_nights <int>, number_of_reviews <int>, last_review <int>,
## #   reviews_per_month <int>, calculated_host_listings_count <int>,
## #   availability_365 <int>

Missing values in the dataset were addressed as follows:

Numeric (reviews_per_month): Missing values were replaced with the median value to preserve distribution consistency.

Categorical (host_name): Missing values were completed with “Unknown” to prevent row loss.

Date (last_review): Missing values were filled with a default date (January 1, 2000) to ensure completeness.

These approaches ensure the dataset is now free of missing values, preparing it for analysis and modeling.

Comparison of Missing Values Before and After Filling

# Before filling
missing_before <- data %>%
  summarise(across(everything(), ~sum(is.na(.)))) %>%
  pivot_longer(everything(), names_to = "Column", values_to = "Missing")

# After filling
missing_after <- data_unnull %>%
  summarise(across(everything(), ~sum(is.na(.)))) %>%
  pivot_longer(everything(), names_to = "Column", values_to = "Missing")

# Combine data
missing_combined <- rbind(
  missing_before %>% mutate(Stage = "Before Filling"),
  missing_after %>% mutate(Stage = "After Filling")
)

# Plot
ggplot(missing_combined, aes(x = reorder(Column, -Missing), y = Missing, fill = Stage)) +
  geom_bar(stat = "identity", position = "dodge") +
  coord_flip() +
  theme_minimal() +
  labs(
    title = "Comparison of Missing Values Before and After Filling",
    x = "Columns",
    y = "Count of Missing Values",
    fill = "Stage"
  )

Interpretation:

Before filling (represented in blue), the columns reviews_per_month and last_review had a significant number of missing values.
After filling (represented in red), these missing values have been resolved for all columns, reducing their counts to zero.

Price Distribution

ggplot(data_unnull, aes(x = price)) +
  geom_histogram(binwidth = 0.1, fill = "blue", color = "black", alpha = 0.7) +
  labs(title = "Price Distribution of Airbnb Listings in NYC",
       x = "Price", y = "Count")

This plot shows the price distribution of Airbnb listings in New York City. It appears to have a skewed distribution, with a long tail on the right side, indicating that there are a larger number of lower-priced listings compared to higher-priced ones.

To address the skewness and potentially achieve a more normal distribution, it would be appropriate to use a log transformation of the price data. This involves taking the natural logarithm of each price value, which can help to normalize the distribution and reduce the impact of the long tail.

log-transformed price

# Add log-transformed price to the dataset
data_unnull <- data_unnull %>%
  filter(!is.na(price) & price > 0) %>%
  mutate(log_price = log(price))

ggplot(data_unnull, aes(x = log_price)) +
  geom_histogram(binwidth = 0.1, fill = "blue", color = "black", alpha = 0.7) +
  labs(title = "Price Distribution of Airbnb Listings in NYC",
       x = "log_Price", y = "Count")

Original price distribution is highly skewed with extreme outliers. Log transformation normalizes the data for better analysis and visualization.
Log price transformation helped identify significant predictors (e.g., room type, minimum nights) while improving the linear regression model’s performance.

Distribution of Room Types in NYC

ggplot(data, aes(x = room_type)) + 
  geom_bar(fill = "skyblue") + 
  theme_minimal() + 
  labs(title = "Distribution of Room Types in NYC", x = "Room Type", y = "Count")

Interpretation:

The distribution indicates that the majority of Airbnb listings in NYC are for entire homes/apartments, which are probably more appealing for larger groups or travelers looking for privacy.

Private rooms seem to be the second most common choice, frequently selected for more budget-friendly stays where travelers share a space with the host.

Shared rooms are the least common, likely due to privacy issues or lower demand for shared accommodations.

Geographical Distribution of Listings

library(ggplot2)

ggplot(data_unnull, aes(x = longitude, y = latitude, color = price)) +
  geom_point(alpha = 0.5) +
  scale_color_gradient(low = "blue", high = "red") +
  labs(title = "Geographical Distribution of Listings", x = "Longitude", y = "Latitude")

This map visualizes the geographical distribution of Airbnb listings in NYC, with price represented by color intensity:

Spatial Distribution: The listings are concentrated in urban areas, aligning with boroughs such as Manhattan, Brooklyn, and parts of Queens. Sparse points represent listings in less urban or peripheral areas.
Price Gradient: Higher-priced listings (in red) are clustered in central and desirable areas, such as Manhattan. These areas are likely hotspots for tourism and business, explaining the elevated prices. Conversely, lower-priced listings (in blue) are more dispersed and found in outer boroughs.

Insights for Stakeholders:

For hosts: Listings in high-demand areas might command premium prices. Hosts in less-central areas could focus on competitive pricing or emphasize unique features to attract guests.
For guests: This map helps identify budget-friendly areas or luxurious options depending on proximity and price preferences.

Correlation Heatmap of main Key Variables

library(GGally)

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

ggcorr(data_unnull %>% select(log_price, reviews_per_month, minimum_nights, number_of_reviews), 
       label = TRUE, label_round = 2, 
       palette = "RdYlBu", name = "Correlation") +
  labs(title = "Correlation Heatmap of Key Variables")

The rows represents the various variables: “number_of_review”, “minimum_nights”, “reviews_per_month”, and “price”.

The columns also display these same variables, with the intersections presenting the correlation coefficients between each pair of variables. The correlation coefficients vary from -1 to 1, where blue signifies a negative correlation, white shows no correlation, and red represents a positive correlation. The color’s intensity reflects the correlation’s strength.

Some key observations:

“number_of_review” exhibits a strong positive correlation (0. 57) with “reviews_per_month”.
“minimum_nights” displays a weak negative correlation (-0. 08) with “number_of_review”.
“price” shows very weak correlations with the other variables, with coefficients nearing 0.

This heatmap offers a quick visual overview of how the various variables are interconnected.

Log(Price) vs Reviews per Month by Room Type

ggplot(data_unnull, aes(x = reviews_per_month, y = log_price, color = room_type)) +
  geom_point(alpha = 0.5) +
  #geom_smooth(method="lm", se= FALSE, aes(color=room_type))+
  labs(title = "Log(Price) vs Avg Reviews per Month by Room Type", 
       x = "Avg Reviews per Month", y = "Log(Price)", color = "Room Type") +
  theme_minimal()

This scatter plot illustrates the connection between the logarithm of price (Log(Price)) and the monthly Average number of reviews for various room types in NYC Airbnb listings.

Entire home/apartment listings (red points) typically have higher prices than private rooms (green) and shared rooms (blue).
Listings that receive a greater average number of reviews each month often feature lower prices, irrespective of room type.
Shared rooms (blue) group at lower price points, while entire homes/apartments lead in the higher price range.

This suggests that both room type and popularity (as indicated by reviews) affect the pricing.

Log(Price) vs Minimum nights

ggplot(data_unnull, aes(x = minimum_nights, y = log_price, color = room_type)) +
  geom_point(alpha = 0.5) +
  labs(title = "Log(Price) vs Minimum nights",
       x = "Minimum nights",
       y = "Log(Price)", color = "Room Type") +
  theme_minimal()

This plot illustrates the connection between the logarithm of the price and the minimum number of nights needed for Airbnb listings in New York City, categorized by room type.

The key observations are:

There is generally a positive correlation between price and minimum nights, suggesting that higher-priced listings often require more minimum nights.
The “Entire home/apt” listings display a broader range in both price and minimum nights when compared to the other room types.
The “Private room” and “Shared room” listings tend to group more closely, with lower prices and fewer minimum nights on average than entire homes/apartments.

Overall, this plot offers insights into how the pricing and minimum night requirements vary among the different Airbnb room types in the NYC market.

Hypothesis 1 Price Distribution by Neighborhood Group

anova_price_neighborhood <- aov(log_price ~ neighbourhood_group, data = data_unnull)
anova_price_neighborhood_sum<-summary(anova_price_neighborhood)
print(anova_price_neighborhood_sum)

##                        Df Sum Sq Mean Sq F value Pr(>F)    
## neighbourhood_group     4   3144   786.0    1857 <2e-16 ***
## Residuals           48879  20687     0.4                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

ggplot(data_unnull, aes(x = neighbourhood_group, y = log_price)) +
  geom_boxplot() +
  theme_minimal() +
  labs(title = "Price Distribution by Neighborhood Group", x = "Neighborhood Group", y = "Price")

Insights:

Manhattan generally exhibits a higher price range compared to neighborhoods such as Brooklyn or Queens, as demonstrated by the elevated median and upper range of the boxplot.

Price variability is greater in specific neighborhoods, revealing that some areas feature more costly listings, while others offer more affordable options.

The box size illustrates that there is increased price variability in specific neighborhood group, suggesting a wider array of listings with varying price points.

f_value <- anova_price_neighborhood_sum[[1]]$`F value`[1]
p_value <- anova_price_neighborhood_sum[[1]]$`Pr(>F)`[1]

# Print results
cat("Hypothesis Test Results:\n")

## Hypothesis Test Results:

cat("F-value:", f_value, "\n")

## F-value: 1857.211

cat("p-value:", p_value, "\n")

## p-value: 0

if (p_value < 0.05) {
  cat("Conclusion: Reject the null hypothesis. Neighborhood group significantly affects price.\n")
} else {
  cat("Conclusion: Fail to reject the null hypothesis. Neighborhood group does not significantly affect price.\n")
}

## Conclusion: Reject the null hypothesis. Neighborhood group significantly affects price.

Hypothesis:

Null Hypothesis (H0): Neighborhood group does not affect price.
Alternative Hypothesis (H1): Neighborhood group affects price.

Test Statistics:

p-value: 0

Given that the p-value is extremely small (less than 0.05), we reject the null hypothesis. This means that the neighborhood group significantly affects price.

The very low p-value indicates strong evidence against the null hypothesis, supporting the alternative hypothesis that the neighborhood group plays a significant role in determining the price of listings.

Hypothesis 2 Price Distribution by Room Type

anova_result <- aov(log_price ~ room_type, data = data_unnull)

anova_summary <- summary(anova_result)
print(anova_summary)

##                Df Sum Sq Mean Sq F value Pr(>F)    
## room_type       2   9192    4596   15347 <2e-16 ***
## Residuals   48881  14639       0                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

ggplot(data_unnull, aes(x = room_type, y = log_price)) +
  geom_violin(fill = "lightblue", alpha = 0.7) +
  labs(
    title = "Price Distribution by Room Type",
    x = "Room Type",
    y = "Price"
  ) +
  theme_minimal()

Key Insights:

Entire homes/apartments lead higher price ranges because of their spaciousness and privacy benefits.
Private rooms provide a moderate pricing option, combining cost-effectiveness with privacy.
Shared rooms are economical but have less variation, targeting travelers mindful of expenses.

f_value <- anova_summary[[1]]$`F value`[1]
p_value <- anova_summary[[1]]$`Pr(>F)`[1]

cat("Hypothesis Test Results:\n")

## Hypothesis Test Results:

cat("F-value:", f_value, "\n")

## F-value: 15347.09

cat("p-value:", p_value, "\n")

## p-value: 0

# Interpretation
if (p_value < 0.05) {
  cat("Conclusion: Reject the null hypothesis. Room type significantly affects price.\n")
} else {
  cat("Conclusion: Fail to reject the null hypothesis. Room type does not significantly affect price.\n")
}

## Conclusion: Reject the null hypothesis. Room type significantly affects price.

Hypothesis:

Null Hypothesis (H0): Room type does not affect price.
Alternative Hypothesis (H1): Room type affects price.

Test Statistics:

p-value: 0

Given that the p-value is extremely small (less than 0.05), we reject the null hypothesis. This means that the suggests that room type significantly affects the pricing of listings on Airbnb.

Modelling

model<- lm(log_price ~ room_type * (reviews_per_month + minimum_nights), data = data_unnull)
summary(model)

## 
## Call:
## lm(formula = log_price ~ room_type * (reviews_per_month + minimum_nights), 
##     data = data_unnull)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.8552 -0.3689 -0.0565  0.2872  5.0285 
## 
## Coefficients:
##                                           Estimate Std. Error  t value Pr(>|t|)
## (Intercept)                              5.1660477  0.0047765 1081.551  < 2e-16
## room_typePrivate room                   -0.8543037  0.0068382 -124.931  < 2e-16
## room_typeShared room                    -1.1644404  0.0225627  -51.609  < 2e-16
## reviews_per_month                       -0.0196839  0.0024495   -8.036 9.51e-16
## minimum_nights                          -0.0002011  0.0001506   -1.335    0.182
## room_typePrivate room:reviews_per_month  0.0129351  0.0033203    3.896 9.80e-05
## room_typeShared room:reviews_per_month  -0.0149977  0.0116532   -1.287    0.198
## room_typePrivate room:minimum_nights    -0.0010949  0.0002713   -4.036 5.44e-05
## room_typeShared room:minimum_nights     -0.0005420  0.0005341   -1.015    0.310
##                                            
## (Intercept)                             ***
## room_typePrivate room                   ***
## room_typeShared room                    ***
## reviews_per_month                       ***
## minimum_nights                             
## room_typePrivate room:reviews_per_month ***
## room_typeShared room:reviews_per_month     
## room_typePrivate room:minimum_nights    ***
## room_typeShared room:minimum_nights        
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5466 on 48875 degrees of freedom
## Multiple R-squared:  0.3872, Adjusted R-squared:  0.3871 
## F-statistic:  3860 on 8 and 48875 DF,  p-value: < 2.2e-16

Based on the given linear regression model summary, here are the main insights:

Room Type:

In comparison to “Entire home/apt” listings, “Private room” listings exhibit a lower log price by 0. 854, while “Shared room” listings display a lower log price by 1. 164.

Reviews per Month:

For “Entire home/apt” listings, an increase of one review per month is linked to a decrease in log price of 0. 0197.
For “Private room” listings, an increase of one review per month corresponds to a smaller decrease in log price of 0. 0067 (0. 0197 - 0. 0129).
For “Shared room” listings, an increase of one review per month is related to a decrease in log price of 0. 0347 (0. 0197 + 0. 0150).

Minimum Nights:

For “Entire home/apt” listings, an increase of one minimum night is related to a decrease in log price of 0. 0002.
For “Private room” listings, an increase of one minimum night corresponds to a smaller decrease in log price of 0. 0013 (0. 0002 + 0. 0011).
For “Shared room” listings, an increase of one minimum night leads to a smaller decrease in log price of 0. 0003 (0. 0002 + 0. 0005).

Model Fit:

The adjusted R-squared value of 0. 3871 shows that the model accounts for approximately 38. 71% of the variation in log price.

Overall, the model indicates that room type, reviews per month, and minimum nights all have meaningful connections with Airbnb listing prices in New York City, with varying degrees of these relationships among different room types.

Model Diagnostics

Residuals and fitted values

# Add residuals and fitted values to data
data_unnull$model_resid <- residuals(model)
data_unnull$model_fitted <- fitted(model)

# Residuals vs Fitted with ggplot
ggplot(data_unnull, aes(x = model_fitted, y = model_resid, color = room_type)) +
  geom_point(alpha = 0.6) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "black") +
  labs(title = "Residuals vs Fitted by Room Type", 
       x = "Fitted Values", 
       y = "Residuals", 
       color = "Room Type") +
  theme_minimal()

This plot shows the residuals compared to the fitted values for various room types (Entire home/apt, Private room, Shared room) in the regression analysis.

Residuals represent the difference between actual values and predicted values, showing how well the model fits the data.
Each room type is represented by a distinct color: red for entire home/apt, green for private room, and blue for shared room.
The residuals are distributed around the horizontal line at 0, indicating the model generally fits the data. However:

1.  Residuals for entire home/apt (red) display a greater spread at higher fitted values, indicating possible variability in price predictions.
2.  Residuals for shared room (blue) and private room (green) are more clustered around the fitted line, indicating less variance.

No obvious pattern emerges in the residuals, suggesting that the assumptions of homoscedasticity (constant variance) and linearity are reasonably satisfied for the model.

Histogram of Residuals

par(mfrow = c(1, 1))

hist_residuals <- resid(model)
hist(hist_residuals, breaks = 30, col = "lightblue", border = "black",
     main = "Histogram of Residuals", xlab = "Residuals")

This histogram shows the distribution of the residuals, which are the differences between the actual and predicted values in a statistical model.

Key Observations:

Centering: The residuals are centered around 0, signifying that the model’s predictions are, on average, unbiased.
Symmetry: The distribution seems roughly symmetric, which is consistent with the normality assumption for residuals in a regression model.
Spread: Most residuals are focused within a narrow range (approximately between -2 and 2), indicating that the bulk of the predictions are near the observed values.
Normality: Although the shape resembles a normal distribution, additional tests (e. g. , Q-Q plot) may be necessary to determine if the residuals strictly conform to a normal distribution.

Q-Q plot for residuals

# Q-Q plot for residuals
qqnorm(residuals(model), main = "Q-Q Plot of Residuals")
qqline(residuals(model), col = "red", lwd = 2)

This Q-Q plot assesses the normality of residuals from the regression model used on the Airbnb NYC dataset:

Alignment with the Line: The majority of residuals align follow the red diagonal line, indicating that the residuals generally conform to a normal distribution.
Deviations at Tails: Minor deviations at the ends (both lower and upper tails) indicate possible outliers or slight deviations from normality, likely due to extreme pricing in certain Airbnb listings.

Overall, the residuals are roughly normal, which support the assumption of normality for the regression model applied in examining Airbnb NYC pricing data.

Residuals vs Leverage Plot

library(ggplot2)

# Calculate diagnostics
leverage <- hatvalues(model)
std_residuals <- rstandard(model)
cooks_distance <- cooks.distance(model)

# Create a data frame for plotting
diag_data <- data.frame(
  Leverage = leverage,
  StandardizedResiduals = std_residuals,
  RoomType = data_unnull$room_type,
  CookDistance = cooks_distance
)

# Plot with interaction by room_type
ggplot(diag_data, aes(x = Leverage, y = StandardizedResiduals, color = RoomType)) +
  geom_point(aes(size = CookDistance), alpha = 0.6) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "black") +
  labs(
    title = "Residuals vs Leverage Plot by Room Type",
    x = "Leverage",
    y = "Standardized Residuals",
    size = "Cook's Distance",
    color = "Room Type"
  ) +
  theme_minimal() +
  scale_color_manual(values = c("red", "blue", "green"))

This Residuals vs. Leverage Plot by Room Type offers perspectives on possible influential points within the Airbnb NYC dataset:

Clusters of Residuals:

The majority of data points exhibit low leverage (close to 0) and minor residuals, signifying that they do not significantly impact the regression model.
Points are categorized by room type: red for “Entire home/apt,” blue for “Private room,” and green for “Shared room.”

High-Leverage Outlier:

One green point (shared room) shows exceptionally high leverage and a substantial Cook’s distance, suggesting it is a significant observation. This point may relate to an Airbnb listing with unusual pricing and reviews.

Cook’s Distance:

The dimensions of the points represent their Cook’s distance, which assesses influence. The larger green point indicates that this observation could heavily sway the regression outcomes.

Recommendation: This high-leverage outlier warrants further examination to assess its influence on the model and determine if it signifies valid data or should be excluded.

Conclusion:

The Airbnb NYC dataset highlights important trends and factors that affect listing prices and demand. Higher prices are linked to entire homes/apartments, certain neighborhoods such as Manhattan, and longer minimum stay requirements. Private and shared rooms present more budget-friendly options. Monthly reviews indicate demand, with sought-after listings typically priced competitively.

Residual analysis indicates the model performs well but identifies some outliers and significant data points that require further examination. Although the price distribution mostly conforms to expectations, some variations reveal chances for model enhancement. Recommendation for hosts include setting competitive prices, optimizing minimum stays, and ensuring steady reviews. For guests, focusing on well-reviewed and shared accommodations provides better value. Policymakers and analysts might delve deeper into pricing inconsistencies and less utilized areas to achieve a more balanced market.

Future Work: Incorporate additional predictors, analyze outliers, and examine spatial and temporal patterns to enhance predictions.

Project

Mounya

2024-12-04

Boxplot of price to check the outliers

Exploratory Data Analysis(EDA)

Fill missing values

Comparison of Missing Values Before and After Filling

Price Distribution

log-transformed price

Distribution of Room Types in NYC

Geographical Distribution of Listings

Correlation Heatmap of main Key Variables

Log(Price) vs Reviews per Month by Room Type

Log(Price) vs Minimum nights

Hypothesis 1 Price Distribution by Neighborhood Group

Hypothesis 2 Price Distribution by Room Type

Modelling

Model Diagnostics

Conclusion: