Loading Data

airbnb <- read_delim("./airbnb_austin.csv", delim = ",")
## Rows: 15244 Columns: 18
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr   (3): name, host_name, room_type
## dbl  (12): id, host_id, neighbourhood, latitude, longitude, price, minimum_n...
## lgl   (2): neighbourhood_group, license
## date  (1): last_review
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.

Expanded Linear Regression Model for Airbnb Pricing

I’ll enhance the previous model (price ~ stay_length) by adding two new predictors:

  1. room_type (categorical) – Expected to significantly impact price.

  2. number_of_reviews (continuous) – Could signal demand/quality, potentially affecting price.

I’ll also test an interaction term (stay_length * room_type) to see if the effect of stay length differs by room type.

Model V2: Using 3 Predictors

Why Include These Variables?

  1. room_type:

    • ANOVA showed strong evidence that room type affects price.

    • Expect entire homes to cost more than private rooms.

    • There is no multicollinearity with stay_length (they measure different things).

  2. available_365:

    • Listings available more days per year may have lower prices (hosts lowering prices to attract bookings).

    • High-availability listings could signal lower demand or less desirable properties.

airbnb_group <- airbnb |>
  mutate(
    stay_length = case_when(
      minimum_nights >= 1 & minimum_nights <= 3 ~ "Short Stay",
      minimum_nights >= 4 & minimum_nights <= 7 ~ "Medium Stay",
      minimum_nights >= 8 & minimum_nights <= 30 ~ "Long Stay",
      minimum_nights >= 31 ~ "Extended Stay",
      TRUE ~ NA_character_
    )
  )
airbnb_data <- airbnb_group |>
  mutate(stay_length = factor(stay_length, levels = c("Short Stay", "Medium Stay", "Long Stay", "Extended Stay")))

model_v2 <- lm(price ~ stay_length + room_type + availability_365, 
                     data = airbnb_data)
summary(model_v2)
## 
## Call:
## lm(formula = price ~ stay_length + room_type + availability_365, 
##     data = airbnb_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##   -401   -192   -108      3  37754 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               186.0677    16.7215  11.127  < 2e-16 ***
## stay_lengthMedium Stay     24.4672    40.4668   0.605 0.545442    
## stay_lengthLong Stay     -172.4597    23.5977  -7.308 2.89e-13 ***
## stay_lengthExtended Stay -169.6141    48.6802  -3.484 0.000495 ***
## room_typeHotel room        80.7334    85.8393   0.941 0.346972    
## room_typePrivate room     -99.1842    23.5330  -4.215 2.52e-05 ***
## room_typeShared room     -206.0941   109.0684  -1.890 0.058839 .  
## availability_365            0.6549     0.0673   9.731  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 834.5 on 11175 degrees of freedom
##   (4061 observations deleted due to missingness)
## Multiple R-squared:  0.01624,    Adjusted R-squared:  0.01563 
## F-statistic: 26.36 on 7 and 11175 DF,  p-value: < 2.2e-16

Key Findings

  • The model explains only 1.6% of price variation (R² = 0.016), meaning it lacks important price drivers.

  • Despite being statistically significant (F-statistic, p < 2e-16), its practical usefulness is very limited due to the low R².

  • Each additional day of availability increases price by $0.65 (p < 0.001), which contradicts typical demand-based pricing logic.

    • Possible explanations:

      • Premium listings may remain available longer due to higher pricing.

      • Popular hosts might keep calendars open longer to attract bookings.

      • Potential data issues, as new listings may default to 365-day availability.

  • Long & Extended Stays are significantly cheaper than Short Stays ($172–$170 less, p < 0.001).

  • Medium Stays show no significant price difference from Short Stays (p = 0.55).

  • Private Rooms: $99 cheaper than Entire Homes (p < 0.001).

  • Shared Rooms: $206 cheaper, but only marginally significant (p = 0.059).

  • Hotel Rooms: No significant price difference from Entire Homes (p = 0.35).

Model 2: Interaction Model

Why Test an Interaction?

  • Hypothesis: The price discount for long stays might differ by room type (e.g., entire homes might drop more sharply than private rooms).
model_interaction <- lm(price ~ stay_length * room_type, data = airbnb_data)
summary(model_interaction)
## 
## Call:
## lm(formula = price ~ stay_length * room_type, data = airbnb_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##   -310   -196   -123    -18  37822 
## 
## Coefficients: (5 not defined because of singularities)
##                                                Estimate Std. Error t value
## (Intercept)                                     321.163      9.499  33.809
## stay_lengthMedium Stay                           13.381     45.065   0.297
## stay_lengthLong Stay                           -161.674     26.539  -6.092
## stay_lengthExtended Stay                       -140.615     52.261  -2.691
## room_typeHotel room                             131.732     86.069   1.531
## room_typePrivate room                          -104.389     27.874  -3.745
## room_typeShared room                           -275.726    209.754  -1.315
## stay_lengthMedium Stay:room_typeHotel room           NA         NA      NA
## stay_lengthLong Stay:room_typeHotel room             NA         NA      NA
## stay_lengthExtended Stay:room_typeHotel room         NA         NA      NA
## stay_lengthMedium Stay:room_typePrivate room    -47.454    103.886  -0.457
## stay_lengthLong Stay:room_typePrivate room        2.707     59.829   0.045
## stay_lengthExtended Stay:room_typePrivate room  -22.903    146.393  -0.156
## stay_lengthMedium Stay:room_typeShared room          NA         NA      NA
## stay_lengthLong Stay:room_typeShared room       136.714    246.123   0.555
## stay_lengthExtended Stay:room_typeShared room        NA         NA      NA
##                                                Pr(>|t|)    
## (Intercept)                                     < 2e-16 ***
## stay_lengthMedium Stay                         0.766531    
## stay_lengthLong Stay                           1.15e-09 ***
## stay_lengthExtended Stay                       0.007143 ** 
## room_typeHotel room                            0.125912    
## room_typePrivate room                          0.000181 ***
## room_typeShared room                           0.188697    
## stay_lengthMedium Stay:room_typeHotel room           NA    
## stay_lengthLong Stay:room_typeHotel room             NA    
## stay_lengthExtended Stay:room_typeHotel room         NA    
## stay_lengthMedium Stay:room_typePrivate room   0.647832    
## stay_lengthLong Stay:room_typePrivate room     0.963918    
## stay_lengthExtended Stay:room_typePrivate room 0.875680    
## stay_lengthMedium Stay:room_typeShared room          NA    
## stay_lengthLong Stay:room_typeShared room      0.578585    
## stay_lengthExtended Stay:room_typeShared room        NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 838.2 on 11172 degrees of freedom
##   (4061 observations deleted due to missingness)
## Multiple R-squared:  0.007956,   Adjusted R-squared:  0.007068 
## F-statistic:  8.96 on 10 and 11172 DF,  p-value: 7.473e-15

Key Findings

  • Base Price: $321 for Short Stay in Entire homes

  • Long/Extended Stays: Cheaper by 162(p<0.001) and162(p<0.001) and 141 (p=0.007) respectively

  • Private Rooms: $104 cheaper than Entire homes (p<0.001)

  • The estimable interactions (e.g., Medium Stay:Private Room) showed no significant price differences (all p > 0.05)

  • Only one interaction (Long Stay:Shared Room +$137) approached marginal significance (p=0.58)

  • Very low R² (0.8%) - explains almost none of price variation

  • Significant F-statistic but poor practical utility

Business Implications

  • Room type matters more than stay length in pricing

  • No evidence that stay-length discounts should vary by room type

Model Diagnostics

plot(model_v2)

1. Residuals vs Fitted Plot

  • The flat red line suggests no severe non-linearity

  • Slight fanning at higher prices indicates mild heteroscedasticity

2. Normal Q-Q Plot

  • Heavy tails deviate from line → non-normal residuals

3. Scale-Location Plot

  • Slight upward trend → higher variance at higher prices

  • Confirms mild heteroscedasticity

4. Residuals vs Leverage

  • Several far-right points → high-leverage outliers

5. Cook’s Distance

  • All bars < 0.5 → no problematic influence