Loading Data

airbnb <- read_delim("./airbnb_austin.csv", delim = ",")
## Rows: 15244 Columns: 18
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr   (3): name, host_name, room_type
## dbl  (12): id, host_id, neighbourhood, latitude, longitude, price, minimum_n...
## lgl   (2): neighbourhood_group, license
## date  (1): last_review
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Model 1

Response Variable: price (continuous) Predictors:

  1. room_type (categorical)

  2. minimum_nights (continuous)

  3. availability_365 (continuous)

  4. neighbourhood (categorical, top 5 neighbourhoods by frequency)

airbnb_model <- airbnb |>
  filter(!is.na(price)) |>
  mutate(
    neighbourhood = fct_lump(as.factor(neighbourhood), n = 5)

  )

model_lm <- glm(price ~ room_type + minimum_nights + availability_365 + neighbourhood,
                data = airbnb_model, 
                family = gaussian(link = "identity"))

summary(model_lm)
## 
## Call:
## glm(formula = price ~ room_type + minimum_nights + availability_365 + 
##     neighbourhood, family = gaussian(link = "identity"), data = airbnb_model)
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            288.18811   31.88156   9.039  < 2e-16 ***
## room_typeHotel room     62.46749   86.24670   0.724 0.468904    
## room_typePrivate room -112.19368   23.63949  -4.746 2.10e-06 ***
## room_typeShared room  -280.87334  108.65357  -2.585 0.009749 ** 
## minimum_nights          -2.05384    0.44808  -4.584 4.62e-06 ***
## availability_365         0.62833    0.06751   9.307  < 2e-16 ***
## neighbourhood78702    -124.62240   36.68021  -3.398 0.000682 ***
## neighbourhood78704    -122.79667   35.27290  -3.481 0.000501 ***
## neighbourhood78741    -130.64521   43.60979  -2.996 0.002743 ** 
## neighbourhood78745    -191.71629   44.00071  -4.357 1.33e-05 ***
## neighbourhoodOther    -102.12786   30.35368  -3.365 0.000769 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 697860.3)
## 
##     Null deviance: 7911282172  on 11182  degrees of freedom
## Residual deviance: 7796495326  on 11172  degrees of freedom
## AIC: 182225
## 
## Number of Fisher Scoring iterations: 2

Summary

Room Type Effects (reference: Entire home/apt)

  • Private rooms: $112 cheaper than entire homes

  • Shared rooms: $281 cheaper

  • Hotel rooms: Not significantly different

Minimum Nights

  • Each additional required night is associated with a $2.05 price decrease

  • Contrary to expectations that longer stays might command premiums

Availability

  • Each additional available day is linked to a $0.63 price increase

  • Suggests more desirable listings stay available longer

Neighborhood Effects (reference: likely downtown)

  • All listed neighbourhoods are significantly cheaper than the reference

  • 78745 shows the largest discount ($192 less than the reference)

  • Neighbourhood differences range from 102 to 192

Model Diagnostics & Issues

vif(model_lm)
##                      GVIF Df GVIF^(1/(2*Df))
## room_type        1.042339  3        1.006935
## minimum_nights   1.017435  1        1.008680
## availability_365 1.019470  1        1.009688
## neighbourhood    1.039429  5        1.003875
  • The model has very low multicollinearity i.e All the predictors are statistically independent enough to be in the model together

  • The model explains some variation in price (Residual deviance < Null deviance) but has limited predictive power (Large residual deviance suggests room for improvement)

  • The high residual variance is likely due to extreme outliers or untransformed, skewed prices.

  • AIC of 182,225 is very high, indicating this may not be the best model specification

  • Large gap between min/max residuals (-398 to 37,766) suggests unequal variance

  • Extreme maximum residual indicates some very overpriced listings

  • The availability effect contradicts typical market behaviour (higher availability usually indicates lower demand)

  • Minimum nights showing negative relationship need investigation

Model 2

model_log <- glm(log(price) ~ room_type + minimum_nights + availability_365 + neighbourhood, 
                data = airbnb_model)

summary(model_log)
## 
## Call:
## glm(formula = log(price) ~ room_type + minimum_nights + availability_365 + 
##     neighbourhood, data = airbnb_model)
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            5.468e+00  2.974e-02 183.873  < 2e-16 ***
## room_typeHotel room    5.387e-01  8.045e-02   6.696 2.24e-11 ***
## room_typePrivate room -7.608e-01  2.205e-02 -34.502  < 2e-16 ***
## room_typeShared room  -1.901e+00  1.014e-01 -18.754  < 2e-16 ***
## minimum_nights        -5.262e-03  4.180e-04 -12.590  < 2e-16 ***
## availability_365       8.159e-04  6.297e-05  12.957  < 2e-16 ***
## neighbourhood78702    -3.103e-01  3.422e-02  -9.070  < 2e-16 ***
## neighbourhood78704    -3.490e-01  3.290e-02 -10.608  < 2e-16 ***
## neighbourhood78741    -6.827e-01  4.068e-02 -16.783  < 2e-16 ***
## neighbourhood78745    -5.373e-01  4.104e-02 -13.092  < 2e-16 ***
## neighbourhoodOther    -4.837e-01  2.831e-02 -17.084  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.6072235)
## 
##     Null deviance: 8341.2  on 11182  degrees of freedom
## Residual deviance: 6783.9  on 11172  degrees of freedom
## AIC: 26170
## 
## Number of Fisher Scoring iterations: 2

Key Improvements Over Previous Model

  • Median residual near zero (-0.143) and smaller range (-2.9 to 5.3 vs. -398 to 37,766)

  • Suggests the log transform addressed severe skewness in price data

  • Standard errors are now tiny fractions (e.g., 0.02974 for intercept vs. 31.88 previously)

  • All neighbourhood effects became highly significant

Interpreting Key Coefficients

Room Type Effects (reference: Entire home/apt)

Room Type Coefficient Interpretation (Multiplicative Effect)
Hotel +0.5387 Costs 71% more (e^0.5387 ≈ 1.71)
Private -0.7608 Costs 53% less (e^-0.7608 ≈ 0.47)
Shared -1.901 Costs 85% less (e^-1.901 ≈ 0.15)

Example: An entire home at $200 would compare to:

  • Hotel roomn 200×1.71= **342**:

  • Private room: 200×0.47= **94**

  • Shared room: 200×0.15= **30**

Minimum Nights (-0.00526)

  • Each additional required night reduces price by 0.525%

  • For a 30-night minimum: 30 × -0.525% ≈ 15.7% discount

Availability (+0.000816)

  • Each available day increases price by 0.0816%

  • 365-day availability → 365 × 0.0816% ≈ 30% higher price

Neighborhood Effects

Zip Code Coefficient Price Multiplier Equivalent Discount
78741 -0.6827 0.51× 49%
Other -0.4837 0.62× 38%
78704 -0.3490 0.71× 29%

Model Diagnostics

  • Maximum residual (5.32) still suggests some overpriced outliers

  • Consider winsorizing extreme log(price) values

  • Dispersion Parameter (0.607), close to 1, suggesting reasonable variance structure

  • AIC (26,170) is a vast improvement over the first model’s AIC (182,225)