airbnb <- read_delim("./airbnb_austin.csv", delim = ",")
## Rows: 15244 Columns: 18
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): name, host_name, room_type
## dbl (12): id, host_id, neighbourhood, latitude, longitude, price, minimum_n...
## lgl (2): neighbourhood_group, license
## date (1): last_review
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
price
(continuous)
Predictors:room_type
(categorical)
minimum_nights
(continuous)
availability_365
(continuous)
neighbourhood
(categorical, top 5 neighbourhoods by
frequency)
airbnb_model <- airbnb |>
filter(!is.na(price)) |>
mutate(
neighbourhood = fct_lump(as.factor(neighbourhood), n = 5)
)
model_lm <- glm(price ~ room_type + minimum_nights + availability_365 + neighbourhood,
data = airbnb_model,
family = gaussian(link = "identity"))
summary(model_lm)
##
## Call:
## glm(formula = price ~ room_type + minimum_nights + availability_365 +
## neighbourhood, family = gaussian(link = "identity"), data = airbnb_model)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 288.18811 31.88156 9.039 < 2e-16 ***
## room_typeHotel room 62.46749 86.24670 0.724 0.468904
## room_typePrivate room -112.19368 23.63949 -4.746 2.10e-06 ***
## room_typeShared room -280.87334 108.65357 -2.585 0.009749 **
## minimum_nights -2.05384 0.44808 -4.584 4.62e-06 ***
## availability_365 0.62833 0.06751 9.307 < 2e-16 ***
## neighbourhood78702 -124.62240 36.68021 -3.398 0.000682 ***
## neighbourhood78704 -122.79667 35.27290 -3.481 0.000501 ***
## neighbourhood78741 -130.64521 43.60979 -2.996 0.002743 **
## neighbourhood78745 -191.71629 44.00071 -4.357 1.33e-05 ***
## neighbourhoodOther -102.12786 30.35368 -3.365 0.000769 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 697860.3)
##
## Null deviance: 7911282172 on 11182 degrees of freedom
## Residual deviance: 7796495326 on 11172 degrees of freedom
## AIC: 182225
##
## Number of Fisher Scoring iterations: 2
Room Type Effects (reference: Entire home/apt)
Private rooms: $112 cheaper than entire homes
Shared rooms: $281 cheaper
Hotel rooms: Not significantly different
Minimum Nights
Each additional required night is associated with a $2.05 price decrease
Contrary to expectations that longer stays might command premiums
Availability
Each additional available day is linked to a $0.63 price increase
Suggests more desirable listings stay available longer
Neighborhood Effects (reference: likely downtown)
All listed neighbourhoods are significantly cheaper than the reference
78745 shows the largest discount ($192 less than the reference)
Neighbourhood differences range from 102 to 192
vif(model_lm)
## GVIF Df GVIF^(1/(2*Df))
## room_type 1.042339 3 1.006935
## minimum_nights 1.017435 1 1.008680
## availability_365 1.019470 1 1.009688
## neighbourhood 1.039429 5 1.003875
The model has very low multicollinearity i.e All the predictors are statistically independent enough to be in the model together
The model explains some variation in price (Residual deviance < Null deviance) but has limited predictive power (Large residual deviance suggests room for improvement)
The high residual variance is likely due to extreme outliers or untransformed, skewed prices.
AIC of 182,225 is very high, indicating this may not be the best model specification
Large gap between min/max residuals (-398 to 37,766) suggests unequal variance
Extreme maximum residual indicates some very overpriced listings
The availability effect contradicts typical market behaviour (higher availability usually indicates lower demand)
Minimum nights showing negative relationship need investigation
model_log <- glm(log(price) ~ room_type + minimum_nights + availability_365 + neighbourhood,
data = airbnb_model)
summary(model_log)
##
## Call:
## glm(formula = log(price) ~ room_type + minimum_nights + availability_365 +
## neighbourhood, data = airbnb_model)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.468e+00 2.974e-02 183.873 < 2e-16 ***
## room_typeHotel room 5.387e-01 8.045e-02 6.696 2.24e-11 ***
## room_typePrivate room -7.608e-01 2.205e-02 -34.502 < 2e-16 ***
## room_typeShared room -1.901e+00 1.014e-01 -18.754 < 2e-16 ***
## minimum_nights -5.262e-03 4.180e-04 -12.590 < 2e-16 ***
## availability_365 8.159e-04 6.297e-05 12.957 < 2e-16 ***
## neighbourhood78702 -3.103e-01 3.422e-02 -9.070 < 2e-16 ***
## neighbourhood78704 -3.490e-01 3.290e-02 -10.608 < 2e-16 ***
## neighbourhood78741 -6.827e-01 4.068e-02 -16.783 < 2e-16 ***
## neighbourhood78745 -5.373e-01 4.104e-02 -13.092 < 2e-16 ***
## neighbourhoodOther -4.837e-01 2.831e-02 -17.084 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.6072235)
##
## Null deviance: 8341.2 on 11182 degrees of freedom
## Residual deviance: 6783.9 on 11172 degrees of freedom
## AIC: 26170
##
## Number of Fisher Scoring iterations: 2
Median residual near zero (-0.143) and smaller range (-2.9 to 5.3 vs. -398 to 37,766)
Suggests the log transform addressed severe skewness in price data
Standard errors are now tiny fractions (e.g., 0.02974 for intercept vs. 31.88 previously)
All neighbourhood effects became highly significant
Room Type | Coefficient | Interpretation (Multiplicative Effect) |
---|---|---|
Hotel | +0.5387 | Costs 71% more (e^0.5387 ≈ 1.71) |
Private | -0.7608 | Costs 53% less (e^-0.7608 ≈ 0.47) |
Shared | -1.901 | Costs 85% less (e^-1.901 ≈ 0.15) |
Example: An entire home at $200 would compare to:
Hotel roomn 200×1.71= **342**:
Private room: 200×0.47= **94**
Shared room: 200×0.15= **30**
Each additional required night reduces price by 0.525%
For a 30-night minimum: 30 × -0.525% ≈ 15.7% discount
Each available day increases price by 0.0816%
365-day availability → 365 × 0.0816% ≈ 30% higher price
Zip Code | Coefficient | Price Multiplier | Equivalent Discount |
---|---|---|---|
78741 | -0.6827 | 0.51× | 49% |
Other | -0.4837 | 0.62× | 38% |
78704 | -0.3490 | 0.71× | 29% |
Maximum residual (5.32) still suggests some overpriced outliers
Consider winsorizing extreme log(price) values
Dispersion Parameter (0.607), close to 1, suggesting reasonable variance structure
AIC (26,170) is a vast improvement over the first model’s AIC (182,225)