GLM

Loading Data

airbnb <- read_delim("./airbnb_austin.csv", delim = ",")

## Rows: 15244 Columns: 18
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr   (3): name, host_name, room_type
## dbl  (12): id, host_id, neighbourhood, latitude, longitude, price, minimum_n...
## lgl   (2): neighbourhood_group, license
## date  (1): last_review
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.

1. Binary Variable

I’ll define a listing from number_of_review as “Frequently Booked” if it has above-median reviews (indicates consistent bookings):

# Convert to binary (1 = frequently booked, 0 = not)
median_reviews <- median(airbnb$number_of_reviews, na.rm = TRUE)
airbnb_data <- airbnb |>
  mutate(frequently_booked = as.numeric(number_of_reviews > median_reviews))

2. Logistic Regression Model

Predictors:

room_type (categorical): Entire home vs. others
price (continuous): Standardized
availability_365 (continuous): Days available/year

model_logit <- glm(frequently_booked ~ room_type + scale(price) + 
                   availability_365,
                   data = airbnb_data, family = "binomial")
summary(model_logit)

## 
## Call:
## glm(formula = frequently_booked ~ room_type + scale(price) + 
##     availability_365, family = "binomial", data = airbnb_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.5752  -1.3056   0.8798   1.0139   2.8624  
## 
## Coefficients:
##                         Estimate Std. Error z value Pr(>|z|)    
## (Intercept)            0.7904910  0.0419073  18.863  < 2e-16 ***
## room_typeHotel room   -3.1837214  0.4603770  -6.915 4.66e-12 ***
## room_typePrivate room -0.9882397  0.0584118 -16.918  < 2e-16 ***
## room_typeShared room  -0.4852431  0.2603035  -1.864   0.0623 .  
## scale(price)          -0.3717990  0.0467163  -7.959 1.74e-15 ***
## availability_365      -0.0015754  0.0001686  -9.346  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 15228  on 11182  degrees of freedom
## Residual deviance: 14620  on 11177  degrees of freedom
##   (4061 observations deleted due to missingness)
## AIC: 14632
## 
## Number of Fisher Scoring iterations: 5

3. Interpreting Coefficients

Entire homes : Most likely to be frequently booked
Hotel rooms: 96.8% lower odds vs entire homes
Private rooms: 62.8% lower odds vs entire homes
Shared rooms: Not significantly different
For every 1 SD increase in price, odds of frequent bookings drop by 31%
Each additional available day reduces odds by 0.16% which might indicate that less desirable listings stay available longer and popular listings get booked early, reducing availability

4. Confidence Interval for `price`

confint(model_logit, "scale(price)", level = 0.95)

## Waiting for profiling to be done...

##      2.5 %     97.5 % 
## -0.4672775 -0.2845991

We’re 95% confident the true coefficient for price lies between -0.47 and -0.28 which means that $1 SD price increase reduces odds of frequent bookings by 28–47%.

5. Diagnostic Insights:

Summary

All variables except shared rooms strongly predict booking frequency
Residual deviance (14,620) < Null deviance (15,228) → Model explains variance better than intercept-only

Median near zero → Good symmetry
Max residual (2.86) suggests some under-predicted cases
Largest impact: Hotel rooms (3x stronger effect than private rooms)
Price matters but less than room type

Recommendations for Hosts:

Convert private rooms to entire homes if possible
Avoid overpricing (even small increases hurt booking frequency)
Limit calendar availability to signal exclusivity

GLM

P.Aina