Loading Data

airbnb <- read_delim("./airbnb_austin.csv", delim = ",")
## Rows: 15244 Columns: 18
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr   (3): name, host_name, room_type
## dbl  (12): id, host_id, neighbourhood, latitude, longitude, price, minimum_n...
## lgl   (2): neighbourhood_group, license
## date  (1): last_review
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.

Focus:

  1. Response Variable: price (continuous) – This is likely the most valuable column for hosts and guests, as it directly impacts booking decisions and revenue.

  2. Explanatory Variable: room_type (categorical) – I expect the room type to influence the price.

  3. Continuous Predictor: minimum_nights (ordered integer) – This variable might influence the price, as longer stays could correlate with different pricing strategies.

ANOVA Test

Null Hypothesis (\(H_0\)): The mean listings’ price is the same across all room types.

anova_result <- aov(price ~ room_type, data = airbnb)

summary(anova_result)
##                Df    Sum Sq Mean Sq F value  Pr(>F)    
## room_type       3 2.588e+07 8627045   12.23 5.5e-08 ***
## Residuals   11179 7.885e+09  705376                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 4061 observations deleted due to missingness

The p-value (0.000000055) is much smaller than the significance level (α=0.05), which means that we have strong evidence to reject the null hypothesis. We conclude that there is a statistically significant difference in the mean price across at least one room_type category. The F value = 12.23, shows that the variability between room types is much larger than the variability within each type, which means that room type matters.

ggplot(airbnb, aes(x = room_type, y = price, fill = room_type)) +
  geom_boxplot() +
  coord_cartesian(ylim = c(10, 1000)) + #for outliers
  labs(title = "Distribution of Prices by Room Type",
       x = "Room Type",
       y = "Price") +
  theme_minimal()
## Warning: Removed 4061 rows containing non-finite values (stat_boxplot).

Recommendations

This justifies if a host is listing an entire home/apartment for a higher price due to its appeal to families or groups. However, a host listing a private room/shared room might focus on competitive pricing to attract budget-conscious travellers.

If a guest is looking for a budget-friendly option, they might consider a Private room/shared room while a guest looking for space and privacy might consider an entire home/apartment which might have higher price.

Hotel Rooms prices vary significantly meaning the prices differ from one another. Some hotels target budget travellers while others cater to luxury clients which results in to high price differences.

Linear Regression

I will be exploring the relationship between price  and minimum_nights. It is expected that longer minimum stay requirements might correlate with higher or lower prices, depending on the host’s strategy.

  • If the minimum nights are between 1 and 3, it is classified as a “Short Stay.”
  • If the minimum nights are between 4 and 7, it is classified as a “Medium Stay.”
  • If the minimum nights are between 8 and 30, it is classified as a “Long Stay.”
  • If the minimum nights are 31 or more, it is classified as an “Extended Stay.”
airbnb_group <- airbnb |>
  mutate(
    stay_length = case_when(
      minimum_nights >= 1 & minimum_nights <= 3 ~ "Short Stay",
      minimum_nights >= 4 & minimum_nights <= 7 ~ "Medium Stay",
      minimum_nights >= 8 & minimum_nights <= 30 ~ "Long Stay",
      minimum_nights >= 31 ~ "Extended Stay",
      TRUE ~ NA_character_
    )
  )
airbnb_data <- airbnb_group |>
  mutate(stay_length = factor(stay_length, levels = c("Short Stay", "Medium Stay", "Long Stay", "Extended Stay")))

lm_model <- lm(price ~ stay_length, data = airbnb_data)

summary(lm_model)
## 
## Call:
## lm(formula = price ~ stay_length, data = airbnb_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##   -295   -203   -121    -17  37833 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               310.115      8.882  34.915  < 2e-16 ***
## stay_lengthMedium Stay     -4.992     40.573  -0.123  0.90208    
## stay_lengthLong Stay     -176.309     23.350  -7.551 4.67e-14 ***
## stay_lengthExtended Stay -145.842     48.848  -2.986  0.00284 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 838.9 on 11179 degrees of freedom
##   (4061 observations deleted due to missingness)
## Multiple R-squared:  0.005645,   Adjusted R-squared:  0.005378 
## F-statistic: 21.15 on 3 and 11179 DF,  p-value: 1.167e-13
ggplot(airbnb_data, aes(x = stay_length, y = price, color = stay_length)) +
  geom_point(alpha = 0.5, position = position_jitter(width = 0.2)) + 
  coord_cartesian(ylim = c(10, 1000)) + #for outliers
  geom_smooth(method = "lm", se = FALSE, color = "black", aes(group = 1)) +
  labs(title = "Price Distribution by Length of Stay",
       x = "Length of Stay",
       y = "Price (in dollars)",
       color = "Stay Length") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 4061 rows containing non-finite values (stat_smooth).
## Warning: Removed 4061 rows containing missing values (geom_point).

Interpretation of Coefficients

1. Intercept (310.115):

  • This is the average price for the reference category, which is Short Stay (1–3 nights).

  • Interpretation: When stay_length is Short Stay, the average price is $310.12.

2. Medium Stay (-4.992):

  • This coefficient represents the difference in average price between Medium Stay (4–7 nights) and the reference category (Short Stay).

  • Interpretation: The average price for Medium Stay is $4.99 lower than for Short Stay.

  • However, the p-value is 0.902, which is much greater than 0.05. This means the difference is not statistically significant. In other words, there is no strong evidence that Medium Stay listings are priced differently from Short Stay listings.

3. Long Stay (-176.309):

  • This coefficient represents the difference in average price between Long Stay (8–30 nights) and the reference category (Short Stay).

  • Interpretation: The average price for Long Stay is $176.31 lower than for Short Stay.

  • The p-value is 4.67e-14, which is much less than 0.05. This means the difference is statistically significant. In other words, Long Stay listings are significantly cheaper than Short Stay listings.

4. Extended Stay (-145.842):

  • This coefficient represents the difference in average price between Extended Stay (31+ nights) and the reference category (Short Stay).

  • Interpretation: The average price for Extended Stay is $145.84 lower than for Short Stay.

  • The p-value is 0.00284, which is less than 0.05. This means the difference is statistically significant. In other words, Extended Stay listings are significantly cheaper than Short Stay listings.

Key Findings

Hosts charge more for shorter stays, possibly because they can attract more bookings and maximize revenue. Both Long Stay and Extended Stay listings are significantly cheaper than Short Stay listings, this could be because hosts offer discounts for longer stays to attract guests who are willing to commit to a longer booking.

The price difference between Medium Stay and Short Stay is negligible and not statistically significant. This suggests that hosts do not adjust prices significantly for stays of 4–7 nights compared to 1–3 nights.

Recommendations

Since Short Stay listings command higher prices, hosts who wants to offer short stay should focus on attracting guests for shorter stays like offering amenities or services that appeal to short-term guests (e.g., flexible check-in/check-out times, car park etc.). Hosts should offer competitive pricing if they want to attract guests for longer stays.

If a guest is planning a longer stay, they should look for Long Stay or Extended Stay listings, as they tend to be cheaper. Negotiate with hosts for additional discounts for longer bookings. But if they need a place for just a few nights, they shoud be prepared to pay a premium for Short Stay listings.

Conclusion

The linear regression model shows that Short Stay listings are the most expensive, while Long Stay and Extended Stay listings are significantly cheaper. Hosts can use this information to optimize their pricing strategies, while guests can use it to find budget-friendly options for longer stays. However, other factors like neighbourhood and room type are also important in determining price, and further analysis is needed to account for these variables.