Loading Data

airbnb <- read_delim("./airbnb_austin.csv", delim = ",")
## Rows: 15244 Columns: 18
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr   (3): name, host_name, room_type
## dbl  (12): id, host_id, neighbourhood, latitude, longitude, price, minimum_n...
## lgl   (2): neighbourhood_group, license
## date  (1): last_review
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.

Define Parameters:

Alpha Level (α): I chose α=0.05, which ensures that the findings are rigorous but not overly restrictive. It’s a standard threshold for balancing Type I errors (false positives) with statistical power.

Power Level (1−β): I choose a power of 0.85, meaning we have an 85% chance of detecting a true effect if it exists.

Minimum Effect Size (Δ): I define a meaningful difference in average prices as $50. This is based on practical considerations, such as the cost difference influencing a customer’s decision.

neighbourhood_count <- airbnb |>
  group_by(neighbourhood) |>
  summarise(listing_count = n()) |>
  arrange(desc(listing_count))

top_neighbourhood <- neighbourhood_count$neighbourhood[1]

print(top_neighbourhood)
## [1] 78704
top_neighbourhood_data <- airbnb |>
  filter(neighbourhood == top_neighbourhood)

head(top_neighbourhood_data)
## # A tibble: 6 x 18
##       id name       host_id host_name neighbourhood_group neighbourhood latitude
##    <dbl> <chr>        <dbl> <chr>     <lgl>                       <dbl>    <dbl>
## 1   6413 Gem of a ~   13879 Todd      NA                          78704     30.2
## 2   6448 Secluded ~   14156 Amy       NA                          78704     30.3
## 3 353261 4/3.5 SoC~ 1789494 Lara      NA                          78704     30.2
## 4 354263 Best Litt~ 1752493 Gigi      NA                          78704     30.2
## 5 355232 Great SXS~ 1798084 Jeffrey   NA                          78704     30.2
## 6 355328 SXSW - mu~ 1798834 Joan      NA                          78704     30.2
## # i 11 more variables: longitude <dbl>, room_type <chr>, price <dbl>,
## #   minimum_nights <dbl>, number_of_reviews <dbl>, last_review <date>,
## #   reviews_per_month <dbl>, calculated_host_listings_count <dbl>,
## #   availability_365 <dbl>, number_of_reviews_ltm <dbl>, license <lgl>

I decided to select the neighborhood with the highest number of listings to obtain ample data. By focusing on a single neighborhood, I can also mitigate price discrepancies caused by varying levels of luxury when determining the mean.

Hypothesis 1: Neyman-Pearson Testing

Null Hypothesis (\(H_0\)):

There is no significant difference in the average price of listings between Hotel room and Private room in the same neighbourhood.

Alternative Hypothesis (\(H_a\)):

There is a significant difference in the average price of listings between Hotel room and Private room in the same neighbourhood.

sd_prices <- top_neighbourhood_data |>
  filter(room_type %in% c("Hotel room", "Private room")) |>
  group_by(room_type) |>
  summarize(sd_price = sd(price, na.rm = TRUE))
sd_pooled <- max(sd_prices$sd_price)

test <- pwrss.t.2means(
  mu1 = 50,
  sd1 = sd_pooled,
  kappa = 1,
  power = 0.85,
  alpha = 0.05,
  alternative = "not equal"
)
##  Difference between Two means 
##  (Independent Samples t Test) 
##  H0: mu1 = mu2 
##  HA: mu1 != mu2 
##  ------------------------------ 
##   Statistical power = 0.85 
##   n1 = 84 
##   n2 = 84 
##  ------------------------------ 
##  Alternative = "not equal" 
##  Degrees of freedom = 166 
##  Non-centrality parameter = 3.016 
##  Type I error rate = 0.05 
##  Type II error rate = 0.15
plot(test)
## Warning in qt(1 - prob.extreme, df = df, ncp = ncp, lower.tail = TRUE): full
## precision may not have been achieved in 'pnt{final}'

print(test)
## $parms
## $parms$mu1
## [1] 50
## 
## $parms$mu2
## [1] 0
## 
## $parms$sd1
## [1] 107.4533
## 
## $parms$sd2
## [1] 107.4533
## 
## $parms$kappa
## [1] 1
## 
## $parms$welch.df
## [1] FALSE
## 
## $parms$paired
## [1] FALSE
## 
## $parms$paired.r
## [1] 0.5
## 
## $parms$alpha
## [1] 0.05
## 
## $parms$margin
## [1] 0
## 
## $parms$alternative
## [1] "not equal"
## 
## $parms$verbose
## [1] TRUE
## 
## 
## $test
## [1] "t"
## 
## $df
## [1] 166
## 
## $ncp
## [1] 3.015607
## 
## $power
## [1] 0.85
## 
## $n
## n1 n2 
## 84 84 
## 
## attr(,"class")
## [1] "pwrss"  "t"      "2means"
filtered_data <- top_neighbourhood_data |>
  filter(room_type %in% c("Hotel room", "Private room"), !is.na(price))

t_test_result <- t.test(price ~ room_type, data = filtered_data, var.equal = FALSE)

print(t_test_result)
## 
##  Welch Two Sample t-test
## 
## data:  price by room_type
## t = 7.046, df = 44.436, p-value = 9.297e-09
## alternative hypothesis: true difference in means between group Hotel room and group Private room is not equal to 0
## 95 percent confidence interval:
##   90.65438 163.26133
## sample estimates:
##   mean in group Hotel room mean in group Private room 
##                   263.2222                   136.2644

The test proves that hotel rooms have a significantly higher average price than private rooms. Since the p-value is extremely small, we reject the null hypothesis and conclude that the price difference is statistically significant.

ggplot(filtered_data, aes(x = room_type, y = price, fill = room_type)) +
  geom_boxplot() +
  labs(title = "Distribution of Prices by Room Type",
       x = "Room Type",
       y = "Price (in dollars)") +
  theme_minimal()

Hypothesis 2: Fisher’s Significance Testing

Null Hypothesis (\(H_0\)):

There is no significant relationship between room type (Entire Apartment vs. Private/Shared Room) and the minimum number of nights required for booking. The choice of room type does not impact how long guests stay.

This analysis will assist a host in determining whether renting out an entire apartment or offering private or shared rooms is the more suitable strategy.

fil_airbnb <- airbnb |>
  mutate(room_category = ifelse(room_type == "Entire home/apt", "Entire Apartment", "Private/Shared Room"))

table(fil_airbnb$room_category)
## 
##    Entire Apartment Private/Shared Room 
##               12429                2815
f_airbnb <- fil_airbnb |>
  mutate(min_night_category = ifelse(minimum_nights <= 7, "Short-term", "Long-term"))

room_min_table <- table(f_airbnb$room_category, f_airbnb$min_night_category)

print(room_min_table)
##                      
##                       Long-term Short-term
##   Entire Apartment         1751      10678
##   Private/Shared Room       518       2297
fisher_test_result <- fisher.test(room_min_table)

print(fisher_test_result)
## 
##  Fisher's Exact Test for Count Data
## 
## data:  room_min_table
## p-value = 1.443e-08
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  0.6521769 0.8116727
## sample estimates:
## odds ratio 
##  0.7271723

The statistical analysis revealed a relationship between room type and the minimum number of nights booked, providing strong evidence to reject the null hypothesis. A p-value of 1.443e-08 shows a significant relationship between room type (Entire Apartment vs. Private/Shared Room) and minimum nights.

An odds ratio of 0.727 provided insight into the direction and magnitude of the relationship. An odds ratio less than 1 suggests an inverse association: Private/Shared Rooms are less likely to be associated with longer minimum stays when compared to Entire Apartments. Conversely, this implies that Entire Apartments are more likely to be booked for longer minimum periods. This finding aligns with intuitive expectations, as entire apartments often cater to travellers seeking extended stays, while private or shared rooms may be more suitable for shorter visits.

The confidence interval, ranging from 0.652 to 0.812, further solidified the statistical significance of this observed effect. The effect is not due to chance, confirming that room type influences the length of stay.

ggplot(f_airbnb, aes(x = room_category, fill = min_night_category)) +
  geom_bar(position = "dodge") +
  labs(title = "Count of Short-term and Long-term Rentals by Room Category",
       x = "Room Category",
       y = "Count",
       fill = "Minimum Nights Category") +
  theme_minimal()

This plot is useful for hosts, platform managers, and analysts to understand the supply-side dynamics of Airbnb listings and make data-driven decisions. It demonstrates that Entire Apartments are more commonly listed on Airbnb. While short-term rentals are common for both room types, Entire Apartments are preferred for longer-term stays compared to Private/Shared Rooms.