Sampling and Drawing Conclusions

Loading Data

airbnb <- read_delim("./airbnb_austin.csv", delim = ",")

## Rows: 15244 Columns: 18
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr   (3): name, host_name, room_type
## dbl  (12): id, host_id, neighbourhood, latitude, longitude, price, minimum_n...
## lgl   (2): neighbourhood_group, license
## date  (1): last_review
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.

Sampling: 5 Random Samples

n_samples <- 5
sample_size <- round(nrow(airbnb) * 0.5)

subsamples <- lapply(1:n_samples, function(i) {
  airbnb |> sample_n(sample_size, replace = TRUE)
})


df_1 <- subsamples[[1]]
df_2 <- subsamples[[2]]
df_3 <- subsamples[[3]]
df_4 <- subsamples[[4]]
df_5 <- subsamples[[5]]

summarize_subsample <- function(df) {
  df |>
    summarise(
      avg_price = mean(price, na.rm = TRUE),
      avg_reviews = mean(number_of_reviews, na.rm = TRUE),
      n_listings = n()
    )
}

summary_stats <- lapply(subsamples, summarize_subsample)
summary_stats <- bind_rows(summary_stats, .id = "subsample")
print(summary_stats)

## # A tibble: 5 x 4
##   subsample avg_price avg_reviews n_listings
##   <chr>         <dbl>       <dbl>      <int>
## 1 1              283.        41.1       7622
## 2 2              298.        41.9       7622
## 3 3              296.        42.5       7622
## 4 4              282.        42.8       7622
## 5 5              300.        42.0       7622

subsample: Each row represents one of the 5 random subsamples.
avg_price: The average listing price in each subsample ranges from 261.01 to 285.75.
avg_reviews: The average number of reviews per listing in each subsample ranges from 40.19 to 43.37.
n_listings: The number of listings in each subsample. All subsamples have 7,622 listings, roughly 50% of your dataset.

Identify Anomalies

anomalies <- summary_stats |>
  filter(avg_price > quantile(avg_price, 0.75) | avg_price < quantile(avg_price, 0.25))

print(anomalies)

## # A tibble: 2 x 4
##   subsample avg_price avg_reviews n_listings
##   <chr>         <dbl>       <dbl>      <int>
## 1 4              282.        42.8       7622
## 2 5              300.        42.0       7622

Anomalies can differ across subsamples due to random variability, sample size, or differences in data distribution. None of the subsamples had average prices or reviews significantly higher or lower than the others. The subsamples are relatively consistent, with no major outliers.

Check Consistency Across Subsamples

common_room_type <- lapply(subsamples, function(df) {
  df |>
    count(room_type) |>
    arrange(desc(n)) |>
    slice(1)
})

common_room_type <- bind_rows(common_room_type, .id = "subsample")
print(common_room_type)

## # A tibble: 5 x 3
##   subsample room_type           n
##   <chr>     <chr>           <int>
## 1 1         Entire home/apt  6193
## 2 2         Entire home/apt  6211
## 3 3         Entire home/apt  6237
## 4 4         Entire home/apt  6186
## 5 5         Entire home/apt  6195

The most common room type is the Entire home/apartment, which is consistent across subsamples; it suggests a stable pattern in the data.

Monte Carlo Simulation

n_simulations <- 1000  

mc_results <- replicate(n_simulations, {  
  subsample <- airbnb |> sample_n(sample_size, replace = TRUE)  
  mean(subsample$price, na.rm = TRUE)  
})  

hist(  
  mc_results,  
  breaks = 30,  
  col = "skyblue",  
  border = "white",  
  main = "Monte Carlo Simulation: Distribution of Average Price",  
  xlab = "Average Price",  
  ylab = "Frequency"  
)

The histogram shows the distribution of average prices across many subsamples. The peak is around $280, which suggests the average price for Airbnb listings. The distribution is narrow and centred around a peak which means the average price is a reliable estimate for the dataset.

Conclusion

This investigation highlights the importance of considering variability, detecting anomalies, and understanding data limitations. By applying these, I can draw more reliable and nuanced conclusions in future analyses. Tools like Monte Carlo simulations and subsampling are handy for exploring data and quantifying uncertainty.