airbnb <- read_delim("./airbnb_austin.csv", delim = ",")
## Rows: 15244 Columns: 18
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (3): name, host_name, room_type
## dbl (12): id, host_id, neighbourhood, latitude, longitude, price, minimum_n...
## lgl (2): neighbourhood_group, license
## date (1): last_review
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
n_samples <- 5
sample_size <- round(nrow(airbnb) * 0.5)
subsamples <- lapply(1:n_samples, function(i) {
airbnb |> sample_n(sample_size, replace = TRUE)
})
df_1 <- subsamples[[1]]
df_2 <- subsamples[[2]]
df_3 <- subsamples[[3]]
df_4 <- subsamples[[4]]
df_5 <- subsamples[[5]]
summarize_subsample <- function(df) {
df |>
summarise(
avg_price = mean(price, na.rm = TRUE),
avg_reviews = mean(number_of_reviews, na.rm = TRUE),
n_listings = n()
)
}
summary_stats <- lapply(subsamples, summarize_subsample)
summary_stats <- bind_rows(summary_stats, .id = "subsample")
print(summary_stats)
## # A tibble: 5 x 4
## subsample avg_price avg_reviews n_listings
## <chr> <dbl> <dbl> <int>
## 1 1 283. 41.1 7622
## 2 2 298. 41.9 7622
## 3 3 296. 42.5 7622
## 4 4 282. 42.8 7622
## 5 5 300. 42.0 7622
subsample
: Each row represents one
of the 5 random subsamples.
avg_price
: The average listing
price in each subsample ranges from 261.01 to
285.75.
avg_reviews
: The average number of
reviews per listing in each subsample ranges from 40.19
to 43.37.
n_listings
: The number of listings
in each subsample. All subsamples have 7,622 listings,
roughly 50% of your dataset.
anomalies <- summary_stats |>
filter(avg_price > quantile(avg_price, 0.75) | avg_price < quantile(avg_price, 0.25))
print(anomalies)
## # A tibble: 2 x 4
## subsample avg_price avg_reviews n_listings
## <chr> <dbl> <dbl> <int>
## 1 4 282. 42.8 7622
## 2 5 300. 42.0 7622
Anomalies can differ across subsamples due to random variability, sample size, or differences in data distribution. None of the subsamples had average prices or reviews significantly higher or lower than the others. The subsamples are relatively consistent, with no major outliers.
common_room_type <- lapply(subsamples, function(df) {
df |>
count(room_type) |>
arrange(desc(n)) |>
slice(1)
})
common_room_type <- bind_rows(common_room_type, .id = "subsample")
print(common_room_type)
## # A tibble: 5 x 3
## subsample room_type n
## <chr> <chr> <int>
## 1 1 Entire home/apt 6193
## 2 2 Entire home/apt 6211
## 3 3 Entire home/apt 6237
## 4 4 Entire home/apt 6186
## 5 5 Entire home/apt 6195
The most common room type is the Entire home/apartment
,
which is consistent across subsamples; it suggests a stable pattern in
the data.
n_simulations <- 1000
mc_results <- replicate(n_simulations, {
subsample <- airbnb |> sample_n(sample_size, replace = TRUE)
mean(subsample$price, na.rm = TRUE)
})
hist(
mc_results,
breaks = 30,
col = "skyblue",
border = "white",
main = "Monte Carlo Simulation: Distribution of Average Price",
xlab = "Average Price",
ylab = "Frequency"
)
The histogram shows the distribution of average prices across many subsamples. The peak is around $280, which suggests the average price for Airbnb listings. The distribution is narrow and centred around a peak which means the average price is a reliable estimate for the dataset.
This investigation highlights the importance of considering variability, detecting anomalies, and understanding data limitations. By applying these, I can draw more reliable and nuanced conclusions in future analyses. Tools like Monte Carlo simulations and subsampling are handy for exploring data and quantifying uncertainty.