This week I’m thinking about what could go wrong when drawing conclusions from data. The big question is: if I had collected a different sample of homes, would I reach the same conclusions? This is important because in real research, we usually only have one sample from a larger population. By creating multiple random samples from my dataset (treating it as the “population”), I can see how much my conclusions might change based on which homes happened to be in my sample.
If different samples give me wildly different answers, that’s a problem—it means my conclusions aren’t reliable. But if all samples point to the same general patterns, I can be more confident in what I’m seeing.
# Load the dataset
ames <- read.csv("ames.csv", stringsAsFactors = FALSE)
cat("Full dataset:", nrow(ames), "homes\n")
## Full dataset: 2930 homes
I’m going to create multiple random samples and compare them. I’ll start with smaller samples (25% of the data) and then increase the size to see how that affects consistency.
set.seed(42) # For reproducibility
sample_frac <- 0.25
n_samples <- 5
df_samples_25 <- tibble()
for (sample_i in 1:n_samples) {
df_i <- ames |>
sample_n(size = sample_frac * nrow(ames), replace = TRUE) |>
mutate(sample_num = sample_i)
df_samples_25 <- bind_rows(df_samples_25, df_i)
}
cat("Created", n_samples, "samples of", sample_frac * 100, "% each\n")
## Created 5 samples of 25 % each
cat("Each sample has approximately", round(sample_frac * nrow(ames)), "homes\n")
## Each sample has approximately 732 homes
set.seed(42)
sample_frac <- 0.50
df_samples_50 <- tibble()
for (sample_i in 1:n_samples) {
df_i <- ames |>
sample_n(size = sample_frac * nrow(ames), replace = TRUE) |>
mutate(sample_num = sample_i)
df_samples_50 <- bind_rows(df_samples_50, df_i)
}
cat("Created", n_samples, "samples of", sample_frac * 100, "% each\n")
## Created 5 samples of 50 % each
cat("Each sample has approximately", round(sample_frac * nrow(ames)), "homes\n")
## Each sample has approximately 1465 homes
set.seed(42)
sample_frac <- 0.75
df_samples_75 <- tibble()
for (sample_i in 1:n_samples) {
df_i <- ames |>
sample_n(size = sample_frac * nrow(ames), replace = TRUE) |>
mutate(sample_num = sample_i)
df_samples_75 <- bind_rows(df_samples_75, df_i)
}
cat("Created", n_samples, "samples of", sample_frac * 100, "% each\n")
## Created 5 samples of 75 % each
cat("Each sample has approximately", round(sample_frac * nrow(ames)), "homes\n")
## Each sample has approximately 2198 homes
Let me compare some basic statistics across samples to see how different they are.
# Comparing sale prices across 25% samples
price_stats_25 <- df_samples_25 |>
group_by(sample_num) |>
summarise(
Mean_Price = mean(SalePrice),
Median_Price = median(SalePrice),
SD_Price = sd(SalePrice),
Min_Price = min(SalePrice),
Max_Price = max(SalePrice)
)
# Adding full population for comparison
full_pop_stats <- tibble(
sample_num = "Full Data",
Mean_Price = mean(ames$SalePrice),
Median_Price = median(ames$SalePrice),
SD_Price = sd(ames$SalePrice),
Min_Price = min(ames$SalePrice),
Max_Price = max(ames$SalePrice)
)
price_comparison_25 <- bind_rows(
price_stats_25 |> mutate(sample_num = as.character(sample_num)),
full_pop_stats
)
kable(price_comparison_25,
col.names = c("Sample", "Mean Price", "Median Price", "Std Dev",
"Min Price", "Max Price"),
caption = "Sale Price Statistics Across 25% Samples",
format.args = list(big.mark = ","),
digits = 0)
| Sample | Mean Price | Median Price | Std Dev | Min Price | Max Price |
|---|---|---|---|---|---|
| 1 | 176,879 | 157,000 | 73,404 | 34,900 | 625,000 |
| 2 | 184,596 | 160,000 | 86,759 | 34,900 | 745,000 |
| 3 | 180,183 | 161,000 | 81,145 | 46,500 | 615,000 |
| 4 | 182,512 | 168,588 | 72,300 | 12,789 | 500,000 |
| 5 | 175,808 | 158,000 | 75,640 | 35,311 | 615,000 |
| Full Data | 180,796 | 160,000 | 79,887 | 12,789 | 755,000 |
# Calculating variation among samples
cat("\n=== VARIABILITY ANALYSIS (25% samples) ===\n")
##
## === VARIABILITY ANALYSIS (25% samples) ===
cat("Mean price range across samples:",
dollar(max(price_stats_25$Mean_Price) - min(price_stats_25$Mean_Price)), "\n")
## Mean price range across samples: $8,787.34
cat("That's a difference of",
round((max(price_stats_25$Mean_Price) - min(price_stats_25$Mean_Price)) /
mean(price_stats_25$Mean_Price) * 100, 2), "%\n")
## That's a difference of 4.88 %
Insight: Even with 25% samples (about 733 homes each), I’m seeing some variation in the average prices. The mean price can vary by several thousand dollars depending on which homes ended up in the sample. The median is usually more stable than the mean, which makes sense because medians aren’t as affected by a few extreme luxury homes.
# Visualize price distributions across samples
ggplot(df_samples_25, aes(x = SalePrice, fill = factor(sample_num))) +
geom_density(alpha = 0.4) +
geom_vline(xintercept = mean(ames$SalePrice), linetype = "dashed",
color = "black", linewidth = 1) +
scale_x_continuous(labels = dollar_format()) +
scale_fill_brewer(palette = "Set2", name = "Sample") +
labs(title = "Sale Price Distribution Across 25% Samples",
subtitle = "Dashed line shows true population mean",
x = "Sale Price",
y = "Density") +
theme_minimal()
What I see: The distributions overlap a lot but aren’t identical. Some samples are slightly shifted left or right, which explains why the means differ. The dashed line (true population mean) sometimes falls in the middle, sometimes off to one side.
# Checking how neighborhood representation varies across samples
neighborhood_counts_25 <- df_samples_25 |>
group_by(sample_num, Neighborhood) |>
summarise(Count = n(), .groups = "drop") |>
group_by(sample_num) |>
mutate(Percentage = Count / sum(Count) * 100)
# Focusing on top 5 neighborhoods in full data
top_neighborhoods <- ames |>
count(Neighborhood, sort = TRUE) |>
head(5) |>
pull(Neighborhood)
neighborhood_comparison <- neighborhood_counts_25 |>
filter(Neighborhood %in% top_neighborhoods) |>
select(sample_num, Neighborhood, Percentage) |>
pivot_wider(names_from = Neighborhood, values_from = Percentage, values_fill = 0)
# Adding full population
full_pop_neighborhoods <- ames |>
count(Neighborhood) |>
mutate(Percentage = n / sum(n) * 100) |>
filter(Neighborhood %in% top_neighborhoods) |>
select(Neighborhood, Percentage) |>
pivot_wider(names_from = Neighborhood, values_from = Percentage) |>
mutate(sample_num = "Full Data")
neighborhood_final <- bind_rows(
neighborhood_comparison |> mutate(sample_num = as.character(sample_num)),
full_pop_neighborhoods
)
kable(neighborhood_final,
caption = "Percentage of Homes in Top 5 Neighborhoods (25% samples)",
digits = 1)
| sample_num | CollgCr | Edwards | NAmes | OldTown | Somerst |
|---|---|---|---|---|---|
| 1 | 8.5 | 5.9 | 18.0 | 8.2 | 7.7 |
| 2 | 7.4 | 7.1 | 14.3 | 9.3 | 6.7 |
| 3 | 8.5 | 5.5 | 14.6 | 7.5 | 5.3 |
| 4 | 11.1 | 5.6 | 13.9 | 7.4 | 5.9 |
| 5 | 8.1 | 6.7 | 16.0 | 8.1 | 6.1 |
| Full Data | 9.1 | 6.6 | 15.1 | 8.2 | 6.2 |
Insight: The neighborhood representation varies quite a bit across samples! Some samples have way more NAmes homes, others have fewer. This could really affect my conclusions. For example, if I was studying neighborhood effects on price and happened to get a sample with too many expensive neighborhoods, I’d think the overall market is pricier than it really is.
This is actually important because it shows sampling bias can happen just by chance, even with random sampling.
# Look at quality ratings across samples
quality_dist_25 <- df_samples_25 |>
group_by(sample_num, Overall.Qual) |>
summarise(Count = n(), .groups = "drop") |>
group_by(sample_num) |>
mutate(Percentage = Count / sum(Count) * 100)
# Checking for very high quality homes (9-10)
high_quality_25 <- quality_dist_25 |>
filter(Overall.Qual >= 9) |>
group_by(sample_num) |>
summarise(High_Quality_Pct = sum(Percentage))
# Adding full population
full_high_quality <- ames |>
summarise(High_Quality_Pct = sum(Overall.Qual >= 9) / n() * 100) |>
mutate(sample_num = "Full Data")
high_quality_comparison <- bind_rows(
high_quality_25 |> mutate(sample_num = as.character(sample_num)),
full_high_quality
)
kable(high_quality_comparison,
col.names = c("Sample", "% High Quality (9-10)"),
caption = "Percentage of High-Quality Homes Across Samples",
digits = 2)
| Sample | % High Quality (9-10) |
|---|---|
| 1 | 3.14 |
| 2 | 5.87 |
| 3 | 4.10 |
| 4 | 3.96 |
| 5 | 4.23 |
| Full Data | 4.71 |
cat("\n=== ANOMALY DETECTION ===\n")
##
## === ANOMALY DETECTION ===
cat("In the full data, high-quality homes (9-10) represent:",
round(full_high_quality$High_Quality_Pct, 2), "%\n")
## In the full data, high-quality homes (9-10) represent: 4.71 %
cat("But in individual samples, this ranges from",
round(min(high_quality_25$High_Quality_Pct), 2), "% to",
round(max(high_quality_25$High_Quality_Pct), 2), "%\n")
## But in individual samples, this ranges from 3.14 % to 5.87 %
What would I call an anomaly? In Sample 1, maybe only 3% of homes are high-quality, so I might think “excellent homes are super rare in Ames!” But in Sample 3, maybe 6% are high-quality, so I’d think they’re more common. The true answer (from full data) is around 4.7%.
If I only had one sample, I might incorrectly identify high-quality homes as anomalies when they’re actually not that unusual, or I might underestimate how rare they are.
Now let me see what happens when I use bigger samples.
# Comparing statistics across different sample sizes
# 50% samples
price_stats_50 <- df_samples_50 |>
group_by(sample_num) |>
summarise(
Mean_Price = mean(SalePrice),
Median_Price = median(SalePrice),
SD_Price = sd(SalePrice)
)
# 75% samples
price_stats_75 <- df_samples_75 |>
group_by(sample_num) |>
summarise(
Mean_Price = mean(SalePrice),
Median_Price = median(SalePrice),
SD_Price = sd(SalePrice)
)
# Calculating variability for each sample size
variability_comparison <- tibble(
Sample_Size = c("25%", "50%", "75%"),
Mean_Range = c(
max(price_stats_25$Mean_Price) - min(price_stats_25$Mean_Price),
max(price_stats_50$Mean_Price) - min(price_stats_50$Mean_Price),
max(price_stats_75$Mean_Price) - min(price_stats_75$Mean_Price)
),
Median_Range = c(
max(price_stats_25$Median_Price) - min(price_stats_25$Median_Price),
max(price_stats_50$Median_Price) - min(price_stats_50$Median_Price),
max(price_stats_75$Median_Price) - min(price_stats_75$Median_Price)
)
)
kable(variability_comparison,
col.names = c("Sample Size", "Mean Price Range", "Median Price Range"),
caption = "How Sample Size Affects Variability",
format.args = list(big.mark = ","),
digits = 0)
| Sample Size | Mean Price Range | Median Price Range |
|---|---|---|
| 25% | 8,787 | 11,588 |
| 50% | 4,897 | 6,500 |
| 75% | 2,565 | 2,000 |
cat("\n=== KEY FINDING ===\n")
##
## === KEY FINDING ===
cat("As sample size increases, variability DECREASES:\n")
## As sample size increases, variability DECREASES:
cat("- At 25%:", dollar(variability_comparison$Mean_Range[1]), "spread in means\n")
## - At 25%: $8,787.34 spread in means
cat("- At 50%:", dollar(variability_comparison$Mean_Range[2]), "spread in means\n")
## - At 50%: $4,896.84 spread in means
cat("- At 75%:", dollar(variability_comparison$Mean_Range[3]), "spread in means\n")
## - At 75%: $2,565.01 spread in means
Insight: As my sample size gets bigger, the statistics become more consistent across samples. With 75% samples, all my samples give pretty similar mean prices because they’re all close to representing the full population.
This tells me that bigger samples = more reliable conclusions. If I only have a small sample, I need to be way more cautious about making strong claims.
# Visualizing the effect of sample size
all_means <- bind_rows(
price_stats_25 |> mutate(Size = "25%"),
price_stats_50 |> mutate(Size = "50%"),
price_stats_75 |> mutate(Size = "75%")
)
ggplot(all_means, aes(x = Size, y = Mean_Price, group = Size)) +
geom_boxplot(fill = "lightblue", alpha = 0.6) +
geom_point(size = 3, alpha = 0.7) +
geom_hline(yintercept = mean(ames$SalePrice), linetype = "dashed",
color = "red", linewidth = 1) +
scale_y_continuous(labels = dollar_format()) +
labs(title = "Sample Size Effect on Mean Price Estimates",
subtitle = "Red line = true population mean; larger samples cluster closer to truth",
x = "Sample Size (% of full data)",
y = "Sample Mean Price") +
theme_minimal()
Even though samples differ in details, some patterns should be consistent if they’re real.
# Test 1: Is the price-area relationship consistent?
correlation_by_sample_25 <- df_samples_25 |>
group_by(sample_num) |>
summarise(
Price_Area_Correlation = cor(SalePrice, Gr.Liv.Area, use = "complete.obs"),
Price_Quality_Correlation = cor(SalePrice, Overall.Qual, use = "complete.obs")
)
# Adding full population
full_correlations <- tibble(
sample_num = "Full Data",
Price_Area_Correlation = cor(ames$SalePrice, ames$Gr.Liv.Area),
Price_Quality_Correlation = cor(ames$SalePrice, ames$Overall.Qual)
)
correlation_comparison <- bind_rows(
correlation_by_sample_25 |> mutate(sample_num = as.character(sample_num)),
full_correlations
)
kable(correlation_comparison,
col.names = c("Sample", "Price-Area Correlation", "Price-Quality Correlation"),
caption = "Correlation Consistency Across 25% Samples",
digits = 3)
| Sample | Price-Area Correlation | Price-Quality Correlation |
|---|---|---|
| 1 | 0.680 | 0.785 |
| 2 | 0.695 | 0.797 |
| 3 | 0.703 | 0.793 |
| 4 | 0.666 | 0.799 |
| 5 | 0.711 | 0.819 |
| Full Data | 0.707 | 0.799 |
cat("\n=== CONSISTENCY CHECK ===\n")
##
## === CONSISTENCY CHECK ===
cat("Price-Area correlation ranges from",
round(min(correlation_by_sample_25$Price_Area_Correlation), 3), "to",
round(max(correlation_by_sample_25$Price_Area_Correlation), 3), "\n")
## Price-Area correlation ranges from 0.666 to 0.711
cat("Price-Quality correlation ranges from",
round(min(correlation_by_sample_25$Price_Quality_Correlation), 3), "to",
round(max(correlation_by_sample_25$Price_Quality_Correlation), 3), "\n")
## Price-Quality correlation ranges from 0.785 to 0.819
cat("\nBoth are consistently strong and positive across all samples!\n")
##
## Both are consistently strong and positive across all samples!
Insight: Even though individual statistics (like mean price) vary across samples, the relationships between variables stay consistent. Every sample shows that bigger homes cost more (correlation ~0.70) and higher quality homes cost more (correlation ~0.79).
This tells me that if I’m studying relationships rather than exact values, I’m on more solid ground even with a smaller sample.
# Show that the relationship is consistent even if the samples differ
ggplot(df_samples_25, aes(x = Gr.Liv.Area, y = SalePrice)) +
geom_point(alpha = 0.1, size = 1) +
geom_smooth(aes(color = factor(sample_num)), method = "lm", se = FALSE) +
scale_y_continuous(labels = dollar_format()) +
scale_x_continuous(labels = comma_format()) +
scale_color_brewer(palette = "Set2", name = "Sample") +
labs(title = "Price vs. Living Area: Consistent Relationship Across Samples",
subtitle = "Each line is a different sample's trend - all very similar slopes",
x = "Living Area (sq ft)",
y = "Sale Price") +
theme_minimal()
What I see: All the trend lines are basically parallel, they might be shifted up or down slightly depending on the sample, but the slope (the relationship strength) is nearly identical. This means the fundamental pattern is real and robust.
Let me check if conclusions about rare combinations (from Week 3) would change across samples.
# Checking building type rarity across samples
building_rarity_25 <- df_samples_25 |>
group_by(sample_num, Bldg.Type) |>
summarise(Count = n(), .groups = "drop") |>
group_by(sample_num) |>
mutate(Percentage = Count / sum(Count) * 100) |>
filter(Bldg.Type %in% c("Twnhs", "2fmCon")) |> # Rare types from Week 3
select(sample_num, Bldg.Type, Percentage)
# Pivot wider for easier comparison
building_comparison <- building_rarity_25 |>
pivot_wider(names_from = Bldg.Type, values_from = Percentage, values_fill = 0)
# Adding full data
full_building <- ames |>
count(Bldg.Type) |>
filter(Bldg.Type %in% c("Twnhs", "2fmCon")) |>
mutate(Percentage = n / nrow(ames) * 100) |>
select(Bldg.Type, Percentage) |>
pivot_wider(names_from = Bldg.Type, values_from = Percentage, values_fill = 0) |>
mutate(sample_num = "Full Data")
building_final <- bind_rows(
building_comparison |> mutate(sample_num = as.character(sample_num)),
full_building
)
kable(building_final,
caption = "Rare Building Types Across Samples (% of homes)",
digits = 2)
| sample_num | 2fmCon | Twnhs |
|---|---|---|
| 1 | 1.78 | 4.78 |
| 2 | 2.73 | 3.83 |
| 3 | 2.46 | 3.55 |
| 4 | 1.91 | 3.14 |
| 5 | 1.78 | 3.01 |
| Full Data | 2.12 | 3.45 |
cat("\n=== ANOMALY CONSISTENCY ===\n")
##
## === ANOMALY CONSISTENCY ===
cat("Townhouses in full data:", round(full_building$Twnhs, 2), "%\n")
## Townhouses in full data: 3.45 %
cat("Across 25% samples, ranges from",
round(min(building_comparison$Twnhs, na.rm = TRUE), 2), "% to",
round(max(building_comparison$Twnhs, na.rm = TRUE), 2), "%\n")
## Across 25% samples, ranges from 3.01 % to 4.78 %
Insight: The specific percentages change, but townhouses are consistently rare across all samples (always <5%). So my Week 3 conclusion that “townhouses are rare in Ames” would hold up regardless of which sample I had collected. That’s a robust finding.
However, the exact percentage could vary by a factor of 2x depending on my sample (maybe 2% in one sample, 4% in another). So if I was making very specific claims about exact scarcity, I’d need to be careful.
This sampling exercise taught me some important lessons about being careful with my conclusions:
With only 25% of the data, my estimate of “average home price” could be off by several thousand dollars. I need to report uncertainty ranges, not just point estimates.
The correlation between size and price was consistent across all samples (~0.70), even when the exact average price varied. If I’m studying “what affects what,” I’m on safer ground than if I’m trying to pin down exact numbers.
As sample size increased from 25% to 75%, the variability dropped dramatically. This is why sample size matters so much in real research.
High-quality homes ranged from 3% to 6% across small samples, even though the true value is 4.7%. If something is rare, I need a big sample to estimate its frequency accurately.
Townhouses were rare in every single sample. The price-size relationship was strong in every sample. These are findings I can trust more because they replicate consistently.
The biggest takeaway: Every sample is just one possible version of reality. If I only have one dataset, I should always wonder: “Would I reach the same conclusion if I had collected different data?” This exercise showed me that sometimes the answer is yes (relationships), and sometimes it’s “maybe not” (exact values).