Introduction

This week I’m thinking about what could go wrong when drawing conclusions from data. The big question is: if I had collected a different sample of homes, would I reach the same conclusions? This is important because in real research, we usually only have one sample from a larger population. By creating multiple random samples from my dataset (treating it as the “population”), I can see how much my conclusions might change based on which homes happened to be in my sample.

If different samples give me wildly different answers, that’s a problem—it means my conclusions aren’t reliable. But if all samples point to the same general patterns, I can be more confident in what I’m seeing.

Data Loading

# Load the dataset
ames <- read.csv("ames.csv", stringsAsFactors = FALSE)

cat("Full dataset:", nrow(ames), "homes\n")
## Full dataset: 2930 homes

Part 1: Creating Random Samples

I’m going to create multiple random samples and compare them. I’ll start with smaller samples (25% of the data) and then increase the size to see how that affects consistency.

Sample Set 1: 25% Samples (Small)

set.seed(42)  # For reproducibility

sample_frac <- 0.25
n_samples <- 5

df_samples_25 <- tibble()

for (sample_i in 1:n_samples) {
  df_i <- ames |>
    sample_n(size = sample_frac * nrow(ames), replace = TRUE) |>
    mutate(sample_num = sample_i)
  
  df_samples_25 <- bind_rows(df_samples_25, df_i)
}

cat("Created", n_samples, "samples of", sample_frac * 100, "% each\n")
## Created 5 samples of 25 % each
cat("Each sample has approximately", round(sample_frac * nrow(ames)), "homes\n")
## Each sample has approximately 732 homes

Sample Set 2: 50% Samples (Medium)

set.seed(42)

sample_frac <- 0.50
df_samples_50 <- tibble()

for (sample_i in 1:n_samples) {
  df_i <- ames |>
    sample_n(size = sample_frac * nrow(ames), replace = TRUE) |>
    mutate(sample_num = sample_i)
  
  df_samples_50 <- bind_rows(df_samples_50, df_i)
}

cat("Created", n_samples, "samples of", sample_frac * 100, "% each\n")
## Created 5 samples of 50 % each
cat("Each sample has approximately", round(sample_frac * nrow(ames)), "homes\n")
## Each sample has approximately 1465 homes

Sample Set 3: 75% Samples (Large)

set.seed(42)

sample_frac <- 0.75
df_samples_75 <- tibble()

for (sample_i in 1:n_samples) {
  df_i <- ames |>
    sample_n(size = sample_frac * nrow(ames), replace = TRUE) |>
    mutate(sample_num = sample_i)
  
  df_samples_75 <- bind_rows(df_samples_75, df_i)
}

cat("Created", n_samples, "samples of", sample_frac * 100, "% each\n")
## Created 5 samples of 75 % each
cat("Each sample has approximately", round(sample_frac * nrow(ames)), "homes\n")
## Each sample has approximately 2198 homes

Part 2: Scrutinizing the Samples - Basic Statistics

Let me compare some basic statistics across samples to see how different they are.

Comparison 1: Sale Price Variability

# Comparing sale prices across 25% samples
price_stats_25 <- df_samples_25 |>
  group_by(sample_num) |>
  summarise(
    Mean_Price = mean(SalePrice),
    Median_Price = median(SalePrice),
    SD_Price = sd(SalePrice),
    Min_Price = min(SalePrice),
    Max_Price = max(SalePrice)
  )

# Adding full population for comparison
full_pop_stats <- tibble(
  sample_num = "Full Data",
  Mean_Price = mean(ames$SalePrice),
  Median_Price = median(ames$SalePrice),
  SD_Price = sd(ames$SalePrice),
  Min_Price = min(ames$SalePrice),
  Max_Price = max(ames$SalePrice)
)

price_comparison_25 <- bind_rows(
  price_stats_25 |> mutate(sample_num = as.character(sample_num)),
  full_pop_stats
)

kable(price_comparison_25,
      col.names = c("Sample", "Mean Price", "Median Price", "Std Dev", 
                    "Min Price", "Max Price"),
      caption = "Sale Price Statistics Across 25% Samples",
      format.args = list(big.mark = ","),
      digits = 0)
Sale Price Statistics Across 25% Samples
Sample Mean Price Median Price Std Dev Min Price Max Price
1 176,879 157,000 73,404 34,900 625,000
2 184,596 160,000 86,759 34,900 745,000
3 180,183 161,000 81,145 46,500 615,000
4 182,512 168,588 72,300 12,789 500,000
5 175,808 158,000 75,640 35,311 615,000
Full Data 180,796 160,000 79,887 12,789 755,000
# Calculating variation among samples
cat("\n=== VARIABILITY ANALYSIS (25% samples) ===\n")
## 
## === VARIABILITY ANALYSIS (25% samples) ===
cat("Mean price range across samples:", 
    dollar(max(price_stats_25$Mean_Price) - min(price_stats_25$Mean_Price)), "\n")
## Mean price range across samples: $8,787.34
cat("That's a difference of", 
    round((max(price_stats_25$Mean_Price) - min(price_stats_25$Mean_Price)) / 
          mean(price_stats_25$Mean_Price) * 100, 2), "%\n")
## That's a difference of 4.88 %

Insight: Even with 25% samples (about 733 homes each), I’m seeing some variation in the average prices. The mean price can vary by several thousand dollars depending on which homes ended up in the sample. The median is usually more stable than the mean, which makes sense because medians aren’t as affected by a few extreme luxury homes.

# Visualize price distributions across samples
ggplot(df_samples_25, aes(x = SalePrice, fill = factor(sample_num))) +
  geom_density(alpha = 0.4) +
  geom_vline(xintercept = mean(ames$SalePrice), linetype = "dashed", 
             color = "black", linewidth = 1) +
  scale_x_continuous(labels = dollar_format()) +
  scale_fill_brewer(palette = "Set2", name = "Sample") +
  labs(title = "Sale Price Distribution Across 25% Samples",
       subtitle = "Dashed line shows true population mean",
       x = "Sale Price",
       y = "Density") +
  theme_minimal()

What I see: The distributions overlap a lot but aren’t identical. Some samples are slightly shifted left or right, which explains why the means differ. The dashed line (true population mean) sometimes falls in the middle, sometimes off to one side.


Comparison 2: Neighborhood Distribution

# Checking how neighborhood representation varies across samples
neighborhood_counts_25 <- df_samples_25 |>
  group_by(sample_num, Neighborhood) |>
  summarise(Count = n(), .groups = "drop") |>
  group_by(sample_num) |>
  mutate(Percentage = Count / sum(Count) * 100)

# Focusing on top 5 neighborhoods in full data
top_neighborhoods <- ames |>
  count(Neighborhood, sort = TRUE) |>
  head(5) |>
  pull(Neighborhood)

neighborhood_comparison <- neighborhood_counts_25 |>
  filter(Neighborhood %in% top_neighborhoods) |>
  select(sample_num, Neighborhood, Percentage) |>
  pivot_wider(names_from = Neighborhood, values_from = Percentage, values_fill = 0)

# Adding full population
full_pop_neighborhoods <- ames |>
  count(Neighborhood) |>
  mutate(Percentage = n / sum(n) * 100) |>
  filter(Neighborhood %in% top_neighborhoods) |>
  select(Neighborhood, Percentage) |>
  pivot_wider(names_from = Neighborhood, values_from = Percentage) |>
  mutate(sample_num = "Full Data")

neighborhood_final <- bind_rows(
  neighborhood_comparison |> mutate(sample_num = as.character(sample_num)),
  full_pop_neighborhoods
)

kable(neighborhood_final,
      caption = "Percentage of Homes in Top 5 Neighborhoods (25% samples)",
      digits = 1)
Percentage of Homes in Top 5 Neighborhoods (25% samples)
sample_num CollgCr Edwards NAmes OldTown Somerst
1 8.5 5.9 18.0 8.2 7.7
2 7.4 7.1 14.3 9.3 6.7
3 8.5 5.5 14.6 7.5 5.3
4 11.1 5.6 13.9 7.4 5.9
5 8.1 6.7 16.0 8.1 6.1
Full Data 9.1 6.6 15.1 8.2 6.2

Insight: The neighborhood representation varies quite a bit across samples! Some samples have way more NAmes homes, others have fewer. This could really affect my conclusions. For example, if I was studying neighborhood effects on price and happened to get a sample with too many expensive neighborhoods, I’d think the overall market is pricier than it really is.

This is actually important because it shows sampling bias can happen just by chance, even with random sampling.


Comparison 3: Quality Rating Distribution - Spotting Anomalies

# Look at quality ratings across samples
quality_dist_25 <- df_samples_25 |>
  group_by(sample_num, Overall.Qual) |>
  summarise(Count = n(), .groups = "drop") |>
  group_by(sample_num) |>
  mutate(Percentage = Count / sum(Count) * 100)

# Checking for very high quality homes (9-10)
high_quality_25 <- quality_dist_25 |>
  filter(Overall.Qual >= 9) |>
  group_by(sample_num) |>
  summarise(High_Quality_Pct = sum(Percentage))

# Adding full population
full_high_quality <- ames |>
  summarise(High_Quality_Pct = sum(Overall.Qual >= 9) / n() * 100) |>
  mutate(sample_num = "Full Data")

high_quality_comparison <- bind_rows(
  high_quality_25 |> mutate(sample_num = as.character(sample_num)),
  full_high_quality
)

kable(high_quality_comparison,
      col.names = c("Sample", "% High Quality (9-10)"),
      caption = "Percentage of High-Quality Homes Across Samples",
      digits = 2)
Percentage of High-Quality Homes Across Samples
Sample % High Quality (9-10)
1 3.14
2 5.87
3 4.10
4 3.96
5 4.23
Full Data 4.71
cat("\n=== ANOMALY DETECTION ===\n")
## 
## === ANOMALY DETECTION ===
cat("In the full data, high-quality homes (9-10) represent:", 
    round(full_high_quality$High_Quality_Pct, 2), "%\n")
## In the full data, high-quality homes (9-10) represent: 4.71 %
cat("But in individual samples, this ranges from",
    round(min(high_quality_25$High_Quality_Pct), 2), "% to",
    round(max(high_quality_25$High_Quality_Pct), 2), "%\n")
## But in individual samples, this ranges from 3.14 % to 5.87 %

What would I call an anomaly? In Sample 1, maybe only 3% of homes are high-quality, so I might think “excellent homes are super rare in Ames!” But in Sample 3, maybe 6% are high-quality, so I’d think they’re more common. The true answer (from full data) is around 4.7%.

If I only had one sample, I might incorrectly identify high-quality homes as anomalies when they’re actually not that unusual, or I might underestimate how rare they are.


Part 3: Increasing Sample Size Effects

Now let me see what happens when I use bigger samples.

# Comparing statistics across different sample sizes

# 50% samples
price_stats_50 <- df_samples_50 |>
  group_by(sample_num) |>
  summarise(
    Mean_Price = mean(SalePrice),
    Median_Price = median(SalePrice),
    SD_Price = sd(SalePrice)
  )

# 75% samples  
price_stats_75 <- df_samples_75 |>
  group_by(sample_num) |>
  summarise(
    Mean_Price = mean(SalePrice),
    Median_Price = median(SalePrice),
    SD_Price = sd(SalePrice)
  )

# Calculating variability for each sample size
variability_comparison <- tibble(
  Sample_Size = c("25%", "50%", "75%"),
  Mean_Range = c(
    max(price_stats_25$Mean_Price) - min(price_stats_25$Mean_Price),
    max(price_stats_50$Mean_Price) - min(price_stats_50$Mean_Price),
    max(price_stats_75$Mean_Price) - min(price_stats_75$Mean_Price)
  ),
  Median_Range = c(
    max(price_stats_25$Median_Price) - min(price_stats_25$Median_Price),
    max(price_stats_50$Median_Price) - min(price_stats_50$Median_Price),
    max(price_stats_75$Median_Price) - min(price_stats_75$Median_Price)
  )
)

kable(variability_comparison,
      col.names = c("Sample Size", "Mean Price Range", "Median Price Range"),
      caption = "How Sample Size Affects Variability",
      format.args = list(big.mark = ","),
      digits = 0)
How Sample Size Affects Variability
Sample Size Mean Price Range Median Price Range
25% 8,787 11,588
50% 4,897 6,500
75% 2,565 2,000
cat("\n=== KEY FINDING ===\n")
## 
## === KEY FINDING ===
cat("As sample size increases, variability DECREASES:\n")
## As sample size increases, variability DECREASES:
cat("- At 25%:", dollar(variability_comparison$Mean_Range[1]), "spread in means\n")
## - At 25%: $8,787.34 spread in means
cat("- At 50%:", dollar(variability_comparison$Mean_Range[2]), "spread in means\n")
## - At 50%: $4,896.84 spread in means
cat("- At 75%:", dollar(variability_comparison$Mean_Range[3]), "spread in means\n")
## - At 75%: $2,565.01 spread in means

Insight: As my sample size gets bigger, the statistics become more consistent across samples. With 75% samples, all my samples give pretty similar mean prices because they’re all close to representing the full population.

This tells me that bigger samples = more reliable conclusions. If I only have a small sample, I need to be way more cautious about making strong claims.

# Visualizing the effect of sample size
all_means <- bind_rows(
  price_stats_25 |> mutate(Size = "25%"),
  price_stats_50 |> mutate(Size = "50%"),
  price_stats_75 |> mutate(Size = "75%")
)

ggplot(all_means, aes(x = Size, y = Mean_Price, group = Size)) +
  geom_boxplot(fill = "lightblue", alpha = 0.6) +
  geom_point(size = 3, alpha = 0.7) +
  geom_hline(yintercept = mean(ames$SalePrice), linetype = "dashed", 
             color = "red", linewidth = 1) +
  scale_y_continuous(labels = dollar_format()) +
  labs(title = "Sample Size Effect on Mean Price Estimates",
       subtitle = "Red line = true population mean; larger samples cluster closer to truth",
       x = "Sample Size (% of full data)",
       y = "Sample Mean Price") +
  theme_minimal()


Part 4: Consistent Patterns Across All Samples

Even though samples differ in details, some patterns should be consistent if they’re real.

# Test 1: Is the price-area relationship consistent?
correlation_by_sample_25 <- df_samples_25 |>
  group_by(sample_num) |>
  summarise(
    Price_Area_Correlation = cor(SalePrice, Gr.Liv.Area, use = "complete.obs"),
    Price_Quality_Correlation = cor(SalePrice, Overall.Qual, use = "complete.obs")
  )

# Adding full population
full_correlations <- tibble(
  sample_num = "Full Data",
  Price_Area_Correlation = cor(ames$SalePrice, ames$Gr.Liv.Area),
  Price_Quality_Correlation = cor(ames$SalePrice, ames$Overall.Qual)
)

correlation_comparison <- bind_rows(
  correlation_by_sample_25 |> mutate(sample_num = as.character(sample_num)),
  full_correlations
)

kable(correlation_comparison,
      col.names = c("Sample", "Price-Area Correlation", "Price-Quality Correlation"),
      caption = "Correlation Consistency Across 25% Samples",
      digits = 3)
Correlation Consistency Across 25% Samples
Sample Price-Area Correlation Price-Quality Correlation
1 0.680 0.785
2 0.695 0.797
3 0.703 0.793
4 0.666 0.799
5 0.711 0.819
Full Data 0.707 0.799
cat("\n=== CONSISTENCY CHECK ===\n")
## 
## === CONSISTENCY CHECK ===
cat("Price-Area correlation ranges from", 
    round(min(correlation_by_sample_25$Price_Area_Correlation), 3), "to",
    round(max(correlation_by_sample_25$Price_Area_Correlation), 3), "\n")
## Price-Area correlation ranges from 0.666 to 0.711
cat("Price-Quality correlation ranges from",
    round(min(correlation_by_sample_25$Price_Quality_Correlation), 3), "to",
    round(max(correlation_by_sample_25$Price_Quality_Correlation), 3), "\n")
## Price-Quality correlation ranges from 0.785 to 0.819
cat("\nBoth are consistently strong and positive across all samples!\n")
## 
## Both are consistently strong and positive across all samples!

Insight: Even though individual statistics (like mean price) vary across samples, the relationships between variables stay consistent. Every sample shows that bigger homes cost more (correlation ~0.70) and higher quality homes cost more (correlation ~0.79).

This tells me that if I’m studying relationships rather than exact values, I’m on more solid ground even with a smaller sample.

# Show that the relationship is consistent even if the samples differ
ggplot(df_samples_25, aes(x = Gr.Liv.Area, y = SalePrice)) +
  geom_point(alpha = 0.1, size = 1) +
  geom_smooth(aes(color = factor(sample_num)), method = "lm", se = FALSE) +
  scale_y_continuous(labels = dollar_format()) +
  scale_x_continuous(labels = comma_format()) +
  scale_color_brewer(palette = "Set2", name = "Sample") +
  labs(title = "Price vs. Living Area: Consistent Relationship Across Samples",
       subtitle = "Each line is a different sample's trend - all very similar slopes",
       x = "Living Area (sq ft)",
       y = "Sale Price") +
  theme_minimal()

What I see: All the trend lines are basically parallel, they might be shifted up or down slightly depending on the sample, but the slope (the relationship strength) is nearly identical. This means the fundamental pattern is real and robust.


Part 5: Building vs. Neighborhood Effects - A Deeper Look

Let me check if conclusions about rare combinations (from Week 3) would change across samples.

# Checking building type rarity across samples
building_rarity_25 <- df_samples_25 |>
  group_by(sample_num, Bldg.Type) |>
  summarise(Count = n(), .groups = "drop") |>
  group_by(sample_num) |>
  mutate(Percentage = Count / sum(Count) * 100) |>
  filter(Bldg.Type %in% c("Twnhs", "2fmCon")) |>  # Rare types from Week 3
  select(sample_num, Bldg.Type, Percentage)

# Pivot wider for easier comparison
building_comparison <- building_rarity_25 |>
  pivot_wider(names_from = Bldg.Type, values_from = Percentage, values_fill = 0)

# Adding full data
full_building <- ames |>
  count(Bldg.Type) |>
  filter(Bldg.Type %in% c("Twnhs", "2fmCon")) |>
  mutate(Percentage = n / nrow(ames) * 100) |>
  select(Bldg.Type, Percentage) |>
  pivot_wider(names_from = Bldg.Type, values_from = Percentage, values_fill = 0) |>
  mutate(sample_num = "Full Data")

building_final <- bind_rows(
  building_comparison |> mutate(sample_num = as.character(sample_num)),
  full_building
)

kable(building_final,
      caption = "Rare Building Types Across Samples (% of homes)",
      digits = 2)
Rare Building Types Across Samples (% of homes)
sample_num 2fmCon Twnhs
1 1.78 4.78
2 2.73 3.83
3 2.46 3.55
4 1.91 3.14
5 1.78 3.01
Full Data 2.12 3.45
cat("\n=== ANOMALY CONSISTENCY ===\n")
## 
## === ANOMALY CONSISTENCY ===
cat("Townhouses in full data:", round(full_building$Twnhs, 2), "%\n")
## Townhouses in full data: 3.45 %
cat("Across 25% samples, ranges from",
    round(min(building_comparison$Twnhs, na.rm = TRUE), 2), "% to",
    round(max(building_comparison$Twnhs, na.rm = TRUE), 2), "%\n")
## Across 25% samples, ranges from 3.01 % to 4.78 %

Insight: The specific percentages change, but townhouses are consistently rare across all samples (always <5%). So my Week 3 conclusion that “townhouses are rare in Ames” would hold up regardless of which sample I had collected. That’s a robust finding.

However, the exact percentage could vary by a factor of 2x depending on my sample (maybe 2% in one sample, 4% in another). So if I was making very specific claims about exact scarcity, I’d need to be careful.


Conclusion: What This Means for Drawing Conclusions

This sampling exercise taught me some important lessons about being careful with my conclusions:

1. Exact numbers are shaky with small samples

With only 25% of the data, my estimate of “average home price” could be off by several thousand dollars. I need to report uncertainty ranges, not just point estimates.

2. Relationships are more reliable than specific values

The correlation between size and price was consistent across all samples (~0.70), even when the exact average price varied. If I’m studying “what affects what,” I’m on safer ground than if I’m trying to pin down exact numbers.

3. Bigger samples = more confidence

As sample size increased from 25% to 75%, the variability dropped dramatically. This is why sample size matters so much in real research.

4. Rare events are tricky

High-quality homes ranged from 3% to 6% across small samples, even though the true value is 4.7%. If something is rare, I need a big sample to estimate its frequency accurately.

5. Some patterns are robust

Townhouses were rare in every single sample. The price-size relationship was strong in every sample. These are findings I can trust more because they replicate consistently.

How this changes my future approach:

  • Always consider sample size: Before making strong claims, I need to ask “is my sample big enough?”
  • Report uncertainty: Instead of saying “the average price is $185,000,” I should say “the average price is approximately $185,000, though this could vary by ±$5,000 with different samples”
  • Focus on robust patterns: Prioritize findings that would likely replicate across different samples
  • Be extra careful with rare events: If I’m studying something that only appears in 5% of homes, I need a really large sample to say anything confident about it
  • Cross-validate when possible: If I can split my data and check if patterns hold in both halves, that’s a good sanity check

The biggest takeaway: Every sample is just one possible version of reality. If I only have one dataset, I should always wonder: “Would I reach the same conclusion if I had collected different data?” This exercise showed me that sometimes the answer is yes (relationships), and sometimes it’s “maybe not” (exact values).