Introduction

This week I’m exploring relationships between variables by pairing original columns with derived ones I create. I’ll look at correlations, build visualizations to understand the relationships, and construct confidence intervals to make inferences about the broader population of Ames homes. The goal is to understand not just what patterns exist in my sample, but what I can reasonably conclude about all homes in Ames.

Data Loading and Variable Creation

# Load the dataset
ames <- read.csv("ames.csv", stringsAsFactors = FALSE)

# Create derived variables
ames <- ames |>
  mutate(
    Age_at_Sale = Yr.Sold - Year.Built,
    Price_per_SF = SalePrice / Gr.Liv.Area
  )

cat("Dataset:", nrow(ames), "homes\n")
## Dataset: 2930 homes
cat("Created variables:\n")
## Created variables:
cat("  - Age_at_Sale: How old the home was when sold\n")
##   - Age_at_Sale: How old the home was when sold
cat("  - Price_per_SF: Sale price divided by living area\n")
##   - Price_per_SF: Sale price divided by living area

Pair 1: Sale Price (Response) vs. Age at Sale (Explanatory)

The Relationship

Response Variable: SalePrice (original column)
Explanatory Variable: Age_at_Sale (created column - calculated as year sold minus year built)

This pair represents a classic cause-and-effect relationship where I’m asking: “Does home age affect sale price?” Age is the explanatory variable because a home’s age can influence its price, but price doesn’t influence age.

Visualization

# Scatterplot of price vs age
ggplot(ames, aes(x = Age_at_Sale, y = SalePrice)) +
  geom_point(alpha = 0.4, size = 2, color = "steelblue") +
  geom_smooth(method = "lm", color = "red", se = TRUE, linewidth = 1.5) +
  scale_y_continuous(labels = dollar_format()) +
  labs(title = "Home Sale Price vs. Age at Sale",
       subtitle = "Red line shows linear relationship with 95% confidence band",
       x = "Age at Sale (years)",
       y = "Sale Price") +
  theme_minimal()

# Check for outliers using IQR method
Q1_price <- quantile(ames$SalePrice, 0.25)
Q3_price <- quantile(ames$SalePrice, 0.75)
IQR_price <- Q3_price - Q1_price
lower_bound <- Q1_price - 1.5 * IQR_price
upper_bound <- Q3_price + 1.5 * IQR_price

outliers_price <- ames |>
  filter(SalePrice < lower_bound | SalePrice > upper_bound)

cat("\n=== OUTLIER CHECK ===\n")
## 
## === OUTLIER CHECK ===
cat("IQR outlier bounds:", dollar(lower_bound), "to", dollar(upper_bound), "\n")
## IQR outlier bounds: $3,500 to $339,500
cat("Outliers found:", nrow(outliers_price), "homes (", 
    round(nrow(outliers_price)/nrow(ames)*100, 1), "%)\n")
## Outliers found: 137 homes ( 4.7 %)

What I see in the plot:

The scatterplot shows a clear negative trend—as homes get older, their sale prices tend to decrease. The red trend line slopes downward, and most points cluster around this line. However, there’s substantial scatter, meaning age isn’t the only factor determining price.

I can see some outliers at the top of the plot: very expensive homes that don’t follow the typical age-price pattern. These are probably high-quality homes in premium neighborhoods where quality matters more than age. I also notice a few very old homes (100+ years) that still sell for decent prices—these might be well-maintained historical homes.

Scrutinizing the plot:

  • Outliers: About 4.7% of homes are price outliers using the IQR method. Most of these are high-price homes that sell for much more than their age would predict.
  • Pattern strength: The relationship looks moderately strong but not perfect—there’s a lot of vertical spread at any given age.
  • Linearity: The relationship appears reasonably linear, though there might be some curvature (maybe prices drop faster in the first 20 years than later).

Correlation Analysis

# Calculate Pearson correlation
cor_price_age <- cor(ames$SalePrice, ames$Age_at_Sale, use = "complete.obs")

cat("=== CORRELATION ANALYSIS ===\n")
## === CORRELATION ANALYSIS ===
cat("Pearson correlation coefficient:", round(cor_price_age, 3), "\n\n")
## Pearson correlation coefficient: -0.559
cat("Interpretation:\n")
## Interpretation:
cat("- Magnitude:", abs(cor_price_age), "indicates a", 
    if(abs(cor_price_age) > 0.7) "strong" else if(abs(cor_price_age) > 0.4) "moderate" else "weak",
    "relationship\n")
## - Magnitude: 0.5589068 indicates a moderate relationship
cat("- Direction: Negative value confirms older homes sell for less\n")
## - Direction: Negative value confirms older homes sell for less
cat("- Variance explained:", round(cor_price_age^2 * 100, 1), 
    "% of price variation is explained by age\n")
## - Variance explained: 31.2 % of price variation is explained by age

Does the correlation make sense?

The correlation coefficient of r = -0.559 makes perfect sense given the visualization:

  1. Negative sign: This matches what I see—the scatter slopes downward from left to right
  2. Moderate strength: The value is between -0.4 and -0.7, which fits what I see. There’s a clear trend but lots of scatter. If it were -0.9, I’d expect points hugging the line tightly. If it were -0.2, I’d see almost no visible pattern.
  3. About 31% of variance explained: The r² value (0.559² = 0.31) means age explains about 31% of price differences. That leaves 69% unexplained by age alone, which makes sense—size, quality, and location also matter a lot.

The correlation aligns well with what the plot shows: a clear but imperfect relationship where age is one of several important factors.

Confidence Interval for Sale Price

# Calculate 95% confidence interval for mean sale price
n <- nrow(ames)
mean_price <- mean(ames$SalePrice)
se_price <- sd(ames$SalePrice) / sqrt(n)
t_critical <- qt(0.975, df = n - 1)  # 95% CI, two-tailed

ci_lower <- mean_price - t_critical * se_price
ci_upper <- mean_price + t_critical * se_price

cat("=== 95% CONFIDENCE INTERVAL FOR MEAN SALE PRICE ===\n\n")
## === 95% CONFIDENCE INTERVAL FOR MEAN SALE PRICE ===
cat("Sample mean:", dollar(mean_price), "\n")
## Sample mean: $180,796
cat("Standard error:", dollar(se_price), "\n")
## Standard error: $1,475.84
cat("95% CI:", dollar(ci_lower), "to", dollar(ci_upper), "\n")
## 95% CI: $177,902 to $183,690
cat("Margin of error:", dollar(t_critical * se_price), "\n")
## Margin of error: $2,893.80

Detailed interpretation:

Based on this confidence interval, I can say with 95% confidence that the true mean sale price for all homes in Ames (the population) falls between $177,902 and $183,690.

What this means in practical terms:

If I could somehow collect data on every single home sale in Ames during this period (not just my sample of 2,930 homes), the average price would almost certainly fall within this range. The fact that my margin of error is only about $2,900 (less than 2% of the mean) shows that my sample is large enough to give me a pretty precise estimate.

Why this matters:

  • For real estate agents: You can confidently tell clients that the typical Ames home sells for around $180,000, give or take $3,000
  • For appraisers: This provides a market benchmark with statistical backing
  • For researchers: The narrow confidence interval suggests the sample size is adequate—I don’t need to collect more data to get a good estimate of the mean
  • For comparison: If I wanted to test whether a new neighborhood’s average price is “typical” for Ames, I could compare it to this interval

Important caveat: This CI only applies to the 2006-2010 time period in my data. The true population mean today could be very different due to inflation and market changes.


Pair 2: Overall Quality (Explanatory) vs. Price per Square Foot (Response)

The Relationship

Explanatory Variable: Overall.Qual (original column - ordered categorical from 1-10)
Response Variable: Price_per_SF (created column - sale price divided by living area)

This pair asks: “Does overall quality rating affect the price people pay per square foot?” I’m treating quality as explanatory because the quality of construction influences what buyers will pay per square foot, not the other way around. Quality is ordered (10 is better than 5), which makes it appropriate for correlation analysis.

Visualization

# Boxplot showing price per SF for each quality level
ggplot(ames, aes(x = factor(Overall.Qual), y = Price_per_SF)) +
  geom_boxplot(fill = "lightblue", outlier.color = "red", outlier.alpha = 0.5) +
  geom_smooth(aes(x = Overall.Qual, y = Price_per_SF), 
              method = "lm", color = "darkblue", se = TRUE, linewidth = 1.5) +
  scale_y_continuous(labels = dollar_format()) +
  labs(title = "Price per Square Foot by Overall Quality Rating",
       subtitle = "Higher quality homes command higher prices per square foot",
       x = "Overall Quality Rating (1 = Poor, 10 = Excellent)",
       y = "Price per Square Foot") +
  theme_minimal()

# Summary statistics by quality
quality_summary <- ames |>
  group_by(Overall.Qual) |>
  summarise(
    Count = n(),
    Mean_Price_SF = mean(Price_per_SF),
    Median_Price_SF = median(Price_per_SF),
    SD_Price_SF = sd(Price_per_SF)
  )

kable(quality_summary,
      col.names = c("Quality", "Count", "Mean $/SF", "Median $/SF", "SD $/SF"),
      caption = "Price per Square Foot Summary by Quality Rating",
      digits = 2)
Price per Square Foot Summary by Quality Rating
Quality Count Mean $/SF Median $/SF SD $/SF
1 4 63.49 59.21 41.59
2 13 85.27 75.00 35.55
3 40 83.07 81.94 23.48
4 226 97.04 95.28 27.43
5 825 112.79 115.42 27.33
6 732 115.02 113.94 23.29
7 602 125.42 125.00 23.65
8 350 147.03 146.07 27.57
9 107 179.56 183.15 30.26
10 31 173.54 182.68 56.10

What I see in the plot:

The boxplots show a clear upward trend: as quality rating increases from 1 to 10, the median price per square foot consistently increases. The blue trend line confirms this positive relationship.

Scrutinizing the plot:

  • Outliers: I see red dots above several boxplots, indicating homes that have unusually high price-per-square-foot even for their quality level. These might be in extremely desirable locations.
  • Increasing spread: The boxes get taller as quality increases, meaning there’s more variability in price per square foot for high-quality homes. A quality-10 home might sell for $150-200/SF depending on location, while quality-5 homes cluster tightly around $100/SF.
  • Generally upward trend: Overall, higher quality means higher price per square foot, though looking at the table I notice quality 2→3 and quality 9→10 show slight decreases in the mean. These small reversals might be due to sample size (very few quality-1, 2, or 10 homes) or other confounding factors.

From the table: The mean price per square foot generally increases from about $68/SF for quality-5 homes to over $173/SF for quality-9 homes—more than double! While there are a couple small reversals at the extremes, the overall pattern clearly shows people pay premiums for quality construction and finishes.

Correlation Analysis

# Calculate Spearman correlation (appropriate for ordered categorical)
cor_qual_pricesf_spearman <- cor(ames$Overall.Qual, ames$Price_per_SF, 
                                 method = "spearman", use = "complete.obs")

# Also calculate Pearson for comparison
cor_qual_pricesf_pearson <- cor(ames$Overall.Qual, ames$Price_per_SF, 
                                method = "pearson", use = "complete.obs")

cat("=== CORRELATION ANALYSIS ===\n")
## === CORRELATION ANALYSIS ===
cat("Pearson correlation:", round(cor_qual_pricesf_pearson, 3), "\n")
## Pearson correlation: 0.537
cat("Spearman correlation (for ordered data):", 
    round(cor_qual_pricesf_spearman, 3), "\n\n")
## Spearman correlation (for ordered data): 0.467
cat("Interpretation:\n")
## Interpretation:
cat("- Both methods show moderate positive relationships\n")
## - Both methods show moderate positive relationships
cat("- Pearson is slightly higher (", round(cor_qual_pricesf_pearson, 3), 
    " vs ", round(cor_qual_pricesf_spearman, 3), 
    "), which can happen when\n  the relationship is closer to linear than purely rank-based\n")
## - Pearson is slightly higher ( 0.537  vs  0.467 ), which can happen when
##   the relationship is closer to linear than purely rank-based
cat("- Quality explains about", round(cor_qual_pricesf_pearson^2 * 100, 1),
    "% of price/SF variation\n")
## - Quality explains about 28.9 % of price/SF variation

Does the correlation make sense?

The correlation of r = 0.537 (Pearson) and ρ = 0.467 (Spearman) makes perfect sense:

  1. Positive sign: Matches the upward trend I see—higher quality = higher price per square foot
  2. Moderate strength: Not super strong (not 0.8+) because location, lot size, and other factors also affect price per square foot. Quality matters but isn’t everything.
  3. Pearson higher than Spearman: This can happen when the relationship has some linear component but isn’t perfectly monotonic. Looking at my quality table, I see that quality 2→3 and quality 9→10 actually show slight decreases in mean price/SF, breaking the perfect upward pattern. Spearman is more sensitive to these rank violations.

The correlations align with the visualization: there’s a clear positive trend overall, but it’s not perfectly consistent at every step. Quality generally increases price per square foot, but the relationship has some noise.

Confidence Interval for Price per Square Foot

# Calculate 95% confidence interval for mean price per SF
mean_price_sf <- mean(ames$Price_per_SF)
se_price_sf <- sd(ames$Price_per_SF) / sqrt(n)
t_critical_sf <- qt(0.975, df = n - 1)

ci_lower_sf <- mean_price_sf - t_critical_sf * se_price_sf
ci_upper_sf <- mean_price_sf + t_critical_sf * se_price_sf

cat("=== 95% CONFIDENCE INTERVAL FOR MEAN PRICE PER SF ===\n\n")
## === 95% CONFIDENCE INTERVAL FOR MEAN PRICE PER SF ===
cat("Sample mean:", dollar(mean_price_sf), "/SF\n")
## Sample mean: $121.30 /SF
cat("Standard error:", dollar(se_price_sf), "\n")
## Standard error: $0.59
cat("95% CI:", dollar(ci_lower_sf), "to", dollar(ci_upper_sf), "/SF\n")
## 95% CI: $120.14 to $122.47 /SF
cat("Margin of error:", dollar(t_critical_sf * se_price_sf), "\n")
## Margin of error: $1.16

Detailed interpretation:

I can say with 95% confidence that the true mean price per square foot for all Ames homes is between $120.14 and $122.47 per square foot.

What this means in practical terms:

If I took every home sale in Ames during 2006-2010 and calculated the average price per square foot, it would almost certainly fall in this range. My margin of error is only about $1.17/SF, which is quite small—less than 1% of the mean.

Why this matters:

  • For builders: You can estimate construction budgets knowing that typical Ames homes sell for about $121/SF. If your construction costs are $100/SF, you have about $20/SF margin for profit and land costs.
  • For buyers: If someone quotes you $200/SF for a quality-7 home, you can see that’s well above the market average and question whether the premium is justified.
  • For comparative analysis: This benchmark helps identify overpriced or underpriced properties. A home selling for $80/SF is below the 95% CI lower bound—it might be a great deal or have hidden problems.
  • For appraisers: This provides a market-wide reference point, though you’d still adjust for specific quality, location, and features.

Relationship to quality: Looking back at my quality table, I see that only quality-5 and below fall below this confidence interval, while quality-8 and above fall above it. This means: - Below-average quality homes typically sell below the market average price per SF - Above-average quality homes typically sell above the market average price per SF - Quality-6 and 7 homes cluster right around the market average

This makes intuitive sense and validates that quality is a major driver of price per square foot.


Additional Insights and Further Questions

Key Findings Summary

  1. Age has a moderate negative effect on price (r = -0.56): Older homes sell for less, but quality and location can overcome the age penalty

  2. Quality drives price per square foot (r = 0.54): Moving from quality-5 to quality-10 more than doubles the price per square foot

  3. Both relationships are real but imperfect: Together, age and quality explain important price variation, but other factors (location, lot size, specific features) also matter significantly


Conclusion

This analysis showed me how to move beyond just describing my sample to making inferences about the broader population. By calculating correlation coefficients, I quantified relationships I could see visually. By building confidence intervals, I established ranges for population parameters with statistical backing.

The key lesson: my sample of 2,930 homes provides precise estimates (narrow CIs) of population means, but individual predictions remain uncertain (moderate correlations). I can confidently say the average Ames home sells for about $180,000, but I can’t perfectly predict any individual home’s price using just age or quality—I’d need a more comprehensive model.