This week I’m exploring relationships between variables by pairing original columns with derived ones I create. I’ll look at correlations, build visualizations to understand the relationships, and construct confidence intervals to make inferences about the broader population of Ames homes. The goal is to understand not just what patterns exist in my sample, but what I can reasonably conclude about all homes in Ames.
# Load the dataset
ames <- read.csv("ames.csv", stringsAsFactors = FALSE)
# Create derived variables
ames <- ames |>
mutate(
Age_at_Sale = Yr.Sold - Year.Built,
Price_per_SF = SalePrice / Gr.Liv.Area
)
cat("Dataset:", nrow(ames), "homes\n")
## Dataset: 2930 homes
cat("Created variables:\n")
## Created variables:
cat(" - Age_at_Sale: How old the home was when sold\n")
## - Age_at_Sale: How old the home was when sold
cat(" - Price_per_SF: Sale price divided by living area\n")
## - Price_per_SF: Sale price divided by living area
Response Variable: SalePrice (original column)
Explanatory Variable: Age_at_Sale (created column -
calculated as year sold minus year built)
This pair represents a classic cause-and-effect relationship where I’m asking: “Does home age affect sale price?” Age is the explanatory variable because a home’s age can influence its price, but price doesn’t influence age.
# Scatterplot of price vs age
ggplot(ames, aes(x = Age_at_Sale, y = SalePrice)) +
geom_point(alpha = 0.4, size = 2, color = "steelblue") +
geom_smooth(method = "lm", color = "red", se = TRUE, linewidth = 1.5) +
scale_y_continuous(labels = dollar_format()) +
labs(title = "Home Sale Price vs. Age at Sale",
subtitle = "Red line shows linear relationship with 95% confidence band",
x = "Age at Sale (years)",
y = "Sale Price") +
theme_minimal()
# Check for outliers using IQR method
Q1_price <- quantile(ames$SalePrice, 0.25)
Q3_price <- quantile(ames$SalePrice, 0.75)
IQR_price <- Q3_price - Q1_price
lower_bound <- Q1_price - 1.5 * IQR_price
upper_bound <- Q3_price + 1.5 * IQR_price
outliers_price <- ames |>
filter(SalePrice < lower_bound | SalePrice > upper_bound)
cat("\n=== OUTLIER CHECK ===\n")
##
## === OUTLIER CHECK ===
cat("IQR outlier bounds:", dollar(lower_bound), "to", dollar(upper_bound), "\n")
## IQR outlier bounds: $3,500 to $339,500
cat("Outliers found:", nrow(outliers_price), "homes (",
round(nrow(outliers_price)/nrow(ames)*100, 1), "%)\n")
## Outliers found: 137 homes ( 4.7 %)
What I see in the plot:
The scatterplot shows a clear negative trend—as homes get older, their sale prices tend to decrease. The red trend line slopes downward, and most points cluster around this line. However, there’s substantial scatter, meaning age isn’t the only factor determining price.
I can see some outliers at the top of the plot: very expensive homes that don’t follow the typical age-price pattern. These are probably high-quality homes in premium neighborhoods where quality matters more than age. I also notice a few very old homes (100+ years) that still sell for decent prices—these might be well-maintained historical homes.
Scrutinizing the plot:
# Calculate Pearson correlation
cor_price_age <- cor(ames$SalePrice, ames$Age_at_Sale, use = "complete.obs")
cat("=== CORRELATION ANALYSIS ===\n")
## === CORRELATION ANALYSIS ===
cat("Pearson correlation coefficient:", round(cor_price_age, 3), "\n\n")
## Pearson correlation coefficient: -0.559
cat("Interpretation:\n")
## Interpretation:
cat("- Magnitude:", abs(cor_price_age), "indicates a",
if(abs(cor_price_age) > 0.7) "strong" else if(abs(cor_price_age) > 0.4) "moderate" else "weak",
"relationship\n")
## - Magnitude: 0.5589068 indicates a moderate relationship
cat("- Direction: Negative value confirms older homes sell for less\n")
## - Direction: Negative value confirms older homes sell for less
cat("- Variance explained:", round(cor_price_age^2 * 100, 1),
"% of price variation is explained by age\n")
## - Variance explained: 31.2 % of price variation is explained by age
Does the correlation make sense?
The correlation coefficient of r = -0.559 makes perfect sense given the visualization:
The correlation aligns well with what the plot shows: a clear but imperfect relationship where age is one of several important factors.
# Calculate 95% confidence interval for mean sale price
n <- nrow(ames)
mean_price <- mean(ames$SalePrice)
se_price <- sd(ames$SalePrice) / sqrt(n)
t_critical <- qt(0.975, df = n - 1) # 95% CI, two-tailed
ci_lower <- mean_price - t_critical * se_price
ci_upper <- mean_price + t_critical * se_price
cat("=== 95% CONFIDENCE INTERVAL FOR MEAN SALE PRICE ===\n\n")
## === 95% CONFIDENCE INTERVAL FOR MEAN SALE PRICE ===
cat("Sample mean:", dollar(mean_price), "\n")
## Sample mean: $180,796
cat("Standard error:", dollar(se_price), "\n")
## Standard error: $1,475.84
cat("95% CI:", dollar(ci_lower), "to", dollar(ci_upper), "\n")
## 95% CI: $177,902 to $183,690
cat("Margin of error:", dollar(t_critical * se_price), "\n")
## Margin of error: $2,893.80
Detailed interpretation:
Based on this confidence interval, I can say with 95% confidence that the true mean sale price for all homes in Ames (the population) falls between $177,902 and $183,690.
What this means in practical terms:
If I could somehow collect data on every single home sale in Ames during this period (not just my sample of 2,930 homes), the average price would almost certainly fall within this range. The fact that my margin of error is only about $2,900 (less than 2% of the mean) shows that my sample is large enough to give me a pretty precise estimate.
Why this matters:
Important caveat: This CI only applies to the 2006-2010 time period in my data. The true population mean today could be very different due to inflation and market changes.
Explanatory Variable: Overall.Qual (original column
- ordered categorical from 1-10)
Response Variable: Price_per_SF (created column - sale
price divided by living area)
This pair asks: “Does overall quality rating affect the price people pay per square foot?” I’m treating quality as explanatory because the quality of construction influences what buyers will pay per square foot, not the other way around. Quality is ordered (10 is better than 5), which makes it appropriate for correlation analysis.
# Boxplot showing price per SF for each quality level
ggplot(ames, aes(x = factor(Overall.Qual), y = Price_per_SF)) +
geom_boxplot(fill = "lightblue", outlier.color = "red", outlier.alpha = 0.5) +
geom_smooth(aes(x = Overall.Qual, y = Price_per_SF),
method = "lm", color = "darkblue", se = TRUE, linewidth = 1.5) +
scale_y_continuous(labels = dollar_format()) +
labs(title = "Price per Square Foot by Overall Quality Rating",
subtitle = "Higher quality homes command higher prices per square foot",
x = "Overall Quality Rating (1 = Poor, 10 = Excellent)",
y = "Price per Square Foot") +
theme_minimal()
# Summary statistics by quality
quality_summary <- ames |>
group_by(Overall.Qual) |>
summarise(
Count = n(),
Mean_Price_SF = mean(Price_per_SF),
Median_Price_SF = median(Price_per_SF),
SD_Price_SF = sd(Price_per_SF)
)
kable(quality_summary,
col.names = c("Quality", "Count", "Mean $/SF", "Median $/SF", "SD $/SF"),
caption = "Price per Square Foot Summary by Quality Rating",
digits = 2)
| Quality | Count | Mean $/SF | Median $/SF | SD $/SF |
|---|---|---|---|---|
| 1 | 4 | 63.49 | 59.21 | 41.59 |
| 2 | 13 | 85.27 | 75.00 | 35.55 |
| 3 | 40 | 83.07 | 81.94 | 23.48 |
| 4 | 226 | 97.04 | 95.28 | 27.43 |
| 5 | 825 | 112.79 | 115.42 | 27.33 |
| 6 | 732 | 115.02 | 113.94 | 23.29 |
| 7 | 602 | 125.42 | 125.00 | 23.65 |
| 8 | 350 | 147.03 | 146.07 | 27.57 |
| 9 | 107 | 179.56 | 183.15 | 30.26 |
| 10 | 31 | 173.54 | 182.68 | 56.10 |
What I see in the plot:
The boxplots show a clear upward trend: as quality rating increases from 1 to 10, the median price per square foot consistently increases. The blue trend line confirms this positive relationship.
Scrutinizing the plot:
From the table: The mean price per square foot generally increases from about $68/SF for quality-5 homes to over $173/SF for quality-9 homes—more than double! While there are a couple small reversals at the extremes, the overall pattern clearly shows people pay premiums for quality construction and finishes.
# Calculate Spearman correlation (appropriate for ordered categorical)
cor_qual_pricesf_spearman <- cor(ames$Overall.Qual, ames$Price_per_SF,
method = "spearman", use = "complete.obs")
# Also calculate Pearson for comparison
cor_qual_pricesf_pearson <- cor(ames$Overall.Qual, ames$Price_per_SF,
method = "pearson", use = "complete.obs")
cat("=== CORRELATION ANALYSIS ===\n")
## === CORRELATION ANALYSIS ===
cat("Pearson correlation:", round(cor_qual_pricesf_pearson, 3), "\n")
## Pearson correlation: 0.537
cat("Spearman correlation (for ordered data):",
round(cor_qual_pricesf_spearman, 3), "\n\n")
## Spearman correlation (for ordered data): 0.467
cat("Interpretation:\n")
## Interpretation:
cat("- Both methods show moderate positive relationships\n")
## - Both methods show moderate positive relationships
cat("- Pearson is slightly higher (", round(cor_qual_pricesf_pearson, 3),
" vs ", round(cor_qual_pricesf_spearman, 3),
"), which can happen when\n the relationship is closer to linear than purely rank-based\n")
## - Pearson is slightly higher ( 0.537 vs 0.467 ), which can happen when
## the relationship is closer to linear than purely rank-based
cat("- Quality explains about", round(cor_qual_pricesf_pearson^2 * 100, 1),
"% of price/SF variation\n")
## - Quality explains about 28.9 % of price/SF variation
Does the correlation make sense?
The correlation of r = 0.537 (Pearson) and ρ = 0.467 (Spearman) makes perfect sense:
The correlations align with the visualization: there’s a clear positive trend overall, but it’s not perfectly consistent at every step. Quality generally increases price per square foot, but the relationship has some noise.
# Calculate 95% confidence interval for mean price per SF
mean_price_sf <- mean(ames$Price_per_SF)
se_price_sf <- sd(ames$Price_per_SF) / sqrt(n)
t_critical_sf <- qt(0.975, df = n - 1)
ci_lower_sf <- mean_price_sf - t_critical_sf * se_price_sf
ci_upper_sf <- mean_price_sf + t_critical_sf * se_price_sf
cat("=== 95% CONFIDENCE INTERVAL FOR MEAN PRICE PER SF ===\n\n")
## === 95% CONFIDENCE INTERVAL FOR MEAN PRICE PER SF ===
cat("Sample mean:", dollar(mean_price_sf), "/SF\n")
## Sample mean: $121.30 /SF
cat("Standard error:", dollar(se_price_sf), "\n")
## Standard error: $0.59
cat("95% CI:", dollar(ci_lower_sf), "to", dollar(ci_upper_sf), "/SF\n")
## 95% CI: $120.14 to $122.47 /SF
cat("Margin of error:", dollar(t_critical_sf * se_price_sf), "\n")
## Margin of error: $1.16
Detailed interpretation:
I can say with 95% confidence that the true mean price per square foot for all Ames homes is between $120.14 and $122.47 per square foot.
What this means in practical terms:
If I took every home sale in Ames during 2006-2010 and calculated the average price per square foot, it would almost certainly fall in this range. My margin of error is only about $1.17/SF, which is quite small—less than 1% of the mean.
Why this matters:
Relationship to quality: Looking back at my quality table, I see that only quality-5 and below fall below this confidence interval, while quality-8 and above fall above it. This means: - Below-average quality homes typically sell below the market average price per SF - Above-average quality homes typically sell above the market average price per SF - Quality-6 and 7 homes cluster right around the market average
This makes intuitive sense and validates that quality is a major driver of price per square foot.
Age has a moderate negative effect on price (r = -0.56): Older homes sell for less, but quality and location can overcome the age penalty
Quality drives price per square foot (r = 0.54): Moving from quality-5 to quality-10 more than doubles the price per square foot
Both relationships are real but imperfect: Together, age and quality explain important price variation, but other factors (location, lot size, specific features) also matter significantly
This analysis showed me how to move beyond just describing my sample to making inferences about the broader population. By calculating correlation coefficients, I quantified relationships I could see visually. By building confidence intervals, I established ranges for population parameters with statistical backing.
The key lesson: my sample of 2,930 homes provides precise estimates (narrow CIs) of population means, but individual predictions remain uncertain (moderate correlations). I can confidently say the average Ames home sells for about $180,000, but I can’t perfectly predict any individual home’s price using just age or quality—I’d need a more comprehensive model.