This report shows statistical analysis and predictive modeling using the Ames Housing dataset to understand the factors that may influence residential property sales in Ames, Iowa. Our objective is to conduct Exploratory Data Analysis (EDA) by performing three distinct hypothesis tests and constructing a Multiple Linear Regression model. The analysis will focus on understanding how key features, such as living area, overall quality, and central air conditioning—impact the final Sale Price.
The dataset was imported, and column names were standardised to ensure compatibility with R. An important pre-processing step involved identifying and removing extreme outliers in the main variable we will be focusing our analysis on, \(\text{SalePrice}\). Outliers can distort our statistical models, such as the Linear Regression Model will be creating, which rely on minimising any errors as much as possible. The Interquartile Range (IQR) rule was applied to remove these extreme cases (although modified from the standard as explained later).
# Display the dimensions and structure
cat("Original Dataset Dimensions (Rows, Columns): ", original_rows, ", ", ncol(ames_data), "\n")
## Original Dataset Dimensions (Rows, Columns): 2930 , 83
cat("Rows Removed (SalePrice Outliers): ", rows_removed, "\n")
## Rows Removed (SalePrice Outliers): 26
cat("Final Dataset Dimensions (Rows, Columns): ", nrow(ames_data), ", ", ncol(ames_data), "\n")
## Final Dataset Dimensions (Rows, Columns): 2904 , 83
# We will check for missing values in key variables needed for the assessment
key_features <- c("SalePrice", "GrLivArea", "OverallQual", "Neighborhood", "HouseStyle", "CentralAir", "GarageCars", "LotArea")
missing_data <- colSums(is.na(ames_data[key_features]))
if(length(missing_data[missing_data > 0]) == 0){
cat("No missing values found in the key features selected for analysis.\n")
} else {
print(missing_data[missing_data > 0])
}
## GarageCars
## 1
A simple distribution of the cleaned \(\text{SalePrice}\) was created to try and understand the basic spread of the data.
# Visualizing the distribution of the CLEANED SalePrice
ggplot(ames_data, aes(x = SalePrice)) +
geom_histogram(binwidth = 10000, fill = "darkblue", color = "white", alpha = 0.7) +
geom_density(aes(y = after_stat(density * 10000)), color = "red", linewidth = 1) +
scale_x_continuous(labels = scales::dollar) +
labs(title = "Distribution of Cleaned Sale Price",
x = "Sale Price ($)",
y = "Count") +
theme_minimal()
The initial investigation confirmed a dataset size of r original_rows observations. A critical finding was the highly right-skewed distribution of the target variable, \(\text{SalePrice}\). To mitigate the undue influence of these extreme values, the IQR rule was applied, resulting in a cleaner dataset of r rows_after_outlier_removal rows, with r rows_removed outliers removed.
The histogram above shows that while the distribution remains right-skewed (which is typical for real estate data with the majority of homes being towrds the middle and lower and a few ouliers to the top), the removal of extreme values has improved the central tendency and reduced the “long tail” effect slightly. Furthermore, all seven key features required for the assessment were confirmed to be free of missing values in the cleaned dataset. This ensures that the data should be as required, for the subsequent regression modeling.
This section investigates three distinct hypotheses concerning the drivers of Sale Price using Correlation, ANOVA, and T-tests.
\(H_0\): There is no significant linear relationship (correlation) between the Above Ground Living Area (\(\text{GrLivArea}\)) and the Sale Price (\(\text{SalePrice}\)) (\(\rho = 0\)).
\(H_A\): There is a significant linear relationship (correlation) between \(\text{GrLivArea}\) and \(\text{SalePrice}\) (\(\rho \neq 0\)).
# Pearson Correlation Test
print(correlation_test)
##
## Pearson's product-moment correlation
##
## data: ames_data$GrLivArea and ames_data$SalePrice
## t = 50.741, df = 2902, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.6658791 0.7044507
## sample estimates:
## cor
## 0.6856459
# Perform a simple linear regression to check linearity/homoscedasticity assumptions
lm_grliv <- lm(SalePrice ~ GrLivArea, data = ames_data)
par(mfrow=c(1, 2))
plot(lm_grliv, which = 1) # Residuals vs Fitted (Homoscedasticity)
plot(lm_grliv, which = 2) # Normal Q-Q (Normality)
par(mfrow=c(1, 1)) # Reset plot layout
The Residuals vs. Fitted plot shows residuals generally clustered around zero, supporting the assumption of linearity, though some fan like shaping is visible. The Normal Q-Q plot indicates minor deviations from normality in the tails, but given the large sample size (\(N > 2000\)), Pearson’s test is robust to this.
The test yields a Pearson correlation coefficient of r = 0.6522547 with a highly significant \(p< 0.001\). We reject the null hypothesis (\(H_0\)). There is a strong, statistically significant positive linear relationship: as the living area increases, the sale price increases.
ggplot(ames_data, aes(x = GrLivArea, y = SalePrice)) +
geom_point(alpha = 0.6, color = "#1F77B4") +
geom_smooth(method = "lm", color = "red", se = TRUE) +
scale_y_continuous(labels = scales::dollar) +
labs(title = "Sale Price vs. Above Grade Living Area (GrLivArea)",
x = "Above Ground Living Area (Square Feet)",
y = "Sale Price ($)") +
theme_minimal()
To best show the findings I have created a scatter plot of the correlation between above ground living area and sale price. Common sense would assume there would be a positive correlation, with sale price increasing with the Above Ground Living Area. This is clearly shown in the graph where a strong positive correlation between the two is present. This directly supports the conclusion from the Pearson test.
\(H_0\): The mean Sale Price (\(\mu\)) is the same across all 10 levels of Overall Quality (\(\text{OverallQual}\)) (\(\mu_1 = \mu_2 = \dots = \mu_{10}\)).
\(H_A\): At least one level of Overall Quality has a significantly different mean Sale Price.
# Run the ANOVA test
print(anova_summary)
## Df Sum Sq Mean Sq F value Pr(>F)
## OverallQual_Factor 9 1.028e+13 1.142e+12 707.7 <2e-16 ***
## Residuals 2894 4.671e+12 1.614e+09
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Assumption Check: Homogeneity of Variances (Levene's Test)
leveneTest(SalePrice ~ OverallQual_Factor, data = ames_data)
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 9 43.055 < 2.2e-16 ***
## 2894
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Levene’s Test yields a significant p-value (\(p < 0.05\)), indicating that the variances of Sale Price are not equal across all quality groups (heterogeneity of variance). While ANOVA is generally robust, this suggests caution, though the differences in means are likely large enough to remain significant.
The ANOVA test yields a highly significant F-statistic of 545.8 with a \(p< 0.001\). We reject the null hypothesis (\(H\_0\)). There is a statistically significant difference in the mean sale price across the different levels of Overall Quality.
ggplot(ames_data, aes(x = OverallQual_Factor, y = SalePrice, fill = OverallQual_Factor)) +
geom_boxplot(outlier.alpha = 0.3) +
scale_y_continuous(labels = scales::dollar) +
labs(title = "Sale Price Distribution by Overall Quality Rating",
x = "Overall Quality Rating (1 = Very Poor, 10 = Very Excellent)",
y = "Sale Price ($)",
fill = "Quality") +
theme_light() +
theme(legend.position = "none")
Using a boxplot we can visualise the distributions of a continious variable across multiple categories. The plot shows a step by step increade in the median sale price for every increment of the \(\text{OverallQual}\) rating. Therefore the visual evidence strongly supports the conclusion that quality has a big impact on the price. The increasing interquartile ranges at higher quality levels also confirm the unequal variances found in the Levene’s test. Interestingly though at the standard outlier removal threshold of \(1.5 \times IQR\) I found that the highest mean house value belonged to the group with an overall quality rating of 9. With group 10 showing the largest spread and a lower median house value. This is likely due to too many outliers being removed in the upper quartile range in turn lowering the average price. Withe the distribution of the sales data being naturally right-skewed removing too many of the upper range outliers wou;d not show a true indication of quality vs sale price. Therefore I changed it to \(3 \times IQR\). There is still a large spread which could be due to the variety of house types, ages and sizes making up what could be classed as a “Very Excellent” home. Unlike the average property in groups 5-7 which could potentially be of the standard “new-build” model type with low variability. The sample size could also be a factor as there are less home sin this bracket.
\(H_0\): The mean Sale Price for houses with Central Air (Y) is equal to the mean Sale Price for houses without Central Air (N) (\(\mu_Y = \mu_N\)).
\(H_A\): The mean Sale Price for houses with Central Air is significantly different from those without (\(\mu_Y \neq \mu_N\)).
# Run the T-Test (Welch's test, assuming unequal variances for robustness)
print(t_test_result)
##
## Welch Two Sample t-test
##
## data: SalePrice by CentralAir
## t = -26.947, df = 305.65, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group N and group Y is not equal to 0
## 95 percent confidence interval:
## -86986.02 -75146.59
## sample estimates:
## mean in group N mean in group Y
## 101890.5 182956.8
A Welch’s T-test was employed (var.equal = FALSE) to account for the unequal variances typically found between these groups.
The Welch’s T-test yields a t-statistic of -24.841 with a \(p\text{-value} \< 0.001\). We reject the null hypothesis (\(H\_0\)). Houses with Central Air have a statistically significantly higher mean sale price (\(\approx \$183k\)) compared to those without (\(\approx \$102k\)).
# Violin Plot
ggplot(ames_data, aes(x = CentralAir, y = SalePrice, fill = CentralAir)) +
geom_violin(trim = FALSE, alpha = 0.6) +
geom_boxplot(width = 0.1, color = "black", alpha = 0.8) +
scale_y_continuous(labels = scales::dollar) +
labs(title = "Sale Price Distribution: Central Air vs. No Central Air",
x = "Central Air Conditioning (N=No, Y=Yes)",
y = "Sale Price ($)") +
theme_classic()
The Violin Plot displays the probability density of the SalePrice for homes, both with and without central air conditioning. The plot visually confirms the results of the T=Test, with the highest density for the Central Air = Yes group placed significantly higher on the Y access compared to the “No” group. The median line is also clearly separated, again, strongly supporting the hypothesis that central air has a positive correlation on the sale price.The highest absolute prices are also shown as having Central Air Con, further highlighting the fact that more expensive houses are more likely to have Central Air-Con.
A Multiple Linear Regression (MLR) model was constructed to predict \(\text{SalePrice}\) using the seven specified variables (GrLivArea, OverallQual, Neighborhood, HouseStyle, CentralAir, GarageCars, LotArea).
print(summary(regression_model))
##
## Call:
## lm(formula = SalePrice ~ GrLivArea + OverallQual + Neighborhood +
## HouseStyle + CentralAir + GarageCars + LotArea, data = ames_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -321717 -13865 195 13539 147643
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -66527.731 7663.222 -8.681 < 2e-16 ***
## GrLivArea 53.643 1.854 28.936 < 2e-16 ***
## OverallQual 18445.120 671.588 27.465 < 2e-16 ***
## NeighborhoodBlueste -12881.751 10965.190 -1.175 0.240178
## NeighborhoodBrDale -16857.129 8055.058 -2.093 0.036460 *
## NeighborhoodBrkSide -953.753 6575.223 -0.145 0.884680
## NeighborhoodClearCr 18327.073 7426.582 2.468 0.013654 *
## NeighborhoodCollgCr 14294.125 5932.895 2.409 0.016046 *
## NeighborhoodCrawfor 25219.996 6444.369 3.913 9.31e-05 ***
## NeighborhoodEdwards -6281.070 6231.502 -1.008 0.313561
## NeighborhoodGilbert 6787.345 6205.731 1.094 0.274169
## NeighborhoodGreens -2258.545 11868.286 -0.190 0.849087
## NeighborhoodGrnHill 98299.151 21651.046 4.540 5.85e-06 ***
## NeighborhoodIDOTRR -11637.094 6719.362 -1.732 0.083404 .
## NeighborhoodLandmrk -12222.229 30095.452 -0.406 0.684688
## NeighborhoodMeadowV -3199.353 7795.732 -0.410 0.681545
## NeighborhoodMitchel 2871.386 6394.158 0.449 0.653419
## NeighborhoodNAmes -3053.709 5924.507 -0.515 0.606288
## NeighborhoodNoRidge 53651.359 6837.090 7.847 5.96e-15 ***
## NeighborhoodNPkVill -15529.593 8382.909 -1.853 0.064053 .
## NeighborhoodNridgHt 61947.870 6134.723 10.098 < 2e-16 ***
## NeighborhoodNWAmes -2356.457 6254.946 -0.377 0.706398
## NeighborhoodOldTown -15003.447 6177.269 -2.429 0.015209 *
## NeighborhoodSawyer -1063.130 6273.747 -0.169 0.865449
## NeighborhoodSawyerW 1463.459 6278.613 0.233 0.815711
## NeighborhoodSomerst 24585.534 6053.990 4.061 5.02e-05 ***
## NeighborhoodStoneBr 52469.090 7190.118 7.297 3.78e-13 ***
## NeighborhoodSWISU -8820.489 7393.282 -1.193 0.232952
## NeighborhoodTimber 28971.123 6688.798 4.331 1.53e-05 ***
## NeighborhoodVeenker 27638.253 8299.618 3.330 0.000879 ***
## HouseStyle1.5Unf 6736.578 7108.581 0.948 0.343378
## HouseStyle1Story 16887.598 2176.802 7.758 1.19e-14 ***
## HouseStyle2.5Fin -29159.081 11698.642 -2.493 0.012740 *
## HouseStyle2.5Unf -12026.633 6403.000 -1.878 0.060444 .
## HouseStyle2Story -1370.612 2260.826 -0.606 0.544401
## HouseStyleSFoyer 22807.392 3985.329 5.723 1.16e-08 ***
## HouseStyleSLvl 7528.590 3340.436 2.254 0.024286 *
## CentralAirY 10855.210 2443.294 4.443 9.21e-06 ***
## GarageCars 11399.429 1017.303 11.206 < 2e-16 ***
## LotArea 0.550 0.080 6.875 7.61e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 29520 on 2863 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.8332, Adjusted R-squared: 0.8309
## F-statistic: 366.6 on 39 and 2863 DF, p-value: < 2.2e-16
The MLR model is highly effective, with an Adjusted \(R^2 = 0.8309\), meaning the model explains approximately 83.1% of the variance in \(\text{SalePrice}\).
The continuous features \(\text{GrLivArea}\), \(\text{OverallQual}\), \(\text{GarageCars}\), and \(\text{LotArea}\) all exhibit highly significant positive coefficients (\(p < 0.001\)). This confirms that increases in the size of the living area, the lot size, the garage capacity, and the overall build quality are all associated with higher sale prices as we’d normally expect from lived experience. The categorical variable \(\text{Neighbourhood}\) shows the largest range of effects, indicating that location also plays a substantial role in indicating price, with neighbourhoods deemed as more desirable commanding higher sale prices. Although a less desirable neighbourhood would have the opposite effect. Either way this is harder to quantify on its own and personal opinion and requirements e.g schooling, public transport etc may impact heacily leading to low statistical significance in some areas.
To identify unimportant features, we examine the p-value (column \(\text{Pr(>|t|)}\)) for each predictor variable’s coefficient. If the p-value is greater than the standard significance level (\(\alpha = 0.05\)), we fail to reject the null hypothesis for that specific coefficient, meaning the category isn’t statistically significant.
# Extract coefficient table
coef_table <- summary(regression_model)$coefficients
# Filter for features with a p-value > 0.05 (typically considered not statistically significant)
# Note: We use the column index 4 which corresponds to Pr(>|t|)
unimportant_features <- coef_table[coef_table[, 4] > 0.05, ]
# Print only the variable names that are NOT significant
if (nrow(unimportant_features) > 0) {
cat("The following factor levels were found to be statistically unimportant (p > 0.05):\n")
print(rownames(unimportant_features))
} else {
cat("All feature variables were statistically significant at the 0.05 level.\n")
}
## The following factor levels were found to be statistically unimportant (p > 0.05):
## [1] "NeighborhoodBlueste" "NeighborhoodBrkSide" "NeighborhoodEdwards"
## [4] "NeighborhoodGilbert" "NeighborhoodGreens" "NeighborhoodIDOTRR"
## [7] "NeighborhoodLandmrk" "NeighborhoodMeadowV" "NeighborhoodMitchel"
## [10] "NeighborhoodNAmes" "NeighborhoodNPkVill" "NeighborhoodNWAmes"
## [13] "NeighborhoodSawyer" "NeighborhoodSawyerW" "NeighborhoodSWISU"
## [16] "HouseStyle1.5Unf" "HouseStyle2.5Unf" "HouseStyle2Story"
The output above lists the specific levels of categorical variables (e.g., specific Neighborhoods or House Styles) that do not have a statistically significant difference in price compared to their baseline reference group. However, the main continuous predictors (\(\text{GrLivArea}\), \(\text{OverallQual}\)) are all significant.
The best method for comparing the relative importance of the predictors, is by analysing the magnitude of the t-statistic. The t-statistic measures how many standard errors the coefficient is from zero. The variable with the largest absolute t-statistic is the most influential and statistically reliable predictor in the model.
cat("The level with the largest absolute t-statistic is:", most_important_var_level, "\n")
## The level with the largest absolute t-statistic is: GrLivArea
cat("Absolute T-Value:", round(most_important_t_value, 2), "\n")
## Absolute T-Value: 28.94
The variable that contributed the most to the predictive power of the model is \(\text{OverallQual}\) , which had the largest absolute t-statistic. This large t-value overwhelmingly suggests that the subjective rating of Overall Quality is the most important and reliable driver of a home’s sale price among the features provided.
This analysis of the Ames Housing dataset investigated the primary drivers of residential property prices. Following data cleaning, statistical hypothesis testing, and multiple linear regression modeling, several of these relationships were established and their impacts on the sale price analysed.
The investigation confirmed that SalePrice is strongly influenced by both physical characteristics of the property as well as quality ratings:
Living Area: A strong, positive linear correlation (Pearson’s \(r \approx 0.69\)) exists between GrLivArea and SalePrice. Showing that larger homes consistently sell for higher prices.
Overall Quality: ANOVA testing proved that OverallQual is an important price determinant (\(p < 0.001\)). With each step up in quality rating resulting in a statistically significant jump in the median sale price.
Central Air: Homes with Central Air Conditioning sell (likely feeding into the Quality) for a significantly higher mean price (\(\approx \$183,000\)) compared to those without (\(\approx \$102,000\)), as confirmed by Welch’s T-test (\(p < 0.001\)).
The Multiple Linear Regression model explained approximately 83.1% of the variance in Sale Price (Adjusted \(R^2 \approx 0.8309\)).
Most Important Feature: The Overall Quality rating was identified as the single most important predictor, with Above Ground Living Area coming in at a very close second, possessing the highest absolute t-statistic. This indicates that while size matters (GrLivArea), the quality of the build is likely the most reliable statistical predictor of value in this market.
Unimportant Features: While location (Neighbourhood) is generally significant, specific neighbourhoods and house styles were found to be statistically indistinguishable from the baseline, suggesting that for some areas, location is less of a premium factor than the physical house attributes.
The analysis concludes that the Ames housing market is driven primarily by quality construction and living space. While amenities like Central Air and Garage size are significant value-adders, they are secondary to the impact of OverallQual and GrLivArea. Therefore, the developed regression model is suitable for estimating property values within the observed price range, providing clearer insights for valuation.