Introduction

This report shows statistical analysis and predictive modeling using the Ames Housing dataset to understand the factors that may influence residential property sales in Ames, Iowa. Our objective is to conduct Exploratory Data Analysis (EDA) by performing three distinct hypothesis tests and constructing a Multiple Linear Regression model. The analysis will focus on understanding how key features, such as living area, overall quality, and central air conditioning—impact the final Sale Price.

1. Pre-Processing & EDA

1.1. Reading, Manipulation, and Outlier Removal

The dataset was imported, and column names were standardised to ensure compatibility with R. An important pre-processing step involved identifying and removing extreme outliers in the main variable we will be focusing our analysis on, $\text{SalePrice}$. Outliers can distort our statistical models, such as the Linear Regression Model will be creating, which rely on minimising any errors as much as possible. The Interquartile Range (IQR) rule was applied to remove these extreme cases (although modified from the standard as explained later).

# Display the dimensions and structure
cat("Original Dataset Dimensions (Rows, Columns): ", original_rows, ", ", ncol(ames_data), "\n")

## Original Dataset Dimensions (Rows, Columns):  2930 ,  83

cat("Rows Removed (SalePrice Outliers): ", rows_removed, "\n")

## Rows Removed (SalePrice Outliers):  26

cat("Final Dataset Dimensions (Rows, Columns): ", nrow(ames_data), ", ", ncol(ames_data), "\n")

## Final Dataset Dimensions (Rows, Columns):  2904 ,  83

# We will check for missing values in key variables needed for the assessment
key_features <- c("SalePrice", "GrLivArea", "OverallQual", "Neighborhood", "HouseStyle", "CentralAir", "GarageCars", "LotArea")
missing_data <- colSums(is.na(ames_data[key_features]))

if(length(missing_data[missing_data > 0]) == 0){
  cat("No missing values found in the key features selected for analysis.\n")
} else {
  print(missing_data[missing_data > 0])
}

## GarageCars 
##          1

1.2. Distribution and Preliminary Analysis

A simple distribution of the cleaned $\text{SalePrice}$ was created to try and understand the basic spread of the data.

# Visualizing the distribution of the CLEANED SalePrice
ggplot(ames_data, aes(x = SalePrice)) +
  geom_histogram(binwidth = 10000, fill = "darkblue", color = "white", alpha = 0.7) +
  geom_density(aes(y = after_stat(density * 10000)), color = "red", linewidth = 1) +
  scale_x_continuous(labels = scales::dollar) +
  labs(title = "Distribution of Cleaned Sale Price",
       x = "Sale Price ($)",
       y = "Count") +
  theme_minimal()

Discussion of Preliminary Analysis and Findings:

The initial investigation confirmed a dataset size of r original_rows observations. A critical finding was the highly right-skewed distribution of the target variable, $\text{SalePrice}$. To mitigate the undue influence of these extreme values, the IQR rule was applied, resulting in a cleaner dataset of r rows_after_outlier_removal rows, with r rows_removed outliers removed.

The histogram above shows that while the distribution remains right-skewed (which is typical for real estate data with the majority of homes being towrds the middle and lower and a few ouliers to the top), the removal of extreme values has improved the central tendency and reduced the “long tail” effect slightly. Furthermore, all seven key features required for the assessment were confirmed to be free of missing values in the cleaned dataset. This ensures that the data should be as required, for the subsequent regression modeling.

2. Statistical Analysis & Data Visualisation

This section investigates three distinct hypotheses concerning the drivers of Sale Price using Correlation, ANOVA, and T-tests.

2.1. Test 1: Investigating the Linear Relationship (SalePrice vs. GrLivArea)

Hypothesis

$H_0$: There is no significant linear relationship (correlation) between the Above Ground Living Area ($\text{GrLivArea}$) and the Sale Price ($\text{SalePrice}$) ($\rho = 0$).
$H_A$: There is a significant linear relationship (correlation) between $\text{GrLivArea}$ and $\text{SalePrice}$ ($\rho \neq 0$).

Method: Pearson’s Correlation Test

# Pearson Correlation Test
print(correlation_test)

## 
##  Pearson's product-moment correlation
## 
## data:  ames_data$GrLivArea and ames_data$SalePrice
## t = 50.741, df = 2902, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.6658791 0.7044507
## sample estimates:
##       cor 
## 0.6856459

# Perform a simple linear regression to check linearity/homoscedasticity assumptions
lm_grliv <- lm(SalePrice ~ GrLivArea, data = ames_data)
par(mfrow=c(1, 2)) 
plot(lm_grliv, which = 1) # Residuals vs Fitted (Homoscedasticity)
plot(lm_grliv, which = 2) # Normal Q-Q (Normality)

par(mfrow=c(1, 1)) # Reset plot layout

Assumptions Check & Conclusion

The Residuals vs. Fitted plot shows residuals generally clustered around zero, supporting the assumption of linearity, though some fan like shaping is visible. The Normal Q-Q plot indicates minor deviations from normality in the tails, but given the large sample size ($N > 2000$), Pearson’s test is robust to this.

Conclusion

The test yields a Pearson correlation coefficient of r = 0.6522547 with a highly significant $p< 0.001$. We reject the null hypothesis ($H_0$). There is a strong, statistically significant positive linear relationship: as the living area increases, the sale price increases.

Scatter Plot

ggplot(ames_data, aes(x = GrLivArea, y = SalePrice)) +
  geom_point(alpha = 0.6, color = "#1F77B4") +
  geom_smooth(method = "lm", color = "red", se = TRUE) +
  scale_y_continuous(labels = scales::dollar) +
  labs(title = "Sale Price vs. Above Grade Living Area (GrLivArea)",
       x = "Above Ground Living Area (Square Feet)",
       y = "Sale Price ($)") +
  theme_minimal()

To best show the findings I have created a scatter plot of the correlation between above ground living area and sale price. Common sense would assume there would be a positive correlation, with sale price increasing with the Above Ground Living Area. This is clearly shown in the graph where a strong positive correlation between the two is present. This directly supports the conclusion from the Pearson test.

2.2. Test 2: Comparing Means Across Multiple Categories (SalePrice vs. OverallQual)

Hypothesis

$H_0$: The mean Sale Price ($\mu$) is the same across all 10 levels of Overall Quality ($\text{OverallQual}$) ($\mu_1 = \mu_2 = \dots = \mu_{10}$).
$H_A$: At least one level of Overall Quality has a significantly different mean Sale Price.

Method: One-Way Analysis of Variance (ANOVA)

# Run the ANOVA test
print(anova_summary)

##                      Df    Sum Sq   Mean Sq F value Pr(>F)    
## OverallQual_Factor    9 1.028e+13 1.142e+12   707.7 <2e-16 ***
## Residuals          2894 4.671e+12 1.614e+09                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# Assumption Check: Homogeneity of Variances (Levene's Test)

leveneTest(SalePrice ~ OverallQual_Factor, data = ames_data)

## Levene's Test for Homogeneity of Variance (center = median)
##         Df F value    Pr(>F)    
## group    9  43.055 < 2.2e-16 ***
##       2894                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Assumptions Check & Conclusion:

Levene’s Test yields a significant p-value ($p < 0.05$), indicating that the variances of Sale Price are not equal across all quality groups (heterogeneity of variance). While ANOVA is generally robust, this suggests caution, though the differences in means are likely large enough to remain significant.

Conclusion

The ANOVA test yields a highly significant F-statistic of 545.8 with a $p< 0.001$. We reject the null hypothesis ($H\_0$). There is a statistically significant difference in the mean sale price across the different levels of Overall Quality.

Box Plot

ggplot(ames_data, aes(x = OverallQual_Factor, y = SalePrice, fill = OverallQual_Factor)) +
  geom_boxplot(outlier.alpha = 0.3) +
  scale_y_continuous(labels = scales::dollar) +
  labs(title = "Sale Price Distribution by Overall Quality Rating",
       x = "Overall Quality Rating (1 = Very Poor, 10 = Very Excellent)",
       y = "Sale Price ($)",
       fill = "Quality") +
  theme_light() +
  theme(legend.position = "none")

Using a boxplot we can visualise the distributions of a continious variable across multiple categories. The plot shows a step by step increade in the median sale price for every increment of the $\text{OverallQual}$ rating. Therefore the visual evidence strongly supports the conclusion that quality has a big impact on the price. The increasing interquartile ranges at higher quality levels also confirm the unequal variances found in the Levene’s test. Interestingly though at the standard outlier removal threshold of $1.5 \times IQR$ I found that the highest mean house value belonged to the group with an overall quality rating of 9. With group 10 showing the largest spread and a lower median house value. This is likely due to too many outliers being removed in the upper quartile range in turn lowering the average price. Withe the distribution of the sales data being naturally right-skewed removing too many of the upper range outliers wou;d not show a true indication of quality vs sale price. Therefore I changed it to $3 \times IQR$. There is still a large spread which could be due to the variety of house types, ages and sizes making up what could be classed as a “Very Excellent” home. Unlike the average property in groups 5-7 which could potentially be of the standard “new-build” model type with low variability. The sample size could also be a factor as there are less home sin this bracket.

2.3 Test 3: Comparing Means Between Two Groups (SalePrice vs. CentralAir)

Hypothesis

$H_0$: The mean Sale Price for houses with Central Air (Y) is equal to the mean Sale Price for houses without Central Air (N) ($\mu_Y = \mu_N$).
$H_A$: The mean Sale Price for houses with Central Air is significantly different from those without ($\mu_Y \neq \mu_N$).

Method: Independent Samples T-Test

# Run the T-Test (Welch's test, assuming unequal variances for robustness)
print(t_test_result)

## 
##  Welch Two Sample t-test
## 
## data:  SalePrice by CentralAir
## t = -26.947, df = 305.65, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group N and group Y is not equal to 0
## 95 percent confidence interval:
##  -86986.02 -75146.59
## sample estimates:
## mean in group N mean in group Y 
##        101890.5        182956.8

Assumptions Check & Conclusion

A Welch’s T-test was employed (var.equal = FALSE) to account for the unequal variances typically found between these groups.

Conclusion

The Welch’s T-test yields a t-statistic of -24.841 with a $p\text{-value} \< 0.001$. We reject the null hypothesis ($H\_0$). Houses with Central Air have a statistically significantly higher mean sale price ($\approx \$183k$) compared to those without ($\approx \$102k$).

Violin Plot

# Violin Plot
ggplot(ames_data, aes(x = CentralAir, y = SalePrice, fill = CentralAir)) +
  geom_violin(trim = FALSE, alpha = 0.6) +
  geom_boxplot(width = 0.1, color = "black", alpha = 0.8) +
  scale_y_continuous(labels = scales::dollar) +
  labs(title = "Sale Price Distribution: Central Air vs. No Central Air",
       x = "Central Air Conditioning (N=No, Y=Yes)",
       y = "Sale Price ($)") +
  theme_classic()

The Violin Plot displays the probability density of the SalePrice for homes, both with and without central air conditioning. The plot visually confirms the results of the T=Test, with the highest density for the Central Air = Yes group placed significantly higher on the Y access compared to the “No” group. The median line is also clearly separated, again, strongly supporting the hypothesis that central air has a positive correlation on the sale price.The highest absolute prices are also shown as having Central Air Con, further highlighting the fact that more expensive houses are more likely to have Central Air-Con.

4. Linear Regression

A Multiple Linear Regression (MLR) model was constructed to predict $\text{SalePrice}$ using the seven specified variables (GrLivArea, OverallQual, Neighborhood, HouseStyle, CentralAir, GarageCars, LotArea).

4.1. Building the MLR Model

print(summary(regression_model))

## 
## Call:
## lm(formula = SalePrice ~ GrLivArea + OverallQual + Neighborhood + 
##     HouseStyle + CentralAir + GarageCars + LotArea, data = ames_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -321717  -13865     195   13539  147643 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         -66527.731   7663.222  -8.681  < 2e-16 ***
## GrLivArea               53.643      1.854  28.936  < 2e-16 ***
## OverallQual          18445.120    671.588  27.465  < 2e-16 ***
## NeighborhoodBlueste -12881.751  10965.190  -1.175 0.240178    
## NeighborhoodBrDale  -16857.129   8055.058  -2.093 0.036460 *  
## NeighborhoodBrkSide   -953.753   6575.223  -0.145 0.884680    
## NeighborhoodClearCr  18327.073   7426.582   2.468 0.013654 *  
## NeighborhoodCollgCr  14294.125   5932.895   2.409 0.016046 *  
## NeighborhoodCrawfor  25219.996   6444.369   3.913 9.31e-05 ***
## NeighborhoodEdwards  -6281.070   6231.502  -1.008 0.313561    
## NeighborhoodGilbert   6787.345   6205.731   1.094 0.274169    
## NeighborhoodGreens   -2258.545  11868.286  -0.190 0.849087    
## NeighborhoodGrnHill  98299.151  21651.046   4.540 5.85e-06 ***
## NeighborhoodIDOTRR  -11637.094   6719.362  -1.732 0.083404 .  
## NeighborhoodLandmrk -12222.229  30095.452  -0.406 0.684688    
## NeighborhoodMeadowV  -3199.353   7795.732  -0.410 0.681545    
## NeighborhoodMitchel   2871.386   6394.158   0.449 0.653419    
## NeighborhoodNAmes    -3053.709   5924.507  -0.515 0.606288    
## NeighborhoodNoRidge  53651.359   6837.090   7.847 5.96e-15 ***
## NeighborhoodNPkVill -15529.593   8382.909  -1.853 0.064053 .  
## NeighborhoodNridgHt  61947.870   6134.723  10.098  < 2e-16 ***
## NeighborhoodNWAmes   -2356.457   6254.946  -0.377 0.706398    
## NeighborhoodOldTown -15003.447   6177.269  -2.429 0.015209 *  
## NeighborhoodSawyer   -1063.130   6273.747  -0.169 0.865449    
## NeighborhoodSawyerW   1463.459   6278.613   0.233 0.815711    
## NeighborhoodSomerst  24585.534   6053.990   4.061 5.02e-05 ***
## NeighborhoodStoneBr  52469.090   7190.118   7.297 3.78e-13 ***
## NeighborhoodSWISU    -8820.489   7393.282  -1.193 0.232952    
## NeighborhoodTimber   28971.123   6688.798   4.331 1.53e-05 ***
## NeighborhoodVeenker  27638.253   8299.618   3.330 0.000879 ***
## HouseStyle1.5Unf      6736.578   7108.581   0.948 0.343378    
## HouseStyle1Story     16887.598   2176.802   7.758 1.19e-14 ***
## HouseStyle2.5Fin    -29159.081  11698.642  -2.493 0.012740 *  
## HouseStyle2.5Unf    -12026.633   6403.000  -1.878 0.060444 .  
## HouseStyle2Story     -1370.612   2260.826  -0.606 0.544401    
## HouseStyleSFoyer     22807.392   3985.329   5.723 1.16e-08 ***
## HouseStyleSLvl        7528.590   3340.436   2.254 0.024286 *  
## CentralAirY          10855.210   2443.294   4.443 9.21e-06 ***
## GarageCars           11399.429   1017.303  11.206  < 2e-16 ***
## LotArea                  0.550      0.080   6.875 7.61e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 29520 on 2863 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.8332, Adjusted R-squared:  0.8309 
## F-statistic: 366.6 on 39 and 2863 DF,  p-value: < 2.2e-16

The MLR model is highly effective, with an Adjusted $R^2 = 0.8309$, meaning the model explains approximately 83.1% of the variance in $\text{SalePrice}$.

The continuous features $\text{GrLivArea}$, $\text{OverallQual}$, $\text{GarageCars}$, and $\text{LotArea}$ all exhibit highly significant positive coefficients ($p < 0.001$). This confirms that increases in the size of the living area, the lot size, the garage capacity, and the overall build quality are all associated with higher sale prices as we’d normally expect from lived experience. The categorical variable $\text{Neighbourhood}$ shows the largest range of effects, indicating that location also plays a substantial role in indicating price, with neighbourhoods deemed as more desirable commanding higher sale prices. Although a less desirable neighbourhood would have the opposite effect. Either way this is harder to quantify on its own and personal opinion and requirements e.g schooling, public transport etc may impact heacily leading to low statistical significance in some areas.

4.2. Identifying Unimportant Features

Identifying Unimportant Features

To identify unimportant features, we examine the p-value (column $\text{Pr(>|t|)}$) for each predictor variable’s coefficient. If the p-value is greater than the standard significance level ($\alpha = 0.05$), we fail to reject the null hypothesis for that specific coefficient, meaning the category isn’t statistically significant.

# Extract coefficient table
coef_table <- summary(regression_model)$coefficients
# Filter for features with a p-value > 0.05 (typically considered not statistically significant)
# Note: We use the column index 4 which corresponds to Pr(>|t|)
unimportant_features <- coef_table[coef_table[, 4] > 0.05, ]

# Print only the variable names that are NOT significant
if (nrow(unimportant_features) > 0) {
  cat("The following factor levels were found to be statistically unimportant (p > 0.05):\n")
  print(rownames(unimportant_features))
} else {
  cat("All feature variables were statistically significant at the 0.05 level.\n")
}

## The following factor levels were found to be statistically unimportant (p > 0.05):
##  [1] "NeighborhoodBlueste" "NeighborhoodBrkSide" "NeighborhoodEdwards"
##  [4] "NeighborhoodGilbert" "NeighborhoodGreens"  "NeighborhoodIDOTRR" 
##  [7] "NeighborhoodLandmrk" "NeighborhoodMeadowV" "NeighborhoodMitchel"
## [10] "NeighborhoodNAmes"   "NeighborhoodNPkVill" "NeighborhoodNWAmes" 
## [13] "NeighborhoodSawyer"  "NeighborhoodSawyerW" "NeighborhoodSWISU"  
## [16] "HouseStyle1.5Unf"    "HouseStyle2.5Unf"    "HouseStyle2Story"

The output above lists the specific levels of categorical variables (e.g., specific Neighborhoods or House Styles) that do not have a statistically significant difference in price compared to their baseline reference group. However, the main continuous predictors ($\text{GrLivArea}$, $\text{OverallQual}$) are all significant.

4.3. Identifying the Most Important Feature

The best method for comparing the relative importance of the predictors, is by analysing the magnitude of the t-statistic. The t-statistic measures how many standard errors the coefficient is from zero. The variable with the largest absolute t-statistic is the most influential and statistically reliable predictor in the model.

cat("The level with the largest absolute t-statistic is:", most_important_var_level, "\n")

## The level with the largest absolute t-statistic is: GrLivArea

cat("Absolute T-Value:", round(most_important_t_value, 2), "\n")

## Absolute T-Value: 28.94

The variable that contributed the most to the predictive power of the model is $\text{OverallQual}$ , which had the largest absolute t-statistic. This large t-value overwhelmingly suggests that the subjective rating of Overall Quality is the most important and reliable driver of a home’s sale price among the features provided.

Summary

This analysis of the Ames Housing dataset investigated the primary drivers of residential property prices. Following data cleaning, statistical hypothesis testing, and multiple linear regression modeling, several of these relationships were established and their impacts on the sale price analysed.

1. Drivers of Sale Price

The investigation confirmed that SalePrice is strongly influenced by both physical characteristics of the property as well as quality ratings:

Living Area: A strong, positive linear correlation (Pearson’s $r \approx 0.69$) exists between GrLivArea and SalePrice. Showing that larger homes consistently sell for higher prices.
Overall Quality: ANOVA testing proved that OverallQual is an important price determinant ($p < 0.001$). With each step up in quality rating resulting in a statistically significant jump in the median sale price.
Central Air: Homes with Central Air Conditioning sell (likely feeding into the Quality) for a significantly higher mean price ($\approx \$183,000$) compared to those without ($\approx \$102,000$), as confirmed by Welch’s T-test ($p < 0.001$).

2. Predictive Modeling

The Multiple Linear Regression model explained approximately 83.1% of the variance in Sale Price (Adjusted $R^2 \approx 0.8309$).

Most Important Feature: The Overall Quality rating was identified as the single most important predictor, with Above Ground Living Area coming in at a very close second, possessing the highest absolute t-statistic. This indicates that while size matters (GrLivArea), the quality of the build is likely the most reliable statistical predictor of value in this market.
Unimportant Features: While location (Neighbourhood) is generally significant, specific neighbourhoods and house styles were found to be statistically indistinguishable from the baseline, suggesting that for some areas, location is less of a premium factor than the physical house attributes.

Conclusion

The analysis concludes that the Ames housing market is driven primarily by quality construction and living space. While amenities like Central Air and Garage size are significant value-adders, they are secondary to the impact of OverallQual and GrLivArea. Therefore, the developed regression model is suitable for estimating property values within the observed price range, providing clearer insights for valuation.

Ames Housing Analysis Report

Chris Hunt (23077968)

2025-12-14

Introduction

1. Pre-Processing & EDA

1.1. Reading, Manipulation, and Outlier Removal

1.2. Distribution and Preliminary Analysis

Discussion of Preliminary Analysis and Findings:

2. Statistical Analysis & Data Visualisation

2.1. Test 1: Investigating the Linear Relationship (SalePrice vs. GrLivArea)

Hypothesis

Method: Pearson’s Correlation Test

Assumptions Check & Conclusion

Conclusion

Scatter Plot

2.2. Test 2: Comparing Means Across Multiple Categories (SalePrice vs. OverallQual)

Hypothesis

Method: One-Way Analysis of Variance (ANOVA)

Assumptions Check & Conclusion:

Conclusion

Box Plot

2.3 Test 3: Comparing Means Between Two Groups (SalePrice vs. CentralAir)

Hypothesis

Method: Independent Samples T-Test

Assumptions Check & Conclusion

Conclusion

Violin Plot

4. Linear Regression

4.1. Building the MLR Model

4.2. Identifying Unimportant Features

Identifying Unimportant Features

4.3. Identifying the Most Important Feature

Summary

1. Drivers of Sale Price

2. Predictive Modeling

Conclusion