This report prepares the Ames housing data for regression modeling
without fitting a final predictive model. The emphasis is on disciplined
data preparation: profiling, cleaning, missingness assessment,
validation checks, outlier handling for regression, transformation
planning, scaling, normality checks, and feature engineering. A
particular focus is placed on the 30 strongest numeric features
correlated with SalePrice, along with binning neighborhoods
into a regression-ready grouping.
Regression quality depends heavily on the quality and structure of the data going into the model. If data are not profiled, cleaned, validated, and transformed appropriately, coefficient estimates and predictions can become unstable, biased, or misleading. This report therefore treats data preparation as a first-class modeling step.
# Install these if needed:
# install.packages(c(
# "readxl","dplyr","ggplot2","tidyr","stringr","purrr","skimr",
# "DataExplorer","naniar","corrplot","moments","forcats","scales","knitr"
# ))
library(readxl)
library(dplyr)
library(ggplot2)
library(tidyr)
library(stringr)
library(purrr)
library(skimr)
library(DataExplorer)
library(naniar)
library(corrplot)
library(moments)
library(forcats)
library(scales)
library(knitr)
This section imports the Ames data and standardizes column names so that later code is easier to write and less error-prone.
# Update the path only if needed
ames <- read_excel("ames_housing_data.xlsx")
# Standardize names for cleaner coding
names(ames) <- make.names(names(ames))
# Basic structure checks
dim(ames)
## [1] 2930 82
names(ames)[1:15]
## [1] "SID" "PID" "SubClass" "Zoning" "LotFrontage"
## [6] "LotArea" "Street" "Alley" "LotShape" "LandContour"
## [11] "Utilities" "LotConfig" "LandSlope" "Neighborhood" "Condition1"
head(ames, 3)
## # A tibble: 3 × 82
## SID PID SubClass Zoning LotFrontage LotArea Street Alley LotShape
## <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <chr> <chr> <chr>
## 1 1 526301100 20 RL 141 31770 Pave NA IR1
## 2 2 526350040 20 RH 80 11622 Pave NA Reg
## 3 3 526351010 20 RL 81 14267 Pave NA IR1
## # ℹ 73 more variables: LandContour <chr>, Utilities <chr>, LotConfig <chr>,
## # LandSlope <chr>, Neighborhood <chr>, Condition1 <chr>, Condition2 <chr>,
## # BldgType <chr>, HouseStyle <chr>, OverallQual <dbl>, OverallCond <dbl>,
## # YearBuilt <dbl>, YearRemodel <dbl>, RoofStyle <chr>, RoofMat <chr>,
## # Exterior1 <chr>, Exterior2 <chr>, MasVnrType <chr>, MasVnrArea <dbl>,
## # ExterQual <chr>, ExterCond <chr>, Foundation <chr>, BsmtQual <chr>,
## # BsmtCond <chr>, BsmtExposure <chr>, BsmtFinType1 <chr>, BsmtFinSF1 <dbl>, …
cat(
paste0(
"- The raw dataset contains **", nrow(ames), "** rows and **", ncol(ames), "** columns.\n",
"- Standardizing the column names early helps avoid later coding problems due to spaces or punctuation.\n",
"- The data include a mix of numeric, ordinal, and nominal fields, so regression preparation must treat variables differently depending on type.\n"
)
)
Profiling gives a broad overview of structure, variable types, ranges, and missingness patterns before any filtering or feature engineering.
# Compact profiling summary
skim(ames)
| Name | ames |
| Number of rows | 2930 |
| Number of columns | 82 |
| _______________________ | |
| Column type frequency: | |
| character | 43 |
| numeric | 39 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| Zoning | 0 | 1.00 | 2 | 7 | 0 | 7 | 0 |
| Street | 0 | 1.00 | 4 | 4 | 0 | 2 | 0 |
| Alley | 0 | 1.00 | 2 | 4 | 0 | 3 | 0 |
| LotShape | 0 | 1.00 | 3 | 3 | 0 | 4 | 0 |
| LandContour | 0 | 1.00 | 3 | 3 | 0 | 4 | 0 |
| Utilities | 0 | 1.00 | 6 | 6 | 0 | 3 | 0 |
| LotConfig | 0 | 1.00 | 3 | 7 | 0 | 5 | 0 |
| LandSlope | 0 | 1.00 | 3 | 3 | 0 | 3 | 0 |
| Neighborhood | 0 | 1.00 | 5 | 7 | 0 | 28 | 0 |
| Condition1 | 0 | 1.00 | 4 | 6 | 0 | 9 | 0 |
| Condition2 | 0 | 1.00 | 4 | 6 | 0 | 8 | 0 |
| BldgType | 0 | 1.00 | 4 | 6 | 0 | 5 | 0 |
| HouseStyle | 0 | 1.00 | 4 | 6 | 0 | 8 | 0 |
| RoofStyle | 0 | 1.00 | 3 | 7 | 0 | 6 | 0 |
| RoofMat | 0 | 1.00 | 4 | 7 | 0 | 8 | 0 |
| Exterior1 | 0 | 1.00 | 5 | 7 | 0 | 16 | 0 |
| Exterior2 | 0 | 1.00 | 5 | 7 | 0 | 17 | 0 |
| MasVnrType | 23 | 0.99 | 4 | 7 | 0 | 5 | 0 |
| ExterQual | 0 | 1.00 | 2 | 2 | 0 | 4 | 0 |
| ExterCond | 0 | 1.00 | 2 | 2 | 0 | 5 | 0 |
| Foundation | 0 | 1.00 | 4 | 6 | 0 | 6 | 0 |
| BsmtQual | 1 | 1.00 | 2 | 2 | 0 | 6 | 0 |
| BsmtCond | 1 | 1.00 | 2 | 2 | 0 | 6 | 0 |
| BsmtExposure | 4 | 1.00 | 2 | 2 | 0 | 5 | 0 |
| BsmtFinType1 | 1 | 1.00 | 2 | 3 | 0 | 7 | 0 |
| BsmtFinType2 | 2 | 1.00 | 2 | 3 | 0 | 7 | 0 |
| Heating | 0 | 1.00 | 4 | 5 | 0 | 6 | 0 |
| HeatingQC | 0 | 1.00 | 2 | 2 | 0 | 5 | 0 |
| CentralAir | 0 | 1.00 | 1 | 1 | 0 | 2 | 0 |
| Electrical | 1 | 1.00 | 3 | 5 | 0 | 5 | 0 |
| KitchenQual | 0 | 1.00 | 2 | 2 | 0 | 5 | 0 |
| Functional | 0 | 1.00 | 3 | 4 | 0 | 8 | 0 |
| FireplaceQu | 0 | 1.00 | 2 | 2 | 0 | 6 | 0 |
| GarageType | 0 | 1.00 | 2 | 7 | 0 | 7 | 0 |
| GarageFinish | 2 | 1.00 | 2 | 3 | 0 | 4 | 0 |
| GarageQual | 1 | 1.00 | 2 | 2 | 0 | 6 | 0 |
| GarageCond | 1 | 1.00 | 2 | 2 | 0 | 6 | 0 |
| PavedDrive | 0 | 1.00 | 1 | 1 | 0 | 3 | 0 |
| PoolQC | 0 | 1.00 | 2 | 2 | 0 | 5 | 0 |
| Fence | 0 | 1.00 | 2 | 5 | 0 | 5 | 0 |
| MiscFeature | 0 | 1.00 | 2 | 4 | 0 | 6 | 0 |
| SaleType | 0 | 1.00 | 2 | 5 | 0 | 10 | 0 |
| SaleCondition | 0 | 1.00 | 6 | 7 | 0 | 6 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| SID | 0 | 1.00 | 1465.50 | 845.96 | 1 | 733.25 | 1465.5 | 2197.75 | 2930 | ▇▇▇▇▇ |
| PID | 0 | 1.00 | 714464496.99 | 188730844.65 | 526301100 | 528477022.50 | 535453620.0 | 907181097.50 | 1007100110 | ▇▁▁▆▂ |
| SubClass | 0 | 1.00 | 57.39 | 42.64 | 20 | 20.00 | 50.0 | 70.00 | 190 | ▇▅▂▁▁ |
| LotFrontage | 490 | 0.83 | 69.22 | 23.37 | 21 | 58.00 | 68.0 | 80.00 | 313 | ▇▃▁▁▁ |
| LotArea | 0 | 1.00 | 10147.92 | 7880.02 | 1300 | 7440.25 | 9436.5 | 11555.25 | 215245 | ▇▁▁▁▁ |
| OverallQual | 0 | 1.00 | 6.09 | 1.41 | 1 | 5.00 | 6.0 | 7.00 | 10 | ▁▂▇▅▁ |
| OverallCond | 0 | 1.00 | 5.56 | 1.11 | 1 | 5.00 | 5.0 | 6.00 | 9 | ▁▁▇▅▁ |
| YearBuilt | 0 | 1.00 | 1971.36 | 30.25 | 1872 | 1954.00 | 1973.0 | 2001.00 | 2010 | ▁▂▃▆▇ |
| YearRemodel | 0 | 1.00 | 1984.27 | 20.86 | 1950 | 1965.00 | 1993.0 | 2004.00 | 2010 | ▅▂▂▃▇ |
| MasVnrArea | 23 | 0.99 | 101.90 | 179.11 | 0 | 0.00 | 0.0 | 164.00 | 1600 | ▇▁▁▁▁ |
| BsmtFinSF1 | 1 | 1.00 | 442.63 | 455.59 | 0 | 0.00 | 370.0 | 734.00 | 5644 | ▇▁▁▁▁ |
| BsmtFinSF2 | 1 | 1.00 | 49.72 | 169.17 | 0 | 0.00 | 0.0 | 0.00 | 1526 | ▇▁▁▁▁ |
| BsmtUnfSF | 1 | 1.00 | 559.26 | 439.49 | 0 | 219.00 | 466.0 | 802.00 | 2336 | ▇▅▂▁▁ |
| TotalBsmtSF | 1 | 1.00 | 1051.61 | 440.62 | 0 | 793.00 | 990.0 | 1302.00 | 6110 | ▇▃▁▁▁ |
| FirstFlrSF | 0 | 1.00 | 1159.56 | 391.89 | 334 | 876.25 | 1084.0 | 1384.00 | 5095 | ▇▃▁▁▁ |
| SecondFlrSF | 0 | 1.00 | 335.46 | 428.40 | 0 | 0.00 | 0.0 | 703.75 | 2065 | ▇▃▂▁▁ |
| LowQualFinSF | 0 | 1.00 | 4.68 | 46.31 | 0 | 0.00 | 0.0 | 0.00 | 1064 | ▇▁▁▁▁ |
| GrLivArea | 0 | 1.00 | 1499.69 | 505.51 | 334 | 1126.00 | 1442.0 | 1742.75 | 5642 | ▇▇▁▁▁ |
| BsmtFullBath | 2 | 1.00 | 0.43 | 0.52 | 0 | 0.00 | 0.0 | 1.00 | 3 | ▇▆▁▁▁ |
| BsmtHalfBath | 2 | 1.00 | 0.06 | 0.25 | 0 | 0.00 | 0.0 | 0.00 | 2 | ▇▁▁▁▁ |
| FullBath | 0 | 1.00 | 1.57 | 0.55 | 0 | 1.00 | 2.0 | 2.00 | 4 | ▁▇▇▁▁ |
| HalfBath | 0 | 1.00 | 0.38 | 0.50 | 0 | 0.00 | 0.0 | 1.00 | 2 | ▇▁▅▁▁ |
| BedroomAbvGr | 0 | 1.00 | 2.85 | 0.83 | 0 | 2.00 | 3.0 | 3.00 | 8 | ▁▇▂▁▁ |
| KitchenAbvGr | 0 | 1.00 | 1.04 | 0.21 | 0 | 1.00 | 1.0 | 1.00 | 3 | ▁▇▁▁▁ |
| TotRmsAbvGrd | 0 | 1.00 | 6.44 | 1.57 | 2 | 5.00 | 6.0 | 7.00 | 15 | ▁▇▂▁▁ |
| Fireplaces | 0 | 1.00 | 0.60 | 0.65 | 0 | 0.00 | 1.0 | 1.00 | 4 | ▇▇▁▁▁ |
| GarageYrBlt | 159 | 0.95 | 1978.13 | 25.53 | 1895 | 1960.00 | 1979.0 | 2002.00 | 2207 | ▂▇▁▁▁ |
| GarageCars | 1 | 1.00 | 1.77 | 0.76 | 0 | 1.00 | 2.0 | 2.00 | 5 | ▅▇▂▁▁ |
| GarageArea | 1 | 1.00 | 472.82 | 215.05 | 0 | 320.00 | 480.0 | 576.00 | 1488 | ▃▇▃▁▁ |
| WoodDeckSF | 0 | 1.00 | 93.75 | 126.36 | 0 | 0.00 | 0.0 | 168.00 | 1424 | ▇▁▁▁▁ |
| OpenPorchSF | 0 | 1.00 | 47.53 | 67.48 | 0 | 0.00 | 27.0 | 70.00 | 742 | ▇▁▁▁▁ |
| EnclosedPorch | 0 | 1.00 | 23.01 | 64.14 | 0 | 0.00 | 0.0 | 0.00 | 1012 | ▇▁▁▁▁ |
| ThreeSsnPorch | 0 | 1.00 | 2.59 | 25.14 | 0 | 0.00 | 0.0 | 0.00 | 508 | ▇▁▁▁▁ |
| ScreenPorch | 0 | 1.00 | 16.00 | 56.09 | 0 | 0.00 | 0.0 | 0.00 | 576 | ▇▁▁▁▁ |
| PoolArea | 0 | 1.00 | 2.24 | 35.60 | 0 | 0.00 | 0.0 | 0.00 | 800 | ▇▁▁▁▁ |
| MiscVal | 0 | 1.00 | 50.64 | 566.34 | 0 | 0.00 | 0.0 | 0.00 | 17000 | ▇▁▁▁▁ |
| MoSold | 0 | 1.00 | 6.22 | 2.71 | 1 | 4.00 | 6.0 | 8.00 | 12 | ▅▆▇▃▃ |
| YrSold | 0 | 1.00 | 2007.79 | 1.32 | 2006 | 2007.00 | 2008.0 | 2009.00 | 2010 | ▇▇▇▇▃ |
| SalePrice | 0 | 1.00 | 180796.06 | 79886.69 | 12789 | 129500.00 | 160000.0 | 213500.00 | 755000 | ▇▇▁▁▁ |
# Global profiling visuals
plot_intro(ames)
plot_missing(ames)
num_cols <- sum(sapply(ames, is.numeric))
cat_cols <- ncol(ames) - num_cols
cat(
paste0(
"- The dataset contains **", num_cols, "** numeric columns and **", cat_cols, "** non-numeric columns.\n",
"- Profiling is important because regression preparation depends on identifying which variables are continuous, discrete, ordinal, or nominal.\n",
"- Missingness is not random in this dataset for some housing attributes, so it should be studied before imputation or row removal.\n"
)
)
Before filtering, create a few derived variables that are useful for data understanding and later regression preparation.
ames <- ames %>%
mutate(
TotalFloorSF = FirstFlrSF + SecondFlrSF,
HouseAge = YrSold - YearBuilt,
RemodAge = YrSold - YearRemodel,
TotalBath = FullBath + HalfBath * 0.5 + BsmtFullBath + BsmtHalfBath * 0.5,
QualityIndex = OverallQual * OverallCond,
GarageAge = ifelse(is.na(GarageYrBlt), NA, YrSold - GarageYrBlt),
TotalPorchSF = OpenPorchSF + EnclosedPorch + ScreenPorch,
TotalOutdoorSF = WoodDeckSF + OpenPorchSF + EnclosedPorch + ScreenPorch,
logSalePrice = log(SalePrice)
)
summary(select(ames, SalePrice, logSalePrice, TotalFloorSF, HouseAge, RemodAge, TotalBath, QualityIndex))
## SalePrice logSalePrice TotalFloorSF HouseAge
## Min. : 12789 Min. : 9.456 Min. : 334 Min. : -1.00
## 1st Qu.:129500 1st Qu.:11.771 1st Qu.:1120 1st Qu.: 7.00
## Median :160000 Median :11.983 Median :1440 Median : 34.00
## Mean :180796 Mean :12.021 Mean :1495 Mean : 36.43
## 3rd Qu.:213500 3rd Qu.:12.271 3rd Qu.:1740 3rd Qu.: 54.00
## Max. :755000 Max. :13.534 Max. :5642 Max. :136.00
##
## RemodAge TotalBath QualityIndex
## Min. :-2.00 Min. :1.000 Min. : 1.00
## 1st Qu.: 4.00 1st Qu.:1.500 1st Qu.:30.00
## Median :15.00 Median :2.000 Median :35.00
## Mean :23.52 Mean :2.218 Mean :33.76
## 3rd Qu.:42.75 3rd Qu.:2.500 3rd Qu.:40.00
## Max. :60.00 Max. :7.000 Max. :90.00
## NA's :2
cat(
"- Derived variables often capture housing value more directly than raw variables alone.\n",
"- `TotalFloorSF`, `HouseAge`, `TotalBath`, and `QualityIndex` are especially useful because they summarize physical size, age, utility, and construction quality.\n",
"- `logSalePrice` is created now because home prices are commonly right-skewed, and log-transformation is often useful in regression preparation.\n"
)
TotalFloorSF, HouseAge,
TotalBath, and QualityIndex are especially
useful because they summarize physical size, age, utility, and
construction quality.logSalePrice is created now because home prices are
commonly right-skewed, and log-transformation is often useful in
regression preparation.The goal here is not to keep every sale, but to define a cleaner sample for modeling typical residential transactions.
n0 <- nrow(ames)
sample_df <- ames
# Keep single-family homes only
drop1 <- sample_df %>% filter(BldgType != "1Fam")
sample_df <- sample_df %>% filter(BldgType == "1Fam")
after1 <- nrow(sample_df)
# Remove very large homes as recommended by the Ames documentation
drop2 <- sample_df %>% filter(GrLivArea > 4000)
sample_df <- sample_df %>% filter(GrLivArea <= 4000)
after2 <- nrow(sample_df)
# Remove atypical sale conditions
drop3 <- sample_df %>% filter(SaleCondition %in% c("Partial", "Family", "Abnorml", "Alloca", "AdjLand"))
sample_df <- sample_df %>% filter(!SaleCondition %in% c("Partial", "Family", "Abnorml", "Alloca", "AdjLand"))
after3 <- nrow(sample_df)
# Remove missing or nonpositive sale price
drop4 <- sample_df %>% filter(is.na(SalePrice) | SalePrice <= 0)
sample_df <- sample_df %>% filter(!is.na(SalePrice), SalePrice > 0)
after4 <- nrow(sample_df)
waterfall <- tibble(
Step = c(
"Original data",
"Keep only BldgType = 1Fam",
"Remove GrLivArea > 4000",
"Remove atypical sale conditions",
"Remove missing/nonpositive SalePrice",
"Final sample"
),
Count = c(n0, after1, after2, after3, after4, nrow(sample_df))
)
kable(waterfall)
| Step | Count |
|---|---|
| Original data | 2930 |
| Keep only BldgType = 1Fam | 2425 |
| Remove GrLivArea > 4000 | 2420 |
| Remove atypical sale conditions | 2001 |
| Remove missing/nonpositive SalePrice | 2001 |
| Final sample | 2001 |
ggplot(waterfall, aes(x = reorder(Step, -seq_along(Step)), y = Count)) +
geom_col(fill = "steelblue") +
geom_text(aes(label = comma(Count)), vjust = -0.3, size = 3.5) +
labs(
title = "Sample Definition Waterfall",
x = NULL,
y = "Count"
) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 25, hjust = 1))
cat(
paste0(
"- The filtered regression sample contains **", nrow(sample_df), "** observations.\n",
"- These drop conditions improve comparability by focusing on typical single-family residential sales.\n",
"- Outlier removal at this stage is not arbitrary; it is tied to the practical modeling objective of valuing typical homes rather than every possible transaction type.\n"
)
)
Missingness should be examined before imputation decisions are made. Some missing values in Ames mean “not present” rather than “unknown.”
missing_tbl <- tibble(
Variable = names(sample_df),
MissingN = sapply(sample_df, function(x) sum(is.na(x))),
MissingPct = sapply(sample_df, function(x) mean(is.na(x)) * 100)
) %>%
arrange(desc(MissingPct))
kable(head(missing_tbl, 20), digits = 2)
| Variable | MissingN | MissingPct |
|---|---|---|
| LotFrontage | 386 | 19.29 |
| GarageYrBlt | 74 | 3.70 |
| GarageAge | 74 | 3.70 |
| MasVnrType | 11 | 0.55 |
| MasVnrArea | 11 | 0.55 |
| BsmtExposure | 2 | 0.10 |
| BsmtFinType2 | 1 | 0.05 |
| Electrical | 1 | 0.05 |
| BsmtFullBath | 1 | 0.05 |
| BsmtHalfBath | 1 | 0.05 |
| TotalBath | 1 | 0.05 |
| SID | 0 | 0.00 |
| PID | 0 | 0.00 |
| SubClass | 0 | 0.00 |
| Zoning | 0 | 0.00 |
| LotArea | 0 | 0.00 |
| Street | 0 | 0.00 |
| Alley | 0 | 0.00 |
| LotShape | 0 | 0.00 |
| LandContour | 0 | 0.00 |
vis_miss(sample_df, sort_miss = TRUE)
# Distinguish likely "structural" NA fields from true missingness
structural_na_vars <- c(
"Alley", "MasVnrType", "BsmtQual", "BsmtCond", "BsmtExposure",
"BsmtFinType1", "BsmtFinType2", "FireplaceQu", "GarageType",
"GarageFinish", "GarageQual", "GarageCond", "PoolQC", "Fence",
"MiscFeature"
)
# Create a working copy for cleaning
clean_df <- sample_df
# Replace structural missing categories with "None"
for (v in structural_na_vars[structural_na_vars %in% names(clean_df)]) {
clean_df[[v]] <- as.character(clean_df[[v]])
clean_df[[v]][is.na(clean_df[[v]])] <- "None"
clean_df[[v]] <- as.factor(clean_df[[v]])
}
# Replace selected numeric structural missing values with zero
zero_if_absent <- c(
"MasVnrArea", "BsmtFinSF1", "BsmtFinSF2", "BsmtUnfSF", "TotalBsmtSF",
"BsmtFullBath", "BsmtHalfBath", "GarageCars", "GarageArea"
)
for (v in zero_if_absent[zero_if_absent %in% names(clean_df)]) {
clean_df[[v]][is.na(clean_df[[v]])] <- 0
}
# Median imputation for remaining numeric missing values
num_missing_before <- sum(is.na(clean_df[sapply(clean_df, is.numeric)]))
clean_df <- clean_df %>%
mutate(across(where(is.numeric), ~ ifelse(is.na(.), median(., na.rm = TRUE), .)))
num_missing_after <- sum(is.na(clean_df[sapply(clean_df, is.numeric)]))
c(num_missing_before = num_missing_before, num_missing_after = num_missing_after)
## num_missing_before num_missing_after
## 535 0
top_missing <- missing_tbl %>% filter(MissingPct > 0) %>% slice(1:5)
cat("- Not all missingness should be treated the same way.\n")
cat("- In this dataset, several housing attributes are missing because the feature does not exist, which should typically be encoded as `None` or `0` rather than dropped.\n")
None or 0 rather than dropped.cat("- Remaining numeric missingness is reduced with median imputation in the cleaned working dataset.\n")
if (nrow(top_missing) > 0) {
cat("- The most affected variables by missingness in the modeling sample are:\n")
for (i in seq_len(nrow(top_missing))) {
cat(paste0(" - `", top_missing$Variable[i], "`: ", round(top_missing$MissingPct[i], 2), "% missing\n"))
}
}
LotFrontage: 19.29% missingGarageYrBlt: 3.7% missingGarageAge: 3.7% missingMasVnrType: 0.55% missingMasVnrArea: 0.55% missingValidation ensures the cleaned data are still consistent with known housing logic and measurement rules.
rule_checks <- tibble(
Rule = c(
"SalePrice > 0",
"GrLivArea > 0",
"OverallQual between 1 and 10",
"OverallCond between 1 and 10",
"GarageCars >= 0",
"YearBuilt <= YrSold",
"HouseAge >= 0",
"TotalBath >= 0",
"TotalFloorSF >= GrLivArea not required but should be nonnegative",
"LotArea > 0"
),
Violations = c(
sum(clean_df$SalePrice <= 0 | is.na(clean_df$SalePrice)),
sum(clean_df$GrLivArea <= 0 | is.na(clean_df$GrLivArea)),
sum(!(clean_df$OverallQual %in% 1:10) | is.na(clean_df$OverallQual)),
sum(!(clean_df$OverallCond %in% 1:10) | is.na(clean_df$OverallCond)),
sum(clean_df$GarageCars < 0, na.rm = TRUE),
sum(clean_df$YearBuilt > clean_df$YrSold, na.rm = TRUE),
sum(clean_df$HouseAge < 0, na.rm = TRUE),
sum(clean_df$TotalBath < 0, na.rm = TRUE),
sum(clean_df$TotalFloorSF < 0, na.rm = TRUE),
sum(clean_df$LotArea <= 0 | is.na(clean_df$LotArea))
)
)
kable(rule_checks)
| Rule | Violations |
|---|---|
| SalePrice > 0 | 0 |
| GrLivArea > 0 | 0 |
| OverallQual between 1 and 10 | 0 |
| OverallCond between 1 and 10 | 0 |
| GarageCars >= 0 | 0 |
| YearBuilt <= YrSold | 0 |
| HouseAge >= 0 | 0 |
| TotalBath >= 0 | 0 |
| TotalFloorSF >= GrLivArea not required but should be nonnegative | 0 |
| LotArea > 0 | 0 |
viol_n <- sum(rule_checks$Violations)
cat(
paste0(
"- Validation checks are used to confirm that the data still make sense after cleaning.\n",
"- Total validation violations found across the listed rules: **", viol_n, "**.\n",
"- These checks are especially important before regression because impossible values can distort coefficients and residual patterns.\n"
)
)
This section looks at the correlation structure among the top predictors to identify possible redundancy.
top_corr_vars <- top30_saleprice$Feature[1:min(15, nrow(top30_saleprice))]
corr_subset <- clean_df %>% select(all_of(c("SalePrice", top_corr_vars))) %>% select(where(is.numeric))
corr_matrix <- cor(corr_subset, use = "pairwise.complete.obs")
round(corr_matrix, 3)
## SalePrice logSalePrice OverallQual TotalFloorSF GrLivArea
## SalePrice 1.000 0.958 0.806 0.778 0.771
## logSalePrice 0.958 1.000 0.828 0.780 0.773
## OverallQual 0.806 0.828 1.000 0.631 0.622
## TotalFloorSF 0.778 0.780 0.631 1.000 0.995
## GrLivArea 0.771 0.773 0.622 0.995 1.000
## TotalBath 0.700 0.731 0.583 0.634 0.627
## GarageCars 0.666 0.690 0.604 0.526 0.519
## TotalBsmtSF 0.651 0.638 0.526 0.397 0.393
## GarageArea 0.642 0.656 0.550 0.490 0.485
## FirstFlrSF 0.640 0.619 0.455 0.529 0.526
## FullBath 0.611 0.646 0.572 0.669 0.668
## TotRmsAbvGrd 0.601 0.618 0.511 0.821 0.829
## YearBuilt 0.571 0.626 0.558 0.301 0.286
## HouseAge -0.570 -0.625 -0.557 -0.301 -0.286
## MasVnrArea 0.544 0.491 0.429 0.429 0.422
## QualityIndex 0.525 0.566 0.694 0.407 0.403
## TotalBath GarageCars TotalBsmtSF GarageArea FirstFlrSF FullBath
## SalePrice 0.700 0.666 0.651 0.642 0.640 0.611
## logSalePrice 0.731 0.690 0.638 0.656 0.619 0.646
## OverallQual 0.583 0.604 0.526 0.550 0.455 0.572
## TotalFloorSF 0.634 0.526 0.397 0.490 0.529 0.669
## GrLivArea 0.627 0.519 0.393 0.485 0.526 0.668
## TotalBath 1.000 0.532 0.436 0.486 0.395 0.734
## GarageCars 0.532 1.000 0.440 0.880 0.439 0.515
## TotalBsmtSF 0.436 0.440 1.000 0.455 0.764 0.340
## GarageArea 0.486 0.880 0.455 1.000 0.463 0.450
## FirstFlrSF 0.395 0.439 0.764 0.463 1.000 0.373
## FullBath 0.734 0.515 0.340 0.450 0.373 1.000
## TotRmsAbvGrd 0.516 0.425 0.261 0.366 0.358 0.585
## YearBuilt 0.584 0.536 0.425 0.487 0.316 0.499
## HouseAge -0.583 -0.536 -0.424 -0.486 -0.315 -0.498
## MasVnrArea 0.353 0.386 0.411 0.386 0.426 0.292
## QualityIndex 0.293 0.307 0.251 0.283 0.236 0.286
## TotRmsAbvGrd YearBuilt HouseAge MasVnrArea QualityIndex
## SalePrice 0.601 0.571 -0.570 0.544 0.525
## logSalePrice 0.618 0.626 -0.625 0.491 0.566
## OverallQual 0.511 0.558 -0.557 0.429 0.694
## TotalFloorSF 0.821 0.301 -0.301 0.429 0.407
## GrLivArea 0.829 0.286 -0.286 0.422 0.403
## TotalBath 0.516 0.584 -0.583 0.353 0.293
## GarageCars 0.425 0.536 -0.536 0.386 0.307
## TotalBsmtSF 0.261 0.425 -0.424 0.411 0.251
## GarageArea 0.366 0.487 -0.486 0.386 0.283
## FirstFlrSF 0.358 0.316 -0.315 0.426 0.236
## FullBath 0.585 0.499 -0.498 0.292 0.286
## TotRmsAbvGrd 1.000 0.233 -0.233 0.322 0.322
## YearBuilt 0.233 1.000 -0.999 0.314 0.105
## HouseAge -0.233 -0.999 1.000 -0.315 -0.104
## MasVnrArea 0.322 0.314 -0.315 1.000 0.217
## QualityIndex 0.322 0.105 -0.104 0.217 1.000
corrplot(
corr_matrix,
method = "color",
type = "upper",
tl.col = "black",
tl.cex = 0.55,
number.cex = 0.45
)
cat("- Correlation clustering among size-related predictors is expected in housing data.\n")
cat("- Strong blocks of correlation suggest that some variables may carry overlapping information.\n")
cat("- This is helpful to know now, because regression modeling later may require choosing among similar predictors or combining them through feature engineering.\n")
This section gives compact summaries for the strongest candidate variables for regression.
quality_numeric <- function(x) {
tibble(
n = length(x),
missing_n = sum(is.na(x)),
missing_pct = mean(is.na(x)) * 100,
unique_n = n_distinct(x, na.rm = TRUE),
min = suppressWarnings(min(x, na.rm = TRUE)),
q1 = suppressWarnings(quantile(x, 0.25, na.rm = TRUE)),
median = suppressWarnings(median(x, na.rm = TRUE)),
mean = suppressWarnings(mean(x, na.rm = TRUE)),
q3 = suppressWarnings(quantile(x, 0.75, na.rm = TRUE)),
max = suppressWarnings(max(x, na.rm = TRUE)),
sd = suppressWarnings(sd(x, na.rm = TRUE)),
skewness = suppressWarnings(moments::skewness(x, na.rm = TRUE))
)
}
quality_results <- map_dfr(top30_saleprice$Feature[1:15], function(v) {
quality_numeric(clean_df[[v]]) %>% mutate(variable = v, .before = 1)
})
kable(quality_results, digits = 2)
| variable | n | missing_n | missing_pct | unique_n | min | q1 | median | mean | q3 | max | sd | skewness |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| logSalePrice | 2001 | 0 | 0 | 721 | 10.46 | 11.78 | 11.99 | 12.02 | 12.27 | 13.35 | 0.37 | 0.11 |
| OverallQual | 2001 | 0 | 0 | 10 | 1.00 | 5.00 | 6.00 | 5.99 | 7.00 | 10.00 | 1.33 | 0.17 |
| TotalFloorSF | 2001 | 0 | 0 | 1083 | 334.00 | 1107.00 | 1442.00 | 1488.07 | 1755.00 | 3820.00 | 489.19 | 0.81 |
| GrLivArea | 2001 | 0 | 0 | 1084 | 334.00 | 1111.00 | 1445.00 | 1492.91 | 1760.00 | 3820.00 | 491.96 | 0.84 |
| TotalBath | 2001 | 0 | 0 | 8 | 1.00 | 1.50 | 2.00 | 2.15 | 2.50 | 4.50 | 0.79 | 0.22 |
| GarageCars | 2001 | 0 | 0 | 6 | 0.00 | 1.00 | 2.00 | 1.74 | 2.00 | 5.00 | 0.72 | -0.07 |
| TotalBsmtSF | 2001 | 0 | 0 | 859 | 0.00 | 801.00 | 973.00 | 1030.62 | 1228.00 | 3206.00 | 400.27 | 0.60 |
| GarageArea | 2001 | 0 | 0 | 537 | 0.00 | 312.00 | 472.00 | 467.84 | 576.00 | 1488.00 | 202.10 | 0.28 |
| FirstFlrSF | 2001 | 0 | 0 | 898 | 334.00 | 882.00 | 1062.00 | 1144.38 | 1344.00 | 3820.00 | 360.73 | 1.24 |
| FullBath | 2001 | 0 | 0 | 4 | 0.00 | 1.00 | 1.00 | 1.51 | 2.00 | 3.00 | 0.54 | 0.25 |
| TotRmsAbvGrd | 2001 | 0 | 0 | 11 | 2.00 | 5.00 | 6.00 | 6.43 | 7.00 | 12.00 | 1.42 | 0.60 |
| YearBuilt | 2001 | 0 | 0 | 113 | 1872.00 | 1950.00 | 1968.00 | 1967.50 | 1996.00 | 2010.00 | 29.79 | -0.47 |
| HouseAge | 2001 | 0 | 0 | 125 | 0.00 | 11.00 | 40.00 | 40.37 | 58.00 | 136.00 | 29.76 | 0.47 |
| MasVnrArea | 2001 | 0 | 0 | 359 | 0.00 | 0.00 | 0.00 | 93.02 | 143.00 | 1600.00 | 170.67 | 2.81 |
| QualityIndex | 2001 | 0 | 0 | 34 | 1.00 | 30.00 | 35.00 | 34.22 | 40.00 | 90.00 | 9.29 | 0.43 |
cat("- Numeric quality summaries help identify skewed, sparse, or low-variance predictors before modeling.\n")
cat("- Skewness is especially useful here because strongly non-normal predictors may need transformation or winsorization before regression.\n")
cat("- Variables with extreme maxima relative to quartiles should be reviewed for outliers or long-tailed distributions.\n")
Outliers can distort least squares regression. At the preparation stage, the goal is not to erase all extremes, but to identify them and decide whether to trim, cap, or transform.
count_outliers_iqr <- function(x) {
q1 <- quantile(x, 0.25, na.rm = TRUE)
q3 <- quantile(x, 0.75, na.rm = TRUE)
iqr <- q3 - q1
lo <- q1 - 1.5 * iqr
hi <- q3 + 1.5 * iqr
sum(x < lo | x > hi, na.rm = TRUE)
}
outlier_tbl <- tibble(
Feature = top30_saleprice$Feature[1:15],
OutlierCount = sapply(clean_df[top30_saleprice$Feature[1:15]], count_outliers_iqr)
) %>%
arrange(desc(OutlierCount))
kable(outlier_tbl)
| Feature | OutlierCount |
|---|---|
| MasVnrArea | 148 |
| TotalBsmtSF | 101 |
| QualityIndex | 86 |
| FirstFlrSF | 41 |
| logSalePrice | 32 |
| GrLivArea | 32 |
| TotalFloorSF | 31 |
| TotRmsAbvGrd | 20 |
| GarageArea | 18 |
| TotalBath | 9 |
| GarageCars | 8 |
| YearBuilt | 7 |
| HouseAge | 4 |
| OverallQual | 3 |
| FullBath | 0 |
top_continuous <- top30_saleprice$Feature[1:9]
plot_df <- clean_df %>%
select(all_of(top_continuous)) %>%
pivot_longer(everything(), names_to = "Feature", values_to = "Value")
ggplot(plot_df, aes(x = Feature, y = Value)) +
geom_boxplot(fill = "lightblue") +
coord_flip() +
labs(
title = "Boxplots of Top Continuous Predictors",
x = NULL,
y = "Value"
) +
theme_minimal()
winsorize_iqr <- function(x) {
q1 <- quantile(x, 0.25, na.rm = TRUE)
q3 <- quantile(x, 0.75, na.rm = TRUE)
iqr <- q3 - q1
lo <- q1 - 1.5 * iqr
hi <- q3 + 1.5 * iqr
pmin(pmax(x, lo), hi)
}
# Example prepared versions for likely long-tailed numeric fields
clean_df <- clean_df %>%
mutate(
LotArea_w = winsorize_iqr(LotArea),
TotalBsmtSF_w = winsorize_iqr(TotalBsmtSF),
GarageArea_w = winsorize_iqr(GarageArea),
TotalFloorSF_w = winsorize_iqr(TotalFloorSF)
)
cat("- Outlier handling in regression should be deliberate, not automatic.\n")
cat("- Some extreme observations are informative and should be transformed rather than removed.\n")
cat("- Winsorized versions of long-tailed predictors are created here as optional regression-ready alternatives.\n")
Regression does not require every predictor to be normal, but strong skewness and nonlinear scale effects often justify transformations.
dist_vars <- c("SalePrice", "logSalePrice", "LotArea", "GrLivArea", "TotalBsmtSF", "GarageArea", "HouseAge", "TotalFloorSF")
dist_df <- clean_df %>%
select(all_of(dist_vars)) %>%
pivot_longer(everything(), names_to = "Variable", values_to = "Value")
ggplot(dist_df, aes(x = Value)) +
geom_histogram(bins = 30, fill = "steelblue", color = "white") +
facet_wrap(~ Variable, scales = "free", ncol = 2) +
theme_minimal() +
labs(title = "Distribution Review for Key Variables")
ggplot(clean_df, aes(sample = SalePrice)) +
stat_qq() +
stat_qq_line(color = "blue") +
labs(title = "Q-Q Plot for SalePrice") +
theme_minimal()
ggplot(clean_df, aes(sample = logSalePrice)) +
stat_qq() +
stat_qq_line(color = "blue") +
labs(title = "Q-Q Plot for logSalePrice") +
theme_minimal()
clean_df <- clean_df %>%
mutate(
logLotArea = log1p(LotArea),
logGrLivArea = log1p(GrLivArea),
logTotalBsmtSF = log1p(TotalBsmtSF),
logGarageArea = log1p(GarageArea),
sqrtHouseAge = sqrt(pmax(HouseAge, 0)),
logTotalFloorSF = log1p(TotalFloorSF)
)
summary(select(clean_df, logLotArea, logGrLivArea, logTotalBsmtSF, logGarageArea, sqrtHouseAge, logTotalFloorSF))
## logLotArea logGrLivArea logTotalBsmtSF logGarageArea
## Min. : 7.824 Min. :5.814 Min. :0.000 Min. :0.000
## 1st Qu.: 9.003 1st Qu.:7.014 1st Qu.:6.687 1st Qu.:5.746
## Median : 9.185 Median :7.277 Median :6.881 Median :6.159
## Mean : 9.204 Mean :7.256 Mean :6.742 Mean :5.889
## 3rd Qu.: 9.375 3rd Qu.:7.474 3rd Qu.:7.114 3rd Qu.:6.358
## Max. :12.280 Max. :8.248 Max. :8.073 Max. :7.306
## sqrtHouseAge logTotalFloorSF
## Min. : 0.000 Min. :5.814
## 1st Qu.: 3.317 1st Qu.:7.010
## Median : 6.325 Median :7.274
## Mean : 5.785 Mean :7.253
## 3rd Qu.: 7.616 3rd Qu.:7.471
## Max. :11.662 Max. :8.248
cat("- `SalePrice` is usually less normal than `logSalePrice`, so both are reviewed during preparation.\n")
SalePrice is usually less normal than
logSalePrice, so both are reviewed during preparation.cat("- Log transforms are especially useful for right-skewed size-related variables.\n")
cat("- The square-root transform for age is a softer transformation that can help when the effect of age is nonlinear but not extremely skewed.\n")
Scaling does not change correlation structure, but it becomes important when regression methods are sensitive to measurement units or penalties.
# Z-score scaling for selected engineered numeric predictors
scale_vars <- c(
"GrLivArea", "LotArea", "TotalBsmtSF", "GarageArea",
"TotalFloorSF", "HouseAge", "TotalBath", "QualityIndex",
"logLotArea", "logGrLivArea", "logTotalBsmtSF", "logGarageArea", "logTotalFloorSF"
)
existing_scale_vars <- scale_vars[scale_vars %in% names(clean_df)]
scaled_matrix <- scale(clean_df[, existing_scale_vars])
scaled_df <- as.data.frame(scaled_matrix)
names(scaled_df) <- paste0(existing_scale_vars, "_z")
clean_df <- bind_cols(clean_df, scaled_df)
summary(select(clean_df, ends_with("_z")))[, 1:6]
## GrLivArea_z LotArea_z TotalBsmtSF_z GarageArea_z
## Min. :-2.35570 Min. :-1.0625 Min. :-2.5748 Min. :-2.31486
## 1st Qu.:-0.77630 1st Qu.:-0.3444 1st Qu.:-0.5737 1st Qu.:-0.77109
## Median :-0.09739 Median :-0.1373 Median :-0.1440 Median : 0.02058
## Mean : 0.00000 Mean : 0.0000 Mean : 0.0000 Mean : 0.00000
## 3rd Qu.: 0.54291 3rd Qu.: 0.1232 3rd Qu.: 0.4931 3rd Qu.: 0.53517
## Max. : 4.73024 Max. :26.0844 Max. : 5.4348 Max. : 5.04773
## TotalFloorSF_z HouseAge_z
## Min. :-2.35916 Min. :-1.35655
## 1st Qu.:-0.77899 1st Qu.:-0.98696
## Median :-0.09418 Median :-0.01259
## Mean : 0.00000 Mean : 0.00000
## 3rd Qu.: 0.54565 3rd Qu.: 0.59219
## Max. : 4.76693 Max. : 3.21290
min_max_norm <- function(x) {
rng <- range(x, na.rm = TRUE)
if (rng[1] == rng[2]) return(rep(0, length(x)))
(x - rng[1]) / (rng[2] - rng[1])
}
norm_vars <- c("SalePrice", "GrLivArea", "LotArea", "TotalFloorSF", "GarageArea")
existing_norm_vars <- norm_vars[norm_vars %in% names(clean_df)]
for (v in existing_norm_vars) {
clean_df[[paste0(v, "_mm")]] <- min_max_norm(clean_df[[v]])
}
summary(select(clean_df, ends_with("_mm")))
## SalePrice_mm GrLivArea_mm LotArea_mm TotalFloorSF_mm
## Min. :0.0000 Min. :0.0000 Min. :0.00000 Min. :0.0000
## 1st Qu.:0.1610 1st Qu.:0.2229 1st Qu.:0.02645 1st Qu.:0.2217
## Median :0.2148 Median :0.3187 Median :0.03408 Median :0.3178
## Mean :0.2439 Mean :0.3324 Mean :0.03914 Mean :0.3311
## 3rd Qu.:0.3005 3rd Qu.:0.4091 3rd Qu.:0.04368 3rd Qu.:0.4076
## Max. :1.0000 Max. :1.0000 Max. :1.00000 Max. :1.0000
## GarageArea_mm
## Min. :0.0000
## 1st Qu.:0.2097
## Median :0.3172
## Mean :0.3144
## 3rd Qu.:0.3871
## Max. :1.0000
cat("- Standardization produces mean-centered predictors on a common scale and is especially useful for penalized regression methods.\n")
cat("- Min-max normalization is optional and mainly useful for algorithms that expect bounded inputs.\n")
cat("- For ordinary least squares, scaling is not mandatory for fit quality, but it still helps with interpretability and later method comparisons.\n")
Neighborhood is a strong housing-price signal, but using many sparse levels can make regression unstable. A practical strategy is to bin neighborhoods using price behavior.
nbhd_summary <- clean_df %>%
group_by(Neighborhood) %>%
summarise(
n = n(),
MeanSalePrice = mean(SalePrice, na.rm = TRUE),
MedianSalePrice = median(SalePrice, na.rm = TRUE),
MeanPricePerSF = mean(SalePrice / pmax(GrLivArea, 1), na.rm = TRUE),
.groups = "drop"
) %>%
arrange(MeanPricePerSF)
kable(nbhd_summary, digits = 2)
| Neighborhood | n | MeanSalePrice | MedianSalePrice | MeanPricePerSF |
|---|---|---|---|---|
| SWISU | 34 | 132983.8 | 135750 | 93.10 |
| IDOTRR | 62 | 113263.4 | 118700 | 94.38 |
| OldTown | 177 | 128156.1 | 122000 | 95.32 |
| BrkSide | 96 | 126740.4 | 127750 | 103.94 |
| Edwards | 129 | 132956.2 | 125000 | 104.54 |
| NWAmes | 113 | 194384.1 | 185000 | 116.73 |
| Gilbert | 128 | 189209.6 | 184050 | 117.77 |
| Crawfor | 78 | 199021.4 | 196500 | 118.20 |
| NAmes | 360 | 146903.7 | 142000 | 119.78 |
| SawyerW | 89 | 190508.2 | 184900 | 119.88 |
| Sawyer | 121 | 137326.1 | 135000 | 125.23 |
| Blmngtn | 1 | 159895.0 | 159895 | 126.30 |
| ClearCr | 37 | 218400.9 | 225000 | 128.37 |
| NoRidge | 66 | 319616.0 | 301750 | 131.90 |
| CollgCr | 213 | 199779.2 | 200500 | 134.44 |
| Mitchel | 84 | 165514.9 | 156225 | 135.02 |
| Veenker | 17 | 252491.2 | 255000 | 140.69 |
| Timber | 49 | 241995.2 | 214900 | 143.03 |
| Somerst | 67 | 248517.9 | 245000 | 144.30 |
| StoneBr | 13 | 348963.1 | 349265 | 165.07 |
| NridgHt | 67 | 345267.9 | 326000 | 166.91 |
ggplot(nbhd_summary, aes(x = reorder(Neighborhood, MeanPricePerSF), y = MeanPricePerSF)) +
geom_col(fill = "steelblue") +
coord_flip() +
labs(
title = "Neighborhood Mean Price per Square Foot",
x = "Neighborhood",
y = "Mean Price per SF"
) +
theme_minimal()
# Create quartile-style bins using mean price per square foot
nbhd_summary <- nbhd_summary %>%
mutate(
NbhdBin = ntile(MeanPricePerSF, 4),
NbhdBin = paste0("grp", NbhdBin)
)
clean_df <- clean_df %>%
left_join(nbhd_summary %>% select(Neighborhood, NbhdBin), by = "Neighborhood")
clean_df$NbhdBin <- factor(clean_df$NbhdBin, levels = c("grp1", "grp2", "grp3", "grp4"))
table(clean_df$NbhdBin)
##
## grp1 grp2 grp3 grp4
## 611 776 401 213
ggplot(clean_df, aes(x = NbhdBin, y = SalePrice)) +
geom_boxplot(fill = "lightblue") +
labs(
title = "SalePrice by Neighborhood Bin",
x = "Neighborhood Bin",
y = "SalePrice"
) +
theme_minimal()
clean_df <- clean_df %>%
mutate(
NbhdBin_grp1 = ifelse(NbhdBin == "grp1", 1, 0),
NbhdBin_grp2 = ifelse(NbhdBin == "grp2", 1, 0),
NbhdBin_grp3 = ifelse(NbhdBin == "grp3", 1, 0)
)
head(select(clean_df, Neighborhood, NbhdBin, NbhdBin_grp1, NbhdBin_grp2, NbhdBin_grp3))
## # A tibble: 6 × 5
## Neighborhood NbhdBin NbhdBin_grp1 NbhdBin_grp2 NbhdBin_grp3
## <chr> <fct> <dbl> <dbl> <dbl>
## 1 NAmes grp2 0 1 0
## 2 NAmes grp2 0 1 0
## 3 NAmes grp2 0 1 0
## 4 NAmes grp2 0 1 0
## 5 Gilbert grp2 0 1 0
## 6 Gilbert grp2 0 1 0
cat("- Neighborhood is an important pricing signal, but using all raw categories can create many regression coefficients with uneven group sizes.\n")
cat("- Binning neighborhoods by average price-per-square-foot preserves much of the market signal while reducing dimensionality.\n")
cat("- The grouped neighborhood feature is now ready for later regression modeling without fitting that model here.\n")
This section assembles a curated set of variables that are prepared for later modeling work. No model is fit in this report.
prepared_vars <- c(
"SalePrice", "logSalePrice",
"OverallQual", "OverallCond",
"GrLivArea", "LotArea", "TotalBsmtSF", "GarageArea", "GarageCars",
"TotalFloorSF", "HouseAge", "RemodAge", "TotalBath", "QualityIndex",
"logLotArea", "logGrLivArea", "logTotalBsmtSF", "logGarageArea", "logTotalFloorSF",
"LotArea_w", "TotalBsmtSF_w", "GarageArea_w", "TotalFloorSF_w",
"Neighborhood", "NbhdBin", "NbhdBin_grp1", "NbhdBin_grp2", "NbhdBin_grp3"
)
prepared_vars <- prepared_vars[prepared_vars %in% names(clean_df)]
prepared_df <- clean_df %>% select(all_of(prepared_vars))
dim(prepared_df)
## [1] 2001 28
head(prepared_df, 5)
## # A tibble: 5 × 28
## SalePrice logSalePrice OverallQual OverallCond GrLivArea LotArea TotalBsmtSF
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 215000 12.3 6 5 1656 31770 1080
## 2 105000 11.6 5 6 896 11622 882
## 3 172000 12.1 6 6 1329 14267 1329
## 4 244000 12.4 7 5 2110 11160 2110
## 5 189900 12.2 5 5 1629 13830 928
## # ℹ 21 more variables: GarageArea <dbl>, GarageCars <dbl>, TotalFloorSF <dbl>,
## # HouseAge <dbl>, RemodAge <dbl>, TotalBath <dbl>, QualityIndex <dbl>,
## # logLotArea <dbl>, logGrLivArea <dbl>, logTotalBsmtSF <dbl>,
## # logGarageArea <dbl>, logTotalFloorSF <dbl>, LotArea_w <dbl>,
## # TotalBsmtSF_w <dbl>, GarageArea_w <dbl>, TotalFloorSF_w <dbl>,
## # Neighborhood <chr>, NbhdBin <fct>, NbhdBin_grp1 <dbl>, NbhdBin_grp2 <dbl>,
## # NbhdBin_grp3 <dbl>
# Optional export for later regression modeling
write.csv(prepared_df, "ames_prepared_for_regression.csv", row.names = FALSE)
cat(
paste0(
"- The final prepared dataset contains **", nrow(prepared_df), "** rows and **", ncol(prepared_df), "** selected columns.\n",
"- This report intentionally stops after preparation and feature engineering.\n",
"- The next step, if needed, would be regression modeling using the prepared dataset and a carefully chosen subset of predictors.\n"
)
)
This report focused on preparing Ames housing data for regression
rather than fitting a final model. The process included profiling,
sample definition, missingness treatment, rule-based validation, outlier
review, transformation planning, scaling, and neighborhood binning. The
top 30 correlations with SalePrice were used as a screening
device to identify strong candidate predictors, while additional feature
engineering created more regression-ready versions of size, age,
quality, and neighborhood effects.
A key takeaway is that regression preparation should not be reduced to a single cleaning step. Different issues require different responses: some missing values reflect absent home features, some extreme values should be transformed instead of deleted, and some categorical predictors should be grouped before modeling. By separating these tasks clearly, the resulting dataset is better aligned with the assumptions and practical needs of future regression work.
PCA is a good screening tool here, PCA does not tell you which variables are best for predicting SalePrice directly. It tells you whether your prepared variables are capturing the main structure and variation among the predictors.
Correlation with SalePrice finds variables individually associated with the target PCA finds combinations of predictors that explain shared variation among the predictors
So PCA may reveal:
############################################################
# PCA on prepared predictors
############################################################
# Choose numeric predictors only
pca_df <- prepared_df %>%
select(where(is.numeric)) %>%
select(-SalePrice, -logSalePrice) # exclude response variables
pca_df <- na.omit(pca_df)
pca_result <- prcomp(pca_df, center = TRUE, scale. = TRUE)
pca_var <- pca_result$sdev^2
pca_var_exp <- pca_var / sum(pca_var)
plot(pca_var_exp,
type = "b",
pch = 19,
xlab = "Principal Component",
ylab = "Proportion of Variance Explained",
main = "Scree Plot")
cumsum(pca_var_exp)
## [1] 0.4073825 0.5148708 0.6053127 0.6774651 0.7451492 0.8088657 0.8652442
## [8] 0.8942601 0.9149812 0.9329241 0.9495006 0.9638727 0.9771641 0.9852185
## [15] 0.9911587 0.9951439 0.9970641 0.9979346 0.9985777 0.9991116 0.9996339
## [22] 0.9998360 0.9999903 1.0000000
loadings <- as.data.frame(pca_result$rotation)
round(loadings, 3)
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9
## OverallQual -0.253 0.041 0.215 0.058 -0.038 -0.088 0.090 -0.161 0.507
## OverallCond 0.062 0.251 0.195 0.145 -0.582 -0.056 0.122 0.190 -0.186
## GrLivArea -0.271 0.286 -0.080 -0.024 0.124 -0.052 -0.059 0.100 -0.018
## LotArea -0.109 -0.068 -0.455 0.140 -0.221 0.094 0.176 -0.098 -0.083
## TotalBsmtSF -0.207 -0.190 0.103 0.430 0.073 0.156 -0.150 -0.023 0.123
## GarageArea -0.245 -0.175 0.094 -0.274 -0.151 0.196 -0.144 -0.014 0.013
## GarageCars -0.251 -0.151 0.092 -0.262 -0.118 0.114 -0.120 -0.079 0.017
## TotalFloorSF -0.273 0.281 -0.077 -0.026 0.123 -0.058 -0.055 0.097 -0.015
## HouseAge 0.190 0.300 -0.126 0.084 -0.192 0.209 -0.215 0.241 -0.065
## RemodAge 0.163 0.012 -0.232 0.024 0.071 0.209 -0.396 0.413 0.394
## TotalBath -0.241 -0.002 0.032 -0.004 0.140 -0.132 0.104 -0.202 -0.373
## QualityIndex -0.148 0.230 0.285 0.146 -0.447 -0.105 0.147 0.014 0.294
## logLotArea -0.176 -0.116 -0.483 0.101 -0.225 0.039 0.124 -0.064 0.003
## logGrLivArea -0.274 0.272 -0.070 -0.019 0.124 -0.066 -0.072 0.088 -0.031
## logTotalBsmtSF -0.120 -0.130 0.212 0.481 0.037 0.078 -0.174 0.175 -0.449
## logGarageArea -0.174 -0.173 0.075 -0.301 -0.253 0.103 -0.248 0.232 -0.238
## logTotalFloorSF -0.275 0.267 -0.067 -0.020 0.123 -0.073 -0.067 0.087 -0.029
## LotArea_w -0.179 -0.101 -0.430 0.100 -0.203 0.028 0.088 -0.069 0.056
## TotalBsmtSF_w -0.208 -0.202 0.094 0.418 0.073 0.159 -0.141 -0.024 0.136
## GarageArea_w -0.247 -0.176 0.096 -0.272 -0.150 0.190 -0.139 -0.013 0.013
## TotalFloorSF_w -0.274 0.277 -0.072 -0.026 0.131 -0.061 -0.056 0.093 -0.016
## NbhdBin_grp1 0.119 0.349 0.016 0.001 -0.005 0.511 -0.079 -0.328 -0.129
## NbhdBin_grp2 0.052 -0.152 -0.113 0.028 -0.135 -0.637 -0.372 0.101 -0.037
## NbhdBin_grp3 -0.097 -0.167 0.050 -0.061 0.161 0.148 0.579 0.630 -0.001
## PC10 PC11 PC12 PC13 PC14 PC15 PC16 PC17 PC18
## OverallQual -0.288 0.104 -0.307 0.011 0.264 0.078 0.069 0.004 0.000
## OverallCond 0.143 0.051 0.288 0.005 -0.334 -0.055 -0.066 -0.047 -0.012
## GrLivArea 0.045 -0.078 0.030 -0.018 -0.057 0.018 -0.018 0.397 0.089
## LotArea -0.633 -0.385 0.170 -0.028 -0.049 0.006 0.026 0.018 -0.261
## TotalBsmtSF 0.144 -0.111 0.273 0.179 0.056 -0.040 0.024 0.101 -0.083
## GarageArea 0.078 -0.075 0.114 -0.332 -0.032 0.331 0.079 -0.035 -0.018
## GarageCars -0.015 -0.038 -0.022 -0.205 -0.008 -0.860 -0.067 0.008 0.006
## TotalFloorSF 0.040 -0.076 0.029 -0.011 -0.070 0.017 -0.020 0.414 0.056
## HouseAge 0.074 -0.120 0.053 -0.192 0.763 -0.036 -0.123 -0.031 0.019
## RemodAge -0.285 0.445 0.136 -0.092 -0.279 -0.029 -0.049 -0.020 -0.005
## TotalBath -0.230 0.629 0.432 -0.064 0.269 0.016 -0.031 -0.016 0.006
## QualityIndex -0.127 0.142 -0.039 -0.016 0.034 0.014 0.053 0.033 0.022
## logLotArea 0.115 0.111 -0.138 0.003 -0.014 0.013 0.026 -0.057 0.757
## logGrLivArea 0.020 -0.070 -0.039 0.051 -0.053 -0.010 0.023 -0.545 -0.032
## logTotalBsmtSF -0.189 0.070 -0.515 -0.316 -0.111 0.033 -0.047 0.024 0.007
## logGarageArea -0.164 0.063 -0.179 0.712 0.117 0.076 -0.054 0.070 -0.008
## logTotalFloorSF 0.016 -0.069 -0.040 0.055 -0.062 -0.008 0.022 -0.523 -0.064
## LotArea_w 0.445 0.330 -0.254 0.004 0.034 0.002 -0.030 0.052 -0.566
## TotalBsmtSF_w 0.165 -0.112 0.296 0.202 0.049 -0.045 0.072 -0.103 0.092
## GarageArea_w 0.074 -0.073 0.110 -0.306 -0.027 0.335 0.071 -0.035 -0.006
## TotalFloorSF_w 0.035 -0.071 0.005 -0.003 -0.068 0.008 -0.042 0.237 -0.060
## NbhdBin_grp1 -0.014 0.122 -0.090 0.102 -0.081 -0.065 0.652 0.045 -0.007
## NbhdBin_grp2 0.010 -0.052 0.052 -0.077 0.088 -0.061 0.601 0.042 -0.027
## NbhdBin_grp3 0.026 0.019 0.028 -0.033 0.118 -0.085 0.390 0.007 -0.023
## PC19 PC20 PC21 PC22 PC23 PC24
## OverallQual -0.174 0.080 0.527 0.001 0.009 -0.002
## OverallCond -0.157 0.063 0.437 -0.004 0.003 -0.003
## GrLivArea 0.096 0.605 -0.039 -0.078 0.012 -0.494
## LotArea 0.041 0.019 0.003 0.007 -0.003 0.000
## TotalBsmtSF -0.662 0.028 -0.236 0.009 0.010 -0.003
## GarageArea 0.004 -0.015 -0.004 0.033 0.699 -0.001
## GarageCars 0.000 0.004 0.003 0.005 -0.007 -0.001
## TotalFloorSF 0.041 -0.135 0.056 0.584 -0.030 0.504
## HouseAge -0.008 -0.013 0.041 0.004 -0.003 0.002
## RemodAge -0.004 0.003 0.011 0.001 -0.004 -0.001
## TotalBath -0.003 0.000 0.010 0.000 0.002 0.000
## QualityIndex 0.222 -0.092 -0.634 -0.003 -0.005 0.004
## logLotArea -0.116 -0.062 -0.009 -0.036 0.009 -0.001
## logGrLivArea -0.032 0.408 -0.095 -0.272 0.021 0.500
## logTotalBsmtSF 0.032 0.001 0.000 0.004 -0.003 0.001
## logGarageArea 0.011 0.003 -0.009 -0.002 0.014 -0.001
## logTotalFloorSF -0.075 -0.315 -0.009 0.407 -0.024 -0.501
## LotArea_w 0.074 0.048 0.020 0.027 -0.006 0.001
## TotalBsmtSF_w 0.645 -0.037 0.222 -0.015 -0.005 0.002
## GarageArea_w -0.016 0.007 0.001 -0.036 -0.713 0.001
## TotalFloorSF_w -0.020 -0.566 0.084 -0.639 0.022 -0.010
## NbhdBin_grp1 -0.020 -0.010 0.019 -0.011 -0.004 -0.001
## NbhdBin_grp2 -0.017 -0.004 0.022 -0.012 -0.003 0.000
## NbhdBin_grp3 -0.019 -0.003 0.008 -0.006 -0.001 0.000
top_pc1 <- sort(abs(pca_result$rotation[, 1]), decreasing = TRUE)
top_pc2 <- sort(abs(pca_result$rotation[, 2]), decreasing = TRUE)
top_pc3 <- sort(abs(pca_result$rotation[, 3]), decreasing = TRUE)
head(top_pc1, 10)
## logTotalFloorSF TotalFloorSF_w logGrLivArea TotalFloorSF GrLivArea
## 0.2753515 0.2738067 0.2737812 0.2728143 0.2707104
## OverallQual GarageCars GarageArea_w GarageArea TotalBath
## 0.2531222 0.2510773 0.2469418 0.2445229 0.2407781
head(top_pc2, 10)
## NbhdBin_grp1 HouseAge GrLivArea TotalFloorSF TotalFloorSF_w
## 0.3493084 0.3004936 0.2857682 0.2805649 0.2772399
## logGrLivArea logTotalFloorSF OverallCond QualityIndex TotalBsmtSF_w
## 0.2719289 0.2673730 0.2507143 0.2301698 0.2016655
head(top_pc3, 10)
## logLotArea LotArea LotArea_w QualityIndex RemodAge
## 0.4826232 0.4545469 0.4299858 0.2851036 0.2321680
## OverallQual logTotalBsmtSF OverallCond HouseAge NbhdBin_grp2
## 0.2153130 0.2120779 0.1953933 0.1256409 0.1129467
pca_scores <- as.data.frame(pca_result$x)
head(pca_scores)
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## 1 -1.8286055 -1.01887502 -4.4608476 0.7135814 -1.3906646 -0.2180562 -0.7614296
## 2 2.1595599 -2.45819030 -0.6257555 -0.6299154 -1.8219632 0.1574683 -1.2567231
## 3 0.7000763 -0.67238656 -1.6125461 2.1145901 -0.8097067 -0.7309959 -0.8641991
## 4 -3.7266429 -0.04073173 -0.2477599 2.1591671 1.1786634 -0.6269818 -2.2496657
## 5 -1.0117876 -0.84249666 -1.6847881 -0.4726617 0.3142807 -1.3751885 -0.3919521
## 6 -0.7888145 -0.11115333 0.1868705 -0.3805976 -0.1814624 -1.7489175 -0.3069621
## PC8 PC9 PC10 PC11 PC12 PC13
## 1 0.1436286 0.3007578 -0.55189188 0.01720992 -0.57847739 -0.38415446
## 2 0.5654555 0.3929074 0.32668953 -0.12077880 -0.18283853 -0.69894053
## 3 0.6570661 0.9307228 0.32401777 0.37993891 -0.28772980 0.48714609
## 4 0.4012078 0.3410760 0.01260334 0.46306744 1.39948365 0.40371765
## 5 -0.5187309 -1.0197127 0.57595712 -0.02309341 -0.35041140 0.01773298
## 6 -0.3010986 -0.4731845 -0.03589646 -0.02833866 -0.03281889 0.04302094
## PC14 PC15 PC16 PC17 PC18 PC19
## 1 -0.05693487 -0.03635597 0.13627418 -0.15959511 0.35460959 -0.086891369
## 2 -0.32039767 1.68245966 0.22128263 0.13869436 -0.01994984 -0.025982919
## 3 -0.07937385 0.24479200 0.06834309 -0.09521178 -0.11913305 0.047160479
## 4 0.58821136 -0.15160441 0.26978550 0.09640607 0.02411940 -0.294151123
## 5 -0.43874117 -0.17001512 0.13351956 -0.07880150 -0.15600722 0.004024422
## 6 -0.50400763 -0.20029964 0.19023273 -0.09975696 0.02666305 -0.009482895
## PC20 PC21 PC22 PC23 PC24
## 1 -0.044990106 0.062240364 -0.02200035 -0.0118288707 0.00031718878
## 2 -0.012471994 0.030976454 -0.01981499 -0.0402253576 -0.00094399065
## 3 0.007112304 0.028136981 0.01528668 0.0152036631 0.00065817054
## 4 -0.048957885 -0.009776146 -0.04322388 0.0080840052 0.00129115202
## 5 0.007312706 -0.065939971 0.01185234 -0.0065133420 0.00007505453
## 6 0.001548306 -0.064137981 0.00164218 0.0003139828 -0.00067596038
pca_scores$SalePrice <- clean_df$SalePrice[as.numeric(rownames(pca_scores))]
pc_saleprice_corr <- sapply(pca_scores %>% select(starts_with("PC")), function(x) {
cor(x, pca_scores$SalePrice, use = "complete.obs")
})
pc_saleprice_corr
## PC1 PC2 PC3 PC4 PC5 PC6
## -0.901773338 0.035176400 0.075964893 0.098840837 -0.007259678 -0.044617026
## PC7 PC8 PC9 PC10 PC11 PC12
## 0.067839931 -0.085764555 0.131801016 -0.004481050 0.008407384 0.079305041
## PC13 PC14 PC15 PC16 PC17 PC18
## -0.015986502 -0.008443691 0.013413584 -0.086041337 0.148539144 -0.003550060
## PC19 PC20 PC21 PC22 PC23 PC24
## -0.068917002 0.031852027 0.011041453 0.033528984 -0.003483513 0.011362201
sort(abs(pc_saleprice_corr), decreasing = TRUE)
## PC1 PC17 PC9 PC4 PC16 PC8
## 0.901773338 0.148539144 0.131801016 0.098840837 0.086041337 0.085764555
## PC12 PC3 PC19 PC7 PC6 PC2
## 0.079305041 0.075964893 0.068917002 0.067839931 0.044617026 0.035176400
## PC22 PC20 PC13 PC15 PC24 PC21
## 0.033528984 0.031852027 0.015986502 0.013413584 0.011362201 0.011041453
## PC14 PC11 PC5 PC10 PC18 PC23
## 0.008443691 0.008407384 0.007259678 0.004481050 0.003550060 0.003483513
round(pca_result$rotation[, 1], 3) # if PC1 is important
## OverallQual OverallCond GrLivArea LotArea TotalBsmtSF
## -0.253 0.062 -0.271 -0.109 -0.207
## GarageArea GarageCars TotalFloorSF HouseAge RemodAge
## -0.245 -0.251 -0.273 0.190 0.163
## TotalBath QualityIndex logLotArea logGrLivArea logTotalBsmtSF
## -0.241 -0.148 -0.176 -0.274 -0.120
## logGarageArea logTotalFloorSF LotArea_w TotalBsmtSF_w GarageArea_w
## -0.174 -0.275 -0.179 -0.208 -0.247
## TotalFloorSF_w NbhdBin_grp1 NbhdBin_grp2 NbhdBin_grp3
## -0.274 0.119 0.052 -0.097
round(pca_result$rotation[, 2], 3) # if PC2 is important
## OverallQual OverallCond GrLivArea LotArea TotalBsmtSF
## 0.041 0.251 0.286 -0.068 -0.190
## GarageArea GarageCars TotalFloorSF HouseAge RemodAge
## -0.175 -0.151 0.281 0.300 0.012
## TotalBath QualityIndex logLotArea logGrLivArea logTotalBsmtSF
## -0.002 0.230 -0.116 0.272 -0.130
## logGarageArea logTotalFloorSF LotArea_w TotalBsmtSF_w GarageArea_w
## -0.173 0.267 -0.101 -0.202 -0.176
## TotalFloorSF_w NbhdBin_grp1 NbhdBin_grp2 NbhdBin_grp3
## 0.277 0.349 -0.152 -0.167
By looking at the Scree plot the third bend is on 12th component.
############################################################
# Cumulative variance explained (first 12 PCs)
############################################################
# Variance explained by each component
pca_var <- pca_result$sdev^2
# Proportion of variance explained
pca_var_exp <- pca_var / sum(pca_var)
# Cumulative proportion
pca_cum_var <- cumsum(pca_var_exp)
# Create a table for first 12 components
pca_summary <- data.frame(
PC = paste0("PC", 1:length(pca_var_exp)),
Variance_Explained = round(pca_var_exp, 4),
Cumulative_Variance = round(pca_cum_var, 4)
)
# Show first 12
pca_summary[1:12, ]
## PC Variance_Explained Cumulative_Variance
## 1 PC1 0.4074 0.4074
## 2 PC2 0.1075 0.5149
## 3 PC3 0.0904 0.6053
## 4 PC4 0.0722 0.6775
## 5 PC5 0.0677 0.7451
## 6 PC6 0.0637 0.8089
## 7 PC7 0.0564 0.8652
## 8 PC8 0.0290 0.8943
## 9 PC9 0.0207 0.9150
## 10 PC10 0.0179 0.9329
## 11 PC11 0.0166 0.9495
## 12 PC12 0.0144 0.9639
pca_summary[1:12, ] %>%
mutate(
Variance_Explained = scales::percent(Variance_Explained),
Cumulative_Variance = scales::percent(Cumulative_Variance)
)
## PC Variance_Explained Cumulative_Variance
## 1 PC1 40.74% 40.7%
## 2 PC2 10.75% 51.5%
## 3 PC3 9.04% 60.5%
## 4 PC4 7.22% 67.8%
## 5 PC5 6.77% 74.5%
## 6 PC6 6.37% 80.9%
## 7 PC7 5.64% 86.5%
## 8 PC8 2.90% 89.4%
## 9 PC9 2.07% 91.5%
## 10 PC10 1.79% 93.3%
## 11 PC11 1.66% 95.0%
## 12 PC12 1.44% 96.4%
############################################################
# PCA-based feature importance using first 12 components
############################################################
# Number of components to use
k <- 12
# PCA loadings
loadings <- pca_result$rotation[, 1:k]
# Variance explained by each PC
eigvals <- pca_result$sdev^2
var_exp <- eigvals / sum(eigvals)
# Keep first 12 variance proportions
var_exp_k <- var_exp[1:k]
# Weighted importance:
# absolute loading × variance explained by that PC
feature_importance <- data.frame(
Feature = rownames(loadings),
Importance = rowSums(abs(loadings) %*% diag(var_exp_k))
)
# Sort from most important to least important
feature_importance <- feature_importance %>%
arrange(desc(Importance))
feature_importance
## Feature Importance
## GarageArea_w GarageArea_w 0.1827789
## GarageArea GarageArea 0.1825329
## TotalBsmtSF_w TotalBsmtSF_w 0.1804486
## TotalBsmtSF TotalBsmtSF 0.1801672
## QualityIndex QualityIndex 0.1782060
## HouseAge HouseAge 0.1780632
## logGarageArea logGarageArea 0.1737494
## GarageCars GarageCars 0.1717215
## OverallQual OverallQual 0.1709620
## TotalFloorSF TotalFloorSF 0.1708581
## GrLivArea GrLivArea 0.1708542
## LotArea_w LotArea_w 0.1706935
## TotalFloorSF_w TotalFloorSF_w 0.1705316
## logGrLivArea logGrLivArea 0.1704284
## logTotalFloorSF logTotalFloorSF 0.1703138
## logLotArea logLotArea 0.1677544
## RemodAge RemodAge 0.1655771
## logTotalBsmtSF logTotalBsmtSF 0.1602580
## TotalBath TotalBath 0.1596354
## LotArea LotArea 0.1587646
## OverallCond OverallCond 0.1471269
## NbhdBin_grp1 NbhdBin_grp1 0.1407924
## NbhdBin_grp3 NbhdBin_grp3 0.1387341
## NbhdBin_grp2 NbhdBin_grp2 0.1257626
top20_pca_importance <- feature_importance %>%
slice(1:20)
top20_pca_importance
## Feature Importance
## GarageArea_w GarageArea_w 0.1827789
## GarageArea GarageArea 0.1825329
## TotalBsmtSF_w TotalBsmtSF_w 0.1804486
## TotalBsmtSF TotalBsmtSF 0.1801672
## QualityIndex QualityIndex 0.1782060
## HouseAge HouseAge 0.1780632
## logGarageArea logGarageArea 0.1737494
## GarageCars GarageCars 0.1717215
## OverallQual OverallQual 0.1709620
## TotalFloorSF TotalFloorSF 0.1708581
## GrLivArea GrLivArea 0.1708542
## LotArea_w LotArea_w 0.1706935
## TotalFloorSF_w TotalFloorSF_w 0.1705316
## logGrLivArea logGrLivArea 0.1704284
## logTotalFloorSF logTotalFloorSF 0.1703138
## logLotArea logLotArea 0.1677544
## RemodAge RemodAge 0.1655771
## logTotalBsmtSF logTotalBsmtSF 0.1602580
## TotalBath TotalBath 0.1596354
## LotArea LotArea 0.1587646
library(ggplot2)
feature_importance %>%
slice(1:20) %>%
ggplot(aes(x = reorder(Feature, Importance), y = Importance)) +
geom_col(fill = "steelblue") +
coord_flip() +
labs(
title = "Top 20 PCA-Based Feature Importances (PC1-PC12)",
x = "Feature",
y = "Importance"
) +
theme_minimal()
feature_importance <- data.frame(
Feature = rownames(loadings),
Importance = rowSums(sweep(abs(loadings), 2, var_exp_k, FUN = "*"))
) %>%
mutate(RelativeImportance = Importance / sum(Importance)) %>%
arrange(desc(RelativeImportance))
feature_importance
## Feature Importance RelativeImportance
## GarageArea_w GarageArea_w 0.1827789 0.04584700
## GarageArea GarageArea 0.1825329 0.04578528
## TotalBsmtSF_w TotalBsmtSF_w 0.1804486 0.04526247
## TotalBsmtSF TotalBsmtSF 0.1801672 0.04519190
## QualityIndex QualityIndex 0.1782060 0.04469996
## HouseAge HouseAge 0.1780632 0.04466415
## logGarageArea logGarageArea 0.1737494 0.04358210
## GarageCars GarageCars 0.1717215 0.04307344
## OverallQual OverallQual 0.1709620 0.04288292
## TotalFloorSF TotalFloorSF 0.1708581 0.04285687
## GrLivArea GrLivArea 0.1708542 0.04285589
## LotArea_w LotArea_w 0.1706935 0.04281559
## TotalFloorSF_w TotalFloorSF_w 0.1705316 0.04277496
## logGrLivArea logGrLivArea 0.1704284 0.04274907
## logTotalFloorSF logTotalFloorSF 0.1703138 0.04272034
## logLotArea logLotArea 0.1677544 0.04207835
## RemodAge RemodAge 0.1655771 0.04153222
## logTotalBsmtSF logTotalBsmtSF 0.1602580 0.04019800
## TotalBath TotalBath 0.1596354 0.04004185
## LotArea LotArea 0.1587646 0.03982342
## OverallCond OverallCond 0.1471269 0.03690428
## NbhdBin_grp1 NbhdBin_grp1 0.1407924 0.03531540
## NbhdBin_grp3 NbhdBin_grp3 0.1387341 0.03479911
## NbhdBin_grp2 NbhdBin_grp2 0.1257626 0.03154542
feature_importance %>%
mutate(RelativeImportance = scales::percent(RelativeImportance))
## Feature Importance RelativeImportance
## GarageArea_w GarageArea_w 0.1827789 4.584700%
## GarageArea GarageArea 0.1825329 4.578528%
## TotalBsmtSF_w TotalBsmtSF_w 0.1804486 4.526247%
## TotalBsmtSF TotalBsmtSF 0.1801672 4.519190%
## QualityIndex QualityIndex 0.1782060 4.469996%
## HouseAge HouseAge 0.1780632 4.466415%
## logGarageArea logGarageArea 0.1737494 4.358210%
## GarageCars GarageCars 0.1717215 4.307344%
## OverallQual OverallQual 0.1709620 4.288292%
## TotalFloorSF TotalFloorSF 0.1708581 4.285687%
## GrLivArea GrLivArea 0.1708542 4.285589%
## LotArea_w LotArea_w 0.1706935 4.281559%
## TotalFloorSF_w TotalFloorSF_w 0.1705316 4.277496%
## logGrLivArea logGrLivArea 0.1704284 4.274907%
## logTotalFloorSF logTotalFloorSF 0.1703138 4.272034%
## logLotArea logLotArea 0.1677544 4.207835%
## RemodAge RemodAge 0.1655771 4.153222%
## logTotalBsmtSF logTotalBsmtSF 0.1602580 4.019800%
## TotalBath TotalBath 0.1596354 4.004185%
## LotArea LotArea 0.1587646 3.982342%
## OverallCond OverallCond 0.1471269 3.690428%
## NbhdBin_grp1 NbhdBin_grp1 0.1407924 3.531540%
## NbhdBin_grp3 NbhdBin_grp3 0.1387341 3.479911%
## NbhdBin_grp2 NbhdBin_grp2 0.1257626 3.154542%
The cumulative variance explained by the first 12 principal components was examined to assess how much of the total variation in the predictors is captured by a reduced set of components. This helps determine whether the feature space can be effectively summarized or whether important variation is spread across many variables.
PCA-based feature importance was calculated using the absolute variable loadings across the first 12 principal components, weighted by the proportion of variance explained by each component. This provides a measure of how strongly each feature contributes to the main structure of the predictor space. Unlike simple correlation with SalePrice, this approach identifies variables that are influential in explaining the overall variation among predictors, even if their individual association with the response is more modest.
If PCA and top-30 correlation agree:
If PCA highlights variables not in the top 30:
For example: one variable may be weak alone but strongly define a component like house size, home quality, or garage/basement capacity. That means the top-30 correlation screen may have missed some structural contributors.
############################################################
# PCA feature importance from first 12 components
############################################################
k <- 12
loadings <- pca_result$rotation[, 1:k]
eigvals <- pca_result$sdev^2
var_exp <- eigvals / sum(eigvals)
var_exp_k <- var_exp[1:k]
pca_importance <- data.frame(
Feature = rownames(loadings),
PCA_Importance = rowSums(sweep(abs(loadings), 2, var_exp_k, FUN = "*"))
) %>%
arrange(desc(PCA_Importance))
############################################################
# PCA feature importance from first 12 components
############################################################
k <- 12
loadings <- pca_result$rotation[, 1:k]
eigvals <- pca_result$sdev^2
var_exp <- eigvals / sum(eigvals)
var_exp_k <- var_exp[1:k]
pca_importance <- data.frame(
Feature = rownames(loadings),
PCA_Importance = rowSums(sweep(abs(loadings), 2, var_exp_k, FUN = "*"))
) %>%
arrange(desc(PCA_Importance))
############################################################
# Correlation table from earlier work
############################################################
corr_importance <- top30_saleprice %>%
select(Feature, Correlation, abs_corr)
corr_importance <- top30_saleprice %>%
mutate(abs_corr = abs(Correlation)) %>%
select(Feature, Correlation, abs_corr)
############################################################
# Merge PCA importance with correlation importance
############################################################
comparison_tbl <- pca_importance %>%
left_join(corr_importance, by = "Feature") %>%
mutate(
In_Top30_Correlation = ifelse(!is.na(abs_corr), "Yes", "No"),
PCA_Rank = rank(-PCA_Importance, ties.method = "first"),
Corr_Rank = ifelse(is.na(abs_corr), NA, rank(-abs_corr, ties.method = "first"))
) %>%
arrange(desc(PCA_Importance))
comparison_tbl
## Feature PCA_Importance Correlation abs_corr In_Top30_Correlation
## 1 GarageArea_w 0.1827789 NA NA No
## 2 GarageArea 0.1825329 0.6420118 0.6420118 Yes
## 3 TotalBsmtSF_w 0.1804486 NA NA No
## 4 TotalBsmtSF 0.1801672 0.6512665 0.6512665 Yes
## 5 QualityIndex 0.1782060 0.5247977 0.5247977 Yes
## 6 HouseAge 0.1780632 -0.5699966 0.5699966 Yes
## 7 logGarageArea 0.1737494 NA NA No
## 8 GarageCars 0.1717215 0.6664744 0.6664744 Yes
## 9 OverallQual 0.1709620 0.8058106 0.8058106 Yes
## 10 TotalFloorSF 0.1708581 0.7780935 0.7780935 Yes
## 11 GrLivArea 0.1708542 0.7709897 0.7709897 Yes
## 12 LotArea_w 0.1706935 NA NA No
## 13 TotalFloorSF_w 0.1705316 NA NA No
## 14 logGrLivArea 0.1704284 NA NA No
## 15 logTotalFloorSF 0.1703138 NA NA No
## 16 logLotArea 0.1677544 NA NA No
## 17 RemodAge 0.1655771 -0.5068690 0.5068690 Yes
## 18 logTotalBsmtSF 0.1602580 NA NA No
## 19 TotalBath 0.1596354 0.7000760 0.7000760 Yes
## 20 LotArea 0.1587646 0.2901018 0.2901018 Yes
## 21 OverallCond 0.1471269 NA NA No
## 22 NbhdBin_grp1 0.1407924 NA NA No
## 23 NbhdBin_grp3 0.1387341 NA NA No
## 24 NbhdBin_grp2 0.1257626 NA NA No
## PCA_Rank Corr_Rank
## 1 1 NA
## 2 2 7
## 3 3 NA
## 4 4 6
## 5 5 9
## 6 6 8
## 7 7 NA
## 8 8 5
## 9 9 1
## 10 10 2
## 11 11 3
## 12 12 NA
## 13 13 NA
## 14 14 NA
## 15 15 NA
## 16 16 NA
## 17 17 10
## 18 18 NA
## 19 19 4
## 20 20 11
## 21 21 NA
## 22 22 NA
## 23 23 NA
## 24 24 NA
A comparison was made between the variables identified through the top-30 correlation screen and those identified as important by PCA using the first 12 principal components. Variables appearing in both lists are especially strong candidates for later regression work, since they are both individually associated with SalePrice and important to the overall structure of the predictor space. Variables highlighted by PCA but not by correlation may represent features that were underemphasized by simple univariate screening, suggesting that the original correlation-based selection may not fully capture all structurally relevant information.
top30_pca <- pca_importance %>%
slice(1:30)
compare_top30 <- top30_pca %>%
left_join(corr_importance, by = "Feature") %>%
mutate(In_Correlation_Top30 = ifelse(is.na(abs_corr), "No", "Yes"))
compare_top30
## Feature PCA_Importance Correlation abs_corr In_Correlation_Top30
## 1 GarageArea_w 0.1827789 NA NA No
## 2 GarageArea 0.1825329 0.6420118 0.6420118 Yes
## 3 TotalBsmtSF_w 0.1804486 NA NA No
## 4 TotalBsmtSF 0.1801672 0.6512665 0.6512665 Yes
## 5 QualityIndex 0.1782060 0.5247977 0.5247977 Yes
## 6 HouseAge 0.1780632 -0.5699966 0.5699966 Yes
## 7 logGarageArea 0.1737494 NA NA No
## 8 GarageCars 0.1717215 0.6664744 0.6664744 Yes
## 9 OverallQual 0.1709620 0.8058106 0.8058106 Yes
## 10 TotalFloorSF 0.1708581 0.7780935 0.7780935 Yes
## 11 GrLivArea 0.1708542 0.7709897 0.7709897 Yes
## 12 LotArea_w 0.1706935 NA NA No
## 13 TotalFloorSF_w 0.1705316 NA NA No
## 14 logGrLivArea 0.1704284 NA NA No
## 15 logTotalFloorSF 0.1703138 NA NA No
## 16 logLotArea 0.1677544 NA NA No
## 17 RemodAge 0.1655771 -0.5068690 0.5068690 Yes
## 18 logTotalBsmtSF 0.1602580 NA NA No
## 19 TotalBath 0.1596354 0.7000760 0.7000760 Yes
## 20 LotArea 0.1587646 0.2901018 0.2901018 Yes
## 21 OverallCond 0.1471269 NA NA No
## 22 NbhdBin_grp1 0.1407924 NA NA No
## 23 NbhdBin_grp3 0.1387341 NA NA No
## 24 NbhdBin_grp2 0.1257626 NA NA No
corr_only <- corr_importance %>%
filter(!Feature %in% top30_pca$Feature)
corr_only
## # A tibble: 19 × 3
## Feature Correlation abs_corr
## <chr> <dbl> <dbl>
## 1 logSalePrice 0.958 0.958
## 2 FirstFlrSF 0.640 0.640
## 3 FullBath 0.611 0.611
## 4 TotRmsAbvGrd 0.601 0.601
## 5 YearBuilt 0.571 0.571
## 6 MasVnrArea 0.544 0.544
## 7 GarageYrBlt 0.522 0.522
## 8 GarageAge -0.521 0.521
## 9 YearRemodel 0.507 0.507
## 10 Fireplaces 0.484 0.484
## 11 BsmtFinSF1 0.454 0.454
## 12 TotalOutdoorSF 0.405 0.405
## 13 LotFrontage 0.387 0.387
## 14 WoodDeckSF 0.350 0.350
## 15 SecondFlrSF 0.350 0.350
## 16 HalfBath 0.349 0.349
## 17 OpenPorchSF 0.342 0.342
## 18 BsmtFullBath 0.294 0.294
## 19 BedroomAbvGr 0.273 0.273
The NA values in the comparison table do not indicate an error in the PCA. They occur because the PCA was run on an expanded set of engineered predictors, including winsorized, transformed, and neighborhood-bin variables, while the earlier correlation table was based on the original top-30 correlation screen. As a result, some PCA-important variables did not yet have corresponding correlation values in the original comparison table. To make the comparison valid, correlations with SalePrice should be recalculated for the same variables used in the PCA. Even so, the current output already suggests that the original correlation screen identified many strong variables, while PCA additionally highlighted the importance of engineered features such as log-transformed, winsorized, and neighborhood-group variables.
The scree plot was used to evaluate how much variance each principal component explains and to determine whether a relatively small number of components captures most of the information in the predictor space.
############################################################
# Recalculate correlation for all PCA features
############################################################
# Correlation of every PCA variable with SalePrice
corr_all_pca_features <- data.frame(
Feature = colnames(pca_df),
Correlation = sapply(pca_df, function(x) {
cor(x, clean_df$SalePrice[as.numeric(rownames(pca_df))], use = "pairwise.complete.obs")
})
) %>%
mutate(abs_corr = abs(Correlation))
corr_all_pca_features############################################################
## Feature Correlation abs_corr
## OverallQual OverallQual 0.8058106 0.8058106
## OverallCond OverallCond -0.1162303 0.1162303
## GrLivArea GrLivArea 0.7709897 0.7709897
## LotArea LotArea 0.2901018 0.2901018
## TotalBsmtSF TotalBsmtSF 0.6512665 0.6512665
## GarageArea GarageArea 0.6420118 0.6420118
## GarageCars GarageCars 0.6664744 0.6664744
## TotalFloorSF TotalFloorSF 0.7780935 0.7780935
## HouseAge HouseAge -0.5699966 0.5699966
## RemodAge RemodAge -0.5068690 0.5068690
## TotalBath TotalBath 0.7000760 0.7000760
## QualityIndex QualityIndex 0.5247977 0.5247977
## logLotArea logLotArea 0.4564209 0.4564209
## logGrLivArea logGrLivArea 0.7452188 0.7452188
## logTotalBsmtSF logTotalBsmtSF 0.3265755 0.3265755
## logGarageArea logGarageArea 0.3775568 0.3775568
## logTotalFloorSF logTotalFloorSF 0.7506574 0.7506574
## LotArea_w LotArea_w 0.4723528 0.4723528
## TotalBsmtSF_w TotalBsmtSF_w 0.6321503 0.6321503
## GarageArea_w GarageArea_w 0.6501319 0.6501319
## TotalFloorSF_w TotalFloorSF_w 0.7714276 0.7714276
## NbhdBin_grp1 NbhdBin_grp1 -0.3576047 0.3576047
## NbhdBin_grp2 NbhdBin_grp2 -0.1793344 0.1793344
## NbhdBin_grp3 NbhdBin_grp3 0.2430058 0.2430058
# Recalculate correlation for all PCA features
############################################################
# Correlation of every PCA variable with SalePrice
corr_all_pca_features <- data.frame(
Feature = colnames(pca_df),
Correlation = sapply(pca_df, function(x) {
cor(x, clean_df$SalePrice[as.numeric(rownames(pca_df))], use = "pairwise.complete.obs")
})
) %>%
mutate(abs_corr = abs(Correlation))
corr_all_pca_features
## Feature Correlation abs_corr
## OverallQual OverallQual 0.8058106 0.8058106
## OverallCond OverallCond -0.1162303 0.1162303
## GrLivArea GrLivArea 0.7709897 0.7709897
## LotArea LotArea 0.2901018 0.2901018
## TotalBsmtSF TotalBsmtSF 0.6512665 0.6512665
## GarageArea GarageArea 0.6420118 0.6420118
## GarageCars GarageCars 0.6664744 0.6664744
## TotalFloorSF TotalFloorSF 0.7780935 0.7780935
## HouseAge HouseAge -0.5699966 0.5699966
## RemodAge RemodAge -0.5068690 0.5068690
## TotalBath TotalBath 0.7000760 0.7000760
## QualityIndex QualityIndex 0.5247977 0.5247977
## logLotArea logLotArea 0.4564209 0.4564209
## logGrLivArea logGrLivArea 0.7452188 0.7452188
## logTotalBsmtSF logTotalBsmtSF 0.3265755 0.3265755
## logGarageArea logGarageArea 0.3775568 0.3775568
## logTotalFloorSF logTotalFloorSF 0.7506574 0.7506574
## LotArea_w LotArea_w 0.4723528 0.4723528
## TotalBsmtSF_w TotalBsmtSF_w 0.6321503 0.6321503
## GarageArea_w GarageArea_w 0.6501319 0.6501319
## TotalFloorSF_w TotalFloorSF_w 0.7714276 0.7714276
## NbhdBin_grp1 NbhdBin_grp1 -0.3576047 0.3576047
## NbhdBin_grp2 NbhdBin_grp2 -0.1793344 0.1793344
## NbhdBin_grp3 NbhdBin_grp3 0.2430058 0.2430058
comparison_tbl <- pca_importance %>%
left_join(corr_all_pca_features, by = "Feature") %>%
arrange(desc(PCA_Importance))
comparison_tbl
## Feature PCA_Importance Correlation abs_corr
## 1 GarageArea_w 0.1827789 0.6501319 0.6501319
## 2 GarageArea 0.1825329 0.6420118 0.6420118
## 3 TotalBsmtSF_w 0.1804486 0.6321503 0.6321503
## 4 TotalBsmtSF 0.1801672 0.6512665 0.6512665
## 5 QualityIndex 0.1782060 0.5247977 0.5247977
## 6 HouseAge 0.1780632 -0.5699966 0.5699966
## 7 logGarageArea 0.1737494 0.3775568 0.3775568
## 8 GarageCars 0.1717215 0.6664744 0.6664744
## 9 OverallQual 0.1709620 0.8058106 0.8058106
## 10 TotalFloorSF 0.1708581 0.7780935 0.7780935
## 11 GrLivArea 0.1708542 0.7709897 0.7709897
## 12 LotArea_w 0.1706935 0.4723528 0.4723528
## 13 TotalFloorSF_w 0.1705316 0.7714276 0.7714276
## 14 logGrLivArea 0.1704284 0.7452188 0.7452188
## 15 logTotalFloorSF 0.1703138 0.7506574 0.7506574
## 16 logLotArea 0.1677544 0.4564209 0.4564209
## 17 RemodAge 0.1655771 -0.5068690 0.5068690
## 18 logTotalBsmtSF 0.1602580 0.3265755 0.3265755
## 19 TotalBath 0.1596354 0.7000760 0.7000760
## 20 LotArea 0.1587646 0.2901018 0.2901018
## 21 OverallCond 0.1471269 -0.1162303 0.1162303
## 22 NbhdBin_grp1 0.1407924 -0.3576047 0.3576047
## 23 NbhdBin_grp3 0.1387341 0.2430058 0.2430058
## 24 NbhdBin_grp2 0.1257626 -0.1793344 0.1793344
############################################################
# Scree plot
############################################################
# Variance explained by each principal component
pca_var <- pca_result$sdev^2
# Proportion of variance explained
pca_var_exp <- pca_var / sum(pca_var)
# Scree plot
scree_df <- data.frame(
PC = 1:length(pca_var_exp),
VarianceExplained = pca_var_exp,
CumulativeVariance = cumsum(pca_var_exp)
)
ggplot(scree_df, aes(x = PC, y = VarianceExplained)) +
geom_line() +
geom_point() +
labs(
title = "Scree Plot",
x = "Principal Component",
y = "Proportion of Variance Explained"
) +
theme_minimal()
ggplot(scree_df, aes(x = PC, y = CumulativeVariance)) +
geom_line() +
geom_point() +
labs(
title = "Cumulative Scree Plot",
x = "Principal Component",
y = "Cumulative Proportion of Variance Explained"
) +
theme_minimal()