suppressPackageStartupMessages({
  library(tidyverse)
  library(knitr)
  library(Hmisc)
  library(caret)
  library(psych)
})

1 Import and Format Data

First, we import the Home Sales dataset. This contains various home features and the target variable SalesPrice. The purpose of this exercise is to apply Multiple Linear Regression to predict a home’s final sale price based on its features. After importing, we also manually correct some formatting issues with the file encoding, empty row at the bottom, meaningless row index feature, and comma in numerical cells.

sales.df <- read.csv("https://s3.us-east-2.amazonaws.com/artificium.us/datasets/HomeSalesUFFIData.csv",
         header = TRUE,
         stringsAsFactors = FALSE, # all numeric
         fileEncoding="UTF-8-BOM", # removes BOM symbol from first header name
         nrows = 99) %>% # row 100 is empty
  mutate(SalesPrice = parse_number(SalesPrice)) %>% # get rid of comma to make numerical
  select(-Observation) # get rid of meaningless feature

2 Identify Outliers

Next, we create a function to identify outliers in continuous features. We find a few outliers in SalesPrice and LivingAreaSqFt, remove them (a total of 4 observations), then save the version of the data with outliers removed in a different dataframe.

# Create function to identify indices of Z-score outliers
index_outliers <- function(x) {
  x <- x[!is.na(x)]
  x_mean <- mean(x)
  x_sd <- sd(x)
  z_scores <- abs(x - x_mean) / x_sd
  # list("outliers" = x[z_scores > 2.8],
  #      "remaining" = x[z_scores <= 2.8],
  #      "indices" = z_scores > 2.8)
  return(z_scores > 2.8)
}
# Store indices of non-binary
numeric_var_ind <- apply(sales.df, MARGIN=2, function(x) length(unique(x)) > 2)
# List outliers for all numeric variables
lapply(sales.df[numeric_var_ind], function(x) {x[index_outliers(x)]})

## $YearSold
## integer(0)
## 
## $SalesPrice
## [1] 450000 450000 624600
## 
## $Finished.Bsmnt
## integer(0)
## 
## $LotSizeSqFt
## [1] 11650
## 
## $NumEncParkSpaces
## integer(0)
## 
## $LivingAreaSqFt
## [1] 2338.66 2042.95

# Create a mask that indicates whether any feature was an outlier for that row
mask <- lapply(sales.df[numeric_var_ind], function(x) {index_outliers(x)}) %>%
  bind_cols() %>%
  apply(MARGIN=1, any)
# Apply mask to create new df with outliers excluded
sales.no.df <- sales.df[!mask,]
# Visually check the rows that were excluded
sales.df[mask,]

3 Check for Normal Distribution

We then apply the Shapiro-Wilk test in the outliers-removed dataset to check for normality. We also check for normality after transforming the data in various ways. Note that a p-value greater than 0.05 would indicate a normal distribution.

# Visualize original distributions
hist.data.frame(sales.no.df[numeric_var_ind])

# Calculate Shapiro Wilk p value from original and various transformed distributions
original <- apply(sales.no.df[numeric_var_ind], MARGIN=2, function(x) shapiro.test(x)$p.value)
log10 <- apply(sales.no.df[numeric_var_ind], MARGIN=2, function(x) shapiro.test(log10(x+.001))$p.value)
inverse <- apply(sales.no.df[numeric_var_ind], MARGIN=2, function(x) shapiro.test(1/(x+.001))$p.value)
sqrt <- apply(sales.no.df[numeric_var_ind], MARGIN=2, function(x) shapiro.test(sqrt(x))$p.value)
rbind(original, log10, inverse, sqrt) %>% as.data.frame()

4 Transform Non-Normally Distributed Data

Taking the optimal transform from above (whichever gives us the highest p-value), we transform the continuous variables. Note that we do not transform the target variable. Specifically, LotSizeSqFt is root-transformed, and LivingAreaSqFt is inverse-transformed. We then visualize the new distributions and confirm no NAs as a final check. While these may not be considered normal per the Shapiro-Wilk test, it should be noted that this does not invalidate the use of multiple linear regression.

# Apply transforms
sales.tx <- sales.no.df %>%
  mutate(LotSizeSqFt = sqrt(LotSizeSqFt), # all values are positive integers
         LivingAreaSqFt = 1/LivingAreaSqFt) # all values are positive integers
# Re-check post-transform
hist.data.frame(sales.tx[numeric_var_ind])

# Confirm no NA
apply(sales.tx, MARGIN=2, function(x) sum(is.na(x)))

##         YearSold       SalesPrice     UFFI.Present      HasBrickExt 
##                0                0                0                0 
##        Gt45YrOld   Finished.Bsmnt      LotSizeSqFt NumEncParkSpaces 
##                0                0                0                0 
##   LivingAreaSqFt            HasAC          HasPool 
##                0                0                0

5 Check For Colinearity

Next, we explore colinearity in the data.

pairs.panels(sales.df[numeric_var_ind])

pairs.panels(sales.no.df[numeric_var_ind])

pairs.panels(sales.tx[numeric_var_ind])

We notice the following notable correlations that are above 0.6 to the target variable:

LivingAreaSqFt in all 3 datasets
YearSold in the outlier-omitted datasets

There are no notable correlations between features, which suggests that each feature is non-redundant.

6 Split Data for Training and Testing

Now, we can partition the data into 85% for training and 15% for testing. We use random sampling without replacement. Note that we choose to use the same randomly generated indices for the No Outliers dataset and the No Outliers + Data Transformed dataset, since they contain the same number of rows and this will allow for more direct comparison.

set.seed(33452)

training_indices <- createDataPartition(sales.df$SalesPrice, times=1, p=0.85)[[1]]
sales.training <- sales.df[training_indices,]
sales.testing <- sales.df[-training_indices,]

training_indices <- createDataPartition(sales.no.df$SalesPrice, times=1, p=0.85)[[1]]
sales.no.training <- sales.no.df[training_indices,]
sales.no.testing <- sales.no.df[-training_indices,]

sales.tx.training <- sales.tx[training_indices,]
sales.tx.testing <- sales.tx[-training_indices,]

7 Build Three Multiple Regression Models

Next, we build three multiple regression models from the three training datasets using a backward elimination based on p-value to predict SalesPrice. This means that the feature with the highest p-value is eliminated in each iteration until all features are significantly contributing. The three resulting models are shown below.

models <- map(list(sales.training, sales.no.training, sales.tx.training), function(x) {
  form_str <- "SalesPrice ~ ."
  m <- lm(as.formula(form_str), data=x)
  p_vals <- summary(m)$coefficients[,4]
  remaining <- names(x)[names(x) != "SalesPrice"] # initialize vector to store all feature names
  while(any(p_vals>0.05)) {
    elim <- which(p_vals == max(p_vals)) # obtain feature with highest p value
    remaining <- remaining[remaining != names(elim)] # remove the feature from the remaining list
    form_str <- paste("SalesPrice ~", paste(remaining, collapse=" + "))
    m <- lm(as.formula(form_str), data=x)
    p_vals <- summary(m)$coefficients[,4]
  }
  return(m)
})

lapply(models, summary)

## [[1]]
## 
## Call:
## lm(formula = as.formula(form_str), data = x)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -112603  -18955    -415   17014  112603 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -2.113e+07  3.627e+06  -5.824 1.08e-07 ***
## YearSold          1.051e+04  1.798e+03   5.848 9.77e-08 ***
## NumEncParkSpaces  1.751e+04  6.404e+03   2.735  0.00765 ** 
## LivingAreaSqFt    1.184e+02  1.429e+01   8.281 1.93e-12 ***
## HasPool           1.848e+05  2.539e+04   7.277 1.85e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 35420 on 82 degrees of freedom
## Multiple R-squared:  0.7608, Adjusted R-squared:  0.7491 
## F-statistic:  65.2 on 4 and 82 DF,  p-value: < 2.2e-16
## 
## 
## [[2]]
## 
## Call:
## lm(formula = as.formula(form_str), data = x)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -78520 -17644   1072  17120  99921 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -2.035e+07  2.928e+06  -6.951 9.77e-10 ***
## YearSold          1.014e+04  1.452e+03   6.985 8.40e-10 ***
## Finished.Bsmnt    2.974e+01  1.290e+01   2.305   0.0238 *  
## NumEncParkSpaces  1.200e+04  4.884e+03   2.456   0.0163 *  
## LivingAreaSqFt    8.014e+01  1.316e+01   6.089 4.00e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 27150 on 78 degrees of freedom
## Multiple R-squared:  0.6769, Adjusted R-squared:  0.6603 
## F-statistic: 40.85 on 4 and 78 DF,  p-value: < 2.2e-16
## 
## 
## [[3]]
## 
## Call:
## lm(formula = as.formula(form_str), data = x)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -81338 -17608   -416  19583  97456 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -2.087e+07  3.037e+06  -6.873 1.37e-09 ***
## YearSold          1.046e+04  1.503e+03   6.962 9.29e-10 ***
## Finished.Bsmnt    3.413e+01  1.353e+01   2.522  0.01369 *  
## NumEncParkSpaces  1.517e+04  5.022e+03   3.020  0.00341 ** 
## LivingAreaSqFt   -4.802e+07  8.965e+06  -5.356 8.33e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 28200 on 78 degrees of freedom
## Multiple R-squared:  0.6515, Adjusted R-squared:  0.6336 
## F-statistic: 36.45 on 4 and 78 DF,  p-value: < 2.2e-16

Interestingly, we can see that all 3 models utilize 4 features, but the model built from the original dataset uses HasPool instead of Finished.Bsment like the other two.

8 Assess the Three Multiple Regression Models

Next, we create a function to assess the Adjusted R-Squared and RMSE of each model on their respective testing datasets.

AssessModel <- function(m, testing_df, class_vec) {
  # testing_df: dataframe of testing cases, doesn't contain target variable
  # class_vec: vector of classifications for testing_df in the same order
  p <- predict(m, testing_df)
  results <- data.frame(Actual = class_vec,
                        Predicted = p) %>%
    mutate(SqErr = (Predicted - Actual)^2)
  return(list(AdjRsq = summary(m)$adj.r.squared,
              RMSE = sqrt(mean(results$SqErr))
              ))
}

testing_dfs <- list(select(sales.testing, -SalesPrice),
                    select(sales.no.testing, -SalesPrice),
                    select(sales.tx.testing, -SalesPrice))
class_vecs <- list(sales.testing$SalesPrice, sales.no.testing$SalesPrice, sales.tx.testing$SalesPrice)
data_names <- list("Original", "Outliers Removed", "Outliers Removed and Values Transformed")

# Map the function to each of the 3 model-testing-classification pairs
Map(AssessModel, models, testing_dfs, class_vecs) %>%
  bind_rows() %>%
  mutate(Dataset = data_names, .before="AdjRsq") %>%
  kable()

Dataset	AdjRsq	RMSE
Original	0.7491255	63140.56
Outliers Removed	0.6603189	20271.34
Outliers Removed and Values Transformed	0.6336161	28207.79

Based on the table above, the model created from the Original dataset has the best fit to its training set but has the greatest error when making predictions from the testing set. Thus, we can argue that the model created from the Outliers Removed dataset performed best in terms of error in its predictions: it has the next highest Adjusted R-squared and the lowest RMSE of all 3 models.

This poses an interesting question, however. Since outliers were removed from the latter 2 models, how well-equipped is it in handling unseen data in the future, which may contain outliers? Removing outliers, while it may produce the best performance metrics in predicting a dataset that is currently available, may not produce the best model to handle all future cases. Based on this, we could also argue that the model from the Original dataset is best, as it explains the most variation; it will inevitably have higher error when encountering outliers.

It should be noted that, of the 4 outliers detected, 3 are present in the training set, and 1 are in the testing set (by random splitting).

9 95% Prediction Intervals

Lastly, we calculate the 95% prediction interval for SalesPrice for each data point in the 3 validation datasets.

# Calculate a df of actual, fit, lwr/upr prediction intervals for each dataset
p_intervals <- Map(function(m, testing_df, class_vec, data_name) {
  predict(m, testing_df, interval="prediction", level=0.95) %>%
    as.data.frame() %>%
    mutate(actual = class_vec,
           dataset = data_name, .before="actual")
}, models, testing_dfs, class_vecs, data_names)

# Check
p_intervals

## [[1]]
##         fit       lwr      upr actual  dataset
## 21 572711.0 483494.05 661927.9 360000 Original
## 34 179683.7 105135.46 254231.9 198000 Original
## 40 452578.1 371951.00 533205.1 450000 Original
## 44 229854.3 158911.32 300797.3 229500 Original
## 46 162587.9  91129.67 234046.2 157500 Original
## 50 198856.2 127218.73 270493.7 175500 Original
## 54 174323.0  99501.77 249144.1 194220 Original
## 72 272910.8 201469.66 344351.9 288000 Original
## 78 158899.2  86509.70 231288.6 163800 Original
## 81 215980.0 144153.09 287807.0 230400 Original
## 84 177385.3 105992.06 248778.6 203400 Original
## 89 198219.6 124449.51 271989.7 210420 Original
## 
## [[2]]
##         fit      lwr      upr actual          dataset
## 13 231973.3 176361.0 287585.6 219600 Outliers Removed
## 19 325979.8 266667.3 385292.4 337500 Outliers Removed
## 21 343040.9 283279.6 402802.1 360000 Outliers Removed
## 22 297045.5 240767.9 353323.0 356400 Outliers Removed
## 32 158000.2 102419.2 213581.3 144000 Outliers Removed
## 33 200575.3 143711.1 257439.5 198000 Outliers Removed
## 55 171520.8 116112.2 226929.3 166500 Outliers Removed
## 59 186381.3 131183.0 241579.5 173700 Outliers Removed
## 82 223334.4 166817.0 279851.8 205200 Outliers Removed
## 84 199423.7 143727.6 255119.9 203400 Outliers Removed
## 85 197484.6 141513.7 253455.6 207000 Outliers Removed
## 89 207122.8 150270.6 263974.9 210420 Outliers Removed
## 
## [[3]]
##         fit       lwr      upr actual                                 dataset
## 13 226667.3 169080.56 284254.0 219600 Outliers Removed and Values Transformed
## 19 307766.5 247364.34 368168.7 337500 Outliers Removed and Values Transformed
## 21 307016.5 247923.22 366109.8 360000 Outliers Removed and Values Transformed
## 22 284743.6 226862.68 342624.4 356400 Outliers Removed and Values Transformed
## 32 148574.9  90228.56 206921.3 144000 Outliers Removed and Values Transformed
## 33 195224.9 135566.92 254882.8 198000 Outliers Removed and Values Transformed
## 55 167093.8 109186.41 225001.2 166500 Outliers Removed and Values Transformed
## 59 186564.5 129164.09 243964.9 173700 Outliers Removed and Values Transformed
## 82 224112.4 165381.84 282843.0 205200 Outliers Removed and Values Transformed
## 84 202391.1 144566.36 260215.8 203400 Outliers Removed and Values Transformed
## 85 196888.7 138667.59 255109.8 207000 Outliers Removed and Values Transformed
## 89 214520.6 155623.34 273417.8 210420 Outliers Removed and Values Transformed

# Visualize
map(p_intervals, ~ {
  data_name <- .x$dataset[1]
  .x %>%
    ggplot(aes(x=actual, y=fit)) +
    geom_point() +
    geom_errorbar(aes(
      ymin = lwr,
      ymax = upr
    )) +
    geom_abline(slope=1, intercept=0, color="red", linetype="dashed") +
    theme_bw() +
    labs(title = "Multiple Linear Regression Predictions from Validation Set",
         subtitle = data_name,
         x = "Actual Home Sale Price",
         y = "Predicted Home Sale Price",
         caption = "Red dashed line is the line of identity
         Vertical bars indicate 95% prediction intervals") +
    scale_x_continuous(labels = scales::dollar, limits=c(75000, 475000)) +
    scale_y_continuous(labels = scales::dollar, limits=c(75000, 675000))
})

## [[1]]

## 
## [[2]]

## 
## [[3]]

The scatterplot for the “Original” dataset confirms our earlier observation: upon manual inspection, we can see that the point that deviates far from the line of identity is sale #40 in the original dataset, which was an outlier for SalesPrice ($4.5^{5}) and for LivingAreaSqFt (2338.66). By chance, this outlier was selected for the validation subset and may explain the highest RMSE we saw in the previous section for the “Original” dataset. This datapoint was not present in the “Outliers Removed” and “Outliers Removed and Values Transformed” datasets, whose predictions had a lower RMSE. We can also observe that the prediction intervals are tighter for these two datasets compared to the “Original,” understandably because there is less variation in the data with outliers removed.

10 References

Becker, B. & Kohavi, R. (1996). Adult [Dataset]. UCI Machine Learning Repository. https://doi.org/10.24432/C5XW20.

Assessing the Impact of Outlier Removal and Feature Transformation in Multiple Linear Regression on Home Sale Prices

Wong, Courtney

2024-11-01