suppressPackageStartupMessages({
library(tidyverse)
library(knitr)
library(Hmisc)
library(caret)
library(psych)
})
First, we import the Home Sales dataset. This contains various home features and the target variable SalesPrice. The purpose of this exercise is to apply Multiple Linear Regression to predict a home’s final sale price based on its features. After importing, we also manually correct some formatting issues with the file encoding, empty row at the bottom, meaningless row index feature, and comma in numerical cells.
sales.df <- read.csv("https://s3.us-east-2.amazonaws.com/artificium.us/datasets/HomeSalesUFFIData.csv",
header = TRUE,
stringsAsFactors = FALSE, # all numeric
fileEncoding="UTF-8-BOM", # removes BOM symbol from first header name
nrows = 99) %>% # row 100 is empty
mutate(SalesPrice = parse_number(SalesPrice)) %>% # get rid of comma to make numerical
select(-Observation) # get rid of meaningless feature
Next, we create a function to identify outliers in continuous features. We find a few outliers in SalesPrice and LivingAreaSqFt, remove them (a total of 4 observations), then save the version of the data with outliers removed in a different dataframe.
# Create function to identify indices of Z-score outliers
index_outliers <- function(x) {
x <- x[!is.na(x)]
x_mean <- mean(x)
x_sd <- sd(x)
z_scores <- abs(x - x_mean) / x_sd
# list("outliers" = x[z_scores > 2.8],
# "remaining" = x[z_scores <= 2.8],
# "indices" = z_scores > 2.8)
return(z_scores > 2.8)
}
# Store indices of non-binary
numeric_var_ind <- apply(sales.df, MARGIN=2, function(x) length(unique(x)) > 2)
# List outliers for all numeric variables
lapply(sales.df[numeric_var_ind], function(x) {x[index_outliers(x)]})
## $YearSold
## integer(0)
##
## $SalesPrice
## [1] 450000 450000 624600
##
## $Finished.Bsmnt
## integer(0)
##
## $LotSizeSqFt
## [1] 11650
##
## $NumEncParkSpaces
## integer(0)
##
## $LivingAreaSqFt
## [1] 2338.66 2042.95
# Create a mask that indicates whether any feature was an outlier for that row
mask <- lapply(sales.df[numeric_var_ind], function(x) {index_outliers(x)}) %>%
bind_cols() %>%
apply(MARGIN=1, any)
# Apply mask to create new df with outliers excluded
sales.no.df <- sales.df[!mask,]
# Visually check the rows that were excluded
sales.df[mask,]
We then apply the Shapiro-Wilk test in the outliers-removed dataset to check for normality. We also check for normality after transforming the data in various ways. Note that a p-value greater than 0.05 would indicate a normal distribution.
# Visualize original distributions
hist.data.frame(sales.no.df[numeric_var_ind])
# Calculate Shapiro Wilk p value from original and various transformed distributions
original <- apply(sales.no.df[numeric_var_ind], MARGIN=2, function(x) shapiro.test(x)$p.value)
log10 <- apply(sales.no.df[numeric_var_ind], MARGIN=2, function(x) shapiro.test(log10(x+.001))$p.value)
inverse <- apply(sales.no.df[numeric_var_ind], MARGIN=2, function(x) shapiro.test(1/(x+.001))$p.value)
sqrt <- apply(sales.no.df[numeric_var_ind], MARGIN=2, function(x) shapiro.test(sqrt(x))$p.value)
rbind(original, log10, inverse, sqrt) %>% as.data.frame()
Taking the optimal transform from above (whichever gives us the highest p-value), we transform the continuous variables. Note that we do not transform the target variable. Specifically, LotSizeSqFt is root-transformed, and LivingAreaSqFt is inverse-transformed. We then visualize the new distributions and confirm no NAs as a final check. While these may not be considered normal per the Shapiro-Wilk test, it should be noted that this does not invalidate the use of multiple linear regression.
# Apply transforms
sales.tx <- sales.no.df %>%
mutate(LotSizeSqFt = sqrt(LotSizeSqFt), # all values are positive integers
LivingAreaSqFt = 1/LivingAreaSqFt) # all values are positive integers
# Re-check post-transform
hist.data.frame(sales.tx[numeric_var_ind])
# Confirm no NA
apply(sales.tx, MARGIN=2, function(x) sum(is.na(x)))
## YearSold SalesPrice UFFI.Present HasBrickExt
## 0 0 0 0
## Gt45YrOld Finished.Bsmnt LotSizeSqFt NumEncParkSpaces
## 0 0 0 0
## LivingAreaSqFt HasAC HasPool
## 0 0 0
Next, we explore colinearity in the data.
pairs.panels(sales.df[numeric_var_ind])
pairs.panels(sales.no.df[numeric_var_ind])
pairs.panels(sales.tx[numeric_var_ind])
We notice the following notable correlations that are above 0.6 to the target variable:
There are no notable correlations between features, which suggests that each feature is non-redundant.
Now, we can partition the data into 85% for training and 15% for testing. We use random sampling without replacement. Note that we choose to use the same randomly generated indices for the No Outliers dataset and the No Outliers + Data Transformed dataset, since they contain the same number of rows and this will allow for more direct comparison.
set.seed(33452)
training_indices <- createDataPartition(sales.df$SalesPrice, times=1, p=0.85)[[1]]
sales.training <- sales.df[training_indices,]
sales.testing <- sales.df[-training_indices,]
training_indices <- createDataPartition(sales.no.df$SalesPrice, times=1, p=0.85)[[1]]
sales.no.training <- sales.no.df[training_indices,]
sales.no.testing <- sales.no.df[-training_indices,]
sales.tx.training <- sales.tx[training_indices,]
sales.tx.testing <- sales.tx[-training_indices,]
Next, we build three multiple regression models from the three training datasets using a backward elimination based on p-value to predict SalesPrice. This means that the feature with the highest p-value is eliminated in each iteration until all features are significantly contributing. The three resulting models are shown below.
models <- map(list(sales.training, sales.no.training, sales.tx.training), function(x) {
form_str <- "SalesPrice ~ ."
m <- lm(as.formula(form_str), data=x)
p_vals <- summary(m)$coefficients[,4]
remaining <- names(x)[names(x) != "SalesPrice"] # initialize vector to store all feature names
while(any(p_vals>0.05)) {
elim <- which(p_vals == max(p_vals)) # obtain feature with highest p value
remaining <- remaining[remaining != names(elim)] # remove the feature from the remaining list
form_str <- paste("SalesPrice ~", paste(remaining, collapse=" + "))
m <- lm(as.formula(form_str), data=x)
p_vals <- summary(m)$coefficients[,4]
}
return(m)
})
lapply(models, summary)
## [[1]]
##
## Call:
## lm(formula = as.formula(form_str), data = x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -112603 -18955 -415 17014 112603
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.113e+07 3.627e+06 -5.824 1.08e-07 ***
## YearSold 1.051e+04 1.798e+03 5.848 9.77e-08 ***
## NumEncParkSpaces 1.751e+04 6.404e+03 2.735 0.00765 **
## LivingAreaSqFt 1.184e+02 1.429e+01 8.281 1.93e-12 ***
## HasPool 1.848e+05 2.539e+04 7.277 1.85e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 35420 on 82 degrees of freedom
## Multiple R-squared: 0.7608, Adjusted R-squared: 0.7491
## F-statistic: 65.2 on 4 and 82 DF, p-value: < 2.2e-16
##
##
## [[2]]
##
## Call:
## lm(formula = as.formula(form_str), data = x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -78520 -17644 1072 17120 99921
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.035e+07 2.928e+06 -6.951 9.77e-10 ***
## YearSold 1.014e+04 1.452e+03 6.985 8.40e-10 ***
## Finished.Bsmnt 2.974e+01 1.290e+01 2.305 0.0238 *
## NumEncParkSpaces 1.200e+04 4.884e+03 2.456 0.0163 *
## LivingAreaSqFt 8.014e+01 1.316e+01 6.089 4.00e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 27150 on 78 degrees of freedom
## Multiple R-squared: 0.6769, Adjusted R-squared: 0.6603
## F-statistic: 40.85 on 4 and 78 DF, p-value: < 2.2e-16
##
##
## [[3]]
##
## Call:
## lm(formula = as.formula(form_str), data = x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -81338 -17608 -416 19583 97456
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.087e+07 3.037e+06 -6.873 1.37e-09 ***
## YearSold 1.046e+04 1.503e+03 6.962 9.29e-10 ***
## Finished.Bsmnt 3.413e+01 1.353e+01 2.522 0.01369 *
## NumEncParkSpaces 1.517e+04 5.022e+03 3.020 0.00341 **
## LivingAreaSqFt -4.802e+07 8.965e+06 -5.356 8.33e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 28200 on 78 degrees of freedom
## Multiple R-squared: 0.6515, Adjusted R-squared: 0.6336
## F-statistic: 36.45 on 4 and 78 DF, p-value: < 2.2e-16
Interestingly, we can see that all 3 models utilize 4 features, but the model built from the original dataset uses HasPool instead of Finished.Bsment like the other two.
Next, we create a function to assess the Adjusted R-Squared and RMSE of each model on their respective testing datasets.
AssessModel <- function(m, testing_df, class_vec) {
# testing_df: dataframe of testing cases, doesn't contain target variable
# class_vec: vector of classifications for testing_df in the same order
p <- predict(m, testing_df)
results <- data.frame(Actual = class_vec,
Predicted = p) %>%
mutate(SqErr = (Predicted - Actual)^2)
return(list(AdjRsq = summary(m)$adj.r.squared,
RMSE = sqrt(mean(results$SqErr))
))
}
testing_dfs <- list(select(sales.testing, -SalesPrice),
select(sales.no.testing, -SalesPrice),
select(sales.tx.testing, -SalesPrice))
class_vecs <- list(sales.testing$SalesPrice, sales.no.testing$SalesPrice, sales.tx.testing$SalesPrice)
data_names <- list("Original", "Outliers Removed", "Outliers Removed and Values Transformed")
# Map the function to each of the 3 model-testing-classification pairs
Map(AssessModel, models, testing_dfs, class_vecs) %>%
bind_rows() %>%
mutate(Dataset = data_names, .before="AdjRsq") %>%
kable()
| Dataset | AdjRsq | RMSE |
|---|---|---|
| Original | 0.7491255 | 63140.56 |
| Outliers Removed | 0.6603189 | 20271.34 |
| Outliers Removed and Values Transformed | 0.6336161 | 28207.79 |
Based on the table above, the model created from the Original dataset has the best fit to its training set but has the greatest error when making predictions from the testing set. Thus, we can argue that the model created from the Outliers Removed dataset performed best in terms of error in its predictions: it has the next highest Adjusted R-squared and the lowest RMSE of all 3 models.
This poses an interesting question, however. Since outliers were removed from the latter 2 models, how well-equipped is it in handling unseen data in the future, which may contain outliers? Removing outliers, while it may produce the best performance metrics in predicting a dataset that is currently available, may not produce the best model to handle all future cases. Based on this, we could also argue that the model from the Original dataset is best, as it explains the most variation; it will inevitably have higher error when encountering outliers.
It should be noted that, of the 4 outliers detected, 3 are present in the training set, and 1 are in the testing set (by random splitting).
Lastly, we calculate the 95% prediction interval for SalesPrice for each data point in the 3 validation datasets.
# Calculate a df of actual, fit, lwr/upr prediction intervals for each dataset
p_intervals <- Map(function(m, testing_df, class_vec, data_name) {
predict(m, testing_df, interval="prediction", level=0.95) %>%
as.data.frame() %>%
mutate(actual = class_vec,
dataset = data_name, .before="actual")
}, models, testing_dfs, class_vecs, data_names)
# Check
p_intervals
## [[1]]
## fit lwr upr actual dataset
## 21 572711.0 483494.05 661927.9 360000 Original
## 34 179683.7 105135.46 254231.9 198000 Original
## 40 452578.1 371951.00 533205.1 450000 Original
## 44 229854.3 158911.32 300797.3 229500 Original
## 46 162587.9 91129.67 234046.2 157500 Original
## 50 198856.2 127218.73 270493.7 175500 Original
## 54 174323.0 99501.77 249144.1 194220 Original
## 72 272910.8 201469.66 344351.9 288000 Original
## 78 158899.2 86509.70 231288.6 163800 Original
## 81 215980.0 144153.09 287807.0 230400 Original
## 84 177385.3 105992.06 248778.6 203400 Original
## 89 198219.6 124449.51 271989.7 210420 Original
##
## [[2]]
## fit lwr upr actual dataset
## 13 231973.3 176361.0 287585.6 219600 Outliers Removed
## 19 325979.8 266667.3 385292.4 337500 Outliers Removed
## 21 343040.9 283279.6 402802.1 360000 Outliers Removed
## 22 297045.5 240767.9 353323.0 356400 Outliers Removed
## 32 158000.2 102419.2 213581.3 144000 Outliers Removed
## 33 200575.3 143711.1 257439.5 198000 Outliers Removed
## 55 171520.8 116112.2 226929.3 166500 Outliers Removed
## 59 186381.3 131183.0 241579.5 173700 Outliers Removed
## 82 223334.4 166817.0 279851.8 205200 Outliers Removed
## 84 199423.7 143727.6 255119.9 203400 Outliers Removed
## 85 197484.6 141513.7 253455.6 207000 Outliers Removed
## 89 207122.8 150270.6 263974.9 210420 Outliers Removed
##
## [[3]]
## fit lwr upr actual dataset
## 13 226667.3 169080.56 284254.0 219600 Outliers Removed and Values Transformed
## 19 307766.5 247364.34 368168.7 337500 Outliers Removed and Values Transformed
## 21 307016.5 247923.22 366109.8 360000 Outliers Removed and Values Transformed
## 22 284743.6 226862.68 342624.4 356400 Outliers Removed and Values Transformed
## 32 148574.9 90228.56 206921.3 144000 Outliers Removed and Values Transformed
## 33 195224.9 135566.92 254882.8 198000 Outliers Removed and Values Transformed
## 55 167093.8 109186.41 225001.2 166500 Outliers Removed and Values Transformed
## 59 186564.5 129164.09 243964.9 173700 Outliers Removed and Values Transformed
## 82 224112.4 165381.84 282843.0 205200 Outliers Removed and Values Transformed
## 84 202391.1 144566.36 260215.8 203400 Outliers Removed and Values Transformed
## 85 196888.7 138667.59 255109.8 207000 Outliers Removed and Values Transformed
## 89 214520.6 155623.34 273417.8 210420 Outliers Removed and Values Transformed
# Visualize
map(p_intervals, ~ {
data_name <- .x$dataset[1]
.x %>%
ggplot(aes(x=actual, y=fit)) +
geom_point() +
geom_errorbar(aes(
ymin = lwr,
ymax = upr
)) +
geom_abline(slope=1, intercept=0, color="red", linetype="dashed") +
theme_bw() +
labs(title = "Multiple Linear Regression Predictions from Validation Set",
subtitle = data_name,
x = "Actual Home Sale Price",
y = "Predicted Home Sale Price",
caption = "Red dashed line is the line of identity
Vertical bars indicate 95% prediction intervals") +
scale_x_continuous(labels = scales::dollar, limits=c(75000, 475000)) +
scale_y_continuous(labels = scales::dollar, limits=c(75000, 675000))
})
## [[1]]
##
## [[2]]
##
## [[3]]
The scatterplot for the “Original” dataset confirms our earlier observation: upon manual inspection, we can see that the point that deviates far from the line of identity is sale #40 in the original dataset, which was an outlier for SalesPrice ($4.5^{5}) and for LivingAreaSqFt (2338.66). By chance, this outlier was selected for the validation subset and may explain the highest RMSE we saw in the previous section for the “Original” dataset. This datapoint was not present in the “Outliers Removed” and “Outliers Removed and Values Transformed” datasets, whose predictions had a lower RMSE. We can also observe that the prediction intervals are tighter for these two datasets compared to the “Original,” understandably because there is less variation in the data with outliers removed.
Becker, B. & Kohavi, R. (1996). Adult [Dataset]. UCI Machine Learning Repository. https://doi.org/10.24432/C5XW20.