Setup/Imports

The section below sets shows all the imports needed to run the .rmd file:

library(gridExtra)
library(GGally)
library(psychometric)
library(tidyverse)
library(matrixcalc)
library(MASS)
library(car)
library(fastDummies)
library(lmtest)

Introduction

The work presented here uses data from Kaggle’s Housing Prices: Advanced Regression Techniques competition. The data from this page is loaded in the cell below:

df <- read.csv("Data/train.csv")
dim(df)

## [1] 1460   81

Each observation from the dataset in this case represents a home sale made in Ames, Iowa. We see from the above output that it contains data from 1,460 home sales, each having information pertaining to 79 different explanatory variables. The dataset also includes the target variable, sale price. To help limit the scope of the data, the following cell filters out a subset of 9 explanatory fields (along with the target variable):

df <- df %>% dplyr::select(
  Utilities, BldgType, Neighborhood, OverallCond, OverallQual, 
  TotalBsmtSF, GrLivArea, YearBuilt, YrSold, SalePrice
)

The following table provides a description of each field that remains in df:

Field Name	Variable Type	Data Type	Description
`Utilities`	Categorical	String	Type of utilities available. One of `AllPub` (All public Utilities), `NoSewr` (Electricity, Gas, and Water), `NoSeWa`(Electricity and Gas Only), or `ELO` Electricity only.
`BldgType`	Categorical	String	Type of dwelling. One of `1Fam` (single-family detached), `2FmCon` (two-family conversion; originally built as one-family dwelling),`Duplx` (duplex), `TwnhsE` (townhouse end unit), `TwnhsI` (townhouse inside unit).
`Neighborhood`	Categorical	String	Physical locations within Ames city limits. One of 25 possible neighborhoods.
`OverallCond`	Finite Numeric	Integer	Rates the overall condition of the house (1-10).
`OverallQual`	Finite Numeric	Integer	Rates the overall material and finish of the house (1-10).
`TotalBsmtSF`	Continuous Numeric	Integer	Total square feet of basement area.
`GrLivArea`	Continuous Numeric	Integer	Above grade (ground) living area square feet.
`YearBuilt`	Continuous Numeric	Integer	Original construction date (YYYY).
`YrSold`	Continuous Numeric	Integer	Year Sold (YYYY).
`SalePrice`	Continuous Numeric	Integer	The property’s sale price in dollars. Included as this is the target variable we are trying to predict.

The fields above were chosen as they provide a decent summary of the different considerations one might make when choosing to buy a home, i.e. (size, type, quality, etc.).

Descriptive and Inferential Statistics

The R cell below provides some useful descriptive statistics for the numeric fields in our dataset:

summary(df %>% dplyr::select(-Utilities, -BldgType))

##  Neighborhood        OverallCond     OverallQual      TotalBsmtSF    
##  Length:1460        Min.   :1.000   Min.   : 1.000   Min.   :   0.0  
##  Class :character   1st Qu.:5.000   1st Qu.: 5.000   1st Qu.: 795.8  
##  Mode  :character   Median :5.000   Median : 6.000   Median : 991.5  
##                     Mean   :5.575   Mean   : 6.099   Mean   :1057.4  
##                     3rd Qu.:6.000   3rd Qu.: 7.000   3rd Qu.:1298.2  
##                     Max.   :9.000   Max.   :10.000   Max.   :6110.0  
##    GrLivArea      YearBuilt        YrSold       SalePrice     
##  Min.   : 334   Min.   :1872   Min.   :2006   Min.   : 34900  
##  1st Qu.:1130   1st Qu.:1954   1st Qu.:2007   1st Qu.:129975  
##  Median :1464   Median :1973   Median :2008   Median :163000  
##  Mean   :1515   Mean   :1971   Mean   :2008   Mean   :180921  
##  3rd Qu.:1777   3rd Qu.:2000   3rd Qu.:2009   3rd Qu.:214000  
##  Max.   :5642   Max.   :2010   Max.   :2010   Max.   :755000

Already, we can pull out some interesting information:

The average SalePrice of all homes is around $181,000,
The range of dates that the homes were built spans all the way from 1872 to 2010.
The smallest home bought is 334 sq. ft. (Still pretty good by NYC standards!)

The below provides a table of the counts for our categorical variables:

print(table(df$Utilities))

## 
## AllPub NoSeWa 
##   1459      1

print(table(df$BldgType))

## 
##   1Fam 2fmCon Duplex  Twnhs TwnhsE 
##   1220     31     52     43    114

print(table(df$Neighborhood))

## 
## Blmngtn Blueste  BrDale BrkSide ClearCr CollgCr Crawfor Edwards Gilbert  IDOTRR 
##      17       2      16      58      28     150      51     100      79      37 
## MeadowV Mitchel   NAmes NoRidge NPkVill NridgHt  NWAmes OldTown  Sawyer SawyerW 
##      17      49     225      41       9      77      73     113      74      59 
## Somerst StoneBr   SWISU  Timber Veenker 
##      86      25      25      38      11

We see right away from the table above that the variable Utilities is the same for all houses except one, and as such will end up having little to no predictive power. Thus, it is removed in the cell below:

df <- df %>% 
    dplyr::select(-Utilities)

From the table of BldgType, we see that most of the homes included in this analysis will be single family homes, and that they come from a range of different neighborhoods.

Next, we can look at the distribution of each of our numeric variables:

numeric_fields <- df %>%
  dplyr::select(-BldgType, -Neighborhood)

numeric_fields %>%
  gather() %>%
  ggplot(aes(value)) +
  facet_wrap(~ key, scales = "free") +  
  geom_histogram(aes(y = ..density..)) + 
  geom_density() + 
  theme(
    axis.text.x=element_blank(),
    axis.ticks.x=element_blank(),
    axis.text.y=element_blank(),
    axis.ticks.y=element_blank())

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

We see that most of the distributions maintain a certain level of normality, with the exception of YrSold and YearBuilt. In the YrBuilt distribution, we can clearly see the presence of the different housing booms experienced in the United States.

The cell below creates scatterplots of our numeric fields with the target variable, SalePrice:

sp1 <- ggplot(data=df, aes(OverallCond, SalePrice)) + geom_point() +
  geom_smooth(method='lm')
sp2 <- ggplot(data=df, aes(OverallQual, SalePrice)) + geom_point() +
  geom_smooth(method='lm')
sp3 <- ggplot(data=df, aes(TotalBsmtSF, SalePrice)) + geom_point() +
  geom_smooth(method='lm')
sp4 <- ggplot(data=df, aes(GrLivArea, SalePrice)) + geom_point() +
  geom_smooth(method='lm')
sp5 <- ggplot(data=df, aes(YearBuilt, SalePrice)) + geom_point() +
  geom_smooth(method='lm')
sp6 <- ggplot(data=df, aes(YrSold, SalePrice)) + geom_point() +
  geom_smooth(method='lm')
scatterplots <- list(sp1, sp2, sp3, sp4, sp5, sp6)

grid.arrange(grobs=scatterplots, ncol=2)

## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'

Overall, we can see a certain level of (mostly positive) correlation with all the chosen explanatory variables with the exception of YrSold, indicating that most of these variables may prove useful when actually building our model.

We can check the correlation between the independent variables by producing a correlation matrix:

ggpairs(dplyr::select(numeric_fields, -SalePrice), axisLabels = "none")

We see indeed that many of the independent variables (with the exception of YrSold) have a non-zero, statistically significant correlation with one another. We can compute the $p$-values and 80% confidence interval of these correlation coefficients below:

numeric_fields <- numeric_fields %>%
  dplyr::select(-SalePrice)

cor_df <- data.frame(matrix(ncol = 7))
colnames(cor_df) <- c("key", "x", "y", "p_value", "r", "CI_lower", "CI_upper")

for (i in 1:length(colnames(numeric_fields))){
  for (j in 1:length(colnames(numeric_fields))){
    key <- paste(colnames(numeric_fields)[i], 
                 colnames(numeric_fields)[j], sep='')
    rev_key <- paste(colnames(numeric_fields)[j], 
                 colnames(numeric_fields)[i], sep='')
    p_value <- cor.test(numeric_fields[,i], numeric_fields[,j])$p.value
    r_value <- cor(numeric_fields[,i], numeric_fields[,j])
    CI <- CIr(r_value, n=dim(numeric_fields)[1], level=0.8)
    df_row <- list(
      key, colnames(numeric_fields)[i], colnames(numeric_fields)[j],
      p_value, r_value, CI[1], CI[2])
    if (p_value > 0 & !(rev_key %in% cor_df[,1])) {
      cor_df <- rbind(cor_df, df_row)
    }
  }
}

cor_df <- tail(cor_df, -1)
cor_df <- cor_df %>%
  dplyr::select(-key)
cor_df

##              x           y       p_value           r    CI_lower     CI_upper
## 2  OverallCond OverallQual  4.361720e-04 -0.09193234 -0.12510797 -0.058551357
## 3  OverallCond TotalBsmtSF  4.685261e-11 -0.17109751 -0.20349065 -0.138330210
## 4  OverallCond   GrLivArea  2.311138e-03 -0.07968587 -0.11294544 -0.046247911
## 5  OverallCond   YearBuilt  3.090962e-50 -0.37598320 -0.40444135 -0.346797661
## 6  OverallCond      YrSold  9.321243e-02  0.04394975  0.01040345  0.077397222
## 7  OverallQual TotalBsmtSF 3.130271e-110  0.53780850  0.51351568  0.561239907
## 8  OverallQual   GrLivArea 2.247511e-139  0.59300743  0.57080615  0.614342245
## 9  OverallQual   YearBuilt 8.400355e-128  0.57232277  0.54931237  0.594465855
## 10 OverallQual      YrSold  2.963852e-01 -0.02734671 -0.06085250  0.006220641
## 11 TotalBsmtSF   GrLivArea  1.857870e-75  0.45486820  0.42783799  0.481085531
## 12 TotalBsmtSF   YearBuilt  1.161332e-54  0.39145200  0.36265484  0.419502316
## 13 TotalBsmtSF      YrSold  5.676652e-01 -0.01496865 -0.04850592  0.018602337
## 14   GrLivArea   YearBuilt  1.660315e-14  0.19900971  0.16656055  0.231028294
## 15   GrLivArea      YrSold  1.630401e-01 -0.03652582 -0.07000165 -0.002967818
## 16   YearBuilt      YrSold  6.031268e-01 -0.01361768 -0.04715777  0.019953079

Because the above output is the result of conducting a series of the same hypothesis tests, family-wise error should be a concern. The family-wise error rate can be calculated via the expression $1-(1-\alpha)^n$, where $n$ is the number of tests and $\alpha$ is the significance level chosen. Thus, as the number of hypothesis test conducted increases the probability of making a type 1 error (false positive) increases. One way to account to family-wise error is to apply what is known as a Bonferroni correction, in which the desired significance level is modified by dividing it by the number of tests: $\alpha_{\text{new}}= \alpha_{\text{old}}/n$. In this case, that means that the new level of significance should be $0.05 / 16 = 0.003125$. We see now that while all 16 of the correlation hypothesis tests performed earlier were below the original significance level, only 10 result in a $p$-value below this new level:

a <- 0.05 /16
length(which((cor_df$p_value < a) == TRUE))

## [1] 10

Evaluate Collinearity Using Precision Matrix

We can get the precision matrix by inverting the correlation matrix:

cor_mat <- unname(cor(numeric_fields))
precision_mat <- solve(cor_mat)

As expected, multiplying the precision matrix by the correlation matrix returns the identity matrix, given that $AA^{-1} = I$.

round(cor_mat %*% precision_mat)

##      [,1] [,2] [,3] [,4] [,5] [,6]
## [1,]    1    0    0    0    0    0
## [2,]    0    1    0    0    0    0
## [3,]    0    0    1    0    0    0
## [4,]    0    0    0    1    0    0
## [5,]    0    0    0    0    1    0
## [6,]    0    0    0    0    0    1

round(precision_mat %*% cor_mat)

##      [,1] [,2] [,3] [,4] [,5] [,6]
## [1,]    1    0    0    0    0    0
## [2,]    0    1    0    0    0    0
## [3,]    0    0    1    0    0    0
## [4,]    0    0    0    1    0    0
## [5,]    0    0    0    0    1    0
## [6,]    0    0    0    0    0    1

We can also generate the LU decomposition of the precision matrix:

decomp <- lu.decomposition(precision_mat)
L <- decomp$L
U <- decomp$U

Given a matrix $M$, the process of LU decomposition returns a lower triangular matrix $L$ and upper triangular matrix $U$ such that $M = LU$. We can confirm this below by showing that the product $LU$ returns the precision matrix:

round(L %*% U, 5) == round(precision_mat, 5)

##      [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] TRUE TRUE TRUE TRUE TRUE TRUE
## [2,] TRUE TRUE TRUE TRUE TRUE TRUE
## [3,] TRUE TRUE TRUE TRUE TRUE TRUE
## [4,] TRUE TRUE TRUE TRUE TRUE TRUE
## [5,] TRUE TRUE TRUE TRUE TRUE TRUE
## [6,] TRUE TRUE TRUE TRUE TRUE TRUE

The diagonal of the precision matrix hold the variance inflation factors (VIFs), which provide us with another way to analyze the multicollinearity between the explanatory variables:

VIF equal to 1 = variables are not correlated.
VIF between 1 and 5 = variables are moderately correlated.
VIF greater than 5 = variables are highly correlated.

rbind(colnames(numeric_fields), diag(precision_mat))

##      [,1]               [,2]               [,3]              
## [1,] "OverallCond"      "OverallQual"      "TotalBsmtSF"     
## [2,] "1.22635762343363" "2.55660044976677" "1.51945746577235"
##      [,4]               [,5]               [,6]              
## [1,] "GrLivArea"        "YearBuilt"        "YrSold"          
## [2,] "1.72890104025723" "1.90655912441781" "1.00337484972296"

We can see that there is some level of collinearity among all of our explanatory variables (once again, with the exception of YrSold) using this methodology. This is expected, given the high level of correlation that the other variables had with one another. However, VIF is typically a better indicator of collinearity than correlation, and generally it is accepted that predictor variables with VIF values less than 5 are acceptable to use in a linear model.

Fitting the Target Variable

A histogram and accompanying density plot estimation of the target variable, SalePrice, is shown below:

ggplot(data=df, aes(x=SalePrice)) +
  geom_histogram(aes(y = ..density..)) + 
  geom_density()

Visually, the data appears to look slightly right-skewed, which we can confirm below by showing that the mean is greater than the median:

print(median(df$SalePrice))

## [1] 163000

print(mean(df$SalePrice))

## [1] 180921.2

As such, it could be appropriate to attempt to fit the distribution of SalePrice to an exponential probability density function (PDF). The exponential distribution, given by the formula below:

\[ \begin{equation} f_X(x) = \begin{cases} \lambda e^{-\lambda x} & \text{if } 0 \leq x < \infty\\ 0 & \text{otherwise } \end{cases} \end{equation} \]

only has one parameter $\lambda$ that is required for the fit. The best-fit value of $\lambda$ is determined below using the fitdistr function:

l_fit <-unname(fitdistr(df$SalePrice, densfun = 'exponential')$estimate)
l_fit

## [1] 5.527268e-06

We can then use this value of $\lambda$ to simulate the distribution of SalePrice, which is done below using 10,000 samples:

set.seed(42)
sim_dist <- rexp(dim(df)[1], rate=l_fit)
fit_df <- as.data.frame(cbind(sim_dist, df$SalePrice))
colnames(fit_df) <- c('simulated', 'actual')
fit_df <- pivot_longer(fit_df, cols=c('simulated', 'actual'), 
                       names_to = "data_type", values_to = 'value')

ggplot(data=fit_df, aes(x=value, fill=data_type)) +
  geom_density()

We can in the above that the simulation in this case is not a great approximation of the actual data, likely due to the large numbers in the SalePrice field that the exponential distribution is attempting to replicate. We see that the actual data has a much higher peak and shorter tail than the simulated exponential data.

We can get the theoretical 5th and 95th percentile of the data by integrating the exponential PDF to find the cumulative density function (CDF) $F_X(x)$. This is done below:

\[ \begin{align} F_X(x) &= \int_0^{x} \lambda e^{-\lambda x} \ dx \\ &= -e^{-\lambda x}\ |_0^{x} \\ F_X(x) &= 1 - e^{-\lambda x} \end{align} \]

We can solve for the percentiles $p$ by plugging them into $F_X(x)$, since by definition $F_X(x) = P(X > x)$:

\[ \begin{align} p &= 1 - e^{-\lambda x} \\ 1 - p &= e^{-\lambda x} \\ \ln(1-p) &= \ln(e^{-\lambda x}) \\ x &= -\frac{\ln(1-p)}{\lambda} \end{align} \]

Using the above equation, we solve for the 5th and 95th percentiles:

pct5 <- -log(1-0.05) / l_fit
pct95 <- -log(1-0.95) / l_fit
pcts <- c(pct5, pct95)
rbind(c(5, 95), pcts)

##          [,1]     [,2]
##         5.000     95.0
## pcts 9280.044 541991.5

We can compare the above to the empirical 5th and 95th percentiles generated via our simulated distribution:

set.seed(42)
quantile(sim_dist, probs=c(0.05, 0.95))

##         5%        95% 
##   9994.215 559104.207

Comparing the theoretical and empirical yields solid agreement, which would only improve given a larger number of samples.

Finally, we can use this empirical distribution to estimate the mean, and provide a 95% confidence interval:

n <- length(sim_dist)
xbar <- mean(sim_dist)
s <- sd(sim_dist)
margin <- qt(0.975,df=n-1)*s/sqrt(n)
paste(round(xbar, 2), "+/-", round(margin, 2))

## [1] "188069.2 +/- 9656.07"

We see that despite the fact that the simulated distribution of SalePrice wasn’t a perfect approximation, it did provide a reasonable estimate of the mean of the actual actual distribution of SalePrice:

mean(df$SalePrice)

## [1] 180921.2

Modelling Sale Price

Building Model

Before building a linear regression model, the cell below converts the two remaining categorical variables Neighborhood and BldgType into dummy variables. This is done in the hopes unhelpful categories from each can be removed from the model if they are not effective predictors.

df <- dummy_cols(df, select_columns = c('BldgType', 'Neighborhood'),
           remove_first_dummy = TRUE, remove_selected_columns =  TRUE)

Now that the dummies variables have been created, the cell below creates a linear regression model to predict SalePrice as a function of BldgType (dummy variables), Neighborhood (dummy variables), OverallCond, OverallQual, TotalBsmtSF, GrLivArea, YearBuilt. and YrSold:

lm <- lm(SalePrice ~ ., data=df)
lm_summary <- summary(lm)
lm_summary

## 
## Call:
## lm(formula = SalePrice ~ ., data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -415832  -14153   -1333   12050  267662 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -6.982e+05  1.360e+06  -0.513 0.607856    
## OverallCond           6.916e+03  9.187e+02   7.528 9.11e-14 ***
## OverallQual           1.436e+04  1.178e+03  12.193  < 2e-16 ***
## TotalBsmtSF           2.319e+01  2.681e+00   8.649  < 2e-16 ***
## GrLivArea             5.281e+01  2.458e+00  21.485  < 2e-16 ***
## YearBuilt             5.293e+02  6.815e+01   7.766 1.54e-14 ***
## YrSold               -1.901e+02  6.737e+02  -0.282 0.777861    
## BldgType_2fmCon      -2.551e+03  6.400e+03  -0.399 0.690239    
## BldgType_Duplex      -1.860e+04  5.053e+03  -3.681 0.000241 ***
## BldgType_Twnhs       -5.218e+04  6.786e+03  -7.690 2.73e-14 ***
## BldgType_TwnhsE      -3.234e+04  4.297e+03  -7.526 9.23e-14 ***
## Neighborhood_Blueste -4.235e+03  2.556e+04  -0.166 0.868440    
## Neighborhood_BrDale  -5.092e+03  1.301e+04  -0.391 0.695553    
## Neighborhood_BrkSide -1.380e+04  1.102e+04  -1.252 0.210620    
## Neighborhood_ClearCr  4.779e+03  1.133e+04   0.422 0.673273    
## Neighborhood_CollgCr -1.409e+04  9.427e+03  -1.495 0.135173    
## Neighborhood_Crawfor  1.124e+04  1.079e+04   1.041 0.297844    
## Neighborhood_Edwards -2.492e+04  1.008e+04  -2.471 0.013583 *  
## Neighborhood_Gilbert -2.139e+04  9.980e+03  -2.144 0.032236 *  
## Neighborhood_IDOTRR  -2.409e+04  1.162e+04  -2.073 0.038344 *  
## Neighborhood_MeadowV  1.419e+03  1.227e+04   0.116 0.907919    
## Neighborhood_Mitchel -1.781e+04  1.035e+04  -1.721 0.085457 .  
## Neighborhood_NAmes   -1.999e+04  9.743e+03  -2.052 0.040375 *  
## Neighborhood_NoRidge  4.194e+04  1.062e+04   3.950 8.18e-05 ***
## Neighborhood_NPkVill  3.174e+03  1.425e+04   0.223 0.823814    
## Neighborhood_NridgHt  5.406e+04  9.581e+03   5.642 2.03e-08 ***
## Neighborhood_NWAmes  -2.625e+04  1.004e+04  -2.613 0.009058 ** 
## Neighborhood_OldTown -2.713e+04  1.075e+04  -2.524 0.011723 *  
## Neighborhood_Sawyer  -2.011e+04  1.028e+04  -1.956 0.050608 .  
## Neighborhood_SawyerW -1.509e+04  1.000e+04  -1.508 0.131680    
## Neighborhood_Somerst  5.465e+03  9.475e+03   0.577 0.564202    
## Neighborhood_StoneBr  6.182e+04  1.084e+04   5.704 1.42e-08 ***
## Neighborhood_SWISU   -3.001e+04  1.225e+04  -2.449 0.014434 *  
## Neighborhood_Timber   5.318e+03  1.062e+04   0.501 0.616566    
## Neighborhood_Veenker  2.613e+04  1.345e+04   1.943 0.052208 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 33780 on 1425 degrees of freedom
## Multiple R-squared:  0.8234, Adjusted R-squared:  0.8192 
## F-statistic: 195.4 on 34 and 1425 DF,  p-value: < 2.2e-16

Already we can see that our model produces a more than reasonable adjusted $R^2$ value, definitely assisted by the fact that all of the numeric fields, with the exception of YrSold, are listed as statistically significant predictors (this makes sense, given how YrSold was the only numeric field that had no significant correlation with the target variable). In terms of our categorical variables, there are a number of building types and neighborhoods that also seem to be effective predictors of target price, but not all of them. The insignificant predictors are removed in the cell below:

mod_summary_sign <- lm_summary$coefficients[ , 4] # get p-values
remove_list <- names(which(mod_summary_sign > 0.05))
df <- df %>%
  dplyr::select(-remove_list[-1])

Next, we can rebuild the model using only the statistically significant predictors:

lm <- lm(SalePrice ~ ., data=df)
lm_summary <- summary(lm)
lm_summary

## 
## Call:
## lm(formula = SalePrice ~ ., data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -431487  -15703   -1516   12654  260351 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -1.024e+06  1.025e+05  -9.989  < 2e-16 ***
## OverallCond           7.137e+03  9.184e+02   7.771 1.47e-14 ***
## OverallQual           1.587e+04  1.130e+03  14.049  < 2e-16 ***
## TotalBsmtSF           2.354e+01  2.680e+00   8.781  < 2e-16 ***
## GrLivArea             5.490e+01  2.444e+00  22.461  < 2e-16 ***
## YearBuilt             4.900e+02  5.222e+01   9.384  < 2e-16 ***
## BldgType_Duplex      -2.039e+04  5.061e+03  -4.029 5.89e-05 ***
## BldgType_Twnhs       -4.467e+04  5.545e+03  -8.056 1.63e-15 ***
## BldgType_TwnhsE      -2.582e+04  3.688e+03  -7.000 3.91e-12 ***
## Neighborhood_Edwards -1.633e+04  3.878e+03  -4.210 2.71e-05 ***
## Neighborhood_Gilbert -1.344e+04  4.346e+03  -3.092  0.00203 ** 
## Neighborhood_IDOTRR  -1.527e+04  6.270e+03  -2.436  0.01497 *  
## Neighborhood_NAmes   -1.120e+04  2.855e+03  -3.922 9.18e-05 ***
## Neighborhood_NoRidge  4.564e+04  5.922e+03   7.708 2.37e-14 ***
## Neighborhood_NridgHt  5.696e+04  4.547e+03  12.526  < 2e-16 ***
## Neighborhood_NWAmes  -1.927e+04  4.309e+03  -4.471 8.39e-06 ***
## Neighborhood_OldTown -2.057e+04  4.268e+03  -4.819 1.60e-06 ***
## Neighborhood_StoneBr  6.350e+04  7.311e+03   8.685  < 2e-16 ***
## Neighborhood_SWISU   -2.412e+04  7.434e+03  -3.245  0.00120 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 34380 on 1441 degrees of freedom
## Multiple R-squared:  0.8151, Adjusted R-squared:  0.8128 
## F-statistic: 352.9 on 18 and 1441 DF,  p-value: < 2.2e-16

The $R^2$ value remains unchanged despite having reduced the number of features in our model by 17, giving further evidence to the fact that they were not helpful to the linear regression model.

Testing Assumptions

Before we can use our model to make predictions, we must check that the assumptions of linear regression are true. If not, our model could be unreliable or even misleading. First, the cell below checks again the VIF values for each of our predictor variables, given that they were previously only evaluated using the numeric fields:

vif(lm)

##          OverallCond          OverallQual          TotalBsmtSF 
##             1.289539             3.014608             1.707139 
##            GrLivArea            YearBuilt      BldgType_Duplex 
##             2.036937             3.070727             1.087206 
##       BldgType_Twnhs      BldgType_TwnhsE Neighborhood_Edwards 
##             1.085790             1.209769             1.185376 
## Neighborhood_Gilbert  Neighborhood_IDOTRR   Neighborhood_NAmes 
##             1.194253             1.199735             1.312622 
## Neighborhood_NoRidge Neighborhood_NridgHt  Neighborhood_NWAmes 
##             1.182529             1.276378             1.089868 
## Neighborhood_OldTown Neighborhood_StoneBr   Neighborhood_SWISU 
##             1.606999             1.111577             1.149105

We see once again that none of the VIF values exceed 5, and thus have levels of collinearity that are acceptable to use in our model.

The four assumptions of linear regression are the following:

Linear relationship: There exists a linear relationship between the independent variable, $x$, and the dependent variable, $y$.
Normality: The residuals of the model are normally distributed.
Homoscedasticity: The residuals have constant variance at every level of $x$.
Independence: The residuals are independent. In particular, there is no correlation between consecutive residuals in time series data.

Linear Relationship

The scatterplots created earlier show that all of the numeric fields appear to exhibit a linear relationship with the exception of YrSold, which has now been removed from the model. We can assume that the linear relationship is met for our remaining dummy categorical variables, since each field only contains two values (0 or 1) and we can fit a straight line perfectly to any two points.

Normality

To check if the residuals are normally distributed, we can make a histogram:

ggplot(lm, aes(x = .resid)) +
  geom_histogram(binwidth = 7500)

While the histogram does indeed appear to be noramally distributed, we can further analyze by use of a Q-Q (quantile-quantile) plot:

ggplot(data = lm, aes(sample = .resid)) +
  stat_qq()

The QQ plot reveals that while the data appears normal for the majority of the distribution, there are some outliers at either end resulting in deviance from an otherwise seemingly straight line. However, given that the vast majority of the data does indeed to follow a normal distribution, we can consider this assumption satisfied.

Homoscedasticity

To check if the homoscedasticity condition is met, we can create a plot of the residuals as a function of the fitted values:

ggplot(lm, aes(x = .fitted, y = .resid)) +
  geom_point() +
  geom_hline(yintercept = 0)

While the distribution of residuals looks somewhat equal across all fitted values, there does appear to be a wider spread towards the right side of the plot. This is once again, due to the presence of a number of residual observations that result in a deviance from an otherwise unitersting residual plot. However, since this deviation only occurs towards the far right side of the plot, we can consider the Homoscedasticity assumption satisfied.

Independence

Based off the above residual plot above, we can also confirm that there is not a strong correlation between the residuals and fitted values. This is further evidenced by the fact that the correlation coefficient between the two is very close to 0:

cor(lm$residuals, lm$fitted.values)

## [1] 1.882933e-17

As such, we confirm that the final linear regression assumption model is met.

Make Predictions

Now that we have confirmed that our model meets all the assumptions of linear regression, we can use it to evaluate performance on our testing set:

test_df <- read.csv("Data/test.csv")
id_field <- test_df['Id']
test_df <- test_df %>% dplyr::select(
  BldgType, Neighborhood, OverallCond, OverallQual, 
  TotalBsmtSF, GrLivArea, YearBuilt, YrSold
)


test_df <- dummy_cols(test_df, select_columns = c('BldgType', 'Neighborhood'),
                      remove_first_dummy = TRUE, 
                      remove_selected_columns =  TRUE)

test_df <- test_df %>%
  dplyr::select(-remove_list[-1])

preds <- predict(lm, test_df)

The above sets up the testing data in the same way as the training data, and then uses it to create predictions, which are stored in preds. The cell below then creates a dataframe of these predictions with the intention of producing a .csv file that can be uploaded to Kaggle.

preds <- cbind(id_field, preds)
preds <- data.frame(preds)
colnames(preds) <- c('Id', 'SalePrice')

# one prediction comes out NaN due to a NaN TotalBsmtSF value in the test
# dataset. Code below replaces this preidciton with the mean prediction value.
preds <- replace(preds,is.na(preds),mean(preds[is.na(preds) == FALSE]))

# write predictions to file
write_csv(preds, file="prediction.csv")

The predictions created above result in a Kaggle score of 0.22216.

Extra - Rebuilding the Model with No Outliers

Though we have no tangible reason to exclude them, it is worth noting that there were a number of outliers in our dataset that were clearly visible when checking the linear regression assumptions. We can visualize these outliers by using a metric known as Cook’s distance, which can be used to determine how much influence a single data point holds over a model’s performance by measuring how that model’s predicted values would change if it didn’t exist. As such, it can be used to identify and remove problematic outliers. Typically, the threshold used for removal is if Cook’s distance is greater than $\frac{4}{n}$, where $n$ is the number of observations in our dataset:

cooksD <- cooks.distance(lm)
cooks <- as.data.frame(cooksD)
colnames(cooks) <- c('d_c')
cooks$row_num <- seq.int(nrow(cooks)) 

ggplot(data = cooks, aes(x = row_num, y = d_c)) + 
  geom_point(size =  1) +
    geom_hline(yintercept = 4 / nrow(df), color = "red") +
      labs(
        x = "Index",
        y = "Cook's Distance"
      ) + 
        ylim(0, 0.01)

The red line in the plot above represents the $\frac{4}{n}$ threshold, and as such it is clear that there are indeed a number of outliers in our dataset. The cell below rebuilds our linear model after first removing these outlier observations:

cooksD <- cooks.distance(lm)
influential <- cooksD[(cooksD > (3 * mean(cooksD, na.rm = TRUE)))]
df <- df[-(as.integer(names(influential))),]

lm <- lm(SalePrice ~ ., data=df)
lm_summary <- summary(lm)
lm_summary$r.squared

## [1] 0.8705526

As expected, we see that the $R^2$ value has increased, as the model is now likely better fitting to the non-outlier values.

We can also see that there is no question of our linear regression assumptions being met by recreating the residual plots:

ggplot(lm, aes(x = .resid)) +
  geom_histogram(binwidth = 7500)

ggplot(data = lm, aes(sample = .resid)) +
  stat_qq()

ggplot(lm, aes(x = .fitted, y = .resid)) +
  geom_point() +
  geom_hline(yintercept = 0)

However, when recreating the predictions using this outlier free model on the testing set:

preds <- predict(lm, test_df)
preds <- cbind(id_field, preds)
preds <- data.frame(preds)
colnames(preds) <- c('Id', 'SalePrice')
preds <- replace(preds,is.na(preds),mean(preds[is.na(preds) == FALSE]))
write_csv(preds, file="prediction.csv")

it results in a much worse Kaggle score of 0.45855. Thus, despite the higher $R^2$ and perfect-looking residual plots, the exclusion of the outliers resulted in a model that was worse at making predictions using new data.

Conclusion

The work presented here shows that a linear regression model can predict home sale prices in Ames, Iowa with above average accuracy. Overall, the final model boasts a R$^2$ of over 0.81, and provides reasonable predictions on a testing dataset. The quality of the model could still likely be improved given a more in-depth feature analysis, the inclusion of more observations, and switching models completely (i.e. to random forest regressor).

Predicting House Sale Prices in Ames, Iowa

William Jasmine

2023-05-17