Regression Diagnostics with R - Ames Housing Data

Name: Karolina Grodzinska
Class: ALY6015: Intermediate Analytics

INTRODUCTION

This assignment is based on data about the housing market. I will start with an exploratory data analysis, presenting some of the variables of interest on the graphs and exploring their distributions. In the second part of the assignment, I will perform correlation and regression analysis, attempting to create a multiple regression model and correcting any issues that are present in the model. My goal is to find the best linear model to predict the prices of houses.

library(tidyverse)
library(RColorBrewer)
library(corrplot)
library(psych)
library(dplyr)
library(ggplot2)
library(gtools)
library(ggfortify)
library(GGally)
library(readr)
library(readxl)
library(knitr)
library(modelr)
library(scales)

df  = read.csv("AmesHousing.csv")

The dataset includes 2930 observations of 82 variables representing different features of houses in Ames, Iowa. From the data documentation, it appears that there are: 23 nominal, 23 ordinal, 14 discrete, and 20 continuous variables.

ANALYSIS PART 1 - Exploratory Data Analysis

# Disabling scientific notation, so my graphs and outputs will be more readable:
options(scipen = 100)

# Firstly, I'd like to check the missing values
na_count <-sapply(df, function(df) sum(length(which(is.na(df)))))
na_count
##        ï..Order             PID     MS.SubClass       MS.Zoning    Lot.Frontage 
##               0               0               0               0             490 
##        Lot.Area          Street           Alley       Lot.Shape    Land.Contour 
##               0               0            2732               0               0 
##       Utilities      Lot.Config      Land.Slope    Neighborhood     Condition.1 
##               0               0               0               0               0 
##     Condition.2       Bldg.Type     House.Style    Overall.Qual    Overall.Cond 
##               0               0               0               0               0 
##      Year.Built  Year.Remod.Add      Roof.Style       Roof.Matl    Exterior.1st 
##               0               0               0               0               0 
##    Exterior.2nd    Mas.Vnr.Type    Mas.Vnr.Area      Exter.Qual      Exter.Cond 
##               0               0              23               0               0 
##      Foundation       Bsmt.Qual       Bsmt.Cond   Bsmt.Exposure  BsmtFin.Type.1 
##               0              79              79              79              79 
##    BsmtFin.SF.1  BsmtFin.Type.2    BsmtFin.SF.2     Bsmt.Unf.SF   Total.Bsmt.SF 
##               1              79               1               1               1 
##         Heating      Heating.QC     Central.Air      Electrical     X1st.Flr.SF 
##               0               0               0               0               0 
##     X2nd.Flr.SF Low.Qual.Fin.SF     Gr.Liv.Area  Bsmt.Full.Bath  Bsmt.Half.Bath 
##               0               0               0               2               2 
##       Full.Bath       Half.Bath   Bedroom.AbvGr   Kitchen.AbvGr    Kitchen.Qual 
##               0               0               0               0               0 
##   TotRms.AbvGrd      Functional      Fireplaces    Fireplace.Qu     Garage.Type 
##               0               0               0            1422             157 
##   Garage.Yr.Blt   Garage.Finish     Garage.Cars     Garage.Area     Garage.Qual 
##             159             157               1               1             158 
##     Garage.Cond     Paved.Drive    Wood.Deck.SF   Open.Porch.SF  Enclosed.Porch 
##             158               0               0               0               0 
##     X3Ssn.Porch    Screen.Porch       Pool.Area         Pool.QC           Fence 
##               0               0               0            2917            2358 
##    Misc.Feature        Misc.Val         Mo.Sold         Yr.Sold       Sale.Type 
##            2824               0               0               0               0 
##  Sale.Condition       SalePrice 
##               0               0

I can see that there are a few variables with many missing values:

df = rename(df, Order = ï..Order)
summary(df)

I used the summary function but I decided not to include the long output as a part of this report.

It seems that the area of lots for the houses varies between 1300 sqft - 215245 sqft, the houses have been built between 1872 and 2010, and the living area has a range between 334 sqft (seems like a small apartment) to 5642 sqft, with the mean value of 1500 sqft. On average, houses have 1.5 bathrooms, 1 kitchen, 2.8 bedrooms, or 6.4 total rooms above the ground. I would like to take a look at some of those features on the graphs, as it is usually easier to interpret a graphical representation of data than look at a big table with raw numbers.

It seems that the best way to conduct this analysis would be to use house price as the predicted variable, as it should depend on other features. My assumption would be that the neighborhood, size of the house, its condition, and additional features should affect the final price.

Thus, I will start the exploratory data analysis by looking at the distribution of house prices.

# Histogram of prices
ggplot(df, aes(x = SalePrice)) +
  geom_histogram(color = "black", fill = "lightslateblue", bins = 50) + 
  scale_x_continuous(labels = comma) +
  labs(title = "Distribution of house prices", x = "Price", y = "Frequency") +
  theme_minimal()

It is important to recognize the shapes of the distribution of data during the analysis process. Histograms are helpful to see the distribution represented visually. Histograms display frequencies of data points for a specified number of bins (Bluman, 2018).

The distribution of house prices is right-skewed with most houses being in a price range under $200.000. From the descriptive statistics, I know that the price range is between 12789 dollars and 755000, with an average of 180796 and the median price equal to 160.000.

As the next step of the exploratory data analysis, I will look at the age of houses and their quality.

barplot(table(df$Year.Built), 
        main = "When were the most houses built?", 
        xlab = "Year",
        ylab = "Number of houses",
        col = brewer.pal(9, "Blues"))

It seems that there has been a housing boom during the early 2000s and when I looked at the frequency table to see the exact values, I saw that the most houses have been built in 2005 (142 houses), followed by 138 houses in 2008, and then the value starts dropping. Between 2007 and 2008 the number of built houses halved, which was most likely caused by the financial crisis.

barplot(table(df$Overall.Cond), 
        main = "In what condition are the most houses on the market?", 
        xlab = "Year",
        ylab = "Number of houses",
        col = brewer.pal(10, "RdYlBu"))

This ordinal variable rates the overall condition of the house in the following way::

10 Very Excellent

9 Excellent

8 Very Good

7 Good

6 Above Average

5 Average

4 Below Average

3 Fair

2 Poor

1 Very Poor

It seems like the most houses on the market are in average condition and that there are more well-maintained houses than the ones belonging to the categories below average.

# Histogram of living area
ggplot(df, aes(x = Gr.Liv.Area)) +
  geom_histogram(color = "black", fill = "pink1", bins = 50) + 
  scale_x_continuous(labels = comma) +
  labs(title = "Distribution of house sizes", x = "Living area (sqft)", y = "Frequency") +
  theme_minimal()

It seems that most houses are under 2000 square feet. From the descriptive statistics, it seems that the mean value is 1500 sqft and the median is 1442 sqft.

Now, I would like to take a look at prices per neighborhood. My assumption is that there should be differences in prices between the areas of Ames, Iowa.

# Let's see median prices per neighborhood
neighbourhoods = tapply(df$SalePrice, df$Neighborhood, median)
neighbourhoods = sort(neighbourhoods, decreasing = TRUE)

dotchart(neighbourhoods, pch = 21, bg = "purple1",
         cex = 0.85,
         xlab="Average price of a house",
         main = "Which neighborhood is the most expensive to buy a house in?")
## Warning in dotchart(neighbourhoods, pch = 21, bg = "purple1", cex = 0.85, : 'x'
## is neither a vector nor a matrix: using as.numeric(x)

I have used median because it is less influenced by outliers than the average (if there was one house worth over a million, the average value would increase but the middle value would still stay the same).

From the graph above, it is possible to see that the neighborhood influences the prices of houses - the most expensive parts of the city have prices three times higher than the cheapest areas.

ANALYSIS PART 2 - Correlation and Regression

Pearson’s correlation coefficient indicates the strength of a linear relationship between two variables. Its values vary between -1 and 1. A coefficient with a value close to -1 indicates a strong negative relationship, a coefficient close to 0 suggests the lack of linear relationship, and a coefficient with a value close to 1 suggests a strong positive relationship (Bluman, 2018).

I’d like to see the correlation between variables in this dataset:

# Correlation matrix
numeric = df %>% select(where(is.numeric)) # firstly, I need to pick only numerical variables

# Graphical representation
ggcorr(numeric)

The darker shades of red indicate a stronger positive correlation and the darker shades of blue indicate a stronger negative correlation. On the other hand, lighter shades show values closer to zero (weaker linear relation).

I can notice how some data is strongly correlated with each other, which could cause issues with multicollinearity if both of those variables would be included in the final model. For example, the order number (basically an ID given to a house) has a strong negative correlation with the year sold, which makes sense because the lower ID numbers were assigned to houses listed for sale earlier.

cor(df$Order, df$Yr.Sold)
## [1] -0.9759929

The correlation matrix was very big and difficult to read but I can see that many variables have weak correlation (indicated by a lighter color), so I can only include variables that have either a strong negative or positive relationship with my predicted variable.

#Selecting variables
new_df = numeric %>% select(SalePrice, Open.Porch.SF, Wood.Deck.SF, Garage.Area, Garage.Cars, Garage.Yr.Blt, Fireplaces, TotRms.AbvGrd, Half.Bath, Full.Bath, Bsmt.Full.Bath, Gr.Liv.Area, X2nd.Flr.SF, X1st.Flr.SF, BsmtFin.SF.1, Mas.Vnr.Area, Year.Built, Year.Remod.Add, Overall.Qual, Lot.Area, Lot.Frontage, Enclosed.Porch, PID)

# Let's see the correlation matrix
ggcorr(new_df, size = 3)

Now, it is easier to see the names of variables and I’ll be able to determine which ones have the strongest positive and negative correlation with the dependent variable (Sale Price).

# Firstly, the negative correlation coefficients:
cor(df$Enclosed.Porch, df$SalePrice)
## [1] -0.1287874
cor(df$SalePrice, df$PID)
## [1] -0.2465212

It seems that the Parcel identification number has the strongest negative correlation with the price of houses. That means that when the identification number increases, the price decreases. However, it doesn’t make sense to analyze identification numbers, so I will create a scatter plot to show the relation with a continuous variable showing the enclosed porch area in square feet.

plot(df$SalePrice, df$Enclosed.Porch, 
    pch = 21,  cex=1.2,
    xlab = "Price",
    ylab = "Porch area (sqft)",
    main = "Relationship between the house price and porch area",
    bg = "cornflowerblue")

The correlation coefficient between the two variables was close to 0, so the visual representation doesn’t really show a clear negative relationship (the points should be scattered along a negative line starting at the highest price point and then decreasing). But from the scatter plot, it can be seen that actually, the prices for houses with porches increase when the size of the porch increases. However, for houses with no porch (the points plotted along the x-axis) the price still increases but it must be dependent on another variable.

I will take a look at the continuous variables with the highest positive correlation with the price:

# Now, trying to find the highest positive correlation coefficients:
cor(df$SalePrice, df$Overall.Qual)
## [1] 0.7992618
cor(df$SalePrice, df$X2nd.Flr.SF)
## [1] 0.2693734
cor(df$SalePrice, df$Gr.Liv.Area)
## [1] 0.7067799

So it seems that the overall quality of the house has the highest positive correlation with its price, which is logical, as the better condition the house is in, the more buyers are willing to pay for it.

However, it would not make sense to create a scatter plot with an ordinal categorical variable, thus, I am going to plot the relationship between the sales price and the above-ground living area in square feet.

ggplot(df, aes(SalePrice, Gr.Liv.Area)) + 
  geom_point(size = 2, color = "gray35", alpha = 0.6) + 
  theme_minimal() +
  geom_smooth(method = lm, color = "blue", size = 2) + 
  scale_x_continuous(labels = comma)
## `geom_smooth()` using formula 'y ~ x'

Scatter plots show ordered pairs of numerical variables. It helps see the relation (or the lack of such) between the independent and dependent variables. Their purpose is to show the nature of a relationship between the variables of interest (Bluman, 2018).

Now, it can be clearly seen that when the price increases both the living area and the house quality also increase. The line shows a linear model of a relationship between the living area and the house price. It can also be seen that there are some unusual observations present in the dataset.

I will create another scatter plot, where points will be colored based on the house quality:

#Let's see a scatter plot:
plot(df$SalePrice, df$Gr.Liv.Area, 
    pch = 21,  cex=1.2,
    xlab = "Price",
    ylab = "Living area (sqft)",
    main = "Positive linear relationship between the house price and house quality",
    bg = c(rev(heat.colors(10)))[unclass(df$Overall.Qual)]) 

I have used sequential color scale, as I am not that interested in seeing all separate categories, so when the color becomes darker, the values increase.

Now, I’ll check which variable has a correlation coefficient closest to 0.5. I selected variables in the color range which seem to be close to that value and eliminated some discrete or categorical variables.

#Selecting variables
df1 = numeric %>% select(SalePrice, Garage.Area, X1st.Flr.SF, BsmtFin.SF.1, Mas.Vnr.Area, Lot.Frontage)

# Let's see the correlation matrix
ggcorr(df1, size = 3, label = TRUE, label_size = 4, label_round = 2, label_alpha = TRUE)

It seems that the variable representing masonry veneer area in square feet has a correlation coefficient 0.51.

# The scatter plot:
plot(df$SalePrice, df$Mas.Vnr.Area, 
    pch = 21,  cex=1.2,
    xlab = "Price",
    ylab = "Masonry veneer area (sqft)",
    main = "Relationship between the house price and masonry veneer area",
    bg = c("red"))

Except for the values plotted along the y = 0 line, it is difficult to see a clear pattern in the data. The data points are scattered around the plot - there is an indication that the price increases when the size increases but it is not as clear as on the previous scatter plot showing the living area above the ground.

With the help of code found on Towards Data Science (Williams, 2020), I have been able to create a correlation matrix that only shows variables with correlation coefficients higher than 0.5.

One important note - the colors are reversed compared to the regular ggcorr library and here a strong positive correlation is shown in blue, while before blue symbolized negative correlation, and red corresponded to a positive one.

corr_simple <- function(data = df,sig = 0.5){
  #convert data to numeric in order to run correlations
  #convert to factor first to keep the integrity of the data - each value will become a number rather than turn into NA
  df_cor <- data %>% mutate_if(is.character, as.factor)
  df_cor <- df_cor %>% mutate_if(is.factor, as.numeric)  #run a correlation and drop the insignificant ones
  corr <- cor(df_cor)
  #prepare to drop duplicates and correlations of 1     
  corr[lower.tri(corr,diag = TRUE)] <- NA 
  #drop perfect correlations
  corr[corr == 1] <- NA   #turn into a 3-column table
  corr <- as.data.frame(as.table(corr))
  #remove the NA values from above 
  corr <- na.omit(corr)   #select significant values  
  corr <- subset(corr, abs(Freq) > sig) 
  #sort by highest correlation
  corr <- corr[order(-abs(corr$Freq)),]   #print table
  print(corr)  #turn corr back into matrix in order to plot with corrplot
  mtx_corr <- reshape2::acast(corr, Var1~Var2, value.var="Freq")
  #plot correlations visually
  corrplot(mtx_corr, is.corr = FALSE, tl.col="black", na.label=" ")
}

corr_simple()
##                Var1           Var2       Freq
## 6397          Order        Yr.Sold -0.9759929
## 2075   Exterior.1st   Exterior.2nd  0.8654165
## 4558    Gr.Liv.Area  TotRms.AbvGrd  0.8077721
## 6661   Overall.Qual      SalePrice  0.7992618
## 1315    MS.SubClass      Bldg.Type  0.7435622
## 6690    Gr.Liv.Area      SalePrice  0.7067799
## 4563  Bedroom.AbvGr  TotRms.AbvGrd  0.6726472
## 3900    X2nd.Flr.SF    Gr.Liv.Area  0.6552512
## 4457     Exter.Qual   Kitchen.Qual  0.6535235
## 6671     Exter.Qual      SalePrice -0.6476163
## 2481     Year.Built     Foundation  0.6366324
## 2315   Overall.Qual     Exter.Qual -0.6331484
## 4148    Gr.Liv.Area      Full.Bath  0.6303208
## 6687    X1st.Flr.SF      SalePrice  0.6216761
## 6697   Kitchen.Qual      SalePrice -0.6136888
## 1743     Year.Built Year.Remod.Add  0.6120953
## 4228    X2nd.Flr.SF      Half.Bath  0.6116337
## 1659   Overall.Qual     Year.Built  0.5970272
## 4447   Overall.Qual   Kitchen.Qual -0.5896450
## 4556    X2nd.Flr.SF  TotRms.AbvGrd  0.5852137
## 3873   Overall.Qual    Gr.Liv.Area  0.5705559
## 1741   Overall.Qual Year.Remod.Add  0.5696088
## 3899    X1st.Flr.SF    Gr.Liv.Area  0.5621658
## 6663     Year.Built      SalePrice  0.5584261
## 6693      Full.Bath      SalePrice  0.5456039
## 4450 Year.Remod.Add   Kitchen.Qual -0.5387844
## 6664 Year.Remod.Add      SalePrice  0.5329738
## 4561      Full.Bath  TotRms.AbvGrd  0.5285992
## 4119   Overall.Qual      Full.Bath  0.5222626
## 4312    Gr.Liv.Area  Bedroom.AbvGr  0.5168075
## 4310    X2nd.Flr.SF  Bedroom.AbvGr  0.5046506
## 3384 Year.Remod.Add     Heating.QC -0.5036757

Now, it should be easier to pick variables for a linear model. I’ll start by plotting linear relationship between my variables of interest:

#Selecting variables
df2 = df %>% select(SalePrice, Garage.Area, X1st.Flr.SF, Gr.Liv.Area)

# Plot
pairs(df2, 
      main = "Relations between continuous variables in Ames housing data", 
      pch = 21, 
      bg = c("royalblue2"), 
      labels = c("Price","Garage Area","1st Floor Area","Living Area"),
      lower.panel = NULL, 
      font.labels = 2, 
      cex.labels = 2) 

From the plots it seems that there might be a linear relationship between the 1st floor area and living area above the ground (which I believe is the sum of the areas on the first and second floor, if the house has a multi-floor plan), thus, I will have to check for multicollinearity.

After drawing scatter plots and computing the values for correlation coefficients to test the significance of a relationship, the next step would be to determine the equation of the regression line (Bluman, 2018).

model1 = lm(SalePrice ~ Overall.Qual + Gr.Liv.Area + X1st.Flr.SF + Garage.Area + Year.Built + Exter.Qual + Full.Bath, data = df)
summary(model1)
## 
## Call:
## lm(formula = SalePrice ~ Overall.Qual + Gr.Liv.Area + X1st.Flr.SF + 
##     Garage.Area + Year.Built + Exter.Qual + Full.Bath, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -515739  -17594   -1446   15550  289941 
## 
## Coefficients:
##                 Estimate  Std. Error t value             Pr(>|t|)    
## (Intercept)  -706025.507   60944.740 -11.585 < 0.0000000000000002 ***
## Overall.Qual   17818.862     794.992  22.414 < 0.0000000000000002 ***
## Gr.Liv.Area       51.323       2.052  25.013 < 0.0000000000000002 ***
## X1st.Flr.SF       25.805       2.162  11.936 < 0.0000000000000002 ***
## Garage.Area       42.642       4.002  10.655 < 0.0000000000000002 ***
## Year.Built       371.503      31.418  11.824 < 0.0000000000000002 ***
## Exter.QualFa  -79323.135    7769.398 -10.210 < 0.0000000000000002 ***
## Exter.QualGd  -64967.008    3870.518 -16.785 < 0.0000000000000002 ***
## Exter.QualTA  -77760.702    4374.322 -17.777 < 0.0000000000000002 ***
## Full.Bath      -6753.736    1678.556  -4.024            0.0000588 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 35050 on 2919 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.8081, Adjusted R-squared:  0.8075 
## F-statistic:  1366 on 9 and 2919 DF,  p-value: < 0.00000000000000022

The equation representing my multiple linear regression is as follows:

y = -706025.5 + 17818.86 * Overall.Qual + 51.32 * Gr.Liv.Area + 25.8 * X1st.Flr.SF + 42.64 * Garage.Area + 371.5 * Year.Built -79323.14 * Exter.QualFa -64967 * Exter.QualGd -77760.7 * Exter.QualTA -6753.74 * Full.Bath

I included 7 variables in the model but the output shows more since one of them is categorical. Last quarter, I have learned that R automatically recodes categorical variables to dummy variables creating one for each category in the data.

The intercept is negative, which means that if all the values corresponding to independent variables would be 0, the price would have been negative. Of course, that cannot happen in reality, even simply by stating that if the living area of the house would be 0, there would be no house to sell. Most coefficients are positive (for example when the house quality increases by 1, the price is expected to increase by $17819). Thus, there are no surprises - better quality houses are worth more, as well as, bigger houses, and newer ones.

The coefficient corresponding to the quality of the exterior is negative because the quality they are compared to seems to be “Excellent” (it is missing as a category in the output). Thus, when the quality decreases, the price also decreases.

It is odd to see that when the number of bathrooms increases by one, the price of the house actually decreases by 6754 dollars. So when the house becomes bigger, the price increases, but when more of the area is designated for bathrooms, the price decreases.

Now, I will take a look at other measures present in the output. The R2 (coefficient of multiple determination) represents the amount of variation explained by the regression model (Bluman, 2018). In my case, the value is around 81%, meaning that the chosen independent variables explain 81% of the variation in house prices.

F-statistic - it is used as an indicator of whether there is a relationship between the dependent and independent variables in the model. The bigger the value of the F-statistic, the better (Rego, 2015). In my case, the value of the F-statistic is 1366.

Small p-value indicates that it would be unlikely to observe a relationship between the predictor and specified response variables due to chance. It is common to use a p-value of 5% or less as a cut-off point. Three asterisks represent highly significant p-values (Rego, 2015). In my model, the p-values are all very close to zero.

ANALYSIS PART 2 - Diagnosing and improving the model

Now, I’d like to do some diagnostics to see how good my model really is:

# let's visualize the model's fit
autoplot(model1,
         which = 1:3,
         nrow = 3,
         ncol = 1)

There are several plots that can help quantify the performance of a model. Above, there are 3 graphs plotted to look at my model. The first one shows residuals versus fitted values (the distance between the actual and predicted values). To test whether the residuals meet the assumption about being normally distributed with the mean equal to zero, I need to look at the trend line. It should closely follow the y = 0 line on the plot (Cotton). In this case, it is mostly true. There are a few points that stray further from the trend line but in general, the points seem to be plotted close to y = 0.

The purpose of the Q-Q plot is to show whether the residuals follow a normal distribution. Quantiles from the normal distribution are represented on the x-axis, and the y-axis represent the standardized residuals (residuals divided by their standard deviation). If plotted points follow along the diagonal line, they are normally distributed (Cotton). Most points in my plot indeed follow that trend, however, there are few (which were also problematic on the residuals vs fitted plot) that seem not to follow that trend.

The third plot shows the square root of the standardized residuals versus the fitted values. It is called a scale-location plot and it helps to check for heteroscedasticity. The plot shows whether the size of the residuals gets bigger or smaller. There are two important things to check - whether the line is roughly horizontal across the plot, which means that the spread of the residuals is roughly equal for all fitted values and the assumption of homoscedasticity is likely met. Secondly, it needs to be verified that the residuals are randomly scattered and there are no clear patterns (Cotton)(Zach, 2020).

There is also a statistical test to check for heteroscedasticity:

library(lmtest)
bptest(model1)
## 
##  studentized Breusch-Pagan test
## 
## data:  model1
## BP = 604.77, df = 9, p-value < 0.00000000000000022

The hypotheses in Breusch-Pagan Test are as following:

From my output, it becomes clear that with the p-value close to 0, I need to reject the null hypothesis and accept that there is heteroscedasticity present in my model.

library(car)
durbinWatsonTest(model1)
##  lag Autocorrelation D-W Statistic p-value
##    1       0.2471018      1.505323       0
##  Alternative hypothesis: rho != 0

The Durbin Watson Test is a measure of autocorrelation in residuals from regression analysis. It may cause many problems, including standard errors being too small or exaggerated goodness of fit, it may also cause the regression coefficients to be statistically significant, when in fact, they are not.

The Hypotheses for the Durbin Watson test are:

H0 = no first-order autocorrelation

H1 = first-order correlation exists

The values vary between 0 to 4, where 2 means there is no autocorrelation, 0 to 2 means there is a positive autocorrelation, and values between 2 and 4 indicate negative autocorrelation (Statistics How To, 2016).

In my case, with the p-value close to 0, I have to reject the null hypothesis and state that positive autocorrelation is present in my model.

“Multicollinearity in regression analysis occurs when two or more predictor variables are highly correlated to each other”. This means that they do not provide unique or independent information in the regression model and can cause problems with interpreting it. The most common way to detect multicollinearity is by using the variance inflation factor (VIF). Its values start at 1 and there is no upper limit.

A general rule of thumb for interpreting VIFs is: a value of 1 indicates no correlation between independent variables; values between 1 and 5 show moderate correlation but are not severe enough to require attention; values greater than 5 indicate a potentially severe correlation between predictors. With VIF higher than 5, the coefficient estimates and p-values in the regression output are likely to be unreliable (Zach, 2019).

vif(model1)
##                  GVIF Df GVIF^(1/(2*Df))
## Overall.Qual 2.999148  1        1.731805
## Gr.Liv.Area  2.564268  1        1.601333
## X1st.Flr.SF  1.711119  1        1.308098
## Garage.Area  1.765036  1        1.328547
## Year.Built   2.150770  1        1.466550
## Exter.Qual   2.680213  3        1.178587
## Full.Bath    2.053187  1        1.432895

It seems that the highest value of VIF is 2.99, so it should not cause any problems and violate the nonmulticollinearity assumption. If two of my predictors (for instance above ground living area and 1st-floor area) would be strongly correlated, I would inspect why and then probably just remove one of them, especially if they introduce similar information to the model.

Now I can check for the unusual observations:

library(olsrr)
ols_plot_resid_lev(model1)

There seem to be many leverage points and some outliers present in my dataset, which influence the regression line. They could already be detected from the Q-Q plot. I will remove 5 observations that are particularly far away from the others (3 of them have been already labeled on the Q-Q plot). I don’t want to remove more, not to reduce my sample size significantly.

# Removing observations
houses = df[-c(2182, 2181, 1499, 1761, 1768),]

Now, I’d like to address the heteroscedasticity present in my original model. The presence of heteroskedasticity suggests two main consequences: there is another estimator than the least squares estimator, which has a smaller variance; and that standard errors computed for the least squares estimators are incorrect, which can affect confidence intervals and hypothesis testing (Yobero, 2016).

When heteroscedasticity is present, the best linear unbiased estimator to use would be the generalized least squares estimator (Yobero, 2016). I will try to use it and compare to my original model.

model2 <- lm(SalePrice ~ Overall.Qual + Gr.Liv.Area + X1st.Flr.SF + Garage.Area + Year.Built + Exter.Qual + Full.Bath, data = houses)
summary(model2)
## 
## Call:
## lm(formula = SalePrice ~ Overall.Qual + Gr.Liv.Area + X1st.Flr.SF + 
##     Garage.Area + Year.Built + Exter.Qual + Full.Bath, data = houses)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -195760  -17285    -931   15791  221602 
## 
## Coefficients:
##                 Estimate  Std. Error t value             Pr(>|t|)    
## (Intercept)  -776584.300   53709.248 -14.459 < 0.0000000000000002 ***
## Overall.Qual   16383.334     701.507  23.354 < 0.0000000000000002 ***
## Gr.Liv.Area       58.127       1.853  31.365 < 0.0000000000000002 ***
## X1st.Flr.SF       33.723       1.929  17.482 < 0.0000000000000002 ***
## Garage.Area       40.382       3.526  11.452 < 0.0000000000000002 ***
## Year.Built       409.972      27.689  14.806 < 0.0000000000000002 ***
## Exter.QualFa  -87069.279    6848.045 -12.714 < 0.0000000000000002 ***
## Exter.QualGd  -72862.883    3434.967 -21.212 < 0.0000000000000002 ***
## Exter.QualTA  -85167.722    3870.859 -22.002 < 0.0000000000000002 ***
## Full.Bath     -11347.640    1490.488  -7.613   0.0000000000000358 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 30830 on 2914 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.8465, Adjusted R-squared:  0.846 
## F-statistic:  1785 on 9 and 2914 DF,  p-value: < 0.00000000000000022

Even after removing the outliers, the R2 is already higher. Let’s see how would the summary of the model look like if it was fitted using generalized least squares estimator:

library(nlme)
model3 <- gls(SalePrice ~ Overall.Qual + Gr.Liv.Area + X1st.Flr.SF + Garage.Area + Year.Built + Exter.Qual + Full.Bath,
         # my formula
         data = houses, 
         correlation = corAR1(0.25, form = ~ 1 | Order), # Correlation with a specified autocorrelation value
         na.action = na.exclude) # what to do with NAs (if any are present) if not used, an error is returned
summary(model3)
## Generalized least squares fit by REML
##   Model: SalePrice ~ Overall.Qual + Gr.Liv.Area + X1st.Flr.SF + Garage.Area +      Year.Built + Exter.Qual + Full.Bath 
##   Data: houses 
##       AIC      BIC    logLik
##   68641.7 68713.43 -34308.85
## 
## Correlation Structure: AR(1)
##  Formula: ~1 | Order 
##  Parameter estimate(s):
##  Phi 
## 0.25 
## 
## Coefficients:
##                  Value Std.Error    t-value p-value
## (Intercept)  -776584.3  53709.25 -14.459042       0
## Overall.Qual   16383.3    701.51  23.354470       0
## Gr.Liv.Area       58.1      1.85  31.364845       0
## X1st.Flr.SF       33.7      1.93  17.481733       0
## Garage.Area       40.4      3.53  11.452215       0
## Year.Built       410.0     27.69  14.806327       0
## Exter.QualFa  -87069.3   6848.04 -12.714473       0
## Exter.QualGd  -72862.9   3434.97 -21.212107       0
## Exter.QualTA  -85167.7   3870.86 -22.002278       0
## Full.Bath     -11347.6   1490.49  -7.613374       0
## 
##  Correlation: 
##              (Intr) Ovrl.Q Gr.L.A X1.F.S Grg.Ar Yr.Blt Ext.QF Ext.QG Ex.QTA
## Overall.Qual  0.187                                                        
## Gr.Liv.Area  -0.337 -0.290                                                 
## X1st.Flr.SF   0.052 -0.064 -0.296                                          
## Garage.Area   0.211 -0.124 -0.175 -0.206                                   
## Year.Built   -0.993 -0.264  0.346 -0.076 -0.217                            
## Exter.QualFa -0.162  0.331 -0.036  0.067  0.069  0.091                     
## Exter.QualGd -0.033  0.248 -0.050  0.169  0.055 -0.052  0.546              
## Exter.QualTA -0.194  0.396 -0.032  0.131  0.070  0.090  0.626  0.892       
## Full.Bath     0.292  0.004 -0.515  0.030  0.031 -0.306 -0.010 -0.036  0.006
## 
## Standardized residuals:
##        Min         Q1        Med         Q3        Max 
## -6.3493815 -0.5606350 -0.0301872  0.5121697  7.1875707 
## 
## Residual standard error: 30831.34 
## Degrees of freedom: 2924 total; 2914 residual

The output from this model differs from linear regression. It shows both the intercept and coefficients and correlation coefficients for all the variables included in the model.

The equation for this model is as follows:

-776584.3 + 16383.3 * Overall.Qual + 58.1 * Gr.Liv.Area + 33.7 * X1st.Flr.SF + 40.4 * Garage.Area + 410 * Year.Built - 87069.3 * Exter.QualiFa - 72862.9 * ExterQualGd - 85167.7 * ExterQualTA - 11347.6 * Full.Bath

The coefficients are slightly different than in the linear model but standard errors are the same. P-values are also very close to zero, indicating that relationships between the predictors and the dependent variable aren’t due to chance.

As stated in our materials for the first module, there are different approaches to compare different models:

Akaike Information Criterion (AIC) estimates the in-sample prediction error and indicates the relative quality of statistical models for a given dataset (it is only useful to compare models based on the same data). Bayesian Information Criterion (BIC) is a penalized-likelihood criterion derived from Bayesian probability. It is closely related to AIC. Generally, lower values of BIC and AIC are preferred.

GLS shows AIC and BIC as an output, so I will compare its performance to my linear model using thhose measures:

# Create vector with values
a = c(AIC(model2), BIC(model2), AIC(model3), BIC(model3))

# Create vector with names
b = c("AIC lm", "BIC lm", "AIC gls", "BIC gls")

# Link the values with names
names(a) = b
print(a)
##   AIC lm   BIC lm  AIC gls  BIC gls 
## 68756.54 68822.33 68641.70 68713.43

Based on the values in the table above, it seems that the linear model based on generalized least squares performed slightly better.

With many independent variables, sometimes it is challenging to choose the best model by trying different combinations. Thus, I will use the best subsets regression. It is a model selection approach conducted by testing all possible combinations of the independent variables and then selecting the best model according to specified statistical criteria (STHDA, 2018).

# All Subsets Regression
library(leaps)
models <- regsubsets(SalePrice ~ Overall.Qual + Gr.Liv.Area + X1st.Flr.SF + Garage.Area + Year.Built + Exter.Qual + Full.Bath, data = houses, nvmax = 9)

# plot a table of models showing variables in each model ordered by the selection statistic (adjusted R2)
plot(models ,scale = "adjr2")

It seems that the model with all 9 predictors was the best, I can also check it based on various criteria: Adjusted R2, Mallow’s Cp, and Bayesian information criterion:

res.sum <- summary(models)
data.frame(
  Adj.R2 = which.max(res.sum$adjr2),
  CP = which.min(res.sum$cp),
  BIC = which.min(res.sum$bic)
)
##   Adj.R2 CP BIC
## 1      9  9   9

The output is clear, the model with all 9 predictors performs best based on all of these metrics. It would be possible that if I included all 82 variables, the best subsets regression method would suggest a linear model with an even higher number of predictors.

Thus, my conclusion is that the best model based on these 9 predictors is based on the generalized least squares (to reduce correlation between residuals).

CONCLUSION

Through this assignment, I have learned how to diagnose and improve linear models. A lot of real-life does not follow all the assumptions of multiple linear regression and issues such as multicollinearity or heteroscedasticity need to be investigated, and often need to be corrected. There can be multiple remedies for those issues, it is important to try different methods and see which one works for a particular dataset.


BIBLIOGRAPHY

Bluman, A. G. (2018). Elementary statistics: A Step by Step Approach: A Brief Version. New York: McGraw-Hill Higher Education; 10th edition.

Cotton, R. Introduction to Regression in R. Datacamp, Retrieved from: https://app.datacamp.com/learn/courses/introduction-to-regression-in-r

Rego, F. (2015). Quick Guide: Interpreting Simple Linear Model Outpu in R. Retrieved from: https://feliperego.github.io/blog/2015/10/23/Interpreting-Model-Output-In-R

Statistics How To. (2016) Durbin Watson Test & Test Statistic. Retrieved from: https://www.statisticshowto.com/durbin-watson-test-coefficient/

STHDA (2018). Best Subsets Regression Essentials in R. Retrieved from: http://www.sthda.com/english/articles/37-model-selection-essentials-in-r/155-best-subsets-regression-essentials-in-r/

Williams, C. (2020). How to Create a Correlation Matrix with Too Many Variables in R. Towards Data science. Retrieved from: https://towardsdatascience.com/how-to-create-a-correlation-matrix-with-too-many-variables-309cc0c0a57

Yobero, C. (2016). Methods for Detecting and Resolving Heteroskedasticity. Retrieved from: https://rpubs.com/cyobero/187387

Zach. (2019). How to Calculate Variance Inflation Factor (VIF) in R. Statology. Retrieved from: https://www.statology.org/variance-inflation-factor-r/

Zach. (2020). How to Interpret a Scale-Location Plot (With Examples). Statology. Retrieved from: https://www.statology.org/scale-location-plot/