Name: Karolina Grodzinska
Class: ALY6015: Intermediate Analytics
INTRODUCTION
This assignment is based on data about the housing market. I will start with an exploratory data analysis, presenting some of the variables of interest on the graphs and exploring their distributions. In the second part of the assignment, I will perform correlation and regression analysis, attempting to create a multiple regression model and correcting any issues that are present in the model. My goal is to find the best linear model to predict the prices of houses.
library(tidyverse)
library(RColorBrewer)
library(corrplot)
library(psych)
library(dplyr)
library(ggplot2)
library(gtools)
library(ggfortify)
library(GGally)
library(readr)
library(readxl)
library(knitr)
library(modelr)
library(scales)
df = read.csv("AmesHousing.csv")
The dataset includes 2930 observations of 82 variables representing different features of houses in Ames, Iowa. From the data documentation, it appears that there are: 23 nominal, 23 ordinal, 14 discrete, and 20 continuous variables.
ANALYSIS PART 1 - Exploratory Data Analysis
# Disabling scientific notation, so my graphs and outputs will be more readable:
options(scipen = 100)
# Firstly, I'd like to check the missing values
na_count <-sapply(df, function(df) sum(length(which(is.na(df)))))
na_count
## ï..Order PID MS.SubClass MS.Zoning Lot.Frontage
## 0 0 0 0 490
## Lot.Area Street Alley Lot.Shape Land.Contour
## 0 0 2732 0 0
## Utilities Lot.Config Land.Slope Neighborhood Condition.1
## 0 0 0 0 0
## Condition.2 Bldg.Type House.Style Overall.Qual Overall.Cond
## 0 0 0 0 0
## Year.Built Year.Remod.Add Roof.Style Roof.Matl Exterior.1st
## 0 0 0 0 0
## Exterior.2nd Mas.Vnr.Type Mas.Vnr.Area Exter.Qual Exter.Cond
## 0 0 23 0 0
## Foundation Bsmt.Qual Bsmt.Cond Bsmt.Exposure BsmtFin.Type.1
## 0 79 79 79 79
## BsmtFin.SF.1 BsmtFin.Type.2 BsmtFin.SF.2 Bsmt.Unf.SF Total.Bsmt.SF
## 1 79 1 1 1
## Heating Heating.QC Central.Air Electrical X1st.Flr.SF
## 0 0 0 0 0
## X2nd.Flr.SF Low.Qual.Fin.SF Gr.Liv.Area Bsmt.Full.Bath Bsmt.Half.Bath
## 0 0 0 2 2
## Full.Bath Half.Bath Bedroom.AbvGr Kitchen.AbvGr Kitchen.Qual
## 0 0 0 0 0
## TotRms.AbvGrd Functional Fireplaces Fireplace.Qu Garage.Type
## 0 0 0 1422 157
## Garage.Yr.Blt Garage.Finish Garage.Cars Garage.Area Garage.Qual
## 159 157 1 1 158
## Garage.Cond Paved.Drive Wood.Deck.SF Open.Porch.SF Enclosed.Porch
## 158 0 0 0 0
## X3Ssn.Porch Screen.Porch Pool.Area Pool.QC Fence
## 0 0 0 2917 2358
## Misc.Feature Misc.Val Mo.Sold Yr.Sold Sale.Type
## 2824 0 0 0 0
## Sale.Condition SalePrice
## 0 0
I can see that there are a few variables with many missing values:
I might just drop those variables, as they do not add much value. But first, I’d like to see some descriptive statistics about the dataset:
df = rename(df, Order = ï..Order)
summary(df)
I used the summary function but I decided not to include the long output as a part of this report.
It seems that the area of lots for the houses varies between 1300 sqft - 215245 sqft, the houses have been built between 1872 and 2010, and the living area has a range between 334 sqft (seems like a small apartment) to 5642 sqft, with the mean value of 1500 sqft. On average, houses have 1.5 bathrooms, 1 kitchen, 2.8 bedrooms, or 6.4 total rooms above the ground. I would like to take a look at some of those features on the graphs, as it is usually easier to interpret a graphical representation of data than look at a big table with raw numbers.
It seems that the best way to conduct this analysis would be to use house price as the predicted variable, as it should depend on other features. My assumption would be that the neighborhood, size of the house, its condition, and additional features should affect the final price.
Thus, I will start the exploratory data analysis by looking at the distribution of house prices.
# Histogram of prices
ggplot(df, aes(x = SalePrice)) +
geom_histogram(color = "black", fill = "lightslateblue", bins = 50) +
scale_x_continuous(labels = comma) +
labs(title = "Distribution of house prices", x = "Price", y = "Frequency") +
theme_minimal()
It is important to recognize the shapes of the distribution of data during the analysis process. Histograms are helpful to see the distribution represented visually. Histograms display frequencies of data points for a specified number of bins (Bluman, 2018).
The distribution of house prices is right-skewed with most houses being in a price range under $200.000. From the descriptive statistics, I know that the price range is between 12789 dollars and 755000, with an average of 180796 and the median price equal to 160.000.
As the next step of the exploratory data analysis, I will look at the age of houses and their quality.
barplot(table(df$Year.Built),
main = "When were the most houses built?",
xlab = "Year",
ylab = "Number of houses",
col = brewer.pal(9, "Blues"))
It seems that there has been a housing boom during the early 2000s and when I looked at the frequency table to see the exact values, I saw that the most houses have been built in 2005 (142 houses), followed by 138 houses in 2008, and then the value starts dropping. Between 2007 and 2008 the number of built houses halved, which was most likely caused by the financial crisis.
barplot(table(df$Overall.Cond),
main = "In what condition are the most houses on the market?",
xlab = "Year",
ylab = "Number of houses",
col = brewer.pal(10, "RdYlBu"))
This ordinal variable rates the overall condition of the house in the following way::
10 Very Excellent
9 Excellent
8 Very Good
7 Good
6 Above Average
5 Average
4 Below Average
3 Fair
2 Poor
1 Very Poor
It seems like the most houses on the market are in average condition and that there are more well-maintained houses than the ones belonging to the categories below average.
# Histogram of living area
ggplot(df, aes(x = Gr.Liv.Area)) +
geom_histogram(color = "black", fill = "pink1", bins = 50) +
scale_x_continuous(labels = comma) +
labs(title = "Distribution of house sizes", x = "Living area (sqft)", y = "Frequency") +
theme_minimal()
It seems that most houses are under 2000 square feet. From the descriptive statistics, it seems that the mean value is 1500 sqft and the median is 1442 sqft.
Now, I would like to take a look at prices per neighborhood. My assumption is that there should be differences in prices between the areas of Ames, Iowa.
# Let's see median prices per neighborhood
neighbourhoods = tapply(df$SalePrice, df$Neighborhood, median)
neighbourhoods = sort(neighbourhoods, decreasing = TRUE)
dotchart(neighbourhoods, pch = 21, bg = "purple1",
cex = 0.85,
xlab="Average price of a house",
main = "Which neighborhood is the most expensive to buy a house in?")
## Warning in dotchart(neighbourhoods, pch = 21, bg = "purple1", cex = 0.85, : 'x'
## is neither a vector nor a matrix: using as.numeric(x)
I have used median because it is less influenced by outliers than the average (if there was one house worth over a million, the average value would increase but the middle value would still stay the same).
From the graph above, it is possible to see that the neighborhood influences the prices of houses - the most expensive parts of the city have prices three times higher than the cheapest areas.
ANALYSIS PART 2 - Correlation and Regression
Pearson’s correlation coefficient indicates the strength of a linear relationship between two variables. Its values vary between -1 and 1. A coefficient with a value close to -1 indicates a strong negative relationship, a coefficient close to 0 suggests the lack of linear relationship, and a coefficient with a value close to 1 suggests a strong positive relationship (Bluman, 2018).
I’d like to see the correlation between variables in this dataset:
# Correlation matrix
numeric = df %>% select(where(is.numeric)) # firstly, I need to pick only numerical variables
# Graphical representation
ggcorr(numeric)
The darker shades of red indicate a stronger positive correlation and the darker shades of blue indicate a stronger negative correlation. On the other hand, lighter shades show values closer to zero (weaker linear relation).
I can notice how some data is strongly correlated with each other, which could cause issues with multicollinearity if both of those variables would be included in the final model. For example, the order number (basically an ID given to a house) has a strong negative correlation with the year sold, which makes sense because the lower ID numbers were assigned to houses listed for sale earlier.
cor(df$Order, df$Yr.Sold)
## [1] -0.9759929
The correlation matrix was very big and difficult to read but I can see that many variables have weak correlation (indicated by a lighter color), so I can only include variables that have either a strong negative or positive relationship with my predicted variable.
#Selecting variables
new_df = numeric %>% select(SalePrice, Open.Porch.SF, Wood.Deck.SF, Garage.Area, Garage.Cars, Garage.Yr.Blt, Fireplaces, TotRms.AbvGrd, Half.Bath, Full.Bath, Bsmt.Full.Bath, Gr.Liv.Area, X2nd.Flr.SF, X1st.Flr.SF, BsmtFin.SF.1, Mas.Vnr.Area, Year.Built, Year.Remod.Add, Overall.Qual, Lot.Area, Lot.Frontage, Enclosed.Porch, PID)
# Let's see the correlation matrix
ggcorr(new_df, size = 3)
Now, it is easier to see the names of variables and I’ll be able to determine which ones have the strongest positive and negative correlation with the dependent variable (Sale Price).
# Firstly, the negative correlation coefficients:
cor(df$Enclosed.Porch, df$SalePrice)
## [1] -0.1287874
cor(df$SalePrice, df$PID)
## [1] -0.2465212
It seems that the Parcel identification number has the strongest negative correlation with the price of houses. That means that when the identification number increases, the price decreases. However, it doesn’t make sense to analyze identification numbers, so I will create a scatter plot to show the relation with a continuous variable showing the enclosed porch area in square feet.
plot(df$SalePrice, df$Enclosed.Porch,
pch = 21, cex=1.2,
xlab = "Price",
ylab = "Porch area (sqft)",
main = "Relationship between the house price and porch area",
bg = "cornflowerblue")
The correlation coefficient between the two variables was close to 0, so the visual representation doesn’t really show a clear negative relationship (the points should be scattered along a negative line starting at the highest price point and then decreasing). But from the scatter plot, it can be seen that actually, the prices for houses with porches increase when the size of the porch increases. However, for houses with no porch (the points plotted along the x-axis) the price still increases but it must be dependent on another variable.
I will take a look at the continuous variables with the highest positive correlation with the price:
# Now, trying to find the highest positive correlation coefficients:
cor(df$SalePrice, df$Overall.Qual)
## [1] 0.7992618
cor(df$SalePrice, df$X2nd.Flr.SF)
## [1] 0.2693734
cor(df$SalePrice, df$Gr.Liv.Area)
## [1] 0.7067799
So it seems that the overall quality of the house has the highest positive correlation with its price, which is logical, as the better condition the house is in, the more buyers are willing to pay for it.
However, it would not make sense to create a scatter plot with an ordinal categorical variable, thus, I am going to plot the relationship between the sales price and the above-ground living area in square feet.
ggplot(df, aes(SalePrice, Gr.Liv.Area)) +
geom_point(size = 2, color = "gray35", alpha = 0.6) +
theme_minimal() +
geom_smooth(method = lm, color = "blue", size = 2) +
scale_x_continuous(labels = comma)
## `geom_smooth()` using formula 'y ~ x'
Scatter plots show ordered pairs of numerical variables. It helps see the relation (or the lack of such) between the independent and dependent variables. Their purpose is to show the nature of a relationship between the variables of interest (Bluman, 2018).
Now, it can be clearly seen that when the price increases both the living area and the house quality also increase. The line shows a linear model of a relationship between the living area and the house price. It can also be seen that there are some unusual observations present in the dataset.
I will create another scatter plot, where points will be colored based on the house quality:
#Let's see a scatter plot:
plot(df$SalePrice, df$Gr.Liv.Area,
pch = 21, cex=1.2,
xlab = "Price",
ylab = "Living area (sqft)",
main = "Positive linear relationship between the house price and house quality",
bg = c(rev(heat.colors(10)))[unclass(df$Overall.Qual)])
I have used sequential color scale, as I am not that interested in seeing all separate categories, so when the color becomes darker, the values increase.
Now, I’ll check which variable has a correlation coefficient closest to 0.5. I selected variables in the color range which seem to be close to that value and eliminated some discrete or categorical variables.
#Selecting variables
df1 = numeric %>% select(SalePrice, Garage.Area, X1st.Flr.SF, BsmtFin.SF.1, Mas.Vnr.Area, Lot.Frontage)
# Let's see the correlation matrix
ggcorr(df1, size = 3, label = TRUE, label_size = 4, label_round = 2, label_alpha = TRUE)
It seems that the variable representing masonry veneer area in square feet has a correlation coefficient 0.51.
# The scatter plot:
plot(df$SalePrice, df$Mas.Vnr.Area,
pch = 21, cex=1.2,
xlab = "Price",
ylab = "Masonry veneer area (sqft)",
main = "Relationship between the house price and masonry veneer area",
bg = c("red"))
Except for the values plotted along the y = 0 line, it is difficult to see a clear pattern in the data. The data points are scattered around the plot - there is an indication that the price increases when the size increases but it is not as clear as on the previous scatter plot showing the living area above the ground.
With the help of code found on Towards Data Science (Williams, 2020), I have been able to create a correlation matrix that only shows variables with correlation coefficients higher than 0.5.
One important note - the colors are reversed compared to the regular ggcorr library and here a strong positive correlation is shown in blue, while before blue symbolized negative correlation, and red corresponded to a positive one.
corr_simple <- function(data = df,sig = 0.5){
#convert data to numeric in order to run correlations
#convert to factor first to keep the integrity of the data - each value will become a number rather than turn into NA
df_cor <- data %>% mutate_if(is.character, as.factor)
df_cor <- df_cor %>% mutate_if(is.factor, as.numeric) #run a correlation and drop the insignificant ones
corr <- cor(df_cor)
#prepare to drop duplicates and correlations of 1
corr[lower.tri(corr,diag = TRUE)] <- NA
#drop perfect correlations
corr[corr == 1] <- NA #turn into a 3-column table
corr <- as.data.frame(as.table(corr))
#remove the NA values from above
corr <- na.omit(corr) #select significant values
corr <- subset(corr, abs(Freq) > sig)
#sort by highest correlation
corr <- corr[order(-abs(corr$Freq)),] #print table
print(corr) #turn corr back into matrix in order to plot with corrplot
mtx_corr <- reshape2::acast(corr, Var1~Var2, value.var="Freq")
#plot correlations visually
corrplot(mtx_corr, is.corr = FALSE, tl.col="black", na.label=" ")
}
corr_simple()
## Var1 Var2 Freq
## 6397 Order Yr.Sold -0.9759929
## 2075 Exterior.1st Exterior.2nd 0.8654165
## 4558 Gr.Liv.Area TotRms.AbvGrd 0.8077721
## 6661 Overall.Qual SalePrice 0.7992618
## 1315 MS.SubClass Bldg.Type 0.7435622
## 6690 Gr.Liv.Area SalePrice 0.7067799
## 4563 Bedroom.AbvGr TotRms.AbvGrd 0.6726472
## 3900 X2nd.Flr.SF Gr.Liv.Area 0.6552512
## 4457 Exter.Qual Kitchen.Qual 0.6535235
## 6671 Exter.Qual SalePrice -0.6476163
## 2481 Year.Built Foundation 0.6366324
## 2315 Overall.Qual Exter.Qual -0.6331484
## 4148 Gr.Liv.Area Full.Bath 0.6303208
## 6687 X1st.Flr.SF SalePrice 0.6216761
## 6697 Kitchen.Qual SalePrice -0.6136888
## 1743 Year.Built Year.Remod.Add 0.6120953
## 4228 X2nd.Flr.SF Half.Bath 0.6116337
## 1659 Overall.Qual Year.Built 0.5970272
## 4447 Overall.Qual Kitchen.Qual -0.5896450
## 4556 X2nd.Flr.SF TotRms.AbvGrd 0.5852137
## 3873 Overall.Qual Gr.Liv.Area 0.5705559
## 1741 Overall.Qual Year.Remod.Add 0.5696088
## 3899 X1st.Flr.SF Gr.Liv.Area 0.5621658
## 6663 Year.Built SalePrice 0.5584261
## 6693 Full.Bath SalePrice 0.5456039
## 4450 Year.Remod.Add Kitchen.Qual -0.5387844
## 6664 Year.Remod.Add SalePrice 0.5329738
## 4561 Full.Bath TotRms.AbvGrd 0.5285992
## 4119 Overall.Qual Full.Bath 0.5222626
## 4312 Gr.Liv.Area Bedroom.AbvGr 0.5168075
## 4310 X2nd.Flr.SF Bedroom.AbvGr 0.5046506
## 3384 Year.Remod.Add Heating.QC -0.5036757
Now, it should be easier to pick variables for a linear model. I’ll start by plotting linear relationship between my variables of interest:
#Selecting variables
df2 = df %>% select(SalePrice, Garage.Area, X1st.Flr.SF, Gr.Liv.Area)
# Plot
pairs(df2,
main = "Relations between continuous variables in Ames housing data",
pch = 21,
bg = c("royalblue2"),
labels = c("Price","Garage Area","1st Floor Area","Living Area"),
lower.panel = NULL,
font.labels = 2,
cex.labels = 2)
From the plots it seems that there might be a linear relationship between the 1st floor area and living area above the ground (which I believe is the sum of the areas on the first and second floor, if the house has a multi-floor plan), thus, I will have to check for multicollinearity.
After drawing scatter plots and computing the values for correlation coefficients to test the significance of a relationship, the next step would be to determine the equation of the regression line (Bluman, 2018).
model1 = lm(SalePrice ~ Overall.Qual + Gr.Liv.Area + X1st.Flr.SF + Garage.Area + Year.Built + Exter.Qual + Full.Bath, data = df)
summary(model1)
##
## Call:
## lm(formula = SalePrice ~ Overall.Qual + Gr.Liv.Area + X1st.Flr.SF +
## Garage.Area + Year.Built + Exter.Qual + Full.Bath, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -515739 -17594 -1446 15550 289941
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -706025.507 60944.740 -11.585 < 0.0000000000000002 ***
## Overall.Qual 17818.862 794.992 22.414 < 0.0000000000000002 ***
## Gr.Liv.Area 51.323 2.052 25.013 < 0.0000000000000002 ***
## X1st.Flr.SF 25.805 2.162 11.936 < 0.0000000000000002 ***
## Garage.Area 42.642 4.002 10.655 < 0.0000000000000002 ***
## Year.Built 371.503 31.418 11.824 < 0.0000000000000002 ***
## Exter.QualFa -79323.135 7769.398 -10.210 < 0.0000000000000002 ***
## Exter.QualGd -64967.008 3870.518 -16.785 < 0.0000000000000002 ***
## Exter.QualTA -77760.702 4374.322 -17.777 < 0.0000000000000002 ***
## Full.Bath -6753.736 1678.556 -4.024 0.0000588 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 35050 on 2919 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.8081, Adjusted R-squared: 0.8075
## F-statistic: 1366 on 9 and 2919 DF, p-value: < 0.00000000000000022
The equation representing my multiple linear regression is as follows:
y = -706025.5 + 17818.86 * Overall.Qual + 51.32 * Gr.Liv.Area + 25.8 * X1st.Flr.SF + 42.64 * Garage.Area + 371.5 * Year.Built -79323.14 * Exter.QualFa -64967 * Exter.QualGd -77760.7 * Exter.QualTA -6753.74 * Full.Bath
I included 7 variables in the model but the output shows more since one of them is categorical. Last quarter, I have learned that R automatically recodes categorical variables to dummy variables creating one for each category in the data.
The intercept is negative, which means that if all the values corresponding to independent variables would be 0, the price would have been negative. Of course, that cannot happen in reality, even simply by stating that if the living area of the house would be 0, there would be no house to sell. Most coefficients are positive (for example when the house quality increases by 1, the price is expected to increase by $17819). Thus, there are no surprises - better quality houses are worth more, as well as, bigger houses, and newer ones.
The coefficient corresponding to the quality of the exterior is negative because the quality they are compared to seems to be “Excellent” (it is missing as a category in the output). Thus, when the quality decreases, the price also decreases.
It is odd to see that when the number of bathrooms increases by one, the price of the house actually decreases by 6754 dollars. So when the house becomes bigger, the price increases, but when more of the area is designated for bathrooms, the price decreases.
Now, I will take a look at other measures present in the output. The R2 (coefficient of multiple determination) represents the amount of variation explained by the regression model (Bluman, 2018). In my case, the value is around 81%, meaning that the chosen independent variables explain 81% of the variation in house prices.
F-statistic - it is used as an indicator of whether there is a relationship between the dependent and independent variables in the model. The bigger the value of the F-statistic, the better (Rego, 2015). In my case, the value of the F-statistic is 1366.
Small p-value indicates that it would be unlikely to observe a relationship between the predictor and specified response variables due to chance. It is common to use a p-value of 5% or less as a cut-off point. Three asterisks represent highly significant p-values (Rego, 2015). In my model, the p-values are all very close to zero.
ANALYSIS PART 2 - Diagnosing and improving the model
Now, I’d like to do some diagnostics to see how good my model really is:
# let's visualize the model's fit
autoplot(model1,
which = 1:3,
nrow = 3,
ncol = 1)
There are several plots that can help quantify the performance of a model. Above, there are 3 graphs plotted to look at my model. The first one shows residuals versus fitted values (the distance between the actual and predicted values). To test whether the residuals meet the assumption about being normally distributed with the mean equal to zero, I need to look at the trend line. It should closely follow the y = 0 line on the plot (Cotton). In this case, it is mostly true. There are a few points that stray further from the trend line but in general, the points seem to be plotted close to y = 0.
The purpose of the Q-Q plot is to show whether the residuals follow a normal distribution. Quantiles from the normal distribution are represented on the x-axis, and the y-axis represent the standardized residuals (residuals divided by their standard deviation). If plotted points follow along the diagonal line, they are normally distributed (Cotton). Most points in my plot indeed follow that trend, however, there are few (which were also problematic on the residuals vs fitted plot) that seem not to follow that trend.
The third plot shows the square root of the standardized residuals versus the fitted values. It is called a scale-location plot and it helps to check for heteroscedasticity. The plot shows whether the size of the residuals gets bigger or smaller. There are two important things to check - whether the line is roughly horizontal across the plot, which means that the spread of the residuals is roughly equal for all fitted values and the assumption of homoscedasticity is likely met. Secondly, it needs to be verified that the residuals are randomly scattered and there are no clear patterns (Cotton)(Zach, 2020).
There is also a statistical test to check for heteroscedasticity:
library(lmtest)
bptest(model1)
##
## studentized Breusch-Pagan test
##
## data: model1
## BP = 604.77, df = 9, p-value < 0.00000000000000022
The hypotheses in Breusch-Pagan Test are as following:
(Zach, 2020)
From my output, it becomes clear that with the p-value close to 0, I need to reject the null hypothesis and accept that there is heteroscedasticity present in my model.
library(car)
durbinWatsonTest(model1)
## lag Autocorrelation D-W Statistic p-value
## 1 0.2471018 1.505323 0
## Alternative hypothesis: rho != 0
The Durbin Watson Test is a measure of autocorrelation in residuals from regression analysis. It may cause many problems, including standard errors being too small or exaggerated goodness of fit, it may also cause the regression coefficients to be statistically significant, when in fact, they are not.
The Hypotheses for the Durbin Watson test are:
H0 = no first-order autocorrelation
H1 = first-order correlation exists
The values vary between 0 to 4, where 2 means there is no autocorrelation, 0 to 2 means there is a positive autocorrelation, and values between 2 and 4 indicate negative autocorrelation (Statistics How To, 2016).
In my case, with the p-value close to 0, I have to reject the null hypothesis and state that positive autocorrelation is present in my model.
“Multicollinearity in regression analysis occurs when two or more predictor variables are highly correlated to each other”. This means that they do not provide unique or independent information in the regression model and can cause problems with interpreting it. The most common way to detect multicollinearity is by using the variance inflation factor (VIF). Its values start at 1 and there is no upper limit.
A general rule of thumb for interpreting VIFs is: a value of 1 indicates no correlation between independent variables; values between 1 and 5 show moderate correlation but are not severe enough to require attention; values greater than 5 indicate a potentially severe correlation between predictors. With VIF higher than 5, the coefficient estimates and p-values in the regression output are likely to be unreliable (Zach, 2019).
vif(model1)
## GVIF Df GVIF^(1/(2*Df))
## Overall.Qual 2.999148 1 1.731805
## Gr.Liv.Area 2.564268 1 1.601333
## X1st.Flr.SF 1.711119 1 1.308098
## Garage.Area 1.765036 1 1.328547
## Year.Built 2.150770 1 1.466550
## Exter.Qual 2.680213 3 1.178587
## Full.Bath 2.053187 1 1.432895
It seems that the highest value of VIF is 2.99, so it should not cause any problems and violate the nonmulticollinearity assumption. If two of my predictors (for instance above ground living area and 1st-floor area) would be strongly correlated, I would inspect why and then probably just remove one of them, especially if they introduce similar information to the model.
Now I can check for the unusual observations:
library(olsrr)
ols_plot_resid_lev(model1)
There seem to be many leverage points and some outliers present in my dataset, which influence the regression line. They could already be detected from the Q-Q plot. I will remove 5 observations that are particularly far away from the others (3 of them have been already labeled on the Q-Q plot). I don’t want to remove more, not to reduce my sample size significantly.
# Removing observations
houses = df[-c(2182, 2181, 1499, 1761, 1768),]
Now, I’d like to address the heteroscedasticity present in my original model. The presence of heteroskedasticity suggests two main consequences: there is another estimator than the least squares estimator, which has a smaller variance; and that standard errors computed for the least squares estimators are incorrect, which can affect confidence intervals and hypothesis testing (Yobero, 2016).
When heteroscedasticity is present, the best linear unbiased estimator to use would be the generalized least squares estimator (Yobero, 2016). I will try to use it and compare to my original model.
model2 <- lm(SalePrice ~ Overall.Qual + Gr.Liv.Area + X1st.Flr.SF + Garage.Area + Year.Built + Exter.Qual + Full.Bath, data = houses)
summary(model2)
##
## Call:
## lm(formula = SalePrice ~ Overall.Qual + Gr.Liv.Area + X1st.Flr.SF +
## Garage.Area + Year.Built + Exter.Qual + Full.Bath, data = houses)
##
## Residuals:
## Min 1Q Median 3Q Max
## -195760 -17285 -931 15791 221602
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -776584.300 53709.248 -14.459 < 0.0000000000000002 ***
## Overall.Qual 16383.334 701.507 23.354 < 0.0000000000000002 ***
## Gr.Liv.Area 58.127 1.853 31.365 < 0.0000000000000002 ***
## X1st.Flr.SF 33.723 1.929 17.482 < 0.0000000000000002 ***
## Garage.Area 40.382 3.526 11.452 < 0.0000000000000002 ***
## Year.Built 409.972 27.689 14.806 < 0.0000000000000002 ***
## Exter.QualFa -87069.279 6848.045 -12.714 < 0.0000000000000002 ***
## Exter.QualGd -72862.883 3434.967 -21.212 < 0.0000000000000002 ***
## Exter.QualTA -85167.722 3870.859 -22.002 < 0.0000000000000002 ***
## Full.Bath -11347.640 1490.488 -7.613 0.0000000000000358 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 30830 on 2914 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.8465, Adjusted R-squared: 0.846
## F-statistic: 1785 on 9 and 2914 DF, p-value: < 0.00000000000000022
Even after removing the outliers, the R2 is already higher. Let’s see how would the summary of the model look like if it was fitted using generalized least squares estimator:
library(nlme)
model3 <- gls(SalePrice ~ Overall.Qual + Gr.Liv.Area + X1st.Flr.SF + Garage.Area + Year.Built + Exter.Qual + Full.Bath,
# my formula
data = houses,
correlation = corAR1(0.25, form = ~ 1 | Order), # Correlation with a specified autocorrelation value
na.action = na.exclude) # what to do with NAs (if any are present) if not used, an error is returned
summary(model3)
## Generalized least squares fit by REML
## Model: SalePrice ~ Overall.Qual + Gr.Liv.Area + X1st.Flr.SF + Garage.Area + Year.Built + Exter.Qual + Full.Bath
## Data: houses
## AIC BIC logLik
## 68641.7 68713.43 -34308.85
##
## Correlation Structure: AR(1)
## Formula: ~1 | Order
## Parameter estimate(s):
## Phi
## 0.25
##
## Coefficients:
## Value Std.Error t-value p-value
## (Intercept) -776584.3 53709.25 -14.459042 0
## Overall.Qual 16383.3 701.51 23.354470 0
## Gr.Liv.Area 58.1 1.85 31.364845 0
## X1st.Flr.SF 33.7 1.93 17.481733 0
## Garage.Area 40.4 3.53 11.452215 0
## Year.Built 410.0 27.69 14.806327 0
## Exter.QualFa -87069.3 6848.04 -12.714473 0
## Exter.QualGd -72862.9 3434.97 -21.212107 0
## Exter.QualTA -85167.7 3870.86 -22.002278 0
## Full.Bath -11347.6 1490.49 -7.613374 0
##
## Correlation:
## (Intr) Ovrl.Q Gr.L.A X1.F.S Grg.Ar Yr.Blt Ext.QF Ext.QG Ex.QTA
## Overall.Qual 0.187
## Gr.Liv.Area -0.337 -0.290
## X1st.Flr.SF 0.052 -0.064 -0.296
## Garage.Area 0.211 -0.124 -0.175 -0.206
## Year.Built -0.993 -0.264 0.346 -0.076 -0.217
## Exter.QualFa -0.162 0.331 -0.036 0.067 0.069 0.091
## Exter.QualGd -0.033 0.248 -0.050 0.169 0.055 -0.052 0.546
## Exter.QualTA -0.194 0.396 -0.032 0.131 0.070 0.090 0.626 0.892
## Full.Bath 0.292 0.004 -0.515 0.030 0.031 -0.306 -0.010 -0.036 0.006
##
## Standardized residuals:
## Min Q1 Med Q3 Max
## -6.3493815 -0.5606350 -0.0301872 0.5121697 7.1875707
##
## Residual standard error: 30831.34
## Degrees of freedom: 2924 total; 2914 residual
The output from this model differs from linear regression. It shows both the intercept and coefficients and correlation coefficients for all the variables included in the model.
The equation for this model is as follows:
-776584.3 + 16383.3 * Overall.Qual + 58.1 * Gr.Liv.Area + 33.7 * X1st.Flr.SF + 40.4 * Garage.Area + 410 * Year.Built - 87069.3 * Exter.QualiFa - 72862.9 * ExterQualGd - 85167.7 * ExterQualTA - 11347.6 * Full.Bath
The coefficients are slightly different than in the linear model but standard errors are the same. P-values are also very close to zero, indicating that relationships between the predictors and the dependent variable aren’t due to chance.
As stated in our materials for the first module, there are different approaches to compare different models:
Akaike Information Criterion (AIC) estimates the in-sample prediction error and indicates the relative quality of statistical models for a given dataset (it is only useful to compare models based on the same data). Bayesian Information Criterion (BIC) is a penalized-likelihood criterion derived from Bayesian probability. It is closely related to AIC. Generally, lower values of BIC and AIC are preferred.
GLS shows AIC and BIC as an output, so I will compare its performance to my linear model using thhose measures:
# Create vector with values
a = c(AIC(model2), BIC(model2), AIC(model3), BIC(model3))
# Create vector with names
b = c("AIC lm", "BIC lm", "AIC gls", "BIC gls")
# Link the values with names
names(a) = b
print(a)
## AIC lm BIC lm AIC gls BIC gls
## 68756.54 68822.33 68641.70 68713.43
Based on the values in the table above, it seems that the linear model based on generalized least squares performed slightly better.
With many independent variables, sometimes it is challenging to choose the best model by trying different combinations. Thus, I will use the best subsets regression. It is a model selection approach conducted by testing all possible combinations of the independent variables and then selecting the best model according to specified statistical criteria (STHDA, 2018).
# All Subsets Regression
library(leaps)
models <- regsubsets(SalePrice ~ Overall.Qual + Gr.Liv.Area + X1st.Flr.SF + Garage.Area + Year.Built + Exter.Qual + Full.Bath, data = houses, nvmax = 9)
# plot a table of models showing variables in each model ordered by the selection statistic (adjusted R2)
plot(models ,scale = "adjr2")
It seems that the model with all 9 predictors was the best, I can also check it based on various criteria: Adjusted R2, Mallow’s Cp, and Bayesian information criterion:
res.sum <- summary(models)
data.frame(
Adj.R2 = which.max(res.sum$adjr2),
CP = which.min(res.sum$cp),
BIC = which.min(res.sum$bic)
)
## Adj.R2 CP BIC
## 1 9 9 9
The output is clear, the model with all 9 predictors performs best based on all of these metrics. It would be possible that if I included all 82 variables, the best subsets regression method would suggest a linear model with an even higher number of predictors.
Thus, my conclusion is that the best model based on these 9 predictors is based on the generalized least squares (to reduce correlation between residuals).
CONCLUSION
Through this assignment, I have learned how to diagnose and improve linear models. A lot of real-life does not follow all the assumptions of multiple linear regression and issues such as multicollinearity or heteroscedasticity need to be investigated, and often need to be corrected. There can be multiple remedies for those issues, it is important to try different methods and see which one works for a particular dataset.
BIBLIOGRAPHY
Bluman, A. G. (2018). Elementary statistics: A Step by Step Approach: A Brief Version. New York: McGraw-Hill Higher Education; 10th edition.
Cotton, R. Introduction to Regression in R. Datacamp, Retrieved from: https://app.datacamp.com/learn/courses/introduction-to-regression-in-r
Rego, F. (2015). Quick Guide: Interpreting Simple Linear Model Outpu in R. Retrieved from: https://feliperego.github.io/blog/2015/10/23/Interpreting-Model-Output-In-R
Statistics How To. (2016) Durbin Watson Test & Test Statistic. Retrieved from: https://www.statisticshowto.com/durbin-watson-test-coefficient/
STHDA (2018). Best Subsets Regression Essentials in R. Retrieved from: http://www.sthda.com/english/articles/37-model-selection-essentials-in-r/155-best-subsets-regression-essentials-in-r/
Williams, C. (2020). How to Create a Correlation Matrix with Too Many Variables in R. Towards Data science. Retrieved from: https://towardsdatascience.com/how-to-create-a-correlation-matrix-with-too-many-variables-309cc0c0a57
Yobero, C. (2016). Methods for Detecting and Resolving Heteroskedasticity. Retrieved from: https://rpubs.com/cyobero/187387
Zach. (2019). How to Calculate Variance Inflation Factor (VIF) in R. Statology. Retrieved from: https://www.statology.org/variance-inflation-factor-r/
Zach. (2020). How to Interpret a Scale-Location Plot (With Examples). Statology. Retrieved from: https://www.statology.org/scale-location-plot/