Regression Dianostics with R

Ondina Castillo 3/2/2022

1. Analyzing a housing dataset (Ames Housing Dataset)

2. EDA

#a. Visualizing summary of the data set
#summary(ames.housing)
#b. creating data subset without character variables
data.only.numeric<-ames.housing[, !sapply(ames.housing, is.character)]
#c. Looking for NA values
#summary(data.only.numeric)
dim(data.only.numeric)
## [1] 2930   39
only.numeric.noNA<- na.omit(data.only.numeric)

3. Creating Correlation Matrix

correlation.matrix<- cor(only.numeric.noNA, method ="pearson")
corrplot::corrplot(cor(correlation.matrix), tl.cex = 0.4)

Interpretation:

The variables with a higher correlation with SalePrice are: Overall.Qual(0.8), Gr.Liv.Area(0.71), Garage.Cars(0.66), Total.Bsmt.SF(0.65), Garage area(0.65), Full Bath(0.56), Year.Built(0.56), garage.Yr.Blt(0.54), Year.Remod.Add(0.54), Mas.Vnr.Area (0.52), and TotRms.AbvGrd(0.52)

4. Creating Graphs

library("car")
scatterplot(Yr.Sold~SalePrice, data = ames.housing)

scatterplot(Lot.Area~SalePrice, data=ames.housing)

5. Plotting Scatterplots with variable with highest and lowest

correlation with SalePrice

library(ggplot2)
#a. Variable with highest correlation with SalePrice
only.numeric.noNA$Overall.Qual<- as.numeric(only.numeric.noNA$Overall.Qual)
only.numeric.noNA$SalePrice<- as.numeric(only.numeric.noNA$SalePrice)
ggplot(data =only.numeric.noNA) + aes(x = Overall.Qual, y = SalePrice, log="y") + theme_light()+ 
  geom_jitter(col="salmon")+ geom_smooth(method='lm', formula= y~x) + ylab("Sale Price") + xlab("Overall Quality") + ggtitle("Linear Regression Overall Quality and Sale Price") 

6. Variable with lowest correlation with SalePrice

ggplot(data =only.numeric.noNA) + aes(x = Misc.Val, y = SalePrice, log="y") + theme_light()+ 
  geom_jitter(col="lightblue")+ geom_smooth(method='lm', formula= y~x) + ylab("Sale Price") + xlab("Miscellaneous feature not covered in other categories") + ggtitle("Linear Regression Miscellaneous feature and Sale Price") 

7. Variable with a correlation closest to 0.5

ggplot(data =only.numeric.noNA) + aes(x = TotRms.AbvGrd, y = SalePrice, log="y") + theme_light()+ 
  geom_jitter(col="gray")+ geom_smooth(method='lm', formula= y~x) + ylab("Sale Price") + xlab("Total rooms above grade") + ggtitle("Linear Regression Total rooms above grade and Sale Price") 

Interpretation:

The higher the correlation between the variable SalePrice and the independent variable , the better the fit of the linear regression. For example, in Figure 4 the correlation is very close to 0 because the data points are scattered and clustered and don’t appear to follow a linear trajectory.

On the contrary, the graphs from Figure 3 and Figure 5 showcase a distinguishable linear relationship between the independent variable and the dependent variable.

Finally, in Figure 5 the linear relationship is not ideal since the point become scattered at the end.

Furthermore, after obtaining the regression summary in R, we can now go ahead and create a regression model to analyze the data. The continuous variables with the highest correlation coefficients with SalePrice were used. Thus, resulting equation from our regression model was done utilizing the variables SalePrice, Garage.Area, Gr.Liv.Area, and Total.Bsmt.SF.

8. Creating regression model with 3 continuous variables

#a. Creating regression model
library(sjPlot)
## #refugeeswelcome
attach(only.numeric.noNA)
Table_regression <- lm(SalePrice ~ Garage.Area + Gr.Liv.Area + Total.Bsmt.SF)
tab_model(Table_regression)
  SalePrice
Predictors Estimates CI p
(Intercept) -40818.97 -47543.94 – -34094.00 \<0.001
Garage Area 113.69 101.03 – 126.35 \<0.001
Gr Liv Area 72.24 67.51 – 76.96 \<0.001
Total Bsmt SF 56.20 50.88 – 61.52 \<0.001
Observations 2274
R2 / R2 adjusted 0.677 / 0.677
summary(Table_regression)
## 
## Call:
## lm(formula = SalePrice ~ Garage.Area + Gr.Liv.Area + Total.Bsmt.SF)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -711329  -19612     689   19623  257538 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -40818.970   3429.340  -11.90   <2e-16 ***
## Garage.Area      113.694      6.456   17.61   <2e-16 ***
## Gr.Liv.Area       72.236      2.408   30.00   <2e-16 ***
## Total.Bsmt.SF     56.199      2.713   20.72   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 47400 on 2270 degrees of freedom
## Multiple R-squared:  0.6769, Adjusted R-squared:  0.6765 
## F-statistic:  1585 on 3 and 2270 DF,  p-value: < 2.2e-16

Interpretation:

As an be seen from the model, all 3 variables’ coefficients (Garage.Area, Gr.Liv.Area, and Total.Bsmt.SF) have a p-value that is less than 0.05 which means that they are all significant and should not be removed.

Moreover, the adjusted R2 is quite high.
#b. Plotting regression model
plot(Table_regression)

par(mfrow = c (2,2))

#6. Checking the model for multicollinearity
vif(Table_regression)
##   Garage.Area   Gr.Liv.Area Total.Bsmt.SF 
##      1.584434      1.476988      1.492049

Interpretation:

The Homoscedasticity assumption has been violated by our model since in the Scale-Location graph the points aren't scattered. Moreover, in the Normal Q-Q Plot the points don’t align perfectly to the line, nonetheless, the disparity isn’t extreme either. Finally, in both the residuals vs fitted plot and the residuals vs leverage plot there are a few outliers or possible unusual observations.

9. Looking for unusual observations or outliers

outlierTest(model=Table_regression)
##        rstudent unadjusted p-value Bonferroni p
## 1161 -16.406359         2.9881e-57   6.7949e-54
## 1690 -12.515509         8.4211e-35   1.9150e-31
## 1691  -8.426969         6.2079e-17   1.4117e-13
## 39     5.478963         4.7515e-08   1.0805e-04
## 1365   5.393770         7.6155e-08   1.7318e-04
## 831    5.217333         1.9799e-07   4.5023e-04
## 1360   5.037173         5.0971e-07   1.1591e-03
## 337    4.746893         2.1943e-06   4.9899e-03
## 338    4.383753         1.2197e-05   2.7737e-02
## 2000  -4.383613         1.2205e-05   2.7754e-02
hat.plot <- function (fit) {
  p <- length(coefficients(Table_regression))
  n <- length (fitted(Table_regression))
  plot(hatvalues(Table_regression), main = "Index Plot of hat Values")
  abline(h=c(2,3)*p/n, col= "red", lty=2)
  identify(1:n, hatvalues(Table_regression), names(hatvalues(Table_regression)))
}
#b. Looking at Cook's D Chart
library(olsrr)
## 
## Attaching package: 'olsrr'

## The following object is masked from 'package:datasets':
## 
##     rivers
ols_plot_cooksd_chart(Table_regression)

hat.plot(Table_regression)

## integer(0)

Interpretation:

As can be seen  there are quite a few outliers.
Also, in the graph below it becomes apparent once again that there are a few outliers, which probably should be removed since they go beyond the red line.

10. Eliminating unusual observations to improve model

cooksd <- cooks.distance(Table_regression)
sample_size <- nrow(data.only.numeric)
influential <- as.numeric(names(cooksd)[(cooksd > (4/sample_size))])
only.numeric.no.outliers <- only.numeric.noNA[-influential, ]


#a. Looking at model now
attach(only.numeric.no.outliers)
## The following objects are masked from only.numeric.noNA:
## 
##     Bedroom.AbvGr, Bsmt.Full.Bath, Bsmt.Half.Bath, Bsmt.Unf.SF,
##     BsmtFin.SF.1, BsmtFin.SF.2, Enclosed.Porch, Fireplaces, Full.Bath,
##     Garage.Area, Garage.Cars, Garage.Yr.Blt, Gr.Liv.Area, Half.Bath,
##     Kitchen.AbvGr, Lot.Area, Lot.Frontage, Low.Qual.Fin.SF,
##     Mas.Vnr.Area, Misc.Val, Mo.Sold, MS.SubClass, Open.Porch.SF, Order,
##     Overall.Cond, Overall.Qual, PID, Pool.Area, SalePrice,
##     Screen.Porch, Total.Bsmt.SF, TotRms.AbvGrd, Wood.Deck.SF,
##     X1st.Flr.SF, X2nd.Flr.SF, X3Ssn.Porch, Year.Built, Year.Remod.Add,
##     Yr.Sold
Table_regression2 <- lm(SalePrice ~ Garage.Area + Gr.Liv.Area + Total.Bsmt.SF)
plot(Table_regression2)

par(mfrow = c (2,2))
summary(Table_regression2)
## 
## Call:
## lm(formula = SalePrice ~ Garage.Area + Gr.Liv.Area + Total.Bsmt.SF)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -116684  -17337     707   17787   92349 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -38606.069   2511.015  -15.38   <2e-16 ***
## Garage.Area      107.961      4.514   23.92   <2e-16 ***
## Gr.Liv.Area       73.110      1.744   41.92   <2e-16 ***
## Total.Bsmt.SF     54.763      1.934   28.32   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 29000 on 2081 degrees of freedom
## Multiple R-squared:  0.7872, Adjusted R-squared:  0.7869 
## F-statistic:  2567 on 3 and 2081 DF,  p-value: < 2.2e-16

Interpretation:

After obtaining these findings, eliminating all these influential observations to improve the model became necessary. After doing that procedure, the model improved noticeably as can be seen on the graph.
 
This time around, the Q-Q plot line is almost perfect, and the points appear to be scattered in the Scale-Location graph. In short, it can be said that the main issues of the models were eliminated by removing the outliers in the data.

11 Examining normality assumption

hist(data.only.numeric$SalePrice)

hist(only.numeric.no.outliers$SalePrice)

Interpretation:

Qhen looking at the histogram of the SalePrice, it can be observed that the data went from being right-skewed to having a normal distribution.

12 Using subsets method to identify the best model

library(leaps)
regfit_full = regsubsets(SalePrice~., data = only.numeric.noNA)
## Warning in leaps.setup(x, y, wt = wt, nbest = nbest, nvmax = nvmax, force.in =
## force.in, : 2 linear dependencies found

## Reordering variables and trying again:
#a. Looking at the model selected by subsets method
model2 <- lm(SalePrice ~ Overall.Qual + BsmtFin.SF.1 + Gr.Liv.Area)
summary(model2)
## 
## Call:
## lm(formula = SalePrice ~ Overall.Qual + BsmtFin.SF.1 + Gr.Liv.Area)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -126507  -16960     397   17331  125132 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -84757.295   2883.213  -29.40   <2e-16 ***
## Overall.Qual  24911.885    559.360   44.54   <2e-16 ***
## BsmtFin.SF.1     33.792      1.484   22.77   <2e-16 ***
## Gr.Liv.Area      64.922      1.717   37.81   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 27070 on 2081 degrees of freedom
## Multiple R-squared:  0.8146, Adjusted R-squared:  0.8143 
## F-statistic:  3048 on 3 and 2081 DF,  p-value: < 2.2e-16
plot(model2)

par(mfrow = c (2,2))
Interpretation

The best model that was selected by the programming language contained the following variables: Overall.Qual, BsmtFin.SF.1, and Gr.Liv.Area.

As can be seen inthe model has a higher adjusted R2 which indicates that it is probably more adequate that the one fabricated in this study. Nonetheless, before making a conclusion the regression model was plotted to see how its graphs looked like; to determine if they comply with the necessary assumptions. And, as can be seen , the model contains less outliers/influential observations than the one that was first graphed in this paper (before all outliers were eliminated to improve the model).

13 Comparing both models

library(performance)
compare_performance(Table_regression2, model2, rank = TRUE)
## # Comparison of Model Performance Indices
## 
## Name              | Model |    R2 | R2 (adj.) |      RMSE |     Sigma | AIC weights | BIC weights | Performance-Score
## ---------------------------------------------------------------------------------------------------------------------
## model2            |    lm | 0.815 |     0.814 | 27042.326 | 27068.304 |        1.00 |        1.00 |           100.00%
## Table_regression2 |    lm | 0.787 |     0.787 | 28969.787 | 28997.616 |     < 0.001 |     < 0.001 |             0.00%
#a Visualization of model performance
library(see)
plot(compare_performance(Table_regression2,model2, rank = TRUE))

Interpretation:

To compare the best model chosen by the subset method and the one that was elaborated in this paper an R package was used to analyze and see the performance indicators of both models.

The results indicate taht the model that performs the best is the one selected by the subsets method.

Moreover, a plot was drawn to see visually the performance of each of the model, and once again it is very clear that the one that should be chosen is model2 (the subsets method model)

Conclusion

Interpretation:


The first model was built taking into account only continuous quantitative variables, namely garage area, above grade living area (square feet) and total square feet of basement area. The second one, on the other hand, was created by using the subsets method and the variables that were selected for creating the best 3 predictors model were overall quality, above grade living area (square feet), and rating of basement finished area (Type 1 finished square feet). The difference between the two models is that the second one contained discrete variables, while the first one was constrained in that aspect. Nonetheless, both agree on incorporating the variable above grade living area.
When it came to selecting the most appropriate model, after comparing both it become clear that the second one was better because it had a better R2 and also a better overall score.
To conclude, after this analysis it can be stated that a house that does better regarding overall quality, above grade living area (square feet), and rating of basement finished area (Type 1 finished square feet), will have a higher sale price. This information could be very beneficial for potential buyers or realtors.