Ondina Castillo 3/2/2022
#a. Visualizing summary of the data set
#summary(ames.housing)
#b. creating data subset without character variables
data.only.numeric<-ames.housing[, !sapply(ames.housing, is.character)]
#c. Looking for NA values
#summary(data.only.numeric)
dim(data.only.numeric)## [1] 2930 39
only.numeric.noNA<- na.omit(data.only.numeric)correlation.matrix<- cor(only.numeric.noNA, method ="pearson")
corrplot::corrplot(cor(correlation.matrix), tl.cex = 0.4)Interpretation:
The variables with a higher correlation with SalePrice are: Overall.Qual(0.8), Gr.Liv.Area(0.71), Garage.Cars(0.66), Total.Bsmt.SF(0.65), Garage area(0.65), Full Bath(0.56), Year.Built(0.56), garage.Yr.Blt(0.54), Year.Remod.Add(0.54), Mas.Vnr.Area (0.52), and TotRms.AbvGrd(0.52)
library("car")
scatterplot(Yr.Sold~SalePrice, data = ames.housing)scatterplot(Lot.Area~SalePrice, data=ames.housing)correlation with SalePrice
library(ggplot2)
#a. Variable with highest correlation with SalePrice
only.numeric.noNA$Overall.Qual<- as.numeric(only.numeric.noNA$Overall.Qual)
only.numeric.noNA$SalePrice<- as.numeric(only.numeric.noNA$SalePrice)
ggplot(data =only.numeric.noNA) + aes(x = Overall.Qual, y = SalePrice, log="y") + theme_light()+
geom_jitter(col="salmon")+ geom_smooth(method='lm', formula= y~x) + ylab("Sale Price") + xlab("Overall Quality") + ggtitle("Linear Regression Overall Quality and Sale Price") ggplot(data =only.numeric.noNA) + aes(x = Misc.Val, y = SalePrice, log="y") + theme_light()+
geom_jitter(col="lightblue")+ geom_smooth(method='lm', formula= y~x) + ylab("Sale Price") + xlab("Miscellaneous feature not covered in other categories") + ggtitle("Linear Regression Miscellaneous feature and Sale Price") ggplot(data =only.numeric.noNA) + aes(x = TotRms.AbvGrd, y = SalePrice, log="y") + theme_light()+
geom_jitter(col="gray")+ geom_smooth(method='lm', formula= y~x) + ylab("Sale Price") + xlab("Total rooms above grade") + ggtitle("Linear Regression Total rooms above grade and Sale Price") Interpretation:
The higher the correlation between the variable SalePrice and the independent variable , the better the fit of the linear regression. For example, in Figure 4 the correlation is very close to 0 because the data points are scattered and clustered and don’t appear to follow a linear trajectory.
On the contrary, the graphs from Figure 3 and Figure 5 showcase a distinguishable linear relationship between the independent variable and the dependent variable.
Finally, in Figure 5 the linear relationship is not ideal since the point become scattered at the end.
Furthermore, after obtaining the regression summary in R, we can now go ahead and create a regression model to analyze the data. The continuous variables with the highest correlation coefficients with SalePrice were used. Thus, resulting equation from our regression model was done utilizing the variables SalePrice, Garage.Area, Gr.Liv.Area, and Total.Bsmt.SF.
#a. Creating regression model
library(sjPlot)## #refugeeswelcome
attach(only.numeric.noNA)
Table_regression <- lm(SalePrice ~ Garage.Area + Gr.Liv.Area + Total.Bsmt.SF)
tab_model(Table_regression)| Â | SalePrice | ||
|---|---|---|---|
| Predictors | Estimates | CI | p |
| (Intercept) | -40818.97 | -47543.94 – -34094.00 | \<0.001 |
| Garage Area | 113.69 | 101.03 – 126.35 | \<0.001 |
| Gr Liv Area | 72.24 | 67.51 – 76.96 | \<0.001 |
| Total Bsmt SF | 56.20 | 50.88 – 61.52 | \<0.001 |
| Observations | 2274 | ||
| R2 / R2 adjusted | 0.677 / 0.677 | ||
summary(Table_regression)##
## Call:
## lm(formula = SalePrice ~ Garage.Area + Gr.Liv.Area + Total.Bsmt.SF)
##
## Residuals:
## Min 1Q Median 3Q Max
## -711329 -19612 689 19623 257538
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -40818.970 3429.340 -11.90 <2e-16 ***
## Garage.Area 113.694 6.456 17.61 <2e-16 ***
## Gr.Liv.Area 72.236 2.408 30.00 <2e-16 ***
## Total.Bsmt.SF 56.199 2.713 20.72 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 47400 on 2270 degrees of freedom
## Multiple R-squared: 0.6769, Adjusted R-squared: 0.6765
## F-statistic: 1585 on 3 and 2270 DF, p-value: < 2.2e-16
Interpretation:
As an be seen from the model, all 3 variables’ coefficients (Garage.Area, Gr.Liv.Area, and Total.Bsmt.SF) have a p-value that is less than 0.05 which means that they are all significant and should not be removed.
Moreover, the adjusted R2 is quite high.
#b. Plotting regression model
plot(Table_regression)par(mfrow = c (2,2))
#6. Checking the model for multicollinearity
vif(Table_regression)## Garage.Area Gr.Liv.Area Total.Bsmt.SF
## 1.584434 1.476988 1.492049
Interpretation:
The Homoscedasticity assumption has been violated by our model since in the Scale-Location graph the points aren't scattered. Moreover, in the Normal Q-Q Plot the points don’t align perfectly to the line, nonetheless, the disparity isn’t extreme either. Finally, in both the residuals vs fitted plot and the residuals vs leverage plot there are a few outliers or possible unusual observations.
outlierTest(model=Table_regression)## rstudent unadjusted p-value Bonferroni p
## 1161 -16.406359 2.9881e-57 6.7949e-54
## 1690 -12.515509 8.4211e-35 1.9150e-31
## 1691 -8.426969 6.2079e-17 1.4117e-13
## 39 5.478963 4.7515e-08 1.0805e-04
## 1365 5.393770 7.6155e-08 1.7318e-04
## 831 5.217333 1.9799e-07 4.5023e-04
## 1360 5.037173 5.0971e-07 1.1591e-03
## 337 4.746893 2.1943e-06 4.9899e-03
## 338 4.383753 1.2197e-05 2.7737e-02
## 2000 -4.383613 1.2205e-05 2.7754e-02
hat.plot <- function (fit) {
p <- length(coefficients(Table_regression))
n <- length (fitted(Table_regression))
plot(hatvalues(Table_regression), main = "Index Plot of hat Values")
abline(h=c(2,3)*p/n, col= "red", lty=2)
identify(1:n, hatvalues(Table_regression), names(hatvalues(Table_regression)))
}#b. Looking at Cook's D Chart
library(olsrr)##
## Attaching package: 'olsrr'
## The following object is masked from 'package:datasets':
##
## rivers
ols_plot_cooksd_chart(Table_regression)hat.plot(Table_regression)## integer(0)
Interpretation:
As can be seen there are quite a few outliers.
Also, in the graph below it becomes apparent once again that there are a few outliers, which probably should be removed since they go beyond the red line.
cooksd <- cooks.distance(Table_regression)
sample_size <- nrow(data.only.numeric)
influential <- as.numeric(names(cooksd)[(cooksd > (4/sample_size))])
only.numeric.no.outliers <- only.numeric.noNA[-influential, ]
#a. Looking at model now
attach(only.numeric.no.outliers)## The following objects are masked from only.numeric.noNA:
##
## Bedroom.AbvGr, Bsmt.Full.Bath, Bsmt.Half.Bath, Bsmt.Unf.SF,
## BsmtFin.SF.1, BsmtFin.SF.2, Enclosed.Porch, Fireplaces, Full.Bath,
## Garage.Area, Garage.Cars, Garage.Yr.Blt, Gr.Liv.Area, Half.Bath,
## Kitchen.AbvGr, Lot.Area, Lot.Frontage, Low.Qual.Fin.SF,
## Mas.Vnr.Area, Misc.Val, Mo.Sold, MS.SubClass, Open.Porch.SF, Order,
## Overall.Cond, Overall.Qual, PID, Pool.Area, SalePrice,
## Screen.Porch, Total.Bsmt.SF, TotRms.AbvGrd, Wood.Deck.SF,
## X1st.Flr.SF, X2nd.Flr.SF, X3Ssn.Porch, Year.Built, Year.Remod.Add,
## Yr.Sold
Table_regression2 <- lm(SalePrice ~ Garage.Area + Gr.Liv.Area + Total.Bsmt.SF)
plot(Table_regression2)par(mfrow = c (2,2))
summary(Table_regression2)##
## Call:
## lm(formula = SalePrice ~ Garage.Area + Gr.Liv.Area + Total.Bsmt.SF)
##
## Residuals:
## Min 1Q Median 3Q Max
## -116684 -17337 707 17787 92349
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -38606.069 2511.015 -15.38 <2e-16 ***
## Garage.Area 107.961 4.514 23.92 <2e-16 ***
## Gr.Liv.Area 73.110 1.744 41.92 <2e-16 ***
## Total.Bsmt.SF 54.763 1.934 28.32 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 29000 on 2081 degrees of freedom
## Multiple R-squared: 0.7872, Adjusted R-squared: 0.7869
## F-statistic: 2567 on 3 and 2081 DF, p-value: < 2.2e-16
Interpretation:
After obtaining these findings, eliminating all these influential observations to improve the model became necessary. After doing that procedure, the model improved noticeably as can be seen on the graph.
This time around, the Q-Q plot line is almost perfect, and the points appear to be scattered in the Scale-Location graph. In short, it can be said that the main issues of the models were eliminated by removing the outliers in the data.
hist(data.only.numeric$SalePrice)hist(only.numeric.no.outliers$SalePrice)Interpretation:
Qhen looking at the histogram of the SalePrice, it can be observed that the data went from being right-skewed to having a normal distribution.
library(leaps)
regfit_full = regsubsets(SalePrice~., data = only.numeric.noNA)## Warning in leaps.setup(x, y, wt = wt, nbest = nbest, nvmax = nvmax, force.in =
## force.in, : 2 linear dependencies found
## Reordering variables and trying again:
#a. Looking at the model selected by subsets method
model2 <- lm(SalePrice ~ Overall.Qual + BsmtFin.SF.1 + Gr.Liv.Area)
summary(model2)##
## Call:
## lm(formula = SalePrice ~ Overall.Qual + BsmtFin.SF.1 + Gr.Liv.Area)
##
## Residuals:
## Min 1Q Median 3Q Max
## -126507 -16960 397 17331 125132
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -84757.295 2883.213 -29.40 <2e-16 ***
## Overall.Qual 24911.885 559.360 44.54 <2e-16 ***
## BsmtFin.SF.1 33.792 1.484 22.77 <2e-16 ***
## Gr.Liv.Area 64.922 1.717 37.81 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 27070 on 2081 degrees of freedom
## Multiple R-squared: 0.8146, Adjusted R-squared: 0.8143
## F-statistic: 3048 on 3 and 2081 DF, p-value: < 2.2e-16
plot(model2)par(mfrow = c (2,2))Interpretation
The best model that was selected by the programming language contained the following variables: Overall.Qual, BsmtFin.SF.1, and Gr.Liv.Area.
As can be seen inthe model has a higher adjusted R2 which indicates that it is probably more adequate that the one fabricated in this study. Nonetheless, before making a conclusion the regression model was plotted to see how its graphs looked like; to determine if they comply with the necessary assumptions. And, as can be seen , the model contains less outliers/influential observations than the one that was first graphed in this paper (before all outliers were eliminated to improve the model).
library(performance)
compare_performance(Table_regression2, model2, rank = TRUE)## # Comparison of Model Performance Indices
##
## Name | Model | R2 | R2 (adj.) | RMSE | Sigma | AIC weights | BIC weights | Performance-Score
## ---------------------------------------------------------------------------------------------------------------------
## model2 | lm | 0.815 | 0.814 | 27042.326 | 27068.304 | 1.00 | 1.00 | 100.00%
## Table_regression2 | lm | 0.787 | 0.787 | 28969.787 | 28997.616 | < 0.001 | < 0.001 | 0.00%
#a Visualization of model performance
library(see)
plot(compare_performance(Table_regression2,model2, rank = TRUE))Interpretation:
To compare the best model chosen by the subset method and the one that was elaborated in this paper an R package was used to analyze and see the performance indicators of both models.
The results indicate taht the model that performs the best is the one selected by the subsets method.
Moreover, a plot was drawn to see visually the performance of each of the model, and once again it is very clear that the one that should be chosen is model2 (the subsets method model)
Interpretation:
The first model was built taking into account only continuous quantitative variables, namely garage area, above grade living area (square feet) and total square feet of basement area. The second one, on the other hand, was created by using the subsets method and the variables that were selected for creating the best 3 predictors model were overall quality, above grade living area (square feet), and rating of basement finished area (Type 1 finished square feet). The difference between the two models is that the second one contained discrete variables, while the first one was constrained in that aspect. Nonetheless, both agree on incorporating the variable above grade living area.
When it came to selecting the most appropriate model, after comparing both it become clear that the second one was better because it had a better R2 and also a better overall score.
To conclude, after this analysis it can be stated that a house that does better regarding overall quality, above grade living area (square feet), and rating of basement finished area (Type 1 finished square feet), will have a higher sale price. This information could be very beneficial for potential buyers or realtors.