Zillow Multiple Regression Analysis

2024-10-20

Zillow Dataset

This dataset includes the variables zestimate, living area (sqft), bedrooms, and bathrooms of 5526 houses along the Light Rail Corridor in the greater Phoenix area. We will be conducting a multiple regression analysis and testing for violation of the homoskedasticity assumption.

##   zestimate livingArea bedrooms bathrooms
## 1    390400       1435        3         2
## 2    488800       2800        5         3
## 3    288500       1028        2         2
## 4    463000       1416        3         2
## 5    330600       1170        3         2
## 6    388700       1600        3         2

Estimated Regression Model

\(\small\hat{zestimate}=\hat\beta_0+\hat\beta_1livingArea+\hat\beta_2bedrooms+\hat\beta_3bathrooms+\hat\mu\)

where

\(\small\hat{\beta}_j =\frac{\sum_{i=1}^{n}(x_{ji}-\bar{x}_j)(y_i-\bar{y})}{\sum_{i=1}^{n}(x_{ji} -\bar{x}_j)^2}\)

Estimated Model is:

\(\small\hat{zestimate}=89851.29+247.49livingArea-35196.16\newline\small+40571.54bathrooms\)

Regression Output

## 
## Call:
## lm(formula = zestimate ~ livingArea + bedrooms + bathrooms, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -455317  -61339   -7828   49178  731849 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  89851.286   6488.289   13.85   <2e-16 ***
## livingArea     247.487      3.528   70.16   <2e-16 ***
## bedrooms    -35169.164   2371.946  -14.83   <2e-16 ***
## bathrooms    40571.539   3292.141   12.32   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 104200 on 5522 degrees of freedom
## Multiple R-squared:  0.6511, Adjusted R-squared:  0.6509 
## F-statistic:  3435 on 3 and 5522 DF,  p-value: < 2.2e-16

Homoskedasticity

One of the primary assumptions of OLS is homoskedasticity, which means that the variance of the error is constant, or, mathematically: \(Var(\mu_{i}|x_{i})=\sigma^2\). We will now visually check for this by plotting the residuals vs. the fitted values of our estimated model in various ways.

Residuals vs. Fitted Values 2D Plot

## Warning: package 'ggplot2' was built under R version 4.3.3

3d Plot of Residuals

Grouped Fitted Values vs. Residuals Column Plot

Code for Previous Slide

dataDeciles <- data %>%
  group_by(decile = cut(fitted, breaks = 10)) %>%
  summarize(meanResidual = mean(residuals))

ggplot(dataDeciles, aes(x = decile, y = meanResidual)) + 
  geom_col( fill = "blue") +
  ggtitle("Mean Residuals by Decile") +
  xlab("Fitted Value of Group") +
  ylab("Mean Residual") +
  theme_minimal()

Analysis

Clearly, the model exhibits heteroskedasticity and the assumption is violated. The 2D and 3D residual plots exhibit a cone shape, meaning that the variance of the error is increasing in x. In the presence of perfect homoskedasticity, the heights of the columns in the final plot would all be zero, however, the value of the mean of the residuals in each grouping varies wildly, and is left-skewed, providing further proof of heteroskedasticity.