Using R, build a multiple regression model for data that interests you. Include in this model at least one quadratic term, one dichotomous term, and one dichotomous vs. quantitative interaction term. Interpret all coefficients. Conduct residual analysis. Was the linear model appropriate? Why or why not?
For this week’s discussion, I picked the California Housing Prices dataset, with the goal of predicting the median house value given a number of numeric and 1 (derived) dichotomous feature.
library(readr)
housing.df <- read_csv('https://raw.githubusercontent.com/amberferger/DATA605_Discussions/master/AFerger_Wk13_Disc_data.csv')
housing.df$ocean_proximity <- with(housing.df, ifelse(ocean_proximity=='NEAR OCEAN', 1,0))
table(housing.df$ocean_proximity)
##
## 0 1
## 17982 2658
We’ll use the pairs command to plot all variables. We can confirm some things that we’d expect:
pairs(housing.df, gap=0.5)
In addition to using each explanatory variable as a predictor, we will also define the following predictors: \(total\_bedrooms^2\) (quadratic), \(ocean\_proximity\) (converted to a dichotomous variable) \(ocean\_proximity * households\) (dichotomous vs quantitative interaction term).
housing.df$total_bedrooms_sq <- housing.df$total_bedrooms^2
housing.df$interaxn_term <- housing.df$ocean_proximity*housing.df$households
We will now fit a multivariate linear model to the data.
housing.lm <- lm(median_house_value ~ longitude + latitude + housing_median_age + total_rooms + total_bedrooms + population + households + median_income + ocean_proximity + total_bedrooms_sq + interaxn_term, data = housing.df)
Let’s take a look at the summary information of our model:
summary(housing.lm)
##
## Call:
## lm(formula = median_house_value ~ longitude + latitude + housing_median_age +
## total_rooms + total_bedrooms + population + households +
## median_income + ocean_proximity + total_bedrooms_sq + interaxn_term,
## data = housing.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -558445 -43540 -11524 30153 859598
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.510e+06 6.514e+04 -53.884 < 2e-16 ***
## longitude -4.169e+04 7.486e+02 -55.689 < 2e-16 ***
## latitude -4.140e+04 7.153e+02 -57.874 < 2e-16 ***
## housing_median_age 1.220e+03 4.378e+01 27.862 < 2e-16 ***
## total_rooms -7.544e+00 8.006e-01 -9.423 < 2e-16 ***
## total_bedrooms 1.310e+02 7.211e+00 18.161 < 2e-16 ***
## population -3.739e+01 1.089e+00 -34.341 < 2e-16 ***
## households 4.272e+01 7.587e+00 5.630 1.82e-08 ***
## median_income 4.037e+04 3.378e+02 119.498 < 2e-16 ***
## ocean_proximity 7.320e+03 2.596e+03 2.820 0.0048 **
## total_bedrooms_sq -6.965e-03 8.086e-04 -8.613 < 2e-16 ***
## interaxn_term -2.258e+00 4.156e+00 -0.543 0.5870
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 69420 on 20421 degrees of freedom
## (207 observations deleted due to missingness)
## Multiple R-squared: 0.6385, Adjusted R-squared: 0.6383
## F-statistic: 3279 on 11 and 20421 DF, p-value: < 2.2e-16
COEFFICIENTS: In a general sense, the coefficients show us the mathematical relationship between the explanatory variables and the response variable. A positive value indicates that as the value of the explanatory variable increases, the mean of the response variable also increases. Conversely, a negative value indicates that as the value of the explanatory variable increases, the mean of the response variable decreases. The strength of the relationship is measured by the magnitude of the coefficient. Given a one-unit increase/decrease in the explanatory variable (leaving all other variables constant), the response variable will shift the magnitude of the coefficient. The p-values associated with each variable tells us whether the relationships are statistically significant.
RESIDUALS: Just from looking at the residual summary, we can tell that our model is not very good. The residuals should be balanced and close to the mean of \(0\). The median should have a value near \(0\), min/max values should be about the same magnitude, and the first and third quartile should also be about the same magnitude. Ultimately, this would mean that the actual observations aren’t too far from the predicted values and there isn’t much variance in the predictive error.
Our median is largely negative, our min and max values are not of the same magnitude, and the quartile values are not the same magnitude. We’ll take a closer look at the residual plots.
plot(fitted(housing.lm),resid(housing.lm))
A good model will have a residual plot where (1) there’s no clear pattern and (2) the points hover both above and below 0. The pattern in the plot of our residuals – as we move towards the right, we can see that the residuals decrease in value – indicates that we don’t have a great model.
Additionally, in a good model we expect the residuals to be normally distributed. By graphing the data in a Q-Q plot, we can see how well the observations follow the line. If they fit the line well, we know that the data is normally distributed.
qqnorm(resid(housing.lm))
qqline(resid(housing.lm))
There is clear deviation from the line in both ends of the plot, so this further shows that our model does not fit the data well.
From this analysis, we can see that our model does not produce a very good output for a few reasons: