Chapters 4 - 6: Linear Regression Using R

Question

Using R, build a multiple regression model for data that interests you. Include in this model at least one quadratic term, one dichotomous term, and one dichotomous vs. quantitative interaction term. Interpret all coefficients. Conduct residual analysis. Was the linear model appropriate? Why or why not?

Solution

For this week’s discussion, I picked the California Housing Prices dataset, with the goal of predicting the median house value given a number of numeric and 1 (derived) dichotomous feature.

Data Specs

  • The data is publicly available on kaggle at https://www.kaggle.com/camnugent/california-housing-prices#housing.csv.
  • It contains \(20,640\) rows and \(10\) columns.
  • There are \(9\) explanatory variables (longitude, latitude, housing_median_age, total_rooms, total_bedrooms, population, households, median_income and ocean_proximity) and \(1\) response variable (median_house_value) in the data set.
  • Becuase this dataset does not include a dichotomous feature, I will create one out of the ocean_proximity column. A value of \(1\) will be assigned to any rows with the value NEAR OCEAN and a \(0\) otherwise.

Load the Data

library(readr)

housing.df <- read_csv('https://raw.githubusercontent.com/amberferger/DATA605_Discussions/master/AFerger_Wk13_Disc_data.csv')

Dichotomize the OCEAN_PROXIMITY variable

housing.df$ocean_proximity <- with(housing.df, ifelse(ocean_proximity=='NEAR OCEAN', 1,0))
table(housing.df$ocean_proximity)
## 
##     0     1 
## 17982  2658

Visualize the Data

We’ll use the pairs command to plot all variables. We can confirm some things that we’d expect:

  • Total rooms, total bedrooms, population, and households are all directly related. In other words, as one incrases, the others do as well.
  • As the median income increases, so does the median house value.
  • As the latitude increases (we go further north), the median house value decreases.
pairs(housing.df, gap=0.5)

Define Predictors

In addition to using each explanatory variable as a predictor, we will also define the following predictors: \(total\_bedrooms^2\) (quadratic), \(ocean\_proximity\) (converted to a dichotomous variable) \(ocean\_proximity * households\) (dichotomous vs quantitative interaction term).

housing.df$total_bedrooms_sq <- housing.df$total_bedrooms^2
housing.df$interaxn_term <- housing.df$ocean_proximity*housing.df$households

Multivariate Linear Model

We will now fit a multivariate linear model to the data.

housing.lm <- lm(median_house_value ~ longitude + latitude + housing_median_age + total_rooms + total_bedrooms + population + households + median_income + ocean_proximity + total_bedrooms_sq + interaxn_term, data = housing.df)

Evaluating the Model

Let’s take a look at the summary information of our model:

summary(housing.lm)
## 
## Call:
## lm(formula = median_house_value ~ longitude + latitude + housing_median_age + 
##     total_rooms + total_bedrooms + population + households + 
##     median_income + ocean_proximity + total_bedrooms_sq + interaxn_term, 
##     data = housing.df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -558445  -43540  -11524   30153  859598 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        -3.510e+06  6.514e+04 -53.884  < 2e-16 ***
## longitude          -4.169e+04  7.486e+02 -55.689  < 2e-16 ***
## latitude           -4.140e+04  7.153e+02 -57.874  < 2e-16 ***
## housing_median_age  1.220e+03  4.378e+01  27.862  < 2e-16 ***
## total_rooms        -7.544e+00  8.006e-01  -9.423  < 2e-16 ***
## total_bedrooms      1.310e+02  7.211e+00  18.161  < 2e-16 ***
## population         -3.739e+01  1.089e+00 -34.341  < 2e-16 ***
## households          4.272e+01  7.587e+00   5.630 1.82e-08 ***
## median_income       4.037e+04  3.378e+02 119.498  < 2e-16 ***
## ocean_proximity     7.320e+03  2.596e+03   2.820   0.0048 ** 
## total_bedrooms_sq  -6.965e-03  8.086e-04  -8.613  < 2e-16 ***
## interaxn_term      -2.258e+00  4.156e+00  -0.543   0.5870    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 69420 on 20421 degrees of freedom
##   (207 observations deleted due to missingness)
## Multiple R-squared:  0.6385, Adjusted R-squared:  0.6383 
## F-statistic:  3279 on 11 and 20421 DF,  p-value: < 2.2e-16

COEFFICIENTS: In a general sense, the coefficients show us the mathematical relationship between the explanatory variables and the response variable. A positive value indicates that as the value of the explanatory variable increases, the mean of the response variable also increases. Conversely, a negative value indicates that as the value of the explanatory variable increases, the mean of the response variable decreases. The strength of the relationship is measured by the magnitude of the coefficient. Given a one-unit increase/decrease in the explanatory variable (leaving all other variables constant), the response variable will shift the magnitude of the coefficient. The p-values associated with each variable tells us whether the relationships are statistically significant.

  • Intercept - value is very close to 0, so this just tells us that the minimum median value of housing is \(\$0\).
  • Longitude and Latitude - both coefficients are largely negative. This means that as the latitude/longitude increase, the value of the house decreases. (This aligns with our intuition)
  • Housing Median Age, Total Bedrooms, Ocean Proximity, and Median Income - all of these coefficients are largely positive and have a very significant p-value. This means that as the values of these variables increase, so does the value of the house.
  • Total Rooms and Total Bedrooms Squared - these values are very close to 0, so this tells us that there isn’t much of a change in the value of the house.
  • Households - this value is also pretty small, but positive. This means that there’s a slight increase in the value of the house with an increase in household size.
  • Population - small but negative value, which means that there’s a slight decrease in the value of the house with an increase in the population.
  • Interaction Term - This term is very close to 0 and has a p-value greater than 0.05, so it is not significant.

RESIDUALS: Just from looking at the residual summary, we can tell that our model is not very good. The residuals should be balanced and close to the mean of \(0\). The median should have a value near \(0\), min/max values should be about the same magnitude, and the first and third quartile should also be about the same magnitude. Ultimately, this would mean that the actual observations aren’t too far from the predicted values and there isn’t much variance in the predictive error.

Our median is largely negative, our min and max values are not of the same magnitude, and the quartile values are not the same magnitude. We’ll take a closer look at the residual plots.

Plotting the residuals

plot(fitted(housing.lm),resid(housing.lm))

A good model will have a residual plot where (1) there’s no clear pattern and (2) the points hover both above and below 0. The pattern in the plot of our residuals – as we move towards the right, we can see that the residuals decrease in value – indicates that we don’t have a great model.

Additionally, in a good model we expect the residuals to be normally distributed. By graphing the data in a Q-Q plot, we can see how well the observations follow the line. If they fit the line well, we know that the data is normally distributed.

qqnorm(resid(housing.lm))
qqline(resid(housing.lm))

There is clear deviation from the line in both ends of the plot, so this further shows that our model does not fit the data well.

Conclusions

From this analysis, we can see that our model does not produce a very good output for a few reasons:

  • There is a clear pattern in the residual plot.
  • The residuals are not normally distributed.
  • The median residual value is far from the expected value of 0.