Chapter 4 - Linear Regression:

Using R, build a multiple regression model for data that interests you. Include in this model at least one quadratic term, one dichotomous term, and one dichotomous vs. quantitative interaction term. Interpret all coefficients. Conduct residual analysis. Was the linear model appropriate? Why or why not?

Solution: California Housing Prices Dataset

Data Specification:

The data is publicly available on kaggle at https://www.kaggle.com/camnugent/california-housing-prices#housing.csv.

It contains 20,640 rows and 10 columns.

There are 9 explanatory variables (longitude, latitude, housing_median_age, total_rooms, total_bedrooms, population, households, median_income and ocean_proximity) and 1 response variable (median_house_value) in the data set.

Since this dataset does not include a dichotomous feature, we will create one out of the ocean_proximity column.

A value of 1 will be assigned to any rows with the value NEAR OCEAN and a 0 otherwise.

Loading the Data from GitHub:

library(readr)

housing_df <- read_csv('https://raw.githubusercontent.com/uzmabb182/Data605_Assignment/main/housing.csv')
## Rows: 20640 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): ocean_proximity
## dbl (9): longitude, latitude, housing_median_age, total_rooms, total_bedroom...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(housing_df)
## # A tibble: 6 × 10
##   longitude latitude housing_median_age total_rooms total_bedrooms population
##       <dbl>    <dbl>              <dbl>       <dbl>          <dbl>      <dbl>
## 1     -122.     37.9                 41         880            129        322
## 2     -122.     37.9                 21        7099           1106       2401
## 3     -122.     37.8                 52        1467            190        496
## 4     -122.     37.8                 52        1274            235        558
## 5     -122.     37.8                 52        1627            280        565
## 6     -122.     37.8                 52         919            213        413
## # ℹ 4 more variables: households <dbl>, median_income <dbl>,
## #   median_house_value <dbl>, ocean_proximity <chr>

Dichotomize the OCEAN_PROXIMITY variable:

housing_df$ocean_proximity <- with(housing_df, ifelse(ocean_proximity=='NEAR OCEAN', 1,0))
table(housing_df$ocean_proximity)
## 
##     0     1 
## 17982  2658

Visualizing the Data:

pairs(housing_df, gap=0.5)

Total rooms, total bedrooms, population, and households are all directly related. In other words, as one incrases, the others do as well. As the median income increases, so does the median house value. As the latitude increases (we go further north), the median house value decreases.

Setting Predictors:

Along with each explanatory variable as a predictor, we will also define the following predictors: total_bedrooms2 (quadratic), ocean_proximity (converted to a dichotomous variable) ocean_proximity∗households (dichotomous vs quantitative interaction term).

housing_df$total_bedrooms_sq <- housing_df$total_bedrooms^2
housing_df$interaxn_term <- housing_df$ocean_proximity*housing_df$households

Building Multivariate Linear Model:

housing.lm <- lm(median_house_value ~ longitude + latitude + housing_median_age + total_rooms + total_bedrooms + population + households + median_income + ocean_proximity + total_bedrooms_sq + interaxn_term, data = housing_df)

Evaluating the Model:

summary(housing.lm)
## 
## Call:
## lm(formula = median_house_value ~ longitude + latitude + housing_median_age + 
##     total_rooms + total_bedrooms + population + households + 
##     median_income + ocean_proximity + total_bedrooms_sq + interaxn_term, 
##     data = housing_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -558445  -43540  -11524   30153  859598 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        -3.510e+06  6.514e+04 -53.884  < 2e-16 ***
## longitude          -4.169e+04  7.486e+02 -55.689  < 2e-16 ***
## latitude           -4.140e+04  7.153e+02 -57.874  < 2e-16 ***
## housing_median_age  1.220e+03  4.378e+01  27.862  < 2e-16 ***
## total_rooms        -7.544e+00  8.006e-01  -9.423  < 2e-16 ***
## total_bedrooms      1.310e+02  7.211e+00  18.161  < 2e-16 ***
## population         -3.739e+01  1.089e+00 -34.341  < 2e-16 ***
## households          4.272e+01  7.587e+00   5.630 1.82e-08 ***
## median_income       4.037e+04  3.378e+02 119.498  < 2e-16 ***
## ocean_proximity     7.320e+03  2.596e+03   2.820   0.0048 ** 
## total_bedrooms_sq  -6.965e-03  8.086e-04  -8.613  < 2e-16 ***
## interaxn_term      -2.258e+00  4.156e+00  -0.543   0.5870    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 69420 on 20421 degrees of freedom
##   (207 observations deleted due to missingness)
## Multiple R-squared:  0.6385, Adjusted R-squared:  0.6383 
## F-statistic:  3279 on 11 and 20421 DF,  p-value: < 2.2e-16

Analysis:

COEFFICIENTS: In a general sense, the coefficients show us the mathematical relationship between the explanatory variables and the response variable.

A positive value indicates that as the value of the explanatory variable increases, the mean of the response variable also increases.

A negative value indicates that as the value of the explanatory variable increases, the mean of the response variable decreases. The strength of the relationship is measured by the magnitude of the coefficient.

Given a one-unit increase/decrease in the explanatory variable (leaving all other variables constant), the response variable will shift the magnitude of the coefficient.

The p-values associated with each variable tells us whether the relationships are statistically significant.

Intercept - value is very close to 0, so this just tells us that the minimum median value of housing is $0 . Longitude and Latitude - both coefficients are largely negative.

This means that as the latitude/longitude increase, the value of the house decreases. (This aligns with our intuition)

Housing Median Age, Total Bedrooms, Ocean Proximity, and Median Income - all of these coefficients are largely positive and have a very significant p-value.

This means that as the values of these variables increase, so does the value of the house.

Total Rooms and Total Bedrooms Squared - these values are very close to 0, so this tells us that there isn’t much of a change in the value of the house.

Households - this value is also pretty small, but positive. This means that there’s a slight increase in the value of the house with an increase in household size.

Population - small but negative value, which means that there’s a slight decrease in the value of the house with an increase in the population.

Interaction Term - This term is very close to 0 and has a p-value greater than 0.05, so it is not significant.

RESIDUALS: Just from looking at the residual summary, we can tell that our model is not very good. The residuals should be balanced and close to the mean of 0

The median should have a value near 0

min/max values should be about the same magnitude, and the first and third quartile should also be about the same magnitude.

This mean that the actual observations aren’t too far from the predicted values and there isn’t much variance in the predictive error.

Our median is largely negative, our min and max values are not of the same magnitude, and the quartile values are not the same magnitude.

Plotting residual plots:

plot(fitted(housing.lm),resid(housing.lm))

Analysis:

For a good model have a residual plot:

  1. there’s no clear pattern and

  2. the points hover both above and below 0.

The pattern in the plot of our residuals – as we move towards the right, we can see that the residuals decrease in value – indicates that we don’t have a great model.

Also, for a good model, we expect the residuals to be normally distributed.

By graphing the data in a Q-Q plot, we can see how well the observations follow the line.

If they fit the line well, we know that the data is normally distributed.

Normal Q-Q Plot:

qqnorm(resid(housing.lm))
qqline(resid(housing.lm))

### Conclusion:

It is evident that there is clear deviation from the line in both ends of the plot,

This further shows that our model does not fit the data well.

There is a clear pattern in the residual plot.

The residuals are not normally distributed.

The median residual value is far from the expected value of 0.