Question:

Using R, build a multiple regression model for data that interests you. Include in this model at least one quadratic term, one dichotomous term, and one dichotomous vs. quantitative interaction term. Interpret all coefficients. Conduct residual analysis. Was the linear model appropriate? Why or why not?

Load the Libraries:

library(ggplot2)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ lubridate 1.9.3     ✔ tibble    3.2.1
## ✔ purrr     1.0.2     ✔ tidyr     1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(readr)

Fetching the California Housing Data CSV from GitHub:

urlfile<- "https://raw.githubusercontent.com/uzmabb182/Data605_Assignment/main/housing.csv"

housing_data<-read_csv(url(urlfile))
## Rows: 20640 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): ocean_proximity
## dbl (9): longitude, latitude, housing_median_age, total_rooms, total_bedroom...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(housing_data)
## # A tibble: 6 × 10
##   longitude latitude housing_median_age total_rooms total_bedrooms population
##       <dbl>    <dbl>              <dbl>       <dbl>          <dbl>      <dbl>
## 1     -122.     37.9                 41         880            129        322
## 2     -122.     37.9                 21        7099           1106       2401
## 3     -122.     37.8                 52        1467            190        496
## 4     -122.     37.8                 52        1274            235        558
## 5     -122.     37.8                 52        1627            280        565
## 6     -122.     37.8                 52         919            213        413
## # ℹ 4 more variables: households <dbl>, median_income <dbl>,
## #   median_house_value <dbl>, ocean_proximity <chr>

Data Specification:

The data is publicly available on kaggle at https://www.kaggle.com/camnugent/california-housing-prices#housing.csv. It contains 20,640 rows and 10 columns.

There are 9 explanatory variables (longitude, latitude, housing_median_age, total_rooms, total_bedrooms, population, households, median_income and ocean_proximity) and 1 response variable (median_house_value) in the data set.

Dichotomize the Variable:

dichotomize lets you specify which categories are “selected”, while undichotomize strips that selection information.

Dichotomize converts a Categorical Array to a Multiple Response, and undichotomize does the reverse.

is.dichotomized reports whether categories have any selected values.

Becuase this dataset does not include a dichotomous feature, we will create one out of the ocean_proximity column.

A value of 1 will be assigned to any rows with the value NEAR OCEAN and a 0 otherwise.

# Convert Tibble to Data Frame
housing_df <- data.frame(housing_data)
head(housing_df)
##   longitude latitude housing_median_age total_rooms total_bedrooms population
## 1   -122.23    37.88                 41         880            129        322
## 2   -122.22    37.86                 21        7099           1106       2401
## 3   -122.24    37.85                 52        1467            190        496
## 4   -122.25    37.85                 52        1274            235        558
## 5   -122.25    37.85                 52        1627            280        565
## 6   -122.25    37.85                 52         919            213        413
##   households median_income median_house_value ocean_proximity
## 1        126        8.3252             452600        NEAR BAY
## 2       1138        8.3014             358500        NEAR BAY
## 3        177        7.2574             352100        NEAR BAY
## 4        219        5.6431             341300        NEAR BAY
## 5        259        3.8462             342200        NEAR BAY
## 6        193        4.0368             269700        NEAR BAY

Dichotomize the OCEAN_PROXIMITY variable

housing_df$ocean_proximity <- with(housing_df, ifelse(ocean_proximity=='NEAR OCEAN', 1,0))
table(housing_df$ocean_proximity)
## 
##     0     1 
## 17982  2658

Visualizing and Analyzing the Data:

pairs(housing_df, gap=0.5)

Total rooms, total bedrooms, population, and households are all directly related. i.e., as one increases, the others do as well.

As the median income increases, so does the median house value.

As the latitude increases (we go further north), the median house value decreases.

Defining the Predictors:

we will also define the following predictors:

using each explanatory variable as a predictor

total_bedrooms2 (quadratic)

ocean_proximity (converted to a dichotomous variable)

ocean_proximity∗households (dichotomous vs quantitative interaction term).

housing_df$total_bedrooms_sq <- housing_df$total_bedrooms^2
housing_df$interaxn_term <- housing_df$ocean_proximity*housing_df$households

Building using each explanatory variable as a predictor:

housing.lm <- lm(median_house_value ~ longitude + latitude + housing_median_age + total_rooms + total_bedrooms + population + households + median_income + ocean_proximity + total_bedrooms_sq + interaxn_term, data = housing_df)

Evaluating the Model:

summary(housing.lm)
## 
## Call:
## lm(formula = median_house_value ~ longitude + latitude + housing_median_age + 
##     total_rooms + total_bedrooms + population + households + 
##     median_income + ocean_proximity + total_bedrooms_sq + interaxn_term, 
##     data = housing_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -558445  -43540  -11524   30153  859598 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        -3.510e+06  6.514e+04 -53.884  < 2e-16 ***
## longitude          -4.169e+04  7.486e+02 -55.689  < 2e-16 ***
## latitude           -4.140e+04  7.153e+02 -57.874  < 2e-16 ***
## housing_median_age  1.220e+03  4.378e+01  27.862  < 2e-16 ***
## total_rooms        -7.544e+00  8.006e-01  -9.423  < 2e-16 ***
## total_bedrooms      1.310e+02  7.211e+00  18.161  < 2e-16 ***
## population         -3.739e+01  1.089e+00 -34.341  < 2e-16 ***
## households          4.272e+01  7.587e+00   5.630 1.82e-08 ***
## median_income       4.037e+04  3.378e+02 119.498  < 2e-16 ***
## ocean_proximity     7.320e+03  2.596e+03   2.820   0.0048 ** 
## total_bedrooms_sq  -6.965e-03  8.086e-04  -8.613  < 2e-16 ***
## interaxn_term      -2.258e+00  4.156e+00  -0.543   0.5870    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 69420 on 20421 degrees of freedom
##   (207 observations deleted due to missingness)
## Multiple R-squared:  0.6385, Adjusted R-squared:  0.6383 
## F-statistic:  3279 on 11 and 20421 DF,  p-value: < 2.2e-16

Interpreting the Results:

Coefficients: Intercept ((Intercept)): The estimated intercept is approximately -3.51e+06.

This represents the predicted median house value when all other predictors are zero.

Longitude (longitude): For every one unit increase in longitude, there is an estimated decrease in median house value of approximately $41,690.

Latitude (latitude): For every one unit increase in latitude, there is an estimated decrease in median house value of approximately $41,400.

Housing Median Age (housing_median_age): For every one unit increase in housing median age, there is an estimated increase in median house value of approximately $1,220.

Total Rooms (total_rooms): For every one unit increase in total rooms, there is an estimated decrease in median house value of approximately $7.54.

Total Bedrooms (total_bedrooms): For every one unit increase in total bedrooms, there is an estimated increase in median house value of approximately $131.

Population (population): For every one unit increase in population, there is an estimated decrease in median house value of approximately $37.39.

Households (households): For every one unit increase in households, there is an estimated increase in median house value of approximately $42.72.

Median Income (median_income): For every one unit increase in median income, there is an estimated increase in median house value of approximately $40,370.

Ocean Proximity (ocean_proximity): This coefficient indicates the effect of the categorical variable “ocean_proximity” on the median house value.

It seems that being closer to the ocean increases the median house value by approximately $7,320.

Total Bedrooms Squared (total_bedrooms_sq): This suggests a non-linear relationship between total bedrooms and median house value.

For every one unit increase in the square of total bedrooms, there is an estimated decrease in median house value of approximately 0.006965.

Interaction Term (interaxn_term): This term doesn’t appear to be statistically significant since its p-value is higher than the typical significance level of 0.05.

Model Performance:

Residuals: These are the differences between the observed and predicted values of the dependent variable.

They indicate how well the model fits the data.

Multiple R-squared: This is a measure of how well the independent variables explain the variability of the dependent variable.

In this case, approximately 63.85% of the variability in median house value is explained by the independent variables.

Adjusted R-squared: This is the R-squared value adjusted for the number of predictors in the model.

It’s slightly lower than the multiple R-squared because it accounts for the number of predictors in the model.

Residual standard error: This is an estimate of the standard deviation of the residuals.

It shows the average amount that the model’s predictions deviate from the actual values.

Summary: the model seems to have decent predictive power, as indicated by the R-squared value, but there may be room for improvement, especially in capturing the non-linear relationships or unmodeled interactions.

plot(fitted(housing.lm),resid(housing.lm))

Graphing the data in a Q-Q plot

In a good model we expect the residuals to be normally distributed.

qqnorm(resid(housing.lm))
qqline(resid(housing.lm))

There is a deviation from the line in both ends of the plot, so this means that our model does not fit the data well.

Conclusions:

From this analysis, we can see that our model does not produce a very good output for a few reasons:

There is a clear pattern in the residual plot. The residuals are not normally distributed. The median residual value is far from the expected value of 0.