Using R, build a multiple regression model for data that interests you. Include in this model at least one quadratic term, one dichotomous term, and one dichotomous vs. quantitative interaction term. Interpret all coefficients. Conduct residual analysis. Was the linear model appropriate? Why or why not?
library(ggplot2)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ lubridate 1.9.3 ✔ tibble 3.2.1
## ✔ purrr 1.0.2 ✔ tidyr 1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(readr)
urlfile<- "https://raw.githubusercontent.com/uzmabb182/Data605_Assignment/main/housing.csv"
housing_data<-read_csv(url(urlfile))
## Rows: 20640 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): ocean_proximity
## dbl (9): longitude, latitude, housing_median_age, total_rooms, total_bedroom...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(housing_data)
## # A tibble: 6 × 10
## longitude latitude housing_median_age total_rooms total_bedrooms population
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 -122. 37.9 41 880 129 322
## 2 -122. 37.9 21 7099 1106 2401
## 3 -122. 37.8 52 1467 190 496
## 4 -122. 37.8 52 1274 235 558
## 5 -122. 37.8 52 1627 280 565
## 6 -122. 37.8 52 919 213 413
## # ℹ 4 more variables: households <dbl>, median_income <dbl>,
## # median_house_value <dbl>, ocean_proximity <chr>
The data is publicly available on kaggle at https://www.kaggle.com/camnugent/california-housing-prices#housing.csv. It contains 20,640 rows and 10 columns.
There are 9 explanatory variables (longitude, latitude, housing_median_age, total_rooms, total_bedrooms, population, households, median_income and ocean_proximity) and 1 response variable (median_house_value) in the data set.
dichotomize lets you specify which categories are “selected”, while undichotomize strips that selection information.
Dichotomize converts a Categorical Array to a Multiple Response, and undichotomize does the reverse.
is.dichotomized reports whether categories have any selected values.
Becuase this dataset does not include a dichotomous feature, we will create one out of the ocean_proximity column.
A value of 1 will be assigned to any rows with the value NEAR OCEAN and a 0 otherwise.
# Convert Tibble to Data Frame
housing_df <- data.frame(housing_data)
head(housing_df)
## longitude latitude housing_median_age total_rooms total_bedrooms population
## 1 -122.23 37.88 41 880 129 322
## 2 -122.22 37.86 21 7099 1106 2401
## 3 -122.24 37.85 52 1467 190 496
## 4 -122.25 37.85 52 1274 235 558
## 5 -122.25 37.85 52 1627 280 565
## 6 -122.25 37.85 52 919 213 413
## households median_income median_house_value ocean_proximity
## 1 126 8.3252 452600 NEAR BAY
## 2 1138 8.3014 358500 NEAR BAY
## 3 177 7.2574 352100 NEAR BAY
## 4 219 5.6431 341300 NEAR BAY
## 5 259 3.8462 342200 NEAR BAY
## 6 193 4.0368 269700 NEAR BAY
Dichotomize the OCEAN_PROXIMITY variable
housing_df$ocean_proximity <- with(housing_df, ifelse(ocean_proximity=='NEAR OCEAN', 1,0))
table(housing_df$ocean_proximity)
##
## 0 1
## 17982 2658
Visualizing and Analyzing the Data:
pairs(housing_df, gap=0.5)
Total rooms, total bedrooms, population, and households are all directly related. i.e., as one increases, the others do as well.
As the median income increases, so does the median house value.
As the latitude increases (we go further north), the median house value decreases.
we will also define the following predictors:
using each explanatory variable as a predictor
total_bedrooms2 (quadratic)
ocean_proximity (converted to a dichotomous variable)
ocean_proximity∗households (dichotomous vs quantitative interaction term).
housing_df$total_bedrooms_sq <- housing_df$total_bedrooms^2
housing_df$interaxn_term <- housing_df$ocean_proximity*housing_df$households
housing.lm <- lm(median_house_value ~ longitude + latitude + housing_median_age + total_rooms + total_bedrooms + population + households + median_income + ocean_proximity + total_bedrooms_sq + interaxn_term, data = housing_df)
summary(housing.lm)
##
## Call:
## lm(formula = median_house_value ~ longitude + latitude + housing_median_age +
## total_rooms + total_bedrooms + population + households +
## median_income + ocean_proximity + total_bedrooms_sq + interaxn_term,
## data = housing_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -558445 -43540 -11524 30153 859598
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.510e+06 6.514e+04 -53.884 < 2e-16 ***
## longitude -4.169e+04 7.486e+02 -55.689 < 2e-16 ***
## latitude -4.140e+04 7.153e+02 -57.874 < 2e-16 ***
## housing_median_age 1.220e+03 4.378e+01 27.862 < 2e-16 ***
## total_rooms -7.544e+00 8.006e-01 -9.423 < 2e-16 ***
## total_bedrooms 1.310e+02 7.211e+00 18.161 < 2e-16 ***
## population -3.739e+01 1.089e+00 -34.341 < 2e-16 ***
## households 4.272e+01 7.587e+00 5.630 1.82e-08 ***
## median_income 4.037e+04 3.378e+02 119.498 < 2e-16 ***
## ocean_proximity 7.320e+03 2.596e+03 2.820 0.0048 **
## total_bedrooms_sq -6.965e-03 8.086e-04 -8.613 < 2e-16 ***
## interaxn_term -2.258e+00 4.156e+00 -0.543 0.5870
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 69420 on 20421 degrees of freedom
## (207 observations deleted due to missingness)
## Multiple R-squared: 0.6385, Adjusted R-squared: 0.6383
## F-statistic: 3279 on 11 and 20421 DF, p-value: < 2.2e-16
Coefficients: Intercept ((Intercept)): The estimated intercept is approximately -3.51e+06.
This represents the predicted median house value when all other predictors are zero.
Longitude (longitude): For every one unit increase in longitude, there is an estimated decrease in median house value of approximately $41,690.
Latitude (latitude): For every one unit increase in latitude, there is an estimated decrease in median house value of approximately $41,400.
Housing Median Age (housing_median_age): For every one unit increase in housing median age, there is an estimated increase in median house value of approximately $1,220.
Total Rooms (total_rooms): For every one unit increase in total rooms, there is an estimated decrease in median house value of approximately $7.54.
Total Bedrooms (total_bedrooms): For every one unit increase in total bedrooms, there is an estimated increase in median house value of approximately $131.
Population (population): For every one unit increase in population, there is an estimated decrease in median house value of approximately $37.39.
Households (households): For every one unit increase in households, there is an estimated increase in median house value of approximately $42.72.
Median Income (median_income): For every one unit increase in median income, there is an estimated increase in median house value of approximately $40,370.
Ocean Proximity (ocean_proximity): This coefficient indicates the effect of the categorical variable “ocean_proximity” on the median house value.
It seems that being closer to the ocean increases the median house value by approximately $7,320.
Total Bedrooms Squared (total_bedrooms_sq): This suggests a non-linear relationship between total bedrooms and median house value.
For every one unit increase in the square of total bedrooms, there is an estimated decrease in median house value of approximately 0.006965.
Interaction Term (interaxn_term): This term doesn’t appear to be statistically significant since its p-value is higher than the typical significance level of 0.05.
Model Performance:
Residuals: These are the differences between the observed and predicted values of the dependent variable.
They indicate how well the model fits the data.
Multiple R-squared: This is a measure of how well the independent variables explain the variability of the dependent variable.
In this case, approximately 63.85% of the variability in median house value is explained by the independent variables.
Adjusted R-squared: This is the R-squared value adjusted for the number of predictors in the model.
It’s slightly lower than the multiple R-squared because it accounts for the number of predictors in the model.
Residual standard error: This is an estimate of the standard deviation of the residuals.
It shows the average amount that the model’s predictions deviate from the actual values.
Summary: the model seems to have decent predictive power, as indicated by the R-squared value, but there may be room for improvement, especially in capturing the non-linear relationships or unmodeled interactions.
plot(fitted(housing.lm),resid(housing.lm))
In a good model we expect the residuals to be normally distributed.
qqnorm(resid(housing.lm))
qqline(resid(housing.lm))
There is a deviation from the line in both ends of the plot, so this means that our model does not fit the data well.
From this analysis, we can see that our model does not produce a very good output for a few reasons:
There is a clear pattern in the residual plot. The residuals are not normally distributed. The median residual value is far from the expected value of 0.