Homework_2 Linear Regression Analysis

Part 1: Research Question

In January 2025, California experienced a major wildfire that destroyed about 7,000 homes. As a result, many people will need to purchase new homes. So, where is the best place to buy? I want to investigate the factors affecting housing prices in California.
We find a dataset includes information on house values, local population, local income, house age, number of households, and more. Next, we will explore the relationships between house values and other factors.

Part 2: Data Exploration and Preparation

California House Price dataset contains 20,640 observations with 10 variables, is sourced from Kaggle and was provided by Shibu Mohapatra, the primary source is the U.S. Census Bureau. California House Price dataset has 10 types of metrics such as the population, median income, median housing price, and so on for each block group in California.

Variables in the Dataset:

longitude: Longitude coordinate of the location.
latitude: atitude coordinate of the location.
housing_median_age: Median age of houses in the area of California..
total_rooms: Total number of rooms in a block of California..
total_bedrooms: Total number of bedrooms of California..
population: Population count in the area of California..
households: Number of households in the area of California..
median_income: Median income of households in the area.
ocean_proximity: Categorical variable indicating the region’s distance to the ocean.
median_house_value: Median house value in U.S. dollars in a block of California.

Data source:https://www.kaggle.com/datasets/shibumohapatra/house-price/data.

statistical summaries of the variables

price_house <- read_csv("D:/Mysoftware/DefaultWD-R/dataset/California_house_price.csv")

str(price_house)

## spc_tbl_ [20,640 × 10] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ longitude         : num [1:20640] -122 -122 -122 -122 -122 ...
##  $ latitude          : num [1:20640] 37.9 37.9 37.9 37.9 37.9 ...
##  $ housing_median_age: num [1:20640] 41 21 52 52 52 52 52 52 42 52 ...
##  $ total_rooms       : num [1:20640] 880 7099 1467 1274 1627 ...
##  $ total_bedrooms    : num [1:20640] 129 1106 190 235 280 ...
##  $ population        : num [1:20640] 322 2401 496 558 565 ...
##  $ households        : num [1:20640] 126 1138 177 219 259 ...
##  $ median_income     : num [1:20640] 8.33 8.3 7.26 5.64 3.85 ...
##  $ ocean_proximity   : chr [1:20640] "NEAR BAY" "NEAR BAY" "NEAR BAY" "NEAR BAY" ...
##  $ median_house_value: num [1:20640] 452600 358500 352100 341300 342200 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   longitude = col_double(),
##   ..   latitude = col_double(),
##   ..   housing_median_age = col_double(),
##   ..   total_rooms = col_double(),
##   ..   total_bedrooms = col_double(),
##   ..   population = col_double(),
##   ..   households = col_double(),
##   ..   median_income = col_double(),
##   ..   ocean_proximity = col_character(),
##   ..   median_house_value = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

summary(price_house)

##    longitude         latitude     housing_median_age  total_rooms   
##  Min.   :-124.3   Min.   :32.54   Min.   : 1.00      Min.   :    2  
##  1st Qu.:-121.8   1st Qu.:33.93   1st Qu.:18.00      1st Qu.: 1448  
##  Median :-118.5   Median :34.26   Median :29.00      Median : 2127  
##  Mean   :-119.6   Mean   :35.63   Mean   :28.64      Mean   : 2636  
##  3rd Qu.:-118.0   3rd Qu.:37.71   3rd Qu.:37.00      3rd Qu.: 3148  
##  Max.   :-114.3   Max.   :41.95   Max.   :52.00      Max.   :39320  
##                                                                     
##  total_bedrooms     population      households     median_income    
##  Min.   :   1.0   Min.   :    3   Min.   :   1.0   Min.   : 0.4999  
##  1st Qu.: 296.0   1st Qu.:  787   1st Qu.: 280.0   1st Qu.: 2.5634  
##  Median : 435.0   Median : 1166   Median : 409.0   Median : 3.5348  
##  Mean   : 537.9   Mean   : 1425   Mean   : 499.5   Mean   : 3.8707  
##  3rd Qu.: 647.0   3rd Qu.: 1725   3rd Qu.: 605.0   3rd Qu.: 4.7432  
##  Max.   :6445.0   Max.   :35682   Max.   :6082.0   Max.   :15.0001  
##  NA's   :207                                                        
##  ocean_proximity    median_house_value
##  Length:20640       Min.   : 14999    
##  Class :character   1st Qu.:119600    
##  Mode  :character   Median :179700    
##                     Mean   :206856    
##                     3rd Qu.:264725    
##                     Max.   :500001    
##

colSums(is.na(price_house))

##          longitude           latitude housing_median_age        total_rooms 
##                  0                  0                  0                  0 
##     total_bedrooms         population         households      median_income 
##                207                  0                  0                  0 
##    ocean_proximity median_house_value 
##                  0                  0

price_house_num <- price_house %>%
  select_if(is.numeric) %>%
  drop_na() 

cor(price_house_num)

##                      longitude    latitude housing_median_age total_rooms
## longitude           1.00000000 -0.92461611        -0.10935655  0.04548017
## latitude           -0.92461611  1.00000000         0.01189907 -0.03666681
## housing_median_age -0.10935655  0.01189907         1.00000000 -0.36062830
## total_rooms         0.04548017 -0.03666681        -0.36062830  1.00000000
## total_bedrooms      0.06960802 -0.06698283        -0.32045104  0.93037950
## population          0.10027030 -0.10899734        -0.29578730  0.85728125
## households          0.05651277 -0.07177419        -0.30276797  0.91899153
## median_income      -0.01555015 -0.07962632        -0.11827772  0.19788152
## median_house_value -0.04539822 -0.14463821         0.10643205  0.13329413
##                    total_bedrooms   population  households median_income
## longitude              0.06960802  0.100270301  0.05651277  -0.015550150
## latitude              -0.06698283 -0.108997344 -0.07177419  -0.079626319
## housing_median_age    -0.32045104 -0.295787297 -0.30276797  -0.118277723
## total_rooms            0.93037950  0.857281251  0.91899153   0.197881519
## total_bedrooms         1.00000000  0.877746743  0.97972827  -0.007722850
## population             0.87774674  1.000000000  0.90718590   0.005086624
## households             0.97972827  0.907185900  1.00000000   0.013433892
## median_income         -0.00772285  0.005086624  0.01343389   1.000000000
## median_house_value     0.04968618 -0.025299732  0.06489355   0.688355475
##                    median_house_value
## longitude                 -0.04539822
## latitude                  -0.14463821
## housing_median_age         0.10643205
## total_rooms                0.13329413
## total_bedrooms             0.04968618
## population                -0.02529973
## households                 0.06489355
## median_income              0.68835548
## median_house_value         1.00000000

ggpairs(price_house, alpha = 0.5) +
  theme_minimal()

Part 3: Linear Regression Analysis

We want to get a full model first

unique(price_house$ocean_proximity)

## [1] "NEAR BAY"   "<1H OCEAN"  "INLAND"     "NEAR OCEAN" "ISLAND"

price_house$ocean_proximity <- as.factor(price_house$ocean_proximity)

model_full <- lm(median_house_value ~ median_income + housing_median_age + 
                   total_rooms + population + households + ocean_proximity,
                 data = price_house)
summary(model_full)

## 
## Call:
## lm(formula = median_house_value ~ median_income + housing_median_age + 
##     total_rooms + population + households + ocean_proximity, 
##     data = price_house)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -547300  -43585  -10882   29580  787296 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                3.333e+04  2.385e+03  13.973  < 2e-16 ***
## median_income              3.897e+04  3.155e+02 123.493  < 2e-16 ***
## housing_median_age         1.149e+03  4.423e+01  25.967  < 2e-16 ***
## total_rooms               -2.757e+00  6.889e-01  -4.001 6.32e-05 ***
## population                -4.042e+01  1.051e+00 -38.468  < 2e-16 ***
## households                 1.486e+02  4.356e+00  34.127  < 2e-16 ***
## ocean_proximityINLAND     -6.870e+04  1.254e+03 -54.794  < 2e-16 ***
## ocean_proximityISLAND      1.817e+05  3.133e+04   5.800 6.73e-09 ***
## ocean_proximityNEAR BAY    3.912e+03  1.691e+03   2.314   0.0207 *  
## ocean_proximityNEAR OCEAN  1.364e+04  1.552e+03   8.792  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 70000 on 20630 degrees of freedom
## Multiple R-squared:  0.6321, Adjusted R-squared:  0.632 
## F-statistic:  3939 on 9 and 20630 DF,  p-value: < 2.2e-16

From summary above, we can see that ocean_proximityNEAR BAY is not very significant, moreover, too many variables make the model more complex and harder to interpret, so I try to remove ocean_proximity and use ANOVA method for comparison.

model_reduced <- lm(median_house_value ~ median_income + housing_median_age +
                      total_rooms + population + households,
                    data = price_house)

anova(model_reduced, model_full)

From the result above, we can see the p-value is very low, means ocean_proximity can significantly improve model’s interpretability. So I need keep this item.

Then, I want to use stepwise method to reduce some variables of full model

model_step <- step(model_full, direction = "both", trace = FALSE)
summary(model_step)

## 
## Call:
## lm(formula = median_house_value ~ median_income + housing_median_age + 
##     total_rooms + population + households + ocean_proximity, 
##     data = price_house)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -547300  -43585  -10882   29580  787296 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                3.333e+04  2.385e+03  13.973  < 2e-16 ***
## median_income              3.897e+04  3.155e+02 123.493  < 2e-16 ***
## housing_median_age         1.149e+03  4.423e+01  25.967  < 2e-16 ***
## total_rooms               -2.757e+00  6.889e-01  -4.001 6.32e-05 ***
## population                -4.042e+01  1.051e+00 -38.468  < 2e-16 ***
## households                 1.486e+02  4.356e+00  34.127  < 2e-16 ***
## ocean_proximityINLAND     -6.870e+04  1.254e+03 -54.794  < 2e-16 ***
## ocean_proximityISLAND      1.817e+05  3.133e+04   5.800 6.73e-09 ***
## ocean_proximityNEAR BAY    3.912e+03  1.691e+03   2.314   0.0207 *  
## ocean_proximityNEAR OCEAN  1.364e+04  1.552e+03   8.792  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 70000 on 20630 degrees of freedom
## Multiple R-squared:  0.6321, Adjusted R-squared:  0.632 
## F-statistic:  3939 on 9 and 20630 DF,  p-value: < 2.2e-16

From the result above, I keep all variables.

Then, we use plot(model_step) to cheack model’s linearity, homoscedasticity, and if there any outliers.

par(mfrow = c(2, 2))
plot(model_step)

From Q-Q plot and residuals vs leverage plot, we can clearly see that outliers exist. In scale -location plots, we can see when median_house_value increases, residuals also upward, so I want to apply a log transformation to price_house to solve this problem and remove outliers.

library(MASS)
# 1. just log translation
log_house_price <- log(price_house$median_house_value)
model_log <- lm(log_house_price ~ median_income + housing_median_age + 
                  total_rooms + population + households + ocean_proximity,
                data = price_house)

summary(model_log)

## 
## Call:
## lm(formula = log_house_price ~ median_income + housing_median_age + 
##     total_rooms + population + households + ocean_proximity, 
##     data = price_house)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.3805 -0.2137 -0.0118  0.1960  3.2641 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                1.143e+01  1.154e-02 990.880  < 2e-16 ***
## median_income              1.713e-01  1.526e-03 112.291  < 2e-16 ***
## housing_median_age         3.075e-03  2.139e-04  14.377  < 2e-16 ***
## total_rooms               -6.585e-06  3.331e-06  -1.977  0.04810 *  
## population                -1.772e-04  5.081e-06 -34.872  < 2e-16 ***
## households                 6.538e-04  2.106e-05  31.043  < 2e-16 ***
## ocean_proximityINLAND     -4.912e-01  6.063e-03 -81.017  < 2e-16 ***
## ocean_proximityISLAND      7.473e-01  1.515e-01   4.933 8.15e-07 ***
## ocean_proximityNEAR BAY   -8.658e-04  8.176e-03  -0.106  0.91566    
## ocean_proximityNEAR OCEAN  2.191e-02  7.504e-03   2.919  0.00351 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3385 on 20630 degrees of freedom
## Multiple R-squared:  0.6464, Adjusted R-squared:  0.6462 
## F-statistic:  4190 on 9 and 20630 DF,  p-value: < 2.2e-16

par(mfrow=c(2,2))
plot(model_log)

#3. log translation and remove outliers
stud_res <- rstudent(model_log)

outliers_res <- which(abs(stud_res) > 3)

cd <- cooks.distance(model_log)

n <- nrow(price_house)

cook_thr <- 4/n

outliers_cook <- which(cd > cook_thr)

outliers_all <- unique(c(outliers_res, outliers_cook))

length(outliers_all)

## [1] 1008

price_house_no_outliers <- price_house[-outliers_all, ]

#2. just remove outliers
model_no_out <- lm(median_house_value ~ median_income + housing_median_age + 
                         total_rooms + population + households + ocean_proximity,
                       data = price_house_no_outliers)

summary(model_no_out)

## 
## Call:
## lm(formula = median_house_value ~ median_income + housing_median_age + 
##     total_rooms + population + households + ocean_proximity, 
##     data = price_house_no_outliers)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -182040  -39444   -6682   30795  296837 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                10228.747   2248.926   4.548 5.44e-06 ***
## median_income              43329.631    315.856 137.182  < 2e-16 ***
## housing_median_age          1254.600     38.900  32.252  < 2e-16 ***
## total_rooms                   -5.580      0.717  -7.783 7.46e-15 ***
## population                   -57.258      1.132 -50.575  < 2e-16 ***
## households                   214.637      4.558  47.096  < 2e-16 ***
## ocean_proximityINLAND     -66734.088   1096.875 -60.840  < 2e-16 ***
## ocean_proximityNEAR BAY    -5621.495   1486.008  -3.783 0.000155 ***
## ocean_proximityNEAR OCEAN  12555.709   1357.481   9.249  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 58720 on 19623 degrees of freedom
## Multiple R-squared:  0.7136, Adjusted R-squared:  0.7135 
## F-statistic:  6112 on 8 and 19623 DF,  p-value: < 2.2e-16

par(mfrow=c(2,2))
plot(model_no_out)

model_log_no_out <- lm(log(median_house_value) ~ median_income + housing_median_age + 
                         total_rooms + population + households + ocean_proximity,
                       data = price_house_no_outliers)

summary(model_log_no_out)

## 
## Call:
## lm(formula = log(median_house_value) ~ median_income + housing_median_age + 
##     total_rooms + population + households + ocean_proximity, 
##     data = price_house_no_outliers)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.94975 -0.20009 -0.01082  0.18308  1.01728 
## 
## Coefficients:
##                             Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)                1.131e+01  1.086e-02 1042.109  < 2e-16 ***
## median_income              1.945e-01  1.525e-03  127.555  < 2e-16 ***
## housing_median_age         3.676e-03  1.878e-04   19.576  < 2e-16 ***
## total_rooms               -2.472e-05  3.461e-06   -7.140 9.63e-13 ***
## population                -2.603e-04  5.465e-06  -47.632  < 2e-16 ***
## households                 1.005e-03  2.200e-05   45.664  < 2e-16 ***
## ocean_proximityINLAND     -4.840e-01  5.295e-03  -91.409  < 2e-16 ***
## ocean_proximityNEAR BAY   -3.258e-02  7.173e-03   -4.542 5.62e-06 ***
## ocean_proximityNEAR OCEAN  3.073e-02  6.553e-03    4.689 2.76e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2834 on 19623 degrees of freedom
## Multiple R-squared:  0.7337, Adjusted R-squared:  0.7336 
## F-statistic:  6760 on 8 and 19623 DF,  p-value: < 2.2e-16

par(mfrow=c(2,2))
plot(model_log_no_out)

We use studentized residuals > 3 and Cook’s distance > 4/n to remove outliers, and use log translation with house price, and then we can clearly see the curve is now smoother, without the slight u-shape seen before from residuals vs fitted plot. Also, Q-Q plot more likely along with the diagonal, indicating that the residuals follow a more normal distribution. And we can see heteroscedasticity has improved from scale location plot, and the impact of outliers has been reduced from residuals vs leverage plot.

Then, we use vif method to deal with collinearity.

library(car)
vif(model_log_no_out)

##                         GVIF Df GVIF^(1/(2*Df))
## median_income       1.720994  1        1.311867
## housing_median_age  1.328448  1        1.152583
## total_rooms        10.664481  1        3.265652
## population          6.921513  1        2.630877
## households         13.187029  1        3.631395
## ocean_proximity     1.441117  3        1.062796

Form the output above, we can see total_rooms and households have stong collinearity, so I remove households.

model_log_no_out_vif <- lm(log(median_house_value) ~ median_income + housing_median_age + 
                         total_rooms + population + ocean_proximity,
                       data = price_house_no_outliers)

summary(model_log_no_out_vif)

## 
## Call:
## lm(formula = log(median_house_value) ~ median_income + housing_median_age + 
##     total_rooms + population + ocean_proximity, data = price_house_no_outliers)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.02067 -0.21250 -0.01621  0.19272  1.01666 
## 
## Coefficients:
##                             Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)                1.146e+01  1.089e-02 1052.485  < 2e-16 ***
## median_income              1.666e-01  1.470e-03  113.376  < 2e-16 ***
## housing_median_age         3.545e-03  1.975e-04   17.951  < 2e-16 ***
## total_rooms                8.586e-05  2.601e-06   33.005  < 2e-16 ***
## population                -1.306e-04  4.911e-06  -26.601  < 2e-16 ***
## ocean_proximityINLAND     -5.379e-01  5.429e-03  -99.092  < 2e-16 ***
## ocean_proximityNEAR BAY   -1.212e-02  7.530e-03   -1.610    0.107    
## ocean_proximityNEAR OCEAN  4.104e-02  6.888e-03    5.959 2.59e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2981 on 19624 degrees of freedom
## Multiple R-squared:  0.7055, Adjusted R-squared:  0.7054 
## F-statistic:  6714 on 7 and 19624 DF,  p-value: < 2.2e-16

par(mfrow=c(2,2))
plot(model_log_no_out_vif)

Then, we want to to see whether it is necessary to include interaction terms or higher-order (quadratic, cubic) terms.

library(gridExtra)

price_house_no_outliers$resid_new <- resid(model_log_no_out_vif)

p1 <- ggplot(price_house_no_outliers, aes(x = median_income, y = resid_new)) +
  geom_point(alpha = 0.4) +
  geom_smooth(method = "loess", col = "red") +
  ggtitle("Residuals vs. Median Income")

p2 <- ggplot(price_house_no_outliers, aes(x = housing_median_age, y = resid_new)) +
  geom_point(alpha = 0.4) +
  geom_smooth(method = "loess", col = "red") +
  ggtitle("Residuals vs. Housing Median Age")

p3 <- ggplot(price_house_no_outliers, aes(x = total_rooms, y = resid_new)) +
  geom_point(alpha = 0.4) +
  geom_smooth(method = "loess", col = "red") +
  ggtitle("Residuals vs. Total Rooms")

p4 <- ggplot(price_house_no_outliers, aes(x = population, y = resid_new)) +
  geom_point(alpha = 0.4) +
  geom_smooth(method = "loess", col = "red") +
  ggtitle("Residuals vs. Population")


grid.arrange(p1, p2, p3, p4,  ncol = 2)

From the first plot Residuals vs. Median Income,we can see that the curve has some curvature, so we tried using a quadratic term to deal with this problem

model_log_no_out_vif_ord <- lm(log(median_house_value) ~ I(median_income^2) + housing_median_age + 
                         total_rooms + population + ocean_proximity,
                       data = price_house_no_outliers)

summary(model_log_no_out_vif_ord)

## 
## Call:
## lm(formula = log(median_house_value) ~ I(median_income^2) + housing_median_age + 
##     total_rooms + population + ocean_proximity, data = price_house_no_outliers)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.30599 -0.21082 -0.00647  0.20835  1.07071 
## 
## Coefficients:
##                             Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)                1.189e+01  9.698e-03 1225.996  < 2e-16 ***
## I(median_income^2)         1.413e-02  1.509e-04   93.633  < 2e-16 ***
## housing_median_age         2.675e-03  2.105e-04   12.708  < 2e-16 ***
## total_rooms                1.169e-04  2.712e-06   43.099  < 2e-16 ***
## population                -1.823e-04  5.149e-06  -35.410  < 2e-16 ***
## ocean_proximityINLAND     -5.958e-01  5.692e-03 -104.676  < 2e-16 ***
## ocean_proximityNEAR BAY   -1.882e-02  8.053e-03   -2.337  0.01943 *  
## ocean_proximityNEAR OCEAN  2.385e-02  7.359e-03    3.241  0.00119 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3188 on 19624 degrees of freedom
## Multiple R-squared:  0.6631, Adjusted R-squared:  0.6629 
## F-statistic:  5517 on 7 and 19624 DF,  p-value: < 2.2e-16

par(mfrow=c(2,2))
plot(model_log_no_out_vif_ord)

anova(model_log_no_out_vif_ord, model_log_no_out_vif)

However, the results shows that after introducing quadratic term, the RSS was increased, so we decide not to include quadratic term.Then, we wan to check whether it is necessary to include interaction terms. Now let’s see whether the residual is dependent of median_income * housing_median_age here.

model_log_no_out_vif_int <- lm(log(median_house_value) ~ median_income * housing_median_age + 
                         total_rooms + population + ocean_proximity,
                       data = price_house_no_outliers)

anova(model_log_no_out_vif_int, model_log_no_out_vif)

summary(model_log_no_out_vif_int)

## 
## Call:
## lm(formula = log(median_house_value) ~ median_income * housing_median_age + 
##     total_rooms + population + ocean_proximity, data = price_house_no_outliers)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.02008 -0.21000 -0.01681  0.19092  1.02747 
## 
## Coefficients:
##                                    Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)                       1.164e+01  1.515e-02  768.272  < 2e-16 ***
## median_income                     1.228e-01  3.051e-03   40.252  < 2e-16 ***
## housing_median_age               -2.529e-03  4.201e-04   -6.022 1.76e-09 ***
## total_rooms                       8.714e-05  2.585e-06   33.710  < 2e-16 ***
## population                       -1.304e-04  4.878e-06  -26.737  < 2e-16 ***
## ocean_proximityINLAND            -5.420e-01  5.398e-03 -100.404  < 2e-16 ***
## ocean_proximityNEAR BAY          -1.689e-02  7.485e-03   -2.257    0.024 *  
## ocean_proximityNEAR OCEAN         3.621e-02  6.848e-03    5.288 1.25e-07 ***
## median_income:housing_median_age  1.571e-03  9.606e-05   16.353  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2961 on 19623 degrees of freedom
## Multiple R-squared:  0.7094, Adjusted R-squared:  0.7093 
## F-statistic:  5988 on 8 and 19623 DF,  p-value: < 2.2e-16

par(mfrow=c(2,2))
plot(model_log_no_out_vif_int)

From the result above, we can see that when we incldue interaction term, p-value is very small, and R-squared also increased, means interaction term is necessary. So, model_log_no_out_vif_int is our final model.

Interpretation of Coefficients:

Our model uses log(median_house_value) as the response. So, each coefficient represents an approximate percentage change in house value for a one-unit change in the predictor, when we holding other variables constant.

Intercept = 11.64 : If all predictors were at 0, the log of mean house value would be 11.64 in this area.
median_income = 0.1228: Each additional unit of median_income increases the log of mean house value by 0.1228 in this area, but we also need to consider interaction term.
housing_median_age = -0.002529: Each additional year of house age alone decreases the log of mean house value by about 0.0025.
total_rooms = 8.714e-05: Each additional unit of median_income increases the log of mean house value by 0000.8714 in this area.
population = -1.304e-04: Each additional unit of population decreases the log of mean house value by -000.1304 in this area.
ocean_proximityINLAND = -0.542: Being INLAND reduces the log house value by about 0.54% than 1H OCEAN
ocean_proximityNEAR BAY = -0.0169: NEAR BAY is 1.69% lower than <1H OCEAN.
ocean_proximityNEAR OCEAN = 0.03621: NEAR OCEAN is about 3.6% higher than the baseline <1H OCEAN.

Part 4: Discussion and Model Improvement

In part 3, we find collinearity between total_rooms and households, Then we removed households, maybe we can creat a new variable: average_rooms = total_rooms / households to keep more information.
Maybe we can include more interaction terms.
Our model can only tell us which factors are related to median house prices in different areas of California. Our conclusions help us find a suitable area, not a specific house. In the future, we can do this.

Homework_2 Linear Regression Analysis

Siyu Zhang

2025-03-15

Part 1: Research Question

Part 2: Data Exploration and Preparation

Part 3: Linear Regression Analysis

Part 4: Discussion and Model Improvement