DA220 FINAL PROJECT - California Housing Prices

summary(house)

##    longitude         latitude     housing_median_age  total_rooms   
##  Min.   :-124.3   Min.   :32.54   Min.   : 1.00      Min.   :    2  
##  1st Qu.:-121.8   1st Qu.:33.93   1st Qu.:18.00      1st Qu.: 1448  
##  Median :-118.5   Median :34.26   Median :29.00      Median : 2127  
##  Mean   :-119.6   Mean   :35.63   Mean   :28.64      Mean   : 2636  
##  3rd Qu.:-118.0   3rd Qu.:37.71   3rd Qu.:37.00      3rd Qu.: 3148  
##  Max.   :-114.3   Max.   :41.95   Max.   :52.00      Max.   :39320  
##                                                                     
##  total_bedrooms     population      households     median_income    
##  Min.   :   1.0   Min.   :    3   Min.   :   1.0   Min.   : 0.4999  
##  1st Qu.: 296.0   1st Qu.:  787   1st Qu.: 280.0   1st Qu.: 2.5634  
##  Median : 435.0   Median : 1166   Median : 409.0   Median : 3.5348  
##  Mean   : 537.9   Mean   : 1425   Mean   : 499.5   Mean   : 3.8707  
##  3rd Qu.: 647.0   3rd Qu.: 1725   3rd Qu.: 605.0   3rd Qu.: 4.7432  
##  Max.   :6445.0   Max.   :35682   Max.   :6082.0   Max.   :15.0001  
##  NA's   :207                                                        
##  median_house_value ocean_proximity   
##  Min.   : 14999     Length:20640      
##  1st Qu.:119600     Class :character  
##  Median :179700     Mode  :character  
##  Mean   :206856                       
##  3rd Qu.:264725                       
##  Max.   :500001                       
##

Data transformation

Since there are originally over 20,000 observations, we will randomly narrow the data down to 2000 values. After that, we will divide the data into a train and test set, with a ratio of 7:3.

house <- na.omit(house)
set.seed(123)

# Randomly select 2000 values.
house <- house %>% sample_n(2000)

# Create training and testing sets using the indices
train_indices <- sample(seq_len(nrow(house)), size = floor(0.7 * nrow(house)))

train <- house[train_indices, ]
test <- house[-train_indices, ]

summary(train)

##    longitude         latitude     housing_median_age  total_rooms   
##  Min.   :-124.2   Min.   :32.57   Min.   : 2.00      Min.   :    2  
##  1st Qu.:-121.8   1st Qu.:33.93   1st Qu.:18.00      1st Qu.: 1424  
##  Median :-118.4   Median :34.23   Median :30.00      Median : 2045  
##  Mean   :-119.5   Mean   :35.58   Mean   :29.11      Mean   : 2591  
##  3rd Qu.:-118.0   3rd Qu.:37.68   3rd Qu.:38.00      3rd Qu.: 3062  
##  Max.   :-114.5   Max.   :41.88   Max.   :52.00      Max.   :22128  
##  total_bedrooms     population      households     median_income    
##  Min.   :   2.0   Min.   :    6   Min.   :   2.0   Min.   : 0.4999  
##  1st Qu.: 292.0   1st Qu.:  777   1st Qu.: 279.8   1st Qu.: 2.5777  
##  Median : 419.5   Median : 1132   Median : 392.5   Median : 3.5878  
##  Mean   : 529.1   Mean   : 1398   Mean   : 491.5   Mean   : 3.8525  
##  3rd Qu.: 625.5   3rd Qu.: 1663   3rd Qu.: 581.0   3rd Qu.: 4.6935  
##  Max.   :4457.0   Max.   :11139   Max.   :4204.0   Max.   :15.0001  
##  median_house_value ocean_proximity   
##  Min.   : 14999     Length:1400       
##  1st Qu.:115800     Class :character  
##  Median :181950     Mode  :character  
##  Mean   :205992                       
##  3rd Qu.:263800                       
##  Max.   :500001

Let’s take a look at the histograms of our data and decide which variables to drop and transform.

I decided to not to include the coordinates variable (longitude & latitude) for the model because they may not have any linear relationship with our dependent variable, which is the median housing price.

Since our y value is not that badly skewed, we can apply sqrt() transformation to it to achieve normality.

train %>% ggplot(aes(sqrt(median_house_value)))+
  geom_histogram(bins = 30, fill = "turquoise", color = "black")

For other variables, here are the plots of them against our y variable after tranformation.

train %>% ggplot(aes(sqrt(total_rooms), sqrt(median_house_value)))+
  geom_point()+geom_smooth(method = "lm")

## `geom_smooth()` using formula = 'y ~ x'

train %>% ggplot(aes(sqrt(total_bedrooms), sqrt(median_house_value)))+
  geom_point()+geom_smooth(method = "lm")

## `geom_smooth()` using formula = 'y ~ x'

train %>% ggplot(aes(sqrt(population), sqrt(median_house_value)))+
  geom_point()+geom_smooth(method = "lm")

## `geom_smooth()` using formula = 'y ~ x'

train %>% ggplot(aes(sqrt(households), sqrt(median_house_value)))+
  geom_point()+geom_smooth(method = "lm")

## `geom_smooth()` using formula = 'y ~ x'

train %>% ggplot(aes(log(median_income), sqrt(median_house_value)))+
  geom_point()+geom_smooth(method = "lm")

## `geom_smooth()` using formula = 'y ~ x'

We can see from these scatter plots that the transformed variables still maintains a linear relationship with our y variable.

We also figured that there is an interaction term between income and total rooms.

income1 <- train %>% filter(median_income < 4)
income2 <- train %>% filter(median_income >= 4 & median_income < 8)
income3 <- train %>% filter(median_income >= 8)

income1 %>% ggplot(aes(total_rooms,median_house_value)) + 
  geom_point() +geom_smooth(method = "lm")

## `geom_smooth()` using formula = 'y ~ x'

income2 %>% ggplot(aes(total_rooms,median_house_value)) + 
  geom_point() +geom_smooth(method = "lm")

## `geom_smooth()` using formula = 'y ~ x'

income3 %>% ggplot(aes(total_rooms,median_house_value)) + 
  geom_point() +geom_smooth(method = "lm")

## `geom_smooth()` using formula = 'y ~ x'

For the first interval (income < 4), there is a noticeable slope for the best fit line which indicates that the median housing price is not simply affected by the total rooms but also income. As for the 2 remaining intervals, a upward slope can also be seen but more marginal.

# Outliers removal
exclude_rows_with_outliers <- function(df, columns, k = 1.5) {
  outlier_indices <- logical(nrow(df))
  
  for (col in columns) {
    Q1 <- quantile(df[[col]], 0.25, na.rm = TRUE)
    Q3 <- quantile(df[[col]], 0.75, na.rm = TRUE)
    IQR <- Q3 - Q1
    lower_bound <- Q1 - k * IQR
    upper_bound <- Q3 + k * IQR
    outlier_indices <- outlier_indices | (df[[col]] < lower_bound | df[[col]] > upper_bound)
  }
  
  return(df[!outlier_indices, ])
}

columns_with_outliers <- c("total_rooms", "total_bedrooms", "population", "households", "median_income", "median_house_value")
H1 <- exclude_rows_with_outliers(train, columns_with_outliers, k = 1.5)

# Transform data
H1 <- train %>%
  mutate(
 SRoom = sqrt(total_rooms),
 SBed = sqrt(total_bedrooms),
 SPop = sqrt(population),
 SHouse = sqrt(households),
 LIncome = log(median_income),
 SValue = sqrt(median_house_value),
  )

H2 <- H1 %>%
  dplyr::select(SValue, longitude, latitude, housing_median_age,SRoom,   SBed, SPop, SHouse ,LIncome   )

# Update test set
T1 <- test %>%
  mutate(
 SRoom = sqrt(total_rooms),
 SBed = sqrt(total_bedrooms),
 SPop = sqrt(population),
 SHouse = sqrt(households),
 LIncome = log(median_income),
 SValue = sqrt(median_house_value),
  )

T2 <- T1 %>%
  dplyr::select(SValue, longitude, latitude, housing_median_age,SRoom,   SBed, SPop, SHouse ,LIncome   )

We then do best subsets and stepwise regression to find the best variables for our model.

best<-regsubsets(SValue ~ housing_median_age + SRoom +  SBed+ SPop +SHouse +LIncome, H2)
summary(best)

## Subset selection object
## Call: regsubsets.formula(SValue ~ housing_median_age + SRoom + SBed + 
##     SPop + SHouse + LIncome, H2)
## 6 Variables  (and intercept)
##                    Forced in Forced out
## housing_median_age     FALSE      FALSE
## SRoom                  FALSE      FALSE
## SBed                   FALSE      FALSE
## SPop                   FALSE      FALSE
## SHouse                 FALSE      FALSE
## LIncome                FALSE      FALSE
## 1 subsets of each size up to 6
## Selection Algorithm: exhaustive
##          housing_median_age SRoom SBed SPop SHouse LIncome
## 1  ( 1 ) " "                " "   " "  " "  " "    "*"    
## 2  ( 1 ) "*"                " "   " "  " "  " "    "*"    
## 3  ( 1 ) " "                "*"   "*"  " "  " "    "*"    
## 4  ( 1 ) "*"                " "   " "  "*"  "*"    "*"    
## 5  ( 1 ) "*"                "*"   " "  "*"  "*"    "*"    
## 6  ( 1 ) "*"                "*"   "*"  "*"  "*"    "*"

stepwise <- step(lm(SValue ~ housing_median_age + SRoom +  SBed+ SPop +SHouse +LIncome, H2))

## Start:  AIC=12318.39
## SValue ~ housing_median_age + SRoom + SBed + SPop + SHouse + 
##     LIncome
## 
##                      Df Sum of Sq      RSS   AIC
## <none>                             9184968 12318
## - SBed                1    126535  9311504 12336
## - SHouse              1    193002  9377971 12346
## - SRoom               1    439999  9624967 12382
## - housing_median_age  1    442803  9627772 12382
## - SPop                1    518330  9703299 12393
## - LIncome             1   7517068 16702037 13154

As both tests gave the same results, our best model will include housing_median_age + SRoom + SBed + Spop + SHouse + LIncome.

We then run a cor test to see the correlations between these variables.

cor(H2)

##                         SValue   longitude     latitude housing_median_age
## SValue              1.00000000 -0.04990532 -0.169918377        0.087492767
## longitude          -0.04990532  1.00000000 -0.922857820       -0.104619587
## latitude           -0.16991838 -0.92285782  1.000000000        0.003312031
## housing_median_age  0.08749277 -0.10461959  0.003312031        1.000000000
## SRoom               0.19542198  0.03877581 -0.026579782       -0.409897025
## SBed                0.08981654  0.07665027 -0.063943521       -0.361738410
## SPop                0.02502464  0.10923136 -0.112296442       -0.344859591
## SHouse              0.12149813  0.06780751 -0.079055226       -0.333302955
## LIncome             0.69112079 -0.05079379 -0.069642869       -0.107500239
##                          SRoom        SBed        SPop      SHouse     LIncome
## SValue              0.19542198  0.08981654  0.02502464  0.12149813  0.69112079
## longitude           0.03877581  0.07665027  0.10923136  0.06780751 -0.05079379
## latitude           -0.02657978 -0.06394352 -0.11229644 -0.07905523 -0.06964287
## housing_median_age -0.40989702 -0.36173841 -0.34485959 -0.33330296 -0.10750024
## SRoom               1.00000000  0.93761712  0.86797360  0.92859091  0.26636941
## SBed                0.93761712  1.00000000  0.89641159  0.97977673  0.02891163
## SPop                0.86797360  0.89641159  1.00000000  0.92404805  0.05007196
## SHouse              0.92859091  0.97977673  0.92404805  1.00000000  0.06705520
## LIncome             0.26636941  0.02891163  0.05007196  0.06705520  1.00000000

We can see that the 4 variables SRoom, SBed, SHouse and LIncome has the highest correlation, so we will pick one of them to drop.

reg1 <- lm(SValue ~ housing_median_age + SBed + SPop + SHouse + LIncome, H2)
summary(reg1)

## 
## Call:
## lm(formula = SValue ~ housing_median_age + SBed + SPop + SHouse + 
##     LIncome, data = H2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -382.32  -58.36   -4.89   51.05  464.82 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        115.1335    12.7244   9.048  < 2e-16 ***
## housing_median_age   1.8878     0.1959   9.635  < 2e-16 ***
## SBed                -0.6015     1.5475  -0.389    0.698    
## SPop                -5.0754     0.4861 -10.441  < 2e-16 ***
## SHouse              10.9851     1.8771   5.852 6.04e-09 ***
## LIncome            183.9799     4.8616  37.844  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 83.09 on 1394 degrees of freedom
## Multiple R-squared:  0.5578, Adjusted R-squared:  0.5562 
## F-statistic: 351.7 on 5 and 1394 DF,  p-value: < 2.2e-16

reg2 <- lm(SValue ~ housing_median_age + SRoom + SPop + SHouse + LIncome, H2)
summary(reg2)

## 
## Call:
## lm(formula = SValue ~ housing_median_age + SRoom + SPop + SHouse + 
##     LIncome, data = H2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -399.68  -56.60   -6.37   49.81  441.13 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        109.4260    12.1076   9.038  < 2e-16 ***
## housing_median_age   1.5329     0.1955   7.841 8.84e-15 ***
## SRoom               -2.9825     0.4346  -6.862 1.02e-11 ***
## SPop                -4.7935     0.4741 -10.111  < 2e-16 ***
## SHouse              16.1889     1.1701  13.836  < 2e-16 ***
## LIncome            205.6736     5.6080  36.675  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 81.73 on 1394 degrees of freedom
## Multiple R-squared:  0.5722, Adjusted R-squared:  0.5706 
## F-statistic: 372.9 on 5 and 1394 DF,  p-value: < 2.2e-16

reg3 <- lm(SValue ~ housing_median_age + SRoom  + SBed + SHouse + LIncome, H2)
summary(reg3)

## 
## Call:
## lm(formula = SValue ~ housing_median_age + SRoom + SBed + SHouse + 
##     LIncome, data = H2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -429.32  -59.84   -7.85   51.45  325.21 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         62.0159    12.7419   4.867 1.26e-06 ***
## housing_median_age   1.7593     0.1992   8.831  < 2e-16 ***
## SRoom               -5.2383     0.5322  -9.843  < 2e-16 ***
## SBed                11.9504     1.8479   6.467 1.38e-10 ***
## SHouse               1.3146     1.6026   0.820    0.412    
## LIncome            231.4206     6.5962  35.084  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 83.43 on 1394 degrees of freedom
## Multiple R-squared:  0.5542, Adjusted R-squared:  0.5526 
## F-statistic: 346.6 on 5 and 1394 DF,  p-value: < 2.2e-16

reg4 <- lm(SValue ~ housing_median_age + SRoom  + SBed + SPop + LIncome, H2)
summary(reg4)

## 
## Call:
## lm(formula = SValue ~ housing_median_age + SRoom + SBed + SPop + 
##     LIncome, data = H2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -431.48  -55.95   -5.25   50.61  406.25 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         76.6959    12.6830   6.047 1.89e-09 ***
## housing_median_age   1.7756     0.1939   9.159  < 2e-16 ***
## SRoom               -4.5130     0.5324  -8.477  < 2e-16 ***
## SBed                15.8598     1.1815  13.424  < 2e-16 ***
## SPop                -2.9077     0.4151  -7.004 3.86e-12 ***
## LIncome            227.7889     6.4634  35.243  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 82.02 on 1394 degrees of freedom
## Multiple R-squared:  0.5691, Adjusted R-squared:  0.5676 
## F-statistic: 368.3 on 5 and 1394 DF,  p-value: < 2.2e-16

We can see that dropping SBed has the least affect on our R-square value, so we will stick with it.

reg <- lm(SValue ~ housing_median_age + SRoom + SPop + SHouse + LIncome, H2)
summary(reg)

## 
## Call:
## lm(formula = SValue ~ housing_median_age + SRoom + SPop + SHouse + 
##     LIncome, data = H2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -399.68  -56.60   -6.37   49.81  441.13 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        109.4260    12.1076   9.038  < 2e-16 ***
## housing_median_age   1.5329     0.1955   7.841 8.84e-15 ***
## SRoom               -2.9825     0.4346  -6.862 1.02e-11 ***
## SPop                -4.7935     0.4741 -10.111  < 2e-16 ***
## SHouse              16.1889     1.1701  13.836  < 2e-16 ***
## LIncome            205.6736     5.6080  36.675  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 81.73 on 1394 degrees of freedom
## Multiple R-squared:  0.5722, Adjusted R-squared:  0.5706 
## F-statistic: 372.9 on 5 and 1394 DF,  p-value: < 2.2e-16

Mathematical assmuptions

We will take a look at 2 main assumptions: Normality and Homoscedasticity.

For normality, we will look at the histogram of residuals and conduct a Shapirp Wilk test.

H3 <- H2 %>% mutate(res=residuals(reg),fit=fitted.values(reg))
H3 %>% ggplot(aes(res))+ 
  geom_histogram(bins = 30, fill = "turquoise", color = "black") + labs(title = "Histogram of residuals")

shapiro.test(reg$residuals)

## 
##  Shapiro-Wilk normality test
## 
## data:  reg$residuals
## W = 0.98519, p-value = 9.033e-11

Even though we can observe normality through the bell-shape of the histogram, the Shapiro test gave a p-value lower then 0.05, so the null hypothesis of normality is rejected. However, since the Shapiro test is very sensitive and can give different results at higher sample sizes, we can safely conclude that normality is satisfied in our case since we can observe it visually.

We will now look at a scatter plot for residuals and ncvTest to check for homoscedasticity.

H3 %>% ggplot(aes(fit,res))+
  geom_point(color = "turquoise") + labs(title = "Scatterplot of residuals")

ncvTest(reg)

## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 5.479215, Df = 1, p = 0.019244

The scatter plot seems like it has residuals scattered randomly around the 0 line, which suggest homoscedasticity. However, the ncvTest did not show the same result as it gave a p-value lower than 0.05, meaning that the model is showing heteroskedasticity instead as the null hypothesis of homoscedasticity is rejected. We will now do a coeftest to try and fix this.

coeftest(reg, vcov = vcovHC(reg, type = "HC4"))

## 
## t test of coefficients:
## 
##                     Estimate Std. Error t value  Pr(>|t|)    
## (Intercept)        109.42604   13.19987  8.2899 2.635e-16 ***
## housing_median_age   1.53286    0.20924  7.3257 4.006e-13 ***
## SRoom               -2.98247    0.52721 -5.6570 1.866e-08 ***
## SPop                -4.79349    0.78874 -6.0774 1.574e-09 ***
## SHouse              16.18892    1.65877  9.7596 < 2.2e-16 ***
## LIncome            205.67356    6.18572 33.2497 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

After trying to remove each variable, we have successfully dealt with heteroskedasticity by removing SPop. If we conduct the ncovTest again, we can see that the p-value is now over 0.05, which measn that we have homoscedasticity. Hence we have our final model.

reg <- lm(SValue ~ housing_median_age + SRoom  + SHouse + LIncome, H2)
ncvTest(reg)

## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 1.188404, Df = 1, p = 0.27565

Final model

summary(reg)

## 
## Call:
## lm(formula = SValue ~ housing_median_age + SRoom + SHouse + LIncome, 
##     data = H2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -401.53  -61.31   -7.98   52.35  360.23 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         86.8748    12.3247   7.049 2.83e-12 ***
## housing_median_age   1.6901     0.2018   8.374  < 2e-16 ***
## SRoom               -3.3249     0.4488  -7.409 2.20e-13 ***
## SHouse               9.4917     0.9989   9.502  < 2e-16 ***
## LIncome            210.0397     5.7907  36.272  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 84.64 on 1395 degrees of freedom
## Multiple R-squared:  0.5408, Adjusted R-squared:  0.5395 
## F-statistic: 410.7 on 4 and 1395 DF,  p-value: < 2.2e-16

Coefficients

Intercept (86.8748): When all variables are zero, the expected value of the square rooted median housing value (SValue) is approximately 86.8748.
housing_median_age (1.6901): For each one-unit increase in the median age of the area, the value of SValue is expected to increase by approximately 1.6901 units, assuming other variables are held constant.
SRoom (-3.3249): For each one-unit increase in square rooted total rooms, the SValue decreases by about 3.3249 units, holding other variables constant.
SHouse (9.4917): For each one-unit increase in the square rooted number of houses in the area, the value of SValue is expected to increase by approximately 9.4917 units, assuming other variables are held constant.
LIncome (210.0397): Each one-unit increase in the log transfomred median income is associated with an increase of about 210.0397 units in SValue, keeping other variables constant.

Model fit

Residual Standard Error (RSE): The RSE of 84.64 suggests that on average, the actual values deviate from the predicted SValue by approximately 84.64 units.
Multiple R-squared: With a value of 0.5408 , the model explains 54.08% of the variability in SValue, which is acceptable as a good fit.
Adjusted R-squared: Adjusted for the number of predictors, the value of 0.5395 still indicates a good model fit.
F-Statistic: The high F-statistic of 410.7 and the very low p-value indicate that the model’s predictive capability is statistically significant.

Confint and Prediction

We will now take a look at the 95% confidence interval of our coefficients.

confint(reg)

##                         2.5 %     97.5 %
## (Intercept)         62.697843 111.051683
## housing_median_age   1.294200   2.086035
## SRoom               -4.205203  -2.444502
## SHouse               7.532154  11.451200
## LIncome            198.680244 221.399075

We can see that none of these intervals include 0 in them, which is a good sign for our predictive variables.

We will now make a prediction on a new set of predictive values from our model.

new <- data.frame(
  housing_median_age = c(29),
  SRoom  = c(47),
  SHouse = c(8),
  LIncome = c(1.2)
)

predictions <- predict(reg, newdata = new, interval = "prediction", level = 0.95)
print(predictions)

##        fit      lwr      upr
## 1 307.6011 139.6507 475.5515

We can see that the since interval is much wider than the confint, it represents uncertainty in our model. However, this could also be due to the nature of a single value prediction.

Test set

regTest <- lm(SValue ~ housing_median_age + SRoom  + SHouse + LIncome , T2)
summary(regTest)

## 
## Call:
## lm(formula = SValue ~ housing_median_age + SRoom + SHouse + LIncome, 
##     data = T2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -215.40  -57.28  -10.59   46.43  662.66 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         85.1427    19.6257   4.338 1.69e-05 ***
## housing_median_age   2.0775     0.3044   6.826 2.16e-11 ***
## SRoom               -2.2938     0.6737  -3.405 0.000706 ***
## SHouse               6.6075     1.5578   4.242 2.57e-05 ***
## LIncome            210.8892     9.2086  22.901  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 86 on 595 degrees of freedom
## Multiple R-squared:  0.5319, Adjusted R-squared:  0.5287 
## F-statistic:   169 on 4 and 595 DF,  p-value: < 2.2e-16

We can observe that there has been some slight changes to our coefficients as well as the model’s fit.

Coefficients

Intercept (85.1427): When all variables are zero, the expected value of the square rooted median housing value (SValue) is approximately 85.1427.
housing_median_age (2.0775): For each one-unit increase in the median age of the area, the value of SValue is expected to increase by approximately 2.0775 units, assuming other variables are held constant.
SRoom (-2.2938): For each one-unit increase in square rooted total rooms, the SValue decreases by about 2.2938 units, holding other variables constant.
SHouse (6.6075): For each one-unit increase in the square rooted number of houses in the area, the value of SValue is expected to increase by approximately 6.6075 units, assuming other variables are held constant.
LIncome 210.8892): Each one-unit increase in the log transfomred median income is associated with an increase of about 210.8892 units in SValue, keeping other variables constant.

Model fit

Residual Standard Error (RSE): The RSE has slightly increased to 86 suggests that on average, the actual values deviate from the predicted SValue by approximately 86 units.
Adjusted R-squared: The adjusted R-square has dropped slightly to 0.5287, which could be due to overfitting.

Conclusion and Future problems

Conclude

Through the analysis, we have developed the final ‘best’ model for predicting the square-rooted median housing prices in Los Angeles. This model integrates four predictors: housing_median_age, SRoom (square-rooted total rooms), SHouse (square-rooted total houses), and LIncome (log-transformed median income). Overall, the model exhibits a quite good fit, evidenced by a moderate R-squared value, indicating effective predictive power. Crucially, it also satisfies the mathematical assumptions of normality and homoscedasticity, ensuring the reliability of our estimates and the validity of statistical inferences made from the model.

Future problems

Overfitting due to multicolinearity: we can observe that there were multiple variables that were highly correlated, which could potentially cause multicolinearity and then eventually leads to overfitting. Overfitting can make the model perform poorly when introduced with unseen data and therefore reduce the accuracy of prediction. To fix this, we can drop variables that are highly correlated, but due to the nature of this dataset with so little variables, other methods needs to be considered.
Small sample size: to match what we’re used to in class, I have to reduce the sample size from 20,000 to 2000, which can affect the model’s accuracy.
Outdated data set: this data set is collected in 1990, so in order to do any applications from this model, we need to first update the data to match current date’s prices.

DA220 FINAL PROJECT - California Housing Prices

Andrew Nguyen

2024-04-27

Data transformation

Mathematical assmuptions

Final model

Coefficients

Model fit

Confint and Prediction

Test set

Coefficients

Model fit

Conclusion and Future problems