In January 2025, California experienced a major wildfire that destroyed about 7,000 homes. As a result, many people will need to purchase new homes. So, where is the best place to buy? I want to investigate the factors affecting housing prices in California.
We find a dataset includes information on house values, local population, local income, house age, number of households, and more. Next, we will explore the relationships between house values and other factors.
Variables in the Dataset:
statistical summaries of the variables
price_house <- read_csv("D:/Mysoftware/DefaultWD-R/dataset/California_house_price.csv")
str(price_house)
## spc_tbl_ [20,640 Ă— 10] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ longitude : num [1:20640] -122 -122 -122 -122 -122 ...
## $ latitude : num [1:20640] 37.9 37.9 37.9 37.9 37.9 ...
## $ housing_median_age: num [1:20640] 41 21 52 52 52 52 52 52 42 52 ...
## $ total_rooms : num [1:20640] 880 7099 1467 1274 1627 ...
## $ total_bedrooms : num [1:20640] 129 1106 190 235 280 ...
## $ population : num [1:20640] 322 2401 496 558 565 ...
## $ households : num [1:20640] 126 1138 177 219 259 ...
## $ median_income : num [1:20640] 8.33 8.3 7.26 5.64 3.85 ...
## $ ocean_proximity : chr [1:20640] "NEAR BAY" "NEAR BAY" "NEAR BAY" "NEAR BAY" ...
## $ median_house_value: num [1:20640] 452600 358500 352100 341300 342200 ...
## - attr(*, "spec")=
## .. cols(
## .. longitude = col_double(),
## .. latitude = col_double(),
## .. housing_median_age = col_double(),
## .. total_rooms = col_double(),
## .. total_bedrooms = col_double(),
## .. population = col_double(),
## .. households = col_double(),
## .. median_income = col_double(),
## .. ocean_proximity = col_character(),
## .. median_house_value = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
summary(price_house)
## longitude latitude housing_median_age total_rooms
## Min. :-124.3 Min. :32.54 Min. : 1.00 Min. : 2
## 1st Qu.:-121.8 1st Qu.:33.93 1st Qu.:18.00 1st Qu.: 1448
## Median :-118.5 Median :34.26 Median :29.00 Median : 2127
## Mean :-119.6 Mean :35.63 Mean :28.64 Mean : 2636
## 3rd Qu.:-118.0 3rd Qu.:37.71 3rd Qu.:37.00 3rd Qu.: 3148
## Max. :-114.3 Max. :41.95 Max. :52.00 Max. :39320
##
## total_bedrooms population households median_income
## Min. : 1.0 Min. : 3 Min. : 1.0 Min. : 0.4999
## 1st Qu.: 296.0 1st Qu.: 787 1st Qu.: 280.0 1st Qu.: 2.5634
## Median : 435.0 Median : 1166 Median : 409.0 Median : 3.5348
## Mean : 537.9 Mean : 1425 Mean : 499.5 Mean : 3.8707
## 3rd Qu.: 647.0 3rd Qu.: 1725 3rd Qu.: 605.0 3rd Qu.: 4.7432
## Max. :6445.0 Max. :35682 Max. :6082.0 Max. :15.0001
## NA's :207
## ocean_proximity median_house_value
## Length:20640 Min. : 14999
## Class :character 1st Qu.:119600
## Mode :character Median :179700
## Mean :206856
## 3rd Qu.:264725
## Max. :500001
##
colSums(is.na(price_house))
## longitude latitude housing_median_age total_rooms
## 0 0 0 0
## total_bedrooms population households median_income
## 207 0 0 0
## ocean_proximity median_house_value
## 0 0
price_house_num <- price_house %>%
select_if(is.numeric) %>%
drop_na()
cor(price_house_num)
## longitude latitude housing_median_age total_rooms
## longitude 1.00000000 -0.92461611 -0.10935655 0.04548017
## latitude -0.92461611 1.00000000 0.01189907 -0.03666681
## housing_median_age -0.10935655 0.01189907 1.00000000 -0.36062830
## total_rooms 0.04548017 -0.03666681 -0.36062830 1.00000000
## total_bedrooms 0.06960802 -0.06698283 -0.32045104 0.93037950
## population 0.10027030 -0.10899734 -0.29578730 0.85728125
## households 0.05651277 -0.07177419 -0.30276797 0.91899153
## median_income -0.01555015 -0.07962632 -0.11827772 0.19788152
## median_house_value -0.04539822 -0.14463821 0.10643205 0.13329413
## total_bedrooms population households median_income
## longitude 0.06960802 0.100270301 0.05651277 -0.015550150
## latitude -0.06698283 -0.108997344 -0.07177419 -0.079626319
## housing_median_age -0.32045104 -0.295787297 -0.30276797 -0.118277723
## total_rooms 0.93037950 0.857281251 0.91899153 0.197881519
## total_bedrooms 1.00000000 0.877746743 0.97972827 -0.007722850
## population 0.87774674 1.000000000 0.90718590 0.005086624
## households 0.97972827 0.907185900 1.00000000 0.013433892
## median_income -0.00772285 0.005086624 0.01343389 1.000000000
## median_house_value 0.04968618 -0.025299732 0.06489355 0.688355475
## median_house_value
## longitude -0.04539822
## latitude -0.14463821
## housing_median_age 0.10643205
## total_rooms 0.13329413
## total_bedrooms 0.04968618
## population -0.02529973
## households 0.06489355
## median_income 0.68835548
## median_house_value 1.00000000
ggpairs(price_house, alpha = 0.5) +
theme_minimal()
unique(price_house$ocean_proximity)
## [1] "NEAR BAY" "<1H OCEAN" "INLAND" "NEAR OCEAN" "ISLAND"
price_house$ocean_proximity <- as.factor(price_house$ocean_proximity)
model_full <- lm(median_house_value ~ median_income + housing_median_age +
total_rooms + population + households + ocean_proximity,
data = price_house)
summary(model_full)
##
## Call:
## lm(formula = median_house_value ~ median_income + housing_median_age +
## total_rooms + population + households + ocean_proximity,
## data = price_house)
##
## Residuals:
## Min 1Q Median 3Q Max
## -547300 -43585 -10882 29580 787296
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.333e+04 2.385e+03 13.973 < 2e-16 ***
## median_income 3.897e+04 3.155e+02 123.493 < 2e-16 ***
## housing_median_age 1.149e+03 4.423e+01 25.967 < 2e-16 ***
## total_rooms -2.757e+00 6.889e-01 -4.001 6.32e-05 ***
## population -4.042e+01 1.051e+00 -38.468 < 2e-16 ***
## households 1.486e+02 4.356e+00 34.127 < 2e-16 ***
## ocean_proximityINLAND -6.870e+04 1.254e+03 -54.794 < 2e-16 ***
## ocean_proximityISLAND 1.817e+05 3.133e+04 5.800 6.73e-09 ***
## ocean_proximityNEAR BAY 3.912e+03 1.691e+03 2.314 0.0207 *
## ocean_proximityNEAR OCEAN 1.364e+04 1.552e+03 8.792 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 70000 on 20630 degrees of freedom
## Multiple R-squared: 0.6321, Adjusted R-squared: 0.632
## F-statistic: 3939 on 9 and 20630 DF, p-value: < 2.2e-16
model_reduced <- lm(median_house_value ~ median_income + housing_median_age +
total_rooms + population + households,
data = price_house)
anova(model_reduced, model_full)
From the result above, we can see the p-value is very low, means ocean_proximity can significantly improve model’s interpretability. So I need keep this item.
model_step <- step(model_full, direction = "both", trace = FALSE)
summary(model_step)
##
## Call:
## lm(formula = median_house_value ~ median_income + housing_median_age +
## total_rooms + population + households + ocean_proximity,
## data = price_house)
##
## Residuals:
## Min 1Q Median 3Q Max
## -547300 -43585 -10882 29580 787296
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.333e+04 2.385e+03 13.973 < 2e-16 ***
## median_income 3.897e+04 3.155e+02 123.493 < 2e-16 ***
## housing_median_age 1.149e+03 4.423e+01 25.967 < 2e-16 ***
## total_rooms -2.757e+00 6.889e-01 -4.001 6.32e-05 ***
## population -4.042e+01 1.051e+00 -38.468 < 2e-16 ***
## households 1.486e+02 4.356e+00 34.127 < 2e-16 ***
## ocean_proximityINLAND -6.870e+04 1.254e+03 -54.794 < 2e-16 ***
## ocean_proximityISLAND 1.817e+05 3.133e+04 5.800 6.73e-09 ***
## ocean_proximityNEAR BAY 3.912e+03 1.691e+03 2.314 0.0207 *
## ocean_proximityNEAR OCEAN 1.364e+04 1.552e+03 8.792 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 70000 on 20630 degrees of freedom
## Multiple R-squared: 0.6321, Adjusted R-squared: 0.632
## F-statistic: 3939 on 9 and 20630 DF, p-value: < 2.2e-16
From the result above, I keep all variables.
par(mfrow = c(2, 2))
plot(model_step)
From Q-Q plot and residuals vs leverage plot, we can clearly see that
outliers exist. In scale -location plots, we can see when
median_house_value increases, residuals also upward, so I want to apply
a log transformation to price_house to solve this problem and remove
outliers.
library(MASS)
# 1. just log translation
log_house_price <- log(price_house$median_house_value)
model_log <- lm(log_house_price ~ median_income + housing_median_age +
total_rooms + population + households + ocean_proximity,
data = price_house)
summary(model_log)
##
## Call:
## lm(formula = log_house_price ~ median_income + housing_median_age +
## total_rooms + population + households + ocean_proximity,
## data = price_house)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.3805 -0.2137 -0.0118 0.1960 3.2641
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.143e+01 1.154e-02 990.880 < 2e-16 ***
## median_income 1.713e-01 1.526e-03 112.291 < 2e-16 ***
## housing_median_age 3.075e-03 2.139e-04 14.377 < 2e-16 ***
## total_rooms -6.585e-06 3.331e-06 -1.977 0.04810 *
## population -1.772e-04 5.081e-06 -34.872 < 2e-16 ***
## households 6.538e-04 2.106e-05 31.043 < 2e-16 ***
## ocean_proximityINLAND -4.912e-01 6.063e-03 -81.017 < 2e-16 ***
## ocean_proximityISLAND 7.473e-01 1.515e-01 4.933 8.15e-07 ***
## ocean_proximityNEAR BAY -8.658e-04 8.176e-03 -0.106 0.91566
## ocean_proximityNEAR OCEAN 2.191e-02 7.504e-03 2.919 0.00351 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3385 on 20630 degrees of freedom
## Multiple R-squared: 0.6464, Adjusted R-squared: 0.6462
## F-statistic: 4190 on 9 and 20630 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(model_log)
#3. log translation and remove outliers
stud_res <- rstudent(model_log)
outliers_res <- which(abs(stud_res) > 3)
cd <- cooks.distance(model_log)
n <- nrow(price_house)
cook_thr <- 4/n
outliers_cook <- which(cd > cook_thr)
outliers_all <- unique(c(outliers_res, outliers_cook))
length(outliers_all)
## [1] 1008
price_house_no_outliers <- price_house[-outliers_all, ]
#2. just remove outliers
model_no_out <- lm(median_house_value ~ median_income + housing_median_age +
total_rooms + population + households + ocean_proximity,
data = price_house_no_outliers)
summary(model_no_out)
##
## Call:
## lm(formula = median_house_value ~ median_income + housing_median_age +
## total_rooms + population + households + ocean_proximity,
## data = price_house_no_outliers)
##
## Residuals:
## Min 1Q Median 3Q Max
## -182040 -39444 -6682 30795 296837
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10228.747 2248.926 4.548 5.44e-06 ***
## median_income 43329.631 315.856 137.182 < 2e-16 ***
## housing_median_age 1254.600 38.900 32.252 < 2e-16 ***
## total_rooms -5.580 0.717 -7.783 7.46e-15 ***
## population -57.258 1.132 -50.575 < 2e-16 ***
## households 214.637 4.558 47.096 < 2e-16 ***
## ocean_proximityINLAND -66734.088 1096.875 -60.840 < 2e-16 ***
## ocean_proximityNEAR BAY -5621.495 1486.008 -3.783 0.000155 ***
## ocean_proximityNEAR OCEAN 12555.709 1357.481 9.249 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 58720 on 19623 degrees of freedom
## Multiple R-squared: 0.7136, Adjusted R-squared: 0.7135
## F-statistic: 6112 on 8 and 19623 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(model_no_out)
model_log_no_out <- lm(log(median_house_value) ~ median_income + housing_median_age +
total_rooms + population + households + ocean_proximity,
data = price_house_no_outliers)
summary(model_log_no_out)
##
## Call:
## lm(formula = log(median_house_value) ~ median_income + housing_median_age +
## total_rooms + population + households + ocean_proximity,
## data = price_house_no_outliers)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.94975 -0.20009 -0.01082 0.18308 1.01728
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.131e+01 1.086e-02 1042.109 < 2e-16 ***
## median_income 1.945e-01 1.525e-03 127.555 < 2e-16 ***
## housing_median_age 3.676e-03 1.878e-04 19.576 < 2e-16 ***
## total_rooms -2.472e-05 3.461e-06 -7.140 9.63e-13 ***
## population -2.603e-04 5.465e-06 -47.632 < 2e-16 ***
## households 1.005e-03 2.200e-05 45.664 < 2e-16 ***
## ocean_proximityINLAND -4.840e-01 5.295e-03 -91.409 < 2e-16 ***
## ocean_proximityNEAR BAY -3.258e-02 7.173e-03 -4.542 5.62e-06 ***
## ocean_proximityNEAR OCEAN 3.073e-02 6.553e-03 4.689 2.76e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2834 on 19623 degrees of freedom
## Multiple R-squared: 0.7337, Adjusted R-squared: 0.7336
## F-statistic: 6760 on 8 and 19623 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(model_log_no_out)
We use studentized residuals > 3 and Cook’s distance > 4/n to
remove outliers, and use log translation with house price, and then we
can clearly see the curve is now smoother, without the slight u-shape
seen before from residuals vs fitted plot. Also, Q-Q plot more likely
along with the diagonal, indicating that the residuals follow a more
normal distribution. And we can see heteroscedasticity has improved from
scale location plot, and the impact of outliers has been reduced from
residuals vs leverage plot.
library(car)
vif(model_log_no_out)
## GVIF Df GVIF^(1/(2*Df))
## median_income 1.720994 1 1.311867
## housing_median_age 1.328448 1 1.152583
## total_rooms 10.664481 1 3.265652
## population 6.921513 1 2.630877
## households 13.187029 1 3.631395
## ocean_proximity 1.441117 3 1.062796
Form the output above, we can see total_rooms and households have stong collinearity, so I remove households.
model_log_no_out_vif <- lm(log(median_house_value) ~ median_income + housing_median_age +
total_rooms + population + ocean_proximity,
data = price_house_no_outliers)
summary(model_log_no_out_vif)
##
## Call:
## lm(formula = log(median_house_value) ~ median_income + housing_median_age +
## total_rooms + population + ocean_proximity, data = price_house_no_outliers)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.02067 -0.21250 -0.01621 0.19272 1.01666
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.146e+01 1.089e-02 1052.485 < 2e-16 ***
## median_income 1.666e-01 1.470e-03 113.376 < 2e-16 ***
## housing_median_age 3.545e-03 1.975e-04 17.951 < 2e-16 ***
## total_rooms 8.586e-05 2.601e-06 33.005 < 2e-16 ***
## population -1.306e-04 4.911e-06 -26.601 < 2e-16 ***
## ocean_proximityINLAND -5.379e-01 5.429e-03 -99.092 < 2e-16 ***
## ocean_proximityNEAR BAY -1.212e-02 7.530e-03 -1.610 0.107
## ocean_proximityNEAR OCEAN 4.104e-02 6.888e-03 5.959 2.59e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2981 on 19624 degrees of freedom
## Multiple R-squared: 0.7055, Adjusted R-squared: 0.7054
## F-statistic: 6714 on 7 and 19624 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(model_log_no_out_vif)
library(gridExtra)
price_house_no_outliers$resid_new <- resid(model_log_no_out_vif)
p1 <- ggplot(price_house_no_outliers, aes(x = median_income, y = resid_new)) +
geom_point(alpha = 0.4) +
geom_smooth(method = "loess", col = "red") +
ggtitle("Residuals vs. Median Income")
p2 <- ggplot(price_house_no_outliers, aes(x = housing_median_age, y = resid_new)) +
geom_point(alpha = 0.4) +
geom_smooth(method = "loess", col = "red") +
ggtitle("Residuals vs. Housing Median Age")
p3 <- ggplot(price_house_no_outliers, aes(x = total_rooms, y = resid_new)) +
geom_point(alpha = 0.4) +
geom_smooth(method = "loess", col = "red") +
ggtitle("Residuals vs. Total Rooms")
p4 <- ggplot(price_house_no_outliers, aes(x = population, y = resid_new)) +
geom_point(alpha = 0.4) +
geom_smooth(method = "loess", col = "red") +
ggtitle("Residuals vs. Population")
grid.arrange(p1, p2, p3, p4, ncol = 2)
From the first plot Residuals vs. Median Income,we can see that the
curve has some curvature, so we tried using a quadratic term to deal
with this problem
model_log_no_out_vif_ord <- lm(log(median_house_value) ~ I(median_income^2) + housing_median_age +
total_rooms + population + ocean_proximity,
data = price_house_no_outliers)
summary(model_log_no_out_vif_ord)
##
## Call:
## lm(formula = log(median_house_value) ~ I(median_income^2) + housing_median_age +
## total_rooms + population + ocean_proximity, data = price_house_no_outliers)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.30599 -0.21082 -0.00647 0.20835 1.07071
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.189e+01 9.698e-03 1225.996 < 2e-16 ***
## I(median_income^2) 1.413e-02 1.509e-04 93.633 < 2e-16 ***
## housing_median_age 2.675e-03 2.105e-04 12.708 < 2e-16 ***
## total_rooms 1.169e-04 2.712e-06 43.099 < 2e-16 ***
## population -1.823e-04 5.149e-06 -35.410 < 2e-16 ***
## ocean_proximityINLAND -5.958e-01 5.692e-03 -104.676 < 2e-16 ***
## ocean_proximityNEAR BAY -1.882e-02 8.053e-03 -2.337 0.01943 *
## ocean_proximityNEAR OCEAN 2.385e-02 7.359e-03 3.241 0.00119 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3188 on 19624 degrees of freedom
## Multiple R-squared: 0.6631, Adjusted R-squared: 0.6629
## F-statistic: 5517 on 7 and 19624 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(model_log_no_out_vif_ord)
anova(model_log_no_out_vif_ord, model_log_no_out_vif)
However, the results shows that after introducing quadratic term, the RSS was increased, so we decide not to include quadratic term.Then, we wan to check whether it is necessary to include interaction terms. Now let’s see whether the residual is dependent of median_income * housing_median_age here.
model_log_no_out_vif_int <- lm(log(median_house_value) ~ median_income * housing_median_age +
total_rooms + population + ocean_proximity,
data = price_house_no_outliers)
anova(model_log_no_out_vif_int, model_log_no_out_vif)
summary(model_log_no_out_vif_int)
##
## Call:
## lm(formula = log(median_house_value) ~ median_income * housing_median_age +
## total_rooms + population + ocean_proximity, data = price_house_no_outliers)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.02008 -0.21000 -0.01681 0.19092 1.02747
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.164e+01 1.515e-02 768.272 < 2e-16 ***
## median_income 1.228e-01 3.051e-03 40.252 < 2e-16 ***
## housing_median_age -2.529e-03 4.201e-04 -6.022 1.76e-09 ***
## total_rooms 8.714e-05 2.585e-06 33.710 < 2e-16 ***
## population -1.304e-04 4.878e-06 -26.737 < 2e-16 ***
## ocean_proximityINLAND -5.420e-01 5.398e-03 -100.404 < 2e-16 ***
## ocean_proximityNEAR BAY -1.689e-02 7.485e-03 -2.257 0.024 *
## ocean_proximityNEAR OCEAN 3.621e-02 6.848e-03 5.288 1.25e-07 ***
## median_income:housing_median_age 1.571e-03 9.606e-05 16.353 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2961 on 19623 degrees of freedom
## Multiple R-squared: 0.7094, Adjusted R-squared: 0.7093
## F-statistic: 5988 on 8 and 19623 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(model_log_no_out_vif_int)
From the result above, we can see that when we incldue interaction term,
p-value is very small, and R-squared also increased, means interaction
term is necessary. So, model_log_no_out_vif_int is our final model.
Interpretation of Coefficients:
Our model uses log(median_house_value) as the response. So, each coefficient represents an approximate percentage change in house value for a one-unit change in the predictor, when we holding other variables constant.
Intercept = 11.64 : If all predictors were at 0, the log of mean house value would be 11.64 in this area.
median_income = 0.1228: Each additional unit of median_income increases the log of mean house value by 0.1228 in this area, but we also need to consider interaction term.
housing_median_age = -0.002529: Each additional year of house age alone decreases the log of mean house value by about 0.0025.
total_rooms = 8.714e-05: Each additional unit of median_income increases the log of mean house value by 0000.8714 in this area.
population = -1.304e-04: Each additional unit of population decreases the log of mean house value by -000.1304 in this area.
ocean_proximityINLAND = -0.542: Being INLAND reduces the log house value by about 0.54% than 1H OCEAN
ocean_proximityNEAR BAY = -0.0169: NEAR BAY is 1.69% lower than <1H OCEAN.
ocean_proximityNEAR OCEAN = 0.03621: NEAR OCEAN is about 3.6% higher than the baseline <1H OCEAN.
In part 3, we find collinearity between total_rooms and households, Then we removed households, maybe we can creat a new variable: average_rooms = total_rooms / households to keep more information.
Maybe we can include more interaction terms.
Our model can only tell us which factors are related to median house prices in different areas of California. Our conclusions help us find a suitable area, not a specific house. In the future, we can do this.