Housing in California

Introduction

Imagine you lying on the beach with your children playing around and your chicken is roasted in the kitchen behind you. What a beautiful life. Can you refuse a house in California? I guess you can’t. Neither can I! But before buying/investing a house, we may need to look at the trend of the house selling price in recent years and the difference among various counties in California.

Methodology

Basic Statistic summary
- distribution
- mean, median, std deviation, etc
Regression modeling
- potential macro-economic variables influencing price: GDP, CPI, Mortgage rate, etc

Visualization

Different plots may include, but not limited to:

histogram
scatterplot
boxplot
linechart

Give-away

Understand the price trend of CA real estate
Understand the different prices in different ereas of CA
Get to know the prediction of the price trend in the near future influenced by macro-economic variables

Packages

library(rlang)
library(tidyverse)  
library(tseries)    
library(quantmod)   
library(modelr)     
library(broom)    
library(DT) 
library(readxl)
library(leaps)
library(plotly)

library(rlang)

library(tidyverse) # Data manipulation and visualization

library(tseries) # Provides time series/regression modeling functions

library(quantmod) # Quantitative Financial Modelling functions

library(modelr) # provides easy pipeline modeling functions

library(broom) # Helps to tidy up model outputs

library(DT) # Helps beutify html

library(readxl) # Read excel file

library(leaps) #model selection funcionts

library(plotly) #visualization

Data Preparation

Load Data

Data Set

The data is coming from “Zillo”. The dataset I choose includes housing prices data in the US. (Data Source)

I will look at the housing price history of the whole California from 1996 to 2017 for pattern analysis and forcasting.

I will also choose those housing data specifically in different counties of California in 2017-06 to see the rank of the price indifferent counties and in the recent 9 years to see the variance among counties.

# original data
county <- read_csv("1.csv")
state <- read_csv("2.csv")

## county: real estate price in diffrent counties of CA
as_tibble(county)

## # A tibble: 1,223 x 262
##    RegionID  RegionName State                          Metro StateCodeFIPS
##       <int>       <chr> <chr>                          <chr>         <chr>
##  1     3101 Los Angeles    CA Los Angeles-Long Beach-Anaheim            06
##  2      139        Cook    IL                        Chicago            17
##  3     2402    Maricopa    AZ                        Phoenix            04
##  4     2841   San Diego    CA                      San Diego            06
##  5     1286      Orange    CA Los Angeles-Long Beach-Anaheim            06
##  6      581       Kings    NY                       New York            36
##  7     2964  Miami-Dade    FL          Miami-Fort Lauderdale            12
##  8      978      Dallas    TX              Dallas-Fort Worth            48
##  9     1347      Queens    NY                       New York            36
## 10     2832   Riverside    CA                      Riverside            06
## # ... with 1,213 more rows, and 257 more variables:
## #   MunicipalCodeFIPS <chr>, SizeRank <int>, `1996-04` <int>,
## #   `1996-05` <int>, `1996-06` <int>, `1996-07` <int>, `1996-08` <int>,
## #   `1996-09` <int>, `1996-10` <int>, `1996-11` <int>, `1996-12` <int>,
## #   `1997-01` <int>, `1997-02` <int>, `1997-03` <int>, `1997-04` <int>,
## #   `1997-05` <int>, `1997-06` <int>, `1997-07` <int>, `1997-08` <int>,
## #   `1997-09` <int>, `1997-10` <int>, `1997-11` <int>, `1997-12` <int>,
## #   `1998-01` <int>, `1998-02` <int>, `1998-03` <int>, `1998-04` <int>,
## #   `1998-05` <int>, `1998-06` <int>, `1998-07` <int>, `1998-08` <int>,
## #   `1998-09` <int>, `1998-10` <int>, `1998-11` <int>, `1998-12` <int>,
## #   `1999-01` <int>, `1999-02` <int>, `1999-03` <int>, `1999-04` <int>,
## #   `1999-05` <int>, `1999-06` <int>, `1999-07` <int>, `1999-08` <int>,
## #   `1999-09` <int>, `1999-10` <int>, `1999-11` <int>, `1999-12` <int>,
## #   `2000-01` <int>, `2000-02` <int>, `2000-03` <int>, `2000-04` <int>,
## #   `2000-05` <int>, `2000-06` <int>, `2000-07` <int>, `2000-08` <int>,
## #   `2000-09` <int>, `2000-10` <int>, `2000-11` <int>, `2000-12` <int>,
## #   `2001-01` <int>, `2001-02` <int>, `2001-03` <int>, `2001-04` <int>,
## #   `2001-05` <int>, `2001-06` <int>, `2001-07` <int>, `2001-08` <int>,
## #   `2001-09` <int>, `2001-10` <int>, `2001-11` <int>, `2001-12` <int>,
## #   `2002-01` <int>, `2002-02` <int>, `2002-03` <int>, `2002-04` <int>,
## #   `2002-05` <int>, `2002-06` <int>, `2002-07` <int>, `2002-08` <int>,
## #   `2002-09` <int>, `2002-10` <int>, `2002-11` <int>, `2002-12` <int>,
## #   `2003-01` <int>, `2003-02` <int>, `2003-03` <int>, `2003-04` <int>,
## #   `2003-05` <int>, `2003-06` <int>, `2003-07` <int>, `2003-08` <int>,
## #   `2003-09` <int>, `2003-10` <int>, `2003-11` <int>, `2003-12` <int>,
## #   `2004-01` <int>, `2004-02` <int>, `2004-03` <int>, `2004-04` <int>,
## #   `2004-05` <int>, ...

## state: real estate price in diffrent states of U.S
as_tibble(state)

## # A tibble: 49 x 258
##    RegionID     RegionName SizeRank `1996-04` `1996-05` `1996-06`
##       <int>          <chr>    <int>     <int>     <int>     <int>
##  1        9     California        1    157900    157800    157500
##  2       54          Texas        2        NA        NA        NA
##  3       43       New York        3        NA        NA        NA
##  4       14        Florida        4     86300     86600     86700
##  5       21       Illinois        5    113900    114400    114500
##  6       47   Pennsylvania        6     83300     83700     83800
##  7       44           Ohio        7     88100     88300     88500
##  8       30       Michigan        8     87600     87100     87300
##  9       16        Georgia        9     92000     92400     92600
## 10       36 North Carolina       10     91700     91900     92000
## # ... with 39 more rows, and 252 more variables: `1996-07` <int>,
## #   `1996-08` <int>, `1996-09` <int>, `1996-10` <int>, `1996-11` <int>,
## #   `1996-12` <int>, `1997-01` <int>, `1997-02` <int>, `1997-03` <int>,
## #   `1997-04` <int>, `1997-05` <int>, `1997-06` <int>, `1997-07` <int>,
## #   `1997-08` <int>, `1997-09` <int>, `1997-10` <int>, `1997-11` <int>,
## #   `1997-12` <int>, `1998-01` <int>, `1998-02` <int>, `1998-03` <int>,
## #   `1998-04` <int>, `1998-05` <int>, `1998-06` <int>, `1998-07` <int>,
## #   `1998-08` <int>, `1998-09` <int>, `1998-10` <int>, `1998-11` <int>,
## #   `1998-12` <int>, `1999-01` <int>, `1999-02` <int>, `1999-03` <int>,
## #   `1999-04` <int>, `1999-05` <int>, `1999-06` <int>, `1999-07` <int>,
## #   `1999-08` <int>, `1999-09` <int>, `1999-10` <int>, `1999-11` <int>,
## #   `1999-12` <int>, `2000-01` <int>, `2000-02` <int>, `2000-03` <int>,
## #   `2000-04` <int>, `2000-05` <int>, `2000-06` <int>, `2000-07` <int>,
## #   `2000-08` <int>, `2000-09` <int>, `2000-10` <int>, `2000-11` <int>,
## #   `2000-12` <int>, `2001-01` <int>, `2001-02` <int>, `2001-03` <int>,
## #   `2001-04` <int>, `2001-05` <int>, `2001-06` <int>, `2001-07` <int>,
## #   `2001-08` <int>, `2001-09` <int>, `2001-10` <int>, `2001-11` <int>,
## #   `2001-12` <int>, `2002-01` <int>, `2002-02` <int>, `2002-03` <int>,
## #   `2002-04` <int>, `2002-05` <int>, `2002-06` <int>, `2002-07` <int>,
## #   `2002-08` <int>, `2002-09` <int>, `2002-10` <int>, `2002-11` <int>,
## #   `2002-12` <int>, `2003-01` <int>, `2003-02` <int>, `2003-03` <int>,
## #   `2003-04` <int>, `2003-05` <int>, `2003-06` <int>, `2003-07` <int>,
## #   `2003-08` <int>, `2003-09` <int>, `2003-10` <int>, `2003-11` <int>,
## #   `2003-12` <int>, `2004-01` <int>, `2004-02` <int>, `2004-03` <int>,
## #   `2004-04` <int>, `2004-05` <int>, `2004-06` <int>, `2004-07` <int>,
## #   `2004-08` <int>, `2004-09` <int>, `2004-10` <int>, ...

Tidy Data

CAcounty <- filter(county, State == "CA") %>%
   select(RegionName, `2017-06`) %>%
   gather(Date, Price, -RegionName)
datatable(CAcounty)

CAtotal <- filter(state, RegionName == "California") %>%
  select(RegionName, `1996-04`:`2017-06`) %>%
  gather(Date, Price, -RegionName) %>%
  separate(Date,c("Year","Month"),sep="-") %>%
  group_by(RegionName,Year) %>%
  summarise(AVG_Price=mean(Price))
datatable(CAtotal)

Data Dictionary

## data dictionary
variable_discription <- c("The location of the real estate","Year-Month of the price collected","Housing price in different counties of CA", "Average housing price in CA per year")
variable_name <- c("RegionName","Date","Price","AVG_Price")
variable_type <- c("Character", "DateType", "Integer", "Integer")
dict <- data.frame(variable_name, variable_discription, variable_type)

datatable(dict, options = list(dom = 't'))

Data Analysis

Ranks and Trends

Welcome, welcome!!!

To wake you up, lets have an overview of the latest housing price tren in California and the rank of different counties. You may be surprised that the housing price is actually going up fastly. Although there is a drop in 2008-2012, as we all know due to the ecnomic crisis, NOW IS A GOOD TIME to re-invest into the housing market again!!!

# county data in 2017-06
ggplot(data = CAcounty, aes(x =reorder(RegionName,-Price), y = Price)) +
  geom_bar(stat="identity",fill="#FF9999", colour="black") +
  xlab("County") +
  labs(title="Real Estate Price in Diffrent Counties of CA", subtitle="June 2007") +
  theme(axis.text.x =
          element_text(size  = 10,
                       angle = 75,
                       hjust = 1,
                       vjust = 1))

# CA price trend in latest 10 years
ggplotly(ggplot(data = CAtotal, aes(x = Year, y = AVG_Price)) +
  geom_jitter(color = "purple", size = 4, shape = 17, alpha = .5) +
  xlab("Year") +
  labs(title="Real Estate Price Trend in CA", subtitle="1996 - 2017") +
  theme(axis.text.x =
                          element_text(size  = 7,
                                       angle = 75,
                                       hjust = 1,
                                       vjust = 1)))

## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`

Forecasting in CA

As a potential buyer/investor, one of the most attractive data you want to know is the future housing price in California.

In this section, My goal is to predict the average of the real estate price in CA.

I firstly assume six potential variables that may influence the price (1996/04-2017/06):

US CPI (Package quantomod)
US Unemployment Rate (Package quantomod)
US Tbill (Package quantomod)
California Mortgage Rate (Data Source)
California Median Income (Data Source)
CALIFORNIA Property Crime Rate (Data Source)

## independent variables
mortgage <- read_xls("15yr.xls")
incomeraw <- read_xls("income.xls")
crime <- read_csv("crime.csv")
# income
income <- incomeraw %>%
  gather(Year,Income,-State) %>%
  select(Year,Income)

#convert data into monthly frequency for more accurate model
CAtotalnew <- state %>%
  filter(RegionName == "California") %>%
  select(RegionName, `1996-04`:`2017-06`) %>%
  gather(Date, Price, -RegionName)
#CPI
q=getSymbols("CPIAUCSL",src = 'FRED')
CPI1=as.data.frame(CPIAUCSL)
CPI2=setNames(cbind(rownames(CPI1),CPI1,row.names=NULL),c("Date","CPI"))
CPI=CPI2[592:846,]
CPI <- CPI %>% 
  separate(Date,c("Year","Month","Day")) %>%
  select(Year,Month,CPI) %>%
  unite(Date,Year,Month,sep="-")
#unemployment rate
w=getSymbols('UNRATE', src = 'FRED')
UNEM1=as.data.frame(UNRATE)
UNEM2=setNames(cbind(rownames(UNEM1),UNEM1,row.names=NULL),c("Date","UNEM"))
UNEM=UNEM2[580:834,]
UNEM <- UNEM %>% 
  separate(Date,c("Year","Month","Day")) %>%
  select(Year,Month,UNEM) %>%
  unite(Date,Year,Month,sep="-")
# treasuary bill rate
e=getSymbols('TB3MS',src = 'FRED')
TB1=as.data.frame(TB3MS)
TB2=setNames(cbind(rownames(TB1),TB1,row.names=NULL),c("Date","TB"))
TB=TB2[748:1002,]
TB <- TB %>% 
  separate(Date,c("Year","Month","Day")) %>%
  select(Year,Month,TB) %>%
  unite(Date,Year,Month,sep="-")

data <- cbind(CAtotalnew$Price,mortgage$MortgageRate, CPI$CPI, UNEM$UNEM, TB$TB, crime$`Property crime rate`,income$Income)
colnames(data) <- c("HousingPrice","MortgageRate","CPI","UNEM","TB","Crime","Income")
data <- data.frame(data)
datatable(data)

Lets first look at the following statistics summary of the regression based on ALL SIX variables. We can see that p-value of four variables are greater than 1%. This tells us that this model is not quite optimal.

set.seed(123)
sample <- sample(c(TRUE,FALSE),nrow(data),replace=T, prob =c(0.6,0.4))
train <- data[sample,]
test <- data[!sample,]
a <- lm(HousingPrice~MortgageRate+CPI+UNEM+TB+Crime+Income, data=train)
summary(a)

## 
## Call:
## lm(formula = HousingPrice ~ MortgageRate + CPI + UNEM + TB + 
##     Crime + Income, data = train)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -93477 -43406  -2634  29517 144270 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -1317908.6   203015.2  -6.492 1.19e-09 ***
## MortgageRate     4269.8    11019.8   0.387    0.699    
## CPI              -552.4     1076.1  -0.513    0.608    
## UNEM             3219.1     4919.0   0.654    0.514    
## TB               8630.0     6332.5   1.363    0.175    
## Crime            1691.0      254.3   6.650 5.21e-10 ***
## Income            293.3       36.1   8.126 1.57e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 56600 on 149 degrees of freedom
## Multiple R-squared:  0.7571, Adjusted R-squared:  0.7473 
## F-statistic: 77.39 on 6 and 149 DF,  p-value: < 2.2e-16

To find our dream model, we need to select a good model. A robust cross-validation approach serves as a perfect tool to help us select variables that are good predictors.

After conducting the cross validation, we can see in the following graph that five variables have the lowest mean errors.

# cross validation
predict.regsubsets <- function(object, newdata, id ,...) {
  form <- as.formula(object$call[[2]]) 
  mat <- model.matrix(form, newdata)
  coefi <- coef(object, id = id)
  xvars <- names(coefi)
  mat[, xvars] %*% coefi
}
k <- 10
set.seed(1)
folds <- sample(1:k, nrow(data), replace = TRUE)
cv_errors <- matrix(NA, k, 6, dimnames = list(NULL, paste(1:6)))
for(j in 1:k) {
  
  # perform best subset on rows not equal to j
  best_subset <- regsubsets(HousingPrice ~ ., data[folds != j, ], nvmax = 6)
  
  # perform cross-validation
  for( i in 1:6) {
    pred_x <- predict.regsubsets(best_subset, data[folds == j, ], id = i)
    cv_errors[j, i] <- mean((data$HousingPrice[folds == j] - pred_x)^2)
  }
}
mean_cv_errors <- colMeans(cv_errors)
plot(mean_cv_errors, type = "b")

Therefore, five of the six varaibles are selected in our model. BRAVO!!!!!

final_best <- regsubsets(HousingPrice ~ ., data = data , nvmax = 6)
coef(final_best, 5)

##   (Intercept)           CPI          UNEM            TB         Crime 
## -1249498.1538    -1373.6297     6728.1754    14245.3264     1524.9223 
##        Income 
##      323.0906

Next step, lets look at our new regression model with five variables. In order to better fit it into a linear model, we can log the Y and X. You can see our new dataset as follows.

data2 <- data %>%
  mutate(
  logHousingPrice=log(data$HousingPrice),
  logTB=log(data$TB),
  logCrime=log(data$Crime),
  logIncome=log(data$Income),
  logCPI=log(data$CPI),
  logUNEM=log(data$UNEM)) %>%
  select(logHousingPrice,logCrime, logCrime,logIncome, logCPI, logUNEM,logTB)
datatable(data2)

An important step in predicting modeling is to split them into a training set (first table) and a testing set (second table) so that we can later running the testing set to exam the accuracy of our model.

set.seed(123)
sample2 <- sample(c(TRUE,FALSE),nrow(data),replace=T, prob =c(0.6,0.4))
train2 <- data2[sample2,]
test2 <- data2[!sample2,]
datatable(train)

datatable(test)

Now, lets look at some statsitics summary and graphs for assessing our model.

GOOD START We have a very high R square (0.8639) and low overall p-value here, which indicate that our model have a good predicting power.

A <- lm(logHousingPrice~logTB+logIncome+logCrime+logUNEM+logCPI,data=train2)

summary(A)

## 
## Call:
## lm(formula = logHousingPrice ~ logTB + logIncome + logCrime + 
##     logUNEM + logCPI, data = train2)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.258342 -0.112457  0.000452  0.097814  0.300927 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -27.02997    2.18554 -12.368  < 2e-16 ***
## logTB         0.06719    0.01352   4.971 1.80e-06 ***
## logIncome     3.32349    0.36154   9.193 2.99e-16 ***
## logCrime      1.42550    0.17762   8.025 2.70e-13 ***
## logUNEM       0.18331    0.06669   2.749  0.00672 ** 
## logCPI        0.70028    0.44306   1.581  0.11608    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1372 on 150 degrees of freedom
## Multiple R-squared:  0.8639, Adjusted R-squared:  0.8594 
## F-statistic: 190.5 on 5 and 150 DF,  p-value: < 2.2e-16

model1_results <- augment(A, train2)%>%
  mutate(Model = "Model 1")

SOME CONCERNS ALSO The RESIDUAL FITTED PLOT appears to be curve to some extent. May be it is not as linear as we assume.

However, in the SCALE-lOCATION PLOT, the residuals seem randomly spread. This is a good sign. In the NORMAL Q-Q PLOT, the dots are very close to the line, which means that the residuals are normally distributed.

p1 <- ggplot(model1_results, aes(.fitted, .std.resid)) +
  geom_ref_line(h = 0) +
  geom_point() +
  geom_smooth(se = TRUE) +
  ggtitle("Residuals vs Fitted")
p2 <- ggplot(model1_results, aes(.fitted, sqrt(.std.resid))) +
  geom_ref_line(h = 0) +
  geom_point() +
  geom_smooth(se = FALSE) +
  ggtitle("Scale-Location")
p1

p2

qq_plot <- qqnorm(model1_results$.resid)
qq_plot <- qqline(model1_results$.resid)

Overall, although the regression model is not quite linear, shows some fair eveidence that it can be a good predictive tool.

YOU THINK THIS IS THE END? NO!!!!

We need to check the collineararity in case there are strong correlation between the idependent variables. WOOPS!!! It seems that CPI and Income have strong relationships with each other. This makes sense because when the market price is high, people obtain more earnings from the sales.

#check colinearity, DROP CPI and Income
car::vif(A)

##     logTB logIncome  logCrime   logUNEM    logCPI 
##  5.777137 21.793754  5.575374  2.725852 30.826293

To solve the collineararity, we have the option to get rid of one of the variables. I choose to drop CPI because housing price is a part of the caculation in getting CPI. Therefore, CPI my not be a good factor to predict housing price.

Let’s again redo our model. This time, hope everything goes well. Hmmm….Now, we do not have strong collinearity and we have great R square and p-values.

# remodel
final <- lm(logHousingPrice~logTB+logCrime+logUNEM+logIncome,data=train2)
car::vif(final)

##     logTB  logCrime   logUNEM logIncome 
##  4.689666  5.480425  2.472607  4.285171

summary(final)

## 
## Call:
## lm(formula = logHousingPrice ~ logTB + logCrime + logUNEM + logIncome, 
##     data = train2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.24799 -0.10844 -0.01003  0.10234  0.31542 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -27.46458    2.17891 -12.605  < 2e-16 ***
## logTB         0.05792    0.01224   4.733 5.05e-06 ***
## logCrime      1.38887    0.17698   7.848 7.21e-13 ***
## logUNEM       0.21543    0.06383   3.375 0.000938 ***
## logIncome     3.83568    0.16111  23.808  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1379 on 151 degrees of freedom
## Multiple R-squared:  0.8617, Adjusted R-squared:  0.858 
## F-statistic: 235.1 on 4 and 151 DF,  p-value: < 2.2e-16

The following graphs are some comparisons of our final model and the previous model. There are not quite many differences here. The only thing I notice is that our final model gets rid of some outliers.

model2_results <- augment(final, train2)%>%
  mutate(Model = "Model 2") %>%
  rbind(model1_results)

fitted <- ggplot(model2_results, aes(.fitted, .std.resid)) +
  geom_ref_line(h = 0) +
  geom_point() +
  geom_smooth(se = TRUE) +
  facet_wrap(~ Model) +
  ggtitle("Residuals vs Fitted")
fitted

scale <- ggplot(model2_results, aes(.fitted, sqrt(.std.resid))) +
  geom_ref_line(h = 0) +
  geom_point() +
  geom_smooth(se = FALSE) +
  facet_wrap(~ Model) +
  ggtitle("Scale-Location")
scale

# Left: model 1
qqnorm(model1_results$.resid); qqline(model1_results$.resid)

# Right: model 2
qqnorm(model2_results$.resid); qqline(model2_results$.resid)

Finally, lets look at the MSE of testing set and training set. Both seem pretty decent!!!!

# test MSE
test2 %>% 
  add_predictions(final) %>%
  summarise(MSE = mean((logHousingPrice - pred)^2))

##          MSE
## 1 0.01814974

# training MSE
train2 %>% 
  add_predictions(final) %>%
  summarise(MSE = mean((logHousingPrice - pred)^2))

##          MSE
## 1 0.01839574

Invensting in CA

Now that we have a predictive model and it shows that there is a good ascending trend in the future because all the prdicting variables have an ascedning trend. You may want to ask, what specific locations in the california should I buy/invest?

To see which counties may be a best choice to invest, we can first calculate the variance of the prices through the latest nine years in different counties. Since we know that the total price in CA is increasing currently, the Larger variance in a county, more potential the price of real estate (risky though).

#check the variance of different county data
CAvar <- filter(county, State == "CA") %>%
  select(RegionName, `1996-04`:`2017-06`) %>%
  gather(Date, Price, -RegionName) %>%
  group_by(RegionName)
stddev <- summarise(CAvar, std.dev=sd(Price)) %>%
  arrange(desc(std.dev))
stddev

## # A tibble: 49 x 2
##       RegionName   std.dev
##            <chr>     <dbl>
##  1 San Francisco 241231.31
##  2     San Mateo 215458.79
##  3   Santa Clara 193109.92
##  4       Alameda 157422.81
##  5       Ventura 133254.85
##  6   Los Angeles 132823.62
##  7  Contra Costa 126434.70
##  8        Sonoma 122352.12
##  9     San Diego 121674.02
## 10        Solano  99087.32
## # ... with 39 more rows

Next, I choose top five potential counties and visualize their trends in the latest nine years. In the graph, we can see that housing price in San Francisco, San Mateo and Santa Clara have the fatest grwoing speed.

CAvar3 <- filter (CAvar, RegionName %in% c("San Francisco","San Mateo","Santa Clara","Alameda","Ventura")) %>%
separate(Date,c("Year","Month"),sep="-") %>%
  group_by(RegionName, Year) %>%
  summarise(AVG_Price=mean(Price))
ggplotly(ggplot(CAvar3, aes(Year, AVG_Price)) +
  geom_jitter(aes(color=RegionName), width = .25, alpha = .5) +
  xlab("Year") +
  labs(title="Comparison of price trend in top 5 potential counties", subtitle="1996 - 2017") +
  theme(axis.text.x =
          element_text(size  = 8,
                       angle = 75,
                       hjust = 1,
                       vjust = 1)))

## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`

Summary

Conclusions

Final model

I counducted a multiple linear regression and got a predictive model: y=-27.4645 + 0.05792X1 + 1.3888X2 + 0.2154X3 +3.8356X4 (X1=TBill, X2=CirmeRate, X3=UnemploymentRate, X4=Income)

coef <- tidy(final, conf.int = TRUE)
ggplot(coef, aes(term, estimate))+
  geom_point()+
  geom_pointrange(aes(ymin = conf.low, ymax = conf.high))+
  labs(title = "Coefficients of multiple linear regression model")

Income becomes the highest weighted variable in this model, based on the increasing trend of California median income, it is very likely that the overall housing price will increase in the near future
Within the California, I would recommend you to buy houses in San Francisco, San Mateo and Santa Clara because these three county have the fatest growing speed

Limitations

It is obvious that the predictive model is not a perfect fit for linear regression. There might be some time series depdencies that need to be considered. However, I have not learned this field yet, but I can defintely go back to this analysis when I learn it next semester
I feel like that the coefficient of crime rate and unemployment is wierd because intuitively they should negatively affect housing price. However, I would guess that becuase the 2008 economic crisis is so harsh that some of the influence of factors is skewed.