Imagine you lying on the beach with your children playing around and your chicken is roasted in the kitchen behind you. What a beautiful life. Can you refuse a house in California? I guess you can’t. Neither can I! But before buying/investing a house, we may need to look at the trend of the house selling price in recent years and the difference among various counties in California.
Different plots may include, but not limited to:
library(rlang)
library(tidyverse)
library(tseries)
library(quantmod)
library(modelr)
library(broom)
library(DT)
library(readxl)
library(leaps)
library(plotly)library(rlang)
library(tidyverse) # Data manipulation and visualization
library(tseries) # Provides time series/regression modeling functions
library(quantmod) # Quantitative Financial Modelling functions
library(modelr) # provides easy pipeline modeling functions
library(broom) # Helps to tidy up model outputs
library(DT) # Helps beutify html
library(readxl) # Read excel file
library(leaps) #model selection funcionts
library(plotly) #visualization
The data is coming from “Zillo”. The dataset I choose includes housing prices data in the US. (Data Source)
I will look at the housing price history of the whole California from 1996 to 2017 for pattern analysis and forcasting.
I will also choose those housing data specifically in different counties of California in 2017-06 to see the rank of the price indifferent counties and in the recent 9 years to see the variance among counties.
# original data
county <- read_csv("1.csv")
state <- read_csv("2.csv")
## county: real estate price in diffrent counties of CA
as_tibble(county)## # A tibble: 1,223 x 262
## RegionID RegionName State Metro StateCodeFIPS
## <int> <chr> <chr> <chr> <chr>
## 1 3101 Los Angeles CA Los Angeles-Long Beach-Anaheim 06
## 2 139 Cook IL Chicago 17
## 3 2402 Maricopa AZ Phoenix 04
## 4 2841 San Diego CA San Diego 06
## 5 1286 Orange CA Los Angeles-Long Beach-Anaheim 06
## 6 581 Kings NY New York 36
## 7 2964 Miami-Dade FL Miami-Fort Lauderdale 12
## 8 978 Dallas TX Dallas-Fort Worth 48
## 9 1347 Queens NY New York 36
## 10 2832 Riverside CA Riverside 06
## # ... with 1,213 more rows, and 257 more variables:
## # MunicipalCodeFIPS <chr>, SizeRank <int>, `1996-04` <int>,
## # `1996-05` <int>, `1996-06` <int>, `1996-07` <int>, `1996-08` <int>,
## # `1996-09` <int>, `1996-10` <int>, `1996-11` <int>, `1996-12` <int>,
## # `1997-01` <int>, `1997-02` <int>, `1997-03` <int>, `1997-04` <int>,
## # `1997-05` <int>, `1997-06` <int>, `1997-07` <int>, `1997-08` <int>,
## # `1997-09` <int>, `1997-10` <int>, `1997-11` <int>, `1997-12` <int>,
## # `1998-01` <int>, `1998-02` <int>, `1998-03` <int>, `1998-04` <int>,
## # `1998-05` <int>, `1998-06` <int>, `1998-07` <int>, `1998-08` <int>,
## # `1998-09` <int>, `1998-10` <int>, `1998-11` <int>, `1998-12` <int>,
## # `1999-01` <int>, `1999-02` <int>, `1999-03` <int>, `1999-04` <int>,
## # `1999-05` <int>, `1999-06` <int>, `1999-07` <int>, `1999-08` <int>,
## # `1999-09` <int>, `1999-10` <int>, `1999-11` <int>, `1999-12` <int>,
## # `2000-01` <int>, `2000-02` <int>, `2000-03` <int>, `2000-04` <int>,
## # `2000-05` <int>, `2000-06` <int>, `2000-07` <int>, `2000-08` <int>,
## # `2000-09` <int>, `2000-10` <int>, `2000-11` <int>, `2000-12` <int>,
## # `2001-01` <int>, `2001-02` <int>, `2001-03` <int>, `2001-04` <int>,
## # `2001-05` <int>, `2001-06` <int>, `2001-07` <int>, `2001-08` <int>,
## # `2001-09` <int>, `2001-10` <int>, `2001-11` <int>, `2001-12` <int>,
## # `2002-01` <int>, `2002-02` <int>, `2002-03` <int>, `2002-04` <int>,
## # `2002-05` <int>, `2002-06` <int>, `2002-07` <int>, `2002-08` <int>,
## # `2002-09` <int>, `2002-10` <int>, `2002-11` <int>, `2002-12` <int>,
## # `2003-01` <int>, `2003-02` <int>, `2003-03` <int>, `2003-04` <int>,
## # `2003-05` <int>, `2003-06` <int>, `2003-07` <int>, `2003-08` <int>,
## # `2003-09` <int>, `2003-10` <int>, `2003-11` <int>, `2003-12` <int>,
## # `2004-01` <int>, `2004-02` <int>, `2004-03` <int>, `2004-04` <int>,
## # `2004-05` <int>, ...
## state: real estate price in diffrent states of U.S
as_tibble(state)## # A tibble: 49 x 258
## RegionID RegionName SizeRank `1996-04` `1996-05` `1996-06`
## <int> <chr> <int> <int> <int> <int>
## 1 9 California 1 157900 157800 157500
## 2 54 Texas 2 NA NA NA
## 3 43 New York 3 NA NA NA
## 4 14 Florida 4 86300 86600 86700
## 5 21 Illinois 5 113900 114400 114500
## 6 47 Pennsylvania 6 83300 83700 83800
## 7 44 Ohio 7 88100 88300 88500
## 8 30 Michigan 8 87600 87100 87300
## 9 16 Georgia 9 92000 92400 92600
## 10 36 North Carolina 10 91700 91900 92000
## # ... with 39 more rows, and 252 more variables: `1996-07` <int>,
## # `1996-08` <int>, `1996-09` <int>, `1996-10` <int>, `1996-11` <int>,
## # `1996-12` <int>, `1997-01` <int>, `1997-02` <int>, `1997-03` <int>,
## # `1997-04` <int>, `1997-05` <int>, `1997-06` <int>, `1997-07` <int>,
## # `1997-08` <int>, `1997-09` <int>, `1997-10` <int>, `1997-11` <int>,
## # `1997-12` <int>, `1998-01` <int>, `1998-02` <int>, `1998-03` <int>,
## # `1998-04` <int>, `1998-05` <int>, `1998-06` <int>, `1998-07` <int>,
## # `1998-08` <int>, `1998-09` <int>, `1998-10` <int>, `1998-11` <int>,
## # `1998-12` <int>, `1999-01` <int>, `1999-02` <int>, `1999-03` <int>,
## # `1999-04` <int>, `1999-05` <int>, `1999-06` <int>, `1999-07` <int>,
## # `1999-08` <int>, `1999-09` <int>, `1999-10` <int>, `1999-11` <int>,
## # `1999-12` <int>, `2000-01` <int>, `2000-02` <int>, `2000-03` <int>,
## # `2000-04` <int>, `2000-05` <int>, `2000-06` <int>, `2000-07` <int>,
## # `2000-08` <int>, `2000-09` <int>, `2000-10` <int>, `2000-11` <int>,
## # `2000-12` <int>, `2001-01` <int>, `2001-02` <int>, `2001-03` <int>,
## # `2001-04` <int>, `2001-05` <int>, `2001-06` <int>, `2001-07` <int>,
## # `2001-08` <int>, `2001-09` <int>, `2001-10` <int>, `2001-11` <int>,
## # `2001-12` <int>, `2002-01` <int>, `2002-02` <int>, `2002-03` <int>,
## # `2002-04` <int>, `2002-05` <int>, `2002-06` <int>, `2002-07` <int>,
## # `2002-08` <int>, `2002-09` <int>, `2002-10` <int>, `2002-11` <int>,
## # `2002-12` <int>, `2003-01` <int>, `2003-02` <int>, `2003-03` <int>,
## # `2003-04` <int>, `2003-05` <int>, `2003-06` <int>, `2003-07` <int>,
## # `2003-08` <int>, `2003-09` <int>, `2003-10` <int>, `2003-11` <int>,
## # `2003-12` <int>, `2004-01` <int>, `2004-02` <int>, `2004-03` <int>,
## # `2004-04` <int>, `2004-05` <int>, `2004-06` <int>, `2004-07` <int>,
## # `2004-08` <int>, `2004-09` <int>, `2004-10` <int>, ...
CAcounty <- filter(county, State == "CA") %>%
select(RegionName, `2017-06`) %>%
gather(Date, Price, -RegionName)
datatable(CAcounty)CAtotal <- filter(state, RegionName == "California") %>%
select(RegionName, `1996-04`:`2017-06`) %>%
gather(Date, Price, -RegionName) %>%
separate(Date,c("Year","Month"),sep="-") %>%
group_by(RegionName,Year) %>%
summarise(AVG_Price=mean(Price))
datatable(CAtotal)## data dictionary
variable_discription <- c("The location of the real estate","Year-Month of the price collected","Housing price in different counties of CA", "Average housing price in CA per year")
variable_name <- c("RegionName","Date","Price","AVG_Price")
variable_type <- c("Character", "DateType", "Integer", "Integer")
dict <- data.frame(variable_name, variable_discription, variable_type)
datatable(dict, options = list(dom = 't'))Welcome, welcome!!!
To wake you up, lets have an overview of the latest housing price tren in California and the rank of different counties. You may be surprised that the housing price is actually going up fastly. Although there is a drop in 2008-2012, as we all know due to the ecnomic crisis, NOW IS A GOOD TIME to re-invest into the housing market again!!!
# county data in 2017-06
ggplot(data = CAcounty, aes(x =reorder(RegionName,-Price), y = Price)) +
geom_bar(stat="identity",fill="#FF9999", colour="black") +
xlab("County") +
labs(title="Real Estate Price in Diffrent Counties of CA", subtitle="June 2007") +
theme(axis.text.x =
element_text(size = 10,
angle = 75,
hjust = 1,
vjust = 1))# CA price trend in latest 10 years
ggplotly(ggplot(data = CAtotal, aes(x = Year, y = AVG_Price)) +
geom_jitter(color = "purple", size = 4, shape = 17, alpha = .5) +
xlab("Year") +
labs(title="Real Estate Price Trend in CA", subtitle="1996 - 2017") +
theme(axis.text.x =
element_text(size = 7,
angle = 75,
hjust = 1,
vjust = 1)))## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`
As a potential buyer/investor, one of the most attractive data you want to know is the future housing price in California.
In this section, My goal is to predict the average of the real estate price in CA.
I firstly assume six potential variables that may influence the price (1996/04-2017/06):
## independent variables
mortgage <- read_xls("15yr.xls")
incomeraw <- read_xls("income.xls")
crime <- read_csv("crime.csv")
# income
income <- incomeraw %>%
gather(Year,Income,-State) %>%
select(Year,Income)
#convert data into monthly frequency for more accurate model
CAtotalnew <- state %>%
filter(RegionName == "California") %>%
select(RegionName, `1996-04`:`2017-06`) %>%
gather(Date, Price, -RegionName)
#CPI
q=getSymbols("CPIAUCSL",src = 'FRED')
CPI1=as.data.frame(CPIAUCSL)
CPI2=setNames(cbind(rownames(CPI1),CPI1,row.names=NULL),c("Date","CPI"))
CPI=CPI2[592:846,]
CPI <- CPI %>%
separate(Date,c("Year","Month","Day")) %>%
select(Year,Month,CPI) %>%
unite(Date,Year,Month,sep="-")
#unemployment rate
w=getSymbols('UNRATE', src = 'FRED')
UNEM1=as.data.frame(UNRATE)
UNEM2=setNames(cbind(rownames(UNEM1),UNEM1,row.names=NULL),c("Date","UNEM"))
UNEM=UNEM2[580:834,]
UNEM <- UNEM %>%
separate(Date,c("Year","Month","Day")) %>%
select(Year,Month,UNEM) %>%
unite(Date,Year,Month,sep="-")
# treasuary bill rate
e=getSymbols('TB3MS',src = 'FRED')
TB1=as.data.frame(TB3MS)
TB2=setNames(cbind(rownames(TB1),TB1,row.names=NULL),c("Date","TB"))
TB=TB2[748:1002,]
TB <- TB %>%
separate(Date,c("Year","Month","Day")) %>%
select(Year,Month,TB) %>%
unite(Date,Year,Month,sep="-")
data <- cbind(CAtotalnew$Price,mortgage$MortgageRate, CPI$CPI, UNEM$UNEM, TB$TB, crime$`Property crime rate`,income$Income)
colnames(data) <- c("HousingPrice","MortgageRate","CPI","UNEM","TB","Crime","Income")
data <- data.frame(data)
datatable(data)Lets first look at the following statistics summary of the regression based on ALL SIX variables. We can see that p-value of four variables are greater than 1%. This tells us that this model is not quite optimal.
set.seed(123)
sample <- sample(c(TRUE,FALSE),nrow(data),replace=T, prob =c(0.6,0.4))
train <- data[sample,]
test <- data[!sample,]
a <- lm(HousingPrice~MortgageRate+CPI+UNEM+TB+Crime+Income, data=train)
summary(a)##
## Call:
## lm(formula = HousingPrice ~ MortgageRate + CPI + UNEM + TB +
## Crime + Income, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -93477 -43406 -2634 29517 144270
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1317908.6 203015.2 -6.492 1.19e-09 ***
## MortgageRate 4269.8 11019.8 0.387 0.699
## CPI -552.4 1076.1 -0.513 0.608
## UNEM 3219.1 4919.0 0.654 0.514
## TB 8630.0 6332.5 1.363 0.175
## Crime 1691.0 254.3 6.650 5.21e-10 ***
## Income 293.3 36.1 8.126 1.57e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 56600 on 149 degrees of freedom
## Multiple R-squared: 0.7571, Adjusted R-squared: 0.7473
## F-statistic: 77.39 on 6 and 149 DF, p-value: < 2.2e-16
To find our dream model, we need to select a good model. A robust cross-validation approach serves as a perfect tool to help us select variables that are good predictors.
After conducting the cross validation, we can see in the following graph that five variables have the lowest mean errors.
# cross validation
predict.regsubsets <- function(object, newdata, id ,...) {
form <- as.formula(object$call[[2]])
mat <- model.matrix(form, newdata)
coefi <- coef(object, id = id)
xvars <- names(coefi)
mat[, xvars] %*% coefi
}
k <- 10
set.seed(1)
folds <- sample(1:k, nrow(data), replace = TRUE)
cv_errors <- matrix(NA, k, 6, dimnames = list(NULL, paste(1:6)))
for(j in 1:k) {
# perform best subset on rows not equal to j
best_subset <- regsubsets(HousingPrice ~ ., data[folds != j, ], nvmax = 6)
# perform cross-validation
for( i in 1:6) {
pred_x <- predict.regsubsets(best_subset, data[folds == j, ], id = i)
cv_errors[j, i] <- mean((data$HousingPrice[folds == j] - pred_x)^2)
}
}
mean_cv_errors <- colMeans(cv_errors)
plot(mean_cv_errors, type = "b")Therefore, five of the six varaibles are selected in our model. BRAVO!!!!!
final_best <- regsubsets(HousingPrice ~ ., data = data , nvmax = 6)
coef(final_best, 5)## (Intercept) CPI UNEM TB Crime
## -1249498.1538 -1373.6297 6728.1754 14245.3264 1524.9223
## Income
## 323.0906
Next step, lets look at our new regression model with five variables. In order to better fit it into a linear model, we can log the Y and X. You can see our new dataset as follows.
data2 <- data %>%
mutate(
logHousingPrice=log(data$HousingPrice),
logTB=log(data$TB),
logCrime=log(data$Crime),
logIncome=log(data$Income),
logCPI=log(data$CPI),
logUNEM=log(data$UNEM)) %>%
select(logHousingPrice,logCrime, logCrime,logIncome, logCPI, logUNEM,logTB)
datatable(data2)An important step in predicting modeling is to split them into a training set (first table) and a testing set (second table) so that we can later running the testing set to exam the accuracy of our model.
set.seed(123)
sample2 <- sample(c(TRUE,FALSE),nrow(data),replace=T, prob =c(0.6,0.4))
train2 <- data2[sample2,]
test2 <- data2[!sample2,]
datatable(train)datatable(test)Now, lets look at some statsitics summary and graphs for assessing our model.
GOOD START We have a very high R square (0.8639) and low overall p-value here, which indicate that our model have a good predicting power.
A <- lm(logHousingPrice~logTB+logIncome+logCrime+logUNEM+logCPI,data=train2)
summary(A)##
## Call:
## lm(formula = logHousingPrice ~ logTB + logIncome + logCrime +
## logUNEM + logCPI, data = train2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.258342 -0.112457 0.000452 0.097814 0.300927
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -27.02997 2.18554 -12.368 < 2e-16 ***
## logTB 0.06719 0.01352 4.971 1.80e-06 ***
## logIncome 3.32349 0.36154 9.193 2.99e-16 ***
## logCrime 1.42550 0.17762 8.025 2.70e-13 ***
## logUNEM 0.18331 0.06669 2.749 0.00672 **
## logCPI 0.70028 0.44306 1.581 0.11608
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1372 on 150 degrees of freedom
## Multiple R-squared: 0.8639, Adjusted R-squared: 0.8594
## F-statistic: 190.5 on 5 and 150 DF, p-value: < 2.2e-16
model1_results <- augment(A, train2)%>%
mutate(Model = "Model 1")SOME CONCERNS ALSO The RESIDUAL FITTED PLOT appears to be curve to some extent. May be it is not as linear as we assume.
However, in the SCALE-lOCATION PLOT, the residuals seem randomly spread. This is a good sign. In the NORMAL Q-Q PLOT, the dots are very close to the line, which means that the residuals are normally distributed.
p1 <- ggplot(model1_results, aes(.fitted, .std.resid)) +
geom_ref_line(h = 0) +
geom_point() +
geom_smooth(se = TRUE) +
ggtitle("Residuals vs Fitted")
p2 <- ggplot(model1_results, aes(.fitted, sqrt(.std.resid))) +
geom_ref_line(h = 0) +
geom_point() +
geom_smooth(se = FALSE) +
ggtitle("Scale-Location")
p1p2qq_plot <- qqnorm(model1_results$.resid)
qq_plot <- qqline(model1_results$.resid)Overall, although the regression model is not quite linear, shows some fair eveidence that it can be a good predictive tool.
YOU THINK THIS IS THE END? NO!!!!
We need to check the collineararity in case there are strong correlation between the idependent variables. WOOPS!!! It seems that CPI and Income have strong relationships with each other. This makes sense because when the market price is high, people obtain more earnings from the sales.
#check colinearity, DROP CPI and Income
car::vif(A)## logTB logIncome logCrime logUNEM logCPI
## 5.777137 21.793754 5.575374 2.725852 30.826293
To solve the collineararity, we have the option to get rid of one of the variables. I choose to drop CPI because housing price is a part of the caculation in getting CPI. Therefore, CPI my not be a good factor to predict housing price.
Let’s again redo our model. This time, hope everything goes well. Hmmm….Now, we do not have strong collinearity and we have great R square and p-values.
# remodel
final <- lm(logHousingPrice~logTB+logCrime+logUNEM+logIncome,data=train2)
car::vif(final)## logTB logCrime logUNEM logIncome
## 4.689666 5.480425 2.472607 4.285171
summary(final)##
## Call:
## lm(formula = logHousingPrice ~ logTB + logCrime + logUNEM + logIncome,
## data = train2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.24799 -0.10844 -0.01003 0.10234 0.31542
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -27.46458 2.17891 -12.605 < 2e-16 ***
## logTB 0.05792 0.01224 4.733 5.05e-06 ***
## logCrime 1.38887 0.17698 7.848 7.21e-13 ***
## logUNEM 0.21543 0.06383 3.375 0.000938 ***
## logIncome 3.83568 0.16111 23.808 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1379 on 151 degrees of freedom
## Multiple R-squared: 0.8617, Adjusted R-squared: 0.858
## F-statistic: 235.1 on 4 and 151 DF, p-value: < 2.2e-16
The following graphs are some comparisons of our final model and the previous model. There are not quite many differences here. The only thing I notice is that our final model gets rid of some outliers.
model2_results <- augment(final, train2)%>%
mutate(Model = "Model 2") %>%
rbind(model1_results)
fitted <- ggplot(model2_results, aes(.fitted, .std.resid)) +
geom_ref_line(h = 0) +
geom_point() +
geom_smooth(se = TRUE) +
facet_wrap(~ Model) +
ggtitle("Residuals vs Fitted")
fittedscale <- ggplot(model2_results, aes(.fitted, sqrt(.std.resid))) +
geom_ref_line(h = 0) +
geom_point() +
geom_smooth(se = FALSE) +
facet_wrap(~ Model) +
ggtitle("Scale-Location")
scale# Left: model 1
qqnorm(model1_results$.resid); qqline(model1_results$.resid)# Right: model 2
qqnorm(model2_results$.resid); qqline(model2_results$.resid)Finally, lets look at the MSE of testing set and training set. Both seem pretty decent!!!!
# test MSE
test2 %>%
add_predictions(final) %>%
summarise(MSE = mean((logHousingPrice - pred)^2))## MSE
## 1 0.01814974
# training MSE
train2 %>%
add_predictions(final) %>%
summarise(MSE = mean((logHousingPrice - pred)^2))## MSE
## 1 0.01839574
Now that we have a predictive model and it shows that there is a good ascending trend in the future because all the prdicting variables have an ascedning trend. You may want to ask, what specific locations in the california should I buy/invest?
To see which counties may be a best choice to invest, we can first calculate the variance of the prices through the latest nine years in different counties. Since we know that the total price in CA is increasing currently, the Larger variance in a county, more potential the price of real estate (risky though).
#check the variance of different county data
CAvar <- filter(county, State == "CA") %>%
select(RegionName, `1996-04`:`2017-06`) %>%
gather(Date, Price, -RegionName) %>%
group_by(RegionName)
stddev <- summarise(CAvar, std.dev=sd(Price)) %>%
arrange(desc(std.dev))
stddev## # A tibble: 49 x 2
## RegionName std.dev
## <chr> <dbl>
## 1 San Francisco 241231.31
## 2 San Mateo 215458.79
## 3 Santa Clara 193109.92
## 4 Alameda 157422.81
## 5 Ventura 133254.85
## 6 Los Angeles 132823.62
## 7 Contra Costa 126434.70
## 8 Sonoma 122352.12
## 9 San Diego 121674.02
## 10 Solano 99087.32
## # ... with 39 more rows
Next, I choose top five potential counties and visualize their trends in the latest nine years. In the graph, we can see that housing price in San Francisco, San Mateo and Santa Clara have the fatest grwoing speed.
CAvar3 <- filter (CAvar, RegionName %in% c("San Francisco","San Mateo","Santa Clara","Alameda","Ventura")) %>%
separate(Date,c("Year","Month"),sep="-") %>%
group_by(RegionName, Year) %>%
summarise(AVG_Price=mean(Price))
ggplotly(ggplot(CAvar3, aes(Year, AVG_Price)) +
geom_jitter(aes(color=RegionName), width = .25, alpha = .5) +
xlab("Year") +
labs(title="Comparison of price trend in top 5 potential counties", subtitle="1996 - 2017") +
theme(axis.text.x =
element_text(size = 8,
angle = 75,
hjust = 1,
vjust = 1)))## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`
coef <- tidy(final, conf.int = TRUE)
ggplot(coef, aes(term, estimate))+
geom_point()+
geom_pointrange(aes(ymin = conf.low, ymax = conf.high))+
labs(title = "Coefficients of multiple linear regression model")