Deliverable3

Packages

library(rlang)
library(tidyverse)  # Data manipulation and visualization
library(tseries)    # Provides time series/regression modeling functions
library(quantmod)   # Quantitative Financial Modelling functions
library(modelr)     # provides easy pipeline modeling functions
library(broom)      # Helps to tidy up model outputs
library(DT)         # Helps beutify html

Data Preparation

Load Data

# original data
county <- read_csv("1.csv")
state <- read_csv("2.csv")

## county: real estate price in diffrent counties of CA
datatable(county)

## state: real estate price in diffrent states of U.S
datatable(state)

Tiday Data

# data only in CA
CAcounty <- filter(county,State=="CA")
CAtotal <- filter(state,RegionName=="California")
# select late 9-year data in CA
CAtotal2 <- select(CAtotal, RegionName, contains("2009"),contains("2010"),contains("2011"),contains("2012"),contains("2013"),contains("2014"),contains("2015"),contains("2016"),contains("2017"))
# select county price in 2017-06
CAcounty2 <- select(CAcounty, RegionName, "2017-06") 
# tidy data
CAcounty3 <- CAcounty2 %>% 
  gather(Date,Price,-RegionName)

CAtotal3 <- CAtotal2 %>% 
  gather(Date,Price,-RegionName)

datatable(CAcounty3)

datatable(CAtotal3)

data dictionary

##  variable_name
##  RegionName   
##  Date         
##  Price        
##  variable_discription                                          
##  The location of the real estate                               
##  Year-Month of the price collected                             
##  Average Price calculated for the whole residentail real estate
##  variable_type
##  Character    
##  DateType     
##  Integer

Data Analysis

Ranks and Trends

Invensting in CA

To see which counties may be a best choice to invest besides living, firstly calculate the variance of the prices through the latest nine years in different counties. Since we know that the total price in CA is increasing currently, the Larger variance in a county, more potential the price of real estate (risky though).

## # A tibble: 49 x 2
##       RegionName   std.dev
##            <chr>     <dbl>
##  1 San Francisco 188183.37
##  2     San Mateo 178454.61
##  3   Santa Clara 157825.84
##  4       Alameda 125354.85
##  5         Marin 120428.02
##  6    Santa Cruz  89551.91
##  7  Contra Costa  84474.95
##  8          Napa  83575.61
##  9        Sonoma  81314.65
## 10        Orange  75376.12
## # ... with 39 more rows

Next, I choose top five potential counties and visualize their trends in the latest nine years.

Forecasting in CA

In this section, I try to predict the average of the real estate price in CA. I firstly assume some economic variables that may influence the price:

CPI
Unemployment Rate
Tbill
more to be assumed

By running the statistical summary and fitting function, we can see that the p-value of the exsiting varaibles are all below 0.05, which indicate that they all have correlation to the housing price.

summary(A)

## 
## Call:
## lm(formula = CAtotal3$Price ~ TB + UNEM + CPI)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -43143  -8084   2233  10550  31398 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1241365.4   133681.4   9.286 4.29e-15 ***
## TB            62247.6    10940.4   5.690 1.32e-07 ***
## UNEM         -40106.8     2623.4 -15.288  < 2e-16 ***
## CPI           -2506.8      502.9  -4.985 2.67e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 17990 on 98 degrees of freedom
## Multiple R-squared:  0.9182, Adjusted R-squared:  0.9157 
## F-statistic: 366.7 on 3 and 98 DF,  p-value: < 2.2e-16

fit <- step(A)

## Start:  AIC=2002.68
## CAtotal3$Price ~ TB + UNEM + CPI
## 
##        Df  Sum of Sq        RSS    AIC
## <none>               3.1733e+10 2002.7
## - CPI   1 8.0470e+09 3.9780e+10 2023.7
## - TB    1 1.0482e+10 4.2215e+10 2029.8
## - UNEM  1 7.5681e+10 1.0741e+11 2125.1

However, the acf plots show some very interesting patterns that need further analysis here.

acf(fit$residuals)

pacf(fit$residuals)

Summary

TBC