library(rlang)
library(tidyverse) # Data manipulation and visualization
library(tseries) # Provides time series/regression modeling functions
library(quantmod) # Quantitative Financial Modelling functions
library(modelr) # provides easy pipeline modeling functions
library(broom) # Helps to tidy up model outputs
library(DT) # Helps beutify html
# original data
county <- read_csv("1.csv")
state <- read_csv("2.csv")
## county: real estate price in diffrent counties of CA
datatable(county)
## state: real estate price in diffrent states of U.S
datatable(state)
# data only in CA
CAcounty <- filter(county,State=="CA")
CAtotal <- filter(state,RegionName=="California")
# select late 9-year data in CA
CAtotal2 <- select(CAtotal, RegionName, contains("2009"),contains("2010"),contains("2011"),contains("2012"),contains("2013"),contains("2014"),contains("2015"),contains("2016"),contains("2017"))
# select county price in 2017-06
CAcounty2 <- select(CAcounty, RegionName, "2017-06")
# tidy data
CAcounty3 <- CAcounty2 %>%
gather(Date,Price,-RegionName)
CAtotal3 <- CAtotal2 %>%
gather(Date,Price,-RegionName)
datatable(CAcounty3)
datatable(CAtotal3)
## variable_name
## RegionName
## Date
## Price
## variable_discription
## The location of the real estate
## Year-Month of the price collected
## Average Price calculated for the whole residentail real estate
## variable_type
## Character
## DateType
## Integer
To see which counties may be a best choice to invest besides living, firstly calculate the variance of the prices through the latest nine years in different counties. Since we know that the total price in CA is increasing currently, the Larger variance in a county, more potential the price of real estate (risky though).
## # A tibble: 49 x 2
## RegionName std.dev
## <chr> <dbl>
## 1 San Francisco 188183.37
## 2 San Mateo 178454.61
## 3 Santa Clara 157825.84
## 4 Alameda 125354.85
## 5 Marin 120428.02
## 6 Santa Cruz 89551.91
## 7 Contra Costa 84474.95
## 8 Napa 83575.61
## 9 Sonoma 81314.65
## 10 Orange 75376.12
## # ... with 39 more rows
Next, I choose top five potential counties and visualize their trends in the latest nine years.
In this section, I try to predict the average of the real estate price in CA. I firstly assume some economic variables that may influence the price:
By running the statistical summary and fitting function, we can see that the p-value of the exsiting varaibles are all below 0.05, which indicate that they all have correlation to the housing price.
summary(A)
##
## Call:
## lm(formula = CAtotal3$Price ~ TB + UNEM + CPI)
##
## Residuals:
## Min 1Q Median 3Q Max
## -43143 -8084 2233 10550 31398
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1241365.4 133681.4 9.286 4.29e-15 ***
## TB 62247.6 10940.4 5.690 1.32e-07 ***
## UNEM -40106.8 2623.4 -15.288 < 2e-16 ***
## CPI -2506.8 502.9 -4.985 2.67e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 17990 on 98 degrees of freedom
## Multiple R-squared: 0.9182, Adjusted R-squared: 0.9157
## F-statistic: 366.7 on 3 and 98 DF, p-value: < 2.2e-16
fit <- step(A)
## Start: AIC=2002.68
## CAtotal3$Price ~ TB + UNEM + CPI
##
## Df Sum of Sq RSS AIC
## <none> 3.1733e+10 2002.7
## - CPI 1 8.0470e+09 3.9780e+10 2023.7
## - TB 1 1.0482e+10 4.2215e+10 2029.8
## - UNEM 1 7.5681e+10 1.0741e+11 2125.1
However, the acf plots show some very interesting patterns that need further analysis here.
acf(fit$residuals)
pacf(fit$residuals)
TBC