In this project, we intend to cover the following topics as part of our analysis: 1. The house price index transition across all states are generally corelated, because we have the house price data for multiple years across states, 2. we can use that data to identify trends in the House Price Index transition over a period of time, once we have that we can identify states which are highly correlated.
Every time there is a drop, in the correlation it would give rise to Investment opportunities.
Once trends and Investment opportunities have been identified for the values in the past, similar analysis can be done on the forecasted values and that would give us beforehand idea of suitable investment opportunity.
Let’s load few required libraries to run our data processing and fitting machine learning models.
library(tibble)
library(knitr)
library(dplyr)
library(tidyverse)
library(reshape)
library(lubridate)
library(zoo)
library(forecast)
library(plotly)
library(ggfortify)
library(timeSeries)
library(RColorBrewer)
library(corrplot)
library(tidyverse)
library(caret)
library(gbm)
library(h2o)
h2o.init(nthreads = -1)
##
## H2O is not running yet, starting it now...
##
## Note: In case of errors look at the following log files:
## /var/folders/0y/khcr0pd55bx60q1zzqy4d0wh0000gn/T//RtmpcuCoG5/h2o_lalithab_started_from_r.out
## /var/folders/0y/khcr0pd55bx60q1zzqy4d0wh0000gn/T//RtmpcuCoG5/h2o_lalithab_started_from_r.err
##
##
## Starting H2O JVM and connecting: .. Connection successful!
##
## R is connected to the H2O cluster:
## H2O cluster uptime: 2 seconds 360 milliseconds
## H2O cluster timezone: America/New_York
## H2O data parsing timezone: UTC
## H2O cluster version: 3.22.1.1
## H2O cluster version age: 4 months and 3 days !!!
## H2O cluster name: H2O_started_from_R_lalithab_lzn763
## H2O cluster total nodes: 1
## H2O cluster total memory: 1.78 GB
## H2O cluster total cores: 4
## H2O cluster allowed cores: 4
## H2O cluster healthy: TRUE
## H2O Connection ip: localhost
## H2O Connection port: 54321
## H2O Connection proxy: NA
## H2O Internal Security: FALSE
## H2O API Extensions: XGBoost, Algos, AutoML, Core V3, Core V4
## R Version: R version 3.5.2 (2018-12-20)
## Warning in h2o.clusterInfo():
## Your H2O cluster version is too old (4 months and 3 days)!
## Please download and install the latest version from http://h2o.ai/download/
###Reading the input data
zill <- as_tibble(read.csv("State_MedianValuePerSqft_AllHomes.csv", stringsAsFactors = FALSE))
df <- read.csv('./ZillowApi_MasterData.csv')
####Dimension of the dataset for time series analysis
dim(zill)
## [1] 50 279
####Dimension of the dataset for House price Predictions
dim(df)
## [1] 38021 21
####Basic Data manipulation and Null validations We’ll follow the tidy data principles to clean and transpose our data in order to structure it to better fit the algorithms.
names(zill)[4:279]<-gsub("X","",colnames(zill[,4:279]))
sum(is.na(zill))
## [1] 107
Will create a dataframe with the required columns: date and median.
zill_df <- zill %>% gather(date, median, 4:279, na.rm = TRUE)
zill_df <- arrange(zill_df, date)
Converting the Entire dataset to time series object. This would allow us to perform time series analysis on the data. For that we’ll leverage zoo library to convert the date column to date datatype.
zill_df$date <- gsub("[:.:]", "-", zill_df$date)
zill_df$date <- as.Date(as.yearmon(zill_df$date))
head(zill_df)
## # A tibble: 6 x 5
## RegionID RegionName SizeRank date median
## <int> <chr> <int> <date> <int>
## 1 9 California 1 1996-04-01 103
## 2 54 Texas 2 1996-04-01 53
## 3 43 New York 3 1996-04-01 72
## 4 14 Florida 4 1996-04-01 55
## 5 21 Illinois 5 1996-04-01 86
## 6 47 Pennsylvania 6 1996-04-01 57
Filtering the data to create time series object and checking for nulls that might have been created in the due course
zill_ts_df <- zill_df %>% select(RegionName,date, median)
sum(is.na(zill_ts_df))
## [1] 0
Let’s get in to the exploratory data analysis
The below plot showcases the transisiton of Median value per square feet transition across all the states in USA. From the transition we can easily conclude that house prices per square feet across all the states follow each other and are highly correlated. There are strong trends which coincides with know major phenomenan that impact housing prices in general. The trends in the past can be used to forecast values in the future by incorporating the trend and seasonality of the data and can be used to identify a suitable investment oppurtunity
pg <- plot_ly(zill_df, x = ~date, y = ~median, color = ~RegionName, type = 'scatter', mode = 'lines')
pg
## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors
## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors
Next, we’ll visualize a correlation matrix. This will help us to identify how correlated the prices across all the states are.
zill_allstate <- spread(zill_ts_df, RegionName, median)
#check for nulls
sum(is.na(zill_allstate))
## [1] 107
#log transformation
zill_allstate[,2:50] <- log(zill_allstate[,2:50]+1)
head(zill_allstate)
## # A tibble: 6 x 51
## date Alabama Alaska Arizona Arkansas California Colorado
## <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1996-04-01 3.81 4.28 4.13 3.69 4.64 4.47
## 2 1996-05-01 3.81 4.29 4.13 3.69 4.64 4.47
## 3 1996-06-01 3.81 4.29 4.13 3.69 4.64 4.47
## 4 1996-07-01 3.81 4.29 4.13 3.69 4.64 4.47
## 5 1996-08-01 3.81 4.29 4.13 3.69 4.64 4.47
## 6 1996-09-01 3.83 4.30 4.14 3.66 4.64 4.47
## # … with 44 more variables: Connecticut <dbl>, Delaware <dbl>, `District
## # of Columbia` <dbl>, Florida <dbl>, Georgia <dbl>, Hawaii <dbl>,
## # Idaho <dbl>, Illinois <dbl>, Indiana <dbl>, Iowa <dbl>, Kansas <dbl>,
## # Kentucky <dbl>, Louisiana <dbl>, Maine <dbl>, Maryland <dbl>,
## # Massachusetts <dbl>, Michigan <dbl>, Minnesota <dbl>,
## # Mississippi <dbl>, Missouri <dbl>, Montana <dbl>, Nebraska <dbl>,
## # Nevada <dbl>, `New Hampshire` <dbl>, `New Jersey` <dbl>, `New
## # Mexico` <dbl>, `New York` <dbl>, `North Carolina` <dbl>, `North
## # Dakota` <dbl>, Ohio <dbl>, Oklahoma <dbl>, Oregon <dbl>,
## # Pennsylvania <dbl>, `Rhode Island` <dbl>, `South Carolina` <dbl>,
## # `South Dakota` <dbl>, Tennessee <dbl>, Texas <dbl>, Utah <dbl>,
## # Virginia <dbl>, Washington <dbl>, `West Virginia` <dbl>,
## # Wisconsin <dbl>, Wyoming <int>
#drop nulls
zill_allstate <- zill_allstate %>% drop_na()
sum(is.na(zill_allstate))
## [1] 0
cor <- cor(zill_allstate[,-1])
corrplot(cor, method="shade",
tl.col = "black", tl.srt = 45, tl.cex = 0.75,cl.cex = 0.75,
sig.level = 0.01, insig = "pch", pch.cex = 0.8
)
Filtering the data further to extract the time series transition for Georgia
zillow_ga <- zill_ts_df[zill_ts_df$RegionName == 'Georgia',]
zillow_ga$median <- log10(zillow_ga$median)
# head(zillow_ga)
# nrow(zillow_ga)
Filtering the data further to extract the time series transition for Indiana
zillow_in <- zill_ts_df[zill_ts_df$RegionName == 'Indiana',]
zillow_in$median <- log10(zillow_in$median)
# head(zillow_in)
Basic Time Series Analysis for Georgia Till Date
ga_ts <- ts(zillow_ga$median, start = as.yearmon(zillow_ga$date[1]), freq = 12)
head(ga_ts)
## Apr May Jun Jul Aug Sep
## 1996 1.707570 1.707570 1.707570 1.707570 1.707570 1.716003
class(ga_ts)
## [1] "ts"
start(ga_ts)
## [1] 1996 4
end(ga_ts)
## [1] 2019 3
frequency(ga_ts)
## [1] 12
summary(ga_ts)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.708 1.806 1.875 1.866 1.924 2.009
Basic Time Series Analysis for Indiana Till Date
in_ts <- ts(zillow_in$median, start = as.yearmon(zillow_in$date[1]), freq = 12)
head(in_ts)
## Apr May Jun Jul Aug Sep
## 1996 1.785330 1.785330 1.778151 1.770852 1.770852 1.763428
class(in_ts)
## [1] "ts"
start(in_ts)
## [1] 1996 4
end(in_ts)
## [1] 2019 3
frequency(in_ts)
## [1] 12
summary(in_ts)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.763 1.833 1.845 1.850 1.882 1.978
sum(is.na(ga_ts))
## [1] 0
sum(is.na(in_ts))
## [1] 0
Time Series Plot until current data for Georgia
p1 <- plot(ga_ts)
Time Series Plot until current data for Indiana
p2 <- plot(in_ts)
Calculating and plotting the Auto Correlation factor and Partial Auto Correlation factor for Georgia, it helps us to determine the parameter required to fit in the time series data into ARIMA model (Auto regressive - Integrated- Moving Average).
acf(ga_ts)
acf(diff(diff(log(ga_ts)))) #q=2, d=2
pacf(diff(diff(log(ga_ts)))) #p=1
acf(in_ts)
acf(diff(diff(log(in_ts)))) #q=1, d=2
pacf(diff(diff(log(in_ts)))) #p=1,
Lets look at the monthly price variation across dataset for Georgia and Indiana
boxplot(ga_ts~cycle(ga_ts, xlab = "date", ylab = "Median", main = "monthly median values for GA from 96 to 18"))
boxplot(ga_ts~cycle(in_ts, xlab = "date", ylab = "Median", main = "monthly median values for IN from 96 to 18"))
We have to keep the trace on for Auto Arima that way we can justify our hyperparameter selection.
###What is ARIMA? ARIMA is an acronym that stands for AutoRegressive Integrated Moving Average. It is a class of model that captures a suite of different standard temporal structures in time series data.
When we fit the model we use auto.arima() function. Auto ARIMA takes into account the AIC and BIC values generated to determine the best combination of parameters. AIC (Akaike Information Criterion) and BIC (Bayesian Information Criterion) values are estimators to compare models. The lower these values, the better is the model.
set.seed(41)
gamodel <- auto.arima(ga_ts)
inmodel <- auto.arima(in_ts)
summary(gamodel)
## Series: ga_ts
## ARIMA(0,2,2)(1,0,0)[12]
##
## Coefficients:
## ma1 ma2 sar1
## -1.2933 0.4755 0.0521
## s.e. 0.0602 0.0595 0.0626
##
## sigma^2 estimated as 9.316e-06: log likelihood=1199.01
## AIC=-2390.03 AICc=-2389.88 BIC=-2375.58
##
## Training set error measures:
## ME RMSE MAE MPE MAPE
## Training set 4.052283e-05 0.003024377 0.002531881 0.002843619 0.1362667
## MASE ACF1
## Training set 0.08739804 0.02491236
summary(inmodel)
## Series: in_ts
## ARIMA(2,2,3)(2,0,1)[12]
##
## Coefficients:
## ar1 ar2 ma1 ma2 ma3 sar1 sar2 sma1
## 0.2162 0.6027 -1.4073 0.1232 0.2996 -0.0633 0.0615 0.0815
## s.e. 0.4537 0.3231 0.4593 0.3517 0.2090 0.9779 0.0669 0.9778
##
## sigma^2 estimated as 1.183e-05: log likelihood=1168.74
## AIC=-2319.47 AICc=-2318.79 BIC=-2286.96
##
## Training set error measures:
## ME RMSE MAE MPE MAPE
## Training set 0.0002333671 0.003375994 0.002653448 0.01290224 0.1437155
## MASE ACF1
## Training set 0.1836091 -0.005631734
For both the states, AIC and the BIC values seem to be pretty less. We can say that our model is fitting well.
We’ll fit the arima model to Georgia state with p,d,q values that we found from our auto.arima model:
p=0; d=2; q=2.
fit_ga = arima(ga_ts, c(0,2,2), seasonal = list(order = c(0,2,2), period = 12))
fit_ga
##
## Call:
## arima(x = ga_ts, order = c(0, 2, 2), seasonal = list(order = c(0, 2, 2), period = 12))
##
## Coefficients:
## ma1 ma2 sma1 sma2
## -1.3109 0.4975 -1.9362 0.9630
## s.e. 0.0729 0.0720 0.1300 0.1301
##
## sigma^2 estimated as 9.922e-06: log likelihood = 1036.73, aic = -2063.46
We’ll fit the arima model to Indiana state with p,d,q values that we found from our auto.arima model:
p=2; d=2; q=3.
fit_in = arima(in_ts, c(2,2,3), seasonal = list(order = c(2,2,3), period = 12))
fit_in
##
## Call:
## arima(x = in_ts, order = c(2, 2, 3), seasonal = list(order = c(2, 2, 3), period = 12))
##
## Coefficients:
## Warning in sqrt(diag(x$var.coef)): NaNs produced
## ar1 ar2 ma1 ma2 ma3 sar1 sar2 sma1
## -1.8119 -0.8583 0.6798 -0.7913 -0.5795 -0.4977 0.0076 -1.1989
## s.e. 0.0703 0.0681 0.1027 0.0602 0.1013 NaN 0.0833 NaN
## sma2 sma3
## -0.1519 0.3775
## s.e. NaN NaN
##
## sigma^2 estimated as 1.397e-05: log likelihood = 1006.76, aic = -1991.53
Now, we’ll fit the the arima model to forecast for next 10 years for Georgia using n.ahead parameter.
pred_ga = predict(fit_ga, n.ahead = 10*12)
pred_ga$pred
## Jan Feb Mar Apr May Jun Jul
## 2019 2.011295 2.015141 2.017369 2.021534
## 2020 2.039541 2.040290 2.045311 2.047479 2.052107 2.054802 2.059637
## 2021 2.081615 2.082898 2.088545 2.091400 2.096883 2.100115 2.105693
## 2022 2.132071 2.133959 2.140304 2.143916 2.150326 2.154168 2.160559
## 2023 2.191765 2.194331 2.201444 2.205887 2.213294 2.217817 2.225094
## 2024 2.261558 2.264872 2.272826 2.278170 2.286646 2.291922 2.300155
## 2025 2.342306 2.346441 2.355306 2.361623 2.371239 2.377339 2.386601
## 2026 2.434867 2.439894 2.449743 2.457104 2.467933 2.474929 2.485290
## 2027 2.540101 2.546091 2.556995 2.565472 2.577584 2.585547 2.597079
## 2028 2.658865 2.665890 2.677920 2.687584 2.701052 2.710054 2.722828
## 2029 2.792017 2.800148 2.813377
## Aug Sep Oct Nov Dec
## 2019 2.023286 2.026247 2.029999 2.031994 2.033514
## 2020 2.061983 2.065767 2.070190 2.072749 2.074838
## 2021 2.108703 2.113382 2.118548 2.121743 2.124471
## 2022 2.164305 2.169951 2.175931 2.179833 2.183273
## 2023 2.229647 2.236331 2.243197 2.247877 2.252100
## 2024 2.305587 2.313381 2.321204 2.326734 2.331812
## 2025 2.392984 2.401958 2.410810 2.417262 2.423266
## 2026 2.492694 2.502922 2.512874 2.520319 2.527321
## 2027 2.605578 2.617129 2.628253 2.636763 2.644834
## 2028 2.732492 2.745439 2.757806 2.767452 2.776664
## 2029
Now, we’ll fit the the arima model to forecast for next 10 years for Indiana using n.ahead parameter.
pred_in = predict(fit_in, n.ahead = 10*12)
pred_in$pred
## Jan Feb Mar Apr May Jun Jul
## 2019 1.981140 1.986097 1.989811 1.994894
## 2020 2.019090 2.023464 2.025875 2.031306 2.035750 2.040759 2.046119
## 2021 2.074169 2.078365 2.081861 2.087318 2.092934 2.098080 2.104539
## 2022 2.135892 2.140759 2.144473 2.150859 2.156822 2.162860 2.169799
## 2023 2.204880 2.210294 2.214448 2.221529 2.228139 2.234801 2.242519
## 2024 2.281334 2.287263 2.291886 2.299698 2.306923 2.314292 2.322724
## 2025 2.365381 2.371898 2.376965 2.385554 2.393396 2.401493 2.410664
## 2026 2.457276 2.464367 2.469921 2.479281 2.487775 2.496602 2.506538
## 2027 2.557196 2.564890 2.570936 2.581095 2.590248 2.599831 2.610540
## 2028 2.665350 2.673661 2.680220 2.691191 2.701024 2.711375 2.722879
## 2029 2.781939 2.790884 2.797972
## Aug Sep Oct Nov Dec
## 2019 1.996840 2.002128 2.008421 2.010642 2.012837
## 2020 2.048613 2.054697 2.061296 2.064480 2.066417
## 2021 2.107171 2.114231 2.121407 2.125113 2.127494
## 2022 2.173100 2.180746 2.188771 2.192874 2.195738
## 2023 2.246240 2.254740 2.263427 2.268114 2.271353
## 2024 2.326973 2.336243 2.345702 2.350900 2.354609
## 2025 2.415441 2.425524 2.435743 2.441501 2.445666
## 2026 2.511856 2.522768 2.533767 2.540094 2.544740
## 2027 2.616422 2.628176 2.639974 2.646888 2.652031
## 2028 2.729337 2.741952 2.754565 2.762083 2.767739
## 2029
Converting the Logarithimic value predictions into absolute values by taking the inverse of log10
pred1_ga = round(10^pred_ga$pred,0)
pred2_in = round(10^pred_in$pred,0)
fit1_ga = round(10^ga_ts)
fit2_in = round(10^in_ts)
Lets plot the current and the forecasted values of the our model.
ts.plot(fit1_ga, pred1_ga, fit2_in,pred2_in,
col = c(2:1,4:1), lwd = 2)
Let’s visualize the same graph in an interactive manner using plotly libraries
df_fit1_ga <- fortify(timeSeries::as.timeSeries(fit1_ga))
df_pred1_ga <- fortify(timeSeries::as.timeSeries(pred1_ga))
df_fit2_in <- fortify(timeSeries::as.timeSeries(fit2_in))
df_pred2_in <- fortify(timeSeries::as.timeSeries(pred2_in))
dfc <- cbind(df_fit1_ga,df_fit2_in)
colnames(dfc) <- c("index_ga", "data_ga", "index_in", "data_in")
dfc <- dfc %>% select(index_ga,data_ga, data_in)
dfc1 <- cbind(df_pred1_ga,df_pred2_in)
colnames(dfc1) <- c("index_ga", "data_ga", "index_in", "data_in")
dfc1 <- dfc1 %>% select(index_ga,data_ga, data_in)
df_main <- rbind(dfc, dfc1)
plot1 <- plot_ly(dfc,x = ~index_ga, y = ~data_ga, name = 'Current Median GA', type = 'scatter', mode = 'lines',
line = list(color = 'rgb(205, 12, 24)', width = 4) ) %>%
add_trace(y = ~data_in, name = 'Median IN', line = list(color = 'rgb(22, 96, 167)', width = 4))
plot2 <- plot_ly(dfc1,x = ~index_ga, y = ~data_ga, name = 'Forecast Median GA', type = 'scatter', mode = 'lines',
line = list(color = 'rgb(205, 12, 24)', width = 4) ) %>%
add_trace(y = ~data_in, name = 'Forecast Median IN', line = list(color = 'rgb(22, 96, 167)', width = 4))
plot_main <- subplot(plot1, plot2)
plot_main
Now, lets apply the model to predict house price index.
summary(df)
## address.street address.zipcode address.city
## 2828 Routh St STE 100 : 92 75201 : 98 Brooklyn : 691
## 1330 Campus Pkwy : 31 11215 : 82 Chicago : 543
## 134 Creek Shoals Dr : 30 8701 : 54 New York : 541
## 36 Mill Plain Rd STE 310: 30 30005 : 46 Los Angeles: 391
## Byron Ave # NA : 30 11222 : 45 Houston : 378
## 123 N Main St : 25 (Other):37692 (Other) :35463
## (Other) :37783 NA's : 4 NA's : 14
## useCode taxAssessmentYear taxAssessment
## SingleFamily :27817 Min. :1999 Min. :1.000e+00
## Condominium : 2460 1st Qu.:2017 1st Qu.:1.418e+05
## MultiFamily2To4 : 1560 Median :2018 Median :2.615e+05
## Apartment : 1183 Mean :2018 Mean :1.607e+06
## VacantResidentialLand: 1070 3rd Qu.:2018 3rd Qu.:4.757e+05
## Unknown : 891 Max. :2020 Max. :1.460e+09
## (Other) : 3040 NA's :5402 NA's :5768
## yearBuilt lotSizeSqFt finishedSqFt bathrooms
## Min. :1111 Min. :1.000e+00 Min. : 1 Min. : 0.010
## 1st Qu.:1958 1st Qu.:6.990e+03 1st Qu.: 1336 1st Qu.: 1.750
## Median :1983 Median :1.307e+04 Median : 1930 Median : 2.000
## Mean :1975 Mean :7.835e+05 Mean : 5940 Mean : 2.385
## 3rd Qu.:2000 3rd Qu.:4.356e+04 3rd Qu.: 2856 3rd Qu.: 3.000
## Max. :2019 Max. :2.147e+09 Max. :3634314 Max. :150.000
## NA's :7222 NA's :7512 NA's :6294 NA's :9469
## bedrooms zestimate.amount.text localRealEstate.name
## Min. : 0.000 Min. : 7016 Downtown : 318
## 1st Qu.: 2.000 1st Qu.: 207133 Oak Lawn : 132
## Median : 3.000 Median : 341025 Austin : 101
## Mean : 2.848 Mean : 686347 Lakewood : 95
## 3rd Qu.: 4.000 3rd Qu.: 575498 Park Slope: 86
## Max. :231.000 Max. :407419190 (Other) :36087
## NA's :7602 NA's :7390 NA's : 1202
## localRealEstate.type lastSoldDate lastSoldPrice.text
## city :24541 6/27/2012: 35 Min. : 1
## neighborhood:12278 4/26/1996: 30 1st Qu.: 137000
## NA's : 1202 1/20/2004: 27 Median : 246500
## 7/13/2018: 26 Mean : 756839
## 9/2/2009 : 26 3rd Qu.: 442250
## (Other) :22400 Max. :150000000
## NA's :15477 NA's :15758
## valuationRange.low valuationRange.high zestimate.amount.text.1
## Mode:logical Mode:logical Min. : 7016
## NA's:38021 NA's:38021 1st Qu.: 207133
## Median : 341025
## Mean : 686347
## 3rd Qu.: 575498
## Max. :407419190
## NA's :7390
## localRealEstate.name.1 localRealEstate.type.1
## Downtown : 318 city :24541
## Oak Lawn : 132 neighborhood:12278
## Austin : 101 NA's : 1202
## Lakewood : 95
## Park Slope: 86
## (Other) :36087
## NA's : 1202
dim(df)
## [1] 38021 21
df.clean <- df %>%
select(useCode,taxAssessment,yearBuilt,lotSizeSqFt,finishedSqFt,bathrooms,bedrooms,lastSoldPrice.text) %>%
drop_na() %>%
mutate((2019-yearBuilt))
names(df.clean)[names(df.clean) == '(2019 - yearBuilt)'] <- 'yearOld'
df.clean <- df.clean %>%
select(useCode,taxAssessment,yearOld,lotSizeSqFt,finishedSqFt,bathrooms,bedrooms,lastSoldPrice.text)
summary(df.clean$lastSoldPrice.text)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1 142000 240000 402642 409400 97500000
df.clean <- df.clean[ df.clean$lastSoldPrice.text < quantile(df.clean$lastSoldPrice.text , 0.75 ) , ]
summary(df.clean$yearOld)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 20.00 35.00 43.83 61.00 319.00
df.clean <- df.clean[ df.clean$yearOld < quantile(df.clean$yearOld , 0.75 ) , ]
ggplot(df.clean, aes(x =yearOld, y=lastSoldPrice.text)) + geom_point()
ggplot(df.clean, aes(x =useCode, y=lastSoldPrice.text)) + geom_boxplot()
cor <- cor(df.clean[,-1])
#correlation Matrix
corrplot(cor, method="shade",
tl.col = "black", tl.srt = 45, tl.cex = 0.75,cl.cex = 0.75,
sig.level = 0.01, insig = "pch", pch.cex = 0.8)
df.cleanDum <- dummyVars("~.", data = df.clean)
trnf <- data.frame(predict(df.cleanDum, newdata = df.clean))
Lets seed the data and split into training and testing
set.seed(123)
ind = createDataPartition(trnf$lastSoldPrice.text, p =0.8, list = FALSE)
trnf.train <- trnf[ind,]
trnf.test <- trnf[-ind,]
anyNA(trnf)
## [1] FALSE
lmMod <- lm(lastSoldPrice.text ~., data=trnf.train)
summary(lmMod)
##
## Call:
## lm(formula = lastSoldPrice.text ~ ., data = trnf.train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -530782 -54314 -381 63834 284666
##
## Coefficients: (3 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.567e+05 4.133e+04 3.790 0.000152 ***
## useCode.Apartment -5.649e+04 4.476e+04 -1.262 0.206901
## useCode.Condominium -3.097e+04 4.150e+04 -0.746 0.455612
## useCode.Cooperative NA NA NA NA
## useCode.Duplex -9.093e+04 4.576e+04 -1.987 0.046966 *
## useCode.Miscellaneous -5.973e+04 4.707e+04 -1.269 0.204529
## useCode.Mobile -1.263e+05 4.228e+04 -2.986 0.002833 **
## useCode.MultiFamily2To4 -7.307e+04 4.246e+04 -1.721 0.085271 .
## useCode.MultiFamily5Plus -5.740e+04 5.810e+04 -0.988 0.323199
## useCode.Quadruplex -1.502e+04 7.695e+04 -0.195 0.845240
## useCode.SingleFamily -5.710e+04 4.113e+04 -1.388 0.165061
## useCode.Townhouse -4.543e+04 4.164e+04 -1.091 0.275325
## useCode.Triplex -1.405e+05 7.692e+04 -1.827 0.067752 .
## useCode.Unknown NA NA NA NA
## useCode.VacantResidentialLand NA NA NA NA
## taxAssessment 1.486e-01 7.297e-03 20.357 < 2e-16 ***
## yearOld -4.720e+02 7.695e+01 -6.134 9.02e-10 ***
## lotSizeSqFt -6.641e-05 1.039e-04 -0.639 0.522712
## finishedSqFt -4.827e+00 1.347e+00 -3.583 0.000342 ***
## bathrooms 2.191e+04 1.872e+03 11.704 < 2e-16 ***
## bedrooms 1.044e+04 1.547e+03 6.745 1.65e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 91760 on 6918 degrees of freedom
## Multiple R-squared: 0.1781, Adjusted R-squared: 0.1761
## F-statistic: 88.21 on 17 and 6918 DF, p-value: < 2.2e-16
gbm.fit <- gbm(
formula = lastSoldPrice.text ~ .,
distribution = "gaussian",
data = trnf.train,
n.trees = 10000,
interaction.depth = 1,
shrinkage = 0.001,
cv.folds = 5,
n.cores = NULL, #
verbose = FALSE
)
## Warning in gbm.fit(x = x, y = y, offset = offset, distribution =
## distribution, : variable 3: useCode.Cooperative has no variation.
## Warning in gbm.fit(x = x, y = y, offset = offset, distribution =
## distribution, : variable 13: useCode.Unknown has no variation.
min_MSE <- which.min(gbm.fit$cv.error)
sqrt(gbm.fit$cv.error[min_MSE])
## [1] 80594.81
Plot loss function as a result of n trees added to the ensemble
gbm.perf(gbm.fit, method = "cv")
## [1] 10000
min_MSE <- which.min(gbm.fit$cv.error)
# get MSE and compute RMSE
sqrt(gbm.fit$cv.error[min_MSE])
## [1] 80594.81
# plot loss function as a result of n trees added to the ensemble
gbm.perf(gbm.fit, method = "cv")
## [1] 10000
pred <- predict(gbm.fit, n.trees = 10000, trnf.test)
RMSE(pred, trnf.test$lastSoldPrice.text)
## [1] 81252
# Training data: Separate into x and y tibbles
x_train_tbl <- trnf.train %>% select(-lastSoldPrice.text)
y_train_tbl <- trnf.train %>% select(lastSoldPrice.text)
# Testing data: What we submit in the competition
x_test_tbl <- trnf.test %>% select(-lastSoldPrice.text)
y_test_tbl <- trnf.test %>% select(lastSoldPrice.text)
data_h2o <- as.h2o(
bind_cols(y_train_tbl, x_train_tbl),
destination_frame= "train.hex")
##
|
| | 0%
|
|=================================================================| 100%
new_data_h2o <- as.h2o(
x_test_tbl,
destination_frame= "test.hex")
##
|
| | 0%
|
|=================================================================| 100%
splits <- h2o.splitFrame(data = data_h2o,
ratios = c(0.7, 0.15), # 70/15/15 split
seed = 1234)
train_h2o <- splits[[1]] # from training data
valid_h2o <- splits[[2]] # from training data
test_h2o <- splits[[3]] # from training data
y <- "lastSoldPrice.text"
x <- setdiff(names(train_h2o),y)
grid <- h2o.glm(
x = x,
y = y,
training_frame = train_h2o,
validation_frame = valid_h2o,
nfolds = 5,
alpha = 0.01,
lambda = 0.0001,
early_stopping = TRUE
)
## Warning in .h2o.startModelJob(algo, params, h2oRestApiVersion): Dropping bad and constant columns: [useCode.Cooperative, useCode.Triplex, useCode.Unknown].
##
|
| | 0%
|
|=================================================================| 100%
There is a lot of scope in this work with respect to optimatization. The data is biased because of the visible skewness and it also has a lot of sparseness. The current models predict the house prices indices with higher rmse values that can further be tuned to get more accuracy.
In all the models above, GBM has outperformed as it has given the lowest rmse. GBM is an excellent ML model for guassian distribution dataset, however the family of the data distribution like guassian, poisson, binomial etc can always be mentioned while applying the GBM ML Model.
Shutdown H2O
h2o.shutdown(prompt = F)
## [1] TRUE