This notebook is a quick intro to data prediction and variable selection using various methods. We will explore, both forward and backward variable selection via linear regression, as well as a quick guide to data exploration and an advance method on model exploration using Bayesian methods.
The data from a city assessor that was interested in predicting home sale prices as a function of various characteristics of the home and the surrounding property. The data consists of sale price, finished square feet, # of bedrooms, # of bathrooms, AC, garage size (# of cars fitted), pool, year built, quality (1=high q, 2=med q, 3=low q), style, lot size, and adjacency to highway (1=yes, 0=no).
# reading in data
rl_stt_data <- read.table("RealEstateSales.txt",header=T)
# creating index for 70:30 data split
index <- createDataPartition(rl_stt_data$price, p = 0.7, list = FALSE)
training_set <- rl_stt_data[index, ]
testing_set <- rl_stt_data[-index, ]
I will use correlation matrix to explore the relationships between all the variables. The lighter the blue the more similar are the variables to each other.
cor_matrix <- rl_stt_data %>%
dplyr::select(-price) %>%
cor(.) %>%
melt(.)
ggplot(data = cor_matrix, aes(x=Var1, y=Var2, fill=value)) +
geom_tile() +
labs(title = "Correlation Matrix", x = "", y = "", fill = "Correlation \nDistribution\n")
Here I will be exploring traditional variable selection methods from Statistics, namely, step-wise selection, forward and backwards. I will be using linear regression with AIC as the variable selection criteria.
lm_fit_forward <- train(price ~ ., data = training_set, "lmStepAIC", scope =
list(lower = ~., upper = ~.^2), direction = "forward")
summary(lm_fit_forward$finalModel)
training_set %>%
dplyr::select(price) %>%
bind_cols(residuals = resid(lm_fit_forward)) %>%
ggplot(aes(x = price, y = residuals)) +
geom_point() +
geom_abline(intercept = 0, slope = 0 , col = "red")+
xlab("Fitted values of Price") +
ylab("Model Residuals") +
ggtitle("Model Diagnostic") +
theme_bw()
final model producs a R^2 of .864, which will become important later in comparing models
frwrd_pred <- predict(lm_fit_forward, newdata = testing_set[,-1])
testing_set %>%
dplyr::select(price) %>%
bind_cols(prediction = frwrd_pred) %>%
ggplot(aes(x = price, y = prediction)) +
geom_point() +
geom_abline(intercept = 0, col = "red")+
xlab("Actual values of Price") +
ylab("Predicted values of Price") +
ggtitle("Prediction Diagnostics") +
theme_bw()
lm_fit_backward <- train(price ~ ., data = training_set, "lmStepAIC", scope =
list(lower = ~., upper = ~.^2), direction = "backward")
Start: AIC=8101.08
.outcome ~ area + bed + bath + ac + grsize + pool + age + qu +
style + lotsize + hgway
Start: AIC=8132.73
.outcome ~ area + bed + bath + ac + grsize + pool + age + qu +
style + lotsize + hgway
Start: AIC=8140.25
.outcome ~ area + bed + bath + ac + grsize + pool + age + qu +
style + lotsize + hgway
Start: AIC=8064.4
.outcome ~ area + bed + bath + ac + grsize + pool + age + qu +
style + lotsize + hgway
Start: AIC=8134.5
.outcome ~ area + bed + bath + ac + grsize + pool + age + qu +
style + lotsize + hgway
Start: AIC=8083.95
.outcome ~ area + bed + bath + ac + grsize + pool + age + qu +
style + lotsize + hgway
Start: AIC=8139.38
.outcome ~ area + bed + bath + ac + grsize + pool + age + qu +
style + lotsize + hgway
Start: AIC=8143.37
.outcome ~ area + bed + bath + ac + grsize + pool + age + qu +
style + lotsize + hgway
Start: AIC=8097.64
.outcome ~ area + bed + bath + ac + grsize + pool + age + qu +
style + lotsize + hgway
Start: AIC=8121.95
.outcome ~ area + bed + bath + ac + grsize + pool + age + qu +
style + lotsize + hgway
Start: AIC=8057.33
.outcome ~ area + bed + bath + ac + grsize + pool + age + qu +
style + lotsize + hgway
Start: AIC=8101.05
.outcome ~ area + bed + bath + ac + grsize + pool + age + qu +
style + lotsize + hgway
Start: AIC=8147.45
.outcome ~ area + bed + bath + ac + grsize + pool + age + qu +
style + lotsize + hgway
Start: AIC=8066.62
.outcome ~ area + bed + bath + ac + grsize + pool + age + qu +
style + lotsize + hgway
Start: AIC=8076.26
.outcome ~ area + bed + bath + ac + grsize + pool + age + qu +
style + lotsize + hgway
Start: AIC=8110.65
.outcome ~ area + bed + bath + ac + grsize + pool + age + qu +
style + lotsize + hgway
Start: AIC=8096.38
.outcome ~ area + bed + bath + ac + grsize + pool + age + qu +
style + lotsize + hgway
Start: AIC=8143.98
.outcome ~ area + bed + bath + ac + grsize + pool + age + qu +
style + lotsize + hgway
Start: AIC=8120.58
.outcome ~ area + bed + bath + ac + grsize + pool + age + qu +
style + lotsize + hgway
Start: AIC=8161.75
.outcome ~ area + bed + bath + ac + grsize + pool + age + qu +
style + lotsize + hgway
Start: AIC=8085.98
.outcome ~ area + bed + bath + ac + grsize + pool + age + qu +
style + lotsize + hgway
Start: AIC=8101.2
.outcome ~ area + bed + bath + ac + grsize + pool + age + qu +
style + lotsize + hgway
Start: AIC=8127.48
.outcome ~ area + bed + bath + ac + grsize + pool + age + qu +
style + lotsize + hgway
Start: AIC=8109.73
.outcome ~ area + bed + bath + ac + grsize + pool + age + qu +
style + lotsize + hgway
Start: AIC=8113.98
.outcome ~ area + bed + bath + ac + grsize + pool + age + qu +
style + lotsize + hgway
Start: AIC=8124.61
.outcome ~ area + bed + bath + ac + grsize + pool + age + qu +
style + lotsize + hgway
summary(lm_fit_backward$finalModel)
Call:
lm(formula = .outcome ~ area + bed + bath + ac + grsize + pool +
age + qu + style + lotsize + hgway, data = dat)
Residuals:
Min 1Q Median 3Q Max
-175224 -36960 -4092 33683 296286
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.640e+06 5.167e+05 -5.108 5.32e-07 ***
area 1.303e+02 9.384e+00 13.882 < 2e-16 ***
bed -1.213e+04 4.216e+03 -2.876 0.00426 **
bath 1.296e+03 5.208e+03 0.249 0.80357
ac -1.061e+04 1.014e+04 -1.046 0.29607
grsize 6.939e+03 6.197e+03 1.120 0.26359
pool 7.617e+03 1.421e+04 0.536 0.59231
age 1.408e+03 2.601e+02 5.411 1.15e-07 ***
qu -5.108e+04 8.781e+03 -5.817 1.34e-08 ***
style -9.715e+03 1.674e+03 -5.802 1.45e-08 ***
lotsize 1.301e+00 2.993e-01 4.346 1.82e-05 ***
hgway -3.857e+04 2.067e+04 -1.866 0.06285 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 63130 on 355 degrees of freedom
Multiple R-squared: 0.7886, Adjusted R-squared: 0.782
F-statistic: 120.4 on 11 and 355 DF, p-value: < 2.2e-16
training_set %>%
dplyr::select(price) %>%
bind_cols(residuals = resid(lm_fit_backward)) %>%
ggplot(aes(x = price, y = residuals)) +
geom_point() +
geom_abline(intercept = 0, slope = 0 , col = "red")+
xlab("Fitted values of Price") +
ylab("Model Residuals") +
ggtitle("Model Diagnostic") +
theme_bw()
final model producs a R^2 of .782, which will become important later in comparing models
backward_pred <- predict(lm_fit_backward, newdata = testing_set[,-1])
testing_set %>%
dplyr::select(price) %>%
bind_cols(prediction = backward_pred) %>%
ggplot(aes(x = price, y = prediction)) +
geom_point() +
geom_abline(intercept = 0, col = "red")+
xlab("Actual values of Price") +
ylab("Predicted values of Price") +
ggtitle("Prediction Diagnostics") +
theme_bw()
ctrl<-trainControl(method = "cv", number = 10)
lm_cv_fit <- train(price ~ ., data = rl_stt_data, method = "lm", trControl = ctrl, metric="Rsquared")
summary(lm_cv_fit)
Call:
lm(formula = .outcome ~ ., data = dat)
Residuals:
Min 1Q Median 3Q Max
-187722 -38049 -2232 31830 293382
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.394e+06 4.343e+05 -5.513 5.62e-08 ***
area 1.303e+02 7.677e+00 16.979 < 2e-16 ***
bed -8.986e+03 3.517e+03 -2.555 0.0109 *
bath 3.211e+03 4.602e+03 0.698 0.4856
ac -1.348e+04 8.602e+03 -1.567 0.1177
grsize 1.394e+04 5.463e+03 2.552 0.0110 *
pool 9.242e+03 1.126e+04 0.821 0.4121
age 1.266e+03 2.188e+02 5.785 1.27e-08 ***
qu -4.822e+04 7.408e+03 -6.509 1.81e-10 ***
style -9.450e+03 1.433e+03 -6.593 1.08e-10 ***
lotsize 1.190e+00 2.576e-01 4.619 4.90e-06 ***
hgway -3.830e+04 1.964e+04 -1.950 0.0517 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 63540 on 510 degrees of freedom
Multiple R-squared: 0.7922, Adjusted R-squared: 0.7877
F-statistic: 176.8 on 11 and 510 DF, p-value: < 2.2e-16
rl_stt_data %>%
dplyr::select(price) %>%
bind_cols(residuals = resid(lm_cv_fit)) %>%
ggplot(aes(x = price, y = residuals)) +
geom_point() +
geom_abline(intercept = 0, slope = 0 , col = "red")+
xlab("Fitted values of Price") +
ylab("Model Residuals") +
ggtitle("Model Diagnostic") +
theme_bw()
final model producs a R^2 of .7877, which will become important later in comparing models
bayes_model_check <- MCMCregress(price ~ area + bed + bath +
ac + grsize + pool +
age + qu + style +
lotsize + hgway + age:qu +
qu:lotsize + area:bed +
area:qu + qu:style + grsize:lotsize +
ac:lotsize + ac:age + age:lotsize +
bed:style + bed:ac + bed:qu +
age:style + area:age + bed:bath +
ac:pool + pool:lotsize + area:style,
data = training_set)
bayes_model_check <- MCMCregress(price ~ area + bed + bath +
ac + grsize + pool +
age + qu + style +
lotsize + hgway + age:qu +
qu:lotsize + area:bed +
area:qu + qu:style + grsize:lotsize +
ac:lotsize + ac:age + age:lotsize +
bed:style + bed:ac + bed:qu +
age:style + area:age + bed:bath +
ac:pool + pool:lotsize + area:style,
data = training_set)
raftery.diag(bayes_model_check)
Quantile (q) = 0.025
Accuracy (r) = +/- 0.005
Probability (s) = 0.95
Burn-in Total Lower bound Dependence
(M) (N) (Nmin) factor (I)
(Intercept) 2 3897 3746 1.040
area 2 3757 3746 1.000
bed 2 3741 3746 0.999
bath 2 3771 3746 1.010
ac 2 3867 3746 1.030
grsize 2 3650 3746 0.974
pool 2 3865 3746 1.030
age 2 3802 3746 1.010
qu 2 3834 3746 1.020
style 2 3680 3746 0.982
lotsize 2 3710 3746 0.990
hgway 2 3741 3746 0.999
age:qu 2 3741 3746 0.999
qu:lotsize 2 3771 3746 1.010
area:bed 2 3802 3746 1.010
area:qu 2 3865 3746 1.030
qu:style 2 3771 3746 1.010
grsize:lotsize 1 3726 3746 0.995
ac:lotsize 2 3834 3746 1.020
ac:age 2 3771 3746 1.010
age:lotsize 2 3771 3746 1.010
bed:style 2 3788 3746 1.010
bed:ac 2 3771 3746 1.010
bed:qu 2 3741 3746 0.999
age:style 2 3710 3746 0.990
area:age 2 3650 3746 0.974
bed:bath 2 3897 3746 1.040
ac:pool 2 3650 3746 0.974
pool:lotsize 2 3680 3746 0.982
area:style 2 3802 3746 1.010
sigma2 2 3802 3746 1.010
I ran my model in a Monte Carlo simulation with 10000 iterations. The summary output of the simulation confirmed the model selected. All the variables were within less than one standard deviation from their mean, thus the model is stable and ready for use within the framework of its requirement.
metric <- "Rsquared"
trainControl <- trainControl(method = "cv", number = 10)
caretGrid <- expand.grid(interaction.depth=c(1, 3, 5), n.trees = (0:50)*50,
shrinkage=c(0.01, 0.001),
n.minobsinnode=10)
set.seed(99)
gbm_fit <- train(price ~ .
, data=training_set
, distribution="gaussian"
, method="gbm"
, trControl=trainControl
, verbose=FALSE
, tuneGrid=caretGrid
, metric=metric
, bag.fraction=0.75
)
There were missing values in resampled performance measures.missing values found in aggregated results
print(gbm.caret)
Stochastic Gradient Boosting
367 samples
11 predictor
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 330, 330, 331, 330, 331, 331, ...
Resampling results across tuning parameters:
interaction.depth n.trees RMSE Rsquared MAE
1 50 64006.90 0.7846619 43320.21
1 100 61211.61 0.7980283 41132.79
1 150 60792.85 0.7994098 41025.22
2 50 59148.94 0.8108342 39284.60
2 100 57568.74 0.8175579 38464.69
2 150 57790.64 0.8175932 38618.94
3 50 58170.40 0.8152909 39126.05
3 100 57017.72 0.8221833 38509.90
3 150 57041.21 0.8224388 38692.48
Tuning parameter 'shrinkage' was held constant at a value of 0.1
Tuning
parameter 'n.minobsinnode' was held constant at a value of 10
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were n.trees = 100, interaction.depth = 3, shrinkage =
0.1 and n.minobsinnode = 10.
best R^2 value from GBM is 0.82
caret.predict <- predict(gbm.caret, newdata = testing_set, type = "raw")
rmse.caret<-rmse(testing_set$price, caret.predict)
print(paste0("RMSE = ", rmse.caret))
[1] "RMSE = 48464.5529647971"
testing_set %>%
dplyr::select(price) %>%
bind_cols(prediction = caret.predict) %>%
ggplot(aes(x = price, y = prediction)) +
geom_point() +
geom_abline(intercept = 0, col = "red")+
xlab("Actual values of Price") +
ylab("Predicted values of Price") +
ggtitle("Prediction Diagnostics") +
theme_bw()
Even though I used one of the most successful algorithms in GBM on the data, the linear Regression model with forward variable selection came out on top with a better R^2 and the model was further confirmed by Bayesian method. The reason for the model that one is that the forward variable selection method also tested all the interaction variables and based on AIC selected the best model that included those as well.