\(\newcommand{\ci}{\perp\!\!\!\perp}\) \(\newcommand{\prob}[1]{\operatorname{P}\left(#1\right)}\) \(\newcommand{\Var}[1]{\operatorname{Var}\left(#1\right)}\) \(\newcommand{\E}[1]{\operatorname{E}\left(#1\right)}\) \(\newcommand{\Cov}[1]{\operatorname{Cov}\left(#1\right)}\) \(\newcommand{\Cor}[1]{\operatorname{Corr}\left(#1\right)}\) \(\newcommand{\sd}[1]{\operatorname{sd}\left(#1\right)}\) \(\newcommand{\prob}[1]{\operatorname{P}\left(#1\right)}\) \(\DeclareMathOperator*{\minf}{minimize \quad}\) \(\newcommand{\mininlineeq}[3]{\mbox{minimize}_{#1}\quad#2\qquad\mbox{subject to }#3}\) \(\newcommand{\mininline}[2]{\mbox{minimize}_{#1}\quad#2}\)

We will demonstrate several more advanced predictive models using Boston housing dataset. This price dataset contains the factors that contribute to the median value of owner-occupied homes in $1000’s.

By constructing a regression model we can quantify the effects of the following measure factors (excluding MEDV as the response):

variable description
CRIM per capita crime rate by town
ZN proportion of residential land zoned for lots over 25,000 sq.ft
INDUS proportion of non-retail business acres per town
CHAS Charles River dummy variable (1 if tract bounds river; else 0)
NOX nitric oxides concentration (parts per 10 million)
RM average number of rooms per dwelling
AGE proportion of owner-occupied units built prior to 1940
DIS weighted distances to five Boston employment centres
RAD index of accessibility to radial highways
TAX full-value property-tax rate per $10,000
PTRATIO pupil-teacher ratio by town
B \(1000(\mbox{Bk} - 0.63)^2\) where Bk is the proportion of blacks by town
LSTAT % lower status of the population
MEDV Median value of owner-occupied homes in $1000’s
knitr::kable(Boston[1:5,1:7])
crim zn indus chas nox rm age
0.00632 18 2.31 0 0.538 6.575 65.2
0.02731 0 7.07 0 0.469 6.421 78.9
0.02729 0 7.07 0 0.469 7.185 61.1
0.03237 0 2.18 0 0.458 6.998 45.8
0.06905 0 2.18 0 0.458 7.147 54.2
dim(Boston)
## [1] 506  14

First, we split data into training and test subsets

test_index = sample(1:nrow(Boston), size = as.integer(0.2*nrow(Boston)))
testing = Boston[test_index,]
training = Boston[-test_index,]
nrow(training)
## [1] 405

Let’s look at the correlation of each independent variable with the dependent variable and near zero variance

knitr::kable(cor(Boston,Boston$medv))
crim -0.3883046
zn 0.3604453
indus -0.4837252
chas 0.1752602
nox -0.4273208
rm 0.6953599
age -0.3769546
dis 0.2499287
rad -0.3816262
tax -0.4685359
ptratio -0.5077867
black 0.3334608
lstat -0.7376627
medv 1.0000000
knitr::kable(t(summarise_if(Boston,is.numeric, sd)/summarise_if(Boston,is.numeric, mean)))
crim 2.3803761
zn 2.0523759
indus 0.6160087
chas 3.6720281
nox 0.2089034
rm 0.1117992
age 0.4104834
dis 0.5548581
rad 0.9118115
tax 0.4128412
ptratio 0.1173060
black 0.2559616
lstat 0.5643741
medv 0.4081651
par(pch=21)
plot(Boston[1:100,c("medv","lstat","rm","age", "crim")],  cex=0.5, bg="light blue")

Linear Regression

We first try to regress lstat on medv.

fit.lm <- lm(medv~lstat,data = training)
#predict on test set
pred.lm <- predict(fit.lm, newdata = testing)
plot(testing$medv,pred.lm, pch=16); abline(a=0,b=1)

We will use RMSE and MAPE to measure predictive perofrmance of our models

metrics = function(yhat,y) {
  n = length(y)
  rmse  =  sqrt(sum((yhat - y)^2)/n)
  mape  = mean(abs((yhat - y)/y))
  r2 = 1 - sum((y-yhat)^2)/sum((y-mean(y))^2)
  return(c(rmse=rmse,mape=mape,R2=r2))
}

Let’s calculate those metrics for the linear model

fit.lm.metrics = metrics(pred.lm,testing$medv)
knitr::kable(t(fit.lm.metrics), digits = 2)
rmse mape R2
7.08 0.23 0.51

Now, let’s look at the relations when medv is on log scale and when both medv and lstat are on the log scale

par(mfrow=c(3,1), mar=c(4,4,0,0), bty='n')
plot(medv~lstat, data=Boston, pch=16)
plot(log(medv)~lstat, data=Boston, pch=16)
plot(log(medv)~log(lstat), data=Boston, pch=16)

Seems like log tranforms might help. Let’s try it in the model

fit.lm.log <- lm(log(medv)~log(lstat),data = training)

#predict on test set
pred.lm <- exp(predict(fit.lm.log, newdata = testing))
par(mfrow=c(1,1))
plot(testing$medv,pred.lm, pch=16); abline(a=0,b=1)

fit.lm.log.metrics = metrics(pred.lm,testing$medv)
knitr::kable(t(fit.lm.log.metrics), digits = 2)
rmse mape R2
6.19 0.21 0.62

What if we only pu medv on the log-scale

#Try linear model using all features
fit.lm.log1 <- lm(log(medv)~lstat,data = training) 

#predict on test set
pred.lm <- exp(predict(fit.lm.log1 , newdata = testing))
plot(testing$medv,pred.lm, pch=16); abline(a=0,b=1)

fit.lm.log1.metrics = metrics(pred.lm,testing$medv)
knitr::kable(t(fit.lm.log1.metrics), digits = 2)
rmse mape R2
6.77 0.2 0.55

Multiple Linear Regression

Try simple linear model using selected features

fit.lm2 <- lm(formula = log(medv) ~ crim + chas + nox + rm + dis + ptratio + 
                rad + black + log(lstat), data = training)

pred.lm2 <- exp(predict(fit.lm2, newdata = testing))
# Plot of predicted price vs actual price
plot(pred.lm2,testing$medv, xlab = "Predicted Price", ylab = "Actual Price",pch=21,bg="grey")
abline(a=0,b=1)

fit.lm2.metrics = metrics(pred.lm2,testing$medv)
knitr::kable(t(fit.lm2.metrics), digits = 2)
rmse mape R2
5.34 0.17 0.72
# Diagnostics plots
layout(matrix(c(1,2,3,4),2,2))
plot(fit.lm2)

Ridge Regression

library(glmnet)  
## Loading required package: Matrix
## Loaded glmnet 4.1-8
# Create grid of lambda values
log_lambda_grid <- seq(2.5, -2, length=100) # In reverse
lambda_grid <- 10^log_lambda_grid

# Create 100 ridge models
ridge_models <- glmnet(testing[,-14], log(testing$medv), alpha=0, lambda=lambda_grid)

# Get coefficients of all 100 models
ridge_coef <- coef(ridge_models)

# Display coefficients for 6 models. 
# Ridge Models are displayed in decreasing order of lambda.
knitr::kable(round(as.matrix(ridge_coef[, c(1:3, 98:100)], 6)))
s0 s1 s2 s97 s98 s99
(Intercept) 3 3 3 4 4 4
crim 0 0 0 0 0 0
zn 0 0 0 0 0 0
indus 0 0 0 0 0 0
chas 0 0 0 0 0 0
nox 0 0 0 -1 -1 -1
rm 0 0 0 0 0 0
age 0 0 0 0 0 0
dis 0 0 0 0 0 0
rad 0 0 0 0 0 0
tax 0 0 0 0 0 0
ptratio 0 0 0 0 0 0
black 0 0 0 0 0 0
lstat 0 0 0 0 0 0

Now we plot the relations between the \(\lambda\) and the \(\beta\)s

library(ggplot2)
m = as.matrix(ridge_coef[-1,])
sa <- stack(as.data.frame(t(m)))
sa$log_lambda <- log_lambda_grid
qplot(log_lambda, values, data = sa, group = ind, colour = ind, geom = "line")

We now use cross-valudation to pick the best model, the one that has the lowest out-of-sample RMSE

test_pred_ridge <- predict(ridge_models, as.matrix(testing[,-14]))
RMSE <- colMeans((exp(test_pred_ridge) - testing$medv)^2)
par(mar=c(4,4,0,0),bty='n')
plot(log_lambda_grid, RMSE, xlab="ln(lambda)", ylab="RMSE", type='l', lwd=2)

fit.ridge.metrics = metrics(exp(test_pred_ridge[,which.min(RMSE)]),testing$medv)

Lasso Regression

log_lambda_grid <- seq(1.5, -2, length=100) # In reverse
lambda_grid <- 10^log_lambda_grid
lasso_models <- glmnet(testing[,-14], log(testing$medv), alpha=1, lambda=lambda_grid)

# Get coefficients of all 100 models
lasso_coef <- coef(lasso_models)

# Display coefficients for 6 models. 
# lasso Models are displayed in decreasing order of lambda.
knitr::kable(round(as.matrix(lasso_coef[, c(1:3, 98:100)], 6)))
s0 s1 s2 s97 s98 s99
(Intercept) 3 3 3 4 4 4
crim 0 0 0 0 0 0
zn 0 0 0 0 0 0
indus 0 0 0 0 0 0
chas 0 0 0 0 0 0
nox 0 0 0 0 0 0
rm 0 0 0 0 0 0
age 0 0 0 0 0 0
dis 0 0 0 0 0 0
rad 0 0 0 0 0 0
tax 0 0 0 0 0 0
ptratio 0 0 0 0 0 0
black 0 0 0 0 0 0
lstat 0 0 0 0 0 0

Now we plot the relations between the \(\lambda\) and the \(\beta\)s

library(ggplot2)
m = as.matrix(lasso_coef[-1,])
sa <- stack(as.data.frame(t(m)))
sa$log_lambda <- log_lambda_grid
qplot(log_lambda, values, data = sa, group = ind, colour = ind, geom = "line")

We now use cross-valudation to pick the best model, the one that has the lowest out-of-sample RMSE

test_pred_lasso <- predict(lasso_models, as.matrix(testing[,-14]))
RMSE <- colMeans((exp(test_pred_lasso) - testing$medv)^2)
par(mar=c(4,4,0,0),bty='n')
plot(log_lambda_grid, RMSE, xlab="ln(lambda)", ylab="RMSE", type='l', lwd=2)

fit.lasso.metrics = metrics(exp(test_pred_lasso[,which.min(RMSE)]),testing$medv)

KNN Model

library(kknn) ## knn library
attach(Boston)
## The following objects are masked from Boston (pos = 6):
## 
##     age, black, chas, crim, dis, indus, lstat, medv, nox, ptratio, rad,
##     rm, tax, zn
ind = order(testing$lstat)
RMSE = NULL

kk = c(2,10,20,50,100,300)
par(mfrow=c(3,2), mar=c(3,3,2,0), bty='n')
for(i in kk){
  near = kknn(medv~lstat,training,testing,k=i,kernel = "rectangular")
  aux = sqrt(mean((testing$medv-near$fitted)^2))
  RMSE = c(RMSE,aux)
  plot(testing$lstat,testing$medv,main=paste("k=",i),pch=21,bg="grey",cex=0.8,col="darkgray")
  lines(testing$lstat[ind],near$fitted[ind],col=2,lwd=2)
}

Now let’s plot the relation between the number of neighbors and model performance (out-of-sample)

par(mfrow=c(1,1), mar=c(3,3,2,0), bty='n')
plot(log(1/kk),RMSE,type="b",xlab="Complexity(log(1/k))",col="blue",ylab="RMSE",lwd=2,cex.lab=1.2,pch=21,bg="grey")

Tree-Based Models

First big tree size

big.tree = tree(medv~lstat,data=training,mindev=.0001)
print(length(unique(big.tree$where)))
## [1] 60
par(mfrow=c(1,1))
plot(big.tree,type="uniform")
text(big.tree,col="blue",label=c("yval"),cex=.8)

Let’s see how well we did on the out-of-sample data

boston.fit = predict(big.tree,testing) #get training fitted values
tree.metrics = metrics(boston.fit,testing$medv)
knitr::kable(t(tree.metrics), digits = 2)
rmse mape R2
6.89 0.24 0.53

Not so bad, comparable to LM. Do we overfit? Let’s see if a smaller tree will do any better.

boston.tree=prune.tree(big.tree,best=7)
cat('pruned tree size: \n')
## pruned tree size:
print(length(unique(boston.tree$where)))
## [1] 7
#plot the tree
plot(boston.tree,type="uniform")
text(boston.tree,col="blue",label=c("yval"),cex=.8)

#plot the tree
plot(boston.tree,type="uniform")
text(boston.tree,col="blue",label=c("yval"),cex=.8)

boston.fit = predict(boston.tree,testing) #get training fitted values
tree.metrics1 = metrics(boston.fit,testing$medv)
knitr::kable(t(tree.metrics1), digits = 2)
rmse mape R2
6.32 0.22 0.61

We have a simpler model (less leafes) that predicts better. The larger model did overfit! Let’s plit our predictions

plot(testing$lstat,testing$medv,cex=.5,pch=16) #plot data
oo=order(testing$lstat)
lines(testing$lstat[oo],boston.fit[oo],col='red',lwd=3) #step function fit

cvals=c(9.725,4.65,3.325,5.495,16.085,19.9) #cutpoints from tree
for(i in 1:length(cvals)) abline(v=cvals[i],col='magenta',lty=2) #cutpoints

Let’s Fit a regression tree to mev~dis+lstat. Now our we have a two dimensional input and the partitions will be rectangles (possible open) The tree is plotted as well as the corresponding partition of the two-dimensional. Again, let’s start with a big tree

big.tree = tree(medv~lstat+dis,training,mindev=.0001)
cat('first big tree size: \n')
## first big tree size:
print(length(unique(big.tree$where)))
## [1] 63

Then prune it down to one with 7 leaves

boston.tree=prune.tree(big.tree,best=7)
cat('pruned tree size: \n')
## pruned tree size:
print(length(unique(boston.tree$where)))
## [1] 7
qplot(lstat,dis,data=Boston, colour = medv) +
  scale_color_gradient(low="blue", high="red")

Plot tree and partition in x.

par(mfrow=c(1,2))
plot(boston.tree,type="u")
text(boston.tree,col="blue",label=c("yval"),cex=.8)
partition.tree(boston.tree)
lines(training$lstat,training$dis, type='p', pch=16, cex=0.5, col=as.integer(training$medv))

pv=seq(from=.01,to=.99,by=.05)
x1q = quantile(testing$lstat,probs=pv)
x2q = quantile(testing$dis,probs=pv)
xx = expand.grid(x1q,x2q) #matrix with two columns using all combinations of x1q and x2q
dfpred = data.frame(dis=xx[,2],lstat=xx[,1])
lmedpred = predict(boston.tree,dfpred)

#make perspective plot

par(mfrow=c(1,1))
persp(x1q,x2q,matrix(lmedpred,ncol=length(x2q),byrow=T),
      theta=150,xlab='dis',ylab='lstat',zlab='medv',
      zlim=c(min(training$medv),1.1*max(training$medv)))

tree.boston=tree(medv~.,training,mindev=.01)
cv.boston=cv.tree(tree.boston, ); 
plot(cv.boston$size,cv.boston$dev, type='l')

prune.boston=prune.tree(tree.boston,best=6)
#summary(prune.boston)
#cv.tree(tree.boston,,prune.tree)$dev
plot(prune.boston)
text(prune.boston,pretty=0)

boston.fit = predict(prune.boston,testing) #get training fitted values
par(mfrow=c(1,1))
plot(testing$medv,boston.fit, pch=16)

tree1.metrics = metrics(boston.fit,testing$medv)
knitr::kable(t(tree1.metrics), digits = 2)
rmse mape R2
6.57 0.21 0.58

The best model thus far!

library(randomForest)
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
## 
##     margin
## The following object is masked from 'package:dplyr':
## 
##     combine
set.seed(1)
bag.boston=randomForest(medv~.,data=training,mtry=13,importance=TRUE)
bag.boston
## 
## Call:
##  randomForest(formula = medv ~ ., data = training, mtry = 13,      importance = TRUE) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 13
## 
##           Mean of squared residuals: 11.84956
##                     % Var explained: 85.19
yhat.bag = predict(bag.boston,newdata=testing)

par(mfrow=c(1,1))
plot(yhat.bag, testing$medv)
abline(0,1)

sqrt(mean((yhat.bag-testing$medv)^2))
## [1] 4.193147
bag.boston=randomForest(medv~.,data=training,mtry=13,ntree=25)
yhat.bag = predict(bag.boston,newdata=testing)
sqrt(mean((yhat.bag-testing$medv)^2))
## [1] 4.400957
rf.boston=randomForest(medv~.,data=training,mtry=6,importance=TRUE)
yhat.rf = predict(rf.boston,newdata=testing)
rf.metrics = metrics(yhat.rf,testing$medv)
knitr::kable(t(rf.metrics), digits = 2)
rmse mape R2
4.05 0.13 0.84
importance(rf.boston)
##           %IncMSE IncNodePurity
## crim    15.523032    1605.74043
## zn       2.910491      99.40062
## indus   11.228386    1999.87201
## chas     3.390930     120.06602
## nox     15.413441    1434.09818
## rm      40.547099   11570.61633
## age     11.716186     738.76790
## dis     16.588296    1413.14748
## rad      5.199374     133.21436
## tax     12.343244     828.34377
## ptratio 13.203003    1298.15690
## black   10.704904     444.85923
## lstat   33.013070   10102.85634
par(mfrow=c(1,1))
varImpPlot(rf.boston,pch=21,bg="grey")

Boosting

#-------------------------------
library(gbm)
## Loaded gbm 2.1.8.1
set.seed(1)
boost.boston=gbm(medv~.,data=training,distribution="gaussian",n.trees=5000,interaction.depth=4)
summary(boost.boston)

##             var    rel.inf
## rm           rm 40.9286195
## lstat     lstat 28.2010295
## nox         nox  7.0059821
## dis         dis  6.8971384
## crim       crim  4.4402223
## ptratio ptratio  3.1289338
## black     black  3.0313266
## age         age  2.7663871
## tax         tax  1.4889469
## chas       chas  0.9745309
## indus     indus  0.6928422
## rad         rad  0.2995614
## zn           zn  0.1444793
par(mfrow=c(1,2))
plot(boost.boston,i="rm")

plot(boost.boston,i="lstat")

yhat.boost=predict(boost.boston,newdata=testing,n.trees=5000)
gbm.metrics = metrics(yhat.boost,testing$medv)
knitr::kable(t(gbm.metrics), digits = 2)
rmse mape R2
3.39 0.13 0.83
boost.boston=gbm(medv~.,data=training,distribution="gaussian",n.trees=5000,interaction.depth=4,shrinkage=0.2,verbose=F)
yhat.boost=predict(boost.boston,newdata=testing,n.trees=5000)
gbm1.metrics = metrics(yhat.boost,testing$medv)
knitr::kable(t(gbm1.metrics), digits = 2)
rmse mape R2
3.43 0.13 0.82

SVM

dat=Boston
dat$medv=1*(dat$medv>25)+0*(dat$medv<=25)
dat$medv=as.factor(dat$medv)

#--------------------------------
library(e1071)
mod_svm=svm(medv~.,dat)
summary(mod_svm)
## 
## Call:
## svm(formula = medv ~ ., data = dat)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  1 
## 
## Number of Support Vectors:  159
## 
##  ( 86 73 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  0 1
par(mfrow=c(1,1))
plot(mod_svm,data=dat,rm~lstat,pch=21,bg="grey")

Logistic regression

mod_logit=glm(medv~.,data=dat,family=binomial())
summary(mod_logit)
## 
## Call:
## glm(formula = medv ~ ., family = binomial(), data = dat)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  5.312511   4.876070   1.090 0.275930    
## crim        -0.011101   0.045322  -0.245 0.806503    
## zn           0.010917   0.010834   1.008 0.313626    
## indus       -0.110452   0.058740  -1.880 0.060060 .  
## chas         0.966337   0.808960   1.195 0.232266    
## nox         -6.844521   4.483514  -1.527 0.126861    
## rm           1.886872   0.452692   4.168 3.07e-05 ***
## age          0.003491   0.011133   0.314 0.753853    
## dis         -0.589016   0.164013  -3.591 0.000329 ***
## rad          0.318042   0.082623   3.849 0.000118 ***
## tax         -0.010826   0.004036  -2.682 0.007314 ** 
## ptratio     -0.353017   0.122259  -2.887 0.003884 ** 
## black       -0.002264   0.003826  -0.592 0.554105    
## lstat       -0.367355   0.073020  -5.031 4.88e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 563.52  on 505  degrees of freedom
## Residual deviance: 209.11  on 492  degrees of freedom
## AIC: 237.11
## 
## Number of Fisher Scoring iterations: 7
# diagnostic plot
par(mfrow=c(2,2), mar=c(3,3,0,0), bty='n')
plot(mod_logit,pch=21,bg="grey")

Multinomial

dat2=Boston
dat2$medv=0*(dat2$medv<=17)+1*(dat2$medv<=25 & dat2$medv>17)+2*(dat2$medv>25)
dat2$medv=as.factor(dat2$medv)
library(nnet)
mod_multinom=multinom(medv~.,data=dat2)
## # weights:  45 (28 variable)
## initial  value 555.897818 
## iter  10 value 338.289401
## iter  20 value 309.716007
## iter  30 value 211.736278
## iter  40 value 203.333103
## iter  50 value 202.890685
## iter  60 value 202.858211
## final  value 202.858175 
## converged
summary(mod_multinom)
## Call:
## multinom(formula = medv ~ ., data = dat2)
## 
## Coefficients:
##   (Intercept)       crim         zn       indus     chas       nox         rm
## 1    24.19217 -0.2231462 0.04670231  0.10160200 0.343424 -12.91234 -0.1602033
## 2    29.51905 -0.0643685 0.05704767 -0.01761023 1.311033 -18.77936  1.7631714
##           age        dis       rad          tax    ptratio       black
## 1 -0.03926022 -0.6648017 0.1826890 -0.006564069 -0.3770017 0.005904329
## 2 -0.03695548 -1.2169084 0.4612001 -0.017781470 -0.7175312 0.001405510
##        lstat
## 1 -0.2409420
## 2 -0.5899765
## 
## Std. Errors:
##   (Intercept)       crim         zn      indus      chas        nox        rm
## 1  0.09595305 0.07342447 0.04186604 0.05732693 0.7108412 0.17673798 0.3344770
## 2  0.02845394 0.06391464 0.04316645 0.08044716 1.0467007 0.01608296 0.4633493
##          age       dis        rad         tax    ptratio       black      lstat
## 1 0.01544144 0.2200741 0.05386458 0.003231373 0.09660977 0.002003785 0.04387560
## 2 0.01846981 0.2649990 0.08985403 0.005039059 0.13750043 0.004055685 0.08066911
## 
## Residual Deviance: 405.7164 
## AIC: 461.7164
par(mfrow=c(1,1))
pred=predict(mod_multinom,dat2)
plot(dat2$medv,pred,xlab="true class",ylab="predicted class")

Neural Network with Keras library

library(keras)

model <- keras_model_sequential() %>%
  layer_dense(units = 64, activation = "relu", input_shape = 13) %>%
  layer_dense(units = 64, activation = "relu") %>%
  layer_dense(units = 32, activation = "relu") %>%
  layer_dense(units = 1) 
summary(model)
## Model: "sequential"
## ________________________________________________________________________________
##  Layer (type)                       Output Shape                    Param #     
## ================================================================================
##  dense_3 (Dense)                    (None, 64)                      896         
##  dense_2 (Dense)                    (None, 64)                      4160        
##  dense_1 (Dense)                    (None, 32)                      2080        
##  dense (Dense)                      (None, 1)                       33          
## ================================================================================
## Total params: 7,169
## Trainable params: 7,169
## Non-trainable params: 0
## ________________________________________________________________________________
model %>% compile(loss = "mse",optimizer = "RMSprop")

Now let’s train the model

history = model %>% fit(x = as.matrix(scale(training[,1:13])), y = training$medv,verbose = 2, epochs=20, validation_data = list(as.matrix(scale(testing[,1:13])),testing$medv))
yhat = model %>% predict( as.matrix(scale(testing[,1:13])))
plot(history)

plot(testing$medv,yhat, pch=16, cex=0.5)

nn.metrics = metrics(yhat,testing$medv)
knitr::kable(t(nn.metrics), digits = 2)
rmse mape R2
5.3 0.2 0.57
library(pls)
## 
## Attaching package: 'pls'
## The following object is masked from 'package:stats':
## 
##     loadings
library(randomForest)
N = nrow(Boston)
# train PLS
plsr_fit <-plsr(medv~., data=Boston, scale=T, validation="CV")
summary(plsr_fit)
## Data:    X dimension: 506 13 
##  Y dimension: 506 1
## Fit method: kernelpls
## Number of components considered: 13
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV           9.206    6.534    5.074    4.989    4.922    4.880    4.852
## adjCV        9.206    6.532    5.069    4.981    4.912    4.869    4.844
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV       4.843    4.842    4.841     4.841     4.842     4.842     4.842
## adjCV    4.835    4.833    4.832     4.832     4.833     4.833     4.833
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X       45.86    56.78    63.91    69.70    75.47    78.72    81.95    84.62
## medv    49.93    70.64    72.30    73.29    73.77    73.92    74.01    74.06
##       9 comps  10 comps  11 comps  12 comps  13 comps
## X       90.16     92.40     96.22     98.16    100.00
## medv    74.06     74.06     74.06     74.06     74.06
# use 7 components
plsr_7 <-plsr(formula = medv ~ ., ncomp = 7, data=Boston, scale = T)
PLS_Score <- scores(plsr_7)
# random forest
rf <- randomForest(x=PLS_Score, y=Boston$medv, importance=TRUE, proximity=TRUE)

# goodness of fit
pls_yhat = predict(plsr_7,Boston,ncomp=7)
rf_yhat = rf$predicted

v = mean((Boston$medv - mean(Boston$medv))^2)
pls.metrics = metrics(pls_yhat,Boston$medv); knitr::kable(t(pls.metrics), digits = 2)
rmse mape R2
4.68 0.16 0.74
plsrf.metrics = metrics(rf_yhat,Boston$medv); knitr::kable(t(plsrf.metrics), digits = 2)
rmse mape R2
4.05 0.14 0.81
par(mfrow=c(1,2))
plot(Boston$medv, pls_yhat, 
     xlab='y', ylab = 'yhat', 
     main=paste("PLS Fit, ", "R-squared =", round(pls.metrics[3],2)),
     pch=21,bg="lightblue",
     xlim=c(-5,50),ylim=c(-5,50))
abline(0,1,col="red",lwd=2)

plot(Boston$medv, rf_yhat, 
     xlab='y', ylab = 'yhat',
     main=paste("PLS+RF Fit, ", "R-squared =", round(plsrf.metrics[3],2)),
     pch=21,bg="lightblue",
     xlim=c(-5,50),ylim=c(-5,50))
abline(0,1,col="red",lwd=2)

Summary Table

rmse mape R2
simple lm 7.08 0.23 0.51
simple log-log lm 6.19 0.21 0.62
simple log lm 6.77 0.20 0.55
multiple lm 5.34 0.17 0.72
ridge 5.60 0.16 0.69
lasso 5.72 0.16 0.68
tree model 6.89 0.24 0.53
simpler tree model 6.32 0.22 0.61
random forest 4.05 0.13 0.84
Boosted tree 3.39 0.13 0.83
simplee Boosted tree 3.43 0.13 0.82
neural net 5.30 0.20 0.57
PLS 4.68 0.16 0.74
PLS RF 4.05 0.14 0.81