DATA 624 - Non-Linear Regression

Zach Herold, Anthony Pagan, Betsy Rosalen

April 21, 2020

Linear Regression Review

Linear Regression model equations can be written either directly or indirectly in the form:

\[y_i = b_0 + b_1x_{i1} + b_2x_{i2} + ... + b_Px_{iP} + e_i\]

Where:

Linear Regression models

Goal: to minimize the sum of squared errors (SSE) or a function of the sum of squared errors

Minimize SSE

OLS - Ordinary Least Squares

PLS - Partial Least Squares

Minimize a function of the SSE

Penalized Models

Linear Regression Pros and Cons

Advantages:

Disadvantages:



What to do when you suspect there is a non-linear relationship but don’t know the nature of the non-linearity?

Non-Linear Regression

Non-linear regression equations take the form:

\[y = f(x,\beta) + \varepsilon\]

Where:



Any model equation that cannot be written in the linear form \(y_i = b_0 + b_1x_{i1} + b_2x_{i2} + ... + b_Px_{iP} + e_i\) is non-linear!

Non-Linear Regression Pros and Cons

Advantages:

Disadvantages:

Non-Linear Regression Models

Goal: to find the (non-linear) curve that comes closest to your data

Neural Networks (NN)

Inspired by theories about how the brain works and the interconnectedness of the neurons in the brain

Multivariate Regression Splines (MARS)

Splits each predictor into two groups using a “hinge” or “hockey-stick” function and models linear relationships to the outcome for each group separately

K-Nearest Neighbors (KNN)

Predicts new samples by finding the closest samples in the training set predictor space and taking the mean of their response values. K represents the number of closest samples to consider.

Support Vector Machines (SVM)

Robust regression that aims to minimize the effect of outliers with an approach that uses a subset of the training data points called “support vectors” to predict new values, and counter-intuitively excludes data points closest to the regression line from the prediction equation.

Tree-Based Models

Subject of the next presentation, so stay tuned!

Neural Networks

Description

Ref: https://www.youtube.com/watch?v=oYbVFhK_olY

Neural Networks

Calculation

Neural Networks

Computing examples

# Create Model
library(neuralnet)
nn<-neuralnet(q03_symptoms~., data=training, hidden =1 , linear.output=FALSE, rep=3) 

Ref: https://www.youtube.com/watch?v=-Vs9Vae2KI0

Neural Networks

Computing examples

## NULL
#Predict
output<-compute(nn, training[-3])
head(output$net.result)
##        [,1]       [,2]
## 1 0.9781103 0.01241033
## 2 0.9787063 0.01201374
## 3 0.9787063 0.01201374
## 4 0.9787063 0.01201374
## 6 0.9815269 0.01016525
## 7 0.9787063 0.01201374
in1<-nn$weights[[2]][[1]][1]+sum(nn$weights[[2]][[1]]*nn$startweights[[2]][[1]])
out<-1/(1+exp(-in1))
in2<-nn$weights[[2]][[2]][1]+sum(nn$weights[[2]][[2]][2]*out)
out2<-1/(1+exp(-in2))

Neural Networks

Pros and Cons

Multivariate Adaptive Regression Splines

Description

Multivariate Adaptive Regression Splines

Description

Splines

Description

Ref: https://www.youtube.com/watch?v=UDDXkffB-aE&t=329s

Splines

Computing examples Spines

SPLINES

## Call:
## smooth.spline(x = age, y = wage, cv = TRUE)
## 
## Smoothing Parameter  spar= 0.6988943  lambda= 0.02792303 (12 iterations)
## Equivalent Degrees of Freedom (Df): 6.794596
## Penalized Criterion (RSS): 75215.9
## PRESS(l.o.o. CV): 1593.383

Ref: https://www.youtube.com/watch?v=u-rVXhsFyxo&t=450s

Splines

Computing examples GAM

GAM

Splines

Computing examples GAM

GAM with Natural Splines

Splines

Computing examples GAM

GAM compare with ANOVA

gam2<-gam(I(wage>250)~s(age,df=4)+s(year,df=4)+education,data=Wage,family=binomial)#GAM for logistic regression
gam2a<-gam(I(wage>250)~s(age,df=4)+year+education,data=Wage,family=binomial)
anova(gam2a, gam2)#compare 2 gams with Anova
## Analysis of Deviance Table
## 
## Model 1: I(wage > 250) ~ s(age, df = 4) + year + education
## Model 2: I(wage > 250) ~ s(age, df = 4) + s(year, df = 4) + education
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1      2990     603.78                     
## 2      2987     602.87  3  0.90498   0.8242

K-Nearest Neighbors

K-Nearest Neighbors

Coded Examples - Iris Dataset

data(iris)
index <-1:nrow(iris)
test.data.index <-sample(index, trunc(length(index)/4))
test.data <- iris[test.data.index,]
train.data <-iris[-test.data.index,]

training <- as.matrix(train.data[,1:4])
rownames(training) <- NULL
train.labels <- as.integer(train.data[,5])

test <- as.matrix(test.data[,1:4])
rownames(test) <- NULL
test.labels <- as.integer(test.data[,5])

K-Nearest Neighbors

Overview

K-Nearest Neighbors

Number of Neighbors

If k = 1, then the new instance is assigned to the class where its nearest neighbor.

If we give a small (large) k input, it may lead to over-fitting (under-fitting). To choose a proper k-value, one can count on cross-validation or bootstrapping.

K-Nearest Neighbors

Similarity/ Distance Metrics

K-Nearest Neighbors

knnVCN k-Nearest Neighbor Classification of Versatile Distance Version method can be “euclidean”, “maximum”, “manhattan”,“canberra”, “binary” or “minkowski”

knnMCN: Mahalanobis Distance

library(knnGarden)
library(caret)
knnMod <- knnVCN(TrnX=training,
                 OrigTrnG=train.labels,
                 TstX=test,
                 ShowObs=TRUE,
                 K=5,
                 method="minkowski",
                 p = 3)

K-Nearest Neighbors

Iris Dataset - Perfect Accuracy

test.predicted <- knnMod$TstXIBelong

test.predicted <- as.integer(ifelse(test.predicted == 1,1,
                                   ifelse(test.predicted == 2,2,
                                          3)))
xtab <- table(test.predicted, test.labels)
confusionMatrix(xtab)
## Confusion Matrix and Statistics
## 
##               test.labels
## test.predicted  1  2  3
##              1 14  0  0
##              2  0  8  1
##              3  0  1 13
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9459          
##                  95% CI : (0.8181, 0.9934)
##     No Information Rate : 0.3784          
##     P-Value [Acc > NIR] : 4.494e-13       
##                                           
##                   Kappa : 0.9174          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 1 Class: 2 Class: 3
## Sensitivity            1.0000   0.8889   0.9286
## Specificity            1.0000   0.9643   0.9565
## Pos Pred Value         1.0000   0.8889   0.9286
## Neg Pred Value         1.0000   0.9643   0.9565
## Prevalence             0.3784   0.2432   0.3784
## Detection Rate         0.3784   0.2162   0.3514
## Detection Prevalence   0.3784   0.2432   0.3784
## Balanced Accuracy      1.0000   0.9266   0.9425

K-Nearest Neighbors

Advantages & Disadvantages

K-Nearest Neighbors

Other Packages

Support Vector Machines

Support Vector Machines

Overview

Support Vector Machines

SVM Applications

Support Vector Machines

Origins

Support Vector Machines

Soft vs. Hard Margins

Allow some misclassification by introducing a slack penalty variable (\(\xi\)). T

Support Vector Machines

Cost Penalty

The slack variable is regulated by hyperparameter cost parameter C. - when C=0, there is a less complex boundary
- when C=inf, more complex boundary, as algorithms cannot afford to misclassify a single datapoint (overfitting)

Support Vector Machines

SVM with Low Cost Parameter

To create a soft margin and allow some misclassification, we use an SVM model with small cost (C= 1)

library(e1071)
iris.subset <- iris[iris$Species %in% c("setosa","virginica"),][c("Sepal.Length", "Sepal.Width", "Species")]
svm.model = svm(Species ~ ., data=iris.subset, kernel='linear', cost=1, scale=FALSE)
plot(x=iris.subset$Sepal.Length,y=iris.subset$Sepal.Width, col=iris.subset$Species, pch=19)
points(iris.subset[svm.model$index,c(1,2)],col="blue",cex=2) #The index of the resulting support vectors in the data matrix.
w = t(svm.model$coefs) %*% svm.model$SV
b = -svm.model$rho
abline(a=-b/w[1,2], b=-w[1,1]/w[1,2], col="red", lty=5)

Support Vector Machines

Support vectors circled with separation line (Low Cost Parameter)

To create a soft margin and allow some misclassification, we use an SVM model with small cost (C= 1)

Support Vector Machines

SVM with High Cost Parameter

svm.model = svm(Species ~ ., data=iris.subset, type='C-classification', kernel='linear', cost=10000, scale=FALSE)
plot(x=iris.subset$Sepal.Length,y=iris.subset$Sepal.Width, col=iris.subset$Species, pch=19)
points(iris.subset[svm.model$index,c(1,2)],col="blue",cex=2)
w = t(svm.model$coefs) %*% svm.model$SV
b = -svm.model$rho #The negative intercept.
abline(a=-b/w[1,2], b=-w[1,1]/w[1,2], col="red", lty=5)

Support Vector Machines

Support vectors circled with separation line (High Cost Parameter)

Support Vector Machines

SVM Classification Plot

Within the scatter plot, the X symbol shows the support vector and the O symbol represents the data points. These two symbols can be altered through the configuration of the svSymbol and dataSymbol options. Both the support vectors and true classes are highlighted and colored depending on their label (green refers to viginica, red refers to versicolor, and black refers to setosa). The last argument, slice, is set when there are more than two variables. Therefore, in this example, we use the additional variables, Sepal.width and Sepal.length, by assigning a constant of 3 and 4.

Support Vector Machines

Non-Linear Cases – Theory

Cover’s Theorem (Thomas M. Cover (1965): given any random set of finite points, then with high probability these points can be made linearly separable by mapping them to a higher dimension.

Support Vector Machines

Kernel Trick

All kernel functions take two feature vectors as parameters and return the scalar dot (inner) product of the vectors. Have property of symmetry and is positive semi-definite.

By performing convex quadratic optimization, we may rewrite the algorithm so that it is independent of transforming function \(\phi\)

Support Vector Machines

Choice of Kernel in kernlab (I)

Support Vector Machines

Choice of Kernel in kernlab (II)

Support Vector Machines

Model Tuning (Cost & Gamma)

tuned = tune.svm(Species~., data = iris.subset, gamma = 10^(-6:-1),cost = 10^(0:2))
model.tuned = svm(Species~., data = iris.subset, gamma = tuned$best.parameters$gamma, cost = tuned$best.parameters$cost)

Support Vector Machines

Advantages & Disadvantages