Newton's Method and Support Vector Machines

The answer to this nonlinear equation is x=10.0188. This is the closest approximation for the root of the function "xe^(-x)".

library(animation)
ani.options(interval=0.5)
newton.method(FUN = function(x) exp(-x)*x, init = 2, rg = c(0, 10))

We have the following: f(x) = x*e^(-x) f’(x) = -(x-1)e^(-x) x(n+1) = x(n)-f(xn)/f’(xn) First iteration: x1 = 2 - f(2)/f’(2) = 4 Second iteration x2 = 4 - f(4)/f’(4) = 5.333 Third iteration x3 = 5.33 - f(5.333)/f’(5.333) = 6.564 Fourth iteration x4 = 6.6564 - f(6.6564)/f’(6.6564) = 7.74 Fifth iteration x5 = 7.74 - f(7.74)/f’(7.74) = 8.89 Sixth iteration x6 = 8.89 - f(8.89)/f’(8.89) = 10.018

f0=function(u) u*exp(-u)
plot(f0,0,14)
newton.method(FUN = function(x) exp(-x)*x, init = 10.018, rg = c(0, 14))

Here we remove observations for which the salary information is unknown and we log-transform it.

library(ISLR)
Hitters = Hitters[-which(is.na(Hitters$Salary)), ]
Hitters$Salary = log(Hitters$Salary)
attach(Hitters)

We create a training and testing set.

train = 1:200
training_data = Hitters[train, ]
testing_data = Hitters[-train, ]

Here we perform boosting on the dataset. As we can see, there seems to be an exponential decreasing relationships between training MSE and shrinkage. As the shrinkage parameter increases, the test MSE decreases exponentially.

library(gbm)
## Loaded gbm 2.1.5
set.seed(100)
powers <-seq(-10, -.3, by = .1)
lambdas <-10^powers
length_lambdas <-  length(lambdas)
train_errors <-  rep(NA, length_lambdas)
test_errors <-  rep(NA, length_lambdas)

for(i in 1:length_lambdas){
  boost_hit = gbm(Salary ~. , data = training_data, distribution = 'gaussian', n.trees = 1000, shrinkage = lambdas[i])
  train.pred = predict(boost_hit, training_data, n.trees = 1000)
  test.pred = predict(boost_hit, testing_data, n.trees = 1000)
  train_errors[i] = mean((training_data$Salary - train.pred)^2)
  test_errors[i] = mean((testing_data$Salary - test.pred)^2)
}

plot(lambdas, train_errors, type = 'b', xlab = 'Shrinkage', ylab = 'Train MSE')

Here we plot different shrinkage values against corresponding test MSE's. We notice that when the shrinkage parameter is close to zero, that is when we have our lowest Test MSE.

plot(lambdas, test_errors, type = 'b', xlab = 'Shrinkage', ylab = 'Test MSE')

lambdas[which.min(test_errors)]
## [1] 0.01995262

Here we see that Lasso and Ridge regression have higher test MSE than boosting does. The test MSE for boosting was approximately .22 from the graph, which was lower than boosting so one can conclude that using bagging is better than boosting for this dataset as well.

lm.fit <-  lm(Salary ~ . , data=training_data)
lm.pred <-  predict(lm.fit, testing_data)
mean((testing_data$Salary - lm.pred)^2)
## [1] 0.4917959
library(glmnet)
## Loading required package: Matrix

## Loading required package: foreach

## Loaded glmnet 2.0-16
library(randomForest)
## randomForest 4.6-14

## Type rfNews() to see new features/changes/bug fixes.
set.seed(134)

x <-  model.matrix(Salary ~ . , data=training_data)
y <-  training_data$Salary
x.test <-  model.matrix(Salary ~ . , data=testing_data)
lasso.fit <-  glmnet(x, y, alpha=1)
lasso.pred <-  predict(lasso.fit, s=0.01, newx=x.test)
mean((testing_data$Salary - lasso.pred)^2)
## [1] 0.4700537

The most important variables are catbat, CRuns, and CHits as seen from the plot below.

library(glmnet)
x <-  model.matrix(Salary ~ . , data=training_data)
y <-  training_data$Salary
x.test = model.matrix(Salary~., data = testing_data)

reg.fit = glmnet(x,y, alpha = 0)


reg.pred = predict(reg.fit, s=.05, newx = x.test)

reg.error = mean((testing_data$Salary - reg.pred)^2)
reg.error
## [1] 0.4570283
## [1] 0.4570283
min(test_errors)
## [1] 0.2587138
## [1] 0.2587138
best_variable = gbm(Salary~., data = training_data , distribution  ="gaussian", n.trees = 1000, shrinkage = lambdas[which.min(test_errors)])
summary(best_variable)

##                 var    rel.inf
## CAtBat       CAtBat 25.3505859
## CHits         CHits 10.3304518
## CWalks       CWalks  9.7221593
## CRBI           CRBI  7.2352757
## Years         Years  6.8592164
## CHmRun       CHmRun  6.5930127
## CRuns         CRuns  5.7979424
## Walks         Walks  5.0174458
## PutOuts     PutOuts  4.0458804
## RBI             RBI  3.9456244
## Hits           Hits  3.2526054
## HmRun         HmRun  2.3675782
## AtBat         AtBat  2.2021122
## Errors       Errors  2.1832225
## Assists     Assists  2.0341115
## Runs           Runs  1.6467310
## Division   Division  0.7417301
## NewLeague NewLeague  0.4467493
## League       League  0.2275649

The test MSE for bagging was .22, which was lower than boosting so one can conclude that using bagging is better than boosting for this dataset as well.

bag.hitters = randomForest(Salary ~. , data = training_data, ntree = 1000, mtry = 19)
bag.pred = predict(bag.hitters, testing_data)
mean((bag.pred - testing_data$Salary)^2)
## [1] 0.2287353

We begin to work with the Auto dataset now. We will use support vector machines in order to see if we can distinguish data based on gas mileage. Here we create a binary variable for cars with gas mileage greater than or below median. 1 represents cars with mileages greater than the median whereas 0 represents those with less.

library(e1071)
bin = ifelse(Auto$mpg >  median(Auto$mpg), 1, 0)
Auto$level = as.factor(bin)
set.seed(100)

From the summary below, it seems that a cost of 1 gives us the lowest error and dispersion. A cost of 1e-03 gives us the most error as well as dispersion.

support.model <- tune(svm, level ~ ., data = Auto, kernel = "linear", ranges = list(cost = c(0.001, 0.01, .1, 1, 5, 10,100)))
summary(support.model)
## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  cost
##     1
## 
## - best performance: 0.007628205 
## 
## - Detailed performance results:
##    cost       error dispersion
## 1 1e-03 0.091923077 0.05438628
## 2 1e-02 0.074038462 0.03915012
## 3 1e-01 0.043333333 0.02708741
## 4 1e+00 0.007628205 0.01228382
## 5 5e+00 0.017820513 0.02105932
## 6 1e+01 0.020384615 0.02353300
## 7 1e+02 0.033205128 0.02720447

We notice that for polynomial basis kernels, the cost function and error are minimized the most at 100 and 2 respectively. For radial basis kernels, we notice that a cost of 100 and gamma value of 0.01 work the best. In general, a radial basis kernel works best since its error is lower than that of the polynomial kernel (0.127< 0.3314).

poly.model <- tune(svm, level ~ ., data = Auto, kernel = "polynomial", ranges = list(cost = c(0.001, 0.01, .1, 1, 5, 10,100), degrees = c(2,3,4)))
summary(poly.model)
## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  cost degrees
##   100       2
## 
## - best performance: 0.3314103 
## 
## - Detailed performance results:
##     cost degrees     error dispersion
## 1  1e-03       2 0.5637179 0.03112361
## 2  1e-02       2 0.5637179 0.03112361
## 3  1e-01       2 0.5637179 0.03112361
## 4  1e+00       2 0.5637179 0.03112361
## 5  5e+00       2 0.5637179 0.03112361
## 6  1e+01       2 0.5637179 0.03112361
## 7  1e+02       2 0.3314103 0.08813630
## 8  1e-03       3 0.5637179 0.03112361
## 9  1e-02       3 0.5637179 0.03112361
## 10 1e-01       3 0.5637179 0.03112361
## 11 1e+00       3 0.5637179 0.03112361
## 12 5e+00       3 0.5637179 0.03112361
## 13 1e+01       3 0.5637179 0.03112361
## 14 1e+02       3 0.3314103 0.08813630
## 15 1e-03       4 0.5637179 0.03112361
## 16 1e-02       4 0.5637179 0.03112361
## 17 1e-01       4 0.5637179 0.03112361
## 18 1e+00       4 0.5637179 0.03112361
## 19 5e+00       4 0.5637179 0.03112361
## 20 1e+01       4 0.5637179 0.03112361
## 21 1e+02       4 0.3314103 0.08813630
rad.model = tune(svm, level ~ ., data = Auto, kernel = "radial", ranges = list(cost = c(0.001, 0.01, .1, 1, 5, 10,100), gamma = c(0.001, 0.01, .1, 1, 5, 10,100)))
summary(rad.model)
## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  cost gamma
##   100  0.01
## 
## - best performance: 0.01275641 
## 
## - Detailed performance results:
##     cost gamma      error dispersion
## 1  1e-03 1e-03 0.56615385 0.03770481
## 2  1e-02 1e-03 0.56615385 0.03770481
## 3  1e-01 1e-03 0.53294872 0.06263870
## 4  1e+00 1e-03 0.09205128 0.05586632
## 5  5e+00 1e-03 0.07410256 0.05343350
## 6  1e+01 1e-03 0.07410256 0.05343350
## 7  1e+02 1e-03 0.02038462 0.01617396
## 8  1e-03 1e-02 0.56615385 0.03770481
## 9  1e-02 1e-02 0.56615385 0.03770481
## 10 1e-01 1e-02 0.08692308 0.05176694
## 11 1e+00 1e-02 0.07410256 0.05343350
## 12 5e+00 1e-02 0.05621795 0.04810877
## 13 1e+01 1e-02 0.02544872 0.01688453
## 14 1e+02 1e-02 0.01275641 0.01344780
## 15 1e-03 1e-01 0.56615385 0.03770481
## 16 1e-02 1e-01 0.20128205 0.08396321
## 17 1e-01 1e-01 0.07666667 0.05552530
## 18 1e+00 1e-01 0.05628205 0.04334690
## 19 5e+00 1e-01 0.02801282 0.01875978
## 20 1e+01 1e-01 0.01782051 0.02088734
## 21 1e+02 1e-01 0.03057692 0.02877571
## 22 1e-03 1e+00 0.56615385 0.03770481
## 23 1e-02 1e+00 0.56615385 0.03770481
## 24 1e-01 1e+00 0.56615385 0.03770481
## 25 1e+00 1e+00 0.06128205 0.04220634
## 26 5e+00 1e+00 0.06641026 0.04393632
## 27 1e+01 1e+00 0.06641026 0.04393632
## 28 1e+02 1e+00 0.06641026 0.04393632
## 29 1e-03 5e+00 0.56615385 0.03770481
## 30 1e-02 5e+00 0.56615385 0.03770481
## 31 1e-01 5e+00 0.56615385 0.03770481
## 32 1e+00 5e+00 0.51269231 0.03921950
## 33 5e+00 5e+00 0.50756410 0.04066436
## 34 1e+01 5e+00 0.50756410 0.04066436
## 35 1e+02 5e+00 0.50756410 0.04066436
## 36 1e-03 1e+01 0.56615385 0.03770481
## 37 1e-02 1e+01 0.56615385 0.03770481
## 38 1e-01 1e+01 0.56615385 0.03770481
## 39 1e+00 1e+01 0.54076923 0.04310545
## 40 5e+00 1e+01 0.53307692 0.04687257
## 41 1e+01 1e+01 0.53307692 0.04687257
## 42 1e+02 1e+01 0.53307692 0.04687257
## 43 1e-03 1e+02 0.56615385 0.03770481
## 44 1e-02 1e+02 0.56615385 0.03770481
## 45 1e-01 1e+02 0.56615385 0.03770481
## 46 1e+00 1e+02 0.56615385 0.03770481
## 47 5e+00 1e+02 0.56615385 0.03770481
## 48 1e+01 1e+02 0.56615385 0.03770481
## 49 1e+02 1e+02 0.56615385 0.03770481
linear <- svm(level ~ ., data = Auto, kernel = "linear", cost = 1)
polynomial.svm <- svm(level ~ ., data = Auto, kernel = "polynomial", cost = 100, degree = 2)
radial.svm <- svm(level ~ ., data = Auto, kernel = "radial", cost = 100, gamma = 0.01)

plotpairs = function(fit) {
    for (name in names(Auto)[!(names(Auto) %in% c("mpg", "level", "name"))]) {
        plot(fit, Auto, as.formula(paste("mpg~", name, sep = "")))
    }
}
plotpairs(linear)