library(ISLR2)
## Warning: package 'ISLR2' was built under R version 4.3.3
x <- Hitters
x <- x[!is.na(x$Salary), ]
x$Salary <- log(x$Salary)
train <- 1:200
test <- setdiff(1:nrow(x), train)
x.train <- x[train,]
x.test <- x[test,]
library(gbm)
## Warning: package 'gbm' was built under R version 4.3.3
## Loaded gbm 2.2.2
## This version of gbm is no longer under development. Consider transitioning to gbm3, https://github.com/gbm-developers/gbm3
set.seed(290923)
lambdas <- 10 ^ seq(-3, 0, by = 0.1)
fits <- lapply(lambdas, function(lam) {
gbm(Salary ~ ., data = x.train, distribution = "gaussian",
n.trees = 1000, shrinkage = lam)
})
errs <- sapply(fits, function(fit) {
p <- predict(fit, x.train, n.trees = 1000)
mean((p - x.train$Salary)^2)
})
plot(lambdas, errs, type = "b", xlab = "Shrinkage values",
ylab = "Training MSE", log = "xy")
The plot shows that as the shrinkage value (λ) increases, the model’s
error on the training set (MSE) decreases. This means that with higher
λ, the model fits the training data more closely. However, using too
high a λ might lead to overfitting, so it’s best to find a balance by
testing on validation data.
errs <- sapply(fits, function(fit) {
p <- predict(fit, x.test, n.trees = 1000)
mean((p - x.test$Salary)^2)
})
plot(lambdas, errs, type = "b", xlab = "Shrinkage values",
ylab = "Training MSE", log = "xy")
abline(v = lambdas[which.min(errs)], lty = 2, col = "darkred")
min(errs)
## [1] 0.2478676
This plot shows that the training MSE decreases as the shrinkage value (λ) increases up to around 0.1. After that, the MSE starts to increase again. The red line at 0.1 suggests this is the optimal shrinkage value, as it minimizes the training MSE before overfitting likely begins at higher values.
#Linear Regression
fit1 <- lm(Salary ~ ., data = x.train)
mean((predict(fit1, x.test) - x.test$Salary)^2)
## [1] 0.4917959
#Ridge Regression
library(glmnet)
## Warning: package 'glmnet' was built under R version 4.3.3
## Loading required package: Matrix
## Warning: package 'Matrix' was built under R version 4.3.2
## Loaded glmnet 4.1-8
a <- model.matrix(Salary ~ ., data = x.train)
a.test <- model.matrix(Salary ~ ., data = x.test)
b <- x[train, "Salary"]
fit2 <- glmnet(a, b, alpha = 1)
mean((predict(fit2, s = 0.1, newx = a.test) - x[test, "Salary"])^2)
## [1] 0.4389054
The test MSE for Linear Regression is 0.4918, and for Ridge Regression, it’s 0.4389. This value is outpeformed by boosting (shrinkage) at MSE 0.2478676. Lower values indicate better performance.
summary(fits[[which.min(errs)]])
In the boosted model, the most important predictor is CAtBat, with a relative influence of 20.13, meaning it has the strongest impact on the model’s predictions. This is followed by PutOuts, with an influence of 8.75, and CWalks, with an influence of 8.60. These three variables are the top factors influencing the model, making them key predictors in this analysis.
library(randomForest)
## Warning: package 'randomForest' was built under R version 4.3.3
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
set.seed(290923)
bagged <- randomForest(Salary ~ ., data = x.train, mtry = 19, ntree = 1000)
mean((predict(bagged, newdata = x[test,]) - x[test,"Salary"])^2)
## [1] 0.2308843
Applying bagging to the training set results in a test set MSE of 0.2309, which is lower than the MSE from the boosting (or shrinkage) approach, which is 0.2479. This means that bagging performs better on this test set, as it produces more accurate predictions with less error compared to boosting. In simple terms, bagging reduces the prediction error more effectively in this case.
We have seen that we can fit an SVM with a non-linear kernel in order to perform classification using a non-linear decision boundary. We will now see that we can also obtain a non-linear decision boundary by performing logistic regression using non-linear transformations of the features. ## a.Generate a data set with n = 500 and p = 2, such that the observations belong to two classes with a quadratic decision boundary between them. For instance, you can do this as follows:
set.seed(290923)
train <- data.frame(
x1 = runif(500) - 0.5,
x2 = runif(500) - 0.5
)
train$y <- factor(as.numeric((train$x1^2 - train$x2^2 > 0)))
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.3.3
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:randomForest':
##
## margin
p <- ggplot(train, aes(x = x1, y = x2, color = y)) +
geom_point(size = 2)
p
fit1 <- glm(y ~ ., data = train, family = "binomial")
plot_model <- function(fit) {
if (inherits(fit, "svm")) {
train$p <- predict(fit)
} else {
train$p <- factor(as.numeric(predict(fit) > 0))
}
ggplot(train, aes(x = x1, y = x2, color = p)) +
geom_point(size = 2)
}
plot_model(fit1)
fit2 <- glm(y ~ poly(x1, 2) + poly(x2, 2), data = train, family = "binomial")
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
plot_model(fit2)
library(e1071)
## Warning: package 'e1071' was built under R version 4.3.3
fit3 <- svm(y ~ x1 + x2, data = train, kernel = "linear")
plot_model(fit3)
fit4 <- svm(y ~ x1 + x2, data = train, kernel = "polynomial", degree = 2)
plot_model(fit4)
## i. Comment on your results.
cat("When simulating data with a quadratic decision boundary, both a logistic regression model with quadratic transformations of the variables and an SVM model with a quadratic kernel provide significantly better and similar fits compared to standard linear methods.")
## When simulating data with a quadratic decision boundary, both a logistic regression model with quadratic transformations of the variables and an SVM model with a quadratic kernel provide significantly better and similar fits compared to standard linear methods.