PROBLEM 1 (TREE BASED METHOD)

a. Remove the observations for whom the salary information is unknown, and then log-transform the salaries.

library(ISLR2)
## Warning: package 'ISLR2' was built under R version 4.3.3
x <- Hitters
x <- x[!is.na(x$Salary), ]
x$Salary <- log(x$Salary)

b. Create a training set consisting of the first 200 observations, and a test set consisting of the remaining observations.

train <- 1:200
test <- setdiff(1:nrow(x), train)
x.train <- x[train,]
x.test <- x[test,]

c. Perform boosting on the training set with 1,000 trees for a range of values of the shrinkage parameter λ. Produce a plot with different shrinkage values on the x-axis and the corresponding training set MSE on the y-axis.

library(gbm)
## Warning: package 'gbm' was built under R version 4.3.3
## Loaded gbm 2.2.2
## This version of gbm is no longer under development. Consider transitioning to gbm3, https://github.com/gbm-developers/gbm3
set.seed(290923)
lambdas <- 10 ^ seq(-3, 0, by = 0.1)
fits <- lapply(lambdas, function(lam) {
  gbm(Salary ~ ., data = x.train, distribution = "gaussian", 
    n.trees = 1000, shrinkage = lam)
})
errs <- sapply(fits, function(fit) {
  p <- predict(fit, x.train, n.trees = 1000)
  mean((p - x.train$Salary)^2)
})
plot(lambdas, errs, type = "b", xlab = "Shrinkage values", 
  ylab = "Training MSE", log = "xy")

The plot shows that as the shrinkage value (λ) increases, the model’s error on the training set (MSE) decreases. This means that with higher λ, the model fits the training data more closely. However, using too high a λ might lead to overfitting, so it’s best to find a balance by testing on validation data.

d. Produce a plot with different shrinkage values on the x-axis and the corresponding test set MSE on the y-axis.

errs <- sapply(fits, function(fit) {
  p <- predict(fit, x.test, n.trees = 1000)
  mean((p - x.test$Salary)^2)
})
plot(lambdas, errs, type = "b", xlab = "Shrinkage values", 
  ylab = "Training MSE", log = "xy")
abline(v = lambdas[which.min(errs)], lty = 2, col = "darkred")

min(errs)
## [1] 0.2478676

This plot shows that the training MSE decreases as the shrinkage value (λ) increases up to around 0.1. After that, the MSE starts to increase again. The red line at 0.1 suggests this is the optimal shrinkage value, as it minimizes the training MSE before overfitting likely begins at higher values.

e. Compare the test MSE of boosting to the test MSE that results from applying two of the regression approaches seen in Chapters 3 and 6.

#Linear Regression
fit1 <- lm(Salary ~ ., data = x.train)
mean((predict(fit1, x.test) - x.test$Salary)^2)
## [1] 0.4917959
#Ridge Regression
library(glmnet)
## Warning: package 'glmnet' was built under R version 4.3.3
## Loading required package: Matrix
## Warning: package 'Matrix' was built under R version 4.3.2
## Loaded glmnet 4.1-8
a <- model.matrix(Salary ~ ., data = x.train)
a.test <- model.matrix(Salary ~ ., data = x.test)
b <- x[train, "Salary"]
fit2 <- glmnet(a, b, alpha = 1)
mean((predict(fit2, s = 0.1, newx = a.test) - x[test, "Salary"])^2)
## [1] 0.4389054

The test MSE for Linear Regression is 0.4918, and for Ridge Regression, it’s 0.4389. This value is outpeformed by boosting (shrinkage) at MSE 0.2478676. Lower values indicate better performance.

f. Which variables appear to be the most important predictors in the boosted model?

summary(fits[[which.min(errs)]])

In the boosted model, the most important predictor is CAtBat, with a relative influence of 20.13, meaning it has the strongest impact on the model’s predictions. This is followed by PutOuts, with an influence of 8.75, and CWalks, with an influence of 8.60. These three variables are the top factors influencing the model, making them key predictors in this analysis.

g. Now apply bagging to the training set. What is the test set MSE for this approach?

library(randomForest)
## Warning: package 'randomForest' was built under R version 4.3.3
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
set.seed(290923)
bagged <- randomForest(Salary ~ ., data = x.train, mtry = 19, ntree = 1000)
mean((predict(bagged, newdata = x[test,]) - x[test,"Salary"])^2)
## [1] 0.2308843

Applying bagging to the training set results in a test set MSE of 0.2309, which is lower than the MSE from the boosting (or shrinkage) approach, which is 0.2479. This means that bagging performs better on this test set, as it produces more accurate predictions with less error compared to boosting. In simple terms, bagging reduces the prediction error more effectively in this case.

PROBLEM 2 (SUPPORT VECTOR MACHINE)

We have seen that we can fit an SVM with a non-linear kernel in order to perform classification using a non-linear decision boundary. We will now see that we can also obtain a non-linear decision boundary by performing logistic regression using non-linear transformations of the features. ## a.Generate a data set with n = 500 and p = 2, such that the observations belong to two classes with a quadratic decision boundary between them. For instance, you can do this as follows:

set.seed(290923)
train <- data.frame(
  x1 = runif(500) - 0.5,
  x2 = runif(500) - 0.5
)
train$y <- factor(as.numeric((train$x1^2 - train$x2^2 > 0)))

b. Plot the observations, colored according to their class labels. Your plot should display X1 on the x-axis, and X2 on the y-axis.

library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.3.3
## 
## Attaching package: 'ggplot2'
## The following object is masked from 'package:randomForest':
## 
##     margin
p <- ggplot(train, aes(x = x1, y = x2, color = y)) + 
  geom_point(size = 2)
p

c. Fit a logistic regression model to the data, using X1 and X2 as predictors.

fit1 <- glm(y ~ ., data = train, family = "binomial")

d. Apply this model to the training data in order to obtain a predicted class label for each training observation. Plot the observations, colored according to the predicted class labels. The decision boundary should be linear.

plot_model <- function(fit) {
  if (inherits(fit, "svm")) {
    train$p <- predict(fit)
  } else {
    train$p <- factor(as.numeric(predict(fit) > 0))
  }
  ggplot(train, aes(x = x1, y = x2, color = p)) + 
    geom_point(size = 2)
}

plot_model(fit1)

e. Now fit a logistic regression model to the data using non-linear functions of X1 and X2 as predictors.

fit2 <- glm(y ~ poly(x1, 2) + poly(x2, 2), data = train, family = "binomial")
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

f. Apply this model to the training data in order to obtain a predicted class label for each training observation. Plot the observations, colored according to the predicted class labels. The decision boundary should be obviously non-linear. If it is not, then repeat (a)-(e) until you come up with an example in which the predicted class labels are obviously non-linear.

plot_model(fit2)

g. it a support vector classifier to the data with X1 and X2 as predictors. Obtain a class prediction for each training observation. Plot the observations, colored according to the predicted class labels.

library(e1071)
## Warning: package 'e1071' was built under R version 4.3.3
fit3 <- svm(y ~ x1 + x2, data = train, kernel = "linear")
plot_model(fit3)

h. Fit a SVM using a non-linear kernel to the data. Obtain a class prediction for each training observation. Plot the observations, colored according to the predicted class labels.

fit4 <- svm(y ~ x1 + x2, data = train, kernel = "polynomial", degree = 2)
plot_model(fit4)

## i. Comment on your results.

cat("When simulating data with a quadratic decision boundary, both a logistic regression model with quadratic transformations of the variables and an SVM model with a quadratic kernel provide significantly better and similar fits compared to standard linear methods.")
## When simulating data with a quadratic decision boundary, both a logistic regression model with quadratic transformations of the variables and an SVM model with a quadratic kernel provide significantly better and similar fits compared to standard linear methods.