#Exercise 1
library(ISLR2)
dat <- Hitters
dat <- dat[!is.na(dat$Salary), ]
dat$Salary <- log(dat$Salary)
train <- 1:200
test <- setdiff(1:nrow(dat), train)
library(gbm)
## Loaded gbm 2.1.9
## This version of gbm is no longer under development. Consider transitioning to gbm3, https://github.com/gbm-developers/gbm3
set.seed(29)
lambdas <- 10 ^ seq(-3, 0, by = 0.1)
fits <- lapply(lambdas, function(lam) {
gbm(Salary ~ ., data = dat[train, ], distribution = "gaussian",
n.trees = 1000, shrinkage = lam)
})
errs <- sapply(fits, function(fit) {
p <- predict(fit, dat[train, ], n.trees = 1000)
mean((p - dat[train, ]$Salary)^2)
})
plot(lambdas, errs, type = "b", xlab = "Shrinkage values",
ylab = "Training MSE", log = "xy")
This plot shows that as the shrinkage parameter λ increases, the
training MSE decreases. This is because a higher shrinkage value allows
each individual tree in the boosting model to have a stronger influence,
leading to faster learning.
errs <- sapply(fits, function(fit) {
p <- predict(fit, dat[test, ], n.trees = 1000)
mean((p - dat[test, ]$Salary)^2)
})
plot(lambdas, errs, type = "b", xlab = "Shrinkage values",
ylab = "Training MSE", log = "xy")
min(errs)
## [1] 0.24784
abline(v = lambdas[which.min(errs)], lty = 2, col = "purple")
The plot shows a U-shaped curve, where the test set MSE initially
decreases as the shrinkage parameter λ increases, reaching a minimum
around a shrinkage value of approximately 0.05. After this point, the
test MSE begins to increase as λ continues to increase. The vertical
dashed purple line represents the optimal λ value, corresponding to the
lowest test MSE (0.24784).
#Linear regression
fit1 <- lm(Salary ~ ., data = dat[train, ])
mean((predict(fit1, dat[test, ]) - dat[test, "Salary"])^2)
## [1] 0.4917959
#Ridge regression
library(glmnet)
## Loading required package: Matrix
## Loaded glmnet 4.1-8
x <- model.matrix(Salary ~ ., data = dat[train, ])
x.test <- model.matrix(Salary ~ ., data = dat[test, ])
y <- dat[train, "Salary"]
fit2 <- glmnet(x, y, alpha = 1)
mean((predict(fit2, s = 0.1, newx = x.test) - dat[test, "Salary"])^2)
## [1] 0.4389054
Boosting outperforms both linear and ridge regression, achieving the lowest test MSE (0.2478) and showing its strength in capturing complex patterns in the data. Ridge Regression has a moderate test MSE (0.4389), better than linear regression but not as effective as boosting. Linear Regression has the highest test MSE (0.4918), indicating it may be too simplistic for this dataset.
summary(fits[[which.min(errs)]])
## var rel.inf
## CAtBat CAtBat 18.0052088
## CWalks CWalks 10.6105789
## PutOuts PutOuts 8.8459762
## Walks Walks 7.5640207
## CRBI CRBI 7.4330152
## CRuns CRuns 6.8383985
## CHmRun CHmRun 6.2971195
## Years Years 5.7960458
## Assists Assists 4.8827027
## Hits Hits 4.3062590
## RBI RBI 3.9166927
## HmRun HmRun 3.5227552
## Runs Runs 3.3611588
## AtBat AtBat 3.1765507
## Errors Errors 2.3880466
## CHits CHits 2.0523113
## NewLeague NewLeague 0.4732075
## Division Division 0.3374449
## League League 0.1925070
With the highest relative influence (18.01), CAtBat (Cumulative At-Bats) variable is the most significant predictor in the model. It indicates that the cumulative number of times a player has been at bat is a crucial factor in determining salary.
library(randomForest)
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
set.seed(29)
bagged <- randomForest(Salary ~ ., data = dat[train, ], mtry = 19, ntree = 1000)
mean((predict(bagged, newdata = dat[test, ]) - dat[test, "Salary"])^2)
## [1] 0.2283885
The bagging model achieves a test MSE of 0.2284, which is slightly better than the boosting model’s test MSE of 0.2478. This suggests that bagging performs marginally better for this dataset, possibly due to its ability to reduce variance by averaging multiple tree models.
#Exercise 2
set.seed(29)
train <- data.frame(
x1 = runif(500) - 0.5,
x2 = runif(500) - 0.5
)
train$y <- factor(as.numeric((train$x1^2 - train$x2^2 > 0)))
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:randomForest':
##
## margin
p <- ggplot(train, aes(x = x1, y = x2, color = y)) +
geom_point(size = 2)
p
The dataset exhibits a nonlinear separation between the two classes,
with class 0 primarily located in regions where
x12 < x22, and class 1
where x12 > x22. The
quadratic nature of the decision boundary is evident from the symmetric
patterns of the two classes along both axes.
fit1 <- glm(y ~ ., data = train, family = "binomial")
plot_model <- function(fit) {
if (inherits(fit, "svm")) {
train$p <- predict(fit)
} else {
train$p <- factor(as.numeric(predict(fit) > 0))
}
ggplot(train, aes(x = x1, y = x2, color = p)) +
geom_point(size = 2)
}
plot_model(fit1)
Logistic regression inherently models a linear decision boundary in the
feature space of the predictors (x1 and x2).
Without additional transformations (e.g., including polynomial terms
such as x12, x22), the model
cannot capture the quadratic structure of the actual decision boundary
in the data.
fit2 <- glm(y ~ poly(x1, 2) + poly(x2, 2), data = train, family = "binomial")
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
plot_model(fit2)
By incorporating polynomial terms, the logistic regression model expands
its hypothesis space to allow for nonlinear boundaries. The model can
now approximate the true quadratic boundary x12 -
x22 = 0 more effectively.
library(e1071)
fit3 <- svm(y ~ x1 + x2, data = train, kernel = "linear")
plot_model(fit3)
The SVM with a linear kernel is trying to fit a straight line through
the feature space (x1 and x2). Given the quadratic
relationship in the data (x12 -
x22), this linear model results in poor accuracy
and misclassified points, especially near regions where the classes
overlap based on their true non-linear separation.
fit4 <- svm(y ~ x1 + x2, data = train, kernel = "polynomial", degree = 2)
plot_model(fit4)
The plot indicates a clear separation between the two classes (y = 0 and
y = 1) using a non-linear decision boundary. The SVM with a polynomial
kernel of degree 2 has captured the quadratic structure of the data,
effectively aligning with the underlying decision boundary of the form
x12 - x22 >0. As a
result, the model has accurately classified the observations into two
distinct regions, with fewer misclassified points compared to the linear
SVM.