Tugas Analitika Data: Tree Based Methods dan SVM

#Exercise 1

Remove the observations for whom the salary information is unknown, and then log-transform the salaries.

library(ISLR2)
dat <- Hitters
dat <- dat[!is.na(dat$Salary), ]
dat$Salary <- log(dat$Salary)

Create a training set consisting of the first 200 observations, and a test set consisting of the remaining observations.

train <- 1:200
test <- setdiff(1:nrow(dat), train)

Perform boosting on the training set with 1,000 trees for a range of values of the shrinkage parameter λ. Produce a plot with different shrinkage values on the x-axis and the corresponding training set MSE on the y-axis.

library(gbm)

## Loaded gbm 2.1.9

## This version of gbm is no longer under development. Consider transitioning to gbm3, https://github.com/gbm-developers/gbm3

set.seed(29)
lambdas <- 10 ^ seq(-3, 0, by = 0.1)
fits <- lapply(lambdas, function(lam) {
  gbm(Salary ~ ., data = dat[train, ], distribution = "gaussian", 
    n.trees = 1000, shrinkage = lam)
})
errs <- sapply(fits, function(fit) {
  p <- predict(fit, dat[train, ], n.trees = 1000)
  mean((p - dat[train, ]$Salary)^2)
})
plot(lambdas, errs, type = "b", xlab = "Shrinkage values", 
  ylab = "Training MSE", log = "xy")

This plot shows that as the shrinkage parameter λ increases, the training MSE decreases. This is because a higher shrinkage value allows each individual tree in the boosting model to have a stronger influence, leading to faster learning.

Produce a plot with different shrinkage values on the x-axis and the corresponding test set MSE on the y-axis.

errs <- sapply(fits, function(fit) {
  p <- predict(fit, dat[test, ], n.trees = 1000)
  mean((p - dat[test, ]$Salary)^2)
})
plot(lambdas, errs, type = "b", xlab = "Shrinkage values", 
  ylab = "Training MSE", log = "xy")
min(errs)

## [1] 0.24784

abline(v = lambdas[which.min(errs)], lty = 2, col = "purple")

The plot shows a U-shaped curve, where the test set MSE initially decreases as the shrinkage parameter λ increases, reaching a minimum around a shrinkage value of approximately 0.05. After this point, the test MSE begins to increase as λ continues to increase. The vertical dashed purple line represents the optimal λ value, corresponding to the lowest test MSE (0.24784).

Compare the test MSE of boosting to the test MSE that results from applying two of the regression approaches seen in Chapters 3 and 6.

#Linear regression
fit1 <- lm(Salary ~ ., data = dat[train, ])
mean((predict(fit1, dat[test, ]) - dat[test, "Salary"])^2)

## [1] 0.4917959

#Ridge regression
library(glmnet)

## Loading required package: Matrix

## Loaded glmnet 4.1-8

x <- model.matrix(Salary ~ ., data = dat[train, ])
x.test <- model.matrix(Salary ~ ., data = dat[test, ])
y <- dat[train, "Salary"]
fit2 <- glmnet(x, y, alpha = 1)
mean((predict(fit2, s = 0.1, newx = x.test) - dat[test, "Salary"])^2)

## [1] 0.4389054

Boosting outperforms both linear and ridge regression, achieving the lowest test MSE (0.2478) and showing its strength in capturing complex patterns in the data. Ridge Regression has a moderate test MSE (0.4389), better than linear regression but not as effective as boosting. Linear Regression has the highest test MSE (0.4918), indicating it may be too simplistic for this dataset.

Which variables appear to be the most important predictors in the boosted model?

summary(fits[[which.min(errs)]])

##                 var    rel.inf
## CAtBat       CAtBat 18.0052088
## CWalks       CWalks 10.6105789
## PutOuts     PutOuts  8.8459762
## Walks         Walks  7.5640207
## CRBI           CRBI  7.4330152
## CRuns         CRuns  6.8383985
## CHmRun       CHmRun  6.2971195
## Years         Years  5.7960458
## Assists     Assists  4.8827027
## Hits           Hits  4.3062590
## RBI             RBI  3.9166927
## HmRun         HmRun  3.5227552
## Runs           Runs  3.3611588
## AtBat         AtBat  3.1765507
## Errors       Errors  2.3880466
## CHits         CHits  2.0523113
## NewLeague NewLeague  0.4732075
## Division   Division  0.3374449
## League       League  0.1925070

With the highest relative influence (18.01), CAtBat (Cumulative At-Bats) variable is the most significant predictor in the model. It indicates that the cumulative number of times a player has been at bat is a crucial factor in determining salary.

Now apply bagging to the training set. What is the test set MSE for this approach?

library(randomForest)

## randomForest 4.7-1.1

## Type rfNews() to see new features/changes/bug fixes.

set.seed(29)
bagged <- randomForest(Salary ~ ., data = dat[train, ], mtry = 19, ntree = 1000)
mean((predict(bagged, newdata = dat[test, ]) - dat[test, "Salary"])^2)

## [1] 0.2283885

The bagging model achieves a test MSE of 0.2284, which is slightly better than the boosting model’s test MSE of 0.2478. This suggests that bagging performs marginally better for this dataset, possibly due to its ability to reduce variance by averaging multiple tree models.

#Exercise 2

Generate a data set with n = 500 and p = 2, such that the observations belong to two classes with a quadratic decision boundary between them.

set.seed(29)
train <- data.frame(
  x1 = runif(500) - 0.5,
  x2 = runif(500) - 0.5
)
train$y <- factor(as.numeric((train$x1^2 - train$x2^2 > 0)))

Plot the observations, colored according to their class labels. Your plot should display X₁ on the x-axis, and X₂ on the y-axis.

library(ggplot2)

## 
## Attaching package: 'ggplot2'

## The following object is masked from 'package:randomForest':
## 
##     margin

p <- ggplot(train, aes(x = x1, y = x2, color = y)) + 
  geom_point(size = 2)
p

The dataset exhibits a nonlinear separation between the two classes, with class 0 primarily located in regions where x₁² < x₂², and class 1 where x₁² > x₂². The quadratic nature of the decision boundary is evident from the symmetric patterns of the two classes along both axes.

Fit a logistic regression model to the data, using X₁ and X₂ as predictors.

fit1 <- glm(y ~ ., data = train, family = "binomial")

Apply this model to the training data in order to obtain a predicted class label for each training observation. Plot the observations, colored according to the predicted class labels. The decision boundary should be linear.

plot_model <- function(fit) {
  if (inherits(fit, "svm")) {
    train$p <- predict(fit)
  } else {
    train$p <- factor(as.numeric(predict(fit) > 0))
  }
  ggplot(train, aes(x = x1, y = x2, color = p)) + 
    geom_point(size = 2)
}

plot_model(fit1)

Logistic regression inherently models a linear decision boundary in the feature space of the predictors (x₁ and x₂). Without additional transformations (e.g., including polynomial terms such as x₁², x₂²), the model cannot capture the quadratic structure of the actual decision boundary in the data.

Now fit a logistic regression model to the data using non-linear functions of X₁ and X₂ as predictors (e.g. X₁², X₁ × X₂, and so forth).

fit2 <- glm(y ~ poly(x1, 2) + poly(x2, 2), data = train, family = "binomial")

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

Apply this model to the training data in order to obtain a predicted class label for each training observation. Plot the observations, colored according to the predicted class labels. The decision boundary should be obviously non-linear. If it is not, then repeat (a)-(e) until you come up with an example in which the predicted class labels are obviously non-linear.

plot_model(fit2)

By incorporating polynomial terms, the logistic regression model expands its hypothesis space to allow for nonlinear boundaries. The model can now approximate the true quadratic boundary x₁² - x₂² = 0 more effectively.

Fit a support vector classifier to the data with X₁ and X₂ as predictors. Obtain a class prediction for each training observation. Plot the observations, colored according to the predicted class labels.

library(e1071)
fit3 <- svm(y ~ x1 + x2, data = train, kernel = "linear")
plot_model(fit3)

The SVM with a linear kernel is trying to fit a straight line through the feature space (x₁ and x₂). Given the quadratic relationship in the data (x₁² - x₂²), this linear model results in poor accuracy and misclassified points, especially near regions where the classes overlap based on their true non-linear separation.

Fit a SVM using a non-linear kernel to the data. Obtain a class prediction for each training observation. Plot the observations, colored according to the predicted class labels.

fit4 <- svm(y ~ x1 + x2, data = train, kernel = "polynomial", degree = 2)
plot_model(fit4)

The plot indicates a clear separation between the two classes (y = 0 and y = 1) using a non-linear decision boundary. The SVM with a polynomial kernel of degree 2 has captured the quadratic structure of the data, effectively aligning with the underlying decision boundary of the form x₁² - x₂² >0. As a result, the model has accurately classified the observations into two distinct regions, with fewer misclassified points compared to the linear SVM.

Comment on your results. When dealing with data that has a quadratic decision boundary, standard linear models (like linear logistic regression or an SVM with a linear kernel) are generally insufficient. These models assume a linear relationship between the features and the target variable, leading to poor performance in classifying observations that are separated by a non-linear boundary. It is beneficial to use models that either include quadratic transformations (like logistic regression with polynomial terms) or that can inherently model non-linear boundaries (like SVM with a polynomial kernel). These models leverage the non-linear structure of the data, providing significantly improved classification performance over standard linear methods.

Tugas Analitika Data: Tree Based Methods dan SVM

Dewa Ayu Maharani Adithi Kirana

2024-11-13