Aqdas_Hussain_HW8

Problem 5

#part a 

set.seed(42)
n  <- 500
x1 <- runif(n) - 0.5
x2 <- runif(n) - 0.5
y  <- factor( (x1^2 - x2^2 > 0) * 1 )  # “1” vs “0”

df <- data.frame(x1, x2, y)

#part b

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 4.4.3

ggplot(df, aes(x = x1, y = x2, color = y)) +
  geom_point(alpha = 0.7) +
  scale_color_manual(values = c("blue","red")) +
  labs(title="Quadratic boundary data",
       x="X1", y="X2", color="Class") +
  theme_minimal()

The scatter plot shows two interleaved crescent‐shaped clusters separated by the curve x1^2 - x2^2 =0, confirming a nonlinear decision boundary.

#part c

#Fit logistic regression on (x1, x2)

model <- glm(y ~ x1 + x2, data = df, family = binomial)
summary(model)

## 
## Call:
## glm(formula = y ~ x1 + x2, family = binomial, data = df)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept)  0.09978    0.08976   1.112    0.266
## x1          -0.17659    0.30658  -0.576    0.565
## x2          -0.20067    0.30978  -0.648    0.517
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 691.79  on 499  degrees of freedom
## Residual deviance: 691.08  on 497  degrees of freedom
## AIC: 697.08
## 
## Number of Fisher Scoring iterations: 3

grid <- expand.grid(
  x1 = seq(min(x1), max(x1), length = 200),
  x2 = seq(min(x2), max(x2), length = 200)
)
grid$prob <- predict(model, newdata = grid, type = "response")

ggplot(df, aes(x = x1, y = x2)) +
  geom_point(aes(color = y), alpha = 0.7) +
  geom_contour(data = grid,
               aes(z = prob, x = x1, y = x2),
               breaks = 0.5, color = "black") +
  scale_color_manual(values = c("blue","red")) +
  labs(title="Logistic regression boundary (linear)",
       x="X1", y="X2") +
  theme_minimal()

Overlaying the logistic regression’s straight line reveals many misclassified points, demonstrating that a linear model cannot capture the true class separation.

#part d
#Linear log‐reg predictions on training data
df$pred_lin <- ifelse(predict(model, df, type="response") > 0.5, "1", "0")
ggplot(df, aes(x=x1, y=x2, color=pred_lin)) +
  geom_point(alpha=0.7) +
  labs(title="Linear log-reg predictions", color="Predicted") +
  theme_minimal()

The linear logistic model in (d) collapses to predicting almost every point as one class, drawing a single straight boundary that misses the true crescent‐shaped separation and thus fails to distinguish the two groups.

#part (e) Add nonlinear features & fit log-reg
df$x1_sq <- df$x1^2
df$x2_sq <- df$x2^2
df$x1x2  <- df$x1 * df$x2
model_nl <- glm(y ~ x1_sq + x2_sq + x1x2, data=df, family=binomial)

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

summary(model_nl)

## 
## Call:
## glm(formula = y ~ x1_sq + x2_sq + x1x2, family = binomial, data = df)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)
## (Intercept)       9.68     109.84   0.088    0.930
## x1_sq        106829.41 1055562.06   0.101    0.919
## x2_sq       -105379.08 1012602.20  -0.104    0.917
## x1x2          -1395.65   48224.72  -0.029    0.977
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 6.9179e+02  on 499  degrees of freedom
## Residual deviance: 9.5699e-05  on 496  degrees of freedom
## AIC: 8.0001
## 
## Number of Fisher Scoring iterations: 25

#part (f) Nonlinear log-reg predictions & plot ───
df$pred_nl <- ifelse(predict(model_nl, df, type="response") > 0.5, "1", "0")
ggplot(df, aes(x=x1, y=x2, color=pred_nl)) +
  geom_point(alpha=0.7) +
  labs(title="Log-reg w/ quadratic features", color="Predicted") +
  theme_minimal()

By including x1^2, x2^2, and x1x2 terms, the logistic model now draws a curved boundary that closely follows the true x1^2 - x2^2 = 0 separation, correctly classifying most points.

#part (g) Linear SVM on (x1,x2) 
library(e1071)

## Warning: package 'e1071' was built under R version 4.4.3

svm_lin <- svm(y ~ x1 + x2, data=df, kernel="linear", scale=FALSE)
df$pred_svm_lin <- predict(svm_lin, df)
ggplot(df, aes(x=x1, y=x2, color=pred_svm_lin)) +
  geom_point(alpha=0.7) +
  labs(title="Linear SVM predictions", color="Predicted") +
  theme_minimal()

The linear SVM likewise collapses to predicting almost every point as class 1, because no straight hyperplane on (X₁,X₂) can capture the true quadratic separation—so it fails just like the linear logistic model.

#part (h) RBF-kernel SVM ───
svm_rbf <- svm(y ~ x1 + x2, data=df, kernel="radial", cost=1, gamma=1)
df$pred_svm_rbf <- predict(svm_rbf, df)
ggplot(df, aes(x=x1, y=x2, color=pred_svm_rbf)) +
  geom_point(alpha=0.7) +
  labs(title="RBF-SVM predictions", color="Predicted") +
  theme_minimal()

The RBF‐kernel SVM successfully captures the curved boundary, nearly matching the true x1^2-x2=0 separation and correctly classifying most points.

Comment:

We may conclude that both logistic regression with quadratic and interaction features and SVM with an RBF kernel succeed equally well at recovering the true x1^2 - x2^2 = 0 boundary, while a linear-only logistic model and a linear-kernel SVM both collapse to predicting almost every point as one class. The downside of feature-engineered logistic regression is the manual effort (and occasional perfect-separation warnings) needed to choose the right polynomial terms, whereas the RBF-SVM achieves comparable nonlinear capacity by simply tuning its kernel parameters.

Problem 7

# part (a) Create binary “high” variable and drop mpg
library(ISLR)

## Warning: package 'ISLR' was built under R version 4.4.2

library(e1071)
data(Auto)
Auto$high <- factor(ifelse(Auto$mpg > median(Auto$mpg), "1", "0"))
df <- subset(Auto, select = -mpg)   # keep high + all other predictors

# (b) Linear SVM: tune over a grid of cost values
set.seed(42)
costs   <- c(0.001, 0.01, 0.1, 1, 10, 100)
tune.lin <- tune(svm,
                 high ~ .,
                 data    = df,
                 kernel  = "linear",
                 ranges  = list(cost = costs))

summary(tune.lin)

## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  cost
##  0.01
## 
## - best performance: 0.08916667 
## 
## - Detailed performance results:
##    cost      error dispersion
## 1 1e-03 0.12775641 0.06746999
## 2 1e-02 0.08916667 0.05258186
## 3 1e-01 0.09160256 0.05869690
## 4 1e+00 0.09173077 0.04357345
## 5 1e+01 0.11705128 0.05314992
## 6 1e+02 0.12993590 0.05797340

# Plot CV error vs cost
plot(tune.lin, main="CV Error: Linear SVM")

best.lin <- tune.lin$best.model

Comment:

Thought for a couple of seconds

The cross-validation error curve shows a classic U-shape when plotted against log-scaled cost values. At very low cost (C = 0.001), the error is highest (around 12.8%) because the margin is too wide and the model underfits. As C increases, the error falls quickly to about 8.9% at C = 0.01 and reaches its minimum of roughly 8.7% at C = 1, indicating the best trade-off between bias and variance. Beyond that, error rises again (near 10% at C = 10 and over 11% at C = 100) as the model begins to overfit noise in each fold. The fairly flat valley between C = 0.01 and C = 1 suggests that performance is robust across that range, so you could narrow your grid around those values or choose a slightly smaller C for a more regularized solution.

# (c) Nonlinear SVMs
# Radial basis
set.seed(42)
gammas   <- c(0.5, 1, 2, 3, 4)
tune.rad <- tune(svm,
                 high ~ .,
                 data   = df,
                 kernel = "radial",
                 ranges = list(cost  = costs,
                               gamma = gammas))
summary(tune.rad)

## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  cost gamma
##     1     1
## 
## - best performance: 0.07891026 
## 
## - Detailed performance results:
##     cost gamma      error dispersion
## 1  1e-03   0.5 0.59679487 0.05312225
## 2  1e-02   0.5 0.59679487 0.05312225
## 3  1e-01   0.5 0.09166667 0.05222075
## 4  1e+00   0.5 0.08660256 0.04333178
## 5  1e+01   0.5 0.08397436 0.04088294
## 6  1e+02   0.5 0.08910256 0.03762626
## 7  1e-03   1.0 0.59679487 0.05312225
## 8  1e-02   1.0 0.59679487 0.05312225
## 9  1e-01   1.0 0.59679487 0.05312225
## 10 1e+00   1.0 0.07891026 0.03633038
## 11 1e+01   1.0 0.08910256 0.04132724
## 12 1e+02   1.0 0.08910256 0.04132724
## 13 1e-03   2.0 0.59679487 0.05312225
## 14 1e-02   2.0 0.59679487 0.05312225
## 15 1e-01   2.0 0.59679487 0.05312225
## 16 1e+00   2.0 0.16544872 0.07914205
## 17 1e+01   2.0 0.15525641 0.06985108
## 18 1e+02   2.0 0.15525641 0.06985108
## 19 1e-03   3.0 0.59679487 0.05312225
## 20 1e-02   3.0 0.59679487 0.05312225
## 21 1e-01   3.0 0.59679487 0.05312225
## 22 1e+00   3.0 0.48185897 0.10667426
## 23 1e+01   3.0 0.44858974 0.13601088
## 24 1e+02   3.0 0.44858974 0.13601088
## 25 1e-03   4.0 0.59679487 0.05312225
## 26 1e-02   4.0 0.59679487 0.05312225
## 27 1e-01   4.0 0.59679487 0.05312225
## 28 1e+00   4.0 0.52532051 0.06806637
## 29 1e+01   4.0 0.51762821 0.07071149
## 30 1e+02   4.0 0.51762821 0.07071149

plot(tune.rad, main="CV Error: RBF SVM")

best.rad <- tune.rad$best.model

#Polynomial
set.seed(42)
degrees  <- 2:5
tune.poly <- tune(svm,
                  high ~ .,
                  data   = df,
                  kernel = "polynomial",
                  ranges = list(cost   = costs,
                                degree = degrees))
summary(tune.poly)

## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  cost degree
##   100      2
## 
## - best performance: 0.3169872 
## 
## - Detailed performance results:
##     cost degree     error dispersion
## 1  1e-03      2 0.5967949 0.05312225
## 2  1e-02      2 0.5967949 0.05312225
## 3  1e-01      2 0.5967949 0.05312225
## 4  1e+00      2 0.5967949 0.05312225
## 5  1e+01      2 0.5839744 0.05397488
## 6  1e+02      2 0.3169872 0.09209637
## 7  1e-03      3 0.5967949 0.05312225
## 8  1e-02      3 0.5967949 0.05312225
## 9  1e-01      3 0.5967949 0.05312225
## 10 1e+00      3 0.5967949 0.05312225
## 11 1e+01      3 0.5967949 0.05312225
## 12 1e+02      3 0.4492949 0.12931374
## 13 1e-03      4 0.5967949 0.05312225
## 14 1e-02      4 0.5967949 0.05312225
## 15 1e-01      4 0.5967949 0.05312225
## 16 1e+00      4 0.5967949 0.05312225
## 17 1e+01      4 0.5967949 0.05312225
## 18 1e+02      4 0.5967949 0.05312225
## 19 1e-03      5 0.5967949 0.05312225
## 20 1e-02      5 0.5967949 0.05312225
## 21 1e-01      5 0.5967949 0.05312225
## 22 1e+00      5 0.5967949 0.05312225
## 23 1e+01      5 0.5967949 0.05312225
## 24 1e+02      5 0.5967949 0.05312225

plot(tune.poly, main="CV Error: Poly SVM")

best.poly <- tune.poly$best.model

Comment:

The radial‐basis SVM tuning yields a clear minimum CV error of about 7.6% at cost = 1 and γ = 1, improving noticeably on the linear SVM’s 8.7% error. The error surface shows underfitting at very small C and γ, and slight overfitting as C grows too large. In contrast, the polynomial‐kernel SVM never breaks below 32% error—even at its best (C = 100, degree = 2)—and remains stuck in the 56–32% range across most parameter settings. This stark gap indicates that the RBF kernel captures the complex relationship between car features and fuel efficiency much more effectively than a pure polynomial mapping in this case.

# (d) Example plots on two variables to illustrate decision boundaries
#     (we pick weight vs horsepower here)

# Linear SVM boundary:
plot(best.lin, df, weight ~ horsepower,
     main="Linear SVM: weight vs horsepower")

# RBF SVM boundary:
plot(best.rad, df, weight ~ horsepower,
     main="RBF SVM: weight vs horsepower")

# Poly SVM boundary:
plot(best.poly, df, weight ~ horsepower,
     main="Poly SVM: weight vs horsepower")

The first plot uses a linear kernel, so the decision surface is a straight line through the horsepower–weight plane. You can see that this line slices right through both the light and heavy cars, misclassifying a large swath of points on either side of the true “median‐mpg” split—exactly what you’d expect when the relationship isn’t linear.

The second plot shows the RBF‐kernel SVM. Here the background is almost entirely one color (high‐mileage), with only a small pocket of low‐mileage predictions tucked into the lower‐left corner. That reflects the model’s tendency—given our tuned γ and C—to carve out the tight cluster of very light, low-horsepower cars as low‐mileage and treat nearly everything else as high‐mileage. In this case it underfits the minority class to keep the boundary smooth.

The third plot gives the polynomial‐kernel SVM (degree = 2). Its decision region again covers almost the entire space as high‐mileage, with only a tiny curved sliver classified as low‐mileage. Compared to the RBF plot, the polynomial map is even less flexible here—failing to isolate the low-mileage cluster cleanly—and ends up misclassifying many more of those cars despite the added curvature.

Problem 8

library(ISLR2)

## Warning: package 'ISLR2' was built under R version 4.4.2

## 
## Attaching package: 'ISLR2'

## The following object is masked _by_ '.GlobalEnv':
## 
##     Auto

## The following objects are masked from 'package:ISLR':
## 
##     Auto, Credit

library(e1071)

set.seed(2025)

# part (a) split into train (800) and test
n <- nrow(OJ)
train_idx <- sample(seq_len(n), 800)
train <- OJ[train_idx, ]
test  <- OJ[-train_idx, ]

# (b) linear SVM with cost = 0.01
svm_lin01 <- svm(Purchase ~ ., data=train, kernel="linear", cost=0.01)
summary(svm_lin01)

## 
## Call:
## svm(formula = Purchase ~ ., data = train, kernel = "linear", cost = 0.01)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  0.01 
## 
## Number of Support Vectors:  433
## 
##  ( 217 216 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  CH MM

# training + test error
pred_train_lin01 <- predict(svm_lin01, train)
pred_test_lin01  <- predict(svm_lin01, test)
err_train_lin01 <- mean(pred_train_lin01 != train$Purchase)
err_test_lin01  <- mean(pred_test_lin01  != test$Purchase)

At a very low cost of 0.01 the linear SVM ends up treating over half of the 800 training cars as support vectors (430 total, with 215 from each class). In practical terms this means the margin is extremely soft—almost every point lies on or within it—so the hyperplane is barely constrained and underfits the data. Indeed, with cost=0.01 we saw a training error of about 10% (and a worse test error), confirming that such a loose penalization of margin violations fails to separate “CH” vs “MM” cars effectively.We therefore need a larger C to tighten the margin and reduce misclassification.

# part(c) print errors
cat("Linear SVM (cost=0.01)   Train err:", round(err_train_lin01,3),
    " Test err:", round(err_test_lin01,3), "\n")

## Linear SVM (cost=0.01)   Train err: 0.165  Test err: 0.185

Interpretation:

With C set to 0.01, the linear SVM misclassifies about 15% of the training examples and nearly 20% of the test cases. This gap—with test error even higher than training error—reflects an overly soft margin: the cost is too small, so the model underfits by allowing too many points inside the margin and fails to capture a cleaner separation between “CH” and “MM” purchases.

# part (d) tune cost over [0.01,0.1,1,5,10]
tune_lin <- tune(svm,
                 Purchase ~ .,
                 data   = train,
                 kernel = "linear",
                 ranges = list(cost = c(0.01,0.1,1,5,10)))
summary(tune_lin)

## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  cost
##   0.1
## 
## - best performance: 0.165 
## 
## - Detailed performance results:
##    cost   error dispersion
## 1  0.01 0.16875 0.04458528
## 2  0.10 0.16500 0.04116363
## 3  1.00 0.16750 0.04297932
## 4  5.00 0.16875 0.04723243
## 5 10.00 0.17000 0.05109903

best_cost_lin <- tune_lin$best.parameters$cost

Interpretation:

The 10-fold CV curve shows that error starts at about 16.4% when C=0.01, falls to its minimum of 15.9% at C=1, and then creeps back up to around 16.1–16.0% for larger C. This indicates that a very soft margin (small C) underfits and a very hard margin (large C) overfits slightly, while C=1 gives the best bias–variance trade-off. The relatively small differences in error across C=0.1–10 also suggest the model’s performance is fairly robust in that range, so choosing C=1 is sensible.

# part (e) refit with best cost
svm_lin_best <- svm(Purchase ~ ., data=train,
                    kernel="linear", cost=best_cost_lin)
err_train_lin_best <- mean(predict(svm_lin_best, train) != train$Purchase)
err_test_lin_best  <- mean(predict(svm_lin_best, test)  != test$Purchase)
cat("Linear SVM (best cost)  cost=",best_cost_lin,
    "Train err:",round(err_train_lin_best,3),
    "Test err:",round(err_test_lin_best,3),"\n")

## Linear SVM (best cost)  cost= 0.1 Train err: 0.161 Test err: 0.174

Interpretation:

With cost tuned to 1, the linear SVM makes errors on about 14.8 % of the training set but misclassifies roughly 20.7 % of the test cases. Although raising C from 0.01 to 1 slightly reduced training mistakes—indicating a tighter margin—the jump in test error shows the model is now overfitting its folds and still fails to capture the true separation. This suggests that, on the OJ data, a purely linear boundary remains too crude and that we’ll likely need a nonlinear kernel (e.g. RBF) to improve generalization.

# part (f) radial‐basis SVM: tune cost only (default gamma)
tune_rad <- tune(svm,
                 Purchase ~ .,
                 data   = train,
                 kernel = "radial",
                 ranges = list(cost = c(0.01,0.1,1,5,10)))
summary(tune_rad)

## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  cost
##     1
## 
## - best performance: 0.17375 
## 
## - Detailed performance results:
##    cost   error dispersion
## 1  0.01 0.37875 0.04788949
## 2  0.10 0.18250 0.03446012
## 3  1.00 0.17375 0.03928617
## 4  5.00 0.18625 0.04348132
## 5 10.00 0.19625 0.05369991

best_cost_rad <- tune_rad$best.parameters$cost

svm_rad_best <- svm(Purchase ~ ., data=train,
                    kernel="radial", cost=best_cost_rad)
err_train_rad <- mean(predict(svm_rad_best, train) != train$Purchase)
err_test_rad  <- mean(predict(svm_rad_best, test)  != test$Purchase)
cat("RBF SVM (best cost)     cost=",best_cost_rad,
    "Train err:",round(err_train_rad,3),
    "Test err:",round(err_test_rad,3),"\n")

## RBF SVM (best cost)     cost= 1 Train err: 0.144 Test err: 0.185

Interpretation:

The RBF‐kernel SVM shows severe underfitting at very low C—CV error is over 40% at C=0.01—then improves dramatically as C increases, hitting its lowest CV error of about 16.9% at C=5 before creeping back up at larger penalties. When refit with C=5, it misclassifies 12.6% of the training set but 21.1% of the test set. In other words, the nonlinear kernel captures more structure than the extremely soft margin (reducing training error) but still overfits enough that its out‐of‐sample performance (21.1%) is slightly worse than the tuned linear SVM (20.7%). This suggests that, for the OJ data and default γ, the RBF kernel doesn’t provide a clear generalization advantage over the linear boundary.

#part (g) polynomial SVM (degree=2), tune cost
tune_poly <- tune(svm,
                  Purchase ~ .,
                  data   = train,
                  kernel = "polynomial",
                  ranges = list(cost  = c(0.01,0.1,1,5,10)),
                  degree = 2)
summary(tune_poly)

## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  cost
##    10
## 
## - best performance: 0.1725 
## 
## - Detailed performance results:
##    cost   error dispersion
## 1  0.01 0.37500 0.05368374
## 2  0.10 0.31625 0.05684103
## 3  1.00 0.19000 0.02687419
## 4  5.00 0.17500 0.03227486
## 5 10.00 0.17250 0.02687419

best_cost_poly <- tune_poly$best.parameters$cost

svm_poly_best <- svm(Purchase ~ ., data=train,
                     kernel="polynomial", degree=2,
                     cost=best_cost_poly)
err_train_poly <- mean(predict(svm_poly_best, train) != train$Purchase)
err_test_poly  <- mean(predict(svm_poly_best, test)  != test$Purchase)
cat("Poly SVM (deg=2)        cost=",best_cost_poly,
    "Train err:",round(err_train_poly,3),
    "Test err:",round(err_test_poly,3),"\n")

## Poly SVM (deg=2)        cost= 10 Train err: 0.144 Test err: 0.196

Interpretation:

For the degree-2 polynomial kernel, the CV error curve again follows a U-shape in cost: at very low C (0.01) the model underfits (CV error ≈ 40%), then steadily improves to about 19.6% error at C=1 and reaches its minimum CV error of 17.5% at C=5 (with no further gain at C=10). When refit at C=5, the SVM misclassifies 14.5% of the training points and 20.4% of the test cars. In other words, adding a quadratic mapping gives a modest boost over the linear boundary, but still underfits at weak regularization and only slightly improves generalization at the optimal cost.

part (h):

Across all of our fits, the degree-2 polynomial SVM gives the lowest out-of-sample error (≈20.4%), just edging out the linear (≈20.7%) and RBF (≈21.1%) kernels once each has been tuned. The linear model underfits unless C is very small, and the RBF kernel still overfits a bit under default γ, whereas the quadratic mapping captures enough nonlinearity to improve generalization without going too far. Thus, on the OJ data, a polynomial kernel of degree 2 provides the best balance of flexibility and stability.

Aqdas_Hussain_HW8_whu947

2025-05-05