hw_8_data_mining

Question 5

We have seen that we can fit an SVM with a non-linear kernel in order to perform classification using a non-linear decision boundary. We will now see that we can also obtain a non-linear decision boundary by performing logistic regression using non-linear transformations of the features.

(a) Generate a data set with n = 500 and p = 2, such that the observations belong to two classes with a quadratic decision boundary between them. For instance, you can do this as follows:

set.seed(421)
x1 <- runif (500) -0.5
x2 <- runif (500) -0.5
y <- 1*(x1^2 - x2^2 > 0)

(b) Plot the observations, colored according to their class labels. Your plot should display X1 on the x-axis, and X2 on the y-axis.

plot(x1[y==0], x2[y==0], 
     col = "purple",
     xlab = "X1",
     ylab = "X2",
     pch = "+")
points(x1[y==1], x2[y==1],
     col = "blue",
     pch = 4)

(c) Fit a logistic regression model to the data, using X1 and X2 as predictors.

svm_model <- glm(y ~ x1 + x2, family = binomial)
summary(svm_model)

## 
## Call:
## glm(formula = y ~ x1 + x2, family = binomial)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.278  -1.227   1.089   1.135   1.175  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept)  0.11999    0.08971   1.338    0.181
## x1          -0.16881    0.30854  -0.547    0.584
## x2          -0.08198    0.31476  -0.260    0.795
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 691.35  on 499  degrees of freedom
## Residual deviance: 690.99  on 497  degrees of freedom
## AIC: 696.99
## 
## Number of Fisher Scoring iterations: 3

(d) Apply this model to the training data in order to obtain a predicted class label for each training observation. Plot the observations, colored according to the predicted class labels. The decision boundary should be linear.

svm_data <- data.frame(x1 = x1, x2 = x2, y = y)
svm_prob <-  predict(svm_model, svm_data, type = "response")
svm_pred <-  ifelse(svm_prob > 0.52, 1, 0)
data_pos <-  svm_data[svm_pred == 1, ]
data_neg <-  svm_data[svm_pred == 0, ]
plot(data_pos$x1, data_pos$x2, col = "green", xlab = "X1", ylab = "X2", pch = "+")
points(data_neg$x1, data_neg$x2, col = "orange", pch = 4)

As we can see from the resulting graph, the decision boundary is indeed linear in nature.

(e) Now fit a logistic regression model to the data using non-linear functions of X1 and X2 as predictors (e.g. X1^2, x2^2 , X1xX2, log(X2), and so forth).

lm_fit = glm(y ~ poly(x1, 2) + poly(x2, 2) + I(x1 * x2), data = svm_data, family = binomial)

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

(f) Apply this model to the training data in order to obtain a predicted class label for each training observation. Plot the observations, colored according to the predicted class labels. The decision boundary should be obviously non-linear. If it is not, then repeat (a)-(e) until you come up with an example in which the predicted class labels are obviously non-linear.

lm_prob <- predict(lm_fit, svm_data, type = "response")
lm_pred <- ifelse(lm_prob > 0.5, 1, 0)
data_pos_lm <- svm_data[lm_pred == 1, ]
data_neg_lm <- svm_data[lm_pred == 0, ]
plot(data_pos_lm$x1, data_pos_lm$x2, col = "blue", xlab = "X1", ylab = "X2", pch = "+")
points(data_neg_lm$x1, data_neg_lm$x2, col = "red", pch = 4)

As we can see from the resulting graph, we see that the predictions obtained are non-linear.

(g) Fit a support vector classifier to the data with X1 and X2 as predictors. Obtain a class prediction for each training observation. Plot the observations, colored according to the predicted class labels.

library(e1071)

## Warning: package 'e1071' was built under R version 4.1.3

svm_fit = svm(as.factor(y) ~ x1 + x2, svm_data, kernel = "linear", cost = 0.1)
svm_pred_2 = predict(svm_fit, svm_data)
data_pos_2 = svm_data[svm_pred_2 == 1, ]
data_neg_2 = svm_data[svm_pred_2 == 0, ]
plot(data_pos_2$x1, data_pos_2$x2, col = "blue", xlab = "X1", ylab = "X2", pch = "+")
points(data_neg_2$x1, data_neg_2$x2, col = "red", pch = 4)

(h) Fit a SVM using a non-linear kernel to the data. Obtain a class prediction for each training observation. Plot the observations, colored according to the predicted class labels.

svm_fit_gamma = svm(as.factor(y) ~ x1 + x2, svm_data, gamma = 1)
svm_pred_gamma = predict(svm_fit_gamma, svm_data)
data_pos_gamma = svm_data[svm_pred_gamma == 1, ]
data_neg_gamma = svm_data[svm_pred_gamma == 0, ]
plot(data_pos_gamma$x1, data_pos_gamma$x2, col = "pink", xlab = "X1", ylab = "X2", pch = "+")
points(data_neg_gamma$x1, data_neg_gamma$x2, col = "chartreuse", pch = 4)

(i) Comment on your results. As we can see from the obtained graphs of all the different models fitted, SVMs are important to find non-linear models. We can also see that the graph that is most like the original data set (if not identical), is the one where we use cross-validation with a parameter of gamma.

Question 7

In this problem, you will use support vector approaches in order to predict whether a given car gets high or low gas mileage based on the Auto data set.

(a) Create a binary variable that takes on a 1 for cars with gas mileage above the median, and a 0 for cars with gas mileage below the median.

Auto <- Auto %>%
        mutate(gas_mileage_high = case_when(mpg > median(mpg) ~ 1,
                                           mpg < median(mpg) ~ 0))

(b) Fit a support vector classifier to the data with various values of cost, in order to predict whether a car gets high or low gas mileage. Report the cross-validation errors associated with different values of this parameter. Comment on your results.

set.seed(3255)
tune_out = tune(svm, gas_mileage_high ~ ., data = Auto, kernel = "linear", 
                ranges = list(cost = c(0.01, 0.1, 1, 5, 10, 100)))
summary(tune_out)

## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  cost
##     1
## 
## - best performance: 0.07439708 
## 
## - Detailed performance results:
##    cost      error dispersion
## 1 1e-02 0.08424758 0.04058643
## 2 1e-01 0.07878261 0.04679699
## 3 1e+00 0.07439708 0.03465824
## 4 5e+00 0.08500812 0.03801837
## 5 1e+01 0.09330097 0.03735789
## 6 1e+02 0.11838197 0.03633917

For a linear support vector classifier, the best result we obtain is when cost = 1, as this is when our error value is the lowest.

(c) Now repeat (b), this time using SVMs with radial and polynomial basis kernels, with different values of gamma and degree and cost. Comment on your results.

set.seed(21)
tune_out_polynomial = tune(svm, gas_mileage_high ~ ., data = Auto, kernel = "polynomial",
                           ranges = list(cost = c(0.1, 1, 5, 10), degree = c(2, 3, 4)))
summary(tune_out_polynomial)

## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  cost degree
##    10      2
## 
## - best performance: 0.3583384 
## 
## - Detailed performance results:
##    cost degree     error dispersion
## 1   0.1      2 0.5034631 0.04096130
## 2   1.0      2 0.4865948 0.04302246
## 3   5.0      2 0.4206657 0.05981167
## 4  10.0      2 0.3583384 0.07646762
## 5   0.1      3 0.5045246 0.04086358
## 6   1.0      3 0.4971469 0.04098347
## 7   5.0      3 0.4654381 0.04210971
## 8  10.0      3 0.4282267 0.04432682
## 9   0.1      4 0.5053294 0.04085232
## 10  1.0      4 0.5052428 0.04084990
## 11  5.0      4 0.5048581 0.04084058
## 12 10.0      4 0.5043469 0.04082957

set.seed(463)
tune_out_radial = tune(svm, gas_mileage_high ~ ., data = Auto, kernel = "radial",
                       ranges = list(cost = c(0.1, 1, 5, 10), gamma = c(0.01, 0.1, 1, 5, 10, 100)))
summary(tune_out_radial)

## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  cost gamma
##     5   0.1
## 
## - best performance: 0.04038842 
## 
## - Detailed performance results:
##    cost gamma      error  dispersion
## 1   0.1 1e-02 0.08370828 0.017128313
## 2   1.0 1e-02 0.07088978 0.017343124
## 3   5.0 1e-02 0.06588120 0.019353773
## 4  10.0 1e-02 0.06077282 0.018179031
## 5   0.1 1e-01 0.05697763 0.018790471
## 6   1.0 1e-01 0.04565992 0.017573464
## 7   5.0 1e-01 0.04038842 0.016455302
## 8  10.0 1e-01 0.04111047 0.016123173
## 9   0.1 1e+00 0.31811359 0.043237185
## 10  1.0 1e+00 0.09880664 0.014927894
## 11  5.0 1e+00 0.09936914 0.014853763
## 12 10.0 1e+00 0.09936914 0.014853763
## 13  0.1 5e+00 0.46797388 0.046923784
## 14  1.0 5e+00 0.23936970 0.005760054
## 15  5.0 5e+00 0.23936236 0.005758518
## 16 10.0 5e+00 0.23936236 0.005758518
## 17  0.1 1e+01 0.47138871 0.045048202
## 18  1.0 1e+01 0.24532133 0.003621833
## 19  5.0 1e+01 0.24532155 0.003621849
## 20 10.0 1e+01 0.24532155 0.003621849
## 21  0.1 1e+02 0.47287958 0.045224125
## 22  1.0 1e+02 0.25179202 0.002248509
## 23  5.0 1e+02 0.25179202 0.002248509
## 24 10.0 1e+02 0.25179202 0.002248509

For a polynomial vector classifier, the best result is obtained with a quadratic equation with a cost of 10, as this is when our error value is the lowest. For a radial vector classifier, the best result is obtained with a gamma parameter of 0.1 with a cost of 10, as this is when our error value is the lowest.

(d) Make some plots to back up your assertions in (b) and (c). Hint: In the lab, we used the plot() function for svm objects only in cases with p = 2. When p > 2, you can use the plot() function to create plots displaying pairs of variables at a time. Essentially, instead of typing plot(svmfit , dat) where svmfit contains your fitted model and dat is a data frame containing your data, you can type plot(svmfit , dat , x1∼x4) in order to plot just the first and fourth variables. However, you must replace x1 and x4 with the correct variable names.

svm.linear = svm(gas_mileage_high ~ ., data = Auto, kernel = "linear", cost = 1)
svm.poly = svm(gas_mileage_high ~ ., data = Auto, kernel = "polynomial", cost = 10, degree = 2)
svm.radial = svm(gas_mileage_high ~ ., data = Auto, kernel = "radial", cost = 5, gamma = 0.1)
plotpairs = function(fit) {
    for (name in names(Auto)[!(names(Auto) %in% c("mpg", "gas_mileage_high", "name"))]) {
        plot(fit, Auto, as.formula(paste("mpg~", name, sep = "")))
    }
}
plotpairs(svm.linear)
plotpairs(svm.poly)
plotpairs(svm.radial)

Problem 8

This problem involves the OJ data set which is part of the ISLR package.

(a) Create a training set containing a random sample of 800 observations, and a test set containing the remaining observations.

train_division <- sample(1:nrow(ISLR::OJ), nrow(ISLR::OJ)*0.75)
training_oj <- ISLR::OJ[train_division, ]
test_oj <- ISLR::OJ[-train_division, ]

(b) Fit a support vector classifier to the training data using cost=0.01, with Purchase as the response and the other variables as predictors. Use the summary() function to produce summary statistics, and describe the results obtained.

oj_svm <- svm(Purchase ~ ., data = training_oj, kernel = "linear", cost = 0.01)
summary(oj_svm)

## 
## Call:
## svm(formula = Purchase ~ ., data = training_oj, kernel = "linear", 
##     cost = 0.01)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  0.01 
## 
## Number of Support Vectors:  435
## 
##  ( 216 219 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  CH MM

(c) What are the training and test error rates?

train_pred_oj = predict(oj_svm, training_oj)
table(training_oj$Purchase, train_pred_oj)

##     train_pred_oj
##       CH  MM
##   CH 433  53
##   MM  79 237

test_pred_of = predict(oj_svm, test_oj)
table(test_oj$Purchase, test_pred_of)

##     test_pred_of
##       CH  MM
##   CH 145  22
##   MM  23  78

Our training error rate can be calculated by (79 + 53) = 132 / 802 = 0.1645 or ~ 16.5%. The test error rate can be calculated by (23 + 22) = 45 / 268 = 0.1679 or ~ 16.8%.

(d) Use the tune() function to select an optimal cost. Consider values in the range 0.01 to 10.

tune_oj <- tune(svm, Purchase ~ ., data = training_oj, kernel = "linear", 
                ranges = list(cost = 10^seq(-2, 1, by = 0.25)))
summary(tune_oj)

## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##        cost
##  0.05623413
## 
## - best performance: 0.1681327 
## 
## - Detailed performance results:
##           cost     error dispersion
## 1   0.01000000 0.1756019 0.06053390
## 2   0.01778279 0.1706327 0.05700726
## 3   0.03162278 0.1681481 0.05593488
## 4   0.05623413 0.1681327 0.06058060
## 5   0.10000000 0.1731173 0.05853050
## 6   0.17782794 0.1718673 0.05891465
## 7   0.31622777 0.1706173 0.05897338
## 8   0.56234133 0.1693673 0.05870765
## 9   1.00000000 0.1706173 0.05808350
## 10  1.77827941 0.1731173 0.05882637
## 11  3.16227766 0.1743673 0.05841191
## 12  5.62341325 0.1731327 0.05947483
## 13 10.00000000 0.1743673 0.06131210

According to our tune function, the most optimal cost to use is 0.31622777, as it has the smallest cross-over error value (that being 0.1633025).

(e) Compute the training and test error rates using this new value for cost.

svm_linear_oj = svm(Purchase ~ ., kernel = "linear", data = training_oj, cost = 0.31622777)

train_pred_oj_linear = predict(svm_linear_oj, training_oj)
table(training_oj$Purchase, train_pred_oj_linear)

##     train_pred_oj_linear
##       CH  MM
##   CH 430  56
##   MM  72 244

test_pred_oj_linear = predict(svm_linear_oj, test_oj)
table(test_oj$Purchase, test_pred_oj_linear)

##     test_pred_oj_linear
##       CH  MM
##   CH 142  25
##   MM  24  77

Our training error rate can be calculated by (72 + 56) = 128 / 802 = 0.1596 or ~ 16.0%. The test error rate can be calculated by (24 + 25) = 49 / 268 = 0.1828 or ~ 18.3%.

(f) Repeat parts (b) through (e) using a support vector machine with a radial kernel. Use the default value for gamma.

oj_svm_radial <- svm(Purchase ~ ., data = training_oj, kernel = "radial")

train_pred_oj_radial = predict(oj_svm_radial, training_oj)
table(training_oj$Purchase, train_pred_oj_radial)

##     train_pred_oj_radial
##       CH  MM
##   CH 444  42
##   MM  76 240

test_pred_oj_radial = predict(oj_svm_radial, test_oj)
table(test_oj$Purchase, test_pred_oj_radial)

##     test_pred_oj_radial
##       CH  MM
##   CH 152  15
##   MM  29  72

The error rates for both the training and test rate when using the default value of gamma are -(76 + 42) = 118 / 802 = 0.1471 or ~ 14.7% -(29 + 15) = 44 / 268 = 0.1641 or ~ 16.4%

tune_oj_radial <- tune(svm, Purchase ~ ., data = training_oj, kernel = "radial", 
                ranges = list(cost = 10^seq(-2, 1, by = 0.25)))
summary(tune_oj_radial)

## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##       cost
##  0.5623413
## 
## - best performance: 0.1732716 
## 
## - Detailed performance results:
##           cost     error dispersion
## 1   0.01000000 0.3940123 0.04965464
## 2   0.01778279 0.3940123 0.04965464
## 3   0.03162278 0.3591358 0.07082075
## 4   0.05623413 0.1931944 0.03826915
## 5   0.10000000 0.1907099 0.03636310
## 6   0.17782794 0.1906944 0.04203566
## 7   0.31622777 0.1807253 0.03689856
## 8   0.56234133 0.1732716 0.03110288
## 9   1.00000000 0.1807562 0.03981907
## 10  1.77827941 0.1820062 0.03042157
## 11  3.16227766 0.1770062 0.03585365
## 12  5.62341325 0.1807562 0.04022095
## 13 10.00000000 0.1919444 0.04713522

#the best cost value is that of 0.56234133, as it has the smallest cross-validation error 

svm_radial_oj = svm(Purchase ~ ., kernel = "radial", data = training_oj, cost = 0.56234133)

train_pred_oj_radial = predict(svm_radial_oj, training_oj)
table(training_oj$Purchase, train_pred_oj_radial)

##     train_pred_oj_radial
##       CH  MM
##   CH 441  45
##   MM  72 244

test_pred_oj_radial = predict(svm_radial_oj, test_oj)
table(test_oj$Purchase, test_pred_oj_radial)

##     test_pred_oj_radial
##       CH  MM
##   CH 150  17
##   MM  26  75

The error rates for both the training and test rate when using the tuned value of gamma are -(72 + 45) = 117 / 802 = 0.1458 or ~ 14.6% -(26 + 17) = 43 / 268 = 0.1604 or ~ 16.0%

(g) Repeat parts (b) through (e) using a support vector machine with a polynomial kernel. Set degree=2.

oj_svm_poly <- svm(Purchase ~ ., data = training_oj, kernel = "polynomial", degree = 2)

train_pred_oj_poly = predict(oj_svm_poly, training_oj)
table(training_oj$Purchase, train_pred_oj_poly)

##     train_pred_oj_poly
##       CH  MM
##   CH 441  45
##   MM 104 212

test_pred_oj_poly = predict(oj_svm_poly, test_oj)
table(test_oj$Purchase, test_pred_oj_poly)

##     test_pred_oj_poly
##       CH  MM
##   CH 157  10
##   MM  38  63

The error rates for both the training and test rate when using the default cost are -(104 + 45) = 149 / 802 = 0.1857 or ~ 18.6% -(38 + 10) = 48 / 268 = 0.1791 or ~ 17.9%

tune_oj_poly <- tune(svm, Purchase ~ ., data = training_oj, kernel = "polynomial", degree = 2,
                ranges = list(cost = 10^seq(-2, 1, by = 0.25)))
summary(tune_oj_radial)

## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##       cost
##  0.5623413
## 
## - best performance: 0.1732716 
## 
## - Detailed performance results:
##           cost     error dispersion
## 1   0.01000000 0.3940123 0.04965464
## 2   0.01778279 0.3940123 0.04965464
## 3   0.03162278 0.3591358 0.07082075
## 4   0.05623413 0.1931944 0.03826915
## 5   0.10000000 0.1907099 0.03636310
## 6   0.17782794 0.1906944 0.04203566
## 7   0.31622777 0.1807253 0.03689856
## 8   0.56234133 0.1732716 0.03110288
## 9   1.00000000 0.1807562 0.03981907
## 10  1.77827941 0.1820062 0.03042157
## 11  3.16227766 0.1770062 0.03585365
## 12  5.62341325 0.1807562 0.04022095
## 13 10.00000000 0.1919444 0.04713522

#the best cost value is that of 0.56234133, as it has the smallest cross-validation error 

svm_radial_oj = svm(Purchase ~ ., kernel = "radial", data = training_oj, cost = 0.56234133)

train_pred_oj_radial = predict(svm_radial_oj, training_oj)
table(training_oj$Purchase, train_pred_oj_radial)

##     train_pred_oj_radial
##       CH  MM
##   CH 441  45
##   MM  72 244

test_pred_oj_radial = predict(svm_radial_oj, test_oj)
table(test_oj$Purchase, test_pred_oj_radial)

##     test_pred_oj_radial
##       CH  MM
##   CH 150  17
##   MM  26  75

The error rates for both the training and test rate when using the tuned cost are -(72 + 45) = 117 / 802 = 0.1458 or ~ 14.6% -(26 + 17) = 43 / 268 = 0.1604 or ~ 16.0%

(h) Overall, which approach seems to give the best results on this data? It seems like the best approach for this data is to utilize a radial kernel with the tuned value of gamma (which in this case was 0.56234133), as it is the one that gives the smallest amount of cross-validation error when compared to all of the other models.

hw_8_data_mining

Eileen Ramirez del Rio

4/15/2022

Question 5

Question 7

Problem 8