Using one or two high-dimensional datasets of your choice, estimate a shrinkage and an SVM model and test them out-of-sample. A large number of high-dimensional datasets can be found here: http://archive.ics.uci.edu/ml/datasets.html . Be sure to choose those that make your life easier, rather than something that takes a lot of manipulation to get into shape. But feel free to use other data or a dataset you have already used, as long as they have at least 10 independent variables and a continuous dependent variable (for lasso/ridge) and/or a binary dependent variable (for SVM). You can also convert a continous dependent variable to a binary for the SVM stage, as we did in the lesson.


Data:
P. Cortez and A. Silva. Using Data Mining to Predict Secondary School Student Performance. In A. Brito and J. Teixeira Eds., Proceedings of 5th FUture BUsiness TEChnology Conference (FUBUTEC 2008) pp. 5-12, Porto, Portugal, April, 2008, EUROSIS, ISBN 978-9077381-39-7. Web Link
key <- tibble::tribble(~id, ~name, ~var, ~type, ~desc, 1L, "school", "student's school", 
    "binary", "'GP' - Gabriel Pereira or 'MS' - Mousinho da Silveira)", 2L, "sex", 
    "student's sex", "binary", "'F' - female or 'M' - male", 3L, "age", "student's age", 
    "numeric", "from 15 to 22", 4L, "address", "student's home address type", "binary", 
    "'U' - urban or 'R' - rural", 5L, "famsize", "family size", "binary", "'LE3' - less or equal to 3 or 'GT3' - greater than 3", 
    6L, "Pstatus", "parent's cohabitation status", "binary", "'T' - living together or 'A' - apart", 
    7L, "Medu", "mother's education", "numeric", "0 - none, 1 - primary education (4th grade), 2  5th to 9th grade, 3  secondary education or 4  higher education", 
    8L, "Fedu", "father's education", "numeric", "0 - none, 1 - primary education (4th grade), 2  5th to 9th grade, 3  secondary education or 4 – higher education", 
    9L, "Mjob", "mother's job", "nominal", "'teacher', 'health' care related, civil 'services' (e.g. administrative or police), 'at_home' or 'other'", 
    10L, "Fjob", "father's job", "nominal", "'teacher', 'health' care related, civil 'services' (e.g. administrative or police), 'at_home' or 'other'", 
    11L, "reason", "reason to choose this school", "nominal", "close to 'home', school 'reputation', 'course' preference or 'other'", 
    12L, "guardian", "student's guardian", "nominal", "'mother', 'father' or 'other'", 
    13L, "traveltime", "home to school travel time", "numeric", "1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - >1 hour", 
    14L, "studytime", "weekly study time", "numeric", "1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours", 
    15L, "failures", "number of past class failures", "numeric", "n if 1<=n<3, else 4", 
    16L, "schoolsup", "extra educational support", "binary", "yes or no", 17L, "famsup", 
    "family educational support", "binary", "yes or no", 18L, "paid", "extra paid classes within the course subject (Math or Portuguese)", 
    "binary", "yes or no", 19L, "activities", "extra-curricular activities", "binary", 
    "yes or no", 20L, "nursery", "attended nursery school", "binary", "yes or no", 
    21L, "higher", "wants to take higher education", "binary", "yes or no", 22L, 
    "internet", "Internet access at home", "binary", "yes or no", 23L, "romantic", 
    "with a romantic relationship", "binary", "yes or no", 24L, "famrel", "quality of family relationships", 
    "numeric", "from 1 - very bad to 5 - excellent", 25L, "freetime", "free time after school", 
    "numeric", "from 1 - very low to 5 - very high", 26L, "goout", "going out with friends", 
    "numeric", "from 1 - very low to 5 - very high", 27L, "Dalc", "workday alcohol consumption", 
    "numeric", "from 1 - very low to 5 - very high", 28L, "Walc", "weekend alcohol consumption", 
    "numeric", "from 1 - very low to 5 - very high", 29L, "health", "current health status", 
    "numeric", "from 1 - very bad to 5 - very good", 30L, "absences", "number of school absences", 
    "numeric", "from 0 to 93", 31L, "G1", "first period grade", "numeric", "from 0 to 20", 
    31L, "G2", "second period grade", "numeric", "from 0 to 20", 32L, "G3", "final grade", 
    "numeric", "from 0 to 20, output target")
smat <- read.csv("~/Northeastern/Git/ppua5301/Homework 12/student/student-mat.csv", 
    sep = ";")
# smat <- read.csv('~/student/student-mat.csv', sep=';')

1

Use your dataset with a continuous dependent variable:

a

Divide your data into two equal-sized samples, the in-sample and the out-sample. Estimate the elastic net model using at least three levels of alpha (ie, three positions in between full lasso and full ridge; eg, alpha = 0, 0.5, and 1), using cv.glmnet to find the best lambda level for each run. (Remember that glmnet prefers that data be in a numeric matrix format rather than a data frame.)
nvars <- key %>% filter(type == "numeric") %>% select(id)
nv <- key %>% filter(type == "numeric") %>% select(name)
dfNumVar <- smat[, nvars$id]
train <- sample(1:nrow(dfNumVar), nrow(dfNumVar)/2)
test <- (-train)
is <- as.matrix(dfNumVar[train, ])
os <- as.matrix(dfNumVar[test, ])
y <- is[, 15]
yos <- os[, 15]
is <- is[, -15]
os <- os[, -15]
ll <- 10^seq(7, -7, length = 100)
ridge <- cv.glmnet(is, y, alpha = 0, lambda = ll)
enet <- cv.glmnet(is, y, alpha = 0.5, lambda = ll)
lasso <- cv.glmnet(is, y, alpha = 1, lambda = ll)
plot(lasso, xvar = "lambda", main = "lasso")

plot(enet, xvar = "lambda", main = "enet")

plot(ridge, xvar = "lambda", main = "ridge")

predict(ridge, type = "coefficients", s = ridge$lambda.min)
## 16 x 1 sparse Matrix of class "dgCMatrix"
##                         1
## (Intercept) -1.136755e-06
## age          9.307277e-08
## Medu         9.054650e-09
## Fedu         9.941532e-09
## traveltime   7.353218e-08
## studytime    1.875399e-08
## failures    -2.019217e-07
## famrel       2.493070e-08
## freetime    -8.588415e-09
## goout        7.592931e-08
## Dalc        -1.561572e-08
## Walc        -5.797767e-08
## health       9.824794e-09
## absences    -1.390077e-09
## G1           9.999998e-01
## G2           1.823923e-07
predict(enet, type = "coefficients", s = enet$lambda.min)
## 16 x 1 sparse Matrix of class "dgCMatrix"
##                         1
## (Intercept)  4.044558e-07
## age          .           
## Medu         .           
## Fedu         .           
## traveltime   .           
## studytime    .           
## failures    -1.270524e-08
## famrel       .           
## freetime     .           
## goout        .           
## Dalc         .           
## Walc         .           
## health       .           
## absences     .           
## G1           9.999999e-01
## G2           5.592905e-08
predict(lasso, type = "coefficients", s = lasso$lambda.min)
## 16 x 1 sparse Matrix of class "dgCMatrix"
##                        1
## (Intercept) 3.166066e-07
## age         .           
## Medu        .           
## Fedu        .           
## traveltime  .           
## studytime   .           
## failures    .           
## famrel      .           
## freetime    .           
## goout       .           
## Dalc        .           
## Walc        .           
## health      .           
## absences    .           
## G1          1.000000e+00
## G2          .

The error was virtually consistent across all models with the minimum lambda points. The elastic net suggests retaining 3 variables, the previous period grades, and the previous failures. The elastic net and minimum lambda for this model will be used for the following analyses.

b

Choose the alpha (and corresponding lambda) with the best results (lowest error), and then test that model out-of-sample using the out-sample data.
glmEnet <- glmnet(is, y, alpha = 0.5, lambda = ll)
plot(glmEnet, xvar = "lambda")

ospred <- predict.glmnet(glmEnet, os, s = 0.5, lambda = enet$lambda.min)

c

Compare your out-of-sample results to regular multiple regression: fit the regression model in-sample, predict yhat out-of-sample, and estimate the error. Which works better?
lmis <- lm(y ~ is)
yhat.lm <- cbind(1, os) %*% lmis$coefficients
(mse.lm <- sum((yos - yhat.lm)^2)/nrow(os))
## [1] 0
yhat.en <- predict(enet$glmnet.fit, s = enet$lambda.min, newx = os)
(mse.en <- sum((yos - yhat.en)^2)/nrow(os))
## [1] 2.431973e-14

The elastic net performs exactly as well as the linear regression model as shown by the MSE with this data set. A sum of squared errors is irrelevant in this instance.

d

Which coefficients are different between the multiple regression and the elastic net model? What, if anything, does this tell you substantively about the effects of your independent variables on your dependent variable?
mean((yhat.lm/yhat.en) * 100)
## [1] 100
tss <- sum((yos - mean(yos))^2)
sse.en <- sum((yos - yhat.en)^2)
(r2.en <- (tss - sse.en)/tss)
## [1] 1

The coefficients are identical between the linear model and the elastic net model. The variables selected with the elastic net model are as good as a linear model in this case, and come very close to predicting the actual final grades as is evident from the value of 1 for the \(R^2\) between the actual values and the elastic net predicted values. This outcome suggests that because the grades from the previous periods are included, the final grades proceed linearly from these grades, ie these are multicolinear with one another and the final grade. It also indicates that students who find math challenging are unlikely to improve drastically enough and in large enough numbers to skew this linearity. However, this may not be true if there was a weighted grading schema that was not disclosed in the dataset key that might have minimized possible grade improvements over the duration of the course. Conversely, it could also mean that students who do well in math, continue to do so in a linear fashion. If the previous period grades were not included, the models might have yielded more interesting results, though the models might not have yielded any significant level of fit.

2

Repeat the same process using your dataset with a binary dependent variable:
bin <- key %>% filter(type == "binary")
nom <- key %>% filter(type == "nominal")
nv$name
nom$name
bin$name
plots <- list()
for (i in seq_along(nv$name)) {
    nvar <- nv$name[i]
    print(nvar)
    plots[[i]] <- ggplot(data = smat, aes_string(x = "failures", y = nvar, color = "address")) + 
        geom_count(aes(size = ..prop..), alpha = 0.5) + scale_size_area(max_size = 10)
}
for (i in 1:16) {
    print(plots[[i]])
}
# studytime v G1 color=Sex
After evaluating numerous combinations of variables it looks like studytime v G1, with sex as the binary is suitable for a decent SVM analysis.

a

Divide your data into an in-sample and out-sample as before, and estimate an SVM using at least two different kernels and tune to find the best cost level for each.
set.seed(2)
library("forcats")
smat$sex <- fct_relevel(smat$sex, "M", "F")
dfNumVar <- smat[, nvars$id]
train <- sample(1:nrow(dfNumVar), nrow(dfNumVar)/2)
test <- (-train)
is <- dfNumVar[train, ]
os <- dfNumVar[test, ]
y <- smat[train, ] %>% select(sex)
yos <- smat[test, ] %>% select(sex)
library("e1071")
kern <- c(rep("linear", 4), rep("polynomial", 4), rep("radial", 4), rep("sigmoid", 
    4))
svms <- list()
cv <- 10^seq(-3, 2, 1)
dfIs <- cbind(y, is)
dfOs <- cbind(yos, os)

dfTune <- data.frame(Kernel = rep(NA, length(kern)), Performance = rep(NA, length(kern)), 
    Cost = rep(NA, length(kern)))
for (i in seq_along(kern)) {
    k <- kern[i]
    print(kern[i])
    tSvm <- NULL
    tSvm <- tryCatch({
        tune(svm, sex ~ G1 + studytime, data = dfIs, ranges = list(cost = cv), kernel = k)
    }, error = function(err) {
        NA
    })
    if (any(!is.na(tSvm)) == T) {
        svms[[i]] <- tSvm
        dfTune$Kernel[i] <- k
        dfTune$Performance[i] <- tSvm$best.performance
        dfTune$Cost[i] <- tSvm$best.parameters$cost
    }
}
## [1] "linear"
## [1] "linear"
## [1] "linear"
## [1] "linear"
## [1] "polynomial"
## [1] "polynomial"
## [1] "polynomial"
## [1] "polynomial"
## [1] "radial"
## [1] "radial"
## [1] "radial"
## [1] "radial"
## [1] "sigmoid"
## [1] "sigmoid"
## [1] "sigmoid"
## [1] "sigmoid"
dfTune %>% arrange(Performance)
##        Kernel Performance  Cost
## 1     sigmoid   0.3757895  0.10
## 2     sigmoid   0.3855263  0.10
## 3      linear   0.3863158 10.00
## 4      linear   0.3915789 10.00
## 5      linear   0.3952632  1.00
## 6     sigmoid   0.4002632  0.10
## 7     sigmoid   0.4007895  0.10
## 8      linear   0.4010526  0.10
## 9      radial   0.4173684  0.10
## 10     radial   0.4197368  0.10
## 11     radial   0.4271053  0.10
## 12 polynomial   0.4281579  1.00
## 13     radial   0.4365789  0.10
## 14 polynomial   0.4418421 10.00
## 15 polynomial   0.4531579 10.00
## 16 polynomial   0.4621053  0.01

b

Chose the kernel and cost with the best results, and then test that model out-of-sample using the out-sample data.
bSvm <- svm(sex ~ G1 + studytime, data = dfIs, cost = 0.1, kernel = "sigmoid")
yhat <- predict(bSvm, newdata = dfOs)
table(predicted = yhat, truth = dfOs$sex)
##          truth
## predicted  M  F
##         M 44 12
##         F 50 92
sum(yhat == dfOs$sex)/length(dfOs$sex)
## [1] 0.6868687

c

Compare your results to a logistic regression: fit the logit in-sample, predict yhat out-of-sample, and estimate the accuracy. Which works better?
logOs <- glm(sex ~ G1 + studytime, data = dfIs, family = "binomial")
logyhat <- predict(logOs, newdata = dfOs)
logyhat <- ifelse(logyhat > 0.5, "M", "F")
sum(logyhat == dfOs$sex)/length(dfOs$sex)
## [1] 0.4141414

The SVM had approximately 6% greater accuracy (less error) than did the logistic regression. Note: I reran this script and SVM now has 27% greater accuracy. This would indicate that different trials can have very different outcomes with logistic regression.

d

Can you make any guesses as to why the SVM works better (if it does)? Feel to speculate, or to research a bit more the output of svm, the meaning of the support vectors, or anything else you can discover about SVMs (no points off for erroneous speculations!).

SVM Kernels allow an iterative approach to adapting a classification model that can classify data differentiated by linear and non-linear relationships into classes with a reasonable degree of accuracy. The logistic model is ideally suited for data that is differentiable via a line defined by the log function. Surprisingly, the dataset used in Q2 was best modeled with an SVM using a sigmoid hyperplane, which looks similar to the logistic regression line. Perhaps the geometric approach to fitting the model employed by the SVM gives it greater accuracy than the statistical equation model used in the LR with this specific data. The simplicity of this particular data (due to it being sorted graphically preliminarily) was key to its ability to be sorted into classes, and might have given SVM the advantage, whereas in data with a greater number of features and complexity, the LR approach would have yielded greater accuracy.