Using one or two high-dimensional datasets of your choice, estimate a shrinkage and an SVM model and test them out-of-sample. A large number of high-dimensional datasets can be found here: http://archive.ics.uci.edu/ml/datasets.html . Be sure to choose those that make your life easier, rather than something that takes a lot of manipulation to get into shape. But feel free to use other data or a dataset you have already used, as long as they have at least 10 independent variables and a continuous dependent variable (for lasso/ridge) and/or a binary dependent variable (for SVM). You can also convert a continous dependent variable to a binary for the SVM stage, as we did in the lesson.
P. Cortez and A. Silva. Using Data Mining to Predict Secondary School Student Performance. In A. Brito and J. Teixeira Eds., Proceedings of 5th FUture BUsiness TEChnology Conference (FUBUTEC 2008) pp. 5-12, Porto, Portugal, April, 2008, EUROSIS, ISBN 978-9077381-39-7. Web Link
key <- tibble::tribble(~id, ~name, ~var, ~type, ~desc, 1L, "school", "student's school",
"binary", "'GP' - Gabriel Pereira or 'MS' - Mousinho da Silveira)", 2L, "sex",
"student's sex", "binary", "'F' - female or 'M' - male", 3L, "age", "student's age",
"numeric", "from 15 to 22", 4L, "address", "student's home address type", "binary",
"'U' - urban or 'R' - rural", 5L, "famsize", "family size", "binary", "'LE3' - less or equal to 3 or 'GT3' - greater than 3",
6L, "Pstatus", "parent's cohabitation status", "binary", "'T' - living together or 'A' - apart",
7L, "Medu", "mother's education", "numeric", "0 - none, 1 - primary education (4th grade), 2 5th to 9th grade, 3 secondary education or 4 higher education",
8L, "Fedu", "father's education", "numeric", "0 - none, 1 - primary education (4th grade), 2 5th to 9th grade, 3 secondary education or 4 – higher education",
9L, "Mjob", "mother's job", "nominal", "'teacher', 'health' care related, civil 'services' (e.g. administrative or police), 'at_home' or 'other'",
10L, "Fjob", "father's job", "nominal", "'teacher', 'health' care related, civil 'services' (e.g. administrative or police), 'at_home' or 'other'",
11L, "reason", "reason to choose this school", "nominal", "close to 'home', school 'reputation', 'course' preference or 'other'",
12L, "guardian", "student's guardian", "nominal", "'mother', 'father' or 'other'",
13L, "traveltime", "home to school travel time", "numeric", "1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - >1 hour",
14L, "studytime", "weekly study time", "numeric", "1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours",
15L, "failures", "number of past class failures", "numeric", "n if 1<=n<3, else 4",
16L, "schoolsup", "extra educational support", "binary", "yes or no", 17L, "famsup",
"family educational support", "binary", "yes or no", 18L, "paid", "extra paid classes within the course subject (Math or Portuguese)",
"binary", "yes or no", 19L, "activities", "extra-curricular activities", "binary",
"yes or no", 20L, "nursery", "attended nursery school", "binary", "yes or no",
21L, "higher", "wants to take higher education", "binary", "yes or no", 22L,
"internet", "Internet access at home", "binary", "yes or no", 23L, "romantic",
"with a romantic relationship", "binary", "yes or no", 24L, "famrel", "quality of family relationships",
"numeric", "from 1 - very bad to 5 - excellent", 25L, "freetime", "free time after school",
"numeric", "from 1 - very low to 5 - very high", 26L, "goout", "going out with friends",
"numeric", "from 1 - very low to 5 - very high", 27L, "Dalc", "workday alcohol consumption",
"numeric", "from 1 - very low to 5 - very high", 28L, "Walc", "weekend alcohol consumption",
"numeric", "from 1 - very low to 5 - very high", 29L, "health", "current health status",
"numeric", "from 1 - very bad to 5 - very good", 30L, "absences", "number of school absences",
"numeric", "from 0 to 93", 31L, "G1", "first period grade", "numeric", "from 0 to 20",
31L, "G2", "second period grade", "numeric", "from 0 to 20", 32L, "G3", "final grade",
"numeric", "from 0 to 20, output target")
smat <- read.csv("~/Northeastern/Git/ppua5301/Homework 12/student/student-mat.csv",
sep = ";")
# smat <- read.csv('~/student/student-mat.csv', sep=';')
nvars <- key %>% filter(type == "numeric") %>% select(id)
nv <- key %>% filter(type == "numeric") %>% select(name)
dfNumVar <- smat[, nvars$id]
train <- sample(1:nrow(dfNumVar), nrow(dfNumVar)/2)
test <- (-train)
is <- as.matrix(dfNumVar[train, ])
os <- as.matrix(dfNumVar[test, ])
y <- is[, 15]
yos <- os[, 15]
is <- is[, -15]
os <- os[, -15]
ll <- 10^seq(7, -7, length = 100)
ridge <- cv.glmnet(is, y, alpha = 0, lambda = ll)
enet <- cv.glmnet(is, y, alpha = 0.5, lambda = ll)
lasso <- cv.glmnet(is, y, alpha = 1, lambda = ll)
plot(lasso, xvar = "lambda", main = "lasso")
plot(enet, xvar = "lambda", main = "enet")
plot(ridge, xvar = "lambda", main = "ridge")
predict(ridge, type = "coefficients", s = ridge$lambda.min)
## 16 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) -1.136755e-06
## age 9.307277e-08
## Medu 9.054650e-09
## Fedu 9.941532e-09
## traveltime 7.353218e-08
## studytime 1.875399e-08
## failures -2.019217e-07
## famrel 2.493070e-08
## freetime -8.588415e-09
## goout 7.592931e-08
## Dalc -1.561572e-08
## Walc -5.797767e-08
## health 9.824794e-09
## absences -1.390077e-09
## G1 9.999998e-01
## G2 1.823923e-07
predict(enet, type = "coefficients", s = enet$lambda.min)
## 16 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 4.044558e-07
## age .
## Medu .
## Fedu .
## traveltime .
## studytime .
## failures -1.270524e-08
## famrel .
## freetime .
## goout .
## Dalc .
## Walc .
## health .
## absences .
## G1 9.999999e-01
## G2 5.592905e-08
predict(lasso, type = "coefficients", s = lasso$lambda.min)
## 16 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 3.166066e-07
## age .
## Medu .
## Fedu .
## traveltime .
## studytime .
## failures .
## famrel .
## freetime .
## goout .
## Dalc .
## Walc .
## health .
## absences .
## G1 1.000000e+00
## G2 .
The error was virtually consistent across all models with the minimum lambda points. The elastic net suggests retaining 3 variables, the previous period grades, and the previous failures. The elastic net and minimum lambda for this model will be used for the following analyses.
glmEnet <- glmnet(is, y, alpha = 0.5, lambda = ll)
plot(glmEnet, xvar = "lambda")
ospred <- predict.glmnet(glmEnet, os, s = 0.5, lambda = enet$lambda.min)
lmis <- lm(y ~ is)
yhat.lm <- cbind(1, os) %*% lmis$coefficients
(mse.lm <- sum((yos - yhat.lm)^2)/nrow(os))
## [1] 0
yhat.en <- predict(enet$glmnet.fit, s = enet$lambda.min, newx = os)
(mse.en <- sum((yos - yhat.en)^2)/nrow(os))
## [1] 2.431973e-14
The elastic net performs exactly as well as the linear regression model as shown by the MSE with this data set. A sum of squared errors is irrelevant in this instance.
mean((yhat.lm/yhat.en) * 100)
## [1] 100
tss <- sum((yos - mean(yos))^2)
sse.en <- sum((yos - yhat.en)^2)
(r2.en <- (tss - sse.en)/tss)
## [1] 1
The coefficients are identical between the linear model and the elastic net model. The variables selected with the elastic net model are as good as a linear model in this case, and come very close to predicting the actual final grades as is evident from the value of 1 for the \(R^2\) between the actual values and the elastic net predicted values. This outcome suggests that because the grades from the previous periods are included, the final grades proceed linearly from these grades, ie these are multicolinear with one another and the final grade. It also indicates that students who find math challenging are unlikely to improve drastically enough and in large enough numbers to skew this linearity. However, this may not be true if there was a weighted grading schema that was not disclosed in the dataset key that might have minimized possible grade improvements over the duration of the course. Conversely, it could also mean that students who do well in math, continue to do so in a linear fashion. If the previous period grades were not included, the models might have yielded more interesting results, though the models might not have yielded any significant level of fit.
bin <- key %>% filter(type == "binary")
nom <- key %>% filter(type == "nominal")
nv$name
nom$name
bin$name
plots <- list()
for (i in seq_along(nv$name)) {
nvar <- nv$name[i]
print(nvar)
plots[[i]] <- ggplot(data = smat, aes_string(x = "failures", y = nvar, color = "address")) +
geom_count(aes(size = ..prop..), alpha = 0.5) + scale_size_area(max_size = 10)
}
for (i in 1:16) {
print(plots[[i]])
}
# studytime v G1 color=Sex
set.seed(2)
library("forcats")
smat$sex <- fct_relevel(smat$sex, "M", "F")
dfNumVar <- smat[, nvars$id]
train <- sample(1:nrow(dfNumVar), nrow(dfNumVar)/2)
test <- (-train)
is <- dfNumVar[train, ]
os <- dfNumVar[test, ]
y <- smat[train, ] %>% select(sex)
yos <- smat[test, ] %>% select(sex)
library("e1071")
kern <- c(rep("linear", 4), rep("polynomial", 4), rep("radial", 4), rep("sigmoid",
4))
svms <- list()
cv <- 10^seq(-3, 2, 1)
dfIs <- cbind(y, is)
dfOs <- cbind(yos, os)
dfTune <- data.frame(Kernel = rep(NA, length(kern)), Performance = rep(NA, length(kern)),
Cost = rep(NA, length(kern)))
for (i in seq_along(kern)) {
k <- kern[i]
print(kern[i])
tSvm <- NULL
tSvm <- tryCatch({
tune(svm, sex ~ G1 + studytime, data = dfIs, ranges = list(cost = cv), kernel = k)
}, error = function(err) {
NA
})
if (any(!is.na(tSvm)) == T) {
svms[[i]] <- tSvm
dfTune$Kernel[i] <- k
dfTune$Performance[i] <- tSvm$best.performance
dfTune$Cost[i] <- tSvm$best.parameters$cost
}
}
## [1] "linear"
## [1] "linear"
## [1] "linear"
## [1] "linear"
## [1] "polynomial"
## [1] "polynomial"
## [1] "polynomial"
## [1] "polynomial"
## [1] "radial"
## [1] "radial"
## [1] "radial"
## [1] "radial"
## [1] "sigmoid"
## [1] "sigmoid"
## [1] "sigmoid"
## [1] "sigmoid"
dfTune %>% arrange(Performance)
## Kernel Performance Cost
## 1 sigmoid 0.3757895 0.10
## 2 sigmoid 0.3855263 0.10
## 3 linear 0.3863158 10.00
## 4 linear 0.3915789 10.00
## 5 linear 0.3952632 1.00
## 6 sigmoid 0.4002632 0.10
## 7 sigmoid 0.4007895 0.10
## 8 linear 0.4010526 0.10
## 9 radial 0.4173684 0.10
## 10 radial 0.4197368 0.10
## 11 radial 0.4271053 0.10
## 12 polynomial 0.4281579 1.00
## 13 radial 0.4365789 0.10
## 14 polynomial 0.4418421 10.00
## 15 polynomial 0.4531579 10.00
## 16 polynomial 0.4621053 0.01
bSvm <- svm(sex ~ G1 + studytime, data = dfIs, cost = 0.1, kernel = "sigmoid")
yhat <- predict(bSvm, newdata = dfOs)
table(predicted = yhat, truth = dfOs$sex)
## truth
## predicted M F
## M 44 12
## F 50 92
sum(yhat == dfOs$sex)/length(dfOs$sex)
## [1] 0.6868687
logOs <- glm(sex ~ G1 + studytime, data = dfIs, family = "binomial")
logyhat <- predict(logOs, newdata = dfOs)
logyhat <- ifelse(logyhat > 0.5, "M", "F")
sum(logyhat == dfOs$sex)/length(dfOs$sex)
## [1] 0.4141414
The SVM had approximately 6% greater accuracy (less error) than did the logistic regression. Note: I reran this script and SVM now has 27% greater accuracy. This would indicate that different trials can have very different outcomes with logistic regression.
SVM Kernels allow an iterative approach to adapting a classification model that can classify data differentiated by linear and non-linear relationships into classes with a reasonable degree of accuracy. The logistic model is ideally suited for data that is differentiable via a line defined by the log function. Surprisingly, the dataset used in Q2 was best modeled with an SVM using a sigmoid hyperplane, which looks similar to the logistic regression line. Perhaps the geometric approach to fitting the model employed by the SVM gives it greater accuracy than the statistical equation model used in the LR with this specific data. The simplicity of this particular data (due to it being sorted graphically preliminarily) was key to its ability to be sorted into classes, and might have given SVM the advantage, whereas in data with a greater number of features and complexity, the LR approach would have yielded greater accuracy.