Tuesday, April 25, by the start of class time (2:00pm).
Use of R Markdown is optional for this assignment.
This homework utilizes the same dataset as HW #6, contained in the file `MH-CLD-small.csv``.
Use a support vector machine (SVM) to predict whether a person will meet the criteria for a substance abuse problem based on the other five variable (age, gender, veteran status, prior diagnosis of schizophrenia, prior diagnosis of major depressive disorder).
As before, split the data in half. Train the model on one half (training data) and evaluate its performance on the other (validation data).
For this problem, use a linear kernel:
install.packages("e1071")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
library(e1071)
library(tidyverse)
dataset <- read_csv("MH-CLD-small.csv")
n <- nrow(dataset)
set.seed(42)
train_idx <- sample(1:n, floor(n/2))
validation_idx <- (1:n)[-train_idx]
train_x <- as.matrix(dataset[train_idx, 1:5])
train_y <- as.matrix(dataset[train_idx, 6])
validation_x <- as.matrix(dataset[validation_idx, 1:5])
validation_y <- as.matrix(dataset[validation_idx, 6])
model <- svm(train_x, train_y,
type = "C-classification",
kernel = "linear")
predicted_y <- predict(model, validation_x)
predicted_binary <- as.numeric(predicted_y)
accuracy <- mean(predicted_binary == validation_y)
#cat("Validation Accuracy:", accuracy*100, "%")
train_pred <- predict(model, train_x)
train_accuracy <- mean(train_pred == train_y)
#cat("Training Accuracy:", train_accuracy*100, "%")
What is its accuracy on the training data?
cat("Training Accuracy:", train_accuracy*100, "%") # 55.82 %
What is its accuracy on the validation data?
cat("Validation Accuracy:", accuracy*100, "%") # 37.94 %
Repeat your analysis from problem 1, this time using nonlinear SVM, in particular a radial basis function kernel:
library(e1071)
library(tidyverse)
dataset <- read_csv("MH-CLD-small.csv")
n <- nrow(dataset)
set.seed(42)
train_idx <- sample(1:n, floor(n/2))
validation_idx <- (1:n)[-train_idx]
train_x <- as.matrix(dataset[train_idx, 1:5])
train_y <- as.matrix(dataset[train_idx, 6])
validation_x <- as.matrix(dataset[validation_idx, 1:5])
validation_y <- as.matrix(dataset[validation_idx, 6])
model <- svm(train_x, train_y,
type = "C-classification",
kernel = "radial")
predicted_y <- predict(model, validation_x)
predicted_binary <- as.numeric(predicted_y)
accuracy <- mean(predicted_binary == validation_y)
#cat("Validation Accuracy:", accuracy*100, "%")
train_pred <- predict(model, train_x)
train_accuracy <- mean(train_pred == train_y)
#cat("Training Accuracy:", train_accuracy*100, "%")
What is its accuracy on the training data?
cat("Training Accuracy:", train_accuracy*100, "%") # 58.96 %
What is its accuracy on the validation data?
cat("Validation Accuracy:", accuracy*100, "%") # 27.76 %
For the model you trained in the previous problem, what is its false alarm rate? That is, the model reports SAP = 1 when the actual value is 0 in the validation data.
predicted_y <- predict(model, validation_x)
false_alarms <- which(predicted_y == 1 & validation_y == 0)
false_alarm_rate <- length(false_alarms) / length(validation_y)
cat("False alarm rate:", false_alarm_rate) # 0.1416