Question 2.1
Describe a situation or problem from your job, everyday life, current events, etc., for which a classification model would be appropriate. List some (up to 5) predictors that you might use.
The coronavirus is devastating to certain subsets of individuals. Whether or not that individual succumbs to the virus and is classified as a mortality can be based on preexisting comorbidities. The questionnaire used for the purposes of screening individuals of for comorbidities asks for the following conditions:
Have you ever been told by a medical professional that you have or are diagnosed with:
Chronic kidney disease? yes/no Chronic pulmonary disease? yes/no Chronic heart disease? yes/no Chronic liver disease? yes/no Diabetes? yes/no
Question 2.2
The files credit_card_data.txt (without headers) and credit_card_data-headers.txt (with headers) contain a dataset with 654 data points, 6 continuous and 4 binary predictor variables. It has anonymized credit card applications with a binary response variable (last column) indicating if the application was positive or negative. The dataset is the “Credit Approval Data Set” from the UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/datasets/Credit+Approval) without the categorical variables and without data points that have missing values.
install.packages("kernlab",repos = "http://cran.us.r-project.org")
## Installing package into 'C:/Users/sethm/OneDrive/Documents/R/win-library/3.6'
## (as 'lib' is unspecified)
## package 'kernlab' successfully unpacked and MD5 sums checked
## Warning: cannot remove prior installation of package 'kernlab'
## Warning in file.copy(savedcopy, lib, recursive = TRUE):
## problem copying C:\Users\sethm\OneDrive\Documents\R\win-
## library\3.6\00LOCK\kernlab\libs\x64\kernlab.dll to C:
## \Users\sethm\OneDrive\Documents\R\win-library\3.6\kernlab\libs\x64\kernlab.dll:
## Permission denied
## Warning: restored 'kernlab'
##
## The downloaded binary packages are in
## C:\Users\sethm\AppData\Local\Temp\RtmpqEhjij\downloaded_packages
library(kernlab)
cc_data <- read.delim("C:\\Users\\sethm\\OneDrive\\Documents\\credit_card_data.txt", header=FALSE)
cc_model<- ksvm(V11~.,data=cc_data, type = "C-svc", kernel = "vanilladot", C = 100, scaled=TRUE)
## Setting default kernel parameters
a <- colSums(cc_model@xmatrix[[1]] * cc_model@coef[[1]])
a
## V1 V2 V3 V4 V5
## -0.0010065348 -0.0011729048 -0.0016261967 0.0030064203 1.0049405641
## V6 V7 V8 V9 V10
## -0.0028259432 0.0002600295 -0.0005349551 -0.0012283758 0.1063633995
a0 <- -cc_model@b
a0
## [1] 0.08158492
pred <- predict(cc_model,cc_data[,1:10])
pred
## [1] 1 1 1 1 1 1 1 1 1 1 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [38] 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0
## [75] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [112] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [149] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [186] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1
## [223] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1
## [260] 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
## [297] 0 0 0 0 0 0 1 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
## [334] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [371] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [408] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [445] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [482] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [519] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1
## [556] 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0
## [593] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
## [630] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
sum(pred == cc_data[,11]) / nrow(cc_data) * 100
## [1] 86.39144
Question 2.3: Using the k-nearest-neighbors classification function kknn contained in the R kknn package, suggest a good value of k, and show how well it classifies that data points in the full data set. Don’t forget to scale the data (scale=TRUE in kknn).
library(kknn)
acc_chk = function(Z){
pred<- rep(0,(nrow(cc_data)))
for (i in 1:nrow(cc_data)){
#e
knn_model=kknn(V11~V1+V2+V3+V4+V5+V6+V7+V8+V9+V10,cc_data[-i,],cc_data[i,],k=Z, scale = T)
pred[i] <- as.integer(fitted(knn_model)+0.5) #for rounding
}
acc = sum(pred == cc_data[,11]) / nrow(cc_data)
return(acc)
}
test_vec <- rep(0,30)
for (Z in 1:30){
test_vec[Z] = acc_chk(Z)
}
knn_accuracy <- as.matrix(test_vec * 100)
knn_accuracy
## [,1]
## [1,] 81.49847
## [2,] 81.49847
## [3,] 81.49847
## [4,] 81.49847
## [5,] 85.16820
## [6,] 84.55657
## [7,] 84.70948
## [8,] 84.86239
## [9,] 84.70948
## [10,] 85.01529
## [11,] 85.16820
## [12,] 85.32110
## [13,] 85.16820
## [14,] 85.16820
## [15,] 85.32110
## [16,] 85.16820
## [17,] 85.16820
## [18,] 85.16820
## [19,] 85.01529
## [20,] 85.01529
## [21,] 84.86239
## [22,] 84.70948
## [23,] 84.40367
## [24,] 84.55657
## [25,] 84.55657
## [26,] 84.40367
## [27,] 84.09786
## [28,] 83.79205
## [29,] 83.94495
## [30,] 84.09786
knn_value <- c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30)
plot(knn_value,knn_accuracy)
max(knn_accuracy)
## [1] 85.3211
The knn value that best classifies the data points is 12 with an accuracy of 85.3211%.
Question 3.1
Using the same data set (credit_card_data.txt or credit_card_data-headers.txt) as in Question 2.2, use the ksvm or kknn function to find a good classifier: (a) using cross-validation (do this for the k-nearest-neighbors model; SVM is optional); and (b) splitting the data into training, validation, and test data sets (pick either KNN or SVM; the other is optional).
Main_Model=train.kknn(V11~.,cc_data,kmax=30, scale=TRUE)
Main_Model_accry=rep(0,30)
for (k in 1:30) {
Main_Model_predicted <- as.integer(fitted(Main_Model)[[k]][1:nrow(cc_data)] + 0.5)
Main_Model_accry[k] <- sum(Main_Model_predicted == cc_data$V11)
}
Main_Model_accry
## [1] 533 533 533 533 557 553 554 555 554 557 557 558 557 557 558 558 558 557 556
## [20] 556 555 554 552 553 553 552 550 548 549 550
samp = sample(nrow(cc_data), round(nrow(cc_data)*.7))
training = cc_data[samp, ]
temp = cc_data[-samp, ]
temp2 = sample(nrow(temp), round(nrow(temp)*.5))
validation = temp[temp2, ]
test = temp[-temp2, ]
train.model = function(neighbors)
{
cc_model = kknn(V11 ~., test, validation, k = neighbors, scale = TRUE)
cc_model_fitted = as.matrix(cc_model$fitted.values)
cc_model_fitted_rounded = as.matrix(lapply(cc_model_fitted[, 1], round))
cc_model_results = data.frame(validation[, 'V11'], cc_model_fitted_rounded)
colnames(cc_model_results) = c("actual", "predicted")
return(cc_model_results)
}
eval.model = function(cc_model)
{
results <- vector("list", nrow(cc_model))
for(i in 1:nrow(cc_model)){
results[[i]] <- as.integer(cc_model[i, 'actual'] == cc_model[i, 'predicted'])
}
correct = sum(data.frame(results))
accuracy = correct / nrow(cc_model)
return(accuracy)
}
cc_model_k4 = train.model(4)
cc_model_k12 = train.model(12)
cc_model_k16 = train.model(16)
eval.model(cc_model_k4)
## [1] 0.7653061
eval.model(cc_model_k12)
## [1] 0.7755102
eval.model(cc_model_k16)
## [1] 0.8061224