In first project both Support Vector Machines and Logistic Regression were employed to study and attempt to predict if, based on a number of risk factors, a given patient will have heart disease. The data set Heart.csv which contained data on 303 patients consisting of 13 risk factors and a zero-one valued variable which indicates if the patient has heart disease. We found logistic regression model performed better than SVM.
In this project we will pre-process the input variables using PCA.We will include as many principal components as necessary to account for at least 98% of the variance of the original input variables.We will compare results to those obtained in Project1. In project 1, logistic performed better than SVM
# Load the library
library(e1071)
library(dummies)
## dummies-1.5.6 provided by Decision Patterns
library(cvTools)
## Loading required package: lattice
## Loading required package: robustbase
## Warning: package 'robustbase' was built under R version 3.2.5
Read the data
#Read the data
heart<- read.csv("/Users/neha/Documents/DS-630-ML/Project/Heart.csv")
heart <- na.omit(heart)
For PCA, remove the response variable #AHD
heart_pca <-heart[,2:14]
Check out for categorical data if any
str(heart_pca)
## 'data.frame': 297 obs. of 13 variables:
## $ Age : int 63 67 67 37 41 56 62 57 63 53 ...
## $ Sex : int 1 1 1 1 0 1 0 0 1 1 ...
## $ ChestPain: Factor w/ 4 levels "asymptomatic",..: 4 1 1 2 3 3 1 1 1 1 ...
## $ RestBP : int 145 160 120 130 130 120 140 120 130 140 ...
## $ Chol : int 233 286 229 250 204 236 268 354 254 203 ...
## $ Fbs : int 1 0 0 0 0 0 0 0 0 1 ...
## $ RestECG : int 2 2 2 0 2 0 2 0 2 2 ...
## $ MaxHR : int 150 108 129 187 172 178 160 163 147 155 ...
## $ ExAng : int 0 1 1 0 0 0 0 1 0 1 ...
## $ Oldpeak : num 2.3 1.5 2.6 3.5 1.4 0.8 3.6 0.6 1.4 3.1 ...
## $ Slope : int 3 2 2 3 1 1 3 1 2 3 ...
## $ Ca : int 0 3 2 0 0 0 2 0 1 0 ...
## $ Thal : Factor w/ 3 levels "fixed","normal",..: 1 2 3 2 2 2 2 2 3 3 ...
“Chest pain”" and “Thal” are categorical variables
new_heart_pca <- dummy.data.frame(heart_pca, names = c("ChestPain","Thal"))
str(new_heart_pca)
## 'data.frame': 297 obs. of 18 variables:
## $ Age : int 63 67 67 37 41 56 62 57 63 53 ...
## $ Sex : int 1 1 1 1 0 1 0 0 1 1 ...
## $ ChestPainasymptomatic: int 0 1 1 0 0 0 1 1 1 1 ...
## $ ChestPainnonanginal : int 0 0 0 1 0 0 0 0 0 0 ...
## $ ChestPainnontypical : int 0 0 0 0 1 1 0 0 0 0 ...
## $ ChestPaintypical : int 1 0 0 0 0 0 0 0 0 0 ...
## $ RestBP : int 145 160 120 130 130 120 140 120 130 140 ...
## $ Chol : int 233 286 229 250 204 236 268 354 254 203 ...
## $ Fbs : int 1 0 0 0 0 0 0 0 0 1 ...
## $ RestECG : int 2 2 2 0 2 0 2 0 2 2 ...
## $ MaxHR : int 150 108 129 187 172 178 160 163 147 155 ...
## $ ExAng : int 0 1 1 0 0 0 0 1 0 1 ...
## $ Oldpeak : num 2.3 1.5 2.6 3.5 1.4 0.8 3.6 0.6 1.4 3.1 ...
## $ Slope : int 3 2 2 3 1 1 3 1 2 3 ...
## $ Ca : int 0 3 2 0 0 0 2 0 1 0 ...
## $ Thalfixed : int 1 0 0 0 0 0 0 0 0 0 ...
## $ Thalnormal : int 0 1 0 1 1 1 1 1 0 0 ...
## $ Thalreversable : int 0 0 1 0 0 0 0 0 1 1 ...
## - attr(*, "dummies")=List of 2
## ..$ ChestPain: int 3 4 5 6
## ..$ Thal : int 16 17 18
Perform pca
pca_on_heart <-princomp(new_heart_pca)
summary(pca_on_heart)
## Importance of components:
## Comp.1 Comp.2 Comp.3 Comp.4
## Standard deviation 52.0044246 23.2921488 17.63224572 7.61501572
## Proportion of Variance 0.7469583 0.1498425 0.08586794 0.01601612
## Cumulative Proportion 0.7469583 0.8968008 0.98266874 0.99868486
## Comp.5 Comp.6 Comp.7 Comp.8
## Standard deviation 1.1937424661 0.960879673 0.8420419543 0.6671064363
## Proportion of Variance 0.0003935837 0.000255008 0.0001958318 0.0001229153
## Cumulative Proportion 0.9990784460 0.999333454 0.9995292858 0.9996522011
## Comp.9 Comp.10 Comp.11 Comp.12
## Standard deviation 5.416043e-01 4.634986e-01 4.430064e-01 3.943334e-01
## Proportion of Variance 8.101772e-05 5.933522e-05 5.420455e-05 4.294799e-05
## Cumulative Proportion 9.997332e-01 9.997926e-01 9.998468e-01 9.998897e-01
## Comp.13 Comp.14 Comp.15 Comp.16
## Standard deviation 3.640134e-01 3.305688e-01 2.899636e-01 2.710558e-01
## Proportion of Variance 3.659742e-05 3.018142e-05 2.322216e-05 2.029239e-05
## Cumulative Proportion 9.999263e-01 9.999565e-01 9.999797e-01 1.000000e+00
## Comp.17 Comp.18
## Standard deviation 4.837684e-08 1.847055e-08
## Proportion of Variance 6.463844e-19 9.422702e-20
## Cumulative Proportion 1.000000e+00 1.000000e+00
As seen it needs only 3 components to explain 98% of the variation Lets calculate standard deviation and proportion of variance of each PCA
std_dev <- pca_on_heart[1:3]$sdev
pr_var <- std_dev^2
prop_varex <- pr_var/sum(pr_var)
prop_varex[1:3]
## Comp.1 Comp.2 Comp.3
## 0.74695834 0.14984246 0.08586794
plot(cumsum(prop_varex[1:3]), xlab = "Principal Component",
ylab = "Cumulative Proportion of Variance Explained",
type = "b")
#Screen plot
screeplot(pca_on_heart, type="lines")
#biplot
biplot(pca_on_heart)
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
Reduced dimension data from PCA
#reduced dimension data
red_data<-data.frame(AHD = heart[, "AHD"], pca_on_heart$scores)
#interested in first 3 componrnts
red_data<- red_data[,1:4]
#View(red_data)
Now we have our reduced data ready to be used for the models:
listModel_1 <- list() # confusionmatrix
listModel_2 <- list() #best model
listModel_3 <- list() # accuracy
Sampling the data
hrt_smp_size <-floor(0.80* nrow(red_data))
# K-fold cross-validation
k<-50
#folds <- cvFolds(NROW(red_data), K=12)
for(i in 1:k){
train_ind <-sample(seq_len(nrow(red_data)), size = hrt_smp_size)
train<- red_data[train_ind,]#set the training set
test<-red_data[-train_ind,]# set the testing set
#train <- red_data[folds$subsets[folds$which != i], ] #Set the training set
#test <- red_data[folds$subsets[folds$which == i], ] #Set the test set
log_model <-glm(AHD ~ .,data=train, family="binomial") #Get your model
pred <- predict(log_model,newdata=test) #Get the predicitons for the test set
pred <- ifelse(pred>0.5,"Yes", "No")
confusionmatrix <- table(test$AHD, pred)
accuracy<- sum(diag(confusionmatrix))/sum(confusionmatrix) #calculate accuracy
listModel_1[[i]] <- confusionmatrix
listModel_2[[i]] <- summary(log_model)
listModel_3[[i]] <- accuracy
}
a <-which.max( listModel_3[] )
max(unlist(listModel_3)) #accuracy
## [1] 0.8
listModel_2[a] # summary
## [[1]]
##
## Call:
## glm(formula = AHD ~ ., family = "binomial", data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9551 -0.9966 -0.4310 1.0388 1.9532
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.036549 0.143712 -0.254 0.7992
## Comp.1 0.003740 0.002613 1.431 0.1523
## Comp.2 -0.040027 0.007151 -5.597 2.18e-08 ***
## Comp.3 -0.018629 0.008377 -2.224 0.0262 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 328.55 on 236 degrees of freedom
## Residual deviance: 283.67 on 233 degrees of freedom
## AIC: 291.67
##
## Number of Fisher Scoring iterations: 3
listModel_1[[a]]#confusion matrix
## pred
## No Yes
## No 37 4
## Yes 8 11
ctable <- as.table(matrix(unlist(listModel_1[[a]]), nrow = 2, byrow = TRUE))
fourfoldplot(ctable, color = c("#CC6666", "#99CC49"),
conf.level = 0, margin = 1, main = "Confusion Matrix Logistic Regression")
Similar to logistic we will test for SVM.
k = 50
# K-fold cross validation
for(i in 1:50){
# train_svm <- red_data[folds$subsets[folds$which != i], ] #Set the training set
#test_svm <- red_data[folds$subsets[folds$which == i], ] #Set the test set
train_ind <-sample(seq_len(nrow(red_data)), size = hrt_smp_size)
train_svm<- red_data[train_ind,]#set the training set
test_svm<-red_data[-train_ind,]# set the testing set
svm_model <-svm(AHD ~ .,data=train_svm, kernel='radial',gamma=0.5
,cost=1)
pred_svm <- predict(svm_model,newdata=test_svm)
confusionmatrix <- table(test_svm$AHD, pred_svm)
accuracy<- sum(diag(confusionmatrix))/sum(confusionmatrix)
listModel_1[[i]] <- confusionmatrix
listModel_2[[i]] <- summary(svm_model)
listModel_3[[i]] <- accuracy
}
Get the best SVM model:
b <-which.max( listModel_3[] )
b
## [1] 31
max(unlist(listModel_3))
## [1] 0.8166667
listModel_2[b]
## [[1]]
##
## Call:
## svm(formula = AHD ~ ., data = train_svm, kernel = "radial", gamma = 0.5,
## cost = 1)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 1
## gamma: 0.5
##
## Number of Support Vectors: 190
##
## ( 94 96 )
##
##
## Number of Classes: 2
##
## Levels:
## No Yes
listModel_1[[b]]
## pred_svm
## No Yes
## No 31 4
## Yes 7 18
ctable <- as.table(matrix(unlist(listModel_1[[b]]), nrow = 2, byrow = TRUE))
fourfoldplot(ctable, color = c("#CC6666", "#99CC49"),
conf.level = 0, margin = 1, main = "Confusion Matrix SVM")
Conclusion: SVM performs better than logistic.This is different from project1 where logistic performed better.
However, the accuracy results achieved in the first project were higher, this might be due to the fact we are working with the reduced dimensional data set so there is some loss of information.