Pincipal Component Analysis

In first project both Support Vector Machines and Logistic Regression were employed to study and attempt to predict if, based on a number of risk factors, a given patient will have heart disease. The data set Heart.csv which contained data on 303 patients consisting of 13 risk factors and a zero-one valued variable which indicates if the patient has heart disease. We found logistic regression model performed better than SVM.

In this project we will pre-process the input variables using PCA.We will include as many principal components as necessary to account for at least 98% of the variance of the original input variables.We will compare results to those obtained in Project1. In project 1, logistic performed better than SVM

# Load the library
library(e1071)
library(dummies)

## dummies-1.5.6 provided by Decision Patterns

library(cvTools)

## Loading required package: lattice

## Loading required package: robustbase

## Warning: package 'robustbase' was built under R version 3.2.5

Read the data

#Read the data 
heart<- read.csv("/Users/neha/Documents/DS-630-ML/Project/Heart.csv")
heart <- na.omit(heart)

For PCA, remove the response variable #AHD

heart_pca <-heart[,2:14]

Check out for categorical data if any

str(heart_pca)

## 'data.frame':    297 obs. of  13 variables:
##  $ Age      : int  63 67 67 37 41 56 62 57 63 53 ...
##  $ Sex      : int  1 1 1 1 0 1 0 0 1 1 ...
##  $ ChestPain: Factor w/ 4 levels "asymptomatic",..: 4 1 1 2 3 3 1 1 1 1 ...
##  $ RestBP   : int  145 160 120 130 130 120 140 120 130 140 ...
##  $ Chol     : int  233 286 229 250 204 236 268 354 254 203 ...
##  $ Fbs      : int  1 0 0 0 0 0 0 0 0 1 ...
##  $ RestECG  : int  2 2 2 0 2 0 2 0 2 2 ...
##  $ MaxHR    : int  150 108 129 187 172 178 160 163 147 155 ...
##  $ ExAng    : int  0 1 1 0 0 0 0 1 0 1 ...
##  $ Oldpeak  : num  2.3 1.5 2.6 3.5 1.4 0.8 3.6 0.6 1.4 3.1 ...
##  $ Slope    : int  3 2 2 3 1 1 3 1 2 3 ...
##  $ Ca       : int  0 3 2 0 0 0 2 0 1 0 ...
##  $ Thal     : Factor w/ 3 levels "fixed","normal",..: 1 2 3 2 2 2 2 2 3 3 ...

“Chest pain”" and “Thal” are categorical variables

new_heart_pca <- dummy.data.frame(heart_pca, names = c("ChestPain","Thal"))
str(new_heart_pca)

## 'data.frame':    297 obs. of  18 variables:
##  $ Age                  : int  63 67 67 37 41 56 62 57 63 53 ...
##  $ Sex                  : int  1 1 1 1 0 1 0 0 1 1 ...
##  $ ChestPainasymptomatic: int  0 1 1 0 0 0 1 1 1 1 ...
##  $ ChestPainnonanginal  : int  0 0 0 1 0 0 0 0 0 0 ...
##  $ ChestPainnontypical  : int  0 0 0 0 1 1 0 0 0 0 ...
##  $ ChestPaintypical     : int  1 0 0 0 0 0 0 0 0 0 ...
##  $ RestBP               : int  145 160 120 130 130 120 140 120 130 140 ...
##  $ Chol                 : int  233 286 229 250 204 236 268 354 254 203 ...
##  $ Fbs                  : int  1 0 0 0 0 0 0 0 0 1 ...
##  $ RestECG              : int  2 2 2 0 2 0 2 0 2 2 ...
##  $ MaxHR                : int  150 108 129 187 172 178 160 163 147 155 ...
##  $ ExAng                : int  0 1 1 0 0 0 0 1 0 1 ...
##  $ Oldpeak              : num  2.3 1.5 2.6 3.5 1.4 0.8 3.6 0.6 1.4 3.1 ...
##  $ Slope                : int  3 2 2 3 1 1 3 1 2 3 ...
##  $ Ca                   : int  0 3 2 0 0 0 2 0 1 0 ...
##  $ Thalfixed            : int  1 0 0 0 0 0 0 0 0 0 ...
##  $ Thalnormal           : int  0 1 0 1 1 1 1 1 0 0 ...
##  $ Thalreversable       : int  0 0 1 0 0 0 0 0 1 1 ...
##  - attr(*, "dummies")=List of 2
##   ..$ ChestPain: int  3 4 5 6
##   ..$ Thal     : int  16 17 18

Perform pca

pca_on_heart <-princomp(new_heart_pca)
summary(pca_on_heart)

## Importance of components:
##                            Comp.1     Comp.2      Comp.3     Comp.4
## Standard deviation     52.0044246 23.2921488 17.63224572 7.61501572
## Proportion of Variance  0.7469583  0.1498425  0.08586794 0.01601612
## Cumulative Proportion   0.7469583  0.8968008  0.98266874 0.99868486
##                              Comp.5      Comp.6       Comp.7       Comp.8
## Standard deviation     1.1937424661 0.960879673 0.8420419543 0.6671064363
## Proportion of Variance 0.0003935837 0.000255008 0.0001958318 0.0001229153
## Cumulative Proportion  0.9990784460 0.999333454 0.9995292858 0.9996522011
##                              Comp.9      Comp.10      Comp.11      Comp.12
## Standard deviation     5.416043e-01 4.634986e-01 4.430064e-01 3.943334e-01
## Proportion of Variance 8.101772e-05 5.933522e-05 5.420455e-05 4.294799e-05
## Cumulative Proportion  9.997332e-01 9.997926e-01 9.998468e-01 9.998897e-01
##                             Comp.13      Comp.14      Comp.15      Comp.16
## Standard deviation     3.640134e-01 3.305688e-01 2.899636e-01 2.710558e-01
## Proportion of Variance 3.659742e-05 3.018142e-05 2.322216e-05 2.029239e-05
## Cumulative Proportion  9.999263e-01 9.999565e-01 9.999797e-01 1.000000e+00
##                             Comp.17      Comp.18
## Standard deviation     4.837684e-08 1.847055e-08
## Proportion of Variance 6.463844e-19 9.422702e-20
## Cumulative Proportion  1.000000e+00 1.000000e+00

As seen it needs only 3 components to explain 98% of the variation Lets calculate standard deviation and proportion of variance of each PCA

std_dev <- pca_on_heart[1:3]$sdev
pr_var <- std_dev^2
prop_varex <- pr_var/sum(pr_var)
prop_varex[1:3]

##     Comp.1     Comp.2     Comp.3 
## 0.74695834 0.14984246 0.08586794

plot(cumsum(prop_varex[1:3]), xlab = "Principal Component",
     ylab = "Cumulative Proportion of Variance Explained",
     type = "b")

#Screen plot
screeplot(pca_on_heart, type="lines")

#biplot
biplot(pca_on_heart)

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

Reduced dimension data from PCA

#reduced dimension data
red_data<-data.frame(AHD = heart[, "AHD"], pca_on_heart$scores)

#interested in first 3 componrnts
red_data<- red_data[,1:4]
#View(red_data)

Now we have our reduced data ready to be used for the models:

listModel_1 <- list() # confusionmatrix
listModel_2 <- list() #best model
listModel_3 <- list() # accuracy

Sampling the data

hrt_smp_size <-floor(0.80* nrow(red_data))
# K-fold cross-validation
k<-50
#folds <- cvFolds(NROW(red_data), K=12)

1. Logistic Model

for(i in 1:k){
  train_ind <-sample(seq_len(nrow(red_data)), size = hrt_smp_size)
  train<- red_data[train_ind,]#set the training set
  test<-red_data[-train_ind,]# set the testing set
  #train <- red_data[folds$subsets[folds$which != i], ] #Set the training set
  #test <- red_data[folds$subsets[folds$which == i], ] #Set the test set
  
  log_model <-glm(AHD ~ .,data=train, family="binomial") #Get your model
  
  pred <- predict(log_model,newdata=test) #Get the predicitons for the test set 
  pred <- ifelse(pred>0.5,"Yes", "No")
  
  confusionmatrix <- table(test$AHD, pred) 
  accuracy<- sum(diag(confusionmatrix))/sum(confusionmatrix) #calculate accuracy
 
  listModel_1[[i]] <- confusionmatrix
  listModel_2[[i]] <- summary(log_model)
  listModel_3[[i]] <- accuracy

}

Accuracy for best logistic model:

a <-which.max( listModel_3[] ) 
max(unlist(listModel_3)) #accuracy

## [1] 0.8

listModel_2[a] # summary

## [[1]]
## 
## Call:
## glm(formula = AHD ~ ., family = "binomial", data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9551  -0.9966  -0.4310   1.0388   1.9532  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.036549   0.143712  -0.254   0.7992    
## Comp.1       0.003740   0.002613   1.431   0.1523    
## Comp.2      -0.040027   0.007151  -5.597 2.18e-08 ***
## Comp.3      -0.018629   0.008377  -2.224   0.0262 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 328.55  on 236  degrees of freedom
## Residual deviance: 283.67  on 233  degrees of freedom
## AIC: 291.67
## 
## Number of Fisher Scoring iterations: 3

listModel_1[[a]]#confusion matrix

##      pred
##       No Yes
##   No  37   4
##   Yes  8  11

ctable <- as.table(matrix(unlist(listModel_1[[a]]), nrow = 2, byrow = TRUE))
fourfoldplot(ctable, color = c("#CC6666", "#99CC49"),
             conf.level = 0, margin = 1, main = "Confusion Matrix Logistic Regression")

2. Support Vector Machine

Similar to logistic we will test for SVM.

k = 50
# K-fold cross validation
for(i in 1:50){
 # train_svm <- red_data[folds$subsets[folds$which != i], ] #Set the training set
  #test_svm <- red_data[folds$subsets[folds$which == i], ] #Set the test set
  train_ind <-sample(seq_len(nrow(red_data)), size = hrt_smp_size)
  train_svm<- red_data[train_ind,]#set the training set
  test_svm<-red_data[-train_ind,]# set the testing set
  
       svm_model <-svm(AHD ~ .,data=train_svm, kernel='radial',gamma=0.5
                       ,cost=1)
       pred_svm <- predict(svm_model,newdata=test_svm)
       confusionmatrix <- table(test_svm$AHD, pred_svm)
       accuracy<- sum(diag(confusionmatrix))/sum(confusionmatrix)
       listModel_1[[i]] <- confusionmatrix
       listModel_2[[i]] <- summary(svm_model)
       listModel_3[[i]] <- accuracy
      
     
   
   
}

Get the best SVM model:

b <-which.max( listModel_3[] )
b

## [1] 31

Accuracy of the best SVM model

max(unlist(listModel_3))

## [1] 0.8166667

listModel_2[b]

## [[1]]
## 
## Call:
## svm(formula = AHD ~ ., data = train_svm, kernel = "radial", gamma = 0.5, 
##     cost = 1)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  1 
##       gamma:  0.5 
## 
## Number of Support Vectors:  190
## 
##  ( 94 96 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  No Yes

listModel_1[[b]]

##      pred_svm
##       No Yes
##   No  31   4
##   Yes  7  18

ctable <- as.table(matrix(unlist(listModel_1[[b]]), nrow = 2, byrow = TRUE))
fourfoldplot(ctable, color = c("#CC6666", "#99CC49"),
             conf.level = 0, margin = 1, main = "Confusion Matrix SVM")

Conclusion: SVM performs better than logistic.This is different from project1 where logistic performed better.

However, the accuracy results achieved in the first project were higher, this might be due to the fact we are working with the reduced dimensional data set so there is some loss of information.

Pincipal Component Analysis

Neha

12/10/2016

1. Logistic Model

Accuracy for best logistic model:

2. Support Vector Machine

Accuracy of the best SVM model