Heart Disease prediction

Analysing Prediction power of different Machine learning algorithms on Heart disease data

About the data The dataset used in this article is the Cleveland Heart Disease dataset taken from the UCI repository(http://archive.ics.uci.edu/ml/datasets/Heart+Disease). It originally contains 76 variables, but all published experiments refer to using 14 of them. The dataset consists of 303 observations. There are 14 columns in the dataset, which are described as follows:

  • Age: Age of the individual.

  • Sex: Gender of the individual: 1 = male, 0 = female

  • Chest-pain type (cp): 1 = asymptotic, 2 = atypical angina, 3 = non anginal pain

  • Resting Blood Pressure (trestbps): The resting blood pressure value in mmHg (unit)

  • Serum Cholesterol (chol): The serum cholesterol in mg/dl (unit)

  • Fasting Blood Sugar (fbs): The fasting blood sugar value of an individual, if > 120 -> 1, else 0

  • Resting ECG (restecg): displays resting electrocardiographic results

  • Max heart rate (thalach): The max heart rate achieved by an individual

  • Exercise induced angina (exang): 1 = yes, 0 = no

  • Oldpeak: ST depression induced by exercise relative to rest

  • Slope: The slope of peak exercise ST segment

  • Ca: Number of major vessels (0-3) colored by fluoroscopy

  • Thal: Thalassemia = 0,1,2,3

  • Heart_disease: Whether the individual is suffering from heart disease or not 0 = absence, 1= present.

1.0 Loading Libraries

1.1 Reading the data

data_orig <- read.csv('heartdisease1.csv')

str(data_orig)
## 'data.frame':    303 obs. of  14 variables:
##  $ age     : int  63 37 41 56 57 57 56 44 52 57 ...
##  $ sex     : int  1 1 0 1 0 1 0 1 1 1 ...
##  $ cp      : int  3 2 1 1 0 0 1 1 2 2 ...
##  $ trestbps: int  145 130 130 120 120 140 140 120 172 150 ...
##  $ chol    : int  233 250 204 236 354 192 294 263 199 168 ...
##  $ fbs     : int  1 0 0 0 0 0 0 0 1 0 ...
##  $ restecg : int  0 1 0 1 1 1 0 1 1 1 ...
##  $ thalach : int  150 187 172 178 163 148 153 173 162 174 ...
##  $ exang   : int  0 0 0 0 1 0 0 0 0 0 ...
##  $ oldpeak : num  2.3 3.5 1.4 0.8 0.6 0.4 1.3 0 0.5 1.6 ...
##  $ slope   : int  0 0 2 2 2 1 1 2 2 2 ...
##  $ ca      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ thal    : int  1 2 2 2 2 1 2 3 3 2 ...
##  $ target  : int  1 1 1 1 1 1 1 1 1 1 ...
head(data_orig)
##   age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal
## 1  63   1  3      145  233   1       0     150     0     2.3     0  0    1
## 2  37   1  2      130  250   0       1     187     0     3.5     0  0    2
## 3  41   0  1      130  204   0       0     172     0     1.4     2  0    2
## 4  56   1  1      120  236   0       1     178     0     0.8     2  0    2
## 5  57   0  0      120  354   0       1     163     1     0.6     2  0    2
## 6  57   1  0      140  192   0       1     148     0     0.4     1  0    1
##   target
## 1      1
## 2      1
## 3      1
## 4      1
## 5      1
## 6      1

Sample of 5 rows shown above

2.0 Data Preperation

2.1 Data scrubbing

2.1.1 Scrubbing Function

Function to convert the class of variables:

convert_class <- function(obj,types){
    for (i in 1:length(obj)){
        FUN <- switch(types[i],character = as.character, 
                                   numeric = as.numeric, 
                                   factor = as.factor)
        obj[,i] <- FUN(obj[,i])
    }
    obj
}
data_orig<-rename(data_orig,heart_disease='target',age="age")

data_orig <- data_orig %>% rename() %>%
  mutate(sex = if_else(sex == 1, "male", "female"),
         fbs = if_else(fbs == 1, ">120", "<=120"),
         exang = if_else(exang == 1, "yes" ,"no"),
         cp = if_else(cp == 1, "atypical angina",
                      if_else(cp == 2, "non-anginal pain", "asymptotic")),
         restecg = if_else(restecg == 0, "normal",
                           if_else(restecg == 1, "abnormal", "definite")))

change_class <-c("numeric","factor","factor","numeric","numeric","factor","factor","numeric","factor","numeric","factor","factor","factor","numeric")

data_orig <- convert_class(data_orig,change_class)

label <- data.frame(
  age = "age in years",
  sex = "sex of the patient",
  cp = "chest pain type",
  trestbps = "resting blood pressure (in mm Hg on admission to the hospital)",
  chol = "serum cholestoral in mg/dl",
  fbs = "fasting blood sugar > 120 mg/dl",
  restecg = "resting electrocardiographic results",
  thalach = "maximum heart rate achieved",
  exang = "exercise induced angina",
  oldpeak = "ST depression induced by exercise relative to rest, a measure of abnormality in electrocardiograms",
  slope = "the slope of the peak exercise ST segment",
  ca = "number of major vessels (0-3) colored by flourosopy",
  thal = "3 = normal; 6 = fixed defect; 7 = reversable defect"
)


vtable::vtable(data_orig, labels = label, factor.limit = 0)
Name Class Label Values
age numeric age in years Num: 29 to 77
sex factor sex of the patient ‘female’ ‘male’
cp factor chest pain type ‘asymptotic’ ‘atypical angina’ ‘non-anginal pain’
trestbps numeric resting blood pressure (in mm Hg on admission to the hospital) Num: 94 to 200
chol numeric serum cholestoral in mg/dl Num: 126 to 564
fbs factor fasting blood sugar > 120 mg/dl ‘<=120’ ‘>120’
restecg factor resting electrocardiographic results ‘abnormal’ ‘definite’ ‘normal’
thalach numeric maximum heart rate achieved Num: 71 to 202
exang factor exercise induced angina ‘no’ ‘yes’
oldpeak numeric ST depression induced by exercise relative to rest, a measure of abnormality in electrocardiograms Num: 0 to 6.2
slope factor the slope of the peak exercise ST segment ‘0’ ‘1’ ‘2’
ca factor number of major vessels (0-3) colored by flourosopy ‘0’ ‘1’ ‘2’ ‘3’ ‘4’
thal factor 3 = normal; 6 = fixed defect; 7 = reversable defect ‘0’ ‘1’ ‘2’ ‘3’
heart_disease numeric NA Num: 0 to 1

3.0 Data quality Check

metadata<-t(introduce(data_orig))
colnames(metadata)<-"Values"
metadata
##                      Values
## rows                    303
## columns                  14
## discrete_columns          8
## continuous_columns        6
## all_missing_columns       0
## total_missing_values      0
## complete_rows           303
## total_observations     4242
## memory_usage          32592
plot_intro(data_orig)

Plot the missing values

plot_missing(data_orig)

* There are no missing values

4.0 Train test split

Splitting the data in 70:30 ratio

set.seed(13263635)

train_percent <-0.70
index <- sample(nrow(data_orig),nrow(data_orig)*train_percent)
data <- data_orig[index,]
data_test <- data_orig[-index,]

Summarizing the Training data

summary(data)
##       age            sex                     cp         trestbps    
##  Min.   :29.00   female: 64   asymptotic      :114   Min.   : 94.0  
##  1st Qu.:47.00   male  :148   atypical angina : 36   1st Qu.:120.0  
##  Median :56.00                non-anginal pain: 62   Median :130.0  
##  Mean   :54.33                                       Mean   :131.2  
##  3rd Qu.:61.00                                       3rd Qu.:140.0  
##  Max.   :74.00                                       Max.   :200.0  
##       chol          fbs          restecg       thalach      exang    
##  Min.   :126.0   <=120:180   abnormal:106   Min.   : 71.0   no :150  
##  1st Qu.:208.0   >120 : 32   definite:  2   1st Qu.:138.0   yes: 62  
##  Median :239.0               normal  :104   Median :154.0            
##  Mean   :242.6                              Mean   :150.8            
##  3rd Qu.:270.0                              3rd Qu.:168.0            
##  Max.   :407.0                              Max.   :202.0            
##     oldpeak       slope   ca      thal    heart_disease   
##  Min.   :0.0000   0: 11   0:131   0:  2   Min.   :0.0000  
##  1st Qu.:0.0000   1: 94   1: 37   1: 13   1st Qu.:0.0000  
##  Median :0.6000   2:107   2: 29   2:119   Median :1.0000  
##  Mean   :0.9892           3: 12   3: 78   Mean   :0.5943  
##  3rd Qu.:1.6000           4:  3           3rd Qu.:1.0000  
##  Max.   :4.4000                           Max.   :1.0000

5.0 Exploratory Data Analysis

Chi-Square Test of Independence for categorical columns

#fetch numeric columns
catcols<-select_if(data, is.factor)
numcols<-select_if(data, is.numeric)

predictors<-catcols[,1:8]
response<-numcols$heart_disease

n<-ncol(predictors)
values<- vector(mode="numeric", length=n)
pvalues<- vector(mode="numeric", length=n)
name.pred<-vector(mode="character", length=n)


for(i in 1:ncol(predictors))
{
table(predictors[,i],response)
c<-chisq.test(predictors[,i],response)

pvalues[i]<-c$p.value
values[i]<-ifelse(round(pvalues[i],3)<0.05,1,0)
}

result<-data.frame(cbind(colnames(predictors),ifelse(round(pvalues,4)<0.001,"<0.001",round(pvalues,4)),values))
colnames(result)<-c("predictors","pvalues","Relationship")

result
##   predictors pvalues Relationship
## 1        sex  <0.001            1
## 2         cp  <0.001            1
## 3        fbs  0.8393            0
## 4    restecg  0.0032            1
## 5      exang  <0.001            1
## 6      slope  <0.001            1
## 7         ca  <0.001            1
## 8       thal  <0.001            1

Interpretation Except fasting blood sugar all predictors(categorical) variables seems to be related to Response.

Correlation plot of numeric columns

#fetch numeric columns
numcols<-select_if(data_orig, is.numeric)

cor_heart <- cor(numcols)
cor_heart
##                      age    trestbps         chol      thalach     oldpeak
## age            1.0000000  0.27935091  0.213677957 -0.398521938  0.21001257
## trestbps       0.2793509  1.00000000  0.123174207 -0.046697728  0.19321647
## chol           0.2136780  0.12317421  1.000000000 -0.009939839  0.05395192
## thalach       -0.3985219 -0.04669773 -0.009939839  1.000000000 -0.34418695
## oldpeak        0.2100126  0.19321647  0.053951920 -0.344186948  1.00000000
## heart_disease -0.2254387 -0.14493113 -0.085239105  0.421740934 -0.43069600
##               heart_disease
## age             -0.22543872
## trestbps        -0.14493113
## chol            -0.08523911
## thalach          0.42174093
## oldpeak         -0.43069600
## heart_disease    1.00000000
#corr plot it
corrplot(cor_heart, method = "ellipse", type="upper",)

#another
ggcorrplot(cor_heart,lab = T)

Interpretation

thalach: maximum heart rate achieved oldpeak: ST depression induced by exercise relative to rest, a measure of abnormality in electrocardiograms

seems to be correlated with response variable… However , thalach & oldpeak also show case substantial correlation amongst themselves.

chol & age are also somewhat correlated trestbps(resting blood pressure) & age also have substantial corelation.

All these correlation amongst predictors are somewhat smaller in magnitude. Therefore as of now we need not remove them from analysis.

present <- round((sum(as.numeric(data$heart_disease))*100)/nrow(data),2)
absent<- round((100 - present),2)

pie(table(data$heart_disease), col=c('green','red'),labels = c(paste(as.character(absent),'heart disease absent %'),paste(as.character(present), 'heart disease present %')),main = 'Distribution of heart disease status as percentage')

levels(data$heart_disease) = c("No disease","Disease")
levels(data$sex) = c("female","male","")
mosaicplot(data$sex ~ data$heart_disease,
           main="Disease by Gender", shade=FALSE,color=TRUE,
           xlab="Gender", ylab="Heart disease")

boxplot(data$age ~ data$heart_disease,
        main="Disease by Age",
         ylab="Age",xlab="Heart disease")

kable(head(data))  %>% kable_styling(bootstrap_options = c("striped", "hover", "responsive")) %>% scroll_box(width = "100%", height = "250px")
age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal heart_disease
100 53 male non-anginal pain 130 246 >120 normal 173 no 0.0 2 3 2 1
75 43 female non-anginal pain 122 213 <=120 abnormal 165 no 0.2 1 0 2 1
23 42 male asymptotic 140 226 <=120 abnormal 178 no 0.0 2 0 2 1
204 68 male non-anginal pain 180 274 >120 normal 150 yes 1.6 1 0 3 0
192 58 male asymptotic 128 216 <=120 normal 131 yes 2.2 1 3 3 0
257 58 male asymptotic 128 259 <=120 normal 130 yes 3.0 1 2 3 0
num_features <- select_if(data, is.numeric) %>%  names()

data %>% 
  dplyr::select(heart_disease, one_of(num_features))  %>%
  mutate(heart_disease=as.factor(heart_disease))%>%
  gather(key = key, value = value,-heart_disease)%>%
  ggplot(aes(x = value, fill = as.factor(heart_disease)))+
  geom_histogram(alpha = 0.7)+
  theme_stata()+
  facet_wrap(~ key, scales = 'free',ncol = 2,nrow = 4)+
  scale_fill_tableau()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(data, aes(cp, fill = as.factor(heart_disease)))+
  geom_bar(position = "fill")+
  ggtitle("cp")

data %>%
  ggplot(aes(x=sex,y=trestbps))+
  geom_boxplot(fill="darkorange")+
  xlab("Sex")+
  ylab("BP")+
  facet_grid(~cp)

6.0 Unsupervised Learning

6.1 K-means

data$heart_disease<-as.factor(data$heart_disease)

numcols<-select_if(data, is.numeric)
uns_df <- scale(numcols)
distance <- get_dist(uns_df)
fviz_dist(distance, gradient = list(low = "#00AFBB", mid = "white", high = "#FC4E07"))

k2 <- kmeans(uns_df, 
             center = 2,
             nstart = 25  )
#table(k2$cluster, dnn=("Clusters"))
k2
## K-means clustering with 2 clusters of sizes 113, 99
## 
## Cluster means:
##          age   trestbps       chol    thalach    oldpeak
## 1 -0.6511378 -0.2758471 -0.3386202  0.5695421 -0.4314943
## 2  0.7432179  0.3148558  0.3865059 -0.6500835  0.4925136
## 
## Clustering vector:
## 100  75  23 204 192 257 189 156  79  18 151  11 253  96  77 169 220 267 
##   1   1   1   2   2   2   1   2   1   2   2   1   2   2   1   2   1   2 
## 251 124   7  74 163 224 107  69   3 105 176 259  27  38 272 248 200 113 
##   2   1   2   1   1   2   2   1   1   1   1   2   2   1   2   2   2   2 
##  35 271 255 282 116  54 150  71 111  43 266 144 268 158  95 294  72 289 
##   2   1   2   1   1   1   1   1   2   1   2   2   1   1   1   2   1   2 
## 196  52 165 243  63 190 120 121  25 179 177 264  14   8 221 214 142 274 
##   2   2   1   2   1   1   1   2   1   1   1   2   2   1   2   2   1   1 
## 149  37  65  48  98 236  61 201 198 276 137 293 141  44 238 250  70 216 
##   1   1   1   1   1   1   2   1   2   1   2   2   1   1   2   2   1   2 
## 160  17 154 146  49 147  92 210 180 300 117  97 178 301 273 231  41  19 
##   1   1   2   2   2   1   1   1   2   1   1   2   2   2   2   1   2   1 
## 102  66 157 229 203 129  33   1 139 138  31   9 148 153 183 202 209 135 
##   2   1   1   2   2   1   1   2   2   2   1   1   2   2   2   2   1   1 
## 185 152  80 162  26 242  59  94  62  47 184  30 127 246 278 270 109 167 
##   2   2   1   2   2   2   1   1   1   1   2   1   1   1   2   2   1   2 
## 199 287 286 175 235 225 296  45 227 277  83  73 119 110 155 101 284  28 
##   2   1   2   2   2   2   2   1   2   2   1   1   1   1   1   1   1   1 
## 298   6 140  76 118 207   2  67 164 254 125 212  56   4 173  89 230 108 
##   2   1   2   1   1   2   1   1   1   2   1   2   1   1   2   1   2   1 
## 131 279 219 292  88  58 208 275 234  42 123 112  78 213  93  24  21  16 
##   1   2   2   2   1   1   2   2   2   1   1   1   1   1   1   2   1   1 
## 249 130 237 211  87 103 170 280  32   5  81  39  64 122 
##   1   2   1   1   2   1   2   2   2   2   1   2   1   1 
## 
## Within cluster sum of squares by cluster:
## [1] 331.2674 451.4324
##  (between_SS / total_SS =  25.8 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"    
## [5] "tot.withinss" "betweenss"    "size"         "iter"        
## [9] "ifault"

Cluster means for each attribute is depicted in summary

Visualization of Clusters

Plotiing Cluster using fviz_cluster

fviz_cluster(k2, data = uns_df)

Varying Visualizations by size

# k2
k3 <- kmeans(uns_df, centers = 3, nstart = 25)
k4 <- kmeans(uns_df, centers = 4, nstart = 25)
k5 <- kmeans(uns_df, centers = 5, nstart = 25)

p1 <- fviz_cluster(k2, geom = "point", data = uns_df)+
  ggtitle("k = 2")
p2 <- fviz_cluster(k3, geom = "point", data = uns_df)+
  ggtitle("k = 3")
p3 <- fviz_cluster(k4, geom = "point", data = uns_df)+
  ggtitle("k = 4")
p4 <- fviz_cluster(k5, geom = "point", data = uns_df)+
  ggtitle("k = 5")

library(gridExtra)
grid.arrange(p1,p2,p3,p4, nrow = 2)

Although this visual assessment tells us where true dilineations occur (or do not occur such as clusters 2 & 4 in the k = 5 graph) between clusters, it does not tell us what the optimal number of clusters is.

Determining Optimal Clusters

As you may recall the analyst specifies the number of clusters to use; preferably the analyst would like to use the optimal number of clusters. To aid the analyst, the following explains the three most popular methods for determining the optimal clusters, which includes:

Elbow method Silhouette method *Gap statistic

(A) Elbow Method
#finding optimal number of clusters
set.seed(13263635)
fviz_nbclust(uns_df, kmeans, method = "wss")

##### (B) Silhouette Method

#finding optimal number of clusters
set.seed(13263635)
fviz_nbclust(uns_df, kmeans, method = "silhouette")

##### (C) Compute gap statistic

# compute gap statistic
set.seed(13263635)
gap_stat <- clusGap(uns_df, FUN = kmeans, nstart = 25,
                    K.max = 10, B = 50)
# Print the result
print(gap_stat, method = "firstmax")
## Clustering Gap statistic ["clusGap"] from call:
## clusGap(x = uns_df, FUNcluster = kmeans, K.max = 10, B = 50,     nstart = 25)
## B=50 simulated reference sets, k = 1..10; spaceH0="scaledPCA"
##  --> Number of clusters (method 'firstmax'): 2
##           logW   E.logW       gap     SE.sim
##  [1,] 5.056204 5.511244 0.4550395 0.01762836
##  [2,] 4.902471 5.371684 0.4692130 0.01506240
##  [3,] 4.823181 5.274316 0.4511350 0.01747658
##  [4,] 4.757474 5.202080 0.4446056 0.01737994
##  [5,] 4.703252 5.145987 0.4427348 0.01643548
##  [6,] 4.659157 5.094961 0.4358043 0.01618537
##  [7,] 4.621819 5.050962 0.4291425 0.01613949
##  [8,] 4.573595 5.010673 0.4370777 0.01611355
##  [9,] 4.536792 4.975277 0.4384854 0.01602263
## [10,] 4.504008 4.943096 0.4390881 0.01617199
#visualize
fviz_gap_stat(gap_stat)

We choose k=2 As most algorithms suggested the same, Hence now we are assured to use 2 clusters.

k2
## K-means clustering with 2 clusters of sizes 113, 99
## 
## Cluster means:
##          age   trestbps       chol    thalach    oldpeak
## 1 -0.6511378 -0.2758471 -0.3386202  0.5695421 -0.4314943
## 2  0.7432179  0.3148558  0.3865059 -0.6500835  0.4925136
## 
## Clustering vector:
## 100  75  23 204 192 257 189 156  79  18 151  11 253  96  77 169 220 267 
##   1   1   1   2   2   2   1   2   1   2   2   1   2   2   1   2   1   2 
## 251 124   7  74 163 224 107  69   3 105 176 259  27  38 272 248 200 113 
##   2   1   2   1   1   2   2   1   1   1   1   2   2   1   2   2   2   2 
##  35 271 255 282 116  54 150  71 111  43 266 144 268 158  95 294  72 289 
##   2   1   2   1   1   1   1   1   2   1   2   2   1   1   1   2   1   2 
## 196  52 165 243  63 190 120 121  25 179 177 264  14   8 221 214 142 274 
##   2   2   1   2   1   1   1   2   1   1   1   2   2   1   2   2   1   1 
## 149  37  65  48  98 236  61 201 198 276 137 293 141  44 238 250  70 216 
##   1   1   1   1   1   1   2   1   2   1   2   2   1   1   2   2   1   2 
## 160  17 154 146  49 147  92 210 180 300 117  97 178 301 273 231  41  19 
##   1   1   2   2   2   1   1   1   2   1   1   2   2   2   2   1   2   1 
## 102  66 157 229 203 129  33   1 139 138  31   9 148 153 183 202 209 135 
##   2   1   1   2   2   1   1   2   2   2   1   1   2   2   2   2   1   1 
## 185 152  80 162  26 242  59  94  62  47 184  30 127 246 278 270 109 167 
##   2   2   1   2   2   2   1   1   1   1   2   1   1   1   2   2   1   2 
## 199 287 286 175 235 225 296  45 227 277  83  73 119 110 155 101 284  28 
##   2   1   2   2   2   2   2   1   2   2   1   1   1   1   1   1   1   1 
## 298   6 140  76 118 207   2  67 164 254 125 212  56   4 173  89 230 108 
##   2   1   2   1   1   2   1   1   1   2   1   2   1   1   2   1   2   1 
## 131 279 219 292  88  58 208 275 234  42 123 112  78 213  93  24  21  16 
##   1   2   2   2   1   1   2   2   2   1   1   1   1   1   1   2   1   1 
## 249 130 237 211  87 103 170 280  32   5  81  39  64 122 
##   1   2   1   1   2   1   2   2   2   2   1   2   1   1 
## 
## Within cluster sum of squares by cluster:
## [1] 331.2674 451.4324
##  (between_SS / total_SS =  25.8 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"    
## [5] "tot.withinss" "betweenss"    "size"         "iter"        
## [9] "ifault"
#changed
datakmClustered<-data %>% 
  mutate(Cluster = k2$cluster-1)
#2 is 1 and 1 is 0

#table(datakmClustered$heart_disease,datakmClustered$Cluster, dnn = c("ture","clustered"))
mcr<-mean(datakmClustered$heart_disease!=(datakmClustered$Cluster))
outkmeans<-numcols %>% 
  mutate(Cluster = k2$cluster) %>% 
  group_by(Cluster) %>% 
  summarise_all("mean")

kable(outkmeans)  %>% 
  kable_styling(bootstrap_options = c("striped", "hover", "responsive"),
                font_size = 12, position = "left", full_width = FALSE)
Cluster age trestbps chol thalach oldpeak
1 48.42478 126.0619 226.4425 163.5487 0.5035398
2 61.08081 137.1010 260.9697 136.1414 1.5434343

Interpretation The mean of all predictors(continious) appear to be different for both the clusters created.

6.2 Hierachical Clustering

Here we have implementd hierachical clustering for categorical & continious variables

The clustering process itself contains 3 distinctive steps:

Calculating dissimilarity matrix - is arguably the most important decision in clustering, and all your further steps are going to be based on the dissimilarity matrix you’ve made. Choosing the clustering method *Assessing clusters

Data types differences are important as dissimilarity matrix is based on distances between individual data points. While it is quite easy to imagine distances between numerical data points (Eucledian distances), categorical data (factors in R) does not seem as obvious. In order to calculate a dissimilarity matrix in this case, you would go for something called Gower distance

tree <- hclustvar(numcols, catcols)
plot(tree, cex = 0.6)

#rect.hclust(tree, k = 4, border = 2:4)

Stablity check using bootstraping

Evaluates the stability of partitions obtained from a hierarchy of p variables. This hierarchy is performed with hclustvar and the stability of the partitions of 2 to p-1 clusters is evaluated with a bootstrap approach. The boostrap approch is the following: hclustvar is applied to B boostrap samples of the n rows. The partitions of 2 to p-1 clusters obtained from the B bootstrap hierarchies are compared with the partitions from the initial hierarchy . The mean of the corrected Rand indices is plotted according to the number of clusters. This graphical representation helps in the determination of a suitable numbers of clusters

stab <- stability(tree, B=50) # "B=50" refers to the number of bootstrap samples to use in the estimation.

Intrepretation This proves that 2 given us the best number of clusters

Using Dissimilarity matrix

“Gower’s distance” is chosen by metric “gower” or automatically if some columns of x are not numeric. Also known as Gower’s coefficient (1971), expressed as a dissimilarity, this implies that a particular standardisation will be applied to each variable, and the “distance” between two units is the sum of all the variable-specific distances

d <- daisy(data[,-14], metric="gower")
fit <- hclust(d=d, method="ward.D")    # Also try: method="ward.D"
# Cut tree into 4 groups
sub_grp <- cutree(fit, k = 2)
# Number of members in each cluster
table(sub_grp)
## sub_grp
##   1   2 
## 153  59
plot(fit, cex = 0.6)
rect.hclust(fit, k = 2, border = 2:4)

Adding the clusters to our traing data

HCdata<-data %>%
  mutate(HCcluster = sub_grp)

Misclassification calculation

HCdata$HCcluster[HCdata$HCcluster==2] <- 0
#table(HCdata$heart_disease,HCdata$HCcluster, dnn = c("ture","clustered"))
mcr<-mean(HCdata$heart_disease!=HCdata$HCcluster)

6.3 PCA

Model the numeric columns to PCA model, It is always performed on a symmetric correlation or covariance matrix. This means the matrix should be numeric and have standardized data.

datapca<-data[,1:13]
dt<-dummy.data.frame(datapca)
colnames(dt)
##  [1] "age"                "sexfemale"          "sexmale"           
##  [4] "cpasymptotic"       "cpatypical angina"  "cpnon-anginal pain"
##  [7] "trestbps"           "chol"               "fbs<=120"          
## [10] "fbs>120"            "restecgabnormal"    "restecgdefinite"   
## [13] "restecgnormal"      "thalach"            "exangno"           
## [16] "exangyes"           "oldpeak"            "slope0"            
## [19] "slope1"             "slope2"             "ca0"               
## [22] "ca1"                "ca2"                "ca3"               
## [25] "ca4"                "thal0"              "thal1"             
## [28] "thal2"              "thal3"
pca <- prcomp(dt, scale = TRUE) 
pca
## Standard deviations (1, .., p=29):
##  [1] 2.246490e+00 1.580613e+00 1.515313e+00 1.388388e+00 1.321554e+00
##  [6] 1.308160e+00 1.272534e+00 1.189638e+00 1.144859e+00 1.114827e+00
## [11] 1.017576e+00 1.006372e+00 9.945358e-01 9.501946e-01 9.079434e-01
## [16] 8.639616e-01 8.510374e-01 7.905021e-01 7.822965e-01 6.214353e-01
## [21] 5.717560e-01 9.976420e-16 6.426039e-16 5.797642e-16 5.454954e-16
## [26] 4.571771e-16 4.554836e-16 3.202758e-16 1.116381e-16
## 
## Rotation (n x k) = (29 x 29):
##                              PC1           PC2          PC3         PC4
## age                 0.2040227737 -1.833293e-01  0.094779873 -0.13731378
## sexfemale          -0.1413086011 -5.394177e-01 -0.039906990 -0.10547773
## sexmale             0.1413086011  5.394177e-01  0.039906990  0.10547773
## cpasymptotic        0.2942003135 -2.461245e-02 -0.124389538  0.04018123
## cpatypical angina  -0.1600597333 -5.511576e-06  0.032650698  0.05734463
## cpnon-anginal pain -0.1903397143  2.698063e-02  0.109385163 -0.09137243
## trestbps            0.0935479157 -1.204104e-01  0.222487435  0.12005801
## chol                0.0849067274 -2.803531e-01 -0.029034968  0.08955058
## fbs<=120           -0.0554027651  2.904020e-02 -0.591801533  0.18297854
## fbs>120             0.0554027651 -2.904020e-02  0.591801533 -0.18297854
## restecgabnormal    -0.1736418581  7.301015e-02 -0.134795878 -0.54120573
## restecgdefinite     0.0625879247 -1.361349e-01 -0.006726700  0.07124934
## restecgnormal       0.1615699524 -4.669831e-02  0.136120641  0.52752438
## thalach            -0.2642768993  1.226385e-01  0.027373082  0.15232430
## exangno            -0.2662813236  5.344106e-02  0.103829439  0.13787861
## exangyes            0.2662813236 -5.344106e-02 -0.103829439 -0.13787861
## oldpeak             0.3018129056 -1.014582e-01 -0.013299747  0.02003847
## slope0              0.1129508182 -6.226086e-02  0.230618976  0.14454194
## slope1              0.2404813692 -1.326143e-01 -0.214508548 -0.02652619
## slope2             -0.2890528978  1.593875e-01  0.110832619 -0.03776438
## ca0                -0.1863277619 -2.539686e-02 -0.017259545  0.31810714
## ca1                 0.0801730543  5.412041e-02 -0.100692516 -0.18736497
## ca2                 0.1440907534 -5.350135e-02  0.095050946 -0.17389422
## ca3                 0.0880689230 -1.975906e-02  0.062237127 -0.05200381
## ca4                -0.0826325459  1.248690e-01 -0.003714119 -0.09886796
## thal0              -0.0003292663 -1.790613e-02  0.071960874 -0.04474856
## thal1               0.1071281011  3.217524e-03  0.087323525  0.08306735
## thal2              -0.2887953143 -2.791867e-01 -0.032234467  0.04175985
## thal3               0.2439394038  2.892717e-01 -0.024699924 -0.07532716
##                            PC5          PC6           PC7          PC8
## age                 0.10549430 -0.213452492  0.1616819119  0.023604430
## sexfemale           0.06696998  0.011344399 -0.0132664051  0.046115109
## sexmale            -0.06696998 -0.011344399  0.0132664051 -0.046115109
## cpasymptotic       -0.01260643  0.251575980  0.2184059522  0.067338655
## cpatypical angina   0.17794918  0.018666681  0.2587982109  0.080642858
## cpnon-anginal pain -0.13306274 -0.291143340 -0.4529931814 -0.140368316
## trestbps           -0.05039087  0.052792830  0.1368246765 -0.088519611
## chol                0.10490839 -0.049599567  0.1036958509 -0.249146156
## fbs<=120           -0.01650000 -0.068522160  0.1133872283 -0.113186317
## fbs>120             0.01650000  0.068522160 -0.1133872283  0.113186317
## restecgabnormal    -0.24450421  0.174292551  0.0922261543  0.011106655
## restecgdefinite    -0.29057685  0.022878704  0.1374788674 -0.408638140
## restecgnormal       0.30073749 -0.178747710 -0.1188272919  0.067910999
## thalach             0.06430739  0.087127349  0.0874879334 -0.015641141
## exangno            -0.23790773 -0.361710871  0.2362339223  0.163201029
## exangyes            0.23790773  0.361710871 -0.2362339223 -0.163201029
## oldpeak            -0.23877351  0.001229302 -0.0009125418 -0.162923697
## slope0             -0.21883738  0.152638210 -0.0091306367 -0.371342645
## slope1             -0.20305279 -0.201743027 -0.2071383509  0.402781053
## slope2              0.29883633  0.132742164  0.2098663948 -0.235475750
## ca0                -0.22856684  0.410141531 -0.2145056094  0.160399335
## ca1                 0.34911401 -0.343597404 -0.1291405671 -0.216397262
## ca2                 0.13349020  0.006067372  0.4103179475  0.227637787
## ca3                -0.25357146 -0.264070192  0.0683355585 -0.215395509
## ca4                -0.07387878 -0.084082911 -0.0300039738 -0.205304036
## thal0              -0.02271613  0.112158906 -0.2015978151  0.170508454
## thal1              -0.24274831 -0.036225835  0.2630957285  0.114121348
## thal2               0.05647509  0.011523649 -0.0538623012 -0.080493188
## thal3               0.06720959 -0.016318483 -0.0350554293 -0.008128518
##                             PC9         PC10          PC11        PC12
## age                 0.087715997 -0.047030462 -0.0514849619  0.25344279
## sexfemale           0.110497288  0.052789321 -0.0333275463  0.03966658
## sexmale            -0.110497288 -0.052789321  0.0333275463 -0.03966658
## cpasymptotic       -0.070216542  0.239407715 -0.0918142199  0.11585403
## cpatypical angina   0.002363250 -0.575654625  0.1913643070 -0.17791385
## cpnon-anginal pain  0.075009077  0.212748155 -0.0573211994  0.01987070
## trestbps            0.198823957  0.170243940  0.0809428399  0.17893968
## chol                0.378954437 -0.034805948 -0.1344075528 -0.11996991
## fbs<=120            0.022985625  0.067737669 -0.0632291693  0.10386450
## fbs>120            -0.022985625 -0.067737669  0.0632291693 -0.10386450
## restecgabnormal     0.108859690 -0.058035125  0.0892992035  0.07682543
## restecgdefinite    -0.113189843 -0.151685506 -0.1707762492 -0.02107557
## restecgnormal      -0.086991199  0.087377357 -0.0562915673 -0.07276366
## thalach             0.139934444  0.164750061  0.0637699306 -0.19026526
## exangno             0.157374818  0.056636761  0.0179033287  0.17592834
## exangyes           -0.157374818 -0.056636761 -0.0179033287 -0.17592834
## oldpeak             0.080997686  0.050272836  0.1745509851 -0.06613846
## slope0              0.005725705 -0.003227085  0.3009325805  0.27827997
## slope1             -0.018519962 -0.046593168  0.1323699612 -0.16649098
## slope2              0.015861704  0.047727286 -0.2650235573  0.04197840
## ca0                 0.179411132 -0.115343125  0.0266367570  0.08414051
## ca1                -0.116144689 -0.196811267  0.1784967480  0.38220875
## ca2                -0.040740958  0.367790016 -0.0002590246 -0.16793850
## ca3                -0.004954165 -0.202539387 -0.3768853491 -0.48613501
## ca4                -0.236604972  0.433208186  0.0549279697 -0.13469650
## thal0              -0.064865516 -0.047711035 -0.6682429521  0.31460817
## thal1              -0.418293722 -0.140089959 -0.0747633927  0.20786286
## thal2              -0.306678462  0.080124643  0.1747356815 -0.13999203
## thal3               0.536678636 -0.003188517 -0.0086525893 -0.02242725
##                            PC13        PC14         PC15         PC16
## age                 0.071981397  0.38373324 -0.214537681  0.307043040
## sexfemale           0.009422092 -0.15754580  0.132702943 -0.139797586
## sexmale            -0.009422092  0.15754580 -0.132702943  0.139797586
## cpasymptotic       -0.289722514  0.11751257 -0.145408546 -0.387590877
## cpatypical angina   0.295940515 -0.28920729 -0.175698766  0.145496306
## cpnon-anginal pain  0.073275472  0.10991485  0.304395008  0.304719637
## trestbps            0.558762233  0.16658326 -0.133346757 -0.209913718
## chol                0.099533027  0.17877867 -0.014999030  0.088998760
## fbs<=120            0.074834503 -0.01981791 -0.011788835  0.175395522
## fbs>120            -0.074834503  0.01981791  0.011788835 -0.175395522
## restecgabnormal     0.022307302  0.07800898 -0.058828058  0.015393105
## restecgdefinite     0.019284283 -0.15737992  0.297537103 -0.160146237
## restecgnormal      -0.026040336 -0.04758982  0.001302855  0.015572132
## thalach             0.053317476 -0.18721599  0.268279182 -0.191364311
## exangno            -0.159850407 -0.06848620 -0.108328418 -0.098331517
## exangyes            0.159850407  0.06848620  0.108328418  0.098331517
## oldpeak            -0.118376039 -0.28116180 -0.070437461  0.130630351
## slope0             -0.183468090 -0.24924191 -0.120746443  0.257749536
## slope1              0.146523898 -0.02438587 -0.004764372 -0.115899019
## slope2             -0.064198930  0.13479804  0.058299086  0.000817087
## ca0                 0.035901178  0.21531961 -0.019033132  0.040677403
## ca1                -0.061019476 -0.10245025  0.065584851 -0.307156634
## ca2                -0.062979887 -0.21493774  0.190248341  0.400493663
## ca3                -0.120805688  0.12340610 -0.181065955 -0.107273784
## ca4                 0.467979249 -0.17267244 -0.331705364 -0.135558575
## thal0               0.096985482 -0.42619396 -0.258192820  0.109837958
## thal1               0.261684675  0.15940947  0.454059167  0.020303257
## thal2              -0.183500247  0.14317523 -0.276835144  0.055709181
## thal3               0.039190364 -0.14120018  0.110723026 -0.089443704
##                            PC17        PC18         PC19         PC20
## age                 0.111102346  0.10755715 -0.119481022  0.634592582
## sexfemale          -0.262204268  0.11173744 -0.071863869  0.056457644
## sexmale             0.262204268 -0.11173744  0.071863869 -0.056457644
## cpasymptotic        0.004321684  0.03778667  0.081334668  0.038321037
## cpatypical angina  -0.007070978  0.04754370 -0.050109424  0.003776925
## cpnon-anginal pain  0.001099695 -0.08065819 -0.047785079 -0.045118642
## trestbps           -0.024729046 -0.51745690 -0.228395658 -0.207888802
## chol                0.311178051  0.17199235  0.614251536 -0.293268601
## fbs<=120           -0.107830779 -0.08751724 -0.050378871  0.008164833
## fbs>120             0.107830779  0.08751724  0.050378871 -0.008164833
## restecgabnormal    -0.003828996 -0.09130157  0.123980608 -0.025264759
## restecgdefinite     0.526368999  0.16138195 -0.389205946  0.081912822
## restecgnormal      -0.097955937  0.06011090 -0.048740716  0.009429520
## thalach             0.132133530 -0.34932228  0.350327215  0.613049818
## exangno             0.054993699  0.07387301 -0.001615882 -0.074831716
## exangyes           -0.054993699 -0.07387301  0.001615882  0.074831716
## oldpeak            -0.017275377 -0.19641737 -0.044658121  0.014852663
## slope0             -0.236779204  0.01835022  0.205780747  0.076968959
## slope1              0.198654712 -0.03992670  0.056407624  0.034717513
## slope2             -0.092347221  0.03153132 -0.147335259 -0.068640573
## ca0                 0.033486402  0.15915730 -0.067902495  0.067031797
## ca1                 0.082632208 -0.14084179  0.119478608 -0.037244599
## ca2                 0.129741079 -0.08790932 -0.081575166 -0.148382938
## ca3                -0.376830513 -0.21119126  0.022227162  0.086051127
## ca4                -0.043502495  0.46680415  0.089227879  0.107268300
## thal0               0.142040654 -0.17122845  0.143891587  0.023977683
## thal1              -0.250400560  0.06807129  0.298951277 -0.021055632
## thal2               0.217044473 -0.19542102 -0.023151988 -0.039387046
## thal3              -0.127236152  0.20154651 -0.153751053  0.046198032
##                            PC21          PC22          PC23          PC24
## age                -0.055835074 -5.900762e-16  0.000000e+00  0.000000e+00
## sexfemale          -0.009716911 -4.236811e-01  9.456792e-02 -1.325497e-01
## sexmale             0.009716911 -4.236811e-01  9.456792e-02 -1.325497e-01
## cpasymptotic        0.024069859  4.846712e-01 -2.574912e-02 -2.320548e-01
## cpatypical angina  -0.020759097  3.649970e-01 -1.939119e-02 -1.747562e-01
## cpnon-anginal pain -0.009246758  4.422046e-01 -2.349300e-02 -2.117223e-01
## trestbps            0.091112831  6.938894e-17 -1.665335e-16  4.857226e-17
## chol               -0.044978707  2.359224e-16 -7.285839e-17 -3.469447e-17
## fbs<=120            0.055246780 -1.260502e-01 -6.621016e-01 -1.250737e-01
## fbs>120            -0.055246780 -1.260502e-01 -6.621016e-01 -1.250737e-01
## restecgabnormal    -0.002668114 -8.042875e-02  1.747689e-01 -1.411757e-01
## restecgdefinite     0.131111145 -1.554999e-02  3.378960e-02 -2.729472e-02
## restecgnormal      -0.022684782 -8.041444e-02  1.747378e-01 -1.411505e-01
## thalach             0.004156898  2.220446e-16 -1.387779e-17  6.938894e-17
## exangno            -0.039936533  3.809297e-02 -6.463858e-02  5.090787e-01
## exangyes            0.039936533  3.809297e-02 -6.463858e-02  5.090787e-01
## oldpeak            -0.783813396  4.024558e-16 -3.608225e-16 -3.469447e-18
## slope0              0.400630885  3.722987e-02 -1.153131e-02  4.991376e-02
## slope1              0.100211495  8.338771e-02 -2.582792e-02  1.117972e-01
## slope2             -0.277298232  8.392351e-02 -2.599387e-02  1.125155e-01
## ca0                -0.122531418 -5.815027e-02  1.126057e-01 -5.812955e-02
## ca1                -0.043260285 -4.542485e-02  8.796343e-02 -4.540866e-02
## ca2                 0.158659708 -4.112426e-02  7.963552e-02 -4.110961e-02
## ca3                 0.133158564 -2.765535e-02  5.355351e-02 -2.764550e-02
## ca4                -0.079020446 -1.413538e-02  2.737260e-02 -1.413034e-02
## thal0              -0.001634511  1.011522e-02 -4.786710e-03  5.992430e-02
## thal1              -0.158154115  2.510435e-02 -1.187984e-02  1.487225e-01
## thal2               0.005608392  5.192370e-02 -2.457126e-02  3.076049e-01
## thal3               0.073239274  5.046037e-02 -2.387878e-02  2.989359e-01
##                             PC25          PC26          PC27          PC28
## age                 0.000000e+00  0.000000e+00  0.000000e+00  0.000000e+00
## sexfemale           1.108848e-01 -1.177609e-01  3.959338e-02  4.135986e-01
## sexmale             1.108848e-01 -1.177609e-01  3.959338e-02  4.135986e-01
## cpasymptotic        1.138542e-01 -1.834704e-01  6.473346e-02  2.486277e-01
## cpatypical angina   8.574149e-02 -1.381682e-01  4.874959e-02  1.872370e-01
## cpnon-anginal pain  1.038783e-01 -1.673948e-01  5.906155e-02  2.268431e-01
## trestbps           -1.283695e-16 -2.081668e-17  3.469447e-18  5.551115e-17
## chol               -3.469447e-18  0.000000e+00  7.979728e-17  5.551115e-17
## fbs<=120            2.373989e-02 -1.191573e-01  1.067599e-01 -6.213498e-02
## fbs>120             2.373989e-02 -1.191573e-01  1.067599e-01 -6.213498e-02
## restecgabnormal    -2.402886e-01 -5.316582e-01  1.220949e-01 -2.798720e-01
## restecgdefinite    -4.645708e-02 -1.027901e-01  2.360568e-02 -5.411009e-02
## restecgnormal      -2.402458e-01 -5.315636e-01  1.220732e-01 -2.798222e-01
## thalach             1.387779e-16 -7.632783e-17  1.075529e-16 -1.665335e-16
## exangno             2.722711e-01 -3.529037e-01 -1.840042e-01  4.928564e-02
## exangyes            2.722711e-01 -3.529037e-01 -1.840042e-01  4.928564e-02
## oldpeak             3.469447e-17 -4.163336e-17  3.087808e-16 -1.665335e-16
## slope0              7.985738e-03  5.722833e-02  4.069696e-02 -1.241536e-01
## slope1              1.788651e-02  1.281804e-01  9.115332e-02 -2.780801e-01
## slope2              1.800144e-02  1.290040e-01  9.173902e-02 -2.798669e-01
## ca0                 4.606204e-01  1.950834e-02  3.775810e-01 -1.600399e-01
## ca1                 3.598197e-01  1.523920e-02  2.949523e-01 -1.250173e-01
## ca2                 3.257538e-01  1.379643e-02  2.670278e-01 -1.131813e-01
## ca3                 2.190638e-01  9.277859e-03  1.795716e-01 -7.611245e-02
## ca4                 1.119693e-01  4.742157e-03  9.178373e-02 -3.890307e-02
## thal0              -5.368634e-02  4.071078e-03  9.308769e-02  4.239949e-02
## thal1              -1.332408e-01  1.010376e-02  2.310286e-01  1.052287e-01
## thal2              -2.755841e-01  2.089776e-02  4.778401e-01  2.176461e-01
## thal3              -2.678175e-01  2.030881e-02  4.643734e-01  2.115123e-01
##                             PC29
## age                 0.000000e+00
## sexfemale           3.085245e-01
## sexmale             3.085245e-01
## cpasymptotic        1.212119e-01
## cpatypical angina   9.128250e-02
## cpnon-anginal pain  1.105914e-01
## trestbps            4.163336e-17
## chol               -1.387779e-17
## fbs<=120           -8.305632e-03
## fbs>120            -8.305632e-03
## restecgabnormal     1.828014e-02
## restecgdefinite     3.534258e-03
## restecgnormal       1.827689e-02
## thalach             5.551115e-17
## exangno             1.582416e-02
## exangyes            1.582416e-02
## oldpeak             1.387779e-17
## slope0              2.562918e-01
## slope1              5.740442e-01
## slope2              5.777327e-01
## ca0                -1.313575e-01
## ca1                -1.026117e-01
## ca2                -9.289694e-02
## ca3                -6.247158e-02
## ca4                -3.193086e-02
## thal0              -6.833711e-03
## thal1              -1.696017e-02
## thal2              -3.507898e-02
## thal3              -3.409037e-02
pcaCharts <- function(x) {
    x.var <- x$sdev ^ 2
    x.pvar <- x.var/sum(x.var)
    print("proportions of variance:")
    print(x.pvar)
    
    par(mfrow=c(2,2))
    plot(x.pvar,xlab="Principal component", ylab="Proportion of variance explained", ylim=c(0,1), type='b')
    plot(cumsum(x.pvar),xlab="Principal component", ylab="Cumulative Proportion of variance explained", ylim=c(0,1), type='b')
    screeplot(x)
    screeplot(x,type="l")
    par(mfrow=c(1,1))
}
summ<-summary(pca)
summ
## Importance of components:
##                          PC1     PC2     PC3     PC4     PC5     PC6
## Standard deviation     2.246 1.58061 1.51531 1.38839 1.32155 1.30816
## Proportion of Variance 0.174 0.08615 0.07918 0.06647 0.06022 0.05901
## Cumulative Proportion  0.174 0.26017 0.33935 0.40582 0.46605 0.52506
##                            PC7    PC8    PC9    PC10    PC11    PC12
## Standard deviation     1.27253 1.1896 1.1449 1.11483 1.01758 1.00637
## Proportion of Variance 0.05584 0.0488 0.0452 0.04286 0.03571 0.03492
## Cumulative Proportion  0.58090 0.6297 0.6749 0.71775 0.75346 0.78838
##                           PC13    PC14    PC15    PC16    PC17    PC18
## Standard deviation     0.99454 0.95019 0.90794 0.86396 0.85104 0.79050
## Proportion of Variance 0.03411 0.03113 0.02843 0.02574 0.02497 0.02155
## Cumulative Proportion  0.82249 0.85362 0.88205 0.90779 0.93276 0.95431
##                          PC19    PC20    PC21      PC22      PC23
## Standard deviation     0.7823 0.62144 0.57176 9.976e-16 6.426e-16
## Proportion of Variance 0.0211 0.01332 0.01127 0.000e+00 0.000e+00
## Cumulative Proportion  0.9754 0.98873 1.00000 1.000e+00 1.000e+00
##                             PC24      PC25      PC26      PC27      PC28
## Standard deviation     5.798e-16 5.455e-16 4.572e-16 4.555e-16 3.203e-16
## Proportion of Variance 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
## Cumulative Proportion  1.000e+00 1.000e+00 1.000e+00 1.000e+00 1.000e+00
##                             PC29
## Standard deviation     1.116e-16
## Proportion of Variance 0.000e+00
## Cumulative Proportion  1.000e+00
fviz_screeplot(pca)

#pcaCharts(pca)

Inference: Top 20 predictors can explain more than 98% of variance

pcavar<-pca$sdev^2
pcasd<-pca$sdev

check variance of first 20 components as per sumary

#pcavar[1:20]
prop_varex <- pcavar/sum(pcavar)
prop_varex[1:20]
##  [1] 0.17402475 0.08614958 0.07917839 0.06646966 0.06022429 0.05900978
##  [7] 0.05583944 0.04880134 0.04519660 0.04285651 0.03570555 0.03492360
## [13] 0.03410694 0.03113344 0.02842625 0.02573895 0.02497464 0.02154805
## [19] 0.02110303 0.01331661
#scree plot
plot(prop_varex, xlab = "Principal Component",
             ylab = "Proportion of Variance Explained",
             type = "b")

#cumulative scree plot
plot(cumsum(prop_varex), xlab = "Principal Component",
              ylab = "Cumulative Proportion of Variance Explained",
              type = "b")

Hence , we can se that out of 31 predictors(hot encoded)predictors 21 predictors contribute to 98% of variance proportion of variance explained

fviz_pca(pca)

#biplot(pca,scale=0, cex=.7)
pca.out <- pca
pca.out$rotation <- -pca.out$rotation
pca.out$x <- -pca.out$x
fviz_pca(pca.out)

impPredictors<-rownames(data.frame(pca$rotation))[1:20]
notsoimp<-rownames(data.frame(pca$rotation))[21:29]
#[1:21]
s_df <- apply(dt, 2, scale)
head(s_df)
##             age  sexfemale    sexmale cpasymptotic cpatypical angina
## [1,] -0.1470709 -0.6560432  0.6560432   -1.0760010        -0.4511991
## [2,] -1.2488032  1.5170999 -1.5170999   -1.0760010        -0.4511991
## [3,] -1.3589764 -0.6560432  0.6560432    0.9249833        -0.4511991
## [4,]  1.5055276 -0.6560432  0.6560432   -1.0760010        -0.4511991
## [5,]  0.4037953 -0.6560432  0.6560432    0.9249833        -0.4511991
## [6,]  0.4037953 -0.6560432  0.6560432    0.9249833        -0.4511991
##      cpnon-anginal pain    trestbps        chol   fbs<=120    fbs>120
## [1,]           1.551755 -0.06512095  0.07211863 -2.3661080  2.3661080
## [2,]           1.551755 -0.49320287 -0.62093350  0.4206414 -0.4206414
## [3,]          -0.641392  0.46998144 -0.34791296  0.4206414 -0.4206414
## [4,]           1.551755  2.61039102  0.66016286 -2.3661080  2.3661080
## [5,]          -0.641392 -0.17214143 -0.55792876  0.4206414 -0.4206414
## [6,]          -0.641392 -0.17214143  0.34513917  0.4206414 -0.4206414
##      restecgabnormal restecgdefinite restecgnormal     thalach   exangno
## [1,]      -0.9976387     -0.09735957     1.0166431  0.99012710  0.641392
## [2,]       0.9976387     -0.09735957    -0.9789896  0.63412635  0.641392
## [3,]       0.9976387     -0.09735957    -0.9789896  1.21262758  0.641392
## [4,]      -0.9976387     -0.09735957     1.0166431 -0.03337507 -1.551755
## [5,]      -0.9976387     -0.09735957     1.0166431 -0.87887687 -1.551755
## [6,]      -0.9976387     -0.09735957     1.0166431 -0.92337696 -1.551755
##       exangyes    oldpeak     slope0     slope1    slope2        ca0
## [1,] -0.641392 -0.8789192 -0.2333843 -0.8904227  0.988271 -1.2687219
## [2,] -0.641392 -0.7012074 -0.2333843  1.1177647 -1.007095  0.7844769
## [3,] -0.641392 -0.8789192 -0.2333843 -0.8904227  0.988271  0.7844769
## [4,]  1.551755  0.5427756 -0.2333843  1.1177647 -1.007095  0.7844769
## [5,]  1.551755  1.0759111 -0.2333843  1.1177647 -1.007095 -1.2687219
## [6,]  1.551755  1.7867585 -0.2333843  1.1177647 -1.007095 -1.2687219
##             ca1        ca2        ca3        ca4       thal0      thal1
## [1,] -0.4587279 -0.3971429  4.0728430 -0.1195256 -0.09735957 -0.2549872
## [2,] -0.4587279 -0.3971429 -0.2443706 -0.1195256 -0.09735957 -0.2549872
## [3,] -0.4587279 -0.3971429 -0.2443706 -0.1195256 -0.09735957 -0.2549872
## [4,] -0.4587279 -0.3971429 -0.2443706 -0.1195256 -0.09735957 -0.2549872
## [5,] -0.4587279 -0.3971429  4.0728430 -0.1195256 -0.09735957 -0.2549872
## [6,] -0.4587279  2.5061083 -0.2443706 -0.1195256 -0.09735957 -0.2549872
##           thal2      thal3
## [1,]  0.8819446 -0.7611464
## [2,]  0.8819446 -0.7611464
## [3,]  0.8819446 -0.7611464
## [4,] -1.1285097  1.3076105
## [5,] -1.1285097  1.3076105
## [6,] -1.1285097  1.3076105
cov_df <- cov(s_df) #covariance metric 
cov_df
##                            age    sexfemale      sexmale cpasymptotic
## age                 1.00000000  0.044895676 -0.044895676   0.21713321
## sexfemale           0.04489568  1.000000000 -1.000000000  -0.15281509
## sexmale            -0.04489568 -1.000000000  1.000000000   0.15281509
## cpasymptotic        0.21713321 -0.152815087  0.152815087   1.00000000
## cpatypical angina  -0.10551936  0.085711609 -0.085711609  -0.48779159
## cpnon-anginal pain -0.15088935  0.096743886 -0.096743886  -0.69340921
## trestbps            0.22722685  0.044151643 -0.044151643   0.10974448
## chol                0.27676712  0.231176419 -0.231176419   0.09868066
## fbs<=120           -0.14010101  0.018953785 -0.018953785   0.05834037
## fbs>120             0.14010101 -0.018953785  0.018953785  -0.05834037
## restecgabnormal    -0.13074969  0.102749367 -0.102749367  -0.17029702
## restecgdefinite     0.02333377  0.042114708 -0.042114708   0.09048279
## restecgnormal       0.12626084 -0.110911514  0.110911514   0.15283040
## thalach            -0.44668768  0.042165244 -0.042165244  -0.31587407
## exangno            -0.14455863  0.083958351 -0.083958351  -0.36730674
## exangyes            0.14455863 -0.083958351  0.083958351   0.36730674
## oldpeak             0.22217640 -0.101633444  0.101633444   0.39467060
## slope0              0.06180576 -0.061184697  0.061184697   0.13158849
## slope1              0.20711257 -0.007804844  0.007804844   0.25620000
## slope2             -0.23320834  0.034897564 -0.034897564  -0.31293916
## ca0                -0.33649460  0.094161419 -0.094161419  -0.16440288
## ca1                 0.17924800 -0.058737722  0.058737722  -0.02233923
## ca2                 0.25814376  0.007334288 -0.007334288   0.28649398
## ca3                 0.10816025 -0.072149276  0.072149276   0.02240201
## ca4                -0.15438162 -0.078785558  0.078785558  -0.12921915
## thal0              -0.01977525  0.042114708 -0.042114708  -0.00738635
## thal1               0.12299163 -0.125248528  0.125248528   0.15810851
## thal2              -0.18880235  0.415686404 -0.415686404  -0.32393904
## thal3               0.13705258 -0.373871529  0.373871529   0.25615387
##                    cpatypical angina cpnon-anginal pain     trestbps
## age                     -0.105519358        -0.15088935  0.227226847
## sexfemale                0.085711609         0.09674389  0.044151643
## sexmale                 -0.085711609        -0.09674389 -0.044151643
## cpasymptotic            -0.487791585        -0.69340921  0.109744476
## cpatypical angina        1.000000000        -0.29076701 -0.042324705
## cpnon-anginal pain      -0.290767011         1.00000000 -0.085348723
## trestbps                -0.042324705        -0.08534872  1.000000000
## chol                    -0.014909937        -0.09585064  0.173933079
## fbs<=120                 0.015228916        -0.07651301 -0.176015299
## fbs>120                 -0.015228916         0.07651301  0.176015299
## restecgabnormal          0.075377836         0.12443420 -0.161922690
## restecgdefinite         -0.044136741        -0.06274160  0.082615022
## restecgnormal           -0.066856398        -0.11232382  0.145975995
## thalach                  0.203416734         0.17830788 -0.032349481
## exangno                  0.180296390         0.25376344 -0.041462290
## exangyes                -0.180296390        -0.25376344  0.041462290
## oldpeak                 -0.260817808        -0.21729253  0.171664805
## slope0                  -0.105801842        -0.05689628  0.186632633
## slope1                  -0.201353845        -0.11460579  0.004891033
## slope2                   0.247003776         0.13911424 -0.087653149
## ca0                      0.097079454         0.10006144  0.010192898
## ca1                     -0.009367502         0.03221652 -0.103336727
## ca2                     -0.070360357        -0.25593146  0.083719831
## ca3                     -0.056416933         0.02201336  0.001528670
## ca4                     -0.054185415         0.18635338  0.043588719
## thal0                   -0.044136741         0.04452629 -0.016839045
## thal1                   -0.010867931        -0.16432186  0.121461771
## thal2                    0.171964956         0.21310776 -0.130941937
## thal3                   -0.162697433        -0.14646236  0.077686762
##                            chol     fbs<=120      fbs>120 restecgabnormal
## age                 0.276767122 -0.140101006  0.140101006   -1.307497e-01
## sexfemale           0.231176419  0.018953785 -0.018953785    1.027494e-01
## sexmale            -0.231176419 -0.018953785  0.018953785   -1.027494e-01
## cpasymptotic        0.098680662  0.058340366 -0.058340366   -1.702970e-01
## cpatypical angina  -0.014909937  0.015228916 -0.015228916    7.537784e-02
## cpnon-anginal pain -0.095850643 -0.076513014  0.076513014    1.244342e-01
## trestbps            0.173933079 -0.176015299  0.176015299   -1.619227e-01
## chol                1.000000000  0.054951665 -0.054951665   -1.678145e-01
## fbs<=120            0.054951665  1.000000000 -1.000000000    5.270463e-02
## fbs>120            -0.054951665 -1.000000000  1.000000000   -5.270463e-02
## restecgabnormal    -0.167814510  0.052704628 -0.052704628    1.000000e+00
## restecgdefinite     0.164215958  0.041147560 -0.041147560   -9.759001e-02
## restecgnormal       0.136089437 -0.060670843  0.060670843   -9.813068e-01
## thalach            -0.060020801  0.079343283 -0.079343283    1.100407e-01
## exangno            -0.116765457  0.047547373 -0.047547373    1.244342e-01
## exangyes            0.116765457 -0.047547373  0.047547373   -1.244342e-01
## oldpeak             0.160539105 -0.032239431  0.032239431   -1.903161e-01
## slope0              0.076902538 -0.198393594  0.198393594   -1.488688e-01
## slope1              0.062521588  0.084572764 -0.084572764   -1.329300e-01
## slope2             -0.096237670  0.003977885 -0.003977885    1.981220e-01
## ca0                -0.060522050  0.075212137 -0.075212137    6.795476e-02
## ca1                 0.007601576  0.055018424 -0.055018424   -6.213698e-02
## ca2                 0.046693337 -0.138907417  0.138907417   -4.118098e-02
## ca3                 0.111813163 -0.067777209  0.067777209   -4.082483e-02
## ca4                -0.130055856  0.050515682 -0.050515682    1.198085e-01
## thal0              -0.066903515 -0.095153732  0.095153732    1.477290e-21
## thal1              -0.086650491 -0.056991912  0.056991912   -9.830414e-02
## thal2              -0.017481103  0.078656288 -0.078656288    1.235744e-01
## thal3               0.074508601 -0.033509072  0.033509072   -7.825107e-02
##                    restecgdefinite restecgnormal     thalach      exangno
## age                     0.02333377   0.126260842 -0.44668768 -0.144558631
## sexfemale               0.04211471  -0.110911514  0.04216524  0.083958351
## sexmale                -0.04211471   0.110911514 -0.04216524 -0.083958351
## cpasymptotic            0.09048279   0.152830400 -0.31587407 -0.367306744
## cpatypical angina      -0.04413674  -0.066856398  0.20341673  0.180296390
## cpnon-anginal pain     -0.06274160  -0.112323820  0.17830788  0.253763441
## trestbps                0.08261502   0.145975995 -0.03234948 -0.041462290
## chol                    0.16421596   0.136089437 -0.06002080 -0.116765457
## fbs<=120                0.04114756  -0.060670843  0.07934328  0.047547373
## fbs>120                -0.04114756   0.060670843 -0.07934328 -0.047547373
## restecgabnormal        -0.09759001  -0.981306763  0.11004073  0.124434203
## restecgdefinite         1.00000000  -0.095765734 -0.09685521 -0.044526294
## restecgnormal          -0.09576573   1.000000000 -0.09133113 -0.115846170
## thalach                -0.09685521  -0.091331128  1.00000000  0.339733179
## exangno                -0.04452629  -0.115846170  0.33973318  1.000000000
## exangyes                0.04452629   0.115846170 -0.33973318 -1.000000000
## oldpeak                 0.25301026   0.141424563 -0.36116742 -0.311915282
## slope0                  0.19716716   0.110768471 -0.13589012 -0.083359664
## slope1                  0.01111941   0.130803504 -0.37037573 -0.219365730
## slope2                 -0.09851505  -0.179107137  0.42829422  0.254944941
## ca0                    -0.02368466  -0.063386882  0.25450840  0.156031422
## ca1                    -0.04487322   0.070825313 -0.17281213 -0.141494963
## ca2                    -0.03884891   0.048700650 -0.08740559 -0.106187357
## ca3                     0.18725248   0.004622502 -0.17845885  0.022860023
## ca4                    -0.01169211  -0.117568852  0.11178090  0.077026063
## thal0                  -0.00952381   0.001841649 -0.06638391 -0.044526294
## thal1                   0.17843912   0.063816309 -0.15588310 -0.008562678
## thal2                  -0.01205962  -0.121264363  0.25684204  0.246621467
## thal3                  -0.07445609   0.092662810 -0.17343034 -0.240587750
##                        exangyes     oldpeak      slope0       slope1
## age                 0.144558631  0.22217640  0.06180576  0.207112568
## sexfemale          -0.083958351 -0.10163344 -0.06118470 -0.007804844
## sexmale             0.083958351  0.10163344  0.06118470  0.007804844
## cpasymptotic        0.367306744  0.39467060  0.13158849  0.256200001
## cpatypical angina  -0.180296390 -0.26081781 -0.10580184 -0.201353845
## cpnon-anginal pain -0.253763441 -0.21729253 -0.05689628 -0.114605794
## trestbps            0.041462290  0.17166480  0.18663263  0.004891033
## chol                0.116765457  0.16053910  0.07690254  0.062521588
## fbs<=120           -0.047547373 -0.03223943 -0.19839359  0.084572764
## fbs>120             0.047547373  0.03223943  0.19839359 -0.084572764
## restecgabnormal    -0.124434203 -0.19031606 -0.14886879 -0.132930031
## restecgdefinite     0.044526294  0.25301026  0.19716716  0.011119408
## restecgnormal       0.115846170  0.14142456  0.11076847  0.130803504
## thalach            -0.339733179 -0.36116742 -0.13589012 -0.370375730
## exangno            -1.000000000 -0.31191528 -0.08335966 -0.219365730
## exangyes            1.000000000  0.31191528  0.08335966  0.219365730
## oldpeak             0.311915282  1.00000000  0.40950683  0.380725388
## slope0              0.083359664  0.40950683  1.00000000 -0.208795549
## slope1              0.219365730  0.38072539 -0.20879555  1.000000000
## slope2             -0.254944941 -0.55995872 -0.23615413 -0.900990398
## ca0                -0.156031422 -0.23413370  0.05264628 -0.138448051
## ca1                 0.141494963  0.01329800 -0.05153714  0.064898980
## ca2                 0.106187357  0.20435494 -0.03123668  0.086805003
## ca3                -0.022860023  0.22053374  0.03472882  0.110087443
## ca4                -0.077026063 -0.10555119 -0.02802759 -0.106932669
## thal0               0.044526294 -0.04251688 -0.02282988  0.011119408
## thal1               0.008562678  0.08827441  0.11749347  0.128062204
## thal2              -0.246621467 -0.33111163 -0.09319478 -0.244235004
## thal3               0.240587750  0.30531958  0.04202012  0.185377024
##                           slope2         ca0          ca1          ca2
## age                -0.2332083362 -0.33649460  0.179248001  0.258143755
## sexfemale           0.0348975644  0.09416142 -0.058737722  0.007334288
## sexmale            -0.0348975644 -0.09416142  0.058737722 -0.007334288
## cpasymptotic       -0.3129391581 -0.16440288 -0.022339233  0.286493980
## cpatypical angina   0.2470037757  0.09707945 -0.009367502 -0.070360357
## cpnon-anginal pain  0.1391142388  0.10006144  0.032216522 -0.255931457
## trestbps           -0.0876531491  0.01019290 -0.103336727  0.083719831
## chol               -0.0962376698 -0.06052205  0.007601576  0.046693337
## fbs<=120            0.0039778848  0.07521214  0.055018424 -0.138907417
## fbs>120            -0.0039778848 -0.07521214 -0.055018424  0.138907417
## restecgabnormal     0.1981220241  0.06795476 -0.062136977 -0.041180983
## restecgdefinite    -0.0985150517 -0.02368466 -0.044873215 -0.038848907
## restecgnormal      -0.1791071365 -0.06338688  0.070825313  0.048700650
## thalach             0.4282942163  0.25450840 -0.172812126 -0.087405590
## exangno             0.2549449412  0.15603142 -0.141494963 -0.106187357
## exangyes           -0.2549449412 -0.15603142  0.141494963  0.106187357
## oldpeak            -0.5599587231 -0.23413370  0.013297999  0.204354935
## slope0             -0.2361541344  0.05264628 -0.051537145 -0.031236680
## slope1             -0.9009903976 -0.13844805  0.064898980  0.086805003
## slope2              1.0000000000  0.11420938 -0.041621903 -0.072393692
## ca0                 0.1142093758  1.00000000 -0.584756390 -0.506251814
## ca1                -0.0416219026 -0.58475639  1.000000000 -0.183043913
## ca2                -0.0723936920 -0.50625181 -0.183043913  1.000000000
## ca3                -0.1247908798 -0.31150768 -0.112630876 -0.097509983
## ca4                 0.1186834726 -0.15236339 -0.055089562 -0.047693692
## thal0              -0.0009207014  0.07673831 -0.044873215 -0.038848907
## thal1              -0.1793666601 -0.08226231 -0.065725542  0.127115394
## thal2               0.2840184611  0.22433232 -0.094386947 -0.173675500
## thal3              -0.2028343241 -0.20529474  0.138818235  0.123258942
##                             ca3         ca4         thal0        thal1
## age                 0.108160246 -0.15438162 -1.977525e-02  0.122991632
## sexfemale          -0.072149276 -0.07878556  4.211471e-02 -0.125248528
## sexmale             0.072149276  0.07878556 -4.211471e-02  0.125248528
## cpasymptotic        0.022402006 -0.12921915 -7.386350e-03  0.158108511
## cpatypical angina  -0.056416933 -0.05418541 -4.413674e-02 -0.010867931
## cpnon-anginal pain  0.022013355  0.18635338  4.452629e-02 -0.164321864
## trestbps            0.001528670  0.04358872 -1.683905e-02  0.121461771
## chol                0.111813163 -0.13005586 -6.690352e-02 -0.086650491
## fbs<=120           -0.067777209  0.05051568 -9.515373e-02 -0.056991912
## fbs>120             0.067777209 -0.05051568  9.515373e-02  0.056991912
## restecgabnormal    -0.040824829  0.11980846  1.477290e-21 -0.098304136
## restecgdefinite     0.187252482 -0.01169211 -9.523810e-03  0.178439125
## restecgnormal       0.004622502 -0.11756885  1.841649e-03  0.063816309
## thalach            -0.178458849  0.11178090 -6.638391e-02 -0.155883099
## exangno             0.022860023  0.07702606 -4.452629e-02 -0.008562678
## exangyes           -0.022860023 -0.07702606  4.452629e-02  0.008562678
## oldpeak             0.220533735 -0.10555119 -4.251688e-02  0.088274412
## slope0              0.034728817 -0.02802759 -2.282988e-02  0.117493471
## slope1              0.110087443 -0.10693267  1.111941e-02  0.128062204
## slope2             -0.124790880  0.11868347 -9.207014e-04 -0.179366660
## ca0                -0.311507684 -0.15236339  7.673831e-02 -0.082262310
## ca1                -0.112630876 -0.05508956 -4.487322e-02 -0.065725542
## ca2                -0.097509983 -0.04769369 -3.884891e-02  0.127115394
## ca3                 1.000000000 -0.02934696 -2.390457e-02  0.107555088
## ca4                -0.029346959  1.00000000 -1.169211e-02 -0.030621934
## thal0              -0.023904572 -0.01169211  1.000000e+00 -0.024943103
## thal1               0.107555088 -0.03062193 -2.494310e-02  1.000000000
## thal2              -0.112540131  0.10591451 -1.103919e-01 -0.289119340
## thal3               0.067086318 -0.09140762 -7.445609e-02 -0.195002438
##                          thal2       thal3
## age                -0.18880235  0.13705258
## sexfemale           0.41568640 -0.37387153
## sexmale            -0.41568640  0.37387153
## cpasymptotic       -0.32393904  0.25615387
## cpatypical angina   0.17196496 -0.16269743
## cpnon-anginal pain  0.21310776 -0.14646236
## trestbps           -0.13094194  0.07768676
## chol               -0.01748110  0.07450860
## fbs<=120            0.07865629 -0.03350907
## fbs>120            -0.07865629  0.03350907
## restecgabnormal     0.12357437 -0.07825107
## restecgdefinite    -0.01205962 -0.07445609
## restecgnormal      -0.12126436  0.09266281
## thalach             0.25684204 -0.17343034
## exangno             0.24662147 -0.24058775
## exangyes           -0.24662147  0.24058775
## oldpeak            -0.33111163  0.30531958
## slope0             -0.09319478  0.04202012
## slope1             -0.24423500  0.18537702
## slope2              0.28401846 -0.20283432
## ca0                 0.22433232 -0.20529474
## ca1                -0.09438695  0.13881823
## ca2                -0.17367550  0.12325894
## ca3                -0.11254013  0.06708632
## ca4                 0.10591451 -0.09140762
## thal0              -0.11039194 -0.07445609
## thal1              -0.28911934 -0.19500244
## thal2               1.00000000 -0.86303202
## thal3              -0.86303202  1.00000000
ei_df <- eigen(cov_df)
ei_df
## eigen() decomposition
## $values
##  [1]  5.046718e+00  2.498338e+00  2.296173e+00  1.927620e+00  1.746504e+00
##  [6]  1.711284e+00  1.619344e+00  1.415239e+00  1.310701e+00  1.242839e+00
## [11]  1.035461e+00  1.012784e+00  9.891014e-01  9.028698e-01  8.243613e-01
## [16]  7.464296e-01  7.242646e-01  6.248935e-01  6.119878e-01  3.861818e-01
## [21]  3.269050e-01  2.091888e-16  2.024479e-17  1.676868e-17 -2.500649e-17
## [26] -3.572164e-17 -1.549294e-16 -2.823622e-16 -3.844234e-16
## 
## $vectors
##                [,1]          [,2]         [,3]        [,4]        [,5]
##  [1,] -0.2040227737  1.833293e-01 -0.094779873 -0.13731378  0.10549430
##  [2,]  0.1413086011  5.394177e-01  0.039906990 -0.10547773  0.06696998
##  [3,] -0.1413086011 -5.394177e-01 -0.039906990  0.10547773 -0.06696998
##  [4,] -0.2942003135  2.461245e-02  0.124389538  0.04018123 -0.01260643
##  [5,]  0.1600597333  5.511576e-06 -0.032650698  0.05734463  0.17794918
##  [6,]  0.1903397143 -2.698063e-02 -0.109385163 -0.09137243 -0.13306274
##  [7,] -0.0935479157  1.204104e-01 -0.222487435  0.12005801 -0.05039087
##  [8,] -0.0849067274  2.803531e-01  0.029034968  0.08955058  0.10490839
##  [9,]  0.0554027651 -2.904020e-02  0.591801533  0.18297854 -0.01650000
## [10,] -0.0554027651  2.904020e-02 -0.591801533 -0.18297854  0.01650000
## [11,]  0.1736418581 -7.301015e-02  0.134795878 -0.54120573 -0.24450421
## [12,] -0.0625879247  1.361349e-01  0.006726700  0.07124934 -0.29057685
## [13,] -0.1615699524  4.669831e-02 -0.136120641  0.52752438  0.30073749
## [14,]  0.2642768993 -1.226385e-01 -0.027373082  0.15232430  0.06430739
## [15,]  0.2662813236 -5.344106e-02 -0.103829439  0.13787861 -0.23790773
## [16,] -0.2662813236  5.344106e-02  0.103829439 -0.13787861  0.23790773
## [17,] -0.3018129056  1.014582e-01  0.013299747  0.02003847 -0.23877351
## [18,] -0.1129508182  6.226086e-02 -0.230618976  0.14454194 -0.21883738
## [19,] -0.2404813692  1.326143e-01  0.214508548 -0.02652619 -0.20305279
## [20,]  0.2890528978 -1.593875e-01 -0.110832619 -0.03776438  0.29883633
## [21,]  0.1863277619  2.539686e-02  0.017259545  0.31810714 -0.22856684
## [22,] -0.0801730543 -5.412041e-02  0.100692516 -0.18736497  0.34911401
## [23,] -0.1440907534  5.350135e-02 -0.095050946 -0.17389422  0.13349020
## [24,] -0.0880689230  1.975906e-02 -0.062237127 -0.05200381 -0.25357146
## [25,]  0.0826325459 -1.248690e-01  0.003714119 -0.09886796 -0.07387878
## [26,]  0.0003292663  1.790613e-02 -0.071960874 -0.04474856 -0.02271613
## [27,] -0.1071281011 -3.217524e-03 -0.087323525  0.08306735 -0.24274831
## [28,]  0.2887953143  2.791867e-01  0.032234467  0.04175985  0.05647509
## [29,] -0.2439394038 -2.892717e-01  0.024699924 -0.07532716  0.06720959
##               [,6]          [,7]         [,8]         [,9]        [,10]
##  [1,] -0.213452492 -0.1616819119  0.023604430 -0.087715997  0.047030462
##  [2,]  0.011344399  0.0132664051  0.046115109 -0.110497288 -0.052789321
##  [3,] -0.011344399 -0.0132664051 -0.046115109  0.110497288  0.052789321
##  [4,]  0.251575980 -0.2184059522  0.067338655  0.070216542 -0.239407715
##  [5,]  0.018666681 -0.2587982109  0.080642858 -0.002363250  0.575654625
##  [6,] -0.291143340  0.4529931814 -0.140368316 -0.075009077 -0.212748155
##  [7,]  0.052792830 -0.1368246765 -0.088519611 -0.198823957 -0.170243940
##  [8,] -0.049599567 -0.1036958509 -0.249146156 -0.378954437  0.034805948
##  [9,] -0.068522160 -0.1133872283 -0.113186317 -0.022985625 -0.067737669
## [10,]  0.068522160  0.1133872283  0.113186317  0.022985625  0.067737669
## [11,]  0.174292551 -0.0922261543  0.011106655 -0.108859690  0.058035125
## [12,]  0.022878704 -0.1374788674 -0.408638140  0.113189843  0.151685506
## [13,] -0.178747710  0.1188272919  0.067910999  0.086991199 -0.087377357
## [14,]  0.087127349 -0.0874879334 -0.015641141 -0.139934444 -0.164750061
## [15,] -0.361710871 -0.2362339223  0.163201029 -0.157374818 -0.056636761
## [16,]  0.361710871  0.2362339223 -0.163201029  0.157374818  0.056636761
## [17,]  0.001229302  0.0009125418 -0.162923697 -0.080997686 -0.050272836
## [18,]  0.152638210  0.0091306367 -0.371342645 -0.005725705  0.003227085
## [19,] -0.201743027  0.2071383509  0.402781053  0.018519962  0.046593168
## [20,]  0.132742164 -0.2098663948 -0.235475750 -0.015861704 -0.047727286
## [21,]  0.410141531  0.2145056094  0.160399335 -0.179411132  0.115343125
## [22,] -0.343597404  0.1291405671 -0.216397262  0.116144689  0.196811267
## [23,]  0.006067372 -0.4103179475  0.227637787  0.040740958 -0.367790016
## [24,] -0.264070192 -0.0683355585 -0.215395509  0.004954165  0.202539387
## [25,] -0.084082911  0.0300039738 -0.205304036  0.236604972 -0.433208186
## [26,]  0.112158906  0.2015978151  0.170508454  0.064865516  0.047711035
## [27,] -0.036225835 -0.2630957285  0.114121348  0.418293722  0.140089959
## [28,]  0.011523649  0.0538623012 -0.080493188  0.306678462 -0.080124643
## [29,] -0.016318483  0.0350554293 -0.008128518 -0.536678636  0.003188517
##               [,11]       [,12]        [,13]       [,14]        [,15]
##  [1,] -0.0514849619  0.25344279 -0.071981397 -0.38373324  0.214537681
##  [2,] -0.0333275463  0.03966658 -0.009422092  0.15754580 -0.132702943
##  [3,]  0.0333275463 -0.03966658  0.009422092 -0.15754580  0.132702943
##  [4,] -0.0918142199  0.11585403  0.289722514 -0.11751257  0.145408546
##  [5,]  0.1913643070 -0.17791385 -0.295940515  0.28920729  0.175698766
##  [6,] -0.0573211994  0.01987070 -0.073275472 -0.10991485 -0.304395008
##  [7,]  0.0809428399  0.17893968 -0.558762233 -0.16658326  0.133346757
##  [8,] -0.1344075528 -0.11996991 -0.099533027 -0.17877867  0.014999030
##  [9,] -0.0632291693  0.10386450 -0.074834503  0.01981791  0.011788835
## [10,]  0.0632291693 -0.10386450  0.074834503 -0.01981791 -0.011788835
## [11,]  0.0892992035  0.07682543 -0.022307302 -0.07800898  0.058828058
## [12,] -0.1707762492 -0.02107557 -0.019284283  0.15737992 -0.297537103
## [13,] -0.0562915673 -0.07276366  0.026040336  0.04758982 -0.001302855
## [14,]  0.0637699306 -0.19026526 -0.053317476  0.18721599 -0.268279182
## [15,]  0.0179033287  0.17592834  0.159850407  0.06848620  0.108328418
## [16,] -0.0179033287 -0.17592834 -0.159850407 -0.06848620 -0.108328418
## [17,]  0.1745509851 -0.06613846  0.118376039  0.28116180  0.070437461
## [18,]  0.3009325805  0.27827997  0.183468090  0.24924191  0.120746443
## [19,]  0.1323699612 -0.16649098 -0.146523898  0.02438587  0.004764372
## [20,] -0.2650235573  0.04197840  0.064198930 -0.13479804 -0.058299086
## [21,]  0.0266367570  0.08414051 -0.035901178 -0.21531961  0.019033132
## [22,]  0.1784967480  0.38220875  0.061019476  0.10245025 -0.065584851
## [23,] -0.0002590246 -0.16793850  0.062979887  0.21493774 -0.190248341
## [24,] -0.3768853491 -0.48613501  0.120805688 -0.12340610  0.181065955
## [25,]  0.0549279697 -0.13469650 -0.467979249  0.17267244  0.331705364
## [26,] -0.6682429521  0.31460817 -0.096985482  0.42619396  0.258192820
## [27,] -0.0747633927  0.20786286 -0.261684675 -0.15940947 -0.454059167
## [28,]  0.1747356815 -0.13999203  0.183500247 -0.14317523  0.276835144
## [29,] -0.0086525893 -0.02242725 -0.039190364  0.14120018 -0.110723026
##              [,16]        [,17]       [,18]        [,19]        [,20]
##  [1,] -0.307043040  0.111102346  0.10755715 -0.119481022 -0.634592582
##  [2,]  0.139797586 -0.262204268  0.11173744 -0.071863869 -0.056457644
##  [3,] -0.139797586  0.262204268 -0.11173744  0.071863869  0.056457644
##  [4,]  0.387590877  0.004321684  0.03778667  0.081334668 -0.038321037
##  [5,] -0.145496306 -0.007070978  0.04754370 -0.050109424 -0.003776925
##  [6,] -0.304719637  0.001099695 -0.08065819 -0.047785079  0.045118642
##  [7,]  0.209913718 -0.024729046 -0.51745690 -0.228395658  0.207888802
##  [8,] -0.088998760  0.311178051  0.17199235  0.614251536  0.293268601
##  [9,] -0.175395522 -0.107830779 -0.08751724 -0.050378871 -0.008164833
## [10,]  0.175395522  0.107830779  0.08751724  0.050378871  0.008164833
## [11,] -0.015393105 -0.003828996 -0.09130157  0.123980608  0.025264759
## [12,]  0.160146237  0.526368999  0.16138195 -0.389205946 -0.081912822
## [13,] -0.015572132 -0.097955937  0.06011090 -0.048740716 -0.009429520
## [14,]  0.191364311  0.132133530 -0.34932228  0.350327215 -0.613049818
## [15,]  0.098331517  0.054993699  0.07387301 -0.001615882  0.074831716
## [16,] -0.098331517 -0.054993699 -0.07387301  0.001615882 -0.074831716
## [17,] -0.130630351 -0.017275377 -0.19641737 -0.044658121 -0.014852663
## [18,] -0.257749536 -0.236779204  0.01835022  0.205780747 -0.076968959
## [19,]  0.115899019  0.198654712 -0.03992670  0.056407624 -0.034717513
## [20,] -0.000817087 -0.092347221  0.03153132 -0.147335259  0.068640573
## [21,] -0.040677403  0.033486402  0.15915730 -0.067902495 -0.067031797
## [22,]  0.307156634  0.082632208 -0.14084179  0.119478608  0.037244599
## [23,] -0.400493663  0.129741079 -0.08790932 -0.081575166  0.148382938
## [24,]  0.107273784 -0.376830513 -0.21119126  0.022227162 -0.086051127
## [25,]  0.135558575 -0.043502495  0.46680415  0.089227879 -0.107268300
## [26,] -0.109837958  0.142040654 -0.17122845  0.143891587 -0.023977683
## [27,] -0.020303257 -0.250400560  0.06807129  0.298951277  0.021055632
## [28,] -0.055709181  0.217044473 -0.19542102 -0.023151988  0.039387046
## [29,]  0.089443704 -0.127236152  0.20154651 -0.153751053 -0.046198032
##              [,21]         [,22]         [,23]         [,24]         [,25]
##  [1,] -0.055835074  0.000000e+00  0.000000e+00  1.225090e-16  0.000000e+00
##  [2,] -0.009716911  4.304169e-01  4.908485e-01  2.427171e-02  1.598752e-01
##  [3,]  0.009716911  4.304169e-01  4.908485e-01  2.427171e-02  1.598752e-01
##  [4,]  0.024069859 -1.624602e-01 -9.349496e-02 -1.462863e-01  2.297954e-01
##  [5,] -0.020759097 -1.223458e-01 -7.040935e-02 -1.101655e-01  1.730547e-01
##  [6,] -0.009246758 -1.482255e-01 -8.530298e-02 -1.334688e-01  2.096609e-01
##  [7,]  0.091112831 -1.249001e-16 -1.387779e-16 -2.628106e-16 -2.706169e-16
##  [8,] -0.044978707  4.163336e-17 -1.665335e-16  1.387779e-16 -8.326673e-17
##  [9,]  0.055246780  1.238864e-01 -2.159140e-01  3.779570e-01  2.646552e-01
## [10,] -0.055246780  1.238864e-01 -2.159140e-01  3.779570e-01  2.646552e-01
## [11,] -0.002668114  2.618657e-02  1.049778e-01  1.936265e-01 -4.934911e-01
## [12,]  0.131111145  5.062877e-03  2.029627e-02  3.743549e-02 -9.541093e-02
## [13,] -0.022684782  2.618191e-02  1.049591e-01  1.935920e-01 -4.934033e-01
## [14,]  0.004156898 -1.387779e-16 -5.551115e-17  1.179612e-16  5.551115e-17
## [15,] -0.039936533  9.881214e-02 -9.271353e-02  3.979858e-01 -4.350554e-02
## [16,]  0.039936533  9.881214e-02 -9.271353e-02  3.979858e-01 -4.350554e-02
## [17,] -0.783813396  8.326673e-17 -9.714451e-17  3.096481e-16  4.163336e-17
## [18,]  0.400630885 -1.655520e-01  1.348923e-01  2.214400e-02 -3.849901e-02
## [19,]  0.100211495 -3.708045e-01  3.021327e-01  4.959827e-02 -8.623034e-02
## [20,] -0.277298232 -3.731871e-01  3.040740e-01  4.991696e-02 -8.678440e-02
## [21,] -0.122531418 -1.066503e-01  2.241673e-01  1.351855e-03  2.229382e-01
## [22,] -0.043260285 -8.331128e-02  1.751112e-01  1.056019e-03  1.741511e-01
## [23,]  0.158659708 -7.542381e-02  1.585326e-01  9.560407e-04  1.576634e-01
## [24,]  0.133158564 -5.072120e-02  1.066104e-01  6.429208e-04  1.060259e-01
## [25,] -0.079020446 -2.592493e-02  5.449138e-02  3.286137e-04  5.419262e-02
## [26,] -0.001634511  5.599721e-02 -2.680530e-02 -6.713152e-02 -2.827909e-02
## [27,] -0.158154115  1.389760e-01 -6.652642e-02 -1.666096e-01 -7.018414e-02
## [28,]  0.005608392  2.874463e-01 -1.375976e-01 -3.446012e-01 -1.451629e-01
## [29,]  0.073239274  2.793453e-01 -1.337198e-01 -3.348895e-01 -1.410719e-01
##               [,26]         [,27]         [,28]         [,29]
##  [1,]  0.000000e+00  0.000000e+00  0.000000e+00  0.000000e+00
##  [2,]  1.264714e-01 -4.810722e-02  1.599167e-01  6.145848e-02
##  [3,]  1.264714e-01 -4.810722e-02  1.599167e-01  6.145848e-02
##  [4,]  5.617425e-02  1.431376e-01  3.669298e-01  3.861601e-01
##  [5,]  4.230380e-02  1.077943e-01  2.763281e-01  2.908102e-01
##  [6,]  5.125229e-02  1.305960e-01  3.347796e-01  3.523250e-01
##  [7,]  3.330669e-16  3.469447e-17 -3.053113e-16 -5.377643e-17
##  [8,] -3.400058e-16 -1.023487e-16  9.714451e-17 -4.510281e-17
##  [9,] -3.174912e-01 -1.784900e-01  2.632447e-01 -1.522574e-01
## [10,] -3.174912e-01 -1.784900e-01  2.632447e-01 -1.522574e-01
## [11,] -2.606901e-01 -4.914851e-02  1.355194e-01  3.308188e-01
## [12,] -5.040148e-02 -9.502308e-03  2.620115e-02  6.396007e-02
## [13,] -2.606437e-01 -4.913976e-02  1.354953e-01  3.307599e-01
## [14,]  1.387779e-17 -1.040834e-16  5.551115e-17  1.283695e-16
## [15,]  2.321523e-01  5.165691e-01 -2.389006e-02 -6.767816e-03
## [16,]  2.321523e-01  5.165691e-01 -2.389006e-02 -6.767816e-03
## [17,] -1.838807e-16 -1.110223e-16 -1.387779e-17 -1.595946e-16
## [18,]  1.860791e-04  3.982187e-02  1.360638e-01 -1.497667e-01
## [19,]  4.167812e-04  8.919330e-02  3.047566e-01 -3.354484e-01
## [20,]  4.194592e-04  8.976640e-02  3.067148e-01 -3.376038e-01
## [21,] -4.081274e-01  2.509126e-01 -2.610841e-01  9.169874e-02
## [22,] -3.188141e-01  1.960037e-01 -2.039493e-01  7.163168e-02
## [23,] -2.886305e-01  1.774471e-01 -1.846404e-01  6.484997e-02
## [24,] -1.940990e-01  1.193301e-01 -1.241675e-01  4.361048e-02
## [25,] -9.920907e-02  6.099273e-02 -6.346526e-02  2.229046e-02
## [26,] -4.410188e-02  5.481539e-02  3.698347e-02 -4.057889e-02
## [27,] -1.094538e-01  1.360430e-01  9.178700e-02 -1.007103e-01
## [28,] -2.263849e-01  2.813797e-01  1.898445e-01 -2.083006e-01
## [29,] -2.200048e-01  2.734498e-01  1.844942e-01 -2.024302e-01
str(ei_df)
## List of 2
##  $ values : num [1:29] 5.05 2.5 2.3 1.93 1.75 ...
##  $ vectors: num [1:29, 1:29] -0.204 0.141 -0.141 -0.294 0.16 ...
##  - attr(*, "class")= chr "eigen"
ei_df$vectors[,1:21]
##                [,1]          [,2]         [,3]        [,4]        [,5]
##  [1,] -0.2040227737  1.833293e-01 -0.094779873 -0.13731378  0.10549430
##  [2,]  0.1413086011  5.394177e-01  0.039906990 -0.10547773  0.06696998
##  [3,] -0.1413086011 -5.394177e-01 -0.039906990  0.10547773 -0.06696998
##  [4,] -0.2942003135  2.461245e-02  0.124389538  0.04018123 -0.01260643
##  [5,]  0.1600597333  5.511576e-06 -0.032650698  0.05734463  0.17794918
##  [6,]  0.1903397143 -2.698063e-02 -0.109385163 -0.09137243 -0.13306274
##  [7,] -0.0935479157  1.204104e-01 -0.222487435  0.12005801 -0.05039087
##  [8,] -0.0849067274  2.803531e-01  0.029034968  0.08955058  0.10490839
##  [9,]  0.0554027651 -2.904020e-02  0.591801533  0.18297854 -0.01650000
## [10,] -0.0554027651  2.904020e-02 -0.591801533 -0.18297854  0.01650000
## [11,]  0.1736418581 -7.301015e-02  0.134795878 -0.54120573 -0.24450421
## [12,] -0.0625879247  1.361349e-01  0.006726700  0.07124934 -0.29057685
## [13,] -0.1615699524  4.669831e-02 -0.136120641  0.52752438  0.30073749
## [14,]  0.2642768993 -1.226385e-01 -0.027373082  0.15232430  0.06430739
## [15,]  0.2662813236 -5.344106e-02 -0.103829439  0.13787861 -0.23790773
## [16,] -0.2662813236  5.344106e-02  0.103829439 -0.13787861  0.23790773
## [17,] -0.3018129056  1.014582e-01  0.013299747  0.02003847 -0.23877351
## [18,] -0.1129508182  6.226086e-02 -0.230618976  0.14454194 -0.21883738
## [19,] -0.2404813692  1.326143e-01  0.214508548 -0.02652619 -0.20305279
## [20,]  0.2890528978 -1.593875e-01 -0.110832619 -0.03776438  0.29883633
## [21,]  0.1863277619  2.539686e-02  0.017259545  0.31810714 -0.22856684
## [22,] -0.0801730543 -5.412041e-02  0.100692516 -0.18736497  0.34911401
## [23,] -0.1440907534  5.350135e-02 -0.095050946 -0.17389422  0.13349020
## [24,] -0.0880689230  1.975906e-02 -0.062237127 -0.05200381 -0.25357146
## [25,]  0.0826325459 -1.248690e-01  0.003714119 -0.09886796 -0.07387878
## [26,]  0.0003292663  1.790613e-02 -0.071960874 -0.04474856 -0.02271613
## [27,] -0.1071281011 -3.217524e-03 -0.087323525  0.08306735 -0.24274831
## [28,]  0.2887953143  2.791867e-01  0.032234467  0.04175985  0.05647509
## [29,] -0.2439394038 -2.892717e-01  0.024699924 -0.07532716  0.06720959
##               [,6]          [,7]         [,8]         [,9]        [,10]
##  [1,] -0.213452492 -0.1616819119  0.023604430 -0.087715997  0.047030462
##  [2,]  0.011344399  0.0132664051  0.046115109 -0.110497288 -0.052789321
##  [3,] -0.011344399 -0.0132664051 -0.046115109  0.110497288  0.052789321
##  [4,]  0.251575980 -0.2184059522  0.067338655  0.070216542 -0.239407715
##  [5,]  0.018666681 -0.2587982109  0.080642858 -0.002363250  0.575654625
##  [6,] -0.291143340  0.4529931814 -0.140368316 -0.075009077 -0.212748155
##  [7,]  0.052792830 -0.1368246765 -0.088519611 -0.198823957 -0.170243940
##  [8,] -0.049599567 -0.1036958509 -0.249146156 -0.378954437  0.034805948
##  [9,] -0.068522160 -0.1133872283 -0.113186317 -0.022985625 -0.067737669
## [10,]  0.068522160  0.1133872283  0.113186317  0.022985625  0.067737669
## [11,]  0.174292551 -0.0922261543  0.011106655 -0.108859690  0.058035125
## [12,]  0.022878704 -0.1374788674 -0.408638140  0.113189843  0.151685506
## [13,] -0.178747710  0.1188272919  0.067910999  0.086991199 -0.087377357
## [14,]  0.087127349 -0.0874879334 -0.015641141 -0.139934444 -0.164750061
## [15,] -0.361710871 -0.2362339223  0.163201029 -0.157374818 -0.056636761
## [16,]  0.361710871  0.2362339223 -0.163201029  0.157374818  0.056636761
## [17,]  0.001229302  0.0009125418 -0.162923697 -0.080997686 -0.050272836
## [18,]  0.152638210  0.0091306367 -0.371342645 -0.005725705  0.003227085
## [19,] -0.201743027  0.2071383509  0.402781053  0.018519962  0.046593168
## [20,]  0.132742164 -0.2098663948 -0.235475750 -0.015861704 -0.047727286
## [21,]  0.410141531  0.2145056094  0.160399335 -0.179411132  0.115343125
## [22,] -0.343597404  0.1291405671 -0.216397262  0.116144689  0.196811267
## [23,]  0.006067372 -0.4103179475  0.227637787  0.040740958 -0.367790016
## [24,] -0.264070192 -0.0683355585 -0.215395509  0.004954165  0.202539387
## [25,] -0.084082911  0.0300039738 -0.205304036  0.236604972 -0.433208186
## [26,]  0.112158906  0.2015978151  0.170508454  0.064865516  0.047711035
## [27,] -0.036225835 -0.2630957285  0.114121348  0.418293722  0.140089959
## [28,]  0.011523649  0.0538623012 -0.080493188  0.306678462 -0.080124643
## [29,] -0.016318483  0.0350554293 -0.008128518 -0.536678636  0.003188517
##               [,11]       [,12]        [,13]       [,14]        [,15]
##  [1,] -0.0514849619  0.25344279 -0.071981397 -0.38373324  0.214537681
##  [2,] -0.0333275463  0.03966658 -0.009422092  0.15754580 -0.132702943
##  [3,]  0.0333275463 -0.03966658  0.009422092 -0.15754580  0.132702943
##  [4,] -0.0918142199  0.11585403  0.289722514 -0.11751257  0.145408546
##  [5,]  0.1913643070 -0.17791385 -0.295940515  0.28920729  0.175698766
##  [6,] -0.0573211994  0.01987070 -0.073275472 -0.10991485 -0.304395008
##  [7,]  0.0809428399  0.17893968 -0.558762233 -0.16658326  0.133346757
##  [8,] -0.1344075528 -0.11996991 -0.099533027 -0.17877867  0.014999030
##  [9,] -0.0632291693  0.10386450 -0.074834503  0.01981791  0.011788835
## [10,]  0.0632291693 -0.10386450  0.074834503 -0.01981791 -0.011788835
## [11,]  0.0892992035  0.07682543 -0.022307302 -0.07800898  0.058828058
## [12,] -0.1707762492 -0.02107557 -0.019284283  0.15737992 -0.297537103
## [13,] -0.0562915673 -0.07276366  0.026040336  0.04758982 -0.001302855
## [14,]  0.0637699306 -0.19026526 -0.053317476  0.18721599 -0.268279182
## [15,]  0.0179033287  0.17592834  0.159850407  0.06848620  0.108328418
## [16,] -0.0179033287 -0.17592834 -0.159850407 -0.06848620 -0.108328418
## [17,]  0.1745509851 -0.06613846  0.118376039  0.28116180  0.070437461
## [18,]  0.3009325805  0.27827997  0.183468090  0.24924191  0.120746443
## [19,]  0.1323699612 -0.16649098 -0.146523898  0.02438587  0.004764372
## [20,] -0.2650235573  0.04197840  0.064198930 -0.13479804 -0.058299086
## [21,]  0.0266367570  0.08414051 -0.035901178 -0.21531961  0.019033132
## [22,]  0.1784967480  0.38220875  0.061019476  0.10245025 -0.065584851
## [23,] -0.0002590246 -0.16793850  0.062979887  0.21493774 -0.190248341
## [24,] -0.3768853491 -0.48613501  0.120805688 -0.12340610  0.181065955
## [25,]  0.0549279697 -0.13469650 -0.467979249  0.17267244  0.331705364
## [26,] -0.6682429521  0.31460817 -0.096985482  0.42619396  0.258192820
## [27,] -0.0747633927  0.20786286 -0.261684675 -0.15940947 -0.454059167
## [28,]  0.1747356815 -0.13999203  0.183500247 -0.14317523  0.276835144
## [29,] -0.0086525893 -0.02242725 -0.039190364  0.14120018 -0.110723026
##              [,16]        [,17]       [,18]        [,19]        [,20]
##  [1,] -0.307043040  0.111102346  0.10755715 -0.119481022 -0.634592582
##  [2,]  0.139797586 -0.262204268  0.11173744 -0.071863869 -0.056457644
##  [3,] -0.139797586  0.262204268 -0.11173744  0.071863869  0.056457644
##  [4,]  0.387590877  0.004321684  0.03778667  0.081334668 -0.038321037
##  [5,] -0.145496306 -0.007070978  0.04754370 -0.050109424 -0.003776925
##  [6,] -0.304719637  0.001099695 -0.08065819 -0.047785079  0.045118642
##  [7,]  0.209913718 -0.024729046 -0.51745690 -0.228395658  0.207888802
##  [8,] -0.088998760  0.311178051  0.17199235  0.614251536  0.293268601
##  [9,] -0.175395522 -0.107830779 -0.08751724 -0.050378871 -0.008164833
## [10,]  0.175395522  0.107830779  0.08751724  0.050378871  0.008164833
## [11,] -0.015393105 -0.003828996 -0.09130157  0.123980608  0.025264759
## [12,]  0.160146237  0.526368999  0.16138195 -0.389205946 -0.081912822
## [13,] -0.015572132 -0.097955937  0.06011090 -0.048740716 -0.009429520
## [14,]  0.191364311  0.132133530 -0.34932228  0.350327215 -0.613049818
## [15,]  0.098331517  0.054993699  0.07387301 -0.001615882  0.074831716
## [16,] -0.098331517 -0.054993699 -0.07387301  0.001615882 -0.074831716
## [17,] -0.130630351 -0.017275377 -0.19641737 -0.044658121 -0.014852663
## [18,] -0.257749536 -0.236779204  0.01835022  0.205780747 -0.076968959
## [19,]  0.115899019  0.198654712 -0.03992670  0.056407624 -0.034717513
## [20,] -0.000817087 -0.092347221  0.03153132 -0.147335259  0.068640573
## [21,] -0.040677403  0.033486402  0.15915730 -0.067902495 -0.067031797
## [22,]  0.307156634  0.082632208 -0.14084179  0.119478608  0.037244599
## [23,] -0.400493663  0.129741079 -0.08790932 -0.081575166  0.148382938
## [24,]  0.107273784 -0.376830513 -0.21119126  0.022227162 -0.086051127
## [25,]  0.135558575 -0.043502495  0.46680415  0.089227879 -0.107268300
## [26,] -0.109837958  0.142040654 -0.17122845  0.143891587 -0.023977683
## [27,] -0.020303257 -0.250400560  0.06807129  0.298951277  0.021055632
## [28,] -0.055709181  0.217044473 -0.19542102 -0.023151988  0.039387046
## [29,]  0.089443704 -0.127236152  0.20154651 -0.153751053 -0.046198032
##              [,21]
##  [1,] -0.055835074
##  [2,] -0.009716911
##  [3,]  0.009716911
##  [4,]  0.024069859
##  [5,] -0.020759097
##  [6,] -0.009246758
##  [7,]  0.091112831
##  [8,] -0.044978707
##  [9,]  0.055246780
## [10,] -0.055246780
## [11,] -0.002668114
## [12,]  0.131111145
## [13,] -0.022684782
## [14,]  0.004156898
## [15,] -0.039936533
## [16,]  0.039936533
## [17,] -0.783813396
## [18,]  0.400630885
## [19,]  0.100211495
## [20,] -0.277298232
## [21,] -0.122531418
## [22,] -0.043260285
## [23,]  0.158659708
## [24,]  0.133158564
## [25,] -0.079020446
## [26,] -0.001634511
## [27,] -0.158154115
## [28,]  0.005608392
## [29,]  0.073239274
p <- ei_df$vectors[,1:21]
p <- -p
p
##                [,1]          [,2]         [,3]        [,4]        [,5]
##  [1,]  0.2040227737 -1.833293e-01  0.094779873  0.13731378 -0.10549430
##  [2,] -0.1413086011 -5.394177e-01 -0.039906990  0.10547773 -0.06696998
##  [3,]  0.1413086011  5.394177e-01  0.039906990 -0.10547773  0.06696998
##  [4,]  0.2942003135 -2.461245e-02 -0.124389538 -0.04018123  0.01260643
##  [5,] -0.1600597333 -5.511576e-06  0.032650698 -0.05734463 -0.17794918
##  [6,] -0.1903397143  2.698063e-02  0.109385163  0.09137243  0.13306274
##  [7,]  0.0935479157 -1.204104e-01  0.222487435 -0.12005801  0.05039087
##  [8,]  0.0849067274 -2.803531e-01 -0.029034968 -0.08955058 -0.10490839
##  [9,] -0.0554027651  2.904020e-02 -0.591801533 -0.18297854  0.01650000
## [10,]  0.0554027651 -2.904020e-02  0.591801533  0.18297854 -0.01650000
## [11,] -0.1736418581  7.301015e-02 -0.134795878  0.54120573  0.24450421
## [12,]  0.0625879247 -1.361349e-01 -0.006726700 -0.07124934  0.29057685
## [13,]  0.1615699524 -4.669831e-02  0.136120641 -0.52752438 -0.30073749
## [14,] -0.2642768993  1.226385e-01  0.027373082 -0.15232430 -0.06430739
## [15,] -0.2662813236  5.344106e-02  0.103829439 -0.13787861  0.23790773
## [16,]  0.2662813236 -5.344106e-02 -0.103829439  0.13787861 -0.23790773
## [17,]  0.3018129056 -1.014582e-01 -0.013299747 -0.02003847  0.23877351
## [18,]  0.1129508182 -6.226086e-02  0.230618976 -0.14454194  0.21883738
## [19,]  0.2404813692 -1.326143e-01 -0.214508548  0.02652619  0.20305279
## [20,] -0.2890528978  1.593875e-01  0.110832619  0.03776438 -0.29883633
## [21,] -0.1863277619 -2.539686e-02 -0.017259545 -0.31810714  0.22856684
## [22,]  0.0801730543  5.412041e-02 -0.100692516  0.18736497 -0.34911401
## [23,]  0.1440907534 -5.350135e-02  0.095050946  0.17389422 -0.13349020
## [24,]  0.0880689230 -1.975906e-02  0.062237127  0.05200381  0.25357146
## [25,] -0.0826325459  1.248690e-01 -0.003714119  0.09886796  0.07387878
## [26,] -0.0003292663 -1.790613e-02  0.071960874  0.04474856  0.02271613
## [27,]  0.1071281011  3.217524e-03  0.087323525 -0.08306735  0.24274831
## [28,] -0.2887953143 -2.791867e-01 -0.032234467 -0.04175985 -0.05647509
## [29,]  0.2439394038  2.892717e-01 -0.024699924  0.07532716 -0.06720959
##               [,6]          [,7]         [,8]         [,9]        [,10]
##  [1,]  0.213452492  0.1616819119 -0.023604430  0.087715997 -0.047030462
##  [2,] -0.011344399 -0.0132664051 -0.046115109  0.110497288  0.052789321
##  [3,]  0.011344399  0.0132664051  0.046115109 -0.110497288 -0.052789321
##  [4,] -0.251575980  0.2184059522 -0.067338655 -0.070216542  0.239407715
##  [5,] -0.018666681  0.2587982109 -0.080642858  0.002363250 -0.575654625
##  [6,]  0.291143340 -0.4529931814  0.140368316  0.075009077  0.212748155
##  [7,] -0.052792830  0.1368246765  0.088519611  0.198823957  0.170243940
##  [8,]  0.049599567  0.1036958509  0.249146156  0.378954437 -0.034805948
##  [9,]  0.068522160  0.1133872283  0.113186317  0.022985625  0.067737669
## [10,] -0.068522160 -0.1133872283 -0.113186317 -0.022985625 -0.067737669
## [11,] -0.174292551  0.0922261543 -0.011106655  0.108859690 -0.058035125
## [12,] -0.022878704  0.1374788674  0.408638140 -0.113189843 -0.151685506
## [13,]  0.178747710 -0.1188272919 -0.067910999 -0.086991199  0.087377357
## [14,] -0.087127349  0.0874879334  0.015641141  0.139934444  0.164750061
## [15,]  0.361710871  0.2362339223 -0.163201029  0.157374818  0.056636761
## [16,] -0.361710871 -0.2362339223  0.163201029 -0.157374818 -0.056636761
## [17,] -0.001229302 -0.0009125418  0.162923697  0.080997686  0.050272836
## [18,] -0.152638210 -0.0091306367  0.371342645  0.005725705 -0.003227085
## [19,]  0.201743027 -0.2071383509 -0.402781053 -0.018519962 -0.046593168
## [20,] -0.132742164  0.2098663948  0.235475750  0.015861704  0.047727286
## [21,] -0.410141531 -0.2145056094 -0.160399335  0.179411132 -0.115343125
## [22,]  0.343597404 -0.1291405671  0.216397262 -0.116144689 -0.196811267
## [23,] -0.006067372  0.4103179475 -0.227637787 -0.040740958  0.367790016
## [24,]  0.264070192  0.0683355585  0.215395509 -0.004954165 -0.202539387
## [25,]  0.084082911 -0.0300039738  0.205304036 -0.236604972  0.433208186
## [26,] -0.112158906 -0.2015978151 -0.170508454 -0.064865516 -0.047711035
## [27,]  0.036225835  0.2630957285 -0.114121348 -0.418293722 -0.140089959
## [28,] -0.011523649 -0.0538623012  0.080493188 -0.306678462  0.080124643
## [29,]  0.016318483 -0.0350554293  0.008128518  0.536678636 -0.003188517
##               [,11]       [,12]        [,13]       [,14]        [,15]
##  [1,]  0.0514849619 -0.25344279  0.071981397  0.38373324 -0.214537681
##  [2,]  0.0333275463 -0.03966658  0.009422092 -0.15754580  0.132702943
##  [3,] -0.0333275463  0.03966658 -0.009422092  0.15754580 -0.132702943
##  [4,]  0.0918142199 -0.11585403 -0.289722514  0.11751257 -0.145408546
##  [5,] -0.1913643070  0.17791385  0.295940515 -0.28920729 -0.175698766
##  [6,]  0.0573211994 -0.01987070  0.073275472  0.10991485  0.304395008
##  [7,] -0.0809428399 -0.17893968  0.558762233  0.16658326 -0.133346757
##  [8,]  0.1344075528  0.11996991  0.099533027  0.17877867 -0.014999030
##  [9,]  0.0632291693 -0.10386450  0.074834503 -0.01981791 -0.011788835
## [10,] -0.0632291693  0.10386450 -0.074834503  0.01981791  0.011788835
## [11,] -0.0892992035 -0.07682543  0.022307302  0.07800898 -0.058828058
## [12,]  0.1707762492  0.02107557  0.019284283 -0.15737992  0.297537103
## [13,]  0.0562915673  0.07276366 -0.026040336 -0.04758982  0.001302855
## [14,] -0.0637699306  0.19026526  0.053317476 -0.18721599  0.268279182
## [15,] -0.0179033287 -0.17592834 -0.159850407 -0.06848620 -0.108328418
## [16,]  0.0179033287  0.17592834  0.159850407  0.06848620  0.108328418
## [17,] -0.1745509851  0.06613846 -0.118376039 -0.28116180 -0.070437461
## [18,] -0.3009325805 -0.27827997 -0.183468090 -0.24924191 -0.120746443
## [19,] -0.1323699612  0.16649098  0.146523898 -0.02438587 -0.004764372
## [20,]  0.2650235573 -0.04197840 -0.064198930  0.13479804  0.058299086
## [21,] -0.0266367570 -0.08414051  0.035901178  0.21531961 -0.019033132
## [22,] -0.1784967480 -0.38220875 -0.061019476 -0.10245025  0.065584851
## [23,]  0.0002590246  0.16793850 -0.062979887 -0.21493774  0.190248341
## [24,]  0.3768853491  0.48613501 -0.120805688  0.12340610 -0.181065955
## [25,] -0.0549279697  0.13469650  0.467979249 -0.17267244 -0.331705364
## [26,]  0.6682429521 -0.31460817  0.096985482 -0.42619396 -0.258192820
## [27,]  0.0747633927 -0.20786286  0.261684675  0.15940947  0.454059167
## [28,] -0.1747356815  0.13999203 -0.183500247  0.14317523 -0.276835144
## [29,]  0.0086525893  0.02242725  0.039190364 -0.14120018  0.110723026
##              [,16]        [,17]       [,18]        [,19]        [,20]
##  [1,]  0.307043040 -0.111102346 -0.10755715  0.119481022  0.634592582
##  [2,] -0.139797586  0.262204268 -0.11173744  0.071863869  0.056457644
##  [3,]  0.139797586 -0.262204268  0.11173744 -0.071863869 -0.056457644
##  [4,] -0.387590877 -0.004321684 -0.03778667 -0.081334668  0.038321037
##  [5,]  0.145496306  0.007070978 -0.04754370  0.050109424  0.003776925
##  [6,]  0.304719637 -0.001099695  0.08065819  0.047785079 -0.045118642
##  [7,] -0.209913718  0.024729046  0.51745690  0.228395658 -0.207888802
##  [8,]  0.088998760 -0.311178051 -0.17199235 -0.614251536 -0.293268601
##  [9,]  0.175395522  0.107830779  0.08751724  0.050378871  0.008164833
## [10,] -0.175395522 -0.107830779 -0.08751724 -0.050378871 -0.008164833
## [11,]  0.015393105  0.003828996  0.09130157 -0.123980608 -0.025264759
## [12,] -0.160146237 -0.526368999 -0.16138195  0.389205946  0.081912822
## [13,]  0.015572132  0.097955937 -0.06011090  0.048740716  0.009429520
## [14,] -0.191364311 -0.132133530  0.34932228 -0.350327215  0.613049818
## [15,] -0.098331517 -0.054993699 -0.07387301  0.001615882 -0.074831716
## [16,]  0.098331517  0.054993699  0.07387301 -0.001615882  0.074831716
## [17,]  0.130630351  0.017275377  0.19641737  0.044658121  0.014852663
## [18,]  0.257749536  0.236779204 -0.01835022 -0.205780747  0.076968959
## [19,] -0.115899019 -0.198654712  0.03992670 -0.056407624  0.034717513
## [20,]  0.000817087  0.092347221 -0.03153132  0.147335259 -0.068640573
## [21,]  0.040677403 -0.033486402 -0.15915730  0.067902495  0.067031797
## [22,] -0.307156634 -0.082632208  0.14084179 -0.119478608 -0.037244599
## [23,]  0.400493663 -0.129741079  0.08790932  0.081575166 -0.148382938
## [24,] -0.107273784  0.376830513  0.21119126 -0.022227162  0.086051127
## [25,] -0.135558575  0.043502495 -0.46680415 -0.089227879  0.107268300
## [26,]  0.109837958 -0.142040654  0.17122845 -0.143891587  0.023977683
## [27,]  0.020303257  0.250400560 -0.06807129 -0.298951277 -0.021055632
## [28,]  0.055709181 -0.217044473  0.19542102  0.023151988 -0.039387046
## [29,] -0.089443704  0.127236152 -0.20154651  0.153751053  0.046198032
##              [,21]
##  [1,]  0.055835074
##  [2,]  0.009716911
##  [3,] -0.009716911
##  [4,] -0.024069859
##  [5,]  0.020759097
##  [6,]  0.009246758
##  [7,] -0.091112831
##  [8,]  0.044978707
##  [9,] -0.055246780
## [10,]  0.055246780
## [11,]  0.002668114
## [12,] -0.131111145
## [13,]  0.022684782
## [14,] -0.004156898
## [15,]  0.039936533
## [16,] -0.039936533
## [17,]  0.783813396
## [18,] -0.400630885
## [19,] -0.100211495
## [20,]  0.277298232
## [21,]  0.122531418
## [22,]  0.043260285
## [23,] -0.158659708
## [24,] -0.133158564
## [25,]  0.079020446
## [26,]  0.001634511
## [27,]  0.158154115
## [28,] -0.005608392
## [29,] -0.073239274
# explainable variance ratio
avo <- ei_df$values / sum(ei_df$values)
round(avo, 4)
##  [1] 0.1740 0.0861 0.0792 0.0665 0.0602 0.0590 0.0558 0.0488 0.0452 0.0429
## [11] 0.0357 0.0349 0.0341 0.0311 0.0284 0.0257 0.0250 0.0215 0.0211 0.0133
## [21] 0.0113 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
sum(avo[1:31])
## [1] NA

7.0 Supervised Learning

7.1 Logistic Regression

# Creating full(including all varibales) and null(intercept only) models for d_1
logit_full = glm(heart_disease~., data = data, family = binomial)
logit_null = glm(heart_disease~1, data = data, family = binomial)


# Forward selection procedure
forward = step(logit_null, scope = list(lower = logit_null, upper = logit_full), direction = "forward")
## Start:  AIC=288.3
## heart_disease ~ 1
## 
##            Df Deviance    AIC
## + thal      3   227.45 235.45
## + ca        4   231.45 241.45
## + oldpeak   1   243.02 247.02
## + cp        2   243.35 249.35
## + thalach   1   251.58 255.58
## + slope     2   255.91 261.91
## + exang     1   262.64 266.64
## + sex       1   269.75 273.75
## + age       1   271.15 275.15
## + restecg   2   274.01 280.01
## + chol      1   278.52 282.52
## + trestbps  1   283.38 287.38
## <none>          286.30 288.30
## + fbs       1   286.14 290.14
## 
## Step:  AIC=235.45
## heart_disease ~ thal
## 
##            Df Deviance    AIC
## + ca        4   186.85 202.85
## + oldpeak   1   204.19 214.19
## + thalach   1   204.24 214.24
## + cp        2   202.45 214.45
## + slope     2   209.78 221.78
## + exang     1   215.21 225.21
## + age       1   218.07 228.07
## + chol      1   218.62 228.62
## + restecg   2   216.67 228.67
## <none>          227.45 235.45
## + sex       1   226.23 236.23
## + trestbps  1   226.51 236.51
## + fbs       1   227.44 237.44
## 
## Step:  AIC=202.85
## heart_disease ~ thal + ca
## 
##            Df Deviance    AIC
## + cp        2   167.33 187.33
## + thalach   1   171.44 189.44
## + oldpeak   1   171.65 189.65
## + slope     2   170.20 190.20
## + exang     1   175.59 193.59
## + restecg   2   175.68 195.68
## + chol      1   180.80 198.80
## + sex       1   183.78 201.78
## <none>          186.85 202.85
## + trestbps  1   185.31 203.31
## + fbs       1   185.96 203.96
## + age       1   186.11 204.11
## 
## Step:  AIC=187.33
## heart_disease ~ thal + ca + cp
## 
##            Df Deviance    AIC
## + thalach   1   159.02 181.02
## + slope     2   157.43 181.43
## + oldpeak   1   160.43 182.43
## + chol      1   160.98 182.98
## + restecg   2   159.78 183.78
## + exang     1   163.06 185.06
## <none>          167.33 187.33
## + trestbps  1   165.95 187.95
## + sex       1   166.01 188.01
## + age       1   166.64 188.64
## + fbs       1   167.11 189.11
## 
## Step:  AIC=181.02
## heart_disease ~ thal + ca + cp + thalach
## 
##            Df Deviance    AIC
## + chol      1   153.44 177.44
## + restecg   2   153.23 179.23
## + oldpeak   1   155.23 179.23
## + slope     2   153.34 179.34
## + exang     1   156.69 180.69
## + sex       1   156.97 180.97
## <none>          159.02 181.02
## + trestbps  1   157.12 181.12
## + fbs       1   158.74 182.74
## + age       1   158.97 182.97
## 
## Step:  AIC=177.44
## heart_disease ~ thal + ca + cp + thalach + chol
## 
##            Df Deviance    AIC
## + sex       1   147.64 173.64
## + oldpeak   1   150.28 176.28
## + slope     2   148.72 176.72
## <none>          153.44 177.44
## + restecg   2   149.60 177.60
## + exang     1   151.80 177.80
## + trestbps  1   152.56 178.56
## + age       1   152.69 178.69
## + fbs       1   153.26 179.26
## 
## Step:  AIC=173.64
## heart_disease ~ thal + ca + cp + thalach + chol + sex
## 
##            Df Deviance    AIC
## + oldpeak   1   144.55 172.55
## + slope     2   142.61 172.61
## + exang     1   145.36 173.36
## <none>          147.64 173.64
## + restecg   2   143.90 173.90
## + trestbps  1   146.31 174.31
## + age       1   147.07 175.07
## + fbs       1   147.53 175.53
## 
## Step:  AIC=172.55
## heart_disease ~ thal + ca + cp + thalach + chol + sex + oldpeak
## 
##            Df Deviance    AIC
## <none>          144.55 172.55
## + slope     2   141.03 173.03
## + exang     1   143.15 173.15
## + trestbps  1   143.63 173.63
## + age       1   144.16 174.16
## + restecg   2   142.16 174.16
## + fbs       1   144.27 174.27
# Backward selection procedure
backward = step(logit_full, scope = list(lower = logit_null, upper = logit_full), direction = "backward")
## Start:  AIC=179.49
## heart_disease ~ age + sex + cp + trestbps + chol + fbs + restecg + 
##     thalach + exang + oldpeak + slope + ca + thal
## 
##            Df Deviance    AIC
## - restecg   2   137.03 177.03
## - oldpeak   1   135.76 177.76
## - fbs       1   136.03 178.03
## - age       1   136.26 178.26
## - trestbps  1   136.85 178.85
## - slope     2   138.93 178.93
## - exang     1   137.09 179.09
## - cp        2   139.42 179.42
## <none>          135.49 179.49
## - thalach   1   139.61 181.61
## - chol      1   140.91 182.91
## - sex       1   141.86 183.86
## - thal      3   150.43 188.43
## - ca        4   162.40 198.40
## 
## Step:  AIC=177.03
## heart_disease ~ age + sex + cp + trestbps + chol + fbs + thalach + 
##     exang + oldpeak + slope + ca + thal
## 
##            Df Deviance    AIC
## - oldpeak   1   137.58 175.58
## - fbs       1   137.61 175.61
## - age       1   137.96 175.96
## - exang     1   138.60 176.60
## - slope     2   140.61 176.61
## - cp        2   140.93 176.93
## <none>          137.03 177.03
## - trestbps  1   139.06 177.06
## - thalach   1   141.62 179.62
## - chol      1   143.73 181.73
## - sex       1   143.85 181.85
## - thal      3   151.11 185.11
## - ca        4   164.23 196.23
## 
## Step:  AIC=175.58
## heart_disease ~ age + sex + cp + trestbps + chol + fbs + thalach + 
##     exang + slope + ca + thal
## 
##            Df Deviance    AIC
## - fbs       1   138.28 174.28
## - age       1   138.76 174.76
## <none>          137.58 175.58
## - exang     1   139.62 175.62
## - cp        2   141.72 175.72
## - trestbps  1   139.85 175.85
## - slope     2   142.28 176.28
## - thalach   1   142.63 178.63
## - chol      1   144.68 180.68
## - sex       1   144.91 180.91
## - thal      3   151.98 183.98
## - ca        4   169.03 199.03
## 
## Step:  AIC=174.28
## heart_disease ~ age + sex + cp + trestbps + chol + thalach + 
##     exang + slope + ca + thal
## 
##            Df Deviance    AIC
## - age       1   139.43 173.43
## - exang     1   140.00 174.00
## - trestbps  1   140.15 174.15
## <none>          138.28 174.28
## - slope     2   142.93 174.93
## - cp        2   143.14 175.14
## - thalach   1   143.33 177.33
## - sex       1   145.41 179.41
## - chol      1   145.69 179.69
## - thal      3   152.38 182.38
## - ca        4   169.15 197.15
## 
## Step:  AIC=173.43
## heart_disease ~ sex + cp + trestbps + chol + thalach + exang + 
##     slope + ca + thal
## 
##            Df Deviance    AIC
## - trestbps  1   140.80 172.80
## - exang     1   141.22 173.22
## <none>          139.43 173.43
## - slope     2   144.00 174.00
## - cp        2   144.77 174.77
## - thalach   1   143.45 175.45
## - chol      1   146.02 178.02
## - sex       1   146.65 178.65
## - thal      3   154.14 182.14
## - ca        4   169.43 195.43
## 
## Step:  AIC=172.8
## heart_disease ~ sex + cp + chol + thalach + exang + slope + ca + 
##     thal
## 
##           Df Deviance    AIC
## - exang    1   142.61 172.61
## <none>         140.80 172.80
## - slope    2   145.36 173.36
## - cp       2   146.37 174.37
## - thalach  1   144.53 174.53
## - sex      1   147.37 177.37
## - chol     1   148.42 178.42
## - thal     3   156.11 182.11
## - ca       4   170.18 194.18
## 
## Step:  AIC=172.61
## heart_disease ~ sex + cp + chol + thalach + slope + ca + thal
## 
##           Df Deviance    AIC
## <none>         142.61 172.61
## - slope    2   147.64 173.64
## - thalach  1   147.53 175.53
## - cp       2   150.23 176.23
## - sex      1   148.72 176.72
## - chol     1   150.41 178.41
## - thal     3   159.14 183.14
## - ca       4   171.37 193.37
# Both 
both = step(logit_full, scope = list(lower = logit_null, upper = logit_full), direction = "both")
## Start:  AIC=179.49
## heart_disease ~ age + sex + cp + trestbps + chol + fbs + restecg + 
##     thalach + exang + oldpeak + slope + ca + thal
## 
##            Df Deviance    AIC
## - restecg   2   137.03 177.03
## - oldpeak   1   135.76 177.76
## - fbs       1   136.03 178.03
## - age       1   136.26 178.26
## - trestbps  1   136.85 178.85
## - slope     2   138.93 178.93
## - exang     1   137.09 179.09
## - cp        2   139.42 179.42
## <none>          135.49 179.49
## - thalach   1   139.61 181.61
## - chol      1   140.91 182.91
## - sex       1   141.86 183.86
## - thal      3   150.43 188.43
## - ca        4   162.40 198.40
## 
## Step:  AIC=177.03
## heart_disease ~ age + sex + cp + trestbps + chol + fbs + thalach + 
##     exang + oldpeak + slope + ca + thal
## 
##            Df Deviance    AIC
## - oldpeak   1   137.58 175.58
## - fbs       1   137.61 175.61
## - age       1   137.96 175.96
## - exang     1   138.60 176.60
## - slope     2   140.61 176.61
## - cp        2   140.93 176.93
## <none>          137.03 177.03
## - trestbps  1   139.06 177.06
## + restecg   2   135.49 179.49
## - thalach   1   141.62 179.62
## - chol      1   143.73 181.73
## - sex       1   143.85 181.85
## - thal      3   151.11 185.11
## - ca        4   164.23 196.23
## 
## Step:  AIC=175.58
## heart_disease ~ age + sex + cp + trestbps + chol + fbs + thalach + 
##     exang + slope + ca + thal
## 
##            Df Deviance    AIC
## - fbs       1   138.28 174.28
## - age       1   138.76 174.76
## <none>          137.58 175.58
## - exang     1   139.62 175.62
## - cp        2   141.72 175.72
## - trestbps  1   139.85 175.85
## - slope     2   142.28 176.28
## + oldpeak   1   137.03 177.03
## + restecg   2   135.76 177.76
## - thalach   1   142.63 178.63
## - chol      1   144.68 180.68
## - sex       1   144.91 180.91
## - thal      3   151.98 183.98
## - ca        4   169.03 199.03
## 
## Step:  AIC=174.28
## heart_disease ~ age + sex + cp + trestbps + chol + thalach + 
##     exang + slope + ca + thal
## 
##            Df Deviance    AIC
## - age       1   139.43 173.43
## - exang     1   140.00 174.00
## - trestbps  1   140.15 174.15
## <none>          138.28 174.28
## - slope     2   142.93 174.93
## - cp        2   143.14 175.14
## + fbs       1   137.58 175.58
## + oldpeak   1   137.61 175.61
## + restecg   2   136.34 176.34
## - thalach   1   143.33 177.33
## - sex       1   145.41 179.41
## - chol      1   145.69 179.69
## - thal      3   152.38 182.38
## - ca        4   169.15 197.15
## 
## Step:  AIC=173.43
## heart_disease ~ sex + cp + trestbps + chol + thalach + exang + 
##     slope + ca + thal
## 
##            Df Deviance    AIC
## - trestbps  1   140.80 172.80
## - exang     1   141.22 173.22
## <none>          139.43 173.43
## - slope     2   144.00 174.00
## + age       1   138.28 174.28
## + oldpeak   1   138.52 174.52
## + fbs       1   138.76 174.76
## - cp        2   144.77 174.77
## + restecg   2   137.20 175.20
## - thalach   1   143.45 175.45
## - chol      1   146.02 178.02
## - sex       1   146.65 178.65
## - thal      3   154.14 182.14
## - ca        4   169.43 195.43
## 
## Step:  AIC=172.8
## heart_disease ~ sex + cp + chol + thalach + exang + slope + ca + 
##     thal
## 
##            Df Deviance    AIC
## - exang     1   142.61 172.61
## <none>          140.80 172.80
## - slope     2   145.36 173.36
## + trestbps  1   139.43 173.43
## + oldpeak   1   139.76 173.76
## + restecg   2   137.98 173.98
## + age       1   140.15 174.15
## - cp        2   146.37 174.37
## + fbs       1   140.46 174.46
## - thalach   1   144.53 174.53
## - sex       1   147.37 177.37
## - chol      1   148.42 178.42
## - thal      3   156.11 182.11
## - ca        4   170.18 194.18
## 
## Step:  AIC=172.61
## heart_disease ~ sex + cp + chol + thalach + slope + ca + thal
## 
##            Df Deviance    AIC
## <none>          142.61 172.61
## + exang     1   140.80 172.80
## + oldpeak   1   141.03 173.03
## + trestbps  1   141.22 173.22
## + restecg   2   139.57 173.57
## - slope     2   147.64 173.64
## + age       1   141.91 173.91
## + fbs       1   142.47 174.47
## - thalach   1   147.53 175.53
## - cp        2   150.23 176.23
## - sex       1   148.72 176.72
## - chol      1   150.41 178.41
## - thal      3   159.14 183.14
## - ca        4   171.37 193.37
forward$formula
## heart_disease ~ thal + ca + cp + thalach + chol + sex + oldpeak

Logistic model

logistic_model = glm(forward$formula, data=data, family = binomial)
summary(logistic_model)
## 
## Call:
## glm(formula = forward$formula, family = binomial, data = data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3951  -0.4321   0.1993   0.4786   2.5288  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        -1.828e-02  2.429e+00  -0.008 0.993996    
## thal1               2.788e+00  1.843e+00   1.513 0.130229    
## thal2               2.772e+00  1.677e+00   1.653 0.098311 .  
## thal3               1.006e+00  1.685e+00   0.597 0.550386    
## ca1                -1.375e+00  5.520e-01  -2.491 0.012729 *  
## ca2                -2.773e+00  7.354e-01  -3.771 0.000163 ***
## ca3                -2.617e+00  1.090e+00  -2.401 0.016329 *  
## ca4                 1.256e+01  1.370e+03   0.009 0.992683    
## cpatypical angina   1.061e+00  6.670e-01   1.591 0.111692    
## cpnon-anginal pain  1.282e+00  5.515e-01   2.325 0.020073 *  
## thalach             2.490e-02  1.076e-02   2.314 0.020651 *  
## chol               -1.457e-02  5.227e-03  -2.788 0.005300 ** 
## sexmale            -1.470e+00  6.368e-01  -2.308 0.020996 *  
## oldpeak            -4.030e-01  2.325e-01  -1.733 0.083100 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 286.30  on 211  degrees of freedom
## Residual deviance: 144.55  on 198  degrees of freedom
## AIC: 172.55
## 
## Number of Fisher Scoring iterations: 15
backward$formula
## heart_disease ~ sex + cp + chol + thalach + slope + ca + thal
both$formula
## heart_disease ~ sex + cp + chol + thalach + slope + ca + thal

Symmetric Cost Logistic Regression

# predicting on train set
predict_logit_train <- predict(logistic_model, type="response")


costfunc = function(obs, pred.p,pcut){
  weight1 = 1   # define the weight for "true=1 but pred=0" (FN)
  weight0 = 1    # define the weight for "true=0 but pred=1" (FP)
  c1 = (obs==1)&(pred.p<pcut)    # count for "true=1 but pred=0"   (FN)
  c0 = (obs==0)&(pred.p>=pcut)   # count for "true=0 but pred=1"   (FP)
  cost = mean(weight1*c1 + weight0*c0)  # misclassification with weight
  return(cost) # you have to return to a value when you write R functions
} 

# define a sequence from 0.01 to 1 by 0.01
p.seq = seq(0.01, 1, 0.01) 


cost = rep(0, length(p.seq))  
for(i in 1:length(p.seq)){ 
  cost[i] = costfunc(obs = data$heart_disease, pred.p = predict_logit_train, pcut = p.seq[i])  
} # end of the loop

optimal.pcut.sym = p.seq[which(cost==min(cost))][1]

optimal.pcut.sym  ##0.42
## [1] 0.42
plot(p.seq, cost)

#predict the response
pred_train <- predict(logistic_model, type = "response")
#convert to binary based on optimal pcut for sym
pred.train.sym <- as.numeric(pred_train > optimal.pcut.sym)

confusion_matrix_train_sym <- table(data$heart_disease, pred.train.sym,dnn =c('true','predicted'))
confusion_matrix_train_sym
##     predicted
## true   0   1
##    0  67  19
##    1   8 118
misclassification_rate_train_sym <- round((confusion_matrix_train_sym[2]+confusion_matrix_train_sym[3])/sum(confusion_matrix_train_sym), 2)
cat("train misclassfication rate:", misclassification_rate_train_sym)
## train misclassfication rate: 0.13
#Symmetric cost evaluation function 
cost0 <- function(r, pi){
  weight1 = 1
  weight0 = 1
  c1 = (r==1)&(pi==0) #logical vector - true if actual 1 but predict 0
  c0 = (r==0)&(pi==1) #logical vector - true if actual 0 but predict 1
  return(mean(weight1*c1+weight0*c0))
}

#Aymmetric cost evaluation function 
cost1 <- function(r, pi){
  weight1 = 5
  weight0 = 1
  c1 = (r==1)&(pi==0) #logical vector - true if actual 1 but predict 0
  c0 = (r==0)&(pi==1) #logical vector - true if actual 0 but predict 1
  return(mean(weight1*c1+weight0*c0))
}

In sample- symmetric

#LOSS
LOSS=(insamplecost0<-cost0(data$heart_disease,pred.train.sym))
#Misclassification rate
MCR=(insampleMCR0<-mean(data$heart_disease!=pred.train.sym))

#Roc curve | AUC 
pred <- prediction(as.numeric(pred_train),as.numeric(data$heart_disease))
perf <- performance(pred, "tpr", "fpr")
plot(perf, colorize=TRUE)

#Get the AUC
AUC=unlist(slot(performance(pred, "auc"), "y.values"))

rslt_sym_in<-c(LOSS,MCR,AUC)
rslt_sym_in
## [1] 0.1273585 0.1273585 0.9262643

Optimal pcut - Assymetric Cost Logistic Regression

costfuncasym = function(obs, pred.p,pcut){
  weight1 = 5   # define the weight for "true=1 but pred=0" (FN)
  weight0 = 1    # define the weight for "true=0 but pred=1" (FP)
  c1 = (obs==1)&(pred.p<pcut)    # count for "true=1 but pred=0"   (FN)
  c0 = (obs==0)&(pred.p>=pcut)   # count for "true=0 but pred=1"   (FP)
  cost = mean(weight1*c1 + weight0*c0)  # misclassification with weight
  return(cost) # you have to return to a value when you write R functions
} # end of the function

p.seq = seq(0.01, 1, 0.01) 
cost = rep(0, length(p.seq))  
for(i in 1:length(p.seq)){ 
  cost[i] = costfuncasym(obs = data$heart_disease, pred.p = predict_logit_train, pcut = p.seq[i])  
} # end of the loop

optimal.pcut.asym = p.seq[which(cost==min(cost))][1]

optimal.pcut.asym #0.2
## [1] 0.2
plot(p.seq, cost)

pred_train <- predict(logistic_model, type = "response")
pred.train.asym <- as.numeric(pred_train > optimal.pcut.asym)


confusion_matrix_train_asym <- table(data$heart_disease, pred.train.asym, dnn =c('true','predicted'))
confusion_matrix_train_asym
##     predicted
## true   0   1
##    0  47  39
##    1   4 122
misclassification_rate_train_asym <- round((confusion_matrix_train_asym[2]+confusion_matrix_train_asym[3])/sum(confusion_matrix_train_asym), 2)
cat("train Asymmetric misclassfication rate:", misclassification_rate_train_asym)
## train Asymmetric misclassfication rate: 0.2

In sample - asymmetric

#LOSS
LOSS=(insamplecost1<-cost1(data$heart_disease,pred.train.asym))
#Misclassification rate
MCR=(insampleMCR1<-mean(data$heart_disease!=pred.train.asym))

#Roc curve | AUC 
pred <- prediction(as.numeric(pred_train),as.numeric(data$heart_disease))
perf <- performance(pred, "tpr", "fpr")
plot(perf, colorize=TRUE)

#Get the AUC
AUC=unlist(slot(performance(pred, "auc"), "y.values"))

rslt_asym_in<-c(LOSS,MCR,AUC)
rslt_asym_in
## [1] 0.2783019 0.2028302 0.9262643

Out sample - symmetric

#predict the response
pred_test <- predict(logistic_model,newdata=data_test, type = "response")
#convert to binary based on optimal pcut for sym
pred.test.sym <- as.numeric(pred_test > optimal.pcut.sym)
#LOSS
LOSS=(outsamplecost0<-cost0(data_test$heart_disease,pred.test.sym))
#Misclassification rate
MCR=(outsampleMCR0<-mean(data_test$heart_disease!=pred.test.sym))

#Roc curve | AUC 
pred <- prediction(as.numeric(pred_test),as.numeric(data_test$heart_disease))
perf <- performance(pred, "tpr", "fpr")
plot(perf, colorize=TRUE)

#Get the AUC
AUC=unlist(slot(performance(pred, "auc"), "y.values"))

rslt_sym_out<-c(LOSS,MCR,AUC)
rslt_sym_out
## [1] 0.1758242 0.1758242 0.8801775

Out sample - asymmetric

#convert to binary based on optimal pcut for sym
pred.test.asym <- as.numeric(pred_test > optimal.pcut.asym)
#LOSS
LOSS=(outsamplecost1<-cost1(data_test$heart_disease,pred.test.asym))
#Misclassification rate
MCR=(outsampleMCR1<-mean(data_test$heart_disease!=pred.test.asym))

#Roc curve | AUC 
pred <- prediction(as.numeric(pred_test),as.numeric(data_test$heart_disease))
perf.l<- performance(pred, "tpr", "fpr")
plot(perf.l, colorize=TRUE)

#Get the AUC
AUC=unlist(slot(performance(pred, "auc"), "y.values"))

rslt_asym_out<-c(LOSS,MCR,AUC)
rslt_asym_out
## [1] 0.4175824 0.3296703 0.8801775
eval<-data.frame(rbind(rslt_sym_in,rslt_sym_out,rslt_asym_in,rslt_asym_out),
                 row.names=c('Logistic - Symmetric cost : in_sample',
                             'Logistic - Symmetric cost : out_sample',
                             'Logistic - Asymmetric cost: in_sample',
                             'Logistic - Asymmetric cost : out_sample'))
colnames(eval) <- c("COST", "MCR","AUC")



kable(eval)  %>% 
  kable_styling(bootstrap_options = c("striped", "hover", "responsive"),
                font_size = 12, position = "left", full_width = FALSE)
COST MCR AUC
Logistic - Symmetric cost : in_sample 0.1273585 0.1273585 0.9262643
Logistic - Symmetric cost : out_sample 0.1758242 0.1758242 0.8801775
Logistic - Asymmetric cost: in_sample 0.2783019 0.2028302 0.9262643
Logistic - Asymmetric cost : out_sample 0.4175824 0.3296703 0.8801775

7.2 Decision Trees

CART stands for classification and regression tree.

From the training Model we have learnt through grid search that optimum pcut value for * Symmetric misclassfication rate: 0.42 * Asymmetric misclassfication rate: 0.20

We will be using the same for further analysis

Here we make the assumption that false negative cost 5 times of false positive. In real life the cost structure should be carefully researched.

#data$heart_disease<-as.factor(data$heart_disease)
#symmetric cost
hrtDS.rpart0 <- rpart(formula = heart_disease ~ ., data = data, method = "class")
#asymmetric cost
hrtDS.rpart1 <- rpart(formula = heart_disease ~ . , data = data, method = "class", parms = list(loss=matrix(c(0,5,1,0), nrow = 2)))

Symmetric Cost

pred0<- predict(hrtDS.rpart0, type="class")
table(data$heart_disease, pred0, dnn = c("True", "Pred"))
##     Pred
## True   0   1
##    0  69  17
##    1  19 107

False negatives are heavy costing so lets, add more penality of 5:1

Asymmetric Cost insample

pred1<- predict(hrtDS.rpart1, type="class")
table(data$heart_disease, pred1, dnn = c("True", "Pred"))
##     Pred
## True   0   1
##    0  45  41
##    1   1 125

As expected our False negatives have reduced to a substantial number but False positives have increased.

Lets evaluate the performance of both these models, keeping in mid that FN are far more harzardous than FP

Evaluation Function

#Symmetric cost evaluation function 
cost0 <- function(r, pi){
  weight1 = 1
  weight0 = 1
  c1 = (r==1)&(pi==0) #logical vector - true if actual 1 but predict 0
  c0 = (r==0)&(pi==1) #logical vector - true if actual 0 but predict 1
  return(mean(weight1*c1+weight0*c0))
}

#Aymmetric cost evaluation function 
cost1 <- function(r, pi){
  weight1 = 5
  weight0 = 1
  c1 = (r==1)&(pi==0) #logical vector - true if actual 1 but predict 0
  c0 = (r==0)&(pi==1) #logical vector - true if actual 0 but predict 1
  return(mean(weight1*c1+weight0*c0))
}

Symmetric Evaluation Here we try to calculate the criterians Missclassification rate | LOSS | AUC for Symmetric cost

data$heart_disease<-as.factor(data$heart_disease)

#LOSS
LOSS=(insamplecost0<-cost1(data$heart_disease,pred0))
#Misclassification rate
MCR=(insampleMCR0<-mean(data$heart_disease!=pred0))

#Roc curve | AUC 
pred0_roc<- predict(hrtDS.rpart0, type="prob")[,2]
pred <- prediction(pred0_roc,data[,14])
perf <- performance(pred, "tpr", "fpr")
plot(perf, colorize=TRUE)

#Get the AUC
AUC=unlist(slot(performance(pred, "auc"), "y.values"))

rslt_sym<-c(LOSS,MCR,AUC)
rslt_sym
## [1] 0.5283019 0.1698113 0.8554356

Asymmetric Evaluation

Here we try tocalculate the criterians Missclassification rate | LOSS | AUC for Asymmetric cost

#LOSS
LOSS=(insamplecost.df<-cost1(data$heart_disease,pred1))
#Misclassification rate
MCR=(insampleMCR.df<-mean(data$heart_disease!=pred1))

#roc curve | AUC
pred1_roc<- predict(hrtDS.rpart1, type="prob")[,2]
pred <- prediction(pred1_roc,as.numeric(data$heart_disease))
perf <- performance(pred, "tpr", "fpr")
plot(perf, colorize=TRUE)

#Get the AUC
AUC=unlist(slot(performance(pred, "auc"), "y.values"))

rslt_Asym<-c(LOSS,MCR,AUC)
rslt_Asym
## [1] 0.2169811 0.1981132 0.8633721

Summary table

eval<-data.frame(rbind(rslt_sym,rslt_Asym),
                 row.names=c('Symmetric cost','Asymmetric cost'))
colnames(eval) <- c("COST", "MCR","AUC")



kable(eval)  %>% 
  kable_styling(bootstrap_options = c("striped", "hover", "responsive"),
                font_size = 12, position = "left", full_width = FALSE)
COST MCR AUC
Symmetric cost 0.5283019 0.1698113 0.8554356
Asymmetric cost 0.2169811 0.1981132 0.8633721

Conclusion: ASY wins

7.2.1 Classification Tree

The ideal way of creating a binary tree is to construct a large tree and then prune it to an optimum level

hrtDS.rpart_prune <- rpart(formula = heart_disease ~ . , 
                           data = data, method = "class", 
                           parms = list(loss=matrix(c(0,5,1,0), nrow = 2)),
                           cp = 0.001)

We have a bushy tree, Now we can try to prune up to a optimum level using plotcp function

prp(hrtDS.rpart_prune)

plotcp(hrtDS.rpart_prune)

printcp(hrtDS.rpart_prune)
## 
## Classification tree:
## rpart(formula = heart_disease ~ ., data = data, method = "class", 
##     parms = list(loss = matrix(c(0, 5, 1, 0), nrow = 2)), cp = 0.001)
## 
## Variables actually used in tree construction:
## [1] ca      oldpeak thal    thalach
## 
## Root node error: 86/212 = 0.40566
## 
## n= 212 
## 
##         CP nsplit rel error xerror    xstd
## 1 0.122093      0   1.00000 5.0000 0.41566
## 2 0.073643      2   0.75581 3.9767 0.39093
## 3 0.001000      5   0.53488 3.5233 0.37678

xerror gives you the cross-validation (default is 10-fold) error. You can see that the rel error (in-sample error) is always decreasing as model is more complex, while the cross-validation error (measure of performance on future observations) is not. That is why we prune the tree to avoid overfitting the training data.

hrtDS.rpart_prune1<-prune(hrtDS.rpart_prune, cp = 0.0086)

7.2.2 Model Assessment

Insample Prediction

hrtDS.rpart_prune1.pred.in<- predict(hrtDS.rpart_prune1, type="class")
table(data$heart_disease, hrtDS.rpart_prune1.pred.in, dnn=c("Truth","Predicted"))
##      Predicted
## Truth   0   1
##     0  45  41
##     1   1 125

Lets check out the criterians for model evaluation…

Missclassification rate | LOSS | AUC

#LOSS
LOSS=(dt.insamplecost<-cost1(data$heart_disease,hrtDS.rpart_prune1.pred.in))
#Misclassification rate
MCR=(dt.insampleMCR<-mean(data$heart_disease!=hrtDS.rpart_prune1.pred.in))
#roc curve | AUC
hrtDS.rpart_prune1.pred.in_roc<- predict(hrtDS.rpart_prune1, type="prob")[,2]
pred <- prediction(hrtDS.rpart_prune1.pred.in_roc,data$heart_disease)
perf <- performance(pred, "tpr", "fpr")
plot(perf, colorize=TRUE)

#Get the AUC
AUC=unlist(slot(performance(pred, "auc"), "y.values"))
#vectorized output
dt.rslt.in<-c(LOSS,MCR,AUC)
dt.rslt.in
## [1] 0.2169811 0.1981132 0.8633721

Out of sample Prediction

hrtDS.rpart_prune1.pred.out<- predict(hrtDS.rpart_prune1,data_test, type="class")
table(data_test$heart_disease, hrtDS.rpart_prune1.pred.out, dnn=c("Truth","Predicted"))
##      Predicted
## Truth  0  1
##     0 27 25
##     1  2 37

Missclassification rate | LOSS | AUC

#LOSS
LOSS=(dt.outsamplecost<-cost1(data_test$heart_disease,hrtDS.rpart_prune1.pred.out))
#Misclassification rate
MCR=(dt.outsampleMCR<-mean(data_test$heart_disease!=hrtDS.rpart_prune1.pred.out))

#roc curve | AUC
hrtDS.rpart_prune1.pred.out_roc<- predict(hrtDS.rpart_prune1,data_test, type="prob")[,2]
pred <- prediction(hrtDS.rpart_prune1.pred.out_roc,as.numeric(data_test$heart_disease))
perf.dt1 <- performance(pred, "tpr", "fpr")
plot(perf.dt1, colorize=TRUE)

#Get the AUC
AUC=unlist(slot(performance(pred, "auc"), "y.values"))
#vectorized output
dt.rslt.out<-c(LOSS,MCR,AUC)
dt.rslt.out
## [1] 0.3846154 0.2967033 0.7924063
eval<-data.frame(rbind(dt.rslt.in,dt.rslt.out),
                 row.names=c('DT: in','DT: out'))
colnames(eval) <- c("COST", "MCR","AUC")

kable(eval)  %>% 
  kable_styling(bootstrap_options = c("striped", "hover", "responsive"),
                font_size = 12, position = "left", full_width = FALSE)
COST MCR AUC
DT: in 0.2169811 0.1981132 0.8633721
DT: out 0.3846154 0.2967033 0.7924063

7.3 GAM:

7.3.1 Modelling

#fetch numeric columns
numcols<-select_if(data_orig, is.numeric)
#fetch character columns
charcols<-select_if(data_orig, is.factor)

#continious columns names
(n<-colnames(numcols)[1:5])
## [1] "age"      "trestbps" "chol"     "thalach"  "oldpeak"
#character columns names
colnames(charcols)
## [1] "sex"     "cp"      "fbs"     "restecg" "exang"   "slope"   "ca"     
## [8] "thal"
#continious columns names
cont<-colnames(numcols)
#character columns names
n <- names(data)

formula.gam <- as.formula(paste("heart_disease ~s(age)+s(trestbps)+s(chol) +s(thalach) +s(oldpeak) +", 
                      paste(n[!n %in% cont], collapse = " + ")))
formula.gam
## heart_disease ~ s(age) + s(trestbps) + s(chol) + s(thalach) + 
##     s(oldpeak) + sex + cp + fbs + restecg + exang + slope + ca + 
##     thal
hrtDS.gam <- gam(formula = formula.gam, family=binomial,data=data);
summary(hrtDS.gam)
## 
## Family: binomial 
## Link function: logit 
## 
## Formula:
## heart_disease ~ s(age) + s(trestbps) + s(chol) + s(thalach) + 
##     s(oldpeak) + sex + cp + fbs + restecg + exang + slope + ca + 
##     thal
## 
## Parametric coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        -2.959e+00  2.835e+00  -1.044  0.29663    
## sexmale            -2.353e+00  9.793e-01  -2.403  0.01625 *  
## cpatypical angina   2.483e-01  8.929e-01   0.278  0.78097    
## cpnon-anginal pain  1.186e+00  7.301e-01   1.625  0.10424    
## fbs>120             1.446e+00  9.946e-01   1.454  0.14593    
## restecgdefinite    -5.351e+01  4.745e+07   0.000  1.00000    
## restecgnormal      -7.645e-01  5.813e-01  -1.315  0.18847    
## exangyes           -7.125e-01  6.413e-01  -1.111  0.26659    
## slope1              7.726e-01  1.584e+00   0.488  0.62571    
## slope2              1.478e+00  1.751e+00   0.844  0.39878    
## ca1                -2.131e+00  7.968e-01  -2.674  0.00750 ** 
## ca2                -4.755e+00  1.194e+00  -3.982 6.84e-05 ***
## ca3                -4.941e+00  1.707e+00  -2.894  0.00380 ** 
## ca4                 4.847e+01  3.875e+07   0.000  1.00000    
## thal1               7.125e+00  2.618e+00   2.722  0.00649 ** 
## thal2               7.174e+00  2.377e+00   3.018  0.00254 ** 
## thal3               4.523e+00  2.237e+00   2.021  0.04323 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##               edf Ref.df Chi.sq p-value  
## s(age)      8.226  8.583 15.469  0.0680 .
## s(trestbps) 1.000  1.000  3.236  0.0721 .
## s(chol)     2.627  3.329  8.159  0.0627 .
## s(thalach)  3.176  3.983  5.228  0.2498  
## s(oldpeak)  5.268  6.271 11.667  0.0778 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.646   Deviance explained = 66.2%
## UBRE = -0.19181  Scale est. = 1         n = 212
plot(hrtDS.gam, shade=TRUE,seWithMean=TRUE,scale=0, pages = 1)

AIC(hrtDS.gam)
## [1] 171.3359
BIC(hrtDS.gam)
## [1] 296.5283
hrtDS.gam$deviance
## [1] 96.74079

We can notice that even the continious variables in the data do not have a non-linear distribution. This is backed up by the above plots where we can see a straight line

7.3.2 Model Assessment

Insample Performance

pcut.gam <- .20
prob.gam.in<-predict(hrtDS.gam,type="response")
pred.gam.in.class<-(prob.gam.in>=pcut.gam)*1
table(data$heart_disease,pred.gam.in.class,dnn=c("Observed","Predicted"))
##         Predicted
## Observed   0   1
##        0  60  26
##        1   2 124

Lets check out the criterians for model evaluation…

Missclassification rate | LOSS | AUC

#LOSS
LOSS=(gam.insamplecost<-cost1(data$heart_disease,pred.gam.in.class))
#Misclassification rate
MCR=(dt.insampleMCR<-mean(data$heart_disease!=pred.gam.in.class))

#roc curve | AUC
pred.gam.class.in_roc<- predict(hrtDS.gam, type="prob")[,2]
pred <- prediction(as.numeric(pred.gam.class.in_roc),as.numeric(data$heart_disease))
perf <- performance(pred, "tpr", "fpr")
plot(perf, colorize=TRUE)

#Get the AUC
AUC=unlist(slot(performance(pred, "auc"), "y.values"))
#vectorized output
gam.rslt.in<-c(LOSS,MCR,AUC)
gam.rslt.in
## [1] 0.1698113 0.1320755 0.7214839

Out of sample Prediction

prob.gam.out<-predict(hrtDS.gam,data_test,type="response")
pred.gam.out.class<-(prob.gam.out>=pcut.gam)*1
table(data_test$heart_disease,pred.gam.out.class,dnn=c("Observed","Predicted"))
##         Predicted
## Observed  0  1
##        0 22 30
##        1  2 37

Missclassification rate | LOSS | AUC

#LOSS
LOSS=(dt.outsamplecost<-cost1(data_test$heart_disease,pred.gam.out.class))
#Misclassification rate
MCR=(dt.outsampleMCR<-mean(data_test$heart_disease!=pred.gam.out.class))

#roc curve | AUC
hrtDS.gam.out_roc<- predict(hrtDS.rpart_prune1,data_test, type="prob")[,2]
pred <- prediction(as.numeric(hrtDS.gam.out_roc),as.numeric(data_test$heart_disease))
perf.gm <- performance(pred, "tpr", "fpr")
plot(perf.gm, colorize=TRUE)

#Get the AUC
AUC=unlist(slot(performance(pred, "auc"), "y.values"))
#vectorized output
gam.rslt.out<-c(LOSS,MCR,AUC)
gam.rslt.out
## [1] 0.4395604 0.3516484 0.7924063
eval<-data.frame(rbind(gam.rslt.in,gam.rslt.out),
                 row.names=c('GAM: in','GAM: out'))
colnames(eval) <- c("COST", "MCR","AUC")

kable(eval)  %>% 
  kable_styling(bootstrap_options = c("striped", "hover", "responsive"),
                font_size = 12, position = "left", full_width = FALSE)
COST MCR AUC
GAM: in 0.1698113 0.1320755 0.7214839
GAM: out 0.4395604 0.3516484 0.7924063

7.4 Ensemble Methods

Advanced Tree Models Bagging, Random Forests, and Boosting

7.4.1 Bagging

Bagging stands for Bootstrap and Aggregating. It employs the idea of bootstrap but the purpose is not to study bias and standard errors of estimates. Instead, the goal of Bagging is to improve prediction accuracy. It fits a tree for each bootsrap sample, and then aggregate the predicted values from all these different trees. For more details, you may look at Wikepedia, or you can find the original paper Leo Breiman (1996).

To my best knowledge, it seems that bagging() won’t take an argument for asymmetric loss. Therefore, the classification results might not be appropriate.Lets check it out…

Modelling

heart.bag<- randomForest(as.factor(heart_disease)~.,mtry=ncol(data)-1, data = data, nbagg=100)

Insample

heart.bag.pred<- predict(heart.bag, type="prob")[,2]
pred.bag = prediction(heart.bag.pred, data$heart_disease)
perf = performance(pred.bag, "tpr", "fpr")
plot(perf, colorize=TRUE)

unlist(slot(performance(pred.bag, "auc"), "y.values"))
## [1] 0.84067

Insample Analysis

heart.bag.pred<- predict(heart.bag, newdata = data, type="class")
table(data$heart_disease, heart.bag.pred, dnn = c("True", "Pred"))
##     Pred
## True   0   1
##    0  86   0
##    1   0 126
#LOSS
LOSS<-cost0(data$heart_disease,heart.bag.pred)
#Misclassification rate
MCR<-mean(data$heart_disease!=heart.bag.pred)
#Get the AUC
AUC=unlist(slot(performance(pred.bag, "auc"), "y.values"))

rslt.bag.in<-c(LOSS,MCR,AUC)
rslt.bag.in
## [1] 0.00000 0.00000 0.84067

Out of sample

levels(data_test$thal) <- levels(data$thal)
levels(data_test$ca) <- levels(data$ca)
levels(data_test$slope) <- levels(data$slope)
levels(data_test$exang) <- levels(data$exang)
levels(data_test$restecg) <- levels(data$restecg)
levels(data_test$fbs) <- levels(data$fbs)
levels(data_test$cp) <- levels(data$cp)
levels(data_test$sex) <- levels(data$sex)

data_test$heart_disease <-as.factor(data_test$heart_disease)
heart.bag.pred.test<- predict(heart.bag, newdata = data_test, type="prob")[,2]
pred.bag.out = prediction(heart.bag.pred.test, data_test$heart_disease)
perf.bg = performance(pred.bag.out, "tpr", "fpr")
plot(perf.bg, colorize=TRUE)

unlist(slot(performance(pred.bag.out, "auc"), "y.values"))
## [1] 0.9114892
heart.bag.pred.test<- predict(heart.bag, newdata = data_test, type="class")
table(data_test$heart_disease, heart.bag.pred.test, dnn = c("True", "Pred"))
##     Pred
## True  0  1
##    0 40 12
##    1  3 36
#LOSS
LOSS<-cost0(data_test$heart_disease,heart.bag.pred.test)
#Misclassification rate
MCR<-mean(data_test$heart_disease!=heart.bag.pred.test)
#Get the AUC
AUC=unlist(slot(performance(pred.bag.out, "auc"), "y.values"))

rslt.bag.out<-c(LOSS,MCR,AUC)
rslt.bag.out
## [1] 0.1648352 0.1648352 0.9114892
eval<-data.frame(rbind(rslt.bag.in,rslt.bag.out),
                 row.names=c('BAG: in','BAG: out'))
colnames(eval) <- c("COST", "MCR","AUC")

kable(eval)  %>% 
  kable_styling(bootstrap_options = c("striped", "hover", "responsive"),
                font_size = 12, position = "left", full_width = FALSE)
COST MCR AUC
BAG: in 0.0000000 0.0000000 0.8406700
BAG: out 0.1648352 0.1648352 0.9114892

7.4.1 Random Forest

Modelling

heart.rf<- randomForest(as.factor(heart_disease)~., mtry=sqrt(ncol(data)-1),ntree=100 ,data = data, importance=TRUE)
heart.rf
## 
## Call:
##  randomForest(formula = as.factor(heart_disease) ~ ., data = data,      mtry = sqrt(ncol(data) - 1), ntree = 100, importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 100
## No. of variables tried at each split: 4
## 
##         OOB estimate of  error rate: 18.87%
## Confusion matrix:
##    0   1 class.error
## 0 61  25   0.2906977
## 1 15 111   0.1190476

Variable Importance

heart.rf$importance
##                      0             1 MeanDecreaseAccuracy MeanDecreaseGini
## age       0.0066894627  0.0090019272         0.0080007447        9.7652322
## sex       0.0100411368  0.0089640079         0.0093973826        2.4283983
## cp        0.0236545789  0.0049268001         0.0124314370        7.5084754
## trestbps  0.0008585000  0.0010609511         0.0008679932        7.5703744
## chol      0.0079340813  0.0083961522         0.0081707502       10.6459891
## fbs      -0.0004175059 -0.0017677013        -0.0011811390        0.8207861
## restecg  -0.0027415097  0.0006863955        -0.0007813800        1.6776843
## thalach   0.0066289697  0.0025489093         0.0038556271       10.9100373
## exang     0.0028168382  0.0026346946         0.0029136281        2.5030901
## oldpeak   0.0314505792  0.0065132308         0.0166571509       14.2058831
## slope     0.0135692685 -0.0011422479         0.0052397966        4.3544639
## ca        0.0554900045  0.0328762794         0.0416077239       12.8668417
## thal      0.0586233858  0.0391646094         0.0464907871       16.0030334
plot(heart.rf, lwd=rep(2, 3))
legend("right", legend = c("OOB Error", "FPR", "FNR"), lwd=rep(2, 3), lty = c(1,2,3), col = c("black", "red", "green"))

Optimal cutoff - Symmetric cost

heart.rf.pred<- predict(heart.rf, type = "prob")[,2]

p.seq = seq(0.01, 1, 0.01)
cost = rep(0, length(p.seq))  
for(i in 1:length(p.seq)){ 
    cost[i] = costfunc(obs = data$heart_disease, pred.p = heart.rf.pred, pcut = p.seq[i])  
}
plot(p.seq, cost)

optimal.pcut.sym= p.seq[which(cost==min(cost))][1]

optimal.pcut.sym#0.62
## [1] 0.49

Insample Analysis

ROC CURVE

pred.rf.in <- prediction(heart.rf.pred, data$heart_disease)
perf <- performance(pred.rf.in, "tpr", "fpr")
plot(perf, colorize=TRUE)

unlist(slot(performance(pred.rf.in, "auc"), "y.values"))
## [1] 0.8551587

Confusion Matrix

heart.rf.pred.c<- predict(heart.bag, newdata = data, type="class")
table(data$heart_disease, heart.rf.pred.c, dnn = c("True", "Pred"))
##     Pred
## True   0   1
##    0  86   0
##    1   0 126

Out-of-Sample - Symmetric

## out-of-sample
heart.rf.pred.test<- predict(heart.rf,newdata = data_test,type="prob")[,2]
heart.rf.class.test<- (heart.rf.pred.test>optimal.pcut.sym)*1
table(data_test$heart_disease, heart.rf.class.test, dnn = c("True", "Pred"))
##     Pred
## True  0  1
##    0 42 10
##    1  5 34
pred.rf.out <- prediction(heart.rf.pred.test, data_test$heart_disease)
perf <- performance(pred.rf.out, "tpr", "fpr")
plot(perf, colorize=TRUE)

unlist(slot(performance(pred.rf.out, "auc"), "y.values"))
## [1] 0.9146943
#LOSS
LOSS<-cost0(data_test$heart_disease,heart.rf.class.test)
#Misclassification rate
MCR<-mean(data_test$heart_disease!=heart.rf.class.test)
#Get the AUC
AUC=unlist(slot(performance(pred.rf.out, "auc"), "y.values"))

rslt.rf.out.sym<-c(LOSS,MCR,AUC)
rslt.rf.out.sym
## [1] 0.1648352 0.1648352 0.9146943

Optimal cutoff - Assymetric cost

p.seq = seq(0.01, 1, 0.01)
cost = rep(0, length(p.seq))  
for(i in 1:length(p.seq)){ 
    cost[i] = costfuncasym(obs = data$heart_disease, pred.p = heart.rf.pred, pcut = p.seq[i])  
}
plot(p.seq, cost)

optimal.pcut.asym= p.seq[which(cost==min(cost))][1]

optimal.pcut.asym#0.07
## [1] 0.2

Out-of-Sample Asymmetric

## out-of-sample
heart.rf.pred.test<- predict(heart.rf,newdata = data_test,type="prob")[,2]
heart.rf.class.test<- (heart.rf.pred.test>optimal.pcut.asym)*1
table(data_test$heart_disease, heart.rf.class.test, dnn = c("True", "Pred"))
##     Pred
## True  0  1
##    0 19 33
##    1  1 38
pred.rf.out <- prediction(heart.rf.pred.test, data_test$heart_disease)
perf.rf <- performance(pred.rf.out, "tpr", "fpr")
plot(perf.rf, colorize=TRUE)

unlist(slot(performance(pred.rf.out, "auc"), "y.values"))
## [1] 0.9146943
#LOSS
LOSS<-cost1(data_test$heart_disease,heart.rf.class.test)
#Misclassification rate
MCR<-mean(data_test$heart_disease!=heart.rf.class.test)
#Get the AUC
AUC=unlist(slot(performance(pred.rf.out, "auc"), "y.values"))

rslt.rf.out.asym<-c(LOSS,MCR,AUC)
rslt.rf.out.asym
## [1] 0.4175824 0.3736264 0.9146943
eval<-data.frame(rbind(rslt.rf.out.sym,rslt.rf.out.asym),
                 row.names=c('Random Forest: Symmetric : Out_sample',
                             'Random Forest: Asymmetric : Out_sample'))
colnames(eval) <- c("COST", "MCR","AUC")

kable(eval)  %>% 
  kable_styling(bootstrap_options = c("striped", "hover", "responsive"),
                font_size = 12, position = "left", full_width = FALSE)
COST MCR AUC
Random Forest: Symmetric : Out_sample 0.1648352 0.1648352 0.9146943
Random Forest: Asymmetric : Out_sample 0.4175824 0.3736264 0.9146943

7.4.2 Boosting

Boosting builds a number of small trees, and each time, the response is the residual from last tree. It is a sequential procedure. We use gbm package to build boosted trees.

data$heart_disease= as.factor(data$heart_disease)

heart.boost= boosting(heart_disease~., data = data, boos = T)
# Training AUC
pred.heart.boost= predict(heart.boost, newdata = data)
pred <- prediction(pred.heart.boost$prob[,2], data$heart_disease)
perf <- performance(pred, "tpr", "fpr")
plot(perf, colorize=TRUE)

unlist(slot(performance(pred, "auc"), "y.values"))
## [1] 1

Boosting in sample

pred.heart.boost.train= predict(heart.boost, newdata = data)
#LOSS
LOSS<-cost0(data$heart_disease,as.numeric(pred.heart.boost.train$class))
#Misclassification rate
MCR<-mean(data$heart_disease!=as.numeric(pred.heart.boost.train$class))


# Testing AUC
pred <- prediction(pred.heart.boost.train$prob[,2], data$heart_disease)
perf <- performance(pred, "tpr", "fpr")
plot(perf, colorize=TRUE)

#Get the AUC
AUC=unlist(slot(performance(pred, "auc"), "y.values"))

boost_in<-c(LOSS,MCR,AUC)
boost_in
## [1] 0 0 1

Boosting out sample

pred.heart.boost.test= predict(heart.boost, newdata = data_test)
#LOSS
LOSS<-cost0(data_test$heart_disease,as.numeric(pred.heart.boost.test$class))
#Misclassification rate
MCR<-mean(data_test$heart_disease!=as.numeric(pred.heart.boost.test$class))


# Testing AUC
pred <- prediction(pred.heart.boost.test$prob[,2], data_test$heart_disease)
perf.bo <- performance(pred, "tpr", "fpr")
plot(perf.bo, colorize=TRUE)

#Get the AUC
AUC=unlist(slot(performance(pred, "auc"), "y.values"))

boost_out<-c(LOSS,MCR,AUC)
boost_out
## [1] 0.1868132 0.1868132 0.9048323
eval<-data.frame(rbind(boost_in,boost_out),
                 row.names=c('Boosting: In_sample',
                             'Boosting: Out_sample'))
colnames(eval) <- c("COST", "MCR","AUC")

kable(eval)  %>% 
  kable_styling(bootstrap_options = c("striped", "hover", "responsive"),
                font_size = 12, position = "left", full_width = FALSE)
COST MCR AUC
Boosting: In_sample 0.0000000 0.0000000 1.0000000
Boosting: Out_sample 0.1868132 0.1868132 0.9048323

—– NOT sure if we can calculate pcut here ——- ### 7.5 Neural Networks

We ran a neural network model using “heart disease” as the output variable, and the remaining 13 variables as input variables. We trained the model using NeuralNetTools package, in which the size of the hidden layers are calculated internally.For classification problems with nnet you need to code the response to factor first

library(e1071)
library(nnet)
library(NeuralNetTools)
data_orig$heart_disease <- as.factor(data_orig$heart_disease)

Sampling Data into 70% and 30% Test

set.seed(13263635)
index <- sample(nrow(data_orig),nrow(data_orig)*0.7)
hd_train <- data_orig[index,]
hd_test <- data_orig[-index,]

Here, we’re using another library called NeuralNetTools in which the size of the hidden layers are calculated internally

Evaluating the results:

print(hd.nnet)
## Neural Network 
## 
## 212 samples
##  13 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 212, 212, 212, 212, 212, 212, ... 
## Resampling results across tuning parameters:
## 
##   size  decay  Accuracy   Kappa     
##   1     0e+00  0.6401923  0.09431636
##   1     1e-04  0.6110185  0.00000000
##   1     1e-01  0.7589285  0.48341994
##   3     0e+00  0.6201874  0.05705648
##   3     1e-04  0.6376189  0.13904563
##   3     1e-01  0.7721512  0.51435973
##   5     0e+00  0.6282547  0.09356149
##   5     1e-04  0.6572260  0.17507520
##   5     1e-01  0.7809113  0.53395146
## 
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were size = 5 and decay = 0.1.
plot(hd.nnet)
PNN<-plot(garson(hd.nnet))

PNN+coord_flip()

plotnet(hd.nnet$finalModel, y_names = "heart_disease") + title("Graphical Representation of Neural Network of Heart Disease ")

## integer(0)
pred.glm.gtrain.nn <- predict(hd.nnet, type = "prob")[,2]
pred.glm.gtest.nn <- predict(hd.nnet, newdata=hd_test,type = "prob")[,2]
cost_sym <- function(obs, pred.p, pcut){
  weight1 = 1
  weight0 = 1
  c1 = (obs==1)&(pred.p<pcut)
  c0 = (obs==0)&(pred.p>=pcut)
  cost = (mean(weight1*c1+weight0*c0))
  return(cost)
}
# define a sequence from 0.01 to 1 by 0.01
p.seq = seq(0.01, 0.5, 0.01)

# write a loop for all p-cut to see which one provides the smallest cost
# first, need to define a 0 vector in order to save the value of cost from all pcut

cost = rep(0, length(p.seq))  
for(i in 1:length(p.seq)){ 
    cost[i] = cost_sym(obs = hd_train$heart_disease, pred.p = pred.glm.gtrain.nn, pcut = p.seq[i])
}

#optimal pcut to be used further
plot(p.seq, cost)

optimal.pcut.sym = p.seq[which(cost == min(cost))]
optimal.pcut.sym #0.5
## [1] 0.42 0.43 0.44 0.45 0.46

Asymmetric : To find Optimal Pcut

#Cost Function
# define a cost function with input "obs" being observed response 
# and "pi" being predicted probability, and "pcut" being the threshold.
cost_asym <- function(obs, pred.p, pcut){
  weight1 = 5
  weight0 = 1
  c1 = (obs==1)&(pred.p<pcut)  # count for "true=1 but pred=0"   (FN)
  c0 = (obs==0)&(pred.p>=pcut) # count for "true=0 but pred=1"   (FP)
  cost = (mean(weight1*c1+weight0*c0)) # misclassification with weight
  return(cost)
}
# define a sequence from 0.01 to 1 by 0.01
p.seq = seq(0.01, 0.5, 0.01)

# write a loop for all p-cut to see which one provides the smallest cost
# first, need to define a 0 vector in order to save the value of cost from all pcut
cost = rep(0, length(p.seq))  
for(i in 1:length(p.seq))
  {cost[i] = cost_asym(obs = hd_train$heart_disease, 
                      pred.p = pred.glm.gtrain.nn, 
                      pcut = p.seq[i])}

#optimal pcut to be used further
plot(p.seq, cost)

optimal.pcut = p.seq[which(cost==min(cost))]
optimal.pcut = 0.36#0.36

Since the Asymmetrical Pcut is lower than symmetric cost, predictions are done using asymmetric cost

7.5.1 Asymmetrical

Train & Test Predictions - Asymm

confusion_matrix_train <- table(hd_train$heart_disease, pred.glm.gtrain.nn)
confusion_matrix_test <- table(hd_test$heart_disease, pred.glm.gtest.nn)

misclassification_rate_train <- round((confusion_matrix_train[2]+confusion_matrix_train[3])/sum(confusion_matrix_train), 2)
misclassification_rate_test <- round((confusion_matrix_test[2]+confusion_matrix_test[3])/sum(confusion_matrix_test), 2)

cat("train misclassfication rate:", misclassification_rate_train, "| test misclassfication rate:", misclassification_rate_test)
## train misclassfication rate: 0 | test misclassfication rate: 0.01
# Train MCR = 0.13
# Test MCR = 0.33
confusion_matrix_train
##    pred.glm.gtrain.nn
##     0.00140547408860982 0.0016187281280542 0.00548143049348793
##   0                   1                  1                   1
##   1                   0                  0                   0
##    pred.glm.gtrain.nn
##     0.00598973145653925 0.0098843168471668 0.0115061735875303
##   0                   1                  1                  1
##   1                   0                  0                  0
##    pred.glm.gtrain.nn
##     0.0128223272381664 0.014151875164994 0.0143605610394608
##   0                  1                 1                  1
##   1                  0                 0                  0
##    pred.glm.gtrain.nn
##     0.0166278571491479 0.0201716782807516 0.021629754181104
##   0                  1                  1                 1
##   1                  0                  0                 0
##    pred.glm.gtrain.nn
##     0.0240365162374382 0.024280831755576 0.0313937449450673
##   0                  1                 1                  1
##   1                  0                 0                  0
##    pred.glm.gtrain.nn
##     0.0340562345750676 0.0344144558800905 0.0386452069088804
##   0                  1                  1                  1
##   1                  0                  0                  0
##    pred.glm.gtrain.nn
##     0.039209620504837 0.0394708106825859 0.0404816342423995
##   0                 1                  1                  1
##   1                 0                  0                  0
##    pred.glm.gtrain.nn
##     0.0414290585909004 0.0424395591295265 0.043839467700985
##   0                  1                  1                 1
##   1                  0                  0                 0
##    pred.glm.gtrain.nn
##     0.0507122263948863 0.0512751805092015 0.0518461466290865
##   0                  1                  1                  1
##   1                  0                  0                  0
##    pred.glm.gtrain.nn
##     0.053711248247861 0.0587230786846722 0.0594830506213191
##   0                 1                  1                  1
##   1                 0                  0                  0
##    pred.glm.gtrain.nn
##     0.0637553739839628 0.0653590595984303 0.065848726482815
##   0                  1                  1                 1
##   1                  0                  0                 0
##    pred.glm.gtrain.nn
##     0.0679356449947991 0.0693918071875803 0.0732580830124811
##   0                  1                  1                  1
##   1                  0                  0                  0
##    pred.glm.gtrain.nn
##     0.0734319443912572 0.076928640458764 0.0774338468075271
##   0                  1                 0                  1
##   1                  0                 1                  0
##    pred.glm.gtrain.nn
##     0.0778547996452845 0.0778939091754068 0.0798186217080686
##   0                  1                  1                  0
##   1                  0                  0                  1
##    pred.glm.gtrain.nn
##     0.0875299764408242 0.0876413211220728 0.0881558003733085
##   0                  1                  1                  1
##   1                  0                  0                  0
##    pred.glm.gtrain.nn
##     0.106003878429398 0.11593951606708 0.119849246953881 0.131986608049957
##   0                 1                1                 1                 1
##   1                 0                0                 0                 0
##    pred.glm.gtrain.nn
##     0.135568186622098 0.136005039247625 0.148820866959496
##   0                 1                 1                 1
##   1                 0                 0                 0
##    pred.glm.gtrain.nn
##     0.153461519765809 0.158649770765087 0.163031632660873 0.1659928170043
##   0                 1                 1                 1               1
##   1                 0                 0                 0               0
##    pred.glm.gtrain.nn
##     0.166849897759784 0.171708456427972 0.182285167408373
##   0                 1                 1                 1
##   1                 0                 0                 0
##    pred.glm.gtrain.nn
##     0.183500209824067 0.188712008954861 0.194475248412997
##   0                 1                 1                 1
##   1                 0                 0                 0
##    pred.glm.gtrain.nn
##     0.202882378942267 0.209223779154213 0.220113186444233
##   0                 1                 1                 1
##   1                 0                 0                 0
##    pred.glm.gtrain.nn
##     0.223985816750779 0.225239524459252 0.24616956896554 0.251543867046982
##   0                 1                 1                1                 1
##   1                 0                 0                0                 0
##    pred.glm.gtrain.nn
##     0.304474968881152 0.307842299441321 0.341311292365706 0.366865839177
##   0                 1                 1                 0              0
##   1                 0                 0                 1              1
##    pred.glm.gtrain.nn
##     0.396812950422799 0.411473626187303 0.41705850901617 0.449438508470835
##   0                 1                 1                1                 1
##   1                 0                 0                0                 0
##    pred.glm.gtrain.nn
##     0.449937027480382 0.46521163447619 0.46768608573504 0.500797755226483
##   0                 0                0                0                 0
##   1                 1                1                1                 1
##    pred.glm.gtrain.nn
##     0.502219407500019 0.542846027055318 0.547709965419821
##   0                 1                 0                 1
##   1                 0                 1                 0
##    pred.glm.gtrain.nn
##     0.576603258260078 0.591897180122641 0.60891511266251 0.61875382285694
##   0                 0                 0                0                0
##   1                 1                 1                1                1
##    pred.glm.gtrain.nn
##     0.620098921718515 0.62421386566248 0.628376331374646 0.634150779254919
##   0                 1                0                 1                 0
##   1                 0                1                 0                 1
##    pred.glm.gtrain.nn
##     0.642613562531629 0.645519654093024 0.652532812191151
##   0                 1                 0                 0
##   1                 0                 1                 1
##    pred.glm.gtrain.nn
##     0.655558631460355 0.661513565698282 0.663742899084925
##   0                 0                 0                 1
##   1                 1                 1                 0
##    pred.glm.gtrain.nn
##     0.678752028678773 0.72006356347281 0.748281977475679 0.754904239047469
##   0                 0                0                 1                 0
##   1                 1                1                 0                 1
##    pred.glm.gtrain.nn
##     0.755129725479019 0.757354113849914 0.763024934721684
##   0                 0                 0                 0
##   1                 1                 1                 1
##    pred.glm.gtrain.nn
##     0.785082223361549 0.795617177903729 0.797230706335981
##   0                 0                 1                 0
##   1                 1                 0                 1
##    pred.glm.gtrain.nn
##     0.800351450300016 0.814373415846112 0.819113223711186 0.8268274275114
##   0                 0                 1                 0               0
##   1                 1                 0                 1               1
##    pred.glm.gtrain.nn
##     0.831685408314223 0.834303542156161 0.842153990709367
##   0                 0                 0                 0
##   1                 1                 1                 1
##    pred.glm.gtrain.nn
##     0.842713912269737 0.842847274616396 0.844540024278214 0.84519765141143
##   0                 0                 0                 0                0
##   1                 1                 1                 1                1
##    pred.glm.gtrain.nn
##     0.847043483200458 0.856178630035966 0.857013907093588
##   0                 1                 0                 1
##   1                 0                 1                 0
##    pred.glm.gtrain.nn
##     0.859470968605237 0.860395824280875 0.863063972419414
##   0                 0                 0                 0
##   1                 1                 1                 1
##    pred.glm.gtrain.nn
##     0.874156417447392 0.875055271084176 0.876551128274807
##   0                 0                 0                 0
##   1                 1                 1                 1
##    pred.glm.gtrain.nn
##     0.880329849320248 0.888949769312837 0.889799106003999
##   0                 0                 0                 0
##   1                 1                 1                 1
##    pred.glm.gtrain.nn
##     0.890626674386715 0.892190877093108 0.894239708404398
##   0                 0                 0                 0
##   1                 1                 1                 1
##    pred.glm.gtrain.nn
##     0.895865331696358 0.900705638009325 0.901272995056137 0.91043123832262
##   0                 0                 0                 0                0
##   1                 1                 1                 1                1
##    pred.glm.gtrain.nn
##     0.913765423924941 0.914055914408337 0.917189472345613
##   0                 0                 0                 0
##   1                 1                 1                 1
##    pred.glm.gtrain.nn
##     0.917498553098726 0.917610797320407 0.918516514286963
##   0                 0                 0                 0
##   1                 1                 1                 1
##    pred.glm.gtrain.nn
##     0.918547099481092 0.918827456389806 0.919297942646124
##   0                 0                 0                 0
##   1                 1                 1                 1
##    pred.glm.gtrain.nn
##     0.919493322131194 0.919704090071472 0.919993466648231 0.92397689026441
##   0                 0                 0                 0                1
##   1                 1                 1                 1                0
##    pred.glm.gtrain.nn
##     0.925681456485117 0.926197727921199 0.926797683988465
##   0                 0                 0                 0
##   1                 1                 1                 1
##    pred.glm.gtrain.nn
##     0.926798472334727 0.926964749325013 0.9274664950847 0.927709815528317
##   0                 0                 0               0                 0
##   1                 1                 1               1                 1
##    pred.glm.gtrain.nn
##     0.929476771325266 0.930288730710336 0.930355238721127
##   0                 0                 0                 0
##   1                 1                 1                 1
##    pred.glm.gtrain.nn
##     0.931095345716087 0.931288040669097 0.934645990184759
##   0                 0                 0                 0
##   1                 1                 1                 1
##    pred.glm.gtrain.nn
##     0.939697272287572 0.943030704631844 0.943319931554613 0.94383793476792
##   0                 0                 0                 0                1
##   1                 1                 1                 1                0
##    pred.glm.gtrain.nn
##     0.945235951101573 0.945629338457765 0.945832841916546 0.94690282170387
##   0                 0                 0                 0                0
##   1                 1                 1                 1                1
##    pred.glm.gtrain.nn
##     0.94708108575275 0.951713062694856 0.955364864776263 0.956303365742809
##   0                0                 0                 0                 0
##   1                1                 1                 1                 1
##    pred.glm.gtrain.nn
##     0.9612646781818 0.968063874710117 0.979339627213609 0.98328409265131
##   0               0                 0                 0                0
##   1               1                 1                 1                1
##    pred.glm.gtrain.nn
##     0.984635956086537 0.985943944855065 0.98811629909488 0.988387531711702
##   0                 0                 0                0                 0
##   1                 1                 1                1                 1
##    pred.glm.gtrain.nn
##     0.988815678847412 0.988825710705713 0.989146089095493
##   0                 0                 0                 0
##   1                 1                 1                 1
##    pred.glm.gtrain.nn
##     0.990447001464028 0.990494126745047 0.991128001136929
##   0                 0                 0                 0
##   1                 1                 1                 1
##    pred.glm.gtrain.nn
##     0.991352042092903 0.991579581459918 0.992869108681771
##   0                 0                 0                 0
##   1                 1                 1                 1
##    pred.glm.gtrain.nn
##     0.993538781805464 0.993563716139713 0.994012087396936
##   0                 0                 0                 0
##   1                 1                 1                 1
##    pred.glm.gtrain.nn
##     0.994017520175521 0.994454065928518 0.9947299338309 0.994815028766075
##   0                 0                 0               0                 0
##   1                 1                 1               1                 1
##    pred.glm.gtrain.nn
##     0.994926476152365 0.995046701249083 0.99506419109178 0.995393140837817
##   0                 0                 0                0                 0
##   1                 1                 1                1                 1
##    pred.glm.gtrain.nn
##     0.995407514715461 0.995532172523794 0.995579047337148
##   0                 0                 0                 0
##   1                 1                 1                 1
##    pred.glm.gtrain.nn
##     0.995714298234534 0.99583399780544 0.997620926201334 0.998349603967806
##   0                 0                0                 0                 0
##   1                 1                1                 1                 2
confusion_matrix_test
##    pred.glm.gtest.nn
##     0.0027714463112969 0.00322479563718421 0.00423389851902834
##   0                  1                   1                   1
##   1                  0                   0                   0
##    pred.glm.gtest.nn
##     0.00464319926601919 0.00496328291664434 0.00634827029527815
##   0                   1                   1                   1
##   1                   0                   0                   0
##    pred.glm.gtest.nn
##     0.0126946177515764 0.0154087014214331 0.0261545788035418
##   0                  1                  1                  1
##   1                  0                  0                  0
##    pred.glm.gtest.nn
##     0.0299817831501402 0.0415105904949099 0.0419567661641778
##   0                  1                  1                  1
##   1                  0                  0                  0
##    pred.glm.gtest.nn
##     0.051200792457784 0.0565585415444473 0.0607873310413389
##   0                 1                  1                  1
##   1                 0                  0                  0
##    pred.glm.gtest.nn
##     0.084590587016722 0.101767513496881 0.106903587430528
##   0                 1                 0                 1
##   1                 0                 1                 0
##    pred.glm.gtest.nn
##     0.111039564740205 0.114172326187984 0.117386210677135
##   0                 1                 1                 1
##   1                 0                 0                 0
##    pred.glm.gtest.nn
##     0.119035879126298 0.123739396414317 0.124522450793027
##   0                 1                 1                 1
##   1                 0                 0                 0
##    pred.glm.gtest.nn
##     0.133164552998665 0.139745284301599 0.155499764954562
##   0                 1                 1                 1
##   1                 0                 0                 0
##    pred.glm.gtest.nn
##     0.163392256383793 0.170576496925379 0.212716510709362
##   0                 1                 1                 0
##   1                 0                 0                 1
##    pred.glm.gtest.nn
##     0.222593593752233 0.223649200234162 0.253149885531279
##   0                 1                 1                 1
##   1                 0                 0                 0
##    pred.glm.gtest.nn
##     0.254651140681851 0.28010499592357 0.286727303673468 0.304676617024148
##   0                 1                1                 1                 1
##   1                 0                0                 0                 0
##    pred.glm.gtest.nn
##     0.349530588438819 0.373327198142616 0.451751561762436
##   0                 1                 0                 1
##   1                 0                 1                 0
##    pred.glm.gtest.nn
##     0.499086752242712 0.5224198992142 0.539137095132152 0.561357271436771
##   0                 1               1                 0                 1
##   1                 0               0                 1                 0
##    pred.glm.gtest.nn
##     0.585223380662428 0.598566052679182 0.65601834430533 0.657826100237102
##   0                 1                 1                0                 0
##   1                 0                 0                1                 1
##    pred.glm.gtest.nn
##     0.660514634914973 0.661724887315089 0.669418952111043
##   0                 0                 0                 0
##   1                 1                 1                 1
##    pred.glm.gtest.nn
##     0.674766995264368 0.696914969065705 0.781739998257185
##   0                 1                 0                 0
##   1                 0                 1                 1
##    pred.glm.gtest.nn
##     0.782511943959304 0.806378841539237 0.829005402359611
##   0                 1                 1                 1
##   1                 0                 0                 0
##    pred.glm.gtest.nn
##     0.829941252916241 0.832934464488041 0.867603808719308 0.8793976015413
##   0                 0                 0                 0               0
##   1                 1                 1                 1               1
##    pred.glm.gtest.nn
##     0.884983227802461 0.885215750473307 0.901858537117129
##   0                 0                 0                 1
##   1                 1                 1                 0
##    pred.glm.gtest.nn
##     0.908085577554559 0.912529030625575 0.913996903472009
##   0                 0                 0                 0
##   1                 1                 1                 1
##    pred.glm.gtest.nn
##     0.914602577858671 0.919945122551635 0.920125789646024
##   0                 0                 0                 0
##   1                 1                 1                 1
##    pred.glm.gtest.nn
##     0.926094517343043 0.926783401584116 0.927176808676671
##   0                 0                 1                 1
##   1                 1                 0                 0
##    pred.glm.gtest.nn
##     0.929397059899547 0.929724760310095 0.930080023679534
##   0                 0                 0                 0
##   1                 1                 1                 1
##    pred.glm.gtest.nn
##     0.931409246460362 0.935413005466966 0.935634102429502 0.95563599779744
##   0                 0                 0                 0                1
##   1                 1                 1                 1                0
##    pred.glm.gtest.nn
##     0.962077887112079 0.971996621089428 0.982931226210635
##   0                 1                 1                 0
##   1                 0                 0                 1
##    pred.glm.gtest.nn
##     0.983224699871222 0.99364023261421 0.993678182831536 0.993847919392054
##   0                 0                0                 0                 0
##   1                 1                1                 1                 1
##    pred.glm.gtest.nn
##     0.993916973257418 0.994536425970503 0.999072514005146
##   0                 0                 0                 0
##   1                 1                 1                 1
##    pred.glm.gtest.nn
##     0.999113985101415
##   0                 0
##   1                 1

ROC Curve for Train set

#In sample roc
prednn_roc.train<- predict(hd.nnet, type="prob")[,2]
pred <- prediction(prednn_roc.train,hd_train$heart_disease)
perf <- performance(pred, "tpr", "fpr")
plot(perf, colorize=TRUE)

auc.train=unlist(slot(performance(pred, "auc"), "y.values"))
auc.train
## [1] 0.9581949

ROC Curve for Test set

#Out of sample roc
prednn_roc.test <- predict(hd.nnet, newdata=hd_test,type = "prob")[,2]
pred.test.nn <- prediction(prednn_roc.test,hd_test$heart_disease)
perf.nn <- performance(pred.test.nn, "tpr", "fpr")
plot(perf.nn, colorize=TRUE)

auc.test=unlist(slot(performance(pred.test.nn, "auc"), "y.values"))
auc.test
## [1] 0.8786982

Evaluation Function for asymmetric cost

#Aymmetric cost evaluation function 
cost_asym <- function(r, pi){
  weight1 = 5
  weight0 = 1
  c1 = (r==1)&(pi==0) #logical vector - true if actual 1 but predict 0
  c0 = (r==0)&(pi==1) #logical vector - true if actual 0 but predict 1
  return(mean(weight1*c1+weight0*c0))
}

Asymmetric Cost : Train and Test

class.pred.train.nn <- (pred.glm.gtrain.nn>optimal.pcut)*1
cost.train <- round(cost_asym(r = hd_train$heart_disease, pi = class.pred.train.nn),2)

class.pred.test.nn<- (pred.glm.gtest.nn>optimal.pcut)*1
cost.test <- round(cost_asym(r = hd_test$heart_disease, pi = class.pred.test.nn),2)

cat("total train cost:", cost.train, "| total test cost:", cost.test)
## total train cost: 0.15 | total test cost: 0.29
#Train Cost = 0.17
# Test Cost = 0.29
nn.train.results <- c(cost.train,misclassification_rate_train,auc.train)
nn.test.results <- c(cost.test,misclassification_rate_test,auc.test)

7.5.2 Neural Net Result

eval_nn <- data.frame(rbind(nn.train.results,nn.test.results),
                 row.names=c('Neural Network : in','Neural Network : out'))
colnames(eval_nn) <- c("COST", "MCR", "AUC")

kable(eval_nn)  %>% 
  kable_styling(bootstrap_options = c("striped", "hover", "responsive"),
                font_size = 12, position = "left", full_width = FALSE)
COST MCR AUC
Neural Network : in 0.15 0.00 0.9581949
Neural Network : out 0.29 0.01 0.8786982

8.0 Conclusion:

Composite Roc curve of all models

pred <- prediction(pred.heart.boost.test$prob[,2], data_test$heart_disease)
perf.bo <- performance(pred, "tpr", "fpr")
plot(perf.bo, col = "red")
plot(perf.bg,add = TRUE, col = "pink")
plot(perf.rf, add = TRUE,col = "cyan")
plot(perf.l, add = TRUE, col = "black")
plot(perf.dt1, add = TRUE, col = "blue")
plot(perf.gm, add = TRUE,col = "green")
plot(perf.nn, add = TRUE,col = "grey")

## Add Legend
legend("bottomright", c("Boosting", "Bagging","RandomFrst","Logistic","Dtrees","Gam","NN"), lty=1, 
    col = c("red", "pink","cyan","black","blue","green","grey"), bty="n")

Considering Miss Classification as most important metric we select Boosting model as it has least Miss classification rate and cost, AUC is also higher than most algorithms although Random forest and Bagging are slightly better than Boosting in this regard

When it comes to diagnosing heart disease, we want to make sure we don’t have too many false-positives (you don’t have heart disease, but told you do and get treatment) or false-negatives (you have heart disease, but told you don’t and don’t get treatment). Therefore, model with highest accuracy and lowest miss classification rate is chosen as the best of all. *Bagging in this case also performs well on the in-sample data, although the misclassification rate is slightly lower than Boosting.

*Random Forest on the other hand performs more moderately in in-sample but doesn’t perform similarly on out of sample. Increasing the cost (Asymmetric loss).

*One important thing to notice here is that Ensemble methods do not have a way to prescribe a Asymmetric cost function while creating the Tree, as in the case of CART(rpart) hence they might not perform well is the data as a asymmetric cost associated to it.

*Decision Trees here take a humble approach, performing well in in-sample and out of sample. Their cost and Misclassification rate is lower than GAM and classic Logistic model.

*The neural network (with 3 hidden layers) has the best in-sample mcr amongst all the models considered but they have a highest loss on Out of sample data.

Evaluating all these factors we select Boosting as best performing model among all other Supervised Learning approaches that we examined in this dataset.

9.0 Appendix

6.6Support Vector Machine (SVM)

library(e1071)
heart.svm = svm(heart_disease ~ ., data = data, cost = 1, gamma = 1/length(data), probability= TRUE)
prob.svm = predict(heart.svm, data_test, probability = TRUE)
prob.svm = attr(prob.svm, 'probabilities')[,2] #This is needed because prob.svm gives a matrix
pred.svm = as.numeric((prob.svm >= 0.2))
table(data_test$heart_disease,pred.svm,dnn=c("Obs","Pred"))
##    Pred
## Obs  0  1
##   0  4 48
##   1 24 15
mean(ifelse(data_test$heart_disease != pred.svm, 1, 0))
## [1] 0.7912088
cost1(data_test$heart_disease, pred.svm)
## [1] 1.846154