Analysing Prediction power of different Machine learning algorithms on Heart disease data
About the data The dataset used in this article is the Cleveland Heart Disease dataset taken from the UCI repository(http://archive.ics.uci.edu/ml/datasets/Heart+Disease). It originally contains 76 variables, but all published experiments refer to using 14 of them. The dataset consists of 303 observations. There are 14 columns in the dataset, which are described as follows:
Age: Age of the individual.
Sex: Gender of the individual: 1 = male, 0 = female
Chest-pain type (cp): 1 = asymptotic, 2 = atypical angina, 3 = non anginal pain
Resting Blood Pressure (trestbps): The resting blood pressure value in mmHg (unit)
Serum Cholesterol (chol): The serum cholesterol in mg/dl (unit)
Fasting Blood Sugar (fbs): The fasting blood sugar value of an individual, if > 120 -> 1, else 0
Resting ECG (restecg): displays resting electrocardiographic results
Max heart rate (thalach): The max heart rate achieved by an individual
Exercise induced angina (exang): 1 = yes, 0 = no
Oldpeak: ST depression induced by exercise relative to rest
Slope: The slope of peak exercise ST segment
Ca: Number of major vessels (0-3) colored by fluoroscopy
Thal: Thalassemia = 0,1,2,3
Heart_disease: Whether the individual is suffering from heart disease or not 0 = absence, 1= present.
data_orig <- read.csv('heartdisease1.csv')
str(data_orig)
## 'data.frame': 303 obs. of 14 variables:
## $ age : int 63 37 41 56 57 57 56 44 52 57 ...
## $ sex : int 1 1 0 1 0 1 0 1 1 1 ...
## $ cp : int 3 2 1 1 0 0 1 1 2 2 ...
## $ trestbps: int 145 130 130 120 120 140 140 120 172 150 ...
## $ chol : int 233 250 204 236 354 192 294 263 199 168 ...
## $ fbs : int 1 0 0 0 0 0 0 0 1 0 ...
## $ restecg : int 0 1 0 1 1 1 0 1 1 1 ...
## $ thalach : int 150 187 172 178 163 148 153 173 162 174 ...
## $ exang : int 0 0 0 0 1 0 0 0 0 0 ...
## $ oldpeak : num 2.3 3.5 1.4 0.8 0.6 0.4 1.3 0 0.5 1.6 ...
## $ slope : int 0 0 2 2 2 1 1 2 2 2 ...
## $ ca : int 0 0 0 0 0 0 0 0 0 0 ...
## $ thal : int 1 2 2 2 2 1 2 3 3 2 ...
## $ target : int 1 1 1 1 1 1 1 1 1 1 ...
head(data_orig)
## age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal
## 1 63 1 3 145 233 1 0 150 0 2.3 0 0 1
## 2 37 1 2 130 250 0 1 187 0 3.5 0 0 2
## 3 41 0 1 130 204 0 0 172 0 1.4 2 0 2
## 4 56 1 1 120 236 0 1 178 0 0.8 2 0 2
## 5 57 0 0 120 354 0 1 163 1 0.6 2 0 2
## 6 57 1 0 140 192 0 1 148 0 0.4 1 0 1
## target
## 1 1
## 2 1
## 3 1
## 4 1
## 5 1
## 6 1
Sample of 5 rows shown above
Function to convert the class of variables:
convert_class <- function(obj,types){
for (i in 1:length(obj)){
FUN <- switch(types[i],character = as.character,
numeric = as.numeric,
factor = as.factor)
obj[,i] <- FUN(obj[,i])
}
obj
}
data_orig<-rename(data_orig,heart_disease='target',age="age")
data_orig <- data_orig %>% rename() %>%
mutate(sex = if_else(sex == 1, "male", "female"),
fbs = if_else(fbs == 1, ">120", "<=120"),
exang = if_else(exang == 1, "yes" ,"no"),
cp = if_else(cp == 1, "atypical angina",
if_else(cp == 2, "non-anginal pain", "asymptotic")),
restecg = if_else(restecg == 0, "normal",
if_else(restecg == 1, "abnormal", "definite")))
change_class <-c("numeric","factor","factor","numeric","numeric","factor","factor","numeric","factor","numeric","factor","factor","factor","numeric")
data_orig <- convert_class(data_orig,change_class)
label <- data.frame(
age = "age in years",
sex = "sex of the patient",
cp = "chest pain type",
trestbps = "resting blood pressure (in mm Hg on admission to the hospital)",
chol = "serum cholestoral in mg/dl",
fbs = "fasting blood sugar > 120 mg/dl",
restecg = "resting electrocardiographic results",
thalach = "maximum heart rate achieved",
exang = "exercise induced angina",
oldpeak = "ST depression induced by exercise relative to rest, a measure of abnormality in electrocardiograms",
slope = "the slope of the peak exercise ST segment",
ca = "number of major vessels (0-3) colored by flourosopy",
thal = "3 = normal; 6 = fixed defect; 7 = reversable defect"
)
vtable::vtable(data_orig, labels = label, factor.limit = 0)
| Name | Class | Label | Values |
|---|---|---|---|
| age | numeric | age in years | Num: 29 to 77 |
| sex | factor | sex of the patient | ‘female’ ‘male’ |
| cp | factor | chest pain type | ‘asymptotic’ ‘atypical angina’ ‘non-anginal pain’ |
| trestbps | numeric | resting blood pressure (in mm Hg on admission to the hospital) | Num: 94 to 200 |
| chol | numeric | serum cholestoral in mg/dl | Num: 126 to 564 |
| fbs | factor | fasting blood sugar > 120 mg/dl | ‘<=120’ ‘>120’ |
| restecg | factor | resting electrocardiographic results | ‘abnormal’ ‘definite’ ‘normal’ |
| thalach | numeric | maximum heart rate achieved | Num: 71 to 202 |
| exang | factor | exercise induced angina | ‘no’ ‘yes’ |
| oldpeak | numeric | ST depression induced by exercise relative to rest, a measure of abnormality in electrocardiograms | Num: 0 to 6.2 |
| slope | factor | the slope of the peak exercise ST segment | ‘0’ ‘1’ ‘2’ |
| ca | factor | number of major vessels (0-3) colored by flourosopy | ‘0’ ‘1’ ‘2’ ‘3’ ‘4’ |
| thal | factor | 3 = normal; 6 = fixed defect; 7 = reversable defect | ‘0’ ‘1’ ‘2’ ‘3’ |
| heart_disease | numeric | NA | Num: 0 to 1 |
metadata<-t(introduce(data_orig))
colnames(metadata)<-"Values"
metadata
## Values
## rows 303
## columns 14
## discrete_columns 8
## continuous_columns 6
## all_missing_columns 0
## total_missing_values 0
## complete_rows 303
## total_observations 4242
## memory_usage 32592
plot_intro(data_orig)
Plot the missing values
plot_missing(data_orig)
* There are no missing values
Splitting the data in 70:30 ratio
set.seed(13263635)
train_percent <-0.70
index <- sample(nrow(data_orig),nrow(data_orig)*train_percent)
data <- data_orig[index,]
data_test <- data_orig[-index,]
Summarizing the Training data
summary(data)
## age sex cp trestbps
## Min. :29.00 female: 64 asymptotic :114 Min. : 94.0
## 1st Qu.:47.00 male :148 atypical angina : 36 1st Qu.:120.0
## Median :56.00 non-anginal pain: 62 Median :130.0
## Mean :54.33 Mean :131.2
## 3rd Qu.:61.00 3rd Qu.:140.0
## Max. :74.00 Max. :200.0
## chol fbs restecg thalach exang
## Min. :126.0 <=120:180 abnormal:106 Min. : 71.0 no :150
## 1st Qu.:208.0 >120 : 32 definite: 2 1st Qu.:138.0 yes: 62
## Median :239.0 normal :104 Median :154.0
## Mean :242.6 Mean :150.8
## 3rd Qu.:270.0 3rd Qu.:168.0
## Max. :407.0 Max. :202.0
## oldpeak slope ca thal heart_disease
## Min. :0.0000 0: 11 0:131 0: 2 Min. :0.0000
## 1st Qu.:0.0000 1: 94 1: 37 1: 13 1st Qu.:0.0000
## Median :0.6000 2:107 2: 29 2:119 Median :1.0000
## Mean :0.9892 3: 12 3: 78 Mean :0.5943
## 3rd Qu.:1.6000 4: 3 3rd Qu.:1.0000
## Max. :4.4000 Max. :1.0000
Chi-Square Test of Independence for categorical columns
#fetch numeric columns
catcols<-select_if(data, is.factor)
numcols<-select_if(data, is.numeric)
predictors<-catcols[,1:8]
response<-numcols$heart_disease
n<-ncol(predictors)
values<- vector(mode="numeric", length=n)
pvalues<- vector(mode="numeric", length=n)
name.pred<-vector(mode="character", length=n)
for(i in 1:ncol(predictors))
{
table(predictors[,i],response)
c<-chisq.test(predictors[,i],response)
pvalues[i]<-c$p.value
values[i]<-ifelse(round(pvalues[i],3)<0.05,1,0)
}
result<-data.frame(cbind(colnames(predictors),ifelse(round(pvalues,4)<0.001,"<0.001",round(pvalues,4)),values))
colnames(result)<-c("predictors","pvalues","Relationship")
result
## predictors pvalues Relationship
## 1 sex <0.001 1
## 2 cp <0.001 1
## 3 fbs 0.8393 0
## 4 restecg 0.0032 1
## 5 exang <0.001 1
## 6 slope <0.001 1
## 7 ca <0.001 1
## 8 thal <0.001 1
Interpretation Except fasting blood sugar all predictors(categorical) variables seems to be related to Response.
Correlation plot of numeric columns
#fetch numeric columns
numcols<-select_if(data_orig, is.numeric)
cor_heart <- cor(numcols)
cor_heart
## age trestbps chol thalach oldpeak
## age 1.0000000 0.27935091 0.213677957 -0.398521938 0.21001257
## trestbps 0.2793509 1.00000000 0.123174207 -0.046697728 0.19321647
## chol 0.2136780 0.12317421 1.000000000 -0.009939839 0.05395192
## thalach -0.3985219 -0.04669773 -0.009939839 1.000000000 -0.34418695
## oldpeak 0.2100126 0.19321647 0.053951920 -0.344186948 1.00000000
## heart_disease -0.2254387 -0.14493113 -0.085239105 0.421740934 -0.43069600
## heart_disease
## age -0.22543872
## trestbps -0.14493113
## chol -0.08523911
## thalach 0.42174093
## oldpeak -0.43069600
## heart_disease 1.00000000
#corr plot it
corrplot(cor_heart, method = "ellipse", type="upper",)
#another
ggcorrplot(cor_heart,lab = T)
Interpretation
thalach: maximum heart rate achieved oldpeak: ST depression induced by exercise relative to rest, a measure of abnormality in electrocardiograms
seems to be correlated with response variable… However , thalach & oldpeak also show case substantial correlation amongst themselves.
chol & age are also somewhat correlated trestbps(resting blood pressure) & age also have substantial corelation.
All these correlation amongst predictors are somewhat smaller in magnitude. Therefore as of now we need not remove them from analysis.
present <- round((sum(as.numeric(data$heart_disease))*100)/nrow(data),2)
absent<- round((100 - present),2)
pie(table(data$heart_disease), col=c('green','red'),labels = c(paste(as.character(absent),'heart disease absent %'),paste(as.character(present), 'heart disease present %')),main = 'Distribution of heart disease status as percentage')
levels(data$heart_disease) = c("No disease","Disease")
levels(data$sex) = c("female","male","")
mosaicplot(data$sex ~ data$heart_disease,
main="Disease by Gender", shade=FALSE,color=TRUE,
xlab="Gender", ylab="Heart disease")
boxplot(data$age ~ data$heart_disease,
main="Disease by Age",
ylab="Age",xlab="Heart disease")
kable(head(data)) %>% kable_styling(bootstrap_options = c("striped", "hover", "responsive")) %>% scroll_box(width = "100%", height = "250px")
| age | sex | cp | trestbps | chol | fbs | restecg | thalach | exang | oldpeak | slope | ca | thal | heart_disease | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 100 | 53 | male | non-anginal pain | 130 | 246 | >120 | normal | 173 | no | 0.0 | 2 | 3 | 2 | 1 |
| 75 | 43 | female | non-anginal pain | 122 | 213 | <=120 | abnormal | 165 | no | 0.2 | 1 | 0 | 2 | 1 |
| 23 | 42 | male | asymptotic | 140 | 226 | <=120 | abnormal | 178 | no | 0.0 | 2 | 0 | 2 | 1 |
| 204 | 68 | male | non-anginal pain | 180 | 274 | >120 | normal | 150 | yes | 1.6 | 1 | 0 | 3 | 0 |
| 192 | 58 | male | asymptotic | 128 | 216 | <=120 | normal | 131 | yes | 2.2 | 1 | 3 | 3 | 0 |
| 257 | 58 | male | asymptotic | 128 | 259 | <=120 | normal | 130 | yes | 3.0 | 1 | 2 | 3 | 0 |
num_features <- select_if(data, is.numeric) %>% names()
data %>%
dplyr::select(heart_disease, one_of(num_features)) %>%
mutate(heart_disease=as.factor(heart_disease))%>%
gather(key = key, value = value,-heart_disease)%>%
ggplot(aes(x = value, fill = as.factor(heart_disease)))+
geom_histogram(alpha = 0.7)+
theme_stata()+
facet_wrap(~ key, scales = 'free',ncol = 2,nrow = 4)+
scale_fill_tableau()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(data, aes(cp, fill = as.factor(heart_disease)))+
geom_bar(position = "fill")+
ggtitle("cp")
data %>%
ggplot(aes(x=sex,y=trestbps))+
geom_boxplot(fill="darkorange")+
xlab("Sex")+
ylab("BP")+
facet_grid(~cp)
data$heart_disease<-as.factor(data$heart_disease)
numcols<-select_if(data, is.numeric)
uns_df <- scale(numcols)
distance <- get_dist(uns_df)
fviz_dist(distance, gradient = list(low = "#00AFBB", mid = "white", high = "#FC4E07"))
k2 <- kmeans(uns_df,
center = 2,
nstart = 25 )
#table(k2$cluster, dnn=("Clusters"))
k2
## K-means clustering with 2 clusters of sizes 113, 99
##
## Cluster means:
## age trestbps chol thalach oldpeak
## 1 -0.6511378 -0.2758471 -0.3386202 0.5695421 -0.4314943
## 2 0.7432179 0.3148558 0.3865059 -0.6500835 0.4925136
##
## Clustering vector:
## 100 75 23 204 192 257 189 156 79 18 151 11 253 96 77 169 220 267
## 1 1 1 2 2 2 1 2 1 2 2 1 2 2 1 2 1 2
## 251 124 7 74 163 224 107 69 3 105 176 259 27 38 272 248 200 113
## 2 1 2 1 1 2 2 1 1 1 1 2 2 1 2 2 2 2
## 35 271 255 282 116 54 150 71 111 43 266 144 268 158 95 294 72 289
## 2 1 2 1 1 1 1 1 2 1 2 2 1 1 1 2 1 2
## 196 52 165 243 63 190 120 121 25 179 177 264 14 8 221 214 142 274
## 2 2 1 2 1 1 1 2 1 1 1 2 2 1 2 2 1 1
## 149 37 65 48 98 236 61 201 198 276 137 293 141 44 238 250 70 216
## 1 1 1 1 1 1 2 1 2 1 2 2 1 1 2 2 1 2
## 160 17 154 146 49 147 92 210 180 300 117 97 178 301 273 231 41 19
## 1 1 2 2 2 1 1 1 2 1 1 2 2 2 2 1 2 1
## 102 66 157 229 203 129 33 1 139 138 31 9 148 153 183 202 209 135
## 2 1 1 2 2 1 1 2 2 2 1 1 2 2 2 2 1 1
## 185 152 80 162 26 242 59 94 62 47 184 30 127 246 278 270 109 167
## 2 2 1 2 2 2 1 1 1 1 2 1 1 1 2 2 1 2
## 199 287 286 175 235 225 296 45 227 277 83 73 119 110 155 101 284 28
## 2 1 2 2 2 2 2 1 2 2 1 1 1 1 1 1 1 1
## 298 6 140 76 118 207 2 67 164 254 125 212 56 4 173 89 230 108
## 2 1 2 1 1 2 1 1 1 2 1 2 1 1 2 1 2 1
## 131 279 219 292 88 58 208 275 234 42 123 112 78 213 93 24 21 16
## 1 2 2 2 1 1 2 2 2 1 1 1 1 1 1 2 1 1
## 249 130 237 211 87 103 170 280 32 5 81 39 64 122
## 1 2 1 1 2 1 2 2 2 2 1 2 1 1
##
## Within cluster sum of squares by cluster:
## [1] 331.2674 451.4324
## (between_SS / total_SS = 25.8 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss"
## [5] "tot.withinss" "betweenss" "size" "iter"
## [9] "ifault"
Cluster means for each attribute is depicted in summary
Visualization of Clusters
Plotiing Cluster using fviz_cluster
fviz_cluster(k2, data = uns_df)
Varying Visualizations by size
# k2
k3 <- kmeans(uns_df, centers = 3, nstart = 25)
k4 <- kmeans(uns_df, centers = 4, nstart = 25)
k5 <- kmeans(uns_df, centers = 5, nstart = 25)
p1 <- fviz_cluster(k2, geom = "point", data = uns_df)+
ggtitle("k = 2")
p2 <- fviz_cluster(k3, geom = "point", data = uns_df)+
ggtitle("k = 3")
p3 <- fviz_cluster(k4, geom = "point", data = uns_df)+
ggtitle("k = 4")
p4 <- fviz_cluster(k5, geom = "point", data = uns_df)+
ggtitle("k = 5")
library(gridExtra)
grid.arrange(p1,p2,p3,p4, nrow = 2)
Although this visual assessment tells us where true dilineations occur (or do not occur such as clusters 2 & 4 in the k = 5 graph) between clusters, it does not tell us what the optimal number of clusters is.
Determining Optimal Clusters
As you may recall the analyst specifies the number of clusters to use; preferably the analyst would like to use the optimal number of clusters. To aid the analyst, the following explains the three most popular methods for determining the optimal clusters, which includes:
Elbow method Silhouette method *Gap statistic
#finding optimal number of clusters
set.seed(13263635)
fviz_nbclust(uns_df, kmeans, method = "wss")
##### (B) Silhouette Method
#finding optimal number of clusters
set.seed(13263635)
fviz_nbclust(uns_df, kmeans, method = "silhouette")
##### (C) Compute gap statistic
# compute gap statistic
set.seed(13263635)
gap_stat <- clusGap(uns_df, FUN = kmeans, nstart = 25,
K.max = 10, B = 50)
# Print the result
print(gap_stat, method = "firstmax")
## Clustering Gap statistic ["clusGap"] from call:
## clusGap(x = uns_df, FUNcluster = kmeans, K.max = 10, B = 50, nstart = 25)
## B=50 simulated reference sets, k = 1..10; spaceH0="scaledPCA"
## --> Number of clusters (method 'firstmax'): 2
## logW E.logW gap SE.sim
## [1,] 5.056204 5.511244 0.4550395 0.01762836
## [2,] 4.902471 5.371684 0.4692130 0.01506240
## [3,] 4.823181 5.274316 0.4511350 0.01747658
## [4,] 4.757474 5.202080 0.4446056 0.01737994
## [5,] 4.703252 5.145987 0.4427348 0.01643548
## [6,] 4.659157 5.094961 0.4358043 0.01618537
## [7,] 4.621819 5.050962 0.4291425 0.01613949
## [8,] 4.573595 5.010673 0.4370777 0.01611355
## [9,] 4.536792 4.975277 0.4384854 0.01602263
## [10,] 4.504008 4.943096 0.4390881 0.01617199
#visualize
fviz_gap_stat(gap_stat)
We choose k=2 As most algorithms suggested the same, Hence now we are assured to use 2 clusters.
k2
## K-means clustering with 2 clusters of sizes 113, 99
##
## Cluster means:
## age trestbps chol thalach oldpeak
## 1 -0.6511378 -0.2758471 -0.3386202 0.5695421 -0.4314943
## 2 0.7432179 0.3148558 0.3865059 -0.6500835 0.4925136
##
## Clustering vector:
## 100 75 23 204 192 257 189 156 79 18 151 11 253 96 77 169 220 267
## 1 1 1 2 2 2 1 2 1 2 2 1 2 2 1 2 1 2
## 251 124 7 74 163 224 107 69 3 105 176 259 27 38 272 248 200 113
## 2 1 2 1 1 2 2 1 1 1 1 2 2 1 2 2 2 2
## 35 271 255 282 116 54 150 71 111 43 266 144 268 158 95 294 72 289
## 2 1 2 1 1 1 1 1 2 1 2 2 1 1 1 2 1 2
## 196 52 165 243 63 190 120 121 25 179 177 264 14 8 221 214 142 274
## 2 2 1 2 1 1 1 2 1 1 1 2 2 1 2 2 1 1
## 149 37 65 48 98 236 61 201 198 276 137 293 141 44 238 250 70 216
## 1 1 1 1 1 1 2 1 2 1 2 2 1 1 2 2 1 2
## 160 17 154 146 49 147 92 210 180 300 117 97 178 301 273 231 41 19
## 1 1 2 2 2 1 1 1 2 1 1 2 2 2 2 1 2 1
## 102 66 157 229 203 129 33 1 139 138 31 9 148 153 183 202 209 135
## 2 1 1 2 2 1 1 2 2 2 1 1 2 2 2 2 1 1
## 185 152 80 162 26 242 59 94 62 47 184 30 127 246 278 270 109 167
## 2 2 1 2 2 2 1 1 1 1 2 1 1 1 2 2 1 2
## 199 287 286 175 235 225 296 45 227 277 83 73 119 110 155 101 284 28
## 2 1 2 2 2 2 2 1 2 2 1 1 1 1 1 1 1 1
## 298 6 140 76 118 207 2 67 164 254 125 212 56 4 173 89 230 108
## 2 1 2 1 1 2 1 1 1 2 1 2 1 1 2 1 2 1
## 131 279 219 292 88 58 208 275 234 42 123 112 78 213 93 24 21 16
## 1 2 2 2 1 1 2 2 2 1 1 1 1 1 1 2 1 1
## 249 130 237 211 87 103 170 280 32 5 81 39 64 122
## 1 2 1 1 2 1 2 2 2 2 1 2 1 1
##
## Within cluster sum of squares by cluster:
## [1] 331.2674 451.4324
## (between_SS / total_SS = 25.8 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss"
## [5] "tot.withinss" "betweenss" "size" "iter"
## [9] "ifault"
#changed
datakmClustered<-data %>%
mutate(Cluster = k2$cluster-1)
#2 is 1 and 1 is 0
#table(datakmClustered$heart_disease,datakmClustered$Cluster, dnn = c("ture","clustered"))
mcr<-mean(datakmClustered$heart_disease!=(datakmClustered$Cluster))
outkmeans<-numcols %>%
mutate(Cluster = k2$cluster) %>%
group_by(Cluster) %>%
summarise_all("mean")
kable(outkmeans) %>%
kable_styling(bootstrap_options = c("striped", "hover", "responsive"),
font_size = 12, position = "left", full_width = FALSE)
| Cluster | age | trestbps | chol | thalach | oldpeak |
|---|---|---|---|---|---|
| 1 | 48.42478 | 126.0619 | 226.4425 | 163.5487 | 0.5035398 |
| 2 | 61.08081 | 137.1010 | 260.9697 | 136.1414 | 1.5434343 |
Interpretation The mean of all predictors(continious) appear to be different for both the clusters created.
Here we have implementd hierachical clustering for categorical & continious variables
The clustering process itself contains 3 distinctive steps:
Calculating dissimilarity matrix - is arguably the most important decision in clustering, and all your further steps are going to be based on the dissimilarity matrix you’ve made. Choosing the clustering method *Assessing clusters
Data types differences are important as dissimilarity matrix is based on distances between individual data points. While it is quite easy to imagine distances between numerical data points (Eucledian distances), categorical data (factors in R) does not seem as obvious. In order to calculate a dissimilarity matrix in this case, you would go for something called Gower distance
tree <- hclustvar(numcols, catcols)
plot(tree, cex = 0.6)
#rect.hclust(tree, k = 4, border = 2:4)
Stablity check using bootstraping
Evaluates the stability of partitions obtained from a hierarchy of p variables. This hierarchy is performed with hclustvar and the stability of the partitions of 2 to p-1 clusters is evaluated with a bootstrap approach. The boostrap approch is the following: hclustvar is applied to B boostrap samples of the n rows. The partitions of 2 to p-1 clusters obtained from the B bootstrap hierarchies are compared with the partitions from the initial hierarchy . The mean of the corrected Rand indices is plotted according to the number of clusters. This graphical representation helps in the determination of a suitable numbers of clusters
stab <- stability(tree, B=50) # "B=50" refers to the number of bootstrap samples to use in the estimation.
Intrepretation This proves that 2 given us the best number of clusters
Using Dissimilarity matrix
“Gower’s distance” is chosen by metric “gower” or automatically if some columns of x are not numeric. Also known as Gower’s coefficient (1971), expressed as a dissimilarity, this implies that a particular standardisation will be applied to each variable, and the “distance” between two units is the sum of all the variable-specific distances
d <- daisy(data[,-14], metric="gower")
fit <- hclust(d=d, method="ward.D") # Also try: method="ward.D"
# Cut tree into 4 groups
sub_grp <- cutree(fit, k = 2)
# Number of members in each cluster
table(sub_grp)
## sub_grp
## 1 2
## 153 59
plot(fit, cex = 0.6)
rect.hclust(fit, k = 2, border = 2:4)
Adding the clusters to our traing data
HCdata<-data %>%
mutate(HCcluster = sub_grp)
Misclassification calculation
HCdata$HCcluster[HCdata$HCcluster==2] <- 0
#table(HCdata$heart_disease,HCdata$HCcluster, dnn = c("ture","clustered"))
mcr<-mean(HCdata$heart_disease!=HCdata$HCcluster)
Model the numeric columns to PCA model, It is always performed on a symmetric correlation or covariance matrix. This means the matrix should be numeric and have standardized data.
datapca<-data[,1:13]
dt<-dummy.data.frame(datapca)
colnames(dt)
## [1] "age" "sexfemale" "sexmale"
## [4] "cpasymptotic" "cpatypical angina" "cpnon-anginal pain"
## [7] "trestbps" "chol" "fbs<=120"
## [10] "fbs>120" "restecgabnormal" "restecgdefinite"
## [13] "restecgnormal" "thalach" "exangno"
## [16] "exangyes" "oldpeak" "slope0"
## [19] "slope1" "slope2" "ca0"
## [22] "ca1" "ca2" "ca3"
## [25] "ca4" "thal0" "thal1"
## [28] "thal2" "thal3"
pca <- prcomp(dt, scale = TRUE)
pca
## Standard deviations (1, .., p=29):
## [1] 2.246490e+00 1.580613e+00 1.515313e+00 1.388388e+00 1.321554e+00
## [6] 1.308160e+00 1.272534e+00 1.189638e+00 1.144859e+00 1.114827e+00
## [11] 1.017576e+00 1.006372e+00 9.945358e-01 9.501946e-01 9.079434e-01
## [16] 8.639616e-01 8.510374e-01 7.905021e-01 7.822965e-01 6.214353e-01
## [21] 5.717560e-01 9.976420e-16 6.426039e-16 5.797642e-16 5.454954e-16
## [26] 4.571771e-16 4.554836e-16 3.202758e-16 1.116381e-16
##
## Rotation (n x k) = (29 x 29):
## PC1 PC2 PC3 PC4
## age 0.2040227737 -1.833293e-01 0.094779873 -0.13731378
## sexfemale -0.1413086011 -5.394177e-01 -0.039906990 -0.10547773
## sexmale 0.1413086011 5.394177e-01 0.039906990 0.10547773
## cpasymptotic 0.2942003135 -2.461245e-02 -0.124389538 0.04018123
## cpatypical angina -0.1600597333 -5.511576e-06 0.032650698 0.05734463
## cpnon-anginal pain -0.1903397143 2.698063e-02 0.109385163 -0.09137243
## trestbps 0.0935479157 -1.204104e-01 0.222487435 0.12005801
## chol 0.0849067274 -2.803531e-01 -0.029034968 0.08955058
## fbs<=120 -0.0554027651 2.904020e-02 -0.591801533 0.18297854
## fbs>120 0.0554027651 -2.904020e-02 0.591801533 -0.18297854
## restecgabnormal -0.1736418581 7.301015e-02 -0.134795878 -0.54120573
## restecgdefinite 0.0625879247 -1.361349e-01 -0.006726700 0.07124934
## restecgnormal 0.1615699524 -4.669831e-02 0.136120641 0.52752438
## thalach -0.2642768993 1.226385e-01 0.027373082 0.15232430
## exangno -0.2662813236 5.344106e-02 0.103829439 0.13787861
## exangyes 0.2662813236 -5.344106e-02 -0.103829439 -0.13787861
## oldpeak 0.3018129056 -1.014582e-01 -0.013299747 0.02003847
## slope0 0.1129508182 -6.226086e-02 0.230618976 0.14454194
## slope1 0.2404813692 -1.326143e-01 -0.214508548 -0.02652619
## slope2 -0.2890528978 1.593875e-01 0.110832619 -0.03776438
## ca0 -0.1863277619 -2.539686e-02 -0.017259545 0.31810714
## ca1 0.0801730543 5.412041e-02 -0.100692516 -0.18736497
## ca2 0.1440907534 -5.350135e-02 0.095050946 -0.17389422
## ca3 0.0880689230 -1.975906e-02 0.062237127 -0.05200381
## ca4 -0.0826325459 1.248690e-01 -0.003714119 -0.09886796
## thal0 -0.0003292663 -1.790613e-02 0.071960874 -0.04474856
## thal1 0.1071281011 3.217524e-03 0.087323525 0.08306735
## thal2 -0.2887953143 -2.791867e-01 -0.032234467 0.04175985
## thal3 0.2439394038 2.892717e-01 -0.024699924 -0.07532716
## PC5 PC6 PC7 PC8
## age 0.10549430 -0.213452492 0.1616819119 0.023604430
## sexfemale 0.06696998 0.011344399 -0.0132664051 0.046115109
## sexmale -0.06696998 -0.011344399 0.0132664051 -0.046115109
## cpasymptotic -0.01260643 0.251575980 0.2184059522 0.067338655
## cpatypical angina 0.17794918 0.018666681 0.2587982109 0.080642858
## cpnon-anginal pain -0.13306274 -0.291143340 -0.4529931814 -0.140368316
## trestbps -0.05039087 0.052792830 0.1368246765 -0.088519611
## chol 0.10490839 -0.049599567 0.1036958509 -0.249146156
## fbs<=120 -0.01650000 -0.068522160 0.1133872283 -0.113186317
## fbs>120 0.01650000 0.068522160 -0.1133872283 0.113186317
## restecgabnormal -0.24450421 0.174292551 0.0922261543 0.011106655
## restecgdefinite -0.29057685 0.022878704 0.1374788674 -0.408638140
## restecgnormal 0.30073749 -0.178747710 -0.1188272919 0.067910999
## thalach 0.06430739 0.087127349 0.0874879334 -0.015641141
## exangno -0.23790773 -0.361710871 0.2362339223 0.163201029
## exangyes 0.23790773 0.361710871 -0.2362339223 -0.163201029
## oldpeak -0.23877351 0.001229302 -0.0009125418 -0.162923697
## slope0 -0.21883738 0.152638210 -0.0091306367 -0.371342645
## slope1 -0.20305279 -0.201743027 -0.2071383509 0.402781053
## slope2 0.29883633 0.132742164 0.2098663948 -0.235475750
## ca0 -0.22856684 0.410141531 -0.2145056094 0.160399335
## ca1 0.34911401 -0.343597404 -0.1291405671 -0.216397262
## ca2 0.13349020 0.006067372 0.4103179475 0.227637787
## ca3 -0.25357146 -0.264070192 0.0683355585 -0.215395509
## ca4 -0.07387878 -0.084082911 -0.0300039738 -0.205304036
## thal0 -0.02271613 0.112158906 -0.2015978151 0.170508454
## thal1 -0.24274831 -0.036225835 0.2630957285 0.114121348
## thal2 0.05647509 0.011523649 -0.0538623012 -0.080493188
## thal3 0.06720959 -0.016318483 -0.0350554293 -0.008128518
## PC9 PC10 PC11 PC12
## age 0.087715997 -0.047030462 -0.0514849619 0.25344279
## sexfemale 0.110497288 0.052789321 -0.0333275463 0.03966658
## sexmale -0.110497288 -0.052789321 0.0333275463 -0.03966658
## cpasymptotic -0.070216542 0.239407715 -0.0918142199 0.11585403
## cpatypical angina 0.002363250 -0.575654625 0.1913643070 -0.17791385
## cpnon-anginal pain 0.075009077 0.212748155 -0.0573211994 0.01987070
## trestbps 0.198823957 0.170243940 0.0809428399 0.17893968
## chol 0.378954437 -0.034805948 -0.1344075528 -0.11996991
## fbs<=120 0.022985625 0.067737669 -0.0632291693 0.10386450
## fbs>120 -0.022985625 -0.067737669 0.0632291693 -0.10386450
## restecgabnormal 0.108859690 -0.058035125 0.0892992035 0.07682543
## restecgdefinite -0.113189843 -0.151685506 -0.1707762492 -0.02107557
## restecgnormal -0.086991199 0.087377357 -0.0562915673 -0.07276366
## thalach 0.139934444 0.164750061 0.0637699306 -0.19026526
## exangno 0.157374818 0.056636761 0.0179033287 0.17592834
## exangyes -0.157374818 -0.056636761 -0.0179033287 -0.17592834
## oldpeak 0.080997686 0.050272836 0.1745509851 -0.06613846
## slope0 0.005725705 -0.003227085 0.3009325805 0.27827997
## slope1 -0.018519962 -0.046593168 0.1323699612 -0.16649098
## slope2 0.015861704 0.047727286 -0.2650235573 0.04197840
## ca0 0.179411132 -0.115343125 0.0266367570 0.08414051
## ca1 -0.116144689 -0.196811267 0.1784967480 0.38220875
## ca2 -0.040740958 0.367790016 -0.0002590246 -0.16793850
## ca3 -0.004954165 -0.202539387 -0.3768853491 -0.48613501
## ca4 -0.236604972 0.433208186 0.0549279697 -0.13469650
## thal0 -0.064865516 -0.047711035 -0.6682429521 0.31460817
## thal1 -0.418293722 -0.140089959 -0.0747633927 0.20786286
## thal2 -0.306678462 0.080124643 0.1747356815 -0.13999203
## thal3 0.536678636 -0.003188517 -0.0086525893 -0.02242725
## PC13 PC14 PC15 PC16
## age 0.071981397 0.38373324 -0.214537681 0.307043040
## sexfemale 0.009422092 -0.15754580 0.132702943 -0.139797586
## sexmale -0.009422092 0.15754580 -0.132702943 0.139797586
## cpasymptotic -0.289722514 0.11751257 -0.145408546 -0.387590877
## cpatypical angina 0.295940515 -0.28920729 -0.175698766 0.145496306
## cpnon-anginal pain 0.073275472 0.10991485 0.304395008 0.304719637
## trestbps 0.558762233 0.16658326 -0.133346757 -0.209913718
## chol 0.099533027 0.17877867 -0.014999030 0.088998760
## fbs<=120 0.074834503 -0.01981791 -0.011788835 0.175395522
## fbs>120 -0.074834503 0.01981791 0.011788835 -0.175395522
## restecgabnormal 0.022307302 0.07800898 -0.058828058 0.015393105
## restecgdefinite 0.019284283 -0.15737992 0.297537103 -0.160146237
## restecgnormal -0.026040336 -0.04758982 0.001302855 0.015572132
## thalach 0.053317476 -0.18721599 0.268279182 -0.191364311
## exangno -0.159850407 -0.06848620 -0.108328418 -0.098331517
## exangyes 0.159850407 0.06848620 0.108328418 0.098331517
## oldpeak -0.118376039 -0.28116180 -0.070437461 0.130630351
## slope0 -0.183468090 -0.24924191 -0.120746443 0.257749536
## slope1 0.146523898 -0.02438587 -0.004764372 -0.115899019
## slope2 -0.064198930 0.13479804 0.058299086 0.000817087
## ca0 0.035901178 0.21531961 -0.019033132 0.040677403
## ca1 -0.061019476 -0.10245025 0.065584851 -0.307156634
## ca2 -0.062979887 -0.21493774 0.190248341 0.400493663
## ca3 -0.120805688 0.12340610 -0.181065955 -0.107273784
## ca4 0.467979249 -0.17267244 -0.331705364 -0.135558575
## thal0 0.096985482 -0.42619396 -0.258192820 0.109837958
## thal1 0.261684675 0.15940947 0.454059167 0.020303257
## thal2 -0.183500247 0.14317523 -0.276835144 0.055709181
## thal3 0.039190364 -0.14120018 0.110723026 -0.089443704
## PC17 PC18 PC19 PC20
## age 0.111102346 0.10755715 -0.119481022 0.634592582
## sexfemale -0.262204268 0.11173744 -0.071863869 0.056457644
## sexmale 0.262204268 -0.11173744 0.071863869 -0.056457644
## cpasymptotic 0.004321684 0.03778667 0.081334668 0.038321037
## cpatypical angina -0.007070978 0.04754370 -0.050109424 0.003776925
## cpnon-anginal pain 0.001099695 -0.08065819 -0.047785079 -0.045118642
## trestbps -0.024729046 -0.51745690 -0.228395658 -0.207888802
## chol 0.311178051 0.17199235 0.614251536 -0.293268601
## fbs<=120 -0.107830779 -0.08751724 -0.050378871 0.008164833
## fbs>120 0.107830779 0.08751724 0.050378871 -0.008164833
## restecgabnormal -0.003828996 -0.09130157 0.123980608 -0.025264759
## restecgdefinite 0.526368999 0.16138195 -0.389205946 0.081912822
## restecgnormal -0.097955937 0.06011090 -0.048740716 0.009429520
## thalach 0.132133530 -0.34932228 0.350327215 0.613049818
## exangno 0.054993699 0.07387301 -0.001615882 -0.074831716
## exangyes -0.054993699 -0.07387301 0.001615882 0.074831716
## oldpeak -0.017275377 -0.19641737 -0.044658121 0.014852663
## slope0 -0.236779204 0.01835022 0.205780747 0.076968959
## slope1 0.198654712 -0.03992670 0.056407624 0.034717513
## slope2 -0.092347221 0.03153132 -0.147335259 -0.068640573
## ca0 0.033486402 0.15915730 -0.067902495 0.067031797
## ca1 0.082632208 -0.14084179 0.119478608 -0.037244599
## ca2 0.129741079 -0.08790932 -0.081575166 -0.148382938
## ca3 -0.376830513 -0.21119126 0.022227162 0.086051127
## ca4 -0.043502495 0.46680415 0.089227879 0.107268300
## thal0 0.142040654 -0.17122845 0.143891587 0.023977683
## thal1 -0.250400560 0.06807129 0.298951277 -0.021055632
## thal2 0.217044473 -0.19542102 -0.023151988 -0.039387046
## thal3 -0.127236152 0.20154651 -0.153751053 0.046198032
## PC21 PC22 PC23 PC24
## age -0.055835074 -5.900762e-16 0.000000e+00 0.000000e+00
## sexfemale -0.009716911 -4.236811e-01 9.456792e-02 -1.325497e-01
## sexmale 0.009716911 -4.236811e-01 9.456792e-02 -1.325497e-01
## cpasymptotic 0.024069859 4.846712e-01 -2.574912e-02 -2.320548e-01
## cpatypical angina -0.020759097 3.649970e-01 -1.939119e-02 -1.747562e-01
## cpnon-anginal pain -0.009246758 4.422046e-01 -2.349300e-02 -2.117223e-01
## trestbps 0.091112831 6.938894e-17 -1.665335e-16 4.857226e-17
## chol -0.044978707 2.359224e-16 -7.285839e-17 -3.469447e-17
## fbs<=120 0.055246780 -1.260502e-01 -6.621016e-01 -1.250737e-01
## fbs>120 -0.055246780 -1.260502e-01 -6.621016e-01 -1.250737e-01
## restecgabnormal -0.002668114 -8.042875e-02 1.747689e-01 -1.411757e-01
## restecgdefinite 0.131111145 -1.554999e-02 3.378960e-02 -2.729472e-02
## restecgnormal -0.022684782 -8.041444e-02 1.747378e-01 -1.411505e-01
## thalach 0.004156898 2.220446e-16 -1.387779e-17 6.938894e-17
## exangno -0.039936533 3.809297e-02 -6.463858e-02 5.090787e-01
## exangyes 0.039936533 3.809297e-02 -6.463858e-02 5.090787e-01
## oldpeak -0.783813396 4.024558e-16 -3.608225e-16 -3.469447e-18
## slope0 0.400630885 3.722987e-02 -1.153131e-02 4.991376e-02
## slope1 0.100211495 8.338771e-02 -2.582792e-02 1.117972e-01
## slope2 -0.277298232 8.392351e-02 -2.599387e-02 1.125155e-01
## ca0 -0.122531418 -5.815027e-02 1.126057e-01 -5.812955e-02
## ca1 -0.043260285 -4.542485e-02 8.796343e-02 -4.540866e-02
## ca2 0.158659708 -4.112426e-02 7.963552e-02 -4.110961e-02
## ca3 0.133158564 -2.765535e-02 5.355351e-02 -2.764550e-02
## ca4 -0.079020446 -1.413538e-02 2.737260e-02 -1.413034e-02
## thal0 -0.001634511 1.011522e-02 -4.786710e-03 5.992430e-02
## thal1 -0.158154115 2.510435e-02 -1.187984e-02 1.487225e-01
## thal2 0.005608392 5.192370e-02 -2.457126e-02 3.076049e-01
## thal3 0.073239274 5.046037e-02 -2.387878e-02 2.989359e-01
## PC25 PC26 PC27 PC28
## age 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
## sexfemale 1.108848e-01 -1.177609e-01 3.959338e-02 4.135986e-01
## sexmale 1.108848e-01 -1.177609e-01 3.959338e-02 4.135986e-01
## cpasymptotic 1.138542e-01 -1.834704e-01 6.473346e-02 2.486277e-01
## cpatypical angina 8.574149e-02 -1.381682e-01 4.874959e-02 1.872370e-01
## cpnon-anginal pain 1.038783e-01 -1.673948e-01 5.906155e-02 2.268431e-01
## trestbps -1.283695e-16 -2.081668e-17 3.469447e-18 5.551115e-17
## chol -3.469447e-18 0.000000e+00 7.979728e-17 5.551115e-17
## fbs<=120 2.373989e-02 -1.191573e-01 1.067599e-01 -6.213498e-02
## fbs>120 2.373989e-02 -1.191573e-01 1.067599e-01 -6.213498e-02
## restecgabnormal -2.402886e-01 -5.316582e-01 1.220949e-01 -2.798720e-01
## restecgdefinite -4.645708e-02 -1.027901e-01 2.360568e-02 -5.411009e-02
## restecgnormal -2.402458e-01 -5.315636e-01 1.220732e-01 -2.798222e-01
## thalach 1.387779e-16 -7.632783e-17 1.075529e-16 -1.665335e-16
## exangno 2.722711e-01 -3.529037e-01 -1.840042e-01 4.928564e-02
## exangyes 2.722711e-01 -3.529037e-01 -1.840042e-01 4.928564e-02
## oldpeak 3.469447e-17 -4.163336e-17 3.087808e-16 -1.665335e-16
## slope0 7.985738e-03 5.722833e-02 4.069696e-02 -1.241536e-01
## slope1 1.788651e-02 1.281804e-01 9.115332e-02 -2.780801e-01
## slope2 1.800144e-02 1.290040e-01 9.173902e-02 -2.798669e-01
## ca0 4.606204e-01 1.950834e-02 3.775810e-01 -1.600399e-01
## ca1 3.598197e-01 1.523920e-02 2.949523e-01 -1.250173e-01
## ca2 3.257538e-01 1.379643e-02 2.670278e-01 -1.131813e-01
## ca3 2.190638e-01 9.277859e-03 1.795716e-01 -7.611245e-02
## ca4 1.119693e-01 4.742157e-03 9.178373e-02 -3.890307e-02
## thal0 -5.368634e-02 4.071078e-03 9.308769e-02 4.239949e-02
## thal1 -1.332408e-01 1.010376e-02 2.310286e-01 1.052287e-01
## thal2 -2.755841e-01 2.089776e-02 4.778401e-01 2.176461e-01
## thal3 -2.678175e-01 2.030881e-02 4.643734e-01 2.115123e-01
## PC29
## age 0.000000e+00
## sexfemale 3.085245e-01
## sexmale 3.085245e-01
## cpasymptotic 1.212119e-01
## cpatypical angina 9.128250e-02
## cpnon-anginal pain 1.105914e-01
## trestbps 4.163336e-17
## chol -1.387779e-17
## fbs<=120 -8.305632e-03
## fbs>120 -8.305632e-03
## restecgabnormal 1.828014e-02
## restecgdefinite 3.534258e-03
## restecgnormal 1.827689e-02
## thalach 5.551115e-17
## exangno 1.582416e-02
## exangyes 1.582416e-02
## oldpeak 1.387779e-17
## slope0 2.562918e-01
## slope1 5.740442e-01
## slope2 5.777327e-01
## ca0 -1.313575e-01
## ca1 -1.026117e-01
## ca2 -9.289694e-02
## ca3 -6.247158e-02
## ca4 -3.193086e-02
## thal0 -6.833711e-03
## thal1 -1.696017e-02
## thal2 -3.507898e-02
## thal3 -3.409037e-02
pcaCharts <- function(x) {
x.var <- x$sdev ^ 2
x.pvar <- x.var/sum(x.var)
print("proportions of variance:")
print(x.pvar)
par(mfrow=c(2,2))
plot(x.pvar,xlab="Principal component", ylab="Proportion of variance explained", ylim=c(0,1), type='b')
plot(cumsum(x.pvar),xlab="Principal component", ylab="Cumulative Proportion of variance explained", ylim=c(0,1), type='b')
screeplot(x)
screeplot(x,type="l")
par(mfrow=c(1,1))
}
summ<-summary(pca)
summ
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6
## Standard deviation 2.246 1.58061 1.51531 1.38839 1.32155 1.30816
## Proportion of Variance 0.174 0.08615 0.07918 0.06647 0.06022 0.05901
## Cumulative Proportion 0.174 0.26017 0.33935 0.40582 0.46605 0.52506
## PC7 PC8 PC9 PC10 PC11 PC12
## Standard deviation 1.27253 1.1896 1.1449 1.11483 1.01758 1.00637
## Proportion of Variance 0.05584 0.0488 0.0452 0.04286 0.03571 0.03492
## Cumulative Proportion 0.58090 0.6297 0.6749 0.71775 0.75346 0.78838
## PC13 PC14 PC15 PC16 PC17 PC18
## Standard deviation 0.99454 0.95019 0.90794 0.86396 0.85104 0.79050
## Proportion of Variance 0.03411 0.03113 0.02843 0.02574 0.02497 0.02155
## Cumulative Proportion 0.82249 0.85362 0.88205 0.90779 0.93276 0.95431
## PC19 PC20 PC21 PC22 PC23
## Standard deviation 0.7823 0.62144 0.57176 9.976e-16 6.426e-16
## Proportion of Variance 0.0211 0.01332 0.01127 0.000e+00 0.000e+00
## Cumulative Proportion 0.9754 0.98873 1.00000 1.000e+00 1.000e+00
## PC24 PC25 PC26 PC27 PC28
## Standard deviation 5.798e-16 5.455e-16 4.572e-16 4.555e-16 3.203e-16
## Proportion of Variance 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
## Cumulative Proportion 1.000e+00 1.000e+00 1.000e+00 1.000e+00 1.000e+00
## PC29
## Standard deviation 1.116e-16
## Proportion of Variance 0.000e+00
## Cumulative Proportion 1.000e+00
fviz_screeplot(pca)
#pcaCharts(pca)
Inference: Top 20 predictors can explain more than 98% of variance
pcavar<-pca$sdev^2
pcasd<-pca$sdev
check variance of first 20 components as per sumary
#pcavar[1:20]
prop_varex <- pcavar/sum(pcavar)
prop_varex[1:20]
## [1] 0.17402475 0.08614958 0.07917839 0.06646966 0.06022429 0.05900978
## [7] 0.05583944 0.04880134 0.04519660 0.04285651 0.03570555 0.03492360
## [13] 0.03410694 0.03113344 0.02842625 0.02573895 0.02497464 0.02154805
## [19] 0.02110303 0.01331661
#scree plot
plot(prop_varex, xlab = "Principal Component",
ylab = "Proportion of Variance Explained",
type = "b")
#cumulative scree plot
plot(cumsum(prop_varex), xlab = "Principal Component",
ylab = "Cumulative Proportion of Variance Explained",
type = "b")
Hence , we can se that out of 31 predictors(hot encoded)predictors 21 predictors contribute to 98% of variance proportion of variance explained
fviz_pca(pca)
#biplot(pca,scale=0, cex=.7)
pca.out <- pca
pca.out$rotation <- -pca.out$rotation
pca.out$x <- -pca.out$x
fviz_pca(pca.out)
impPredictors<-rownames(data.frame(pca$rotation))[1:20]
notsoimp<-rownames(data.frame(pca$rotation))[21:29]
#[1:21]
s_df <- apply(dt, 2, scale)
head(s_df)
## age sexfemale sexmale cpasymptotic cpatypical angina
## [1,] -0.1470709 -0.6560432 0.6560432 -1.0760010 -0.4511991
## [2,] -1.2488032 1.5170999 -1.5170999 -1.0760010 -0.4511991
## [3,] -1.3589764 -0.6560432 0.6560432 0.9249833 -0.4511991
## [4,] 1.5055276 -0.6560432 0.6560432 -1.0760010 -0.4511991
## [5,] 0.4037953 -0.6560432 0.6560432 0.9249833 -0.4511991
## [6,] 0.4037953 -0.6560432 0.6560432 0.9249833 -0.4511991
## cpnon-anginal pain trestbps chol fbs<=120 fbs>120
## [1,] 1.551755 -0.06512095 0.07211863 -2.3661080 2.3661080
## [2,] 1.551755 -0.49320287 -0.62093350 0.4206414 -0.4206414
## [3,] -0.641392 0.46998144 -0.34791296 0.4206414 -0.4206414
## [4,] 1.551755 2.61039102 0.66016286 -2.3661080 2.3661080
## [5,] -0.641392 -0.17214143 -0.55792876 0.4206414 -0.4206414
## [6,] -0.641392 -0.17214143 0.34513917 0.4206414 -0.4206414
## restecgabnormal restecgdefinite restecgnormal thalach exangno
## [1,] -0.9976387 -0.09735957 1.0166431 0.99012710 0.641392
## [2,] 0.9976387 -0.09735957 -0.9789896 0.63412635 0.641392
## [3,] 0.9976387 -0.09735957 -0.9789896 1.21262758 0.641392
## [4,] -0.9976387 -0.09735957 1.0166431 -0.03337507 -1.551755
## [5,] -0.9976387 -0.09735957 1.0166431 -0.87887687 -1.551755
## [6,] -0.9976387 -0.09735957 1.0166431 -0.92337696 -1.551755
## exangyes oldpeak slope0 slope1 slope2 ca0
## [1,] -0.641392 -0.8789192 -0.2333843 -0.8904227 0.988271 -1.2687219
## [2,] -0.641392 -0.7012074 -0.2333843 1.1177647 -1.007095 0.7844769
## [3,] -0.641392 -0.8789192 -0.2333843 -0.8904227 0.988271 0.7844769
## [4,] 1.551755 0.5427756 -0.2333843 1.1177647 -1.007095 0.7844769
## [5,] 1.551755 1.0759111 -0.2333843 1.1177647 -1.007095 -1.2687219
## [6,] 1.551755 1.7867585 -0.2333843 1.1177647 -1.007095 -1.2687219
## ca1 ca2 ca3 ca4 thal0 thal1
## [1,] -0.4587279 -0.3971429 4.0728430 -0.1195256 -0.09735957 -0.2549872
## [2,] -0.4587279 -0.3971429 -0.2443706 -0.1195256 -0.09735957 -0.2549872
## [3,] -0.4587279 -0.3971429 -0.2443706 -0.1195256 -0.09735957 -0.2549872
## [4,] -0.4587279 -0.3971429 -0.2443706 -0.1195256 -0.09735957 -0.2549872
## [5,] -0.4587279 -0.3971429 4.0728430 -0.1195256 -0.09735957 -0.2549872
## [6,] -0.4587279 2.5061083 -0.2443706 -0.1195256 -0.09735957 -0.2549872
## thal2 thal3
## [1,] 0.8819446 -0.7611464
## [2,] 0.8819446 -0.7611464
## [3,] 0.8819446 -0.7611464
## [4,] -1.1285097 1.3076105
## [5,] -1.1285097 1.3076105
## [6,] -1.1285097 1.3076105
cov_df <- cov(s_df) #covariance metric
cov_df
## age sexfemale sexmale cpasymptotic
## age 1.00000000 0.044895676 -0.044895676 0.21713321
## sexfemale 0.04489568 1.000000000 -1.000000000 -0.15281509
## sexmale -0.04489568 -1.000000000 1.000000000 0.15281509
## cpasymptotic 0.21713321 -0.152815087 0.152815087 1.00000000
## cpatypical angina -0.10551936 0.085711609 -0.085711609 -0.48779159
## cpnon-anginal pain -0.15088935 0.096743886 -0.096743886 -0.69340921
## trestbps 0.22722685 0.044151643 -0.044151643 0.10974448
## chol 0.27676712 0.231176419 -0.231176419 0.09868066
## fbs<=120 -0.14010101 0.018953785 -0.018953785 0.05834037
## fbs>120 0.14010101 -0.018953785 0.018953785 -0.05834037
## restecgabnormal -0.13074969 0.102749367 -0.102749367 -0.17029702
## restecgdefinite 0.02333377 0.042114708 -0.042114708 0.09048279
## restecgnormal 0.12626084 -0.110911514 0.110911514 0.15283040
## thalach -0.44668768 0.042165244 -0.042165244 -0.31587407
## exangno -0.14455863 0.083958351 -0.083958351 -0.36730674
## exangyes 0.14455863 -0.083958351 0.083958351 0.36730674
## oldpeak 0.22217640 -0.101633444 0.101633444 0.39467060
## slope0 0.06180576 -0.061184697 0.061184697 0.13158849
## slope1 0.20711257 -0.007804844 0.007804844 0.25620000
## slope2 -0.23320834 0.034897564 -0.034897564 -0.31293916
## ca0 -0.33649460 0.094161419 -0.094161419 -0.16440288
## ca1 0.17924800 -0.058737722 0.058737722 -0.02233923
## ca2 0.25814376 0.007334288 -0.007334288 0.28649398
## ca3 0.10816025 -0.072149276 0.072149276 0.02240201
## ca4 -0.15438162 -0.078785558 0.078785558 -0.12921915
## thal0 -0.01977525 0.042114708 -0.042114708 -0.00738635
## thal1 0.12299163 -0.125248528 0.125248528 0.15810851
## thal2 -0.18880235 0.415686404 -0.415686404 -0.32393904
## thal3 0.13705258 -0.373871529 0.373871529 0.25615387
## cpatypical angina cpnon-anginal pain trestbps
## age -0.105519358 -0.15088935 0.227226847
## sexfemale 0.085711609 0.09674389 0.044151643
## sexmale -0.085711609 -0.09674389 -0.044151643
## cpasymptotic -0.487791585 -0.69340921 0.109744476
## cpatypical angina 1.000000000 -0.29076701 -0.042324705
## cpnon-anginal pain -0.290767011 1.00000000 -0.085348723
## trestbps -0.042324705 -0.08534872 1.000000000
## chol -0.014909937 -0.09585064 0.173933079
## fbs<=120 0.015228916 -0.07651301 -0.176015299
## fbs>120 -0.015228916 0.07651301 0.176015299
## restecgabnormal 0.075377836 0.12443420 -0.161922690
## restecgdefinite -0.044136741 -0.06274160 0.082615022
## restecgnormal -0.066856398 -0.11232382 0.145975995
## thalach 0.203416734 0.17830788 -0.032349481
## exangno 0.180296390 0.25376344 -0.041462290
## exangyes -0.180296390 -0.25376344 0.041462290
## oldpeak -0.260817808 -0.21729253 0.171664805
## slope0 -0.105801842 -0.05689628 0.186632633
## slope1 -0.201353845 -0.11460579 0.004891033
## slope2 0.247003776 0.13911424 -0.087653149
## ca0 0.097079454 0.10006144 0.010192898
## ca1 -0.009367502 0.03221652 -0.103336727
## ca2 -0.070360357 -0.25593146 0.083719831
## ca3 -0.056416933 0.02201336 0.001528670
## ca4 -0.054185415 0.18635338 0.043588719
## thal0 -0.044136741 0.04452629 -0.016839045
## thal1 -0.010867931 -0.16432186 0.121461771
## thal2 0.171964956 0.21310776 -0.130941937
## thal3 -0.162697433 -0.14646236 0.077686762
## chol fbs<=120 fbs>120 restecgabnormal
## age 0.276767122 -0.140101006 0.140101006 -1.307497e-01
## sexfemale 0.231176419 0.018953785 -0.018953785 1.027494e-01
## sexmale -0.231176419 -0.018953785 0.018953785 -1.027494e-01
## cpasymptotic 0.098680662 0.058340366 -0.058340366 -1.702970e-01
## cpatypical angina -0.014909937 0.015228916 -0.015228916 7.537784e-02
## cpnon-anginal pain -0.095850643 -0.076513014 0.076513014 1.244342e-01
## trestbps 0.173933079 -0.176015299 0.176015299 -1.619227e-01
## chol 1.000000000 0.054951665 -0.054951665 -1.678145e-01
## fbs<=120 0.054951665 1.000000000 -1.000000000 5.270463e-02
## fbs>120 -0.054951665 -1.000000000 1.000000000 -5.270463e-02
## restecgabnormal -0.167814510 0.052704628 -0.052704628 1.000000e+00
## restecgdefinite 0.164215958 0.041147560 -0.041147560 -9.759001e-02
## restecgnormal 0.136089437 -0.060670843 0.060670843 -9.813068e-01
## thalach -0.060020801 0.079343283 -0.079343283 1.100407e-01
## exangno -0.116765457 0.047547373 -0.047547373 1.244342e-01
## exangyes 0.116765457 -0.047547373 0.047547373 -1.244342e-01
## oldpeak 0.160539105 -0.032239431 0.032239431 -1.903161e-01
## slope0 0.076902538 -0.198393594 0.198393594 -1.488688e-01
## slope1 0.062521588 0.084572764 -0.084572764 -1.329300e-01
## slope2 -0.096237670 0.003977885 -0.003977885 1.981220e-01
## ca0 -0.060522050 0.075212137 -0.075212137 6.795476e-02
## ca1 0.007601576 0.055018424 -0.055018424 -6.213698e-02
## ca2 0.046693337 -0.138907417 0.138907417 -4.118098e-02
## ca3 0.111813163 -0.067777209 0.067777209 -4.082483e-02
## ca4 -0.130055856 0.050515682 -0.050515682 1.198085e-01
## thal0 -0.066903515 -0.095153732 0.095153732 1.477290e-21
## thal1 -0.086650491 -0.056991912 0.056991912 -9.830414e-02
## thal2 -0.017481103 0.078656288 -0.078656288 1.235744e-01
## thal3 0.074508601 -0.033509072 0.033509072 -7.825107e-02
## restecgdefinite restecgnormal thalach exangno
## age 0.02333377 0.126260842 -0.44668768 -0.144558631
## sexfemale 0.04211471 -0.110911514 0.04216524 0.083958351
## sexmale -0.04211471 0.110911514 -0.04216524 -0.083958351
## cpasymptotic 0.09048279 0.152830400 -0.31587407 -0.367306744
## cpatypical angina -0.04413674 -0.066856398 0.20341673 0.180296390
## cpnon-anginal pain -0.06274160 -0.112323820 0.17830788 0.253763441
## trestbps 0.08261502 0.145975995 -0.03234948 -0.041462290
## chol 0.16421596 0.136089437 -0.06002080 -0.116765457
## fbs<=120 0.04114756 -0.060670843 0.07934328 0.047547373
## fbs>120 -0.04114756 0.060670843 -0.07934328 -0.047547373
## restecgabnormal -0.09759001 -0.981306763 0.11004073 0.124434203
## restecgdefinite 1.00000000 -0.095765734 -0.09685521 -0.044526294
## restecgnormal -0.09576573 1.000000000 -0.09133113 -0.115846170
## thalach -0.09685521 -0.091331128 1.00000000 0.339733179
## exangno -0.04452629 -0.115846170 0.33973318 1.000000000
## exangyes 0.04452629 0.115846170 -0.33973318 -1.000000000
## oldpeak 0.25301026 0.141424563 -0.36116742 -0.311915282
## slope0 0.19716716 0.110768471 -0.13589012 -0.083359664
## slope1 0.01111941 0.130803504 -0.37037573 -0.219365730
## slope2 -0.09851505 -0.179107137 0.42829422 0.254944941
## ca0 -0.02368466 -0.063386882 0.25450840 0.156031422
## ca1 -0.04487322 0.070825313 -0.17281213 -0.141494963
## ca2 -0.03884891 0.048700650 -0.08740559 -0.106187357
## ca3 0.18725248 0.004622502 -0.17845885 0.022860023
## ca4 -0.01169211 -0.117568852 0.11178090 0.077026063
## thal0 -0.00952381 0.001841649 -0.06638391 -0.044526294
## thal1 0.17843912 0.063816309 -0.15588310 -0.008562678
## thal2 -0.01205962 -0.121264363 0.25684204 0.246621467
## thal3 -0.07445609 0.092662810 -0.17343034 -0.240587750
## exangyes oldpeak slope0 slope1
## age 0.144558631 0.22217640 0.06180576 0.207112568
## sexfemale -0.083958351 -0.10163344 -0.06118470 -0.007804844
## sexmale 0.083958351 0.10163344 0.06118470 0.007804844
## cpasymptotic 0.367306744 0.39467060 0.13158849 0.256200001
## cpatypical angina -0.180296390 -0.26081781 -0.10580184 -0.201353845
## cpnon-anginal pain -0.253763441 -0.21729253 -0.05689628 -0.114605794
## trestbps 0.041462290 0.17166480 0.18663263 0.004891033
## chol 0.116765457 0.16053910 0.07690254 0.062521588
## fbs<=120 -0.047547373 -0.03223943 -0.19839359 0.084572764
## fbs>120 0.047547373 0.03223943 0.19839359 -0.084572764
## restecgabnormal -0.124434203 -0.19031606 -0.14886879 -0.132930031
## restecgdefinite 0.044526294 0.25301026 0.19716716 0.011119408
## restecgnormal 0.115846170 0.14142456 0.11076847 0.130803504
## thalach -0.339733179 -0.36116742 -0.13589012 -0.370375730
## exangno -1.000000000 -0.31191528 -0.08335966 -0.219365730
## exangyes 1.000000000 0.31191528 0.08335966 0.219365730
## oldpeak 0.311915282 1.00000000 0.40950683 0.380725388
## slope0 0.083359664 0.40950683 1.00000000 -0.208795549
## slope1 0.219365730 0.38072539 -0.20879555 1.000000000
## slope2 -0.254944941 -0.55995872 -0.23615413 -0.900990398
## ca0 -0.156031422 -0.23413370 0.05264628 -0.138448051
## ca1 0.141494963 0.01329800 -0.05153714 0.064898980
## ca2 0.106187357 0.20435494 -0.03123668 0.086805003
## ca3 -0.022860023 0.22053374 0.03472882 0.110087443
## ca4 -0.077026063 -0.10555119 -0.02802759 -0.106932669
## thal0 0.044526294 -0.04251688 -0.02282988 0.011119408
## thal1 0.008562678 0.08827441 0.11749347 0.128062204
## thal2 -0.246621467 -0.33111163 -0.09319478 -0.244235004
## thal3 0.240587750 0.30531958 0.04202012 0.185377024
## slope2 ca0 ca1 ca2
## age -0.2332083362 -0.33649460 0.179248001 0.258143755
## sexfemale 0.0348975644 0.09416142 -0.058737722 0.007334288
## sexmale -0.0348975644 -0.09416142 0.058737722 -0.007334288
## cpasymptotic -0.3129391581 -0.16440288 -0.022339233 0.286493980
## cpatypical angina 0.2470037757 0.09707945 -0.009367502 -0.070360357
## cpnon-anginal pain 0.1391142388 0.10006144 0.032216522 -0.255931457
## trestbps -0.0876531491 0.01019290 -0.103336727 0.083719831
## chol -0.0962376698 -0.06052205 0.007601576 0.046693337
## fbs<=120 0.0039778848 0.07521214 0.055018424 -0.138907417
## fbs>120 -0.0039778848 -0.07521214 -0.055018424 0.138907417
## restecgabnormal 0.1981220241 0.06795476 -0.062136977 -0.041180983
## restecgdefinite -0.0985150517 -0.02368466 -0.044873215 -0.038848907
## restecgnormal -0.1791071365 -0.06338688 0.070825313 0.048700650
## thalach 0.4282942163 0.25450840 -0.172812126 -0.087405590
## exangno 0.2549449412 0.15603142 -0.141494963 -0.106187357
## exangyes -0.2549449412 -0.15603142 0.141494963 0.106187357
## oldpeak -0.5599587231 -0.23413370 0.013297999 0.204354935
## slope0 -0.2361541344 0.05264628 -0.051537145 -0.031236680
## slope1 -0.9009903976 -0.13844805 0.064898980 0.086805003
## slope2 1.0000000000 0.11420938 -0.041621903 -0.072393692
## ca0 0.1142093758 1.00000000 -0.584756390 -0.506251814
## ca1 -0.0416219026 -0.58475639 1.000000000 -0.183043913
## ca2 -0.0723936920 -0.50625181 -0.183043913 1.000000000
## ca3 -0.1247908798 -0.31150768 -0.112630876 -0.097509983
## ca4 0.1186834726 -0.15236339 -0.055089562 -0.047693692
## thal0 -0.0009207014 0.07673831 -0.044873215 -0.038848907
## thal1 -0.1793666601 -0.08226231 -0.065725542 0.127115394
## thal2 0.2840184611 0.22433232 -0.094386947 -0.173675500
## thal3 -0.2028343241 -0.20529474 0.138818235 0.123258942
## ca3 ca4 thal0 thal1
## age 0.108160246 -0.15438162 -1.977525e-02 0.122991632
## sexfemale -0.072149276 -0.07878556 4.211471e-02 -0.125248528
## sexmale 0.072149276 0.07878556 -4.211471e-02 0.125248528
## cpasymptotic 0.022402006 -0.12921915 -7.386350e-03 0.158108511
## cpatypical angina -0.056416933 -0.05418541 -4.413674e-02 -0.010867931
## cpnon-anginal pain 0.022013355 0.18635338 4.452629e-02 -0.164321864
## trestbps 0.001528670 0.04358872 -1.683905e-02 0.121461771
## chol 0.111813163 -0.13005586 -6.690352e-02 -0.086650491
## fbs<=120 -0.067777209 0.05051568 -9.515373e-02 -0.056991912
## fbs>120 0.067777209 -0.05051568 9.515373e-02 0.056991912
## restecgabnormal -0.040824829 0.11980846 1.477290e-21 -0.098304136
## restecgdefinite 0.187252482 -0.01169211 -9.523810e-03 0.178439125
## restecgnormal 0.004622502 -0.11756885 1.841649e-03 0.063816309
## thalach -0.178458849 0.11178090 -6.638391e-02 -0.155883099
## exangno 0.022860023 0.07702606 -4.452629e-02 -0.008562678
## exangyes -0.022860023 -0.07702606 4.452629e-02 0.008562678
## oldpeak 0.220533735 -0.10555119 -4.251688e-02 0.088274412
## slope0 0.034728817 -0.02802759 -2.282988e-02 0.117493471
## slope1 0.110087443 -0.10693267 1.111941e-02 0.128062204
## slope2 -0.124790880 0.11868347 -9.207014e-04 -0.179366660
## ca0 -0.311507684 -0.15236339 7.673831e-02 -0.082262310
## ca1 -0.112630876 -0.05508956 -4.487322e-02 -0.065725542
## ca2 -0.097509983 -0.04769369 -3.884891e-02 0.127115394
## ca3 1.000000000 -0.02934696 -2.390457e-02 0.107555088
## ca4 -0.029346959 1.00000000 -1.169211e-02 -0.030621934
## thal0 -0.023904572 -0.01169211 1.000000e+00 -0.024943103
## thal1 0.107555088 -0.03062193 -2.494310e-02 1.000000000
## thal2 -0.112540131 0.10591451 -1.103919e-01 -0.289119340
## thal3 0.067086318 -0.09140762 -7.445609e-02 -0.195002438
## thal2 thal3
## age -0.18880235 0.13705258
## sexfemale 0.41568640 -0.37387153
## sexmale -0.41568640 0.37387153
## cpasymptotic -0.32393904 0.25615387
## cpatypical angina 0.17196496 -0.16269743
## cpnon-anginal pain 0.21310776 -0.14646236
## trestbps -0.13094194 0.07768676
## chol -0.01748110 0.07450860
## fbs<=120 0.07865629 -0.03350907
## fbs>120 -0.07865629 0.03350907
## restecgabnormal 0.12357437 -0.07825107
## restecgdefinite -0.01205962 -0.07445609
## restecgnormal -0.12126436 0.09266281
## thalach 0.25684204 -0.17343034
## exangno 0.24662147 -0.24058775
## exangyes -0.24662147 0.24058775
## oldpeak -0.33111163 0.30531958
## slope0 -0.09319478 0.04202012
## slope1 -0.24423500 0.18537702
## slope2 0.28401846 -0.20283432
## ca0 0.22433232 -0.20529474
## ca1 -0.09438695 0.13881823
## ca2 -0.17367550 0.12325894
## ca3 -0.11254013 0.06708632
## ca4 0.10591451 -0.09140762
## thal0 -0.11039194 -0.07445609
## thal1 -0.28911934 -0.19500244
## thal2 1.00000000 -0.86303202
## thal3 -0.86303202 1.00000000
ei_df <- eigen(cov_df)
ei_df
## eigen() decomposition
## $values
## [1] 5.046718e+00 2.498338e+00 2.296173e+00 1.927620e+00 1.746504e+00
## [6] 1.711284e+00 1.619344e+00 1.415239e+00 1.310701e+00 1.242839e+00
## [11] 1.035461e+00 1.012784e+00 9.891014e-01 9.028698e-01 8.243613e-01
## [16] 7.464296e-01 7.242646e-01 6.248935e-01 6.119878e-01 3.861818e-01
## [21] 3.269050e-01 2.091888e-16 2.024479e-17 1.676868e-17 -2.500649e-17
## [26] -3.572164e-17 -1.549294e-16 -2.823622e-16 -3.844234e-16
##
## $vectors
## [,1] [,2] [,3] [,4] [,5]
## [1,] -0.2040227737 1.833293e-01 -0.094779873 -0.13731378 0.10549430
## [2,] 0.1413086011 5.394177e-01 0.039906990 -0.10547773 0.06696998
## [3,] -0.1413086011 -5.394177e-01 -0.039906990 0.10547773 -0.06696998
## [4,] -0.2942003135 2.461245e-02 0.124389538 0.04018123 -0.01260643
## [5,] 0.1600597333 5.511576e-06 -0.032650698 0.05734463 0.17794918
## [6,] 0.1903397143 -2.698063e-02 -0.109385163 -0.09137243 -0.13306274
## [7,] -0.0935479157 1.204104e-01 -0.222487435 0.12005801 -0.05039087
## [8,] -0.0849067274 2.803531e-01 0.029034968 0.08955058 0.10490839
## [9,] 0.0554027651 -2.904020e-02 0.591801533 0.18297854 -0.01650000
## [10,] -0.0554027651 2.904020e-02 -0.591801533 -0.18297854 0.01650000
## [11,] 0.1736418581 -7.301015e-02 0.134795878 -0.54120573 -0.24450421
## [12,] -0.0625879247 1.361349e-01 0.006726700 0.07124934 -0.29057685
## [13,] -0.1615699524 4.669831e-02 -0.136120641 0.52752438 0.30073749
## [14,] 0.2642768993 -1.226385e-01 -0.027373082 0.15232430 0.06430739
## [15,] 0.2662813236 -5.344106e-02 -0.103829439 0.13787861 -0.23790773
## [16,] -0.2662813236 5.344106e-02 0.103829439 -0.13787861 0.23790773
## [17,] -0.3018129056 1.014582e-01 0.013299747 0.02003847 -0.23877351
## [18,] -0.1129508182 6.226086e-02 -0.230618976 0.14454194 -0.21883738
## [19,] -0.2404813692 1.326143e-01 0.214508548 -0.02652619 -0.20305279
## [20,] 0.2890528978 -1.593875e-01 -0.110832619 -0.03776438 0.29883633
## [21,] 0.1863277619 2.539686e-02 0.017259545 0.31810714 -0.22856684
## [22,] -0.0801730543 -5.412041e-02 0.100692516 -0.18736497 0.34911401
## [23,] -0.1440907534 5.350135e-02 -0.095050946 -0.17389422 0.13349020
## [24,] -0.0880689230 1.975906e-02 -0.062237127 -0.05200381 -0.25357146
## [25,] 0.0826325459 -1.248690e-01 0.003714119 -0.09886796 -0.07387878
## [26,] 0.0003292663 1.790613e-02 -0.071960874 -0.04474856 -0.02271613
## [27,] -0.1071281011 -3.217524e-03 -0.087323525 0.08306735 -0.24274831
## [28,] 0.2887953143 2.791867e-01 0.032234467 0.04175985 0.05647509
## [29,] -0.2439394038 -2.892717e-01 0.024699924 -0.07532716 0.06720959
## [,6] [,7] [,8] [,9] [,10]
## [1,] -0.213452492 -0.1616819119 0.023604430 -0.087715997 0.047030462
## [2,] 0.011344399 0.0132664051 0.046115109 -0.110497288 -0.052789321
## [3,] -0.011344399 -0.0132664051 -0.046115109 0.110497288 0.052789321
## [4,] 0.251575980 -0.2184059522 0.067338655 0.070216542 -0.239407715
## [5,] 0.018666681 -0.2587982109 0.080642858 -0.002363250 0.575654625
## [6,] -0.291143340 0.4529931814 -0.140368316 -0.075009077 -0.212748155
## [7,] 0.052792830 -0.1368246765 -0.088519611 -0.198823957 -0.170243940
## [8,] -0.049599567 -0.1036958509 -0.249146156 -0.378954437 0.034805948
## [9,] -0.068522160 -0.1133872283 -0.113186317 -0.022985625 -0.067737669
## [10,] 0.068522160 0.1133872283 0.113186317 0.022985625 0.067737669
## [11,] 0.174292551 -0.0922261543 0.011106655 -0.108859690 0.058035125
## [12,] 0.022878704 -0.1374788674 -0.408638140 0.113189843 0.151685506
## [13,] -0.178747710 0.1188272919 0.067910999 0.086991199 -0.087377357
## [14,] 0.087127349 -0.0874879334 -0.015641141 -0.139934444 -0.164750061
## [15,] -0.361710871 -0.2362339223 0.163201029 -0.157374818 -0.056636761
## [16,] 0.361710871 0.2362339223 -0.163201029 0.157374818 0.056636761
## [17,] 0.001229302 0.0009125418 -0.162923697 -0.080997686 -0.050272836
## [18,] 0.152638210 0.0091306367 -0.371342645 -0.005725705 0.003227085
## [19,] -0.201743027 0.2071383509 0.402781053 0.018519962 0.046593168
## [20,] 0.132742164 -0.2098663948 -0.235475750 -0.015861704 -0.047727286
## [21,] 0.410141531 0.2145056094 0.160399335 -0.179411132 0.115343125
## [22,] -0.343597404 0.1291405671 -0.216397262 0.116144689 0.196811267
## [23,] 0.006067372 -0.4103179475 0.227637787 0.040740958 -0.367790016
## [24,] -0.264070192 -0.0683355585 -0.215395509 0.004954165 0.202539387
## [25,] -0.084082911 0.0300039738 -0.205304036 0.236604972 -0.433208186
## [26,] 0.112158906 0.2015978151 0.170508454 0.064865516 0.047711035
## [27,] -0.036225835 -0.2630957285 0.114121348 0.418293722 0.140089959
## [28,] 0.011523649 0.0538623012 -0.080493188 0.306678462 -0.080124643
## [29,] -0.016318483 0.0350554293 -0.008128518 -0.536678636 0.003188517
## [,11] [,12] [,13] [,14] [,15]
## [1,] -0.0514849619 0.25344279 -0.071981397 -0.38373324 0.214537681
## [2,] -0.0333275463 0.03966658 -0.009422092 0.15754580 -0.132702943
## [3,] 0.0333275463 -0.03966658 0.009422092 -0.15754580 0.132702943
## [4,] -0.0918142199 0.11585403 0.289722514 -0.11751257 0.145408546
## [5,] 0.1913643070 -0.17791385 -0.295940515 0.28920729 0.175698766
## [6,] -0.0573211994 0.01987070 -0.073275472 -0.10991485 -0.304395008
## [7,] 0.0809428399 0.17893968 -0.558762233 -0.16658326 0.133346757
## [8,] -0.1344075528 -0.11996991 -0.099533027 -0.17877867 0.014999030
## [9,] -0.0632291693 0.10386450 -0.074834503 0.01981791 0.011788835
## [10,] 0.0632291693 -0.10386450 0.074834503 -0.01981791 -0.011788835
## [11,] 0.0892992035 0.07682543 -0.022307302 -0.07800898 0.058828058
## [12,] -0.1707762492 -0.02107557 -0.019284283 0.15737992 -0.297537103
## [13,] -0.0562915673 -0.07276366 0.026040336 0.04758982 -0.001302855
## [14,] 0.0637699306 -0.19026526 -0.053317476 0.18721599 -0.268279182
## [15,] 0.0179033287 0.17592834 0.159850407 0.06848620 0.108328418
## [16,] -0.0179033287 -0.17592834 -0.159850407 -0.06848620 -0.108328418
## [17,] 0.1745509851 -0.06613846 0.118376039 0.28116180 0.070437461
## [18,] 0.3009325805 0.27827997 0.183468090 0.24924191 0.120746443
## [19,] 0.1323699612 -0.16649098 -0.146523898 0.02438587 0.004764372
## [20,] -0.2650235573 0.04197840 0.064198930 -0.13479804 -0.058299086
## [21,] 0.0266367570 0.08414051 -0.035901178 -0.21531961 0.019033132
## [22,] 0.1784967480 0.38220875 0.061019476 0.10245025 -0.065584851
## [23,] -0.0002590246 -0.16793850 0.062979887 0.21493774 -0.190248341
## [24,] -0.3768853491 -0.48613501 0.120805688 -0.12340610 0.181065955
## [25,] 0.0549279697 -0.13469650 -0.467979249 0.17267244 0.331705364
## [26,] -0.6682429521 0.31460817 -0.096985482 0.42619396 0.258192820
## [27,] -0.0747633927 0.20786286 -0.261684675 -0.15940947 -0.454059167
## [28,] 0.1747356815 -0.13999203 0.183500247 -0.14317523 0.276835144
## [29,] -0.0086525893 -0.02242725 -0.039190364 0.14120018 -0.110723026
## [,16] [,17] [,18] [,19] [,20]
## [1,] -0.307043040 0.111102346 0.10755715 -0.119481022 -0.634592582
## [2,] 0.139797586 -0.262204268 0.11173744 -0.071863869 -0.056457644
## [3,] -0.139797586 0.262204268 -0.11173744 0.071863869 0.056457644
## [4,] 0.387590877 0.004321684 0.03778667 0.081334668 -0.038321037
## [5,] -0.145496306 -0.007070978 0.04754370 -0.050109424 -0.003776925
## [6,] -0.304719637 0.001099695 -0.08065819 -0.047785079 0.045118642
## [7,] 0.209913718 -0.024729046 -0.51745690 -0.228395658 0.207888802
## [8,] -0.088998760 0.311178051 0.17199235 0.614251536 0.293268601
## [9,] -0.175395522 -0.107830779 -0.08751724 -0.050378871 -0.008164833
## [10,] 0.175395522 0.107830779 0.08751724 0.050378871 0.008164833
## [11,] -0.015393105 -0.003828996 -0.09130157 0.123980608 0.025264759
## [12,] 0.160146237 0.526368999 0.16138195 -0.389205946 -0.081912822
## [13,] -0.015572132 -0.097955937 0.06011090 -0.048740716 -0.009429520
## [14,] 0.191364311 0.132133530 -0.34932228 0.350327215 -0.613049818
## [15,] 0.098331517 0.054993699 0.07387301 -0.001615882 0.074831716
## [16,] -0.098331517 -0.054993699 -0.07387301 0.001615882 -0.074831716
## [17,] -0.130630351 -0.017275377 -0.19641737 -0.044658121 -0.014852663
## [18,] -0.257749536 -0.236779204 0.01835022 0.205780747 -0.076968959
## [19,] 0.115899019 0.198654712 -0.03992670 0.056407624 -0.034717513
## [20,] -0.000817087 -0.092347221 0.03153132 -0.147335259 0.068640573
## [21,] -0.040677403 0.033486402 0.15915730 -0.067902495 -0.067031797
## [22,] 0.307156634 0.082632208 -0.14084179 0.119478608 0.037244599
## [23,] -0.400493663 0.129741079 -0.08790932 -0.081575166 0.148382938
## [24,] 0.107273784 -0.376830513 -0.21119126 0.022227162 -0.086051127
## [25,] 0.135558575 -0.043502495 0.46680415 0.089227879 -0.107268300
## [26,] -0.109837958 0.142040654 -0.17122845 0.143891587 -0.023977683
## [27,] -0.020303257 -0.250400560 0.06807129 0.298951277 0.021055632
## [28,] -0.055709181 0.217044473 -0.19542102 -0.023151988 0.039387046
## [29,] 0.089443704 -0.127236152 0.20154651 -0.153751053 -0.046198032
## [,21] [,22] [,23] [,24] [,25]
## [1,] -0.055835074 0.000000e+00 0.000000e+00 1.225090e-16 0.000000e+00
## [2,] -0.009716911 4.304169e-01 4.908485e-01 2.427171e-02 1.598752e-01
## [3,] 0.009716911 4.304169e-01 4.908485e-01 2.427171e-02 1.598752e-01
## [4,] 0.024069859 -1.624602e-01 -9.349496e-02 -1.462863e-01 2.297954e-01
## [5,] -0.020759097 -1.223458e-01 -7.040935e-02 -1.101655e-01 1.730547e-01
## [6,] -0.009246758 -1.482255e-01 -8.530298e-02 -1.334688e-01 2.096609e-01
## [7,] 0.091112831 -1.249001e-16 -1.387779e-16 -2.628106e-16 -2.706169e-16
## [8,] -0.044978707 4.163336e-17 -1.665335e-16 1.387779e-16 -8.326673e-17
## [9,] 0.055246780 1.238864e-01 -2.159140e-01 3.779570e-01 2.646552e-01
## [10,] -0.055246780 1.238864e-01 -2.159140e-01 3.779570e-01 2.646552e-01
## [11,] -0.002668114 2.618657e-02 1.049778e-01 1.936265e-01 -4.934911e-01
## [12,] 0.131111145 5.062877e-03 2.029627e-02 3.743549e-02 -9.541093e-02
## [13,] -0.022684782 2.618191e-02 1.049591e-01 1.935920e-01 -4.934033e-01
## [14,] 0.004156898 -1.387779e-16 -5.551115e-17 1.179612e-16 5.551115e-17
## [15,] -0.039936533 9.881214e-02 -9.271353e-02 3.979858e-01 -4.350554e-02
## [16,] 0.039936533 9.881214e-02 -9.271353e-02 3.979858e-01 -4.350554e-02
## [17,] -0.783813396 8.326673e-17 -9.714451e-17 3.096481e-16 4.163336e-17
## [18,] 0.400630885 -1.655520e-01 1.348923e-01 2.214400e-02 -3.849901e-02
## [19,] 0.100211495 -3.708045e-01 3.021327e-01 4.959827e-02 -8.623034e-02
## [20,] -0.277298232 -3.731871e-01 3.040740e-01 4.991696e-02 -8.678440e-02
## [21,] -0.122531418 -1.066503e-01 2.241673e-01 1.351855e-03 2.229382e-01
## [22,] -0.043260285 -8.331128e-02 1.751112e-01 1.056019e-03 1.741511e-01
## [23,] 0.158659708 -7.542381e-02 1.585326e-01 9.560407e-04 1.576634e-01
## [24,] 0.133158564 -5.072120e-02 1.066104e-01 6.429208e-04 1.060259e-01
## [25,] -0.079020446 -2.592493e-02 5.449138e-02 3.286137e-04 5.419262e-02
## [26,] -0.001634511 5.599721e-02 -2.680530e-02 -6.713152e-02 -2.827909e-02
## [27,] -0.158154115 1.389760e-01 -6.652642e-02 -1.666096e-01 -7.018414e-02
## [28,] 0.005608392 2.874463e-01 -1.375976e-01 -3.446012e-01 -1.451629e-01
## [29,] 0.073239274 2.793453e-01 -1.337198e-01 -3.348895e-01 -1.410719e-01
## [,26] [,27] [,28] [,29]
## [1,] 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
## [2,] 1.264714e-01 -4.810722e-02 1.599167e-01 6.145848e-02
## [3,] 1.264714e-01 -4.810722e-02 1.599167e-01 6.145848e-02
## [4,] 5.617425e-02 1.431376e-01 3.669298e-01 3.861601e-01
## [5,] 4.230380e-02 1.077943e-01 2.763281e-01 2.908102e-01
## [6,] 5.125229e-02 1.305960e-01 3.347796e-01 3.523250e-01
## [7,] 3.330669e-16 3.469447e-17 -3.053113e-16 -5.377643e-17
## [8,] -3.400058e-16 -1.023487e-16 9.714451e-17 -4.510281e-17
## [9,] -3.174912e-01 -1.784900e-01 2.632447e-01 -1.522574e-01
## [10,] -3.174912e-01 -1.784900e-01 2.632447e-01 -1.522574e-01
## [11,] -2.606901e-01 -4.914851e-02 1.355194e-01 3.308188e-01
## [12,] -5.040148e-02 -9.502308e-03 2.620115e-02 6.396007e-02
## [13,] -2.606437e-01 -4.913976e-02 1.354953e-01 3.307599e-01
## [14,] 1.387779e-17 -1.040834e-16 5.551115e-17 1.283695e-16
## [15,] 2.321523e-01 5.165691e-01 -2.389006e-02 -6.767816e-03
## [16,] 2.321523e-01 5.165691e-01 -2.389006e-02 -6.767816e-03
## [17,] -1.838807e-16 -1.110223e-16 -1.387779e-17 -1.595946e-16
## [18,] 1.860791e-04 3.982187e-02 1.360638e-01 -1.497667e-01
## [19,] 4.167812e-04 8.919330e-02 3.047566e-01 -3.354484e-01
## [20,] 4.194592e-04 8.976640e-02 3.067148e-01 -3.376038e-01
## [21,] -4.081274e-01 2.509126e-01 -2.610841e-01 9.169874e-02
## [22,] -3.188141e-01 1.960037e-01 -2.039493e-01 7.163168e-02
## [23,] -2.886305e-01 1.774471e-01 -1.846404e-01 6.484997e-02
## [24,] -1.940990e-01 1.193301e-01 -1.241675e-01 4.361048e-02
## [25,] -9.920907e-02 6.099273e-02 -6.346526e-02 2.229046e-02
## [26,] -4.410188e-02 5.481539e-02 3.698347e-02 -4.057889e-02
## [27,] -1.094538e-01 1.360430e-01 9.178700e-02 -1.007103e-01
## [28,] -2.263849e-01 2.813797e-01 1.898445e-01 -2.083006e-01
## [29,] -2.200048e-01 2.734498e-01 1.844942e-01 -2.024302e-01
str(ei_df)
## List of 2
## $ values : num [1:29] 5.05 2.5 2.3 1.93 1.75 ...
## $ vectors: num [1:29, 1:29] -0.204 0.141 -0.141 -0.294 0.16 ...
## - attr(*, "class")= chr "eigen"
ei_df$vectors[,1:21]
## [,1] [,2] [,3] [,4] [,5]
## [1,] -0.2040227737 1.833293e-01 -0.094779873 -0.13731378 0.10549430
## [2,] 0.1413086011 5.394177e-01 0.039906990 -0.10547773 0.06696998
## [3,] -0.1413086011 -5.394177e-01 -0.039906990 0.10547773 -0.06696998
## [4,] -0.2942003135 2.461245e-02 0.124389538 0.04018123 -0.01260643
## [5,] 0.1600597333 5.511576e-06 -0.032650698 0.05734463 0.17794918
## [6,] 0.1903397143 -2.698063e-02 -0.109385163 -0.09137243 -0.13306274
## [7,] -0.0935479157 1.204104e-01 -0.222487435 0.12005801 -0.05039087
## [8,] -0.0849067274 2.803531e-01 0.029034968 0.08955058 0.10490839
## [9,] 0.0554027651 -2.904020e-02 0.591801533 0.18297854 -0.01650000
## [10,] -0.0554027651 2.904020e-02 -0.591801533 -0.18297854 0.01650000
## [11,] 0.1736418581 -7.301015e-02 0.134795878 -0.54120573 -0.24450421
## [12,] -0.0625879247 1.361349e-01 0.006726700 0.07124934 -0.29057685
## [13,] -0.1615699524 4.669831e-02 -0.136120641 0.52752438 0.30073749
## [14,] 0.2642768993 -1.226385e-01 -0.027373082 0.15232430 0.06430739
## [15,] 0.2662813236 -5.344106e-02 -0.103829439 0.13787861 -0.23790773
## [16,] -0.2662813236 5.344106e-02 0.103829439 -0.13787861 0.23790773
## [17,] -0.3018129056 1.014582e-01 0.013299747 0.02003847 -0.23877351
## [18,] -0.1129508182 6.226086e-02 -0.230618976 0.14454194 -0.21883738
## [19,] -0.2404813692 1.326143e-01 0.214508548 -0.02652619 -0.20305279
## [20,] 0.2890528978 -1.593875e-01 -0.110832619 -0.03776438 0.29883633
## [21,] 0.1863277619 2.539686e-02 0.017259545 0.31810714 -0.22856684
## [22,] -0.0801730543 -5.412041e-02 0.100692516 -0.18736497 0.34911401
## [23,] -0.1440907534 5.350135e-02 -0.095050946 -0.17389422 0.13349020
## [24,] -0.0880689230 1.975906e-02 -0.062237127 -0.05200381 -0.25357146
## [25,] 0.0826325459 -1.248690e-01 0.003714119 -0.09886796 -0.07387878
## [26,] 0.0003292663 1.790613e-02 -0.071960874 -0.04474856 -0.02271613
## [27,] -0.1071281011 -3.217524e-03 -0.087323525 0.08306735 -0.24274831
## [28,] 0.2887953143 2.791867e-01 0.032234467 0.04175985 0.05647509
## [29,] -0.2439394038 -2.892717e-01 0.024699924 -0.07532716 0.06720959
## [,6] [,7] [,8] [,9] [,10]
## [1,] -0.213452492 -0.1616819119 0.023604430 -0.087715997 0.047030462
## [2,] 0.011344399 0.0132664051 0.046115109 -0.110497288 -0.052789321
## [3,] -0.011344399 -0.0132664051 -0.046115109 0.110497288 0.052789321
## [4,] 0.251575980 -0.2184059522 0.067338655 0.070216542 -0.239407715
## [5,] 0.018666681 -0.2587982109 0.080642858 -0.002363250 0.575654625
## [6,] -0.291143340 0.4529931814 -0.140368316 -0.075009077 -0.212748155
## [7,] 0.052792830 -0.1368246765 -0.088519611 -0.198823957 -0.170243940
## [8,] -0.049599567 -0.1036958509 -0.249146156 -0.378954437 0.034805948
## [9,] -0.068522160 -0.1133872283 -0.113186317 -0.022985625 -0.067737669
## [10,] 0.068522160 0.1133872283 0.113186317 0.022985625 0.067737669
## [11,] 0.174292551 -0.0922261543 0.011106655 -0.108859690 0.058035125
## [12,] 0.022878704 -0.1374788674 -0.408638140 0.113189843 0.151685506
## [13,] -0.178747710 0.1188272919 0.067910999 0.086991199 -0.087377357
## [14,] 0.087127349 -0.0874879334 -0.015641141 -0.139934444 -0.164750061
## [15,] -0.361710871 -0.2362339223 0.163201029 -0.157374818 -0.056636761
## [16,] 0.361710871 0.2362339223 -0.163201029 0.157374818 0.056636761
## [17,] 0.001229302 0.0009125418 -0.162923697 -0.080997686 -0.050272836
## [18,] 0.152638210 0.0091306367 -0.371342645 -0.005725705 0.003227085
## [19,] -0.201743027 0.2071383509 0.402781053 0.018519962 0.046593168
## [20,] 0.132742164 -0.2098663948 -0.235475750 -0.015861704 -0.047727286
## [21,] 0.410141531 0.2145056094 0.160399335 -0.179411132 0.115343125
## [22,] -0.343597404 0.1291405671 -0.216397262 0.116144689 0.196811267
## [23,] 0.006067372 -0.4103179475 0.227637787 0.040740958 -0.367790016
## [24,] -0.264070192 -0.0683355585 -0.215395509 0.004954165 0.202539387
## [25,] -0.084082911 0.0300039738 -0.205304036 0.236604972 -0.433208186
## [26,] 0.112158906 0.2015978151 0.170508454 0.064865516 0.047711035
## [27,] -0.036225835 -0.2630957285 0.114121348 0.418293722 0.140089959
## [28,] 0.011523649 0.0538623012 -0.080493188 0.306678462 -0.080124643
## [29,] -0.016318483 0.0350554293 -0.008128518 -0.536678636 0.003188517
## [,11] [,12] [,13] [,14] [,15]
## [1,] -0.0514849619 0.25344279 -0.071981397 -0.38373324 0.214537681
## [2,] -0.0333275463 0.03966658 -0.009422092 0.15754580 -0.132702943
## [3,] 0.0333275463 -0.03966658 0.009422092 -0.15754580 0.132702943
## [4,] -0.0918142199 0.11585403 0.289722514 -0.11751257 0.145408546
## [5,] 0.1913643070 -0.17791385 -0.295940515 0.28920729 0.175698766
## [6,] -0.0573211994 0.01987070 -0.073275472 -0.10991485 -0.304395008
## [7,] 0.0809428399 0.17893968 -0.558762233 -0.16658326 0.133346757
## [8,] -0.1344075528 -0.11996991 -0.099533027 -0.17877867 0.014999030
## [9,] -0.0632291693 0.10386450 -0.074834503 0.01981791 0.011788835
## [10,] 0.0632291693 -0.10386450 0.074834503 -0.01981791 -0.011788835
## [11,] 0.0892992035 0.07682543 -0.022307302 -0.07800898 0.058828058
## [12,] -0.1707762492 -0.02107557 -0.019284283 0.15737992 -0.297537103
## [13,] -0.0562915673 -0.07276366 0.026040336 0.04758982 -0.001302855
## [14,] 0.0637699306 -0.19026526 -0.053317476 0.18721599 -0.268279182
## [15,] 0.0179033287 0.17592834 0.159850407 0.06848620 0.108328418
## [16,] -0.0179033287 -0.17592834 -0.159850407 -0.06848620 -0.108328418
## [17,] 0.1745509851 -0.06613846 0.118376039 0.28116180 0.070437461
## [18,] 0.3009325805 0.27827997 0.183468090 0.24924191 0.120746443
## [19,] 0.1323699612 -0.16649098 -0.146523898 0.02438587 0.004764372
## [20,] -0.2650235573 0.04197840 0.064198930 -0.13479804 -0.058299086
## [21,] 0.0266367570 0.08414051 -0.035901178 -0.21531961 0.019033132
## [22,] 0.1784967480 0.38220875 0.061019476 0.10245025 -0.065584851
## [23,] -0.0002590246 -0.16793850 0.062979887 0.21493774 -0.190248341
## [24,] -0.3768853491 -0.48613501 0.120805688 -0.12340610 0.181065955
## [25,] 0.0549279697 -0.13469650 -0.467979249 0.17267244 0.331705364
## [26,] -0.6682429521 0.31460817 -0.096985482 0.42619396 0.258192820
## [27,] -0.0747633927 0.20786286 -0.261684675 -0.15940947 -0.454059167
## [28,] 0.1747356815 -0.13999203 0.183500247 -0.14317523 0.276835144
## [29,] -0.0086525893 -0.02242725 -0.039190364 0.14120018 -0.110723026
## [,16] [,17] [,18] [,19] [,20]
## [1,] -0.307043040 0.111102346 0.10755715 -0.119481022 -0.634592582
## [2,] 0.139797586 -0.262204268 0.11173744 -0.071863869 -0.056457644
## [3,] -0.139797586 0.262204268 -0.11173744 0.071863869 0.056457644
## [4,] 0.387590877 0.004321684 0.03778667 0.081334668 -0.038321037
## [5,] -0.145496306 -0.007070978 0.04754370 -0.050109424 -0.003776925
## [6,] -0.304719637 0.001099695 -0.08065819 -0.047785079 0.045118642
## [7,] 0.209913718 -0.024729046 -0.51745690 -0.228395658 0.207888802
## [8,] -0.088998760 0.311178051 0.17199235 0.614251536 0.293268601
## [9,] -0.175395522 -0.107830779 -0.08751724 -0.050378871 -0.008164833
## [10,] 0.175395522 0.107830779 0.08751724 0.050378871 0.008164833
## [11,] -0.015393105 -0.003828996 -0.09130157 0.123980608 0.025264759
## [12,] 0.160146237 0.526368999 0.16138195 -0.389205946 -0.081912822
## [13,] -0.015572132 -0.097955937 0.06011090 -0.048740716 -0.009429520
## [14,] 0.191364311 0.132133530 -0.34932228 0.350327215 -0.613049818
## [15,] 0.098331517 0.054993699 0.07387301 -0.001615882 0.074831716
## [16,] -0.098331517 -0.054993699 -0.07387301 0.001615882 -0.074831716
## [17,] -0.130630351 -0.017275377 -0.19641737 -0.044658121 -0.014852663
## [18,] -0.257749536 -0.236779204 0.01835022 0.205780747 -0.076968959
## [19,] 0.115899019 0.198654712 -0.03992670 0.056407624 -0.034717513
## [20,] -0.000817087 -0.092347221 0.03153132 -0.147335259 0.068640573
## [21,] -0.040677403 0.033486402 0.15915730 -0.067902495 -0.067031797
## [22,] 0.307156634 0.082632208 -0.14084179 0.119478608 0.037244599
## [23,] -0.400493663 0.129741079 -0.08790932 -0.081575166 0.148382938
## [24,] 0.107273784 -0.376830513 -0.21119126 0.022227162 -0.086051127
## [25,] 0.135558575 -0.043502495 0.46680415 0.089227879 -0.107268300
## [26,] -0.109837958 0.142040654 -0.17122845 0.143891587 -0.023977683
## [27,] -0.020303257 -0.250400560 0.06807129 0.298951277 0.021055632
## [28,] -0.055709181 0.217044473 -0.19542102 -0.023151988 0.039387046
## [29,] 0.089443704 -0.127236152 0.20154651 -0.153751053 -0.046198032
## [,21]
## [1,] -0.055835074
## [2,] -0.009716911
## [3,] 0.009716911
## [4,] 0.024069859
## [5,] -0.020759097
## [6,] -0.009246758
## [7,] 0.091112831
## [8,] -0.044978707
## [9,] 0.055246780
## [10,] -0.055246780
## [11,] -0.002668114
## [12,] 0.131111145
## [13,] -0.022684782
## [14,] 0.004156898
## [15,] -0.039936533
## [16,] 0.039936533
## [17,] -0.783813396
## [18,] 0.400630885
## [19,] 0.100211495
## [20,] -0.277298232
## [21,] -0.122531418
## [22,] -0.043260285
## [23,] 0.158659708
## [24,] 0.133158564
## [25,] -0.079020446
## [26,] -0.001634511
## [27,] -0.158154115
## [28,] 0.005608392
## [29,] 0.073239274
p <- ei_df$vectors[,1:21]
p <- -p
p
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.2040227737 -1.833293e-01 0.094779873 0.13731378 -0.10549430
## [2,] -0.1413086011 -5.394177e-01 -0.039906990 0.10547773 -0.06696998
## [3,] 0.1413086011 5.394177e-01 0.039906990 -0.10547773 0.06696998
## [4,] 0.2942003135 -2.461245e-02 -0.124389538 -0.04018123 0.01260643
## [5,] -0.1600597333 -5.511576e-06 0.032650698 -0.05734463 -0.17794918
## [6,] -0.1903397143 2.698063e-02 0.109385163 0.09137243 0.13306274
## [7,] 0.0935479157 -1.204104e-01 0.222487435 -0.12005801 0.05039087
## [8,] 0.0849067274 -2.803531e-01 -0.029034968 -0.08955058 -0.10490839
## [9,] -0.0554027651 2.904020e-02 -0.591801533 -0.18297854 0.01650000
## [10,] 0.0554027651 -2.904020e-02 0.591801533 0.18297854 -0.01650000
## [11,] -0.1736418581 7.301015e-02 -0.134795878 0.54120573 0.24450421
## [12,] 0.0625879247 -1.361349e-01 -0.006726700 -0.07124934 0.29057685
## [13,] 0.1615699524 -4.669831e-02 0.136120641 -0.52752438 -0.30073749
## [14,] -0.2642768993 1.226385e-01 0.027373082 -0.15232430 -0.06430739
## [15,] -0.2662813236 5.344106e-02 0.103829439 -0.13787861 0.23790773
## [16,] 0.2662813236 -5.344106e-02 -0.103829439 0.13787861 -0.23790773
## [17,] 0.3018129056 -1.014582e-01 -0.013299747 -0.02003847 0.23877351
## [18,] 0.1129508182 -6.226086e-02 0.230618976 -0.14454194 0.21883738
## [19,] 0.2404813692 -1.326143e-01 -0.214508548 0.02652619 0.20305279
## [20,] -0.2890528978 1.593875e-01 0.110832619 0.03776438 -0.29883633
## [21,] -0.1863277619 -2.539686e-02 -0.017259545 -0.31810714 0.22856684
## [22,] 0.0801730543 5.412041e-02 -0.100692516 0.18736497 -0.34911401
## [23,] 0.1440907534 -5.350135e-02 0.095050946 0.17389422 -0.13349020
## [24,] 0.0880689230 -1.975906e-02 0.062237127 0.05200381 0.25357146
## [25,] -0.0826325459 1.248690e-01 -0.003714119 0.09886796 0.07387878
## [26,] -0.0003292663 -1.790613e-02 0.071960874 0.04474856 0.02271613
## [27,] 0.1071281011 3.217524e-03 0.087323525 -0.08306735 0.24274831
## [28,] -0.2887953143 -2.791867e-01 -0.032234467 -0.04175985 -0.05647509
## [29,] 0.2439394038 2.892717e-01 -0.024699924 0.07532716 -0.06720959
## [,6] [,7] [,8] [,9] [,10]
## [1,] 0.213452492 0.1616819119 -0.023604430 0.087715997 -0.047030462
## [2,] -0.011344399 -0.0132664051 -0.046115109 0.110497288 0.052789321
## [3,] 0.011344399 0.0132664051 0.046115109 -0.110497288 -0.052789321
## [4,] -0.251575980 0.2184059522 -0.067338655 -0.070216542 0.239407715
## [5,] -0.018666681 0.2587982109 -0.080642858 0.002363250 -0.575654625
## [6,] 0.291143340 -0.4529931814 0.140368316 0.075009077 0.212748155
## [7,] -0.052792830 0.1368246765 0.088519611 0.198823957 0.170243940
## [8,] 0.049599567 0.1036958509 0.249146156 0.378954437 -0.034805948
## [9,] 0.068522160 0.1133872283 0.113186317 0.022985625 0.067737669
## [10,] -0.068522160 -0.1133872283 -0.113186317 -0.022985625 -0.067737669
## [11,] -0.174292551 0.0922261543 -0.011106655 0.108859690 -0.058035125
## [12,] -0.022878704 0.1374788674 0.408638140 -0.113189843 -0.151685506
## [13,] 0.178747710 -0.1188272919 -0.067910999 -0.086991199 0.087377357
## [14,] -0.087127349 0.0874879334 0.015641141 0.139934444 0.164750061
## [15,] 0.361710871 0.2362339223 -0.163201029 0.157374818 0.056636761
## [16,] -0.361710871 -0.2362339223 0.163201029 -0.157374818 -0.056636761
## [17,] -0.001229302 -0.0009125418 0.162923697 0.080997686 0.050272836
## [18,] -0.152638210 -0.0091306367 0.371342645 0.005725705 -0.003227085
## [19,] 0.201743027 -0.2071383509 -0.402781053 -0.018519962 -0.046593168
## [20,] -0.132742164 0.2098663948 0.235475750 0.015861704 0.047727286
## [21,] -0.410141531 -0.2145056094 -0.160399335 0.179411132 -0.115343125
## [22,] 0.343597404 -0.1291405671 0.216397262 -0.116144689 -0.196811267
## [23,] -0.006067372 0.4103179475 -0.227637787 -0.040740958 0.367790016
## [24,] 0.264070192 0.0683355585 0.215395509 -0.004954165 -0.202539387
## [25,] 0.084082911 -0.0300039738 0.205304036 -0.236604972 0.433208186
## [26,] -0.112158906 -0.2015978151 -0.170508454 -0.064865516 -0.047711035
## [27,] 0.036225835 0.2630957285 -0.114121348 -0.418293722 -0.140089959
## [28,] -0.011523649 -0.0538623012 0.080493188 -0.306678462 0.080124643
## [29,] 0.016318483 -0.0350554293 0.008128518 0.536678636 -0.003188517
## [,11] [,12] [,13] [,14] [,15]
## [1,] 0.0514849619 -0.25344279 0.071981397 0.38373324 -0.214537681
## [2,] 0.0333275463 -0.03966658 0.009422092 -0.15754580 0.132702943
## [3,] -0.0333275463 0.03966658 -0.009422092 0.15754580 -0.132702943
## [4,] 0.0918142199 -0.11585403 -0.289722514 0.11751257 -0.145408546
## [5,] -0.1913643070 0.17791385 0.295940515 -0.28920729 -0.175698766
## [6,] 0.0573211994 -0.01987070 0.073275472 0.10991485 0.304395008
## [7,] -0.0809428399 -0.17893968 0.558762233 0.16658326 -0.133346757
## [8,] 0.1344075528 0.11996991 0.099533027 0.17877867 -0.014999030
## [9,] 0.0632291693 -0.10386450 0.074834503 -0.01981791 -0.011788835
## [10,] -0.0632291693 0.10386450 -0.074834503 0.01981791 0.011788835
## [11,] -0.0892992035 -0.07682543 0.022307302 0.07800898 -0.058828058
## [12,] 0.1707762492 0.02107557 0.019284283 -0.15737992 0.297537103
## [13,] 0.0562915673 0.07276366 -0.026040336 -0.04758982 0.001302855
## [14,] -0.0637699306 0.19026526 0.053317476 -0.18721599 0.268279182
## [15,] -0.0179033287 -0.17592834 -0.159850407 -0.06848620 -0.108328418
## [16,] 0.0179033287 0.17592834 0.159850407 0.06848620 0.108328418
## [17,] -0.1745509851 0.06613846 -0.118376039 -0.28116180 -0.070437461
## [18,] -0.3009325805 -0.27827997 -0.183468090 -0.24924191 -0.120746443
## [19,] -0.1323699612 0.16649098 0.146523898 -0.02438587 -0.004764372
## [20,] 0.2650235573 -0.04197840 -0.064198930 0.13479804 0.058299086
## [21,] -0.0266367570 -0.08414051 0.035901178 0.21531961 -0.019033132
## [22,] -0.1784967480 -0.38220875 -0.061019476 -0.10245025 0.065584851
## [23,] 0.0002590246 0.16793850 -0.062979887 -0.21493774 0.190248341
## [24,] 0.3768853491 0.48613501 -0.120805688 0.12340610 -0.181065955
## [25,] -0.0549279697 0.13469650 0.467979249 -0.17267244 -0.331705364
## [26,] 0.6682429521 -0.31460817 0.096985482 -0.42619396 -0.258192820
## [27,] 0.0747633927 -0.20786286 0.261684675 0.15940947 0.454059167
## [28,] -0.1747356815 0.13999203 -0.183500247 0.14317523 -0.276835144
## [29,] 0.0086525893 0.02242725 0.039190364 -0.14120018 0.110723026
## [,16] [,17] [,18] [,19] [,20]
## [1,] 0.307043040 -0.111102346 -0.10755715 0.119481022 0.634592582
## [2,] -0.139797586 0.262204268 -0.11173744 0.071863869 0.056457644
## [3,] 0.139797586 -0.262204268 0.11173744 -0.071863869 -0.056457644
## [4,] -0.387590877 -0.004321684 -0.03778667 -0.081334668 0.038321037
## [5,] 0.145496306 0.007070978 -0.04754370 0.050109424 0.003776925
## [6,] 0.304719637 -0.001099695 0.08065819 0.047785079 -0.045118642
## [7,] -0.209913718 0.024729046 0.51745690 0.228395658 -0.207888802
## [8,] 0.088998760 -0.311178051 -0.17199235 -0.614251536 -0.293268601
## [9,] 0.175395522 0.107830779 0.08751724 0.050378871 0.008164833
## [10,] -0.175395522 -0.107830779 -0.08751724 -0.050378871 -0.008164833
## [11,] 0.015393105 0.003828996 0.09130157 -0.123980608 -0.025264759
## [12,] -0.160146237 -0.526368999 -0.16138195 0.389205946 0.081912822
## [13,] 0.015572132 0.097955937 -0.06011090 0.048740716 0.009429520
## [14,] -0.191364311 -0.132133530 0.34932228 -0.350327215 0.613049818
## [15,] -0.098331517 -0.054993699 -0.07387301 0.001615882 -0.074831716
## [16,] 0.098331517 0.054993699 0.07387301 -0.001615882 0.074831716
## [17,] 0.130630351 0.017275377 0.19641737 0.044658121 0.014852663
## [18,] 0.257749536 0.236779204 -0.01835022 -0.205780747 0.076968959
## [19,] -0.115899019 -0.198654712 0.03992670 -0.056407624 0.034717513
## [20,] 0.000817087 0.092347221 -0.03153132 0.147335259 -0.068640573
## [21,] 0.040677403 -0.033486402 -0.15915730 0.067902495 0.067031797
## [22,] -0.307156634 -0.082632208 0.14084179 -0.119478608 -0.037244599
## [23,] 0.400493663 -0.129741079 0.08790932 0.081575166 -0.148382938
## [24,] -0.107273784 0.376830513 0.21119126 -0.022227162 0.086051127
## [25,] -0.135558575 0.043502495 -0.46680415 -0.089227879 0.107268300
## [26,] 0.109837958 -0.142040654 0.17122845 -0.143891587 0.023977683
## [27,] 0.020303257 0.250400560 -0.06807129 -0.298951277 -0.021055632
## [28,] 0.055709181 -0.217044473 0.19542102 0.023151988 -0.039387046
## [29,] -0.089443704 0.127236152 -0.20154651 0.153751053 0.046198032
## [,21]
## [1,] 0.055835074
## [2,] 0.009716911
## [3,] -0.009716911
## [4,] -0.024069859
## [5,] 0.020759097
## [6,] 0.009246758
## [7,] -0.091112831
## [8,] 0.044978707
## [9,] -0.055246780
## [10,] 0.055246780
## [11,] 0.002668114
## [12,] -0.131111145
## [13,] 0.022684782
## [14,] -0.004156898
## [15,] 0.039936533
## [16,] -0.039936533
## [17,] 0.783813396
## [18,] -0.400630885
## [19,] -0.100211495
## [20,] 0.277298232
## [21,] 0.122531418
## [22,] 0.043260285
## [23,] -0.158659708
## [24,] -0.133158564
## [25,] 0.079020446
## [26,] 0.001634511
## [27,] 0.158154115
## [28,] -0.005608392
## [29,] -0.073239274
# explainable variance ratio
avo <- ei_df$values / sum(ei_df$values)
round(avo, 4)
## [1] 0.1740 0.0861 0.0792 0.0665 0.0602 0.0590 0.0558 0.0488 0.0452 0.0429
## [11] 0.0357 0.0349 0.0341 0.0311 0.0284 0.0257 0.0250 0.0215 0.0211 0.0133
## [21] 0.0113 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
sum(avo[1:31])
## [1] NA
# Creating full(including all varibales) and null(intercept only) models for d_1
logit_full = glm(heart_disease~., data = data, family = binomial)
logit_null = glm(heart_disease~1, data = data, family = binomial)
# Forward selection procedure
forward = step(logit_null, scope = list(lower = logit_null, upper = logit_full), direction = "forward")
## Start: AIC=288.3
## heart_disease ~ 1
##
## Df Deviance AIC
## + thal 3 227.45 235.45
## + ca 4 231.45 241.45
## + oldpeak 1 243.02 247.02
## + cp 2 243.35 249.35
## + thalach 1 251.58 255.58
## + slope 2 255.91 261.91
## + exang 1 262.64 266.64
## + sex 1 269.75 273.75
## + age 1 271.15 275.15
## + restecg 2 274.01 280.01
## + chol 1 278.52 282.52
## + trestbps 1 283.38 287.38
## <none> 286.30 288.30
## + fbs 1 286.14 290.14
##
## Step: AIC=235.45
## heart_disease ~ thal
##
## Df Deviance AIC
## + ca 4 186.85 202.85
## + oldpeak 1 204.19 214.19
## + thalach 1 204.24 214.24
## + cp 2 202.45 214.45
## + slope 2 209.78 221.78
## + exang 1 215.21 225.21
## + age 1 218.07 228.07
## + chol 1 218.62 228.62
## + restecg 2 216.67 228.67
## <none> 227.45 235.45
## + sex 1 226.23 236.23
## + trestbps 1 226.51 236.51
## + fbs 1 227.44 237.44
##
## Step: AIC=202.85
## heart_disease ~ thal + ca
##
## Df Deviance AIC
## + cp 2 167.33 187.33
## + thalach 1 171.44 189.44
## + oldpeak 1 171.65 189.65
## + slope 2 170.20 190.20
## + exang 1 175.59 193.59
## + restecg 2 175.68 195.68
## + chol 1 180.80 198.80
## + sex 1 183.78 201.78
## <none> 186.85 202.85
## + trestbps 1 185.31 203.31
## + fbs 1 185.96 203.96
## + age 1 186.11 204.11
##
## Step: AIC=187.33
## heart_disease ~ thal + ca + cp
##
## Df Deviance AIC
## + thalach 1 159.02 181.02
## + slope 2 157.43 181.43
## + oldpeak 1 160.43 182.43
## + chol 1 160.98 182.98
## + restecg 2 159.78 183.78
## + exang 1 163.06 185.06
## <none> 167.33 187.33
## + trestbps 1 165.95 187.95
## + sex 1 166.01 188.01
## + age 1 166.64 188.64
## + fbs 1 167.11 189.11
##
## Step: AIC=181.02
## heart_disease ~ thal + ca + cp + thalach
##
## Df Deviance AIC
## + chol 1 153.44 177.44
## + restecg 2 153.23 179.23
## + oldpeak 1 155.23 179.23
## + slope 2 153.34 179.34
## + exang 1 156.69 180.69
## + sex 1 156.97 180.97
## <none> 159.02 181.02
## + trestbps 1 157.12 181.12
## + fbs 1 158.74 182.74
## + age 1 158.97 182.97
##
## Step: AIC=177.44
## heart_disease ~ thal + ca + cp + thalach + chol
##
## Df Deviance AIC
## + sex 1 147.64 173.64
## + oldpeak 1 150.28 176.28
## + slope 2 148.72 176.72
## <none> 153.44 177.44
## + restecg 2 149.60 177.60
## + exang 1 151.80 177.80
## + trestbps 1 152.56 178.56
## + age 1 152.69 178.69
## + fbs 1 153.26 179.26
##
## Step: AIC=173.64
## heart_disease ~ thal + ca + cp + thalach + chol + sex
##
## Df Deviance AIC
## + oldpeak 1 144.55 172.55
## + slope 2 142.61 172.61
## + exang 1 145.36 173.36
## <none> 147.64 173.64
## + restecg 2 143.90 173.90
## + trestbps 1 146.31 174.31
## + age 1 147.07 175.07
## + fbs 1 147.53 175.53
##
## Step: AIC=172.55
## heart_disease ~ thal + ca + cp + thalach + chol + sex + oldpeak
##
## Df Deviance AIC
## <none> 144.55 172.55
## + slope 2 141.03 173.03
## + exang 1 143.15 173.15
## + trestbps 1 143.63 173.63
## + age 1 144.16 174.16
## + restecg 2 142.16 174.16
## + fbs 1 144.27 174.27
# Backward selection procedure
backward = step(logit_full, scope = list(lower = logit_null, upper = logit_full), direction = "backward")
## Start: AIC=179.49
## heart_disease ~ age + sex + cp + trestbps + chol + fbs + restecg +
## thalach + exang + oldpeak + slope + ca + thal
##
## Df Deviance AIC
## - restecg 2 137.03 177.03
## - oldpeak 1 135.76 177.76
## - fbs 1 136.03 178.03
## - age 1 136.26 178.26
## - trestbps 1 136.85 178.85
## - slope 2 138.93 178.93
## - exang 1 137.09 179.09
## - cp 2 139.42 179.42
## <none> 135.49 179.49
## - thalach 1 139.61 181.61
## - chol 1 140.91 182.91
## - sex 1 141.86 183.86
## - thal 3 150.43 188.43
## - ca 4 162.40 198.40
##
## Step: AIC=177.03
## heart_disease ~ age + sex + cp + trestbps + chol + fbs + thalach +
## exang + oldpeak + slope + ca + thal
##
## Df Deviance AIC
## - oldpeak 1 137.58 175.58
## - fbs 1 137.61 175.61
## - age 1 137.96 175.96
## - exang 1 138.60 176.60
## - slope 2 140.61 176.61
## - cp 2 140.93 176.93
## <none> 137.03 177.03
## - trestbps 1 139.06 177.06
## - thalach 1 141.62 179.62
## - chol 1 143.73 181.73
## - sex 1 143.85 181.85
## - thal 3 151.11 185.11
## - ca 4 164.23 196.23
##
## Step: AIC=175.58
## heart_disease ~ age + sex + cp + trestbps + chol + fbs + thalach +
## exang + slope + ca + thal
##
## Df Deviance AIC
## - fbs 1 138.28 174.28
## - age 1 138.76 174.76
## <none> 137.58 175.58
## - exang 1 139.62 175.62
## - cp 2 141.72 175.72
## - trestbps 1 139.85 175.85
## - slope 2 142.28 176.28
## - thalach 1 142.63 178.63
## - chol 1 144.68 180.68
## - sex 1 144.91 180.91
## - thal 3 151.98 183.98
## - ca 4 169.03 199.03
##
## Step: AIC=174.28
## heart_disease ~ age + sex + cp + trestbps + chol + thalach +
## exang + slope + ca + thal
##
## Df Deviance AIC
## - age 1 139.43 173.43
## - exang 1 140.00 174.00
## - trestbps 1 140.15 174.15
## <none> 138.28 174.28
## - slope 2 142.93 174.93
## - cp 2 143.14 175.14
## - thalach 1 143.33 177.33
## - sex 1 145.41 179.41
## - chol 1 145.69 179.69
## - thal 3 152.38 182.38
## - ca 4 169.15 197.15
##
## Step: AIC=173.43
## heart_disease ~ sex + cp + trestbps + chol + thalach + exang +
## slope + ca + thal
##
## Df Deviance AIC
## - trestbps 1 140.80 172.80
## - exang 1 141.22 173.22
## <none> 139.43 173.43
## - slope 2 144.00 174.00
## - cp 2 144.77 174.77
## - thalach 1 143.45 175.45
## - chol 1 146.02 178.02
## - sex 1 146.65 178.65
## - thal 3 154.14 182.14
## - ca 4 169.43 195.43
##
## Step: AIC=172.8
## heart_disease ~ sex + cp + chol + thalach + exang + slope + ca +
## thal
##
## Df Deviance AIC
## - exang 1 142.61 172.61
## <none> 140.80 172.80
## - slope 2 145.36 173.36
## - cp 2 146.37 174.37
## - thalach 1 144.53 174.53
## - sex 1 147.37 177.37
## - chol 1 148.42 178.42
## - thal 3 156.11 182.11
## - ca 4 170.18 194.18
##
## Step: AIC=172.61
## heart_disease ~ sex + cp + chol + thalach + slope + ca + thal
##
## Df Deviance AIC
## <none> 142.61 172.61
## - slope 2 147.64 173.64
## - thalach 1 147.53 175.53
## - cp 2 150.23 176.23
## - sex 1 148.72 176.72
## - chol 1 150.41 178.41
## - thal 3 159.14 183.14
## - ca 4 171.37 193.37
# Both
both = step(logit_full, scope = list(lower = logit_null, upper = logit_full), direction = "both")
## Start: AIC=179.49
## heart_disease ~ age + sex + cp + trestbps + chol + fbs + restecg +
## thalach + exang + oldpeak + slope + ca + thal
##
## Df Deviance AIC
## - restecg 2 137.03 177.03
## - oldpeak 1 135.76 177.76
## - fbs 1 136.03 178.03
## - age 1 136.26 178.26
## - trestbps 1 136.85 178.85
## - slope 2 138.93 178.93
## - exang 1 137.09 179.09
## - cp 2 139.42 179.42
## <none> 135.49 179.49
## - thalach 1 139.61 181.61
## - chol 1 140.91 182.91
## - sex 1 141.86 183.86
## - thal 3 150.43 188.43
## - ca 4 162.40 198.40
##
## Step: AIC=177.03
## heart_disease ~ age + sex + cp + trestbps + chol + fbs + thalach +
## exang + oldpeak + slope + ca + thal
##
## Df Deviance AIC
## - oldpeak 1 137.58 175.58
## - fbs 1 137.61 175.61
## - age 1 137.96 175.96
## - exang 1 138.60 176.60
## - slope 2 140.61 176.61
## - cp 2 140.93 176.93
## <none> 137.03 177.03
## - trestbps 1 139.06 177.06
## + restecg 2 135.49 179.49
## - thalach 1 141.62 179.62
## - chol 1 143.73 181.73
## - sex 1 143.85 181.85
## - thal 3 151.11 185.11
## - ca 4 164.23 196.23
##
## Step: AIC=175.58
## heart_disease ~ age + sex + cp + trestbps + chol + fbs + thalach +
## exang + slope + ca + thal
##
## Df Deviance AIC
## - fbs 1 138.28 174.28
## - age 1 138.76 174.76
## <none> 137.58 175.58
## - exang 1 139.62 175.62
## - cp 2 141.72 175.72
## - trestbps 1 139.85 175.85
## - slope 2 142.28 176.28
## + oldpeak 1 137.03 177.03
## + restecg 2 135.76 177.76
## - thalach 1 142.63 178.63
## - chol 1 144.68 180.68
## - sex 1 144.91 180.91
## - thal 3 151.98 183.98
## - ca 4 169.03 199.03
##
## Step: AIC=174.28
## heart_disease ~ age + sex + cp + trestbps + chol + thalach +
## exang + slope + ca + thal
##
## Df Deviance AIC
## - age 1 139.43 173.43
## - exang 1 140.00 174.00
## - trestbps 1 140.15 174.15
## <none> 138.28 174.28
## - slope 2 142.93 174.93
## - cp 2 143.14 175.14
## + fbs 1 137.58 175.58
## + oldpeak 1 137.61 175.61
## + restecg 2 136.34 176.34
## - thalach 1 143.33 177.33
## - sex 1 145.41 179.41
## - chol 1 145.69 179.69
## - thal 3 152.38 182.38
## - ca 4 169.15 197.15
##
## Step: AIC=173.43
## heart_disease ~ sex + cp + trestbps + chol + thalach + exang +
## slope + ca + thal
##
## Df Deviance AIC
## - trestbps 1 140.80 172.80
## - exang 1 141.22 173.22
## <none> 139.43 173.43
## - slope 2 144.00 174.00
## + age 1 138.28 174.28
## + oldpeak 1 138.52 174.52
## + fbs 1 138.76 174.76
## - cp 2 144.77 174.77
## + restecg 2 137.20 175.20
## - thalach 1 143.45 175.45
## - chol 1 146.02 178.02
## - sex 1 146.65 178.65
## - thal 3 154.14 182.14
## - ca 4 169.43 195.43
##
## Step: AIC=172.8
## heart_disease ~ sex + cp + chol + thalach + exang + slope + ca +
## thal
##
## Df Deviance AIC
## - exang 1 142.61 172.61
## <none> 140.80 172.80
## - slope 2 145.36 173.36
## + trestbps 1 139.43 173.43
## + oldpeak 1 139.76 173.76
## + restecg 2 137.98 173.98
## + age 1 140.15 174.15
## - cp 2 146.37 174.37
## + fbs 1 140.46 174.46
## - thalach 1 144.53 174.53
## - sex 1 147.37 177.37
## - chol 1 148.42 178.42
## - thal 3 156.11 182.11
## - ca 4 170.18 194.18
##
## Step: AIC=172.61
## heart_disease ~ sex + cp + chol + thalach + slope + ca + thal
##
## Df Deviance AIC
## <none> 142.61 172.61
## + exang 1 140.80 172.80
## + oldpeak 1 141.03 173.03
## + trestbps 1 141.22 173.22
## + restecg 2 139.57 173.57
## - slope 2 147.64 173.64
## + age 1 141.91 173.91
## + fbs 1 142.47 174.47
## - thalach 1 147.53 175.53
## - cp 2 150.23 176.23
## - sex 1 148.72 176.72
## - chol 1 150.41 178.41
## - thal 3 159.14 183.14
## - ca 4 171.37 193.37
forward$formula
## heart_disease ~ thal + ca + cp + thalach + chol + sex + oldpeak
Logistic model
logistic_model = glm(forward$formula, data=data, family = binomial)
summary(logistic_model)
##
## Call:
## glm(formula = forward$formula, family = binomial, data = data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.3951 -0.4321 0.1993 0.4786 2.5288
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.828e-02 2.429e+00 -0.008 0.993996
## thal1 2.788e+00 1.843e+00 1.513 0.130229
## thal2 2.772e+00 1.677e+00 1.653 0.098311 .
## thal3 1.006e+00 1.685e+00 0.597 0.550386
## ca1 -1.375e+00 5.520e-01 -2.491 0.012729 *
## ca2 -2.773e+00 7.354e-01 -3.771 0.000163 ***
## ca3 -2.617e+00 1.090e+00 -2.401 0.016329 *
## ca4 1.256e+01 1.370e+03 0.009 0.992683
## cpatypical angina 1.061e+00 6.670e-01 1.591 0.111692
## cpnon-anginal pain 1.282e+00 5.515e-01 2.325 0.020073 *
## thalach 2.490e-02 1.076e-02 2.314 0.020651 *
## chol -1.457e-02 5.227e-03 -2.788 0.005300 **
## sexmale -1.470e+00 6.368e-01 -2.308 0.020996 *
## oldpeak -4.030e-01 2.325e-01 -1.733 0.083100 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 286.30 on 211 degrees of freedom
## Residual deviance: 144.55 on 198 degrees of freedom
## AIC: 172.55
##
## Number of Fisher Scoring iterations: 15
backward$formula
## heart_disease ~ sex + cp + chol + thalach + slope + ca + thal
both$formula
## heart_disease ~ sex + cp + chol + thalach + slope + ca + thal
Symmetric Cost Logistic Regression
# predicting on train set
predict_logit_train <- predict(logistic_model, type="response")
costfunc = function(obs, pred.p,pcut){
weight1 = 1 # define the weight for "true=1 but pred=0" (FN)
weight0 = 1 # define the weight for "true=0 but pred=1" (FP)
c1 = (obs==1)&(pred.p<pcut) # count for "true=1 but pred=0" (FN)
c0 = (obs==0)&(pred.p>=pcut) # count for "true=0 but pred=1" (FP)
cost = mean(weight1*c1 + weight0*c0) # misclassification with weight
return(cost) # you have to return to a value when you write R functions
}
# define a sequence from 0.01 to 1 by 0.01
p.seq = seq(0.01, 1, 0.01)
cost = rep(0, length(p.seq))
for(i in 1:length(p.seq)){
cost[i] = costfunc(obs = data$heart_disease, pred.p = predict_logit_train, pcut = p.seq[i])
} # end of the loop
optimal.pcut.sym = p.seq[which(cost==min(cost))][1]
optimal.pcut.sym ##0.42
## [1] 0.42
plot(p.seq, cost)
#predict the response
pred_train <- predict(logistic_model, type = "response")
#convert to binary based on optimal pcut for sym
pred.train.sym <- as.numeric(pred_train > optimal.pcut.sym)
confusion_matrix_train_sym <- table(data$heart_disease, pred.train.sym,dnn =c('true','predicted'))
confusion_matrix_train_sym
## predicted
## true 0 1
## 0 67 19
## 1 8 118
misclassification_rate_train_sym <- round((confusion_matrix_train_sym[2]+confusion_matrix_train_sym[3])/sum(confusion_matrix_train_sym), 2)
cat("train misclassfication rate:", misclassification_rate_train_sym)
## train misclassfication rate: 0.13
#Symmetric cost evaluation function
cost0 <- function(r, pi){
weight1 = 1
weight0 = 1
c1 = (r==1)&(pi==0) #logical vector - true if actual 1 but predict 0
c0 = (r==0)&(pi==1) #logical vector - true if actual 0 but predict 1
return(mean(weight1*c1+weight0*c0))
}
#Aymmetric cost evaluation function
cost1 <- function(r, pi){
weight1 = 5
weight0 = 1
c1 = (r==1)&(pi==0) #logical vector - true if actual 1 but predict 0
c0 = (r==0)&(pi==1) #logical vector - true if actual 0 but predict 1
return(mean(weight1*c1+weight0*c0))
}
In sample- symmetric
#LOSS
LOSS=(insamplecost0<-cost0(data$heart_disease,pred.train.sym))
#Misclassification rate
MCR=(insampleMCR0<-mean(data$heart_disease!=pred.train.sym))
#Roc curve | AUC
pred <- prediction(as.numeric(pred_train),as.numeric(data$heart_disease))
perf <- performance(pred, "tpr", "fpr")
plot(perf, colorize=TRUE)
#Get the AUC
AUC=unlist(slot(performance(pred, "auc"), "y.values"))
rslt_sym_in<-c(LOSS,MCR,AUC)
rslt_sym_in
## [1] 0.1273585 0.1273585 0.9262643
Optimal pcut - Assymetric Cost Logistic Regression
costfuncasym = function(obs, pred.p,pcut){
weight1 = 5 # define the weight for "true=1 but pred=0" (FN)
weight0 = 1 # define the weight for "true=0 but pred=1" (FP)
c1 = (obs==1)&(pred.p<pcut) # count for "true=1 but pred=0" (FN)
c0 = (obs==0)&(pred.p>=pcut) # count for "true=0 but pred=1" (FP)
cost = mean(weight1*c1 + weight0*c0) # misclassification with weight
return(cost) # you have to return to a value when you write R functions
} # end of the function
p.seq = seq(0.01, 1, 0.01)
cost = rep(0, length(p.seq))
for(i in 1:length(p.seq)){
cost[i] = costfuncasym(obs = data$heart_disease, pred.p = predict_logit_train, pcut = p.seq[i])
} # end of the loop
optimal.pcut.asym = p.seq[which(cost==min(cost))][1]
optimal.pcut.asym #0.2
## [1] 0.2
plot(p.seq, cost)
pred_train <- predict(logistic_model, type = "response")
pred.train.asym <- as.numeric(pred_train > optimal.pcut.asym)
confusion_matrix_train_asym <- table(data$heart_disease, pred.train.asym, dnn =c('true','predicted'))
confusion_matrix_train_asym
## predicted
## true 0 1
## 0 47 39
## 1 4 122
misclassification_rate_train_asym <- round((confusion_matrix_train_asym[2]+confusion_matrix_train_asym[3])/sum(confusion_matrix_train_asym), 2)
cat("train Asymmetric misclassfication rate:", misclassification_rate_train_asym)
## train Asymmetric misclassfication rate: 0.2
In sample - asymmetric
#LOSS
LOSS=(insamplecost1<-cost1(data$heart_disease,pred.train.asym))
#Misclassification rate
MCR=(insampleMCR1<-mean(data$heart_disease!=pred.train.asym))
#Roc curve | AUC
pred <- prediction(as.numeric(pred_train),as.numeric(data$heart_disease))
perf <- performance(pred, "tpr", "fpr")
plot(perf, colorize=TRUE)
#Get the AUC
AUC=unlist(slot(performance(pred, "auc"), "y.values"))
rslt_asym_in<-c(LOSS,MCR,AUC)
rslt_asym_in
## [1] 0.2783019 0.2028302 0.9262643
Out sample - symmetric
#predict the response
pred_test <- predict(logistic_model,newdata=data_test, type = "response")
#convert to binary based on optimal pcut for sym
pred.test.sym <- as.numeric(pred_test > optimal.pcut.sym)
#LOSS
LOSS=(outsamplecost0<-cost0(data_test$heart_disease,pred.test.sym))
#Misclassification rate
MCR=(outsampleMCR0<-mean(data_test$heart_disease!=pred.test.sym))
#Roc curve | AUC
pred <- prediction(as.numeric(pred_test),as.numeric(data_test$heart_disease))
perf <- performance(pred, "tpr", "fpr")
plot(perf, colorize=TRUE)
#Get the AUC
AUC=unlist(slot(performance(pred, "auc"), "y.values"))
rslt_sym_out<-c(LOSS,MCR,AUC)
rslt_sym_out
## [1] 0.1758242 0.1758242 0.8801775
Out sample - asymmetric
#convert to binary based on optimal pcut for sym
pred.test.asym <- as.numeric(pred_test > optimal.pcut.asym)
#LOSS
LOSS=(outsamplecost1<-cost1(data_test$heart_disease,pred.test.asym))
#Misclassification rate
MCR=(outsampleMCR1<-mean(data_test$heart_disease!=pred.test.asym))
#Roc curve | AUC
pred <- prediction(as.numeric(pred_test),as.numeric(data_test$heart_disease))
perf.l<- performance(pred, "tpr", "fpr")
plot(perf.l, colorize=TRUE)
#Get the AUC
AUC=unlist(slot(performance(pred, "auc"), "y.values"))
rslt_asym_out<-c(LOSS,MCR,AUC)
rslt_asym_out
## [1] 0.4175824 0.3296703 0.8801775
eval<-data.frame(rbind(rslt_sym_in,rslt_sym_out,rslt_asym_in,rslt_asym_out),
row.names=c('Logistic - Symmetric cost : in_sample',
'Logistic - Symmetric cost : out_sample',
'Logistic - Asymmetric cost: in_sample',
'Logistic - Asymmetric cost : out_sample'))
colnames(eval) <- c("COST", "MCR","AUC")
kable(eval) %>%
kable_styling(bootstrap_options = c("striped", "hover", "responsive"),
font_size = 12, position = "left", full_width = FALSE)
| COST | MCR | AUC | |
|---|---|---|---|
| Logistic - Symmetric cost : in_sample | 0.1273585 | 0.1273585 | 0.9262643 |
| Logistic - Symmetric cost : out_sample | 0.1758242 | 0.1758242 | 0.8801775 |
| Logistic - Asymmetric cost: in_sample | 0.2783019 | 0.2028302 | 0.9262643 |
| Logistic - Asymmetric cost : out_sample | 0.4175824 | 0.3296703 | 0.8801775 |
CART stands for classification and regression tree.
From the training Model we have learnt through grid search that optimum pcut value for * Symmetric misclassfication rate: 0.42 * Asymmetric misclassfication rate: 0.20
We will be using the same for further analysis
Here we make the assumption that false negative cost 5 times of false positive. In real life the cost structure should be carefully researched.
#data$heart_disease<-as.factor(data$heart_disease)
#symmetric cost
hrtDS.rpart0 <- rpart(formula = heart_disease ~ ., data = data, method = "class")
#asymmetric cost
hrtDS.rpart1 <- rpart(formula = heart_disease ~ . , data = data, method = "class", parms = list(loss=matrix(c(0,5,1,0), nrow = 2)))
Symmetric Cost
pred0<- predict(hrtDS.rpart0, type="class")
table(data$heart_disease, pred0, dnn = c("True", "Pred"))
## Pred
## True 0 1
## 0 69 17
## 1 19 107
False negatives are heavy costing so lets, add more penality of 5:1
Asymmetric Cost insample
pred1<- predict(hrtDS.rpart1, type="class")
table(data$heart_disease, pred1, dnn = c("True", "Pred"))
## Pred
## True 0 1
## 0 45 41
## 1 1 125
As expected our False negatives have reduced to a substantial number but False positives have increased.
Lets evaluate the performance of both these models, keeping in mid that FN are far more harzardous than FP
Evaluation Function
#Symmetric cost evaluation function
cost0 <- function(r, pi){
weight1 = 1
weight0 = 1
c1 = (r==1)&(pi==0) #logical vector - true if actual 1 but predict 0
c0 = (r==0)&(pi==1) #logical vector - true if actual 0 but predict 1
return(mean(weight1*c1+weight0*c0))
}
#Aymmetric cost evaluation function
cost1 <- function(r, pi){
weight1 = 5
weight0 = 1
c1 = (r==1)&(pi==0) #logical vector - true if actual 1 but predict 0
c0 = (r==0)&(pi==1) #logical vector - true if actual 0 but predict 1
return(mean(weight1*c1+weight0*c0))
}
Symmetric Evaluation Here we try to calculate the criterians Missclassification rate | LOSS | AUC for Symmetric cost
data$heart_disease<-as.factor(data$heart_disease)
#LOSS
LOSS=(insamplecost0<-cost1(data$heart_disease,pred0))
#Misclassification rate
MCR=(insampleMCR0<-mean(data$heart_disease!=pred0))
#Roc curve | AUC
pred0_roc<- predict(hrtDS.rpart0, type="prob")[,2]
pred <- prediction(pred0_roc,data[,14])
perf <- performance(pred, "tpr", "fpr")
plot(perf, colorize=TRUE)
#Get the AUC
AUC=unlist(slot(performance(pred, "auc"), "y.values"))
rslt_sym<-c(LOSS,MCR,AUC)
rslt_sym
## [1] 0.5283019 0.1698113 0.8554356
Asymmetric Evaluation
Here we try tocalculate the criterians Missclassification rate | LOSS | AUC for Asymmetric cost
#LOSS
LOSS=(insamplecost.df<-cost1(data$heart_disease,pred1))
#Misclassification rate
MCR=(insampleMCR.df<-mean(data$heart_disease!=pred1))
#roc curve | AUC
pred1_roc<- predict(hrtDS.rpart1, type="prob")[,2]
pred <- prediction(pred1_roc,as.numeric(data$heart_disease))
perf <- performance(pred, "tpr", "fpr")
plot(perf, colorize=TRUE)
#Get the AUC
AUC=unlist(slot(performance(pred, "auc"), "y.values"))
rslt_Asym<-c(LOSS,MCR,AUC)
rslt_Asym
## [1] 0.2169811 0.1981132 0.8633721
Summary table
eval<-data.frame(rbind(rslt_sym,rslt_Asym),
row.names=c('Symmetric cost','Asymmetric cost'))
colnames(eval) <- c("COST", "MCR","AUC")
kable(eval) %>%
kable_styling(bootstrap_options = c("striped", "hover", "responsive"),
font_size = 12, position = "left", full_width = FALSE)
| COST | MCR | AUC | |
|---|---|---|---|
| Symmetric cost | 0.5283019 | 0.1698113 | 0.8554356 |
| Asymmetric cost | 0.2169811 | 0.1981132 | 0.8633721 |
Conclusion: ASY wins
The ideal way of creating a binary tree is to construct a large tree and then prune it to an optimum level
hrtDS.rpart_prune <- rpart(formula = heart_disease ~ . ,
data = data, method = "class",
parms = list(loss=matrix(c(0,5,1,0), nrow = 2)),
cp = 0.001)
We have a bushy tree, Now we can try to prune up to a optimum level using plotcp function
prp(hrtDS.rpart_prune)
plotcp(hrtDS.rpart_prune)
printcp(hrtDS.rpart_prune)
##
## Classification tree:
## rpart(formula = heart_disease ~ ., data = data, method = "class",
## parms = list(loss = matrix(c(0, 5, 1, 0), nrow = 2)), cp = 0.001)
##
## Variables actually used in tree construction:
## [1] ca oldpeak thal thalach
##
## Root node error: 86/212 = 0.40566
##
## n= 212
##
## CP nsplit rel error xerror xstd
## 1 0.122093 0 1.00000 5.0000 0.41566
## 2 0.073643 2 0.75581 3.9767 0.39093
## 3 0.001000 5 0.53488 3.5233 0.37678
xerror gives you the cross-validation (default is 10-fold) error. You can see that the rel error (in-sample error) is always decreasing as model is more complex, while the cross-validation error (measure of performance on future observations) is not. That is why we prune the tree to avoid overfitting the training data.
hrtDS.rpart_prune1<-prune(hrtDS.rpart_prune, cp = 0.0086)
Insample Prediction
hrtDS.rpart_prune1.pred.in<- predict(hrtDS.rpart_prune1, type="class")
table(data$heart_disease, hrtDS.rpart_prune1.pred.in, dnn=c("Truth","Predicted"))
## Predicted
## Truth 0 1
## 0 45 41
## 1 1 125
Lets check out the criterians for model evaluation…
Missclassification rate | LOSS | AUC
#LOSS
LOSS=(dt.insamplecost<-cost1(data$heart_disease,hrtDS.rpart_prune1.pred.in))
#Misclassification rate
MCR=(dt.insampleMCR<-mean(data$heart_disease!=hrtDS.rpart_prune1.pred.in))
#roc curve | AUC
hrtDS.rpart_prune1.pred.in_roc<- predict(hrtDS.rpart_prune1, type="prob")[,2]
pred <- prediction(hrtDS.rpart_prune1.pred.in_roc,data$heart_disease)
perf <- performance(pred, "tpr", "fpr")
plot(perf, colorize=TRUE)
#Get the AUC
AUC=unlist(slot(performance(pred, "auc"), "y.values"))
#vectorized output
dt.rslt.in<-c(LOSS,MCR,AUC)
dt.rslt.in
## [1] 0.2169811 0.1981132 0.8633721
Out of sample Prediction
hrtDS.rpart_prune1.pred.out<- predict(hrtDS.rpart_prune1,data_test, type="class")
table(data_test$heart_disease, hrtDS.rpart_prune1.pred.out, dnn=c("Truth","Predicted"))
## Predicted
## Truth 0 1
## 0 27 25
## 1 2 37
Missclassification rate | LOSS | AUC
#LOSS
LOSS=(dt.outsamplecost<-cost1(data_test$heart_disease,hrtDS.rpart_prune1.pred.out))
#Misclassification rate
MCR=(dt.outsampleMCR<-mean(data_test$heart_disease!=hrtDS.rpart_prune1.pred.out))
#roc curve | AUC
hrtDS.rpart_prune1.pred.out_roc<- predict(hrtDS.rpart_prune1,data_test, type="prob")[,2]
pred <- prediction(hrtDS.rpart_prune1.pred.out_roc,as.numeric(data_test$heart_disease))
perf.dt1 <- performance(pred, "tpr", "fpr")
plot(perf.dt1, colorize=TRUE)
#Get the AUC
AUC=unlist(slot(performance(pred, "auc"), "y.values"))
#vectorized output
dt.rslt.out<-c(LOSS,MCR,AUC)
dt.rslt.out
## [1] 0.3846154 0.2967033 0.7924063
eval<-data.frame(rbind(dt.rslt.in,dt.rslt.out),
row.names=c('DT: in','DT: out'))
colnames(eval) <- c("COST", "MCR","AUC")
kable(eval) %>%
kable_styling(bootstrap_options = c("striped", "hover", "responsive"),
font_size = 12, position = "left", full_width = FALSE)
| COST | MCR | AUC | |
|---|---|---|---|
| DT: in | 0.2169811 | 0.1981132 | 0.8633721 |
| DT: out | 0.3846154 | 0.2967033 | 0.7924063 |
#fetch numeric columns
numcols<-select_if(data_orig, is.numeric)
#fetch character columns
charcols<-select_if(data_orig, is.factor)
#continious columns names
(n<-colnames(numcols)[1:5])
## [1] "age" "trestbps" "chol" "thalach" "oldpeak"
#character columns names
colnames(charcols)
## [1] "sex" "cp" "fbs" "restecg" "exang" "slope" "ca"
## [8] "thal"
#continious columns names
cont<-colnames(numcols)
#character columns names
n <- names(data)
formula.gam <- as.formula(paste("heart_disease ~s(age)+s(trestbps)+s(chol) +s(thalach) +s(oldpeak) +",
paste(n[!n %in% cont], collapse = " + ")))
formula.gam
## heart_disease ~ s(age) + s(trestbps) + s(chol) + s(thalach) +
## s(oldpeak) + sex + cp + fbs + restecg + exang + slope + ca +
## thal
hrtDS.gam <- gam(formula = formula.gam, family=binomial,data=data);
summary(hrtDS.gam)
##
## Family: binomial
## Link function: logit
##
## Formula:
## heart_disease ~ s(age) + s(trestbps) + s(chol) + s(thalach) +
## s(oldpeak) + sex + cp + fbs + restecg + exang + slope + ca +
## thal
##
## Parametric coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.959e+00 2.835e+00 -1.044 0.29663
## sexmale -2.353e+00 9.793e-01 -2.403 0.01625 *
## cpatypical angina 2.483e-01 8.929e-01 0.278 0.78097
## cpnon-anginal pain 1.186e+00 7.301e-01 1.625 0.10424
## fbs>120 1.446e+00 9.946e-01 1.454 0.14593
## restecgdefinite -5.351e+01 4.745e+07 0.000 1.00000
## restecgnormal -7.645e-01 5.813e-01 -1.315 0.18847
## exangyes -7.125e-01 6.413e-01 -1.111 0.26659
## slope1 7.726e-01 1.584e+00 0.488 0.62571
## slope2 1.478e+00 1.751e+00 0.844 0.39878
## ca1 -2.131e+00 7.968e-01 -2.674 0.00750 **
## ca2 -4.755e+00 1.194e+00 -3.982 6.84e-05 ***
## ca3 -4.941e+00 1.707e+00 -2.894 0.00380 **
## ca4 4.847e+01 3.875e+07 0.000 1.00000
## thal1 7.125e+00 2.618e+00 2.722 0.00649 **
## thal2 7.174e+00 2.377e+00 3.018 0.00254 **
## thal3 4.523e+00 2.237e+00 2.021 0.04323 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df Chi.sq p-value
## s(age) 8.226 8.583 15.469 0.0680 .
## s(trestbps) 1.000 1.000 3.236 0.0721 .
## s(chol) 2.627 3.329 8.159 0.0627 .
## s(thalach) 3.176 3.983 5.228 0.2498
## s(oldpeak) 5.268 6.271 11.667 0.0778 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.646 Deviance explained = 66.2%
## UBRE = -0.19181 Scale est. = 1 n = 212
plot(hrtDS.gam, shade=TRUE,seWithMean=TRUE,scale=0, pages = 1)
AIC(hrtDS.gam)
## [1] 171.3359
BIC(hrtDS.gam)
## [1] 296.5283
hrtDS.gam$deviance
## [1] 96.74079
We can notice that even the continious variables in the data do not have a non-linear distribution. This is backed up by the above plots where we can see a straight line
Insample Performance
pcut.gam <- .20
prob.gam.in<-predict(hrtDS.gam,type="response")
pred.gam.in.class<-(prob.gam.in>=pcut.gam)*1
table(data$heart_disease,pred.gam.in.class,dnn=c("Observed","Predicted"))
## Predicted
## Observed 0 1
## 0 60 26
## 1 2 124
Lets check out the criterians for model evaluation…
Missclassification rate | LOSS | AUC
#LOSS
LOSS=(gam.insamplecost<-cost1(data$heart_disease,pred.gam.in.class))
#Misclassification rate
MCR=(dt.insampleMCR<-mean(data$heart_disease!=pred.gam.in.class))
#roc curve | AUC
pred.gam.class.in_roc<- predict(hrtDS.gam, type="prob")[,2]
pred <- prediction(as.numeric(pred.gam.class.in_roc),as.numeric(data$heart_disease))
perf <- performance(pred, "tpr", "fpr")
plot(perf, colorize=TRUE)
#Get the AUC
AUC=unlist(slot(performance(pred, "auc"), "y.values"))
#vectorized output
gam.rslt.in<-c(LOSS,MCR,AUC)
gam.rslt.in
## [1] 0.1698113 0.1320755 0.7214839
Out of sample Prediction
prob.gam.out<-predict(hrtDS.gam,data_test,type="response")
pred.gam.out.class<-(prob.gam.out>=pcut.gam)*1
table(data_test$heart_disease,pred.gam.out.class,dnn=c("Observed","Predicted"))
## Predicted
## Observed 0 1
## 0 22 30
## 1 2 37
Missclassification rate | LOSS | AUC
#LOSS
LOSS=(dt.outsamplecost<-cost1(data_test$heart_disease,pred.gam.out.class))
#Misclassification rate
MCR=(dt.outsampleMCR<-mean(data_test$heart_disease!=pred.gam.out.class))
#roc curve | AUC
hrtDS.gam.out_roc<- predict(hrtDS.rpart_prune1,data_test, type="prob")[,2]
pred <- prediction(as.numeric(hrtDS.gam.out_roc),as.numeric(data_test$heart_disease))
perf.gm <- performance(pred, "tpr", "fpr")
plot(perf.gm, colorize=TRUE)
#Get the AUC
AUC=unlist(slot(performance(pred, "auc"), "y.values"))
#vectorized output
gam.rslt.out<-c(LOSS,MCR,AUC)
gam.rslt.out
## [1] 0.4395604 0.3516484 0.7924063
eval<-data.frame(rbind(gam.rslt.in,gam.rslt.out),
row.names=c('GAM: in','GAM: out'))
colnames(eval) <- c("COST", "MCR","AUC")
kable(eval) %>%
kable_styling(bootstrap_options = c("striped", "hover", "responsive"),
font_size = 12, position = "left", full_width = FALSE)
| COST | MCR | AUC | |
|---|---|---|---|
| GAM: in | 0.1698113 | 0.1320755 | 0.7214839 |
| GAM: out | 0.4395604 | 0.3516484 | 0.7924063 |
Advanced Tree Models Bagging, Random Forests, and Boosting
Bagging stands for Bootstrap and Aggregating. It employs the idea of bootstrap but the purpose is not to study bias and standard errors of estimates. Instead, the goal of Bagging is to improve prediction accuracy. It fits a tree for each bootsrap sample, and then aggregate the predicted values from all these different trees. For more details, you may look at Wikepedia, or you can find the original paper Leo Breiman (1996).
To my best knowledge, it seems that bagging() won’t take an argument for asymmetric loss. Therefore, the classification results might not be appropriate.Lets check it out…
Modelling
heart.bag<- randomForest(as.factor(heart_disease)~.,mtry=ncol(data)-1, data = data, nbagg=100)
Insample
heart.bag.pred<- predict(heart.bag, type="prob")[,2]
pred.bag = prediction(heart.bag.pred, data$heart_disease)
perf = performance(pred.bag, "tpr", "fpr")
plot(perf, colorize=TRUE)
unlist(slot(performance(pred.bag, "auc"), "y.values"))
## [1] 0.84067
Insample Analysis
heart.bag.pred<- predict(heart.bag, newdata = data, type="class")
table(data$heart_disease, heart.bag.pred, dnn = c("True", "Pred"))
## Pred
## True 0 1
## 0 86 0
## 1 0 126
#LOSS
LOSS<-cost0(data$heart_disease,heart.bag.pred)
#Misclassification rate
MCR<-mean(data$heart_disease!=heart.bag.pred)
#Get the AUC
AUC=unlist(slot(performance(pred.bag, "auc"), "y.values"))
rslt.bag.in<-c(LOSS,MCR,AUC)
rslt.bag.in
## [1] 0.00000 0.00000 0.84067
Out of sample
levels(data_test$thal) <- levels(data$thal)
levels(data_test$ca) <- levels(data$ca)
levels(data_test$slope) <- levels(data$slope)
levels(data_test$exang) <- levels(data$exang)
levels(data_test$restecg) <- levels(data$restecg)
levels(data_test$fbs) <- levels(data$fbs)
levels(data_test$cp) <- levels(data$cp)
levels(data_test$sex) <- levels(data$sex)
data_test$heart_disease <-as.factor(data_test$heart_disease)
heart.bag.pred.test<- predict(heart.bag, newdata = data_test, type="prob")[,2]
pred.bag.out = prediction(heart.bag.pred.test, data_test$heart_disease)
perf.bg = performance(pred.bag.out, "tpr", "fpr")
plot(perf.bg, colorize=TRUE)
unlist(slot(performance(pred.bag.out, "auc"), "y.values"))
## [1] 0.9114892
heart.bag.pred.test<- predict(heart.bag, newdata = data_test, type="class")
table(data_test$heart_disease, heart.bag.pred.test, dnn = c("True", "Pred"))
## Pred
## True 0 1
## 0 40 12
## 1 3 36
#LOSS
LOSS<-cost0(data_test$heart_disease,heart.bag.pred.test)
#Misclassification rate
MCR<-mean(data_test$heart_disease!=heart.bag.pred.test)
#Get the AUC
AUC=unlist(slot(performance(pred.bag.out, "auc"), "y.values"))
rslt.bag.out<-c(LOSS,MCR,AUC)
rslt.bag.out
## [1] 0.1648352 0.1648352 0.9114892
eval<-data.frame(rbind(rslt.bag.in,rslt.bag.out),
row.names=c('BAG: in','BAG: out'))
colnames(eval) <- c("COST", "MCR","AUC")
kable(eval) %>%
kable_styling(bootstrap_options = c("striped", "hover", "responsive"),
font_size = 12, position = "left", full_width = FALSE)
| COST | MCR | AUC | |
|---|---|---|---|
| BAG: in | 0.0000000 | 0.0000000 | 0.8406700 |
| BAG: out | 0.1648352 | 0.1648352 | 0.9114892 |
Modelling
heart.rf<- randomForest(as.factor(heart_disease)~., mtry=sqrt(ncol(data)-1),ntree=100 ,data = data, importance=TRUE)
heart.rf
##
## Call:
## randomForest(formula = as.factor(heart_disease) ~ ., data = data, mtry = sqrt(ncol(data) - 1), ntree = 100, importance = TRUE)
## Type of random forest: classification
## Number of trees: 100
## No. of variables tried at each split: 4
##
## OOB estimate of error rate: 18.87%
## Confusion matrix:
## 0 1 class.error
## 0 61 25 0.2906977
## 1 15 111 0.1190476
Variable Importance
heart.rf$importance
## 0 1 MeanDecreaseAccuracy MeanDecreaseGini
## age 0.0066894627 0.0090019272 0.0080007447 9.7652322
## sex 0.0100411368 0.0089640079 0.0093973826 2.4283983
## cp 0.0236545789 0.0049268001 0.0124314370 7.5084754
## trestbps 0.0008585000 0.0010609511 0.0008679932 7.5703744
## chol 0.0079340813 0.0083961522 0.0081707502 10.6459891
## fbs -0.0004175059 -0.0017677013 -0.0011811390 0.8207861
## restecg -0.0027415097 0.0006863955 -0.0007813800 1.6776843
## thalach 0.0066289697 0.0025489093 0.0038556271 10.9100373
## exang 0.0028168382 0.0026346946 0.0029136281 2.5030901
## oldpeak 0.0314505792 0.0065132308 0.0166571509 14.2058831
## slope 0.0135692685 -0.0011422479 0.0052397966 4.3544639
## ca 0.0554900045 0.0328762794 0.0416077239 12.8668417
## thal 0.0586233858 0.0391646094 0.0464907871 16.0030334
plot(heart.rf, lwd=rep(2, 3))
legend("right", legend = c("OOB Error", "FPR", "FNR"), lwd=rep(2, 3), lty = c(1,2,3), col = c("black", "red", "green"))
Optimal cutoff - Symmetric cost
heart.rf.pred<- predict(heart.rf, type = "prob")[,2]
p.seq = seq(0.01, 1, 0.01)
cost = rep(0, length(p.seq))
for(i in 1:length(p.seq)){
cost[i] = costfunc(obs = data$heart_disease, pred.p = heart.rf.pred, pcut = p.seq[i])
}
plot(p.seq, cost)
optimal.pcut.sym= p.seq[which(cost==min(cost))][1]
optimal.pcut.sym#0.62
## [1] 0.49
Insample Analysis
ROC CURVE
pred.rf.in <- prediction(heart.rf.pred, data$heart_disease)
perf <- performance(pred.rf.in, "tpr", "fpr")
plot(perf, colorize=TRUE)
unlist(slot(performance(pred.rf.in, "auc"), "y.values"))
## [1] 0.8551587
Confusion Matrix
heart.rf.pred.c<- predict(heart.bag, newdata = data, type="class")
table(data$heart_disease, heart.rf.pred.c, dnn = c("True", "Pred"))
## Pred
## True 0 1
## 0 86 0
## 1 0 126
Out-of-Sample - Symmetric
## out-of-sample
heart.rf.pred.test<- predict(heart.rf,newdata = data_test,type="prob")[,2]
heart.rf.class.test<- (heart.rf.pred.test>optimal.pcut.sym)*1
table(data_test$heart_disease, heart.rf.class.test, dnn = c("True", "Pred"))
## Pred
## True 0 1
## 0 42 10
## 1 5 34
pred.rf.out <- prediction(heart.rf.pred.test, data_test$heart_disease)
perf <- performance(pred.rf.out, "tpr", "fpr")
plot(perf, colorize=TRUE)
unlist(slot(performance(pred.rf.out, "auc"), "y.values"))
## [1] 0.9146943
#LOSS
LOSS<-cost0(data_test$heart_disease,heart.rf.class.test)
#Misclassification rate
MCR<-mean(data_test$heart_disease!=heart.rf.class.test)
#Get the AUC
AUC=unlist(slot(performance(pred.rf.out, "auc"), "y.values"))
rslt.rf.out.sym<-c(LOSS,MCR,AUC)
rslt.rf.out.sym
## [1] 0.1648352 0.1648352 0.9146943
Optimal cutoff - Assymetric cost
p.seq = seq(0.01, 1, 0.01)
cost = rep(0, length(p.seq))
for(i in 1:length(p.seq)){
cost[i] = costfuncasym(obs = data$heart_disease, pred.p = heart.rf.pred, pcut = p.seq[i])
}
plot(p.seq, cost)
optimal.pcut.asym= p.seq[which(cost==min(cost))][1]
optimal.pcut.asym#0.07
## [1] 0.2
Out-of-Sample Asymmetric
## out-of-sample
heart.rf.pred.test<- predict(heart.rf,newdata = data_test,type="prob")[,2]
heart.rf.class.test<- (heart.rf.pred.test>optimal.pcut.asym)*1
table(data_test$heart_disease, heart.rf.class.test, dnn = c("True", "Pred"))
## Pred
## True 0 1
## 0 19 33
## 1 1 38
pred.rf.out <- prediction(heart.rf.pred.test, data_test$heart_disease)
perf.rf <- performance(pred.rf.out, "tpr", "fpr")
plot(perf.rf, colorize=TRUE)
unlist(slot(performance(pred.rf.out, "auc"), "y.values"))
## [1] 0.9146943
#LOSS
LOSS<-cost1(data_test$heart_disease,heart.rf.class.test)
#Misclassification rate
MCR<-mean(data_test$heart_disease!=heart.rf.class.test)
#Get the AUC
AUC=unlist(slot(performance(pred.rf.out, "auc"), "y.values"))
rslt.rf.out.asym<-c(LOSS,MCR,AUC)
rslt.rf.out.asym
## [1] 0.4175824 0.3736264 0.9146943
eval<-data.frame(rbind(rslt.rf.out.sym,rslt.rf.out.asym),
row.names=c('Random Forest: Symmetric : Out_sample',
'Random Forest: Asymmetric : Out_sample'))
colnames(eval) <- c("COST", "MCR","AUC")
kable(eval) %>%
kable_styling(bootstrap_options = c("striped", "hover", "responsive"),
font_size = 12, position = "left", full_width = FALSE)
| COST | MCR | AUC | |
|---|---|---|---|
| Random Forest: Symmetric : Out_sample | 0.1648352 | 0.1648352 | 0.9146943 |
| Random Forest: Asymmetric : Out_sample | 0.4175824 | 0.3736264 | 0.9146943 |
Boosting builds a number of small trees, and each time, the response is the residual from last tree. It is a sequential procedure. We use gbm package to build boosted trees.
data$heart_disease= as.factor(data$heart_disease)
heart.boost= boosting(heart_disease~., data = data, boos = T)
# Training AUC
pred.heart.boost= predict(heart.boost, newdata = data)
pred <- prediction(pred.heart.boost$prob[,2], data$heart_disease)
perf <- performance(pred, "tpr", "fpr")
plot(perf, colorize=TRUE)
unlist(slot(performance(pred, "auc"), "y.values"))
## [1] 1
Boosting in sample
pred.heart.boost.train= predict(heart.boost, newdata = data)
#LOSS
LOSS<-cost0(data$heart_disease,as.numeric(pred.heart.boost.train$class))
#Misclassification rate
MCR<-mean(data$heart_disease!=as.numeric(pred.heart.boost.train$class))
# Testing AUC
pred <- prediction(pred.heart.boost.train$prob[,2], data$heart_disease)
perf <- performance(pred, "tpr", "fpr")
plot(perf, colorize=TRUE)
#Get the AUC
AUC=unlist(slot(performance(pred, "auc"), "y.values"))
boost_in<-c(LOSS,MCR,AUC)
boost_in
## [1] 0 0 1
Boosting out sample
pred.heart.boost.test= predict(heart.boost, newdata = data_test)
#LOSS
LOSS<-cost0(data_test$heart_disease,as.numeric(pred.heart.boost.test$class))
#Misclassification rate
MCR<-mean(data_test$heart_disease!=as.numeric(pred.heart.boost.test$class))
# Testing AUC
pred <- prediction(pred.heart.boost.test$prob[,2], data_test$heart_disease)
perf.bo <- performance(pred, "tpr", "fpr")
plot(perf.bo, colorize=TRUE)
#Get the AUC
AUC=unlist(slot(performance(pred, "auc"), "y.values"))
boost_out<-c(LOSS,MCR,AUC)
boost_out
## [1] 0.1868132 0.1868132 0.9048323
eval<-data.frame(rbind(boost_in,boost_out),
row.names=c('Boosting: In_sample',
'Boosting: Out_sample'))
colnames(eval) <- c("COST", "MCR","AUC")
kable(eval) %>%
kable_styling(bootstrap_options = c("striped", "hover", "responsive"),
font_size = 12, position = "left", full_width = FALSE)
| COST | MCR | AUC | |
|---|---|---|---|
| Boosting: In_sample | 0.0000000 | 0.0000000 | 1.0000000 |
| Boosting: Out_sample | 0.1868132 | 0.1868132 | 0.9048323 |
—– NOT sure if we can calculate pcut here ——- ### 7.5 Neural Networks
We ran a neural network model using “heart disease” as the output variable, and the remaining 13 variables as input variables. We trained the model using NeuralNetTools package, in which the size of the hidden layers are calculated internally.For classification problems with nnet you need to code the response to factor first
library(e1071)
library(nnet)
library(NeuralNetTools)
data_orig$heart_disease <- as.factor(data_orig$heart_disease)
Sampling Data into 70% and 30% Test
set.seed(13263635)
index <- sample(nrow(data_orig),nrow(data_orig)*0.7)
hd_train <- data_orig[index,]
hd_test <- data_orig[-index,]
Here, we’re using another library called NeuralNetTools in which the size of the hidden layers are calculated internally
Evaluating the results:
print(hd.nnet)
## Neural Network
##
## 212 samples
## 13 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 212, 212, 212, 212, 212, 212, ...
## Resampling results across tuning parameters:
##
## size decay Accuracy Kappa
## 1 0e+00 0.6401923 0.09431636
## 1 1e-04 0.6110185 0.00000000
## 1 1e-01 0.7589285 0.48341994
## 3 0e+00 0.6201874 0.05705648
## 3 1e-04 0.6376189 0.13904563
## 3 1e-01 0.7721512 0.51435973
## 5 0e+00 0.6282547 0.09356149
## 5 1e-04 0.6572260 0.17507520
## 5 1e-01 0.7809113 0.53395146
##
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were size = 5 and decay = 0.1.
plot(hd.nnet)
PNN<-plot(garson(hd.nnet))
PNN+coord_flip()
plotnet(hd.nnet$finalModel, y_names = "heart_disease") + title("Graphical Representation of Neural Network of Heart Disease ")
## integer(0)
pred.glm.gtrain.nn <- predict(hd.nnet, type = "prob")[,2]
pred.glm.gtest.nn <- predict(hd.nnet, newdata=hd_test,type = "prob")[,2]
cost_sym <- function(obs, pred.p, pcut){
weight1 = 1
weight0 = 1
c1 = (obs==1)&(pred.p<pcut)
c0 = (obs==0)&(pred.p>=pcut)
cost = (mean(weight1*c1+weight0*c0))
return(cost)
}
# define a sequence from 0.01 to 1 by 0.01
p.seq = seq(0.01, 0.5, 0.01)
# write a loop for all p-cut to see which one provides the smallest cost
# first, need to define a 0 vector in order to save the value of cost from all pcut
cost = rep(0, length(p.seq))
for(i in 1:length(p.seq)){
cost[i] = cost_sym(obs = hd_train$heart_disease, pred.p = pred.glm.gtrain.nn, pcut = p.seq[i])
}
#optimal pcut to be used further
plot(p.seq, cost)
optimal.pcut.sym = p.seq[which(cost == min(cost))]
optimal.pcut.sym #0.5
## [1] 0.42 0.43 0.44 0.45 0.46
Asymmetric : To find Optimal Pcut
#Cost Function
# define a cost function with input "obs" being observed response
# and "pi" being predicted probability, and "pcut" being the threshold.
cost_asym <- function(obs, pred.p, pcut){
weight1 = 5
weight0 = 1
c1 = (obs==1)&(pred.p<pcut) # count for "true=1 but pred=0" (FN)
c0 = (obs==0)&(pred.p>=pcut) # count for "true=0 but pred=1" (FP)
cost = (mean(weight1*c1+weight0*c0)) # misclassification with weight
return(cost)
}
# define a sequence from 0.01 to 1 by 0.01
p.seq = seq(0.01, 0.5, 0.01)
# write a loop for all p-cut to see which one provides the smallest cost
# first, need to define a 0 vector in order to save the value of cost from all pcut
cost = rep(0, length(p.seq))
for(i in 1:length(p.seq))
{cost[i] = cost_asym(obs = hd_train$heart_disease,
pred.p = pred.glm.gtrain.nn,
pcut = p.seq[i])}
#optimal pcut to be used further
plot(p.seq, cost)
optimal.pcut = p.seq[which(cost==min(cost))]
optimal.pcut = 0.36#0.36
Since the Asymmetrical Pcut is lower than symmetric cost, predictions are done using asymmetric cost
Train & Test Predictions - Asymm
confusion_matrix_train <- table(hd_train$heart_disease, pred.glm.gtrain.nn)
confusion_matrix_test <- table(hd_test$heart_disease, pred.glm.gtest.nn)
misclassification_rate_train <- round((confusion_matrix_train[2]+confusion_matrix_train[3])/sum(confusion_matrix_train), 2)
misclassification_rate_test <- round((confusion_matrix_test[2]+confusion_matrix_test[3])/sum(confusion_matrix_test), 2)
cat("train misclassfication rate:", misclassification_rate_train, "| test misclassfication rate:", misclassification_rate_test)
## train misclassfication rate: 0 | test misclassfication rate: 0.01
# Train MCR = 0.13
# Test MCR = 0.33
confusion_matrix_train
## pred.glm.gtrain.nn
## 0.00140547408860982 0.0016187281280542 0.00548143049348793
## 0 1 1 1
## 1 0 0 0
## pred.glm.gtrain.nn
## 0.00598973145653925 0.0098843168471668 0.0115061735875303
## 0 1 1 1
## 1 0 0 0
## pred.glm.gtrain.nn
## 0.0128223272381664 0.014151875164994 0.0143605610394608
## 0 1 1 1
## 1 0 0 0
## pred.glm.gtrain.nn
## 0.0166278571491479 0.0201716782807516 0.021629754181104
## 0 1 1 1
## 1 0 0 0
## pred.glm.gtrain.nn
## 0.0240365162374382 0.024280831755576 0.0313937449450673
## 0 1 1 1
## 1 0 0 0
## pred.glm.gtrain.nn
## 0.0340562345750676 0.0344144558800905 0.0386452069088804
## 0 1 1 1
## 1 0 0 0
## pred.glm.gtrain.nn
## 0.039209620504837 0.0394708106825859 0.0404816342423995
## 0 1 1 1
## 1 0 0 0
## pred.glm.gtrain.nn
## 0.0414290585909004 0.0424395591295265 0.043839467700985
## 0 1 1 1
## 1 0 0 0
## pred.glm.gtrain.nn
## 0.0507122263948863 0.0512751805092015 0.0518461466290865
## 0 1 1 1
## 1 0 0 0
## pred.glm.gtrain.nn
## 0.053711248247861 0.0587230786846722 0.0594830506213191
## 0 1 1 1
## 1 0 0 0
## pred.glm.gtrain.nn
## 0.0637553739839628 0.0653590595984303 0.065848726482815
## 0 1 1 1
## 1 0 0 0
## pred.glm.gtrain.nn
## 0.0679356449947991 0.0693918071875803 0.0732580830124811
## 0 1 1 1
## 1 0 0 0
## pred.glm.gtrain.nn
## 0.0734319443912572 0.076928640458764 0.0774338468075271
## 0 1 0 1
## 1 0 1 0
## pred.glm.gtrain.nn
## 0.0778547996452845 0.0778939091754068 0.0798186217080686
## 0 1 1 0
## 1 0 0 1
## pred.glm.gtrain.nn
## 0.0875299764408242 0.0876413211220728 0.0881558003733085
## 0 1 1 1
## 1 0 0 0
## pred.glm.gtrain.nn
## 0.106003878429398 0.11593951606708 0.119849246953881 0.131986608049957
## 0 1 1 1 1
## 1 0 0 0 0
## pred.glm.gtrain.nn
## 0.135568186622098 0.136005039247625 0.148820866959496
## 0 1 1 1
## 1 0 0 0
## pred.glm.gtrain.nn
## 0.153461519765809 0.158649770765087 0.163031632660873 0.1659928170043
## 0 1 1 1 1
## 1 0 0 0 0
## pred.glm.gtrain.nn
## 0.166849897759784 0.171708456427972 0.182285167408373
## 0 1 1 1
## 1 0 0 0
## pred.glm.gtrain.nn
## 0.183500209824067 0.188712008954861 0.194475248412997
## 0 1 1 1
## 1 0 0 0
## pred.glm.gtrain.nn
## 0.202882378942267 0.209223779154213 0.220113186444233
## 0 1 1 1
## 1 0 0 0
## pred.glm.gtrain.nn
## 0.223985816750779 0.225239524459252 0.24616956896554 0.251543867046982
## 0 1 1 1 1
## 1 0 0 0 0
## pred.glm.gtrain.nn
## 0.304474968881152 0.307842299441321 0.341311292365706 0.366865839177
## 0 1 1 0 0
## 1 0 0 1 1
## pred.glm.gtrain.nn
## 0.396812950422799 0.411473626187303 0.41705850901617 0.449438508470835
## 0 1 1 1 1
## 1 0 0 0 0
## pred.glm.gtrain.nn
## 0.449937027480382 0.46521163447619 0.46768608573504 0.500797755226483
## 0 0 0 0 0
## 1 1 1 1 1
## pred.glm.gtrain.nn
## 0.502219407500019 0.542846027055318 0.547709965419821
## 0 1 0 1
## 1 0 1 0
## pred.glm.gtrain.nn
## 0.576603258260078 0.591897180122641 0.60891511266251 0.61875382285694
## 0 0 0 0 0
## 1 1 1 1 1
## pred.glm.gtrain.nn
## 0.620098921718515 0.62421386566248 0.628376331374646 0.634150779254919
## 0 1 0 1 0
## 1 0 1 0 1
## pred.glm.gtrain.nn
## 0.642613562531629 0.645519654093024 0.652532812191151
## 0 1 0 0
## 1 0 1 1
## pred.glm.gtrain.nn
## 0.655558631460355 0.661513565698282 0.663742899084925
## 0 0 0 1
## 1 1 1 0
## pred.glm.gtrain.nn
## 0.678752028678773 0.72006356347281 0.748281977475679 0.754904239047469
## 0 0 0 1 0
## 1 1 1 0 1
## pred.glm.gtrain.nn
## 0.755129725479019 0.757354113849914 0.763024934721684
## 0 0 0 0
## 1 1 1 1
## pred.glm.gtrain.nn
## 0.785082223361549 0.795617177903729 0.797230706335981
## 0 0 1 0
## 1 1 0 1
## pred.glm.gtrain.nn
## 0.800351450300016 0.814373415846112 0.819113223711186 0.8268274275114
## 0 0 1 0 0
## 1 1 0 1 1
## pred.glm.gtrain.nn
## 0.831685408314223 0.834303542156161 0.842153990709367
## 0 0 0 0
## 1 1 1 1
## pred.glm.gtrain.nn
## 0.842713912269737 0.842847274616396 0.844540024278214 0.84519765141143
## 0 0 0 0 0
## 1 1 1 1 1
## pred.glm.gtrain.nn
## 0.847043483200458 0.856178630035966 0.857013907093588
## 0 1 0 1
## 1 0 1 0
## pred.glm.gtrain.nn
## 0.859470968605237 0.860395824280875 0.863063972419414
## 0 0 0 0
## 1 1 1 1
## pred.glm.gtrain.nn
## 0.874156417447392 0.875055271084176 0.876551128274807
## 0 0 0 0
## 1 1 1 1
## pred.glm.gtrain.nn
## 0.880329849320248 0.888949769312837 0.889799106003999
## 0 0 0 0
## 1 1 1 1
## pred.glm.gtrain.nn
## 0.890626674386715 0.892190877093108 0.894239708404398
## 0 0 0 0
## 1 1 1 1
## pred.glm.gtrain.nn
## 0.895865331696358 0.900705638009325 0.901272995056137 0.91043123832262
## 0 0 0 0 0
## 1 1 1 1 1
## pred.glm.gtrain.nn
## 0.913765423924941 0.914055914408337 0.917189472345613
## 0 0 0 0
## 1 1 1 1
## pred.glm.gtrain.nn
## 0.917498553098726 0.917610797320407 0.918516514286963
## 0 0 0 0
## 1 1 1 1
## pred.glm.gtrain.nn
## 0.918547099481092 0.918827456389806 0.919297942646124
## 0 0 0 0
## 1 1 1 1
## pred.glm.gtrain.nn
## 0.919493322131194 0.919704090071472 0.919993466648231 0.92397689026441
## 0 0 0 0 1
## 1 1 1 1 0
## pred.glm.gtrain.nn
## 0.925681456485117 0.926197727921199 0.926797683988465
## 0 0 0 0
## 1 1 1 1
## pred.glm.gtrain.nn
## 0.926798472334727 0.926964749325013 0.9274664950847 0.927709815528317
## 0 0 0 0 0
## 1 1 1 1 1
## pred.glm.gtrain.nn
## 0.929476771325266 0.930288730710336 0.930355238721127
## 0 0 0 0
## 1 1 1 1
## pred.glm.gtrain.nn
## 0.931095345716087 0.931288040669097 0.934645990184759
## 0 0 0 0
## 1 1 1 1
## pred.glm.gtrain.nn
## 0.939697272287572 0.943030704631844 0.943319931554613 0.94383793476792
## 0 0 0 0 1
## 1 1 1 1 0
## pred.glm.gtrain.nn
## 0.945235951101573 0.945629338457765 0.945832841916546 0.94690282170387
## 0 0 0 0 0
## 1 1 1 1 1
## pred.glm.gtrain.nn
## 0.94708108575275 0.951713062694856 0.955364864776263 0.956303365742809
## 0 0 0 0 0
## 1 1 1 1 1
## pred.glm.gtrain.nn
## 0.9612646781818 0.968063874710117 0.979339627213609 0.98328409265131
## 0 0 0 0 0
## 1 1 1 1 1
## pred.glm.gtrain.nn
## 0.984635956086537 0.985943944855065 0.98811629909488 0.988387531711702
## 0 0 0 0 0
## 1 1 1 1 1
## pred.glm.gtrain.nn
## 0.988815678847412 0.988825710705713 0.989146089095493
## 0 0 0 0
## 1 1 1 1
## pred.glm.gtrain.nn
## 0.990447001464028 0.990494126745047 0.991128001136929
## 0 0 0 0
## 1 1 1 1
## pred.glm.gtrain.nn
## 0.991352042092903 0.991579581459918 0.992869108681771
## 0 0 0 0
## 1 1 1 1
## pred.glm.gtrain.nn
## 0.993538781805464 0.993563716139713 0.994012087396936
## 0 0 0 0
## 1 1 1 1
## pred.glm.gtrain.nn
## 0.994017520175521 0.994454065928518 0.9947299338309 0.994815028766075
## 0 0 0 0 0
## 1 1 1 1 1
## pred.glm.gtrain.nn
## 0.994926476152365 0.995046701249083 0.99506419109178 0.995393140837817
## 0 0 0 0 0
## 1 1 1 1 1
## pred.glm.gtrain.nn
## 0.995407514715461 0.995532172523794 0.995579047337148
## 0 0 0 0
## 1 1 1 1
## pred.glm.gtrain.nn
## 0.995714298234534 0.99583399780544 0.997620926201334 0.998349603967806
## 0 0 0 0 0
## 1 1 1 1 2
confusion_matrix_test
## pred.glm.gtest.nn
## 0.0027714463112969 0.00322479563718421 0.00423389851902834
## 0 1 1 1
## 1 0 0 0
## pred.glm.gtest.nn
## 0.00464319926601919 0.00496328291664434 0.00634827029527815
## 0 1 1 1
## 1 0 0 0
## pred.glm.gtest.nn
## 0.0126946177515764 0.0154087014214331 0.0261545788035418
## 0 1 1 1
## 1 0 0 0
## pred.glm.gtest.nn
## 0.0299817831501402 0.0415105904949099 0.0419567661641778
## 0 1 1 1
## 1 0 0 0
## pred.glm.gtest.nn
## 0.051200792457784 0.0565585415444473 0.0607873310413389
## 0 1 1 1
## 1 0 0 0
## pred.glm.gtest.nn
## 0.084590587016722 0.101767513496881 0.106903587430528
## 0 1 0 1
## 1 0 1 0
## pred.glm.gtest.nn
## 0.111039564740205 0.114172326187984 0.117386210677135
## 0 1 1 1
## 1 0 0 0
## pred.glm.gtest.nn
## 0.119035879126298 0.123739396414317 0.124522450793027
## 0 1 1 1
## 1 0 0 0
## pred.glm.gtest.nn
## 0.133164552998665 0.139745284301599 0.155499764954562
## 0 1 1 1
## 1 0 0 0
## pred.glm.gtest.nn
## 0.163392256383793 0.170576496925379 0.212716510709362
## 0 1 1 0
## 1 0 0 1
## pred.glm.gtest.nn
## 0.222593593752233 0.223649200234162 0.253149885531279
## 0 1 1 1
## 1 0 0 0
## pred.glm.gtest.nn
## 0.254651140681851 0.28010499592357 0.286727303673468 0.304676617024148
## 0 1 1 1 1
## 1 0 0 0 0
## pred.glm.gtest.nn
## 0.349530588438819 0.373327198142616 0.451751561762436
## 0 1 0 1
## 1 0 1 0
## pred.glm.gtest.nn
## 0.499086752242712 0.5224198992142 0.539137095132152 0.561357271436771
## 0 1 1 0 1
## 1 0 0 1 0
## pred.glm.gtest.nn
## 0.585223380662428 0.598566052679182 0.65601834430533 0.657826100237102
## 0 1 1 0 0
## 1 0 0 1 1
## pred.glm.gtest.nn
## 0.660514634914973 0.661724887315089 0.669418952111043
## 0 0 0 0
## 1 1 1 1
## pred.glm.gtest.nn
## 0.674766995264368 0.696914969065705 0.781739998257185
## 0 1 0 0
## 1 0 1 1
## pred.glm.gtest.nn
## 0.782511943959304 0.806378841539237 0.829005402359611
## 0 1 1 1
## 1 0 0 0
## pred.glm.gtest.nn
## 0.829941252916241 0.832934464488041 0.867603808719308 0.8793976015413
## 0 0 0 0 0
## 1 1 1 1 1
## pred.glm.gtest.nn
## 0.884983227802461 0.885215750473307 0.901858537117129
## 0 0 0 1
## 1 1 1 0
## pred.glm.gtest.nn
## 0.908085577554559 0.912529030625575 0.913996903472009
## 0 0 0 0
## 1 1 1 1
## pred.glm.gtest.nn
## 0.914602577858671 0.919945122551635 0.920125789646024
## 0 0 0 0
## 1 1 1 1
## pred.glm.gtest.nn
## 0.926094517343043 0.926783401584116 0.927176808676671
## 0 0 1 1
## 1 1 0 0
## pred.glm.gtest.nn
## 0.929397059899547 0.929724760310095 0.930080023679534
## 0 0 0 0
## 1 1 1 1
## pred.glm.gtest.nn
## 0.931409246460362 0.935413005466966 0.935634102429502 0.95563599779744
## 0 0 0 0 1
## 1 1 1 1 0
## pred.glm.gtest.nn
## 0.962077887112079 0.971996621089428 0.982931226210635
## 0 1 1 0
## 1 0 0 1
## pred.glm.gtest.nn
## 0.983224699871222 0.99364023261421 0.993678182831536 0.993847919392054
## 0 0 0 0 0
## 1 1 1 1 1
## pred.glm.gtest.nn
## 0.993916973257418 0.994536425970503 0.999072514005146
## 0 0 0 0
## 1 1 1 1
## pred.glm.gtest.nn
## 0.999113985101415
## 0 0
## 1 1
ROC Curve for Train set
#In sample roc
prednn_roc.train<- predict(hd.nnet, type="prob")[,2]
pred <- prediction(prednn_roc.train,hd_train$heart_disease)
perf <- performance(pred, "tpr", "fpr")
plot(perf, colorize=TRUE)
auc.train=unlist(slot(performance(pred, "auc"), "y.values"))
auc.train
## [1] 0.9581949
ROC Curve for Test set
#Out of sample roc
prednn_roc.test <- predict(hd.nnet, newdata=hd_test,type = "prob")[,2]
pred.test.nn <- prediction(prednn_roc.test,hd_test$heart_disease)
perf.nn <- performance(pred.test.nn, "tpr", "fpr")
plot(perf.nn, colorize=TRUE)
auc.test=unlist(slot(performance(pred.test.nn, "auc"), "y.values"))
auc.test
## [1] 0.8786982
Evaluation Function for asymmetric cost
#Aymmetric cost evaluation function
cost_asym <- function(r, pi){
weight1 = 5
weight0 = 1
c1 = (r==1)&(pi==0) #logical vector - true if actual 1 but predict 0
c0 = (r==0)&(pi==1) #logical vector - true if actual 0 but predict 1
return(mean(weight1*c1+weight0*c0))
}
Asymmetric Cost : Train and Test
class.pred.train.nn <- (pred.glm.gtrain.nn>optimal.pcut)*1
cost.train <- round(cost_asym(r = hd_train$heart_disease, pi = class.pred.train.nn),2)
class.pred.test.nn<- (pred.glm.gtest.nn>optimal.pcut)*1
cost.test <- round(cost_asym(r = hd_test$heart_disease, pi = class.pred.test.nn),2)
cat("total train cost:", cost.train, "| total test cost:", cost.test)
## total train cost: 0.15 | total test cost: 0.29
#Train Cost = 0.17
# Test Cost = 0.29
nn.train.results <- c(cost.train,misclassification_rate_train,auc.train)
nn.test.results <- c(cost.test,misclassification_rate_test,auc.test)
eval_nn <- data.frame(rbind(nn.train.results,nn.test.results),
row.names=c('Neural Network : in','Neural Network : out'))
colnames(eval_nn) <- c("COST", "MCR", "AUC")
kable(eval_nn) %>%
kable_styling(bootstrap_options = c("striped", "hover", "responsive"),
font_size = 12, position = "left", full_width = FALSE)
| COST | MCR | AUC | |
|---|---|---|---|
| Neural Network : in | 0.15 | 0.00 | 0.9581949 |
| Neural Network : out | 0.29 | 0.01 | 0.8786982 |
Composite Roc curve of all models
pred <- prediction(pred.heart.boost.test$prob[,2], data_test$heart_disease)
perf.bo <- performance(pred, "tpr", "fpr")
plot(perf.bo, col = "red")
plot(perf.bg,add = TRUE, col = "pink")
plot(perf.rf, add = TRUE,col = "cyan")
plot(perf.l, add = TRUE, col = "black")
plot(perf.dt1, add = TRUE, col = "blue")
plot(perf.gm, add = TRUE,col = "green")
plot(perf.nn, add = TRUE,col = "grey")
## Add Legend
legend("bottomright", c("Boosting", "Bagging","RandomFrst","Logistic","Dtrees","Gam","NN"), lty=1,
col = c("red", "pink","cyan","black","blue","green","grey"), bty="n")
Considering Miss Classification as most important metric we select Boosting model as it has least Miss classification rate and cost, AUC is also higher than most algorithms although Random forest and Bagging are slightly better than Boosting in this regard
When it comes to diagnosing heart disease, we want to make sure we don’t have too many false-positives (you don’t have heart disease, but told you do and get treatment) or false-negatives (you have heart disease, but told you don’t and don’t get treatment). Therefore, model with highest accuracy and lowest miss classification rate is chosen as the best of all. *Bagging in this case also performs well on the in-sample data, although the misclassification rate is slightly lower than Boosting.
*Random Forest on the other hand performs more moderately in in-sample but doesn’t perform similarly on out of sample. Increasing the cost (Asymmetric loss).
*One important thing to notice here is that Ensemble methods do not have a way to prescribe a Asymmetric cost function while creating the Tree, as in the case of CART(rpart) hence they might not perform well is the data as a asymmetric cost associated to it.
*Decision Trees here take a humble approach, performing well in in-sample and out of sample. Their cost and Misclassification rate is lower than GAM and classic Logistic model.
*The neural network (with 3 hidden layers) has the best in-sample mcr amongst all the models considered but they have a highest loss on Out of sample data.
Evaluating all these factors we select Boosting as best performing model among all other Supervised Learning approaches that we examined in this dataset.
6.6Support Vector Machine (SVM)
library(e1071)
heart.svm = svm(heart_disease ~ ., data = data, cost = 1, gamma = 1/length(data), probability= TRUE)
prob.svm = predict(heart.svm, data_test, probability = TRUE)
prob.svm = attr(prob.svm, 'probabilities')[,2] #This is needed because prob.svm gives a matrix
pred.svm = as.numeric((prob.svm >= 0.2))
table(data_test$heart_disease,pred.svm,dnn=c("Obs","Pred"))
## Pred
## Obs 0 1
## 0 4 48
## 1 24 15
mean(ifelse(data_test$heart_disease != pred.svm, 1, 0))
## [1] 0.7912088
cost1(data_test$heart_disease, pred.svm)
## [1] 1.846154