Data: This data set is called Heart Failure Prediction and available on Kaggle.com. It has 2126 observation of 22 variables.
2126 fetal cardiotocograms (CTGs) were automatically processed and the respective diagnostic features measured. The CTGs were also classified by three expert obstetricians and a consensus classification label assigned to each of them. Classification was both with respect to a morphologic pattern (A, B, C. …) and to a fetal state (N, S, P). Therefore the dataset can be used either for 10-class or 3-class experiments.
Attribute Information:
LB - FHR baseline (beats per minute) AC - # of accelerations per second FM - # of fetal movements per second UC - # of uterine contractions per second DL - # of light decelerations per second DS - # of severe decelerations per second DP - # of prolongued decelerations per second ASTV - percentage of time with abnormal short term variability MSTV - mean value of short term variability ALTV - percentage of time with abnormal long term variability MLTV - mean value of long term variability Width - width of FHR histogram Min - minimum of FHR histogram Max - Maximum of FHR histogram Nmax - # of histogram peaks Nzeros - # of histogram zeros Mode - histogram mode Mean - histogram mean Median - histogram median Variance - histogram variance Tendency - histogram tendency CLASS - FHR pattern class code (1 to 10) NSP - fetal state class code (N=normal; S=suspect; P=pathologic)
Objective:
Create a Multinomial Logistic Regression model for predicting NSP status of patience. NSP have 3 different levels: 1.Normal, 2. Suspect and 3.Pathologic
Data Preparation
#Loading necessary libraries
library(tidyverse) # For data manipulation
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5 v purrr 0.3.4
## v tibble 3.1.4 v dplyr 1.0.7
## v tidyr 1.1.3 v stringr 1.4.0
## v readr 2.0.1 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(dplyr)
library(caret) # confusion matrix
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(nnet)
library(ggplot2)
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
## The following object is masked from 'package:purrr':
##
## some
library(mlbench)
library(caret)
library(corrplot)
## corrplot 0.90 loaded
#PREPARING WORK SPAcE
# Clear the workspace:
rm(list = ls())
# Load data
df <- read.csv("Cardiotocographic.csv", header = TRUE)
head(df)
## LB AC FM UC DL DS DP ASTV MSTV ALTV MLTV
## 1 120 0.000000000 0 0.000000000 0.000000000 0 0.000000000 73 0.5 43 2.4
## 2 132 0.006379585 0 0.006379585 0.003189793 0 0.000000000 17 2.1 0 10.4
## 3 133 0.003322259 0 0.008305648 0.003322259 0 0.000000000 16 2.1 0 13.4
## 4 134 0.002560819 0 0.007682458 0.002560819 0 0.000000000 16 2.4 0 23.0
## 5 132 0.006514658 0 0.008143322 0.000000000 0 0.000000000 16 2.4 0 19.9
## 6 134 0.001049318 0 0.010493179 0.009443861 0 0.002098636 26 5.9 0 0.0
## Width Min Max Nmax Nzeros Mode Mean Median Variance Tendency NSP
## 1 64 62 126 2 0 120 137 121 73 1 2
## 2 130 68 198 6 1 141 136 140 12 0 1
## 3 130 68 198 5 1 141 135 138 13 0 1
## 4 117 53 170 11 0 137 134 137 13 1 1
## 5 117 53 170 9 0 137 136 138 11 1 1
## 6 150 50 200 5 3 76 107 107 170 0 3
nrow(df)
## [1] 2126
We can obtain summary statistics through summary function.
#Summary Statistics
summary(df)
## LB AC FM UC
## Min. :106.0 Min. :0.000000 Min. :0.000000 Min. :0.000000
## 1st Qu.:126.0 1st Qu.:0.000000 1st Qu.:0.000000 1st Qu.:0.001876
## Median :133.0 Median :0.001630 Median :0.000000 Median :0.004482
## Mean :133.3 Mean :0.003170 Mean :0.009474 Mean :0.004357
## 3rd Qu.:140.0 3rd Qu.:0.005631 3rd Qu.:0.002512 3rd Qu.:0.006525
## Max. :160.0 Max. :0.019284 Max. :0.480634 Max. :0.014925
## DL DS DP ASTV
## Min. :0.000000 Min. :0.000e+00 Min. :0.0000000 Min. :12.00
## 1st Qu.:0.000000 1st Qu.:0.000e+00 1st Qu.:0.0000000 1st Qu.:32.00
## Median :0.000000 Median :0.000e+00 Median :0.0000000 Median :49.00
## Mean :0.001885 Mean :3.585e-06 Mean :0.0001566 Mean :46.99
## 3rd Qu.:0.003264 3rd Qu.:0.000e+00 3rd Qu.:0.0000000 3rd Qu.:61.00
## Max. :0.015385 Max. :1.353e-03 Max. :0.0053476 Max. :87.00
## MSTV ALTV MLTV Width
## Min. :0.200 Min. : 0.000 Min. : 0.000 Min. : 3.00
## 1st Qu.:0.700 1st Qu.: 0.000 1st Qu.: 4.600 1st Qu.: 37.00
## Median :1.200 Median : 0.000 Median : 7.400 Median : 67.50
## Mean :1.333 Mean : 9.847 Mean : 8.188 Mean : 70.45
## 3rd Qu.:1.700 3rd Qu.:11.000 3rd Qu.:10.800 3rd Qu.:100.00
## Max. :7.000 Max. :91.000 Max. :50.700 Max. :180.00
## Min Max Nmax Nzeros
## Min. : 50.00 Min. :122 Min. : 0.000 Min. : 0.0000
## 1st Qu.: 67.00 1st Qu.:152 1st Qu.: 2.000 1st Qu.: 0.0000
## Median : 93.00 Median :162 Median : 3.000 Median : 0.0000
## Mean : 93.58 Mean :164 Mean : 4.068 Mean : 0.3236
## 3rd Qu.:120.00 3rd Qu.:174 3rd Qu.: 6.000 3rd Qu.: 0.0000
## Max. :159.00 Max. :238 Max. :18.000 Max. :10.0000
## Mode Mean Median Variance
## Min. : 60.0 Min. : 73.0 Min. : 77.0 Min. : 0.00
## 1st Qu.:129.0 1st Qu.:125.0 1st Qu.:129.0 1st Qu.: 2.00
## Median :139.0 Median :136.0 Median :139.0 Median : 7.00
## Mean :137.5 Mean :134.6 Mean :138.1 Mean : 18.81
## 3rd Qu.:148.0 3rd Qu.:145.0 3rd Qu.:148.0 3rd Qu.: 24.00
## Max. :187.0 Max. :182.0 Max. :186.0 Max. :269.00
## Tendency NSP
## Min. :-1.0000 Min. :1.000
## 1st Qu.: 0.0000 1st Qu.:1.000
## Median : 0.0000 Median :1.000
## Mean : 0.3203 Mean :1.304
## 3rd Qu.: 1.0000 3rd Qu.:1.000
## Max. : 1.0000 Max. :3.000
#Data Structure
str(df)
## 'data.frame': 2126 obs. of 22 variables:
## $ LB : int 120 132 133 134 132 134 134 122 122 122 ...
## $ AC : num 0 0.00638 0.00332 0.00256 0.00651 ...
## $ FM : num 0 0 0 0 0 0 0 0 0 0 ...
## $ UC : num 0 0.00638 0.00831 0.00768 0.00814 ...
## $ DL : num 0 0.00319 0.00332 0.00256 0 ...
## $ DS : num 0 0 0 0 0 0 0 0 0 0 ...
## $ DP : num 0 0 0 0 0 ...
## $ ASTV : int 73 17 16 16 16 26 29 83 84 86 ...
## $ MSTV : num 0.5 2.1 2.1 2.4 2.4 5.9 6.3 0.5 0.5 0.3 ...
## $ ALTV : int 43 0 0 0 0 0 0 6 5 6 ...
## $ MLTV : num 2.4 10.4 13.4 23 19.9 0 0 15.6 13.6 10.6 ...
## $ Width : int 64 130 130 117 117 150 150 68 68 68 ...
## $ Min : int 62 68 68 53 53 50 50 62 62 62 ...
## $ Max : int 126 198 198 170 170 200 200 130 130 130 ...
## $ Nmax : int 2 6 5 11 9 5 6 0 0 1 ...
## $ Nzeros : int 0 1 1 0 0 3 3 0 0 0 ...
## $ Mode : int 120 141 141 137 137 76 71 122 122 122 ...
## $ Mean : int 137 136 135 134 136 107 107 122 122 122 ...
## $ Median : int 121 140 138 137 138 107 106 123 123 123 ...
## $ Variance: int 73 12 13 13 11 170 215 3 3 1 ...
## $ Tendency: int 1 0 0 1 1 0 0 1 1 1 ...
## $ NSP : int 2 1 1 1 1 3 3 3 3 3 ...
We need to convert the response variable (NSP) into categorical variable.
#Convert to Categorical Variable
factor_levels= c(1:3)
newdf <- df %>%
mutate_at(.vars= vars(NSP),
.funs=funs(factor(.,levels=factor_levels)))
str(newdf)
## 'data.frame': 2126 obs. of 22 variables:
## $ LB : int 120 132 133 134 132 134 134 122 122 122 ...
## $ AC : num 0 0.00638 0.00332 0.00256 0.00651 ...
## $ FM : num 0 0 0 0 0 0 0 0 0 0 ...
## $ UC : num 0 0.00638 0.00831 0.00768 0.00814 ...
## $ DL : num 0 0.00319 0.00332 0.00256 0 ...
## $ DS : num 0 0 0 0 0 0 0 0 0 0 ...
## $ DP : num 0 0 0 0 0 ...
## $ ASTV : int 73 17 16 16 16 26 29 83 84 86 ...
## $ MSTV : num 0.5 2.1 2.1 2.4 2.4 5.9 6.3 0.5 0.5 0.3 ...
## $ ALTV : int 43 0 0 0 0 0 0 6 5 6 ...
## $ MLTV : num 2.4 10.4 13.4 23 19.9 0 0 15.6 13.6 10.6 ...
## $ Width : int 64 130 130 117 117 150 150 68 68 68 ...
## $ Min : int 62 68 68 53 53 50 50 62 62 62 ...
## $ Max : int 126 198 198 170 170 200 200 130 130 130 ...
## $ Nmax : int 2 6 5 11 9 5 6 0 0 1 ...
## $ Nzeros : int 0 1 1 0 0 3 3 0 0 0 ...
## $ Mode : int 120 141 141 137 137 76 71 122 122 122 ...
## $ Mean : int 137 136 135 134 136 107 107 122 122 122 ...
## $ Median : int 121 140 138 137 138 107 106 123 123 123 ...
## $ Variance: int 73 12 13 13 11 170 215 3 3 1 ...
## $ Tendency: int 1 0 0 1 1 0 0 1 1 1 ...
## $ NSP : Factor w/ 3 levels "1","2","3": 2 1 1 1 1 3 3 3 3 3 ...
Feature Selection
For the feature selection, I used R packages caret and mlbench.
First, I am going to remove the highly correlated variables from the model. Generally, we remove variables with an absolute correlation of 0.75 or higher. For this example, I chose cutoff point as 0.8
# Calculate correlation matrix
CM <- cor(newdf[,1:21])
# Plot Correlation Matrix
corrplot(CM, method="circle")
We can notice that Min and Width, and Mode, Mean and Median are highly correlated variables.
# Find variables that are highly corrected (ideally >0.75)
HC <- findCorrelation(CM, cutoff=0.8)
# Print indexes of highly correlated attributes
print(HC)
## [1] 13 18 19
# Print new of the highly correlated variables
colnames(newdf[,c(HC)])
## [1] "Min" "Mean" "Median"
# Remove the highly correlated variables
ndf<- newdf[,-c(HC)]
head(ndf,2)
## LB AC FM UC DL DS DP ASTV MSTV ALTV MLTV Width
## 1 120 0.000000000 0 0.000000000 0.000000000 0 0 73 0.5 43 2.4 64
## 2 132 0.006379585 0 0.006379585 0.003189793 0 0 17 2.1 0 10.4 130
## Max Nmax Nzeros Mode Variance Tendency NSP
## 1 126 2 0 120 73 1 2
## 2 198 6 1 141 12 0 1
After we remove highly correlated variables Min, Mean and Median, we can implement ROC curve analysis to rank the attributes by importance
# Prepare training scheme
control <- trainControl(method="repeatedcv", number=10, repeats=3)
# Train the model
model <- train(NSP~., data=ndf, method="lvq", preProcess="scale", trControl=control)
# Estimate variable importance
importance <- varImp(model, scale=FALSE)
# Summarize importance
print(importance)
## ROC curve variable importance
##
## variables are sorted by maximum importance across the classes
## X1 X2 X3
## ALTV 0.8741 0.8741 0.6669
## MSTV 0.8734 0.8734 0.6719
## Mode 0.7687 0.8729 0.8729
## ASTV 0.8606 0.8400 0.8606
## MLTV 0.7872 0.8139 0.8139
## AC 0.8136 0.8136 0.8007
## LB 0.7884 0.7897 0.7897
## Variance 0.7773 0.7773 0.6954
## DP 0.7552 0.7478 0.7552
## DL 0.6412 0.7378 0.7378
## UC 0.7366 0.7366 0.6042
## Width 0.7095 0.7095 0.6412
## Tendency 0.6581 0.6916 0.6916
## Nmax 0.6122 0.6122 0.6060
## FM 0.5767 0.5643 0.5767
## Nzeros 0.5467 0.5481 0.5481
## Max 0.5470 0.5431 0.5470
## DS 0.5168 0.5170 0.5170
# Plot importance
plot(importance)
I am going to remove these attributes which their importance are less than 0.7 :Tendency, NMax, FM, Nzeros, Max and DS.
# Remove the highly correlated variables
nndf = subset(ndf, select = -c(Tendency, Nmax, FM, Nzeros, Max, DS) )
head(nndf,2)
## LB AC UC DL DP ASTV MSTV ALTV MLTV Width Mode
## 1 120 0.000000000 0.000000000 0.000000000 0 73 0.5 43 2.4 64 120
## 2 132 0.006379585 0.006379585 0.003189793 0 17 2.1 0 10.4 130 141
## Variance NSP
## 1 73 2
## 2 12 1
After removing unimportance variables, we can finally fit our model
#Data Partition
ind <- sample(2, nrow(nndf), replace=TRUE, prob=c(0.8,0.2))
train <- nndf[ind==1,]
test <- nndf[ind==2,]
#Structure of the new data set
str(nndf)
## 'data.frame': 2126 obs. of 13 variables:
## $ LB : int 120 132 133 134 132 134 134 122 122 122 ...
## $ AC : num 0 0.00638 0.00332 0.00256 0.00651 ...
## $ UC : num 0 0.00638 0.00831 0.00768 0.00814 ...
## $ DL : num 0 0.00319 0.00332 0.00256 0 ...
## $ DP : num 0 0 0 0 0 ...
## $ ASTV : int 73 17 16 16 16 26 29 83 84 86 ...
## $ MSTV : num 0.5 2.1 2.1 2.4 2.4 5.9 6.3 0.5 0.5 0.3 ...
## $ ALTV : int 43 0 0 0 0 0 0 6 5 6 ...
## $ MLTV : num 2.4 10.4 13.4 23 19.9 0 0 15.6 13.6 10.6 ...
## $ Width : int 64 130 130 117 117 150 150 68 68 68 ...
## $ Mode : int 120 141 141 137 137 76 71 122 122 122 ...
## $ Variance: int 73 12 13 13 11 170 215 3 3 1 ...
## $ NSP : Factor w/ 3 levels "1","2","3": 2 1 1 1 1 3 3 3 3 3 ...
MULTINOMIAL MODEL
I would like to plot the NSP with the most important predictor variables.
ggpairs(nndf, columns = c("ALTV","MSTV","Mode", "ASTV", "MLTV", "AC"),
ggplot2::aes(colour=NSP))
length(which(nndf$NSP == "1"))
## [1] 1655
length(which(nndf$NSP == "2"))
## [1] 295
length(which(nndf$NSP == "3"))
## [1] 176
You can notice couple of things here:
model_fit <- multinom(NSP~., data=nndf )
## # weights: 42 (26 variable)
## initial value 2335.649726
## iter 10 value 1081.681027
## iter 20 value 722.457786
## iter 30 value 680.255077
## iter 40 value 612.944482
## iter 50 value 598.581568
## iter 60 value 588.574046
## iter 70 value 586.869067
## iter 80 value 585.493302
## iter 90 value 574.186666
## iter 100 value 553.241036
## final value 553.241036
## stopped after 100 iterations
summary(model_fit)
## Call:
## multinom(formula = NSP ~ ., data = nndf)
##
## Coefficients:
## (Intercept) LB AC UC DL DP ASTV
## 2 -13.60253 0.01317323 -905.7489 -239.0915 -511.3077 218.0199 0.05681254
## 3 -11.33307 0.14714698 -739.3105 -331.2532 -106.1311 661.6405 0.15113035
## MSTV ALTV MLTV Width Mode Variance
## 2 -0.28516050 0.01568452 -0.05956471 0.011428796 0.06234549 0.04240634
## 3 -0.07161099 0.04992633 -0.02748014 0.006441676 -0.15322409 0.06851576
##
## Std. Errors:
## (Intercept) LB AC UC DL DP
## 2 1.616936 0.02216677 0.003807494 0.02459405 0.003604098 0.001194961
## 3 2.635281 0.03051712 0.004020687 0.02799005 0.004434058 0.003340625
## ASTV MSTV ALTV MLTV Width Mode
## 2 0.008985746 0.2378977 0.005437792 0.02774036 0.003703704 0.01757267
## 3 0.018048906 0.3062349 0.008225476 0.05046120 0.005829754 0.02197030
## Variance
## 2 0.006470396
## 3 0.008138784
##
## Residual Deviance: 1106.482
## AIC: 1158.482
Confusion Matrix (Train Data)
#Predict Table
pred <- predict(model_fit, train)
#Create a table
table1 <- table(pred, train$NSP)
table1
##
## pred 1 2 3
## 1 1255 67 12
## 2 46 152 26
## 3 10 16 107
#Accuracy Rate
sum(diag(table1))/sum(table1)
## [1] 0.8953282
My model can predict train data set with around 89.8% accuracy rate.
Confusion Matrix (Test Data)
#Predict Table
pred_test <- predict(model_fit, test)
#Create a table
table2 <- table(pred_test, test$NSP)
table2
##
## pred_test 1 2 3
## 1 330 24 2
## 2 12 33 2
## 3 2 3 27
#Accuracy Rate
sum(diag(table2))/sum(table2)
## [1] 0.8965517
My model can predict unseen data with around 89.7% accuracy rate.