Multinomial Logistic Regression

Introduction

Data: This data set is called Heart Failure Prediction and available on Kaggle.com. It has 2126 observation of 22 variables.

2126 fetal cardiotocograms (CTGs) were automatically processed and the respective diagnostic features measured. The CTGs were also classified by three expert obstetricians and a consensus classification label assigned to each of them. Classification was both with respect to a morphologic pattern (A, B, C. …) and to a fetal state (N, S, P). Therefore the dataset can be used either for 10-class or 3-class experiments.

Attribute Information:

LB - FHR baseline (beats per minute)
AC - # of accelerations per second
FM - # of fetal movements per second
UC - # of uterine contractions per second
DL - # of light decelerations per second
DS - # of severe decelerations per second
DP - # of prolongued decelerations per second
ASTV - percentage of time with abnormal short term variability
MSTV - mean value of short term variability
ALTV - percentage of time with abnormal long term variability
MLTV - mean value of long term variability
Width - width of FHR histogram
Min - minimum of FHR histogram
Max - Maximum of FHR histogram
Nmax - # of histogram peaks
Nzeros - # of histogram zeros
Mode - histogram mode
Mean - histogram mean
Median - histogram median
Variance - histogram variance
Tendency - histogram tendency
CLASS - FHR pattern class code (1 to 10)
NSP - fetal state class code (N=normal; S=suspect; P=pathologic)

Objective:

Create a Multinomial Logistic Regression model for predicting NSP status of patience. NSP have 3 different levels: 1.Normal, 2. Suspect and 3.Pathologic

Data Preparation

#Loading necessary libraries
library(tidyverse) # For data manipulation

## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --

## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.4     v dplyr   1.0.7
## v tidyr   1.1.3     v stringr 1.4.0
## v readr   2.0.1     v forcats 0.5.1

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(dplyr) 
library(caret)  # confusion matrix

## Loading required package: lattice

## 
## Attaching package: 'caret'

## The following object is masked from 'package:purrr':
## 
##     lift

library(GGally)

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

library(nnet)
library(ggplot2)
library(MASS)

## 
## Attaching package: 'MASS'

## The following object is masked from 'package:dplyr':
## 
##     select

library(car)

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:dplyr':
## 
##     recode

## The following object is masked from 'package:purrr':
## 
##     some

library(mlbench)
library(caret)
library(corrplot)

## corrplot 0.90 loaded

#PREPARING WORK SPAcE
# Clear the workspace: 
rm(list = ls())

# Load data
df <- read.csv("Cardiotocographic.csv", header = TRUE)

head(df)

##    LB          AC FM          UC          DL DS          DP ASTV MSTV ALTV MLTV
## 1 120 0.000000000  0 0.000000000 0.000000000  0 0.000000000   73  0.5   43  2.4
## 2 132 0.006379585  0 0.006379585 0.003189793  0 0.000000000   17  2.1    0 10.4
## 3 133 0.003322259  0 0.008305648 0.003322259  0 0.000000000   16  2.1    0 13.4
## 4 134 0.002560819  0 0.007682458 0.002560819  0 0.000000000   16  2.4    0 23.0
## 5 132 0.006514658  0 0.008143322 0.000000000  0 0.000000000   16  2.4    0 19.9
## 6 134 0.001049318  0 0.010493179 0.009443861  0 0.002098636   26  5.9    0  0.0
##   Width Min Max Nmax Nzeros Mode Mean Median Variance Tendency NSP
## 1    64  62 126    2      0  120  137    121       73        1   2
## 2   130  68 198    6      1  141  136    140       12        0   1
## 3   130  68 198    5      1  141  135    138       13        0   1
## 4   117  53 170   11      0  137  134    137       13        1   1
## 5   117  53 170    9      0  137  136    138       11        1   1
## 6   150  50 200    5      3   76  107    107      170        0   3

nrow(df)

## [1] 2126

We can obtain summary statistics through summary function.

#Summary Statistics
summary(df)

##        LB              AC                 FM                 UC          
##  Min.   :106.0   Min.   :0.000000   Min.   :0.000000   Min.   :0.000000  
##  1st Qu.:126.0   1st Qu.:0.000000   1st Qu.:0.000000   1st Qu.:0.001876  
##  Median :133.0   Median :0.001630   Median :0.000000   Median :0.004482  
##  Mean   :133.3   Mean   :0.003170   Mean   :0.009474   Mean   :0.004357  
##  3rd Qu.:140.0   3rd Qu.:0.005631   3rd Qu.:0.002512   3rd Qu.:0.006525  
##  Max.   :160.0   Max.   :0.019284   Max.   :0.480634   Max.   :0.014925  
##        DL                 DS                  DP                 ASTV      
##  Min.   :0.000000   Min.   :0.000e+00   Min.   :0.0000000   Min.   :12.00  
##  1st Qu.:0.000000   1st Qu.:0.000e+00   1st Qu.:0.0000000   1st Qu.:32.00  
##  Median :0.000000   Median :0.000e+00   Median :0.0000000   Median :49.00  
##  Mean   :0.001885   Mean   :3.585e-06   Mean   :0.0001566   Mean   :46.99  
##  3rd Qu.:0.003264   3rd Qu.:0.000e+00   3rd Qu.:0.0000000   3rd Qu.:61.00  
##  Max.   :0.015385   Max.   :1.353e-03   Max.   :0.0053476   Max.   :87.00  
##       MSTV            ALTV             MLTV            Width       
##  Min.   :0.200   Min.   : 0.000   Min.   : 0.000   Min.   :  3.00  
##  1st Qu.:0.700   1st Qu.: 0.000   1st Qu.: 4.600   1st Qu.: 37.00  
##  Median :1.200   Median : 0.000   Median : 7.400   Median : 67.50  
##  Mean   :1.333   Mean   : 9.847   Mean   : 8.188   Mean   : 70.45  
##  3rd Qu.:1.700   3rd Qu.:11.000   3rd Qu.:10.800   3rd Qu.:100.00  
##  Max.   :7.000   Max.   :91.000   Max.   :50.700   Max.   :180.00  
##       Min              Max           Nmax            Nzeros       
##  Min.   : 50.00   Min.   :122   Min.   : 0.000   Min.   : 0.0000  
##  1st Qu.: 67.00   1st Qu.:152   1st Qu.: 2.000   1st Qu.: 0.0000  
##  Median : 93.00   Median :162   Median : 3.000   Median : 0.0000  
##  Mean   : 93.58   Mean   :164   Mean   : 4.068   Mean   : 0.3236  
##  3rd Qu.:120.00   3rd Qu.:174   3rd Qu.: 6.000   3rd Qu.: 0.0000  
##  Max.   :159.00   Max.   :238   Max.   :18.000   Max.   :10.0000  
##       Mode            Mean           Median         Variance     
##  Min.   : 60.0   Min.   : 73.0   Min.   : 77.0   Min.   :  0.00  
##  1st Qu.:129.0   1st Qu.:125.0   1st Qu.:129.0   1st Qu.:  2.00  
##  Median :139.0   Median :136.0   Median :139.0   Median :  7.00  
##  Mean   :137.5   Mean   :134.6   Mean   :138.1   Mean   : 18.81  
##  3rd Qu.:148.0   3rd Qu.:145.0   3rd Qu.:148.0   3rd Qu.: 24.00  
##  Max.   :187.0   Max.   :182.0   Max.   :186.0   Max.   :269.00  
##     Tendency            NSP       
##  Min.   :-1.0000   Min.   :1.000  
##  1st Qu.: 0.0000   1st Qu.:1.000  
##  Median : 0.0000   Median :1.000  
##  Mean   : 0.3203   Mean   :1.304  
##  3rd Qu.: 1.0000   3rd Qu.:1.000  
##  Max.   : 1.0000   Max.   :3.000

#Data Structure
str(df)

## 'data.frame':    2126 obs. of  22 variables:
##  $ LB      : int  120 132 133 134 132 134 134 122 122 122 ...
##  $ AC      : num  0 0.00638 0.00332 0.00256 0.00651 ...
##  $ FM      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ UC      : num  0 0.00638 0.00831 0.00768 0.00814 ...
##  $ DL      : num  0 0.00319 0.00332 0.00256 0 ...
##  $ DS      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ DP      : num  0 0 0 0 0 ...
##  $ ASTV    : int  73 17 16 16 16 26 29 83 84 86 ...
##  $ MSTV    : num  0.5 2.1 2.1 2.4 2.4 5.9 6.3 0.5 0.5 0.3 ...
##  $ ALTV    : int  43 0 0 0 0 0 0 6 5 6 ...
##  $ MLTV    : num  2.4 10.4 13.4 23 19.9 0 0 15.6 13.6 10.6 ...
##  $ Width   : int  64 130 130 117 117 150 150 68 68 68 ...
##  $ Min     : int  62 68 68 53 53 50 50 62 62 62 ...
##  $ Max     : int  126 198 198 170 170 200 200 130 130 130 ...
##  $ Nmax    : int  2 6 5 11 9 5 6 0 0 1 ...
##  $ Nzeros  : int  0 1 1 0 0 3 3 0 0 0 ...
##  $ Mode    : int  120 141 141 137 137 76 71 122 122 122 ...
##  $ Mean    : int  137 136 135 134 136 107 107 122 122 122 ...
##  $ Median  : int  121 140 138 137 138 107 106 123 123 123 ...
##  $ Variance: int  73 12 13 13 11 170 215 3 3 1 ...
##  $ Tendency: int  1 0 0 1 1 0 0 1 1 1 ...
##  $ NSP     : int  2 1 1 1 1 3 3 3 3 3 ...

We need to convert the response variable (NSP) into categorical variable.

#Convert to Categorical Variable

factor_levels= c(1:3)
newdf <- df %>% 
  mutate_at(.vars= vars(NSP),
            .funs=funs(factor(.,levels=factor_levels)))

str(newdf)

## 'data.frame':    2126 obs. of  22 variables:
##  $ LB      : int  120 132 133 134 132 134 134 122 122 122 ...
##  $ AC      : num  0 0.00638 0.00332 0.00256 0.00651 ...
##  $ FM      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ UC      : num  0 0.00638 0.00831 0.00768 0.00814 ...
##  $ DL      : num  0 0.00319 0.00332 0.00256 0 ...
##  $ DS      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ DP      : num  0 0 0 0 0 ...
##  $ ASTV    : int  73 17 16 16 16 26 29 83 84 86 ...
##  $ MSTV    : num  0.5 2.1 2.1 2.4 2.4 5.9 6.3 0.5 0.5 0.3 ...
##  $ ALTV    : int  43 0 0 0 0 0 0 6 5 6 ...
##  $ MLTV    : num  2.4 10.4 13.4 23 19.9 0 0 15.6 13.6 10.6 ...
##  $ Width   : int  64 130 130 117 117 150 150 68 68 68 ...
##  $ Min     : int  62 68 68 53 53 50 50 62 62 62 ...
##  $ Max     : int  126 198 198 170 170 200 200 130 130 130 ...
##  $ Nmax    : int  2 6 5 11 9 5 6 0 0 1 ...
##  $ Nzeros  : int  0 1 1 0 0 3 3 0 0 0 ...
##  $ Mode    : int  120 141 141 137 137 76 71 122 122 122 ...
##  $ Mean    : int  137 136 135 134 136 107 107 122 122 122 ...
##  $ Median  : int  121 140 138 137 138 107 106 123 123 123 ...
##  $ Variance: int  73 12 13 13 11 170 215 3 3 1 ...
##  $ Tendency: int  1 0 0 1 1 0 0 1 1 1 ...
##  $ NSP     : Factor w/ 3 levels "1","2","3": 2 1 1 1 1 3 3 3 3 3 ...

Feature Selection

For the feature selection, I used R packages caret and mlbench.

First, I am going to remove the highly correlated variables from the model. Generally, we remove variables with an absolute correlation of 0.75 or higher. For this example, I chose cutoff point as 0.8

# Calculate correlation matrix
CM <- cor(newdf[,1:21])
# Plot Correlation Matrix
corrplot(CM, method="circle")

We can notice that Min and Width, and Mode, Mean and Median are highly correlated variables.

# Find variables that are highly corrected (ideally >0.75)
HC <- findCorrelation(CM, cutoff=0.8)
# Print indexes of highly correlated attributes
print(HC)

## [1] 13 18 19

# Print new of the highly correlated variables
colnames(newdf[,c(HC)])

## [1] "Min"    "Mean"   "Median"

# Remove the highly correlated variables
ndf<- newdf[,-c(HC)]

head(ndf,2)

##    LB          AC FM          UC          DL DS DP ASTV MSTV ALTV MLTV Width
## 1 120 0.000000000  0 0.000000000 0.000000000  0  0   73  0.5   43  2.4    64
## 2 132 0.006379585  0 0.006379585 0.003189793  0  0   17  2.1    0 10.4   130
##   Max Nmax Nzeros Mode Variance Tendency NSP
## 1 126    2      0  120       73        1   2
## 2 198    6      1  141       12        0   1

After we remove highly correlated variables Min, Mean and Median, we can implement ROC curve analysis to rank the attributes by importance

# Prepare training scheme
control <- trainControl(method="repeatedcv", number=10, repeats=3)
# Train the model
model <- train(NSP~., data=ndf, method="lvq", preProcess="scale", trControl=control)
# Estimate variable importance
importance <- varImp(model, scale=FALSE)
# Summarize importance
print(importance)

## ROC curve variable importance
## 
##   variables are sorted by maximum importance across the classes
##              X1     X2     X3
## ALTV     0.8741 0.8741 0.6669
## MSTV     0.8734 0.8734 0.6719
## Mode     0.7687 0.8729 0.8729
## ASTV     0.8606 0.8400 0.8606
## MLTV     0.7872 0.8139 0.8139
## AC       0.8136 0.8136 0.8007
## LB       0.7884 0.7897 0.7897
## Variance 0.7773 0.7773 0.6954
## DP       0.7552 0.7478 0.7552
## DL       0.6412 0.7378 0.7378
## UC       0.7366 0.7366 0.6042
## Width    0.7095 0.7095 0.6412
## Tendency 0.6581 0.6916 0.6916
## Nmax     0.6122 0.6122 0.6060
## FM       0.5767 0.5643 0.5767
## Nzeros   0.5467 0.5481 0.5481
## Max      0.5470 0.5431 0.5470
## DS       0.5168 0.5170 0.5170

# Plot importance
plot(importance)

I am going to remove these attributes which their importance are less than 0.7 :Tendency, NMax, FM, Nzeros, Max and DS.

# Remove the highly correlated variables
nndf = subset(ndf, select = -c(Tendency, Nmax, FM, Nzeros, Max, DS) )

head(nndf,2)

##    LB          AC          UC          DL DP ASTV MSTV ALTV MLTV Width Mode
## 1 120 0.000000000 0.000000000 0.000000000  0   73  0.5   43  2.4    64  120
## 2 132 0.006379585 0.006379585 0.003189793  0   17  2.1    0 10.4   130  141
##   Variance NSP
## 1       73   2
## 2       12   1

After removing unimportance variables, we can finally fit our model

#Data Partition
ind <-  sample(2, nrow(nndf), replace=TRUE, prob=c(0.8,0.2))

train <- nndf[ind==1,]
test <- nndf[ind==2,]

#Structure of the new data set
str(nndf)

## 'data.frame':    2126 obs. of  13 variables:
##  $ LB      : int  120 132 133 134 132 134 134 122 122 122 ...
##  $ AC      : num  0 0.00638 0.00332 0.00256 0.00651 ...
##  $ UC      : num  0 0.00638 0.00831 0.00768 0.00814 ...
##  $ DL      : num  0 0.00319 0.00332 0.00256 0 ...
##  $ DP      : num  0 0 0 0 0 ...
##  $ ASTV    : int  73 17 16 16 16 26 29 83 84 86 ...
##  $ MSTV    : num  0.5 2.1 2.1 2.4 2.4 5.9 6.3 0.5 0.5 0.3 ...
##  $ ALTV    : int  43 0 0 0 0 0 0 6 5 6 ...
##  $ MLTV    : num  2.4 10.4 13.4 23 19.9 0 0 15.6 13.6 10.6 ...
##  $ Width   : int  64 130 130 117 117 150 150 68 68 68 ...
##  $ Mode    : int  120 141 141 137 137 76 71 122 122 122 ...
##  $ Variance: int  73 12 13 13 11 170 215 3 3 1 ...
##  $ NSP     : Factor w/ 3 levels "1","2","3": 2 1 1 1 1 3 3 3 3 3 ...

MULTINOMIAL MODEL

I would like to plot the NSP with the most important predictor variables.

ggpairs(nndf, columns = c("ALTV","MSTV","Mode", "ASTV", "MLTV", "AC"),
        ggplot2::aes(colour=NSP))

length(which(nndf$NSP == "1"))

## [1] 1655

length(which(nndf$NSP == "2"))

## [1] 295

length(which(nndf$NSP == "3"))

## [1] 176

You can notice couple of things here:

NSP level 1 (Normal) dominate the data set,and so data is imbalanced
Data can not be easily separated even with the most important features.

model_fit <- multinom(NSP~., data=nndf )

## # weights:  42 (26 variable)
## initial  value 2335.649726 
## iter  10 value 1081.681027
## iter  20 value 722.457786
## iter  30 value 680.255077
## iter  40 value 612.944482
## iter  50 value 598.581568
## iter  60 value 588.574046
## iter  70 value 586.869067
## iter  80 value 585.493302
## iter  90 value 574.186666
## iter 100 value 553.241036
## final  value 553.241036 
## stopped after 100 iterations

summary(model_fit)

## Call:
## multinom(formula = NSP ~ ., data = nndf)
## 
## Coefficients:
##   (Intercept)         LB        AC        UC        DL       DP       ASTV
## 2   -13.60253 0.01317323 -905.7489 -239.0915 -511.3077 218.0199 0.05681254
## 3   -11.33307 0.14714698 -739.3105 -331.2532 -106.1311 661.6405 0.15113035
##          MSTV       ALTV        MLTV       Width        Mode   Variance
## 2 -0.28516050 0.01568452 -0.05956471 0.011428796  0.06234549 0.04240634
## 3 -0.07161099 0.04992633 -0.02748014 0.006441676 -0.15322409 0.06851576
## 
## Std. Errors:
##   (Intercept)         LB          AC         UC          DL          DP
## 2    1.616936 0.02216677 0.003807494 0.02459405 0.003604098 0.001194961
## 3    2.635281 0.03051712 0.004020687 0.02799005 0.004434058 0.003340625
##          ASTV      MSTV        ALTV       MLTV       Width       Mode
## 2 0.008985746 0.2378977 0.005437792 0.02774036 0.003703704 0.01757267
## 3 0.018048906 0.3062349 0.008225476 0.05046120 0.005829754 0.02197030
##      Variance
## 2 0.006470396
## 3 0.008138784
## 
## Residual Deviance: 1106.482 
## AIC: 1158.482

Confusion Matrix (Train Data)

#Predict Table
pred <- predict(model_fit, train)

#Create a table
table1 <- table(pred, train$NSP)
table1

##     
## pred    1    2    3
##    1 1255   67   12
##    2   46  152   26
##    3   10   16  107

#Accuracy Rate
sum(diag(table1))/sum(table1)

## [1] 0.8953282

My model can predict train data set with around 89.8% accuracy rate.

Confusion Matrix (Test Data)

#Predict Table
pred_test <- predict(model_fit, test)

#Create a table
table2 <- table(pred_test, test$NSP)
table2

##          
## pred_test   1   2   3
##         1 330  24   2
##         2  12  33   2
##         3   2   3  27

#Accuracy Rate
sum(diag(table2))/sum(table2)

## [1] 0.8965517

My model can predict unseen data with around 89.7% accuracy rate.

**************

Multinomial Logistic Regression

Mustafa Arslan

9/14/2021