Loading Data

LB AC FM UC DL DS DP ASTV MSTV ALTV MLTV Width Min Max Nmax Nzeros Mode Mean Median Variance Tendency NSP
120 0.000000000 0 0.000000000 0.000000000 0 0.000000000 73 0.5 43 2.4 64 62 126 2 0 120 137 121 73 1 2
132 0.006379585 0 0.006379585 0.003189793 0 0.000000000 17 2.1 0 10.4 130 68 198 6 1 141 136 140 12 0 1
133 0.003322259 0 0.008305648 0.003322259 0 0.000000000 16 2.1 0 13.4 130 68 198 5 1 141 135 138 13 0 1
134 0.002560819 0 0.007682458 0.002560819 0 0.000000000 16 2.4 0 23.0 117 53 170 11 0 137 134 137 13 1 1
132 0.006514658 0 0.008143322 0.000000000 0 0.000000000 16 2.4 0 19.9 117 53 170 9 0 137 136 138 11 1 1
134 0.001049318 0 0.010493179 0.009443861 0 0.002098636 26 5.9 0 0.0 150 50 200 5 3 76 107 107 170 0 3

Attribute Information about the CTG dataset

Correlation Matrix

correlation_matrix <- cor(data)
corrplot(correlation_matrix, method="circle")

As we observed from the correlation Matrix there is somewhat strong correlation between Width of FHR Histogram and DL, So let’s visualize the variability

data %>% ggplot(aes(x=Width,
                    y=DL))+
  geom_point(alpha=.5)

Data is clustered in a mess way so we need to apply some transformation

data %>% ggplot(aes(log10(x=Width),
                    y=log10(DL)))+
  geom_point(alpha=.5)

Not a good candidate for linear modeling even after log transform

Now we are checking relation between Width and MSTV

data %>% ggplot(aes(x=Width,
                y=MSTV))+
  geom_point(alpha=.5)+
  geom_smooth(method="lm")

Good candidate for linear regression

Studying the most risk factors affecting class NSP

rpart(data$NSP~., data=data)
## n= 2126 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
##  1) root 2126 802.100200 1.304327  
##    2) Mean>=107.5 2021 520.754100 1.227115  
##      4) MSTV>=0.45 1770 208.454200 1.105085  
##        8) DP< 0.002064161 1757 177.126900 1.095048  
##         16) MSTV>=0.55 1636 110.515300 1.062958  
##           32) AC>=0.001845025 976   5.963115 1.006148 *
##           33) AC< 0.001845025 660  96.743940 1.146970  
##             66) Median< 150.5 572  55.138110 1.082168 *
##             67) Median>=150.5 88  23.590910 1.568182 *
##         17) MSTV< 0.55 121  42.148760 1.528926  
##           34) ASTV< 59.5 66  11.030300 1.212121 *
##           35) ASTV>=59.5 55  16.545450 1.909091 *
##        9) DP>=0.002064161 13   7.230769 2.461538 *
##      5) MSTV< 0.45 251 100.071700 2.087649  
##       10) ALTV< 68.5 210  59.623810 1.919048  
##         20) ASTV< 79.5 188  33.207450 1.824468  
##           40) ALTV< 7.5 21   1.809524 1.095238 *
##           41) ALTV>=7.5 167  18.826350 1.916168 *
##         21) ASTV>=79.5 22  10.363640 2.727273 *
##       11) ALTV>=68.5 41   3.902439 2.951220 *
##    3) Mean< 107.5 105  37.390480 2.790476  
##      6) Max>=220.5 9   3.555556 1.222222 *
##      7) Max< 220.5 96   9.625000 2.937500 *
model <- rpart(data$NSP~., data=data)
rpart.plot(model)

We got MSTV,ALTV & ASTV as the high risk factors,so we will build our logistic regression model based on these factors.

We will create a column P(Pathology) of binary family based on if class NSP is Pathology or not.

data <- data %>% mutate(P=as.integer(NSP=="3"))

Now we imputed successfully our new classifier variable.

Splitting data for validation

set.seed(2)
split <- initial_split(data,
                       prop=.80,
                       strata=P)
p_train <- training(split) 
p_test <- testing(split)

Logistic Classification regression Model

model2<- glm(P ~ MSTV+ALTV+ASTV,     
             data=p_train,
             family="binomial")
summary(model2)
## 
## Call:
## glm(formula = P ~ MSTV + ALTV + ASTV, family = "binomial", data = p_train)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -10.855005   0.745989 -14.551  < 2e-16 ***
## MSTV          1.231484   0.114867  10.721  < 2e-16 ***
## ALTV          0.026689   0.004759   5.608 2.05e-08 ***
## ASTV          0.111116   0.010182  10.913  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 976.84  on 1699  degrees of freedom
## Residual deviance: 663.81  on 1696  degrees of freedom
## AIC: 671.81
## 
## Number of Fisher Scoring iterations: 7

All very significant predictors for our model (P-value<<)

Test the model

p_test <- p_test %>% 
  mutate(P_prob = predict(model2,p_test,type= "response"),
         P_pred = ifelse(P_prob > .5, 1, 0))
t <- table(p_test$P,
           p_test$P_pred) 
accuracy<- sum(diag(t))/sum(t) 
print(accuracy)
## [1] 0.9295775

92% accuracy for our logistic classifying model.

Decision Tree Model for NSP Class

Model1

As there exists a strong correlation between ASTV,MSTV & ALTV with NSP class, we will discuss these relations in our model tree

tree_model1 <- rpart(NSP ~ ASTV+ALTV+MSTV,data=p_train, method = "class")
rpart.plot(tree_model1, main = "Decision Tree for NSP Classification")

RSME of the model

test1<-predict(tree_model1,
        p_test,
        interval="prediction")
RSME <- sqrt(mean((p_test$NSP - predict(tree_model1, p_test))^2))
## [1] "RSME of tree model1:"
## [1] 1.213492

Low value so predictions based on our model are closer to the actual values

Model2

Reffering to the correlation matrix also there exists strong negative correlations between AC & UC with NSP Class, Visually:

tree_model2 <- rpart(NSP ~ AC+UC,data=p_train, method = "class")
rpart.plot(tree_model2, main = "Decision Tree for NSP Classification")

test2 <-predict(tree_model2,
        p_test,
        interval="prediction")
RSME <- sqrt(mean((p_test$NSP - predict(tree_model2, p_test))^2))
## [1] "RSME of tree model2:"
## [1] 1.197712

A little bit smaller RSME than model 1 so not actually better fit than it.

Model3

Adding all predictors to the same model

tree_model3 <- rpart(NSP ~ ASTV+MSTV+ALTV+AC+UC,data=p_train, method = "class")
rpart.plot(tree_model3, main = "Decision Tree for NSP Classification")

test3<-predict(tree_model3,
        p_test,
        interval="prediction")
RSME <- sqrt(mean((p_test$NSP - predict(tree_model3, p_test))^2))
## [1] "RSME of tree model3:"
## [1] 1.20678

Somewhat higher RSME but more fit still to study all the correlated variables in order to make our decision