Loading Data

LB	AC	UC	DL	DP	ASTV	MSTV	ALTV	MLTV	Width	Min	Max	Nmax	Nzeros	Mode	Mean	Median	Variance	Tendency	NSP
120	0.000000000	0.000000000	0.000000000	0.000000000	73	0.5	43	2.4	64	62	126	2	0	120	137	121	73	1	2
132	0.006379585	0.006379585	0.003189793	0.000000000	17	2.1	0	10.4	130	68	198	6	1	141	136	140	12	0	1
133	0.003322259	0.008305648	0.003322259	0.000000000	16	2.1	0	13.4	130	68	198	5	1	141	135	138	13	0	1
134	0.002560819	0.007682458	0.002560819	0.000000000	16	2.4	0	23.0	117	53	170	11	0	137	134	137	13	1	1
132	0.006514658	0.008143322	0.000000000	0.000000000	16	2.4	0	19.9	117	53	170	9	0	137	136	138	11	1	1
134	0.001049318	0.010493179	0.009443861	0.002098636	26	5.9	0	0.0	150	50	200	5	3	76	107	107	170	0	3

Attribute Information about the CTG dataset

LB: beats per minute AC: Number of accelerations per second FM: Number of uterine contractions per second DL: Number of light decelerations per second DS: Number of severe decelerations per second DP: Number of prolonged decelerations per second ASTV: Percentage of time with abnormal short term variability MSTV: Mean value of short term variability ALTV: Percentage of time with abnormal long term variability MLTV: Mean value of long term variability
Width: Width of FHR(Fetal Heart Rate) histogram Min: Minimum of FHR histogram Max: Maximum of FHR histogram Nmax: Number of histogram peaks Nzeros: Number of histogram zeros Mode: Histogram mode Mean: Histogram mean Median: Histogram median Variance: Histogram variance Tendency: Histogram tendency
NSP(Normal(N) = 1; Suspecios(S) = 2; Pathology(P) = 3):

Fetal state class node

Correlation Matrix

correlation_matrix <- cor(data)
corrplot(correlation_matrix, method="circle")

As we observed from the correlation Matrix there is somewhat strong correlation between Width of FHR Histogram and DL, So let’s visualize the variability

data %>% ggplot(aes(x=Width,
                    y=DL))+
  geom_point(alpha=.5)

Data is clustered in a mess way so we need to apply some transformation

data %>% ggplot(aes(log10(x=Width),
                    y=log10(DL)))+
  geom_point(alpha=.5)

Not a good candidate for linear modeling even after log transform

Now we are checking relation between Width and MSTV

data %>% ggplot(aes(x=Width,
                y=MSTV))+
  geom_point(alpha=.5)+
  geom_smooth(method="lm")

Good candidate for linear regression

Studying the most risk factors affecting class NSP

rpart(data$NSP~., data=data)

## n= 2126 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
##  1) root 2126 802.100200 1.304327  
##    2) Mean>=107.5 2021 520.754100 1.227115  
##      4) MSTV>=0.45 1770 208.454200 1.105085  
##        8) DP< 0.002064161 1757 177.126900 1.095048  
##         16) MSTV>=0.55 1636 110.515300 1.062958  
##           32) AC>=0.001845025 976   5.963115 1.006148 *
##           33) AC< 0.001845025 660  96.743940 1.146970  
##             66) Median< 150.5 572  55.138110 1.082168 *
##             67) Median>=150.5 88  23.590910 1.568182 *
##         17) MSTV< 0.55 121  42.148760 1.528926  
##           34) ASTV< 59.5 66  11.030300 1.212121 *
##           35) ASTV>=59.5 55  16.545450 1.909091 *
##        9) DP>=0.002064161 13   7.230769 2.461538 *
##      5) MSTV< 0.45 251 100.071700 2.087649  
##       10) ALTV< 68.5 210  59.623810 1.919048  
##         20) ASTV< 79.5 188  33.207450 1.824468  
##           40) ALTV< 7.5 21   1.809524 1.095238 *
##           41) ALTV>=7.5 167  18.826350 1.916168 *
##         21) ASTV>=79.5 22  10.363640 2.727273 *
##       11) ALTV>=68.5 41   3.902439 2.951220 *
##    3) Mean< 107.5 105  37.390480 2.790476  
##      6) Max>=220.5 9   3.555556 1.222222 *
##      7) Max< 220.5 96   9.625000 2.937500 *

model <- rpart(data$NSP~., data=data)
rpart.plot(model)

We got MSTV,ALTV & ASTV as the high risk factors,so we will build our logistic regression model based on these factors.

We will create a column P(Pathology) of binary family based on if class NSP is Pathology or not.

data <- data %>% mutate(P=as.integer(NSP=="3"))

Now we imputed successfully our new classifier variable.

Splitting data for validation

set.seed(2)
split <- initial_split(data,
                       prop=.80,
                       strata=P)
p_train <- training(split) 
p_test <- testing(split)

Logistic Classification regression Model

model2<- glm(P ~ MSTV+ALTV+ASTV,     
             data=p_train,
             family="binomial")
summary(model2)

## 
## Call:
## glm(formula = P ~ MSTV + ALTV + ASTV, family = "binomial", data = p_train)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -10.855005   0.745989 -14.551  < 2e-16 ***
## MSTV          1.231484   0.114867  10.721  < 2e-16 ***
## ALTV          0.026689   0.004759   5.608 2.05e-08 ***
## ASTV          0.111116   0.010182  10.913  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 976.84  on 1699  degrees of freedom
## Residual deviance: 663.81  on 1696  degrees of freedom
## AIC: 671.81
## 
## Number of Fisher Scoring iterations: 7

All very significant predictors for our model (P-value<<)

Test the model

p_test <- p_test %>% 
  mutate(P_prob = predict(model2,p_test,type= "response"),
         P_pred = ifelse(P_prob > .5, 1, 0))
t <- table(p_test$P,
           p_test$P_pred) 
accuracy<- sum(diag(t))/sum(t) 
print(accuracy)

## [1] 0.9295775

92% accuracy for our logistic classifying model.

Decision Tree Model for NSP Class

Model1

As there exists a strong correlation between ASTV,MSTV & ALTV with NSP class, we will discuss these relations in our model tree

tree_model1 <- rpart(NSP ~ ASTV+ALTV+MSTV,data=p_train, method = "class")
rpart.plot(tree_model1, main = "Decision Tree for NSP Classification")

RSME of the model

test1<-predict(tree_model1,
        p_test,
        interval="prediction")
RSME <- sqrt(mean((p_test$NSP - predict(tree_model1, p_test))^2))

## [1] "RSME of tree model1:"

## [1] 1.213492

Low value so predictions based on our model are closer to the actual values

Model2

Reffering to the correlation matrix also there exists strong negative correlations between AC & UC with NSP Class, Visually:

tree_model2 <- rpart(NSP ~ AC+UC,data=p_train, method = "class")
rpart.plot(tree_model2, main = "Decision Tree for NSP Classification")

test2 <-predict(tree_model2,
        p_test,
        interval="prediction")
RSME <- sqrt(mean((p_test$NSP - predict(tree_model2, p_test))^2))

## [1] "RSME of tree model2:"

## [1] 1.197712

A little bit smaller RSME than model 1 so not actually better fit than it.

Model3

Adding all predictors to the same model

tree_model3 <- rpart(NSP ~ ASTV+MSTV+ALTV+AC+UC,data=p_train, method = "class")
rpart.plot(tree_model3, main = "Decision Tree for NSP Classification")

test3<-predict(tree_model3,
        p_test,
        interval="prediction")
RSME <- sqrt(mean((p_test$NSP - predict(tree_model3, p_test))^2))

## [1] "RSME of tree model3:"

## [1] 1.20678

Somewhat higher RSME but more fit still to study all the correlated variables in order to make our decision

Cardiotocographic Analysis

Batoul Kalot

2025-07-09