Data description: The Maternal Health Risk dataset was published on August 14, 2023, by Marzia Ahmed and is available through the UCI Machine Learning Repository. This dataset provides valuable information for analyzing factors influencing maternal health outcomes and assessing associated risk levels. Source: UCI Machine Learning Repository – Maternal Health Risk Dataset
SystolicBP:Upper value of Blood Pressure in mmHg
DiastolicBP: Lower value of Blood Pressure in mmHg
BS:Blood glucose levels is in terms of a molar concentration
BodyTemp: body temperature
HeartRate: A normal resting heart rate
RiskLevel: Predicted Risk Intensity Level during pregnancy considering the previous attribute.
# import dataset
maternal <- read.csv("MaternalHealthRiskDataSet.csv")
head(maternal)
## Age SystolicBP DiastolicBP BS BodyTemp HeartRate RiskLevel
## 1 25 130 80 15.00 98 86 high risk
## 2 35 140 90 13.00 98 70 high risk
## 3 29 90 70 8.00 100 80 high risk
## 4 30 140 85 7.00 98 70 high risk
## 5 35 120 60 6.10 98 76 low risk
## 6 23 140 80 7.01 98 70 high risk
# data preparation
colSums(is.na(maternal))
## Age SystolicBP DiastolicBP BS BodyTemp HeartRate
## 0 0 0 0 0 0
## RiskLevel
## 0
str(maternal)
## 'data.frame': 1014 obs. of 7 variables:
## $ Age : int 25 35 29 30 35 23 23 35 32 42 ...
## $ SystolicBP : int 130 140 90 140 120 140 130 85 120 130 ...
## $ DiastolicBP: int 80 90 70 85 60 80 70 60 90 80 ...
## $ BS : num 15 13 8 7 6.1 7.01 7.01 11 6.9 18 ...
## $ BodyTemp : num 98 98 100 98 98 98 98 102 98 98 ...
## $ HeartRate : int 86 70 80 70 76 70 78 86 70 70 ...
## $ RiskLevel : chr "high risk" "high risk" "high risk" "high risk" ...
** Regression Choice:Since this dataset evaluates RiskLevel (e.g., high risk, low risk) based on various features, the target variable is categorical in nature. Therefore, logistic regression is the appropriate choice for this analysis, as it is well-suited for modeling relationships between one or more continuous predictors and a categorical outcome.
** Data Assessment
# check outliers by boxplot
boxplot(maternal$Age, main = "Age")
boxplot(maternal$SystolicBP, main = "SystolicBP")
boxplot(maternal$DiastolicBP, main = "DiastolicBP")
boxplot(maternal$BS, main = "Blood Sugar")
boxplot(maternal$BodyTemp, main = "BodyTemp")
boxplot(maternal$HeartRate, main = "HeartRate")
# check the outliers of heart rate
summary(maternal$HeartRate)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 7.0 70.0 76.0 74.3 80.0 90.0
# remove heart rate = 7
maternal_1 <- subset(maternal, HeartRate >= 60)
summary(maternal_1$HeartRate)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 60.00 70.00 76.00 74.43 80.00 90.00
# check the outliers of BodyTemp
summary(maternal_1$BodyTemp)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 98.00 98.00 98.00 98.67 98.00 103.00
subset(maternal_1, BodyTemp >101)
## Age SystolicBP DiastolicBP BS BodyTemp HeartRate RiskLevel
## 8 35 85 60 11.0 102 86 high risk
## 36 12 95 60 6.1 102 60 low risk
## 67 17 85 60 9.0 102 86 mid risk
## 106 34 85 60 11.0 102 86 high risk
## 136 22 90 60 7.5 102 60 high risk
## 140 18 120 80 6.9 102 76 mid risk
## 145 17 120 80 6.7 102 76 mid risk
## 172 12 90 60 7.9 102 66 high risk
## 181 12 95 60 6.1 102 60 low risk
## 192 17 90 65 6.1 103 67 high risk
## 200 17 85 60 9.0 102 86 high risk
## 222 17 85 60 9.0 102 86 mid risk
## 236 28 120 80 9.0 102 76 high risk
## 241 17 120 80 7.0 102 76 high risk
## 268 12 90 60 8.0 102 66 high risk
## 277 12 90 60 11.0 102 60 high risk
## 288 17 90 65 7.7 103 67 high risk
## 296 17 85 60 6.3 102 86 high risk
## 338 45 120 80 6.9 103 70 low risk
## 339 70 85 60 6.9 102 70 low risk
## 340 65 120 90 6.9 103 76 low risk
## 341 55 120 80 6.9 102 80 low risk
## 343 22 120 80 6.9 103 76 low risk
## 372 12 90 60 7.8 102 60 high risk
## 383 17 90 65 7.8 103 67 high risk
## 391 17 85 69 7.8 102 86 high risk
## 414 50 130 80 16.0 102 76 mid risk
## 415 27 120 90 6.8 102 68 mid risk
## 420 17 140 100 6.8 103 80 high risk
## 423 36 140 100 6.8 102 76 high risk
## 429 36 140 100 6.8 102 76 high risk
## 443 35 85 60 11.0 102 86 high risk
## 459 34 85 60 11.0 102 86 high risk
## 473 18 120 80 6.8 102 76 low risk
## 494 17 85 60 7.9 102 86 low risk
## 508 18 120 80 7.9 102 76 mid risk
## 513 17 120 80 7.5 102 76 low risk
## 524 17 85 60 7.5 102 86 low risk
## 544 12 90 60 7.5 102 66 low risk
## 553 12 90 60 7.5 102 60 low risk
## 564 17 90 65 7.5 103 67 low risk
## 572 17 85 60 7.5 102 86 low risk
## 589 12 90 60 7.5 102 66 mid risk
## 598 22 90 60 7.5 102 60 high risk
## 613 17 90 65 7.5 103 67 mid risk
## 649 17 90 60 9.0 102 86 mid risk
## 680 35 85 60 11.0 102 86 high risk
## 713 18 120 80 6.9 102 76 mid risk
## 717 17 120 80 6.7 102 76 mid risk
## 727 17 85 60 9.0 102 86 mid risk
## 734 18 120 80 6.9 102 76 mid risk
## 738 17 120 80 6.7 102 76 mid risk
## 748 17 85 60 9.0 102 86 mid risk
## 788 50 130 80 16.0 102 76 mid risk
## 789 27 120 90 6.8 102 68 mid risk
## 812 18 120 80 7.9 102 76 mid risk
## 828 12 90 60 7.5 102 66 mid risk
## 835 17 90 65 7.5 103 67 mid risk
## 844 17 90 60 9.0 102 86 mid risk
## 859 18 120 80 6.9 102 76 mid risk
## 863 17 120 80 6.7 102 76 mid risk
## 873 17 85 60 9.0 102 86 mid risk
## 892 18 120 80 6.8 102 76 low risk
## 903 17 85 60 7.9 102 86 low risk
## 915 17 120 80 7.5 102 76 low risk
## 923 17 85 60 7.5 102 86 low risk
## 935 12 90 60 7.5 102 66 low risk
## 941 12 90 60 7.5 102 60 low risk
## 949 17 90 65 7.5 103 67 low risk
## 952 17 85 60 7.5 102 86 low risk
## 964 12 90 60 7.9 102 66 high risk
## 971 17 90 65 6.1 103 67 high risk
## 974 17 85 60 9.0 102 86 high risk
## 985 28 120 80 9.0 102 76 high risk
## 990 17 120 80 7.0 102 76 high risk
## 997 12 90 60 8.0 102 66 high risk
## 1001 12 90 60 11.0 102 60 high risk
## 1006 17 90 65 7.7 103 67 high risk
## 1007 17 85 60 6.3 102 86 high risk
# remove data that BodyTemp >100 and RiskLevel is 'low risk'
maternal_2 <- subset(maternal_1, !(BodyTemp >101 & RiskLevel == "low risk"))
subset(maternal_2, BodyTemp >101)
## Age SystolicBP DiastolicBP BS BodyTemp HeartRate RiskLevel
## 8 35 85 60 11.0 102 86 high risk
## 67 17 85 60 9.0 102 86 mid risk
## 106 34 85 60 11.0 102 86 high risk
## 136 22 90 60 7.5 102 60 high risk
## 140 18 120 80 6.9 102 76 mid risk
## 145 17 120 80 6.7 102 76 mid risk
## 172 12 90 60 7.9 102 66 high risk
## 192 17 90 65 6.1 103 67 high risk
## 200 17 85 60 9.0 102 86 high risk
## 222 17 85 60 9.0 102 86 mid risk
## 236 28 120 80 9.0 102 76 high risk
## 241 17 120 80 7.0 102 76 high risk
## 268 12 90 60 8.0 102 66 high risk
## 277 12 90 60 11.0 102 60 high risk
## 288 17 90 65 7.7 103 67 high risk
## 296 17 85 60 6.3 102 86 high risk
## 372 12 90 60 7.8 102 60 high risk
## 383 17 90 65 7.8 103 67 high risk
## 391 17 85 69 7.8 102 86 high risk
## 414 50 130 80 16.0 102 76 mid risk
## 415 27 120 90 6.8 102 68 mid risk
## 420 17 140 100 6.8 103 80 high risk
## 423 36 140 100 6.8 102 76 high risk
## 429 36 140 100 6.8 102 76 high risk
## 443 35 85 60 11.0 102 86 high risk
## 459 34 85 60 11.0 102 86 high risk
## 508 18 120 80 7.9 102 76 mid risk
## 589 12 90 60 7.5 102 66 mid risk
## 598 22 90 60 7.5 102 60 high risk
## 613 17 90 65 7.5 103 67 mid risk
## 649 17 90 60 9.0 102 86 mid risk
## 680 35 85 60 11.0 102 86 high risk
## 713 18 120 80 6.9 102 76 mid risk
## 717 17 120 80 6.7 102 76 mid risk
## 727 17 85 60 9.0 102 86 mid risk
## 734 18 120 80 6.9 102 76 mid risk
## 738 17 120 80 6.7 102 76 mid risk
## 748 17 85 60 9.0 102 86 mid risk
## 788 50 130 80 16.0 102 76 mid risk
## 789 27 120 90 6.8 102 68 mid risk
## 812 18 120 80 7.9 102 76 mid risk
## 828 12 90 60 7.5 102 66 mid risk
## 835 17 90 65 7.5 103 67 mid risk
## 844 17 90 60 9.0 102 86 mid risk
## 859 18 120 80 6.9 102 76 mid risk
## 863 17 120 80 6.7 102 76 mid risk
## 873 17 85 60 9.0 102 86 mid risk
## 964 12 90 60 7.9 102 66 high risk
## 971 17 90 65 6.1 103 67 high risk
## 974 17 85 60 9.0 102 86 high risk
## 985 28 120 80 9.0 102 76 high risk
## 990 17 120 80 7.0 102 76 high risk
## 997 12 90 60 8.0 102 66 high risk
## 1001 12 90 60 11.0 102 60 high risk
## 1006 17 90 65 7.7 103 67 high risk
## 1007 17 85 60 6.3 102 86 high risk
# check the outlier of blood sugar
summary(maternal_2$BS)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.000 6.900 7.500 8.763 8.000 19.000
# remove the data that blood sugar >=10 and risklevel is "low risk"
maternal_3 <- subset(maternal_2, !(BS >=10 & RiskLevel == "low risk"))
subset(maternal_3, BS >=10)
## Age SystolicBP DiastolicBP BS BodyTemp HeartRate RiskLevel
## 1 25 130 80 15 98.0 86 high risk
## 2 35 140 90 13 98.0 70 high risk
## 8 35 85 60 11 102.0 86 high risk
## 10 42 130 80 18 98.0 70 high risk
## 15 48 120 80 11 98.0 88 mid risk
## 17 50 140 90 15 98.0 90 high risk
## 21 40 140 100 18 98.0 90 high risk
## 74 54 130 70 12 98.0 67 mid risk
## 75 44 120 90 16 98.0 80 mid risk
## 78 55 120 90 12 98.0 70 mid risk
## 92 60 120 85 15 98.0 60 mid risk
## 103 48 140 90 15 98.0 90 high risk
## 106 34 85 60 11 102.0 86 high risk
## 107 50 140 90 15 98.0 90 high risk
## 109 42 140 100 18 98.0 90 high risk
## 111 50 140 95 17 98.0 60 high risk
## 114 30 140 100 15 98.0 70 high risk
## 115 63 140 90 15 98.0 90 high risk
## 118 55 140 100 18 98.0 90 high risk
## 120 30 140 100 15 98.0 70 high risk
## 121 48 120 80 11 98.0 88 high risk
## 122 49 140 90 15 98.0 90 high risk
## 124 40 160 100 19 98.0 77 high risk
## 125 32 140 90 18 98.0 88 high risk
## 127 54 140 100 15 98.0 66 high risk
## 128 55 140 95 19 98.0 77 high risk
## 130 48 120 80 11 98.0 88 high risk
## 131 40 160 100 19 98.0 77 high risk
## 132 32 140 90 18 98.0 88 high risk
## 134 54 140 100 15 98.0 66 high risk
## 135 40 120 95 11 98.0 80 high risk
## 137 40 120 85 15 98.0 60 high risk
## 138 55 140 95 19 98.0 77 high risk
## 139 50 130 100 16 98.0 75 high risk
## 150 37 120 90 11 98.0 88 high risk
## 153 17 110 75 12 101.0 76 high risk
## 158 40 120 90 12 98.0 80 high risk
## 167 40 160 100 19 98.0 77 high risk
## 168 32 140 90 18 98.0 88 high risk
## 178 54 140 100 15 98.0 66 high risk
## 179 40 120 95 11 98.0 80 high risk
## 182 60 120 85 15 98.0 60 high risk
## 183 55 140 95 19 98.0 77 high risk
## 184 50 130 100 16 98.0 75 high risk
## 194 50 120 80 15 98.0 70 high risk
## 206 33 120 75 10 98.0 70 high risk
## 207 48 120 80 11 98.0 88 high risk
## 211 50 140 95 17 98.0 60 high risk
## 218 30 140 100 15 98.0 70 high risk
## 229 48 120 80 11 98.0 88 high risk
## 231 50 140 90 15 98.0 77 high risk
## 235 40 140 100 18 98.0 77 high risk
## 238 17 90 60 11 101.0 78 high risk
## 240 25 120 90 12 101.0 80 high risk
## 242 19 90 65 11 101.0 70 high risk
## 246 37 120 90 11 98.0 88 high risk
## 249 17 110 75 13 101.0 76 high risk
## 250 25 120 90 15 98.0 80 high risk
## 263 40 160 100 19 98.0 77 high risk
## 264 32 140 90 18 98.0 88 high risk
## 274 54 140 100 15 98.0 66 high risk
## 275 40 120 95 11 98.0 80 high risk
## 277 12 90 60 11 102.0 60 high risk
## 278 60 120 85 15 98.0 60 high risk
## 279 55 140 95 19 98.0 77 high risk
## 280 50 130 100 16 98.0 76 high risk
## 303 48 120 80 11 98.0 88 high risk
## 317 22 120 60 15 98.0 80 high risk
## 318 55 120 90 18 98.0 60 high risk
## 319 54 130 70 12 98.0 67 mid risk
## 320 35 85 60 19 98.0 86 high risk
## 321 43 120 90 18 98.0 70 high risk
## 328 56 120 80 13 98.0 70 high risk
## 330 43 120 80 15 98.0 76 high risk
## 332 44 120 90 16 98.0 80 mid risk
## 335 55 120 90 12 98.0 70 mid risk
## 342 45 90 60 18 101.0 70 high risk
## 346 37 120 90 11 98.0 88 high risk
## 363 40 160 100 19 98.0 77 high risk
## 364 32 140 90 18 98.0 88 high risk
## 369 54 140 100 15 98.0 66 high risk
## 370 40 120 95 11 98.0 80 high risk
## 373 60 120 85 15 98.0 60 mid risk
## 374 55 140 95 19 98.0 77 high risk
## 375 50 130 100 16 98.0 75 high risk
## 398 48 120 80 11 98.0 88 high risk
## 414 50 130 80 16 102.0 76 mid risk
## 416 60 140 90 12 98.0 77 high risk
## 418 60 140 80 16 98.0 66 high risk
## 426 35 100 60 15 98.0 80 high risk
## 427 40 140 100 13 101.0 66 high risk
## 432 35 100 60 15 98.0 80 high risk
## 433 40 140 100 13 101.0 66 high risk
## 436 65 130 80 15 98.0 86 high risk
## 437 35 140 80 13 98.0 70 high risk
## 438 29 90 70 10 98.0 80 high risk
## 443 35 85 60 11 102.0 86 high risk
## 445 43 130 80 18 98.0 70 mid risk
## 450 48 120 80 11 98.0 88 high risk
## 452 48 140 90 15 98.0 90 high risk
## 459 34 85 60 11 102.0 86 high risk
## 461 42 130 80 18 98.0 70 mid risk
## 468 50 140 90 15 98.0 90 high risk
## 472 42 140 100 18 98.0 90 high risk
## 483 50 140 95 17 98.0 60 high risk
## 490 30 140 100 15 98.0 70 high risk
## 501 48 120 80 11 98.0 88 mid risk
## 503 63 140 90 15 98.0 90 high risk
## 507 55 140 100 18 98.0 90 high risk
## 520 30 140 100 15 98.0 70 high risk
## 531 48 120 80 11 98.0 88 high risk
## 533 49 140 90 15 98.0 90 high risk
## 539 40 160 100 19 98.0 77 high risk
## 540 32 140 90 18 98.0 88 high risk
## 550 54 140 100 15 98.0 66 high risk
## 551 40 120 95 11 98.0 80 mid risk
## 554 60 120 85 15 98.0 60 mid risk
## 555 55 140 95 19 98.0 77 high risk
## 556 50 130 100 16 98.0 75 mid risk
## 579 48 120 80 11 98.0 88 high risk
## 584 40 160 100 19 98.0 77 high risk
## 585 32 140 90 18 98.0 88 high risk
## 595 54 140 100 15 98.0 66 high risk
## 596 40 120 95 11 98.0 80 high risk
## 599 40 120 85 15 98.0 60 high risk
## 600 55 140 95 19 98.0 77 high risk
## 601 50 130 100 16 98.0 75 high risk
## 603 40 120 85 15 98.0 60 high risk
## 604 55 140 95 19 98.0 77 high risk
## 605 50 130 100 16 98.0 75 mid risk
## 615 50 120 80 15 98.0 70 high risk
## 628 48 120 80 11 98.0 88 high risk
## 631 22 100 65 12 98.0 80 high risk
## 632 50 140 95 17 98.0 60 high risk
## 633 35 100 70 11 98.0 60 high risk
## 637 50 130 80 15 98.0 86 high risk
## 638 35 140 90 13 98.0 70 high risk
## 639 29 90 70 11 100.0 80 high risk
## 641 46 140 100 12 99.0 90 high risk
## 642 28 95 60 10 101.0 86 high risk
## 645 25 140 100 15 98.6 70 high risk
## 656 48 120 80 11 98.0 88 high risk
## 658 27 140 90 15 98.0 90 high risk
## 659 25 140 100 12 99.0 80 high risk
## 676 35 140 90 13 98.0 70 high risk
## 680 35 85 60 11 102.0 86 high risk
## 681 42 130 80 18 98.0 70 high risk
## 682 50 140 90 15 98.0 90 high risk
## 684 40 140 100 18 98.0 90 high risk
## 687 37 120 90 11 98.0 88 high risk
## 688 17 110 75 12 101.0 76 high risk
## 689 40 120 90 12 98.0 80 high risk
## 690 40 160 100 19 98.0 77 high risk
## 711 48 120 80 11 98.0 88 mid risk
## 732 48 120 80 11 98.0 88 mid risk
## 755 54 130 70 12 98.0 67 mid risk
## 756 44 120 90 16 98.0 80 mid risk
## 759 55 120 90 12 98.0 70 mid risk
## 773 60 120 85 15 98.0 60 mid risk
## 788 50 130 80 16 102.0 76 mid risk
## 798 43 130 80 18 98.0 70 mid risk
## 803 42 130 80 18 98.0 70 mid risk
## 811 48 120 80 11 98.0 88 mid risk
## 818 40 120 95 11 98.0 80 mid risk
## 819 60 120 85 15 98.0 60 mid risk
## 820 50 130 100 16 98.0 75 mid risk
## 834 50 130 100 16 98.0 75 mid risk
## 857 48 120 80 11 98.0 88 mid risk
## 956 40 140 100 18 98.0 90 high risk
## 959 37 120 90 11 98.0 88 high risk
## 960 17 110 75 12 101.0 76 high risk
## 961 40 120 90 12 98.0 80 high risk
## 962 40 160 100 19 98.0 77 high risk
## 963 32 140 90 18 98.0 88 high risk
## 966 54 140 100 15 98.0 66 high risk
## 967 40 120 95 11 98.0 80 high risk
## 968 60 120 85 15 98.0 60 high risk
## 969 55 140 95 19 98.0 77 high risk
## 970 50 130 100 16 98.0 75 high risk
## 973 50 120 80 15 98.0 70 high risk
## 975 33 120 75 10 98.0 70 high risk
## 976 48 120 80 11 98.0 88 high risk
## 977 50 140 95 17 98.0 60 high risk
## 978 30 140 100 15 98.0 70 high risk
## 980 48 120 80 11 98.0 88 high risk
## 981 50 140 90 15 98.0 77 high risk
## 984 40 140 100 18 98.0 77 high risk
## 987 17 90 60 11 101.0 78 high risk
## 989 25 120 90 12 101.0 80 high risk
## 991 19 90 65 11 101.0 70 high risk
## 992 37 120 90 11 98.0 88 high risk
## 993 17 110 75 13 101.0 76 high risk
## 994 25 120 90 15 98.0 80 high risk
## 995 40 160 100 19 98.0 77 high risk
## 996 32 140 90 18 98.0 88 high risk
## 999 54 140 100 15 98.0 66 high risk
## 1000 40 120 95 11 98.0 80 high risk
## 1001 12 90 60 11 102.0 60 high risk
## 1002 60 120 85 15 98.0 60 high risk
## 1003 55 140 95 19 98.0 77 high risk
## 1004 50 130 100 16 98.0 76 high risk
## 1009 48 120 80 11 98.0 88 high risk
## 1010 22 120 60 15 98.0 80 high risk
## 1011 55 120 90 18 98.0 60 high risk
## 1012 35 85 60 19 98.0 86 high risk
## 1013 43 120 90 18 98.0 70 high risk
maternal_clean <- maternal_3
summary(maternal_clean)
## Age SystolicBP DiastolicBP BS
## Min. :10.00 Min. : 70.0 Min. : 49.00 Min. : 6.000
## 1st Qu.:19.00 1st Qu.:100.0 1st Qu.: 65.00 1st Qu.: 6.900
## Median :27.00 Median :120.0 Median : 80.00 Median : 7.500
## Mean :29.98 Mean :113.5 Mean : 76.65 Mean : 8.754
## 3rd Qu.:39.00 3rd Qu.:120.0 3rd Qu.: 90.00 3rd Qu.: 8.000
## Max. :66.00 Max. :160.0 Max. :100.00 Max. :19.000
## BodyTemp HeartRate RiskLevel
## Min. : 98.00 Min. :60.00 Length:985
## 1st Qu.: 98.00 1st Qu.:70.00 Class :character
## Median : 98.00 Median :76.00 Mode :character
## Mean : 98.59 Mean :74.39
## 3rd Qu.: 98.00 3rd Qu.:80.00
## Max. :103.00 Max. :90.00
** Outliers analysis
HeartRate: The detected outliers show heart rate values around 7 bpm, which are physiologically impossible. These records are considered data entry errors and were therefore removed from the dataset.
BodyTemp: Body temperature values above 101°F indicate fever conditions. Such readings are typically associated with high-risk or mid-risk maternal health levels. To maintain logical consistency, all records with BodyTemp > 101°F and RiskLevel = “low risk” were removed.
BS: A blood sugar level greater than 10 mmol/L suggests potential hyperglycemia and is usually linked with high or mid maternal risk. Consequently, records with BS > 10 and RiskLevel = “low risk” were excluded to ensure dataset reliability and medical plausibility.
# check the collinearity
correlation_matrix <- cor(maternal_clean[, c("Age", "SystolicBP", "DiastolicBP", "BS", "BodyTemp", "HeartRate")])
correlation_matrix
## Age SystolicBP DiastolicBP BS BodyTemp
## Age 1.00000000 0.4152118 0.3943027 0.47909474 -0.25415057
## SystolicBP 0.41521179 1.0000000 0.7826082 0.42543551 -0.26803474
## DiastolicBP 0.39430271 0.7826082 1.0000000 0.42325102 -0.24461544
## BS 0.47909474 0.4254355 0.4232510 1.00000000 -0.08207481
## BodyTemp -0.25415057 -0.2680347 -0.2446154 -0.08207481 1.00000000
## HeartRate 0.05606688 -0.0229889 -0.0589288 0.14590214 0.11943522
## HeartRate
## Age 0.05606688
## SystolicBP -0.02298890
## DiastolicBP -0.05892880
## BS 0.14590214
## BodyTemp 0.11943522
## HeartRate 1.00000000
** Collinearity analysis
# check the normality
shapiro.test(maternal_clean$Age)
##
## Shapiro-Wilk normality test
##
## data: maternal_clean$Age
## W = 0.91737, p-value < 2.2e-16
shapiro.test(maternal_clean$SystolicBP)
##
## Shapiro-Wilk normality test
##
## data: maternal_clean$SystolicBP
## W = 0.90575, p-value < 2.2e-16
shapiro.test(maternal_clean$DiastolicBP)
##
## Shapiro-Wilk normality test
##
## data: maternal_clean$DiastolicBP
## W = 0.94756, p-value < 2.2e-16
shapiro.test(maternal_clean$BS)
##
## Shapiro-Wilk normality test
##
## data: maternal_clean$BS
## W = 0.67569, p-value < 2.2e-16
shapiro.test(maternal_clean$BodyTemp)
##
## Shapiro-Wilk normality test
##
## data: maternal_clean$BodyTemp
## W = 0.50361, p-value < 2.2e-16
shapiro.test(maternal_clean$HeartRate)
##
## Shapiro-Wilk normality test
##
## data: maternal_clean$HeartRate
## W = 0.95236, p-value < 2.2e-16
hist(maternal_clean$Age, probability = TRUE)
lines(density(maternal_clean$Age), col = 'red')
hist(maternal_clean$SystolicBP, probability = TRUE)
lines(density(maternal_clean$SystolicBP), col = 'red')
hist(maternal_clean$DiastolicBP, probability = TRUE)
lines(density(maternal_clean$DiastolicBP), col = 'red')
hist(maternal_clean$BS, probability = TRUE)
lines(density(maternal_clean$BS), col = 'red')
hist(maternal_clean$BodyTemp, probability = TRUE)
lines(density(maternal_clean$BodyTemp), col = 'red')
hist(maternal_clean$HeartRate, probability = TRUE)
lines(density(maternal_clean$HeartRate), col = 'red')
** Normality analysis: The results of the Shapiro–Wilk normality test, supported by histogram visualizations, demonstrate that all variables significantly deviate from a normal distribution. Therefore, the assumption of normality is not satisfied for this dataset.
** test for homoscedasticity: Since the dependent variable RiskLevel includes three categorical levels (low, mid, and high), a multinomial logistic regression model was employed. Homoscedasticity is not an assumption for logistic regression, as the variance of the response variable depends on the predicted probabilities.
** Data Split
# split the data to 70/30
library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
set.seed(123)
train_index <- createDataPartition(maternal_clean$RiskLevel, p = 0.7, list = FALSE)
train_data <- maternal_clean[train_index,]
test_data <- maternal_clean[-train_index,]
# multinomial logistic regression
library(nnet)
maternal_clean$RiskLevel <- as.factor(maternal_clean$RiskLevel)
model_multi1 <- multinom(RiskLevel ~ Age + SystolicBP + DiastolicBP + BS + BodyTemp + HeartRate, data = train_data)
## # weights: 24 (14 variable)
## initial value 759.141091
## iter 10 value 629.597777
## iter 20 value 536.210329
## iter 30 value 505.767145
## iter 40 value 504.147202
## iter 50 value 503.571551
## final value 503.502345
## converged
summary(model_multi1)
## Call:
## multinom(formula = RiskLevel ~ Age + SystolicBP + DiastolicBP +
## BS + BodyTemp + HeartRate, data = train_data)
##
## Coefficients:
## (Intercept) Age SystolicBP DiastolicBP BS BodyTemp
## low risk 153.49418 0.01493923 -0.072686297 -0.01760594 -0.7775492 -1.3356866
## mid risk 48.03771 0.01683282 -0.007092011 -0.05883441 -0.3729959 -0.3677746
## HeartRate
## low risk -0.07850442
## mid risk -0.03919853
##
## Std. Errors:
## (Intercept) Age SystolicBP DiastolicBP BS BodyTemp
## low risk 0.0001948333 0.01367527 0.01307712 0.01631953 0.11129976 0.02100432
## mid risk 0.0001440045 0.01271893 0.01126380 0.01457269 0.05196203 0.01639377
## HeartRate
## low risk 0.01995628
## mid risk 0.01640244
##
## Residual Deviance: 1007.005
## AIC: 1035.005
# remove DiastolicBP
model_multi2 <- multinom(RiskLevel ~ Age + SystolicBP + BS + BodyTemp + HeartRate, data = train_data)
## # weights: 21 (12 variable)
## initial value 759.141091
## iter 10 value 627.114178
## iter 20 value 542.717280
## iter 30 value 520.104526
## iter 40 value 517.804853
## iter 50 value 515.428104
## final value 515.425312
## converged
summary(model_multi2)
## Call:
## multinom(formula = RiskLevel ~ Age + SystolicBP + BS + BodyTemp +
## HeartRate, data = train_data)
##
## Coefficients:
## (Intercept) Age SystolicBP BS BodyTemp HeartRate
## low risk 150.59484 0.01555393 -0.08094925 -0.7396030 -1.3119601 -0.08122199
## mid risk 42.36125 0.01443048 -0.03663022 -0.4015462 -0.3227446 -0.03442227
##
## Std. Errors:
## (Intercept) Age SystolicBP BS BodyTemp HeartRate
## low risk 0.0001979874 0.01372230 0.009126859 0.10224931 0.02109619 0.02023666
## mid risk 0.0001445513 0.01275148 0.007708714 0.05281594 0.01629270 0.01644001
##
## Residual Deviance: 1030.851
## AIC: 1054.851
# remove Age, DiastolicBP, HeartRate
model_multi3 <- multinom(RiskLevel ~ SystolicBP + BS + BodyTemp, data = train_data)
## # weights: 15 (8 variable)
## initial value 759.141091
## iter 10 value 593.870051
## iter 20 value 548.190182
## iter 30 value 532.634520
## iter 40 value 527.247598
## iter 50 value 524.654320
## iter 60 value 524.484236
## final value 524.437259
## converged
summary(model_multi3)
## Call:
## multinom(formula = RiskLevel ~ SystolicBP + BS + BodyTemp, data = train_data)
##
## Coefficients:
## (Intercept) SystolicBP BS BodyTemp
## low risk 148.3870 -0.07259164 -0.7237608 -1.3569666
## mid risk 42.5065 -0.03168957 -0.3790738 -0.3538934
##
## Std. Errors:
## (Intercept) SystolicBP BS BodyTemp
## low risk 0.0001019664 0.008321894 0.09567523 0.011937788
## mid risk 0.0001016986 0.006924722 0.04378866 0.008421014
##
## Residual Deviance: 1048.875
## AIC: 1064.875
AIC(model_multi1,model_multi2,model_multi3)
## df AIC
## model_multi1 14 1035.005
## model_multi2 12 1054.851
## model_multi3 8 1064.875
# confusion matrix
pred1 <- predict(model_multi1, newdata = test_data)
confusionMatrix(data=as.factor(pred1), reference = as.factor(test_data$RiskLevel))
## Confusion Matrix and Statistics
##
## Reference
## Prediction high risk low risk mid risk
## high risk 63 1 15
## low risk 1 91 50
## mid risk 17 21 35
##
## Overall Statistics
##
## Accuracy : 0.6429
## 95% CI : (0.5852, 0.6976)
## No Information Rate : 0.3844
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.4555
##
## Mcnemar's Test P-Value : 0.007486
##
## Statistics by Class:
##
## Class: high risk Class: low risk Class: mid risk
## Sensitivity 0.7778 0.8053 0.3500
## Specificity 0.9249 0.7182 0.8041
## Pos Pred Value 0.7975 0.6408 0.4795
## Neg Pred Value 0.9163 0.8553 0.7059
## Prevalence 0.2755 0.3844 0.3401
## Detection Rate 0.2143 0.3095 0.1190
## Detection Prevalence 0.2687 0.4830 0.2483
## Balanced Accuracy 0.8513 0.7618 0.5771
pred2 <- predict(model_multi2, newdata = test_data)
confusionMatrix(data=as.factor(pred2), reference = as.factor(test_data$RiskLevel))
## Confusion Matrix and Statistics
##
## Reference
## Prediction high risk low risk mid risk
## high risk 60 1 17
## low risk 2 94 50
## mid risk 19 18 33
##
## Overall Statistics
##
## Accuracy : 0.6361
## 95% CI : (0.5782, 0.6911)
## No Information Rate : 0.3844
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.4444
##
## Mcnemar's Test P-Value : 0.001433
##
## Statistics by Class:
##
## Class: high risk Class: low risk Class: mid risk
## Sensitivity 0.7407 0.8319 0.3300
## Specificity 0.9155 0.7127 0.8093
## Pos Pred Value 0.7692 0.6438 0.4714
## Neg Pred Value 0.9028 0.8716 0.7009
## Prevalence 0.2755 0.3844 0.3401
## Detection Rate 0.2041 0.3197 0.1122
## Detection Prevalence 0.2653 0.4966 0.2381
## Balanced Accuracy 0.8281 0.7723 0.5696
pred3 <- predict(model_multi3, newdata = test_data)
confusionMatrix(data=as.factor(pred3), reference = as.factor(test_data$RiskLevel))
## Confusion Matrix and Statistics
##
## Reference
## Prediction high risk low risk mid risk
## high risk 59 0 13
## low risk 2 99 53
## mid risk 20 14 34
##
## Overall Statistics
##
## Accuracy : 0.6531
## 95% CI : (0.5956, 0.7074)
## No Information Rate : 0.3844
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.4683
##
## Mcnemar's Test P-Value : 8.718e-06
##
## Statistics by Class:
##
## Class: high risk Class: low risk Class: mid risk
## Sensitivity 0.7284 0.8761 0.3400
## Specificity 0.9390 0.6961 0.8247
## Pos Pred Value 0.8194 0.6429 0.5000
## Neg Pred Value 0.9009 0.9000 0.7080
## Prevalence 0.2755 0.3844 0.3401
## Detection Rate 0.2007 0.3367 0.1156
## Detection Prevalence 0.2449 0.5238 0.2313
## Balanced Accuracy 0.8337 0.7861 0.5824
# check accuracy
acc1 <- mean(pred1 == test_data$RiskLevel)
acc2 <- mean(pred2 == test_data$RiskLevel)
acc3 <- mean(pred3 == test_data$RiskLevel)
acc1
## [1] 0.6428571
acc2
## [1] 0.6360544
acc3
## [1] 0.6530612
# evaluate three models
model_eval <- data.frame(model = c("model1", "model2", "model3"), AIC = c(AIC(model_multi1), AIC(model_multi2),AIC(model_multi3)), Accuracy = c(acc1,acc2, acc3))
model_eval
## model AIC Accuracy
## 1 model1 1035.005 0.6428571
## 2 model2 1054.851 0.6360544
## 3 model3 1064.875 0.6530612
** Model Report: Three multinomial logistic regression models were compared using two key evaluation metrics: Akaike Information Criterion (AIC) and classification accuracy on the testing dataset.As shown in the table above, Model 1, which includes all predictor variables (Age, SystolicBP, DiastolicBP, BS, BodyTemp, and HeartRate), achieved the lowest AIC value (1035.005), indicating the best trade-off between model fit and complexity. Although Model 3 produced a slightly higher accuracy (0.653) than Model 1 (0.643), its AIC (1064.875) is substantially higher, suggesting overfitting or reduced model efficiency.Therefore, Model 1 was selected as the most suitable regression model for predicting maternal health risk levels, as it demonstrates the optimal balance between model simplicity and predictive performance.In addition, the three confusion matrices indicate that the model performed well in identifying high-risk and low-risk cases, but exhibited moderate misclassification between mid-risk and adjacent categories, suggesting that the boundaries between these risk levels are less distinct.
**Decision Tree Analysis
library(rpart)
library(rpart.plot)
## Warning: package 'rpart.plot' was built under R version 4.5.1
library(caret)
maternal_clean$RiskLevel <- as.factor(maternal_clean$RiskLevel)
tree_model <- rpart(RiskLevel ~ Age + SystolicBP + DiastolicBP + BS + BodyTemp + HeartRate, data = train_data, method = "class", parms = list(split="information"))
print(tree_model)
## n= 691
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 691 427 low risk (0.27641100 0.38205499 0.34153401)
## 2) BS>=7.95 184 45 high risk (0.75543478 0.01086957 0.23369565) *
## 3) BS< 7.95 507 245 low risk (0.10256410 0.51676529 0.38067061)
## 6) SystolicBP>=132.5 30 3 high risk (0.90000000 0.00000000 0.10000000) *
## 7) SystolicBP< 132.5 477 215 low risk (0.05241090 0.54926625 0.39832285)
## 14) BodyTemp< 99.5 400 146 low risk (0.01750000 0.63500000 0.34750000)
## 28) SystolicBP< 129.5 376 122 low risk (0.01861702 0.67553191 0.30585106) *
## 29) SystolicBP>=129.5 24 0 mid risk (0.00000000 0.00000000 1.00000000) *
## 15) BodyTemp>=99.5 77 26 mid risk (0.23376623 0.10389610 0.66233766) *
summary(tree_model)
## Call:
## rpart(formula = RiskLevel ~ Age + SystolicBP + DiastolicBP +
## BS + BodyTemp + HeartRate, data = train_data, method = "class",
## parms = list(split = "information"))
## n= 691
##
## CP nsplit rel error xerror xstd
## 1 0.32084309 0 1.0000000 1.0000000 0.02991224
## 2 0.08196721 1 0.6791569 0.6791569 0.03038116
## 3 0.05620609 3 0.5152225 0.5152225 0.02867840
## 4 0.01000000 4 0.4590164 0.4613583 0.02779263
##
## Variable importance
## BS SystolicBP DiastolicBP BodyTemp Age HeartRate
## 36 26 13 11 8 5
##
## Node number 1: 691 observations, complexity param=0.3208431
## predicted class=low risk expected loss=0.617945 P(node) =1
## class counts: 191 264 236
## probabilities: 0.276 0.382 0.342
## left son=2 (184 obs) right son=3 (507 obs)
## Primary splits:
## BS < 7.95 to the right, improve=164.83220, (0 missing)
## SystolicBP < 132.5 to the right, improve=124.80490, (0 missing)
## DiastolicBP < 92.5 to the right, improve= 67.50824, (0 missing)
## Age < 31.5 to the right, improve= 43.04697, (0 missing)
## BodyTemp < 99.5 to the left, improve= 38.87529, (0 missing)
## Surrogate splits:
## Age < 36.5 to the right, agree=0.790, adj=0.212, (0 split)
## SystolicBP < 137.5 to the right, agree=0.784, adj=0.190, (0 split)
## DiastolicBP < 92.5 to the right, agree=0.781, adj=0.179, (0 split)
## HeartRate < 84 to the right, agree=0.771, adj=0.141, (0 split)
## BodyTemp < 101.5 to the right, agree=0.737, adj=0.011, (0 split)
##
## Node number 2: 184 observations
## predicted class=high risk expected loss=0.2445652 P(node) =0.2662808
## class counts: 139 2 43
## probabilities: 0.755 0.011 0.234
##
## Node number 3: 507 observations, complexity param=0.08196721
## predicted class=low risk expected loss=0.4832347 P(node) =0.7337192
## class counts: 52 262 193
## probabilities: 0.103 0.517 0.381
## left son=6 (30 obs) right son=7 (477 obs)
## Primary splits:
## SystolicBP < 132.5 to the right, improve=62.43981, (0 missing)
## BodyTemp < 99.5 to the left, improve=38.73359, (0 missing)
## DiastolicBP < 97.5 to the right, improve=35.34743, (0 missing)
## BS < 7.055 to the right, improve=22.31142, (0 missing)
## HeartRate < 77.5 to the left, improve=11.54812, (0 missing)
## Surrogate splits:
## DiastolicBP < 97.5 to the right, agree=0.97, adj=0.5, (0 split)
##
## Node number 6: 30 observations
## predicted class=high risk expected loss=0.1 P(node) =0.04341534
## class counts: 27 0 3
## probabilities: 0.900 0.000 0.100
##
## Node number 7: 477 observations, complexity param=0.08196721
## predicted class=low risk expected loss=0.4507338 P(node) =0.6903039
## class counts: 25 262 190
## probabilities: 0.052 0.549 0.398
## left son=14 (400 obs) right son=15 (77 obs)
## Primary splits:
## BodyTemp < 99.5 to the left, improve=49.715280, (0 missing)
## SystolicBP < 129.5 to the left, improve=23.061630, (0 missing)
## BS < 7.055 to the right, improve=21.106480, (0 missing)
## DiastolicBP < 49.5 to the left, improve=11.706950, (0 missing)
## Age < 17.5 to the left, improve= 7.781778, (0 missing)
##
## Node number 14: 400 observations, complexity param=0.05620609
## predicted class=low risk expected loss=0.365 P(node) =0.5788712
## class counts: 7 254 139
## probabilities: 0.018 0.635 0.347
## left son=28 (376 obs) right son=29 (24 obs)
## Primary splits:
## SystolicBP < 129.5 to the left, improve=26.835620, (0 missing)
## BS < 7.005 to the right, improve=18.140290, (0 missing)
## Age < 33.5 to the right, improve= 9.753374, (0 missing)
## DiastolicBP < 49.5 to the left, improve= 8.898949, (0 missing)
## HeartRate < 77.5 to the left, improve= 6.773620, (0 missing)
##
## Node number 15: 77 observations
## predicted class=mid risk expected loss=0.3376623 P(node) =0.1114327
## class counts: 18 8 51
## probabilities: 0.234 0.104 0.662
##
## Node number 28: 376 observations
## predicted class=low risk expected loss=0.3244681 P(node) =0.5441389
## class counts: 7 254 115
## probabilities: 0.019 0.676 0.306
##
## Node number 29: 24 observations
## predicted class=mid risk expected loss=0 P(node) =0.03473227
## class counts: 0 0 24
## probabilities: 0.000 0.000 1.000
rpart.plot(tree_model, type = 3, fallen.leaves = TRUE, main = "Decision Tree for Maternal Health Risk Prediction", extra = 104)
tree_model$variable.importance
## BS SystolicBP DiastolicBP BodyTemp Age HeartRate
## 164.83218 120.62938 60.78220 51.50694 34.93726 23.29150
** Decision Tree Analysis:A Decision Tree model was developed using six physiological predictors (Age, SystolicBP, DiastolicBP, BS, BodyTemp, and HeartRate) to classify maternal health risk levels.The resulting tree revealed that BS(blood sugar) and SystolicBP were the most influential predictors in distinguishing risk categories, followed by DiastolicBP and BodyTemp.These findings indicate that blood pressure and blood sugar levels play the most critical role in determining maternal health risk.
** PCA analysis
pca_data <- maternal_clean[, c("Age", "SystolicBP", "DiastolicBP", "BS", "BodyTemp", "HeartRate")]
# scaling
pca_scaled <- scale(pca_data)
pca_result <- prcomp(pca_scaled, center = TRUE, scale. = TRUE)
summary(pca_result)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6
## Standard deviation 1.6096 1.0780 0.9161 0.8446 0.69199 0.46447
## Proportion of Variance 0.4318 0.1937 0.1399 0.1189 0.07981 0.03596
## Cumulative Proportion 0.4318 0.6255 0.7653 0.8842 0.96404 1.00000
pca_result$rotation
## PC1 PC2 PC3 PC4 PC5
## Age 0.439174114 -0.13876309 0.29082103 0.5476434 0.63469447
## SystolicBP 0.530588341 0.08523217 -0.26175603 -0.3562446 0.10491813
## DiastolicBP 0.523623413 0.10704551 -0.32207805 -0.3359659 0.07277007
## BS 0.427721334 -0.36798950 -0.06033367 0.4182756 -0.70905351
## BodyTemp -0.261554957 -0.47136417 -0.77470715 0.1930367 0.26721633
## HeartRate 0.008006029 -0.77744524 0.37331129 -0.4980820 0.08184240
## PC6
## Age 0.02331586
## SystolicBP -0.71047677
## DiastolicBP 0.70175274
## BS -0.01693074
## BodyTemp -0.02378907
## HeartRate 0.03700874
# scree plot
pr.var <- pca_result$sdev^2
pve <- pr.var / sum(pr.var)
par(mfrow = c(1,1), mar = c(4.5,4.5,2,1))
plot(pve, type = "b", pch = 19, xaxt = "n",
xlab = "Principal Component", ylab = "Proportion of Variance Explained",
ylim = c(0, max(pve) * 1.1))
axis(1, at = 1:length(pve), labels = paste0("PC", 1:length(pve)))
lines(cumsum(pve), type = "b", pch = 17, col = "red")
** PCA analysis: A Principal Component Analysis (PCA) was conducted using six physiological indicators (Age, SystolicBP, DiastolicBP, BS, BodyTemp, and HeartRate). The loading matrix indicates that the first principal component (PC1) is dominated by SystolicBP, DiastolicBP, BS, and Age, representing a cardiometabolic risk factor associated with blood pressure and glucose regulation.The second component (PC2) shows high loadings for HeartRate, BodyTemp, and BS, reflecting a physiological stress factor, possibly related to acute metabolic or thermal responses.Together, the first two components explain the majority of the data variance, suggesting that maternal health risk can be largely characterized by two underlying dimensions — long-term cardiovascular–metabolic status and short-term physiological stress response.