YanQiHomework2

Data Selection and Preparation

Data description: The Maternal Health Risk dataset was published on August 14, 2023, by Marzia Ahmed and is available through the UCI Machine Learning Repository. This dataset provides valuable information for analyzing factors influencing maternal health outcomes and assessing associated risk levels. Source: UCI Machine Learning Repository – Maternal Health Risk Dataset

SystolicBP:Upper value of Blood Pressure in mmHg
DiastolicBP: Lower value of Blood Pressure in mmHg
BS:Blood glucose levels is in terms of a molar concentration
BodyTemp: body temperature
HeartRate: A normal resting heart rate
RiskLevel: Predicted Risk Intensity Level during pregnancy considering the previous attribute.

# import dataset
maternal <- read.csv("MaternalHealthRiskDataSet.csv")
head(maternal)

##   Age SystolicBP DiastolicBP    BS BodyTemp HeartRate RiskLevel
## 1  25        130          80 15.00       98        86 high risk
## 2  35        140          90 13.00       98        70 high risk
## 3  29         90          70  8.00      100        80 high risk
## 4  30        140          85  7.00       98        70 high risk
## 5  35        120          60  6.10       98        76  low risk
## 6  23        140          80  7.01       98        70 high risk

# data preparation
colSums(is.na(maternal))

##         Age  SystolicBP DiastolicBP          BS    BodyTemp   HeartRate 
##           0           0           0           0           0           0 
##   RiskLevel 
##           0

str(maternal)

## 'data.frame':    1014 obs. of  7 variables:
##  $ Age        : int  25 35 29 30 35 23 23 35 32 42 ...
##  $ SystolicBP : int  130 140 90 140 120 140 130 85 120 130 ...
##  $ DiastolicBP: int  80 90 70 85 60 80 70 60 90 80 ...
##  $ BS         : num  15 13 8 7 6.1 7.01 7.01 11 6.9 18 ...
##  $ BodyTemp   : num  98 98 100 98 98 98 98 102 98 98 ...
##  $ HeartRate  : int  86 70 80 70 76 70 78 86 70 70 ...
##  $ RiskLevel  : chr  "high risk" "high risk" "high risk" "high risk" ...

Parametric Statistics

** Regression Choice:Since this dataset evaluates RiskLevel (e.g., high risk, low risk) based on various features, the target variable is categorical in nature. Therefore, logistic regression is the appropriate choice for this analysis, as it is well-suited for modeling relationships between one or more continuous predictors and a categorical outcome.

** Data Assessment

# check outliers by boxplot

boxplot(maternal$Age, main = "Age")

boxplot(maternal$SystolicBP, main = "SystolicBP")

boxplot(maternal$DiastolicBP, main = "DiastolicBP")

boxplot(maternal$BS, main = "Blood Sugar")

boxplot(maternal$BodyTemp, main = "BodyTemp")

boxplot(maternal$HeartRate, main = "HeartRate")

# check the outliers of heart rate
summary(maternal$HeartRate)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     7.0    70.0    76.0    74.3    80.0    90.0

# remove heart rate = 7
maternal_1 <- subset(maternal, HeartRate >= 60)
summary(maternal_1$HeartRate)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   60.00   70.00   76.00   74.43   80.00   90.00

# check the outliers of BodyTemp
summary(maternal_1$BodyTemp)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   98.00   98.00   98.00   98.67   98.00  103.00

subset(maternal_1, BodyTemp >101)

##      Age SystolicBP DiastolicBP   BS BodyTemp HeartRate RiskLevel
## 8     35         85          60 11.0      102        86 high risk
## 36    12         95          60  6.1      102        60  low risk
## 67    17         85          60  9.0      102        86  mid risk
## 106   34         85          60 11.0      102        86 high risk
## 136   22         90          60  7.5      102        60 high risk
## 140   18        120          80  6.9      102        76  mid risk
## 145   17        120          80  6.7      102        76  mid risk
## 172   12         90          60  7.9      102        66 high risk
## 181   12         95          60  6.1      102        60  low risk
## 192   17         90          65  6.1      103        67 high risk
## 200   17         85          60  9.0      102        86 high risk
## 222   17         85          60  9.0      102        86  mid risk
## 236   28        120          80  9.0      102        76 high risk
## 241   17        120          80  7.0      102        76 high risk
## 268   12         90          60  8.0      102        66 high risk
## 277   12         90          60 11.0      102        60 high risk
## 288   17         90          65  7.7      103        67 high risk
## 296   17         85          60  6.3      102        86 high risk
## 338   45        120          80  6.9      103        70  low risk
## 339   70         85          60  6.9      102        70  low risk
## 340   65        120          90  6.9      103        76  low risk
## 341   55        120          80  6.9      102        80  low risk
## 343   22        120          80  6.9      103        76  low risk
## 372   12         90          60  7.8      102        60 high risk
## 383   17         90          65  7.8      103        67 high risk
## 391   17         85          69  7.8      102        86 high risk
## 414   50        130          80 16.0      102        76  mid risk
## 415   27        120          90  6.8      102        68  mid risk
## 420   17        140         100  6.8      103        80 high risk
## 423   36        140         100  6.8      102        76 high risk
## 429   36        140         100  6.8      102        76 high risk
## 443   35         85          60 11.0      102        86 high risk
## 459   34         85          60 11.0      102        86 high risk
## 473   18        120          80  6.8      102        76  low risk
## 494   17         85          60  7.9      102        86  low risk
## 508   18        120          80  7.9      102        76  mid risk
## 513   17        120          80  7.5      102        76  low risk
## 524   17         85          60  7.5      102        86  low risk
## 544   12         90          60  7.5      102        66  low risk
## 553   12         90          60  7.5      102        60  low risk
## 564   17         90          65  7.5      103        67  low risk
## 572   17         85          60  7.5      102        86  low risk
## 589   12         90          60  7.5      102        66  mid risk
## 598   22         90          60  7.5      102        60 high risk
## 613   17         90          65  7.5      103        67  mid risk
## 649   17         90          60  9.0      102        86  mid risk
## 680   35         85          60 11.0      102        86 high risk
## 713   18        120          80  6.9      102        76  mid risk
## 717   17        120          80  6.7      102        76  mid risk
## 727   17         85          60  9.0      102        86  mid risk
## 734   18        120          80  6.9      102        76  mid risk
## 738   17        120          80  6.7      102        76  mid risk
## 748   17         85          60  9.0      102        86  mid risk
## 788   50        130          80 16.0      102        76  mid risk
## 789   27        120          90  6.8      102        68  mid risk
## 812   18        120          80  7.9      102        76  mid risk
## 828   12         90          60  7.5      102        66  mid risk
## 835   17         90          65  7.5      103        67  mid risk
## 844   17         90          60  9.0      102        86  mid risk
## 859   18        120          80  6.9      102        76  mid risk
## 863   17        120          80  6.7      102        76  mid risk
## 873   17         85          60  9.0      102        86  mid risk
## 892   18        120          80  6.8      102        76  low risk
## 903   17         85          60  7.9      102        86  low risk
## 915   17        120          80  7.5      102        76  low risk
## 923   17         85          60  7.5      102        86  low risk
## 935   12         90          60  7.5      102        66  low risk
## 941   12         90          60  7.5      102        60  low risk
## 949   17         90          65  7.5      103        67  low risk
## 952   17         85          60  7.5      102        86  low risk
## 964   12         90          60  7.9      102        66 high risk
## 971   17         90          65  6.1      103        67 high risk
## 974   17         85          60  9.0      102        86 high risk
## 985   28        120          80  9.0      102        76 high risk
## 990   17        120          80  7.0      102        76 high risk
## 997   12         90          60  8.0      102        66 high risk
## 1001  12         90          60 11.0      102        60 high risk
## 1006  17         90          65  7.7      103        67 high risk
## 1007  17         85          60  6.3      102        86 high risk

# remove data that BodyTemp >100 and RiskLevel is 'low risk'
maternal_2 <- subset(maternal_1, !(BodyTemp >101 & RiskLevel == "low risk"))
subset(maternal_2, BodyTemp >101)

##      Age SystolicBP DiastolicBP   BS BodyTemp HeartRate RiskLevel
## 8     35         85          60 11.0      102        86 high risk
## 67    17         85          60  9.0      102        86  mid risk
## 106   34         85          60 11.0      102        86 high risk
## 136   22         90          60  7.5      102        60 high risk
## 140   18        120          80  6.9      102        76  mid risk
## 145   17        120          80  6.7      102        76  mid risk
## 172   12         90          60  7.9      102        66 high risk
## 192   17         90          65  6.1      103        67 high risk
## 200   17         85          60  9.0      102        86 high risk
## 222   17         85          60  9.0      102        86  mid risk
## 236   28        120          80  9.0      102        76 high risk
## 241   17        120          80  7.0      102        76 high risk
## 268   12         90          60  8.0      102        66 high risk
## 277   12         90          60 11.0      102        60 high risk
## 288   17         90          65  7.7      103        67 high risk
## 296   17         85          60  6.3      102        86 high risk
## 372   12         90          60  7.8      102        60 high risk
## 383   17         90          65  7.8      103        67 high risk
## 391   17         85          69  7.8      102        86 high risk
## 414   50        130          80 16.0      102        76  mid risk
## 415   27        120          90  6.8      102        68  mid risk
## 420   17        140         100  6.8      103        80 high risk
## 423   36        140         100  6.8      102        76 high risk
## 429   36        140         100  6.8      102        76 high risk
## 443   35         85          60 11.0      102        86 high risk
## 459   34         85          60 11.0      102        86 high risk
## 508   18        120          80  7.9      102        76  mid risk
## 589   12         90          60  7.5      102        66  mid risk
## 598   22         90          60  7.5      102        60 high risk
## 613   17         90          65  7.5      103        67  mid risk
## 649   17         90          60  9.0      102        86  mid risk
## 680   35         85          60 11.0      102        86 high risk
## 713   18        120          80  6.9      102        76  mid risk
## 717   17        120          80  6.7      102        76  mid risk
## 727   17         85          60  9.0      102        86  mid risk
## 734   18        120          80  6.9      102        76  mid risk
## 738   17        120          80  6.7      102        76  mid risk
## 748   17         85          60  9.0      102        86  mid risk
## 788   50        130          80 16.0      102        76  mid risk
## 789   27        120          90  6.8      102        68  mid risk
## 812   18        120          80  7.9      102        76  mid risk
## 828   12         90          60  7.5      102        66  mid risk
## 835   17         90          65  7.5      103        67  mid risk
## 844   17         90          60  9.0      102        86  mid risk
## 859   18        120          80  6.9      102        76  mid risk
## 863   17        120          80  6.7      102        76  mid risk
## 873   17         85          60  9.0      102        86  mid risk
## 964   12         90          60  7.9      102        66 high risk
## 971   17         90          65  6.1      103        67 high risk
## 974   17         85          60  9.0      102        86 high risk
## 985   28        120          80  9.0      102        76 high risk
## 990   17        120          80  7.0      102        76 high risk
## 997   12         90          60  8.0      102        66 high risk
## 1001  12         90          60 11.0      102        60 high risk
## 1006  17         90          65  7.7      103        67 high risk
## 1007  17         85          60  6.3      102        86 high risk

# check the outlier of blood sugar
summary(maternal_2$BS)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   6.000   6.900   7.500   8.763   8.000  19.000

# remove the data that blood sugar >=10 and risklevel is "low risk"
maternal_3 <- subset(maternal_2, !(BS >=10 & RiskLevel == "low risk"))
subset(maternal_3, BS >=10)

##      Age SystolicBP DiastolicBP BS BodyTemp HeartRate RiskLevel
## 1     25        130          80 15     98.0        86 high risk
## 2     35        140          90 13     98.0        70 high risk
## 8     35         85          60 11    102.0        86 high risk
## 10    42        130          80 18     98.0        70 high risk
## 15    48        120          80 11     98.0        88  mid risk
## 17    50        140          90 15     98.0        90 high risk
## 21    40        140         100 18     98.0        90 high risk
## 74    54        130          70 12     98.0        67  mid risk
## 75    44        120          90 16     98.0        80  mid risk
## 78    55        120          90 12     98.0        70  mid risk
## 92    60        120          85 15     98.0        60  mid risk
## 103   48        140          90 15     98.0        90 high risk
## 106   34         85          60 11    102.0        86 high risk
## 107   50        140          90 15     98.0        90 high risk
## 109   42        140         100 18     98.0        90 high risk
## 111   50        140          95 17     98.0        60 high risk
## 114   30        140         100 15     98.0        70 high risk
## 115   63        140          90 15     98.0        90 high risk
## 118   55        140         100 18     98.0        90 high risk
## 120   30        140         100 15     98.0        70 high risk
## 121   48        120          80 11     98.0        88 high risk
## 122   49        140          90 15     98.0        90 high risk
## 124   40        160         100 19     98.0        77 high risk
## 125   32        140          90 18     98.0        88 high risk
## 127   54        140         100 15     98.0        66 high risk
## 128   55        140          95 19     98.0        77 high risk
## 130   48        120          80 11     98.0        88 high risk
## 131   40        160         100 19     98.0        77 high risk
## 132   32        140          90 18     98.0        88 high risk
## 134   54        140         100 15     98.0        66 high risk
## 135   40        120          95 11     98.0        80 high risk
## 137   40        120          85 15     98.0        60 high risk
## 138   55        140          95 19     98.0        77 high risk
## 139   50        130         100 16     98.0        75 high risk
## 150   37        120          90 11     98.0        88 high risk
## 153   17        110          75 12    101.0        76 high risk
## 158   40        120          90 12     98.0        80 high risk
## 167   40        160         100 19     98.0        77 high risk
## 168   32        140          90 18     98.0        88 high risk
## 178   54        140         100 15     98.0        66 high risk
## 179   40        120          95 11     98.0        80 high risk
## 182   60        120          85 15     98.0        60 high risk
## 183   55        140          95 19     98.0        77 high risk
## 184   50        130         100 16     98.0        75 high risk
## 194   50        120          80 15     98.0        70 high risk
## 206   33        120          75 10     98.0        70 high risk
## 207   48        120          80 11     98.0        88 high risk
## 211   50        140          95 17     98.0        60 high risk
## 218   30        140         100 15     98.0        70 high risk
## 229   48        120          80 11     98.0        88 high risk
## 231   50        140          90 15     98.0        77 high risk
## 235   40        140         100 18     98.0        77 high risk
## 238   17         90          60 11    101.0        78 high risk
## 240   25        120          90 12    101.0        80 high risk
## 242   19         90          65 11    101.0        70 high risk
## 246   37        120          90 11     98.0        88 high risk
## 249   17        110          75 13    101.0        76 high risk
## 250   25        120          90 15     98.0        80 high risk
## 263   40        160         100 19     98.0        77 high risk
## 264   32        140          90 18     98.0        88 high risk
## 274   54        140         100 15     98.0        66 high risk
## 275   40        120          95 11     98.0        80 high risk
## 277   12         90          60 11    102.0        60 high risk
## 278   60        120          85 15     98.0        60 high risk
## 279   55        140          95 19     98.0        77 high risk
## 280   50        130         100 16     98.0        76 high risk
## 303   48        120          80 11     98.0        88 high risk
## 317   22        120          60 15     98.0        80 high risk
## 318   55        120          90 18     98.0        60 high risk
## 319   54        130          70 12     98.0        67  mid risk
## 320   35         85          60 19     98.0        86 high risk
## 321   43        120          90 18     98.0        70 high risk
## 328   56        120          80 13     98.0        70 high risk
## 330   43        120          80 15     98.0        76 high risk
## 332   44        120          90 16     98.0        80  mid risk
## 335   55        120          90 12     98.0        70  mid risk
## 342   45         90          60 18    101.0        70 high risk
## 346   37        120          90 11     98.0        88 high risk
## 363   40        160         100 19     98.0        77 high risk
## 364   32        140          90 18     98.0        88 high risk
## 369   54        140         100 15     98.0        66 high risk
## 370   40        120          95 11     98.0        80 high risk
## 373   60        120          85 15     98.0        60  mid risk
## 374   55        140          95 19     98.0        77 high risk
## 375   50        130         100 16     98.0        75 high risk
## 398   48        120          80 11     98.0        88 high risk
## 414   50        130          80 16    102.0        76  mid risk
## 416   60        140          90 12     98.0        77 high risk
## 418   60        140          80 16     98.0        66 high risk
## 426   35        100          60 15     98.0        80 high risk
## 427   40        140         100 13    101.0        66 high risk
## 432   35        100          60 15     98.0        80 high risk
## 433   40        140         100 13    101.0        66 high risk
## 436   65        130          80 15     98.0        86 high risk
## 437   35        140          80 13     98.0        70 high risk
## 438   29         90          70 10     98.0        80 high risk
## 443   35         85          60 11    102.0        86 high risk
## 445   43        130          80 18     98.0        70  mid risk
## 450   48        120          80 11     98.0        88 high risk
## 452   48        140          90 15     98.0        90 high risk
## 459   34         85          60 11    102.0        86 high risk
## 461   42        130          80 18     98.0        70  mid risk
## 468   50        140          90 15     98.0        90 high risk
## 472   42        140         100 18     98.0        90 high risk
## 483   50        140          95 17     98.0        60 high risk
## 490   30        140         100 15     98.0        70 high risk
## 501   48        120          80 11     98.0        88  mid risk
## 503   63        140          90 15     98.0        90 high risk
## 507   55        140         100 18     98.0        90 high risk
## 520   30        140         100 15     98.0        70 high risk
## 531   48        120          80 11     98.0        88 high risk
## 533   49        140          90 15     98.0        90 high risk
## 539   40        160         100 19     98.0        77 high risk
## 540   32        140          90 18     98.0        88 high risk
## 550   54        140         100 15     98.0        66 high risk
## 551   40        120          95 11     98.0        80  mid risk
## 554   60        120          85 15     98.0        60  mid risk
## 555   55        140          95 19     98.0        77 high risk
## 556   50        130         100 16     98.0        75  mid risk
## 579   48        120          80 11     98.0        88 high risk
## 584   40        160         100 19     98.0        77 high risk
## 585   32        140          90 18     98.0        88 high risk
## 595   54        140         100 15     98.0        66 high risk
## 596   40        120          95 11     98.0        80 high risk
## 599   40        120          85 15     98.0        60 high risk
## 600   55        140          95 19     98.0        77 high risk
## 601   50        130         100 16     98.0        75 high risk
## 603   40        120          85 15     98.0        60 high risk
## 604   55        140          95 19     98.0        77 high risk
## 605   50        130         100 16     98.0        75  mid risk
## 615   50        120          80 15     98.0        70 high risk
## 628   48        120          80 11     98.0        88 high risk
## 631   22        100          65 12     98.0        80 high risk
## 632   50        140          95 17     98.0        60 high risk
## 633   35        100          70 11     98.0        60 high risk
## 637   50        130          80 15     98.0        86 high risk
## 638   35        140          90 13     98.0        70 high risk
## 639   29         90          70 11    100.0        80 high risk
## 641   46        140         100 12     99.0        90 high risk
## 642   28         95          60 10    101.0        86 high risk
## 645   25        140         100 15     98.6        70 high risk
## 656   48        120          80 11     98.0        88 high risk
## 658   27        140          90 15     98.0        90 high risk
## 659   25        140         100 12     99.0        80 high risk
## 676   35        140          90 13     98.0        70 high risk
## 680   35         85          60 11    102.0        86 high risk
## 681   42        130          80 18     98.0        70 high risk
## 682   50        140          90 15     98.0        90 high risk
## 684   40        140         100 18     98.0        90 high risk
## 687   37        120          90 11     98.0        88 high risk
## 688   17        110          75 12    101.0        76 high risk
## 689   40        120          90 12     98.0        80 high risk
## 690   40        160         100 19     98.0        77 high risk
## 711   48        120          80 11     98.0        88  mid risk
## 732   48        120          80 11     98.0        88  mid risk
## 755   54        130          70 12     98.0        67  mid risk
## 756   44        120          90 16     98.0        80  mid risk
## 759   55        120          90 12     98.0        70  mid risk
## 773   60        120          85 15     98.0        60  mid risk
## 788   50        130          80 16    102.0        76  mid risk
## 798   43        130          80 18     98.0        70  mid risk
## 803   42        130          80 18     98.0        70  mid risk
## 811   48        120          80 11     98.0        88  mid risk
## 818   40        120          95 11     98.0        80  mid risk
## 819   60        120          85 15     98.0        60  mid risk
## 820   50        130         100 16     98.0        75  mid risk
## 834   50        130         100 16     98.0        75  mid risk
## 857   48        120          80 11     98.0        88  mid risk
## 956   40        140         100 18     98.0        90 high risk
## 959   37        120          90 11     98.0        88 high risk
## 960   17        110          75 12    101.0        76 high risk
## 961   40        120          90 12     98.0        80 high risk
## 962   40        160         100 19     98.0        77 high risk
## 963   32        140          90 18     98.0        88 high risk
## 966   54        140         100 15     98.0        66 high risk
## 967   40        120          95 11     98.0        80 high risk
## 968   60        120          85 15     98.0        60 high risk
## 969   55        140          95 19     98.0        77 high risk
## 970   50        130         100 16     98.0        75 high risk
## 973   50        120          80 15     98.0        70 high risk
## 975   33        120          75 10     98.0        70 high risk
## 976   48        120          80 11     98.0        88 high risk
## 977   50        140          95 17     98.0        60 high risk
## 978   30        140         100 15     98.0        70 high risk
## 980   48        120          80 11     98.0        88 high risk
## 981   50        140          90 15     98.0        77 high risk
## 984   40        140         100 18     98.0        77 high risk
## 987   17         90          60 11    101.0        78 high risk
## 989   25        120          90 12    101.0        80 high risk
## 991   19         90          65 11    101.0        70 high risk
## 992   37        120          90 11     98.0        88 high risk
## 993   17        110          75 13    101.0        76 high risk
## 994   25        120          90 15     98.0        80 high risk
## 995   40        160         100 19     98.0        77 high risk
## 996   32        140          90 18     98.0        88 high risk
## 999   54        140         100 15     98.0        66 high risk
## 1000  40        120          95 11     98.0        80 high risk
## 1001  12         90          60 11    102.0        60 high risk
## 1002  60        120          85 15     98.0        60 high risk
## 1003  55        140          95 19     98.0        77 high risk
## 1004  50        130         100 16     98.0        76 high risk
## 1009  48        120          80 11     98.0        88 high risk
## 1010  22        120          60 15     98.0        80 high risk
## 1011  55        120          90 18     98.0        60 high risk
## 1012  35         85          60 19     98.0        86 high risk
## 1013  43        120          90 18     98.0        70 high risk

maternal_clean <- maternal_3
summary(maternal_clean)

##       Age          SystolicBP     DiastolicBP           BS        
##  Min.   :10.00   Min.   : 70.0   Min.   : 49.00   Min.   : 6.000  
##  1st Qu.:19.00   1st Qu.:100.0   1st Qu.: 65.00   1st Qu.: 6.900  
##  Median :27.00   Median :120.0   Median : 80.00   Median : 7.500  
##  Mean   :29.98   Mean   :113.5   Mean   : 76.65   Mean   : 8.754  
##  3rd Qu.:39.00   3rd Qu.:120.0   3rd Qu.: 90.00   3rd Qu.: 8.000  
##  Max.   :66.00   Max.   :160.0   Max.   :100.00   Max.   :19.000  
##     BodyTemp        HeartRate      RiskLevel        
##  Min.   : 98.00   Min.   :60.00   Length:985        
##  1st Qu.: 98.00   1st Qu.:70.00   Class :character  
##  Median : 98.00   Median :76.00   Mode  :character  
##  Mean   : 98.59   Mean   :74.39                     
##  3rd Qu.: 98.00   3rd Qu.:80.00                     
##  Max.   :103.00   Max.   :90.00

** Outliers analysis

HeartRate: The detected outliers show heart rate values around 7 bpm, which are physiologically impossible. These records are considered data entry errors and were therefore removed from the dataset.
BodyTemp: Body temperature values above 101°F indicate fever conditions. Such readings are typically associated with high-risk or mid-risk maternal health levels. To maintain logical consistency, all records with BodyTemp > 101°F and RiskLevel = “low risk” were removed.
BS: A blood sugar level greater than 10 mmol/L suggests potential hyperglycemia and is usually linked with high or mid maternal risk. Consequently, records with BS > 10 and RiskLevel = “low risk” were excluded to ensure dataset reliability and medical plausibility.

# check the collinearity
correlation_matrix <- cor(maternal_clean[, c("Age", "SystolicBP", "DiastolicBP", "BS", "BodyTemp", "HeartRate")])
correlation_matrix

##                     Age SystolicBP DiastolicBP          BS    BodyTemp
## Age          1.00000000  0.4152118   0.3943027  0.47909474 -0.25415057
## SystolicBP   0.41521179  1.0000000   0.7826082  0.42543551 -0.26803474
## DiastolicBP  0.39430271  0.7826082   1.0000000  0.42325102 -0.24461544
## BS           0.47909474  0.4254355   0.4232510  1.00000000 -0.08207481
## BodyTemp    -0.25415057 -0.2680347  -0.2446154 -0.08207481  1.00000000
## HeartRate    0.05606688 -0.0229889  -0.0589288  0.14590214  0.11943522
##               HeartRate
## Age          0.05606688
## SystolicBP  -0.02298890
## DiastolicBP -0.05892880
## BS           0.14590214
## BodyTemp     0.11943522
## HeartRate    1.00000000

** Collinearity analysis

The correlation analysis indicates that most independent variables show weak to moderate correlations (|r| < 0.5), suggesting no severe multicollinearity issues. However, SystolicBP and DiastolicBP exhibit a high correlation (r = 0.78), implying potential collinearity. To mitigate this effect, one of the blood pressure variables (typically SystolicBP) may be retained for subsequent regression analysis.

# check the normality
shapiro.test(maternal_clean$Age)

## 
##  Shapiro-Wilk normality test
## 
## data:  maternal_clean$Age
## W = 0.91737, p-value < 2.2e-16

shapiro.test(maternal_clean$SystolicBP)

## 
##  Shapiro-Wilk normality test
## 
## data:  maternal_clean$SystolicBP
## W = 0.90575, p-value < 2.2e-16

shapiro.test(maternal_clean$DiastolicBP)

## 
##  Shapiro-Wilk normality test
## 
## data:  maternal_clean$DiastolicBP
## W = 0.94756, p-value < 2.2e-16

shapiro.test(maternal_clean$BS)

## 
##  Shapiro-Wilk normality test
## 
## data:  maternal_clean$BS
## W = 0.67569, p-value < 2.2e-16

shapiro.test(maternal_clean$BodyTemp)

## 
##  Shapiro-Wilk normality test
## 
## data:  maternal_clean$BodyTemp
## W = 0.50361, p-value < 2.2e-16

shapiro.test(maternal_clean$HeartRate)

## 
##  Shapiro-Wilk normality test
## 
## data:  maternal_clean$HeartRate
## W = 0.95236, p-value < 2.2e-16

hist(maternal_clean$Age, probability = TRUE)
lines(density(maternal_clean$Age), col = 'red')

hist(maternal_clean$SystolicBP, probability = TRUE)
lines(density(maternal_clean$SystolicBP), col = 'red')

hist(maternal_clean$DiastolicBP, probability = TRUE)
lines(density(maternal_clean$DiastolicBP), col = 'red')

hist(maternal_clean$BS, probability = TRUE)
lines(density(maternal_clean$BS), col = 'red')

hist(maternal_clean$BodyTemp, probability = TRUE)
lines(density(maternal_clean$BodyTemp), col = 'red')

hist(maternal_clean$HeartRate, probability = TRUE)
lines(density(maternal_clean$HeartRate), col = 'red')

** Normality analysis: The results of the Shapiro–Wilk normality test, supported by histogram visualizations, demonstrate that all variables significantly deviate from a normal distribution. Therefore, the assumption of normality is not satisfied for this dataset.

** test for homoscedasticity: Since the dependent variable RiskLevel includes three categorical levels (low, mid, and high), a multinomial logistic regression model was employed. Homoscedasticity is not an assumption for logistic regression, as the variance of the response variable depends on the predicted probabilities.

** Data Split

# split the data to 70/30
library(caret)

## Loading required package: ggplot2

## Loading required package: lattice

set.seed(123)
train_index <- createDataPartition(maternal_clean$RiskLevel, p = 0.7, list = FALSE)

train_data <- maternal_clean[train_index,]
test_data <- maternal_clean[-train_index,]

# multinomial logistic regression
library(nnet)

maternal_clean$RiskLevel <- as.factor(maternal_clean$RiskLevel)

model_multi1 <- multinom(RiskLevel ~ Age + SystolicBP + DiastolicBP + BS + BodyTemp + HeartRate, data = train_data)

## # weights:  24 (14 variable)
## initial  value 759.141091 
## iter  10 value 629.597777
## iter  20 value 536.210329
## iter  30 value 505.767145
## iter  40 value 504.147202
## iter  50 value 503.571551
## final  value 503.502345 
## converged

summary(model_multi1)

## Call:
## multinom(formula = RiskLevel ~ Age + SystolicBP + DiastolicBP + 
##     BS + BodyTemp + HeartRate, data = train_data)
## 
## Coefficients:
##          (Intercept)        Age   SystolicBP DiastolicBP         BS   BodyTemp
## low risk   153.49418 0.01493923 -0.072686297 -0.01760594 -0.7775492 -1.3356866
## mid risk    48.03771 0.01683282 -0.007092011 -0.05883441 -0.3729959 -0.3677746
##            HeartRate
## low risk -0.07850442
## mid risk -0.03919853
## 
## Std. Errors:
##           (Intercept)        Age SystolicBP DiastolicBP         BS   BodyTemp
## low risk 0.0001948333 0.01367527 0.01307712  0.01631953 0.11129976 0.02100432
## mid risk 0.0001440045 0.01271893 0.01126380  0.01457269 0.05196203 0.01639377
##           HeartRate
## low risk 0.01995628
## mid risk 0.01640244
## 
## Residual Deviance: 1007.005 
## AIC: 1035.005

# remove DiastolicBP
model_multi2 <- multinom(RiskLevel ~ Age + SystolicBP + BS + BodyTemp + HeartRate, data = train_data)

## # weights:  21 (12 variable)
## initial  value 759.141091 
## iter  10 value 627.114178
## iter  20 value 542.717280
## iter  30 value 520.104526
## iter  40 value 517.804853
## iter  50 value 515.428104
## final  value 515.425312 
## converged

summary(model_multi2)

## Call:
## multinom(formula = RiskLevel ~ Age + SystolicBP + BS + BodyTemp + 
##     HeartRate, data = train_data)
## 
## Coefficients:
##          (Intercept)        Age  SystolicBP         BS   BodyTemp   HeartRate
## low risk   150.59484 0.01555393 -0.08094925 -0.7396030 -1.3119601 -0.08122199
## mid risk    42.36125 0.01443048 -0.03663022 -0.4015462 -0.3227446 -0.03442227
## 
## Std. Errors:
##           (Intercept)        Age  SystolicBP         BS   BodyTemp  HeartRate
## low risk 0.0001979874 0.01372230 0.009126859 0.10224931 0.02109619 0.02023666
## mid risk 0.0001445513 0.01275148 0.007708714 0.05281594 0.01629270 0.01644001
## 
## Residual Deviance: 1030.851 
## AIC: 1054.851

# remove Age, DiastolicBP, HeartRate
model_multi3 <- multinom(RiskLevel ~ SystolicBP + BS + BodyTemp, data = train_data)

## # weights:  15 (8 variable)
## initial  value 759.141091 
## iter  10 value 593.870051
## iter  20 value 548.190182
## iter  30 value 532.634520
## iter  40 value 527.247598
## iter  50 value 524.654320
## iter  60 value 524.484236
## final  value 524.437259 
## converged

summary(model_multi3)

## Call:
## multinom(formula = RiskLevel ~ SystolicBP + BS + BodyTemp, data = train_data)
## 
## Coefficients:
##          (Intercept)  SystolicBP         BS   BodyTemp
## low risk    148.3870 -0.07259164 -0.7237608 -1.3569666
## mid risk     42.5065 -0.03168957 -0.3790738 -0.3538934
## 
## Std. Errors:
##           (Intercept)  SystolicBP         BS    BodyTemp
## low risk 0.0001019664 0.008321894 0.09567523 0.011937788
## mid risk 0.0001016986 0.006924722 0.04378866 0.008421014
## 
## Residual Deviance: 1048.875 
## AIC: 1064.875

AIC(model_multi1,model_multi2,model_multi3)

##              df      AIC
## model_multi1 14 1035.005
## model_multi2 12 1054.851
## model_multi3  8 1064.875

# confusion matrix
pred1 <- predict(model_multi1, newdata = test_data)
confusionMatrix(data=as.factor(pred1), reference = as.factor(test_data$RiskLevel))

## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  high risk low risk mid risk
##   high risk        63        1       15
##   low risk          1       91       50
##   mid risk         17       21       35
## 
## Overall Statistics
##                                           
##                Accuracy : 0.6429          
##                  95% CI : (0.5852, 0.6976)
##     No Information Rate : 0.3844          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.4555          
##                                           
##  Mcnemar's Test P-Value : 0.007486        
## 
## Statistics by Class:
## 
##                      Class: high risk Class: low risk Class: mid risk
## Sensitivity                    0.7778          0.8053          0.3500
## Specificity                    0.9249          0.7182          0.8041
## Pos Pred Value                 0.7975          0.6408          0.4795
## Neg Pred Value                 0.9163          0.8553          0.7059
## Prevalence                     0.2755          0.3844          0.3401
## Detection Rate                 0.2143          0.3095          0.1190
## Detection Prevalence           0.2687          0.4830          0.2483
## Balanced Accuracy              0.8513          0.7618          0.5771

pred2 <- predict(model_multi2, newdata = test_data)
confusionMatrix(data=as.factor(pred2), reference = as.factor(test_data$RiskLevel))

## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  high risk low risk mid risk
##   high risk        60        1       17
##   low risk          2       94       50
##   mid risk         19       18       33
## 
## Overall Statistics
##                                           
##                Accuracy : 0.6361          
##                  95% CI : (0.5782, 0.6911)
##     No Information Rate : 0.3844          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.4444          
##                                           
##  Mcnemar's Test P-Value : 0.001433        
## 
## Statistics by Class:
## 
##                      Class: high risk Class: low risk Class: mid risk
## Sensitivity                    0.7407          0.8319          0.3300
## Specificity                    0.9155          0.7127          0.8093
## Pos Pred Value                 0.7692          0.6438          0.4714
## Neg Pred Value                 0.9028          0.8716          0.7009
## Prevalence                     0.2755          0.3844          0.3401
## Detection Rate                 0.2041          0.3197          0.1122
## Detection Prevalence           0.2653          0.4966          0.2381
## Balanced Accuracy              0.8281          0.7723          0.5696

pred3 <- predict(model_multi3, newdata = test_data)
confusionMatrix(data=as.factor(pred3), reference = as.factor(test_data$RiskLevel))

## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  high risk low risk mid risk
##   high risk        59        0       13
##   low risk          2       99       53
##   mid risk         20       14       34
## 
## Overall Statistics
##                                           
##                Accuracy : 0.6531          
##                  95% CI : (0.5956, 0.7074)
##     No Information Rate : 0.3844          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.4683          
##                                           
##  Mcnemar's Test P-Value : 8.718e-06       
## 
## Statistics by Class:
## 
##                      Class: high risk Class: low risk Class: mid risk
## Sensitivity                    0.7284          0.8761          0.3400
## Specificity                    0.9390          0.6961          0.8247
## Pos Pred Value                 0.8194          0.6429          0.5000
## Neg Pred Value                 0.9009          0.9000          0.7080
## Prevalence                     0.2755          0.3844          0.3401
## Detection Rate                 0.2007          0.3367          0.1156
## Detection Prevalence           0.2449          0.5238          0.2313
## Balanced Accuracy              0.8337          0.7861          0.5824

# check accuracy
acc1 <- mean(pred1 == test_data$RiskLevel)
acc2 <- mean(pred2 == test_data$RiskLevel)
acc3 <- mean(pred3 == test_data$RiskLevel)

acc1

## [1] 0.6428571

acc2

## [1] 0.6360544

acc3

## [1] 0.6530612

# evaluate three models

model_eval <- data.frame(model = c("model1", "model2", "model3"), AIC = c(AIC(model_multi1), AIC(model_multi2),AIC(model_multi3)), Accuracy = c(acc1,acc2, acc3))

model_eval

##    model      AIC  Accuracy
## 1 model1 1035.005 0.6428571
## 2 model2 1054.851 0.6360544
## 3 model3 1064.875 0.6530612

** Model Report: Three multinomial logistic regression models were compared using two key evaluation metrics: Akaike Information Criterion (AIC) and classification accuracy on the testing dataset.As shown in the table above, Model 1, which includes all predictor variables (Age, SystolicBP, DiastolicBP, BS, BodyTemp, and HeartRate), achieved the lowest AIC value (1035.005), indicating the best trade-off between model fit and complexity. Although Model 3 produced a slightly higher accuracy (0.653) than Model 1 (0.643), its AIC (1064.875) is substantially higher, suggesting overfitting or reduced model efficiency.Therefore, Model 1 was selected as the most suitable regression model for predicting maternal health risk levels, as it demonstrates the optimal balance between model simplicity and predictive performance.In addition, the three confusion matrices indicate that the model performed well in identifying high-risk and low-risk cases, but exhibited moderate misclassification between mid-risk and adjacent categories, suggesting that the boundaries between these risk levels are less distinct.

Non-Parametric Statistics

**Decision Tree Analysis

library(rpart)
library(rpart.plot)

## Warning: package 'rpart.plot' was built under R version 4.5.1

library(caret)

maternal_clean$RiskLevel <- as.factor(maternal_clean$RiskLevel)

tree_model <- rpart(RiskLevel ~ Age + SystolicBP + DiastolicBP + BS + BodyTemp + HeartRate, data = train_data, method = "class", parms = list(split="information"))

print(tree_model)

## n= 691 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 691 427 low risk (0.27641100 0.38205499 0.34153401)  
##    2) BS>=7.95 184  45 high risk (0.75543478 0.01086957 0.23369565) *
##    3) BS< 7.95 507 245 low risk (0.10256410 0.51676529 0.38067061)  
##      6) SystolicBP>=132.5 30   3 high risk (0.90000000 0.00000000 0.10000000) *
##      7) SystolicBP< 132.5 477 215 low risk (0.05241090 0.54926625 0.39832285)  
##       14) BodyTemp< 99.5 400 146 low risk (0.01750000 0.63500000 0.34750000)  
##         28) SystolicBP< 129.5 376 122 low risk (0.01861702 0.67553191 0.30585106) *
##         29) SystolicBP>=129.5 24   0 mid risk (0.00000000 0.00000000 1.00000000) *
##       15) BodyTemp>=99.5 77  26 mid risk (0.23376623 0.10389610 0.66233766) *

summary(tree_model)

## Call:
## rpart(formula = RiskLevel ~ Age + SystolicBP + DiastolicBP + 
##     BS + BodyTemp + HeartRate, data = train_data, method = "class", 
##     parms = list(split = "information"))
##   n= 691 
## 
##           CP nsplit rel error    xerror       xstd
## 1 0.32084309      0 1.0000000 1.0000000 0.02991224
## 2 0.08196721      1 0.6791569 0.6791569 0.03038116
## 3 0.05620609      3 0.5152225 0.5152225 0.02867840
## 4 0.01000000      4 0.4590164 0.4613583 0.02779263
## 
## Variable importance
##          BS  SystolicBP DiastolicBP    BodyTemp         Age   HeartRate 
##          36          26          13          11           8           5 
## 
## Node number 1: 691 observations,    complexity param=0.3208431
##   predicted class=low risk   expected loss=0.617945  P(node) =1
##     class counts:   191   264   236
##    probabilities: 0.276 0.382 0.342 
##   left son=2 (184 obs) right son=3 (507 obs)
##   Primary splits:
##       BS          < 7.95  to the right, improve=164.83220, (0 missing)
##       SystolicBP  < 132.5 to the right, improve=124.80490, (0 missing)
##       DiastolicBP < 92.5  to the right, improve= 67.50824, (0 missing)
##       Age         < 31.5  to the right, improve= 43.04697, (0 missing)
##       BodyTemp    < 99.5  to the left,  improve= 38.87529, (0 missing)
##   Surrogate splits:
##       Age         < 36.5  to the right, agree=0.790, adj=0.212, (0 split)
##       SystolicBP  < 137.5 to the right, agree=0.784, adj=0.190, (0 split)
##       DiastolicBP < 92.5  to the right, agree=0.781, adj=0.179, (0 split)
##       HeartRate   < 84    to the right, agree=0.771, adj=0.141, (0 split)
##       BodyTemp    < 101.5 to the right, agree=0.737, adj=0.011, (0 split)
## 
## Node number 2: 184 observations
##   predicted class=high risk  expected loss=0.2445652  P(node) =0.2662808
##     class counts:   139     2    43
##    probabilities: 0.755 0.011 0.234 
## 
## Node number 3: 507 observations,    complexity param=0.08196721
##   predicted class=low risk   expected loss=0.4832347  P(node) =0.7337192
##     class counts:    52   262   193
##    probabilities: 0.103 0.517 0.381 
##   left son=6 (30 obs) right son=7 (477 obs)
##   Primary splits:
##       SystolicBP  < 132.5 to the right, improve=62.43981, (0 missing)
##       BodyTemp    < 99.5  to the left,  improve=38.73359, (0 missing)
##       DiastolicBP < 97.5  to the right, improve=35.34743, (0 missing)
##       BS          < 7.055 to the right, improve=22.31142, (0 missing)
##       HeartRate   < 77.5  to the left,  improve=11.54812, (0 missing)
##   Surrogate splits:
##       DiastolicBP < 97.5  to the right, agree=0.97, adj=0.5, (0 split)
## 
## Node number 6: 30 observations
##   predicted class=high risk  expected loss=0.1  P(node) =0.04341534
##     class counts:    27     0     3
##    probabilities: 0.900 0.000 0.100 
## 
## Node number 7: 477 observations,    complexity param=0.08196721
##   predicted class=low risk   expected loss=0.4507338  P(node) =0.6903039
##     class counts:    25   262   190
##    probabilities: 0.052 0.549 0.398 
##   left son=14 (400 obs) right son=15 (77 obs)
##   Primary splits:
##       BodyTemp    < 99.5  to the left,  improve=49.715280, (0 missing)
##       SystolicBP  < 129.5 to the left,  improve=23.061630, (0 missing)
##       BS          < 7.055 to the right, improve=21.106480, (0 missing)
##       DiastolicBP < 49.5  to the left,  improve=11.706950, (0 missing)
##       Age         < 17.5  to the left,  improve= 7.781778, (0 missing)
## 
## Node number 14: 400 observations,    complexity param=0.05620609
##   predicted class=low risk   expected loss=0.365  P(node) =0.5788712
##     class counts:     7   254   139
##    probabilities: 0.018 0.635 0.347 
##   left son=28 (376 obs) right son=29 (24 obs)
##   Primary splits:
##       SystolicBP  < 129.5 to the left,  improve=26.835620, (0 missing)
##       BS          < 7.005 to the right, improve=18.140290, (0 missing)
##       Age         < 33.5  to the right, improve= 9.753374, (0 missing)
##       DiastolicBP < 49.5  to the left,  improve= 8.898949, (0 missing)
##       HeartRate   < 77.5  to the left,  improve= 6.773620, (0 missing)
## 
## Node number 15: 77 observations
##   predicted class=mid risk   expected loss=0.3376623  P(node) =0.1114327
##     class counts:    18     8    51
##    probabilities: 0.234 0.104 0.662 
## 
## Node number 28: 376 observations
##   predicted class=low risk   expected loss=0.3244681  P(node) =0.5441389
##     class counts:     7   254   115
##    probabilities: 0.019 0.676 0.306 
## 
## Node number 29: 24 observations
##   predicted class=mid risk   expected loss=0  P(node) =0.03473227
##     class counts:     0     0    24
##    probabilities: 0.000 0.000 1.000

rpart.plot(tree_model, type = 3, fallen.leaves = TRUE, main = "Decision Tree for Maternal Health Risk Prediction", extra = 104)

tree_model$variable.importance

##          BS  SystolicBP DiastolicBP    BodyTemp         Age   HeartRate 
##   164.83218   120.62938    60.78220    51.50694    34.93726    23.29150

** Decision Tree Analysis:A Decision Tree model was developed using six physiological predictors (Age, SystolicBP, DiastolicBP, BS, BodyTemp, and HeartRate) to classify maternal health risk levels.The resulting tree revealed that BS(blood sugar) and SystolicBP were the most influential predictors in distinguishing risk categories, followed by DiastolicBP and BodyTemp.These findings indicate that blood pressure and blood sugar levels play the most critical role in determining maternal health risk.

** PCA analysis

pca_data <- maternal_clean[, c("Age", "SystolicBP", "DiastolicBP", "BS", "BodyTemp", "HeartRate")]

# scaling
pca_scaled <- scale(pca_data)

pca_result <- prcomp(pca_scaled, center = TRUE, scale. = TRUE)
summary(pca_result)

## Importance of components:
##                           PC1    PC2    PC3    PC4     PC5     PC6
## Standard deviation     1.6096 1.0780 0.9161 0.8446 0.69199 0.46447
## Proportion of Variance 0.4318 0.1937 0.1399 0.1189 0.07981 0.03596
## Cumulative Proportion  0.4318 0.6255 0.7653 0.8842 0.96404 1.00000

pca_result$rotation

##                      PC1         PC2         PC3        PC4         PC5
## Age          0.439174114 -0.13876309  0.29082103  0.5476434  0.63469447
## SystolicBP   0.530588341  0.08523217 -0.26175603 -0.3562446  0.10491813
## DiastolicBP  0.523623413  0.10704551 -0.32207805 -0.3359659  0.07277007
## BS           0.427721334 -0.36798950 -0.06033367  0.4182756 -0.70905351
## BodyTemp    -0.261554957 -0.47136417 -0.77470715  0.1930367  0.26721633
## HeartRate    0.008006029 -0.77744524  0.37331129 -0.4980820  0.08184240
##                     PC6
## Age          0.02331586
## SystolicBP  -0.71047677
## DiastolicBP  0.70175274
## BS          -0.01693074
## BodyTemp    -0.02378907
## HeartRate    0.03700874

# scree plot
pr.var <- pca_result$sdev^2
pve <- pr.var / sum(pr.var)

par(mfrow = c(1,1), mar = c(4.5,4.5,2,1))
plot(pve, type = "b", pch = 19, xaxt = "n",
     xlab = "Principal Component", ylab = "Proportion of Variance Explained",
     ylim = c(0, max(pve) * 1.1))
axis(1, at = 1:length(pve), labels = paste0("PC", 1:length(pve)))
lines(cumsum(pve), type = "b", pch = 17, col = "red")

** PCA analysis: A Principal Component Analysis (PCA) was conducted using six physiological indicators (Age, SystolicBP, DiastolicBP, BS, BodyTemp, and HeartRate). The loading matrix indicates that the first principal component (PC1) is dominated by SystolicBP, DiastolicBP, BS, and Age, representing a cardiometabolic risk factor associated with blood pressure and glucose regulation.The second component (PC2) shows high loadings for HeartRate, BodyTemp, and BS, reflecting a physiological stress factor, possibly related to acute metabolic or thermal responses.Together, the first two components explain the majority of the data variance, suggesting that maternal health risk can be largely characterized by two underlying dimensions — long-term cardiovascular–metabolic status and short-term physiological stress response.

YanQiHomework2

Yan Qi

2025-11-01

Data Selection and Preparation

Parametric Statistics

Non-Parametric Statistics