CUNY SPS DATA 622 - Machine Learning and Big Data

Overview

This analysis employs a variety of algorithms to predict fetal health status using a cardiotocogram (CTG) dataset. Predicting fetal health, particularly suspect and pathological health statuses, is key for informing doctors in identifying appropriate interventions and intervention timelines. Analysis begins with exploratory data analysis (EDA) and preprocessing followed by a series of predictive models. It ends with an evaluation of model performance.

All references and a technical appendix of all R code are available at the end of this report.

PROJECT SECTIONS

Exploratory Data Analysis (EDA)

EDA uses summary statistics and univariate and bivariate visualizations to summarize the CTG dataset, its response, and the initial set of possible features. This information will inform preprocessing of the dataset prior to modeling.

Dataset

Sourced from a 2000 research project and deidentified, the dataset consists of samples describing selected health characteristics of individual fetuses. It contains 2,126 samples and 22 total variables, including a target fetal_health with three classes: ‘Normal’, ‘Suspect’, and ‘Pathological’. The available features for modeling range from the fetal heart rate (FHR) baseline (baseline_value) to the number of prolonged decelerations per second (prolonged_decelerations) to varied characteristics of FHR distribution (histogram_*). All variables in the dataset are listed and described below.

Dataset Descriptions

Variable	Description
baseline_value	Fetal heart rate (FHR) baseline (beats per minute)
accelerations	# of accelerations per second
fetal_movement	# of fetal movements per second
uterine_contractions	# of uterine contractions per second
light_decelerations	# of light decelerations per second
severe_decelerations	# of severe decelerations per second
prolonged_decelerations	# of prolonged decelerations per second
abnormal_short_var	% of time with abnormal short term variability
mean_short_var	Mean value of short term variability
perc_time_long_var	% of time with abnormal long term variability
mean_long_var	Mean value of long term variability
histogram_width	Width of FHR histogram
histogram_min	Minimum (low frequency) of FHR histogram
histogram_max	Maximum (high frequency) of FHR histogram
histogram_peaks	# of histogram peaks
histogram_zeroes	# of histogram zeroes
histogram_mode	Histogram mode
histogram_mean	Histogram mean
histogram_median	Histogram median
histogram_variance	Histogram variance
histogram_tendency	Histogram tendency
fetal_health	1 (Normal), 2 (Suspect), 3 (Pathological) - TARGET

Summary Statistics

Expand for Basic Statistic Summary

Data summary

Name	df
Number of rows	2126
Number of columns	22
_______________________
Column type frequency:
factor	1
numeric	21
________________________
Group variables	None

Variable type: factor

skim_variable	complete_rate	ordered	n_unique	top_counts
fetal_health	1	TRUE	3	1: 1655, 2: 295, 3: 176

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p50	p100
baseline_value	1	133.30	9.84	106.0	133.0	160.00
accelerations	1	0.00	0.00	0.0	0.0	0.02
fetal_movement	1	0.01	0.05	0.0	0.0	0.48
uterine_contractions	1	0.00	0.00	0.0	0.0	0.01
light_decelerations	1	0.00	0.00	0.0	0.0	0.01
severe_decelerations	1	0.00	0.00	0.0	0.0	0.00
prolonged_decelerations	1	0.00	0.00	0.0	0.0	0.00
abnormal_short_var	1	46.99	17.19	12.0	49.0	87.00
mean_short_var	1	1.33	0.88	0.2	1.2	7.00
perc_time_long_var	1	9.85	18.40	0.0	0.0	91.00
mean_long_var	1	8.19	5.63	0.0	7.4	50.70
histogram_width	1	70.45	38.96	3.0	67.5	180.00
histogram_min	1	93.58	29.56	50.0	93.0	159.00
histogram_max	1	164.03	17.94	122.0	162.0	238.00
histogram_peaks	1	4.07	2.95	0.0	3.0	18.00
histogram_zeroes	1	0.32	0.71	0.0	0.0	10.00
histogram_mode	1	137.45	16.38	60.0	139.0	187.00
histogram_mean	1	134.61	15.59	73.0	136.0	182.00
histogram_median	1	138.09	14.47	77.0	139.0	186.00
histogram_variance	1	18.81	28.98	0.0	7.0	269.00
histogram_tendency	1	0.32	0.61	-1.0	0.0	1.00

Summary statistics reveal useful information about the dataset. First, the dataset contains zero missing values. This completeness renders imputation, which can rest on tenuous assumptions, unnecessary. Second, beyond the categorical target fetal_health, all of the features are numeric, though their scales vary substantially. Thus centering and scaling could facilitate modeling. Third, many features display apparent rightward skew, which suggests additional transformation may be necessary as well. And fourth, the classes of fetal_health are highly imbalanced. The largest class, “Normal” (1), consists of 1,655 samples compared to 295 samples for “Suspect” (2) and 176 samples for “Pathological” (3).

Feature Distributions

Next comes visualization of the dataset features and their relations to the target fetal_health. The following two figures depict sets of feature-specific boxplots by class of fetal_health. Broadly, the boxplots reveal clear class-related differences in the distributions of the features.

Feature boxplots by fetal health

The first set of plots focuses on the eleven non-histogram features: baseline_value, accelerations, fetal_movement, uterine_contractions, light_decelerations, severe_decelerations, prolonged_decelerations, abnormal_short_var, mean_short_var, perc_time_long_var, and mean_long_var. There are clear differences across classes of fetal_health. Notably, the IQRs of the distributions of abnormal_short_var and accelerations for the “Normal” class show minimal overlap with their counterparts for the other two classes. Features light_decelerations and baseline_value possess similar relationships for the “Suspect” class, as does prolonged_decelerations for the “Pathological” class. Skewness is common across features, particularly for fetal_movement and severe_decelerations, though level of skewness tends to vary by class within features (e.g., “Normal” for perc_time_long_var).

Feature boxplots by fetal health, continued

The second set of plots focuses on the ten histogram features: histogram_width, histogram_min, histogram_max, histogram_peaks, histogram_zeroes, histogram_mode, histogram_mean, histogram_median, histogram_variance, and histogram_tendency. Several observations jump out. First, the three measure of central tendency features show different distributions across classes of fetal_health, but those distributions are similar regardless of measure. Second, there is less skewness among this set of features, though it is still present and is substantial for histogram_peaks, histogram_variance, and histogram_zeroes; there are particular class differences in variance. And third, for histogram_tendency, the distribution for the “Pathological” class is predominantly negative versus the predominantly positive distributions for the “Normal” and “Suspect” classes.

Feature Correlations

The feature names and descriptions suggest that groups of features are correlated, which could pose problems of collinearity during modeling. Considering the marked differences in feature distributions by class of fetal_health, it is appropriate to calculate any correlations separately for each class. Further, across features and classes, distributions are typically skewed, which calls for the non-parametric Spearman’s rho. Below are correlation heat maps using Spearman’s rho for all features and by class of fetal_health.

Spearman’s correlation heat map of features, Fetal health: Normal (1)

Several patterns stand out for the “Normal” class. Understandably, the measure of central tendency features are highly positively correlated with one another but also with baseline_value. Likewise, the histogram shape features show notably strong correlations, including the negative correlations between histogram_min and each of histogram_width, histogram_peaks, and histogram_variance, and the positive correlations between histogram_width and each of histogram_max, histogram_peaks, and histogram_variance. Other features with somewhat strong correlations across features are light_decelerations and mean_short_var.

Spearman’s correlation heat map of features, Fetal health: Suspect (2)

The heat map for the “Suspect” class repeats the general “Normal” patterns but with higher correlation values. There are stronger relationships between the histogram shape features and the non-histogram features. Additionally, the question marks for severe_decelerations indicate that there are zero values for that feature for this class.

Spearman’s correlation heat map of features, Fetal health: Pathological (3)

The third heat map, for the “Pathological” class, continues the pattern of stronger correlations. Nearly all previously observed patterns are amplified, whether to the positive or the negative, for this class. A small set of features continues to show relatively weaker correlations, including baseline_value (aside from the measure of central tendency features), accelerations, fetal_movement, severe_decelerations, mean_long_var, histogram_zeroes, and histogram_tendency.

Data Preprocessing

Some modeling methodologies require datasets that do not have missing values and are free of multicollinearity, outliers, and highly influential leverage points. Since the dataset has no null values, imputation is not necessary. The remaining issues are tackled in the following preprocessing code.

Skewness

Examining the skewness of each feature allows for a better understanding of the dataset at hand and can help to more easily identify features affected by outliers. Below are skewness statistics for each feature, with negative values reflecting left skewness and positive values reflecting right skewness. Larger values are associated with greater levels of skewness.

Two features show large skew greater than \(\pm\) 5: severe_decelerations at approximately 17.33; and fetal_movement at approximately 7.80. Three additional features show skew greater than \(\pm\) 3: prolonged_decelerations at approximately 4.32; histogram_zeroes at approximately 3.91; and histogram_variance at approximately 3.22. All positive, these values reflect the distributions depicted during EDA. They could respond favorably to transformation prior to modeling, though the Box-Cox transformation (results not shown) only negligibly affects the skewness.

	Skewness
baseline_value	0.0202835
accelerations	1.2026931
fetal_movement	7.8004579
uterine_contractions	0.1590898
light_decelerations	1.7160127
severe_decelerations	17.3289771
prolonged_decelerations	4.3178655
abnormal_short_var	-0.0118119
mean_short_var	1.6550013
perc_time_long_var	2.1919788
mean_long_var	1.3301189
histogram_width	0.3137915
histogram_min	0.1156207
histogram_max	0.5770473
histogram_peaks	0.8916264
histogram_zeroes	3.9147572
histogram_mode	-0.9937740
histogram_mean	-0.6501009
histogram_median	-0.4777393
histogram_variance	3.2154316
histogram_tendency	-0.3111925

Outlier Detection

Outliers are defined as any observations that lie outside 1.5 * IQR, where IQR (Inter Quartile Range) is the difference between the 75th and 25th quartiles of the feature values. Outliers appear as dots on box and whisker plots.

A closer examination of the severe_decelerations variable shows that there are only 2 values for this feature - 0 and 0.001. Only 7 records in the dataset have a non-zero value, but 6 of these are classified as Pathological fetal health. To retain this information, this feature is eliminated from outlier analysis.

Caps are placed on observations at their 5th and 95th percentiles for the remaining variables. Any data points that fall outside of this range are replaced with the capped values.

Collinearity

EDA found that strong correlations are present for each class of fetal_health. Spearman’s rank coefficient is used to check correlations once more with all samples, regardless of class. This new plot mimics the one for the “Normal” class on account of the notable imbalance in sample counts. Here again, the measure of central tendency features are highly positively correlated with one another and with baseline_value. There are also strong correlations, varying between positive and negative, between the histogram shape features. Others with somewhat strong correlations across features are light_decelerations and mean_short_var.

Additionally, three pairs of features have pair-wise correlation values above the 0.90 absolute threshold. For each pair, the feature with the higher mean absolute correlation is recommended for removal from the data set; these three features are histogram_min, histogram_mean, and histogram_median. They are removed prior to modeling.

Normalization

Normalization ensures that all variables are on the same scale. This is particularly important for understanding relationships between the predictor variables and response. Normalization involves scaling the features to have a mean of 0 and standard deviation of 1.

Train/Test Split

Lastly, the data set is split 70/30 into a training set and a test set. The latter will be held out for validation of the final models.

The resulting training and test sets have 1475 and 651 rows, respectively.

Ordinal Logistic Regression

Ordinal Logistic Regression (OLR) is a classification method for problems with more than 2 ordinal outcomes. Since the outcome variable in this project, fetal_health, is progressive in nature, this method was chosen over a standard multinomial logistic regression, where classifications do not assume a numeric ordering.

It is important that the fetal_health outcome variable is explicitly defined as ordinal. Elimination of this step results in an analysis that is potentially meaningless. The fetal_healthfactor in this dataset is ordered from 1 - Normal, 2 - Suspect, and 3 - Pathological.

OLR Base Model

A baseline model that includes all features is developed. Specifying Hess=TRUE provides the observed information matrix from optimization. Similar to a standard logistic regression, the output of OLR includes the coefficients, standard errors, intercepts, and metrics (Residual Deviance and AIC). It also includes t-values, which need to be converted to p-values to identify features for model inclusion.

T-values are the model coefficients divided by their standard error (where standard error is an estimate of the variation). The larger the t-value, the greater the evidence against the null hypothesis, and ultimately, the more likely the feature should be included in the model. T-values and p-values are inextricably linked, and since p-values are used to identify features for inclusion, the t-values are converted. A summary of initial model features, including the coefficients and p-values is included in the table below.

##                                Value Std. Error      t value      p value
## baseline_value           1.618710479  0.7786186  2.078951791 3.762178e-02
## accelerations           -8.649541992  1.2046913 -7.179882398 6.977141e-13
## fetal_movement          -0.732117329  0.4437418 -1.649872326 9.896905e-02
## uterine_contractions    -2.229580195  0.3613449 -6.170226919 6.819206e-10
## light_decelerations     -1.529016490  0.5987776 -2.553563206 1.066269e-02
## prolonged_decelerations  5.394271702  0.5278990 10.218378501 1.640622e-24
## abnormal_short_var       4.329746225  0.5475717  7.907177143 2.632908e-15
## mean_short_var          -1.371684261  0.7715743 -1.777773287 7.544110e-02
## perc_time_long_var       1.057329838  0.3870188  2.731985790 6.295386e-03
## mean_long_var           -1.064813891  0.7708739 -1.381307556 1.671844e-01
## histogram_width         -1.561222257  0.8626423 -1.809814146 7.032461e-02
## histogram_max            3.468279738  0.8463747  4.097806260 4.170841e-05
## histogram_peaks          0.005320765  0.5683204  0.009362264 9.925301e-01
## histogram_zeroes        -0.223109933  0.4012522 -0.556034196 5.781875e-01
## histogram_mode          -2.624937244  0.9641524 -2.722533617 6.478344e-03
## histogram_variance       2.664573332  0.6439059  4.138140994 3.501312e-05
## histogram_tendency       0.715553212  0.4434007  1.613784418 1.065742e-01
## severe_decelerations     4.055847726  1.2708276  3.191501233 1.415355e-03
## 1|2                      2.689643797  0.7691247  3.497018935 4.704884e-04
## 2|3                      5.530234882  0.7973262  6.935975401 4.034278e-12

The last two rows in the output are the intercepts (or cutpoints) of the OLR. They represent where the outcome variable is cut to create the three groups of fetal_health. It is apparent by looking at the p-values that many features (like histogram_peaks and histogram_zeroes) do not warrant enough significance for inclusion in the final model.

OLR Improved Model

Backwards elimination using a p-value of less than 0.05 results in a final model that excludes the following features: histogram_peaks, histogram_zeroes, mean_long_var, mean_short_var, and histogram_tendency. Confidence intervals and proportional log-odds ratios are computed for the final OLR model.

##                                   OR        2.5 %       97.5 %
## baseline_value          4.610438e+00 1.029905e+00  20.45818246
## accelerations           2.533684e-04 2.255512e-05   0.00210013
## fetal_movement          4.096692e-01 1.754595e-01   0.93349482
## uterine_contractions    9.836939e-02 4.889797e-02   0.19439477
## light_decelerations     2.234842e-01 8.258704e-02   0.57999024
## prolonged_decelerations 2.189972e+02 8.378651e+01 610.45753520
## abnormal_short_var      1.160374e+02 4.565958e+01 312.07724146
## perc_time_long_var      4.298857e+00 2.303978e+00   8.07435726
## histogram_max           1.457852e+01 4.246393e+00  51.15954738
## histogram_mode          1.656304e-01 3.012074e-02   0.92963881
## histogram_variance      1.512956e+01 4.469471e+00  52.02717777
## severe_decelerations    3.933170e+01 4.532012e+00 879.48539675
## histogram_width         2.104377e-01 6.949437e-02   0.62313623

The proportional log-odds ratios can be interpreted similar to odds ratios from binary logistic regression models. In other words, we can interpret each variable as “for a one unit increase in variable XXX, the odds of fetal health moving from Normal to Suspect or Pathological are YYY times greater, given that the other variables in the model are held constant”.

It is important to note that the succession of the fetal_health variable from 1 (Normal) to 2 (Suspect) to 3 (Pathological) represents an overall decrease in the well-being of the child. This distinction is vital for proper interpretation of the proportional log-odds ratios.

Some key things to note:

The two most statistically significant variables are prolonged_decelerations and accelerations, with proportional odds ratio of \(218.9972\) and \(.00025\), respectively. This means that for a 1 unit increase prolonged_decelerations, the odds of moving from Normal to Suspect or Pathological are 219 times greater. Similarly, for a 1 unit increase in accelerations, the odds of moving from Normal to Suspect or Pathological are 0.00025 times greater. In other words, the more prolonged_decelerations, the worse the fetal health is, and the lower the accelerations are, the better the fetal health is.
Other features that affect fetal health negatively (increases in values result in decreases in overall health) are: abnormal_short_var, histogram_max, histogram_variance, severe_decelerations, baseline_value, and perc_time_long_var.
Other features that affect fetal health positively (increases in values result in increases in overall health) are: fetal_movement, uterine_contractions, light_decelerations, histogram_mode, and histogram_width.

As a gut check, these results make intuitive sense. For example, one would expect that the more fetal movement there is, the healthier the baby and the more severe decelerations there are, the unhealthier the baby.

OLR Performance

Predictions are made on the test set to evaluate model performance. Final performance is included in the summary below.

## Confusion Matrix and Statistics
## 
##          
## olr_preds   1   2   3
##         1 482  29   4
##         2  25  48  23
##         3   1  11  28
## 
## Overall Statistics
##                                           
##                Accuracy : 0.8571          
##                  95% CI : (0.8279, 0.8831)
##     No Information Rate : 0.7803          
##     P-Value [Acc > NIR] : 4.52e-07        
##                                           
##                   Kappa : 0.6005          
##                                           
##  Mcnemar's Test P-Value : 0.09655         
## 
## Statistics by Class:
## 
##                      Class: 1 Class: 2 Class: 3
## Sensitivity            0.9488  0.54545  0.50909
## Specificity            0.7692  0.91474  0.97987
## Pos Pred Value         0.9359  0.50000  0.70000
## Neg Pred Value         0.8088  0.92793  0.95581
## Prevalence             0.7803  0.13518  0.08449
## Detection Rate         0.7404  0.07373  0.04301
## Detection Prevalence   0.7911  0.14747  0.06144
## Balanced Accuracy      0.8590  0.73010  0.74448

The overall accuracy of the model is 0.8571429.

Support Vector Machine (SVM)

Support Vector Machine (SVM) is a supervised machine learning algorithm that can be used for classification or regression problems. Kernel is the technique used to transform ones data and then based on these transformations an optimal boundary is then found between the possible outputs. The data transformations are pretty complex, then separates your data based on the labels or outputs that have been defined.

SVM can be run using the e1071 package or the caret package. In the e1071 package the kernels are Linear, Radial, Polynomial or Sigimoid. In the caret package, the kernels are listed as svmLinear (for linear values), svmRadial (for non linear), and svmPoly (for nonlinear). When it is run within the svm function the kernals are listed differently in the in the e1071 it is listed under kernel, while in the caret package it is listed under method. For the purpose of this project, svmLinear will be used.

Before training the model, the trainControl() method is implemented. This will control all the computational overheads so that it can use the train() function provided by the caret package. The training method will train the data on different algorithms.

The method parameter defines the re-sampling method.Repeated cross-validation is chosen for this model.
The number parameter holds the number of re-sampling iterations.
The repeats parameter contains the sets to compute for the repeated cross-validation.

SVM Base Model

A base SVM model is created for comparison purposes. Features are centered and scaled. The tuneLength parameter tells the algorithm to optimize training by using different default values for the main parameter. Predictions are made on the test set to evaluate model performance.

Overall accuracy of the base model on the test set is 0.8817204.

SVM Improved Model

To determine the ideal configuration for an optimized model, the tuneGrid parameter is used with a series of pre-defined Cost values between 0 and 5. Results are visualized in a plot of performance for each of the selected grid values.

The final value used for the improved model is C = 0.25. Final predictions are made on the test set. Results are displayed in the table below.

## Confusion Matrix and Statistics
## 
##               
## svm_final_pred   1   2   3
##              1 482  22   7
##              2  23  62  17
##              3   3   4  31
## 
## Overall Statistics
##                                           
##                Accuracy : 0.8833          
##                  95% CI : (0.8561, 0.9069)
##     No Information Rate : 0.7803          
##     P-Value [Acc > NIR] : 7.642e-12       
##                                           
##                   Kappa : 0.6769          
##                                           
##  Mcnemar's Test P-Value : 0.02159         
## 
## Statistics by Class:
## 
##                      Class: 1 Class: 2 Class: 3
## Sensitivity            0.9488  0.70455  0.56364
## Specificity            0.7972  0.92895  0.98826
## Pos Pred Value         0.9432  0.60784  0.81579
## Neg Pred Value         0.8143  0.95264  0.96085
## Prevalence             0.7803  0.13518  0.08449
## Detection Rate         0.7404  0.09524  0.04762
## Detection Prevalence   0.7849  0.15668  0.05837
## Balanced Accuracy      0.8730  0.81675  0.77595

Overall accuracy of the improved svm model on the test set is 0.8832565.

SVM Conclusion

SVlinear is chosen in this project because of the linear values of the dataset. Grid search was completed to identify the ideal cost value for the function, improving the model performance slightly from 0.8817204 to 0.8832565.

Gradient Boosting Machines (GBM)

The gradient boosting machine (GBM) algorithm is a tree-based ensemble method. It builds an ensemble of weak learners–typically, decision trees of shallow depth–that learn from and improve on prior trees. This dependence across trees differs from the independent trees of the random forest algorithm. GBM have shown that a large number of weak learners can in aggregate produce high predictive performance.

GBM offer many hyperparameters for tuning as well as a choice of weak learner. This analysis uses decision trees and, for hyperparameters, focuses on the number of trees, or the total of trees to be fit by the algorithm; tree interaction depth, or the number of splits in each tree; and shrinkage, or the rate at which the algorithm descends the gradient. Per cursory research, the first three of these four appear to be the most common and also the most influential on model performance.

GBM Base Model

An initial GBM model serves as a baseline. It uses the default value for shrinkage (0.1) and compares numbers of trees of 50, 100, and 150 and depths of one, two, and three trees, respectively. Preprocessing consists of centering and scaling, and ten-fold cross-validation is repeated three times during training. Predictions are made on the test set to evaluate model performance.

Per the largest value of accuracy, the optimal model features 100 trees, a depth of two trees, shrinkage of 0.1, and a minimum of ten samples per terminal node of each tree. Its accuracy and Kappa on the training set are approximately 0.94 and 0.84, respectively. Regarding feature importance, abnormal_short_var spurs improvement in all trees where it informs splitting, while perc_time_long_var and mean_short_var sit at roughly eight of every ten trees (~71.18) and one of every two trees (~51.78), respectively.

Overall accuracy of the base model on the test set is 0.93702.

GBM Improved Model

GBM tuning starts from a grid of hyperparameter values. The possible numbers of trees (100, 500, 800 and 1000) are raised to account for lower possible shrinkage rates (0.01 and 0.05). Tree depth is also expanded to a maximum of seven. There are 24 possible combinations across the three hyperparameters.

A second GBM model adopts the tuning grid and maintains the same preprocessing (centering and scaling) and cross-validation (three-times, ten-fold) approaches. Training time is substantial.

Following grid search, and per accuracy, the optimal model features 800 trees, a depth of seven, and shrinkage of 0.05. Its training set accuracy and Kappa are approximately 0.95 and 0.85, respectively, which are both slight improvements over the baseline model. Feature importances are examined for the optimized model.

Regarding feature importance, abnormal_short_var is important in all trees where it informs splitting. The importance of mean_short_var is similar, at roughly 70.

It is interesting to note that the most important features in the model do not completely align with the findings from the ordinal logistic regression.

Particularly interesting is the severe_decelerations feature. In the OLR model, the proportional log-odds ratio was 39, indicating that for 1 unit increase, the odds of moving from Normal to Suspect or Pathological are 39 times greater. However, in the optimized GBM model, this feature doesn’t even appear in the top 10 factors!

Final predictions are made on the optimized GBM model. Results are included in the table below.

## Confusion Matrix and Statistics
## 
##          
## gbm2_pred   1   2   3
##         1 499  16   2
##         2   6  67   9
##         3   3   5  44
## 
## Overall Statistics
##                                           
##                Accuracy : 0.937           
##                  95% CI : (0.9155, 0.9544)
##     No Information Rate : 0.7803          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.8233          
##                                           
##  Mcnemar's Test P-Value : 0.1172          
## 
## Statistics by Class:
## 
##                      Class: 1 Class: 2 Class: 3
## Sensitivity            0.9823   0.7614  0.80000
## Specificity            0.8741   0.9734  0.98658
## Pos Pred Value         0.9652   0.8171  0.84615
## Neg Pred Value         0.9328   0.9631  0.98164
## Prevalence             0.7803   0.1352  0.08449
## Detection Rate         0.7665   0.1029  0.06759
## Detection Prevalence   0.7942   0.1260  0.07988
## Balanced Accuracy      0.9282   0.8674  0.89329

The optimal model predicts well on the test set, with an accuracy of approximately 0.93702 and a Kappa of approximately 0.86. Unsurprisingly, the model accuracy exceeds the no information rate of approximately 0.78, or the proportion of samples representing the “Normal” class of fetal_health. The balanced accuracy values of the three classes range from approximately 0.89 for “Suspect” to approximately 0.93 for “Normal” to approximately 0.95 for “Pathological”.

GBM Summary

Overall, GBM performs well in predicting fetal_health for samples in this data set, with an accuracy of approximately 0.95. Tuning hyperparameters including number of trees, tree depth, and shrinkage results in slight performance improvement over a baseline model. The optimal hyperparameter values are 800 trees, a depth of seven, and shrinkage of 0.05.

Random Forest (RF)

Random forest, is a collection of decision trees that are individual and create an ensemble. Each decision tree within the random forest has a class prediction and the class that has the most predictions becomes the designated prediction for the whole forest.

“The wisdom of crowds” is the fundamental concept of a random forest. A group of trees is needed to come together in order to classify the data to come up with a decision, which absorbs the flaws internally to come up with the best accuracy. It is important for a random forest for the trees to be uncorrelated so that the decision makers are not influenced by ones opinion.

Random forest will be used to see if it can accurately predict whether the fetal_health is either 1 - Normal, 2 - Suspect or 3 - Pathological. Since features in the dataset need to be uncorrelated the correlated features were already removed as part of the data preparation.

RF Base Model

First we run the model with the default values. The model includes the same trainControl as the previous models with the cross validation method, with 10 resampling iterations,and 3 repeats but with an addition of a grid search which is a randomized method. The model also includes a metric for accuracy, since achieving a higher accuracy is prevalent.

The base model gives an accuracy of 0.9262673. The balanced accuracy between the classes are approximately .9097 for Normal, .85031 for Suspect, and .90406 for Pathological.

Tuning of the RF Model

Next is to tune the model to see if a higher accuracy can be obtained by adding some tuning metrics. Some of the metrics that were added was the ‘Importance=true’ to obtain most important values. There are three metrics to note when it comes to tuning an RF model.

.mtry - to find out how many features are chosen for each iteration, a vector is created with values from 2-10 which will be evaluated in the tuneGrid
maxnodes - max number of terminal nodes (leaves), a loop is created in order to evaluate different values of maxnodes going from 5-25
ntree- to find the optimal number of trees which are (250,300,350,400, 450, 500, 550, 600, 800, 1000)

After getting the optimal values for each the final model consists of these parameters which are \(m_{try}\) of 4, maxnodes of 25, and ntree of 800. The default model had an \(m_{try}\) of 10.

After evaluating the model it was discovered that the accuracy of the optimized model ended up giving an accuracy of approximately 90%, when the base model gave an accuracy of 93.24%. It was then decided to see if the by removing the node parameters and keep the tuneGrid of of the \(m_{try}\) , as well as .ntree with a default number of 500(800 was used but same outcome as the base model) to see if it will increase the accuracy.

Optimized RF Model

For the optimized RF model the node parameters will be removed, the tuneGrid will remain and it will be the default number of ntree’s which is 500.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   1   2   3
##          1 500  22   3
##          2   6  62   8
##          3   2   4  44
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9309          
##                  95% CI : (0.9086, 0.9491)
##     No Information Rate : 0.7803          
##     P-Value [Acc > NIR] : < 2e-16         
##                                           
##                   Kappa : 0.8016          
##                                           
##  Mcnemar's Test P-Value : 0.01361         
## 
## Statistics by Class:
## 
##                      Class: 1 Class: 2 Class: 3
## Sensitivity            0.9843  0.70455  0.80000
## Specificity            0.8252  0.97513  0.98993
## Pos Pred Value         0.9524  0.81579  0.88000
## Neg Pred Value         0.9365  0.95478  0.98170
## Prevalence             0.7803  0.13518  0.08449
## Detection Rate         0.7680  0.09524  0.06759
## Detection Prevalence   0.8065  0.11674  0.07680
## Balanced Accuracy      0.9047  0.83984  0.89497

This model gives an accuracy of 0.9308756 which is a small increase from the base model but it is an increase. The balanced accuracy increased as well, for the Normal Class is approximately 91.17%, for Suspect is approximately 84.52%, and Pathological is approximately 91.39%. In comparison to the other classes it poorly predicts Suspect, but the prediction for Normal and Pathological are about the same. The optimal \(m_{try}\) for this model is 5 which means that 5 features are chosen for each iteration.

The varImp so that we can see the important features of the random forest in order to see if it compares to the Ordinal Logistic Regression model. Looking at the importance levels once can see that prolonged_decelerations is considered an important feature for Normal and Pathological class but not for Suspect class. abnormal_short_var is the important feature for Suspect. It is interesting though that ‘accelerations’ is not considered important in comparison to the OLR model.

RF Conclusion

The optimized model with the removal of nodes ended up being the optimized model, but it was a slight increase from the base model it went from 0.9262673 to 0.9308756. Although one may try to find the most optimized model by trying find the best parameters it may not always be the best model depending on the dataset. The best parameters to use for this model was \(m_{try}\) and .ntree of the default number of 500. Since the accuracy is somewhat high one can use random forest in order to predict fetal_health.

Model Comparison

Predicting fetal health is important in determining the appropriate course of action for a doctor to take. Successful prediction of suspect and pathological fetal health status can aid in the development of intervention timelines.

Although model accuracy is often used as a means of comparing models, since we are primarily concerned with properly identifying less healthy babies, it is important that the selected metrics gauge the performance of the models at the classification level. This is particularly important in this project, as the classes are drastically imbalanced.

In addition to a raw accuracy comparison, we will also use the following metrics to determine the best model:

Prevalence: This measure shows us how often the positive class occurs in the sample. The data set is highly imbalanced, and this measure captures that.
Detection Rate: This measure shows us the number of correct positive class predictions made as a proportion of all of the predictions made. In a perfect classifier, this number matches the Prevalence.
Sensitivity (Recall): This measure shows the proportion of the positive class that were correctly identified.

Model Accuracy

A comparison of the model accuracy across the 4 methods is shown below:

GBM performs the best, followed by Random Forest, then SVM, and finally OLR.

Class Prevalence

All models have the same class prevalence because the same training and test sets were used. We know from the EDA that the dataset is imbalanced. Visualizing the prevalence gives us a better idea of the proportion of each class in the test set:

The prevalence can be used in combination with the Detection Rate to understand what portion of each class that the models correctly identified. In the following figure, predictions are broken out by prevalence at the class level per model. The blue area represents the portion of the class that was identified correctly. The pink area represents the portion of the class that was missed.

Out of the four models, OLR performs the worst on predicting Suspect and Pathological fetal health statuses.

Model Sensitivity

A comparison of model sensitivity for each of the three classes is shown below. It is important that the sensitivity for classes 2 (Suspect) and 3 (Pathological) are high, as these are the statuses that are more important to predict correctly.

GBM performs best across the board, while OLR performs the worst. Some things to note:

All models perform well when predicting Normal Fetal Health.
GBM, SVM, and Random Forest perform worse when predicting Suspect Fetal Health, but significanly better then OLR.
Both SVM and OLR perform poorly when predicting Pathological Fetal Health. GBM and Random Forest perform similarly.

Conclusions and Future Work

Overall, the GBM model has the best performance with 94% Accuracy, and 98%, 76%, and 80% recall for Normal, Suspect, and Pathological statuses. Regarding future work, over- or under- sampling would create a more balanced dataset that could likely improve the overall metrics for the dataset, especially for the Ordinal Linear Regression Model. Additionally, including combination features could yield some interesting insights into the data, and potentially increase overall recall. Finally, this work could be expanded to include PCA for feature consolidation and reduction.

Works Cited

https://www.kaggle.com/andrewmvd/fetal-health-classification

Ayres de Campos et al. (2000). SisPorto 2.0 A Program for Automated Analysis of Cardiotocograms. J Matern Fetal Med 5:311-318

Boehmke, B. (2018). Gradient boosting machines. UC Business Analytics R Programming Guide. Accessed May 15, 2021 from http://uc-r.github.io/gbm_regression.

Kumar Das, S. (2020). Random Forest Classification explained in detail and developed in R. TechTarget, Inc. Accessed May 2021 from https://www.datasciencecentral.com/profiles/blogs/random-forest-classification-explained-in-detail-and-developed-in.

Kuhn, M. (2019). The caret package. Accessed May 15, 2021, from https://topepo.github.io/caret/.

R Random Forest Tutorial with Example. Guru99. Accessed May 2021 from https://www.guru99.com/r-random-forest-tutorial.html.

Rashkiany, H. (2020). Support Vector Machine In R: Using SVM To Predict Heart Diseases. Edureka. Accessed May 2021 from www.edureka.co/blog/support-vector-machine-in-r/.

Code Appendix

The code chunks below represent the R code called in order during the analysis. These are reproduced in the appendix for review and comment.

df <- read_csv('https://raw.githubusercontent.com/chrosemo/DATA622-Group3-Final/main/fetal_health.csv')
df <- df %>% rename(baseline_value = 'baseline value',
                    prolonged_decelerations = 'prolongued_decelerations',
                    abnormal_short_var = 'abnormal_short_term_variability',
                    mean_short_var = 'mean_value_of_short_term_variability',
                    perc_time_long_var = 'percentage_of_time_with_abnormal_long_term_variability',
                    mean_long_var = 'mean_value_of_long_term_variability',
                    histogram_peaks = 'histogram_number_of_peaks',
                    histogram_zeroes = 'histogram_number_of_zeroes') %>% 
             mutate(fetal_health = as.factor(fetal_health))

# order the dependent variable
df$fetal_health = factor(df$fetal_health, levels = c("1", "2", "3"), ordered = TRUE)

skim(df) %>% dplyr::select(-n_missing, -numeric.p25, -numeric.p75, -numeric.hist)

df %>% 
  dplyr::select(baseline_value:mean_long_var, fetal_health) %>%
  gather(key = 'variable', value = 'value', -fetal_health) %>%
  ggplot(aes(x = '', y = value, fill = fetal_health)) +
    facet_wrap(~ variable, scales = 'free') +
    geom_boxplot() +
    coord_flip() +
    labs(x = NULL, y = NULL) +
    theme(axis.text.x = element_text(angle = 45)) +
    scale_fill_manual(name = 'Fetal health',
                      guide = guide_legend(reverse=TRUE),
                      labels = c('Normal (1)', 'Suspect (2)', 'Pathological (3)'),
                      values = c('forestgreen', 'gold', 'red'))

df %>% 
  dplyr::select(histogram_width:histogram_tendency, fetal_health) %>%
  gather(key = 'variable', value = 'value', -fetal_health) %>%
  ggplot(aes(x = '', y = value, fill = fetal_health)) +
    facet_wrap(~ variable, scales = 'free') +
    geom_boxplot() +
    coord_flip() +
    labs(x = NULL, y = NULL) +
    scale_fill_manual(name = 'Fetal health',
                      guide = guide_legend(reverse=TRUE),
                      labels = c('Normal (1)', 'Suspect (2)', 'Pathological (3)'),
                      values = c('forestgreen', 'gold', 'red'))

corrplot(cor(df %>%
               filter(fetal_health == 1) %>%
               dplyr::select(-fetal_health),
             method = 'spearman'))

corrplot(cor(df %>%
               filter(fetal_health == 2) %>%
               dplyr::select(-fetal_health),
             method = 'spearman'))

corrplot(cor(df %>%
               filter(fetal_health == 3) %>%
               dplyr::select(-fetal_health),
             method = 'spearman'))

knitr::kable(sapply(df[,-22], skewness), col.names = c("Skewness"))

no_target <- subset(df, select = c(-fetal_health, -severe_decelerations))

# identifying 5th and 95th percentiles per column
low5 <- apply(no_target, 2, quantile, prob = 0.05)
up95 <- apply(no_target, 2, quantile, prob = 0.95)

# for each column, replace outliers with upper 95% value or lower 5% value
for (i in seq_along(no_target)) {
  no_target[,i][no_target[,i] < low5[i]] <- low5[i]
  no_target[,i][no_target[,i] > up95[i]] <- up95[i]
}

# adding back in severe decelerations and fetal health
df <- cbind(no_target, severe_decelerations = df$severe_decelerations, fetal_health=df$fetal_health)

corr <- cor(df %>% dplyr::select(-fetal_health), method = "spearman")
corrplot::corrplot(corr)

df_clean <- as.data.frame(df %>% dplyr::select(-c('histogram_min', 'histogram_mean', 'histogram_median')))

# min/max normalization
min_max_norm <- function(x) {
    (x - min(x)) / (max(x) - min(x))
}

# normalized data
df_normalized <- as.data.frame(lapply(df_clean %>% dplyr::select(-fetal_health), min_max_norm))

df_clean <- cbind(df_normalized, fetal_health = df_clean$fetal_health)

set.seed(525)
which_train <- sample(x = c(TRUE, FALSE), 
                      size = nrow(df_clean), 
                      replace = TRUE, 
                      prob = c(0.7, 0.3))
train_set <- df_clean[which_train, ]
test_set <- df_clean[!which_train, ]

train_set2 <- train_set
test_set2 <- test_set

olr_model <- polr(fetal_health ~ ., data = train_set2, Hess=TRUE)

ctable <- coef(summary(olr_model))
p <- pnorm(abs(ctable[, "t value"]), lower.tail = FALSE) * 2
ctable <- cbind(ctable, "p value" = p)
ctable

olr_model_final <- polr(fetal_health ~ baseline_value + accelerations + fetal_movement +
                    uterine_contractions + light_decelerations + prolonged_decelerations +
                    abnormal_short_var + perc_time_long_var +
                    histogram_max + histogram_mode + histogram_variance + 
                    severe_decelerations + histogram_width , 
                  data = train_set2, Hess=TRUE)

olr_ci <- confint(olr_model_final)
olr_or <- coef(olr_model_final)
exp(cbind(OR = olr_or, olr_ci))

olr_preds <- predict(olr_model_final, newdata = test_set2)
olr_conf <- confusionMatrix(table(olr_preds, test_set2$fetal_health))
olr_accuracy <- olr_conf$overall['Accuracy']
olr_final_metrics <- as.data.frame(olr_conf$byClass[,c(1,8,9)]) %>%
  mutate(Model = 'OLR',
         Class = row_number())
olr_conf

trctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 3)

set.seed(525)
svm_Linear <- train(fetal_health ~., data = train_set, 
                    method = "svmLinear",
                    trControl=trctrl,
                    preProcess = c("center", "scale"),
                    tuneLength = 10)

svm_base_pred <- predict(svm_Linear, newdata = test_set)
svm_conf1 <- confusionMatrix(table(svm_base_pred, test_set$fetal_health))
svm_acc1 <- svm_conf1$overall['Accuracy']

set.seed(525)
grid <- expand.grid(C = c(0,0.01, 0.05, 0.1, 0.25, 0.5, 0.75, 1, 1.25, 1.5, 1.75, 2, 5))
svm_Linear_Grid <- train(fetal_health ~., data = train_set, method = "svmLinear",
                         trControl=trctrl,
                         preProcess = c("center", "scale"),
                         tuneGrid = grid,
                         tuneLength = 10)
plot(svm_Linear_Grid,
     main = "Accuracy by Cost Value", 
     xlab = 'Cost Value')

set.seed(525)
svm_final_pred <- predict(svm_Linear_Grid, newdata = test_set)
svm_conf2 <- confusionMatrix(table(svm_final_pred, test_set$fetal_health))
svm_acc2 <- svm_conf2$overall['Accuracy']
svm_final_metrics <- as.data.frame(svm_conf2$byClass[,c(1,8,9)]) %>%
  mutate(Model = 'SVM',
         Class = row_number())

svm_conf2

set.seed(525)
gbm1 <- train(x = train_set[,-19],
                           y = train_set[,19],
                           method = "gbm",
                           preProcess = c('center', 'scale'),
                           trControl = trctrl,
                           verbose = FALSE)


gbm1_pred <- predict(gbm1, newdata = test_set)
gbm_conf1 <- confusionMatrix(table(gbm1_pred, test_set$fetal_health))
gbm_acc1 <- gbm_conf1$overall['Accuracy']

set.seed(525)

#grid <- expand.grid(.interaction.depth = c(2, 5, 7),
#                    .n.trees = c(100, 500, 800, 1000),
#                    .shrinkage = c(0.01, 0.05),
#                    .n.minobsinnode = 10)

#gbm2 <- train(x = train_set[,-19],
#                           y = train_set[,19],
#                           method = "gbm",
#                           tuneGrid = grid,
#                           preProcess = c('center', 'scale'),
#                           trControl = trctrl, 
#                           verbose = FALSE)

set.seed(525)
grid_opt <- expand.grid(.interaction.depth = 7,
                    .n.trees = 800,
                    .shrinkage = 0.05,
                    .n.minobsinnode = 10)
gbm_opt <- train(x = train_set[,-19],
                           y = train_set[,19],
                           method = "gbm",
                           tuneGrid = grid_opt,
                           preProcess = c('center', 'scale'),
                           trControl = trctrl, 
                           verbose = FALSE)

vip(gbm_opt)

gbm2_pred <- predict(gbm_opt, newdata = test_set)
gbm_conf2 <- confusionMatrix(table(gbm2_pred, test_set$fetal_health))
gbm_acc2 <- gbm_conf2$overall['Accuracy']
gbm_final_metrics <- as.data.frame(gbm_conf2$byClass[,c(1,8,9)]) %>%
  mutate(Model = 'GBM',
         Class = row_number())

gbm_conf2

set.seed(525)
# Define the control
trControl <- trainControl(method = "repeatedcv", number = 10, repeats = 3)

# Run the model
rf_default <- train(fetal_health~.,
    data = train_set,
    method = "rf",
    metric = "Accuracy",
    trControl = trControl)
## print(rf_default)

prediction_rf1 <-predict(rf_default, test_set)
rf_conf1<-confusionMatrix(prediction_rf1, test_set$fetal_health)
rf_acc1 <- rf_conf1$overall['Accuracy']
##rf_conf1

set.seed(525)
tuneGrid = expand.grid(.mtry=c(2:15))
op_rf <- train(fetal_health~.,
    train_set,
    method = "rf",
    metric = "Accuracy",
    tuneGrid = tuneGrid,
    trControl = trControl,
    importance = TRUE,
    ntree = 500)

##print(op_rf)

prediction_rf2 <-predict(op_rf, test_set)
rf_conf2<-confusionMatrix(prediction_rf2, test_set$fetal_health)
rf_acc2 <- rf_conf2$overall['Accuracy']
rf_final_metrics <- as.data.frame(rf_conf2$byClass[,c(1,8,9)]) %>%
  mutate(Model = 'RF',
         Class = row_number())

p2= plot(varImp(op_rf, scale = TRUE))

gridExtra::grid.arrange(p1,p2, ncol=2)

prevalence <- all_metrics %>%
  dplyr::select(Prevalence, Class) %>%
  distinct()

ggplot(prevalence, aes(x = Class, y = Prevalence, fill = Class)) +
  geom_bar(stat="identity") +
  ggtitle('Class Prevalence in the Test Set') +
  geom_text(aes(label = round(Prevalence,2)) , 
            colour = "white", size = 4,
            vjust = 1.5, position = position_dodge(.9))

portion_missed <- all_metrics %>%
  mutate(`Portion Missed` = Prevalence - `Detection Rate`) %>%
  dplyr::select(-Sensitivity, -Prevalence) 

portion_missed <- gather(portion_missed, type, rate, c(`Detection Rate`, `Portion Missed`)) 

ggplot(portion_missed, aes(x = Class, 
                           y = rate, 
                           fill = factor(type, 
                                         levels=c("Portion Missed", "Detection Rate")))) +
  geom_bar(position = "stack", stat = "identity") +
  labs(fill = 'Type',
       y = 'Percent Occurence',
       title = 'Percent of Class Captured by each Model') + 
  facet_wrap( ~ Model)

all_metrics <- rbind(olr_final_metrics, svm_final_metrics, gbm_final_metrics, rf_final_metrics )
all_accuracies <- as.data.frame(rbind(olr = olr_accuracy, svm = svm_acc2, gbm = gbm_acc2, rf=rf_acc2 ))
all_accuracies$model <- rownames(all_accuracies)
ggplot(all_accuracies, aes(x = model, y = Accuracy, fill = model)) +
  geom_bar(stat="identity") +
  ggtitle('Comparison of Overall Model Accuracy') +
  geom_text(aes(label = round(Accuracy,2)) , 
            colour = "white", size = 4,
            vjust = 1.5, position = position_dodge(.9))

ggplot(all_metrics %>% dplyr::select(-Prevalence, -`Detection Rate`), 
       aes(x = Class, y = Sensitivity, fill = Model)) +
  geom_col(position = "dodge") +
  ggtitle('Comparison of Sensitivity Across Classes') +
  geom_text(aes(label = round(Sensitivity,2)) , 
            colour = "white", size = 4,
            vjust = 1.5, position = position_dodge(.9))