Introduction
- Multidimensional Scaling (MDS)
- Principal Component Analysis (PCA)
- Logistic Regression
Dataset and preprocessing
- Data Description
- Data cleaning
- Variables analysis
Multidimensional Scaling (MDS)
- Classical MDS
- Advance MDS
- The optimal dimensions
- Draw variables in 3D
Principal Component Analysis(PCA)
- How to choose number of components
- Anylysis of components
- Measuring the quality of PCA
- Visualisation of PCA
Logistic Regression
- Modeling data Preparing: Training and Testing
- Compare Accuracy Score

1 Introduction

Why use MDS and PCA on fetal health data? I get fetal dataset contains 2126 records of features extracted from Cardiotocogram exams, which were then classified by three expert obstetritians into 3 classes: Normal, Suspect, and Pathological. And all I need to do is to classify the health status of the fetus (fetal_health) through these feature values and built a predictive model. But firstly, I want to reduce dimensionality of the data, because it contains 22 features, so using MDS to visualize the relative positions of variables, and then using PCA to reduce dimensions in order to get a simple version of dataset.

Why use Logistic Regression? Logistic Regression is a generalized linear regression analysis model, which is often used in data mining, automatic disease diagnosis, economic forecasting and other fields. For example, explore the risk factors that cause diseases, and predict the probability of disease occurrence based on risk factors.

My main idea for offering this topic is interest in dimension reduction and prediction models, and also classifying fetal health data can prevent fetal and maternal mortality which is meaningful in itself.

Fetal health data comes from Kaggle. This dataset contains 2126 rows and 22 features which extracted from Cardiotocogram exams, which were then classified by three expert obstetritians into 3 classes: normal, suspect and pathological. Dataset authors: Ayres de Campos et al. (2000) SisPorto 2.0 A Program for Automated Analysis of Cardiotocograms. J Matern Fetal Med 5:311-318

1.1 Multidimensional Scaling (MDS)

Definition of MDS is“Multidimensional scaling (MDS) is a means of visualizing the level of similarity of individual cases of a dataset. MDS is used to translate”information about the pairwise ‘distances’ among a set of n objects or individuals” into a configuration of n points mapped into an abstract Cartesian space.” by Wikipedia.

So using MDS which can decreasing the number of variables to get a 2D/3D visualization.

1.2 Principal Component Analysis (PCA)

Definition of PCA is “PCA is a statistical technique for reducing the dimensionality of a dataset. This is accomplished by linearly transforming the data into a new coordinate system where (most of) the variation in the data can be described with fewer dimensions than the initial data.” by Wikipedia.

So using PCA which can be represent majority of variance from original dataset and create a simple version of fetal health data.

1.3 Logistic Regression

Definition of Logistic Regression is “the logistic model (or logit model) is a statistical model that models the probability of an event taking place by having the log-odds for the event be a linear combination of one or more independent variables.“ by Wikipedia.

The prediction formula of the model is as follows:

Although this formula looks similar to linear regression, but it’s not a regression algorithm. Since in this formula, we are not returning the weighted sum of the features, but thresholding (0) the prediction. If the function value is less than 0, we predict the class as -1, and if greater than 0, we predict the class as +1.

2 Dataset and preprocessing

2.1Data Description

Import data, the dataset is extracted from ECG examinations, which were then divided into 3 categories: Fetal_health (1-Normal, 2-Suspect, 3-Pathological).

setwd('D:/unsupervised_learning/dimension reduction')
data=read.csv('fetal_health.csv',sep=',',dec='.')
colnames(data)

##  [1] "baseline.value"                                        
##  [2] "accelerations"                                         
##  [3] "fetal_movement"                                        
##  [4] "uterine_contractions"                                  
##  [5] "light_decelerations"                                   
##  [6] "severe_decelerations"                                  
##  [7] "prolongued_decelerations"                              
##  [8] "abnormal_short_term_variability"                       
##  [9] "mean_value_of_short_term_variability"                  
## [10] "percentage_of_time_with_abnormal_long_term_variability"
## [11] "mean_value_of_long_term_variability"                   
## [12] "histogram_width"                                       
## [13] "histogram_min"                                         
## [14] "histogram_max"                                         
## [15] "histogram_number_of_peaks"                             
## [16] "histogram_number_of_zeroes"                            
## [17] "histogram_mode"                                        
## [18] "histogram_mean"                                        
## [19] "histogram_median"                                      
## [20] "histogram_variance"                                    
## [21] "histogram_tendency"                                    
## [22] "fetal_health"

head(data)

##   baseline.value accelerations fetal_movement uterine_contractions
## 1            120         0.000              0                0.000
## 2            132         0.006              0                0.006
## 3            133         0.003              0                0.008
## 4            134         0.003              0                0.008
## 5            132         0.007              0                0.008
## 6            134         0.001              0                0.010
##   light_decelerations severe_decelerations prolongued_decelerations
## 1               0.000                    0                    0.000
## 2               0.003                    0                    0.000
## 3               0.003                    0                    0.000
## 4               0.003                    0                    0.000
## 5               0.000                    0                    0.000
## 6               0.009                    0                    0.002
##   abnormal_short_term_variability mean_value_of_short_term_variability
## 1                              73                                  0.5
## 2                              17                                  2.1
## 3                              16                                  2.1
## 4                              16                                  2.4
## 5                              16                                  2.4
## 6                              26                                  5.9
##   percentage_of_time_with_abnormal_long_term_variability
## 1                                                     43
## 2                                                      0
## 3                                                      0
## 4                                                      0
## 5                                                      0
## 6                                                      0
##   mean_value_of_long_term_variability histogram_width histogram_min
## 1                                 2.4              64            62
## 2                                10.4             130            68
## 3                                13.4             130            68
## 4                                23.0             117            53
## 5                                19.9             117            53
## 6                                 0.0             150            50
##   histogram_max histogram_number_of_peaks histogram_number_of_zeroes
## 1           126                         2                          0
## 2           198                         6                          1
## 3           198                         5                          1
## 4           170                        11                          0
## 5           170                         9                          0
## 6           200                         5                          3
##   histogram_mode histogram_mean histogram_median histogram_variance
## 1            120            137              121                 73
## 2            141            136              140                 12
## 3            141            135              138                 13
## 4            137            134              137                 13
## 5            137            136              138                 11
## 6             76            107              107                170
##   histogram_tendency fetal_health
## 1                  1            2
## 2                  0            1
## 3                  0            1
## 4                  1            1
## 5                  1            1
## 6                  0            3

str(data)

## 'data.frame':    2126 obs. of  22 variables:
##  $ baseline.value                                        : num  120 132 133 134 132 134 134 122 122 122 ...
##  $ accelerations                                         : num  0 0.006 0.003 0.003 0.007 0.001 0.001 0 0 0 ...
##  $ fetal_movement                                        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ uterine_contractions                                  : num  0 0.006 0.008 0.008 0.008 0.01 0.013 0 0.002 0.003 ...
##  $ light_decelerations                                   : num  0 0.003 0.003 0.003 0 0.009 0.008 0 0 0 ...
##  $ severe_decelerations                                  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ prolongued_decelerations                              : num  0 0 0 0 0 0.002 0.003 0 0 0 ...
##  $ abnormal_short_term_variability                       : num  73 17 16 16 16 26 29 83 84 86 ...
##  $ mean_value_of_short_term_variability                  : num  0.5 2.1 2.1 2.4 2.4 5.9 6.3 0.5 0.5 0.3 ...
##  $ percentage_of_time_with_abnormal_long_term_variability: num  43 0 0 0 0 0 0 6 5 6 ...
##  $ mean_value_of_long_term_variability                   : num  2.4 10.4 13.4 23 19.9 0 0 15.6 13.6 10.6 ...
##  $ histogram_width                                       : num  64 130 130 117 117 150 150 68 68 68 ...
##  $ histogram_min                                         : num  62 68 68 53 53 50 50 62 62 62 ...
##  $ histogram_max                                         : num  126 198 198 170 170 200 200 130 130 130 ...
##  $ histogram_number_of_peaks                             : num  2 6 5 11 9 5 6 0 0 1 ...
##  $ histogram_number_of_zeroes                            : num  0 1 1 0 0 3 3 0 0 0 ...
##  $ histogram_mode                                        : num  120 141 141 137 137 76 71 122 122 122 ...
##  $ histogram_mean                                        : num  137 136 135 134 136 107 107 122 122 122 ...
##  $ histogram_median                                      : num  121 140 138 137 138 107 106 123 123 123 ...
##  $ histogram_variance                                    : num  73 12 13 13 11 170 215 3 3 1 ...
##  $ histogram_tendency                                    : num  1 0 0 1 1 0 0 1 1 1 ...
##  $ fetal_health                                          : num  2 1 1 1 1 3 3 3 3 3 ...

2.2Data Cleaning

Looking for null or missing values or duplicated values

data[!complete.cases(data),]

##  [1] baseline.value                                        
##  [2] accelerations                                         
##  [3] fetal_movement                                        
##  [4] uterine_contractions                                  
##  [5] light_decelerations                                   
##  [6] severe_decelerations                                  
##  [7] prolongued_decelerations                              
##  [8] abnormal_short_term_variability                       
##  [9] mean_value_of_short_term_variability                  
## [10] percentage_of_time_with_abnormal_long_term_variability
## [11] mean_value_of_long_term_variability                   
## [12] histogram_width                                       
## [13] histogram_min                                         
## [14] histogram_max                                         
## [15] histogram_number_of_peaks                             
## [16] histogram_number_of_zeroes                            
## [17] histogram_mode                                        
## [18] histogram_mean                                        
## [19] histogram_median                                      
## [20] histogram_variance                                    
## [21] histogram_tendency                                    
## [22] fetal_health                                          
## <0 rows> (or 0-length row.names)

There is no missing data.

Then, we need to remove the fetal_health column which is not necessary for the MDS and PCA.

data_new=data[,-22]

2.3 Variables analysis

To get a deeper knowledge about data variables, I use corrplot() to see correlations about data.

From figures, we can find “histogram_mean”, “histogram_mode” and “histogram_median” have high correlations value. And “baseline.value” has a high correlation with any of them (“histogram_mean”, “histogram_mode” and “histogram_median”). Then “histogram_width” has high correlation with any of “histogram_max” and “histogram_number_of_peaks”, etc.

Next, I will plot bar density plot on full dataset for each variables.

The density plot show quite clearly that variables have various distributions. Now we are still not clear about the relationship between the variables, therefore, visualize the variables through MDS in our next steps.

2.4 Data Standardization

Feature standardization helps to adjust all data elements to a common scale to improve the performance of PCA or MDS. For example, in our dataset, Histogram_max value is greater than the hundreds, while severe_decelerations is around 0.00025. So using scale() function to standardize data.

data_new=scale(data_new)
head(data_new,2)

##      baseline.value accelerations fetal_movement uterine_contractions
## [1,]     -1.3519020    -0.8221949     -0.2031618           -1.4821159
## [2,]     -0.1324944     0.7299611     -0.2031618            0.5544962
##      light_decelerations severe_decelerations prolongued_decelerations
## [1,]          -0.6382874          -0.05746208               -0.2686911
## [2,]           0.3751547          -0.05746208               -0.2686911
##      abnormal_short_term_variability mean_value_of_short_term_variability
## [1,]                        1.512834                           -0.9428732
## [2,]                       -1.744341                            0.8686362
##      percentage_of_time_with_abnormal_long_term_variability
## [1,]                                              1.8021175
## [2,]                                             -0.5352354
##      mean_value_of_long_term_variability histogram_width histogram_min
## [1,]                          -1.0283184      -0.1654677    -1.0683107
## [2,]                           0.3930835       1.5287648    -0.8653352
##      histogram_max histogram_number_of_peaks histogram_number_of_zeroes
## [1,]     -2.119093                -0.7012319                 -0.4583360
## [2,]      1.893349                 0.6549828                  0.9579755
##      histogram_mode histogram_mean histogram_median histogram_variance
## [1,]     -1.0653632     0.15323366       -1.1813642          1.8701287
## [2,]      0.2165872     0.08910477        0.1320069         -0.2349429
##      histogram_tendency
## [1,]          1.1127182
## [2,]         -0.5244022

3 Multidimensional Scaling (MDS)

3.1 Classical MDS

Firstly, using the classical multidimensional scaling of data matrix to see the variables distances in 2 dimensional plot.

dist.data=dist(t(data_new))
mds1=cmdscale(dist.data,k=2)
plot(mds1,xlim=c(-60,60))

plot(mds1,xlim=c(-60,60),cex=1,pch=21,bg='red')
text(mds1,labels=colnames(data_new),cex=0.8,adj=0.5)

3.2 Advance MDS

There is also a advanced MDS way to visualize the level of similarity of fetal variables, firstly I will generate the correlation matrix. Then I will convert similarities to dissimilarities, which later I will use in MDS.

Stress per point (in %):

round(mds2$spp,  2)

##                                         baseline.value 
##                                                   1.78 
##                                          accelerations 
##                                                   6.33 
##                                         fetal_movement 
##                                                  10.35 
##                                   uterine_contractions 
##                                                  11.46 
##                                    light_decelerations 
##                                                   1.75 
##                                   severe_decelerations 
##                                                   6.21 
##                               prolongued_decelerations 
##                                                   6.16 
##                        abnormal_short_term_variability 
##                                                   2.02 
##                   mean_value_of_short_term_variability 
##                                                   2.03 
## percentage_of_time_with_abnormal_long_term_variability 
##                                                   2.26 
##                    mean_value_of_long_term_variability 
##                                                   8.52 
##                                        histogram_width 
##                                                   2.24 
##                                          histogram_min 
##                                                   5.30 
##                                          histogram_max 
##                                                   8.84 
##                              histogram_number_of_peaks 
##                                                   2.28 
##                             histogram_number_of_zeroes 
##                                                   7.91 
##                                         histogram_mode 
##                                                   1.94 
##                                         histogram_mean 
##                                                   2.32 
##                                       histogram_median 
##                                                   1.64 
##                                     histogram_variance 
##                                                   1.37 
##                                     histogram_tendency 
##                                                   7.29

mds2$stress

## [1] 0.1736905

The total stress is 0.1736 which is between poor and fair, since 0 is the best value, and Kruskal (1964) gave some stress benchmarks for ordinal MDS: 0.20 = poor, 0.10 =fair, 0.05 = good, 0.025 = excellent, 0.00 = perfect

So it means that 2 dimensions are not enough to visualize the data, then I will plot stress by dimension to find out the optimal dimensions.

3.3 The optimal dimensions

stress=matrix(0,9,1)
for (x in 2:10) {
  mds.data=mds(dis, ndim=x, type='ordinal')
  stress[x-1,1]=mds.data$stress
}
stress.data=as.data.frame(stress)
stress.data$num=2:10
colnames(stress.data) <- c("Stress.value", "Number.of.dimensions")
ggplot(stress.data, aes(x=Number.of.dimensions, y=Stress.value)) + 
  geom_point() + geom_line() +
  geom_hline(yintercept=c(0.2, 0.1), linetype="dashed", color = "red", size=1) +
  scale_x_continuous(breaks = seq(1,10))

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.

The “elbow” point is around 3, and the total stress for 3 dimensions is about 10.8%, which is better than 2 dimensions. Then I will plot MDS results in 3 dimensions. Firstly, get the locations in 3 dimensions.

3.4 Draw variables in 3D

Next step, using k-means to cluster similar variables:

Find out the optimal number of clusters.
K-Means clustering
Draw the cllustering results in 3 dimensions.

mds3d <- mds(dis, ndim = 3, type = "ordinal") # MDS for ordinal data
mds3d.data=as.data.frame(mds3d$conf)
fviz_nbclust(mds3d.data, FUNcluster = kmeans,method = 'silhouette')

Using silhouette, the optimal number is 7, so i will cluster variables with 7 clusters.

km=eclust(mds3d.data,'kmeans', hc_metric = 'euclidean',k=7)

Then, put the clustering result into the mds3d.data.

Finally, I draw our variables point in 3 dimensions with clustering results.

library(magrittr)

## 
## Attaching package: 'magrittr'

## The following object is masked from 'package:purrr':
## 
##     set_names

## The following object is masked from 'package:tidyr':
## 
##     extract

mds3d.data$cluster=km$cluster

t <- list(
  family = "sans serif",
  size = 8,
  color = "grey50")

p <- plotly::plot_ly(mds3d.data, x = ~D1, y = ~D2, z = ~D3, color = ~cluster, text= rownames(mds3d.data)) %>% plotly::add_markers() %>%plotly::add_text(textfont =t, textposition = "top")
p

## Warning: textfont.color doesn't (yet) support data arrays

## Warning: textfont.color doesn't (yet) support data arrays

From the 3d graph, which visualizes the variables of the data, we know which variables are close and which are far away. For example, these variables mean_value_of_shour_term_variability, histogram_width, histogram_number_of_peaks, light_decelerations are close.

4 Principal Component Analysis(PCA)

4.1 Choosing the optimal number of components

pca.data=prcomp(data_new,center = T,scale=T)
eigen(cor(data_new))$values

##  [1]  6.058351e+00  3.507779e+00  1.824323e+00  1.496719e+00  1.218359e+00
##  [6]  1.020172e+00  9.817713e-01  9.265325e-01  7.624281e-01  6.402046e-01
## [11]  5.772840e-01  4.973146e-01  3.885029e-01  3.270316e-01  2.654308e-01
## [16]  1.808173e-01  1.331473e-01  1.175281e-01  4.951429e-02  2.678914e-02
## [21] -3.295975e-17

fviz_eig(pca.data,choice = 'eigenvalue')

fviz_eig(pca.data)

From this figure, the 7 components should be chosen, since eigenvalues of those are higher than 1. The variances scree plot gives the same results.

summary(pca.data)

## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6     PC7
## Standard deviation     2.4614 1.8729 1.35067 1.22340 1.10379 1.01004 0.99084
## Proportion of Variance 0.2885 0.1670 0.08687 0.07127 0.05802 0.04858 0.04675
## Cumulative Proportion  0.2885 0.4555 0.54240 0.61367 0.67169 0.72027 0.76702
##                            PC8     PC9    PC10    PC11    PC12   PC13    PC14
## Standard deviation     0.96257 0.87317 0.80013 0.75979 0.70521 0.6233 0.57187
## Proportion of Variance 0.04412 0.03631 0.03049 0.02749 0.02368 0.0185 0.01557
## Cumulative Proportion  0.81114 0.84745 0.87794 0.90542 0.92911 0.9476 0.96318
##                           PC15    PC16    PC17   PC18    PC19    PC20      PC21
## Standard deviation     0.51520 0.42523 0.36489 0.3428 0.22252 0.16367 1.509e-15
## Proportion of Variance 0.01264 0.00861 0.00634 0.0056 0.00236 0.00128 0.000e+00
## Cumulative Proportion  0.97582 0.98443 0.99077 0.9964 0.99872 1.00000 1.000e+00

Unfortunately, from importance of components, 7 components only explain 76% but 11 components can explain 90%, but I still chose 7 components, because I want to get a simper version.

4.2 Analysis of components

fviz_pca_var(pca.data,col.var = 'mediumpurple')

Basing on the plot, the contributins of components in every dimension are clear. RC1 consists of “histogram_width”, “histogram_min” and ect. To visualize, I will also draw the contribution of variables plot.

4.3 Quality measures of PCA

Use the complexity and uniquenesses plot to see the quality of the PCA, the lower these two should be the better.

“Complexity-the higher this value, the more factor loads take values greater than zero. If the loading of only one factor is relatively large and the remaining ones are close to zero, the complexity is close to 1. In other words, how many variables constitute single factor. High complexity is an undesirable feature because it involves a more difficult interpretation of factors.”

“Uniquenesses is the proportion of variance that is not shared with other variables. In PCA we want it be low, because then it is easier to reduce the space to a smaller number of dimensions. This means that the variable does not carry additional information in relation to other variables in the model.” by professor’s lecture notes.

pca_rotated$complexity

##                                         baseline.value 
##                                               1.344785 
##                                          accelerations 
##                                               1.321418 
##                                         fetal_movement 
##                                               2.405200 
##                                   uterine_contractions 
##                                               1.165828 
##                                    light_decelerations 
##                                               2.747854 
##                                   severe_decelerations 
##                                               1.521857 
##                               prolongued_decelerations 
##                                               5.376386 
##                        abnormal_short_term_variability 
##                                               3.062475 
##                   mean_value_of_short_term_variability 
##                                               2.090381 
## percentage_of_time_with_abnormal_long_term_variability 
##                                               3.230700 
##                    mean_value_of_long_term_variability 
##                                               1.068962 
##                                        histogram_width 
##                                               1.150702 
##                                          histogram_min 
##                                               1.839195 
##                                          histogram_max 
##                                               2.781923 
##                              histogram_number_of_peaks 
##                                               1.115990 
##                             histogram_number_of_zeroes 
##                                               3.467759 
##                                         histogram_mode 
##                                               1.198087 
##                                         histogram_mean 
##                                               1.323111 
##                                       histogram_median 
##                                               1.108557 
##                                     histogram_variance 
##                                               1.570531 
##                                     histogram_tendency 
##                                               1.357554

So we can see from this figure, some variables, including “fetal_movement” and “precentage of..”, “histogram_number_of zeroes”, etc. have complex value which is more than 1.8, and some variables have the uniqueness is more than 0.38 but small than 0.78, so this indicates some variables of fetal data is not quite good for PCA.

4.4 Visualizations

4.4.1 Visualizations for individuals or oberservations

Because 7 dimensions which is difficult to represented. So labeled observations in 2 dimensions, which is easily to visualized. Although actually 2 dimensions lost some information about our data.

Draw observations and variables together in 2 dimensions.

4.4.2 Visualizations for group value (fetal_health)

This time not only visualize observations and variables but also differentiate Fetal_health (1-Normal, 2-Suspect, 3-Pathological). I will show 2 methods using pca2d() and ggbiplot() functions.

Compare these groups plot, we can find that:

group 3 which is pathological has bigger variables “severe-decelerations” and “prolongued_decelertions”, which are important indicates of fetal healthy problems.
group 2 which is suspect unhealthy has bigger variables “abnormal_short_term_variability”, “percentage_of_time..” and histogram_min”. So, these are indicates of suspicious healthy problems.
group 1 which is normal has bigger residual variables.
then, we can also see the observations of group 3 (pathological) are more spreaded and variance which is rather bigger than in other groups.

4.4.3 Visualizations 3D

For obrservations/Individuals

pca3d(pca.data,components = 1:3, group=data$fetal_health) # to make an interactive plot

## [1] 0.20365360 0.13008624 0.09626841
## Creating new device

5 Logistic Regression

The last part is to build a logistic regression model to predict whether the fetus will have health problems. Also, I will compare models based on the entire dataset and 7 and 11 components.

Why choose 7 and 11 components? As we analyzed PCA before, 7 components can explain 76%, which can achieve the purpose of dimensionality reduction, while 11 components can explain 90%, which can contain more information of our data set.

The model that will be used is a logistic regression model. A typical use of this model is to predict y given a set of predictor variables x. Predictors can be continuous, categorical, or a mixture of both. In this case, the predictor y is 3 categories (1=normal, 2=suspicious, 3=pathological) and the predictor x is a set of continuous and discrete data.

5.1 Model data Preparing

First, I will spilt the whole data into 2 chunks: training and testing data. The training data is used to fit our model which will then test over testing data.

sample_size=floor(0.7*nrow(data))
set.seed(123)
data0=as.data.frame(scale(data[,-22]))
data0$fetal_health=data$fetal_health
train_ind=sample(seq_len(nrow(data0)),size=sample_size)
data_train=data0[train_ind,]
data_test=data0[-train_ind,]

Secondly, model our dimensional reduction data and also divide into 2 chunk: training and testing

library(factoextra)
data1=as.data.frame(get_pca_ind(pca.data)$coord[,1:7])
data1$fetal_health=data$fetal_health
data_train_7comp=data1[train_ind,]
data_test_7comp=data1[-train_ind,]

Finally, prepare the 11 components data.

data2=as.data.frame(get_pca_ind(pca.data)$coord[,1:11])
data2$fetal_health=data$fetal_health
data_train_11comp=data2[train_ind,]
data_test_11comp=data2[-train_ind,]

5.2 Compare Accuracy Score

5.2.1 Modeling and testing for the whole data

library(stats)
model <- glm(fetal_health ~.,data=data_train)
summary(model)

## 
## Call:
## glm(formula = fetal_health ~ ., data = data_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.3972  -0.2099  -0.0269   0.1425   1.5662  
## 
## Coefficients: (1 not defined because of singularities)
##                                                          Estimate Std. Error
## (Intercept)                                             1.3020314  0.0099933
## baseline.value                                          0.1398317  0.0259742
## accelerations                                          -0.0021230  0.0167536
## fetal_movement                                         -0.0010452  0.0109532
## uterine_contractions                                   -0.0789846  0.0113088
## light_decelerations                                    -0.0044686  0.0188197
## severe_decelerations                                    0.0409530  0.0104449
## prolongued_decelerations                                0.2405177  0.0168434
## abnormal_short_term_variability                         0.1493815  0.0139908
## mean_value_of_short_term_variability                    0.0142623  0.0181013
## percentage_of_time_with_abnormal_long_term_variability  0.2063705  0.0140222
## mean_value_of_long_term_variability                     0.0252177  0.0142345
## histogram_width                                         0.1403057  0.0413934
## histogram_min                                           0.1811938  0.0435806
## histogram_max                                                  NA         NA
## histogram_number_of_peaks                              -0.0209750  0.0153185
## histogram_number_of_zeroes                             -0.0008726  0.0111756
## histogram_mode                                         -0.0459822  0.0302064
## histogram_mean                                         -0.0384418  0.0467268
## histogram_median                                       -0.1625018  0.0526350
## histogram_variance                                      0.0694509  0.0156950
## histogram_tendency                                      0.0484699  0.0170999
##                                                        t value Pr(>|t|)    
## (Intercept)                                            130.291  < 2e-16 ***
## baseline.value                                           5.383 8.50e-08 ***
## accelerations                                           -0.127 0.899179    
## fetal_movement                                          -0.095 0.923989    
## uterine_contractions                                    -6.984 4.32e-12 ***
## light_decelerations                                     -0.237 0.812345    
## severe_decelerations                                     3.921 9.23e-05 ***
## prolongued_decelerations                                14.280  < 2e-16 ***
## abnormal_short_term_variability                         10.677  < 2e-16 ***
## mean_value_of_short_term_variability                     0.788 0.430873    
## percentage_of_time_with_abnormal_long_term_variability  14.717  < 2e-16 ***
## mean_value_of_long_term_variability                      1.772 0.076670 .  
## histogram_width                                          3.390 0.000719 ***
## histogram_min                                            4.158 3.40e-05 ***
## histogram_max                                               NA       NA    
## histogram_number_of_peaks                               -1.369 0.171126    
## histogram_number_of_zeroes                              -0.078 0.937775    
## histogram_mode                                          -1.522 0.128158    
## histogram_mean                                          -0.823 0.410817    
## histogram_median                                        -3.087 0.002057 ** 
## histogram_variance                                       4.425 1.04e-05 ***
## histogram_tendency                                       2.835 0.004653 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.1479799)
## 
##     Null deviance: 541.00  on 1487  degrees of freedom
## Residual deviance: 217.09  on 1467  degrees of freedom
## AIC: 1402.5
## 
## Number of Fisher Scoring iterations: 2

library(randomForest)

## randomForest 4.7-1.1

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:gridExtra':
## 
##     combine

## The following object is masked from 'package:dplyr':
## 
##     combine

## The following object is masked from 'package:psych':
## 
##     outlier

## The following object is masked from 'package:ggplot2':
## 
##     margin

## The following object is masked from 'package:labdsv':
## 
##     importance

fitted.results <- predict(model,newdata=data_test,type='response')
fitted.results <- ifelse(fitted.results > 0.5,1,0)
misClasificError <- mean(fitted.results != data_test$fetal_health)
print(paste('Accuracy',1-misClasificError))

## [1] "Accuracy 0.755485893416928"

The testing accuracy for the whole data is 0.75

5.2.2 Modeling and testing for the 7 components data.

model_7comp <- glm(fetal_health ~.,data=data_train_7comp)
summary(model_7comp)

## 
## Call:
## glm(formula = fetal_health ~ ., data = data_train_7comp)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.0611  -0.2565  -0.0275   0.1950   1.5159  
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.299062   0.010722 121.161  < 2e-16 ***
## Dim.1       -0.007351   0.004437  -1.657 0.097791 .  
## Dim.2        0.101498   0.005756  17.634  < 2e-16 ***
## Dim.3        0.281372   0.007957  35.363  < 2e-16 ***
## Dim.4        0.066364   0.008684   7.642 3.82e-14 ***
## Dim.5        0.080389   0.009938   8.089 1.24e-15 ***
## Dim.6       -0.041251   0.010633  -3.879 0.000109 ***
## Dim.7       -0.058265   0.011002  -5.296 1.36e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.1707939)
## 
##     Null deviance: 541.00  on 1487  degrees of freedom
## Residual deviance: 252.77  on 1480  degrees of freedom
## AIC: 1603
## 
## Number of Fisher Scoring iterations: 2

fitted7comp.results <- predict(model_7comp,newdata=data_test_7comp,type='response')
fitted7comp.results <- ifelse(fitted7comp.results > 0.5,1,0)
misClasificError01 <- mean(fitted7comp.results != data_test_7comp$fetal_health)
print(paste('Accuracy',1-misClasificError01))

## [1] "Accuracy 0.746081504702194"

The test accuracy for 7 components is 0.74, is close to the whole data. So we can see that in this case, PCA dimensionality reduction data applied to Logistic Regression is almost accurate as full data prediction of fetal healthy.

5.2.3 Modeling and testing results for the 11 components data.

To check whether that 11components can be better than 7components, I still continue to the next step.

model_11comp <- glm(fetal_health ~.,data=data_train_11comp)
summary(model_11comp)

## 
## Call:
## glm(formula = fetal_health ~ ., data = data_train_11comp)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.27455  -0.26918  -0.03072   0.20579   1.54286  
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.299161   0.010460 124.209  < 2e-16 ***
## Dim.1       -0.008559   0.004331  -1.976 0.048285 *  
## Dim.2        0.102323   0.005616  18.220  < 2e-16 ***
## Dim.3        0.283043   0.007766  36.446  < 2e-16 ***
## Dim.4        0.062985   0.008481   7.426 1.88e-13 ***
## Dim.5        0.082947   0.009703   8.549  < 2e-16 ***
## Dim.6       -0.039672   0.010373  -3.824 0.000137 ***
## Dim.7       -0.061643   0.010745  -5.737 1.17e-08 ***
## Dim.8        0.004707   0.010840   0.434 0.664174    
## Dim.9       -0.033951   0.012058  -2.816 0.004935 ** 
## Dim.10       0.112788   0.013565   8.314  < 2e-16 ***
## Dim.11      -0.028981   0.013911  -2.083 0.037400 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.1624313)
## 
##     Null deviance: 541.00  on 1487  degrees of freedom
## Residual deviance: 239.75  on 1476  degrees of freedom
## AIC: 1532.3
## 
## Number of Fisher Scoring iterations: 2

fitted11comp.results <- predict(model_11comp,newdata=data_test_11comp,type='response')
fitted11comp.results <- ifelse(fitted11comp.results > 0.5,1,0)
misClasificError02<- mean(fitted11comp.results != data_test_11comp$fetal_health)
print(paste('Accuracy',1-misClasificError02))

## [1] "Accuracy 0.746081504702194"

The accuracy result of 11 components is same as the 7 components.

MDS/PCA on Fetal Health data(using Logistic Regression Prediction)

Ting_Wei