[1] ~ Introduction

Context: Parkinson’s Disease (PD) is a degenerative neurological disorder marked by decreased dopamine levels in the brain. It manifests itself through a deterioration of movement, including the presence of tremors and stiffness. There is commonly a marked effect on speech, including dysarthria (difficulty articulating sounds), hypophonia (lowered volume), and monotone (reduced pitch range). Additionally, cognitive impairments and changes in mood can occur, and risk of dementia is increased.

Traditional diagnosis of Parkinson’s Disease involves a clinician taking a neurological history of the patient and observing motor skills in various situations. Since there is no definitive laboratory test to diagnose PD, diagnosis is often difficult, particularly in the early stages when motor effects are not yet severe. Monitoring progression of the disease over time requires repeated clinic visits by the patient. An effective screening process, particularly one that doesn’t require a clinic visit, would be beneficial. Since PD patients exhibit characteristic vocal features, voice recordings are a useful and non-invasive tool for diagnosis. If machine learning algorithms could be applied to a voice recording dataset to accurately diagnosis PD, this would be an effective screening step prior to an appointment with a clinician.

Project Task: Develop an appropriate model that will effectively screen patients for Parkinson’s Disease using their different measurements of speech.

Data and Attributes Information can be found here: Parkinson’s Disease Dataset

Original Paper and Researchers: BioMedical Engineering OnLine

[2] ~ Loading Required R Packages

library(boot)
library(GGally)
library(car)
library(glmnet)
library(caret)
library(dplyr)
library(broom)
library(MASS)
library(class)
library(klaR)
library(tree)
library(randomForest)
library(gbm)
library(pls)

[3] ~ Importing and Exploring the Dataset

set.seed(2645)
# Data is downloaded using the Link access above
data <- read.csv("Parkinson_Data.csv")

## Check for NA Values:
## The Dataset doesn't seem to contain any NA Values
which(is.na(data)==TRUE)
## integer(0)
## Checking a sample of the Dataset
head(data)
##             name MDVP.Fo.Hz. MDVP.Fhi.Hz. MDVP.Flo.Hz. MDVP.Jitter...
## 1 phon_R01_S01_1     119.992      157.302       74.997        0.00784
## 2 phon_R01_S01_2     122.400      148.650      113.819        0.00968
## 3 phon_R01_S01_3     116.682      131.111      111.555        0.01050
## 4 phon_R01_S01_4     116.676      137.871      111.366        0.00997
## 5 phon_R01_S01_5     116.014      141.781      110.655        0.01284
## 6 phon_R01_S01_6     120.552      131.162      113.787        0.00968
##   MDVP.Jitter.Abs. MDVP.RAP MDVP.PPQ Jitter.DDP MDVP.Shimmer MDVP.Shimmer.dB.
## 1          0.00007  0.00370  0.00554    0.01109      0.04374            0.426
## 2          0.00008  0.00465  0.00696    0.01394      0.06134            0.626
## 3          0.00009  0.00544  0.00781    0.01633      0.05233            0.482
## 4          0.00009  0.00502  0.00698    0.01505      0.05492            0.517
## 5          0.00011  0.00655  0.00908    0.01966      0.06425            0.584
## 6          0.00008  0.00463  0.00750    0.01388      0.04701            0.456
##   Shimmer.APQ3 Shimmer.APQ5 MDVP.APQ Shimmer.DDA     NHR    HNR status     RPDE
## 1      0.02182      0.03130  0.02971     0.06545 0.02211 21.033      1 0.414783
## 2      0.03134      0.04518  0.04368     0.09403 0.01929 19.085      1 0.458359
## 3      0.02757      0.03858  0.03590     0.08270 0.01309 20.651      1 0.429895
## 4      0.02924      0.04005  0.03772     0.08771 0.01353 20.644      1 0.434969
## 5      0.03490      0.04825  0.04465     0.10470 0.01767 19.649      1 0.417356
## 6      0.02328      0.03526  0.03243     0.06985 0.01222 21.378      1 0.415564
##        DFA   spread1  spread2       D2      PPE
## 1 0.815285 -4.813031 0.266482 2.301442 0.284654
## 2 0.819521 -4.075192 0.335590 2.486855 0.368674
## 3 0.825288 -4.443179 0.311173 2.342259 0.332634
## 4 0.819235 -4.117501 0.334147 2.405554 0.368975
## 5 0.823484 -3.747787 0.234513 2.332180 0.410335
## 6 0.825069 -4.242867 0.299111 2.187560 0.357775
## Removing the Samples names since they dont contribe to the Model
data <- data[,-1]

# Validating the Data Types
str(head(data, n=0))
## 'data.frame':    0 obs. of  23 variables:
##  $ MDVP.Fo.Hz.     : num 
##  $ MDVP.Fhi.Hz.    : num 
##  $ MDVP.Flo.Hz.    : num 
##  $ MDVP.Jitter...  : num 
##  $ MDVP.Jitter.Abs.: num 
##  $ MDVP.RAP        : num 
##  $ MDVP.PPQ        : num 
##  $ Jitter.DDP      : num 
##  $ MDVP.Shimmer    : num 
##  $ MDVP.Shimmer.dB.: num 
##  $ Shimmer.APQ3    : num 
##  $ Shimmer.APQ5    : num 
##  $ MDVP.APQ        : num 
##  $ Shimmer.DDA     : num 
##  $ NHR             : num 
##  $ HNR             : num 
##  $ status          : int 
##  $ RPDE            : num 
##  $ DFA             : num 
##  $ spread1         : num 
##  $ spread2         : num 
##  $ D2              : num 
##  $ PPE             : num
# Converting Status into true factor: 1 for Disease/ 0 for Healthy
data$status <- as.factor(data$status)
head(data$status)
## [1] 1 1 1 1 1 1
## Levels: 0 1
## There are 23 variables and 195 samples (each sample has ~ 6 records)
## Note that first I consider all records to be independent 
## Table to show count of Parkinson's Disease Patietns (PDP) vs Control
tbl <- as.vector(c(with(data, table(status)), as.numeric(nrow(data))))
bp <- barplot(tbl, beside = TRUE, names.arg = c("Healthy", "PDP", "Total"),
        ylab="Frequency", xlab="Status", ylim= c(0,max(tbl)+50),
        main="Response-Target Distribution" ,col=c("tomato","mediumturquoise","grey"))
text(x = bp, y = tbl, label = tbl,
     pos = 3, cex = 0.8, col = c("tomato","mediumturquoise","black"))

Conclusion from Bar Plot: Note that the Healthy Samples are only 48 out of 195, and therefore the class ‘status’ is skewed.
That must be taken into consideration when building the different models.

[3.1] ~ Visualizing the Dataset

set.seed(2645)
## The number of Predictors (23) make it hard to visualize all the Data using pairs plot
## For this reason, I generate several plots using ggpairs() from "GGally" Package
## GGally package uses version 4.0.3, please update to the latest version to run
## These plots take a set 4 variables each run, colored as a function of status(0vs1):
for (x in seq(from=1, to=ncol(data[,-17])-4, by=4)){
  ggp = ggpairs(data[,-17][,c(seq(from=x, to=x+4))], 
                aes(color=data$status), progress=ggmatrix_progress())
  print(ggp)
}

Indications from the Generalized Pairs Plot:
1. Some of the Predictors are semi-Normally distributed such as D2 & spread2
2. There is a positive correlation between many of the variables such as Jitter.DDP & MDVP.RAP 3. There is a clear indication of some outliers in most variables, but all the values are in their acceptable biological range.
=> For this reason, the original data is kept untouched, a new set is created to test later on if this set makes any difference. (Check the next section in this tab for outliers filtering)

[3.2] ~ Replacing the outliers (>1.5IQR+Q3 || <Q1-1.5IQR) by the Median

  • The outliers are replaced by Median and not removed due to small sample size
  • set.seed(2645)
    ## Observing the summary of the Dataset to give
    ## us an idea about the 1st and 3rd Quantiles
    data.fil.by.iqr <- data
    summary(data.fil.by.iqr)
    ##   MDVP.Fo.Hz.      MDVP.Fhi.Hz.    MDVP.Flo.Hz.    MDVP.Jitter...    
    ##  Min.   : 88.33   Min.   :102.1   Min.   : 65.48   Min.   :0.001680  
    ##  1st Qu.:117.57   1st Qu.:134.9   1st Qu.: 84.29   1st Qu.:0.003460  
    ##  Median :148.79   Median :175.8   Median :104.31   Median :0.004940  
    ##  Mean   :154.23   Mean   :197.1   Mean   :116.32   Mean   :0.006220  
    ##  3rd Qu.:182.77   3rd Qu.:224.2   3rd Qu.:140.02   3rd Qu.:0.007365  
    ##  Max.   :260.11   Max.   :592.0   Max.   :239.17   Max.   :0.033160  
    ##  MDVP.Jitter.Abs.       MDVP.RAP           MDVP.PPQ          Jitter.DDP      
    ##  Min.   :7.000e-06   Min.   :0.000680   Min.   :0.000920   Min.   :0.002040  
    ##  1st Qu.:2.000e-05   1st Qu.:0.001660   1st Qu.:0.001860   1st Qu.:0.004985  
    ##  Median :3.000e-05   Median :0.002500   Median :0.002690   Median :0.007490  
    ##  Mean   :4.396e-05   Mean   :0.003306   Mean   :0.003446   Mean   :0.009920  
    ##  3rd Qu.:6.000e-05   3rd Qu.:0.003835   3rd Qu.:0.003955   3rd Qu.:0.011505  
    ##  Max.   :2.600e-04   Max.   :0.021440   Max.   :0.019580   Max.   :0.064330  
    ##   MDVP.Shimmer     MDVP.Shimmer.dB.  Shimmer.APQ3       Shimmer.APQ5    
    ##  Min.   :0.00954   Min.   :0.0850   Min.   :0.004550   Min.   :0.00570  
    ##  1st Qu.:0.01650   1st Qu.:0.1485   1st Qu.:0.008245   1st Qu.:0.00958  
    ##  Median :0.02297   Median :0.2210   Median :0.012790   Median :0.01347  
    ##  Mean   :0.02971   Mean   :0.2823   Mean   :0.015664   Mean   :0.01788  
    ##  3rd Qu.:0.03789   3rd Qu.:0.3500   3rd Qu.:0.020265   3rd Qu.:0.02238  
    ##  Max.   :0.11908   Max.   :1.3020   Max.   :0.056470   Max.   :0.07940  
    ##     MDVP.APQ        Shimmer.DDA           NHR                HNR        
    ##  Min.   :0.00719   Min.   :0.01364   Min.   :0.000650   Min.   : 8.441  
    ##  1st Qu.:0.01308   1st Qu.:0.02474   1st Qu.:0.005925   1st Qu.:19.198  
    ##  Median :0.01826   Median :0.03836   Median :0.011660   Median :22.085  
    ##  Mean   :0.02408   Mean   :0.04699   Mean   :0.024847   Mean   :21.886  
    ##  3rd Qu.:0.02940   3rd Qu.:0.06080   3rd Qu.:0.025640   3rd Qu.:25.076  
    ##  Max.   :0.13778   Max.   :0.16942   Max.   :0.314820   Max.   :33.047  
    ##  status       RPDE             DFA            spread1          spread2        
    ##  0: 48   Min.   :0.2566   Min.   :0.5743   Min.   :-7.965   Min.   :0.006274  
    ##  1:147   1st Qu.:0.4213   1st Qu.:0.6748   1st Qu.:-6.450   1st Qu.:0.174350  
    ##          Median :0.4960   Median :0.7223   Median :-5.721   Median :0.218885  
    ##          Mean   :0.4985   Mean   :0.7181   Mean   :-5.684   Mean   :0.226510  
    ##          3rd Qu.:0.5876   3rd Qu.:0.7619   3rd Qu.:-5.046   3rd Qu.:0.279234  
    ##          Max.   :0.6852   Max.   :0.8253   Max.   :-2.434   Max.   :0.450493  
    ##        D2             PPE         
    ##  Min.   :1.423   Min.   :0.04454  
    ##  1st Qu.:2.099   1st Qu.:0.13745  
    ##  Median :2.362   Median :0.19405  
    ##  Mean   :2.382   Mean   :0.20655  
    ##  3rd Qu.:2.636   3rd Qu.:0.25298  
    ##  Max.   :3.671   Max.   :0.52737
    ## Note that quantile(x) returns 0% - 25% - 50% - 75% - 100% quantiles
    for (i in seq(from=1, to=ncol(data.fil.by.iqr))){
      if(i != 17){
        quant <- quantile(data.fil.by.iqr[,i])
        data.fil.by.iqr[which(data.fil.by.iqr[,i]>(1.5*(quant[4]-quant[1])+quant[4])),i] <- median(data[,i])
        data.fil.by.iqr[which(data.fil.by.iqr[,i]<(quant[1]-1.5*(quant[4]-quant[1]))),i] <- median(data[,i])
      }
    }
    
    ## Observing the SD of the Data
    ## Calculating how many SD are the max and min measurements far from the Mean
    ## 1. Before correcting for outliers
    tb_org <- apply(data[,-17], 2, function(x){return(c(sd(x),(mean(x)-min(x))/sd(x),(max(x)-mean(x))/sd(x)))})
    rownames(tb_org) <- c("SD","kSD(min,mean)","kSD(max,mean)")
    tb_org
    ##               MDVP.Fo.Hz. MDVP.Fhi.Hz. MDVP.Flo.Hz. MDVP.Jitter...
    ## SD              41.390065    91.491548    43.521413    0.004848134
    ## kSD(min,mean)    1.592064     1.037909     1.168359    0.936538022
    ## kSD(max,mean)    2.558014     4.316520     2.822642    5.556682255
    ##               MDVP.Jitter.Abs.    MDVP.RAP    MDVP.PPQ  Jitter.DDP MDVP.Shimmer
    ## SD                3.482191e-05 0.002967774 0.002758977 0.008903344   0.01885693
    ## kSD(min,mean)     1.061371e+00 0.884976379 0.915686973 0.885054919   1.06958695
    ## kSD(max,mean)     6.204170e+00 6.110164453 5.847690318 6.111192503   4.73941744
    ##               MDVP.Shimmer.dB. Shimmer.APQ3 Shimmer.APQ5   MDVP.APQ Shimmer.DDA
    ## SD                   0.1948773   0.01015316   0.01202371 0.01694674  0.03045912
    ## kSD(min,mean)        1.0121820   1.09464956   1.01285385 0.99673984  1.09499605
    ## kSD(max,mean)        5.2327735   4.01902853   5.11670411 6.70916873  4.01940000
    ##                      NHR      HNR      RPDE        DFA  spread1    spread2
    ## SD            0.04041845 4.425764 0.1039417 0.05533583 1.090208 0.08340576
    ## kSD(min,mean) 0.59866417 3.037888 2.3278964 2.59898596 2.091883 2.64054115
    ## kSD(max,mean) 7.17427149 2.521830 1.7953856 1.93706235 2.981419 2.68545775
    ##                     D2        PPE
    ## SD            0.382799 0.09011932
    ## kSD(min,mean) 2.504027 1.79775698
    ## kSD(max,mean) 3.368161 3.55989537
    ## 2. After correcting for outliers
    tb <- apply(data.fil.by.iqr[,-17], 2, function(x){return(c(sd(x),(mean(x)-min(x))/sd(x),(max(x)-mean(x))/sd(x)))})
    rownames(tb) <- c("SD","kSD(min,mean)","kSD(max,mean)")
    tb
    ##               MDVP.Fo.Hz. MDVP.Fhi.Hz. MDVP.Flo.Hz. MDVP.Jitter...
    ## SD              41.390065    49.247531    43.521413    0.002675295
    ## kSD(min,mean)    1.592064     1.567499     1.168359    1.377657827
    ## kSD(max,mean)    2.558014     4.418913     2.822642    3.855410027
    ##               MDVP.Jitter.Abs.    MDVP.RAP    MDVP.PPQ  Jitter.DDP MDVP.Shimmer
    ## SD                2.374803e-05 0.001418343 0.001408827 0.004253768   0.01550102
    ## kSD(min,mean)     1.361948e+00 1.430524825 1.402112226 1.430960643   1.18255544
    ## kSD(max,mean)     3.396341e+00 4.075902021 3.758209686 4.074746510   3.33650079
    ##               MDVP.Shimmer.dB. Shimmer.APQ3 Shimmer.APQ5   MDVP.APQ Shimmer.DDA
    ## SD                    0.139609  0.008614769  0.009457879 0.01232011  0.02584362
    ## kSD(min,mean)         1.207806  1.195710387  1.129976288 1.20625404  1.19612498
    ## kSD(max,mean)         3.354938  3.248982014  3.109874760 3.29045997  3.24868494
    ##                      NHR      HNR      RPDE        DFA  spread1    spread2
    ## SD            0.01197518 4.425764 0.1039417 0.05533583 1.090208 0.08340576
    ## kSD(min,mean) 1.15683689 3.037888 2.3278964 2.59898596 2.091883 2.64054115
    ## kSD(max,mean) 3.84684453 2.521830 1.7953856 1.93706235 2.981419 2.68545775
    ##                     D2        PPE
    ## SD            0.382799 0.09011932
    ## kSD(min,mean) 2.504027 1.79775698
    ## kSD(max,mean) 3.368161 3.55989537

    Conclusion about Outliers:
    > 1. By replacing outliers by Medians, the SD of some of the Variables decreased which shows measurements are more close to the mean.
    > 2. The Right Skeweness decreased: shown in an increase in kSDmin,mean and decrease in kSDmax,mean
    Major Concern: The replaced values are within the normal biological range, and therefore replacing them might lead to decrease in variation but at the cost of an increase in bias.

    [4] ~ Building Models

    In this Project we attempt the following approaches:
    1. Classification using:
    a. Multiple Logistic Regression
    b. Logistic Discriminant Analysis (LDA)
    c. Quadratic Discriminant Analysis (QDA)
    d. K-th Nearest Neigbors (KNN)

    2. Tree Based Approaches:
    a. Basic Decision Trees
    b. Bagging, Random Forests and Boosting

    Note: These approaches are all accompanied with Cross Validation, Boosting & Model Selection and Regularization

    Finally: A comparison between all the Models, eventually reaching the best Model that will effectively predict PD in patients.

    Split the Dataset into Train and Test using createDataPartition() that does Stratified Sampling by Default
    Stratified Sampling in this case is important due to the response class being Skewed. Healthy People were under-represented.

    # Set seed to get reproducible output. RStudio Version 4.0.3
    set.seed(2645)
    # Taking 70% Training, 30% Test Since the Sample isn't significantly large.
    training_indices <- createDataPartition(data$status, p=0.7, list=FALSE)
    
    x <- model.matrix(status~.,data)[,-1]
    y <- data$status
    
    x.train <- data[training_indices,]
    y.train <- data$status[training_indices]
    
    x.test  <- data[-training_indices,]
    y.test  <- data$status[-training_indices]

    [4.1.a] ~ Multiple Logistic Regression

    set.seed(2645)
    # Attach the DataSet:
    attach(data)
    # Performing Multiple Logistic Regression, Family = Binomial to indicate classification
    glm.fit <- glm(status~., data=data, subset=training_indices, family="binomial")
    ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
    summary(glm.fit)
    ## 
    ## Call:
    ## glm(formula = status ~ ., family = "binomial", data = data, subset = training_indices)
    ## 
    ## Deviance Residuals: 
    ##      Min        1Q    Median        3Q       Max  
    ## -2.04111   0.00000   0.03459   0.21264   2.00135  
    ## 
    ## Coefficients:
    ##                    Estimate Std. Error z value Pr(>|z|)  
    ## (Intercept)      -1.161e+01  2.506e+01  -0.463   0.6432  
    ## MDVP.Fo.Hz.      -1.089e-02  3.332e-02  -0.327   0.7437  
    ## MDVP.Fhi.Hz.     -2.726e-03  7.235e-03  -0.377   0.7063  
    ## MDVP.Flo.Hz.      2.912e-03  1.758e-02   0.166   0.8684  
    ## MDVP.Jitter...   -1.912e+03  1.875e+03  -1.019   0.3080  
    ## MDVP.Jitter.Abs. -1.044e+05  1.332e+05  -0.784   0.4330  
    ## MDVP.RAP          1.317e+05  2.113e+05   0.624   0.5329  
    ## MDVP.PPQ         -2.087e+03  2.872e+03  -0.727   0.4673  
    ## Jitter.DDP       -4.197e+04  7.032e+04  -0.597   0.5507  
    ## MDVP.Shimmer      7.870e+02  1.671e+03   0.471   0.6376  
    ## MDVP.Shimmer.dB.  3.017e+01  5.467e+01   0.552   0.5810  
    ## Shimmer.APQ3     -1.173e+05  1.735e+05  -0.676   0.4991  
    ## Shimmer.APQ5     -4.208e+02  5.839e+02  -0.721   0.4711  
    ## MDVP.APQ          1.790e+02  6.171e+02   0.290   0.7718  
    ## Shimmer.DDA       3.860e+04  5.782e+04   0.668   0.5044  
    ## NHR              -2.293e+01  9.494e+01  -0.242   0.8092  
    ## HNR               9.805e-03  2.733e-01   0.036   0.9714  
    ## RPDE             -6.925e+00  6.404e+00  -1.081   0.2795  
    ## DFA               1.855e+01  1.263e+01   1.468   0.1420  
    ## spread1           1.498e+00  2.776e+00   0.540   0.5894  
    ## spread2           1.939e+01  9.895e+00   1.960   0.0501 .
    ## D2                2.768e+00  2.338e+00   1.184   0.2364  
    ## PPE               2.422e+01  4.561e+01   0.531   0.5954  
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    ## 
    ## (Dispersion parameter for binomial family taken to be 1)
    ## 
    ##     Null deviance: 153.528  on 136  degrees of freedom
    ## Residual deviance:  49.892  on 114  degrees of freedom
    ## AIC: 95.892
    ## 
    ## Number of Fisher Scoring iterations: 9
    ## Before Discussing the warning, its important to show that there 
    ## doesn't appear to be any significant p-values.
    summary(glm.fit)$coefficients
    ##                       Estimate   Std. Error     z value   Pr(>|z|)
    ## (Intercept)      -1.160690e+01 2.505849e+01 -0.46319220 0.64322661
    ## MDVP.Fo.Hz.      -1.089426e-02 3.331969e-02 -0.32696158 0.74369694
    ## MDVP.Fhi.Hz.     -2.726264e-03 7.235185e-03 -0.37680643 0.70631746
    ## MDVP.Flo.Hz.      2.912171e-03 1.757490e-02  0.16570055 0.86839262
    ## MDVP.Jitter...   -1.911796e+03 1.875390e+03 -1.01941238 0.30800723
    ## MDVP.Jitter.Abs. -1.044023e+05 1.331651e+05 -0.78400605 0.43303656
    ## MDVP.RAP          1.317328e+05 2.112747e+05  0.62351431 0.53294660
    ## MDVP.PPQ         -2.087442e+03 2.871501e+03 -0.72695135 0.46725576
    ## Jitter.DDP       -4.196701e+04 7.032359e+04 -0.59677010 0.55066089
    ## MDVP.Shimmer      7.869702e+02 1.670670e+03  0.47105063 0.63760458
    ## MDVP.Shimmer.dB.  3.017484e+01 5.467423e+01  0.55190242 0.58101521
    ## Shimmer.APQ3     -1.172686e+05 1.735091e+05 -0.67586467 0.49912657
    ## Shimmer.APQ5     -4.207959e+02 5.839347e+02 -0.72062152 0.47114241
    ## MDVP.APQ          1.789917e+02 6.171042e+02  0.29005094 0.77177727
    ## Shimmer.DDA       3.860137e+04 5.782218e+04  0.66758765 0.50439684
    ## NHR              -2.292902e+01 9.493736e+01 -0.24151736 0.80915416
    ## HNR               9.805451e-03 2.732710e-01  0.03588178 0.97137662
    ## RPDE             -6.924955e+00 6.403709e+00 -1.08139746 0.27952035
    ## DFA               1.855250e+01 1.263489e+01  1.46835454 0.14200794
    ## spread1           1.497874e+00 2.775528e+00  0.53967174 0.58942343
    ## spread2           1.938950e+01 9.895062e+00  1.95951287 0.05005275
    ## D2                2.768159e+00 2.337841e+00  1.18406642 0.23638677
    ## PPE               2.421996e+01 4.560993e+01  0.53102389 0.59540223
    ## Validation Set Approach to get Test Classification Error:
    glm.prediction <- predict(glm.fit, newdata = x.test, type="response")
    table_mat <- table(data$status[-training_indices], glm.prediction > 0.5)
    colnames(table_mat) <- c("0","1")
    table_mat
    ##    
    ##      0  1
    ##   0 10  4
    ##   1  8 36
    predic <- ifelse(glm.prediction>0.5,1,0)
    vsa.test.error.glm <- mean(predic!=data$status[-training_indices])
    vsa.test.error.glm
    ## [1] 0.2068966
    ## Cross Validation Error:
    set.seed(2645)
    cv.error <- cv.glm(x.train, glm.fit, K=5)$delta
    ## Warning: glm.fit: algorithm did not converge
    
    ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
    ## Warning: glm.fit: algorithm did not converge
    ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
    ## Warning: glm.fit: algorithm did not converge
    ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
    
    ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
    
    ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
    cv.error
    ## [1] 0.2748844 0.2018427
    ## Recalling the Assumptions of Multiple Logistic Regression, and the previous results of viewing the data, this might be due to high correlation between predictors (multicollinearity).
    vif(glm.fit)
    ##      MDVP.Fo.Hz.     MDVP.Fhi.Hz.     MDVP.Flo.Hz.   MDVP.Jitter... 
    ##     1.304402e+01     1.838318e+00     3.522143e+00     1.167191e+02 
    ## MDVP.Jitter.Abs.         MDVP.RAP         MDVP.PPQ       Jitter.DDP 
    ##     4.893332e+01     4.598358e+05     1.154114e+02     4.584093e+05 
    ##     MDVP.Shimmer MDVP.Shimmer.dB.     Shimmer.APQ3     Shimmer.APQ5 
    ##     2.006434e+03     1.811858e+02     7.800041e+06     1.076339e+02 
    ##         MDVP.APQ      Shimmer.DDA              NHR              HNR 
    ##     1.224602e+02     7.796154e+06     5.985399e+00     6.938585e+00 
    ##             RPDE              DFA          spread1          spread2 
    ##     3.627937e+00     3.372692e+00     4.063907e+01     2.524288e+00 
    ##               D2              PPE 
    ##     2.280740e+00     7.391852e+01

    We notice that ths is true, and the variables are truly correlated with each other. That is indicated by a high VIF value, typically larger than 10. Here VIF[MVP.PPQ] is 115, and that of [Jitter.DDP] is 4x10^5 which is extremely high.

    Adding Interaction terms to our glm.fit would probably allow us to predict more accurately, however many of the variables are correlated and adding interaction terms to each will result in very high complexity which we want to avoid. Other Models will deal with this issue more effectively.

    ## Back to the warning: This Occuring warning usually shows due to complete seperation.
    ## It also arises whenever a probability of an observation is effectively indistinguishable from 1 or 0 which happens when some observations have very extreme values of one or more predictors.
    eps <- 10 * .Machine$double.eps
    glm.resids <- augment(glm.fit) %>%
      mutate(p = 1 / (1 + exp(-.fitted)),
             warning = p > 1-eps)
    arrange(glm.resids, desc(.fitted)) %>%  
      dplyr::select(c(1:4), p, warning) %>% 
      slice(1:10)
    ## Registered S3 method overwritten by 'cli':
    ##   method     from
    ##   print.tree tree
    ## # A tibble: 10 x 6
    ##    .rownames status MDVP.Fo.Hz. MDVP.Fhi.Hz.     p warning
    ##    <chr>     <fct>        <dbl>        <dbl> <dbl> <lgl>  
    ##  1 153       1             198.         233.  1.00 FALSE  
    ##  2 149       1             189.         216.  1.00 FALSE  
    ##  3 148       1             184.         217.  1.00 FALSE  
    ##  4 151       1             187.         212.  1.00 FALSE  
    ##  5 90        1             180.         226.  1.00 FALSE  
    ##  6 161       1             114.         124.  1.00 FALSE  
    ##  7 101       1             126.         141.  1.00 FALSE  
    ##  8 20        1             156.         189.  1.00 FALSE  
    ##  9 158       1             118.         134.  1.00 FALSE  
    ## 10 21        1             154.         166.  1.00 FALSE

    Notice that many other observations also result in P=1. This warning might be a false positive. We remove the observation, temporarily, that is causing the warning and we check for any change in the P-Values.

    x.train.glm <- x.train[!glm.resids$warning,]
    glm.fit1 <- glm(x.train.glm$status~., data=x.train.glm, family="binomial")
    ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
    summary(glm.fit1)
    ## 
    ## Call:
    ## glm(formula = x.train.glm$status ~ ., family = "binomial", data = x.train.glm)
    ## 
    ## Deviance Residuals: 
    ##      Min        1Q    Median        3Q       Max  
    ## -2.04111   0.00000   0.03459   0.21264   2.00135  
    ## 
    ## Coefficients:
    ##                    Estimate Std. Error z value Pr(>|z|)  
    ## (Intercept)      -1.161e+01  2.506e+01  -0.463   0.6432  
    ## MDVP.Fo.Hz.      -1.089e-02  3.332e-02  -0.327   0.7437  
    ## MDVP.Fhi.Hz.     -2.726e-03  7.235e-03  -0.377   0.7063  
    ## MDVP.Flo.Hz.      2.912e-03  1.758e-02   0.166   0.8684  
    ## MDVP.Jitter...   -1.912e+03  1.875e+03  -1.019   0.3080  
    ## MDVP.Jitter.Abs. -1.044e+05  1.332e+05  -0.784   0.4330  
    ## MDVP.RAP          1.317e+05  2.113e+05   0.624   0.5329  
    ## MDVP.PPQ         -2.087e+03  2.872e+03  -0.727   0.4673  
    ## Jitter.DDP       -4.197e+04  7.032e+04  -0.597   0.5507  
    ## MDVP.Shimmer      7.870e+02  1.671e+03   0.471   0.6376  
    ## MDVP.Shimmer.dB.  3.017e+01  5.467e+01   0.552   0.5810  
    ## Shimmer.APQ3     -1.173e+05  1.735e+05  -0.676   0.4991  
    ## Shimmer.APQ5     -4.208e+02  5.839e+02  -0.721   0.4711  
    ## MDVP.APQ          1.790e+02  6.171e+02   0.290   0.7718  
    ## Shimmer.DDA       3.860e+04  5.782e+04   0.668   0.5044  
    ## NHR              -2.293e+01  9.494e+01  -0.242   0.8092  
    ## HNR               9.805e-03  2.733e-01   0.036   0.9714  
    ## RPDE             -6.925e+00  6.404e+00  -1.081   0.2795  
    ## DFA               1.855e+01  1.263e+01   1.468   0.1420  
    ## spread1           1.498e+00  2.776e+00   0.540   0.5894  
    ## spread2           1.939e+01  9.895e+00   1.960   0.0501 .
    ## D2                2.768e+00  2.338e+00   1.184   0.2364  
    ## PPE               2.422e+01  4.561e+01   0.531   0.5954  
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    ## 
    ## (Dispersion parameter for binomial family taken to be 1)
    ## 
    ##     Null deviance: 153.528  on 136  degrees of freedom
    ## Residual deviance:  49.892  on 114  degrees of freedom
    ## AIC: 95.892
    ## 
    ## Number of Fisher Scoring iterations: 9

    We notice the same warning appearing again, then it is due to complete seperation.

    This happens when when the response seperates one or more of the predictors completely: for all values x > c in predictor X, the response is always 1(or 0). This happens when either the predictor is truly highly correlated with the response, or due to the split in the data set.

    This can be fixed by applying a penalty to the regression model which we explore in the following Sections => Ridge Regression or Lasso

    Conclusion on Multiple Logistic Regression: It seems like the glm.fit has about 80% accuracy in predicting the response. However, the interepretability of the model is very low as there doesn’t appear to be any significant predictors.

    Additionally, since the Predictors are highly correlated, and since there exists complete seperation, it’s best to consider Ridge Regression or Lasso, or in between both: Elastic Net Approach.

    ~~ Ridge, Lasso && Elastic Net Approaches seed(2645)

    set.seed(2645)
    ## Ridge Regression: alpha = 0
    ## Lasso: alpha = 1
    ## Elastic: 0 =< alpha =< 1
    ## ~~ Create a list that will fit all the fits from alpha = 0 (ridge) to alpha = 1 (lasso)
    list.of.fits <- list()
    for (i in 0:10){
        alpha = i/10
        fit.name <- paste0("alpha",alpha)
        # Now for each alpha, get the best model fit using CV 
        list.of.fits[[fit.name]] <- 
                  cv.glmnet(x[training_indices,], y[training_indices], type.measure="class", alpha=alpha, family="binomial", nfolds=5, standardize=TRUE)
    }
    ## Warning: from glmnet Fortran code (error code -99); Convergence for 99th lambda
    ## value not reached after maxit=100000 iterations; solutions for larger lambdas
    ## returned
    ## ~~ Find the Classification Error for each of the fits with their best lambda given by CV
    results <- data.frame()
    for (i in 0:10){
      alpha = i/10
      fit.name <- paste0("alpha",alpha)
      # Get the Fit from the list:
      fit.alpha <- list.of.fits[[fit.name]]
      # Calculate the Classification Error for the best fit.
      ##     In this case, I set 's' to "lambda.1se", which is
      ##     the value for lambda that results in the simplest model
      ##     such that the cross validation error is within one 
      ##     standard error of the minimum.
      ##     This offers a negligible increase in bias while decreasing variance.
      fit.prediction <- predict(fit.alpha, s=fit.alpha$lambda.1se, newx=x[-training_indices,], type="response")
      fit.pred <- ifelse(fit.prediction>0.5, 1, 0)
      c.e = mean(fit.pred != y.test)
      # Add to the results
      temp <- data.frame(alpha=alpha, ce=c.e, lambda1se=fit.alpha$lambda.1se, fit.name=fit.name)
      results <- rbind(results, temp)
    }
    results
    ##    alpha        ce  lambda1se fit.name
    ## 1    0.0 0.1551724 0.93976210   alpha0
    ## 2    0.1 0.1551724 0.46772181 alpha0.1
    ## 3    0.2 0.1551724 0.30915035 alpha0.2
    ## 4    0.3 0.1551724 0.22619475 alpha0.3
    ## 5    0.4 0.1551724 0.15457518 alpha0.4
    ## 6    0.5 0.1551724 0.13571685 alpha0.5
    ## 7    0.6 0.1551724 0.13622619 alpha0.6
    ## 8    0.7 0.1724138 0.05547298 alpha0.7
    ## 9    0.8 0.1551724 0.10216964 alpha0.8
    ## 10   0.9 0.1724138 0.09967205 alpha0.9
    ## 11   1.0 0.1724138 0.07447454   alpha1
    ## Extract the best Model with the least Classification Error:
    ## We notice that different fits with different alpha have identical CE
    ## Picking the first one: (lowest alpha => Ridge Regression)
    best.fit <- results[which.min(results[,2]),]
    best.fit
    ##   alpha        ce lambda1se fit.name
    ## 1     0 0.1551724 0.9397621   alpha0
    alpha.best <- as.double(best.fit[1])
    alpha.best
    ## [1] 0
    ce.best <- as.double(best.fit[2])
    ce.best
    ## [1] 0.1551724
    lambda.best <- as.double(best.fit[3])
    lambda.best
    ## [1] 0.9397621
    # Get the final fit of the best model:
    best.full <- glmnet(x,y,alpha=alpha.best, family="binomial", standardize=TRUE)
    plot(best.full, xvar="lambda")

    best.final <- predict(best.full, s=lambda.best, type="coefficients")[1:ncol(data),]
    best.final <- best.final[best.final!=0]
    # This is the best fit we can get with varying values of alpha and lambda simultaneously
    best.final
    ##      (Intercept)      MDVP.Fo.Hz.     MDVP.Fhi.Hz.     MDVP.Flo.Hz. 
    ##     1.500340e-01    -2.267567e-03    -5.405438e-04    -2.009510e-03 
    ##   MDVP.Jitter... MDVP.Jitter.Abs.         MDVP.RAP         MDVP.PPQ 
    ##     6.644355e+00     1.419603e+03     1.139578e+01     1.296685e+01 
    ##       Jitter.DDP     MDVP.Shimmer MDVP.Shimmer.dB.     Shimmer.APQ3 
    ##     3.798194e+00     3.070914e+00     2.725448e-01     5.113419e+00 
    ##     Shimmer.APQ5         MDVP.APQ      Shimmer.DDA              NHR 
    ##     4.512558e+00     3.597775e+00     1.704137e+00     2.855772e-01 
    ##              HNR             RPDE              DFA          spread1 
    ##    -1.188915e-02     4.251241e-01     1.077079e+00     1.170001e-01 
    ##          spread2               D2              PPE 
    ##     1.305050e+00     2.146965e-01     1.277064e+00
    ## Picking a higher one: alpha = 0.4, Shrinks predictors more => Decreases Variance at the cost of Bias
    ## + Theoratically increases Interpretability (but in this case theres too much collinearity)
    ## Anyway, our main goal is prediction and note interpretability
    higher.fit <- results[5,]
    higher.fit
    ##   alpha        ce lambda1se fit.name
    ## 5   0.4 0.1551724 0.1545752 alpha0.4
    alpha.higher<- as.double(higher.fit[1])
    alpha.higher
    ## [1] 0.4
    ce.best.higher <- as.double(higher.fit[2])
    ce.best.higher
    ## [1] 0.1551724
    lambda.best.higher <- as.double(higher.fit[3])
    lambda.best.higher
    ## [1] 0.1545752
    # Get the final fit of the model:
    best.full.higher <- glmnet(x,y,alpha=alpha.higher, family="binomial", standardize = TRUE)
    plot(best.full.higher, xvar="lambda")

    best.final.higher <- predict(best.full.higher, s=lambda.best.higher, type="coefficients")[1:ncol(data),]
    best.final.higher <- best.final.higher[best.final.higher!=0]
    # This is one of the best fit we can get with varying values of alpha and lambda simultaneously
    best.final.higher
    ##  (Intercept)  MDVP.Fo.Hz. MDVP.Flo.Hz. MDVP.Shimmer Shimmer.APQ3     MDVP.APQ 
    ##  2.144934353 -0.002716660 -0.001694658  2.157631564  0.539297646  1.291374129 
    ##  Shimmer.DDA      spread1      spread2           D2          PPE 
    ##  0.177211729  0.356297312  2.306403641  0.220409887  3.283032701
    ## Picking the highest one: alpha = 0.8, Shrinks predictors more => Decreases Variance at the cost of Bias
    ## + Theoratically increases Interpretability (but in this case theres too much collinearity)
    ## Anyway, our main goal is prediction and note interpretability
    highest.fit <- results[9,]
    highest.fit
    ##   alpha        ce lambda1se fit.name
    ## 9   0.8 0.1551724 0.1021696 alpha0.8
    alpha.high <- as.double(highest.fit[1])
    alpha.high
    ## [1] 0.8
    ce.best.high <- as.double(highest.fit[2])
    ce.best.high
    ## [1] 0.1551724
    lambda.best.high <- as.double(highest.fit[3])
    lambda.best.high
    ## [1] 0.1021696
    # Get the final fit of the model:
    best.full.high <- glmnet(x,y,alpha=alpha.high, family="binomial", standardize = TRUE)
    plot(best.full.high, xvar="lambda")

    best.final.high <- predict(best.full.high, s=lambda.best.high, type="coefficients")[1:ncol(data),]
    best.final.high <- best.final.high[best.final.high!=0]
    # This is one of the best fit we can get with varying values of alpha and lambda simultaneously
    best.final.high
    ##   (Intercept)   MDVP.Fo.Hz.  MDVP.Flo.Hz.       spread1       spread2 
    ##  3.8970344984 -0.0004394087 -0.0001357522  0.6058558631  1.7421158249 
    ##           PPE 
    ##  2.8726253684

    Conclusion: We see that there is a huge Coefficient in best.final which represents alpha=0 (Ridge Regression), that corresponds to MDVP.Jitter.Abs.
    <!> This is most probably because of Collinearity.

    We notice when we choose the highest alpha, while keeping the classification error minimum, we get smaller coefficients, and less predictors with smaller coefficients.

    In theory, on this data set, both derived models, the Ridge Regression based Model (alpha=0) and the Elastic Net based Model (alpha=0.8) have the same predictive power as they both have CE=0.1551724. This means both have better predictive power than GLM which had CE~0.2

    Repeating the Same Proccess Using a different seed() in the Next Tab

    ~~ Ridge, Lasso && Elastic Net Approaches seed(20)

    set.seed(20)
    training_indices.diff.seed <- createDataPartition(data$status, p=0.7, list=FALSE)
    
    x.train.diff.seed <- data[training_indices.diff.seed,]
    y.train.diff.seed <- data$status[training_indices.diff.seed]
    
    x.test.diff.seed  <- data[-training_indices.diff.seed,]
    y.test.diff.seed  <- data$status[-training_indices.diff.seed]
    ## Ridge Regression: alpha = 0
    ## Lasso: alpha = 1
    ## Elastic: 0 =< alpha =< 1
    ## ~~ Create a list that will fit all the fits from alpha = 0 (ridge) to alpha = 1 (lasso)
    list.of.fits.diff.seed <- list()
    for (i in 0:10){
        alpha = i/10
        fit.name <- paste0("alpha",alpha)
        # Now for each alpha, get the best model fit using CV 
        list.of.fits[[fit.name]] <- 
                  cv.glmnet(x[training_indices.diff.seed,], y[training_indices.diff.seed], type.measure="class", alpha=alpha, family="binomial", nfolds=5, standardize=TRUE)
    }
    
    ## ~~ Find the Classification Error for each of the fits with their best lambda given by CV
    results.diff.seed <- data.frame()
    for (i in 0:10){
      alpha = i/10
      fit.name <- paste0("alpha",alpha)
      # Get the Fit from the list:
      fit.alpha <- list.of.fits[[fit.name]]
      # Calculate the Classification Error for the best fit.
      ##     In this case, I set 's' to "lambda.1se", which is
      ##     the value for lambda that results in the simplest model
      ##     such that the cross validation error is within one 
      ##     standard error of the minimum.
      ##     This offers a negligible increase in bias while decreasing variance.
      fit.prediction <- predict(fit.alpha, s=fit.alpha$lambda.1se, newx=x[-training_indices,], type="response")
      fit.pred <- ifelse(fit.prediction>0.5, 1, 0)
      c.e = mean(fit.pred != y.test)
      # Add to the results
      temp <- data.frame(alpha=alpha, ce=c.e, lambda1se=fit.alpha$lambda.1se, fit.name=fit.name)
      results.diff.seed <- rbind(results.diff.seed, temp)
    }
    results.diff.seed
    ##    alpha        ce   lambda1se fit.name
    ## 1    0.0 0.1724138 0.483747683   alpha0
    ## 2    0.1 0.1551724 0.289999116 alpha0.1
    ## 3    0.2 0.1724138 0.278096140 alpha0.2
    ## 4    0.3 0.1724138 0.223311857 alpha0.3
    ## 5    0.4 0.1551724 0.152605085 alpha0.4
    ## 6    0.5 0.1724138 0.161387953 alpha0.5
    ## 7    0.6 0.1551724 0.111655928 alpha0.6
    ## 8    0.7 0.1724138 0.079456040 alpha0.7
    ## 9    0.8 0.1034483 0.001056704 alpha0.8
    ## 10   0.9 0.1724138 0.089659974 alpha0.9
    ## 11   1.0 0.1724138 0.088561540   alpha1
    ## Extract the best Model with the least Classification Error:
    best.fit.diff.seed <- results.diff.seed[which.min(results.diff.seed[,2]),]
    best.fit.diff.seed
    ##   alpha        ce   lambda1se fit.name
    ## 9   0.8 0.1034483 0.001056704 alpha0.8
    alpha.best.diff.seed <- as.double(best.fit.diff.seed[1])
    alpha.best.diff.seed
    ## [1] 0.8
    ce.best.diff.seed <- as.double(best.fit.diff.seed[2])
    ce.best.diff.seed
    ## [1] 0.1034483
    lambda.best.diff.seed <- as.double(best.fit.diff.seed[3])
    lambda.best.diff.seed
    ## [1] 0.001056704
    # Get the final fit of the best model:
    best.full.diff.seed <- glmnet(x,y,alpha=alpha.best.diff.seed, family="binomial", standardize=TRUE)
    plot(best.full.diff.seed, xvar="lambda")

    best.final.diff.seed <- predict(best.full.diff.seed, s=lambda.best.diff.seed, type="coefficients")[1:ncol(data),]
    best.final.diff.seed <- best.final.diff.seed[best.final.diff.seed!=0]
    # This is the best fit we can get with varying values of alpha and lambda simultaneously for the new seed
    best.final.diff.seed
    ##      (Intercept)      MDVP.Fo.Hz.     MDVP.Fhi.Hz.     MDVP.Flo.Hz. 
    ##    -1.577741e+00    -1.398541e-02    -2.103912e-03    -1.137205e-04 
    ##   MDVP.Jitter... MDVP.Jitter.Abs.         MDVP.RAP         MDVP.PPQ 
    ##    -5.029867e+02    -4.228313e+04     8.993404e+02    -7.103324e+02 
    ##       Jitter.DDP     Shimmer.APQ3         MDVP.APQ      Shimmer.DDA 
    ##     2.806904e+02    -6.283288e+01     1.892247e+02    -1.748911e+01 
    ##              HNR             RPDE              DFA          spread1 
    ##     6.973688e-02    -3.752634e+00     2.923742e+00     8.390816e-01 
    ##          spread2               D2              PPE 
    ##     8.096619e+00     1.626175e+00     1.599997e+01
    # Compare to best.final(alpha=0, ce=0.15)
    best.final
    ##      (Intercept)      MDVP.Fo.Hz.     MDVP.Fhi.Hz.     MDVP.Flo.Hz. 
    ##     1.500340e-01    -2.267567e-03    -5.405438e-04    -2.009510e-03 
    ##   MDVP.Jitter... MDVP.Jitter.Abs.         MDVP.RAP         MDVP.PPQ 
    ##     6.644355e+00     1.419603e+03     1.139578e+01     1.296685e+01 
    ##       Jitter.DDP     MDVP.Shimmer MDVP.Shimmer.dB.     Shimmer.APQ3 
    ##     3.798194e+00     3.070914e+00     2.725448e-01     5.113419e+00 
    ##     Shimmer.APQ5         MDVP.APQ      Shimmer.DDA              NHR 
    ##     4.512558e+00     3.597775e+00     1.704137e+00     2.855772e-01 
    ##              HNR             RPDE              DFA          spread1 
    ##    -1.188915e-02     4.251241e-01     1.077079e+00     1.170001e-01 
    ##          spread2               D2              PPE 
    ##     1.305050e+00     2.146965e-01     1.277064e+00
    ce.best
    ## [1] 0.1551724
    # Compare to best.final.high(alpha=0.8, ce=0.15)
    best.final.high
    ##   (Intercept)   MDVP.Fo.Hz.  MDVP.Flo.Hz.       spread1       spread2 
    ##  3.8970344984 -0.0004394087 -0.0001357522  0.6058558631  1.7421158249 
    ##           PPE 
    ##  2.8726253684
    ce.best.high
    ## [1] 0.1551724

    This difference indicates that there is high variation in the DataSet. With the new seed we were able to train and test in a way where we got a smaller classification error (0.10344) compared to (0.15517) which would show greater prediction power. But this definitely only applies on this data set cut taken.

    With the new Seed, MDVP.Jitter.Abs. still has an extremely high coefficient but has flipped to negative! This shows that the reason the coefficient is extremely high is not really because of being a very important predictor to our response but rather/or also due to collinearity with other predictors.

    ~~ Final Conclusion about the previous Models

    Although the models offer very low interepretability, they offer a decent prediction accuracy (85% accuracy with the first try, and 90% with the second try)
    However, having very high coeffients in the model will probably lead to failure in predictions on future patients. When one predictor takes all the credit from other predictors due to collinearity, it makes predictions more potent to failure.

    One of the Models ‘best.final.high’ with alpha=0.8 was able to minimize the coefficients while keeping an 85% prediction accuracy however eleminating many predictors. Using an inbetween alpha (=0.4) resulted in a better model, same CE, but more predictors with logical coefficients.
    Although the same alpha=0.8 on a different seed resulted in 0.10344 in ‘best.final.diff.seed’, the model has high coefficients values and is more prone to failure on different Datasets.

    In General, fitting Ridge or Lasso Regression or Elastic Net faced too much variablity due to sample size and significant collinearity. We Consider different approaches such as the LDA and QDA + step wise selection that might be able to deal with these issues.

    [4.1.b] ~ Logistic Discriminant Analysis (LDA)

    set.seed(2645)
    folds <- createFolds(data$status, k=5, returnTrain = TRUE)
    d.train <- data[training_indices,]
    ce_vector <- vector()
    for (fold in folds){
        train = fold
        lda.fit <- lda(d.train$status~., data=d.train[,-5], subset=train, family=binomial)
        lda.prediction <- predict(lda.fit, newdata=d.train[-train,-5])
        fit.pred <- ifelse(fit.prediction>0.5, 1, 0)
        c.e = mean(fit.pred != y.test)
        ce_vector <- c(ce_vector,c.e)
      }
    mce <- mean(ce_vector)
    mce
    ## [1] 0.1724138
    ## Error in lda.default(x, grouping, ...) : 
    ## variable 5 appears to be constant within groups
    ## This Variable Refers to MDVP.Jitter.Abs. It appears to be constant due to scale.
    ## Since there are other variables that represent MDVP.Jitter.Abs,
    ##    it was safe to remove the variable. (tested, check Update #2)
    
    ## Mean Classification error is 0.1237 < Prior Classifications errors for Ridge and Lasso (~0.15)
    
    ## We choose to subset the data manually by biological similarities between predictors.
    sub.data <- data[,-c(2,3,5,6,7,8,10,11,12,13,14,15,16,20,21)]
    folds <- createFolds(sub.data$status, k=5, returnTrain = TRUE)
    d.train.sub <- sub.data[training_indices,]
    ce_vector <- vector()
    for (fold in folds){
        train = fold
        lda.fit <- lda(d.train.sub$status~., data=d.train.sub, subset=train, family=binomial)
        lda.prediction <- predict(lda.fit, newdata=d.train.sub[-train,])
        fit.pred <- ifelse(fit.prediction>0.5, 1, 0)
        c.e = mean(fit.pred != y.test)
        ce_vector <- c(ce_vector,c.e)
      }
    mce <- mean(ce_vector)
    mce
    ## [1] 0.1724138
    ## We observe the same classification error, subsetting the data manually doesnt result in any progress
    
    ## It also indicates that also using stepwise subsetting will not result in any change, but we do it for the sake of demonstration
    lda.fit <- lda(data$status~., data=data[,-5], subset=training_indices, family=binomial)
    
    lda.sub <- stepclass(data[,-5], status, direction="both", method="lda", k=5, trace=-1)
    ##  `stepwise classification', using 10-fold cross-validated correctness rate of method lda'.
    ## 195 observations of 22 variables in 2 classes; direction: both
    ## stop criterion: improvement less than 5%.
    ## Warning in cv.rate(vars = c(model, tryvar), data = data, grouping = grouping, :
    ## error(s) in modeling/prediction step
    ## correctness rate: 0.85105;  in: "spread1";  variables (1): spread1
    ## Warning in cv.rate(vars = c(model, tryvar), data = data, grouping = grouping, :
    ## error(s) in modeling/prediction step
    ## 
    ##  hr.elapsed min.elapsed sec.elapsed 
    ##        0.00        0.00        0.71
    lda.fit <- lda(data$status~., data=data[,c(17,21)], subset=training_indices, family=binomial)
    lda.prediction <- predict(lda.fit, newdata=data[-training_indices,])
    fit.pred <- ifelse(fit.prediction>0.5, 1, 0)
    c.e = mean(fit.pred != y.test)
    c.e 
    ## [1] 0.1724138
    ## We see only one variable selected: spread1, but it is clear that other variables have an effect on the response status.
    ## Also the CE stays the same.

    Taking a Quick look back on Logistic Regression with Manual Subset Selection

    set.seed(2645)
    # Logistic Regression
    glm.fit.sub <- glm(status~., data=sub.data, subset=training_indices, family="binomial")
    summary(glm.fit.sub)
    ## 
    ## Call:
    ## glm(formula = status ~ ., family = "binomial", data = sub.data, 
    ##     subset = training_indices)
    ## 
    ## Deviance Residuals: 
    ##      Min        1Q    Median        3Q       Max  
    ## -2.33000   0.00306   0.11762   0.38242   1.61681  
    ## 
    ## Coefficients:
    ##                  Estimate Std. Error z value Pr(>|z|)   
    ## (Intercept)     -16.66009    8.42515  -1.977  0.04799 * 
    ## MDVP.Fo.Hz.      -0.01443    0.01090  -1.324  0.18537   
    ## MDVP.Jitter... -196.58493  163.12848  -1.205  0.22817   
    ## MDVP.Shimmer     76.45980   49.88588   1.533  0.12535   
    ## RPDE             -3.85919    4.28421  -0.901  0.36770   
    ## DFA              11.20148    7.73733   1.448  0.14770   
    ## D2                4.10210    1.49296   2.748  0.00600 **
    ## PPE              23.24357    8.81603   2.637  0.00838 **
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    ## 
    ## (Dispersion parameter for binomial family taken to be 1)
    ## 
    ##     Null deviance: 153.528  on 136  degrees of freedom
    ## Residual deviance:  74.362  on 129  degrees of freedom
    ## AIC: 90.362
    ## 
    ## Number of Fisher Scoring iterations: 7
    ## VIF: Check if any are still collinear
    vif(glm.fit.sub)
    ##    MDVP.Fo.Hz. MDVP.Jitter...   MDVP.Shimmer           RPDE            DFA 
    ##       2.212201       1.641817       1.619375       2.253123       1.551336 
    ##             D2            PPE 
    ##       1.675813       2.118685
    ## Validation Set Approach to get Test Classification Error:
    glm.prediction.sub <- predict(glm.fit.sub, newdata = sub.data[-training_indices,], type="response")
    table_mat <- table(data$status[-training_indices], glm.prediction.sub > 0.5)
    colnames(table_mat) <- c("0","1")
    table_mat
    ##    
    ##      0  1
    ##   0  8  6
    ##   1  4 40
    predic <- ifelse(glm.prediction.sub>0.5,1,0)
    vsa.test.error.glm <- mean(predic!=data$status[-training_indices])
    vsa.test.error.glm
    ## [1] 0.1724138
    ## Cross Validation Error:
    cv.error <- cv.glm(sub.data[-training_indices,], glm.fit.sub, K=5)$delta
    ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
    ## Warning in y - yhat: longer object length is not a multiple of shorter object
    ## length
    
    ## Warning in y - yhat: longer object length is not a multiple of shorter object
    ## length
    
    ## Warning in y - yhat: longer object length is not a multiple of shorter object
    ## length
    ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
    ## Warning in y - yhat: longer object length is not a multiple of shorter object
    ## length
    
    ## Warning in y - yhat: longer object length is not a multiple of shorter object
    ## length
    cv.error
    ## [1] 0.4031160 0.1729946
    ## We observe the same problem concerning complete seperation.
    ## However now we have better corrected cv.error, closer to Elastic Net
    ## We also see significant predictors, but surely, the ones excluded might be significant as well.
    ## The cv.error that is not bias corrected is high, but the bias corrected cv.error is best so far.
    ## However, this approach is VERY high in bias due to the manual subset selection.

    Unfortunately, but as expected, due to collinearity, LDA did not offer any Major improvement, even with subset selection, manual and using step wise selection.

    [4.1.c] ~ Quadratic Discriminant Analysis (QDA)

    set.seed(2645)
    qda.fit <- qda(data$status~., data=data, subset=training_indices, family="binomial", cv=TRUE)
    qda.prediction <- predict(qda.fit, newdata=d.train[-training_indices,])
    fit.pred <- ifelse(fit.prediction>0.5, 1, 0)
    c.e = mean(fit.pred != y.test)
    c.e
    ## [1] 0.1724138
    ## No change in CE. QDA doesnt offer any advantage over LDA in this case.

    As expected, since both QDA and LDA fail to deal with collinear predictors, we couldn’t improve our prediction accuracy way beyond previous models.

    So far, Elastic Net Approach with alpha=0.4 and a cross validated Lambda.1se was able to reach prediction accuracy close to 85% while maintaining a good number of predictors (fair bias-variance trade off)

    So far, it is clear that approximately all the predictors are important: each predictor from each biological “category” can be used independently in a model and we would still get the same results.

    Ex: Instead of using NHR and HNR both, we could use only one of them.

    [4.1.d] ~ K-th Nearest Neigbors (KNN)

    set.seed(2645)
    
    ce_vector <- vector()
    for (K in seq(from=2, to=30, by=1)){
      knn.pred <- knn(data[training_indices, ], data[-training_indices, ], data[training_indices,17],k=K)
      c.e <- mean(knn.pred!=data$status[-training_indices])
      ce_vector <- c(ce_vector, c.e)
    }
    ce_vector
    ##  [1] 0.1896552 0.1034483 0.1551724 0.1551724 0.1724138 0.1206897 0.1724138
    ##  [8] 0.1551724 0.2068966 0.1896552 0.2068966 0.1896552 0.2068966 0.2068966
    ## [15] 0.2241379 0.2068966 0.2068966 0.2068966 0.2068966 0.2068966 0.2241379
    ## [22] 0.2068966 0.2068966 0.2068966 0.2241379 0.2068966 0.2068966 0.2068966
    ## [29] 0.2241379
    ## We see the best KNN Prediction is with K=2 with a CE = 0.134 and then deteriorates significantly as K increases.

    From the first section, visualization of the data, we can already tell that the distribution is more close to linear rather than complex, and that’s why KNN fails (at K > 2) to give better predictions than LDA and QDA.

    [4.2.a] ~ Decision Trees & Pruning

    # 1. Construct the Tree
    tree.simple <- tree(data$status~., data=data, subset=training_indices)
    
    # 2. Predict using the test set
    tree.pred = predict(tree.simple, data[-training_indices,],type="class")
    
    # 3. Construct the Table:
    table(tree.pred, data$status[-training_indices])
    ##          
    ## tree.pred  0  1
    ##         0  8  2
    ##         1  6 42
    # 4. Get the Classificiation Error:
    simple.ce <- mean(tree.pred!=data$status[-training_indices])
    simple.ce
    ## [1] 0.137931
    ## Pruning with Cross Validation
    # 5. Prune the Simple Tree
    cv.pr.tree = cv.tree(tree.simple, FUN=prune.misclass)
    
    # 7. Observe the results
    cv.pr.tree
    ## $size
    ## [1] 7 5 4 2 1
    ## 
    ## $dev
    ## [1] 17 19 23 21 35
    ## 
    ## $k
    ## [1] -Inf    0    2    3   18
    ## 
    ## $method
    ## [1] "misclass"
    ## 
    ## attr(,"class")
    ## [1] "prune"         "tree.sequence"
    par(mfrow=c(1,2))
    plot(cv.pr.tree$size, cv.pr.tree$dev,type="b")
    plot(cv.pr.tree$k, cv.pr.tree$dev,type="b")

    # 8. Get the Tree with the best size that results with min cve(dev)
    best_size <- cv.pr.tree$size[which.min(cv.pr.tree$dev)]
    
    # 9. Prune the simple tree again set with size = best_size
    prune.full = prune.misclass(tree.simple, best=best_size)
    
    # 10. View the plots:
    plot(prune.full)
    text(prune.full, pretty=0)
    
    ## Retest the pruned tree:
    tree.pred=predict(prune.full, newdata=data[-training_indices,] , type="class")
    table(tree.pred, data$status[-training_indices])
    ##          
    ## tree.pred  0  1
    ##         0  8  0
    ##         1  6 44
    prune.ce <- mean(tree.pred!=data$status[-training_indices])
    prune.ce
    ## [1] 0.1034483
    #prune.full

    Clearly, the pruned tree has higher prediction accuracy than the simple tree: 89% > 81.5%
    Also, note that this prediction accuracy is very comparable with the previous best model: Elastic Net (with alpha=0.4 and ce~0.13, or one with alpha=0.8 and ce~0.10).

    However, the pruned tree selects for only two variables: MDVP.Fi.Hz. and spread2
    This might cause bad predictions on differnet Dataset. where the variables might not be as highly correlated as in this Dataset.
    We already know from previous sections that MDVP.Fi.Hz is highly correlated with MDVP.Fo.Hz and .Flo.Hz. as they are biologically related.

    [4.2.b] ~ Bagging, Random Forests and Boosting

    set.seed(2645)
    ## Perform Bagging: mtry = number of predictors
    ## Perform Random Forests: mtry = sqrt(p) for classification
    oob.values <- matrix(data=NA, nrow=ncol(data)-1, ncol=4)
    mce.values <- matrix(data=NA, nrow=ncol(data)-1, ncol=4)
    for(i in 1:(ncol(data)-1)) {
      for(n in c(250,500,750,1000)){
        temp.model <- randomForest(data$status ~ ., data=data,
                                 importance=TRUE, subset=training_indices, mtry=i, ntree=n)
      #Get the mse values
      mce.values[i,n/250] <- mean(data$status[-training_indices]!=predict(temp.model,newdata=data[-training_indices,]))
      #Get the oob values
      oob.values[i,n/250] <- temp.model$err.rate[nrow(temp.model$err.rate),1]
      }
    }
    
    oob.values
    ##            [,1]      [,2]      [,3]      [,4]
    ##  [1,] 0.1167883 0.1094891 0.1386861 0.1167883
    ##  [2,] 0.1386861 0.1240876 0.1386861 0.1313869
    ##  [3,] 0.1240876 0.1240876 0.1240876 0.1167883
    ##  [4,] 0.1313869 0.1240876 0.1313869 0.1386861
    ##  [5,] 0.1094891 0.1167883 0.1167883 0.1386861
    ##  [6,] 0.1386861 0.1240876 0.1313869 0.1313869
    ##  [7,] 0.1313869 0.1386861 0.1313869 0.1313869
    ##  [8,] 0.1313869 0.1240876 0.1313869 0.1313869
    ##  [9,] 0.1386861 0.1313869 0.1313869 0.1313869
    ## [10,] 0.1313869 0.1313869 0.1386861 0.1386861
    ## [11,] 0.1240876 0.1386861 0.1167883 0.1313869
    ## [12,] 0.1386861 0.1386861 0.1386861 0.1386861
    ## [13,] 0.1167883 0.1240876 0.1313869 0.1313869
    ## [14,] 0.1240876 0.1240876 0.1386861 0.1240876
    ## [15,] 0.1240876 0.1167883 0.1313869 0.1167883
    ## [16,] 0.1313869 0.1167883 0.1313869 0.1094891
    ## [17,] 0.1167883 0.1240876 0.1167883 0.1240876
    ## [18,] 0.1386861 0.1313869 0.1313869 0.1167883
    ## [19,] 0.1240876 0.1094891 0.1313869 0.1240876
    ## [20,] 0.1459854 0.1240876 0.1240876 0.1167883
    ## [21,] 0.1313869 0.1240876 0.1240876 0.1167883
    ## [22,] 0.1313869 0.1240876 0.1240876 0.1240876
    mce.values
    ##             [,1]       [,2]      [,3]       [,4]
    ##  [1,] 0.12068966 0.10344828 0.1034483 0.12068966
    ##  [2,] 0.06896552 0.08620690 0.1206897 0.08620690
    ##  [3,] 0.08620690 0.06896552 0.0862069 0.06896552
    ##  [4,] 0.08620690 0.06896552 0.0862069 0.06896552
    ##  [5,] 0.10344828 0.08620690 0.0862069 0.08620690
    ##  [6,] 0.08620690 0.08620690 0.0862069 0.08620690
    ##  [7,] 0.08620690 0.08620690 0.0862069 0.08620690
    ##  [8,] 0.08620690 0.08620690 0.0862069 0.08620690
    ##  [9,] 0.08620690 0.08620690 0.0862069 0.08620690
    ## [10,] 0.08620690 0.08620690 0.0862069 0.08620690
    ## [11,] 0.08620690 0.08620690 0.0862069 0.08620690
    ## [12,] 0.08620690 0.08620690 0.0862069 0.08620690
    ## [13,] 0.08620690 0.08620690 0.0862069 0.08620690
    ## [14,] 0.08620690 0.08620690 0.0862069 0.08620690
    ## [15,] 0.08620690 0.08620690 0.0862069 0.08620690
    ## [16,] 0.10344828 0.08620690 0.0862069 0.08620690
    ## [17,] 0.08620690 0.08620690 0.0862069 0.08620690
    ## [18,] 0.08620690 0.10344828 0.0862069 0.08620690
    ## [19,] 0.08620690 0.08620690 0.0862069 0.08620690
    ## [20,] 0.08620690 0.08620690 0.0862069 0.08620690
    ## [21,] 0.10344828 0.08620690 0.0862069 0.08620690
    ## [22,] 0.08620690 0.10344828 0.0862069 0.08620690
    ##Bagging:
    oob.values[22]
    ## [1] 0.1313869
    mce.values[22]
    ## [1] 0.0862069
    ##Random Forests:
    oob.values[4,2]
    ## [1] 0.1240876
    mce.values[4,2]
    ## [1] 0.06896552
    mce.values.try1 <- vector(length=length(seq(from=600, to=900, by=25)))
    for(n in seq(from=600, to=900, by=25)){
        temp.model <- randomForest(data$status ~ ., data=data,
                                 importance=TRUE, subset=training_indices, mtry=4, ntree=n)
      #Get the mse values
      mce.values.try1[(n/25)-23] <- mean(data$status[-training_indices]!=predict(temp.model,
                                                                          newdata=data[-training_indices,]))
    }
    mce.values.try1
    ##  [1] 0.06896552 0.08620690 0.08620690 0.08620690 0.08620690 0.08620690
    ##  [7] 0.06896552 0.10344828 0.06896552 0.08620690 0.08620690 0.06896552
    ## [13] 0.06896552

    Clearly, and not surprisingly, we get the best mce value to be for mtry=4. The number of trees needed to reach that optimal prediction accuracy was 600

    Why mtry=4? Because the predictors are extremely highly correlated, random forests can deal with collinear predictors.

    Then, we explored the possible ntree values that would further optimize the prediction accuracy, and it appears that overfitting starts to occure after ntree~750

    Our best fit between Bagging and Random Forest is a random forest with mtry=4, resulting in mce=0.06896 which is the lowest prediction error so far.

    => Random Forest Tree is able to reach 93.02% prediction accuracy while handeling collinearity.

    ## Boosting:
    #boost.data=gbm(data$status~.,data=data[training_indices ,],
                     #distribution="bernoulli", n.trees=750, interaction.depth=10, cv=5)
    
    ## Error in model.frame.default(formula = data$status ~ ., data = data[training_indices, : variable lengths differ (found for 'MDVP.Fo.Hz.')
    
    #summary(boost.data)
    
    ## Error in h(simpleError(msg, call)) : error in evaluating the argument 'object' in selecting a method for function 'summary': object 'boost.data' not found

    Models Summary



    Model Classification Error Prediction Accuracy % Pros Cons
    Multiple Logistic Regression: glm cv5 0.2018427 79.82 Simple to Implement Low Prediction Accuracy, high variance
    MLR with Manual Subsetting of Data: glm cv5 0.1724138 82.76 Less Variance and higher Prediction Accuracy Possible increase in Bias due to Manual Prediction Selection
    Ridge Regression: glmnet alpha0 0.1551724 84.48276 Fair Prediction Accuracy High Variance: No Predictors Selection Performed
    Lasso: glmnet alpha1 NA NA NA Not Selected due to Less Prediction Accuracy than Elastic Net
    Elastic Net: glmnet alpha0.8 0.1034483 89.66 High Prediction Accuarcy Possible increase in Bias, but compensated by presence of Multicollinearity
    Elastic Net: glmnet alpha0.4 0.1551724 84.48 Balance between Bias-Variance Pros come at the cost of Prediction Accuracy on this Dataset
    Linear Discriminant Analysis: lda cv5 0.1724138 82.76 Simple to Implement Low Prediction Accuracy, identical to glm with manual data subsetting
    Quadratic Discriminant Analysis: qda cv5 0.1724138 82.76 Simple to Implement Low Prediction Accuracy
    K-th Nearest Neighbors: knn k2 0.1034483 89.66 High Prediction Accuarcy, Simple to Implement For K=2, will give very similar results to a Linear Model
    Simple Decision Trees: tree 0.13791 86.21 Fair Prediction Accuracy Could be Improved by Pruning and other Methods
    Pruned Trees: cv.tree prune.missclass 0.1034483 89.66 High Prediction Accuracy Pruning is highly affected by Multicollinearity
    Bagging: randomForest mtryP ntrees250,1000 0.0862069 91.38 Offers a Middle Ground between Random Forest and Pruned Trees Takes the whole data and resamples, might increase multicollinearity
    Random Forest: randomForest mtrySQRT(P) ntees250,750,1000 0.06896552 93.11 Highest Prediction Accuracy Requires high Time Complexity to obtain