Context: Parkinson’s Disease (PD) is a degenerative neurological disorder marked by decreased dopamine levels in the brain. It manifests itself through a deterioration of movement, including the presence of tremors and stiffness. There is commonly a marked effect on speech, including dysarthria (difficulty articulating sounds), hypophonia (lowered volume), and monotone (reduced pitch range). Additionally, cognitive impairments and changes in mood can occur, and risk of dementia is increased.
Traditional diagnosis of Parkinson’s Disease involves a clinician taking a neurological history of the patient and observing motor skills in various situations. Since there is no definitive laboratory test to diagnose PD, diagnosis is often difficult, particularly in the early stages when motor effects are not yet severe. Monitoring progression of the disease over time requires repeated clinic visits by the patient. An effective screening process, particularly one that doesn’t require a clinic visit, would be beneficial. Since PD patients exhibit characteristic vocal features, voice recordings are a useful and non-invasive tool for diagnosis. If machine learning algorithms could be applied to a voice recording dataset to accurately diagnosis PD, this would be an effective screening step prior to an appointment with a clinician.
Project Task: Develop an appropriate model that will effectively screen patients for Parkinson’s Disease using their different measurements of speech.
Data and Attributes Information can be found here: Parkinson’s Disease Dataset
Original Paper and Researchers: BioMedical Engineering OnLine
library(boot)
library(GGally)
library(car)
library(glmnet)
library(caret)
library(dplyr)
library(broom)
library(MASS)
library(class)
library(klaR)
library(tree)
library(randomForest)
library(gbm)
library(pls)
set.seed(2645)
# Data is downloaded using the Link access above
data <- read.csv("Parkinson_Data.csv")
## Check for NA Values:
## The Dataset doesn't seem to contain any NA Values
which(is.na(data)==TRUE)
## integer(0)
## Checking a sample of the Dataset
head(data)
## name MDVP.Fo.Hz. MDVP.Fhi.Hz. MDVP.Flo.Hz. MDVP.Jitter...
## 1 phon_R01_S01_1 119.992 157.302 74.997 0.00784
## 2 phon_R01_S01_2 122.400 148.650 113.819 0.00968
## 3 phon_R01_S01_3 116.682 131.111 111.555 0.01050
## 4 phon_R01_S01_4 116.676 137.871 111.366 0.00997
## 5 phon_R01_S01_5 116.014 141.781 110.655 0.01284
## 6 phon_R01_S01_6 120.552 131.162 113.787 0.00968
## MDVP.Jitter.Abs. MDVP.RAP MDVP.PPQ Jitter.DDP MDVP.Shimmer MDVP.Shimmer.dB.
## 1 0.00007 0.00370 0.00554 0.01109 0.04374 0.426
## 2 0.00008 0.00465 0.00696 0.01394 0.06134 0.626
## 3 0.00009 0.00544 0.00781 0.01633 0.05233 0.482
## 4 0.00009 0.00502 0.00698 0.01505 0.05492 0.517
## 5 0.00011 0.00655 0.00908 0.01966 0.06425 0.584
## 6 0.00008 0.00463 0.00750 0.01388 0.04701 0.456
## Shimmer.APQ3 Shimmer.APQ5 MDVP.APQ Shimmer.DDA NHR HNR status RPDE
## 1 0.02182 0.03130 0.02971 0.06545 0.02211 21.033 1 0.414783
## 2 0.03134 0.04518 0.04368 0.09403 0.01929 19.085 1 0.458359
## 3 0.02757 0.03858 0.03590 0.08270 0.01309 20.651 1 0.429895
## 4 0.02924 0.04005 0.03772 0.08771 0.01353 20.644 1 0.434969
## 5 0.03490 0.04825 0.04465 0.10470 0.01767 19.649 1 0.417356
## 6 0.02328 0.03526 0.03243 0.06985 0.01222 21.378 1 0.415564
## DFA spread1 spread2 D2 PPE
## 1 0.815285 -4.813031 0.266482 2.301442 0.284654
## 2 0.819521 -4.075192 0.335590 2.486855 0.368674
## 3 0.825288 -4.443179 0.311173 2.342259 0.332634
## 4 0.819235 -4.117501 0.334147 2.405554 0.368975
## 5 0.823484 -3.747787 0.234513 2.332180 0.410335
## 6 0.825069 -4.242867 0.299111 2.187560 0.357775
## Removing the Samples names since they dont contribe to the Model
data <- data[,-1]
# Validating the Data Types
str(head(data, n=0))
## 'data.frame': 0 obs. of 23 variables:
## $ MDVP.Fo.Hz. : num
## $ MDVP.Fhi.Hz. : num
## $ MDVP.Flo.Hz. : num
## $ MDVP.Jitter... : num
## $ MDVP.Jitter.Abs.: num
## $ MDVP.RAP : num
## $ MDVP.PPQ : num
## $ Jitter.DDP : num
## $ MDVP.Shimmer : num
## $ MDVP.Shimmer.dB.: num
## $ Shimmer.APQ3 : num
## $ Shimmer.APQ5 : num
## $ MDVP.APQ : num
## $ Shimmer.DDA : num
## $ NHR : num
## $ HNR : num
## $ status : int
## $ RPDE : num
## $ DFA : num
## $ spread1 : num
## $ spread2 : num
## $ D2 : num
## $ PPE : num
# Converting Status into true factor: 1 for Disease/ 0 for Healthy
data$status <- as.factor(data$status)
head(data$status)
## [1] 1 1 1 1 1 1
## Levels: 0 1
## There are 23 variables and 195 samples (each sample has ~ 6 records)
## Note that first I consider all records to be independent
## Table to show count of Parkinson's Disease Patietns (PDP) vs Control
tbl <- as.vector(c(with(data, table(status)), as.numeric(nrow(data))))
bp <- barplot(tbl, beside = TRUE, names.arg = c("Healthy", "PDP", "Total"),
ylab="Frequency", xlab="Status", ylim= c(0,max(tbl)+50),
main="Response-Target Distribution" ,col=c("tomato","mediumturquoise","grey"))
text(x = bp, y = tbl, label = tbl,
pos = 3, cex = 0.8, col = c("tomato","mediumturquoise","black"))
Conclusion from Bar Plot: Note that the Healthy Samples are only 48 out of 195, and therefore the class ‘status’ is skewed.
That must be taken into consideration when building the different models.
set.seed(2645)
## The number of Predictors (23) make it hard to visualize all the Data using pairs plot
## For this reason, I generate several plots using ggpairs() from "GGally" Package
## GGally package uses version 4.0.3, please update to the latest version to run
## These plots take a set 4 variables each run, colored as a function of status(0vs1):
for (x in seq(from=1, to=ncol(data[,-17])-4, by=4)){
ggp = ggpairs(data[,-17][,c(seq(from=x, to=x+4))],
aes(color=data$status), progress=ggmatrix_progress())
print(ggp)
}
Indications from the Generalized Pairs Plot:
1. Some of the Predictors are semi-Normally distributed such as D2 & spread2
2. There is a positive correlation between many of the variables such as Jitter.DDP & MDVP.RAP 3. There is a clear indication of some outliers in most variables, but all the values are in their acceptable biological range.
=> For this reason, the original data is kept untouched, a new set is created to test later on if this set makes any difference. (Check the next section in this tab for outliers filtering)
set.seed(2645)
## Observing the summary of the Dataset to give
## us an idea about the 1st and 3rd Quantiles
data.fil.by.iqr <- data
summary(data.fil.by.iqr)
## MDVP.Fo.Hz. MDVP.Fhi.Hz. MDVP.Flo.Hz. MDVP.Jitter...
## Min. : 88.33 Min. :102.1 Min. : 65.48 Min. :0.001680
## 1st Qu.:117.57 1st Qu.:134.9 1st Qu.: 84.29 1st Qu.:0.003460
## Median :148.79 Median :175.8 Median :104.31 Median :0.004940
## Mean :154.23 Mean :197.1 Mean :116.32 Mean :0.006220
## 3rd Qu.:182.77 3rd Qu.:224.2 3rd Qu.:140.02 3rd Qu.:0.007365
## Max. :260.11 Max. :592.0 Max. :239.17 Max. :0.033160
## MDVP.Jitter.Abs. MDVP.RAP MDVP.PPQ Jitter.DDP
## Min. :7.000e-06 Min. :0.000680 Min. :0.000920 Min. :0.002040
## 1st Qu.:2.000e-05 1st Qu.:0.001660 1st Qu.:0.001860 1st Qu.:0.004985
## Median :3.000e-05 Median :0.002500 Median :0.002690 Median :0.007490
## Mean :4.396e-05 Mean :0.003306 Mean :0.003446 Mean :0.009920
## 3rd Qu.:6.000e-05 3rd Qu.:0.003835 3rd Qu.:0.003955 3rd Qu.:0.011505
## Max. :2.600e-04 Max. :0.021440 Max. :0.019580 Max. :0.064330
## MDVP.Shimmer MDVP.Shimmer.dB. Shimmer.APQ3 Shimmer.APQ5
## Min. :0.00954 Min. :0.0850 Min. :0.004550 Min. :0.00570
## 1st Qu.:0.01650 1st Qu.:0.1485 1st Qu.:0.008245 1st Qu.:0.00958
## Median :0.02297 Median :0.2210 Median :0.012790 Median :0.01347
## Mean :0.02971 Mean :0.2823 Mean :0.015664 Mean :0.01788
## 3rd Qu.:0.03789 3rd Qu.:0.3500 3rd Qu.:0.020265 3rd Qu.:0.02238
## Max. :0.11908 Max. :1.3020 Max. :0.056470 Max. :0.07940
## MDVP.APQ Shimmer.DDA NHR HNR
## Min. :0.00719 Min. :0.01364 Min. :0.000650 Min. : 8.441
## 1st Qu.:0.01308 1st Qu.:0.02474 1st Qu.:0.005925 1st Qu.:19.198
## Median :0.01826 Median :0.03836 Median :0.011660 Median :22.085
## Mean :0.02408 Mean :0.04699 Mean :0.024847 Mean :21.886
## 3rd Qu.:0.02940 3rd Qu.:0.06080 3rd Qu.:0.025640 3rd Qu.:25.076
## Max. :0.13778 Max. :0.16942 Max. :0.314820 Max. :33.047
## status RPDE DFA spread1 spread2
## 0: 48 Min. :0.2566 Min. :0.5743 Min. :-7.965 Min. :0.006274
## 1:147 1st Qu.:0.4213 1st Qu.:0.6748 1st Qu.:-6.450 1st Qu.:0.174350
## Median :0.4960 Median :0.7223 Median :-5.721 Median :0.218885
## Mean :0.4985 Mean :0.7181 Mean :-5.684 Mean :0.226510
## 3rd Qu.:0.5876 3rd Qu.:0.7619 3rd Qu.:-5.046 3rd Qu.:0.279234
## Max. :0.6852 Max. :0.8253 Max. :-2.434 Max. :0.450493
## D2 PPE
## Min. :1.423 Min. :0.04454
## 1st Qu.:2.099 1st Qu.:0.13745
## Median :2.362 Median :0.19405
## Mean :2.382 Mean :0.20655
## 3rd Qu.:2.636 3rd Qu.:0.25298
## Max. :3.671 Max. :0.52737
## Note that quantile(x) returns 0% - 25% - 50% - 75% - 100% quantiles
for (i in seq(from=1, to=ncol(data.fil.by.iqr))){
if(i != 17){
quant <- quantile(data.fil.by.iqr[,i])
data.fil.by.iqr[which(data.fil.by.iqr[,i]>(1.5*(quant[4]-quant[1])+quant[4])),i] <- median(data[,i])
data.fil.by.iqr[which(data.fil.by.iqr[,i]<(quant[1]-1.5*(quant[4]-quant[1]))),i] <- median(data[,i])
}
}
## Observing the SD of the Data
## Calculating how many SD are the max and min measurements far from the Mean
## 1. Before correcting for outliers
tb_org <- apply(data[,-17], 2, function(x){return(c(sd(x),(mean(x)-min(x))/sd(x),(max(x)-mean(x))/sd(x)))})
rownames(tb_org) <- c("SD","kSD(min,mean)","kSD(max,mean)")
tb_org
## MDVP.Fo.Hz. MDVP.Fhi.Hz. MDVP.Flo.Hz. MDVP.Jitter...
## SD 41.390065 91.491548 43.521413 0.004848134
## kSD(min,mean) 1.592064 1.037909 1.168359 0.936538022
## kSD(max,mean) 2.558014 4.316520 2.822642 5.556682255
## MDVP.Jitter.Abs. MDVP.RAP MDVP.PPQ Jitter.DDP MDVP.Shimmer
## SD 3.482191e-05 0.002967774 0.002758977 0.008903344 0.01885693
## kSD(min,mean) 1.061371e+00 0.884976379 0.915686973 0.885054919 1.06958695
## kSD(max,mean) 6.204170e+00 6.110164453 5.847690318 6.111192503 4.73941744
## MDVP.Shimmer.dB. Shimmer.APQ3 Shimmer.APQ5 MDVP.APQ Shimmer.DDA
## SD 0.1948773 0.01015316 0.01202371 0.01694674 0.03045912
## kSD(min,mean) 1.0121820 1.09464956 1.01285385 0.99673984 1.09499605
## kSD(max,mean) 5.2327735 4.01902853 5.11670411 6.70916873 4.01940000
## NHR HNR RPDE DFA spread1 spread2
## SD 0.04041845 4.425764 0.1039417 0.05533583 1.090208 0.08340576
## kSD(min,mean) 0.59866417 3.037888 2.3278964 2.59898596 2.091883 2.64054115
## kSD(max,mean) 7.17427149 2.521830 1.7953856 1.93706235 2.981419 2.68545775
## D2 PPE
## SD 0.382799 0.09011932
## kSD(min,mean) 2.504027 1.79775698
## kSD(max,mean) 3.368161 3.55989537
## 2. After correcting for outliers
tb <- apply(data.fil.by.iqr[,-17], 2, function(x){return(c(sd(x),(mean(x)-min(x))/sd(x),(max(x)-mean(x))/sd(x)))})
rownames(tb) <- c("SD","kSD(min,mean)","kSD(max,mean)")
tb
## MDVP.Fo.Hz. MDVP.Fhi.Hz. MDVP.Flo.Hz. MDVP.Jitter...
## SD 41.390065 49.247531 43.521413 0.002675295
## kSD(min,mean) 1.592064 1.567499 1.168359 1.377657827
## kSD(max,mean) 2.558014 4.418913 2.822642 3.855410027
## MDVP.Jitter.Abs. MDVP.RAP MDVP.PPQ Jitter.DDP MDVP.Shimmer
## SD 2.374803e-05 0.001418343 0.001408827 0.004253768 0.01550102
## kSD(min,mean) 1.361948e+00 1.430524825 1.402112226 1.430960643 1.18255544
## kSD(max,mean) 3.396341e+00 4.075902021 3.758209686 4.074746510 3.33650079
## MDVP.Shimmer.dB. Shimmer.APQ3 Shimmer.APQ5 MDVP.APQ Shimmer.DDA
## SD 0.139609 0.008614769 0.009457879 0.01232011 0.02584362
## kSD(min,mean) 1.207806 1.195710387 1.129976288 1.20625404 1.19612498
## kSD(max,mean) 3.354938 3.248982014 3.109874760 3.29045997 3.24868494
## NHR HNR RPDE DFA spread1 spread2
## SD 0.01197518 4.425764 0.1039417 0.05533583 1.090208 0.08340576
## kSD(min,mean) 1.15683689 3.037888 2.3278964 2.59898596 2.091883 2.64054115
## kSD(max,mean) 3.84684453 2.521830 1.7953856 1.93706235 2.981419 2.68545775
## D2 PPE
## SD 0.382799 0.09011932
## kSD(min,mean) 2.504027 1.79775698
## kSD(max,mean) 3.368161 3.55989537
Conclusion about Outliers:
> 1. By replacing outliers by Medians, the SD of some of the Variables decreased which shows measurements are more close to the mean.
> 2. The Right Skeweness decreased: shown in an increase in kSDmin,mean and decrease in kSDmax,mean
Major Concern: The replaced values are within the normal biological range, and therefore replacing them might lead to decrease in variation but at the cost of an increase in bias.
In this Project we attempt the following approaches:
1. Classification using:
a. Multiple Logistic Regression
b. Logistic Discriminant Analysis (LDA)
c. Quadratic Discriminant Analysis (QDA)
d. K-th Nearest Neigbors (KNN)2. Tree Based Approaches:
a. Basic Decision Trees
b. Bagging, Random Forests and BoostingNote: These approaches are all accompanied with Cross Validation, Boosting & Model Selection and Regularization
Finally: A comparison between all the Models, eventually reaching the best Model that will effectively predict PD in patients.
Split the Dataset into Train and Test using createDataPartition() that does Stratified Sampling by Default
Stratified Sampling in this case is important due to the response class being Skewed. Healthy People were under-represented.
# Set seed to get reproducible output. RStudio Version 4.0.3
set.seed(2645)
# Taking 70% Training, 30% Test Since the Sample isn't significantly large.
training_indices <- createDataPartition(data$status, p=0.7, list=FALSE)
x <- model.matrix(status~.,data)[,-1]
y <- data$status
x.train <- data[training_indices,]
y.train <- data$status[training_indices]
x.test <- data[-training_indices,]
y.test <- data$status[-training_indices]
set.seed(2645)
# Attach the DataSet:
attach(data)
# Performing Multiple Logistic Regression, Family = Binomial to indicate classification
glm.fit <- glm(status~., data=data, subset=training_indices, family="binomial")
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(glm.fit)
##
## Call:
## glm(formula = status ~ ., family = "binomial", data = data, subset = training_indices)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.04111 0.00000 0.03459 0.21264 2.00135
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.161e+01 2.506e+01 -0.463 0.6432
## MDVP.Fo.Hz. -1.089e-02 3.332e-02 -0.327 0.7437
## MDVP.Fhi.Hz. -2.726e-03 7.235e-03 -0.377 0.7063
## MDVP.Flo.Hz. 2.912e-03 1.758e-02 0.166 0.8684
## MDVP.Jitter... -1.912e+03 1.875e+03 -1.019 0.3080
## MDVP.Jitter.Abs. -1.044e+05 1.332e+05 -0.784 0.4330
## MDVP.RAP 1.317e+05 2.113e+05 0.624 0.5329
## MDVP.PPQ -2.087e+03 2.872e+03 -0.727 0.4673
## Jitter.DDP -4.197e+04 7.032e+04 -0.597 0.5507
## MDVP.Shimmer 7.870e+02 1.671e+03 0.471 0.6376
## MDVP.Shimmer.dB. 3.017e+01 5.467e+01 0.552 0.5810
## Shimmer.APQ3 -1.173e+05 1.735e+05 -0.676 0.4991
## Shimmer.APQ5 -4.208e+02 5.839e+02 -0.721 0.4711
## MDVP.APQ 1.790e+02 6.171e+02 0.290 0.7718
## Shimmer.DDA 3.860e+04 5.782e+04 0.668 0.5044
## NHR -2.293e+01 9.494e+01 -0.242 0.8092
## HNR 9.805e-03 2.733e-01 0.036 0.9714
## RPDE -6.925e+00 6.404e+00 -1.081 0.2795
## DFA 1.855e+01 1.263e+01 1.468 0.1420
## spread1 1.498e+00 2.776e+00 0.540 0.5894
## spread2 1.939e+01 9.895e+00 1.960 0.0501 .
## D2 2.768e+00 2.338e+00 1.184 0.2364
## PPE 2.422e+01 4.561e+01 0.531 0.5954
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 153.528 on 136 degrees of freedom
## Residual deviance: 49.892 on 114 degrees of freedom
## AIC: 95.892
##
## Number of Fisher Scoring iterations: 9
## Before Discussing the warning, its important to show that there
## doesn't appear to be any significant p-values.
summary(glm.fit)$coefficients
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.160690e+01 2.505849e+01 -0.46319220 0.64322661
## MDVP.Fo.Hz. -1.089426e-02 3.331969e-02 -0.32696158 0.74369694
## MDVP.Fhi.Hz. -2.726264e-03 7.235185e-03 -0.37680643 0.70631746
## MDVP.Flo.Hz. 2.912171e-03 1.757490e-02 0.16570055 0.86839262
## MDVP.Jitter... -1.911796e+03 1.875390e+03 -1.01941238 0.30800723
## MDVP.Jitter.Abs. -1.044023e+05 1.331651e+05 -0.78400605 0.43303656
## MDVP.RAP 1.317328e+05 2.112747e+05 0.62351431 0.53294660
## MDVP.PPQ -2.087442e+03 2.871501e+03 -0.72695135 0.46725576
## Jitter.DDP -4.196701e+04 7.032359e+04 -0.59677010 0.55066089
## MDVP.Shimmer 7.869702e+02 1.670670e+03 0.47105063 0.63760458
## MDVP.Shimmer.dB. 3.017484e+01 5.467423e+01 0.55190242 0.58101521
## Shimmer.APQ3 -1.172686e+05 1.735091e+05 -0.67586467 0.49912657
## Shimmer.APQ5 -4.207959e+02 5.839347e+02 -0.72062152 0.47114241
## MDVP.APQ 1.789917e+02 6.171042e+02 0.29005094 0.77177727
## Shimmer.DDA 3.860137e+04 5.782218e+04 0.66758765 0.50439684
## NHR -2.292902e+01 9.493736e+01 -0.24151736 0.80915416
## HNR 9.805451e-03 2.732710e-01 0.03588178 0.97137662
## RPDE -6.924955e+00 6.403709e+00 -1.08139746 0.27952035
## DFA 1.855250e+01 1.263489e+01 1.46835454 0.14200794
## spread1 1.497874e+00 2.775528e+00 0.53967174 0.58942343
## spread2 1.938950e+01 9.895062e+00 1.95951287 0.05005275
## D2 2.768159e+00 2.337841e+00 1.18406642 0.23638677
## PPE 2.421996e+01 4.560993e+01 0.53102389 0.59540223
## Validation Set Approach to get Test Classification Error:
glm.prediction <- predict(glm.fit, newdata = x.test, type="response")
table_mat <- table(data$status[-training_indices], glm.prediction > 0.5)
colnames(table_mat) <- c("0","1")
table_mat
##
## 0 1
## 0 10 4
## 1 8 36
predic <- ifelse(glm.prediction>0.5,1,0)
vsa.test.error.glm <- mean(predic!=data$status[-training_indices])
vsa.test.error.glm
## [1] 0.2068966
## Cross Validation Error:
set.seed(2645)
cv.error <- cv.glm(x.train, glm.fit, K=5)$delta
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
cv.error
## [1] 0.2748844 0.2018427
## Recalling the Assumptions of Multiple Logistic Regression, and the previous results of viewing the data, this might be due to high correlation between predictors (multicollinearity).
vif(glm.fit)
## MDVP.Fo.Hz. MDVP.Fhi.Hz. MDVP.Flo.Hz. MDVP.Jitter...
## 1.304402e+01 1.838318e+00 3.522143e+00 1.167191e+02
## MDVP.Jitter.Abs. MDVP.RAP MDVP.PPQ Jitter.DDP
## 4.893332e+01 4.598358e+05 1.154114e+02 4.584093e+05
## MDVP.Shimmer MDVP.Shimmer.dB. Shimmer.APQ3 Shimmer.APQ5
## 2.006434e+03 1.811858e+02 7.800041e+06 1.076339e+02
## MDVP.APQ Shimmer.DDA NHR HNR
## 1.224602e+02 7.796154e+06 5.985399e+00 6.938585e+00
## RPDE DFA spread1 spread2
## 3.627937e+00 3.372692e+00 4.063907e+01 2.524288e+00
## D2 PPE
## 2.280740e+00 7.391852e+01
We notice that ths is true, and the variables are truly correlated with each other. That is indicated by a high VIF value, typically larger than 10. Here VIF[MVP.PPQ] is 115, and that of [Jitter.DDP] is 4x10^5 which is extremely high.
Adding Interaction terms to our glm.fit would probably allow us to predict more accurately, however many of the variables are correlated and adding interaction terms to each will result in very high complexity which we want to avoid. Other Models will deal with this issue more effectively.
## Back to the warning: This Occuring warning usually shows due to complete seperation.
## It also arises whenever a probability of an observation is effectively indistinguishable from 1 or 0 which happens when some observations have very extreme values of one or more predictors.
eps <- 10 * .Machine$double.eps
glm.resids <- augment(glm.fit) %>%
mutate(p = 1 / (1 + exp(-.fitted)),
warning = p > 1-eps)
arrange(glm.resids, desc(.fitted)) %>%
dplyr::select(c(1:4), p, warning) %>%
slice(1:10)
## Registered S3 method overwritten by 'cli':
## method from
## print.tree tree
## # A tibble: 10 x 6
## .rownames status MDVP.Fo.Hz. MDVP.Fhi.Hz. p warning
## <chr> <fct> <dbl> <dbl> <dbl> <lgl>
## 1 153 1 198. 233. 1.00 FALSE
## 2 149 1 189. 216. 1.00 FALSE
## 3 148 1 184. 217. 1.00 FALSE
## 4 151 1 187. 212. 1.00 FALSE
## 5 90 1 180. 226. 1.00 FALSE
## 6 161 1 114. 124. 1.00 FALSE
## 7 101 1 126. 141. 1.00 FALSE
## 8 20 1 156. 189. 1.00 FALSE
## 9 158 1 118. 134. 1.00 FALSE
## 10 21 1 154. 166. 1.00 FALSE
Notice that many other observations also result in P=1. This warning might be a false positive. We remove the observation, temporarily, that is causing the warning and we check for any change in the P-Values.
x.train.glm <- x.train[!glm.resids$warning,]
glm.fit1 <- glm(x.train.glm$status~., data=x.train.glm, family="binomial")
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(glm.fit1)
##
## Call:
## glm(formula = x.train.glm$status ~ ., family = "binomial", data = x.train.glm)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.04111 0.00000 0.03459 0.21264 2.00135
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.161e+01 2.506e+01 -0.463 0.6432
## MDVP.Fo.Hz. -1.089e-02 3.332e-02 -0.327 0.7437
## MDVP.Fhi.Hz. -2.726e-03 7.235e-03 -0.377 0.7063
## MDVP.Flo.Hz. 2.912e-03 1.758e-02 0.166 0.8684
## MDVP.Jitter... -1.912e+03 1.875e+03 -1.019 0.3080
## MDVP.Jitter.Abs. -1.044e+05 1.332e+05 -0.784 0.4330
## MDVP.RAP 1.317e+05 2.113e+05 0.624 0.5329
## MDVP.PPQ -2.087e+03 2.872e+03 -0.727 0.4673
## Jitter.DDP -4.197e+04 7.032e+04 -0.597 0.5507
## MDVP.Shimmer 7.870e+02 1.671e+03 0.471 0.6376
## MDVP.Shimmer.dB. 3.017e+01 5.467e+01 0.552 0.5810
## Shimmer.APQ3 -1.173e+05 1.735e+05 -0.676 0.4991
## Shimmer.APQ5 -4.208e+02 5.839e+02 -0.721 0.4711
## MDVP.APQ 1.790e+02 6.171e+02 0.290 0.7718
## Shimmer.DDA 3.860e+04 5.782e+04 0.668 0.5044
## NHR -2.293e+01 9.494e+01 -0.242 0.8092
## HNR 9.805e-03 2.733e-01 0.036 0.9714
## RPDE -6.925e+00 6.404e+00 -1.081 0.2795
## DFA 1.855e+01 1.263e+01 1.468 0.1420
## spread1 1.498e+00 2.776e+00 0.540 0.5894
## spread2 1.939e+01 9.895e+00 1.960 0.0501 .
## D2 2.768e+00 2.338e+00 1.184 0.2364
## PPE 2.422e+01 4.561e+01 0.531 0.5954
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 153.528 on 136 degrees of freedom
## Residual deviance: 49.892 on 114 degrees of freedom
## AIC: 95.892
##
## Number of Fisher Scoring iterations: 9
We notice the same warning appearing again, then it is due to complete seperation.
This happens when when the response seperates one or more of the predictors completely: for all values x > c in predictor X, the response is always 1(or 0). This happens when either the predictor is truly highly correlated with the response, or due to the split in the data set.
This can be fixed by applying a penalty to the regression model which we explore in the following Sections => Ridge Regression or Lasso
Conclusion on Multiple Logistic Regression: It seems like the glm.fit has about 80% accuracy in predicting the response. However, the interepretability of the model is very low as there doesn’t appear to be any significant predictors.
Additionally, since the Predictors are highly correlated, and since there exists complete seperation, it’s best to consider Ridge Regression or Lasso, or in between both: Elastic Net Approach.
set.seed(2645)
## Ridge Regression: alpha = 0
## Lasso: alpha = 1
## Elastic: 0 =< alpha =< 1
## ~~ Create a list that will fit all the fits from alpha = 0 (ridge) to alpha = 1 (lasso)
list.of.fits <- list()
for (i in 0:10){
alpha = i/10
fit.name <- paste0("alpha",alpha)
# Now for each alpha, get the best model fit using CV
list.of.fits[[fit.name]] <-
cv.glmnet(x[training_indices,], y[training_indices], type.measure="class", alpha=alpha, family="binomial", nfolds=5, standardize=TRUE)
}
## Warning: from glmnet Fortran code (error code -99); Convergence for 99th lambda
## value not reached after maxit=100000 iterations; solutions for larger lambdas
## returned
## ~~ Find the Classification Error for each of the fits with their best lambda given by CV
results <- data.frame()
for (i in 0:10){
alpha = i/10
fit.name <- paste0("alpha",alpha)
# Get the Fit from the list:
fit.alpha <- list.of.fits[[fit.name]]
# Calculate the Classification Error for the best fit.
## In this case, I set 's' to "lambda.1se", which is
## the value for lambda that results in the simplest model
## such that the cross validation error is within one
## standard error of the minimum.
## This offers a negligible increase in bias while decreasing variance.
fit.prediction <- predict(fit.alpha, s=fit.alpha$lambda.1se, newx=x[-training_indices,], type="response")
fit.pred <- ifelse(fit.prediction>0.5, 1, 0)
c.e = mean(fit.pred != y.test)
# Add to the results
temp <- data.frame(alpha=alpha, ce=c.e, lambda1se=fit.alpha$lambda.1se, fit.name=fit.name)
results <- rbind(results, temp)
}
results
## alpha ce lambda1se fit.name
## 1 0.0 0.1551724 0.93976210 alpha0
## 2 0.1 0.1551724 0.46772181 alpha0.1
## 3 0.2 0.1551724 0.30915035 alpha0.2
## 4 0.3 0.1551724 0.22619475 alpha0.3
## 5 0.4 0.1551724 0.15457518 alpha0.4
## 6 0.5 0.1551724 0.13571685 alpha0.5
## 7 0.6 0.1551724 0.13622619 alpha0.6
## 8 0.7 0.1724138 0.05547298 alpha0.7
## 9 0.8 0.1551724 0.10216964 alpha0.8
## 10 0.9 0.1724138 0.09967205 alpha0.9
## 11 1.0 0.1724138 0.07447454 alpha1
## Extract the best Model with the least Classification Error:
## We notice that different fits with different alpha have identical CE
## Picking the first one: (lowest alpha => Ridge Regression)
best.fit <- results[which.min(results[,2]),]
best.fit
## alpha ce lambda1se fit.name
## 1 0 0.1551724 0.9397621 alpha0
alpha.best <- as.double(best.fit[1])
alpha.best
## [1] 0
ce.best <- as.double(best.fit[2])
ce.best
## [1] 0.1551724
lambda.best <- as.double(best.fit[3])
lambda.best
## [1] 0.9397621
# Get the final fit of the best model:
best.full <- glmnet(x,y,alpha=alpha.best, family="binomial", standardize=TRUE)
plot(best.full, xvar="lambda")
best.final <- predict(best.full, s=lambda.best, type="coefficients")[1:ncol(data),]
best.final <- best.final[best.final!=0]
# This is the best fit we can get with varying values of alpha and lambda simultaneously
best.final
## (Intercept) MDVP.Fo.Hz. MDVP.Fhi.Hz. MDVP.Flo.Hz.
## 1.500340e-01 -2.267567e-03 -5.405438e-04 -2.009510e-03
## MDVP.Jitter... MDVP.Jitter.Abs. MDVP.RAP MDVP.PPQ
## 6.644355e+00 1.419603e+03 1.139578e+01 1.296685e+01
## Jitter.DDP MDVP.Shimmer MDVP.Shimmer.dB. Shimmer.APQ3
## 3.798194e+00 3.070914e+00 2.725448e-01 5.113419e+00
## Shimmer.APQ5 MDVP.APQ Shimmer.DDA NHR
## 4.512558e+00 3.597775e+00 1.704137e+00 2.855772e-01
## HNR RPDE DFA spread1
## -1.188915e-02 4.251241e-01 1.077079e+00 1.170001e-01
## spread2 D2 PPE
## 1.305050e+00 2.146965e-01 1.277064e+00
## Picking a higher one: alpha = 0.4, Shrinks predictors more => Decreases Variance at the cost of Bias
## + Theoratically increases Interpretability (but in this case theres too much collinearity)
## Anyway, our main goal is prediction and note interpretability
higher.fit <- results[5,]
higher.fit
## alpha ce lambda1se fit.name
## 5 0.4 0.1551724 0.1545752 alpha0.4
alpha.higher<- as.double(higher.fit[1])
alpha.higher
## [1] 0.4
ce.best.higher <- as.double(higher.fit[2])
ce.best.higher
## [1] 0.1551724
lambda.best.higher <- as.double(higher.fit[3])
lambda.best.higher
## [1] 0.1545752
# Get the final fit of the model:
best.full.higher <- glmnet(x,y,alpha=alpha.higher, family="binomial", standardize = TRUE)
plot(best.full.higher, xvar="lambda")
best.final.higher <- predict(best.full.higher, s=lambda.best.higher, type="coefficients")[1:ncol(data),]
best.final.higher <- best.final.higher[best.final.higher!=0]
# This is one of the best fit we can get with varying values of alpha and lambda simultaneously
best.final.higher
## (Intercept) MDVP.Fo.Hz. MDVP.Flo.Hz. MDVP.Shimmer Shimmer.APQ3 MDVP.APQ
## 2.144934353 -0.002716660 -0.001694658 2.157631564 0.539297646 1.291374129
## Shimmer.DDA spread1 spread2 D2 PPE
## 0.177211729 0.356297312 2.306403641 0.220409887 3.283032701
## Picking the highest one: alpha = 0.8, Shrinks predictors more => Decreases Variance at the cost of Bias
## + Theoratically increases Interpretability (but in this case theres too much collinearity)
## Anyway, our main goal is prediction and note interpretability
highest.fit <- results[9,]
highest.fit
## alpha ce lambda1se fit.name
## 9 0.8 0.1551724 0.1021696 alpha0.8
alpha.high <- as.double(highest.fit[1])
alpha.high
## [1] 0.8
ce.best.high <- as.double(highest.fit[2])
ce.best.high
## [1] 0.1551724
lambda.best.high <- as.double(highest.fit[3])
lambda.best.high
## [1] 0.1021696
# Get the final fit of the model:
best.full.high <- glmnet(x,y,alpha=alpha.high, family="binomial", standardize = TRUE)
plot(best.full.high, xvar="lambda")
best.final.high <- predict(best.full.high, s=lambda.best.high, type="coefficients")[1:ncol(data),]
best.final.high <- best.final.high[best.final.high!=0]
# This is one of the best fit we can get with varying values of alpha and lambda simultaneously
best.final.high
## (Intercept) MDVP.Fo.Hz. MDVP.Flo.Hz. spread1 spread2
## 3.8970344984 -0.0004394087 -0.0001357522 0.6058558631 1.7421158249
## PPE
## 2.8726253684
Conclusion: We see that there is a huge Coefficient in best.final which represents alpha=0 (Ridge Regression), that corresponds to MDVP.Jitter.Abs.
<!> This is most probably because of Collinearity.
We notice when we choose the highest alpha, while keeping the classification error minimum, we get smaller coefficients, and less predictors with smaller coefficients.
In theory, on this data set, both derived models, the Ridge Regression based Model (alpha=0) and the Elastic Net based Model (alpha=0.8) have the same predictive power as they both have CE=0.1551724. This means both have better predictive power than GLM which had CE~0.2
Repeating the Same Proccess Using a different seed() in the Next Tab
set.seed(20)
training_indices.diff.seed <- createDataPartition(data$status, p=0.7, list=FALSE)
x.train.diff.seed <- data[training_indices.diff.seed,]
y.train.diff.seed <- data$status[training_indices.diff.seed]
x.test.diff.seed <- data[-training_indices.diff.seed,]
y.test.diff.seed <- data$status[-training_indices.diff.seed]
## Ridge Regression: alpha = 0
## Lasso: alpha = 1
## Elastic: 0 =< alpha =< 1
## ~~ Create a list that will fit all the fits from alpha = 0 (ridge) to alpha = 1 (lasso)
list.of.fits.diff.seed <- list()
for (i in 0:10){
alpha = i/10
fit.name <- paste0("alpha",alpha)
# Now for each alpha, get the best model fit using CV
list.of.fits[[fit.name]] <-
cv.glmnet(x[training_indices.diff.seed,], y[training_indices.diff.seed], type.measure="class", alpha=alpha, family="binomial", nfolds=5, standardize=TRUE)
}
## ~~ Find the Classification Error for each of the fits with their best lambda given by CV
results.diff.seed <- data.frame()
for (i in 0:10){
alpha = i/10
fit.name <- paste0("alpha",alpha)
# Get the Fit from the list:
fit.alpha <- list.of.fits[[fit.name]]
# Calculate the Classification Error for the best fit.
## In this case, I set 's' to "lambda.1se", which is
## the value for lambda that results in the simplest model
## such that the cross validation error is within one
## standard error of the minimum.
## This offers a negligible increase in bias while decreasing variance.
fit.prediction <- predict(fit.alpha, s=fit.alpha$lambda.1se, newx=x[-training_indices,], type="response")
fit.pred <- ifelse(fit.prediction>0.5, 1, 0)
c.e = mean(fit.pred != y.test)
# Add to the results
temp <- data.frame(alpha=alpha, ce=c.e, lambda1se=fit.alpha$lambda.1se, fit.name=fit.name)
results.diff.seed <- rbind(results.diff.seed, temp)
}
results.diff.seed
## alpha ce lambda1se fit.name
## 1 0.0 0.1724138 0.483747683 alpha0
## 2 0.1 0.1551724 0.289999116 alpha0.1
## 3 0.2 0.1724138 0.278096140 alpha0.2
## 4 0.3 0.1724138 0.223311857 alpha0.3
## 5 0.4 0.1551724 0.152605085 alpha0.4
## 6 0.5 0.1724138 0.161387953 alpha0.5
## 7 0.6 0.1551724 0.111655928 alpha0.6
## 8 0.7 0.1724138 0.079456040 alpha0.7
## 9 0.8 0.1034483 0.001056704 alpha0.8
## 10 0.9 0.1724138 0.089659974 alpha0.9
## 11 1.0 0.1724138 0.088561540 alpha1
## Extract the best Model with the least Classification Error:
best.fit.diff.seed <- results.diff.seed[which.min(results.diff.seed[,2]),]
best.fit.diff.seed
## alpha ce lambda1se fit.name
## 9 0.8 0.1034483 0.001056704 alpha0.8
alpha.best.diff.seed <- as.double(best.fit.diff.seed[1])
alpha.best.diff.seed
## [1] 0.8
ce.best.diff.seed <- as.double(best.fit.diff.seed[2])
ce.best.diff.seed
## [1] 0.1034483
lambda.best.diff.seed <- as.double(best.fit.diff.seed[3])
lambda.best.diff.seed
## [1] 0.001056704
# Get the final fit of the best model:
best.full.diff.seed <- glmnet(x,y,alpha=alpha.best.diff.seed, family="binomial", standardize=TRUE)
plot(best.full.diff.seed, xvar="lambda")
best.final.diff.seed <- predict(best.full.diff.seed, s=lambda.best.diff.seed, type="coefficients")[1:ncol(data),]
best.final.diff.seed <- best.final.diff.seed[best.final.diff.seed!=0]
# This is the best fit we can get with varying values of alpha and lambda simultaneously for the new seed
best.final.diff.seed
## (Intercept) MDVP.Fo.Hz. MDVP.Fhi.Hz. MDVP.Flo.Hz.
## -1.577741e+00 -1.398541e-02 -2.103912e-03 -1.137205e-04
## MDVP.Jitter... MDVP.Jitter.Abs. MDVP.RAP MDVP.PPQ
## -5.029867e+02 -4.228313e+04 8.993404e+02 -7.103324e+02
## Jitter.DDP Shimmer.APQ3 MDVP.APQ Shimmer.DDA
## 2.806904e+02 -6.283288e+01 1.892247e+02 -1.748911e+01
## HNR RPDE DFA spread1
## 6.973688e-02 -3.752634e+00 2.923742e+00 8.390816e-01
## spread2 D2 PPE
## 8.096619e+00 1.626175e+00 1.599997e+01
# Compare to best.final(alpha=0, ce=0.15)
best.final
## (Intercept) MDVP.Fo.Hz. MDVP.Fhi.Hz. MDVP.Flo.Hz.
## 1.500340e-01 -2.267567e-03 -5.405438e-04 -2.009510e-03
## MDVP.Jitter... MDVP.Jitter.Abs. MDVP.RAP MDVP.PPQ
## 6.644355e+00 1.419603e+03 1.139578e+01 1.296685e+01
## Jitter.DDP MDVP.Shimmer MDVP.Shimmer.dB. Shimmer.APQ3
## 3.798194e+00 3.070914e+00 2.725448e-01 5.113419e+00
## Shimmer.APQ5 MDVP.APQ Shimmer.DDA NHR
## 4.512558e+00 3.597775e+00 1.704137e+00 2.855772e-01
## HNR RPDE DFA spread1
## -1.188915e-02 4.251241e-01 1.077079e+00 1.170001e-01
## spread2 D2 PPE
## 1.305050e+00 2.146965e-01 1.277064e+00
ce.best
## [1] 0.1551724
# Compare to best.final.high(alpha=0.8, ce=0.15)
best.final.high
## (Intercept) MDVP.Fo.Hz. MDVP.Flo.Hz. spread1 spread2
## 3.8970344984 -0.0004394087 -0.0001357522 0.6058558631 1.7421158249
## PPE
## 2.8726253684
ce.best.high
## [1] 0.1551724
This difference indicates that there is high variation in the DataSet. With the new seed we were able to train and test in a way where we got a smaller classification error (0.10344) compared to (0.15517) which would show greater prediction power. But this definitely only applies on this data set cut taken.
With the new Seed, MDVP.Jitter.Abs. still has an extremely high coefficient but has flipped to negative! This shows that the reason the coefficient is extremely high is not really because of being a very important predictor to our response but rather/or also due to collinearity with other predictors.
Although the models offer very low interepretability, they offer a decent prediction accuracy (85% accuracy with the first try, and 90% with the second try)
However, having very high coeffients in the model will probably lead to failure in predictions on future patients. When one predictor takes all the credit from other predictors due to collinearity, it makes predictions more potent to failure.
One of the Models ‘best.final.high’ with alpha=0.8 was able to minimize the coefficients while keeping an 85% prediction accuracy however eleminating many predictors. Using an inbetween alpha (=0.4) resulted in a better model, same CE, but more predictors with logical coefficients.
Although the same alpha=0.8 on a different seed resulted in 0.10344 in ‘best.final.diff.seed’, the model has high coefficients values and is more prone to failure on different Datasets.
In General, fitting Ridge or Lasso Regression or Elastic Net faced too much variablity due to sample size and significant collinearity. We Consider different approaches such as the LDA and QDA + step wise selection that might be able to deal with these issues.
set.seed(2645)
folds <- createFolds(data$status, k=5, returnTrain = TRUE)
d.train <- data[training_indices,]
ce_vector <- vector()
for (fold in folds){
train = fold
lda.fit <- lda(d.train$status~., data=d.train[,-5], subset=train, family=binomial)
lda.prediction <- predict(lda.fit, newdata=d.train[-train,-5])
fit.pred <- ifelse(fit.prediction>0.5, 1, 0)
c.e = mean(fit.pred != y.test)
ce_vector <- c(ce_vector,c.e)
}
mce <- mean(ce_vector)
mce
## [1] 0.1724138
## Error in lda.default(x, grouping, ...) :
## variable 5 appears to be constant within groups
## This Variable Refers to MDVP.Jitter.Abs. It appears to be constant due to scale.
## Since there are other variables that represent MDVP.Jitter.Abs,
## it was safe to remove the variable. (tested, check Update #2)
## Mean Classification error is 0.1237 < Prior Classifications errors for Ridge and Lasso (~0.15)
## We choose to subset the data manually by biological similarities between predictors.
sub.data <- data[,-c(2,3,5,6,7,8,10,11,12,13,14,15,16,20,21)]
folds <- createFolds(sub.data$status, k=5, returnTrain = TRUE)
d.train.sub <- sub.data[training_indices,]
ce_vector <- vector()
for (fold in folds){
train = fold
lda.fit <- lda(d.train.sub$status~., data=d.train.sub, subset=train, family=binomial)
lda.prediction <- predict(lda.fit, newdata=d.train.sub[-train,])
fit.pred <- ifelse(fit.prediction>0.5, 1, 0)
c.e = mean(fit.pred != y.test)
ce_vector <- c(ce_vector,c.e)
}
mce <- mean(ce_vector)
mce
## [1] 0.1724138
## We observe the same classification error, subsetting the data manually doesnt result in any progress
## It also indicates that also using stepwise subsetting will not result in any change, but we do it for the sake of demonstration
lda.fit <- lda(data$status~., data=data[,-5], subset=training_indices, family=binomial)
lda.sub <- stepclass(data[,-5], status, direction="both", method="lda", k=5, trace=-1)
## `stepwise classification', using 10-fold cross-validated correctness rate of method lda'.
## 195 observations of 22 variables in 2 classes; direction: both
## stop criterion: improvement less than 5%.
## Warning in cv.rate(vars = c(model, tryvar), data = data, grouping = grouping, :
## error(s) in modeling/prediction step
## correctness rate: 0.85105; in: "spread1"; variables (1): spread1
## Warning in cv.rate(vars = c(model, tryvar), data = data, grouping = grouping, :
## error(s) in modeling/prediction step
##
## hr.elapsed min.elapsed sec.elapsed
## 0.00 0.00 0.71
lda.fit <- lda(data$status~., data=data[,c(17,21)], subset=training_indices, family=binomial)
lda.prediction <- predict(lda.fit, newdata=data[-training_indices,])
fit.pred <- ifelse(fit.prediction>0.5, 1, 0)
c.e = mean(fit.pred != y.test)
c.e
## [1] 0.1724138
## We see only one variable selected: spread1, but it is clear that other variables have an effect on the response status.
## Also the CE stays the same.
Taking a Quick look back on Logistic Regression with Manual Subset Selection
set.seed(2645)
# Logistic Regression
glm.fit.sub <- glm(status~., data=sub.data, subset=training_indices, family="binomial")
summary(glm.fit.sub)
##
## Call:
## glm(formula = status ~ ., family = "binomial", data = sub.data,
## subset = training_indices)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.33000 0.00306 0.11762 0.38242 1.61681
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -16.66009 8.42515 -1.977 0.04799 *
## MDVP.Fo.Hz. -0.01443 0.01090 -1.324 0.18537
## MDVP.Jitter... -196.58493 163.12848 -1.205 0.22817
## MDVP.Shimmer 76.45980 49.88588 1.533 0.12535
## RPDE -3.85919 4.28421 -0.901 0.36770
## DFA 11.20148 7.73733 1.448 0.14770
## D2 4.10210 1.49296 2.748 0.00600 **
## PPE 23.24357 8.81603 2.637 0.00838 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 153.528 on 136 degrees of freedom
## Residual deviance: 74.362 on 129 degrees of freedom
## AIC: 90.362
##
## Number of Fisher Scoring iterations: 7
## VIF: Check if any are still collinear
vif(glm.fit.sub)
## MDVP.Fo.Hz. MDVP.Jitter... MDVP.Shimmer RPDE DFA
## 2.212201 1.641817 1.619375 2.253123 1.551336
## D2 PPE
## 1.675813 2.118685
## Validation Set Approach to get Test Classification Error:
glm.prediction.sub <- predict(glm.fit.sub, newdata = sub.data[-training_indices,], type="response")
table_mat <- table(data$status[-training_indices], glm.prediction.sub > 0.5)
colnames(table_mat) <- c("0","1")
table_mat
##
## 0 1
## 0 8 6
## 1 4 40
predic <- ifelse(glm.prediction.sub>0.5,1,0)
vsa.test.error.glm <- mean(predic!=data$status[-training_indices])
vsa.test.error.glm
## [1] 0.1724138
## Cross Validation Error:
cv.error <- cv.glm(sub.data[-training_indices,], glm.fit.sub, K=5)$delta
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning in y - yhat: longer object length is not a multiple of shorter object
## length
## Warning in y - yhat: longer object length is not a multiple of shorter object
## length
## Warning in y - yhat: longer object length is not a multiple of shorter object
## length
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning in y - yhat: longer object length is not a multiple of shorter object
## length
## Warning in y - yhat: longer object length is not a multiple of shorter object
## length
cv.error
## [1] 0.4031160 0.1729946
## We observe the same problem concerning complete seperation.
## However now we have better corrected cv.error, closer to Elastic Net
## We also see significant predictors, but surely, the ones excluded might be significant as well.
## The cv.error that is not bias corrected is high, but the bias corrected cv.error is best so far.
## However, this approach is VERY high in bias due to the manual subset selection.
Unfortunately, but as expected, due to collinearity, LDA did not offer any Major improvement, even with subset selection, manual and using step wise selection.
set.seed(2645)
qda.fit <- qda(data$status~., data=data, subset=training_indices, family="binomial", cv=TRUE)
qda.prediction <- predict(qda.fit, newdata=d.train[-training_indices,])
fit.pred <- ifelse(fit.prediction>0.5, 1, 0)
c.e = mean(fit.pred != y.test)
c.e
## [1] 0.1724138
## No change in CE. QDA doesnt offer any advantage over LDA in this case.
As expected, since both QDA and LDA fail to deal with collinear predictors, we couldn’t improve our prediction accuracy way beyond previous models.
So far, Elastic Net Approach with alpha=0.4 and a cross validated Lambda.1se was able to reach prediction accuracy close to 85% while maintaining a good number of predictors (fair bias-variance trade off)
So far, it is clear that approximately all the predictors are important: each predictor from each biological “category” can be used independently in a model and we would still get the same results.
Ex: Instead of using NHR and HNR both, we could use only one of them.
set.seed(2645)
ce_vector <- vector()
for (K in seq(from=2, to=30, by=1)){
knn.pred <- knn(data[training_indices, ], data[-training_indices, ], data[training_indices,17],k=K)
c.e <- mean(knn.pred!=data$status[-training_indices])
ce_vector <- c(ce_vector, c.e)
}
ce_vector
## [1] 0.1896552 0.1034483 0.1551724 0.1551724 0.1724138 0.1206897 0.1724138
## [8] 0.1551724 0.2068966 0.1896552 0.2068966 0.1896552 0.2068966 0.2068966
## [15] 0.2241379 0.2068966 0.2068966 0.2068966 0.2068966 0.2068966 0.2241379
## [22] 0.2068966 0.2068966 0.2068966 0.2241379 0.2068966 0.2068966 0.2068966
## [29] 0.2241379
## We see the best KNN Prediction is with K=2 with a CE = 0.134 and then deteriorates significantly as K increases.
From the first section, visualization of the data, we can already tell that the distribution is more close to linear rather than complex, and that’s why KNN fails (at K > 2) to give better predictions than LDA and QDA.
# 1. Construct the Tree
tree.simple <- tree(data$status~., data=data, subset=training_indices)
# 2. Predict using the test set
tree.pred = predict(tree.simple, data[-training_indices,],type="class")
# 3. Construct the Table:
table(tree.pred, data$status[-training_indices])
##
## tree.pred 0 1
## 0 8 2
## 1 6 42
# 4. Get the Classificiation Error:
simple.ce <- mean(tree.pred!=data$status[-training_indices])
simple.ce
## [1] 0.137931
## Pruning with Cross Validation
# 5. Prune the Simple Tree
cv.pr.tree = cv.tree(tree.simple, FUN=prune.misclass)
# 7. Observe the results
cv.pr.tree
## $size
## [1] 7 5 4 2 1
##
## $dev
## [1] 17 19 23 21 35
##
## $k
## [1] -Inf 0 2 3 18
##
## $method
## [1] "misclass"
##
## attr(,"class")
## [1] "prune" "tree.sequence"
par(mfrow=c(1,2))
plot(cv.pr.tree$size, cv.pr.tree$dev,type="b")
plot(cv.pr.tree$k, cv.pr.tree$dev,type="b")
# 8. Get the Tree with the best size that results with min cve(dev)
best_size <- cv.pr.tree$size[which.min(cv.pr.tree$dev)]
# 9. Prune the simple tree again set with size = best_size
prune.full = prune.misclass(tree.simple, best=best_size)
# 10. View the plots:
plot(prune.full)
text(prune.full, pretty=0)
## Retest the pruned tree:
tree.pred=predict(prune.full, newdata=data[-training_indices,] , type="class")
table(tree.pred, data$status[-training_indices])
##
## tree.pred 0 1
## 0 8 0
## 1 6 44
prune.ce <- mean(tree.pred!=data$status[-training_indices])
prune.ce
## [1] 0.1034483
#prune.full
Clearly, the pruned tree has higher prediction accuracy than the simple tree: 89% > 81.5%
Also, note that this prediction accuracy is very comparable with the previous best model: Elastic Net (with alpha=0.4 and ce~0.13, or one with alpha=0.8 and ce~0.10).
However, the pruned tree selects for only two variables: MDVP.Fi.Hz. and spread2
This might cause bad predictions on differnet Dataset. where the variables might not be as highly correlated as in this Dataset.
We already know from previous sections that MDVP.Fi.Hz is highly correlated with MDVP.Fo.Hz and .Flo.Hz. as they are biologically related.
set.seed(2645)
## Perform Bagging: mtry = number of predictors
## Perform Random Forests: mtry = sqrt(p) for classification
oob.values <- matrix(data=NA, nrow=ncol(data)-1, ncol=4)
mce.values <- matrix(data=NA, nrow=ncol(data)-1, ncol=4)
for(i in 1:(ncol(data)-1)) {
for(n in c(250,500,750,1000)){
temp.model <- randomForest(data$status ~ ., data=data,
importance=TRUE, subset=training_indices, mtry=i, ntree=n)
#Get the mse values
mce.values[i,n/250] <- mean(data$status[-training_indices]!=predict(temp.model,newdata=data[-training_indices,]))
#Get the oob values
oob.values[i,n/250] <- temp.model$err.rate[nrow(temp.model$err.rate),1]
}
}
oob.values
## [,1] [,2] [,3] [,4]
## [1,] 0.1167883 0.1094891 0.1386861 0.1167883
## [2,] 0.1386861 0.1240876 0.1386861 0.1313869
## [3,] 0.1240876 0.1240876 0.1240876 0.1167883
## [4,] 0.1313869 0.1240876 0.1313869 0.1386861
## [5,] 0.1094891 0.1167883 0.1167883 0.1386861
## [6,] 0.1386861 0.1240876 0.1313869 0.1313869
## [7,] 0.1313869 0.1386861 0.1313869 0.1313869
## [8,] 0.1313869 0.1240876 0.1313869 0.1313869
## [9,] 0.1386861 0.1313869 0.1313869 0.1313869
## [10,] 0.1313869 0.1313869 0.1386861 0.1386861
## [11,] 0.1240876 0.1386861 0.1167883 0.1313869
## [12,] 0.1386861 0.1386861 0.1386861 0.1386861
## [13,] 0.1167883 0.1240876 0.1313869 0.1313869
## [14,] 0.1240876 0.1240876 0.1386861 0.1240876
## [15,] 0.1240876 0.1167883 0.1313869 0.1167883
## [16,] 0.1313869 0.1167883 0.1313869 0.1094891
## [17,] 0.1167883 0.1240876 0.1167883 0.1240876
## [18,] 0.1386861 0.1313869 0.1313869 0.1167883
## [19,] 0.1240876 0.1094891 0.1313869 0.1240876
## [20,] 0.1459854 0.1240876 0.1240876 0.1167883
## [21,] 0.1313869 0.1240876 0.1240876 0.1167883
## [22,] 0.1313869 0.1240876 0.1240876 0.1240876
mce.values
## [,1] [,2] [,3] [,4]
## [1,] 0.12068966 0.10344828 0.1034483 0.12068966
## [2,] 0.06896552 0.08620690 0.1206897 0.08620690
## [3,] 0.08620690 0.06896552 0.0862069 0.06896552
## [4,] 0.08620690 0.06896552 0.0862069 0.06896552
## [5,] 0.10344828 0.08620690 0.0862069 0.08620690
## [6,] 0.08620690 0.08620690 0.0862069 0.08620690
## [7,] 0.08620690 0.08620690 0.0862069 0.08620690
## [8,] 0.08620690 0.08620690 0.0862069 0.08620690
## [9,] 0.08620690 0.08620690 0.0862069 0.08620690
## [10,] 0.08620690 0.08620690 0.0862069 0.08620690
## [11,] 0.08620690 0.08620690 0.0862069 0.08620690
## [12,] 0.08620690 0.08620690 0.0862069 0.08620690
## [13,] 0.08620690 0.08620690 0.0862069 0.08620690
## [14,] 0.08620690 0.08620690 0.0862069 0.08620690
## [15,] 0.08620690 0.08620690 0.0862069 0.08620690
## [16,] 0.10344828 0.08620690 0.0862069 0.08620690
## [17,] 0.08620690 0.08620690 0.0862069 0.08620690
## [18,] 0.08620690 0.10344828 0.0862069 0.08620690
## [19,] 0.08620690 0.08620690 0.0862069 0.08620690
## [20,] 0.08620690 0.08620690 0.0862069 0.08620690
## [21,] 0.10344828 0.08620690 0.0862069 0.08620690
## [22,] 0.08620690 0.10344828 0.0862069 0.08620690
##Bagging:
oob.values[22]
## [1] 0.1313869
mce.values[22]
## [1] 0.0862069
##Random Forests:
oob.values[4,2]
## [1] 0.1240876
mce.values[4,2]
## [1] 0.06896552
mce.values.try1 <- vector(length=length(seq(from=600, to=900, by=25)))
for(n in seq(from=600, to=900, by=25)){
temp.model <- randomForest(data$status ~ ., data=data,
importance=TRUE, subset=training_indices, mtry=4, ntree=n)
#Get the mse values
mce.values.try1[(n/25)-23] <- mean(data$status[-training_indices]!=predict(temp.model,
newdata=data[-training_indices,]))
}
mce.values.try1
## [1] 0.06896552 0.08620690 0.08620690 0.08620690 0.08620690 0.08620690
## [7] 0.06896552 0.10344828 0.06896552 0.08620690 0.08620690 0.06896552
## [13] 0.06896552
Clearly, and not surprisingly, we get the best mce value to be for mtry=4. The number of trees needed to reach that optimal prediction accuracy was 600
Why mtry=4? Because the predictors are extremely highly correlated, random forests can deal with collinear predictors.
Then, we explored the possible ntree values that would further optimize the prediction accuracy, and it appears that overfitting starts to occure after ntree~750
Our best fit between Bagging and Random Forest is a random forest with mtry=4, resulting in mce=0.06896 which is the lowest prediction error so far.
=> Random Forest Tree is able to reach 93.02% prediction accuracy while handeling collinearity.
## Boosting:
#boost.data=gbm(data$status~.,data=data[training_indices ,],
#distribution="bernoulli", n.trees=750, interaction.depth=10, cv=5)
## Error in model.frame.default(formula = data$status ~ ., data = data[training_indices, : variable lengths differ (found for 'MDVP.Fo.Hz.')
#summary(boost.data)
## Error in h(simpleError(msg, call)) : error in evaluating the argument 'object' in selecting a method for function 'summary': object 'boost.data' not found
| Model | Classification Error | Prediction Accuracy % | Pros | Cons |
|---|---|---|---|---|
| Multiple Logistic Regression: glm cv5 | 0.2018427 | 79.82 | Simple to Implement | Low Prediction Accuracy, high variance |
| MLR with Manual Subsetting of Data: glm cv5 | 0.1724138 | 82.76 | Less Variance and higher Prediction Accuracy | Possible increase in Bias due to Manual Prediction Selection |
| Ridge Regression: glmnet alpha0 | 0.1551724 | 84.48276 | Fair Prediction Accuracy | High Variance: No Predictors Selection Performed |
| Lasso: glmnet alpha1 | NA | NA | NA | Not Selected due to Less Prediction Accuracy than Elastic Net |
| Elastic Net: glmnet alpha0.8 | 0.1034483 | 89.66 | High Prediction Accuarcy | Possible increase in Bias, but compensated by presence of Multicollinearity |
| Elastic Net: glmnet alpha0.4 | 0.1551724 | 84.48 | Balance between Bias-Variance | Pros come at the cost of Prediction Accuracy on this Dataset |
| Linear Discriminant Analysis: lda cv5 | 0.1724138 | 82.76 | Simple to Implement | Low Prediction Accuracy, identical to glm with manual data subsetting |
| Quadratic Discriminant Analysis: qda cv5 | 0.1724138 | 82.76 | Simple to Implement | Low Prediction Accuracy |
| K-th Nearest Neighbors: knn k2 | 0.1034483 | 89.66 | High Prediction Accuarcy, Simple to Implement | For K=2, will give very similar results to a Linear Model |
| Simple Decision Trees: tree | 0.13791 | 86.21 | Fair Prediction Accuracy | Could be Improved by Pruning and other Methods |
| Pruned Trees: cv.tree prune.missclass | 0.1034483 | 89.66 | High Prediction Accuracy | Pruning is highly affected by Multicollinearity |
| Bagging: randomForest mtryP ntrees250,1000 | 0.0862069 | 91.38 | Offers a Middle Ground between Random Forest and Pruned Trees | Takes the whole data and resamples, might increase multicollinearity |
| Random Forest: randomForest mtrySQRT(P) ntees250,750,1000 | 0.06896552 | 93.11 | Highest Prediction Accuracy | Requires high Time Complexity to obtain |