NOTE Before starting this assignment please remember to clear your environment, you can do that by running the following code chunk
rm(list = ls(all = TRUE))
Get the data
Data Pre-processing
Build a model
Predictions
Communication
The “parkinsons_data.csv”" dataset is composed of a range of biomedical voice measurements. Each column in the table is a particular voice measure, and each row corresponds one of 195 voice recordings.
The dataset has 195 rows and 23 columns.
The column/variable names’ explanation is given below:
MDVP:Fo(Hz) : Average vocal fundamental frequency
MDVP:Fhi(Hz) : Maximum vocal fundamental frequency
MDVP:Flo(Hz) : Minimum vocal fundamental frequency
MDVP:Jitter(%); MDVP:Jitter(Abs); MDVP:RAP; MDVP:PPQ; Jitter:DDP :_ Several measures of variation in fundamental frequency
MDVP:Shimmer; MDVP:Shimmer(dB); Shimmer:APQ3; Shimmer:APQ5; MDVP:APQ; Shimmer:DDA : Several measures of variation in amplitude
NHR; HNR : Two measures of ratio of noise to tonal components in the voice
status : Health status of the subject (Parkinson’s) - Positive, (Normal) - Healthy
RPDE; D2 : Two nonlinear dynamical complexity measures
DFA : Signal fractal scaling exponent
spread1; spread2; PPE : Three nonlinear measures of fundamental frequency variation
setwd("C:\\Users\\C5215696\\Desktop\\Data Science\\Regression-concepts\\Logistic-reg\\20170708_Batch30_CSE7302c_LogisticRegression_Lab")
getwd()
## [1] "C:/Users/C5215696/Desktop/Data Science/Regression-concepts/Logistic-reg/20170708_Batch30_CSE7302c_LogisticRegression_Lab"
parkinson_data <- read.csv(file = "parkinsons_data.csv", header = TRUE, sep = ",")
str(parkinson_data)
## 'data.frame': 195 obs. of 23 variables:
## $ MDVP.Fo.Hz. : num 120 122 117 117 116 ...
## $ MDVP.Fhi.Hz. : num 157 149 131 138 142 ...
## $ MDVP.Flo.Hz. : num 75 114 112 111 111 ...
## $ MDVP.Jitter... : num 0.00784 0.00968 0.0105 0.00997 0.01284 ...
## $ MDVP.Jitter.Abs.: num 0.00007 0.00008 0.00009 0.00009 0.00011 0.00008 0.00003 0.00003 0.00006 0.00006 ...
## $ MDVP.RAP : num 0.0037 0.00465 0.00544 0.00502 0.00655 0.00463 0.00155 0.00144 0.00293 0.00268 ...
## $ MDVP.PPQ : num 0.00554 0.00696 0.00781 0.00698 0.00908 0.0075 0.00202 0.00182 0.00332 0.00332 ...
## $ Jitter.DDP : num 0.0111 0.0139 0.0163 0.015 0.0197 ...
## $ MDVP.Shimmer : num 0.0437 0.0613 0.0523 0.0549 0.0643 ...
## $ MDVP.Shimmer.dB.: num 0.426 0.626 0.482 0.517 0.584 0.456 0.14 0.134 0.191 0.255 ...
## $ Shimmer.APQ3 : num 0.0218 0.0313 0.0276 0.0292 0.0349 ...
## $ Shimmer.APQ5 : num 0.0313 0.0452 0.0386 0.0401 0.0483 ...
## $ MDVP.APQ : num 0.0297 0.0437 0.0359 0.0377 0.0447 ...
## $ Shimmer.DDA : num 0.0654 0.094 0.0827 0.0877 0.1047 ...
## $ NHR : num 0.0221 0.0193 0.0131 0.0135 0.0177 ...
## $ HNR : num 21 19.1 20.7 20.6 19.6 ...
## $ status : Factor w/ 2 levels "Normal","Parkinson's": 2 2 2 2 2 2 2 2 2 2 ...
## $ RPDE : num 0.415 0.458 0.43 0.435 0.417 ...
## $ DFA : num 0.815 0.82 0.825 0.819 0.823 ...
## $ spread1 : num -4.81 -4.08 -4.44 -4.12 -3.75 ...
## $ spread2 : num 0.266 0.336 0.311 0.334 0.235 ...
## $ D2 : num 2.3 2.49 2.34 2.41 2.33 ...
## $ PPE : num 0.285 0.369 0.333 0.369 0.41 ...
head(parkinson_data)
## MDVP.Fo.Hz. MDVP.Fhi.Hz. MDVP.Flo.Hz. MDVP.Jitter... MDVP.Jitter.Abs.
## 1 119.992 157.302 74.997 0.00784 0.00007
## 2 122.400 148.650 113.819 0.00968 0.00008
## 3 116.682 131.111 111.555 0.01050 0.00009
## 4 116.676 137.871 111.366 0.00997 0.00009
## 5 116.014 141.781 110.655 0.01284 0.00011
## 6 120.552 131.162 113.787 0.00968 0.00008
## MDVP.RAP MDVP.PPQ Jitter.DDP MDVP.Shimmer MDVP.Shimmer.dB. Shimmer.APQ3
## 1 0.00370 0.00554 0.01109 0.04374 0.426 0.02182
## 2 0.00465 0.00696 0.01394 0.06134 0.626 0.03134
## 3 0.00544 0.00781 0.01633 0.05233 0.482 0.02757
## 4 0.00502 0.00698 0.01505 0.05492 0.517 0.02924
## 5 0.00655 0.00908 0.01966 0.06425 0.584 0.03490
## 6 0.00463 0.00750 0.01388 0.04701 0.456 0.02328
## Shimmer.APQ5 MDVP.APQ Shimmer.DDA NHR HNR status RPDE
## 1 0.03130 0.02971 0.06545 0.02211 21.033 Parkinson's 0.414783
## 2 0.04518 0.04368 0.09403 0.01929 19.085 Parkinson's 0.458359
## 3 0.03858 0.03590 0.08270 0.01309 20.651 Parkinson's 0.429895
## 4 0.04005 0.03772 0.08771 0.01353 20.644 Parkinson's 0.434969
## 5 0.04825 0.04465 0.10470 0.01767 19.649 Parkinson's 0.417356
## 6 0.03526 0.03243 0.06985 0.01222 21.378 Parkinson's 0.415564
## DFA spread1 spread2 D2 PPE
## 1 0.815285 -4.813031 0.266482 2.301442 0.284654
## 2 0.819521 -4.075192 0.335590 2.486855 0.368674
## 3 0.825288 -4.443179 0.311173 2.342259 0.332634
## 4 0.819235 -4.117501 0.334147 2.405554 0.368975
## 5 0.823484 -3.747787 0.234513 2.332180 0.410335
## 6 0.825069 -4.242867 0.299111 2.187560 0.357775
tail(parkinson_data)
## MDVP.Fo.Hz. MDVP.Fhi.Hz. MDVP.Flo.Hz. MDVP.Jitter... MDVP.Jitter.Abs.
## 190 201.774 262.707 78.228 0.00694 3e-05
## 191 174.188 230.978 94.261 0.00459 3e-05
## 192 209.516 253.017 89.488 0.00564 3e-05
## 193 174.688 240.005 74.287 0.01360 8e-05
## 194 198.764 396.961 74.904 0.00740 4e-05
## 195 214.289 260.277 77.973 0.00567 3e-05
## MDVP.RAP MDVP.PPQ Jitter.DDP MDVP.Shimmer MDVP.Shimmer.dB.
## 190 0.00412 0.00396 0.01235 0.02574 0.255
## 191 0.00263 0.00259 0.00790 0.04087 0.405
## 192 0.00331 0.00292 0.00994 0.02751 0.263
## 193 0.00624 0.00564 0.01873 0.02308 0.256
## 194 0.00370 0.00390 0.01109 0.02296 0.241
## 195 0.00295 0.00317 0.00885 0.01884 0.190
## Shimmer.APQ3 Shimmer.APQ5 MDVP.APQ Shimmer.DDA NHR HNR status
## 190 0.01454 0.01582 0.01758 0.04363 0.04441 19.368 Normal
## 191 0.02336 0.02498 0.02745 0.07008 0.02764 19.517 Normal
## 192 0.01604 0.01657 0.01879 0.04812 0.01810 19.147 Normal
## 193 0.01268 0.01365 0.01667 0.03804 0.10715 17.883 Normal
## 194 0.01265 0.01321 0.01588 0.03794 0.07223 19.020 Normal
## 195 0.01026 0.01161 0.01373 0.03078 0.04398 21.209 Normal
## RPDE DFA spread1 spread2 D2 PPE
## 190 0.508479 0.683761 -6.934474 0.159890 2.316346 0.112838
## 191 0.448439 0.657899 -6.538586 0.121952 2.657476 0.133050
## 192 0.431674 0.683244 -6.195325 0.129303 2.784312 0.168895
## 193 0.407567 0.655683 -6.787197 0.158453 2.679772 0.131728
## 194 0.451221 0.643956 -6.744577 0.207454 2.138608 0.123306
## 195 0.462803 0.664357 -5.724056 0.190667 2.555477 0.148569
sum(is.na(parkinson_data))
## [1] 0
library(caret)
## Warning: package 'caret' was built under R version 3.3.3
## Loading required package: lattice
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 3.3.3
set.seed(786)
train_rows <- createDataPartition(parkinson_data$status, p = 0.6, list = F)
train_data <- parkinson_data[train_rows, ]
test_data <- parkinson_data[-train_rows, ]
str(train_data)
## 'data.frame': 118 obs. of 23 variables:
## $ MDVP.Fo.Hz. : num 116 107.3 95.7 95.1 88.3 ...
## $ MDVP.Fhi.Hz. : num 142 114 132 120 112 ...
## $ MDVP.Flo.Hz. : num 110.7 104.3 91.8 91.2 84.1 ...
## $ MDVP.Jitter... : num 0.01284 0.0029 0.00551 0.00532 0.00505 ...
## $ MDVP.Jitter.Abs.: num 0.00011 0.00003 0.00006 0.00006 0.00006 0.00006 0.00002 0.00003 0.00002 0.00004 ...
## $ MDVP.RAP : num 0.00655 0.00144 0.00293 0.00268 0.00254 0.00281 0.00118 0.00165 0.00121 0.00211 ...
## $ MDVP.PPQ : num 0.00908 0.00182 0.00332 0.00332 0.0033 0.00336 0.00153 0.00208 0.00149 0.00292 ...
## $ Jitter.DDP : num 0.01966 0.00431 0.0088 0.00803 0.00763 ...
## $ MDVP.Shimmer : num 0.0643 0.0157 0.0209 0.0284 0.0214 ...
## $ MDVP.Shimmer.dB.: num 0.584 0.134 0.191 0.255 0.197 0.249 0.112 0.154 0.158 0.192 ...
## $ Shimmer.APQ3 : num 0.0349 0.00829 0.01073 0.01441 0.01079 ...
## $ Shimmer.APQ5 : num 0.04825 0.00946 0.01277 0.01725 0.01342 ...
## $ MDVP.APQ : num 0.0447 0.0126 0.0172 0.0244 0.0189 ...
## $ Shimmer.DDA : num 0.1047 0.0249 0.0322 0.0432 0.0324 ...
## $ NHR : num 0.01767 0.00344 0.0107 0.01022 0.01166 ...
## $ HNR : num 19.6 26.9 21.8 21.9 21.1 ...
## $ status : Factor w/ 2 levels "Normal","Parkinson's": 2 2 2 2 2 2 2 2 2 2 ...
## $ RPDE : num 0.417 0.637 0.616 0.547 0.611 ...
## $ DFA : num 0.823 0.763 0.774 0.798 0.776 ...
## $ spread1 : num -3.75 -6.17 -5.5 -5.01 -5.25 ...
## $ spread2 : num 0.235 0.184 0.328 0.326 0.391 ...
## $ D2 : num 2.33 2.06 2.32 2.43 2.41 ...
## $ PPE : num 0.41 0.164 0.232 0.271 0.25 ...
Use the glm() function to build a basic model
Build a model using all the variables, excluding the response variable, in the dataset
logit_reg <- glm(status~., data = parkinson_data, family = binomial)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(logit_reg)
##
## Call:
## glm(formula = status ~ ., family = binomial, data = parkinson_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.14880 0.00000 0.08458 0.35089 1.80839
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.224e+01 1.730e+01 -0.707 0.4794
## MDVP.Fo.Hz. -1.187e-02 2.005e-02 -0.592 0.5538
## MDVP.Fhi.Hz. -2.448e-03 3.908e-03 -0.626 0.5312
## MDVP.Flo.Hz. 3.359e-03 1.116e-02 0.301 0.7635
## MDVP.Jitter... -1.413e+03 1.131e+03 -1.249 0.2117
## MDVP.Jitter.Abs. -4.242e+04 8.770e+04 -0.484 0.6286
## MDVP.RAP 1.813e+03 1.240e+05 0.015 0.9883
## MDVP.PPQ -1.916e+03 1.830e+03 -1.047 0.2951
## Jitter.DDP 6.729e+02 4.136e+04 0.016 0.9870
## MDVP.Shimmer 3.480e+02 9.340e+02 0.373 0.7095
## MDVP.Shimmer.dB. 2.354e+01 3.072e+01 0.766 0.4435
## Shimmer.APQ3 1.286e+04 1.083e+05 0.119 0.9054
## Shimmer.APQ5 -2.627e+02 4.159e+02 -0.632 0.5277
## MDVP.APQ 2.181e+02 3.667e+02 0.595 0.5519
## Shimmer.DDA -4.563e+03 3.612e+04 -0.126 0.8995
## NHR 2.059e+01 5.085e+01 0.405 0.6856
## HNR 5.209e-02 2.052e-01 0.254 0.7997
## RPDE -3.213e+00 4.760e+00 -0.675 0.4996
## DFA 7.563e+00 8.721e+00 0.867 0.3858
## spread1 4.647e-02 1.674e+00 0.028 0.9778
## spread2 1.047e+01 6.051e+00 1.731 0.0835 .
## D2 1.224e+00 1.410e+00 0.868 0.3856
## PPE 3.658e+01 2.458e+01 1.488 0.1367
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 217.647 on 194 degrees of freedom
## Residual deviance: 91.027 on 172 degrees of freedom
## AIC: 137.03
##
## Number of Fisher Scoring iterations: 9
prob_train <- predict(logit_reg, type = "response")
library(ROCR)
## Warning: package 'ROCR' was built under R version 3.3.3
## Loading required package: gplots
## Warning: package 'gplots' was built under R version 3.3.3
##
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
##
## lowess
pred <- prediction(prob_train, parkinson_data$status)
perf <- performance(pred, measure="tpr", x.measure="fpr")
plot(perf, col=rainbow(10), colorize=T, print.cutoffs.at=seq(0,1,0.50))
perf_auc <- performance(pred, measure="auc")
auc <- perf_auc@y.values[[1]]
print(auc)
## [1] 0.9519558
### Write your answer here
# A cutoff of 0.05 seems reasonable
prob_test <- predict(logit_reg, test_data, type = "response")
preds_test <- ifelse(prob_test > 0.05, "Parkinson's", "Normal")
table(preds_test)
## preds_test
## Normal Parkinson's
## 2 75
test_data_labs <- test_data$status
conf_matrix <- table(test_data_labs, preds_test)
print(conf_matrix)
## preds_test
## test_data_labs Normal Parkinson's
## Normal 2 17
## Parkinson's 0 58
Calculate the Specificity
The Proportion of correctly identified negatives by the test/model.
\[{Specificity} = \frac{Number~of~True~Negatives}{Number~of~True~Negatives + Number~of~False~Positives}\]
specificity <- conf_matrix[1, 1]/sum(conf_matrix[1, ])
print(specificity)
## [1] 0.1052632
Calculate the Sensitivity
The Proportion of correctly identified positives by the test/model.
\[{Sensitivity} = \frac{Number~of~True~Positives}{Number~of~True~Positives + Number~of~False~Negatives}\]
sensitivity <- conf_matrix[2, 2]/sum(conf_matrix[2, ])
print(sensitivity)
## [1] 1
Calculate the Accuracy
The Proportion of correctly identified psotivies/negatives in the entire population by the test/model
\[{Accuracy} = \frac{Number~of~True~Positives +Number~of~True~Negatives}{Number~Of~Subjects~in~the~Population}\]
accuracy <- sum(diag(conf_matrix))/sum(conf_matrix)
print(accuracy)
## [1] 0.7792208
Use the caret package to compute the evaluation metrics
Mention your reference positive class as an argument to the confusionMatrix() function
library(caret)
confusionMatrix(preds_test, test_data$status, positive = "Parkinson's")
## Confusion Matrix and Statistics
##
## Reference
## Prediction Normal Parkinson's
## Normal 2 0
## Parkinson's 17 58
##
## Accuracy : 0.7792
## 95% CI : (0.6702, 0.8658)
## No Information Rate : 0.7532
## P-Value [Acc > NIR] : 0.3530856
##
## Kappa : 0.1506
## Mcnemar's Test P-Value : 0.0001042
##
## Sensitivity : 1.0000
## Specificity : 0.1053
## Pos Pred Value : 0.7733
## Neg Pred Value : 1.0000
## Prevalence : 0.7532
## Detection Rate : 0.7532
## Detection Prevalence : 0.9740
## Balanced Accuracy : 0.5526
##
## 'Positive' Class : Parkinson's
##