NOTE Before starting this assignment please remember to clear your environment, you can do that by running the following code chunk

rm(list = ls(all = TRUE))

Agenda

  • Get the data

  • Data Pre-processing

  • Build a model

  • Predictions

  • Communication

Reading & Understanding the Data

  • The “parkinsons_data.csv”" dataset is composed of a range of biomedical voice measurements. Each column in the table is a particular voice measure, and each row corresponds one of 195 voice recordings.

  • The dataset has 195 rows and 23 columns.

  • The column/variable names’ explanation is given below:

  1. MDVP:Fo(Hz) : Average vocal fundamental frequency

  2. MDVP:Fhi(Hz) : Maximum vocal fundamental frequency

  3. MDVP:Flo(Hz) : Minimum vocal fundamental frequency

  4. MDVP:Jitter(%); MDVP:Jitter(Abs); MDVP:RAP; MDVP:PPQ; Jitter:DDP :_ Several measures of variation in fundamental frequency

  5. MDVP:Shimmer; MDVP:Shimmer(dB); Shimmer:APQ3; Shimmer:APQ5; MDVP:APQ; Shimmer:DDA : Several measures of variation in amplitude

  6. NHR; HNR : Two measures of ratio of noise to tonal components in the voice

  7. status : Health status of the subject (Parkinson’s) - Positive, (Normal) - Healthy

  8. RPDE; D2 : Two nonlinear dynamical complexity measures

  9. DFA : Signal fractal scaling exponent

  10. spread1; spread2; PPE : Three nonlinear measures of fundamental frequency variation

  • Make sure the dataset is located in your current working directory and read in the data
setwd("C:\\Users\\C5215696\\Desktop\\Data Science\\Regression-concepts\\Logistic-reg\\20170708_Batch30_CSE7302c_LogisticRegression_Lab")

getwd()
## [1] "C:/Users/C5215696/Desktop/Data Science/Regression-concepts/Logistic-reg/20170708_Batch30_CSE7302c_LogisticRegression_Lab"
parkinson_data <- read.csv(file = "parkinsons_data.csv", header = TRUE, sep = ",")
  • Use the str() function to get a feel for the dataset.
str(parkinson_data)
## 'data.frame':    195 obs. of  23 variables:
##  $ MDVP.Fo.Hz.     : num  120 122 117 117 116 ...
##  $ MDVP.Fhi.Hz.    : num  157 149 131 138 142 ...
##  $ MDVP.Flo.Hz.    : num  75 114 112 111 111 ...
##  $ MDVP.Jitter...  : num  0.00784 0.00968 0.0105 0.00997 0.01284 ...
##  $ MDVP.Jitter.Abs.: num  0.00007 0.00008 0.00009 0.00009 0.00011 0.00008 0.00003 0.00003 0.00006 0.00006 ...
##  $ MDVP.RAP        : num  0.0037 0.00465 0.00544 0.00502 0.00655 0.00463 0.00155 0.00144 0.00293 0.00268 ...
##  $ MDVP.PPQ        : num  0.00554 0.00696 0.00781 0.00698 0.00908 0.0075 0.00202 0.00182 0.00332 0.00332 ...
##  $ Jitter.DDP      : num  0.0111 0.0139 0.0163 0.015 0.0197 ...
##  $ MDVP.Shimmer    : num  0.0437 0.0613 0.0523 0.0549 0.0643 ...
##  $ MDVP.Shimmer.dB.: num  0.426 0.626 0.482 0.517 0.584 0.456 0.14 0.134 0.191 0.255 ...
##  $ Shimmer.APQ3    : num  0.0218 0.0313 0.0276 0.0292 0.0349 ...
##  $ Shimmer.APQ5    : num  0.0313 0.0452 0.0386 0.0401 0.0483 ...
##  $ MDVP.APQ        : num  0.0297 0.0437 0.0359 0.0377 0.0447 ...
##  $ Shimmer.DDA     : num  0.0654 0.094 0.0827 0.0877 0.1047 ...
##  $ NHR             : num  0.0221 0.0193 0.0131 0.0135 0.0177 ...
##  $ HNR             : num  21 19.1 20.7 20.6 19.6 ...
##  $ status          : Factor w/ 2 levels "Normal","Parkinson's": 2 2 2 2 2 2 2 2 2 2 ...
##  $ RPDE            : num  0.415 0.458 0.43 0.435 0.417 ...
##  $ DFA             : num  0.815 0.82 0.825 0.819 0.823 ...
##  $ spread1         : num  -4.81 -4.08 -4.44 -4.12 -3.75 ...
##  $ spread2         : num  0.266 0.336 0.311 0.334 0.235 ...
##  $ D2              : num  2.3 2.49 2.34 2.41 2.33 ...
##  $ PPE             : num  0.285 0.369 0.333 0.369 0.41 ...
  • Take a look at the data using the “head()” and “tail()” functions
head(parkinson_data)
##   MDVP.Fo.Hz. MDVP.Fhi.Hz. MDVP.Flo.Hz. MDVP.Jitter... MDVP.Jitter.Abs.
## 1     119.992      157.302       74.997        0.00784          0.00007
## 2     122.400      148.650      113.819        0.00968          0.00008
## 3     116.682      131.111      111.555        0.01050          0.00009
## 4     116.676      137.871      111.366        0.00997          0.00009
## 5     116.014      141.781      110.655        0.01284          0.00011
## 6     120.552      131.162      113.787        0.00968          0.00008
##   MDVP.RAP MDVP.PPQ Jitter.DDP MDVP.Shimmer MDVP.Shimmer.dB. Shimmer.APQ3
## 1  0.00370  0.00554    0.01109      0.04374            0.426      0.02182
## 2  0.00465  0.00696    0.01394      0.06134            0.626      0.03134
## 3  0.00544  0.00781    0.01633      0.05233            0.482      0.02757
## 4  0.00502  0.00698    0.01505      0.05492            0.517      0.02924
## 5  0.00655  0.00908    0.01966      0.06425            0.584      0.03490
## 6  0.00463  0.00750    0.01388      0.04701            0.456      0.02328
##   Shimmer.APQ5 MDVP.APQ Shimmer.DDA     NHR    HNR      status     RPDE
## 1      0.03130  0.02971     0.06545 0.02211 21.033 Parkinson's 0.414783
## 2      0.04518  0.04368     0.09403 0.01929 19.085 Parkinson's 0.458359
## 3      0.03858  0.03590     0.08270 0.01309 20.651 Parkinson's 0.429895
## 4      0.04005  0.03772     0.08771 0.01353 20.644 Parkinson's 0.434969
## 5      0.04825  0.04465     0.10470 0.01767 19.649 Parkinson's 0.417356
## 6      0.03526  0.03243     0.06985 0.01222 21.378 Parkinson's 0.415564
##        DFA   spread1  spread2       D2      PPE
## 1 0.815285 -4.813031 0.266482 2.301442 0.284654
## 2 0.819521 -4.075192 0.335590 2.486855 0.368674
## 3 0.825288 -4.443179 0.311173 2.342259 0.332634
## 4 0.819235 -4.117501 0.334147 2.405554 0.368975
## 5 0.823484 -3.747787 0.234513 2.332180 0.410335
## 6 0.825069 -4.242867 0.299111 2.187560 0.357775
tail(parkinson_data)
##     MDVP.Fo.Hz. MDVP.Fhi.Hz. MDVP.Flo.Hz. MDVP.Jitter... MDVP.Jitter.Abs.
## 190     201.774      262.707       78.228        0.00694            3e-05
## 191     174.188      230.978       94.261        0.00459            3e-05
## 192     209.516      253.017       89.488        0.00564            3e-05
## 193     174.688      240.005       74.287        0.01360            8e-05
## 194     198.764      396.961       74.904        0.00740            4e-05
## 195     214.289      260.277       77.973        0.00567            3e-05
##     MDVP.RAP MDVP.PPQ Jitter.DDP MDVP.Shimmer MDVP.Shimmer.dB.
## 190  0.00412  0.00396    0.01235      0.02574            0.255
## 191  0.00263  0.00259    0.00790      0.04087            0.405
## 192  0.00331  0.00292    0.00994      0.02751            0.263
## 193  0.00624  0.00564    0.01873      0.02308            0.256
## 194  0.00370  0.00390    0.01109      0.02296            0.241
## 195  0.00295  0.00317    0.00885      0.01884            0.190
##     Shimmer.APQ3 Shimmer.APQ5 MDVP.APQ Shimmer.DDA     NHR    HNR status
## 190      0.01454      0.01582  0.01758     0.04363 0.04441 19.368 Normal
## 191      0.02336      0.02498  0.02745     0.07008 0.02764 19.517 Normal
## 192      0.01604      0.01657  0.01879     0.04812 0.01810 19.147 Normal
## 193      0.01268      0.01365  0.01667     0.03804 0.10715 17.883 Normal
## 194      0.01265      0.01321  0.01588     0.03794 0.07223 19.020 Normal
## 195      0.01026      0.01161  0.01373     0.03078 0.04398 21.209 Normal
##         RPDE      DFA   spread1  spread2       D2      PPE
## 190 0.508479 0.683761 -6.934474 0.159890 2.316346 0.112838
## 191 0.448439 0.657899 -6.538586 0.121952 2.657476 0.133050
## 192 0.431674 0.683244 -6.195325 0.129303 2.784312 0.168895
## 193 0.407567 0.655683 -6.787197 0.158453 2.679772 0.131728
## 194 0.451221 0.643956 -6.744577 0.207454 2.138608 0.123306
## 195 0.462803 0.664357 -5.724056 0.190667 2.555477 0.148569
  • Are there any missing values in the dataset?
sum(is.na(parkinson_data))
## [1] 0

Data Pre-processing

Train/Test Split

  • Split the data 70/30 into train and test sets, using Stratified Sampling by setting the seed as “786”
library(caret)
## Warning: package 'caret' was built under R version 3.3.3
## Loading required package: lattice
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 3.3.3
set.seed(786)

train_rows <- createDataPartition(parkinson_data$status, p = 0.6, list = F)

train_data <- parkinson_data[train_rows, ]

test_data <- parkinson_data[-train_rows, ]

str(train_data)
## 'data.frame':    118 obs. of  23 variables:
##  $ MDVP.Fo.Hz.     : num  116 107.3 95.7 95.1 88.3 ...
##  $ MDVP.Fhi.Hz.    : num  142 114 132 120 112 ...
##  $ MDVP.Flo.Hz.    : num  110.7 104.3 91.8 91.2 84.1 ...
##  $ MDVP.Jitter...  : num  0.01284 0.0029 0.00551 0.00532 0.00505 ...
##  $ MDVP.Jitter.Abs.: num  0.00011 0.00003 0.00006 0.00006 0.00006 0.00006 0.00002 0.00003 0.00002 0.00004 ...
##  $ MDVP.RAP        : num  0.00655 0.00144 0.00293 0.00268 0.00254 0.00281 0.00118 0.00165 0.00121 0.00211 ...
##  $ MDVP.PPQ        : num  0.00908 0.00182 0.00332 0.00332 0.0033 0.00336 0.00153 0.00208 0.00149 0.00292 ...
##  $ Jitter.DDP      : num  0.01966 0.00431 0.0088 0.00803 0.00763 ...
##  $ MDVP.Shimmer    : num  0.0643 0.0157 0.0209 0.0284 0.0214 ...
##  $ MDVP.Shimmer.dB.: num  0.584 0.134 0.191 0.255 0.197 0.249 0.112 0.154 0.158 0.192 ...
##  $ Shimmer.APQ3    : num  0.0349 0.00829 0.01073 0.01441 0.01079 ...
##  $ Shimmer.APQ5    : num  0.04825 0.00946 0.01277 0.01725 0.01342 ...
##  $ MDVP.APQ        : num  0.0447 0.0126 0.0172 0.0244 0.0189 ...
##  $ Shimmer.DDA     : num  0.1047 0.0249 0.0322 0.0432 0.0324 ...
##  $ NHR             : num  0.01767 0.00344 0.0107 0.01022 0.01166 ...
##  $ HNR             : num  19.6 26.9 21.8 21.9 21.1 ...
##  $ status          : Factor w/ 2 levels "Normal","Parkinson's": 2 2 2 2 2 2 2 2 2 2 ...
##  $ RPDE            : num  0.417 0.637 0.616 0.547 0.611 ...
##  $ DFA             : num  0.823 0.763 0.774 0.798 0.776 ...
##  $ spread1         : num  -3.75 -6.17 -5.5 -5.01 -5.25 ...
##  $ spread2         : num  0.235 0.184 0.328 0.326 0.391 ...
##  $ D2              : num  2.33 2.06 2.32 2.43 2.41 ...
##  $ PPE             : num  0.41 0.164 0.232 0.271 0.25 ...

Build a model

Basic Logistic Regression Model

  • Use the glm() function to build a basic model

  • Build a model using all the variables, excluding the response variable, in the dataset

logit_reg <- glm(status~., data = parkinson_data, family = binomial)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
  • Get the summary of the model and understand the output
summary(logit_reg)
## 
## Call:
## glm(formula = status ~ ., family = binomial, data = parkinson_data)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.14880   0.00000   0.08458   0.35089   1.80839  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)  
## (Intercept)      -1.224e+01  1.730e+01  -0.707   0.4794  
## MDVP.Fo.Hz.      -1.187e-02  2.005e-02  -0.592   0.5538  
## MDVP.Fhi.Hz.     -2.448e-03  3.908e-03  -0.626   0.5312  
## MDVP.Flo.Hz.      3.359e-03  1.116e-02   0.301   0.7635  
## MDVP.Jitter...   -1.413e+03  1.131e+03  -1.249   0.2117  
## MDVP.Jitter.Abs. -4.242e+04  8.770e+04  -0.484   0.6286  
## MDVP.RAP          1.813e+03  1.240e+05   0.015   0.9883  
## MDVP.PPQ         -1.916e+03  1.830e+03  -1.047   0.2951  
## Jitter.DDP        6.729e+02  4.136e+04   0.016   0.9870  
## MDVP.Shimmer      3.480e+02  9.340e+02   0.373   0.7095  
## MDVP.Shimmer.dB.  2.354e+01  3.072e+01   0.766   0.4435  
## Shimmer.APQ3      1.286e+04  1.083e+05   0.119   0.9054  
## Shimmer.APQ5     -2.627e+02  4.159e+02  -0.632   0.5277  
## MDVP.APQ          2.181e+02  3.667e+02   0.595   0.5519  
## Shimmer.DDA      -4.563e+03  3.612e+04  -0.126   0.8995  
## NHR               2.059e+01  5.085e+01   0.405   0.6856  
## HNR               5.209e-02  2.052e-01   0.254   0.7997  
## RPDE             -3.213e+00  4.760e+00  -0.675   0.4996  
## DFA               7.563e+00  8.721e+00   0.867   0.3858  
## spread1           4.647e-02  1.674e+00   0.028   0.9778  
## spread2           1.047e+01  6.051e+00   1.731   0.0835 .
## D2                1.224e+00  1.410e+00   0.868   0.3856  
## PPE               3.658e+01  2.458e+01   1.488   0.1367  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 217.647  on 194  degrees of freedom
## Residual deviance:  91.027  on 172  degrees of freedom
## AIC: 137.03
## 
## Number of Fisher Scoring iterations: 9

Create an ROC Curve

  1. Get a list of predictions (probability scores) using the predict() function
prob_train <- predict(logit_reg, type = "response")
  1. Using the ROCR package create a “prediction()” object
library(ROCR)
## Warning: package 'ROCR' was built under R version 3.3.3
## Loading required package: gplots
## Warning: package 'gplots' was built under R version 3.3.3
## 
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
## 
##     lowess
pred <- prediction(prob_train, parkinson_data$status)
  1. Extract performance measures (True Positive Rate and False Positive Rate) using the “performance()” function from the ROCR package
perf <- performance(pred, measure="tpr", x.measure="fpr")
  1. Plot the ROC curve using the extracted performance measures (TPR and FPR)
plot(perf, col=rainbow(10), colorize=T, print.cutoffs.at=seq(0,1,0.50))

  • Extract the AUC score of the ROC curve and store it in a variable named “auc”
perf_auc <- performance(pred, measure="auc")

auc <- perf_auc@y.values[[1]]

print(auc)
## [1] 0.9519558

Choose a Cutoff Value

  • Based on the trade off between TPR and FPR depending on the business domain, a call on the cutoff has to be made.
### Write your answer here

# A cutoff of 0.05 seems reasonable

Predictions on test data

  • After choosing a cutoff value, predict the class labels on the test data using our model
prob_test <- predict(logit_reg, test_data, type = "response")

preds_test <- ifelse(prob_test > 0.05, "Parkinson's", "Normal")

table(preds_test)
## preds_test
##      Normal Parkinson's 
##           2          75

Evaluation Metrics for classification

Manual Computation

Confusion Matrix

  • Create a confusion matrix using the table() function
test_data_labs <- test_data$status

conf_matrix <- table(test_data_labs, preds_test)

print(conf_matrix)
##               preds_test
## test_data_labs Normal Parkinson's
##    Normal           2          17
##    Parkinson's      0          58

Specificity

  • Calculate the Specificity

  • The Proportion of correctly identified negatives by the test/model.

\[{Specificity} = \frac{Number~of~True~Negatives}{Number~of~True~Negatives + Number~of~False~Positives}\]

specificity <- conf_matrix[1, 1]/sum(conf_matrix[1, ])

print(specificity)
## [1] 0.1052632

Sensitivity

  • Calculate the Sensitivity

  • The Proportion of correctly identified positives by the test/model.

\[{Sensitivity} = \frac{Number~of~True~Positives}{Number~of~True~Positives + Number~of~False~Negatives}\]

sensitivity <- conf_matrix[2, 2]/sum(conf_matrix[2, ])

print(sensitivity)
## [1] 1

Accuracy

  • Calculate the Accuracy

  • The Proportion of correctly identified psotivies/negatives in the entire population by the test/model

\[{Accuracy} = \frac{Number~of~True~Positives +Number~of~True~Negatives}{Number~Of~Subjects~in~the~Population}\]

accuracy <- sum(diag(conf_matrix))/sum(conf_matrix)

print(accuracy)
## [1] 0.7792208

Automated Computation through Caret

  • Use the caret package to compute the evaluation metrics

  • Mention your reference positive class as an argument to the confusionMatrix() function

library(caret)
confusionMatrix(preds_test, test_data$status, positive = "Parkinson's")
## Confusion Matrix and Statistics
## 
##              Reference
## Prediction    Normal Parkinson's
##   Normal           2           0
##   Parkinson's     17          58
##                                           
##                Accuracy : 0.7792          
##                  95% CI : (0.6702, 0.8658)
##     No Information Rate : 0.7532          
##     P-Value [Acc > NIR] : 0.3530856       
##                                           
##                   Kappa : 0.1506          
##  Mcnemar's Test P-Value : 0.0001042       
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.1053          
##          Pos Pred Value : 0.7733          
##          Neg Pred Value : 1.0000          
##              Prevalence : 0.7532          
##          Detection Rate : 0.7532          
##    Detection Prevalence : 0.9740          
##       Balanced Accuracy : 0.5526          
##                                           
##        'Positive' Class : Parkinson's     
##