Diagnosing Heart Disease

Of all the applications of machine-learning, diagnosing any serious disease using a black box is always going to be a hard sell. If the output from a model is the particular course of treatment (potentially with side-effects), or surgery, or the absence of treatment, people are going to want to know why.

This dataset gives a number of variables along with a target condition of having or not having heart disease.

Reading the data

heartDf<-read.csv("heart.csv")
summary(heartDf)
#>       age             sex               cp           trestbps    
#>  Min.   :29.00   Min.   :0.0000   Min.   :0.000   Min.   : 94.0  
#>  1st Qu.:47.50   1st Qu.:0.0000   1st Qu.:0.000   1st Qu.:120.0  
#>  Median :55.00   Median :1.0000   Median :1.000   Median :130.0  
#>  Mean   :54.37   Mean   :0.6832   Mean   :0.967   Mean   :131.6  
#>  3rd Qu.:61.00   3rd Qu.:1.0000   3rd Qu.:2.000   3rd Qu.:140.0  
#>  Max.   :77.00   Max.   :1.0000   Max.   :3.000   Max.   :200.0  
#>       chol            fbs            restecg          thalach     
#>  Min.   :126.0   Min.   :0.0000   Min.   :0.0000   Min.   : 71.0  
#>  1st Qu.:211.0   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:133.5  
#>  Median :240.0   Median :0.0000   Median :1.0000   Median :153.0  
#>  Mean   :246.3   Mean   :0.1485   Mean   :0.5281   Mean   :149.6  
#>  3rd Qu.:274.5   3rd Qu.:0.0000   3rd Qu.:1.0000   3rd Qu.:166.0  
#>  Max.   :564.0   Max.   :1.0000   Max.   :2.0000   Max.   :202.0  
#>      exang           oldpeak         slope             ca        
#>  Min.   :0.0000   Min.   :0.00   Min.   :0.000   Min.   :0.0000  
#>  1st Qu.:0.0000   1st Qu.:0.00   1st Qu.:1.000   1st Qu.:0.0000  
#>  Median :0.0000   Median :0.80   Median :1.000   Median :0.0000  
#>  Mean   :0.3267   Mean   :1.04   Mean   :1.399   Mean   :0.7294  
#>  3rd Qu.:1.0000   3rd Qu.:1.60   3rd Qu.:2.000   3rd Qu.:1.0000  
#>  Max.   :1.0000   Max.   :6.20   Max.   :2.000   Max.   :4.0000  
#>       thal           target      
#>  Min.   :0.000   Min.   :0.0000  
#>  1st Qu.:2.000   1st Qu.:0.0000  
#>  Median :2.000   Median :1.0000  
#>  Mean   :2.314   Mean   :0.5446  
#>  3rd Qu.:3.000   3rd Qu.:1.0000  
#>  Max.   :3.000   Max.   :1.0000

Column names for better interpretation

Meaning of column names:

age: The person’s age in years
sex: The person’s sex (1 = male, 0 = female)
cp: The chest pain experienced (Value 1: typical angina, Value 2: atypical angina, Value 3: non-anginal pain, Value 4: asymptomatic)
trestbps: The person’s resting blood pressure (mm Hg on admission to the hospital)
chol: The person’s cholesterol measurement in mg/dl
fbs: The person’s fasting blood sugar (> 120 mg/dl, 1 = true; 0 = false)
restecg: Resting electrocardiographic measurement (0 = normal, 1 = having ST-T wave abnormality, 2 = showing probable or definite left ventricular hypertrophy by Estes’ criteria)
thalach: The person’s maximum heart rate achieved
exang: Exercise induced angina (1 = yes; 0 = no)
oldpeak: ST depression induced by exercise relative to rest (‘ST’ relates to positions on the ECG plot.)
slope: the slope of the peak exercise ST segment (Value 1: upsloping, Value 2: flat, Value 3: downsloping)
ca: The number of major vessels (0-3)
thal: A blood disorder called thalassemia (3 = normal; 6 = fixed defect; 7 = reversable defect)
target: Heart disease (0 = no, 1 = yes)

names(heartDf)<-c('age', 'sex', 'chest_pain_type', 'resting_blood_pressure', 'cholesterol', 'fasting_blood_sugar', 'rest_ecg', 'max_heart_rate_achieved','exercise_induced_angina','st_depression', 'st_slope', 'num_major_vessels', 'thalassemia', 'target')

Avoid being HARKing

Let us try to understand how each variable affects diagnosis of heart disease first and then try to solve the case:

Looking at information of heart disease risk factors led me to the following: high cholesterol, high blood pressure, diabetes, weight, family history and smoking 3. According to another source 4, the major factors that can’t be changed are: increasing age, male gender and heredity. Note that thalassemia, one of the variables in this dataset, is heredity. Major factors that can be modified are: Smoking, high cholesterol, high blood pressure, physical inactivity, and being overweight and having diabetes. Other factors include stress, alcohol and poor diet/nutrition.

I can see no reference to the ‘number of major vessels’, but given that the definition of heart disease is “…what happens when your heart’s blood supply is blocked or interrupted by a build-up of fatty substances in the coronary arteries”, it seems logical the more major vessels is a good thing, and therefore will reduce the probability of heart disease.

Let us change some of the variables

heartDf$sex[heartDf$sex == 0]<-"F"
heartDf$sex[heartDf$sex == 1]<-"M"

heartDf$chest_pain_type[heartDf$chest_pain_type == 0] <- NA
heartDf$chest_pain_type[heartDf$chest_pain_type == 1] <- 'Typical Angina'
heartDf$chest_pain_type[heartDf$chest_pain_type == 2] <- 'Atypical Angina'
heartDf$chest_pain_type[heartDf$chest_pain_type == 3] <- 'Non-Anginal Pain'
heartDf$chest_pain_type[heartDf$chest_pain_type == 4] <- 'Asymptomatic'

heartDf$fasting_blood_sugar[heartDf$fasting_blood_sugar == 0] <- 'Lower than 120mg/ml'
heartDf$fasting_blood_sugar[heartDf$fasting_blood_sugar == 1] <- 'Greater than 120mg/ml'

heartDf$rest_ecg[heartDf$rest_ecg == 0] <- 'Normal'
heartDf$rest_ecg[heartDf$rest_ecg == 1] <- 'ST-T wave abnormality'
heartDf$rest_ecg[heartDf$rest_ecg == 2] <- 'Left Ventricular Hypertrophy'

heartDf$exercise_induced_angina[heartDf$exercise_induced_angina == 0] <- 'No'
heartDf$exercise_induced_angina[heartDf$exercise_induced_angina == 1] <- 'Yes'


heartDf$st_slope[heartDf$st_slope == 0] <- NA
heartDf$st_slope[heartDf$st_slope == 1] <- 'Upsloping'
heartDf$st_slope[heartDf$st_slope == 2] <- 'Flat'
heartDf$st_slope[heartDf$st_slope == 3] <- 'Downsloping'

heartDf$thalassemia[heartDf$thalassemia == 0] <- NA
heartDf$thalassemia[heartDf$thalassemia == 1] <- 'Normal'
heartDf$thalassemia[heartDf$thalassemia == 2] <- 'Fixed Defect'
heartDf$thalassemia[heartDf$thalassemia == 3] <- 'Reversable Defect'

head(heartDf)
#>   age sex  chest_pain_type resting_blood_pressure cholesterol
#> 1  63   M Non-Anginal Pain                    145         233
#> 2  37   M  Atypical Angina                    130         250
#> 3  41   F   Typical Angina                    130         204
#> 4  56   M   Typical Angina                    120         236
#> 5  57   F             <NA>                    120         354
#> 6  57   M             <NA>                    140         192
#>     fasting_blood_sugar              rest_ecg max_heart_rate_achieved
#> 1 Greater than 120mg/ml                Normal                     150
#> 2   Lower than 120mg/ml ST-T wave abnormality                     187
#> 3   Lower than 120mg/ml                Normal                     172
#> 4   Lower than 120mg/ml ST-T wave abnormality                     178
#> 5   Lower than 120mg/ml ST-T wave abnormality                     163
#> 6   Lower than 120mg/ml ST-T wave abnormality                     148
#>   exercise_induced_angina st_depression  st_slope num_major_vessels
#> 1                      No           2.3      <NA>                 0
#> 2                      No           3.5      <NA>                 0
#> 3                      No           1.4      Flat                 0
#> 4                      No           0.8      Flat                 0
#> 5                     Yes           0.6      Flat                 0
#> 6                      No           0.4 Upsloping                 0
#>    thalassemia target
#> 1       Normal      1
#> 2 Fixed Defect      1
#> 3 Fixed Defect      1
#> 4 Fixed Defect      1
#> 5 Fixed Defect      1
#> 6       Normal      1

Checking the datatypes before we move ahead

heartDf<-na.omit(heartDf)
dim(heartDf)
#> [1] 149  14

Our aim is to attempt and distinguish the presence of heart disease (values 1,2,3,4) from absence of heart disease (value 0). Therefore, we replace all labels greater than 1 by 1.

heartDf$target[heartDf$target > 1] <- 1
summary(heartDf)
#>       age            sex            chest_pain_type    resting_blood_pressure
#>  Min.   :29.00   Length:149         Length:149         Min.   : 94           
#>  1st Qu.:45.00   Class :character   Class :character   1st Qu.:120           
#>  Median :54.00   Mode  :character   Mode  :character   Median :130           
#>  Mean   :53.23                                         Mean   :131           
#>  3rd Qu.:60.00                                         3rd Qu.:140           
#>  Max.   :76.00                                         Max.   :192           
#>   cholesterol    fasting_blood_sugar   rest_ecg         max_heart_rate_achieved
#>  Min.   :126.0   Length:149          Length:149         Min.   : 96            
#>  1st Qu.:211.0   Class :character    Class :character   1st Qu.:149            
#>  Median :235.0   Mode  :character    Mode  :character   Median :162            
#>  Mean   :243.6                                          Mean   :158            
#>  3rd Qu.:269.0                                          3rd Qu.:172            
#>  Max.   :564.0                                          Max.   :202            
#>  exercise_induced_angina st_depression      st_slope         num_major_vessels
#>  Length:149              Min.   :0.0000   Length:149         Min.   :0.0000   
#>  Class :character        1st Qu.:0.0000   Class :character   1st Qu.:0.0000   
#>  Mode  :character        Median :0.2000   Mode  :character   Median :0.0000   
#>                          Mean   :0.6691                      Mean   :0.5503   
#>                          3rd Qu.:1.2000                      3rd Qu.:1.0000   
#>                          Max.   :3.8000                      Max.   :4.0000   
#>  thalassemia            target      
#>  Length:149         Min.   :0.0000  
#>  Class :character   1st Qu.:1.0000  
#>  Mode  :character   Median :1.0000  
#>                     Mean   :0.7785  
#>                     3rd Qu.:1.0000  
#>                     Max.   :1.0000
sapply(heartDf,class)
#>                     age                     sex         chest_pain_type 
#>               "integer"             "character"             "character" 
#>  resting_blood_pressure             cholesterol     fasting_blood_sugar 
#>               "integer"               "integer"             "character" 
#>                rest_ecg max_heart_rate_achieved exercise_induced_angina 
#>             "character"               "integer"             "character" 
#>           st_depression                st_slope       num_major_vessels 
#>               "numeric"             "character"               "integer" 
#>             thalassemia                  target 
#>             "character"               "numeric"

In R, a categorical variable (a variable that takes on a finite amount of values) is a factor. As we can see, sex is incorrectly treated as a number when in reality it can only be 1 if male and 0 if female. We can use the transform method to change the in built type of each feature.

heartDfTrans <- transform(
  heartDf,
  age=as.integer(age),
  sex=as.factor(sex),
  chest_pain_type=as.factor(chest_pain_type),
  resting_blood_pressure=as.integer(resting_blood_pressure),
  cholesterol=as.integer(cholesterol),
  fasting_blood_sugar=as.factor(fasting_blood_sugar),
  rest_ecg=as.factor(rest_ecg),
  max_heart_rate_achieved=as.integer(max_heart_rate_achieved),
  exercise_induced_angina=as.factor(exercise_induced_angina),
  st_depression=as.numeric(st_depression),
  st_slope=as.factor(st_slope),
  num_major_vessels=as.factor(num_major_vessels),
  thalassemia=as.factor(thalassemia),
  target=as.factor(target)
)

sapply(heartDfTrans, class)
#>                     age                     sex         chest_pain_type 
#>               "integer"                "factor"                "factor" 
#>  resting_blood_pressure             cholesterol     fasting_blood_sugar 
#>               "integer"               "integer"                "factor" 
#>                rest_ecg max_heart_rate_achieved exercise_induced_angina 
#>                "factor"               "integer"                "factor" 
#>           st_depression                st_slope       num_major_vessels 
#>               "numeric"                "factor"                "factor" 
#>             thalassemia                  target 
#>                "factor"                "factor"

sapply(heartDfTrans, typeof)
#>                     age                     sex         chest_pain_type 
#>               "integer"               "integer"               "integer" 
#>  resting_blood_pressure             cholesterol     fasting_blood_sugar 
#>               "integer"               "integer"               "integer" 
#>                rest_ecg max_heart_rate_achieved exercise_induced_angina 
#>               "integer"               "integer"               "integer" 
#>           st_depression                st_slope       num_major_vessels 
#>                "double"               "integer"               "integer" 
#>             thalassemia                  target 
#>               "integer"               "integer"

summary(heartDfTrans)
#>       age        sex            chest_pain_type resting_blood_pressure
#>  Min.   :29.00   F:54   Atypical Angina :81     Min.   : 94           
#>  1st Qu.:45.00   M:95   Non-Anginal Pain:20     1st Qu.:120           
#>  Median :54.00          Typical Angina  :48     Median :130           
#>  Mean   :53.23                                  Mean   :131           
#>  3rd Qu.:60.00                                  3rd Qu.:140           
#>  Max.   :76.00                                  Max.   :192           
#>   cholesterol               fasting_blood_sugar
#>  Min.   :126.0   Greater than 120mg/ml: 24     
#>  1st Qu.:211.0   Lower than 120mg/ml  :125     
#>  Median :235.0                                 
#>  Mean   :243.6                                 
#>  3rd Qu.:269.0                                 
#>  Max.   :564.0                                 
#>                          rest_ecg  max_heart_rate_achieved
#>  Left Ventricular Hypertrophy: 1   Min.   : 96            
#>  Normal                      :63   1st Qu.:149            
#>  ST-T wave abnormality       :85   Median :162            
#>                                    Mean   :158            
#>                                    3rd Qu.:172            
#>                                    Max.   :202            
#>  exercise_induced_angina st_depression         st_slope  num_major_vessels
#>  No :131                 Min.   :0.0000   Flat     :93   0:100            
#>  Yes: 18                 1st Qu.:0.0000   Upsloping:56   1: 30            
#>                          Median :0.2000                  2:  9            
#>                          Mean   :0.6691                  3:  6            
#>                          3rd Qu.:1.2000                  4:  4            
#>                          Max.   :3.8000                                   
#>             thalassemia  target 
#>  Fixed Defect     :108   0: 33  
#>  Normal           :  5   1:116  
#>  Reversable Defect: 36          
#>                                 
#>                                 
#>

colSums(is.na(heartDfTrans))
#>                     age                     sex         chest_pain_type 
#>                       0                       0                       0 
#>  resting_blood_pressure             cholesterol     fasting_blood_sugar 
#>                       0                       0                       0 
#>                rest_ecg max_heart_rate_achieved exercise_induced_angina 
#>                       0                       0                       0 
#>           st_depression                st_slope       num_major_vessels 
#>                       0                       0                       0 
#>             thalassemia                  target 
#>                       0                       0

Hands-on: Actual Classification problem:

1. Before we begin, let us create training and test samples:

What if the training data has a bias, the entire model can have the bias carry forwarded. To avoid this it is really important for us that we identify the bias and figure out the training data accordingly.

summary(heartDfTrans$target)
#>   0   1 
#>  33 116

Here the bias is very less (delta of <30 observations). However, its good to split data on the basis of target variables equally distributed.


# Get subset of dataframe with all the 1's
heartDfTransOnes<-subset(heartDfTrans,heartDfTrans$target==1)
dim(heartDfTransOnes)
#> [1] 116  14

# Get subset of dataframe with all the 0's
heartDfTransZeros<-subset(heartDfTrans,heartDfTrans$target==0)
dim(heartDfTransZeros)
#> [1] 33 14

#Seed is a simple atomic integer vector, the first element of which specifies the kind of normal generator
set.seed(100)

heartDfTransOnesTrainingSet<-sample(1:nrow(heartDfTransOnes),0.7*nrow(heartDfTransOnes))
heartDfTransZerosTrainingSet<-sample(1:nrow(heartDfTransZeros),0.7*nrow(heartDfTransZeros))

trainingDataOnes<-heartDfTransOnes[heartDfTransOnesTrainingSet,]
dim(trainingDataOnes)
#> [1] 81 14
trainingDataZeros<-heartDfTransZeros[heartDfTransZerosTrainingSet,]
dim(trainingDataZeros)
#> [1] 23 14
trainingData<-rbind(trainingDataOnes,trainingDataZeros)
dim(trainingData)
#> [1] 104  14

testDataOnes<-heartDfTransOnes[-heartDfTransOnesTrainingSet,]
dim(testDataOnes)
#> [1] 35 14
testDataZeros<-heartDfTransZeros[-heartDfTransZerosTrainingSet,]
dim(testDataZeros)
#> [1] 10 14
testData<-rbind(testDataOnes,testDataZeros)
dim(testData)
#> [1] 45 14

#We now have exactly divided the training data and test data into 70% & 30% respectively

2. Logistic Regression:

One more important point is regarding the error has more levels. To avoid this make sure of three things: - The levels in R dataframe test & training dataset are exactly the same. - Make sure that factor data has no missing values.

We have used glm() function with binomial option to implement a logistic regression function. Post then once the model characteristics are captured in the predictor variable then we use the predict function to derive the log(odds) of the Y variable, in our case the variable name is target. But this will be a logarithmic variable however we wish to have values as between 0 and 1. So, to convert it into prediction probability scores that is bound between 0 and 1, we use the plogis()

library(InformationValue)
logisticRegressionModel<-glm(target  ~ age+
                            sex+
                            chest_pain_type+
                            resting_blood_pressure+
                            cholesterol+
                            fasting_blood_sugar+
                            rest_ecg+
                            max_heart_rate_achieved+
                            exercise_induced_angina+
                            st_depression+
                            st_slope+
                            num_major_vessels+
                            thalassemia, data=trainingData, family=binomial(link="logit"))
#> Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

sapply(trainingData, levels)
#> $age
#> NULL
#> 
#> $sex
#> [1] "F" "M"
#> 
#> $chest_pain_type
#> [1] "Atypical Angina"  "Non-Anginal Pain" "Typical Angina"  
#> 
#> $resting_blood_pressure
#> NULL
#> 
#> $cholesterol
#> NULL
#> 
#> $fasting_blood_sugar
#> [1] "Greater than 120mg/ml" "Lower than 120mg/ml"  
#> 
#> $rest_ecg
#> [1] "Left Ventricular Hypertrophy" "Normal"                      
#> [3] "ST-T wave abnormality"       
#> 
#> $max_heart_rate_achieved
#> NULL
#> 
#> $exercise_induced_angina
#> [1] "No"  "Yes"
#> 
#> $st_depression
#> NULL
#> 
#> $st_slope
#> [1] "Flat"      "Upsloping"
#> 
#> $num_major_vessels
#> [1] "0" "1" "2" "3" "4"
#> 
#> $thalassemia
#> [1] "Fixed Defect"      "Normal"            "Reversable Defect"
#> 
#> $target
#> [1] "0" "1"
sapply(testData, levels)
#> $age
#> NULL
#> 
#> $sex
#> [1] "F" "M"
#> 
#> $chest_pain_type
#> [1] "Atypical Angina"  "Non-Anginal Pain" "Typical Angina"  
#> 
#> $resting_blood_pressure
#> NULL
#> 
#> $cholesterol
#> NULL
#> 
#> $fasting_blood_sugar
#> [1] "Greater than 120mg/ml" "Lower than 120mg/ml"  
#> 
#> $rest_ecg
#> [1] "Left Ventricular Hypertrophy" "Normal"                      
#> [3] "ST-T wave abnormality"       
#> 
#> $max_heart_rate_achieved
#> NULL
#> 
#> $exercise_induced_angina
#> [1] "No"  "Yes"
#> 
#> $st_depression
#> NULL
#> 
#> $st_slope
#> [1] "Flat"      "Upsloping"
#> 
#> $num_major_vessels
#> [1] "0" "1" "2" "3" "4"
#> 
#> $thalassemia
#> [1] "Fixed Defect"      "Normal"            "Reversable Defect"
#> 
#> $target
#> [1] "0" "1"

The default cutoff prediction probability score is 0.5 or the ratio of 1’s and 0’s in the training data. But sometimes, tuning the probability cutoff can improve the accuracy in both the development and validation samples. The InformationValue::optimalCutoff function provides ways to find the optimal cutoff to improve the prediction of 1’s, 0’s, both 1’s and 0’s and o reduce the misclassification error. Lets compute the optimal score that minimizes the misclassification error for the above model.

predicted <- plogis(predict(logisticRegressionModel, testData))
optCutOff <- optimalCutoff(testData$target, predicted)[1]
optCutOff
#> [1] 0.01

2.1 Model Diagnostics:

Sensitivity measures how often a test correctly generates a positive result for people who have the condition that’s being tested for (also known as the “true positive” rate). A test that’s highly sensitive will flag almost everyone who has the disease and not generate many false-negative results. (Example: a test with 90% sensitivity will correctly return a positive result for 90% of people who have the disease, but will return a negative result — a false-negative — for 10% of the people who have the disease and should have tested positive.)

Specificity measures a test’s ability to correctly generate a negative result for people who don’t have the condition that’s being tested for (also known as the “true negative” rate). A high-specificity test will correctly rule out almost everyone who doesn’t have the disease and won’t generate many false-positive results. (Example: a test with 90% specificity will correctly return a negative result for 90% of people who don’t have the disease, but will return a positive result — a false-positive — for 10% of the people who don’t have the disease and should have tested negative.)

A receiver operating characteristic curve, or ROC curve, is a graphical plot that illustrates the diagnostic ability of a binary classifier system as its discrimination threshold is varied. The ROC curve is created by plotting the true positive rate against the false positive rate at various threshold settings.

summary(logisticRegressionModel)
#> 
#> Call:
#> glm(formula = target ~ age + sex + chest_pain_type + resting_blood_pressure + 
#>     cholesterol + fasting_blood_sugar + rest_ecg + max_heart_rate_achieved + 
#>     exercise_induced_angina + st_depression + st_slope + num_major_vessels + 
#>     thalassemia, family = binomial(link = "logit"), data = trainingData)
#> 
#> Deviance Residuals: 
#>     Min       1Q   Median       3Q      Max  
#> -4.0765   0.0003   0.0329   0.2864   0.9101  
#> 
#> Coefficients:
#>                                          Estimate Std. Error z value Pr(>|z|)
#> (Intercept)                             1.933e+01  6.523e+03   0.003  0.99764
#> age                                     4.529e-02  8.402e-02   0.539  0.58987
#> sexM                                   -4.799e+00  1.972e+00  -2.433  0.01496
#> chest_pain_typeNon-Anginal Pain         3.819e+00  1.820e+00   2.099  0.03585
#> chest_pain_typeTypical Angina          -4.660e+00  1.765e+00  -2.639  0.00830
#> resting_blood_pressure                 -4.304e-02  3.573e-02  -1.205  0.22833
#> cholesterol                            -5.212e-03  1.087e-02  -0.479  0.63168
#> fasting_blood_sugarLower than 120mg/ml -1.320e+00  1.417e+00  -0.931  0.35167
#> rest_ecgNormal                         -1.443e+01  6.523e+03  -0.002  0.99823
#> rest_ecgST-T wave abnormality          -1.583e+01  6.523e+03  -0.002  0.99806
#> max_heart_rate_achieved                 9.684e-02  4.477e-02   2.163  0.03053
#> exercise_induced_anginaYes              1.916e+00  1.501e+00   1.277  0.20154
#> st_depression                          -2.684e+00  8.902e-01  -3.015  0.00257
#> st_slopeUpsloping                      -4.112e+00  1.670e+00  -2.462  0.01383
#> num_major_vessels1                     -6.260e+00  2.112e+00  -2.963  0.00304
#> num_major_vessels2                     -9.741e+00  3.222e+00  -3.023  0.00250
#> num_major_vessels3                     -8.785e+00  5.097e+00  -1.724  0.08479
#> num_major_vessels4                      2.228e+01  2.644e+03   0.008  0.99328
#> thalassemiaNormal                       6.059e+00  3.692e+00   1.641  0.10078
#> thalassemiaReversable Defect           -3.096e+00  1.457e+00  -2.125  0.03357
#>                                          
#> (Intercept)                              
#> age                                      
#> sexM                                   * 
#> chest_pain_typeNon-Anginal Pain        * 
#> chest_pain_typeTypical Angina          **
#> resting_blood_pressure                   
#> cholesterol                              
#> fasting_blood_sugarLower than 120mg/ml   
#> rest_ecgNormal                           
#> rest_ecgST-T wave abnormality            
#> max_heart_rate_achieved                * 
#> exercise_induced_anginaYes               
#> st_depression                          **
#> st_slopeUpsloping                      * 
#> num_major_vessels1                     **
#> num_major_vessels2                     **
#> num_major_vessels3                     . 
#> num_major_vessels4                       
#> thalassemiaNormal                        
#> thalassemiaReversable Defect           * 
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 109.900  on 103  degrees of freedom
#> Residual deviance:  34.318  on  84  degrees of freedom
#> AIC: 74.318
#> 
#> Number of Fisher Scoring iterations: 17
plotROC(testData$target, predicted)

Concordance(testData$target, predicted)
#> $Concordance
#> [1] 0.7571429
#> 
#> $Discordance
#> [1] 0.2428571
#> 
#> $Tied
#> [1] 2.775558e-17
#> 
#> $Pairs
#> [1] 350
sensitivity(testData$target, predicted, threshold = optCutOff)
#> [1] 0.9428571
specificity(testData$target, predicted, threshold = optCutOff)
#> [1] 0.3
confusionMatrix(testData$target, predicted, threshold = optCutOff)
#>   0  1
#> 0 3  2
#> 1 7 33
head(testData)
#>    age sex  chest_pain_type resting_blood_pressure cholesterol
#> 3   41   F   Typical Angina                    130         204
#> 10  57   M  Atypical Angina                    150         168
#> 13  49   M   Typical Angina                    130         266
#> 15  58   F Non-Anginal Pain                    150         283
#> 16  50   F  Atypical Angina                    120         219
#> 20  69   F Non-Anginal Pain                    140         239
#>      fasting_blood_sugar              rest_ecg max_heart_rate_achieved
#> 3    Lower than 120mg/ml                Normal                     172
#> 10   Lower than 120mg/ml ST-T wave abnormality                     174
#> 13   Lower than 120mg/ml ST-T wave abnormality                     171
#> 15 Greater than 120mg/ml                Normal                     162
#> 16   Lower than 120mg/ml ST-T wave abnormality                     158
#> 20   Lower than 120mg/ml ST-T wave abnormality                     151
#>    exercise_induced_angina st_depression  st_slope num_major_vessels
#> 3                       No           1.4      Flat                 0
#> 10                      No           1.6      Flat                 0
#> 13                      No           0.6      Flat                 0
#> 15                      No           1.0      Flat                 0
#> 16                      No           1.6 Upsloping                 0
#> 20                      No           1.8      Flat                 2
#>     thalassemia target
#> 3  Fixed Defect      1
#> 10 Fixed Defect      1
#> 13 Fixed Defect      1
#> 15 Fixed Defect      1
#> 16 Fixed Defect      1
#> 20 Fixed Defect      1
colnames(testData)
#>  [1] "age"                     "sex"                    
#>  [3] "chest_pain_type"         "resting_blood_pressure" 
#>  [5] "cholesterol"             "fasting_blood_sugar"    
#>  [7] "rest_ecg"                "max_heart_rate_achieved"
#>  [9] "exercise_induced_angina" "st_depression"          
#> [11] "st_slope"                "num_major_vessels"      
#> [13] "thalassemia"             "target"

2.3 How do we use the model ?


predicted_case <- plogis(predict(logisticRegressionModel, data.frame(
  age=as.integer(30),
  sex=as.factor('F'),
  chest_pain_type=as.factor('Atypical Angina'),
  resting_blood_pressure=as.integer(100),
  cholesterol=as.integer(30),
  fasting_blood_sugar=as.factor('Lower than 120mg/ml'),
  rest_ecg=as.factor('Normal'),
  max_heart_rate_achieved=as.integer(120),
  exercise_induced_angina=as.factor('No'),
  st_depression=as.numeric(1.2),
  st_slope=as.factor('Flat'),
  num_major_vessels=as.factor('0'),
  thalassemia=as.factor('Fixed Defect')
)))

sprintf("The chances of patient being diagnosed with a heart disease is %0.2f %%", predicted_case*100)
#> [1] "The chances of patient being diagnosed with a heart disease is 99.99 %"

… To be continued