The objective of this project is to predict heart disease in hospital patients. The processed.data.cleveland data set will be used for testing the KNN algorithm.

First I’ll import the dataset.

Heart <- read.csv(url("https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data"), header = FALSE)
head(Heart)

Insert column names into data.

colnames(Heart) <- c("age","sex","cp","trestbps", "chol", "fbs", "restecg", "thalach", "exang", "oldpeak", "slope", "ca", "thal", "num")

head(Heart)

The attribute information is the following:

age: age in years

sex: gender( 1 = male; 0 = female)

cp: chest pain type

– Value 1: typical angina

– Value 2: atypical angina

– Value 3: non-anginal pain

– Value 4: asymptomatic

trestbps: resting blood pressure (in mm Hg on admission to the hospital)

chol: serum cholestoral in mg/dl

fbs: (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)

restecg: resting electrocardiographic results

– Value 0: normal

– Value 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)

– Value 2: showing probable or definite left ventricular hypertrophy by Estes’ criteria

thalach: maximum heart rate achieved

exang: exercise induced angina (1 = yes; 0 = no)

oldpeak: ST depression induced by exercise relative to rest

slope: the slope of the peak exercise ST segment

– Value 1: upsloping

– Value 2: flat

– Value 3: downsloping

ca: number of major vessels (0-3) colored by flourosopy

thal: 3 = normal; 6 = fixed defect; 7 = reversable defect

num: diagnosis of heart disease (angiographic disease status)

– Value 0: < 50% diameter narrowing

– Value 1: > 50% diameter narrowing

class(Heart$num)

Because num is the outcome we want the algorithm to predict, we must change it to factor. The num variable is coded as

0 - disease free, and then 1, 2, 3, and 4 being different degrees of severity of heart disease. We’re only interested in predicting whether or not a patient has the disease or not, so our “dummy variables” will be:

0 = No Disease

1 = Disease

I’’ll first change the “num” integer variables “2”, “3”, and “4” to “1”, so only “0” and “1” variables remain.

Heart$num[Heart$num=="4"] <- "1"
Heart$num[Heart$num=="3"] <- "1"
Heart$num[Heart$num=="2"] <- "1"

Next I’ll change the “num” variable to a factor

Heart$num <- as.factor(Heart$num)

class(Heart$num)
head(Heart$num)
round(prop.table(table(Heart$num)) * 100, digits = 1)

As shown above, our data set includes 45.9% with heart disease and 54.1% without heart disease.

We also need to convert “thal” and “ca” to number variables, because K-NN requires all variables besides the predictor variable to be numeric.

Heart$thal <- as.character(Heart$thal)
Heart$thal <- as.numeric(Heart$thal)
str(Heart)
'data.frame':   297 obs. of  14 variables:
 $ age     : num  63 67 67 37 41 56 62 57 63 53 ...
 $ sex     : num  1 1 1 1 0 1 0 0 1 1 ...
 $ cp      : num  1 4 4 3 2 2 4 4 4 4 ...
 $ trestbps: num  145 160 120 130 130 120 140 120 130 140 ...
 $ chol    : num  233 286 229 250 204 236 268 354 254 203 ...
 $ fbs     : num  1 0 0 0 0 0 0 0 0 1 ...
 $ restecg : num  2 2 2 0 2 0 2 0 2 2 ...
 $ thalach : num  150 108 129 187 172 178 160 163 147 155 ...
 $ exang   : num  0 1 1 0 0 0 0 1 0 1 ...
 $ oldpeak : num  2.3 1.5 2.6 3.5 1.4 0.8 3.6 0.6 1.4 3.1 ...
 $ slope   : num  3 2 2 3 1 1 3 1 2 3 ...
 $ ca      : Factor w/ 5 levels "?","0.0","1.0",..: 2 5 4 2 2 2 4 2 3 2 ...
 $ thal    : num  6 3 7 3 3 3 3 3 7 7 ...
 $ num     : Factor w/ 2 levels "0","1": 1 2 2 1 1 1 2 1 2 2 ...
 - attr(*, "na.action")=Class 'omit'  Named int [1:6] 88 167 193 267 288 303
  .. ..- attr(*, "names")= chr [1:6] "88" "167" "193" "267" ...
Heart$ca <- as.character(Heart$ca)
Heart$ca <- as.numeric(Heart$ca)
str(Heart)
'data.frame':   297 obs. of  14 variables:
 $ age     : num  63 67 67 37 41 56 62 57 63 53 ...
 $ sex     : num  1 1 1 1 0 1 0 0 1 1 ...
 $ cp      : num  1 4 4 3 2 2 4 4 4 4 ...
 $ trestbps: num  145 160 120 130 130 120 140 120 130 140 ...
 $ chol    : num  233 286 229 250 204 236 268 354 254 203 ...
 $ fbs     : num  1 0 0 0 0 0 0 0 0 1 ...
 $ restecg : num  2 2 2 0 2 0 2 0 2 2 ...
 $ thalach : num  150 108 129 187 172 178 160 163 147 155 ...
 $ exang   : num  0 1 1 0 0 0 0 1 0 1 ...
 $ oldpeak : num  2.3 1.5 2.6 3.5 1.4 0.8 3.6 0.6 1.4 3.1 ...
 $ slope   : num  3 2 2 3 1 1 3 1 2 3 ...
 $ ca      : num  0 3 2 0 0 0 2 0 1 0 ...
 $ thal    : num  6 3 7 3 3 3 3 3 7 7 ...
 $ num     : Factor w/ 2 levels "0","1": 1 2 2 1 1 1 2 1 2 2 ...
 - attr(*, "na.action")=Class 'omit'  Named int [1:6] 88 167 193 267 288 303
  .. ..- attr(*, "names")= chr [1:6] "88" "167" "193" "267" ...

We now need to normalize the data so the KNN vector isn’t improperly influenced by differing measurments and lengths. I’ll first create a normalize function, test it, and then apply it to the data.

normalize <- function(x) {
  return ((x - min(x)) / (max(x) - min(x)))
}

normalize(c(1,2,3,4,5))

normalize(c(10, 20, 30, 40, 50))

As shown above, the function appears to be working correctly - the second set is exactly 10x greater than the first, but after normalization they both output the same values.

I’ll now normalize all the numeric features in the data frame.

Heart[1:13] <- as.data.frame(lapply(Heart[1:13], normalize))


summary(Heart_n$chol)

As seen above, cholestoral which used to have a min of 126 and a max of 564, now has values ranging between 0 and 1.

Split data into training and testing sets.

ind <- sample(2, nrow(Heart), replace=TRUE, prob=c(0.7, 0.3))
train <- Heart[ind==1,]
test <- Heart[ind==2,]

We’ve created “train”, which is a random selection of 197 observations for training, and “test” which is a random selection of 100 observations for testing.

Now to test the model

install.packages("class")
library("class")
pred <- knn(train = train[1:13], test = test[1:13], cl = train$num, k = 1)

We now have an output of predicted disease for the test set of data. To evaluate the model we’ll compare the predictions to the observations we have inthe test data.

library("gmodels")
CrossTable(x = test$num, y = pred, prop.chisq = FALSE)

As seen in the table, 78/100 cases were accurately predicted. This is a good start, but the model must improve if it is to be really used to diagnose patients. Mistakes in this domain are extremely consequential.

I’m now going to try to improve the model by making the k value 5.

pred_2 <- knn(train = train[1:13], test = test[1:13], cl = train$num, k = 5)

CrossTable(x = test$num, y = pred_2, prop.chisq = FALSE)

This is a slight improvement. This model predicted 85/100 observations correctly. In the next model, I’m going to make k = 14. It’s common to make the k value equal to the square root of the sum of the observations in the training set - the square root of 197 is approx. 14.

pred_3 <- knn(train = train[1:13], test = test[1:13], cl = train$num, k = 14)

CrossTable(x = test$num, y = pred_3, prop.chisq = FALSE)

This model is slightly worse - 84/100 correctly chosen. I’m now going to try one more vlaue: k = 10.

pred_4 <- knn(train = train[1:13], test = test[1:13], cl = train$num, k = 10)

CrossTable(x = test$num, y = pred_4, prop.chisq = FALSE)

This model predicted 81/100 observations correctly. Of the models I attempted, “pred_2” was most effective with k = 5. It’s still not accurate enough to use to diagnose heart disease. Some ideas for further improving the model might be to use a larger training set - more data would certainly help. KNN might not be the optimal machine learning method for this problem.

