Intro

World Health Organization has estimated 12 million deaths occur worldwide, every year due to Heart diseases. Half the deaths in the United States and other developed countries are due to cardio vascular diseases. The early prognosis of cardiovascular diseases can aid in making decisions on lifestyle changes in high risk patients and in turn reduce the complications. This project intends to pinpoint the most relevant/risk factors of heart disease as well as predict the overall risk using logistic regression and k-nn.

Classification

Classification is the process of finding or discovering a model or function which helps in separating the data into multiple categorical classes i.e. discrete values. In classification, data is categorized under different labels according to some parameters given in input and then the labels are predicted for the data.

The derived mapping function could be demonstrated in the form of “IF-THEN” rules. The classification process deal with the problems where the data can be divided into binary or multiple discrete labels.

Let’s take an example, suppose we want to predict the possibility of the wining of match by Team A on the basis of some parameters recorded earlier. Then there would be two labels Yes and No.

Logistic Regression

Logistic regression is basically a supervised classification algorithm. In a classification problem, the target variable(or output), y, can take only discrete values for given set of features(or inputs), X.

Contrary to popular belief, logistic regression IS a regression model. The model builds a regression model to predict the probability that a given data entry belongs to the category numbered as “1”. Just like Linear regression assumes that the data follows a linear function, Logistic regression models the data using the sigmoid function.

Logistic regression becomes a classification technique only when a decision threshold is brought into the picture. The setting of the threshold value is a very important aspect of Logistic regression and is dependent on the classification problem itself.

The decision for the value of the threshold value is majorly affected by the values of precision and recall. Ideally, we want both precision and recall to be 1, but this seldom is the case. In case of a Precision-Recall tradeoff we use the following arguments to decide upon the thresold:

  1. Low Precision/High Recall: In applications where we want to reduce the number of false negatives without necessarily reducing the number false positives, we choose a decision value which has a low value of Precision or high value of Recall. For example, in a cancer diagnosis application, we do not want any affected patient to be classified as not affected without giving much heed to if the patient is being wrongfully diagnosed with cancer. This is because, the absence of cancer can be detected by further medical diseases but the presence of the disease cannot be detected in an already rejected candidate.

  2. High Precision/Low Recall: In applications where we want to reduce the number of false positives without necessarily reducing the number false negatives, we choose a decision value which has a high value of Precision or low value of Recall. For example, if we are classifying customers whether they will react positively or negatively to a personalised advertisement, we want to be absolutely sure that the customer will react positively to the advertisemnt because otherwise, a negative reaction can cause a loss potential sales from the customer.

Based on the number of categories, Logistic regression can be classified as:

binomial: target variable can have only 2 possible types: “0” or “1” which may represent “win” vs “loss”, “pass” vs “fail”, “dead” vs “alive”, etc.

multinomial: target variable can have 3 or more possible types which are not ordered(i.e. types have no quantitative significance) like “disease A” vs “disease B” vs “disease C”.

ordinal: it deals with target variables with ordered categories. For example, a test score can be categorized as:“very poor”, “poor”, “good”, “very good”. Here, each category can be given a score like 0, 1, 2, 3.

K-NN

K-Nearest Neighbors is one of the most basic yet essential classification algorithms in Machine Learning. It belongs to the supervised learning domain and finds intense application in pattern recognition, data mining and intrusion detection.

It is widely disposable in real-life scenarios since it is non-parametric, meaning, it does not make any underlying assumptions about the distribution of data (as opposed to other algorithms such as GMM, which assume a Gaussian distribution of the given data).

We are given some prior data (also called training data), which classifies coordinates into groups identified by an attribute.

K-NN works like the illustration below.

Objective

In this project, we will try to predict whether patient have 10 year risk of coronary heart disease CHD or not based on the categories of several supporting variables. The algorithms that I will use are logistic regression and k-nearest neighbor which are included in supervised learning.

Data Preparation

Library

library(tidyverse)
library(gtools)
library(class)
library(caret)
library(DT)

Dataset

The dataset is publically available on the Kaggle website, and it is from an ongoing cardiovascular study on residents of the town of Framingham, Massachusetts. The classification goal is to predict whether the patient has 10-year risk of future coronary heart disease (CHD).The dataset provides the patients’ information. It includes over 4,000 records and 15 attributes.

Variables Each attribute is a potential risk factor. There are both demographic, behavioral and medical risk factors.

Demographic: * Male: male or female(if 1 male, 0 female) * Age: Age of the patient;(Continuous - Although the recorded ages have been truncated to whole numbers, the concept of age is continuous) * Education: levels coded 1 for some high school, 2 for a high school diploma or GED, 3 for some college or vocational school, and 4 for a college degree.

Behavioral: * Current Smoker: whether or not the patient is a current smoker (Nominal) * Cigs Per Day: the number of cigarettes that the person smoked on average in one day.(can be considered continuous as one can have any number of cigarettes, even half a cigarette)

Medical(history): * BP Meds: whether or not the patient was on blood pressure medication (Nominal) * Prevalent Stroke: whether or not the patient had previously had a stroke (Nominal) * Prevalent Hyp: whether or not the patient was hypertensive (Nominal) * Diabetes: whether or not the patient had diabetes (Nominal)

Medical(current): * Tot Chol: total cholesterol level (Continuous) * Sys BP: systolic blood pressure (Continuous) * Dia BP: diastolic blood pressure (Continuous) * BMI: Body Mass Index (Continuous) * Heart Rate: heart rate (Continuous - In medical research, variables such as heart rate though in fact discrete, yet are considered continuous because of large number of possible values.) * Glucose: glucose level (Continuous)

Predict variable (desired target): * 10 year risk of coronary heart disease CHD (binary: “1”, means “Yes”, “0” means “No”)

framingham <- read.csv("data_input/framingham.csv")
datatable(framingham)
glimpse(framingham)
## Rows: 4,240
## Columns: 16
## $ male            <int> 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, ...
## $ age             <int> 39, 46, 48, 61, 46, 43, 63, 45, 52, 43, 50, 43, 46,...
## $ education       <int> 4, 2, 1, 3, 3, 2, 1, 2, 1, 1, 1, 2, 1, 3, 2, 2, 3, ...
## $ currentSmoker   <int> 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, ...
## $ cigsPerDay      <int> 0, 0, 20, 30, 23, 0, 0, 20, 0, 30, 0, 0, 15, 0, 9, ...
## $ BPMeds          <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, ...
## $ prevalentStroke <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ prevalentHyp    <int> 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1, ...
## $ diabetes        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ totChol         <int> 195, 250, 245, 225, 285, 228, 205, 313, 260, 225, 2...
## $ sysBP           <dbl> 106.0, 121.0, 127.5, 150.0, 130.0, 180.0, 138.0, 10...
## $ diaBP           <dbl> 70.0, 81.0, 80.0, 95.0, 84.0, 110.0, 71.0, 71.0, 89...
## $ BMI             <dbl> 26.97, 28.73, 25.34, 28.58, 23.10, 30.30, 33.11, 21...
## $ heartRate       <int> 80, 95, 75, 65, 85, 77, 60, 79, 76, 93, 75, 72, 98,...
## $ glucose         <int> 77, 76, 70, 103, 85, 99, 85, 78, 79, 88, 76, 61, 64...
## $ TenYearCHD      <int> 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, ...

The data has 4240 rows and 16 columns.

Check Missing Value

framingham %>% 
  is.na() %>% 
  colSums()
##            male             age       education   currentSmoker      cigsPerDay 
##               0               0             105               0              29 
##          BPMeds prevalentStroke    prevalentHyp        diabetes         totChol 
##              53               0               0               0              50 
##           sysBP           diaBP             BMI       heartRate         glucose 
##               0               0              19               1             388 
##      TenYearCHD 
##               0

Drop Missing Value

framingham_clean <- framingham %>% 
  drop_na()

Adjust Data Type

framingham_clean <- framingham_clean %>% 
  mutate(male = factor(male, levels = c(0,1), labels = c("Female", "Male")),
         education = as.factor(education),
         currentSmoker = as.factor(currentSmoker),
         BPMeds = as.factor(BPMeds),
         prevalentStroke = as.factor(prevalentStroke),
         prevalentHyp = as.factor(prevalentHyp),
         diabetes = as.factor(diabetes),
         TenYearCHD = factor(TenYearCHD, levels = c(0,1), labels = c("No", "Yes")))

Exploratory Data Analysis

summary(framingham_clean)
##      male           age        education currentSmoker   cigsPerDay    
##  Female:2035   Min.   :32.00   1:1526    0:1869        Min.   : 0.000  
##  Male  :1623   1st Qu.:42.00   2:1101    1:1789        1st Qu.: 0.000  
##                Median :49.00   3: 608                  Median : 0.000  
##                Mean   :49.55   4: 423                  Mean   : 9.025  
##                3rd Qu.:56.00                           3rd Qu.:20.000  
##                Max.   :70.00                           Max.   :70.000  
##  BPMeds   prevalentStroke prevalentHyp diabetes    totChol          sysBP      
##  0:3547   0:3637          0:2518       0:3559   Min.   :113.0   Min.   : 83.5  
##  1: 111   1:  21          1:1140       1:  99   1st Qu.:206.0   1st Qu.:117.0  
##                                                 Median :234.0   Median :128.0  
##                                                 Mean   :236.8   Mean   :132.4  
##                                                 3rd Qu.:263.0   3rd Qu.:143.9  
##                                                 Max.   :600.0   Max.   :295.0  
##      diaBP             BMI          heartRate         glucose       TenYearCHD
##  Min.   : 48.00   Min.   :15.54   Min.   : 44.00   Min.   : 40.00   No :3101  
##  1st Qu.: 75.00   1st Qu.:23.08   1st Qu.: 68.00   1st Qu.: 71.00   Yes: 557  
##  Median : 82.00   Median :25.38   Median : 75.00   Median : 78.00             
##  Mean   : 82.92   Mean   :25.78   Mean   : 75.73   Mean   : 81.85             
##  3rd Qu.: 90.00   3rd Qu.:28.04   3rd Qu.: 82.00   3rd Qu.: 87.00             
##  Max.   :142.50   Max.   :56.80   Max.   :143.00   Max.   :394.00

Modelling

Data Pre-Process

prop.table(table(framingham_clean$TenYearCHD))
## 
##       No      Yes 
## 0.847731 0.152269

Our data is imbalance so we need to makes the data balance. Before balancing the data, we need to do cross validation first.

set.seed(303)
intrain <- sample(nrow(framingham_clean), nrow(framingham_clean)*0.8)
data_train <- framingham_clean[intrain,]
data_test <- framingham_clean[-intrain,]
prop.table(table(data_train$TenYearCHD)) %>% round(2)
## 
##   No  Yes 
## 0.85 0.15
prop.table(table(data_test$TenYearCHD)) %>% round(2)
## 
##   No  Yes 
## 0.83 0.17

We can balancing our data with downsampling or upsampling. Downsampling or upsampling will only be done on the data train because the classification model creation process is only done on the data train. The test data is only used to evaluate the resulting model on the data train.

train_up <- upSample(x = data_train[ , -1], y = data_train$TenYearCHD, yname = "TenYearCHD")
prop.table(table(train_up$TenYearCHD)) %>% round(2)
## 
##  No Yes 
## 0.5 0.5

Logistic Regression

Modeling uses the glm () function for logistic regression. We used all of the variables as the base model, where the TenYearCHD variable becomes the response variable.

model_glm <- glm(formula = TenYearCHD ~ ., family = "binomial", 
             data = train_up)
## Warning in model.matrix.default(mt, mf, contrasts): the response appeared on the
## right-hand side and was dropped
## Warning in model.matrix.default(mt, mf, contrasts): problem with term 15 in
## model.matrix: no columns are assigned
summary(model_glm)
## 
## Call:
## glm(formula = TenYearCHD ~ ., family = "binomial", data = train_up)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5241  -0.9881  -0.1009   1.0126   2.1554  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -7.6047676  0.4651520 -16.349  < 2e-16 ***
## age               0.0690480  0.0043944  15.713  < 2e-16 ***
## education2       -0.2441857  0.0783156  -3.118 0.001821 ** 
## education3       -0.3449055  0.0946571  -3.644 0.000269 ***
## education4       -0.1717242  0.1057381  -1.624 0.104364    
## currentSmoker1    0.1447117  0.1030959   1.404 0.160420    
## cigsPerDay        0.0315480  0.0041485   7.605 2.85e-14 ***
## BPMeds1           0.1975389  0.1746278   1.131 0.257971    
## prevalentStroke1  1.0187753  0.3883191   2.624 0.008702 ** 
## prevalentHyp1     0.0243785  0.0929777   0.262 0.793170    
## diabetes1        -0.0273554  0.2349520  -0.116 0.907312    
## totChol           0.0018704  0.0006880   2.719 0.006553 ** 
## sysBP             0.0149370  0.0025976   5.750 8.91e-09 ***
## diaBP             0.0036556  0.0042223   0.866 0.386609    
## BMI               0.0204844  0.0082884   2.471 0.013456 *  
## heartRate         0.0004562  0.0026910   0.170 0.865375    
## glucose           0.0046994  0.0016230   2.895 0.003786 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 6906.5  on 4981  degrees of freedom
## Residual deviance: 5952.4  on 4965  degrees of freedom
## AIC: 5986.4
## 
## Number of Fisher Scoring iterations: 4

In the first modeling, there are still many predictor variables that are not significant to the target variable, therefore we will try to do the fitting model using the stepwise method.

model_backward <- step(object = model_glm, direction = "backward")
## Start:  AIC=5986.36
## TenYearCHD ~ age + education + currentSmoker + cigsPerDay + BPMeds + 
##     prevalentStroke + prevalentHyp + diabetes + totChol + sysBP + 
##     diaBP + BMI + heartRate + glucose + TenYearCHD
## Warning in model.matrix.default(object, data = structure(list(TenYearCHD =
## structure(c(1L, : the response appeared on the right-hand side and was dropped
## Warning in model.matrix.default(object, data = structure(list(TenYearCHD =
## structure(c(1L, : problem with term 15 in model.matrix: no columns are assigned
## 
## Step:  AIC=5986.36
## TenYearCHD ~ age + education + currentSmoker + cigsPerDay + BPMeds + 
##     prevalentStroke + prevalentHyp + diabetes + totChol + sysBP + 
##     diaBP + BMI + heartRate + glucose
## 
##                   Df Deviance    AIC
## - diabetes         1   5952.4 5984.4
## - heartRate        1   5952.4 5984.4
## - prevalentHyp     1   5952.4 5984.4
## - diaBP            1   5953.1 5985.1
## - BPMeds           1   5953.7 5985.7
## - currentSmoker    1   5954.3 5986.3
## <none>                 5952.4 5986.4
## - BMI              1   5958.5 5990.5
## - totChol          1   5959.8 5991.8
## - prevalentStroke  1   5960.2 5992.2
## - glucose          1   5961.0 5993.0
## - education        3   5969.9 5997.9
## - sysBP            1   5986.6 6018.6
## - cigsPerDay       1   6012.4 6044.4
## - age              1   6214.5 6246.5
## 
## Step:  AIC=5984.37
## TenYearCHD ~ age + education + currentSmoker + cigsPerDay + BPMeds + 
##     prevalentStroke + prevalentHyp + totChol + sysBP + diaBP + 
##     BMI + heartRate + glucose
## 
##                   Df Deviance    AIC
## - heartRate        1   5952.4 5982.4
## - prevalentHyp     1   5952.4 5982.4
## - diaBP            1   5953.1 5983.1
## - BPMeds           1   5953.7 5983.7
## - currentSmoker    1   5954.3 5984.3
## <none>                 5952.4 5984.4
## - BMI              1   5958.5 5988.5
## - totChol          1   5959.8 5989.8
## - prevalentStroke  1   5960.2 5990.2
## - education        3   5969.9 5995.9
## - glucose          1   5966.5 5996.5
## - sysBP            1   5986.6 6016.6
## - cigsPerDay       1   6012.4 6042.4
## - age              1   6214.6 6244.6
## 
## Step:  AIC=5982.4
## TenYearCHD ~ age + education + currentSmoker + cigsPerDay + BPMeds + 
##     prevalentStroke + prevalentHyp + totChol + sysBP + diaBP + 
##     BMI + glucose
## 
##                   Df Deviance    AIC
## - prevalentHyp     1   5952.5 5980.5
## - diaBP            1   5953.2 5981.2
## - BPMeds           1   5953.7 5981.7
## - currentSmoker    1   5954.4 5982.4
## <none>                 5952.4 5982.4
## - BMI              1   5958.6 5986.6
## - totChol          1   5959.9 5987.9
## - prevalentStroke  1   5960.3 5988.3
## - education        3   5970.0 5994.0
## - glucose          1   5966.7 5994.7
## - sysBP            1   5986.8 6014.8
## - cigsPerDay       1   6012.7 6040.7
## - age              1   6215.2 6243.2
## 
## Step:  AIC=5980.48
## TenYearCHD ~ age + education + currentSmoker + cigsPerDay + BPMeds + 
##     prevalentStroke + totChol + sysBP + diaBP + BMI + glucose
## 
##                   Df Deviance    AIC
## - diaBP            1   5953.3 5979.3
## - BPMeds           1   5953.9 5979.9
## - currentSmoker    1   5954.4 5980.4
## <none>                 5952.5 5980.5
## - BMI              1   5958.8 5984.8
## - totChol          1   5959.9 5985.9
## - prevalentStroke  1   5960.3 5986.3
## - education        3   5970.0 5992.0
## - glucose          1   5966.7 5992.7
## - sysBP            1   5995.9 6021.9
## - cigsPerDay       1   6012.8 6038.8
## - age              1   6217.1 6243.1
## 
## Step:  AIC=5979.29
## TenYearCHD ~ age + education + currentSmoker + cigsPerDay + BPMeds + 
##     prevalentStroke + totChol + sysBP + BMI + glucose
## 
##                   Df Deviance    AIC
## - BPMeds           1   5954.7 5978.7
## - currentSmoker    1   5955.2 5979.2
## <none>                 5953.3 5979.3
## - totChol          1   5960.6 5984.6
## - BMI              1   5961.0 5985.0
## - prevalentStroke  1   5961.3 5985.3
## - education        3   5970.4 5990.4
## - glucose          1   5967.3 5991.3
## - cigsPerDay       1   6013.6 6037.6
## - sysBP            1   6077.6 6101.6
## - age              1   6222.4 6246.4
## 
## Step:  AIC=5978.7
## TenYearCHD ~ age + education + currentSmoker + cigsPerDay + prevalentStroke + 
##     totChol + sysBP + BMI + glucose
## 
##                   Df Deviance    AIC
## - currentSmoker    1   5956.6 5978.6
## <none>                 5954.7 5978.7
## - totChol          1   5962.2 5984.2
## - BMI              1   5962.4 5984.4
## - prevalentStroke  1   5963.1 5985.1
## - education        3   5971.7 5989.7
## - glucose          1   5968.9 5990.9
## - cigsPerDay       1   6014.8 6036.8
## - sysBP            1   6091.5 6113.5
## - age              1   6224.3 6246.3
## 
## Step:  AIC=5978.58
## TenYearCHD ~ age + education + cigsPerDay + prevalentStroke + 
##     totChol + sysBP + BMI + glucose
## 
##                   Df Deviance    AIC
## <none>                 5956.6 5978.6
## - BMI              1   5963.6 5983.6
## - totChol          1   5964.2 5984.2
## - prevalentStroke  1   5965.1 5985.1
## - education        3   5973.9 5989.9
## - glucose          1   5970.7 5990.7
## - sysBP            1   6092.2 6112.2
## - cigsPerDay       1   6145.8 6165.8
## - age              1   6224.3 6244.3

By using the stepwise backward method, we obtain the following model.

summary(model_backward)
## 
## Call:
## glm(formula = TenYearCHD ~ age + education + cigsPerDay + prevalentStroke + 
##     totChol + sysBP + BMI + glucose, family = "binomial", data = train_up)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.48003  -0.99002  -0.09612   1.00211   2.15571  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -7.4603808  0.3510091 -21.254  < 2e-16 ***
## age               0.0677250  0.0042602  15.897  < 2e-16 ***
## education2       -0.2378724  0.0780515  -3.048 0.002307 ** 
## education3       -0.3443004  0.0943314  -3.650 0.000262 ***
## education4       -0.1690404  0.1051206  -1.608 0.107822    
## cigsPerDay        0.0358574  0.0027049  13.256  < 2e-16 ***
## prevalentStroke1  1.0488749  0.3857378   2.719 0.006545 ** 
## totChol           0.0018859  0.0006864   2.748 0.006004 ** 
## sysBP             0.0171438  0.0015229  11.257  < 2e-16 ***
## BMI               0.0212253  0.0080231   2.646 0.008157 ** 
## glucose           0.0045594  0.0012712   3.587 0.000335 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 6906.5  on 4981  degrees of freedom
## Residual deviance: 5956.6  on 4971  degrees of freedom
## AIC: 5978.6
## 
## Number of Fisher Scoring iterations: 4

Prediction

By using the model_backward from stepwise, we will try to predict it using the test data we already have.

data_test$prob_TenYearCHD<-predict(model_backward, type = "response", newdata = data_test)

We will try to see the distribution of data prediction opportunities.

ggplot(data_test, aes(x=prob_TenYearCHD)) +
  geom_density(lwd=0.5) +
  labs(title = "Distribution of Probability Prediction Data") +
  theme_minimal()

data_test$pred_TenYearCHD <- factor(ifelse(data_test$prob_TenYearCHD > 0.5, "Yes","No"))
data_test[1:10, c("pred_TenYearCHD", "TenYearCHD")]

Model Evaluation

We use confusion matrix to evaluate the model.

glm_conf <- confusionMatrix(data_test$pred_TenYearCHD, data_test$TenYearCHD, positive = "Yes")
glm_conf
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  406  50
##        Yes 204  72
##                                           
##                Accuracy : 0.653           
##                  95% CI : (0.6173, 0.6875)
##     No Information Rate : 0.8333          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.1699          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.59016         
##             Specificity : 0.66557         
##          Pos Pred Value : 0.26087         
##          Neg Pred Value : 0.89035         
##              Prevalence : 0.16667         
##          Detection Rate : 0.09836         
##    Detection Prevalence : 0.37705         
##       Balanced Accuracy : 0.62787         
##                                           
##        'Positive' Class : Yes             
## 
  • Re-call / Sensitivity = from all the actual positive data, how capable is the proportion of my model to guess right.

  • Specificity = from all negative actual data, how capable is the proportion of my model to guess the right one.

  • Accuracy = how well is my model able to correctly guess the target Y.

  • Pos Pred Value/Precision = from all the predicted results, how capable is my model to correctly guess the positive class.

K-NN

Data Pre-Process

Make dummy variable

dmy <- dummyVars(" ~TenYearCHD + male + education + currentSmoker + BPMeds + prevalentStroke + prevalentHyp + diabetes", data = framingham_clean)
dmy <- data.frame(predict(dmy, newdata = framingham_clean))
str(dmy)
## 'data.frame':    3658 obs. of  18 variables:
##  $ TenYearCHD.No    : num  1 1 1 0 1 1 0 1 1 1 ...
##  $ TenYearCHD.Yes   : num  0 0 0 1 0 0 1 0 0 0 ...
##  $ male.Female      : num  0 1 0 1 1 1 1 1 0 0 ...
##  $ male.Male        : num  1 0 1 0 0 0 0 0 1 1 ...
##  $ education.1      : num  0 0 1 0 0 0 1 0 1 1 ...
##  $ education.2      : num  0 1 0 0 0 1 0 1 0 0 ...
##  $ education.3      : num  0 0 0 1 1 0 0 0 0 0 ...
##  $ education.4      : num  1 0 0 0 0 0 0 0 0 0 ...
##  $ currentSmoker.0  : num  1 1 0 0 0 1 1 0 1 0 ...
##  $ currentSmoker.1  : num  0 0 1 1 1 0 0 1 0 1 ...
##  $ BPMeds.0         : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BPMeds.1         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ prevalentStroke.0: num  1 1 1 1 1 1 1 1 1 1 ...
##  $ prevalentStroke.1: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ prevalentHyp.0   : num  1 1 1 0 1 0 1 1 0 0 ...
##  $ prevalentHyp.1   : num  0 0 0 1 0 1 0 0 1 1 ...
##  $ diabetes.0       : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ diabetes.1       : num  0 0 0 0 0 0 0 0 0 0 ...
dmy$TenYearCHD.No <- NULL
dmy$male.Female <- NULL
dmy$currentSmoker.0 <- NULL
dmy$BPMeds.0 <- NULL
dmy$prevalentStroke.0 <- NULL
dmy$prevalentHyp.0 <- NULL
dmy$diabetes.0 <- NULL
names(dmy)
##  [1] "TenYearCHD.Yes"    "male.Male"         "education.1"      
##  [4] "education.2"       "education.3"       "education.4"      
##  [7] "currentSmoker.1"   "BPMeds.1"          "prevalentStroke.1"
## [10] "prevalentHyp.1"    "diabetes.1"

Make training data and testing data from dmy data that has been formed.

set.seed(300)
dmy_train <- dmy[intrain,2:11]
dmy_test <- dmy[-intrain,2:11]

dmy_train_label <- dmy[intrain,1]
dmy_test_label <- dmy[-intrain,1]

Prediction with K-NN

sqrt(nrow(dmy_train))
## [1] 54.09251
pred_knn <- class::knn(train = dmy_train,
                       test = dmy_test, 
                       cl = dmy_train_label, 
                       k = 17)

Make Confusion Matrix

pred_knn_conf <- confusionMatrix(as.factor(pred_knn), as.factor(dmy_test_label),"1")
pred_knn_conf
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 610 122
##          1   0   0
##                                           
##                Accuracy : 0.8333          
##                  95% CI : (0.8043, 0.8596)
##     No Information Rate : 0.8333          
##     P-Value [Acc > NIR] : 0.5242          
##                                           
##                   Kappa : 0               
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.0000          
##             Specificity : 1.0000          
##          Pos Pred Value :    NaN          
##          Neg Pred Value : 0.8333          
##              Prevalence : 0.1667          
##          Detection Rate : 0.0000          
##    Detection Prevalence : 0.0000          
##       Balanced Accuracy : 0.5000          
##                                           
##        'Positive' Class : 1               
## 

Logistic Regression vs K-NN

eval_logit <- data_frame(Accuracy = glm_conf$overall[1],
           Recall = glm_conf$byClass[1],
           Specificity = glm_conf$byClass[2],
           Precision = glm_conf$byClass[3])
## Warning: `data_frame()` is deprecated as of tibble 1.1.0.
## Please use `tibble()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
eval_knn <- data_frame(Accuracy = pred_knn_conf$overall[1],
           Recall = pred_knn_conf$byClass[1],
           Specificity = pred_knn_conf$byClass[2],
           Precision = pred_knn_conf$byClass[3])
eval_logit
eval_knn

Conclusion

The results obtained from both models used in this project have not shown maximum results where in terms of accuracy and other parameters are still far from expected. So further model development is needed in the future.