Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Outline of the Study

  • Introduction : Introduction to the dataset and the goal(s) of the study
  • Making Sense of the Data:
    • Exploration of the dataset using visualization
    • Exploration of the dataset using Chi-Square independent test, Cramer’s V test, and GoodmanKruskal tau value
  • Predictive Modeling: Logisitic Regression and RandomForest
    • Two logistic regression instances
    • Evaluation of the model instances using 5-fold Cross-validation
    • Improving the instances using variable selection
      • stats:step()
      • bestglm::bestglm()
    • RandomForest Model
    • Prediction with RandomForest and Logisitc Regression
  • Final Model Exploration Using Parallel coordinates visualization
  • Conclusion and next steps

1.Introduction

This report is an machine learning/data science investigation in the famous Framingham heart study(FHS). More specifically, the goal is making some predictive models on a FHS dataset, and reviewing some exploratory and modelling techiniques through doing so. But what is FHS?

We read about FHS:

The Framingham Heart Study is a long term prospective study of the etiology of cardiovascular disease among a population of free living subjects in the community of Framingham, Massachusetts. The Framingham Heart Study was a landmark study in epidemiology in that it was the first prospective study of cardiovascular disease and identified the concept of risk factors and their joint effects FHS Longitudinal Data Document

The reason of choosing this dataset is two fold. Beside the importace of the FHS, it has been chosen in order to having a opportunity to commemorate a great online course, which heavily influced me and made me more interested in analytics in summer 2015. This data is borrowed from Analytics Edge course of MIT, presented on Edx platform. Salute to all the preparators of the course! Analytics Edge at EdX

The second reason of picking this dataset is rather personal! I have went recently through some medical checkups, so I can feed my own data into the predictive model to have a prediction about myself! Even though this study and the resulted models are not prepared to have such medical usage, it is still nice to have something personal here!

The dataset is a rather small subset of possible FHS dataset, having 4240 observations and 16 variables. The variables are as follows:

  • sex : the gender of the observations. The variable is a binary named “male” in the dataset.
  • age : Age at the time of medical examination in years.
  • education : A categorical variable of the participants education, with the levels: Some high school (1), high school/GED (2), some college/vocational school (3), college (4)

  • currentSmoker: Current cigarette smoking at the time of examinations
  • cigsPerDay: Number of cigarettes smoked each day

  • BPmeds: Use of Anti-hypertensive medication at exam
  • prevalentStroke: Prevalent Stroke (0 = free of disease)
  • prevalentHyp: Prevalent Hypertensive. Subject was defined as hypertensive if treated
  • diabetes: Diabetic according to criteria of first exam treated

  • totChol: Total cholesterol (mg/dL)
  • sysBP: Systolic Blood Pressure (mmHg)
  • diaBP: Diastolic blood pressure (mmHg)
  • BMI: Body Mass Index, weight (kg)/height (m)^2
  • heartRate: Heart rate (beats/minute)
  • glucose: Blood glucose level (mg/dL)

And finally the response variable : + TenYearCHD : The 10 year risk of coronary heart disease(CHD).

3658 of these 4240 records are complete cases, and the rest have some missing values.


2.Making Sense of the Data

Before every step, loading the required libraries:

require(knitr)
require(dplyr)
require(ggplot2)
require(readr)
require(gridExtra) #for multiple plots
require(GoodmanKruskal) 
require(vcd) # for Cramer's V
require(psych) # for phi() in order to compute the strenght of a found association
require(boot) #for cross-validation
require(ggparallel) # for parallel coordinate plot
require(bestglm)  # to find the best variable subset using CV
require(randomForest) # for random Forest test on data 
require(RColorBrewer) # apetizing palettes! 

Then, loading the Framingham Heart Study dataset. This data comes from the BioLINCC website. FHS Longitudinal Data Document

2.1 Numerical examination of the variables and the dataset structure

Let’s have a high level numerical look at the dataset.

##print("Dimension of the dataset")
dim(framingham_dataset)
## [1] 4240   16
##print("The first six rows of the dataset")
kable(head(framingham_dataset))
male age education currentSmoker cigsPerDay BPMeds prevalentStroke prevalentHyp diabetes totChol sysBP diaBP BMI heartRate glucose TenYearCHD
TRUE 39 4 FALSE 0 FALSE FALSE FALSE FALSE 195 106.0 70 26.97 80 77 FALSE
FALSE 46 2 FALSE 0 FALSE FALSE FALSE FALSE 250 121.0 81 28.73 95 76 FALSE
TRUE 48 1 TRUE 20 FALSE FALSE FALSE FALSE 245 127.5 80 25.34 75 70 FALSE
FALSE 61 3 TRUE 30 FALSE FALSE TRUE FALSE 225 150.0 95 28.58 65 103 TRUE
FALSE 46 3 TRUE 23 FALSE FALSE FALSE FALSE 285 130.0 84 23.10 85 85 FALSE
FALSE 43 2 FALSE 0 FALSE FALSE TRUE FALSE 228 180.0 110 30.30 77 99 FALSE
##print("Examination of the structure of the dataset")
str(framingham_dataset)
## Classes 'tbl_df', 'tbl' and 'data.frame':    4240 obs. of  16 variables:
##  $ male           : logi  TRUE FALSE TRUE FALSE FALSE FALSE ...
##  $ age            : int  39 46 48 61 46 43 63 45 52 43 ...
##  $ education      : Factor w/ 4 levels "1","2","3","4": 4 2 1 3 3 2 1 2 1 1 ...
##  $ currentSmoker  : logi  FALSE FALSE TRUE TRUE TRUE FALSE ...
##  $ cigsPerDay     : int  0 0 20 30 23 0 0 20 0 30 ...
##  $ BPMeds         : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ prevalentStroke: logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ prevalentHyp   : logi  FALSE FALSE FALSE TRUE FALSE TRUE ...
##  $ diabetes       : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ totChol        : int  195 250 245 225 285 228 205 313 260 225 ...
##  $ sysBP          : num  106 121 128 150 130 ...
##  $ diaBP          : num  70 81 80 95 84 110 71 71 89 107 ...
##  $ BMI            : num  27 28.7 25.3 28.6 23.1 ...
##  $ heartRate      : int  80 95 75 65 85 77 60 79 76 93 ...
##  $ glucose        : int  77 76 70 103 85 99 85 78 79 88 ...
##  $ TenYearCHD     : logi  FALSE FALSE FALSE TRUE FALSE FALSE ...
##  - attr(*, "spec")=List of 2
##   ..$ cols   :List of 16
##   .. ..$ male           : list()
##   .. .. ..- attr(*, "class")= chr  "collector_logical" "collector"
##   .. ..$ age            : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ education      :List of 3
##   .. .. ..$ levels    : num  1 2 3 4
##   .. .. ..$ ordered   : logi FALSE
##   .. .. ..$ include_na: logi FALSE
##   .. .. ..- attr(*, "class")= chr  "collector_factor" "collector"
##   .. ..$ currentSmoker  : list()
##   .. .. ..- attr(*, "class")= chr  "collector_logical" "collector"
##   .. ..$ cigsPerDay     : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ BPMeds         : list()
##   .. .. ..- attr(*, "class")= chr  "collector_logical" "collector"
##   .. ..$ prevalentStroke: list()
##   .. .. ..- attr(*, "class")= chr  "collector_logical" "collector"
##   .. ..$ prevalentHyp   : list()
##   .. .. ..- attr(*, "class")= chr  "collector_logical" "collector"
##   .. ..$ diabetes       : list()
##   .. .. ..- attr(*, "class")= chr  "collector_logical" "collector"
##   .. ..$ totChol        : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ sysBP          : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ diaBP          : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ BMI            : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ heartRate      : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ glucose        : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ TenYearCHD     : list()
##   .. .. ..- attr(*, "class")= chr  "collector_logical" "collector"
##   ..$ default: list()
##   .. ..- attr(*, "class")= chr  "collector_guess" "collector"
##   ..- attr(*, "class")= chr "col_spec"
##print("Examination of the summary of variables")
summary(framingham_dataset)
##     male              age        education   currentSmoker  
##  Mode :logical   Min.   :32.00   1   :1720   Mode :logical  
##  FALSE:2420      1st Qu.:42.00   2   :1253   FALSE:2145     
##  TRUE :1820      Median :49.00   3   : 689   TRUE :2095     
##                  Mean   :49.58   4   : 473                  
##                  3rd Qu.:56.00   NA's: 105                  
##                  Max.   :70.00                              
##                                                             
##    cigsPerDay       BPMeds        prevalentStroke prevalentHyp   
##  Min.   : 0.000   Mode :logical   Mode :logical   Mode :logical  
##  1st Qu.: 0.000   FALSE:4063      FALSE:4215      FALSE:2923     
##  Median : 0.000   TRUE :124       TRUE :25        TRUE :1317     
##  Mean   : 9.006   NA's :53                                       
##  3rd Qu.:20.000                                                  
##  Max.   :70.000                                                  
##  NA's   :29                                                      
##   diabetes          totChol          sysBP           diaBP      
##  Mode :logical   Min.   :107.0   Min.   : 83.5   Min.   : 48.0  
##  FALSE:4131      1st Qu.:206.0   1st Qu.:117.0   1st Qu.: 75.0  
##  TRUE :109       Median :234.0   Median :128.0   Median : 82.0  
##                  Mean   :236.7   Mean   :132.4   Mean   : 82.9  
##                  3rd Qu.:263.0   3rd Qu.:144.0   3rd Qu.: 90.0  
##                  Max.   :696.0   Max.   :295.0   Max.   :142.5  
##                  NA's   :50                                     
##       BMI          heartRate         glucose       TenYearCHD     
##  Min.   :15.54   Min.   : 44.00   Min.   : 40.00   Mode :logical  
##  1st Qu.:23.07   1st Qu.: 68.00   1st Qu.: 71.00   FALSE:3596     
##  Median :25.40   Median : 75.00   Median : 78.00   TRUE :644      
##  Mean   :25.80   Mean   : 75.88   Mean   : 81.96                  
##  3rd Qu.:28.04   3rd Qu.: 83.00   3rd Qu.: 87.00                  
##  Max.   :56.80   Max.   :143.00   Max.   :394.00                  
##  NA's   :19      NA's   :1        NA's   :388

2.2 Uni-variate plots of the dataset

The following code can automatically generates all uni-variable plots of a dataset.

# just it needs removal of character, time and date variables 
dataset <- data.frame(framingham_dataset )

         all_plots <- lapply(X = 1:ncol(dataset), function(x) ggplot()+
                                                     geom_bar(data = dataset,
                                                                  aes(x = dataset[,x]), fill = "skyblue" ,
                                                              alpha = 0.6)+ 
                                                     theme_linedraw()+
                                                     xlab(colnames(dataset)[x])
                        )

#colnames(dataset)        
marrangeGrob(grobs=all_plots, nrow=2, ncol=2)

This is for getting the least understanding of the variables. The understanding of the whole problem, and the dataset, would be achieved through multi-variate or at least bi-variate visualization. What I want do to here is getting sense of the data, i.e. feeling the data and becoming familiar with it.

2.3 bi-variate plots of the dataset: relation between response variable and predictors

Now we can venture on having some bi-variate visualizations, specially in order to see the relation between response variable(TenYearCHD) and predictors. Not all pair of variables can be investigated due to deluge number of plots! We can move back to this step later and deeper understanding.

The following code can generate all the bi-variate plots of the response variable. Since the response variable is a binary variable, we would either have boxplots, when the predictor is quantitative, or segmented bar plots, when the predictor is a qualitative variable.

ptlist_bi <- list()

for (var in colnames(dataset) ){
        if (class(dataset[,var]) %in% c("factor","logical") ) {
                ###print(class(dataset[,var]))
                ###print("first if")
                ptlist_bi[[var]] <- ggplot(data = dataset)  + 
                        geom_bar( aes_string(x = var, fill ="TenYearCHD" ), position = "fill") +
                        theme_linedraw() +
                        xlab(var)  
                        
                                                   
        } else if (class(dataset[,var]) %in% c("numeric","double","integer") ) {
                ###print(class(dataset[,var]))
                # #print("second if")
                ptlist_bi[[var]] <- ggplot(data = dataset) + 
                        geom_boxplot(aes_string(y = var, x ="TenYearCHD" )) + 
                        theme_linedraw() +
                        xlab("TenYearCHD") + 
                        ylab(var)
        }
}

#names(ptlist_bi)
marrangeGrob(grobs=ptlist_bi, nrow=2, ncol=2)

Based on what we have here, being male is directly related to TenYearCHD, thus the variable male seems a relatively good predictor. Similarly, Age seems a good predictor since the patients with TenYearCHD == TRUE, have higher median of age, with almost a similar distribution. In contrast, there seems no relation between different categories of the education and the response variable. The current Smoker variable shows a slight relation with the response variable, as the current smokers have a slightly higher risk of TenYearCHD. With the same manner we can investigate remaining plots.

2.4 Association between qualitative variable pairs using Goodman&Kruskal tau

However, one may want to have a numeric value for such associations, beside these graphs, which are essentially qualitative approaches. In order to have such numeric measures, I would like to use Goodman and Kruskal's tau measure, a measure of association between two unordered factor, i.e. two categorical/nominal variables. Among the factor variables that we have in this dataset, only the education is ordinal, i.e. its categories have meaning. This measure is more informational than Cramer’s V or chi-square measures. More on this issue, here GoodmanKruskal Package

class_list <- lapply(X = 1:ncol(dataset), function(x) class(dataset[,x]))
names(class_list) <- colnames(dataset)

dataset_cat_variables <- subset(x = dataset, select = names(which(class_list=="logical"| class_list=="factor")))

#colnames(dataset_cat_variables)
dataset_cat_variables$education <- NULL 


GKmatrix_dataset <- GKtauDataframe(dataset_cat_variables)
plot(GKmatrix_dataset)

As one can see, there is very little explanatory power in the predictors regarding the variability of the response variable. In other words, based on the Goodman and Kruskal's tau measure, there is almost no association between our predictors and the response variable. It can be seen in the values of the TenYearCHD column.

This is not a good news for the model, assuming that I am doing G&Ktau correctly!

In order to check these findings, we can test the significance of association of categorical variables with the response variable using Chi-square test, and then using Phi correlation coefficient in order to evaluate the strength of possible association. The phi is used for 2x2 contigency tables. For larger tables, i.e. variables with more levels, Cramer’s V can be utilised. About Cramer’s V in R , About Phi Correlation in R

unlist(lapply(X = 1:(ncol(dataset_cat_variables)-1) ,FUN =  function(x) chisq.test(table(dataset_cat_variables[,7],dataset_cat_variables[,x]))$p.value ))
## [1] 1.121518e-08 2.211021e-01 3.096658e-08 1.795676e-04 1.188961e-30
## [6] 5.525144e-10
unlist(lapply(X = 1:(ncol(dataset_cat_variables)-1), FUN = function(x)
        phi(matrix(table(dataset_cat_variables[,7], dataset_cat_variables[,x])), digits = 5)) )       
## [1] 0 0 0 0 0 0

It is strange that when p-value of the Chi-square is so low, the significance of the possible association is zero. This is something beyond my current level of knowledge, and I am going to seek the reasons of. This document states that these two tests (chi-squre and Phi correlation) are basically yielding the same results in large number of observations, since one is based on normal distribution and the other is based on t-distribution. Here is more explanation about the phi coefficient. Moreover, some sources suggest that in the case of presence-absence dichotomous variables in a contingency table, the tests would be different.

2.5 Bi-variate investigation for multicollinearity

The real peril for the model lies in the phenomenon of collinearity. Collinearity happens when two predictors are highly associated or correlated. We need to check such characteristics, and then move on building the logisitc regression model.

According to the Goodman and Kruskal's tau plot, we should not be worried about the collinearity. But what about the education which was excluded due to its ordered nature? The Cramer’s V test shows that the strength is not considerable.

dataset_cat_variables <- subset(x = dataset, select = names(which(class_list=="logical"| class_list=="factor")))

# Chi square independence test of education vs. other categorcial variables 
#print("Chi square P-values of variable education and other categorical variables")
unlist(lapply(X = 1:(ncol(dataset_cat_variables)) ,FUN =  function(x) chisq.test(table(dataset_cat_variables$education,dataset_cat_variables[,x]))$p.value ))
## [1] 9.850320e-19 0.000000e+00 3.583561e-04 8.575789e-01 1.512755e-01
## [6] 8.987478e-08 1.729736e-02 5.190369e-07
#re-locating the education variable to the first variable of the dataset 
dataset_cat_variables <- dataset_cat_variables[,colnames(dataset_cat_variables)[c(2,1,3,4,5,6,7)]]

#print("Cramer's V value of association of variable education and other categorical variables")
unlist(lapply(X = 2:(ncol(dataset_cat_variables)), FUN = function(x)
        assocstats(x = table(dataset_cat_variables[,1], dataset_cat_variables[,x]))$cramer ) )
## [1] 0.14501778 0.06676356 0.01369873 0.03579241 0.09281974 0.04955549

None of the variables show strong association with education. The highest value of the Cramer’s V is 0.145, which is quite weak, between education and gender.

But what about the variables such as currentSmoker and cigsPerDay ? Clearly one is predictable from the other. Having a numerical and a categorical variable, we can group the numerical variable into several categories and then using Goodman and Kruskal's tau. The function GroupNumeric() from the GoodmanKruskal package can help to convert a quantitative variable into a qualitative one, however, it is easy to use cut() function here based on subjective understading of the data, and the multimodal distribution of the cigsPerDay which was seen before.

Now let’s check the GKtau values

class_list <- lapply(X = 1:ncol(dataset_2), function(x) class(dataset_2[,x]))
names(class_list) <- colnames(dataset)
t <- sapply(X = names(class_list) , FUN = function(x) TRUE %in% ( class_list[x] %in% c("factor","logical")) )

dataset_cat_variables_2 <- subset(x = dataset_2, select = t )


GKmatrix_dataset_2 <- GKtauDataframe(dataset_cat_variables_2)
plot(GKmatrix_dataset_2)

From the tau values on the matrix-plot and their background shape, we can see that cigsPerDay can totally explain variability of currentSmoker. This is not a surprise as if we know how many cigarettes one smokes per day, from 0 to any number, we can cenrtainly claim that we know one is smoker or not!

The second association is cigsPerDay to male, but it is not strong. So the former can explain littler variability of the latter.

In the next dataset, I converted all quantitative variables into qualitative/categorical variables. Now we can have a comprehensive matrix, although some information is lost due to conversion.

#dataset3 includes only categorical variables. 
dataset_3 <- dataset_2
dataset_3$totChol <- GroupNumeric(x = dataset$totChol , n = 5 )
## Warning in classInt::classIntervals(x, n, style = style, ...): var has
## missing values, omitted in finding classes
dataset_3$sysBP <- GroupNumeric(x = dataset$sysBP , n = 5 )
dataset_3$diaBP <- GroupNumeric(x = dataset$diaBP , n = 5 )
dataset_3$BMI <- GroupNumeric(x = dataset$BMI , n = 5 )
## Warning in classInt::classIntervals(x, n, style = style, ...): var has
## missing values, omitted in finding classes
dataset_3$heartRate <-GroupNumeric(x = dataset$heartRate , n = 5 )
## Warning in classInt::classIntervals(x, n, style = style, ...): var has
## missing values, omitted in finding classes
dataset_3$glucose <- GroupNumeric(x = dataset$glucose , n = 5 )
## Warning in classInt::classIntervals(x, n, style = style, ...): var has
## missing values, omitted in finding classes
dataset_3$age <- GroupNumeric(x = dataset$age , n = 5 )

class_list <- lapply(X = 1:ncol(dataset_3), function(x) class(dataset_3[,x]))
names(class_list) <- colnames(dataset)
t <- sapply(X = names(class_list) , FUN = function(x) TRUE %in% ( class_list[[x]] %in% c("factor","logical")) )

dataset_cat_variables <- subset(x = dataset_3, select = t )


GKmatrix_dataset <- GKtauDataframe(dataset_cat_variables)

plot(GKmatrix_dataset)

As we can see, sysBP and diaBP can predict prevalentHyp, but not very strongly. (around 0.5). We can keep prevalentHyp in the model thus. The second point is about the visual output of the GK tau. Similar to many other visualizations, this visual matrix does not scale well, and reading values becomes difficult. Better to use numeric values then, i.e. using GKtauDataframe().

We are going to build two models based on first two datasets, since the last one with all categorical data does not seem to yield anything more than others.

3.Predictive Modelling: Logistic Regression, and RandomForest

Now it is the time of evaluating the model instances. Here, let’s call logistic regression the model, and let’s name each variation based on variable selection or conversion, an instance.

We have two instances: 1. A model instance including all original variables, specifically cigsPerday and currentSmoker variables 2. A model instance including all original variables except currentSmoker. cigsPerday is converted into a factor variable

In order to evaluate the model instances, we can use either mathematical adjustments of training error rates such as AIC. The other approach is using a validation dataset and evaluate the model based on its performance on this set. From the latter approach, I choose to utilitize K-fold Cross-Validation(CV) technique, and more specifically 5-fold CV. Here, there are other techniques such as Leave one out cross-validation.

3.1 Two Logistic Regression Model instances

# in order to have a reproducible report 
set.seed(7)

# since cv.glm() in the next step cannot handle missing values, 
# I keep only the complete cases in the model. 
dataset_1 <- dataset[complete.cases(dataset),] 

dataset_2$currentSmoker <- NULL
dataset_2 <- dataset_2[complete.cases(dataset_2),]

#nrow(dataset_1) == nrow(dataset_2)

model1 <- glm(data = dataset_1, formula = TenYearCHD ~ . , family = "binomial")
summary(model1)
## 
## Call:
## glm(formula = TenYearCHD ~ ., family = "binomial", data = dataset_1)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9298  -0.5927  -0.4239  -0.2826   2.8616  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         -8.264593   0.710186 -11.637  < 2e-16 ***
## maleTRUE             0.534965   0.109901   4.868 1.13e-06 ***
## age                  0.062229   0.006756   9.211  < 2e-16 ***
## education2          -0.191812   0.123434  -1.554  0.12019    
## education3          -0.195935   0.150139  -1.305  0.19188    
## education4          -0.059625   0.164620  -0.362  0.71720    
## currentSmokerTRUE    0.073036   0.156749   0.466  0.64126    
## cigsPerDay           0.018005   0.006234   2.888  0.00387 ** 
## BPMedsTRUE           0.165206   0.234484   0.705  0.48109    
## prevalentStrokeTRUE  0.704867   0.491479   1.434  0.15152    
## prevalentHypTRUE     0.233424   0.138202   1.689  0.09122 .  
## diabetesTRUE         0.025920   0.316132   0.082  0.93465    
## totChol              0.002377   0.001129   2.105  0.03527 *  
## sysBP                0.015456   0.003812   4.054 5.03e-05 ***
## diaBP               -0.004121   0.006444  -0.640  0.52247    
## BMI                  0.005215   0.012786   0.408  0.68338    
## heartRate           -0.003004   0.004213  -0.713  0.47592    
## glucose              0.007216   0.002234   3.229  0.00124 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 3121.2  on 3657  degrees of freedom
## Residual deviance: 2752.2  on 3640  degrees of freedom
## AIC: 2788.2
## 
## Number of Fisher Scoring iterations: 5

This model is based on the original dataset. The records with missing values are omitted from the dataset, and the model shows that variables male, age, cigsPerDay, totChol, sysBP and glucose are significant, and prevalentHyp is to somehow significant.

model2 <- glm(data = dataset_2, formula = TenYearCHD ~ . , family = "binomial")
summary(model2)
## 
## Call:
## glm(formula = TenYearCHD ~ ., family = "binomial", data = dataset_2)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9192  -0.5933  -0.4226  -0.2833   2.9029  
## 
## Coefficients:
##                         Estimate Std. Error z value Pr(>|z|)    
## (Intercept)            -8.304021   0.709840 -11.698  < 2e-16 ***
## maleTRUE                0.565244   0.108219   5.223 1.76e-07 ***
## age                     0.061732   0.006744   9.154  < 2e-16 ***
## education2             -0.187641   0.123379  -1.521 0.128298    
## education3             -0.204574   0.150217  -1.362 0.173243    
## education4             -0.071952   0.164771  -0.437 0.662346    
## cigsPerDay[1,20) cigs   0.310478   0.112562   2.758 0.005810 ** 
## cigsPerDay[20,30) cigs  0.652554   0.194769   3.350 0.000807 ***
## cigsPerDay[30,40) cigs  0.976745   0.273147   3.576 0.000349 ***
## cigsPerDay40+ cigs      0.489754   0.340397   1.439 0.150215    
## BPMedsTRUE              0.170878   0.234525   0.729 0.466238    
## prevalentStrokeTRUE     0.708396   0.490581   1.444 0.148741    
## prevalentHypTRUE        0.230710   0.138095   1.671 0.094790 .  
## diabetesTRUE            0.033275   0.315429   0.105 0.915985    
## totChol                 0.002478   0.001130   2.193 0.028299 *  
## sysBP                   0.015313   0.003813   4.016 5.92e-05 ***
## diaBP                  -0.003713   0.006451  -0.576 0.564881    
## BMI                     0.004756   0.012805   0.371 0.710321    
## heartRate              -0.002470   0.004200  -0.588 0.556473    
## glucose                 0.007095   0.002231   3.180 0.001471 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 3121.2  on 3657  degrees of freedom
## Residual deviance: 2752.6  on 3638  degrees of freedom
## AIC: 2792.6
## 
## Number of Fisher Scoring iterations: 5

In the second model instance, the significant variables are the same as previous model instance.

A very important question is how to measure the performace of these two model instances and how to compare them? There are various approaches to measure the performance, but I choose 5-folds Cross Validation here.

In order to do the cross-validation and evaluate the model instances, we need a cost function. A function that is suggested by the boot package, is a simple function which returns the average number of mis-classified cases based on a threshold. The threshold is set by default to 0.5, which means any observation with more than 50% chance of CHD is flagged as a TRUE case with on-going disease. From the medical point of view, I rather be more cautious and discreet, so I reduce the threshold to 0.4, so even the cases with 40% chance of getting the heart disease, would be flagged to recieve further medical attentions. Reducing the threshold, increases the False Positive Rate and thus the medical costs, but reduces the False Negative Rate and saves lives! Having said that, it is possible to use either of Sensitiviety or Specificity as the cost function. Moreover, it is possible to use Area Under the Curve(AUC) joint with CV using the cvAUC package.

It is worth to mention that the proper way to pick a threshold, is using ROC curve.

3.2 Cross-Validation Evaluation of the model instances

set.seed(7)


# borrowed from https://stackoverflow.com/questions/16781551/cost-function-in-cv-glm-of-boot-library-in-r
cost <- function(r, p = 0, threshold = 0.4) { 
        mean((p < threshold) & r==1 | (p > threshold) & r==0)
}


model1_cv_delta <- cv.glm(data = dataset_1, glmfit = model1, cost = cost, K = 5)$delta[1]
model2_cv_delta <-  cv.glm(data = dataset_2, glmfit = model2, cost = cost, K = 5)$delta[1]

kable(data.frame("model1" = model1_cv_delta ,
                 "model2" = model2_cv_delta ), 
                 caption = "CV Misclassification error", digits = 4)
CV Misclassification error
model1 model2
0.158 0.1597
kable(data.frame("model1" = 1-model1_cv_delta ,
                 "model2" = 1-model2_cv_delta ),
                caption = "CV-Accuracy", digits = 4)
CV-Accuracy
model1 model2
0.842 0.8403

As we can see, the two models are very similar, however, model1 shows a slight advantage. The accuracy is quite high indeed. But let’s see whether we can improve the model2 by removing some of its variables.

3.3 Improving the model instances through variable selection.

Once again let’s have a look at the model1 summary.

summary(model1)
## 
## Call:
## glm(formula = TenYearCHD ~ ., family = "binomial", data = dataset_1)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9298  -0.5927  -0.4239  -0.2826   2.8616  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         -8.264593   0.710186 -11.637  < 2e-16 ***
## maleTRUE             0.534965   0.109901   4.868 1.13e-06 ***
## age                  0.062229   0.006756   9.211  < 2e-16 ***
## education2          -0.191812   0.123434  -1.554  0.12019    
## education3          -0.195935   0.150139  -1.305  0.19188    
## education4          -0.059625   0.164620  -0.362  0.71720    
## currentSmokerTRUE    0.073036   0.156749   0.466  0.64126    
## cigsPerDay           0.018005   0.006234   2.888  0.00387 ** 
## BPMedsTRUE           0.165206   0.234484   0.705  0.48109    
## prevalentStrokeTRUE  0.704867   0.491479   1.434  0.15152    
## prevalentHypTRUE     0.233424   0.138202   1.689  0.09122 .  
## diabetesTRUE         0.025920   0.316132   0.082  0.93465    
## totChol              0.002377   0.001129   2.105  0.03527 *  
## sysBP                0.015456   0.003812   4.054 5.03e-05 ***
## diaBP               -0.004121   0.006444  -0.640  0.52247    
## BMI                  0.005215   0.012786   0.408  0.68338    
## heartRate           -0.003004   0.004213  -0.713  0.47592    
## glucose              0.007216   0.002234   3.229  0.00124 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 3121.2  on 3657  degrees of freedom
## Residual deviance: 2752.2  on 3640  degrees of freedom
## AIC: 2788.2
## 
## Number of Fisher Scoring iterations: 5

Till now, we had assumed that all variables must be included in the model, except for the case of collinearity. Now, we are permitted to improve the model by removing non-significant variables, i.e. variables that do not add value to our model. There are several approaches here such as forward selection and backward elimination.

For instance, backward elimination method based on the p-values of the insignificant variables. The elimination continues until the AIC shows no further improvement. There are also stats::step() and bestglm::bestglm() functions to automate the variable selection process. The latter package and its main function have many options for choosing the information criteria such as AIC, BIC, LOOCV and CV, while the former’s stepwise algorithm is based on AIC. Surprisingly, the bestglm() does not work with a model instance including categorical variable when the chosen IC is CV. It is not difficult to write a backward elimination algorithm based on cross-validation error that can handle categorical variables as well. Nevertheless, here I accede to use AIC and BIC due to lack of time.

bestglm_bic_model <- bestglm(Xy = dataset_1 , family = binomial , IC = "BIC")
step_aic_model <- step(object = model1 )

Now let’s have a look at the two instances and their CV errors.

bestglm_bic_model
## BIC
## Best Model:
##               Df Sum Sq Mean Sq F value   Pr(>F)    
## male           1    4.0   3.970   34.03 5.89e-09 ***
## age            1   26.4  26.364  226.03  < 2e-16 ***
## cigsPerDay     1    2.3   2.294   19.67 9.49e-06 ***
## sysBP          1   10.3  10.330   88.56  < 2e-16 ***
## glucose        1    3.3   3.256   27.91 1.34e-07 ***
## Residuals   3652  426.0   0.117                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The bestglm::bestglm() based on BIC has reduced the model variables to 5: male, age, cigsPerDay, sysBP and glucose. All the variables are strongly significant, as expected.

summary(step_aic_model)
## 
## Call:
## glm(formula = TenYearCHD ~ male + age + cigsPerDay + prevalentStroke + 
##     prevalentHyp + totChol + sysBP + glucose, family = "binomial", 
##     data = dataset_1)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0063  -0.5972  -0.4291  -0.2841   2.8634  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         -8.745885   0.522456 -16.740  < 2e-16 ***
## maleTRUE             0.553297   0.107018   5.170 2.34e-07 ***
## age                  0.065411   0.006442  10.153  < 2e-16 ***
## cigsPerDay           0.019579   0.004181   4.683 2.82e-06 ***
## prevalentStrokeTRUE  0.751698   0.483585   1.554   0.1201    
## prevalentHypTRUE     0.225762   0.135085   1.671   0.0947 .  
## totChol              0.002257   0.001122   2.011   0.0443 *  
## sysBP                0.014218   0.002857   4.976 6.50e-07 ***
## glucose              0.007317   0.001673   4.374 1.22e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 3121.2  on 3657  degrees of freedom
## Residual deviance: 2757.5  on 3649  degrees of freedom
## AIC: 2775.5
## 
## Number of Fisher Scoring iterations: 5

The stats::step() function based on AIC has reduced the model variables to 8: male, age, cigsPerDay,prevalentStroke, prevalentHyp, totChol, sysBP and glucose. Interenstingly, the optimum model instance found by step() has non-significant variable.

One interesting point is the computation time of bestglm() is more than step(), by a vast margin!

bestglm_model_cv_error <- cv.glm(data = dataset_1,
                                 glmfit = glm(formula = TenYearCHD ~ male+age+cigsPerDay+sysBP+glucose , 
                                              family = binomial, data = dataset_1),
                                 cost = cost, K = 5)$delta[1]

step_model_cv_error <- cv.glm(data = dataset_1, glmfit = step_aic_model, cost = cost, K = 5)$delta[1]

kable(data.frame("bestglm() bic model" = bestglm_model_cv_error ,
                 "step() aic model" = step_model_cv_error ), 
                 caption = "CV Misclassification error", digits = 4)
CV Misclassification error
bestglm…bic.model step…aic.model
0.1512 0.1514
kable(data.frame("bestglm() bic model" = 1-bestglm_model_cv_error ,
                 "step() aic model" = 1-step_model_cv_error ),
                caption = "CV-Accuracy", digits = 4)
CV-Accuracy
bestglm…bic.model step…aic.model
0.8488 0.8486

Both AIC approach and BIC approach yields the same accuracy. Which to choose? I rather go for the AIC since the model instance has more predictors, and thus more insight. Nevertheless, it is also rational to choose the BIC model instance, since it is more parsimonious! Comparing to the model1 accuracy, we have \(0.8475 - 0.842 = 0.0055\) improvement in the accuracy through variable selection. However, we have lost very interesting informations about the relation of other predictors and the response variable. Sometimes, it is the goal of the method to keep all predictors. Here I assumed that the variable selection is allowed.

3.4 RandomForest model

Till now, I have worked on logistic regression model only. There are more models available to model the current problem, and RandomForest is a popular one. Let’s give it a try and compare the result with the previous model.

#---- Error is the  misclassification rate of CV output for each RF model instance
#---- The RF model instances are made by changing the number of trees from 50 to 1000
#---- The final result of CV misclassification rate of each chosen ntree is plotted 
rf_df <- data.frame(ntree = NA, accuracy = NA)
counter <- 0 


# preparation of the fold labels for each observation 
fold_size<- nrow(dataset_1) %/% 5 
remaining <- nrow(dataset_1) %% 5 
fold_seq <- c(rep(1,fold_size), rep(2,fold_size), rep(3,fold_size), rep(4,fold_size),rep(5,fold_size + remaining) ) 

# for various number of trees we calculate CV error. 
for (n in seq(50,1000,50)) {
        set.seed(7)
        counter <- counter + 1 
        cv_error_vec <- vector(length = 5)
        
        # CV error is the average of misclassification error of test folds
        for (k in 1:5) {
                     rf_dataset_train <- dataset_1[fold_seq != k ,]
                     rf_dataset_test <- dataset_1[fold_seq == k , ]
                     
                     rf_model <- randomForest( formula = TenYearCHD ~ . ,
                                               ntree = n, x = rf_dataset_train[,-16] ,
                                               y = factor(rf_dataset_train[,16]), 
                                                xtest = rf_dataset_test[,-16], ytest = factor(rf_dataset_test[,16]) )
                     conf_mat<-(rf_model$test$confusion)
                     ##print(conf_mat)
                      cv_error_vec[k] <-  (conf_mat[1,1] + conf_mat[2,2]) / nrow(rf_dataset_test)
        }
        
        rf_df[counter, 1] <- n 
        rf_df[counter, 2] <- mean(cv_error_vec)
        
}
#kable(rf_df)
kable(rf_df[sort(x = rf_df[,2], decreasing = TRUE, index.return = TRUE)$ix,])
ntree accuracy
8 400 0.8485621
20 1000 0.8485588
19 950 0.8482863
13 650 0.8482829
10 500 0.8480127
11 550 0.8480127
15 750 0.8480116
18 900 0.8477424
12 600 0.8477391
5 250 0.8477335
4 200 0.8474677
3 150 0.8474610
14 700 0.8471952
6 300 0.8471941
7 350 0.8471930
16 800 0.8471919
17 850 0.8466503
2 100 0.8466469
9 450 0.8466447
1 50 0.8460964
#round(rf_df[rf_df$ntree==650,2], digits = 4)

g1<- ggplot() + 
        geom_point(data = rf_df , aes(x = ntree , y = accuracy)) +
        geom_line(data = rf_df , mapping = aes(x = ntree, y = accuracy), color = "blue") + 
        geom_hline( yintercept =   1-bestglm_model_cv_error , color = "red" ) + 
        theme_linedraw() +
        #scale_x_continuous(breaks=seq(50,1000,50)) +
        geom_text(data = rf_df , mapping = aes(x = ntree, y = accuracy, label = ntree) , nudge_x = 40 )+
        ggtitle("5-fold CV accuracy")

#----- old method: error is based on RandomForest OOB, the confusion matrix output of the RandomForest

rf_df <- data.frame(ntree = NA, accuracy = NA)
counter <- 0 
#set.seed(7)

for (n in seq(50,1000,50)) {
        set.seed(7)
        counter <- counter + 1 
        
        
        rf_model <- randomForest( formula = TenYearCHD ~ . , ntree = n, x = dataset_1[,-16] , y = factor(dataset_1[,16]) ) 
        conf_mat<-(rf_model$confusion)
        rf_df[counter, 2] <- (conf_mat[1,1] + conf_mat[2,2]) / nrow(dataset_1)
        rf_df[counter, 1] <- n 
       
        
}

g2<- ggplot() + 
        geom_point(data = rf_df , aes(x = ntree , y = accuracy)) +
        geom_line(data = rf_df , mapping = aes(x = ntree, y = accuracy), color = "blue") + 
        geom_hline( yintercept =   1-bestglm_model_cv_error , color = "red" ) + 
        theme_linedraw() +

        geom_text(data = rf_df , mapping = aes(x = ntree, y = accuracy, label = ntree) , nudge_x = 40 )+
        ggtitle( "OOB accuracy")

#--- now plotting in one page
grid.arrange(g1,g2, ncol = 2)

Here, I used both CV and out-of-bag(OOB) for evaluation of RF performance. It is not clear to me whether CV is needed or RF confusion matrix is a good measure, since it is based on OOB approach.

As we can see, the highest accuracy of the RandomForest model, in the range of 50 to 1000 trees, can be obtained by setting the number of trees equal to 400 for CV approach. The red line in the plot shows the best CV accuracy that we have gotten from logistic regression model instances. Since the highest accuracy of OOB is higher than the highest accuracy of CV, I chose the CV accuracy to be more discreet. The ntree=400 has the \(CV accuracy = 0.8486\), which is \(0.0002\) worse than the best logistic regression model! However, if we consider OOB accuracy then the RandomForest model is doing \(0.0012\) better than the best logistic regression model.

The accuracy of the model has increased in RF, but the price is losing interpretability. RF is a blackbox, and we have no clue about the relations between predictors and the response variable.

3.5 What does the model predict on my personal data?

Here in order to complete this report, I want to add a prediction part based on a new dataset. The dataset has only one record, which includes my own personal data. In other words, I have created a model and I want to know whether it predicts a CHD for me or not.

#colnames(dataset_1)
pred_data <- matrix(ncol= 15)
colnames(pred_data) <- colnames(dataset_1[-16])
pred_data <- data.frame(pred_data)

pred_data<-dataset_1[1,-16]
pred_data$age <- 32 
pred_data$education <- factor(4, levels = c(1,2,3,4)) 
pred_data$currentSmoker <- FALSE 
pred_data$cigsPerDay <- 0 
pred_data$BPMeds <- FALSE
pred_data$prevalentStroke <- FALSE 
pred_data$prevalentHyp <- FALSE
pred_data$diabetes <- FALSE
pred_data$totChol <- 211 
pred_data$sysBP <- 100 
pred_data$diaBP <- 70 
pred_data$BMI <- 23
pred_data$heartRate <- 90
pred_data$glucose <- 90

The prediction output of the logistic regression model:

set.seed(7)
var_selection <- c("male", "age", "cigsPerDay", "sysBP" , "glucose")
glm_BIC_opt <- glm(data = dataset_1 , formula = TenYearCHD ~ male+ age+ cigsPerDay+ sysBP+ glucose ,family = binomial )
predict(glm_BIC_opt, newdata = pred_data, type = "response")
##          1 
## 0.02752699

and RandomForest Prediction:

set.seed(7)
rf_model <- randomForest( formula = TenYearCHD ~ . , ntree = 400, x = dataset_1[,-16] , y = factor(dataset_1[,16]) ) 
predict(rf_model, newdata = pred_data)
##     1 
## FALSE 
## Levels: FALSE TRUE

So it seems that I am not at risk now! However, as I mentioned before, the models are for sake of education and machine learning practice, rather than medical prediction!

4.Final Model exploration Using Parallel coordinates

Let’s have the last look at the model

#colnames(dataset_3)

dataset_3 <- dataset_2[complete.cases(dataset_2),]
dataset_3$totChol <- GroupNumeric(x = dataset_3$totChol , n = 5 , orderedFactor = TRUE)

dataset_3$sysBP <- GroupNumeric(x = dataset_3$sysBP , n = 5 , orderedFactor = TRUE)
dataset_3$diaBP <- GroupNumeric(x = dataset_3$diaBP , n = 5 , orderedFactor = TRUE)
dataset_3$BMI <- GroupNumeric(x = dataset_3$BMI , n = 5, orderedFactor = TRUE )
dataset_3$heartRate <-GroupNumeric(x = dataset_3$heartRate , n = 5 , orderedFactor = TRUE)
dataset_3$glucose <- GroupNumeric(x =dataset_3$glucose , n = 5 , orderedFactor = TRUE)
dataset_3$age <- GroupNumeric(x = dataset_3$age , n = 5 , orderedFactor = TRUE)

#ordered factor needed 

dataset_3_GK <- dataset_3[,c("male","age","cigsPerDay","sysBP","glucose","TenYearCHD")]
plot(GKtauDataframe(dataset_3_GK))

colors_parcoo <- c("CCCCCC","999999",
                  "3399FF","0066CC","003366","006699","3399CC",
                  "FFCCFF","FF99FF","FF66FF","9933FF","9966CC",
                  "CCFF00","99FF00","66FF00","33FF00","006600",
                  "FFCC00","FF9900","FF6600","FF3300","660000",
                  "0066FF","000000")
colors_parcoo <- paste0("#",colors_parcoo)
# + scale_color_brewer(palette = "Set3") 
g4<- ggparallel(data = dataset_3 , vars = list("male","cigsPerDay","glucose", "sysBP","age", "TenYearCHD"),
                color = "black", label = TRUE, text.angle = 0,label.size =2 , order = 0  ) +
        theme(legend.position="none") + 
        scale_colour_manual(values = colors_parcoo)+
        scale_fill_manual(values = colors_parcoo)

g4

The result is mostly as expected. Based on the GKtau values, the predictors have the least association with each other. This is what we want in order to avoid collinearity phenomenon.

However, the parallel coordinate still shows some interesting points. For instance, the association between age groups and the TenYearCHD outcome is interesting. The lower age groups participate very little in the TenYearCHD==TRUE, which means that age is positively related to the disease. On the other side, females (male == FALSE) contribute more in 0 cigs and [1,20) cigs groups, comparing to males. In other words, men tend to smoke more cigarettes, and are heavy smokers.

Interactive parallel coordinates can yield much better insight since we can track groups of observations along the axes.

5.Conclusion

In this study, a dataset of Framingham Heart Study including 4240 observations and 16 variables was used in order to build predictive models. The models were supposed to predict coronary heart disease(CHD) in ten years.

Having explored the dataset visually and numerically, logistic Regression and Random Forest models were utilised to build models. The evaluation of the models were done using K-Fold Cross-Validation.

For expanding this study, further classification methods such as Support Vector Machine(SVM), Gradient Boosting(GB), Neural Network models, K-nearest neighbour algorithm, and even decision tree can be used.

Shahin Ashkiani Contact@shahin-ashkiani.com

September 2017