This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
This report is an machine learning/data science investigation in the famous Framingham heart study(FHS). More specifically, the goal is making some predictive models on a FHS dataset, and reviewing some exploratory and modelling techiniques through doing so. But what is FHS?
We read about FHS:
The Framingham Heart Study is a long term prospective study of the etiology of cardiovascular disease among a population of free living subjects in the community of Framingham, Massachusetts. The Framingham Heart Study was a landmark study in epidemiology in that it was the first prospective study of cardiovascular disease and identified the concept of risk factors and their joint effects FHS Longitudinal Data Document
The reason of choosing this dataset is two fold. Beside the importace of the FHS, it has been chosen in order to having a opportunity to commemorate a great online course, which heavily influced me and made me more interested in analytics in summer 2015. This data is borrowed from Analytics Edge course of MIT, presented on Edx platform. Salute to all the preparators of the course! Analytics Edge at EdX
The second reason of picking this dataset is rather personal! I have went recently through some medical checkups, so I can feed my own data into the predictive model to have a prediction about myself! Even though this study and the resulted models are not prepared to have such medical usage, it is still nice to have something personal here!
The dataset is a rather small subset of possible FHS dataset, having 4240 observations and 16 variables. The variables are as follows:
education : A categorical variable of the participants education, with the levels: Some high school (1), high school/GED (2), some college/vocational school (3), college (4)
cigsPerDay: Number of cigarettes smoked each day
diabetes: Diabetic according to criteria of first exam treated
glucose: Blood glucose level (mg/dL)
And finally the response variable : + TenYearCHD : The 10 year risk of coronary heart disease(CHD).
3658 of these 4240 records are complete cases, and the rest have some missing values.
Before every step, loading the required libraries:
require(knitr)
require(dplyr)
require(ggplot2)
require(readr)
require(gridExtra) #for multiple plots
require(GoodmanKruskal)
require(vcd) # for Cramer's V
require(psych) # for phi() in order to compute the strenght of a found association
require(boot) #for cross-validation
require(ggparallel) # for parallel coordinate plot
require(bestglm) # to find the best variable subset using CV
require(randomForest) # for random Forest test on data
require(RColorBrewer) # apetizing palettes!
Then, loading the Framingham Heart Study dataset. This data comes from the BioLINCC website. FHS Longitudinal Data Document
Let’s have a high level numerical look at the dataset.
##print("Dimension of the dataset")
dim(framingham_dataset)
## [1] 4240 16
##print("The first six rows of the dataset")
kable(head(framingham_dataset))
| male | age | education | currentSmoker | cigsPerDay | BPMeds | prevalentStroke | prevalentHyp | diabetes | totChol | sysBP | diaBP | BMI | heartRate | glucose | TenYearCHD |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| TRUE | 39 | 4 | FALSE | 0 | FALSE | FALSE | FALSE | FALSE | 195 | 106.0 | 70 | 26.97 | 80 | 77 | FALSE |
| FALSE | 46 | 2 | FALSE | 0 | FALSE | FALSE | FALSE | FALSE | 250 | 121.0 | 81 | 28.73 | 95 | 76 | FALSE |
| TRUE | 48 | 1 | TRUE | 20 | FALSE | FALSE | FALSE | FALSE | 245 | 127.5 | 80 | 25.34 | 75 | 70 | FALSE |
| FALSE | 61 | 3 | TRUE | 30 | FALSE | FALSE | TRUE | FALSE | 225 | 150.0 | 95 | 28.58 | 65 | 103 | TRUE |
| FALSE | 46 | 3 | TRUE | 23 | FALSE | FALSE | FALSE | FALSE | 285 | 130.0 | 84 | 23.10 | 85 | 85 | FALSE |
| FALSE | 43 | 2 | FALSE | 0 | FALSE | FALSE | TRUE | FALSE | 228 | 180.0 | 110 | 30.30 | 77 | 99 | FALSE |
##print("Examination of the structure of the dataset")
str(framingham_dataset)
## Classes 'tbl_df', 'tbl' and 'data.frame': 4240 obs. of 16 variables:
## $ male : logi TRUE FALSE TRUE FALSE FALSE FALSE ...
## $ age : int 39 46 48 61 46 43 63 45 52 43 ...
## $ education : Factor w/ 4 levels "1","2","3","4": 4 2 1 3 3 2 1 2 1 1 ...
## $ currentSmoker : logi FALSE FALSE TRUE TRUE TRUE FALSE ...
## $ cigsPerDay : int 0 0 20 30 23 0 0 20 0 30 ...
## $ BPMeds : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ prevalentStroke: logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ prevalentHyp : logi FALSE FALSE FALSE TRUE FALSE TRUE ...
## $ diabetes : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ totChol : int 195 250 245 225 285 228 205 313 260 225 ...
## $ sysBP : num 106 121 128 150 130 ...
## $ diaBP : num 70 81 80 95 84 110 71 71 89 107 ...
## $ BMI : num 27 28.7 25.3 28.6 23.1 ...
## $ heartRate : int 80 95 75 65 85 77 60 79 76 93 ...
## $ glucose : int 77 76 70 103 85 99 85 78 79 88 ...
## $ TenYearCHD : logi FALSE FALSE FALSE TRUE FALSE FALSE ...
## - attr(*, "spec")=List of 2
## ..$ cols :List of 16
## .. ..$ male : list()
## .. .. ..- attr(*, "class")= chr "collector_logical" "collector"
## .. ..$ age : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ education :List of 3
## .. .. ..$ levels : num 1 2 3 4
## .. .. ..$ ordered : logi FALSE
## .. .. ..$ include_na: logi FALSE
## .. .. ..- attr(*, "class")= chr "collector_factor" "collector"
## .. ..$ currentSmoker : list()
## .. .. ..- attr(*, "class")= chr "collector_logical" "collector"
## .. ..$ cigsPerDay : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ BPMeds : list()
## .. .. ..- attr(*, "class")= chr "collector_logical" "collector"
## .. ..$ prevalentStroke: list()
## .. .. ..- attr(*, "class")= chr "collector_logical" "collector"
## .. ..$ prevalentHyp : list()
## .. .. ..- attr(*, "class")= chr "collector_logical" "collector"
## .. ..$ diabetes : list()
## .. .. ..- attr(*, "class")= chr "collector_logical" "collector"
## .. ..$ totChol : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ sysBP : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ diaBP : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ BMI : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ heartRate : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ glucose : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ TenYearCHD : list()
## .. .. ..- attr(*, "class")= chr "collector_logical" "collector"
## ..$ default: list()
## .. ..- attr(*, "class")= chr "collector_guess" "collector"
## ..- attr(*, "class")= chr "col_spec"
##print("Examination of the summary of variables")
summary(framingham_dataset)
## male age education currentSmoker
## Mode :logical Min. :32.00 1 :1720 Mode :logical
## FALSE:2420 1st Qu.:42.00 2 :1253 FALSE:2145
## TRUE :1820 Median :49.00 3 : 689 TRUE :2095
## Mean :49.58 4 : 473
## 3rd Qu.:56.00 NA's: 105
## Max. :70.00
##
## cigsPerDay BPMeds prevalentStroke prevalentHyp
## Min. : 0.000 Mode :logical Mode :logical Mode :logical
## 1st Qu.: 0.000 FALSE:4063 FALSE:4215 FALSE:2923
## Median : 0.000 TRUE :124 TRUE :25 TRUE :1317
## Mean : 9.006 NA's :53
## 3rd Qu.:20.000
## Max. :70.000
## NA's :29
## diabetes totChol sysBP diaBP
## Mode :logical Min. :107.0 Min. : 83.5 Min. : 48.0
## FALSE:4131 1st Qu.:206.0 1st Qu.:117.0 1st Qu.: 75.0
## TRUE :109 Median :234.0 Median :128.0 Median : 82.0
## Mean :236.7 Mean :132.4 Mean : 82.9
## 3rd Qu.:263.0 3rd Qu.:144.0 3rd Qu.: 90.0
## Max. :696.0 Max. :295.0 Max. :142.5
## NA's :50
## BMI heartRate glucose TenYearCHD
## Min. :15.54 Min. : 44.00 Min. : 40.00 Mode :logical
## 1st Qu.:23.07 1st Qu.: 68.00 1st Qu.: 71.00 FALSE:3596
## Median :25.40 Median : 75.00 Median : 78.00 TRUE :644
## Mean :25.80 Mean : 75.88 Mean : 81.96
## 3rd Qu.:28.04 3rd Qu.: 83.00 3rd Qu.: 87.00
## Max. :56.80 Max. :143.00 Max. :394.00
## NA's :19 NA's :1 NA's :388
The following code can automatically generates all uni-variable plots of a dataset.
# just it needs removal of character, time and date variables
dataset <- data.frame(framingham_dataset )
all_plots <- lapply(X = 1:ncol(dataset), function(x) ggplot()+
geom_bar(data = dataset,
aes(x = dataset[,x]), fill = "skyblue" ,
alpha = 0.6)+
theme_linedraw()+
xlab(colnames(dataset)[x])
)
#colnames(dataset)
marrangeGrob(grobs=all_plots, nrow=2, ncol=2)
This is for getting the least understanding of the variables. The understanding of the whole problem, and the dataset, would be achieved through multi-variate or at least bi-variate visualization. What I want do to here is getting sense of the data, i.e. feeling the data and becoming familiar with it.
Now we can venture on having some bi-variate visualizations, specially in order to see the relation between response variable(TenYearCHD) and predictors. Not all pair of variables can be investigated due to deluge number of plots! We can move back to this step later and deeper understanding.
The following code can generate all the bi-variate plots of the response variable. Since the response variable is a binary variable, we would either have boxplots, when the predictor is quantitative, or segmented bar plots, when the predictor is a qualitative variable.
ptlist_bi <- list()
for (var in colnames(dataset) ){
if (class(dataset[,var]) %in% c("factor","logical") ) {
###print(class(dataset[,var]))
###print("first if")
ptlist_bi[[var]] <- ggplot(data = dataset) +
geom_bar( aes_string(x = var, fill ="TenYearCHD" ), position = "fill") +
theme_linedraw() +
xlab(var)
} else if (class(dataset[,var]) %in% c("numeric","double","integer") ) {
###print(class(dataset[,var]))
# #print("second if")
ptlist_bi[[var]] <- ggplot(data = dataset) +
geom_boxplot(aes_string(y = var, x ="TenYearCHD" )) +
theme_linedraw() +
xlab("TenYearCHD") +
ylab(var)
}
}
#names(ptlist_bi)
marrangeGrob(grobs=ptlist_bi, nrow=2, ncol=2)
Based on what we have here, being male is directly related to
TenYearCHD, thus the variable male seems a relatively good predictor. Similarly, Age seems a good predictor since the patients with TenYearCHD == TRUE, have higher median of age, with almost a similar distribution. In contrast, there seems no relation between different categories of the education and the response variable. The current Smoker variable shows a slight relation with the response variable, as the current smokers have a slightly higher risk of TenYearCHD. With the same manner we can investigate remaining plots.
However, one may want to have a numeric value for such associations, beside these graphs, which are essentially qualitative approaches. In order to have such numeric measures, I would like to use Goodman and Kruskal's tau measure, a measure of association between two unordered factor, i.e. two categorical/nominal variables. Among the factor variables that we have in this dataset, only the education is ordinal, i.e. its categories have meaning. This measure is more informational than Cramer’s V or chi-square measures. More on this issue, here GoodmanKruskal Package
class_list <- lapply(X = 1:ncol(dataset), function(x) class(dataset[,x]))
names(class_list) <- colnames(dataset)
dataset_cat_variables <- subset(x = dataset, select = names(which(class_list=="logical"| class_list=="factor")))
#colnames(dataset_cat_variables)
dataset_cat_variables$education <- NULL
GKmatrix_dataset <- GKtauDataframe(dataset_cat_variables)
plot(GKmatrix_dataset)
As one can see, there is very little explanatory power in the predictors regarding the variability of the response variable. In other words, based on the Goodman and Kruskal's tau measure, there is almost no association between our predictors and the response variable. It can be seen in the values of the TenYearCHD column.
This is not a good news for the model, assuming that I am doing G&Ktau correctly!
In order to check these findings, we can test the significance of association of categorical variables with the response variable using Chi-square test, and then using Phi correlation coefficient in order to evaluate the strength of possible association. The phi is used for 2x2 contigency tables. For larger tables, i.e. variables with more levels, Cramer’s V can be utilised. About Cramer’s V in R , About Phi Correlation in R
unlist(lapply(X = 1:(ncol(dataset_cat_variables)-1) ,FUN = function(x) chisq.test(table(dataset_cat_variables[,7],dataset_cat_variables[,x]))$p.value ))
## [1] 1.121518e-08 2.211021e-01 3.096658e-08 1.795676e-04 1.188961e-30
## [6] 5.525144e-10
unlist(lapply(X = 1:(ncol(dataset_cat_variables)-1), FUN = function(x)
phi(matrix(table(dataset_cat_variables[,7], dataset_cat_variables[,x])), digits = 5)) )
## [1] 0 0 0 0 0 0
It is strange that when p-value of the Chi-square is so low, the significance of the possible association is zero. This is something beyond my current level of knowledge, and I am going to seek the reasons of. This document states that these two tests (chi-squre and Phi correlation) are basically yielding the same results in large number of observations, since one is based on normal distribution and the other is based on t-distribution. Here is more explanation about the phi coefficient. Moreover, some sources suggest that in the case of presence-absence dichotomous variables in a contingency table, the tests would be different.
The real peril for the model lies in the phenomenon of collinearity. Collinearity happens when two predictors are highly associated or correlated. We need to check such characteristics, and then move on building the logisitc regression model.
According to the Goodman and Kruskal's tau plot, we should not be worried about the collinearity. But what about the education which was excluded due to its ordered nature? The Cramer’s V test shows that the strength is not considerable.
dataset_cat_variables <- subset(x = dataset, select = names(which(class_list=="logical"| class_list=="factor")))
# Chi square independence test of education vs. other categorcial variables
#print("Chi square P-values of variable education and other categorical variables")
unlist(lapply(X = 1:(ncol(dataset_cat_variables)) ,FUN = function(x) chisq.test(table(dataset_cat_variables$education,dataset_cat_variables[,x]))$p.value ))
## [1] 9.850320e-19 0.000000e+00 3.583561e-04 8.575789e-01 1.512755e-01
## [6] 8.987478e-08 1.729736e-02 5.190369e-07
#re-locating the education variable to the first variable of the dataset
dataset_cat_variables <- dataset_cat_variables[,colnames(dataset_cat_variables)[c(2,1,3,4,5,6,7)]]
#print("Cramer's V value of association of variable education and other categorical variables")
unlist(lapply(X = 2:(ncol(dataset_cat_variables)), FUN = function(x)
assocstats(x = table(dataset_cat_variables[,1], dataset_cat_variables[,x]))$cramer ) )
## [1] 0.14501778 0.06676356 0.01369873 0.03579241 0.09281974 0.04955549
None of the variables show strong association with education. The highest value of the Cramer’s V is 0.145, which is quite weak, between education and gender.
But what about the variables such as currentSmoker and cigsPerDay ? Clearly one is predictable from the other. Having a numerical and a categorical variable, we can group the numerical variable into several categories and then using Goodman and Kruskal's tau. The function GroupNumeric() from the GoodmanKruskal package can help to convert a quantitative variable into a qualitative one, however, it is easy to use cut() function here based on subjective understading of the data, and the multimodal distribution of the cigsPerDay which was seen before.
Now let’s check the GKtau values
class_list <- lapply(X = 1:ncol(dataset_2), function(x) class(dataset_2[,x]))
names(class_list) <- colnames(dataset)
t <- sapply(X = names(class_list) , FUN = function(x) TRUE %in% ( class_list[x] %in% c("factor","logical")) )
dataset_cat_variables_2 <- subset(x = dataset_2, select = t )
GKmatrix_dataset_2 <- GKtauDataframe(dataset_cat_variables_2)
plot(GKmatrix_dataset_2)
From the tau values on the matrix-plot and their background shape, we can see that cigsPerDay can totally explain variability of currentSmoker. This is not a surprise as if we know how many cigarettes one smokes per day, from 0 to any number, we can cenrtainly claim that we know one is smoker or not!
The second association is cigsPerDay to male, but it is not strong. So the former can explain littler variability of the latter.
In the next dataset, I converted all quantitative variables into qualitative/categorical variables. Now we can have a comprehensive matrix, although some information is lost due to conversion.
#dataset3 includes only categorical variables.
dataset_3 <- dataset_2
dataset_3$totChol <- GroupNumeric(x = dataset$totChol , n = 5 )
## Warning in classInt::classIntervals(x, n, style = style, ...): var has
## missing values, omitted in finding classes
dataset_3$sysBP <- GroupNumeric(x = dataset$sysBP , n = 5 )
dataset_3$diaBP <- GroupNumeric(x = dataset$diaBP , n = 5 )
dataset_3$BMI <- GroupNumeric(x = dataset$BMI , n = 5 )
## Warning in classInt::classIntervals(x, n, style = style, ...): var has
## missing values, omitted in finding classes
dataset_3$heartRate <-GroupNumeric(x = dataset$heartRate , n = 5 )
## Warning in classInt::classIntervals(x, n, style = style, ...): var has
## missing values, omitted in finding classes
dataset_3$glucose <- GroupNumeric(x = dataset$glucose , n = 5 )
## Warning in classInt::classIntervals(x, n, style = style, ...): var has
## missing values, omitted in finding classes
dataset_3$age <- GroupNumeric(x = dataset$age , n = 5 )
class_list <- lapply(X = 1:ncol(dataset_3), function(x) class(dataset_3[,x]))
names(class_list) <- colnames(dataset)
t <- sapply(X = names(class_list) , FUN = function(x) TRUE %in% ( class_list[[x]] %in% c("factor","logical")) )
dataset_cat_variables <- subset(x = dataset_3, select = t )
GKmatrix_dataset <- GKtauDataframe(dataset_cat_variables)
plot(GKmatrix_dataset)
As we can see, sysBP and diaBP can predict prevalentHyp, but not very strongly. (around 0.5). We can keep prevalentHyp in the model thus. The second point is about the visual output of the GK tau. Similar to many other visualizations, this visual matrix does not scale well, and reading values becomes difficult. Better to use numeric values then, i.e. using GKtauDataframe().
We are going to build two models based on first two datasets, since the last one with all categorical data does not seem to yield anything more than others.
Now it is the time of evaluating the model instances. Here, let’s call logistic regression the model, and let’s name each variation based on variable selection or conversion, an instance.
We have two instances: 1. A model instance including all original variables, specifically cigsPerday and currentSmoker variables 2. A model instance including all original variables except currentSmoker. cigsPerday is converted into a factor variable
In order to evaluate the model instances, we can use either mathematical adjustments of training error rates such as AIC. The other approach is using a validation dataset and evaluate the model based on its performance on this set. From the latter approach, I choose to utilitize K-fold Cross-Validation(CV) technique, and more specifically 5-fold CV. Here, there are other techniques such as Leave one out cross-validation.
# in order to have a reproducible report
set.seed(7)
# since cv.glm() in the next step cannot handle missing values,
# I keep only the complete cases in the model.
dataset_1 <- dataset[complete.cases(dataset),]
dataset_2$currentSmoker <- NULL
dataset_2 <- dataset_2[complete.cases(dataset_2),]
#nrow(dataset_1) == nrow(dataset_2)
model1 <- glm(data = dataset_1, formula = TenYearCHD ~ . , family = "binomial")
summary(model1)
##
## Call:
## glm(formula = TenYearCHD ~ ., family = "binomial", data = dataset_1)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9298 -0.5927 -0.4239 -0.2826 2.8616
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -8.264593 0.710186 -11.637 < 2e-16 ***
## maleTRUE 0.534965 0.109901 4.868 1.13e-06 ***
## age 0.062229 0.006756 9.211 < 2e-16 ***
## education2 -0.191812 0.123434 -1.554 0.12019
## education3 -0.195935 0.150139 -1.305 0.19188
## education4 -0.059625 0.164620 -0.362 0.71720
## currentSmokerTRUE 0.073036 0.156749 0.466 0.64126
## cigsPerDay 0.018005 0.006234 2.888 0.00387 **
## BPMedsTRUE 0.165206 0.234484 0.705 0.48109
## prevalentStrokeTRUE 0.704867 0.491479 1.434 0.15152
## prevalentHypTRUE 0.233424 0.138202 1.689 0.09122 .
## diabetesTRUE 0.025920 0.316132 0.082 0.93465
## totChol 0.002377 0.001129 2.105 0.03527 *
## sysBP 0.015456 0.003812 4.054 5.03e-05 ***
## diaBP -0.004121 0.006444 -0.640 0.52247
## BMI 0.005215 0.012786 0.408 0.68338
## heartRate -0.003004 0.004213 -0.713 0.47592
## glucose 0.007216 0.002234 3.229 0.00124 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 3121.2 on 3657 degrees of freedom
## Residual deviance: 2752.2 on 3640 degrees of freedom
## AIC: 2788.2
##
## Number of Fisher Scoring iterations: 5
This model is based on the original dataset. The records with missing values are omitted from the dataset, and the model shows that variables male, age, cigsPerDay, totChol, sysBP and glucose are significant, and prevalentHyp is to somehow significant.
model2 <- glm(data = dataset_2, formula = TenYearCHD ~ . , family = "binomial")
summary(model2)
##
## Call:
## glm(formula = TenYearCHD ~ ., family = "binomial", data = dataset_2)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9192 -0.5933 -0.4226 -0.2833 2.9029
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -8.304021 0.709840 -11.698 < 2e-16 ***
## maleTRUE 0.565244 0.108219 5.223 1.76e-07 ***
## age 0.061732 0.006744 9.154 < 2e-16 ***
## education2 -0.187641 0.123379 -1.521 0.128298
## education3 -0.204574 0.150217 -1.362 0.173243
## education4 -0.071952 0.164771 -0.437 0.662346
## cigsPerDay[1,20) cigs 0.310478 0.112562 2.758 0.005810 **
## cigsPerDay[20,30) cigs 0.652554 0.194769 3.350 0.000807 ***
## cigsPerDay[30,40) cigs 0.976745 0.273147 3.576 0.000349 ***
## cigsPerDay40+ cigs 0.489754 0.340397 1.439 0.150215
## BPMedsTRUE 0.170878 0.234525 0.729 0.466238
## prevalentStrokeTRUE 0.708396 0.490581 1.444 0.148741
## prevalentHypTRUE 0.230710 0.138095 1.671 0.094790 .
## diabetesTRUE 0.033275 0.315429 0.105 0.915985
## totChol 0.002478 0.001130 2.193 0.028299 *
## sysBP 0.015313 0.003813 4.016 5.92e-05 ***
## diaBP -0.003713 0.006451 -0.576 0.564881
## BMI 0.004756 0.012805 0.371 0.710321
## heartRate -0.002470 0.004200 -0.588 0.556473
## glucose 0.007095 0.002231 3.180 0.001471 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 3121.2 on 3657 degrees of freedom
## Residual deviance: 2752.6 on 3638 degrees of freedom
## AIC: 2792.6
##
## Number of Fisher Scoring iterations: 5
In the second model instance, the significant variables are the same as previous model instance.
A very important question is how to measure the performace of these two model instances and how to compare them? There are various approaches to measure the performance, but I choose 5-folds Cross Validation here.
In order to do the cross-validation and evaluate the model instances, we need a cost function. A function that is suggested by the boot package, is a simple function which returns the average number of mis-classified cases based on a threshold. The threshold is set by default to 0.5, which means any observation with more than 50% chance of CHD is flagged as a TRUE case with on-going disease. From the medical point of view, I rather be more cautious and discreet, so I reduce the threshold to 0.4, so even the cases with 40% chance of getting the heart disease, would be flagged to recieve further medical attentions. Reducing the threshold, increases the False Positive Rate and thus the medical costs, but reduces the False Negative Rate and saves lives! Having said that, it is possible to use either of Sensitiviety or Specificity as the cost function. Moreover, it is possible to use Area Under the Curve(AUC) joint with CV using the cvAUC package.
It is worth to mention that the proper way to pick a threshold, is using ROC curve.
set.seed(7)
# borrowed from https://stackoverflow.com/questions/16781551/cost-function-in-cv-glm-of-boot-library-in-r
cost <- function(r, p = 0, threshold = 0.4) {
mean((p < threshold) & r==1 | (p > threshold) & r==0)
}
model1_cv_delta <- cv.glm(data = dataset_1, glmfit = model1, cost = cost, K = 5)$delta[1]
model2_cv_delta <- cv.glm(data = dataset_2, glmfit = model2, cost = cost, K = 5)$delta[1]
kable(data.frame("model1" = model1_cv_delta ,
"model2" = model2_cv_delta ),
caption = "CV Misclassification error", digits = 4)
| model1 | model2 |
|---|---|
| 0.158 | 0.1597 |
kable(data.frame("model1" = 1-model1_cv_delta ,
"model2" = 1-model2_cv_delta ),
caption = "CV-Accuracy", digits = 4)
| model1 | model2 |
|---|---|
| 0.842 | 0.8403 |
As we can see, the two models are very similar, however, model1 shows a slight advantage. The accuracy is quite high indeed. But let’s see whether we can improve the model2 by removing some of its variables.
Once again let’s have a look at the model1 summary.
summary(model1)
##
## Call:
## glm(formula = TenYearCHD ~ ., family = "binomial", data = dataset_1)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9298 -0.5927 -0.4239 -0.2826 2.8616
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -8.264593 0.710186 -11.637 < 2e-16 ***
## maleTRUE 0.534965 0.109901 4.868 1.13e-06 ***
## age 0.062229 0.006756 9.211 < 2e-16 ***
## education2 -0.191812 0.123434 -1.554 0.12019
## education3 -0.195935 0.150139 -1.305 0.19188
## education4 -0.059625 0.164620 -0.362 0.71720
## currentSmokerTRUE 0.073036 0.156749 0.466 0.64126
## cigsPerDay 0.018005 0.006234 2.888 0.00387 **
## BPMedsTRUE 0.165206 0.234484 0.705 0.48109
## prevalentStrokeTRUE 0.704867 0.491479 1.434 0.15152
## prevalentHypTRUE 0.233424 0.138202 1.689 0.09122 .
## diabetesTRUE 0.025920 0.316132 0.082 0.93465
## totChol 0.002377 0.001129 2.105 0.03527 *
## sysBP 0.015456 0.003812 4.054 5.03e-05 ***
## diaBP -0.004121 0.006444 -0.640 0.52247
## BMI 0.005215 0.012786 0.408 0.68338
## heartRate -0.003004 0.004213 -0.713 0.47592
## glucose 0.007216 0.002234 3.229 0.00124 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 3121.2 on 3657 degrees of freedom
## Residual deviance: 2752.2 on 3640 degrees of freedom
## AIC: 2788.2
##
## Number of Fisher Scoring iterations: 5
Till now, we had assumed that all variables must be included in the model, except for the case of collinearity. Now, we are permitted to improve the model by removing non-significant variables, i.e. variables that do not add value to our model. There are several approaches here such as forward selection and backward elimination.
For instance, backward elimination method based on the p-values of the insignificant variables. The elimination continues until the AIC shows no further improvement. There are also stats::step() and bestglm::bestglm() functions to automate the variable selection process. The latter package and its main function have many options for choosing the information criteria such as AIC, BIC, LOOCV and CV, while the former’s stepwise algorithm is based on AIC. Surprisingly, the bestglm() does not work with a model instance including categorical variable when the chosen IC is CV. It is not difficult to write a backward elimination algorithm based on cross-validation error that can handle categorical variables as well. Nevertheless, here I accede to use AIC and BIC due to lack of time.
bestglm_bic_model <- bestglm(Xy = dataset_1 , family = binomial , IC = "BIC")
step_aic_model <- step(object = model1 )
Now let’s have a look at the two instances and their CV errors.
bestglm_bic_model
## BIC
## Best Model:
## Df Sum Sq Mean Sq F value Pr(>F)
## male 1 4.0 3.970 34.03 5.89e-09 ***
## age 1 26.4 26.364 226.03 < 2e-16 ***
## cigsPerDay 1 2.3 2.294 19.67 9.49e-06 ***
## sysBP 1 10.3 10.330 88.56 < 2e-16 ***
## glucose 1 3.3 3.256 27.91 1.34e-07 ***
## Residuals 3652 426.0 0.117
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The bestglm::bestglm() based on BIC has reduced the model variables to 5: male, age, cigsPerDay, sysBP and glucose. All the variables are strongly significant, as expected.
summary(step_aic_model)
##
## Call:
## glm(formula = TenYearCHD ~ male + age + cigsPerDay + prevalentStroke +
## prevalentHyp + totChol + sysBP + glucose, family = "binomial",
## data = dataset_1)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.0063 -0.5972 -0.4291 -0.2841 2.8634
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -8.745885 0.522456 -16.740 < 2e-16 ***
## maleTRUE 0.553297 0.107018 5.170 2.34e-07 ***
## age 0.065411 0.006442 10.153 < 2e-16 ***
## cigsPerDay 0.019579 0.004181 4.683 2.82e-06 ***
## prevalentStrokeTRUE 0.751698 0.483585 1.554 0.1201
## prevalentHypTRUE 0.225762 0.135085 1.671 0.0947 .
## totChol 0.002257 0.001122 2.011 0.0443 *
## sysBP 0.014218 0.002857 4.976 6.50e-07 ***
## glucose 0.007317 0.001673 4.374 1.22e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 3121.2 on 3657 degrees of freedom
## Residual deviance: 2757.5 on 3649 degrees of freedom
## AIC: 2775.5
##
## Number of Fisher Scoring iterations: 5
The stats::step() function based on AIC has reduced the model variables to 8: male, age, cigsPerDay,prevalentStroke, prevalentHyp, totChol, sysBP and glucose. Interenstingly, the optimum model instance found by step() has non-significant variable.
One interesting point is the computation time of bestglm() is more than step(), by a vast margin!
bestglm_model_cv_error <- cv.glm(data = dataset_1,
glmfit = glm(formula = TenYearCHD ~ male+age+cigsPerDay+sysBP+glucose ,
family = binomial, data = dataset_1),
cost = cost, K = 5)$delta[1]
step_model_cv_error <- cv.glm(data = dataset_1, glmfit = step_aic_model, cost = cost, K = 5)$delta[1]
kable(data.frame("bestglm() bic model" = bestglm_model_cv_error ,
"step() aic model" = step_model_cv_error ),
caption = "CV Misclassification error", digits = 4)
| bestglm…bic.model | step…aic.model |
|---|---|
| 0.1512 | 0.1514 |
kable(data.frame("bestglm() bic model" = 1-bestglm_model_cv_error ,
"step() aic model" = 1-step_model_cv_error ),
caption = "CV-Accuracy", digits = 4)
| bestglm…bic.model | step…aic.model |
|---|---|
| 0.8488 | 0.8486 |
Both AIC approach and BIC approach yields the same accuracy. Which to choose? I rather go for the AIC since the model instance has more predictors, and thus more insight. Nevertheless, it is also rational to choose the BIC model instance, since it is more parsimonious! Comparing to the model1 accuracy, we have \(0.8475 - 0.842 = 0.0055\) improvement in the accuracy through variable selection. However, we have lost very interesting informations about the relation of other predictors and the response variable. Sometimes, it is the goal of the method to keep all predictors. Here I assumed that the variable selection is allowed.
Till now, I have worked on logistic regression model only. There are more models available to model the current problem, and RandomForest is a popular one. Let’s give it a try and compare the result with the previous model.
#---- Error is the misclassification rate of CV output for each RF model instance
#---- The RF model instances are made by changing the number of trees from 50 to 1000
#---- The final result of CV misclassification rate of each chosen ntree is plotted
rf_df <- data.frame(ntree = NA, accuracy = NA)
counter <- 0
# preparation of the fold labels for each observation
fold_size<- nrow(dataset_1) %/% 5
remaining <- nrow(dataset_1) %% 5
fold_seq <- c(rep(1,fold_size), rep(2,fold_size), rep(3,fold_size), rep(4,fold_size),rep(5,fold_size + remaining) )
# for various number of trees we calculate CV error.
for (n in seq(50,1000,50)) {
set.seed(7)
counter <- counter + 1
cv_error_vec <- vector(length = 5)
# CV error is the average of misclassification error of test folds
for (k in 1:5) {
rf_dataset_train <- dataset_1[fold_seq != k ,]
rf_dataset_test <- dataset_1[fold_seq == k , ]
rf_model <- randomForest( formula = TenYearCHD ~ . ,
ntree = n, x = rf_dataset_train[,-16] ,
y = factor(rf_dataset_train[,16]),
xtest = rf_dataset_test[,-16], ytest = factor(rf_dataset_test[,16]) )
conf_mat<-(rf_model$test$confusion)
##print(conf_mat)
cv_error_vec[k] <- (conf_mat[1,1] + conf_mat[2,2]) / nrow(rf_dataset_test)
}
rf_df[counter, 1] <- n
rf_df[counter, 2] <- mean(cv_error_vec)
}
#kable(rf_df)
kable(rf_df[sort(x = rf_df[,2], decreasing = TRUE, index.return = TRUE)$ix,])
| ntree | accuracy | |
|---|---|---|
| 8 | 400 | 0.8485621 |
| 20 | 1000 | 0.8485588 |
| 19 | 950 | 0.8482863 |
| 13 | 650 | 0.8482829 |
| 10 | 500 | 0.8480127 |
| 11 | 550 | 0.8480127 |
| 15 | 750 | 0.8480116 |
| 18 | 900 | 0.8477424 |
| 12 | 600 | 0.8477391 |
| 5 | 250 | 0.8477335 |
| 4 | 200 | 0.8474677 |
| 3 | 150 | 0.8474610 |
| 14 | 700 | 0.8471952 |
| 6 | 300 | 0.8471941 |
| 7 | 350 | 0.8471930 |
| 16 | 800 | 0.8471919 |
| 17 | 850 | 0.8466503 |
| 2 | 100 | 0.8466469 |
| 9 | 450 | 0.8466447 |
| 1 | 50 | 0.8460964 |
#round(rf_df[rf_df$ntree==650,2], digits = 4)
g1<- ggplot() +
geom_point(data = rf_df , aes(x = ntree , y = accuracy)) +
geom_line(data = rf_df , mapping = aes(x = ntree, y = accuracy), color = "blue") +
geom_hline( yintercept = 1-bestglm_model_cv_error , color = "red" ) +
theme_linedraw() +
#scale_x_continuous(breaks=seq(50,1000,50)) +
geom_text(data = rf_df , mapping = aes(x = ntree, y = accuracy, label = ntree) , nudge_x = 40 )+
ggtitle("5-fold CV accuracy")
#----- old method: error is based on RandomForest OOB, the confusion matrix output of the RandomForest
rf_df <- data.frame(ntree = NA, accuracy = NA)
counter <- 0
#set.seed(7)
for (n in seq(50,1000,50)) {
set.seed(7)
counter <- counter + 1
rf_model <- randomForest( formula = TenYearCHD ~ . , ntree = n, x = dataset_1[,-16] , y = factor(dataset_1[,16]) )
conf_mat<-(rf_model$confusion)
rf_df[counter, 2] <- (conf_mat[1,1] + conf_mat[2,2]) / nrow(dataset_1)
rf_df[counter, 1] <- n
}
g2<- ggplot() +
geom_point(data = rf_df , aes(x = ntree , y = accuracy)) +
geom_line(data = rf_df , mapping = aes(x = ntree, y = accuracy), color = "blue") +
geom_hline( yintercept = 1-bestglm_model_cv_error , color = "red" ) +
theme_linedraw() +
geom_text(data = rf_df , mapping = aes(x = ntree, y = accuracy, label = ntree) , nudge_x = 40 )+
ggtitle( "OOB accuracy")
#--- now plotting in one page
grid.arrange(g1,g2, ncol = 2)
Here, I used both CV and out-of-bag(OOB) for evaluation of RF performance. It is not clear to me whether CV is needed or RF confusion matrix is a good measure, since it is based on OOB approach.
As we can see, the highest accuracy of the RandomForest model, in the range of 50 to 1000 trees, can be obtained by setting the number of trees equal to 400 for CV approach. The red line in the plot shows the best CV accuracy that we have gotten from logistic regression model instances. Since the highest accuracy of OOB is higher than the highest accuracy of CV, I chose the CV accuracy to be more discreet. The ntree=400 has the \(CV accuracy = 0.8486\), which is \(0.0002\) worse than the best logistic regression model! However, if we consider OOB accuracy then the RandomForest model is doing \(0.0012\) better than the best logistic regression model.
The accuracy of the model has increased in RF, but the price is losing interpretability. RF is a blackbox, and we have no clue about the relations between predictors and the response variable.
Here in order to complete this report, I want to add a prediction part based on a new dataset. The dataset has only one record, which includes my own personal data. In other words, I have created a model and I want to know whether it predicts a CHD for me or not.
#colnames(dataset_1)
pred_data <- matrix(ncol= 15)
colnames(pred_data) <- colnames(dataset_1[-16])
pred_data <- data.frame(pred_data)
pred_data<-dataset_1[1,-16]
pred_data$age <- 32
pred_data$education <- factor(4, levels = c(1,2,3,4))
pred_data$currentSmoker <- FALSE
pred_data$cigsPerDay <- 0
pred_data$BPMeds <- FALSE
pred_data$prevalentStroke <- FALSE
pred_data$prevalentHyp <- FALSE
pred_data$diabetes <- FALSE
pred_data$totChol <- 211
pred_data$sysBP <- 100
pred_data$diaBP <- 70
pred_data$BMI <- 23
pred_data$heartRate <- 90
pred_data$glucose <- 90
The prediction output of the logistic regression model:
set.seed(7)
var_selection <- c("male", "age", "cigsPerDay", "sysBP" , "glucose")
glm_BIC_opt <- glm(data = dataset_1 , formula = TenYearCHD ~ male+ age+ cigsPerDay+ sysBP+ glucose ,family = binomial )
predict(glm_BIC_opt, newdata = pred_data, type = "response")
## 1
## 0.02752699
and RandomForest Prediction:
set.seed(7)
rf_model <- randomForest( formula = TenYearCHD ~ . , ntree = 400, x = dataset_1[,-16] , y = factor(dataset_1[,16]) )
predict(rf_model, newdata = pred_data)
## 1
## FALSE
## Levels: FALSE TRUE
So it seems that I am not at risk now! However, as I mentioned before, the models are for sake of education and machine learning practice, rather than medical prediction!
Let’s have the last look at the model
#colnames(dataset_3)
dataset_3 <- dataset_2[complete.cases(dataset_2),]
dataset_3$totChol <- GroupNumeric(x = dataset_3$totChol , n = 5 , orderedFactor = TRUE)
dataset_3$sysBP <- GroupNumeric(x = dataset_3$sysBP , n = 5 , orderedFactor = TRUE)
dataset_3$diaBP <- GroupNumeric(x = dataset_3$diaBP , n = 5 , orderedFactor = TRUE)
dataset_3$BMI <- GroupNumeric(x = dataset_3$BMI , n = 5, orderedFactor = TRUE )
dataset_3$heartRate <-GroupNumeric(x = dataset_3$heartRate , n = 5 , orderedFactor = TRUE)
dataset_3$glucose <- GroupNumeric(x =dataset_3$glucose , n = 5 , orderedFactor = TRUE)
dataset_3$age <- GroupNumeric(x = dataset_3$age , n = 5 , orderedFactor = TRUE)
#ordered factor needed
dataset_3_GK <- dataset_3[,c("male","age","cigsPerDay","sysBP","glucose","TenYearCHD")]
plot(GKtauDataframe(dataset_3_GK))
colors_parcoo <- c("CCCCCC","999999",
"3399FF","0066CC","003366","006699","3399CC",
"FFCCFF","FF99FF","FF66FF","9933FF","9966CC",
"CCFF00","99FF00","66FF00","33FF00","006600",
"FFCC00","FF9900","FF6600","FF3300","660000",
"0066FF","000000")
colors_parcoo <- paste0("#",colors_parcoo)
# + scale_color_brewer(palette = "Set3")
g4<- ggparallel(data = dataset_3 , vars = list("male","cigsPerDay","glucose", "sysBP","age", "TenYearCHD"),
color = "black", label = TRUE, text.angle = 0,label.size =2 , order = 0 ) +
theme(legend.position="none") +
scale_colour_manual(values = colors_parcoo)+
scale_fill_manual(values = colors_parcoo)
g4
The result is mostly as expected. Based on the GKtau values, the predictors have the least association with each other. This is what we want in order to avoid collinearity phenomenon.
However, the parallel coordinate still shows some interesting points. For instance, the association between age groups and the TenYearCHD outcome is interesting. The lower age groups participate very little in the TenYearCHD==TRUE, which means that age is positively related to the disease. On the other side, females (male == FALSE) contribute more in 0 cigs and [1,20) cigs groups, comparing to males. In other words, men tend to smoke more cigarettes, and are heavy smokers.
Interactive parallel coordinates can yield much better insight since we can track groups of observations along the axes.
In this study, a dataset of Framingham Heart Study including 4240 observations and 16 variables was used in order to build predictive models. The models were supposed to predict coronary heart disease(CHD) in ten years.
Having explored the dataset visually and numerically, logistic Regression and Random Forest models were utilised to build models. The evaluation of the models were done using K-Fold Cross-Validation.
For expanding this study, further classification methods such as Support Vector Machine(SVM), Gradient Boosting(GB), Neural Network models, K-nearest neighbour algorithm, and even decision tree can be used.
Shahin Ashkiani Contact@shahin-ashkiani.com
September 2017