heartStudy <- read.csv("https://raw.githubusercontent.com/GauravPadawe/Framingham-Heart-Study/master/framingham.csv")
heartStudy.0 = heartStudy # make a copy of the data for data cleansing
heartStudy = na.omit(heartStudy.0) # Delete all records with missing components
y0=heartStudy$TenYearCHD
The Framingham Heart Study is a long-term prospective study of the etiology of cardiovascular disease among a population of free-living subjects in the community of Framingham, Massachusetts. The Framingham Heart Study was a landmark study in epidemiology in that it was the first prospective study of cardiovascular disease and identified the concept of risk factors and their joint effects FHS Longitudinal Data Document. The study began in 1948 with 5,209 adult subjects, and is now on its third generation of participants.
Our data set includes 4240 subjects. After removing subjects with missing data for convenience, we are left with 3,658 subjects.
The variables in the data set are as follows:
male [cat]: the gender of the subjects
age [num]: Age of subject at time of medical examination in years.
Education [cat]: A categorical variable of the participants education, with the levels: Some high school(1), high school/GED(2), some college/vocational school(3), college(4)
currentSmoker [cat]: Whether or not the subject currently smokes at time of examinations
cigsPerDay [num]: Number of cigarettes the subject smokes per day
BPMeds [cat]: whether or not the subject takes anti-hypertensive medication at exam
prevalentStroke [cat]: Prevalent Stroke (0=free of disease)
prevalentHyp [cat]: Prevalent Hypertensive. Subject was defined as hypertensive if treated
diabetes [cat]: Diabetic according to criteria of first exam treated
totChol [num]: Total cholesterol (mg/dl)
sysBP [num]: Systolic Blood Pressure (mmHg)
diaBP [num]: Diastolic blood pressure (mmHg)
BMI [num]: body mass index, weight(kg)/height(m)^2
heartRate [num]: Heart Rate (beats/minute)
glucose [num]: Blood glucose level (mg/dL)
TenYearCHD(binary cat response variable): The 10 year risk of coronary heart disease(CHD)
Many studies have shown that variables like blood pressure, cholesterol, diabetes, smoking, BMI, and physical inactivity are good predictors of heart disease. The objective of this study is to explore the association between cholesterol and heart disease
According to Robert S Rosenson, a professor of Medicine,
The objective of this research is to build a logistic regression model to predict ten year heart disease risk using various risk factors associated with the individual patient.
hist(heartStudy$cigsPerDay)
hist(heartStudy$age)
pairs.panels(heartStudy[, 10:15],
method = "pearson", # correlation method
hist.col = "#00AFBB",
density = TRUE, # show density plots
ellipses = TRUE # show correlation ellipses
)
I first create histograms and pair plots for all quantitative variables. The only issue in the histograms is that cigsPerDay appears skewed to the right. To remedy this I will discretize cigsPerDay by creating bins:
cigs = heartStudy$cigsPerDay
grp.cigs = cigs
grp.cigs[cigs %in% c(0:0)] = "0"
grp.cigs[cigs %in% c(1:10)] = "1-10"
grp.cigs[cigs %in% c(11:20)] = "11-20"
grp.cigs[cigs %in% c(20:70)] = "20+"
heartStudy$grp.cigs = grp.cigs
heartStudy$sd.totChol = (heartStudy$totChol-mean(heartStudy$totChol))/sd(heartStudy$totChol)
heartStudy$sd.sysBP = (heartStudy$sysBP-mean(heartStudy$sysBP))/sd(heartStudy$sysBP)
heartStudy$sd.diaBP = (heartStudy$diaBP-mean(heartStudy$diaBP))/sd(heartStudy$diaBP)
heartStudy$sd.BMI = (heartStudy$BMI-mean(heartStudy$BMI))/sd(heartStudy$BMI)
heartStudy$sd.heartRate = (heartStudy$heartRate-mean(heartStudy$heartRate))/sd(heartStudy$heartRate)
heartStudy$sd.glucose = (heartStudy$glucose-mean(heartStudy$glucose))/sd(heartStudy$glucose)
## drop the original variables except for the response variable
sd.heartStudy = heartStudy[c(1, 3, 4, 6, 7, 8, 9, 16, 17, 18, 19, 20, 21, 22, 23)]
Since this is a predictive model, we don’t worry about the interpretation of the coefficients. The objective is to identify a model that has the best predictive performance.
We randomly split the data into two subsets. 70% of the data will be used as training data. We will use the training data to search the candidate models, validate them and identify the final model using the cross-validation method. The 30% of hold-up sample will be used for assessing the performance of the final model.
## splitting data: 80% training and 20% testing
n <- dim(sd.heartStudy)[1]
train.n <- round(0.8*n)
train.id <- sample(1:n, train.n, replace = FALSE)
## training and testing data sets
train <- sd.heartStudy[train.id, ]
test <- sd.heartStudy[-train.id, ]
For convenience, we use 0.5 as the common cut-off for all three models to define the predicted. In a real application, you may want to find the optimal cut-off for each candidate models in the cross-validation process.
5-fold Cross-Validation Since our training data is relative small, I will use 5 fold cross validation to ensure the validation data set to have enough diabetes cases.
## 5-fold cross-validation
k=5
fold.size = round(dim(train)[1]/k)
## PE vectors for candidate models
PE1 = rep(0,5)
PE2 = rep(0,5)
PE3 = rep(0,5)
for(i in 1:k){
## Training and testing folds
valid.id = (fold.size*(i-1)+1):(fold.size*i)
valid = train[valid.id, ]
train.dat = train[-valid.id,]
## full model
candidate01 = glm(TenYearCHD ~grp.cigs + sd.totChol +sd.sysBP + sd.diaBP + sd.BMI +
sd.heartRate + sd.glucose, family = binomial(link = "logit"),
data = train.dat)
## reduced model
candidate03 = glm(diabetes ~ sd.totChol + sd.sysBP + sd.glucose,
family = binomial(link = "logit"),
data = train.dat)
## stepAIC Model
candidate02 = stepAIC(candidate01,
scope = list(lower=formula(candidate03),upper=formula(candidate01)),
direction = "forward", # forward selection
trace = 0 # do not show the details
)
## predicted probabilities of each candidate model
pred01 = predict(candidate01, newdata = valid, type="response")
pred02 = predict(candidate02, newdata = valid, type="response")
pred03 = predict(candidate03, newdata = valid, type="response")
## confusion matrix: ftable() will
confusion.matrix01 = ftable(as.vector((pred01>0.5)),(valid$TenYearCHD=="1"))
confusion.matrix02 = ftable(as.vector((pred02>0.5)),(valid$TenYearCHD=="1"))
confusion.matrix03 = ftable(as.vector((pred03>0.5)),(valid$TenYearCHD=="1"))
PE1[i] = (confusion.matrix01[1,2] + confusion.matrix01[2,1])/length(pred01)
PE2[i] = (confusion.matrix02[1,2] + confusion.matrix02[2,1])/length(pred02)
PE3[i] = (confusion.matrix03[1,2] + confusion.matrix03[2,1])/length(pred03)
}
avg.pe = cbind(PE1 = mean(PE1), PE2 = mean(PE2), PE3 = mean(PE3))
kable(avg.pe, caption = "Average of prediction errors of candidate models")
| PE1 | PE2 | PE3 |
|---|---|---|
| 0.1504274 | 0.1504274 | 0.1500855 |
The average predictive errors show that candidate models 1 and 2 have the same predictive error. Since model 2 is simpler than model 1, we choose model 2 as the final predictive model.
The actual accuracy of the final model is given by
pred02 = predict(candidate02, newdata = test, type="response")
confusion.matrix02 = ftable(as.vector((pred02>0.5)),(test$TenYearCHD=="1"))
accuracy = (confusion.matrix02[1,1] + confusion.matrix02[2,2])/length(pred02)
kable(accuracy, caption="The actual accuracy of the final model")
| x |
|---|
| 0.8442623 |
The analysis focused on predicting Ten Year Chronic Heart Disease. For illustrative purposes, we used three models as candidates and used cross-validation to select the final working model. The final model had an accuracy of 0.8510929 or 85.11%,