1 Introduction

heartStudy <- read.csv("https://raw.githubusercontent.com/GauravPadawe/Framingham-Heart-Study/master/framingham.csv")

heartStudy.0 = heartStudy    # make a copy of the data for data cleansing 
heartStudy = na.omit(heartStudy.0)       # Delete all records with missing components
y0=heartStudy$TenYearCHD

The Framingham Heart Study is a long-term prospective study of the etiology of cardiovascular disease among a population of free-living subjects in the community of Framingham, Massachusetts. The Framingham Heart Study was a landmark study in epidemiology in that it was the first prospective study of cardiovascular disease and identified the concept of risk factors and their joint effects FHS Longitudinal Data Document. The study began in 1948 with 5,209 adult subjects, and is now on its third generation of participants.

Our data set includes 4240 subjects. After removing subjects with missing data for convenience, we are left with 3,658 subjects.

The variables in the data set are as follows:

  • male [cat]: the gender of the subjects

  • age [num]: Age of subject at time of medical examination in years.

  • Education [cat]: A categorical variable of the participants education, with the levels: Some high school(1), high school/GED(2), some college/vocational school(3), college(4)

  • currentSmoker [cat]: Whether or not the subject currently smokes at time of examinations

  • cigsPerDay [num]: Number of cigarettes the subject smokes per day

  • BPMeds [cat]: whether or not the subject takes anti-hypertensive medication at exam

  • prevalentStroke [cat]: Prevalent Stroke (0=free of disease)

  • prevalentHyp [cat]: Prevalent Hypertensive. Subject was defined as hypertensive if treated

  • diabetes [cat]: Diabetic according to criteria of first exam treated

  • totChol [num]: Total cholesterol (mg/dl)

  • sysBP [num]: Systolic Blood Pressure (mmHg)

  • diaBP [num]: Diastolic blood pressure (mmHg)

  • BMI [num]: body mass index, weight(kg)/height(m)^2

  • heartRate [num]: Heart Rate (beats/minute)

  • glucose [num]: Blood glucose level (mg/dL)

  • TenYearCHD(binary cat response variable): The 10 year risk of coronary heart disease(CHD)

Many studies have shown that variables like blood pressure, cholesterol, diabetes, smoking, BMI, and physical inactivity are good predictors of heart disease. The objective of this study is to explore the association between cholesterol and heart disease

According to Robert S Rosenson, a professor of Medicine,

  • A total cholesterol level of less than 200 mg/dL (5.17 mmol/L) is normal.
  • A total cholesterol level of 200 to 239 mg/dL (5.17 to 6.18 mmol/L) is borderline high.
  • A total cholesterol level of 240 mg/dL (6.21 mmol/L) or greater is high.

2 Research Question

The objective of this research is to build a logistic regression model to predict ten year heart disease risk using various risk factors associated with the individual patient.

3 Exploratory Data Analysis

hist(heartStudy$cigsPerDay)

hist(heartStudy$age)

pairs.panels(heartStudy[, 10:15], 
             method = "pearson", # correlation method
             hist.col = "#00AFBB",
             density = TRUE,  # show density plots
             ellipses = TRUE # show correlation ellipses
             )

I first create histograms and pair plots for all quantitative variables. The only issue in the histograms is that cigsPerDay appears skewed to the right. To remedy this I will discretize cigsPerDay by creating bins:

cigs = heartStudy$cigsPerDay
grp.cigs = cigs
grp.cigs[cigs %in% c(0:0)] = "0"
grp.cigs[cigs %in% c(1:10)] = "1-10"
grp.cigs[cigs %in% c(11:20)] = "11-20"
grp.cigs[cigs %in% c(20:70)] = "20+"

heartStudy$grp.cigs = grp.cigs

3.1 Standardizing Numerical Predictor Variables

heartStudy$sd.totChol = (heartStudy$totChol-mean(heartStudy$totChol))/sd(heartStudy$totChol)
heartStudy$sd.sysBP = (heartStudy$sysBP-mean(heartStudy$sysBP))/sd(heartStudy$sysBP)
heartStudy$sd.diaBP = (heartStudy$diaBP-mean(heartStudy$diaBP))/sd(heartStudy$diaBP)
heartStudy$sd.BMI = (heartStudy$BMI-mean(heartStudy$BMI))/sd(heartStudy$BMI)
heartStudy$sd.heartRate = (heartStudy$heartRate-mean(heartStudy$heartRate))/sd(heartStudy$heartRate)
heartStudy$sd.glucose = (heartStudy$glucose-mean(heartStudy$glucose))/sd(heartStudy$glucose)
## drop the original variables except for the response variable
sd.heartStudy = heartStudy[c(1, 3, 4, 6, 7, 8, 9, 16, 17, 18, 19, 20, 21, 22, 23)]

Since this is a predictive model, we don’t worry about the interpretation of the coefficients. The objective is to identify a model that has the best predictive performance.

3.2 Data Split - Training and Test Data

We randomly split the data into two subsets. 70% of the data will be used as training data. We will use the training data to search the candidate models, validate them and identify the final model using the cross-validation method. The 30% of hold-up sample will be used for assessing the performance of the final model.

## splitting data: 80% training and 20% testing
n <- dim(sd.heartStudy)[1]
train.n <- round(0.8*n)
train.id <- sample(1:n, train.n, replace = FALSE)
## training and testing data sets
train <- sd.heartStudy[train.id, ]
test <- sd.heartStudy[-train.id, ]

3.3 Candidate Models

For convenience, we use 0.5 as the common cut-off for all three models to define the predicted. In a real application, you may want to find the optimal cut-off for each candidate models in the cross-validation process.

5-fold Cross-Validation Since our training data is relative small, I will use 5 fold cross validation to ensure the validation data set to have enough diabetes cases.

## 5-fold cross-validation
k=5
fold.size = round(dim(train)[1]/k)
## PE vectors for candidate models
PE1 = rep(0,5)
PE2 = rep(0,5)
PE3 = rep(0,5)
for(i in 1:k){
  ## Training and testing folds
  valid.id = (fold.size*(i-1)+1):(fold.size*i)
  valid = train[valid.id, ]
  train.dat = train[-valid.id,]
  ##  full model
  candidate01 = glm(TenYearCHD ~grp.cigs + sd.totChol +sd.sysBP + sd.diaBP + sd.BMI + 
                    sd.heartRate + sd.glucose, family = binomial(link = "logit"),  
                    data = train.dat)  
## reduced model
  candidate03 = glm(diabetes ~ sd.totChol + sd.sysBP + sd.glucose, 
                    family = binomial(link = "logit"),  
                    data = train.dat) 
## stepAIC Model
   candidate02 = stepAIC(candidate01, 
                      scope = list(lower=formula(candidate03),upper=formula(candidate01)),
                      direction = "forward",   # forward selection
                      trace = 0                # do not show the details
                      )
  ##  predicted probabilities of each candidate model
   pred01 = predict(candidate01, newdata = valid, type="response")
   pred02 = predict(candidate02, newdata = valid, type="response")
   pred03 = predict(candidate03, newdata = valid, type="response")
  ## confusion matrix: ftable() will  
   confusion.matrix01 = ftable(as.vector((pred01>0.5)),(valid$TenYearCHD=="1"))
   confusion.matrix02 = ftable(as.vector((pred02>0.5)),(valid$TenYearCHD=="1"))
   confusion.matrix03 = ftable(as.vector((pred03>0.5)),(valid$TenYearCHD=="1"))
   PE1[i] = (confusion.matrix01[1,2] + confusion.matrix01[2,1])/length(pred01)
   PE2[i] = (confusion.matrix02[1,2] + confusion.matrix02[2,1])/length(pred02)
   PE3[i] = (confusion.matrix03[1,2] + confusion.matrix03[2,1])/length(pred03)
}
avg.pe = cbind(PE1 = mean(PE1), PE2 = mean(PE2), PE3 = mean(PE3))
kable(avg.pe, caption = "Average of prediction errors of candidate models")
Average of prediction errors of candidate models
PE1 PE2 PE3
0.1504274 0.1504274 0.1500855

The average predictive errors show that candidate models 1 and 2 have the same predictive error. Since model 2 is simpler than model 1, we choose model 2 as the final predictive model.

The actual accuracy of the final model is given by

pred02 = predict(candidate02, newdata = test, type="response")
confusion.matrix02 = ftable(as.vector((pred02>0.5)),(test$TenYearCHD=="1"))
accuracy = (confusion.matrix02[1,1] + confusion.matrix02[2,2])/length(pred02)
kable(accuracy, caption="The actual accuracy of the final model")
The actual accuracy of the final model
x
0.8442623

3.4 Summary

The analysis focused on predicting Ten Year Chronic Heart Disease. For illustrative purposes, we used three models as candidates and used cross-validation to select the final working model. The final model had an accuracy of 0.8510929 or 85.11%,