1 Introduction

Coronary heart disease is a general disease where the arteries of the heart cannot deliver enough oxygen-rich blood to the heart. Coronary artery disease is a common symptoms, including chest pain and shortness of breath. This project is going to examine relationship between Coronary heart disease and potential factors. Potential risk factors will be test by using 3 models, including Logistic Regression Model, Neural Network Model, and Decision Tree Model.

2 Description of Data

The data source is from a long-term prospective study. The dataset contains 4240 observations and 16 variables. The description of the variables are as below:

sex: the gender of the observations. The variable is a binary named “male” in the dataset.

age: Age at the time of medical examination in years.

education: A categorical variable of the participants education, with the levels: Some high school (1), high school/GED (2), some college/vocational school (3), college (4)

currentSmoker: Current cigarette smoking at the time of examinations

cigsPerDay: Number of cigarettes smoked each day

BPmeds: Use of Anti-hypertensive medication at exam

prevalentStroke: Prevalent Stroke (0 = free of disease)

prevalentHyp: Prevalent Hypertensive. Subject was defined as hypertensive if treated

diabetes: Diabetic according to criteria of first exam treated

totChol: Total cholesterol (mg/dL)

sysBP: Systolic Blood Pressure (mmHg)

diaBP: Diastolic blood pressure (mmHg)

BMI: Body Mass Index, weight (kg)/height (m)^2

heartRate: Heart rate (beats/minute)

glucose: Blood glucose level (mg/dL)

TenYearCHD(response variable): The 10 year risk of coronary heart disease(CHD).

The data path is stored as below: “https://raw.githubusercontent.com/GUANTSERN-KUO/STA551/main/Week1/FraminghamHeartStudy.csv

heatdata = read.csv("https://raw.githubusercontent.com/GUANTSERN-KUO/STA551/main/Week1/FraminghamHeartStudy.csv")

2.1 Dataset Information and Detected Outlier

There are 16 variables in the heart data, including 7 binary variables, 8 numeric variables and 1 category variable (education). TenYearCHD variable is a main interesting response variable. Refer to below table, the most missing value appears in glucose variable, which contains (388 missing observations) with approximately 9% of missing proportion. Other missing values appear in variables of education, cigsPerDay, BPMeds, totChol and BMI with missing proportion no exceeds 3%.

For totChol variable, there is one potential outlier value (totChol=696). sysBP variable has an abnormal value (sysBP=295). heartRate variable has potential outlier value (heartRate=143). glucose is equal to 394. All of these value should be further checked.

summary(heatdata)
##       male             age          education     currentSmoker   
##  Min.   :0.0000   Min.   :32.00   Min.   :1.000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:42.00   1st Qu.:1.000   1st Qu.:0.0000  
##  Median :0.0000   Median :49.00   Median :2.000   Median :0.0000  
##  Mean   :0.4292   Mean   :49.58   Mean   :1.979   Mean   :0.4941  
##  3rd Qu.:1.0000   3rd Qu.:56.00   3rd Qu.:3.000   3rd Qu.:1.0000  
##  Max.   :1.0000   Max.   :70.00   Max.   :4.000   Max.   :1.0000  
##                                   NA's   :105                     
##    cigsPerDay         BPMeds        prevalentStroke     prevalentHyp   
##  Min.   : 0.000   Min.   :0.00000   Min.   :0.000000   Min.   :0.0000  
##  1st Qu.: 0.000   1st Qu.:0.00000   1st Qu.:0.000000   1st Qu.:0.0000  
##  Median : 0.000   Median :0.00000   Median :0.000000   Median :0.0000  
##  Mean   : 9.006   Mean   :0.02962   Mean   :0.005896   Mean   :0.3106  
##  3rd Qu.:20.000   3rd Qu.:0.00000   3rd Qu.:0.000000   3rd Qu.:1.0000  
##  Max.   :70.000   Max.   :1.00000   Max.   :1.000000   Max.   :1.0000  
##  NA's   :29       NA's   :53                                           
##     diabetes          totChol          sysBP           diaBP      
##  Min.   :0.00000   Min.   :107.0   Min.   : 83.5   Min.   : 48.0  
##  1st Qu.:0.00000   1st Qu.:206.0   1st Qu.:117.0   1st Qu.: 75.0  
##  Median :0.00000   Median :234.0   Median :128.0   Median : 82.0  
##  Mean   :0.02571   Mean   :236.7   Mean   :132.4   Mean   : 82.9  
##  3rd Qu.:0.00000   3rd Qu.:263.0   3rd Qu.:144.0   3rd Qu.: 90.0  
##  Max.   :1.00000   Max.   :696.0   Max.   :295.0   Max.   :142.5  
##                    NA's   :50                                     
##       BMI          heartRate         glucose         TenYearCHD    
##  Min.   :15.54   Min.   : 44.00   Min.   : 40.00   Min.   :0.0000  
##  1st Qu.:23.07   1st Qu.: 68.00   1st Qu.: 71.00   1st Qu.:0.0000  
##  Median :25.40   Median : 75.00   Median : 78.00   Median :0.0000  
##  Mean   :25.80   Mean   : 75.88   Mean   : 81.96   Mean   :0.1519  
##  3rd Qu.:28.04   3rd Qu.: 83.00   3rd Qu.: 87.00   3rd Qu.:0.0000  
##  Max.   :56.80   Max.   :143.00   Max.   :394.00   Max.   :1.0000  
##  NA's   :19      NA's   :1        NA's   :388
## glucose value proportion
prop.table(table(heatdata$glucose, exclude=NULL))
## 
##           40           43           44           45           47           48 
## 0.0004716981 0.0002358491 0.0004716981 0.0009433962 0.0007075472 0.0002358491 
##           50           52           53           54           55           56 
## 0.0007075472 0.0004716981 0.0011792453 0.0011792453 0.0030660377 0.0011792453 
##           57           58           59           60           61           62 
## 0.0049528302 0.0047169811 0.0025943396 0.0148584906 0.0051886792 0.0101415094 
##           63           64           65           66           67           68 
## 0.0148584906 0.0108490566 0.0200471698 0.0143867925 0.0252358491 0.0219339623 
##           69           70           71           72           73           74 
## 0.0153301887 0.0358490566 0.0181603774 0.0254716981 0.0367924528 0.0332547170 
##           75           76           77           78           79           80 
## 0.0455188679 0.0299528302 0.0393867925 0.0349056604 0.0226415094 0.0360849057 
##           81           82           83           84           85           86 
## 0.0139150943 0.0235849057 0.0356132075 0.0252358491 0.0299528302 0.0146226415 
##           87           88           89           90           91           92 
## 0.0268867925 0.0174528302 0.0080188679 0.0191037736 0.0068396226 0.0089622642 
##           93           94           95           96           97           98 
## 0.0141509434 0.0089622642 0.0113207547 0.0063679245 0.0077830189 0.0056603774 
##           99          100          101          102          103          104 
## 0.0049528302 0.0113207547 0.0009433962 0.0044811321 0.0075471698 0.0033018868 
##          105          106          107          108          109          110 
## 0.0028301887 0.0023584906 0.0035377358 0.0033018868 0.0007075472 0.0021226415 
##          111          112          113          114          115          116 
## 0.0004716981 0.0025943396 0.0030660377 0.0011792453 0.0035377358 0.0011792453 
##          117          118          119          120          121          122 
## 0.0023584906 0.0023584906 0.0004716981 0.0021226415 0.0004716981 0.0007075472 
##          123          124          125          126          127          129 
## 0.0011792453 0.0004716981 0.0002358491 0.0009433962 0.0009433962 0.0002358491 
##          130          131          132          135          136          137 
## 0.0004716981 0.0002358491 0.0007075472 0.0004716981 0.0004716981 0.0009433962 
##          140          142          143          144          145          147 
## 0.0009433962 0.0002358491 0.0002358491 0.0002358491 0.0004716981 0.0002358491 
##          148          150          155          156          160          163 
## 0.0002358491 0.0004716981 0.0002358491 0.0002358491 0.0002358491 0.0002358491 
##          166          167          170          172          173          177 
## 0.0002358491 0.0002358491 0.0004716981 0.0002358491 0.0004716981 0.0002358491 
##          183          186          191          193          202          205 
## 0.0002358491 0.0002358491 0.0002358491 0.0002358491 0.0002358491 0.0002358491 
##          206          207          210          215          216          223 
## 0.0004716981 0.0002358491 0.0002358491 0.0004716981 0.0002358491 0.0002358491 
##          225          235          244          248          250          254 
## 0.0002358491 0.0002358491 0.0002358491 0.0002358491 0.0002358491 0.0002358491 
##          255          256          260          268          270          274 
## 0.0002358491 0.0002358491 0.0002358491 0.0002358491 0.0002358491 0.0002358491 
##          292          294          297          320          325          332 
## 0.0002358491 0.0002358491 0.0002358491 0.0002358491 0.0002358491 0.0002358491 
##          348          368          370          386          394         <NA> 
## 0.0002358491 0.0002358491 0.0002358491 0.0002358491 0.0004716981 0.0915094340

3 Missing Values - Imputation

There are several variables have missing values as indicated above. Since glucose has most of missing values, our strategy is to impute glucose variable. For those variables with missing rate below 3%, we will not perform imputation at this time.

To impute the glucose variable, variables with high linearly correlation will be needed. Based on the data, sysBP variable is the most linearly correlated with the glucose variable in heart data. The correlation is about 0.14. Unfortunately, 0.14 is not high enough to be linearly correlated with the glucose variable. So we decide to choose mean glucose value to impute the the glucose missing value, rather than using linear imputation.

## create correlationplot
par(mfrow = c(1,2))
plot(heatdata$glucose, heatdata$sysBP, xlab = "Glucose Level", ylab = "sysBP Level")
plot(heatdata$glucose, log(heatdata$sysBP), xlab = "Glucose Level", ylab = "log sysBP Level")

## correlation with glucose and sysBP
glu.sysbp.corr <- cor.test(heatdata$glucose,  heatdata$sysBP, 
                    method = "pearson")
glu.sysbp.corr$estimate
##       cor 
## 0.1405728

The missing glucose values are imputed by using the mean glucose value. Now, We are visually comparing the distributions between the original glucose and the imputed glucose.

glucose.mean <- mean(heatdata$glucose, na.rm=TRUE)
glucose.mean
## [1] 81.96366
## missing glucose values are imputed by using the mean glucose value
impute.glucose = heatdata$glucose
n=length(impute.glucose)
for (i in 1:n){
  if (is.na(impute.glucose[i]) == TRUE ) impute.glucose[i] = glucose.mean
}

heatdata$impute.glucose = impute.glucose

As per below density curves, distributions of the imputed glucose and original glucose levels are close to each other.

den.tri = density(na.omit(heatdata$glucose))
den.imput.tri = density(na.omit(heatdata$impute.glucose))


plot(den.imput.tri, col = "red", xlab = "glucose", ylab = "density", main = "original glucose vs imputed glucose")
lines(den.tri, col = "blue")
legend("topright", c("Inputed glucose", "Original glucose"), col=c("red", "blue"), lty =rep(1,2), bty="n", cex = 0.8)

heatdata02 = na.omit(heatdata[, c(
  "male", "age", "education", "currentSmoker"
      , "cigsPerDay" , "BPMeds" ,"prevalentStroke", "prevalentHyp" , "diabetes",   "totChol","sysBP" , "diaBP" ,"BMI","heartRate", "TenYearCHD" ,"impute.glucose"
  
)])

After completing imputation, we keep the all original variables and the new imputed variable (impute.glucose variable). Then we also delete all missing observation records.

4 Exploratory Data Analysis

4.1 General Objectives of EDA

Exploratory Data Analysis (EDA) is an way that is used to analyze the data and find trends, patterns, or further check assumptions in data. EDA has many tools to detect classical statistics, including graphics, histograms and scatter plots, etc.

4.2 Assess Distributions

In order to further perform modeling and otput analysis, we will re-organize and covert current numeric values to character values. There are 12 numeric type of variables in heart data, these variables need to be converted into character type. One of variable - blood pressure is converted according to below reference. (https://bmcmedicine.biomedcentral.com/articles/10.1186/s12916-022-02407-z).

heatdata02$grp.TenYearCHD <- ifelse(heatdata02$TenYearCHD < 1, 'NO','YES')

heatdata02$grp.male <- ifelse(heatdata02$male < 1, 'NO','YES')

heatdata02$grp.currentSmoker <- ifelse(heatdata02$currentSmoker < 1, 'NO','YES')

heatdata02$grp.BPMeds <- ifelse(heatdata02$BPMeds < 1, 'NO','YES')

heatdata02$grp.prevalentStroke <- ifelse(heatdata02$prevalentStroke < 1, 'NO','YES')

heatdata02$grp.prevalentHyp <- ifelse(heatdata02$prevalentHyp < 1, 'NO','YES')
heatdata02$grp.diabetes <- ifelse(heatdata02$diabetes < 1, 'NO','YES')



heatdata02$grp.education <- 
               ifelse(heatdata02$education == 1, 'Some high school',
               ifelse(heatdata02$education == 2, 'high school/GED',
               ifelse(heatdata02$education == 3, 'some college/vocational school', 
               ifelse(heatdata02$education == 4, 'college', 'Missing'
                      
                  ))))


heatdata02$grp.impute.glucose <- 
              ifelse(heatdata02$impute.glucose < 117, '(0, 117)',
               ifelse(heatdata02$impute.glucose > 137, '> 137', '[117,137]'))


heatdata02$grp.diaBP <- ifelse(heatdata02$diaBP < 80, '(0, 80)',
               ifelse(heatdata02$diaBP > 90, '> 90', '[80,90]'))

heatdata02$grp.sysBP <- ifelse(heatdata02$sysBP < 110, '(0, 110)',
               ifelse(heatdata02$sysBP > 120, '> 120', '[110,120]'))

heatdata02$grp.age <- ifelse(heatdata02$age <= 44, '[21, 44]',
               ifelse(heatdata02$age >= 65, '65+', '[45, 64]'))
## dataset  ready to do correlation analysis for only numeric variables

heatdata02.NO.num = (heatdata02[, c(
  "grp.male", "age", "grp.education", "grp.currentSmoker"
      , "cigsPerDay" , "grp.BPMeds" ,"grp.prevalentStroke", 
      "grp.prevalentHyp" , "grp.diabetes",   "totChol","sysBP" , "diaBP" ,"BMI","heartRate", "grp.TenYearCHD" ,"impute.glucose"
  
)])

4.3 Correlations Between Numerical Variables

cor(heatdata02.NO.num[sapply(heatdata02.NO.num,is.numeric)])
##                         age  cigsPerDay     totChol       sysBP       diaBP
## age             1.000000000 -0.19060031  0.26979872  0.38981443  0.20617145
## cigsPerDay     -0.190600311  1.00000000 -0.02640377 -0.08794251 -0.05126527
## totChol         0.269798723 -0.02640377  1.00000000  0.21453561  0.17153720
## sysBP           0.389814426 -0.08794251  0.21453561  1.00000000  0.78551069
## diaBP           0.206171445 -0.05126527  0.17153720  0.78551069  1.00000000
## BMI             0.134672699 -0.08783416  0.12319550  0.32963884  0.38327303
## heartRate      -0.008337091  0.06802934  0.08837479  0.18706802  0.18184920
## impute.glucose  0.113684393 -0.05146883  0.04782736  0.12975091  0.06130097
##                        BMI    heartRate impute.glucose
## age             0.13467270 -0.008337091     0.11368439
## cigsPerDay     -0.08783416  0.068029339    -0.05146883
## totChol         0.12319550  0.088374792     0.04782736
## sysBP           0.32963884  0.187068018     0.12975091
## diaBP           0.38327303  0.181849202     0.06130097
## BMI             1.00000000  0.073339162     0.07985079
## heartRate       0.07333916  1.000000000     0.09220655
## impute.glucose  0.07985079  0.092206546     1.00000000

As per above table, there is a strong positive linear relationship between sysBP and diaBP. Despite relationship between sysBP and diaBP, there is no strong linear relationship in between other variables.

5 Multiple Logistic Regression Model

The main interested of model is to examine how dose the 10 year risk of coronary heart disease relate with predictor variables. We built an initial model(full model) to further interpret result. The initial model includes all predictor variables and one response variable (The 10 year risk of coronary heart disease(CHD)) from heart data.

heatdata02 = read.csv("https://raw.githubusercontent.com/GUANTSERN-KUO/STA551/main/Week1/heartdata02.csv")
initial.model = glm(TenYearCHD ~ male + age + education + currentSmoker + cigsPerDay + BPMeds + prevalentStroke + prevalentHyp + diabetes + totChol + sysBP + diaBP + BMI + heartRate + impute.glucose , family = binomial, data = heatdata02)
coefficient.table = summary(initial.model)$coef
kable(coefficient.table, caption = "Significance tests of logistic regression model")
Significance tests of logistic regression model
Estimate Std. Error z value Pr(>|z|)
(Intercept) -8.3454750 0.6918155 -12.0631509 0.0000000
male 0.5164847 0.1046483 4.9354316 0.0000008
age 0.0650880 0.0064772 10.0488215 0.0000000
education -0.0157951 0.0474442 -0.3329186 0.7391958
currentSmoker 0.0413487 0.1506534 0.2744626 0.7837291
cigsPerDay 0.0205992 0.0059561 3.4584842 0.0005432
BPMeds 0.2402326 0.2273870 1.0564924 0.2907433
prevalentStroke 0.6856163 0.4841031 1.4162608 0.1566992
prevalentHyp 0.2185459 0.1329060 1.6443644 0.1001010
diabetes 0.0483318 0.3125623 0.1546311 0.8771122
totChol 0.0016982 0.0010913 1.5560619 0.1196934
sysBP 0.0144790 0.0036808 3.9336386 0.0000837
diaBP -0.0022472 0.0062500 -0.3595542 0.7191806
BMI 0.0024734 0.0122931 0.2012036 0.8405394
heartRate -0.0019700 0.0040068 -0.4916505 0.6229664
impute.glucose 0.0072268 0.0022309 3.2393858 0.0011979

Above the significance test table, p-values of male, age, cigsPerDay, sysBP and impute.glucose are below 0.05. We will use Automatic Variable Selection (choosing backward methods) to drop some variables which are most insignificant predictor with 10 year risk of coronary heart disease from the full model.

Additional key step is building reduce model. The reduce model includes key variables in the model regardless statistical significance are not strong. The key variables that used in model will be generally provided by Sponsor/Expert. In our reduce model, sysBP and impute.glucose are included. Lastly, we build a final model, the final model variables are selected between the reduce model (smallest numbers variable )and the full model (biggest numbers variable). Below is Summary table of significant tests from final model.

full.model = initial.model  # the *biggest model* that includes all predictor variables
reduced.model = glm(TenYearCHD ~ sysBP + impute.glucose , family = binomial, data = heatdata02)
final.model =  step(full.model, 
                    scope=list(lower=formula(reduced.model),upper=formula(full.model)),
                    data = heatdata02, 
                    direction = "backward",
                    trace = 0)   # trace = 0: suppress the detailed selection process
final.model.coef = summary(final.model)$coef
kable(final.model.coef , caption = "Summary table of significant tests")
Summary table of significant tests
Estimate Std. Error z value Pr(>|z|)
(Intercept) -8.6233038 0.5029993 -17.143769 0.0000000
male 0.5140215 0.1025868 5.010602 0.0000005
age 0.0660943 0.0062364 10.598143 0.0000000
cigsPerDay 0.0215367 0.0040051 5.377342 0.0000001
prevalentStroke 0.7543964 0.4791567 1.574425 0.1153892
prevalentHyp 0.2200455 0.1299295 1.693576 0.0903458
totChol 0.0016827 0.0010873 1.547618 0.1217143
sysBP 0.0139632 0.0027611 5.057153 0.0000004
impute.glucose 0.0074305 0.0016711 4.446421 0.0000087

5.1 Interpretation - Association Analysis

Refer to above table, all predictor variables are positively associated with the response variable. prevalentStroke, prevalentHyp and totChol does not achieve statistical significance with a result of p-value greater than 0.05.

male and prevalentStroke are two variables that are most positively associated with the response variable. Male and people diagnosis with stroke are more likely to have 10 year risk of coronary heart disease(CHD).

5.2 Predictive Analysis

Now the final model is built, and it includes variables of male, age, cigsPerDay, prevalentStroke, prevalentHyp, totChol, sysBP, and impute.glucose. We are going to compare response result from heart data with predicted result generated from final model. Refer to below table, values in the first row are entered into the model, and predicted result is 0. This predicted result is same as response result (TenYearCHD) from heart data. Values in the second row are randomly selected and entered into the model. The predicted result is 0.

mynewdata = data.frame(male =c(1, 0),
                       age = c(39, 33),
                       cigsPerDay = c(0,1),
                       prevalentStroke = c(0, 1),
                       prevalentHyp = c(0,1),
                       totChol = c(195,160),
                       sysBP = c(106, 110),
                       impute.glucose = c(77, 85))
pred.success.prob = predict(final.model, newdata = mynewdata, type="response")
##
## threshold probability
cut.off.prob = 0.5
pred.response = ifelse(pred.success.prob > cut.off.prob, 1, 0)  # This predicts the response
## Add the new predicted response to Mynewdata
mynewdata$Pred.Response = pred.response
##
kable(mynewdata, caption = "Predicted Value of response variable 
      with the given cut-off probability")
Predicted Value of response variable with the given cut-off probability
male age cigsPerDay prevalentStroke prevalentHyp totChol sysBP impute.glucose Pred.Response
1 39 0 0 0 195 106 77 0
0 33 1 1 1 160 110 85 0
## be ready to find optimal model dataset
heartdata03 = (heatdata02[, c(
"male", "age", "cigsPerDay", "prevalentStroke"
      , "prevalentHyp" , "totChol" ,"sysBP", "impute.glucose", "TenYearCHD" )])
## read heartdata03 file
heartdata03 = read.csv("https://raw.githubusercontent.com/GUANTSERN-KUO/STA551/main/Week1/heartdata03.csv")

5.3 Randomly Splitting for Cross-validation

Random splitting: 70% for training (candidate models) and 30% for testing (the final model). The training set is splited into 5 smaller sets. Any of one of 5 smaller set is considered as validation data, and rest of 4 sets are considered as training data.

nn = dim(heartdata03)[1]
train.id = sample(1:nn, round(nn*0.7), replace = FALSE) 
training = heartdata03[train.id,]
testing = heartdata03[-train.id,]

5.4 Finding Optimal Cut-off Probability

Below two candidate models are built - Training model and newdata (as known as validation model). pred.prob will be generated by the training model and new data model. A cut off point 20 will be used to compared with the pred.prob. If pred.prob is larger than the cut 0ff point, pred.status in the validation model will be presented as 1. pred.status will then be compared with TenYearCHD from the training model to further support accuracy calculation.

Refer to the figure output (5-fold CV performance), the highest accuracy occurs at cut off probability of 0.52.

n0 = dim(training)[1]/5
cut.0ff.prob = seq(0,1, length = 22)[-c(1,22)]        # candidate cut off prob
pred.accuracy = matrix(0,ncol=20, nrow=5, byrow = T)  # null vector for storing prediction accuracy
## 5-fold CV
for (i in 1:5){
  valid.id = ((i-1)*n0 + 1):(i*n0)
  valid.data = training[valid.id,]
  train.data = training[-valid.id,]
  train.model = glm(TenYearCHD ~ male + age + cigsPerDay + prevalentStroke + prevalentHyp + totChol + sysBP + impute.glucose     , family = binomial(link = logit), data = train.data)
  
  
  newdata = data.frame(male= valid.data$male, age = valid.data$age, cigsPerDay = valid.data$cigsPerDay, prevalentStroke= valid.data$prevalentStroke, prevalentHyp= valid.data$prevalentHyp, totChol= valid.data$totChol, sysBP= valid.data$sysBP, impute.glucose = valid.data$impute.glucose      )
  
  
  pred.prob = predict.glm(train.model, newdata, type = "response")
  # define confusion matrix and accuracy
  for(j in 1:20){
    pred.status = rep(0,length(pred.prob))
    valid.data$pred.status = as.numeric(pred.prob >cut.0ff.prob[j])
    a11 = sum(valid.data$pred.status == valid.data$TenYearCHD)
    pred.accuracy[i,j] = a11/length(pred.prob)
  }
}
###  
avg.accuracy = apply(pred.accuracy, 2, mean)
max.id = which(avg.accuracy ==max(avg.accuracy ))
### visual representation
tick.label = as.character(round(cut.0ff.prob,2))
plot(1:20, avg.accuracy, type = "b",
     xlim=c(1,20), 
     ylim=c(0.5,1), 
     axes = FALSE,
     xlab = "Cut-off Probability",
     ylab = "Accuracy",
     main = "5-fold CV performance"
     )
axis(1, at=1:20, label = tick.label, las = 2)
axis(2)
segments(max.id, 0.5, max.id, avg.accuracy[max.id], col = "red")
text(max.id, avg.accuracy[max.id]+0.05, as.character(round(avg.accuracy[max.id],4)), col = "red", cex = 0.8)
Figure 5-fold CV performance plot

Figure 5-fold CV performance plot

5.5 Reporting Test

Below two models - test.model(from training) and newdata (testing data) will be used to find the regression coefficients and then use the highest cut off probability 0.52 to find the accuracy. The accuracy is 0.8746867.

test.model = glm(TenYearCHD ~ male + age + cigsPerDay + prevalentStroke + prevalentHyp + totChol + sysBP + impute.glucose, family = binomial(link = logit), data = training)

newdata = data.frame(male= testing$male, age = testing$age, cigsPerDay = testing$cigsPerDay, prevalentStroke= testing$prevalentStroke, prevalentHyp= testing$prevalentHyp, totChol= testing$totChol, sysBP= testing$sysBP, impute.glucose = testing$impute.glucose)

pred.prob.test = predict.glm(test.model, newdata, type = "response")
testing$test.status = as.numeric(pred.prob.test > 0.52)
a11 = sum(testing$test.status == testing$TenYearCHD)
test.accuracy = a11/length(pred.prob.test)
kable(as.data.frame(test.accuracy), align='c')
test.accuracy
0.8512949

5.6 Local Performance Meaures

Since the optimal cut-off probability is 0.52, Next, we are going to use the testing data set to report the local measures.

test.model = glm(TenYearCHD ~ male + age + cigsPerDay + prevalentStroke + prevalentHyp + totChol + sysBP + impute.glucose, family = binomial(link = logit), data = training)

newdata = data.frame(male= testing$male, age = testing$age, cigsPerDay = testing$cigsPerDay, prevalentStroke= testing$prevalentStroke, prevalentHyp= testing$prevalentHyp, totChol= testing$totChol, sysBP= testing$sysBP, impute.glucose = testing$impute.glucose)

pred.prob.test = predict.glm(test.model, newdata, type = "response")
testing$test.status = as.numeric(pred.prob.test > 0.52)
### components for defining various measures
TN = sum(testing$test.status ==0 & testing$TenYearCHD==0)
FN = sum(testing$test.status ==0 & testing$TenYearCHD ==1)
FP = sum(testing$test.status ==1 & testing$TenYearCHD==0)
TP = sum(testing$test.status ==1 & testing$TenYearCHD
         ==1)
###
sensitivity = TP / (TP + FN)
specificity = TN / (TN + FP)
###
precision = TP / (TP + FP)
recall = sensitivity
F1 = 2*precision*recall/(precision + recall)
metric.list = cbind(sensitivity = sensitivity, 
                    specificity = specificity, 
                    precision = precision,
                    recall = recall,
                    F1 = F1)
kable(as.data.frame(metric.list), align='c', caption = "Local performance metrics")
Local performance metrics
sensitivity specificity precision recall F1
0.0698925 0.9950544 0.7222222 0.0698925 0.127451

5.7 Global Measure: ROC and AUC

The more that the ROC curve hugs the top left corner of the plot, the better the model does at classifying the data into categories. To quantify this, we can calculate the AUC (area under the curve) which tells us how much of the plot is located under the curve. The closer AUC is to 1, the better the model.

Sensitivity (True Positive Rate, Recall) is defined as the probability of those who received a positive result on this test out of those who actually have a 10 year risk of coronary heart disease.

Specificity (True Negative Rate) is defined as the probability of those who received a negative result on this test out of those who do not actually have the 10 year risk of coronary heart disease.

Refer to below output, X-axis is 1 - specificity, y-axis is sensitivity. AUC result is around 0.77, which closes to 1. This AUC indicates the better model.

cut.off.seq = seq(0, 1, length = 100)
sensitivity.vec = NULL
specificity.vec = NULL
### 

training.model = glm(TenYearCHD ~ male + age + cigsPerDay + prevalentStroke + prevalentHyp + totChol + sysBP + impute.glucose, family = binomial(link = logit), data = training)

newdata = data.frame(male= training$male, age = training$age, cigsPerDay = training$cigsPerDay, prevalentStroke= training$prevalentStroke, prevalentHyp= training$prevalentHyp, totChol= training$totChol, sysBP= training$sysBP, impute.glucose = training$impute.glucose)

pred.prob.train = predict.glm(training.model, newdata, type = "response")
for (i in 1:100){
  training$train.status = as.numeric(pred.prob.train > cut.off.seq[i])
### components for defining various measures
TN = sum(training$train.status == 0 & training$TenYearCHD == 0)
FN = sum(training$train.status == 0 & training$TenYearCHD == 1)
FP = sum(training$train.status == 1 & training$TenYearCHD == 0)
TP = sum(training$train.status == 1 & training$TenYearCHD == 1)
###
sensitivity.vec[i] = TP / (TP + FN)
specificity.vec[i] = TN / (TN + FP)
}
one.minus.spec = 1 - specificity.vec
sens.vec = sensitivity.vec
## A better approx of ROC, need library {pROC}
  prediction = pred.prob.train
  category = training$TenYearCHD == 1
  ROCobj <- roc(category, prediction)
  AUC = round(auc(ROCobj),4)
##

logistic.model <- list(sensitivity.vec= sensitivity.vec, specificity.vec = specificity.vec,  AUC = round(AUC,3))

Calculating AUC value in reduce model

cut.off.seq = seq(0, 1, length = 100)
sensitivity.vec = NULL
specificity.vec = NULL
### 

training.model = glm(TenYearCHD ~ sysBP + impute.glucose, family = binomial(link = logit), data = training)

newdata = data.frame(sysBP= training$sysBP, impute.glucose = training$impute.glucose)

pred.prob.train = predict.glm(training.model, newdata, type = "response")
for (i in 1:100){
  training$train.status = as.numeric(pred.prob.train > cut.off.seq[i])
### components for defining various measures
TN = sum(training$train.status == 0 & training$TenYearCHD == 0)
FN = sum(training$train.status == 0 & training$TenYearCHD == 1)
FP = sum(training$train.status == 1 & training$TenYearCHD == 0)
TP = sum(training$train.status == 1 & training$TenYearCHD == 1)
###
sensitivity.vec[i] = TP / (TP + FN)
specificity.vec[i] = TN / (TN + FP)
}
one.minus.spec = 1 - specificity.vec
sens.vec = sensitivity.vec
## A better approx of ROC, need library {pROC}
  prediction = pred.prob.train
  category = training$TenYearCHD == 1
  ROCobj <- roc(category, prediction)
  AUC = round(auc(ROCobj),4)
##
  
reduced.model.auc <- list(sensitivity.vec= sensitivity.vec, specificity.vec = specificity.vec,  AUC = round(AUC,3))

Calculating AUC value in full model

heartdata02 = read.csv("https://raw.githubusercontent.com/GUANTSERN-KUO/STA551/main/Week1/heartdata02.csv")
nn = dim(heartdata02)[1]
train.id = sample(1:nn, round(nn*0.7), replace = FALSE) 
training = heartdata02[train.id,]
testing = heartdata02[-train.id,]


cut.off.seq = seq(0, 1, length = 100)
sensitivity.vec = NULL
specificity.vec = NULL
### 
training.model = glm(TenYearCHD ~ male + age + education + currentSmoker + cigsPerDay + BPMeds + prevalentStroke + prevalentHyp + diabetes + totChol + sysBP + diaBP + BMI + heartRate + impute.glucose , family = binomial, data = training)


newdata = data.frame(male= training$male, age = training$age, education = training$education, currentSmoker= training$currentSmoker, cigsPerDay= training$cigsPerDay, BPMeds= training$BPMeds, prevalentStroke= training$prevalentStroke, prevalentHyp= training$prevalentHyp, diabetes= training$diabetes, totChol= training$totChol, sysBP= training$sysBP, diaBP= training$diaBP, BMI= training$BMI, heartRate= training$heartRate, impute.glucose = training$impute.glucose)

pred.prob.train = predict.glm(training.model, newdata, type = "response")
for (i in 1:100){
  training$train.status = as.numeric(pred.prob.train > cut.off.seq[i])
### components for defining various measures
TN = sum(training$train.status == 0 & training$TenYearCHD == 0)
FN = sum(training$train.status == 0 & training$TenYearCHD == 1)
FP = sum(training$train.status == 1 & training$TenYearCHD == 0)
TP = sum(training$train.status == 1 & training$TenYearCHD == 1)
###
sensitivity.vec[i] = TP / (TP + FN)
specificity.vec[i] = TN / (TN + FP)
}
one.minus.spec = 1 - specificity.vec
sens.vec = sensitivity.vec
## A better approx of ROC, need library {pROC}
  prediction = pred.prob.train
  category = training$TenYearCHD == 1
  ROCobj <- roc(category, prediction)
  AUC = round(auc(ROCobj),4)
##
  
full.model.auc <- list(sensitivity.vec= sensitivity.vec, specificity.vec = specificity.vec,  AUC = round(AUC,3))
par(pty="s")      # set up square plot through graphic parameter
plot(1-logistic.model$specificity.vec, logistic.model$sensitivity.vec, type = "l", xlim=c(0,1), ylim=c(0,1), 
     xlab="1 - specificity", ylab="Sensitivity", col = "blue", lwd = 2,
     main="ROC Curves based on different model", cex.main = 0.9, col.main = "navy")
abline(0,1, lty = 2, col = "orchid4", lwd = 2)



lines(1-reduced.model.auc$specificity.vec, reduced.model.auc$sensitivity.vec, col = "red", lwd = 2, lty=2)

lines(1-full.model.auc$specificity.vec, full.model.auc$sensitivity.vec, col = "olivedrab", lwd = 2)



legend("bottomright", c(paste("logistic model, AUC =",logistic.model$AUC), 
                        paste("reduce model, AUC =",reduced.model.auc$AUC),
                        paste("full model, AUC =",full.model.auc$AUC)
                        ), 
                       
                     
                                           col=c("blue","red","olivedrab"), 
                        lty=rep(1,6), lwd=rep(2,6), cex = 0.5, bty = "n")
Figure Comparison of ROC curves

Figure Comparison of ROC curves

By compare logistic model (final model), reduced model and full model, the logistic model (final model) has the highest AUC, hence, logistic model (final model) is the best model among 3 models.

6 Implementing Neural Network Models

## write heartdata02 csv file from orginal heatdata02 data. 
## It is more efficient to read data from section 5 every time.

write.csv(x = heatdata02, file = "heartdata02.csv", row.names = FALSE)

6.1 Feature Conversion and Numeric Feature Scaling

Before building neural network models, we should convert variable to numeric type and add required dummy variables. In addition, numeric type variables are supposed to be converted into scaling, which has no unit of measurement.

Below is scaling formula:

\[ scaled.var = \frac{orig.var - \min(orig.var)}{\max(orig.var)-\min(orig.var)} \]

heartdata02 = read.csv("https://raw.githubusercontent.com/GUANTSERN-KUO/STA551/main/Week1/heartdata02.csv")


## only choose character variable and certain numeric variable(cigsPerDay,totChol,BMI and heartRate ).
heartdata03 <- (heartdata02[, c(
  "grp.male", "grp.age", "grp.education", "grp.currentSmoker"
      , "cigsPerDay" , "grp.BPMeds" ,"grp.prevalentStroke", 
      "grp.prevalentHyp" , "grp.diabetes",   "totChol","grp.sysBP" , "grp.diaBP" ,"BMI","heartRate" ,"grp.impute.glucose" , "grp.TenYearCHD"
  
)])

heartdata03$cigsPerDay <- as.numeric(heartdata03$cigsPerDay)
heartdata03$totChol <- as.numeric(heartdata03$totChol)
heartdata03$heartRate <- as.numeric(heartdata03$heartRate)

heartdata03$BMI = (heartdata03$BMI-min(heartdata03$BMI))/(max(heartdata03$BMI)-min(heartdata03$BMI))

heartdata03$cigsPerDay = (heartdata03$cigsPerDay-min(heartdata03$cigsPerDay))/(max(heartdata03$cigsPerDay)-min(heartdata03$cigsPerDay))

heartdata03$totChol = (heartdata03$totChol-min(heartdata03$totChol))/(max(heartdata03$totChol)-min(heartdata03$totChol))

heartdata03$heartRate = (heartdata03$heartRate-min(heartdata03$heartRate))/(max(heartdata03$heartRate)-min(heartdata03$heartRate))

6.2 Extract All Feature Names

We use the R function called model.matrix to present required variable and dummy variable. It is automatic output to show factors to a set of dummy variables and expanding interactions similarly. Moreover, we are renaming the variables to avoid some naming issue from neural network models.

heartdataMtx = model.matrix(~ ., data = heartdata03)
colnames(heartdataMtx)
##  [1] "(Intercept)"                                
##  [2] "grp.maleYES"                                
##  [3] "grp.age[45, 64]"                            
##  [4] "grp.age65+"                                 
##  [5] "grp.educationhigh school/GED"               
##  [6] "grp.educationsome college/vocational school"
##  [7] "grp.educationSome high school"              
##  [8] "grp.currentSmokerYES"                       
##  [9] "cigsPerDay"                                 
## [10] "grp.BPMedsYES"                              
## [11] "grp.prevalentStrokeYES"                     
## [12] "grp.prevalentHypYES"                        
## [13] "grp.diabetesYES"                            
## [14] "totChol"                                    
## [15] "grp.sysBP[110,120]"                         
## [16] "grp.sysBP> 120"                             
## [17] "grp.diaBP[80,90]"                           
## [18] "grp.diaBP> 90"                              
## [19] "BMI"                                        
## [20] "heartRate"                                  
## [21] "grp.impute.glucose[117,137]"                
## [22] "grp.impute.glucose> 137"                    
## [23] "grp.TenYearCHDYES"
colnames(heartdataMtx)[2] <- "maleYES"
colnames(heartdataMtx)[3] <- "age45To64"
colnames(heartdataMtx)[4] <- "age65older"
colnames(heartdataMtx)[5] <- "highschoolGED"
colnames(heartdataMtx)[6] <- "college"
colnames(heartdataMtx)[7] <- "highschool"
colnames(heartdataMtx)[8] <- "SmokerYES"
colnames(heartdataMtx)[9] <- "cigsPerDay"
colnames(heartdataMtx)[10] <- "BPMedsYES"
colnames(heartdataMtx)[11] <- "StrokeYES"
colnames(heartdataMtx)[12] <- "HypYES"
colnames(heartdataMtx)[13] <- "diabetesYES"
colnames(heartdataMtx)[14] <- "totChol"
colnames(heartdataMtx)[15] <- "sysBP110To120"
colnames(heartdataMtx)[16] <- "sysBPgt120"
colnames(heartdataMtx)[17] <- "diaBP80To90"
colnames(heartdataMtx)[18] <- "diaBPgt90"
colnames(heartdataMtx)[19] <- "BMI"
colnames(heartdataMtx)[20] <- "heartRate"
colnames(heartdataMtx)[21] <- "glucose117To137"
colnames(heartdataMtx)[22] <- "glucosegt137"
colnames(heartdataMtx)[23] <- "TenYearCHDYES"

6.3 Define Model Formula

Please refer to below model formula.

columnNames = colnames(heartdataMtx)
#remove intercept
columnList = paste(columnNames[-c(1,length(columnNames))], collapse = "+")
#TenYearCHDYES for response variable
columnList = paste(c(columnNames[length(columnNames)],"~",columnList), collapse="")
modelFormula = formula(columnList)
modelFormula
## TenYearCHDYES ~ maleYES + age45To64 + age65older + highschoolGED + 
##     college + highschool + SmokerYES + cigsPerDay + BPMedsYES + 
##     StrokeYES + HypYES + diabetesYES + totChol + sysBP110To120 + 
##     sysBPgt120 + diaBP80To90 + diaBPgt90 + BMI + heartRate + 
##     glucose117To137 + glucosegt137

6.4 Training and Testing NN Model

For Random splitting, 40% for testing (testDat) and 60% for training (trainDat).In addition, building the NN Model by using the function neuralnet().

n = dim(heartdataMtx)[1]
testID = sample(1:n, round(n*0.6), replace = FALSE)
testDat = heartdataMtx[-testID,]
trainDat = heartdataMtx[testID,]
NetworkModel = neuralnet(modelFormula,
                         data = trainDat,
                         hidden = 1,               # single layer NN
                         rep = 1,                  # number of replicates in training NN
                         threshold = 0.02,         # threshold for the partial derivatives as stopping criteria.
                         learningrate = 0.2,       # user selected rate
                         algorithm = "rprop+"
                         )
kable(NetworkModel$result.matrix)
error 135.3708263
reached.threshold 0.0179098
steps 5957.0000000
Intercept.to.1layhid1 3.6917731
maleYES.to.1layhid1 -0.3983861
age45To64.to.1layhid1 -0.6887910
age65older.to.1layhid1 -1.2598333
highschoolGED.to.1layhid1 0.1681006
college.to.1layhid1 -0.0394863
highschool.to.1layhid1 -0.2216419
SmokerYES.to.1layhid1 -0.0764684
cigsPerDay.to.1layhid1 -0.5743887
BPMedsYES.to.1layhid1 -0.4702947
StrokeYES.to.1layhid1 -0.3588336
HypYES.to.1layhid1 -0.4449071
diabetesYES.to.1layhid1 0.0379054
totChol.to.1layhid1 -1.2398940
sysBP110To120.to.1layhid1 -0.1490999
sysBPgt120.to.1layhid1 -0.3025300
diaBP80To90.to.1layhid1 0.0864177
diaBPgt90.to.1layhid1 -0.0245953
BMI.to.1layhid1 0.0670563
heartRate.to.1layhid1 -0.6163167
glucose117To137.to.1layhid1 -0.3064244
glucosegt137.to.1layhid1 -1.2162804
Intercept.to.TenYearCHDYES 1.3744028
1layhid1.to.TenYearCHDYES -1.4208678
Single-layer backpropagation Neural network model for heart disease

Single-layer backpropagation Neural network model for heart disease

6.5 About Cross-validation in Neural Network

Cross-validation is a method that we evaluate a machine learning model and test its performance. The training set is split into 5 smaller sets. Any of one of 5 smaller set is considered as validation data, and rest of 4 sets are considered as training data.

Below two candidate models are built - Training model (from train.data) and validation data (from valid.data). Predict score will be generated by the train.data and valid.data. A cut off point 20 will be used to compared with the predict score(pred.nn.score). If predict score (pred.nn.score) is larger than the cut 0ff point, predict status(pred.status) will be presented as 1. Predict status(pred.status) will be compared with TenYearCHDYES from the validation data (from valid.data) to further support accuracy calculation.

Refer to the figure output (5-fold CV performance), the highest accuracy occurs at cut off score of 0.57.

n0 = dim(trainDat)[1]/5
cut.off.score = seq(0,1, length = 22)[-c(1,22)]        # candidate cut off prob
pred.accuracy = matrix(0,ncol=20, nrow=5, byrow = T)  
# null vector for storing prediction accuracy
##
for (i in 1:5){
  valid.id = ((i-1)*n0 + 1):(i*n0)
  valid.data = trainDat[valid.id,]
  train.data = trainDat[-valid.id,]
  ####
  train.model = neuralnet(modelFormula,
                         data = train.data,
                         hidden = 1,               # single layer NN
                         rep = 1,                  
                # number of replicates in training NN
                         threshold = 0.02,         
                # threshold for the partial derivatives as stopping criteria.
                         learningrate = 0.2,       # user selected rate
                         algorithm = "rprop+"
                         )
    pred.nn.score = predict(train.model, valid.data)
    for(j in 1:20){
    #pred.status = rep(0,length(pred.nn.score))
    pred.status = as.numeric(pred.nn.score > cut.off.score[j])
    a11 = sum(pred.status == valid.data[,23])
    pred.accuracy[i,j] = a11/length(pred.nn.score)
  }
}
###  
avg.accuracy = apply(pred.accuracy, 2, mean)
max.id = which(avg.accuracy ==max(avg.accuracy ))
### visual representation
tick.label = as.character(round(cut.off.score,2))
plot(1:20, avg.accuracy, type = "b",
     xlim=c(1,20), 
     ylim=c(0.5,1), 
     axes = FALSE,
     xlab = "Cut-off Score",
     ylab = "Accuracy",
     main = "5-fold CV performance"
     )
axis(1, at=1:20, label = tick.label, las = 2)
axis(2)
segments(max.id, 0.5, max.id, avg.accuracy[max.id], col = "red")
text(max.id, avg.accuracy[max.id]+0.03, as.character(round(avg.accuracy[max.id],4)), col = "red", cex = 0.8)

6.6 Testing Model Performance

Below two models - NetworkModel (from trainDat) and testing data (testDat) will be used with the highest cut off score 0.57 to find the accuracy. The accuracy is 0.8512949.

#Test the resulting output
nn.results <- predict(NetworkModel, testDat)
results <- data.frame(actual = testDat[,23], prediction = nn.results > .57)


confMatrix = table(results$prediction, results$actual)    # confusion matrix
accuracy=sum(results$actua == results$prediction)/length(results$prediction)
list(confusion.matrix = confMatrix, accuracy = accuracy)       
## $confusion.matrix
##        
##            0    1
##   FALSE 1355  228
##   TRUE     5    8
## 
## $accuracy
## [1] 0.85401

6.7 ROC Analysis

As mentioned in section 5.7, Sensitivity (True Positive Rate, Recall) is defined as the probability of those who received a positive result on this test out of those who actually have a 10 year risk of coronary heart disease.

Specificity (True Negative Rate) is defined as the probability of those who received a negative result on this test out of those who do not actually have the 10 year risk of coronary heart disease. Refer to below ROC Curve output, X-axis is 1 - specificity, y-axis is sensitivity.

Below two models - NetworkModel and training data (trainDat) will be used to predict the result.

Below code detail explanation

definition of “a”: Real status is getting the 10 year risk of coronary heart disease and predicting result is the same as real status.

definition of “d”: Real status is not getting the 10 year risk of coronary heart disease and predicting result is the same as real status.

definition of “b”: Real status is not getting the 10 year risk of coronary heart disease and predicting result is not the same as real status.

definition of “c”: Real status is getting the 10 year risk of coronary heart disease and predicting result is not the same as real status.

nn.results = predict(NetworkModel, trainDat)  # Keep in mind that trainDat is a matrix!
cut0 = seq(0,1, length = 20)
SenSpe = matrix(0, ncol = length(cut0), nrow = 2, byrow = FALSE)
for (i in 1:length(cut0)){
    a = sum(trainDat[,"TenYearCHDYES"] == 1 & (nn.results > cut0[i]))
    d = sum(trainDat[,"TenYearCHDYES"] == 0 & (nn.results < cut0[i]))
    b = sum(trainDat[,"TenYearCHDYES"] == 0 & (nn.results > cut0[i]))    
    c = sum(trainDat[,"TenYearCHDYES"] == 1 & (nn.results < cut0[i]))   
    sen = a/(a + c)
    spe = d/(b + d)
    SenSpe[,i] = c(sen, spe)
}
# plotting ROC
plot(1-SenSpe[2,], SenSpe[1,], type ="b", xlim=c(0,1), ylim=c(0,1),
     xlab = "1 - specificity", ylab = "Sensitivity", lty = 1,
     main = "ROC Curve", col = "blue")
abline(0,1, lty = 2, col = "red")

## Calculate AUC
xx = 1-SenSpe[2,]
yy = SenSpe[1,]
width = xx[-length(xx)] - xx[-1]
height = yy[-1]

## A better approx of ROC, need library {pROC}
  prediction = as.vector(nn.results)
  category = trainDat[,"TenYearCHDYES"] == 1
  ROCobj <- roc(category, prediction)
  AUC = auc(ROCobj)[1]
  ##
###
text(0.8, 0.3, paste("AUC = ", round(AUC,4)), col = "purple", cex = 0.9)


legend("bottomright", c("ROC of the model", "Random guessing"), lty=c(1,2),
       col = c("blue", "red"), bty = "n", cex = 0.8)
Figure ROC Curve of the neural network model.

Figure ROC Curve of the neural network model.

NN.model <- list(SenSpe= SenSpe, AUC = round(AUC,3))

Refer to above ROC curve, it is clear that the neural network is better than the random guessing as the area under the curve is larger than 0.5. The better model the closer to 1 AUC result. In general, if the area under the ROC curve is greater than 0.65, the predictive power of the underlying model is acceptable.

7 Decision Tree Algorithms

knitr:: include_graphics("tree.jpg")

Decision tree algorithms is easy to interpret and implement in real wold. It is utilized for both regression tasks and classification. It consists of a root node, branches, internal nodes and leaf nodes. This project aims to test the relationship of getting 10 year risk of coronary heart disease(CHD) and potential factors (e.g. age, education…ect.). The result of getting 10 year risk of coronary heart disease are presented as either Yes or No. Based on type of the result, Classification and Regression Tree (CART) will be used as the most appropriate method to predict the result.

The following diagram indicates the basic structure of a decision tree.

knitr:: include_graphics("node.png")

7.1 Decision Tree Growing - Impurity Measures

Decision tree growing is defined as an iterative process of splitting the main space into multiple sub-spaces based on certain criteria. Certain criteria is defined based on feature variables.

The two most frequently used impurity measures in decision tree induction are Gini index and entropy.This project aims to demonstrate the process of Gini Index and entropy in decision tree model.

7.1.1 Gini Index

Gini Index determines how the features of a dataset should split nodes to form the tree.

7.1.2 Entropy and Information Gain

Entropy calculates the typical amount of data required to categorize an example.

Information Gain indicates the reduction in uncertainty attained by splitting based on a feature.

7.2 Implementing Decision Tree models

heartdata02 = read.csv("https://raw.githubusercontent.com/GUANTSERN-KUO/STA551/main/Week1/heartdata02.csv")


## only choose character variable and certain numeric variable(cigsPerDay,totChol,BMI and heartRate ). 
## Same as section 5 but no Numeric Feature Scaling
heartdata04 <- (heartdata02[, c(
  "grp.male", "grp.age", "grp.education", "grp.currentSmoker"
      , "cigsPerDay" , "grp.BPMeds" ,"grp.prevalentStroke", 
      "grp.prevalentHyp" , "grp.diabetes",   "totChol","grp.sysBP" , "grp.diaBP" ,"BMI","heartRate" ,"grp.impute.glucose" , "grp.TenYearCHD"
  
)])

# We use a random split approach
n = dim(heartdata04)[1]  # sample size
# caution: using without replacement
train.id = sample(1:n, round(0.7*n), replace = FALSE)  
train = heartdata04[train.id, ]    # training data
test = heartdata04[-train.id, ]    # testing data
# arguments to pass into rpart():
# 1. data set (training /testing); 
# 2. Penalty coefficients
# 3. Impurity measure
## 
tree.builder = function(in.data, fp, fn, purity){
   tree = rpart(grp.TenYearCHD ~ .,                # including all features
                data = in.data, 
                na.action  = na.rpart,       # By default, deleted if the outcome is missing, 
                                             # kept if predictors are missing
                method = "class",            # Classification form factor
                model  = FALSE,
                x = FALSE,
                y = TRUE,
            parms = list
            ( # loss matrix. Penalize false positives or negatives more heavily
                         loss = matrix(c(0, fp, fn, 0), ncol = 2, byrow = TRUE),   
                         split = purity),          # "gini" or "information"
             ## rpart algorithm options (These are defaults)
             control = rpart.control(
                        minsplit = 10,  # minimum number of observations required before split
                        minbucket= 10,  # minimum number of observations in any terminal node, default = minsplit/3
                        cp  = 0.01,  # complexity parameter for stopping rule, 0.02 -> small tree 
                       xval = 10     # number of cross-validation )
                        )
             )
  }

Four different decision tree models are defined based on above functions.

Model 1: gini.tree.11 is based on the Gini index without penalizing false positives and false negatives.

Model 2: info.tree.11 is based on entropy without penalizing false positives and false negatives.

Model 3: gini.tree.110 is based on the Gini index: cost of false negatives is 10 times the positives.

Model 4: info.tree.110 is based on entropy: cost of false negatives is 10 times the positives.

## Call the tree model wrapper.
gini.tree.1.1 = tree.builder(in.data = train, fp = 1, fn = 1, purity = "gini")
info.tree.1.1 = tree.builder(in.data = train, fp = 1, fn = 1, purity = "information")



gini.tree.1.10 = tree.builder(in.data = train, fp = 1, fn = 10, purity = "gini")
info.tree.1.10 = tree.builder(in.data = train, fp = 1, fn = 10, purity = "information")


## tree plots
par(mfrow=c(1,2))
rpart.plot(gini.tree.1.1, main = "Tree with Gini index: non-penalization" ,roundint=FALSE)
rpart.plot(info.tree.1.1, main = "Tree with entropy: non-penalization",roundint=FALSE)
Figure  Non-penalized decision tree models using Gini index (left) and entropy (right).

Figure Non-penalized decision tree models using Gini index (left) and entropy (right).

par(mfrow=c(1,2))
rpart.plot(gini.tree.1.10, main = "Tree with Gini index: penalization", roundint=FALSE)
rpart.plot(info.tree.1.10, main = "Tree with entropy: penalization" ,roundint=FALSE)
Figure penalized decision tree models using Gini index (left) and entropy (right).

Figure penalized decision tree models using Gini index (left) and entropy (right).

7.3 ROC for Model Selection

The ROC analysis is used to select the best model from 4 different decision tree models created previously.

# function returning a sensitivity and specificity matrix
SenSpe = function(in.data, fp, fn, purity){
  cutoff = seq(0,1, length = 20)   # 20 cut-offs including 0 and 1. 
  model = tree.builder(in.data, fp, fn, purity) 
  
  ## Caution: decision tree returns both "success" and "failure" probabilities.
  ## We need only "success" probability to define sensitivity and specificity!!! 
  pred = predict(model, newdata = in.data, type = "prob") # two-column matrix.
  senspe.mtx = matrix(0, ncol = length(cutoff), nrow= 2, byrow = FALSE)
  for (i in 1:length(cutoff)){
  
  
    
  pred.out =  ifelse(pred[,"YES"] >= cutoff[i], "YES", "NO")  
  TP = sum(pred.out =="YES" & in.data$grp.TenYearCHD == "YES")
  TN = sum(pred.out =="NO" & in.data$grp.TenYearCHD == "NO")
  FP = sum(pred.out =="YES" & in.data$grp.TenYearCHD == "NO")
  FN = sum(pred.out =="NO" & in.data$grp.TenYearCHD == "YES")
  senspe.mtx[1,i] = TP/(TP + FN)
  senspe.mtx[2,i] = TN/(TN + FP)
  }
  ## A better approx of ROC, need library {pROC}
  prediction = pred[, "YES"]
  category = in.data$grp.TenYearCHD == "YES"
  ROCobj <- roc(category, prediction)
  AUC = auc(ROCobj)
  ##
  list(senspe.mtx= senspe.mtx, AUC = round(AUC,3))
 }

Next we use above function (SenSpe) to build 4 trees models.

giniROC11 = SenSpe(in.data = train, fp=1, fn=1, purity="gini")
infoROC11 = SenSpe(in.data = train, fp=1, fn=1, purity="information")
giniROC110 = SenSpe(in.data = train, fp=1, fn=10, purity="gini")
infoROC110 = SenSpe(in.data = train, fp=1, fn=10, purity="information")

Next, we draw the ROC curves and calculate the areas under the ROC curves.

par(pty="s")      # set up square plot through graphic parameter
plot(1-giniROC11$senspe.mtx[2,], giniROC11$senspe.mtx[1,], type = "l", xlim=c(0,1), ylim=c(0,1), 
     xlab="1 - specificity", ylab="Sensitivity", col = "blue", lwd = 2,
     main="ROC Curves of Decision Trees", cex.main = 0.9, col.main = "navy")
abline(0,1, lty = 2, col = "orchid4", lwd = 2)



lines(1-infoROC11$senspe.mtx[2,], infoROC11$senspe.mtx[1,], col = "firebrick2", lwd = 2, lty=2)
lines(1-giniROC110$senspe.mtx[2,], giniROC110$senspe.mtx[1,], col = "olivedrab", lwd = 2)
lines(1-infoROC110$senspe.mtx[2,], infoROC110$senspe.mtx[1,], col = "skyblue", lwd = 2)

legend("bottomright", c(paste("gini.1.1,  AUC =", giniROC11$AUC), 
                        paste("info.1.1,   AUC =",infoROC11$AUC), 
                        paste("gini.1.10, AUC =",giniROC110$AUC), 
                        paste("info.1.10, AUC =",infoROC110$AUC)),
                     
                                           col=c("blue","firebrick2","olivedrab","skyblue"), 
                        lty=rep(1,6), lwd=rep(2,6), cex = 0.5, bty = "n")
Figure Comparison of ROC curves

Figure Comparison of ROC curves

The above ROC curves are plotted based on the 4 tree models. gini.1.10 model has the highest AUC as the best model within the 4 tree models.

7.4 Optimal Cut-off Score Determination

The final model (gini.1.10) is determined above, we are going to find the optimal cut-off score for reporting the predictive performance of the final model with the test data.

Optm.cutoff = function(in.data, fp, fn, purity){
  n0 = dim(in.data)[1]/5
  cutoff = seq(0,1, length = 20)               # candidate cut off prob
  ## accuracy for each candidate cut-off
  accuracy.mtx = matrix(0, ncol=20, nrow=5)    # 20 candidate cutoffs and gini.11
  ##
  for (k in 1:5){
     valid.id = ((k-1)*n0 + 1):(k*n0)
     valid.dat = in.data[valid.id,]
     train.dat = in.data[-valid.id,] 
     ## tree model
     tree.model = tree.builder(in.data, fp, fn, purity)
     ## prediction 
     pred = predict(tree.model, newdata = valid.dat, type = "prob")[,2]
     ## for-loop
     for (i in 1:20){
        ## predicted probabilities
        pc.1 = ifelse(pred > cutoff[i], "YES", "NO")
        ## accuracy
        a1 = mean(pc.1 == valid.dat$grp.TenYearCHD)
        accuracy.mtx[k,i] = a1
       }
      }
   avg.acc = apply(accuracy.mtx, 2, mean)
   ## plots
   n = length(avg.acc)
   idx = which(avg.acc == max(avg.acc))
   tick.label = as.character(round(cutoff,2))
   ##
   plot(1:n, avg.acc, xlab="cut-off score", ylab="average accuracy", 
        ylim=c(min(avg.acc), 1), 
        axes = FALSE,
        main=paste("5-fold CV optimal cut-off \n ",purity,"(fp, fn) = (", fp, ",", fn,")" , collapse = ""),
        cex.main = 0.9,
        col.main = "navy")
        axis(1, at=1:20, label = tick.label, las = 2)
        axis(2)
        points(idx, avg.acc[idx], pch=19, col = "red")
        segments(idx , min(avg.acc), idx , avg.acc[idx ], col = "red")
       text(idx, avg.acc[idx]+0.03, as.character(round(avg.acc[idx],4)), col = "red", cex = 0.8) 
   }

Below figure indicates the optimal cut-off score of 4 decision trees.

gini.1.1: 0.53 gini.1.10: 0.68 info.1.1: 0.58 info.1.10: 0.68

par(mfrow=c(3,2))
Optm.cutoff(in.data = train, fp=1, fn=1, purity="gini")
Optm.cutoff(in.data = train, fp=1, fn=1, purity="information")
Optm.cutoff(in.data = train, fp=1, fn=10, purity="gini")
Optm.cutoff(in.data = train, fp=1, fn=10, purity="information")
Figure: Plot of optimal cut-off determination

Figure: Plot of optimal cut-off determination

8 Comparision of 3 different models

Logistic model (the best logistic model), Neural Network model and gini1.10 model (the best decision tree model) are compared in the below figure. The logistic model has the highest AUC, hence it is the best model to detect potential factors of Coronary heart disease.

par(pty="s")      # set up square plot through graphic parameter
plot(1-giniROC110$senspe.mtx[2,], giniROC110$senspe.mtx[1,], type = "l", xlim=c(0,1), ylim=c(0,1), 
     xlab="1 - specificity", ylab="Sensitivity", col = "blue", lwd = 2,
     main="ROC Curves based on different model", cex.main = 0.9, col.main = "navy")
abline(0,1, lty = 2, col = "orchid4", lwd = 2)



lines(1-NN.model$SenSpe[2,], NN.model$SenSpe[1,], col = "red", lwd = 2, lty=2)

lines(1-logistic.model$specificity.vec, logistic.model$sensitivity.vec, col = "olivedrab", lwd = 2)

legend("bottomright", c(paste("gini.1.10, AUC =",giniROC110$AUC), 
                        paste("NN.model, AUC =",NN.model$AUC),
                        paste("logistic.model, AUC =",logistic.model$AUC)), 
                        
                       
                     
                                           col=c("blue","red","olivedrab" ), 
                        lty=rep(1,6), lwd=rep(2,6), cex = 0.5, bty = "n")
Figure Comparison of ROC curves

Figure Comparison of ROC curves