Introduction
For this project, we will be working on creating a multiple logistic
regression model. The data set for this project looks at the various
factors affecting an individual patient’s risk of developing coronary
heart disease (CHD) in a 10-year period of time. The response variable
in this data set is binary with 1 representing yes, the patient is at
risk of developing CHD within a 10-year period of time, and 0
representing no, the patient is not at risk of developing CHD within a
10-year period of time.
Our overall goal in this project is to create a model which has the
greatest predictive performance. We will create several candidate models
and compare them with each other to determine which model should be
selected as the final model for the prediction of an individual’s odds
of being at risk for developing CHD over a 10-year period of time. We
will use the cross-validation process to determine which of the
candidate models provides the best predictive power. We will also use
ROC analysis to determine which of the candidate models has the best
global goodness. This will allow us to see which of the candidate models
provides the best utility and quality for predictive modeling.
Data Description
I found this data set on kaggle.com on the following web page: https://www.kaggle.com/datasets/dileep070/heart-disease-prediction-using-logistic-regression/data
This data set takes a look at various medical and personal factors
which may have an impact on an individual’s risk of developing CHD
within the next 10 years. The binary response variable tells whether an
individual patient is at risk of developing CHD within the next 10
years, with a response value of 1 meaning yes they are at risk, and 0
meaning no they are not at risk.
Variables
There are 16 variables in this data set.
male: The gender of the patient. A categorical variable with 1
for “male”, 0 for “female”, and 2 for “other”.
age: The age of the patient in years. This is a quantitative,
continuous variable given as an integer value.
education: The education level of the patient. A categorical
variable with 1 for “less than high school”, 2 for “high school
diploma”, 3 for “college graduate”, and 4 for “post-college
graduate”.
currentSmoker: Whether or not the patient is currently a smoker.
A binary variable with 1 for “yes” and 0 for “no”.
cigsPerDay: The average number of cigarettes the patient smokes
in a day. This is a quantitative, continuous variable.
BPMeds: Whether or not the patient takes blood pressure
medication. A binary variable with 1 for “yes” and 0 for “no”.
prevalentStroke: Whether or not the patient has a history of
strokes. A binary variable with 1 for “yes” and 0 for “no”.
prevalentHyp: Whether or not the patient has a history of
hypertension. A binary variable with 1 for “yes” and 0 for
“no”.
diabetes: Whether or not the patient has a history of diabetes. A
binary variable with 1 for “yes” and 0 for “no”.
totChol: The patient’s total cholesterol level, given in mg/dL
(milligrams per deciliter). A quantitative, continuous
variable.
sysBP: The patient’s systolic blood pressure, given in mmHG
(millimeters of mercury). A quantitative, continuous variable.
diaBP: The patient’s diastolic blood pressure, given in mmHG
(millimeters of mercury). A quantitative, continuous variable.
BMI: The patient’s body mass index (BMI). A quantitative,
continuous variable.
heartRate: The patient’s heart rate, given in beats per minute. A
numeric, quantitative variable.
glucose: The patient’s glucose level, given in mg/dL (milligrams
per deciliter). A numeric, quantitative variable.
TenYearCHD (response variable): The binary response variable of
this data set which represents whether or not the patient has a 10-year
risk of developing Coronary Heart Disease (CHD). This response variable
is binary with 1 meaning yes, the patient does have a risk of developing
CHD in the next 10 years, and 0 meaning no, the patient does not have a
risk of developing CHD in the next 10 years.
Research
Questions
The key analytical question which I would like to investigate in this
project is:
- Can we create a model which has the highest predictive power in
regards to predicting a patient’s odds of being at risk for developing
coronary heart disease within a 10-year period of time based on the
predictor variables in this data set?
This question will serve as the basis for creating the candidate
multiple logistic regression models for this data set. We will
ultimately choose a final model of these potential candidate models
based upon our findings in regards to their predictive power. The
findings to this question could provide utility for both patients and
doctors to provide them with information relating to the odds and risks
for developing CHD and which factors significantly impact these
odds.
Some further questions to consider as a starting point for this
project include:
Is there a major difference in the predictive powers of the
candidate models we create?
Do the continuous independent variables in this data set follow a
normal distribution? And if not, is there a potential explanation for
why this may be the case?
How good is the predictive quality of our final model in regards
of predicting whether an individual will be at risk for developing CHD
in a 10-year period of time?
We will use these questions as a starting point for this project in
order to create a multiple logistic regression model which has the
greatest predictive power for predicting the odds of a patient being at
risk for developing CHD in a 10-year period based on the many factors
which play a role in this prediction.
Exploratory Data
Analysis
First, we will read in the data set from Github and we will call it
“heartdisease”.
heartdisease <- read.csv("https://raw.githubusercontent.com/JosieGallop/STA321/refs/heads/main/dataset/framingham.csv", header=TRUE)
str(heartdisease)
'data.frame': 4238 obs. of 16 variables:
$ male : int 1 0 1 0 0 0 0 0 1 1 ...
$ age : int 39 46 48 61 46 43 63 45 52 43 ...
$ education : int 4 2 1 3 3 2 1 2 1 1 ...
$ currentSmoker : int 0 0 1 1 1 0 0 1 0 1 ...
$ cigsPerDay : int 0 0 20 30 23 0 0 20 0 30 ...
$ BPMeds : int 0 0 0 0 0 0 0 0 0 0 ...
$ prevalentStroke: int 0 0 0 0 0 0 0 0 0 0 ...
$ prevalentHyp : int 0 0 0 1 0 1 0 0 1 1 ...
$ diabetes : int 0 0 0 0 0 0 0 0 0 0 ...
$ totChol : int 195 250 245 225 285 228 205 313 260 225 ...
$ sysBP : num 106 121 128 150 130 ...
$ diaBP : num 70 81 80 95 84 110 71 71 89 107 ...
$ BMI : num 27 28.7 25.3 28.6 23.1 ...
$ heartRate : int 80 95 75 65 85 77 60 79 76 93 ...
$ glucose : int 77 76 70 103 85 99 85 78 79 88 ...
$ TenYearCHD : int 0 0 0 1 0 0 1 0 0 0 ...
We will make a copy of the data set called “heartdisease0” which will
be the data set we will analyze and transform through the exploratory
data analysis.
heartdisease0 <- heartdisease
Fixing the Missing
Observations
When looking through the data set, I noticed that some observations
appeared to be missing from the data set, as they had values of “NA” for
certain variables. Before we can begin with the logistic regression,
this is something that we should look into.
First, let’s see how many values are missing for each of the
variables.
colSums(is.na(heartdisease0))
male age education currentSmoker cigsPerDay
0 0 105 0 29
BPMeds prevalentStroke prevalentHyp diabetes totChol
53 0 0 0 50
sysBP diaBP BMI heartRate glucose
0 0 19 1 388
TenYearCHD
0
We can see that the variables education, cigsPerDay, BPMeds, totChol,
BMI, heartRate, and glucose all have missing observations.
It is important to note that single imputation methods that could be
used to fix the missing observations of the variables could lead to
potential skew in their distributions. So, instead we will use multiple
imputation methods to fill in these missing observations with randomly
approximated values. Out of the variables with missing values, none of
the variables have the majority of their data entries missing, so it
seems suitable to fill in these missing observations with approximated
values in order to prevent these entries from coming up as NA within the
data set. However, it is important to be mindful of the fact that
filling in these missing entries could still lead to a potential skew or
inaccuracy in the distribution of these variables after filling in the
missing observations. In order to properly handle this, we will first
fix the missing values and then check the distributions of the variables
to ensure that they are normally distributed, or are practically
important enough variables to include within the final multiple logistic
regression model.
Multiple Imputation
of Missing Observations
We will use multiple imputation to fill in the missing values of the
observations. Using this multiple imputation method will use a randomly
iterated method to fill in the missing observations within the data set.
This will help ensure that the missing observations are filled in the
best way possible by handling missing values in a way that reduces the
chance of false prediction. We will use the Multiple Imputation using
Chained Equations (MICE) function to use multiple imputation to fill in
the missing observations.
We will store the results of the multiple imputation in an object
called “mice” for now.
set.seed(1)
mice <- mice(heartdisease0, method = "cart")
iter imp variable
1 1 education cigsPerDay BPMeds totChol BMI heartRate glucose
1 2 education cigsPerDay BPMeds totChol BMI heartRate glucose
1 3 education cigsPerDay BPMeds totChol BMI heartRate glucose
1 4 education cigsPerDay BPMeds totChol BMI heartRate glucose
1 5 education cigsPerDay BPMeds totChol BMI heartRate glucose
2 1 education cigsPerDay BPMeds totChol BMI heartRate glucose
2 2 education cigsPerDay BPMeds totChol BMI heartRate glucose
2 3 education cigsPerDay BPMeds totChol BMI heartRate glucose
2 4 education cigsPerDay BPMeds totChol BMI heartRate glucose
2 5 education cigsPerDay BPMeds totChol BMI heartRate glucose
3 1 education cigsPerDay BPMeds totChol BMI heartRate glucose
3 2 education cigsPerDay BPMeds totChol BMI heartRate glucose
3 3 education cigsPerDay BPMeds totChol BMI heartRate glucose
3 4 education cigsPerDay BPMeds totChol BMI heartRate glucose
3 5 education cigsPerDay BPMeds totChol BMI heartRate glucose
4 1 education cigsPerDay BPMeds totChol BMI heartRate glucose
4 2 education cigsPerDay BPMeds totChol BMI heartRate glucose
4 3 education cigsPerDay BPMeds totChol BMI heartRate glucose
4 4 education cigsPerDay BPMeds totChol BMI heartRate glucose
4 5 education cigsPerDay BPMeds totChol BMI heartRate glucose
5 1 education cigsPerDay BPMeds totChol BMI heartRate glucose
5 2 education cigsPerDay BPMeds totChol BMI heartRate glucose
5 3 education cigsPerDay BPMeds totChol BMI heartRate glucose
5 4 education cigsPerDay BPMeds totChol BMI heartRate glucose
5 5 education cigsPerDay BPMeds totChol BMI heartRate glucose
We will create a new data frame called “heartdisease1” to represent
the corrected data set that will have no missing values after multiple
imputation. We will convert the results of the multiple imputation from
a large mids object back to a regular data frame. This will be used for
the future analysis and we will call this object “heartdisease1”.
heartdisease1 <-mice::complete(mice,2)
Now that we have used the multiple imputation method to fill in all
of the missing observations, let’s double check that there are no more
missing observations. All of the variables should now have a value of
zero missing observations.
colSums(is.na(heartdisease1))
male age education currentSmoker cigsPerDay
0 0 0 0 0
BPMeds prevalentStroke prevalentHyp diabetes totChol
0 0 0 0 0
sysBP diaBP BMI heartRate glucose
0 0 0 0 0
TenYearCHD
0
As we can see, all of the variables now have zero missing
observations, so we successfully fixed all of the missing values by
filling them in with multiple imputation by the use of the mice
function. The revised data set “heartdisease1” has no missing
observations, so we will use this for further analysis and for creating
the multiple logistic regression model in the future steps of this
project.
Fixing Variable
Types
One other thing I noticed while looking through the data set is that
the variable “cigsPerDay” is given as a character variable even though
it should be a numeric variable. This variable is a measure of the
number of cigarettes a patient smokes in a day, so it should be reported
as a numeric variable since it is a continuous range of possible values
and not collected as categories. So, let’s convert cigsPerDay from a
character variable to a numeric variable. We will use the as.numeric()
function to convert cigsPerDay to a numeric variable.
heartdisease1$cigsPerDay <- as.numeric(heartdisease1$cigsPerDay)
Now, the cigsPerDay variable is correctly given as a numeric
variable.
The variable of age should also be given as a numeric value, but is
is not so we will convert this variable to a numeric as well.
heartdisease1$age <- as.numeric(heartdisease1$age)
The variable of age is now correctly given as a numeric value.
Additionally, the variable BPMeds is given as a character, however
this is a binary variable so it should be given as an integer variable
with values 0 and 1 like the other binary variables in the data set. We
will convert this variable from a character to an integer. We will use
the as.integer() function to do this conversion.
heartdisease1$BPMeds <- as.integer(heartdisease1$BPMeds)
Now, the BPMeds variable is correctly given as a binary, integer
variable.
Another issue is that the variable for education is supposed to be
categorical with each number representing a different category of
education level. However, this variable is given as an integer variable.
So, we will fix this by making education a character variable.
heartdisease1$education <- as.character(heartdisease1$education)
The education variable is now correctly given as a categorical
variable.
Now, all of the variables in the data set are given in the correct
types that they should be given as.
Checking the Variable
Distributions
Before we begin with creating and comparing the multiple logistic
regression models, we should first check the distributions of the
variables in the data set to ensure that they meet the criteria of being
approximately normally distributed, for the quantitative variables. Or,
if the variables do not appear to follow a normal distribution, we will
see if they are practically important enough to be included within the
model building process despite their lack of a normal distribution. We
will take a look at some of the variables, particuarelly some of the
quanitiative variables which had missing values that were filled in
through the multiple imputation process.
totChol
Distribution
totChol was one of the variables that had some missing observations
that were filled in during the previous steps. So, we should definetly
check the distribution of this variable.
ylimit = max(density(heartdisease1$totChol)$y)
hist(heartdisease1$totChol, probability = TRUE, main = "totChol Distribution", xlab="totChol",
col = "aliceblue", border="cornflowerblue")
lines(density(heartdisease1$totChol, adjust=2), col="darkorchid")

Overall, totChol appears to be unimodal and approximately normally
distributed. Therefore we can say that this variable is safe to include
within the multiple logistic regression model building process and that
the process of filling in the missing observations did not lead to any
severe concerns for the distribution of this variable.
BMI
Distribution
BMI is another variable that had missing observations that were
filled in during the previous steps. We will check its distribution to
make sure it appears to be approximately normally distributed.
ylimit = max(density(heartdisease1$BMI)$y)
hist(heartdisease1$BMI, probability = TRUE, main = "BMI Distribution", xlab="BMI",
col = "aliceblue", border="cornflowerblue")
lines(density(heartdisease1$BMI, adjust=2), col="darkorchid")

The distribution for BMI is unimodal and appears to be approximately
normally distributed. It appears that the majority of individuals have a
BMI somewhere in the range of 20-30, with the vast majority of the data
falling within this range. This allows us to conclude that it is safe to
include the variable of BMI within the multiple logistic regression
model building process and that the process of filling in the missing
values did not lead to any major concerns with the distribution of this
variable.
Glucose
Distribution
The variable glucose was another variable that had missing
observations which were filled in during the previous steps. We will
check its distribution to make sure that everything looks alright.
ylimit = max(density(heartdisease1$glucose)$y)
hist(heartdisease1$glucose, probability = TRUE, main = "glucose", xlab="glucose",
col = "aliceblue", border="cornflowerblue")
lines(density(heartdisease1$glucose, adjust=2), col="darkorchid")

The variable glucose is unimodal and appears to be approximately
normally distributed with the exception of a few outliers to the right.
The most common glucose value appears to be around 75, with the
histogram peaking around this value. Despite the few outliers, we will
keep this variable in the final model since a patient’s glucose level is
a medically important variable which would be helpful to know in order
to assess a patient’s odds of being at risk for developing CHD.
cigsPerDay
Distribution
cigsPerDay was another variable that had missing observations which
were filled in during the previous steps. So, we should check its
distribution to make sure things look alright.
ylimit = max(density(heartdisease1$cigsPerDay)$y)
hist(heartdisease1$cigsPerDay, probability = TRUE, main = "cigsPerDay", xlab="cigsPerDay",
col = "aliceblue", border="cornflowerblue")
lines(density(heartdisease1$cigsPerDay, adjust=2), col="darkorchid")

The variable cigsPerDay appears to be very notably skewed to the
right. This can be attributed to that this variable represents the
number of cigarettes a patient smokes per day. Many of the patients
included in the data collection were not smokers, so they would have
reported a value of 0 cigarettes smoked per day. This can explain the
skew to the right of this variable, because non-smokers would report a
value of 0 while smokers would report a value greater than 0 that
represents the number of cigarettes they smoke in a day. This is a
pratically important variable regardless of its skew, because smoking is
a major risk factor for conditions like CHD. Additionally, this is an
important piece of information, because doctors would want to know how
many cigarettes their patients smoke in a day, because this has a
significant impact on the patient’s risk factors. So, we will keep this
variable in the final model despite its skew because it is an important
variable in regards to a patient’s potential risk factors for developing
CHD. But, it is important to keep in mind that this variable is skewed
due to it being representative of both non-smokers and smokers.
heartRate
Distribution
The variable heartRate was another one which had some missing
observations that needed to be filled in during previous steps. We will
check its distribution to ensure that everything looks alright.
ylimit = max(density(heartdisease1$heartRate)$y)
hist(heartdisease1$heartRate, probability = TRUE, main = "heartRate", xlab="heartRate",
col = "aliceblue", border="cornflowerblue")
lines(density(heartdisease1$heartRate, adjust=2), col="darkorchid")

The distribution for heart rate is unimodal and approximately
normally distributed without any notable skew or outliers that are
apparent in its density plot. This allows us to conclude that this
variable is safe to include within the multiple logistic regression
model building process and that the process of filling in the missing
values did not lead to any issues in its distribution.
Overall
Findings
We have checked the distributions of the variables which had missing
observations that were filled in during the previous steps of this
project along with other important quantitative variables in this data
set. We ensured that the variables either follows a distribution that
appears to be approximately normal, or are practically important enough
to be included in the multiple logistic regression model building
process otherwise.
The only variable that did not appear to follow an approximately
normal distribution was cigsPerDay. However, the skewness seen in this
variable’s distribution can be attributed to the fact that this data was
collected amongst both smokers and non-smokers. So, non-smokers reported
values of 0 cigarettes smoked per day, while smokers reported values
that were greater than zero. This appeared to the distribution for this
variable appearing to be skewed to the right. However, this is a
variable that is practically important because the number of cigarettes
a patient smokes per day can have a significant impact on their odds of
being at risk for developing CHD. Due to the practical importance of
this variable, it must still be included within the multiple logistic
regression model building process.
Altogether, we have ensured that the variables in this data set
appear to be good enough to use within the model building process. So,
now we will begin with the process of building the candidate multiple
logistic regression models.
Data Split
We will now split the data up into training and testing data. We will
randomly split the data so that 80% of the data will be used for
training and 20% of the data will be used for testing. By splitting the
data up into the training and the testing data, this will allow us to
build the candidate models and analyze their predictive performances. We
will use the split data to conduct the cross validation process to
analysis the predictive power and performances of each of the three
candidate models we will investigate. This will allow us to eventually
determine which of these three models is the ideal choice for our final
model based upon its overall predictive power.
# Split the data into 80% training and 20% testing.
n <- dim(sd.heartdisease)[1]
train.n <- round(0.8*n)
train.id <- sample(1:n, train.n, replace = FALSE)
train <- sd.heartdisease[train.id, ]
test <- sd.heartdisease[-train.id, ]
Now, our data has been split up into a training and a testing data
set, with 80% of the data for training and the remaining 20% for
testing. The observations were split up randomly amongst these two data
sets. We will specifically use the training data for the cross
validation process. This process will allow us to identify which
multiple logistic regression model is the best model for the purpose of
prediction. This will allow us to select a model which predicts the odds
of an individual being at risk for developing CHD with the highest
quality of prediction.
Best Model
Identification
Now that our variables have been checked and standardized to ensure
their predictive performance being the best that it possibly can be, we
will begin with the model building process. We will look at three
potential, candidate models for the multiple logistic regression and
determine which one has the best predictive performance.
Descriptions of the
Three Candidate Models
We will take a look into the specifications of the three candidate
multiple logistic regression models.
Full Model
The first of the three models will be the full model, with all of the
variables included. This full model will use all of the variables in
order to create a multiple logistic regression to predict the odds of a
patient being at risk for developing CHD in a 10-year period.
The regression equation for this full model is represented as:
TenYearCHD = male + sd.age + education + currentSmoker +
sd.cigsPerDay + BPMeds + prevalentStroke + prevalentHyp + diabetes +
sd.totChol + sd.sysBP + sd.diaBP + sd.BMI + sd.heartRate +
sd.glucose
We will find the specific multiple logisitic regression equation for
this full model.
full.model = glm(TenYearCHD ~ male + sd.age + education + currentSmoker + sd.cigsPerDay + BPMeds + prevalentStroke + prevalentHyp + diabetes + sd.totChol + sd.sysBP + sd.diaBP + sd.BMI + sd.heartRate + sd.glucose,
family = binomial(link = "logit"),
data = sd.heartdisease)
kable(summary(full.model)$coef,
caption = "Full Model Summary of the Inferential Statistics")
Full Model Summary of the Inferential Statistics
(Intercept) |
-2.2024269 |
0.1240704 |
-17.7514313 |
0.0000000 |
male |
0.4852112 |
0.1013449 |
4.7877196 |
0.0000017 |
sd.age |
0.5168830 |
0.0539408 |
9.5824163 |
0.0000000 |
education2 |
-0.2356171 |
0.1154925 |
-2.0401079 |
0.0413396 |
education3 |
-0.1026466 |
0.1376433 |
-0.7457430 |
0.4558227 |
education4 |
0.0115144 |
0.1519726 |
0.0757666 |
0.9396048 |
currentSmoker |
0.0219380 |
0.1443470 |
0.1519811 |
0.8792019 |
sd.cigsPerDay |
0.2507207 |
0.0679008 |
3.6924543 |
0.0002221 |
BPMeds |
0.3270416 |
0.2170889 |
1.5064870 |
0.1319422 |
prevalentStroke |
0.9388714 |
0.4468276 |
2.1011937 |
0.0356240 |
prevalentHyp |
0.2311662 |
0.1287922 |
1.7948770 |
0.0726733 |
diabetes |
0.0912436 |
0.3005700 |
0.3035687 |
0.7614565 |
sd.totChol |
0.0803445 |
0.0457538 |
1.7560185 |
0.0790852 |
sd.sysBP |
0.3074739 |
0.0783167 |
3.9260304 |
0.0000864 |
sd.diaBP |
-0.0327142 |
0.0713895 |
-0.4582495 |
0.6467732 |
sd.BMI |
0.0071253 |
0.0482202 |
0.1477649 |
0.8825283 |
sd.heartRate |
-0.0137062 |
0.0467007 |
-0.2934898 |
0.7691478 |
sd.glucose |
0.1729621 |
0.0500142 |
3.4582600 |
0.0005437 |
The equation for the multiple logistic regression equation of the
full model is given as follows:
log p/(1-p) = -2.2024 + 0.4852 * male + 0.5169 * sd.age - 0.2356 *
education2 - 0.1026 * education3 + 0.0115 * education4 + 0.0219 *
currentSmoker+ 0.2507 * sd.cigsPerDay + 0.3270 * BPMeds + 0.9389 *
prevalentStroke + 0.2312 * prevalentHyp + 0.0912 * diabetes + 0.0803 *
sd.totChol + 0.3075 * sd.sysBP - 0.0327 * sd.diaBP + 0.0071 * sd.BMI -
0.0137 * sd.heartRate + 0.1730 * sd.glucose
Reduced Model
The second of the three models will be the reduced model. This model
will consist only of the most practically important variables within the
context of this situation, using various factors to best predict the
odds of an individual being at risk for developing CHD. The United
States National Library of Medicine of the National Institutes of Health
(NIH), the most notable risk factors for an individual being at risk of
developing CHD are cigarette smoking, blood pressure, and cholesterol
levels. In this data set, there are two variables related to smoking,
currentSmoker, whether a patient is currently a smoker or not, and
cigsPerDay which is was scaled to sd.cigsPerDay, how many cigarettes a
patient smokes in a day. There are also two variables related to blood
pressure, sysBP and diaBP, for systolic blood pressure and diastolic
blood pressure. Both of these blood pressure variables were scaled and
are now given as sd.sysBP and sd.diaBP. There is also a variable
included in the data set for a patient’s total cholesterol levels,
totChol, which was scaled and is now given as sd.totChol. These five
variables in the data set represent the key risk factors that the
medical report from the NIH stated as being the most significant in
predicting a patient’s risk for CHD. So, our reduced model will consist
only of these particular variables.
The general regression equation for this reduced model is represented
as:
TenYearCHD = currentSmoker + sd.cigsPerDay + sd.sysBP + sd.diaBP +
sd.totChol
We will find the specific multiple logistic regression equation for
this reduced model.
reduced.model = glm(TenYearCHD ~ currentSmoker + sd.cigsPerDay + sd.sysBP + sd.diaBP + sd.totChol,
family = binomial(link = "logit"),
data = sd.heartdisease)
kable(summary(reduced.model)$coef,
caption = "Reduced Model Summary of the Inferential Statistics")
Reduced Model Summary of the Inferential Statistics
(Intercept) |
-1.7675361 |
0.0822320 |
-21.494511 |
0.0000000 |
currentSmoker |
-0.1443498 |
0.1392986 |
-1.036262 |
0.3000800 |
sd.cigsPerDay |
0.2683287 |
0.0637373 |
4.209920 |
0.0000255 |
sd.sysBP |
0.6381074 |
0.0641837 |
9.941893 |
0.0000000 |
sd.diaBP |
-0.1409295 |
0.0650918 |
-2.165088 |
0.0303809 |
sd.totChol |
0.1113871 |
0.0431866 |
2.579205 |
0.0099028 |
The equation for the multiple logistic regression equation of the
reduced model is given as follows:
log p/(1-p) = -1.7675 - 0.1443 * currentSmoker + 0.2683 *
sd.cigsPerDay + 0.6381 * sd.sysBP - 0.1409 * sd.diaBP + 0.1114 *
sd.totChol
Forward Selection
Model
The third of three candidate models will be the model that was found
through the automatic variable selection process, specifically with
using forward selection. We will call this the fs.model for this project
to represent it being the model found through the use of forward
selection of the variables.
We will find the specific multiple logistic regression equation for
this forward selection model.
fs.model = stepAIC(reduced.model, scope = list(lower=formula(reduced.model), upper=formula(full.model)),
direction = "forward",
trace = 0
)
kable(summary(fs.model)$coef,
caption="Final Model Summary of the Inferential Statistics")
Final Model Summary of the Inferential Statistics
(Intercept) |
-2.2897777 |
0.1108872 |
-20.6496069 |
0.0000000 |
currentSmoker |
0.0083694 |
0.1431896 |
0.0584499 |
0.9533902 |
sd.cigsPerDay |
0.2497211 |
0.0677217 |
3.6874613 |
0.0002265 |
sd.sysBP |
0.3080000 |
0.0777695 |
3.9604228 |
0.0000748 |
sd.diaBP |
-0.0324529 |
0.0698694 |
-0.4644798 |
0.6423040 |
sd.totChol |
0.0755135 |
0.0454483 |
1.6615237 |
0.0966083 |
sd.age |
0.5363993 |
0.0527922 |
10.1605849 |
0.0000000 |
male |
0.5164091 |
0.0993494 |
5.1979103 |
0.0000002 |
sd.glucose |
0.1825977 |
0.0373944 |
4.8830291 |
0.0000010 |
prevalentStroke |
0.9413243 |
0.4425788 |
2.1269078 |
0.0334277 |
prevalentHyp |
0.2310484 |
0.1281164 |
1.8034261 |
0.0713213 |
BPMeds |
0.3233004 |
0.2160729 |
1.4962560 |
0.1345870 |
The equation for the multiple logistic regression equation of the
forward selection model is given as follows:
log p/(1-p) = -2.2898 + 0.0084 * currentSmoker + 0.2497 *
sd.cigsPerDay + 0.2497 * sd.sysBP - 0.0325 * sd.diaBP + 0.0755 *
sd.totChol + 0.5364 * sd.age + 0.5164 * male + 0.1826 * sd.glucose +
0.9413 * prevalentStroke + 0.2310 * prevalentHyp + 0.3233 * BPMeds
The variables that were kept in this model after forward selection
were currentSmoker, sd.cigsPerDay, sd.sysBP, sd.diaBP, sd.totChol,
sd,age, male, sd.glucose, prevalentStroke, prevalentHyp, and BPMeds.
Cross Validation
Now that we’ve looked into the specifications of the three potential
candidate models, we will use cross-validation to determine which of
these three models we should use. We want to see which model has the
best predictive performance. We will use a 5-fold cross validation to
conduct the process of finding the predictive errors for each of the
three candidate models. We will use this 5-fold cross validation process
to calculate the average predictive errors for each of the three
candidate models, the full model, the reduced model, and the forward
selection model. We are looking for which model has the lowest value of
average predictive errors. This model with the lowest value for the
average predictive errors will be the ideal choice, as having the lowest
predictive error value indicates a higher predictive power and more
accurate predictions than a model with higher average predictive error
would.
For our cross validation process, we will use a cut off probability
of 0.5 to define between the binary response values of 1, yes, and 0,
no. These values represent whether an individual will be stated as being
at risk for developing CHD in a 10-year period based upon the multiple
logistic regression model that was used for prediction.
We will label the full model as our first model (01), the reduced
model as our second model (02), and the forward selection model as our
third model (03) for this cross-validation process.
## 5 Fold Cross Validation
k = 5
fold.size = floor(dim(train)[1]/k)
PE1 = rep(0,5)
PE2 = rep(0,5)
PE3 = rep(0,5)
# Training and Testing Fold for the Models
for(i in 1:k){
valid.id = (fold.size*(i-1)+1):(fold.size*i)
valid = train[valid.id, ]
train.data = train[-valid.id,]
# Full Model: all variables
full.model = glm(TenYearCHD ~ male + sd.age + education + currentSmoker + sd.cigsPerDay + BPMeds + prevalentStroke + prevalentHyp + diabetes + sd.totChol + sd.sysBP + sd.diaBP + sd.BMI + sd.heartRate + sd.glucose,
family = binomial(link = "logit"),
data = train.data)
## Reduced Model: only the most practically important variables.
reduced.model = glm(TenYearCHD ~ currentSmoker + sd.cigsPerDay + sd.sysBP + sd.diaBP + sd.totChol,
family = binomial(link = "logit"),
data = train.data)
## Automatic Forward Selection (fs) Model: model found using forward selection
fs.model = stepAIC(full.model,
scope = list(lower=formula(reduced.model),
upper=formula(full.model)),
direction = "forward",
trace = 0
)
# Finding the predicted probabilities for each of the candidate models
pred.full = predict(full.model, newdata = valid, type = "response")
pred.reduced = predict(reduced.model, newdata = valid, type = "response")
pred.fs = predict(fs.model, newdata = valid, type = "response")
pre.outcome01 = ifelse(as.vector(pred.full) > 0.5, "1", "0")
pre.outcome02 = ifelse(as.vector(pred.reduced) > 0.5, "1", "0")
pre.outcome03 = ifelse(as.vector(pred.fs) > 0.5, "1", "0")
PE1[i] = sum(pre.outcome01 == valid$TenYearCHD)/length(pred.full)
PE2[i] = sum(pre.outcome02 == valid$TenYearCHD)/length(pred.reduced)
PE3[i] = sum(pre.outcome03 == valid$TenYearCHD)/length(pred.fs)
}
# Finding the average predictive errors of each candidate model
avg.pe = cbind(PE1 = mean(PE1), PE2 = mean(PE2), PE3 = mean(PE3))
kable(avg.pe, caption = "Average Predictive Errors of the Three Candidate Models")
Average Predictive Errors of the Three Candidate
Models
0.8545723 |
0.8495575 |
0.8545723 |
The average predictive errors of the three candidate models are given
as follows:
The full model, represented by PE1, had an average predictive error
of 0.8546. The reduced model, represented by PE2, had an average
predictive error of 0.8495.
The forward selection model, represented by PE3, had an average
predictive error of 0.8546.
We can see that both the full model and the forward selection model
had equal values for their average predictive errors, with both having a
value of 0.8546. However, the reduced model a lower average predictive
error value than the other two models, with a value of 0.8496. This
indicates that the reduced model is the one which we should choose as
our final model.
The cross-validation process of calculating the average predictive
errors for the three candidate models showed that the reduced model is
the ideal one to use for prediction due to it having the lowest value of
its average predictive error out of the three candidate models. This
makes the reduced model have the highest predictive power and
performance out of the three candidate models, as shown by how it has
the least amount of its average predictive errors in comparison to the
full model and the forward selection model. So, we will move forward
with the use of the reduced model as our final multiple logistic
regression model for prediction of the odds of an individual being at
risk for developing CHD.
Final Model
In the previous cross-validation step, we used a cut off probability
of 0.5 to identify between the response variable giving a positive or
negative result. In this case, a “positive” result, or an output of 1,
represents that an individual is at risk for developing CHD in a 10-year
period. And, a “negative” result, or an output of 0, represents that an
individual is not at risk for developing CHD in a 10-year period. We
used the cross-validation process with this cut off probability of 0.5
to find the values of the average predictive errors of the three
candidate models. We found that the reduced model was the one with the
best predictive performance, so we will use that as our final model.
However, in order to report the details of the final model, we must
find its actual accuracy based upon the test data. We previously split
the data up into a training and a testing data set. We will use the test
data set to find the actual accuracy of the final model, the reduced
model, in order to provide the fullest accuracy in the report of this
model.
pred.reduced = predict(reduced.model, newdata = test, type = "response")
pred.reduced.outcome = ifelse(as.vector(pred.reduced) > 0.5, "1", "0")
Accuracy = sum(pred.reduced.outcome == test$TenYearCHD)/length(pred.reduced)
kable(Accuracy, caption="Actual Accuracy of the Final Model (Reduced Model)")
Actual Accuracy of the Final Model (Reduced Model)
0.8443396 |
The actual accuracy of the final model, the reduced model, is given
as 0.8443. This value represents the accuracy rate of this final
model.
It appears that the final model has good accuracy as seen by its
accuracy rate of 0.8443. This indicates that the final model we selected
has good predictive performance for predicting the odds of an individual
being at risk for developing CHD within a 10-year period.
ROC Analysis
We will use ROC analysis to measure the global performance of the
multiple logistic regression model. This plot will illustrate the
relationship of the true positive rate (TPR), also known as the
sensitivity, against the false positive rate (FPR), also given as 1 -
specificity. We will plot the ROC curves for each of the three candidate
models to compare them against one another. We will also calculate the
area under the curve (AUC) values for each of the three candidate
models. The value of area under the curve represents the global goodness
of the model, and a higher AUC value is ideal. A higher value of the
area under the curve indicates better and more quality global goodness
of that model.
We will plot the ROC curves for the full model, the reduced model,
and the forward selection model. This will allow us to compare the area
under the curves for each of the three models. Additionally, we will
calculate the AUC values for each of the three models to compare them
with each other and determine which model indicates the best global
goodness.
TPR.FPR=function(pred){
prob.seq = seq(0, 1, length=50)
pn = length(prob.seq)
true.lab = as.vector(train$TenYearCHD)
TPR = NULL
FPR = NULL
for (i in 1:pn){
pred.lab = as.vector(ifelse(pred > prob.seq[i],"1", "0"))
TPR[i] = length(which(true.lab=="1" & pred.lab=="1"))/length(which(true.lab=="1"))
FPR[i] = length(which(true.lab=="0" & pred.lab=="1"))/length(which(true.lab=="0"))
}
cbind(FPR = FPR, TPR = TPR)
}
We will use the training data set to help find the predictive
probabilities for each of the three candidate models.
# The Full Model
full.model = glm(TenYearCHD ~male + sd.age + education + currentSmoker + sd.cigsPerDay + BPMeds + prevalentStroke + prevalentHyp + diabetes + sd.totChol + sd.sysBP + sd.diaBP + sd.BMI + sd.heartRate + sd.glucose,
family = binomial(link = "logit"),
data = train)
# The Reduced Model (final model)
reduced.model = glm(TenYearCHD ~ currentSmoker + sd.cigsPerDay + sd.sysBP + sd.diaBP + sd.totChol,
family = binomial(link = "logit"),
data = train)
# The Forward Selection (fs) Model
fs.model = stepAIC(full.model,
scope = list(lower=formula(reduced.model),
upper=formula(full.model)),
direction = "forward",
trace = 0
)
# Finding the Predicted Probabilities for Each Model
pred.full = predict.glm(full.model, newdata = train, type = "response")
pred.reduced = predict.glm(reduced.model, newdata = train, type = "response")
pred.fs = predict.glm(fs.model, newdata = train, type = "response")
Now, we will plot the ROC curves for each of the three candidate
models to compare them.
# Plotting the ROC curves
plot(TPR.FPR(pred.full)[,1], TPR.FPR(pred.full)[,2],
type="l", col = 2, lty = 1, xlim = c(0,1), ylim = c(0,1),
xlab = "FPR: 1 - Specificity",
ylab ="TPR: Sensitivity",
main = "ROC Curves for the Three Candidate Models",
cex.main = 0.8,
col.main = "navy")
lines(TPR.FPR(pred.reduced)[,1], TPR.FPR(pred.reduced)[,2], col=3, lty=2)
lines(TPR.FPR(pred.fs)[,1], TPR.FPR(pred.fs)[,2], col=4, lty=3)
category = train$TenYearCHD == "1"
ROCobj01 <- roc(category, as.vector(pred.full))
ROCobj02 <- roc(category, as.vector(pred.reduced))
ROCobj03 <- roc(category, as.vector(pred.fs))
AUC01 = round(auc(ROCobj01),4)
AUC02 = round(auc(ROCobj02),4)
AUC03 = round(auc(ROCobj03),4)
legend("bottomright", c(paste("Full model: AUC = ", AUC01),
paste("Reduced model: AUC =", AUC02),
paste("Forward Selection model: AUC =", AUC03)),
col = 2:4, lty = 1:3, cex = 0.8, bty = "n")

We can see that both the full model and the forward selection model
fall on the exact same curve, which is also evidenced by how these two
models have the same values for their area under the curve Both the full
model and the forward selection model have an area under the curve of
0.7356, and their ROC curves match up identically to each other. This
matches up with the previous finding of the full model and the forward
selection model having the exact same values for their average
predictive errors.
Something interesting found in this ROC analysis, is that the reduced
model actually has the smallest value of area under the curve with
0.6763. We found that the reduced model was the one with the lowest
average predictive errors previously, which made it the ideal model
choice for our final model. However, this model turned out to have the
lowest area under the curve out of the three, which is not ideal. A
model with a higher area under the curve has a better global goodness
than a model with a lower area under the curve.
Solely based upon area under the curve values found in the ROC
analysis, it would appear that the forward selection model provides the
best global goodness. Both the full model and the forward selection
model had the same value for their area under the curve, but the forward
selection model would be the better choice due to it having less
variables.
The ROC analysis revealed that the forward selection model may in
fact be a good alternative to use as the final model due to it having a
higher area under the curve than the reduced model, despite the reduced
model having a lower average value of predictive errors as was found
earlier. A potential reason for the forward selection model having a
higher AUC than the reduced model despite the reduced model having a
lower and more ideal average value of predictive errors could be due to
the reduced model having less variables. It could be a possibility that
the reduced model having fewer variables could have a higher likelihood
of false positivies or false negatives due to some of the variables
which may be significant not having been included in the reduced
model.
Conclusion
In this project, three candidate models were created to compare their
predictive powers and ultimately choose a final model that provides the
best significance and quality of prediction for predicting the odds of
an individual being at risk for developing CHD in a 10-year period.
These models all consisted of multiple logistic regression to predict
the binary response variable of an individual’s odds of being at risk
for developing CHD. Creating a model for the purpose of this prediction
can provide use for doctors and patients in order to predict their odds
of being at risk for developing CHD based upon various personal and
medical factors.
The three candidate models that were created included a full model
with all of the variables in the data set, a reduced model with only the
variables deemed as the most practically important, and a model created
through forward variable selection. These three models were compared by
their predictive power and quality for their potential use as a final
multiple logistic regression model for this project.
The average predictive errors of all three candidate models were
compared to assess their predictive powers through the process of
cross-validation. It was found that the full model and the forward
selection model had identical values for the average predictive errors.
However, the reduced model turned out to have the lowest value of its
average predictive errors out of the three candidate models. This
indicated that the reduced model was the ideal choice due to it having
the lowest value of average predictive errors, and therefore the highest
quality of predictive power. This indicated that the reduced model would
be a good choice for the final model due to its high quality predictive
power shown by how it had the smallest value of average predictive
errors out of all three candidate models.
Furthermore, ROC analysis was conducted to continue the analysis of
the three candidate models. It was found that the full model and the
forward selection model have identical ROC curves and area under the
curve values. This matched up with how it was previously found that the
full model and the forward selection model have the exact same average
value of their predictive errors. Between these two models, we would
choose the forward selection model over the full model due to it having
less variables. However, we also found that the reduced model in fact
had the lowest area under the curve value out of the three models. This
is interesting as it contradicts with the reduced model previously being
shown to have the lowest average value of predictive errors. It turns
out that despite the reduced model having the lowest average of
predictive errors and therefore the highest predictive power, it also
has the lowest area under the curve meaning that it has lower global
goodness out of the candidate models.
Overall, it was found that through cross-validation it turned out
that the reduced model had the lowest average value of predictive
errors, and therefore the highest predictive power. This made the
reduced model the ideal choice for the selection of our final model.
This reduced model focused on only the most practically important
variables for predicting the odds of an individual being at risk for
developing CHD. These most practically important variables included
currentSmoker, sd.cigsPerDay, sd.sysBP, sd.diaBP, and sd.totChol.
However, it was later found through ROC analysis that the reduced model
had the lowest area under the curve value, indicating a lower global
goodness. It was in fact the forward selection model that the ROC
analysis suggested as being the ideal choice based upon its global
goodness. The forward selection model included the variables of
currentSmoker, sd.cigsPerDay, sd.sysBP, sd.diaBP, sd.totChol, sd,age,
male, sd.glucose, prevalentStroke, prevalentHyp, and BPMeds.
Even though the reduced model provided the best predictive power due
to its lower value of average predictive errors, the forward selection
model was shown through ROC analysis to have the highest global
goodness. Both the reduced model and the full model are high quality
models for the prediction and estimation of the odds of an individual
being at risk for developing CHD. This discrepancy between the reduced
model having the highest predictive power yet the lowest global goodness
could in fact be due to it having the fewest variables. By having fewer
variables, this could potentially lead to higher rates of false
positives and false negatives than the other models with more variables
accounted for. So overall, the reduced model would be the ideal choice
as the final model for a model with the highest predictive power, which
is what we were looking for in this project. However, it is important to
keep in mind that this reduced model has a lower global goodness than
the forward selection model, likely due to the reduced model having
fewer variables and potentially a higher rate for false positives and
false negatives. This is important to keep in mind when creating a model
for the prediction of an individual’s odds of being at risk for
developing CHD, as high predictive power is ideal for ensuring that
patient’s health is accounted for properly, but it is also important to
be wary of the potential for false postives and false negatives in
regards of these predictions.
Recommendations
Some recommendations I would make for future projects include:
Expand the data collection to ensure the accuracy of the findings
found in this project. This could also involve reaching out to various
hospitals and doctors for a better understanding of the most serious
risk factors for CHD in order to ensure that the variables kept within
the final model indeed are representative of a patient’s odds of being
at risk for developing CHD.
Consider other variables which could be statistically significant
in predicting the odds of a patient’s risk for developing CHD. All of
the predictor variables that were included within this data set made
sense in terms of this project and were all factors which could play a
role in a patient’s odds of being at risk for developing CHD. However,
future projects could consider looking into additional variables that
may be statistically significant in predicting the odds of a patient
being at risk for developing CHD. For instance, one potential variable
that could be looked into is a patient’s income to see if the income of
a patient has an impact on their odds of being at risk for developing
CHD. A reason why I believe that this could be a significant variable
for this model is because patients with higher incomes tend to have more
options when it comes to medical care than patients with lower incomes
would. So, perhaps there could be a relationship between a patient’s
income and their odds of being at risk for developing CHD. This is just
one example of a variable that is not in this data set which may provide
use for future projects.
Consider other possible models that could provide quality
predictive power in predicting the odds of a patient being at risk for
developing CHD within a 10-year period of time. In this project, we
looked at three candidate multiple logistic regression models and chose
which one provided the best utility. These three models included a full
model with all of the predictor variables left in the model, a reduced
model with only the variables that the medical report described as most
significant risk factors for CHD, and a final model that was created
through automatic forward variable selection. We found that the reduced
model was the one which provided the lowest average value for predictive
errors out of the three models, and so we chose this to be our final
model due to it providing the strongest predictive power. However, this
is not to say that there is not any other possible model that could
potentially be better than the one we created. If further projects found
a model that provided better and more power in prediction than our final
model, this could help doctors and patients better understand their odds
of being at risk for developing CHD based on the various factors
included in the model.
Further investigate the topic of false postives and false
negatives amongst the predicted values. The topic of false postives and
false negatives was addressed as potentially being a reason for the
descrepancy between the reduced model having the greatest predictive
power, yet also having the lowest global goodness. Having fewer
varaibles could possibly lead to some factors being unaccounted for, and
potentially resulting in some false positive or negative predictions.
This is a potential topic which could be further looked into in further
experiments to fully understand the occurences of false positive or
negative predictions.
Overall, this project showed that the reduced model resulted in the
lowest average predictive errors out of the three models, meaning this
was the model with the highest predictive power. However, ROC analysis
showed that the reduced model had a lower area under the curve model,
and it was the forward selection model which had the highest global
goodness as indicated by the ROC analysis. This is something important,
because although the reduced model is an ideal choice for the final
model due to its high quality of predictive power, this lower global
goodness is an important thing to keep in mind when using this final
model for the prediction of an individual’s odds of being at risk for
developing CHD in a 10-year period.
---
title: "Predicting a Patient's Odds of Being at Risk for Developing CHD- Binary Predictive Modeling"
author: "Josie Gallop"
date: "2024-10-21"
output:
  html_document: 
    toc: yes
    toc_depth: 4
    toc_float: yes
    fig_width: 6
    fig_caption: yes
    number_sections: yes
    toc_collapsed: yes
    code_folding: hide
    code_download: yes
    smooth_scroll: yes
    theme: lumen
  word_document: 
    toc: yes
    toc_depth: 4
    fig_caption: yes
    keep_md: yes
  pdf_document: 
    toc: yes
    toc_depth: 4
    fig_caption: yes
    number_sections: yes
editor_options: 
  chunk_output_type: console
---

```{=html}

<style type="text/css">

/* Cascading Style Sheets (CSS) is a stylesheet language used to describe the presentation of a document written in HTML or XML. it is a simple mechanism for adding style (e.g., fonts, colors, spacing) to Web documents. */

h1.title {  /* Title - font specifications of the report title */
  font-size: 24px;
  color: DarkRed;
  text-align: center;
  font-family: "Gill Sans", sans-serif;
}
h4.author { /* Header 4 - font specifications for authors  */
  font-size: 20px;
  font-family: system-ui;
  color: DarkRed;
  text-align: center;
}
h4.date { /* Header 4 - font specifications for the date  */
  font-size: 18px;
  font-family: system-ui;
  color: DarkBlue;
  text-align: center;
}
h1 { /* Header 1 - font specifications for level 1 section title  */
    font-size: 22px;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: center;
}
h2 { /* Header 2 - font specifications for level 2 section title */
    font-size: 20px;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: left;
}

h3 { /* Header 3 - font specifications of level 3 section title  */
    font-size: 18px;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: left;
}

h4 { /* Header 4 - font specifications of level 4 section title  */
    font-size: 18px;
    font-family: "Times New Roman", Times, serif;
    color: darkred;
    text-align: left;
}

body { background-color:white; }

.highlightme { background-color:yellow; }

p { background-color:white; }

</style>
```
```{r setup, include=FALSE}
# Detect, install, and load packages if needed.
if (!require("knitr")) {
   install.packages("knitr")
   library(knitr)
}
if (!require("leaflet")) {
   install.packages("leaflet")
   library(leaflet)
}
if (!require("EnvStats")) {
   install.packages("EnvStats")
   library(EnvStats)
}
if (!require("MASS")) {
   install.packages("MASS")
   library(MASS)
}
if (!require("phytools")) {
   install.packages("phytools")
   library(phytools)
}
if(!require("dplyr")) {
   install.packages("dplyr")
   library(dplyr)
}
if(!require("tidyverse")) {
   install.packages("tidyverse")
   library(tidyverse)
}
if(!require("GGally")) {
   install.packages("GGally")
   library(GGally)
}
if(!require("usdm")) {
   install.packages("usdm")
   library(usdm)
}
if(!require("car")) {
   install.packages("car")
   library(car)
}
if (!require("boot")) {
   install.packages("boot")
   library(boot)
}
if(!require("pander")) {
   install.packages("pander")
   library(pander)
}
if(!require("mice")) {
   install.packages("mice")
   library(mice)
}
if(!require("mlbench")) {
   install.packages("mlbench")
   library(mlbench)
}
if(!require("psych")) {
   install.packages("psych")
   library(psych)
}
if(!require("broom.mixed")) {
   install.packages("broom.mixed")
   library(broom.mixed)
}
if(!require("GGally")) {
   install.packages("GGally")
   library(GGally)
}
if(!require("caret")) {
   install.packages("caret")
   library(caret)
}
if (!require("pROC")) {
   install.packages("pROC")
   library(pROC)
}
# Specifications of outputs of code in code chunks
knitr::opts_chunk$set(echo = TRUE,  # include code chunk in the output                                           file
                   warning = FALSE,  # Sometimes, your code may produce a                                         warning
                                     # messages, you can choose to include                                        the
                                     # warning messages in the output file. 
                   message = FALSE,  
                   results = TRUE,   # you can also decide whether to                                             include 
                                     # the output in the output file.
                   comment = NA      # Suppress hash-tags in the output                                           results.
                      )   
```


# Introduction

For this project, we will be working on creating a multiple logistic regression model. The data set for this project looks at the various factors affecting an individual patient's risk of developing coronary heart disease (CHD) in a 10-year period of time. The response variable in this data set is binary with 1 representing yes, the patient is at risk of developing CHD within a 10-year period of time, and 0 representing no, the patient is not at risk of developing CHD within a 10-year period of time.

Our overall goal in this project is to create a model which has the greatest predictive performance. We will create several candidate models and compare them with each other to determine which model should be selected as the final model for the prediction of an individual's odds of being at risk for developing CHD over a 10-year period of time. We will use the cross-validation process to determine which of the candidate models provides the best predictive power. We will also use ROC analysis to determine which of the candidate models has the best global goodness. This will allow us to see which of the candidate models provides the best utility and quality for predictive modeling. 



## Data Description

I found this data set on kaggle.com on the following web page:
https://www.kaggle.com/datasets/dileep070/heart-disease-prediction-using-logistic-regression/data

This data set takes a look at various medical and personal factors which may have an impact on an individual's risk of developing CHD within the next 10 years. The binary response variable tells whether an individual patient is at risk of developing CHD within the next 10 years, with a response value of 1 meaning yes they are at risk, and 0 meaning no they are not at risk.



## Variables

There are 16 variables in this data set.

* male: The gender of the patient. A categorical variable with 1 for "male",  0 for "female", and 2 for "other". 

* age: The age of the patient in years. This is a quantitative, continuous variable given as an integer value.

* education: The education level of the patient. A categorical variable with 1 for "less than high school", 2 for "high school diploma", 3 for "college graduate", and 4 for "post-college graduate".

* currentSmoker: Whether or not the patient is currently a smoker. A binary variable with 1 for "yes" and 0 for "no".

* cigsPerDay: The average number of cigarettes the patient smokes in a day. This is a quantitative, continuous variable. 

* BPMeds: Whether or not the patient takes blood pressure medication. A binary variable with 1 for "yes" and 0 for "no".

* prevalentStroke: Whether or not the patient has a history of strokes. A binary variable with 1 for "yes" and 0 for "no".

* prevalentHyp: Whether or not the patient has a history of hypertension. A binary variable with 1 for "yes" and 0 for "no".

* diabetes: Whether or not the patient has a history of diabetes. A binary variable with 1 for "yes" and 0 for "no".

* totChol: The patient's total cholesterol level, given in mg/dL (milligrams per deciliter). A quantitative, continuous variable. 

* sysBP: The patient's systolic blood pressure, given in mmHG (millimeters of mercury). A quantitative, continuous variable. 

* diaBP: The patient's diastolic blood pressure, given in mmHG (millimeters of mercury). A quantitative, continuous variable. 

* BMI: The patient's body mass index (BMI). A quantitative, continuous variable. 

* heartRate: The patient's heart rate, given in beats per minute. A numeric, quantitative variable. 

* glucose: The patient's glucose level, given in mg/dL (milligrams per deciliter). A numeric, quantitative variable. 

* TenYearCHD (response variable): The binary response variable of this data set which represents whether or not the patient has a 10-year risk of developing Coronary Heart Disease (CHD). This response variable is binary with 1 meaning yes, the patient does have a risk of developing CHD in the next 10 years, and 0 meaning no, the patient does not have a risk of developing CHD in the next 10 years.



## Research Questions


The key analytical question which I would like to investigate in this project is:

* Can we create a model which has the highest predictive power in regards to predicting a patient's odds of being at risk for developing coronary heart disease within a 10-year period of time based on the predictor variables in this data set?

This question will serve as the basis for creating the candidate multiple logistic regression models for this data set. We will ultimately choose a final model of these potential candidate models based upon our findings in regards to their predictive power. The findings to this question could provide utility for both patients and doctors to provide them with information relating to the odds and risks for developing CHD and which factors significantly impact these odds. 

Some further questions to consider as a starting point for this project include:

* Is there a major difference in the predictive powers of the candidate models we create?

* Do the continuous independent variables in this data set follow a normal distribution? And if not, is there a potential explanation for why this may be the case?

* How good is the predictive quality of our final model in regards of predicting whether an individual will be at risk for developing CHD in a 10-year period of time?


We will use these questions as a starting point for this project in order to create a multiple logistic regression model which has the greatest predictive power for predicting the odds of a patient being at risk for developing CHD in a 10-year period based on the many factors which play a role in this prediction. 


# Exploratory Data Analysis

First, we will read in the data set from Github and we will call it "heartdisease". 

```{r}
heartdisease <- read.csv("https://raw.githubusercontent.com/JosieGallop/STA321/refs/heads/main/dataset/framingham.csv", header=TRUE)

str(heartdisease)
```


We will make a copy of the data set called "heartdisease0" which will be the data set we will analyze and transform through the exploratory data analysis. 

```{r}
heartdisease0 <- heartdisease
```



## Fixing the Missing Observations


When looking through the data set, I noticed that some observations appeared to be missing from the data set, as they had values of "NA" for certain variables. Before we can begin with the logistic regression, this is something that we should look into.

First, let's see how many values are missing for each of the variables.

```{r}
colSums(is.na(heartdisease0))
```

We can see that the variables education, cigsPerDay, BPMeds, totChol, BMI, heartRate, and glucose all have missing observations.

It is important to note that single imputation methods that could be used to fix the missing observations of the variables could lead to potential skew in their distributions. So, instead we will use multiple imputation methods to fill in these missing observations with randomly approximated values. Out of the variables with missing values, none of the variables have the majority of their data entries missing, so it seems suitable to fill in these missing observations with approximated values in order to prevent these entries from coming up as NA within the data set. However, it is important to be mindful of the fact that filling in these missing entries could still lead to a potential skew or inaccuracy in the distribution of these variables after filling in the missing observations. In order to properly handle this, we will first fix the missing values and then check the distributions of the variables to ensure that they are normally distributed, or are practically important enough variables to include within the final multiple logistic regression model.


## Multiple Imputation of Missing Observations

We will use multiple imputation to fill in the missing values of the observations. Using this multiple imputation method will use a randomly iterated method to fill in the missing observations within the data set. This will help ensure that the missing observations are filled in the best way possible by handling missing values in a way that reduces the chance of false prediction. We will use the Multiple Imputation using Chained Equations (MICE) function to use multiple imputation to fill in the missing observations. 

We will store the results of the multiple imputation in an object called "mice" for now.

```{r}
set.seed(1)
mice <- mice(heartdisease0, method = "cart")
```

We will create a new data frame called "heartdisease1" to represent the corrected data set that will have no missing values after multiple imputation. We will convert the results of the multiple imputation from a large mids object back to a regular data frame. This will be used for the future analysis and we will call this object "heartdisease1".

```{r}
heartdisease1 <-mice::complete(mice,2) 
```

Now that we have used the multiple imputation method to fill in all of the missing observations, let's double check that there are no more missing observations. All of the variables should now have a value of zero missing observations. 

```{r}
colSums(is.na(heartdisease1))
```

As we can see, all of the variables now have zero missing observations, so we successfully fixed all of the missing values by filling them in with multiple imputation by the use of the mice function. The revised data set "heartdisease1" has no missing observations, so we will use this for further analysis and for creating the multiple logistic regression model in the future steps of this project.



## Fixing Variable Types

One other thing I noticed while looking through the data set is that the variable "cigsPerDay" is given as a character variable even though it should be a numeric variable. This variable is a measure of the number of cigarettes a patient smokes in a day, so it should be reported as a numeric variable since it is a continuous range of possible values and not collected as categories. So, let's convert cigsPerDay from a character variable to a numeric variable. We will use the as.numeric() function to convert cigsPerDay to a numeric variable.

```{r}
heartdisease1$cigsPerDay <- as.numeric(heartdisease1$cigsPerDay)
```

Now, the cigsPerDay variable is correctly given as a numeric variable.

The variable of age should also be given as a numeric value, but is is not so we will convert this variable to a numeric as well.

```{r}
heartdisease1$age <- as.numeric(heartdisease1$age)
```

The variable of age is now correctly given as a numeric value. 

Additionally, the variable BPMeds is given as a character, however this is a binary variable so it should be given as an integer variable with values 0 and 1 like the other binary variables in the data set. We will convert this variable from a character to an integer. We will use the as.integer() function to do this conversion.

```{r}
heartdisease1$BPMeds <- as.integer(heartdisease1$BPMeds)
```

Now, the BPMeds variable is correctly given as a binary, integer variable.


Another issue is that the variable for education is supposed to be categorical with each number representing a different category of education level. However, this variable is given as an integer variable. So, we will fix this by making education a character variable.

```{r}
heartdisease1$education <- as.character(heartdisease1$education)
```

The education variable is now correctly given as a categorical variable.

Now, all of the variables in the data set are given in the correct types that they should be given as. 



## Checking the Variable Distributions


Before we begin with creating and comparing the multiple logistic regression models, we should first check the distributions of the variables in the data set to ensure that they meet the criteria of being approximately normally distributed, for the quantitative variables. Or, if the variables do not appear to follow a normal distribution, we will see if they are practically important enough to be included within the model building process despite their lack of a normal distribution. We will take a look at some of the variables, particuarelly some of the quanitiative variables which had missing values that were filled in through the multiple imputation process. 


### totChol Distribution

totChol was one of the variables that had some missing observations that were filled in during the previous steps. So, we should definetly check the distribution of this variable.

```{r}
ylimit = max(density(heartdisease1$totChol)$y)
hist(heartdisease1$totChol, probability = TRUE, main = "totChol Distribution", xlab="totChol", 
       col = "aliceblue", border="cornflowerblue")
  lines(density(heartdisease1$totChol, adjust=2), col="darkorchid")
```

Overall, totChol appears to be unimodal and approximately normally distributed. Therefore we can say that this variable is safe to include within the multiple logistic regression model building process and that the process of filling in the missing observations did not lead to any severe concerns for the distribution of this variable.


### BMI Distribution

BMI is another variable that had missing observations that were filled in during the previous steps. We will check its distribution to make sure it appears to be approximately normally distributed. 

```{r}
ylimit = max(density(heartdisease1$BMI)$y)
hist(heartdisease1$BMI, probability = TRUE, main = "BMI Distribution", xlab="BMI", 
       col = "aliceblue", border="cornflowerblue")
  lines(density(heartdisease1$BMI, adjust=2), col="darkorchid")
```

The distribution for BMI is unimodal and appears to be approximately normally distributed. It appears that the majority of individuals have a BMI somewhere in the range of 20-30, with the vast majority of the data falling within this range. This allows us to conclude that it is safe to include the variable of BMI within the multiple logistic regression model building process and that the process of filling in the missing values did not lead to any major concerns with the distribution of this variable. 



### Glucose Distribution

The variable glucose was another variable that had missing observations which were filled in during the previous steps. We will check its distribution to make sure that everything looks alright.

```{r}
ylimit = max(density(heartdisease1$glucose)$y)
hist(heartdisease1$glucose, probability = TRUE, main = "glucose", xlab="glucose", 
       col = "aliceblue", border="cornflowerblue")
  lines(density(heartdisease1$glucose, adjust=2), col="darkorchid")
```

The variable glucose is unimodal and appears to be approximately normally distributed with the exception of a few outliers to the right. The most common glucose value appears to be around 75, with the histogram peaking around this value. Despite the few outliers, we will keep this variable in the final model since a patient's glucose level is a medically important variable which would be helpful to know in order to assess a patient's odds of being at risk for developing CHD.  



### cigsPerDay Distribution

cigsPerDay was another variable that had missing observations which were filled in during the previous steps. So, we should check its distribution to make sure things look alright.

```{r}
ylimit = max(density(heartdisease1$cigsPerDay)$y)
hist(heartdisease1$cigsPerDay, probability = TRUE, main = "cigsPerDay", xlab="cigsPerDay", 
       col = "aliceblue", border="cornflowerblue")
  lines(density(heartdisease1$cigsPerDay, adjust=2), col="darkorchid")
```

The variable cigsPerDay appears to be very notably skewed to the right. This can be attributed to that this variable represents the number of cigarettes a patient smokes per day. Many of the patients included in the data collection were not smokers, so they would have reported a value of 0 cigarettes smoked per day. This can explain the skew to the right of this variable, because non-smokers would report a value of 0 while smokers would report a value greater than 0 that represents the number of cigarettes they smoke in a day. 
This is a pratically important variable regardless of its skew, because smoking is a major risk factor for conditions like CHD. Additionally, this is an important piece of information, because doctors would want to know how many cigarettes their patients smoke in a day, because this has a significant impact on the patient's risk factors. So, we will keep this variable in the final model despite its skew because it is an important variable in regards to a patient's potential risk factors for developing CHD. But, it is important to keep in mind that this variable is skewed due to it being representative of both non-smokers and smokers. 


### heartRate Distribution

The variable heartRate was another one which had some missing observations that needed to be filled in during previous steps. We will check its distribution to ensure that everything looks alright. 

```{r}
ylimit = max(density(heartdisease1$heartRate)$y)
hist(heartdisease1$heartRate, probability = TRUE, main = "heartRate", xlab="heartRate", 
       col = "aliceblue", border="cornflowerblue")
  lines(density(heartdisease1$heartRate, adjust=2), col="darkorchid")
```

The distribution for heart rate is unimodal and approximately normally distributed without any notable skew or outliers that are apparent in its density plot. This allows us to conclude that this variable is safe to include within the multiple logistic regression model building process and that the process of filling in the missing values did not lead to any issues in its distribution.


### Overall Findings

We have checked the distributions of the variables which had missing observations that were filled in during the previous steps of this project along with other important quantitative variables in this data set. We ensured that the variables either follows a distribution that appears to be approximately normal, or are practically important enough to be included in the multiple logistic regression model building process otherwise. 

The only variable that did not appear to follow an approximately normal distribution was cigsPerDay. However, the skewness seen in this variable's distribution can be attributed to the fact that this data was collected amongst both smokers and non-smokers. So, non-smokers reported values of 0 cigarettes smoked per day, while smokers reported values that were greater than zero. This appeared to the distribution for this variable appearing to be skewed to the right. However, this is a variable that is practically important because the number of cigarettes a patient smokes per day can have a significant impact on their odds of being at risk for developing CHD. Due to the practical importance of this variable, it must still be included within the multiple logistic regression model building process.

Altogether, we have ensured that the variables in this data set appear to be good enough to use within the model building process. So, now we will begin with the process of building the candidate multiple logistic regression models. 




# Variable Transformation- Standardizing the Quantitative Variables


The main goal for this project is to find a model with the best predictive performance and power. In order to do so, we will standardize the quantitative variables in the data set to ensure that our data will be its best for predictive performance. 

We will standardize all of the numeric, quantitative variables in order to give our data the best predictive power for the future model building we will conduct in the following steps. 

```{r}
# Age
heartdisease1$sd.age = (heartdisease1$age-mean(heartdisease1$age))/sd(heartdisease1$age)

# cigsPerDay
heartdisease1$sd.cigsPerDay = (heartdisease1$cigsPerDay-mean(heartdisease1$cigsPerDay))/sd(heartdisease1$cigsPerDay)

# totChol
heartdisease1$sd.totChol = (heartdisease1$totChol-mean(heartdisease1$totChol))/sd(heartdisease1$totChol)

# sysBP
heartdisease1$sd.sysBP = (heartdisease1$sysBP-mean(heartdisease1$sysBP))/sd(heartdisease1$sysBP)

# diaBP
heartdisease1$sd.diaBP = (heartdisease1$diaBP-mean(heartdisease1$diaBP))/sd(heartdisease1$diaBP)

# BMI
heartdisease1$sd.BMI = (heartdisease1$BMI-mean(heartdisease1$BMI))/sd(heartdisease1$BMI)

# heartRate
heartdisease1$sd.heartRate = (heartdisease1$heartRate-mean(heartdisease1$heartRate))/sd(heartdisease1$heartRate)

# glucose
heartdisease1$sd.glucose = (heartdisease1$glucose-mean(heartdisease1$glucose))/sd(heartdisease1$glucose)
```

We will create a new data object to store the new standardized variables along with all of the binary and categorical predictor variables from the previous data set. We will call the new data set with all of the standardized quantitative variables "sd.heartdisease". 

```{r}
# New data set with original, non-standardized variables dropped.
sd.heartdisease = select(heartdisease1, -2,-5,-10:-15)
```

Now, all of the quantitative, numerical variables in the original data set have been standardized in order to ensure that our future model will have the best predictive power because the goal of this project is to create a model with the best predictive power and performance. 




# Data Split

We will now split the data up into training and testing data. We will randomly split the data so that 80% of the data will be used for training and 20% of the data will be used for testing. By splitting the data up into the training and the testing data, this will allow us to build the candidate models and analyze their predictive performances. We will use the split data to conduct the cross validation process to analysis the predictive power and performances of each of the three candidate models we will investigate. This will allow us to eventually determine which of these three models is the ideal choice for our final model based upon its overall predictive power. 

```{r}
# Split the data into 80% training and 20% testing.
n <- dim(sd.heartdisease)[1]
train.n <- round(0.8*n)
train.id <- sample(1:n, train.n, replace = FALSE)

train <- sd.heartdisease[train.id, ]
test <- sd.heartdisease[-train.id, ]
```

Now, our data has been split up into a training and a testing data set, with 80% of the data for training and the remaining 20% for testing. The observations were split up randomly amongst these two data sets. We will specifically use the training data for the cross validation process. This process will allow us to identify which multiple logistic regression model is the best model for the purpose of prediction. This will allow us to select a model which predicts the odds of an individual being at risk for developing CHD with the highest quality of prediction. 




# Best Model Identification

Now that our variables have been checked and standardized to ensure their predictive performance being the best that it possibly can be, we will begin with the model building process. We will look at three potential, candidate models for the multiple logistic regression and determine which one has the best predictive performance. 

## Descriptions of the Three Candidate Models 

We will take a look into the specifications of the three candidate multiple logistic regression models. 


### Full Model

The first of the three models will be the full model, with all of the variables included. This full model will use all of the variables in order to create a multiple logistic regression to predict the odds of a patient being at risk for developing CHD in a 10-year period. 

The  regression equation for this full model is represented as:

TenYearCHD = male + sd.age + education + currentSmoker + sd.cigsPerDay + BPMeds + prevalentStroke + prevalentHyp + diabetes + sd.totChol + sd.sysBP + sd.diaBP + sd.BMI + sd.heartRate + sd.glucose

We will find the specific multiple logisitic regression equation for this full model.

```{r}
full.model = glm(TenYearCHD ~ male + sd.age + education + currentSmoker + sd.cigsPerDay + BPMeds + prevalentStroke + prevalentHyp + diabetes + sd.totChol + sd.sysBP + sd.diaBP + sd.BMI + sd.heartRate + sd.glucose, 
          family = binomial(link = "logit"),  
          data = sd.heartdisease)  
kable(summary(full.model)$coef, 
      caption = "Full Model Summary of the Inferential Statistics")
```

The equation for the multiple logistic regression equation of the full model is given as follows:

log p/(1-p) = -2.2024 + 0.4852 * male + 0.5169 * sd.age - 0.2356 * education2 - 0.1026 * education3 + 0.0115 * education4 + 0.0219 * currentSmoker+ 0.2507 * sd.cigsPerDay + 0.3270 * BPMeds + 0.9389 * prevalentStroke + 0.2312 * prevalentHyp + 0.0912 * diabetes + 0.0803 * sd.totChol + 0.3075 * sd.sysBP - 0.0327 * sd.diaBP + 0.0071 * sd.BMI - 0.0137 * sd.heartRate + 0.1730 * sd.glucose


### Reduced Model

The second of the three models will be the reduced model. This model will consist only of the most practically important variables within the context of this situation, using various factors to best predict the odds of an individual being at risk for developing CHD. The United States National Library of Medicine of the National Institutes of Health (NIH), the most notable risk factors for an individual being at risk of developing CHD are cigarette smoking, blood pressure, and cholesterol levels. In this data set, there are two variables related to smoking, currentSmoker, whether a patient is currently a smoker or not, and cigsPerDay which is was scaled to sd.cigsPerDay, how many cigarettes a patient smokes in a day. There are also two variables related to blood pressure, sysBP and diaBP, for systolic blood pressure and diastolic blood pressure. Both of these blood pressure variables were scaled and are now given as sd.sysBP and sd.diaBP. There is also a variable included in the data set for a patient's total cholesterol levels, totChol, which was scaled and is now given as sd.totChol. These five variables in the data set represent the key risk factors that the medical report from the NIH stated as being the most significant in predicting a patient's risk for CHD. So, our reduced model will consist only of these particular variables. 

The general regression equation for this reduced model is represented as:

TenYearCHD = currentSmoker + sd.cigsPerDay + sd.sysBP + sd.diaBP + sd.totChol

We will find the specific multiple logistic regression equation for this reduced model.

```{r}
reduced.model = glm(TenYearCHD ~ currentSmoker + sd.cigsPerDay + sd.sysBP + sd.diaBP + sd.totChol, 
          family = binomial(link = "logit"),
          data = sd.heartdisease) 
kable(summary(reduced.model)$coef, 
      caption = "Reduced Model Summary of the Inferential Statistics")
```

The equation for the multiple logistic regression equation of the reduced model is given as follows:

log p/(1-p) = -1.7675 - 0.1443 * currentSmoker + 0.2683 * sd.cigsPerDay + 0.6381 * sd.sysBP - 0.1409 * sd.diaBP + 0.1114 * sd.totChol


### Forward Selection Model

The third of three candidate models will be the model that was found through the automatic variable selection process, specifically with using forward selection. We will call this the fs.model for this project to represent it being the model found through the use of forward selection of the variables. 

We will find the specific multiple logistic regression equation for this forward selection model.

```{r}
fs.model = stepAIC(reduced.model, scope = list(lower=formula(reduced.model), upper=formula(full.model)),
                      direction = "forward",  
                      trace = 0   
                      )
kable(summary(fs.model)$coef, 
      caption="Final Model Summary of the Inferential Statistics")
```

The equation for the multiple logistic regression equation of the forward selection model is given as follows:

log p/(1-p) = -2.2898 + 0.0084 * currentSmoker + 0.2497 * sd.cigsPerDay + 0.2497 * sd.sysBP - 0.0325 * sd.diaBP + 0.0755 * sd.totChol + 0.5364 * sd.age + 0.5164 * male + 0.1826 * sd.glucose + 0.9413 * prevalentStroke + 0.2310 * prevalentHyp + 0.3233 * BPMeds 

The variables that were kept in this model after forward selection were currentSmoker, sd.cigsPerDay, sd.sysBP, sd.diaBP, sd.totChol, sd,age, male, sd.glucose, prevalentStroke, prevalentHyp, and BPMeds. 


## Cross Validation

Now that we've looked into the specifications of the three potential candidate models, we will use cross-validation to determine which of these three models we should use. We want to see which model has the best predictive performance. We will use a 5-fold cross validation to conduct the process of finding the predictive errors for each of the three candidate models. We will use this 5-fold cross validation process to calculate the average predictive errors for each of the three candidate models, the full model, the reduced model, and the forward selection model. We are looking for which model has the lowest value of average predictive errors. This model with the lowest value for the average predictive errors will be the ideal choice, as having the lowest predictive error value indicates a higher predictive power and more accurate predictions than a model with higher average predictive error would. 

For our cross validation process, we will use a cut off probability of 0.5 to define between the binary response values of 1, yes, and 0, no. These values represent whether an individual will be stated as being at risk for developing CHD in a 10-year period based upon the multiple logistic regression model that was used for prediction.

We will label the full model as our first model (01), the reduced model as our second model (02), and the forward selection model as our third model (03) for this cross-validation process. 

```{r}
## 5 Fold Cross Validation
k = 5
fold.size = floor(dim(train)[1]/k)
PE1 = rep(0,5)
PE2 = rep(0,5)
PE3 = rep(0,5)

# Training and Testing Fold for the Models
for(i in 1:k){
  
 valid.id = (fold.size*(i-1)+1):(fold.size*i)
  valid = train[valid.id, ]
  train.data = train[-valid.id,]
  
# Full Model: all variables
full.model = glm(TenYearCHD ~ male + sd.age + education + currentSmoker +  sd.cigsPerDay + BPMeds + prevalentStroke + prevalentHyp + diabetes + sd.totChol + sd.sysBP + sd.diaBP + sd.BMI + sd.heartRate + sd.glucose, 
          family = binomial(link = "logit"),  
          data = train.data)

## Reduced Model: only the most practically important variables.
reduced.model = glm(TenYearCHD ~ currentSmoker + sd.cigsPerDay + sd.sysBP + sd.diaBP + sd.totChol, 
          family = binomial(link = "logit"),
          data = train.data) 

## Automatic Forward Selection (fs) Model: model found using forward selection
fs.model = stepAIC(full.model, 
                      scope = list(lower=formula(reduced.model),   
                      upper=formula(full.model)),
                      direction = "forward",  
                      trace = 0   
                      )

# Finding the predicted probabilities for each of the candidate models
   pred.full = predict(full.model, newdata = valid, type = "response")
   pred.reduced = predict(reduced.model, newdata = valid, type = "response")
   pred.fs = predict(fs.model, newdata = valid, type = "response")
   
   pre.outcome01 = ifelse(as.vector(pred.full) > 0.5, "1", "0")
   pre.outcome02 = ifelse(as.vector(pred.reduced) > 0.5, "1", "0")
   pre.outcome03 = ifelse(as.vector(pred.fs) > 0.5, "1", "0")
   
   PE1[i] = sum(pre.outcome01 == valid$TenYearCHD)/length(pred.full)
   PE2[i] = sum(pre.outcome02 == valid$TenYearCHD)/length(pred.reduced)
   PE3[i] = sum(pre.outcome03 == valid$TenYearCHD)/length(pred.fs)
}

# Finding the average predictive errors of each candidate model  
avg.pe = cbind(PE1 = mean(PE1), PE2 = mean(PE2), PE3 = mean(PE3))
kable(avg.pe, caption = "Average Predictive Errors of the Three Candidate Models") 
```

The average predictive errors of the three candidate models are given as follows:

The full model, represented by PE1, had an average predictive error of 0.8546. 
The reduced model, represented by PE2, had an average predictive error of 0.8495. 

The forward selection model, represented by PE3, had an average predictive error of 0.8546.

We can see that both the full model and the forward selection model had equal values for their average predictive errors, with both having a value of 0.8546. However, the reduced model a lower average predictive error value than the other two models, with a value of 0.8496. This indicates that the reduced model is the one which we should choose as our final model.

The cross-validation process of calculating the average predictive errors for the three candidate models showed that the reduced model is the ideal one to use for prediction due to it having the lowest value of its average predictive error out of the three candidate models. This makes the reduced model have the highest predictive power and performance out of the three candidate models, as shown by how it has the least amount of its average predictive errors in comparison to the full model and the forward selection model. So, we will move forward with the use of the reduced model as our final multiple logistic regression model for prediction of the odds of an individual being at risk for developing CHD. 


## Final Model

In the previous cross-validation step, we used a cut off probability of 0.5 to identify between the response variable giving a positive or negative result. In this case, a "positive" result, or an output of 1, represents that an individual is at risk for developing CHD in a 10-year period. And, a "negative" result, or an output of 0, represents that an individual is not at risk for developing CHD in a 10-year period. We used the cross-validation process with this cut off probability of 0.5 to find the values of the average predictive errors of the three candidate models. We found that the reduced model was the one with the best predictive performance, so we will use that as our final model. 

However, in order to report the details of the final model, we must find its actual accuracy based upon the test data. We previously split the data up into a training and a testing data set. We will use the test data set to find the actual accuracy of the final model, the reduced model, in order to provide the fullest accuracy in the report of this model. 

```{r}
pred.reduced = predict(reduced.model, newdata = test, type = "response")
pred.reduced.outcome = ifelse(as.vector(pred.reduced) > 0.5, "1", "0")

Accuracy = sum(pred.reduced.outcome == test$TenYearCHD)/length(pred.reduced)
kable(Accuracy, caption="Actual Accuracy of the Final Model (Reduced Model)")
```

The actual accuracy of the final model, the reduced model, is given as 0.8443. This value represents the accuracy rate of this final model. 

It appears that the final model has good accuracy as seen by its accuracy rate of 0.8443. This indicates that the final model we selected has good predictive performance for predicting the odds of an individual being at risk for developing CHD within a 10-year period. 



# ROC Analysis

We will use ROC analysis to measure the global performance of the multiple logistic regression model. This plot will illustrate the relationship of the true positive rate (TPR), also known as the sensitivity, against the false positive rate (FPR), also given as 1 - specificity. We will plot the ROC curves for each of the three candidate models to compare them against one another. We will also calculate the area under the curve (AUC) values for each of the three candidate models. The value of area under the curve represents the global goodness of the model, and a higher AUC value is ideal. A higher value of the area under the curve indicates better and more quality global goodness of that model. 

We will plot the ROC curves for the full model, the reduced model, and the forward selection model. This will allow us to compare the area under the curves for each of the three models. Additionally, we will calculate the AUC values for each of the three models to compare them with each other and determine which model indicates the best global goodness. 


```{r}
TPR.FPR=function(pred){
  prob.seq = seq(0, 1, length=50)  
  pn = length(prob.seq)
  true.lab = as.vector(train$TenYearCHD)
  TPR = NULL
  FPR = NULL
  for (i in 1:pn){
   pred.lab = as.vector(ifelse(pred > prob.seq[i],"1", "0"))
   TPR[i] = length(which(true.lab=="1" & pred.lab=="1"))/length(which(true.lab=="1"))
   FPR[i] = length(which(true.lab=="0" & pred.lab=="1"))/length(which(true.lab=="0"))
  }
 cbind(FPR = FPR, TPR = TPR)
}
```

We will use the training data set to help find the predictive probabilities for each of the three candidate models. 

```{r}
# The Full Model
full.model = glm(TenYearCHD ~male + sd.age + education + currentSmoker +  sd.cigsPerDay + BPMeds + prevalentStroke + prevalentHyp + diabetes + sd.totChol + sd.sysBP + sd.diaBP + sd.BMI + sd.heartRate + sd.glucose, 
          family = binomial(link = "logit"),  
          data = train)

# The Reduced Model (final model)
reduced.model = glm(TenYearCHD ~ currentSmoker + sd.cigsPerDay + sd.sysBP + sd.diaBP + sd.totChol, 
          family = binomial(link = "logit"),
          data = train) 

# The Forward Selection (fs) Model
fs.model = stepAIC(full.model, 
                      scope = list(lower=formula(reduced.model),   
                      upper=formula(full.model)),
                      direction = "forward",  
                      trace = 0   
                      )

# Finding the Predicted Probabilities for Each Model
pred.full = predict.glm(full.model, newdata = train, type = "response") 
pred.reduced = predict.glm(reduced.model, newdata = train, type = "response")
pred.fs = predict.glm(fs.model, newdata = train, type = "response")
```

Now, we will plot the ROC curves for each of the three candidate models to compare them. 

```{r}
# Plotting the ROC curves
 plot(TPR.FPR(pred.full)[,1], TPR.FPR(pred.full)[,2], 
      type="l", col = 2, lty = 1, xlim = c(0,1), ylim = c(0,1),
      xlab = "FPR: 1 - Specificity",
      ylab ="TPR: Sensitivity",
      main = "ROC Curves for the Three Candidate Models",
      cex.main = 0.8,
      col.main = "navy")
 lines(TPR.FPR(pred.reduced)[,1], TPR.FPR(pred.reduced)[,2],  col=3, lty=2)
 lines(TPR.FPR(pred.fs)[,1], TPR.FPR(pred.fs)[,2],  col=4, lty=3)    
 
  category = train$TenYearCHD == "1"
  ROCobj01 <- roc(category, as.vector(pred.full))
  ROCobj02 <- roc(category, as.vector(pred.reduced))
  ROCobj03 <- roc(category, as.vector(pred.fs))
  AUC01 = round(auc(ROCobj01),4)
  AUC02 = round(auc(ROCobj02),4)
  AUC03 = round(auc(ROCobj03),4)
  
  legend("bottomright", c(paste("Full model: AUC = ", AUC01), 
                         paste("Reduced model: AUC =", AUC02),
                         paste("Forward Selection model: AUC =", AUC03)),
        col = 2:4, lty = 1:3, cex = 0.8, bty = "n")
```


We can see that both the full model and the forward selection model fall on the exact same curve, which is also evidenced by how these two models have the same values for their area under the curve Both the full model and the forward selection model have an area under the curve of 0.7356, and their ROC curves match up identically to each other. This matches up with the previous finding of the full model and the forward selection model having the exact same values for their average predictive errors. 

Something interesting found in this ROC analysis, is that the reduced model actually has the smallest value of area under the curve with 0.6763. We found that the reduced model was the one with the lowest average predictive errors previously, which made it the ideal model choice for our final model. However, this model turned out to have the lowest area under the curve out of the three, which is not ideal. A model with a higher area under the curve has a better global goodness than a model with a lower area under the curve. 

Solely based upon area under the curve values found in the ROC analysis, it would appear that the forward selection model provides the best global goodness. Both the full model and the forward selection model had the same value for their area under the curve, but the forward selection model would be the better choice due to it having less variables.

The ROC analysis revealed that the forward selection model may in fact be a good alternative to use as the final model due to it having a higher area under the curve than the reduced model, despite the reduced model having a lower average value of predictive errors as was found earlier. A potential reason for the forward selection model having a higher AUC than the reduced model despite the reduced model having a lower and more ideal average value of predictive errors could be due to the reduced model having less variables. It could be a possibility that the reduced model having fewer variables could have a higher likelihood of false positivies or false negatives due to some of the variables which may be significant not having been included in the reduced model. 



# Conclusion

In this project, three candidate models were created to compare their predictive powers and ultimately choose a final model that provides the best significance and quality of prediction for predicting the odds of an individual being at risk for developing CHD in a 10-year period. These models all consisted of multiple logistic regression to predict the binary response variable of an individual's odds of being at risk for developing CHD. Creating a model for the purpose of this prediction can provide use for doctors and patients in order to predict their odds of being at risk for developing CHD based upon various personal and medical factors. 

The three candidate models that were created included a full model with all of the variables in the data set, a reduced model with only the variables deemed as the most practically important, and a model created through forward variable selection. These three models were compared by their predictive power and quality for their potential use as a final multiple logistic regression model for this project. 

The average predictive errors of all three candidate models were compared to assess their predictive powers through the process of cross-validation. It was found that the full model and the forward selection model had identical values for the average predictive errors. However, the reduced model turned out to have the lowest value of its average predictive errors out of the three candidate models. This indicated that the reduced model was the ideal choice due to it having the lowest value of average predictive errors, and therefore the highest quality of predictive power. This indicated that the reduced model would be a good choice for the final model due to its high quality predictive power shown by how it had the smallest value of average predictive errors out of all three candidate models. 

Furthermore, ROC analysis was conducted to continue the analysis of the three candidate models. It was found that the full model and the forward selection model have identical ROC curves and area under the curve values. This matched up with how it was previously found that the full model and the forward selection model have the exact same average value of their predictive errors. Between these two models, we would choose the forward selection model over the full model due to it having less variables. However, we also found that the reduced model in fact had the lowest area under the curve value out of the three models. This is interesting as it contradicts with the reduced model previously being shown to have the lowest average value of predictive errors. It turns out that despite the reduced model having the lowest average of predictive errors and therefore the highest predictive power, it also has the lowest area under the curve meaning that it has lower global goodness out of the candidate models.

Overall, it was found that through cross-validation it turned out that the reduced model had the lowest average value of predictive errors, and therefore the highest predictive power. This made the reduced model the ideal choice for the selection of our final model. This reduced model focused on only the most practically important variables for predicting the odds of an individual being at risk for developing CHD. These most practically important variables included currentSmoker, sd.cigsPerDay, sd.sysBP, sd.diaBP, and sd.totChol. However, it was later found through ROC analysis that the reduced model had the lowest area under the curve value, indicating a lower global goodness. It was in fact the forward selection model that the ROC analysis suggested as being the ideal choice based upon its global goodness. The forward selection model included the variables of currentSmoker, sd.cigsPerDay, sd.sysBP, sd.diaBP, sd.totChol, sd,age, male, sd.glucose, prevalentStroke, prevalentHyp, and BPMeds. 

Even though the reduced model provided the best predictive power due to its lower value of average predictive errors, the forward selection model was shown through ROC analysis to have the highest global goodness. Both the reduced model and the full model are high quality models for the prediction and estimation of the odds of an individual being at risk for developing CHD. This discrepancy between the reduced model having the highest predictive power yet the lowest global goodness could in fact be due to it having the fewest variables. By having fewer variables, this could potentially lead to higher rates of false positives and false negatives than the other models with more variables accounted for. So overall, the reduced model would be the ideal choice as the final model for a model with the highest predictive power, which is what we were looking for in this project. However, it is important to keep in mind that this reduced model has a lower global goodness than the forward selection model, likely due to the reduced model having fewer variables and potentially a higher rate for false positives and false negatives. This is important to keep in mind when creating a model for the prediction of an individual's odds of being at risk for developing CHD, as high predictive power is ideal for ensuring that patient's health is accounted for properly, but it is also important to be wary of the potential for false postives and false negatives in regards of these predictions. 



## Recommendations

Some recommendations I would make for future projects include:

* Expand the data collection to ensure the accuracy of the findings found in this project. This could also involve reaching out to various hospitals and doctors for a better understanding of the most serious risk factors for CHD in order to ensure that the variables kept within the final model indeed are representative of a patient's odds of being at risk for developing CHD.

* Consider other variables which could be statistically significant in predicting the odds of a patient's risk for developing CHD. All of the predictor variables that were included within this data set made sense in terms of this project and were all factors which could play a role in a patient's odds of being at risk for developing CHD. However, future projects could consider looking into additional variables that may be statistically significant in predicting the odds of a patient being at risk for developing CHD. For instance, one potential variable that could be looked into is a patient's income to see if the income of a patient has an impact on their odds of being at risk for developing CHD. A reason why I believe that this could be a significant variable for this model is because patients with higher incomes tend to have more options when it comes to medical care than patients with lower incomes would. So, perhaps there could be a relationship between a patient's income and their odds of being at risk for developing CHD. This is just one example of a variable that is not in this data set which may provide use for future projects.

* Consider other possible models that could provide quality predictive power in predicting the odds of a patient being at risk for developing CHD within a 10-year period of time. In this project, we looked at three candidate multiple logistic regression models and chose which one provided the best utility. These three models included a full model with all of the predictor variables left in the model, a reduced model with only the variables that the medical report described as most significant risk factors for CHD, and a final model that was created through automatic forward variable selection. We found that the reduced model was the one which provided the lowest average value for predictive errors out of the three models, and so we chose this to be our final model due to it providing the strongest predictive power. However, this is not to say that there is not any other possible model that could potentially be better than the one we created. If further projects found a model that provided better and more power in prediction than our final model, this could help doctors and patients better understand their odds of being at risk for developing CHD based on the various factors included in the model. 

* Further investigate the topic of false postives and false negatives amongst the predicted values. The topic of false postives and false negatives was addressed as potentially being a reason for the descrepancy between the reduced model having the greatest predictive power, yet also having the lowest global goodness. Having fewer varaibles could possibly lead to some factors being unaccounted for, and potentially resulting in some false positive or negative predictions. This is a potential topic which could be further looked into in further experiments to fully understand the occurences of false positive or negative predictions. 


Overall, this project showed that the reduced model resulted in the lowest average predictive errors out of the three models, meaning this was the model with the highest predictive power. However, ROC analysis showed that the reduced model had a lower area under the curve model, and it was the forward selection model which had the highest global goodness as indicated by the ROC analysis. This is something important, because although the reduced model is an ideal choice for the final model due to its high quality of predictive power, this lower global goodness is an important thing to keep in mind when using this final model for the prediction of an individual's odds of being at risk for developing CHD in a 10-year period. 




# References

This data set was found on kaggle.com under the collection of logistic regression data sets. Included below is the citation of the web page of where I found the data set I used for my project along with the medical report I found which described which factors are the most medically significant when it comes to determining a patient's risk for developing CHD.


Dileep. (2019, June 7). Logistic regression to predict heart disease. Kaggle. https://www.kaggle.com/datasets/dileep070/heart-disease-prediction-using-logistic-regression?resource=download&select=framingham.csv 

Hajar, R. (2017). Risk factors for coronary artery disease: Historical perspectives. Heart views : the official journal of the Gulf Heart Association. https://pmc.ncbi.nlm.nih.gov/articles/PMC5686931/ 

