For this project, we will be working on creating a multiple logistic regression model. The data set for this project looks at the various factors affecting an individual patient’s risk of developing coronary heart disease (CHD) in a 10-year period of time. The response variable in this data set is binary with 1 representing yes, the patient is at risk of developing CHD within a 10-year period of time, and 0 representing no, the patient is not at risk of developing CHD within a 10-year period of time.
I found this data set on kaggle.com on the following web page: https://www.kaggle.com/datasets/dileep070/heart-disease-prediction-using-logistic-regression/data
This data set takes a look at various medical and personal factors which may have an impact on an individual’s risk of developing CHD within the next 10 years. The binary response variable tells whether an individual patient is at risk of developing CHD within the next 10 years, with a response value of 1 meaning yes they are at risk, and 0 meaning no they are not at risk.
There are 16 variables in this data set.
male: The gender of the patient. A categorical variable with 1 for “male”, 0 for “female”, and 2 for “other”.
age: The age of the patient in years. This is a quantitative, continuous variable given as an integer value.
education: The education level of the patient. A categorical variable with 1 for “less than high school”, 2 for “high school diploma”, 3 for “college graduate”, and 4 for “post-college graduate”.
currentSmoker: Whether or not the patient is currently a smoker. A binary variable with 1 for “yes” and 0 for “no”.
cigsPerDay: The average number of cigarettes the patient smokes in a day. This is a quantitative, continuous variable.
BPMeds: Whether or not the patient takes blood pressure medication. A binary variable with 1 for “yes” and 0 for “no”.
prevalentStroke: Whether or not the patient has a history of strokes. A binary variable with 1 for “yes” and 0 for “no”.
prevalentHyp: Whether or not the patient has a history of hypertension. A binary variable with 1 for “yes” and 0 for “no”.
diabetes: Whether or not the patient has a history of diabetes. A binary variable with 1 for “yes” and 0 for “no”.
totChol: The patient’s total cholesterol level, given in mg/dL (milligrams per deciliter). A quantitative, continuous variable.
sysBP: The patient’s systolic blood pressure, given in mmHG (millimeters of mercury). A quantitative, continuous variable.
diaBP: The patient’s diastolic blood pressure, given in mmHG (millimeters of mercury). A quantitative, continuous variable.
BMI: The patient’s body mass index (BMI). A quantitative, continuous variable.
heartRate: The patient’s heart rate, given in beats per minute. A numeric, quantitative variable.
glucose: The patient’s glucose level, given in mg/dL (milligrams per deciliter). A numeric, quantitative variable.
TenYearCHD (response variable): The binary response variable of this data set which represents whether or not the patient has a 10-year risk of developing Coronary Heart Disease (CHD). This response variable is binary with 1 meaning yes, the patient does have a risk of developing CHD in the next 10 years, and 0 meaning no, the patient does not have a risk of developing CHD in the next 10 years.
The key analytical question which I would like to investigate in this project is:
This question will serve as the basis for creating the logistic regression model for this data set. For this project, we will just create a simple logistic regression model for now with only one of the predictor variables. However, we will still use the question of whether this simple logistic regression model allows us to statistically significantly predict a patient’s odds of being at risk for developing CHD within a 10-year period. The findings to this question could provide utility for both patients and doctors to provide them with information relating to the odds and risks for developing CHD and which factors significantly impact these odds.
Some further questions to consider as a starting point for this project include:
Which predictor variables provide the greatest statistical significance in predicting the patient’s odds of being at risk for developing CHD in a 10-year period?
Do the continuous independent variables in this data set follow a normal distribution? And if not, is there a potential explanation for why this may be the case?
How accurately can our model predict who will be at risk for developing CHD in a 10-year period of time?
We will use these questions as a starting point for this project in order to create a multiple logistic regression model which statistically significantly predicts the odds of a patient being at risk for developing CHD in a 10-year period based on the many factors which play a role in this prediction.
First, we will read in the data set from Github and we will call it “heartdisease”.
heartdisease <- read.csv("https://raw.githubusercontent.com/JosieGallop/STA321/refs/heads/main/dataset/framingham.csv", header=TRUE)
str(heartdisease)
'data.frame': 4238 obs. of 16 variables:
$ male : int 1 0 1 0 0 0 0 0 1 1 ...
$ age : int 39 46 48 61 46 43 63 45 52 43 ...
$ education : int 4 2 1 3 3 2 1 2 1 1 ...
$ currentSmoker : int 0 0 1 1 1 0 0 1 0 1 ...
$ cigsPerDay : int 0 0 20 30 23 0 0 20 0 30 ...
$ BPMeds : int 0 0 0 0 0 0 0 0 0 0 ...
$ prevalentStroke: int 0 0 0 0 0 0 0 0 0 0 ...
$ prevalentHyp : int 0 0 0 1 0 1 0 0 1 1 ...
$ diabetes : int 0 0 0 0 0 0 0 0 0 0 ...
$ totChol : int 195 250 245 225 285 228 205 313 260 225 ...
$ sysBP : num 106 121 128 150 130 ...
$ diaBP : num 70 81 80 95 84 110 71 71 89 107 ...
$ BMI : num 27 28.7 25.3 28.6 23.1 ...
$ heartRate : int 80 95 75 65 85 77 60 79 76 93 ...
$ glucose : int 77 76 70 103 85 99 85 78 79 88 ...
$ TenYearCHD : int 0 0 0 1 0 0 1 0 0 0 ...
When looking through the data set, I noticed that some observations appeared to be missing from the data set, as they had values of “NA” for certain variables. Before we can begin with the logistic regression, this is something that we should look into.
First, let’s see how many values are missing for each of the variables.
colSums(is.na(heartdisease))
male age education currentSmoker cigsPerDay
0 0 105 0 29
BPMeds prevalentStroke prevalentHyp diabetes totChol
53 0 0 0 50
sysBP diaBP BMI heartRate glucose
0 0 19 1 388
TenYearCHD
0
We can see that the variables education, cigsPerDay, BPMeds, totChol, BMI, heartRate, and glucose all have missing observations.
Now that we know which variables are missing observations, we can see which specific observations are missing from these variables.
sapply(heartdisease, function(x) which(is.na(x)))
$male
integer(0)
$age
integer(0)
$education
[1] 34 37 73 185 214 294 306 307 320 401 413 430 473 500 504
[16] 623 695 720 738 783 820 917 944 967 981 1028 1039 1071 1073 1076
[31] 1130 1139 1180 1228 1240 1254 1260 1287 1289 1315 1376 1381 1395 1479 1605
[46] 1623 1642 1655 1676 1682 1731 1952 1972 1985 1999 2061 2139 2251 2268 2294
[61] 2347 2431 2469 2519 2543 2596 2670 2749 2785 2847 2883 2886 2902 2912 2929
[76] 3013 3034 3035 3114 3149 3161 3220 3235 3259 3291 3311 3471 3486 3515 3589
[91] 3604 3619 3656 3674 3764 3765 3869 3874 3944 4013 4083 4099 4122 4123 4139
$currentSmoker
integer(0)
$cigsPerDay
[1] 132 140 1047 1293 1348 1452 1498 1611 1626 1871 1964 1981 2406 2514 2543
[16] 3022 3035 3095 3107 3109 3157 3178 3310 3433 3580 3716 3848 3925 3943
$BPMeds
[1] 50 78 194 246 315 396 422 766 770 798 999 1003 1045 1105 1123
[16] 1178 1207 1285 1302 1567 1574 1617 1722 1858 1862 1914 1984 1986 1987 2003
[31] 2075 2121 2174 2182 2368 2609 2646 2739 2836 2944 3227 3314 3374 3376 3527
[46] 3645 3738 3792 3817 4009 4140 4163 4236
$prevalentStroke
integer(0)
$prevalentHyp
integer(0)
$diabetes
integer(0)
$totChol
[1] 43 155 248 430 568 578 610 674 823 835 872 952 1106 1123 1318
[16] 1355 1360 1449 1576 1745 1748 1786 1880 1941 2004 2009 2080 2210 2264 2341
[31] 2342 2418 2584 2590 2613 2806 2873 2903 3111 3112 3150 3286 3577 3608 3631
[46] 3661 3961 3962 3989 4186
$sysBP
integer(0)
$diaBP
integer(0)
$BMI
[1] 98 295 706 1156 1162 1595 1605 1625 1748 1976 2049 2068 2092 2178 2530
[16] 2720 2926 3091 3340
$heartRate
[1] 690
$glucose
[1] 15 22 27 43 55 71 112 115 132 155 204 212 216 217 247
[16] 248 251 264 274 280 283 295 297 302 303 310 316 330 339 344
[31] 346 356 383 408 414 419 426 428 429 434 437 449 456 457 468
[46] 488 491 500 512 520 541 553 564 568 577 578 592 610 646 662
[61] 674 676 680 691 706 715 756 758 761 778 779 780 813 814 823
[76] 830 833 873 884 897 906 918 923 937 940 943 973 996 1000 1020
[91] 1029 1064 1065 1102 1116 1119 1120 1123 1143 1162 1173 1176 1182 1191 1203
[106] 1212 1215 1235 1251 1276 1298 1318 1328 1336 1348 1355 1360 1362 1369 1378
[121] 1396 1401 1409 1412 1430 1449 1471 1494 1507 1522 1525 1538 1541 1570 1575
[136] 1576 1582 1618 1646 1648 1659 1672 1682 1688 1693 1695 1707 1717 1721 1724
[151] 1745 1748 1759 1771 1776 1779 1786 1787 1801 1809 1816 1822 1831 1846 1861
[166] 1867 1870 1884 1886 1916 1925 1927 1933 1941 1959 1975 1979 1981 1988 2004
[181] 2007 2009 2011 2015 2023 2027 2039 2046 2050 2065 2069 2080 2082 2086 2097
[196] 2104 2105 2108 2109 2110 2111 2119 2135 2161 2162 2175 2184 2210 2217 2220
[211] 2255 2263 2264 2270 2271 2274 2315 2316 2341 2342 2356 2376 2399 2410 2418
[226] 2432 2463 2508 2524 2525 2531 2555 2568 2571 2574 2586 2590 2594 2610 2613
[241] 2616 2627 2639 2641 2656 2659 2690 2696 2697 2700 2711 2712 2728 2729 2753
[256] 2757 2764 2772 2774 2793 2801 2826 2847 2848 2859 2873 2886 2896 2903 2905
[271] 2906 2930 2942 2945 2947 2950 2957 2968 2972 2987 2988 2995 3002 3008 3046
[286] 3047 3049 3052 3056 3063 3090 3099 3106 3111 3112 3119 3124 3147 3150 3158
[301] 3159 3164 3191 3192 3208 3291 3292 3297 3309 3314 3317 3331 3333 3340 3365
[316] 3381 3384 3387 3409 3456 3457 3466 3480 3489 3534 3559 3561 3572 3574 3577
[331] 3580 3592 3598 3608 3614 3619 3625 3631 3648 3650 3661 3662 3671 3682 3686
[346] 3713 3714 3744 3747 3752 3756 3765 3772 3774 3776 3781 3788 3865 3873 3879
[361] 3884 3894 3902 3904 3914 3926 3948 3958 3961 3962 3980 3988 3989 3990 4004
[376] 4018 4029 4050 4062 4087 4122 4154 4161 4171 4209 4230 4231 4237
$TenYearCHD
integer(0)
We will create a new data set called “heartdisease1” to represent the corrected data set that will have no missing values.
heartdisease1 <- heartdisease
It is important to note that the methods that will be used to fix the missing observations of the variables could lead to potential skew in their distributions. Out of the variables with missing values, none of the variables have the majority of their data entries missing, so it seems suitable to fill in these missing observations with approximated values in order to prevent certain entries from coming up as NA within the data set. However, it is important to be mindful of the fact that filling in these missing entries could lead to a potential skew or inaccuracy in the distribution of these variables after filling in the missing observations. In order to properly handle this, we will first fix the missing values and then check the distributions of the variables to ensure that they are normally distributed, or are practically important enough variables to include within the final multiple logistic regression model.
First, let’s fix the missing values for the quantitative variables. The quantitative variables with missing values are totChol, BMI, heartRate, and glucose.
Let’s start with totChol. We will calculate the mean value of totChol and use this mean to replace the missing observations. After we replace the missing observations with the mean value, we will then check the first 20 observations of the corrected totChol variable with the missing observations filled in to make sure everything looks good.
# Replace missing values of totChol with the mean of totChol.
mean(heartdisease1$totChol, na.rm = TRUE)
[1] 236.7216
heartdisease1$totChol <-
replace(heartdisease1$totChol,
is.na(heartdisease1$totChol),
mean(heartdisease1$totChol, na.rm = TRUE))
# Let's check the first 20 observations of the fixed totChol to make sure it all look good.
head(heartdisease1$totChol, 20)
[1] 195 250 245 225 285 228 205 313 260 225 254 247 294 332 226 221 232 291 195
[20] 195
Next, let’s fix BMI. We will calculate the mean value of BMI and use this mean to replace the missing observations. After we replace the missing observations with the mean value, we will then check the first 20 observations of the corrected BMI variable with the missing observations filled in to make sure everything looks good.
# Replace missing values of BMI with the mean of BMI.
mean(heartdisease1$BMI, na.rm = TRUE)
[1] 25.80201
heartdisease1$BMI <-
replace(heartdisease1$BMI,
is.na(heartdisease1$BMI),
mean(heartdisease1$BMI, na.rm = TRUE))
# Let's check the first 20 observations of the fixed BMI to make sure it all look good.
head(heartdisease1$BMI, 20)
[1] 26.97 28.73 25.34 28.58 23.10 30.30 33.11 21.68 26.36 23.61 22.91 27.64
[13] 26.31 31.31 22.35 21.35 22.37 23.38 23.24 26.88
Next, let’s fix heartRate. We will calculate the mean value of heartRate and use this mean to replace the missing observations. After we replace the missing observations with the mean value, we will then check the first 20 observations of the corrected heartRate variable with the missing observations filled in to make sure everything looks good.
# Replace missing values of heartRate with the mean of heartRate.
mean(heartdisease1$heartRate, na.rm = TRUE)
[1] 75.87892
heartdisease1$heartRate <-
replace(heartdisease1$heartRate,
is.na(heartdisease1$heartRate),
mean(heartdisease1$heartRate, na.rm = TRUE))
# Let's check the first 20 observations of the fixed heartRate to make sure it all look good.
head(heartdisease1$heartRate, 20)
[1] 80 95 75 65 85 77 60 79 76 93 75 72 98 65 85 95 64 80 75 85
And now let’s fix glucose. We will calculate the mean value of glucose and use this mean to replace the missing observations. After we replace the missing observations with the mean value, we will then check the first 20 observations of the corrected glucose variable with the missing observations filled in to make sure everything looks good.
# Replace missing values of glucose with the mean of glucose.
mean(heartdisease1$glucose, na.rm = TRUE)
[1] 81.96675
heartdisease1$glucose <-
replace(heartdisease1$glucose,
is.na(heartdisease1$glucose),
mean(heartdisease1$glucose, na.rm = TRUE))
# Let's check the first 20 observations of the fixed glucose to make sure it all look good.
head(heartdisease1$glucose, 20)
[1] 77.00000 76.00000 70.00000 103.00000 85.00000 99.00000 85.00000
[8] 78.00000 79.00000 88.00000 76.00000 61.00000 64.00000 84.00000
[15] 81.96675 70.00000 72.00000 89.00000 78.00000 65.00000
As was stated previously, it is important to acknowledge that this process of filling in the missing observations could potentially lead to a skewness of the distribution of these variables. We will check the distributions for these variables in a couple of steps to ensure that this process did not create any severe problems with the distributions of the variables that had missing observations.
Now that we have fixed all of the missing observations for the quantitative variables, we will fix the missing observations for the categorical variables as well. The categorical variables with missing observations are education, cigsPerDay, and BPMeds. To do this, we will replace the missing values with the mode of that variable.
Let’s start with education. We will calculate the mode value of education and use this mode to replace the missing observations. After we replace the missing observations with the mode value, we will then check the first 20 observations of the corrected education variable with the missing observations filled in to make sure everything looks good.
# Replace missing values of education with the mode categorical value of education.
heartdisease1$education <-
replace(heartdisease1$education,
is.na(heartdisease1$education),
names(which.max(table(heartdisease1$education [1:4238]))))
# Let's check the first 30 observations of the fixed education to make sure it all looks good.
head(heartdisease1$education, 30)
[1] "4" "2" "1" "3" "3" "2" "1" "2" "1" "1" "1" "2" "1" "3" "2" "2" "3" "2" "2"
[20] "2" "2" "1" "1" "3" "2" "4" "1" "2" "3" "1"
Now let’s fix cigsPerDay. We will calculate the mode value of cigsPerDay and use this mode to replace the missing observations. After we replace the missing observations with the mode value, we will then check the first 20 observations of the corrected cigsPerDay variable with the missing observations filled in to make sure everything looks good.
# Replace missing values of cigsPerDay with the mode categorical value of cigsPerDay.
heartdisease1$cigsPerDay <-
replace(heartdisease1$cigsPerDay,
is.na(heartdisease1$cigsPerDay),
names(which.max(table(heartdisease1$cigsPerDay [1:4238]))))
# Let's check the first 30 observations of the fixed education to make sure it all looks good.
head(heartdisease1$cigsPerDay, 30)
[1] "0" "0" "20" "30" "23" "0" "0" "20" "0" "30" "0" "0" "15" "0" "9"
[16] "20" "10" "20" "5" "0" "30" "0" "0" "20" "30" "20" "0" "20" "0" "0"
Lastly, let’s fix BPMeds. We will calculate the mode value of BPMeds and use this mode to replace the missing observations. After we replace the missing observations with the mode value, we will then check the first 20 observations of the corrected BPMeds variable with the missing observations filled in to make sure everything looks good.
# Replace missing values of BPMeds with the mode categorical value of BPMeds.
heartdisease1$BPMeds <-
replace(heartdisease1$BPMeds,
is.na(heartdisease1$BPMeds),
names(which.max(table(heartdisease1$BPMeds [1:4238]))))
# Let's check the first 30 observations of the fixed education to make sure it all looks good.
head(heartdisease1$BPMeds, 30)
[1] "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "1" "0" "0" "0" "0" "0"
[20] "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
Once again, we must acknowledge that this process of filling in the missing observations could potentially lead to a skewness of the distribution of these variables. We will check the distributions for these variables in a couple of steps to ensure that this process did not create any severe problems with the distributions of the variables that had missing observations.
Now that we have fixed all of the variables that were missing observations, let’s double check that there are no more missing observations in the new data set “heartdisease1”.
colSums(is.na(heartdisease1))
male age education currentSmoker cigsPerDay
0 0 0 0 0
BPMeds prevalentStroke prevalentHyp diabetes totChol
0 0 0 0 0
sysBP diaBP BMI heartRate glucose
0 0 0 0 0
TenYearCHD
0
As we can see, all of the variables now have zero missing observations, so we successfully fixed all of the missing values by filling them in with either the mean or mode depending on whether the variable in question was quantitative or categorical. The new, revised data set “heartdisease1” now has no missing observations, so we will use this for further analysis and for creating the multiple logistic regression model in the future steps of this project.
One other thing I noticed while looking through the data set is that the variable “cigsPerDay” is given as a character variable even though it should be a numeric variable. This variable is a measure of the number of cigarettes a patient smokes in a day, so it should be reported as a numeric variable since it is a continuous range of possible values and not collected as categories. So, let’s convert cigsPerDay from a character variable to a numeric variable. We will use the as.numeric() function to convert cigsPerDay to a numeric variable.
heartdisease1$cigsPerDay <- as.numeric(heartdisease1$cigsPerDay)
Now, the cigsPerDay variable is correctly given as a numeric variable.
Additionally, the variable BPMeds is given as a character, however this is a binary variable so it should be given as an integer variable with values 0 and 1 like the other binary variables in the data set. We will convert this variable from a character to an integer. We will use the as.integer() function to do this conversion.
heartdisease1$BPMeds <- as.integer(heartdisease1$BPMeds)
Now, the BPMeds variable is correctly given as a binary, integer variable.
Now that we have corrected the types of the variables that were mislabeled as the incorrect type, we will soon begin with the creation of the simple logistic regression model in the following steps.
Before we begin with creating the multiple logistic regression models, we should first check the distributions of the variables in the data set to ensure that they meet the criteria of being approximately normally distributed, for the quantitative variables. Or, if the variables do not appear to follow a normal distribution, we will see if they are practically important enough to be included within the model building process despite their lack of a normal distribution.
As was previously stated when we filled in the missing observations amongst several variables with missing values, these methods of filling in the missing observations with approximate values can potentially lead to skewness or inaccuracies in the distributions of the variables in question. Since there were quite a few variables in this data set that had missing observations, we should check the variables distributions before creating the multiple logistic regression models to make sure that everything looks alright and that there is no severe problems with any of the variable distributions.
Since there are many variables in this data set, we will look at two separate cases of pairwise scatterplots to avoid having an overload of variables at once. First, we will look at the pairwise scatterplots for the quantitative variables to check that they are unimodal and approximately normally distributed. Then, we will look at the pairwise scatterplots for the categorical and binary variables.
Although the variable of age did not have any missing values that needed to be filled in, let’s still check its distribution since it is an important quantitative variable for the multiple logsitic regression model.
ylimit = max(density(heartdisease1$age)$y)
hist(heartdisease1$age, probability = TRUE, main = "age", xlab="age",
col = "aliceblue", border="cornflowerblue")
lines(density(heartdisease1$age, adjust=2), col="darkorchid")
Overall, the distribution for age appears to be fairly consistent. There does appear to be more observations for younger ages, less than 50 years old, and fewer observations for older ages, over 60 years old. This leads to the appearance of the age distribution potentially being slightly skewed to the left. However, this observation does not appear to be significant enough to suggest that there is a problem with the data of the age variable at the moment. We can conclude that the variable of age is good to include within the multiple logistic regression model building process.
totChol was one of the variables that had some missing observations that were filled in during the previous steps. So, we should definetly check the distribution of this variable.
ylimit = max(density(heartdisease1$totChol)$y)
hist(heartdisease1$totChol, probability = TRUE, main = "totChol Distribution", xlab="totChol",
col = "aliceblue", border="cornflowerblue")
lines(density(heartdisease1$totChol, adjust=2), col="darkorchid")
Overall, totChol appears to be unimodal and approximately normally distributed. Therefore we can say that this variable is safe to include within the multiple logistic regression model building process and that the process of filling in the missing observations did not lead to any severe concerns for the distribution of this variable.
BMI is another variable that had missing observations that were filled in during the previous steps. We will check its distribution to make sure it appears to be approximately normally distributed.
ylimit = max(density(heartdisease1$BMI)$y)
hist(heartdisease1$BMI, probability = TRUE, main = "BMI Distribution", xlab="BMI",
col = "aliceblue", border="cornflowerblue")
lines(density(heartdisease1$BMI, adjust=2), col="darkorchid")
The distribution for BMI is unimodal and appears to be approximately normally distributed. It appears that the majority of individuals have a BMI somewhere in the range of 20-30, with the vast majority of the data falling within this range. This allows us to conclude that it is safe to include the variable of BMI within the multiple logistic regression model building process and that the process of filling in the missing values did not lead to any major concerns with the distribution of this variable.
The variable glucose was another variable that had missing observations which were filled in during the previous steps. We will check its distribution to make sure that everything looks alright.
ylimit = max(density(heartdisease1$glucose)$y)
hist(heartdisease1$glucose, probability = TRUE, main = "glucose", xlab="glucose",
col = "aliceblue", border="cornflowerblue")
lines(density(heartdisease1$glucose, adjust=2), col="darkorchid")
The variable glucose is unimodal and appears to be approximately normally distributed with the exception of a few outliers to the right. The most common glucose value appears to be around 75, with the histogram peaking around this value. Despite the few outliers, we will keep this variable in the final model since a patient’s glucose level is a medically important variable which would be helpful to know in order to assess a patient’s odds of being at risk for developing CHD.
cigsPerDay was another variable that had missing observations which were filled in during the previous steps. So, we should check its distribution to make sure things look alright.
ylimit = max(density(heartdisease1$cigsPerDay)$y)
hist(heartdisease1$cigsPerDay, probability = TRUE, main = "cigsPerDay", xlab="cigsPerDay",
col = "aliceblue", border="cornflowerblue")
lines(density(heartdisease1$cigsPerDay, adjust=2), col="darkorchid")
The variable cigsPerDay appears to be very notably skewed to the right. This can be attributed to that this variable represents the number of cigarettes a patient smokes per day. Many of the patients included in the data collection were not smokers, so they would have reported a value of 0 cigarettes smoked per day. This can explain the skew to the right of this variable, because non-smokers would report a value of 0 while smokers would report a value greater than 0 that represents the number of cigarettes they smoke in a day. This is a pratically important variable regardless of its skew, because smoking is a major risk factor for conditions like CHD. Additionally, this is an important piece of information, because doctors would want to know how many cigarettes their patients smoke in a day, because this has a significant impact on the patient’s risk factors. So, we will keep this variable in the final model despite its skew because it is an important variable in regards to a patient’s potential risk factors for developing CHD. But, it is important to keep in mind that this variable is skewed due to it being representative of both non-smokers and smokers.
BPMeds was another variable that had some missing observations that were filled in during previous steps. We will check its distribution to make sure that everything looks alright.
ylimit = max(density(heartdisease1$BPMeds)$y)
hist(heartdisease1$BPMeds, probability = TRUE, main = "BPMeds", xlab="BPMeds",
col = "aliceblue", border="cornflowerblue")
lines(density(heartdisease1$BPMeds, adjust=2), col="darkorchid")
Since BPMeds is a binary variable with only values of 0 and 1, we can expect that the distribution of it would be bimodal, with data only located at x = 0 and x = 1. The distribution of BPMeds shows that there is much more data at 0 than 1, meaning that much more patients reported not taking blood pressure medication than the patients who do take blood pressure medicine. Although this variable does appear to be skewed to x = 0 with it having much more data than x = 1, this is a pratically important predictor variable so we will keep it included in the final multiple logistic regression model. Whether or not a patient takes blood pressure medication is an important piece of information that doctors would want to know from patients since it could have an impact on their risk factors for developing heart disease.
The variable heartRate was another one which had some missing observations that needed to be filled in during previous steps. We will check its distribution to ensure that everything looks alright.
ylimit = max(density(heartdisease1$heartRate)$y)
hist(heartdisease1$heartRate, probability = TRUE, main = "heartRate", xlab="heartRate",
col = "aliceblue", border="cornflowerblue")
lines(density(heartdisease1$heartRate, adjust=2), col="darkorchid")
The distribution for heart rate is unimodal and approximately normally distributed without any notable skew or outliers that are apparent in its density plot. This allows us to conclude that this variable is safe to include within the multiple logistic regression model building process and that the process of filling in the missing values did not lead to any issues in its distribution.
sysBP was not a variable that had any missing observations which needed to be filled in. However, it is an important quantitative variable so we will check its distribution to make sure it appears to be approximately normally distributed.
ylimit = max(density(heartdisease1$sysBP)$y)
hist(heartdisease1$sysBP, probability = TRUE, main = "sysBP", xlab="sysBP",
col = "aliceblue", border="cornflowerblue")
lines(density(heartdisease1$sysBP, adjust=2), col="darkorchid")
Overall, the distribution for sysBP appears to be unimodal and approximately normally distributed without any severe skewness that raises a cause for immediate concern. We can conclude that sysBP is a variable that is safe to include within the multiple logistic regression model building process.
diaBP was not a variable that had any missing observations which needed to be filled in. However, it is an important quantitative variable so we will check its distribution to make sure it appears to be approximately normally distributed.
ylimit = max(density(heartdisease1$diaBP)$y)
hist(heartdisease1$diaBP, probability = TRUE, main = "diaBP Distribution", xlab="diaBP",
col = "aliceblue", border="cornflowerblue")
lines(density(heartdisease1$diaBP, adjust=2), col="darkorchid")
Overall, the distribution for diaBP appears to be unimodal and approximately normally distributed. We can conclude that diaBP is a variable that is safe to include within the multiple logistic regression model building process.
We have checked the distributions of the variables which had missing observations that were filled in during the previous steps of this project along with other important quantitative variables in this data set. We ensured that the variables either follows a distribution that appears to be approximately normal, or are practically important enough to be included in the multiple logistic regression model building process otherwise.
Overall, the vast majority of the variables did in fact follow a distribution that was approximately normally distributed without any noticable or significant skewness or outliers. This allows us to conclude that the process of filling in the missing observations in the variable that had missing values did not lead to any serious problems that affected the distributions of these variables.
The only variable that did not appear to follow an approximately normal distribution was cigsPerDay. However, the skewness seen in this variable’s distribution can be attributed to the fact that this data was collected amongst both smokers and non-smokers. So, non-smokers reported values of 0 cigarettes smoked per day, while smokers reported values that were greater than zero. This appeared to the distribution for this variable appearing to be skewed to the right. However, this is a variable that is practically important because the number of cigarettes a patient smokes per day can have a significant impact on their odds of being at risk for developing CHD. Due to the practical importance of this variable, it must still be included within the multiple logistic regression model building process.
Altogether, we have ensured that the variables in this data set appear to be good enough to use within the model building process. So, now we will begin with the process of building the candidate multiple logistic regression models.
Now, we will begin to build the multiple logistic regression model for the data. We will create three candidate models for this project and then will later select the best one of the three to use as the final multiple logisitc regression model.
We will start with creating the full model that includes all of the predictor variables in the data set,
full.model = glm(TenYearCHD ~male + age + education + currentSmoker + cigsPerDay + BPMeds + prevalentStroke + prevalentHyp + diabetes + totChol + sysBP + diaBP + BMI + heartRate + glucose,
family = binomial(link = "logit"),
data = heartdisease1)
kable(summary(full.model)$coef,
caption = "Full Model Summary of the Inferential Statistics")
Estimate | Std. Error | z value | Pr(>|z|) | |
---|---|---|---|---|
(Intercept) | -8.0136319 | 0.6595488 | -12.1501736 | 0.0000000 |
male | 0.4832252 | 0.1012879 | 4.7708097 | 0.0000018 |
age | 0.0606595 | 0.0062872 | 9.6481745 | 0.0000000 |
education2 | -0.1920573 | 0.1152003 | -1.6671600 | 0.0954826 |
education3 | -0.0913989 | 0.1389137 | -0.6579549 | 0.5105671 |
education4 | 0.0317405 | 0.1525457 | 0.2080723 | 0.8351725 |
currentSmoker | 0.0165846 | 0.1428593 | 0.1160903 | 0.9075810 |
cigsPerDay | 0.0214818 | 0.0056361 | 3.8114754 | 0.0001381 |
BPMeds | 0.2488193 | 0.2204520 | 1.1286778 | 0.2590338 |
prevalentStroke | 0.9713889 | 0.4439075 | 2.1882687 | 0.0286500 |
prevalentHyp | 0.2308865 | 0.1286361 | 1.7948818 | 0.0726725 |
diabetes | 0.1766062 | 0.2947860 | 0.5990999 | 0.5491063 |
totChol | 0.0018834 | 0.0010280 | 1.8320348 | 0.0669462 |
sysBP | 0.0141939 | 0.0035457 | 4.0031125 | 0.0000625 |
diaBP | -0.0028777 | 0.0059856 | -0.4807744 | 0.6306768 |
BMI | 0.0018980 | 0.0118449 | 0.1602398 | 0.8726922 |
heartRate | -0.0012247 | 0.0038861 | -0.3151523 | 0.7526460 |
glucose | 0.0067843 | 0.0021476 | 3.1590844 | 0.0015827 |
The equation for the multiple logistic regression equation of the full model is given as follows:
log p/(1-p) = -8.0136 + 0.4832 * male + 0.0607 * age - 0.1921 * education2 - 0.0914 * education3 + 0.0317 * education4 + 0.0166 * currentSmoker + 0.0215 * cigsPerDay + 0.9714 * prevalentStroke + 0.2309 * prevalentHyp + 0.1766 * diabetes + 0.0019 * totChol + 0.0142 * sysBP - 0.0029 * diaBP + 0.0019 * BMI - 0.0012 * heartRate + 0.0068 * glucose
The multiple logistic regression of the full model is statistically significant with p < .001. That means, the full model of all of the predictor variables in the data set, male, age, education, currentSmoker, cigsPerDay, BPMeds, prevalentStroke, prevalentHyp, diabetes, totChol, sysBP, diaBP, BMI, heartRate, and glucose, statistically significantly predicted the odds of a patient being at risk for developing CHD in a 10-year period of time.
As we can see, there are an abundance of predictor variables in this full model. We will later go through the variable selection process to narrow down the multiple logistic regression model to that of a final model. We will use the automatic variable selection process to due this later on in this project. This process will help us to decide which variables to include within our final multiple logistic regression model.
According to a report by the National Library of Medicine of the National Institutes of Health (NIH), the most notable risk factors for an individual being at risk of developing CHD are cigarette smoking, blood pressure, and cholesterol levels. In our data set, there are two variables related to smoking, currentSmoker, whether a patient is currently a smoker or not, and cigsPerDay, how many cigarettes a patient smokes in a day. There are also two variables related to blood pressure, sysBP and diastolic BP, for systolic blood pressure and diastolic blood pressure. There is also a variable included in the data set for a patient’s total cholesterol levels, totChol. These five variables in the data set represent the key risk factors that the medical report from the NIH stated as being the most significant in predicting a patient’s risk for CHD.
Let’s begin with creating a reduced model using the facts from this report. We will make another candidate model with just these variables. Since these variables are practically important for predicting a patient’s odds of being at risk for developing CHD, the smallest model must include these factors.
reduced.model = glm(TenYearCHD ~ currentSmoker + cigsPerDay + sysBP + diaBP + totChol,
family = binomial(link = "logit"),
data = heartdisease1)
kable(summary(reduced.model)$coef,
caption = "Reduced Model Summary of the Inferential Statistics")
Estimate | Std. Error | z value | Pr(>|z|) | |
---|---|---|---|---|
(Intercept) | -5.4464334 | 0.3700477 | -14.718193 | 0.0000000 |
currentSmoker | -0.1503058 | 0.1379616 | -1.089476 | 0.2759442 |
cigsPerDay | 0.0230630 | 0.0052931 | 4.357183 | 0.0000132 |
sysBP | 0.0288884 | 0.0029109 | 9.924339 | 0.0000000 |
diaBP | -0.0117367 | 0.0054645 | -2.147804 | 0.0317293 |
totChol | 0.0026359 | 0.0009736 | 2.707332 | 0.0067826 |
The equation for the multiple logistic regression equation of the reduced model is given as follows:
log p/(1-p) = -5.4464 - 0.1503 * currentSmoker + 0.0231 * cigsPerDay + 0.0289 * sysBP - 0.0117 * diaBP + 0.0026 * totChol
The multiple logistic regression of the reduced model is statistically significant with p < .001. That means, the reduced model of currentSmoker, cigsPerDay, sysBP, diaBP, and totChol statistically significantly predicted the odds of a patient being at risk for developing CHD in a 10-year period of time.
Now that we have looked both at the full model, with all of the predictor variables kept in the multiple logisitic regression model, and the reduced model, with only the predictor variables that the medical report of the NIH listed as the most significant factors in a patient’s risk of CHD, we will work on constructing the final model. This final model will be built through the use of the automatic variable selection process. We will use the forward selection process in order to determine which variables should be kept in the final model.
We will use automatic variable selection to help build the final model.
final.model = stepAIC(reduced.model, scope = list(lower=formula(reduced.model), upper=formula(full.model)),
direction = "forward",
trace = 0
)
kable(summary(final.model)$coef,
caption="Final Model Summary of the Inferential Statistics")
Estimate | Std. Error | z value | Pr(>|z|) | |
---|---|---|---|---|
(Intercept) | -8.3106987 | 0.5490393 | -15.1368015 | 0.0000000 |
currentSmoker | 0.0046193 | 0.1416577 | 0.0326090 | 0.9739864 |
cigsPerDay | 0.0212771 | 0.0056227 | 3.7841222 | 0.0001543 |
sysBP | 0.0145661 | 0.0035076 | 4.1526800 | 0.0000329 |
diaBP | -0.0029301 | 0.0058563 | -0.5003271 | 0.6168448 |
totChol | 0.0018347 | 0.0010222 | 1.7947609 | 0.0726918 |
age | 0.0625609 | 0.0061559 | 10.1627646 | 0.0000000 |
male | 0.5078985 | 0.0992562 | 5.1170440 | 0.0000003 |
glucose | 0.0075735 | 0.0016349 | 4.6323623 | 0.0000036 |
prevalentStroke | 1.0205076 | 0.4381915 | 2.3289078 | 0.0198640 |
prevalentHyp | 0.2441433 | 0.1273426 | 1.9172164 | 0.0552104 |
The predictor variables that were kept in the final model after the forward selection process are: currentSmoker, cigsPerDay, sysBP, diaBP, totChol, age, male, and glucose, prevalentStroke, and prevalentHyp. As we can see, all five of the variables from the reduced model were kept within the final model: currentSmoker, cigsPerDay, sysBP, diaBP, and totChol. Additionally, the forward selection process also kept the variables of age, male, glucose, prevalentStroke, and prevalentHyp within the final model.
The equation for the multiple logistic regression equation of the final model is given as follows:
log p/(1-p) = -8.3107 + 0.0046 * currentSmoker + 0.0213 * cigsPerDay + 0.0146 * sysBP - 0.0029 * diaBP + 0.0018 * totChol + 0.0626 * age + 0.5079 * male + 0.0076 * glucose + 1.0205 * prevalentStroke + 0.2441 * prevalentHyp
The multiple logistic regression of the final model is statistically significant with p < .001. That means, the final model of currentSmoker, cigsPerDay, sysBP, diaBP, totChol, age, male, glucose, prevalentStroke, and prevalentHyp, statistically significantly predicted the odds of a patient being at risk for developing CHD in a 10-year period of time.
Let’s look at some overall goodness of fit test for our three candidate multiple logistic regression models.
global.measure = function(s.logit){
dev.resid = s.logit$deviance
dev.0.resid = s.logit$null.deviance
aic = s.logit$aic
goodness = cbind(Deviance.residual =dev.resid, Null.Deviance.Residual = dev.0.resid, AIC = aic)
goodness
}
goodness=rbind(full.model = global.measure(full.model),
reduced.model=global.measure(reduced.model),
final.model=global.measure(final.model))
row.names(goodness) = c("full.model", "reduced.model", "final.model")
kable(goodness, caption ="Global Goodness-of-fit Statistics Comparison")
Deviance.residual | Null.Deviance.Residual | AIC | |
---|---|---|---|
full.model | 3205.583 | 3611.55 | 3241.583 |
reduced.model | 3392.831 | 3611.55 | 3404.831 |
final.model | 3210.882 | 3611.55 | 3232.882 |
By conducting the global goodness of fit tests on all three of the candidate models, this allows us to verify the choice of the final model.
We can see that the final model has the lowest AIC value out of the three candidate models. The full model has an AIC of 3,241.583, the reduced model has an AIC of 3,404.83, and the final model has an AIC of 3,232.882. A lower AIC value is ideal in terms of creating a good fitting model, so this verifies that the final model, which was created through automatic forward variable selection, is the ideal choice as our final, overall model for the multiple logistic regression.
As was found through the global goodness of fit tests, the ideal model to use for the multiple logisitic regression is the final model that was created through the automatic forward variable selection process.
The equation for this model was previously found to be given as:
log p/(1-p) = -8.3107 + 0.0046 * currentSmoker + 0.0213 * cigsPerDay + 0.0146 * sysBP - 0.0029 * diaBP + 0.0018 * totChol + 0.0626 * age + 0.5079 * male + 0.0076 * glucose + 1.0205 * prevalentStroke + 0.2441 * prevalentHyp
And this final model was shown to be statistically significant with a p-value of p < .001. Out of all three of the candidate models, it was this final model which had the lowest AIC value, showing that it is the ideal choice to use out of the three candidate models that were made.
Now, let’s look at the odds ratio value and interpret them within the context of this multiple logistic regression model.
model.coef.stats = summary(final.model)$coef
odds.ratio = exp(coef(final.model))
out.stats = cbind(model.coef.stats, odds.ratio = odds.ratio)
kable(out.stats,caption = "Summary Statistics with Odds Ratios")
Estimate | Std. Error | z value | Pr(>|z|) | odds.ratio | |
---|---|---|---|---|---|
(Intercept) | -8.3106987 | 0.5490393 | -15.1368015 | 0.0000000 | 0.0002459 |
currentSmoker | 0.0046193 | 0.1416577 | 0.0326090 | 0.9739864 | 1.0046300 |
cigsPerDay | 0.0212771 | 0.0056227 | 3.7841222 | 0.0001543 | 1.0215051 |
sysBP | 0.0145661 | 0.0035076 | 4.1526800 | 0.0000329 | 1.0146727 |
diaBP | -0.0029301 | 0.0058563 | -0.5003271 | 0.6168448 | 0.9970742 |
totChol | 0.0018347 | 0.0010222 | 1.7947609 | 0.0726918 | 1.0018364 |
age | 0.0625609 | 0.0061559 | 10.1627646 | 0.0000000 | 1.0645593 |
male | 0.5078985 | 0.0992562 | 5.1170440 | 0.0000003 | 1.6617953 |
glucose | 0.0075735 | 0.0016349 | 4.6323623 | 0.0000036 | 1.0076022 |
prevalentStroke | 1.0205076 | 0.4381915 | 2.3289078 | 0.0198640 | 2.7746027 |
prevalentHyp | 0.2441433 | 0.1273426 | 1.9172164 | 0.0552104 | 1.2765272 |
The odds ratio tells us the measure of association between an exposure and an outcome. In this case, our exposure will the predictor variable in question and the outcome will be the patient’s odds of being at risk for developing CHD in a 10-year period of time. This odds ratio value will allow us to see how a patient’s odds of being at risk for developing CHD in a 10-year period changes with an increase in the predictor variable that is being looked at.
We will look at the odds ratios for each of the predictor variables and interpret this in a manner that is practically useful for the multiple logistic regression model.
The odds ratio for currentSmoker is 1.0046 This means that the odds of a patient being at risk for developing CHD within a 10-year period of time is 0.46% greater if the patient is a current smoker than if the patient is not a current smoker, holding all other variables constant.
The odds ratio for cigsPerDay is 1.0215. This means that for every 1 additional cigarette a patient smokes in a day, the odds of this patient being at risk for developing CHD within a 10-year period of time increases by 2.15%, holding all other variables constant.
The odds ratio for sysBP is 1.0147 This means that for every 1 mmHG increase in a patient’s systolic blood pressure, the odds of this patient being at risk for developing CHD within a 10-year period of time increases by 1.47%, holding all other variables constant.
The odds ratio for diaBP is 0.9971 This means that for every 1 mmHG increase in a patient’s diastolic blood pressure, the odds of this patient being at risk for developing CHD within a 10-year period of time decreased by 0.29%, holding all other variables constant.
The odds ratio for totChol is 1.0012 This means that for every 1 mg/dL increase in a patient’s total cholesterol level, the odds of this patient being at risk for developing CHD within a 10-year period of time increased by 0.12%, holding all other variables constant.
The odds ratio for age is 1.0646 This means that for every 1 year increase in a patient’s age, the odds of this patient being at risk for developing CHD within a 10-year period of time increased by 6.46%, holding all other variables constant.
The odds ratio for male is 1.6618 This means that the odds of a patient being at risk for developing CHD within a 10-year period of time is 66.18% greater if the patient is male than if the patient is female, holding all other variables constant.
The odds ratio for glucose is 1.0076 This means that for every 1 mg/dL increase in a patient’s glucose level, the odds of this patient being at risk for developing CHD within a 10-year period of time increased by 0.76%, holding all other variables constant.
The odds ratio for prevalentStroke is 2.7746. This means that holding all other variables constant, the odds of this patient being at risk for developing CHD within a 10-year period of time is 2.7769 times greater for patients with a history of prevalent strokes than patients who do not have a history of prevalent strokes, holding all other variables constant. This is an incredibly significant value, as it shows that a patient having a history of strokes is a major risk factor for their odds of being at risk for developing CHD.
The odds ratio for prevalentHyp is 1.2765. This means that holding all other variables constant, the odds of this patient being at risk for developing CHD within a 10-year period of time is 1.2765 times greater for patients with a history of prevalent hypertension than patients who do not have a history of prevalent hypertension, holding all other variables constant. This is also an incredibly significant value, as it shows that a patient having a history of prevalent is a major risk factor for their odds of being at risk for developing CHD.
Altogether, the final model was statistically significant in predicting the odds of a patient’s risk for developing CHD based upon the predictor variables of currentSmoker, cigsPerDay, sysBP, diaBP, totChol, age, male, and glucose, prevalentStroke, and prevalentHyp. The final multiple logistic regression model had a statistically significant p-value of p < .001 showing that this model is useful for predicting the odds of a patient being at risk for developing CHD over a 10-year period of time.
One interesting finding in this project include that prevalentStroke was a variable which had a very huge impact on the odds of an individual being at risk for developing CHD. We found that the odds of this patient being at risk for developing CHD within a 10-year period of time is 2.7769 times greater for patients with a history of prevalent strokes than patients who do not have a history of prevalent strokes, holding all other variables constant. This is a very notable value as it shows that there is a huge difference between the odds of an individual with a history of strokes and an individual without a history of strokes being at risk for developing CHD. This is something important for both doctors and patients to be aware of, as patients who report a history of strokes should be closely monitored by doctors as they have much higher odds of being at risk for developing CHD than patients who do not have a history of strokes.
Another interesting finding in this project was that males appear to have much greater odds of being at risk for developing CHD than females do. In fact, it was found that the odds of a patient being at risk for developing CHD within a 10-year period of time is 66.18% greater if the patient is male than if the patient is female, holding all other variables constant. This is something notable found within this project because it shows that while holding all other variables constant, males tend to have higher odds of being at risk for developing CHD than females do. This is something important for doctors to know as perhaps for patients who do not have any other preexisting factors that could increase their risk, it may be worthwhile to start monitoring and testing male patients for CHD at younger ages than female patients who have the same baseline risk factors.
Something else interesting that was found in this project was how much of an impact age had on the increase of an individual’s odds of being at risk for developing CHD. It was found that the odds of this patient being at risk for developing CHD within a 10-year period of time increased by 6.46%, holding all other variables constant. This is an incredibly significant percentage as it goes up rapidly as a patient continues to age. Although a 6.46% percentage increase in odds may seem not too large in the overall scope of things, this is the increase in odds for only 1 year of increase in an individual’s age. This means that for every year that passes an individual’s odds of being at risk for developing CHD goes up by 6.46%. By the time 10 years have gone by, this individual’s odds of being at risk for developing CHD would have gone up by a massive 64.6%. This shows how rapidly this percentage increases as time goes by. This finding is significant as it shows the importance of having regular medical checkups, especially as you get older. This is a finding which is useful for doctors as it is something they should share with their patients to remind them of why it is important to have regular medical checkups, especially for older patients.
Some of the factors which were shown to increase an individual’s odds of being at risk for developing CHD are things which can be addressed with care and medical intervention, such as the number of cigarettes smoked per day. It was found that as the number of cigarettes smoked per day increases, so does the individual’s odds of being at risk for developing CHD. So, doctors could use this information to provide patients who smoke with ways to help them quit or cut down on the number of cigarettes they smoke per day. However, there are other risk factors for CHD which are things which can not be changed by medical intervention such as a history of strokes and the age of the patient. Although these factors are things which a patient can not change, it is something that by sharing with their doctor, the doctor can help them assess their odds of being at risk for developing CHD and provide them with other methods to help them manage their risk such as medications. Additionally, the findings of this project reinforces the importance of having regular medical checkups with a doctor, as it is important to keep an eye on
Overall, this project provided many significant findings that can be useful for both doctors and patients. There are many factors which have an impact on an individual’s odds of being at risk for developing CHD, and it is important for both patients and their doctors to be aware of these risk factors. By knowing the factors which can increase an individual’s odds of being at risk for developing CHD, patients can use this to be mindful of their health and which aspects of their life and their health may potentially increase their odds of being at risk for developing CHD.
Some recommendations I would make for future projects include:
Expand the data collection to ensure the accuracy of the findings found in this project. This could also involve reaching out to various hospitals and doctors for a better understanding of the most serious risk factors for CHD in order to ensure that the variables kept within the final model indeed are representative of a patient’s odds of being at risk for developing CHD.
Consider other variables which could be statistically significant in predicting the odds of a patient’s risk for developing CHD. All of the predictor variables that were included within this data set made sense in terms of this project and were all factors which could play a role in a patient’s odds of being at risk for developing CHD. However, future projects could consider looking into additional variables that may be statistically significant in predicting the odds of a patient being at risk for developing CHD. For instance, one potential variable that could be looked into is a patient’s income to see if the income of a patient has an impact on their odds of being at risk for developing CHD. A reason why I believe that this could be a significant variable for this model is because patients with higher incomes tend to have more options when it comes to medical care than patients with lower incomes would. So, perhaps there could be a relationship between a patient’s income and their odds of being at risk for developing CHD. This is just one example of a variable that is not in this data set which may provide use for future projects.
Consider other possible models that could provide statistical significance in predicting the odds of a patient being at risk for developing CHD in a 10-year period. In this project, we looked at three candidate multiple logistic regression models and chose which one provided the best utility. These three models included a full model with all of the predictor variables left in the model, a reduced model with only the variables that the medical report described as most significant risk factors for CHD, and a final model that was created through automatic forward variable selection. Overall, the final model provided the best goodness of fit and was chosen as the one to use. However, this is not to say that there is not any other model that could potentially be better than the one we created. If further projects found a model that provided better and more accuracy in prediction than our final model, this could help doctors and patients better understand their odds of being at risk for developing CHD based on the various factors included in the model.
Overall, our final model showed statistical significance and good utility in predicting the odds of a patient being at risk for developing CHD in a 10-year period of time based on the factors of currentSmoker, cigsPerDay, sysBP, diaBP, totChol, age, male, and glucose, prevalentStroke, and prevalentHyp. This shows that the final multiple logistic regression model provides useful insight into assessing a patient’s risk factors and their overall odds of being at risk for developing CHD based upon their life and their health.
This multiple logistic regression model can provide use for doctors who want to keep their patients informed about their odds of being at risk for developing CHD, as well as for patients who want to stay informed about their own health and the factors which may increase their odds of being at risk for developing CHD. Ultimately, these findings show the importance of staying up to date on one’s health and having regular medical appointments and check ups with a doctor to be proactive in addressing potential risk factors and being aware and mindful of the odds of being at risk for developing CHD in a 10-year period based upon various factors of one’s life and health history.
This data set was found on kaggle.com under the collection of logistic regression data sets. Included below is the citation of the web page of where I found the data set I used for my project along with the medical report I found which described which factors are the most medically significant when it comes to determining a patient’s risk for developing CHD.
Dileep. (2019, June 7). Logistic regression to predict heart disease. Kaggle. https://www.kaggle.com/datasets/dileep070/heart-disease-prediction-using-logistic-regression?resource=download&select=framingham.csv
Hajar, R. (2017). Risk factors for coronary artery disease: Historical perspectives. Heart views : the official journal of the Gulf Heart Association. https://pmc.ncbi.nlm.nih.gov/articles/PMC5686931/