The Framingham Heart Study (FHS), backed by the National Institute of Health (NIH) is known as the longest running and most important epidemiological studies in the history of medical research. This landmark prospective research study centered around cross sectional data, following over ten thousand patients throughout the greater part of the past century.
Staring in 1948 the FHS study began by surveying five thousand (5,000) Framingham, Massachusetts residents, ages 30 through 60, regarding various aspects of their current lifestyle while conducting various physical health examinations. Patients were then followed, with regular re-examinations conducted every two years. As the life cycle of this research has progressed, additional examinations have been included in the FHS as well as enrollment of an additional five thousand (5,000) new patients directly related to the original FHS patient cohort. The FHS continues to evolve and seeks to provide a living and growing longitudinal dataset for a variety of health conditions and their risk factors, both behavioral and genetic.
A minor subset of the longitudinal FHS data was obtained from the
Course Project Data Repository
for use. At the time of utilization, this subset contained
4240 observations
and 16 variables
.
fhs_raw <- read.csv('https://nlepera.github.io/sta551/HW02/data/fhs_data.csv')
#link to variable descriptions https://pengdsci.github.io/datasets/FraminghamHeartStudy/FraminghamHeartStudy-description.pdf#
A full description of the variables can be obtained at the following
link
.
A short form summary is included below for ease of reference. Numerical
data is indicated with a bold [N], while categorical data is
indicated with a bold [C]. Categorical variables are originally
encoded with integer values for data storage and utilization
considerations.
sex
Gender of the pt. (Male = 1 | Non-Male = 0)
[C]age
Age at time of pt. exam (years) [N]education
Level of pt’s education. (Some high school =
1 | High school / GED = 2 | Some college / vocational = 3 | College = 4)
[C]currentSmoker
Smoker status at time of pt. exam (Smoker
= 1 | Non-smoker = 2) [C]cigsPerDay
Number of cigarettes smoked per day at time
of pt. exam [N]BPmeds
Pt medical history: blood pressure medicine /
anti-hypertensive (Yes BPmeds = 1 | No BPmeds = 2) [C]prevalentStroke
Pt medical history: stroke (Yes stroke
= 1 | No stroke = 0) [C]prevalentHyp
Pt medical history: hypertension (Yes
hypertension = 1 | No hypertension = 0) [C]diabetes
Pt medical history: diabetes (Yes diabetes = 1
| No diabetes = 2) [C]totCol
Total cholesterol (mg/dL) [N]sysBP
Systolic blood pressure (mmHg) [N]diaDP
Diastolic blood pressure (mmHg) [N]BMI
Body Mass Index (kg/m2) [N]heart rate
Heart rate (beats/minute) [N]glucose
Blood glucose level (mg/dL) [N]TenYearCHD
Calculated 10 year risk of coronary heart
disease (At Risk = 1 / Not at risk = 0) [C]A brief summary of the raw data set (fhs_raw
) is
included below for reference. Categorical variables have not yet been
transformed and are still represented as integer values.
summary(fhs_raw)
male age education currentSmoker
Min. :0.0000 Min. :32.00 Min. :1.000 Min. :0.0000
1st Qu.:0.0000 1st Qu.:42.00 1st Qu.:1.000 1st Qu.:0.0000
Median :0.0000 Median :49.00 Median :2.000 Median :0.0000
Mean :0.4292 Mean :49.58 Mean :1.979 Mean :0.4941
3rd Qu.:1.0000 3rd Qu.:56.00 3rd Qu.:3.000 3rd Qu.:1.0000
Max. :1.0000 Max. :70.00 Max. :4.000 Max. :1.0000
NA's :105
cigsPerDay BPMeds prevalentStroke prevalentHyp
Min. : 0.000 Min. :0.00000 Min. :0.000000 Min. :0.0000
1st Qu.: 0.000 1st Qu.:0.00000 1st Qu.:0.000000 1st Qu.:0.0000
Median : 0.000 Median :0.00000 Median :0.000000 Median :0.0000
Mean : 9.006 Mean :0.02962 Mean :0.005896 Mean :0.3106
3rd Qu.:20.000 3rd Qu.:0.00000 3rd Qu.:0.000000 3rd Qu.:1.0000
Max. :70.000 Max. :1.00000 Max. :1.000000 Max. :1.0000
NA's :29 NA's :53
diabetes totChol sysBP diaBP
Min. :0.00000 Min. :107.0 Min. : 83.5 Min. : 48.0
1st Qu.:0.00000 1st Qu.:206.0 1st Qu.:117.0 1st Qu.: 75.0
Median :0.00000 Median :234.0 Median :128.0 Median : 82.0
Mean :0.02571 Mean :236.7 Mean :132.4 Mean : 82.9
3rd Qu.:0.00000 3rd Qu.:263.0 3rd Qu.:144.0 3rd Qu.: 90.0
Max. :1.00000 Max. :696.0 Max. :295.0 Max. :142.5
NA's :50
BMI heartRate glucose TenYearCHD
Min. :15.54 Min. : 44.00 Min. : 40.00 Min. :0.0000
1st Qu.:23.07 1st Qu.: 68.00 1st Qu.: 71.00 1st Qu.:0.0000
Median :25.40 Median : 75.00 Median : 78.00 Median :0.0000
Mean :25.80 Mean : 75.88 Mean : 81.96 Mean :0.1519
3rd Qu.:28.04 3rd Qu.: 83.00 3rd Qu.: 87.00 3rd Qu.:0.0000
Max. :56.80 Max. :143.00 Max. :394.00 Max. :1.0000
NA's :19 NA's :1 NA's :388
str(fhs_raw)
'data.frame': 4240 obs. of 16 variables:
$ male : int 1 0 1 0 0 0 0 0 1 1 ...
$ age : int 39 46 48 61 46 43 63 45 52 43 ...
$ education : int 4 2 1 3 3 2 1 2 1 1 ...
$ currentSmoker : int 0 0 1 1 1 0 0 1 0 1 ...
$ cigsPerDay : int 0 0 20 30 23 0 0 20 0 30 ...
$ BPMeds : int 0 0 0 0 0 0 0 0 0 0 ...
$ prevalentStroke: int 0 0 0 0 0 0 0 0 0 0 ...
$ prevalentHyp : int 0 0 0 1 0 1 0 0 1 1 ...
$ diabetes : int 0 0 0 0 0 0 0 0 0 0 ...
$ totChol : int 195 250 245 225 285 228 205 313 260 225 ...
$ sysBP : num 106 121 128 150 130 ...
$ diaBP : num 70 81 80 95 84 110 71 71 89 107 ...
$ BMI : num 27 28.7 25.3 28.6 23.1 ...
$ heartRate : int 80 95 75 65 85 77 60 79 76 93 ...
$ glucose : int 77 76 70 103 85 99 85 78 79 88 ...
$ TenYearCHD : int 0 0 0 1 0 0 1 0 0 0 ...
In order to determine if easily measurable patient health measures
such as glucose levels and total cholesterol levels can act as predictor
values for more salient health measures such as patient BMI and patient
10 year CHD risk diagnosis, multiple regression models were created
utilizing the fhs
subset data. Before any data could be
analyzed or models built, feature variables were engineered to improve
analysis capabilities and all missing values were imputed using the
mice
package.
The feasibility of utilizing glucose and total cholesterol levels as BMI predictors was explored first. Multiple linear regression models were built and subsequently tested with 5-fold validation, comparing the mean square error (MSE) for each validation loop. Once a final best fit model with the lowest average MSE and least variation across all rounds of validation was selected, the selected prediciton model was utilized to complete test predictions based on hypothetical patient values.
Secondly the feasibility of utilizing glucose and total cholesterol levels as 10 year CHD risk diagnosis predictors was explored. Two logistic regression models were built and subsequently compared utilizing their Receiving Operating Characteristic (ROC) curves and calculated area under the curve (AUC) values. As a part of the model comparison, best fit cutoff values, \(\a\) (type 1 error rates), and \(\b\) (type 2 error rates) were calculated to determine the point of model best fit.
In order to prevent categorical values stored as integers from being incorporated into analysis of true numerical variables, the following variables were transformed from integer to logical values.
male
currentSmoker
BPMeds
prevalentStroke
prevalentHyp
diabetes
TenYearCHD
Additionally, the education
variable was transformed
from an integer to a factor, using the following replacement values:
education
= 1 -> “SomeHS”education
= 2 -> “HS-GED”education
= 3 -> “SomeCol-Voc”education
= 4 -> “College”A summary of the transformed variables is included to illustrate the transformation and to capture the overall retention of all variables and missing values.
fhs <- fhs_raw
fhs$male <- as.logical(as.integer(fhs$male))
fhs$currentSmoker <- as.logical(as.integer(fhs$currentSmoker))
fhs$BPMeds <- as.logical(as.integer(fhs$BPMeds))
fhs$prevalentStroke <- as.logical(as.integer(fhs$prevalentStroke))
fhs$prevalentHyp <- as.logical(as.integer(fhs$prevalentHyp))
fhs$diabetes <- as.logical(as.integer(fhs$diabetes))
fhs$TenYearCHD <- as.logical(as.integer(fhs$TenYearCHD))
fhs$education <- factor(fhs$education,
levels = c(1:4),
labels = c("SomeHS", "HS-GED", "SomeCol-Voc", "College"))
str(fhs)
'data.frame': 4240 obs. of 16 variables:
$ male : logi TRUE FALSE TRUE FALSE FALSE FALSE ...
$ age : int 39 46 48 61 46 43 63 45 52 43 ...
$ education : Factor w/ 4 levels "SomeHS","HS-GED",..: 4 2 1 3 3 2 1 2 1 1 ...
$ currentSmoker : logi FALSE FALSE TRUE TRUE TRUE FALSE ...
$ cigsPerDay : int 0 0 20 30 23 0 0 20 0 30 ...
$ BPMeds : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ prevalentStroke: logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ prevalentHyp : logi FALSE FALSE FALSE TRUE FALSE TRUE ...
$ diabetes : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ totChol : int 195 250 245 225 285 228 205 313 260 225 ...
$ sysBP : num 106 121 128 150 130 ...
$ diaBP : num 70 81 80 95 84 110 71 71 89 107 ...
$ BMI : num 27 28.7 25.3 28.6 23.1 ...
$ heartRate : int 80 95 75 65 85 77 60 79 76 93 ...
$ glucose : int 77 76 70 103 85 99 85 78 79 88 ...
$ TenYearCHD : logi FALSE FALSE FALSE TRUE FALSE FALSE ...
In order to reduce the number of categorical variables for analysis,
the variables prevalentHyp
and prevalentStroke
were combined to create a new feature variable entitled
prevalentCond
. These variables were able to be combined
into a single feature variable, as they were both binary categorical
values measuring past patient medical history. This new feature variable
prevalentCond
serves to record patient medical history for
both hypertension and stroke, utilizing the following potential
values:
Hyp
: Indicates the patient has a history of
Hypertension onlyHypStroke
: Indicates the patient has a history of
both Hypertension and StrokeNone
: Indicates the patient does not have a
history of Hypertension or StrokeStroke
: Indicates the patient has a history of
Stroke onlyProvisions for missing values were included, but as will be discussed
in later sections, neither prevalentHyp
nor
prevalentStroke
have missing values.
fhs$prevalentCond <- NA
for (i in 1:nrow(fhs)) {
if (fhs$prevalentHyp[i] == TRUE && fhs$prevalentStroke[i] == FALSE) {
fhs$prevalentCond[i] <- "Hyp"
}
else if (fhs$prevalentHyp[i] == TRUE && fhs$prevalentStroke[i] == TRUE) {
fhs$prevalentCond[i] <- "HypStroke"
}
else if (fhs$prevalentHyp[i] == FALSE && fhs$prevalentStroke[i] == FALSE) {
fhs$prevalentCond[i] <- "None"
}
else if (fhs$prevalentHyp[i] == FALSE && fhs$prevalentStroke[i] == TRUE) {
fhs$prevalentCond[i] <- "Stroke"
}
else if (fhs$prevalentHyp[i] == "" || fhs$prevalentStroke[i] == "") {
fhs$prevalentCond[i] <- "Missing-Val"
}
else fhs$prevalentCond[i] <- "Error"
}
In order to allow for proper handing in later logistic regression modeling, the characher variable categories as assigned were converted to factor values. A quick summary check of the newly created feature variable illustrates that all 4240 observations were retained, and no values are missing.
fhs$prevalentCond <- as.factor(as.character(fhs$prevalentCond))
str(fhs$prevalentCond)
Factor w/ 4 levels "Hyp","HypStroke",..: 3 3 3 1 3 1 3 3 1 1 ...
summary(fhs$prevalentCond)
Hyp HypStroke None Stroke
1298 19 2917 6
In order to reduce the number of categorical variables for analysis,
the variables sysBP
(systolic blood pressure) and
diaBP
(diastolic blood pressure) were binned and combined
to create a new feature variable entitled highBP
. The
parameters to bin and combine both systolic and diastolic blood pressure
readings into a single categorical variable describing the patient’s
high blood pressure status were obtained from the American Heart
Association at the following
link
fhs$highBP <- NA
for (i in 1:nrow(fhs)) {
if (fhs$sysBP[i] < 120 && fhs$diaBP[i] < 80) {
fhs$highBP[i] <- "Normal"
}
else if (fhs$sysBP[i] < 130 && fhs$diaBP[i] < 80) {
fhs$highBP[i] <- "Elevated"
}
else if (fhs$sysBP[i] < 140 | fhs$diaBP[i] < 90) {
fhs$highBP[i] <- "HBP1"
}
else if (fhs$sysBP[i] >= 140 | fhs$diaBP[i] >= 90) {
fhs$highBP[i] <- "HBP2"
}
else if (fhs$sysBP[i] >= 180 | fhs$diaBP[i] >= 120) {
fhs$highBP[i] <- "Crisis"
}
}
In order to allow for proper handing in later logistic regression modeling, the character variable categories as assigned were converted to factor values.
fhs$highBP <- as.factor(as.character(fhs$highBP))
str(fhs)
'data.frame': 4240 obs. of 18 variables:
$ male : logi TRUE FALSE TRUE FALSE FALSE FALSE ...
$ age : int 39 46 48 61 46 43 63 45 52 43 ...
$ education : Factor w/ 4 levels "SomeHS","HS-GED",..: 4 2 1 3 3 2 1 2 1 1 ...
$ currentSmoker : logi FALSE FALSE TRUE TRUE TRUE FALSE ...
$ cigsPerDay : int 0 0 20 30 23 0 0 20 0 30 ...
$ BPMeds : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ prevalentStroke: logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ prevalentHyp : logi FALSE FALSE FALSE TRUE FALSE TRUE ...
$ diabetes : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ totChol : int 195 250 245 225 285 228 205 313 260 225 ...
$ sysBP : num 106 121 128 150 130 ...
$ diaBP : num 70 81 80 95 84 110 71 71 89 107 ...
$ BMI : num 27 28.7 25.3 28.6 23.1 ...
$ heartRate : int 80 95 75 65 85 77 60 79 76 93 ...
$ glucose : int 77 76 70 103 85 99 85 78 79 88 ...
$ TenYearCHD : logi FALSE FALSE FALSE TRUE FALSE FALSE ...
$ prevalentCond : Factor w/ 4 levels "Hyp","HypStroke",..: 3 3 3 1 3 1 3 3 1 1 ...
$ highBP : Factor w/ 4 levels "Elevated","HBP1",..: 4 2 2 3 2 3 2 4 2 3 ...
Prior to utilizing the FHS data in any capacity, an understanding of the missing values for each category was obtained. A total count of missing values per variable as well as a total percentage of missing values for each variable was calculated.
sapply(fhs, function (x) sum(is.na(x))) #count of missing
male age education currentSmoker cigsPerDay
0 0 105 0 29
BPMeds prevalentStroke prevalentHyp diabetes totChol
53 0 0 0 50
sysBP diaBP BMI heartRate glucose
0 0 19 1 388
TenYearCHD prevalentCond highBP
0 0 0
sapply(fhs, function (x) round((mean(is.na(x))*100), 3)) #percent of total missing
male age education currentSmoker cigsPerDay
0.000 0.000 2.476 0.000 0.684
BPMeds prevalentStroke prevalentHyp diabetes totChol
1.250 0.000 0.000 0.000 1.179
sysBP diaBP BMI heartRate glucose
0.000 0.000 0.448 0.024 9.151
TenYearCHD prevalentCond highBP
0.000 0.000 0.000
The overall structure and pattern of missing values is represented in
the below table to allow for trend visualization. Each column of the
below plot represents a variable from the fhs_raw
dataset.
Each row represents a different pattern of present vs. missing data for
each variable. Blue filled cells indicate a
data point is available for that variable in the referenced pattern,
while red squares indicate a missing value
for that variable in the referenced pattern. For example, the 2nd to the
top row of the pattern table includes blue
cells for all variables, except the red cell
in the glucose
column, indicating the only missing value
for that pattern is within the glucose
variable.
The numeric values on the left hand side represent the frequency in
which that pattern of present vs. missing data is seen in the dataset
fhs_raw
. The right hand values indicate the total number of
missing values in that data pattern. For example, the top most row has a
left hand value of 3658, and a right hand value of zero (0), indicating
there are 3658 observations in which there are zero (0) missing values.
The 2nd to top row has a left hand missing value of 331, a right hand
value of one (1), and a red cell in the glucose
variable,
indicating there are 331 observations with one (1) missing value,
specifically in the glucose
variable.
md.pattern(fhs, rotate.names = TRUE)
male age currentSmoker prevalentStroke prevalentHyp diabetes sysBP diaBP
3658 1 1 1 1 1 1 1 1
331 1 1 1 1 1 1 1 1
93 1 1 1 1 1 1 1 1
8 1 1 1 1 1 1 1 1
51 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1
9 1 1 1 1 1 1 1 1
38 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1
23 1 1 1 1 1 1 1 1
4 1 1 1 1 1 1 1 1
2 1 1 1 1 1 1 1 1
13 1 1 1 1 1 1 1 1
4 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1
0 0 0 0 0 0 0 0
TenYearCHD prevalentCond highBP heartRate BMI cigsPerDay totChol BPMeds
3658 1 1 1 1 1 1 1 1
331 1 1 1 1 1 1 1 1
93 1 1 1 1 1 1 1 1
8 1 1 1 1 1 1 1 1
51 1 1 1 1 1 1 1 0
1 1 1 1 1 1 1 1 0
9 1 1 1 1 1 1 0 1
38 1 1 1 1 1 1 0 1
1 1 1 1 1 1 1 0 1
1 1 1 1 1 1 1 0 0
23 1 1 1 1 1 0 1 1
4 1 1 1 1 1 0 1 1
2 1 1 1 1 1 0 1 1
13 1 1 1 1 0 1 1 1
4 1 1 1 1 0 1 1 1
1 1 1 1 1 0 1 1 1
1 1 1 1 1 0 1 0 1
1 1 1 1 0 1 1 1 1
0 0 0 1 19 29 50 53
education glucose
3658 1 1 0
331 1 0 1
93 0 1 1
8 0 0 2
51 1 1 1
1 1 0 2
9 1 1 1
38 1 0 2
1 0 1 2
1 1 0 3
23 1 1 1
4 1 0 2
2 0 1 2
13 1 1 1
4 1 0 2
1 0 1 2
1 1 0 3
1 1 1 1
105 388 645
A total of 18 unique patterns of present vs. missing data
combinations are observed in the FHS subset, fhs_raw
. The
most frequently seen pattern is all values present, the second most
frequent pattern is 1 missing value in the glucose
variable, and the most frequent pattern with more than one missing value
is a combination of both missing totChol
and
glucose
values, as seen in thirty eight (38) instances.
As the variable with the largest total percentage of blank values
(9.151% missing) was glucose
, an imputation model was built
around the glucose
variable to create the best possible
fit. For this imputation model fitting, the fhs
data was
subset for only numerical categories to allow for ease of analysis and
simple prediction utilizing the MICE
imputation package.
Once this subset (fhs_num
) was created, three different
MICE imputation models were created: Predictive mean matching (PMM)
Imputation, Classification and Regression Trees (CART)
Imputation, and Lasso Linear Regression Imputation.
fhs_num <- fhs %>%
dplyr::select(age, cigsPerDay, totChol, sysBP, diaBP, BMI, heartRate, glucose)
set.seed(123) #set seed to maintain consistency with imputation
fhs_imp <- data.frame(
original = fhs_num$glucose,
imputed_pmm = complete(mice(fhs_num, method = "pmm", print = FALSE))$glucose,
imputed_cart = complete(mice(fhs_num, method = "cart", print = FALSE))$glucose,
imputed_lasso = complete(mice(fhs_num, method = "lasso.norm", print = FALSE))$glucose
)
To determine the best fit of the three MICE imputation models tested
on the glucose
variable, the original glucose
distribution with missing values omitted was plotted on a histogram to
visualize the distribution, density, and mean value. The three
imputation models were plotted with the same configuration and axis
scales to allow for ease of visual comparison. This allows for visual
comparison of the distribution of the original glucose
variable data, compared to the imputed glucose
data.
Additionally, the mean glucose
value can be compared
between the original variable data, and the imputed glucose
data.
y_lim_imp <- range(0,0.04)
imp.orig <- ggplot(fhs_imp, aes(x=original)) +
geom_histogram(aes(y = ..density..), fill = "orange", alpha = 0.25, color = "orange") +
geom_density(aes(x = original), color = "darkorange4") +
ylim(y_lim_imp) +
geom_vline(aes(xintercept = mean(original, na.rm = TRUE)), color="darkorange4", size = 0.5) +
geom_text(aes(x=mean(original, na.rm = TRUE), label = round(mean(original, na.rm = TRUE), 2)), y= 0.03, x = 120, angle = 0)
imp.pmm <- ggplot(fhs_imp, aes(x = imputed_pmm)) +
geom_histogram(aes(y=..density..), fill = "blue", alpha = 0.25, color = "blue") +
geom_density(aes(x=imputed_pmm), color = "darkblue") +
ylim(y_lim_imp) +
geom_vline(aes(xintercept = mean(imputed_pmm)), color="darkblue", size = 0.5) +
geom_text(aes(x=mean(imputed_pmm, na.rm = TRUE), label = round(mean(imputed_pmm, na.rm = TRUE), 2)), y= 0.03, x = 120, angle = 0)
imp.cart <- ggplot(fhs_imp, aes(x = imputed_cart)) +
geom_histogram(aes(y=..density..), fill = "red", alpha = 0.25, color = "red") +
geom_density(aes(x=imputed_cart), color = "darkred") +
ylim(y_lim_imp) +
geom_vline(aes(xintercept = mean(imputed_cart)), color="darkred", size = 0.5) +
geom_text(aes(x=mean(imputed_cart, na.rm = TRUE), label = round(mean(imputed_cart, na.rm = TRUE), 2)), y= 0.03, x = 120, angle = 0)
imp.lasso <- ggplot(fhs_imp, aes(x = imputed_lasso)) +
geom_histogram(aes(y = ..density..), fill = "magenta", alpha = 0.25, color = "magenta") +
geom_density(aes(x=imputed_lasso), color = "darkmagenta") +
ylim(y_lim_imp) +
geom_vline(aes(xintercept = mean(imputed_lasso)), color="darkmagenta", size = 0.5) +
geom_text(aes(x=mean(imputed_lasso, na.rm = TRUE), label = round(mean(imputed_lasso, na.rm = TRUE), 2)), y= 0.03, x = 120, angle = 0)
grid.arrange(imp.orig, imp.pmm, imp.cart, imp.lasso, nrow = 2, ncol = 2, top = textGrob("Comparrison of Glucose Imputation Models"))
Upon reviewing the various imputed glucose
data outputs
of the above MICE imputation, the CART imputation was determined as the
best fit for the imputation model. As visible in the above histogram
data, the CART imputation model is the only model that retained the
original glucose
data mean, with a near identical
distribution and density.
As the remaining variables with missing values contain less than 2.5%
missing data, compared to the 9.151% missing data from the
glucose
variable, the best fit CART imputation is applied
to all variables in the fhs
dataset. To note, the
imputation was completed on all variables for this step, not just the
numeric variable subset utilized for imputation model selection.
The newly imputed dataset (fhs_imp2
) was then verified
to contain zero missing values
set.seed(123)
fhs_imp2 <- fhs
fhs_imp2 <- complete(mice(fhs, method = "cart", print = FALSE))
fhs_imp2$BPMeds <- as.logical(as.numeric(fhs_imp2$BPMeds))
sapply(fhs_imp2, function (x) sum(is.na(x)))
male age education currentSmoker cigsPerDay
0 0 0 0 0
BPMeds prevalentStroke prevalentHyp diabetes totChol
0 0 0 0 0
sysBP diaBP BMI heartRate glucose
0 0 0 0 0
TenYearCHD prevalentCond highBP
0 0 0
summary(fhs_imp2)
male age education currentSmoker
Mode :logical Min. :32.00 SomeHS :1758 Mode :logical
FALSE:2420 1st Qu.:42.00 HS-GED :1283 FALSE:2145
TRUE :1820 Median :49.00 SomeCol-Voc: 711 TRUE :2095
Mean :49.58 College : 488
3rd Qu.:56.00
Max. :70.00
cigsPerDay BPMeds prevalentStroke prevalentHyp
Min. : 0.000 Mode :logical Mode :logical Mode :logical
1st Qu.: 0.000 FALSE:4114 FALSE:4215 FALSE:2923
Median : 0.000 TRUE :126 TRUE :25 TRUE :1317
Mean : 9.084
3rd Qu.:20.000
Max. :70.000
diabetes totChol sysBP diaBP
Mode :logical Min. :107.0 Min. : 83.5 Min. : 48.0
FALSE:4131 1st Qu.:206.0 1st Qu.:117.0 1st Qu.: 75.0
TRUE :109 Median :234.0 Median :128.0 Median : 82.0
Mean :236.8 Mean :132.4 Mean : 82.9
3rd Qu.:263.0 3rd Qu.:144.0 3rd Qu.: 90.0
Max. :696.0 Max. :295.0 Max. :142.5
BMI heartRate glucose TenYearCHD
Min. :15.54 Min. : 44.00 Min. : 40.00 Mode :logical
1st Qu.:23.07 1st Qu.: 68.00 1st Qu.: 71.00 FALSE:3596
Median :25.40 Median : 75.00 Median : 78.00 TRUE :644
Mean :25.80 Mean : 75.88 Mean : 81.87
3rd Qu.:28.04 3rd Qu.: 83.00 3rd Qu.: 87.00
Max. :56.80 Max. :143.00 Max. :394.00
prevalentCond highBP
Hyp :1298 Elevated: 410
HypStroke: 19 HBP1 :1965
None :2917 HBP2 : 832
Stroke : 6 Normal :1033
The distributions and frequencies of the categorical variables (both
feature variables and native variables) highBP
,
prevalentCond
, TenYearCHD
,
BPMeds
, diabetes
, and male
were
visualized utilizing bar charts
y_lim <- range(0,4200)
highbp.den <- ggplot(fhs_imp2, aes(x = highBP)) +
geom_bar(fill = "lightblue") +
ylim(y_lim) +
labs(title = "High Blood Pressure
")
prevcond.den <- ggplot(fhs_imp2, aes(x = prevalentCond)) +
ylim(y_lim) +
geom_bar(fill = "lightblue") +
labs(title = "Prev. Conditions
(Stroke / Hypertension / Both)")
tenyear.den <- ggplot(fhs_imp2, aes(x = TenYearCHD)) +
ylim(y_lim) +
geom_bar(fill = "lightblue") +
labs(title = "Chronic Heart Disease
@ 10 Year Follow Up")
BPmed.den <- ggplot(fhs_imp2, aes(x = BPMeds)) +
ylim(y_lim) +
geom_bar(fill = "lightblue") +
labs(title = "Blood Pressure Medicine
")
diabet.den <- ggplot(fhs_imp2, aes(x = diabetes)) +
ylim(y_lim) +
geom_bar(fill = "lightblue") +
labs(title = "Diabetes
")
male.den <- ggplot(fhs_imp2, aes(x = male)) +
ylim(y_lim) +
geom_bar(fill = "lightblue") +
labs(title = "Male
")
grid.arrange(highbp.den, prevcond.den, tenyear.den, BPmed.den, diabet.den, male.den, nrow = 2, ncol = 3)
age.den <- ggplot(fhs_imp2, aes(x = age)) +
geom_histogram(aes(y = ..density..), fill = "lightblue") +
geom_density(aes (x = age), color = "darkblue") +
labs(title = "Age")
cigs.den <- ggplot(fhs_imp2, aes(x = cigsPerDay)) +
geom_histogram(aes(y = ..density..), fill = "lightblue") +
geom_density(aes (x = cigsPerDay), color = "darkblue") +
labs(title = "Num. Cigs Smoked Per Day")
chol.den <- ggplot(fhs_imp2, aes(x = totChol)) +
geom_histogram(aes(y = ..density..), fill = "lightblue") +
geom_density(aes (x = totChol), color = "darkblue") +
labs(title = "Total Cholesterol")
glu.den <- ggplot(fhs_imp2, aes(x = glucose)) +
geom_histogram(aes(y = ..density..), fill = "lightblue") +
geom_density(aes (x = glucose), color = "darkblue") +
labs(title = "Glucose")
bp.den <- ggplot(fhs_imp2) +
geom_histogram(aes(x=sysBP, y=..density.., fill = "Systolic" ), alpha = 0.75) +
geom_histogram(aes(x=diaBP, y=..density.., fill = "Diastolic"), alpha = 0.75) +
geom_density(aes(x = sysBP, y = ..density..), color = "darkgreen") +
geom_density(aes(x = diaBP, y = ..density..), color = "darkblue") +
labs(title = "Systolic & Diastolic Blood Pressure", fill = "Blood Pressure
Values", x = "Blood Pressure Reading (mmHG)") +
scale_fill_manual(values = c("lightblue", "lightgreen"))
bmi.den <- ggplot(fhs_imp2, aes(x = BMI)) +
geom_histogram(aes(y = ..density..), fill = "lightblue") +
geom_density(aes (x = BMI), color = "darkblue") +
labs(title = "Body Mass Index (BMI)")
hr.den <- ggplot(fhs_imp2, aes(x = heartRate)) +
geom_histogram(aes(y = ..density..), fill = "lightblue") +
geom_density(aes (x = heartRate), color = "darkblue") +
labs(title = "Heart Rate")
grid.arrange(age.den, cigs.den, chol.den, glu.den, bmi.den, hr.den, nrow = 2, ncol = 3)
grid.arrange(bp.den, nrow = 1, ncol = 2)
In order to address the practical question of assessing the impact of
various health measures on patient Body Mass Index (BMI
)
value, only clinically related measures will be retained for further
analysis.
Systolic sysBP
and diastolic diaBP
variables were combined into a single categorical feature variable
highBP
, both of these variables will be omitted from
further analysis
The Body Mass Index (BMI
) scale already accounts for
age in BMI calculation, so the age
variable will be omitted
from further analysis
Total number of cigarettes smoked per day
(cigsPerDay
) is not clinically linked to BMI and will thus
be omitted from further analysis. While cigarette consumption may impact
cardiovascular health or other health indicators, there is no clinical
significance to BMI.
Total cholesterol totChol
is a common measure of
cardiovascular fitness. As cardiovascular fitness rates are often
associated with lower BMI values, total cholesterol will be included in
further analysis to examine if total cholesterol values are a good
predictor value for BMI.
Glucose glucose
levels are often clinically related
to weight related disorders, such as type-II diabetes. As weight is
clinically associated with higher BMI values, the glucose data will be
included in further analysis to examine if glucose is a good predictor
value for BMI.
ggpairs(fhs_imp2, columns = c(10,13:15), aes(alpha = 0.05), lower = list(continuous = wrap("smooth", method = "lm", se = FALSE, colour = "hotpink", alpha = 0.25)), upper = list(continuous = wrap("cor")))
In reviewing the pairwise analysis, the following patterns were identified:
glucose
) and total cholesterol
(totChol
) are linearly correlated with the response
variable Body Mass Index (BMI
)glucose
and totChol
are not linearly
correlated, indicating there will be no co-linearity issues when
comparing with a linear regression.Linear regression can be utilized to assess correlation between the dependent variable, Body Mass Index (BMI), and a variety of predictor variables. Multiple linear regression models utilizing clinically associated predictor variables were created and analyzed for their respective goodness of fit using R2 values.
The following assumptions have been met to support linear regression:
To start, two linear regression models utilizing totChol
and glucose
predictor variables with a BMI
as
a response variable were created. The first model compares BMI with
total cholesterol (totChol
), glucose
(glucose
), and heart rate (heartRate
) in their
native formats, while the second model compares the log transformation
of BMI
with with total cholesterol (totChol
)
and glucose (glucose
) in their native formats.
hr.chol.glu <- (lm(BMI ~ heartRate + totChol + glucose, data = fhs_imp2))
The below pander table indicates that totChol
has the
regression coefficient with the most statistical significance (p-value
< 0.05 indicates significantly different from zero (0) and
p-value|totChol = 1.251e-12 indicating strong statistic
significance). With a regression coefficient of 0.009918, this indicates
that for each increase of 1 mg/dL in a patient’s total cholesterol
value, the BMI will increase by 0.009918. Additionally, the second most
significant variable was identified as glucose, with a p-value|glucose =
6.6e-5 with a regression coefficient of 0.01182. This
indicates that for every 1 mg/dL increase in patient glucose, BMI will
increase 0.01182. Finally the third most significant variable was
identified as heart rate, with a p-value|heartrate = 0.0008007
indicating statistical significance. With a regression coefficient of
0.01724, an increase of 1 bpm in a patient’s heart rate is associated
with an 0.01724 increase in BMI.
An overall R2 value of 0.0217795 indicates that 2.14795% of the variance in BMI data is associated with the combination of glucose, heart rate, and total cholesterol levels.
In viewing the residual plots the following information is obtained
regarding the initial linear regression model
(hr.glu.chol
):
pander::pander(summary(hr.chol.glu)$coef, caption = "Summary statistics of the regression coefficients of the initial 3 predictor variable model")
Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
(Intercept) | 21.19 | 0.5223 | 40.56 | 3.32e-304 |
heartRate | 0.01737 | 0.005193 | 3.345 | 0.0008295 |
totChol | 0.009883 | 0.001397 | 7.075 | 1.738e-12 |
glucose | 0.01168 | 0.002621 | 4.457 | 8.514e-06 |
kable(summary(hr.chol.glu)$r.squared, caption = "R^2^ value of Intial Model")
x |
---|
0.0211883 |
par(mfrow=c(2,2), mar=c(2,3,2,2))
plot(hr.chol.glu)
Before completing a log transformation, a Box-Cox transformation to
identify a potential power transformation of the response variable,
BMI
, was completed. Utilizing the 95% confidence interval,
it was determined that the natural log is not the best transformation
for this regression model. This output occurs when the dependent
variable, BMI
is not normally distributed. The difference
in actual BMI
distribution and the hypothetical
BMI
normal distribution utilizing the real BMI
standard deviation and means values are visible in the distribution plot
below.
bc.bmi <- boxcox(BMI ~ totChol + glucose,
data = fhs_imp2,
lambda = seq(-1, 1, length = 20),
xlab = expression(paste(lambda)))
title(main = 'Box-Cox Transformation: 95% CI of lambda')
bmi.den +
geom_line(stat = "density", aes(x = BMI, color = "BMI Density Distribution"), size = 0.7) +
stat_function(fun = dnorm, args = list(mean=mean(fhs_imp2$BMI), sd=sd(fhs_imp2$BMI)), aes(color = "Standard Normal Curve"), size = 0.7, linetype = "dashed") +
scale_color_manual(values = c( "BMI Density Distribution" = "blue", "Standard Normal Curve" = "black")) +
labs( x = "BMI", y = "Density", title = "Distribution of BMI data vs Normal Distribution", color = "Legend")
As shown in the previous sections along with the 95% confidence
interval from the Box-Cox analysis, the BMI
data is near
normally distributed, but not quite. In order to determine the best fit
lambda value, x value associated with the the maximum y value from the
box-cox transformation plot was utilized
[-0.5959596]
.
[1] -0.5959596
Utilizing the calculated best fit \(\lambda\) value of -0.5959596, the
following Box-Cox transformation was completed on the
fhs_imp2$BMI
data:
\[\mathbf{y} = \frac{{x^\lambda}-1}{\lambda}\]
if
\[\mathbf{\lambda}\neq0\] thus
creating the power transformed BMI data (y), under the
bc2.bmi
variable.
fhs_imp2$bc2.bmi <- ((fhs_imp2$BMI**op.lambda)-1)/(op.lambda)
chol.glu <- lm(bc2.bmi ~ totChol * glucose, data = fhs_imp2)
The below pander table indicates that totChol
has the
regression coefficient with the most statistical significance (p-value
< 0.05 indicates significantly different from zero (0) and
p-value|totChol = 3.364e-12 indicating strong statistic
significance). With a regression coefficient of 0.0001657, this
indicates that for each increase of 1 mg/dL in a patient’s total
cholesterol value, the Box-Cox transformed BMI value will increase by
0.0001657. Additionally, the second most significant variable was
identified as glucose, with a p-value|glucose = 5.839e-08
with a regression coefficient of 0.0003744. This indicates that for
every 1 mg/dL increase in patient glucose,the Box-Cox transformed BMI
value will increase 0.0003744. Finally the interaction between
totChol
and glucose
was examined, and found to
have statistical significance per the evaluated p value (p value <
0.05 indicates significantly different from zero (0) and p
value|totalChol:glucose = 4.118e-06). The significance of
this relationship indicates that the Box-Cox transformed BMI value is
impacted by the product of both the total cholesterol and glucose
values; meaning the overall impact of the net change in the product of
totChol
and glucose
on BMI
is
impacted by changes to either the totChol
,
glucose
, or both variables. The impact of this relationship
is captured in the regression coefficient of This interaction’s
regression coefficient value of -1.242e-06 indicates that as
the product of totChol
and glucose
increased
by 1 mL2/dL2, the Box-Cox transformed BMI value
decreases by -1.242e-06.
An overall R2 value of 0.0261903 indicates that 2.61903% of the variance in the Box-Cox transformed BMI data is associated with the product of glucose and total cholesterol levels.
In viewing the residual plots the following information is obtained
regarding the second linear regression model
(chol.glu
):
pander::pander(summary(chol.glu)$coef, caption = "Summary statistics of the regression coefficients of the Second Model")
Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
(Intercept) | 1.394 | 0.006052 | 230.3 | 0 |
totChol | 0.0001439 | 2.421e-05 | 5.944 | 2.997e-09 |
glucose | 0.0002968 | 6.978e-05 | 4.254 | 2.146e-05 |
totChol:glucose | -9.633e-07 | 2.764e-07 | -3.485 | 0.0004963 |
kable(summary(chol.glu)$r.squared, caption = "R^2^ value of Second Model")
x |
---|
0.0241806 |
par(mfrow=c(2,2), mar=c(2,3,2,2))
plot(chol.glu)
### Third Model (chol.glu2) An influential Value (OBS 3845) was removed
based on the results indicated in the above Residuals vs Leverage plot
for the second model
chol.glu2
. The resultant the model was
re-reviewed as chol.glu2
, and all observations were now
shown to remain within Cook’s distance, indicating no influential values
remain. This third model illustrated a potential better fit with an
increased R2 value of 0.0773365 indicating 2.73365% variance
in Box-Cox transformed BMI variable attributable to the product of
glucose and total cholesterol values, reduced variance as seen in the
Q-Q Residuals plot, and reduced outliers on the Residuals vs Fitted
plot.
fhs_imp2mod3 <- fhs_imp2[-3845,]
chol.glu2 <- lm(bc2.bmi ~ totChol * glucose, data = fhs_imp2mod3)
pander::pander(summary(chol.glu2)$coef, caption = "Summary statistics of the regression coefficients of the Second Model Without Influential Variable")
Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
(Intercept) | 1.387 | 0.006445 | 215.2 | 0 |
totChol | 0.0001732 | 2.617e-05 | 6.619 | 4.066e-11 |
glucose | 0.0003777 | 7.496e-05 | 5.038 | 4.892e-07 |
totChol:glucose | -1.324e-06 | 3.023e-07 | -4.381 | 1.21e-05 |
kable(summary(chol.glu2)$r.squared, caption = "R^2^ value of Second Model without Influential Value")
x |
---|
0.0249064 |
par(mfrow=c(2,2), mar=c(2,3,2,2))
plot(chol.glu2)
To cross validate all three proposed linear regression models, the data was first split 80:20 for training:test data, the training data was then run through a five fold validation to compare the mean square error (MSE) of all three models. The MSE values were plotted and the overall values for each fold of validation were compared to determine the best fit model.
set.seed(123)
#shuffling the data
n <- dim(fhs_imp2)[1] #set sample size
obs.ID <- 1:n #create observation ID to allow for random splitting
n.train <- round(0.8*n) #create sample size for training data. round to keep whole number since no partial observations
shuffle.id <- sample(obs.ID, n, replace = FALSE) #randomize created observation ID, does not allow for dupes in list.
shuffle.fhs <- fhs_imp2[shuffle.id, ] #orders rows of dataframe to be in order of shuffle.id created random obs.id list
#splitting the data
train.data.linear <- shuffle.fhs[1:n.train, ] #splitting shuffled data from 1st obs to end of training data (80%)
test.data.linear <- shuffle.fhs[(n.train + 1):n, ] #splitting shuffled data from 1st obs after end of training data to end of remaining data (test)
n.fold <- round(n.train/5)-1 #number of observations in each of the 5 folds of validation
#validation loop - 5 fold
mse.lin.1 <- rep(0,5)
mse.lin.2 <- rep(0,5)
for (i in 1:5){
valid.id = ((i-1) * n.fold + 1):(1 * n.fold) #breaks down folds into 5 groups
lin.train = train.data.linear[-valid.id, ]
lin.valid = train.data.linear[valid.id, ]
#building models to run off of folded data
m1 = lm(BMI ~ heartRate + totChol + glucose, data = lin.train)
m2 = lm(bc2.bmi ~ totChol * glucose, data = lin.train)
#predicting BMI using models
predm1 = predict(m1, newdata = lin.train)
predm2 = predict(m2, newdata = lin.train)
#calculate mean square error between predicted values and values from validation data
mse.lin.1[i] = mean((predm1 - lin.valid$BMI)^2)
mse.lin.2[i] = mean((predm2 - lin.valid$bc2.bmi)^2)
}
set.seed(123)
#repeat with data that has influential value removed (model 3)
#shuffling the data
n.3 <- dim(fhs_imp2mod3)[1] #set sample size
obs.ID.3 <- 1:n.3 #create observation ID to allow for random splitting
n.train.3 <- round(0.8*n) #create sample size for training data. round to keep whole number since no partial observations
# n.train = n.train.3 - sample spiting retained from first validation run through
shuffle.id.3 <- sample(obs.ID.3, n.3, replace = FALSE) #randomize created observation ID, does not allow for dupes in list.
shuffle.fhs.3 <- fhs_imp2mod3[shuffle.id.3, ] #orders rows of dataframe to be in order of shuffle.id created random obs.id list
#splitting the data
train.data.linear.3 <- shuffle.fhs.3[1:n.train.3, ] #splitting shuffled data from 1st obs to end of training data (80%)
test.data.linear.3 <- shuffle.fhs.3[(n.train.3 + 1):n.3, ] #splitting shuffled data from 1st obs after end of training data to end of remaining data (test)
n.fold.3 <- round(n.train.3/5)-1 #number of observations in each of the 5 folds of validation
#validation loop - 5 fold
mse.lin.3 <- rep(0,5)
for (i in 1:5){
valid.id.3 = ((i-1) * n.fold.3 + 1):(1 * n.fold.3) #breaks down folds into 5 groups
lin.train.3 = train.data.linear.3[-valid.id.3, ]
lin.valid.3 = train.data.linear.3[valid.id.3, ]
#building models to run off of folded data
m3 = lm(bc2.bmi ~ totChol * glucose, data = lin.train.3)
#predicting BMI using models
predm3 = predict(m3, newdata = lin.train.3)
#calculate mean square error between predicted values and values from validation data
mse.lin.3[i] = mean((predm3 - lin.valid.3$bc2.bmi)^2)
}
mse.1 = mean(mse.lin.1)
mse.2 = mean(mse.lin.2)
mse.3 = mean(mse.lin.3)
mse.mod1 <- data.frame (
x = 1:5,
y = mse.lin.1
)
mse.mod2 <- data.frame (
x = 1:5,
y = mse.lin.2
)
mse.mod3 <- data.frame (
x = 1:5,
y = mse.lin.3
)
mse.mod1.plot <- ggplot (mse.mod1, aes(x = x, y = y)) +
geom_point(aes(color = "Model 1"))+
geom_line(aes(color = "Model 1")) +
geom_point (aes (x = mse.mod2$x, y = mse.mod2$y, color = "Model 2")) +
geom_line (aes (x = mse.mod2$x, y = mse.mod2$y, color = "Model 2")) +
geom_point(aes(x = mse.mod3$x, y = mse.mod3$y, color = "Model 3"))+
geom_line (aes(x = mse.mod3$x, y = mse.mod3$y, color = "Model 3")) +
labs (x = "Validation Fold", y = "Calculated MSE", title = "Mean Square Error (MSE)
Per Validation Round
") +
scale_color_manual (values = c("Model 1" = "orange", "Model 2" = "blue", "Model 3" = "green" ))
mse.mod2.plot <- ggplot (mse.mod2, aes(x = x, y = y)) +
geom_point (aes (x = mse.mod2$x, y = mse.mod2$y), color = "blue") +
geom_line (aes (x = mse.mod2$x, y = mse.mod2$y), color = "blue") +
geom_point(aes(x = mse.mod3$x, y = mse.mod3$y), color = "green")+
geom_line (aes(x = mse.mod3$x, y = mse.mod3$y), color = "green") +
labs (x = "Validation Fold", y = "Calculated MSE", title = "Mean Square Error (MSE)
Per Validation Round:
Models 2 & 3 Closer Look")
mse.table <- data.frame (
x = c("Model 1", "Model 2", "Model 3"),
y = c(mse.1, mse.2, mse.3)
)
colnames(mse.table) <- c("Predictive Model", "Mean MSE Across all Validation Folds")
grid.arrange(mse.mod1.plot, mse.mod2.plot, nrow = 1, ncol = 2)
Based on the visual representation of the MSE values for each fold of
validation paired with the mean MSE for each model, it has been
determined that Model 2, chol.glu
, the linear regression of
the Box-Cox transformed BMI values compared to the product of the
glucose and total cholesterol values, is the best fit for the data.
Utilizing Model 2 as the predictive model will give the closest to
accurate prediction values for a patient’s BMI based on their measured
total cholesterol and glucose values, as evidenced by the lowest mean
MSE value of all three models across all five folds of validation.
Utilizing the final model, fit to all data, the predicted BMI for a
glucose of 85 and total cholesterol of 212 was assessed. Due to the BMI
data undergoing a box cox transformation, the predicted value produced
by the chol.glu
model must undergo back transformation
using the following equation.
\[\mathbf{x} = (y * \lambda + 1)^\frac{1}{\lambda}\] Without back transformation the final value will not be an actual BMI value. The previously calculated \(\lambda\) value of -0.5959597 was utilized for the back transformation.
predictor.data.linear <- data.frame (totChol = 212,glucose = 85)
pred.boxBMI <- predict(chol.glu, newdata = predictor.data.linear, type = "response")
pred.BMI <- (pred.boxBMI * op.lambda + 1)**(1/op.lambda)
kable(pred.BMI, caption = "Predicted patient BMI with a Glucose of 85 and Total Cholesterol of 212")
x |
---|
25.10196 |
Logistic regression can be utilized to assess the correlation between
a binary response variable, 10 Year CHD risk diagnosis
(TenYearCHD
), and a variety of predictor variables.
Multiple logistic regression models utilizing clinically associated
predictor variables were created and analyzed for their respective
goodness of fit using their calculated Receiving Operating
Characteristic (ROC) curves and Area Under the Curve (AUC) values.
The following assumptions have been met to support linear regression:
To complete the logistic regression analyses and create logistic
recession models, the TenYearCHD
variable was selected as
response variable, overall looking to determine correlation with odds of
patient diagnosis with a 10 year CHD risk, characterized by a binary of
TRUE for diagnosed at risk and FALSE for diagnosed not at risk.
First a simple logistic regression model was created to compare the
odds of patient diagnosis with 10 year CHD risk as predicted by the
previously identified predictor variables used in the best fit linear
regression model: glucose levels (glucose
) and total
cholesterol levels (totChol
).
simple.model <- glm(TenYearCHD ~ totChol + glucose, family = binomial, data = fhs_imp2)
kable(summary(simple.model)$coef, caption = "Summary table of significance test for glucose & total cholesterol impact on Ten Year CHD Risk diagnosis")
Estimate | Std. Error | z value | Pr(>|z|) | |
---|---|---|---|---|
(Intercept) | -3.6517943 | 0.2609017 | -13.996822 | 0.0e+00 |
totChol | 0.0044513 | 0.0009294 | 4.789449 | 1.7e-06 |
glucose | 0.0103189 | 0.0015144 | 6.813706 | 0.0e+00 |
Utilizing the simple logistic regression model, both glucose and total cholesterol illustrated statistically significant positive correlation with 10 year CHD risk diagnosis. In layman’s terms, as a patient’s glucose and total cholesterol values increase, the odds the patient will be diagnosed with a 10 year CHD risk increases. By exponentiating the model, it is found that the exponentiated slope coefficients for glucose and total cholesterol levels are 1.01037237 and 1.00446124 respectively. Thus we estimate a 1.037% increase in the odds of a patient being diagnosed with 10 year CHD risk for each 1 mg/dL increase in said patient’s glucose levels. Additionally, we estimate a 0.446% increase in the odds of a patient being diagnosed with 10 year CHD risk for each 1 mg/dL increase in said patient’s total cholesterol levels.
exp(simple.model$coeff)
(Intercept) totChol glucose
0.02594453 1.00446124 1.01037237
exp(confint(simple.model))
2.5 % 97.5 %
(Intercept) 0.01548977 0.04310716
totChol 1.00263070 1.00629401
glucose 1.00742438 1.01344303
Rather than utilizing the approach of pre-determining the predictor
variables similar to that utilized in the initial model
simple.model
, a cumulative model entitled
full.model
was created to utilize as the starting point for
automatically identifying the significant and non-significant variables
based on AUC values automatically calculated by the step()
function.
The same predictor values from the initial model
simple.model
were utilized as the reduced model for the
step function. After shrinking to the reduced model, only glucose has
demonstrated statistical significance with p values of 0.0000006 while
total cholesterol measured in with a p value of and 0.1204717 indicating
a lack of significance. Reviewing the regression coefficients, both
glucose and total cholesterol indicate a positive correlation of
1.0079240235 with the odds of a patient being diagnosed with 10 year CHD
risk. This indicates that for every 1 mg/dL a patient’s glucose levels
increase, the odds of diagnosis with a 10 year risk factor for chronic
heart disease increases by 0.792%.
full.model <- glm(TenYearCHD ~ male + age + education + currentSmoker + cigsPerDay + BPMeds + prevalentCond + diabetes + totChol + sysBP + diaBP + BMI + heartRate + glucose + highBP, family = binomial, data = fhs_imp2)
reduced.model <- simple.model
final.model <- step(full.model,
scope = list(lower = formula(reduced.model), upper = formula(full.model)),
data = fhs_imp2,
direction = "backward",
trace = 0)
kable(summary(final.model)$coef, caption = "Summary table of significance test for variables impacting Ten Year CHD Risk diagnosis in the reduced model")
Estimate | Std. Error | z value | Pr(>|z|) | |
---|---|---|---|---|
(Intercept) | -8.1097060 | 0.5460834 | -14.8506734 | 0.0000000 |
maleTRUE | 0.5120966 | 0.0985004 | 5.1989289 | 0.0000002 |
age | 0.0633016 | 0.0059870 | 10.5731389 | 0.0000000 |
cigsPerDay | 0.0210554 | 0.0038440 | 5.4774737 | 0.0000000 |
BPMedsTRUE | 0.3165451 | 0.2170697 | 1.4582645 | 0.1447677 |
prevalentCondHypStroke | 1.0279179 | 0.4882259 | 2.1054148 | 0.0352552 |
prevalentCondNone | -0.2200811 | 0.1266463 | -1.7377612 | 0.0822529 |
prevalentCondStroke | 0.3556615 | 1.1321075 | 0.3141587 | 0.7534005 |
totChol | 0.0015846 | 0.0010205 | 1.5527965 | 0.1204717 |
sysBP | 0.0128276 | 0.0026910 | 4.7668855 | 0.0000019 |
glucose | 0.0078928 | 0.0015773 | 5.0041302 | 0.0000006 |
exp(final.model$coeff)
(Intercept) maleTRUE age
0.0003006072 1.6687863145 1.0653481073
cigsPerDay BPMedsTRUE prevalentCondHypStroke
1.0212786039 1.3723780869 2.7952399361
prevalentCondNone prevalentCondStroke totChol
0.8024537155 1.4271243469 1.0015858956
sysBP glucose
1.0129102239 1.0079240235
exp(confint(final.model))
2.5 % 97.5 %
(Intercept) 0.0001020661 0.0008686866
maleTRUE 1.3761292419 2.0249376415
age 1.0529774542 1.0779904685
cigsPerDay 1.0135894547 1.0289865266
BPMedsTRUE 0.8900118566 2.0878543179
prevalentCondHypStroke 1.0694788453 7.4454903517
prevalentCondNone 0.6265342414 1.0294962687
prevalentCondStroke 0.0717671519 9.7659883733
totChol 0.9995702905 1.0035822724
sysBP 1.0075875923 1.0182799243
glucose 1.0048311932 1.0110828877
A reliable way to compare the efficacy of the model fit overall and in comparison to another model is the ROC curve and associated Area Under the Curve (AUC) values.
Cutoff values were automatically calculated and are demonstrated below on the ROC curve plot. These cutoff points represent the best possible values at which the model fits the original data best. These values represent the respective false positive rates (\(\alpha = 1 - Specificity\)) and false negative rates (\(\beta = 1 - Sensitivity\)) that will produce the best fit outcomes for the predictor model.
no10yr <- which(fhs_imp2$TenYearCHD == FALSE)
yes10yr <- which(fhs_imp2$TenYearCHD == TRUE)
fhs_imp2$CHD = 0
fhs_imp2$CHD[yes10yr] = 1
cutoff <- seq(0, 1, length = 100)
sensitiv = NULL
specif = NULL
sensitiv2 = NULL
specif2 = NULL
newdata.simple <- data.frame(fhs_imp2)
predict.data.log <- predict.glm(simple.model, newdata.simple, type = "response")
predict.data.log2 <- predict.glm(final.model, newdata.simple, type = "response")
for (i in 1:100){
fhs_imp2$CHDtrain = as.numeric(predict.data.log > cutoff[i])
TN = sum(fhs_imp2$CHDtrain == 0 & fhs_imp2$CHD == 0) # true negative
FN = sum(fhs_imp2$CHDtrain == 0 & fhs_imp2$CHD == 1) # false negative
FP = sum(fhs_imp2$CHDtrain == 1 & fhs_imp2$CHD == 0) # false positive
TP = sum(fhs_imp2$CHDtrain == 1 & fhs_imp2$CHD == 1) # true positive
sensitiv[i] = TP / (TP + FN)
specif[i] = TN / (TN + FP)
}
for (i in 1:100){
fhs_imp2$CHDtrain2 = as.numeric(predict.data.log2 > cutoff[i])
TN2 = sum(fhs_imp2$CHDtrain2 == 0 & fhs_imp2$CHD == 0) # true negative
FN2 = sum(fhs_imp2$CHDtrain2 == 0 & fhs_imp2$CHD == 1) # false negative
FP2 = sum(fhs_imp2$CHDtrain2 == 1 & fhs_imp2$CHD == 0) # false positive
TP2 = sum(fhs_imp2$CHDtrain2 == 1 & fhs_imp2$CHD == 1) # true positive
sensitiv2[i] = TP2 / (TP2 + FN2)
specif2[i] = TN2 / (TN2 + FP2)
}
specificity = 1 - specif
sensitivity = sensitiv
specificity2 = 1 - specif2
sensitivity2 = sensitiv2
prediction <- predict.data.log
category <- fhs_imp2$CHD == 1
ROCobj <- roc(category, prediction)
AUC.simple <- round(auc(ROCobj), 5)
prediction2 <- predict.data.log2
category2 <- fhs_imp2$CHD == 1
ROCobj2 <- roc(category2, prediction2)
AUC.final <- round(auc(ROCobj2), 5)
specif.ori <- ROCobj$specificities
specif.ori2 <- ROCobj2$specificities
sens.ori <- ROCobj$sensitivities
sens.ori2 <- ROCobj2$sensitivities
diff <- abs(sens.ori - specif.ori)
diff2 <- abs(sens.ori2 - specif.ori2)
min.simple <- which(diff == min(diff))
min.final <- which(diff2 == min(diff2))
roc.plot <- plot(specificity, sensitivity,
type = "l",
main = "Model Comparrison Using ROC and AUC",
xlab="1-specificity",
ylab="sensitivity") +
segments (0,0,1,1, lty=2, col = "red") +
lines(specificity, sensitivity, lwd = 1, col = "navy") +
lines(specificity2, sensitivity2, lwd = 1, col = "darkgreen") +
text(0.8, 0.2, paste("Simple Model AUC = ", AUC.simple), col = "navy", cex = 0.8) +
text(0.8, 0.25, paste("Final Model AUC = ", AUC.final), col = "darkgreen", cex = 0.8) +
points((1-specif.ori[min.simple]), specif.ori[min.simple], col = "navy", pch = 17) +
points((1-specif.ori2[min.final]), specif.ori2[min.final], col = "darkgreen", pch = 17)
text(0.8, 0.18, paste( "Simple Model Calculated Cutoff = (", round((1-specif.ori[min.simple]),4), ", " , round(sens.ori[min.simple],4),")"), col = "navy", cex = 0.8) +
text(0.8, 0.23, paste( "Final Model Calculated Cutoff = (", round((1-specif.ori2[min.final]),4), ", " , round(sens.ori2[min.final],4),")"), col = "darkgreen", cex = 0.8)
integer(0)
pred.table = cbind(Error.Type = c('False Positive ($alpha$)', "False Negative ($beta$)"), Error.Value = c((1-specif.ori[min.simple]), sens.ori[min.simple]))
kable(pred.table, caption = "Error values for Simple Model Calculated Cutoff")
Error.Type | Error.Value |
---|---|
False Positive (\(alpha\)) | 0.442714126807564 |
False Negative (\(beta\)) | 0.557453416149068 |
pred.table2 = cbind(Error.Type = c('False Positive ($alpha$)', "False Negative ($beta$)"), Error.Value = c((1-specif.ori2[min.final]), sens.ori2[min.final]))
kable(pred.table2, caption = "Error values for Final Model Calculated Cutoff")
Error.Type | Error.Value |
---|---|
False Positive (\(alpha\)) | 0.324805339265851 |
False Negative (\(beta\)) | 0.675465838509317 |
Overall the final.model
returned a fair AUC value of
0.73244 indicating the model preformed fairly with acceptable
discrimination between classes. The simple.model
returned a
lower AUC value of 0.57987 indicating a near fail in the model
prediction. The closer a model’s AUC value is to a full 1.000 value, the
better fit the model is in terms of predictive ability.
This indicates the second logistic regression model
final.model
created utilizing the step()
function produced the best fit predictor model for predicting a
patient’s 10 year Chronic Heart Disease risk diagnosis
TenYearCHD
.
Patient health has been at the forefront of physician and patient concerns for decades, as evidenced by the search for predictive measures for long term patient health. The FHS study has underscored this desire for understanding what factors will impact a patient long term’s prognosis.
After the creation and analysis of various prediction models for a
patient’s impactful health factors: BMI and 10 year CHD risk diagnosis,
it was found that some models preform better than others. While full
causation cannot be conferred from these results, it was found that the
linear regression model of Box-cox transformed BMI data utilizing the
product of glucose and total cholesterol produced the most accurate
predictions of patient BMI. Additionally it was found that the final
logistic regression model comparing the significant variables as
selected by the step()
function produced the most accurate
predicitons of a patients 10 year CHD risk diagnosis when held to \(\alpha = 0.324805339265851\) and \(\beta = 0.675465838509317\)
Heart.org - American Heart Association blood pressure ranges https://www.heart.org/en/health-topics/high-blood-pressure/understanding-blood-pressure-readings