ContextBase Logo



Section 1: Data Preparation and Exploration

In this section, we will explore the dataset used for analyzing the risk factors associated with coronary heart disease. The data is provided in the file “SA_Heart.csv”, which contains information collected from various individuals, including their demographic characteristics, lifestyle factors, and health indicators. We will begin by reading the data into R and assigning it to the object heart. This will allow us to perform further data processing, visualization, and modeling tasks on the dataset.

heart <- read.csv("SA_Heart.csv")
library(knitr)
kable(head(heart), caption = "First five rows of dataset:")
First five rows of dataset:
ID sbp tobacco ldl adiposity famhist typea obesity alcohol age chd
1 160 12.00 5.73 23.11 Present 49 25.30 97.20 52 1
2 144 0.01 4.41 28.61 Absent 55 28.87 2.06 63 1
3 118 0.08 3.48 32.28 Present 52 29.14 3.81 46 0
4 170 7.50 6.41 38.03 Present 51 31.99 24.26 58 1
5 134 13.60 3.50 27.78 Present 60 25.99 57.34 49 1
6 132 6.20 6.47 36.21 Present 62 30.77 14.14 45 0



Section 2: Exploratory Data Analysis

Exploratory data analysis (EDA) is a crucial step in understanding the characteristics of the dataset and identifying potential patterns, relationships, and anomalies that may exist within the data. In this Section, we will perform graphical and numerical summaries for each variable in the dataset, providing insights into their distributions and relationships. These visualizations and summaries will aid in determining the appropriate statistical techniques and modeling approaches for further analysis.

library(ggplot2)
library(gridExtra)

p1 <- ggplot(heart, aes(x = sbp)) + geom_histogram()
p2 <- ggplot(heart, aes(x = tobacco)) + geom_histogram()
p3 <- ggplot(heart, aes(x = ldl)) + geom_histogram()
p4 <- ggplot(heart, aes(x = adiposity)) + geom_histogram()
p5 <- ggplot(heart, aes(x = famhist)) + geom_bar()
p6 <- ggplot(heart, aes(x = typea)) + geom_histogram()
p7 <- ggplot(heart, aes(x = obesity)) + geom_histogram()
p8 <- ggplot(heart, aes(x = alcohol)) + geom_histogram()
p9 <- ggplot(heart, aes(x = age)) + geom_histogram()
p10 <- ggplot(heart, aes(x = chd)) + geom_bar()

grid.arrange(p1, p2, p3, p4, p5, p6, p7, p8, p9, p10, ncol = 3)

The numerical summaries and graphical displays show the distributions of the variables. The right skewed variables include sbp, tobacco, ldl, alcohol, and obesity. The left skewed variables include adiposity, typea, and age. The variable adiposity is the closet to a normal distribution. The categorical variables famhist and chd show the counts for each category.

In regression models with categorical predictors, one level of the categorical variable needs to be chosen as the reference or baseline category. The coefficients for the other levels are interpreted relative to this reference category.

It is generally recommended to choose the reference category that represents the “normal” or “control” group, or the category with the largest sample size. In this case, the famhist variable has two levels: “Present” and “Absent”. Since “Absent” represents the group without a family history of heart disease, it makes sense to use this as the reference category.

Using “Absent” as the reference category allows for a more straightforward interpretation of the model coefficients. The coefficient for “Present” would then represent the change in the log odds of having coronary heart disease (chd) for individuals with a family history of heart disease, compared to those without a family history.



Section 4: Exploring Relationships Between Continuous Predictors

In this section, we will explore the relationships between the continuous predictor variables in the dataset. Visualizing these relationships can provide valuable insights into the nature and strength of the associations between the predictors, which can inform variable selection and model building processes. One effective way to visualize these relationships is through a scatterplot matrix, which displays pairwise scatterplots for all combinations of the continuous predictors.

pairs(heart[, c("sbp", "tobacco", "ldl", "adiposity", "obesity", "alcohol", "age")])

Based on the scatterplot matrix shown in Section 4, some of the continuous predictors appear to be highly correlated with each other. Specifically, sbp (systolic blood pressure) seems to be strongly correlated with adiposity and obesity. Similarly, ldl (low-density lipoprotein) also appears to be correlated with adiposity and obesity.

Including highly correlated predictors in the same regression model can lead to issues of multicollinearity. Multicollinearity can cause instability in the coefficient estimates, inflated standard errors, and difficulty in separating the individual effects of the correlated predictors.

To avoid multicollinearity problems, it is generally recommended to include only one predictor from a set of highly correlated predictors in the model. In this case, you may want to consider excluding either sbp or adiposity/obesity, and either ldl or adiposity/obesity from the logistic regression model for chd.

The choice of which variable to include can be based on factors such as the strength of correlation with the outcome variable, the interpretability of the predictor, and any prior knowledge or theoretical considerations about the importance of the predictors.

Looking at the output from the backward stepwise selection, we can see that dropping adiposity or alcohol from the model results in the lowest AIC values, suggesting that these variables may be less important predictors of chd in the presence of the other predictors.



Section 5: Exploring Relationships with Family History

In this section, we will investigate the potential relationships between the continuous predictor variables and the family history of heart disease (famhist). Identifying and understanding these relationships is crucial for building an effective logistic regression model for predicting coronary heart disease (chd). We will employ statistical graphics to visualize and explore these relationships, followed by formal statistical tests to assess their significance.

p1 <- ggplot(heart, aes(x = famhist, y = sbp)) + geom_boxplot()
p2 <- ggplot(heart, aes(x = famhist, y = tobacco)) + geom_boxplot()
p3 <- ggplot(heart, aes(x = famhist, y = ldl)) + geom_boxplot()
p4 <- ggplot(heart, aes(x = famhist, y = adiposity)) + geom_boxplot()
p5 <- ggplot(heart, aes(x = famhist, y = obesity)) + geom_boxplot()
p6 <- ggplot(heart, aes(x = famhist, y = alcohol)) + geom_boxplot()
p7 <- ggplot(heart, aes(x = famhist, y = age)) + geom_boxplot()

grid.arrange(p1, p2, p3, p4, p5, p6, p7, ncol = 4)

# Formal tests
t.test(sbp ~ famhist, data = heart)
## 
##  Welch Two Sample t-test
## 
## data:  sbp by famhist
## t = -1.8487, df = 415.49, p-value = 0.0652
## alternative hypothesis: true difference in means between group Absent and group Present is not equal to 0
## 95 percent confidence interval:
##  -7.3412759  0.2250722
## sample estimates:
##  mean in group Absent mean in group Present 
##              136.8481              140.4062
t.test(tobacco ~ famhist, data = heart)
## 
##  Welch Two Sample t-test
## 
## data:  tobacco by famhist
## t = -1.951, df = 440.56, p-value = 0.0517
## alternative hypothesis: true difference in means between group Absent and group Present is not equal to 0
## 95 percent confidence interval:
##  -1.655805577  0.006092614
## sample estimates:
##  mean in group Absent mean in group Present 
##              3.292852              4.117708
t.test(ldl ~ famhist, data = heart)
## 
##  Welch Two Sample t-test
## 
## data:  ldl by famhist
## t = -3.4955, df = 406.78, p-value = 0.0005252
## alternative hypothesis: true difference in means between group Absent and group Present is not equal to 0
## 95 percent confidence interval:
##  -1.0581904 -0.2963975
## sample estimates:
##  mean in group Absent mean in group Present 
##              4.458852              5.136146
t.test(adiposity ~ famhist, data = heart)
## 
##  Welch Two Sample t-test
## 
## data:  adiposity by famhist
## t = -4.0128, df = 428.65, p-value = 7.081e-05
## alternative hypothesis: true difference in means between group Absent and group Present is not equal to 0
## 95 percent confidence interval:
##  -4.269669 -1.462153
## sample estimates:
##  mean in group Absent mean in group Present 
##              24.21570              27.08161
t.test(obesity ~ famhist, data = heart)
## 
##  Welch Two Sample t-test
## 
## data:  obesity by famhist
## t = -2.5002, df = 413.95, p-value = 0.0128
## alternative hypothesis: true difference in means between group Absent and group Present is not equal to 0
## 95 percent confidence interval:
##  -1.763503 -0.211055
## sample estimates:
##  mean in group Absent mean in group Present 
##              25.63381              26.62109
t.test(alcohol ~ famhist, data = heart)
## 
##  Welch Two Sample t-test
## 
## data:  alcohol by famhist
## t = -1.6608, df = 344.29, p-value = 0.09766
## alternative hypothesis: true difference in means between group Absent and group Present is not equal to 0
## 95 percent confidence interval:
##  -8.7273351  0.7363328
## sample estimates:
##  mean in group Absent mean in group Present 
##              15.38393              19.37943
t.test(age ~ famhist, data = heart)
## 
##  Welch Two Sample t-test
## 
## data:  age by famhist
## t = -5.4775, df = 451.33, p-value = 7.177e-08
## alternative hypothesis: true difference in means between group Absent and group Present is not equal to 0
## 95 percent confidence interval:
##  -9.643094 -4.550656
## sample estimates:
##  mean in group Absent mean in group Present 
##              39.86667              46.96354

The boxplots and formal tests (t-tests) show that some continuous predictors like sbp, ldl, adiposity, obesity, and alcohol have significant differences in their distributions between the two levels of famhist. However, after applying a Bonferroni correction (significance level of 0.05/7 = 0.007), only sbp, ldl, and adiposity remain significant. These variables should not be included in the same model as famhist due to potential issues with multicollinearity.



Section 6: Model Selection Using the Akaike Information Criterion

In this section, we will employ the Akaike Information Criterion (AIC) to identify the best additive logistic regression model for predicting coronary heart disease (chd) based on the available predictor variables. The AIC is a widely used model selection criterion that balances the trade-off between goodness of fit and model complexity, providing a systematic approach to choose the most parsimonious model that adequately explains the data.

# Full model
full_model <- glm(chd ~ sbp + tobacco + ldl + adiposity + famhist + typea + obesity + alcohol + age, data = heart, family = binomial)

# Backward selection using AIC
step_model <- step(full_model, direction = "backward")
## Start:  AIC=492.14
## chd ~ sbp + tobacco + ldl + adiposity + famhist + typea + obesity + 
##     alcohol + age
## 
##             Df Deviance    AIC
## - alcohol    1   472.14 490.14
## - adiposity  1   472.55 490.55
## - sbp        1   473.44 491.44
## <none>           472.14 492.14
## - obesity    1   474.23 492.23
## - ldl        1   481.07 499.07
## - tobacco    1   481.67 499.67
## - typea      1   483.05 501.05
## - age        1   486.53 504.53
## - famhist    1   488.89 506.89
## 
## Step:  AIC=490.14
## chd ~ sbp + tobacco + ldl + adiposity + famhist + typea + obesity + 
##     age
## 
##             Df Deviance    AIC
## - adiposity  1   472.55 488.55
## - sbp        1   473.47 489.47
## <none>           472.14 490.14
## - obesity    1   474.24 490.24
## - ldl        1   481.15 497.15
## - tobacco    1   482.06 498.06
## - typea      1   483.06 499.06
## - age        1   486.64 502.64
## - famhist    1   488.99 504.99
## 
## Step:  AIC=488.55
## chd ~ sbp + tobacco + ldl + famhist + typea + obesity + age
## 
##           Df Deviance    AIC
## - sbp      1   473.98 487.98
## <none>         472.55 488.55
## - obesity  1   474.65 488.65
## - tobacco  1   482.54 496.54
## - ldl      1   482.95 496.95
## - typea    1   483.19 497.19
## - famhist  1   489.38 503.38
## - age      1   495.48 509.48
## 
## Step:  AIC=487.98
## chd ~ tobacco + ldl + famhist + typea + obesity + age
## 
##           Df Deviance    AIC
## - obesity  1   475.69 487.69
## <none>         473.98 487.98
## - tobacco  1   484.18 496.18
## - typea    1   484.30 496.30
## - ldl      1   484.53 496.53
## - famhist  1   490.58 502.58
## - age      1   502.11 514.11
## 
## Step:  AIC=487.69
## chd ~ tobacco + ldl + famhist + typea + age
## 
##           Df Deviance    AIC
## <none>         475.69 487.69
## - ldl      1   484.71 494.71
## - typea    1   485.44 495.44
## - tobacco  1   486.03 496.03
## - famhist  1   492.09 502.09
## - age      1   502.38 512.38
# List of models and AIC values
model_list <- list(
  "Full Model" = full_model,
  "Backward Selected Model" = step_model
)

model_AICs <- sapply(model_list, AIC)

data.frame(Model = names(model_list), AIC = model_AICs)
##                                           Model      AIC
## Full Model                           Full Model 492.1400
## Backward Selected Model Backward Selected Model 487.6856

The above section performs backward stepwise selection using the Akaike Information Criterion (AIC) to find the best additive logistic regression model for predicting coronary heart disease (chd) based on the given predictor variables.

The full_model is fitted using all the predictor variables: systolic blood pressure (sbp), tobacco use, low-density lipoprotein (ldl) level, adiposity, family history of heart disease (famhist), type A behavior pattern (typea), obesity, alcohol consumption, and age.

The step() function from the stats package is used to perform backward stepwise selection on the full_model. The direction = “backward” argument specifies that the selection process starts with the full model and removes predictor variables one by one until the AIC cannot be improved further.

The output from step() shows the steps of the backward selection process, including the predictor variables removed at each step and the corresponding AIC values.

The final step shows the Backward Selected Model, which includes the predictor variables tobacco, ldl, famhist, typea, and age. This model has the lowest AIC value of 487.69.

A list model_list is created, containing the “Full Model” and the “Backward Selected Model”.

The sapply() function is used to extract the AIC values for each model in model_list.

Finally, a data frame is created, displaying the model names and their corresponding AIC values.

The output table shows that the Backward Selected Model has the lowest AIC value of 487.69, indicating that it is the best-fitting model according to the AIC criterion. The Full Model, which includes all predictor variables, has a higher AIC value of 492.14, suggesting that it is not as good a fit as the more parsimonious Backward Selected Model.

In summary, the code illustrates the process of using backward stepwise selection with the AIC criterion to find the best additive logistic regression model for predicting coronary heart disease. The final model includes tobacco use, ldl level, family history of heart disease, type A behavior pattern, and age as significant predictors.



Section 7: Model Evaluation and Diagnostic Checks

After selecting the best model using the backward selection approach, it is essential to evaluate its performance and assess its goodness of fit to the data. This Section focuses on evaluating the model obtained in Section 6 and performing diagnostic checks to ensure its validity and reliability. By examining various model diagnostics, we can identify potential issues or violations of assumptions that may impact the model’s performance and interpretability.

# Assess model fit
best_model <- model_list[["Backward Selected Model"]]

summary(best_model)
## 
## Call:
## glm(formula = chd ~ tobacco + ldl + famhist + typea + age, family = binomial, 
##     data = heart)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9165  -0.8054  -0.4430   0.9329   2.6139  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    -6.44644    0.92087  -7.000 2.55e-12 ***
## tobacco         0.08038    0.02588   3.106  0.00190 ** 
## ldl             0.16199    0.05497   2.947  0.00321 ** 
## famhistPresent  0.90818    0.22576   4.023 5.75e-05 ***
## typea           0.03712    0.01217   3.051  0.00228 ** 
## age             0.05046    0.01021   4.944 7.65e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 596.11  on 461  degrees of freedom
## Residual deviance: 475.69  on 456  degrees of freedom
## AIC: 487.69
## 
## Number of Fisher Scoring iterations: 5
# Compute residuals
resid <- residuals(best_model, type = "deviance")

# Compute leverage values
leverage <- hatvalues(best_model)

# Plot residuals vs. leverage
plot(leverage, resid)
abline(h = 0, lty = 2, col = "gray")

# Plot residuals vs. fitted values
plot(best_model$fitted.values, resid)
abline(h = 0, lty = 2, col = "gray")

The above section summarizes the best logistic regression model obtained through the backward selection process.

The model predicts the presence of coronary heart disease (chd) based on the following predictor variables: tobacco use, low-density lipoprotein (ldl) level, family history of heart disease (famhist), type A behavior pattern (typea), and age.

The summary provides several important pieces of information to assess the model fit:

Deviance Residuals: These are the residuals of the fitted model, which measure the difference between the observed values and the predicted values. The distribution of the deviance residuals gives an idea about the goodness of fit. In this case, the residuals seem to be reasonably well-distributed, with no extreme outliers.

Coefficients: The estimated coefficients for each predictor variable are provided, along with their standard errors, z-values, and associated p-values. All the predictor variables are statistically significant (p-value < 0.05), indicating that they contribute significantly to the model’s predictive power.

Null and Residual Deviance: The null deviance represents the deviance of a model with only the intercept, while the residual deviance is the deviance of the fitted model. A smaller residual deviance compared to the null deviance suggests that the model is an improvement over the null model.

AIC (Akaike Information Criterion): The AIC is a measure of the relative quality of the model, with lower values indicating a better fit. The AIC value can be used to compare the fit of different models.

Number of Fisher Scoring iterations: This indicates the number of iterations required for the model to converge to the maximum likelihood estimates.

Overall, the summary suggests that the selected logistic regression model has a reasonably good fit, with all predictor variables being statistically significant and the model showing an improvement over the null model.



Section 8: Explanation of Regression Model

The backward stepwise selection process has arrived at the following model: chd ~ tobacco + ldl + famhist + typea + age

This is a logistic regression model for predicting the binary outcome variable chd (coronary heart disease) using the following predictors:

tobacco: Cumulative tobacco consumption (in kg).
ldl: Low-density lipoprotein (“bad cholesterol”) level.
famhist: A categorical variable indicating the family history of heart disease, with “Absent” as the reference category.
typea: Score on the Bortner Short Rating Scale for coronary (type A) behavior.
age: Age of the individual (in years) at the time of the study.

The model equation can be written as:
log(p / (1 - p)) = β0 + β1 * tobacco + β2 * ldl + β3 * famhist_Present + β4 * typea + β5 * age

Where:
p is the probability of having coronary heart disease (chd = 1).
β0 is the intercept term, representing the log odds of having coronary heart disease when all predictors are zero or at their reference levels.
β1 is the coefficient for tobacco, indicating the change in the log odds of having coronary heart disease for a one-unit increase in cumulative tobacco consumption, holding all other variables constant.
β2 is the coefficient for ldl, indicating the change in the log odds of having coronary heart disease for a one-unit increase in the low-density lipoprotein level, holding all other variables constant.
β3 is the coefficient for famhist_Present, indicating the change in the log odds of having coronary heart disease for individuals with a family history of heart disease (famhist = “Present”), compared to those without a family history (famhist = “Absent”), holding all other variables constant.
β4 is the coefficient for typea, indicating the change in the log odds of having coronary heart disease for a one-unit increase in the score on the Bortner Short Rating Scale for coronary (type A) behavior, holding all other variables constant.
β5 is the coefficient for age, indicating the change in the log odds of having coronary heart disease for a one-year increase in age, holding all other variables constant.

The interpretation of the model coefficients (β1, β2, β3, β4, β5) depends on the scale of the predictors and the reference category for categorical variables. For example, if tobacco is measured in kilograms, then β1 represents the change in the log odds of having coronary heart disease for a one-kilogram increase in cumulative tobacco consumption. Similarly, β3 represents the difference in the log odds of having coronary heart disease between individuals with a family history of heart disease and those without, holding all other variables constant.

The overall fit of the model can be assessed using various measures, such as the deviance, AIC (Akaike Information Criterion), or other goodness-of-fit statistics. The model with the lowest AIC value (487.69) is considered the best fit among the models considered in the backward stepwise selection process.



Section 9: Interpretation of Coefficients

Here is the interpretation of the model coefficients, significance, and overall fit:

Coefficients:

Intercept (-6.44644): The log odds of having coronary heart disease when all predictors are zero. A negative value indicates that the baseline odds are less than 1.

tobacco (0.08038): For every 1 unit increase in cumulative tobacco consumption (kg), the log odds of having coronary heart disease increase by 0.08038, holding all other variables constant. This coefficient is statistically significant (p-value = 0.00190).

ldl (0.16199): For every 1 unit increase in the low-density lipoprotein level, the log odds of having coronary heart disease increase by 0.16199, holding all other variables constant. This coefficient is statistically significant (p-value = 0.00321).

famhistPresent (0.90818): For individuals with a family history of heart disease (famhist = “Present”), the log odds of having coronary heart disease are 0.90818 higher than those without a family history (famhist = “Absent”), holding all other variables constant. This coefficient is statistically significant (p-value = 5.75e-05).

typea (0.03712): For every 1 unit increase in the score on the Bortner Short Rating Scale for coronary (type A) behavior, the log odds of having coronary heart disease increase by 0.03712, holding all other variables constant. This coefficient is statistically significant (p-value = 0.00228).

age (0.05046): For every 1-year increase in age, the log odds of having coronary heart disease increase by 0.05046, holding all other variables constant. This coefficient is statistically significant (p-value = 7.65e-07).

Significance:
The model coefficients for all predictors are statistically significant at the 0.01 level (p-value < 0.01), indicating that they are important predictors of coronary heart disease in this model.

Overall fit:
The null deviance of 596.11 on 461 degrees of freedom represents the deviance of the null model (model with only an intercept).

The residual deviance of 475.69 on 456 degrees of freedom represents the deviance of the fitted model. The difference between the null deviance and the residual deviance (596.11 - 475.69 = 120.42) represents the deviance explained by the predictors in the model.

The AIC (Akaike Information Criterion) value of 487.69 is used to compare the fit of this model with other potential models. Lower AIC values indicate a better fit.

Overall, the model appears to fit the data reasonably well, with all predictors being statistically significant and the residual deviance being lower than the null deviance. However, it is important to assess other assumptions and diagnostics of the model to ensure its validity and reliability for prediction or inference.