Logistic Regression

What is logistic regression

Logistic regression is a statistical method used to model and analyze datasets where the outcome variable is binary (e.g., “yes/no” or “male/female”) and predictor variables are those believed to either influence, predict, or explain the outcome.

The best way to learn and understand logistic regression is to look at an example (and try to replicate the analysis yourself at home). We’re going to use R to analyse a dataset called “PimaIndiansDiabetes” that you can access by installing the “mlbench” package.

The data set has:

A binary outcome variable:

diabetes (positive or negative)

Predictor variables:

pregnant: number of times pregnant
glucose: blood glucose levels
pressure: diastolic blood pressure
triceps: triceps skin fold thickness
insulin: blood insulin levels
mass: body mass index (BMI)
pedigree: diabetes predigree function
age: age in years

Objective: To determine which predictor variables are associated with meaningful changes in the probability of being diagnosed with diabetes and to quantify the extent of their influence.

Understanding how predictors influence the probability of diabetes can inform risk stratification, policy, or clinical interventions.

Here is a snapshot of the first 10 rows of data.

Show the code

library(tidyverse)
library(gtExtras)
library(mlbench)

data(PimaIndiansDiabetes, package = "mlbench")

PimaIndiansDiabetes %>%
  head(10) %>%
  gt() %>%
  gt_theme_pff() %>%
  tab_header(
    title = "PimaIndiansDiabetes Dataset",
  ) %>%
  data_color(
    columns = diabetes,
    colors = scales::col_factor(
      palette = c("#f5c5bd", "#cbf5bd"),
      domain = c("neg", "pos")
    )
  )

1: Load the tidyverse package for data manipulation and piping functions.
2: Load the gtExtras package for enhanced table customization.
3: Load the mlbench package, which includes the PimaIndiansDiabetes dataset.
4: Load the PimaIndiansDiabetes dataset from the mlbench package into the environment.
5: Start processing the PimaIndiansDiabetes dataset.
6: Extract the first 10 rows of the dataset for a concise table.
7: Convert the data into a gt table for formatting.
8: Apply the gt_theme_pff() function to give the table a polished and clean look.
9: Add a header to the table.
10: Set the table’s title to “PimaIndiansDiabetes Dataset.”
11: Apply color-coding to the diabetes column for visual distinction.
12: Specify the diabetes column for color coding.
13: Define the color palette to use pink (#f5c5bd) for “neg” and green (`#cbf5bd”) for “pos.”
14: Specify the domain for the diabetes column, mapping “neg” and “pos” to the colors.

PimaIndiansDiabetes Dataset
pregnant	glucose	pressure	triceps	insulin	mass	pedigree	age	diabetes
6	148	72	35	0	33.6	0.627	50	pos
1	85	66	29	0	26.6	0.351	31	neg
8	183	64	0	0	23.3	0.672	32	pos
1	89	66	23	94	28.1	0.167	21	neg
0	137	40	35	168	43.1	2.288	33	pos
5	116	74	0	0	25.6	0.201	30	neg
3	78	50	32	88	31.0	0.248	26	pos
10	115	0	0	0	35.3	0.134	29	neg
2	197	70	45	543	30.5	0.158	53	pos
8	125	96	0	0	0.0	0.232	54	pos

A single numeric predictor

Let’s start with a single numeric predictor. In this case “glucose” and take a look at the relationship between changing levels of glucose and the likelihood of being diagnosed with diabetes. Figure 1 shows the diagnostic status for each value in the dataset of glucose. It is clear that for lower levels of glucose, fewer people are diagnosed with diabetes and for higher levels of glucose, more people are diagnosed with diabetes.

Logistic regression uses that data to create a model that can be used to predict the probability of being diagnosed with diabetes at any level of glucose. Figure 2 shows how that probability changes over the values of glucose levels.

Show the code

PimaIndiansDiabetes %>% 
  ggplot(aes(x = glucose, y = as.numeric(diabetes == "pos") * 0.5, color = diabetes)) +
  geom_jitter(height = 0.2, width = 0.2, alpha = 0.3, size = 4) +
  scale_color_manual(values = c("neg" = "darkgreen", "pos" = "red")) +
  labs(
    title = "Glucose Levels and Diabetes Status",
    x = "Glucose Level"
  ) +
  scale_y_continuous(breaks = c(0, 0.5), labels = c("No Diabetes", "Diabetes")) +
  theme_minimal() +
  theme(
    legend.position = "none",
    axis.line.x = element_line(color = "black"),
    plot.title = element_text(hjust = 0.5),
    axis.title.y = element_blank()
  )

1: Start a ggplot, mapping glucose to the x-axis and a compressed numeric representation of diabetes status to the y-axis. Use color to differentiate diabetes status.
2: Add jittered points to the plot, adjusting for height and width to spread out overlapping points. Set transparency and size for better visualization.
3: Manually define colors for diabetes status: “neg” as dark green and “pos” as red.
4: Add plot labels, including a title and an x-axis label.
5: Specify the plot title: “Glucose Levels and Diabetes Status.”
6: Set the x-axis label to “Glucose Level.”
7: Adjust the y-axis to display only two breaks (0 and 0.5) and label them as “No Diabetes” and “Diabetes.”
8: Use a minimal theme for a clean and uncluttered visual appearance.
9: Customize the plot theme further:
10: Remove the legend to simplify the plot.
11: Add a black line to the x-axis for emphasis.
12: Center the plot title for better alignment.
13: Remove the y-axis title to avoid redundancy.

Show the code

library(plotly)
model <- glm(diabetes ~ glucose, data = PimaIndiansDiabetes, family = binomial)

# Generate data for the curve
glucose_range <- data.frame(glucose = seq(min(PimaIndiansDiabetes$glucose, na.rm = TRUE),
                                        max(PimaIndiansDiabetes$glucose, na.rm = TRUE),
                                        length.out = 500))
glucose_range$prob <- predict(model, newdata = glucose_range, type = "response")

# Create interactive plot
plot_ly(data = glucose_range,
                          x = ~glucose,
                          y = ~prob,
                          type = 'scatter',
                          mode = 'lines',
                          name = "Probability Curve") %>%
  layout(
    title = "Probability of Diabetes by Glucose Level",
    xaxis = list(title = "Glucose Level"),
    yaxis = list(title = "Probability of Diabetes")
  )

1: Load the plotly package for creating interactive plots.
2: Fit a logistic regression model predicting diabetes based on glucose using the glm function with a binomial family.
3: Start generating data for the probability curve.
4: Create a sequence of glucose values starting from the minimum in the dataset, excluding missing values.
5: Extend the sequence up to the maximum glucose value in the dataset.
6: Generate 500 equally spaced points in the range of glucose values to ensure a smooth curve.
7: Use the logistic regression model to predict probabilities for the generated glucose values.
8: Begin creating the interactive plot with the plot_ly function.
9: Specify the data source as the glucose_range data frame.
10: Map the glucose values to the x-axis of the plot.
11: Map the predicted probabilities (prob) to the y-axis of the plot.
12: Set the plot type to ‘scatter’ for a line chart.
13: Configure the plot to display lines for the curve.
14: Add a name to the curve for labeling in the legend or hover text.
15: Customize the layout of the plot using the layout function.
16: Set the plot’s title to “Probability of Diabetes by Glucose Level.”
17: Define the label for the x-axis as “Glucose Level.”
18: Define the label for the y-axis as “Probability of Diabetes.”

Figure 2

Creating a simple model

Here is the code to create and summarise a logistic regression model with this data:

model <- glm(diabetes ~ glucose,
             data = PimaIndiansDiabetes,
             family = binomial)

summary(model)

1: Fit a logistic regression model (glm) predicting diabetes based on glucose. This uses a binomial family, which is suitable for binary outcomes.
2: Specify the data source as the PimaIndiansDiabetes dataset.
3: Use the family = binomial argument to indicate logistic regression for modeling probabilities.
4: Display a summary of the fitted model, providing details like coefficients, standard errors, p-values, and overall model performance.

Explanatory notes:

glm() is used to fit a generalized linear model.
The formula diabetes ~ glucose specifies that we are modelling diabetes as the outcome, predicted by glucose.
The family = binomial argument indicates that we are using logistic regression (appropriate for binary outcomes).
summary(model) provides detailed results of the model

And here are the results:


Call:
glm(formula = diabetes ~ glucose, family = binomial, data = PimaIndiansDiabetes)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -5.350080   0.420827  -12.71   <2e-16 ***
glucose      0.037873   0.003252   11.65   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 993.48  on 767  degrees of freedom
Residual deviance: 808.72  on 766  degrees of freedom
AIC: 812.72

Number of Fisher Scoring iterations: 4

Explanatory notes:

Looking at the results of the logistic regression, lets focus on just a few numbers (for now).

Firstly, we have coefficients. What are these? Well, we have a mathematical model that has place holders into which specific values can be used so that the model best describes the data that we have.

You’ll notice that we have two coefficients in this case.

The intercept that tells us what the model looks like when glucose is zero (which is where the model intercepts the y axis). In this example that is a meaningless number because people can’t live with a zero glucose. So we can ignore that number.
And we have a coefficient for glucose, our predictor variable. That is going to tell us (from the data that we have) about what happens when the value of glucose changes. Specifically, how it affects the chances of a person being diagnosed with diabetes. The change can be positive (as is the case in this example), negative or zero. A positive number means that as glucose levels go up, the chances of being diagnosed with diabetes goes up too. If it were zero, then there would be no change and if it were negative then it would mean that as glucose goes up, the chances of being diagnosed with diabetes would go down.

You’ll notice that I’ve used a generic and somewhat vague term “chances” instead of “probability”. That’s because the actual number is what is called the “log-odds” (or the logarithm of the odds ratio). Don’t worry about that for now. All you need to focus on is the fact that is it positive. You’ll also notice that the results include a p value which in this case is very small, meaning that this result is statistically significant.

We’ll look at other values provided in the results as we built the model and add additional variables.

A single categorical predictor

Now let’s consider the same question using a categorical variable as a predictor. This dataset doesn’t have a categorical variable and so we’ll create one using the Age variable but dividing the observations into “Young adult”, “Middle-aged” and “Older adult”. Note that we’re doing this in order to illustrate how to interpret logistic regression results. In reality, it would be best practice to leave Age as a numeric variable and include it in the model as such.

Here is the code to create a categorical variables from a numeric variable (if you’re interested)

Show the code

age_data <- PimaIndiansDiabetes %>%
  mutate(age_group = case_when(
    age < 30 ~ "Young adult",
    age <= 50 ~ "Middle-aged",
    TRUE ~ "Older adult"
  )) %>%  
  mutate(age_group = factor(age_group, levels = c("Young adult", "Middle-aged", "Older adult"))) %>%
  select(age_group, diabetes)

1: Start processing the PimaIndiansDiabetes dataset to create an age group variable.
2: Add a new variable age_group using the mutate function and the case_when statement for conditional grouping.
3: Assign the label “Young adult” to individuals under 30 years old.
4: Assign the label “Middle-aged” to individuals between 30 and 50 years old, inclusive.
5: Assign the label “Older adult” to individuals over 50 years old.
6: Convert the age_group variable into a factor and specify the order of the levels as “Young adult,” “Middle-aged,” and “Older adult.”
7: Select only the age_group and diabetes variables for the final output.

Figure 3 shows the proportion of observations that have a diagnosis of diabetes changes across these three categories.

Show the code

plot_data2 <- age_data %>%
  group_by(age_group, diabetes) %>%
  summarise(count = n(), .groups = 'drop') %>%
  mutate(
    proportion = count / sum(count),
    diabetes = factor(diabetes, levels = c("neg", "pos"))
  )

ggplot(plot_data2, aes(x = age_group, fill = diabetes)) +
  geom_bar(position = "fill", aes(y = proportion), stat = "identity", color = "black", width = 0.8, alpha = 0.3) +
  scale_fill_manual(values = c("neg" = "#069145", "pos" = "red"),
                    labels = c("No Diabetes", "Diabetes")) +
  scale_y_continuous(labels = scales::percent_format(), expand = c(0, 0)) +
  labs(
    title = "Proportion of Diabetes by Age Group",
    x = "",
    y = "Proportion (%)",
    fill = ""
  ) +
  theme_minimal() +
  theme(panel.grid.major.x = element_blank(),
        plot.title = element_text(hjust = 0.5))

1: Use the age_data dataset as the starting point for data manipulation.
2: Group the data by age_group and diabetes to calculate summary statistics for each combination.
3: Compute the count of occurrences in each group using summarise and drop unnecessary grouping.
4: Add new variables using mutate.
5: Calculate the proportion of each group by dividing the count by the total count.
6: Convert the diabetes variable into a factor and set the levels to ensure “neg” (No Diabetes) appears below “pos” (Diabetes) in the plot.
7: Map age_group to the x-axis and diabetes to the fill aesthetic.
8: Add a bar plot layer with proportions, adjusting transparency, border color, and bar width.
9: Customize the fill colors: green for “neg” and red for “pos.”
10: Label the fill categories as “No Diabetes” and “Diabetes.”
11: Adjust the y-axis to display percentages and remove unnecessary space around the axis.
12: Add plot labels.
13: Set the plot title to “Proportion of Diabetes by Age Group.”
14: Leave the x-axis label blank for simplicity.
15: Label the y-axis as “Proportion (%).”
16: Leave the legend title blank.
17: Apply a minimal theme for a clean appearance.
18: Remove vertical grid lines from the plot to avoid visual clutter.
19: Center the plot title for better alignment.

Now let’s create a logistic regression model using age group (a categorical variable) as the predictor.

Show the code

model <- glm(diabetes ~ age_group, 
             data = age_data, 
             family = binomial)

summary(model)

1: Fit a logistic regression model using the glm function.
2: Display a summary of the logistic regression model.


Call:
glm(formula = diabetes ~ age_group, family = binomial, data = age_data)

Coefficients:
                     Estimate Std. Error z value Pr(>|z|)    
(Intercept)           -1.3122     0.1229 -10.675  < 2e-16 ***
age_groupMiddle-aged   1.3191     0.1699   7.765 8.16e-15 ***
age_groupOlder adult   1.1886     0.2543   4.673 2.96e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 993.48  on 767  degrees of freedom
Residual deviance: 924.66  on 765  degrees of freedom
AIC: 930.66

Number of Fisher Scoring iterations: 4

Explanatory note

Reference Category: You’ll notice that under the heading of Coefficients there is no row for “Young adult”. This is because it is our reference category. this is used as the baseline for comparison. Each of the other age categories will be compared to the “Young adult” categories to determine if there is a change in the chances of being diagnosed with diabetes when we move from Young adult to that category.
Coefficients:
- The intercept: when looking at the example of glucose, we decided to ignore the intercept because it represented the likelihood of diabetes when glucose is zero (which is meaningless). In the case of a categorical variable however, the intercept is the likelihood of diabetes for the reference category (Young adults). In this case is it negative. So being a Young adult decreases the chances of being diagnosed with diabetes.
- Both of the other age categories (Middle aged and Older adult) have positive estimates (which means that, all else being equal, a change from Young adult to that category would increase the likelihood of being diagnosed with diabetes.
- All three estimates have very small p values and so are considered to be statistically significant.

Adding variables

One of the key strengths of logistic regression is the ability to include multiple predictor variables in the model. For example, beyond plasma glucose concentration, we could include variables such as weight, age, pregnancy history and blood pressure, for example.

By considering multiple predictors simultaneously, logistic regression allows us to explore how each variable contributes to the likelihood of the outcome (in this case, diabetes) while accounting for the influence of others.

In the case of the risk of diabetes, there may be multiple risk factors that we want to consider. Let’s start adding variables to our model and see what happens. We’ll start by adding age to the model (we’ll use the original age (numeric) variable and not the categorical variable that we created.

Show the code

# Fit the logistic regression model
model <- glm(diabetes ~ glucose + age,
             data = PimaIndiansDiabetes, 
             family = binomial)

# Summarize the model
summary(model)


Call:
glm(formula = diabetes ~ glucose + age, family = binomial, data = PimaIndiansDiabetes)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -5.912449   0.462620  -12.78  < 2e-16 ***
glucose      0.035644   0.003290   10.83  < 2e-16 ***
age          0.024778   0.007374    3.36 0.000778 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 993.48  on 767  degrees of freedom
Residual deviance: 797.36  on 765  degrees of freedom
AIC: 803.36

Number of Fisher Scoring iterations: 4

Let’s first look at the coefficients:

Glucose:
- Looking at the results, we can see that glucose remains a positive and statistically significant predictor of diabetes. The coefficient is slightly lower than in the model without age. This reflects the adjusted effect of glucose on the likelihood of being diagnosed with diabetes, accounting for the effect of age. In other words, some of the previously observed effect of glucose on diabetes risk can now be attributed to the different ages of the subjects rather than solely their glucose levels.
Age:
- Age also has a positive effect on the likelihood of being diagnosed with diabetes, and the small p-value confirms this effect is statistically significant. Because glucose is included in the model, the age effect is adjusted for glucose.

Explanatory notes on other results

Akaike Information Criterion (AIC):
- A metric for comparing models, balancing goodness of fit with model complexity.
- Lower AIC values suggest a better trade-off between fit and complexity. Adding variables that improve the model without overfitting typically results in a lower AIC.
- The AIC (803.36) is also lower than the glucose-only model (814.59), indicating that adding age improves the model’s overall fit.
Residual Deviance:
- Measures how well the model fits the data. It is a summary of the discrepancies between observed outcomes and the model’s predictions.
- Lower values indicate better fit. A drop in residual deviance when adding a variable suggests the new variable improves the model.
- The residual deviance (797.36) is lower than that of the glucose-only model (808.59), showing improved fit with the addition of age.
Null Deviance:
- Represents the deviance of a model with no predictors (only an intercept).
- This is the baseline for comparison, showing how much better the model with predictors performs compared to a model with no predictors.
- When to Use: Null deviance is useful in understanding the baseline fit of a dataset. For instance, if the null deviance is very high and the residual deviance of your model remains close to it, it indicates that your predictors are not improving the fit much. A hypothetical example: In a logistic regression model predicting diabetes, a null deviance of 1000 and a residual deviance of 990 would suggest poor predictive power of the included variables.
Number of Fisher Scoring Iterations:
- Refers to the number of iterations the algorithm needed to converge on a solution when estimating model parameters.
- Fewer iterations indicate faster convergence, but this metric is not typically used to evaluate model quality.
- When to Use: If a model takes an unusually high number of iterations to converge, it might signal issues with the data or the model, such as multicollinearity or poorly scaled variables. Hypothetical example: If a logistic regression model takes 50 iterations (instead of 4-5) to converge, you might check for these issues or simplify the model.
We now have the tools we need to build a model. We can add numeric and categorical variables and look at the p value of the coefficient to determine if its predictive value is statistically significant and the look at the Residual deviance and AIC, which if lower indicate that the model is improved by adding that variable.

Before building the model however, there are a few concepts and idea that we need to explore that will help us understand how variables interact with each other.

Confounding variables

Confounding occurs when a third variable influences both the predictor and the outcome, distorting the apparent relationship between the predictor and the outcome.

Confounding creates a spurious association that might disappear or change when the confounding variable is controlled for by including it in the model.

In this model for example, we should ask if being older (age) has an effect on both the likelihood of diabetes (the outcome variable) AND glucose levels. If that were true, then some of the effect of glucose on diabetes could be accounted for simply by virtue of the fact that older people have higher glucose levels AND older people are more likely to have diabetes.

Figure 4 shows the relationship between age and glucose levels in this dataset::

Show the code

PimaIndiansDiabetes %>%
  ggplot(aes(age, glucose)) +
  geom_smooth(method = lm) +
  theme_minimal() +
  labs(title = "Age and blood glucose levels",
       x = "Age",
       y = "Glucose levels")

1: Start the data pipeline with the PimaIndiansDiabetes dataset, which contains variables like age and glucose.
2: Create a ggplot object, mapping age to the x-axis and glucose to the y-axis using the aes function.
3: Add a smoothing line to the plot with geom_smooth, specifying method = lm to use linear regression for the fit.
4: Apply the theme_minimal() function to give the plot a clean, minimalist aesthetic.
5: Add a title to the plot using the labs function, describing the relationship being visualized.
6: Label the x-axis as “Age” to indicate the independent variable.
7: Label the y-axis as “Glucose levels” to represent the dependent variable.

By including both age and glucose levels in the model, this effect is controlled for. This is why we see the coefficient estimates for glucose going from 0.038 in the first model (with just glucose as a predictor) to 0.035 in the second model (with glucose and age as predictors).

Effect modifiers

Effect modifiers (or interacting variables) are slightly different. Let’s stick with the example of glucose and age. While it is true that older people have higher glucose, it might also be true that being older (due to changes in physiology with age) modifies th effect of glucose on diabetes risk. So not only do older people have higher levels of glucose, but the glucose itself has a different effect on diabetes in older people.

Let’s repeat the model but this time, instead of just adding age as an additional predictor variable, let’s add it as a possible interaction term (or effect modifier) and see what happens.

Show the code

# Fit the logistic regression model
model <- glm(diabetes ~ glucose * age,
             data = PimaIndiansDiabetes, 
             family = binomial)

# Summarize the model
summary(model)


Call:
glm(formula = diabetes ~ glucose * age, family = binomial, data = PimaIndiansDiabetes)

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept) -9.5106592  1.3016848  -7.306 2.74e-13 ***
glucose      0.0638544  0.0100514   6.353 2.11e-10 ***
age          0.1265362  0.0338114   3.742 0.000182 ***
glucose:age -0.0007889  0.0002550  -3.094 0.001973 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 993.48  on 767  degrees of freedom
Residual deviance: 788.27  on 764  degrees of freedom
AIC: 796.27

Number of Fisher Scoring iterations: 5

Firstly we notice that the interaction term is statistically significant and that the Residual deviance and AIC are both reduced. All of this suggests that this is an improved model and that this interaction terms should be included in the model.

Next, we see that the interaction term has a negative value for its estimate. This means that with increasing age, the effect of glucose on diabetes decreases. In other words, high levels of glucose impacts young people more than older people with regard to the likelihood of a diabetes diagnosis.

The estimates for both glucose and age remain statistically significant and both have a stronger effect size in the new model. That is because these values represent their individual contributions to the model. For example, for glucose the estimate represents the effect of glucose on diabetes at the baseline level of age; and we know that the effect of glucose on diabetes is higher in younger people (from this model).

Collinearity

Collinearity in logistic regression happens when two or more predictor variables are highly correlated with each other. It can lead to unstable estimates, inflated standard errors, and difficulty in interpreting the unique contribution of each predictor.

Collinearity makes it difficult for the model to figure out the individual impact of each variable on the outcome because they’re essentially explaining the same variance.

Its a good idea to create a table or use data visualization to identify pairs of variables that are highly correlated. A correlation coefficient of 0.7 or higher may mean that you need to consider the strategies described below. From ?@fig-colin and ?@fig-co it is clear that non of the variables are highly correlated and so further action is not needed.

Show the code

correlation_matrix <- cor(PimaIndiansDiabetes[, sapply(PimaIndiansDiabetes, is.numeric)])

cor_df <- as.data.frame(round(correlation_matrix, 2)) %>%
  tibble::rownames_to_column("Variable")

cor_df %>%
  gt() %>%
  gt_color_rows(
    columns = -Variable,
    palette = "RdBu",
    domain = c(-1, 1)
  )

1: Compute the correlation matrix for all numeric columns in the PimaIndiansDiabetes dataset using the cor function.
2: Convert the correlation matrix to a data frame and round the values to two decimal places for readability.
3: Use rownames_to_column to add row names as a new column called “Variable” to make the data frame tidy.
4: Start creating a gt table from the processed correlation data frame.
5: Transform the correlation data frame into a gt table for visualization.
6: Apply heatmap-style coloring to the rows of the gt table.
7: Exclude the “Variable” column from the heatmap styling by specifying columns = -Variable.
8: Use the “RdBu” color palette (red-blue) to visualize positive and negative correlations effectively.
9: Set the domain for coloring to range from -1 (perfect negative correlation) to 1 (perfect positive correlation).

Variable	pregnant	glucose	pressure	triceps	insulin	mass	pedigree	age
pregnant	1.00	0.13	0.14	-0.08	-0.07	0.02	-0.03	0.54
glucose	0.13	1.00	0.15	0.06	0.33	0.22	0.14	0.26
pressure	0.14	0.15	1.00	0.21	0.09	0.28	0.04	0.24
triceps	-0.08	0.06	0.21	1.00	0.44	0.39	0.18	-0.11
insulin	-0.07	0.33	0.09	0.44	1.00	0.20	0.19	-0.04
mass	0.02	0.22	0.28	0.39	0.20	1.00	0.14	0.04
pedigree	-0.03	0.14	0.04	0.18	0.19	0.14	1.00	0.03
age	0.54	0.26	0.24	-0.11	-0.04	0.04	0.03	1.00

Show the code

library(corrplot)
corrplot(correlation_matrix, method = "color", type = "upper")

1: Load the corrplot library which provides functions to visualize correlation matrices on one line.
2: Use the corrplot function to create a correlation plot where correlation_matrix is the input matrix, method = "color" specifies color representation, and type = "upper" visualizes only the upper triangle of the matrix for clarity.

Strategies for dealing with collinearity

Remove One of the Collinear Variables:
If two variables are highly correlated, consider dropping one from the model, especially if it provides redundant information.
Combine Variables:
Create a composite variable, such as an average or a principal component, to reduce collinearity while retaining the information.
Use Regularization Techniques:
Employ methods like Lasso or Ridge regression, which can handle collinearity by penalizing large coefficients.
Center and Scale Variables:
Standardizing predictors can help mitigate the effects of collinearity, although it doesn’t eliminate it entirely.
Investigate Variance Inflation Factor (VIF):
Use VIF to identify and address predictors contributing the most to collinearity. Remove predictors with high VIF values.
Stepwise Regression:
Use stepwise methods to iteratively add or remove predictors to identify the most parsimonious model.
Domain Expertise:
Use knowledge about the data and its context to prioritize which variables are most important and less likely to be redundant.

Now that we know how to add variables, assess the impact on the model and look consider effect modifier, confounding and collinearity, we can start the work of actually building our model

Building the model

Once you’re satisfied with a simple model as your starting point and have verified that the variables in your dataset are not highly correlated, you can proceed to incrementally improve the model by adding additional variables and interaction terms one at a time. At each step, assess whether the added variable or interaction term is statistically significant and evaluate its impact on the model’s performance.

As you build your model, at each increment check that:

AIC and Residual Deviance decrease
the new variable or interaction term is statistically significant.

Figure 5 show the the impact on the model as each of these terms were added, one at a time.

Show the code

library(readxl)
LR_variables <- read_excel("LR_variables.xlsx")

data_long <- LR_variables %>%
  pivot_longer(cols = -Variable, names_to = "variable", values_to = "value") %>%
  mutate(value = as.factor(value))

variables_order <- unique(LR_variables$Variable)
ggplot(data_long, aes(x = variable,
                      y = factor(Variable, levels = variables_order),
                      fill = value)) +
  geom_tile(color = "white") +
  scale_fill_manual(
    values = c("0" = "#dea49b", "1" = "#8ecfaf"),
    name = "Value"
  ) +
  scale_y_discrete(limits = rev(variables_order)) +
  guides(fill = "none") +
  labs(
    title = "Effect of additional variables on model parameters",
    x = "",
    y = "",
    caption = "For AIC and Residual deviance, green means a lower value and so supports the inclusion of the variable in the model. \nFor statistical significance, green means a lower p value and so supports the inclusion of the variable in the model."
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5),
    axis.text.x = element_text(hjust = 0.5, face = "bold"),
    axis.text.y = element_text(size = 10),
    panel.grid = element_blank()
  )

1: Load the readxl library for reading Excel files.
2: Read the Excel file “LR_variables.xlsx” into a data frame named LR_variables.
3: Start processing LR_variables with a data transformation pipeline.
4: Use pivot_longer to reshape the data frame from wide to long format by collapsing columns into two: “variable” and “value”.
5: Convert the “value” column to a factor to prepare for categorical visualizations.
6: Extract the unique values of “Variable” to define the order for the y-axis.
7: Start creating a ggplot object with “variable” on the x-axis and “Variable” on the y-axis.
8: Use factor to ensure that the order of the y-axis matches variables_order.
9: Map the fill aesthetic to “value” to show different colors for each value.
10: Add a tiled heatmap layer with white borders around the tiles.
11: Begin defining the fill scale for the heatmap.
12: Specify the color palette: pink (#dea49b) for “0” and green (#8ecfaf) for “1”.
13: Label the legend as “Value”.
14: Reverse the order of the y-axis levels using rev(variables_order).
15: Remove the legend for the fill aesthetic using guides(fill = "none").
16: Add labels to the plot.
17: Set the plot title to “Effect of additional variables on model parameters”.
18: Leave the x-axis label blank.
19: Leave the y-axis label blank.
20: Add a caption explaining the interpretation of colors in the plot.
21: Apply a minimal theme for a clean layout.
22: Customize the plot theme.
23: Center the plot title horizontally.
24: Bold the x-axis text and align it horizontally.
25: Increase the font size of the y-axis text for better readability.
26: Remove the background grid lines for a cleaner appearance.

Preliminary model

Here is the preliminary model that we’ve built up based on the principles above.

We’re calling this a preliminary model because we still need to check some assumptions about the model which may lead us to excluding certain variables.

Show the code

model <- glm(diabetes ~ glucose + age + glucose:age + pregnant + age:pregnant + mass + pressure + pedigree,
             data = PimaIndiansDiabetes,
             family = binomial)

summary(model)

1: Fit a logistic regression model with multiple predictors including glucose, age, glucose:age (interaction term), pregnant, age:pregnant (another interaction term), mass, pressure, and pedigree to predict diabetes. Each term is included to capture both main and interaction effects, allowing for the assessment of combined influences on the outcome.
2: Specify the data source as the PimaIndiansDiabetes dataset, which contains the variables for analysis.
3: Use the family = binomial argument to indicate that this is a logistic regression model for binary outcomes such as diabetes (e.g., “neg” or “pos”). Logistic regression models the log odds of the outcome as a linear combination of the predictors.
4: Use the summary(model) function to display detailed results including coefficients (showing the relationship between each predictor and the log odds of diabetes), standard errors (measuring variability in the estimates), z-values and p-values (indicating statistical significance), and deviance measures (assessing model fit).


Call:
glm(formula = diabetes ~ glucose + age + glucose:age + pregnant + 
    age:pregnant + mass + pressure + pedigree, family = binomial, 
    data = PimaIndiansDiabetes)

Coefficients:
               Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -1.174e+01  1.500e+00  -7.823 5.15e-15 ***
glucose       5.142e-02  1.062e-02   4.844 1.27e-06 ***
age           1.143e-01  3.784e-02   3.022  0.00251 ** 
pregnant      4.418e-01  1.113e-01   3.969 7.23e-05 ***
mass          8.750e-02  1.453e-02   6.022 1.73e-09 ***
pressure     -1.260e-02  5.159e-03  -2.441  0.01463 *  
pedigree      8.803e-01  2.986e-01   2.948  0.00320 ** 
glucose:age  -4.916e-04  2.691e-04  -1.827  0.06774 .  
age:pregnant -8.264e-03  2.706e-03  -3.054  0.00226 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 993.48  on 767  degrees of freedom
Residual deviance: 712.65  on 759  degrees of freedom
AIC: 730.65

Number of Fisher Scoring iterations: 5

Checking model assumptions

Model assumptions can be checked visually and using statistical methods. These methods should be used together to get a sense of the strength of your model. we’ll make reference to these plots in the text below:

Visual checks

Show the code

dataset <- PimaIndiansDiabetes %>%
  filter(mass > 10, pressure > 10, age < 70)

model <- glm(diabetes ~ glucose + pregnant + pressure + mass + pedigree + age, family = binomial, data = dataset)

log_odds <- predict(model, type = "link")

dataset <- na.omit(dataset)
dataset$log_odds <- log_odds

long_dataset <- dataset %>%
  select(-diabetes, -insulin, -triceps) %>%
  pivot_longer(cols = -log_odds, names_to = "predictor", values_to = "value")

ggplot(long_dataset, aes(x = value, y = log_odds)) +
  geom_point(alpha = 0.2, color = "steelblue") +
  geom_smooth(method = "loess", color = "coral") +
  facet_wrap(~ predictor, scales = "free_x") +
  labs(title = "Log-Odds vs. Predictors",
       x = "",
       y = "Log-Odds") +
  theme_bw()

1: Use the PimaIndiansDiabetes dataset as the starting point for analysis.
2: Filter the dataset to include rows where mass > 10, pressure > 10, and age < 70 to exclude extreme or invalid values.
3: Fit a logistic regression model with predictors glucose, pregnant, pressure, mass, pedigree, and age, specifying family = binomial for binary outcome modeling.
4: Use the predict function to calculate log-odds (linear predictor) for each observation based on the fitted model.
5: Remove rows with missing values using na.omit to ensure complete data for further analysis.
6: Add the calculated log_odds as a new column to the filtered dataset.
7: Start creating a long-format dataset for visualization.
8: Select all columns except diabetes, insulin, and triceps for the long-format conversion.
9: Use pivot_longer to reshape the dataset, creating columns for predictor names (predictor) and their values (value).
10: Begin a ggplot visualization mapping predictor values to the x-axis and log-odds to the y-axis.
11: Add a scatterplot layer with semi-transparent blue points to display individual data points.
12: Add a LOESS smooth line to visualize trends in the relationship between predictors and log-odds, coloring it coral.
13: Use facet_wrap to create separate panels for each predictor, with free scaling for the x-axis.
14: Add a title to the plot, “Log-Odds vs. Predictors.”
15: Leave the x-axis label blank for simplicity.
16: Label the y-axis as “Log-Odds” to indicate the outcome variable.
17: Apply the theme_bw theme for a clean and professional appearance.

Show the code

model <- glm(diabetes ~ glucose + age + glucose:age + pregnant + age:pregnant + mass + pressure + pedigree, data = PimaIndiansDiabetes, family = binomial)

cooks_d <- cooks.distance(model)
cooks_df <- data.frame(Observation = seq_along(cooks_d), CooksDistance = cooks_d)

ggplot(cooks_df, aes(x = Observation, y = CooksDistance)) +
  geom_segment(aes(xend = Observation, yend = 0), color = "steelblue") +
  geom_point(color = "coral") +
  labs(title = "Cook's Distance", x = "", y = "Cook's Distance") +
  theme_minimal()

1: Fit a logistic regression model with predictors glucose, age, their interaction glucose:age, and additional variables pregnant, age:pregnant, mass, pressure, and pedigree to predict diabetes. Specify family = binomial for binary outcome modeling.
2: Calculate Cook’s distance for each observation using the fitted model to identify influential points that may disproportionately affect model estimates.
3: Create a data frame with observations and their corresponding Cook’s distance values for visualization.
4: Start a ggplot visualization with observation indices on the x-axis and Cook’s distance values on the y-axis.
5: Add a segment layer to connect each observation point to the x-axis, using steel blue for clarity.
6: Overlay points representing individual Cook’s distances, coloring them coral for visual distinction.
7: Add a title “Cook’s Distance” and label the y-axis accordingly, leaving the x-axis label blank for simplicity.
8: Apply the theme_minimal theme for a clean and professional plot appearance.

Show the code

model <- glm(diabetes ~ glucose + age + glucose:age + pregnant + age:pregnant + mass + pressure + pedigree, data = PimaIndiansDiabetes, family = binomial)



leverage_values <- hatvalues(model)

n <- nrow(PimaIndiansDiabetes)
p <- length(coefficients(model))
leverage_threshold <- 2 * p / n

leverage_df <- data.frame(Observation = seq_along(leverage_values), Leverage = leverage_values)

ggplot(leverage_df, aes(x = Observation, y = Leverage)) +
  geom_point(color = "blue") +
  geom_hline(yintercept = leverage_threshold, linetype = "dashed", color = "red") +
  labs(title = "Leverage Values", x = "Observation Index", y = "Leverage") +
  theme_minimal()

1: Fit a logistic regression model with predictors glucose, age, their interaction glucose:age, and additional variables pregnant, age:pregnant, mass, pressure, and pedigree to predict diabetes. Specify family = binomial for binary outcome modeling.
2: Calculate leverage values for each observation using hatvalues, which quantifies the influence of each data point on the fitted values.
3: Compute n, the total number of observations in the PimaIndiansDiabetes dataset.
4: Calculate p, the total number of predictors (including the intercept) in the fitted model.
5: Define a threshold for identifying high leverage points, calculated as 2 * p / n, which is a common rule of thumb.
6: Create a data frame with observation indices and their corresponding leverage values for visualization.
7: Start a ggplot visualization with observation indices on the x-axis and leverage values on the y-axis.
8: Add a point layer to represent individual leverage values, coloring the points blue for clarity.
9: Overlay a horizontal dashed red line indicating the leverage threshold to identify points with high leverage.
10: Add a title “Leverage Values” and label both axes appropriately.
11: Apply the theme_minimal theme for a clean and professional plot appearance.

Show the code

model <- glm(diabetes ~ glucose + age + glucose:age + pregnant + age:pregnant + mass + pressure + pedigree, data = PimaIndiansDiabetes, family = binomial)



deviance_residuals <- residuals(model, type = "deviance")

residuals_df <- data.frame(Observation = seq_along(deviance_residuals), DevianceResiduals = deviance_residuals)

ggplot(residuals_df, aes(x = Observation, y = DevianceResiduals)) +
  geom_point(color = "blue") +
  geom_hline(yintercept = c(-2, 2), linetype = "dashed", color = "red") +
  labs(title = "Deviance Residuals", x = "Observation Index", y = "Deviance Residuals") +
  theme_minimal()

1: Fit a logistic regression model with predictors glucose, age, their interaction glucose:age, and additional variables pregnant, age:pregnant, mass, pressure, and pedigree to predict diabetes. Specify family = binomial for binary outcome modeling.
2: Extract deviance residuals from the fitted model using residuals with type = "deviance" to evaluate the goodness-of-fit for each observation.
3: Create a data frame containing observation indices and their corresponding deviance residuals for visualization.
4: Start a ggplot visualization with observation indices on the x-axis and deviance residuals on the y-axis.
5: Add a point layer to display individual deviance residuals, coloring the points blue for clarity.
6: Overlay horizontal dashed red lines at residual values of -2 and 2 to highlight potential outliers.
7: Add a title “Deviance Residuals” and label both axes appropriately.
8: Apply the theme_minimal theme for a clean and professional plot appearance.

Considering the model assumptions needs to be done in the context of the study design and your subject matter knowledge of the variables and the data. Here is a list of the assumptions that you should consider (we’ll take a closer look at each of them below):

The dependent (outcome) variable is binary
Observations are independent
Adequate sample size
No omitted variables that bias the study
Good logistic regression model fit
No perfect multicollinearity
Linear relationship between the log-odds and predictors
Absence of highly influential outliers

Let’s look at each of these in detail.

1. The Dependent Variable is Binary

Assumption: The outcome variable must be binary (e.g., “Yes” or “No”).
Check: Ensure the dependent variable has two levels.
- In R, use table(your_data$dependent_variable) to confirm.

In this example, the outcome variable is binary (diabetes or no diabetes)

2. Observations are Independent

Assumption: Each observation is independent of the others.

This means that the outcome of one observation does not influence or is not influenced by the outcome of another. For example, data collected from different individuals in a survey is generally considered independent, whereas repeated measures from the same individual are not.

Check:

Review the study design to ensure there is no clustering or repeated measurements that could violate independence.
If clustering exists (e.g., data collected from individuals within the same group or location), consider using mixed-effects models or generalized estimating equations (GEE) to account for it.

In this example, there are no repeated measures on the same individuals (each observation is a new person) and so the observations can be considered independent of each other

3. The Relationship Between the Log-Odds and Predictors is Linear

At each value of a predictor variable, there is a chance that the outcome will (in this case) be diabetes. That likelihood is the odds (i.e. the probability of it happening over the probability of it not happening). If we take the natural logarithm of all of those odds and plot them against the associated values in the predictor variable, we should get a straight line.

Assumption: There is a linear relationship between the predictors and the log-odds of the outcome.
Check:
- Method: Use the Box-Tidwell test to detect non-linearity.
- Alternative: Create interaction terms between continuous variables and their log-transformation. Significant interaction terms suggest non-linearity.
- Visualization: Plot the log-odds against predictors or use splines to assess non-linearity.
  
  Let’s check whether or not the predictor variable “glucose” meets this assumption:

Explanatory note:

Figure 6 shows each predictor variable plotted against the log-odds of being diagnosed with diabetes (the outcome variable). Remember that the odds are an indication of the likelihood of something happening.

Looking at Figure 6 we may feel that the relationship is linear. This visual check can be biased by the subjective interpretation of the researcher. Using a statistical method to determine the linearity of this relationship can be needed. Let’s take a look at how to use the Box-Tidwell test for this.

Box-Tidwell test

The Box-Tidwell test evaluates whether the relationship between continuous predictors and the log-odds of the outcome in a logistic regression model is linear.

The Box-Tidwell test introducing an interaction term between each predictor and its natural logarithm. If the interaction term is statistically significant, it indicates a deviation from linearity, suggesting that the predictor’s relationship with the log-odds is non-linear.

Show the code

dataset$diabetes <- ifelse(dataset$diabetes == "pos", 1, 0)

dataset <- dataset %>%
  mutate(pressure_log = log(pressure), pressure_interaction = pressure * pressure_log)

model <- glm(diabetes ~ pressure + pressure_interaction, family = binomial, data = dataset)

summary(model)

1: Convert the diabetes variable to a binary format, where “pos” becomes 1 and other values become 0, for compatibility with logistic regression.
2: Add two new columns to the dataset: pressure_log, the natural logarithm of pressure, and pressure_interaction, the product of pressure and its logarithm, to capture interaction effects.
3: Transform the dataset to include interaction terms between pressure and its log transformation for modeling.
4: Fit a logistic regression model with pressure and the interaction term pressure_interaction as predictors, specifying family = binomial for binary outcome modeling.
5: Summarize the fitted model to evaluate the significance of predictors, including the interaction term, by examining coefficients, standard errors, z-values, and p-values.


Call:
glm(formula = diabetes ~ pressure + pressure_interaction, family = binomial, 
    data = dataset)

Coefficients:
                     Estimate Std. Error z value Pr(>|z|)
(Intercept)          -3.55684    3.55422  -1.001    0.317
pressure              0.08580    0.25640   0.335    0.738
pressure_interaction -0.01067    0.04835  -0.221    0.825

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 935.76  on 726  degrees of freedom
Residual deviance: 915.12  on 724  degrees of freedom
AIC: 921.12

Number of Fisher Scoring iterations: 4

Interpretation of the Results:

Look at the p-value for pressure_interaction:
- If the p-value is significant (e.g., < 0.05), this indicates a non-linear relationship between pressure and the log-odds of diabetes.
- In this model, the p value for the interaction term is more than 0.05. The relationship can therefore be considered linear

4. No perfect multicollinearity

Multicollinearity occurs when two or more predictor variables in a regression model are highly correlated, meaning they share similar information about the outcome.

Correlation coefficients measure the linear relationship between two variables at a time. However, multicollinearity can arise from more complex interrelationships among multiple variables, which simple pairwise correlations won’t reveal.

Multicollinearity can make it difficult to determine the independent effect of each predictor, leading to inflated standard errors, unstable coefficient estimates, and reduced interpretability of the model. While it doesn’t affect the predictive power of the model as a whole, it can compromise the reliability of individual coefficients, making them less useful for inference.

Variance Inflation Factor (VIF) Detects Multivariate Issues:

VIF assesses how much the variance of a regression coefficient is inflated due to linear dependence among predictors. It considers the combined effects of all predictors in the model, not just pairwise correlations.

To calculate the Variance Inflation Factor (VIF) for your logistic regression model, you can use the car package, which provides the vif() function. Note that all of the predictor variables are included in the model passed into the VIF function except the interaction terms.

Show the code

library(car)

model <- glm(diabetes ~ glucose + age + pregnant + mass + pressure + pedigree, data = PimaIndiansDiabetes, family = binomial)

vif(model)

1: Load the car library to access tools for regression diagnostics, including the Variance Inflation Factor (VIF) function.
2: Fit a logistic regression model with predictors glucose, age, pregnant, mass, pressure, and pedigree to predict diabetes, using the glm function and specifying family = binomial for binary outcome modeling.
3: Calculate the Variance Inflation Factor (VIF) for each predictor in the model to assess multicollinearity, where higher values indicate stronger correlations among predictors.

 glucose      age pregnant     mass pressure pedigree 
1.050219 1.471357 1.407772 1.100039 1.138374 1.009366

As a rule of thumb, a VIF higher than 10 is an indication of seveve multicollinearity. A variable with a VIF greater than 10 doesn’t always need to be dropped. While a high VIF indicates potential multicollinearity, it’s crucial to consider the context, the importance of the variable, and the specific goals of the analysis. There are techniques to address multicollinearity without immediately resorting to dropping variables.

Important

In this case all of the variables included in the model have low VIF values.

When to Consider Dropping the Variable:

The variable is not critical to the research question or hypothesis.
Its removal does not significantly impact the model’s interpretability or predictive power.
Other methods to address multicollinearity fail or are not practical.

Alternative Techniques to Handle High VIF:

Centering the Variables:
- For interaction terms or highly correlated predictors, subtract the mean of the variable before creating the interaction.
- This reduces multicollinearity without losing information.
Regularized Regression:
- Use techniques like Ridge Regression or Lasso Regression to penalize large coefficients, effectively managing multicollinearity.
Combining Variables:
- If two variables are highly correlated and conceptually similar, consider combining them (e.g., averaging or summing them) to reduce redundancy.
Principal Component Analysis (PCA):
- Use PCA to create uncorrelated components from the correlated variables and use these components as predictors.
Retain High-VIF Variables if They’re Important:
- If a variable is crucial for your research question or model interpretation, you can keep it while acknowledging the multicollinearity.
- Document this decision and interpret the results cautiously.
  
  Model Simplification:
- If the high VIF is due to complex interactions, reevaluate whether the interaction terms are necessary for your specific research objectives.

Practical Advice:

Prioritize Domain Knowledge: The decision to keep or drop a variable should align with its theoretical or practical importance in your study.
Assess Impact: Examine whether the high VIF affects your coefficient estimates or their standard errors significantly.
Iterative Approach: Start by trying alternative techniques, and only drop the variable as a last resort.

5. Absence of Highly Influential Outliers

Assumption: No single observation should unduly influence the model.
Validation: Examine influence measures such as Cook’s distance, leverage values, and deviance residuals.

Cook’s Distance

Cook’s distance is a measure used to identify influential data points in a regression model. It quantifies the impact of removing an observation on the overall model fit. High Cook’s distance values indicate that the observation has a significant influence on the estimated regression coefficients. A rule of thumb is that Cook’s distance greater than 1 or significantly higher than the values for most other observations warrants further investigation. However, interpretation depends on context and the dataset. Figure 7 shows the Cook’s Distance for all of the observations in our dataset.

The fact that all of the values are substantially less than 1 suggests that none of the individual observations have an unusual influence on the model.

Leverage Values in Logistic Regression

Leverage values measure the influence of an observation on the predicted outcomes by evaluating how far the observation’s predictor values deviate from the mean of the predictors. In the context of logistic regression, observations with high leverage have the potential to disproportionately affect the model’s fit. High-leverage points may not always be problematic, but they warrant attention, particularly if they also have large residuals or Cook’s distances. Identifying and assessing leverage values ensures that influential points are not unduly skewing the model, which is critical for the robustness and interpretability of results.

Figure 8 above shows that there are a few observations with high leverage. Given that the Cook’s Distance for all of these values is low, we’re not going to be too worried. However there are a handful that are particularly high and we might want to identify those observations and take a closer look at them.

Identify specific observations with high leverage values

Show the code

leverage_values <- hatvalues(model)

leverage_df <- data.frame(Observation = seq_along(leverage_values), Leverage = leverage_values)

n <- nrow(PimaIndiansDiabetes)
p <- length(coefficients(model))
leverage_threshold <- 2 * p / n

high_leverage_df <- leverage_df[leverage_df$Leverage > leverage_threshold, ]

high_leverage_df[order(-high_leverage_df$Leverage), ] %>% head(5)

1: Calculate leverage values for each observation in the model using hatvalues to quantify each point’s influence on the fitted values.
2: Create a data frame containing observation indices and their corresponding leverage values for analysis.
3: Compute n, the total number of observations in the PimaIndiansDiabetes dataset.
4: Calculate p, the total number of predictors in the model, including the intercept.
5: Define the leverage threshold using the formula 2 * p / n, a common rule for identifying high-leverage points.
6: Filter the leverage_df data frame to include only observations with leverage values exceeding the threshold, identifying high-leverage points.
7: Rank the high-leverage points by their leverage values in descending order and display the top 5 observations for further examination.

    Observation   Leverage
454         454 0.07769392
337         337 0.04806033
8             8 0.04561866
126         126 0.04496419
295         295 0.04306046

Observation 538, upon inspection doesn’t seem unusual (the age of the person was high but that’s no reason to exclude the observation). We take note of the fact that observation 538 is high as we look at the Deviance Residuals below.

Deviance Residuals

Deviance residuals measure the discrepancy between observed and predicted values in logistic regression, providing a way to assess individual data points’ fit within the model. They are derived from the contribution of each observation to the overall deviance, with large deviance residuals indicating observations that deviate significantly from the model’s predictions. This makes them particularly relevant for identifying potential outliers or influential points that may affect the model’s validity. In logistic regression, where traditional residuals are not meaningful due to the binary nature of the outcome, deviance residuals are an essential diagnostic tool to check the model’s assumption of appropriate fit for individual observations.

Figure 9 shows a few observations that fall outside of the range that we’d like. Again, because the Cook’s Distance is low we’re not too concerned but we do want to know if any of the problematic observations here are the same as those identified in Figure 8

List those that cross the threshold of > 2

Show the code

library(car)

model <- glm(diabetes ~ glucose + age + pregnant + mass + pressure + pedigree, data = PimaIndiansDiabetes, family = binomial)



deviance_residuals <- residuals(model, type = "deviance")

residuals_df <- data.frame(Observation = seq_along(deviance_residuals), DevianceResiduals = deviance_residuals)

large_residuals_df <- residuals_df[abs(residuals_df$DevianceResiduals) > 2, ]

large_residuals_df[order(-abs(large_residuals_df$DevianceResiduals)), ] %>% head(5)

1: Load the car library to access tools for regression diagnostics, including the Variance Inflation Factor (VIF) function.
2: Fit a logistic regression model with predictors glucose, age, pregnant, mass, pressure, and pedigree to predict diabetes, using the glm function and specifying family = binomial for binary outcome modeling.
3: Extract deviance residuals from the fitted model using residuals with type = "deviance" to evaluate the goodness-of-fit for each observation.
4: Create a data frame containing observation indices and their corresponding deviance residuals for analysis.
5: Filter the data frame to include only observations with deviance residuals exceeding an absolute value of 2, identifying potentially problematic data points.
6: Rank the filtered observations by the absolute size of their deviance residuals in descending order and display the top 5 for further examination.

    Observation DevianceResiduals
350         350          2.898431
229         229         -2.738045
503         503          2.717513
10           10          2.567858
623         623         -2.466175

Once again, each of the observations that cross the threshold can be investigated. If needed, the model can be run without those observations. In this case, given that most of the values are within the threshold and the Cook’s distances are all small, the model can be left as is.

6. Adequate Sample Size

Assumption: The dataset should have sufficient observations to estimate model parameters effectively.
Validation: Follow the rule of thumb of at least 10 events per predictor variable and calculate accordingly.

Our model has about 10 observations per predictor variable and so meets this assumption

7. Logistic Regression Model Fit

Assumption: The model should fit the data well.
Validation: Use goodness-of-fit tests like the Hosmer-Lemeshow test which assesses how well the predicted probabilities align with the observed outcomes across different ranges of predicted probabilities. It provides a test of the model’s goodness of fit, especially in terms of calibration. A p-value of greater than 0.05 suggests a good fit.

Show the code

library(ResourceSelection)

hoslem.test(model$y, fitted(model), g = 10)

1: Load the ResourceSelection package to perform the Hosmer-Lemeshow test for goodness-of-fit in logistic regression models.
2: Conduct the Hosmer-Lemeshow test using hoslem.test with the observed outcomes (model$y) and the fitted probabilities from the model. Specify g = 10 to group the data into 10 bins based on predicted probabilities.


    Hosmer and Lemeshow goodness of fit (GOF) test

data:  model$y, fitted(model)
X-squared = 5.1074, df = 8, p-value = 0.746

In this case the p value is 0.09 that suggests that this model is a good fit.

8. No Omitted Variable Bias

Assumption: All relevant variables are included in the model.
Validation: Use domain knowledge and model comparison techniques to evaluate the inclusion of relevant variables.