Including interaction terms in logit regression models can introduce several complexities and potential issues. One primary concern is the risk of multicollinearity, where the interaction terms are highly correlated with the main effects, inflating the variance of the coefficient estimates and making them unstable and difficult to interpret. This can lead to misleading conclusions about the relationships between variables. Additionally, interaction terms can complicate the model by increasing the number of parameters, which can be particularly problematic if the sample size is not large enough to support the additional complexity. This can result in overfitting, where the model captures noise rather than the true underlying relationships, reducing its generalizability to new data.
Furthermore, including interaction terms in logit regression models can lead to misinterpretation of results due to the nonlinearity of the logit function. As Ai and Norton (2003) highlight, the marginal effects of interaction terms in nonlinear models are not simply the coefficients, as they would be in a linear regression. Instead, the true interaction effect depends on the values of all covariates and varies across observations. This complexity means that simply interpreting the interaction coefficient as an indication of whether the interaction effect is significant or not can be misleading. Additionally, standard hypothesis tests for interaction terms in logit models may not provide accurate inferences due to the way interaction effects manifest in these models.
As suggested by King, Tomz, and Wittenberg (2000) and further explored by Zelner (2009), a simulation-based approach offers a more reliable way to interpret results from logit models with interaction terms. By simulating predicted probabilities across a range of values for the interacting variables, researchers can visualize how the interaction affects the outcome rather than relying on coefficient significance alone. This approach helps address issues of nonlinearity by generating more intuitive quantities of interest, such as first differences or relative risks, rather than raw logit coefficients. Simulations can also provide confidence intervals for these quantities, improving the robustness and interpretability of findings. Overall, simulations offer a powerful tool for enhancing the reliability and interpretability of logit regression models with interaction terms.
In this analysis, we will use the baseball hitting data to estimate a set of logistic regression models with increasing complexity. We will conduct a likelihood ratio test and use AIC and BIC to select the best model. We will then interpret the results.
First, we load the necessary libraries and the dataset.
library(tidyverse)
## Warning: package 'ggplot2' was built under R version 4.3.2
## Warning: package 'tidyr' was built under R version 4.3.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(broom)
library(car)
## Warning: package 'car' was built under R version 4.3.3
## Loading required package: carData
##
## Attaching package: 'car'
##
## The following object is masked from 'package:dplyr':
##
## recode
##
## The following object is masked from 'package:purrr':
##
## some
# Load the data
url <- "https://raw.githubusercontent.com/angelbayron/AngelB/main/baseball_hitting.csv"
data <- read_csv(url)
## Rows: 2508 Columns: 18
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): Player name, position, Strikeouts, Caught stealing
## dbl (14): Games, At-bat, Runs, Hits, Double (2B), third baseman, home run, r...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Let’s take a look at the first few rows of the dataset.
head(data)
## # A tibble: 6 × 18
## `Player name` position Games `At-bat` Runs Hits `Double (2B)`
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 B Bonds LF 2986 9847 2227 2935 601
## 2 H Aaron RF 3298 12364 2174 3771 624
## 3 B Ruth RF 2504 8399 2174 2873 506
## 4 A Pujols 1B 3080 11421 1914 3384 686
## 5 A Rodriguez SS 2784 10566 2021 3115 548
## 6 W Mays CF 2992 10881 2062 3283 523
## # ℹ 11 more variables: `third baseman` <dbl>, `home run` <dbl>,
## # `run batted in` <dbl>, `a walk` <dbl>, Strikeouts <chr>,
## # `stolen base` <dbl>, `Caught stealing` <chr>, AVG <dbl>,
## # `On-base Percentage` <dbl>, `Slugging Percentage` <dbl>,
## # `On-base Plus Slugging` <dbl>
We need to clean the data and create a binary variable for our
logistic regression analysis. Let’s create a binary variable
High_HR
indicating whether a player has hit more than 400
home runs.
data <- data %>%
mutate(High_HR = ifelse(`home run` > 400, 1, 0))
We will build logistic regression models with increasing complexity and compare them using AIC and BIC.
Our first model will include only the At-bat
variable.
model1 <- glm(High_HR ~ `At-bat`, data = data, family = binomial)
summary(model1)
##
## Call:
## glm(formula = High_HR ~ `At-bat`, family = binomial, data = data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -9.675e+00 7.071e-01 -13.68 <2e-16 ***
## `At-bat` 9.401e-04 8.473e-05 11.10 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 543.72 on 2499 degrees of freedom
## Residual deviance: 302.68 on 2498 degrees of freedom
## (8 observations deleted due to missingness)
## AIC: 306.68
##
## Number of Fisher Scoring iterations: 8
Runs
The second model will include At-bat
and
Runs
.
model2 <- glm(High_HR ~ `At-bat` + Runs, data = data, family = binomial)
summary(model2)
##
## Call:
## glm(formula = High_HR ~ `At-bat` + Runs, family = binomial, data = data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -9.4004420 0.7416161 -12.676 < 2e-16 ***
## `At-bat` 0.0003111 0.0001485 2.094 0.0362 *
## Runs 0.0037300 0.0007509 4.967 6.79e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 543.72 on 2499 degrees of freedom
## Residual deviance: 277.52 on 2497 degrees of freedom
## (8 observations deleted due to missingness)
## AIC: 283.52
##
## Number of Fisher Scoring iterations: 9
Hits
The third model will include At-bat
, Runs
,
and Hits
.
model3 <- glm(High_HR ~ `At-bat` + Runs + Hits, data = data, family = binomial)
summary(model3)
##
## Call:
## glm(formula = High_HR ~ `At-bat` + Runs + Hits, family = binomial,
## data = data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.067e+01 8.872e-01 -12.031 < 2e-16 ***
## `At-bat` 1.588e-03 2.978e-04 5.331 9.78e-08 ***
## Runs 6.905e-03 1.048e-03 6.591 4.38e-11 ***
## Hits -5.662e-03 1.136e-03 -4.982 6.31e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 543.72 on 2499 degrees of freedom
## Residual deviance: 249.16 on 2496 degrees of freedom
## (8 observations deleted due to missingness)
## AIC: 257.16
##
## Number of Fisher Scoring iterations: 9
AVG
The fourth model will include At-bat
, Runs
,
Hits
, and AVG
.
model4 <- glm(High_HR ~ `At-bat` + Runs + Hits + AVG, data = data, family = binomial)
summary(model4)
##
## Call:
## glm(formula = High_HR ~ `At-bat` + Runs + Hits + AVG, family = binomial,
## data = data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -16.290847 9.434467 -1.727 0.0842 .
## `At-bat` 0.002245 0.001146 1.959 0.0501 .
## Runs 0.006747 0.001066 6.333 2.41e-10 ***
## Hits -0.007926 0.003971 -1.996 0.0460 *
## AVG 20.182194 33.542461 0.602 0.5474
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 543.72 on 2499 degrees of freedom
## Residual deviance: 248.78 on 2495 degrees of freedom
## (8 observations deleted due to missingness)
## AIC: 258.78
##
## Number of Fisher Scoring iterations: 10
We will compare the models using AIC and BIC.
AIC(model1, model2, model3, model4)
## df AIC
## model1 2 306.6802
## model2 3 283.5179
## model3 4 257.1610
## model4 5 258.7845
BIC(model1, model2, model3, model4)
## df BIC
## model1 2 318.3283
## model2 3 300.9900
## model3 4 280.4572
## model4 5 287.9047
We will conduct a likelihood ratio test to compare the models.
anova(model1, model2, model3, model4, test = "LRT")
## Analysis of Deviance Table
##
## Model 1: High_HR ~ `At-bat`
## Model 2: High_HR ~ `At-bat` + Runs
## Model 3: High_HR ~ `At-bat` + Runs + Hits
## Model 4: High_HR ~ `At-bat` + Runs + Hits + AVG
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 2498 302.68
## 2 2497 277.52 1 25.1623 5.270e-07 ***
## 3 2496 249.16 1 28.3569 1.009e-07 ***
## 4 2495 248.78 1 0.3765 0.5395
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Based on the AIC, BIC, and likelihood ratio test, we choose the best model. Let’s interpret the results of the best model.
best_model <- model4
summary(best_model)
##
## Call:
## glm(formula = High_HR ~ `At-bat` + Runs + Hits + AVG, family = binomial,
## data = data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -16.290847 9.434467 -1.727 0.0842 .
## `At-bat` 0.002245 0.001146 1.959 0.0501 .
## Runs 0.006747 0.001066 6.333 2.41e-10 ***
## Hits -0.007926 0.003971 -1.996 0.0460 *
## AVG 20.182194 33.542461 0.602 0.5474
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 543.72 on 2499 degrees of freedom
## Residual deviance: 248.78 on 2495 degrees of freedom
## (8 observations deleted due to missingness)
## AIC: 258.78
##
## Number of Fisher Scoring iterations: 10
The coefficients of the logistic regression model represent the
log-odds of the binary outcome (High_HR
) for a one-unit
increase in the predictor variable, holding other variables constant.
The p-values indicate the significance of each predictor in the
model.
In this analysis, we estimated a set of logistic regression models with increasing complexity to predict whether a player has hit more than 400 home runs. We used AIC and BIC to select the best model and conducted a likelihood ratio test to compare the models. We then interpreted the results of the best model.