Part 1: Interaction Terms in Logit Regression Models

Including interaction terms in logit regression models can introduce several complexities and potential issues. One primary concern is the risk of multicollinearity, where the interaction terms are highly correlated with the main effects, inflating the variance of the coefficient estimates and making them unstable and difficult to interpret. This can lead to misleading conclusions about the relationships between variables. Additionally, interaction terms can complicate the model by increasing the number of parameters, which can be particularly problematic if the sample size is not large enough to support the additional complexity. This can result in overfitting, where the model captures noise rather than the true underlying relationships, reducing its generalizability to new data.

Furthermore, including interaction terms in logit regression models can lead to misinterpretation of results due to the nonlinearity of the logit function. As Ai and Norton (2003) highlight, the marginal effects of interaction terms in nonlinear models are not simply the coefficients, as they would be in a linear regression. Instead, the true interaction effect depends on the values of all covariates and varies across observations. This complexity means that simply interpreting the interaction coefficient as an indication of whether the interaction effect is significant or not can be misleading. Additionally, standard hypothesis tests for interaction terms in logit models may not provide accurate inferences due to the way interaction effects manifest in these models.

As suggested by King, Tomz, and Wittenberg (2000) and further explored by Zelner (2009), a simulation-based approach offers a more reliable way to interpret results from logit models with interaction terms. By simulating predicted probabilities across a range of values for the interacting variables, researchers can visualize how the interaction affects the outcome rather than relying on coefficient significance alone. This approach helps address issues of nonlinearity by generating more intuitive quantities of interest, such as first differences or relative risks, rather than raw logit coefficients. Simulations can also provide confidence intervals for these quantities, improving the robustness and interpretability of findings. Overall, simulations offer a powerful tool for enhancing the reliability and interpretability of logit regression models with interaction terms.

Introduction

In this analysis, we will use the baseball hitting data to estimate a set of logistic regression models with increasing complexity. We will conduct a likelihood ratio test and use AIC and BIC to select the best model. We will then interpret the results.

Data Preparation

First, we load the necessary libraries and the dataset.

library(tidyverse)
## Warning: package 'ggplot2' was built under R version 4.3.2
## Warning: package 'tidyr' was built under R version 4.3.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(broom)
library(car)
## Warning: package 'car' was built under R version 4.3.3
## Loading required package: carData
## 
## Attaching package: 'car'
## 
## The following object is masked from 'package:dplyr':
## 
##     recode
## 
## The following object is masked from 'package:purrr':
## 
##     some
# Load the data
url <- "https://raw.githubusercontent.com/angelbayron/AngelB/main/baseball_hitting.csv"
data <- read_csv(url)
## Rows: 2508 Columns: 18
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (4): Player name, position, Strikeouts, Caught stealing
## dbl (14): Games, At-bat, Runs, Hits, Double (2B), third baseman, home run, r...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Let’s take a look at the first few rows of the dataset.

head(data)
## # A tibble: 6 × 18
##   `Player name` position Games `At-bat`  Runs  Hits `Double (2B)`
##   <chr>         <chr>    <dbl>    <dbl> <dbl> <dbl>         <dbl>
## 1 B Bonds       LF        2986     9847  2227  2935           601
## 2 H Aaron       RF        3298    12364  2174  3771           624
## 3 B Ruth        RF        2504     8399  2174  2873           506
## 4 A Pujols      1B        3080    11421  1914  3384           686
## 5 A Rodriguez   SS        2784    10566  2021  3115           548
## 6 W Mays        CF        2992    10881  2062  3283           523
## # ℹ 11 more variables: `third baseman` <dbl>, `home run` <dbl>,
## #   `run batted in` <dbl>, `a walk` <dbl>, Strikeouts <chr>,
## #   `stolen base` <dbl>, `Caught stealing` <chr>, AVG <dbl>,
## #   `On-base Percentage` <dbl>, `Slugging Percentage` <dbl>,
## #   `On-base Plus Slugging` <dbl>

Data Cleaning

We need to clean the data and create a binary variable for our logistic regression analysis. Let’s create a binary variable High_HR indicating whether a player has hit more than 400 home runs.

data <- data %>%
  mutate(High_HR = ifelse(`home run` > 400, 1, 0))

Model Building

We will build logistic regression models with increasing complexity and compare them using AIC and BIC.

Model 1: Simple Model

Our first model will include only the At-bat variable.

model1 <- glm(High_HR ~ `At-bat`, data = data, family = binomial)
summary(model1)
## 
## Call:
## glm(formula = High_HR ~ `At-bat`, family = binomial, data = data)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -9.675e+00  7.071e-01  -13.68   <2e-16 ***
## `At-bat`     9.401e-04  8.473e-05   11.10   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 543.72  on 2499  degrees of freedom
## Residual deviance: 302.68  on 2498  degrees of freedom
##   (8 observations deleted due to missingness)
## AIC: 306.68
## 
## Number of Fisher Scoring iterations: 8

Model 2: Adding Runs

The second model will include At-bat and Runs.

model2 <- glm(High_HR ~ `At-bat` + Runs, data = data, family = binomial)
summary(model2)
## 
## Call:
## glm(formula = High_HR ~ `At-bat` + Runs, family = binomial, data = data)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -9.4004420  0.7416161 -12.676  < 2e-16 ***
## `At-bat`     0.0003111  0.0001485   2.094   0.0362 *  
## Runs         0.0037300  0.0007509   4.967 6.79e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 543.72  on 2499  degrees of freedom
## Residual deviance: 277.52  on 2497  degrees of freedom
##   (8 observations deleted due to missingness)
## AIC: 283.52
## 
## Number of Fisher Scoring iterations: 9

Model 3: Adding Hits

The third model will include At-bat, Runs, and Hits.

model3 <- glm(High_HR ~ `At-bat` + Runs + Hits, data = data, family = binomial)
summary(model3)
## 
## Call:
## glm(formula = High_HR ~ `At-bat` + Runs + Hits, family = binomial, 
##     data = data)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.067e+01  8.872e-01 -12.031  < 2e-16 ***
## `At-bat`     1.588e-03  2.978e-04   5.331 9.78e-08 ***
## Runs         6.905e-03  1.048e-03   6.591 4.38e-11 ***
## Hits        -5.662e-03  1.136e-03  -4.982 6.31e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 543.72  on 2499  degrees of freedom
## Residual deviance: 249.16  on 2496  degrees of freedom
##   (8 observations deleted due to missingness)
## AIC: 257.16
## 
## Number of Fisher Scoring iterations: 9

Model 4: Adding AVG

The fourth model will include At-bat, Runs, Hits, and AVG.

model4 <- glm(High_HR ~ `At-bat` + Runs + Hits + AVG, data = data, family = binomial)
summary(model4)
## 
## Call:
## glm(formula = High_HR ~ `At-bat` + Runs + Hits + AVG, family = binomial, 
##     data = data)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -16.290847   9.434467  -1.727   0.0842 .  
## `At-bat`      0.002245   0.001146   1.959   0.0501 .  
## Runs          0.006747   0.001066   6.333 2.41e-10 ***
## Hits         -0.007926   0.003971  -1.996   0.0460 *  
## AVG          20.182194  33.542461   0.602   0.5474    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 543.72  on 2499  degrees of freedom
## Residual deviance: 248.78  on 2495  degrees of freedom
##   (8 observations deleted due to missingness)
## AIC: 258.78
## 
## Number of Fisher Scoring iterations: 10

Model Comparison

We will compare the models using AIC and BIC.

AIC(model1, model2, model3, model4)
##        df      AIC
## model1  2 306.6802
## model2  3 283.5179
## model3  4 257.1610
## model4  5 258.7845
BIC(model1, model2, model3, model4)
##        df      BIC
## model1  2 318.3283
## model2  3 300.9900
## model3  4 280.4572
## model4  5 287.9047

Likelihood Ratio Test

We will conduct a likelihood ratio test to compare the models.

anova(model1, model2, model3, model4, test = "LRT")
## Analysis of Deviance Table
## 
## Model 1: High_HR ~ `At-bat`
## Model 2: High_HR ~ `At-bat` + Runs
## Model 3: High_HR ~ `At-bat` + Runs + Hits
## Model 4: High_HR ~ `At-bat` + Runs + Hits + AVG
##   Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
## 1      2498     302.68                          
## 2      2497     277.52  1  25.1623 5.270e-07 ***
## 3      2496     249.16  1  28.3569 1.009e-07 ***
## 4      2495     248.78  1   0.3765    0.5395    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Model Interpretation

Based on the AIC, BIC, and likelihood ratio test, we choose the best model. Let’s interpret the results of the best model.

best_model <- model4
summary(best_model)
## 
## Call:
## glm(formula = High_HR ~ `At-bat` + Runs + Hits + AVG, family = binomial, 
##     data = data)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -16.290847   9.434467  -1.727   0.0842 .  
## `At-bat`      0.002245   0.001146   1.959   0.0501 .  
## Runs          0.006747   0.001066   6.333 2.41e-10 ***
## Hits         -0.007926   0.003971  -1.996   0.0460 *  
## AVG          20.182194  33.542461   0.602   0.5474    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 543.72  on 2499  degrees of freedom
## Residual deviance: 248.78  on 2495  degrees of freedom
##   (8 observations deleted due to missingness)
## AIC: 258.78
## 
## Number of Fisher Scoring iterations: 10

The coefficients of the logistic regression model represent the log-odds of the binary outcome (High_HR) for a one-unit increase in the predictor variable, holding other variables constant. The p-values indicate the significance of each predictor in the model.

Conclusion

In this analysis, we estimated a set of logistic regression models with increasing complexity to predict whether a player has hit more than 400 home runs. We used AIC and BIC to select the best model and conducted a likelihood ratio test to compare the models. We then interpreted the results of the best model.

Part 3