Reading Assignment

When we include interaction terms in logit regression models used to predict probabilities, interpreting the results can be challenging. In linear models, interaction terms show how the effect of one variable change depending on another variable, and this is straightforward. However, in logit models, the relationship is not linear, making the interaction effect less clear. The coefficients obtained from the model do not directly indicate how much the probability changes when both variables interact. This can lead to misunderstandings or incorrect conclusions.

For example, Ai and Norton (2003) point out that the interaction effect in logit models is not constant, it depends on the values of other variables in the model. This means it is not as simple as looking at the coefficient and concluding the effect. King, Tomz, and Wittenberg (2000) suggest that to make sense of these results, we should focus on presenting predicted probabilities or marginal effects, which are easier to understand. Instead of merely reporting coefficients, we can show how the probability of an outcome changes under different scenarios.

This is where simulation-based approaches, as suggested by Zelner, B.A. (2009), become useful. This approach allows us to create hypothetical scenarios and observe how predicted probabilities change when we adjust the variables. For instance, we can simulate what happens to the probability of an outcome when one variable increase while another remains constant, or when both change together. This provides a clearer picture of the interaction effect and helps avoid misinterpretations of the model’s results. Simulation makes the complex relationships in logit models more tangible and easier to explain.

In short, interaction terms in logit models can be difficult to interpret because their effects are not straightforward. However, by using simulation and focusing on predicted probabilities, we can better understand and communicate the true implications of our model.

Data Analysis Assignment

# Clear environment
rm(list = ls())
gc()
##          used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 539019 28.8    1196824   64   686460 36.7
## Vcells 982446  7.5    8388608   64  1875940 14.4
# Load libraries
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
library(readr)
library(haven)
library(stringr)
library(ggplot2)
library(broom)
library(car)
## Warning: package 'car' was built under R version 4.4.3
## Loading required package: carData
## Warning: package 'carData' was built under R version 4.4.3
## 
## Attaching package: 'car'
## 
## The following object is masked from 'package:dplyr':
## 
##     recode
## 
## The following object is masked from 'package:purrr':
## 
##     some
# Set directory
air_quality <- read.csv("C:/Users/Shamp/Downloads/Air_Quality 2024NYC.csv")

# Verify the Column Name
colnames(air_quality)
##  [1] "Unique.ID"     "Indicator.ID"  "Name"          "Measure"      
##  [5] "Measure.Info"  "Geo.Type.Name" "Geo.Join.ID"   "GeoPlaceName" 
##  [9] "Time.Period"   "Start_Date"    "Data.Value"    "Message"
# Check the data structure 
str(air_quality)
## 'data.frame':    18025 obs. of  12 variables:
##  $ Unique.ID    : int  179772 221956 221806 221836 221812 179785 178540 178561 823217 221962 ...
##  $ Indicator.ID : int  640 386 386 386 386 640 365 365 365 386 ...
##  $ Name         : chr  "Boiler Emissions- Total SO2 Emissions" "Ozone (O3)" "Ozone (O3)" "Ozone (O3)" ...
##  $ Measure      : chr  "Number per km2" "Mean" "Mean" "Mean" ...
##  $ Measure.Info : chr  "number" "ppb" "ppb" "ppb" ...
##  $ Geo.Type.Name: chr  "UHF42" "UHF34" "UHF34" "UHF34" ...
##  $ Geo.Join.ID  : int  409 305307 103 204 104 209 209 409 409 306308 ...
##  $ GeoPlaceName : chr  "Southeast Queens" "Upper East Side-Gramercy" "Fordham - Bronx Pk" "East New York" ...
##  $ Time.Period  : chr  "2015" "Summer 2014" "Summer 2014" "Summer 2014" ...
##  $ Start_Date   : chr  "1/1/2015" "6/1/2014" "6/1/2014" "6/1/2014" ...
##  $ Data.Value   : num  0.3 24.9 30.7 32 31.9 1.2 8.6 8 6.1 25.3 ...
##  $ Message      : logi  NA NA NA NA NA NA ...
head(air_quality)
##   Unique.ID Indicator.ID                                  Name        Measure
## 1    179772          640 Boiler Emissions- Total SO2 Emissions Number per km2
## 2    221956          386                            Ozone (O3)           Mean
## 3    221806          386                            Ozone (O3)           Mean
## 4    221836          386                            Ozone (O3)           Mean
## 5    221812          386                            Ozone (O3)           Mean
## 6    179785          640 Boiler Emissions- Total SO2 Emissions Number per km2
##   Measure.Info Geo.Type.Name Geo.Join.ID             GeoPlaceName Time.Period
## 1       number         UHF42         409         Southeast Queens        2015
## 2          ppb         UHF34      305307 Upper East Side-Gramercy Summer 2014
## 3          ppb         UHF34         103       Fordham - Bronx Pk Summer 2014
## 4          ppb         UHF34         204            East New York Summer 2014
## 5          ppb         UHF34         104     Pelham - Throgs Neck Summer 2014
## 6       number         UHF42         209  Bensonhurst - Bay Ridge        2015
##   Start_Date Data.Value Message
## 1   1/1/2015        0.3      NA
## 2   6/1/2014       24.9      NA
## 3   6/1/2014       30.7      NA
## 4   6/1/2014       32.0      NA
## 5   6/1/2014       31.9      NA
## 6   1/1/2015        1.2      NA
# Create a binary variable (e.g., Data.Value > 25)
air_quality <- air_quality %>%
  mutate(Binary_Var = ifelse(Data.Value > 25, 1, 0))

# Create a subset of the data with only the necessary variables and remove rows with missing values
subset_data <- air_quality %>%
  select(Binary_Var, Data.Value, Indicator.ID, Geo.Type.Name, Geo.Join.ID) %>%
  na.omit()
# Fit a simple logit model
model1 <- glm(Binary_Var ~ Data.Value, data = subset_data, family = binomial)
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
# Fit a more complex logit model
model2 <- glm(Binary_Var ~ Data.Value + Indicator.ID + Geo.Type.Name, data = subset_data, family = binomial)
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
# Fit an even more complex logit model
model3 <- glm(Binary_Var ~ Data.Value + Indicator.ID + Geo.Type.Name + Geo.Join.ID, data = subset_data, family = binomial)
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
# Compare models using AIC and BIC
AIC(model1, model2, model3)
##        df       AIC
## model1  2  4.001003
## model2  7 14.001003
## model3  8 16.001003
BIC(model1, model2, model3)
##        df      BIC
## model1  2 19.59903
## model2  7 68.59411
## model3  8 78.39313
# Conduct likelihood ratio test
anova(model1, model2, model3, test = "LRT")
## Analysis of Deviance Table
## 
## Model 1: Binary_Var ~ Data.Value
## Model 2: Binary_Var ~ Data.Value + Indicator.ID + Geo.Type.Name
## Model 3: Binary_Var ~ Data.Value + Indicator.ID + Geo.Type.Name + Geo.Join.ID
##   Resid. Df Resid. Dev Df    Deviance Pr(>Chi)
## 1     18014  0.0010031                        
## 2     18009  0.0010028  5  2.8240e-07        1
## 3     18008  0.0010034  1 -6.2439e-07
# Summarize the best model
summary(model3)
## 
## Call:
## glm(formula = Binary_Var ~ Data.Value + Indicator.ID + Geo.Type.Name + 
##     Geo.Join.ID, family = binomial, data = subset_data)
## 
## Coefficients:
##                         Estimate Std. Error z value Pr(>|z|)
## (Intercept)           -5.892e+03  1.401e+04  -0.421    0.674
## Data.Value             2.350e+02  5.423e+02   0.433    0.665
## Indicator.ID           6.591e-05  3.198e-01   0.000    1.000
## Geo.Type.NameCD        5.396e+00  3.505e+03   0.002    0.999
## Geo.Type.NameCitywide  4.314e+01  2.438e+04   0.002    0.999
## Geo.Type.NameUHF34     5.447e+00  3.505e+03   0.002    0.999
## Geo.Type.NameUHF42     5.384e+00  3.505e+03   0.002    0.999
## Geo.Join.ID            1.129e-07  3.653e-04   0.000    1.000
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2.1627e+04  on 18015  degrees of freedom
## Residual deviance: 1.0034e-03  on 18008  degrees of freedom
## AIC: 16.001
## 
## Number of Fisher Scoring iterations: 25

Interpretation

Model Summaries

  1. Model 1: Binary_Var ~ Data.Value
  2. Model 2: Binary_Var ~ Data.Value + Indicator.ID + Geo.Type.Name
  3. Model 3: Binary_Var ~ Data.Value + Indicator.ID + Geo.Type.Name + Geo.Join.ID

Likelihood Ratio Test

The likelihood ratio test compares the nested models to see if the more complex models significantly improve the fit compared to the simpler ones. Coefficients and P-values

• (Intercept): The intercept is not statistically significant (p = 0.674). • Data.Value: The coefficient for Data.Value is positive but not statistically significant (p = 0.665). • Indicator.ID: The coefficient for Indicator.ID is very close to zero and not statistically significant (p = 1.000). • Geo.Type.NameCD: The coefficient for Geo.Type.NameCD is not statistically significant (p = 0.999). • Geo.Type.NameCitywide: The coefficient for Geo.Type.NameCitywide is not statistically significant (p = 0.999). • Geo.Type.NameUHF34: The coefficient for Geo.Type.NameUHF34 is not statistically significant (p = 0.999). • Geo.Type.NameUHF42: The coefficient for Geo.Type.NameUHF42 is not statistically significant (p = 0.999). • Geo.Join.ID: The coefficient for Geo.Join.ID is very close to zero and not statistically significant (p = 1.000).

Model Fit

The results indicate that none of the variables included in Model 3 are statistically significant predictors of the binary outcome. The p-values for all coefficients are very high, indicating that there is no evidence to suggest that any of these variables have a noteworthy effect on the binary outcome.

Null deviance: 21627 on 18015 degrees of freedom.

This measures how well the response variable is predicted by a model that includes only the intercept but no predictors. A high null deviance indicates that the intercept-only model does not fit the data well. Residual Deviance

Residual deviance: 0.0010034 on 18008 degrees of freedom.

This measures how well the response variable is predicted by the model with predictors. A lower residual deviance indicates a better fit. In this case, the residual deviance is extremely low, suggesting that the model fits the data almost perfectly. However, this might indicate overfitting. Degrees of Freedom

Degrees of freedom: The difference between the number of observations and the number of parameters estimated. For the null deviance, it is 18015 (number of observations) minus 1 (intercept). For the residual deviance, it is 18015 minus the number of parameters in the model.

Overall, the results suggest that the model does not provide meaningful perceptions into the relationship between the predictors and the binary outcome.

AIC (Akaike Information Criterion)

• Model 1: AIC = 4.001003 • Model 2: AIC = 14.001003 • Model 3: AIC = 16.001003

Explanation

• Model 1 has the lowest AIC value (4.001003), indicating that it is the best model among the three in terms of balancing model fit and complexity. Lower AIC values are preferred because they suggest a better trade-off between goodness of fit and model complexity. • Model 2 has a higher AIC value (14.001003) compared to Model 1, suggesting that adding Indicator.ID and Geo.Type.Name does not improve the model enough to justify the increased complexity. • Model 3 has the highest AIC value (16.001003), indicating that adding Geo.Join.ID further increases the model complexity without providing any significant improvement in fit. Based on the AIC values, Model 1 is the preferred model because it has the lowest AIC, representing the best balance between fit and complexity. This aligns with the earlier interpretation that adding more variables does not significantly improve the model.

BIC (Bayesian Information Criterion)

• Model 1: BIC = 19.59903 • Model 2: BIC = 68.59411 • Model 3: BIC = 78.39313

Explanation

• Model 1 has the lowest BIC value (19.59903), representing that it is the best model among the three in terms of balancing model fit and complexity. Lower BIC values are preferred because they suggest a better trade-off between goodness of fit and model complexity. • Model 2 has a higher BIC value (68.59411) compared to Model 1, suggesting that adding Indicator.ID and Geo.Type.Name does not improve the model enough to justify the increased complexity. • Model 3 has the highest BIC value (78.39313), indicating that adding Geo.Join.ID further increases the model complexity without providing a significant improvement in fit.

Both AIC and BIC suggest that Model 1 is the best model. It is the simplest model and provides the best balance between fit and complexity. The additional variables in Models 2 and 3 do not remarkably improve the model’s performance.