Analyzing Dichotomous Dependent Variables

In this tutorial, we are going use the binomial distribution to simulate a dichotomous dependent variable, y, with a known relationship to an independent variable, x. We will then estimate the relationship using OLS, logistic and probit regressions and compare the results. Along the way, we will talk about the differences in the modeling approaches.

# Create X as a sequence from 1 to 11
X <- 1:11

# Define the probabilities for Y = 1 at each X value with steeper relationship in middle and flatter at edges
probabilities <- c(0.05, .075, .125, 0.2, .3, 0.5, 0.7, 0.8, 0.875, .925, 0.95)


# Simulate data with binomial distribution for Y
sample <- 1000  # Number of trials for each x 

# Create an empty data frame to store the results
simulated_data <- data.frame(X = numeric(),
                             Y = numeric(),
                             stringsAsFactors = FALSE)

# Simulate Y=1 probabilistic ally for each X value iteration
for (i in 1:sample) {
  # Sample Y=1 probabilistic ally at each X value
  simulated_Y <- sapply(probabilities, function(p) sample(c(0, 1), size = 1, prob = c(1 - p, p))) #Probabilistic ally draw 1 | 0 at the specified probability level 
  
  # Create a data frame for the current iteration
  iteration_data <- data.frame(x = X, y = simulated_Y)
  
  # Append the current iteration data to the main data frame
  simulated_data <- rbind(simulated_data, iteration_data)
}

skim(simulated_data) #Look at first few cases 
Data summary
Name simulated_data
Number of rows 11000
Number of columns 2
_______________________
Column type frequency:
numeric 2
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
x 0 1 6.0 3.16 1 3 6 9 11 ▇▅▅▅▅
y 0 1 0.5 0.50 0 0 1 1 1 ▇▁▁▁▇

Reviewing Simulated Data

Using group_by from the tidyverse package, we check the probability that Y = 1 at each X level to see if the simulation worked as hoped. Results show that it did with P(Y=1∣X)roughly following the population parameters stipulated earlier.

## # A tibble: 11 × 2
##        x mean_y
##    <int>  <dbl>
##  1     1  0.057
##  2     2  0.073
##  3     3  0.136
##  4     4  0.211
##  5     5  0.292
##  6     6  0.506
##  7     7  0.708
##  8     8  0.798
##  9     9  0.867
## 10    10  0.915
## 11    11  0.939

Estimating Regression Models

With our simulated data, now we’ll estimate three separate regressions changing the GLM model type using:

  • Logistic
  • Probit
  • OLS

Two models, logistic and probit, are designed to work with dichotomous DVs so should fit the data better with our simulated data compared to the OLS estimator which assumes a normal distribution. We use GLM to estimate all three models, including the OLS, to more easily compare model fit across estimators. Both the logit and probit use family=binomial in the code but with different link functions to account for the different calculations. The OLS GLM model uses the Gaussian distribution family with the standard identity link function.

Estimating & Interpreting the Models

We estimate each of the three models than compare the results using stargazer.

logit <- glm(y ~ x , data = simulated_data, family = binomial(link = "logit"))
summary(logit)
## 
## Call:
## glm(formula = y ~ x, family = binomial(link = "logit"), data = simulated_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5118  -0.7122   0.2953   0.7114   2.5107  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -3.73012    0.07090  -52.61   <2e-16 ***
## x            0.62191    0.01101   56.47   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 15249.2  on 10999  degrees of freedom
## Residual deviance:  9465.7  on 10998  degrees of freedom
## AIC: 9469.7
## 
## Number of Fisher Scoring iterations: 5
probit <- glm(y ~ x , data = simulated_data, family = binomial(link = "probit"))
summary(probit)
## 
## Call:
## glm(formula = y ~ x, family = binomial(link = "probit"), data = simulated_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5638  -0.7368   0.2760   0.7375   2.5649  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.139751   0.036817  -58.12   <2e-16 ***
## x            0.356519   0.005632   63.30   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 15249.2  on 10999  degrees of freedom
## Residual deviance:  9497.7  on 10998  degrees of freedom
## AIC: 9501.7
## 
## Number of Fisher Scoring iterations: 4
ols <- glm(y ~ x, data = simulated_data, family = gaussian(link = "identity"))
summary(ols)
## 
## Call:
## glm(formula = y ~ x, family = gaussian(link = "identity"), data = simulated_data)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.02568  -0.28998   0.02532   0.28962   1.02532  
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.130418   0.007640  -17.07   <2e-16 ***
## x            0.105100   0.001126   93.31   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.1395652)
## 
##     Null deviance: 2750.0  on 10999  degrees of freedom
## Residual deviance: 1534.9  on 10998  degrees of freedom
## AIC: 9559.2
## 
## Number of Fisher Scoring iterations: 2
stargazer(ols, logit, probit, type="text", digits=2)
## 
## ===============================================
##                        Dependent variable:     
##                   -----------------------------
##                                 y              
##                    normal   logistic   probit  
##                      (1)       (2)       (3)   
## -----------------------------------------------
## x                  0.11***   0.62***   0.36*** 
##                    (0.001)   (0.01)    (0.01)  
##                                                
## Constant          -0.13***  -3.73***  -2.14*** 
##                    (0.01)    (0.07)    (0.04)  
##                                                
## -----------------------------------------------
## Observations       11,000    11,000    11,000  
## Log Likelihood    -4,777.60 -4,732.87 -4,748.86
## Akaike Inf. Crit. 9,559.19  9,469.74  9,501.72 
## ===============================================
## Note:               *p<0.1; **p<0.05; ***p<0.01

Results show that x is a significant predictor of y across all three model estimation strategies, which it should be given the data generating process that created this set of data. However, the coefficients mean something different in each of the three models. For the OLS model, the coefficient of .11 reflects a linear estimation of how x is related to y so that for every 1 unit increase in x, regardless of where on the x scale you are, results in a .11, or 11% point, increase in the probability of Y = 1. The OLS model is assuming a linear relationship and is creating the coefficient that best describes the linear relationship between X and Y.

The coefficients on the logit and probit model do not assume linearity meaning that the impact of X on values of Y will vary depending on where on the X scale you are. With X values closer to the middle of the scale, the relationship will be stronger so that a 1 unit increase in X results in the largest possible change in Y. With X values closer to either its maximum or minimum, the relationship between X and Y will be weaker so that a 1 unit increase in X results in a smaller change in Y.

For the logit model, the coefficient values represent the logged odds. This is important to know so that you do not immediately start to interpret the coefficient values like it is an OLS coefficient. We can easily interpret direction of relationship and significance with the logged odds but to understand the substantive impact we should convert to predicted probabilities, which we will do below. The logit coefficient of .65 means at its steepest slope a one unit increase in x is equal to a .65 logged odds increased in y. Note the phrase “at its steepest slope”, which means that the relationship between X and Y will vary depending on where on the x-scale the relationship is evaluated.

For the probit model, the coefficient values represents the change in the z-score for each 1-unit increase in x. In same manner as logit, we can easily interpret direction of relationship and significance between x and y in the probit model but not the substantive impact. The .37 probit coefficient means, at its steepest slope, a one unit increase in x is equal to a .37 increase in the z-score. To make the interpretation easier, we should also convert probit results to predicted probabilities just like with logit.

Evaluating Model Fit Statistics

Since each of the three models are estimated using glm on the exact same data, we can evaluate which of the three estimators are the best with this set of data. We will evaluate the AIC since we are comparing across estimators. You could also apply the glm_fit function we reviewed earlier in the semester for additional fit metrics. Remember for the AIC results, a lower value indicates a better fitting model.

With this set of data, the logistic model seems to fit the best as it has the lowest AIC value. Probit performs slightly worse than logit here but better than the OLS results. This is to be expected since the logit and probit models are designed to worked on a binary DV whereas OLS is designed for a normally distributed DV.

Next, let’s calculate the predicted probability/value of Y at each value of X across the three estimators and compare how well the predictions fit the underlying data generating process.

Calculating Predicted Value/Probabilities by Model Type

Using the predict function, we first create a new data frame with the appropriate x values before saving the predict values and probabilities from each of the three models in the new file. Then we will append the actual probabilities from the population for each value of X to check fit.

#Creates new data frames and saves individual predicted values/probabilities into them
pv_ols<- ggpredict(ols, terms=("x"))
pv_logit<- ggpredict(logit, terms=("x"))
pv_probit<- ggpredict(probit, terms=("x"))

##Combine predicted values into new tibble
# Rename columns in each tibble
pv_ols <- pv_ols %>%
  rename_with(~ paste0(., "_ols"), -x)

pv_logit <- pv_logit %>%
  rename_with(~ paste0(., "_log"), -x)

pv_probit <- pv_probit %>%
  rename_with(~ paste0(., "_pro"), -x)

# Combine the tibbles by the 'x' column
combined_pv <- pv_ols %>%
  left_join(pv_logit, by = "x") %>%
  left_join(pv_probit, by = "x")

# View the combined result
head(combined_pv)
## # Predicted values of y
## 
## x | predicted_ols | std.error_ols | conf.low_ols | conf.high_ols | group_ols | predicted_log | std.error_log | conf.low_log | conf.high_log | group_log | predicted_pro | std.error_pro | conf.low_pro | conf.high_pro | group_pro
## ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
## 1 |         -0.03 |          0.01 |        -0.04 |         -0.01 |         1 |          0.04 |          0.06 |         0.04 |          0.05 |         1 |          0.04 |          0.03 |         0.03 |          0.04 |         1
## 2 |          0.08 |          0.01 |         0.07 |          0.09 |         1 |          0.08 |          0.05 |         0.07 |          0.08 |         1 |          0.08 |          0.03 |         0.07 |          0.08 |         1
## 3 |          0.18 |          0.00 |         0.18 |          0.19 |         1 |          0.13 |          0.04 |         0.12 |          0.14 |         1 |          0.14 |          0.02 |         0.13 |          0.15 |         1
## 4 |          0.29 |          0.00 |         0.28 |          0.30 |         1 |          0.22 |          0.03 |         0.21 |          0.24 |         1 |          0.24 |          0.02 |         0.23 |          0.25 |         1
## 5 |          0.40 |          0.00 |         0.39 |          0.40 |         1 |          0.35 |          0.03 |         0.34 |          0.36 |         1 |          0.36 |          0.02 |         0.35 |          0.37 |         1
## 6 |          0.50 |          0.00 |         0.49 |          0.51 |         1 |          0.50 |          0.03 |         0.49 |          0.51 |         1 |          0.50 |          0.01 |         0.49 |          0.51 |         1
#Calculates mean predicted values by x for eah model then combines 
c<-simulated_data %>% #
  group_by(x) %>%
  summarize(mean_y = mean(y))
new_data_combo <- merge(combined_pv, c, by = "x")

#Calculates the change in predicted y at each 1 unit increase in x
log_delta <- diff(new_data_combo$predicted_log)
prob_delta <- diff(new_data_combo$predicted_pro)
ols_delta <- diff(new_data_combo$predicted_ols)
act_delta <- diff(new_data_combo$mean_y)

print (new_data_combo)
##     x predicted_ols std.error_ols conf.low_ols conf.high_ols group_ols
## 1   1   -0.02531818   0.006663867  -0.03838056   -0.01225581         1
## 2   2    0.07978182   0.005743531   0.06852347    0.09104017         1
## 3   3    0.18488182   0.004909860   0.17525761    0.19450603         1
## 4   4    0.28998182   0.004214600   0.28172045    0.29824319         1
## 5   5    0.39508182   0.003735843   0.38775889    0.40240474         1
## 6   6    0.50018182   0.003561987   0.49319968    0.50716395         1
## 7   7    0.60528182   0.003735843   0.59795889    0.61260474         1
## 8   8    0.71038182   0.004214600   0.70212045    0.71864319         1
## 9   9    0.81548182   0.004909860   0.80585761    0.82510603         1
## 10 10    0.92058182   0.005743531   0.90932347    0.93184017         1
## 11 11    1.02568182   0.006663867   1.01261944    1.03874419         1
##    predicted_log std.error_log conf.low_log conf.high_log group_log
## 1     0.04276965    0.06077370   0.03815012    0.04792069         1
## 2     0.07682371    0.05101244   0.07002604    0.08422150         1
## 3     0.13419119    0.04187683   0.12493817    0.14401672         1
## 4     0.22400284    0.03387685   0.21267306    0.23575546         1
## 5     0.34964915    0.02800329   0.33727372    0.36223047         1
## 6     0.50033165    0.02575462   0.48771459    0.51294829         1
## 7     0.65095394    0.02801765   0.63837575    0.66332535         1
## 8     0.77645802    0.03390058   0.76471388    0.78777864         1
## 9     0.86611677    0.04190563   0.85630310    0.87535778         1
## 10    0.92336425    0.05104397   0.91597816    0.93015056         1
## 11    0.95733884    0.06080678   0.95219727    0.96194949         1
##    predicted_pro std.error_pro conf.low_pro conf.high_pro group_pro mean_y
## 1     0.03727423    0.03172478   0.03248882    0.04262059         1  0.057
## 2     0.07683118    0.02684937   0.06952504    0.08470685         1  0.073
## 3     0.14226570    0.02233293   0.13264666    0.15234598         1  0.136
## 4     0.23771355    0.01844115   0.22668135    0.24903401         1  0.211
## 5     0.36048672    0.01564727   0.34907233    0.37202682         1  0.292
## 6     0.49974487    0.01459598   0.48833376    0.51115620         1  0.506
## 7     0.63903445    0.01564267   0.62749272    0.65045093         1  0.708
## 8     0.76189073    0.01843335   0.75056507    0.77292865         1  0.798
## 9     0.85744631    0.02232327   0.84735698    0.86707472         1  0.867
## 10    0.92298424    0.02683864   0.91509773    0.93030116         1  0.915
## 11    0.96262159    0.03171344   0.95726527    0.96741650         1  0.939
# Create a new data frame with the differences
differences <- data.frame(y_delta = act_delta, log_delta = log_delta, pro_delta = prob_delta, ols_delta=ols_delta)
differences$x<-seq(2,11,1)
print(differences)
##    y_delta  log_delta  pro_delta ols_delta  x
## 1    0.016 0.03405406 0.03955696    0.1051  2
## 2    0.063 0.05736748 0.06543452    0.1051  3
## 3    0.075 0.08981165 0.09544785    0.1051  4
## 4    0.081 0.12564631 0.12277317    0.1051  5
## 5    0.214 0.15068250 0.13925815    0.1051  6
## 6    0.202 0.15062229 0.13928957    0.1051  7
## 7    0.090 0.12550408 0.12285629    0.1051  8
## 8    0.069 0.08965875 0.09555558    0.1051  9
## 9    0.048 0.05724748 0.06553793    0.1051 10
## 10   0.024 0.03397459 0.03963735    0.1051 11

First table shows the predicted probability - i.e. P(Y = 1 | X) for each of the three estimators and the true DGP that we created. Notice that when X=1 the predicted values for Y for the logit and probit models are close to 1 but not quite there whereas the predicted value for the OLS estimator is 1.03. This prediction is out-of-bounds of possibilities as it is impossible to have a probability >1 like this. This illustrates one reason why running linear probability models is not advised.

Before we graph the results, examine the table with the differences in predicted y for the three different models at each 1 unit increase in X. The OLS model, as mentioned, assumes the same change in Y of -.106 at each X. The logit, probit, and, most importantly, the true DGP results reveal a non-linear relationship between x and y whereby the change in y at the extremes is much less steep than compared to the middle of the x scale.

For instance moving from X=1 to X=2 resulted in a predicted probability increase of .033 in the y estimate. While still larger than the true DGP result of .023 it is much closer than the OLS estimate of .11. Probit results showed a similar increase as the logit result of .037. At the middle of the x scale, moving from 5 to resulted in a increase of .153 and .142 for logit/probit estimators respectively, which is a much steeper estimated relationship than at the extremes. This pattern of results matches the DGP much more closely because both logit and probit estimators are designed to work with the type of distribution that created the dependent variable (the binomial distribution).

Graphing the Results

First, let’s graph the predicted value of y at each x scale-point for the three different models plus the actual value of y at each x.

ggplot(pv_logit, aes(x = x)) +
  geom_line(aes(y = predicted_log, color = "Log"), linewidth = .75) +
   geom_ribbon(aes(ymin = conf.low_log, ymax = conf.high_log), alpha = 0.05) +
  labs(y = "Predicted Value of Y") +  # Update y-axis label here
  theme_minimal()

ggplot(new_data_combo, aes(x = x)) +
  geom_line(aes(y = predicted_ols, color = "OLS"), linewidth = .75, linetype="dashed") +
  geom_line(aes(y = predicted_log, color = "Log"), linewidth = .75) +
  geom_line(aes(y = predicted_pro, color = "Probit"), linewidth = .75) +
  geom_line(aes(y = mean_y, color = "Observed"), linewidth = .75) +
  scale_color_manual(name = "Model",
                     values = c("OLS" = "red", "Log" = "black", "Probit" = "grey", "Observed" = "blue")) +
  labs(y = "Predicted Value of Y") +  # Update y-axis label here
  theme_minimal()

By comparing the predicted value of y by each of the estimators with the actual mean of y at each x, we see interesting results. First the blue line is observed value of y and it shows the class s shaped curve of the binomial distribution. When we look at the dashed red line for the OLS estimator, we see that it does not follow that s-shaped curve as it is the classic linear predictor that assumes x influences y at the same rate at all levels of x. The black and grey lines (logit and probit model predicted values of y respectively) more closely follow the actual value of y. Remember, logit and probit are drawn from the binomial distribution which does not assume that a 1-unit increase in x has the same impact on y at all x’s like linear regression assumes.

Let’s finish by graphing the results saved in the differences data frame using ggplot and compare the change in predicted values for each estimator visually. This gives us a better visual at how well each estimator predicted the impact of a 1-unit change in x on y by comparing the differences from each estimator with the true change in y at each level of x.

ggplot(differences, aes(x = x)) +
  geom_line(aes(y = ols_delta, color = "OLS"), linewidth = .75, linetype="dashed") +
  geom_line(aes(y = log_delta, color = "Log"), linewidth = .75) +
  geom_line(aes(y = pro_delta, color = "Probit"), linewidth = .75) +
  geom_line(aes(y = y_delta, color = "Observed"), linewidth = .75) +
  scale_color_manual(name = "Model",
                     values = c("OLS" = "red", "Log" = "black", "Probit" = "grey", "Observed" = "blue")) +
  labs(y = "Predicted Change in PV") +  # Update y-axis label here
  theme_minimal()

This visually reflects the change in predicted probability as you move up the x-scale from 1 to 2, 2 to 3, etc. The true change in predicted probability is reflect in the blue line while the black line reflects the change for the logit model, the grey line the change for the probit model and the red line for the OLS estimator. The OLS estimator is a flat line at .105 since OLS assumes a linear relationship. That assumption does not match the true change in Y as reflected by the actual population change.

While the logit and probit models do not perfectly align with the true population change, they are much closer to reflecting the true relationship between x and y than the OLS estimator. This graph is a visual reflection of why we want to model dichotomous dependent variables using either logit and probit instead of OLS.

Evaluating Model Accuracy

*Note that we will evaluate model fit here on the logit results but the approach stays identical if you have estimated a probit model.

Unlike with an OLS estimator, the residuals, predicted y - actual y, do not give us good information with a binary DV. Instead, we create binned residuals plots to understand if we are systematically over or under-estimating y at different levels of our predictor.

First, let’s calculate and graph traditional residuals to understand why they are not informative.

# Fitted values
fv.logit <- logit$fitted.values  #Gets fitted values at each X

y.logit <- logit$y # The dependent variable (i.e., actual values)

# Residual plot
plot(fv.logit, y.logit - fv.logit, pch = 19)
abline(h = 0, lwd = 2, lty = 3)

Since Y is binary, the traditional residuals are not informative as they are simply the differences between the average predicted value at each ‘x’ versus the actual value of y at each x. What we really want to know is if we consistently predict y at each level of x.

For this, we will use binned residuals plots. Binned residuals divide the data into categories (bins) based on their fitted values then plot the average residual against the average fitted value by bin.

# Create bins based on the range of x
simulated_data$predicted_probs<-predict(logit, type="response")
simulated_data$residuals <- simulated_data$y - simulated_data$predicted_probs
simulated_data$bins <- cut(simulated_data$x, breaks = c(0:12))

# Summarize the residuals for each bin
binned_data <- simulated_data %>%
  group_by(bins) %>%
  summarize(
    mean_residual = mean(residuals),
    ci_lower = mean_residual - 1.96 * sd(residuals) / sqrt(n()),
    ci_upper = mean_residual + 1.96 * sd(residuals) / sqrt(n())
  )

#Create a plot to visualize the mean residuals and their CIs
ggplot(binned_data, aes(x = bins, y = mean_residual)) +
  geom_point(color = "black", size = 3) +      # Dot graph with points
  geom_errorbar(aes(ymin = ci_lower, ymax = ci_upper), width = 0.2, position = position_dodge(width = 0.5)) +  # CIs
  labs(x = "Bins of x", y = "Avg Binned Residual", title = "Binned Residuals with 95% CI") +
  theme_minimal() + geom_hline(yintercept=0, linetype="dashed",color="red")

Above, you will find the binned residuals plot along with the 95% CI for the average prediction. Generally, we are looking for any systematic pattern in our data to suggest that we have a misspecified model - omitted variable bias, etc. - or a IV that is need of transformation.

Here, we see some significant misses - at x=5 and x=7 but in general there is no systematic pattern on our misses. Remember the previous predicted value of y graph we reviewed which showed that we were the least accurate at predicting y at those values. Binned residuals are just another way to estimate average accuracy at different levels of x. Overall, this plot that our model is generally well performing, and we do not need to add additional predictors or transform our IV. Which is what we expect since we created the data with known parameters.

Next, we will examine the classification accuracy for the logit results but once again the code will work the same on the estimated probit model. Classification accuracy will tell us how much better our predictions are versus simply guessing 0 | 1 depending on whichever one is more prevalent in the data. In our simulated data, y = .495 or simply a coin flip. Guessing 0 would mean we were correct 1-.495 or 50.5% of the time. Hopefully, our model is able to increase our prediction accuracy over simply guessing the average.

Calculating classification accuracy is fairly straightforward. First, we calculate the null percent correct - i.e. the mean of y - which was .505 for our simulated data. Then we calculate the accuracy of the predicted y to see if we improve on simply guessing the mean. Note, we should since we created the DGP for this set of data.

#Create new vector with 0|1 predictions from the model
predicted_class <- ifelse(simulated_data$predicted_probs >= 0.5, 1, 0)

# Compare the predicted binary outcomes to the actual y 
actual_class <- simulated_data$y

# Calculate the classification accuracy
accuracy <- mean(predicted_class == actual_class)
print(accuracy)
## [1] 0.8149091
# Calculate the classification accuracy improvement
accuracy_improve <- accuracy-mean(simulated_data$y)
print(accuracy_improve)
## [1] 0.3147273

Looking at the classification accuracy results, we see a value of .819 which means we correctly predicted the value of y 81.9% of the time. Since the naive guessing approach would have resulted in us being accurate ~50%, we also calculate the improvement accuracy of 32% point increase. This value means that by adding x to our logit model predicting y improved our model accuracy by 32% over guessing the mean. That is an impressive improvement and one that you are not likely to see in real-world data.

We can also calculate something called the confusion matrix using the caret package. While the name is confusing, it essential calculate the same values as above along with a variety of other metrics.

library(caret)
## Warning: package 'caret' was built under R version 4.2.3
## Loading required package: lattice
## Warning: package 'lattice' was built under R version 4.2.3
## 
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
## 
##     lift
# Convert to factor for comparison
predicted_classes <- factor(predicted_class, levels = c(0, 1))
actual_classes <- factor(actual_class, levels = c(0, 1))

# Calculate confusion matrix
confusion_matrix <- confusionMatrix(predicted_classes, actual_classes)
confusion_matrix
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 4231  769
##          1 1267 4733
##                                           
##                Accuracy : 0.8149          
##                  95% CI : (0.8075, 0.8221)
##     No Information Rate : 0.5002          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.6298          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.7696          
##             Specificity : 0.8602          
##          Pos Pred Value : 0.8462          
##          Neg Pred Value : 0.7888          
##              Prevalence : 0.4998          
##          Detection Rate : 0.3846          
##    Detection Prevalence : 0.4545          
##       Balanced Accuracy : 0.8149          
##                                           
##        'Positive' Class : 0               
## 

Notice the reported accuracy rate of .8149 and the No Information rate of .5002. Those are the values we manually calculated in the previous code chunk. The other

Conclusions

In this tutorial, we reviewed analyzing a binary DV. We showed that using a linear probability model results in a worse fitting model and with predictions outside the bounds of reality. This was to be expected since OLS is an estimate for a normally distributed DV and not a binary one.

We also reviewed how to estimate logit and probit models, evaluate the results, translate the coefficients into predicted probabilities, graph predicted probabilities with confidence intervals, and finally how to evaluate the model fit.