In this tutorial, we are going use the binomial distribution to simulate a dichotomous dependent variable, y, with a known relationship to an independent variable, x. We will then estimate the relationship using OLS, logistic and probit regressions and compare the results. Along the way, we will talk about the differences in the modeling approaches.
# Create X as a sequence from 1 to 11
<- 1:11
X
# Define the probabilities for Y = 1 at each X value with steeper relationship in middle and flatter at edges
<- c(0.05, .075, .125, 0.2, .3, 0.5, 0.7, 0.8, 0.875, .925, 0.95)
probabilities
# Simulate data with binomial distribution for Y
<- 1000 # Number of trials for each x
sample
# Create an empty data frame to store the results
<- data.frame(X = numeric(),
simulated_data Y = numeric(),
stringsAsFactors = FALSE)
# Simulate Y=1 probabilistic ally for each X value iteration
for (i in 1:sample) {
# Sample Y=1 probabilistic ally at each X value
<- sapply(probabilities, function(p) sample(c(0, 1), size = 1, prob = c(1 - p, p))) #Probabilistic ally draw 1 | 0 at the specified probability level
simulated_Y
# Create a data frame for the current iteration
<- data.frame(x = X, y = simulated_Y)
iteration_data
# Append the current iteration data to the main data frame
<- rbind(simulated_data, iteration_data)
simulated_data
}
skim(simulated_data) #Look at first few cases
Name | simulated_data |
Number of rows | 11000 |
Number of columns | 2 |
_______________________ | |
Column type frequency: | |
numeric | 2 |
________________________ | |
Group variables | None |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
x | 0 | 1 | 6.0 | 3.16 | 1 | 3 | 6 | 9 | 11 | ▇▅▅▅▅ |
y | 0 | 1 | 0.5 | 0.50 | 0 | 0 | 1 | 1 | 1 | ▇▁▁▁▇ |
Using group_by
from the tidyverse
package,
we check the probability that Y = 1 at each X level to see if the
simulation worked as hoped. Results show that it did with
P(Y=1∣X)roughly following the population parameters stipulated
earlier.
## # A tibble: 11 × 2
## x mean_y
## <int> <dbl>
## 1 1 0.057
## 2 2 0.073
## 3 3 0.136
## 4 4 0.211
## 5 5 0.292
## 6 6 0.506
## 7 7 0.708
## 8 8 0.798
## 9 9 0.867
## 10 10 0.915
## 11 11 0.939
With our simulated data, now we’ll estimate three separate regressions changing the GLM model type using:
Two models, logistic and probit, are designed to work with
dichotomous DVs so should fit the data better with our simulated data
compared to the OLS estimator which assumes a normal distribution. We
use GLM to estimate all three models, including the OLS, to more easily
compare model fit across estimators. Both the logit and probit use
family=binomial
in the code but with different link
functions to account for the different calculations. The OLS GLM model
uses the Gaussian
distribution family with the standard
identity
link function.
We estimate each of the three models than compare the results using
stargazer
.
<- glm(y ~ x , data = simulated_data, family = binomial(link = "logit"))
logit summary(logit)
##
## Call:
## glm(formula = y ~ x, family = binomial(link = "logit"), data = simulated_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.5118 -0.7122 0.2953 0.7114 2.5107
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.73012 0.07090 -52.61 <2e-16 ***
## x 0.62191 0.01101 56.47 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 15249.2 on 10999 degrees of freedom
## Residual deviance: 9465.7 on 10998 degrees of freedom
## AIC: 9469.7
##
## Number of Fisher Scoring iterations: 5
<- glm(y ~ x , data = simulated_data, family = binomial(link = "probit"))
probit summary(probit)
##
## Call:
## glm(formula = y ~ x, family = binomial(link = "probit"), data = simulated_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.5638 -0.7368 0.2760 0.7375 2.5649
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.139751 0.036817 -58.12 <2e-16 ***
## x 0.356519 0.005632 63.30 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 15249.2 on 10999 degrees of freedom
## Residual deviance: 9497.7 on 10998 degrees of freedom
## AIC: 9501.7
##
## Number of Fisher Scoring iterations: 4
<- glm(y ~ x, data = simulated_data, family = gaussian(link = "identity"))
ols summary(ols)
##
## Call:
## glm(formula = y ~ x, family = gaussian(link = "identity"), data = simulated_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.02568 -0.28998 0.02532 0.28962 1.02532
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.130418 0.007640 -17.07 <2e-16 ***
## x 0.105100 0.001126 93.31 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.1395652)
##
## Null deviance: 2750.0 on 10999 degrees of freedom
## Residual deviance: 1534.9 on 10998 degrees of freedom
## AIC: 9559.2
##
## Number of Fisher Scoring iterations: 2
stargazer(ols, logit, probit, type="text", digits=2)
##
## ===============================================
## Dependent variable:
## -----------------------------
## y
## normal logistic probit
## (1) (2) (3)
## -----------------------------------------------
## x 0.11*** 0.62*** 0.36***
## (0.001) (0.01) (0.01)
##
## Constant -0.13*** -3.73*** -2.14***
## (0.01) (0.07) (0.04)
##
## -----------------------------------------------
## Observations 11,000 11,000 11,000
## Log Likelihood -4,777.60 -4,732.87 -4,748.86
## Akaike Inf. Crit. 9,559.19 9,469.74 9,501.72
## ===============================================
## Note: *p<0.1; **p<0.05; ***p<0.01
Results show that x is a significant predictor of y across all three model estimation strategies, which it should be given the data generating process that created this set of data. However, the coefficients mean something different in each of the three models. For the OLS model, the coefficient of .11 reflects a linear estimation of how x is related to y so that for every 1 unit increase in x, regardless of where on the x scale you are, results in a .11, or 11% point, increase in the probability of Y = 1. The OLS model is assuming a linear relationship and is creating the coefficient that best describes the linear relationship between X and Y.
The coefficients on the logit and probit model do not assume linearity meaning that the impact of X on values of Y will vary depending on where on the X scale you are. With X values closer to the middle of the scale, the relationship will be stronger so that a 1 unit increase in X results in the largest possible change in Y. With X values closer to either its maximum or minimum, the relationship between X and Y will be weaker so that a 1 unit increase in X results in a smaller change in Y.
For the logit model, the coefficient values represent the logged odds. This is important to know so that you do not immediately start to interpret the coefficient values like it is an OLS coefficient. We can easily interpret direction of relationship and significance with the logged odds but to understand the substantive impact we should convert to predicted probabilities, which we will do below. The logit coefficient of .65 means at its steepest slope a one unit increase in x is equal to a .65 logged odds increased in y. Note the phrase “at its steepest slope”, which means that the relationship between X and Y will vary depending on where on the x-scale the relationship is evaluated.
For the probit model, the coefficient values represents the change in
the z-score for each 1-unit increase in x
. In same manner
as logit, we can easily interpret direction of relationship and
significance between x
and y
in the probit
model but not the substantive impact. The .37 probit coefficient means,
at its steepest slope, a one unit increase in x is equal to a .37
increase in the z-score. To make the interpretation easier, we should
also convert probit results to predicted probabilities just like with
logit.
Since each of the three models are estimated using glm
on the exact same data, we can evaluate which of the three estimators
are the best with this set of data. We will evaluate the AIC since we
are comparing across estimators. You could also apply the
glm_fit
function we reviewed earlier in the semester for
additional fit metrics. Remember for the AIC results, a lower value
indicates a better fitting model.
With this set of data, the logistic model seems to fit the best as it has the lowest AIC value. Probit performs slightly worse than logit here but better than the OLS results. This is to be expected since the logit and probit models are designed to worked on a binary DV whereas OLS is designed for a normally distributed DV.
Next, let’s calculate the predicted probability/value of Y at each value of X across the three estimators and compare how well the predictions fit the underlying data generating process.
Using the predict
function, we first create a new data
frame with the appropriate x values before saving the predict values and
probabilities from each of the three models in the new file. Then we
will append the actual probabilities from the population for each value
of X to check fit.
#Creates new data frames and saves individual predicted values/probabilities into them
<- ggpredict(ols, terms=("x"))
pv_ols<- ggpredict(logit, terms=("x"))
pv_logit<- ggpredict(probit, terms=("x"))
pv_probit
##Combine predicted values into new tibble
# Rename columns in each tibble
<- pv_ols %>%
pv_ols rename_with(~ paste0(., "_ols"), -x)
<- pv_logit %>%
pv_logit rename_with(~ paste0(., "_log"), -x)
<- pv_probit %>%
pv_probit rename_with(~ paste0(., "_pro"), -x)
# Combine the tibbles by the 'x' column
<- pv_ols %>%
combined_pv left_join(pv_logit, by = "x") %>%
left_join(pv_probit, by = "x")
# View the combined result
head(combined_pv)
## # Predicted values of y
##
## x | predicted_ols | std.error_ols | conf.low_ols | conf.high_ols | group_ols | predicted_log | std.error_log | conf.low_log | conf.high_log | group_log | predicted_pro | std.error_pro | conf.low_pro | conf.high_pro | group_pro
## ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
## 1 | -0.03 | 0.01 | -0.04 | -0.01 | 1 | 0.04 | 0.06 | 0.04 | 0.05 | 1 | 0.04 | 0.03 | 0.03 | 0.04 | 1
## 2 | 0.08 | 0.01 | 0.07 | 0.09 | 1 | 0.08 | 0.05 | 0.07 | 0.08 | 1 | 0.08 | 0.03 | 0.07 | 0.08 | 1
## 3 | 0.18 | 0.00 | 0.18 | 0.19 | 1 | 0.13 | 0.04 | 0.12 | 0.14 | 1 | 0.14 | 0.02 | 0.13 | 0.15 | 1
## 4 | 0.29 | 0.00 | 0.28 | 0.30 | 1 | 0.22 | 0.03 | 0.21 | 0.24 | 1 | 0.24 | 0.02 | 0.23 | 0.25 | 1
## 5 | 0.40 | 0.00 | 0.39 | 0.40 | 1 | 0.35 | 0.03 | 0.34 | 0.36 | 1 | 0.36 | 0.02 | 0.35 | 0.37 | 1
## 6 | 0.50 | 0.00 | 0.49 | 0.51 | 1 | 0.50 | 0.03 | 0.49 | 0.51 | 1 | 0.50 | 0.01 | 0.49 | 0.51 | 1
#Calculates mean predicted values by x for eah model then combines
<-simulated_data %>% #
cgroup_by(x) %>%
summarize(mean_y = mean(y))
<- merge(combined_pv, c, by = "x")
new_data_combo
#Calculates the change in predicted y at each 1 unit increase in x
<- diff(new_data_combo$predicted_log)
log_delta <- diff(new_data_combo$predicted_pro)
prob_delta <- diff(new_data_combo$predicted_ols)
ols_delta <- diff(new_data_combo$mean_y)
act_delta
print (new_data_combo)
## x predicted_ols std.error_ols conf.low_ols conf.high_ols group_ols
## 1 1 -0.02531818 0.006663867 -0.03838056 -0.01225581 1
## 2 2 0.07978182 0.005743531 0.06852347 0.09104017 1
## 3 3 0.18488182 0.004909860 0.17525761 0.19450603 1
## 4 4 0.28998182 0.004214600 0.28172045 0.29824319 1
## 5 5 0.39508182 0.003735843 0.38775889 0.40240474 1
## 6 6 0.50018182 0.003561987 0.49319968 0.50716395 1
## 7 7 0.60528182 0.003735843 0.59795889 0.61260474 1
## 8 8 0.71038182 0.004214600 0.70212045 0.71864319 1
## 9 9 0.81548182 0.004909860 0.80585761 0.82510603 1
## 10 10 0.92058182 0.005743531 0.90932347 0.93184017 1
## 11 11 1.02568182 0.006663867 1.01261944 1.03874419 1
## predicted_log std.error_log conf.low_log conf.high_log group_log
## 1 0.04276965 0.06077370 0.03815012 0.04792069 1
## 2 0.07682371 0.05101244 0.07002604 0.08422150 1
## 3 0.13419119 0.04187683 0.12493817 0.14401672 1
## 4 0.22400284 0.03387685 0.21267306 0.23575546 1
## 5 0.34964915 0.02800329 0.33727372 0.36223047 1
## 6 0.50033165 0.02575462 0.48771459 0.51294829 1
## 7 0.65095394 0.02801765 0.63837575 0.66332535 1
## 8 0.77645802 0.03390058 0.76471388 0.78777864 1
## 9 0.86611677 0.04190563 0.85630310 0.87535778 1
## 10 0.92336425 0.05104397 0.91597816 0.93015056 1
## 11 0.95733884 0.06080678 0.95219727 0.96194949 1
## predicted_pro std.error_pro conf.low_pro conf.high_pro group_pro mean_y
## 1 0.03727423 0.03172478 0.03248882 0.04262059 1 0.057
## 2 0.07683118 0.02684937 0.06952504 0.08470685 1 0.073
## 3 0.14226570 0.02233293 0.13264666 0.15234598 1 0.136
## 4 0.23771355 0.01844115 0.22668135 0.24903401 1 0.211
## 5 0.36048672 0.01564727 0.34907233 0.37202682 1 0.292
## 6 0.49974487 0.01459598 0.48833376 0.51115620 1 0.506
## 7 0.63903445 0.01564267 0.62749272 0.65045093 1 0.708
## 8 0.76189073 0.01843335 0.75056507 0.77292865 1 0.798
## 9 0.85744631 0.02232327 0.84735698 0.86707472 1 0.867
## 10 0.92298424 0.02683864 0.91509773 0.93030116 1 0.915
## 11 0.96262159 0.03171344 0.95726527 0.96741650 1 0.939
# Create a new data frame with the differences
<- data.frame(y_delta = act_delta, log_delta = log_delta, pro_delta = prob_delta, ols_delta=ols_delta)
differences $x<-seq(2,11,1)
differencesprint(differences)
## y_delta log_delta pro_delta ols_delta x
## 1 0.016 0.03405406 0.03955696 0.1051 2
## 2 0.063 0.05736748 0.06543452 0.1051 3
## 3 0.075 0.08981165 0.09544785 0.1051 4
## 4 0.081 0.12564631 0.12277317 0.1051 5
## 5 0.214 0.15068250 0.13925815 0.1051 6
## 6 0.202 0.15062229 0.13928957 0.1051 7
## 7 0.090 0.12550408 0.12285629 0.1051 8
## 8 0.069 0.08965875 0.09555558 0.1051 9
## 9 0.048 0.05724748 0.06553793 0.1051 10
## 10 0.024 0.03397459 0.03963735 0.1051 11
First table shows the predicted probability - i.e. P(Y = 1 | X) for each of the three estimators and the true DGP that we created. Notice that when X=1 the predicted values for Y for the logit and probit models are close to 1 but not quite there whereas the predicted value for the OLS estimator is 1.03. This prediction is out-of-bounds of possibilities as it is impossible to have a probability >1 like this. This illustrates one reason why running linear probability models is not advised.
Before we graph the results, examine the table with the differences in predicted y for the three different models at each 1 unit increase in X. The OLS model, as mentioned, assumes the same change in Y of -.106 at each X. The logit, probit, and, most importantly, the true DGP results reveal a non-linear relationship between x and y whereby the change in y at the extremes is much less steep than compared to the middle of the x scale.
For instance moving from X=1 to X=2 resulted in a predicted probability increase of .033 in the y estimate. While still larger than the true DGP result of .023 it is much closer than the OLS estimate of .11. Probit results showed a similar increase as the logit result of .037. At the middle of the x scale, moving from 5 to resulted in a increase of .153 and .142 for logit/probit estimators respectively, which is a much steeper estimated relationship than at the extremes. This pattern of results matches the DGP much more closely because both logit and probit estimators are designed to work with the type of distribution that created the dependent variable (the binomial distribution).
First, let’s graph the predicted value of y at each x scale-point for the three different models plus the actual value of y at each x.
ggplot(pv_logit, aes(x = x)) +
geom_line(aes(y = predicted_log, color = "Log"), linewidth = .75) +
geom_ribbon(aes(ymin = conf.low_log, ymax = conf.high_log), alpha = 0.05) +
labs(y = "Predicted Value of Y") + # Update y-axis label here
theme_minimal()
ggplot(new_data_combo, aes(x = x)) +
geom_line(aes(y = predicted_ols, color = "OLS"), linewidth = .75, linetype="dashed") +
geom_line(aes(y = predicted_log, color = "Log"), linewidth = .75) +
geom_line(aes(y = predicted_pro, color = "Probit"), linewidth = .75) +
geom_line(aes(y = mean_y, color = "Observed"), linewidth = .75) +
scale_color_manual(name = "Model",
values = c("OLS" = "red", "Log" = "black", "Probit" = "grey", "Observed" = "blue")) +
labs(y = "Predicted Value of Y") + # Update y-axis label here
theme_minimal()
By comparing the predicted value of y by each of the estimators with the actual mean of y at each x, we see interesting results. First the blue line is observed value of y and it shows the class s shaped curve of the binomial distribution. When we look at the dashed red line for the OLS estimator, we see that it does not follow that s-shaped curve as it is the classic linear predictor that assumes x influences y at the same rate at all levels of x. The black and grey lines (logit and probit model predicted values of y respectively) more closely follow the actual value of y. Remember, logit and probit are drawn from the binomial distribution which does not assume that a 1-unit increase in x has the same impact on y at all x’s like linear regression assumes.
Let’s finish by graphing the results saved in the
differences
data frame using ggplot
and
compare the change in predicted values for each estimator visually. This
gives us a better visual at how well each estimator predicted the impact
of a 1-unit change in x on y by comparing the differences from each
estimator with the true change in y at each level of x.
ggplot(differences, aes(x = x)) +
geom_line(aes(y = ols_delta, color = "OLS"), linewidth = .75, linetype="dashed") +
geom_line(aes(y = log_delta, color = "Log"), linewidth = .75) +
geom_line(aes(y = pro_delta, color = "Probit"), linewidth = .75) +
geom_line(aes(y = y_delta, color = "Observed"), linewidth = .75) +
scale_color_manual(name = "Model",
values = c("OLS" = "red", "Log" = "black", "Probit" = "grey", "Observed" = "blue")) +
labs(y = "Predicted Change in PV") + # Update y-axis label here
theme_minimal()
This visually reflects the change in predicted probability as you
move up the x-scale from 1 to 2, 2 to 3, etc. The true change in
predicted probability is reflect in the blue
line while the
black
line reflects the change for the logit model, the
grey
line the change for the probit model and the
red
line for the OLS estimator. The OLS estimator is a flat
line at .105 since OLS assumes a linear relationship. That assumption
does not match the true change in Y as reflected by the actual
population change.
While the logit and probit models do not perfectly align with the true population change, they are much closer to reflecting the true relationship between x and y than the OLS estimator. This graph is a visual reflection of why we want to model dichotomous dependent variables using either logit and probit instead of OLS.
*Note that we will evaluate model fit here on the logit results but the approach stays identical if you have estimated a probit model.
Unlike with an OLS estimator, the residuals, predicted y - actual y,
do not give us good information with a binary DV. Instead, we create
binned residuals plots to understand if we are systematically over or
under-estimating y
at different levels of our
predictor.
First, let’s calculate and graph traditional residuals to understand why they are not informative.
# Fitted values
<- logit$fitted.values #Gets fitted values at each X
fv.logit
<- logit$y # The dependent variable (i.e., actual values)
y.logit
# Residual plot
plot(fv.logit, y.logit - fv.logit, pch = 19)
abline(h = 0, lwd = 2, lty = 3)
Since Y is binary, the traditional residuals are not informative as
they are simply the differences between the average predicted value at
each ‘x’ versus the actual value of y
at each
x
. What we really want to know is if we consistently
predict y
at each level of x
.
For this, we will use binned residuals plots. Binned residuals divide the data into categories (bins) based on their fitted values then plot the average residual against the average fitted value by bin.
# Create bins based on the range of x
$predicted_probs<-predict(logit, type="response")
simulated_data$residuals <- simulated_data$y - simulated_data$predicted_probs
simulated_data$bins <- cut(simulated_data$x, breaks = c(0:12))
simulated_data
# Summarize the residuals for each bin
<- simulated_data %>%
binned_data group_by(bins) %>%
summarize(
mean_residual = mean(residuals),
ci_lower = mean_residual - 1.96 * sd(residuals) / sqrt(n()),
ci_upper = mean_residual + 1.96 * sd(residuals) / sqrt(n())
)
#Create a plot to visualize the mean residuals and their CIs
ggplot(binned_data, aes(x = bins, y = mean_residual)) +
geom_point(color = "black", size = 3) + # Dot graph with points
geom_errorbar(aes(ymin = ci_lower, ymax = ci_upper), width = 0.2, position = position_dodge(width = 0.5)) + # CIs
labs(x = "Bins of x", y = "Avg Binned Residual", title = "Binned Residuals with 95% CI") +
theme_minimal() + geom_hline(yintercept=0, linetype="dashed",color="red")
Above, you will find the binned residuals plot along with the 95% CI for the average prediction. Generally, we are looking for any systematic pattern in our data to suggest that we have a misspecified model - omitted variable bias, etc. - or a IV that is need of transformation.
Here, we see some significant misses - at x=5
and
x=7
but in general there is no systematic pattern on our
misses. Remember the previous predicted value of y graph we reviewed
which showed that we were the least accurate at predicting y at those
values. Binned residuals are just another way to estimate average
accuracy at different levels of x. Overall, this plot that our model is
generally well performing, and we do not need to add additional
predictors or transform our IV. Which is what we expect since we created
the data with known parameters.
Next, we will examine the classification accuracy for the logit results but once again the code will work the same on the estimated probit model. Classification accuracy will tell us how much better our predictions are versus simply guessing 0 | 1 depending on whichever one is more prevalent in the data. In our simulated data, y = .495 or simply a coin flip. Guessing 0 would mean we were correct 1-.495 or 50.5% of the time. Hopefully, our model is able to increase our prediction accuracy over simply guessing the average.
Calculating classification accuracy is fairly straightforward. First,
we calculate the null percent correct - i.e. the mean of y - which was
.505 for our simulated data. Then we calculate the accuracy of the
predicted y
to see if we improve on simply guessing the
mean. Note, we should since we created the DGP for this set of data.
#Create new vector with 0|1 predictions from the model
<- ifelse(simulated_data$predicted_probs >= 0.5, 1, 0)
predicted_class
# Compare the predicted binary outcomes to the actual y
<- simulated_data$y
actual_class
# Calculate the classification accuracy
<- mean(predicted_class == actual_class)
accuracy print(accuracy)
## [1] 0.8149091
# Calculate the classification accuracy improvement
<- accuracy-mean(simulated_data$y)
accuracy_improve print(accuracy_improve)
## [1] 0.3147273
Looking at the classification accuracy results, we see a value of
.819 which means we correctly predicted the value of y
81.9% of the time. Since the naive guessing approach would have resulted
in us being accurate ~50%, we also calculate the improvement accuracy of
32% point increase. This value means that by adding x
to
our logit model predicting y
improved our model accuracy by
32% over guessing the mean. That is an impressive improvement and one
that you are not likely to see in real-world data.
We can also calculate something called the confusion matrix using the
caret
package. While the name is confusing, it essential
calculate the same values as above along with a variety of other
metrics.
library(caret)
## Warning: package 'caret' was built under R version 4.2.3
## Loading required package: lattice
## Warning: package 'lattice' was built under R version 4.2.3
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
# Convert to factor for comparison
<- factor(predicted_class, levels = c(0, 1))
predicted_classes <- factor(actual_class, levels = c(0, 1))
actual_classes
# Calculate confusion matrix
<- confusionMatrix(predicted_classes, actual_classes)
confusion_matrix confusion_matrix
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 4231 769
## 1 1267 4733
##
## Accuracy : 0.8149
## 95% CI : (0.8075, 0.8221)
## No Information Rate : 0.5002
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6298
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.7696
## Specificity : 0.8602
## Pos Pred Value : 0.8462
## Neg Pred Value : 0.7888
## Prevalence : 0.4998
## Detection Rate : 0.3846
## Detection Prevalence : 0.4545
## Balanced Accuracy : 0.8149
##
## 'Positive' Class : 0
##
Notice the reported accuracy rate of .8149
and the No
Information rate of .5002
. Those are the values we manually
calculated in the previous code chunk. The other
In this tutorial, we reviewed analyzing a binary DV. We showed that using a linear probability model results in a worse fitting model and with predictions outside the bounds of reality. This was to be expected since OLS is an estimate for a normally distributed DV and not a binary one.
We also reviewed how to estimate logit and probit models, evaluate the results, translate the coefficients into predicted probabilities, graph predicted probabilities with confidence intervals, and finally how to evaluate the model fit.