Logistic Regression

Importing necessary libraries

library(tidymodels)
library(tidyverse)
library(ggplot2)
library(ggformula)
library(GGally)
library(dplyr)

Loading the dataset

nonvoters<-read.csv("/SharedProjects/Appiah/STA631/03-Logistic/nonvoters.csv") |>
  dplyr::select(ppage, educ, race, gender, income_cat, Q30, voter_category)

head(nonvoters)
##   ppage                educ  race gender    income_cat Q30 voter_category
## 1    73             College White Female      $75-125k   2         always
## 2    90             College White Female $125k or more   3         always
## 3    53             College White   Male $125k or more   2       sporadic
## 4    58        Some college Black Female       $40-75k   2       sporadic
## 5    81 High school or less White   Male       $40-75k   1         always
## 6    61 High school or less White Female       $40-75k   5   rarely/never
nonvoters <- nonvoters |>
  mutate(
    party = case_when(
      Q30 == 1 ~ "Republican",
      Q30 == 2 ~ "Democrat",
      Q30 == 3 ~ "Independent",
      Q30 %in% c(4, 5, -1) ~ "Other"
    ),
    party = factor(
      party,
      levels = c("Other", "Independent", "Democrat", "Republican")
    ),
    voter_category = factor(
      voter_category,
      levels = c("rarely/never", "sporadic", "always")
    )
  )

head(nonvoters)
##   ppage                educ  race gender    income_cat Q30 voter_category
## 1    73             College White Female      $75-125k   2         always
## 2    90             College White Female $125k or more   3         always
## 3    53             College White   Male $125k or more   2       sporadic
## 4    58        Some college Black Female       $40-75k   2       sporadic
## 5    81 High school or less White   Male       $40-75k   1         always
## 6    61 High school or less White Female       $40-75k   5   rarely/never
##         party
## 1    Democrat
## 2 Independent
## 3    Democrat
## 4    Democrat
## 5  Republican
## 6       Other

Visualizing the voter category according to the party affiliation.

nonvoters |>
  ggbivariate("voter_category", "party", title = "voter outcome by party")

From the visualization above, we can see almost similar voting pattern of people who identify themselves as a “Republican” or a “Democrat” with nearly 35% always voters, 46% sporadic voters and 19% “rarely/never” voters. For independents, the engagement is slightly lower in comparison to other big parties, with 30.2 % “always” voters, 44.3% “sporadic” voters and 25.5% “rarely/never” voters. For people in the “Other” party category, disengagement is most pronounced with nearly half (49.6%) rarely or never vote, only 15.3% always vote, and 35.1% vote sporadically. Hence, we can say that affiliation with a major political party (Democrat or Republican) is associated with higher and more consistent voting behavior, while Independents and those in the “Other” category are more likely to disengage from voting. The bivariate plot suggests that party affiliation has moderate predictive power for voting behavior: Democrats and Republicans are more likely to vote regularly than Independents or those in other parties. However, because voting patterns within each party are still quite mixed, party alone is not a strong standalone predictor of individual voting behavior.

Model instantiation

Since there are more than two labels(categories) in our response variable, we are using multinominal regression in this case.

multi_spec <- multinom_reg() |>
  set_engine("nnet")

Fitting the model

multi_mod <- multi_spec |>
  fit(voter_category ~ ppage + educ + race + gender + income_cat + party, data = nonvoters)
tidy(multi_mod) |>
  print(n = Inf) 
## # A tibble: 28 × 6
##    y.level  term                      estimate std.error statistic   p.value
##    <chr>    <chr>                        <dbl>     <dbl>     <dbl>     <dbl>
##  1 sporadic (Intercept)              -1.57       0.188    -8.34    7.58e- 17
##  2 sporadic ppage                     0.0457     0.00232  19.7     5.33e- 86
##  3 sporadic educHigh school or less  -0.853      0.0974   -8.76    1.94e- 18
##  4 sporadic educSome college         -0.293      0.0952   -3.08    2.10e-  3
##  5 sporadic raceHispanic              0.0402     0.128     0.314   7.53e-  1
##  6 sporadic raceOther/Mixed          -0.332      0.159    -2.09    3.66e-  2
##  7 sporadic raceWhite                -0.0775     0.108    -0.719   4.72e-  1
##  8 sporadic genderMale               -0.0901     0.0722   -1.25    2.12e-  1
##  9 sporadic income_cat$40-75k        -0.0738     0.111    -0.662   5.08e-  1
## 10 sporadic income_cat$75-125k        0.0125     0.107     0.117   9.07e-  1
## 11 sporadic income_catLess than $40k -0.588      0.114    -5.17    2.36e-  7
## 12 sporadic partyIndependent          0.548      0.111     4.95    7.27e-  7
## 13 sporadic partyDemocrat             0.940      0.106     8.85    8.51e- 19
## 14 sporadic partyRepublican           0.857      0.112     7.62    2.61e- 14
## 15 always   (Intercept)              -2.92       0.218   -13.4     6.73e- 41
## 16 always   ppage                     0.0582     0.00257  22.7     5.94e-114
## 17 always   educHigh school or less  -1.27       0.109   -11.6     2.46e- 31
## 18 always   educSome college         -0.330      0.102    -3.23    1.26e-  3
## 19 always   raceHispanic             -0.341      0.150    -2.28    2.28e-  2
## 20 always   raceOther/Mixed          -0.600      0.185    -3.24    1.20e-  3
## 21 always   raceWhite                 0.127      0.119     1.07    2.84e-  1
## 22 always   genderMale               -0.192      0.0797   -2.41    1.58e-  2
## 23 always   income_cat$40-75k        -0.000380   0.121    -0.00313 9.97e-  1
## 24 always   income_cat$75-125k        0.165      0.114     1.45    1.48e-  1
## 25 always   income_catLess than $40k -0.664      0.127    -5.23    1.71e-  7
## 26 always   partyIndependent          0.839      0.136     6.15    7.52e- 10
## 27 always   partyDemocrat             1.40       0.131    10.7     1.29e- 26
## 28 always   partyRepublican           1.24       0.136     9.10    8.87e- 20
multi_mod$fit$call
## nnet::multinom(formula = voter_category ~ ppage + educ + race + 
##     gender + income_cat + party, data = data, trace = FALSE)

Here, we are using repair_call() to update the model’s stored call so it correctly references the current dataset i.e nonvoters.

multi_mod <- repair_call(multi_mod, data = nonvoters)
multi_mod$fit$call
## nnet::multinom(formula = voter_category ~ ppage + educ + race + 
##     gender + income_cat + party, data = nonvoters, trace = FALSE)

After repair_call(), we can see that the old data reference i.e. data = data has been replaced with data = nonvoters.

Model Equations

Since our dataset cosists of three labels i.e always, sporadic and rarely/never in the response variable, we will need two equation. Baseline category for the model is “rarely/never”.

Using your tidy(multi_mod) |> print(n = Inf) output, the model equations will be:

\[\begin{align*} \log\left(\frac{\hat{p}_{\text{sporadic}}}{\hat{p}_{\text{rarely/never}}}\right) &= -1.57 \\ &+ 0.046 \times \text{ppage} \\ &- 0.853 \times \text{educHigh school or less} \\ &- 0.293 \times \text{educSome college} \\ &+ 0.040 \times \text{raceHispanic} \\ &- 0.332 \times \text{raceOther/Mixed} \\ &- 0.078 \times \text{raceWhite} \\ &- 0.090 \times \text{genderMale} \\ &- 0.074 \times \text{income\_cat\$40-75k} \\ &+ 0.013 \times \text{income\_cat\$75-125k} \\ &- 0.588 \times \text{income\_catLess than \$40k} \\ &+ 0.548 \times \text{partyIndependent} \\ &+ 0.940 \times \text{partyDemocrat} \\ &+ 0.857 \times \text{partyRepublican} \end{align*}\]

and

\[\begin{align*} \log\left(\frac{\hat{p}_{\text{always}}}{\hat{p}_{\text{rarely/never}}}\right) &= -2.919 \\ &+ 0.058 \times \text{ppage} \\ &- 1.267 \times \text{educHigh school or less} \\ &- 0.330 \times \text{educSome college} \\ &- 0.341 \times \text{raceHispanic} \\ &- 0.600 \times \text{raceOther/Mixed} \\ &+ 0.127 \times \text{raceWhite} \\ &- 0.192 \times \text{genderMale} \\ &- 0.0004 \times \text{income\_cat\$40-75k} \\ &+ 0.165 \times \text{income\_cat\$75-125k} \\ &- 0.664 \times \text{income\_catLess than \$40k} \\ &+ 0.839 \times \text{partyIndependent} \\ &+ 1.401 \times \text{partyDemocrat} \\ &+ 1.239 \times \text{partyRepublican} \end{align*}\]

Interpret the slope for genderMale. How did it change (if any)?

The log odds of an always(vs rarely) voter decreases by 0.192 for male in comparison to female, holding all the variables constant.

OR,

(exp(-0.192) - 1) * 100
## [1] -17.46931

The odds of an always(vs rarely) voter decreases by 17.47% for male in comparison to female, holding all the variables constant.

For the model without the Party variable, the odds of males was 19.02% less likely than females to be always voters (vs rarely/never) and the odds of males is 17.47% less likely than females for the model with Party variable, which means that both the models (with/without Party variable) are telling the same story i.e. males are less likely than females to be always voters (vs rarely/never).

Interpret the slopes for the two major parties (Republican, Democratic). What does this tell us?

For Republican: The positive slope (+1.239) suggests that individuals who identify as Republicans have higher log odds of being always voters rather than rarely/never voters compared to the reference party group (i.e. others), holding all the other variables constant. OR,

(exp(1.239) - 1) * 100
## [1] 245.216

The odds of an always(vs rarely) voter increases by 245.2% for an individuals who identify as Republicans in comparison to others, holding all the variables constant.

For Democratic: The positive slope (+1.401) suggests that individuals who identify as Democrats have higher log odds of being always voters rather than rarely/never voters compared to the reference party group (i.e. others), holding all the other variables constant. OR,

(exp(1.401) - 1)* 100
## [1] 305.9257

The odds of an always(vs rarely) voter increases by 305.9% for an individuals who identify as Democrats in comparison to others, holding all the variables constant.

Predicting

We could use this model to calculate probabilities. Generally, for categories \(2, \ldots, K\), the probability that the \(i^{th}\) observation is in the \(k^{th}\) category is,

\[ \hat{p}_{ik} = \frac{e^{\hat\beta_{0j} + \hat\beta_{1j}x_{i1} + \hat\beta_{2j}x_{i2} + \cdots + \hat\beta_{pj}x_{ip}}}{1 + \sum_{k = 2}^Ke^{\hat\beta_{0k} + \hat\beta_{1k}x_{1i} + \hat\beta_{2k}x_{2i} + \cdots + \hat\beta_{pk}x_{pi}}} \]

And the baseline category, \(k = 1\),

\[ \hat{p}_{i1} = 1 - \sum_{k = 2}^K \hat{p}_{ik} \]

voter_aug <- augment(multi_mod, new_data = nonvoters)

voter_aug
## # A tibble: 5,836 × 12
##    .pred_class `.pred_rarely/never` .pred_sporadic .pred_always ppage educ      
##    <fct>                      <dbl>          <dbl>        <dbl> <int> <chr>     
##  1 always                    0.0281          0.394        0.578    73 College   
##  2 always                    0.0208          0.423        0.556    90 College   
##  3 sporadic                  0.0947          0.480        0.425    53 College   
##  4 sporadic                  0.0924          0.482        0.425    58 Some coll…
##  5 sporadic                  0.0762          0.506        0.418    81 High scho…
##  6 sporadic                  0.353           0.436        0.212    61 High scho…
##  7 sporadic                  0.0679          0.506        0.426    80 High scho…
##  8 sporadic                  0.0782          0.504        0.417    68 Some coll…
##  9 always                    0.0467          0.474        0.480    70 College   
## 10 always                    0.0463          0.466        0.488    83 Some coll…
## # ℹ 5,826 more rows
## # ℹ 6 more variables: race <chr>, gender <chr>, income_cat <chr>, Q30 <int>,
## #   voter_category <fct>, party <fct>
voter_aug |>
  dplyr::select(contains("pred"))
## # A tibble: 5,836 × 4
##    .pred_class `.pred_rarely/never` .pred_sporadic .pred_always
##    <fct>                      <dbl>          <dbl>        <dbl>
##  1 always                    0.0281          0.394        0.578
##  2 always                    0.0208          0.423        0.556
##  3 sporadic                  0.0947          0.480        0.425
##  4 sporadic                  0.0924          0.482        0.425
##  5 sporadic                  0.0762          0.506        0.418
##  6 sporadic                  0.353           0.436        0.212
##  7 sporadic                  0.0679          0.506        0.426
##  8 sporadic                  0.0782          0.504        0.417
##  9 always                    0.0467          0.474        0.480
## 10 always                    0.0463          0.466        0.488
## # ℹ 5,826 more rows

After calculating probabilities for the three different classes, the category having the highest probability will be our predicted class. For examples, we can see predicted probabilities for three classes “rarely/never”, “sporadic” and “always” are 0.02811841, 0.3939914, and 0.57789017 respectively in the first record. Among these classes, “always” has the highest probability, and hence is voted as our predicted class.

Let’s print the confusion matrix to assess the performance of our model.

voter_conf_mat <- voter_aug %>% 
  count(voter_category, .pred_class, .drop = FALSE)

conf_wide <- voter_conf_mat %>% 
  pivot_wider(
    names_from = .pred_class,
    values_from = n,
    values_fill = 0
  )

conf_wide <- conf_wide %>% 
  mutate(Row_Total = rowSums(across(-voter_category)))

conf_with_totals <- conf_wide %>% 
  bind_rows(
    conf_wide %>% 
      summarise(
        voter_category = "Column_Total",
        across(-voter_category, sum)
      )
  )

conf_with_totals
## # A tibble: 4 × 5
##   voter_category `rarely/never` sporadic always Row_Total
##   <chr>                   <int>    <int>  <int>     <dbl>
## 1 rarely/never              642      754     55      1451
## 2 sporadic                  274     1958    342      2574
## 3 always                    216     1142    453      1811
## 4 Column_Total             1132     3854    850      5836

Interpretation:

The diagonal of the confusion matrix shows the correct prediction made by our multinominal model, whereas all the other predictions are incorrect. From the matrix, we can say that: - 642, 1958 and 453 records are classified as rarely/never, sporadic and always respectively, and all these prediction were correct. - 274 records were predicted as rarely/never, whose actual category was sporadic. Similarly, 216 records wwere classified as rarely/never, whose actual category was always. - 754 and 1142 records were classified as sporadic, whose actual category was rarely/never and always respectively. - 55 and 453 records were classified as always, whose actual category was rarely/never and sporadic respectively.

From the confusion matrix, we can say that our model is best at predicting sporadic voters similar to the model without the Party variable. Our model is struggling to classify “always” voters because out of 1811 entries, only 453 were correctly classified by our model. A large number of always voters are misclassified as sporadic. Both rarely/never and always voters are often predicted as sporadic.

We can also visualize how well these predictions fit the original values.

nonvoters %>% 
  ggplot(aes(x = voter_category)) +
  geom_bar() +
  geom_text(
    stat = "count",
    aes(label = ..count..),
    vjust = -0.3
  ) +
  labs(
    main = "Self-reported voter category"
    )

voter_conf_mat %>% 
  ggplot(aes(x = voter_category, y = n, fill = .pred_class)) +
  geom_bar(stat = "identity") +
  geom_text(
    aes(label = n),
    position = position_stack(vjust = 0.5),
    color = "white",
    size = 3
  ) +
  labs(
    main = "Predicted vs self-reported voter category"
    )

For rarely/never voters, only 642 are correctly predicted, while a larger number (754) are misclassified as sporadic, and very few (55) are predicted as always. This indicates the model has difficulty identifying infrequent voters and tends to shift them toward the middle category.

For sporadic voters, the model performs best: 1,958 individuals are correctly classified, with smaller misclassifications as rarely/never (274) and always (342). This suggests the predictors strongly capture characteristics of sporadic voting.

For always voters, correct classification is weaker: only 453 are predicted correctly, while most (1,142) are misclassified as sporadic, and 216 as rarely/never. This shows the model struggles to distinguish consistent voters from sporadic ones.

Overall, the visualization reveals a central-category bias: when uncertain, the model most often predicts sporadic, leading to strong performance for that group but poorer discrimination at the extremes of voting behavior.

Comparing the performance of our model with and without Party variable

Comparing the confusion matrix of the model without and with the party variable, we can see that 586 correctly classified for the rarely/never voters in the former one and 642 correctly classified in the latter one, 1,994 correctly classified for sporadic voters in the former one and 1,958 correctly classified in the latter one, and 418 correctly classified for always voters in the former one and 453 correctly classified in the latter one. Upon comparison, including party affiliation modestly improves classification of rarely/never and always voters, but overall model performance changes only slightly, with sporadic voters remaining the easiest group to predict in both models.

If we compare both the models in terms of accuracy and sensitivity, we can notice that adding party affiliation slightly improved the model’s performance. Overall accuracy increased from about 51.4% to 52.3%, and the model became better at correctly identifying both rarely/never voters and always voters. This indicates that party provides additional predictive information about voting behavior. However, the improvement is small, and the model still struggles to distinguish always voters from sporadic voters. Most misclassifications continue to fall into the sporadic category, suggesting that while party helps, it is not sufficient on its own to substantially improve predictions. Other factors beyond party affiliation are still needed to better capture consistent voting behavior.