Importing necessary libraries
library(tidymodels)
library(tidyverse)
library(ggplot2)
library(ggformula)
library(GGally)
library(dplyr)
Loading the dataset
nonvoters<-read.csv("/SharedProjects/Appiah/STA631/03-Logistic/nonvoters.csv") |>
dplyr::select(ppage, educ, race, gender, income_cat, Q30, voter_category)
head(nonvoters)
## ppage educ race gender income_cat Q30 voter_category
## 1 73 College White Female $75-125k 2 always
## 2 90 College White Female $125k or more 3 always
## 3 53 College White Male $125k or more 2 sporadic
## 4 58 Some college Black Female $40-75k 2 sporadic
## 5 81 High school or less White Male $40-75k 1 always
## 6 61 High school or less White Female $40-75k 5 rarely/never
nonvoters <- nonvoters |>
mutate(
party = case_when(
Q30 == 1 ~ "Republican",
Q30 == 2 ~ "Democrat",
Q30 == 3 ~ "Independent",
Q30 %in% c(4, 5, -1) ~ "Other"
),
party = factor(
party,
levels = c("Other", "Independent", "Democrat", "Republican")
),
voter_category = factor(
voter_category,
levels = c("rarely/never", "sporadic", "always")
)
)
head(nonvoters)
## ppage educ race gender income_cat Q30 voter_category
## 1 73 College White Female $75-125k 2 always
## 2 90 College White Female $125k or more 3 always
## 3 53 College White Male $125k or more 2 sporadic
## 4 58 Some college Black Female $40-75k 2 sporadic
## 5 81 High school or less White Male $40-75k 1 always
## 6 61 High school or less White Female $40-75k 5 rarely/never
## party
## 1 Democrat
## 2 Independent
## 3 Democrat
## 4 Democrat
## 5 Republican
## 6 Other
Visualizing the voter category according to the party affiliation.
nonvoters |>
ggbivariate("voter_category", "party", title = "voter outcome by party")
From the visualization above, we can see almost similar voting pattern of people who identify themselves as a “Republican” or a “Democrat” with nearly 35% always voters, 46% sporadic voters and 19% “rarely/never” voters. For independents, the engagement is slightly lower in comparison to other big parties, with 30.2 % “always” voters, 44.3% “sporadic” voters and 25.5% “rarely/never” voters. For people in the “Other” party category, disengagement is most pronounced with nearly half (49.6%) rarely or never vote, only 15.3% always vote, and 35.1% vote sporadically. Hence, we can say that affiliation with a major political party (Democrat or Republican) is associated with higher and more consistent voting behavior, while Independents and those in the “Other” category are more likely to disengage from voting. The bivariate plot suggests that party affiliation has moderate predictive power for voting behavior: Democrats and Republicans are more likely to vote regularly than Independents or those in other parties. However, because voting patterns within each party are still quite mixed, party alone is not a strong standalone predictor of individual voting behavior.
Since there are more than two labels(categories) in our response variable, we are using multinominal regression in this case.
multi_spec <- multinom_reg() |>
set_engine("nnet")
multi_mod <- multi_spec |>
fit(voter_category ~ ppage + educ + race + gender + income_cat + party, data = nonvoters)
tidy(multi_mod) |>
print(n = Inf)
## # A tibble: 28 × 6
## y.level term estimate std.error statistic p.value
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 sporadic (Intercept) -1.57 0.188 -8.34 7.58e- 17
## 2 sporadic ppage 0.0457 0.00232 19.7 5.33e- 86
## 3 sporadic educHigh school or less -0.853 0.0974 -8.76 1.94e- 18
## 4 sporadic educSome college -0.293 0.0952 -3.08 2.10e- 3
## 5 sporadic raceHispanic 0.0402 0.128 0.314 7.53e- 1
## 6 sporadic raceOther/Mixed -0.332 0.159 -2.09 3.66e- 2
## 7 sporadic raceWhite -0.0775 0.108 -0.719 4.72e- 1
## 8 sporadic genderMale -0.0901 0.0722 -1.25 2.12e- 1
## 9 sporadic income_cat$40-75k -0.0738 0.111 -0.662 5.08e- 1
## 10 sporadic income_cat$75-125k 0.0125 0.107 0.117 9.07e- 1
## 11 sporadic income_catLess than $40k -0.588 0.114 -5.17 2.36e- 7
## 12 sporadic partyIndependent 0.548 0.111 4.95 7.27e- 7
## 13 sporadic partyDemocrat 0.940 0.106 8.85 8.51e- 19
## 14 sporadic partyRepublican 0.857 0.112 7.62 2.61e- 14
## 15 always (Intercept) -2.92 0.218 -13.4 6.73e- 41
## 16 always ppage 0.0582 0.00257 22.7 5.94e-114
## 17 always educHigh school or less -1.27 0.109 -11.6 2.46e- 31
## 18 always educSome college -0.330 0.102 -3.23 1.26e- 3
## 19 always raceHispanic -0.341 0.150 -2.28 2.28e- 2
## 20 always raceOther/Mixed -0.600 0.185 -3.24 1.20e- 3
## 21 always raceWhite 0.127 0.119 1.07 2.84e- 1
## 22 always genderMale -0.192 0.0797 -2.41 1.58e- 2
## 23 always income_cat$40-75k -0.000380 0.121 -0.00313 9.97e- 1
## 24 always income_cat$75-125k 0.165 0.114 1.45 1.48e- 1
## 25 always income_catLess than $40k -0.664 0.127 -5.23 1.71e- 7
## 26 always partyIndependent 0.839 0.136 6.15 7.52e- 10
## 27 always partyDemocrat 1.40 0.131 10.7 1.29e- 26
## 28 always partyRepublican 1.24 0.136 9.10 8.87e- 20
multi_mod$fit$call
## nnet::multinom(formula = voter_category ~ ppage + educ + race +
## gender + income_cat + party, data = data, trace = FALSE)
Here, we are using repair_call() to update the model’s
stored call so it correctly references the current dataset i.e
nonvoters.
multi_mod <- repair_call(multi_mod, data = nonvoters)
multi_mod$fit$call
## nnet::multinom(formula = voter_category ~ ppage + educ + race +
## gender + income_cat + party, data = nonvoters, trace = FALSE)
After repair_call(), we can see that the old data
reference i.e. data = data has been replaced with
data = nonvoters.
Since our dataset cosists of three labels i.e always,
sporadic and rarely/never in the response
variable, we will need two equation. Baseline category for the model is
“rarely/never”.
Using your tidy(multi_mod) |> print(n = Inf) output,
the model equations will be:
\[\begin{align*} \log\left(\frac{\hat{p}_{\text{sporadic}}}{\hat{p}_{\text{rarely/never}}}\right) &= -1.57 \\ &+ 0.046 \times \text{ppage} \\ &- 0.853 \times \text{educHigh school or less} \\ &- 0.293 \times \text{educSome college} \\ &+ 0.040 \times \text{raceHispanic} \\ &- 0.332 \times \text{raceOther/Mixed} \\ &- 0.078 \times \text{raceWhite} \\ &- 0.090 \times \text{genderMale} \\ &- 0.074 \times \text{income\_cat\$40-75k} \\ &+ 0.013 \times \text{income\_cat\$75-125k} \\ &- 0.588 \times \text{income\_catLess than \$40k} \\ &+ 0.548 \times \text{partyIndependent} \\ &+ 0.940 \times \text{partyDemocrat} \\ &+ 0.857 \times \text{partyRepublican} \end{align*}\]
and
\[\begin{align*} \log\left(\frac{\hat{p}_{\text{always}}}{\hat{p}_{\text{rarely/never}}}\right) &= -2.919 \\ &+ 0.058 \times \text{ppage} \\ &- 1.267 \times \text{educHigh school or less} \\ &- 0.330 \times \text{educSome college} \\ &- 0.341 \times \text{raceHispanic} \\ &- 0.600 \times \text{raceOther/Mixed} \\ &+ 0.127 \times \text{raceWhite} \\ &- 0.192 \times \text{genderMale} \\ &- 0.0004 \times \text{income\_cat\$40-75k} \\ &+ 0.165 \times \text{income\_cat\$75-125k} \\ &- 0.664 \times \text{income\_catLess than \$40k} \\ &+ 0.839 \times \text{partyIndependent} \\ &+ 1.401 \times \text{partyDemocrat} \\ &+ 1.239 \times \text{partyRepublican} \end{align*}\]
genderMale. How did it change
(if any)?The log odds of an always(vs rarely) voter decreases by 0.192 for male in comparison to female, holding all the variables constant.
OR,
(exp(-0.192) - 1) * 100
## [1] -17.46931
The odds of an always(vs rarely) voter decreases by 17.47% for male in comparison to female, holding all the variables constant.
For the model without the Party variable, the odds of
males was 19.02% less likely than females to be always voters (vs
rarely/never) and the odds of males is 17.47% less likely than females
for the model with Party variable, which means that both
the models (with/without Party variable) are telling the
same story i.e. males are less likely than females to be always voters
(vs rarely/never).
Republican, Democratic). What does this tell
us?For Republican: The positive slope (+1.239) suggests
that individuals who identify as Republicans have higher log odds of
being always voters rather than rarely/never voters compared to the
reference party group (i.e. others), holding all the other
variables constant. OR,
(exp(1.239) - 1) * 100
## [1] 245.216
The odds of an always(vs rarely) voter increases by 245.2% for an
individuals who identify as Republicans in comparison to
others, holding all the variables constant.
For Democratic: The positive slope (+1.401) suggests
that individuals who identify as Democrats have higher log odds of being
always voters rather than rarely/never voters compared to the reference
party group (i.e. others), holding all the other variables
constant. OR,
(exp(1.401) - 1)* 100
## [1] 305.9257
The odds of an always(vs rarely) voter increases by 305.9% for an
individuals who identify as Democrats in comparison to
others, holding all the variables constant.
We could use this model to calculate probabilities. Generally, for categories \(2, \ldots, K\), the probability that the \(i^{th}\) observation is in the \(k^{th}\) category is,
\[ \hat{p}_{ik} = \frac{e^{\hat\beta_{0j} + \hat\beta_{1j}x_{i1} + \hat\beta_{2j}x_{i2} + \cdots + \hat\beta_{pj}x_{ip}}}{1 + \sum_{k = 2}^Ke^{\hat\beta_{0k} + \hat\beta_{1k}x_{1i} + \hat\beta_{2k}x_{2i} + \cdots + \hat\beta_{pk}x_{pi}}} \]
And the baseline category, \(k = 1\),
\[ \hat{p}_{i1} = 1 - \sum_{k = 2}^K \hat{p}_{ik} \]
voter_aug <- augment(multi_mod, new_data = nonvoters)
voter_aug
## # A tibble: 5,836 × 12
## .pred_class `.pred_rarely/never` .pred_sporadic .pred_always ppage educ
## <fct> <dbl> <dbl> <dbl> <int> <chr>
## 1 always 0.0281 0.394 0.578 73 College
## 2 always 0.0208 0.423 0.556 90 College
## 3 sporadic 0.0947 0.480 0.425 53 College
## 4 sporadic 0.0924 0.482 0.425 58 Some coll…
## 5 sporadic 0.0762 0.506 0.418 81 High scho…
## 6 sporadic 0.353 0.436 0.212 61 High scho…
## 7 sporadic 0.0679 0.506 0.426 80 High scho…
## 8 sporadic 0.0782 0.504 0.417 68 Some coll…
## 9 always 0.0467 0.474 0.480 70 College
## 10 always 0.0463 0.466 0.488 83 Some coll…
## # ℹ 5,826 more rows
## # ℹ 6 more variables: race <chr>, gender <chr>, income_cat <chr>, Q30 <int>,
## # voter_category <fct>, party <fct>
voter_aug |>
dplyr::select(contains("pred"))
## # A tibble: 5,836 × 4
## .pred_class `.pred_rarely/never` .pred_sporadic .pred_always
## <fct> <dbl> <dbl> <dbl>
## 1 always 0.0281 0.394 0.578
## 2 always 0.0208 0.423 0.556
## 3 sporadic 0.0947 0.480 0.425
## 4 sporadic 0.0924 0.482 0.425
## 5 sporadic 0.0762 0.506 0.418
## 6 sporadic 0.353 0.436 0.212
## 7 sporadic 0.0679 0.506 0.426
## 8 sporadic 0.0782 0.504 0.417
## 9 always 0.0467 0.474 0.480
## 10 always 0.0463 0.466 0.488
## # ℹ 5,826 more rows
After calculating probabilities for the three different classes, the category having the highest probability will be our predicted class. For examples, we can see predicted probabilities for three classes “rarely/never”, “sporadic” and “always” are 0.02811841, 0.3939914, and 0.57789017 respectively in the first record. Among these classes, “always” has the highest probability, and hence is voted as our predicted class.
Let’s print the confusion matrix to assess the performance of our model.
voter_conf_mat <- voter_aug %>%
count(voter_category, .pred_class, .drop = FALSE)
conf_wide <- voter_conf_mat %>%
pivot_wider(
names_from = .pred_class,
values_from = n,
values_fill = 0
)
conf_wide <- conf_wide %>%
mutate(Row_Total = rowSums(across(-voter_category)))
conf_with_totals <- conf_wide %>%
bind_rows(
conf_wide %>%
summarise(
voter_category = "Column_Total",
across(-voter_category, sum)
)
)
conf_with_totals
## # A tibble: 4 × 5
## voter_category `rarely/never` sporadic always Row_Total
## <chr> <int> <int> <int> <dbl>
## 1 rarely/never 642 754 55 1451
## 2 sporadic 274 1958 342 2574
## 3 always 216 1142 453 1811
## 4 Column_Total 1132 3854 850 5836
Interpretation:
The diagonal of the confusion matrix shows the correct prediction made by our multinominal model, whereas all the other predictions are incorrect. From the matrix, we can say that: - 642, 1958 and 453 records are classified as rarely/never, sporadic and always respectively, and all these prediction were correct. - 274 records were predicted as rarely/never, whose actual category was sporadic. Similarly, 216 records wwere classified as rarely/never, whose actual category was always. - 754 and 1142 records were classified as sporadic, whose actual category was rarely/never and always respectively. - 55 and 453 records were classified as always, whose actual category was rarely/never and sporadic respectively.
From the confusion matrix, we can say that our model is best at
predicting sporadic voters similar to the model without the
Party variable. Our model is struggling to classify
“always” voters because out of 1811 entries, only 453 were correctly
classified by our model. A large number of always voters are
misclassified as sporadic. Both rarely/never and always voters are often
predicted as sporadic.
We can also visualize how well these predictions fit the original values.
nonvoters %>%
ggplot(aes(x = voter_category)) +
geom_bar() +
geom_text(
stat = "count",
aes(label = ..count..),
vjust = -0.3
) +
labs(
main = "Self-reported voter category"
)
voter_conf_mat %>%
ggplot(aes(x = voter_category, y = n, fill = .pred_class)) +
geom_bar(stat = "identity") +
geom_text(
aes(label = n),
position = position_stack(vjust = 0.5),
color = "white",
size = 3
) +
labs(
main = "Predicted vs self-reported voter category"
)
For rarely/never voters, only 642 are correctly predicted, while a larger number (754) are misclassified as sporadic, and very few (55) are predicted as always. This indicates the model has difficulty identifying infrequent voters and tends to shift them toward the middle category.
For sporadic voters, the model performs best: 1,958 individuals are correctly classified, with smaller misclassifications as rarely/never (274) and always (342). This suggests the predictors strongly capture characteristics of sporadic voting.
For always voters, correct classification is weaker: only 453 are predicted correctly, while most (1,142) are misclassified as sporadic, and 216 as rarely/never. This shows the model struggles to distinguish consistent voters from sporadic ones.
Overall, the visualization reveals a central-category bias: when uncertain, the model most often predicts sporadic, leading to strong performance for that group but poorer discrimination at the extremes of voting behavior.
Party variableComparing the confusion matrix of the model without and with the party variable, we can see that 586 correctly classified for the rarely/never voters in the former one and 642 correctly classified in the latter one, 1,994 correctly classified for sporadic voters in the former one and 1,958 correctly classified in the latter one, and 418 correctly classified for always voters in the former one and 453 correctly classified in the latter one. Upon comparison, including party affiliation modestly improves classification of rarely/never and always voters, but overall model performance changes only slightly, with sporadic voters remaining the easiest group to predict in both models.
If we compare both the models in terms of accuracy and sensitivity, we can notice that adding party affiliation slightly improved the model’s performance. Overall accuracy increased from about 51.4% to 52.3%, and the model became better at correctly identifying both rarely/never voters and always voters. This indicates that party provides additional predictive information about voting behavior. However, the improvement is small, and the model still struggles to distinguish always voters from sporadic voters. Most misclassifications continue to fall into the sporadic category, suggesting that while party helps, it is not sufficient on its own to substantially improve predictions. Other factors beyond party affiliation are still needed to better capture consistent voting behavior.