load("C:/Users/tophe/Downloads/States.RData")
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(effects)
## Loading required package: carData
##
## Attaching package: 'carData'
##
## The following object is masked _by_ '.GlobalEnv':
##
## States
##
## lattice theme set by effectsTheme()
## See ?effectsTheme for details.
library(nnet)
library(fst)
library(modelsummary)
## `modelsummary` 2.0.0 now uses `tinytable` as its default table-drawing
## backend. Learn more at: https://vincentarelbundock.github.io/tinytable/
##
## Revert to `kableExtra` for one session:
##
## options(modelsummary_factory_default = 'kableExtra')
## options(modelsummary_factory_latex = 'kableExtra')
## options(modelsummary_factory_html = 'kableExtra')
##
## Silence this message forever:
##
## config_modelsummary(startup_message = FALSE)
library(sjPlot)
## #refugeeswelcome
library(ggplot2)
library(car)
##
## Attaching package: 'car'
##
## The following object is masked from 'package:dplyr':
##
## recode
##
## The following object is masked from 'package:purrr':
##
## some
library(Lock5Data)
library(mosaic)
## Registered S3 method overwritten by 'mosaic':
## method from
## fortify.SpatialPolygonsDataFrame ggplot2
##
## The 'mosaic' package masks several functions from core packages in order to add
## additional features. The original behavior of these functions should not be affected by this.
##
## Attaching package: 'mosaic'
##
## The following object is masked from 'package:Matrix':
##
## mean
##
## The following objects are masked from 'package:car':
##
## deltaMethod, logit
##
## The following object is masked from 'package:modelsummary':
##
## msummary
##
## The following objects are masked from 'package:dplyr':
##
## count, do, tally
##
## The following object is masked from 'package:purrr':
##
## cross
##
## The following object is masked from 'package:ggplot2':
##
## stat
##
## The following objects are masked from 'package:stats':
##
## binom.test, cor, cor.test, cov, fivenum, IQR, median, prop.test,
## quantile, sd, t.test, var
##
## The following objects are masked from 'package:base':
##
## max, mean, min, prod, range, sample, sum
Task 1
Using the States dataset, conduct a multiple linear regression with ClintonVote as the dependent variable and College, HouseholdIncome, and NonWhite as independent variables. Create regression tables using the sjPlot package. Interpret the coefficients and assess the model fit.
Model1 <- lm(ClintonVote ~ College + HouseholdIncome + NonWhite , data = States)
tab_model(Model1)
| Â | ClintonVote | ||
|---|---|---|---|
| Predictors | Estimates | CI | p |
| (Intercept) | 0.99 | -11.11 – 13.10 | 0.870 |
| College | 1.12 | 0.67 – 1.58 | <0.001 |
| HouseholdIncome | -0.08 | -0.38 – 0.23 | 0.625 |
| NonWhite | 0.43 | 0.27 – 0.60 | <0.001 |
| Observations | 50 | ||
| R2 / R2 adjusted | 0.616 / 0.590 | ||
The coefficients for the model state that for every single unit increase in College, HouseholdIncome, and NonWhite, the expected percent of people in the state voting for Clinton would have risen 1.12, -0.08, and 0.43 respectively. The intercept is at 0.991 which means if there was no college graduates in the state, no household income, and only white people, Clinton would get about 1 percent of the vote. The model’s R-squared value of 0.616 states that 61.6 percent of variation is explainable using this model. This implies that the model fits decently well. Only the coefficients for College and Nonwhite are statistically significant in the data due to HouseholdIncome’s p value being well above the 0.05 cutoff.
Task 2
Extend the model from Task 1 by adding the Region variable. Compare this new model to the previous one using AIC and BIC. Create a regression table with all models with modelsummary. Interpret the results and discuss which model you prefer and why.
Model2 <- lm(ClintonVote ~ College + HouseholdIncome + NonWhite + Region, data = States)
tab_model(Model2)
| Â | ClintonVote | ||
|---|---|---|---|
| Predictors | Estimates | CI | p |
| (Intercept) | 14.81 | 0.37 – 29.24 | 0.045 |
| College | 1.19 | 0.67 – 1.71 | <0.001 |
| HouseholdIncome | -0.43 | -0.77 – -0.08 | 0.016 |
| NonWhite | 0.52 | 0.36 – 0.69 | <0.001 |
| Region [NE] | 8.03 | 2.54 – 13.52 | 0.005 |
| Region [S] | -3.40 | -8.93 – 2.13 | 0.221 |
| Region [W] | 5.66 | 0.07 – 11.25 | 0.047 |
| Observations | 50 | ||
| R2 / R2 adjusted | 0.722 / 0.684 | ||
Models1 <- list(Model1, Model2)
modelsummary(Models1)
| (1) | (2) | |
|---|---|---|
| (Intercept) | 0.991 | 14.806 |
| (6.014) | (7.159) | |
| College | 1.125 | 1.189 |
| (0.225) | (0.257) | |
| HouseholdIncome | -0.076 | -0.429 |
| (0.153) | (0.171) | |
| NonWhite | 0.431 | 0.525 |
| (0.082) | (0.082) | |
| RegionNE | 8.031 | |
| (2.721) | ||
| RegionS | -3.404 | |
| (2.742) | ||
| RegionW | 5.658 | |
| (2.772) | ||
| Num.Obs. | 50 | 50 |
| R2 | 0.616 | 0.722 |
| R2 Adj. | 0.590 | 0.684 |
| AIC | 336.8 | 326.5 |
| BIC | 346.3 | 341.8 |
| Log.Lik. | -163.383 | -155.243 |
| RMSE | 6.35 | 5.40 |
College’s coefficient increases from 1.125 to 1.189, HouseholdIncome sinks from -0.08 to -0.43, and NonWhite grows from 0.43 to 0.525. This implies more changes with each unit increase in a variable. Each Region also has its own coefficient that predicts the increase in the odds of voting for Clinton depending on if the State is in that region. Only the southern region varible has a high enough p value to have it thrown out as all the others are below 0.05. After including the Region variable, the R-squared increased by 0.106 which means 10.6 percent more of the variation is explained by the model. AIC and BIC also both go down after Region is included in the data. Because of this, and because of the number of valid variables increasing in the second leads me to believe the second model is preferable.
Task 3 Using the States dataset, fit a model predicting ClintonVote with College, NonWhite, and their interaction. Visualize the interaction effect using the effects package and interpret the results. How does the interaction change our understanding of the relationship between education and voting patterns?
m.interact <- lm(ClintonVote ~ NonWhite + College + NonWhite:College, data=States)
interaction_plot <- effect("NonWhite*College", m.interact, xlevels=list(NonWhite=seq(0, 90, 10)))
plot(interaction_plot, main="Interaction Effect", xlab="% Non-White", ylab="% Clinton Vote")
The lines are not parallel which means there is an interaction effect.
This means that the relationship between the Nonwhite populations and
the likelihood a state votes for Clinton depends on the number of
college graduates in the state.
Task 4 Using the States dataset, create a binary outcome variable indicating whether a state voted for Clinton (1) or not (0). Perform a logistic regression with this new variable as the dependent variable and College, HouseholdIncome, and Region as predictors. Create regression tables using both modelsummary and sjPlot. Interpret the odds ratios and assess model fit.
States$ClintonVote <- abs(as.numeric(States$Elect2016) - 2)
Model4 <- glm(ClintonVote ~ College + HouseholdIncome + Region, data=States, family=binomial(link="logit"))
tab_model(Model4)
| Â | ClintonVote | ||
|---|---|---|---|
| Predictors | Odds Ratios | CI | p |
| (Intercept) | 0.00 | 0.00 – 0.00 | 0.002 |
| College | 1.29 | 1.01 – 1.71 | 0.050 |
| HouseholdIncome | 1.06 | 0.93 – 1.22 | 0.399 |
| Region [NE] | 20.54 | 1.70 – 599.19 | 0.032 |
| Region [S] | 1.37 | 0.04 – 32.70 | 0.844 |
| Region [W] | 20.52 | 1.55 – 500.93 | 0.035 |
| Observations | 50 | ||
| R2 Tjur | 0.588 | ||
modelsummary(Model4)
| (1) | |
|---|---|
| (Intercept) | -13.883 |
| (4.578) | |
| College | 0.254 |
| (0.130) | |
| HouseholdIncome | 0.056 |
| (0.066) | |
| RegionNE | 3.022 |
| (1.413) | |
| RegionS | 0.313 |
| (1.589) | |
| RegionW | 3.021 |
| (1.431) | |
| Num.Obs. | 50 |
| AIC | 44.0 |
| BIC | 55.5 |
| Log.Lik. | -16.007 |
| RMSE | 0.32 |
exp(coef(Model4))
## (Intercept) College HouseholdIncome RegionNE RegionS
## 9.346977e-07 1.289038e+00 1.057591e+00 2.054104e+01 1.366961e+00
## RegionW
## 2.051945e+01
According to this model, for each increase in one unit of College grads in the state, the state is 28.9 percent more likely to vote for Clinton. An increase in Household Income by a thousand dollars makes a state 5.8 percent more likely to vote for Clinton too. The intercept has a value of almost zero, meaning there is essentially no way Clinton would win the election with no college graduates or Household income. Because this model has a lower R-Squared value than the other models, it can only explain 58.8 percent of variation. Also, the p-values of only two variables are below the 0.05 threshold for statistical significance, meaning the model is not a good fit.
Task 5 Leveraging the same logistic regression model from Task 4, use the effects package to plot the predicted probabilities of voting for Clinton across the range of College values, separated by Region. Then, create a plot using sjPlot’s plot_model function to visualize the odds ratios of all predictors in the model.
States$ClintonVote <- abs(as.numeric(States$Elect2016) - 2)
Model4 <- glm(ClintonVote ~ College + HouseholdIncome + Region, data=States, family=binomial(link="logit"))
Model5 <- multinom(Region ~ College, data=States)
## # weights: 12 (6 variable)
## initial value 69.314718
## iter 10 value 52.744531
## final value 52.742812
## converged
Model6 <- allEffects(Model5)
plot(Model6, type = "probability", confint = TRUE, multiline = TRUE,ylim=c(0,1))
plot_model(Model4, show.values = TRUE, show.p = TRUE, value.offset = 0.2, wrap.title = 100)