Project 4

load("C:/Users/tophe/Downloads/States.RData")
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(effects)

## Loading required package: carData
## 
## Attaching package: 'carData'
## 
## The following object is masked _by_ '.GlobalEnv':
## 
##     States
## 
## lattice theme set by effectsTheme()
## See ?effectsTheme for details.

library(nnet)
library(fst)
library(modelsummary)

## `modelsummary` 2.0.0 now uses `tinytable` as its default table-drawing
##   backend. Learn more at: https://vincentarelbundock.github.io/tinytable/
## 
## Revert to `kableExtra` for one session:
## 
##   options(modelsummary_factory_default = 'kableExtra')
##   options(modelsummary_factory_latex = 'kableExtra')
##   options(modelsummary_factory_html = 'kableExtra')
## 
## Silence this message forever:
## 
##   config_modelsummary(startup_message = FALSE)

library(sjPlot)

## #refugeeswelcome

library(ggplot2)
library(car)

## 
## Attaching package: 'car'
## 
## The following object is masked from 'package:dplyr':
## 
##     recode
## 
## The following object is masked from 'package:purrr':
## 
##     some

library(Lock5Data)
library(mosaic)

## Registered S3 method overwritten by 'mosaic':
##   method                           from   
##   fortify.SpatialPolygonsDataFrame ggplot2
## 
## The 'mosaic' package masks several functions from core packages in order to add 
## additional features.  The original behavior of these functions should not be affected by this.
## 
## Attaching package: 'mosaic'
## 
## The following object is masked from 'package:Matrix':
## 
##     mean
## 
## The following objects are masked from 'package:car':
## 
##     deltaMethod, logit
## 
## The following object is masked from 'package:modelsummary':
## 
##     msummary
## 
## The following objects are masked from 'package:dplyr':
## 
##     count, do, tally
## 
## The following object is masked from 'package:purrr':
## 
##     cross
## 
## The following object is masked from 'package:ggplot2':
## 
##     stat
## 
## The following objects are masked from 'package:stats':
## 
##     binom.test, cor, cor.test, cov, fivenum, IQR, median, prop.test,
##     quantile, sd, t.test, var
## 
## The following objects are masked from 'package:base':
## 
##     max, mean, min, prod, range, sample, sum

Task 1

Using the States dataset, conduct a multiple linear regression with ClintonVote as the dependent variable and College, HouseholdIncome, and NonWhite as independent variables. Create regression tables using the sjPlot package. Interpret the coefficients and assess the model fit.

Model1 <- lm(ClintonVote ~ College + HouseholdIncome + NonWhite , data = States)
tab_model(Model1)

	ClintonVote
Predictors	Estimates	CI	p
(Intercept)	0.99	-11.11 – 13.10	0.870
College	1.12	0.67 – 1.58	<0.001
HouseholdIncome	-0.08	-0.38 – 0.23	0.625
NonWhite	0.43	0.27 – 0.60	<0.001
Observations	50
R² / R² adjusted	0.616 / 0.590

The coefficients for the model state that for every single unit increase in College, HouseholdIncome, and NonWhite, the expected percent of people in the state voting for Clinton would have risen 1.12, -0.08, and 0.43 respectively. The intercept is at 0.991 which means if there was no college graduates in the state, no household income, and only white people, Clinton would get about 1 percent of the vote. The model’s R-squared value of 0.616 states that 61.6 percent of variation is explainable using this model. This implies that the model fits decently well. Only the coefficients for College and Nonwhite are statistically significant in the data due to HouseholdIncome’s p value being well above the 0.05 cutoff.

Task 2

Extend the model from Task 1 by adding the Region variable. Compare this new model to the previous one using AIC and BIC. Create a regression table with all models with modelsummary. Interpret the results and discuss which model you prefer and why.

Model2 <- lm(ClintonVote ~ College + HouseholdIncome + NonWhite + Region, data = States)
tab_model(Model2)

	ClintonVote
Predictors	Estimates	CI	p
(Intercept)	14.81	0.37 – 29.24	0.045
College	1.19	0.67 – 1.71	<0.001
HouseholdIncome	-0.43	-0.77 – -0.08	0.016
NonWhite	0.52	0.36 – 0.69	<0.001
Region [NE]	8.03	2.54 – 13.52	0.005
Region [S]	-3.40	-8.93 – 2.13	0.221
Region [W]	5.66	0.07 – 11.25	0.047
Observations	50
R² / R² adjusted	0.722 / 0.684

Models1 <- list(Model1, Model2)
modelsummary(Models1)

tinytable_mlfmxafhyn1217335rkn

	(1)	(2)
(Intercept)	0.991	14.806
	(6.014)	(7.159)
College	1.125	1.189
	(0.225)	(0.257)
HouseholdIncome	-0.076	-0.429
	(0.153)	(0.171)
NonWhite	0.431	0.525
	(0.082)	(0.082)
RegionNE		8.031
		(2.721)
RegionS		-3.404
		(2.742)
RegionW		5.658
		(2.772)
Num.Obs.	50	50
R2	0.616	0.722
R2 Adj.	0.590	0.684
AIC	336.8	326.5
BIC	346.3	341.8
Log.Lik.	-163.383	-155.243
RMSE	6.35	5.40

College’s coefficient increases from 1.125 to 1.189, HouseholdIncome sinks from -0.08 to -0.43, and NonWhite grows from 0.43 to 0.525. This implies more changes with each unit increase in a variable. Each Region also has its own coefficient that predicts the increase in the odds of voting for Clinton depending on if the State is in that region. Only the southern region varible has a high enough p value to have it thrown out as all the others are below 0.05. After including the Region variable, the R-squared increased by 0.106 which means 10.6 percent more of the variation is explained by the model. AIC and BIC also both go down after Region is included in the data. Because of this, and because of the number of valid variables increasing in the second leads me to believe the second model is preferable. Task 3 Using the States dataset, fit a model predicting ClintonVote with College, NonWhite, and their interaction. Visualize the interaction effect using the effects package and interpret the results. How does the interaction change our understanding of the relationship between education and voting patterns?

m.interact <- lm(ClintonVote ~ NonWhite + College + NonWhite:College, data=States)
interaction_plot <- effect("NonWhite*College", m.interact, xlevels=list(NonWhite=seq(0, 90, 10)))
plot(interaction_plot, main="Interaction Effect", xlab="% Non-White", ylab="% Clinton Vote")

The lines are not parallel which means there is an interaction effect. This means that the relationship between the Nonwhite populations and the likelihood a state votes for Clinton depends on the number of college graduates in the state. Task 4 Using the States dataset, create a binary outcome variable indicating whether a state voted for Clinton (1) or not (0). Perform a logistic regression with this new variable as the dependent variable and College, HouseholdIncome, and Region as predictors. Create regression tables using both modelsummary and sjPlot. Interpret the odds ratios and assess model fit.

States$ClintonVote <- abs(as.numeric(States$Elect2016) - 2)
Model4 <- glm(ClintonVote ~ College + HouseholdIncome + Region, data=States, family=binomial(link="logit"))
tab_model(Model4)

	ClintonVote
Predictors	Odds Ratios	CI	p
(Intercept)	0.00	0.00 – 0.00	0.002
College	1.29	1.01 – 1.71	0.050
HouseholdIncome	1.06	0.93 – 1.22	0.399
Region [NE]	20.54	1.70 – 599.19	0.032
Region [S]	1.37	0.04 – 32.70	0.844
Region [W]	20.52	1.55 – 500.93	0.035
Observations	50
R² Tjur	0.588

modelsummary(Model4)

tinytable_yw7v1kjjv7kvh7wpioua

	(1)
(Intercept)	-13.883
	(4.578)
College	0.254
	(0.130)
HouseholdIncome	0.056
	(0.066)
RegionNE	3.022
	(1.413)
RegionS	0.313
	(1.589)
RegionW	3.021
	(1.431)
Num.Obs.	50
AIC	44.0
BIC	55.5
Log.Lik.	-16.007
RMSE	0.32

exp(coef(Model4))

##     (Intercept)         College HouseholdIncome        RegionNE         RegionS 
##    9.346977e-07    1.289038e+00    1.057591e+00    2.054104e+01    1.366961e+00 
##         RegionW 
##    2.051945e+01

According to this model, for each increase in one unit of College grads in the state, the state is 28.9 percent more likely to vote for Clinton. An increase in Household Income by a thousand dollars makes a state 5.8 percent more likely to vote for Clinton too. The intercept has a value of almost zero, meaning there is essentially no way Clinton would win the election with no college graduates or Household income. Because this model has a lower R-Squared value than the other models, it can only explain 58.8 percent of variation. Also, the p-values of only two variables are below the 0.05 threshold for statistical significance, meaning the model is not a good fit.

Task 5 Leveraging the same logistic regression model from Task 4, use the effects package to plot the predicted probabilities of voting for Clinton across the range of College values, separated by Region. Then, create a plot using sjPlot’s plot_model function to visualize the odds ratios of all predictors in the model.

States$ClintonVote <- abs(as.numeric(States$Elect2016) - 2)
Model4 <- glm(ClintonVote ~ College + HouseholdIncome + Region, data=States, family=binomial(link="logit"))

Model5 <- multinom(Region ~ College, data=States)

## # weights:  12 (6 variable)
## initial  value 69.314718 
## iter  10 value 52.744531
## final  value 52.742812 
## converged

Model6 <- allEffects(Model5)
plot(Model6, type = "probability", confint = TRUE, multiline = TRUE,ylim=c(0,1))

plot_model(Model4, show.values = TRUE, show.p = TRUE, value.offset = 0.2, wrap.title = 100)

Project 4

Topher Eaton

2024-07-24