##Instructions

Task 1

Task 2

Task 3

Task 4

Task 5

Setup

We’ll start as always by preparing our packages and our data. We will be using the “States.RData” dataset from quercus along with packages like “effects” and “interactions”

Task 1

We must conduct a multiple linear regression model, with ClintonVote as our dependent variable and College+HouseholdIncome+Nonwhite being our independant variables. We’ll define the model, use sjplot to make a table, and interpret the results.

#Lets make our Multiple linear regression model
m1 <- lm(ClintonVote ~ College + HouseholdIncome + NonWhite, data = States)

#Now our regression table with sjPlot's "tab_model" command
tab_model(m1)
  ClintonVote
Predictors Estimates CI p
(Intercept) 0.99 -11.11 – 13.10 0.870
College 1.12 0.67 – 1.58 <0.001
HouseholdIncome -0.08 -0.38 – 0.23 0.625
NonWhite 0.43 0.27 – 0.60 <0.001
Observations 50
R2 / R2 adjusted 0.616 / 0.590
#We'll calculate the RMSE since sjPlot's regression table does not automatically display it.
#We could pull AIC and BIC but without another model to compare to, AIC and BIC are not very useful
rmse_value <- sqrt(mean(residuals(m1)^2))
cat("RMSE:", rmse_value, "\n")
## RMSE: 6.351735

Interpretation of Coefficients College: For every 1 unit increase in college variable, votes for Clinton increase by 1.12 votes.

HouseholdIncome: For every unit increase in household income, votes for Clinton fall by 0.08 votes

NonWhite: For every unit increase in the Non-whites, votes for Clinton increase by 0.43 votes

This suggests that richer people (specifically with higher household income) might be less likely to vote for Clinton, which makes sense as Clinton’s party was associated with raising taxes and richer folks would tend to have aversion to raised taxes.

Model Fit Our model has a Root Mean Square Error of about 6.35 this means that the average magnitude of my model’s residuals is 6.35. This means that my model’s data points are quite close to the regression line and it has a good fit.

Task 2

We need a new model that adds Region as a predictor. After that, we’ll make a regression table displaying both models with the modelsummary package since modelsummary’s regression table automatically displays AIC and BIC. We’ll compare AIC and BIC and see which model we prefer.

#We'll first make our second MLR model and then a model that interprets both
m1 <- lm(ClintonVote ~ College + HouseholdIncome + NonWhite, data = States)
m2 <- lm(ClintonVote ~ College + HouseholdIncome + NonWhite + Region, data = States)
m3 <- list(m1, m2)

#Now our regression table
modelsummary(m3)
tinytable_b53wizb1lz840cm8wkol
(1) (2)
(Intercept) 0.991 14.806
(6.014) (7.159)
College 1.125 1.189
(0.225) (0.257)
HouseholdIncome -0.076 -0.429
(0.153) (0.171)
NonWhite 0.431 0.525
(0.082) (0.082)
RegionNE 8.031
(2.721)
RegionS -3.404
(2.742)
RegionW 5.658
(2.772)
Num.Obs. 50 50
R2 0.616 0.722
R2 Adj. 0.590 0.684
AIC 336.8 326.5
BIC 346.3 341.8
Log.Lik. -163.383 -155.243
RMSE 6.35 5.40

Comparing AIC and BIC Our first model had an AIC score of 336.8 and a BIC score of 346.3. Our new model has an AIC of 326.5 and a BIC of 341.8. Strictly comparing AIC and BIC, model 2 has a better fit than model 1 as it has both a lower AIC and a lower BIC than model 1. A lower AIC means that model 2 has a better fit in terms of balancing fit and complexity than model 1. A lower BIC means that model 2 fits better than model 1 when penalizing for complexity.

Interpretation of modelsummary table

Model 1 has an R-Square of 0.616 and an adjusted r-square of 0.590. This means that College, HouseholdIncome and Nonwhite explain 61.6% of the variance in ClintonVote and that only drops to 59% when adjusting for number of predictors, a 2.6% decrease. Model 2 has an R-square of 0.722 and an adjusted R-square of 0.684. This means that College, HouseholdIncome, Nonwhite, and Region explain 72.2% of the variance in ClintonVote and that number only drops to 68.4% when adjusting for number of predictors, a 3.8% decrease.

Model 1 has an RMSE of 6.35. As we stated previously, this means that the average magnitude of my model’s residuals is 6.35. Model 2 has an RMSE of 5.40, which means the average magnitude of its residuals is lower than model 1’s and it fits better in terms of how far the data points of the model are from the regression line.

Preferred model While I do agree with the principle of Occam’s Razor, I do not prefer the simpler model in this circumstance. All the problems that come with adding too many variables such as overfitting and loss of interpretability are not present in this model. The adjusted R-square does not fall drastically in model 2 from the original r-square, and even after adjusting for multiple predictors it still explains more of the variability in ClintonVotes than model 1 does when adjusting. As we mentioned when analyzing AIC and BIC, model 2 has better fit in terms of balancing fit and complexity and penalizing for complexity than model 1, but also, model 2’s data points are closer to the regression line on average than model 1’s which is reflected in the lower RMSE. Along with fitting better and explaining more of the variation in the dependent variable, model 2 still remains interpretable and sensible. It is for this reason that I prefer model 2 to model 1.

Task 3

We must fit a model with predicting ClintonVote with just College, NonWhite and their interaction this time. We’ll use the effect package to then visualize this interaction and interpret the results of the visualization

m.interact <- lm(ClintonVote ~ College + NonWhite + College:NonWhite, data = States)

interaction_plot <- effect("College*NonWhite", m.interact, xlevels=list(NonWhite=seq(0, 90, 10)))
plot(interaction_plot, main="Interaction Effect", xlab="% Non-White", ylab="% Clinton Vote")

What can we tell from this visualization? All panels show a positive slant, so, It seems across all percents of college educated individuals residing in a state, the percentage of votes for Clinton tends to always increase as the non-white population of that state increases. No matter what percentage of college-educated individuals reside in a state, we can predict that an increase in that state’s non-white population will increase the votes for Clinton.

From the shaded area in every panel that represents confidence intervals, we see that states with 22%, 30% and 37% college education have narrow intervals than states with the 51% college education. This means there is less uncertainty in NonWhite’s ability to predict votes for Clinton in those less educated states with 22%, 30%, and 37% College.

It is clear that, in any given state, the percentage of non-white population has at least some kind of positive effect on the percentage of Clinton votes across all levels of percentage of college educated individuals. While the effect is across all college educated percentages, it is most precise in states with lower college educated percentages. We can interpret this to mean that education does not have much of an effect on voting pattern as race does, especially when analyzing Clinton voters. These findings are very useful and might prompt future research with a hypothesis that college education does not predict political alignment perhaps?

Task 4

We are going to create a binary outcome variable that tells us if a state voted for Clinton or not. With this new variable as a dependent, we will perform a logistic regression, having College + HouseholdIncome + Region be the independent variables. Finally we will create 2 regression tables, one with sjPlot one with modelsummary and interpret odds ratios and model fit.

##Creating our binary outcome variable
BOVClinton <- abs(as.numeric(States$Elect2016) - 2)
table(BOVClinton)
## BOVClinton
##  0  1 
## 30 20
#It worked nicely, on to our logistic regression
m.logit <- glm(BOVClinton ~ College + HouseholdIncome + Region, data=States, family=binomial(link="logit"))

##Let's make our regression table with modelsummary
modelsummary(m.logit)
tinytable_ctcnczof23ypy3n5usyv
(1)
(Intercept) -13.883
(4.578)
College 0.254
(0.130)
HouseholdIncome 0.056
(0.066)
RegionNE 3.022
(1.413)
RegionS 0.313
(1.589)
RegionW 3.021
(1.431)
Num.Obs. 50
AIC 44.0
BIC 55.5
RMSE 0.32
##Now one with SjPlot
tab_model(m.logit)
  BOVClinton
Predictors Odds Ratios CI p
(Intercept) 0.00 0.00 – 0.00 0.002
College 1.29 1.01 – 1.71 0.050
HouseholdIncome 1.06 0.93 – 1.22 0.399
Region [NE] 20.54 1.70 – 599.19 0.032
Region [S] 1.37 0.04 – 32.70 0.844
Region [W] 20.52 1.55 – 500.93 0.035
Observations 50

Odds Ratios Intercept: The odds ratio is 0.00 here, meaning that when College, HouseholdIncome, and Region are set to 0, the baseline odds of a state voting Clinton are extremely low. This relationship is statistically significant at a p of 0.002.

College: 1.29 odds ratio indicates that for every 1% increase in a state’s college-educated population, the odds of that state voting Clinton increase by 29%. This is statistically significant at a p-value of 0.050

HouseholdIncome: An odds ratio of 1.06 would suggest that for every 1 unit increase in household income in a state increase odds of it voting Clinton by 6%, however, its important to note that this effect is not statistically significant as the p-value is 0.399

Region[NE]: The odds of a northeast state voting Clinton are 20.54 times higher than a hypothetical baseline state. This will make more sense when comparing with other regions. This effect is also statistically with a p-value of 0.032

Region[S]: The odds of a state in the southern region voting Clinton is only 1.37 times higher than the baseline region, way less when comparing with the northeast region. Its important to note that this effect is not statistically significant though as it’s p-value is 0.844.

Region [W]: The western region is 20.52 times higher to vote Clinton than a hypothetical baseline region, similar to the northeast region. This effect is statistically significant at a p-value of 0.035

Conclusion: When analyzing odds ratios of our model, it is clear that the percentage of college-educated individuals residing in a state, and if that state is located in the northeast or west regions of the USA, are all factors that significantly increases the odds of that state voting for Clinton. It is not statistically significant to say that household income and a state being in the southern region of the USA increase odds of voting for Clinton.

Model Fit We will only look at RMSE to asses model fit as we have no other model to compare to (which is required for AIC and BIC analysis). A root mean square error of 0.32 means that the predicted values of our model vary, on average, only 0.32 units from the observed values. This lets us know that our model has very good fit in terms of the average magnitude of its residuals.

Task 5

Using our same logistic regression model from task 4, we will plot the predicted probabilities of voting for Clinton across the range of College values, separated by Region. Then, to visualize all odds ratios of all our predictors (College, HouseholdIncome, and Region) we will use sjPlot.

# Recall, our model from task 4 is called "m.logit"
PredProb_m.logit <- effect("College*Region", m.logit, xlevels = list(College = seq(min(States$College), max(States$College), by = 1)))
## NOTE: College:Region does not appear in the model
plot(PredProb_m.logit, main = "Predicted Probabilities of Voting for Clinton by College Education and Region", 
     xlab = "% College Educated", ylab = "% Clinton Vote",
     multiline = TRUE, rug = FALSE, ci.style = "bands")

# What a gorgeous visualization, now to make our sjPlot visualization with the "plot_model" command
plot_model(m.logit)

# Here we can perfectly see all the odds ratios

End

Thanks for reading my markdown
Mateja Dokic