library("tidyverse")
library("dplyr")
library("htmltools")
library("forcats")
library("stargazer")
library("sjPlot")
library("car")

14

Add at least one quadratic term into your model and interpret the results. Is it significant? What is the effect of a 1-unit increase in that variable at its mean value?

Returning to the graphs in the preliminary analysis from Homework8_9 and looking for variables that might be better explained by a quadratic term does not yield any candidates. For the purposes of answering the problem, I’ll pick the % of the population with 300 min of exercise of more per week (act300), and determine whether states with moderate percentages of Republican identified individuals show higher numbers of those at this activity level and whether this percentage decreases for states with larger percentages of Republican identified individuals.

RObs300Fru <- lm(R ~ wgtObs + conFru + act300, Pty_Hlth)
RObs300sqFru <- lm(R ~ wgtObs + conFru + act300 + I(act300^2), Pty_Hlth)
sjt.lm(RObs300sqFru)
    R
    B CI p
(Intercept)   -5.57 -121.00 – 109.85 .923
wgtObs   0.71 -0.24 – 1.66 .138
conFru   0.65 -0.10 – 1.39 .087
act300   -0.56 -7.25 – 6.14 .868
I(act300^2)   0.01 -0.09 – 0.12 .776
Observations   51
R2 / adj. R2   .312 / .253
sjt.lm(RObs300Fru)
    R
    B CI p
(Intercept)   -21.00 -60.24 – 18.23 .287
wgtObs   0.70 -0.23 – 1.64 .137
conFru   0.66 -0.07 – 1.39 .075
act300   0.39 -0.25 – 1.04 .226
Observations   51
R2 / adj. R2   .311 / .267

As expected based on the graphical inspection, exploring act300 as a quadratic term did not better explain the variable’s interaction with the dependent variable. The \(\beta\) is near 0 and did not switch sign indicating it does not decline in states with higher % of Republican identified individuals. The adjusted \(R^2\) value has also decreased because of the addition of this quadratic term that does not fit the data.

15

Add at least one interaction term to you model and interpret the results. Is it significant? What is the effect of a 1-unit increase in one of those interacted variables holding the other at its mean value?

Based on the findings in Q11 from Homework8_9 there appears to be an interactive effect between states with higher % of individuals living a sedentary lifestyle, and not consuming fruit in one’s diet and the % of individuals identifying as Republican. We will also look at a possible interaction between wgtObs & act0 due to the logical association between the two.

RObsFru0 <- lm(R ~ wgtObs + conFru + act0, Pty_Hlth)
RObsFru0_F0 <- lm(R ~ wgtObs + conFru + act0 + conFru * act0, Pty_Hlth)
RObsFru0_O0 <- lm(R ~ wgtObs + conFru + act0 + wgtObs * act0, Pty_Hlth)
sjt.lm(RObsFru0, RObsFru0_F0, RObsFru0_O0)
    R   R   R
    B CI p   B CI p   B CI p
(Intercept)   -1.25 -18.85 – 16.35 .887   -50.02 -163.41 – 63.38 .379   -33.20 -138.11 – 71.72 .527
wgtObs   0.80 -0.12 – 1.71 .086   0.75 -0.17 – 1.67 .108   1.69 -1.35 – 4.73 .269
conFru   0.87 0.11 – 1.63 .027   2.09 -0.82 – 5.01 .155   1.02 0.11 – 1.93 .029
act0   -0.71 -1.46 – 0.03 .060   1.05 -3.06 – 5.15 .611   0.46 -3.41 – 4.32 .813
conFru:act0       -0.04 -0.14 – 0.05 .385    
wgtObs:act0           -0.04 -0.17 – 0.09 .537
Observations   51   51   51
R2 / adj. R2   .341 / .299   .352 / .296   .347 / .290

With the tables side-by-side the results are somewhat confusing. When the lack of fruit consumption is interacting with sedentary lifestyle, most notable is the sign shift of act0 alone from -.71 to 1.05 with a concurrent drastic change in the y-intercept from -1.25 to -50.02, the only negative values remaining in the interaction model are the interaction itself and the y-intercept. We see similar changes in the interaction between wgtObs & act0 with slight successive decreases in fit (adj \(R^2\)) with the interaction models. The best explanation I can come up with for this phenomena is that in the reduced model without the interactions, the y-intercept indicates we start with states with a moderate % of Republicans, and states with more sedentary individuals is correlated with fewer % of the population as Republicans perhaps due to high density, urban coastal areas that are often liberal and the combination of convenience and lack of accessible outdoor recreation inclines people towards sedentary lifestyles, explaining the negative correlation. With the addition of interactive terms, the y-intercept jumps into very liberal territory at -50.02 and sedentary lifestyle has a far less significant but positive correlation due to it’s being fitted to a linear model that includes the confluence of lack of fruit consumption and sedentariness (indicating poor diet in addition to sedentary behavior) and being obese and sedentary respectively. Why the interaction terms are negative remains baffling, though the confidence intervals show that they span 0 and are statistically insignificant and thus may not be all too important. The lack of fruit being positively correlated with could be explained by the midwestern states where monocropped grains and factory farmed animals make up a large proportion of the agricultural product and therefore diet in contrast to places like Hawaii, California and coastal areas where fruit is more abundant.

16

Test either the model in 14 or the model in 15 using the F test for nested models. That is, estimate the full model with the variable and quadratic term, or the variable and interaction, and then estimate the reduced model without either, and run the F test to establish whether those variables significantly improve your model.

It’s apparent what the result is going to be given the Adjusted \(R^2\) values, but for the sake of practice the model with interaction between wgtObs & act0 will be tested against the reduced (original) model. \[F = \frac{(R_c^2 - R_r^2) / df_1}{(1-R_c^2)/df_2}\]

rr_org <- summary(RObsFru0)$r.squared
rc_int <- summary(RObsFru0_O0)$r.squared
(f <- ((rc_int - rr_org)/1)/((1 - rc_int)/(51 - 4 - 1)))
## [1] 0.3867096
anova(RObsFru0, RObsFru0_O0)
## Analysis of Variance Table
## 
## Model 1: R ~ wgtObs + conFru + act0
## Model 2: R ~ wgtObs + conFru + act0 + wgtObs * act0
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1     47 2352.1                           
## 2     46 2332.5  1    19.609 0.3867 0.5371

The \(F_{stat}\) and \(P_{value}\) from the F-test indicate that the interaction model was not an improvement over the original model towards a better fit for the explaining the dependent variable in a statistically significant way.

1

Using the anes_2008tr.csv dataset in Course Resources, model vote_rep (whether the respondent voted Republican in the last election) as a function of age, race, income, and ideology.

anes <- read_csv(file = "anes_2008tr.csv")
Rep_AgeRacIncIde <- glm(vote_rep ~ age + race_white + income + ideology_con, family = "binomial", 
    anes)
sjt.glm(Rep_AgeRacIncIde)
    vote_rep
    Odds Ratio CI p
(Intercept)   0.00 0.00 – 0.00 <.001
age   1.00 1.00 – 1.01 .258
race_white   11.43 8.28 – 16.01 <.001
income   1.50 1.30 – 1.74 <.001
ideology_con   2.83 2.50 – 3.24 <.001
Observations   1539

a

What’s the probability of voting Republican for a white person of average age, income, and ideology?

I’m unsure as to whether the probability this question is asking for is best represented by the interacting terms model, or by a prediction based on the original model, so I’ve used both methods. Intuitively, I would guess that the prediction is what is being asked for.

Repint_rWhAgeIncIde <- glm(vote_rep ~ age * race_white * income * ideology_con, family = "binomial", 
    anes)
sjt.glm(Repint_rWhAgeIncIde)
    vote_rep
    Odds Ratio CI p
(Intercept)   0.15 0.00 – 209.35 .682
age   0.92 0.79 – 1.13 .396
race_white   0.01 0.00 – 1476.87 .415
income   0.56 0.04 – 16.88 .718
ideology_con   0.53 0.09 – 4.81 .544
age:race_white   1.11 0.88 – 1.35 .331
age:income   1.02 0.95 – 1.08 .587
race_white:income   2.45 0.05 – 81.87 .640
age:ideology_con   1.02 0.98 – 1.06 .323
race_white:ideology_con   6.55 0.48 – 70.19 .143
income:ideology_con   1.35 0.64 – 2.59 .410
age:race_white:income   0.97 0.91 – 1.06 .512
age:race_white:ideology_con   0.98 0.93 – 1.03 .331
age:income:ideology_con   1.00 0.98 – 1.01 .577
race_white:income:ideology_con   0.77 0.34 – 1.87 .546
age:race_white:income:ideology_con   1.00 0.99 – 1.02 .577
Observations   1539
input <- data.frame(race_white = 1, age = mean(anes$age), income = mean(anes$income), 
    ideology_con = mean(anes$ideology_con))
(Wh <- predict.glm(Repint_rWhAgeIncIde, input, type = "response"))
##         1 
## 0.4924732

b

What’s the change in probability of voting Republican for a person of average age, income, and ideology who switches from black to white?
input$race_white <- 0
(dp_BtoWh <- Wh - (nWh <- predict.glm(Repint_rWhAgeIncIde, input, type = "response")))
##         1 
## 0.4057051
(dodds_BtoWh <- Wh/nWh)
##        1 
## 5.675739

The change in probability is +~.41 between a black person and a white person going to vote Republican, or in terms of odds, a white person is ~5.68 times more likely to vote Republican than a black person.

c

Using the \(e^\beta\) formula from the lesson, what’s the effect on the odds ratio of shifting from black to white?

\[\begin{aligned} \hat{P}(y=1|x=a) - \hat{P}(y=1|x=b) = \textrm{invlogit}(a) - \textrm{invlogit}(b) =\\\frac{e^{\beta_0 + \beta_1 a_1 + \beta_2 a_2 + \beta_3 a_3 + \beta_4 a_4}}{1+e^{\beta_0 + \beta_1 a_1 + \beta_2 a_2 + \beta_3 a_3 + \beta_4 a_4}} - \frac{e^{\beta_0 + \beta_1 b_1 + \beta_2 b_2 + \beta_3 b_3 + \beta_4 b_4}}{1+e^{\beta_0 + \beta_1 b_1 + \beta_2 b_2 + \beta_3 b_3 + \beta_4 b_4}} \end{aligned}\] \[\begin{aligned} \frac{e^{\beta_0 + \beta_1 a}}{1+e^{\beta_0 + \beta_1 a}} - \frac{e^{\beta_0 + \beta_1 b}}{1+e^{\beta_0 + \beta_1 b}} \\ \frac{e^{-8.077 + 0.005(46.451)+2.437(1)+0.405(2.705)+1.042(4.098)}}{1+e^{-8.077 + 0.005(46.451)+2.437(0)+0.405(2.705)+1.042(4.098)}} - \frac{e^{-8.077 + 0.005(46.451)+2.437(0)+0.405(2.705)+1.042(4.098)}}{1+e^{-8.077 + 0.005(46.451)+2.437(0)+0.405(2.705)+1.042(4.098)}} \end{aligned}\] Using R to solve that craziness…

b_yint <- Rep_AgeRacIncIde$coef[1]
b_age <- Rep_AgeRacIncIde$coef[2]
b_rWh <- Rep_AgeRacIncIde$coef[3]
b_inc <- Rep_AgeRacIncIde$coef[4]
b_ide <- Rep_AgeRacIncIde$coef[5]
(`1cdodds_BtoWh` <- (exp(b_yint + b_age * input[2] + b_rWh * 1 + b_inc * input[3] + 
    b_ide * input[4])/(1 + exp(b_yint + b_age * input[2] + b_rWh * 1 + b_inc * input[3] + 
    b_ide * input[4]))) - (exp(b_yint + b_age * input[2] + b_rWh * 0 + b_inc * input[3] + 
    b_ide * input[4])/(1 + exp(b_yint + b_age * input[2] + b_rWh * 0 + b_inc * input[3] + 
    b_ide * input[4]))))
##         age
## 1 0.4096499

d

What has a greater effect on the probability of voting Republican: an age increase of 50 years, or an increase of one income bracket? (You may choose your own baseline, such as from 25 years below average to 25 years above average; and similarly for income.)
dage_lw <- input
dage_hi <- input
dinc_lw <- input
dinc_hi <- input
dage_lw[2] <- input[2] - 25
dage_hi[2] <- input[2] + 25
dinc_lw[3] <- input[3] - 1
dinc_hi[3] <- input[3] + 1
(`50yrs` <- predict.glm(Repint_rWhAgeIncIde, dage_hi, type = "response") - predict.glm(Repint_rWhAgeIncIde, 
    dage_lw, type = "response"))
##          1 
## 0.03083505
(inc_1 <- predict.glm(Repint_rWhAgeIncIde, dinc_hi, type = "response") - predict.glm(Repint_rWhAgeIncIde, 
    dinc_lw, type = "response"))
##        1 
## 0.116571
inc_1/`50yrs`
##       1 
## 3.78047

If you are of non-white race, the change in the probability of voting Republican from an age of 21.5 to 71.5 is +~3%, whereas the the change in probability of voting Republican from one income bracket below average to one income bracket above average is +~11%. In other words, a non-white person is ~3.78 times more likely to vote Republican by moving up two income brackets than by aging 50 years.

dage_lw[1] <- 1
dage_hi[1] <- 1
dinc_lw[1] <- 1
dinc_hi[1] <- 1
(`50yrswh` <- predict.glm(Repint_rWhAgeIncIde, dage_hi, type = "response") - predict.glm(Repint_rWhAgeIncIde, 
    dage_lw, type = "response"))
##          1 
## 0.01946875
(inc_1wh <- predict.glm(Repint_rWhAgeIncIde, dinc_hi, type = "response") - predict.glm(Repint_rWhAgeIncIde, 
    dinc_lw, type = "response"))
##         1 
## 0.1293148
inc_1wh/`50yrswh`
##        1 
## 6.642172

If you are white, the change in the probability of voting Republican from an age of 21.5 to 71.5 is +~2%, whereas the the change in probability of voting Republican from one income bracket below average to one income bracket above average is +~13%. In other words, a white person is ~6.64 times more likely to vote Republican by moving up two income brackets than by aging 50 years.

e

Now run the regression with all the other variables in anes_2008tr (except for voted) How do your coefficients change? What do you think explains any coefficient that became or lost significance?
Rep_AgeRacGenEduIncIdePid <- glm(vote_rep ~ age + race_white + gender_male + education + 
    income + ideology_con + partyid_rep, family = "binomial", anes)
sjt.glm(Rep_AgeRacGenEduIncIdePid, Rep_AgeRacIncIde)
    vote_rep   vote_rep
    Odds Ratio CI p   Odds Ratio CI p
(Intercept)   0.00 0.00 – 0.00 <.001   0.00 0.00 – 0.00 <.001
age   1.02 1.00 – 1.03 .005   1.00 1.00 – 1.01 .258
race_white   5.09 3.43 – 7.62 <.001   11.43 8.28 – 16.01 <.001
gender_male   0.87 0.60 – 1.26 .458        
education   1.02 0.90 – 1.16 .761        
income   1.28 1.06 – 1.55 .010   1.50 1.30 – 1.74 <.001
ideology_con   1.67 1.41 – 1.98 <.001   2.83 2.50 – 3.24 <.001
partyid_rep   2.45 2.19 – 2.75 <.001        
Observations   1539   1539
Rep_AgeRacGenEduIncIdePid$terms
## vote_rep ~ age + race_white + gender_male + education + income + 
##     ideology_con + partyid_rep
## attr(,"variables")
## list(vote_rep, age, race_white, gender_male, education, income, 
##     ideology_con, partyid_rep)
## attr(,"factors")
##              age race_white gender_male education income ideology_con
## vote_rep       0          0           0         0      0            0
## age            1          0           0         0      0            0
## race_white     0          1           0         0      0            0
## gender_male    0          0           1         0      0            0
## education      0          0           0         1      0            0
## income         0          0           0         0      1            0
## ideology_con   0          0           0         0      0            1
## partyid_rep    0          0           0         0      0            0
##              partyid_rep
## vote_rep               0
## age                    0
## race_white             0
## gender_male            0
## education              0
## income                 0
## ideology_con           0
## partyid_rep            1
## attr(,"term.labels")
## [1] "age"          "race_white"   "gender_male"  "education"   
## [5] "income"       "ideology_con" "partyid_rep" 
## attr(,"order")
## [1] 1 1 1 1 1 1 1
## attr(,"intercept")
## [1] 1
## attr(,"response")
## [1] 1
## attr(,".Environment")
## <environment: R_GlobalEnv>
## attr(,"predvars")
## list(vote_rep, age, race_white, gender_male, education, income, 
##     ideology_con, partyid_rep)
## attr(,"dataClasses")
##     vote_rep          age   race_white  gender_male    education 
##    "numeric"    "numeric"    "numeric"    "numeric"    "numeric" 
##       income ideology_con  partyid_rep 
##    "numeric"    "numeric"    "numeric"
sumCom <- summary(Rep_AgeRacGenEduIncIdePid)
sjt.lm(Rep_AgeRacGenEduIncIdePid, Rep_AgeRacIncIde)
    vote_rep   vote_rep
    B CI p   B CI p
(Intercept)   -8.65 -9.87 – -7.50 <.001   -8.08 -9.01 – -7.19 <.001
age   0.01 0.00 – 0.03 .005   0.00 -0.00 – 0.01 .258
race_white   1.63 1.23 – 2.03 <.001   2.44 2.11 – 2.77 <.001
gender_male   -0.14 -0.51 – 0.23 .458    
education   0.02 -0.11 – 0.15 .761    
income   0.25 0.06 – 0.44 .010   0.40 0.26 – 0.55 <.001
ideology_con   0.51 0.35 – 0.68 <.001   1.04 0.92 – 1.18 <.001
partyid_rep   0.89 0.79 – 1.01 <.001    
Observations   1539   1539
R2 / adj. R2   .525 / .729   .402 / .558
Considering each factor individually:
  • age
  • Age is a significant factor in the model that includes all available variables and insignificant in the reduced model, indicating that age plays a significant role in voting Republican when gender, education and party_id are factored in. However, it’s influence accounts for only a 2% change in voting Republican with each additional year.
  • race_white
  • Race is by and large the most significant and influential predictive variable in voting Republican. In the reduced model, it accounts for ~11.4x increase in the chance of voting Republican. When gender, education, and party_id are included the influence of the variable is mitigated to increasing the odds by ~5.1x. This could be due to a suppression effect with the addition of the liberalizing factor, education (liberalizing for all but those whom attained a high school degree and decided not to continue their education), or a sharing of influence with party_id where higher values also indicate conservatism and is therefore also closely linked to voting Republican.
  • gender_male
  • The addition of gender has a insignificant slightly negative effect on the probability of voting Republican in the model, this is likely due to a suppression effect in a model with significantly stronger predictors.
  • education
  • Education has a insignificant albeit slightly positive effect on P(vote_rep), perhaps due to it’s inclusion in a linear model when we know from previous exploration that it has a quadratic effect.
  • income
  • Income is a stronger, highly significant influence in the reduced model, and retains it’s significance though with a more muted effect in the complete model. This could be due to multicollinearity between education and income, given we know education has a quadratic effect. In other words, given those who have more education are likely to have greater income, and are also less likely to vote Republican, the effect of income would be suppressed in a model where age is statistically significant and that includes education.
  • ideology_con
  • A conservative ideology is going to exhibit colinearity with the response (dependent) variable which we can observe reflected in its significance level in both models. It’s effect is mitigated in the complete model, possibly due to its chained causation with the stronger indicator of party_id.
  • partyid_rep
  • Being affiliated with the Republican party is going to have the strongest colinearity with voting Republican, reflected in it’s significance and placeholder as the 2nd highest coefficient. It’s magnitude of effect is second to that of being white. However, this variable has a wide-range of values which are minimally explained:
    partyid_rep is the party ID of the respondent, with Republican being the higher value
    I could not find in the ANES documentation what variable and or question on the survey this was matched to, which would help to better explain it.