DACSS 603: Introduction to Quantitative Analysis final research project
# load gss 2021 dataset
GSS2021 <- gss_get_yr(2021)
Understanding Americans’ Opinion on the Legal Right to an Abortion
Abortion has been a social and political topic of contention for decades in the U.S. Documentation on the practice of abortion shows the practice predates the conception of America, and it wasn’t until the 1860s that states began passing anti-abortion laws (Alliance for Perinatal Research and Services, 1979). Abortion was first nationally banned in 1910, but it is well understood that abortions never stopped happening. In the 1960s, 11 states moved toward legalizing abortion, and in 1973, the Supreme Court decided on the Roe v. Wade case, which established legal right to abortion in the U.S. However, in the past few years, states have begun passing restrictive laws attempting to regulate and limit abortion, like Texas’ SB 8, which effectively bans abortion past 6 weeks of pregnancy (Planned Parenthood, 2022). Now, nearly half of U.S. states are moving toward establishing restrictive abortion laws of varying degrees (NYT, 2022). Politically, the debate has evolved to be broadly framed as a ‘red vs. blue’ issue. However, it’s not as clear how the general public distribute around abortion rights, and if it correlates to political identification in the same nature as it seems to with politicians.
How does political identification relate to one’s opinion that abortion should be legal, regardless of the reason for abortion, in the United States? And are there other demographics that relate to opinions on the matter?
There is no relationship between political ID and belief in the right to legal abortion regardless of reasoning.
GSS 2021 codebook to select and understand variables: https://gss.norc.org/Documents/codebook/GSS%202021%20Codebook.pdf
# select variables of focus from gss 2021 dataset
vars <- select(GSS2021, age, sex, educ, income, raceacs1, raceacs2, raceacs3, raceacs4, raceacs5, raceacs6, raceacs7, raceacs8, raceacs9, raceacs10, raceacs11, raceacs12, raceacs13, raceacs14, raceacs15, raceacs16, partyid, abany, abanyg)
# the gss2021 dataset did not include race variable but had separated variables by race
# recode each race to unique number
vars$raceacs1[vars$raceacs1 == "0"] <- " "
vars$raceacs2[vars$raceacs2 == "0"] <- " "
vars$raceacs2[vars$raceacs2 == "1"] <- "2"
vars$raceacs3[vars$raceacs3 == "0"] <- " "
vars$raceacs3[vars$raceacs3 == "1"] <- "3"
vars$raceacs4[vars$raceacs4 == "0"] <- " "
vars$raceacs4[vars$raceacs4 == "1"] <- "4"
vars$raceacs5[vars$raceacs5 == "0"] <- " "
vars$raceacs5[vars$raceacs5 == "1"] <- "5"
vars$raceacs6[vars$raceacs6 == "0"] <- " "
vars$raceacs6[vars$raceacs6 == "1"] <- "6"
vars$raceacs7[vars$raceacs7 == "0"] <- " "
vars$raceacs7[vars$raceacs7 == "1"] <- "7"
vars$raceacs8[vars$raceacs8 == "0"] <- " "
vars$raceacs8[vars$raceacs8 == "1"] <- "8"
vars$raceacs9[vars$raceacs9 == "0"] <- " "
vars$raceacs9[vars$raceacs9 == "1"] <- "9"
vars$raceacs10[vars$raceacs10 == "0"] <- " "
vars$raceacs10[vars$raceacs10 == "1"] <- "10"
vars$raceacs11[vars$raceacs11 == "0"] <- " "
vars$raceacs11[vars$raceacs11 == "1"] <- "11"
vars$raceacs12[vars$raceacs12 == "0"] <- " "
vars$raceacs12[vars$raceacs12 == "1"] <- "12"
vars$raceacs13[vars$raceacs13 == "0"] <- " "
vars$raceacs13[vars$raceacs13 == "1"] <- "13"
vars$raceacs14[vars$raceacs14 == "0"] <- " "
vars$raceacs14[vars$raceacs14 == "1"] <- "14"
vars$raceacs15[vars$raceacs15 == "0"] <- " "
vars$raceacs15[vars$raceacs15 == "1"] <- "15"
vars$raceacs16[vars$raceacs16 == "0"] <- " "
vars$raceacs16[vars$raceacs16 == "1"] <- "16"
# create one race variable by combining the re-coded columns
vars$race <- paste(vars$raceacs1, vars$raceacs2, vars$raceacs3, vars$raceacs4, vars$raceacs5, vars$raceacs6, vars$raceacs7, vars$raceacs8, vars$raceacs9, vars$raceacs10, vars$raceacs11, vars$raceacs12, vars$raceacs13, vars$raceacs14, vars$raceacs15, vars$raceacs16)
vars$race <- as.numeric(vars$race)
# abortion question was separated into 2 groups
# consolidate to one variable/column
vars$abany[is.na(vars$abany)] <- ""
vars$abanyg[is.na(vars$abanyg)] <- ""
vars$aban <- paste(vars$abany, vars$abanyg)
vars$aban <- as.numeric(vars$aban)
gss <- select(vars, age, sex, educ, income, race, partyid, aban)
Explanatory variable is party identification, and this decision is explained in the background section above.
Additional variables to be considered (controls) are age, sex, educ, income, race.
The selection of these additional variables were each motivated by social/political/historical context. First, gender was chosen because the question wording explicitly asks about the legal right for “a pregnant woman” to obtain an abortion. Therefore, it should be tested whether women and men hold differing views on the matter. I included age because abortion rights generally directly effect those in ‘child-bearing’ years (I’m not saying it does not effect everyone in some way though). Also, Roe v. Wade was passed in 1973, so perhaps this question may evoke strong opinions from older generations. Previous studies have indicated a correlation between higher education and greater support for abortion rights. Income has well-documented levels of privilege in the U.S., and it is often suggested that regardless of restrictive abortion bills being passed, those in upper classes will have the means to be able to obtain safe abortions elsewhere. Lastly, historically, women of color and women of lower socioeconomic statuses have higher rates of abortion, so it’s curious how opinion may trend by race and income.
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3780732/
Outcome variable is belief in legal abortion for any reason. –> Specifying “for any reason” because the GSS included multiple questions on legal abortions that specified specific scenarios to permit legal abortion. This question asked about support for legal abortion regardless of the reason.
The question posed by the survey:
“Please tell me whether or not you think it should be possible for a pregnant woman to obtain a legal abortion if the woman wants it for any reason?”
# summarize variables
summary(gss)
age sex educ income
Min. :18.00 Min. :1.000 Min. : 0.00 Min. : 1.00
1st Qu.:37.00 1st Qu.:1.000 1st Qu.:12.00 1st Qu.:12.00
Median :53.00 Median :2.000 Median :15.00 Median :12.00
Mean :52.16 Mean :1.559 Mean :14.77 Mean :11.25
3rd Qu.:66.00 3rd Qu.:2.000 3rd Qu.:16.00 3rd Qu.:12.00
Max. :89.00 Max. :2.000 Max. :20.00 Max. :13.00
NA's :333 NA's :92 NA's :66 NA's :443
race partyid aban
Min. : 1.000 Min. :0.000 Min. :1.000
1st Qu.: 1.000 1st Qu.:1.000 1st Qu.:1.000
Median : 1.000 Median :3.000 Median :1.000
Mean : 2.015 Mean :2.776 Mean :1.419
3rd Qu.: 1.000 3rd Qu.:5.000 3rd Qu.:2.000
Max. :16.000 Max. :7.000 Max. :2.000
NA's :224 NA's :32 NA's :1404
# remove NAs from abortion variable because those respondents were not polled on the topic
gss <- gss %>% filter(!is.na(aban))
summary(gss)
age sex educ income
Min. :18.00 Min. :1.000 Min. : 0.00 Min. : 1.00
1st Qu.:37.00 1st Qu.:1.000 1st Qu.:12.00 1st Qu.:12.00
Median :53.00 Median :2.000 Median :15.00 Median :12.00
Mean :52.08 Mean :1.554 Mean :14.84 Mean :11.28
3rd Qu.:67.00 3rd Qu.:2.000 3rd Qu.:16.00 3rd Qu.:12.00
Max. :89.00 Max. :2.000 Max. :20.00 Max. :13.00
NA's :194 NA's :56 NA's :41 NA's :283
race partyid aban
Min. : 1.000 Min. :0.000 Min. :1.000
1st Qu.: 1.000 1st Qu.:1.000 1st Qu.:1.000
Median : 1.000 Median :3.000 Median :1.000
Mean : 1.973 Mean :2.755 Mean :1.419
3rd Qu.: 1.000 3rd Qu.:5.000 3rd Qu.:2.000
Max. :16.000 Max. :7.000 Max. :2.000
NA's :154 NA's :20
# A tibble: 73 x 2
age counts
<dbl> <int>
1 18 4
2 19 10
3 20 11
4 21 18
5 22 16
6 23 18
7 24 25
8 25 26
9 26 35
10 27 22
# ... with 63 more rows
# A tibble: 3 x 2
sex counts
<dbl> <int>
1 1 1148
2 2 1424
3 NA 56
# A tibble: 21 x 2
educ counts
<dbl> <int>
1 0 6
2 1 1
3 2 1
4 3 1
5 5 1
6 6 8
7 7 2
8 8 15
9 9 24
10 10 34
# ... with 11 more rows
# A tibble: 14 x 2
income counts
<dbl> <int>
1 1 51
2 2 32
3 3 11
4 4 7
5 5 15
6 6 7
7 7 14
8 8 38
9 9 107
10 10 82
11 11 98
12 12 1584
13 13 299
14 NA 283
# A tibble: 15 x 2
race counts
<dbl> <int>
1 1 1986
2 2 268
3 3 12
4 4 25
5 5 30
6 6 9
7 7 5
8 8 15
9 9 3
10 10 12
11 12 1
12 14 2
13 15 20
14 16 86
15 NA 154
# A tibble: 9 x 2
partyid counts
<dbl> <int>
1 0 559
2 1 356
3 2 303
4 3 509
5 4 206
6 5 253
7 6 347
8 7 75
9 NA 20
# A tibble: 2 x 2
aban counts
<dbl> <int>
1 1 1528
2 2 1100
# Visualize response frequency by political ID
aborpar <- gss %>%
group_by(aban, partyid) %>%
summarise(n = n()) %>%
mutate(Freq = n/sum(n)) %>% na.omit(partyid)
aborpar$aban[aborpar$aban == "1"] <- "Yes"
aborpar$aban[aborpar$aban == "2"] <- "No"
aborpar$partyid[aborpar$partyid == "0"] <- "Strong Dem"
aborpar$partyid[aborpar$partyid == "1"] <- "Weak Dem"
aborpar$partyid[aborpar$partyid == "2"] <- "Lean Dem"
aborpar$partyid[aborpar$partyid == "3"] <- "Ind/Neither"
aborpar$partyid[aborpar$partyid == "4"] <- "Lean Rep"
aborpar$partyid[aborpar$partyid == "5"] <- "Weak Rep"
aborpar$partyid[aborpar$partyid == "6"] <- "Strong Rep"
aborpar$partyid[aborpar$partyid == "7"] <- "Other"
Levels <- c("Strong Dem", "Weak Dem", "Lean Dem", "Ind/Neither", "Lean Rep", "Weak Rep", "Strong Rep", "Other")
cbPalette <- c("#999999", "#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7")
ggplot(data = aborpar, aes(x=aban, y=Freq, fill=factor(partyid, levels = unique(Levels)))) +
geom_bar(stat="identity", position=position_dodge())+
labs(title = "Should it be possible for one to obtain a legal abortion for ANY reason?", subtitle = "'Yes' or 'No' response frequency by political identification", x = "Resonse", y = "Frequency") +
theme_light() + scale_fill_manual(values=cbPalette, guide = guide_legend("Party ID"))
Based on the graph, it does seem that No responses skew Republican and Yes responses skew Democrat, indicating that more liberal identification experiences a greater frequency of support for legal abortion regardless of reasoning.
plot(gss)
new <- gss
# abortion ques -- yes = 1, no = 0
new$aban[new$aban == "2"] <- "0"
new$aban <- as.factor(new$aban)
# sex -- female = 0, male = 1
new$sex[new$sex == "2"] <- "0"
new$sex <- as.factor(new$sex)
# race -- nonwhite = 0, white = 1
new$race <- replace(new$race, new$race>1, 0)
new$race <- as.factor(new$race)
# partyid -- remove 'other', then recode levels
# 1=strong dem, 2=not very strong dem, 3=independent(lean dem), 4=independent/neither, 5=independent(lean rep), 6=not very strong rep, 7=strong rep
new <- new %>% filter(!str_detect(partyid, "7"))
new$partyid <- new$partyid + 1
new$partyid <- as.factor(new$partyid)
# income
new$income <- as.integer(new$income)
new <- new %>% filter(!is.na(income))
new$income <- as.factor(new$income)
# age and years of education are numbers, which is good
# now, look at summary of variables again
summary(new)
age sex educ income
Min. :18.00 0 :1232 Min. : 0.00 12 :1536
1st Qu.:38.00 1 : 996 1st Qu.:13.00 13 : 284
Median :53.00 NA's: 37 Median :15.00 9 : 102
Mean :52.58 Mean :14.93 11 : 94
3rd Qu.:67.00 3rd Qu.:16.00 10 : 79
Max. :89.00 Max. :20.00 1 : 50
NA's :125 NA's :21 (Other): 120
race partyid aban
0 : 408 1:515 0: 938
1 :1724 2:322 1:1327
NA's: 133 3:277
4:419
5:191
6:227
7:314
There are a lot of NAs for age and and race. What does it look like to remove all NAs
age sex educ income race
Min. :18.00 0:1101 Min. : 0.00 12 :1405 0: 377
1st Qu.:39.00 1: 905 1st Qu.:13.00 13 : 230 1:1629
Median :54.00 Median :16.00 9 : 87
Mean :53.05 Mean :14.98 11 : 80
3rd Qu.:67.00 3rd Qu.:16.00 10 : 67
Max. :89.00 Max. :20.00 1 : 39
(Other): 98
partyid aban
1:456 0: 833
2:298 1:1173
3:253
4:344
5:161
6:207
7:287
There’s still a large amount of responses in the sample and for the most part (other than income and education - but removing NAs wasn’t the cause of that) the summary shows that there’s a fair amount of representation for each factor/category. So I will proceed with the set that omits NAs.
# check that variables are proper structures
str(new2)
tibble [2,006 x 7] (S3: tbl_df/tbl/data.frame)
$ age : num [1:2006] 61 37 23 71 20 65 37 43 44 30 ...
..- attr(*, "label")= chr "age of respondent"
..- attr(*, "format.stata")= chr "%8.0g"
$ sex : Factor w/ 2 levels "0","1": 1 2 2 1 1 1 2 1 1 1 ...
$ educ : num [1:2006] 16 11 15 16 14 12 14 20 16 16 ...
..- attr(*, "label")= chr "highest year of school completed"
..- attr(*, "format.stata")= chr "%8.0g"
$ income : Factor w/ 13 levels "1","2","3","4",..: 8 12 12 13 12 11 12 12 13 12 ...
$ race : Factor w/ 2 levels "0","1": 1 2 2 1 2 2 2 2 1 1 ...
$ partyid: Factor w/ 7 levels "1","2","3","4",..: 3 4 6 1 1 1 3 4 2 2 ...
$ aban : Factor w/ 2 levels "0","1": 2 2 1 2 2 2 2 2 2 2 ...
- attr(*, "na.action")= 'omit' Named int [1:259] 1 2 9 10 18 21 29 32 33 34 ...
..- attr(*, "names")= chr [1:259] "1" "2" "9" "10" ...
xtabs(~aban + income, data=new2)
income
aban 1 2 3 4 5 6 7 8 9 10 11 12 13
0 21 11 6 2 5 4 2 19 45 36 35 582 65
1 18 16 4 5 6 2 6 10 42 31 45 823 165
xtabs(~aban + sex, data=new2)
sex
aban 0 1
0 453 380
1 648 525
xtabs(~aban + educ, data=new2)
educ
aban 0 2 3 6 7 8 9 10 11 12 13 14 15 16 17 18
0 0 0 1 3 1 2 8 11 20 201 58 141 44 190 43 59
1 2 1 0 2 0 6 9 13 16 178 75 142 54 331 90 131
educ
aban 19 20
0 16 35
1 35 88
xtabs(~aban + age, data=new2)
age
aban 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38
0 0 5 0 2 1 5 1 4 6 4 7 11 11 12 6 16 10 18 10 18 14
1 2 1 7 6 9 8 14 16 18 13 27 25 20 16 22 22 23 18 26 26 16
age
aban 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59
0 13 13 16 14 20 20 9 9 15 12 13 19 13 10 16 14 18 14 18 18 18
1 20 13 28 28 18 17 12 23 16 17 16 15 24 19 20 15 24 15 20 17 29
age
aban 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80
0 14 15 17 19 20 10 13 27 14 17 20 14 14 14 15 18 6 11 9 9 9
1 23 21 22 31 15 26 18 28 26 25 29 17 16 17 22 16 13 8 8 16 6
age
aban 81 82 83 84 85 86 87 88 89
0 14 4 6 4 6 5 4 2 10
1 2 3 2 9 3 0 1 3 6
xtabs(~aban + partyid, data=new2)
partyid
aban 1 2 3 4 5 6 7
0 84 78 58 168 109 126 210
1 372 220 195 176 52 81 77
xtabs(~aban + race, data=new2)
race
aban 0 1
0 149 684
1 228 945
Only 19 respondents place between 0 and 7 years of education, so I may want to remove those samples…
Now I’ll fit a binomial logistic regression model, starting with partyid as the only explanatory variable.
# model just with partyid
fit1 <- glm(aban ~ partyid, family = binomial(link = "logit"), data = new2)
summary(fit1)
Call:
glm(formula = aban ~ partyid, family = binomial(link = "logit"),
data = new2)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.8394 -0.9964 0.6381 0.7791 1.6221
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.4881 0.1208 12.318 <2e-16 ***
partyid2 -0.4512 0.1788 -2.524 0.0116 *
partyid3 -0.2755 0.1923 -1.433 0.1518
partyid4 -1.4416 0.1619 -8.901 <2e-16 ***
partyid5 -2.2282 0.2074 -10.745 <2e-16 ***
partyid6 -1.9299 0.1867 -10.334 <2e-16 ***
partyid7 -2.4914 0.1798 -13.853 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 2723.0 on 2005 degrees of freedom
Residual deviance: 2340.9 on 1999 degrees of freedom
AIC: 2354.9
Number of Fisher Scoring iterations: 4
So, with the model only including party identification as an explanatory variable shows that as party id increases (which progresses toward more independent than conservative id), there is decrease in log odds of the response to legal abortion being Yes. Also, the p-value is well below the significance level .05, which indicates the relationship is significant.
# compute McFadden's r-squared to evaluate predictive power of model
pR2(fit1)["McFadden"]
fitting null model for pseudo-r2
McFadden
0.1403234
Using the rule of thumb that values over .4 indicate well-fitting model, that’s not great.
varImp(fit1)
Overall
partyid2 2.523674
partyid3 1.433090
partyid4 8.901361
partyid5 10.745494
partyid6 10.334255
partyid7 13.853439
Conservative identification scores high in variable importance.
chisq.test(new2$partyid, new2$aban)
Pearson's Chi-squared test
data: new2$partyid and new2$aban
X-squared = 368.49, df = 6, p-value < 2.2e-16
chisq.bintest(aban ~ partyid, new2)
Pearson's Chi-squared test
data: aban by partyid
X-squared = 368.4864, df = 6, p-value < 2.2e-16
alternative hypothesis: true difference in probabilities is not equal to 0
sample estimates:
proba in group 1 proba in group 2 proba in group 3 proba in group 4
0.8157895 0.7382550 0.7707510 0.5116279
proba in group 5 proba in group 6 proba in group 7
0.3229814 0.3913043 0.2682927
Pairwise comparisons using Pearson's Chi-squared tests with Yates' continuity correction
1 2 3 4 5 6
2 1.796e-02 - - - - -
3 2.112e-01 4.339e-01 - - - -
4 3.400e-19 1.015e-08 3.726e-10 - - -
5 9.706e-30 3.483e-17 1.058e-18 1.606e-04 - -
6 1.487e-26 2.397e-14 7.271e-16 1.045e-02 0.2360 -
7 3.996e-48 8.312e-29 6.360e-30 1.547e-09 0.2768 0.007277
P value adjustment method: fdr
Chi-squared test of independence and chisq binomial test produce a p-value of 2.2e-16 which is well-below the 0.05 level of significance, so I can conclude that a relationship exists between political ID and response to the abortion question.
With the first fitted model, I’ll demonstrate how it may be used to predict response to the abortion question.
# Response from Strong Republican
new2 %>% summarise(partyid = '7') %>% predict(fit1, newdata = .) -> pred_vote
cat("Log-odds:", pred_vote)
Log-odds: -1.003302
new2 %>% summarise(partyid = '7') %>% predict(fit1, newdata = ., type = 'response') -> pred_prob
cat("Probability:", pred_prob)
Probability: 0.2682927
For a respondent who identifies as a strong republican, there will be a decrease in the log(odds) that the respondent says Yes by about 1. The probability that the respondent says Yes is 0.27.
# Response from Strong Democrat
new2 %>% summarise(partyid = '1') %>% predict(fit1, newdata = .) -> pred_vote
cat("Log-odds:", pred_vote)
Log-odds: 1.488077
new2 %>% summarise(partyid = '1') %>% predict(fit1, newdata = ., type = 'response') -> pred_prob
cat("Probability:", pred_prob)
Probability: 0.8157895
The model predicts that the probability that a strong democrat would say Yes to the abortion question is 0.82, and the log(odds) of a Yes response increases by 1.49.
plot(allEffects(fit1))
Let’s see how a model looks with all variables discussed.
# model with all variables
fit2 <- glm(aban ~ partyid + age + educ + income + race + sex, family = binomial(link = "logit"), data = new2)
summary(fit2)
Call:
glm(formula = aban ~ partyid + age + educ + income + race + sex,
family = binomial(link = "logit"), data = new2)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.1892 -0.9421 0.5253 0.8122 2.0871
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.718963 0.490079 1.467 0.14237
partyid2 -0.577799 0.185849 -3.109 0.00188 **
partyid3 -0.400866 0.199391 -2.010 0.04438 *
partyid4 -1.517882 0.171386 -8.857 < 2e-16 ***
partyid5 -2.365754 0.216123 -10.946 < 2e-16 ***
partyid6 -2.194851 0.197622 -11.106 < 2e-16 ***
partyid7 -2.662291 0.190225 -13.995 < 2e-16 ***
age -0.017774 0.003208 -5.541 3e-08 ***
educ 0.046670 0.020790 2.245 0.02478 *
income2 0.740003 0.551015 1.343 0.17928
income3 -0.148126 0.802940 -0.184 0.85364
income4 1.198394 1.021467 1.173 0.24071
income5 0.319761 0.755723 0.423 0.67221
income6 -1.068413 0.963065 -1.109 0.26726
income7 1.583028 0.977425 1.620 0.10532
income8 -0.276314 0.552560 -0.500 0.61703
income9 0.142837 0.424602 0.336 0.73657
income10 0.181180 0.443968 0.408 0.68320
income11 0.702179 0.431453 1.627 0.10364
income12 0.789631 0.360673 2.189 0.02857 *
income13 1.296339 0.398359 3.254 0.00114 **
race1 0.446119 0.136203 3.275 0.00106 **
sex1 0.070250 0.105189 0.668 0.50423
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 2723.0 on 2005 degrees of freedom
Residual deviance: 2248.4 on 1983 degrees of freedom
AIC: 2294.4
Number of Fisher Scoring iterations: 4
Using a threshold of .05 singificance, sex is the only variable that is not influential. Although, with .01 level of significance, education would also get cut.
# check for multicollinearity
vif(fit2)
GVIF Df GVIF^(1/(2*Df))
partyid 1.236950 6 1.017879
age 1.101638 1 1.049589
educ 1.147707 1 1.071311
income 1.276925 12 1.010238
race 1.127343 1 1.061764
sex 1.037975 1 1.018810
No VIF above 5, so it’s safe to assume there is not multicollinearity occurring.
# compute McFadden's r-squared to evaluate predictive power of model
pR2(fit2)["McFadden"]
fitting null model for pseudo-r2
McFadden
0.1743009
This model has lower AIC and higher r-squared than the first, but still not great.
varImp(fit2)
Overall
partyid2 3.1089779
partyid3 2.0104536
partyid4 8.8565241
partyid5 10.9463391
partyid6 11.1063009
partyid7 13.9954637
age 5.5413175
educ 2.2448588
income2 1.3429814
income3 0.1844791
income4 1.1732095
income5 0.4231191
income6 1.1093890
income7 1.6195901
income8 0.5000615
income9 0.3364008
income10 0.4080937
income11 1.6274758
income12 2.1893286
income13 3.2541980
race1 3.2754111
sex1 0.6678427
When it comes to variable importance, partyid and age are highest, followed by income and race, then education.
plot(allEffects(fit2))
It’s surprising that there isn’t a strong relationship with gender, but I will remove it for the next model.
fit4 <- glm(aban ~ partyid + age + educ + income + race, family = binomial(link = "logit"), data = new2)
summary(fit4)
Call:
glm(formula = aban ~ partyid + age + educ + income + race, family = binomial(link = "logit"),
data = new2)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.2074 -0.9433 0.5295 0.8082 2.1024
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.735986 0.489183 1.505 0.132447
partyid2 -0.572146 0.185631 -3.082 0.002055 **
partyid3 -0.390295 0.198694 -1.964 0.049495 *
partyid4 -1.511857 0.171101 -8.836 < 2e-16 ***
partyid5 -2.356515 0.215666 -10.927 < 2e-16 ***
partyid6 -2.185015 0.197005 -11.091 < 2e-16 ***
partyid7 -2.651258 0.189419 -13.997 < 2e-16 ***
age -0.017663 0.003202 -5.517 3.46e-08 ***
educ 0.046713 0.020798 2.246 0.024703 *
income2 0.735874 0.550505 1.337 0.181312
income3 -0.160451 0.800665 -0.200 0.841170
income4 1.190128 1.018479 1.169 0.242591
income5 0.317291 0.757389 0.419 0.675269
income6 -1.081627 0.961948 -1.124 0.260838
income7 1.553559 0.975758 1.592 0.111350
income8 -0.278662 0.551858 -0.505 0.613593
income9 0.138778 0.424100 0.327 0.743495
income10 0.175816 0.443372 0.397 0.691705
income11 0.699884 0.430985 1.624 0.104393
income12 0.792766 0.360113 2.201 0.027705 *
income13 1.309390 0.397440 3.295 0.000986 ***
race1 0.444193 0.136178 3.262 0.001107 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 2723.0 on 2005 degrees of freedom
Residual deviance: 2248.8 on 1984 degrees of freedom
AIC: 2292.8
Number of Fisher Scoring iterations: 4
# compute variable importance
varImp(fit4)
Overall
partyid2 3.0821635
partyid3 1.9643025
partyid4 8.8360580
partyid5 10.9266982
partyid6 11.0911416
partyid7 13.9967817
age 5.5165696
educ 2.2460090
income2 1.3367261
income3 0.2003974
income4 1.1685343
income5 0.4189279
income6 1.1244138
income7 1.5921564
income8 0.5049512
income9 0.3272287
income10 0.3965427
income11 1.6239174
income12 2.2014378
income13 3.2945577
race1 3.2618457
plot(allEffects(fit4))
Because the income variable is so disproportionate (majority of sample makes highest level of income option provided), the lower thresholds have huge margins and may negatively affect prediction.
# see how removing income effects the model
fit5 <- glm(aban ~ partyid + age + educ + race, family = binomial(link = "logit"), data = new2)
summary(fit5)
Call:
glm(formula = aban ~ partyid + age + educ + race, family = binomial(link = "logit"),
data = new2)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.1382 -0.9522 0.5627 0.8267 1.9932
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.872156 0.365297 2.388 0.016962 *
partyid2 -0.569781 0.183168 -3.111 0.001866 **
partyid3 -0.393017 0.195900 -2.006 0.044834 *
partyid4 -1.524097 0.168754 -9.031 < 2e-16 ***
partyid5 -2.296167 0.212374 -10.812 < 2e-16 ***
partyid6 -2.085436 0.193890 -10.756 < 2e-16 ***
partyid7 -2.558796 0.185707 -13.779 < 2e-16 ***
age -0.016485 0.003148 -5.236 1.64e-07 ***
educ 0.080721 0.019395 4.162 3.16e-05 ***
race1 0.463468 0.133679 3.467 0.000526 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 2723.0 on 2005 degrees of freedom
Residual deviance: 2286.4 on 1996 degrees of freedom
AIC: 2306.4
Number of Fisher Scoring iterations: 4
That caused both the AIC and McFadden’s r-squared to decrease, which is not good… but not by a significant amount. The ranges of income provided as options, with the highest being “greater than $25,000 annual income, is inaccurate to 2021 income average.
plot(allEffects(fit5))
stargazer(fit1, fit2, fit4, fit5, type = "text")
=============================================================
Dependent variable:
-------------------------------------------
aban
(1) (2) (3) (4)
-------------------------------------------------------------
partyid2 -0.451** -0.578*** -0.572*** -0.570***
(0.179) (0.186) (0.186) (0.183)
partyid3 -0.276 -0.401** -0.390** -0.393**
(0.192) (0.199) (0.199) (0.196)
partyid4 -1.442*** -1.518*** -1.512*** -1.524***
(0.162) (0.171) (0.171) (0.169)
partyid5 -2.228*** -2.366*** -2.357*** -2.296***
(0.207) (0.216) (0.216) (0.212)
partyid6 -1.930*** -2.195*** -2.185*** -2.085***
(0.187) (0.198) (0.197) (0.194)
partyid7 -2.491*** -2.662*** -2.651*** -2.559***
(0.180) (0.190) (0.189) (0.186)
age -0.018*** -0.018*** -0.016***
(0.003) (0.003) (0.003)
educ 0.047** 0.047** 0.081***
(0.021) (0.021) (0.019)
income2 0.740 0.736
(0.551) (0.551)
income3 -0.148 -0.160
(0.803) (0.801)
income4 1.198 1.190
(1.021) (1.018)
income5 0.320 0.317
(0.756) (0.757)
income6 -1.068 -1.082
(0.963) (0.962)
income7 1.583 1.554
(0.977) (0.976)
income8 -0.276 -0.279
(0.553) (0.552)
income9 0.143 0.139
(0.425) (0.424)
income10 0.181 0.176
(0.444) (0.443)
income11 0.702 0.700
(0.431) (0.431)
income12 0.790** 0.793**
(0.361) (0.360)
income13 1.296*** 1.309***
(0.398) (0.397)
race1 0.446*** 0.444*** 0.463***
(0.136) (0.136) (0.134)
sex1 0.070
(0.105)
Constant 1.488*** 0.719 0.736 0.872**
(0.121) (0.490) (0.489) (0.365)
-------------------------------------------------------------
Observations 2,006 2,006 2,006 2,006
Log Likelihood -1,170.450 -1,124.189 -1,124.413 -1,143.206
Akaike Inf. Crit. 2,354.900 2,294.379 2,292.825 2,306.412
=============================================================
Note: *p<0.1; **p<0.05; ***p<0.01
I used logistic regression due to the fact that the outcome variable, opinion on right to legal abortion, is binary - the response can take one of two values, Yes or No. I denoted Yes as 1 and No as 0.
All variables were coded numerically. Age and years of education were numeric variables, while sex, race, party ID, and income were factors. First, I fit a model with just party identification as the explanatory variable, to evaluate effects on abortion opinion without accounting for other variables. While the model reflected a significant relationship between the two based upon low p-values, the r-squared value was rather low, suggesting the model is not as reliable for prediction as preferred. The coefficients in the model indicate a reduction in log(odds) that one will support legal abortion as party identification trends conservative. The model fits also reflect evidence that increase in age is associated with decreased support of the abortion question, race of white increased odds of support versus non-white respondents, and increased level of education increased odds of support for legal abortion.
When fitting a model with all selected variables, each had significant p-values besides sex. However, AIC did not decrease much compared to the simple regression model, nor did the r-squared increase to a level of reliable model predictability. Removing sex and income did not improve the model fit either. Model comparison shows insignificant changes in log likelihood across all of the models I fit. With that being said, while there is a statistically significant relationship between party identification and support for legal abortion (for any reason), as evidenced by significant p-values in the logistic regression models and chi-squared test, the explanatory variables used in this study did not produce a statistically reliable predictive model.
Further research is needed to explore additional variables that may be contributing factors to support for abortion, to create a stronger predictive model.