DACSS 603: Final Project

DACSS 603: Introduction to Quantitative Analysis final research project

Megan Georges
2022-05-14

2021 General Social Survey (GSS)

# load gss 2021 dataset 
GSS2021 <- gss_get_yr(2021)

Title

Understanding Americans’ Opinion on the Legal Right to an Abortion

Background

Abortion has been a social and political topic of contention for decades in the U.S. Documentation on the practice of abortion shows the practice predates the conception of America, and it wasn’t until the 1860s that states began passing anti-abortion laws (Alliance for Perinatal Research and Services, 1979). Abortion was first nationally banned in 1910, but it is well understood that abortions never stopped happening. In the 1960s, 11 states moved toward legalizing abortion, and in 1973, the Supreme Court decided on the Roe v. Wade case, which established legal right to abortion in the U.S. However, in the past few years, states have begun passing restrictive laws attempting to regulate and limit abortion, like Texas’ SB 8, which effectively bans abortion past 6 weeks of pregnancy (Planned Parenthood, 2022). Now, nearly half of U.S. states are moving toward establishing restrictive abortion laws of varying degrees (NYT, 2022). Politically, the debate has evolved to be broadly framed as a ‘red vs. blue’ issue. However, it’s not as clear how the general public distribute around abortion rights, and if it correlates to political identification in the same nature as it seems to with politicians.

Research Question

How does political identification relate to one’s opinion that abortion should be legal, regardless of the reason for abortion, in the United States? And are there other demographics that relate to opinions on the matter?

Hypothesis

There is no relationship between political ID and belief in the right to legal abortion regardless of reasoning.

Variables

GSS 2021 codebook to select and understand variables: https://gss.norc.org/Documents/codebook/GSS%202021%20Codebook.pdf

# select variables of focus from gss 2021 dataset
vars <- select(GSS2021, age, sex, educ, income, raceacs1, raceacs2, raceacs3, raceacs4, raceacs5, raceacs6, raceacs7, raceacs8, raceacs9, raceacs10, raceacs11, raceacs12, raceacs13, raceacs14,  raceacs15, raceacs16, partyid, abany, abanyg)
# the gss2021 dataset did not include race variable but had separated variables by race
# recode each race to unique number
vars$raceacs1[vars$raceacs1 == "0"] <- " "
vars$raceacs2[vars$raceacs2 == "0"] <- " "
vars$raceacs2[vars$raceacs2 == "1"] <- "2"
vars$raceacs3[vars$raceacs3 == "0"] <- " "
vars$raceacs3[vars$raceacs3 == "1"] <- "3"
vars$raceacs4[vars$raceacs4 == "0"] <- " "
vars$raceacs4[vars$raceacs4 == "1"] <- "4"
vars$raceacs5[vars$raceacs5 == "0"] <- " "
vars$raceacs5[vars$raceacs5 == "1"] <- "5"
vars$raceacs6[vars$raceacs6 == "0"] <- " "
vars$raceacs6[vars$raceacs6 == "1"] <- "6"
vars$raceacs7[vars$raceacs7 == "0"] <- " "
vars$raceacs7[vars$raceacs7 == "1"] <- "7"
vars$raceacs8[vars$raceacs8 == "0"] <- " "
vars$raceacs8[vars$raceacs8 == "1"] <- "8"
vars$raceacs9[vars$raceacs9 == "0"] <- " "
vars$raceacs9[vars$raceacs9 == "1"] <- "9"
vars$raceacs10[vars$raceacs10 == "0"] <- " "
vars$raceacs10[vars$raceacs10 == "1"] <- "10"
vars$raceacs11[vars$raceacs11 == "0"] <- " "
vars$raceacs11[vars$raceacs11 == "1"] <- "11"
vars$raceacs12[vars$raceacs12 == "0"] <- " "
vars$raceacs12[vars$raceacs12 == "1"] <- "12"
vars$raceacs13[vars$raceacs13 == "0"] <- " "
vars$raceacs13[vars$raceacs13 == "1"] <- "13"
vars$raceacs14[vars$raceacs14 == "0"] <- " "
vars$raceacs14[vars$raceacs14 == "1"] <- "14"
vars$raceacs15[vars$raceacs15 == "0"] <- " "
vars$raceacs15[vars$raceacs15 == "1"] <- "15"
vars$raceacs16[vars$raceacs16 == "0"] <- " "
vars$raceacs16[vars$raceacs16 == "1"] <- "16"

# create one race variable by combining the re-coded columns 
vars$race <- paste(vars$raceacs1, vars$raceacs2, vars$raceacs3, vars$raceacs4, vars$raceacs5, vars$raceacs6, vars$raceacs7, vars$raceacs8, vars$raceacs9, vars$raceacs10, vars$raceacs11, vars$raceacs12, vars$raceacs13, vars$raceacs14, vars$raceacs15, vars$raceacs16)
vars$race <- as.numeric(vars$race)
# abortion question was separated into 2 groups
# consolidate to one variable/column
vars$abany[is.na(vars$abany)] <- ""
vars$abanyg[is.na(vars$abanyg)] <- ""
vars$aban <- paste(vars$abany, vars$abanyg)
vars$aban <- as.numeric(vars$aban)
gss <- select(vars, age, sex, educ, income, race, partyid, aban)

Explanatory variable is party identification, and this decision is explained in the background section above.

Additional variables to be considered (controls) are age, sex, educ, income, race.

The selection of these additional variables were each motivated by social/political/historical context. First, gender was chosen because the question wording explicitly asks about the legal right for “a pregnant woman” to obtain an abortion. Therefore, it should be tested whether women and men hold differing views on the matter. I included age because abortion rights generally directly effect those in ‘child-bearing’ years (I’m not saying it does not effect everyone in some way though). Also, Roe v. Wade was passed in 1973, so perhaps this question may evoke strong opinions from older generations. Previous studies have indicated a correlation between higher education and greater support for abortion rights. Income has well-documented levels of privilege in the U.S., and it is often suggested that regardless of restrictive abortion bills being passed, those in upper classes will have the means to be able to obtain safe abortions elsewhere. Lastly, historically, women of color and women of lower socioeconomic statuses have higher rates of abortion, so it’s curious how opinion may trend by race and income.

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3780732/

Outcome variable is belief in legal abortion for any reason. –> Specifying “for any reason” because the GSS included multiple questions on legal abortions that specified specific scenarios to permit legal abortion. This question asked about support for legal abortion regardless of the reason.

Understanding the Explanatory Variables

Understanding the Outcome Variable

The question posed by the survey:

“Please tell me whether or not you think it should be possible for a pregnant woman to obtain a legal abortion if the woman wants it for any reason?”

# summarize variables 
summary(gss)
      age             sex             educ           income     
 Min.   :18.00   Min.   :1.000   Min.   : 0.00   Min.   : 1.00  
 1st Qu.:37.00   1st Qu.:1.000   1st Qu.:12.00   1st Qu.:12.00  
 Median :53.00   Median :2.000   Median :15.00   Median :12.00  
 Mean   :52.16   Mean   :1.559   Mean   :14.77   Mean   :11.25  
 3rd Qu.:66.00   3rd Qu.:2.000   3rd Qu.:16.00   3rd Qu.:12.00  
 Max.   :89.00   Max.   :2.000   Max.   :20.00   Max.   :13.00  
 NA's   :333     NA's   :92      NA's   :66      NA's   :443    
      race           partyid           aban      
 Min.   : 1.000   Min.   :0.000   Min.   :1.000  
 1st Qu.: 1.000   1st Qu.:1.000   1st Qu.:1.000  
 Median : 1.000   Median :3.000   Median :1.000  
 Mean   : 2.015   Mean   :2.776   Mean   :1.419  
 3rd Qu.: 1.000   3rd Qu.:5.000   3rd Qu.:2.000  
 Max.   :16.000   Max.   :7.000   Max.   :2.000  
 NA's   :224      NA's   :32      NA's   :1404   
# remove NAs from abortion variable because those respondents were not polled on the topic
gss <- gss %>% filter(!is.na(aban))
summary(gss)
      age             sex             educ           income     
 Min.   :18.00   Min.   :1.000   Min.   : 0.00   Min.   : 1.00  
 1st Qu.:37.00   1st Qu.:1.000   1st Qu.:12.00   1st Qu.:12.00  
 Median :53.00   Median :2.000   Median :15.00   Median :12.00  
 Mean   :52.08   Mean   :1.554   Mean   :14.84   Mean   :11.28  
 3rd Qu.:67.00   3rd Qu.:2.000   3rd Qu.:16.00   3rd Qu.:12.00  
 Max.   :89.00   Max.   :2.000   Max.   :20.00   Max.   :13.00  
 NA's   :194     NA's   :56      NA's   :41      NA's   :283    
      race           partyid           aban      
 Min.   : 1.000   Min.   :0.000   Min.   :1.000  
 1st Qu.: 1.000   1st Qu.:1.000   1st Qu.:1.000  
 Median : 1.000   Median :3.000   Median :1.000  
 Mean   : 1.973   Mean   :2.755   Mean   :1.419  
 3rd Qu.: 1.000   3rd Qu.:5.000   3rd Qu.:2.000  
 Max.   :16.000   Max.   :7.000   Max.   :2.000  
 NA's   :154      NA's   :20                     

Summary of variables

gss %>% group_by(age) %>% summarise(counts = n())
# A tibble: 73 x 2
     age counts
   <dbl>  <int>
 1    18      4
 2    19     10
 3    20     11
 4    21     18
 5    22     16
 6    23     18
 7    24     25
 8    25     26
 9    26     35
10    27     22
# ... with 63 more rows
gss %>% group_by(sex) %>% summarise(counts = n())
# A tibble: 3 x 2
    sex counts
  <dbl>  <int>
1     1   1148
2     2   1424
3    NA     56
gss %>% group_by(educ) %>% summarise(counts = n())
# A tibble: 21 x 2
    educ counts
   <dbl>  <int>
 1     0      6
 2     1      1
 3     2      1
 4     3      1
 5     5      1
 6     6      8
 7     7      2
 8     8     15
 9     9     24
10    10     34
# ... with 11 more rows
gss %>% group_by(income) %>% summarise(counts = n())
# A tibble: 14 x 2
   income counts
    <dbl>  <int>
 1      1     51
 2      2     32
 3      3     11
 4      4      7
 5      5     15
 6      6      7
 7      7     14
 8      8     38
 9      9    107
10     10     82
11     11     98
12     12   1584
13     13    299
14     NA    283
gss %>% group_by(race) %>% summarise(counts = n())
# A tibble: 15 x 2
    race counts
   <dbl>  <int>
 1     1   1986
 2     2    268
 3     3     12
 4     4     25
 5     5     30
 6     6      9
 7     7      5
 8     8     15
 9     9      3
10    10     12
11    12      1
12    14      2
13    15     20
14    16     86
15    NA    154
gss %>% group_by(partyid) %>% summarise(counts = n())
# A tibble: 9 x 2
  partyid counts
    <dbl>  <int>
1       0    559
2       1    356
3       2    303
4       3    509
5       4    206
6       5    253
7       6    347
8       7     75
9      NA     20
gss %>% group_by(aban) %>% summarise(counts = n())
# A tibble: 2 x 2
   aban counts
  <dbl>  <int>
1     1   1528
2     2   1100
# Visualize response frequency by political ID
aborpar <- gss %>%
  group_by(aban, partyid) %>%
  summarise(n = n()) %>%
  mutate(Freq = n/sum(n)) %>% na.omit(partyid)

aborpar$aban[aborpar$aban == "1"] <- "Yes"
aborpar$aban[aborpar$aban == "2"] <- "No"

aborpar$partyid[aborpar$partyid == "0"] <- "Strong Dem"
aborpar$partyid[aborpar$partyid == "1"] <- "Weak Dem"
aborpar$partyid[aborpar$partyid == "2"] <- "Lean Dem"
aborpar$partyid[aborpar$partyid == "3"] <- "Ind/Neither"
aborpar$partyid[aborpar$partyid == "4"] <- "Lean Rep"
aborpar$partyid[aborpar$partyid == "5"] <- "Weak Rep"
aborpar$partyid[aborpar$partyid == "6"] <- "Strong Rep"
aborpar$partyid[aborpar$partyid == "7"] <- "Other"

Levels <- c("Strong Dem", "Weak Dem", "Lean Dem", "Ind/Neither", "Lean Rep", "Weak Rep", "Strong Rep", "Other")

cbPalette <- c("#999999", "#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7")

ggplot(data = aborpar, aes(x=aban, y=Freq, fill=factor(partyid, levels = unique(Levels)))) +
  geom_bar(stat="identity", position=position_dodge())+
  labs(title = "Should it be possible for one to obtain a legal abortion for ANY reason?", subtitle = "'Yes' or 'No' response frequency by political identification", x = "Resonse", y = "Frequency") +
  theme_light() + scale_fill_manual(values=cbPalette, guide = guide_legend("Party ID"))

Based on the graph, it does seem that No responses skew Republican and Yes responses skew Democrat, indicating that more liberal identification experiences a greater frequency of support for legal abortion regardless of reasoning.

plot(gss)

Recode variables and fix data structure for analysis

new <- gss
# abortion ques -- yes = 1, no = 0
new$aban[new$aban == "2"] <- "0"
new$aban <- as.factor(new$aban)

# sex -- female = 0, male = 1
new$sex[new$sex == "2"] <- "0"
new$sex <- as.factor(new$sex)

# race -- nonwhite = 0, white = 1
new$race <- replace(new$race, new$race>1, 0)
new$race <- as.factor(new$race)

# partyid -- remove 'other', then recode levels
# 1=strong dem, 2=not very strong dem, 3=independent(lean dem), 4=independent/neither, 5=independent(lean rep), 6=not very strong rep, 7=strong rep
new <- new %>% filter(!str_detect(partyid, "7"))
new$partyid <- new$partyid + 1
new$partyid <- as.factor(new$partyid)

# income 
new$income <- as.integer(new$income)
new <- new %>% filter(!is.na(income))
new$income <- as.factor(new$income)

# age and years of education are numbers, which is good
# now, look at summary of variables again
summary(new)
      age          sex            educ           income    
 Min.   :18.00   0   :1232   Min.   : 0.00   12     :1536  
 1st Qu.:38.00   1   : 996   1st Qu.:13.00   13     : 284  
 Median :53.00   NA's:  37   Median :15.00   9      : 102  
 Mean   :52.58               Mean   :14.93   11     :  94  
 3rd Qu.:67.00               3rd Qu.:16.00   10     :  79  
 Max.   :89.00               Max.   :20.00   1      :  50  
 NA's   :125                 NA's   :21      (Other): 120  
   race      partyid aban    
 0   : 408   1:515   0: 938  
 1   :1724   2:322   1:1327  
 NA's: 133   3:277           
             4:419           
             5:191           
             6:227           
             7:314           

There are a lot of NAs for age and and race. What does it look like to remove all NAs

new2 <- na.omit(new)
summary(new2)
      age        sex           educ           income     race    
 Min.   :18.00   0:1101   Min.   : 0.00   12     :1405   0: 377  
 1st Qu.:39.00   1: 905   1st Qu.:13.00   13     : 230   1:1629  
 Median :54.00            Median :16.00   9      :  87           
 Mean   :53.05            Mean   :14.98   11     :  80           
 3rd Qu.:67.00            3rd Qu.:16.00   10     :  67           
 Max.   :89.00            Max.   :20.00   1      :  39           
                                          (Other):  98           
 partyid aban    
 1:456   0: 833  
 2:298   1:1173  
 3:253           
 4:344           
 5:161           
 6:207           
 7:287           

There’s still a large amount of responses in the sample and for the most part (other than income and education - but removing NAs wasn’t the cause of that) the summary shows that there’s a fair amount of representation for each factor/category. So I will proceed with the set that omits NAs.

# check that variables are proper structures
str(new2)
tibble [2,006 x 7] (S3: tbl_df/tbl/data.frame)
 $ age    : num [1:2006] 61 37 23 71 20 65 37 43 44 30 ...
  ..- attr(*, "label")= chr "age of respondent"
  ..- attr(*, "format.stata")= chr "%8.0g"
 $ sex    : Factor w/ 2 levels "0","1": 1 2 2 1 1 1 2 1 1 1 ...
 $ educ   : num [1:2006] 16 11 15 16 14 12 14 20 16 16 ...
  ..- attr(*, "label")= chr "highest year of school completed"
  ..- attr(*, "format.stata")= chr "%8.0g"
 $ income : Factor w/ 13 levels "1","2","3","4",..: 8 12 12 13 12 11 12 12 13 12 ...
 $ race   : Factor w/ 2 levels "0","1": 1 2 2 1 2 2 2 2 1 1 ...
 $ partyid: Factor w/ 7 levels "1","2","3","4",..: 3 4 6 1 1 1 3 4 2 2 ...
 $ aban   : Factor w/ 2 levels "0","1": 2 2 1 2 2 2 2 2 2 2 ...
 - attr(*, "na.action")= 'omit' Named int [1:259] 1 2 9 10 18 21 29 32 33 34 ...
  ..- attr(*, "names")= chr [1:259] "1" "2" "9" "10" ...
xtabs(~aban + income, data=new2)
    income
aban   1   2   3   4   5   6   7   8   9  10  11  12  13
   0  21  11   6   2   5   4   2  19  45  36  35 582  65
   1  18  16   4   5   6   2   6  10  42  31  45 823 165
xtabs(~aban + sex, data=new2)
    sex
aban   0   1
   0 453 380
   1 648 525
xtabs(~aban + educ, data=new2)
    educ
aban   0   2   3   6   7   8   9  10  11  12  13  14  15  16  17  18
   0   0   0   1   3   1   2   8  11  20 201  58 141  44 190  43  59
   1   2   1   0   2   0   6   9  13  16 178  75 142  54 331  90 131
    educ
aban  19  20
   0  16  35
   1  35  88
xtabs(~aban + age, data=new2)
    age
aban 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38
   0  0  5  0  2  1  5  1  4  6  4  7 11 11 12  6 16 10 18 10 18 14
   1  2  1  7  6  9  8 14 16 18 13 27 25 20 16 22 22 23 18 26 26 16
    age
aban 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59
   0 13 13 16 14 20 20  9  9 15 12 13 19 13 10 16 14 18 14 18 18 18
   1 20 13 28 28 18 17 12 23 16 17 16 15 24 19 20 15 24 15 20 17 29
    age
aban 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80
   0 14 15 17 19 20 10 13 27 14 17 20 14 14 14 15 18  6 11  9  9  9
   1 23 21 22 31 15 26 18 28 26 25 29 17 16 17 22 16 13  8  8 16  6
    age
aban 81 82 83 84 85 86 87 88 89
   0 14  4  6  4  6  5  4  2 10
   1  2  3  2  9  3  0  1  3  6
xtabs(~aban + partyid, data=new2)
    partyid
aban   1   2   3   4   5   6   7
   0  84  78  58 168 109 126 210
   1 372 220 195 176  52  81  77
xtabs(~aban + race, data=new2)
    race
aban   0   1
   0 149 684
   1 228 945

Only 19 respondents place between 0 and 7 years of education, so I may want to remove those samples…

Now I’ll fit a binomial logistic regression model, starting with partyid as the only explanatory variable.

# model just with partyid
fit1 <- glm(aban ~ partyid, family = binomial(link = "logit"), data = new2)
summary(fit1)

Call:
glm(formula = aban ~ partyid, family = binomial(link = "logit"), 
    data = new2)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.8394  -0.9964   0.6381   0.7791   1.6221  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)   1.4881     0.1208  12.318   <2e-16 ***
partyid2     -0.4512     0.1788  -2.524   0.0116 *  
partyid3     -0.2755     0.1923  -1.433   0.1518    
partyid4     -1.4416     0.1619  -8.901   <2e-16 ***
partyid5     -2.2282     0.2074 -10.745   <2e-16 ***
partyid6     -1.9299     0.1867 -10.334   <2e-16 ***
partyid7     -2.4914     0.1798 -13.853   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 2723.0  on 2005  degrees of freedom
Residual deviance: 2340.9  on 1999  degrees of freedom
AIC: 2354.9

Number of Fisher Scoring iterations: 4

So, with the model only including party identification as an explanatory variable shows that as party id increases (which progresses toward more independent than conservative id), there is decrease in log odds of the response to legal abortion being Yes. Also, the p-value is well below the significance level .05, which indicates the relationship is significant.

# compute McFadden's r-squared to evaluate predictive power of model
pR2(fit1)["McFadden"]
fitting null model for pseudo-r2
 McFadden 
0.1403234 

Using the rule of thumb that values over .4 indicate well-fitting model, that’s not great.

varImp(fit1)
           Overall
partyid2  2.523674
partyid3  1.433090
partyid4  8.901361
partyid5 10.745494
partyid6 10.334255
partyid7 13.853439

Conservative identification scores high in variable importance.

chisq.test(new2$partyid, new2$aban)

    Pearson's Chi-squared test

data:  new2$partyid and new2$aban
X-squared = 368.49, df = 6, p-value < 2.2e-16
chisq.bintest(aban ~ partyid, new2)

    Pearson's Chi-squared test

data:  aban by partyid 
X-squared = 368.4864, df = 6, p-value < 2.2e-16
alternative hypothesis: true difference in probabilities is not equal to 0 
sample estimates:
proba in group 1 proba in group 2 proba in group 3 proba in group 4 
       0.8157895        0.7382550        0.7707510        0.5116279 
proba in group 5 proba in group 6 proba in group 7 
       0.3229814        0.3913043        0.2682927 

        Pairwise comparisons using Pearson's Chi-squared tests with Yates' continuity correction

          1         2         3         4      5        6
2 1.796e-02         -         -         -      -        -
3 2.112e-01 4.339e-01         -         -      -        -
4 3.400e-19 1.015e-08 3.726e-10         -      -        -
5 9.706e-30 3.483e-17 1.058e-18 1.606e-04      -        -
6 1.487e-26 2.397e-14 7.271e-16 1.045e-02 0.2360        -
7 3.996e-48 8.312e-29 6.360e-30 1.547e-09 0.2768 0.007277

P value adjustment method: fdr

Chi-squared test of independence and chisq binomial test produce a p-value of 2.2e-16 which is well-below the 0.05 level of significance, so I can conclude that a relationship exists between political ID and response to the abortion question.

With the first fitted model, I’ll demonstrate how it may be used to predict response to the abortion question.

# Response from Strong Republican
new2 %>% summarise(partyid = '7') %>% predict(fit1, newdata = .) -> pred_vote
cat("Log-odds:", pred_vote)
Log-odds: -1.003302
new2 %>% summarise(partyid = '7') %>% predict(fit1, newdata = ., type = 'response') -> pred_prob
cat("Probability:", pred_prob)
Probability: 0.2682927

For a respondent who identifies as a strong republican, there will be a decrease in the log(odds) that the respondent says Yes by about 1. The probability that the respondent says Yes is 0.27.

# Response from Strong Democrat
new2 %>% summarise(partyid = '1') %>% predict(fit1, newdata = .) -> pred_vote
cat("Log-odds:", pred_vote)
Log-odds: 1.488077
new2 %>% summarise(partyid = '1') %>% predict(fit1, newdata = ., type = 'response') -> pred_prob
cat("Probability:", pred_prob)
Probability: 0.8157895

The model predicts that the probability that a strong democrat would say Yes to the abortion question is 0.82, and the log(odds) of a Yes response increases by 1.49.

plot(allEffects(fit1))

Let’s see how a model looks with all variables discussed.

# model with all variables
fit2 <- glm(aban ~ partyid + age + educ + income + race + sex, family = binomial(link = "logit"), data = new2)
summary(fit2)

Call:
glm(formula = aban ~ partyid + age + educ + income + race + sex, 
    family = binomial(link = "logit"), data = new2)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.1892  -0.9421   0.5253   0.8122   2.0871  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept)  0.718963   0.490079   1.467  0.14237    
partyid2    -0.577799   0.185849  -3.109  0.00188 ** 
partyid3    -0.400866   0.199391  -2.010  0.04438 *  
partyid4    -1.517882   0.171386  -8.857  < 2e-16 ***
partyid5    -2.365754   0.216123 -10.946  < 2e-16 ***
partyid6    -2.194851   0.197622 -11.106  < 2e-16 ***
partyid7    -2.662291   0.190225 -13.995  < 2e-16 ***
age         -0.017774   0.003208  -5.541    3e-08 ***
educ         0.046670   0.020790   2.245  0.02478 *  
income2      0.740003   0.551015   1.343  0.17928    
income3     -0.148126   0.802940  -0.184  0.85364    
income4      1.198394   1.021467   1.173  0.24071    
income5      0.319761   0.755723   0.423  0.67221    
income6     -1.068413   0.963065  -1.109  0.26726    
income7      1.583028   0.977425   1.620  0.10532    
income8     -0.276314   0.552560  -0.500  0.61703    
income9      0.142837   0.424602   0.336  0.73657    
income10     0.181180   0.443968   0.408  0.68320    
income11     0.702179   0.431453   1.627  0.10364    
income12     0.789631   0.360673   2.189  0.02857 *  
income13     1.296339   0.398359   3.254  0.00114 ** 
race1        0.446119   0.136203   3.275  0.00106 ** 
sex1         0.070250   0.105189   0.668  0.50423    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 2723.0  on 2005  degrees of freedom
Residual deviance: 2248.4  on 1983  degrees of freedom
AIC: 2294.4

Number of Fisher Scoring iterations: 4

Using a threshold of .05 singificance, sex is the only variable that is not influential. Although, with .01 level of significance, education would also get cut.

# check for multicollinearity
vif(fit2)
            GVIF Df GVIF^(1/(2*Df))
partyid 1.236950  6        1.017879
age     1.101638  1        1.049589
educ    1.147707  1        1.071311
income  1.276925 12        1.010238
race    1.127343  1        1.061764
sex     1.037975  1        1.018810

No VIF above 5, so it’s safe to assume there is not multicollinearity occurring.

# compute McFadden's r-squared to evaluate predictive power of model
pR2(fit2)["McFadden"]
fitting null model for pseudo-r2
 McFadden 
0.1743009 

This model has lower AIC and higher r-squared than the first, but still not great.

varImp(fit2)
            Overall
partyid2  3.1089779
partyid3  2.0104536
partyid4  8.8565241
partyid5 10.9463391
partyid6 11.1063009
partyid7 13.9954637
age       5.5413175
educ      2.2448588
income2   1.3429814
income3   0.1844791
income4   1.1732095
income5   0.4231191
income6   1.1093890
income7   1.6195901
income8   0.5000615
income9   0.3364008
income10  0.4080937
income11  1.6274758
income12  2.1893286
income13  3.2541980
race1     3.2754111
sex1      0.6678427

When it comes to variable importance, partyid and age are highest, followed by income and race, then education.

plot(allEffects(fit2))

It’s surprising that there isn’t a strong relationship with gender, but I will remove it for the next model.

fit4 <- glm(aban ~ partyid + age + educ + income + race, family = binomial(link = "logit"), data = new2)
summary(fit4)

Call:
glm(formula = aban ~ partyid + age + educ + income + race, family = binomial(link = "logit"), 
    data = new2)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.2074  -0.9433   0.5295   0.8082   2.1024  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept)  0.735986   0.489183   1.505 0.132447    
partyid2    -0.572146   0.185631  -3.082 0.002055 ** 
partyid3    -0.390295   0.198694  -1.964 0.049495 *  
partyid4    -1.511857   0.171101  -8.836  < 2e-16 ***
partyid5    -2.356515   0.215666 -10.927  < 2e-16 ***
partyid6    -2.185015   0.197005 -11.091  < 2e-16 ***
partyid7    -2.651258   0.189419 -13.997  < 2e-16 ***
age         -0.017663   0.003202  -5.517 3.46e-08 ***
educ         0.046713   0.020798   2.246 0.024703 *  
income2      0.735874   0.550505   1.337 0.181312    
income3     -0.160451   0.800665  -0.200 0.841170    
income4      1.190128   1.018479   1.169 0.242591    
income5      0.317291   0.757389   0.419 0.675269    
income6     -1.081627   0.961948  -1.124 0.260838    
income7      1.553559   0.975758   1.592 0.111350    
income8     -0.278662   0.551858  -0.505 0.613593    
income9      0.138778   0.424100   0.327 0.743495    
income10     0.175816   0.443372   0.397 0.691705    
income11     0.699884   0.430985   1.624 0.104393    
income12     0.792766   0.360113   2.201 0.027705 *  
income13     1.309390   0.397440   3.295 0.000986 ***
race1        0.444193   0.136178   3.262 0.001107 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 2723.0  on 2005  degrees of freedom
Residual deviance: 2248.8  on 1984  degrees of freedom
AIC: 2292.8

Number of Fisher Scoring iterations: 4
pR2(fit4)["McFadden"]
fitting null model for pseudo-r2
McFadden 
0.174137 
# compute variable importance
varImp(fit4)
            Overall
partyid2  3.0821635
partyid3  1.9643025
partyid4  8.8360580
partyid5 10.9266982
partyid6 11.0911416
partyid7 13.9967817
age       5.5165696
educ      2.2460090
income2   1.3367261
income3   0.2003974
income4   1.1685343
income5   0.4189279
income6   1.1244138
income7   1.5921564
income8   0.5049512
income9   0.3272287
income10  0.3965427
income11  1.6239174
income12  2.2014378
income13  3.2945577
race1     3.2618457
plot(allEffects(fit4))

Because the income variable is so disproportionate (majority of sample makes highest level of income option provided), the lower thresholds have huge margins and may negatively affect prediction.

# see how removing income effects the model
fit5 <- glm(aban ~ partyid + age + educ + race, family = binomial(link = "logit"), data = new2)
summary(fit5)

Call:
glm(formula = aban ~ partyid + age + educ + race, family = binomial(link = "logit"), 
    data = new2)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.1382  -0.9522   0.5627   0.8267   1.9932  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept)  0.872156   0.365297   2.388 0.016962 *  
partyid2    -0.569781   0.183168  -3.111 0.001866 ** 
partyid3    -0.393017   0.195900  -2.006 0.044834 *  
partyid4    -1.524097   0.168754  -9.031  < 2e-16 ***
partyid5    -2.296167   0.212374 -10.812  < 2e-16 ***
partyid6    -2.085436   0.193890 -10.756  < 2e-16 ***
partyid7    -2.558796   0.185707 -13.779  < 2e-16 ***
age         -0.016485   0.003148  -5.236 1.64e-07 ***
educ         0.080721   0.019395   4.162 3.16e-05 ***
race1        0.463468   0.133679   3.467 0.000526 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 2723.0  on 2005  degrees of freedom
Residual deviance: 2286.4  on 1996  degrees of freedom
AIC: 2306.4

Number of Fisher Scoring iterations: 4
pR2(fit5)["McFadden"]
fitting null model for pseudo-r2
 McFadden 
0.1603334 

That caused both the AIC and McFadden’s r-squared to decrease, which is not good… but not by a significant amount. The ranges of income provided as options, with the highest being “greater than $25,000 annual income, is inaccurate to 2021 income average.

plot(allEffects(fit5))

stargazer(fit1, fit2, fit4, fit5, type = "text")

=============================================================
                              Dependent variable:            
                  -------------------------------------------
                                     aban                    
                     (1)        (2)        (3)        (4)    
-------------------------------------------------------------
partyid2           -0.451**  -0.578***  -0.572***  -0.570*** 
                   (0.179)    (0.186)    (0.186)    (0.183)  
                                                             
partyid3            -0.276    -0.401**   -0.390**   -0.393** 
                   (0.192)    (0.199)    (0.199)    (0.196)  
                                                             
partyid4          -1.442***  -1.518***  -1.512***  -1.524*** 
                   (0.162)    (0.171)    (0.171)    (0.169)  
                                                             
partyid5          -2.228***  -2.366***  -2.357***  -2.296*** 
                   (0.207)    (0.216)    (0.216)    (0.212)  
                                                             
partyid6          -1.930***  -2.195***  -2.185***  -2.085*** 
                   (0.187)    (0.198)    (0.197)    (0.194)  
                                                             
partyid7          -2.491***  -2.662***  -2.651***  -2.559*** 
                   (0.180)    (0.190)    (0.189)    (0.186)  
                                                             
age                          -0.018***  -0.018***  -0.016*** 
                              (0.003)    (0.003)    (0.003)  
                                                             
educ                          0.047**    0.047**    0.081*** 
                              (0.021)    (0.021)    (0.019)  
                                                             
income2                        0.740      0.736              
                              (0.551)    (0.551)             
                                                             
income3                        -0.148     -0.160             
                              (0.803)    (0.801)             
                                                             
income4                        1.198      1.190              
                              (1.021)    (1.018)             
                                                             
income5                        0.320      0.317              
                              (0.756)    (0.757)             
                                                             
income6                        -1.068     -1.082             
                              (0.963)    (0.962)             
                                                             
income7                        1.583      1.554              
                              (0.977)    (0.976)             
                                                             
income8                        -0.276     -0.279             
                              (0.553)    (0.552)             
                                                             
income9                        0.143      0.139              
                              (0.425)    (0.424)             
                                                             
income10                       0.181      0.176              
                              (0.444)    (0.443)             
                                                             
income11                       0.702      0.700              
                              (0.431)    (0.431)             
                                                             
income12                      0.790**    0.793**             
                              (0.361)    (0.360)             
                                                             
income13                      1.296***   1.309***            
                              (0.398)    (0.397)             
                                                             
race1                         0.446***   0.444***   0.463*** 
                              (0.136)    (0.136)    (0.134)  
                                                             
sex1                           0.070                         
                              (0.105)                        
                                                             
Constant           1.488***    0.719      0.736     0.872**  
                   (0.121)    (0.490)    (0.489)    (0.365)  
                                                             
-------------------------------------------------------------
Observations        2,006      2,006      2,006      2,006   
Log Likelihood    -1,170.450 -1,124.189 -1,124.413 -1,143.206
Akaike Inf. Crit. 2,354.900  2,294.379  2,292.825  2,306.412 
=============================================================
Note:                             *p<0.1; **p<0.05; ***p<0.01

Concluding Thoughts

I used logistic regression due to the fact that the outcome variable, opinion on right to legal abortion, is binary - the response can take one of two values, Yes or No. I denoted Yes as 1 and No as 0.

All variables were coded numerically. Age and years of education were numeric variables, while sex, race, party ID, and income were factors. First, I fit a model with just party identification as the explanatory variable, to evaluate effects on abortion opinion without accounting for other variables. While the model reflected a significant relationship between the two based upon low p-values, the r-squared value was rather low, suggesting the model is not as reliable for prediction as preferred. The coefficients in the model indicate a reduction in log(odds) that one will support legal abortion as party identification trends conservative. The model fits also reflect evidence that increase in age is associated with decreased support of the abortion question, race of white increased odds of support versus non-white respondents, and increased level of education increased odds of support for legal abortion.

When fitting a model with all selected variables, each had significant p-values besides sex. However, AIC did not decrease much compared to the simple regression model, nor did the r-squared increase to a level of reliable model predictability. Removing sex and income did not improve the model fit either. Model comparison shows insignificant changes in log likelihood across all of the models I fit. With that being said, while there is a statistically significant relationship between party identification and support for legal abortion (for any reason), as evidenced by significant p-values in the logistic regression models and chi-squared test, the explanatory variables used in this study did not produce a statistically reliable predictive model.

Further research is needed to explore additional variables that may be contributing factors to support for abortion, to create a stronger predictive model.