The General Social Survey (GSS) is a national research project intended to identify attributes and attitudes of the American society and to facilitate comparison between the US and other countries. The research scope includes respondents’ background, personal and family information, societal concerns, workplace and economic concerns etc. The survey covered the period from 1972 to 2012 (in this data set).
In this study, we will focus on the 2012 data. Specifically, we’ll study the relationship between the “degree” (highest degree) and the “abnomore” (Married - wants no more children) variables, in order to answer the question: “Is there a relationship between people’s education and whether they support abortions for women who are married and don’t want any more children?”
This question is of interest to me since, out of curiosity, I have always wondered whether people with more education tend to be more open to abortion, while people with less education more conservative. To others, the result of this study could be useful as well. For example, a non-profit organization aiming to reduce abortion rate or to lobby for anti-abortion laws would be able to have a clearer picture of its target audience based on the result of this study.
The data was collected by conducting “computer-assisted personal interview (CAPI), face-to-face interview, telephone interview” (please refer to Reference 1 - “ICPSR GSS Info”). According to the GSS CODEBOOK (Reference 2), the GSS survey uses “a full probability sample for the 1977+ survey”. Probability sampling is a form of random sampling and therefore inference from this study can be generalized to the American population. (Reference 3)
The survey is an observational study since no random assignment was involved. Respondents were simply surveyed using different methods, no interference was present before and during the interviews. As a result, only correlation can be inferred and any causal link attempted using the data would be unfounded and not recommended.
Since the survey sampled the U.S. society covering both metropolitan and non-metropolitan areas and used stratification by region, age and race, the American national population is the population of interest, any inference made would be applicable for this population.
The cases in the survey are individual respondents with answers to survey questions stored in the variables that take the columns. The two variables we’re interested in are “degree” and “abnomore”. They are both categorical variables.
Next, we load the data, perform some pre-processing and exploratory data analysis.
setwd("C:/Users/George/Dropbox/WorkingDir/Data Analysis and Statistical Inference")
load(url("http://bit.ly/dasi_gss_data"))
suppressMessages(library(dplyr))
# Subset the needed variables
data12 <- filter(gss, year == 2012)
data <- select(data12, degree, abnomore)
# Calculate NA value percentages
sum(is.na(data$degree))/nrow(data) # degree column NA value percentage
## [1] 0.004052685
sum(is.na(data$abnomore))/nrow(data) # abnomore column NA value percentage
## [1] 0.3687943
So the “degree” variable has a NA value percentage of about 0.4% while the “abnomore” variable has almost 37% NA value which is fairly high. However, barring obtaining addtional data, we have to remove these value pairs in order to proceed since imputing the missing data with subjective values may bring more variability. The removal of missing data may be a potential source of bias for this study that will be further discussed later.
# Remove cases with NA values
data.new <- data[complete.cases(data),]
# Calculate summary statistics and make a contingency table
dim(data.new)
## [1] 1240 2
str(data.new)
## 'data.frame': 1240 obs. of 2 variables:
## $ degree : Factor w/ 5 levels "Lt High School",..: 2 2 4 1 1 4 1 2 2 2 ...
## $ abnomore: Factor w/ 2 levels "Yes","No": 1 2 1 1 2 1 2 2 2 2 ...
t <- table(data.new$abnomore, data.new$degree); t
##
## Lt High School High School Junior College Bachelor Graduate
## Yes 55 254 42 132 93
## No 107 375 52 96 34
# Make a mosaic plot to visulize the data
mosaicplot(t, main = "Attitude toward abortion for different education backgrounds", color = TRUE)
Based on the data, for respondents with less education, i.e. the “Lt Hight School” and “High School” groups, a more conservative attitude toward abortion is observed; on the other hand, for those respondents with higher education, i.e. people with bachelor and graduate degrees, a resounding yes dominates. Thus, it does appear the “education”" variable and the “abnomore” variable are not independent. A potential explanation is that people may become more open with more education.
To validate our intuition, we resort to hypothesis testing. Before we start testing the hypothesis, let’s first make sure all the conditions for using the test are met.
## c1 c2 c3 c4 c5
## 1 75.25161 292.1806 43.66452 105.9097 58.99355
## 2 86.74839 336.8194 50.33548 122.0903 68.00645
Since all the conditions are met, we proceed with the following hypotheses:
H0: People’s view on abortion and their education level are indepedent, i.e. there is no relationship between people’s education and whether they support abortions for women who are married and don’t want any more children.
HA: People’s view on abortion and their education level are depedent, i.e. there is relationship between people’s education and whether they support abortions for women who are married and don’t want any more children.
Since we are testing the independence between two categorical variables, chi-squared test of independence is the right solution.
chisq.test(t)
##
## Pearson's Chi-squared test
##
## data: t
## X-squared = 68.224, df = 4, p-value = 5.38e-14
As we can see from the result, the p-value is 5.38e-14 which is almost zero. Therefore, we reject the null hypothesis and conclude that for the American population, the data provide convincing evidence that people’s views on abortion do vary by education level, that is, there is relationship between people’s education and whether they support abortions for women who are married and don’t want any more children. This is in line with our initial intuition based on exploratory data analysis.
As we used Chi-square test of independence, there are no other methods applicable and hence nothing else to compare.
In summary, we conclude that attitude towards abortion does vary across groups with different education backgrounds. This is based on the hypothesis test we conducted using the full probability sample data from the GSS Survey. As suggested ealier, the result of this study might be helpful for relevant organizations in identifying target social groups.
Since we removed a significant percentage of NA values from the data set, there could be potential bias that is introduced. It is therefore recommended that future research collect more robust data in this regard and revist our study and results as needed.
The code used to find the expected count in the contingency table:
row_total_1 <- sum(t[1,])
row_total_2 <- sum(t[2,])
col_total_1 <- sum(t[,1])
col_total_2 <- sum(t[,2])
col_total_3 <- sum(t[,3])
col_total_4 <- sum(t[,4])
col_total_5 <- sum(t[,5])
table_total <- row_total_1 + row_total_2
data.frame(c1=c(row_total_1*col_total_1/table_total,
row_total_2*col_total_1/table_total),
c2=c(row_total_1*col_total_2/table_total,
row_total_2*col_total_2/table_total),
c3=c(row_total_1*col_total_3/table_total,
row_total_2*col_total_3/table_total),
c4=c(row_total_1*col_total_4/table_total,
row_total_2*col_total_4/table_total),
c5=c(row_total_1*col_total_5/table_total,
row_total_2*col_total_5/table_total))