Load packages

library(ggplot2)
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.3.1
library(plyr)
library(lattice)
library(statsr)
library(e1071)
## Warning: package 'e1071' was built under R version 3.3.1

Load data

load("gss.Rdata")

Part 1: Data

Since 1972, the General Social Survey (GSS) has been monitoring societal change and studying the growing complexity of American society. The GSS aims to gather data on contemporary American society in order to monitor and explain trends and constants in attitudes, behaviors, and attributes; to examine the structure and functioning of society in general as well as the role played by relevant subgroups; to compare the United States to other societies in order to place American society in comparative perspective and develop cross-national models of human society; and to make high-quality data easily accessible to scholars, students, policy makers, and others, with minimal cost and waiting.

It should be noted that the information on attitudes, behaviors, etc. was obtained through telephone interviews, face-to-face interviews, and on-line questionnaires. As such, this is best described as an obervational study and may suffer from “availability” bias, that is, respondents were willing to answer the questions and may not necessarily represent a truly random sampling of the population. Consequently, one cannot easily generalize the results of this particular sample to the U.S. population at large and causality between any two variables cannot be inferred.


Part 2: Research question

Gun ownership remains a controversial topic in the United States especially after the police shootings of several black men and the subsequent shooting of five police officers in Dallas, Texas and three police officers in Baton Rouge, Louisiana on July 7, 2016 and July 17, 2016, respectively.

Here, I would like to address the question whether the proportion of people owning a gun is related to the highest year of school completed.


Part 3: Exploratory data analysis

Preamble: It should be noted that several variables are likely to be highly correlated such as years of school completed and income for example. Moreover, respondents that are 18 years of age cannot be expected to have completed 20 years of education for obvious reasons. As such, only respondents 26 years of age or older were included in subsequent analyses.

First, select the variables of interest followed by a selection for complete cases.

df <- subset(gss, age > 25)
vars <- c("age", "educ", "owngun")
df <- df[vars]
df <- df[complete.cases(df),]

The structure of the resulting data frame is as follows:

str(df)
## 'data.frame':    29870 obs. of  3 variables:
##  $ age   : int  54 51 36 32 54 41 31 26 78 46 ...
##  $ educ  : int  6 8 11 12 8 12 8 16 7 16 ...
##  $ owngun: Factor w/ 3 levels "Yes","No","Refused": 1 1 1 1 1 2 2 2 2 2 ...

Although, strictly speaking, the age of the respondents is not part of the original research question, it is possible that gun ownership is related to age and, thus, indirectly may influence the relationship between gun ownership and education.

fit <- lm(educ ~ age, data=df)
summary(fit)
## 
## Call:
## lm(formula = educ ~ age, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -13.971  -1.722  -0.179   2.138   9.450 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 15.382389   0.059635  257.94   <2e-16 ***
## age         -0.054295   0.001154  -47.07   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.207 on 29868 degrees of freedom
## Multiple R-squared:  0.06905,    Adjusted R-squared:  0.06902 
## F-statistic:  2215 on 1 and 29868 DF,  p-value: < 2.2e-16

As shown above, he summary of a linear regression with age as the regressor and education as the response indicates a significant effect of age on education. In particular, with every unit increase of age, education seems to decrease by 0.05 years. Figure 1 below shows how the mean number of educational years varies by age.

meanEduc <- ddply(df, .(age), summarise, mean=mean(educ))
ggplot(meanEduc, aes(x=age, y=mean)) + geom_point() +
  labs(title="Fig 1: Education vs. Age") +
  labs(x="Age (years)", y="Years of Education (mean)")

As can be seen in Figure 1, the mean years of education decreases from about 50 years of age onward. In order to minimize this, subsequent analyses will include only study participants age 50 or younger.

df = subset(df, age <= 50)

This leaves 17119 repondents for investigating a possible relationship between gun ownership and years of education. A preliminary visualization of respondents owning a gun and the years of education is shown in Figure 2.

ggplot(df, aes(x=owngun, y=educ)) + geom_boxplot() +
  labs(title="Fig 2: Education and Gun Ownership") +
  labs(x="Gun Ownership", y="Education (years)")

The number of study participants who refused to answer the question of gun ownership is 146, or 0.8%, of total number of respondents and will not be considered for further analysis. From the results shown in Figure 2, it can be seen that years of education is skewed to the left for the other two groups.

df = subset(df, owngun != "Refused")

Since both distributions of years of education appera to be left-skewed (Fig. 2), the actual value of the skewness was computed using the skewness function in package e1071.

#Store groups in two data frames
df_yes <- subset(df, owngun=="Yes")
df_no <- subset(df, owngun=="No")
skewness(df_yes$educ)
## [1] 0.006814789
skewness(df_no$educ)
## [1] -0.2186594

These values indicate that years of education is slightly skewed for study participants who do own a gun and moderately skewed for those that do not own a gun.

Next, let’s compare the distribution of years of education for the two different groups to a normal distribution using a quantile-quamtile plot.

#Simulate normal distributions dn_no and dn_yes
par(mfrow = c(1, 2))
p1 <- qqnorm(df_no$educ, main="Fig3A: Does not own gun")
p2 <- qqnorm(df_yes$educ, main="Fig3B: Does own gun")

The results presented in Figures 3A and 3B suggest that the deviation from the normal distribution is not too severe such as to preclude using a regular t-test statistic.


Part 4: Inference

The null hypothesis to be tested is that for those that do or do not own a gun, the mean years of education will not be different. The alternative hypothesis states that there is a difference in the mean years of education between the two groups.

Nothwithstanding a possible “convenience” bias, it can reasonably be assumed that study participants within and between each group were sampled independently; the number of study participants considered is definitely less than 10% of the total population of the United States. In addition, although a moderate skew was noted for both groups, the number of observations for each group (10,005 and 6,968 for responding no or yes to owning a gun, respectively) should be sufficient.

diff_means <- mean(df_no$educ) - mean(df_yes$educ)
SE = sqrt((sd(df_no$educ)^2 / length(df_no$educ)) +
  (sd(df_yes$educ)^2 / length(df_yes$educ)))
t_crit = 2.08 #two-tailed test
conf_int = diff_means + c(-t_crit*SE, t_crit*SE); conf_int
## [1] 0.3227438 0.5056517

The confidence intervals show that those who do not own a gun have between 0.32 and 0.51 more years of education on average.

In order to test our original null hypothesis, it is necessary to calculate the appropriate t statistic using a value for the degrees of freedom equal to the lowest group size (6968) - 1 or, in other words, df = 6967.

t_stat = (mean(df_no$educ) - mean(df_yes$educ)) / SE; t_stat
## [1] 9.420381

The P value is essentially 0 which indicates that there is indeed a significant difference between the two groups.

Note that the calculations shown above could also have been using R’s t.test:

t.test(df_no$educ, df_yes$educ)
## 
##  Welch Two Sample t-test
## 
## data:  df_no$educ and df_yes$educ
## t = 9.4204, df = 16169, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.3280151 0.5003804
## sample estimates:
## mean of x mean of y 
##  13.55972  13.14552