Setup

Load packages

library(ggplot2)
library(dplyr)
library(statsr)

Load data

load("gss.Rdata")

1. Data

1.1 What is the GSS ?

"Since 1972, the General Social Survey (GSS) has been monitoring societal change and studying the growing complexity of American society.

The GSS aims to gather data on contemporary American society in order to monitor and explain trends and constants in attitudes, behaviors, and attributes; to examine the structure and functioning of society in general as well as the role played by relevant subgroups; to compare the United States to other societies in order to place American society in comparative perspective and develop cross-national models of human society; and to make high-quality data easily accessible to scholars, students, policy makers, and others, with minimal cost and waiting.

GSS questions include such items as national spending priorities, marijuana use, crime and punishment, race relations, quality of life, and confidence in institutions. Since 1988, the GSS has also collected data on sexual behavior including number of sex partners, frequency of intercourse, extramarital relationships, and sex with prostitutes."

Source : https://www.norc.org/Research/Projects/Pages/general-social-survey.aspx

1.2 Methodology

"The target population of the GSS is adults (18+) living in households in the United States. The GSS sample is drawn using an area probability design that randomly selects respondents in households across the United States to take part in the survey. Respondents that become part of the GSS sample are from a mix of urban, suburban, and rural geographic areas. Participation in the study is strictly voluntary. However, because only about a few thousand respondents are interviewed in the main study, every respondent selected is very important to the results.

The survey is conducted face-to-face with an in-person interview by NORC at the University of Chicago. The survey was conducted every year from 1972 to 1994 (except in 1979, 1981, and 1992). Since 1994, it has been conducted every other year. The survey takes about 90 minutes to administer. As of 2014, 30 national samples with 59,599 respondents and 5,900+ variables have been collected."

Source : https://en.wikipedia.org/wiki/General_Social_Survey#Methodology

The samples are randomly selected from a population of adults and the information is collected through face-to-face interviews.This is an observational study, using very large random samples: the sample statistics will permit to draw conclusions about the population parameters, but not to infer causation between the variables.The data is collected as not to cause interference on how the data arises(contrary to a randomized experiment). (see OS3, 1.3.5). Over the years, the sampling procedure has apparently evolved from the original form of 1972-1974 to a more advanced and reliable one. For more information on sampling design, see the GSS codebook, appendix A.

1.3 About this data set

“The General Social Surveys (GSS) were designed as part of a data diffusion project in 1972. The GSS replicated questionnaire items and wording in order to facilitate time-trend studies. The latest survey, GSS 2012, includes a cumulative file that merges all 29 General Social Surveys into a single file containing data from 1972 to 2012. The items appearing in the surveys are one of three types: Permanent questions that occur on each survey, rotating questions that appear on two out of every three surveys (1973, 1974, and 1976, or 1973, 1975, and 1976), and a few occasional questions such as split ballot experiments that occur in a single survey. The 2012 surveys included seven topic modules: Jewish identity, generosity, workplace violence, science, skin tone, and modules for experimental and miscellaneous questions. The International Social Survey Program (ISSP) module included in the 2012 survey was gender. The data also contain several variables describing the demographic characteristics of the respondents.”

Source: General Social Survey, 1972-2012 (Cumulative File) (ICPSR 34802)

The dataset hereby used is a simplified version of the cumulative GSS, with “removed missing values from the responses and created factor variables when appropriate to facilitate analysis using R.”


Part 2: Research question

We are interested in the possible association between gun ownership and income. For this reason, we create two subsets of the gss dataset:

gincome <- subset(gss, owngun =="Yes")$coninc
ngincome <- subset(gss, owngun == "No")$coninc
summary(gincome)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     383   24543   42215   49289   64546  180386    1140
hist(gincome)

summary(ngincome)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     383   14761   30344   40238   53507  180386    2126
hist(ngincome)

As we can see, the two distribution are right skewed, but the gincome mean value is noticeably higher than the ngincome one.


Part 3: Exploratory data analysis

First we clean the data for our analysis. We consider the owngun variable with only two levels, “Yes” or “No”, dropping the “Refused” and NA levels of the variable class:

gss1 <- subset(gss, owngun != "Refused" & !is.na(class))
gss2 <- droplevels.data.frame(gss1)
gss2$class <- factor(gss2$class, exclude = 'No Class')

Let’s explore with some graphs this possible relationship:

plot(gss2$owngun ~ gss2$coninc, xlab = "Income", ylab= "Gun Ownership")

gss2%>%filter(!is.na(class)) %>% ggplot(aes(x =class, fill = owngun)) + geom_bar(position = 'fill')

ggplot(gss2, aes(owngun, coninc)) +
  geom_boxplot()+xlab("Gunowners")+ylab("Income")+labs(title = "Income comparison")
## Warning: Removed 3157 rows containing non-finite values (stat_boxplot).

We notice a clear upward tendency in the proportion of gun owners as the income increases followed by a slight decrease. We notice also from the boxplot that the mean income of gun owners is higher than the income of people who do not own guns. Is this difference statistically significant?


Part 4: Inference

Conditions for inference :

We check the two conditions necessary to apply the t-distribution to the difference in sample means. (1) Because the data come from a simple random sample and consist of less than 10% of all such cases, the observations are independent. Additionally, while each distribution is strongly skewed, the large sample sizes are enough compensation to model each mean separately using a t-distribution. (2) The independence reasoning applied in (1) also ensures the observations in each sample are independent. Since both conditions are satisfied, the difference in sample means may be modeled using a t-distribution.

Dependency

We can elucidate this relationship of dependency with a chi-square test. We use the categorical variable income06, equivalent of coninc, as the explanatory variable. Conditions are fulfilled :

  • Independence . Each case that contributes a count to the table must be independent of all the other cases in the table.
  • Sample size / distribution . Each particular scenario (i.e. cell count) must have at least 5 expected cases.
inference(y = owngun, x= income06, data = gss2, statistic = "proportion", success = "Yes",type = "ht", method = "theoretical")
## Warning: Use alternative = "greater" for chi-square test
## Response variable: categorical (2 levels) 
## Explanatory variable: categorical (26 levels) 
## Observed:
##                     y
## x                    Yes  No
##   Under $1 000        14  63
##   $1 000 To 2 999     14  55
##   $3 000 To 3 999      4  45
##   $4 000 To 4 999      4  27
##   $5 000 To 5 999      6  36
##   $6 000 To 6 999      7  45
##   $7 000 To 7 999     14  67
##   $8 000 To 9 999     19  89
##   $10000 To 12499     49 153
##   $12500 To 14999     24 153
##   $15000 To 17499     42 133
##   $17500 To 19999     29  97
##   $20000 To 22499     37 140
##   $22500 To 24999     46 145
##   $25000 To 29999     87 177
##   $30000 To 34999     80 190
##   $35000 To 39999     93 173
##   $40000 To 49999    156 292
##   $50000 To 59999    161 241
##   $60000 To 74999    221 284
##   $75000 To $89999   187 196
##   $90000 To $109999  149 177
##   $110000 To $129999  93 116
##   $130000 To $149999  59  79
##   $150000 Or Over    135 213
##   Refused            157 263
## 
## Expected:
##                     y
## x                          Yes        No
##   Under $1 000        26.24621  50.75379
##   $1 000 To 2 999     23.51933  45.48067
##   $3 000 To 3 999     16.70213  32.29787
##   $4 000 To 4 999     10.56665  20.43335
##   $5 000 To 5 999     14.31611  27.68389
##   $6 000 To 6 999     17.72471  34.27529
##   $7 000 To 7 999     27.60965  53.39035
##   $8 000 To 9 999     36.81286  71.18714
##   $10000 To 12499     68.85368 133.14632
##   $12500 To 14999     60.33219 116.66781
##   $15000 To 17499     59.65047 115.34953
##   $17500 To 19999     42.94834  83.05166
##   $20000 To 22499     60.33219 116.66781
##   $22500 To 24999     65.10423 125.89577
##   $25000 To 29999     89.98699 174.01301
##   $30000 To 34999     92.03215 177.96785
##   $35000 To 39999     90.66871 175.33129
##   $40000 To 49999    152.70520 295.29480
##   $50000 To 59999    137.02565 264.97435
##   $60000 To 74999    172.13421 332.86579
##   $75000 To $89999   130.54931 252.45069
##   $90000 To $109999  111.12030 214.87970
##   $110000 To $129999  71.23970 137.76030
##   $130000 To $149999  47.03866  90.96134
##   $150000 Or Over    118.61922 229.38078
##   Refused            143.16113 276.83887
## 
## H0: income06 and owngun are independent
## HA: income06 and owngun are dependent
## chi_sq = 261.5867, df = 25, p_value = 0

The p-value is practically 0 so the income and the gun ownership are definitely dependent. Next, we are ready for our inference for the difference of two means, considering first if the conditions necessary to apply the t-distribution are fulfilled :

  1. Because the data come from a simple random sample and consist of less than 10% of all such cases, the observations are independent. Additionally, while each distribution is strongly skewed, the large sample sizes are enough compensation to model each mean separately using a t-distribution.
  2. The independence reasoning applied in (1) also ensures the observations in each sample are independent. Since both conditions are satisfied, the difference in sample means may be modeled using a t-distribution.
inference(y = coninc, x = owngun, data = gss2, statistic = "mean", 
          type = "ht", null = 0, alternative = "twosided", method = "theoretical")
## Response variable: numerical
## Explanatory variable: categorical (2 levels) 
## n_Yes = 12494, y_bar_Yes = 49369.4696, s_Yes = 34742.5996
## n_No = 17555, y_bar_No = 40352.9602, s_No = 35362.8838
## H0: mu_Yes =  mu_No
## HA: mu_Yes != mu_No
## t = 22.0082, df = 12493
## p_value = < 0.0001

As the p-value is lower than 0.0001, we reject the \(H_{0}\) . There is indeed a statistical significant difference between the average income of gun-owners and non-owners.

inference(y = coninc, x = owngun, data = gss2, statistic = "mean", 
          type = "ci",method = "theoretical")
## Response variable: numerical, Explanatory variable: categorical (2 levels)
## n_Yes = 12494, y_bar_Yes = 49369.4696, s_Yes = 34742.5996
## n_No = 17555, y_bar_No = 40352.9602, s_No = 35362.8838
## 95% CI (Yes - No): (8213.4551 , 9819.5637)

We are 95% confident that the annual average income of guns’ owners is 8213.4551 to 9819.5637 dollars higher than the average income of people who do not own guns.