library(ggplot2)
library(dplyr)
library(statsr)load("gss.Rdata")"Since 1972, the General Social Survey (GSS) has been monitoring societal change and studying the growing complexity of American society.
The GSS aims to gather data on contemporary American society in order to monitor and explain trends and constants in attitudes, behaviors, and attributes; to examine the structure and functioning of society in general as well as the role played by relevant subgroups; to compare the United States to other societies in order to place American society in comparative perspective and develop cross-national models of human society; and to make high-quality data easily accessible to scholars, students, policy makers, and others, with minimal cost and waiting.
GSS questions include such items as national spending priorities, marijuana use, crime and punishment, race relations, quality of life, and confidence in institutions. Since 1988, the GSS has also collected data on sexual behavior including number of sex partners, frequency of intercourse, extramarital relationships, and sex with prostitutes."
Source : https://www.norc.org/Research/Projects/Pages/general-social-survey.aspx
"The target population of the GSS is adults (18+) living in households in the United States. The GSS sample is drawn using an area probability design that randomly selects respondents in households across the United States to take part in the survey. Respondents that become part of the GSS sample are from a mix of urban, suburban, and rural geographic areas. Participation in the study is strictly voluntary. However, because only about a few thousand respondents are interviewed in the main study, every respondent selected is very important to the results.
The survey is conducted face-to-face with an in-person interview by NORC at the University of Chicago. The survey was conducted every year from 1972 to 1994 (except in 1979, 1981, and 1992). Since 1994, it has been conducted every other year. The survey takes about 90 minutes to administer. As of 2014, 30 national samples with 59,599 respondents and 5,900+ variables have been collected."
Source : https://en.wikipedia.org/wiki/General_Social_Survey#Methodology
The samples are randomly selected from a population of adults and the information is collected through face-to-face interviews.This is an observational study, using very large random samples: the sample statistics will permit to draw conclusions about the population parameters, but not to infer causation between the variables.The data is collected as not to cause interference on how the data arises(contrary to a randomized experiment). (see OS3, 1.3.5). Over the years, the sampling procedure has apparently evolved from the original form of 1972-1974 to a more advanced and reliable one. For more information on sampling design, see the GSS codebook, appendix A.
“The General Social Surveys (GSS) were designed as part of a data diffusion project in 1972. The GSS replicated questionnaire items and wording in order to facilitate time-trend studies. The latest survey, GSS 2012, includes a cumulative file that merges all 29 General Social Surveys into a single file containing data from 1972 to 2012. The items appearing in the surveys are one of three types: Permanent questions that occur on each survey, rotating questions that appear on two out of every three surveys (1973, 1974, and 1976, or 1973, 1975, and 1976), and a few occasional questions such as split ballot experiments that occur in a single survey. The 2012 surveys included seven topic modules: Jewish identity, generosity, workplace violence, science, skin tone, and modules for experimental and miscellaneous questions. The International Social Survey Program (ISSP) module included in the 2012 survey was gender. The data also contain several variables describing the demographic characteristics of the respondents.”
Source: General Social Survey, 1972-2012 (Cumulative File) (ICPSR 34802)
The dataset hereby used is a simplified version of the cumulative GSS, with “removed missing values from the responses and created factor variables when appropriate to facilitate analysis using R.”
We are interested in the possible association between gun ownership and income. For this reason, we create two subsets of the gss dataset:
gincome <- subset(gss, owngun =="Yes")$coninc
ngincome <- subset(gss, owngun == "No")$coninc
summary(gincome)## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 383 24543 42215 49289 64546 180386 1140
hist(gincome)summary(ngincome)## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 383 14761 30344 40238 53507 180386 2126
hist(ngincome)As we can see, the two distribution are right skewed, but the gincome mean value is noticeably higher than the ngincome one.
First we clean the data for our analysis. We consider the owngun variable with only two levels, “Yes” or “No”, dropping the “Refused” and NA levels of the variable class:
gss1 <- subset(gss, owngun != "Refused" & !is.na(class))
gss2 <- droplevels.data.frame(gss1)
gss2$class <- factor(gss2$class, exclude = 'No Class')Let’s explore with some graphs this possible relationship:
plot(gss2$owngun ~ gss2$coninc, xlab = "Income", ylab= "Gun Ownership")gss2%>%filter(!is.na(class)) %>% ggplot(aes(x =class, fill = owngun)) + geom_bar(position = 'fill')ggplot(gss2, aes(owngun, coninc)) +
geom_boxplot()+xlab("Gunowners")+ylab("Income")+labs(title = "Income comparison")## Warning: Removed 3157 rows containing non-finite values (stat_boxplot).
We notice a clear upward tendency in the proportion of gun owners as the income increases followed by a slight decrease. We notice also from the boxplot that the mean income of gun owners is higher than the income of people who do not own guns. Is this difference statistically significant?
We check the two conditions necessary to apply the t-distribution to the difference in sample means. (1) Because the data come from a simple random sample and consist of less than 10% of all such cases, the observations are independent. Additionally, while each distribution is strongly skewed, the large sample sizes are enough compensation to model each mean separately using a t-distribution. (2) The independence reasoning applied in (1) also ensures the observations in each sample are independent. Since both conditions are satisfied, the difference in sample means may be modeled using a t-distribution.
We can elucidate this relationship of dependency with a chi-square test. We use the categorical variable income06, equivalent of coninc, as the explanatory variable. Conditions are fulfilled :
inference(y = owngun, x= income06, data = gss2, statistic = "proportion", success = "Yes",type = "ht", method = "theoretical")## Warning: Use alternative = "greater" for chi-square test
## Response variable: categorical (2 levels)
## Explanatory variable: categorical (26 levels)
## Observed:
## y
## x Yes No
## Under $1 000 14 63
## $1 000 To 2 999 14 55
## $3 000 To 3 999 4 45
## $4 000 To 4 999 4 27
## $5 000 To 5 999 6 36
## $6 000 To 6 999 7 45
## $7 000 To 7 999 14 67
## $8 000 To 9 999 19 89
## $10000 To 12499 49 153
## $12500 To 14999 24 153
## $15000 To 17499 42 133
## $17500 To 19999 29 97
## $20000 To 22499 37 140
## $22500 To 24999 46 145
## $25000 To 29999 87 177
## $30000 To 34999 80 190
## $35000 To 39999 93 173
## $40000 To 49999 156 292
## $50000 To 59999 161 241
## $60000 To 74999 221 284
## $75000 To $89999 187 196
## $90000 To $109999 149 177
## $110000 To $129999 93 116
## $130000 To $149999 59 79
## $150000 Or Over 135 213
## Refused 157 263
##
## Expected:
## y
## x Yes No
## Under $1 000 26.24621 50.75379
## $1 000 To 2 999 23.51933 45.48067
## $3 000 To 3 999 16.70213 32.29787
## $4 000 To 4 999 10.56665 20.43335
## $5 000 To 5 999 14.31611 27.68389
## $6 000 To 6 999 17.72471 34.27529
## $7 000 To 7 999 27.60965 53.39035
## $8 000 To 9 999 36.81286 71.18714
## $10000 To 12499 68.85368 133.14632
## $12500 To 14999 60.33219 116.66781
## $15000 To 17499 59.65047 115.34953
## $17500 To 19999 42.94834 83.05166
## $20000 To 22499 60.33219 116.66781
## $22500 To 24999 65.10423 125.89577
## $25000 To 29999 89.98699 174.01301
## $30000 To 34999 92.03215 177.96785
## $35000 To 39999 90.66871 175.33129
## $40000 To 49999 152.70520 295.29480
## $50000 To 59999 137.02565 264.97435
## $60000 To 74999 172.13421 332.86579
## $75000 To $89999 130.54931 252.45069
## $90000 To $109999 111.12030 214.87970
## $110000 To $129999 71.23970 137.76030
## $130000 To $149999 47.03866 90.96134
## $150000 Or Over 118.61922 229.38078
## Refused 143.16113 276.83887
##
## H0: income06 and owngun are independent
## HA: income06 and owngun are dependent
## chi_sq = 261.5867, df = 25, p_value = 0
The p-value is practically 0 so the income and the gun ownership are definitely dependent. Next, we are ready for our inference for the difference of two means, considering first if the conditions necessary to apply the t-distribution are fulfilled :
inference(y = coninc, x = owngun, data = gss2, statistic = "mean",
type = "ht", null = 0, alternative = "twosided", method = "theoretical")## Response variable: numerical
## Explanatory variable: categorical (2 levels)
## n_Yes = 12494, y_bar_Yes = 49369.4696, s_Yes = 34742.5996
## n_No = 17555, y_bar_No = 40352.9602, s_No = 35362.8838
## H0: mu_Yes = mu_No
## HA: mu_Yes != mu_No
## t = 22.0082, df = 12493
## p_value = < 0.0001
As the p-value is lower than 0.0001, we reject the \(H_{0}\) . There is indeed a statistical significant difference between the average income of gun-owners and non-owners.
inference(y = coninc, x = owngun, data = gss2, statistic = "mean",
type = "ci",method = "theoretical")## Response variable: numerical, Explanatory variable: categorical (2 levels)
## n_Yes = 12494, y_bar_Yes = 49369.4696, s_Yes = 34742.5996
## n_No = 17555, y_bar_No = 40352.9602, s_No = 35362.8838
## 95% CI (Yes - No): (8213.4551 , 9819.5637)
We are 95% confident that the annual average income of guns’ owners is 8213.4551 to 9819.5637 dollars higher than the average income of people who do not own guns.