Inferential Analysis - General Social Survey

Siddharth Samant

10/07/2020


Setup

The objective of this assignment is to conduct exploratory data analysis and inferential tests on the General Social Survey (GSS) dataset.

We start by loading some important R packages and datasets.


Load packages

We load the tidyverse set of packages, which includes the visualisation package ggplot2, the relational data package dplyr, and the forcats package for working with factors, among others.

We also load the statsr package for easy visualization of confidence intervals and hypothesis tests, and the kableExtra package for table styling and formatting.

library(tidyverse)
library(statsr)
library(kableExtra)

Load data

We load the gss dataset. It is stored as the object gss.

load("gss.Rdata")

Part 1: Data1

The General Social Survey (GSS) is a regular, ongoing interview survey of U.S households conducted by the National Opinion Research Center. The mission of the GSS is to make timely, high-quality, scientifically relevant data available to social science researchers.

The GSS is a personal interview survey and collects information on a wide range of demographic characteristics of respondents and their parents, including:

  • Behavioral items such as group membership and voting.
  • Personal psychological evaluations, including measures of happiness, misanthropy, and life satisfaction.
  • Attitudinal questions on such public issues as abortion, crime and punishment, race relations, gender roles, and spending priorities.

The total number of rows and columns in the dataset are as follows:

knitr::kable(
        tibble(Rows = dim(gss)[1],
               Columns = dim(gss)[2]),
        align = "cc"
) %>%
        kable_styling(full_width = TRUE)
Rows Columns
57061 114

Let’s take a quick look at the first few rows and columns of the dataset:

knitr::kable(
        gss[1:10,1:8],
        align = "cc"
) %>%
        kable_styling(full_width = TRUE)
caseid year age sex race hispanic uscitzn educ
1 1972 23 Female White NA NA 16
2 1972 70 Male White NA NA 10
3 1972 48 Female White NA NA 12
4 1972 27 Female White NA NA 17
5 1972 61 Female White NA NA 12
6 1972 26 Male White NA NA 14
7 1972 28 Male White NA NA 13
8 1972 27 Male White NA NA 16
9 1972 21 Female Black NA NA 12
10 1972 30 Female Black NA NA 12

Data Collection:

  • The basic GSS design is a repeated cross-sectional survey of a nationally representative sample of non-institutionalized adults who speak either English or Spanish.

  • The preferred interview mode is 90-minute in-person interviews; however, a few interviews will be done by telephone in the event that an in-person contact cannot be scheduled.

Survey Design & Scope of Inference:

  • The survey is an observational study. The population of interest is the adult household population of the United States of America.

  • The survey adopts a combination of cluster random sampling and stratified random sampling. The Primary Sampling Units (PSUs) employed are Standard Metropolitan Statistical Areas (SMSAs) or non-metropolitan counties (i.e., clusters). These SMSAs and counties are stratified by region, age, and race before selection.

  • It avoids the pitfalls of non-response bias by scheduling the interviews only after 3:00 PM on weekdays, or during weekends and holidays.

  • Since the study is observational and not an experiment, there is no random assignment. Thus, we can infer correlation between the explanatory and response variables, but not causation.

  • However, since random sampling techniques have been extensively used for the survey, our inferences will be generalizable to the population of interest.


Part 2: Research question

Is there a difference in the average age of Americans who own guns, and those who do not?

In 2019, there were a total of 417 mass shootings in the USA. The non-profit Gun Violence Archive (GVA) defines a mass shooting as any incident in which at least 4 people are shot, including the shooter. Mass shooting incidents as estimated by GVA are on the rise - there were 337 such incidents in 2017, and 346 in 2018.2 Mass shootings in the US are often linked to the high rates of gun ownership in the country.

We are interested in seeing if there are age differences in gun ownership in the US. Such a difference might give us an indication of whether gun ownership rates will rise or fall in the future. Organizations who advocate for increasing restrictions on gun ownership might also benefit from this research - they can target their interventions against gun ownership more effectively.


Part 3: Exploratory data analysis

The gss the variable owngun. It asks respondents whether they have a gun at home.

knitr::kable(
        gss %>% count(owngun),
        align = "cc"
) %>%
        kable_styling(full_width = TRUE)
owngun n
Yes 14000
No 20144
Refused 315
NA 22602

14,000 respondents said they had a gun at home, 20,144 said no, 315 refused to answer, and there are 22,602 missing values.

We will make 2 changes to the dataset:

  1. Remove missing values, and the value “Refused” from the owngun variable - we only want to capture “Yes” and “No” answers for the purposes of our research
  2. Filter out all survey years except 2012, which is the latest survey year in the dataset. There are two reasons for this:
    • We want to look at the latest trends in gun ownership. The earlier the survey year, the less relevant the data is
    • We do not want a very high sample size. The higher the sample size, the greater the chance of a statistically significant result that is NOT practically significant.
gss1 <- gss %>%
        filter(!is.na(owngun), owngun %in% c("Yes", "No"),
               year == max(year, na.rm = TRUE)) %>%
        mutate(owngun = fct_drop(owngun, c("Refused")))

knitr::kable(
        gss1 %>% count(owngun),
        align = "cc"
) %>%
        kable_styling(full_width = TRUE)
owngun n
Yes 440
No 841

Of the survey respondents in 2012, 440 said they owned a gun, 841 said they did not.

Next, let us look at whether there is a visible difference in age between Americans who own guns and those who do not. We will look at the mean age using a summary table, and the median age using a box plot, after filtering out missing values in the age variable.

gss2 <- gss1 %>%
        filter(!is.na(age)) %>%
        group_by(owngun) %>%
        summarise(avgGunOwnerAge = mean(age, na.rm = TRUE))

knitr::kable(
        gss2,
        align = "cc"
) %>%
        kable_styling(full_width = TRUE)
owngun avgGunOwnerAge
Yes 51.01818
No 46.55024

The mean age for a gun owner is 51, while for those who do not own firearms, the mean age is 46.5.

gss1 <- gss1 %>%
        filter(!is.na(age))

ggplot(gss1, aes(owngun, age)) +
        geom_boxplot(fill = "seagreen1") +
        labs(
                x = "Gun ownership",
                y = "Age (years)",
                title = "Boxplot - Gun ownership vs. age in years"
        )

There seems to be a clear difference in the median age of gun ownership in the US. The age distribution of both gun owners and gun non-owners seems to be unimodal, symmetric, and nearly normal, with no outliers.


Part 4: Inference

We will conduct a Hypothesis Test and a Confidence Interval estimation in order to answer our research question.


A. Null & Alternate Hypotheses

Null Hypothesis (\(H_{o}\)): There is no difference in the mean ages of adult US householders who own guns and those who don’t own guns.

Null Hypothesis (\(H_{\alpha}\)): There is a difference in the mean ages of adult US householders who own guns and those who don’t own guns.


B. Conditions for Inference for Comparing Two Means

  1. Independence within groups: The sampled observations within each group are independent. This is because:

    • the observations have been obtained through random sampling
    • the samples are easily less than 10% of the population of interest
  2. Independence between groups: The two groups are independent of each other - the observations in one group are not individually paired with observations of the other group

  3. Sample size/skew: The boxplots we plotted in the previous section indicate a nearly normal distribution. In any case, our sample sizes are large enough to account for any skew in the distributions of the population of interest


C. Methods

We will be using the following 2 methods for conducting inferential analysis on our data.

  1. Two-tailed Hypothesis Test: We will either reject, or fail to reject, the null hypothesis, on the basis of a two-tailed Hypothesis Test (significance level (\(\alpha\)) = 0.05).
  • \(H_{o}: \mu_{1} - \mu_{2} = 0\)
  • \(H_{\alpha}: \mu_{1} - \mu_{2} \neq 0\)

In the above equation:

  • \(\mu_{1}\) is the mean age of adult US householders who own a gun
  • \(\mu_{2}\) is the mean age of adult US householders who do not own a gun
  1. 95% Confidence Interval: We will also use a 95% confidence interval to confirm the results of our Hypothesis Test. Using a Confidence Interval is analogous to conducting a Two-Tailed Hypothesis Test, provided that 1 - Confidence Level = Significance Level. In this instance, the condition is satisfied, since \(1 - 0.95 = 0.05\).

The justifications for using the above methods are as follows:

  • Our response variable is continuous, while our explanatory variable is categorical
  • We are investigating the difference between 2 means; hence we use Hypothesis Test & Confidence Interval (as opposed to ANOVA/ Chi-square Test)
  • Since we are investigating a difference in means, the Hypothesis Test has to be two-tailed (the difference between means can extend in either direction)

D. Inference & Interpretation of Results

  1. Hypothesis Test:

Below, we use the inference function from the statsr package to conduct a two-tailed hypothesis test at a significance level of 5%.

inference(y = age, x = owngun, data = gss1, statistic = "mean", 
         type = "ht", null = 0, 
         alternative = "twosided", method = "theoretical")
## Response variable: numerical
## Explanatory variable: categorical (2 levels) 
## n_Yes = 440, y_bar_Yes = 51.0182, s_Yes = 17.1213
## n_No = 836, y_bar_No = 46.5502, s_No = 17.494
## H0: mu_Yes =  mu_No
## HA: mu_Yes != mu_No
## t = 4.3975, df = 439
## p_value = < 0.0001

We get a t test-statistic of 4.3975, which translates to a p-value of less than 0.0001.

The p-value is the probability of obtaining a test statistic as or more extreme in favour of the alternate hypothesis, if the null hypothesis is indeed true.

For a two-tailed Hypothesis Test with \(\alpha = 0.05\), we:

  • Fail to reject the null hypothesis if p-value >= \(\alpha\)
  • Reject the null hypothesis if p-value < \(\alpha\)

Since the p-value = <0.0001, \(p-value < \alpha\).

This leads us to reject the null hypothesis.

  1. Confidence Interval:

Below, we use the inference function from the statsr package to conduct a 95% Confidence Interval estimation.

inference(y = age, x = owngun, data = gss1, statistic = "mean", 
         type = "ci", method = "theoretical")
## Response variable: numerical, Explanatory variable: categorical (2 levels)
## n_Yes = 440, y_bar_Yes = 51.0182, s_Yes = 17.1213
## n_No = 836, y_bar_No = 46.5502, s_No = 17.494
## 95% CI (Yes - No): (2.4711 , 6.4648)

We obtain a 95% confidence interval of (2.4711 , 6.4648).

The confidence interval does not include 0. Thus, we can say with 95% confidence that the difference between the two means is not equal to zero.

This leads us to reject the null hypothesis.

Furthermore, since the 95% confidence interval is > 0, \(\mu_{1} > \mu_{2}\). Thus, we can state that the average age of adult US householders who own guns(\(\mu_{1}\)) is greater than the average age of adult US householders who do not own guns(\(\mu_{2}\)).


E. Using Both Hypothesis Test and Confidence Interval

The reasons we used both a Hypothesis Test and a Confidence Interval are as follows:

  1. Using a Confidence Interval is analogous to conducting a Two-Tailed Hypothesis Test, provided that 1 - Confidence Level = Significance Level (This condition is satisfied in our analysis). Thus, a Confidence Interval confirms the results of our Hypothesis Test.

  2. Using a Confidence Interval in addition to a Hypothesis Test also allows us to infer whether the difference in means is greater than or less than zero.


F. Conclusion

Research Question: Is there a difference in the average age of Americans who own guns, and those who do not?

Conclusion:

  1. We fail to reject the null hypothesis that there is no difference in the average age of US adult householders who own a gun and those who don’t.

  2. We accept the alternate hypothesis that there is a statistically significant difference in the average age of US adult householders who own a gun and those who don’t, at a significance level of 5%.

  3. Furthermore, based on the result of a 95% confidence interval estimation, we can state that the average age of adult US householders who own guns is greater than the average age of adult US householders who do not own guns.


References