Statistical Inference with GSS data

Setup

Load packages

library(ggplot2)
library(dplyr)
library(statsr)
library(plotly)

Load data and clean by removing columns where ‘NA’ values comprise more than 10% of total rows. We will also delete the caseid column since it serves no useful purpose in our analysis.

load("gss.RData")
gss_filtered <- gss[,colSums(is.na(gss)) <= 0.1*nrow(gss)] %>% select(-caseid)

Part 1: Data

According to the GSS Data Documentation, Appendix A, the GSS observations are a multi-stage area probability sample to the block or segment level. At the block level, quota sampling is used with quotas based on sex, age, and employment status. The cost of the quota samples is substantially less than the cost of a full probability sample of the same size, but there is, of course, the chance of sample biases mainly due to not-at-homes which are not controlled by the quotas. However, in order to reduce this bias, the interviewers are given instructions to canvass and interview only after 3:00 p.m. on weekdays or during the weekend or holidays. This type of sample design is most appropriate when the past experience and judgment of a project director suggest that sample biases are likely to be small relative to the precision of the measuring instrument and the decisions that are to be made.

One should conclude that random sampling is NOT being employed to collect the GSS data because of convenience bias (i.e., respondents need to be available to be interviewed). Therefore, any inferences derived from these data may not be generalizable to the population at large.

Part 2: Research question

For my analysis, I would like to investigate whether people are statistically more financially well-off when they are younger or older in age. I will look at the “Opinion of Family Income” variable, finrela, in the GSS dataset and compare proportional responses between those respondents who are below the mean age of 46 and those who are above the mean. Does the age-old question (pun intended): “Are we ‘older & wiser’ and therefore, more financially stable, as we mature?” turn out to be true or is this simply a myth? Let’s find out!

Part 3: Exploratory data analysis

First, let’s look at a histogram and density plot of `age` in `gss_filtered`

gss_filtered_no_NA <- gss_filtered[!(is.na(gss_filtered$age)),]

age_density <- density(gss_filtered_no_NA$age)
fig <- plot_ly(data=gss_filtered, x=~age,type="histogram",nbinsx=40,
               name="Histogram") %>%
  
  add_trace(x = age_density$x, y = age_density$y, type = "scatter", mode = "lines",
            fill = "tozeroy", yaxis = "y2", name = "Density") %>%
  
  layout(autosize = F, width = 800, height = 400,
         title="Histogram and Density Plot of Age",
         xaxis=list(title="Age"),yaxis=list(title="Number of Instances"),
         yaxis2 = list(overlaying = "y", side = "right", title="Density"),
         legend = list(x = 5, y = 1))
fig

Some summary statistics for the `age` variable…

summary(gss_filtered$age)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    18.0    31.0    43.0    45.7    59.0    89.0     202

Now, let’s look at a distribution of the values of `finrela`.

finrela_values <- as.data.frame(table(gss_filtered$finrela))

fig <- plot_ly(data=finrela_values,x=~Var1,y=~Freq,type="bar") %>%
  layout(title="Distribution of finrela Values",xaxis=list(title="finrela Value"),
         yaxis=list(title="Number of Instances"))

fig

Part 4: Inference

My hypothesis is associated with the financial satisfaction variable `finrela`.

\(H_0:\) The proportion of those respondents who rated their financial satisfaction as ‘Above Average’ or ‘Far Above Average’ is equal for those who are both above and below the mean age of 46 years.

\(H_a:\) The proportion of those respondents who rated their financial satisfaction as ‘Above Average’ of ‘Far Above Average’ is NOT equal for those who are both above and below the mean age of 46 years.

First, we must refactor the `age` and `finrela` columns.

Recoding for new variable age_group will be “younger” for ages <= 46 years and “older” for ages > 46 years.
Recoding for finrela will as follows: ‘Above Average’ and ‘Far Above Average’ responses will be recoded as ‘Success’ while ‘Average’, ‘Below Average’, and ‘Far Below Average’ responses will be recoded as ‘Failure’.

gss_filtered$age_group <- ifelse(gss_filtered$age<=46,"younger","older")

levels(gss_filtered$finrela)[levels(gss_filtered$finrela) =="Far Above Average"|
                               levels(gss_filtered$finrela)=="Above Average"] <-"Success"

levels(gss_filtered$finrela)[levels(gss_filtered$finrela) =="Average"|
                               levels(gss_filtered$finrela)=="Below Average"|
                               levels(gss_filtered$finrela)=="Far Below Average"] <-"Failure"

We perform the hypothesis test…

# Inference for a proportion - hypothesis test

inference(y = finrela, x = age_group, data = gss_filtered, statistic = "proportion",
          type = "ht",
          method = "theoretical", 
          alternative = "twosided",
          success = "Success")

## Response variable: categorical (2 levels, success: Success)
## Explanatory variable: categorical (2 levels) 
## n_older = 22635, p_hat_older = 0.2045
## n_younger = 29322, p_hat_younger = 0.205
## H0: p_older =  p_younger
## HA: p_older != p_younger
## z = -0.1506
## p_value = 0.8803

We determine the proportion statistics…

# Inference for a proportion
# Calculate 95% confidence intervals for the proportion of 'Success' responses.

inference(y = finrela, x = age_group, data = gss_filtered, statistic = "proportion",
          type = "ci",
          method = "theoretical", 
          success = "Success")

## Response variable: categorical (2 levels, success: Success)
## Explanatory variable: categorical (2 levels) 
## n_older = 22635, p_hat_older = 0.2045
## n_younger = 29322, p_hat_younger = 0.205
## 95% CI (older - younger): (-0.0075 , 0.0065)

Part 5: Summary & Conclusions

Let’s interpret the results we’ve obtained. We reproduce here the conditions for inference for comparing two independent proportions:

Independence - If we assume that each respondent’s answer for the finrela and age variables are independent from another respondent’s answers (which they are…) then:
- We have independence ‘within’ groups
- We have independence ‘between’ groups

Sample Size/Skew - Each sample meets the success-failure condition (y=“younger”, o="older’):
- \(n_{y}p_{y} = (29322)(0.205) >= 10\)
- \(n_{y}(1-p_{y}) = (29322)(0.795) >= 10\)
- \(n_{o}p_{o} = (22635)(0.2045) >= 10\)
- \(n_{o}(1-p_{o}) = (22635)(0.7955) >= 10\)

We see from our inference results that the CI for the difference in proportions passes through 0 and our p-value (0.8803) is significantly greater than 0.05. Hence, we must fail to reject \(H_0\)and conclude that there is no statistically significant difference in the proportions of ‘financially stable’ people between the younger and older age groups.

Statistical Inference with GSS data

Ken Wood

5/8/2021

Setup

Load packages

Part 1: Data

Part 2: Research question

Part 3: Exploratory data analysis

First, let’s look at a histogram and density plot of `age` in `gss_filtered`

Some summary statistics for the `age` variable…

Now, let’s look at a distribution of the values of `finrela`.

Part 4: Inference

My hypothesis is associated with the financial satisfaction variable `finrela`.

First, we must refactor the `age` and `finrela` columns.

We perform the hypothesis test…

We determine the proportion statistics…

Part 5: Summary & Conclusions

Statistical Inference with GSS data

Ken Wood

5/8/2021

Setup

Load packages

Part 1: Data

Part 2: Research question

Part 3: Exploratory data analysis

First, let’s look at a histogram and density plot of age in gss_filtered

Some summary statistics for the age variable…

Now, let’s look at a distribution of the values of finrela.

Part 4: Inference

My hypothesis is associated with the financial satisfaction variable finrela.

First, we must refactor the age and finrela columns.

We perform the hypothesis test…

We determine the proportion statistics…

Part 5: Summary & Conclusions

First, let’s look at a histogram and density plot of `age` in `gss_filtered`

Some summary statistics for the `age` variable…

Now, let’s look at a distribution of the values of `finrela`.

My hypothesis is associated with the financial satisfaction variable `finrela`.

First, we must refactor the `age` and `finrela` columns.