Statistical inference with the GSS data

Setup

Load packages

library(ggplot2)
library(dplyr)
library(statsr)

Load data

Make sure your data and R Markdown files are in the same directory. When loaded your data file will be called gss. Delete this note when before you submit your work.

load("gss.Rdata")

Part 1: Data

This data has been gathered via the General Social Survey (GSS), which aims to gather data on contemporary American Society in order to monitor and explain trends and constants in attitudes, behaviors adn attributes; to examine the structure and functioning of society in general as well as the role played by relevant subgroups; to compare the United States to other societies in order to place American society in comparative perspective and develop cross-national models of human society; and to make high-quality data easily accessible to scholars, students, policy makers, and others, with minimal cost and waiting.

It comprises of data from respondents from the year 1972 to 2012, and contains responses on the respondent’s general background along with their political views, their monetary status and conditions, their attitude towards national problems, social concerns, their job security and satisfaction index. The survey also gathers data from the respondents on the controversial social issues which along with the rest of the gathered data might help provide some insights into the current scenarios.

Overall, the data is quite generalizable but it is not causal since, this is not an experimental data i.e. there is no evidence of random assignment in the data gathered.

Part 2: Research question

Question: Is there a relationship between race and job satisfaction, i.e. does job satisfaction index vary by races?

And, to say why or how is this of interest, well it is just to check if there is indeed equality across races when it comes to jobs and it’s related figures (salary component) and it’s related variables (job satisfaction)!

Part 3: Exploratory data analysis

In the data collected via GSS, we have data on individuals across years, ranging from 1972 to the year 2012. As per the research question, first thing to do here is to have visual on how job satisfaction varies per different usual factors like gender, race etc.

Following visual takes race and job satisfaction into account:

ggplot(data = gss, aes(x = race, fill = factor(satjob))) +
  geom_histogram(width = 1, stat = "count")+
  xlab("Variability")+
  ylab("satisfactionIndex")+
  labs(fill = "satjob")

## Warning: Ignoring unknown parameters: binwidth, bins, pad

And, the graph below further breaks the data down by sex.

ggplot(data = gss, aes(x = race, fill = factor(satjob))) +
  geom_histogram(width = 1, stat = "count")+
  facet_wrap(~sex)+
  xlab("Variability")+
  ylab("satisfactionIndex")+
  labs(fill = "satjob")

## Warning: Ignoring unknown parameters: binwidth, bins, pad

If we look at the visuals above, it might look like the job satisfaction index is not influenced that great a deal by different races.

Well, lets look at the numbers to confirm if this is indeed the truth. To begin with, let’s first get rid of all the NAs, and in doing so, let’s minimize the number of factors we are having for “satjob” variable:

jobs<- gss%>%
  filter(!is.na(satjob))

jobs<- jobs%>%
    mutate(jobsat = ifelse(satjob == "Very Satisfied","sat", 
                          ifelse(satjob == "Mod. Satisfied", "sat", "dis")))

table(jobs$jobsat)

## 
##   dis   sat 
##  5824 35453

Now, to see how these two factors are distributed across races:

jobs%>%
  filter(!is.na(race))%>%
  group_by(race, jobsat)%>%
  summarise(total = n())

## # A tibble: 6 x 3
## # Groups:   race [?]
##     race jobsat total
##   <fctr>  <chr> <int>
## 1  White    dis  4353
## 2  White    sat 29242
## 3  Black    dis  1158
## 4  Black    sat  4478
## 5  Other    dis   313
## 6  Other    sat  1733

Well, as can be seen by these numbers, the total number of dissatisfied individuals is in a decreasing order as we go from white to black to others.

So, here we have our data, and we can begin with our hypothesis testing. We will be using the Chi Square Independence test to formulate our result.

Part 4: Inference

Null Hypothesis(\(H_{o}\)) dictates that job satisfaction and races are independent of each other. And the Alternative Hypothesis(\(H_{a}\)) would say that there is a relationship between job satisfaction and races, i.e. the job satisfaction index varies by different races.

But, first thing first, we need to check the conditions before we begin the test:

INDEPENDENCE:

Random Sample/Assignment
If sampling without replacement, n<10% of the population
Each case only contributes to one cell in the table

And the second condition:

SAMPLE SIZE

Each particular scenario must have at least 5 expected cases

Following is a set of Formulae that we will be requiring to proceed with the test: \(X(square) = \sum_{i=1}^{k}{(O-E)^2/E}\) Where O is Observed, E is Expected and k is the number of cells.

(\(D_f\)) = \((R-1)*(C-1)\) Where R and C are the number of Rows and Columns respectively.

All that we need next is the overall job satisfaction or dissatisfaction rate.

\(Expected Count = (row-total)*(column-total)/(table-total)\)

table(jobs$jobsat, jobs$race)

##      
##       White Black Other
##   dis  4353  1158   313
##   sat 29242  4478  1733

So, here, we have

2 rows and 3 columns
Column-total(white) = 33595
Column-total(black) = 5636
Column-total(other) = 2046
Row-total(sat) = 35453
Row-total(dis) = 5824

Now, all we need to do, is to use this data, solve and get the required \(CHI sq.\). value and \(D_f\) the degrees of freedom and then using this we can calculate the P-value. To do so simply in R, we have:

Stat_table = table(jobs$jobsat,jobs$race)
Stat_table # our contingency table

##      
##       White Black Other
##   dis  4353  1158   313
##   sat 29242  4478  1733

chisq.test(Stat_table)

## 
##  Pearson's Chi-squared test
## 
## data:  Stat_table
## X-squared = 231.89, df = 2, p-value < 2.2e-16

The p-value is really small(almost equal to 0). This provides convincing evidence against the Null Hypothesis(\(H_o\)) that the alternative hypothesis (\(H_a\)) is true. That is, there is indeed a relationship between race and job satisfaction levels.