library(ggplot2)
library(dplyr)
library(statsr)Make sure your data and R Markdown files are in the same directory. When loaded your data file will be called gss. Delete this note when before you submit your work.
load("gss.Rdata")This data has been gathered via the General Social Survey (GSS), which aims to gather data on contemporary American Society in order to monitor and explain trends and constants in attitudes, behaviors adn attributes; to examine the structure and functioning of society in general as well as the role played by relevant subgroups; to compare the United States to other societies in order to place American society in comparative perspective and develop cross-national models of human society; and to make high-quality data easily accessible to scholars, students, policy makers, and others, with minimal cost and waiting.
It comprises of data from respondents from the year 1972 to 2012, and contains responses on the respondent’s general background along with their political views, their monetary status and conditions, their attitude towards national problems, social concerns, their job security and satisfaction index. The survey also gathers data from the respondents on the controversial social issues which along with the rest of the gathered data might help provide some insights into the current scenarios.
Overall, the data is quite generalizable but it is not causal since, this is not an experimental data i.e. there is no evidence of random assignment in the data gathered.
Question: Is there a relationship between race and job satisfaction, i.e. does job satisfaction index vary by races?
And, to say why or how is this of interest, well it is just to check if there is indeed equality across races when it comes to jobs and it’s related figures (salary component) and it’s related variables (job satisfaction)!
In the data collected via GSS, we have data on individuals across years, ranging from 1972 to the year 2012. As per the research question, first thing to do here is to have visual on how job satisfaction varies per different usual factors like gender, race etc.
Following visual takes race and job satisfaction into account:
ggplot(data = gss, aes(x = race, fill = factor(satjob))) +
geom_histogram(width = 1, stat = "count")+
xlab("Variability")+
ylab("satisfactionIndex")+
labs(fill = "satjob")## Warning: Ignoring unknown parameters: binwidth, bins, pad
And, the graph below further breaks the data down by sex.
ggplot(data = gss, aes(x = race, fill = factor(satjob))) +
geom_histogram(width = 1, stat = "count")+
facet_wrap(~sex)+
xlab("Variability")+
ylab("satisfactionIndex")+
labs(fill = "satjob")## Warning: Ignoring unknown parameters: binwidth, bins, pad
If we look at the visuals above, it might look like the job satisfaction index is not influenced that great a deal by different races.
Well, lets look at the numbers to confirm if this is indeed the truth. To begin with, let’s first get rid of all the NAs, and in doing so, let’s minimize the number of factors we are having for “satjob” variable:
jobs<- gss%>%
filter(!is.na(satjob))
jobs<- jobs%>%
mutate(jobsat = ifelse(satjob == "Very Satisfied","sat",
ifelse(satjob == "Mod. Satisfied", "sat", "dis")))
table(jobs$jobsat)##
## dis sat
## 5824 35453
Now, to see how these two factors are distributed across races:
jobs%>%
filter(!is.na(race))%>%
group_by(race, jobsat)%>%
summarise(total = n())## # A tibble: 6 x 3
## # Groups: race [?]
## race jobsat total
## <fctr> <chr> <int>
## 1 White dis 4353
## 2 White sat 29242
## 3 Black dis 1158
## 4 Black sat 4478
## 5 Other dis 313
## 6 Other sat 1733
Well, as can be seen by these numbers, the total number of dissatisfied individuals is in a decreasing order as we go from white to black to others.
So, here we have our data, and we can begin with our hypothesis testing. We will be using the Chi Square Independence test to formulate our result.
Null Hypothesis(\(H_{o}\)) dictates that job satisfaction and races are independent of each other. And the Alternative Hypothesis(\(H_{a}\)) would say that there is a relationship between job satisfaction and races, i.e. the job satisfaction index varies by different races.
But, first thing first, we need to check the conditions before we begin the test:Following is a set of Formulae that we will be requiring to proceed with the test: \(X(square) = \sum_{i=1}^{k}{(O-E)^2/E}\) Where O is Observed, E is Expected and k is the number of cells.
(\(D_f\)) = \((R-1)*(C-1)\) Where R and C are the number of Rows and Columns respectively.
All that we need next is the overall job satisfaction or dissatisfaction rate.
\(Expected Count = (row-total)*(column-total)/(table-total)\)
table(jobs$jobsat, jobs$race)##
## White Black Other
## dis 4353 1158 313
## sat 29242 4478 1733
Now, all we need to do, is to use this data, solve and get the required \(CHI sq.\). value and \(D_f\) the degrees of freedom and then using this we can calculate the P-value. To do so simply in R, we have:
Stat_table = table(jobs$jobsat,jobs$race)
Stat_table # our contingency table##
## White Black Other
## dis 4353 1158 313
## sat 29242 4478 1733
chisq.test(Stat_table)##
## Pearson's Chi-squared test
##
## data: Stat_table
## X-squared = 231.89, df = 2, p-value < 2.2e-16
The p-value is really small(almost equal to 0). This provides convincing evidence against the Null Hypothesis(\(H_o\)) that the alternative hypothesis (\(H_a\)) is true. That is, there is indeed a relationship between race and job satisfaction levels.