library(ggplot2)
library(dplyr)
library(statsr)load("gss.Rdata")Generabizality: Conditions for generabizality are as follows:
This study has sample size of 57,061. It is less than 10% of the whole United States population, and large enough to dispel effect of skewness. This satisfies the sample size requirement for generabizality of United States population. In addition, random sampling is conducted. Therefore, this study meets criterias for generabizality of United States population.
Causality : The survey is observational study. Causality can only be concluded if there is follow up experimental study. Therefore, we can only draw correlation from this study.
Between different races in United States (White, Black, others), is there any distinctive proportion of employment types, i.e. self-employed or work for someone else?
This research question is of interest to the author. As a person who grows in one country in South East Asia, the author observes there is distinctive proportion in the type of employment between different races. Many anthropologists hypothizes this phenomenon has complex links between culture and history on how certain races came into the country. The author is eager to explore whether similar phenomenon occurs in United States, and learn its inference.
We subset the dataset into only variables of our interest, i.e. race and wrkslf. Beforehand, let’s see summaries of those two variables
summary(gss$race)## White Black Other
## 46350 7926 2785
summary(gss$wrkslf)## Self-Employed Someone Else NA's
## 6197 47352 3512
We create new dataframe df with only race and wrkslf variables, and remove all the NAs.
df <- gss %>%
select(race,wrkslf) %>%
na.omit()From the new created dataframe df, let’s see the proportion of self-employed and working for someone else.
overall <- df %>%
group_by(wrkslf) %>%
summarise(n = n()) %>%
mutate(freq = n / sum(n))
overall## # A tibble: 2 x 3
## wrkslf n freq
## <fct> <int> <dbl>
## 1 Self-Employed 6197 0.116
## 2 Someone Else 47352 0.884
And compare the overall proportion, with type of employment of different races
by_race <- df %>%
group_by(race, wrkslf) %>%
summarise(n=n()) %>%
mutate(freq = n / sum(n))
by_race## # A tibble: 6 x 4
## # Groups: race [3]
## race wrkslf n freq
## <fct> <fct> <int> <dbl>
## 1 White Self-Employed 5477 0.125
## 2 White Someone Else 38252 0.875
## 3 Black Self-Employed 454 0.0623
## 4 Black Someone Else 6829 0.938
## 5 Other Self-Employed 266 0.105
## 6 Other Someone Else 2271 0.895
ggplot(data=by_race, aes(x = race, y = freq*100, fill = wrkslf)) +
geom_bar(stat = "identity") +
geom_hline(aes(yintercept = overall[["freq"]][2]*100, linetype = "Overall proportion working \nfor someone else")) +
labs(title = "Proportion of employment type between races", y = "Percentage")From summary table and plot above, we can see that black race has relatively larger proportion working for someone else. 0.937 black race works for someone else, compared to overall proportion of 0.88. From this exlanatory data analysis, we see there is correlation between races to employment types. In the inference below, we will see if this case happens only by chance, or indeed employemet types and races are dependent.
Hypoptheses: Null hypothesis (nothing going on): Races and employment types are independent. Employment types do not vary by races. Alternative hypothesis (something going on): Races and employment types are dependent. Employment types do vary by races.
Method: We want to examine relationship between multiple categorical variables. In this problem, we have three categorical variables (race) with two levels (wrkslf). Hence, chi-square independence test is the most suitable to evaluate the inference. This method does not have confidence interval and p-value association.
Conditions: Conditions of the data set are met to conduct chi-square independence test. 1. Independence: Sampled observations are independent. * random sample * n < 10% population * each case only contributes one cell in the table
Chi-square independence test
Let’s see the actual contingency table of employment type vs race.
t<- table(df$race, df$wrkslf)
t##
## Self-Employed Someone Else
## White 5477 38252
## Black 454 6829
## Other 266 2271
For additional information, let’s compare with expected contingency table, if null hypothesis is true.
chisq.test(t)$expected##
## Self-Employed Someone Else
## White 5060.5728 38668.427
## Black 842.8309 6440.169
## Other 293.5963 2243.404
From the actual contingency table t, we run chisq.test() function, which is built in function to perform chi-square independence test.
chisq.test(t)##
## Pearson's Chi-squared test
##
## data: t
## X-squared = 244.54, df = 2, p-value < 2.2e-16
p-value is near 0. Hence, we reject our null hypothesis in favor of our alternative hypothesis. We conclude that races and employment types are dependent. Employment types do vary by races.