library(ggplot2)## Warning: package 'ggplot2' was built under R version 3.3.3
library(dplyr)## Warning: package 'dplyr' was built under R version 3.3.3
library(statsr)Make sure your data and R Markdown files are in the same directory. When loaded your data file will be called gss. Delete this note when before you submit your work.
load("gss.Rdata")GSS uses random sampling for selecting individuals from the United States population in order to perform their surveys. Additionally the sample size is large enough to be representative of the true population. Based on these information, we can say any study is generalizable.
Causability can be determine only if random assignment is used during an experimental study. in this case, the data collected are not within the framework of an experiment and therefore no information that random assignment was used. Hence, no causal relationship can be proven between variables.
According to the US bureau of Labor statistics, the US unemployment rate by race & ethinicity in 2010 was as follow: rates for Blacks (16.0 percent), Hispanics (12.5 percent), Whites (8.7 percent), and Asians (7.5 percent) https://www.bls.gov/opub/ted/2011/ted_20111005.htm
Using the GSS dataset, can we verify that there is a relationship between unemployment rate and race & ethnicity? What is the average age of unemployed people in the US?
gss10 <- gss %>% filter(year==2010)
aa <- gss10 %>% group_by(race) %>% summarise(sampl_counts=n())
bb <- gss10 %>% filter(tolower(unemp)=="yes") %>% group_by(race) %>% summarise(unemp_counts=n())
cc <- round(bb$unemp_counts/aa$sampl_counts, digits=2)
dd <- c("", sum(aa$sampl_counts), sum(bb$unemp_counts), "")
sampl_summary <- rbind(cbind(aa, bb$unemp_counts, cc), dd)## Warning in `[<-.factor`(`*tmp*`, ri, value = ""): invalid factor level, NA
## generated
names(sampl_summary) <- c("Race/Ethnicity", "Sample Count 2010", "Unemp_Count_2010", "Prop_by_race_2010")
sampl_summary## Race/Ethnicity Sample Count 2010 Unemp_Count_2010 Prop_by_race_2010
## 1 White 1550 357 0.23
## 2 Black 311 103 0.33
## 3 Other 183 60 0.33
## 4 <NA> 2044 520
Below summary shows that our sample yields a rate of unemployment of white(23%), Black(33%) & others(33%).These results seem quite high as compared to the one provided for the same year 2010 by the US bureau of Labor statistic for the same year https://www.bls.gov/opub/ted/2011/ted_20111005.htm
par(mfrow=c(2,2))
gss10 %>% ggplot()+geom_bar(aes(x=race, fill=unemp), position = "stack")+labs(x="Race/Ethnicity", y="Employment Status") + theme(panel.background = NULL, axis.text.x = element_text(angle=90, vjust=1))gss10 %>% ggplot()+geom_bar(aes(x=race, fill=unemp), position = "dodge")+labs(x="Race/Ethnicity", y="Employment Status") + theme(panel.background = NULL, axis.text.x = element_text(angle=90, vjust=1))gss %>% ggplot()+geom_bar(aes(x=year, fill=unemp), position = "stack")+labs(x="By Years", y="Employment Status") + theme(panel.background = NULL, axis.text.x = element_text(angle=90, vjust=1))gss %>% ggplot()+geom_bar(aes(x=year, fill=unemp), position = "dodge")+labs(x="By Years", y="Employment Status") + theme(panel.background = NULL, axis.text.x = element_text(angle=90, vjust=1))len_unemp <- nrow(filter(gss10, tolower(unemp)=="yes"))
gss10 %>% filter(!is.na(age) & !is.na(unemp) & tolower(unemp)=="yes") %>% group_by(age<=24, age > 24 & age <=45, age >45) %>% summarise(counts=n() , unemp_rate =round(counts/len_unemp, digits=2))## Source: local data frame [3 x 5]
## Groups: age <= 24, age > 24 & age <= 45 [?]
##
## `age <= 24` `age > 24 & age <= 45` `age > 45` counts unemp_rate
## <lgl> <lgl> <lgl> <int> <dbl>
## 1 FALSE FALSE TRUE 183 0.35
## 2 FALSE TRUE FALSE 261 0.50
## 3 TRUE FALSE FALSE 76 0.15
From this we can see that the highest concentration of unemployees amoung the population aged between [24-45] years old.
gss10 %>% filter(!is.na(age) & !is.na(unemp)) %>% ggplot()+geom_histogram(aes(x=age, fill=unemp), binwidth = 1)+labs(x="Age", y="Number of unemployed") + theme(panel.background = NULL, axis.text.x = element_text(angle=90, vjust=1))We can see from this plot that there is a high concentration of unemployement with a peak around the age of [24-45]. Above 45 years old, the rate of unemployment decreases compared to the employment rate.
Based on the figures retrieved from the GSS sample of 2010, may we make the inference that the unemployment rate of the Black is higher than reported by the US Bureau of Labor Statistics (BLS)?
We will address this by running hypothesis testing and if possible calculate the confidence interval of the unemployment proportion amoung blacks based on this sample to see if it captures the 2010 population statistics provided by the BLS.
H0: We assume there is nothing going on. The 2010 unemployment rate among Black Americans is 16% https://www.bls.gov/opub/ted/2011/ted_20111005.htm
HA: The unemployment rate of Black Americans is higher than 16%.
H0: \(p=0.16\) HA: \(p > 0.16\)
Note that this is a one sided hypothesis test that was formulate as we are interested in knowing if the unemployment rate is higher than reported.
bl_count <- gss10 %>% filter(tolower(race)=="black") %>% nrow()
wh_count <- gss10 %>% filter(tolower(race)=="white") %>% nrow()
b_hat <- gss10 %>% filter(tolower(race)=="black" & tolower(unemp)=="yes") %>% nrow()/bl_count
w_hat <- gss10 %>% filter(tolower(race)=="white" & tolower(unemp)=="yes") %>% nrow()/wh_count
b_hat <- round(b_hat, digits=2)
w_hat <- round(w_hat, digits=2)
unemp_rate <- data.frame(b_hat, w_hat)
names(unemp_rate) <- c("Black Unemployment rate", "White Unemployment rate")
row.names(unemp_rate) <- c("2010")
unemp_rate## Black Unemployment rate White Unemployment rate
## 2010 0.33 0.23
The estimate of the proportion of unemployment amoungst blacks and whites is calculate as displayed above and stored in the variables b_hat & w_hat.
Independence:
Sample size & skweness:
## We take the proportion given by the US BLS as the true given population proportion ##
p = 0.16
CLT_cond <- data.frame(nrow(gss10)*p, nrow(gss10)*(1-p))
names(CLT_cond) <- c("n*p", "n*(1-p)")
row.names(CLT_cond) <- c("CLT size cond.")
CLT_cond## n*p n*(1-p)
## CLT size cond. 327.04 1716.96
sampl50_2010_bl <- gss10 %>% rep_sample_n(size=200, reps=500, replace=TRUE) %>% filter(tolower(race)=="black") %>% summarise(counts = n())
sampl50_2010_bl_unemp <- gss10 %>% rep_sample_n(size=200, reps=500, replace=TRUE) %>% filter(tolower(race)=="black" & tolower(unemp)=="yes") %>% summarise(counts = n())
bl_unemp_2010 <- data.frame(round(sampl50_2010_bl_unemp$counts/sampl50_2010_bl$counts, digits=2))
names(bl_unemp_2010) <- c("unemp_prop")
ggplot(data= bl_unemp_2010, aes(x=unemp_prop))+geom_histogram(binwidth = 0.01, colour="red", fill="blue")Looking at the sampling distribution, we can already see that the proportion of unemployment rate in 2010 amoung the black american in the studied sample is somewhere around 30 and 40 percents.
As n >30 and that the population standard deviation can be calculated given the population proportion shared by the US BLS, it’s now time to calculate the Z value and find the associated P-value. For this we will use the inference function.
SE = sqrt(p*(1-p)/nrow(gss10))
z_score = (b_hat - p)/SE
P_value = pnorm(z_score, lower.tail = FALSE)
P_value## [1] 6.881791e-98
The \(P(Z\geq\ 20.96 | p=0.16)\) is approximately equal to 0
Let’s now calculate the confidence interval of the proportion of black unemployment using this sample at 95% Confidence Level.
For the confidence interval, the Standard Error formula uses the sample proportion. Therefore \(\hat{\ p}\) will be used instead of p. \(SE = \sqrt{\hat{\ p}(1-\hat{\ p})/n}\)
SE_ci=sqrt(b_hat*(1-b_hat)/nrow(gss10))
sig_lev = 0.05
Zci = qnorm(sig_lev, lower.tail = FALSE)
ME_ci = Zci * SE_ci
CI_bounds <- c(b_hat-ME_ci, b_hat+ME_ci)
CI_bounds## [1] 0.3128927 0.3471073
The confidence interval constructed is \(CI = 0.33 \pm 0.017\)
We can clearly say that the porportion of black americans unemployment reported in 2010 by the US BLS does not fall in the confidence interval built on the basis of this analysis.
We will reject the null Hypothesis that the average unemployment rate amoung the black american in 2010 was equal to 16% in favor of the alternate hypothesis Ha which defends that the average unemployment rate was greater than 16% for black americans in 2010.
Of course these conclusions did not take into account many factors including the N/A which my represent people giving unreliable information about their imployment status. Another information is that people may not be kin at all to provide their employment status as par of such a survey, therefore intorducing some biased in the result of the study.
This will compare the unemployment rate between white and black and report on possible inequality in terms of employment rate. The cause of the inequality is out of the scope of this analysis.
inference(y=unemp, x=race, data = filter(gss10, !is.na(unemp)&tolower(race) %in% c("black", "white")), type = "ht", statistic = "proportion", success = "yes", method = "theoretical", alternative = "greater", sig_level = 0.05, conf_level = 0.95)## Response variable: categorical (2 levels)
## Explanatory variable: categorical (3 levels)
## Observed:
## y
## x Yes No
## White 357 691
## Black 103 102
##
## Expected:
## y
## x Yes No
## White 384.74062 663.2594
## Black 75.25938 129.7406
##
## H0: race and unemp are independent
## HA: race and unemp are dependent
## chi_sq = 19.317, df = 1, p_value = 0
It appears that the parameters users in the current dataset do not yield a the same unemployment by race/enthnicity in 2010 reported by the US BLS. This could be due to some other consideration made by the US BLS but not taken into account in this study.