Statistical inference with the GSS data

Setup

Load packages

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 3.3.3

library(dplyr)

## Warning: package 'dplyr' was built under R version 3.3.3

library(statsr)

Load data

Make sure your data and R Markdown files are in the same directory. When loaded your data file will be called gss. Delete this note when before you submit your work.

load("gss.Rdata")

Part 1: Data

Generalization

GSS uses random sampling for selecting individuals from the United States population in order to perform their surveys. Additionally the sample size is large enough to be representative of the true population. Based on these information, we can say any study is generalizable.

Causation

Causability can be determine only if random assignment is used during an experimental study. in this case, the data collected are not within the framework of an experiment and therefore no information that random assignment was used. Hence, no causal relationship can be proven between variables.

Part 2: Research question

According to the US bureau of Labor statistics, the US unemployment rate by race & ethinicity in 2010 was as follow: rates for Blacks (16.0 percent), Hispanics (12.5 percent), Whites (8.7 percent), and Asians (7.5 percent) https://www.bls.gov/opub/ted/2011/ted_20111005.htm

Using the GSS dataset, can we verify that there is a relationship between unemployment rate and race & ethnicity? What is the average age of unemployed people in the US?

Part 3: Exploratory data analysis

Summary table of unemployment rate per race in the US.

gss10 <- gss %>% filter(year==2010)

aa <- gss10 %>% group_by(race) %>% summarise(sampl_counts=n()) 
bb <- gss10 %>% filter(tolower(unemp)=="yes") %>% group_by(race) %>% summarise(unemp_counts=n())
cc <- round(bb$unemp_counts/aa$sampl_counts, digits=2)
dd <- c("", sum(aa$sampl_counts), sum(bb$unemp_counts), "")
sampl_summary <- rbind(cbind(aa, bb$unemp_counts, cc), dd)

## Warning in `[<-.factor`(`*tmp*`, ri, value = ""): invalid factor level, NA
## generated

names(sampl_summary) <- c("Race/Ethnicity", "Sample Count 2010", "Unemp_Count_2010", "Prop_by_race_2010")
sampl_summary

##   Race/Ethnicity Sample Count 2010 Unemp_Count_2010 Prop_by_race_2010
## 1          White              1550              357              0.23
## 2          Black               311              103              0.33
## 3          Other               183               60              0.33
## 4           <NA>              2044              520

Below summary shows that our sample yields a rate of unemployment of white(23%), Black(33%) & others(33%).These results seem quite high as compared to the one provided for the same year 2010 by the US bureau of Labor statistic for the same year https://www.bls.gov/opub/ted/2011/ted_20111005.htm

Visualization of employment status distributed by race/ethnicity

par(mfrow=c(2,2))

gss10 %>% ggplot()+geom_bar(aes(x=race, fill=unemp), position = "stack")+labs(x="Race/Ethnicity", y="Employment Status") + theme(panel.background = NULL, axis.text.x = element_text(angle=90, vjust=1))

gss10 %>% ggplot()+geom_bar(aes(x=race, fill=unemp), position = "dodge")+labs(x="Race/Ethnicity", y="Employment Status") + theme(panel.background = NULL, axis.text.x = element_text(angle=90, vjust=1))

gss %>% ggplot()+geom_bar(aes(x=year, fill=unemp), position = "stack")+labs(x="By Years", y="Employment Status") + theme(panel.background = NULL, axis.text.x = element_text(angle=90, vjust=1))

gss %>% ggplot()+geom_bar(aes(x=year, fill=unemp), position = "dodge")+labs(x="By Years", y="Employment Status") + theme(panel.background = NULL, axis.text.x = element_text(angle=90, vjust=1))

Summary table of unemployment by age groups

len_unemp <- nrow(filter(gss10, tolower(unemp)=="yes"))

gss10 %>% filter(!is.na(age) & !is.na(unemp) & tolower(unemp)=="yes") %>% group_by(age<=24, age > 24 & age <=45, age >45) %>% summarise(counts=n() , unemp_rate =round(counts/len_unemp, digits=2))

## Source: local data frame [3 x 5]
## Groups: age <= 24, age > 24 & age <= 45 [?]
## 
##   `age <= 24` `age > 24 & age <= 45` `age > 45` counts unemp_rate
##         <lgl>                  <lgl>      <lgl>  <int>      <dbl>
## 1       FALSE                  FALSE       TRUE    183       0.35
## 2       FALSE                   TRUE      FALSE    261       0.50
## 3        TRUE                  FALSE      FALSE     76       0.15

From this we can see that the highest concentration of unemployees amoung the population aged between [24-45] years old.

gss10 %>% filter(!is.na(age) & !is.na(unemp)) %>% ggplot()+geom_histogram(aes(x=age, fill=unemp), binwidth = 1)+labs(x="Age", y="Number of unemployed") + theme(panel.background = NULL, axis.text.x = element_text(angle=90, vjust=1))

We can see from this plot that there is a high concentration of unemployement with a peak around the age of [24-45]. Above 45 years old, the rate of unemployment decreases compared to the employment rate.

Part 4: Inference

Hypothesis testing & Confidence Interval

Based on the figures retrieved from the GSS sample of 2010, may we make the inference that the unemployment rate of the Black is higher than reported by the US Bureau of Labor Statistics (BLS)?

We will address this by running hypothesis testing and if possible calculate the confidence interval of the unemployment proportion amoung blacks based on this sample to see if it captures the 2010 population statistics provided by the BLS.

Setting Hypothesis

H0: We assume there is nothing going on. The 2010 unemployment rate among Black Americans is 16% https://www.bls.gov/opub/ted/2011/ted_20111005.htm

HA: The unemployment rate of Black Americans is higher than 16%.

H0: \(p=0.16\) HA: \(p > 0.16\)

Note that this is a one sided hypothesis test that was formulate as we are interested in knowing if the unemployment rate is higher than reported.

Calculating the point estimate of the proportion of Black unemployment in 2010

bl_count <- gss10 %>% filter(tolower(race)=="black") %>% nrow()
wh_count <- gss10 %>% filter(tolower(race)=="white") %>% nrow()
b_hat <- gss10 %>% filter(tolower(race)=="black" & tolower(unemp)=="yes") %>% nrow()/bl_count
w_hat <- gss10 %>% filter(tolower(race)=="white" & tolower(unemp)=="yes") %>% nrow()/wh_count

b_hat <- round(b_hat, digits=2)
w_hat <- round(w_hat, digits=2)

unemp_rate <- data.frame(b_hat, w_hat)
names(unemp_rate) <- c("Black Unemployment rate", "White Unemployment rate")
row.names(unemp_rate) <- c("2010")

unemp_rate

##      Black Unemployment rate White Unemployment rate
## 2010                    0.33                    0.23

The estimate of the proportion of unemployment amoungst blacks and whites is calculate as displayed above and stored in the variables b_hat & w_hat.

Let’s now review the conditions for inference before any further calculation

Independence:

GSS used random sampling for performing this survey

The total size of the sample in 2010 is 2044 < 10% of the population of the United States

Sample size & skweness:

## We take the proportion given by the US BLS as the true given population proportion ##

p = 0.16

CLT_cond <- data.frame(nrow(gss10)*p, nrow(gss10)*(1-p))
names(CLT_cond) <- c("n*p", "n*(1-p)")
row.names(CLT_cond) <- c("CLT size cond.")

CLT_cond

##                   n*p n*(1-p)
## CLT size cond. 327.04 1716.96

n x p > 10 and n x (1-p) > 10. The size & Skewness condition is verified.

Let’s draw the sampling distribution

sampl50_2010_bl <- gss10 %>% rep_sample_n(size=200, reps=500, replace=TRUE) %>% filter(tolower(race)=="black") %>% summarise(counts = n())

sampl50_2010_bl_unemp <- gss10 %>% rep_sample_n(size=200, reps=500, replace=TRUE) %>% filter(tolower(race)=="black" & tolower(unemp)=="yes") %>% summarise(counts = n())

bl_unemp_2010 <- data.frame(round(sampl50_2010_bl_unemp$counts/sampl50_2010_bl$counts, digits=2))
names(bl_unemp_2010) <- c("unemp_prop")

ggplot(data= bl_unemp_2010, aes(x=unemp_prop))+geom_histogram(binwidth = 0.01, colour="red", fill="blue")

Looking at the sampling distribution, we can already see that the proportion of unemployment rate in 2010 amoung the black american in the studied sample is somewhere around 30 and 40 percents.

Calculaing the test statistics and the P-value

As n >30 and that the population standard deviation can be calculated given the population proportion shared by the US BLS, it’s now time to calculate the Z value and find the associated P-value. For this we will use the inference function.

SE = sqrt(p*(1-p)/nrow(gss10))
z_score = (b_hat - p)/SE
P_value = pnorm(z_score, lower.tail = FALSE)

P_value

## [1] 6.881791e-98

The \(P(Z\geq\ 20.96 | p=0.16)\) is approximately equal to 0

Confidence interval calculation to validate the result

Let’s now calculate the confidence interval of the proportion of black unemployment using this sample at 95% Confidence Level.

For the confidence interval, the Standard Error formula uses the sample proportion. Therefore \(\hat{\ p}\) will be used instead of p. \(SE = \sqrt{\hat{\ p}(1-\hat{\ p})/n}\)

SE_ci=sqrt(b_hat*(1-b_hat)/nrow(gss10))

sig_lev = 0.05
Zci = qnorm(sig_lev, lower.tail = FALSE) 

ME_ci = Zci * SE_ci

CI_bounds <- c(b_hat-ME_ci, b_hat+ME_ci)
CI_bounds

## [1] 0.3128927 0.3471073

The confidence interval constructed is \(CI = 0.33 \pm 0.017\)

We can clearly say that the porportion of black americans unemployment reported in 2010 by the US BLS does not fall in the confidence interval built on the basis of this analysis.

Conclusion

We will reject the null Hypothesis that the average unemployment rate amoung the black american in 2010 was equal to 16% in favor of the alternate hypothesis Ha which defends that the average unemployment rate was greater than 16% for black americans in 2010.

Of course these conclusions did not take into account many factors including the N/A which my represent people giving unreliable information about their imployment status. Another information is that people may not be kin at all to provide their employment status as par of such a survey, therefore intorducing some biased in the result of the study.

Extra analysis:

Chi-Quare Goodness Of Fit Test for comparing US BLS results to this sample

This will compare the unemployment rate between white and black and report on possible inequality in terms of employment rate. The cause of the inequality is out of the scope of this analysis.

inference(y=unemp, x=race, data = filter(gss10, !is.na(unemp)&tolower(race) %in% c("black", "white")), type = "ht", statistic = "proportion", success = "yes", method = "theoretical", alternative = "greater", sig_level = 0.05, conf_level = 0.95)

## Response variable: categorical (2 levels) 
## Explanatory variable: categorical (3 levels) 
## Observed:
##        y
## x       Yes  No
##   White 357 691
##   Black 103 102
## 
## Expected:
##        y
## x             Yes       No
##   White 384.74062 663.2594
##   Black  75.25938 129.7406
## 
## H0: race and unemp are independent
## HA: race and unemp are dependent
## chi_sq = 19.317, df = 1, p_value = 0

It appears that the parameters users in the current dataset do not yield a the same unemployment by race/enthnicity in 2010 reported by the US BLS. This could be due to some other consideration made by the US BLS but not taken into account in this study.