This Report is the last project for the course in Inferentail Statistics from the online coursera specialization by the Duke University in Statistics with R. The General Social Survey (GSS) data will be analyzed to answer the following question:Is there a difference in the education level across the years for people of different sex and race? During this report convincing statistical evidence is found between the relationship of education level and race. Furthermore it was found that the relationship between education and sex has become non significant. For this report the data and question will be stated, afterwards an exploratory data anlysis will be done to get a better picture of the data. A hypothesis will be defined and tested. This report uses a hypothesis test and ANOVA test to support its findings.
library(knitr)
library(ggplot2)
library(dplyr)
library(scales)
library(gridExtra)
library(kableExtra)
library(statsr)Since 1972, the General Social Survey (GSS) has been monitoring societal change and studying the growing complexity of American society. The GSS aims to gather data on contemporary American society in order to monitor and explain trends and constants in attitudes, behaviors, and attributes; to examine the structure and functioning of society in general as well as the role played by relevant subgroups; to compare the United States to other societies in order to place American society in comparative perspective and develop cross-national models of human society; and to make high-quality data easily accessible to scholars, students, policy makers, and others, with minimal cost and waiting.
GSS questions cover a diverse range of issues including national spending priorities, marijuana use, crime and punishment, race relations, quality of life, confidence in institutions, and sexual behavior.
load("gss.Rdata")The General Social Survey (GSS) data is composed of 57,061 cases (rows) and 114 variables (columns) provide by this course. The GSS data was collected by computer-assisted personal interview (CAPI), face-to-face interview and telephone interview of adults (18+) in randomly selected households. Each row corresponds to a person surveyed. From this data the variables that will be used for this report will be years, educ, race, sex.
The GSS is an observational study - with no explicit random assignments to treatments - so all relationships indicated may indication association, but not causation.
Is there a difference in the education level across the years for people of different sex and race?
More specifically we are going to answer the question if there exist a difference between white people and people with other ethnic background.
To answer the question a hypothesis test between the education level in 1980 for the different factors and compared the results with a the results for the year 2010. In a second step a linear model will be fitted using race and sex as variables to predict education. The significance of each variable will be then analyzed using an ANOVA Test.
In this survey the levels given for the variable race were the following
levels(gss$race)[1] "White" "Black" "Other"
To simplify the analysis we will combine both levels “Black” and “Other” into one level “Not White”
gss <- gss%>%
mutate(race = factor(ifelse(race =="White","White","Not White"),
levels = c("White", "Not White")))We can start by taking a look at the distribution of people surveyed by race and by sex.
p1 <- gss %>%
ggplot(aes(x = year,y = (..count..)/sum(..count..), fill = race)) + geom_bar(position = "fill") +
scale_y_continuous(labels = percent) +
labs(title = "Percentage of People by Race",
y = "Percentage [%]",
x = "Year of Survey")
p2 <- gss %>%
ggplot(aes(x = year, y = (..count..)/sum(..count..), fill = sex)) + geom_bar(position = "fill") +
scale_y_continuous(labels = percent) +
labs(title = "Percentage of People by sex",
y = "Percentage [%]",
x = "Year of Survey")
grid.arrange(p1,p2,nrow = 1)We see that there is a clear difference in the amount of people who were surveyed with the vast majority being white however with a slight decreasing tendency over the years. As for the difference between men and women there is an even distribution between sex across the years being nearly 50/50.
From the variables that we will be using we can see how many of them have missing values
# Race
raceNA <-sum(is.na(gss$race))
# Sex
sexNA <- sum(is.na(gss$sex))
# Education
educNA <- sum(is.na(gss$educ))
dt <- data.frame(raceNA,sexNA,educNA)
kable(dt, "html") %>%
kable_styling(bootstrap_options = "striped", full_width = F, position = "left")| raceNA | sexNA | educNA |
|---|---|---|
| 0 | 0 | 164 |
We see there are no NA for the variables in race and sex and 164 in education. To simplify the following steps we will neglect the 164 entries with NA values for education.
gss <- gss%>%
filter(!is.na(educ))We can take a look at the overall trend in education and ask an elementary question: How has the education level changed over the years?
gss %>%
group_by(year)%>%
summarise(tmpVar = mean(educ))%>%
ggplot(aes(x = year, y = tmpVar)) + geom_line(size = 1) +
geom_smooth(method = "lm",se = F) +
labs(y="Average Years of Education",
title = "Overall Development in Education")It is clear that there is an overall positive trend in education. Now we can take a deeper look at this information and check if this trend also applies for people of different sex and race.
gss %>%
group_by(year, sex, race) %>%
summarise(m = mean(educ)) %>%
ggplot(aes(x = year, y = m, color = sex)) +
geom_line(size = 1) +
facet_grid(.~race)+
labs(y = "Average Years of Education",
title = "Overall Development in Education")In this plot it is clear that the overall trend applies to both ethnic groups however a difference between both groups in their education level becomes apparent. To illustrate this we can use a visualization with a box plot taking the years 1980 and 2010.
gss %>%
filter(year == 1980 | year == 2010) %>%
ggplot(aes(x = race, y = educ , fill = sex)) + geom_boxplot() +
facet_grid(.~year)In a hypothesis test we start with the null hypothesis which says that there is nothing going on i.e there is no difference in the education level between people of different race and sex. The alternative hypothesis is that there is something going on and hence a difference in the education level.
\(H_0: \mu_{1980} - \mu_{2010} = 0\)
\(H_A: \mu_{1980} - \mu_{2010} \neq 0\)
In order to check condition 2 we can draw a histogram of the variable of interest which in this case is education.
gss %>%
ggplot(aes(x = educ, fill = race)) + geom_histogram(binwidth = 1) +
labs(x = "Years of Education", y = "Number of People",
title = "Histogram of Education Years")The distribution for this variable is somewhat left skewed however the number of people surveyed for both ethnic groups is way greater than 30 and so we can asume normality for the further analysis.
dat <- gss %>%
filter(year == 1980 | year == 2010) %>%
group_by(race, year) %>%
summarise(mean = mean(educ), sd = sd(educ), num = n())
dat %>% kable("html") %>%
kable_styling(bootstrap_options = "striped", full_width = F, position = "left")| race | year | mean | sd | num |
|---|---|---|---|---|
| White | 1980 | 12.09977 | 3.146563 | 1313 |
| White | 2010 | 13.68887 | 3.047769 | 1546 |
| Not White | 1980 | 11.23333 | 3.655311 | 150 |
| Not White | 2010 | 12.74645 | 3.352376 | 493 |
We can compute the standard error associated with the data for this we will define the function computeSE which will take a data frame that contains the standard deviations and the number of observations and compute the SE.
computeSE <- function(dat){
sd <- dat$sd
n <- dat$num
tmp <- sqrt(sd[1]/n[1] + sd[2]/n[2])
return(tmp)
}
se_1980 <- dat %>%
filter(year ==1980) %>%
computeSE()
se_2010 <- dat %>%
filter(year ==2010) %>%
computeSE()With the SE we can now compute our t-statistic
computeTStat <- function(dat){
mu <- dat$mean
tmp <- (mu[1]-mu[2])/computeSE(dat)
return(abs(tmp))
}
tstat_1980 <- dat %>%
filter(year == 1980) %>%
computeTStat()
tstat_2010 <- dat %>%
filter(year == 2010) %>%
computeTStat()
dt <- data.frame(tstat_1980,tstat_2010)
kable(dt, "html") %>%
kable_styling(bootstrap_options = "striped", full_width = F, position = "left")| tstat_1980 | tstat_2010 |
|---|---|
| 5.296052 | 10.06267 |
We can compute the p-value i.e. the probability of getting such a t-statistic given that the null hypothesis is true and use an alpha of 95% to evaluate this hypothesis.
computeDOF <- function(dat){
n <- dat$num
return(min(n)-1)
}
dof_1980 <- dat %>%
filter(year == 1980) %>%
computeDOF()
dof_2010 <- dat %>%
filter(year == 2010) %>%
computeDOF()
pval_1980 <- pt(tstat_1980, dof_1980, lower.tail = F)*2
pval_2010 <- pt(tstat_2010, dof_2010, lower.tail = F)*2
dt <- data.frame(pval_1980,pval_2010)
kable(dt, "html") %>%
kable_styling(bootstrap_options = "striped", full_width = F, position = "left")| pval_1980 | pval_2010 |
|---|---|
| 4e-07 | 0 |
We see that both values are well below the 5% margin and we can conclude that this data show convincing evidence that there exist a relationship in the difference in education level between white people and people of a different race.
We can continue the analysis and see the linear effect of each variable by fitting two linear models for the years 1980 and 2010 respectively.
dat_1980 <- gss %>%
filter(year == 1980)
modelFit_1980 <- lm(educ ~ race + sex, data = dat_1980)
dat_2010 <- gss %>%
filter(year == 2010)
modelFit_2010 <- lm(educ ~ race + sex, data = dat_2010)With this models we can do an ANOVA test for the different years and compare their output.
Side Note: This models are not the best possible models to represent the data as seen with their \(R^2\) value of 0.0105738 for the 1980 model and 0.0164921 for the 2010 model.
Taking a look at the output for both models we see that race has a statistical significance seen from the small p-value for both years 1980 and 2010. Sex on the other side is no longer a significant predictor. With a very high p-value in the anova test we see that the variance in education explained by sex is minimal.
summary(modelFit_1980)
Call:
lm(formula = educ ~ race + sex, data = dat_1980)
Residuals:
Min 1Q Median 3Q Max
-12.322 -1.918 0.082 1.678 8.899
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 12.3219 0.1281 96.166 < 2e-16 ***
raceNot White -0.8166 0.2763 -2.955 0.00318 **
sexFemale -0.4039 0.1690 -2.391 0.01695 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.197 on 1460 degrees of freedom
Multiple R-squared: 0.01057, Adjusted R-squared: 0.009218
F-statistic: 7.801 on 2 and 1460 DF, p-value: 0.0004265
summary(modelFit_2010)
Call:
lm(formula = educ ~ race + sex, data = dat_2010)
Residuals:
Min 1Q Median 3Q Max
-13.660 -1.712 -0.660 2.288 7.284
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 13.65999 0.11105 123.004 < 2e-16 ***
raceNot White -0.94422 0.16169 -5.840 6.07e-09 ***
sexFemale 0.05198 0.13961 0.372 0.71
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.125 on 2036 degrees of freedom
Multiple R-squared: 0.01649, Adjusted R-squared: 0.01553
F-statistic: 17.07 on 2 and 2036 DF, p-value: 4.445e-08
anova(modelFit_1980)Analysis of Variance Table
Response: educ
Df Sum Sq Mean Sq F value Pr(>F)
race 1 101.1 101.062 9.8879 0.001697 **
sex 1 58.4 58.411 5.7149 0.016948 *
Residuals 1460 14922.4 10.221
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(modelFit_2010)Analysis of Variance Table
Response: educ
Df Sum Sq Mean Sq F value Pr(>F)
race 1 332.0 332.00 34.0023 6.386e-09 ***
sex 1 1.4 1.35 0.1386 0.7097
Residuals 2036 19879.3 9.76
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We have found strong statistical evidence that there is a difference in education level within people of different race. This difference was present as far as 1980 and was still present in the year 2010. Sex on the other hand lost its significance level over the years meaning that there is no strong statistical evidence to reject the null hypothesis.
Since this was an observational study, we can generalize causality but not causation. We can only concude that there is corelation between the variables.