This document is the report from the final course project for the Inferential Statistics course, as part of the Duke University Statistics with R course in partnership with Coursera. The project consisted of exploring a real-world dataset (a subset of the General Social Survey) and create a report that included statistical inference on a question of interest (ended using three quesitons).
Note: all of the research questions are focusing on the survey data in the year 2012 to account for year-on-year variances and display the most recent view that our data could provide.
Since 1972, the General Social Survey (GSS) has been monitoring societal change and studying the growing complexity of American society. The GSS aims to gather data on contemporary American society in order to monitor and explain trends and constants in attitudes, behaviors, and attributes; to examine the structure and functioning of society in general as well as the role played by relevant subgroups; to compare the United States to other societies in order to place American society in comparative perspective and develop cross-national models of human society; and to make high-quality data easily accessible to scholars, students, policy makers, and others, with minimal cost and waiting.
Source: GSS project description https://www.norc.org/Research/Projects/Pages/general-social-survey.aspx
According to the below link, each survey from 1972 to 2004 was an independently drawn sample of English-speaking persons 18 years of age or over, living in non-institutional arrangements within the United States through probability sampling (page 8). http://gss.norc.org/documents/codebook/GSS_Codebook_intro.pdf
However, there is a major difference between the individuals prior to 2006 and after, as explained below
Furthermore, the selection process was ran through systematic selection and controlled selection. Therefore, it can be assumed that the data is representative of the populations mentioned above, based on the year in focus.
Due to the observational nature of the study, no random assignment was used (test/control group), and hence causality cannot be inferred.
It would be helpful to know if this perception is dependent on gender, since sex is an important topic and can lead to serious family and health consequences, amongst others. This is especially important for single parents, or parents of the same sex, where one’s stance and actions cannot overcompensate for the other’s.
This is important as the earned income should be independent of someone’s family background. If there are differences, further measures should be put into place for helping out those in unfavourable groups.
Note: all of the research questions are focusing on the survey data in the year 2012 to account for year-on-year variances and display the most recent view that our data could provide.
We can see that there is a higher proportion of men that oppose sexual education in schools (11%) than women (8%). The absolute numbers can be seen in the table below.
question_1_SexEduc <- gss%>%
filter(year == 2012)%>%
filter(sexeduc != "Depends" & !is.na(sexeduc))%>%
group_by(sex, sexeduc)%>%
summarise(count = n())%>%
mutate(prop = round(count/sum(count),2)*100)
ggplot(question_1_SexEduc, aes(sex, prop, fill = sexeduc))+
geom_col(col = "black")+
theme_few()+
theme(legend.position = "top")+
geom_text(aes(label = paste(prop, "%", sep = "")), nudge_y = -3)+
labs(y="Proportion",
x = NULL,
fill = "Perception on Sexual Education in schools")question_1_SexEduc%>%
kbl(caption = "Summary statistics")%>%
kable_paper("hover", full_width = F)| sex | sexeduc | count | prop |
|---|---|---|---|
| Male | Favor | 519 | 89 |
| Male | Oppose | 64 | 11 |
| Female | Favor | 638 | 92 |
| Female | Oppose | 53 | 8 |
We can notice large differences both in median salary and variances across the different classes.
question_3_detail <- gss%>%
filter(year == "2012")%>%
filter(!is.na(incom16))%>%
filter(!is.na(coninc))%>%
select(incom16, coninc)%>%
droplevels()
ggplot(question_3_detail, aes(incom16, coninc, fill = incom16))+
geom_boxplot(col = "black", varwidth = TRUE)+
labs(x = NULL,
y = "Total family income (yearly)")+
scale_y_continuous(labels=scales::dollar_format())+
coord_flip()+
theme_few()+
theme(legend.position = "none")question_3_year <- gss%>%
filter(year == "2012")%>%
group_by(incom16)%>%
summarise(count = n(),
median_income = round(median(coninc, na.rm = TRUE),1))%>%
mutate(prop = round(count/sum(count)*100,2))
question_3_year%>%
kbl(caption = "Summary statistics")%>%
kable_paper("hover", full_width = F)| incom16 | count | median_income | prop |
|---|---|---|---|
| Far Below Average | 181 | 24895 | 9.17 |
| Below Average | 523 | 34470 | 26.49 |
| Average | 875 | 34470 | 44.33 |
| Above Average | 304 | 51705 | 15.40 |
| Far Above Average | 47 | 34470 | 2.38 |
| NA | 44 | 21065 | 2.23 |
question_1 <- gss%>%
filter(year == "2012")%>%
select(sex,sexeduc)%>%
droplevels()
table(question_1$sex, question_1$sexeduc)%>%
kbl(caption = "Contingency table")%>%
kable_paper("hover", full_width = F)| Favor | Oppose | |
|---|---|---|
| Male | 519 | 64 |
| Female | 638 | 53 |
In this analysis, we’re working with two categorical variables and thus, we will be using an inference based on two proportions. This means that we will be able to report both a p value and a confidence interval. Given that we met all of the conditions, we can use a theoretical based method rather than a simulation based inference.
Interpretation:
We observed a p-value lower than 5%, so for this question, we reject the null hypothesis. Thus, there is sufficient evidence to conclude that a larger proportion of males, compared to females, oppose sex education. In the context of our data, a p value of 0.02 means that given a true null hypothesis, (ie. if there were no differences between the groups), there’s only a chance of 2% for this difference between males and females to have simply happened by chance.
Hypothesis testing is used to assess the plausibility of a hypothesis by using sample data. The test provides evidence concerning the plausibility of the hypothesis, given the data. Statistical analysts test a hypothesis by measuring and examining a random sample of the population being analyzed
sexeduc_2012 <- gss%>%
filter(year == 2012)%>%
filter(sexeduc %in% c("Favor", "Oppose"))%>%
select(sex,sexeduc)%>%
droplevels()
inference(x = sex,
y = sexeduc,
data = sexeduc_2012,
statistic = "proportion",
type = "ht",
method = "theoretical",
alternative = "greater",
success = "Oppose",
null = 0)## Response variable: categorical (2 levels, success: Oppose)
## Explanatory variable: categorical (2 levels)
## n_Male = 583, p_hat_Male = 0.1098
## n_Female = 691, p_hat_Female = 0.0767
## H0: p_Male = p_Female
## HA: p_Male > p_Female
## z = 2.0367
## p_value = 0.0208
Interpretation:
The 95% confidence interval for the difference between proportions of males and females that oppose sex education is between 0.0009 to 0.0653 (it excludes 0). Though the difference is really small, it still exists. Hence the results for both tests agree, and we can say that a greater proportion of men, than females, oppose sex education, and steps can be taken to better educate the men on the issue.
inference(x = sex,
y = sexeduc,
data = sexeduc_2012,
statistic = "proportion",
type = "ci",
method = "theoretical",
#null = 0,
#alternative = "greater",
success = "Oppose")## Response variable: categorical (2 levels, success: Oppose)
## Explanatory variable: categorical (2 levels)
## n_Male = 583, p_hat_Male = 0.1098
## n_Female = 691, p_hat_Female = 0.0767
## 95% CI (Male - Female): (9e-04 , 0.0653)
question_3_year%>%
filter(!is.na(incom16))%>%
kbl(caption = "Summary statistics")%>%
kable_paper("hover", full_width = F)| incom16 | count | median_income | prop |
|---|---|---|---|
| Far Below Average | 181 | 24895 | 9.17 |
| Below Average | 523 | 34470 | 26.49 |
| Average | 875 | 34470 | 44.33 |
| Above Average | 304 | 51705 | 15.40 |
| Far Above Average | 47 | 34470 | 2.38 |
However, the sample size is relatively large for each class thus, unlike the independence of data, this assumption is not crucial.
ggplot(question_3_detail, aes(sample=coninc, col = incom16))+
stat_qq()+
stat_qq_line()+
facet_wrap(.~incom16)+
labs(title = "QQ plot")+
theme_few()+
theme(legend.position = "none")question_3_detail %>%
select(incom16, coninc) %>%
group_by(group = as.character(incom16)) %>%
do(tidy(shapiro.test(.$coninc)))%>%
#put this into a aesthetically pleasing dataframe
kbl(caption = "Shapiro-Wilk normality test")%>%
kable_paper("hover", full_width = F)| group | statistic | p.value | method |
|---|---|---|---|
| Above Average | 0.8291873 | 0.0e+00 | Shapiro-Wilk normality test |
| Average | 0.7930347 | 0.0e+00 | Shapiro-Wilk normality test |
| Below Average | 0.7740452 | 0.0e+00 | Shapiro-Wilk normality test |
| Far Above Average | 0.7785680 | 1.3e-06 | Shapiro-Wilk normality test |
| Far Below Average | 0.6971207 | 0.0e+00 | Shapiro-Wilk normality test |
When the homogeneity assumption is violated, as seen above, we could use the Welch one-way test, which does not require for the assumption of equal variances to be met.
res.aov <- aov(coninc ~ incom16, data = question_3_detail)
tidy(leveneTest(question_3_detail$coninc~question_3_detail$incom16))%>%
kbl(caption = "Levene's test")%>%
kable_paper("hover", full_width = F)| statistic | p.value | df | df.residual |
|---|---|---|---|
| 12.3667 | 0 | 4 | 1723 |
ggplot(res.aov, aes(.fitted, .resid, col = incom16))+
geom_point()+
stat_smooth(method="loess", show.legend = FALSE)+
geom_hline(yintercept=0, col="red", linetype="dashed")+
labs(x = "Fitted values",
y = "Residuals",
title = "Residual vs Fitted Plot",
col = "Class")+
theme_few()+
theme(legend.position = "top")Given that we have one categorical variable with multiple levels (class) and one numerical variable (tv hours per day), the appropriate method for analysis is ANOVA. At the same time, we violated the homogeneity assumption and a one way test is more appropriate in this case. Furthermore, we will account for the increased risk of type I error with the Tukey test(due to multple comparisons), as well as understand which groups are significant (or not). Lastly, this means that we will be using reporting a p value but not a confidence interval, since this type of statistical analysis does not support the latter.
ANOVA is used to compare differences of means among more than 2 groups. It does this by looking at variation in the data and where that variation is found (hence its name). Specifically, ANOVA compares the amount of variation between groups with the amount of variation within groups. It can be used for both observational and experimental studies.
Interpretation:
We can start by conducting a ANOVA test, despite the homogeneity assumption being violated, as a point of comparison with the one way test (below). We can see that the p value is almost 0, which means we can reject the null hypothesis and that there are significant differences between at least two of our groups.
tidy(res.aov)%>%
kbl(caption = c("ANOVA table - without intercept"), )%>%
kable_paper("hover", full_width = F)| term | df | sumsq | meansq | statistic | p.value |
|---|---|---|---|---|---|
| incom16 | 4 | 1.248804e+11 | 31220101666 | 14.61664 | 0 |
| Residuals | 1723 | 3.680206e+12 | 2135929026 | NA | NA |
# tidy(Anova(res.aov, type =3))%>%
# kbl(caption = c("ANOVA table - with intercept"))%>%
# kable_paper("hover", full_width = F)
variance_explained <- round(
summary.lm(res.aov)$r.squared*100, 1
)In the context of the research question, the ANOVA output also shows us that only around 3.3% of the variance in total family income is explained by the respondent’s income when 16 years old. However, this only tells us that some groups are significantly different to others, not which ones. We will need to run a Tukey post hoc analysis to find out more detail.
A similar result is shown by our one way test below. No other methods were applicable and hence there’s nothing to compare other than the two described above.
res.aov2 <- oneway.test(coninc ~ incom16, data = question_3_detail)
tidy(res.aov2)%>%
kbl(caption = c("One way test"), )%>%
kable_paper("hover", full_width = F)| num.df | den.df | statistic | p.value | method |
|---|---|---|---|---|
| 4 | 246.654 | 11.30872 | 0 | One-way analysis of means (not assuming equal variances) |
Once we account for the increased risk of type I error with the Tukey test, we can still see a significant difference between the following groups: - Above Average and Far Below Average - Above Average and Below Average - Above Average and Average
tidy(TukeyHSD(res.aov))%>%
kbl(caption = c("Tukey post hoc analysis"))%>%
kable_paper("hover", full_width = F)| term | contrast | null.value | estimate | conf.low | conf.high | adj.p.value |
|---|---|---|---|---|---|---|
| incom16 | Below Average-Far Below Average | 0 | 4398.047 | -7110.550 | 15906.64 | 0.8350256 |
| incom16 | Average-Far Below Average | 0 | 9457.687 | -1509.506 | 20424.88 | 0.1284567 |
| incom16 | Above Average-Far Below Average | 0 | 28072.921 | 15507.856 | 40637.99 | 0.0000000 |
| incom16 | Far Above Average-Far Below Average | 0 | 20273.529 | -1404.306 | 41951.36 | 0.0796375 |
| incom16 | Average-Below Average | 0 | 5059.640 | -2264.573 | 12383.85 | 0.3250145 |
| incom16 | Above Average-Below Average | 0 | 23674.875 | 14122.616 | 33227.13 | 0.0000000 |
| incom16 | Far Above Average-Below Average | 0 | 15875.482 | -4206.682 | 35957.65 | 0.1961012 |
| incom16 | Above Average-Average | 0 | 18615.235 | 9722.700 | 27507.77 | 0.0000001 |
| incom16 | Far Above Average-Average | 0 | 10815.842 | -8961.034 | 30592.72 | 0.5669211 |
| incom16 | Far Above Average-Above Average | 0 | -7799.392 | -28505.101 | 12906.32 | 0.8421870 |