library(ggplot2)
library(dplyr)
library(reshape2)
library(tidyr)
library(statsr)
Make sure your data and R Markdown files are in the same directory. When loaded your data file will be called gss
. Delete this note when before you submit your work.
load("gss.Rdata")
dim(gss)
## [1] 57061 114
The GSS replicated questionnaire items and wording in order to facilitate time-trend studies. Generalization is provided since the GSS data were sampling from all noninstitutionalized, English and Spanish speaking persons 18 years of age or older, living in the United States. However, since the study does not make use of random assignment, we can conclude that this an observational study, meaning that we can only stablish associations as opposed to causation, in other words, causality cannot be derived from this study as control experiment was not part of the study design. The sample can be biased as the respondents are self-selected, even though there is some randomness in the selection of the sample. It will miss people who do not wish to be surveyed.
The data set includes several parts of imformation:
We only study some of those imformation
dat = select(gss,year, age, sex, race, educ, degree, marital, partyid, polviews)
summary(dat)
dat = na.exclude(dat)
Trump VS Hillary, presidential election 2016, is one of the hottest show in this year! So I want to investigate the political view of Americans to get some insight of this election. My research question is that What is the relationship between the political view and some other variables. Several sub questions to this research question listed below:
table(gss$polviews)
##
## Extremely Liberal Liberal Slightly Liberal
## 1330 5582 6181
## Moderate Slightly Conservative Conservative
## 18494 7691 7092
## Extrmly Conservative
## 1506
ggplot(filter(gss, !is.na(polviews)), aes(x = polviews)) + geom_bar(position = "stack", fill = "green") + ylab("number of person") + xlab("political views")
So the sample distribution of the political views of the polulation is symmetrical and unimodal.
ggplot(dat, aes(x = polviews, y = age, na.exclude = T)) + geom_boxplot() + ylab("age of respondent") + xlab("political views")
So we can see that conservatives are more likely to be elder.
dat1 = select(gss, year, polviews) %>%
na.exclude() %>%
group_by(year) %>%
summarise(extremely_liberal = table(polviews)[1]/length(polviews),liberal = table(polviews)[2]/length(polviews), slightly_liberal = table(polviews)[3]/length(polviews), moderate = table(polviews)[4]/length(polviews), slightly_conser = table(polviews)[5]/length(polviews), conser = table(polviews)[6]/length(polviews), extremely_conser = table(polviews)[7]/length(polviews))
dat1 = melt(dat1, id = "year", measure.vars = names(dat1)[2:8])
ggplot(dat1, aes(x = year, y = value, group = variable, colour = variable)) + geom_line() + ylab("proportions holding different political views")
So the proportion of people hold different political opinions keeps stable over these yeas.
From the picture above, we can see that there is significant difference in age between the groups of people holding different political views. We are going to the hypothesis test:
\[H_0: \mbox{There are differences in age between different groups people holding different political views}\]
\[H_1: \mbox{There is no difference in age between different groups people holding different political views}\]
We assume that the mean age of different groups are \(\mu_1,..., \mu_7\), and our hypothesis may turn out to be: \[H_0: \mu_1 = \cdots = \mu_7 \leftrightarrow H_1: \mbox{there are not all equal to each other}\]
It is a test whether there are difference of age between different groups of people holding political views, so we can use analysis of variance. With some proper assumptions, we can assume that it is a one-factor hypothesis test.
ml = aov(age~polviews,data=gss)
summary(ml)
## Df Sum Sq Mean Sq F value Pr(>F)
## polviews 6 234618 39103 131.4 <2e-16 ***
## Residuals 47722 14204374 298
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 9332 observations deleted due to missingness
Since it shows a very small p-value, the hypothesis is significant! So the data is evidence of certain associations between age and political views when assumptions are met.
One of the key assumptions of our hypothesis test is about the normality of the data. We can test this by the QQ plot age grouped by different political views holding people.
dat = select(gss,age, polviews) %>%
filter(polviews =="Liberal") %>%
select(age)
qqnorm(dat$age)
But it seems that the normality is not satisfied. Along with other doubts about the assumptions of using anova, we need more imformation and analysis to conclude if the data are evidence of assotiation between age and politcal views.