This reports is the result of the project of the Coursera Curse: “Data Analysis and Statistical Inference”
Is there a relationship between the gender of a person and the political consideration (liberal/conservative) of oneself?
The data have been extracted from the General Social Survey (GSS): A sociological survey used to collect data on demographic characteristics and attitudes of residents of the United States. There are a total of 57,061 cases and 114 variables in this dataset (one case = one interview to one person). Note that this is a cumulative data file for surveys conducted between 1972 - 2012 and that not all respondents answered all questions in all years. For this study only two fields have been chosen: Respondent’s sex (gss$sex) and political view (gss$polview)
RESPONDENTS SEX
Code respondent’s sex. sex : is a categorical variable (regular) with only two values:
1 MALE
2 FEMALE
sex <- gss$sex[!is.na(gss$polviews)]
table(sex, useNA = "ifany")
## sex
## Male Female
## 21386 26490
We can see that no missing values are provided
POLVIEWS
THINK OF SELF AS LIBERAL OR CONSERVATIVE
We hear a lot of talk these days about liberals and conservatives. I’m going to show you a seven-point scale on which the political views that people might hold are arranged from extremely liberal - point 1 - to extremely conservative - point. Where would you place yourself on this scale?
Code respondent’s political view of oneself. polview is a categorical variable (ordinal) with 7+1 values
polviews <- gss$polviews[!is.na(gss$polviews)]
table(polviews, useNA = "ifany")
## polviews
## Extremely Liberal Liberal Slightly Liberal
## 1330 5582 6181
## Moderate Slightly Conservative Conservative
## 18494 7691 7092
## Extrmly Conservative
## 1506
For this study we have eliminated the NA responses.
Type of study: It is and observational study since tha data was provided before the design of the experiment. So we have NOT randomly assign subjects to both groups, so we not establish causal conclussion as in experimental designs one could do. So, at the end of this observational study only associative cocnlussions could be drawn
Scope of inference - generalizability: The sample is quite large (57061) so, the results could be extended to the population of interest, i.e., the residents in the U.S. between 1972 - 2012
The main source of bias are the people that did NOT answer the political view. There are quiet a lot of non-respondents to this question (around 16%), so maybe the non-respondents could be not equally distributed as the respondents, and the result is somehow biased.
** Scope of inference - causality:** No causal link could be establish between the variables of interest, since this is and observational study and we can not select neither maipulate tha values of the variables-
Let’s calculate the frecuency distribution of political views according to sex:
male <- table(polviews[sex=="Male"])/length(polviews[sex=="Male"])
female <- table(polviews[sex=="Female"])/length(polviews[sex=="Female"])
cbind(male,female)
## male female
## Extremely Liberal 0.02918 0.02665
## Liberal 0.11433 0.11842
## Slightly Liberal 0.12859 0.12952
## Moderate 0.35883 0.40846
## Slightly Conservative 0.17605 0.14821
## Conservative 0.15856 0.13971
## Extrmly Conservative 0.03446 0.02903
par(mfrow=c(1,2))
barplot(male, xlab ="Political View", main="Male Political View", ylab="Relative frequency",las=2, ylim = c(0.,0.40))
barplot(female, xlab ="Political View", main="Female Political View", ylab="Relative frequency",las=2, ylim = c(0.,0.40))
par(mfrow=c(1,1))
From the data and plots, it seems that Females have more moderate political views, than Males which have higher values for extreme political viewpoints. In the study I will try to establish if the differences are significatives.
We could examine deeper the differences in frequency with the two n
par(mfrow=c(1,2))
diffs <- male - female
plot(table(polviews, sex), xlab="Political View %", ylab="Sex", main="Political View according to sex", las=2)
barplot(diffs, xlab ="Differences in Political View", main="Male - Female", ylab="Relative frequency",las=2)
par(mfrow=c(1,1))
We will try to analyze two categorical variabels, gender and Political View. The first one with two levels (male and female); the second one with 5 variables.
Total <- table(polviews)
Real <- cbind(table(polviews, sex), Total)
Total <- c(table(sex, useNA = "no"), "Total" = length(polviews))
Real <- rbind(Real, Total)
Real
## Male Female Total
## Extremely Liberal 624 706 1330
## Liberal 2445 3137 5582
## Slightly Liberal 2750 3431 6181
## Moderate 7674 10820 18494
## Slightly Conservative 3765 3926 7691
## Conservative 3391 3701 7092
## Extrmly Conservative 737 769 1506
## Total 21386 26490 47876
Observing the table above: Does there appear to be a relationship between gender and Political View?
Hypotheses
H0 (nothing going on): Gender and Political View are independent. Political View rates do not vary by Gender.
HA (something going on): Gender and Political are dependent. Political View rates do vary by relationship status
As we are trying to evaluate the relationship between two categorical variables (at least one with more than 2 levels), we will perform a chi-square tests of independence. It consists in three steps:
Conditions for the chi-square test:
Independence: Sampled observations must be independent.
Sample size: Each particular scenario (i.e. cell) must have at least 5 expected cases.
We met all the conditions for the chi-sqare test
Calculus
The observed overall male rate of the sample is:
male_rate = sum(sex == "Male") /length(polviews)
male_rate
## [1] 0.4467
If in fact gender and political views are independent (i.e. if in fact H0 is true) how many of male people would we expect to have liberal political views? How many moderate or conservative? Let’s calculte it according to the rates of males and females:
Expected <- Real
Expected[1:7,1] <- Real[1:7,3]*Real[8,1]/Real[8,3]
Expected[1:7,2] <- Real[1:7,3]*Real[8,2]/Real[8,3]
Expected
## Male Female Total
## Extremely Liberal 594.1 735.9 1330
## Liberal 2493.5 3088.5 5582
## Slightly Liberal 2761.0 3420.0 6181
## Moderate 8261.2 10232.8 18494
## Slightly Conservative 3435.5 4255.5 7691
## Conservative 3168.0 3924.0 7092
## Extrmly Conservative 672.7 833.3 1506
## Total 21386.0 26490.0 47876
CHI2 <- sum((Real[1:7,1] - Expected[1:7,1])**2/Expected[1:7,1])
df <- (length(levels(sex)) -1) * (length(levels(polviews)) -1)
CHI2
## [1] 97.67
df
## [1] 6
pchisq(CHI2, df, lower.tail = FALSE)
## [1] 7.7e-19
Insert conclusion here…