Data Analysis and Statistical Inference Project

1. Introduction

This reports is the result of the project of the Coursera Curse: “Data Analysis and Statistical Inference”. I have chosen the following question to solve:

Is there a relationship between the gender of a person and the political consideration (liberal/conservative) of oneself?

2. Data:

The data have been extracted from the General Social Survey (GSS): A sociological survey used to collect data on demographic characteristics and attitudes of residents of the United States. There are a total of 57,061 cases and 114 variables in this dataset (one case = one interview to one person). Note that this is a cumulative data file for surveys conducted between 1972 - 2012 and that not all respondents answered all questions in all years. For this study only two fields have been chosen: Respondent’s sex (gss$sex) and political view (gss$polview)

RESPONDENT’S SEX

Code respondent’s sex. sex : is a categorical variable (regular) with only two values:

1 MALE

2 FEMALE

sex <- gss$sex[!is.na(gss$polviews)]
table(sex, useNA = "ifany")

## sex
##   Male Female 
##  21386  26490

We can see that no missing values are provided

POLITICAL VIEWS

THINK OF SELF AS LIBERAL OR CONSERVATIVE. “We hear a lot of talk these days about liberals and conservatives. I’m going to show you a seven-point scale on which the political views that people might hold are arranged from extremely liberal - point 1 - to extremely conservative - point. Where would you place yourself on this scale?”"

Code respondent’s political view of oneself, polview, is a categorical variable (ordinal) with 7+1 values

polviews <- gss$polviews[!is.na(gss$polviews)]
table(polviews)

## polviews
##     Extremely Liberal               Liberal      Slightly Liberal 
##                  1330                  5582                  6181 
##              Moderate Slightly Conservative          Conservative 
##                 18494                  7691                  7092 
##  Extrmly Conservative 
##                  1506

For this study we have eliminated the NA responses.

Type of study: It is and observational study since tha data was provided before the design of the experiment. So we have NOT randomly assign subjects to both groups, so we not establish causal conclusions as in experimental designs one could do. So, at the end of this observational study only associative conclusions could be drawn

Scope of inference - generalizability: The sample is quite large (57061) so, the results could be extended to the population of interest, i.e., the residents in the U.S. between 1972 - 2012

The main source of bias are the people that did NOT answer the political view. There are quiet a lot of non-respondents to this question (around 16%), so maybe the non-respondents could be not equally distributed as the respondents, and the result is somehow biased.

Scope of inference - causality: No causal link could be establish between the variables of interest, since this is and observational study and we can not select, neither maipulate tha values of the variables-

3. Exploratory data analysis

Let’s calculate the frecuency distribution of political views according to sex:

male <- table(polviews[sex=="Male"])/length(polviews[sex=="Male"])
female <- table(polviews[sex=="Female"])/length(polviews[sex=="Female"])
cbind(male,female)

##                          male  female
## Extremely Liberal     0.02918 0.02665
## Liberal               0.11433 0.11842
## Slightly Liberal      0.12859 0.12952
## Moderate              0.35883 0.40846
## Slightly Conservative 0.17605 0.14821
## Conservative          0.15856 0.13971
## Extrmly Conservative  0.03446 0.02903

par(mfrow=c(1,2))
barplot(male, main="Male Political View", ylab="Relative frequency",las=2, ylim = c(0.,0.40))
barplot(female, main="Female Political View", ylab="Relative frequency",las=2, ylim = c(0.,0.40))

plot of chunk unnamed-chunk-4

From the data and plots, it seems that Females have more moderate political views than Males, which have higher values for extreme political viewpoints. In the study I will try to establish if the differences are significatives.

We could examine deeper the differences in frequency with the two following plots, to see visually, the differences un the proportions in political view between males and females.

par(mfrow=c(1,2))
diffs <- male - female
plot(table(polviews, sex), xlab="Political View %", ylab="Sex", main="Political View according to sex", las=2)
barplot(diffs, main="Difference in Relative frequency", ylab="Male - Female",las=2)

plot of chunk unnamed-chunk-5

The exploratory data analysis sughest that there is a difference between political views of males and females, where males have more extreme points of views, and females are more moderate.

4. Inference

We will try to analyze two categorical variabels, gender and Political View. The first one with two levels (male and female); the second one with 5 variables.

Total <- table(polviews)
Real <- cbind(table(polviews, sex), Total)
Total <- c(table(sex, useNA = "no"), "Total" = length(polviews))
Real <- rbind(Real, Total)
Real

##                        Male Female Total
## Extremely Liberal       624    706  1330
## Liberal                2445   3137  5582
## Slightly Liberal       2750   3431  6181
## Moderate               7674  10820 18494
## Slightly Conservative  3765   3926  7691
## Conservative           3391   3701  7092
## Extrmly Conservative    737    769  1506
## Total                 21386  26490 47876

Observing the table above: Does there appear to be a relationship between Gender and Political View?

Hypotheses

H0 (nothing going on): Gender and Political View are independent. Political View rates do not vary by Gender.

HA (something going on): Gender and Political are dependent. Political View rates do vary by gender

As we are trying to evaluate the relationship between two categorical variables (at least one with more than 2 levels), we will perform a chi-square tests of independence. It consists in three steps:

Quantify how different the observed counts are from the expected counts
large deviations from what would be expected based on sampling variation (chance) alone provide strong evidence for the alternative hypothesis
called an independence test since were evaluating the relationship between two categorical variables

Conditions for the chi-square test:

Independence: Sampled observations must be independent.
- random sample/assignment
- if sampling without replacement, n < 10% of population
- each case only contributes to one cell in the table.
Sample size: Each particular scenario (i.e. cell) must have at least 5 expected cases.

We met all the conditions for the chi-sqare test.

Calculus

The observed overall male rate of the sample is:

male_rate = sum(sex == "Male") /length(polviews)
male_rate

## [1] 0.4467

If in fact gender and political views are independent (i.e. if in fact H0 is true) how many of male people would we expect to have liberal political views? How many moderate or conservative? Let’s calculte it according to the rates of males and females:

Expected <- Real
Expected[1:7,1] <- Real[1:7,3]*Real[8,1]/Real[8,3]
Expected[1:7,2] <- Real[1:7,3]*Real[8,2]/Real[8,3]
Expected # Table with expected rates in  H0 were true

##                          Male  Female Total
## Extremely Liberal       594.1   735.9  1330
## Liberal                2493.5  3088.5  5582
## Slightly Liberal       2761.0  3420.0  6181
## Moderate               8261.2 10232.8 18494
## Slightly Conservative  3435.5  4255.5  7691
## Conservative           3168.0  3924.0  7092
## Extrmly Conservative    672.7   833.3  1506
## Total                 21386.0 26490.0 47876

Now, we coan compute the two parameters needed two compute the statistic. (df: degrees of freedom )

CHI2 <- sum((Real[1:7,1] - Expected[1:7,1])**2/Expected[1:7,1])
df <- (length(levels(sex)) -1) * (length(levels(polviews)) -1)
CHI2

## [1] 97.67

df

## [1] 6

The parameter Chi-square is large enough, much greater than 3

pchisq(CHI2, df, lower.tail = FALSE)

## [1] 7.7e-19

The value is practically 0, so we can reject yhe null hipotesys and conclude that, in fact, there is something going on.

5. Conclusion

The study has concluded that there is a statistical relationship between gender and political view of oneself. Females are more moderate that Males.

However, to improve the study, an experimental study must be addressed, to control aditional sources of bias:

The people in this study is biased, because not all the people answered the interview
The answers could be also biased, because there were a significative amount of cases with no available political view