Data Analysis and Statistical Inference Project Report

Introduction

There is a widely held belief that the Republican party enjoys more support from the traditional religious groups in America. In this project we examine the question whether there is an association between religious preferences and political party affiliations.

As politics in America is getting more and more polarized, with numerous hypotheses on the influences of various groups in driving the polarization, it is important that we verify the hypotheses for which we have data on. The General Social Survey results provide us with data to verify some of these hypothesis using observational studies. Hence we decided to examine the specific question stated above.

Data

The data was obtained from the General Social Sciences Survey Cumulative File, 1972 - 2012 Coursera Extract, provided as part of the course. The description is given here: https://d396qusza40orc.cloudfront.net/statistics%2Fproject%2Fgss1.html The data was downloaded from the link: http://bit.ly/dasi_gss_data

Each case represent the opinions of one respondent from the survey that was randomly conducted on the US population.
In this study we look at 2 categorical variables, relig (https://d396qusza40orc.cloudfront.net/statistics%2Fproject%2Fgss1.html#relig) and partyid (https://d396qusza40orc.cloudfront.net/statistics%2Fproject%2Fgss1.html#partyid), which represent the religious preference and party affiliation respectively. Since religious views are developed much earlier in an individual than political affiliations, we are considering partyid as the response and relig as the explanatory variable.

To reduce the number of levels in the relig variable, the religions with fewer counts were collapsed. ‘Buddhism’, ‘Hinduism’, ‘Other Eastern’, ‘Moslem/Islam’, ‘Native-American’ and ‘Inter-Nondenominational’ were collapsed in to the already existing ‘Other’ group. ‘Orthodox-Christian’ and ‘Christian’ were collapsed into ‘Other Christian’. The missing values were removed and a clean data set with the concerned variables and required levels was created:

load(url("http://bit.ly/dasi_gss_data"))
analData = subset(as.data.frame(gss), !is.na(relig) & !is.na(partyid),select=  c("caseid", "partyid", "relig"))
str(analData)

## 'data.frame':    56561 obs. of  3 variables:
##  $ caseid : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ partyid: Factor w/ 8 levels "Strong Democrat",..: 3 2 4 2 1 3 3 3 1 1 ...
##  $ relig  : Factor w/ 13 levels "Protestant","Catholic",..: 3 2 1 5 1 1 2 3 1 1 ...

analData[analData$relig %in% c("Buddhism", "Hinduism", "Other Eastern", "Moslem/Islam", "Native American", "Inter-Nondenominational"), ]$relig = 'Other' 
analData$relig = droplevels(analData$relig)
levels(analData$relig) = c(levels(analData$relig), "Other Christian")
analData[analData$relig %in% c("Orthodox-Christian", "Christian"), ]$relig = 'Other Christian' 
analData$relig = droplevels(analData$relig)

The study is an observational study. The question being researched is to identify the associations between politics and religion and not involve identifying causality. The data was collected on a representative sample over a period of time, and inferences are made based on the relevant data points. These are steps that conform to an observational study. The sample was not divided into randomized groups with different treatments, an essential step in an experiment

Since the sample is representative of the US population, the findings can be generalized to the entire US population. However, the complexity involved in getting responses could bias the sample. The responses to surveys are typically expected to be lower among 1. Sections of the population with lower education and income levels 2. Non English speaking population. So, while it can be said that the results can largely be generalized to the US population, we have to a admit that it may not truly representative of the above sections of the population, minorities and people of poor income levels

It needs to be cautioned that the results of an observational study like this cannot be used to establish causal links as there are multiple factors that may be inter dependent, influencing the dependencies between the variables being studied. Only if we’re able to control for these confounding variables and conduct an experiment randomizing these effects across the groups, can we make inferences about causality. Since this is not true in our case, we can only make conclusions about dependencies and not causality.

Exploratory Analysis

The tables below show the frequencies:

## 
##      Protestant        Catholic          Jewish            None 
##           33329           13855            1152            6080 
##           Other Other Christian 
##            1467             678

And the one below shows the columns percentages:

##                     
##                      Protestant Catholic Jewish None Other Other Christian
##   Strong Democrat            17       16     25   13    15              13
##   Not Str Democrat           20       24     30   18    19              18
##   Ind,Near Dem               10       12     15   19    19              11
##   Independent                13       16      9   25    21              21
##   Ind,Near Rep                9        9      5    8     9              10
##   Not Str Republican         18       14      9    9     8              16
##   Strong Republican          12        8      4    4     6               8
##   Other Party                 1        1      2    3     4               3

Below is the mosaic plot between the two variables:

Inference

H0 :Political Affiliations and Religious Preferences are independent Halpha :Political Affiliations and Religious Preferences are not independent

Since both the variables are categorical, we use the Chisquared test for independence

Conditions for Chisquared test: 1. The original data is from a random sample of the US population and so is by large representative 2. The sample size is less than 10% of the population and so the individual responses are independent 3. The two variables are categorical with more than 2 levels 4. The minimum number in a cell is > 5

All the above conditions are satisfied.

There are 6 and 8 levels for the 2 variables. Since it is cumbersome to show the calculations on a contingency table, we’re using the chisq.test method available in R.

chisq.test(table(analData$partyid, analData$relig))

## 
##  Pearson's Chi-squared test
## 
## data:  table(analData$partyid, analData$relig)
## X-squared = 2508, df = 35, p-value < 2.2e-16

Conclusion

Since the pvalue is way less than 0.05, we can reject the null hypothesis and accept the alternative hypothesis that the poliltical party affiliations are not indpendent of religious preferences. In other words, there is an association between Religion and politics in the United States.