Statistical inference with the GSS data

Setup

Load packages

library(ggplot2)
library(dplyr)
library(tidyr)
library(statsr)
library(lattice)
library(tigerstats)

## Warning: package 'tigerstats' was built under R version 3.3.2

## Warning: package 'abd' was built under R version 3.3.2

## Warning: package 'mosaic' was built under R version 3.3.2

## Warning: package 'mosaicData' was built under R version 3.3.2

library(forcats)

## Warning: package 'forcats' was built under R version 3.3.2

library(rockchalk)

## Warning: package 'rockchalk' was built under R version 3.3.2

library(vcd)

## Warning: package 'vcd' was built under R version 3.3.2

Load data

set.seed(2016)
setwd("C:/Users/karel_chajim/Desktop/coursera/inferential/peer")
load("gss.Rdata")

Situation

General Social Survey (GSS) dataset provides politicians, policymakers, and scholars with a clear and unbiased perspective on what Americans think and feel about such issues as national spending priorities, crime and punishment, intergroup relations, and confidence in institutions. The GSS is one of the most influential studies in the social sciences, and is frequently referenced in leading publications, including the New York Times, the Wall Street Journal, and the Associated Press. The vast majority of GSS data is obtained in face-to-face interviews. Computer-assisted personal interviewing (CAPI) began in the 2002 GSS. Under some conditions when it has proved difficult to arrange an in-person interview with a sampled respondent, GSS interviews may be conducted by telephone. The target population of the GSS is adults (18+) living in households in the United States. The GSS sample is drawn using an area probability design that randomly selects respondents in households across the United States to take part in the survey. Respondents that become part of the GSS sample are from a mix of urban, suburban, and rural geographic areas. Participation in the study is strictly voluntary. However, because only about a few thousand respondents are interviewed in the main study, every respondent selected is very important to the results. The survey is conducted face-to-face with an in-person interview by NORC at the University of Chicago. The survey was conducted every year from 1972 to 1994 (except in 1979, 1981, and 1992). Since 1994, it has been conducted every other year. The survey takes about 90 minutes to administer. As of 2014, 30 national samples with 59,599 respondents and 5,900+ variables have been collected. The target population of the GSS is adults (18+) living in households in the United States. From 1972 to 2004 it was further restricted to those able to do the survey in English. From 2006 to present it has included those able to do the survey in English or Spanish. Those unable to do the survey in either English or Spanish are out-of-scope. Residents of institutions and group quarters are out-of-scope. Those with mental and/or physical conditions that prevent them from doing the survey, but who live in households are part of the target population and in-scope. In the reinterviews those who have died, moved out of the United States, or who no longer live in a household have left the target population and are out-of-scope.

Part 2: Research question

For sake of peer assigment of Duke University I am going to research if religion affects the life satisfaction and education level.

Part 3: Exploratory data analysis

workset<-subset(gss,year>=2006)
workset<-workset%>%drop_na(relig)
with(workset, 
     barplot(
       rev(
         sort(
           table(workset$relig)
           )
         )
       )
     )

Well, that barplot shows some interesting informations but I would like to know how many members of religion groups there are. For informational purposes I am interested in just 5 most common religious group.

freq<-table(workset$relig)

barplot(
  rev(
    sort(
      freq/length(workset$relig)
      )
    )[1:5], main = "relative frequency of religion", xlab = "religion", ylab = "Percentage"
  )

We can see, that most common is protestantism, second is catholic and third most common group of religion is group of people without religion. Fourth is christianity, which is not clear what part of christianity it is. It leads me to thinking about joining all christian group, but it would lead me to christians agains other which could infere my question. Because I cannot be sure what this group means I will ignore it.

nab <- workset
nab <- nab %>% 
  filter(relig!="Christian") %>% 
  droplevels()

freq_nab<-table(nab$relig)

barplot(
    rev(
        sort(
            freq_nab/length(nab$relig)
        )
    )[1:5], main = "relative frequency of religion", xlab = "religion", ylab = "Percentage"
)

And now we can see 5 most common religious groups in the U.S without those who replied “Christianity”. I am leaving it out of my computations, because it is not clear, what they do mean by “Christianity” and therefore I am not able to infere it for any conclusion.

These information doesn´t tell me anything about their religiosity yet. It doesn´t tell much when you say you are Protestant because it doesn´t mean you are realy religious. What does say that is how often you attend in temple. What are 5 most important frequencies?

freq_nab_att<-table(nab$attend)

barplot(
    rev(
        sort(
            freq_nab_att/length(nab$attend)
        )
    )[1:5], main = "relative frequency of religion", xlab = "religion", ylab = "Percentage"
)

So, there are obviously two large groups. Everyweek attendands and onceyearers. Thir cluster are people who attend randomly. How they stand against religious groups? Who are most active, most piet one?

plot(table(nab$relig,nab$attend))

What a mess! And look - there are people without religion who attend their temple regularly. How can be anybody without religion religious? Well, they probably attend temple because of family. But it do not answer frequent visits. And so - I am going to exclude them from graph of most piet religious groups.

no_none <- nab %>% 
  filter(relig!="None") %>% 
  droplevels()

#clear data to just levels I am going to use - most frequent attendees
no_none$attend <- combineLevels(no_none$attend, levs = c("Sevrl Times A Yr","2-3X A Month","More Thn Once Wk", "Once A Month", "Nrly Every Week", "Lt Once A Year"), newLabel = "random")

## The original levels Lt Once A Year Once A Year Sevrl Times A Yr Once A Month 2-3X A Month Nrly Every Week Every Week More Thn Once Wk 
## have been replaced by Once A Year Every Week random

#clear data to just levels I am going to use - most frequent religious groups
no_none$relig <- combineLevels(no_none$relig, levs = c("Other","Buddhism","Hinduism", "Other Eastern", "Orthodox-Christian", "Native American","Inter-Nondenominational"), newLabel = "other")

## The original levels Protestant Catholic Jewish Other Buddhism Hinduism Other Eastern Moslem/Islam Orthodox-Christian Native American Inter-Nondenominational 
## have been replaced by Protestant Catholic Jewish Moslem/Islam other

freq_nab2<-table(no_none$relig)
freq_nab_att2<-table(no_none$attend)

mosaicplot(table(no_none$attend,no_none$relig), main = "Mosaic table of most pious religious groups", xlab="how often", ylab="what group", las = 2)

So - now I can see that from this perspective protestants are most pious and random religious group.

Well with this - how are these groups succesfull with education?

no_none$degree <- combineLevels(no_none$degree, levs = c("Lt High School","High School"), newLabel = "low")

## The original levels Lt High School High School Junior College Bachelor Graduate 
## have been replaced by Junior College Bachelor Graduate low

no_none$degree <- combineLevels(no_none$degree, levs = c("Junior College","Bachelor"), newLabel = "college")

## The original levels Junior College Bachelor Graduate low 
## have been replaced by Graduate low college

mosaic(data=no_none,~relig+attend+degree, shade=TRUE, cex=2)

Part 4: Inference

Well it looks like if you are part of religious group and is active randomly or you are onceayearer, you probably will get better degree, because low P-value. Is that true? I will use Chi-square test of independence to see if religious group, attendance in temple and degress are independent. My null hypothesis is that they are dependent.

chisq.test(table(no_none$attend,no_none$degree), p=c(.5, .25, .15, .1))

## 
##  Pearson's Chi-squared test
## 
## data:  table(no_none$attend, no_none$degree)
## X-squared = 47.593, df = 4, p-value = 1.147e-09

chisq.test(table(no_none$relig,no_none$degree), p=c(.5, .25, .15, .1))

## 
##  Pearson's Chi-squared test
## 
## data:  table(no_none$relig, no_none$degree)
## X-squared = 173.99, df = 8, p-value < 2.2e-16

chisq.test(table(no_none$relig,no_none$attend), p=c(.5, .25, .15, .1))

## 
##  Pearson's Chi-squared test
## 
## data:  table(no_none$relig, no_none$attend)
## X-squared = 83.247, df = 8, p-value = 1.083e-14

We have got P-values near zeros for both attendance in temple and degree and religiosity and degree. To just be sure I checked if religious group and attendance in temple are dependent. For all test I have got P-values near zero. So I can say that I have to refuse null hypothesis they are dependent and must say they are independ on each another. But! Wait! I have deleted people who do not have any religion. So what about them?

relig <- nab
relig$relig <- combineLevels(relig$relig, levs = c("Other","Buddhism","Hinduism", "Other Eastern", "Orthodox-Christian", "Native American","Inter-Nondenominational"), newLabel = "other")

## The original levels Protestant Catholic Jewish None Other Buddhism Hinduism Other Eastern Moslem/Islam Orthodox-Christian Native American Inter-Nondenominational 
## have been replaced by Protestant Catholic Jewish None Moslem/Islam other

relig$degree <- combineLevels(relig$degree, levs = c("Lt High School","High School"), newLabel = "low")

## The original levels Lt High School High School Junior College Bachelor Graduate 
## have been replaced by Junior College Bachelor Graduate low

relig$degree <- combineLevels(relig$degree, levs = c("Junior College","Bachelor"), newLabel = "college")

## The original levels Junior College Bachelor Graduate low 
## have been replaced by Graduate low college

relig$attend <- combineLevels(relig$attend, levs = c("Sevrl Times A Yr","2-3X A Month","More Thn Once Wk", "Once A Month", "Nrly Every Week", "Lt Once A Year"), newLabel = "random")

## The original levels Lt Once A Year Once A Year Sevrl Times A Yr Once A Month 2-3X A Month Nrly Every Week Every Week More Thn Once Wk 
## have been replaced by Once A Year Every Week random

#I am interested only in those who says they are not part of any religious group
relig2 <- relig %>% 
  filter(relig=="None") %>% 
  droplevels()

chisq.test(table(relig2$degree,relig2$relig))

## 
##  Chi-squared test for given probabilities
## 
## data:  table(relig2$degree, relig2$relig)
## X-squared = 717.41, df = 2, p-value < 2.2e-16

Well! It looks like I have hot another near zero p-value. Therefore I am refusing null hypothesis. Those values are definitely independent!

What about other question - is job satisfaction dependant on degree? Null hypothesis says it is independent, like religiosity and degree.

chisq.test(table(relig$degree,relig$satjob))

## 
##  Pearson's Chi-squared test
## 
## data:  table(relig$degree, relig$satjob)
## X-squared = 61.425, df = 6, p-value = 2.31e-11

Null hypothesis rejected. They do depend!

chisq.test(table(relig$relig,relig$satjob))

## Warning in chisq.test(table(relig$relig, relig$satjob)): Chi-squared
## approximation may be incorrect

## 
##  Pearson's Chi-squared test
## 
## data:  table(relig$relig, relig$satjob)
## X-squared = 34.636, df = 15, p-value = 0.00277

Religiosity and job satisfaction are independent!

chisq.test(table(relig$attend,relig$satjob))

## 
##  Pearson's Chi-squared test
## 
## data:  table(relig$attend, relig$satjob)
## X-squared = 22.578, df = 6, p-value = 0.0009508

Attendance in temple and job satisfaction do depend!

Putting it together it means it is not important if you do believe or not, if you are part of any religious group, because it will not affect your job satisfaction. What in fact affects your job satisfaction is if you go to temple and on level of degree.

plot((table(relig$attend,relig$satjob)), las=2)

If you go to temple once a week you will probably be more satisfied with your job.