Setup

Load packages

library(ggplot2)
library(dplyr)

## Warning: package 'dplyr' was built under R version 3.3.2

library(statsr)
library(rockchalk)

## Warning: package 'rockchalk' was built under R version 3.3.2

Load data

load("gss.Rdata")

Part 1: Data

This is just quick summary of what data set is. It is the General Social Survey (GSS) http://www.gss.norc.org/, one of the most influential studies in the social sciences. The survey is conducted face-to-face with an in-person interview by NORC at the University of Chicago. The survey was conducted every year from 1972 to 1994 (except in 1979, 1981, and 1992). Since 1994, it has been conducted every other year. As of 2014, 30 national samples with 59,599 respondents and 5,900+ variables have been collected. The target population of the GSS is adults (18+) living in households in the United States. From 1972 to 2004 it was further restricted to those able to do the survey in English. From 2006 to present it has included those able to do the survey in English or Spanish. Those unable to do the survey in either English or Spanish are out-of-scope. Residents of institutions and group quarters are out-of-scope. Those with mental and/or physical conditions that prevent them from doing the survey, but who live in households are part of the target population and in-scope. In the reinterviews those who have died, moved out of the United States, or who no longer live in a household have left the target population and are out-of-scope. Since the survey is random and covers a significant number of US states (28+ currently) the survey results can be generalized to the overall US population.

IMPORTANT: GSS is made through random sampling and therefore the result is generalizable to population of 18 years or older living in U.S. Data do not involve random assignment and therefore we cannot infere any causality here as data is obtained in survey.

Part 2: Research question

For sake of peer assigment of Duke University I am going to research if religion affects the education level. Are people of belief more educated (spend more years studying) than atheist/non believers? I do suppose that this is not case as I do suppose religion is not connected, in any way, to years spent on college. My research question is: How believer looks like compared to nonbelievers -do they spend more years studying?

Part 3: Exploratory data analysis

What are variables we have got in religiosity? How they are organized? Are there any NA (if yes - how many)?

#what religions do we have in data?
levels(gss$relig)

##  [1] "Protestant"              "Catholic"               
##  [3] "Jewish"                  "None"                   
##  [5] "Other"                   "Buddhism"               
##  [7] "Hinduism"                "Other Eastern"          
##  [9] "Moslem/Islam"            "Orthodox-Christian"     
## [11] "Christian"               "Native American"        
## [13] "Inter-Nondenominational"

#what is number of NA in data?
sum(is.na(gss$relig))

## [1] 233

#proportions and sums...
gss %>% 
  filter(!is.na(relig)) %>% 
  group_by(relig) %>% 
  dplyr::summarize(counts=n()) %>% 
  mutate(proportion = counts/sum(counts))

## # A tibble: 13 × 3
##                      relig counts   proportion
##                     <fctr>  <int>        <dbl>
## 1               Protestant  33472 0.5890054199
## 2                 Catholic  13926 0.2450552545
## 3                   Jewish   1155 0.0203244879
## 4                     None   6113 0.1075702119
## 5                    Other    998 0.0175617653
## 6                 Buddhism    130 0.0022876047
## 7                 Hinduism     63 0.0011086084
## 8            Other Eastern     31 0.0005455057
## 9             Moslem/Islam    108 0.0019004716
## 10      Orthodox-Christian     96 0.0016893081
## 11               Christian    588 0.0103470120
## 12         Native American     24 0.0004223270
## 13 Inter-Nondenominational    124 0.0021820229

So we can see there are 13 different categories for religion. It vary from different kinds of christianity through judaism, islam, buddhism to other systems of believes and to cathegory of None religion, which I am interested in particularly. In next steps I will have to group all religions to one cathegory as I am going to compare it with atheists in survey.

#how much of Atheists?
gss %>% 
  filter(!is.na(relig)) %>% 
  group_by(relig=="None") %>% 
  dplyr::summarize(counts=n()) %>% 
  mutate(proportion = counts/sum(counts))

## # A tibble: 2 × 3
##   `relig == "None"` counts proportion
##               <lgl>  <int>      <dbl>
## 1             FALSE  50715  0.8924298
## 2              TRUE   6113  0.1075702

That means I have 6113 of answers of None religion, which is around 10% of all answers. Well and what about education?

summary(gss$educ)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00   12.00   12.00   12.75   15.00   20.00     164

Lowest number of years spent in school is … unsurprisingly … ZERO, highest is 20 years in school. Mean is somehow similar to median as mean is 12.75 years and median is 12 years which is the same for first quartille. Third quartille is 15 years of school. But we have got lot of NAs too!

#give me years of college without NAs
a <- gss %>%
  filter(!is.na(educ)) %>%
  group_by(educ)
  
#show me boxplot
boxplot(a$educ, main = "Years on college", ylab="years")

As we know from previous query - median is also first quartil. We can see there is lot of outliers in data, so I will deal with them easily:

boxplot(a$educ, main = "Years on college", ylab="years", outline = FALSE)

Now - just for sake of research question, I am going to compare boxplots of believers and atheist with years on college, then I will move to inference as my EDA will be done.

#will use variable from before and sort it by religion
a <- gss %>%
  filter(!is.na(relig)) %>%
  group_by(relig)

#put it to data variable
data<-a

#change levels of cathegorical variables in data$relig
data$relig <- combineLevels(a$relig, levs = c("Protestant","Catholic","Jewish", "Other", "Buddhism", "Hinduism","Other Eastern", "Moslem/Islam", "Orthodox-Christian", "Christian", "Native American", "Inter-Nondenominational"), newLabel = "Believer")

## The original levels Protestant Catholic Jewish None Other Buddhism Hinduism Other Eastern Moslem/Islam Orthodox-Christian Christian Native American Inter-Nondenominational 
## have been replaced by None Believer

#how do they stand against each other?
boxplot(data$educ~data$relig, main = "Years on college", ylab="years", xlab="religion", outline = FALSE)

This looks promising as atheists looks like having more variability in years spent studiying and have higher median than believers do. But is it any use? Will see in inference part…

Part 4: Inference

hypothesis

H0: My null hypothesis is, that there is no difference between Believers and Atheist. Ha: My alternative hypothesis states, that there is difference between them.

Conditions check

1, As i can see, since random sampling is conducted for GSS survey and sampled people is smaller number than 10% of whole population of the US, I have met first condition for testing. 2, And as I have more than 10 people in data, I can say that second condition for summary statistics is met too.

We can either calculate the standard error and construct the interval by hand, or allow the inference function to do it for us.

inference(y = relig, data = data, type = "ci", statistic = "proportion", null = 0, alternative = "twosided", method = "theoretical", success = "Believer")

## Single categorical variable, success: Believer
## n = 56828, p-hat = 0.8924
## 95% CI: (0.8899 , 0.895)

There is 95% confidence interval between 0.889 to 0.895 (88.9-89.5% chances) that if randomly chosen people will be believer. We also know that p-hat, or proportion mean, is 0.8924, that is sample proportion of believers to atheist. How does it stands to education?

I am going to use two sided test as I want to know both lower or higher education in religious or atheist group.

inference(x = relig, y = educ, data = data, type = "ht", statistic = "mean", null = 0, alternative = "twosided", method = "theoretical", success = "Believer")

## Warning: Ignoring success since y is numerical

## Response variable: numerical
## Explanatory variable: categorical (2 levels) 
## n_None = 6109, y_bar_None = 13.5304, s_None = 3.0969
## n_Believer = 50593, y_bar_Believer = 12.6573, s_Believer = 3.1795
## H0: mu_None =  mu_Believer
## HA: mu_None != mu_Believer
## t = 20.753, df = 6108
## p_value = < 0.0001

As we have got p-value less than 0.05 and even less than 0.01 we can reject null hypothesis that there is no difference between atheist and people of belief and accepted alternative hypothesis there is difference between them when talking about years spent on college.

So what about this? Does this mean that atheist are more educated than believers or vice versa?

One sided - lower

H0: Atheists spent same years in college than believers

Ha: Atheists are less educated than believers (spent less years on college)

inference(x = relig, y = educ, data = data, type = "ht", statistic = "mean", null = 0, alternative = "less", method = "theoretical", success = "Believer")

## Warning: Ignoring success since y is numerical

## Response variable: numerical
## Explanatory variable: categorical (2 levels) 
## n_None = 6109, y_bar_None = 13.5304, s_None = 3.0969
## n_Believer = 50593, y_bar_Believer = 12.6573, s_Believer = 3.1795
## H0: mu_None =  mu_Believer
## HA: mu_None < mu_Believer
## t = 20.753, df = 6108
## p_value = 1

As we have got high p-value, we cannot reject null hypothesis. We know from two sided test, that atheists do not stay the same amount of years in college, therefore I do suppose they are more educated…

one sided - greater