Research question: Is there a relationship between family income and frequency of attends religious services by respondent?
Actually, this question is very interesting not only for me, but for a lot of people. There are a lot of debates about religious theme in the world. Poorer people cares much more about religious. It helps them to forget about everyday problems and feel more happier about life. So, lets find out is it true.
Data collection: The data were collected in such ways:
1. Computer-assisted personal interview (CAPI);
2. Face-to-face interview;
3. Telephone interview.
Interviews were provided in American society.
Cases:
Cases in GSS data are represented by responses to the survey from random Americans in 1972-2012 time range.
There are 57061 observations (cases) in GSS data.
Variables:
Income variable - coninc - total family income in constant dollars: numerical (continuous).
Outcome variable - attend - how often respondent attends religious services: categorical (ordinal).
Study:
It is an observational study because:
1. data is downloaded from the NET, so it is collected in a way that does not directly interfere with how the data arise;
2. in this studying only association will be established;
3. studying is retrospective because it uses past data.
Scope of inference - generalizability:
Population of interest - Americans aged 18 and above.
This study could be generalized to all population because random sampling is provided.
Potential source of bias is convenience sample.
Scope of inference - causality:
Not, this data couldn’t be used to establish causal link between family income and frequency of attends religious services by respondent. It is observational study and, because of this only correlational relation could be established between the variables of interest. To establish casual link experiment should be provided.
coninc
summary(gss$coninc)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 383 18440 35600 44500 59540 180400 5829
The mean of the sample equals 44500 dollars family income.
While the median is 35600 dollars. As median is less than mean, we can conclude that coninc data is right skewed.
Half of all respondents have income in range [18440, 59540] dollars.
The minimum income is 383 dollars. The maximum income is 180400 dollars.
Data also contains 5829 NA’s, which have to be removed while performing statistic.
hist(gss$coninc, breaks=25)
As displayed on plot, coninc data is right skewed, so it has to be normalized.
attend
summary(gss$attend)
## Never Lt Once A Year Once A Year Sevrl Times A Yr
## 0 4351 7476 7202
## Once A Month 2-3X A Month Nrly Every Week Every Week
## 4067 5060 3183 11383
## More Thn Once Wk NA's
## 4370 9969
attend data also contains 9969 NA’s, which have to be removed while performing statistic.
prop.table(table(gss$attend))
##
## Never Lt Once A Year Once A Year Sevrl Times A Yr
## 0.00000000 0.09239361 0.15875308 0.15293468
## Once A Month 2-3X A Month Nrly Every Week Every Week
## 0.08636286 0.10744925 0.06759110 0.24171834
## More Thn Once Wk
## 0.09279708
As we could see, most of the respondents attend religious services every week (24.17%).
barplot(sort(table(gss$attend), decreasing = TRUE), las=2)
cononc & attend
boxplot(gss$coninc ~ gss$attend, las=2)
From this boxplot we could conclude, that data in every group is right skewed. And all groups has nearly the same median (it looks like there is no relation between variables. But, to be shure in it, we need to perform hypothesis testing).
In this study we need to compare means of 8 groups of people. For this lets use F statistic and perform analysis of variance (ANOVA) test.
H0: The mean family income is the same across groups of people based on how often they attend religious services.
H1: Al least one pair of means are different from each other -> How often people attend religious services depends on their income.
Conditions for ANOVA:
Independence
Approximately normal
Distribution of response variable within each group should be approximately normal - Distribution is right skewed within each group, so data has to be normalized. Lets use log transformation.
coninc <- log(x = gss$coninc, base = 30)
Comparing two datasets, we could conclude that normalized data is more suitable for ANOVA (It is still a little bit left skewed, but much less skewed than it was before):
old.par <- par(mfrow=c(2, 1))
hist(gss$coninc, breaks=25, main = "Non-normalized data", xlab = "")
hist(coninc, breaks=25, main = "Normalized data")
Constant variance
Variability should be consistent across groups:
Firstly, lets clean data from NA values and verify how number of observations have been decrased:
data <- data.frame(coninc, gss$attend)
names(data) <- c("coninc", "attend")
data <- data[complete.cases(data), ]
Percentage of deleted observations:
(1 - dim(data)[1] / length(gss$coninc)) * 100
## [1] 25.49727
Calculating standard deviations of all of the group could be performed in next way:
by(data$coninc, data$attend, sd)
## data$attend: Never
## [1] NA
## --------------------------------------------------------
## data$attend: Lt Once A Year
## [1] 0.2903833
## --------------------------------------------------------
## data$attend: Once A Year
## [1] 0.2814978
## --------------------------------------------------------
## data$attend: Sevrl Times A Yr
## [1] 0.2846086
## --------------------------------------------------------
## data$attend: Once A Month
## [1] 0.2897008
## --------------------------------------------------------
## data$attend: 2-3X A Month
## [1] 0.3033349
## --------------------------------------------------------
## data$attend: Nrly Every Week
## [1] 0.2854247
## --------------------------------------------------------
## data$attend: Every Week
## [1] 0.2726725
## --------------------------------------------------------
## data$attend: More Thn Once Wk
## [1] 0.2711406
So, as we could see, standard deviations of all groups are near the same.
Method for inference
As all conditions are met, we could use ANOVA to perform inference. This method is suitable for this task, because we need to compare more than two means.
Inference
load(url("http://assets.datacamp.com/course/dasi/inference.Rdata"))
inference(y = data$coninc, x = data$attend, est = "mean", type = "ht", null = 0, alternative = "greater", method = "theoretical")
## Response variable: numerical, Explanatory variable: categorical
## ANOVA
## Summary statistics:
## n_Never = NA, mean_Never = NA, sd_Never = NA
## n_Lt Once A Year = 3957, mean_Lt Once A Year = 3.0299, sd_Lt Once A Year = 0.2904
## n_Once A Year = 6915, mean_Once A Year = 3.0567, sd_Once A Year = 0.2815
## n_Sevrl Times A Yr = 6545, mean_Sevrl Times A Yr = 3.056, sd_Sevrl Times A Yr = 0.2846
## n_Once A Month = 3721, mean_Once A Month = 3.0437, sd_Once A Month = 0.2897
## n_2-3X A Month = 4558, mean_2-3X A Month = 3.0258, sd_2-3X A Month = 0.3033
## n_Nrly Every Week = 2835, mean_Nrly Every Week = 3.0387, sd_Nrly Every Week = 0.2854
## n_Every Week = 10088, mean_Every Week = 3.0611, sd_Every Week = 0.2727
## n_More Thn Once Wk = 3893, mean_More Thn Once Wk = 3.0123, sd_More Thn Once Wk = 0.2711
## H_0: All means are equal.
## H_A: At least one mean is different.
## Analysis of Variance Table
##
## Response: y
## Df Sum Sq Mean Sq F value Pr(>F)
## x 7 11.2 1.60090 19.954 < 2.2e-16
## Residuals 42504 3410.0 0.08023
##
## Pairwise tests: t tests with pooled SD
## Lt Once A Year Once A Year Sevrl Times A Yr Once A Month
## Once A Year 0.0000 NA NA NA
## Sevrl Times A Yr 0.0000 0.8751 NA NA
## Once A Month 0.0333 0.0236 0.0349 NA
## 2-3X A Month 0.5060 0.0000 0.0000 0.0043
## Nrly Every Week 0.2077 0.0043 0.0067 0.4799
## Every Week 0.0000 0.3250 0.2547 0.0014
## More Thn Once Wk 0.0057 0.0000 0.0000 0.0000
## 2-3X A Month Nrly Every Week Every Week
## Once A Year NA NA NA
## Sevrl Times A Yr NA NA NA
## Once A Month NA NA NA
## 2-3X A Month NA NA NA
## Nrly Every Week 0.0574 NA NA
## Every Week 0.0000 2e-04 NA
## More Thn Once Wk 0.0280 2e-04 0
Interpretation of results:
p values is less than 0.05 (2.2e-16) so with 95% confident we could conclude that The data provide convincing evidence that at least one pair of population means are different from each other (there is relation betweeen family income and frequency of attends religious services).
Two find out which means are different we could use table above with pairwise tests and compare values from it to modified significance level:
k <- 7 * (7 - 1) / 2
a_modified <- 0.05/k
print(a_modified)
## [1] 0.002380952
For these p values, which are less than amodified, we could conclude that their means are different with 95% confidence level (like Every week and One a month groups).
So, based on provided investigation, we could conclude that there is a relationship between family income and frequency of attends religious services. However, I wasn’t correct at the beggining of the investigation, when assumed that there is no relation between family income and frequency of attends religious services based on box plots. I have learned a lot playing with data, like using ANOVA to make decisions. In future investigation could be extended by finding out explicit relation and also investigate how frequency of attends religious services depends on living type (like city/town).
GSS project description: http://www.norc.org/Research/Projects/Pages/general-social-survey.aspx
GSS data could be downloaded from: http://www.icpsr.umich.edu/icpsrweb/ICPSR/studies/34802/version/1
Codebook: https://d396qusza40orc.cloudfront.net/statistics%2Fproject%2Fgss1.html
data[1:20, ]
## coninc attend
## 1 2.988066 Once A Year
## 2 3.061952 Every Week
## 3 3.061952 Once A Month
## 6 3.235678 Once A Year
## 7 3.186563 Every Week
## 9 2.415963 Sevrl Times A Yr
## 10 2.988066 More Thn Once Wk
## 11 2.889145 Every Week
## 12 2.889145 Every Week
## 13 2.889145 Every Week
## 14 2.889145 Every Week
## 15 2.988066 More Thn Once Wk
## 16 2.889145 2-3X A Month
## 17 3.061952 Every Week
## 18 2.988066 Every Week
## 19 3.235678 Every Week
## 20 3.277750 2-3X A Month
## 21 3.186563 2-3X A Month
## 22 3.331356 Once A Year
## 23 2.889145 Lt Once A Year