DATA 606 Data Project Proposal

Data Preparation

First, I’ll load in the data and rename the columns.

library(dplyr)
library(tidyverse)

religionData <- read.csv('religionData.csv', header= TRUE, sep = ',')
religionData %>% as_tibble(religionData)

# rename columns
colNames <- c('RELIGION','RELIGION2', 'EVANGELICAL', 'RELIGIOUS_SERVICES', 'FREQ_PRAY_WITH_MOTIONS', 'FREQ_PRAY_WITH_OBJECTS', 'FREQ_PRAY_BEFORE_MEALS',
'FREQ_PRAY_FOR_OTHERS', 'FREQ_ASK_TO_PRAY_WITH_SOMEONE', 'FREQ_BRING_UP_RELIGION',
'FREQ_ASK_ABOUT_RELIGION', 'FREQ_DECLINE_FOOD_FOR_RELIGION', 'FREQ_WEAR_RELIGIOUS_CLOTHING', 'FREQ_PARTICIPATE_IN_PUBLIC_RELIGIOUS_EVENT',

'COMFORT_OWN_PRAY_WITH_MOTIONS',
'COMFORT_OWN_PRAY_WITH_OBJECTS',
'COMFORT_OWN_PRAY_BEFORE_MEALS',
'COMFORT_OWN_PRAY_FOR_OTHERS',
'COMFORT_OWN_ASK_TO_PRAY_WITH_SOMEONE',
'COMFORT_OWN_BRING_UP_RELIGION',
'COMFORT_OWN_ASK_ABOUT_RELIGION',
'COMFORT_OWN_DECLINE_FOOD_FOR_RELIGION',
'COMFORT_OWN_WEAR_RELIGIOUS_CLOTHING',
'COMFORT_OWN_PARTICIPATE_IN_PUBLIC_RELIGIOUS_EVENT',

'COMFORT_OTHER_PRAY_WITH_MOTIONS', 'COMFORT_OTHER_PRAY_WITH_OBJECTS', 'COMFORT_OTHER_PRAY_BEFORE_MEALS', 'COMFORT_OTHER_PRAY_FOR_OTHERS', 'COMFORT_OTHER_ASK_TO_PRAY_WITH_SOMEONE', 'COMFORT_OTHER_BRING_UP_RELIGION', 'COMFORT_OTHER_ASK_ABOUT_RELIGION', 'COMFORT_OTHER_DECLINE_FOOD_FOR_RELIGION',
'COMFORT_OTHER_WEAR_RELIGIOUS_CLOTHING', 'COMFORT_OTHER_PARTICIPATE_IN_PUBLIC_RELIGIOUS_EVENT', 'COMFORT_SEE_OTHER_PRAY_WITH_MOTIONS', 'COMFORT_SEE_OTHER_PRAY_WITH_OBJECTS', 'COMFORT_SEE_OTHER_PRAY_BEFORE_MEALS', 'COMFORT_SEE_OTHER_PRAY_FOR_OTHERS', 'COMFORT_SEE_OTHER_ASK_TO_PRAY_WITH_SOMEONE', 'COMFORT_SEE_OTHER_BRING_UP_RELIGION',
'COMFORT_SEE_OTHER_ASK_ABOUT_RELIGION', 'COMFORT_SEE_OTHER_DECLINE_FOOD_FOR_RELIGION', 'COMFORT_SEE_OTHER_WEAR_RELIGIOUS_CLOTHING', 'COMFORT_SEE_OTHER_PARTICIPATE_IN_PUBLIC_RELIGIOUS_EVENT', 'AGE', 'GENDER', 'HOUSEHOLD_SALARY', 'US_REGION')

names(religionData) <- colNames

There are a number of fields within this dataset, so I am going to subset it only to general demographics and survey responses related to the comfort of seeing religious actions outside of the respondent’s religion.

colsToKeep <- c('COMFORT_SEE_OTHER_PRAY_WITH_MOTIONS', 'COMFORT_SEE_OTHER_PRAY_WITH_OBJECTS', 'COMFORT_SEE_OTHER_PRAY_BEFORE_MEALS', 'COMFORT_SEE_OTHER_PRAY_FOR_OTHERS', 'COMFORT_SEE_OTHER_ASK_TO_PRAY_WITH_SOMEONE', 'COMFORT_SEE_OTHER_BRING_UP_RELIGION',
'COMFORT_SEE_OTHER_ASK_ABOUT_RELIGION', 'COMFORT_SEE_OTHER_DECLINE_FOOD_FOR_RELIGION', 'COMFORT_SEE_OTHER_WEAR_RELIGIOUS_CLOTHING', 'COMFORT_SEE_OTHER_PARTICIPATE_IN_PUBLIC_RELIGIOUS_EVENT', 'AGE', 'GENDER', 'HOUSEHOLD_SALARY', 'US_REGION')

religionData <- religionData %>%
  filter(RELIGION != 'Response') %>%
  select(colsToKeep)

In order to quantify each individual’s comfort with public religious displays, I need to convert my categorical features to numeric. First, I will rank each survey response in order of comfort (with Not at all comfortable having the lowest ranking and Extremely comfortable having the highest ranking)

comfortColumns <- seq(1:10)

for (i in comfortColumns) {
  religionData[[i]] <- factor(religionData[[i]], levels = c("", "Response", "Not at all comfortable", "Not so comfortable", "Somewhat comfortably", "Very comfortable", "Extremely comfortable"), ordered = TRUE)
  
}

# eliminate records where any of the survey responses are blank
toBeRemoved<-which(religionData$COMFORT_SEE_OTHER_PRAY_WITH_MOTIONS==""|religionData$COMFORT_SEE_OTHER_PRAY_WITH_OBJECTS==""|religionData$COMFORT_SEE_OTHER_PRAY_BEFORE_MEALS==""|religionData$COMFORT_SEE_OTHER_PRAY_FOR_OTHERS==""|religionData$COMFORT_SEE_OTHER_ASK_TO_PRAY_WITH_SOMEONE==""|religionData$COMFORT_SEE_OTHER_BRING_UP_RELIGION==""|religionData$COMFORT_SEE_OTHER_ASK_ABOUT_RELIGION==""|religionData$COMFORT_SEE_OTHER_DECLINE_FOOD_FOR_RELIGION==""|religionData$COMFORT_SEE_OTHER_WEAR_RELIGIOUS_CLOTHING==""|religionData$COMFORT_SEE_OTHER_PARTICIPATE_IN_PUBLIC_RELIGIOUS_EVENT=="")

religionData<-religionData[-toBeRemoved,]

The survey responses are still categorical, so I need to convert them to numeric values. Since the columns are already ordered, the numerical values should maintain the hierarchy. I’ll also need to compute an average ranking across survey responses. I’ll create a new variable called AVERAGE_RANKING with this information.

religionData$AVERAGE_RATING <- 0


# loop through all survey questions and convert each response to a number 
# add number to the AVERAGE_RATING column
for (i in comfortColumns) {
  religionData[[i]]<-as.numeric(religionData[[i]])
  religionData$AVERAGE_RATING <- religionData$AVERAGE_RATING + religionData[[i]]
}

# final average rating
religionData$AVERAGE_RATING <- religionData$AVERAGE_RATING/10

Research question

You should phrase your research question in a way that matches up with the scope of inference your dataset allows for.

Are age, income, and region predictive of an individual’s comfort level with public religious displays?

Cases

What are the cases, and how many are there?

Each case represents a respondent’s survey answers. There are a total of 979 cases in the dataset.

Data collection

Describe the method of data collection.

This data was collected using a SurveyMonkey poll, conducted between July 29 and August 1, 2016. The survey asked 661 respondents questions about public displays of religion.

Type of study

What type of study is this (observational/experiment)?

This is a survey, which is a type of observational study.

Data Source

If you collected the data, state self-collected. If not, provide a citation/link.

This data came from: https://github.com/fivethirtyeight/data/tree/master/religion-survey.

Dependent Variable

What is the response variable? Is it quantitative or qualitative?

The response variable is comfort level and it is numeric.

Independent Variable

You should have two independent variables, one quantitative and one qualitative.

Quantitative Variable - Age, Income
Qualitative Variable - Region

Relevant summary statistics

Provide summary statistics for each the variables. Also include appropriate visualizations related to your research question (e.g. scatter plot, boxplots, etc). This step requires the use of R, hence a code chunk is provided below. Insert more code chunks as needed.

summary(religionData$AVERAGE_RATING)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   5.700   5.597   6.200   7.000

table(religionData$AGE)

## 
##  18 - 29  30 - 44  45 - 59      60+ Response 
##      212      255      264      248        0

table(religionData$HOUSEHOLD_SALARY)

## 
##         $0 to $9,999   $10,000 to $24,999 $100,000 to $124,999 
##                   86                   93                   87 
## $125,000 to $149,999 $150,000 to $174,999 $175,000 to $199,999 
##                   52                   37                   15 
##      $200,000 and up   $25,000 to $49,999   $50,000 to $74,999 
##                   53                  166                  151 
##   $75,000 to $99,999 Prefer not to answer             Response 
##                  111                  128                    0

table(religionData$US_REGION)

## 
##                    East North Central East South Central 
##                 13                164                 50 
##    Middle Atlantic           Mountain        New England 
##                124                 67                 65 
##            Pacific           Response     South Atlantic 
##                146                  0                191 
## West North Central West South Central 
##                 68                 91

religionData %>%
  group_by(AGE) %>%
  summarise(MEAN_BY_AGE = mean(AVERAGE_RATING),
            MEDIAN_BY_AGE = median(AVERAGE_RATING),
            STDEV_BY_AGE = sd(AVERAGE_RATING))

## # A tibble: 4 x 4
##   AGE     MEAN_BY_AGE MEDIAN_BY_AGE STDEV_BY_AGE
##   <fct>         <dbl>         <dbl>        <dbl>
## 1 18 - 29        5.67          5.75        0.966
## 2 30 - 44        5.54          5.6         1.02 
## 3 45 - 59        5.61          5.7         0.938
## 4 60+            5.59          5.6         0.808

boxplot(religionData$AVERAGE_RATING~religionData$AGE)

religionData %>%
  group_by(HOUSEHOLD_SALARY) %>%
  summarise(MEAN_BY_SALARY = mean(AVERAGE_RATING),
            MEDIAN_BY_SALARY = median(AVERAGE_RATING),
            STDEV_BY_SALARY = sd(AVERAGE_RATING))

## # A tibble: 11 x 4
##    HOUSEHOLD_SALARY     MEAN_BY_SALARY MEDIAN_BY_SALARY STDEV_BY_SALARY
##    <fct>                         <dbl>            <dbl>           <dbl>
##  1 $0 to $9,999                   5.54             5.75           1.16 
##  2 $10,000 to $24,999             5.68             5.8            0.937
##  3 $100,000 to $124,999           5.54             5.6            0.939
##  4 $125,000 to $149,999           5.58             5.7            0.983
##  5 $150,000 to $174,999           5.45             5.6            1.01 
##  6 $175,000 to $199,999           5.61             5.7            0.728
##  7 $200,000 and up                5.60             5.7            0.855
##  8 $25,000 to $49,999             5.63             5.8            1.01 
##  9 $50,000 to $74,999             5.76             5.9            0.832
## 10 $75,000 to $99,999             5.47             5.6            0.945
## 11 Prefer not to answer           5.53             5.55           0.778

boxplot(religionData$AVERAGE_RATING~religionData$HOUSEHOLD_SALARY)

religionData %>%
  group_by(US_REGION) %>%
  summarise(MEAN_BY_REGION = mean(AVERAGE_RATING),
            MEDIAN_BY_REGION = median(AVERAGE_RATING),
            STDEV_BY_REGION = sd(AVERAGE_RATING))

## # A tibble: 10 x 4
##    US_REGION          MEAN_BY_REGION MEDIAN_BY_REGION STDEV_BY_REGION
##    <fct>                       <dbl>            <dbl>           <dbl>
##  1 ""                           5.19             5.3            1.07 
##  2 East North Central           5.69             5.9            0.986
##  3 East South Central           5.85             5.95           0.735
##  4 Middle Atlantic              5.56             5.6            0.916
##  5 Mountain                     5.41             5.5            1.05 
##  6 New England                  5.69             5.7            0.782
##  7 Pacific                      5.59             5.7            1.02 
##  8 South Atlantic               5.58             5.6            0.953
##  9 West North Central           5.51             5.65           0.880
## 10 West South Central           5.59             5.6            0.807

boxplot(religionData$AVERAGE_RATING~religionData$US_REGION)

religionData %>%
  group_by(GENDER) %>%
  summarise(MEAN_BY_GENDER = mean(AVERAGE_RATING),
            MEDIAN_BY_GENDER = median(AVERAGE_RATING),
            STDEV_BY_GENDER = sd(AVERAGE_RATING))

## # A tibble: 2 x 4
##   GENDER MEAN_BY_GENDER MEDIAN_BY_GENDER STDEV_BY_GENDER
##   <fct>           <dbl>            <dbl>           <dbl>
## 1 Female           5.70              5.8           0.852
## 2 Male             5.49              5.6           1.01

boxplot(religionData$AVERAGE_RATING~religionData$GENDER)

religionData %>%
  group_by(AGE,GENDER) %>%
  summarise(MEAN_BY_AGE_GENDER = mean(AVERAGE_RATING),
            MEDIAN_BY_AGE_GENDER = median(AVERAGE_RATING),
            STDEV_BY_AGE_GENDER = sd(AVERAGE_RATING))

## # A tibble: 8 x 5
## # Groups:   AGE [4]
##   AGE     GENDER MEAN_BY_AGE_GENDER MEDIAN_BY_AGE_GENDER STDEV_BY_AGE_GEND~
##   <fct>   <fct>               <dbl>                <dbl>              <dbl>
## 1 18 - 29 Female               5.77                 5.9               0.909
## 2 18 - 29 Male                 5.55                 5.5               1.01 
## 3 30 - 44 Female               5.61                 5.7               0.872
## 4 30 - 44 Male                 5.46                 5.5               1.16 
## 5 45 - 59 Female               5.73                 5.8               0.844
## 6 45 - 59 Male                 5.48                 5.65              1.02 
## 7 60+     Female               5.68                 5.7               0.794
## 8 60+     Male                 5.47                 5.55              0.813

ggplot(religionData, aes(x=AVERAGE_RATING))+ geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.