The Data

The data comes from our class survey. First load the data. We will call the dataset sd as an abbreviation for survey data.

sd<-read_excel("survey_data.xlsx")

Explore the Data

Each observation (or row) is an individual survey response.

We can see what variables are in the dataset. We can match these variables to the survey.

names(sd)
##  [1] "consent"           "college"           "rel_extent"       
##  [4] "rel_services"      "rel_pray"          "rel_environment"  
##  [7] "rel_change"        "college_major"     "college_year"     
## [10] "anxiety_1"         "anxiety_2"         "anxiety_3"        
## [13] "anxiety_4"         "anxiety_5"         "anxiety_6"        
## [16] "anxiety_7"         "anxiety_8"         "cheating_1"       
## [19] "cheating_2"        "cheating_3"        "alcohol_frequency"
## [22] "mj"                "q17"               "alcohol_drinks"

We can also look at the dataset by clicking on sd in the Data window.

Categorical Variables

Many of our variables are categorical. The values that they take are not meaningful unless you look at the corresponding category. Open sd and look at the rel_extent column. What does it mean to go from a 5 to a 6? Notice that this variable is not like income or age where it is easy to understand what a change in the values mean.

Since almost all of the survey involves these types of variables, we’ll work through how to best use them.

Generate Measure of Religiosity

We will typically need to generate new variables based on the data. Let’s use rel_extent where is a stated preference measure of one’s religiosity. We can see the values the variable takes and the proportions.

table(sd$rel_extent)
## 
##   1   4   5   6 
##   8  68  91 142
prop.table(table(sd$rel_extent))
## 
##          1          4          5          6 
## 0.02588997 0.22006472 0.29449838 0.45954693

We see that 8 individuals or 2.6% are “Very Religious”, 68 (22%) are “Moderately Religious”, 91 (29.4%) are “Slightly Religious”, and 142 (46%) are “Not religious at all.”

Should we use this variable as a measure of religiosity? We can. An alternative that I prefer is to generate an indicator variable that puts everyone into two groups. The first group are “religious” and the second group will be "non/less religious. Note this distinction is arbitrary.

Let’s define religious as an individual who identifies as “Very Religious” or “Moderately Religious.”

Method 1: Using mutate we generate the variable religious that takes a value of “1” if the rel_extent equals 1 OR 4, and 0 otherwise. The %>% is a pipe: it tells R which dataset to use. Finally the sd <- tells R to modify the dataset.

sd<-sd %>% mutate(religious= ifelse((rel_extent == 1 | rel_extent == 4) ,1,0 ))

table(sd$religious)
## 
##   0   1 
## 233  76
prop.table(table(sd$religious))
## 
##         0         1 
## 0.7540453 0.2459547

Method 2: Same idea as above, except now religious2 takes a value of “1” if rel_extent is between the values of 1 and 4. You get the same result as above, but this method might be useful for other variables.

sd<-sd %>% mutate(religious2= ifelse((rel_extent >= 1 & rel_extent <= 4) ,1,0 ))
table(sd$religious2)
## 
##   0   1 
## 233  76
prop.table(table(sd$religious2))
## 
##         0         1 
## 0.7540453 0.2459547
Q1: Create an indicator that takes a value of 1 if someone ever goes to a religious service and 0 if the indivdual never goes.

Outcomes

Let’s define a “frequent drinker” as someone who drinks at least 2-3 times a week. Note that this is arbitrary. You can define a new variable such as “somewhat frequent drinker” as someone who drinks at least 2-4 a month. One useful thing is to look at the distribution of the variable and then deciding how to arbitrarily generate the new variable.

table(sd$alcohol_frequency)
## 
##   1   2   3   4   5 
##  30  26  73 136  44

We’ll generate the drink_freq variable.

sd<-sd %>%
  mutate(drink_freq = ifelse((alcohol_frequency>=4 & alcohol_frequency<=5), 1, 0))

table(sd$drink_freq)
## 
##   0   1 
## 129 180

Analysis

Do religious people drink more? We can look at the association. First, let’s look at the percentage of people who are frequent drinkers between the religious and non-religious. We create a table drink_rel that uses the sd dataset and measurevar takes the mean of the variable drink_feq for both the religious and non-religious group using groupvars. Finally, na.rm=TRUE tells R to remove missing values from the analysis.

drink_rel <-summarySE(sd,measurevar="drink_freq",groupvars="religious", na.rm=TRUE)
drink_rel
##   religious   N drink_freq        sd         se         ci
## 1         0 233  0.5879828 0.4932578 0.03231439 0.06366717
## 2         1  76  0.5657895 0.4989463 0.05723306 0.11401410

We see that 58.8% of the non-religious frequent drinkers (1st row) and 56.6% of the religious are frequent drinkers (2nd row).

We’re going to do one more thing that will make are figures look nicer.

drink_rel<-drink_rel%>%
  mutate(religious2 = ifelse(religious==1, "Religious", "Not Religious"))
drink_rel
##   religious   N drink_freq        sd         se         ci    religious2
## 1         0 233  0.5879828 0.4932578 0.03231439 0.06366717 Not Religious
## 2         1  76  0.5657895 0.4989463 0.05723306 0.11401410     Religious

All we did is create a new variable that defines each group.

Bar Plot

We use the drink_rel table, and define the x-axis and y-axis by the variables we want to plot. with fill defining the color and legend.

ggplot(drink_rel, aes(x=religious2)) +
       geom_col(aes(y= drink_freq))

We can format it by adding a title, labeling the axis, and adding color (via fill).

ggplot(drink_rel, aes(religious2)) +
geom_col(aes(y    = drink_freq,
            fill  = drink_freq)) + ggtitle("Religiosity is associated with Less Drinking") +xlab("Religion") + ylab("Frequent Drinker")

Is the difference statistically significant? Let’s do a linear regression. NOTE: for the regression we will use the sd dataset which has 300 observations and NOT the drink_rel table. This is why we use religious and not religious2.

reg_results<-lm(drink_freq ~ religious, data = sd)
summary(reg_results)
## 
## Call:
## lm(formula = drink_freq ~ religious, data = sd)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.5880 -0.5880  0.4120  0.4120  0.4342 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.58798    0.03241   18.14   <2e-16 ***
## religious   -0.02219    0.06534   -0.34    0.734    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4947 on 307 degrees of freedom
## Multiple R-squared:  0.0003756,  Adjusted R-squared:  -0.00288 
## F-statistic: 0.1154 on 1 and 307 DF,  p-value: 0.7344

When using an indicator variable as the independent variable in a regression (religious = 1 if religious and 0 otherwise), we can interpret the results as follows:

  • The estimate for the (Intercept) is .60, so the percentage of non-religious who drink frequently is 60%.

  • The estimate for religious is -.02069. Religious individuals are about 2% less likely to be frequent drinkers. The percentage of religious who drink is 60% - 2% = 58%.

  • The p-value for the religious estimate is .754 and it has no stars (*). Therefore this estimate is not statistically significant.

We can thus infer that while religious individuals drink at a lower rate, this difference is not statistically significant. We conclude that we do not have evidence that religiosity leads to a change in drinking behavior.

Let’s revise our title to reflect this.

ggplot(drink_rel, aes(religious2)) +
geom_col(aes(y    = drink_freq,
            fill  = drink_freq)) + ggtitle("Little Evidence that Religiosity associated with Less Drinking") +xlab("Religion") + ylab("Frequent Drinker")

Q2 Is the revealed preference measure of relgiosity associated with drinking? Generate a figure and use linear regression to see if the difference is statistically significant. Interpret the results of the linear regression.

Aggregating multiple variables

The anxiety questionnaire had multiple questions that one could use to create an index. The index score can be interpreted using this scoring system.

table(sd$anxiety_1)
## 
##   0   1   2   3 
##  36 111  95  66
table(sd$anxiety_2)
## 
##   0   1   2   3 
##  86 117  61  43

Look at the dataset and we see that to generate an index we can aggregate all the values for the anxiety measures for each row.
- We update the sd dataset with sd<-.
- We use the sd dataset for the mutate function with the pipe %>% - mutate generates a new variable anxiety which selects the variables anxiety_1 to anxiety 7 and sums up the values for each individual rowSums and removes missing values/responses na.rm = TRUE

sd<-sd %>%
    mutate(anxiety = select(. , anxiety_1:anxiety_7) %>% rowSums(na.rm = TRUE))

sd %>%
    select(starts_with('anxiety'))
## # A tibble: 309 x 9
##    anxiety_1 anxiety_2 anxiety_3 anxiety_4 anxiety_5 anxiety_6 anxiety_7
##        <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>
##  1         1         2         2         2         1         1         1
##  2         2         2         2         2         2         2         2
##  3         3         3         3         2         2         2         3
##  4         1         1         1         0         0         0         1
##  5         2         2         2         2         1         1         1
##  6         3         2         3         3         3         3         1
##  7         2         1         2         1         0         0         0
##  8         0         0         0         1         0         0         0
##  9         2         1         1         2         1         1         1
## 10         0         0         0         0         0         0         0
## # … with 299 more rows, and 2 more variables: anxiety_8 <dbl>, anxiety <dbl>

Note I did not include anxiety_8 in the index as this was not part of the original GAD-7.

Q3 For the honor code questions, how would you aggregate them? Would you create an index?
Q4 Is the stated preference measure of religiosity religious associated with greater or less anxiety? Can you conclude that religiosity can explain the change in anxiety? Why or why not?