The data comes from our class survey. First load the data. We will call the dataset sd as an abbreviation for survey data.
sd<-read_excel("survey_data.xlsx")
Each observation (or row) is an individual survey response.
We can see what variables are in the dataset. We can match these variables to the survey.
names(sd)
## [1] "consent" "college" "rel_extent"
## [4] "rel_services" "rel_pray" "rel_environment"
## [7] "rel_change" "college_major" "college_year"
## [10] "anxiety_1" "anxiety_2" "anxiety_3"
## [13] "anxiety_4" "anxiety_5" "anxiety_6"
## [16] "anxiety_7" "anxiety_8" "cheating_1"
## [19] "cheating_2" "cheating_3" "alcohol_frequency"
## [22] "mj" "q17" "alcohol_drinks"
We can also look at the dataset by clicking on sd in the Data window.
Many of our variables are categorical. The values that they take are not meaningful unless you look at the corresponding category. Open sd and look at the rel_extent column. What does it mean to go from a 5 to a 6? Notice that this variable is not like income or age where it is easy to understand what a change in the values mean.
Since almost all of the survey involves these types of variables, we’ll work through how to best use them.
We will typically need to generate new variables based on the data. Let’s use rel_extent where is a stated preference measure of one’s religiosity. We can see the values the variable takes and the proportions.
table(sd$rel_extent)
##
## 1 4 5 6
## 8 68 91 142
prop.table(table(sd$rel_extent))
##
## 1 4 5 6
## 0.02588997 0.22006472 0.29449838 0.45954693
We see that 8 individuals or 2.6% are “Very Religious”, 68 (22%) are “Moderately Religious”, 91 (29.4%) are “Slightly Religious”, and 142 (46%) are “Not religious at all.”
Should we use this variable as a measure of religiosity? We can. An alternative that I prefer is to generate an indicator variable that puts everyone into two groups. The first group are “religious” and the second group will be "non/less religious. Note this distinction is arbitrary.
Let’s define religious as an individual who identifies as “Very Religious” or “Moderately Religious.”
Method 1: Using mutate we generate the variable religious that takes a value of “1” if the rel_extent equals 1 OR 4, and 0 otherwise. The %>% is a pipe: it tells R which dataset to use. Finally the sd <- tells R to modify the dataset.
sd<-sd %>% mutate(religious= ifelse((rel_extent == 1 | rel_extent == 4) ,1,0 ))
table(sd$religious)
##
## 0 1
## 233 76
prop.table(table(sd$religious))
##
## 0 1
## 0.7540453 0.2459547
Method 2: Same idea as above, except now religious2 takes a value of “1” if rel_extent is between the values of 1 and 4. You get the same result as above, but this method might be useful for other variables.
sd<-sd %>% mutate(religious2= ifelse((rel_extent >= 1 & rel_extent <= 4) ,1,0 ))
table(sd$religious2)
##
## 0 1
## 233 76
prop.table(table(sd$religious2))
##
## 0 1
## 0.7540453 0.2459547
Let’s define a “frequent drinker” as someone who drinks at least 2-3 times a week. Note that this is arbitrary. You can define a new variable such as “somewhat frequent drinker” as someone who drinks at least 2-4 a month. One useful thing is to look at the distribution of the variable and then deciding how to arbitrarily generate the new variable.
table(sd$alcohol_frequency)
##
## 1 2 3 4 5
## 30 26 73 136 44
We’ll generate the drink_freq variable.
sd<-sd %>%
mutate(drink_freq = ifelse((alcohol_frequency>=4 & alcohol_frequency<=5), 1, 0))
table(sd$drink_freq)
##
## 0 1
## 129 180
Do religious people drink more? We can look at the association. First, let’s look at the percentage of people who are frequent drinkers between the religious and non-religious. We create a table drink_rel that uses the sd dataset and measurevar takes the mean of the variable drink_feq for both the religious and non-religious group using groupvars. Finally, na.rm=TRUE tells R to remove missing values from the analysis.
drink_rel <-summarySE(sd,measurevar="drink_freq",groupvars="religious", na.rm=TRUE)
drink_rel
## religious N drink_freq sd se ci
## 1 0 233 0.5879828 0.4932578 0.03231439 0.06366717
## 2 1 76 0.5657895 0.4989463 0.05723306 0.11401410
We see that 58.8% of the non-religious frequent drinkers (1st row) and 56.6% of the religious are frequent drinkers (2nd row).
We’re going to do one more thing that will make are figures look nicer.
drink_rel<-drink_rel%>%
mutate(religious2 = ifelse(religious==1, "Religious", "Not Religious"))
drink_rel
## religious N drink_freq sd se ci religious2
## 1 0 233 0.5879828 0.4932578 0.03231439 0.06366717 Not Religious
## 2 1 76 0.5657895 0.4989463 0.05723306 0.11401410 Religious
All we did is create a new variable that defines each group.
We use the drink_rel table, and define the x-axis and y-axis by the variables we want to plot. with fill defining the color and legend.
ggplot(drink_rel, aes(x=religious2)) +
geom_col(aes(y= drink_freq))
We can format it by adding a title, labeling the axis, and adding color (via fill).
ggplot(drink_rel, aes(religious2)) +
geom_col(aes(y = drink_freq,
fill = drink_freq)) + ggtitle("Religiosity is associated with Less Drinking") +xlab("Religion") + ylab("Frequent Drinker")
Is the difference statistically significant? Let’s do a linear regression. NOTE: for the regression we will use the sd dataset which has 300 observations and NOT the drink_rel table. This is why we use religious and not religious2.
reg_results<-lm(drink_freq ~ religious, data = sd)
summary(reg_results)
##
## Call:
## lm(formula = drink_freq ~ religious, data = sd)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.5880 -0.5880 0.4120 0.4120 0.4342
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.58798 0.03241 18.14 <2e-16 ***
## religious -0.02219 0.06534 -0.34 0.734
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4947 on 307 degrees of freedom
## Multiple R-squared: 0.0003756, Adjusted R-squared: -0.00288
## F-statistic: 0.1154 on 1 and 307 DF, p-value: 0.7344
When using an indicator variable as the independent variable in a regression (religious = 1 if religious and 0 otherwise), we can interpret the results as follows:
The estimate for the (Intercept) is .60, so the percentage of non-religious who drink frequently is 60%.
The estimate for religious is -.02069. Religious individuals are about 2% less likely to be frequent drinkers. The percentage of religious who drink is 60% - 2% = 58%.
The p-value for the religious estimate is .754 and it has no stars (*). Therefore this estimate is not statistically significant.
We can thus infer that while religious individuals drink at a lower rate, this difference is not statistically significant. We conclude that we do not have evidence that religiosity leads to a change in drinking behavior.
Let’s revise our title to reflect this.
ggplot(drink_rel, aes(religious2)) +
geom_col(aes(y = drink_freq,
fill = drink_freq)) + ggtitle("Little Evidence that Religiosity associated with Less Drinking") +xlab("Religion") + ylab("Frequent Drinker")
The anxiety questionnaire had multiple questions that one could use to create an index. The index score can be interpreted using this scoring system.
table(sd$anxiety_1)
##
## 0 1 2 3
## 36 111 95 66
table(sd$anxiety_2)
##
## 0 1 2 3
## 86 117 61 43
Look at the dataset and we see that to generate an index we can aggregate all the values for the anxiety measures for each row.
- We update the sd dataset with sd<-.
- We use the sd dataset for the mutate function with the pipe %>% - mutate generates a new variable anxiety which selects the variables anxiety_1 to anxiety 7 and sums up the values for each individual rowSums and removes missing values/responses na.rm = TRUE
sd<-sd %>%
mutate(anxiety = select(. , anxiety_1:anxiety_7) %>% rowSums(na.rm = TRUE))
sd %>%
select(starts_with('anxiety'))
## # A tibble: 309 x 9
## anxiety_1 anxiety_2 anxiety_3 anxiety_4 anxiety_5 anxiety_6 anxiety_7
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 2 2 2 1 1 1
## 2 2 2 2 2 2 2 2
## 3 3 3 3 2 2 2 3
## 4 1 1 1 0 0 0 1
## 5 2 2 2 2 1 1 1
## 6 3 2 3 3 3 3 1
## 7 2 1 2 1 0 0 0
## 8 0 0 0 1 0 0 0
## 9 2 1 1 2 1 1 1
## 10 0 0 0 0 0 0 0
## # … with 299 more rows, and 2 more variables: anxiety_8 <dbl>, anxiety <dbl>
Note I did not include anxiety_8 in the index as this was not part of the original GAD-7.
religious associated with greater or less anxiety? Can you conclude that religiosity can explain the change in anxiety? Why or why not?