if(!require(tidyverse)){install.packages("tidyverse")}

R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

## Rows: 31 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (3): ID, Group_by_PA, coping_stress
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

One-Way ANOVA

The one-way ANOVA is also referred to as a between-subjects ANOVA or one-factor (one independent variable) ANOVA. Although it can be used with an independent variable with only two groups, the independent-samples t-test is typically used in this situation instead. For this reason, you will come across the one-way ANOVA being described as a test to use when you have three or more groups (rather than two or more groups).

Scenario: A researcher believes that individuals that are more physically active are better able to cope with stress in the workplace. To test this theory, the researcher recruited 31 subjects and measured how many minutes of physical activity they performed per week and their ability to cope with workplace stress. The subjects were categorized into four groups based on the number of minutes of physical activity they performed: namely, “sedentary” (1), “low” (2), “moderate” (3) and “high” (4) physical activity groups. These groups (levels of physical activity) formed an independent variable called Group_by_PA. The ability to cope with workplace stress was assessed as the average score of a series of Likert items on a questionnaire, which allowed an overall “coping with workplace stress” score to be calculated; higher scores indicating a greater ability to cope with workplace-related stress. This dependent variable was called coping_stress. The researcher would like to know if coping stress is dependent on physical activity level.

Question: What is the research question? What is the null hypothesis in this scenario?

Variable Names

You can check out the exact variable names:

## [1] "ID"            "Group_by_PA"   "coping_stress"

Note that the echo = FALSE parameter was added to certain code chunks to prevent printing of this R code.

Let us assign data to these variable names:

PA <- as.factor(cop$Group_by_PA)
c_stress <- cop$coping_stress

Next, we can calculate summary statistics for coping stress by physical activity level:

cop %>%
  group_by(Group_by_PA) %>%
  summarise(
    mean_coping = mean(coping_stress),
    sd_coping = sd(coping_stress)
  )
## # A tibble: 4 × 3
##   Group_by_PA mean_coping sd_coping
##         <dbl>       <dbl>     <dbl>
## 1           1        4.15     0.771
## 2           2        5.88     1.69 
## 3           3        7.12     1.57 
## 4           4        7.51     1.24

We can generate a plot of the two variables:

ggplot(cop, aes(x = Group_by_PA, y = coping_stress)) +
  geom_col()

Now we can perform a one-way ANOVA by using the following code:

# one-way ANOVA
anova <- aov(c_stress~PA)
summary(anova)
##             Df Sum Sq Mean Sq F value   Pr(>F)    
## PA           3  49.03  16.344   8.316 0.000445 ***
## Residuals   27  53.07   1.965                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Question: What can you conclude about the differences in coping stress between all four groups of physical activity level?

Post-Hoc Test

It is important to realize that the one-way ANOVA is an omnibus test statistic and cannot tell you which specific groups were significantly different from each other; it only tells you that at least two groups were different. Since you may have three, four, five or more groups in your study design, determining which of these groups differ from each other is important. You can do this using follow-up (aka post-hoc) tests. Since there are four groups of PA level in this scenario and the one-way ANOVA determine that the differences between at least two groups are statistically significant, we need to run a post-hoc test. We will use the Tukey post-hoc test to determine whether pairwise differences are statistically significant.

# post-hoc test
TukeyHSD(anova)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = c_stress ~ PA)
## 
## $PA
##          diff        lwr      upr     p adj
## 2-1 1.7276175 -0.2057757 3.661011 0.0923527
## 3-1 2.9715262  0.9859704 4.957082 0.0018413
## 4-1 3.3540854  1.3034122 5.404759 0.0006806
## 3-2 1.2439086 -0.6202750 3.108092 0.2835038
## 4-2 1.6264679 -0.3069254 3.559861 0.1226045
## 4-3 0.3825593 -1.6029965 2.368115 0.9517285

The comparison in the first row states that the mean coping stress score in the “Low” group is 1.72762 higher than the “Sedentary” group. To determine whether this mean difference is statistically significant, you need to examine the p adj column, which presents the p-value. For this comparison, the p-value is .092 (i.e., p = .092). If p < .05, you have a statistically significant result, but if p > .05, you do not have a statistically significant result. For our comparison between the “Sedentary” and “Low” groups, p = .092, which is greater than .05 and, therefore, the mean difference between these two groups is not statistically significant (i.e., the mean difference is not different from zero in the population).

Confidence Intervals

Due to the limitations of the statistical significance value, we can also interpret the 95% confidence intervals, which are -.2058 to 3.6610 (i.e., the “lwr” and “upr” columns, respectively). It is best to report the confidence intervals as well as the statistical significance value in your reports as shown here in APA format:

There was an increase in coping stress score from the sedentary group (M = 4.2, SD = 0.8) to the group performing a low level of physical activity (M = 5.9, SD = 1.7), a mean increase of 1.7, 95% CI [-0.2, 3.7], which was not statistically significant (p = .092).

There is a link between the 95% confidence intervals of the mean difference and the statistical significance of the mean difference. If the confidence intervals do not contain 0 (zero), you have a statistically significant mean difference (p < .05). If they do contain zero, you do not have a statistically significant mean difference (p > .05). In this example, you discovered that the 95% confidence intervals were from -0.2058 to 3.6610, thus they include zero and indicate a non-statistically significant result.

Question: What conclusion can you make about the mean difference between the high (4) and sedentary (1) groups? Write out your interpretation in APA format.

Resources for learning R and working in RStudio

That was a short introduction to R and RStudio, but we will provide you with more functions and a more complete sense of the language as the course progresses. You might find the following tips and resources helpful.