library(afex) # to run the ANOVA and plot results
library(psych) # for the describe() command
library(ggplot2) # to visualize our results
library(expss) # for the cross_cases() command
library(car) # for the leveneTest() command
library(emmeans) # for posthoc testsANOVA HW
Loading Libraries
Importing Data
#change this to my data for HW
d <- read.csv(file="Data/mydata.csv", header=T)
#
# # new code! this adds a column with a number for each row. it acts as an identifier for our participations
d$row_id <- 1:nrow(d)State Your Hypothesis - PART OF YOUR WRITEUP
Note: You can chose to run either a one-way ANOVA (a single IV with more than 3 levels) or a two-way/factorial ANOVA (at least two IVs) for the homework. You will need to specify your hypothesis and customize your code based on the choice you make. I will run both versions of the test here for illustrative purposes.
I predict that under age 18 group will report the highest scores on the intolerance of uncertainty scale than the 18-45 group and the 45 and over group.
#State your hypotheses. Remember, your DV will be a continuous variable. For your IV, you need either one categorical variable with three levels, or two categorical variables with at least two levels each.
Check Your Assumptions
ANOVA Assumptions
- Independence of observations (confirmed by data report)
- All levels of the IVs should have equal number of cases (ideally; in the real world, this varies) and there should be no empty cells. Cells with low numbers increase chance of Type II error. (we will check this below)
- Homogeneity of variance should be assured (we will check this below)
- Outliers should be identified and removed (we will check this below)
- DV should be normally distributed for each level of the IV (we will check this below)
Check levels of IVs
# # for a one-way ANOVA
# # COMMENT THIS SECTION OUR FROM THE HW IF YOU DO NOT USE IT
table(d$age, useNA = "always")
1 under 18 2 between 18 and 25 3 between 26 and 35 4 between 36 and 45
619 56 5 82
5 over 45 <NA>
164 0
# # if you need to recode
# # to drop levels from your variable
# # this subsets the data and says that any participant who is coded as 'LEVEL BAD' should be removed
# # if you don't need this for the homework, comment it out (add a # at the beginning of the line)
# d <- subset(d, pet != "bird")
# d <- subset(d, pet != "cat and dog")
# d <- subset(d, pet != "fish")
# d <- subset(d, pet != "multiple types of pet")
# d <- subset(d, pet != "other")
#
# # to combine levels
# # this says that where any participant is coded as 'LEVEL BAD' it should be replaced by 'LEVEL GOOD'
# # you can repeat this as needed, changing 'LEVEL BAD' if you have multiple levels that you want to combine into a single level
# # if you don't need this for the homework, comment it out (add a # at the beginning of the line)
d$age_rc[d$age == "2 between 18 and 25"] <- "between 18 and 45"
d$age_rc[d$age == "3 between 26 and 35"] <- "between 18 and 45"
d$age_rc[d$age == "4 between 36 and 45"] <- "between 18 and 45"
d$age_rc[d$age == "1 under 18"] <- "under 18"
d$age_rc[d$age == "5 over 45"] <- "45 and over"
# # preview your changes and make sure everything is correct
table(d$age_rc, useNA = "always")
45 and over between 18 and 45 under 18 <NA>
164 143 619 0
#
# # check your variable types
str(d)'data.frame': 926 obs. of 8 variables:
$ age : chr "1 under 18" "1 under 18" "4 between 36 and 45" "4 between 36 and 45" ...
$ education: chr "prefer not to say" "2 equivalent to high school completion" "5 undergraduate degree" "6 graduate degree or higher" ...
$ pswq : num 0.851 -1.124 1.163 -0.342 -0.127 ...
$ iou : num 4 1.59 3.37 1.7 1.11 ...
$ phq : num 3.33 1 2.33 1.11 2.33 ...
$ rse : num 1.6 3.9 1.7 3.9 1.8 3.5 3 3.5 2.5 3.4 ...
$ row_id : int 1 2 3 4 5 6 7 8 9 10 ...
$ age_rc : chr "under 18" "under 18" "between 18 and 45" "between 18 and 45" ...
#
# # make sure that your IV is recognized as a factor by R
d$age<- as.factor(d$age)Check homogeneity of variance
# # use the leveneTest() command from the car package to test homogeneity of variance
# # uses the 'formula' setup: formula is y~x1*x2, where y is our DV and x1 is our first IV and x2 is our second IV
# WE WANT THE LEVENES TEST TO BE NOT SIGNIFICANT. SIGNIFICANT IS BAD, WE WANT HOMOGENEUITY, NOT HETEROGENEITY. WE WANT IT TO BE BIGGER THAN .05, if there is not homogeneity the results are unreliable!!!!!!!!
# # for a one-way ANOVA
# # COMMENT THIS SECTION OUR FROM THE HW IF YOU DO NOT USE IT
leveneTest(iou~age, data = d)Levene's Test for Homogeneity of Variance (center = median)
Df F value Pr(>F)
group 4 11.81 2.326e-09 ***
921
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#I do have heterogeneity of varianceCheck for outliers using Cook’s distance and Residuals vs Leverage plot
Run a Regression
# # use this commented out section only if you need to remove outliers.
# # to drop a single outlier, remove the # at the beginning of the line and use this code:
#d <- subset(d, row_id!=c(618))
#
# # to drop multiple outliers, remove the # at the beginning of the line and use this code:
# # d <- subset(d, row_id!=c(1108) & row_id!=c(602))
# # use the lm() command to run the regression
# # formula is y~x1*x2 + c, where y is our DV, x1 is our first IV, x2 is our second IV, and c is our covariate
#
# # for a one-way ANOVA
# # COMMENT THIS SECTION OUR FROM THE HW IF YOU DO NOT USE IT
reg_model <- lm(iou ~ age, data = d) #for one-wayCheck for outliers
# # for a one-way ANOVA
# # COMMENT THIS SECTION OUR FROM THE HW IF YOU DO NOT USE IT
# # Cook's distance THE DISTANCE CUTOFF IS .5, NOT .05!!!!!!! We good
plot(reg_model, 4)#
# # Residuals vs Leverage WE WANT THE RED TO BE AS CLOSE TO THE DOTTED LINE AS POSSIBLE!ALSO THE COOKS DISTANCE SCORE OUTLIERS CAN SOMETIMES POP UP ON THHIS GRAPH TOO. LOOK FOR WOBBLE/ANY PARTICPANTS OUTSIDE THE DASHED LINES FOR COOKS DISTANCE. We good
plot(reg_model, 5)Check Your Variables
# # we'll use the describeBy() command to view skew and kurtosis across our IVs and make sure the DV is normally distributed across all of the levels. REMEMBER SKEW AND KURTOSIS SHOULD BE BETWEEN -2 AND +2.
describeBy(d$iou, group = d$age_rc)
Descriptive statistics by group
group: 45 and over
vars n mean sd median trimmed mad min max range skew kurtosis se
X1 1 164 2.05 0.65 1.93 1.98 0.55 1.11 4.78 3.67 1.14 1.64 0.05
------------------------------------------------------------
group: between 18 and 45
vars n mean sd median trimmed mad min max range skew kurtosis se
X1 1 143 2.59 0.84 2.48 2.56 1.04 1.04 4.44 3.41 0.3 -0.91 0.07
------------------------------------------------------------
group: under 18
vars n mean sd median trimmed mad min max range skew kurtosis se
X1 1 619 2.66 0.92 2.52 2.62 1.04 1.04 4.89 3.85 0.36 -0.77 0.04
Issues with My Data - PART OF YOUR WRITEUP
To begin, I combined the “between 18 and 25”, “between 26 and 35”, and “between 36 and 45” into one category of “between 18 and 45”. I found that I did not have homogeneity of variance, as p < .05. This means the results could potentially be less reliable. There were no outliers above the .5 cutoff. Skew and kurtosis fell within the normal range.
Run an ANOVA
# # for a one-way ANOVA
# # COMMENT THIS SECTION OUR FROM THE HW IF YOU DO NOT USE IT
aov_model <- aov_ez(data = d,
id = "row_id",
between = c("age_rc"),
dv = "iou",
anova_table = list(es = "pes"))Converting to factor: age_rc
Contrasts set to contr.sum for the following variables: age_rc
View Output
Effect size cutoffs from Cohen (1988):
- ηp2 = 0.01 indicates a small effect
- ηp2 = 0.06 indicates a medium effect
- ηp2 = 0.14 indicates a large effect
# # for a one-way ANOVA. WITH ANOVA, THERES TWO DF, A WITHIN AND BETWEEN DF!!!!
# # COMMENT THIS SECTION OUR FROM THE HW IF YOU DO NOT USE IT
nice(aov_model)Anova Table (Type 3 tests)
Response: iou
Effect df MSE F pes p.value
1 age_rc 2, 923 0.75 32.22 *** .065 <.001
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '+' 0.1 ' ' 1
# Visualize Results
# # for a one-way ANOVA
# # COMMENT THIS SECTION OUR FROM THE HW IF YOU DO NOT USE IT
afex_plot(aov_model, x = "age_rc")# Run Posthoc Tests
Only run posthocs if the test is significant! E.g., only run the posthoc tests on gender if there is a main effect for gender.
# # for a one-way ANOVA
# # COMMENT THIS SECTION OUR FROM THE HW IF YOU DO NOT USE IT
emmeans(aov_model, specs="age_rc", adjust="tukey")Note: adjust = "tukey" was changed to "sidak"
because "tukey" is only appropriate for one set of pairwise comparisons
age_rc emmean SE df lower.CL upper.CL
45 and over 2.05 0.0675 923 1.89 2.21
between 18 and 45 2.59 0.0723 923 2.42 2.77
under 18 2.66 0.0348 923 2.58 2.74
Confidence level used: 0.95
Conf-level adjustment: sidak method for 3 estimates
pairs(emmeans(aov_model, specs="age_rc", adjust="tukey")) contrast estimate SE df t.ratio p.value
45 and over - between 18 and 45 -0.5423 0.0990 923 -5.480 <.0001
45 and over - under 18 -0.6068 0.0760 923 -7.988 <.0001
between 18 and 45 - under 18 -0.0645 0.0802 923 -0.804 0.7006
P value adjustment: tukey method for comparing a family of 3 estimates
Write Up Results
One-Way ANOVA
Write up your results. Again, make sure to maintain an appropriate tone, and follow APA guidelines for reporting statistical results. I recommend following the below outline:
- Briefly restate your hypothesis
- Describe any issues with your data (you can copy/paste from above, just make sure everything flows).
- Report your results. Make them a new paragraph. Make sure to include your F-value, degrees of freedom, p-value, and effect size. Since we are showing our means and standard deviations for the levels of our IV in our plot, you do NOT have to report them in the text (normally, you would report it like this: (M = #.##, SD = .##), repeating for each level).
- Specify where the differences occur. For instance, one level of the IV may be significantly different than the other two, or there may be multiple differences.
- Describe your interaction, if you ran a two-way ANOVA and there is one.
- Interpret your effect size (trivial, small, medium, or large) and include the citation.
- Make sure to include a reference to Figure 1 (created using the code below)
I predicted that under age 18 group will report the highest scores on the intolerance of uncertainty scale than the 18-45 group and the 45 and over group. To begin, I combined the “between 18 and 25”, “between 26 and 35”, and “between 36 and 45” into one category of “between 18 and 45”. I found that I did not have homogeneity of variance, as p < .05. This means the results could potentially be less reliable. There were no outliers above the .5 cutoff. Skew and kurtosis fell within the normal range.
F(2,922) = 32.23, p <.001, ηp2 < .065
Our overall model is significant, and post hoc tests showed significant effects in the “45 and over” and “between 18 and 45”, p < .001, and a significant effect of “45 and over” and “under 18”, p < .001. There was a medium effect size of ηp2 = 0.065. See figure 1
References
Cohen J. (1988). Statistical Power Analysis for the Behavioral Sciences. New York, NY: Routledge Academic.