library(psych) # for the describe() command
library(ggplot2) # to visualize our results
##
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
##
## %+%, alpha
library(expss) # for the cross_cases() command
## Loading required package: maditr
##
## To select rows from data: rows(mtcars, am==0)
##
## Attaching package: 'maditr'
## The following object is masked from 'package:base':
##
## sort_by
##
## Attaching package: 'expss'
## The following object is masked from 'package:ggplot2':
##
## vars
library(car) # for the leveneTest() command
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:expss':
##
## recode
## The following object is masked from 'package:psych':
##
## logit
library(afex) # to run the ANOVA and plot results
## Loading required package: lme4
## Loading required package: Matrix
##
## Attaching package: 'lme4'
## The following object is masked from 'package:expss':
##
## dummy
## ************
## Welcome to afex. For support visit: http://afex.singmann.science/
## - Functions for ANOVAs: aov_car(), aov_ez(), and aov_4()
## - Methods for calculating p-values with mixed(): 'S', 'KR', 'LRT', and 'PB'
## - 'afex_aov' and 'mixed' objects can be passed to emmeans() for follow-up tests
## - Get and set global package options with: afex_options()
## - Set sum-to-zero contrasts globally: set_sum_contrasts()
## - For example analyses see: browseVignettes("afex")
## ************
##
## Attaching package: 'afex'
## The following object is masked from 'package:lme4':
##
## lmer
library(emmeans) # for posthoc tests
## Welcome to emmeans.
## Caution: You lose important information if you filter this package's results.
## See '? untidy'
# import the dataset you cleaned previously
# this will be the dataset you'll use throughout the rest of the semester
# use ARC data
d <- read.csv(file="Data/mydata.csv", header=T)
# new code! this adds a column with a number for each row. it makes it easier when we drop outliers later
d$row_id <- 1:nrow(d)
Note: You can chose to run either a one-way ANOVA (a single IV with more than 2 levels) or a two-way/factorial ANOVA (at least two IVs) for the homework. You will need to specify your hypothesis and customize your code based on the choice you make. I will run both versions of the test here for illustrative purposes.
One-Way: We predict that there will be a significant effect of gender on maturity, as measured by the Markers of Adulthood scale.
# you only need to check the variables you're using in the current analysis
# although you checked them previously, it's always a good idea to look them over again and be sure that everything is correct
str(d)
## 'data.frame': 3133 obs. of 7 variables:
## $ gender : chr "f" "m" "m" "f" ...
## $ sibling : chr "at least one sibling" "at least one sibling" "at least one sibling" "at least one sibling" ...
## $ moa_maturity: num 3.67 3.33 3.67 3 3.67 ...
## $ support : num 6 6.75 5.17 5.58 6 ...
## $ socmeduse : int 47 23 34 35 37 13 37 43 37 29 ...
## $ stress : num 3.3 3.3 4 3.2 3.1 3.5 3.3 2.4 2.9 2.7 ...
## $ row_id : int 1 2 3 4 5 6 7 8 9 10 ...
# make our categorical variables factors
d$row_id <- as.factor(d$row_id) #we'll actually use our ID variable for this analysis, so make sure it's coded as a factor
d$gender <- as.factor(d$gender)
d$sibling <- as.factor(d$sibling)
# you can use the describe() command on an entire dataframe (d) or just on a single variable
describe(d$moa_maturity)
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 3133 3.59 0.43 3.67 3.65 0.49 1 4 3 -1.2 1.87 0.01
# we'll use the describeBy() command to view skew and kurtosis across our IVs
describeBy(d$moa_maturity, group = d$gender)
##
## Descriptive statistics by group
## group: f
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 2298 3.61 0.42 3.67 3.67 0.49 1 4 3 -1.21 1.91 0.01
## ------------------------------------------------------------
## group: m
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 781 3.54 0.46 3.67 3.6 0.49 1.33 4 2.67 -1.16 1.7 0.02
## ------------------------------------------------------------
## group: nb
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 54 3.5 0.4 3.33 3.53 0.49 2.33 4 1.67 -0.5 -0.13 0.05
describeBy(d$moa_maturity, group = d$sibling)
##
## Descriptive statistics by group
## group: at least one sibling
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 2832 3.6 0.43 3.67 3.66 0.49 1 4 3 -1.22 1.92 0.01
## ------------------------------------------------------------
## group: only child
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 301 3.55 0.44 3.67 3.6 0.49 1.67 4 2.33 -1.02 1.44 0.03
# also use histograms to examine your continuous variable
hist(d$moa_maturity)
# and cross_cases() to examine your categorical variables
cross_cases(d, gender, sibling)
|  sibling | ||
|---|---|---|
|  at least one sibling |  only child | |
|  gender | ||
|    f | 2083 | 215 |
|    m | 700 | 81 |
|    nb | 49 | 5 |
|    #Total cases | 2832 | 301 |
table(d$gender)
##
## f m nb
## 2298 781 54
# use the leveneTest() command from the car package to test homogeneity of variance
# uses the 'formula' setup: formula is y~x1*x2, where y is our DV and x1 is our first IV and x2 is our second IV
leveneTest(moa_maturity~gender, data = d)
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 2 3.7325 0.02404 *
## 3130
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# use this commented out section only if you need to remove outliers
# to drop a single outlier, remove the # at the beginning of the line and use this code:
# d <- subset(d, row_id!=c(1108))
# to drop multiple outliers, remove the # at the beginning of the line and use this code:
# d <- subset(d, row_id!=c(1108) & row_id!=c(602) & row_id!=c(220))
# use the lm() command to run the regression
# formula is y~x1*x2 + c, where y is our DV, x1 is our first IV, x2 is our second IV, and c is our covariate
reg_model <- lm(moa_maturity ~ gender, data = d) #for one-way
# Cook's distance
plot(reg_model, 4)
# Residuals vs Leverage
plot(reg_model, 5)
Our cell sizes for gender are unbalanced. A small sample size for one of the levels of our variable limits our power and increases our Type II error rate.
Levene’s test is significant for our three-level gender variable. We are ignoring this and continuing with the analysis anyway, but in the real world this is something we would have to correct for.
aov_model <- aov_ez(data = d,
id = "row_id",
between = c("gender"),
dv = "moa_maturity",
anova_table = list(es = "pes"))
## Contrasts set to contr.sum for the following variables: gender
Effect size cutoffs from Cohen (1988):
nice(aov_model)
## Anova Table (Type 3 tests)
##
## Response: moa_maturity
## Effect df MSE F pes p.value
## 1 gender 2, 3130 0.19 8.47 *** .005 <.001
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '+' 0.1 ' ' 1
afex_plot(aov_model, x = "gender")
Only run posthocs if the test is significant! E.g., only run the posthoc tests on gender if there is a main effect for gender.
emmeans(aov_model, specs="gender", adjust="tukey")
## Note: adjust = "tukey" was changed to "sidak"
## because "tukey" is only appropriate for one set of pairwise comparisons
## gender emmean SE df lower.CL upper.CL
## f 3.610 0.008977 3130 3.589 3.632
## m 3.542 0.015399 3130 3.506 3.579
## nb 3.500 0.058564 3130 3.360 3.640
##
## Confidence level used: 0.95
## Conf-level adjustment: sidak method for 3 estimates
pairs(emmeans(aov_model, specs="gender", adjust="tukey"))
## contrast estimate SE df t.ratio p.value
## f - m 0.0678 0.0178 3130 3.802 0.0004
## f - nb 0.1102 0.0592 3130 1.861 0.1504
## m - nb 0.0425 0.0606 3130 0.701 0.7627
##
## P value adjustment: tukey method for comparing a family of 3 estimates
To test our hypothesis that there would be a significant effect of gender on maturity, we used a one-way ANOVA. Our data was unbalanced, with many more women participating in our survey (n = 2298) than men (n = 781) or non-binary and other gender participants (n = 54). This significantly reduces the power of our test and increases the chances of a Type II error. A significant Levene’s test (p = .024) also indicates that our data violates the assumption of homogeneity of variance. This suggests that there is an increased chance of Type I error. We continued with our analysis for the purpose of this class.
We found a significant effect of gender, F(2,3130) = 8.47, p < .001, ηp2 = .005 (small effect size; Cohen, 1988). Posthoc tests using Sidak’s test revealed that women showed higher maturity than men and non-binary and other gender participants, while men also showed higher amounts of maturity than non-binary and other gender participants (see Figure 1 for a comparison).
References
Cohen J. (1988). Statistical Power Analysis for the Behavioral Sciences. New York, NY: Routledge Academic.