library(psych) # for the describe() command
library(ggplot2) # to visualize our results
##
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
##
## %+%, alpha
library(expss) # for the cross_cases() command
## Loading required package: maditr
##
## To select columns from data: columns(mtcars, mpg, vs:carb)
##
## Use 'expss_output_rnotebook()' to display tables inside R Notebooks.
## To return to the console output, use 'expss_output_default()'.
##
## Attaching package: 'expss'
## The following object is masked from 'package:ggplot2':
##
## vars
library(car) # for the leveneTest() command
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:expss':
##
## recode
## The following object is masked from 'package:psych':
##
## logit
library(afex) # to run the ANOVA and plot results
## Loading required package: lme4
## Loading required package: Matrix
##
## Attaching package: 'lme4'
## The following object is masked from 'package:expss':
##
## dummy
## ************
## Welcome to afex. For support visit: http://afex.singmann.science/
## - Functions for ANOVAs: aov_car(), aov_ez(), and aov_4()
## - Methods for calculating p-values with mixed(): 'S', 'KR', 'LRT', and 'PB'
## - 'afex_aov' and 'mixed' objects can be passed to emmeans() for follow-up tests
## - Get and set global package options with: afex_options()
## - Set sum-to-zero contrasts globally: set_sum_contrasts()
## - For example analyses see: browseVignettes("afex")
## ************
##
## Attaching package: 'afex'
## The following object is masked from 'package:lme4':
##
## lmer
library(emmeans) # for posthoc tests
# import the dataset you cleaned previously
# this will be the dataset you'll use throughout the rest of the semester
# use ARC data
d <- read.csv(file="Data/arcdata_final.csv", header=T)
# for the HW, you may or may not need to use the code below this comment
# check to see if you have a variable called 'X' or 'ResponseId' in your data
names(d)
## [1] "X" "trans" "ethnicity" "rse" "gad" "support"
## [7] "mfq_26"
Note: You can chose to run either a one-way ANOVA (a single IV with more than 3 levels) or a two-way/factorial ANOVA (at least two IVs) for the homework. You will need to specify your hypothesis and customize your code based on the choice you make. I will run both versions of the test here for illustrative purposes.
I predict that there will be a significant effect of ethnicity on self-esteem, as measured by the Rosenberg’s Self-Esteem Scale (RSE).
# you only need to check the variables you're using in the current analysis
# although you checked them previously, it's always a good idea to look them over again and be sure that everything is correct
str(d)
## 'data.frame': 1210 obs. of 7 variables:
## $ X : int 1 20 30 31 33 49 57 68 81 86 ...
## $ trans : chr "no" "no" "no" "no" ...
## $ ethnicity: chr "White - British, Irish, other" "White - British, Irish, other" "White - British, Irish, other" "White - British, Irish, other" ...
## $ rse : num 2.3 1.6 3.9 1.7 3.9 2.4 1.8 1.3 3.5 2.6 ...
## $ gad : num 1.86 3.86 1.14 2 1.43 ...
## $ support : num 2.5 2.17 5 2.5 3.67 ...
## $ mfq_26 : num 4.2 3.35 4.65 4.65 4.5 4.3 5.25 5 4.7 4.05 ...
# make our categorical variables factors
d$X <- as.factor(d$X) #we'll actually use our ID variable for this analysis, so make sure it's coded as a factor
d$ethnicity <- as.factor(d$ethnicity)
# check your categorical variables
table(d$ethnicity, useNA = "always")
##
## Asian/Asian British - Indian, Pakistani, Bangladeshi, other
## 123
## Black/Black British - Caribbean, African, other
## 24
## Chinese/Chinese British
## 10
## Middle Eastern/Middle Eastern British - Arab, Turkish, other
## 12
## Mixed race - other
## 37
## Mixed race - White and Black/Black British
## 21
## Other ethnic group
## 10
## Prefer not to say
## 24
## White - British, Irish, other
## 949
## <NA>
## 0
# also use histograms to examine your continuous variable
hist(d$rse)
Assumptions checked below:
If you have confirmed everything else…
# check your categorical variables and make sure they have decent cell sizes
# they should have at least 5 participants in each cell
# but larger numbers are always better
table(d$ethnicity)
##
## Asian/Asian British - Indian, Pakistani, Bangladeshi, other
## 123
## Black/Black British - Caribbean, African, other
## 24
## Chinese/Chinese British
## 10
## Middle Eastern/Middle Eastern British - Arab, Turkish, other
## 12
## Mixed race - other
## 37
## Mixed race - White and Black/Black British
## 21
## Other ethnic group
## 10
## Prefer not to say
## 24
## White - British, Irish, other
## 949
cross_cases(d, ethnicity)
|  #Total | |
|---|---|
|  ethnicity | |
|    Asian/Asian British - Indian, Pakistani, Bangladeshi, other | 123 |
|    Black/Black British - Caribbean, African, other | 24 |
|    Chinese/Chinese British | 10 |
|    Middle Eastern/Middle Eastern British - Arab, Turkish, other | 12 |
|    Mixed race - other | 37 |
|    Mixed race - White and Black/Black British | 21 |
|    Other ethnic group | 10 |
|    Prefer not to say | 24 |
|    White - British, Irish, other | 949 |
|    #Total cases | 1210 |
d$other[d$ethnicity == "Other ethnic group"] <- "other"
d$other[d$ethnicity == "Prefer not to say"] <- NA
d$other[d$ethnicity == "Mixed race - other"] <- "other"
d$other[d$ethnicity == "Mixed race - White and Black/Black British"] <- "other"
d$other[d$ethnicity == "White - British, Irish, other"] <- "White"
d$other[d$ethnicity == "Asain/Asain British - Indian, Pakistani, Bangladeshi, other"] <- "Asain"
d$other[d$ethnicity == "Middle Eastern/Middle Eastern British - Arab, Turkish, other"] <- "Middle Eastern"
d$other[d$ethnicity == "Black/Black British - Caribbean, African, other"] <- "Black"
d$other[d$ethnicity == "Chinese/Chinese British"] <- "Chinese"
table(d$other)
##
## Black Chinese Middle Eastern other White
## 24 10 12 68 949
d$other <- as.factor(d$other)
# you can use the describe() command on an entire dataframe (d) or just on a single variable
describe(d$rse)
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 1210 2.63 0.72 2.7 2.64 0.74 1 4 3 -0.22 -0.72 0.02
# we'll use the describeBy() command to view skew and kurtosis across our IVs
describeBy(d$rse, group = d$other)
##
## Descriptive statistics by group
## group: Black
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 24 2.56 0.75 2.6 2.56 0.96 1.1 4 2.9 0.04 -1.03 0.15
## ------------------------------------------------------------
## group: Chinese
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 10 2.66 0.55 3 2.69 0.22 1.9 3.2 1.3 -0.33 -1.98 0.18
## ------------------------------------------------------------
## group: Middle Eastern
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 12 2.79 0.79 2.9 2.8 0.67 1.5 4 2.5 -0.06 -1.28 0.23
## ------------------------------------------------------------
## group: other
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 68 2.57 0.65 2.6 2.57 0.59 1.2 4 2.8 -0.05 -0.61 0.08
## ------------------------------------------------------------
## group: White
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 949 2.64 0.73 2.7 2.66 0.74 1 4 3 -0.24 -0.74 0.02
# use the leveneTest() command from the car package to test homogeneity of variance
# uses the 'formula' setup: formula is y~x1*x2, where y is our DV and x1 is our first IV and x2 is our second IV
leveneTest(rse~other, data = d)
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 4 0.8838 0.473
## 1058
# use the lm() command to run the regression
# formula is y~x1*x2, where y is our DV, x1 is our first IV and x2 is our second IV
reg_model <- lm(rse~other, data = d) #for one-way
# Cook's distance
plot(reg_model, 4)
# Residuals vs Leverage
plot(reg_model, 5)
My cell sizes are very unbalanced. A small sample size for three of the levels of my variable limits my power and increases my Type II error rate.
Levene’s test is not significant for my five-level other variable, showing homogeneity of variance.
After running Cook’s Distance, there seems to be an observation of one outlier above the 0.5 cutoff. I am ignoring this and continuing with the analysis anyway, but in the real world this is something I would have to correct for.
aov_model <- aov_ez(data = d,
id = "X",
between = c("other"),
dv = "rse",
anova_table = list(es = "pes"))
## Warning: Missing values for 147 ID(s), which were removed before analysis:
## 119, 140, 490, 520, 580, 824, 851, 1065, 1180, 1224, ... [showing first 10 only]
## Below the first few rows (in wide format) of the removed cases with missing data.
## X other .
## # 14 119 <NA> 3.4
## # 20 140 <NA> 3.6
## # 66 490 <NA> 2.9
## # 67 520 <NA> 2.6
## # 75 580 <NA> 2.7
## # 100 824 <NA> 3.2
## Contrasts set to contr.sum for the following variables: other
Effect size cutoffs from Cohen (1988):
nice(aov_model)
## Anova Table (Type 3 tests)
##
## Response: rse
## Effect df MSE F pes p.value
## 1 other 4, 1058 0.52 0.35 .001 .846
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '+' 0.1 ' ' 1
afex_plot(aov_model, x = "other")
Only run posthocs if the test is significant! E.g., only run the posthoc tests on gender if there is a main effect for gender.
To test my hypothesis that there would be a significant effect of ethnicity on self-esteem, I used a one-way ANOVA. My data was unbalanced, with many more White participants participating in my survey (n = 949), than Black participants (n = 24), Chinese participants (n = 10), Middle Eastern participants (n = 12), and Others (n = 68). This significantly reduces the power of my test and increases the chances of a Type II error. I did not identify any outliers following visual analysis of a Residuals vs Leverage plot, however I did observe an outlier that was above the threshold on 0.5 utilizng Cook’s Distance. An insignificant Levene’s test (p = .473) indicates that my data follows the assumption of homogeneity of variance. I continued with my analysis for the purpose of this class.
I found a very small effect of ethnicity on self-esteem, F(4,1058) = 0.35, p .846, ηp2 = .001 (small effect size; Cohen, 1988). Since there was no main effect found for ethnicity on self-esteem, Posthoc tests using Tukey’s HSD was not ran.
```
References
Cohen J. (1988). Statistical Power Analysis for the Behavioral Sciences. New York, NY: Routledge Academic.