All weekly notes and codes are licensed under CC-By-SA 4.0 unless otherwise stated. For other resources redirected from this page, please refer to the authorsâ website for licensing information.
This is an R Markdown document. You may download the codes for this html document or read the notes on this site. Markdown is a simple formatting syntax (coding language) for authoring HTML, PDF, and MS Word documents. In a R Markdown document, you will have text section (like the textbox we used in the google colab document) and code chunk (like the code box we used in google colab). The text section is where we use the Markdown language. Markdown language is a very straightforward and simple coding language. For basic knowledge of Markdown, see Basic Syntax in the Markdown Guide.
For more details on using R Markdown see the R Studio R Markdown resource. When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document.
By now, you should be very familiar with the common packages that we use in psychology data analyses. You can just call your packages upfront so that you don’t have to do that later.
If you have never install these packages in your computer, use install.packages() first. (the statement eval=FALSE is to ask R not to run this when it knits the document - if you let the run during knitting, it will take you a long time to knit every time). If you have already installed these packages, skip to the next codes (i.e., library()).
install.packages("Hmisc")
install.packages("tidyverse") #contains a bunch of important packages
install.packages("gtsumamry")
As you see, we always start and end the code chunk using three symbols `. When you open the code chunk, you need to tell R that your code is a “r” code. Then, you may add statements to control what shows or not show when you knit your final RMarkdown document. For example,
eval=False means that you do not run the code when you knit.results=FALSE means that you do not want to output being included in your RMarkdown document.message=FALSE means that you do not want to print any messages (e.g., warnings).echo=FALSE means that you do not want to print the codes (the input) when you knit your document.include=FALSE means that you want to hide both input and output.Calling a package is not relevant to our main analyses, so I have decided to only print the codes (so people know what packages I have used) but not the messages (readers do not need to see what has been loaded or masked, etc.!). Try deleting “message=FALSE” in the code file and you will see a long list of unimportant outputs being printed!
library(Hmisc)
## Warning: package 'Hmisc' was built under R version 3.6.3
## Loading required package: lattice
## Loading required package: survival
## Warning: package 'survival' was built under R version 3.6.3
## Loading required package: Formula
## Loading required package: ggplot2
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
##
## format.pval, units
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:Hmisc':
##
## src, summarize
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(psych)
##
## Attaching package: 'psych'
## The following object is masked from 'package:Hmisc':
##
## describe
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 3.6.1
## -- Attaching packages --------------------------------------- tidyverse 1.2.1 --
## v tibble 3.1.3 v purrr 0.3.4
## v tidyr 1.1.3 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.5.1
## Warning: package 'tidyr' was built under R version 3.6.3
## Warning: package 'readr' was built under R version 3.6.2
## Warning: package 'purrr' was built under R version 3.6.3
## Warning: package 'forcats' was built under R version 3.6.3
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x psych::%+%() masks ggplot2::%+%()
## x psych::alpha() masks ggplot2::alpha()
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
## x dplyr::src() masks Hmisc::src()
## x dplyr::summarize() masks Hmisc::summarize()
library(gt) #for drawing tables
## Warning: package 'gt' was built under R version 3.6.3
##
## Attaching package: 'gt'
## The following object is masked from 'package:Hmisc':
##
## html
library(gtsummary) #for drawing tables
library(stargazer) #for drawing tables
##
## Please cite as:
## Hlavac, Marek (2018). stargazer: Well-Formatted Regression and Summary Statistics Tables.
## R package version 5.2.2. https://CRAN.R-project.org/package=stargazer
library(ggplot2) #for plotting graphs
Also, since we have already discussed how to install packages and call packages multiple times, starting from here, I will only list the library() function in our notes and codes. If you do not have them, install it first using install.packages().
We will continue to use the World Value Survey - Wave 6 Algeria data for examples in this week. Last week, we discussed how to recode values into missing for a specific variable. However, the world value survey data have a uniform way of setting missing data, that is, any values in the raw variables that are negative are missing. In this case, we can set NA for all variables in the beginning.
data <- read.csv("https://raw.githubusercontent.com/manyu26/PsychLearnR/master/psychlearnr_data.csv")
data[data<0]<-NA
In Week 2 notes, we have created subscales score for the schwartz scale. (Note that they do not have good reliability, but we are using them to demonstrate our examples.)
The codes are repeated here.
caretake <- dplyr::select(data, V74B,V78,V79)
caretake[sapply(caretake, function(x) x %in% "-2")] <- NA
data$caretake_mean<-rowMeans(caretake)
security<-dplyr::select(data,V72,V77)
security[sapply(security,function(x) x %in% "-2")] <-NA
data$security_mean<-rowMeans(security)
excite <- dplyr::select(data, V70, V71, V73, V75, V76)
excite[sapply(excite, function(x) x %in% "-2")] <- NA
data$excite_mean<-rowMeans(excite)
Additionally, the World Value Survey calculated two value scores
demo <- dplyr::select(data, V57, V240, V229)
tbl_summary(demo) #gtsummary package
| Characteristic | N = 1,2001 |
|---|---|
| V57 | |
| 1 | 562 (47%) |
| 3 | 24 (2.0%) |
| 4 | 14 (1.2%) |
| 5 | 38 (3.2%) |
| 6 | 562 (47%) |
| V240 | |
| 1 | 608 (51%) |
| 2 | 592 (49%) |
| V229 | |
| 1 | 236 (20%) |
| 2 | 139 (12%) |
| 3 | 108 (9.0%) |
| 4 | 78 (6.5%) |
| 5 | 269 (22%) |
| 6 | 206 (17%) |
| 7 | 158 (13%) |
| 8 | 6 (0.5%) |
|
1
n (%)
|
|
You can add label to the table subtitles.
demo <- dplyr::select(data, V57, V240, V229)
tbl_summary(demo,
label=(list(V57="Relationship Status",V240="sex",V229="Perceived Social Status"))
)
| Characteristic | N = 1,2001 |
|---|---|
| Relationship Status | |
| 1 | 562 (47%) |
| 3 | 24 (2.0%) |
| 4 | 14 (1.2%) |
| 5 | 38 (3.2%) |
| 6 | 562 (47%) |
| sex | |
| 1 | 608 (51%) |
| 2 | 592 (49%) |
| Perceived Social Status | |
| 1 | 236 (20%) |
| 2 | 139 (12%) |
| 3 | 108 (9.0%) |
| 4 | 78 (6.5%) |
| 5 | 269 (22%) |
| 6 | 206 (17%) |
| 7 | 158 (13%) |
| 8 | 6 (0.5%) |
|
1
n (%)
|
|
There are tons of other good packages and functions for tables. Depending on your need, you may end up choosing different packages for different tables.
stargazer(demo) #default type is latex
##
## % Table created by stargazer v.5.2.2 by Marek Hlavac, Harvard University. E-mail: hlavac at fas.harvard.edu
## % Date and time: Wed, Jul 28, 2021 - 4:27:49 PM
## \begin{table}[!htbp] \centering
## \caption{}
## \label{}
## \begin{tabular}{@{\extracolsep{5pt}}lccccccc}
## \\[-1.8ex]\hline
## \hline \\[-1.8ex]
## Statistic & \multicolumn{1}{c}{N} & \multicolumn{1}{c}{Mean} & \multicolumn{1}{c}{St. Dev.} & \multicolumn{1}{c}{Min} & \multicolumn{1}{c}{Pctl(25)} & \multicolumn{1}{c}{Pctl(75)} & \multicolumn{1}{c}{Max} \\
## \hline \\[-1.8ex]
## V57 & 1,200 & 3.543 & 2.436 & 1 & 1 & 6 & 6 \\
## V240 & 1,200 & 1.493 & 0.500 & 1 & 1 & 2 & 2 \\
## V229 & 1,200 & 4.071 & 2.121 & 1 & 2 & 6 & 8 \\
## \hline \\[-1.8ex]
## \end{tabular}
## \end{table}
stargazer(demo, type="html")
##
## <table style="text-align:center"><tr><td colspan="8" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left">Statistic</td><td>N</td><td>Mean</td><td>St. Dev.</td><td>Min</td><td>Pctl(25)</td><td>Pctl(75)</td><td>Max</td></tr>
## <tr><td colspan="8" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left">V57</td><td>1,200</td><td>3.543</td><td>2.436</td><td>1</td><td>1</td><td>6</td><td>6</td></tr>
## <tr><td style="text-align:left">V240</td><td>1,200</td><td>1.493</td><td>0.500</td><td>1</td><td>1</td><td>2</td><td>2</td></tr>
## <tr><td style="text-align:left">V229</td><td>1,200</td><td>4.071</td><td>2.121</td><td>1</td><td>2</td><td>6</td><td>8</td></tr>
## <tr><td colspan="8" style="border-bottom: 1px solid black"></td></tr></table>
stargazer(demo, type="text")
##
## ========================================================
## Statistic N Mean St. Dev. Min Pctl(25) Pctl(75) Max
## --------------------------------------------------------
## V57 1,200 3.543 2.436 1 1 6 6
## V240 1,200 1.493 0.500 1 1 2 2
## V229 1,200 4.071 2.121 1 2 6 8
## --------------------------------------------------------
When you have two categorical variables, you will likely be using chi-square test of independence. In R, the easiest way is to create a frequency table of the two variables you want to examine using the table() function. Then, run the chi-square test on the table.
Let’s look at a new variable, relationship status (V57). Relationship status is coded as married (1), living together as married (2), divorced (3), separated (4), widowed (5), and single (6). Does relationship status differ by sex (V240 - 1 = Male; 2 = Female)? Let’s call this table “table_sexmarried”.
table_sexmarried<- table(data$V57, data$V240)
table_sexmarried #print the raw table
##
## 1 2
## 1 289 273
## 3 7 17
## 4 6 8
## 5 15 23
## 6 291 271
chisq.test(table_sexmarried)
##
## Pearson's Chi-squared test
##
## data: table_sexmarried
## X-squared = 7.0918, df = 4, p-value = 0.1311
As shown in the output above, the null hypothesis “sex and relationship status are independent” is not rejected. Sex and relationship status are not related. From the table, we can also observe that the distribution of relationship status across male and female is very similar. Most participants were either married or single.
Using R to draw tables and figures have several advantages. First, you can avoid errors when moving numbers from R to your text-editing document. Second, once you have the codes developed for a specific type of table, you only need to re-run the codes with any new analysis to get a new table. It is time-saving in the long run.
However, there are many different ways to create table in R. What packages you use depend on what analyses you are doing. We will introduce the package gt here. To build a table, you first need to transfer your results (table output from another function) to a data frame.
table_sexmarried<-as.data.frame.matrix(table_sexmarried) #coerce your table to data frame
table_sexmarried
## 1 2
## 1 289 273
## 3 7 17
## 4 6 8
## 5 15 23
## 6 291 271
As you see above, our column names and row names are numeric (i.e., not properly named). Let’s take a moment to name the columns and rows.
colnames(table_sexmarried)<- c("Male", "Female")
rownames(table_sexmarried)<- c("married" , "divorced", "separated ", "widowed", "single")
table_sexmarried
## Male Female
## married 289 273
## divorced 7 17
## separated 6 8
## widowed 15 23
## single 291 271
Now we see the table we want. We can use the gt function in the gt package to build a table. See Introduction to Creating gt Tables for the structure of gt.
The very basic format of gt table is to simply use gt(your_table_in_df).
gt_sexmarried<- gt(table_sexmarried)
gt_sexmarried
| Male | Female |
|---|---|
| 289 | 273 |
| 7 | 17 |
| 6 | 8 |
| 15 | 23 |
| 291 | 271 |
But as you see, it still lacks components we want. For example, it doesn’t have the row names. To do that, we add to the argument more statemetns to clarify what we need. The infix operator %>% pass whatever is on the left hand-side of the operator to the first arugment of the right hand side of the operator. table_sexmarried %>% gt() is thus equivalent to gt(table_sexmarried). Changing codes from nesting to chaining using %>% is simply a personal choice. However, this practice makes the codes easier to read (left to right instead of inside out). Also, when your codes get really complicated, e.g.,summary(head(data)), chaining helps to see the different functions being applied step-by-step, i.e., data %>% head() %>% summary(). When building plots and tables, this operator is a gem.
Let’s add our rownames to stub. “stub” is the name given to leftmost, non-data column in a table) as defined by the gt table structure.
gt_sexmarried<-
table_sexmarried %>%
gt(rownames_to_stub = TRUE)
gt_sexmarried
| Male | Female | |
|---|---|---|
| married | 289 | 273 |
| divorced | 7 | 17 |
| separated | 6 | 8 |
| widowed | 15 | 23 |
| single | 291 | 271 |
Let’s try adding different things to the table.
gt_sexmarried<-
table_sexmarried %>%
gt(rownames_to_stub = TRUE) %>%
tab_header(title = "Relationship Status by sex") %>%
tab_stubhead(label = "Relationship Status") %>%
tab_source_note(source_note = "Note: Participants were also given the option ''living
togehter as married'' but no participants chose this option" )
gt_sexmarried
| Relationship Status by sex | ||
|---|---|---|
| Relationship Status | Male | Female |
| married | 289 | 273 |
| divorced | 7 | 17 |
| separated | 6 | 8 |
| widowed | 15 | 23 |
| single | 291 | 271 |
| Note: Participants were also given the option ''living togehter as married'' but no participants chose this option | ||
A common assumption across tests that we discuss in this chapter is the test of homoscedasticity, or test of equal variance. Below are codes for the common variance tests, including F-test var.test (for one IV only), Bartlett’s test bartlett.test (for one more more IVs) and Levene’s test leveneTest.
Let’s conduct a equal variance test for testing sex differences (V240) on seucrity subscale of Schwartz values scale (security_mean). To do that, we only need to input the formula within a function of our choice.
First, make sure your IV is set as “factor” variable. It doesn’t matter if you are only running the diagnostic or anova test, but if you are conducting post-hoc and other analyses, you may get an error message if you don’t.
data$V238<-as.factor(data$V238) #social class
data$V240<-as.factor(data$V240) #sex
var.test(data$security_mean~data$V240) #F-test to compare two variances
##
## F test to compare two variances
##
## data: data$security_mean by data$V240
## F = 0.91343, num df = 565, denom df = 556, p-value = 0.284
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 0.7738725 1.0780417
## sample estimates:
## ratio of variances
## 0.9134331
Barlett’s test can handle more than one IVs.
bartlett.test(data$security_mean~interaction(data$V240, data$V238), data) #bartlett test
##
## Bartlett test of homogeneity of variances
##
## data: data$security_mean by interaction(data$V240, data$V238)
## Bartlett's K-squared = 11.453, df = 9, p-value = 0.246
Levine’s test is the most common test of homogeneity. The car package is required.
data$V240<-as.factor(data$V240)
library(car)
## Warning: package 'car' was built under R version 3.6.1
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:purrr':
##
## some
## The following object is masked from 'package:psych':
##
## logit
## The following object is masked from 'package:dplyr':
##
## recode
leveneTest(data$security_mean~data$V240, data)
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 1 1.2787 0.2584
## 1121
To conduct an independent t-test, we need to input the target formula, that is, to tell R what the independent and dependent variables are. As shown below, we put the dependent variable on the left (i.e. security_mean), follow by “~” and then put the independent variable on the right (i.e. sex, V240). We need to indicate what data frame we are using in this hypothesis testing (i.e. “data”). Finally, the t.test() function allows you to choose whether the t-test assume equal variance or not. Since we found in the last section that the variances across the two groups were equal, we put var.equal=TRUE.
t.test(security_mean~V240, data = data, var.equal=TRUE)
##
## Two Sample t-test
##
## data: security_mean by V240
## t = -0.77612, df = 1121, p-value = 0.4378
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.18124734 0.07850105
## sample estimates:
## mean in group 1 mean in group 2
## 2.234982 2.286355
The default test is two-sided/two-tailed, but users also have options to use one-sided/one-tailed test. For one-tailed test hypothesizing that true difference in means is greater than 0, use the additional statement alternative="greater". To test the hypothesis that true difference in means is less than 0, use the additional statement alternative="less". To adjust the confidence level, use the statement conf.level=. If you forget what you should do, you may look up the document through google or ?t.test
If you are conducting dependent t-test, you just need to list your repeated measures and add paired=TRUE. Something like this: t.test(data$prescore, data$postscore, paired=TRUE)
To conduct ANOVA, simply replace t.test() with the aov() function. However, the function only returns sum of squares and degree of freedom. To obtain other statistics, we need to first assign an object to the function. Let’s name it anova_model. The next step is to use summary() to obtain summary statistics.
Let’s look at whether security subscale differ between people from different social class (V238).
anova <- aov(security_mean~V238, data = data)
summary(anova)
## Df Sum Sq Mean Sq F value Pr(>F)
## V238 4 12.5 3.127 2.68 0.0305 *
## Residuals 1019 1188.7 1.167
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 176 observations deleted due to missingness
You may also request diagnostic plots
plot(anova) #diagnostic plots
If you have a covariate or a blocking variable, you may simply add it to the formula.
anova <- aov(security_mean~V238+caretake_mean, data = data)
summary(anova)
## Df Sum Sq Mean Sq F value Pr(>F)
## V238 4 14.2 3.55 4.04 0.00296 **
## caretake_mean 1 268.0 267.95 305.02 < 2e-16 ***
## Residuals 983 863.5 0.88
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 211 observations deleted due to missingness
To conduct post-hoc test, we need to examine the pairwise differences. The function pairwise.t.test() can be used when comparing pairs with no adjustment, bonferroni correction or Holmâ“Bonferroni correction method.
pairwise.t.test(data$security_mean, data$V238, p.adj = "none")
##
## Pairwise comparisons using t tests with pooled SD
##
## data: data$security_mean and data$V238
##
## 1 2 3 4
## 2 0.0147 - - -
## 3 0.0042 0.3092 - -
## 4 0.0026 0.1691 0.6536 -
## 5 0.0043 0.2725 0.6796 0.9277
##
## P value adjustment method: none
pairwise.t.test(data$security_mean, data$V238, p.adj = "bonf")
##
## Pairwise comparisons using t tests with pooled SD
##
## data: data$security_mean and data$V238
##
## 1 2 3 4
## 2 0.147 - - -
## 3 0.042 1.000 - -
## 4 0.026 1.000 1.000 -
## 5 0.043 1.000 1.000 1.000
##
## P value adjustment method: bonferroni
pairwise.t.test(data$security_mean, data$V238, p.adj = "holm")
##
## Pairwise comparisons using t tests with pooled SD
##
## data: data$security_mean and data$V238
##
## 1 2 3 4
## 2 0.103 - - -
## 3 0.038 1.000 - -
## 4 0.026 1.000 1.000 -
## 5 0.038 1.000 1.000 1.000
##
## P value adjustment method: holm
To obtain mean for each groups:
aggregate(data$security_mean, by=list(data$V238), FUN=mean, na.rm=TRUE)
## Group.1 x
## 1 1 1.673077
## 2 2 2.214545
## 3 3 2.302817
## 4 4 2.342007
## 5 5 2.353535
To obtain adjusted means, emmeans is a great package.
If Tukey’s HSD test is desired, then a different function TukeyHSD needs to be used. If you don’t have any covariates in your model (i.e., non-factor), you can simply write TukeyHSD(model_name). But since we have a covariate here, we need to tell the function which variable in the formula we are doing the posthoc.
TukeyHSD(anova, "V238", ordered=T) #if true, all differences will be positive
## Warning in replications(paste("~", xx), data = mf): non-factors ignored:
## caretake_mean
## Tukey multiple comparisons of means
## 95% family-wise confidence level
## factor levels have been ordered
##
## Fit: aov(formula = security_mean ~ V238 + caretake_mean, data = data)
##
## $V238
## diff lwr upr p adj
## 2-1 0.53904429 0.01255270 1.0655359 0.0417502
## 3-1 0.61654844 0.09573273 1.1373641 0.0109895
## 5-1 0.67429150 0.10736693 1.2412161 0.0104503
## 4-1 0.71408261 0.18694858 1.2412166 0.0021004
## 3-2 0.07750415 -0.13168299 0.2866913 0.8496721
## 5-2 0.13524721 -0.17120625 0.4417007 0.7478834
## 4-2 0.17503832 -0.04941778 0.3994944 0.2076441
## 5-3 0.05774306 -0.23885320 0.3543393 0.9840460
## 4-3 0.09753417 -0.11326468 0.3083330 0.7131329
## 4-5 0.03979111 -0.26776477 0.3473470 0.9966530
There are multiple ways to obtain effect sizes. Here is one example with the package effectsize
library(effectsize)
## Warning: package 'effectsize' was built under R version 3.6.3
##
## Attaching package: 'effectsize'
## The following object is masked from 'package:psych':
##
## phi
effectsize::eta_squared(anova, ci=.95)
## Parameter | Eta2 (partial) | 95% CI
## ---------------------------------------------
## V238 | 0.02 | [0.00, 0.03]
## caretake_mean | 0.24 | [0.19, 0.28]
effectsize::omega_squared(anova, ci=.95)
## Parameter | Omega2 (partial) | 95% CI
## ------------------------------------------------
## V238 | 0.01 | [ 0.00, 0.03]
## caretake_mean | 0.24 | [ 0.19, 0.28]
effectsize::cohens_f(anova, ci=.95)
## Parameter | Cohen's f (partial) | 95% CI
## --------------------------------------------------
## V238 | 0.13 | [0.05, 0.18]
## caretake_mean | 0.56 | [0.49, 0.62]
If you want to organize all outputs in one table, try the following:
library(sjstats)
## Warning: package 'sjstats' was built under R version 3.6.3
## Registered S3 methods overwritten by 'lme4':
## method from
## cooks.distance.influence.merMod car
## influence.merMod car
## dfbeta.influence.merMod car
## dfbetas.influence.merMod car
##
## Attaching package: 'sjstats'
## The following objects are masked from 'package:effectsize':
##
## cohens_f, phi
## The following object is masked from 'package:psych':
##
## phi
sjstats::anova_stats(anova, digits=2)
## term df sumsq meansq statistic p.value etasq
## V238 V238 4 14.20 3.55 4.04 0 0.01
## caretake_mean caretake_mean 1 267.95 267.95 305.02 0 0.23
## ...3 Residuals 983 863.55 0.88 NA NA NA
## partial.etasq omegasq partial.omegasq epsilonsq cohens.f power
## V238 0.02 0.01 0.01 0.01 0.13 0.91
## caretake_mean 0.24 0.23 0.24 0.23 0.56 1.00
## ...3 NA NA NA NA NA NA
To conduct two-way ANOVA, you simply add your second IV to your formular. If you have more IVs, you add in the same manner. Let’s change our DV to the caretake subscale (caretake_mean) demostrate a different example than our one-way.
anova2way <- aov(caretake_mean~V240*V238, data = data) #or
anova2way <- aov(caretake_mean~V240+V238+V240:V238, data = data)
summary(anova2way)
## Df Sum Sq Mean Sq F value Pr(>F)
## V240 1 0.0 0.008 0.008 0.928
## V238 4 2.6 0.646 0.658 0.621
## V240:V238 4 30.2 7.553 7.694 4.18e-06 ***
## Residuals 989 970.9 0.982
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 201 observations deleted due to missingness
Practice: Can you find the effect sizes using this model? Can you do a Tukey HSD post-hoc test?
TukeyHSD(anova2way)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = caretake_mean ~ V240 + V238 + V240:V238, data = data)
##
## $V240
## diff lwr upr p adj
## 2-1 -0.005652571 -0.1286996 0.1173944 0.9281881
##
## $V238
## diff lwr upr p adj
## 2-1 0.080324658 -0.4759596 0.6366090 0.9948807
## 3-1 0.047077668 -0.5032143 0.5973697 0.9993389
## 4-1 0.048923941 -0.5081222 0.6059701 0.9992666
## 5-1 0.219185292 -0.3801219 0.8184925 0.8556403
## 3-2 -0.033246990 -0.2529943 0.1865004 0.9938729
## 4-2 -0.031400717 -0.2675533 0.2047518 0.9962782
## 5-2 0.138860634 -0.1846154 0.4623367 0.7667871
## 4-3 0.001846273 -0.2198226 0.2235151 0.9999999
## 5-3 0.172107624 -0.1409512 0.4851664 0.5611370
## 5-4 0.170261351 -0.1545231 0.4950458 0.6066624
##
## $`V240:V238`
## diff lwr upr p adj
## 2:1-1:1 1.01742919 -0.27774069 2.31259908 0.2744837
## 1:2-1:1 0.57870370 -0.50714360 1.66455100 0.8012110
## 2:2-1:1 0.88284203 -0.19602559 1.96170964 0.2213294
## 1:3-1:1 0.77798942 -0.29588726 1.85186610 0.3916095
## 2:3-1:1 0.64908977 -0.42449011 1.72266965 0.6573430
## 1:4-1:1 0.68242711 -0.39776400 1.76261821 0.5968897
## 2:4-1:1 0.75455116 -0.33193796 1.84104028 0.4558830
## 1:5-1:1 1.18757467 0.06685278 2.30829656 0.0277463
## 2:5-1:1 0.31986532 -0.86163053 1.50136117 0.9975470
## 1:2-2:1 -0.43872549 -1.25292527 0.37547429 0.7906264
## 2:2-2:1 -0.13458717 -0.93945501 0.67028067 0.9999521
## 1:3-2:1 -0.23943978 -1.03760520 0.55872564 0.9946522
## 2:3-2:1 -0.36833943 -1.16610548 0.42942662 0.9059067
## 1:4-2:1 -0.33500209 -1.14164311 0.47163894 0.9498049
## 2:4-2:1 -0.26287803 -1.07793356 0.55217750 0.9908951
## 1:5-2:1 0.17014548 -0.69001408 1.03030503 0.9998035
## 2:5-2:1 -0.69756387 -1.63553474 0.24040699 0.3524780
## 2:2-1:2 0.30413832 -0.08239961 0.69067625 0.2722885
## 1:3-1:2 0.19928571 -0.17309496 0.57166639 0.7972621
## 2:3-1:2 0.07038606 -0.30113783 0.44190996 0.9998628
## 1:4-1:2 0.10372340 -0.28649330 0.49394011 0.9978719
## 2:4-1:2 0.17584746 -0.23147901 0.58317392 0.9364024
## 1:5-1:2 0.60887097 0.11747113 1.10027080 0.0036135
## 2:5-1:2 -0.25883838 -0.87640513 0.35872836 0.9468807
## 1:3-2:2 -0.10485261 -0.45636126 0.24665605 0.9948736
## 2:3-2:2 -0.23375226 -0.58435312 0.11684861 0.5178213
## 1:4-2:2 -0.20041492 -0.57076615 0.16993632 0.7864121
## 2:4-2:2 -0.12829086 -0.51662811 0.26004638 0.9891806
## 1:5-2:2 0.30473265 -0.17104541 0.78051070 0.5771780
## 2:5-2:2 -0.56297671 -1.16818712 0.04223371 0.0936816
## 2:3-1:3 -0.12889965 -0.46382765 0.20602835 0.9691301
## 1:4-1:3 -0.09556231 -0.45111236 0.25998774 0.9976808
## 2:4-1:3 -0.02343826 -0.39768632 0.35080981 1.0000000
## 1:5-1:3 0.40958525 -0.05476435 0.87393486 0.1387675
## 2:5-1:3 -0.45812410 -1.05439203 0.13814383 0.3053053
## 1:4-2:3 0.03333734 -0.32131526 0.38798994 0.9999997
## 2:4-2:3 0.10546139 -0.26793417 0.47885695 0.9966052
## 1:5-2:3 0.53848490 0.07482211 1.00214770 0.0091578
## 2:5-2:3 -0.32922445 -0.92495767 0.26650878 0.7648353
## 2:4-1:4 0.07212405 -0.31987507 0.46412318 0.9998926
## 1:5-1:4 0.50514756 0.02637595 0.98391918 0.0290718
## 2:5-1:4 -0.36256179 -0.97012837 0.24500479 0.6744327
## 1:5-2:4 0.43302351 -0.05979293 0.92583995 0.1424775
## 2:5-2:4 -0.43468584 -1.05338038 0.18400869 0.4381901
## 2:5-1:5 -0.86770935 -1.54471886 -0.19069985 0.0021070
The post-hoc test results can get really messy when you have high number of group comparisons. We can plot them out. There are tons of ways to present an interaction plot. Here, we will introduce “faceting” from ggplot. Facets divide a plot into subplots based on a categorical variable (or multiple categorical variables).
filter(data, !is.na(V238)) %>% #filter NA of social status so it doesn't appear on graph
filter(!is.na(caretake_mean)) %>% #this line is optional; ggplot will removes missing for DV.
ggplot()+
aes( x=V238, y = caretake_mean) +
geom_boxplot() +
facet_wrap(~V240 ) +
xlab("Perceived Social Status") +
ylab("Schwartz's Caretake Subscale")
The aov() function is for balanced design only. If you want to conduct a two-way ANOVA for unbalanced designs (e.g., Type III ANOVA), you may use Anova() in the car package.
anovaunbal<-aov(caretake_mean~V240*V238, data = data)
Anova(anovaunbal, type="III") #type III anova
## Anova Table (Type III tests)
##
## Response: caretake_mean
## Sum Sq Df F value Pr(>F)
## (Intercept) 23.90 1 24.3462 9.439e-07 ***
## V240 6.09 1 6.2049 0.0129026 *
## V238 21.19 4 5.3962 0.0002669 ***
## V240:V238 30.21 4 7.6938 4.180e-06 ***
## Residuals 970.93 989
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The World Value Survey does not have repeated measure data. But let’s create a fake Time 2 measure for caretake_mean.
data$caretake_t1<-data$caretake_mean
data$caretake_t2<-data$excite_mean
data$caretake_t3<-data$security_mean
Currently, we call the data form you have as a “wide form”, that is, time 1 and time 2 scores belong to separate columns. You need to first transform your data to a “long form”. But make sure you have given your participants a unique ID before you do so.
data_long<-gather(data,
time, caretake_t1t2t3, caretake_t1, caretake_t2, caretake_t3,
factor_key = T)
nrow(data)
## [1] 1200
nrow(data_long)
## [1] 3600
Before conducting the tests, let’s check assumptions. First, outliers:
library(rstatix) #for identify_outliers
## Warning: package 'rstatix' was built under R version 3.6.3
##
## Attaching package: 'rstatix'
## The following objects are masked from 'package:effectsize':
##
## cohens_d, eta_squared
## The following object is masked from 'package:stats':
##
## filter
data_long %>%
group_by(time) %>%
identify_outliers(caretake_t1t2t3)
## # A tibble: 38 x 438
## time ID V1 V2 V2A V3 V4 V5 V6 V7 V8 V9 V10
## <fct> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
## 1 care~ 40 6 12 12 40 4 4 4 4 4 1 2
## 2 care~ 379 6 12 12 379 1 1 1 4 1 1 2
## 3 care~ 464 6 12 12 464 1 3 1 3 1 1 1
## 4 care~ 521 6 12 12 521 1 1 2 4 1 1 4
## 5 care~ 530 6 12 12 530 2 4 3 4 2 1 3
## 6 care~ 533 6 12 12 533 1 1 2 4 1 1 4
## 7 care~ 590 6 12 12 590 1 1 1 4 1 1 3
## 8 care~ 1180 6 12 12 1180 2 2 2 4 2 2 NA
## 9 care~ 40 6 12 12 40 4 4 4 4 4 1 2
## 10 care~ 172 6 12 12 172 1 1 1 1 1 1 1
## # ... with 28 more rows, and 425 more variables: V11 <int>, V12 <int>,
## # V13 <int>, V14 <int>, V15 <int>, V16 <int>, V17 <int>, V18 <int>,
## # V19 <int>, V20 <int>, V21 <int>, V22 <int>, V23 <int>, V24 <int>,
## # V25 <int>, V26 <int>, V27 <int>, V28 <int>, V29 <int>, V30 <int>,
## # V31 <int>, V32 <int>, V33 <int>, V34 <int>, V35 <int>, V36 <int>,
## # V37 <int>, V38 <int>, V39 <int>, V40 <int>, V41 <int>, V42 <int>,
## # V43 <int>, V44 <int>, V44_ES <int>, V45 <int>, V46 <int>, V47 <int>,
## # V48 <int>, V49 <int>, V50 <int>, V51 <int>, V52 <int>, V53 <int>,
## # V54 <int>, V55 <int>, V56 <int>, V56_NZ <int>, V57 <int>, V58 <int>,
## # V59 <int>, V60 <int>, V61 <int>, V62 <int>, V63 <int>, V64 <int>,
## # V65 <int>, V66 <int>, V67 <int>, V68 <int>, V69 <int>, V70 <int>,
## # V71 <int>, V72 <int>, V73 <int>, V74 <int>, V74B <int>, V75 <int>,
## # V76 <int>, V77 <int>, V78 <int>, V79 <int>, V80 <int>, V81 <int>,
## # V82 <int>, V83 <int>, V84 <int>, V85 <int>, V86 <int>, V87 <int>,
## # V88 <int>, V89 <int>, V90 <int>, V91 <int>, V92 <int>, V93 <int>,
## # V94 <int>, V95 <int>, V96 <int>, V97 <int>, V98 <int>, V99 <int>,
## # V100 <int>, V101 <int>, V102 <int>, V103 <int>, V104 <int>, V105 <int>,
## # V106 <int>, V107 <int>, ...
The output shows participants with extreme values.
Next, normality assumption.
data_long %>%
group_by(time) %>%
shapiro_test(caretake_t1t2t3)
## # A tibble: 3 x 4
## time variable statistic p
## <fct> <chr> <dbl> <dbl>
## 1 caretake_t1 caretake_t1t2t3 0.935 2.01e-21
## 2 caretake_t2 caretake_t1t2t3 0.971 1.25e-13
## 3 caretake_t3 caretake_t1t2t3 0.904 5.35e-26
library(ggpubr) #similar to ggplot2
## Warning: package 'ggpubr' was built under R version 3.6.3
ggqqplot(data_long,"caretake_t1t2t3", facet.by="time")
## Warning: Removed 317 rows containing non-finite values (stat_qq).
## Warning: Removed 317 rows containing non-finite values (stat_qq_line).
## Warning: Removed 317 rows containing non-finite values (stat_qq_line).
Assumption of sphericity does not need to be tested alone. Instead, you can use the anova_test function in the rstatix package to run the repeated measure tests. The results will display the Mauchly’s test for sphericity and the sphericity corrected results. The output will displace two corrected results, Greenhouse-Geisser epsilon (GGe), and Huynh-Feldt epsilon (HFe).
Now that we have a long form and we finished testing assumptions, we can do a repeated measure ANOVA. Let’s assume that we want to see if sex (V240) differs on their caretake change over time. The format is the same, your repeated measure (caretake_t1t2) is your DV, so it appears on the left hand side of the formula. Sex (V240) is your IV, so it appears on the right hand side. Error(ID) helps identify who is being repeated here (i.e., the participants)
rep_data<-select(data_long, ID, caretake_t1t2t3,time, V240)
rep_data<-rep_data[complete.cases(rep_data),]
repeated_aov<-anova_test(rep_data, dv=caretake_t1t2t3, wid=ID
, within=time
, between=V240
)
repeated_aov
## ANOVA Table (type III tests)
##
## $ANOVA
## Effect DFn DFd F p p<.05 ges
## 1 V240 1 1032 0.217 6.41e-01 0.000115
## 2 time 2 2064 89.393 5.85e-38 * 0.038000
## 3 V240:time 2 2064 0.266 7.66e-01 0.000117
##
## $`Mauchly's Test for Sphericity`
## Effect W p p<.05
## 1 time 0.896 3.13e-25 *
## 2 V240:time 0.896 3.13e-25 *
##
## $`Sphericity Corrections`
## Effect GGe DF[GG] p[GG] p[GG]<.05 HFe DF[HF] p[HF]
## 1 time 0.906 1.81, 1870.11 1.15e-34 * 0.908 1.82, 1873.21 1.02e-34
## 2 V240:time 0.906 1.81, 1870.11 7.44e-01 0.908 1.82, 1873.21 7.45e-01
## p[HF]<.05
## 1 *
## 2
To see the corrected solution:
get_anova_table(repeated_aov)
## ANOVA Table (type III tests)
##
## Effect DFn DFd F p p<.05 ges
## 1 V240 1.00 1032.00 0.217 6.41e-01 0.000115
## 2 time 1.81 1870.11 89.393 1.15e-34 * 0.038000
## 3 V240:time 1.81 1870.11 0.266 7.44e-01 0.000117
The aov() function can also handle repeated measure, but the codes are not as straight forward and it doesn’t produce spherity test.
anova_within<-aov(caretake_t1t2t3~V240*time+Error(ID/V240),data_long) #when V240 is within-subj
anova_within<-aov(caretake_t1t2t3~V240*time+Error(ID),data_long) #when V240 is between-subj
summary(anova_within)
##
## Error: ID
## Df Sum Sq Mean Sq
## V240 1 6.646 6.646
##
## Error: Within
## Df Sum Sq Mean Sq F value Pr(>F)
## V240 1 0 0.05 0.042 0.837
## time 2 167 83.62 71.997 <2e-16 ***
## V240:time 2 1 0.47 0.406 0.667
## Residuals 3276 3805 1.16
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Post-hoc
#effect of sex on DV at each level of time
sexeffect <- rep_data %>%
group_by(time) %>%
anova_test(dv = caretake_t1t2t3, wid = ID, between=V240) %>%
get_anova_table() %>%
adjust_pvalue(method = "bonferroni")
## Coefficient covariances computed by hccm()
## Coefficient covariances computed by hccm()
## Coefficient covariances computed by hccm()
sexeffect
## # A tibble: 3 x 9
## time Effect DFn DFd F p `p<.05` ges p.adj
## * <fct> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <dbl>
## 1 caretake_t1 V240 1 1099 0.26 0.61 "" 0.000237 1
## 2 caretake_t2 V240 1 1057 0.031 0.861 "" 0.0000292 1
## 3 caretake_t3 V240 1 1121 0.602 0.438 "" 0.000537 1
#effect of time on DV at each level of sex
timeeffect <- rep_data %>%
group_by(V240) %>%
anova_test(dv = caretake_t1t2t3, wid = ID, within=time) %>%
get_anova_table() %>%
adjust_pvalue(method = "bonferroni")
timeeffect
## # A tibble: 2 x 9
## V240 Effect DFn DFd F p `p<.05` ges p.adj
## <fct> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <dbl>
## 1 1 time 1.81 946. 53.3 7.14e-21 * 0.041 1.43e-20
## 2 2 time 1.81 921. 37.7 3.61e-15 * 0.035 7.22e-15
Linear Mixed Model (LMM) can also deal with repeated data and is generally considered to be more powerful. It can be used if you have participants nested in groups (e.g., students in class, class in school), if you want to model multiple random effects for groups/participants, when you have unbalanced design, when you have covariates (random or fixed), etc. We won’t be able to cover the details of it, but here is the general form in case you want to pursue further:
library(lme4)
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
fit<-lmer(caretake_t1t2t3~V240*time+(1|ID),data_long)
summary(fit)
## Linear mixed model fit by REML ['lmerMod']
## Formula: caretake_t1t2t3 ~ V240 * time + (1 | ID)
## Data: data_long
##
## REML criterion at convergence: 9521.7
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.4021 -0.6552 -0.1271 0.5886 3.0734
##
## Random effects:
## Groups Name Variance Std.Dev.
## ID (Intercept) 0.3857 0.6210
## Residual 0.7787 0.8825
## Number of obs: 3283, groups: ID, 1139
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 2.33782 0.04604 50.774
## V2402 -0.02704 0.06489 -0.417
## timecaretake_t2 0.41323 0.05379 7.682
## timecaretake_t3 -0.10186 0.05314 -1.917
## V2402:timecaretake_t2 0.03391 0.07634 0.444
## V2402:timecaretake_t3 0.07288 0.07512 0.970
##
## Correlation of Fixed Effects:
## (Intr) V2402 tmcr_2 tmcr_3 V2402:_2
## V2402 -0.710
## timecrtk_t2 -0.579 0.411
## timecrtk_t3 -0.590 0.419 0.502
## V2402:tmc_2 0.408 -0.574 -0.705 -0.354
## V2402:tmc_3 0.418 -0.586 -0.355 -0.707 0.497
In the above example, we only have model one random effect - separate slopes for each participant (ID). If you have nested data (e.g., participants nested in their age group-V242), you may add random effects to each grouping.
library(lme4)
fit<-lmer(caretake_t1t2t3~V240*time+(1|V242)+(1|V242:ID),data_long)
summary(fit)
## Linear mixed model fit by REML ['lmerMod']
## Formula: caretake_t1t2t3 ~ V240 * time + (1 | V242) + (1 | V242:ID)
## Data: data_long
##
## REML criterion at convergence: 9521.7
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.4033 -0.6565 -0.1241 0.5861 3.0676
##
## Random effects:
## Groups Name Variance Std.Dev.
## V242:ID (Intercept) 0.384612 0.62017
## V242 (Intercept) 0.001058 0.03253
## Residual 0.778748 0.88247
## Number of obs: 3283, groups: V242:ID, 1139; V242, 62
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 2.33793 0.04631 50.484
## V2402 -0.02813 0.06489 -0.433
## timecaretake_t2 0.41320 0.05380 7.681
## timecaretake_t3 -0.10193 0.05314 -1.918
## V2402:timecaretake_t2 0.03393 0.07634 0.444
## V2402:timecaretake_t3 0.07296 0.07512 0.971
##
## Correlation of Fixed Effects:
## (Intr) V2402 tmcr_2 tmcr_3 V2402:_2
## V2402 -0.705
## timecrtk_t2 -0.575 0.411
## timecrtk_t3 -0.587 0.419 0.502
## V2402:tmc_2 0.406 -0.574 -0.705 -0.354
## V2402:tmc_3 0.415 -0.586 -0.355 -0.707 0.497
In Week 4 notes, we will cover some ways to compute simple slope/effects. Those methods apply to LMM as well.