library(tidyverse)
library(ggplot2)
library(reshape2)
library(rstatix)
source("/Users/claire/Desktop/gradstats2022/gradstats_funcs.R")
load data on self esteem across three time points for 10 participants
data("selfesteem", package = "datarium")
head(selfesteem)
## # A tibble: 6 × 4
## id t1 t2 t3
## <int> <dbl> <dbl> <dbl>
## 1 1 4.01 5.18 7.11
## 2 2 2.56 6.91 6.31
## 3 3 3.24 4.44 9.78
## 4 4 3.42 4.71 8.35
## 5 5 2.87 3.91 6.46
## 6 6 2.05 5.34 6.65
convert data to long form for sake of practice
data often comes in long form – as it is collected this way maybe more often than not?
selfesteem_long<- melt(selfesteem, id.vars = "id", value.name = "score", measure.vars = c("t1", "t2", "t3"), variable.name = "time")
head(selfesteem_long)
## id time score
## 1 1 t1 4.005027
## 2 2 t1 2.558124
## 3 3 t1 3.244241
## 4 4 t1 3.419538
## 5 5 t1 2.871243
## 6 6 t1 2.045868
from long form we can still get very useful information (means, plots etc):
selfesteem_long %>%
group_by(time) %>%
get_summary_stats(score, type = "mean_sd")
## # A tibble: 3 × 5
## time variable n mean sd
## <fct> <fct> <dbl> <dbl> <dbl>
## 1 t1 score 10 3.14 0.552
## 2 t2 score 10 4.93 0.863
## 3 t3 score 10 7.64 1.14
but if we were to analyze the data, it would be incorrect and would not be controlling for the non independence that comes with each ID having multiple rows.
if we plot the density of each self esteem score by time, they have very different distributions, with self esteem largely increasing over time
ggplot(selfesteem_long, aes(x=score))+
geom_density(color="darkblue", fill="lightblue") +
facet_wrap(~time)
similarly, we we can see that when we plot the linear effect by participant:
selfesteem_long %>%
ggplot( aes(x=time, y=score)) +
geom_line( color="grey") +
geom_point(shape=21, color="black", fill="#69b3a2", size=6)+
facet_wrap(~id)
there is generally a linear increase but people have their own trends
if we wanted to see whether self esteem increases over time, we could run an analysis:
selfesteem_long$lin[selfesteem_long$time=="t1"]<- -1
selfesteem_long$lin[selfesteem_long$time=="t2"]<- 0
selfesteem_long$lin[selfesteem_long$time=="t3"]<- 1
selfesteem_long$quad[selfesteem_long$time=="t1"]<- -1
selfesteem_long$quad[selfesteem_long$time=="t2"]<- 2
selfesteem_long$quad[selfesteem_long$time=="t3"]<- -1
mcSummary(lm(score~lin+quad, data=selfesteem_long))
## lm(formula = score ~ lin + quad, data = selfesteem_long)
##
## Omnibus ANOVA
## SS df MS EtaSq F p
## Model 102.456 2 51.228 0.829 65.261 0
## Error 21.194 27 0.785
## Corr Total 123.650 29 4.264
##
## RMSE AdjEtaSq
## 0.886 0.816
##
## Coefficients
## Est StErr t SSR(3) EtaSq tol CI_2.5 CI_97.5 p
## (Intercept) 5.237 0.162 32.374 822.723 0.975 NA 4.905 5.569 0.000
## lin 2.248 0.198 11.348 101.080 0.827 1 1.842 2.655 0.000
## quad -0.151 0.114 -1.324 1.376 0.061 1 -0.386 0.083 0.197
we see a significant linear effect – but remember this doesn’t control for non independence!
so let’s bring the data back to it’s wide form:
selfesteem_wide <- dcast(selfesteem_long, id ~ time, value.var="score")
head(selfesteem_wide)
## id t1 t2 t3
## 1 1 4.005027 5.182286 7.107831
## 2 2 2.558124 6.912915 6.308434
## 3 3 3.244241 4.443434 9.778410
## 4 4 3.419538 4.711696 8.347124
## 5 5 2.871243 3.908429 6.457287
## 6 6 2.045868 5.340549 6.653224
to run this analysis controlling for dependence – we need to do several things: