library(tidyverse)
library(ggplot2)
library(reshape2)
library(rstatix)
source("/Users/claire/Desktop/gradstats2022/gradstats_funcs.R")

load data on self esteem across three time points for 10 participants

data("selfesteem", package = "datarium")
head(selfesteem)
## # A tibble: 6 × 4
##      id    t1    t2    t3
##   <int> <dbl> <dbl> <dbl>
## 1     1  4.01  5.18  7.11
## 2     2  2.56  6.91  6.31
## 3     3  3.24  4.44  9.78
## 4     4  3.42  4.71  8.35
## 5     5  2.87  3.91  6.46
## 6     6  2.05  5.34  6.65

convert data to long form for sake of practice

data often comes in long form – as it is collected this way maybe more often than not?

selfesteem_long<- melt(selfesteem, id.vars = "id", value.name = "score", measure.vars = c("t1", "t2", "t3"), variable.name = "time")

head(selfesteem_long)
##   id time    score
## 1  1   t1 4.005027
## 2  2   t1 2.558124
## 3  3   t1 3.244241
## 4  4   t1 3.419538
## 5  5   t1 2.871243
## 6  6   t1 2.045868

from long form we can still get very useful information (means, plots etc):

selfesteem_long %>%
  group_by(time) %>%
  get_summary_stats(score, type = "mean_sd")
## # A tibble: 3 × 5
##   time  variable     n  mean    sd
##   <fct> <fct>    <dbl> <dbl> <dbl>
## 1 t1    score       10  3.14 0.552
## 2 t2    score       10  4.93 0.863
## 3 t3    score       10  7.64 1.14

but if we were to analyze the data, it would be incorrect and would not be controlling for the non independence that comes with each ID having multiple rows.

if we plot the density of each self esteem score by time, they have very different distributions, with self esteem largely increasing over time

ggplot(selfesteem_long, aes(x=score))+
  geom_density(color="darkblue", fill="lightblue") +
  facet_wrap(~time)

similarly, we we can see that when we plot the linear effect by participant:

selfesteem_long %>%
  ggplot( aes(x=time, y=score)) +
    geom_line( color="grey") +
    geom_point(shape=21, color="black", fill="#69b3a2", size=6)+
  facet_wrap(~id)

there is generally a linear increase but people have their own trends

if we wanted to see whether self esteem increases over time, we could run an analysis:

selfesteem_long$lin[selfesteem_long$time=="t1"]<- -1
selfesteem_long$lin[selfesteem_long$time=="t2"]<- 0
selfesteem_long$lin[selfesteem_long$time=="t3"]<- 1

selfesteem_long$quad[selfesteem_long$time=="t1"]<- -1
selfesteem_long$quad[selfesteem_long$time=="t2"]<- 2
selfesteem_long$quad[selfesteem_long$time=="t3"]<- -1

mcSummary(lm(score~lin+quad, data=selfesteem_long))
## lm(formula = score ~ lin + quad, data = selfesteem_long)
## 
## Omnibus ANOVA
##                 SS df     MS EtaSq      F p
## Model      102.456  2 51.228 0.829 65.261 0
## Error       21.194 27  0.785               
## Corr Total 123.650 29  4.264               
## 
##   RMSE AdjEtaSq
##  0.886    0.816
## 
## Coefficients
##                Est StErr      t  SSR(3) EtaSq tol CI_2.5 CI_97.5     p
## (Intercept)  5.237 0.162 32.374 822.723 0.975  NA  4.905   5.569 0.000
## lin          2.248 0.198 11.348 101.080 0.827   1  1.842   2.655 0.000
## quad        -0.151 0.114 -1.324   1.376 0.061   1 -0.386   0.083 0.197

we see a significant linear effect – but remember this doesn’t control for non independence!

so let’s bring the data back to it’s wide form:

selfesteem_wide <- dcast(selfesteem_long, id ~ time, value.var="score")
head(selfesteem_wide)
##   id       t1       t2       t3
## 1  1 4.005027 5.182286 7.107831
## 2  2 2.558124 6.912915 6.308434
## 3  3 3.244241 4.443434 9.778410
## 4  4 3.419538 4.711696 8.347124
## 5  5 2.871243 3.908429 6.457287
## 6  6 2.045868 5.340549 6.653224

to run this analysis controlling for dependence – we need to do several things:

  1. create the w0 (between subject score)
  2. create the w1s (within subject score)
  1. run a model where w0 is regression on an intercept (w0~1)
  2. run a model for each w1 where w1 is regressed on an intercept (w1~1)
  3. make inferences!