Get longform data, keeping character as a single variable (which will
automatically be used as a treatment-coded factor by
lm)
Df <- readxl::read_excel("../data/SimplifiedAdultDatabase.xlsx",
sheet = "Hierarchy") %>%
janitor::clean_names() %>%
select(character, level, duration, phonogram, sro, z_regularity, log_homo_den) %>%
mutate(across(c(phonogram, sro, z_regularity, log_homo_den), ~scale(as.numeric(.x))))
fit <- lm(duration ~ level*(phonogram + sro + z_regularity + log_homo_den)
, data = Df)
summary(fit)$coef %>% knitr::kable()
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 2227.62996 | 11.64608 | 191.2771528 | 0.0000000 |
| levelRadical | -1521.44134 | 16.69708 | -91.1202104 | 0.0000000 |
| levelStroke | -2079.71876 | 16.47005 | -126.2727561 | 0.0000000 |
| phonogram | 171.65724 | 14.13595 | 12.1433082 | 0.0000000 |
| sro | 27.41315 | 12.34275 | 2.2209918 | 0.0264148 |
| z_regularity | 46.59324 | 13.81930 | 3.3716072 | 0.0007554 |
| log_homo_den | -17.21184 | 11.70933 | -1.4699259 | 0.1416711 |
| levelRadical:phonogram | -149.61936 | 20.52758 | -7.2887011 | 0.0000000 |
| levelStroke:phonogram | -174.10521 | 19.99126 | -8.7090679 | 0.0000000 |
| levelRadical:sro | -43.51703 | 17.47391 | -2.4903995 | 0.0128057 |
| levelStroke:sro | -25.71078 | 17.45529 | -1.4729510 | 0.1408535 |
| levelRadical:z_regularity | -45.77456 | 19.66488 | -2.3277319 | 0.0199827 |
| levelStroke:z_regularity | -44.70008 | 19.54343 | -2.2872173 | 0.0222421 |
| levelRadical:log_homo_den | 15.45687 | 16.80968 | 0.9195219 | 0.3578856 |
| levelStroke:log_homo_den | 16.51703 | 16.55949 | 0.9974360 | 0.3186214 |
For example for sro. The first two comparisons come for
free in the model (which has Character as the baseline). The only new
information here is for Radical vs. Stroke.
Radical vs. Character
car::linearHypothesis(fit, "levelRadical:sro = 0")
##
## Linear hypothesis test:
## levelRadical:sro = 0
##
## Model 1: restricted model
## Model 2: duration ~ level * (phonogram + sro + z_regularity + log_homo_den)
##
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 3526 574327134
## 2 3525 573318404 1 1008730 6.2021 0.01281 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Stroke vs. Character
car::linearHypothesis(fit, "levelStroke:sro = 0")
##
## Linear hypothesis test:
## levelStroke:sro = 0
##
## Model 1: restricted model
## Model 2: duration ~ level * (phonogram + sro + z_regularity + log_homo_den)
##
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 3526 573671273
## 2 3525 573318404 1 352869 2.1696 0.1409
Radical vs. Stroke
car::linearHypothesis(fit, "levelRadical:sro - levelStroke:sro = 0")
##
## Linear hypothesis test:
## levelRadical:sro - levelStroke:sro = 0
##
## Model 1: restricted model
## Model 2: duration ~ level * (phonogram + sro + z_regularity + log_homo_den)
##
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 3526 573487293
## 2 3525 573318404 1 168889 1.0384 0.3083
This reproduces the current analysis but without having to use different models for main and interaction effects. I think this is the correct approach. If we believe that there are interaction effects then these are going to affect the size of main effects. So main effects should be reported only from a model that also includes interactions.
My problem with that model for duration - and similar arguments might apply to a lesser extent for other dependent variables, I think - is that the time that it takes someone to produce a character is largely, but not entirely, the sum of times to produce its radicals. And time to complete each radical is the sum of time to complete each component stroke. The question we really want to answer is whether there are effects of psycholinguistic features on duration of character production over and above effects on radicals or strokes. Does character explain any additional variance that is not accounted for by radical and stroke? For me, this is a direct test of “Are the lexical effects across these different levels hierarchical?”.
If that’s the correct understanding, then what we need is this sort of analysis. The example is character duration, controlling for residual and stroke.
Get short-form data (different columns for stroke, radical, and character duration):
Df2 <- readxl::read_excel("../data/SimplifiedAdultDatabase.xlsx",
sheet = "Hierarchy") %>%
janitor::clean_names() %>%
select(character, level, duration, phonogram, sro, z_regularity, log_homo_den) %>%
pivot_wider(names_from = level, values_from = duration, names_prefix = "dur_") %>%
janitor::clean_names() %>%
drop_na()
m0 <- lm(dur_character ~ 1, data = Df2)
m1 <- lm(dur_character ~ dur_radical, data = Df2)
m2 <- lm(dur_character ~ dur_radical + dur_stroke, data = Df2)
m4 <- lm(dur_character ~ dur_radical + dur_stroke
+ phonogram + sro + z_regularity + log_homo_den
, data = Df2)
anova(m0,m1,m2, m4)
## Analysis of Variance Table
##
## Model 1: dur_character ~ 1
## Model 2: dur_character ~ dur_radical
## Model 3: dur_character ~ dur_radical + dur_stroke
## Model 4: dur_character ~ dur_radical + dur_stroke + phonogram + sro +
## z_regularity + log_homo_den
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 1139 476580623
## 2 1138 453437800 1 23142823 63.037 4.855e-15 ***
## 3 1137 447104625 1 6333175 17.250 3.522e-05 ***
## 4 1133 415961783 4 31142842 21.207 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(m4)$coef %>% knitr::kable()
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 2117.6850917 | 119.7018073 | 17.6913377 | 0.0000000 |
| dur_radical | 0.4502129 | 0.0580543 | 7.7550311 | 0.0000000 |
| dur_stroke | -2.7625446 | 0.6420009 | -4.3030231 | 0.0000183 |
| phonogram | 319.3085127 | 52.3979333 | 6.0939143 | 0.0000000 |
| sro | 88.2272285 | 47.7738746 | 1.8467673 | 0.0650415 |
| z_regularity | 45.4359500 | 24.9134011 | 1.8237554 | 0.0684524 |
| log_homo_den | -44.1389488 | 51.6508451 | -0.8545639 | 0.3929733 |
You would also want to model radical duration, controlling for stroke duration. Ideally at the radical level, with random effects for character. and for stroke duration with ideally with random effects for character and (if this makes sense) for radical.