Two approaches to character / stroke / radical comparisons

Current approach, but with subsequent analysis comparing slopes

Get longform data, keeping character as a single variable (which will automatically be used as a treatment-coded factor by lm)

Df <- readxl::read_excel("../data/SimplifiedAdultDatabase.xlsx", 
                                      sheet = "Hierarchy") %>% 
  janitor::clean_names() %>% 
  select(character, level, duration, phonogram, sro, z_regularity, log_homo_den) %>% 
  mutate(across(c(phonogram, sro, z_regularity, log_homo_den), ~scale(as.numeric(.x))))

Model…

fit <- lm(duration ~ level*(phonogram + sro + z_regularity + log_homo_den)
   , data = Df)

summary(fit)$coef %>% knitr::kable()

	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	2227.62996	11.64608	191.2771528	0.0000000
levelRadical	-1521.44134	16.69708	-91.1202104	0.0000000
levelStroke	-2079.71876	16.47005	-126.2727561	0.0000000
phonogram	171.65724	14.13595	12.1433082	0.0000000
sro	27.41315	12.34275	2.2209918	0.0264148
z_regularity	46.59324	13.81930	3.3716072	0.0007554
log_homo_den	-17.21184	11.70933	-1.4699259	0.1416711
levelRadical:phonogram	-149.61936	20.52758	-7.2887011	0.0000000
levelStroke:phonogram	-174.10521	19.99126	-8.7090679	0.0000000
levelRadical:sro	-43.51703	17.47391	-2.4903995	0.0128057
levelStroke:sro	-25.71078	17.45529	-1.4729510	0.1408535
levelRadical:z_regularity	-45.77456	19.66488	-2.3277319	0.0199827
levelStroke:z_regularity	-44.70008	19.54343	-2.2872173	0.0222421
levelRadical:log_homo_den	15.45687	16.80968	0.9195219	0.3578856
levelStroke:log_homo_den	16.51703	16.55949	0.9974360	0.3186214

… then compare slopes …

For example for sro. The first two comparisons come for free in the model (which has Character as the baseline). The only new information here is for Radical vs. Stroke.

Radical vs. Character

car::linearHypothesis(fit, "levelRadical:sro = 0")

## 
## Linear hypothesis test:
## levelRadical:sro = 0
## 
## Model 1: restricted model
## Model 2: duration ~ level * (phonogram + sro + z_regularity + log_homo_den)
## 
##   Res.Df       RSS Df Sum of Sq      F  Pr(>F)  
## 1   3526 574327134                              
## 2   3525 573318404  1   1008730 6.2021 0.01281 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Stroke vs. Character

car::linearHypothesis(fit, "levelStroke:sro = 0")

## 
## Linear hypothesis test:
## levelStroke:sro = 0
## 
## Model 1: restricted model
## Model 2: duration ~ level * (phonogram + sro + z_regularity + log_homo_den)
## 
##   Res.Df       RSS Df Sum of Sq      F Pr(>F)
## 1   3526 573671273                           
## 2   3525 573318404  1    352869 2.1696 0.1409

Radical vs. Stroke

car::linearHypothesis(fit, "levelRadical:sro - levelStroke:sro = 0")

## 
## Linear hypothesis test:
## levelRadical:sro - levelStroke:sro = 0
## 
## Model 1: restricted model
## Model 2: duration ~ level * (phonogram + sro + z_regularity + log_homo_den)
## 
##   Res.Df       RSS Df Sum of Sq      F Pr(>F)
## 1   3526 573487293                           
## 2   3525 573318404  1    168889 1.0384 0.3083

This reproduces the current analysis but without having to use different models for main and interaction effects. I think this is the correct approach. If we believe that there are interaction effects then these are going to affect the size of main effects. So main effects should be reported only from a model that also includes interactions.

My problem with that model for duration - and similar arguments might apply to a lesser extent for other dependent variables, I think - is that the time that it takes someone to produce a character is largely, but not entirely, the sum of times to produce its radicals. And time to complete each radical is the sum of time to complete each component stroke. The question we really want to answer is whether there are effects of psycholinguistic features on duration of character production over and above effects on radicals or strokes. Does character explain any additional variance that is not accounted for by radical and stroke? For me, this is a direct test of “Are the lexical effects across these different levels hierarchical?”.

Modelling unique variance due to level, controlling for lower levels

If that’s the correct understanding, then what we need is this sort of analysis. The example is character duration, controlling for residual and stroke.

Get short-form data (different columns for stroke, radical, and character duration):

Df2 <- readxl::read_excel("../data/SimplifiedAdultDatabase.xlsx", 
                     sheet = "Hierarchy") %>% 
  janitor::clean_names() %>% 
  select(character, level, duration, phonogram, sro, z_regularity, log_homo_den) %>% 
  pivot_wider(names_from = level, values_from = duration, names_prefix = "dur_") %>% 
  janitor::clean_names() %>% 
  drop_na()

model

m0 <- lm(dur_character ~ 1, data = Df2)
m1 <- lm(dur_character ~ dur_radical, data = Df2)
m2 <- lm(dur_character ~ dur_radical + dur_stroke, data = Df2)
m4 <- lm(dur_character ~ dur_radical + dur_stroke 
         + phonogram + sro + z_regularity + log_homo_den
         , data = Df2)

compare models

anova(m0,m1,m2, m4)

## Analysis of Variance Table
## 
## Model 1: dur_character ~ 1
## Model 2: dur_character ~ dur_radical
## Model 3: dur_character ~ dur_radical + dur_stroke
## Model 4: dur_character ~ dur_radical + dur_stroke + phonogram + sro + 
##     z_regularity + log_homo_den
##   Res.Df       RSS Df Sum of Sq      F    Pr(>F)    
## 1   1139 476580623                                  
## 2   1138 453437800  1  23142823 63.037 4.855e-15 ***
## 3   1137 447104625  1   6333175 17.250 3.522e-05 ***
## 4   1133 415961783  4  31142842 21.207 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

summarise final model

summary(m4)$coef %>% knitr::kable()

	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	2117.6850917	119.7018073	17.6913377	0.0000000
dur_radical	0.4502129	0.0580543	7.7550311	0.0000000
dur_stroke	-2.7625446	0.6420009	-4.3030231	0.0000183
phonogram	319.3085127	52.3979333	6.0939143	0.0000000
sro	88.2272285	47.7738746	1.8467673	0.0650415
z_regularity	45.4359500	24.9134011	1.8237554	0.0684524
log_homo_den	-44.1389488	51.6508451	-0.8545639	0.3929733

You would also want to model radical duration, controlling for stroke duration. Ideally at the radical level, with random effects for character. and for stroke duration with ideally with random effects for character and (if this makes sense) for radical.