Overview

These analyses are based on the re-running of candide unordered models (specifically at the level of training_size == 300 and 1000) which saved by-item error data (for both training and test items). Previous implementations of candide unordered (with the brute force machine teacher) only saved model-wise accuracy data. In order to understand the characteristics of successful training sets, we needed this by-item error data which we didn’t have prior to 1.22.19. It is important to note that currently this analysis only accounts for models trained on 300 and 1000 words.

For a more complete description of these item-level characteristics see the rpubs notebook here.

load(file = './data/data_clean/by_item_error.rda')

Test error

There is potentially interesting variability in generalization (test) error across all items in the set.

ggplot(by_item_error, aes(test_error, fill = factor(training_size))) +
  geom_histogram() +
  labs(x = "test error", y = 'count', title = 'Test error across items', subtitle = '(0 = low error, 1 = high error)', fill = "training set size") +
  theme(plot.title = element_text(hjust = .5), plot.subtitle = element_text(hjust = .5))

by_item_error %>% 
  filter(!is.na(training_size)) %>%
  ggplot(aes(factor(training_size), test_error, fill = factor(training_size))) +
  geom_violin() +
  stat_summary(fun.y="median", geom="point", size = 12, shape = "-") +
  labs(title = 'Test error by training set size', x = "training set size", subtitle = "means showed as line", fill = "training set size") +
  theme(plot.title = element_text(size = 16, hjust = .5), plot.subtitle = element_text(size = 13, hjust = .5))

Regardless of training set size condition, there are many words that are never generalized to - representing the most difficult words that never get taught to a given learner. For training set size of 300, this amounts to 37 items, shown below. For reference, words are shown with their age of acquisition rating (Kuperman et al., 2012) and word frequency.

require(knitr)

by_item_error %>% filter(test_error == 1 & training_size == 300) %>%
  select(word, aoa = aoa_mean, frequency = freq) %>%
  kable(caption = "Words never generalized to with training set size == 300")
Words never generalized to with training set size == 300
word aoa frequency
school 3.890000 16989
times 6.700000 11221
one 3.227100 156684
choir 6.530000 271
quay 14.700000 1
zounds NA 1
bas 9.000000 1
rolled NA 432
the 3.983747 1501908
sure 4.850000 56091
was NA 288391
schnook 12.910000 1
warmth 6.260000 227
chic 9.530000 119
yeah NA 152262
view 5.630000 1965
angst 13.120000 47
rheum 17.380000 1
two 4.239515 54384
sixth 5.358500 551
gauche 15.500000 1
phrase 8.440000 464
scheme 9.650000 370
coup 12.060000 133
ache 5.790000 127
sword 5.450000 1335
klan NA 1
rouge 12.330000 165
beau 12.232265 298
fixed NA 1647
butte NA 1
world 5.320000 23216
jinx 7.890000 204
what 3.855863 501965
quartz 9.280000 27
draught 12.530000 24
sphinx 10.220000 52

For models trained on 1000 words, there were 59 words that were never generalized to, representing an interesting tradeoff given that this is more words than in the 300 word condition.

by_item_error %>% filter(test_error == 1 & training_size == 1000) %>%
  select(word, aoa = aoa_mean, frequency = freq) %>%
  kable(caption = "Words never generalized to with training set size == 1000")
Words never generalized to with training set size == 1000
word aoa frequency
school 3.890000 16989
christ NA 1
times 6.700000 11221
one 3.227100 156684
choir 6.530000 271
quay 14.700000 1
zounds NA 1
do 3.600000 312915
bas 9.000000 1
rolled NA 432
the 3.983747 1501908
sure 4.850000 56091
was NA 288391
schnook 12.910000 1
warmth 6.260000 227
queue 12.170000 60
chic 9.530000 119
eighth 6.051205 350
yeah NA 152262
view 5.630000 1965
angst 13.120000 47
egg 3.890000 1328
axe 6.110000 249
rheum 17.380000 1
two 4.239515 54384
sixth 5.358500 551
corps 11.560000 555
suite 9.370000 849
gauche 15.500000 1
hour 5.850000 8277
phrase 8.440000 464
scheme 9.650000 370
torque 13.290000 37
coup 12.060000 133
chord 9.710000 93
ache 5.790000 127
tech 12.900000 326
feud 10.330000 66
blitz 10.400000 64
sword 5.450000 1335
climb 5.300000 1007
klan NA 1
czar 11.710000 35
have 3.720000 314232
rouge 12.330000 165
beau 12.232265 298
fixed NA 1647
butte NA 1
plaid 8.560000 82
lose 5.780000 8382
does NA 34002
world 5.320000 23216
beige 7.740000 69
jinx 7.890000 204
smooth 5.610000 932
what 3.855863 501965
quartz 9.280000 27
draught 12.530000 24
sphinx 10.220000 52

Models trained on 300 words showed no words that are always generalized to, while the more experienced models - those trained on 1000 words - had two words they always generalized to: “rip” and “rack”.

Item-wise attributes

The following item-level characteristics are included in the visuals below:

There are some mild correlations (greater than .25) between test error (test_error) and four variables of interest: orthographic length, pairwise mean orthographic distance (computed as manhattan distance), pairwise mean phonological distance (also computed as manhattan distance), and pairwise hidden layer spread (how spread out each word is from all other words).

by_item_error %>% 
  rename(`**TEST_ERROR**` = test_error) %>%
  mutate(aoa = as.numeric(aoa_mean)) %>%
  select(-c(word, aoa_mean, test_freq, training_size)) %>%
  cor(use = 'pairwise.complete.obs') %>%
  data.table::melt() %>%
  ggplot(aes(Var1, Var2, fill = value)) +
  geom_tile() +
  geom_text(aes(label = round(value, digits = 2)), size = 2.5) +
  scale_fill_gradient2() +
  theme(axis.title = element_blank(), axis.text.x = element_text(angle = 90, hjust = 1)) +
  labs(fill = 'Correlation')

Orthography

For bivariate relationships, we can look to orthography first. Here is test error by othographic length. Orthographic length is simply the string length of the item. Here the points are rendered as the items themselves to help see the words on the margins. Note: all bivariate plots are rendered with loess lines.

by_item_error %>% filter(!is.na(training_size)) %>%
  ggplot(aes(orth_length, test_error, color = factor(training_size))) +
    geom_point(size = 1) +
    geom_smooth(method = 'loess') +
    geom_jitter(width = .1, height = .1) +
    labs(title = 'Correlation of test error with orthographic length', subtitle = '(r = .23)', color = "training set size") +
    ylim(0, 1) +
    theme(plot.title = element_text(size = 16, hjust = .5), plot.subtitle = element_text(size = 18, hjust = .5))

The other orthographic variable that showed a mild correlation with test error was mean orthographic distance. This is computed as the average distance in orthographic (manhattan) space between every item in the set. You can think of this as a measure of how dissimilar (high value = less similar) an item is, on average, from every other item in the set.

by_item_error %>% filter(!is.na(training_size)) %>%
  ggplot(aes(orth_dist, test_error, color = factor(training_size))) +
    geom_point(size = 1) +
    geom_smooth(method = 'loess') +
    labs(title = 'Correlation of test error with average pairwise orth distance', subtitle = '(r = .26)', color = "training set size") +
    ylim(0, 1) +
    theme(plot.title = element_text(size = 16, hjust = .5), plot.subtitle = element_text(size = 18, hjust = .5))

Phonology

In terms of phonology, the only correlation that shows up is with phonological distance, computed the same way as orthographic distance, but in phonological space.

by_item_error %>% filter(!is.na(training_size)) %>%
  ggplot(aes(phon_dist, test_error, color = factor(training_size))) +
    geom_point(size = 1) +
    geom_smooth(method = 'loess') +
    labs(title = 'Correlation of test error with average pairwise phon distance', subtitle = '(r = .20)', color = "training set size") +
    ylim(0, 1) +
    theme(plot.title = element_text(size = 16, hjust = .5), plot.subtitle = element_text(size = 18, hjust = .5))

There is a mild negative correlation between test error and “hidden_spread”. This variable is a measure of how spread out the other words are from a given word in the test set. This is calculated by taking the pairwise distances between a given word and all other words in the test set, and getting the standard deviation of those values for that word.

ggplot(by_item_error, aes(hidden_spread, test_error, color = factor(training_size))) +
  geom_point(size = 1) +
  geom_smooth(method = 'loess') +
  labs(title = 'Correlation of test error with pairwise hidden layer spread', subtitle = '(r = -.28)', color = "training set size") +
  ylim(0, 1) +
  theme(plot.title = element_text(size = 16, hjust = .5), plot.subtitle = element_text(size = 18, hjust = .5))

Modeling orthographic and phonological characteristics

We fit a multiple regression model with orthographic distance, phonological distance, hidden spread, and training size predicting generalization error. This linear model captures 26.2% of the overall variance, with the bulk captured by training set size (14%), but a non-trivial amount captured by the hidden spread measure (9%).

require(lmSupport)
require(lme4)

model_1 = lm(test_error ~ orth_dist + phon_dist + hidden_spread + training_size, data = by_item_error)
modelSummary(model_1)
## lm(formula = test_error ~ orth_dist + phon_dist + hidden_spread + 
##     training_size, data = by_item_error)
## Observations: 5762
## 
## Linear model fit by least squares
## 
## Coefficients:
##                 Estimate         SE      t Pr(>|t|)    
## (Intercept)    8.463e-01  6.700e-02  12.63   <2e-16 ***
## orth_dist      6.629e-02  6.230e-03  10.64   <2e-16 ***
## phon_dist      3.058e-02  3.005e-03  10.18   <2e-16 ***
## hidden_spread -2.143e-01  9.189e-03 -23.32   <2e-16 ***
## training_size -3.348e-04  1.094e-05 -30.60   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Sum of squared errors (SSE): 486.3, Error df: 5757
## R-squared:  0.2618
modelEffectSizes(model_1)
## lm(formula = test_error ~ orth_dist + phon_dist + hidden_spread + 
##     training_size, data = by_item_error)
## 
## Coefficients
##                   SSR df pEta-sqr dR-sqr
## (Intercept)   13.4751  1   0.0270     NA
## orth_dist      9.5617  1   0.0193 0.0145
## phon_dist      8.7468  1   0.0177 0.0133
## hidden_spread 45.9402  1   0.0863 0.0697
## training_size 79.1130  1   0.1399 0.1201
## 
## Sum of squared errors (SSE): 486.3
## Sum of squared total  (SST): 658.8

And the model not including training_size as a predictor.

require(lmSupport)
require(lme4)

by_item_error %>% filter(training_size == 300) %>%
  lm(test_error ~ orth_dist + phon_dist + hidden_spread, data = .) -> model
modelSummary(model)
## lm(formula = test_error ~ orth_dist + phon_dist + hidden_spread, 
##     data = .)
## Observations: 2881
## 
## Linear model fit by least squares
## 
## Coefficients:
##                Estimate        SE       t Pr(>|t|)    
## (Intercept)    0.734367  0.085627   8.576  < 2e-16 ***
## orth_dist      0.059176  0.008008   7.390 1.91e-13 ***
## phon_dist      0.042430  0.003862  10.986  < 2e-16 ***
## hidden_spread -0.231163  0.011809 -19.575  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Sum of squared errors (SSE): 200.7, Error df: 2877
## R-squared:  0.2180
modelEffectSizes(model)
## lm(formula = test_error ~ orth_dist + phon_dist + hidden_spread, 
##     data = .)
## 
## Coefficients
##                   SSR df pEta-sqr dR-sqr
## (Intercept)    5.1312  1   0.0249     NA
## orth_dist      3.8098  1   0.0186 0.0148
## phon_dist      8.4192  1   0.0403 0.0328
## hidden_spread 26.7300  1   0.1175 0.1042
## 
## Sum of squared errors (SSE): 200.7
## Sum of squared total  (SST): 256.6

Here is the plotted model predictions for the additive model, showing the most salient effect, generalization error predicted by hidden_spread. This trend is for training_size == 300 only.

by_item_error %>% filter(training_size == 300) %>%
  lm(test_error ~ orth_dist + phon_dist + hidden_spread, data = .) -> model


plot_data <- by_item_error %>% filter(training_size == 300)

dNew <- data.frame(hidden_spread=seq(min(plot_data$hidden_spread),max(plot_data$hidden_spread),length=100), 
                   orth_dist = mean(plot_data$orth_dist), phon_dist = mean(plot_data$phon_dist)) #creating data frame for predictor values, first two numbers are range of predictor
pY <- modelPredictions(model,dNew) #use modelPredictions() to get standard error of Y-hats

# plot
ggplot(plot_data, aes(hidden_spread, test_error)) + geom_point(color = 'grey49', size = .5) + 
  geom_smooth(data = pY, aes(ymin = CILo, ymax = CIHi, x = hidden_spread, y = Predicted), stat = "identity", color="black") +
  theme_bw(base_size = 14) + 
  labs(x = 'Variability in pairwise average distance in latent space', # indicate a label for the x-axis
       y = 'Generalization error') +
  theme(axis.line = element_line(colour = "black"),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        panel.border = element_blank(),
        panel.background = element_blank(),
        axis.title.y = element_text(size = 20),
        axis.title.x = element_text(size = 16))

rm(plot_data, dNew, pY)