by item error analysis (brute force 2)

Overview

These analyses are based on the re-running of candide unordered models (specifically at the level of training_size == 300 and 1000) which saved by-item error data (for both training and test items). Previous implementations of candide unordered (with the brute force machine teacher) only saved model-wise accuracy data. In order to understand the characteristics of successful training sets, we needed this by-item error data which we didn’t have prior to 1.22.19. It is important to note that currently this analysis only accounts for models trained on 300 and 1000 words.

For a more complete description of these item-level characteristics see the rpubs notebook here.

load(file = './data/data_clean/by_item_error.rda')

Test error

There is potentially interesting variability in generalization (test) error across all items in the set.

ggplot(by_item_error, aes(test_error, fill = factor(training_size))) +
  geom_histogram() +
  labs(x = "test error", y = 'count', title = 'Test error across items', subtitle = '(0 = low error, 1 = high error)', fill = "training set size") +
  theme(plot.title = element_text(hjust = .5), plot.subtitle = element_text(hjust = .5))

by_item_error %>% 
  filter(!is.na(training_size)) %>%
  ggplot(aes(factor(training_size), test_error, fill = factor(training_size))) +
  geom_violin() +
  stat_summary(fun.y="median", geom="point", size = 12, shape = "-") +
  labs(title = 'Test error by training set size', x = "training set size", subtitle = "means showed as line", fill = "training set size") +
  theme(plot.title = element_text(size = 16, hjust = .5), plot.subtitle = element_text(size = 13, hjust = .5))

Regardless of training set size condition, there are many words that are never generalized to - representing the most difficult words that never get taught to a given learner. For training set size of 300, this amounts to 37 items, shown below. For reference, words are shown with their age of acquisition rating (Kuperman et al., 2012) and word frequency.

require(knitr)

by_item_error %>% filter(test_error == 1 & training_size == 300) %>%
  select(word, aoa = aoa_mean, frequency = freq) %>%
  kable(caption = "Words never generalized to with training set size == 300")

Words never generalized to with training set size == 300
word	aoa	frequency
school	3.890000	16989
times	6.700000	11221
one	3.227100	156684
choir	6.530000	271
quay	14.700000	1
zounds	NA	1
bas	9.000000	1
rolled	NA	432
the	3.983747	1501908
sure	4.850000	56091
was	NA	288391
schnook	12.910000	1
warmth	6.260000	227
chic	9.530000	119
yeah	NA	152262
view	5.630000	1965
angst	13.120000	47
rheum	17.380000	1
two	4.239515	54384
sixth	5.358500	551
gauche	15.500000	1
phrase	8.440000	464
scheme	9.650000	370
coup	12.060000	133
ache	5.790000	127
sword	5.450000	1335
klan	NA	1
rouge	12.330000	165
beau	12.232265	298
fixed	NA	1647
butte	NA	1
world	5.320000	23216
jinx	7.890000	204
what	3.855863	501965
quartz	9.280000	27
draught	12.530000	24
sphinx	10.220000	52

For models trained on 1000 words, there were 59 words that were never generalized to, representing an interesting tradeoff given that this is more words than in the 300 word condition.

by_item_error %>% filter(test_error == 1 & training_size == 1000) %>%
  select(word, aoa = aoa_mean, frequency = freq) %>%
  kable(caption = "Words never generalized to with training set size == 1000")

Words never generalized to with training set size == 1000
word	aoa	frequency
school	3.890000	16989
christ	NA	1
times	6.700000	11221
one	3.227100	156684
choir	6.530000	271
quay	14.700000	1
zounds	NA	1
do	3.600000	312915
bas	9.000000	1
rolled	NA	432
the	3.983747	1501908
sure	4.850000	56091
was	NA	288391
schnook	12.910000	1
warmth	6.260000	227
queue	12.170000	60
chic	9.530000	119
eighth	6.051205	350
yeah	NA	152262
view	5.630000	1965
angst	13.120000	47
egg	3.890000	1328
axe	6.110000	249
rheum	17.380000	1
two	4.239515	54384
sixth	5.358500	551
corps	11.560000	555
suite	9.370000	849
gauche	15.500000	1
hour	5.850000	8277
phrase	8.440000	464
scheme	9.650000	370
torque	13.290000	37
coup	12.060000	133
chord	9.710000	93
ache	5.790000	127
tech	12.900000	326
feud	10.330000	66
blitz	10.400000	64
sword	5.450000	1335
climb	5.300000	1007
klan	NA	1
czar	11.710000	35
have	3.720000	314232
rouge	12.330000	165
beau	12.232265	298
fixed	NA	1647
butte	NA	1
plaid	8.560000	82
lose	5.780000	8382
does	NA	34002
world	5.320000	23216
beige	7.740000	69
jinx	7.890000	204
smooth	5.610000	932
what	3.855863	501965
quartz	9.280000	27
draught	12.530000	24
sphinx	10.220000	52

Models trained on 300 words showed no words that are always generalized to, while the more experienced models - those trained on 1000 words - had two words they always generalized to: “rip” and “rack”.

Item-wise attributes

The following item-level characteristics are included in the visuals below:

aoa_num: aoa rating for a given word from Kuperberg’s BRM norming study
test_error: mean accuracy per word on the generalization set
test_freq: the number of times an item was included in the generalizion testing process
hidden_neighb_size: the size of the neighborhood in latent space to which the item belongs
hidden_cluster: which cluster the item belongs to in latent space
hidden_spread: for each word, this indicates how spread out all the other words are from it in latent space
hidden_dist: the average distance in latent space between each item and every other in the set
phon_nighb_size: the size of the neighborhood in phonological space to which the item belongs
phon_cluster: which cluster the item belongs to in phonological space
phon_spread: for each word, this indicates how spread out all the other words are from it in phon space
phon_dist: the average distance in phon space between each item and every other in the set
orth_nighb_size: the size of the neighborhood in orthographic space to which the item belongs
orth_cluster: which cluster the item belongs to in orthographic space
orth_spread: for each word, this indicates how spread out all the other words are from it in orth space
orth_dist: the average distance in orth space between each item and every other in the set
orth_length: the orthographic length of each word
the frequency of each word taken from a large corpus

There are some mild correlations (greater than .25) between test error (test_error) and four variables of interest: orthographic length, pairwise mean orthographic distance (computed as manhattan distance), pairwise mean phonological distance (also computed as manhattan distance), and pairwise hidden layer spread (how spread out each word is from all other words).

by_item_error %>% 
  rename(`**TEST_ERROR**` = test_error) %>%
  mutate(aoa = as.numeric(aoa_mean)) %>%
  select(-c(word, aoa_mean, test_freq, training_size)) %>%
  cor(use = 'pairwise.complete.obs') %>%
  data.table::melt() %>%
  ggplot(aes(Var1, Var2, fill = value)) +
  geom_tile() +
  geom_text(aes(label = round(value, digits = 2)), size = 2.5) +
  scale_fill_gradient2() +
  theme(axis.title = element_blank(), axis.text.x = element_text(angle = 90, hjust = 1)) +
  labs(fill = 'Correlation')

Orthography

For bivariate relationships, we can look to orthography first. Here is test error by othographic length. Orthographic length is simply the string length of the item. Here the points are rendered as the items themselves to help see the words on the margins. Note: all bivariate plots are rendered with loess lines.

by_item_error %>% filter(!is.na(training_size)) %>%
  ggplot(aes(orth_length, test_error, color = factor(training_size))) +
    geom_point(size = 1) +
    geom_smooth(method = 'loess') +
    geom_jitter(width = .1, height = .1) +
    labs(title = 'Correlation of test error with orthographic length', subtitle = '(r = .23)', color = "training set size") +
    ylim(0, 1) +
    theme(plot.title = element_text(size = 16, hjust = .5), plot.subtitle = element_text(size = 18, hjust = .5))

The other orthographic variable that showed a mild correlation with test error was mean orthographic distance. This is computed as the average distance in orthographic (manhattan) space between every item in the set. You can think of this as a measure of how dissimilar (high value = less similar) an item is, on average, from every other item in the set.

by_item_error %>% filter(!is.na(training_size)) %>%
  ggplot(aes(orth_dist, test_error, color = factor(training_size))) +
    geom_point(size = 1) +
    geom_smooth(method = 'loess') +
    labs(title = 'Correlation of test error with average pairwise orth distance', subtitle = '(r = .26)', color = "training set size") +
    ylim(0, 1) +
    theme(plot.title = element_text(size = 16, hjust = .5), plot.subtitle = element_text(size = 18, hjust = .5))

Phonology

In terms of phonology, the only correlation that shows up is with phonological distance, computed the same way as orthographic distance, but in phonological space.

by_item_error %>% filter(!is.na(training_size)) %>%
  ggplot(aes(phon_dist, test_error, color = factor(training_size))) +
    geom_point(size = 1) +
    geom_smooth(method = 'loess') +
    labs(title = 'Correlation of test error with average pairwise phon distance', subtitle = '(r = .20)', color = "training set size") +
    ylim(0, 1) +
    theme(plot.title = element_text(size = 16, hjust = .5), plot.subtitle = element_text(size = 18, hjust = .5))

There is a mild negative correlation between test error and “hidden_spread”. This variable is a measure of how spread out the other words are from a given word in the test set. This is calculated by taking the pairwise distances between a given word and all other words in the test set, and getting the standard deviation of those values for that word.

ggplot(by_item_error, aes(hidden_spread, test_error, color = factor(training_size))) +
  geom_point(size = 1) +
  geom_smooth(method = 'loess') +
  labs(title = 'Correlation of test error with pairwise hidden layer spread', subtitle = '(r = -.28)', color = "training set size") +
  ylim(0, 1) +
  theme(plot.title = element_text(size = 16, hjust = .5), plot.subtitle = element_text(size = 18, hjust = .5))

Modeling orthographic and phonological characteristics

We fit a multiple regression model with orthographic distance, phonological distance, hidden spread, and training size predicting generalization error. This linear model captures 26.2% of the overall variance, with the bulk captured by training set size (14%), but a non-trivial amount captured by the hidden spread measure (9%).

require(lmSupport)
require(lme4)

model_1 = lm(test_error ~ orth_dist + phon_dist + hidden_spread + training_size, data = by_item_error)
modelSummary(model_1)

## lm(formula = test_error ~ orth_dist + phon_dist + hidden_spread + 
##     training_size, data = by_item_error)
## Observations: 5762
## 
## Linear model fit by least squares
## 
## Coefficients:
##                 Estimate         SE      t Pr(>|t|)    
## (Intercept)    8.463e-01  6.700e-02  12.63   <2e-16 ***
## orth_dist      6.629e-02  6.230e-03  10.64   <2e-16 ***
## phon_dist      3.058e-02  3.005e-03  10.18   <2e-16 ***
## hidden_spread -2.143e-01  9.189e-03 -23.32   <2e-16 ***
## training_size -3.348e-04  1.094e-05 -30.60   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Sum of squared errors (SSE): 486.3, Error df: 5757
## R-squared:  0.2618

modelEffectSizes(model_1)

## lm(formula = test_error ~ orth_dist + phon_dist + hidden_spread + 
##     training_size, data = by_item_error)
## 
## Coefficients
##                   SSR df pEta-sqr dR-sqr
## (Intercept)   13.4751  1   0.0270     NA
## orth_dist      9.5617  1   0.0193 0.0145
## phon_dist      8.7468  1   0.0177 0.0133
## hidden_spread 45.9402  1   0.0863 0.0697
## training_size 79.1130  1   0.1399 0.1201
## 
## Sum of squared errors (SSE): 486.3
## Sum of squared total  (SST): 658.8

And the model not including training_size as a predictor.

require(lmSupport)
require(lme4)

by_item_error %>% filter(training_size == 300) %>%
  lm(test_error ~ orth_dist + phon_dist + hidden_spread, data = .) -> model
modelSummary(model)

## lm(formula = test_error ~ orth_dist + phon_dist + hidden_spread, 
##     data = .)
## Observations: 2881
## 
## Linear model fit by least squares
## 
## Coefficients:
##                Estimate        SE       t Pr(>|t|)    
## (Intercept)    0.734367  0.085627   8.576  < 2e-16 ***
## orth_dist      0.059176  0.008008   7.390 1.91e-13 ***
## phon_dist      0.042430  0.003862  10.986  < 2e-16 ***
## hidden_spread -0.231163  0.011809 -19.575  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Sum of squared errors (SSE): 200.7, Error df: 2877
## R-squared:  0.2180

modelEffectSizes(model)

## lm(formula = test_error ~ orth_dist + phon_dist + hidden_spread, 
##     data = .)
## 
## Coefficients
##                   SSR df pEta-sqr dR-sqr
## (Intercept)    5.1312  1   0.0249     NA
## orth_dist      3.8098  1   0.0186 0.0148
## phon_dist      8.4192  1   0.0403 0.0328
## hidden_spread 26.7300  1   0.1175 0.1042
## 
## Sum of squared errors (SSE): 200.7
## Sum of squared total  (SST): 256.6

Here is the plotted model predictions for the additive model, showing the most salient effect, generalization error predicted by hidden_spread. This trend is for training_size == 300 only.

by_item_error %>% filter(training_size == 300) %>%
  lm(test_error ~ orth_dist + phon_dist + hidden_spread, data = .) -> model


plot_data <- by_item_error %>% filter(training_size == 300)

dNew <- data.frame(hidden_spread=seq(min(plot_data$hidden_spread),max(plot_data$hidden_spread),length=100), 
                   orth_dist = mean(plot_data$orth_dist), phon_dist = mean(plot_data$phon_dist)) #creating data frame for predictor values, first two numbers are range of predictor
pY <- modelPredictions(model,dNew) #use modelPredictions() to get standard error of Y-hats

# plot
ggplot(plot_data, aes(hidden_spread, test_error)) + geom_point(color = 'grey49', size = .5) + 
  geom_smooth(data = pY, aes(ymin = CILo, ymax = CIHi, x = hidden_spread, y = Predicted), stat = "identity", color="black") +
  theme_bw(base_size = 14) + 
  labs(x = 'Variability in pairwise average distance in latent space', # indicate a label for the x-axis
       y = 'Generalization error') +
  theme(axis.line = element_line(colour = "black"),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        panel.border = element_blank(),
        panel.background = element_blank(),
        axis.title.y = element_text(size = 20),
        axis.title.x = element_text(size = 16))

rm(plot_data, dNew, pY)