These analyses are based on the re-running of candide unordered models (specifically at the level of training_size == 300 and 1000) which saved by-item error data (for both training and test items). Previous implementations of candide unordered (with the brute force machine teacher) only saved model-wise accuracy data. In order to understand the characteristics of successful training sets, we needed this by-item error data which we didn’t have prior to 1.22.19. It is important to note that currently this analysis only accounts for models trained on 300 and 1000 words.
For a more complete description of these item-level characteristics see the rpubs notebook here.
load(file = './data/data_clean/by_item_error.rda')
There is potentially interesting variability in generalization (test) error across all items in the set.
ggplot(by_item_error, aes(test_error, fill = factor(training_size))) +
geom_histogram() +
labs(x = "test error", y = 'count', title = 'Test error across items', subtitle = '(0 = low error, 1 = high error)', fill = "training set size") +
theme(plot.title = element_text(hjust = .5), plot.subtitle = element_text(hjust = .5))
by_item_error %>%
filter(!is.na(training_size)) %>%
ggplot(aes(factor(training_size), test_error, fill = factor(training_size))) +
geom_violin() +
stat_summary(fun.y="median", geom="point", size = 12, shape = "-") +
labs(title = 'Test error by training set size', x = "training set size", subtitle = "means showed as line", fill = "training set size") +
theme(plot.title = element_text(size = 16, hjust = .5), plot.subtitle = element_text(size = 13, hjust = .5))
Regardless of training set size condition, there are many words that are never generalized to - representing the most difficult words that never get taught to a given learner. For training set size of 300, this amounts to 37 items, shown below. For reference, words are shown with their age of acquisition rating (Kuperman et al., 2012) and word frequency.
require(knitr)
by_item_error %>% filter(test_error == 1 & training_size == 300) %>%
select(word, aoa = aoa_mean, frequency = freq) %>%
kable(caption = "Words never generalized to with training set size == 300")
word | aoa | frequency |
---|---|---|
school | 3.890000 | 16989 |
times | 6.700000 | 11221 |
one | 3.227100 | 156684 |
choir | 6.530000 | 271 |
quay | 14.700000 | 1 |
zounds | NA | 1 |
bas | 9.000000 | 1 |
rolled | NA | 432 |
the | 3.983747 | 1501908 |
sure | 4.850000 | 56091 |
was | NA | 288391 |
schnook | 12.910000 | 1 |
warmth | 6.260000 | 227 |
chic | 9.530000 | 119 |
yeah | NA | 152262 |
view | 5.630000 | 1965 |
angst | 13.120000 | 47 |
rheum | 17.380000 | 1 |
two | 4.239515 | 54384 |
sixth | 5.358500 | 551 |
gauche | 15.500000 | 1 |
phrase | 8.440000 | 464 |
scheme | 9.650000 | 370 |
coup | 12.060000 | 133 |
ache | 5.790000 | 127 |
sword | 5.450000 | 1335 |
klan | NA | 1 |
rouge | 12.330000 | 165 |
beau | 12.232265 | 298 |
fixed | NA | 1647 |
butte | NA | 1 |
world | 5.320000 | 23216 |
jinx | 7.890000 | 204 |
what | 3.855863 | 501965 |
quartz | 9.280000 | 27 |
draught | 12.530000 | 24 |
sphinx | 10.220000 | 52 |
For models trained on 1000 words, there were 59 words that were never generalized to, representing an interesting tradeoff given that this is more words than in the 300 word condition.
by_item_error %>% filter(test_error == 1 & training_size == 1000) %>%
select(word, aoa = aoa_mean, frequency = freq) %>%
kable(caption = "Words never generalized to with training set size == 1000")
word | aoa | frequency |
---|---|---|
school | 3.890000 | 16989 |
christ | NA | 1 |
times | 6.700000 | 11221 |
one | 3.227100 | 156684 |
choir | 6.530000 | 271 |
quay | 14.700000 | 1 |
zounds | NA | 1 |
do | 3.600000 | 312915 |
bas | 9.000000 | 1 |
rolled | NA | 432 |
the | 3.983747 | 1501908 |
sure | 4.850000 | 56091 |
was | NA | 288391 |
schnook | 12.910000 | 1 |
warmth | 6.260000 | 227 |
queue | 12.170000 | 60 |
chic | 9.530000 | 119 |
eighth | 6.051205 | 350 |
yeah | NA | 152262 |
view | 5.630000 | 1965 |
angst | 13.120000 | 47 |
egg | 3.890000 | 1328 |
axe | 6.110000 | 249 |
rheum | 17.380000 | 1 |
two | 4.239515 | 54384 |
sixth | 5.358500 | 551 |
corps | 11.560000 | 555 |
suite | 9.370000 | 849 |
gauche | 15.500000 | 1 |
hour | 5.850000 | 8277 |
phrase | 8.440000 | 464 |
scheme | 9.650000 | 370 |
torque | 13.290000 | 37 |
coup | 12.060000 | 133 |
chord | 9.710000 | 93 |
ache | 5.790000 | 127 |
tech | 12.900000 | 326 |
feud | 10.330000 | 66 |
blitz | 10.400000 | 64 |
sword | 5.450000 | 1335 |
climb | 5.300000 | 1007 |
klan | NA | 1 |
czar | 11.710000 | 35 |
have | 3.720000 | 314232 |
rouge | 12.330000 | 165 |
beau | 12.232265 | 298 |
fixed | NA | 1647 |
butte | NA | 1 |
plaid | 8.560000 | 82 |
lose | 5.780000 | 8382 |
does | NA | 34002 |
world | 5.320000 | 23216 |
beige | 7.740000 | 69 |
jinx | 7.890000 | 204 |
smooth | 5.610000 | 932 |
what | 3.855863 | 501965 |
quartz | 9.280000 | 27 |
draught | 12.530000 | 24 |
sphinx | 10.220000 | 52 |
Models trained on 300 words showed no words that are always generalized to, while the more experienced models - those trained on 1000 words - had two words they always generalized to: “rip” and “rack”.
The following item-level characteristics are included in the visuals below:
There are some mild correlations (greater than .25) between test error (test_error) and four variables of interest: orthographic length, pairwise mean orthographic distance (computed as manhattan distance), pairwise mean phonological distance (also computed as manhattan distance), and pairwise hidden layer spread (how spread out each word is from all other words).
by_item_error %>%
rename(`**TEST_ERROR**` = test_error) %>%
mutate(aoa = as.numeric(aoa_mean)) %>%
select(-c(word, aoa_mean, test_freq, training_size)) %>%
cor(use = 'pairwise.complete.obs') %>%
data.table::melt() %>%
ggplot(aes(Var1, Var2, fill = value)) +
geom_tile() +
geom_text(aes(label = round(value, digits = 2)), size = 2.5) +
scale_fill_gradient2() +
theme(axis.title = element_blank(), axis.text.x = element_text(angle = 90, hjust = 1)) +
labs(fill = 'Correlation')
For bivariate relationships, we can look to orthography first. Here is test error by othographic length. Orthographic length is simply the string length of the item. Here the points are rendered as the items themselves to help see the words on the margins. Note: all bivariate plots are rendered with loess lines.
by_item_error %>% filter(!is.na(training_size)) %>%
ggplot(aes(orth_length, test_error, color = factor(training_size))) +
geom_point(size = 1) +
geom_smooth(method = 'loess') +
geom_jitter(width = .1, height = .1) +
labs(title = 'Correlation of test error with orthographic length', subtitle = '(r = .23)', color = "training set size") +
ylim(0, 1) +
theme(plot.title = element_text(size = 16, hjust = .5), plot.subtitle = element_text(size = 18, hjust = .5))
The other orthographic variable that showed a mild correlation with test error was mean orthographic distance. This is computed as the average distance in orthographic (manhattan) space between every item in the set. You can think of this as a measure of how dissimilar (high value = less similar) an item is, on average, from every other item in the set.
by_item_error %>% filter(!is.na(training_size)) %>%
ggplot(aes(orth_dist, test_error, color = factor(training_size))) +
geom_point(size = 1) +
geom_smooth(method = 'loess') +
labs(title = 'Correlation of test error with average pairwise orth distance', subtitle = '(r = .26)', color = "training set size") +
ylim(0, 1) +
theme(plot.title = element_text(size = 16, hjust = .5), plot.subtitle = element_text(size = 18, hjust = .5))
In terms of phonology, the only correlation that shows up is with phonological distance, computed the same way as orthographic distance, but in phonological space.
by_item_error %>% filter(!is.na(training_size)) %>%
ggplot(aes(phon_dist, test_error, color = factor(training_size))) +
geom_point(size = 1) +
geom_smooth(method = 'loess') +
labs(title = 'Correlation of test error with average pairwise phon distance', subtitle = '(r = .20)', color = "training set size") +
ylim(0, 1) +
theme(plot.title = element_text(size = 16, hjust = .5), plot.subtitle = element_text(size = 18, hjust = .5))
There is a mild negative correlation between test error and “hidden_spread”. This variable is a measure of how spread out the other words are from a given word in the test set. This is calculated by taking the pairwise distances between a given word and all other words in the test set, and getting the standard deviation of those values for that word.
ggplot(by_item_error, aes(hidden_spread, test_error, color = factor(training_size))) +
geom_point(size = 1) +
geom_smooth(method = 'loess') +
labs(title = 'Correlation of test error with pairwise hidden layer spread', subtitle = '(r = -.28)', color = "training set size") +
ylim(0, 1) +
theme(plot.title = element_text(size = 16, hjust = .5), plot.subtitle = element_text(size = 18, hjust = .5))
We fit a multiple regression model with orthographic distance, phonological distance, hidden spread, and training size predicting generalization error. This linear model captures 26.2% of the overall variance, with the bulk captured by training set size (14%), but a non-trivial amount captured by the hidden spread measure (9%).
require(lmSupport)
require(lme4)
model_1 = lm(test_error ~ orth_dist + phon_dist + hidden_spread + training_size, data = by_item_error)
modelSummary(model_1)
## lm(formula = test_error ~ orth_dist + phon_dist + hidden_spread +
## training_size, data = by_item_error)
## Observations: 5762
##
## Linear model fit by least squares
##
## Coefficients:
## Estimate SE t Pr(>|t|)
## (Intercept) 8.463e-01 6.700e-02 12.63 <2e-16 ***
## orth_dist 6.629e-02 6.230e-03 10.64 <2e-16 ***
## phon_dist 3.058e-02 3.005e-03 10.18 <2e-16 ***
## hidden_spread -2.143e-01 9.189e-03 -23.32 <2e-16 ***
## training_size -3.348e-04 1.094e-05 -30.60 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Sum of squared errors (SSE): 486.3, Error df: 5757
## R-squared: 0.2618
modelEffectSizes(model_1)
## lm(formula = test_error ~ orth_dist + phon_dist + hidden_spread +
## training_size, data = by_item_error)
##
## Coefficients
## SSR df pEta-sqr dR-sqr
## (Intercept) 13.4751 1 0.0270 NA
## orth_dist 9.5617 1 0.0193 0.0145
## phon_dist 8.7468 1 0.0177 0.0133
## hidden_spread 45.9402 1 0.0863 0.0697
## training_size 79.1130 1 0.1399 0.1201
##
## Sum of squared errors (SSE): 486.3
## Sum of squared total (SST): 658.8
And the model not including training_size as a predictor.
require(lmSupport)
require(lme4)
by_item_error %>% filter(training_size == 300) %>%
lm(test_error ~ orth_dist + phon_dist + hidden_spread, data = .) -> model
modelSummary(model)
## lm(formula = test_error ~ orth_dist + phon_dist + hidden_spread,
## data = .)
## Observations: 2881
##
## Linear model fit by least squares
##
## Coefficients:
## Estimate SE t Pr(>|t|)
## (Intercept) 0.734367 0.085627 8.576 < 2e-16 ***
## orth_dist 0.059176 0.008008 7.390 1.91e-13 ***
## phon_dist 0.042430 0.003862 10.986 < 2e-16 ***
## hidden_spread -0.231163 0.011809 -19.575 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Sum of squared errors (SSE): 200.7, Error df: 2877
## R-squared: 0.2180
modelEffectSizes(model)
## lm(formula = test_error ~ orth_dist + phon_dist + hidden_spread,
## data = .)
##
## Coefficients
## SSR df pEta-sqr dR-sqr
## (Intercept) 5.1312 1 0.0249 NA
## orth_dist 3.8098 1 0.0186 0.0148
## phon_dist 8.4192 1 0.0403 0.0328
## hidden_spread 26.7300 1 0.1175 0.1042
##
## Sum of squared errors (SSE): 200.7
## Sum of squared total (SST): 256.6
Here is the plotted model predictions for the additive model, showing the most salient effect, generalization error predicted by hidden_spread. This trend is for training_size == 300 only.
by_item_error %>% filter(training_size == 300) %>%
lm(test_error ~ orth_dist + phon_dist + hidden_spread, data = .) -> model
plot_data <- by_item_error %>% filter(training_size == 300)
dNew <- data.frame(hidden_spread=seq(min(plot_data$hidden_spread),max(plot_data$hidden_spread),length=100),
orth_dist = mean(plot_data$orth_dist), phon_dist = mean(plot_data$phon_dist)) #creating data frame for predictor values, first two numbers are range of predictor
pY <- modelPredictions(model,dNew) #use modelPredictions() to get standard error of Y-hats
# plot
ggplot(plot_data, aes(hidden_spread, test_error)) + geom_point(color = 'grey49', size = .5) +
geom_smooth(data = pY, aes(ymin = CILo, ymax = CIHi, x = hidden_spread, y = Predicted), stat = "identity", color="black") +
theme_bw(base_size = 14) +
labs(x = 'Variability in pairwise average distance in latent space', # indicate a label for the x-axis
y = 'Generalization error') +
theme(axis.line = element_line(colour = "black"),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.border = element_blank(),
panel.background = element_blank(),
axis.title.y = element_text(size = 20),
axis.title.x = element_text(size = 16))
rm(plot_data, dNew, pY)