Discovering optimal training sequences for training orth-to-phon relationships in English

Overview

When a child learns to map printed language on to their existing spoken language knowledge (an essential, most often early component of learning to read) the characteristics of the printed language matter in supporting generalization to new words and patterns learned in subsequent learning in the language. Programs of instruction often take an intuitive theory of how the structure of the input for children ought to be structured: introduce simple units first and gradually (and exhaustively) increase the percieved complexity of the language for sake of generalizing to novel printed words. Our approach is different, and derived: using a machine teaching algorithm, we estimate the best next training item in a cognitive model of a reading child to optimize error descent in support of generalization. For a recent poster on this project see here. NOTE: The description of the experiments in this poster describes a learner with a different number of hidden units than those discussed below, but the difference does not impact the results. The models described below were selected for the summary because most closely relate to other similar modeling activities being undertaken by the lab.

Cognitive model

The cognitive model used here takes a monosyllabic orthographic wordform as an input (printed words like “post”, “splash”, or “main”) and the spoken form of that word as an output (i.e., /poʊst/, /splæʃ/, and /me:n/). This mapping occurs through a hidden layer of 100 units. The inputs and outputs are centered on their (first) vowel, whether orthographic vowel in the case of the input layer (i.e., the “o” in “post” or “a” in “main”) and phonological vowel on the output (the /o/ in /poʊst/, and /e/ in /me:n/).

We are interested in testing whether the order of training experiences provided to a neural network affects the how quickly the training set is learned. Specifically we are evaluating the Candide machine teacher as a candidate algorithm for discovering good training sequences. The Candide algorithm is as follows.

Until the model is trained, do:

1. Loop over training set. For each word in the training set:
    1. Train the network on the word. Update weights based on the error on that single example
    2. Test the network on all words in the training set
    3. Store the error associated with this word.
    4. Reset the network to the state just prior to training on this word.
3 . Identify the word associated with the lowest error.
4. Train on that optimal word.
5. Write the error and word associated with this step to file.
6. Repeat

Data shown in this summary are for applying this method in 30 separate implementations of 1,000 examples each. Each training set is a uniformly random sample from the full corpus (2,881 examples). The 30 training sets were fit with the same feed-forward neural network described above. The networks were trained incrementally, with batch size = 1. Each training example was chosen with the Candide algorithm. Models were composed and fit using the scikit-learn Python library, to allow tighter integration between Candide and the model than was easily possible with Lens, the other modeling architectured used for subsequent simulations. See below for more details about implementation.

Clearly, the Candide algorithm is computationally intensive. Each training experience, which consists of a single example, is chosen only after evaluating the outcome of all training experiences in terms of error reduction on the full training set. In effect, this amounts to peforming a pair of computations for every combination of training items. For a training set of 1,000 items, this means 1,000,000 combinations and 2,000,000 computations. Over the course of about 3 days, 30 models fit to different training sets of 1000 examples were optimized using Candide to pick each example.

Results

The logistic loss at each training event is shown the figure below.

## Warning: Detecting old grouped_df format, replacing `vars` attribute by
## `groups`

What we want to know is if the sequences underlying these learning trajectories look different from random sequences. One way to quantify this is in terms of the number of unique training experiences that the model has accumulated over the course of training. Early in training, it is highly likely that each new training sequence will be unique, while later in training it is more likely that any new sample from the 1000 examples in the training set will have been seen at least once before. This is true even of random samples.

However, the probability of selecting unique examples might differ under Candide and random training sequences. If it is better to sample broadly, then Candide might enforce that behavior and prevent resampling of previously seen examples. If, on the other hand, it helps to focus on certain subsets of the training set, then Candide may accumulate unique training experiences more slowly than a random sample.

In Figure two, we present the number of unique examples accumulated over the training sequence for the 30 models run. Each subsequent point is a sum of previous points, plus one if the training experience associated with that point is unique, or plus zero if that training experience is a repetition. Colored lines are the 30 Candide sequences, and the grey lines are 30 random sequences. Candide sequences accumualte unique training experiences more slowly than would be expected in a random sequence, beginning to diverge at around 300 training experiences. That is to say, the first 300 training experiences Candide selects are effectively at random, while remaining sequences appear to be more structured.

Comparing Candide sequences to alternatives

In addition to the Candide sequence, a variety of other training sequences were attempted, with variable names indicated. These include simulations that used…

Shuffled Candide sequence, maintaining which examples were trained and their frequencies, but eliminates Candides intended order (“Candide shuffled”)
Uniformly random with replacement (“random unweighted”)
Frequency-weighted random with replacement (“random frequency”)
Uniformly random with replacement, but reducing the training set to only those items selected by Candide, but disregarding both the frequency and order of those items (“random unweightedCU”)
Frequency-weighted random with replacement, but training the training set as in 4 (“random frequencyCU”)
Permuted passes over the full training set. Here, sequences are constructed by sampling uniformly at random without replacement until all 1000 training examples have been used. Then, all items are replaced at once, and the sequence is continued by again sampling uniformly at random without replacement. This procedure is looped until a sequence of the desired length is constructed. If the total sequence length is divisibly by the size of the training set, then all items are presented with identical frequency; if the sequence length is not divisible by the size of the training set, then only a subset of items will be sampled on the final pass through this loop, resulting in slightly unbalanced frequencies. (“Permuted wholesample”)

These 6 conditions were modelled using Lens. To compare apples to apples, the Candide sequence obtained through the analysis above (using scikit-learn in Python) is run through exactly the same Lens protocol as these 6. In total, 7 sequences can be compared.

Measures

Unit-wise accuracy (adj_acc_unitwise): For each word, the number of output units that were on when they should be (hit), off when they should be (true negative), on when they should not be (false positive), and off when they should not be (miss) are recorded. Based on this information, we can compute the proportion of hits (relative to the number of units that should have been on) and true negatives (relative to the number of units that should have been off). The sum of these two proportions divided by two is an accuracy measure that has been adjusted for unequal numbers of units that should have been on or off.
Proportion accurate (mean_acc): This is the proportion of words where adj_acc_unitwise == 1. In other words, the number of words that had perfect accuracy output activation patterns, divided by the number of words.
Iterations to criterion (maxiter): This is the model iteration count at the moment the model hit the criterion (which was mean_acc > 0.9).
Rank distance to target indicates on average the number of times a given teacher had the correct closest target phonological (i.e., output) pattern to the computed output relative to all the other teachers.

All 7 sequences were pushed through a model (Lens) with 100 hidden units. We know from some experimentation not reported on here that we have ample representational power in a hidden layer of this size. As you can see below the Candide selected sequence performs reliably worse than any random sampling method on the test set.

Generalization

While performance on the training set is poorer in the Candide condition than others, the generalization patterns tell a different story.

Interim Conclusions

Our results suggest that there is a tradeoff between traing set accuracy and generalization capacity across our key conditions. The permuted whole sample-trained models outperform other conditions on training set accuracy. This is not surprising given that the permuted whole sample condition is tasked with learning as much as possible on each trial where the model is guaranteed exposure to every word in the training set. In a sense, this architecture ‘crams for the test’, in that it learns a good deal about the training set provided in the time provided. On the other hand, the sequenced candide algorithm underperforms on test items, but outperforms other conditions on the generalization set. Ultimately, learning is about generalization and the efficiency with which we are able to transfer limited knowledge to novel forms. In this sense, Candide is doing well to other teaching strategies implemented in these models. While the performance of Candide relative to other teachers is only modestly better, the results suggest that we can push the algorithm and indeed the concept further.

Future directions

We think that Candide is promoting an undesirable traversal of the error landscape and getting caught in local minima in the training process. See related documentation about jittering the orthographic inputs in an autoencoding architecture (Jitternet) for more on this point. As in other simulations related to this sequence optimization procedure, we would like to test this method using more ecological training sets. The 2881-word corpus provides an opportunity for the model to learn some words relevant to early reading development, but has many low frequency words that aren’t commonly used in children’s texts. We would like to either incorporate a larger corpus and implement the training process with a frequency-weighted feeding procedure that is developmentally oriented, or select a smaller set of kid-appropriate words for modeling. Ultimately, the algorithm proposed here should provide insight into the actual words that would be ideal for early reading instruction. So, subsequent analyses (ideally with new implementations that have some of the enhanced features described here) would be aimed at understanding those words that were selected and why. Finally, the training procedure involves word selection based on an error calculation against a test set that actually contains the training set. Additional work could be done to control this procedure such that item choice is conditioned on a different test set which is different than the set from which training items are drawn.

Other implementation details

Models were fit on the Manchester Computation Shared Facility (CSF), which enabled the 30 models to both run in parallel and return results frequently, so that progress could be easily evaluated during training. This second convenience would have been more difficult to achieve with HTCondor at Wisconsin (at least at that time). Sigmoidal transfer functions are used at each layer of the model, including the hidden layer. MLPClassifier() is used in Python 3 during the Python implementation of each model, and the Lens build is sourced from Jay McClelland’s lab at Stanford (documentation for which has since been removed, but can be accessed through Ted Gibson’s lab at MIT).