These are the analyses used for the “Choosing to go to work: Using reinforcement-based methods to balance animal welfare with research needs” manuscript in the Hackenberg Lab (Reed College).

For the most part, these analyses measure acquisition of learning of S+/S- cues. Each pigeon was paired with one color (acting as a S+ cue) which signaled that food was available in the box; trials with the other color types (S-) never had food available. The behavior that was tracked was box entry or non-entry and, hence, there were four distinct trial outcomes that were tracked as data: 1) Entry during a S+ trial (correct) 2) Non-entry during a S+ trial (incorrect) 3) Entry during a S- trial (incorrect) 4) Non-entry during a S- trial (correct)

Training Data (DBT)

First, let’s just see how many sessions each subject ran per phase of DBT:

PhaseNum P12 P16 P18 P22 P23 P5
0 12 10 10 10 11 10
1 22 21 23 22 23 24
2 13 13 13 13 13 13
3 10 10 10 10 9 10
4 36 36 36 36 37 36
5 46 46 44 46 46 46
6 11 11 11 11 11 11
7 19 20 19 20 19 19
8 33 33 32 33 33 33
9 39 39 39 39 39 39

Next, we can start to quantify our behavioral data. Our here goal was to measure the pigeons’ discriminability or sensitivity to the presence of food (S+) versus its absence (S-). To achieve this, we first tried calculating the signal detection theory measure d’ using the hit rate (proportion of correct responses during S+ trials) and the false alarm rate (proportion of incorrect responses during S- trials) for each session combined across birds to give us a population of trials. Utilizing R, we first formatted our data into a tidy-formatted dataframe called a “tibble” and employed the dplyr package to group the data by session and compute these rates. We calculated d’ for each session using the inverse normal distribution function (qnorm) and subtracting the values corresponding to the hit and false alarm rates. This approach allowed us to quantify the pigeons’ ability to discriminate between S+ and S- trials across multiple training session.

Note that we’ve jittered any 0 or 1.0 values to 0.01 and 0.99 respectively to avoid division by zero errors.

We can also do this individually by subject, but this makes our grouped populations of data VERY small. We can see the detriments of this in the ceiling effects, jumpiness, and overall lack of coherent trend in our data.

As we can see, our d’ calculations really struggle to show learning performance. A big issue here is that our populations of signal (S+) and noise (S-) are very small, especially for subject data (even when we group sessions). The d’ metric depends upon usage of Z-scores, so is much better suited to larger groups of data.

Instead, we can develop a Unified Performance Metric (UPM) that takes just single-session performance and accounts for both S+ and S- trials within each session. One possible metric could be the difference between the proportion of correct responses for S+ and S- trials, weighted by the relative importance of each trial type. This weighting can be manipulated a bit to get an idea of learning data, but we have to be careful when performing any parametric statistics here. Cherry-picking and manipulating our data is done purely for visualization purposes here.

Maintenance Data (DBR)

Next up, we can move into comparisons across the “Maintenance” phases of the experiment, where we used the disco box removal as a tool to extract birds. First we want to see if any pattern emerges in UPM score across phases (shown in the boxplots below). Clearly, some phases feature great box entry (see WAm1, WAm7, and PAm12.5), while others do not (GR7, WAm4, PAm87.5). To understand what might be causing this, we can dive into two additional data points:

  1. Tokens Earned (TE) in the previous session - this indicates how much food the bird consumed the day before, perhaps correlating with motivation to enter the disco box.

  2. Weight - the pigeons weight compared to their baseline might do show the same.

First, let’s just see how many sessions each subject ran per experiment:

Exp 5 12 16 18 22 23
1.1-GR7 5 5 5 5 5 5
1.2-WAm4 5 5 5 5 5 5
1.3-WAm1 5 5 5 5 5 5
1.4-WAm4 5 5 5 5 5 5
1.5-PAm87.5 5 5 5 5 5 5
1.6-PAm12.5 5 5 5 5 5 5
1.7-PAm2.5 5 5 5 5 5 5

Then look at overall UPM accuracy across experiments:

Which we can also color by subject and add a line to show continuity across experiments.

And can make colorblind safe by also introducing symbols.

Or we can make them sequential…

Or look by subject

With 95% confidence intervals

Or standard error of the mean

Correlation b/n Tokens Exchanged and UPM in Maintenance Data

Just looking at the correlation between raw UPM scores and our tokens exchanged variable seems to reveal a negative relationship between TE and UPM, but the visualization data seems to be quite messy. This seems to be largely due to some clustered and ceiling effects at 100% UPM accuracy.

Note that tokens exchanged is not included in every one of our data points, so we’re dropping some data in our analysis here.

One of the issues here might be that different subjects systemically earn a different amount of tokens each session; for example, P18 once earned 304 tokens in a session while P22 never earned more than 200. Therefore, we can calculate a RANKING for each session (for each bird) based upon the relative number of tokens earned. If each bird had 35 sessions here (five sessions in each of seven experiments), then we’d assign each session a rank (1-35) based upon tokens earned in the prior session.

Note that the rank doesn’t always go up to 35 because we’re missing some “tokens exchanged” data in our records.

Now we see a more normalized (and more “significant”) more of a trend emerge! E.g., it looks like UPM decreases as a function of tokens earned rank, which is emblematic of a negative relationship between DBR performance and a big previous meal. But we still have that visual issue of ceiling/group that we had before ranking.

To address this, let’s consolidate our colored data points by unifying all the session data across each experiment. We can maintain a semblance of variance by changing the size of each point to represent UPM variance.

We can look at seperate trends for each subject

Or colored by subject

Or color AND shape by subject

Correlation b/n Pigeon Weight and UPM in Maintenance Data

We can apply the same approach to looking for a correlation between ranked pre-pigeon weight (measured directly following box entry) and UPM performance, but see no such effect.

Same for post-weight trends…

We can also calculate weight gain during the following session, but don’t see a “significant” correlation.

Which holds true if we do a proportion normalization instead of a rank correlation.