These are the analyses used for the “Choosing to go to work: Using
reinforcement-based methods to balance animal welfare with research
needs” manuscript in the Hackenberg Lab (Reed College).
For the most part, these analyses measure acquisition of learning of
S+/S- cues. Each pigeon was paired with one color (acting as a S+ cue)
which signaled that food was available in the box; trials with the other
color types (S-) never had food available. The behavior that was tracked
was box entry or non-entry and, hence, there were four distinct trial
outcomes that were tracked as data: 1) Entry during a S+ trial (correct)
2) Non-entry during a S+ trial (incorrect) 3) Entry during a S- trial
(incorrect) 4) Non-entry during a S- trial (correct)
Training Data (DBT)
First, let’s just see how many sessions each subject ran per phase of
DBT:
| PhaseNum |
P12 |
P16 |
P18 |
P22 |
P23 |
P5 |
| 0 |
12 |
10 |
10 |
10 |
11 |
10 |
| 1 |
22 |
21 |
23 |
22 |
23 |
24 |
| 2 |
13 |
13 |
13 |
13 |
13 |
13 |
| 3 |
10 |
10 |
10 |
10 |
9 |
10 |
| 4 |
36 |
36 |
36 |
36 |
37 |
36 |
| 5 |
46 |
46 |
44 |
46 |
46 |
46 |
| 6 |
11 |
11 |
11 |
11 |
11 |
11 |
| 7 |
19 |
20 |
19 |
20 |
19 |
19 |
| 8 |
33 |
33 |
32 |
33 |
33 |
33 |
| 9 |
39 |
39 |
39 |
39 |
39 |
39 |
Next, we can start to quantify our behavioral data. Our here goal was
to measure the pigeons’ discriminability or sensitivity to the presence
of food (S+) versus its absence (S-). To achieve this, we first tried
calculating the signal detection theory measure d’ using the hit rate
(proportion of correct responses during S+ trials) and the false alarm
rate (proportion of incorrect responses during S- trials) for each
session combined across birds to give us a population of trials.
Utilizing R, we first formatted our data into a tidy-formatted dataframe
called a “tibble” and employed the dplyr package to group the data by
session and compute these rates. We calculated d’ for each session using
the inverse normal distribution function (qnorm) and subtracting the
values corresponding to the hit and false alarm rates. This approach
allowed us to quantify the pigeons’ ability to discriminate between S+
and S- trials across multiple training session.
Note that we’ve jittered any 0 or 1.0 values to 0.01 and 0.99
respectively to avoid division by zero errors.

We can also do this individually by subject, but this makes our
grouped populations of data VERY small. We can see the detriments of
this in the ceiling effects, jumpiness, and overall lack of coherent
trend in our data.

As we can see, our d’ calculations really struggle to show learning
performance. A big issue here is that our populations of signal (S+) and
noise (S-) are very small, especially for subject data (even when we
group sessions). The d’ metric depends upon usage of Z-scores, so is
much better suited to larger groups of data.
Instead, we can develop a Unified Performance Metric (UPM) that takes
just single-session performance and accounts for both S+ and S- trials
within each session. One possible metric could be the difference between
the proportion of correct responses for S+ and S- trials, weighted by
the relative importance of each trial type. This weighting can be
manipulated a bit to get an idea of learning data, but we have to be
careful when performing any parametric statistics here. Cherry-picking
and manipulating our data is done purely for visualization purposes
here.




Maintenance Data (DBR)
Next up, we can move into comparisons across the “Maintenance” phases
of the experiment, where we used the disco box removal as a tool to
extract birds. First we want to see if any pattern emerges in UPM score
across phases (shown in the boxplots below). Clearly, some phases
feature great box entry (see WAm1, WAm7, and PAm12.5), while others do
not (GR7, WAm4, PAm87.5). To understand what might be causing this, we
can dive into two additional data points:
Tokens Earned (TE) in the previous session - this indicates how
much food the bird consumed the day before, perhaps correlating with
motivation to enter the disco box.
Weight - the pigeons weight compared to their baseline might do
show the same.
First, let’s just see how many sessions each subject ran per
experiment:
| Exp |
5 |
12 |
16 |
18 |
22 |
23 |
| 1.1-GR7 |
5 |
5 |
5 |
5 |
5 |
5 |
| 1.2-WAm4 |
5 |
5 |
5 |
5 |
5 |
5 |
| 1.3-WAm1 |
5 |
5 |
5 |
5 |
5 |
5 |
| 1.4-WAm4 |
5 |
5 |
5 |
5 |
5 |
5 |
| 1.5-PAm87.5 |
5 |
5 |
5 |
5 |
5 |
5 |
| 1.6-PAm12.5 |
5 |
5 |
5 |
5 |
5 |
5 |
| 1.7-PAm2.5 |
5 |
5 |
5 |
5 |
5 |
5 |
Then look at overall UPM accuracy across experiments:

Correlation b/n Tokens Exchanged and UPM in Maintenance Data
Just looking at the correlation between raw UPM scores and our tokens
exchanged variable seems to reveal a negative relationship between TE
and UPM, but the visualization data seems to be quite messy. This seems
to be largely due to some clustered and ceiling effects at 100% UPM
accuracy.
Note that tokens exchanged is not included in every one of our data
points, so we’re dropping some data in our analysis here.

One of the issues here might be that different subjects systemically
earn a different amount of tokens each session; for example, P18 once
earned 304 tokens in a session while P22 never earned more than 200.
Therefore, we can calculate a RANKING for each session (for each bird)
based upon the relative number of tokens earned. If each bird had 40
sessions here (five sessions in each of eight experiments), then we’d
assign each session a rank (1-40) based upon tokens earned in the prior
session.
Note that the rank doesn’t always go up to 40 because we’re missing
some “tokens exchanged” data in our records.

Now we see a more normalized (and more “significant”) more of a trend
emerge! E.g., it looks like UPM decreases as a function of tokens earned
rank, which is emblematic of a negative relationship between DBR
performance and a big previous meal. But we still have that visual issue
of ceiling/group that we had before ranking.
To address this, let’s consolidate our colored data points by
unifying all the session data across each experiment. We can maintain a
semblance of variance by changing the size of each point to represent
UPM variance.

Correlation b/n Pigeon Weight and UPM in Maintenance Data
We can apply the same approach to looking for a correlation between
ranked pre-pigeon weight (measured directly following box entry) and UPM
performance, but see no such effect.

Same for post-weight trends…

We can also calculate weight gain during the following session, but
don’t see a “significant” correlation.

Which holds true if we do a proportion normalization instead of a
rank correlation.
