Lymphocyte study

CLIENT

Janae Mudge

UPDATED

August 22, 2025

From client:

The purpose of this study is to determine if oral consumption of berberine (BBR) when paired with physical activity will increase T lymphocyte presence while decreasing the presence of myeloid-derived suppressor cells (MDSC) within the tumor microenvironment, spleen, bone marrow, and blood of the immunocompetent BALB/c 4T1 mammary adenocarcinoma model.

Study design. Mice were randomly allocated to groups defined by the factorial combinations of:

Response structure: physical samples were collected from blood, spleen tissue, lung tissue, and tumor tissue and four different lymphocyte frequencies were measured from each sample.

Requests:

  1. Visualize multivariate data using PCA
Note

Comments/clarifications needing client’s attention will appear like this.

Study data

I reorganized the raw data slightly to make it more analysis-friendly in two key respects:

  1. shorter column names
  2. factorial treatment structure encoded explicitly (rather than as a single ‘group’ variable)

After rearrangment the data look like so:

Table: example rows from study data
id bbr exercise cancer tissue l.cd4 l.foxp3 l.cd8
1 FALSE FALSE FALSE spleen 0.577 0.006 0.242
1 FALSE FALSE FALSE blood 0.128 0.0016 0.0621
1 FALSE FALSE FALSE lung 0.479 0.0276 0.152
1 FALSE FALSE FALSE tumor NA NA NA
2 FALSE FALSE FALSE spleen 0.767 0.0111 0.0629
2 FALSE FALSE FALSE blood NA NA NA

There are many missing values, but missigness does not appear to be at the mouse level as it’s a little uneven across tissue types – e.g., complete observations are available for the exercise-only group for lung tissue but not for blood samples.

Table: missing values by tissue type and treatment group
tissue bbr exercise cancer n.obs n.total pct.obs pct.miss
blood FALSE FALSE FALSE 2 3 0.6667 0.3333
blood TRUE FALSE FALSE 2 3 0.6667 0.3333
blood FALSE TRUE FALSE 3 4 0.75 0.25
blood TRUE TRUE FALSE 2 3 0.6667 0.3333
blood FALSE FALSE TRUE 7 8 0.875 0.125
blood TRUE FALSE TRUE 7 7 1 0
blood FALSE TRUE TRUE 5 6 0.8333 0.1667
blood TRUE TRUE TRUE 7 8 0.875 0.125
lung FALSE FALSE FALSE 2 3 0.6667 0.3333
lung TRUE FALSE FALSE 2 3 0.6667 0.3333
lung FALSE TRUE FALSE 4 4 1 0
lung TRUE TRUE FALSE 3 3 1 0
lung FALSE FALSE TRUE 6 8 0.75 0.25
lung TRUE FALSE TRUE 5 7 0.7143 0.2857
lung FALSE TRUE TRUE 6 6 1 0
lung TRUE TRUE TRUE 7 8 0.875 0.125
spleen FALSE FALSE FALSE 3 3 1 0
spleen TRUE FALSE FALSE 3 3 1 0
spleen FALSE TRUE FALSE 4 4 1 0
spleen TRUE TRUE FALSE 3 3 1 0
spleen FALSE FALSE TRUE 7 8 0.875 0.125
spleen TRUE FALSE TRUE 6 7 0.8571 0.1429
spleen FALSE TRUE TRUE 4 6 0.6667 0.3333
spleen TRUE TRUE TRUE 6 8 0.75 0.25
tumor FALSE FALSE TRUE 7 8 0.875 0.125
tumor TRUE FALSE TRUE 6 7 0.8571 0.1429
tumor FALSE TRUE TRUE 6 6 1 0
tumor TRUE TRUE TRUE 7 8 0.875 0.125

Note

Please review for accuracy and clarify reason for missing observations.

Since I’m not focusing on statistical modeling I’ll just ignore the missing values for now, but this is an important response feature that shouldn’t be overlooked in formal analysis (hypothesis tests, ANOVA, etc.).

Direct visualizations of florescence data

Although client’s request was to produce PCA visualizations of the multivariate response structure by treatment factors and tissue types, there are some direct visualizations possible since there are only four response variables.

One option is to display individual frequency observations for each response by tissue type and treatment group. This is shown below with group means overlaid. The treatment encoding uses the following convention:

  • first digit indicates cancer status (0 for absent and 1 for present)
  • second digit indicates BBR treatment (0 for no consumption and 1 for consumption)
  • third digit indicates exercise status (0 for no exercise and 1 for exercise)

For example: 1.0.0 would denote the 4T1 only group; 1.0.1 would denote the 4T1 + EX group.

Figure: lymphocyte frequencies by tissue type and treatment group with treatment group means shown as red triangles and tissue type means shown as blue dashed lines

This is the most fine-grained look at the data I can think of:

  • range and concentration of black points within row shows individual variability by group/tissue/response
  • range and concentration of red triangles within panel shows treatment group variability, with more diffuse group means being more strongly suggestive of an effect
  • relative position of red triangles and blue dashed lines shows whether frequencies appeared higher or lower than expected for each treatment group
  • samples size captured by number of points per row
  • (perhaps self-evident) no tumor tissue samples collected for mice without cancer

As another option, one could look at factorial effects, or the marginal differences between levels of each treatment factor averaged over the other treatment factors:

  • effect of cancer: estimated change in frequency when cancer is present, averaged over berberine and exercise
  • effect of berberine: estimated change in frequency when berberine is consumed, averaged over cancer status and exercise
  • effect of exercise: estimated change in frequency when mice exercise, averaged over cancer status and berberine consumption

In addition, one can consider interaction effects, which quantify how the effect of one factor changes in the presence of the other:

  • interaction of cancer and berberine: change in effect of berberine when cancer is present (or vice versa)
  • interaction of exercise and berberine: change in effect of berberine when combined with exercise (or vice versa)
  • interaction of exercise and cancer: change in effect of exercise when cancer is present (or vice versa)

Plots of these effects are shown below with pointwise 95% confidence intervals. The zero line is shown by the vertical dash; estimated effects to the right of the reference line indicate increases in mean frequency, and estimated effects to the left of the reference line indicate decreases in mean frequency. I have not shown tumor tissue because these samples were collected for only half of the treatment groups.

From this one can get a sense of the main effects for each response and tissue type – for example, in spleen tissue cancer is associated with an estimated increase in lmyphocyte frequency but is not associated with any change in lymphocyte/foxp3 frequency.

Note that the confidence intervals are not simultaneous, so this shouldn’t be used as a proxy for testing main and interaction effects. Few effects are likely significant judging from this plot. That said, looking at the point estimates:

  1. Blood

    • exercise and berberine have minimal estimated marginal effects on mean frequency across all three cell types
    • cancer is estimated to suppress mean CD4 and CD8 frequency but increase FOXP3 frequency
    • berberine in combination with exercise is estimated to suppress mean CD4 and CD8 frequency
    • when cancer is present, berberine is estimated to suppress mean CD8 frequency
    • when cancer is present, exercise is estimated to elevate mean CD8 frequency
  2. Lung tissue

    • all three factors are estimated to suppress mean FOXP3 frequency
    • berberine and cancer are estimated to suppress mean CD4 frequency
    • when cancer is present, berberine is estiamted to elevate mean CD4 frequency and (to a lesser extent) mean FOXP3 frequency
    • berberine in combination with exercise is estimated to elevate mean FOXP3 frequency
    • when cancer is present, exercise is estimated to elevate mean FOXP3 frequency
  3. Spleen tissue

    • exercise and berberine are estiamted to suppress mean FOXP3 frequency
    • presence of cancer is estimated to suppress mean CD4 frequency
    • exercise is estimated to suppress mean CD4 frequency
    • berberine in combination with exercise is estimated to suppress mean CD4 frequency and elevate mean FOXP3 frequency
    • when cancer is present, berberine is estimated to increase mean CD4 frequency
    • when cancer is present, exercise is estimated to suppress mean CD4 and CD8 frequencies
Note

Many of these effects will probably disappear after intervals are adjusted for multiple inference. Formal tests with an appropriate ANOVA model are recommended here.

PCA

Client requested some clarification on what principal component analysis (PCA) is/does. In statistical terms, it finds a low-dimensional representation of multivariate data that retains as much of the total variation in the original data as possible. In the context of this application, think of it as providing a way to explore frequency data in aggregate across cell types.

In technical detail, “principal components” are derived variables: \[ \begin{align*} \text{PC}1 &= (a_1 \times\text{CD4}) + (a_2 \times\text{CD8}) + (a_3 \times\text{FOXP3}) \\ \text{PC}2 &= (b_1 \times\text{CD4}) + (b_2 \times\text{CD8}) + (b_3 \times\text{FOXP3}) \\ \text{PC}3 &= (c_1 \times\text{CD4}) + (c_2 \times\text{CD8}) + (c_3 \times\text{FOXP3}) \end{align*} \] The variable “loadings” \(a, b, c\) are chosen to maximize the variance of the each component after accounting for previous components. Each component accounts for a certain share of total variance across the three cell types.

Only three components are possible since there are three variables considered. If all three PC’s are used, then the principal components are just a one-to-one transformation of the three cell type frequencies. When PCA is used principally for data visualization, only 2 components are used so that the data can be projected into an X-Y plane for plotting.

Principal component analysis (PCA) is easy to implement in R. There are a few different packages/functions that do this – below I’ve shown prcomp.

# compute first two principal components
pca_out <- prcomp(~ l.cd4 + l.cd8 + l.foxp3, 
                  data = raw,
                  center = T,
                  scale = T,
                  rank = 2)

Inspecting the resulting object shows the variable loadings:

# inspect sds and loadings
pca_out
Standard deviations (1, .., p=3):
[1] 1.1735320 0.9836247 0.8095091

Rotation (n x k) = (3 x 2):
               PC1        PC2
l.cd4   -0.6713582  0.2488316
l.cd8   -0.6832401  0.1571725
l.foxp3 -0.2871603 -0.9557090

The summary method for the resulting object shows the variance explained information.

# variance explained
summary(pca_out)
Importance of first k=2 (out of 3) components:
                          PC1    PC2
Standard deviation     1.1735 0.9836
Proportion of Variance 0.4591 0.3225
Cumulative Proportion  0.4591 0.7816

Jointly, the first two components capture 68.58% of the total (multivariate) variation in the data. Appending the values of the derived variables to the original data yields:

# append pc scores to treatment structure
pcdata <- bind_cols(drop_na(raw), as_tibble(pca_out$x))
id bbr exercise cancer tissue l.cd4 l.foxp3 l.cd8 PC1 PC2
1 FALSE FALSE FALSE spleen 0.577 0.006 0.242 -1.605 0.5874
1 FALSE FALSE FALSE blood 0.128 0.0016 0.0621 1.477 0.2618
1 FALSE FALSE FALSE lung 0.479 0.0276 0.152 -1.43 -2.475
2 FALSE FALSE FALSE spleen 0.767 0.0111 0.0629 -1.148 -0.1257
3 FALSE FALSE FALSE spleen 0.252 0.0069 0.156 0.1413 -0.09918
3 FALSE FALSE FALSE blood 0.358 0.005 0.349 -1.585 0.612

Note that missing values were dropped to perform PCA, and so must also be dropped from the original data when appending treatment structure.

Visualizations typically look for cluster separation according to treatment factors. For example, we could look for separation according to BBR by tissue type:

The vectors point to the group means and serve as a helpful proxy for assessing separation: comparing colors gives an approximate sense of the separation between microenvironments with respect to lymphocyte frequencies, whereas comparing solid/dash gives an approximate sense of the treatment effect.

Separating this out by tissue type and showing estimated (main) treatment effects across cell types yields the figure below.