L. parviflorus gene expression

CLIENT

Weston Gronor

UPDATED

July 31, 2025

Project summary

Background from client:

I am studying a California native plant that has different distribution of flower color when growing in different soil types. I believe that you have already met with my collaborator Magdalene Lo. She has grown these plants in different soil types to measure fitness and survival. I have taken the plant samples she has grown and am doing genetic analysis on them.

Research questions from client:

How does gene expression of anthocyanin biosynthetic genes (the types of genes I am studying) differ in different flower colors of L. parviflorus (species) when grown in different soil types?
How does gene expression differ between flower colors when grown in the same soil type?

Study design

Plants with white and pink flowers were grown in four soil types. There were 4 trays of the control, 5 trays of drought (water limitation), 5 trays of magnesium, and 6 trays of MgxDr (interaction). Each tray had both white and pink flower morphs. Client does not have information about how many were in each tray or how individuals were allocated to trays, only that several hundred seeds were planted for each flower color. Client reports “there were 12 families of pink flowers and 6 families of white flowers crossed so there are about double the number of pink seeds planted.”

Client collected individual plants of each flower color from each soil type. With respect to soil type, some individuals were collected from different trays but some were not. The individuals were chosen based on the family. For the plants that were collected, the client sampled tissue from three locations: flower, leaf, and root. These tissue samples were combined across individuals to form pooled samples from which RNA was extracted. The RNA samples were divided in two – the ‘replicate’ column identifies these subsamples – and six genes were amplified independently via qPCR to obtain gene expression data.

Sample data

Sample data provided:

Code

# read in and preview data
raw <- read_csv('sample-data.csv') |> rename_with(tolower)
head(raw) |> pander()

tissue	color	treatment	ct	absolute	replicate
leaf	pink	control	27.8	3.575	1
leaf	pink	control	27.42	3.693	2
leaf	white	control	26.2	4.065	1
leaf	white	control	26.18	4.07	2
leaf	pink	mgxdr	28.21	3.449	1
leaf	pink	mgxdr	27.66	3.619	2

The responses are:

CT values: number of PCR cycles until concentration threshold is crossed
absolute concentration: back-computed from CT values using qPCR standard

Save for small differences in the conversion factor, these are scalar multiples of each other.

Client request

Questions:

What type of analysis is best? It will be some kind of comparison of means but what exactly makes the most sense.
How can we potentially bring multiple tissue types into the analysis so we do not have to multiply the analyses by 3?
How can we control for the fact that the conversion factor may be slightly different between each run?
Some kind of R or JMP template!
[added 7/30] Plots of the differences between pink and white morphs in different soil conditions.

Recommendations

Treatment coding

Since the soil types themselves have factorial structure, the treatment structure should be recorded like so:

treatment	drought	magnesium
control	N	N
mg	N	Y
drought	Y	N
mgxdr	Y	Y

If balanced groups of seeds were randomly allocated by family to soil types and grown in such a way that the same number of individuals of each family were allowed to mature, then the treatment allocation is randomized with respect to soil type and family (not color) and the observational unit is a plant. If so, then the treatment allocation is also randomized with respect to soil type and color, but the design is not balanced. Client should confirm with collaborator as this will be relevant for peer/committee review.

Pseudoreplication

Because there is just one physical sample of each tissue type per treatment combination, there is no replication in this study. Treatments were not independently allocated to the RNA subsamples, so there is not sufficient data to estimate the observational variation in gene expression levels. This significantly constrains possible analyses of study data, as ANOVA cannot be performed. The values of the responses (CT value and absolute concentration) should be averaged across pseudoreplicates.

Code

# recode treatments and average pseudoreplicates
clean <- raw |>
  mutate(drought = if_else(treatment %in% c('drought', 'mgxdr'), 'Y', 'N') |> factor(levels = c('N', 'Y')),
         magnesium = if_else(treatment %in% c('mg', 'mgxdr'), 'Y', 'N') |> factor(levels = c('N', 'Y')),
         color = factor(color),
         tissue = factor(tissue)) |>
  group_by(drought, magnesium, color, tissue) |>
  summarize(across(.cols = c(ct, absolute), .fns = list(mean = mean), .names = '{.col}.{.fn}'),
            .groups = 'drop')
head(clean) |> pander()

drought	magnesium	color	tissue	ct.mean	absolute.mean
N	N	pink	flower	26.14	3.674
N	N	pink	leaf	27.61	3.634
N	N	pink	root	31.48	2.396
N	N	white	flower	25.43	3.912
N	N	white	leaf	26.19	4.068
N	N	white	root	30.93	2.569

Analysis strategy

Interaction plots

Visualization can be used to identify possible effects focusing on the difference in expression levels between pink and white flowers. Examples of complete interaction plots (no aggregation across any treatment factors) are shown below. The client should consider examining similar plots with aggregation to explore marginal effects of treatment factors.

Code

# interaction contrasts (absolute concentration)
clean |>
  mutate(magnesium = fct_recode(magnesium, magnesium = 'Y', control = 'N'),
         drought = fct_recode(drought, drought = 'Y', control = 'N')) |>
  ggplot(aes(x = color, y = absolute.mean, color = magnesium)) +
  geom_point() +
  facet_grid(tissue ~ drought) +
  geom_path(aes(group = magnesium)) +
  guides(color = guide_legend(title = NULL)) +
  theme_bw() +
  labs(title = 'absolute concentration')

Code

# interaction contrasts (ct values)
clean |>
  mutate(magnesium = fct_recode(magnesium, magnesium = 'Y', control = 'N'),
         drought = fct_recode(drought, drought = 'Y', control = 'N')) |>
  ggplot(aes(x = color, y = ct.mean, color = magnesium)) +
  geom_point() +
  facet_grid(tissue ~ drought) +
  geom_path(aes(group = magnesium)) +
  guides(color = guide_legend(title = NULL)) +
  theme_bw() +
  labs(title = 'ct values')

I note the graphics are mirror images, so conclusions should not differ regardless of whether CT values or absolute concentrations are used. My recommendation is to choose just one response to analyze; examining both will multiply your results without adding information.

Contrasts

The research questions pertain to comparing gene expression levels between pink and white flower phenotypes under various conditions. These are referred to as contrasts and are estimated via arithmetic differences between the corresponding averages (P - W, P/dr - W/dr, etc.). The recommended analysis thus focuses on contrasts for the color phenotype, and interprets interaction effects accordingly.

The codes provided use lm and emmeans to compute contrasts, but they’re not really fitting a model, since the treatment mean “estimates” are really just the observations.

Code

# "model" to use for computing contrasts of interest
fit <- lm(absolute.mean ~ drought*magnesium*color*tissue, data = clean)
fit.emm <- emmeans(fit, specs = ~ color*drought*magnesium | tissue)

Research question 1

How does gene expression of anthocyanin biosynthetic genes (the types of genes I am studying) differ in different flower colors of L. parviflorus (species) when grown in different soil types?

I’ll rephrase this as: does the difference in gene expression level between color phenotypes depend on soil type? And then I’ll construe this as whether there are significant interaction effects involving color.

The factorial effects of color, drought, and magnesium are the differences between ‘high’ and ‘low’ levels of each factor conditioned on the levels of other factors in various constellations.

The main effects of each are the differences between the average response at each level of the factor, averaged across the other factors. There are three main effects:

color: the difference in average gene expression between samples from pink and white flowers, averaged across soil type
magnesium: the difference in average gene expression between samples from magnesium and non-magnesium soils, averaged across color and drought type
drought: the difference in average gene expression between samples from drought and non-drought soils, averaged across color and magnesium

The interaction effects then characterize the dependence of the differences above on the levels of other factors. There are four interaction effects:

color/magnesium: the difference in the effect of color between levels of the magnesium factor, averaged across drought type
color/drought: the difference in the effect of color between levels of the drought type, averaged across levels of the magnesium factor
magnesium/drought: the difference in the effect of magnesium between levels of the drought type, averaged across color
color/magnesium/drought: the difference in the color/magnesium interaction between levels of the drought type

The interaction effects are symmetric in the sense that the color/magnesium effect is the same as the magnesium/color effect. Because the research question focuses on color phenotypes, contrasts are interpreted primarily in terms of the effect of color when applicable.

The estimated factorial effects for each tissue type are as follows:

Code

# wrap in function for use with emmeans
fac.eff.emmc <- function(levels, ...){
  data_grid(clean, magnesium, drought, color) |>
    mutate(mg = 0.25*c(rep(-1, 4), rep(1, 4)),
           dr = 0.25*rep(c(rep(-1, 2), rep(1, 2)), 2),
           cl = 0.25*rep(c(1, -1), 4),
           clxdr = 0.5*(4*cl)*(4*dr),
           clxmg = 0.5*(4*cl)*(4*mg),
           drxmg = 0.5*(4*mg)*(4*dr),
           clxdrxmg = (4*cl)*(4*dr)*(4*mg)) |>
    arrange(magnesium, drought, color) |>
    select(-(1:3))
}

# point estimates of factorial effects
fit.fac.eff <- fit.emm |> 
  contrast('fac.eff') |>
  tidy() |>
  select(tissue, contrast, estimate)

# examine
fit.fac.eff |> 
  spread(tissue, estimate) |>
  pander(caption = 'Estimates of factorial effects by tissue type.')

Estimates of factorial effects by tissue type.
contrast	flower	leaf	root
cl	0.8575	0.07569	0.1034
clxdr	0.0145	-0.5297	0.3753
clxdrxmg	-3.389	-1.768	-1.172
clxmg	0.4824	0.6649	-0.4079
dr	-0.4589	-0.1424	-0.1894
drxmg	2.57	1.531	0.1463
mg	0.721	-0.02505	0.2525

We can’t answer the question of whether color interactions are significant the usual way but R. V. Lenth (1989) provides a clever method of testing the hypotheses \(H_{0i}: \theta_i = 0\) vs \(H_{Ai}: \theta_i \neq 0\) for each factorial effect \(\theta_i, i = 1, \dots, k\). The unrepx package provides an implementation that generates p-values for these tests from the vector of interaction effects (R. Lenth 2022).

The full set of results for all three tissue types is below. Remember that these are separate analyses shown jointly. Lenth’s method assumes that estimates for factorial effects are independent, which is only true here within one tissue type; since tissue samples of different types are collected from the same individuals, the independence does not hold across tissue types.

Code

# apply to all three tissue types
fit.fac.eff |> group_by(tissue) |> 
  nest(data = c(contrast, estimate)) |>
  mutate(test = map(data, ~eff.test(.x$estimate, method = 'Lenth'))) |>
  mutate(rslt = map2(data, test, ~inner_join(.x, .y, join_by(estimate == effect)))) |>
  select(tissue, rslt) |>
  unnest(everything()) |>
  pander(caption = 'Inference for factorial effects using Lenth\'s method.')

Inference for factorial effects using Lenth’s method.
tissue	contrast	estimate	Lenth_PSE	t.ratio	p.value	simult.pval
flower	mg	0.721	0.9026	0.799	0.3684	0.9904
flower	dr	-0.4589	0.9026	-0.508	0.6728	1
flower	cl	0.8575	0.9026	0.95	0.2901	0.9346
flower	clxdr	0.0145	0.9026	0.016	0.9893	1
flower	clxmg	0.4824	0.9026	0.534	0.6577	1
flower	drxmg	2.57	0.9026	2.848	0.0365	0.1752
flower	clxdrxmg	-3.389	0.9026	-3.755	0.0221	0.1045
leaf	mg	-0.02505	0.7946	-0.032	0.9791	1
leaf	dr	-0.1424	0.7946	-0.179	0.8797	1
leaf	cl	0.07569	0.7946	0.095	0.9371	1
leaf	clxdr	-0.5297	0.7946	-0.667	0.5789	1
leaf	clxmg	0.6649	0.7946	0.837	0.3463	0.9792
leaf	drxmg	1.531	0.7946	1.927	0.0773	0.348
leaf	clxdrxmg	-1.768	0.7946	-2.225	0.0564	0.2553
root	mg	0.2525	0.3315	0.762	0.3912	0.9953
root	dr	-0.1894	0.3315	-0.571	0.6368	1
root	cl	0.1034	0.3315	0.312	0.7936	1
root	clxdr	0.3753	0.3315	1.132	0.2221	0.822
root	clxmg	-0.4079	0.3315	-1.231	0.1932	0.7517
root	drxmg	0.1463	0.3315	0.441	0.7136	1
root	clxdrxmg	-1.172	0.3315	-3.535	0.0249	0.119

I’m choosing to look at the unadjusted p-values, as the F tests in ANOVA are not usually adjusted for multiplicity. However, the client may wish to consider simultaneous inference adjustments when looking at results across genes.

For this gene, the color/drought/magnesium interactions are significant for all tissue types according to the unadjusted p-values, suggesting that the effect of color depends on the combination of soil types.

However, somewhat counterintuitively, the main effects and two-way interactions involving color are not significant. This is addressed to some extent by the plot below: the expression level differs by color phenotype less when averaged across soil type (solid black) than when disaggregated, and even then, expression levels for pink and white color phenotypes only differ in certain soil combinations. The client should consider carefully how to explain this result.

Code

# visualize marginal effect of color
color.means <- clean |>
  group_by(color, tissue) |>
  summarize(absolute.mean = mean(absolute.mean))
ggplot(clean, 
       aes(x = color, y = absolute.mean, 
           color = magnesium, linetype = drought)) +
  facet_wrap(~tissue) +
  geom_point(alpha = 0.5) +
  geom_path(aes(group = interaction(drought, magnesium)),
            alpha = 0.5) +
  geom_point(data = color.means, 
             inherit.aes = F,
             aes(x = color, y = absolute.mean),
             shape = 17, size = 4) +
  geom_path(data = color.means,
            inherit.aes = F,
            aes(x = color, y = absolute.mean, group = tissue),
            linewidth = 1.25) +
  theme_bw()

Research question 2

How does gene expression differ between flower colors when grown in the same soil type?

This can be approached analogous to the analysis given above, but using specific interaction contrasts that compare expression levels between pink and white color phenotypes separately for each combination of the drought/magnesium factors in place of all factorial effects. This analysis is shown below.

Code

# specify contrasts of interest
contr.fn.emmc <- function(levels, ...){
  data_grid(clean, color, magnesium, drought) |>
    mutate('p-w.ctrl' = c(1, 0, 0, 0, -1, 0, 0, 0),
           'p-w.dr' = c(0, 1, 0, 0, 0, -1, 0, 0),
           'p-w.mg' = c(0, 0, 1, 0, 0, 0, -1, 0),
           'p-w.dxm' = c(0, 0, 0, 1, 0, 0, 0, -1)) |>
    arrange(magnesium, drought, color) |>
    select(starts_with('p'))
}
fit.contr <- fit.emm |> 
  contrast('contr.fn') |>
  tidy() |>
  select(tissue, contrast, estimate) |>
  mutate(contrast = factor(contrast, 
                           levels = paste('p-w', c('ctrl', 'mg', 'dr', 'dxm'), sep = '.')))

# examine
fit.contr |> 
  spread(tissue, estimate) |>
  pander(caption = 'Estimates of P-W contrasts by tissue type and soil treatment combination.')

Estimates of P-W contrasts by tissue type and soil treatment combination.
contrast	flower	leaf	root
p-w.ctrl	-0.2382	-0.4339	-0.1732
p-w.mg	1.939	1.115	0.004789
p-w.dr	1.471	-0.07968	0.788
p-w.dxm	0.2587	-0.2987	-0.2058

Code

# apply lenth's method to estimated contrasts
fit.contr |> group_by(tissue) |> 
  nest(data = c(contrast, estimate)) |>
  mutate(test = map(data, ~eff.test(.x$estimate, method = 'Lenth'))) |>
  mutate(rslt = map2(data, test, ~inner_join(.x, .y, join_by(estimate == effect)))) |>
  select(tissue, rslt) |>
  unnest(everything()) |>
  pander(caption = 'Inference using Lenth\'s method.')

Inference using Lenth’s method.
tissue	contrast	estimate	Lenth_PSE	t.ratio	p.value	simult.pval
flower	p-w.ctrl	-0.2382	1.297	-0.184	0.8881	1
flower	p-w.dr	1.471	1.297	1.134	0.1767	0.6135
flower	p-w.mg	1.939	1.297	1.495	0.0969	0.3484
flower	p-w.dxm	0.2587	1.297	0.199	0.8793	1
leaf	p-w.ctrl	-0.4339	0.5494	-0.79	0.3804	0.9483
leaf	p-w.dr	-0.07968	0.5494	-0.145	0.9102	1
leaf	p-w.mg	1.115	0.5494	2.029	0.0493	0.1723
leaf	p-w.dxm	-0.2987	0.5494	-0.544	0.6408	1
root	p-w.ctrl	-0.1732	0.2598	-0.667	0.5255	1
root	p-w.dr	0.788	0.2598	3.033	0.0267	0.0944
root	p-w.mg	0.004789	0.2598	0.018	0.9895	1
root	p-w.dxm	-0.2058	0.2598	-0.792	0.378	0.9456

Gene expression levels differ significantly between pink and white color phenotypes in flower tissue and leaf tissue in the magnesium soil and in root tissue in the drought soil. Visually:

Code

# plot estimated contrasts above
fit.contr |> 
  group_by(tissue) |> 
  nest(data = c(contrast, estimate)) |>
  mutate(test = map(data, ~eff.test(.x$estimate, method = 'Lenth'))) |>
  mutate(rslt = map2(data, test, ~inner_join(.x, .y, join_by(estimate == effect)))) |>
  select(tissue, rslt) |>
  unnest(everything()) |>
  mutate(lwr = estimate - 2*Lenth_PSE,
         upr = estimate + 2*Lenth_PSE) |>
  select(tissue, contrast, estimate, lwr, upr) |>
  mutate(contrast = fct_recode(contrast,
                               "control" = "p-w.ctrl",
                               "magnesium" = "p-w.mg",
                               "drought" = "p-w.dr",
                               "both" = "p-w.dxm")) |>
  arrange(contrast, tissue) |>
  ggplot(aes(x = contrast, y = estimate, color = tissue)) +
  geom_point(position = position_dodge(width = 0.5)) +
  geom_errorbar(aes(ymin = lwr, ymax = upr), 
                width = 0.1, 
                position = position_dodge(width = 0.5)) +
  geom_path(aes(group = tissue), position = position_dodge(width = 0.5)) +
  theme_bw() +
  labs(x = "soil type", y = "estimated (P - W) difference")

If this is too messy, here is the same in separate panels:

Code

# same as above but faceted
fit.contr |> 
  group_by(tissue) |> 
  nest(data = c(contrast, estimate)) |>
  mutate(test = map(data, ~eff.test(.x$estimate, method = 'Lenth'))) |>
  mutate(rslt = map2(data, test, ~inner_join(.x, .y, join_by(estimate == effect)))) |>
  select(tissue, rslt) |>
  unnest(everything()) |>
  mutate(lwr = estimate - 2*Lenth_PSE,
         upr = estimate + 2*Lenth_PSE) |>
  select(tissue, contrast, estimate, lwr, upr) |>
  mutate(contrast = fct_recode(contrast,
                               "control" = "p-w.ctrl",
                               "magnesium" = "p-w.mg",
                               "drought" = "p-w.dr",
                               "both" = "p-w.dxm")) |>
  arrange(contrast, tissue) |>
  ggplot(aes(x = contrast, y = estimate)) +
  geom_point() +
  geom_errorbar(aes(ymin = lwr, ymax = upr), 
                width = 0.1) +
  geom_path(aes(group = tissue)) +
  facet_grid(~tissue) +
  theme_bw() +
  labs(x = "soil type", y = "estimated (P - W) difference")

These plots show the estimated difference in gene expression levels (concentration values) between pink and white flowers by soil type.

Summary

Due to pseudoreplication, ANOVA should not be performed. The approach above employs the method of R. V. Lenth (1989) for analysis of unreplicated factorial designs. This analysis pipeline should be easily transferrable to the data from the other five genes with minor to no adjustments to code.

Returning to the original questions:

What type of analysis is best? It will be some kind of comparison of means but what exactly makes the most sense.

There is no replication in the study, so analyses should be largely limited to visual assessments. R. V. Lenth (1989) provides a method of analyzing data from unreplicated factorial designs; this approach can be used to address the main research questions and example implementations are given above.

How can we potentially bring multiple tissue types into the analysis so we do not have to multiply the analyses by 3?

It is recommended to carry out analyses separately by tissue type and compare qualitatively as originally planned. Under different circumstances and with proper design considerations, effects associated with tissue type could be estimated along with other factors in the ANOVA framework; client is recommended to consult with a statistician at design stage on experimental design and analysis plan if this is of interest in future.

How can we control for the fact that the conversion factor may be slightly different between each run?

Small variation in the conversion factors between samples shouldn’t affect conclusions. It is also worth noting that the conversions need not be uniform if they are part of the measurement process. In this study, direct comparison of CT values and absolute concentrations indicates near-identical trends with respect to the experimental factors.

Some kind of R or JMP template!

A copy of the source .qmd for this document is provided for client’s use.

References

Lenth, Russell. 2022. Unrepx: Analysis and Graphics for Unreplicated Experiments. https://CRAN.R-project.org/package=unrepx.

Lenth, Russell V. 1989. “Quick and Easy Analysis of Unreplicated Factorials.” Technometrics 31 (4): 469–73.