Levering the Rodene et al., 2022 UAV dataset as we used in lab11 to test a hypothesis regarding the relationship between heritability and fitness (approximately equal to yield in an agriculture setting. In this case, canopy coverage on different dates can be considered as a proxy to reflect different levels of fitness):
The github repository for this homework can be found at the following: https://github.com/Mlien89/Agro932_HW2
My hypothesis is that canopy coverage can affect fitness in a positive manner, meaning genotypes that have a higher canopy coverage at early growth stages will perform better(higher yields). This produces greater leaf area for the plant to photosynthesize and produce more engery for plant growth and development. But it is germplasm dependent and some inbreds will perform better than other’s due to the nature of how the germplasm performs in a given environment.
July6-august12 –> heritability can vary greater at this point due to genotypes and environment. fitness is affected greatly
August14-30 –> heritability is on average highest, where oversaturation occurs due from the full canopy experienced in corn plots. canopy coverage values stay high and therefore heritability is on average high. fitness is affected during pollination and grain fill reproductive stages.
August30+ –> heritability begins to drop due to plant maturity rates and senesence. nitrogen uptake begins to slow and completly stop at maturity.
#load necessary packages and data from lab11
library(tidyr)
library(dplyr)
library(broom)
library(ggplot2)
library(ggpmisc)
## Warning in .recacheSubclasses(def@className, def, env): undefined subclass
## "packedMatrix" of class "replValueSp"; definition not updated
## Warning in .recacheSubclasses(def@className, def, env): undefined subclass
## "packedMatrix" of class "mMatrix"; definition not updated
#read in the ppj220030-su-0002-tables1.csv file from lab11
df <- read.csv("C:/Users/u942451/OneDrive - University of Nebraska-Lincoln/school/PhD/2022-agro932-lab/data/ppj220030-sup-0002-tables1.csv")
#view frequency of days and number of observations
table(df$date)
##
## Aug12 Aug14 Aug16 Aug20 Aug22 Aug23 Aug26 Aug30 July6 Sept1 Sept3 Sept5
## 878 878 878 878 878 878 877 878 878 878 878 878
#add replication
df$Rep <- "Rep2"
df[df$Row < 3000,] $Rep <- "Rep1"
#view first part of data frame
head(df)
## Row Row.Numbers Genotype Pedigree Treatment date Canopy_Coverage Rep
## 1 1001 1001 and 1002 A641 Ames 19311 Nitrogen July6 29.747379 Rep1
## 2 1001 1001 and 1002 A641 Ames 19311 Nitrogen Aug23 15.537977 Rep1
## 3 1001 1001 and 1002 A641 Ames 19311 Nitrogen Sept1 15.189726 Rep1
## 4 1001 1001 and 1002 A641 Ames 19311 Nitrogen Aug22 19.037283 Rep1
## 5 1001 1001 and 1002 A641 Ames 19311 Nitrogen Aug12 9.793073 Rep1
## 6 1001 1001 and 1002 A641 Ames 19311 Nitrogen Aug26 21.378869 Rep1
the following linear equation applies to the produced ANOVAs for this assignment.
\(p_{ijk} = \mu + g_i + t_i + g_i x t_i + r_k + e_{ijk}\),
where \(p_{ijk}\) is the phenotype value of the ith genotype evaluated in the jth treatment with the kth rep, \(g_i\) is the effect of the ith genotype, \(t_i\) is the effect of the jth treatment (or environment), \(g_i x t_j\) is the interaction effect of the ith genotype with the jth treatment, \(r_k\) is the effect of the kth rep, and \(e_{ijk}\) is the residual error. The \(e_{ijk}\) have expectation to zero
the reference table below details the ANOVA for an inbred population.
| Source | df | Observed MS | E(MS) |
|---|---|---|---|
| Environment | \(e-1=1\) | ||
| Replications/E | \(r-1=1\) | ||
| Inbred lines | \(n-1=232\) | \(MS_{progeny}=275\) | \(V_e + rV_{G \times E} + reV_{progeny}\) |
| Inbreds x E | \((n-1)(e-1)=224\) | \(MS_{PE}=31\) | \(V_e + rV_{G \times E}\) |
| pooled error | \((n-1)(r-1)e=419\) | \(MS_{error}=32\) | \(V_e\) |
–
let’s calculate and summarize ANOVA’s for all dates and store in a single data frame for later use.
#create a dataframe with the summaries of all of the anovas for all of the dates for canopy coverage
summaries <- df %>% group_by(date) %>% do(tidy(aov(Canopy_Coverage ~ Genotype + Treatment + Genotype:Treatment + Rep, data = .)))
Looking at the Aug12 ANOVA we can the observed MS and E(MS) produced for each source. from this table we can estimate heritability. heritability is influenced by allele frequencies, and differ from one population to another. The quantified difference depends on environments and number of measurements, which varies across traits. narrow-sense h^2 is a fundamental statistic used in predicting response to selection. observing the anova table below we can see that the main effects of genotype, treatment, and rep all have significance, but the interaction term does not prove significant for this given observation. the df shows the degrees of freedom for each variable, the sumsq is the sum of squares, or the variation between the group means created by the levels of the independent variables and the overall mean. The meansq shows the mean sum of squares, or the sum of squares divided by the degrees of freedom. the statistic or F-value is the test statisctic, and this is the mean square of the variable divided by the mean square of each parameter. The p.value is the p-value of F-stat, and shows how likely the F-value calculated from the F-test would have occurred if the null hypothesis of no difference was true.
#lets look at the anova for Aug12
summaries[summaries$date == "Aug12",]
## # A tibble: 5 x 7
## # Groups: date [1]
## date term df sumsq meansq statistic p.value
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Aug12 Genotype 232 336570. 1451. 16.6 7.01e-127
## 2 Aug12 Treatment 1 1581. 1581. 18.1 2.58e- 5
## 3 Aug12 Rep 1 2096. 2096. 24.0 1.37e- 6
## 4 Aug12 Genotype:Treatment 224 20498. 91.5 1.05 3.39e- 1
## 5 Aug12 Residuals 419 36578. 87.3 NA NA
The \(h^2\) on a plot-mean basis can be estimated as
\[\begin{align*} h^2 & = \frac{V_{A}}{V_{A} + V_{\bar{Y}}} \\ & = \frac{V_{A}}{V_{A} + V_{G \times E}/e + V_{e}/(re)} \\ & = \frac{61}{61 + 31/4} = 0.89 \\ \end{align*}\]
let’s calculate narrow-sense heritability for each day and store in a table for visualization
#store unique dates to loop through and calcualte heritability
dates <- as.character(unique(summaries$date))
#create list to store dataframes
h2_list <- list()
#loop to calculate h^2
for (i in dates){
tmp <- summaries[summaries$date == i,]
VA <- (tmp[1,5] - tmp[4,5])/(2*2)
VP <- (tmp[4,5])/((length(unique(df$Rep)))*length(unique(df$Treatment)))
h2 <- VA/(VA + (tmp[4,5])/((length(unique(df$Rep)))*length(unique(df$Treatment))))
h2_list[[i]] <- cbind(i, h2)
}
#combine into single data frame
h2_df <- do.call("rbind", h2_list)
names(h2_df) <- c("date", "heritability")
#order by date
order <- c("July6","Aug12","Aug14","Aug16","Aug20","Aug22","Aug23","Aug26","Aug30","Sept1","Sept3","Sept5")
h2_df <- h2_df[match(order, h2_df$date),]
#add column with ascending order to use for visualization
h2_df <- h2_df %>% mutate(day_number = 1:n())
We can now plot the heritability data for each date and see how the values change over time
ggplot(h2_df, aes(x = day_number, y = heritability))+
geom_point() +
scale_x_continuous(breaks = seq(1,12, by = 1)) +
labs(title = "Canopy Coverage" ~h^2, x = "observation", y = "heritability (%)") +
geom_smooth(method = "lm", se = F) +
stat_poly_eq(formula = y ~ x,
aes(label = paste(..eq.label.., ..rr.label.., sep = "~~~")),
parse = TRUE)
## `geom_smooth()` using formula 'y ~ x'
The heritability of canopy coverage changes over time. As hypothesized, I thought the heritability would increase and stay high for August14-30 due to over-saturation and coverage reamining high for many time points. there seems to be a flucation that occurs throughout the time-points indicating that there could be additional factors that influence trait heritability, for example how the data was collected during the time of day, interval of collection, or outside sources of variation not being explained by the linear model. I am wondering if there was any spatial correction prior to analysis of any field variation that could be removed to improve heritability? Overall, the paper by Rodene and this exercise suggest that canopy coverage is useful in estimating fitness, where a number newly found loci contribute to fitness in regards to canopy coverage in inbreds and response to nitrogen uptake.