Correlation between Additive-genetic effects and residuals

library(breedR)
library(ggplot2)
library(plyr)
library(dplyr)

theme_set(theme_bw())

Simulate a progeny-trial dataset with a somewhat “extreme” pedigree (only 6 families) in order to highlight the desired structure

simdat <- function(N = 1e3, vc = list(gen = .5, res = 1)) {
  breedR.sample.phenotype(
    fixed   = c(mu = 1),
    genetic = list(
      model    = 'add_animal',
      Nparents = c(3, 2),  # 3*2 "families"
      sigma2_a = vc$gen,
      check.factorial = FALSE),
    residual.variance = vc$res,
    N = N)[-(1:20),]   # remove founders
}

Fit an additive-genetic model

fitdat <- function(dat = simdat()) {
  suppressMessages(
    remlf90(
      fixed = phenotype ~ 1,
      genetic = list(
        model = 'add_animal',
        ped   = dat[, 1:3],
        id    = 'self'
      ),
      data = dat
    )
  )
}

Observed effect: positive correlation between PBV and residuals

set.seed(12345)
res <- fitdat()
PBV = model.matrix(res)$genetic %*%
  ranef(res)$genetic
qplot(resid(res), PBV) +
  geom_smooth(method = 'lm') +
  coord_fixed()

summary(res)
## Linear Mixed Model with pedigree and spatial effects fit by AI-REMLF90 ver. 1.122 
##    Data: dat 
##   AIC  BIC logLik
##  2950 2960  -1473
## 
## Parameters of special components:
## 
## 
## Variance components:
##          Estimated variances   S.E.
## genetic               0.2874 0.2448
## Residual              1.0069 0.1331
## 
##              Estimate   S.E.
## Heritability   0.2047 0.1752
## 
## Fixed effects:
##            value   s.e.
## Intercept 1.2688 0.2471
lm(PBV ~ resid(res))
## 
## Call:
## lm(formula = PBV ~ resid(res))
## 
## Coefficients:
## (Intercept)   resid(res)  
##    -0.01832      0.14605

The observed phenotype is split into the model components:

all.equal(
  model.matrix(res)$genetic %*% ranef(res)$genetic + resid(res),
  res$mf$phenotype - fixef(res)$Intercept[, 'value'],
  check.attributes = FALSE
)
## [1] TRUE

… in a way such that the ratio between the genetic blups and empirical residual is constant and equal to the heritability, up to some shift which pools the estimates towards the family mean.

Replicate simulation experiment

Assess correlation between BVs and residuals

simcorr <- function(res = fitdat()) {
  PBV = model.matrix(res)$genetic %*%
    ranef(res)$genetic
  return(
    data.frame(
      est_sigma_a = res$var[1, 1],
      est_sigma_e = res$var[2, 1],
      corr_a_e    = cor(PBV, resid(res))
    )
  )
}
Nrep <- 30
repdat <- as.data.frame(t(replicate(Nrep, unlist(simcorr()))))

Plot density estimators of the sampling distributions of variance component estimators and the correlation between PBV and residuals

repdat %>% 
  lapply(density) %>% 
  ldply(function(x) as.data.frame(x[c('x', 'y')])) %>% 
  ggplot(aes(x, y)) +
  geom_line() +
  facet_grid(~ .id) +
  theme_bw()

Conclusions:

  • The estimators are working fine (i.e. unbiased)
  • The expected correlation is about 0.46

Interpretation:

  • The prediction of random effects essentially splits the deviation between the observed phenotype and the expected value (given by the fixed effects - in this case only the global mean) between the available random effects, acording to the relative magnitude of variance estimates.

  • This means that individuals whose phenotype (i.e. “real” BV+residual) is higher than average will have positive estimated residual and breeding value.

  • The kinship structure and the spatial effect only introduce some relative perturbations around this, but the underlying effect remains.

  • Note that the regression slope shown above is very similar to the estimated heritability

Some more insight:

Simulate two independent random variables with a variance ratio of 2 (say, a spatial effect and the residual effect). But in this case we remove the autocorrelation in the random effect such that each blup is predicted based on the individual’s phenotype only.

set.seed(1234)
N <- 500
dat <- 
  data.frame(
    S = rnorm(N),
    R = rnorm(N, sd = sqrt(2))
  )

For every individual (point), we only observe the phenotype, which is the sum of both

ind <- 11
ggplot(dat, aes(R, S)) +
  geom_point(col = 'gray', size = .1) +
  geom_point(dat = dat[ind, ], colour = 'red') +
  geom_abline(intercept = sum(dat[ind,]), slope = -1, col = 'darkgray') +
  coord_fixed()

The predicted blups are computed by splitting the observed phenotype according to the estimated variance ratios. This is, the point of intersection between the phenotypic line and the variance ratio line of slope \(1/(1+2)\) The prediction is a projection of the true points on this line.

ggplot(dat, aes(R, S)) +
  geom_point(col = 'gray', size = .1) +
  geom_point(dat = dat[ind, ], colour = 'red') +
  geom_abline(intercept = sum(dat[ind,]), slope = -1, col = 'darkgray') +
  geom_abline(intercept = 0, slope = 1/3, col = 'blue') +
  coord_fixed()

While this model is not identifiable, we force blup prediction by making only one round or reml estimation. The variance estimates are very close to the initial values

res2 <- remlf90(
  fixed = pheno ~ 1,
  random = ~ ind,
  data = transform(
    dat,
    pheno = R + S,
    ind   = seq_len(nrow(dat))
  ),
  var.ini = list(ind = 1, resid = 2),
  progsf90.options = 'maxrounds 1'
)

summary(res2)
## Linear Mixed Model with pedigree and spatial effects fit by AI-REMLF90 ver. 1.122 
##    Data: transform(dat, pheno = R + S, ind = seq_len(nrow(dat))) 
##   AIC  BIC logLik
##  1997 2006 -996.7
## 
## Parameters of special components:
## 
## 
## Variance components:
##          Estimated variances  S.E.
## ind                    1.070 92845
## Residual               2.062 92845
## 
## Fixed effects:
##              value   s.e.
## Intercept -0.07599 0.0775
vratio <- res2$var[1, 1]/sum(res2$var[, 1])

The interecept also need to be split acording to the variance ratio

dat <- transform(
  dat,
  Shat = fixef(res2)$Intercept[, 'value']*vratio + 
    model.matrix(res2)$ind %*% ranef(res2)$ind,
  Rhat = fixef(res2)$Intercept[, 'value']*(1-vratio) + resid(res2)
)

ggplot(dat) +
  geom_point(aes(R, S), col = 'gray', size = .1) +
  geom_point(aes(R, S), dat = dat[ind, ], colour = 'red') +
  geom_abline(intercept = sum(dat[ind, c('R', 'S')]), slope = -1, col = 'darkgray') +
  geom_abline(intercept = 0, slope = 1/2, col = 'blue') +
  geom_point(aes(Rhat, Shat), size = .1) +
  geom_point(aes(Rhat, Shat), dat = dat[ind, ], colour = 'darkred') +
  coord_fixed()

This procedure of prediction produces two effects:

  1. Even if we know (or assume) that the random effect and the residuals are independent, the respective blups are conditioned to the observed phenotype. This necessarily introduces a correlation in the predictions that is more or less strong according to how much additional information about an individual we can get from other individuals (e.g. kinship, neighbourhood).

  2. The variances of the blups are smaller than the estimated variances, as a consequance of the projection.

Furthermore, the sum of the variances and twice the covariance gives an approximation to the phenotypic variance:

var(dat$Rhat)+
var(dat$Shat)+
2*cov(dat$Rhat, dat$Shat)
## [1] 3.138393