This demonstration will show you how to fit Genomic BLUP (G-BLUP) and Ridge Regression BLUP (RR-BLUP) models. The data that we will use is a simulated dataset if 250 inbred lines evaluated in 4 environments. The data have already been summarized within location, thus the data consists of the adjusted mean phenotypic value per line per environment and the standard error of each mean. Because the standard errors are not all the same, we must account for heterogeneous error variance.
First we load the RData file. This contains the objects bv, M, ped and pheno. bv is a matrix of true breeding values for 530 individuals. M is a matrix of genotypic data for 530 individuals, ped is a data frame of the pedigrees for all 530 individuals. pheno is a data frame of the phenotypic data for 250 genotypes. The 250 genotypes that were phenotyped are lines, whereas the 280 genotypes that were not phenotyped are the plants used in the line development process.
Next, we load the library ‘rrBLUP’ we will use the mixed.solve function from this package.
load('Data250Lines.RData')
library(rrBLUP)
First use str() to check the structure of the phenotypic data.
str(pheno)
## 'data.frame': 1000 obs. of 5 variables:
## $ phenoGID: int 281 282 283 284 285 286 287 288 289 290 ...
## $ env : num 1 1 1 1 1 1 1 1 1 1 ...
## $ error : num 6 6 6 6 6 6 6 6 6 6 ...
## $ pValue : num -4.445 1.959 0.554 1.211 2.596 ...
## $ se : num 2.45 2.45 2.45 2.45 2.45 ...
Notice that all the variables are being considered as integers or numeric. The phenoGID, and env variables are factors thus we need to change them to factors.
pheno$env<- as.factor(pheno$env)#change env to a factor
pheno$phenoGID<- as.factor(pheno$phenoGID)#change phenoGID factor
The next steps are to create the response variable vector, y, which are our phenotypes, and the fixed-effect design matrix X. We will use these objects both for G-BLUP and for RR-BLUP.
First the response variable vector
y<- pheno$pValue
Next, the fixed-effect design matrix X. To create this matrix we use the function model.matrix, and specify the fixed effects portion of the model.
X<- model.matrix(pValue~1+env, pheno)
X[1:5,]
## (Intercept) env2 env3 env4
## 1 1 0 0 0
## 2 1 0 0 0
## 3 1 0 0 0
## 4 1 0 0 0
## 5 1 0 0 0
G-BLUP uses a marker-based relationship matrix. This is assumed proportional to the genetic covariance matrix: \(\mathbf{G}\propto\mathbf{A}\times\sigma^2_{a}\). We will use the A.mat() function to calculate the marker-based relationship matrix
A<- A.mat(M)
A[1:5, 1:5]
## 1 2 3 4 5
## 1 1.1879740 -0.19279642 -0.12013471 0.282500768 -0.161504019
## 2 -0.1927964 0.85838710 0.02025822 0.036642193 0.155324783
## 3 -0.1201347 0.02025822 0.90357123 0.071155605 0.075393305
## 4 0.2825008 0.03664219 0.07115560 0.969718941 -0.008362001
## 5 -0.1615040 0.15532478 0.07539330 -0.008362001 0.644396744
Next we create the random-effects design matrix, Z. Beforehand, we will need to make sure that the factor levels of phenoGID are the same as the row and column names of the relationship matrix. This is important because the columns of Z must exactly match up with the row and columns of A.
pheno$phenoGID<- factor(pheno$phenoGID, levels=colnames(A))
Z<- model.matrix(pValue~0+phenoGID, pheno)
Z[1:5, 1:5]
## phenoGID1 phenoGID2 phenoGID3 phenoGID4 phenoGID5
## 1 0 0 0 0 0
## 2 0 0 0 0 0
## 3 0 0 0 0 0
## 4 0 0 0 0 0
## 5 0 0 0 0 0
First we will fit the G-BLUP model ignoring the fact that we have a heterogeneous error variance. We use the mixed.solve() function. The arguments are y, Z, K, and X (in that order). y is the response variable, Z is the random effects design matrix, K is the relationship matrix that specifies the relationship between the genotypes, and X is the fixed effects design matrix. With G-BLUP, the K matrix is the marker-based relationship matrix.
mod0<- mixed.solve(y= y, Z=Z, K=A, X=X)
str(mod0)
## List of 5
## $ Vu : num 1.09
## $ Ve : num 15.6
## $ beta: num [1:4(1d)] 2.4429 -0.1769 -0.0841 -0.1206
## ..- attr(*, "dimnames")=List of 1
## .. ..$ : chr [1:4] "(Intercept)" "env2" "env3" "env4"
## $ u : num [1:530(1d)] 0.681 -1.773 1.102 -0.433 -0.65 ...
## ..- attr(*, "dimnames")=List of 1
## .. ..$ : chr [1:530] "1" "2" "3" "4" ...
## $ LL : num -2820
This model is not entirely correct because our response variable consists of means with different standard errors. Therefore we cannot assume homogeneity of variance.
Now we will fit the G-BLUP model accounting for the fact that we have a heterogeneous error variance. We do this by specifying the R matrix. The R * the residual error variance is the error covariance variance matrix. Usually R is a diagonal matrix that contains 1/weight on the diagonal. The larger the weight, the smaller the error variance. In our case, the weight is the reciprocal of the known error variances (which is the se^2).
Since we are using the package mixed.solve(), we cannot directly specify the R matrix. Instead, we can use a workaround which is to pre-multiply y, Z, and X by the inverse of the square root of R. Luckily, this turns out to be very easy. We simply divide y, Z, and X by the vector of the standard errors of y.
y2<- y/pheno$se
X2<- X/pheno$se
Z2<- Z/pheno$se
Now we can fit the G-BLUP model again, this time using the y2, X2, and Z2 objects which are pre-multiplied by the inverse of the square root of R.
mod1<- mixed.solve(y= y2, Z=Z2, K=A, X=X2)
str(mod1)
## List of 5
## $ Vu : num 1.18
## $ Ve : num 1.37
## $ beta: num [1:4(1d)] 2.4392 -0.1769 -0.0841 -0.1206
## ..- attr(*, "dimnames")=List of 1
## .. ..$ : chr [1:4] "(Intercept)" "env2" "env3" "env4"
## $ u : num [1:530(1d)] 0.783 -1.757 1.036 -0.806 -0.577 ...
## ..- attr(*, "dimnames")=List of 1
## .. ..$ : chr [1:530] "1" "2" "3" "4" ...
## $ LL : num -1613
Now, since this is a simulated dataset, we have the true breeding values, object bv, we can compare the genomic estimated breeding values (GEBVs) with the true breeding values
cor(mod1$u, bv) #accuracy of the better G-BLUP model
## [,1]
## [1,] 0.7651347
cor(mod0$u, bv) #accuracy of the worse G-BLUP model
## [,1]
## [1,] 0.7539713
We can also make a scatterplot to visualize how well the GEBVs correspond to the true breeding values
plot(mod1$u, bv)
Recall that only genotypes 281:530 were phenotyped. We can check the accuracy of this specific group of individuals
cor(mod1$u[281:530], bv[281:530]) #accuracy of phenotyped
## [1] 0.772162
RR-BLUP and G-BLUP will give us the same results, but they are implemented in a slightly different way. In G-BLUP we use a relationship matrix and the random effects design matrix Z must relate the y vector to the relationship matrix. In RR-BLUP, the random effects design matrix, Z, is the marker data, and the rows of Z must relate to the y vector.
To fit the RR-BLUP model we can use the y2 and the X2 matrix that we already made above. We only need to prepare the Z matrix. To do that, we will take the marker matrix, M and ensure the rows in M exactly correspond to the phenotypes.
ixM<- match(pheno$phenoGID, row.names(M))
ixM #row index vector to make Z from M
## [1] 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298
## [19] 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316
## [37] 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334
## [55] 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352
## [73] 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370
## [91] 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388
## [109] 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406
## [127] 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424
## [145] 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442
## [163] 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460
## [181] 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478
## [199] 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496
## [217] 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514
## [235] 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 281 282
## [253] 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300
## [271] 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318
## [289] 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336
## [307] 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354
## [325] 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372
## [343] 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390
## [361] 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408
## [379] 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426
## [397] 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444
## [415] 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462
## [433] 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480
## [451] 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498
## [469] 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516
## [487] 517 518 519 520 521 522 523 524 525 526 527 528 529 530 281 282 283 284
## [505] 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302
## [523] 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320
## [541] 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338
## [559] 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356
## [577] 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374
## [595] 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392
## [613] 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410
## [631] 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428
## [649] 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446
## [667] 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464
## [685] 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482
## [703] 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500
## [721] 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518
## [739] 519 520 521 522 523 524 525 526 527 528 529 530 281 282 283 284 285 286
## [757] 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304
## [775] 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322
## [793] 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340
## [811] 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358
## [829] 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376
## [847] 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394
## [865] 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412
## [883] 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430
## [901] 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448
## [919] 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466
## [937] 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484
## [955] 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502
## [973] 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520
## [991] 521 522 523 524 525 526 527 528 529 530
Prepare the Z matrix using the row index vector that indicates how to order the rows of M so they correspond to the phenotypic data
Z<- M[ixM,] #row index vector to make Z from M
As we did before with G-BLUP, we need to divide Z by the vector of the standard errors of y to account for heterogeneous error variances.
Z2<- Z/pheno$se
Now we can fit the RR-BLUP model. We use y2 and X2 that we prepared when we fit G-BLUP, and we will us the Z2 matrix that we just prepared. We will put K=NULL, because in RR-BLUP our random effects are markers and we assume that the markers are independent and identically distribute (i.i.d).
mod1rr<- mixed.solve(y= y2, Z=Z2, K=NULL, X=X2)
str(mod1rr) #look at the results output
## List of 5
## $ Vu : num 0.00563
## $ Ve : num 1.37
## $ beta: num [1:4(1d)] 2.0395 -0.1769 -0.0841 -0.1206
## ..- attr(*, "dimnames")=List of 1
## .. ..$ : chr [1:4] "(Intercept)" "env2" "env3" "env4"
## $ u : num [1:1100(1d)] 8.48e-17 -4.09e-03 1.89e-02 1.50e-02 2.09e-03 ...
## ..- attr(*, "dimnames")=List of 1
## .. ..$ : chr [1:1100] "mrk1" "mrk2" "mrk3" "mrk4" ...
## $ LL : num -1613
From this model, the random effects (the BLUPs) are the marker effects. To get Genomic Estimated Breeding Values (GEBVs), we must multiply the marker effects by the marker matrix
gebv_rr<- M %*% mod1rr$u
Now, lets check the accuracy by correlating the GEBVs that we just obtained with the true breeding values.
cor(gebv_rr, bv) #accuracy of RR-BLUP
## [,1]
## [1,] 0.7651348
Notice that the accuracy is the same as G-BLUP. This is expected because the two models are equivalent.