This tutorial is aiming to explain how to perform GWAS with R.
Here I describe a method for performing GWAS with R. We use a dataset used in GWAS with rice germplasm collection (Zhao et al. 2011; Nature Communications 2:467), and a dataset of SNPs genotyped with High Density Rice Array (HDRA) (McCouch et al. 2016; Nature Communication 7:10532). Both datasets can be downloaded from Rice Diversity homepage (http://www.ricediversity.org/data/). The HDRA genotyping data contained 0.7 million SNPs. Of them, we used 38,769 SNPs which have no redundancy among them, low missing rate (< 1%) and minor allele frequency higher than 5%.
We will use gaston and pcaMethods packages for reading and analyzing SNP data. rrBLUP, gaston, qqman and BGLR packages will be used for GWAS. All the packages can be easily downloaded and installed by using the R package installer.
# required packages
require(gaston) # for many functions
require(pcaMethods) # for probablistic PCA (a package in Bioconductor)
require(LEA)
require(RColorBrewer)
require(manhattanly)
require(BGLR) # for Bayesian regression
require(GAPIT)
Here we assume that SNP genotype data is in a vcf (variant call format) file. Read the file with the read.vcf function in the gaston package. The gaston package allow you to load data also from a file in bed (PLINK binary biallelic genotype table, http://zzz.bwh.harvard.edu/plink/binary.shtml) format or files in bim (PLINK estented MAP file) and fam (PLINK sample information file) formats.
# read file
bm <- read.vcf("HDRA-G6-4-RDP1-RDP2-NIAS.AGCT_MAF005MIS001NR_imputed.vcf.gz")
## ped stats and snps stats have been set.
## 'p' has been set.
## 'mu' and 'sigma' have been set.
bm
## A bed.matrix with 1568 individuals and 38769 markers.
## snps stats are set
## ped stats are set
The vcf file contains genotypes of 38769 markers of 1568 samples.
Next we load files for phenotypic data and the information of samples (lines/varieties).
# read phenotypic data and line data
line <- read.csv("RiceDiversityLine4GWASGS.csv",
row.names = 1, stringsAsFactors = T)
pheno <- read.csv("RiceDiversityPheno4GWASGS.csv",
row.names = 1)
head(line)
## NSFTV.ID GSOR.ID IRGC.ID Accession.Name
## NSFTV1@86f75d2b.0 1 301001 To be assigned Agostano
## NSFTV3@5ef1be74.0 3 301003 117636 Ai-Chiao-Hong
## NSFTV4@81d03b86.0 4 301004 117601 NSF-TV 4
## NSFTV5@5533f406.0 5 301005 117641 NSF-TV 5
## NSFTV6@0d125c0e.0 6 301006 117603 ARC 7229
## NSFTV7@e37be9e5.0 7 301007 To be assigned Arias
## Country.of.origin Latitude Longitude Sub.population
## NSFTV1@86f75d2b.0 Italy 41.871940 12.56738 TEJ
## NSFTV3@5ef1be74.0 China 27.902527 116.87256 IND
## NSFTV4@81d03b86.0 India 22.903081 87.12158 AUS
## NSFTV5@5533f406.0 India 30.472664 75.34424 AROMATIC
## NSFTV6@0d125c0e.0 India 22.903081 87.12158 AUS
## NSFTV7@e37be9e5.0 Indonesia -0.789275 113.92133 TRJ
str(pheno)
## 'data.frame': 388 obs. of 36 variables:
## $ Flowering.time.at.Arkansas : num 75.1 89.5 94.5 87.5 89.1 ...
## $ Flowering.time.at.Faridpur : int 64 66 67 70 73 NA 74 75 55 NA ...
## $ Flowering.time.at.Aberdeen : int 81 83 93 108 101 158 122 81 74 NA ...
## $ FT.ratio.of.Arkansas.Aberdeen : num 0.927 1.078 1.016 0.81 0.882 ...
## $ FT.ratio.of.Faridpur.Aberdeen : num 0.79 0.795 0.72 0.648 0.723 ...
## $ Culm.habit : num 4 7.5 6 3.5 6 3 2.5 6.5 3 NA ...
## $ Leaf.pubescence : int 1 0 1 1 1 1 NA 1 NA NA ...
## $ Flag.leaf.length : num 28.4 39 27.7 30.4 36.9 ...
## $ Flag.leaf.width : num 1.283 1 1.517 0.892 1.75 ...
## $ Awn.presence : int 0 0 0 0 0 1 0 1 1 NA ...
## $ Panicle.number.per.plant : num 3.07 4.05 3.12 3.7 2.86 ...
## $ Plant.height : num 111 144 128 154 148 ...
## $ Panicle.length : num 20.5 26.8 23.5 29 30.9 ...
## $ Primary.panicle.branch.number : num 9.27 9.17 8.67 6.36 11.17 ...
## $ Seed.number.per.panicle : num 4.79 4.44 5.08 4.52 5.54 ...
## $ Florets.per.panicle : num 4.91 4.73 5.15 4.73 5.68 ...
## $ Panicle.fertility : num 0.879 0.75 0.935 0.813 0.871 0.907 0.781 0.887 0.537 NA ...
## $ Seed.length : num 8.06 7.71 8.24 9.71 7.12 ...
## $ Seed.width : num 3.69 2.95 2.93 2.38 3.28 ...
## $ Seed.volume : num 2.59 2.11 2.15 1.94 2.2 ...
## $ Seed.surface.area : num 3.91 3.66 3.71 3.7 3.66 ...
## $ Brown.rice.seed.length : num 5.79 5.6 6.12 7.09 5.14 ...
## $ Brown.rice.seed.width : num 3.11 2.57 2.47 2.04 2.8 ...
## $ Brown.rice.surface.area : num 3.51 3.29 3.33 3.3 3.28 ...
## $ Brown.rice.volume : num 7.74 5.11 5.13 4.09 5.56 ...
## $ Seed.length.width.ratio : num 2.19 2.61 2.81 4.08 2.17 ...
## $ Brown.rice.length.width.ratio : num 1.86 2.18 2.48 3.47 1.83 ...
## $ Seed.color : int 0 0 0 0 0 0 0 0 0 NA ...
## $ Pericarp.color : int 0 0 0 0 0 0 0 0 0 NA ...
## $ Straighthead.suseptability : num 4.83 7.83 NA 8.33 8.17 ...
## $ Blast.resistance : int 8 4 3 5 3 2 2 9 2 1 ...
## $ Amylose.content : num 15.6 23.3 23.1 19.3 23.2 ...
## $ Alkali.spreading.value : num 6.08 5.64 5.53 6.03 5.44 ...
## $ Protein.content : num 8.45 9.2 8 9.6 8.5 ...
## $ Year07Flowering.time.at.Arkansas: num 73 88 105.5 86.5 85.5 ...
## $ Year06Flowering.time.at.Arkansas: num 77.2 91 83.5 88.5 92.7 ...
The line data contains brief passport data of lines/varieties and the name of subpopulation to which samples belong. The pheno data contain 36 traits (in some cases, the combitions of a trait andlocations and/or years). Four traits (Leaf.puescence, Awn.presnce, Seed.color, Pericarp.color) are binary traits, one (Blast.resistance) ordinal categorical, and the remainders are continuous.
Next, extract marker genotype data of samples with phenotypic data
from bm data.
Information about samples are included in bm@ped.
head(bm@ped)
## famid id father mother sex pheno N0
## 1 IRGC121316@c88dcbba.0 IRGC121316@c88dcbba.0 0 0 0 NA 25217
## 2 IRGC121578@950ba1f2.0 IRGC121578@950ba1f2.0 0 0 0 NA 35744
## 3 IRGC121491@0b7d031b.0 IRGC121491@0b7d031b.0 0 0 0 NA 28683
## 4 IRGC121440@d02dfe4e.0 IRGC121440@d02dfe4e.0 0 0 0 NA 28244
## 5 IRGC121896@89688646.0 IRGC121896@89688646.0 0 0 0 NA 31812
## 6 IRGC121323@57c71a3b.0 IRGC121323@57c71a3b.0 0 0 0 NA 27630
## N1 N2 NAs N0.x N1.x N2.x NAs.x N0.y N1.y N2.y NAs.y N0.mt N1.mt N2.mt
## 1 249 13303 0 0 0 0 0 0 0 0 0 0 0 0
## 2 163 2862 0 0 0 0 0 0 0 0 0 0 0 0
## 3 231 9855 0 0 0 0 0 0 0 0 0 0 0 0
## 4 798 9727 0 0 0 0 0 0 0 0 0 0 0 0
## 5 196 6761 0 0 0 0 0 0 0 0 0 0 0 0
## 6 1160 9979 0 0 0 0 0 0 0 0 0 0 0 0
## NAs.mt callrate hz callrate.x hz.x callrate.y hz.y callrate.mt hz.mt
## 1 0 1 0.006422657 NaN NaN NaN NaN NaN NaN
## 2 0 1 0.004204390 NaN NaN NaN NaN NaN NaN
## 3 0 1 0.005958369 NaN NaN NaN NaN NaN NaN
## 4 0 1 0.020583456 NaN NaN NaN NaN NaN NaN
## 5 0 1 0.005055586 NaN NaN NaN NaN NaN NaN
## 6 0 1 0.029920813 NaN NaN NaN NaN NaN NaN
The names of samples are in the id column. In bm@ped, we can find the number of missing data (NAs), the proportion of non-missing data (callrate), the number of three genotypes (A1/A1, A1/A2, A2/A2 as N0, N1 and N2), and the proportion of heterozygous genotypes (i.e., heterozygosity, hz) etc. The actual alleles of A1 and A2 for each marker can be found in bm@snps as described below.
We can also check the status of SNPs stored as bm@snps.
head(bm@snps)
## chr id dist pos A1 A2 quality filter N0 N1 N2 NAs N0.f N1.f
## 1 1 SNP-1.8562. 0 9563 A T 0 PASS 1343 7 218 0 NA NA
## 2 1 SNP-1.18594. 0 19595 G A 0 PASS 1481 8 79 0 NA NA
## 3 1 SNP-1.24921. 0 25922 C T 0 PASS 1348 24 196 0 NA NA
## 4 1 SNP-1.25253. 0 26254 A T 0 PASS 1152 10 406 0 NA NA
## 5 1 SNP-1.29213. 0 30214 T A 0 PASS 1346 4 218 0 NA NA
## 6 1 SNP-1.30477. 0 31478 C T 0 PASS 1013 15 540 0 NA NA
## N2.f NAs.f callrate maf hz
## 1 NA NA 1 0.14126276 0.004464286
## 2 NA NA 1 0.05293367 0.005102041
## 3 NA NA 1 0.13265306 0.015306122
## 4 NA NA 1 0.26211735 0.006377551
## 5 NA NA 1 0.14030612 0.002551020
## 6 NA NA 1 0.34917092 0.009566327
In bm@snps, we can find chromosome number (chr), marker name (id), linkage map distance (dist), physical position (pos), A1 and A2 alleles (A1 and A2), the proportion of non-missing data (callrate), the number of three genotypes (A1/A1, A1/A2, A2/A2 as N0, N1 and N2), minor allele frequency (maf), and the proportion of heterozygous genotypes (i.e., heterozygosity, hz) etc.
The samples we need (samples with phenotypic data) can be selected in the same way as a matrix object.
bm.wp <- bm[bm@ped$id %in% rownames(pheno),]
bm.wp
## A bed.matrix with 388 individuals and 38769 markers.
## snps stats are set
## There is one monomorphic SNP
## ped stats are set
The marker genotype data of 388 individuals were successfully extracted.
Because we selected subset of individuals from all, some markers may becomes monomorphic. Here, we remove the markers from bm.wp. Later, we will again remove some markers which have small MAF and large missing rate.
bm.wp <- select.snps(bm.wp, maf > 0)
bm.wp
## A bed.matrix with 388 individuals and 38768 markers.
## snps stats are set
## ped stats are set
One marker was removed from the data. You can also perform the same processing as bm.wp <- bm.wp[, bm@snps$maf > 0].
To see the subpopulation structure of samples, we perform principal component analysis of marker genotype data. The marker genotype data are scored and stored in bm as a score matrix. We can extract the matrix with as.matrix function.
# extract the score matrix
gt.score <- as.matrix(bm.wp)
head(gt.score)[,1:10]
## SNP-1.8562. SNP-1.18594. SNP-1.24921. SNP-1.25253.
## NSFTV80@02d095ba.0 0 0 0 0
## NSFTV87@64fa0112.0 0 0 0 0
## NSFTV96@32a6808e.0 0 0 0 0
## NSFTV100@1f10be3d.0 0 0 0 0
## NSFTV114@eac19fb8.0 0 0 0 0
## NSFTV140@85551f9c.0 0 0 0 0
## SNP-1.29213. SNP-1.30477. SNP-1.31732. SNP-1.32666.
## NSFTV80@02d095ba.0 0 0 0 2
## NSFTV87@64fa0112.0 0 2 0 2
## NSFTV96@32a6808e.0 0 0 0 2
## NSFTV100@1f10be3d.0 0 0 0 2
## NSFTV114@eac19fb8.0 0 0 0 2
## NSFTV140@85551f9c.0 0 2 0 2
## SNP-1.40457. SNP-1.75851.
## NSFTV80@02d095ba.0 0 0
## NSFTV87@64fa0112.0 0 0
## NSFTV96@32a6808e.0 0 0
## NSFTV100@1f10be3d.0 0 0
## NSFTV114@eac19fb8.0 0 0
## NSFTV140@85551f9c.0 0 0
As described above, marker genotypes of A1/A1, A1/A2 and A2/A2 are scored as 0, 1, 2.
To evaluate the relationship with know subpopulation structure and also with phenotypic data, we reorder the gt.score data in the same order as one in pheno and line data.
gt.score <- gt.score[rownames(pheno),]
Because the number of markers in gt.score is high, it takes a very long time with the prcomp function, which is generally used for PCA. Here, we conduct PCA with the pcaMethods package, which is specialized for PCA. Here, we perform probablistic principal component analysis (PPCA). PPCA allows us to conduct PCA even with missing data.
# perform probablistic principal component analysis (PPCA)
# PPCA can be performed even when a dataset has missing entries
pca <- pcaMethods::pca(gt.score, "ppca", nPcs = 6)
pcs <- scores(pca)
pairs(pcs[,1:6], col = line$Sub.population, oma = c(3, 3, 3, 15))
op <- par(xpd = T)
legend("bottomright", levels(line$Sub.population), col = 1:nlevels(line$Sub.population), pch = 1)
par(op)
As the result of analysis, PC1 to 4 captured the subpopulation structure of the germplasm collection well. PC5 and 6 captured within-subpopulation variation in tropical japonica (TRJ) and indica (IND).
Next, we check the relationship between principal component scores and phenotypic values of lines/varieties. Here we use the length-width ratio of seeds as an example.
# see the relationship between phenotypic values of the target trait and PC scores
plot(pcs[,1], pheno$Seed.length.width.ratio, col = line$Sub.population)
plot(pcs[,2], pheno$Seed.length.width.ratio, col = line$Sub.population)
plot(pcs[,3], pheno$Seed.length.width.ratio, col = line$Sub.population)
plot(pcs[,4], pheno$Seed.length.width.ratio, col = line$Sub.population)
The graphs show unclear relationship between phenotypic values and
principal component scores. The relationship between phenotypic values
and the attribution to the subpopulations is more clear. We will come
back to this result later.
If we do not have prior (known) information of subpopulation structure in target population, we may be able to estimate it with clustering analysis. Here, we simply apply k-means clustering, which is a famous non-hierarchical clustering method.
# k-means clustering
km <- kmeans(gt.score, centers = 5, nstart = 10, iter.max = 100)
grp <- as.factor(km$cluster)
pairs(pcs[,1:6], col = grp, oma = c(3, 3, 3, 15))
op <- par(xpd = T)
legend("bottomright", levels(grp), col = 1:nlevels(grp), pch = 1)
par(op)
The result shows k-means clustering works well in this case. The classification in the clustering method has a similar pattern to the know subpopulation information.
Actually, we can also perform an almost identical analysis via eigen value decomposition of genetic relationship matrix (GRM).GRM is calculated as \(\frac{1}{q} XX'\) with \(X\) the standardized genotype matrix and \(q\) the number of SNPs (ncol(x)). GRM can be calculated with the GRM function in the gaston package.
grm <- GRM(bm.wp)
grm[1:6, 1:6]
## NSFTV80@02d095ba.0 NSFTV87@64fa0112.0 NSFTV96@32a6808e.0
## NSFTV80@02d095ba.0 2.29390430 -0.0782388 -0.09368214
## NSFTV87@64fa0112.0 -0.07823880 1.3258848 0.68903351
## NSFTV96@32a6808e.0 -0.09368214 0.6890335 1.27583718
## NSFTV100@1f10be3d.0 -0.21102436 0.2894305 0.33597934
## NSFTV114@eac19fb8.0 -0.18258514 0.2570685 0.37205958
## NSFTV140@85551f9c.0 -0.12167218 0.3007855 0.37542513
## NSFTV100@1f10be3d.0 NSFTV114@eac19fb8.0 NSFTV140@85551f9c.0
## NSFTV80@02d095ba.0 -0.2110244 -0.1825851 -0.1216722
## NSFTV87@64fa0112.0 0.2894305 0.2570685 0.3007855
## NSFTV96@32a6808e.0 0.3359793 0.3720596 0.3754251
## NSFTV100@1f10be3d.0 1.2132517 0.9044130 0.6716095
## NSFTV114@eac19fb8.0 0.9044130 1.2369840 0.6397730
## NSFTV140@85551f9c.0 0.6716095 0.6397730 1.2719293
Perform eigen value decomposion of the grm matrix. First, reorder the matrix in the identical order to that of phenoypic data.
grm <- grm[rownames(pheno), rownames(pheno)]
eig <- eigen(grm)
Calculate the scores of principal component and draw graphs.
pcs <- eig$vectors %*% diag(sqrt(eig$values))
pairs(pcs[,1:6], col = line$Sub.population, oma = c(3, 3, 3, 15))
op <- par(xpd = T)
legend("bottomright", levels(line$Sub.population), col = 1:nlevels(line$Sub.population), pch = 1)
par(op)
While there is small differences between the previous and current analyses, the general pattern is same between two, suggesting PC scores can be also calculated via eigen value decomposion of GRM. This also suggests that GRM itself has the same information contained in PC scores.
Here, we estimate genetic background with a method specialized for population genetic structure analysis. The snmf function in the LEA package estimates ancestry coefficients using the sparse non-negative matrix factorization (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3982712/). The function returns a project object containing all runs of the snmf program for the input data. It can be useful to perform several runs of snmf for various numbers of ancestral populations (K).
First, the input file to be used by the LEA must be prepared; the data must be transposed so that it is in the orientation assumed by the LEA (samples in columns, markers in rows).
write.table(t(gt.score), "rice.geno", col.names = F, row.names = F, sep = "")
Estimate ancestry coefficients at K = 5 with four replications.
project <- snmf("rice.geno",
K = 5,
I = 1000,
entropy = T,
repetitions = 4,
project = "new")
## The project is saved into :
## rice.snmfProject
##
## To load the project, use:
## project = load.snmfProject("rice.snmfProject")
##
## To remove the project, use:
## remove.snmfProject("rice.snmfProject")
##
## [1] 2010135985
## [1] "*************************************"
## [1] "* create.dataset *"
## [1] "*************************************"
## summary of the options:
##
## -n (number of individuals) 388
## -L (number of loci) 38768
## -s (seed random init) 2010135985
## -r (percentage of masked data) 0.05
## -x (genotype file in .geno format) /Users/hiro/Documents/Rprojects/京大gwasAll/rice.geno
## -o (output file in .geno format) /Users/hiro/Documents/Rprojects/京大gwasAll/rice.snmf/masked/rice_I.geno
##
## Write genotype file with masked data, /Users/hiro/Documents/Rprojects/京大gwasAll/rice.snmf/masked/rice_I.geno: OK.
##
## [1] "*************************************"
## [1] "* sNMF K = 5 repetition 1 *"
## [1] "*************************************"
## summary of the options:
##
## -n (number of individuals) 388
## -L (number of loci) 38768
## -K (number of ancestral pops) 5
## -x (input file) /Users/hiro/Documents/Rprojects/京大gwasAll/rice.snmf/masked/rice_I.geno
## -q (individual admixture file) /Users/hiro/Documents/Rprojects/京大gwasAll/rice.snmf/K5/run1/rice_r1.5.Q
## -g (ancestral frequencies file) /Users/hiro/Documents/Rprojects/京大gwasAll/rice.snmf/K5/run1/rice_r1.5.G
## -i (number max of iterations) 200
## -a (regularization parameter) 10
## -s (seed random init) 6305103281
## -e (tolerance error) 1E-05
## -p (number of processes) 1
## -I (number of SNPs used to init Q) 1000
## - diploid
##
## Read genotype file /Users/hiro/Documents/Rprojects/京大gwasAll/rice.snmf/masked/rice_I.geno: OK.
##
## Initialization of Q with a random subset of 1000 SNPs:
## [ ]
## [============]
## Number of iterations: 31
##
## Main algorithm:
## [ ]
## [======]
## Number of iterations: 16
##
## Least-square error: 2348539.016665
## Write individual ancestry coefficient file /Users/hiro/Documents/Rprojects/京大gwasAll/rice.snmf/K5/run1/rice_r1.5.Q: OK.
## Write ancestral allele frequency coefficient file /Users/hiro/Documents/Rprojects/京大gwasAll/rice.snmf/K5/run1/rice_r1.5.G: OK.
##
## [1] "*************************************"
## [1] "* cross-entropy estimation *"
## [1] "*************************************"
## summary of the options:
##
## -n (number of individuals) 388
## -L (number of loci) 38768
## -K (number of ancestral pops) 5
## -x (genotype file) /Users/hiro/Documents/Rprojects/京大gwasAll/rice.geno
## -q (individual admixture) /Users/hiro/Documents/Rprojects/京大gwasAll/rice.snmf/K5/run1/rice_r1.5.Q
## -g (ancestral frequencies) /Users/hiro/Documents/Rprojects/京大gwasAll/rice.snmf/K5/run1/rice_r1.5.G
## -i (with masked genotypes) /Users/hiro/Documents/Rprojects/京大gwasAll/rice.snmf/masked/rice_I.geno
## - diploid
##
## Cross-Entropy (all data): 0.242812
## Cross-Entropy (masked data): 0.258944
## The project is saved into :
## rice.snmfProject
##
## To load the project, use:
## project = load.snmfProject("rice.snmfProject")
##
## To remove the project, use:
## remove.snmfProject("rice.snmfProject")
##
## [1] 63915391
## [1] "*************************************"
## [1] "* create.dataset *"
## [1] "*************************************"
## summary of the options:
##
## -n (number of individuals) 388
## -L (number of loci) 38768
## -s (seed random init) 63915391
## -r (percentage of masked data) 0.05
## -x (genotype file in .geno format) /Users/hiro/Documents/Rprojects/京大gwasAll/rice.geno
## -o (output file in .geno format) /Users/hiro/Documents/Rprojects/京大gwasAll/rice.snmf/masked/rice_I.geno
##
## Write genotype file with masked data, /Users/hiro/Documents/Rprojects/京大gwasAll/rice.snmf/masked/rice_I.geno: OK.
##
## [1] "*************************************"
## [1] "* sNMF K = 5 repetition 2 *"
## [1] "*************************************"
## summary of the options:
##
## -n (number of individuals) 388
## -L (number of loci) 38768
## -K (number of ancestral pops) 5
## -x (input file) /Users/hiro/Documents/Rprojects/京大gwasAll/rice.snmf/masked/rice_I.geno
## -q (individual admixture file) /Users/hiro/Documents/Rprojects/京大gwasAll/rice.snmf/K5/run2/rice_r2.5.Q
## -g (ancestral frequencies file) /Users/hiro/Documents/Rprojects/京大gwasAll/rice.snmf/K5/run2/rice_r2.5.G
## -i (number max of iterations) 200
## -a (regularization parameter) 10
## -s (seed random init) 63915391
## -e (tolerance error) 1E-05
## -p (number of processes) 1
## -I (number of SNPs used to init Q) 1000
## - diploid
##
## Read genotype file /Users/hiro/Documents/Rprojects/京大gwasAll/rice.snmf/masked/rice_I.geno: OK.
##
## Initialization of Q with a random subset of 1000 SNPs:
## [ ]
## [==========]
## Number of iterations: 26
##
## Main algorithm:
## [ ]
## [===================]
## Number of iterations: 52
##
## Least-square error: 2368944.352814
## Write individual ancestry coefficient file /Users/hiro/Documents/Rprojects/京大gwasAll/rice.snmf/K5/run2/rice_r2.5.Q: OK.
## Write ancestral allele frequency coefficient file /Users/hiro/Documents/Rprojects/京大gwasAll/rice.snmf/K5/run2/rice_r2.5.G: OK.
##
## [1] "*************************************"
## [1] "* cross-entropy estimation *"
## [1] "*************************************"
## summary of the options:
##
## -n (number of individuals) 388
## -L (number of loci) 38768
## -K (number of ancestral pops) 5
## -x (genotype file) /Users/hiro/Documents/Rprojects/京大gwasAll/rice.geno
## -q (individual admixture) /Users/hiro/Documents/Rprojects/京大gwasAll/rice.snmf/K5/run2/rice_r2.5.Q
## -g (ancestral frequencies) /Users/hiro/Documents/Rprojects/京大gwasAll/rice.snmf/K5/run2/rice_r2.5.G
## -i (with masked genotypes) /Users/hiro/Documents/Rprojects/京大gwasAll/rice.snmf/masked/rice_I.geno
## - diploid
##
## Cross-Entropy (all data): 0.246049
## Cross-Entropy (masked data): 0.263974
## The project is saved into :
## rice.snmfProject
##
## To load the project, use:
## project = load.snmfProject("rice.snmfProject")
##
## To remove the project, use:
## remove.snmfProject("rice.snmfProject")
##
## [1] 73289461
## [1] "*************************************"
## [1] "* create.dataset *"
## [1] "*************************************"
## summary of the options:
##
## -n (number of individuals) 388
## -L (number of loci) 38768
## -s (seed random init) 73289461
## -r (percentage of masked data) 0.05
## -x (genotype file in .geno format) /Users/hiro/Documents/Rprojects/京大gwasAll/rice.geno
## -o (output file in .geno format) /Users/hiro/Documents/Rprojects/京大gwasAll/rice.snmf/masked/rice_I.geno
##
## Write genotype file with masked data, /Users/hiro/Documents/Rprojects/京大gwasAll/rice.snmf/masked/rice_I.geno: OK.
##
## [1] "*************************************"
## [1] "* sNMF K = 5 repetition 3 *"
## [1] "*************************************"
## summary of the options:
##
## -n (number of individuals) 388
## -L (number of loci) 38768
## -K (number of ancestral pops) 5
## -x (input file) /Users/hiro/Documents/Rprojects/京大gwasAll/rice.snmf/masked/rice_I.geno
## -q (individual admixture file) /Users/hiro/Documents/Rprojects/京大gwasAll/rice.snmf/K5/run3/rice_r3.5.Q
## -g (ancestral frequencies file) /Users/hiro/Documents/Rprojects/京大gwasAll/rice.snmf/K5/run3/rice_r3.5.G
## -i (number max of iterations) 200
## -a (regularization parameter) 10
## -s (seed random init) 73289461
## -e (tolerance error) 1E-05
## -p (number of processes) 1
## -I (number of SNPs used to init Q) 1000
## - diploid
##
## Read genotype file /Users/hiro/Documents/Rprojects/京大gwasAll/rice.snmf/masked/rice_I.geno: OK.
##
## Initialization of Q with a random subset of 1000 SNPs:
## [ ]
## [=================]
## Number of iterations: 45
##
## Main algorithm:
## [ ]
## [=======]
## Number of iterations: 18
##
## Least-square error: 2383751.257163
## Write individual ancestry coefficient file /Users/hiro/Documents/Rprojects/京大gwasAll/rice.snmf/K5/run3/rice_r3.5.Q: OK.
## Write ancestral allele frequency coefficient file /Users/hiro/Documents/Rprojects/京大gwasAll/rice.snmf/K5/run3/rice_r3.5.G: OK.
##
## [1] "*************************************"
## [1] "* cross-entropy estimation *"
## [1] "*************************************"
## summary of the options:
##
## -n (number of individuals) 388
## -L (number of loci) 38768
## -K (number of ancestral pops) 5
## -x (genotype file) /Users/hiro/Documents/Rprojects/京大gwasAll/rice.geno
## -q (individual admixture) /Users/hiro/Documents/Rprojects/京大gwasAll/rice.snmf/K5/run3/rice_r3.5.Q
## -g (ancestral frequencies) /Users/hiro/Documents/Rprojects/京大gwasAll/rice.snmf/K5/run3/rice_r3.5.G
## -i (with masked genotypes) /Users/hiro/Documents/Rprojects/京大gwasAll/rice.snmf/masked/rice_I.geno
## - diploid
##
## Cross-Entropy (all data): 0.24721
## Cross-Entropy (masked data): 0.264126
## The project is saved into :
## rice.snmfProject
##
## To load the project, use:
## project = load.snmfProject("rice.snmfProject")
##
## To remove the project, use:
## remove.snmfProject("rice.snmfProject")
##
## [1] 1616129687
## [1] "*************************************"
## [1] "* create.dataset *"
## [1] "*************************************"
## summary of the options:
##
## -n (number of individuals) 388
## -L (number of loci) 38768
## -s (seed random init) 1616129687
## -r (percentage of masked data) 0.05
## -x (genotype file in .geno format) /Users/hiro/Documents/Rprojects/京大gwasAll/rice.geno
## -o (output file in .geno format) /Users/hiro/Documents/Rprojects/京大gwasAll/rice.snmf/masked/rice_I.geno
##
## Write genotype file with masked data, /Users/hiro/Documents/Rprojects/京大gwasAll/rice.snmf/masked/rice_I.geno: OK.
##
## [1] "*************************************"
## [1] "* sNMF K = 5 repetition 4 *"
## [1] "*************************************"
## summary of the options:
##
## -n (number of individuals) 388
## -L (number of loci) 38768
## -K (number of ancestral pops) 5
## -x (input file) /Users/hiro/Documents/Rprojects/京大gwasAll/rice.snmf/masked/rice_I.geno
## -q (individual admixture file) /Users/hiro/Documents/Rprojects/京大gwasAll/rice.snmf/K5/run4/rice_r4.5.Q
## -g (ancestral frequencies file) /Users/hiro/Documents/Rprojects/京大gwasAll/rice.snmf/K5/run4/rice_r4.5.G
## -i (number max of iterations) 200
## -a (regularization parameter) 10
## -s (seed random init) 1616129687
## -e (tolerance error) 1E-05
## -p (number of processes) 1
## -I (number of SNPs used to init Q) 1000
## - diploid
##
## Read genotype file /Users/hiro/Documents/Rprojects/京大gwasAll/rice.snmf/masked/rice_I.geno: OK.
##
## Initialization of Q with a random subset of 1000 SNPs:
## [ ]
## [=======]
## Number of iterations: 20
##
## Main algorithm:
## [ ]
## [======]
## Number of iterations: 16
##
## Least-square error: 2349650.871035
## Write individual ancestry coefficient file /Users/hiro/Documents/Rprojects/京大gwasAll/rice.snmf/K5/run4/rice_r4.5.Q: OK.
## Write ancestral allele frequency coefficient file /Users/hiro/Documents/Rprojects/京大gwasAll/rice.snmf/K5/run4/rice_r4.5.G: OK.
##
## [1] "*************************************"
## [1] "* cross-entropy estimation *"
## [1] "*************************************"
## summary of the options:
##
## -n (number of individuals) 388
## -L (number of loci) 38768
## -K (number of ancestral pops) 5
## -x (genotype file) /Users/hiro/Documents/Rprojects/京大gwasAll/rice.geno
## -q (individual admixture) /Users/hiro/Documents/Rprojects/京大gwasAll/rice.snmf/K5/run4/rice_r4.5.Q
## -g (ancestral frequencies) /Users/hiro/Documents/Rprojects/京大gwasAll/rice.snmf/K5/run4/rice_r4.5.G
## -i (with masked genotypes) /Users/hiro/Documents/Rprojects/京大gwasAll/rice.snmf/masked/rice_I.geno
## - diploid
##
## Cross-Entropy (all data): 0.242809
## Cross-Entropy (masked data): 0.25901
## The project is saved into :
## rice.snmfProject
##
## To load the project, use:
## project = load.snmfProject("rice.snmfProject")
##
## To remove the project, use:
## remove.snmfProject("rice.snmfProject")
ce <- cross.entropy(project, K = 5); ce
## K = 5
## run 1 0.2589444
## run 2 0.2639744
## run 3 0.2641262
## run 4 0.2590100
Qmat <- Q(project, K = 5, run = which.min(ce))
rownames(Qmat) <- rownames(gt.score)
head(Qmat)
## V1 V2 V3 V4 V5
## NSFTV1@86f75d2b.0 0.000099973 9.99730e-05 5.87250e-03 9.99730e-05 0.993828000
## NSFTV3@5ef1be74.0 0.982354000 9.99729e-05 9.99729e-05 9.99729e-05 0.017345600
## NSFTV4@81d03b86.0 0.149472000 7.75379e-01 9.99820e-05 7.49493e-02 0.000099982
## NSFTV5@5533f406.0 0.000099970 9.99700e-05 9.99600e-01 9.99700e-05 0.000099970
## NSFTV6@0d125c0e.0 0.147150000 7.60223e-01 9.99910e-05 8.61347e-02 0.006391970
## NSFTV7@e37be9e5.0 0.020806400 2.14061e-02 5.34589e-02 8.51703e-01 0.052625700
Visualize the Q matrix for each of known subpopulations.
palette <- brewer.pal(5, "Set2")
barplot(t(Qmat[line$Sub.population == "IND",]), col= palette)
barplot(t(Qmat[line$Sub.population == "TEJ",]), col= palette)
barplot(t(Qmat[line$Sub.population == "TRJ",]), col= palette)
barplot(t(Qmat[line$Sub.population == "AUS",]), col= palette)
barplot(t(Qmat[line$Sub.population == "AROMATIC",]), col = palette)
barplot(t(Qmat[line$Sub.population == "ADMIX",]), col= palette)
For more beautiful visualization of this result, please refer to this home page, https://connor-french.github.io/intro-pop-structure-r/
From here, we are going into GWAS analysis.
As we checked before, bm@ped also has a column for phenotypic data.
head(bm.wp@ped)
## famid id father mother sex pheno N0 N1
## 1077 NSFTV80@02d095ba.0 NSFTV80@02d095ba.0 0 0 0 NA 26224 92
## 1078 NSFTV87@64fa0112.0 NSFTV87@64fa0112.0 0 0 0 NA 31639 102
## 1079 NSFTV96@32a6808e.0 NSFTV96@32a6808e.0 0 0 0 NA 31417 97
## 1080 NSFTV100@1f10be3d.0 NSFTV100@1f10be3d.0 0 0 0 NA 33200 28
## 1081 NSFTV114@eac19fb8.0 NSFTV114@eac19fb8.0 0 0 0 NA 32355 35
## 1082 NSFTV140@85551f9c.0 NSFTV140@85551f9c.0 0 0 0 NA 31226 33
## N2 NAs N0.x N1.x N2.x NAs.x N0.y N1.y N2.y NAs.y N0.mt N1.mt N2.mt
## 1077 12452 0 0 0 0 0 0 0 0 0 0 0 0
## 1078 7027 0 0 0 0 0 0 0 0 0 0 0 0
## 1079 7254 0 0 0 0 0 0 0 0 0 0 0 0
## 1080 5540 0 0 0 0 0 0 0 0 0 0 0 0
## 1081 6378 0 0 0 0 0 0 0 0 0 0 0 0
## 1082 7509 0 0 0 0 0 0 0 0 0 0 0 0
## NAs.mt callrate hz callrate.x hz.x callrate.y hz.y callrate.mt
## 1077 0 1 0.0023730912 NaN NaN NaN NaN NaN
## 1078 0 1 0.0026310359 NaN NaN NaN NaN NaN
## 1079 0 1 0.0025020636 NaN NaN NaN NaN NaN
## 1080 0 1 0.0007222452 NaN NaN NaN NaN NaN
## 1081 0 1 0.0009028064 NaN NaN NaN NaN NaN
## 1082 0 1 0.0008512175 NaN NaN NaN NaN NaN
## hz.mt
## 1077 NaN
## 1078 NaN
## 1079 NaN
## 1080 NaN
## 1081 NaN
## 1082 NaN
Here we use this column for linking phenotypic data with marker genotype data. We use length-width ratio of seeds as an example trait.
bm.wp@ped$pheno <- pheno[bm.wp@ped$id, "Seed.length.width.ratio"]
Using select.inds function in gaston package, remove samples with missing data in the target trait.
bm.wom <- select.inds(bm.wp, !is.na(bm.wp@ped$pheno))
bm.wom
## A bed.matrix with 361 individuals and 38768 markers.
## snps stats are set
## ped stats are set
361 samples have non-missing data
Next, select SNPs with MAF > 0.05 and callrate > 0.8.
bm.wom <- select.snps(bm.wom, maf > 0.05)
bm.wom <- select.snps(bm.wom, callrate > 0.8)
bm.wom
## A bed.matrix with 361 individuals and 36113 markers.
## snps stats are set
## ped stats are set
36113 markers remain.
The model used is GWAS is \[Y = X \alpha + G \beta+\omega + \epsilon\] where \(\omega \sim N(0, \tau K)\) and \(\epsilon \sim N(0, \sigma^2 I_n)\). \(G\) is the genotypic vector of SNP tested in GWAS. The matrix \(K\) is GRM, while the matrix \(X\) is the concatenation of the covariable matrix of fixed effects. Prior to the estimation of model parameters, we have to prepare \(K\) and \(X\).
To obtain \(K\), recalculate GRM based on the selected samples and selected SNPs.
K <- GRM(bm.wom)
As \(X\), association.test function in gaston has the option to include the first \(p\) PCs (based on the matrix \(K\)). With this option, the covariable matrix of fixed effects is concatenated with the matrix of the first \(p\) PC scores.
We can also prepare the fixed effect matrix by ourselves. Here, we prepare \(X\) corresponding to the clusters obtained in the k-means clustering. Function model.matrix can be used for preparing a design matrix based on the clusters.
Xkm <- model.matrix(~grp[bm.wom@ped$id])
head(Xkm)
## (Intercept) grp[bm.wom@ped$id]2 grp[bm.wom@ped$id]3 grp[bm.wom@ped$id]4
## 1 1 0 0 0
## 2 1 0 0 0
## 3 1 0 0 0
## 4 1 0 0 0
## 5 1 0 0 0
## 6 1 0 0 0
## grp[bm.wom@ped$id]5
## 1 1
## 2 0
## 3 0
## 4 0
## 5 0
## 6 0
We can also prepare $X$ corresponding to the Q matrix obtained in sNMF (sparse Nonnegative Matrix Factorization).
Xqm <- Qmat[bm.wom@ped$id, ]
Perform GWAS with associaiton.test function in gaston package.
First, perform GWAS with a linear mixed model (\(Y = X \alpha + G\beta + \omega + \epsilon\)), likelihood ratio test (“lrt”) and the first 4 principal components.
gwa <- association.test(bm.wom, method = "lmm", response = "quantitative",
test = "lrt", eigenK = eigen(K), p = 4)
See the first 6 rows of the result.
head(gwa)
## chr pos id A1 A2 freqA2 h2 LRT p
## 1 1 9563 SNP-1.8562. A T 0.1398892 0.8279960 0.22814072 0.6329059
## 2 1 25922 SNP-1.24921. C T 0.1343490 0.8284383 0.48835936 0.4846601
## 3 1 26254 SNP-1.25253. A T 0.2963989 0.8279921 0.25333106 0.6147393
## 4 1 30214 SNP-1.29213. T A 0.1371191 0.8280157 0.19370190 0.6598533
## 5 1 31478 SNP-1.30477. C T 0.2243767 0.8276972 0.04035891 0.8407805
## 6 1 32733 SNP-1.31732. T G 0.3033241 0.8280499 0.11842655 0.7307473
As the result of analysis, we got h2, LRT and p for each SNP with SNP information (chr, pos, alleles, freqA2 etc).
h2 represents estimated heritability based on the random effects \(\omega \sim N(0, \tau K)\) and \(\epsilon \sim N(0, \sigma^2 I_n)\). That is, the esimates are calculated as \(\frac{\tau}{\tau+\sigma^2}\).
LRT is the value of \(-2log(\Lambda)\), where \(\Lambda\) is the likelihoood ratio between full and reduced models. The full model is \(Y = X \alpha + G\beta + \omega + \epsilon\), while the reduced model is one without tested SNP effect, i.e., \(Y = X \alpha + \omega + \epsilon\).
p is the p value in the chi-square test of LRT value. When the reduced model is correct (i.e., \(\beta = 0\)), LRT follows the chi-square distribution with 1 degree of freedom.
The association.test function has two more methods for testing the significance of the \(G\beta\) term. One of them is “Wald test”. In the association.test function, we can specify the method with test option. That is,
gwa.wald <- association.test(bm.wom, method = "lmm", response = "quantitative",
test = "wald", eigenK = eigen(K), p = 4)
head(gwa.wald)
## chr pos id A1 A2 freqA2 h2 beta sd
## 1 1 9563 SNP-1.8562. A T 0.1398892 0.8316850 0.021182732 0.04452256
## 2 1 25922 SNP-1.24921. C T 0.1343490 0.8320361 0.030970388 0.04444931
## 3 1 26254 SNP-1.25253. A T 0.2963989 0.8316693 0.022011048 0.04392335
## 4 1 30214 SNP-1.29213. T A 0.1371191 0.8317340 0.019788208 0.04508654
## 5 1 31478 SNP-1.30477. C T 0.2243767 0.8309326 -0.005794986 0.02905650
## 6 1 32733 SNP-1.31732. T G 0.3033241 0.8316936 0.015362921 0.04457715
## p
## 1 0.6342345
## 2 0.4859546
## 3 0.6162838
## 4 0.6607384
## 5 0.8419197
## 6 0.7303675
Here, we again got h2 as same as the lrt test. beta is the estimated value of \(\beta\) , while sd is the estimated standard deviation of \(\beta\) estimation.
In Wald test, we fit the full model to data to etimate MLE of \(\beta\), i.e., \(\hat\beta\). Intuitively, when \(\hat\beta\) is close to 0, we are willing to accept the null hypothesis. Thus, the distance between \(\beta\) and 0 is the basis of the constructing the test statistic.
p is the result of Wald test.
In the association.test function, we can also apply the “score test”. In the score test, we can apply the model with multiple random effects. That is \[Y = X \alpha + G \beta+\omega_1 + \omega_2 +...+\omega_l + \epsilon\] Thus, even when we have only one random effect (other than one for the error term), we have to specify random effects in the function.
gwa.score <- association.test(bm.wom, method = "lmm", response = "quantitative",
test = "score", K = K, eigenK = eigen(K), p = 4)
head(gwa.score)
## chr pos id A1 A2 freqA2 score p
## 1 1 9563 SNP-1.8562. A T 0.1398892 0.22641518 0.6341948
## 2 1 25922 SNP-1.24921. C T 0.1343490 0.48465853 0.4863197
## 3 1 26254 SNP-1.25253. A T 0.2963989 0.25122499 0.6162138
## 4 1 30214 SNP-1.29213. T A 0.1371191 0.19256777 0.6607880
## 5 1 31478 SNP-1.30477. C T 0.2243767 0.03987976 0.8417159
## 6 1 32733 SNP-1.31732. T G 0.3033241 0.11853390 0.7306301
In the score test, association.test function use AIREML (avarage information REML) algorithm for estimating parameters in the linear mixed model. AIREML accerelates computing speed and enables models with multiple random effects.
In score test, we estimate parameters of the reduced model (i.e., model without \(G\beta\) term) and also estimate the slope of the likelihood function based on the full model. This estimated slope is score, and used as a test statistic.
p is the result of score test.
Let’s compare \(-log_{10}(p)\) values among methods.
plot(-log10(gwa$p), -log10(gwa.wald$p))
abline(0,1)
plot(-log10(gwa$p), -log10(gwa.score$p))
abline(0,1)
The test statistics of LRT, Wald, score tests follow the chi-square distribution of the same degree of freedom under the null hypothesis, but they are know to have the follwing relationship Wald \(\geq\) LRT \(\geq\) score. When you have sufficient computing power and have only single random effect, LRT model is recommended.
Next, we draw a manhattan plot with manhattan function in gaston package.
manhattan(gwa, pch = 20)
Because we have a large number of SNPs, it is necessary to adjust p-values for taking into account false positives occuring in the multiple comparison.
To adjusting p-values for multiple comparison, first we apply the Bonferroni correction to the p-values.
p.adj <- p.adjust(gwa$p, method = "bonferroni")
gwa[p.adj < 0.05, ]
## chr pos id A1 A2 freqA2 h2 LRT
## 10556 3 16688126 SNP-3.16686770. G A 0.31301939 0.8084444 27.81301
## 10560 3 16719368 SNP-3.16718013. G A 0.25623269 0.8012029 31.55608
## 10561 3 16733441 SNP-3.16732086. G T 0.37257618 0.7871791 70.92125
## 15893 5 5249951 SNP-5.5249928. G A 0.40304709 0.8012601 25.44338
## 15894 5 5265914 SNP-5.5265891. C T 0.40443213 0.8020949 25.66848
## 15901 5 5327913 SNP-5.5327890. A G 0.38504155 0.7906462 33.42147
## 15904 5 5362484 SNP-5.5362461. G A 0.21052632 0.8035727 33.23475
## 15907 5 5371772 SNP-5.5371749. G A 0.47783934 0.7434928 61.51409
## 15908 5 5372955 SNP-5.5372932. A G 0.49030471 0.7557589 50.98443
## 15916 5 5442823 SNP-5.5442800. G A 0.24515235 0.7962602 25.67027
## 17883 5 28533147 SNP-5.28470501. A G 0.20914127 0.8087377 24.02196
## 17884 5 28548001 SNP-5.28485355. C T 0.07063712 0.8067769 27.76584
## 22974 7 22113010 SNP-7.22112016. A T 0.10941828 0.7956951 27.68030
## 23268 7 25214651 SNP-7.25213656. C T 0.08587258 0.7946965 27.71881
## p
## 10556 1.336246e-07
## 10560 1.937627e-08
## 10561 3.717712e-17
## 15893 4.555571e-07
## 15894 4.053949e-07
## 15901 7.420040e-09
## 15904 8.167799e-09
## 15907 4.395863e-15
## 15908 9.310132e-13
## 15916 4.050197e-07
## 17883 9.524305e-07
## 17884 1.369224e-07
## 22974 1.431124e-07
## 23268 1.402917e-07
We found three significant peaks on chromosomes 3, 5 and 7. While the Bonferroni correction is known to be concervative. Less conservative corrections are also availables.
Here, we calculate false discovery rate (FDR) proposed by Benjamini and Hochberg (1995).
fdr <- p.adjust(gwa$p, method = "BH")
gwa[fdr < 0.05, ]
## chr pos id A1 A2 freqA2 h2 LRT
## 5015 2 2679995 SNP-2.2679992. C T 0.05263158 0.8965545 16.95033
## 8883 3 1596846 SNP-3.1595841. A G 0.35318560 0.8647457 19.38137
## 10556 3 16688126 SNP-3.16686770. G A 0.31301939 0.8084444 27.81301
## 10560 3 16719368 SNP-3.16718013. G A 0.25623269 0.8012029 31.55608
## 10561 3 16733441 SNP-3.16732086. G T 0.37257618 0.7871791 70.92125
## 10562 3 16737323 SNP-3.16735968. C T 0.18698061 0.8181589 17.91413
## 15892 5 5247991 SNP-5.5247968. C A 0.40027701 0.8068695 17.90365
## 15893 5 5249951 SNP-5.5249928. G A 0.40304709 0.8012601 25.44338
## 15894 5 5265914 SNP-5.5265891. C T 0.40443213 0.8020949 25.66848
## 15896 5 5298292 SNP-5.5298269. G A 0.11634349 0.8150177 17.12572
## 15900 5 5321857 SNP-5.5321834. G A 0.12188366 0.8175423 20.72151
## 15901 5 5327913 SNP-5.5327890. A G 0.38504155 0.7906462 33.42147
## 15902 5 5338205 SNP-5.5338182. T C 0.27562327 0.8065463 23.15532
## 15904 5 5362484 SNP-5.5362461. G A 0.21052632 0.8035727 33.23475
## 15907 5 5371772 SNP-5.5371749. G A 0.47783934 0.7434928 61.51409
## 15908 5 5372955 SNP-5.5372932. A G 0.49030471 0.7557589 50.98443
## 15909 5 5375849 SNP-5.5375826. C A 0.08310249 0.8103570 17.58997
## 15912 5 5388363 SNP-5.5388340. C T 0.11080332 0.8012170 20.02733
## 15916 5 5442823 SNP-5.5442800. G A 0.24515235 0.7962602 25.67027
## 15917 5 5458379 SNP-5.5458356. A T 0.44182825 0.8094049 22.30818
## 15922 5 5516194 SNP-5.5516131. G A 0.64404432 0.8167627 22.31333
## 17883 5 28533147 SNP-5.28470501. A G 0.20914127 0.8087377 24.02196
## 17884 5 28548001 SNP-5.28485355. C T 0.07063712 0.8067769 27.76584
## 22969 7 22057564 SNP-7.22056570. A G 0.56786704 0.8056860 21.44077
## 22971 7 22074195 SNP-7.22073201. A C 0.63157895 0.8007114 16.66650
## 22974 7 22113010 SNP-7.22112016. A T 0.10941828 0.7956951 27.68030
## 22975 7 22118606 SNP-7.22117612. C T 0.64542936 0.7974147 17.32605
## 22976 7 22119568 SNP-7.22118574. G A 0.64265928 0.7987932 16.72201
## 22999 7 22252173 SNP-7.22251179. A T 0.08725762 0.8064662 17.96129
## 23078 7 23048478 SNP-7.23047484. A G 0.11495845 0.7989903 18.68074
## 23085 7 23104178 SNP-7.23103184. C T 0.15096953 0.8137814 17.75307
## 23251 7 24845026 SNP-7.24844031. C T 0.09695291 0.8052351 21.62142
## 23268 7 25214651 SNP-7.25213656. C T 0.08587258 0.7946965 27.71881
## p
## 5015 3.837057e-05
## 8883 1.070464e-05
## 10556 1.336246e-07
## 10560 1.937627e-08
## 10561 3.717712e-17
## 10562 2.310985e-05
## 15892 2.323745e-05
## 15893 4.555571e-07
## 15894 4.053949e-07
## 15896 3.498529e-05
## 15900 5.311609e-06
## 15901 7.420040e-09
## 15902 1.494295e-06
## 15904 8.167799e-09
## 15907 4.395863e-15
## 15908 9.310132e-13
## 15909 2.740298e-05
## 15912 7.634326e-06
## 15916 4.050197e-07
## 15917 2.322152e-06
## 15922 2.315936e-06
## 17883 9.524305e-07
## 17884 1.369224e-07
## 22969 3.649278e-06
## 22971 4.456111e-05
## 22974 1.431124e-07
## 22975 3.148410e-05
## 22976 4.327598e-05
## 22999 2.254431e-05
## 23078 1.545356e-05
## 23085 2.515107e-05
## 23251 3.321212e-06
## 23268 1.402917e-07
In this case, while the number of significant SNPs increase when FDR is used as critatia, the position of peaks does not change largely.
Visualize these results with a Manhattan plot.
col <- rep("black", nrow(gwa))
col[gwa$chr %% 2 == 0] <- "gray50"
col[fdr < 0.05] <- "green"
manhattan(gwa, pch = 20, col = col)
col <- rep("black", nrow(gwa))
col[gwa$chr %% 2 == 0] <- "gray50"
col[p.adj < 0.05] <- "green"
manhattan(gwa, pch = 20, col = col)
In association.test function, we can also set the fixed effect of subpopulation structure via a design matrix.
Here, we use \(X\) corresponding to the clusters obtained in the k-means clustering, i.e, Xkm.
gwa.km <- association.test(bm.wom, method = "lmm", response = "quantitative",
test = "lrt", X = Xkm, eigenK = eigen(K))
We can also use \(X\) corresponding to the Q matrix obtained in sNMF, i.e.m Xqm.
gwa.qm <- association.test(bm.wom, method = "lmm", response = "quantitative",
test = "lrt", X = Xqm, eigenK = eigen(K))
## Warning in trans.X(X, mean.y = mean(Y)): An intercept column was added to the
## covariate matrix X
gwa.qm <- association.test(bm.wom, method = "lmm", response = "quantitative",
test = "lrt", X = Xqm[,-1], eigenK = eigen(K))
## Warning in trans.X(X, mean.y = mean(Y)): An intercept column was added to the
## covariate matrix X
We can also remove the subpopulation structure effect to see the influence of the structure. Here, we used \(X\) corresponding to the clusters obtained in the k-means clustering, i.e, Xkm.
gwa.no <- association.test(bm.wom, method = "lmm", response = "quantitative",
test = "lrt", eigenK = eigen(K))
Compare p-values with Q-Q plot. Draw Q-Q plots with the previous and current results.
qqplot.pvalues(gwa$p, pch = 20, ylim = range(-log10(c(gwa$p, gwa.km$p, gwa.no$p))))
qqplot.pvalues(gwa.km$p, pch = 20, ylim = range(-log10(c(gwa$p, gwa.km$p, gwa.no$p))))
qqplot.pvalues(gwa.qm$p, pch = 20, ylim = range(-log10(c(gwa$p, gwa.km$p, gwa.no$p))))
qqplot.pvalues(gwa.no$p, pch = 20, ylim = range(-log10(c(gwa$p, gwa.km$p, gwa.no$p))))
The significance levels are not largely different between models. This may be owing to the random effect \(\omega\).
Finally, we perform GWAS without the random effect. That is, the model is \(y = X\alpha + G\beta + \epsilon\). Because this model does not contain random effect, we perform simple liner regression with the option method = “lm”.
gwa.lm <- association.test(bm.wom, method = "lm", response = "quantitative", test = "wald")
Draw Q-Q plot for the simple linear regression model.
qqplot.pvalues(gwa.lm$p, pch = 20)
The observed \(-log_{10}(p)\) values is largely inflated under the simple linear regression model, suggesting the importance of the term \(\omega\).
The package manhattanly, which uses the functionality of plotly, can be used to draw interactive Manhattan plots. This makes it easy to see which of the tens of thousands of markers are actually significant.
mr <- manhattanr(gwa, chr = "chr", bp = "pos", p = "p", snp = "id")
manhattanly(mr,
suggestiveline = ,
genomewideline = )
Q-Q plots can also be drawn.
qqly(mr$data, snp = "id")
Using gaston package, we can also draw the pattern of local linkage disequilibrium.
Here, we draw a LD heatmap around a GWAS significant peak. Let’s focus on a region around the highest peak on the chromosome 5.
selector <- gwa$chr == 5 & fdr < 0.05
gwa[selector,]
## chr pos id A1 A2 freqA2 h2 LRT
## 15892 5 5247991 SNP-5.5247968. C A 0.40027701 0.8068695 17.90365
## 15893 5 5249951 SNP-5.5249928. G A 0.40304709 0.8012601 25.44338
## 15894 5 5265914 SNP-5.5265891. C T 0.40443213 0.8020949 25.66848
## 15896 5 5298292 SNP-5.5298269. G A 0.11634349 0.8150177 17.12572
## 15900 5 5321857 SNP-5.5321834. G A 0.12188366 0.8175423 20.72151
## 15901 5 5327913 SNP-5.5327890. A G 0.38504155 0.7906462 33.42147
## 15902 5 5338205 SNP-5.5338182. T C 0.27562327 0.8065463 23.15532
## 15904 5 5362484 SNP-5.5362461. G A 0.21052632 0.8035727 33.23475
## 15907 5 5371772 SNP-5.5371749. G A 0.47783934 0.7434928 61.51409
## 15908 5 5372955 SNP-5.5372932. A G 0.49030471 0.7557589 50.98443
## 15909 5 5375849 SNP-5.5375826. C A 0.08310249 0.8103570 17.58997
## 15912 5 5388363 SNP-5.5388340. C T 0.11080332 0.8012170 20.02733
## 15916 5 5442823 SNP-5.5442800. G A 0.24515235 0.7962602 25.67027
## 15917 5 5458379 SNP-5.5458356. A T 0.44182825 0.8094049 22.30818
## 15922 5 5516194 SNP-5.5516131. G A 0.64404432 0.8167627 22.31333
## 17883 5 28533147 SNP-5.28470501. A G 0.20914127 0.8087377 24.02196
## 17884 5 28548001 SNP-5.28485355. C T 0.07063712 0.8067769 27.76584
## p
## 15892 2.323745e-05
## 15893 4.555571e-07
## 15894 4.053949e-07
## 15896 3.498529e-05
## 15900 5.311609e-06
## 15901 7.420040e-09
## 15902 1.494295e-06
## 15904 8.167799e-09
## 15907 4.395863e-15
## 15908 9.310132e-13
## 15909 2.740298e-05
## 15912 7.634326e-06
## 15916 4.050197e-07
## 15917 2.322152e-06
## 15922 2.315936e-06
## 17883 9.524305e-07
## 17884 1.369224e-07
Calculate the local LD around the significant SNPs and draw a LD heatmap.
from <- 15882 - 10
to <- 15912 + 10
ld <- LD(bm.wom, c(from, to), measure = "r2")
LD.plot(ld, snp.positions = bm.wom@snps$pos[from:to], pdf.file = "ldplot.pdf")
## quartz_off_screen
## 2
The association.test function can perform GWAS for a binary trait. The model used in the analysis is \[logit(P[Y=1|X,G,\omega])=X\alpha + G\beta + \omega\]
Prepare data for using “Awn.presense” as an example trait.
bm.wp@ped$pheno <- pheno[bm.wp@ped$id, "Leaf.pubescence"]
bm.wom <- select.inds(bm.wp, !is.na(bm.wp@ped$pheno))
bm.wom <- select.snps(bm.wom, maf > 0.05)
bm.wom <- select.snps(bm.wom, callrate > 0.8)
bm.wom
## A bed.matrix with 277 individuals and 35723 markers.
## snps stats are set
## ped stats are set
Prepare GRM for this trait (because the subset of samples may be different from the previous trait).
K <- GRM(bm.wom)
Perform GWAS with association.test function. Make response option as “binary”. In the logistic regression, “score test” is used.
gwa <- association.test(bm.wom, method = "lmm", response = "binary", test = "score",
K = K, eigenK = eigen(K), p = 4)
Show the results.
manhattan(gwa, pch = 20)
qqplot.pvalues(gwa$p, pch = 20)
fdr <- p.adjust(gwa$p, method = "BH")
gwa[fdr < 0.05, ]
## chr pos id A1 A2 freqA2 score p
## 15277 5 990754 SNP-5.990736. T A 0.1444043 29.50638 5.573304e-08
One significant association was found in this trait.
Here, we apply Bayesian regression models, which are frequently used in genomic prediction, to GWAS. The major advantage of Bayesian regression is the simultanous estimation of all SNP effects.
The following thinning procedure is not alway necessary. Here, it is done for reducing colliniarity and also the number of markers.
bm.thin <- LD.thin(bm.wp, threshold = 0.4)
bm.thin
## A bed.matrix with 388 individuals and 8312 markers.
## snps stats are set
## ped stats are set
Set phenotypic values to ped$pheno.
bm.thin@ped$pheno <- pheno[bm.thin@ped$id, "Seed.length.width.ratio"]
bm.thin.wom <- select.inds(bm.thin, !is.na(bm.thin@ped$pheno))
bm.thin.wom <- select.snps(bm.thin.wom, maf > 0.05)
bm.thin.wom <- select.snps(bm.thin.wom, callrate > 0.8)
bm.thin.wom
## A bed.matrix with 361 individuals and 7122 markers.
## snps stats are set
## ped stats are set
Prepare a design matrix for subpopulation structure and GRM.
X <- as.matrix(bm.thin.wom)
Xqm <- Qmat[bm.thin.wom@ped$id, ]
grm <- GRM(bm.thin.wom)
Perform Bayesian regression (BayesB, Meuwissen et al. 2001, Genetics 157:1819) with the prepared data. MCMC algorithm is used for the estimation of parameters.
ETA <- list(list(X = X, model = "BayesB"),
list(X = Xqm, model = "FIXED"),
list(K = grm, model = "RKHS"))
y <- bm.thin.wom@ped$pheno
model <- BGLR(y = y, ETA = ETA, nIter = 4000, burnIn = 1000)
In this seminar, nIter and burnIn are set as small numbers. In the actual application of BGLR to your data, nIter and burnIn should be much larger (e.g., nIter = 12000, burnIn = 2000) than these numbers.
Visualize the result.
col <- rep("black", ncol(bm.thin.wom))
col[bm.thin.wom@snps$chr %% 2 == 0] <- "gray50"
plot(abs(model$ETA[[1]]$b), pch = 20, col = col)
Clear associations were detected on the chromosomes 3 and 5. These are GS3 and GS5 (qSW5).
geno.data <- data.frame(Taxa = bm.wp@ped$id, as.matrix(bm.wp),
row.names = 1:nrow(bm.wp@ped))
head(geno.data[,1:5])
## Taxa SNP.1.8562. SNP.1.18594. SNP.1.24921. SNP.1.25253.
## 1 NSFTV80@02d095ba.0 0 0 0 0
## 2 NSFTV87@64fa0112.0 0 0 0 0
## 3 NSFTV96@32a6808e.0 0 0 0 0
## 4 NSFTV100@1f10be3d.0 0 0 0 0
## 5 NSFTV114@eac19fb8.0 0 0 0 0
## 6 NSFTV140@85551f9c.0 0 0 0 0
map.data <- data.frame(SNP = bm.wp@snps$id, Chromosome = bm.wp@snps$chr,
Position = bm.wp@snps$pos)
head(map.data)
## SNP Chromosome Position
## 1 SNP-1.8562. 1 9563
## 2 SNP-1.18594. 1 19595
## 3 SNP-1.24921. 1 25922
## 4 SNP-1.25253. 1 26254
## 5 SNP-1.29213. 1 30214
## 6 SNP-1.30477. 1 31478
pheno.data <- data.frame(Taxa = bm.wp@ped$id, bm.wp@ped$pheno)
head(pheno.data)
## Taxa bm.wp.ped.pheno
## 1 NSFTV80@02d095ba.0 1
## 2 NSFTV87@64fa0112.0 1
## 3 NSFTV96@32a6808e.0 NA
## 4 NSFTV100@1f10be3d.0 0
## 5 NSFTV114@eac19fb8.0 0
## 6 NSFTV140@85551f9c.0 0
pwd <- getwd()
setwd("gapit")
res.gapit <- GAPIT(Y = pheno.data,
GD = geno.data,
GM = map.data,
PCA.total = 4,
model = c("MLMM", "FarmCPU", "Blink"))
## [1] "--------------------- Welcome to GAPIT ----------------------------"
## [1] "MLMM" "FarmCPU" "Blink"
## [1] "--------------------Processing traits----------------------------------"
## [1] "Phenotype provided!"
## [1] "The 1 model in all."
## [1] "MLMM"
## [1] "GAPIT.DP in process..."
## [1] "GAPIT will filter marker with MAF setting !!"
## [1] "The markers will be filtered by SNP.MAF: 0"
## maf_index
## TRUE
## 38768
## [1] "Calling prcomp..."
## [1] "Creating PCA graphs..."
## [1] "Joining taxa..."
## [1] "Exporting PCs..."
## [1] "PC created"
## [1] "Filting marker for GAPIT.Genotype.View function ..."
## [1] "GAPIT.Genotype.View . pdfs generate.successfully!"
## [1] 388 5
## [1] "GAPIT.DP accomplished successfully for multiple traits. Results are saved"
## [1] "Processing trait: bm.wp.ped.pheno"
## [1] "GAPIT.Phenotype.View in press..."
## [1] "GAPIT.Phenotype.View output pdf has been generated successfully!"
## [1] "--------------------Phenotype and Genotype ----------------------------------"
## [1] "MLMM"
## [1] TRUE
## [1] "There are 1 traits in phenotype data."
## [1] "There are 388 individuals in phenotype data."
## [1] "There are 38768 markers in genotype data."
## [1] "Phenotype and Genotype are test OK !!"
## [1] "--------------------GAPIT Logical Done----------------------------------"
## [1] "GAPIT.IC in process..."
## [1] "There is 0 Covarinces."
## [1] "There are 277 common individuals in genotype , phenotype and CV files."
## [1] "The dimension of total CV is "
## [1] 277 5
## [1] "GAPIT.IC accomplished successfully for multiple traits. Results are saved"
## [1] "GAPIT.SS in process..."
## [1] "GAPIT will be into GWAS approach..."
## [1] "MLMM"
## [1] "The GAPIT would go into Bus..."
## [1] "GWAS by MLMM method !!"
## [1] "Calculating kinship with VanRaden method..."
## [1] "substracting P..."
## [1] "Getting X'X..."
## [1] "Adjusting..."
## [1] "Calculating kinship with VanRaden method: done"
## null model done! pseudo-h= 0.862
## step 1 done! pseudo-h= 0.694
## step 2 done! pseudo-h= 0.695
## step 3 done! pseudo-h= 0.587
## step 4 done! pseudo-h= 0.482
## step 5 done! pseudo-h= 0.449
## step 6 done! pseudo-h= 0.292
## step 7 done! pseudo-h= 0.23
## step 8 done! pseudo-h= 0.167
## step 9 done! pseudo-h= 0.049
## backward analysis
## creating output
## Loading required package: lme4
## Loading required package: Matrix
## [1] "GAPIT.RandomModel beginning..."
## [1] FALSE
## [1] "Candidate Genes could Phenotype_Variance_Explained(%) :"
## [1] 6.879883 2.527924 17.079389 18.714826 20.388430
## [1] "Creating marker p-value, MAF, estimated effect, PVE 3 plot..."
## [1] 16562 16541 28938 13628 10240 10514 13264 7316 10179
## [1] "GAPIT.ID in process..."
## [1] "Filtering SNPs with MAF..."
## [1] "Calculating FDR..."
## [1] "QQ plot..."
## [1] "Manhattan plot (Genomewise)..."
## [1] "GAPIT.Manhattan accomplished successfully!zw"
## [1] "Manhattan plot (Chromosomewise)..."
## [1] "select 0 candidate significont markers in 1 chromosome "
## [1] "select 0 candidate significont markers in 2 chromosome "
## [1] "select 1 candidate significont markers in 3 chromosome "
## [1] "select 1 candidate significont markers in 4 chromosome "
## [1] "select 1 candidate significont markers in 5 chromosome "
## [1] "select 0 candidate significont markers in 6 chromosome "
## [1] "select 0 candidate significont markers in 7 chromosome "
## [1] "select 0 candidate significont markers in 8 chromosome "
## [1] "select 1 candidate significont markers in 9 chromosome "
## [1] "select 0 candidate significont markers in 10 chromosome "
## [1] "select 0 candidate significont markers in 11 chromosome "
## [1] "select 0 candidate significont markers in 12 chromosome "
## [1] "manhattan plot on chromosome finished"
## [1] "GAPIT.Manhattan accomplished successfully!zw"
## [1] "Association table..."
## [1] "Joining tvalue and stderr"
## [1] "GAPIT Phenotype distribution with significant markers in process..."
## [1] FALSE
## [1] 5 8
## [1] "GAPIT.ID accomplished successfully for multiple traits. Results are saved"
## [1] "GAPIT accomplished successfully for multiple traits. Result are saved"
## [1] "--------------------Processing traits----------------------------------"
## [1] "Phenotype provided!"
## [1] "The 2 model in all."
## [1] "FarmCPU"
## [1] "Processing trait: bm.wp.ped.pheno"
## [1] "--------------------Phenotype and Genotype ----------------------------------"
## [1] "FarmCPU"
## [1] TRUE
## [1] "There are 1 traits in phenotype data."
## [1] "There are 388 individuals in phenotype data."
## [1] "There are 38768 markers in genotype data."
## [1] "Phenotype and Genotype are test OK !!"
## [1] "--------------------GAPIT Logical Done----------------------------------"
## [1] "GAPIT.IC in process..."
## [1] "There is 0 Covarinces."
## [1] "There are 277 common individuals in genotype , phenotype and CV files."
## [1] "The dimension of total CV is "
## [1] 277 5
## [1] "GAPIT.IC accomplished successfully for multiple traits. Results are saved"
## [1] "GAPIT.SS in process..."
## [1] "GAPIT will be into GWAS approach..."
## [1] "FarmCPU"
## [1] "The GAPIT would go into Bus..."
## [1] "--------------------- Welcome to FarmCPU ----------------------------"
## [1] "FarmCPU Started..."
## [1] "Current loop: 1 out of maximum of 10"
## [1] "seqQTN"
## NULL
## [1] "scanning..."
## [1] "number of covariates in current loop is:"
## [1] 4
## [1] "Current loop: 2 out of maximum of 10"
## [1] "optimizing possible QTNs..."
## [1] "seqQTN"
## [1] 16562 5012 24914 35802 15255 5783
## [1] "scanning..."
## [1] "number of covariates in current loop is:"
## [1] 10
## [1] "Current loop: 3 out of maximum of 10"
## [1] "optimizing possible QTNs..."
## [1] "seqQTN"
## [1] 16562 19638 33258 35736 15255 14320 8528 5012 35802 24914 5783
## [1] "scanning..."
## [1] "number of covariates in current loop is:"
## [1] 15
## [1] "Current loop: 4 out of maximum of 10"
## [1] "optimizing possible QTNs..."
## [1] "seqQTN"
## [1] 16562 14320 5012 8528 18275 31159 27999 19638 15255 33258 35736 5783
## [13] 24914 35802
## [1] "scanning..."
## [1] "number of covariates in current loop is:"
## [1] 18
## [1] "Current loop: 5 out of maximum of 10"
## [1] "optimizing possible QTNs..."
## [1] "seqQTN"
## [1] 16562 14320 31159 5012 8528 33258 18275 19638 15255 27999 24914 35736
## [13] 5783 35802
## [1] "scanning..."
## [1] "number of covariates in current loop is:"
## [1] 18
## [1] "**********FarmCPU ACCOMPLISHED SUCCESSFULLY**********"
## [1] "Calculating Orignal GWAS result..."
## [1] "GAPIT.RandomModel beginning..."
## [1] FALSE
## boundary (singular) fit: see help('isSingular')
## [1] "Candidate Genes could Phenotype_Variance_Explained(%) :"
## [1] 1.190997e+01 1.731541e+00 5.268729e+00 8.136881e+00 1.779986e+01
## [6] 9.493885e-01 2.449731e-06 6.676072e-09 5.201408e-06 1.237090e+01
## [11] 2.835158e-08 0.000000e+00 0.000000e+00 3.589689e+00
## [1] "Creating marker p-value, MAF, estimated effect, PVE 3 plot..."
## [1] "FarmCPU has been done succeedly!!"
## [1] 16562 14320 31159 5012 8528 33258 18275 19638 15255 27999 24914 35736
## [13] 5783 35802
## [1] "GAPIT.ID in process..."
## [1] "Filtering SNPs with MAF..."
## [1] "Calculating FDR..."
## [1] "QQ plot..."
## [1] "Manhattan plot (Genomewise)..."
## [1] "GAPIT.Manhattan accomplished successfully!zw"
## [1] "Manhattan plot (Chromosomewise)..."
## [1] "select 1 candidate significont markers in 1 chromosome "
## [1] "select 1 candidate significont markers in 2 chromosome "
## [1] "select 0 candidate significont markers in 3 chromosome "
## [1] "select 1 candidate significont markers in 4 chromosome "
## [1] "select 2 candidate significont markers in 5 chromosome "
## [1] "select 0 candidate significont markers in 6 chromosome "
## [1] "select 0 candidate significont markers in 7 chromosome "
## [1] "select 0 candidate significont markers in 8 chromosome "
## [1] "select 0 candidate significont markers in 9 chromosome "
## [1] "select 1 candidate significont markers in 10 chromosome "
## [1] "select 1 candidate significont markers in 11 chromosome "
## [1] "select 0 candidate significont markers in 12 chromosome "
## [1] "manhattan plot on chromosome finished"
## [1] "GAPIT.Manhattan accomplished successfully!zw"
## [1] "Association table..."
## [1] "Joining tvalue and stderr"
## [1] "GAPIT Phenotype distribution with significant markers in process..."
## [1] FALSE
## [1] 14 8
## [1] "GAPIT.ID accomplished successfully for multiple traits. Results are saved"
## [1] "GAPIT accomplished successfully for multiple traits. Result are saved"
## [1] "--------------------Processing traits----------------------------------"
## [1] "Phenotype provided!"
## [1] "The 3 model in all."
## [1] "BLINK"
## [1] "Processing trait: bm.wp.ped.pheno"
## [1] "--------------------Phenotype and Genotype ----------------------------------"
## [1] "BLINK"
## [1] TRUE
## [1] "There are 1 traits in phenotype data."
## [1] "There are 388 individuals in phenotype data."
## [1] "There are 38768 markers in genotype data."
## [1] "Phenotype and Genotype are test OK !!"
## [1] "--------------------GAPIT Logical Done----------------------------------"
## [1] "GAPIT.IC in process..."
## [1] "There is 0 Covarinces."
## [1] "There are 277 common individuals in genotype , phenotype and CV files."
## [1] "The dimension of total CV is "
## [1] 277 5
## [1] "GAPIT.IC accomplished successfully for multiple traits. Results are saved"
## [1] "GAPIT.SS in process..."
## [1] "GAPIT will be into GWAS approach..."
## [1] "BLINK"
## [1] "The GAPIT would go into Bus..."
## [1] "----------------------Welcome to Blink----------------------"
## [1] "----------------------Iteration: 1 ----------------------"
## [1] "seqQTN:"
## NULL
## [1] "----------------------Iteration: 2 ----------------------"
## [1] "LD remove is working...."
## [1] "Number SNPs for LD remove:"
## [1] 610
## [1] "Model selection based on BIC is working...."
## [1] "Number of SNPs for BIC selection:"
## [1] 147
## [1] "seqQTN:"
## [1] 16562 16541 5012 16842 16376 867 16420
## [1] "----------------------Iteration: 3 ----------------------"
## [1] "LD remove is working...."
## [1] "Number SNPs for LD remove:"
## [1] 35
## [1] "Model selection based on BIC is working...."
## [1] "Number of SNPs for BIC selection:"
## [1] 13
## [1] "seqQTN:"
## [1] 16541 16420 16562 28630 28938 9073 16839 19638 5012
## [1] "----------------------Iteration: 4 ----------------------"
## [1] "LD remove is working...."
## [1] "Number SNPs for LD remove:"
## [1] 5
## [1] "Model selection based on BIC is working...."
## [1] "Number of SNPs for BIC selection:"
## [1] 5
## [1] "seqQTN:"
## [1] 16541 16562 16420 8528 28630 28938 9073 16839 19638 5012
## [1] "----------------------Iteration: 5 ----------------------"
## [1] "LD remove is working...."
## [1] "Number SNPs for LD remove:"
## [1] 11
## [1] "Model selection based on BIC is working...."
## [1] "Number of SNPs for BIC selection:"
## [1] 9
## [1] "seqQTN:"
## [1] 16541 16562 16420 8528 16839 5012 33257 30873 7316 28630 28938 9073
## [1] "----------------------Iteration: 6 ----------------------"
## [1] "LD remove is working...."
## [1] "Number SNPs for LD remove:"
## [1] 25
## [1] "Model selection based on BIC is working...."
## [1] "Number of SNPs for BIC selection:"
## [1] 14
## [1] "seqQTN:"
## [1] 16541 16562 16420 8528 5012 9073 18540 16839 7316 30873 28085 33257
## [13] 28938 33617 28630
## [1] "----------------------Iteration: 7 ----------------------"
## [1] "LD remove is working...."
## [1] "Number SNPs for LD remove:"
## [1] 14
## [1] "Model selection based on BIC is working...."
## [1] "Number of SNPs for BIC selection:"
## [1] 13
## [1] "seqQTN:"
## [1] 16541 16420 16562 8528 30873 5012 7316 16839 28085 9073 33257 33617
## [13] 28938 18540 28630
## [1] "LD.time(sec):"
## [1] 0.000 0.050 0.001 0.000 0.001 0.001 0.001
## [1] "BIC.time(sec):"
## [1] 0.000 0.006 0.001 0.000 0.000 0.001 0.001
## [1] "GLM.time(sec):"
## [1] 1.313 1.498 1.547 1.553 1.649 1.736 1.727
## [1] "-------------Blink finished successfully in 13.29 seconds!-----------------"
## [1] "Calculating Orignal GWAS result..."
## [1] "GAPIT.RandomModel beginning..."
## [1] FALSE
## Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, :
## Model failed to converge with max|grad| = 0.00248241 (tol = 0.002, component 1)
## [1] "Candidate Genes could Phenotype_Variance_Explained(%) :"
## [1] 11.264217 1.965852 1.875484 1.108550 6.432448 15.570494 4.572542
## [8] 1.174148 8.864626 6.081775 4.610798 1.620718 4.830481
## [1] "Creating marker p-value, MAF, estimated effect, PVE 3 plot..."
## [1] "BLINK R is done !!!!!"
## [1] 16541 16420 16562 8528 30873 5012 7316 16839 28085 9073 33257 33617
## [13] 28938 18540 28630
## [1] "GAPIT.ID in process..."
## [1] "Filtering SNPs with MAF..."
## [1] "Calculating FDR..."
## [1] "QQ plot..."
## [1] "Manhattan plot (Genomewise)..."
## [1] "GAPIT.Manhattan accomplished successfully!zw"
## [1] "Manhattan plot (Chromosomewise)..."
## [1] "select 1 candidate significont markers in 1 chromosome "
## [1] "select 3 candidate significont markers in 2 chromosome "
## [1] "select 0 candidate significont markers in 3 chromosome "
## [1] "select 0 candidate significont markers in 4 chromosome "
## [1] "select 2 candidate significont markers in 5 chromosome "
## [1] "select 0 candidate significont markers in 6 chromosome "
## [1] "select 0 candidate significont markers in 7 chromosome "
## [1] "select 1 candidate significont markers in 8 chromosome "
## [1] "select 1 candidate significont markers in 9 chromosome "
## [1] "select 1 candidate significont markers in 10 chromosome "
## [1] "select 2 candidate significont markers in 11 chromosome "
## [1] "select 0 candidate significont markers in 12 chromosome "
## [1] "manhattan plot on chromosome finished"
## [1] "GAPIT.Manhattan accomplished successfully!zw"
## [1] "Association table..."
## [1] "Joining tvalue and stderr"
## [1] "GAPIT Phenotype distribution with significant markers in process..."
## [1] FALSE
## [1] 13 8
## [1] "GAPIT.ID accomplished successfully for multiple traits. Results are saved"
## [1] "GAPIT accomplished successfully for multiple traits. Result are saved"
## [1] "Reading GWAS result with MLMM.bm.wp.ped.pheno"
## [1] "Reading GWAS result with FarmCPU.bm.wp.ped.pheno"
## [1] "Reading GWAS result with BLINK.bm.wp.ped.pheno"
## Warning in plot.formula(1 ~ 1, col = "white", xlab = "", ylab = "", ylim = c(0,
## : the formula '1 ~ 1' is treated as '1 ~ 1'
## [1] "GAPIT.Association.Manhattans has done !!!"
## [1] "GAPIT has output Multiple Manhattan figure with Symphysic type!!!"
## [1] "MLMM" "FarmCPU" "BLINK"
## [1] "Reading GWAS result with MLMM.bm.wp.ped.pheno"
## [1] "Reading GWAS result with FarmCPU.bm.wp.ped.pheno"
## [1] "Reading GWAS result with BLINK.bm.wp.ped.pheno"
## Warning in mtext(side = 4, paste(environ_name[k], sep = ""), line = 2, cex = 1,
## : "base_family" is not a graphical parameter
## Warning in mtext(side = 4, paste(environ_name[k], sep = ""), line = 2, cex = 1,
## : "base_family" is not a graphical parameter
## Warning in mtext(side = 4, paste(environ_name[k], sep = ""), line = 2, cex = 1,
## : "base_family" is not a graphical parameter
## [1] "GAPIT.Association.Manhattans has done !!!"
## [1] "GAPIT has output Multiple Manhattan figures with Wide and High types!!!"
## Warning in GAPIT.Circle.Manhattan.Plot(band = 1, r = 3, GMM$multip_mapP, : NAs
## introduced by coercion
## Warning in max(numeric.chr, na.rm = TRUE): no non-missing arguments to max;
## returning -Inf
## [1] "Multracks_QQ Plotting MLMM.bm.wp.ped.pheno..."
## [1] "Multracks_QQ Plotting FarmCPU.bm.wp.ped.pheno..."
## [1] "Multracks_QQ Plotting BLINK.bm.wp.ped.pheno..."
## [1] "Multiple QQ plot has been finished!"
## [1] "GAPIT has output Multiple Manhattan and QQ figures with Circle types!!!"
## [1] "GAPIT has done all analysis!!!"
## [1] "Please find your all results in :"
## [1] "/Users/hiro/Documents/Rprojects/京大gwasAll/gapit"
setwd(pwd)