This tutorial is aiming to explain how to perform GWAS with R.

Example datasets

Here I describe a method for performing GWAS with R. We use a dataset used in GWAS with rice germplasm collection (Zhao et al. 2011; Nature Communications 2:467), and a dataset of SNPs genotyped with High Density Rice Array (HDRA) (McCouch et al. 2016; Nature Communication 7:10532). Both datasets can be downloaded from Rice Diversity homepage (http://www.ricediversity.org/data/). The HDRA genotyping data contained 0.7 million SNPs. Of them, we used 38,769 SNPs which have no redundancy among them, low missing rate (< 1%) and minor allele frequency higher than 5%.

R packages required for the tutorial

We will use gaston and pcaMethods packages for reading and analyzing SNP data. rrBLUP, gaston, qqman and BGLR packages will be used for GWAS. All the packages can be easily downloaded and installed by using the R package installer.

# required packages
require(gaston) # for many functions
require(pcaMethods) # for probablistic PCA (a package in Bioconductor)
require(LEA)
require(RColorBrewer)
require(manhattanly)
require(BGLR) # for Bayesian regression
require(GAPIT)

Load SNP genotype data and phenotypic data

Here we assume that SNP genotype data is in a vcf (variant call format) file. Read the file with the read.vcf function in the gaston package. The gaston package allow you to load data also from a file in bed (PLINK binary biallelic genotype table, http://zzz.bwh.harvard.edu/plink/binary.shtml) format or files in bim (PLINK estented MAP file) and fam (PLINK sample information file) formats.

# read file
bm <- read.vcf("HDRA-G6-4-RDP1-RDP2-NIAS.AGCT_MAF005MIS001NR_imputed.vcf.gz")
## ped stats and snps stats have been set. 
## 'p' has been set. 
## 'mu' and 'sigma' have been set.
bm
## A bed.matrix with 1568 individuals and 38769 markers.
## snps stats are set
## ped stats are set

The vcf file contains genotypes of 38769 markers of 1568 samples.

Next we load files for phenotypic data and the information of samples (lines/varieties).

# read phenotypic data and line data
line <- read.csv("RiceDiversityLine4GWASGS.csv", 
                 row.names = 1, stringsAsFactors = T)
pheno <- read.csv("RiceDiversityPheno4GWASGS.csv", 
                  row.names = 1)
head(line)
##                   NSFTV.ID GSOR.ID        IRGC.ID Accession.Name
## NSFTV1@86f75d2b.0        1  301001 To be assigned       Agostano
## NSFTV3@5ef1be74.0        3  301003         117636  Ai-Chiao-Hong
## NSFTV4@81d03b86.0        4  301004         117601       NSF-TV 4
## NSFTV5@5533f406.0        5  301005         117641       NSF-TV 5
## NSFTV6@0d125c0e.0        6  301006         117603       ARC 7229
## NSFTV7@e37be9e5.0        7  301007 To be assigned          Arias
##                   Country.of.origin  Latitude Longitude Sub.population
## NSFTV1@86f75d2b.0             Italy 41.871940  12.56738            TEJ
## NSFTV3@5ef1be74.0             China 27.902527 116.87256            IND
## NSFTV4@81d03b86.0             India 22.903081  87.12158            AUS
## NSFTV5@5533f406.0             India 30.472664  75.34424       AROMATIC
## NSFTV6@0d125c0e.0             India 22.903081  87.12158            AUS
## NSFTV7@e37be9e5.0         Indonesia -0.789275 113.92133            TRJ
str(pheno)
## 'data.frame':    388 obs. of  36 variables:
##  $ Flowering.time.at.Arkansas      : num  75.1 89.5 94.5 87.5 89.1 ...
##  $ Flowering.time.at.Faridpur      : int  64 66 67 70 73 NA 74 75 55 NA ...
##  $ Flowering.time.at.Aberdeen      : int  81 83 93 108 101 158 122 81 74 NA ...
##  $ FT.ratio.of.Arkansas.Aberdeen   : num  0.927 1.078 1.016 0.81 0.882 ...
##  $ FT.ratio.of.Faridpur.Aberdeen   : num  0.79 0.795 0.72 0.648 0.723 ...
##  $ Culm.habit                      : num  4 7.5 6 3.5 6 3 2.5 6.5 3 NA ...
##  $ Leaf.pubescence                 : int  1 0 1 1 1 1 NA 1 NA NA ...
##  $ Flag.leaf.length                : num  28.4 39 27.7 30.4 36.9 ...
##  $ Flag.leaf.width                 : num  1.283 1 1.517 0.892 1.75 ...
##  $ Awn.presence                    : int  0 0 0 0 0 1 0 1 1 NA ...
##  $ Panicle.number.per.plant        : num  3.07 4.05 3.12 3.7 2.86 ...
##  $ Plant.height                    : num  111 144 128 154 148 ...
##  $ Panicle.length                  : num  20.5 26.8 23.5 29 30.9 ...
##  $ Primary.panicle.branch.number   : num  9.27 9.17 8.67 6.36 11.17 ...
##  $ Seed.number.per.panicle         : num  4.79 4.44 5.08 4.52 5.54 ...
##  $ Florets.per.panicle             : num  4.91 4.73 5.15 4.73 5.68 ...
##  $ Panicle.fertility               : num  0.879 0.75 0.935 0.813 0.871 0.907 0.781 0.887 0.537 NA ...
##  $ Seed.length                     : num  8.06 7.71 8.24 9.71 7.12 ...
##  $ Seed.width                      : num  3.69 2.95 2.93 2.38 3.28 ...
##  $ Seed.volume                     : num  2.59 2.11 2.15 1.94 2.2 ...
##  $ Seed.surface.area               : num  3.91 3.66 3.71 3.7 3.66 ...
##  $ Brown.rice.seed.length          : num  5.79 5.6 6.12 7.09 5.14 ...
##  $ Brown.rice.seed.width           : num  3.11 2.57 2.47 2.04 2.8 ...
##  $ Brown.rice.surface.area         : num  3.51 3.29 3.33 3.3 3.28 ...
##  $ Brown.rice.volume               : num  7.74 5.11 5.13 4.09 5.56 ...
##  $ Seed.length.width.ratio         : num  2.19 2.61 2.81 4.08 2.17 ...
##  $ Brown.rice.length.width.ratio   : num  1.86 2.18 2.48 3.47 1.83 ...
##  $ Seed.color                      : int  0 0 0 0 0 0 0 0 0 NA ...
##  $ Pericarp.color                  : int  0 0 0 0 0 0 0 0 0 NA ...
##  $ Straighthead.suseptability      : num  4.83 7.83 NA 8.33 8.17 ...
##  $ Blast.resistance                : int  8 4 3 5 3 2 2 9 2 1 ...
##  $ Amylose.content                 : num  15.6 23.3 23.1 19.3 23.2 ...
##  $ Alkali.spreading.value          : num  6.08 5.64 5.53 6.03 5.44 ...
##  $ Protein.content                 : num  8.45 9.2 8 9.6 8.5 ...
##  $ Year07Flowering.time.at.Arkansas: num  73 88 105.5 86.5 85.5 ...
##  $ Year06Flowering.time.at.Arkansas: num  77.2 91 83.5 88.5 92.7 ...

The line data contains brief passport data of lines/varieties and the name of subpopulation to which samples belong. The pheno data contain 36 traits (in some cases, the combitions of a trait andlocations and/or years). Four traits (Leaf.puescence, Awn.presnce, Seed.color, Pericarp.color) are binary traits, one (Blast.resistance) ordinal categorical, and the remainders are continuous.

Next, extract marker genotype data of samples with phenotypic data from bm data.
Information about samples are included in .

head(bm@ped)
##                   famid                    id father mother sex pheno    N0
## 1 IRGC121316@c88dcbba.0 IRGC121316@c88dcbba.0      0      0   0    NA 25217
## 2 IRGC121578@950ba1f2.0 IRGC121578@950ba1f2.0      0      0   0    NA 35744
## 3 IRGC121491@0b7d031b.0 IRGC121491@0b7d031b.0      0      0   0    NA 28683
## 4 IRGC121440@d02dfe4e.0 IRGC121440@d02dfe4e.0      0      0   0    NA 28244
## 5 IRGC121896@89688646.0 IRGC121896@89688646.0      0      0   0    NA 31812
## 6 IRGC121323@57c71a3b.0 IRGC121323@57c71a3b.0      0      0   0    NA 27630
##     N1    N2 NAs N0.x N1.x N2.x NAs.x N0.y N1.y N2.y NAs.y N0.mt N1.mt N2.mt
## 1  249 13303   0    0    0    0     0    0    0    0     0     0     0     0
## 2  163  2862   0    0    0    0     0    0    0    0     0     0     0     0
## 3  231  9855   0    0    0    0     0    0    0    0     0     0     0     0
## 4  798  9727   0    0    0    0     0    0    0    0     0     0     0     0
## 5  196  6761   0    0    0    0     0    0    0    0     0     0     0     0
## 6 1160  9979   0    0    0    0     0    0    0    0     0     0     0     0
##   NAs.mt callrate          hz callrate.x hz.x callrate.y hz.y callrate.mt hz.mt
## 1      0        1 0.006422657        NaN  NaN        NaN  NaN         NaN   NaN
## 2      0        1 0.004204390        NaN  NaN        NaN  NaN         NaN   NaN
## 3      0        1 0.005958369        NaN  NaN        NaN  NaN         NaN   NaN
## 4      0        1 0.020583456        NaN  NaN        NaN  NaN         NaN   NaN
## 5      0        1 0.005055586        NaN  NaN        NaN  NaN         NaN   NaN
## 6      0        1 0.029920813        NaN  NaN        NaN  NaN         NaN   NaN

The names of samples are in the id column. In , we can find the number of missing data (NAs), the proportion of non-missing data (callrate), the number of three genotypes (A1/A1, A1/A2, A2/A2 as N0, N1 and N2), and the proportion of heterozygous genotypes (i.e., heterozygosity, hz) etc. The actual alleles of A1 and A2 for each marker can be found in as described below.

We can also check the status of SNPs stored as .

head(bm@snps)
##   chr           id dist   pos A1 A2 quality filter   N0 N1  N2 NAs N0.f N1.f
## 1   1  SNP-1.8562.    0  9563  A  T       0   PASS 1343  7 218   0   NA   NA
## 2   1 SNP-1.18594.    0 19595  G  A       0   PASS 1481  8  79   0   NA   NA
## 3   1 SNP-1.24921.    0 25922  C  T       0   PASS 1348 24 196   0   NA   NA
## 4   1 SNP-1.25253.    0 26254  A  T       0   PASS 1152 10 406   0   NA   NA
## 5   1 SNP-1.29213.    0 30214  T  A       0   PASS 1346  4 218   0   NA   NA
## 6   1 SNP-1.30477.    0 31478  C  T       0   PASS 1013 15 540   0   NA   NA
##   N2.f NAs.f callrate        maf          hz
## 1   NA    NA        1 0.14126276 0.004464286
## 2   NA    NA        1 0.05293367 0.005102041
## 3   NA    NA        1 0.13265306 0.015306122
## 4   NA    NA        1 0.26211735 0.006377551
## 5   NA    NA        1 0.14030612 0.002551020
## 6   NA    NA        1 0.34917092 0.009566327

In , we can find chromosome number (chr), marker name (id), linkage map distance (dist), physical position (pos), A1 and A2 alleles (A1 and A2), the proportion of non-missing data (callrate), the number of three genotypes (A1/A1, A1/A2, A2/A2 as N0, N1 and N2), minor allele frequency (maf), and the proportion of heterozygous genotypes (i.e., heterozygosity, hz) etc.

Extract data we need for analysis

The samples we need (samples with phenotypic data) can be selected in the same way as a matrix object.

bm.wp <- bm[bm@ped$id %in% rownames(pheno),]
bm.wp
## A bed.matrix with 388 individuals and 38769 markers.
## snps stats are set
##   There is one monomorphic SNP
## ped stats are set

The marker genotype data of 388 individuals were successfully extracted.

Because we selected subset of individuals from all, some markers may becomes monomorphic. Here, we remove the markers from bm.wp. Later, we will again remove some markers which have small MAF and large missing rate.

bm.wp <- select.snps(bm.wp, maf > 0)
bm.wp
## A bed.matrix with 388 individuals and 38768 markers.
## snps stats are set
## ped stats are set

One marker was removed from the data. You can also perform the same processing as bm.wp <- bm.wp[, $maf > 0].

Subpopulatino structure

To see the subpopulation structure of samples, we perform principal component analysis of marker genotype data. The marker genotype data are scored and stored in bm as a score matrix. We can extract the matrix with as.matrix function.

# extract the score matrix
gt.score <- as.matrix(bm.wp)
head(gt.score)[,1:10]
##                     SNP-1.8562. SNP-1.18594. SNP-1.24921. SNP-1.25253.
## NSFTV80@02d095ba.0            0            0            0            0
## NSFTV87@64fa0112.0            0            0            0            0
## NSFTV96@32a6808e.0            0            0            0            0
## NSFTV100@1f10be3d.0           0            0            0            0
## NSFTV114@eac19fb8.0           0            0            0            0
## NSFTV140@85551f9c.0           0            0            0            0
##                     SNP-1.29213. SNP-1.30477. SNP-1.31732. SNP-1.32666.
## NSFTV80@02d095ba.0             0            0            0            2
## NSFTV87@64fa0112.0             0            2            0            2
## NSFTV96@32a6808e.0             0            0            0            2
## NSFTV100@1f10be3d.0            0            0            0            2
## NSFTV114@eac19fb8.0            0            0            0            2
## NSFTV140@85551f9c.0            0            2            0            2
##                     SNP-1.40457. SNP-1.75851.
## NSFTV80@02d095ba.0             0            0
## NSFTV87@64fa0112.0             0            0
## NSFTV96@32a6808e.0             0            0
## NSFTV100@1f10be3d.0            0            0
## NSFTV114@eac19fb8.0            0            0
## NSFTV140@85551f9c.0            0            0

As described above, marker genotypes of A1/A1, A1/A2 and A2/A2 are scored as 0, 1, 2.

To evaluate the relationship with know subpopulation structure and also with phenotypic data, we reorder the gt.score data in the same order as one in pheno and line data.

gt.score <- gt.score[rownames(pheno),]

Because the number of markers in gt.score is high, it takes a very long time with the prcomp function, which is generally used for PCA. Here, we conduct PCA with the pcaMethods package, which is specialized for PCA. Here, we perform probablistic principal component analysis (PPCA). PPCA allows us to conduct PCA even with missing data.

# perform probablistic principal component analysis (PPCA)
# PPCA can be performed even when a dataset has missing entries
pca <- pcaMethods::pca(gt.score, "ppca", nPcs = 6)
pcs <- scores(pca)
pairs(pcs[,1:6], col = line$Sub.population, oma = c(3, 3, 3, 15))
op <- par(xpd = T)
legend("bottomright", levels(line$Sub.population), col = 1:nlevels(line$Sub.population), pch = 1)

par(op)

As the result of analysis, PC1 to 4 captured the subpopulation structure of the germplasm collection well. PC5 and 6 captured within-subpopulation variation in tropical japonica (TRJ) and indica (IND).

Next, we check the relationship between principal component scores and phenotypic values of lines/varieties. Here we use the length-width ratio of seeds as an example.

# see the relationship between phenotypic values of the target trait and PC scores
plot(pcs[,1], pheno$Seed.length.width.ratio, col = line$Sub.population)

plot(pcs[,2], pheno$Seed.length.width.ratio, col = line$Sub.population)

plot(pcs[,3], pheno$Seed.length.width.ratio, col = line$Sub.population)

plot(pcs[,4], pheno$Seed.length.width.ratio, col = line$Sub.population)

The graphs show unclear relationship between phenotypic values and principal component scores. The relationship between phenotypic values and the attribution to the subpopulations is more clear. We will come back to this result later.

If we do not have prior (known) information of subpopulation structure in target population, we may be able to estimate it with clustering analysis. Here, we simply apply k-means clustering, which is a famous non-hierarchical clustering method.

# k-means clustering
km <- kmeans(gt.score, centers = 5, nstart = 10, iter.max = 100)
grp <- as.factor(km$cluster)
pairs(pcs[,1:6], col = grp, oma = c(3, 3, 3, 15))
op <- par(xpd = T)
legend("bottomright", levels(grp), col = 1:nlevels(grp), pch = 1)

par(op)

The result shows k-means clustering works well in this case. The classification in the clustering method has a similar pattern to the know subpopulation information.

Actually, we can also perform an almost identical analysis via eigen value decomposition of genetic relationship matrix (GRM).GRM is calculated as \(\frac{1}{q} XX'\) with \(X\) the standardized genotype matrix and \(q\) the number of SNPs (ncol(x)). GRM can be calculated with the GRM function in the gaston package.

grm <- GRM(bm.wp)
grm[1:6, 1:6]
##                     NSFTV80@02d095ba.0 NSFTV87@64fa0112.0 NSFTV96@32a6808e.0
## NSFTV80@02d095ba.0          2.29390430         -0.0782388        -0.09368214
## NSFTV87@64fa0112.0         -0.07823880          1.3258848         0.68903351
## NSFTV96@32a6808e.0         -0.09368214          0.6890335         1.27583718
## NSFTV100@1f10be3d.0        -0.21102436          0.2894305         0.33597934
## NSFTV114@eac19fb8.0        -0.18258514          0.2570685         0.37205958
## NSFTV140@85551f9c.0        -0.12167218          0.3007855         0.37542513
##                     NSFTV100@1f10be3d.0 NSFTV114@eac19fb8.0 NSFTV140@85551f9c.0
## NSFTV80@02d095ba.0           -0.2110244          -0.1825851          -0.1216722
## NSFTV87@64fa0112.0            0.2894305           0.2570685           0.3007855
## NSFTV96@32a6808e.0            0.3359793           0.3720596           0.3754251
## NSFTV100@1f10be3d.0           1.2132517           0.9044130           0.6716095
## NSFTV114@eac19fb8.0           0.9044130           1.2369840           0.6397730
## NSFTV140@85551f9c.0           0.6716095           0.6397730           1.2719293

Perform eigen value decomposion of the grm matrix. First, reorder the matrix in the identical order to that of phenoypic data.

grm <- grm[rownames(pheno), rownames(pheno)]
eig <- eigen(grm)

Calculate the scores of principal component and draw graphs.

pcs <- eig$vectors %*% diag(sqrt(eig$values))
pairs(pcs[,1:6], col = line$Sub.population, oma = c(3, 3, 3, 15))
op <- par(xpd = T)
legend("bottomright", levels(line$Sub.population), col = 1:nlevels(line$Sub.population), pch = 1)

par(op)

While there is small differences between the previous and current analyses, the general pattern is same between two, suggesting PC scores can be also calculated via eigen value decomposion of GRM. This also suggests that GRM itself has the same information contained in PC scores.

Here, we estimate genetic background with a method specialized for population genetic structure analysis. The snmf function in the LEA package estimates ancestry coefficients using the sparse non-negative matrix factorization (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3982712/). The function returns a project object containing all runs of the snmf program for the input data. It can be useful to perform several runs of snmf for various numbers of ancestral populations (K).

First, the input file to be used by the LEA must be prepared; the data must be transposed so that it is in the orientation assumed by the LEA (samples in columns, markers in rows).

write.table(t(gt.score), "rice.geno", col.names = F, row.names = F, sep = "")

Estimate ancestry coefficients at K = 5 with four replications.

project <- snmf("rice.geno",
                K = 5, 
                I = 1000,
                entropy = T,
                repetitions = 4,
                project = "new")
## The project is saved into :
##  rice.snmfProject 
## 
## To load the project, use:
##  project = load.snmfProject("rice.snmfProject")
## 
## To remove the project, use:
##  remove.snmfProject("rice.snmfProject")
## 
## [1] 2010135985
## [1] "*************************************"
## [1] "*          create.dataset            *"
## [1] "*************************************"
## summary of the options:
## 
##         -n (number of individuals)                 388
##         -L (number of loci)                        38768
##         -s (seed random init)                      2010135985
##         -r (percentage of masked data)             0.05
##         -x (genotype file in .geno format)         /Users/hiro/Documents/Rprojects/京大gwasAll/rice.geno
##         -o (output file in .geno format)           /Users/hiro/Documents/Rprojects/京大gwasAll/rice.snmf/masked/rice_I.geno
## 
##  Write genotype file with masked data, /Users/hiro/Documents/Rprojects/京大gwasAll/rice.snmf/masked/rice_I.geno:       OK.
## 
## [1] "*************************************"
## [1] "* sNMF K = 5  repetition 1      *"
## [1] "*************************************"
## summary of the options:
## 
##         -n (number of individuals)             388
##         -L (number of loci)                    38768
##         -K (number of ancestral pops)          5
##         -x (input file)                        /Users/hiro/Documents/Rprojects/京大gwasAll/rice.snmf/masked/rice_I.geno
##         -q (individual admixture file)         /Users/hiro/Documents/Rprojects/京大gwasAll/rice.snmf/K5/run1/rice_r1.5.Q
##         -g (ancestral frequencies file)        /Users/hiro/Documents/Rprojects/京大gwasAll/rice.snmf/K5/run1/rice_r1.5.G
##         -i (number max of iterations)          200
##         -a (regularization parameter)          10
##         -s (seed random init)                  6305103281
##         -e (tolerance error)                   1E-05
##         -p (number of processes)               1
##         -I (number of SNPs used to init Q)     1000
##         - diploid
## 
## Read genotype file /Users/hiro/Documents/Rprojects/京大gwasAll/rice.snmf/masked/rice_I.geno:       OK.
## 
## Initialization of Q with a random subset of 1000 SNPs:
##  [                                                                           ]
##  [============]
## Number of iterations: 31
## 
## Main algorithm:
##  [                                                                           ]
##  [======]
## Number of iterations: 16
## 
## Least-square error: 2348539.016665
## Write individual ancestry coefficient file /Users/hiro/Documents/Rprojects/京大gwasAll/rice.snmf/K5/run1/rice_r1.5.Q:      OK.
## Write ancestral allele frequency coefficient file /Users/hiro/Documents/Rprojects/京大gwasAll/rice.snmf/K5/run1/rice_r1.5.G:   OK.
## 
## [1] "*************************************"
## [1] "*    cross-entropy estimation       *"
## [1] "*************************************"
## summary of the options:
## 
##         -n (number of individuals)         388
##         -L (number of loci)                38768
##         -K (number of ancestral pops)      5
##         -x (genotype file)                 /Users/hiro/Documents/Rprojects/京大gwasAll/rice.geno
##         -q (individual admixture)          /Users/hiro/Documents/Rprojects/京大gwasAll/rice.snmf/K5/run1/rice_r1.5.Q
##         -g (ancestral frequencies)         /Users/hiro/Documents/Rprojects/京大gwasAll/rice.snmf/K5/run1/rice_r1.5.G
##         -i (with masked genotypes)         /Users/hiro/Documents/Rprojects/京大gwasAll/rice.snmf/masked/rice_I.geno
##         - diploid
## 
## Cross-Entropy (all data):     0.242812
## Cross-Entropy (masked data):  0.258944
## The project is saved into :
##  rice.snmfProject 
## 
## To load the project, use:
##  project = load.snmfProject("rice.snmfProject")
## 
## To remove the project, use:
##  remove.snmfProject("rice.snmfProject")
## 
## [1] 63915391
## [1] "*************************************"
## [1] "*          create.dataset            *"
## [1] "*************************************"
## summary of the options:
## 
##         -n (number of individuals)                 388
##         -L (number of loci)                        38768
##         -s (seed random init)                      63915391
##         -r (percentage of masked data)             0.05
##         -x (genotype file in .geno format)         /Users/hiro/Documents/Rprojects/京大gwasAll/rice.geno
##         -o (output file in .geno format)           /Users/hiro/Documents/Rprojects/京大gwasAll/rice.snmf/masked/rice_I.geno
## 
##  Write genotype file with masked data, /Users/hiro/Documents/Rprojects/京大gwasAll/rice.snmf/masked/rice_I.geno:       OK.
## 
## [1] "*************************************"
## [1] "* sNMF K = 5  repetition 2      *"
## [1] "*************************************"
## summary of the options:
## 
##         -n (number of individuals)             388
##         -L (number of loci)                    38768
##         -K (number of ancestral pops)          5
##         -x (input file)                        /Users/hiro/Documents/Rprojects/京大gwasAll/rice.snmf/masked/rice_I.geno
##         -q (individual admixture file)         /Users/hiro/Documents/Rprojects/京大gwasAll/rice.snmf/K5/run2/rice_r2.5.Q
##         -g (ancestral frequencies file)        /Users/hiro/Documents/Rprojects/京大gwasAll/rice.snmf/K5/run2/rice_r2.5.G
##         -i (number max of iterations)          200
##         -a (regularization parameter)          10
##         -s (seed random init)                  63915391
##         -e (tolerance error)                   1E-05
##         -p (number of processes)               1
##         -I (number of SNPs used to init Q)     1000
##         - diploid
## 
## Read genotype file /Users/hiro/Documents/Rprojects/京大gwasAll/rice.snmf/masked/rice_I.geno:       OK.
## 
## Initialization of Q with a random subset of 1000 SNPs:
##  [                                                                           ]
##  [==========]
## Number of iterations: 26
## 
## Main algorithm:
##  [                                                                           ]
##  [===================]
## Number of iterations: 52
## 
## Least-square error: 2368944.352814
## Write individual ancestry coefficient file /Users/hiro/Documents/Rprojects/京大gwasAll/rice.snmf/K5/run2/rice_r2.5.Q:      OK.
## Write ancestral allele frequency coefficient file /Users/hiro/Documents/Rprojects/京大gwasAll/rice.snmf/K5/run2/rice_r2.5.G:   OK.
## 
## [1] "*************************************"
## [1] "*    cross-entropy estimation       *"
## [1] "*************************************"
## summary of the options:
## 
##         -n (number of individuals)         388
##         -L (number of loci)                38768
##         -K (number of ancestral pops)      5
##         -x (genotype file)                 /Users/hiro/Documents/Rprojects/京大gwasAll/rice.geno
##         -q (individual admixture)          /Users/hiro/Documents/Rprojects/京大gwasAll/rice.snmf/K5/run2/rice_r2.5.Q
##         -g (ancestral frequencies)         /Users/hiro/Documents/Rprojects/京大gwasAll/rice.snmf/K5/run2/rice_r2.5.G
##         -i (with masked genotypes)         /Users/hiro/Documents/Rprojects/京大gwasAll/rice.snmf/masked/rice_I.geno
##         - diploid
## 
## Cross-Entropy (all data):     0.246049
## Cross-Entropy (masked data):  0.263974
## The project is saved into :
##  rice.snmfProject 
## 
## To load the project, use:
##  project = load.snmfProject("rice.snmfProject")
## 
## To remove the project, use:
##  remove.snmfProject("rice.snmfProject")
## 
## [1] 73289461
## [1] "*************************************"
## [1] "*          create.dataset            *"
## [1] "*************************************"
## summary of the options:
## 
##         -n (number of individuals)                 388
##         -L (number of loci)                        38768
##         -s (seed random init)                      73289461
##         -r (percentage of masked data)             0.05
##         -x (genotype file in .geno format)         /Users/hiro/Documents/Rprojects/京大gwasAll/rice.geno
##         -o (output file in .geno format)           /Users/hiro/Documents/Rprojects/京大gwasAll/rice.snmf/masked/rice_I.geno
## 
##  Write genotype file with masked data, /Users/hiro/Documents/Rprojects/京大gwasAll/rice.snmf/masked/rice_I.geno:       OK.
## 
## [1] "*************************************"
## [1] "* sNMF K = 5  repetition 3      *"
## [1] "*************************************"
## summary of the options:
## 
##         -n (number of individuals)             388
##         -L (number of loci)                    38768
##         -K (number of ancestral pops)          5
##         -x (input file)                        /Users/hiro/Documents/Rprojects/京大gwasAll/rice.snmf/masked/rice_I.geno
##         -q (individual admixture file)         /Users/hiro/Documents/Rprojects/京大gwasAll/rice.snmf/K5/run3/rice_r3.5.Q
##         -g (ancestral frequencies file)        /Users/hiro/Documents/Rprojects/京大gwasAll/rice.snmf/K5/run3/rice_r3.5.G
##         -i (number max of iterations)          200
##         -a (regularization parameter)          10
##         -s (seed random init)                  73289461
##         -e (tolerance error)                   1E-05
##         -p (number of processes)               1
##         -I (number of SNPs used to init Q)     1000
##         - diploid
## 
## Read genotype file /Users/hiro/Documents/Rprojects/京大gwasAll/rice.snmf/masked/rice_I.geno:       OK.
## 
## Initialization of Q with a random subset of 1000 SNPs:
##  [                                                                           ]
##  [=================]
## Number of iterations: 45
## 
## Main algorithm:
##  [                                                                           ]
##  [=======]
## Number of iterations: 18
## 
## Least-square error: 2383751.257163
## Write individual ancestry coefficient file /Users/hiro/Documents/Rprojects/京大gwasAll/rice.snmf/K5/run3/rice_r3.5.Q:      OK.
## Write ancestral allele frequency coefficient file /Users/hiro/Documents/Rprojects/京大gwasAll/rice.snmf/K5/run3/rice_r3.5.G:   OK.
## 
## [1] "*************************************"
## [1] "*    cross-entropy estimation       *"
## [1] "*************************************"
## summary of the options:
## 
##         -n (number of individuals)         388
##         -L (number of loci)                38768
##         -K (number of ancestral pops)      5
##         -x (genotype file)                 /Users/hiro/Documents/Rprojects/京大gwasAll/rice.geno
##         -q (individual admixture)          /Users/hiro/Documents/Rprojects/京大gwasAll/rice.snmf/K5/run3/rice_r3.5.Q
##         -g (ancestral frequencies)         /Users/hiro/Documents/Rprojects/京大gwasAll/rice.snmf/K5/run3/rice_r3.5.G
##         -i (with masked genotypes)         /Users/hiro/Documents/Rprojects/京大gwasAll/rice.snmf/masked/rice_I.geno
##         - diploid
## 
## Cross-Entropy (all data):     0.24721
## Cross-Entropy (masked data):  0.264126
## The project is saved into :
##  rice.snmfProject 
## 
## To load the project, use:
##  project = load.snmfProject("rice.snmfProject")
## 
## To remove the project, use:
##  remove.snmfProject("rice.snmfProject")
## 
## [1] 1616129687
## [1] "*************************************"
## [1] "*          create.dataset            *"
## [1] "*************************************"
## summary of the options:
## 
##         -n (number of individuals)                 388
##         -L (number of loci)                        38768
##         -s (seed random init)                      1616129687
##         -r (percentage of masked data)             0.05
##         -x (genotype file in .geno format)         /Users/hiro/Documents/Rprojects/京大gwasAll/rice.geno
##         -o (output file in .geno format)           /Users/hiro/Documents/Rprojects/京大gwasAll/rice.snmf/masked/rice_I.geno
## 
##  Write genotype file with masked data, /Users/hiro/Documents/Rprojects/京大gwasAll/rice.snmf/masked/rice_I.geno:       OK.
## 
## [1] "*************************************"
## [1] "* sNMF K = 5  repetition 4      *"
## [1] "*************************************"
## summary of the options:
## 
##         -n (number of individuals)             388
##         -L (number of loci)                    38768
##         -K (number of ancestral pops)          5
##         -x (input file)                        /Users/hiro/Documents/Rprojects/京大gwasAll/rice.snmf/masked/rice_I.geno
##         -q (individual admixture file)         /Users/hiro/Documents/Rprojects/京大gwasAll/rice.snmf/K5/run4/rice_r4.5.Q
##         -g (ancestral frequencies file)        /Users/hiro/Documents/Rprojects/京大gwasAll/rice.snmf/K5/run4/rice_r4.5.G
##         -i (number max of iterations)          200
##         -a (regularization parameter)          10
##         -s (seed random init)                  1616129687
##         -e (tolerance error)                   1E-05
##         -p (number of processes)               1
##         -I (number of SNPs used to init Q)     1000
##         - diploid
## 
## Read genotype file /Users/hiro/Documents/Rprojects/京大gwasAll/rice.snmf/masked/rice_I.geno:       OK.
## 
## Initialization of Q with a random subset of 1000 SNPs:
##  [                                                                           ]
##  [=======]
## Number of iterations: 20
## 
## Main algorithm:
##  [                                                                           ]
##  [======]
## Number of iterations: 16
## 
## Least-square error: 2349650.871035
## Write individual ancestry coefficient file /Users/hiro/Documents/Rprojects/京大gwasAll/rice.snmf/K5/run4/rice_r4.5.Q:      OK.
## Write ancestral allele frequency coefficient file /Users/hiro/Documents/Rprojects/京大gwasAll/rice.snmf/K5/run4/rice_r4.5.G:   OK.
## 
## [1] "*************************************"
## [1] "*    cross-entropy estimation       *"
## [1] "*************************************"
## summary of the options:
## 
##         -n (number of individuals)         388
##         -L (number of loci)                38768
##         -K (number of ancestral pops)      5
##         -x (genotype file)                 /Users/hiro/Documents/Rprojects/京大gwasAll/rice.geno
##         -q (individual admixture)          /Users/hiro/Documents/Rprojects/京大gwasAll/rice.snmf/K5/run4/rice_r4.5.Q
##         -g (ancestral frequencies)         /Users/hiro/Documents/Rprojects/京大gwasAll/rice.snmf/K5/run4/rice_r4.5.G
##         -i (with masked genotypes)         /Users/hiro/Documents/Rprojects/京大gwasAll/rice.snmf/masked/rice_I.geno
##         - diploid
## 
## Cross-Entropy (all data):     0.242809
## Cross-Entropy (masked data):  0.25901
## The project is saved into :
##  rice.snmfProject 
## 
## To load the project, use:
##  project = load.snmfProject("rice.snmfProject")
## 
## To remove the project, use:
##  remove.snmfProject("rice.snmfProject")
ce <- cross.entropy(project, K = 5); ce
##           K = 5
## run 1 0.2589444
## run 2 0.2639744
## run 3 0.2641262
## run 4 0.2590100
Qmat <- Q(project, K = 5, run = which.min(ce))
rownames(Qmat) <- rownames(gt.score)
head(Qmat)
##                            V1          V2          V3          V4          V5
## NSFTV1@86f75d2b.0 0.000099973 9.99730e-05 5.87250e-03 9.99730e-05 0.993828000
## NSFTV3@5ef1be74.0 0.982354000 9.99729e-05 9.99729e-05 9.99729e-05 0.017345600
## NSFTV4@81d03b86.0 0.149472000 7.75379e-01 9.99820e-05 7.49493e-02 0.000099982
## NSFTV5@5533f406.0 0.000099970 9.99700e-05 9.99600e-01 9.99700e-05 0.000099970
## NSFTV6@0d125c0e.0 0.147150000 7.60223e-01 9.99910e-05 8.61347e-02 0.006391970
## NSFTV7@e37be9e5.0 0.020806400 2.14061e-02 5.34589e-02 8.51703e-01 0.052625700

Visualize the Q matrix for each of known subpopulations.

palette <- brewer.pal(5, "Set2")
barplot(t(Qmat[line$Sub.population == "IND",]),  col= palette)

barplot(t(Qmat[line$Sub.population == "TEJ",]),  col= palette)

barplot(t(Qmat[line$Sub.population == "TRJ",]),  col= palette)

barplot(t(Qmat[line$Sub.population == "AUS",]),  col= palette)

barplot(t(Qmat[line$Sub.population == "AROMATIC",]),  col = palette)

barplot(t(Qmat[line$Sub.population == "ADMIX",]),  col= palette)

For more beautiful visualization of this result, please refer to this home page, https://connor-french.github.io/intro-pop-structure-r/

Preparation for GWAS

From here, we are going into GWAS analysis.

As we checked before, also has a column for phenotypic data.

head(bm.wp@ped)
##                    famid                  id father mother sex pheno    N0  N1
## 1077  NSFTV80@02d095ba.0  NSFTV80@02d095ba.0      0      0   0    NA 26224  92
## 1078  NSFTV87@64fa0112.0  NSFTV87@64fa0112.0      0      0   0    NA 31639 102
## 1079  NSFTV96@32a6808e.0  NSFTV96@32a6808e.0      0      0   0    NA 31417  97
## 1080 NSFTV100@1f10be3d.0 NSFTV100@1f10be3d.0      0      0   0    NA 33200  28
## 1081 NSFTV114@eac19fb8.0 NSFTV114@eac19fb8.0      0      0   0    NA 32355  35
## 1082 NSFTV140@85551f9c.0 NSFTV140@85551f9c.0      0      0   0    NA 31226  33
##         N2 NAs N0.x N1.x N2.x NAs.x N0.y N1.y N2.y NAs.y N0.mt N1.mt N2.mt
## 1077 12452   0    0    0    0     0    0    0    0     0     0     0     0
## 1078  7027   0    0    0    0     0    0    0    0     0     0     0     0
## 1079  7254   0    0    0    0     0    0    0    0     0     0     0     0
## 1080  5540   0    0    0    0     0    0    0    0     0     0     0     0
## 1081  6378   0    0    0    0     0    0    0    0     0     0     0     0
## 1082  7509   0    0    0    0     0    0    0    0     0     0     0     0
##      NAs.mt callrate           hz callrate.x hz.x callrate.y hz.y callrate.mt
## 1077      0        1 0.0023730912        NaN  NaN        NaN  NaN         NaN
## 1078      0        1 0.0026310359        NaN  NaN        NaN  NaN         NaN
## 1079      0        1 0.0025020636        NaN  NaN        NaN  NaN         NaN
## 1080      0        1 0.0007222452        NaN  NaN        NaN  NaN         NaN
## 1081      0        1 0.0009028064        NaN  NaN        NaN  NaN         NaN
## 1082      0        1 0.0008512175        NaN  NaN        NaN  NaN         NaN
##      hz.mt
## 1077   NaN
## 1078   NaN
## 1079   NaN
## 1080   NaN
## 1081   NaN
## 1082   NaN

Here we use this column for linking phenotypic data with marker genotype data. We use length-width ratio of seeds as an example trait.

bm.wp@ped$pheno <- pheno[bm.wp@ped$id, "Seed.length.width.ratio"]

Using select.inds function in gaston package, remove samples with missing data in the target trait.

bm.wom <- select.inds(bm.wp, !is.na(bm.wp@ped$pheno))
bm.wom
## A bed.matrix with 361 individuals and 38768 markers.
## snps stats are set
## ped stats are set

361 samples have non-missing data

Next, select SNPs with MAF > 0.05 and callrate > 0.8.

bm.wom <- select.snps(bm.wom, maf > 0.05)
bm.wom <- select.snps(bm.wom, callrate > 0.8)
bm.wom
## A bed.matrix with 361 individuals and 36113 markers.
## snps stats are set
## ped stats are set

36113 markers remain.

The model used is GWAS is \[Y = X \alpha + G \beta+\omega + \epsilon\] where \(\omega \sim N(0, \tau K)\) and \(\epsilon \sim N(0, \sigma^2 I_n)\). \(G\) is the genotypic vector of SNP tested in GWAS. The matrix \(K\) is GRM, while the matrix \(X\) is the concatenation of the covariable matrix of fixed effects. Prior to the estimation of model parameters, we have to prepare \(K\) and \(X\).

To obtain \(K\), recalculate GRM based on the selected samples and selected SNPs.

K <- GRM(bm.wom)

As \(X\), association.test function in gaston has the option to include the first \(p\) PCs (based on the matrix \(K\)). With this option, the covariable matrix of fixed effects is concatenated with the matrix of the first \(p\) PC scores.

We can also prepare the fixed effect matrix by ourselves. Here, we prepare \(X\) corresponding to the clusters obtained in the k-means clustering. Function model.matrix can be used for preparing a design matrix based on the clusters.

Xkm <- model.matrix(~grp[bm.wom@ped$id])
head(Xkm)
##   (Intercept) grp[bm.wom@ped$id]2 grp[bm.wom@ped$id]3 grp[bm.wom@ped$id]4
## 1           1                   0                   0                   0
## 2           1                   0                   0                   0
## 3           1                   0                   0                   0
## 4           1                   0                   0                   0
## 5           1                   0                   0                   0
## 6           1                   0                   0                   0
##   grp[bm.wom@ped$id]5
## 1                   1
## 2                   0
## 3                   0
## 4                   0
## 5                   0
## 6                   0

We can also prepare $X$ corresponding to the Q matrix obtained in sNMF (sparse Nonnegative Matrix Factorization).

Xqm <- Qmat[bm.wom@ped$id, ]

GWAS with association.test function

Perform GWAS with associaiton.test function in gaston package.

First, perform GWAS with a linear mixed model (\(Y = X \alpha + G\beta + \omega + \epsilon\)), likelihood ratio test (“lrt”) and the first 4 principal components.

gwa <- association.test(bm.wom, method = "lmm", response = "quantitative", 
                        test = "lrt", eigenK = eigen(K), p = 4)

See the first 6 rows of the result.

head(gwa)
##   chr   pos           id A1 A2    freqA2        h2        LRT         p
## 1   1  9563  SNP-1.8562.  A  T 0.1398892 0.8279960 0.22814072 0.6329059
## 2   1 25922 SNP-1.24921.  C  T 0.1343490 0.8284383 0.48835936 0.4846601
## 3   1 26254 SNP-1.25253.  A  T 0.2963989 0.8279921 0.25333106 0.6147393
## 4   1 30214 SNP-1.29213.  T  A 0.1371191 0.8280157 0.19370190 0.6598533
## 5   1 31478 SNP-1.30477.  C  T 0.2243767 0.8276972 0.04035891 0.8407805
## 6   1 32733 SNP-1.31732.  T  G 0.3033241 0.8280499 0.11842655 0.7307473

As the result of analysis, we got h2, LRT and p for each SNP with SNP information (chr, pos, alleles, freqA2 etc).

h2 represents estimated heritability based on the random effects \(\omega \sim N(0, \tau K)\) and \(\epsilon \sim N(0, \sigma^2 I_n)\). That is, the esimates are calculated as \(\frac{\tau}{\tau+\sigma^2}\).

LRT is the value of \(-2log(\Lambda)\), where \(\Lambda\) is the likelihoood ratio between full and reduced models. The full model is \(Y = X \alpha + G\beta + \omega + \epsilon\), while the reduced model is one without tested SNP effect, i.e., \(Y = X \alpha + \omega + \epsilon\).

p is the p value in the chi-square test of LRT value. When the reduced model is correct (i.e., \(\beta = 0\)), LRT follows the chi-square distribution with 1 degree of freedom.

The association.test function has two more methods for testing the significance of the \(G\beta\) term. One of them is “Wald test”. In the association.test function, we can specify the method with test option. That is,

gwa.wald <- association.test(bm.wom, method = "lmm", response = "quantitative", 
                             test = "wald", eigenK = eigen(K), p = 4)
head(gwa.wald)
##   chr   pos           id A1 A2    freqA2        h2         beta         sd
## 1   1  9563  SNP-1.8562.  A  T 0.1398892 0.8316850  0.021182732 0.04452256
## 2   1 25922 SNP-1.24921.  C  T 0.1343490 0.8320361  0.030970388 0.04444931
## 3   1 26254 SNP-1.25253.  A  T 0.2963989 0.8316693  0.022011048 0.04392335
## 4   1 30214 SNP-1.29213.  T  A 0.1371191 0.8317340  0.019788208 0.04508654
## 5   1 31478 SNP-1.30477.  C  T 0.2243767 0.8309326 -0.005794986 0.02905650
## 6   1 32733 SNP-1.31732.  T  G 0.3033241 0.8316936  0.015362921 0.04457715
##           p
## 1 0.6342345
## 2 0.4859546
## 3 0.6162838
## 4 0.6607384
## 5 0.8419197
## 6 0.7303675

Here, we again got h2 as same as the lrt test. beta is the estimated value of \(\beta\) , while sd is the estimated standard deviation of \(\beta\) estimation.

In Wald test, we fit the full model to data to etimate MLE of \(\beta\), i.e., \(\hat\beta\). Intuitively, when \(\hat\beta\) is close to 0, we are willing to accept the null hypothesis. Thus, the distance between \(\beta\) and 0 is the basis of the constructing the test statistic.

p is the result of Wald test.

In the association.test function, we can also apply the “score test”. In the score test, we can apply the model with multiple random effects. That is \[Y = X \alpha + G \beta+\omega_1 + \omega_2 +...+\omega_l + \epsilon\] Thus, even when we have only one random effect (other than one for the error term), we have to specify random effects in the function.

gwa.score <- association.test(bm.wom, method = "lmm", response = "quantitative", 
                              test = "score", K = K, eigenK = eigen(K), p = 4)
head(gwa.score)
##   chr   pos           id A1 A2    freqA2      score         p
## 1   1  9563  SNP-1.8562.  A  T 0.1398892 0.22641518 0.6341948
## 2   1 25922 SNP-1.24921.  C  T 0.1343490 0.48465853 0.4863197
## 3   1 26254 SNP-1.25253.  A  T 0.2963989 0.25122499 0.6162138
## 4   1 30214 SNP-1.29213.  T  A 0.1371191 0.19256777 0.6607880
## 5   1 31478 SNP-1.30477.  C  T 0.2243767 0.03987976 0.8417159
## 6   1 32733 SNP-1.31732.  T  G 0.3033241 0.11853390 0.7306301

In the score test, association.test function use AIREML (avarage information REML) algorithm for estimating parameters in the linear mixed model. AIREML accerelates computing speed and enables models with multiple random effects.

In score test, we estimate parameters of the reduced model (i.e., model without \(G\beta\) term) and also estimate the slope of the likelihood function based on the full model. This estimated slope is score, and used as a test statistic.

p is the result of score test.

Let’s compare \(-log_{10}(p)\) values among methods.

plot(-log10(gwa$p), -log10(gwa.wald$p))
abline(0,1)

plot(-log10(gwa$p), -log10(gwa.score$p))
abline(0,1)

The test statistics of LRT, Wald, score tests follow the chi-square distribution of the same degree of freedom under the null hypothesis, but they are know to have the follwing relationship Wald \(\geq\) LRT \(\geq\) score. When you have sufficient computing power and have only single random effect, LRT model is recommended.

Manhattan plot and Q-Q plot

Next, we draw a manhattan plot with manhattan function in gaston package.

manhattan(gwa, pch = 20)

Because we have a large number of SNPs, it is necessary to adjust p-values for taking into account false positives occuring in the multiple comparison.

To adjusting p-values for multiple comparison, first we apply the Bonferroni correction to the p-values.

p.adj <- p.adjust(gwa$p, method = "bonferroni")
gwa[p.adj < 0.05, ]
##       chr      pos              id A1 A2     freqA2        h2      LRT
## 10556   3 16688126 SNP-3.16686770.  G  A 0.31301939 0.8084444 27.81301
## 10560   3 16719368 SNP-3.16718013.  G  A 0.25623269 0.8012029 31.55608
## 10561   3 16733441 SNP-3.16732086.  G  T 0.37257618 0.7871791 70.92125
## 15893   5  5249951  SNP-5.5249928.  G  A 0.40304709 0.8012601 25.44338
## 15894   5  5265914  SNP-5.5265891.  C  T 0.40443213 0.8020949 25.66848
## 15901   5  5327913  SNP-5.5327890.  A  G 0.38504155 0.7906462 33.42147
## 15904   5  5362484  SNP-5.5362461.  G  A 0.21052632 0.8035727 33.23475
## 15907   5  5371772  SNP-5.5371749.  G  A 0.47783934 0.7434928 61.51409
## 15908   5  5372955  SNP-5.5372932.  A  G 0.49030471 0.7557589 50.98443
## 15916   5  5442823  SNP-5.5442800.  G  A 0.24515235 0.7962602 25.67027
## 17883   5 28533147 SNP-5.28470501.  A  G 0.20914127 0.8087377 24.02196
## 17884   5 28548001 SNP-5.28485355.  C  T 0.07063712 0.8067769 27.76584
## 22974   7 22113010 SNP-7.22112016.  A  T 0.10941828 0.7956951 27.68030
## 23268   7 25214651 SNP-7.25213656.  C  T 0.08587258 0.7946965 27.71881
##                  p
## 10556 1.336246e-07
## 10560 1.937627e-08
## 10561 3.717712e-17
## 15893 4.555571e-07
## 15894 4.053949e-07
## 15901 7.420040e-09
## 15904 8.167799e-09
## 15907 4.395863e-15
## 15908 9.310132e-13
## 15916 4.050197e-07
## 17883 9.524305e-07
## 17884 1.369224e-07
## 22974 1.431124e-07
## 23268 1.402917e-07

We found three significant peaks on chromosomes 3, 5 and 7. While the Bonferroni correction is known to be concervative. Less conservative corrections are also availables.

Here, we calculate false discovery rate (FDR) proposed by Benjamini and Hochberg (1995).

fdr <- p.adjust(gwa$p, method = "BH")
gwa[fdr < 0.05, ]
##       chr      pos              id A1 A2     freqA2        h2      LRT
## 5015    2  2679995  SNP-2.2679992.  C  T 0.05263158 0.8965545 16.95033
## 8883    3  1596846  SNP-3.1595841.  A  G 0.35318560 0.8647457 19.38137
## 10556   3 16688126 SNP-3.16686770.  G  A 0.31301939 0.8084444 27.81301
## 10560   3 16719368 SNP-3.16718013.  G  A 0.25623269 0.8012029 31.55608
## 10561   3 16733441 SNP-3.16732086.  G  T 0.37257618 0.7871791 70.92125
## 10562   3 16737323 SNP-3.16735968.  C  T 0.18698061 0.8181589 17.91413
## 15892   5  5247991  SNP-5.5247968.  C  A 0.40027701 0.8068695 17.90365
## 15893   5  5249951  SNP-5.5249928.  G  A 0.40304709 0.8012601 25.44338
## 15894   5  5265914  SNP-5.5265891.  C  T 0.40443213 0.8020949 25.66848
## 15896   5  5298292  SNP-5.5298269.  G  A 0.11634349 0.8150177 17.12572
## 15900   5  5321857  SNP-5.5321834.  G  A 0.12188366 0.8175423 20.72151
## 15901   5  5327913  SNP-5.5327890.  A  G 0.38504155 0.7906462 33.42147
## 15902   5  5338205  SNP-5.5338182.  T  C 0.27562327 0.8065463 23.15532
## 15904   5  5362484  SNP-5.5362461.  G  A 0.21052632 0.8035727 33.23475
## 15907   5  5371772  SNP-5.5371749.  G  A 0.47783934 0.7434928 61.51409
## 15908   5  5372955  SNP-5.5372932.  A  G 0.49030471 0.7557589 50.98443
## 15909   5  5375849  SNP-5.5375826.  C  A 0.08310249 0.8103570 17.58997
## 15912   5  5388363  SNP-5.5388340.  C  T 0.11080332 0.8012170 20.02733
## 15916   5  5442823  SNP-5.5442800.  G  A 0.24515235 0.7962602 25.67027
## 15917   5  5458379  SNP-5.5458356.  A  T 0.44182825 0.8094049 22.30818
## 15922   5  5516194  SNP-5.5516131.  G  A 0.64404432 0.8167627 22.31333
## 17883   5 28533147 SNP-5.28470501.  A  G 0.20914127 0.8087377 24.02196
## 17884   5 28548001 SNP-5.28485355.  C  T 0.07063712 0.8067769 27.76584
## 22969   7 22057564 SNP-7.22056570.  A  G 0.56786704 0.8056860 21.44077
## 22971   7 22074195 SNP-7.22073201.  A  C 0.63157895 0.8007114 16.66650
## 22974   7 22113010 SNP-7.22112016.  A  T 0.10941828 0.7956951 27.68030
## 22975   7 22118606 SNP-7.22117612.  C  T 0.64542936 0.7974147 17.32605
## 22976   7 22119568 SNP-7.22118574.  G  A 0.64265928 0.7987932 16.72201
## 22999   7 22252173 SNP-7.22251179.  A  T 0.08725762 0.8064662 17.96129
## 23078   7 23048478 SNP-7.23047484.  A  G 0.11495845 0.7989903 18.68074
## 23085   7 23104178 SNP-7.23103184.  C  T 0.15096953 0.8137814 17.75307
## 23251   7 24845026 SNP-7.24844031.  C  T 0.09695291 0.8052351 21.62142
## 23268   7 25214651 SNP-7.25213656.  C  T 0.08587258 0.7946965 27.71881
##                  p
## 5015  3.837057e-05
## 8883  1.070464e-05
## 10556 1.336246e-07
## 10560 1.937627e-08
## 10561 3.717712e-17
## 10562 2.310985e-05
## 15892 2.323745e-05
## 15893 4.555571e-07
## 15894 4.053949e-07
## 15896 3.498529e-05
## 15900 5.311609e-06
## 15901 7.420040e-09
## 15902 1.494295e-06
## 15904 8.167799e-09
## 15907 4.395863e-15
## 15908 9.310132e-13
## 15909 2.740298e-05
## 15912 7.634326e-06
## 15916 4.050197e-07
## 15917 2.322152e-06
## 15922 2.315936e-06
## 17883 9.524305e-07
## 17884 1.369224e-07
## 22969 3.649278e-06
## 22971 4.456111e-05
## 22974 1.431124e-07
## 22975 3.148410e-05
## 22976 4.327598e-05
## 22999 2.254431e-05
## 23078 1.545356e-05
## 23085 2.515107e-05
## 23251 3.321212e-06
## 23268 1.402917e-07

In this case, while the number of significant SNPs increase when FDR is used as critatia, the position of peaks does not change largely.

Visualize these results with a Manhattan plot.

col <- rep("black", nrow(gwa))
col[gwa$chr %% 2 == 0] <- "gray50"
col[fdr < 0.05] <- "green"
manhattan(gwa, pch = 20, col = col)

col <- rep("black", nrow(gwa))
col[gwa$chr %% 2 == 0] <- "gray50"
col[p.adj < 0.05] <- "green"
manhattan(gwa, pch = 20, col = col)

Subpopulation effect

In association.test function, we can also set the fixed effect of subpopulation structure via a design matrix.

Here, we use \(X\) corresponding to the clusters obtained in the k-means clustering, i.e, Xkm.

gwa.km <- association.test(bm.wom, method = "lmm", response = "quantitative", 
                           test = "lrt", X = Xkm, eigenK = eigen(K))

We can also use \(X\) corresponding to the Q matrix obtained in sNMF, i.e.m Xqm.

gwa.qm <- association.test(bm.wom, method = "lmm", response = "quantitative", 
                           test = "lrt", X = Xqm, eigenK = eigen(K))
## Warning in trans.X(X, mean.y = mean(Y)): An intercept column was added to the
## covariate matrix X
gwa.qm <- association.test(bm.wom, method = "lmm", response = "quantitative", 
                           test = "lrt", X = Xqm[,-1], eigenK = eigen(K))
## Warning in trans.X(X, mean.y = mean(Y)): An intercept column was added to the
## covariate matrix X

We can also remove the subpopulation structure effect to see the influence of the structure. Here, we used \(X\) corresponding to the clusters obtained in the k-means clustering, i.e, Xkm.

gwa.no <- association.test(bm.wom, method = "lmm", response = "quantitative", 
                           test = "lrt", eigenK = eigen(K))

Compare p-values with Q-Q plot. Draw Q-Q plots with the previous and current results.

qqplot.pvalues(gwa$p, pch = 20, ylim = range(-log10(c(gwa$p, gwa.km$p, gwa.no$p))))

qqplot.pvalues(gwa.km$p, pch = 20, ylim = range(-log10(c(gwa$p, gwa.km$p, gwa.no$p))))

qqplot.pvalues(gwa.qm$p, pch = 20, ylim = range(-log10(c(gwa$p, gwa.km$p, gwa.no$p))))

qqplot.pvalues(gwa.no$p, pch = 20, ylim = range(-log10(c(gwa$p, gwa.km$p, gwa.no$p))))

The significance levels are not largely different between models. This may be owing to the random effect \(\omega\).

Finally, we perform GWAS without the random effect. That is, the model is \(y = X\alpha + G\beta + \epsilon\). Because this model does not contain random effect, we perform simple liner regression with the option method = “lm”.

gwa.lm <- association.test(bm.wom, method = "lm", response = "quantitative", test = "wald")

Draw Q-Q plot for the simple linear regression model.

qqplot.pvalues(gwa.lm$p, pch = 20)

The observed \(-log_{10}(p)\) values is largely inflated under the simple linear regression model, suggesting the importance of the term \(\omega\).

Interactive Manhattan plot

The package manhattanly, which uses the functionality of plotly, can be used to draw interactive Manhattan plots. This makes it easy to see which of the tens of thousands of markers are actually significant.

mr <- manhattanr(gwa, chr = "chr", bp = "pos", p = "p", snp = "id")
manhattanly(mr, 
            suggestiveline = ,
            genomewideline = )

Q-Q plots can also be drawn.

qqly(mr$data, snp = "id")

Local LD pattern

Using gaston package, we can also draw the pattern of local linkage disequilibrium.

Here, we draw a LD heatmap around a GWAS significant peak. Let’s focus on a region around the highest peak on the chromosome 5.

selector <- gwa$chr == 5 & fdr < 0.05
gwa[selector,]
##       chr      pos              id A1 A2     freqA2        h2      LRT
## 15892   5  5247991  SNP-5.5247968.  C  A 0.40027701 0.8068695 17.90365
## 15893   5  5249951  SNP-5.5249928.  G  A 0.40304709 0.8012601 25.44338
## 15894   5  5265914  SNP-5.5265891.  C  T 0.40443213 0.8020949 25.66848
## 15896   5  5298292  SNP-5.5298269.  G  A 0.11634349 0.8150177 17.12572
## 15900   5  5321857  SNP-5.5321834.  G  A 0.12188366 0.8175423 20.72151
## 15901   5  5327913  SNP-5.5327890.  A  G 0.38504155 0.7906462 33.42147
## 15902   5  5338205  SNP-5.5338182.  T  C 0.27562327 0.8065463 23.15532
## 15904   5  5362484  SNP-5.5362461.  G  A 0.21052632 0.8035727 33.23475
## 15907   5  5371772  SNP-5.5371749.  G  A 0.47783934 0.7434928 61.51409
## 15908   5  5372955  SNP-5.5372932.  A  G 0.49030471 0.7557589 50.98443
## 15909   5  5375849  SNP-5.5375826.  C  A 0.08310249 0.8103570 17.58997
## 15912   5  5388363  SNP-5.5388340.  C  T 0.11080332 0.8012170 20.02733
## 15916   5  5442823  SNP-5.5442800.  G  A 0.24515235 0.7962602 25.67027
## 15917   5  5458379  SNP-5.5458356.  A  T 0.44182825 0.8094049 22.30818
## 15922   5  5516194  SNP-5.5516131.  G  A 0.64404432 0.8167627 22.31333
## 17883   5 28533147 SNP-5.28470501.  A  G 0.20914127 0.8087377 24.02196
## 17884   5 28548001 SNP-5.28485355.  C  T 0.07063712 0.8067769 27.76584
##                  p
## 15892 2.323745e-05
## 15893 4.555571e-07
## 15894 4.053949e-07
## 15896 3.498529e-05
## 15900 5.311609e-06
## 15901 7.420040e-09
## 15902 1.494295e-06
## 15904 8.167799e-09
## 15907 4.395863e-15
## 15908 9.310132e-13
## 15909 2.740298e-05
## 15912 7.634326e-06
## 15916 4.050197e-07
## 15917 2.322152e-06
## 15922 2.315936e-06
## 17883 9.524305e-07
## 17884 1.369224e-07

Calculate the local LD around the significant SNPs and draw a LD heatmap.

from <- 15882 - 10
to <- 15912 + 10
ld <- LD(bm.wom, c(from, to), measure = "r2")
LD.plot(ld, snp.positions = bm.wom@snps$pos[from:to], pdf.file = "ldplot.pdf")
## quartz_off_screen 
##                 2

Binary traits

The association.test function can perform GWAS for a binary trait. The model used in the analysis is \[logit(P[Y=1|X,G,\omega])=X\alpha + G\beta + \omega\]

Prepare data for using “Awn.presense” as an example trait.

bm.wp@ped$pheno <- pheno[bm.wp@ped$id, "Leaf.pubescence"]
bm.wom <- select.inds(bm.wp, !is.na(bm.wp@ped$pheno))
bm.wom <- select.snps(bm.wom, maf > 0.05)
bm.wom <- select.snps(bm.wom, callrate > 0.8)
bm.wom
## A bed.matrix with 277 individuals and 35723 markers.
## snps stats are set
## ped stats are set

Prepare GRM for this trait (because the subset of samples may be different from the previous trait).

K <- GRM(bm.wom)

Perform GWAS with association.test function. Make response option as “binary”. In the logistic regression, “score test” is used.

gwa <- association.test(bm.wom, method = "lmm", response = "binary", test = "score", 
                        K = K, eigenK = eigen(K), p = 4)

Show the results.

manhattan(gwa, pch = 20)

qqplot.pvalues(gwa$p, pch = 20)

fdr <- p.adjust(gwa$p, method = "BH")
gwa[fdr < 0.05, ]
##       chr    pos            id A1 A2    freqA2    score            p
## 15277   5 990754 SNP-5.990736.  T  A 0.1444043 29.50638 5.573304e-08

One significant association was found in this trait.

GWAS with Bayesian regression

Here, we apply Bayesian regression models, which are frequently used in genomic prediction, to GWAS. The major advantage of Bayesian regression is the simultanous estimation of all SNP effects.

The following thinning procedure is not alway necessary. Here, it is done for reducing colliniarity and also the number of markers.

bm.thin <- LD.thin(bm.wp, threshold = 0.4)
bm.thin
## A bed.matrix with 388 individuals and 8312 markers.
## snps stats are set
## ped stats are set

Set phenotypic values to ped$pheno.

bm.thin@ped$pheno <- pheno[bm.thin@ped$id, "Seed.length.width.ratio"]
bm.thin.wom <- select.inds(bm.thin, !is.na(bm.thin@ped$pheno))
bm.thin.wom <- select.snps(bm.thin.wom, maf > 0.05)
bm.thin.wom <- select.snps(bm.thin.wom, callrate > 0.8)
bm.thin.wom
## A bed.matrix with 361 individuals and 7122 markers.
## snps stats are set
## ped stats are set

Prepare a design matrix for subpopulation structure and GRM.

X <- as.matrix(bm.thin.wom)
Xqm <- Qmat[bm.thin.wom@ped$id, ]
grm <- GRM(bm.thin.wom)

Perform Bayesian regression (BayesB, Meuwissen et al. 2001, Genetics 157:1819) with the prepared data. MCMC algorithm is used for the estimation of parameters.

ETA <- list(list(X = X, model = "BayesB"), 
            list(X = Xqm, model = "FIXED"), 
            list(K = grm, model = "RKHS"))
y <- bm.thin.wom@ped$pheno
model <- BGLR(y = y, ETA = ETA, nIter = 4000, burnIn = 1000)

In this seminar, nIter and burnIn are set as small numbers. In the actual application of BGLR to your data, nIter and burnIn should be much larger (e.g., nIter = 12000, burnIn = 2000) than these numbers.

Visualize the result.

col <- rep("black", ncol(bm.thin.wom))
col[bm.thin.wom@snps$chr %% 2 == 0] <- "gray50"
plot(abs(model$ETA[[1]]$b), pch = 20, col = col)

Clear associations were detected on the chromosomes 3 and 5. These are GS3 and GS5 (qSW5).

GAPIT

geno.data <- data.frame(Taxa = bm.wp@ped$id, as.matrix(bm.wp), 
                        row.names = 1:nrow(bm.wp@ped))
head(geno.data[,1:5])
##                  Taxa SNP.1.8562. SNP.1.18594. SNP.1.24921. SNP.1.25253.
## 1  NSFTV80@02d095ba.0           0            0            0            0
## 2  NSFTV87@64fa0112.0           0            0            0            0
## 3  NSFTV96@32a6808e.0           0            0            0            0
## 4 NSFTV100@1f10be3d.0           0            0            0            0
## 5 NSFTV114@eac19fb8.0           0            0            0            0
## 6 NSFTV140@85551f9c.0           0            0            0            0
map.data <- data.frame(SNP = bm.wp@snps$id, Chromosome = bm.wp@snps$chr,
                       Position = bm.wp@snps$pos)
head(map.data)
##            SNP Chromosome Position
## 1  SNP-1.8562.          1     9563
## 2 SNP-1.18594.          1    19595
## 3 SNP-1.24921.          1    25922
## 4 SNP-1.25253.          1    26254
## 5 SNP-1.29213.          1    30214
## 6 SNP-1.30477.          1    31478
pheno.data <- data.frame(Taxa = bm.wp@ped$id, bm.wp@ped$pheno)
head(pheno.data)
##                  Taxa bm.wp.ped.pheno
## 1  NSFTV80@02d095ba.0               1
## 2  NSFTV87@64fa0112.0               1
## 3  NSFTV96@32a6808e.0              NA
## 4 NSFTV100@1f10be3d.0               0
## 5 NSFTV114@eac19fb8.0               0
## 6 NSFTV140@85551f9c.0               0
pwd <- getwd()
setwd("gapit")
res.gapit <- GAPIT(Y = pheno.data,
                   GD = geno.data,
                   GM = map.data,
                   PCA.total = 4,
                   model = c("MLMM", "FarmCPU", "Blink"))
## [1] "--------------------- Welcome to GAPIT ----------------------------"
## [1] "MLMM"    "FarmCPU" "Blink"  
## [1] "--------------------Processing traits----------------------------------"
## [1] "Phenotype provided!"
## [1] "The 1 model in all."
## [1] "MLMM"
## [1] "GAPIT.DP in process..."
## [1] "GAPIT will filter marker with MAF setting !!"
## [1] "The markers will be filtered by SNP.MAF: 0"
## maf_index
##  TRUE 
## 38768 
## [1] "Calling prcomp..."
## [1] "Creating PCA graphs..."
## [1] "Joining taxa..."
## [1] "Exporting PCs..."
## [1] "PC created"
## [1] "Filting marker for GAPIT.Genotype.View function ..."
## [1] "GAPIT.Genotype.View . pdfs generate.successfully!"
## [1] 388   5
## [1] "GAPIT.DP accomplished successfully for multiple traits. Results are saved"
## [1] "Processing trait: bm.wp.ped.pheno"
## [1] "GAPIT.Phenotype.View in press..."
## [1] "GAPIT.Phenotype.View output pdf has been generated successfully!"
## [1] "--------------------Phenotype and Genotype ----------------------------------"
## [1] "MLMM"
## [1] TRUE
## [1] "There are  1  traits in phenotype data."
## [1] "There are  388  individuals in phenotype data."
## [1] "There are  38768  markers in genotype data."
## [1] "Phenotype and Genotype are test OK !!"
## [1] "--------------------GAPIT Logical Done----------------------------------"
## [1] "GAPIT.IC in process..."
## [1] "There is 0 Covarinces."
## [1] "There are 277 common individuals in genotype , phenotype and CV files."
## [1] "The dimension of total CV is "
## [1] 277   5
## [1] "GAPIT.IC accomplished successfully for multiple traits. Results are saved"
## [1] "GAPIT.SS in process..."
## [1] "GAPIT will be into GWAS approach..."
## [1] "MLMM"
## [1] "The GAPIT would go into Bus..."
## [1] "GWAS by MLMM method !!"
## [1] "Calculating kinship with VanRaden method..."
## [1] "substracting P..."
## [1] "Getting X'X..."
## [1] "Adjusting..."
## [1] "Calculating kinship with VanRaden method: done"
## null model done! pseudo-h= 0.862 
## step 1 done! pseudo-h= 0.694 
## step  2  done! pseudo-h= 0.695 
## step  3  done! pseudo-h= 0.587 
## step  4  done! pseudo-h= 0.482 
## step  5  done! pseudo-h= 0.449 
## step  6  done! pseudo-h= 0.292 
## step  7  done! pseudo-h= 0.23 
## step  8  done! pseudo-h= 0.167 
## step  9  done! pseudo-h= 0.049 
## backward analysis 
## creating output
## Loading required package: lme4
## Loading required package: Matrix
## [1] "GAPIT.RandomModel beginning..."
## [1] FALSE
## [1] "Candidate Genes could Phenotype_Variance_Explained(%) :"
## [1]  6.879883  2.527924 17.079389 18.714826 20.388430
## [1] "Creating marker p-value, MAF, estimated effect, PVE 3 plot..."
## [1] 16562 16541 28938 13628 10240 10514 13264  7316 10179
## [1] "GAPIT.ID in process..."
## [1] "Filtering SNPs with MAF..."
## [1] "Calculating FDR..."
## [1] "QQ plot..."
## [1] "Manhattan plot (Genomewise)..."
## [1] "GAPIT.Manhattan accomplished successfully!zw"
## [1] "Manhattan plot (Chromosomewise)..."
## [1] "select 0 candidate significont markers in 1 chromosome "
## [1] "select 0 candidate significont markers in 2 chromosome "
## [1] "select 1 candidate significont markers in 3 chromosome "
## [1] "select 1 candidate significont markers in 4 chromosome "
## [1] "select 1 candidate significont markers in 5 chromosome "
## [1] "select 0 candidate significont markers in 6 chromosome "
## [1] "select 0 candidate significont markers in 7 chromosome "
## [1] "select 0 candidate significont markers in 8 chromosome "
## [1] "select 1 candidate significont markers in 9 chromosome "
## [1] "select 0 candidate significont markers in 10 chromosome "
## [1] "select 0 candidate significont markers in 11 chromosome "
## [1] "select 0 candidate significont markers in 12 chromosome "
## [1] "manhattan plot on chromosome finished"
## [1] "GAPIT.Manhattan accomplished successfully!zw"
## [1] "Association table..."
## [1] "Joining tvalue and stderr"
## [1] "GAPIT Phenotype distribution with significant markers in process..."
## [1] FALSE
## [1] 5 8
## [1] "GAPIT.ID accomplished successfully for multiple traits. Results are saved"
## [1] "GAPIT accomplished successfully for multiple traits. Result are saved"
## [1] "--------------------Processing traits----------------------------------"
## [1] "Phenotype provided!"
## [1] "The 2 model in all."
## [1] "FarmCPU"
## [1] "Processing trait: bm.wp.ped.pheno"
## [1] "--------------------Phenotype and Genotype ----------------------------------"
## [1] "FarmCPU"
## [1] TRUE
## [1] "There are  1  traits in phenotype data."
## [1] "There are  388  individuals in phenotype data."
## [1] "There are  38768  markers in genotype data."
## [1] "Phenotype and Genotype are test OK !!"
## [1] "--------------------GAPIT Logical Done----------------------------------"
## [1] "GAPIT.IC in process..."
## [1] "There is 0 Covarinces."
## [1] "There are 277 common individuals in genotype , phenotype and CV files."
## [1] "The dimension of total CV is "
## [1] 277   5
## [1] "GAPIT.IC accomplished successfully for multiple traits. Results are saved"
## [1] "GAPIT.SS in process..."
## [1] "GAPIT will be into GWAS approach..."
## [1] "FarmCPU"
## [1] "The GAPIT would go into Bus..."
## [1] "--------------------- Welcome to FarmCPU ----------------------------"
## [1] "FarmCPU Started..."
## [1] "Current loop: 1 out of maximum of 10"
## [1] "seqQTN"
## NULL
## [1] "scanning..."
## [1] "number of covariates in current loop is:"
## [1] 4
## [1] "Current loop: 2 out of maximum of 10"
## [1] "optimizing possible QTNs..."
## [1] "seqQTN"
## [1] 16562  5012 24914 35802 15255  5783
## [1] "scanning..."
## [1] "number of covariates in current loop is:"
## [1] 10
## [1] "Current loop: 3 out of maximum of 10"
## [1] "optimizing possible QTNs..."
## [1] "seqQTN"
##  [1] 16562 19638 33258 35736 15255 14320  8528  5012 35802 24914  5783
## [1] "scanning..."
## [1] "number of covariates in current loop is:"
## [1] 15
## [1] "Current loop: 4 out of maximum of 10"
## [1] "optimizing possible QTNs..."
## [1] "seqQTN"
##  [1] 16562 14320  5012  8528 18275 31159 27999 19638 15255 33258 35736  5783
## [13] 24914 35802
## [1] "scanning..."
## [1] "number of covariates in current loop is:"
## [1] 18
## [1] "Current loop: 5 out of maximum of 10"
## [1] "optimizing possible QTNs..."
## [1] "seqQTN"
##  [1] 16562 14320 31159  5012  8528 33258 18275 19638 15255 27999 24914 35736
## [13]  5783 35802
## [1] "scanning..."
## [1] "number of covariates in current loop is:"
## [1] 18
## [1] "**********FarmCPU ACCOMPLISHED SUCCESSFULLY**********"
## [1] "Calculating Orignal GWAS result..."
## [1] "GAPIT.RandomModel beginning..."
## [1] FALSE
## boundary (singular) fit: see help('isSingular')
## [1] "Candidate Genes could Phenotype_Variance_Explained(%) :"
##  [1] 1.190997e+01 1.731541e+00 5.268729e+00 8.136881e+00 1.779986e+01
##  [6] 9.493885e-01 2.449731e-06 6.676072e-09 5.201408e-06 1.237090e+01
## [11] 2.835158e-08 0.000000e+00 0.000000e+00 3.589689e+00
## [1] "Creating marker p-value, MAF, estimated effect, PVE 3 plot..."
## [1] "FarmCPU has been done succeedly!!"
##  [1] 16562 14320 31159  5012  8528 33258 18275 19638 15255 27999 24914 35736
## [13]  5783 35802
## [1] "GAPIT.ID in process..."
## [1] "Filtering SNPs with MAF..."
## [1] "Calculating FDR..."
## [1] "QQ plot..."
## [1] "Manhattan plot (Genomewise)..."
## [1] "GAPIT.Manhattan accomplished successfully!zw"
## [1] "Manhattan plot (Chromosomewise)..."
## [1] "select 1 candidate significont markers in 1 chromosome "
## [1] "select 1 candidate significont markers in 2 chromosome "
## [1] "select 0 candidate significont markers in 3 chromosome "
## [1] "select 1 candidate significont markers in 4 chromosome "
## [1] "select 2 candidate significont markers in 5 chromosome "
## [1] "select 0 candidate significont markers in 6 chromosome "
## [1] "select 0 candidate significont markers in 7 chromosome "
## [1] "select 0 candidate significont markers in 8 chromosome "
## [1] "select 0 candidate significont markers in 9 chromosome "
## [1] "select 1 candidate significont markers in 10 chromosome "
## [1] "select 1 candidate significont markers in 11 chromosome "
## [1] "select 0 candidate significont markers in 12 chromosome "
## [1] "manhattan plot on chromosome finished"
## [1] "GAPIT.Manhattan accomplished successfully!zw"
## [1] "Association table..."
## [1] "Joining tvalue and stderr"
## [1] "GAPIT Phenotype distribution with significant markers in process..."
## [1] FALSE
## [1] 14  8
## [1] "GAPIT.ID accomplished successfully for multiple traits. Results are saved"
## [1] "GAPIT accomplished successfully for multiple traits. Result are saved"
## [1] "--------------------Processing traits----------------------------------"
## [1] "Phenotype provided!"
## [1] "The 3 model in all."
## [1] "BLINK"
## [1] "Processing trait: bm.wp.ped.pheno"
## [1] "--------------------Phenotype and Genotype ----------------------------------"
## [1] "BLINK"
## [1] TRUE
## [1] "There are  1  traits in phenotype data."
## [1] "There are  388  individuals in phenotype data."
## [1] "There are  38768  markers in genotype data."
## [1] "Phenotype and Genotype are test OK !!"
## [1] "--------------------GAPIT Logical Done----------------------------------"
## [1] "GAPIT.IC in process..."
## [1] "There is 0 Covarinces."
## [1] "There are 277 common individuals in genotype , phenotype and CV files."
## [1] "The dimension of total CV is "
## [1] 277   5
## [1] "GAPIT.IC accomplished successfully for multiple traits. Results are saved"
## [1] "GAPIT.SS in process..."
## [1] "GAPIT will be into GWAS approach..."
## [1] "BLINK"
## [1] "The GAPIT would go into Bus..."
## [1] "----------------------Welcome to Blink----------------------"
## [1] "----------------------Iteration: 1 ----------------------"
## [1] "seqQTN:"
## NULL
## [1] "----------------------Iteration: 2 ----------------------"
## [1] "LD remove is working...."
## [1] "Number SNPs for LD remove:"
## [1] 610
## [1] "Model selection based on BIC is working...."
## [1] "Number of SNPs for BIC selection:"
## [1] 147
## [1] "seqQTN:"
## [1] 16562 16541  5012 16842 16376   867 16420
## [1] "----------------------Iteration: 3 ----------------------"
## [1] "LD remove is working...."
## [1] "Number SNPs for LD remove:"
## [1] 35
## [1] "Model selection based on BIC is working...."
## [1] "Number of SNPs for BIC selection:"
## [1] 13
## [1] "seqQTN:"
## [1] 16541 16420 16562 28630 28938  9073 16839 19638  5012
## [1] "----------------------Iteration: 4 ----------------------"
## [1] "LD remove is working...."
## [1] "Number SNPs for LD remove:"
## [1] 5
## [1] "Model selection based on BIC is working...."
## [1] "Number of SNPs for BIC selection:"
## [1] 5
## [1] "seqQTN:"
##  [1] 16541 16562 16420  8528 28630 28938  9073 16839 19638  5012
## [1] "----------------------Iteration: 5 ----------------------"
## [1] "LD remove is working...."
## [1] "Number SNPs for LD remove:"
## [1] 11
## [1] "Model selection based on BIC is working...."
## [1] "Number of SNPs for BIC selection:"
## [1] 9
## [1] "seqQTN:"
##  [1] 16541 16562 16420  8528 16839  5012 33257 30873  7316 28630 28938  9073
## [1] "----------------------Iteration: 6 ----------------------"
## [1] "LD remove is working...."
## [1] "Number SNPs for LD remove:"
## [1] 25
## [1] "Model selection based on BIC is working...."
## [1] "Number of SNPs for BIC selection:"
## [1] 14
## [1] "seqQTN:"
##  [1] 16541 16562 16420  8528  5012  9073 18540 16839  7316 30873 28085 33257
## [13] 28938 33617 28630
## [1] "----------------------Iteration: 7 ----------------------"
## [1] "LD remove is working...."
## [1] "Number SNPs for LD remove:"
## [1] 14
## [1] "Model selection based on BIC is working...."
## [1] "Number of SNPs for BIC selection:"
## [1] 13
## [1] "seqQTN:"
##  [1] 16541 16420 16562  8528 30873  5012  7316 16839 28085  9073 33257 33617
## [13] 28938 18540 28630
## [1] "LD.time(sec):"
## [1] 0.000 0.050 0.001 0.000 0.001 0.001 0.001
## [1] "BIC.time(sec):"
## [1] 0.000 0.006 0.001 0.000 0.000 0.001 0.001
## [1] "GLM.time(sec):"
## [1] 1.313 1.498 1.547 1.553 1.649 1.736 1.727
## [1] "-------------Blink finished successfully in 13.29 seconds!-----------------"
## [1] "Calculating Orignal GWAS result..."
## [1] "GAPIT.RandomModel beginning..."
## [1] FALSE
## Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, :
## Model failed to converge with max|grad| = 0.00248241 (tol = 0.002, component 1)
## [1] "Candidate Genes could Phenotype_Variance_Explained(%) :"
##  [1] 11.264217  1.965852  1.875484  1.108550  6.432448 15.570494  4.572542
##  [8]  1.174148  8.864626  6.081775  4.610798  1.620718  4.830481
## [1] "Creating marker p-value, MAF, estimated effect, PVE 3 plot..."
## [1] "BLINK R is done !!!!!"
##  [1] 16541 16420 16562  8528 30873  5012  7316 16839 28085  9073 33257 33617
## [13] 28938 18540 28630
## [1] "GAPIT.ID in process..."
## [1] "Filtering SNPs with MAF..."
## [1] "Calculating FDR..."
## [1] "QQ plot..."
## [1] "Manhattan plot (Genomewise)..."
## [1] "GAPIT.Manhattan accomplished successfully!zw"
## [1] "Manhattan plot (Chromosomewise)..."
## [1] "select 1 candidate significont markers in 1 chromosome "
## [1] "select 3 candidate significont markers in 2 chromosome "
## [1] "select 0 candidate significont markers in 3 chromosome "
## [1] "select 0 candidate significont markers in 4 chromosome "
## [1] "select 2 candidate significont markers in 5 chromosome "
## [1] "select 0 candidate significont markers in 6 chromosome "
## [1] "select 0 candidate significont markers in 7 chromosome "
## [1] "select 1 candidate significont markers in 8 chromosome "
## [1] "select 1 candidate significont markers in 9 chromosome "
## [1] "select 1 candidate significont markers in 10 chromosome "
## [1] "select 2 candidate significont markers in 11 chromosome "
## [1] "select 0 candidate significont markers in 12 chromosome "
## [1] "manhattan plot on chromosome finished"
## [1] "GAPIT.Manhattan accomplished successfully!zw"
## [1] "Association table..."
## [1] "Joining tvalue and stderr"
## [1] "GAPIT Phenotype distribution with significant markers in process..."
## [1] FALSE
## [1] 13  8
## [1] "GAPIT.ID accomplished successfully for multiple traits. Results are saved"
## [1] "GAPIT accomplished successfully for multiple traits. Result are saved"
## [1] "Reading GWAS result with MLMM.bm.wp.ped.pheno"
## [1] "Reading GWAS result with FarmCPU.bm.wp.ped.pheno"
## [1] "Reading GWAS result with BLINK.bm.wp.ped.pheno"
## Warning in plot.formula(1 ~ 1, col = "white", xlab = "", ylab = "", ylim = c(0,
## : the formula '1 ~ 1' is treated as '1 ~ 1'
## [1] "GAPIT.Association.Manhattans has done !!!"
## [1] "GAPIT has output Multiple Manhattan figure with Symphysic type!!!"
## [1] "MLMM"    "FarmCPU" "BLINK"  
## [1] "Reading GWAS result with MLMM.bm.wp.ped.pheno"
## [1] "Reading GWAS result with FarmCPU.bm.wp.ped.pheno"
## [1] "Reading GWAS result with BLINK.bm.wp.ped.pheno"
## Warning in mtext(side = 4, paste(environ_name[k], sep = ""), line = 2, cex = 1,
## : "base_family" is not a graphical parameter
## Warning in mtext(side = 4, paste(environ_name[k], sep = ""), line = 2, cex = 1,
## : "base_family" is not a graphical parameter

## Warning in mtext(side = 4, paste(environ_name[k], sep = ""), line = 2, cex = 1,
## : "base_family" is not a graphical parameter
## [1] "GAPIT.Association.Manhattans has done !!!"
## [1] "GAPIT has output Multiple Manhattan figures with Wide and High types!!!"
## Warning in GAPIT.Circle.Manhattan.Plot(band = 1, r = 3, GMM$multip_mapP, : NAs
## introduced by coercion
## Warning in max(numeric.chr, na.rm = TRUE): no non-missing arguments to max;
## returning -Inf
## [1] "Multracks_QQ Plotting MLMM.bm.wp.ped.pheno..."
## [1] "Multracks_QQ Plotting FarmCPU.bm.wp.ped.pheno..."
## [1] "Multracks_QQ Plotting BLINK.bm.wp.ped.pheno..."
## [1] "Multiple QQ plot has been finished!"
## [1] "GAPIT has output Multiple Manhattan and QQ figures with Circle types!!!"
## [1] "GAPIT has done all analysis!!!"
## [1] "Please find your all results in :"
## [1] "/Users/hiro/Documents/Rprojects/京大gwasAll/gapit"
setwd(pwd)