1. Plan

I plan to run some analysis using the Complex Traits Genetics Virtual Lab (CTG-VL):
As detailed at link, the input format of the genome-wide association summary statistics (GWASS) file should be:
Tab separated and should minimum contain the columns CHR, BP, SNP, A1, A2, FREQ, BETA, SE and P. The column “N” is optional but highly recommended if there are large discrepancies between sample sizes for each SNP. Prof Nyholt informed us that “A1” is the effect allele (EA), and “A2” is the non-effect allele (NEA).

Example of an input files is as follows:

CHR BP  SNP A1  A2  FREQ    BETA    SE  P   N
7   92383888    rs10    A   C   0.06431  0.0013 0.0042   7.5e-01    85000
12  126890980   rs1000000   A   G   0.2219   0.0001 0.0021   9.6e-01    85000
4   21618674    rs10000010  T   C   0.5086  -0.0001 0.0016   9.4e-01    85000
4   1357325 rs10000012  C   G   0.8634   0.0047 0.0025   5.7e-02    85000
4   37225069    rs10000013  A   C   0.7708  -0.0061 0.0021   3.3e-03    85000

2. Downloading GWAS summary statistics

2.1. Migraine

Migraine GWAS summary statistics used in this study was obtained from the Hautakangas et al. 2021 meta-analysis of migraine risk through the contact of one of the authors, Dr. Dale Nyholt.

The downloading was completed in the following steps:

  1. Downloaded the GWAS summary statistics from the link sent to me by Dr. Dale Nyholt.
  2. Moved the GWAS summary statistics to the HPC by the use of FileZilla.
  3. Connected to the QUT HPC (lyra.qut.edu.au) via SSH (terminal).
  4. Created a working directory.

mkdir Datasets

  1. Changed to my working directory.

cd Datasets

  1. Checked format of the downloaded GWASS file using the head command to view first 10 rows.

head migraine_IHGC2021_no23andMe_gwama_v2.txt

rs_number chromosome position reference_allele other_allele eaf beta se beta_95L beta_95U   z   p.value _-log10_p-value q_statistic q_p-value   i2 n_studies n_samples effects Neff rsid_ukbb   marker

1:592368    1   592368  G   A   0.010199    0.136408  0.299302 -0.450224    0.72304   0.455754  0.648588    0.188031    0   1   1   594    ??+? 1106    rs561532399 1:592368
1:636285    1   636285  C   T   0.096043    0.037062  0.027023 -0.015904    0.090028  1.371469  0.170212    0.769011    0   1   1   341050 +??? 15773   rs545945172 1:636285
1:637420    1   637420  T   C   0.019151    -0.028    0.195107 -0.41041   0.35441     -0.143512 0.885866    0.052632    0   1   1   5941   ??-  1398    1:637420    1:637420
1:649192    1   649192  T   A   0.117561    0.032302  0.02445  -0.015621    0.080224  1.321107  0.186447    0.729445    0   1   1   341050 +??? 16125   rs201942322 1:649192
1:657788    1   657788  G   C   0.219788    -0.009857 0.063584 -0.134481    0.114767  -0.155025 0.876783    0.057108    0   1   1   5941     ??-?   1442    1:657788    1:657788
1:657963    1   657963  T   C   0.013471    0.456059  0.204272 0.055686   0.856432  2.232607    0.025597    1.591816    0   1   1   5941     ??+?   1803    rs535013214 1:657963
1:658080    1   658080  A   G   0.012222    -0.077539 0.275975 -0.61845   0.463372  -0.280965   0.778738    0.108609    0   1   1   5941     ??-?   1088    1:658080    1:658080
1:658178    1   658178  G   A   0.019984    0.091726  0.199317 -0.298935    0.482387  0.460202  0.645393    0.190176    0   1   1   5941     ??+?   1285    1:658178    1:658178
1:662414    1   662414  T   C   0.134918    0.028631  0.024013 -0.018434    0.075696  1.192313  0.233118    0.632424    0   1   1   341050 +??? 14859   rs371628865 1:662414  

2.2. Alzheimer’s Disease

Alzheimer’s disease GWAS summary statistics used in this study was obtained from Wightman et al. 2021 through the Psychiatric Genomics Consortium.

The downloading was completed in the following steps:

  1. Downloaded the GWAS summary statistics from the Psychiatric Genomics Consortium.
  2. Moved the GWAS summary statistics to the HPC by the use of FileZilla.
  3. Connected to the QUT HPC (lyra.qut.edu.au) via SSH (terminal).
  4. Changed to my working directory.

cd Datasets

  1. Checked format of the downloaded GWAS summary statistic file using the head command to view first 10 rows.

head PGCALZ2sumstatsExcluding23andMe.txt

chr PosGRCh37   testedAllele    otherAllele     z                       p                   N
1   13668           A               G           -0.33742041121370514    0.7358              74004
1   14506           A               G           -0.511215737572647      0.6092              74004
1   14773           T               C           1.3712404899915154      0.1702999999999999  74004
1   14860           T               G           -1.2758741791491297     0.20200000000000018 74004
1   17101           C               G           -0.47105704329760717    0.6375999999999998  74004
1   17147           A               G           0.08532879488562903     0.932               74004
1   17516           C               T           -0.8723820309097523     0.3829999999999999  74004
1   17532           A               C           0.1705215348636156      0.8646              74004
1   17626           A               G           0.7388468491852137      0.45999999999999996 74004

3. Column overview

3.1. Migraine

Below I have copied some of the details of the columns in the order that they are provided in the GWASS file:

  • rs_number: Marker ID
  • chromosome: Marker chromosome
  • position: Marker position (bp)
  • reference_allele: Effect allele
  • other_allele: Non-effect allele
  • eaf: Effect allele frequency
  • beta: Overall beta value for meta-analysis
  • se: Standard error
  • beta_95L: Lower 95% confidence interval for beta
  • beta_95U: Upper 95% confidence interval for beta
  • z: Z-score
  • p.value: Meta-analysis p-value
  • _-log10_p-value: Absolute value of logarithm of meta-analysis p-value to the base of 10
  • q_statistic: Cochran’s heterogeneity statistic
  • q_p-value: Cochran’s heterogeneity statistic’s p-value
  • i2: Heterogeneity index I2 by Higgins et al 2003
  • n_studies: Number of studies with marker present
  • n_samples: Number of samples with marker present (will be NA if marker is present in any input file where N column is not present)
  • effects: Summary of effect directions (‘+’ - positive effect of reference allele, ‘-’ - negative effect of reference allele, ‘0’ - no effect (or non-significant) effect of reference allele, ‘?’ - missing data)
  • Neff: Effective sample size
  • rsid_ukbb:
  • marker:

The columns needed to be extracted in order to run analysis using the CTG-VL is as follows:

  • chromosome (corresponds to CHR)
  • position (corresponds to BP)
  • rs_number (corresponds to SNP)
  • reference_allele (corresponds to A1)
  • other_allele (corresponds to A2)
  • eaf (corresponds to FREQ)
  • beta (corresponds to BETA)
  • se (corresponds to SE)
  • p.value (corresponds to P)
  • n_samples (corresponds to N)

As can be seen we should already have all the columns needed to run analysis using the CTG-VL. We just need to extract the columns and rename them.

3.2. Alzheimer’s Disease

Below I have copied some of the details of the columns in the order that they are provided in the GWASS file:

  • chr: Marker chromosome
  • PosGRCh37: Marker position (BP) according to the GRCh37
  • testedAllele: Effect allele
  • otherAllele: Non-effect allele
  • z: Z-score
  • p: P-value
  • N: Sample size

The columns needed to be extracted in order to run analysis using the CTG-VL is as follows:

  • chr (corresponds to CHR)
  • PosGRCh37 (corresponds to BP)
  • Missing (corresponds to SNP)
  • testedAllele (corresponds to A1)
  • otherAllele (corresponds to A2)
  • Missing (corresponds to FREQ)
  • Missing (corresponds to BETA)
  • Missing (corresponds to SE)
  • p (corresponds to P)
  • N (corresponds to N)

As can be seen the columns: SNP, FREQ, BETA and SE are missing, so we will need to manipulate the data in order to get these.


4. Manipulation of data

4.1. Migraine

4.1.1. Extracting columns

As for the migraine GWAS summary statistics we only need to extract some of the columns and rename them in order for the data to be in the right format so we can use CTG-VL to analyse it. This is done by the following steps:

  1. Extract all the column names and write them to a new file.

head -n1 migraine_IHGC2021_no23andMe_gwama_v2.txt > migraine_IHGC2021_no23andMe_gwama_v2.txt.colnames

  1. Transpose the column names in order to make it easier to identify the relevant columns.

sed -e 'y/\t/\n/' migraine_IHGC2021_no23andMe_gwama_v2.txt.colnames > migraine_IHGC2021_no23andMe_gwama_v2.txt.colnames.transposed

  1. Add line number to the transposed file.

cat -n migraine_IHGC2021_no23andMe_gwama_v2.txt.colnames.transposed > migraine_IHGC2021_no23andMe_gwama_v2.txt.colnames.transposed.lineNo

  1. View the resulting file.

more migraine_IHGC2021_no23andMe_gwama_v2.txt.colnames.transposed.lineNo

 1  rs_number
 2  chromosome
 3  position
 4  reference_allele
 5  other_allele
 6  eaf
 7  beta
 8  se
 9  beta_95L
10  beta_95U
11  z
12  p.value
13  _-log10_p-value
14  q_statistic
15  q_p-value
16  i2
17  n_studies
18  n_samples
19  effects
20  Neff
21  rsid_ukbb
22  marker

So referring back to the format we want:

CHR BP SNP A1 A2 FREQ BETA SE P N

We want the following column numbers from ‘migraine_IHGC2021_no23andMe_gwama_v2.txt’

2 3 1 4 5 6 7 8 12 18

  1. Choose selected columns
    Test command on the file containing the first 10 rows of ‘migraine_IHGC2021_no23andMe_gwama_v2.txt’

awk -v OFS='\t' '{ print $2, $3, $1, $4, $5, $6, $7, $8, $12, $18 }' migraine_IHGC2021_no23andMe_gwama_v2.txt.head

chromosome  position    rs_number   reference_allele    other_allele    eaf beta    se  p.value n_samples
1   592368  1:592368    G   A   0.010199    0.136408    0.299302    0.648588    ??+?
1   636285  1:636285    C   T   0.096043    0.037062    0.027023    0.170212    +???
1   637420  1:637420    T   C   0.019151    -0.028    0.195107  0.885866    ??-?
1   649192  1:649192    T   A   0.117561    0.032302    0.02445   0.186447  +???
1   657788  1:657788    G   C   0.219788    -0.009857   0.063584    0.876783    ??-?
1   657963  1:657963    T   C   0.013471    0.456059    0.204272    0.025597    ??+?
1   658080  1:658080    A   G   0.012222    -0.077539   0.275975    0.778738    ??-?
1   658178  1:658178    G   A   0.019984    0.091726    0.199317    0.645393    ??+?
1   662414  1:662414    T   C   0.134918    0.028631    0.024013    0.233118    +???

Looks like it worked. However, the sample size does not seem to be useful, so we won’t extract that column in the full file.

  1. Adapt it to run on the full file.

awk -v OFS='\t' '{ print $2, $3, $1, $4, $5, $6, $7, $8, $12 }' migraine_IHGC2021_no23andMe_gwama_v2.txt > migraine_IHGC2021_no23andMe_gwama_v2.txt.CTGVL

  1. Check if it worked.

head migraine_IHGC2021_no23andMe_gwama_v2.txt.CTGVL

chromosome  position    rs_number   reference_allele    other_allele    eaf beta    se  p.value
1   592368  1:592368    G   A   0.010199    0.136408    0.299302    0.648588
1   636285  1:636285    C   T   0.096043    0.037062    0.027023    0.170212
1   637420  1:637420    T   C   0.019151    -0.028    0.195107  0.885866
1   649192  1:649192    T   A   0.117561    0.032302    0.02445   0.186447
1   657788  1:657788    G   C   0.219788    -0.009857   0.063584    0.876783
1   657963  1:657963    T   C   0.013471    0.456059    0.204272    0.025597
1   658080  1:658080    A   G   0.012222    -0.077539   0.275975    0.778738
1   658178  1:658178    G   A   0.019984    0.091726    0.199317    0.645393
1   662414  1:662414    T   C   0.134918    0.028631    0.024013    0.233118

tail migraine_IHGC2021_no23andMe_gwama_v2.txt.CTGVL

X   155226657   rs139594651 T   C   0.325104    -0.051982   0.022151    0.018962
X   155227607   rs141061503 G   T   0.333626    -0.04959    0.021159    0.019116
X   155228631   rs148242753 A   G   0.346415    -0.052932   0.022249    0.017378
X   155229003   rs188340375 A   G   0.347217    -0.052858   0.02223   0.017438
X   155229189   rs189652562 C   T   0.349702    -0.053048   0.022319    0.017487
X   155229483   rs139273633 T   C   0.346997    -0.052996   0.022289    0.017445
X   155229622   rs150057047 C   G   0.341794    -0.053095   0.022362    0.017603
X   155229796   rs145421232 A   G   0.341067    -0.053227   0.022415    0.017589
X   155230932   rs144749717 G   A   0.342003    -0.053848   0.022663    0.017521
X   155233098   rs138096180 C   T   0.352228    -0.053986   0.022706    0.017448  

Looks like it worked. However, we can see that some of the rows in the rs_number column does not contain rsIDs, which is preferred. Therefore some substantial manipulations are made.

4.1.2. Removing rows that do not contain rsIDs

Use grep to extract rows containing “rs”.

grep "rs" migraine_IHGC2021_no23andMe_gwama_v2.txt.CTGVL > migraine_IHGC2021_no23andMe_gwama_v2.txt.CTGVL.rsidonly

4.1.3. Renaming column names

I use Joe text editor to rename the column names in ‘migraine_IHGC2021_no23andMe_gwama_v2.txt.CTGVL.rsidonly’
From:
chromosome position rs_number reference_allele other_allele eaf beta se p.value n_samples

To:
CHR BP SNP A1 A2 FREQ BETA SE P N

Save as ‘migraine_IHGC2021_no23andMe_gwama_v2.txt.CTGVL.rsidonly.correctcolname’

Then I view the resulting file.

head migraine_IHGC2021_no23andMe_gwama_v2.txt.CTGVL.rsidonly.correctcolname

CHR BP  SNP A1  A2  FREQ    BETA    SE  P
1   662613  rs151190501 A   G   0.023833    -0.002305   0.173807    0.989415
1   662622  rs61769339  A   G   0.11161 0.027319    0.023048    0.235865
1   693731  rs61769350  G   A   0.117222    0.009883    0.02157 0.646838
1   704637  rs142559957 A   G   0.083049    0.020961    0.0957  0.826616
1   705882  rs72631875  A   G   0.066553    -0.05509    0.033577    0.100859
1   706368  rs55727773  A   G   0.482703    0.002583    0.016164    0.873
1   707886  rs115229562 C   G   0.066665    0.0938  0.109672    0.392398
1   713337  rs184249808 A   G   0.020187    0.0673  0.199301    0.735613
1   717587  rs144155419 A   G   0.015726    -0.043106   0.06268 0.491641

This looks fine.

4.1.4. Looking for errors

I tried uploading the ‘migraine_IHGC2021_no23andMe_gwama_v2.txt.CTGVL.rsidonly.correctcolname’ file to the CTG-VL. However, when doing so, an error was raised in the ‘Status’ column under ‘My Data’. I then tried sorting all the different columns in the ‘migraine_IHGC2021_no23andMe_gwama_v2.txt.CTGVL.rsidonly.correctcolname’ file to check if something unusual was happening.

  1. Sorting by chromosome.

sort -k1,1n migraine_IHGC2021_no23andMe_gwama_v2.txt.CTGVL.rsidonly.correctcolname > migraine_IHGC2021_no23andMe_gwama_v2.txt.CTGVL.rsidonly.correctcolname.sortCHR

When viewing the data, everything looks fine.

head migraine_IHGC2021_no23andMe_gwama_v2.txt.CTGVL.rsidonly.correctcolname.sortCHR

CHR BP  SNP A1  A2  FREQ    BETA    SE  P
X   100000500   rs5920848   T   A   0.452611    -0.006459   0.007078    0.361441
X   100000719   rs4828056   G   A   0.45257 -0.00665    0.007077    0.347377
X   100003215   rs2154370   T   C   0.451382    -0.006564   0.00707 0.35318
X   100003774   rs75945819  C   T   0.461256    -0.005969   0.007187    0.406287
X   100005339   rs188135282 G   T   0.014273    -0.0581 0.19495 0.765686
X   100006043   rs4828057   C   A   0.451512    -0.006674   0.007059    0.34439
X   100006377   rs5967183   A   T   0.55286 0.005702    0.007104    0.422218
X   100007433   rs2022471   A   G   0.450063    -0.005641   0.007097    0.426745
X   100008593   rs5921641   G   A   0.480213    -0.007165   0.007096    0.312645

tail migraine_IHGC2021_no23andMe_gwama_v2.txt.CTGVL.rsidonly.correctcolname.sortCHR

22  51227891    rs6010091   A   G   0.42802 -0.001905   0.014723    0.897024
22  51228259    rs138384468 G   A   0.073761    -0.000634   0.043196    0.988287
22  51228910    rs145146472 A   G   0.300584    -0.001419   0.015942    0.929049
22  51229805    rs9616985   C   T   0.072746    0.010011    0.022734    0.659699
22  51229855    rs144549712 A   G   0.143121    0.030003    0.021895    0.170578
22  51233300    rs62240042  T   C   0.333872    0.010257    0.016812    0.541804
22  51234048    rs141330630 C   T   0.013537    -0.379597   0.243448    0.11893
22  51234799    rs191117135 A   G   0.015694    -0.075994   0.064729    0.240359
22  51237063    rs3896457   C   T   0.299954    -0.003727   0.015613    0.811307
22  51238249    rs149733995 C   A   0.072902    -0.006451   0.0441  0.883671
  1. Sorting by base position.

sort -k2,2n migraine_IHGC2021_no23andMe_gwama_v2.txt.CTGVL.rsidonly.correctcolname > migraine_IHGC2021_no23andMe_gwama_v2.txt.CTGVL.rsidonly.correctcolname.sortBP

When viewing the head result the first 7 rows of the BP column is in scientific notation, which might cause a problem when uploading it. To test whether this was the problem I deleted these rows.

head migraine_IHGC2021_no23andMe_gwama_v2.txt.CTGVL.rsidonly.correctcolname.sortBP

CHR BP  SNP A1  A2  FREQ    BETA    SE  P
10  1.2e+07 rs74733400  T   A   0.018302    -0.00314    0.0316  0.920837
6   1.36e+08    rs55926138  C   T   0.023115    0.005944    0.027572    0.829296
12  4.8e+07 rs855274    T   C   0.100324    -0.025858   0.013134    0.049001
2   7.2e+07 rs61573637  A   G   0.091582    -0.000782   0.01372 0.954553
9   8e+07   rs78198903  T   A   0.111289    0.016326    0.012676    0.197716
9   9e+05   rs72699266  T   C   0.011967    -0.104202   0.0929  0.261993
1   9.3e+07 rs188690587 G   A   0.023369    -0.072327   0.166705    0.66441
17  828 rs62053745  T   C   0.197069    0.011277    0.014306    0.430516
17  834 rs9747082   G   A   0.134363    0.009148    0.013838    0.508602

tail migraine_IHGC2021_no23andMe_gwama_v2.txt.CTGVL.rsidonly.correctcolname.sortBP

1   249223485   rs113286866 A   G   0.010701    -0.005186   0.084472    0.951032
1   249224155   rs115216350 T   C   0.014103    -0.239172   0.249257    0.337277
1   249225064   rs191382835 T   A   0.015645    -0.062227   0.209466    0.766414
1   249225077   rs183571322 A   G   0.010032    0.07725 0.240278    0.747836
1   249226310   rs146557682 A   C   0.010032    0.07725 0.240278    0.747836
1   249228009   rs143586772 A   G   0.019903    -0.011325   0.059645    0.849393
1   249228324   rs185293715 A   T   0.010032    0.07725 0.240278    0.747836
1   249228388   rs189826851 G   T   0.014114    0.106908    0.234435    0.648394
1   249229734   rs75254746  C   T   0.028278    0.009297    0.043954    0.832473
1   249230279   rs146821690 G   C   0.010322    0.071252    0.240329    0.766868
  1. Deleting first 7 rows.

sed '2,8d' migraine_IHGC2021_no23andMe_gwama_v2.txt.CTGVL.rsidonly.correctcolname.sortBP > migraine_IHGC2021_no23andMe_gwama_v2.txt.CTGVL.rsidonly.correctcolname.sortBP.test

Now the data might be ready for to the CTG-VL upload.

4.2. Alzheimer’s Disease

As for the Alzheimer’s disease GWAS summary statistics some of the columns we need in order to analyse the data using CTG-VL are missing. The columns missing is:

  • SNP
  • FREQ
  • BETA
  • SE

Therefore we need to find a way to calculate these, and then add them to the file.

4.2.1. Calculating Beta, SE and frequency

The following links was used to figure out how to calculate beta and se. As Prof Dale Nyholt had told us, that we did not really need the frequency column in order to run the analysis, it should just be a fixed frequency.

4.2.2. Adding BETA, SE and FREQ columns

  1. A file only containing the first 10 rows of the full file was made in order to test if the clolumns seemed to be added properly.

head PGCALZ2sumstatsExcluding23andMe.txt > PGCALZ2sumstatsExcluding23andMe.txt.head

  1. Then I tried to add the Beta and SE, and frequency (I set the frequency to 0.1) to the head file.

awk -v OFS='\t' 'NR!=1{$8 = (0.1); $9 = ($5*sqrt((1/((1+($5*$5))/$7))/$7)); $10 = ($9/$5); print}' PGCALZ2sumstatsExcluding23andMe.txt.head > PGCALZ2sumstatsExcluding23andMe.txt.head.freqbetase

By viewing the file it looks like it worked. However, the column names have been deleted from the file, but these could just be added using joe text editor.

more PGCALZ2sumstatsExcluding23andMe.txt.head.freqbetase

1   13668   A   G   -0.33742041121370514    0.7358  74004   0.1 -0.319711   0.947515
1   14506   A   G   -0.511215737572647  0.6092  74004   0.1 -0.455185   0.890397
1   14773   T   C   1.3712404899915154  0.1702999999999999  74004   0.1 0.807969    0.589225
1   14860   T   G   -1.2758741791491297 0.20200000000000018 74004   0.1 -0.787059   0.616878
1   17101   C   G   -0.47105704329760717    0.6375999999999998  74004   0.1 -0.426144   0.904655
1   17147   A   G   0.08532879488562903 0.932   74004   0.1 0.0850198   0.996379
1   17516   C   T   -0.8723820309097523 0.3829999999999999  74004   0.1 -0.657387   0.753554
1   17532   A   C   0.1705215348636156  0.8646  74004   0.1 0.168095    0.985771
1   17626   A   G   0.7388468491852137  0.45999999999999996 74004   0.1 0.594244    0.804285
  1. Then I tried to add the Beta and SE, and frequency (I set the frequency to 0.1) to the full file.

awk -v OFS='\t' 'NR!=1{$8 = (0.1); $9 = ($5*sqrt((1/((1+($5*$5))/$7))/$7)); $10 = ($9/$5); print}' PGCALZ2sumstatsExcluding23andMe.txt > PGCALZ2sumstatsExcluding23andMe.txt.freqbetase

However, this gave the following error:

awk: cmd. line:1: (FILENAME=PGCALZ2sumstatsExcluding23andMe.txt FNR=11987715) fatal: division by zero attempted
  1. I tried to take a look at the line that raised the error.

sed -n '11987715p' PGCALZ2sumstatsExcluding23andMe.txt

This gave the following output:

19  45387459    G   C   inf 0.0 758138

Indicating that the error was raised because the z-score was infinity.

  1. To fix this problem I deleted the rows containing inf values as these could not be used in the analysis anyway because they also had a p-value of 0.

grep -v "inf" PGCALZ2sumstatsExcluding23andMe.txt > PGCALZ2sumstatsExcluding23andMe.txt.noinf

  1. Tried adding the Beta and SE, and frequency columns one last time.

awk -v OFS='\t' 'NR!=1{$8 = (0.1); $9 = ($5*sqrt((1/((1+($5*$5))/$7))/$7)); $10 = ($9/$5); print}' PGCALZ2sumstatsExcluding23andMe.txt.noinf > PGCALZ2sumstatsExcluding23andMe.txt.noinf.freqbetase

By viewing the data, it seemed like it worked.

head PGCALZ2sumstatsExcluding23andMe.txt.noinf.freqbetase

1   13668   A   G   -0.33742041121370514    0.7358  74004   0.1 -0.319711   0.947515
1   14506   A   G   -0.511215737572647  0.6092  74004   0.1 -0.455185   0.890397
1   14773   T   C   1.3712404899915154  0.1702999999999999  74004   0.1 0.807969    0.589225
1   14860   T   G   -1.2758741791491297 0.20200000000000018 74004   0.1 -0.787059   0.616878
1   17101   C   G   -0.47105704329760717    0.6375999999999998  74004   0.1 -0.426144   0.904655
1   17147   A   G   0.08532879488562903 0.932   74004   0.1 0.0850198   0.996379
1   17516   C   T   -0.8723820309097523 0.3829999999999999  74004   0.1 -0.657387   0.753554
1   17532   A   C   0.1705215348636156  0.8646  74004   0.1 0.168095    0.985771
1   17626   A   G   0.7388468491852137  0.45999999999999996 74004   0.1 0.594244    0.804285
1   17765   A   G   0.09388564134428469 0.9252  74004   0.1 0.0934746   0.995622

By manually adding the column names using joe text editor and viewing the file, it seems to be fine.

head PGCALZ2sumstatsExcluding23andMe.txt.noinf.freqbetase.colnames

CHR BP  A1  A2  Z   P   N   FREQ    BETA    SE
1   13668   A   G   -0.33742041121370514    0.7358  74004   0.1 -0.319711   0.947515
1   14506   A   G   -0.511215737572647  0.6092  74004   0.1 -0.455185   0.890397
1   14773   T   C   1.3712404899915154  0.1702999999999999  74004   0.1 0.807969    0.589225
1   14860   T   G   -1.2758741791491297 0.20200000000000018 74004   0.1 -0.787059   0.616878
1   17101   C   G   -0.47105704329760717    0.6375999999999998  74004   0.1 -0.426144   0.904655
1   17147   A   G   0.08532879488562903 0.932   74004   0.1 0.0850198   0.996379
1   17516   C   T   -0.8723820309097523 0.3829999999999999  74004   0.1 -0.657387   0.753554
1   17532   A   C   0.1705215348636156  0.8646  74004   0.1 0.168095    0.985771
1   17626   A   G   0.7388468491852137  0.45999999999999996 74004   0.1 0.594244    0.804285

4.2.3. Adding SNP column

The last thing we need to do in order for us to upload the data to the CTG-VL is to get the rsID’s, which is done in the following steps.

  1. Download the ‘1000G_20101123_v3_GIANT_chr1_23_Marker_Details.txt’ file.
  2. Upload the file to HPC via Filezilla.
  3. Connect to the QUT HPC.
  4. Change working directory.

cd Datasets

  1. Checking the format of the downloaded GWASS file.

head 1000G_20101123_v3_GIANT_chr1_23_Marker_Details.txt

minimac_names       minimac_names_if_no_rsID  chr_b37    pos_b37  Intrpl_pos_b36  chr_b36    pos_b36
1:10583             rs58108140                1            10583  0               1              583
1:10611             rs189107123               1            10611  0               1              611
1:13302             rs180734498               1            13302  0               1             3165
1:13327             rs144762171               1            13327  0               1             3190
1:13957:TC_T        1:13957:TC_T              1            13957  0               1             3820
1:13980             rs151276478               1            13980  0               1             3843
1:30923             rs140337953               1            30923  0               1            20786
1:46402:C_CTG       1:46402:C_CTG             1            46402  0               1            36265
1:47190:G_GA        1:47190:G_GA              1            47190  0               1            37053
  1. Creating a new column containing the chr and pos values joined by an underscore in both files.

awk -v OFS='\t' '{print $1"_"$2, $0}' PGCALZ2sumstatsExcluding23andMe.txt.noinf.freqbetase.colnames > PGCALZ2sumstatsExcluding23andMe.txt.noinf.freqbetase.colnames.CHR_BP

awk -v OFS='\t' '{print $3"_"$4, $0}' 1000G_20101123_v3_GIANT_chr1_23_Marker_Details.txt > 1000G_20101123_v3_GIANT_chr1_23_Marker_Details.txt.CHR_BP

By viewing the files, it looks like it worked.

head PGCALZ2sumstatsExcluding23andMe.txt.noinf.freqbetase.colnames.CHR_BP

CHR_BP  CHR BP  A1  A2  Z   P   N   FREQ    BETA    SE
1_13668 1   13668   A   G   -0.33742041121370514    0.7358  74004   0.1 -0.319711   0.947515
1_14506 1   14506   A   G   -0.511215737572647  0.6092  74004   0.1 -0.455185   0.890397
1_14773 1   14773   T   C   1.3712404899915154  0.1702999999999999  74004   0.1 0.807969    0.589225
1_14860 1   14860   T   G   -1.2758741791491297 0.20200000000000018 74004   0.1 -0.787059   0.616878
1_17101 1   17101   C   G   -0.47105704329760717    0.6375999999999998  74004   0.1 -0.426144   0.904655
1_17147 1   17147   A   G   0.08532879488562903 0.932   74004   0.1 0.0850198   0.996379
1_17516 1   17516   C   T   -0.8723820309097523 0.3829999999999999  74004   0.1 -0.657387   0.753554
1_17532 1   17532   A   C   0.1705215348636156  0.8646  74004   0.1 0.168095    0.985771
1_17626 1   17626   A   G   0.7388468491852137  0.45999999999999996 74004   0.1 0.594244    0.804285

head 1000G_20101123_v3_GIANT_chr1_23_Marker_Details.txt.CHR_BP

chr_b37_pos_b37 minimac_names       minimac_names_if_no_rsID  chr_b37    pos_b37  Intrpl_pos_b36  chr_b36    pos_b36
1_10583 1:10583             rs58108140                1            10583  0               1              583
1_10611 1:10611             rs189107123               1            10611  0               1              611
1_13302 1:13302             rs180734498               1            13302  0               1             3165
1_13327 1:13327             rs144762171               1            13327  0               1             3190
1_13957 1:13957:TC_T        1:13957:TC_T              1            13957  0               1             3820
1_13980 1:13980             rs151276478               1            13980  0               1             3843
1_30923 1:30923             rs140337953               1            30923  0               1            20786
1_46402 1:46402:C_CTG       1:46402:C_CTG             1            46402  0               1            36265
1_47190 1:47190:G_GA        1:47190:G_GA              1            47190  0               1            37053

Before merging the two files in order to get the rsID’s we need to handle possible duplicate CHR_BP variant names first, to prevent incorrect matching.

  1. Create files that only contain CHR_BP variant names.

cut -f1 PGCALZ2sumstatsExcluding23andMe.txt.noinf.freqbetase.colnames.CHR_BP > PGCALZ2sumstatsExcluding23andMe.txt.noinf.freqbetase.colnames.CHR_BP.col1 cut -f1 1000G_20101123_v3_GIANT_chr1_23_Marker_Details.txt.CHR_BP > 1000G_20101123_v3_GIANT_chr1_23_Marker_Details.txt.CHR_BP.col1

  1. Create a file containing only the unique chr_bp variant names in both the files.

sort PGCALZ2sumstatsExcluding23andMe.txt.noinf.freqbetase.colnames.CHR_BP.col1 | uniq -u > PGCALZ2sumstatsExcluding23andMe.txt.noinf.freqbetase.colnames.CHR_BP.col1.uniq sort 1000G_20101123_v3_GIANT_chr1_23_Marker_Details.txt.CHR_BP.col1 | uniq -u > 1000G_20101123_v3_GIANT_chr1_23_Marker_Details.txt.CHR_BP.col1.uniq

  1. Do a line count of the original and generate unique file to see whether duplicates were removed.

    wc -l PGCALZ2sumstatsExcluding23andMe.txt.noinf.freqbetase.colnames.CHR_BP.col1 PGCALZ2sumstatsExcluding23andMe.txt.noinf.freqbetase.colnames.CHR_BP.col1.uniq

    12688309 PGCALZ2sumstatsExcluding23andMe.txt.noinf.freqbetase.colnames.CHR_BP.col1 12631385 PGCALZ2sumstatsExcluding23andMe.txt.noinf.freqbetase.colnames.CHR_BP.col1.uniq

In the GWASS file, 56,924 CHR_BP variant names were removed.

wc -l 1000G_20101123_v3_GIANT_chr1_23_Marker_Details.txt.CHR_BP.col1 1000G_20101123_v3_GIANT_chr1_23_Marker_Details.txt.CHR_BP.col1.uniq

31326390 1000G_20101123_v3_GIANT_chr1_23_Marker_Details.txt.CHR_BP.col1
31281348 1000G_20101123_v3_GIANT_chr1_23_Marker_Details.txt.CHR_BP.col1.uniq

In the 1000G reference file, 45,042 CHR_BP variant names were removed.

Now the rsID’s can be added to the GWASS file.

  1. Create new GWASS and 1000G references files with the duplicate CHR_BP variante names removed.

awk 'NR==FNR {FILE1[$1]=$0; next} ($1 in FILE1) {print $0}' PGCALZ2sumstatsExcluding23andMe.txt.noinf.freqbetase.colnames.CHR_BP.col1.uniq PGCALZ2sumstatsExcluding23andMe.txt.noinf.freqbetase.colnames.CHR_BP > PGCALZ2sumstatsExcluding23andMe.txt.noinf.freqbetase.colnames.CHR_BP.uniq

awk 'NR==FNR {FILE1[$1]=$0; next} ($1 in FILE1) {print $0}' 1000G_20101123_v3_GIANT_chr1_23_Marker_Details.txt.CHR_BP.col1.uniq 1000G_20101123_v3_GIANT_chr1_23_Marker_Details.txt.CHR_BP > 1000G_20101123_v3_GIANT_chr1_23_Marker_Details.txt.CHR_BP.uniq

  1. Merge the RSIDs in the 1000G reference file with the GWASS file.

awk -v OFS='\t' 'NR==FNR { FILE1[$1]=$3; next} ($1 in FILE1) {print FILE1[$1], $0}' 1000G_20101123_v3_GIANT_chr1_23_Marker_Details.txt.CHR_BP.uniq PGCALZ2sumstatsExcluding23andMe.txt.noinf.freqbetase.colnames.CHR_BP.uniq > PGCALZ2sumstatsExcluding23andMe.txt.noinf.freqbetase.colnames.CHR_BP_1000G_20101123.withRSID.uniq

  1. Count how many variants are left in the file compared to the one we started out with.

wc -l PGCALZ2sumstatsExcluding23andMe.txt.noinf.freqbetase.colnames.CHR_BP PGCALZ2sumstatsExcluding23andMe.txt.noinf.freqbetase.colnames.CHR_BP_1000G_20101123.withRSID.uniq

12688309 PGCALZ2sumstatsExcluding23andMe.txt.noinf.freqbetase.colnames.CHR_BP
11079499 PGCALZ2sumstatsExcluding23andMe.txt.noinf.freqbetase.colnames.CHR_BP_1000G_20101123.withRSID.uniq

So 1,608,810 variants have been removed due to the merge.

  1. Column names manually added using the joe text editor the file was viewed.

head PGCALZ2sumstatsExcluding23andMe.txt.noinf.freqbetase.colnames.CHR_BP_1000G_20101123.withRSID.uniq.colnames

SNP CHR_BP  CHR BP  A1  A2  Z   P   N   FREQ    BETA    SE
rs146756510 1_126108    1   126108  A   G   1.4293627464142322  0.15290000000000006 74004   0.1 0.819381    0.573249
rs148209574 1_135032    1   135032  A   G   1.3141373113403905  0.18880000000000008 74004   0.1 0.795796    0.605565
rs141415251 1_233476    1   233476  A   G   0.6176592182045477  0.5368  74004   0.1 0.5255  0.850793
rs144425991 1_529782    1   529782  T   C   -1.2629698923484767 0.2066  74004   0.1 -0.784001   0.62076
rs188948234 1_637753    1   637753  A   C   2.021577678426751   0.04321999999999999 74004   0.1 0.896332    0.443383
rs191780594 1_662571    1   662571  T   C   0.7629386243130007  0.4454999999999999  74004   0.1 0.606563    0.795035
rs151190501 1_662613    1   662613  A   G   0.022310842464422612    0.9822  74004   0.1 0.0223053   0.999751
rs61769339  1_662622    1   662622  A   G   -0.7675339112394487 0.4427641389654222  137930  0.1 -0.608865   0.793274
rs150820983 1_676127    1   676127  T   C   1.1090699664504071  0.26739999999999997 74004   0.1 0.742682    0.669644
  1. Make sure the SNP names are only rsIDs as some variants may still not have an rsID assigned.

egrep "rs|SNP" PGCALZ2sumstatsExcluding23andMe.txt.noinf.freqbetase.colnames.CHR_BP_1000G_20101123.withRSID.uniq.colnames > PGCALZ2sumstatsExcluding23andMe.txt.noinf.freqbetase.colnames.CHR_BP_1000G_20101123.withRSID.uniq.colnames.rsOnly

  1. Checking how many variants have been removed.

wc -l PGCALZ2sumstatsExcluding23andMe.txt.noinf.freqbetase.colnames.CHR_BP_1000G_20101123.withRSID.uniq.colnames PGCALZ2sumstatsExcluding23andMe.txt.noinf.freqbetase.colnames.CHR_BP_1000G_20101123.withRSID.uniq.colnames.rsOnly

11079500 PGCALZ2sumstatsExcluding23andMe.txt.noinf.freqbetase.colnames.CHR_BP_1000G_20101123.withRSID.uniq.colnames
11063896 PGCALZ2sumstatsExcluding23andMe.txt.noinf.freqbetase.colnames.CHR_BP_1000G_20101123.withRSID.uniq.colnames.rsOnly

Thereby, 15,604 variants have been removed.


5. Uploading to CTG-VL

5.1. Migraine

The file ‘migraine_IHGC2021_no23andMe_gwama_v2.txt.CTGVL.rsidonly.correctcolname.sortBP.test’ was succesfully uploaded to the CTG-VL.

NOTE: When I uploaded the data, I manually specified the total sample size, as specified in Hautakangas et al. 2021 meta-analysis of migraine risk (i.e., the article that the data comes from) to be 589,356, which corresponds to the total sample size excluding 23andMe.

5.2. Alzheimer’s Disease

The file ‘PGCALZ2sumstatsExcluding23andMe.txt.noinf.freqbetase.colnames.CHR_BP_1000G_20101123.withRSID.uniq.colnames.rsOnly’ was succesfully uploaded to the CTG-VL.

NOTE: When I uploaded the data, I specified the total sample size, as specified in Wightman et al. 2021 genome-wide association study of Alzheimer’s disease to be 1,261,563. However, the data contained an N column so this might not matter.

6. SNP-based heritability analysis

6.1 Migraine

In order to analyse the SNP-based heritability, I went to the CTG-VL. From there under ‘Analyses’ I chose ‘LD score’. Under ‘select analysis’ I chose ‘Heritability’, and under ‘select GWAS’ i chose ‘Migraine_GWAS_1’ and then I hit ‘Run Analysis’.

From the analysis I got the following table:

Migraine_GWAS_1 - h2
Observed h2 Lambda GC   Mean Chi^2  Intercept   Ratio
0.0283 (0.0015) 1.2664  1.3506  1.0297 (0.0078) 0.0847 (0.0222)

As this yields a significant P-value, it indicates that there is a significant SNP-based heritibility in this dataset.

6.2. Alzheimer’s Disease

In order to analyse the SNP-based heritability, I went to the CTG-VL. From there under ‘Analyses’ I chose ‘LD score’. Under ‘select analysis’ I chose ‘Heritability’, and under ‘select GWAS’ i chose ‘Alzheimer_GWAS_1’ and then I hit ‘Run Analysis’.

From the analysis I got the following table:

Alzheimer_GWAS - h2
Observed h2 Lambda GC   Mean Chi^2  Intercept   Ratio
0.0111 (0.002)  1.0957  1.1785  1.0194 (0.0132) 0.1089 (0.0739)

As this yields a significant P-value, it indicates that there is a significant SNP-based heritibility in this dataset.

7. Genetic correlation analyses

7.1. Cross-trait LD Score regression

In order to analyse the genetic correlation between the two traits, I went to the CTG-VL. From there under ‘Analyses’ I chose ‘LD score’. Under ‘select analysis’ I chose ‘Genetic correlation’, and under ‘select GWAS’ i chose ‘Alzheimer_GWAS’ and ‘Migraine_GWAS_1’ then I hit ‘Run Analysis’.

From the analysis I got the following table:

Alzheimer_GWAS vs Migraine_GWAS_1 - rG
rG  se  z   P   h2_obs  h2_obs_se   h2_int  h2_int_se   gcov_int    gcov_int_se
0.0466  0.0528  0.8828  0.3774  0.0286  0.0018  1.0271  0.0104  0.0081  0.0053

From the table above, the p-value of 0.3774 indicates that there is no genetic correlation between the two traits.

7.2. SECA

As the Cross-trait LD Score regression did not indicate any genetic correlation, I tried running SECA, which seems to be more sensitive in order to detect a genetic correlation.

7.2.1. Formatting data

The input files should contain the following five columns of data with a header row listing column names of ‘SNP’, ‘EA’, ‘NEA’, ‘P’ and ‘BETA’. The formatting is done in the following steps.

  1. Extracting the proper columns from the files.

Migraine (dataset 1): awk -v OFS='\t' '{ print $3, $4, $5, $9, $7 }' migraine_IHGC2021_no23andMe_gwama_v2.txt.CTGVL.rsidonly.correctcolname.sortBP.test > migraine.SECA

Alzheimer’s disease (dataset 2): awk -v OFS='\t' '{ print $1, $5, $6, $8, $11 }' PGCALZ2sumstatsExcluding23andMe.txt.noinf.freqbetase.colnames.CHR_BP_1000G_20101123.withRSID.uniq.colnames.rsOnly > alzheimer.SECA

  1. Viewing the data.

head migraine.SECA alzheimer.SECA

==> migraine.SECA <==
SNP A1  A2  P   BETA
rs62053745  T   C   0.430516    0.011277
rs9747082   G   A   0.508602    0.009148
rs62053747  A   G   0.278197    0.028343
rs34151105  C   T   0.330109    -0.015252
rs62053748  T   A   0.842967    0.007108
rs77383171  T   C   0.806836    0.004359
rs75157665  A   G   0.334874    -0.015101
rs35054424  G   A   0.502227    0.009247
rs17055023  G   A   0.512025    0.009035

==> alzheimer.SECA <==
SNP A1  A2  P   BETA
rs146756510 A   G   0.15290000000000006 0.819381
rs148209574 A   G   0.18880000000000008 0.795796
rs141415251 A   G   0.5368  0.5255
rs144425991 T   C   0.2066  -0.784001
rs188948234 A   C   0.04321999999999999 0.896332
rs191780594 T   C   0.4454999999999999  0.606563
rs151190501 A   G   0.9822  0.0223053
rs61769339  A   G   0.4427641389654222  -0.608865
rs150820983 T   C   0.26739999999999997 0.742682
  1. Giving the columns the appropriate names.

    The column names are changed using joe editor, and saves as migraine.SECA.correctcolname and alzheimer.SECA.correctcolname respectively.

  2. Viewing the data.

head migraine.SECA.correctcolname alzheimer.SECA.correctcolname

==> migraine.SECA.correctcolname <==
SNP EA  NEA P   BETA
rs62053745  T   C   0.430516    0.011277
rs9747082   G   A   0.508602    0.009148
rs62053747  A   G   0.278197    0.028343
rs34151105  C   T   0.330109    -0.015252
rs62053748  T   A   0.842967    0.007108
rs77383171  T   C   0.806836    0.004359
rs75157665  A   G   0.334874    -0.015101
rs35054424  G   A   0.502227    0.009247
rs17055023  G   A   0.512025    0.009035

==> alzheimer.SECA.correctcolname <==
SNP EA  NEA P   BETA
rs146756510 A   G   0.15290000000000006 0.819381
rs148209574 A   G   0.18880000000000008 0.795796
rs141415251 A   G   0.5368  0.5255
rs144425991 T   C   0.2066  -0.784001
rs188948234 A   C   0.04321999999999999 0.896332
rs191780594 T   C   0.4454999999999999  0.606563
rs151190501 A   G   0.9822  0.0223053
rs61769339  A   G   0.4427641389654222  -0.608865
rs150820983 T   C   0.26739999999999997 0.742682

Now the format should be correct.

7.2.2. LD clumping

LD clumping is an approach that finds the LD independent SNPs, and only these should be used to run SECA. LD clumping is performed in the following steps.

  1. Obtain the overlap of SNPs between the two datasets.
    As I have downloaded plink version 1.07, I first have to extract the SNPs that are in the LD reference file.

awk 'NR==FNR {FILE1[$1]=$0; next} ($1 in FILE1) {print $0}' alzheimer.SECA.correctcolname migraine.SECA.correctcolname > migraine_subset_overlapping_alzheimer.tmp

Then I find the overlap of SNPs between the two datasets.

awk 'NR==FNR {FILE1[$2]=$0; next} ($1 in FILE1) {print $0}' hapmap_CEU_r23a_filtered.bim migraine_subset_overlapping_alzheimer.tmp > migraine_subset_overlapping_alzheimer

  1. Check how many SNP are overlapping.

wc -l migraine_subset_overlapping_alzheimer

2227755 migraine_subset_overlapping_alzheimer
  1. View the data.

head migraine_subset_overlapping_alzheimer

rs6565733   G   A   0.899924    -0.001644
rs1106175   G   A   0.76494 0.004095
rs7235612   A   G   0.197555    0.020496
rs9950836   A   G   0.41199 0.0175
rs11127467  G   C   0.383763    -0.027588
rs10193286  C   G   0.622834    -0.016573
rs11898038  A   G   0.3966  0.020979
rs7595668   A   G   0.902585    -0.00335
rs10195681  A   C   0.847215    -0.005271
rs11901199  A   G   0.548693    0.014687

As can be seen the column names are missing, so these are being added again using joe editor.

  1. Identify a set of independent SNPs in migraine data (overlapping Alzheimer’s data) via linkage disequilibrium (LD) clumping.
    To do this, I made a script, as it takes a long time to run the command below:

plink --noweb --bfile hapmap_CEU_r23a_filtered --clump migraine_subset_overlapping_alzheimer --clump-field P --clump-p1 1 --clump-p2 1 --clump-r2 0.1 --clump-kb 1000 --out migraine_subset_overlapping_alzheimer_clump1

The script looks as follows:

#!/bin/bash

# Run this as "qsub submit_simple.sh"

# Set the resource requirements; 1 CPU, 10 GB memory and 12 hours wall time.
#PBS -l ncpus=1
#PBS -l mem=10GB
#PBS -l walltime=12:00:00 

# Send an email when this job aborts, begins or ends.
#PBS -m abe 
#PBS -M n11368292@qut.edu.au

# Run your program.

cd $PBS_O_WORKDIR
./plink --noweb --bfile hapmap_CEU_r23a_filtered --clump migraine_subset_overlapping_alzheimer --clump-field P --clump-p1 1 --clump-p2 1 --clump-r2 0.1 --clump-kb 1000 --out migraine_subset_overlapping_alzheimer_clump1

To run the script i wrote the following command.

qsub script_clump1.py

  1. Extracting the SNP column.

awk '{ print $3 }' migraine_subset_overlapping_alzheimer_clump1.clumped > migraine_subset_overlapping_alzheimer_clump1.SNPs

  1. Extracting the index SNP data from the ‘dataset1_subset_overlapping_dataset2’ file.
    This can be done by matching on field #3 in FILE1 and field #1 in FILE2, using the following awk command:

awk 'NR==FNR {FILE1[$3]=$0; next} ($1 in FILE1) {print $0}' migraine_subset_overlapping_alzheimer_clump1.clumped migraine_subset_overlapping_alzheimer > migraine_subset_overlapping_alzheimer.clump1

  1. Viewing the file.

head migraine_subset_overlapping_alzheimer.clump1

SNP EA  NEA P   BETA
rs10193286  C   G   0.622834    -0.016573
rs11898038  A   G   0.3966  0.020979
rs8064924   G   A   0.474664    0.009721
rs10045830  G   T   0.507788    0.005939
rs3794811   G   C   0.76548 0.002978
rs4451252   T   C   0.309267    0.012697
rs4461849   A   C   0.794341    -0.006941
rs1609550   A   G   0.25041 0.010513
rs11185553  G   C   0.744019    0.003372

This file contains a subset of GWAS summary results from the migraine dataset for a set of independent (index) SNPs (r2 > 0.1) for all pairs of SNPs within 1 Mb of each other.

  1. P-value informed LD clumping is performed. It is performed on the migraine_subset_overlapping_alzheimer.clump1 file to ensure none of the index SNPs within 10 Mb of each other are in long-range LD (r2 > 0.1).
    To do this, I made a script to run the command below:

./plink --noweb --bfile hapmap_CEU_r23a_filtered --extract migraine_subset_overlapping_alzheimer_clump1.SNPs --clump migraine_subset_overlapping_alzheimer.clump1 --clump-field P --clump-p1 1 --clump-p2 1 --clump-r2 0.1 --clump-kb 10000 --out migraine_subset_overlapping_alzheimer_clump2

The script looks as follows:

#!/bin/bash

# Run this as "qsub submit_simple.sh"

# Set the resource requirements; 1 CPU, 10 GB memory and 12 hours wall time.
#PBS -l ncpus=1
#PBS -l mem=10GB
#PBS -l walltime=12:00:00 

# Send an email when this job aborts, begins or ends.
#PBS -m abe 
#PBS -M n11368292@qut.edu.au

# Run your program.

cd $PBS_O_WORKDIR
./plink --noweb --bfile hapmap_CEU_r23a_filtered --extract migraine_subset_overlapping_alzheimer_clump1.SNPs --clump migraine_subset_overlapping_alzheimer.clump1 --clump-field P --clump-p1 1 --clump-p2 1 --clump-r2 0.1 --clump-kb 10000 --out migraine_subset_overlapping_alzheimer_clump2

To run the script i wrote the following command:

qsub script_clump2.py

  1. Extract the set of independent SNPs from both datasets.
    Extract independent SNPs from migraine data.

awk 'NR==FNR {FILE1[$3]=$0; next} ($1 in FILE1) {print $0}' migraine_subset_overlapping_alzheimer_clump2.clumped migraine_subset_overlapping_alzheimer > migraine.independent

awk 'NR==FNR {FILE1[$3]=$0; next} ($1 in FILE1) {print $0}' migraine_subset_overlapping_alzheimer_clump2.clumped alzheimer.SECA.correctcolname > alzheimer.independent

  1. View the files.

head migraine.independent

SNP EA  NEA P   BETA
rs10045830  G   T   0.507788    0.005939
rs4451252   T   C   0.309267    0.012697
rs12718064  C   T   0.427238    -0.008721
rs11649979  T   C   0.649061    -0.006907
rs3930589   T   C   0.134194    -0.034377
rs7213702   A   G   0.022943    -0.037337
rs7215219   G   A   0.026133    -0.020486
rs11185557  G   A   0.133448    -0.014336
rs4133102   C   G   0.003636    0.063565

head alzheimer.independent

SNP EA  NEA P   BETA
rs2980300   T   C   0.3739393122267137  0.66446
rs4951864   C   T   0.09179162970999584 -0.860095
rs1891905   C   T   0.29342569121177464 -0.724347
rs11260549  A   G   0.1753068387723845  0.804681
rs3813199   A   G   0.4115305156064484  0.634635
rs6603781   G   A   0.8838024363734281  0.144614
rs1810745   A   G   0.5941115614725418  -0.470282
rs7531530   T   C   0.8473090387788222  0.18908
rs13303201  A   C   0.28713670613181275 -0.728818  
  1. Count how many SNP there is left in the datasets.

wc -l migraine.independent migraine.SECA.correctcolname

27006 migraine.independent
9428582 migraine.SECA.correctcolname

wc -l alzheimer.independent alzheimer.SECA.correctcolname

27006 alzheimer.independent
11063896 alzheimer.SECA.correctcolname

The migraine.independent and alzheimer.independent files can now be used to perform SECA.

7.2.3. Performing SECA

SECA was performed in the following steps.

  1. Preparing input using process.pl.

./process.pl migraine.independent > migraine.independent.input

./process.pl alzheimer.independent > alzheimer.independent.input

  1. Viewing the data.

head migraine.independent.input alzheimer.independent.input

==> migraine.independent.input <==
SNP EA  NEA P   BETA
rs10045830  G   T   0.507788    0.005939
rs4451252   T   C   0.309267    0.012697
rs12718064  C   T   0.427238    -0.008721
rs11649979  T   C   0.649061    -0.006907
rs3930589   T   C   0.134194    -0.034377
rs7213702   A   G   0.022943    -0.037337
rs7215219   G   A   0.026133    -0.020486
rs11185557  G   A   0.133448    -0.014336
rs4133102   C   G   0.003636    0.063565

==> alzheimer.independent.input <==
SNP EA  NEA P   BETA
rs2980300   T   C   0.3739393122267137  0.66446
rs4951864   C   T   0.09179162970999584 -0.860095
rs1891905   C   T   0.29342569121177464 -0.724347
rs11260549  A   G   0.1753068387723845  0.804681
rs3813199   A   G   0.4115305156064484  0.634635
rs6603781   G   A   0.8838024363734281  0.144614
rs1810745   A   G   0.5941115614725418  -0.470282
rs7531530   T   C   0.8473090387788222  0.18908
rs13303201  A   C   0.28713670613181275 -0.728818
  1. Preparing a text file named ‘filelist.txt’ which lists ‘migraine.independent.input’ on line 1 and ‘alzheimer.independent.input’ on line 2.

    The text file was made using Visual Studio Code and then uploaded to the HPC using FileZilla.

  2. Aligning the SNP effects (BETAs) in the second file listed in ‘filelist.txt’ to the same effect allele (‘EA’) for the SNP effects (BETAs) in the first file listed in ‘filelist.txt’.

./gwama2.1_SECAalign

It gave the following output in the terminal:

File filelist.txt contained 2 studies.
------------------------------------
Reading file: migraine.independent.input
Strand column missing! Expecting always positive strand.
Marker count: 27005 Markers passing sanity check: 27005
Strand problems: 0 Wrong alleles: 0
Effect problems: 0 Multiple occurances: 0
------------------------------------
Reading file: alzheimer.independent.input
Strand column missing! Expecting always positive strand.
Marker count: 27005 Markers passing sanity check: 27004
Strand problems: 0 Wrong alleles: 0
Effect problems: 1 Multiple occurances: 0
------------------------------------
Preparing output...
------------------------------------
GWAMA program finished current job successfully!
Please check gwama.log.out for full information.
Analysis finished.
  1. Changing the file name.

    As the SNPs are LD-independent, the ‘aligned_effects.txt’ file was renamed to ‘independent_aligned_effects.txt’ using joe editor.

  2. View the file.

head independent_aligned_effects.txt

SNP EA  NEA BETA1   P1  EFF1    BETA2   P2  EFF2
rs10045830  G   T   0.005939    0.507788    +   0.793212    0.192712    +
rs4451252   T   C   0.012697    0.309267    +   0.763799    0.236670    +
rs12718064  C   T   -8.7210e-03 0.427238    -   -7.0439e-01 0.321021    -
rs11649979  T   C   -6.9070e-03 0.649061    -   -9.0128e-02 0.927893    -
rs3930589   T   C   -3.4377e-02 0.134194    -   -3.8131e-01 0.679995    -
rs7213702   A   G   -3.7337e-02 0.022943    -   0.777669    0.216087    +
rs7215219   G   A   -2.0486e-02 0.026133    -   0.653641    0.387761    +
rs11185557  G   A   -1.4336e-02 0.133448    -   -1.4347e-01 0.884730    -
rs4133102   C   G   0.063565    0.003636    +   -3.9227e-01 0.669780    -
  1. Count how many SNPs are in the file.

wc -l independent_aligned_effects.txt

27006 independent_aligned_effects.txt
  1. Running SECA. To run SECA the following commands were used:

module load r

And then:

R

Then within R, the following command was used to run SECA:

source("SECA_Ranalysis.R")

  1. Running SECA heatmap permutation. To run the SECA heatmap permutation, the following commands were used:

module load r

And then:

R

Then within R, the following command was used to run the SECA heatmap permutation:

source("SECA_Ranalysis_HeatmapPerm.R")

In contrast to cross-trait LD Score regression, SECA indicated that there was a genetic correlation between the traits.

8. Causal association analysis

8.1. GSMR using CTG-VL

In order to analyse the genetic causality between traits, I went to the CTG-VL. From there under ‘Analyses’ I chose ‘GSMR’. Under ‘select outcome’ i chose ‘Alzheimer_GWAS’ and under ‘select exposure’ i chose ‘Migraine_GWAS_1’ then I hit ‘Run Analysis’.

When trying to run this analysis, I eventually got an error.

I kept getting an error every time I tried running GSMR on CTG-VL either as Alzheimer’s Disease as the outcome and Migraine as the exposure or reversed. The first step towards finding the problem was to find two other datasets that was already in the Neal Lab, and then run each of them with my own data. I used the ‘Illnesses of mother: Alzheimer’s disease/dementia’ from Neal lab and tried to run it with my own migraine data. This worked both when the migraine data was the outcome and the exposure. Afterwards, I used the ‘Diagnoses - main ICD10: R51 Headache’ from Neal lab and tried to run it with my own Alzheimer’s Disease data. This only worked when the Alzheimer’s data was set as the exposure, as the ‘Diagnoses - main ICD10: R51 Headache’ did not seem to have enough significant loci.

The above first of all indicated that GSMR worked, and furthermore it indicated that my datasets both worked independently. However, I still could not figure out the exact problem.

8.2. GCTA using terminal

I then tried to run GCTA, which is a software package that can be run on the terminal, that supports the GSMR analysis (among other things).

8.2.1 Preparing for GCTA

The preparation of GCTA was done in the following steps in my local terminal, as the HPC kept killing my job, when I ran the analysis on there.

  1. Downloading the software package.
  2. Unzipping the file.

unzip gcta-1.94.1-linux-kernel-3-x86_64.zip

  1. Downloading European reference data from 1000 Genomes.
  2. Making two text files.
    Two text files was then made, one containing the name of the outcome data file (Alzheimer’s data) and one containing the name of the exposure data file (Migraine data). These text files looked as follows.

Outcome text file:

alzheimer alzheimer_GSMR_colnames

Exposure text file:

migraine migraine_GSMR_N_colnames

8.2.2. Running GCTA

GCTA was run using the following command.

gcta-1.94.1 --bfile g1000_eur --gsmr-file migraine_GSMR.txt alzheimer_GSMR.txt --gsmr-direction 0 --effect-plot --out gsmr_result

After running for some time, an error was raised.

Error: there are too many SNPs that have large difference in allele frequency. Please check the GWAS summary data.

This got me thinking that I had used fixed frequencies when preparing my Alzheimer’s dataset CTG-VL upload, which could be the problem. I also noticed, that a file called ‘gsmr_result.freq.badsnps’ had been produced, which might contain all the SNPs that was raising the error.

I tried counting how many SNPs the file contained.

wc -l gsmr_result.freq.badsnps

2794 gsmr_result.freq.badsnps

As there was only 2794 bad SNPs, I decided to remove these, to solve the problem.

awk 'NR == FNR {a[$1]; next} !($3 in a)' gsmr_result.freq.badsnps alzheimer_GSMR_colnames > alzheimer_GSMR_colnames_nobadsnps

Then I changed the outcome text file, so it looked like this.

alzheimer alzheimer_GSMR_colnames_nobadsnps

I then tried running GCTA again, using Alzheimer’s Disease as the outcome, and Migraine as the exposure.

gcta-1.94.1 --bfile g1000_eur --gsmr-file migraine_GSMR.txt alzheimer_GSMR.txt --gsmr-direction 0 --effect-plot --out gsmrMA_result

This yielded the following results.

Exposure        Outcome bxy     se      p       nsnp
migraine        alzheimer       5.29027 1.38816 0.000138403     26

The table above indicates, that there is a significant causal relationship.

Then i ran GCTA using Migraine as the outcome and Alzheimer’s Disease as the exposure.

gcta-1.94.1 --bfile g1000_eur --gsmr-file alzheimer_GSMR.txt migraine_GSMR.txt --gsmr-direction 0 --effect-plot --out gsmrAM_result

This yielded the following results.

Exposure        Outcome bxy     se      p       nsnp
alzheimer       migraine        0.00216948      0.00153883      0.158593        83

The table above indicates, that there is not a significant causal relationship.

8.2.3. Creating effect plots

The plot are generated in the following steps.

  1. Downloading R script (gsmr_plot.r).
  2. Creating plot for Alzheimer’s Disease as outcome and Migraine as exposure.
source("gsmr_plot.r")
gsmr_data = read_gsmr_data("gsmrMA_result.eff_plot.gz")
gsmr_summary(gsmr_data)      # show a summary of the data
plot_gsmr_effect(gsmr_data, "migraine", "alzheimer", colors()[75])
  1. Creating plot for Alzheimer’s Disease as outcome and Migraine as exposure.
source("gsmr_plot.r")
gsmr_data = read_gsmr_data("gsmrAM_result.eff_plot.gz")
gsmr_summary(gsmr_data)      # show a summary of the data
plot_gsmr_effect(gsmr_data, "alzheimer", "migraine", colors()[75])