I plan to run some analysis using the Complex Traits Genetics Virtual
Lab (CTG-VL):
As detailed at link,
the input format of the genome-wide association summary statistics
(GWASS) file should be:
Tab separated and should minimum contain the columns CHR, BP, SNP, A1,
A2, FREQ, BETA, SE and P. The column “N” is optional but highly
recommended if there are large discrepancies between sample sizes for
each SNP. Prof Nyholt informed us that “A1” is the effect allele (EA),
and “A2” is the non-effect allele (NEA).
Example of an input files is as follows:
CHR BP SNP A1 A2 FREQ BETA SE P N
7 92383888 rs10 A C 0.06431 0.0013 0.0042 7.5e-01 85000
12 126890980 rs1000000 A G 0.2219 0.0001 0.0021 9.6e-01 85000
4 21618674 rs10000010 T C 0.5086 -0.0001 0.0016 9.4e-01 85000
4 1357325 rs10000012 C G 0.8634 0.0047 0.0025 5.7e-02 85000
4 37225069 rs10000013 A C 0.7708 -0.0061 0.0021 3.3e-03 85000
Migraine GWAS summary statistics used in this study was obtained from the Hautakangas et al. 2021 meta-analysis of migraine risk through the contact of one of the authors, Dr. Dale Nyholt.
The downloading was completed in the following steps:
mkdir Datasets
cd Datasets
head migraine_IHGC2021_no23andMe_gwama_v2.txt
rs_number chromosome position reference_allele other_allele eaf beta se beta_95L beta_95U z p.value _-log10_p-value q_statistic q_p-value i2 n_studies n_samples effects Neff rsid_ukbb marker
1:592368 1 592368 G A 0.010199 0.136408 0.299302 -0.450224 0.72304 0.455754 0.648588 0.188031 0 1 1 594 ??+? 1106 rs561532399 1:592368
1:636285 1 636285 C T 0.096043 0.037062 0.027023 -0.015904 0.090028 1.371469 0.170212 0.769011 0 1 1 341050 +??? 15773 rs545945172 1:636285
1:637420 1 637420 T C 0.019151 -0.028 0.195107 -0.41041 0.35441 -0.143512 0.885866 0.052632 0 1 1 5941 ??- 1398 1:637420 1:637420
1:649192 1 649192 T A 0.117561 0.032302 0.02445 -0.015621 0.080224 1.321107 0.186447 0.729445 0 1 1 341050 +??? 16125 rs201942322 1:649192
1:657788 1 657788 G C 0.219788 -0.009857 0.063584 -0.134481 0.114767 -0.155025 0.876783 0.057108 0 1 1 5941 ??-? 1442 1:657788 1:657788
1:657963 1 657963 T C 0.013471 0.456059 0.204272 0.055686 0.856432 2.232607 0.025597 1.591816 0 1 1 5941 ??+? 1803 rs535013214 1:657963
1:658080 1 658080 A G 0.012222 -0.077539 0.275975 -0.61845 0.463372 -0.280965 0.778738 0.108609 0 1 1 5941 ??-? 1088 1:658080 1:658080
1:658178 1 658178 G A 0.019984 0.091726 0.199317 -0.298935 0.482387 0.460202 0.645393 0.190176 0 1 1 5941 ??+? 1285 1:658178 1:658178
1:662414 1 662414 T C 0.134918 0.028631 0.024013 -0.018434 0.075696 1.192313 0.233118 0.632424 0 1 1 341050 +??? 14859 rs371628865 1:662414
Alzheimer’s disease GWAS summary statistics used in this study was obtained from Wightman et al. 2021 through the Psychiatric Genomics Consortium.
The downloading was completed in the following steps:
cd Datasets
head PGCALZ2sumstatsExcluding23andMe.txt
chr PosGRCh37 testedAllele otherAllele z p N
1 13668 A G -0.33742041121370514 0.7358 74004
1 14506 A G -0.511215737572647 0.6092 74004
1 14773 T C 1.3712404899915154 0.1702999999999999 74004
1 14860 T G -1.2758741791491297 0.20200000000000018 74004
1 17101 C G -0.47105704329760717 0.6375999999999998 74004
1 17147 A G 0.08532879488562903 0.932 74004
1 17516 C T -0.8723820309097523 0.3829999999999999 74004
1 17532 A C 0.1705215348636156 0.8646 74004
1 17626 A G 0.7388468491852137 0.45999999999999996 74004
Below I have copied some of the details of the columns in the order that they are provided in the GWASS file:
The columns needed to be extracted in order to run analysis using the CTG-VL is as follows:
As can be seen we should already have all the columns needed to run analysis using the CTG-VL. We just need to extract the columns and rename them.
Below I have copied some of the details of the columns in the order that they are provided in the GWASS file:
The columns needed to be extracted in order to run analysis using the CTG-VL is as follows:
As can be seen the columns: SNP, FREQ, BETA and SE are missing, so we will need to manipulate the data in order to get these.
As for the migraine GWAS summary statistics we only need to extract some of the columns and rename them in order for the data to be in the right format so we can use CTG-VL to analyse it. This is done by the following steps:
head -n1 migraine_IHGC2021_no23andMe_gwama_v2.txt > migraine_IHGC2021_no23andMe_gwama_v2.txt.colnames
sed -e 'y/\t/\n/' migraine_IHGC2021_no23andMe_gwama_v2.txt.colnames > migraine_IHGC2021_no23andMe_gwama_v2.txt.colnames.transposed
cat -n migraine_IHGC2021_no23andMe_gwama_v2.txt.colnames.transposed > migraine_IHGC2021_no23andMe_gwama_v2.txt.colnames.transposed.lineNo
more migraine_IHGC2021_no23andMe_gwama_v2.txt.colnames.transposed.lineNo
1 rs_number
2 chromosome
3 position
4 reference_allele
5 other_allele
6 eaf
7 beta
8 se
9 beta_95L
10 beta_95U
11 z
12 p.value
13 _-log10_p-value
14 q_statistic
15 q_p-value
16 i2
17 n_studies
18 n_samples
19 effects
20 Neff
21 rsid_ukbb
22 marker
So referring back to the format we want:
CHR BP SNP A1 A2 FREQ BETA SE P N
We want the following column numbers from ‘migraine_IHGC2021_no23andMe_gwama_v2.txt’
2 3 1 4 5 6 7 8 12 18
awk -v OFS='\t' '{ print $2, $3, $1, $4, $5, $6, $7, $8, $12, $18 }' migraine_IHGC2021_no23andMe_gwama_v2.txt.head
chromosome position rs_number reference_allele other_allele eaf beta se p.value n_samples
1 592368 1:592368 G A 0.010199 0.136408 0.299302 0.648588 ??+?
1 636285 1:636285 C T 0.096043 0.037062 0.027023 0.170212 +???
1 637420 1:637420 T C 0.019151 -0.028 0.195107 0.885866 ??-?
1 649192 1:649192 T A 0.117561 0.032302 0.02445 0.186447 +???
1 657788 1:657788 G C 0.219788 -0.009857 0.063584 0.876783 ??-?
1 657963 1:657963 T C 0.013471 0.456059 0.204272 0.025597 ??+?
1 658080 1:658080 A G 0.012222 -0.077539 0.275975 0.778738 ??-?
1 658178 1:658178 G A 0.019984 0.091726 0.199317 0.645393 ??+?
1 662414 1:662414 T C 0.134918 0.028631 0.024013 0.233118 +???
Looks like it worked. However, the sample size does not seem to be useful, so we won’t extract that column in the full file.
awk -v OFS='\t' '{ print $2, $3, $1, $4, $5, $6, $7, $8, $12 }' migraine_IHGC2021_no23andMe_gwama_v2.txt > migraine_IHGC2021_no23andMe_gwama_v2.txt.CTGVL
head migraine_IHGC2021_no23andMe_gwama_v2.txt.CTGVL
chromosome position rs_number reference_allele other_allele eaf beta se p.value
1 592368 1:592368 G A 0.010199 0.136408 0.299302 0.648588
1 636285 1:636285 C T 0.096043 0.037062 0.027023 0.170212
1 637420 1:637420 T C 0.019151 -0.028 0.195107 0.885866
1 649192 1:649192 T A 0.117561 0.032302 0.02445 0.186447
1 657788 1:657788 G C 0.219788 -0.009857 0.063584 0.876783
1 657963 1:657963 T C 0.013471 0.456059 0.204272 0.025597
1 658080 1:658080 A G 0.012222 -0.077539 0.275975 0.778738
1 658178 1:658178 G A 0.019984 0.091726 0.199317 0.645393
1 662414 1:662414 T C 0.134918 0.028631 0.024013 0.233118
tail migraine_IHGC2021_no23andMe_gwama_v2.txt.CTGVL
X 155226657 rs139594651 T C 0.325104 -0.051982 0.022151 0.018962
X 155227607 rs141061503 G T 0.333626 -0.04959 0.021159 0.019116
X 155228631 rs148242753 A G 0.346415 -0.052932 0.022249 0.017378
X 155229003 rs188340375 A G 0.347217 -0.052858 0.02223 0.017438
X 155229189 rs189652562 C T 0.349702 -0.053048 0.022319 0.017487
X 155229483 rs139273633 T C 0.346997 -0.052996 0.022289 0.017445
X 155229622 rs150057047 C G 0.341794 -0.053095 0.022362 0.017603
X 155229796 rs145421232 A G 0.341067 -0.053227 0.022415 0.017589
X 155230932 rs144749717 G A 0.342003 -0.053848 0.022663 0.017521
X 155233098 rs138096180 C T 0.352228 -0.053986 0.022706 0.017448
Looks like it worked. However, we can see that some of the rows in the rs_number column does not contain rsIDs, which is preferred. Therefore some substantial manipulations are made.
Use grep to extract rows containing “rs”.
grep "rs" migraine_IHGC2021_no23andMe_gwama_v2.txt.CTGVL > migraine_IHGC2021_no23andMe_gwama_v2.txt.CTGVL.rsidonly
I use Joe text editor to rename the column names in
‘migraine_IHGC2021_no23andMe_gwama_v2.txt.CTGVL.rsidonly’
From:
chromosome position rs_number reference_allele other_allele eaf beta se
p.value n_samples
To:
CHR BP SNP A1 A2 FREQ BETA SE P N
Save as ‘migraine_IHGC2021_no23andMe_gwama_v2.txt.CTGVL.rsidonly.correctcolname’
Then I view the resulting file.
head migraine_IHGC2021_no23andMe_gwama_v2.txt.CTGVL.rsidonly.correctcolname
CHR BP SNP A1 A2 FREQ BETA SE P
1 662613 rs151190501 A G 0.023833 -0.002305 0.173807 0.989415
1 662622 rs61769339 A G 0.11161 0.027319 0.023048 0.235865
1 693731 rs61769350 G A 0.117222 0.009883 0.02157 0.646838
1 704637 rs142559957 A G 0.083049 0.020961 0.0957 0.826616
1 705882 rs72631875 A G 0.066553 -0.05509 0.033577 0.100859
1 706368 rs55727773 A G 0.482703 0.002583 0.016164 0.873
1 707886 rs115229562 C G 0.066665 0.0938 0.109672 0.392398
1 713337 rs184249808 A G 0.020187 0.0673 0.199301 0.735613
1 717587 rs144155419 A G 0.015726 -0.043106 0.06268 0.491641
This looks fine.
I tried uploading the ‘migraine_IHGC2021_no23andMe_gwama_v2.txt.CTGVL.rsidonly.correctcolname’ file to the CTG-VL. However, when doing so, an error was raised in the ‘Status’ column under ‘My Data’. I then tried sorting all the different columns in the ‘migraine_IHGC2021_no23andMe_gwama_v2.txt.CTGVL.rsidonly.correctcolname’ file to check if something unusual was happening.
sort -k1,1n migraine_IHGC2021_no23andMe_gwama_v2.txt.CTGVL.rsidonly.correctcolname > migraine_IHGC2021_no23andMe_gwama_v2.txt.CTGVL.rsidonly.correctcolname.sortCHR
When viewing the data, everything looks fine.
head migraine_IHGC2021_no23andMe_gwama_v2.txt.CTGVL.rsidonly.correctcolname.sortCHR
CHR BP SNP A1 A2 FREQ BETA SE P
X 100000500 rs5920848 T A 0.452611 -0.006459 0.007078 0.361441
X 100000719 rs4828056 G A 0.45257 -0.00665 0.007077 0.347377
X 100003215 rs2154370 T C 0.451382 -0.006564 0.00707 0.35318
X 100003774 rs75945819 C T 0.461256 -0.005969 0.007187 0.406287
X 100005339 rs188135282 G T 0.014273 -0.0581 0.19495 0.765686
X 100006043 rs4828057 C A 0.451512 -0.006674 0.007059 0.34439
X 100006377 rs5967183 A T 0.55286 0.005702 0.007104 0.422218
X 100007433 rs2022471 A G 0.450063 -0.005641 0.007097 0.426745
X 100008593 rs5921641 G A 0.480213 -0.007165 0.007096 0.312645
tail migraine_IHGC2021_no23andMe_gwama_v2.txt.CTGVL.rsidonly.correctcolname.sortCHR
22 51227891 rs6010091 A G 0.42802 -0.001905 0.014723 0.897024
22 51228259 rs138384468 G A 0.073761 -0.000634 0.043196 0.988287
22 51228910 rs145146472 A G 0.300584 -0.001419 0.015942 0.929049
22 51229805 rs9616985 C T 0.072746 0.010011 0.022734 0.659699
22 51229855 rs144549712 A G 0.143121 0.030003 0.021895 0.170578
22 51233300 rs62240042 T C 0.333872 0.010257 0.016812 0.541804
22 51234048 rs141330630 C T 0.013537 -0.379597 0.243448 0.11893
22 51234799 rs191117135 A G 0.015694 -0.075994 0.064729 0.240359
22 51237063 rs3896457 C T 0.299954 -0.003727 0.015613 0.811307
22 51238249 rs149733995 C A 0.072902 -0.006451 0.0441 0.883671
sort -k2,2n migraine_IHGC2021_no23andMe_gwama_v2.txt.CTGVL.rsidonly.correctcolname > migraine_IHGC2021_no23andMe_gwama_v2.txt.CTGVL.rsidonly.correctcolname.sortBP
When viewing the head result the first 7 rows of the BP column is in scientific notation, which might cause a problem when uploading it. To test whether this was the problem I deleted these rows.
head migraine_IHGC2021_no23andMe_gwama_v2.txt.CTGVL.rsidonly.correctcolname.sortBP
CHR BP SNP A1 A2 FREQ BETA SE P
10 1.2e+07 rs74733400 T A 0.018302 -0.00314 0.0316 0.920837
6 1.36e+08 rs55926138 C T 0.023115 0.005944 0.027572 0.829296
12 4.8e+07 rs855274 T C 0.100324 -0.025858 0.013134 0.049001
2 7.2e+07 rs61573637 A G 0.091582 -0.000782 0.01372 0.954553
9 8e+07 rs78198903 T A 0.111289 0.016326 0.012676 0.197716
9 9e+05 rs72699266 T C 0.011967 -0.104202 0.0929 0.261993
1 9.3e+07 rs188690587 G A 0.023369 -0.072327 0.166705 0.66441
17 828 rs62053745 T C 0.197069 0.011277 0.014306 0.430516
17 834 rs9747082 G A 0.134363 0.009148 0.013838 0.508602
tail migraine_IHGC2021_no23andMe_gwama_v2.txt.CTGVL.rsidonly.correctcolname.sortBP
1 249223485 rs113286866 A G 0.010701 -0.005186 0.084472 0.951032
1 249224155 rs115216350 T C 0.014103 -0.239172 0.249257 0.337277
1 249225064 rs191382835 T A 0.015645 -0.062227 0.209466 0.766414
1 249225077 rs183571322 A G 0.010032 0.07725 0.240278 0.747836
1 249226310 rs146557682 A C 0.010032 0.07725 0.240278 0.747836
1 249228009 rs143586772 A G 0.019903 -0.011325 0.059645 0.849393
1 249228324 rs185293715 A T 0.010032 0.07725 0.240278 0.747836
1 249228388 rs189826851 G T 0.014114 0.106908 0.234435 0.648394
1 249229734 rs75254746 C T 0.028278 0.009297 0.043954 0.832473
1 249230279 rs146821690 G C 0.010322 0.071252 0.240329 0.766868
sed '2,8d' migraine_IHGC2021_no23andMe_gwama_v2.txt.CTGVL.rsidonly.correctcolname.sortBP > migraine_IHGC2021_no23andMe_gwama_v2.txt.CTGVL.rsidonly.correctcolname.sortBP.test
Now the data might be ready for to the CTG-VL upload.
As for the Alzheimer’s disease GWAS summary statistics some of the columns we need in order to analyse the data using CTG-VL are missing. The columns missing is:
Therefore we need to find a way to calculate these, and then add them to the file.
The following links was used to figure out how to calculate beta and se. As Prof Dale Nyholt had told us, that we did not really need the frequency column in order to run the analysis, it should just be a fixed frequency.
head PGCALZ2sumstatsExcluding23andMe.txt > PGCALZ2sumstatsExcluding23andMe.txt.head
awk -v OFS='\t' 'NR!=1{$8 = (0.1); $9 = ($5*sqrt((1/((1+($5*$5))/$7))/$7)); $10 = ($9/$5); print}' PGCALZ2sumstatsExcluding23andMe.txt.head > PGCALZ2sumstatsExcluding23andMe.txt.head.freqbetase
By viewing the file it looks like it worked. However, the column names have been deleted from the file, but these could just be added using joe text editor.
more PGCALZ2sumstatsExcluding23andMe.txt.head.freqbetase
1 13668 A G -0.33742041121370514 0.7358 74004 0.1 -0.319711 0.947515
1 14506 A G -0.511215737572647 0.6092 74004 0.1 -0.455185 0.890397
1 14773 T C 1.3712404899915154 0.1702999999999999 74004 0.1 0.807969 0.589225
1 14860 T G -1.2758741791491297 0.20200000000000018 74004 0.1 -0.787059 0.616878
1 17101 C G -0.47105704329760717 0.6375999999999998 74004 0.1 -0.426144 0.904655
1 17147 A G 0.08532879488562903 0.932 74004 0.1 0.0850198 0.996379
1 17516 C T -0.8723820309097523 0.3829999999999999 74004 0.1 -0.657387 0.753554
1 17532 A C 0.1705215348636156 0.8646 74004 0.1 0.168095 0.985771
1 17626 A G 0.7388468491852137 0.45999999999999996 74004 0.1 0.594244 0.804285
awk -v OFS='\t' 'NR!=1{$8 = (0.1); $9 = ($5*sqrt((1/((1+($5*$5))/$7))/$7)); $10 = ($9/$5); print}' PGCALZ2sumstatsExcluding23andMe.txt > PGCALZ2sumstatsExcluding23andMe.txt.freqbetase
However, this gave the following error:
awk: cmd. line:1: (FILENAME=PGCALZ2sumstatsExcluding23andMe.txt FNR=11987715) fatal: division by zero attempted
sed -n '11987715p' PGCALZ2sumstatsExcluding23andMe.txt
This gave the following output:
19 45387459 G C inf 0.0 758138
Indicating that the error was raised because the z-score was infinity.
grep -v "inf" PGCALZ2sumstatsExcluding23andMe.txt > PGCALZ2sumstatsExcluding23andMe.txt.noinf
awk -v OFS='\t' 'NR!=1{$8 = (0.1); $9 = ($5*sqrt((1/((1+($5*$5))/$7))/$7)); $10 = ($9/$5); print}' PGCALZ2sumstatsExcluding23andMe.txt.noinf > PGCALZ2sumstatsExcluding23andMe.txt.noinf.freqbetase
By viewing the data, it seemed like it worked.
head PGCALZ2sumstatsExcluding23andMe.txt.noinf.freqbetase
1 13668 A G -0.33742041121370514 0.7358 74004 0.1 -0.319711 0.947515
1 14506 A G -0.511215737572647 0.6092 74004 0.1 -0.455185 0.890397
1 14773 T C 1.3712404899915154 0.1702999999999999 74004 0.1 0.807969 0.589225
1 14860 T G -1.2758741791491297 0.20200000000000018 74004 0.1 -0.787059 0.616878
1 17101 C G -0.47105704329760717 0.6375999999999998 74004 0.1 -0.426144 0.904655
1 17147 A G 0.08532879488562903 0.932 74004 0.1 0.0850198 0.996379
1 17516 C T -0.8723820309097523 0.3829999999999999 74004 0.1 -0.657387 0.753554
1 17532 A C 0.1705215348636156 0.8646 74004 0.1 0.168095 0.985771
1 17626 A G 0.7388468491852137 0.45999999999999996 74004 0.1 0.594244 0.804285
1 17765 A G 0.09388564134428469 0.9252 74004 0.1 0.0934746 0.995622
By manually adding the column names using joe text editor and viewing the file, it seems to be fine.
head PGCALZ2sumstatsExcluding23andMe.txt.noinf.freqbetase.colnames
CHR BP A1 A2 Z P N FREQ BETA SE
1 13668 A G -0.33742041121370514 0.7358 74004 0.1 -0.319711 0.947515
1 14506 A G -0.511215737572647 0.6092 74004 0.1 -0.455185 0.890397
1 14773 T C 1.3712404899915154 0.1702999999999999 74004 0.1 0.807969 0.589225
1 14860 T G -1.2758741791491297 0.20200000000000018 74004 0.1 -0.787059 0.616878
1 17101 C G -0.47105704329760717 0.6375999999999998 74004 0.1 -0.426144 0.904655
1 17147 A G 0.08532879488562903 0.932 74004 0.1 0.0850198 0.996379
1 17516 C T -0.8723820309097523 0.3829999999999999 74004 0.1 -0.657387 0.753554
1 17532 A C 0.1705215348636156 0.8646 74004 0.1 0.168095 0.985771
1 17626 A G 0.7388468491852137 0.45999999999999996 74004 0.1 0.594244 0.804285
The last thing we need to do in order for us to upload the data to the CTG-VL is to get the rsID’s, which is done in the following steps.
cd Datasets
head 1000G_20101123_v3_GIANT_chr1_23_Marker_Details.txt
minimac_names minimac_names_if_no_rsID chr_b37 pos_b37 Intrpl_pos_b36 chr_b36 pos_b36
1:10583 rs58108140 1 10583 0 1 583
1:10611 rs189107123 1 10611 0 1 611
1:13302 rs180734498 1 13302 0 1 3165
1:13327 rs144762171 1 13327 0 1 3190
1:13957:TC_T 1:13957:TC_T 1 13957 0 1 3820
1:13980 rs151276478 1 13980 0 1 3843
1:30923 rs140337953 1 30923 0 1 20786
1:46402:C_CTG 1:46402:C_CTG 1 46402 0 1 36265
1:47190:G_GA 1:47190:G_GA 1 47190 0 1 37053
awk -v OFS='\t' '{print $1"_"$2, $0}' PGCALZ2sumstatsExcluding23andMe.txt.noinf.freqbetase.colnames > PGCALZ2sumstatsExcluding23andMe.txt.noinf.freqbetase.colnames.CHR_BP
awk -v OFS='\t' '{print $3"_"$4, $0}' 1000G_20101123_v3_GIANT_chr1_23_Marker_Details.txt > 1000G_20101123_v3_GIANT_chr1_23_Marker_Details.txt.CHR_BP
By viewing the files, it looks like it worked.
head PGCALZ2sumstatsExcluding23andMe.txt.noinf.freqbetase.colnames.CHR_BP
CHR_BP CHR BP A1 A2 Z P N FREQ BETA SE
1_13668 1 13668 A G -0.33742041121370514 0.7358 74004 0.1 -0.319711 0.947515
1_14506 1 14506 A G -0.511215737572647 0.6092 74004 0.1 -0.455185 0.890397
1_14773 1 14773 T C 1.3712404899915154 0.1702999999999999 74004 0.1 0.807969 0.589225
1_14860 1 14860 T G -1.2758741791491297 0.20200000000000018 74004 0.1 -0.787059 0.616878
1_17101 1 17101 C G -0.47105704329760717 0.6375999999999998 74004 0.1 -0.426144 0.904655
1_17147 1 17147 A G 0.08532879488562903 0.932 74004 0.1 0.0850198 0.996379
1_17516 1 17516 C T -0.8723820309097523 0.3829999999999999 74004 0.1 -0.657387 0.753554
1_17532 1 17532 A C 0.1705215348636156 0.8646 74004 0.1 0.168095 0.985771
1_17626 1 17626 A G 0.7388468491852137 0.45999999999999996 74004 0.1 0.594244 0.804285
head 1000G_20101123_v3_GIANT_chr1_23_Marker_Details.txt.CHR_BP
chr_b37_pos_b37 minimac_names minimac_names_if_no_rsID chr_b37 pos_b37 Intrpl_pos_b36 chr_b36 pos_b36
1_10583 1:10583 rs58108140 1 10583 0 1 583
1_10611 1:10611 rs189107123 1 10611 0 1 611
1_13302 1:13302 rs180734498 1 13302 0 1 3165
1_13327 1:13327 rs144762171 1 13327 0 1 3190
1_13957 1:13957:TC_T 1:13957:TC_T 1 13957 0 1 3820
1_13980 1:13980 rs151276478 1 13980 0 1 3843
1_30923 1:30923 rs140337953 1 30923 0 1 20786
1_46402 1:46402:C_CTG 1:46402:C_CTG 1 46402 0 1 36265
1_47190 1:47190:G_GA 1:47190:G_GA 1 47190 0 1 37053
Before merging the two files in order to get the rsID’s we need to handle possible duplicate CHR_BP variant names first, to prevent incorrect matching.
cut -f1 PGCALZ2sumstatsExcluding23andMe.txt.noinf.freqbetase.colnames.CHR_BP > PGCALZ2sumstatsExcluding23andMe.txt.noinf.freqbetase.colnames.CHR_BP.col1
cut -f1 1000G_20101123_v3_GIANT_chr1_23_Marker_Details.txt.CHR_BP > 1000G_20101123_v3_GIANT_chr1_23_Marker_Details.txt.CHR_BP.col1
sort PGCALZ2sumstatsExcluding23andMe.txt.noinf.freqbetase.colnames.CHR_BP.col1 | uniq -u > PGCALZ2sumstatsExcluding23andMe.txt.noinf.freqbetase.colnames.CHR_BP.col1.uniq
sort 1000G_20101123_v3_GIANT_chr1_23_Marker_Details.txt.CHR_BP.col1 | uniq -u > 1000G_20101123_v3_GIANT_chr1_23_Marker_Details.txt.CHR_BP.col1.uniq
Do a line count of the original and generate unique file to see
whether duplicates were removed.
wc -l PGCALZ2sumstatsExcluding23andMe.txt.noinf.freqbetase.colnames.CHR_BP.col1 PGCALZ2sumstatsExcluding23andMe.txt.noinf.freqbetase.colnames.CHR_BP.col1.uniq
12688309 PGCALZ2sumstatsExcluding23andMe.txt.noinf.freqbetase.colnames.CHR_BP.col1 12631385 PGCALZ2sumstatsExcluding23andMe.txt.noinf.freqbetase.colnames.CHR_BP.col1.uniq
In the GWASS file, 56,924 CHR_BP variant names were removed.
wc -l 1000G_20101123_v3_GIANT_chr1_23_Marker_Details.txt.CHR_BP.col1 1000G_20101123_v3_GIANT_chr1_23_Marker_Details.txt.CHR_BP.col1.uniq
31326390 1000G_20101123_v3_GIANT_chr1_23_Marker_Details.txt.CHR_BP.col1
31281348 1000G_20101123_v3_GIANT_chr1_23_Marker_Details.txt.CHR_BP.col1.uniq
In the 1000G reference file, 45,042 CHR_BP variant names were removed.
Now the rsID’s can be added to the GWASS file.
awk 'NR==FNR {FILE1[$1]=$0; next} ($1 in FILE1) {print $0}' PGCALZ2sumstatsExcluding23andMe.txt.noinf.freqbetase.colnames.CHR_BP.col1.uniq PGCALZ2sumstatsExcluding23andMe.txt.noinf.freqbetase.colnames.CHR_BP > PGCALZ2sumstatsExcluding23andMe.txt.noinf.freqbetase.colnames.CHR_BP.uniq
awk 'NR==FNR {FILE1[$1]=$0; next} ($1 in FILE1) {print $0}' 1000G_20101123_v3_GIANT_chr1_23_Marker_Details.txt.CHR_BP.col1.uniq 1000G_20101123_v3_GIANT_chr1_23_Marker_Details.txt.CHR_BP > 1000G_20101123_v3_GIANT_chr1_23_Marker_Details.txt.CHR_BP.uniq
awk -v OFS='\t' 'NR==FNR { FILE1[$1]=$3; next} ($1 in FILE1) {print FILE1[$1], $0}' 1000G_20101123_v3_GIANT_chr1_23_Marker_Details.txt.CHR_BP.uniq PGCALZ2sumstatsExcluding23andMe.txt.noinf.freqbetase.colnames.CHR_BP.uniq > PGCALZ2sumstatsExcluding23andMe.txt.noinf.freqbetase.colnames.CHR_BP_1000G_20101123.withRSID.uniq
wc -l PGCALZ2sumstatsExcluding23andMe.txt.noinf.freqbetase.colnames.CHR_BP PGCALZ2sumstatsExcluding23andMe.txt.noinf.freqbetase.colnames.CHR_BP_1000G_20101123.withRSID.uniq
12688309 PGCALZ2sumstatsExcluding23andMe.txt.noinf.freqbetase.colnames.CHR_BP
11079499 PGCALZ2sumstatsExcluding23andMe.txt.noinf.freqbetase.colnames.CHR_BP_1000G_20101123.withRSID.uniq
So 1,608,810 variants have been removed due to the merge.
head PGCALZ2sumstatsExcluding23andMe.txt.noinf.freqbetase.colnames.CHR_BP_1000G_20101123.withRSID.uniq.colnames
SNP CHR_BP CHR BP A1 A2 Z P N FREQ BETA SE
rs146756510 1_126108 1 126108 A G 1.4293627464142322 0.15290000000000006 74004 0.1 0.819381 0.573249
rs148209574 1_135032 1 135032 A G 1.3141373113403905 0.18880000000000008 74004 0.1 0.795796 0.605565
rs141415251 1_233476 1 233476 A G 0.6176592182045477 0.5368 74004 0.1 0.5255 0.850793
rs144425991 1_529782 1 529782 T C -1.2629698923484767 0.2066 74004 0.1 -0.784001 0.62076
rs188948234 1_637753 1 637753 A C 2.021577678426751 0.04321999999999999 74004 0.1 0.896332 0.443383
rs191780594 1_662571 1 662571 T C 0.7629386243130007 0.4454999999999999 74004 0.1 0.606563 0.795035
rs151190501 1_662613 1 662613 A G 0.022310842464422612 0.9822 74004 0.1 0.0223053 0.999751
rs61769339 1_662622 1 662622 A G -0.7675339112394487 0.4427641389654222 137930 0.1 -0.608865 0.793274
rs150820983 1_676127 1 676127 T C 1.1090699664504071 0.26739999999999997 74004 0.1 0.742682 0.669644
egrep "rs|SNP" PGCALZ2sumstatsExcluding23andMe.txt.noinf.freqbetase.colnames.CHR_BP_1000G_20101123.withRSID.uniq.colnames > PGCALZ2sumstatsExcluding23andMe.txt.noinf.freqbetase.colnames.CHR_BP_1000G_20101123.withRSID.uniq.colnames.rsOnly
wc -l PGCALZ2sumstatsExcluding23andMe.txt.noinf.freqbetase.colnames.CHR_BP_1000G_20101123.withRSID.uniq.colnames PGCALZ2sumstatsExcluding23andMe.txt.noinf.freqbetase.colnames.CHR_BP_1000G_20101123.withRSID.uniq.colnames.rsOnly
11079500 PGCALZ2sumstatsExcluding23andMe.txt.noinf.freqbetase.colnames.CHR_BP_1000G_20101123.withRSID.uniq.colnames
11063896 PGCALZ2sumstatsExcluding23andMe.txt.noinf.freqbetase.colnames.CHR_BP_1000G_20101123.withRSID.uniq.colnames.rsOnly
Thereby, 15,604 variants have been removed.
The file ‘migraine_IHGC2021_no23andMe_gwama_v2.txt.CTGVL.rsidonly.correctcolname.sortBP.test’ was succesfully uploaded to the CTG-VL.
NOTE: When I uploaded the data, I manually specified the total sample size, as specified in Hautakangas et al. 2021 meta-analysis of migraine risk (i.e., the article that the data comes from) to be 589,356, which corresponds to the total sample size excluding 23andMe.
The file ‘PGCALZ2sumstatsExcluding23andMe.txt.noinf.freqbetase.colnames.CHR_BP_1000G_20101123.withRSID.uniq.colnames.rsOnly’ was succesfully uploaded to the CTG-VL.
NOTE: When I uploaded the data, I specified the total sample size, as specified in Wightman et al. 2021 genome-wide association study of Alzheimer’s disease to be 1,261,563. However, the data contained an N column so this might not matter.
In order to analyse the SNP-based heritability, I went to the CTG-VL. From there under ‘Analyses’ I chose ‘LD score’. Under ‘select analysis’ I chose ‘Heritability’, and under ‘select GWAS’ i chose ‘Migraine_GWAS_1’ and then I hit ‘Run Analysis’.
From the analysis I got the following table:
Migraine_GWAS_1 - h2
Observed h2 Lambda GC Mean Chi^2 Intercept Ratio
0.0283 (0.0015) 1.2664 1.3506 1.0297 (0.0078) 0.0847 (0.0222)
As this yields a significant P-value, it indicates that there is a significant SNP-based heritibility in this dataset.
In order to analyse the SNP-based heritability, I went to the CTG-VL. From there under ‘Analyses’ I chose ‘LD score’. Under ‘select analysis’ I chose ‘Heritability’, and under ‘select GWAS’ i chose ‘Alzheimer_GWAS_1’ and then I hit ‘Run Analysis’.
From the analysis I got the following table:
Alzheimer_GWAS - h2
Observed h2 Lambda GC Mean Chi^2 Intercept Ratio
0.0111 (0.002) 1.0957 1.1785 1.0194 (0.0132) 0.1089 (0.0739)
As this yields a significant P-value, it indicates that there is a significant SNP-based heritibility in this dataset.
In order to analyse the genetic correlation between the two traits, I went to the CTG-VL. From there under ‘Analyses’ I chose ‘LD score’. Under ‘select analysis’ I chose ‘Genetic correlation’, and under ‘select GWAS’ i chose ‘Alzheimer_GWAS’ and ‘Migraine_GWAS_1’ then I hit ‘Run Analysis’.
From the analysis I got the following table:
Alzheimer_GWAS vs Migraine_GWAS_1 - rG
rG se z P h2_obs h2_obs_se h2_int h2_int_se gcov_int gcov_int_se
0.0466 0.0528 0.8828 0.3774 0.0286 0.0018 1.0271 0.0104 0.0081 0.0053
From the table above, the p-value of 0.3774 indicates that there is no genetic correlation between the two traits.
As the Cross-trait LD Score regression did not indicate any genetic correlation, I tried running SECA, which seems to be more sensitive in order to detect a genetic correlation.
The input files should contain the following five columns of data with a header row listing column names of ‘SNP’, ‘EA’, ‘NEA’, ‘P’ and ‘BETA’. The formatting is done in the following steps.
Migraine (dataset 1):
awk -v OFS='\t' '{ print $3, $4, $5, $9, $7 }' migraine_IHGC2021_no23andMe_gwama_v2.txt.CTGVL.rsidonly.correctcolname.sortBP.test > migraine.SECA
Alzheimer’s disease (dataset 2):
awk -v OFS='\t' '{ print $1, $5, $6, $8, $11 }' PGCALZ2sumstatsExcluding23andMe.txt.noinf.freqbetase.colnames.CHR_BP_1000G_20101123.withRSID.uniq.colnames.rsOnly > alzheimer.SECA
head migraine.SECA alzheimer.SECA
==> migraine.SECA <==
SNP A1 A2 P BETA
rs62053745 T C 0.430516 0.011277
rs9747082 G A 0.508602 0.009148
rs62053747 A G 0.278197 0.028343
rs34151105 C T 0.330109 -0.015252
rs62053748 T A 0.842967 0.007108
rs77383171 T C 0.806836 0.004359
rs75157665 A G 0.334874 -0.015101
rs35054424 G A 0.502227 0.009247
rs17055023 G A 0.512025 0.009035
==> alzheimer.SECA <==
SNP A1 A2 P BETA
rs146756510 A G 0.15290000000000006 0.819381
rs148209574 A G 0.18880000000000008 0.795796
rs141415251 A G 0.5368 0.5255
rs144425991 T C 0.2066 -0.784001
rs188948234 A C 0.04321999999999999 0.896332
rs191780594 T C 0.4454999999999999 0.606563
rs151190501 A G 0.9822 0.0223053
rs61769339 A G 0.4427641389654222 -0.608865
rs150820983 T C 0.26739999999999997 0.742682
Giving the columns the appropriate names.
The column names are changed using joe editor, and saves as migraine.SECA.correctcolname and alzheimer.SECA.correctcolname respectively.
Viewing the data.
head migraine.SECA.correctcolname alzheimer.SECA.correctcolname
==> migraine.SECA.correctcolname <==
SNP EA NEA P BETA
rs62053745 T C 0.430516 0.011277
rs9747082 G A 0.508602 0.009148
rs62053747 A G 0.278197 0.028343
rs34151105 C T 0.330109 -0.015252
rs62053748 T A 0.842967 0.007108
rs77383171 T C 0.806836 0.004359
rs75157665 A G 0.334874 -0.015101
rs35054424 G A 0.502227 0.009247
rs17055023 G A 0.512025 0.009035
==> alzheimer.SECA.correctcolname <==
SNP EA NEA P BETA
rs146756510 A G 0.15290000000000006 0.819381
rs148209574 A G 0.18880000000000008 0.795796
rs141415251 A G 0.5368 0.5255
rs144425991 T C 0.2066 -0.784001
rs188948234 A C 0.04321999999999999 0.896332
rs191780594 T C 0.4454999999999999 0.606563
rs151190501 A G 0.9822 0.0223053
rs61769339 A G 0.4427641389654222 -0.608865
rs150820983 T C 0.26739999999999997 0.742682
Now the format should be correct.
LD clumping is an approach that finds the LD independent SNPs, and only these should be used to run SECA. LD clumping is performed in the following steps.
awk 'NR==FNR {FILE1[$1]=$0; next} ($1 in FILE1) {print $0}' alzheimer.SECA.correctcolname migraine.SECA.correctcolname > migraine_subset_overlapping_alzheimer.tmp
Then I find the overlap of SNPs between the two datasets.
awk 'NR==FNR {FILE1[$2]=$0; next} ($1 in FILE1) {print $0}' hapmap_CEU_r23a_filtered.bim migraine_subset_overlapping_alzheimer.tmp > migraine_subset_overlapping_alzheimer
wc -l migraine_subset_overlapping_alzheimer
2227755 migraine_subset_overlapping_alzheimer
head migraine_subset_overlapping_alzheimer
rs6565733 G A 0.899924 -0.001644
rs1106175 G A 0.76494 0.004095
rs7235612 A G 0.197555 0.020496
rs9950836 A G 0.41199 0.0175
rs11127467 G C 0.383763 -0.027588
rs10193286 C G 0.622834 -0.016573
rs11898038 A G 0.3966 0.020979
rs7595668 A G 0.902585 -0.00335
rs10195681 A C 0.847215 -0.005271
rs11901199 A G 0.548693 0.014687
As can be seen the column names are missing, so these are being added again using joe editor.
plink --noweb --bfile hapmap_CEU_r23a_filtered --clump migraine_subset_overlapping_alzheimer --clump-field P --clump-p1 1 --clump-p2 1 --clump-r2 0.1 --clump-kb 1000 --out migraine_subset_overlapping_alzheimer_clump1
The script looks as follows:
#!/bin/bash
# Run this as "qsub submit_simple.sh"
# Set the resource requirements; 1 CPU, 10 GB memory and 12 hours wall time.
#PBS -l ncpus=1
#PBS -l mem=10GB
#PBS -l walltime=12:00:00
# Send an email when this job aborts, begins or ends.
#PBS -m abe
#PBS -M n11368292@qut.edu.au
# Run your program.
cd $PBS_O_WORKDIR
./plink --noweb --bfile hapmap_CEU_r23a_filtered --clump migraine_subset_overlapping_alzheimer --clump-field P --clump-p1 1 --clump-p2 1 --clump-r2 0.1 --clump-kb 1000 --out migraine_subset_overlapping_alzheimer_clump1
To run the script i wrote the following command.
qsub script_clump1.py
awk '{ print $3 }' migraine_subset_overlapping_alzheimer_clump1.clumped > migraine_subset_overlapping_alzheimer_clump1.SNPs
awk 'NR==FNR {FILE1[$3]=$0; next} ($1 in FILE1) {print $0}' migraine_subset_overlapping_alzheimer_clump1.clumped migraine_subset_overlapping_alzheimer > migraine_subset_overlapping_alzheimer.clump1
head migraine_subset_overlapping_alzheimer.clump1
SNP EA NEA P BETA
rs10193286 C G 0.622834 -0.016573
rs11898038 A G 0.3966 0.020979
rs8064924 G A 0.474664 0.009721
rs10045830 G T 0.507788 0.005939
rs3794811 G C 0.76548 0.002978
rs4451252 T C 0.309267 0.012697
rs4461849 A C 0.794341 -0.006941
rs1609550 A G 0.25041 0.010513
rs11185553 G C 0.744019 0.003372
This file contains a subset of GWAS summary results from the migraine dataset for a set of independent (index) SNPs (r2 > 0.1) for all pairs of SNPs within 1 Mb of each other.
migraine_subset_overlapping_alzheimer.clump1 file to ensure
none of the index SNPs within 10 Mb of each other are in long-range LD
(r2 > 0.1)../plink --noweb --bfile hapmap_CEU_r23a_filtered --extract migraine_subset_overlapping_alzheimer_clump1.SNPs --clump migraine_subset_overlapping_alzheimer.clump1 --clump-field P --clump-p1 1 --clump-p2 1 --clump-r2 0.1 --clump-kb 10000 --out migraine_subset_overlapping_alzheimer_clump2
The script looks as follows:
#!/bin/bash
# Run this as "qsub submit_simple.sh"
# Set the resource requirements; 1 CPU, 10 GB memory and 12 hours wall time.
#PBS -l ncpus=1
#PBS -l mem=10GB
#PBS -l walltime=12:00:00
# Send an email when this job aborts, begins or ends.
#PBS -m abe
#PBS -M n11368292@qut.edu.au
# Run your program.
cd $PBS_O_WORKDIR
./plink --noweb --bfile hapmap_CEU_r23a_filtered --extract migraine_subset_overlapping_alzheimer_clump1.SNPs --clump migraine_subset_overlapping_alzheimer.clump1 --clump-field P --clump-p1 1 --clump-p2 1 --clump-r2 0.1 --clump-kb 10000 --out migraine_subset_overlapping_alzheimer_clump2
To run the script i wrote the following command:
qsub script_clump2.py
awk 'NR==FNR {FILE1[$3]=$0; next} ($1 in FILE1) {print $0}' migraine_subset_overlapping_alzheimer_clump2.clumped migraine_subset_overlapping_alzheimer > migraine.independent
awk 'NR==FNR {FILE1[$3]=$0; next} ($1 in FILE1) {print $0}' migraine_subset_overlapping_alzheimer_clump2.clumped alzheimer.SECA.correctcolname > alzheimer.independent
head migraine.independent
SNP EA NEA P BETA
rs10045830 G T 0.507788 0.005939
rs4451252 T C 0.309267 0.012697
rs12718064 C T 0.427238 -0.008721
rs11649979 T C 0.649061 -0.006907
rs3930589 T C 0.134194 -0.034377
rs7213702 A G 0.022943 -0.037337
rs7215219 G A 0.026133 -0.020486
rs11185557 G A 0.133448 -0.014336
rs4133102 C G 0.003636 0.063565
head alzheimer.independent
SNP EA NEA P BETA
rs2980300 T C 0.3739393122267137 0.66446
rs4951864 C T 0.09179162970999584 -0.860095
rs1891905 C T 0.29342569121177464 -0.724347
rs11260549 A G 0.1753068387723845 0.804681
rs3813199 A G 0.4115305156064484 0.634635
rs6603781 G A 0.8838024363734281 0.144614
rs1810745 A G 0.5941115614725418 -0.470282
rs7531530 T C 0.8473090387788222 0.18908
rs13303201 A C 0.28713670613181275 -0.728818
wc -l migraine.independent migraine.SECA.correctcolname
27006 migraine.independent
9428582 migraine.SECA.correctcolname
wc -l alzheimer.independent alzheimer.SECA.correctcolname
27006 alzheimer.independent
11063896 alzheimer.SECA.correctcolname
The migraine.independent and alzheimer.independent files can now be used to perform SECA.
SECA was performed in the following steps.
./process.pl migraine.independent > migraine.independent.input
./process.pl alzheimer.independent > alzheimer.independent.input
head migraine.independent.input alzheimer.independent.input
==> migraine.independent.input <==
SNP EA NEA P BETA
rs10045830 G T 0.507788 0.005939
rs4451252 T C 0.309267 0.012697
rs12718064 C T 0.427238 -0.008721
rs11649979 T C 0.649061 -0.006907
rs3930589 T C 0.134194 -0.034377
rs7213702 A G 0.022943 -0.037337
rs7215219 G A 0.026133 -0.020486
rs11185557 G A 0.133448 -0.014336
rs4133102 C G 0.003636 0.063565
==> alzheimer.independent.input <==
SNP EA NEA P BETA
rs2980300 T C 0.3739393122267137 0.66446
rs4951864 C T 0.09179162970999584 -0.860095
rs1891905 C T 0.29342569121177464 -0.724347
rs11260549 A G 0.1753068387723845 0.804681
rs3813199 A G 0.4115305156064484 0.634635
rs6603781 G A 0.8838024363734281 0.144614
rs1810745 A G 0.5941115614725418 -0.470282
rs7531530 T C 0.8473090387788222 0.18908
rs13303201 A C 0.28713670613181275 -0.728818
Preparing a text file named ‘filelist.txt’ which lists
‘migraine.independent.input’ on line 1 and ‘alzheimer.independent.input’
on line 2.
The text file was made using Visual Studio Code and then uploaded to the HPC using FileZilla.
Aligning the SNP effects (BETAs) in the second file listed in ‘filelist.txt’ to the same effect allele (‘EA’) for the SNP effects (BETAs) in the first file listed in ‘filelist.txt’.
./gwama2.1_SECAalign
It gave the following output in the terminal:
File filelist.txt contained 2 studies.
------------------------------------
Reading file: migraine.independent.input
Strand column missing! Expecting always positive strand.
Marker count: 27005 Markers passing sanity check: 27005
Strand problems: 0 Wrong alleles: 0
Effect problems: 0 Multiple occurances: 0
------------------------------------
Reading file: alzheimer.independent.input
Strand column missing! Expecting always positive strand.
Marker count: 27005 Markers passing sanity check: 27004
Strand problems: 0 Wrong alleles: 0
Effect problems: 1 Multiple occurances: 0
------------------------------------
Preparing output...
------------------------------------
GWAMA program finished current job successfully!
Please check gwama.log.out for full information.
Analysis finished.
Changing the file name.
As the SNPs are LD-independent, the ‘aligned_effects.txt’ file was renamed to ‘independent_aligned_effects.txt’ using joe editor.
View the file.
head independent_aligned_effects.txt
SNP EA NEA BETA1 P1 EFF1 BETA2 P2 EFF2
rs10045830 G T 0.005939 0.507788 + 0.793212 0.192712 +
rs4451252 T C 0.012697 0.309267 + 0.763799 0.236670 +
rs12718064 C T -8.7210e-03 0.427238 - -7.0439e-01 0.321021 -
rs11649979 T C -6.9070e-03 0.649061 - -9.0128e-02 0.927893 -
rs3930589 T C -3.4377e-02 0.134194 - -3.8131e-01 0.679995 -
rs7213702 A G -3.7337e-02 0.022943 - 0.777669 0.216087 +
rs7215219 G A -2.0486e-02 0.026133 - 0.653641 0.387761 +
rs11185557 G A -1.4336e-02 0.133448 - -1.4347e-01 0.884730 -
rs4133102 C G 0.063565 0.003636 + -3.9227e-01 0.669780 -
wc -l independent_aligned_effects.txt
27006 independent_aligned_effects.txt
module load r
And then:
R
Then within R, the following command was used to run SECA:
source("SECA_Ranalysis.R")
module load r
And then:
R
Then within R, the following command was used to run the SECA heatmap permutation:
source("SECA_Ranalysis_HeatmapPerm.R")
In contrast to cross-trait LD Score regression, SECA indicated that there was a genetic correlation between the traits.
In order to analyse the genetic causality between traits, I went to the CTG-VL. From there under ‘Analyses’ I chose ‘GSMR’. Under ‘select outcome’ i chose ‘Alzheimer_GWAS’ and under ‘select exposure’ i chose ‘Migraine_GWAS_1’ then I hit ‘Run Analysis’.
When trying to run this analysis, I eventually got an error.
I kept getting an error every time I tried running GSMR on CTG-VL either as Alzheimer’s Disease as the outcome and Migraine as the exposure or reversed. The first step towards finding the problem was to find two other datasets that was already in the Neal Lab, and then run each of them with my own data. I used the ‘Illnesses of mother: Alzheimer’s disease/dementia’ from Neal lab and tried to run it with my own migraine data. This worked both when the migraine data was the outcome and the exposure. Afterwards, I used the ‘Diagnoses - main ICD10: R51 Headache’ from Neal lab and tried to run it with my own Alzheimer’s Disease data. This only worked when the Alzheimer’s data was set as the exposure, as the ‘Diagnoses - main ICD10: R51 Headache’ did not seem to have enough significant loci.
The above first of all indicated that GSMR worked, and furthermore it indicated that my datasets both worked independently. However, I still could not figure out the exact problem.
I then tried to run GCTA, which is a software package that can be run on the terminal, that supports the GSMR analysis (among other things).
The preparation of GCTA was done in the following steps in my local terminal, as the HPC kept killing my job, when I ran the analysis on there.
unzip gcta-1.94.1-linux-kernel-3-x86_64.zip
Outcome text file:
alzheimer alzheimer_GSMR_colnames
Exposure text file:
migraine migraine_GSMR_N_colnames
GCTA was run using the following command.
gcta-1.94.1 --bfile g1000_eur --gsmr-file migraine_GSMR.txt alzheimer_GSMR.txt --gsmr-direction 0 --effect-plot --out gsmr_result
After running for some time, an error was raised.
Error: there are too many SNPs that have large difference in allele frequency. Please check the GWAS summary data.
This got me thinking that I had used fixed frequencies when preparing my Alzheimer’s dataset CTG-VL upload, which could be the problem. I also noticed, that a file called ‘gsmr_result.freq.badsnps’ had been produced, which might contain all the SNPs that was raising the error.
I tried counting how many SNPs the file contained.
wc -l gsmr_result.freq.badsnps
2794 gsmr_result.freq.badsnps
As there was only 2794 bad SNPs, I decided to remove these, to solve the problem.
awk 'NR == FNR {a[$1]; next} !($3 in a)' gsmr_result.freq.badsnps alzheimer_GSMR_colnames > alzheimer_GSMR_colnames_nobadsnps
Then I changed the outcome text file, so it looked like this.
alzheimer alzheimer_GSMR_colnames_nobadsnps
I then tried running GCTA again, using Alzheimer’s Disease as the outcome, and Migraine as the exposure.
gcta-1.94.1 --bfile g1000_eur --gsmr-file migraine_GSMR.txt alzheimer_GSMR.txt --gsmr-direction 0 --effect-plot --out gsmrMA_result
This yielded the following results.
Exposure Outcome bxy se p nsnp
migraine alzheimer 5.29027 1.38816 0.000138403 26
The table above indicates, that there is a significant causal relationship.
Then i ran GCTA using Migraine as the outcome and Alzheimer’s Disease as the exposure.
gcta-1.94.1 --bfile g1000_eur --gsmr-file alzheimer_GSMR.txt migraine_GSMR.txt --gsmr-direction 0 --effect-plot --out gsmrAM_result
This yielded the following results.
Exposure Outcome bxy se p nsnp
alzheimer migraine 0.00216948 0.00153883 0.158593 83
The table above indicates, that there is not a significant causal relationship.
The plot are generated in the following steps.
source("gsmr_plot.r")
gsmr_data = read_gsmr_data("gsmrMA_result.eff_plot.gz")
gsmr_summary(gsmr_data) # show a summary of the data
plot_gsmr_effect(gsmr_data, "migraine", "alzheimer", colors()[75])
source("gsmr_plot.r")
gsmr_data = read_gsmr_data("gsmrAM_result.eff_plot.gz")
gsmr_summary(gsmr_data) # show a summary of the data
plot_gsmr_effect(gsmr_data, "alzheimer", "migraine", colors()[75])