genedropper: conduct a gene-drop analysis through a pedigree.A “gene-drop” simulation approach can model expected changes in allele frequencies due to genetic drift, given an identical pedigree structure.
To install this package, you can use the install_github command in library(devtools):
library(devtools)
install_github("susjoh/genedroppeR")
NB This is a work in progress. More description will be given in time. The functions are also slower than they could be - we are working on it.
Things to add: * Weighted regression * Selection
Do you have comments, polite criticism, feedback, helpful suggestions? Please contact me at Susan.Johnston (at) ed.ac.uk.
library(genedroppeR)
The library comes with an example dataset from a long-term study of unicorns on the island of Áiteigin off the South coast of Scotland (Figure 1). They have been genotyped for three loci: HornSNP, a major quantitative trait locus for horn length; MHC, encompassing haplotype variation at the Magic Histocompatibility Locus; and ColourSNP, a SNP responsible for a rare glitter colour polymorphism.
Figure 1: A typical unicorn.
data(unicorn)
str(unicorn)
## 'data.frame': 3951 obs. of 7 variables:
## $ id : int 1 2 3 4 5 6 7 8 9 10 ...
## $ mother : int NA NA NA NA NA NA NA NA NA NA ...
## $ father : int NA NA NA NA NA NA NA NA NA NA ...
## $ cohort : int 1 1 1 1 1 1 1 1 1 1 ...
## $ HornSNP : Factor w/ 3 levels "AA","GA","GG": NA 3 NA 2 2 3 1 NA 2 1 ...
## $ MHC : Factor w/ 16 levels "AA","AB","AC",..: NA 7 11 8 6 4 NA 12 10 10 ...
## $ ColourSNP: Factor w/ 3 levels "AA","AB","BB": NA 3 NA NA NA NA 3 3 3 NA ...
Unicorns have a promiscuous mating system, in common with other magical beasts. We can see an example of the pedigree below. The numbers at the side indicate the cohorts of when each unicorn was born. This can be any sequential series of numbers.
We can summarise this data in terms of the allele frequency changes using the function summary_cohort(). Let’s start with the HornSNP locus.
unicorn.summ <- summary_cohort(id = unicorn$id,
mother = unicorn$mother,
father = unicorn$father,
cohort = unicorn$cohort,
genotype = unicorn$HornSNP)
unicorn.summ
## cohort A G GenoCount FullCount PropGenotyped
## 1 1 0.4090909 0.5909091 33 47 0.7021277
## 2 2 0.3918919 0.6081081 74 105 0.7047619
## 3 3 0.3939394 0.6060606 33 99 0.3333333
## 4 4 0.4611111 0.5388889 90 103 0.8737864
## 5 5 0.3818898 0.6181102 127 144 0.8819444
## 6 6 0.3721591 0.6278409 176 188 0.9361702
## 7 7 0.3850575 0.6149425 174 181 0.9613260
## 8 8 0.3954082 0.6045918 196 206 0.9514563
## 9 9 0.4184783 0.5815217 184 194 0.9484536
## 10 10 0.4139785 0.5860215 186 190 0.9789474
## 11 11 0.3985149 0.6014851 202 236 0.8559322
## 12 12 0.4243119 0.5756881 218 231 0.9437229
## 13 13 0.4400826 0.5599174 242 255 0.9490196
## 14 14 0.4540541 0.5459459 185 187 0.9893048
## 15 15 0.4537037 0.5462963 216 219 0.9863014
## 16 16 0.4466912 0.5533088 272 343 0.7930029
## 17 17 0.4523810 0.5476190 147 204 0.7205882
## 18 18 0.4782609 0.5217391 253 272 0.9301471
## 19 19 0.4506803 0.5493197 294 322 0.9130435
## 20 20 0.4581281 0.5418719 203 225 0.9022222
## 21 NA NaN NaN 0 0 NaN
## NonFounders Founders PropFounders
## 1 0 47 1.00000000
## 2 28 77 0.73333333
## 3 70 29 0.29292929
## 4 52 51 0.49514563
## 5 99 45 0.31250000
## 6 139 49 0.26063830
## 7 151 30 0.16574586
## 8 188 18 0.08737864
## 9 187 7 0.03608247
## 10 176 14 0.07368421
## 11 224 12 0.05084746
## 12 220 11 0.04761905
## 13 244 11 0.04313725
## 14 167 20 0.10695187
## 15 202 17 0.07762557
## 16 306 37 0.10787172
## 17 187 17 0.08333333
## 18 252 20 0.07352941
## 19 316 6 0.01863354
## 20 214 11 0.04888889
## 21 NA NA NA
This output provides the allele frequencies for the A and G alleles per cohort, as well as a count of the number of genotyped individuals and the total number of individuals per cohort, the proportion of genotyped individuals, the numbers of non founders and founders, and the proportion of founders per cohort. A founder is defined as an individual where neither the mother or father is known.
The frequency of the A allele, which confers a larger horn, seems to be increasing in the population:
library(ggplot2)
ggplot(unicorn.summ, aes(cohort, A)) +
geom_line() +
stat_smooth(method = "lm") +
ggtitle("Temporal dynamics of A allele")
## Warning: Removed 1 rows containing non-finite values (stat_smooth).
## Warning: Removed 1 rows containing missing values (geom_path).
However, given the complex structure of the pedigree, it may be that this increase in allele frequency is more likely to be due to drift than selection. One approach to model this is to simulate allele frequency changes given a pedigree of the same structure, and see if the observed allele frequency change is significantly different from what we would expect by chance.
In early cohorts, many individuals are likely to be founders. To run the gene-drop, we select some cohorts at the start to be the “founder” cohorts, where allele frequencies are sampled from the observed frequencies in each year. To determine which cohorts we define as founder cohorts, we need to explore the proportion of genotyped and founder individuals to determine the best way to set up our simulation.
ggplot(unicorn.summ, aes(cohort, PropFounders)) +
geom_line() +
ggtitle("Proportion of Founder IDs per cohort")
## Warning: Removed 1 rows containing missing values (geom_path).
ggplot(unicorn.summ, aes(cohort, PropGenotyped)) +
geom_line() +
ggtitle("Proportion of Genotyped IDs per cohort")
## Warning: Removed 1 rows containing missing values (geom_path).
The number of cohorts declines rapidly until around 5 cohorts, so lets define cohorts 1:5 as founder cohorts, and 6:20 as simulated cohorts i.e. we will investigate how allele frequencies change in the second 15 year period from year 6 to year 20.
As Horns is a biallelic locus, we can model the gene-drop simulation using the function genedrop_snp(). We define the id, mother, father, cohort and genotype using columns from the unicorn dataset. We then define the following:
nsim - The number of simulations to runn_founder_cohorts - The number of founder cohorts - here we will choose 5.fix_founders - This is an option to keep the founder genotypes fixed in genotyped individuals, i.e. how allele frequencies would change from an almost identical starting point. In this analysis, we will keep this as F to allow some flexibility.verbose - This will output the function progress.interval - If verbose is TRUE, then this defines the interval at which to output the function progress.Let’s try it:
unicorn.UF <- genedrop_snp(id = unicorn$id,
mother = unicorn$mother,
father = unicorn$father,
cohort = unicorn$cohort,
genotype = unicorn$HornSNP,
nsim = 1000,
n_founder_cohorts = 5,
fix_founders = F,
verbose = T,
interval = 200)
## Running simulation 1 of 1000.
## Running simulation 201 of 1000.
## Running simulation 401 of 1000.
## Running simulation 601 of 1000.
## Running simulation 801 of 1000.
str(unicorn.UF)
## 'data.frame': 3951000 obs. of 7 variables:
## $ ID : chr "1" "2" "3" "4" ...
## $ True.Geno : num NA 2 NA 1 1 2 0 NA 1 0 ...
## $ cohort : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Mum.Allele : num 1 1 1 0 0 1 0 1 0 0 ...
## $ Dad.Allele : num 1 1 1 1 1 1 0 0 1 0 ...
## $ Simulation : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Simulated.Geno: num 2 2 2 1 1 2 0 1 1 0 ...
This outputs a data frame that includes the simulated genotypes for each individual. For computational reasons, the genotypes have been reported as allele dosages, e.g. 0 = AA, 1 = AG, 2 = GG.
This is a very large dataset, but we can summarise the allele frequencies per cohort with the function summary_genedrop:
unicorn.UF.summ <- summary_genedrop(unicorn.UF)
str(unicorn.UF.summ)
## List of 2
## $ observed_frequencies :'data.frame': 20 obs. of 4 variables:
## ..$ Cohort : num [1:20] 1 2 3 4 5 6 7 8 9 10 ...
## ..$ Simulation: num [1:20] 0 0 0 0 0 0 0 0 0 0 ...
## ..$ Count : int [1:20(1d)] 33 74 33 90 127 176 174 196 184 186 ...
## .. ..- attr(*, "dimnames")=List of 1
## .. .. ..$ : chr [1:20] "1" "2" "3" "4" ...
## ..$ p : num [1:20(1d)] 0.409 0.392 0.394 0.461 0.382 ...
## .. ..- attr(*, "dimnames")=List of 1
## .. .. ..$ : chr [1:20] "1" "2" "3" "4" ...
## $ simulated_frequencies:'data.frame': 20000 obs. of 4 variables:
## ..$ Cohort : int [1:20000] 1 2 3 4 5 6 7 8 9 10 ...
## ..$ Simulation: int [1:20000] 1 1 1 1 1 1 1 1 1 1 ...
## ..$ Count : int [1:20000] 33 74 33 90 127 176 174 196 184 186 ...
## ..$ p : num [1:20000] 0.409 0.372 0.394 0.45 0.386 ...
This object contains two lists: the observed frequencies at Horn and the simulated frequencies.
unicorn.UF.summ$observed_frequencies
## Cohort Simulation Count p
## 1 1 0 33 0.4090909
## 2 2 0 74 0.3918919
## 3 3 0 33 0.3939394
## 4 4 0 90 0.4611111
## 5 5 0 127 0.3818898
## 6 6 0 176 0.3721591
## 7 7 0 174 0.3850575
## 8 8 0 196 0.3954082
## 9 9 0 184 0.4184783
## 10 10 0 186 0.4139785
## 11 11 0 202 0.3985149
## 12 12 0 218 0.4243119
## 13 13 0 242 0.4400826
## 14 14 0 185 0.4540541
## 15 15 0 216 0.4537037
## 16 16 0 272 0.4466912
## 17 17 0 147 0.4523810
## 18 18 0 253 0.4782609
## 19 19 0 294 0.4506803
## 20 20 0 203 0.4581281
To visualise how the observed and simulated frequencies compare, we can use the plot_genedrop_results() function:
plot_genedrop_results(unicorn.UF.summ)
We can see that the founder cohorts (1:5) were sampled based on the observed frequencies, but this constraint is then removed after 5 generations. The red line shows the observed allele frequency change.
We can then use the function plot_genedrop_lm_slopes() to look at how the linear regression slopes of the allele frequencies compare to the observed slope (red vertical line). This function also outputs the number of simulated slopes that were higher or lower than the observed slope.
plot_genedrop_lm_slopes(unicorn.UF.summ,
n_founder_cohorts = 5,
remove_founders = T)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Iteration Allele Slope Slopes.Lower Slopes.Higher
## 1 0 p 0.006370185 999 1
There is some suggestion that the increase in frequency of the A allele could be due to selection in this population.
One apparent shortcoming of this approach is that we can only detect strong, directional selection. This is a valid criticism. However, one could potentially identify signature of balancing selection at a locus if there is less cumulative change in allele frequency over time than expected due to change (NB. At this stage this is just an idea - any feedback on this approach is welcome). The function plot_genedrop_cumulative_change() plots and reports the cumulated change over the non-founder cohorts:
plot_genedrop_cumulative_change(unicorn.UF.summ,
n_founder_cohorts = 5,
remove_founders = T)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Var1 Var2 value Cumulative.Change.Lower Cumulative.Change.Higher
## 1 0 p 0.1957828 1 999
This also suggests that the observed allele frequency change from year to year is less than expected by chance.
This locus characterises haplotype variation at a moderately variable locus controlling magic compatibility within unicorns.
table(unicorn$MHC)
##
## AA AB AC AD BA BB BC BD CA CB CC CD DA DB DC DD
## 22 40 104 121 51 98 203 246 114 161 350 406 129 203 399 504
As you can see, there are 4 alleles at this locus. Because there are more than two alleles, we can conduct gene-dropping simulations with the function genedrop_multi(). NB. If you are running this on genotypes with a delimiter e.g. “A/C” or “Lo_Hi”, you will need to define it using genotype_delim = "/" or genotype_delim = "_" within the function.
unicorn.summ <- summary_cohort(id = unicorn$id,
mother = unicorn$mother,
father = unicorn$father,
cohort = unicorn$cohort,
genotype = unicorn$MHC)
unicorn.summ
## cohort A B C D GenoCount FullCount
## 1 1 0.04054054 0.2837838 0.3243243 0.3513514 37 47
## 2 2 0.12837838 0.2094595 0.2972973 0.3648649 74 105
## 3 3 0.10526316 0.2236842 0.3026316 0.3684211 76 99
## 4 4 0.08536585 0.1768293 0.3658537 0.3719512 82 103
## 5 5 0.09734513 0.2035398 0.2964602 0.4026549 113 144
## 6 6 0.09935897 0.2019231 0.3044872 0.3942308 156 188
## 7 7 0.09507042 0.2007042 0.3133803 0.3908451 142 181
## 8 8 0.09567901 0.2067901 0.2993827 0.3981481 162 206
## 9 9 0.08450704 0.1760563 0.3274648 0.4119718 142 194
## 10 10 0.09235669 0.1783439 0.3375796 0.3917197 157 190
## 11 11 0.07853403 0.1570681 0.3691099 0.3952880 191 236
## 12 12 0.08196721 0.1912568 0.2814208 0.4453552 183 231
## 13 13 0.08080808 0.1717172 0.3409091 0.4065657 198 255
## 14 14 0.08389262 0.1677852 0.3557047 0.3926174 149 187
## 15 15 0.11578947 0.1710526 0.3289474 0.3842105 190 219
## 16 16 0.09821429 0.1946429 0.2982143 0.4089286 280 343
## 17 17 0.09177215 0.1518987 0.3354430 0.4208861 158 204
## 18 18 0.11659193 0.1345291 0.3834081 0.3654709 223 272
## 19 19 0.11264822 0.1343874 0.3636364 0.3893281 253 322
## 20 20 0.08918919 0.1297297 0.3459459 0.4351351 185 225
## 21 NA NaN NaN NaN NaN 0 0
## PropGenotyped NonFounders Founders PropFounders
## 1 0.7872340 0 47 1.00000000
## 2 0.7047619 28 77 0.73333333
## 3 0.7676768 70 29 0.29292929
## 4 0.7961165 52 51 0.49514563
## 5 0.7847222 99 45 0.31250000
## 6 0.8297872 139 49 0.26063830
## 7 0.7845304 151 30 0.16574586
## 8 0.7864078 188 18 0.08737864
## 9 0.7319588 187 7 0.03608247
## 10 0.8263158 176 14 0.07368421
## 11 0.8093220 224 12 0.05084746
## 12 0.7922078 220 11 0.04761905
## 13 0.7764706 244 11 0.04313725
## 14 0.7967914 167 20 0.10695187
## 15 0.8675799 202 17 0.07762557
## 16 0.8163265 306 37 0.10787172
## 17 0.7745098 187 17 0.08333333
## 18 0.8198529 252 20 0.07352941
## 19 0.7857143 316 6 0.01863354
## 20 0.8222222 214 11 0.04888889
## 21 NaN NA NA NA
unicorn.MHC.UF <- genedrop_multi(id = unicorn$id,
mother = unicorn$mother,
father = unicorn$father,
cohort = unicorn$cohort,
genotype = unicorn$MHC,
nsim = 1000,
n_founder_cohorts = 5,
fix_founders = F,
verbose = T,
interval = 200)
## Running simulation 1 of 1000.
## Running simulation 201 of 1000.
## Running simulation 401 of 1000.
## Running simulation 601 of 1000.
## Running simulation 801 of 1000.
unicorn.MHC.UF.summ <- summary_genedrop(unicorn.MHC.UF)
plot_genedrop_results(unicorn.MHC.UF.summ)
plot_genedrop_lm_slopes(unicorn.MHC.UF.summ)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Iteration Allele Slope Slopes.Lower Slopes.Higher
## 1 0 A 0.0006718187 700 300
## 2 0 B -0.0049788334 247 753
## 3 0 C 0.0022493400 661 339
## 4 0 D 0.0020576748 433 567
plot_genedrop_cumulative_change(unicorn.MHC.UF.summ)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Var1 Var2 value Cumulative.Change.Lower Cumulative.Change.Higher
## 1 0 A 0.2983984 12 988
## 2 0 B 0.3747645 13 987
## 3 0 C 0.6077412 379 621
## 4 0 D 0.3810273 14 986
unicorn.summ <- summary_cohort(id = unicorn$id,
mother = unicorn$mother,
father = unicorn$father,
cohort = unicorn$cohort,
genotype = unicorn$ColourSNP)
unicorn.summ
## cohort A B GenoCount FullCount PropGenotyped
## 1 1 0.01388889 0.9861111 36 47 0.7659574
## 2 2 0.03623188 0.9637681 69 105 0.6571429
## 3 3 0.05421687 0.9457831 83 99 0.8383838
## 4 4 0.01315789 0.9868421 76 103 0.7378641
## 5 5 0.02020202 0.9797980 99 144 0.6875000
## 6 6 0.04609929 0.9539007 141 188 0.7500000
## 7 7 0.04850746 0.9514925 134 181 0.7403315
## 8 8 0.03548387 0.9645161 155 206 0.7524272
## 9 9 0.05944056 0.9405594 143 194 0.7371134
## 10 10 0.04577465 0.9542254 142 190 0.7473684
## 11 11 0.06084656 0.9391534 189 236 0.8008475
## 12 12 0.04324324 0.9567568 185 231 0.8008658
## 13 13 0.05329949 0.9467005 197 255 0.7725490
## 14 14 0.05882353 0.9411765 136 187 0.7272727
## 15 15 0.04518072 0.9548193 166 219 0.7579909
## 16 16 0.04395604 0.9560440 273 343 0.7959184
## 17 17 0.04516129 0.9548387 155 204 0.7598039
## 18 18 0.04656863 0.9534314 204 272 0.7500000
## 19 19 0.06779661 0.9322034 236 322 0.7329193
## 20 20 0.05029586 0.9497041 169 225 0.7511111
## 21 NA NaN NaN 0 0 NaN
## NonFounders Founders PropFounders
## 1 0 47 1.00000000
## 2 28 77 0.73333333
## 3 70 29 0.29292929
## 4 52 51 0.49514563
## 5 99 45 0.31250000
## 6 139 49 0.26063830
## 7 151 30 0.16574586
## 8 188 18 0.08737864
## 9 187 7 0.03608247
## 10 176 14 0.07368421
## 11 224 12 0.05084746
## 12 220 11 0.04761905
## 13 244 11 0.04313725
## 14 167 20 0.10695187
## 15 202 17 0.07762557
## 16 306 37 0.10787172
## 17 187 17 0.08333333
## 18 252 20 0.07352941
## 19 316 6 0.01863354
## 20 214 11 0.04888889
## 21 NA NA NA
unicorn.colour.UF <- genedrop_snp(id = unicorn$id,
mother = unicorn$mother,
father = unicorn$father,
cohort = unicorn$cohort,
genotype = unicorn$ColourSNP,
nsim = 1000,
n_founder_cohorts = 5,
fix_founders = F,
verbose = T,
interval = 200)
## Running simulation 1 of 1000.
## Running simulation 201 of 1000.
## Running simulation 401 of 1000.
## Running simulation 601 of 1000.
## Running simulation 801 of 1000.
unicorn.colour.UF.summ <- summary_genedrop(unicorn.colour.UF)
plot_genedrop_results(unicorn.colour.UF.summ)
plot_genedrop_lm_slopes(unicorn.colour.UF.summ)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Iteration Allele Slope Slopes.Lower Slopes.Higher
## 1 0 p 0.001436429 891 109
plot_genedrop_cumulative_change(unicorn.colour.UF.summ)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Var1 Var2 value Cumulative.Change.Lower Cumulative.Change.Higher
## 1 0 p 0.271847 917 83