Genotypes are from Brenna
Forester tutorial
Original wolf data are from Schweizer
et al. (2015) Molecular Ecology, Dryad c9b25
Please see README.md at https://github.com/z0on/RDA-forest for
RDAforest package installation instructions and overview of
functions.
Loading packages:
library(RDAforest)
library(rnaturalearth)
library(rnaturalearthdata)
library(terra)
library(viridis)
landscape=st_as_sf(ne_countries(scale="medium",continent="north america"))
ll=load("./wolf_v5.RData")
ll
## [1] "geno" "env" "latlon" "envc" "envf" "ecotype" "env.full"
geno: genotypes of 94 wolves at sites with allele
freq >0.05, in 0,1,2 format
env: subset of environmental variables that we will use
for modeling, for the 94 wolves
latlon: coordinates of sampled wolves
envc: raster of coordinates and variable values for
today
envf: raster of coordinates and variable values for
future (2041-2060)
ecotype: table of sample names and their designated
ecotype, according to Schweizer et al., 2015
env.full: all bioclimatic variables (+ treecover) for
the 94 wolves. Will not use it here (for consistency with the RDA-forest
paper), but you can try exploring how it will work if we use all 19
variables instead of a small subset.
Let’s compute genetic distances based on genotypic correlation, and make an unconstrained ordination:
# distances:
cordist=1-cor(t(geno))
# ordination:
ord=capscale(cordist~1)
Proportion of variance explained by PCs (“MDS” is the mathematically identical to principal components so we will call them PCs):
plot(ord$CA$eig/sum(ord$CA$eig),xlab="PC",ylab="proportion of variance explained")
So we clearly have some very interesting leading principal
components, perhaps 7-10.
Extracting scores, plotting ordination:
so=data.frame(scores(ord,scaling=1,display="sites"))
ggplot(so,aes(MDS1,MDS2,color=ecotype$ecotype))+geom_point()+coord_equal()+theme_bw()
This looks pretty but the question is, how much of this structure is
due to just isolation-by-distance (IBD)? Meaning, the
fact that wolves found near each other are more similar
genetically?
Plotting genetic distance against geographic distance to see if there is a sloping cline (or multiple sloping clines):
# converting lat, lon to great circle distances
GCD=gcd.dist(latlon)
latlon.gcd=GCD[[1]]
distGCD=GCD[[2]]
plot(as.dist(cordist)~distGCD,pch=16,cex=0.6,col=rgb(0,0,0,alpha=0.2))
# Alternative: use Universal Transverse Mercator coordinates instead of great circle-based
# epsg.code=epsg.maker(mean(range(latlon$lon)),mean(range(latlon$lat)))
# latlon.utm=latlon2UTM(latlon, epsg.code)
# dist.utm=dist(latlon.utm)
#plot(as.dist(cordist)~dist.utm,pch=16,cex=0.6,col=rgb(0,0,0,alpha=0.2))
Looks like there is a sloping cline! To make it more formal, let’s test for significance of correlation between genetic and geographic distances:
protest(capscale(distGCD~1),capscale(cordist~1))
##
## Call:
## protest(X = capscale(distGCD ~ 1), Y = capscale(cordist ~ 1))
##
## Procrustes Sum of Squares (m12 squared): 0.402
## Correlation in a symmetric Procrustes rotation: 0.7733
## Significance: 0.001
##
## Permutation: free
## Number of permutations: 999
Definitely a highly significant dependence.
Since under pure isolation-by-distance genetic distances are expected to
look like logarithm of geographical distances, we will compute PCs of
log(geographical distances) and use them as covariates to regress out
when computing genetic ordination.
# getting the first two PCs of log-distance matrix
d.ord=capscale(distGCD~1)
pcs.d=scores(d.ord,scaling=1,display="sites",choices=c(1:2))
# constructing and plotting partial ordination
ord1=capscale(cordist~1+Condition(as.matrix(pcs.d)))
plot(ord1$CA$eig/sum(ord1$CA$eig),xlab="PC",ylab="proportion of variance explained")
so1=data.frame(scores(ord1,scaling=1,display="sites"))
ggplot(so1,aes(MDS1,MDS2,color=ecotype$ecotype))+geom_point()+coord_equal()+theme_bw()
Note that in this plot, ecotypes form compact clouds of points, as opposed to elongated shapes like on the uncorrected plot above. This is an indication of IBD correction working. At the same time, the ecotypes remain well-separated and there are no far-flung outliers. Merging of the previously separated clusters and appearance of outliers would be an indication of over-correction (removal of variation of interest).
Let’s see if any of our predictors (bioclimatic variables + tree cover) are super correlated with each other
pc=hclust(as.dist(1-cor(env)))
plot(pc)
abline(h=0.1, col="red")
Variables below red line are correlated with r > 0.9! Still let’s keep all of them for now, for consistency with the RDA-forest paper.
We will use all predictors and more leading PCs (40, specified by the
option pcs2keep=c(1:40)) that we will likely need, just to
see how it looks. Note that we are using ord1 ordination object
here, the one with lat and lon regressed out.
gf=makeGF(ord1,env,pcs2keep=c(1:20))
Which PCs are predicted?
# predicted PCs, and how well they are predicted (per-PC R2s)
gf$result
## MDS1 MDS2 MDS3 MDS4 MDS5 MDS6 MDS7
## 0.60485934 0.82322985 0.51326041 0.37836228 0.35600004 0.38743002 0.19866178
## MDS8 MDS9 MDS12 MDS15
## 0.13670559 0.08358239 0.02933609 0.06225810
These values are cross-validation R-squares (R2), the proportion of
variance explained along each PC. Some are pretty high!
Looks like we must keep the first 9 MDSes .
So how much total variance does our model capture?
# rescaling to proportion of total variance (based on eigenvalues in ord1)
eigen.var=(ord1$CA$eig/sum(ord1$CA$eig))[names(gf$result)]
# total variance explained by model
sum(eigen.var*gf$result)
## [1] 0.1259879
about 12.5%, it looks like.
Let’s plot importances of all predictors. These are properly scaled to
correspond to the proportion of variance explained by the predictor in
the whole dataset.
# setting the number of PCs to keep
tokeep=10
# computing properly scaled importances:
imps=data.frame(importance_RDAforest(gf,ord1))
# some data frame housekeeping...
names(imps)="R2"
imps$var=row.names(imps)
# reordering predictors by their importances:
imps$var=factor(imps$var,levels=imps$var[order(imps$R2)])
# plotting
ggplot(imps,aes(var,R2))+geom_bar(stat="identity")+coord_flip()+theme_bw()
R2 values above may seem low to people used to linear regressions.
The main reason is that, unlike in linear regressions, the R2 here is
computed on samples that were not used for model
building (“hold-out” samples). So this is a true predictive power. In
linear regression, the model fit is tested on the data used to build the
same model, which always over-estimates the predictive power of the
model (I wonder why people keep doing this).
Turnover curves is the central concept of the gradient forest method,
Ellis et al 2012,
greatly helping interpretation of random forest models. These curves
reflect how much the predicted variable (y axis) changes as you move
along the range of predictor (x axis). Importantly, although
turnover curves look like monotonous accumulation of difference across
the predictor scale, this may not necessarily be the case. The
key meaningful information in these plots are boundaries between
distinct states, which look like steps, and the magnitude of
transitions, which are heights of those steps. Remember that
multiple subsequent steps may lead back to the same
state.
Let’s make a small model for just the first 3 PCs (for better
readability of the plots) and examine which predictors describe them,
and how.
gf3=makeGF(ord1,env,pcs2keep=c(1:3))
## Calculating forests for 3 species
## ...
plot_gf_turnovers(gf3,imps$var[1:8])
The colored plots are turnovers of individual PCs (from
MDS1 to MDS3), and the black-and-white plots
are their averages. Colored plots don’t have predictor labels below
x-axes, but their order corresponds to the black and white plots, which
are labeled. You can see three things:
- which PC is linked with which predictor For example, PC1
(MDS1, red line on colored plots) is associated with pretty
much every predictor of those that we plotted, but especially with
MaxT_WarmM .
- where in the predictor range major gPC transitions happen. For
example, for MinT_ColdM the largest transition (which
affects both MDS2 and MDS3) is at about -13
degrees C (black and white plot).
- scale of the y axes shows how much variance is explained
by each predictor for each PC (colored plots), and average across all
PCs (back and white curves). Proportion of total data variance explained
is actually lower because each PC explains a different fraction of it;
we will have properly scaled turnover curves later on.
To discard predictors that only appear important because they are
correlated with some truly important variables, we use the
mtry criterion. mtry is the number of randomly
chosen predictors to find the next split in the tree. With higher
mtry there is a higher chance that the actual driver is
chosen together with the non-influential correlated variable and is then
used for the split. As a result, the correlated variable is used less
often, which drives its importance down, as observed in simulations by
Strobl
et al 2008. So, we fit two models with different mtry
settings to each ordination jackknife replicate. Predictors consistently
showing diminished raw importance at the higher mtry setting are then
discarded.
This procedure can be made less strict (i.e. retain more predictors) by
setting prop.positive.cutoff, the proportion of replicates
in which importance has to increase under higher mtry, to
something less than 0.5. Another way to achieve less strict selection is
to manually specify mintry = 3, maxtry = 6 (these would be
used by default for situations with less than 12 predictors).
Note: this step will take a bit.
mm=mtrySelJack(Y=cordist,X=env,covariates=pcs.d,nreps=30,prop.positive.cutoff=0.5, top.pcs=tokeep)
Which environmental variables pass the selection process?
mm$goodvars
## [1] "MeanT_DryQ" "MaxT_WarmM" "T_Season" "Prec_DryQ"
Boxplot of predictor importance change depending on
mtry:
ggplot(mm$delta,aes(var,values))+
geom_boxplot(outlier.shape = NA)+
coord_flip()+
geom_hline(yintercept=0,col="red")
Bar chart of proportion of positive change in response to higher
mtry. Good variables would be the ones above the red line
(change yintersept setting here to match
prop.positive.cutoff in the call to
mtrySelJack) :
ggplot(mm$prop.positive,aes(var,prop.positive))+
geom_bar(stat="identity")+
coord_flip()+
geom_hline(yintercept=0.5,col="red")
We are only using the variables we selected above. Note that we are
still using pcs.d as a covariate, to account for
IBD. newX here is a dataframe of environmental conditions
across the landscape where adaptation is to be predicted. This data
table must contain all the predictors that the model uses.
We will make predictions for the present-day environment, and for the
future environment.
# present-day
oj=ordinationJackknife(Y=cordist,X=env[,mm$goodvars],newX=envc,covariates=pcs.d,nreps=20,top.pcs=tokeep,extra=0.1)
# future
ojf=ordinationJackknife(Y=cordist,X=env[,mm$goodvars],newX=envf,covariates=pcs.d,nreps=20,top.pcs=tokeep,extra=0.1)
# save(mm,gf,oj,ojf,file="models_wolf_vFeb25.RData")
Plotting predicted turnover curves for the top three predictors. This time the numbers on y axis are meaningful - they reflect the proportion of total variance explained by the predictor.
for(i in 1:length(mm$goodvars)){
plot_turnover(oj,envc[oj$goodrows,],names(oj$median.importance)[i])
}
Plotting boxplots of importance (proportion of total variance explained)
ggplot(oj$all.importances,aes(variable,importance))+geom_boxplot(outlier.shape = NA)+coord_flip()
Forming predictions based on averaging replicates from
ordinationJackknife:
# spots on the map that are within modeled parameter range:
goods=oj$goodrows
# predictor data restricted to only those spots:
ras2=envc[which(goods),]
xy2=envc[goods,c("x","y")]
names(xy2)=c("lon","lat")
rfpreds=oj$predictions.direct
turnovers=oj$predictions.turnover
bests=names(oj$median.importance)
First, we plot unclustered random forest predictions. This will
produce two plots:
- PCA plot of predicted values with arrows showing how the top four
variables that passed mtrySelJack procedure fit to it (by
linear regression, RDA-style). Number of predictor arrows shown can be
changed in the chunk above. The proportions of that plot are controlled
by three parameters:
- rangeExp : increasing this value will make the PCA plot
smaller relative to the plotting area
- scal : increasing this value will make the arrows
shorter
- jitscale : increasing this value will increase the
distance from arrow tips to arrow labels. The labels are jittered
somewhat every time, so you might want to rerun the same command several
times until you get them positioned nicely.
- Actual map of predicted adaptations, in square coordinates (we will
replot it later in “nice” coordinates). Similar colors indicate similar
adaptations. To change colors on this map (and on the PCA plot), try
varying color.scheme option to plot_adaptation, like
“001”,“100”,“010”,“111” etc. The saturation of the colors is controlled
by lighten option; setting it to zero will produce maximum
saturation.
pa0=plot_adaptation(rfpreds,ras2[,bests],xy2,main="unclustered",
# options affecting PCA plot:
rangeExp=1.5,
scal=10,
jitscale=0.05,
# options affecting colors:
color.scheme="011",
lighten=0.8
)
Now our goal is to break the continuous colors in the map above into
“adaptive neighborhoods” - bounded areas likely to contain
similarly-adapted organisms. There are two ways doing it. One
way is to simply cluster spatial points based on random forest
predictions, corresponding to colors on the map above. We will
ask for more clusters than we expect (based on the number of wolf
ecotypes, which is six) with the idea that we will then merge the
clusters that are too similar. There are two options affecting this
process:
- nclust : number of initial clusters;
- cluster.merge.level : threshold for merging, as a
fraction of the maximum dissimilarity observed between clusters.
With these options to plot_adaptation, an additional plot
will be produced, showing hierarchical clustering tree of the spatial
clusters, with the red line showing the merging threshold. This can be
used as the guide to adjust nclust and
cluster.merge.threshold, although there are no formal
criteria how to do it.
Also note that the map now has numbers on it - these correspond to
merged clusters in the tree plot.
pa1=plot_adaptation(rfpreds,ras2[,bests],xy2,main="direct preds",
# options affecting PCA plot:
rangeExp=1.5,
scal=10,
jitscale=0.05,
# options affecting map and PCA colors:
color.scheme="011",
lighten=0.8,
# options affecting clustering:
cluster.guide = NULL,
nclust=20,
cluster.merge.level=0.35
)
The second way of clustering is to form clusters based on
turnover curves, but merge them according to similarity of direct random
forest predictions within them. In simulations this mode
generates less noisy adaptive neighborhood maps than clustering based on
predictions themselves. This is done by including additional option,
cluster.guide:
pa2=plot_adaptation(rfpreds,ras2[,bests],xy2,main="turnovers",
# options affecting PCA plot:
rangeExp=1.5,
scal=10,
jitscale=0.05,
# options affecting map and PCA colors:
color.scheme="011",
lighten=0.8,
# options affecting clustering:
cluster.guide = turnovers,
nclust=20,
cluster.merge.level=0.35
)
adding “overlay.points” of original samples, colored by
ecotype.
Unclustered:
plot_nice_map(xy2,mapdata=landscape,map.on.top=F,size=1,cols=pa0$colors,overlay.points = latlon,size.points=2,cols.points = ecotype$ecotype)
Direct clustering based on RF predictions.
plot_nice_map(xy2,mapdata=landscape,map.on.top=F,cols=pa1$colors,size=1,overlay.points = latlon,cols.points = ecotype$ecotype,size.points=2)
This map separates ecotypes faril well.
Finally, clustering based on turnover curves. This mode is supposed to result in less noisy clusters.
plot_nice_map(xy2,mapdata=landscape,map.on.top=F,size=1,cols=pa2$colors,overlay.points = latlon,cols.points = ecotype$ecotype,size.points=2)
Maps based on different clustering are almost identical for this dataset. They delineate ecotypes fairly well, identifying major boundaries between Arctic, mid-continent, and the British Columbia environments. Atlantic forest and West forest ecotypes seem to share similar adaptations, also quite close to Boreal forest ecotype. Somewhat surprisingly, the boundary between Arctic and High Arctic ecotypes remains uncertain, despite these ecotypes being very distinct in the IBD-corrected PCA plot (see above). The likely explanation is that our predictor set simply does not include an environmental parameter that either drives their separation or is correlated with one.
Genetic offset is simply the Euclidean distance between future and
present-day gPC predictions for each point on the landscape. To make it
more interpretable, we scale it to the 90%-th quantile of the difference
observed between present-day gPC predictions across landscape: then the
offset of 1 would mean that survival at this specific location in the
future will require almost as much adaptation as currently observed
across the whole study area.
We have already generated future predictions (ojf), so
let’s compare them to the present-day (oj). The function
gen_offset_oj() requires four arguments: the two
ordinationJaccknife() objects for present-day and future,
and two sets of spatial coordinates, for present-day and future
predictions (for aligning the two).
OFFS=gen_offset_oj(X=oj,Y=ojf,sx=envc[,1:2],sy=envf[,1:2])
# plotting the offset as a raster on the map (square coordinates):
# # simple way
# plot(terra::rast(OFFS),col=rev(map.pal("inferno")))
# fancy way
#pdf("wolves_gen_offset.pdf",height=5, width=7)
gg=ggplot()+geom_sf(data=landscape)+
xlim(min(OFFS$x),max(OFFS$x))+
ylim(min(OFFS$y), max(OFFS$y))+
theme_minimal()+
geom_raster(data=OFFS,aes(x=x,y=y,fill=offset))+scale_fill_viridis(option="inferno",direction = -1)
plot(gg)
#dev.off()
Here (and in the subsequent maps) darker color means worse. We see that East and West coast wolves are not very endangered, Arctic wolves are in real trouble, and mid-continent wolves are in between the two extremes.
Let’s imagine we are managing a wildlife park, and we want to breed
local wolves with wolves from elsewhere to bring genetic variants that
would promote adaptation to future conditions. Where should we get such
wolves from? From a location where wolf adaptation now is
similar to what will be needed in our park in the future. In
other words, we need to find where present-day gPCs predictions are
similar to the future gPCs predictions at our park’s location. Function
env_mismatch() does that:
# lat, lon coordinates of our park (somewhere in the middle of Canada)
park=c(-107, 55)
# raster of future gPCs
rfut=terra::rast(data.frame(cbind(envf[ojf$goodrows,1:2],ojf$predictions.direct)))
# extracting future adaptation needed in our park
park.fut=terra::extract(x = rfut, y = data.frame(rbind(park)), method="bilinear" )[-1]
# rescaling factor: 90% quantile of present-day adaptation disparity
sc=adapt_scale(oj$predictions.direct)[2]
# computing distances between future-needed and present-day adaptation
agf=env_mismatch(X=park.fut,Y=oj,sy=envc[,1:2],sc=sc)
# plot agf suitability, simple way
# plot(terra::rast(agf),col=rev(map.pal("inferno")))
# points(park[1],park[2],pch=0)
# fancy way
#pdf("wolves_agf.pdf",height=5, width=7)
gg=ggplot()+geom_sf(data=landscape)+
xlim(min(agf$x),max(agf$x))+
ylim(min(agf$y), max(agf$y))+
theme_minimal()+
geom_raster(data=agf,aes(x=x,y=y,fill=env.mismatch))+scale_fill_viridis(option="inferno",direction = -1)+
geom_point(data=data.frame(rbind(park)),aes(X1,X2),pch=0,size=2,col="cyan")
plot(gg)
#dev.off()
Our park is the little square in the middle. The preferred area for our assisted gene flow provenance is south of the park, next to Canada-USA border.
Now consider a situation when we have a genotyped wolf of unknown
origin, and we want to release it into the wild so it has the best
chance of survival. This is the same problem that we just solved above
for assisted gene flow: we need to calculate environmental mismatch
across lanscape for a specific vector or gPCs, only this time this
vector characterizes an individual wolf, not the future adaptation
needed at our wildlife park.
So let’s pick an “lone wolf” from our pack of 94, build a
ordinationJackknife model without it, and then predict
where it belongs.
# This one is High Arctic ecotype. Feel free to pick any other, it is quite fun.
i=65
print(ecotype[i,"ecotype"])
## [1] "Pop_4_HighArctic"
# data to build a model
geno0=geno[-i,]
# our lone wolf
geno.i=geno[i,]
# distance matrix for model building
cordist0=1-cor(t(geno0))
# adding the lone wolf to the matrix, to predict its gPCs later
cordist.i=1-cor(t(rbind(geno0,geno.i)))
Building the landscape model without the lone wolf (will take a
while…):
oj0=ordinationJackknife(Y=cordist0,X=env[-i,mm$goodvars],newX=envc,covariates=pcs.d[-i,],nreps=20,top.pcs=tokeep,extra=0.1)
Now we predict gPCs for the unknown wolf and calculate its mismatch
with the landscape:
# step one: getting gPCs for the other (model-building) wolves
ord0=capscale(cordist0~1+Condition(as.matrix(pcs.d[-i,])))
sc0=scores(ord0, scaling=1, display="sites",choices=c(1:40))
# predicting gPCs for the same old wolves plus the new "lone wolf"
scp=predict(ord0,cordist.i,type='sp',scaling="sites")[,1:40]
# making sure the predictions for the new wolf align (mostly in terms of gPC sign) with the model
pro=procrustes(sc0,scp[-nrow(cordist.i),],scale=FALSE)
# at last, getting the predicted gPCs for the lone wolf
wi=as.vector(as.numeric(pro$rotation %*% scp[nrow(scp),]))[1:tokeep]
sc=adapt_scale(oj0$predictions.direct)[2]
# calculating environmental mismatch
home=env_mismatch(X=wi,Y=oj0,sy=envc[,1:2],sc=sc)
# plot habitat mismatch for our wolf:
# # simple way
# plot(terra::rast(home),col=rev(map.pal("magma")))
# # this is where the wolf actually came from:
# points(latlon[i,1],latlon[i,2],pch=0)
# fancy way
#pdf("wolves_loneWolf.pdf",height=5, width=7)
gg=ggplot()+geom_sf(data=landscape)+
xlim(min(home$x),max(home$x))+
ylim(min(home$y), max(home$y))+
theme_minimal()+
geom_raster(data=home,aes(x=x,y=y,fill=env.mismatch))+scale_fill_viridis(option="inferno",direction = -1)+
geom_point(data=data.frame(latlon[i,]),aes(lon,lat),pch=0,col="cyan",size=2)
plot(gg)
#dev.off()
Looks like our lone wolf wandered to the edge of its comfort zone -
the lightest area on the map (best habitat suitability for its genotype)
shows that it belongs to the High Arctic (which is the correct ecotype),
but the location where it was sampled is in a darker area next to
it.