Choosing variables and some preliminary results

Summary

Identified 14 important covariates (None of which were phosphorus).
Optimised JSDM by scaling covariates.
A model using all data predicted almost perfectly onto itself.
Models predicting onto trained and tested on different data had mixed results. Randomly selecting half the plots as training data still predicted really well onto the remaining plots. However, using two sites to predict onto a third severely decreased predictive power.
The proportion of variance explained by the environment vs biotic interactions seemed sensitive to the data used. A similar pattern happened with the correlation results.

Identifying important covariates

I first created a correlation matrix and filtered those variables that had a |0.7| correlation or greater. These were then plotted against each other.

corr_vars <- occ_df %>% select(acidity,
                               Mg,
                               Na,
                               K,
                               Clay,
                               Silt,
                               Sand,
                               conductivity_ms,
                               ph_kcl, 
                               C_perc, 
                               corr_dC,
                               C_N_ratio) %>% 
                         cor()
corrplot::corrplot(corr_vars, type = "lower", method = "number")

There were strong correlations between pH, acidity, Mg, K and corrected dC. The pH was used as a proxy variable and renamed ph_et_al. Clay, silt and sand formed another group. Clay was used as a proxy and renamed texture. Conductivity and Na were correlated; the latter was chosen and renamed salt. Finally, there was a correlation between C_N_ratio and percent C; C_N_ration was renamed carbon. Munsell colour was also excluded.

dat_occ <- cbind(occ_df[,1:10], occ_df %>% select(lon, 
                                                  lat,
                                                  percent_over1,
                                                  percent_over2,
                                                  Ca,
                                                  Na,
                                                  P,
                                                  Olsen,
                                                  Clay,
                                                  ph_kcl, 
                                                  N_perc,
                                                  corr_dN,
                                                  C_N_ratio,
                                                  elevation,
                                                  slope,
                                                  aspect,
                                                  drainage, 
                                                  Q_cover))
colnames(dat_occ)[colnames(dat_occ) == "ph_kcl"] <- "ph_et_al"
colnames(dat_occ)[colnames(dat_occ) == "Clay"] <- "texture"
colnames(dat_occ)[colnames(dat_occ) == "Na"] <- "salt"
colnames(dat_occ)[colnames(dat_occ) == "C_N_ratio"] <- "carbon"

A BRT was then run for each species with the above covariates (Tc = 2, lr = 0.0005). A variable was selected for the JSDM analysis if it had a relative importance of 5% in at least one of the BRTs. The final variables are shown below. Interestingly, neither inorganic nor Olsen P was influential.

 fin_occ_df <- dat_occ %>% 
  select(1:10, 
         ph_et_al,
         salt,
         carbon,
         Ca,
         Q_cover,
         elevation,
         percent_over1,
         percent_over2,
         lat,
         lon,
         aspect,
         texture,
         corr_dN,
         drainage)

Fitting the JSDM

First ran a model using all 150 plots, and three latent variables. However, the diagnostic plots looked unpromising.

## NULL

I then retried after scaling the covariates. First centering is done by subtracting the column means of x from their corresponding columns. Then scaling is done by dividing the (centered) columns of x by their standard deviations. The author of boral did this in one of his examples. The diagnostic plots of this model looked much better. From now on I’ll be using scaled covariates.

## NULL

Predicting with JSDMs

The above model was used to predict onto the same data. The AUC values were all close to one, suggesting a very good predictive power.

Species	AUC
R_burtoniae	1.000
R_comptonii	0.999
D_diversifolium	0.998
A_delaetii	1.000
A_fissum	1.000
A_framesii	0.999
C_spissum	0.999
C_staminodiosum	0.996
Dicrocaulon_sp	1.000
Oophytum_sp	0.983

Let’s use sites 2 and 3 to predict site 1. Some species are predicted really well, others very badly.

Species	AUC
R_burtoniae	0.908
R_comptonii	0.626
D_diversifolium	0.329
A_delaetii	0.908
A_fissum	0.415
A_framesii	0.409
C_spissum	0.730
C_staminodiosum	0.729
Dicrocaulon_sp	0.359
Oophytum_sp	0.917

Now sites 1 and 2 on 3. Predictions are much worse than they were. A_framesii is not included as it didn’t occur in site 3.

Species	AUC
R_burtoniae	0.851
R_comptonii	0.516
D_diversifolium	0.417
A_delaetii	0.575
A_fissum	0.419
C_spissum	0.490
C_staminodiosum	0.731
Dicrocaulon_sp	0.478
Oophytum_sp	0.660

Now sites 1 and 3 on 2. Predictions are much worse than they were. R_comptonii and C_stamin are not included as it didn’t occur in site 2.

Species	AUC
R_burtoniae	0.943
D_diversifolium	0.321
A_delaetii	0.681
A_fissum	0.078
A_framesii	0.531
C_spissum	0.400
Dicrocaulon_sp	0.637
Oophytum_sp	0.650

Let’s try by randomly selecting 75 plots and predicting onto the remaining 75. It predicts a lot better than site by site.

Species	AUC
R_burtoniae	0.918
R_comptonii	0.790
D_diversifolium	0.832
A_delaetii	0.794
A_fissum	0.601
A_framesii	0.623
C_spissum	0.739
C_staminodiosum	0.882
Dicrocaulon_sp	0.832
Oophytum_sp	0.916

Overall then, there is some indication that distribution is deterministic.

Abiotic or biotic?

Boral can partition variance into that explained by environment and latent variables. However, this seems quite sensitive to the data used. The full model is shown on the left, and the randomly chosen one on the right.

Correlations

Using the full model. Correlation due to environment on the left, correlation due to latent variables on the right.

Using the random model.

Important covariates

Full model