The Analysis

The following analysis pulled geochemistry data from the 2015 Reference Larvae data set, and tested the effects of larval source, and co-variance matrix prior, on the ability of the infinite mixture model (IMM) to correctly assign larvae to sources. During analysis, each larvae was removed from the overall data set, and the remaining larvae used as baseline data to train the IMM. The individual larvae removed from the data set was then assigned to a source based on the IMM, allowing for 7 possible extra, and untrained, source assignments, using each of four different co-variance matrix priors:

For each co-variance matrix prior for each larvae, the IMM was run for an initial 1000 adaptation and 2000 burn-in iterations. The posterior distribution of source assignments for a final 3000 iterations were retained and used for further analysis.

The Data Set

Data were \(Log(x+1)\) transformed, and centered (by elemental mean) and scaled (by elemental standard deviation) prior to analysis. Together, these transformations were meant to normalize the elemental data, and then standardize the center and variance so that unknown sources could be easily modeled in the IMM. Principal Component Analysis suggested a lot of overlap in the multivariate geochemical signals among sources along the first 2 principle components (accounting for 36 and 22 percent of total variation respectively). Of all elements, Mg accounted for the most variation among individual larvae, while La accounted for the least.

Importance of components:
                          PC1    PC2    PC3     PC4     PC5     PC6     PC7     PC8
Standard deviation     0.8440 0.6633 0.4559 0.43804 0.38877 0.34044 0.32125 0.20066
Proportion of Variance 0.3629 0.2242 0.1059 0.09778 0.07702 0.05906 0.05259 0.02052
Cumulative Proportion  0.3629 0.5871 0.6930 0.79082 0.86784 0.92689 0.97948 1.00000

Results by Individual

Below are the posterior distribution results of the IMM assignment for each individual larvae (color) to each potential source (x-axis; numbers represent possible extra sources). Results are separated by the actual larval source (panel rows) and the co-variance matrix prior used in the IMM (panel columns).

We can see that generally, individual sources assign most frequently to one source, which is usually the correct one. Sources like CHR, FBE and GB are more multi-modal and diffuse in their assignments though.

Results by Source

To summarize assignment results by site, we take the mode source assignment for each individual and use that as the IMM predicted source assignment. Then, for each co-variance matrix prior (panels), we plot the percent of individuals from each source (Actual Source) assigned to each possible source by the IMM (Predicted Source).

Just as before, when we looked at data on an individual level, we generally see IMM assignment to correct sources, but CHR is mostly confused with PHB and FBW, and GB and FBW are confused with each other. Maximum percentage correctly assigned was only 62% for FBE though.

Effects of Source and Coviariance Matrix Prior on Correct Assignment

To understand what is driving misclassification by the IMM, we test the effects of Actual Source, Co-variance Prior, and their interaction with a GLM (family=binomial, link=log), using individual nested within actual source as a random effect.

Analysis of Deviance Table (Type III Wald chisquare tests)

Response: Correct
                          Chisq Df Pr(>Chisq)    
(Intercept)             12.9177  1  0.0003255 ***
Actual_Source            7.9274  4  0.0942732 .  
Cov_Prior                4.9148  3  0.1781406    
Actual_Source:Cov_Prior 11.7227 12  0.4681974    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

We find that a larva’s actual source has the greatest, and only marginally insignificant, effect on correct classification.

We can use the regression coefficient summaries for a more in-depth look.

Generalized linear mixed model fit by maximum likelihood (Laplace Approximation) ['glmerMod']
 Family: binomial  ( logit )
Formula: Correct ~ Actual_Source * Cov_Prior + (1 | Actual_Source/Individual_Code)
   Data: leave.one.out.results.byind.assignment.table
Control: glmerControl(optimizer = "bobyqa", optCtrl = list(maxfun = 20000))

     AIC      BIC   logLik deviance df.resid 
   621.0    722.4   -288.5    577.0      722 

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-3.2871 -0.1157 -0.0579  0.1652  2.4500 

Random effects:
 Groups                        Name        Variance  Std.Dev. 
 Individual_Code:Actual_Source (Intercept) 4.722e+01 6.872e+00
 Actual_Source                 (Intercept) 1.018e-15 3.191e-08
Number of obs: 744, groups:  Individual_Code:Actual_Source, 186; Actual_Source, 5

Fixed effects:
                                               Estimate Std. Error z value Pr(>|z|)    
(Intercept)                                     -6.5205     1.8142  -3.594 0.000325 ***
Actual_SourceFBE                                 6.3633     2.9343   2.169 0.030117 *  
Actual_SourceFBW                                 4.9025     2.4853   1.973 0.048542 *  
Actual_SourceGB                                  3.5646     2.1834   1.633 0.102548    
Actual_SourcePHB                                 5.4763     2.2603   2.423 0.015399 *  
Cov_PriorSource_and_ID                           1.3127     1.1820   1.111 0.266759    
Cov_PriorSource_and_Universal                    1.8499     1.1897   1.555 0.119957    
Cov_PriorUniversal                              -0.7606     1.2383  -0.614 0.539071    
Actual_SourceFBE:Cov_PriorSource_and_ID          1.2209     1.7028   0.717 0.473394    
Actual_SourceFBW:Cov_PriorSource_and_ID         -0.2013     1.5930  -0.126 0.899459    
Actual_SourceGB:Cov_PriorSource_and_ID          -2.3905     1.4677  -1.629 0.103378    
Actual_SourcePHB:Cov_PriorSource_and_ID         -0.1891     1.4697  -0.129 0.897644    
Actual_SourceFBE:Cov_PriorSource_and_Universal   0.6837     1.7017   0.402 0.687865    
Actual_SourceFBW:Cov_PriorSource_and_Universal  -0.7385     1.5954  -0.463 0.643451    
Actual_SourceGB:Cov_PriorSource_and_Universal   -2.5551     1.4638  -1.746 0.080885 .  
Actual_SourcePHB:Cov_PriorSource_and_Universal  -0.7263     1.4709  -0.494 0.621479    
Actual_SourceFBE:Cov_PriorUniversal              1.9547     1.6763   1.166 0.243569    
Actual_SourceFBW:Cov_PriorUniversal              0.7606     1.6338   0.466 0.641565    
Actual_SourceGB:Cov_PriorUniversal               0.7606     1.4870   0.511 0.609020    
Actual_SourcePHB:Cov_PriorUniversal              1.1372     1.5158   0.750 0.453104    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Relative to the intercept (reflecting the CHR source and using the ID co-variance matrix), larvae from GB were the only ones to have a significantly similar frequency of correct assignments. This is pretty consistent with what we saw in the figures before - CHR and GB had low frequencies of correct assignments, while other sources were better.

Results by Covariance Matrix Prior

If we wanted to look at the effect of different co-variance priors, we see that, generally, using the source specific co-variance matrices leads to the best assignments.

Taken together, the results suggest that for best assignment, we should run the IMM with source specific and universal co-variance priors. Taking these IMM assignments and projecting the correct and incorrect assignments onto the PCA from before, we see that there is no clear pattern, geochemically, for why individuals are misassigned.

Overall, we learn that source specific and universal co-variance matrices produce marginally better IMM results when assigning larvae, but that the greatest influence on IMM misclassifications is source-specific geochemistry.

