The Analysis

The following analysis pulled geochemistry data from shells of juvenile mussels rasied in separate bays in eastern Maine in 2015 (along side the larvae from the previous analysis), and tested the effects of source (bay), and co-variance matrix prior, on the ability of the infinite mixture model (IMM) to correctly assign juvenile chemistry to sources. During analysis, each juvenile was removed from the overall data set, and the remaining juveniles used as baseline data to train the IMM. The individual juvenile removed from the data set was then assigned to a source based on the IMM, allowing for 7 possible extra, and untrained, source assignments, using each of four different co-variance matrix priors:

For each co-variance matrix prior for each juvenile, the IMM was run for an initial 1000 adaptation and 2000 burn-in iterations. The posterior distribution of source assignments for a final 3000 iterations were retained and used for further analysis.

The Data Set

Data were \(Log(x+1)\) transformed, and centered (by elemental mean) and scaled (by elemental standard deviation) prior to analysis. Together, these transformations were meant to normalize the elemental data, and then standardize the center and variance so that unknown sources could be easily modeled in the IMM. Principal Component Analysis suggested a lot of overlap in the multivariate geochemical signals among sources along the first 2 principle components (accounting for 72 and 12 percent of total variation respectively). Of all elements, Mn and Co accounted for the most variation among individual larvae, while La accounted for the least.

Importance of components:
                          PC1    PC2     PC3     PC4     PC5     PC6     PC7     PC8
Standard deviation     0.8952 0.3571 0.24539 0.19649 0.17896 0.15648 0.11961 0.08027
Proportion of Variance 0.7252 0.1154 0.05449 0.03494 0.02898 0.02216 0.01295 0.00583
Cumulative Proportion  0.7252 0.8407 0.89514 0.93008 0.95907 0.98122 0.99417 1.00000

Results by Individual

Below are the posterior distribution results of the IMM assignment for each individual juveniles (color) to each potential source (x-axis; numbers represent possible extra sources). Results are separated by the actual juvenile source (panel rows) and the co-variance matrix prior used in the IMM (panel columns).

We can see that generally, individual sources assign most frequently to one source, which is usually the correct one. Depending on the covariance used, sources like MBR, GB and DB are more multi-modal and diffuse in their assignments though.

Results by Source

To summarize assignment results by site, we take the mode source assignment for each individual and use that as the IMM predicted source assignment. Then, for each co-variance matrix prior (panels), we plot the percent of individuals from each source (Actual Source) assigned to each possible source by the IMM (Predicted Source).

Just as before, when we looked at data on an individual level, we generally see IMM assignment to correct sources, but, depending on the covariance used, MBR and GB are mostly confused with DB, and DB is confused with GB. Maximum percentage correctly assigned was ~ 88% for FBE though.

Effects of Source and Coviariance Matrix Prior on Correct Assignment

To understand what is driving misclassification by the IMM, we test the effects of Actual Source, Co-variance Prior, and their interaction with a GLM (family=binomial, link=log), using individual nested within actual source as a random effect.

Analysis of Deviance Table (Type III Wald chisquare tests)

Response: Correct
                          Chisq Df Pr(>Chisq)    
(Intercept)              0.9668  1   0.325467    
Actual_Source           17.3475  4   0.001654 ** 
Cov_Prior               15.9057  3   0.001186 ** 
Actual_Source:Cov_Prior 44.2806 12  1.368e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

We find that a juveniles’s actual source interacts with the covariance used to affect correct classifications.

We can use the regression coefficient summaries for a more in-depth look.

Generalized linear mixed model fit by maximum likelihood (Laplace Approximation) [glmerMod]
 Family: binomial  ( logit )
Formula: Correct ~ Actual_Source * Cov_Prior + (1 | Actual_Source/Individual_Code)
   Data: leave.one.out.results.byind.assignment.table
Control: glmerControl(optimizer = "bobyqa", optCtrl = list(maxfun = 20000))

     AIC      BIC   logLik deviance df.resid 
   404.7    499.5   -180.4    360.7      526 

Scaled residuals: 
     Min       1Q   Median       3Q      Max 
-28.9230  -0.0205   0.0143   0.0461  25.5696 

Random effects:
 Groups                        Name        Variance  Std.Dev. 
 Individual_Code:Actual_Source (Intercept) 3.751e+01 6.124e+00
 Actual_Source                 (Intercept) 3.435e-15 5.861e-08
Number of obs: 548, groups:  Individual_Code:Actual_Source, 137; Actual_Source, 5

Fixed effects:
                                               Estimate Std. Error z value Pr(>|z|)    
(Intercept)                                     -1.4624     1.4873  -0.983 0.325467    
Actual_SourceFBE                                 9.8616     3.0361   3.248 0.001162 ** 
Actual_SourceGB                                  4.4980     2.0893   2.153 0.031331 *  
Actual_SourceMBR                                 8.6472     2.5131   3.441 0.000580 ***
Actual_SourcePHB                                 7.5958     2.1069   3.605 0.000312 ***
Cov_PriorSource_and_ID                           5.8002     1.6774   3.458 0.000544 ***
Cov_PriorSource_and_Universal                    5.8002     1.6775   3.458 0.000545 ***
Cov_PriorUniversal                              -0.8929     0.9713  -0.919 0.357924    
Actual_SourceFBE:Cov_PriorSource_and_ID         -5.8002     3.0633  -1.893 0.058294 .  
Actual_SourceGB:Cov_PriorSource_and_ID         -12.9372     2.5655  -5.043 4.59e-07 ***
Actual_SourceMBR:Cov_PriorSource_and_ID        -20.8495     4.0346  -5.168 2.37e-07 ***
Actual_SourcePHB:Cov_PriorSource_and_ID         -5.1599     2.0087  -2.569 0.010207 *  
Actual_SourceFBE:Cov_PriorSource_and_Universal  -8.0393     3.1013  -2.592 0.009536 ** 
Actual_SourceGB:Cov_PriorSource_and_Universal  -12.5134     2.5307  -4.945 7.63e-07 ***
Actual_SourceMBR:Cov_PriorSource_and_Universal -20.8495     4.0349  -5.167 2.37e-07 ***
Actual_SourcePHB:Cov_PriorSource_and_Universal  -5.8002     1.9956  -2.906 0.003655 ** 
Actual_SourceFBE:Cov_PriorUniversal              0.8929     2.7411   0.326 0.744604    
Actual_SourceGB:Cov_PriorUniversal               1.3311     1.3566   0.981 0.326494    
Actual_SourceMBR:Cov_PriorUniversal             -0.3952     1.9531  -0.202 0.839635    
Actual_SourcePHB:Cov_PriorUniversal              1.5333     1.5010   1.021 0.307028    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Relative to the intercept (reflecting the DB source and using the ID co-variance matrix), it seems that, generally, using the ID co-variance matrix, or the universal covariance matrix, give better assignments than integrating source specific covariance-matrices.

Results by Covariance Matrix Prior

If we wanted to look at the effect of different co-variance matrices in more detail, we again see that, generally, using the ID or University co-variance matrices leads to the best assignments.

Taken together, the results suggest that for best assignment, we should run the IMM with ID or universal co-variance priors, which is somewhat different than what the larval data suggested (that using source-specific covariance matrices is better). Taking these IMM assignments and projecting the correct and incorrect assignments onto the PCA from before, we see that where there is multivariate overlap in geochemistry, individuals are misassigned.

Overall, we learn that ID or universal co-variance matrices produce better IMM results when assigning juveniles, and that IMM misclassifications are mostly confined to the mutivariate overlap in geochemistry.

---
title: "Infinite Mixture Model Leave-One-Out Analysis for *Mytilus edulis* Juveniles"
output: html_notebook
---

```{r, message=FALSE, warning=FALSE, include=FALSE}
library("plyr")
library("abind")
library("reshape")
library("ggplot2")
library("car")
library("lme4")
library("ggfortify")

### some element data

ICPMS.all.2015.dat<-read.csv("/Users/scottmorello/Dropbox/Archives/Academic/UMass Boston/Etter Lab/Projects/Trace Element Mussel Connectivity/2015 ICPMS Datasets/2015.All.Data.csv")

ICPMS.all.2015.juvdat<-subset(ICPMS.all.2015.dat,Type=="Juvenile")
top.sources<-names(sort(summary(ICPMS.all.2015.juvdat$Site),decreasing = TRUE))[1:5]
ICPMS.all.2015.juvdat<-subset(ICPMS.all.2015.juvdat,Site==top.sources[1]|Site==top.sources[2]|Site==top.sources[3]|Site==top.sources[4]|Site==top.sources[5])
ICPMS.all.2015.juvdat$Site<-factor(ICPMS.all.2015.juvdat$Site)
ICPMS.all.2015.juvdat.env<-ICPMS.all.2015.juvdat[,c(1:4)]
ICPMS.all.2015.juvdat.val<-ICPMS.all.2015.juvdat[,-c(1:4)]
ICPMS.all.2015.juvdat.val.trans<-log(ICPMS.all.2015.juvdat.val+1,10)


leave.one.out.results<-readRDS("/Users/scottmorello/Dropbox/Archives/Academic/UMass Boston/Etter Lab/Projects/Trace Element Mussel Connectivity/R Work/Infinite Mixture Model in R/Jags_Work/leaveoneoutresultsJuv.rds")

```
## The Analysis
  The following analysis pulled geochemistry data from shells of juvenile mussels rasied in separate bays in eastern Maine in 2015 (along side the larvae from the previous analysis), and tested the effects of source (bay), and co-variance matrix prior, on the ability of the infinite mixture model (IMM) to correctly assign juvenile chemistry to sources. During analysis, each juvenile was removed from the overall data set, and the remaining juveniles used as baseline data to train the IMM. The individual juvenile removed from the data set was then assigned to a source based on the IMM, allowing for 7 possible extra, and untrained, source assignments, using each of four different co-variance matrix priors:

* The Identity Matrix for all sources
* Source Specific Co-variance Matrices for baseline sources, and then the Identity Matrix for extra sources
* Source Specific Co-variance Matrices for baseline sources, and then the Universal Co-variance Matrix for extra sources
* The Universal Co-variance Matrix for all sources

For each co-variance matrix prior for each juvenile, the IMM was run for an initial 1000 adaptation and 2000 burn-in iterations. The posterior distribution of source assignments for a final 3000 iterations were retained and used for further analysis.


## The Data Set

  Data were $Log(x+1)$ transformed, and centered (by elemental mean) and scaled (by elemental standard deviation) prior to analysis. Together, these transformations were meant to normalize the elemental data, and then standardize the center and variance so that unknown sources could be easily modeled in the IMM. Principal Component Analysis suggested a lot of overlap in the multivariate geochemical signals among sources along the first 2 principle components (accounting for 72 and 12 percent of total variation respectively). Of all elements, **Mn** and **Co** accounted for the most variation among individual larvae, while **La** accounted for the least.
```{r, echo=FALSE, fig.height=2, fig.width=3, message=FALSE, warning=FALSE}

autoplot(prcomp(ICPMS.all.2015.juvdat.val.trans), 
         data = ICPMS.all.2015.juvdat.env, 
         colour = "Site",
         loadings = TRUE,
         loadings.label = TRUE,
         frame = TRUE, 
         frame.type = 'norm')+
  theme_bw()
```
```{r, echo=FALSE, message=FALSE, warning=FALSE}
summary(prcomp(ICPMS.all.2015.juvdat.val.trans))
```


## Results by Individual

  Below are the posterior distribution results of the IMM assignment for each individual juveniles (color) to each potential source (x-axis; numbers represent possible extra sources). Results are separated by the actual juvenile source (panel rows) and the co-variance matrix prior used in the IMM (panel columns).
```{r, echo=FALSE, fig.height=3, fig.width=4, message=FALSE, warning=FALSE}
leave.one.out.results.byind.table<-melt(leave.one.out.results,id.vars=c("Individual","Cov_Prior","Actual"))
colnames(leave.one.out.results.byind.table)[3:5]<-c("Actual_Source","Predicted_Source","Percent_Assignment")

leave.one.out.results.byind.table$Predicted_Source<-factor(leave.one.out.results.byind.table$Predicted_Source)
levels(leave.one.out.results.byind.table$Predicted_Source)<-c(levels(ICPMS.all.2015.juvdat.env[sampind,"Site"]),c(refsite.n+1:extrasite.n))

leave.one.out.results.byind.table$Actual_Source<-factor(leave.one.out.results.byind.table$Actual_Source)
levels(leave.one.out.results.byind.table$Actual_Source)<-c(levels(ICPMS.all.2015.juvdat.env[sampind,"Site"]),c(refsite.n+1:extrasite.n))


colour.vec<-rainbow(max(leave.one.out.results.byind.table$Individual), s=.6, v=.9)[sample(c(1:max(leave.one.out.results.byind.table$Individual)),max(leave.one.out.results.byind.table$Individual))]
colour.vec<-colour.vec[match(leave.one.out.results.byind.table$Individual,c(1:max(leave.one.out.results.byind.table$Individual)))]


ggplot(leave.one.out.results.byind.table,aes(x=Predicted_Source,group=Individual,y=Percent_Assignment,colour=factor(Individual)))+
    facet_grid(Actual_Source~Cov_Prior)+
    geom_line() + 
    scale_colour_manual(values = colour.vec)+
    theme_bw()+
    theme(axis.text.x = element_text(angle = 90, hjust = 1),legend.position="none")
```
We can see that generally, individual sources assign most frequently to one source, which is usually the correct one. Depending on the covariance used, sources like MBR, GB and DB are more multi-modal and diffuse in their assignments though.


## Results by Source

To summarize assignment results by site, we take the mode source assignment for each individual and use that as the IMM predicted source assignment. Then, for each co-variance matrix prior (panels), we plot the percent of individuals from each source (Actual Source) assigned to each possible source by the IMM (Predicted Source).
```{r, echo=FALSE, fig.height=1, fig.width=4, message=FALSE, warning=FALSE}
leave.one.out.results.byind.assignment<-ddply(leave.one.out.results.byind.table,.(Individual,Cov_Prior,Actual_Source),summarize,Predicted_Source=which.max(Percent_Assignment))
leave.one.out.results.byind.assignment$Predicted_Source<-factor(leave.one.out.results.byind.assignment$Predicted_Source)
levels(leave.one.out.results.byind.assignment$Predicted_Source)<-c(levels(ICPMS.all.2015.juvdat.env[sampind,"Site"]),c(refsite.n+1:extrasite.n))[1:length(unique(leave.one.out.results.byind.assignment$Predicted_Source))]

#ID
ID.assign.table<-(table(subset(leave.one.out.results.byind.assignment,Cov_Prior=="ID")[,c(3:4)])/rowSums(table(subset(leave.one.out.results.byind.assignment,Cov_Prior=="ID")[,c(3:4)])))[c(1:5),]

#Source_and_ID
Source_and_ID.assign.table<-(table(subset(leave.one.out.results.byind.assignment,Cov_Prior=="Source_and_ID")[,c(3:4)])/rowSums(table(subset(leave.one.out.results.byind.assignment,Cov_Prior=="Source_and_ID")[,c(3:4)])))[c(1:5),]

#Source_and_Universal
Source_and_Universal.assign.table<-(table(subset(leave.one.out.results.byind.assignment,Cov_Prior=="Source_and_Universal")[,c(3:4)])/rowSums(table(subset(leave.one.out.results.byind.assignment,Cov_Prior=="Source_and_Universal")[,c(3:4)])))[c(1:5),]

#Universal
Universal.assign.table<-(table(subset(leave.one.out.results.byind.assignment,Cov_Prior=="Universal")[,c(3:4)])/rowSums(table(subset(leave.one.out.results.byind.assignment,Cov_Prior=="Universal")[,c(3:4)])))[c(1:5),]

leave.one.out.results.all.assignment<-cbind(data.frame(Cov_Prior=rep(c("ID","Source_and_ID","Source_and_Universal","Universal"),each=5),Actual_Source=rep(row.names(ID.assign.table),4)),rbind(ID.assign.table,Source_and_ID.assign.table,Source_and_Universal.assign.table,Universal.assign.table))
leave.one.out.results.all.assignment<-melt(leave.one.out.results.all.assignment)
colnames(leave.one.out.results.all.assignment)[3:4]<-c("Predicted_Source","Percent_Assigned")


ggplot(leave.one.out.results.all.assignment,aes(x=Predicted_Source,y=Actual_Source,fill=Percent_Assigned))+
  facet_wrap(~Cov_Prior,ncol=4)+
  geom_tile(colour = "white") + 
  scale_fill_gradient(low = "white", high = "red",limits=c(0,1))+
  theme_bw()+
  theme(axis.text.x = element_text(angle = 90, hjust = 1))
```
Just as before, when we looked at data on an individual level, we generally see IMM assignment to correct sources, but, depending on the covariance used, MBR and GB are mostly confused with DB, and DB is confused with GB. Maximum percentage correctly assigned was ~ 88% for FBE though.


## Effects of Source and Coviariance Matrix Prior on Correct Assignment

To understand what is driving misclassification by the IMM, we test the effects of Actual Source, Co-variance Prior, and their interaction with a GLM (family=binomial, link=log), using individual nested within actual source as a random effect.
```{r, echo=FALSE, message=FALSE, warning=FALSE}
leave.one.out.results.byind.assignment.table<-leave.one.out.results.byind.assignment
leave.one.out.results.byind.assignment.table$Correct<-as.numeric(as.character(leave.one.out.results.byind.assignment.table$Actual_Source)==as.character(leave.one.out.results.byind.assignment.table$Predicted_Source))
leave.one.out.results.byind.assignment.table$Individual_Code<-paste0(leave.one.out.results.byind.assignment.table$Actual_Source,leave.one.out.results.byind.assignment.table$Individual)

leave.one.out.results.correct.assignment.lm2<-glmer(Correct~Actual_Source*Cov_Prior+(1|Actual_Source/Individual_Code),data=leave.one.out.results.byind.assignment.table,family=binomial)
ss <- getME(leave.one.out.results.correct.assignment.lm2,c("theta","fixef"))
leave.one.out.results.correct.assignment.lm<-update(leave.one.out.results.correct.assignment.lm2,start=ss,control=glmerControl(optimizer="bobyqa",optCtrl=list(maxfun=2e4)))
Anova(leave.one.out.results.correct.assignment.lm,type=3)
```
We find that a juveniles's actual source interacts with the covariance used to affect correct classifications.

We can use the regression coefficient summaries for a more in-depth look.
```{r, echo=FALSE, message=FALSE, warning=FALSE}
summary(leave.one.out.results.correct.assignment.lm)
```
Relative to the intercept (reflecting the DB source and using the ID co-variance matrix), it seems that, generally, using the ID co-variance matrix, or the universal covariance matrix, give better assignments than integrating source specific covariance-matrices.


## Results by Covariance Matrix Prior
If we wanted to look at the effect of different co-variance matrices in more detail, we again see that, generally, using the ID or University co-variance matrices leads to the best assignments.
```{r, echo=FALSE, fig.height=2, fig.width=4, message=FALSE, warning=FALSE}
leave.one.out.results.correct.assignment<-ddply(leave.one.out.results.byind.assignment.table,.(Cov_Prior,Actual_Source),summarize,Correctly_Assigned=sum(Correct),Total=length(Correct))
leave.one.out.results.correct.assignment$Percent_Correct<-leave.one.out.results.correct.assignment$Correctly_Assigned/leave.one.out.results.correct.assignment$Total


ggplot(leave.one.out.results.correct.assignment,aes(x=Actual_Source,y=Percent_Correct,fill=Cov_Prior))+
  geom_bar(stat="identity",position="dodge")+
  scale_fill_brewer(name="Covariance Prior",breaks=c("ID", "Source_and_ID","Source_and_Universal","Universal"),labels=c("Identity Matrix", "Source Specific Covariance,\nthen Identity Matrices","Source Specific Covariance,\nthen Universal Covariance Matrices","Universal Covariance Matrix"))+
  xlab("Actual Source")+
  ylab("Percent Correctly Assigned")+
  theme_bw()
```

Taken together, the results suggest that for best assignment, we should run the IMM with ID or universal co-variance priors, which is somewhat different than what the larval data suggested (that using source-specific covariance matrices is better). Taking these IMM assignments and projecting the correct and incorrect assignments onto the PCA from before, we see that where there is multivariate overlap in geochemistry, individuals are misassigned.
```{r, echo=FALSE, fig.height=2, fig.width=3, message=FALSE, warning=FALSE}
leave.one.out.results.byind.assignment.correct<-subset(leave.one.out.results.byind.assignment.table,Cov_Prior=="Source_and_Universal")
leave.one.out.results.byind.assignment.correct$Correct[which(leave.one.out.results.byind.assignment.correct$Correct==1)]<-"Correct"
leave.one.out.results.byind.assignment.correct$Correct[-which(leave.one.out.results.byind.assignment.correct$Correct=="Correct")]<-"Incorrect"


ICPMS.all.2015.juvdat.env2<-ICPMS.all.2015.juvdat.env
ICPMS.all.2015.juvdat.env2$Correct<-factor(leave.one.out.results.byind.assignment.correct$Correct)

autoplot(prcomp(ICPMS.all.2015.juvdat.val.trans), 
         data = ICPMS.all.2015.juvdat.env2, 
         colour = "Site",
         shape="Correct",
         loadings = TRUE,
         loadings.label = TRUE,
         frame = TRUE, 
         frame.type = 'norm')+
  theme_bw()
```

Overall, we learn that ID or universal co-variance matrices produce better IMM results when assigning juveniles, and that IMM misclassifications are mostly confined to the mutivariate overlap in geochemistry.