Linear discriminant analysis within the Rattus rattus complex

1 Morphological disparity test between Rattus tanezumi and Rattus R3

Rattus tanezumi and Rattus R3 are two mitotypes, which are delimited according to their mitochondrial genotypes (Pagès M. 2013). Here, we studied the geographical distribution and morphological disparity of the two mitotypes. We hypothesized that the two mitotypes Rattus tanezumi and Rattus R3 should differ morphologically. For this, we used linear discriminant analysis to evaluate the morphological discrimination of individuals previously identified by genetic barcoding.

1.1 Gathering data from new barcoded individuals and CERoPath

We gathered data from last barcoding sequences of Rattus rattus complex, and merged them to the CERoPath database, selecting only previously barcoded individuals (Rattus tanezumi and Rattus R3)

Code
load("data_rattus.RData")

1.2 geographic distribution

The visualisation of the data is done with the library leaflet .

Code
library(leaflet)

map <-  leaflet() %>%
  addTiles() %>%
  addCircleMarkers(data = data_rattus |>
                     dplyr::filter(taxonomyID == "Rattus_r3"),
                   ~longitude, ~latitude,
                   label = ~ as.character(taxonomyID),
                   #popup = popupTable(scrub_villages_total |>
                   #                    dplyr::select(nameE,n_women,n_men)),
                   fillColor = "blue", weight = 0.2,
                   group = "Rattus R3",
                   stroke = FALSE, fillOpacity = 0.4) %>%
  addCircleMarkers(data = data_rattus |>
                     dplyr::filter(taxonomyID == "Rattus_tanezumi"),
                   ~longitude, ~latitude,
                   label = ~ as.character(taxonomyID),
                   #popup = popupTable(scrub_villages_total |>
                   #                    dplyr::select(nameE,n_women,n_men)),
                   fillColor = "red", weight = 0.2,
                   group = "Rattus tanezumi",
                   stroke = FALSE, fillOpacity = 0.6) %>%
  addProviderTiles(providers$Esri.WorldStreetMap,group ="WSM") %>%
  addTiles(group = "OSM") %>%
  addTiles(urlTemplate = "https://mts1.google.com/vt/lyrs=s&hl=en&src=app&x={x}&y={y}&z={z}&s=G",
           attribution = 'Google',group ="Google") %>%
  addLayersControl(baseGroups = c("WSM","OSM","Google"),
                   overlayGroups = c("Rattus tanezumi",
                                     "Rattus R3"),
                   options = layersControlOptions(collapsed = FALSE)) %>%
  addScaleBar(position = "bottomright",
              scaleBarOptions(maxWidth = 100, metric = TRUE, imperial = FALSE,
                              updateWhenIdle = TRUE)) 

map

2 Principal component analysis on morphometrics

Principal component analysis (PCA) on morphometrical variables of individuals was performed using the library factoextra allowing extracting the projection of each individual and the components (axes) of the analysis.

Code
library(factoextra)

rattus_measurements <- data_rattus |>
  dplyr::select(id_indiv,taxonomyID,bodyWeight:headMeasurement) |>
  tidyr::drop_na()

rattus_measurements[,3:8] <- scale(rattus_measurements[,3:8], center = TRUE, scale = TRUE)

# Compute PCAsg
res.pca <- prcomp(rattus_measurements[,3:8])

data_pca <- res.pca$x

fig_eigen <- fviz_eig(res.pca)

rattus_measurements$taxonomyID <- factor(rattus_measurements$taxonomyID, 
                                         levels=c("Rattus_tanezumi", "Rattus_r3"))
groups <-  as.factor(rattus_measurements$taxonomyID)

fig_pca <- fviz_pca_biplot(res.pca, repel = TRUE,
                col.var = "black",
                alpha.var = 1,
                col.ind = groups, # color by groups
                palette = c("#2E9FDF", "#FC4E07"),
                addEllipses = TRUE, # Concentration ellipses
                ellipse.type = "confidence",
                geom.ind = c("point"),
                legend.title = "Groups")

library(patchwork)
(fig_pca | fig_eigen) +
  plot_annotation(
  title = 'Results of PCA on morphometric variables',
  tag_levels = 'A')

3 Linear Discriminant Analysis (LDA)

Linear Discriminant Analysis (LDA) on morphometrical variables of individuals was performed using the library MASS to find combination of the variables that give best possible separation between the two mitotypes.

Code
library(MASS)

rattus_measurements <- data_rattus |>
  dplyr::filter(sex %in% c("male","female"))|>
  dplyr::select(taxonomyID,bodyWeight:headMeasurement) |>
  tidyr::drop_na()

rattus_measurements[,2:7] <- scale(rattus_measurements[,2:7], center = TRUE, scale = TRUE)

lda.model = lda (factor(taxonomyID) ~
                   bodyWeight + headBodyMeasurement + tailMeasurement +
                 hindfootMeasurement + earMeasurement + headMeasurement,
                 data = rattus_measurements)

# Panels of histograms and overlayed density plots
plot(lda.model, dimen=1, type="both") # fit from lda 

3.1 Confusion matrix (training error)

To check how well the model predicts our training data, we can construct a confusion matrix of the fitted vs observed Taxa using the table function.

Code
#|
lda_prediction <- predict(lda.model)
conf <- table(list(predicted=lda_prediction$class, observed= rattus_measurements$taxonomyID))

caret::confusionMatrix(conf)
Confusion Matrix and Statistics

                 observed
predicted         Rattus_r3 Rattus_tanezumi
  Rattus_r3             226             100
  Rattus_tanezumi       147             284
                                        
               Accuracy : 0.6737        
                 95% CI : (0.639, 0.707)
    No Information Rate : 0.5073        
    P-Value [Acc > NIR] : < 2.2e-16     
                                        
                  Kappa : 0.3461        
                                        
 Mcnemar's Test P-Value : 0.003423      
                                        
            Sensitivity : 0.6059        
            Specificity : 0.7396        
         Pos Pred Value : 0.6933        
         Neg Pred Value : 0.6589        
             Prevalence : 0.4927        
         Detection Rate : 0.2985        
   Detection Prevalence : 0.4306        
      Balanced Accuracy : 0.6727        
                                        
       'Positive' Class : Rattus_r3     
                                        

Statistics

Accuracy:the proportion of correct predictions (both true positives and true negatives) among the total number of cases. The 95% CI (confidence interval) for the accuracy means we can be 95% confident that the true accuracy lies within this range.

No Information Rate (NIR): the accuracy that could be obtained by always predicting the majority class (class 0 in this case).

P-Value [Acc > NIR]: the p-value for a statistical test comparing the accuracy of the model to the NIR. A small p-value (typically less than 0.05) indicates that the model’s accuracy is significantly better than the NIR.

Kappa: a metric that considers both the true positive rate and the false positive rate, providing a more balanced assessment of the model’s performance. Kappa ranges from -1 to 1, with 0 indicating no better than random chance, and 1 indicating perfect agreement between predictions and true values.

Mcnemar’s Test P-Value: the p-value for a statistical test comparing the number of false positives and false negatives. A large p-value (typically greater than 0.05) indicates that there is no significant difference between the number of false positives and false negatives.

3.2 Visualizing Confusion Matrix

Code
#|
fourfoldplot(as.table(conf),color=c("yellow","pink"),main = "Confusion Matrix")

3.3 Visualizing the ROC curve (Receiver Operating Characteristic curve)

A receiver operating characteristic curve (or ROC curve) is a graphical plot that illustrates the performance of a binary classifier model. The ROC curve is the plot of the true positive rate (TPR) against the false positive rate (FPR) at each threshold setting. It needs to split the dataset used for the LDA into training (80%) and testing (20%) sets in order to create predicted assignation using a GLM (general linear modelling). The library pROC is used to plot the ROC.

Code
#|
set.seed(42)

rattus_measurements1 <-rattus_measurements |>
  dplyr::mutate(taxonomyID_bino = dplyr::recode(taxonomyID,
                                                "Rattus_r3" ='0',
                                                "Rattus_tanezumi"='1')) |>
  dplyr::mutate(taxonomyID_bino = as.integer(taxonomyID_bino)) |>
  dplyr::select(- taxonomyID)

train_index <- caret::createDataPartition(rattus_measurements1$taxonomyID_bino, p = 0.8, list = FALSE)
rattus_train <- rattus_measurements1[train_index, ]
rattus_test <- rattus_measurements1[-train_index, ]

# Create the logistic regression model
model <- glm(taxonomyID_bino ~ . , data = rattus_train, family = "binomial")

# Make predictions on the test dataset
predicted_probs <- predict(model, rattus_test, type = "response")
predicted_classes <- ifelse(predicted_probs > 0.5, 1, 0)

# Load the pROC package
library(pROC)
# Create the ROC curve
roc_obj <- roc(rattus_test$taxonomyID_bino , predicted_probs)

# Plot the ROC curve
plot(roc_obj, main = "ROC Curve for the Logistic Regression Model")
abline(0, 1, lty = 2, col = "gray")  # Add a reference line for a random classifier

3.4 Calculate the AUC (Area Under the Curve)

The AUC (Area Under Curve) of the ROC curve determine the specificity and sensitivity of the model. The closer the AUC value is to the 1, the better the given model fits the data.

Code
#|
auc_value <- auc(roc_obj)
cat("AUC value:", auc_value, "\n")
AUC value: 0.6740506 

4 Linear Discriminant Analysis (LDA) including location (longitude, latitude)

Linear Discriminant Analysis (LDA) is now used on morphometrical variables and geographic locations (longitude, latitude) of individuals.

Code
rattus_morpho_loc <- data_rattus |>
  dplyr::select(taxonomyID, bodyWeight:latitude) |>
  tidyr::drop_na()

rattus_morpho_loc[,2:9] <- scale(rattus_morpho_loc[,2:9], center = TRUE, scale = TRUE)

lda.model = lda (factor(taxonomyID) ~
                   bodyWeight + headBodyMeasurement + tailMeasurement +
                   hindfootMeasurement + earMeasurement + headMeasurement +
                   longitude + latitude,
                 data = rattus_morpho_loc)

# Panels of histograms and overlayed density plots
plot(lda.model, dimen=1, type="both") # fit from lda 

4.1 New confusion matrix (training error)

Code
#|
lda_prediction <- predict(lda.model)
conf <- table(list(predicted=lda_prediction$class, observed= rattus_measurements$taxonomyID))

caret::confusionMatrix(conf)
Confusion Matrix and Statistics

                 observed
predicted         Rattus_r3 Rattus_tanezumi
  Rattus_r3             334              27
  Rattus_tanezumi        39             357
                                          
               Accuracy : 0.9128          
                 95% CI : (0.8904, 0.9319)
    No Information Rate : 0.5073          
    P-Value [Acc > NIR] : <2e-16          
                                          
                  Kappa : 0.8255          
                                          
 Mcnemar's Test P-Value : 0.1757          
                                          
            Sensitivity : 0.8954          
            Specificity : 0.9297          
         Pos Pred Value : 0.9252          
         Neg Pred Value : 0.9015          
             Prevalence : 0.4927          
         Detection Rate : 0.4412          
   Detection Prevalence : 0.4769          
      Balanced Accuracy : 0.9126          
                                          
       'Positive' Class : Rattus_r3       
                                          

4.2 Visualizing of the new confusion matrix

Code
#|
fourfoldplot(as.table(conf),color=c("yellow","pink"),main = "Confusion Matrix")

4.3 Visualizing the ROC curve (Receiver Operating Characteristic curve)

Code
#|
set.seed(42)

rattus_morpho_loc1 <- rattus_morpho_loc |>
  dplyr::mutate(taxonomyID_bino = dplyr::recode(taxonomyID,
                                                "Rattus_r3" ='0',
                                                "Rattus_tanezumi"='1')) |>
  dplyr::mutate(taxonomyID_bino = as.integer(taxonomyID_bino)) |>
  dplyr::select(- taxonomyID)

train_index <- caret::createDataPartition(rattus_morpho_loc1$taxonomyID_bino, 
                                          p = 0.8, list = FALSE)

rattus_train <- rattus_morpho_loc1[train_index, ]
rattus_test <- rattus_morpho_loc1[-train_index, ]

# Create the logistic regression model
model <- glm(taxonomyID_bino ~ . , data = rattus_train, family = "binomial")

# Make predictions on the test dataset
predicted_probs <- predict(model, rattus_test, type = "response")
predicted_classes <- ifelse(predicted_probs > 0.5, 1, 0)

# Create the ROC curve
roc_obj <- roc(rattus_test$taxonomyID_bino , predicted_probs)

# Plot the ROC curve
plot(roc_obj, main = "ROC Curve for the Logistic Regression Model")
abline(0, 1, lty = 2, col = "gray")  # Add a reference line for a random classifier

4.4 Calculate the new AUC (Area Under the Curve)

Code
#|
auc_value <- auc(roc_obj)
cat("AUC value:", auc_value, "\n")
AUC value: 0.9854079 

4.5 First conclusion

Incorporating geographic locations (longitude, latitude) of individuals improved the discrimination of the two mitotypes with an accuracy of 0.913 (compared to a value of 0.674 when using morphometric data alone). Confusion matrix and ROC visualizations confirmed the improvement when adding geography.

5 Conclusion

Although the origins and maintenance of the two mitotypes, Rattus tanezumi and Rattus R3, are not yet explained, our results showed that these two mitotypes can coexist in their sympatric distribution through a morphometric change in Rattus tanezumi. Rattus R3 does not seem to be morphologically affected when living in sympatry with Rattus tanezumi.

6 References

Pagès M., Galan M., Bazin E. 2013. “Cytonuclear Discordance Among Southeast Asian Black Rats (Rattus Rattus Complex).” Molecular Ecology 22 (4): 1019–34.