Code
load("data_rattus.RData")
Rattus tanezumi and Rattus R3 are two mitotypes, which are delimited according to their mitochondrial genotypes (Pagès M. 2013). Here, we studied the geographical distribution and morphological disparity of the two mitotypes. We hypothesized that the two mitotypes Rattus tanezumi and Rattus R3 should differ morphologically. For this, we used linear discriminant analysis to evaluate the morphological discrimination of individuals previously identified by genetic barcoding.
We gathered data from last barcoding sequences of Rattus rattus complex, and merged them to the CERoPath database, selecting only previously barcoded individuals (Rattus tanezumi and Rattus R3)
load("data_rattus.RData")
The visualisation of the data is done with the library leaflet .
library(leaflet)
<- leaflet() %>%
map addTiles() %>%
addCircleMarkers(data = data_rattus |>
::filter(taxonomyID == "Rattus_r3"),
dplyr~longitude, ~latitude,
label = ~ as.character(taxonomyID),
#popup = popupTable(scrub_villages_total |>
# dplyr::select(nameE,n_women,n_men)),
fillColor = "blue", weight = 0.2,
group = "Rattus R3",
stroke = FALSE, fillOpacity = 0.4) %>%
addCircleMarkers(data = data_rattus |>
::filter(taxonomyID == "Rattus_tanezumi"),
dplyr~longitude, ~latitude,
label = ~ as.character(taxonomyID),
#popup = popupTable(scrub_villages_total |>
# dplyr::select(nameE,n_women,n_men)),
fillColor = "red", weight = 0.2,
group = "Rattus tanezumi",
stroke = FALSE, fillOpacity = 0.6) %>%
addProviderTiles(providers$Esri.WorldStreetMap,group ="WSM") %>%
addTiles(group = "OSM") %>%
addTiles(urlTemplate = "https://mts1.google.com/vt/lyrs=s&hl=en&src=app&x={x}&y={y}&z={z}&s=G",
attribution = 'Google',group ="Google") %>%
addLayersControl(baseGroups = c("WSM","OSM","Google"),
overlayGroups = c("Rattus tanezumi",
"Rattus R3"),
options = layersControlOptions(collapsed = FALSE)) %>%
addScaleBar(position = "bottomright",
scaleBarOptions(maxWidth = 100, metric = TRUE, imperial = FALSE,
updateWhenIdle = TRUE))
map
Principal component analysis (PCA) on morphometrical variables of individuals was performed using the library factoextra allowing extracting the projection of each individual and the components (axes) of the analysis.
library(factoextra)
<- data_rattus |>
rattus_measurements ::select(id_indiv,taxonomyID,bodyWeight:headMeasurement) |>
dplyr::drop_na()
tidyr
3:8] <- scale(rattus_measurements[,3:8], center = TRUE, scale = TRUE)
rattus_measurements[,
# Compute PCAsg
<- prcomp(rattus_measurements[,3:8])
res.pca
<- res.pca$x
data_pca
<- fviz_eig(res.pca)
fig_eigen
$taxonomyID <- factor(rattus_measurements$taxonomyID,
rattus_measurementslevels=c("Rattus_tanezumi", "Rattus_r3"))
<- as.factor(rattus_measurements$taxonomyID)
groups
<- fviz_pca_biplot(res.pca, repel = TRUE,
fig_pca col.var = "black",
alpha.var = 1,
col.ind = groups, # color by groups
palette = c("#2E9FDF", "#FC4E07"),
addEllipses = TRUE, # Concentration ellipses
ellipse.type = "confidence",
geom.ind = c("point"),
legend.title = "Groups")
library(patchwork)
| fig_eigen) +
(fig_pca plot_annotation(
title = 'Results of PCA on morphometric variables',
tag_levels = 'A')
Linear Discriminant Analysis (LDA) on morphometrical variables of individuals was performed using the library MASS to find combination of the variables that give best possible separation between the two mitotypes.
library(MASS)
<- data_rattus |>
rattus_measurements ::filter(sex %in% c("male","female"))|>
dplyr::select(taxonomyID,bodyWeight:headMeasurement) |>
dplyr::drop_na()
tidyr
2:7] <- scale(rattus_measurements[,2:7], center = TRUE, scale = TRUE)
rattus_measurements[,
= lda (factor(taxonomyID) ~
lda.model + headBodyMeasurement + tailMeasurement +
bodyWeight + earMeasurement + headMeasurement,
hindfootMeasurement data = rattus_measurements)
# Panels of histograms and overlayed density plots
plot(lda.model, dimen=1, type="both") # fit from lda
To check how well the model predicts our training data, we can construct a confusion matrix of the fitted vs observed Taxa using the table function.
#|
<- predict(lda.model)
lda_prediction <- table(list(predicted=lda_prediction$class, observed= rattus_measurements$taxonomyID))
conf
::confusionMatrix(conf) caret
Confusion Matrix and Statistics
observed
predicted Rattus_r3 Rattus_tanezumi
Rattus_r3 226 100
Rattus_tanezumi 147 284
Accuracy : 0.6737
95% CI : (0.639, 0.707)
No Information Rate : 0.5073
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.3461
Mcnemar's Test P-Value : 0.003423
Sensitivity : 0.6059
Specificity : 0.7396
Pos Pred Value : 0.6933
Neg Pred Value : 0.6589
Prevalence : 0.4927
Detection Rate : 0.2985
Detection Prevalence : 0.4306
Balanced Accuracy : 0.6727
'Positive' Class : Rattus_r3
Statistics
Accuracy:the proportion of correct predictions (both true positives and true negatives) among the total number of cases. The 95% CI (confidence interval) for the accuracy means we can be 95% confident that the true accuracy lies within this range.
No Information Rate (NIR): the accuracy that could be obtained by always predicting the majority class (class 0 in this case).
P-Value [Acc > NIR]: the p-value for a statistical test comparing the accuracy of the model to the NIR. A small p-value (typically less than 0.05) indicates that the model’s accuracy is significantly better than the NIR.
Kappa: a metric that considers both the true positive rate and the false positive rate, providing a more balanced assessment of the model’s performance. Kappa ranges from -1 to 1, with 0 indicating no better than random chance, and 1 indicating perfect agreement between predictions and true values.
Mcnemar’s Test P-Value: the p-value for a statistical test comparing the number of false positives and false negatives. A large p-value (typically greater than 0.05) indicates that there is no significant difference between the number of false positives and false negatives.
#|
fourfoldplot(as.table(conf),color=c("yellow","pink"),main = "Confusion Matrix")
A receiver operating characteristic curve (or ROC curve) is a graphical plot that illustrates the performance of a binary classifier model. The ROC curve is the plot of the true positive rate (TPR) against the false positive rate (FPR) at each threshold setting. It needs to split the dataset used for the LDA into training (80%) and testing (20%) sets in order to create predicted assignation using a GLM (general linear modelling). The library pROC is used to plot the ROC.
#|
set.seed(42)
<-rattus_measurements |>
rattus_measurements1 ::mutate(taxonomyID_bino = dplyr::recode(taxonomyID,
dplyr"Rattus_r3" ='0',
"Rattus_tanezumi"='1')) |>
::mutate(taxonomyID_bino = as.integer(taxonomyID_bino)) |>
dplyr::select(- taxonomyID)
dplyr
<- caret::createDataPartition(rattus_measurements1$taxonomyID_bino, p = 0.8, list = FALSE)
train_index <- rattus_measurements1[train_index, ]
rattus_train <- rattus_measurements1[-train_index, ]
rattus_test
# Create the logistic regression model
<- glm(taxonomyID_bino ~ . , data = rattus_train, family = "binomial")
model
# Make predictions on the test dataset
<- predict(model, rattus_test, type = "response")
predicted_probs <- ifelse(predicted_probs > 0.5, 1, 0)
predicted_classes
# Load the pROC package
library(pROC)
# Create the ROC curve
<- roc(rattus_test$taxonomyID_bino , predicted_probs)
roc_obj
# Plot the ROC curve
plot(roc_obj, main = "ROC Curve for the Logistic Regression Model")
abline(0, 1, lty = 2, col = "gray") # Add a reference line for a random classifier
The AUC (Area Under Curve) of the ROC curve determine the specificity and sensitivity of the model. The closer the AUC value is to the 1, the better the given model fits the data.
#|
<- auc(roc_obj)
auc_value cat("AUC value:", auc_value, "\n")
AUC value: 0.6740506
Linear Discriminant Analysis (LDA) is now used on morphometrical variables and geographic locations (longitude, latitude) of individuals.
<- data_rattus |>
rattus_morpho_loc ::select(taxonomyID, bodyWeight:latitude) |>
dplyr::drop_na()
tidyr
2:9] <- scale(rattus_morpho_loc[,2:9], center = TRUE, scale = TRUE)
rattus_morpho_loc[,
= lda (factor(taxonomyID) ~
lda.model + headBodyMeasurement + tailMeasurement +
bodyWeight + earMeasurement + headMeasurement +
hindfootMeasurement + latitude,
longitude data = rattus_morpho_loc)
# Panels of histograms and overlayed density plots
plot(lda.model, dimen=1, type="both") # fit from lda
#|
<- predict(lda.model)
lda_prediction <- table(list(predicted=lda_prediction$class, observed= rattus_measurements$taxonomyID))
conf
::confusionMatrix(conf) caret
Confusion Matrix and Statistics
observed
predicted Rattus_r3 Rattus_tanezumi
Rattus_r3 334 27
Rattus_tanezumi 39 357
Accuracy : 0.9128
95% CI : (0.8904, 0.9319)
No Information Rate : 0.5073
P-Value [Acc > NIR] : <2e-16
Kappa : 0.8255
Mcnemar's Test P-Value : 0.1757
Sensitivity : 0.8954
Specificity : 0.9297
Pos Pred Value : 0.9252
Neg Pred Value : 0.9015
Prevalence : 0.4927
Detection Rate : 0.4412
Detection Prevalence : 0.4769
Balanced Accuracy : 0.9126
'Positive' Class : Rattus_r3
#|
fourfoldplot(as.table(conf),color=c("yellow","pink"),main = "Confusion Matrix")
#|
set.seed(42)
<- rattus_morpho_loc |>
rattus_morpho_loc1 ::mutate(taxonomyID_bino = dplyr::recode(taxonomyID,
dplyr"Rattus_r3" ='0',
"Rattus_tanezumi"='1')) |>
::mutate(taxonomyID_bino = as.integer(taxonomyID_bino)) |>
dplyr::select(- taxonomyID)
dplyr
<- caret::createDataPartition(rattus_morpho_loc1$taxonomyID_bino,
train_index p = 0.8, list = FALSE)
<- rattus_morpho_loc1[train_index, ]
rattus_train <- rattus_morpho_loc1[-train_index, ]
rattus_test
# Create the logistic regression model
<- glm(taxonomyID_bino ~ . , data = rattus_train, family = "binomial")
model
# Make predictions on the test dataset
<- predict(model, rattus_test, type = "response")
predicted_probs <- ifelse(predicted_probs > 0.5, 1, 0)
predicted_classes
# Create the ROC curve
<- roc(rattus_test$taxonomyID_bino , predicted_probs)
roc_obj
# Plot the ROC curve
plot(roc_obj, main = "ROC Curve for the Logistic Regression Model")
abline(0, 1, lty = 2, col = "gray") # Add a reference line for a random classifier
#|
<- auc(roc_obj)
auc_value cat("AUC value:", auc_value, "\n")
AUC value: 0.9854079
Incorporating geographic locations (longitude, latitude) of individuals improved the discrimination of the two mitotypes with an accuracy of 0.913 (compared to a value of 0.674 when using morphometric data alone). Confusion matrix and ROC visualizations confirmed the improvement when adding geography.
Although the origins and maintenance of the two mitotypes, Rattus tanezumi and Rattus R3, are not yet explained, our results showed that these two mitotypes can coexist in their sympatric distribution through a morphometric change in Rattus tanezumi. Rattus R3 does not seem to be morphologically affected when living in sympatry with Rattus tanezumi.