1 Objective

This document presents the strategy and code necessary for the identification of kinematic sperm subpopulations, based on individual sperm data evaluated with a Computer Assisted Sperm Analysis (CASA) system. The strategy and code used are detailed in Rivas et al (2022) paper.

2 Introduction

For the identification of kinematic sperm subpopulations, motility parameters values obtained with a CASA system are typically utilized. These values are calculated for each sperm. The number of motility parameters depends on the CASA system manufacturer; however, commonly evaluated parameters include VCL, VAP, VSL, LIN, STR, BCF, WOB, and ALH.

Various statistical strategies employ the motility parameters values evaluated in a CASA system (Martínez-Pastor et al., 2011; Ramón and Martínez-Pastor, 2018), to identify kinematic sperm subpopulations. It is customary to conduct the identification in two stages. In the first stage the aim is to reduce the dataset’s dimensionality accomplished through a principal component analysis. In the principal component analysis, components with eigenvalues greater than 1 are retained. In the second stage, these principal components serve as input for the clustering algorithms such as Kmeans or hierarchical agglomerative clustering.

In the case of hierarchical agglomerative clustering, the goal is to group data points that are closest to each other and farthest from other groups, using various types of distances, although the most commonly used is Euclidean distance. The number of groups is an important aspect to consider; therefore, the aim is to use an objective criterion for selection, such as inertia gain. In this regard, graphical analysis of inertia gain is highly beneficial.

In this document, we will present the strategy and procedure for identifying kinematic sperm subpopulations from boar sperm data exposed to different pH values (Rivas et al., 2022).

3 Requirements

Libraries: To replicate the procedure presented in this document, it is necessary to have the FactoMineR, and dplyr libraries.

Data: It is required to download the dataset “230819_pluginnuevo3.tab” into your working directory. The dataset is located at https://doi.org/10.7910/DVN/CRVVSI (Rivas et al., 2022).

Next, download the file:

download.file(url = "https://dataverse.harvard.edu/api/access/datafile/4549661", destfile="motility_results_ph.csv")

We will create an object called mr and place the content of the downloaded dataset into it.

mr<-read.csv("motility_results_ph.csv", header=TRUE, sep=",", stringsAsFactors=TRUE)

Th file motility_results_ph.csv contain 11,620 lines of data, but for the purpose of this document, a sample of 500 lines of data will be used.

mr_sample<-sample(mr, 500, replace=FALSE, prob=NULL)

4 Procedure

4.1 Stage 1: Acquisition and creation of objects

The dataset used in this workflow can be downloaded from our account on the Harvard Dataverse site: https://dataverse.harvard.edu, as indicated above, or directly using the following code:

setwd("/Users/andresammx/Documents/RStudio/Markdown/Identificación_de_subpoblaciones")

Download the dataset.

download.file(url = "https://dataverse.harvard.edu/api/access/datafile/4549661", destfile="230819_pluginnuevo.tab")

Read the file:

mr<-read.csv("230819_pluginnuevo3.tab", header=TRUE, sep="\t", stringsAsFactors=TRUE)

Take a sample of 500 rows from the dataset and save them in a new object called mr_sample:

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
mr_sample<- mr %>% group_by(ID2) %>% slice_sample(n=100, replace=FALSE)

Convert the mr_sample object to a dataframe and verify its structure:

mr_sample<-as.data.frame(mr_sample)
str(mr_sample)
## 'data.frame':    500 obs. of  11 variables:
##  $ ID1      : Factor w/ 75 levels "220219-1","220219-10",..: 69 28 58 46 70 6 7 28 69 58 ...
##  $ ID2      : Factor w/ 5 levels "pH.70","pH.74",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ ID3      : Factor w/ 6 levels "","balder","berthrand",..: 2 3 2 4 2 6 6 3 2 2 ...
##  $ VCL      : num  265.6 47.2 356.8 239.3 94.3 ...
##  $ VAP      : num  108.6 36.6 143.3 86 65.6 ...
##  $ VSL      : num  68.9 28 57 55.4 41.3 ...
##  $ LIN      : num  0.259 0.592 0.16 0.232 0.438 ...
##  $ STR      : num  0.635 0.764 0.398 0.644 0.631 ...
##  $ WOB      : num  0.409 0.775 0.402 0.359 0.695 ...
##  $ BeatCross: num  30.8 38.7 28.2 38.1 29 ...
##  $ ALH      : num  8.62 1.96 12.42 6.7 4.69 ...

The mr_sample object have 500 rows and 11 columns. Column 1 (ID1) corresponds to the identifier of the capture routine video; column 2 (ID2) corresponds to the treatment, representing different pH values; and column 3 (ID3) corresponds to the identifier of the donor male.

4.2 Stage 2: Principal Component Analysis (PCA)

The necessary libraries are loaded:

library(FactoMineR)

Now, we create a new object called mr_sample_2 containing only the variables (columns) that we will use:

mr_sample_2<-mr_sample[,c(2,4:11)]

An object res.pca is created to contain the results of the principal component analysis function. The arguments for the PCA function indicate that the data should be scaled, three principal components are expected, a qualitative variable with no weight in the results is being entered, and we request the results to be plotted.

res.pca = PCA(mr_sample_2, scale.unit=TRUE, ncp=3, quali.sup=1, graph=TRUE)

The PCA function generates two plots. In the PCA graph of individuals, the values of the first and second principal components are plotted, and the corresponding line are observed next to each point. The PCA graph of variables shows a correlation circle, where each arrow indicates the magnitude and sign of the correlation of a variable with respect to the first and second principal components. For example, we observe that VCL has a high positive correlation with the first principal component and a low correlation with the second principal component.

res.pca is a list-type object, and its content can be accessed using the $ operator. Typing res.pca in the console will first display the objects that make up the list and then show the description of each object:

res.pca
## **Results for the Principal Component Analysis (PCA)**
## The analysis was performed on 500 individuals, described by 9 variables
## *The results are available in the following objects:
## 
##    name                description                                          
## 1  "$eig"              "eigenvalues"                                        
## 2  "$var"              "results for the variables"                          
## 3  "$var$coord"        "coord. for the variables"                           
## 4  "$var$cor"          "correlations variables - dimensions"                
## 5  "$var$cos2"         "cos2 for the variables"                             
## 6  "$var$contrib"      "contributions of the variables"                     
## 7  "$ind"              "results for the individuals"                        
## 8  "$ind$coord"        "coord. for the individuals"                         
## 9  "$ind$cos2"         "cos2 for the individuals"                           
## 10 "$ind$contrib"      "contributions of the individuals"                   
## 11 "$quali.sup"        "results for the supplementary categorical variables"
## 12 "$quali.sup$coord"  "coord. for the supplementary categories"            
## 13 "$quali.sup$v.test" "v-test of the supplementary categories"             
## 14 "$call"             "summary statistics"                                 
## 15 "$call$centre"      "mean of the variables"                              
## 16 "$call$ecart.type"  "standard error of the variables"                    
## 17 "$call$row.w"       "weights for the individuals"                        
## 18 "$call$col.w"       "weights for the variables"

The eigenvalues and the cumulative variance values presented in Rivas et al., (2022) are found in the res.pca$eig object:

res.pca$eig
##        eigenvalue percentage of variance cumulative percentage of variance
## comp 1 3.96582357             49.5727946                          49.57279
## comp 2 2.39521837             29.9402297                          79.51302
## comp 3 0.99221034             12.4026292                          91.91565
## comp 4 0.42550528              5.3188160                          97.23447
## comp 5 0.12786974              1.5983717                          98.83284
## comp 6 0.04754866              0.5943582                          99.42720
## comp 7 0.03306028              0.4132535                          99.84045
## comp 8 0.01276377              0.1595471                         100.00000

According to the Kaiser criterion, we choose the principal components with eigenvalues greater than 1.

The values of the coordinates of each principal component can be saved for the purpose of generating clean and customized graphics:

dind<-as.data.frame(res.pca$ind$coord)
names(dind)
## [1] "Dim.1" "Dim.2" "Dim.3"

And those columns can be appended to the mr_sample_2, for example:

tab2<-cbind(mr_sample_2$ID2,dind)
head(tab2)
##   mr_sample_2$ID2        Dim.1       Dim.2      Dim.3
## 1           pH.70  1.155339564  0.06880405  0.8731306
## 2           pH.70 -2.785419405  2.61534418 -1.2313661
## 3           pH.70  2.671369221 -1.14843568  0.2633520
## 4           pH.70 -0.006226783 -0.46870882  1.4537604
## 5           pH.70 -0.884580976  1.51072808 -1.5240606
## 6           pH.70  0.112930932  1.56553832 -0.4460328

Prepare tab2 to generate the custom graph:

tab2<-as.data.frame(tab2)
names(tab2)
## [1] "mr_sample_2$ID2" "Dim.1"           "Dim.2"           "Dim.3"
head(tab2)
##   mr_sample_2$ID2        Dim.1       Dim.2      Dim.3
## 1           pH.70  1.155339564  0.06880405  0.8731306
## 2           pH.70 -2.785419405  2.61534418 -1.2313661
## 3           pH.70  2.671369221 -1.14843568  0.2633520
## 4           pH.70 -0.006226783 -0.46870882  1.4537604
## 5           pH.70 -0.884580976  1.51072808 -1.5240606
## 6           pH.70  0.112930932  1.56553832 -0.4460328
colnames(tab2)[1]<-"pH"
names(tab2)
## [1] "pH"    "Dim.1" "Dim.2" "Dim.3"

Construct the graph:

plot(tab2$Dim.1, tab2$Dim.2, col=tab2$pH, xlab = "PC1", ylab = "PC2", xlim = c(-4,8), ylim = c(-4,8))
legend("topright", legend = levels(tab2$pH), col = 1:length(levels(tab2$pH)), pch = 1, cex = 1.2)

4.3 Stage 3: Hierarchical clustering on principal components

Once the principal component analysis has been performed, the res.pca object will serve as input for the HCPC function; a new object called res.hcpc is created to store the results of HCPC.

res.hcpc <- HCPC(res.pca, nb.clust=-1, conso=0, min=3, max=10, metric="euclidean", method="ward")

The HCPC function generates three plots. The first one is the “Hierarchical clustering”, which has a horizontal line suggesting the cutoff point for group formation in the dendrogram. This graph includes an inset with a bar chart indicating the inertia gain. Once you click on the bar or at some other level on the graph, two additional plots are generated (for the purposes of this document, the nb.clustargument was set to -1 for automatic cutoff, but setting it to 0 allows the user to choose the cutoff height). The “Hierarchical clustering on the factor map” grpah is a three-dimensional representation of the first and second principal components on the x and y axes, respectively, while the dendrogram is presented bidimensionally in the z axis. The “Factor map” graph displays the values of the first and second principal components, now colored according to the formed groups or clusters. These groups represent the kinematic subpopulations.

The res.hcpc object is a list-type object, and its contents can be accessed using the $operator. Typing res.hcpc in the console will first display the objects that make up list and then show the the description of each object:

res.hcpc
## **Results for the Hierarchical Clustering on Principal Components**
##    name                   
## 1  "$data.clust"          
## 2  "$desc.var"            
## 3  "$desc.var$quanti.var" 
## 4  "$desc.var$quanti"     
## 5  "$desc.var$test.chi2"  
## 6  "$desc.axes$category"  
## 7  "$desc.axes"           
## 8  "$desc.axes$quanti.var"
## 9  "$desc.axes$quanti"    
## 10 "$desc.ind"            
## 11 "$desc.ind$para"       
## 12 "$desc.ind$dist"       
## 13 "$call"                
## 14 "$call$t"              
##    description                                              
## 1  "dataset with the cluster of the individuals"            
## 2  "description of the clusters by the variables"           
## 3  "description of the cluster var. by the continuous var." 
## 4  "description of the clusters by the continuous var."     
## 5  "description of the cluster var. by the categorical var."
## 6  "description of the clusters by the categories."         
## 7  "description of the clusters by the dimensions"          
## 8  "description of the cluster var. by the axes"            
## 9  "description of the clusters by the axes"                
## 10 "description of the clusters by the individuals"         
## 11 "parangons of each clusters"                             
## 12 "specific individuals"                                   
## 13 "summary statistics"                                     
## 14 "description of the tree"

The res.hcpc$data.clust object contain data of rm_sample_2, and at the end, it has a column named clust, which indicates the cluster to which each data line belongs:

head(res.hcpc$data.clust)
##     ID2       VCL       VAP      VSL      LIN      STR      WOB BeatCross
## 1 pH.70 265.64667 108.57502 68.93514 0.259499 0.634908 0.408720  30.79646
## 2 pH.70  47.19655  36.58363 27.96065 0.592430 0.764294 0.775134  38.72727
## 3 pH.70 356.77786 143.29716 56.96533 0.159666 0.397533 0.401643  28.17391
## 4 pH.70 239.27805  86.00893 55.40223 0.231539 0.644145 0.359452  38.08696
## 5 pH.70  94.31265  65.55947 41.34975 0.438433 0.630721 0.695129  28.96552
## 6 pH.70 165.31839  97.35522 67.48124 0.408190 0.693145 0.588895  28.44828
##         ALH clust
## 1  8.619123     3
## 2  1.959340     1
## 3 12.417836     3
## 4  6.697638     3
## 5  4.690355     3
## 6  5.007826     3

This object is crucial as it allows us to analyze the dataset based on treatments and clusters. The content of res.hcpc$data.clust can be stored in a new object for subsequent analysis:

mr_sub<-res.hcpc$data.clust

The res.hcpc$desc.axes object contain the values of each cluster based on quantitative variables (motility parameters).

4.4 Stage 4: Subpopulations in the dataset

One of the objectives of identifying kinematic sperm subpopulations is to describe the effect of experimental treatments on the structure of these subpopulations, starting with understanding the proportion of sperm from each treatment in each subpopulation. Similarly, it is relevant to describe the effect of each treatment on each cluster. This aspect will be addressed in another Rpubs document.

5 Conclusions

With the proposed workflow in this document, it is possible to handle a file of motility parameters, obtained with a CASA system, to identify kinematic subpopulations in two steps. The result is an object that contains the same input data plus a column indicating the cluster to which each data line belongs. Handling the resulting object will allow for identyfying the proportion of sperm in each subpopulation and the effect of each treatment on each subpopulation.

References

Martínez-Pastor F, Tizado EJ, Garde JJ, Anel L and Paz P de (2011) Statistical Series: Opportunities and challenges of sperm motility subpopulation analysis. Theriogenology 75 783–795.
Ramón M and Martínez-Pastor F (2018) Implementation of novel statistical procedures and other advanced approaches to improve analysis of CASA data. Reproduction, Fertility, and Development 30 860–866.
Rivas AC, Ayala EME and Aragon MA (2022) Effect of various pH levels on the sperm kinematic parameters of boars. South African Journal of Animal Science 52 693–704.