1 Objective

In this document, the workflow is presented to obtain a file with coordinates of sperm trajectories, formatted to serve as input for the Jupyter notebook that allows the construction of individual images of sperm trajectories (named traj- ah-6-full-0.ipynb, available at 10.7910/DVN/CBMKVA); these images are used as input for the hierarchical clustering machine learning algorithm implemented in the article by Rodríguez-Martínez et al (2023).

2 Introduction

Some Computer Assisted Sperm Analysis (CASA) systems allow for the retrieval of data for each analyzed sperm in a capture routine. By capture routine, we mean the recording of a video sequence for a defined period (usually one or two seconds). Among the data that can be recovered are the traditional kinematic or motility parameters (VCL, VAP, VSL, LIN, STR, BCF, and ALH), although others can be obtained depending on the manufacturer and the version of the software used by the CASA system.

The motility parameters are internally calculated by CASA system software, and there may be variation in the calculation algorithms (Amann and Waberski, 2014). However, the input data for the algorithms that calculated speeds (VCL, VAP, and VSL) are based on tracking the sperm head. Tracking is performed through the processing of each of the images that compose the video sequence and refers to the location (coordinates) of the sperm head over the capture time. With some equipment, in addition to motility parameters, tracking coordinates can be recovered.

A notable case of software for CASA systems is the CASA plugin (Wilson-Leedy and Ingermann, 2007) for ImageJ (Rasband, 1997). This plugin has undergone some modifications, allowing its adaptation for the evaluation of sperm from different species (Giaretta et al., 2017; Rivas et al., 2022). With this software, it is possible to retrieve the coordinates of the analyzed sperm.

Given that motility parameters are calculated from coordinates, these parameters represent a condensed measure of the kinematic behavior of sperm. Consequently, there is a wealth of information within the coordinates, enabling the reconstruction not only of motility parameters but also of the trajectory followed by each analyzed sperm. It is crucial to note that these trajectories cannot be reconstructed solely from motility parameters; hence, we emphasize that each trajectory possesses an associated set of motility parameters (Rodríguez-Martínez et al., 2023). Despite the underlying information embedded in the coordinates, current approaches to identify kinematic subpopulations in a dataset have exclusively utilized motility parameters (Ramón and Martínez-Pastor, 2018).

Coordinates can be used to reconstruct the image of each trajectory followed by the sperm analyzed in a CASA system. These images can then be used as input for machine learning algorithms that cluster the images, followed by statistically describing those groups (subpopulations) based on the associated motility parameters.

In this document we will describe how we constructed a coordinate dataset which will serve as input for machine learning algorithms, particularly the one implemented by Rodríguez-Martínez et al (2023). The data corresponds to coordinates of hamster sperm, analyzed in a CASA (SMAS Version 3.18); the capture speed was set at 50 fps for one second (Fujinoki M, Comunicación personal). From the SMAS system, a pair of files were obtained after each capture routine, the first file contains the motility parameters data, while the second one contains the coordinates of the detected sperm.

The procedure consists of three stages: Acquisition and initial adjustments, adding identifiers, and constructing the final object. We save the files with the “ods” extension (LibreOffice Calc application) and use the readODS library to read them.

2.1 Requirements

Two files are required. The first one should be the file with the coordinates of the trajectories (see Figure 1); the second file should contain the IDs of the analyzed sperm (see Figure 2). The second file can be generated from the file containing the motility parameters (see Figure 3). All necessary files can be downloaded from https://doi.org/10.7910/DVN/JAN8RE. In Figure 1 you can observe the data structure in the original ods file. It can be seen that the first column contains the sperm identifiers and the respective coordinate (x or y); the number corresponds to the identifier, and the x or y in parentheses indicates the value of the respective coordinate.

$Figure 1. Screenshot of a LibreOffice Calc spreadsheet with coordinate data. The file traj_data_test.ods contains 20 rows and 150 columns. Each row corresponds to a sperm coordinate. The first column contains the sperm identifier, where the number is the indicator, and the letter in parentheses indicates the coordinate (x or y ). The remaining 150 columns contain the coordinate value of each sperm in each frame analyzed by the CASA system. There may be rows where zeros are present instead of coordinate values; this is because some sperm may not have been detected throuhtout the capture routine.$

Figure 1. Screenshot of a LibreOffice Calc spreadsheet with coordinate data. The file traj_data_test.ods contains 20 rows and 150 columns. Each row corresponds to a sperm coordinate. The first column contains the sperm identifier, where the number is the indicator, and the letter in parentheses indicates the coordinate (x or y ). The remaining 150 columns contain the coordinate value of each sperm in each frame analyzed by the CASA system. There may be rows where zeros are present instead of coordinate values; this is because some sperm may not have been detected throuhtout the capture routine.

Figure 2. Screenshot of a LibreOffice Calc spreadsheet with sperm identifier data. The file has 11 rows, the firste row contains column names, and the remaining 10 rows have identifier data. The columns names correspond to two different identifiers, called ID1 and ID5.

Figure 3. Screenshot of a LibreOffice Calc spreadsheet of the mr_data_test ods file. This file contains motility parameters data. The file comprises 11 rows; the first row contains the column names (motility parameters), and the subsequent 10 rows contain data for various evaluated motility parameters. The first four columns correspond to identifiers of the analyzed sperm.

3 Procedure

All files used in this workflow can be downloaded from our account on the Harvard Dataverse site https://dataverse.harvard.edu, as indicated above, or directly using the following code:

download.file(url = "https://dataverse.harvard.edu/api/access/datafile/8076976", destfile="traj_data_test.ods")
download.file(url = "https://dataverse.harvard.edu/api/access/datafile/8076975", destfile="ID5_data_test.ods")
download.file(url = "https://dataverse.harvard.edu/api/access/datafile/8076974", destfile="mr_data_test.ods")

3.1 Stage 1: Acquisition and Initial Adjustments

Load the necessary library for reading ods files:

library(readODS)

Set the working directory:

setwd("/Users/andresammx/Documents/RStudio/Markdown/Coordenadas_espermaticas_formato")

An object named coord is created, and the content of the working file containing the coordinates is loaded into this object. The source file is read using the read_ods command. The argument col_names=FALSE indicates that the first line of the file should not be interpreted as column names. The argument as_tibble=FALSE indicates the desire to obtain an object with a dataframe structure, and not of the tibble type.

coord<-read_ods("traj_data_test.ods", col_names=FALSE, as_tibble=FALSE)

## New names:
## • `` -> `...1`
## • `` -> `...2`
## • `` -> `...3`
## • `` -> `...4`
## • `` -> `...5`
## • `` -> `...6`
## • `` -> `...7`
## • `` -> `...8`
## • `` -> `...9`
## • `` -> `...10`
## • `` -> `...11`
## • `` -> `...12`
## • `` -> `...13`
## • `` -> `...14`
## • `` -> `...15`
## • `` -> `...16`
## • `` -> `...17`
## • `` -> `...18`
## • `` -> `...19`
## • `` -> `...20`
## • `` -> `...21`
## • `` -> `...22`
## • `` -> `...23`
## • `` -> `...24`
## • `` -> `...25`
## • `` -> `...26`
## • `` -> `...27`
## • `` -> `...28`
## • `` -> `...29`
## • `` -> `...30`
## • `` -> `...31`
## • `` -> `...32`
## • `` -> `...33`
## • `` -> `...34`
## • `` -> `...35`
## • `` -> `...36`
## • `` -> `...37`
## • `` -> `...38`
## • `` -> `...39`
## • `` -> `...40`
## • `` -> `...41`
## • `` -> `...42`
## • `` -> `...43`
## • `` -> `...44`
## • `` -> `...45`
## • `` -> `...46`
## • `` -> `...47`
## • `` -> `...48`
## • `` -> `...49`
## • `` -> `...50`
## • `` -> `...51`
## • `` -> `...52`
## • `` -> `...53`
## • `` -> `...54`
## • `` -> `...55`
## • `` -> `...56`
## • `` -> `...57`
## • `` -> `...58`
## • `` -> `...59`
## • `` -> `...60`
## • `` -> `...61`
## • `` -> `...62`
## • `` -> `...63`
## • `` -> `...64`
## • `` -> `...65`
## • `` -> `...66`
## • `` -> `...67`
## • `` -> `...68`
## • `` -> `...69`
## • `` -> `...70`
## • `` -> `...71`
## • `` -> `...72`
## • `` -> `...73`
## • `` -> `...74`
## • `` -> `...75`
## • `` -> `...76`
## • `` -> `...77`
## • `` -> `...78`
## • `` -> `...79`
## • `` -> `...80`
## • `` -> `...81`
## • `` -> `...82`
## • `` -> `...83`
## • `` -> `...84`
## • `` -> `...85`
## • `` -> `...86`
## • `` -> `...87`
## • `` -> `...88`
## • `` -> `...89`
## • `` -> `...90`
## • `` -> `...91`
## • `` -> `...92`
## • `` -> `...93`
## • `` -> `...94`
## • `` -> `...95`
## • `` -> `...96`
## • `` -> `...97`
## • `` -> `...98`
## • `` -> `...99`
## • `` -> `...100`
## • `` -> `...101`
## • `` -> `...102`
## • `` -> `...103`
## • `` -> `...104`
## • `` -> `...105`
## • `` -> `...106`
## • `` -> `...107`
## • `` -> `...108`
## • `` -> `...109`
## • `` -> `...110`
## • `` -> `...111`
## • `` -> `...112`
## • `` -> `...113`
## • `` -> `...114`
## • `` -> `...115`
## • `` -> `...116`
## • `` -> `...117`
## • `` -> `...118`
## • `` -> `...119`
## • `` -> `...120`
## • `` -> `...121`
## • `` -> `...122`
## • `` -> `...123`
## • `` -> `...124`
## • `` -> `...125`
## • `` -> `...126`
## • `` -> `...127`
## • `` -> `...128`
## • `` -> `...129`
## • `` -> `...130`
## • `` -> `...131`
## • `` -> `...132`
## • `` -> `...133`
## • `` -> `...134`
## • `` -> `...135`
## • `` -> `...136`
## • `` -> `...137`
## • `` -> `...138`
## • `` -> `...139`
## • `` -> `...140`
## • `` -> `...141`
## • `` -> `...142`
## • `` -> `...143`
## • `` -> `...144`
## • `` -> `...145`
## • `` -> `...146`
## • `` -> `...147`
## • `` -> `...148`
## • `` -> `...149`
## • `` -> `...150`
## • `` -> `...151`

The output obtained upon executing the function indicates that R has assigned a numeric name to each column in our file.

Now, let’s review the structure of the data in the coord object:

str(coord)

## 'data.frame':    20 obs. of  151 variables:
##  $ ...1  : chr  "2(x)" "2(y)" "3(x)" "3(y)" ...
##  $ ...2  : num  1054.1 59.1 0 0 556.9 ...
##  $ ...3  : num  1054 59 0 0 557 ...
##  $ ...4  : num  1054.1 58.9 0 0 557 ...
##  $ ...5  : num  1053.9 58.8 0 0 557 ...
##  $ ...6  : num  1053.9 58.8 0 0 557.1 ...
##  $ ...7  : num  1053.8 58.7 0 0 557 ...
##  $ ...8  : num  1053.5 58.7 0 0 556.8 ...
##  $ ...9  : num  1053.7 58.7 0 0 556.9 ...
##  $ ...10 : num  1053.6 58.5 0 0 557 ...
##  $ ...11 : num  1053.5 58.4 0 0 556.9 ...
##  $ ...12 : num  1053.5 58.3 0 0 557 ...
##  $ ...13 : num  1053.5 58.4 0 0 557 ...
##  $ ...14 : num  1053.5 58.2 0 0 556.8 ...
##  $ ...15 : num  1053.5 58.2 1821 76.7 556.7 ...
##  $ ...16 : num  1053.5 58.4 1815.1 82.9 556.6 ...
##  $ ...17 : num  1053.6 58.4 1806.9 86.4 556.6 ...
##  $ ...18 : num  1053.6 58.1 1798.5 82.3 556.8 ...
##  $ ...19 : num  1053.7 58.1 1793.4 79.3 556.6 ...
##  $ ...20 : num  1053.5 58 1789.5 76.6 556.6 ...
##  $ ...21 : num  1053.3 57.8 1787.1 74.7 556.6 ...
##  $ ...22 : num  1053.3 57.8 1787.2 82.1 556.5 ...
##  $ ...23 : num  1053.3 57.6 1793.8 82.6 556.6 ...
##  $ ...24 : num  1053.3 57.6 1800.7 79.9 556.7 ...
##  $ ...25 : num  1053 57.4 1805.8 76.5 556.5 ...
##  $ ...26 : num  1052.9 57.3 1807.2 72 556.4 ...
##  $ ...27 : num  1053.2 57.2 1807.7 67.4 556.5 ...
##  $ ...28 : num  1053.1 57.1 1807.5 63.1 556.7 ...
##  $ ...29 : num  1053.3 56.9 1806.7 59.6 556.6 ...
##  $ ...30 : num  1053 57 1807 56 557 ...
##  $ ...31 : num  1053.2 56.8 1807.8 54.2 556.6 ...
##  $ ...32 : num  1053.3 56.7 1807.1 54.2 556.5 ...
##  $ ...33 : num  1053 56.5 1807.2 52.3 556.5 ...
##  $ ...34 : num  1053.1 56.4 1811.1 42.5 556.5 ...
##  $ ...35 : num  1053.3 56.1 1822 56.3 556.6 ...
##  $ ...36 : num  1053.3 56.2 1832.9 70 556.5 ...
##  $ ...37 : num  1053.2 56.2 1843.8 83.8 556.4 ...
##  $ ...38 : num  1053.2 56 1835.8 89.5 556.4 ...
##  $ ...39 : num  1053.1 55.9 1830.9 93.3 556.4 ...
##  $ ...40 : num  1052.9 55.9 1822.9 94.7 556.4 ...
##  $ ...41 : num  1052.9 55.8 1815.2 93.8 556.4 ...
##  $ ...42 : num  1052.9 55.8 1809.9 92.3 556.4 ...
##  $ ...43 : num  1052.8 55.7 1808.7 91.9 556.4 ...
##  $ ...44 : num  1052.4 55.7 1815.9 93 556.4 ...
##  $ ...45 : num  1052.4 55.6 1823.7 92.9 556.3 ...
##  $ ...46 : num  1052.4 55.5 1828.1 90 556.5 ...
##  $ ...47 : num  1052.4 55.5 1830 86 556.4 ...
##  $ ...48 : num  1052.5 55.2 1830.1 81.8 556.4 ...
##  $ ...49 : num  1052.5 55 1829.9 77.1 556.5 ...
##  $ ...50 : num  1052.3 55 1830 72.8 556.5 ...
##  $ ...51 : num  1052.3 54.9 1828.6 70.4 556.5 ...
##  $ ...52 : num  1052.4 54.7 1827.7 68.7 556.4 ...
##  $ ...53 : num  1052.3 54.7 1827.5 68.7 556.4 ...
##  $ ...54 : num  1052.1 54.7 1827 66.1 556.3 ...
##  $ ...55 : num  1052.1 54.7 1828.4 57.1 556.4 ...
##  $ ...56 : num  1051.9 54.6 1836.3 66.6 556.4 ...
##  $ ...57 : num  1051.9 54.3 1844.1 76.2 556.2 ...
##  $ ...58 : num  1052.1 54.2 1852 85.8 556.4 ...
##  $ ...59 : num  1051.9 54.1 1859.9 95.4 556.2 ...
##  $ ...60 : num  1052 54.1 1855.6 100.8 556.2 ...
##  $ ...61 : num  1052 53.9 1849.1 103.1 556.1 ...
##  $ ...62 : num  1052 53.8 1839.2 102.1 556.2 ...
##  $ ...63 : num  1052 53.7 1831.6 100.5 556 ...
##  $ ...64 : num  1052 53.8 1829.4 100.6 556.1 ...
##  $ ...65 : num  1051.9 53.8 1833.8 102.7 556 ...
##  $ ...66 : num  1052 53.8 1842.8 103.5 556.1 ...
##  $ ...67 : num  1051.7 53.6 1848.3 100.6 556.2 ...
##  $ ...68 : num  1051.5 53.4 1851.2 96.3 556.3 ...
##  $ ...69 : num  1051.7 53.4 1851.6 92 556.2 ...
##  $ ...70 : num  1051.6 53.6 1849.9 87.7 556.1 ...
##  $ ...71 : num  1051.5 53.2 1847.8 85 556.1 ...
##  $ ...72 : num  1051.7 53.1 1846.1 81.8 556 ...
##  $ ...73 : num  1051.7 53 1844.2 79.1 555.8 ...
##  $ ...74 : num  1051.9 53 1843.1 76.6 555.8 ...
##  $ ...75 : num  1052 53 1844.3 74.2 555.9 ...
##  $ ...76 : num  1051.9 53.1 1843.9 64.3 555.7 ...
##  $ ...77 : num  1051.8 53 1853.2 73.9 555.9 ...
##  $ ...78 : num  1051.9 52.9 1862.6 83.5 555.9 ...
##  $ ...79 : num  1051.9 52.8 1871.9 93.1 555.9 ...
##  $ ...80 : num  1051.9 52.8 1881.3 102.7 555.9 ...
##  $ ...81 : num  1051.9 52.8 1873 105.4 555.8 ...
##  $ ...82 : num  1051.9 52.8 1865.4 107.8 555.8 ...
##  $ ...83 : num  1051.9 52.8 1855.8 106.4 555.8 ...
##  $ ...84 : num  1051.7 52.8 1848.2 104 555.7 ...
##  $ ...85 : num  1051.6 52.8 1845.2 102.6 555.8 ...
##  $ ...86 : num  1051.6 52.8 1848 104.9 555.7 ...
##  $ ...87 : num  1051.7 52.8 1857.1 106.3 555.8 ...
##  $ ...88 : num  1051.7 52.8 1861.8 104.1 555.7 ...
##  $ ...89 : num  1051.9 52.9 1864.8 100.5 555.7 ...
##  $ ...90 : num  1051.9 52.8 1866 96.7 555.8 ...
##  $ ...91 : num  1051.9 52.9 1865.7 93.2 555.8 ...
##  $ ...92 : num  1051.5 52.9 1864.1 90.7 555.7 ...
##  $ ...93 : num  1051.5 52.8 1862.9 87.9 555.7 ...
##  $ ...94 : num  1051.5 52.7 1861.6 85.3 555.6 ...
##  $ ...95 : num  1051.6 52.8 1861.3 83.4 555.5 ...
##  $ ...96 : num  1051.3 52.7 1860.7 82.7 555.6 ...
##  $ ...97 : num  1051.4 52.7 1860.2 81.3 555.6 ...
##  $ ...98 : num  1051.4 52.6 1857.6 68.8 555.6 ...
##  $ ...99 : num  1051.4 52.6 1860 84.7 555.7 ...
##   [list output truncated]

It can be observed that coord object has 20 observations (rows) and 151 variables (columns). Thus, the coord object has 151 columns, where the first corresponds to sperm identifiers, and the rest correspond to the values of the x or y coordinates.

The next step involves transposing rows into columns. The t function is applied to each element of coord (sapply). The argument specifying that the result should be numeric is provided to sapply. A new object named coord2 is created with the aforementioned instructions and is given the structure of a dataframe:

coord2<-t(sapply(coord, as.numeric))

## Warning in lapply(X = X, FUN = FUN, ...): NAs introduced by coercion

coord2<-as.data.frame(coord2)

A warning is noted, indicating that the function has introduced NAs. Therefore, we proceed to verify the structure of coord2 object:

str(coord2)

## 'data.frame':    151 obs. of  20 variables:
##  $ V1 : num  NA 1054 1054 1054 1054 ...
##  $ V2 : num  NA 59.1 59 58.9 58.8 ...
##  $ V3 : num  NA 0 0 0 0 0 0 0 0 0 ...
##  $ V4 : num  NA 0 0 0 0 0 0 0 0 0 ...
##  $ V5 : num  NA 557 557 557 557 ...
##  $ V6 : num  NA 257 256 256 256 ...
##  $ V7 : num  NA 1558 1562 1565 1566 ...
##  $ V8 : num  NA 456 456 455 453 ...
##  $ V9 : num  NA 1418 1419 1419 1420 ...
##  $ V10: num  NA 630 629 627 627 ...
##  $ V11: num  NA 1508 1509 1510 1508 ...
##  $ V12: num  NA 0 0 0 0 ...
##  $ V13: num  NA 0 0 0 0 ...
##  $ V14: num  NA 660 660 659 659 ...
##  $ V15: num  NA 549 548 551 550 ...
##  $ V16: num  NA 735 736 736 736 ...
##  $ V17: num  NA 0 0 0 0 0 0 0 0 0 ...
##  $ V18: num  NA 0 0 0 0 0 0 0 0 0 ...
##  $ V19: num  NA 1793 1793 1793 1793 ...
##  $ V20: num  NA 930 930 930 931 ...

Now, we have a dataframe with 151 observations (rows) and 20 variables (columns). The first row contains NAs. Let’s check the beginning of the object.

head(coord2)

##            V1       V2 V3 V4       V5       V6       V7       V8       V9
## ...1       NA       NA NA NA       NA       NA       NA       NA       NA
## ...2 1054.062 59.09278  0  0 556.8834 256.5092 1557.705 455.9799 1417.506
## ...3 1053.990 58.95833  0  0 556.9383 256.4753 1561.764 455.6351 1418.618
## ...4 1054.083 58.87629  0  0 557.0184 256.3190 1564.940 455.1192 1419.067
## ...5 1053.896 58.82292  0  0 557.0000 256.2822 1565.549 452.6732 1420.282
## ...6 1053.937 58.78947  0  0 557.0975 256.3232 1566.778 450.3038 1420.919
##           V10      V11 V12 V13      V14      V15      V16 V17 V18      V19
## ...1       NA       NA  NA  NA       NA       NA       NA  NA  NA       NA
## ...2 629.9663 1508.214   0   0 659.8587 549.4211 735.4737   0   0 1792.862
## ...3 628.6404 1508.910   0   0 659.7297 548.4578 735.6265   0   0 1793.273
## ...4 627.4832 1509.527   0   0 659.3681 550.9625 735.5875   0   0 1793.246
## ...5 626.8588 1508.331   0   0 659.2418 550.0617 736.4691   0   0 1793.362
## ...6 626.3837 1508.648   0   0 658.7337 553.6901 735.1549   0   0 1793.853
##           V20
## ...1       NA
## ...2 930.0461
## ...3 930.0606
## ...4 930.2308
## ...5 930.5797
## ...6 930.0735

We can observe that at the beginning of the coord2 object, a row containing NAs was inserted. To clean the object, we select only the rows that we need:

coord2<-coord2[2:151,]
head(coord2)

##            V1       V2 V3 V4       V5       V6       V7       V8       V9
## ...2 1054.062 59.09278  0  0 556.8834 256.5092 1557.705 455.9799 1417.506
## ...3 1053.990 58.95833  0  0 556.9383 256.4753 1561.764 455.6351 1418.618
## ...4 1054.083 58.87629  0  0 557.0184 256.3190 1564.940 455.1192 1419.067
## ...5 1053.896 58.82292  0  0 557.0000 256.2822 1565.549 452.6732 1420.282
## ...6 1053.937 58.78947  0  0 557.0975 256.3232 1566.778 450.3038 1420.919
## ...7 1053.844 58.73958  0  0 556.9684 256.1202 1568.443 451.9823 1421.352
##           V10      V11 V12 V13      V14      V15      V16 V17 V18      V19
## ...2 629.9663 1508.214   0   0 659.8587 549.4211 735.4737   0   0 1792.862
## ...3 628.6404 1508.910   0   0 659.7297 548.4578 735.6265   0   0 1793.273
## ...4 627.4832 1509.527   0   0 659.3681 550.9625 735.5875   0   0 1793.246
## ...5 626.8588 1508.331   0   0 659.2418 550.0617 736.4691   0   0 1793.362
## ...6 626.3837 1508.648   0   0 658.7337 553.6901 735.1549   0   0 1793.853
## ...7 625.9091 1511.569   0   0 663.0067 556.5714 735.1948   0   0 1794.060
##           V20
## ...2 930.0461
## ...3 930.0606
## ...4 930.2308
## ...5 930.5797
## ...6 930.0735
## ...7 930.0150

In coord2 object, then, odd-numbered columns correspond to x, while even-numbered columns correspond to y. The order has been maintained, but the sperm identifiers have been lost.

The next task is to reshape coord2 object into a long format. For this purpose, columns 3 should be placed below column 1, column 5 below column 3, and so forth. The same should happen for even-numbered columns. To achieve this, we will first create an object named col_odd, which will have the same length as coord2 and will contain the identifiers of the odd-numbered columns:

col_odd<-seq_len(ncol(coord2)) %% 2

Next, we will create two objects named only_x and only_y, where we will place the odd-numbered columns (corresponding to x) and even-numbered columns (corresponding to y), respectively:”

only_x<-coord2[, col_odd == 1]
only_y<-coord2[, col_odd == 0]

Verify the content of each object:

head(only_x)

##            V1 V3       V5       V7       V9      V11 V13      V15 V17      V19
## ...2 1054.062  0 556.8834 1557.705 1417.506 1508.214   0 549.4211   0 1792.862
## ...3 1053.990  0 556.9383 1561.764 1418.618 1508.910   0 548.4578   0 1793.273
## ...4 1054.083  0 557.0184 1564.940 1419.067 1509.527   0 550.9625   0 1793.246
## ...5 1053.896  0 557.0000 1565.549 1420.282 1508.331   0 550.0617   0 1793.362
## ...6 1053.937  0 557.0975 1566.778 1420.919 1508.648   0 553.6901   0 1793.853
## ...7 1053.844  0 556.9684 1568.443 1421.352 1511.569   0 556.5714   0 1794.060

head(only_y)

##            V2 V4       V6       V8      V10 V12      V14      V16 V18      V20
## ...2 59.09278  0 256.5092 455.9799 629.9663   0 659.8587 735.4737   0 930.0461
## ...3 58.95833  0 256.4753 455.6351 628.6404   0 659.7297 735.6265   0 930.0606
## ...4 58.87629  0 256.3190 455.1192 627.4832   0 659.3681 735.5875   0 930.2308
## ...5 58.82292  0 256.2822 452.6732 626.8588   0 659.2418 736.4691   0 930.5797
## ...6 58.78947  0 256.3232 450.3038 626.3837   0 658.7337 735.1549   0 930.0735
## ...7 58.73958  0 256.1202 451.9823 625.9091   0 663.0067 735.1948   0 930.0150

Now, two new objects must be created, which will contain the stacked data from the x columns or the y columns:

only_x<-data.frame(x=unlist(only_x, use.names=FALSE))
only_y<-data.frame(y=unlist(only_y, use.names=FALSE))

We will create a new object named traj, which will consist of two columns. The first column will contain the content of the only_x object, and the second column will contain the content of only_y object:

traj<-cbind(only_x, only_y)

Verify the structure of the traj object:

str(traj)

## 'data.frame':    1500 obs. of  2 variables:
##  $ x: num  1054 1054 1054 1054 1054 ...
##  $ y: num  59.1 59 58.9 58.8 58.8 ...

The traj object now contain two variables (the x and y columns) and 1500 observations or lines of data. In total, the dataframe contains 3000 data points. This number is obtained by multiplying the 150 pairs of coordinates by 10 observations (sperm) per 2 coordinates.

3.2 Stage 2: Identifier creation

To the traj object, columns with identifiers for each sperm must be added. These identifiers are of string type. Each column should be named from ID1 to ID5. The content of each column may change according to the conducted experiment. In this example, the content of each ID is as follows: 1. ID1 is the key to the capture routine. 2. ID2 is the experiment number (or the number of the male or experimental unit). 3. ID3 is the experimental treatment (or factor level). 4. ID4 is the incubation time. 5. ID5 is the sperm ID in the capture routine. The ID must match exactly with the motility parameters file.

It’s noted that the content or the identifying columns will change depending on the input files.

Now, a file with the coordinates for the entire experiment must be created. A file with the order of IDs for the analyzed sperm must be carefully prepared. This file is opened in R, and from there, the objects ID1 and ID5 are obtained.

An object containing the missing IDs is created. This object will contain the data presented in Figure 2.

ID_faltantes<-read_ods("ID5_data_test.ods", col_names=TRUE, as_tibble=FALSE)

Verify the structure of the ID_faltantes object:

str(ID_faltantes)

## 'data.frame':    10 obs. of  2 variables:
##  $ ID1 : chr  "obs1_1" "obs1_1" "obs1_1" "obs1_1" ...
##  $ ID.5: num  2 3 5 7 9 10 11 13 14 15

“As the traj object has 150 coordinates for each sperm, we should have each identifier repeated 150 times. Therefore, from the ID_faltantes object, we take column 1 and place it in a new object called ID1. Next, we create 150 repetitions of each value in ID1:

ID1<-ID_faltantes[,1]
ID1<-rep(ID1, each=150)
length(ID1)

## [1] 1500

Since each file used as input for this workflow may have a different number of lines (analyzed sperm), we will now create an object called num_row. This object will contain the necessary number of lines for each specific input file.

num_row<-(ncol(coord2)/2)*150
num_row

## [1] 1500

The num_row object will be used to generate as many lines as necessary for ID2, ID3, and ID4.

ID2<-rep("exp1", num_row)
ID3<-rep("tratamiento1", num_row)
ID4<-rep("0h", num_row)

To create the ID5 object, we will take the content of column 2 from the ID_faltantes object and multiply per 150.

ID5<-ID_faltantes[,2]
ID5<-rep(ID5, each=150)

3.3 Stage 3: Final object creation

Finally, we create a new object, called traj2 by merging the identifying columns and traj. Note that the sperm identifiers change throughout the dataframe.

traj2<-as.data.frame(cbind(ID1, ID2, ID3, ID4, ID5, traj))
str(traj2)

## 'data.frame':    1500 obs. of  7 variables:
##  $ ID1: chr  "obs1_1" "obs1_1" "obs1_1" "obs1_1" ...
##  $ ID2: chr  "exp1" "exp1" "exp1" "exp1" ...
##  $ ID3: chr  "tratamiento1" "tratamiento1" "tratamiento1" "tratamiento1" ...
##  $ ID4: chr  "0h" "0h" "0h" "0h" ...
##  $ ID5: num  2 2 2 2 2 2 2 2 2 2 ...
##  $ x  : num  1054 1054 1054 1054 1054 ...
##  $ y  : num  59.1 59 58.9 58.8 58.8 ...

head(traj2)

##      ID1  ID2          ID3 ID4 ID5        x        y
## 1 obs1_1 exp1 tratamiento1  0h   2 1054.062 59.09278
## 2 obs1_1 exp1 tratamiento1  0h   2 1053.990 58.95833
## 3 obs1_1 exp1 tratamiento1  0h   2 1054.083 58.87629
## 4 obs1_1 exp1 tratamiento1  0h   2 1053.896 58.82292
## 5 obs1_1 exp1 tratamiento1  0h   2 1053.937 58.78947
## 6 obs1_1 exp1 tratamiento1  0h   2 1053.844 58.73958

tail(traj2)

##         ID1  ID2          ID3 ID4 ID5        x        y
## 1495 obs1_1 exp1 tratamiento1  0h  15 1805.636 931.2121
## 1496 obs1_1 exp1 tratamiento1  0h  15 1805.554 931.2769
## 1497 obs1_1 exp1 tratamiento1  0h  15 1805.403 931.4478
## 1498 obs1_1 exp1 tratamiento1  0h  15 1804.460 930.5397
## 1499 obs1_1 exp1 tratamiento1  0h  15 1804.113 931.2903
## 1500 obs1_1 exp1 tratamiento1  0h  15 1803.900 931.3500

Depending on the settings of CASA system, it is possible that some sperm may not have been detected in the 150 frame, resulting in zero values. These zero values will pose a problem when reconstructing the images of the trajectories, so we must eliminate them. This process occurs in two steps; first, we replace all zeros with NAs:

traj2[traj2 == 0]<-NA

And then, whit the help of the drop_na function from the tidyr library, we remove all rows with NAs:

library(tidyr)
traj2<- drop_na(traj2)

Verify the traj2 object:

str(traj2)

## 'data.frame':    1379 obs. of  7 variables:
##  $ ID1: chr  "obs1_1" "obs1_1" "obs1_1" "obs1_1" ...
##  $ ID2: chr  "exp1" "exp1" "exp1" "exp1" ...
##  $ ID3: chr  "tratamiento1" "tratamiento1" "tratamiento1" "tratamiento1" ...
##  $ ID4: chr  "0h" "0h" "0h" "0h" ...
##  $ ID5: num  2 2 2 2 2 2 2 2 2 2 ...
##  $ x  : num  1054 1054 1054 1054 1054 ...
##  $ y  : num  59.1 59 58.9 58.8 58.8 ...

And we see that all rows with zeros have been eliminated:

head(traj2)

##      ID1  ID2          ID3 ID4 ID5        x        y
## 1 obs1_1 exp1 tratamiento1  0h   2 1054.062 59.09278
## 2 obs1_1 exp1 tratamiento1  0h   2 1053.990 58.95833
## 3 obs1_1 exp1 tratamiento1  0h   2 1054.083 58.87629
## 4 obs1_1 exp1 tratamiento1  0h   2 1053.896 58.82292
## 5 obs1_1 exp1 tratamiento1  0h   2 1053.937 58.78947
## 6 obs1_1 exp1 tratamiento1  0h   2 1053.844 58.73958

Verify that the sperm IDs in the traj2 object correspond to the IDs of the rows in the motility parameters file.

unique(traj2$ID5)

##  [1]  2  3  5  7  9 10 11 13 14 15

At this point, we can create a CSV file for later use:

write.csv(traj2, "traj_exp1.csv")

Alternatively, we can create a new object with a different name, for example:

end_1<-traj2

In case we repeat our workflow with other files, the last created object (end_1) would be useful to create a final object containing all the data from the processed files. This can be done as follows:

end_all<-as.data.frame(rbind(end_1, end_2, end_3, end_4, end_5, end_6, end_7, end_8))

Whether we process one or more files with trajectory data, it should be remembered that to combine the traj file and the motility parameters file in the traj- ah-6-full-0.ipynb Jupyter notebook, our final object traj2 and the motility parameters file must have the same number of observations (lines) (see Figure 3).

It is important to remember that the files intended to be used as input for the traj-ah-6-full-0.ipynb Jupyter notebook, must have a CSV extension. In the case of the coordinate file, the first line (containing the column names) should be removed, whereas the motility parameters file must include the line with the column names.

4 Conclusions

With the proposed workflow in this document, one can modify a coordinate file to obtain the long format. Similarly, sperm IDs can be automatically generated so that each set of coordinates has its own identifier.

5 Acknowledgments

The authors express their gratitude to Dr. Masakatsu Fujinoki for sharing his data.

References

Amann RP and Waberski D (2014) Computer-assisted sperm analysis (CASA): Capabilities and potential developments. Theriogenology 81 5-17.e1-3.

Giaretta E, Munerato M, Yeste M, Galeati G, Spinaci M, Tamanini C, Mari G and Bucci D (2017) Implementing an open-access CASA software for the assessment of stallion sperm motility: Relationship with other sperm quality parameters. Animal Reproduction Science 176 11–19.

Ramón M and Martínez-Pastor F (2018) Implementation of novel statistical procedures and other advanced approaches to improve analysis of CASA data. Reproduction, Fertility, and Development 30 860–866.

Rasband WS (1997) ImageJ, US National Institutes of Health. Bethesda, Maryland, USA.

Rivas AC, Ayala EME and Aragon MA (2022) Effect of various pH levels on the sperm kinematic parameters of boars. South African Journal of Animal Science 52 693–704.

Rodríguez-Martínez EA, Rivas CU, Ayala ME, Blanco-Rodríguez R, Juarez N, Hernandez-Vargas EA and Aragón A (2023) A new computational approach, based on images trajectories, to identify the subjacent heterogeneity of sperm to the effects of ketanserin. Cytometry. Part A 103 655–663.

Wilson-Leedy JG and Ingermann R (2007) Development of a novel CASA system based on open source software for characterization of zebrafish sperm motility parameters. Theriogenology 67 661–672.

Streamlining CASA Data: Converting Sperm Coordinates to Long Format for Improved Analysis:

Preparation for machine learning algorithms

Andrés Aragón Martínez

Cindy U. Rivas Arzaluz

María E. Ayala Escobar

2023-12-26